0% found this document useful (0 votes)
14 views

Abstract Dynamic Programming

Uploaded by

Jose Velarde
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Abstract Dynamic Programming

Uploaded by

Jose Velarde
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 257

Abstract Dynamic Programming

Dimitri P. Bertsekas
Massachusetts Institute of Technology

WWW site for book information and orders

http://www.athenasc.com

Athena Scientific, Belmont, Massachusetts


Athena Scientific
Post Office Box 805
Nashua, NH 03061-0805
U.S.A.

Email: [email protected]
WWW: http://www.athenasc.com

!c 2013 Dimitri P. Bertsekas


All rights reserved. No part of this book may be reproduced in any form
by any electronic or mechanical means (including photocopying, recording,
or information storage and retrieval) without permission in writing from
the publisher.

Publisher’s Cataloging-in-Publication Data


Bertsekas, Dimitri P.
Abstract Dynamic Programming
Includes bibliographical references and index
1. Mathematical Optimization. 2. Dynamic Programming. I. Title.
QA402.5 .B465 2013 519.703 01-75941

ISBN-10: 1-886529-42-6, ISBN-13: 978-1-886529-42-7


Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . p. 1
1.1. Structure of Dynamic Programming Problems . . . . . . . p. 2
1.2. Abstract Dynamic Programming Models . . . . . . . . . . p. 5
1.2.1. Problem Formulation . . . . . . . . . . . . . . . . p. 5
1.2.2. Monotonicity and Contraction Assumptions . . . . . . p. 7
1.2.3. Some Examples . . . . . . . . . . . . . . . . . . p. 9
1.2.4. Approximation-Related Mappings . . . . . . . . . . p. 21
1.3. Organization of the Book . . . . . . . . . . . . . . . . p. 23
1.4. Notes, Sources, and Exercises . . . . . . . . . . . . . . . p. 25

2. Contractive Models . . . . . . . . . . . . . . . . . p. 29
2.1. Fixed Point Equation and Optimality Conditions . . . . . . p. 30
2.2. Limited Lookahead Policies . . . . . . . . . . . . . . . p. 37
2.3. Value Iteration . . . . . . . . . . . . . . . . . . . . . p. 42
2.3.1. Approximate Value Iteration . . . . . . . . . . . . . p. 43
2.4. Policy Iteration . . . . . . . . . . . . . . . . . . . . . p. 46
2.4.1. Approximate Policy Iteration . . . . . . . . . . . . p. 48
2.5. Optimistic Policy Iteration . . . . . . . . . . . . . . . . p. 52
2.5.1. Convergence of Optimistic Policy Iteration . . . . . . p. 52
2.5.2. Approximate Optimistic Policy Iteration . . . . . . . p. 57
2.6. Asynchronous Algorithms . . . . . . . . . . . . . . . . p. 61
2.6.1. Asynchronous Value Iteration . . . . . . . . . . . . p. 61
2.6.2. Asynchronous Policy Iteration . . . . . . . . . . . . p. 67
2.6.3. Policy Iteration with a Uniform Fixed Point . . . . . . p. 72
2.7. Notes, Sources, and Exercises . . . . . . . . . . . . . . . p. 79

3. Semicontractive Models . . . . . . . . . . . . . . . p. 85
3.1. Semicontractive Models and Regular Policies . . . . . . . . p. 86
3.1.1. Fixed Points, Optimality Conditions, and
Algorithmic Results . . . . . . . . . . . . . . . . p. 90
3.1.2. Illustrative Example: Deterministic Shortest
Path Problems . . . . . . . . . . . . . . . . . . p. 97
3.2. Irregular Policies and a Perturbation Approach . . . . . p. 100
3.2.1. The Case Where Irregular Policies Have Infinite
Cost . . . . . . . . . . . . . . . . . . . . . . p. 100
3.2.2. The Case Where Irregular Policies Have Finite

iii
iv Contents

Cost - Perturbations . . . . . . . . . . . . . . . p. 107


3.3. Algorithms . . . . . . . . . . . . . . . . . . . . . . p. 116
3.3.1. Asynchronous Value Iteration . . . . . . . . . . . p. 117
3.3.2. Asynchronous Policy Iteration . . . . . . . . . . . p. 118
3.3.3. Policy Iteration with Perturbations . . . . . . . . . p. 124
3.4. Notes, Sources, and Exercises . . . . . . . . . . . . . . p. 125

4. Noncontractive Models . . . . . . . . . . . . . . p. 129


4.1. Noncontractive Models . . . . . . . . . . . . . . . . p. 130
4.2. Finite Horizon Problems . . . . . . . . . . . . . . . . p. 133
4.3. Infinite Horizon Problems . . . . . . . . . . . . . . . p. 139
4.3.1. Fixed Point Properties and Optimality Conditions . . p. 143
4.3.2. Value Iteration . . . . . . . . . . . . . . . . . . p. 154
4.3.3. Policy Iteration . . . . . . . . . . . . . . . . . p. 157
4.4. Semicontractive-Monotone Increasing Models . . . . . . . p. 163
4.4.1. Value and Policy Iteration Algorithms . . . . . . . p. 163
4.4.2. Some Applications . . . . . . . . . . . . . . . . p. 166
4.4.3. Linear-Quadratic Problems . . . . . . . . . . . . p. 168
4.5. Affine Monotonic Models . . . . . . . . . . . . . . . p. 171
4.5.1. Increasing Affine Monotonic Models . . . . . . . . p. 172
4.5.2. Nonincreasing Affine Monotonic Models . . . . . . . p. 173
4.5.3. Exponential Cost Stochastic Shortest Path
Problems . . . . . . . . . . . . . . . . . . . . p. 175
4.6. An Overview of Semicontractive Models and Results . . . p. 179
4.7. Notes, Sources, and Exercises . . . . . . . . . . . . . . p. 179

5. Models with Restricted Policies . . . . . . . . . . . p. 187


5.1. A Framework for Restricted Policies . . . . . . . . . . . p. 188
5.1.1. General Assumptions . . . . . . . . . . . . . . . p. 192
5.2. Finite Horizon Problems . . . . . . . . . . . . . . . . p. 196
5.3. Contractive Models . . . . . . . . . . . . . . . . . . p. 198
5.4. Borel Space Models . . . . . . . . . . . . . . . . . . p. 200
5.5. Notes, Sources, and Exercises . . . . . . . . . . . . . . p. 201

Appendix A: Notation and Mathematical Conventions . p. 203


Appendix B: Contraction Mappings . . . . . . . . . p. 207
Appendix C: Measure Theoretic Issues . . . . . . . p. 216
Appendix D: Solutions of Exercises . . . . . . . . . p. 230
References . . . . . . . . . . . . . . . . . . . p. 241
Index . . . . . . . . . . . . . . . . . . . . . . p. 247
Preface

This book aims at a unified and economical development of the core the-
ory and algorithms of total cost sequential decision problems, based on
the strong connections of the subject with fixed point theory. The analy-
sis focuses on the abstract mapping that underlies dynamic programming
(DP for short) and defines the mathematical character of the associated
problem. Our discussion centers on two fundamental properties that this
mapping may have: monotonicity and (weighted sup-norm) contraction. It
turns out that the nature of the analytical and algorithmic DP theory is
determined primarily by the presence or absence of these two properties,
and the rest of the problem’s structure is largely inconsequential.
In this book, with some minor exceptions, we will assume that mono-
tonicity holds. Consequently, we organize our treatment around the con-
traction property, and we focus on four main classes of models:
(a) Contractive models, discussed in Chapter 2, which have the richest
and strongest theory, and are the benchmark against which the the-
ory of other models is compared. Prominent among these models are
discounted stochastic optimal control problems. The development of
these models is quite thorough and includes the analysis of recent ap-
proximation algorithms for large-scale problems (neuro-dynamic pro-
gramming, reinforcement learning).
(b) Semicontractive models, discussed in Chapter 3 and parts of Chap-
ter 4. The term “semicontractive” is used qualitatively here, to refer
to a variety of models where some policies have a regularity/contrac-
tion-like property but others do not. A prominent example is stochas-
tic shortest path problems, where one aims to drive the state of
a Markov chain to a termination state at minimum expected cost.
These models also have a strong theory under certain conditions, of-
ten nearly as strong as those of the contractive models.
(c) Noncontractive models, discussed in Chapter 4, which rely on just
monotonicity. These models are more complex than the preceding
ones and much of the theory of the contractive models generalizes in
weaker form, if at all. For example, in general the associated Bell-
man equation need not have a unique solution, the value iteration
method may work starting with some functions but not with others,
and the policy iteration method may not work at all. Infinite hori-
zon examples of these models are the classical positive and negative
DP problems, first analyzed by Dubins and Savage, Blackwell, and

v
vi Preface

Strauch, which are discussed in various sources. Some new semicon-


tractive models are also discussed in this chapter, further bridging
the gap between contractive and noncontractive models.
(d) Restricted policies and Borel space models, which are discussed
in Chapter 5. These models are motivated in part by the complex
measurability questions that arise in mathematically rigorous theories
of stochastic optimal control involving continuous probability spaces.
Within this context, the admissible policies and DP mapping are
restricted to have certain measurability properties, and the analysis
of the preceding chapters requires modifications. Restricted policy
models are also useful when there is a special class of policies with
favorable structure, which is “closed” with respect to the standard DP
operations, in the sense that analysis and algorithms can be confined
within this class.
We do not consider average cost DP problems, whose character bears
a much closer connection to stochastic processes than to total cost prob-
lems. We also do not address specific stochastic characteristics underlying
the problem, such as for example a Markovian structure. Thus our re-
sults apply equally well to Markovian decision problems and to sequential
minimax problems. While this makes our development general and a con-
venient starting point for the further analysis of a variety of different types
of problems, it also ignores some of the interesting characteristics of special
types of DP problems that require an intricate probabilistic analysis.
Let us describe the research content of the book in summary, de-
ferring a more detailed discussion to the end-of-chapter notes. A large
portion of our analysis has been known for a long time, but in a somewhat
fragmentary form. In particular, the contractive theory, first developed by
Denardo [Den67], has been known for the case of the unweighted sup-norm,
but does not cover the important special case of stochastic shortest path
problems where all policies are proper. Chapter 2 transcribes this theory
to the weighted sup-norm contraction case. Moreover, Chapter 2 develops
extensions of the theory to approximate DP, and includes material on asyn-
chronous value iteration (based on the author’s work [Ber82], [Ber83]), and
asynchronous policy iteration algorithms (based on the author’s joint work
with Huizhen (Janey) Yu [BeY10a], [BeY10b], [YuB11a]). Most of this
material is relatively new, having been presented in the author’s recent
book [Ber12a] and survey paper [Ber12b], with detailed references given
there. The analysis of infinite horizon noncontractive models in Chapter 4
was first given in the author’s paper [Ber77], and was also presented in the
book by Bertsekas and Shreve [BeS78], which in addition contains much
of the material on finite horizon problems, restricted policies models, and
Borel space models. These were the starting point and main sources for
our development.
The new research presented in this book is primarily on the semi-
Preface vii

contractive models of Chapter 3 and parts of Chapter 4. Traditionally,


the theory of total cost infinite horizon DP has been bordered by two ex-
tremes: discounted models, which have a contractive nature, and positive
and negative models, which do not have a contractive nature, but rely
on an enhanced monotonicity structure (monotone increase and monotone
decrease models, or in classical DP terms, positive and negative models).
Between these two extremes lies a gray area of problems that are not con-
tractive, and either do not fit into the categories of positive and negative
models, or possess additional structure that is not exploited by the theory
of these models. Included are stochastic shortest path problems, search
problems, linear-quadratic problems, a host of queueing problems, multi-
plicative and exponential cost models, and others. Together these problems
represent an important part of the infinite horizon total cost DP landscape.
They possess important theoretical characteristics, not generally available
for positive and negative models, such as the uniqueness of solution of Bell-
man’s equation within a subset of interest, and the validity of useful forms
of value and policy iteration algorithms.
Our semicontractive models aim to provide a unifying abstract DP
structure for problems in this gray area between contractive and noncon-
tractive models. The analysis is motivated in part by stochastic shortest
path problems, where there are two types of policies: proper , which are
the ones that lead to the termination state with probability one from all
starting states, and improper , which are the ones that are not proper.
Proper and improper policies can also be characterized through their Bell-
man equation mapping: for the former this mapping is a contraction, while
for the latter it is not. In our more general semicontractive models, policies
are also characterized in terms of their Bellman equation mapping, through
a notion of regularity, which generalizes the notion of a proper policy and
is related to classical notions of asymptotic stability from control theory.
In our development a policy is regular within a certain set if its cost
function is the unique asymptotically stable equilibrium (fixed point) of
the associated DP mapping within that set. We assume that some policies
are regular while others are not , and impose various assumptions to ensure
that attention can be focused on the regular policies. From an analytical
point of view, this brings to bear the theory of fixed points of monotone
mappings. From the practical point of view, this allows application to a
diverse collection of interesting problems, ranging from stochastic short-
est path problems of various kinds, where the regular policies include the
proper policies, to linear-quadratic problems, where the regular policies
include the stabilizing linear feedback controllers.
The definition of regularity is introduced in Chapter 3, and its theoret-
ical ramifications are explored through extensions of the classical stochastic
shortest path and search problems. In Chapter 4, semicontractive models
are discussed in the presence of additional monotonicity structure, which
brings to bear the properties of positive and negative DP models. With the
viii Preface

aid of this structure, the theory of semicontractive models can be strength-


ened and can be applied to several additional problems, including risk-
sensitive/exponential cost problems.
The book has a theoretical research monograph character, but re-
quires a modest mathematical background for all chapters except the last
one, essentially a first course in analysis. Of course, prior exposure to DP
will definitely be very helpful to provide orientation and context. A few
exercises have been included, either to illustrate the theory with exam-
ples and counterexamples, or to provide applications and extensions of the
theory. Solutions of all the exercises can be found in Appendix D, at the
book’s internet site
http://www.athenasc.com/abstractdp.html
and at the author’s web site
http://web.mit.edu/dimitrib/www/home.html
Additional exercises and other related material may be added to these sites
over time.
I would like to express my appreciation to a few colleagues for inter-
actions, recent and old, which have helped shape the form of the book. My
collaboration with Steven Shreve on our 1978 book provided the motivation
and the background for the material on models with restricted policies and
associated measurability questions. My collaboration with John Tsitsiklis
on stochastic shortest path problems provided inspiration for the work on
semicontractive models. My collaboration with Janey Yu played an im-
portant role in the book’s development, and is reflected in our joint work
on asynchronous policy iteration, on perturbation models, and on risk-
sensitive models. Moreover Janey contributed significantly to the material
on semicontractive models with many insightful suggestions. Finally, I am
thankful to Mengdi Wang, who went through portions of the book with
care, and gave several helpful comments.

Dimitri P. Bertsekas, Spring 2013


Preface ix

NOTE ADDED TO THE CHINESE EDITION

The errata of the original edition, as per March 1, 2014, have been incor-
porated in the present edition of the book. The following two papers have
a strong connection to the book, and amplify on the range of applications
of the semicontractive models of Chapters 3 and 4:
(1) D. P. Bertsekas, “Robust Shortest Path Planning and Semicontractive
Dynamic Programming,” Lab. for Information and Decision Systems
Report LIDS-P-2915, MIT, Feb. 2014.
(2) D. P. Bertsekas, “Infinite-Space Shortest Path Problems and Semicon-
tractive Dynamic Programming,” Lab. for Information and Decision
Systems Report LIDS-P-2916, MIT, Feb. 2014.
These papers may be viewed as “on-line appendixes” of the book. They
can be downloaded from the book’s internet site and the author’s web page.
1

Introduction

Contents

1.1. Structure of Dynamic Programming Problems . . . . . p. 2


1.2. Abstract Dynamic Programming Models . . . . . . . . p. 5
1.2.1. Problem Formulation . . . . . . . . . . . . . . p. 5
1.2.2. Monotonicity and Contraction Assumptions . . . . p. 7
1.2.3. Some Examples . . . . . . . . . . . . . . . . p. 9
1.2.4. Approximation-Related Mappings . . . . . . . . p. 21
1.3. Organization of the Book . . . . . . . . . . . . . . p. 23
1.4. Notes, Sources, and Exercises . . . . . . . . . . . . . p. 25

1
2 Introduction Chap. 1

1.1 STRUCTURE OF DYNAMIC PROGRAMMING PROBLEMS

Dynamic programming (DP for short) is the principal method for analysis
of a large and diverse class of sequential decision problems. Examples are
deterministic and stochastic optimal control problems with a continuous
state space, Markov and semi-Markov decision problems with a discrete
state space, minimax problems, and sequential zero sum games. While the
nature of these problems may vary widely, their underlying structures turn
out to be very similar. In all cases there is an underlying mapping that de-
pends on an associated controlled dynamic system and corresponding cost
per stage. This mapping, the DP operator, provides a “compact signature”
of the problem. It defines the cost function of policies and the optimal cost
function, and it provides a convenient shorthand notation for algorithmic
description and analysis.
More importantly, the structure of the DP operator defines the math-
ematical character of the associated problem. The purpose of this book is to
provide an analysis of this structure, centering on two fundamental prop-
erties: monotonicity and (weighted sup-norm) contraction. It turns out
that the nature of the analytical and algorithmic DP theory is determined
primarily by the presence or absence of these two properties, and the rest
of the problem’s structure is largely inconsequential.

A Deterministic Optimal Control Example

To illustrate our viewpoint, let us consider a discrete-time deterministic


optimal control problem described by a system equation

xk+1 = f (xk , uk ), k = 0, 1, . . . . (1.1)

Here xk is the state of the system taking values in a set X (the state space),
and uk is the control taking values in a set U (the control space). At stage
k, there is a cost
αk g(xk , uk )
incurred when uk is applied at state xk , where α is a scalar in (0, 1] that has
the interpretation of a discount factor when α < 1. The controls are chosen
as a function of the current state, subject to a constraint that depends on
that state. In particular, at state x the control is constrained to take values
in a given set U (x) ⊂ U . Thus we are interested in optimization over the
set of (nonstationary) policies
! "
Π = {µ0 , µ1 , . . .} | µk ∈ M, k = 0, 1, . . . ,

where M is the set of functions µ : X #→ U defined by


! "
M = µ | µ(x) ∈ U (x), ∀ x ∈ X .
Sec. 1.1 Structure of Dynamic Programming Problems 3

The total cost of a policy π = {µ0 , µ1 , . . .} over an infinite number of


stages and starting at an initial state x0 is

! " #
Jπ (x0 ) = αk g xk , µk (xk ) , (1.2)
k=0

where the state sequence {xk } is generated by the deterministic system


(1.1) under the policy π:
" #
xk+1 = f xk , µk (xk ) , k = 0, 1, . . .

The optimal cost function is †

J * (x) = inf Jπ (x), x ∈ X.


π∈Π

For any policy π = {µ0 , µ1 , . . .}, consider the policy π1 = {µ1 , µ2 , . . .}


and write by using Eq. (1.2),
" # " #
Jπ (x) = g x, µ0 (x) + αJπ1 f (x, µ0 (x) .

We have for all x ∈ X


$ " # " #%
J * (x) = inf g x, µ0 (x) + αJπ1 f (x, µ0 (x)
π={µ0 ,π1 }∈Π
$ " # " #%
= inf g x, µ0 (x) + α inf Jπ1 f (x, µ0 (x)
µ0 ∈M π1 ∈Π
$ " # " #%
= inf g x, µ0 (x) + αJ * f (x, µ0 (x) .
µ0 ∈M

The minimization over µ0 ∈ M can be written as minimization over all


u ∈ U (x), so we can write the preceding equation as
$ " #%
J * (x) = inf g(x, u) + αJ * f (x, u) , ∀ x ∈ X. (1.3)
u∈U (x)

This equation is an example of Bellman’s equation, which plays a


central role in DP analysis and algorithms. If it can be solved for J * ,
an optimal stationary policy {µ∗ , µ∗ , . . .} may typically be obtained by
minimization of the right-hand side for each x, i.e.,
$ " #%
µ∗ (x) ∈ arg min g(x, u) + αJ * f (x, u) , ∀ x ∈ X. (1.4)
u∈U (x)

† For the informal discussion of this section, we will disregard a few mathe-
matical issues. In particular, we assume that the series defining Jπ in Eq. (1.2)
is convergent for all allowable π, and that the optimal cost function J ∗ is real-
valued. We will address such issues later.
4 Introduction Chap. 1

We now note that both Eqs. (1.3) and (1.4) can be stated in terms of
the expression
! "
H(x, u, J ) = g(x, u) + αJ f (x, u) , x ∈ X, u ∈ U (x).
Defining ! "
(Tµ J )(x) = H x, µ(x), J , x ∈ X,
and
(T J )(x) = inf H(x, u, J ) = inf (Tµ J )(x), x ∈ X,
u∈U (x) µ∈M

we see that Bellman’s equation (1.3) can be written compactly as


J * = T J *,
i.e., J * is the fixed point of T , viewed as a mapping from the set of real-
valued functions on X into itself. Moreover, it can be similarly seen that
Jµ , the cost function of the stationary policy {µ, µ, . . .}, is a fixed point of
Tµ . In addition, the optimality condition (1.4) can be stated compactly as
Tµ∗ J * = T J * .
We will see later that additional properties, as well as a variety of algorithms
for finding J * can be analyzed using the mappings T and Tµ .
One more property that holds in some generality is worth noting. For
a given policy π = {µ0 , µ1 , . . .} and a terminal cost αN J¯(xN ) for the state
xN at the end of N stages, consider the N -stage cost function
N −1
Jπ,N (x0 ) = αN J¯(xN ) +
# ! "
αk g xk , µk (xk ) . (1.5)
k=0

Then it can be verified by induction that for all initial states x0 , we have
Jπ,N (x0 ) = (Tµ0 Tµ1 · · · TµN −1 J¯)(x0 ). (1.6)
Here Tµ0 Tµ1 · · · TµN −1 is the composition of the mappings Tµ0 , Tµ1 , . . . TµN −1 ,
i.e., for all J ,
! "
(Tµ0 Tµ1 J )(x) = Tµ0 (Tµ1 J ) (x), x ∈ X,
and more generally
! "
(Tµ0 Tµ1 · · · TµN −1 J )(x) = Tµ0 (Tµ1 (· · · (TµN −1 J ))) (x), x ∈ X,
(our notational conventions are summarized in Appendix A). Thus the
finite horizon cost functions Jπ,N of π can be defined in terms of the map-
pings Tµ [cf. Eq. (1.6)], and so can their infinite horizon limit Jπ :
Jπ (x) = lim (Tµ0 Tµ1 · · · TµN −1 J¯)(x), x ∈ X, (1.7)
N →∞

where J¯ is the zero function, J¯(x) = 0 for all x ∈ X (assuming the limit
exists).
Sec. 1.2 Abstract Dynamic Programming Models 5

Connection with Fixed Point Methodology

The Bellman equation (1.3) and the optimality condition (1.4), stated in
terms of the mappings Tµ and T , highlight the central theme of this book,
which is that DP theory is intimately connected with the theory of abstract
mappings and their fixed points. Analogs of the Bellman equation, J * =
T J * , optimality conditions, and other results and computational methods
hold for a great variety of DP models, and can be stated compactly as
described above in terms of the corresponding mappings Tµ and T . The
gain from this abstraction is greater generality and mathematical insight,
as well as a more unified, economical, and streamlined analysis.

1.2 ABSTRACT DYNAMIC PROGRAMMING MODELS

In this section we formally introduce and illustrate with examples an ab-


stract DP model, which embodies the ideas discussed in the preceding
section.

1.2.1 Problem Formulation

Let X and U be two sets, which we loosely refer to as a set of “states”


and a set of “controls,” respectively. For each x ∈ X, let U (x) ⊂ U be a
nonempty subset of controls that are feasible at state x. We denote by M
the set of all functions µ : X #→ U with µ(x) ∈ U (x), for all x ∈ X.
In analogy with DP, we refer to sequences π = {µ0 , µ1 , . . .}, with
µk ∈ M for all k, as “nonstationary policies,” and we refer to a sequence
{µ, µ, . . .}, with µ ∈ M, as a “stationary policy.” In our development,
stationary policies will play a dominant role, and with slight abuse of ter-
minology, we will also refer to any µ ∈ M as a “policy” when confusion
cannot arise.
Let R(X) be the set of real-valued functions J : X #→ %, and let
H : X × U × R(X) #→ % be a given mapping. † For each policy µ ∈ M, we
consider the mapping Tµ : R(X) #→ R(X) defined by
! "
(Tµ J )(x) = H x, µ(x), J , ∀ x ∈ X, J ∈ R(X),
and we also consider the mapping T defined by ‡
(T J )(x) = inf H(x, u, J ), ∀ x ∈ X, J ∈ R(X).
u∈U (x)

† Our notation and mathematical conventions are outlined in Appendix A.


In particular, we denote by ! the set of real numbers, and by !n the space of
n-dimensional vectors with real components.
‡ We assume that H, Tµ J, and T J are real-valued for J ∈ R(X) in the
present chapter and in Chapter 2. In Chapters 3-5 we will allow H(x, u, J), and
hence also (Tµ J)(x) and (T J)(x), to take the values ∞ and −∞.
6 Introduction Chap. 1

Similar to the deterministic optimal control problem of the preceding


section, the mappings Tµ and T serve to define a multistage optimization
problem and a DP-like methodology for its solution. In particular, for some
function J¯ ∈ R(X), and nonstationary policy π = {µ0 , µ1 , . . .}, we define
for each integer N ≥ 1 the functions

Jπ,N (x) = (Tµ0 Tµ1 · · · TµN −1 J¯)(x), x ∈ X,

where Tµ0 Tµ1 · · · TµN −1 denotes the composition of the mappings Tµ0 , Tµ1 ,
. . . , TµN −1 , i.e.,
! "
Tµ0 Tµ1 · · · TµN −1 J = Tµ0 (Tµ1 (· · · (TµN −2 (TµN −1 J ))) · · ·) , J ∈ R(X).

We view Jπ,N as the “N -stage cost function” of π [cf. Eq. (1.5)]. Consider
also the function

Jπ (x) = lim sup Jπ,N (x) = lim sup(Tµ0 Tµ1 · · · TµN −1 J¯)(x), x ∈ X,
N →∞ N →∞

which we view as the “infinite horizon cost function” of π [cf. Eq. (1.7); we
use lim sup for generality, since we are not assured that the limit exists].
We want to minimize Jπ over π, i.e., to find

J * (x) = inf Jπ (x), x ∈ X,


π

and a policy π ∗ that attains the infimum, if one exists.


The key connection with fixed point methodology is that J * “typi-
cally” (under mild assumptions) can be shown to satisfy

J * (x) = inf H(x, u, J * ), ∀ x ∈ X,


u∈U (x)

i.e., it is a fixed point of T . We refer to this as Bellman’s equation [cf. Eq.


(1.3)]. Another fact is that if an optimal policy π ∗ exists, it “typically” can
be selected to be stationary, π ∗ = {µ∗ , µ∗ , . . .}, with µ∗ ∈ M satisfying an
optimality condition, such as for example

Tµ∗ J * = T J *

[cf. Eq. (1.4)]. Several other results of an analytical or algorithmic nature


also hold under appropriate conditions, which will be discussed in detail
later.
However, Bellman’s equation and other related results may not hold
without Tµ and T having some special structural properties. Prominent
among these are a monotonicity assumption that typically holds in DP
problems, and a contraction assumption that holds for some important
classes of problems.
Sec. 1.2 Abstract Dynamic Programming Models 7

1.2.2 Monotonicity and Contraction Assumptions

Let us now formalize the monotonicity and contraction assumptions. We


will require that both of these assumptions hold throughout the next chap-
ter, and we will gradually relax the contraction assumption in Chapters
3-5. Recall also our assumption that Tµ and T map R(X) (the space of
real-valued functions over X) into R(X). In Chapters 3-5 we will relax this
assumption as well.

Assumption 1.2.1: (Monotonicity) If J, J ! ∈ R(X) and J ≤ J ! ,


then
H(x, u, J ) ≤ H(x, u, J ! ), ∀ x ∈ X, u ∈ U (x).

Note that by taking infimum over u ∈ U (x), we have

J ≤ J! ⇒ inf H(x, u, J ) ≤ inf H(x, u, J ! ), ∀ x ∈ X,


u∈U (x) u∈U (x)

or equivalently,
J ≤ J! ⇒ T J ≤ T J !.
Another way to arrive at this relation, is to note that the monotonicity
assumption is equivalent to

J ≤ J! ⇒ Tµ J ≤ Tµ J ! , ∀ µ ∈ M,

and to use the simple but important fact

inf H(x, u, J ) = inf (Tµ J )(x), ∀ x ∈ X, J ∈ R(x),


u∈U (x) µ∈M

i.e., infimum over u is!equivalent to infimum over" µ, which holds in view


of the definition M = µ | µ(x) ∈ U (x), ∀ x ∈ X . We will be writing this
relation as T J = inf µ∈M Tµ J .
For the contraction assumption, we introduce a function v : X %→ '
with
v(x) > 0, ∀ x ∈ X.
Let us denote by B(X) the space of real-valued functions J on X such
that J (x)/v(x) is bounded as x ranges over X, and consider the weighted
sup-norm # #
#J (x)#
(J ( = sup
x∈X v(x)

on B(X). The properties of B(X) and some of the associated fixed point
theory are discussed in Appendix B. In particular, as shown there, B(X)
8 Introduction Chap. 1

= 0 Tµ J = 0 Tµ J

=0 J TJ =0 ) Jµ J TJ

Figure 1.2.1. Illustration of the monotonicity and the contraction assumptions in


one dimension. The mapping Tµ on the left is monotone but is not a contraction.
The mapping Tµ on the right is both monotone and a contraction. It has a unique
fixed point at Jµ .

is a complete normed space, so any mapping from B(X) to B(X) that is a


contraction or an m-stage contraction for some integer m > 1, with respect
to ! · !, has a unique fixed point (cf. Props. B.1 and B.2).

Assumption 1.2.2: (Contraction) For all J ∈ B(X) and µ ∈ M,


the functions Tµ J and T J belong to B(X). Furthermore, for some
α ∈ (0, 1), we have

!Tµ J − Tµ J ! ! ≤ α!J − J ! !, ∀ J, J ! ∈ B(X), µ ∈ M. (1.8)

Figure 1.2.1 illustrates the monotonicity and the contraction assump-


tions. It is important to note that the contraction condition (1.8) implies
that
!T J − T J ! ! ≤ α!J − J ! !, ∀ J, J ! ∈ B(X), (1.9)
so that T is also a contraction with modulus α. To see this we use Eq.
(1.8) to write
(Tµ J )(x) ≤ (Tµ J ! )(x) + α!J − J ! ! v(x), ∀ x ∈ X,
from which, by taking infimum of both sides over µ ∈ M, we have
(T J )(x) − (T J ! )(x)
≤ α!J − J ! !, ∀ x ∈ X.
v(x)
Reversing the roles of J and J ! , we also have
(T J ! )(x) − (T J )(x)
≤ α!J − J ! !, ∀ x ∈ X,
v(x)
Sec. 1.2 Abstract Dynamic Programming Models 9

and combining the preceding two relations, and taking the supremum of
the left side over x ∈ X, we obtain Eq. (1.9).
Nearly all mappings related to DP satisfy the monotonicity assump-
tion, and many important ones satisfy the weighted sup-norm contraction
assumption as well. When both assumptions hold, the most powerful an-
alytical and computational results can be obtained, as we will show in
Chapter 2. These are:
(a) Bellman’s equation has a unique solution, i.e., T and Tµ have unique
fixed points, which are the optimal cost function J * and the cost
functions Jµ of the stationary policies {µ, µ, . . .}, respectively [cf. Eq.
(1.3)].
(b) A stationary policy {µ∗ , µ∗ , . . .} is optimal if and only if

Tµ∗ J * = T J * ,

[cf. Eq. (1.4)].


(c) J * and Jµ can be computed by the value iteration method,

J * = lim T k J, Jµ = lim Tµk J,


k→∞ k→∞

starting with any J ∈ B(X).


(d) J * can be computed by the policy iteration method, whereby we gen-
erate a sequence of stationary policies via

Tµk+1 Jµk = T Jµk ,

starting from some initial policy µ0 [here Jµk is obtained as the fixed
point of Tµk by several possible methods, including value iteration as
in (c) above].
These are the most favorable types of results one can hope for in the
DP context, and they are supplemented by a host of other results, involving
approximate and/or asynchronous implementations of the value and policy
iteration methods, and other related methods that combine features of
both. As the contraction property is relaxed and is replaced by various
weaker assumptions, some of the preceding results may hold in weaker
form. For example J * turns out to be a solution of Bellman’s equation in
all the models to be discussed, but it may not be the unique solution. The
interplay between the monotonicity and contraction-like properties, and
the associated results of the form (a)-(d) described above is the recurring
analytical theme in this book.

1.2.3 Some Examples

In what follows in this section, we describe a few special cases, which indi-
cate the connections of appropriate forms of the mapping H with the most
10 Introduction Chap. 1

popular total cost DP models. In all these models the monotonicity As-
sumption 1.2.1 (or some closely related version) holds, but the contraction
Assumption 1.2.2 may not hold, as we will indicate later. Our descriptions
are by necessity brief, and the reader is referred to the relevant textbook
literature for more detailed discussion.

Example 1.2.1 (Stochastic Optimal Control - Markovian


Decision Problems)

Consider the stationary discrete-time dynamic system

xk+1 = f (xk , uk , wk ), k = 0, 1, . . . , (1.10)

where for all k, the state xk is an element of a space X, the control uk is


an element of a space U , and wk is a random “disturbance,” an element of
a space W . We consider problems with infinite state and control spaces, as
well as problems with discrete (finite or countable) state space (in which case
the underlying system is a Markov chain). However, for technical reasons
that relate to measure theoretic issues to be explained later in Chapter 5, we
assume that W is a countable set.
The control uk is constrained to take values in a given nonempty subset
U (xk ) of U , which depends on the current state xk [uk ∈ U (xk ), for all
xk ∈ X]. The random disturbances wk , k = 0, 1, . . ., are characterized by
probability distributions P (· | xk , uk ) that are identical for all k, where P (wk |
xk , uk ) is the probability of occurrence of wk , when the current state and
control are xk and uk , respectively. Thus the probability of wk may depend
explicitly on xk and uk , but not on values of prior disturbances wk−1 , . . . , w0 .
Given an initial state x0 , we want to find a policy π = {µ0 , µ1 , . . .},
where µk : X "→ U , µk (xk ) ∈ U (xk ), for all xk ∈ X, k = 0, 1, . . ., that
minimizes the cost function
!N −1 %
"
αk g xk , µk (xk ), wk
# $
Jπ (x0 ) = lim sup E , (1.11)
N →∞ wk
k=0,1,... k=0

subject to the system equation constraint


# $
xk+1 = f xk , µk (xk ), wk , k = 0, 1, . . . .

This is a classical problem, which is discussed extensively in various sources,


including the author’s text [Ber12a]. It is usually referred to as the stochastic
optimal control problem or the Markovian Decision Problem (MDP for short).
Note that the expected value of the N -stage cost of π,
!N −1 %
"
αk g xk , µk (xk ), wk
# $
E ,
wk
k=0,1,... k=0

is defined as a (possibly countably infinite) sum, since the disturbances wk ,


k = 0, 1, . . ., take values in a countable set. Indeed, the reader may verify
Sec. 1.2 Abstract Dynamic Programming Models 11

that all the subsequent mathematical expressions that involve an expected


value can be written as summations over a finite or a countable set, so they
make sense without resort to measure-theoretic integration concepts. †
In what follows we will often impose appropriate assumptions on the
cost per stage g and the scalar α, which guarantee that the infinite horizon
cost Jπ (x0 ) is defined as a limit (rather than as a lim sup):
!N −1 %
" k
# $
Jπ (x0 ) = lim E α g xk , µk (xk ), wk .
N →∞ wk
k=0,1,... k=0

In particular, it can be shown that the limit exists if α < 1 and g is uniformly
bounded, i.e., for some B > 0,
& &
&g(x, u, w)& ≤ B, ∀ x ∈ X, u ∈ U (x), w ∈ W. (1.12)

In this case, we obtain the classical discounted infinite horizon DP prob-


lem, which generally has the most favorable structure of all infinite horizon
stochastic DP models (see [Ber12a], Chapters 1 and 2).
To make the connection with abstract DP, let us define
' # $(
H(x, u, J) = E g(x, u, w) + αJ f (x, u, w) ,

so that
' # $ # $(
(Tµ J)(x) = E g x, µ(x), w + αJ f (x, µ(x), w) ,

and ' # $(
(T J)(x) = inf E g(x, u, w) + αJ f (x, u, w) .
u∈U (x)

Similar to the deterministic optimal control problem of Section 1.1, the N -


stages cost of π, can be expressed in terms of Tµ :
!N −1 %
"
¯ 0) = αk g xk , µk (xk ), wk
# $
(Tµ0 · · · TµN −1 J)(x E ,
wk
k=0,1,... k=0

† As noted in Appendix A, the formula for the expected value of a random


variable w defined over a space Ω is

E{w} = E{w+ } + E{w− },

where w+ and w− are the positive and negative parts of w,

w+ (ω) = max 0, w(ω) , w− (ω) = min 0, w(ω) ,


' ( ' (
∀ ω ∈ Ω.

In this way, taking also into account the rule ∞−∞ = ∞ (see Appendix A), E{w}
is well-defined as an extended real number if Ω is finite or countably infinite.
12 Introduction Chap. 1

where J¯ is the zero function, J(x)


¯ = 0 for all x ∈ X. The same is true for
the infinite stages cost [cf. Eq. (1.11)]:

Jπ (x0 ) = lim sup (Tµ0 · · · TµN −1 J¯)(x0 ).


N →∞

It can be seen that the mappings Tµ and T are monotone, and it is


well-known that if α < 1 and the boundedness condition (1.12) holds, they
are contractive as well (under the unweighted sup-norm); see e.g., [Ber12a],
Chapter 1. In this case, the model has the powerful analytical and algorith-
mic properties (a)-(d) mentioned at the end of the preceding subsection. In
particular, the optimal cost function J ∗ [i.e., J ∗ (x) = inf π Jπ (x) for all x ∈ X]
can be shown to be the unique solution of the fixed point equation J ∗ = T J ∗ ,
also known as Bellman’s equation, which has the form

J ∗ (x) = E g(x, u, w) + αJ ∗ f (x, u, w)


! " #$
inf , x ∈ X,
u∈U (x)

and parallels the one given for deterministic optimal control problems [cf. Eq.
(1.3)].
These properties can be expressed and analyzed in an abstract setting
by using just the mappings Tµ and T , both when Tµ and T are contractive
(see Chapter 2), and when they are only monotone and not contractive (see
Chapter 4). Moreover, under some conditions, it is possible to analyze these
properties in cases where Tµ is contractive for some but not all µ (see Chapter
3, and Sections 4.4-4.5).

Example 1.2.2 (Finite-State Discounted Markovian Decision


Problems)

In the special case of the preceding example where the number of states is
finite, the system equation (1.10) may be defined in terms of the transition
probabilities
" #
pxy (u) = Prob y = f (x, u, w) | x , x, y ∈ X, u ∈ U (x),

so H takes the form


% " #
H(x, u, J) = pxy (u) g(x, u, y) + αJ(y) .
y∈X

When α < 1 and the boundedness condition


& &
&g(x, u, y)& ≤ B, ∀ x, y ∈ X, u ∈ U (x),

[cf. Eq. (1.12)] holds (or more simply when U is a finite set), the mappings Tµ
and T are contraction mappings with respect to the standard (unweighted)
sup-norm. This is a classical problem, referred to as discounted finite-state
MDP , which has a favorable theory and has found extensive applications (cf.
[Ber12a], Chapters 1 and 2). The model is additionally important, because it
is often used for computational solution of continuous state space problems
via discretization.
Sec. 1.2 Abstract Dynamic Programming Models 13

Example 1.2.3 (Discounted Semi-Markov Problems)

With x, y, and u as in Example 1.2.2, consider a mapping of the form


!
H(x, u, J) = G(x, u) + mxy (u)J(y),
y∈X

where G is some function representing expected cost per stage, and mxy (u)
are nonnegative scalars with
!
mxy (u) < 1, ∀ x ∈ X, u ∈ U (x).
y∈X

The equation J ∗ = T J ∗ is Bellman’s equation for a finite-state continuous-


time semi-Markov decision problem, after it is converted into an equivalent
discrete-time problem (cf. [Ber12a], Section 1.4). Again, the mappings Tµ and
T are monotone and can be shown to be contraction mappings with respect
to the unweighted sup-norm.

Example 1.2.4 (Discounted Zero-Sum Dynamic Games)

Let us consider a zero sum game analog of the finite-state MDP Example 1.2.2.
Here there are two players that choose actions at each stage: the first (called
the minimizer ) may choose a move i out of n moves and the second (called
the maximizer ) may choose a move j out of m moves. Then the minimizer
gives a specified amount aij to the maximizer, called a payoff . The minimizer
wishes to minimize aij , and the maximizer wishes to maximize aij .
The players use mixed strategies, whereby the minimizer selects a prob-
ability distribution u = (u1 , . . . , un ) over his n possible moves and the max-
imizer selects a probability distribution v = (v1 , . . . , vm ) over his m possible
moves. Since the probability of selecting i and j is ui vj , the expected pay-
a u v or u# Av, where A is the n × m matrix with
"
off for this stage is i,j ij i j
components aij .
In a single-stage version of the game, the minimizer must minimize
maxv∈V u# Av and the maximizer must maximize minu∈U u# Av, where U and
V are the sets of probability distributions over {1, . . . , n} and {1, . . . , m},
respectively. A fundamental result (which will not be proved here) is that
these two values are equal:

min max u# Av = max min u# Av. (1.13)


u∈U v∈V v∈V u∈U

Let us consider the situation where a separate game of the type just
described is played at each stage. The game played at a given stage is repre-
sented by a “state” x that takes values in a finite set X. The state evolves
according to transition probabilities qxy (i, j) where i and j are the moves
selected by the minimizer and the maximizer, respectively (here y represents
the next game to be played after moves i and j are chosen at the game rep-
resented by x). When the state is x, under u ∈ U and v ∈ V , the one-stage
14 Introduction Chap. 1

expected payoff is u! A(x)v, where A(x) is the n × m payoff matrix, and the
state transition probabilities are
n m
! !
pxy (u, v) = ui vj qxy (i, j) = u! Qxy v,
i=1 j=1

where Qxy is the n × m matrix that has components qxy (i, j). Payoffs are
discounted by α ∈ (0, 1), and the objectives of the minimizer and maximizer,
roughly speaking, are to minimize and to maximize the total discounted ex-
pected payoff. This requires selections of u and v to strike a balance between
obtaining favorable current stage payoffs and playing favorable games in fu-
ture stages.
We now introduce an abstract DP framework related to the sequential
move selection process just described. We consider the mapping G given by
!
G(x, u, v, J) = u! A(x)v + α pxy (u, v)J(y)
y∈X
"
!
# (1.14)
!
=u A(x) + α Qxy J(y) v,
y∈X

where α ∈ (0, 1) is discount factor, and the mapping H given by

H(x, u, J) = max G(x, u, v, J).


v∈V

The corresponding mappings Tµ and T are


$ %
(Tµ J)(x) = max G x, µ(x), v, J , x ∈ X,
v∈V

and
(T J)(x) = min max G(x, u, v, J).
u∈U v∈V

It can be shown that Tµ and T are monotone and (unweighted) sup-norm


contractions. Moreover, the unique fixed point J ∗ of T satisfies

J ∗ (x) = min max G(x, u, v, J ∗ ), ∀ x ∈ X,


u∈U v∈V

(see [Ber12a], Section 1.6.2).


We now note that since
!
A(x) + α Qxy J(y)
y∈X

[cf. Eq. (1.14)] is a matrix that is independent of u and v, we may view J ∗ (x)
as the value of a static game (which depends on the state x). In particular,
from the fundamental minimax equality (1.13), we have

min max G(x, u, v, J ∗ ) = max min G(x, u, v, J ∗ ), ∀ x ∈ X.


u∈U v∈V v∈V u∈U
Sec. 1.2 Abstract Dynamic Programming Models 15

This implies that J ∗ is also the unique fixed point of the mapping

(T J)(x) = max H(x, v, J),


v∈V

where
H(x, v, J) = min G(x, u, v, J),
u∈U

i.e., J ∗ is the fixed point regardless of the order in which minimizer and
maximizer select mixed strategies at each stage.
In the preceding development, we have introduced J ∗ as the unique
fixed point of the mappings T and T . However, J ∗ also has an interpretation
in game theoretic terms. In particular, it can be shown that J ∗ (x) is the value
of a dynamic game, whereby at state x the two opponents choose multistage
(possibly nonstationary) policies that consist of functions of the current state,
and continue to select moves using these policies over an infinite horizon. For
further discussion of this interpretation, we refer to [Ber12a] and to books on
dynamic games such as [FiV96]; see also [PaB99] and [Yu11] for an analysis
of the undiscounted case (α = 1) where there is a termination state, as in the
stochastic shortest path problems of the subsequent Example 1.2.6.

Example 1.2.5 (Minimax Problems)

Consider a minimax version of Example 1.2.1, where w is not random but is


rather chosen by an antagonistic player from a set W (x, u). Let
! " #$
H(x, u, J) = sup g(x, u, w) + αJ f (x, u, w) .
w∈W (x,u)

Then the equation J ∗ = T J ∗ is Bellman’s equation for an infinite horizon


minimax DP problem. A special case of this mapping arises in zero-sum
dynamic games (cf. Example 1.2.4).

Example 1.2.6 (Stochastic Shortest Path Problems)

The stochastic shortest path (SSP for short) problem is the special case of
the stochastic optimal control Example 1.2.1 where:
(a) There is no discounting (α = 1).
(b) The state space is X = {0, 1, . . . , n} and we are given transition prob-
abilities, denoted by

pxy (u) = P (xk+1 = y | xk = x, uk = u), x, y ∈ X, u ∈ U (x).

(c) The control constraint set U (x) is finite for all x ∈ X.


(d) A cost g(x, u) is incurred when control u ∈ U (x) is selected at state x.
16 Introduction Chap. 1

(e) State 0 is a special termination state, which is absorbing and cost-free,


i.e.,
p00 (u) = 1,
and for all u ∈ U (0), g(0, u) = 0.
To simplify the notation, we have assumed that the cost per stage does not
depend on the successor state, which amounts to using expected cost per
stage in all calculations.
Since the termination state 0 is cost-free and absorbing, the cost starting
from 0 is zero for every policy. Accordingly, for all cost functions, we ignore
the component that corresponds to 0, and define
n
!
H(x, u, J) = g(x, u) + pxy (u)J(y), x = 1, . . . , n, u ∈ U (x), J ∈ "n .
y=1

The mappings Tµ and T are defined by


n
" # ! " #
(Tµ J)(x) = g x, µ(x) + pxy µ(x) J(y), x = 1, . . . , n,
y=1

$ n
%
!
(T J)(x) = min g(x, u) + pxy (u)J(y) , x = 1, . . . , n.
u∈U (x)
y=1

Note that the matrix that has components pxy (u), x, y = 1, . . . , n, is sub-
stochastic (some of its row sums may be less than 1) because there may be
positive transition probability from a state x to the termination state 0. Con-
sequently Tµ may be a contraction for some µ, but not necessarily for all
µ ∈ M.
The SSP problem has been discussed in many sources, including the
books [Pal67], [Der70], [Whi82], [Ber87], [BeT89], [HeL99], and [Ber12a],
where it is sometimes referred to by earlier names such as “first passage
problem” and “transient programming problem.” In the framework that is
most relevant to our purposes, there is a classification of stationary policies
for SSP into proper and improper . We say that µ ∈ M is proper if, when
using µ, there is positive probability that termination will be reached after at
most n stages, regardless of the initial state; i.e., if

ρµ = max P {xn #= 0 | x0 = x, µ} < 1.


x=1,...,n

Otherwise, we say that µ is improper. It can be seen that µ is proper if and


only if in the Markov chain corresponding to µ, each state x is connected to
the termination state with a path of positive probability transitions.
For a proper policy µ, it can be shown that Tµ is a weighted sup-norm
contraction, as well as an n-stage contraction with respect to the unweighted
sup-norm. For an improper policy µ, Tµ is not a contraction with respect to
any norm. Moreover, T also need not be a contraction with respect to any
norm (think of the case where there is only one policy, which is improper).
Sec. 1.2 Abstract Dynamic Programming Models 17

However, T is a weighted sup-norm contraction in the important special case


where all policies are proper (see [BeT96], Prop. 2.2, or [Ber12a], Prop. 3.3.1).
Nonetheless, even in the case where there are improper policies and T
is not a contraction, results comparable to the case of discounted finite-state
MDP are available for SSP problems assuming that:
(a) There exists at least one proper policy.
(b) For every improper policy there is an initial state that has infinite cost
under this policy.
Under the preceding two assumptions, it was shown in [BeT91] that T has a
unique fixed point J ∗ , the optimal cost function of the SSP problem. More-
over, a policy {µ∗ , µ∗ , . . .} is optimal if and only if
Tµ∗ J ∗ = T J ∗ .
In addition, J ∗ and Jµ can be computed by value iteration,

J ∗ = lim T k J, Jµ = lim Tµk J,


k→∞ k→∞

starting with any J ∈ "n (see [Ber12a], Chapter 3, for a textbook account).
These properties are in analogy with the desirable properties (a)-(c), given at
the end of the preceding subsection in connection with contractive models.
Regarding policy iteration, it works in its strongest form when there are
no improper policies, in which case the mappings Tµ and T are weighted sup-
norm contractions. When there are improper policies, modifications to the
policy iteration method are needed; see [YuB11a], [Ber12a], and also Sections
3.3.2, 3.3.3, where these modifications will be discussed in an abstract setting.
Let us also note that there is an alternative line of analysis of SSP
problems, whereby favorable results are obtained assuming that there exists
an optimal proper policy, and the one-stage cost is nonnegative, g(x, u) ≥ 0
for all (x, u) (see [Pal67], [Der70], [Whi82], and [Ber87]). This analysis will
also be generalized in Chapter 3 and in Section 4.4, and the nonnegativity
assumption on g will be relaxed.

Example 1.2.7 (Deterministic Shortest Path Problems)

The special case of the SSP problem where the state transitions are determin-
istic is the classical shortest path problem. Here, we have a graph of n nodes
x = 1, . . . , n, plus a destination 0, and an arc length axy for each directed arc
(x, y). At state/node x, a policy µ chooses an outgoing arc from x. Thus the
controls available at x can be identified with the outgoing neighbors of x [the
nodes u such that (x, u) is an arc]. The corresponding mapping H is

axu + J(u) if u $= 0,
!
H(x, u, J) = x = 1, . . . , n.
ax0 if u = 0,
" #
A stationary policy µ defines a graph whose arcs are x, µ(x) , x =
1, . . . , n. The policy µ is proper if and only if this graph is acyclic (it consists of
a tree of directed paths leading from each node to the destination). Thus there
18 Introduction Chap. 1

exists a proper policy if and only if each node is connected to the destination
with a directed path. Furthermore, an improper policy has finite cost starting
from every initial state if and only if all the cycles of the corresponding graph
have nonnegative cycle cost. It follows that the favorable analytical and
algorithmic results described for SSP in the preceding example hold if the
given graph is connected and the costs of all its cycles are positive. We will
see later that significant complications result if the cycle costs are allowed to
be nonpositive, even though the shortest path problem is still well posed in
the sense that shortest paths exist if the given graph is connected (see Section
3.1.2).

Example 1.2.8 (Multiplicative and Risk-Sensitive Models)

With x, y, u, and transition probabilities pxy (u), as in the finite-state MDP


of Example 1.2.2, consider the mapping
! " #
H(x, u, J) = pxy (u)g(x, u, y)J(y) = E g(x, u, y)J(y) | x, u , (1.15)
y∈X

where g is a scalar function with g(x, u, y) ≥ 0 for all x, y, u (this is necessary


for H to be monotone). This mapping corresponds to the multiplicative model
of minimizing over all π = {µ0 , µ1 , . . .} the cost
$ % & % &
Jπ (x0 ) = lim sup E g x0 , µ0 (x0 ), x1 g x1 , µ1 (x1 ), x2 · · ·
N →∞
% &' ( (1.16)
g xN −1 , µN −1 (xN −1 ), xN 'x 0 ,

where the state sequence


% & {x0 , x1 , . . .} is generated using the transition prob-
abilities pxk xk+1 µk (xk ) .
To see that the mapping H of Eq. (1.15) corresponds to the cost function
(1.16), let us consider the unit function

¯
J(x) ≡ 1, x ∈ X,

and verify that for all x0 ∈ X, we have


$ %
¯ 0 ) = E g x0 , µ0 (x0 ), x1 g x1 , µ1 (x1 ), x2 · · ·
& % &
(Tµ0 Tµ1 · · · TµN −1 J)(x
% &' ( (1.17)
g xN −1 , µN −1 (xN −1 ), xN ' x0 ,

so that

¯
Jπ (x) = lim sup (Tµ0 Tµ1 · · · TµN −1 J)(x), x ∈ X.
N →∞

Indeed, taking into account that J¯(x) ≡ 1, we have

¯ N −1 ) = E g xN −1 , µN −1 (xN −1 ), xN J¯(xN ) | xN −1
" % & #
(TµN −1 J)(x
" % & #
= E g xN −1 , µN −1 (xN −1 ), xN | xN −1 ,
Sec. 1.2 Abstract Dynamic Programming Models 19

¯ N −2 ) = (Tµ (TµN −1 J¯) (xN −2 )


! "
(TµN −2 TµN −1 J)(x N −2
# ! "
= E g xN −2 , µN −2 (xN −2 ), xN −1
# ! " $
· E g xN −1 , µN −1 (xN −1 ), xN | xN −1 } | xN −2 ,
and continuing similarly,
% !
¯ 0 ) = E g x0 , µ0 (x0 ), x1 E g x1 , µ1 (x1 ), x2 · · ·
" # ! "
(Tµ0 Tµ1 · · ·TµN −1 J)(x
# ! " $ $ $ &
E g xN −1 , µN −1 (xN −1 ), xN | xN −1 | xN −2 · · · | x0 ,

which by using the iterated expectations formula (see e.g., [BeT08]) proves
the expression (1.17).
An important special case of a multiplicative model is when g has the
form
g(x, u, y) = eh(x,u,y)
for some one-stage cost function h. We then obtain a finite-state MDP with
an exponential cost function,
% ! "&
h(x0 ,µ0 (x0 ),x1 )+···+h(xN −1 ,µN −1 (xN −1 ),xN )
Jπ (x0 ) = lim sup E e ,
N →∞

which is often used to introduce risk aversion in the choice of policy through
the convexity of the exponential.
There is also a multiplicative version of the infinite state space stochas-
tic optimal control problem of Example 1.2.1. The mapping H takes the
form # ! "$
H(x, u, J) = E g(x, u, w)J f (x, u, w) ,
where xk+1 = f (xk , uk , wk ) is the underlying discrete-time dynamic system;
cf. Eq. (1.10).
Multiplicative models and related risk-sensitive models are discussed
extensively in the literature, mostly for the exponential cost case and under
different assumptions than ours; see e.g., [HoM72], [Jac73], [Rot84], [ChS87],
[Whi90], [JBE94], [FlM95], [HeM96], [FeM97], [BoM99], [CoM99], [BoM99],
[BoM02], [BBB08]. The works of references [DeR79], [Pat01], and [Pat07]
relate to the stochastic shortest path problems of Example 1.2.6, and are the
closest to the semicontractive models discussed in Chapters 3 and 4.
Issues of risk-sensitivity have also been dealt within frameworks that
do not quite conform to the multiplicative model of the preceding example,
and are based on the theory of multi-stage risk measures; see e.g., [Rus10],
[CaR12], and the references quoted there. Still these formulations involve
abstract monotone DP mappings and are covered by our theory.

Example 1.2.9 (Distributed Aggregation)


The abstract DP framework is useful not only in modeling DP problems, but
also in modeling algorithms arising in DP and even other contexts. We il-
lustrate this with an example from [BeY10b] that relates to the distributed
20 Introduction Chap. 1

solution of large-scale discounted finite-state MDP using cost function ap-


proximation based on aggregation. † It involves a partition of the n states
into m subsets for the purposes of distributed computation, and yields a cor-
responding approximation (V1 , . . . , Vm ) to the cost vector J ∗ .
In particular, we have a discounted n-state MDP (cf. Example 1.2.2),
and we introduce aggregate states S1 , . . . , Sm , which are disjoint subsets of
the original state space {1, . . . , n}. We assume that these sets form a parti-
tion, i.e., each x ∈ {1, . . . , n} belongs to one and only one of the aggregate
state/subsets. We envision a network of processors ! = 1, . . . , m, each as-
signed to the computation of a local cost function V! , defined on the corre-
sponding aggregate state/subset S! :

V! = {V!y | y ∈ S! }.

Processor ! also maintains a scalar aggregate cost R! for its aggregate state,
which is a weighted average of the detailed cost values V!x within S! :
!
R! = d!x V!x ,
x∈S!

"
where d!x are given probabilities with d!x ≥ 0 and d = 1. The aggre-
x∈S! !x
gate costs R! are communicated between processors and are used to perform
the computation of the local cost functions V! (we will discuss computation
models of this type in Section 2.6).
We denote J = (V1 , . . . , Vm , R1 , . . . , Rm ), so that J is a vector of di-
mension n + m. We introduce the mapping H(x, u, J) defined for each of the
n states x by

H(x, u, J) = W! (x, u, V! , R1 , . . . , Rm ), if x ∈ S! .

where for x ∈ S!
n
! !
W! (x, u, V! , R1 , . . . , Rm ) = pxy (u)g(x, u, y) + α pxy (u)V!y
y=1 y∈S!
!
+α pxy (u)Rs(y) ;
y ∈S
/ !

and for each original system state y, we denote by s(y) the index of the subset
to which y belongs [i.e., y ∈ Ss(y) ].
We may view H as an abstract mapping on the space of J, and aim to
find its fixed point J ∗ = (V1∗ , . . . , Vm

, R1∗ , . . . , Rm

). Then, for ! = 1, . . . , m, we

† See [Ber12a], Section 6.5.2, for a more detailed discussion. Other examples
of algorithmic mappings that come under our framework arise in asynchronous
policy iteration (see Sections 2.6.3, 3.3.2, and [BeY10a], [BeY10b], [YuB11a]),
and in constrained forms of policy iteration (see [Ber11c], or [Ber12a], Exercise
2.7).
Sec. 1.2 Abstract Dynamic Programming Models 21

may view V!∗ as an approximation to the optimal cost vector of the original
MDP starting at states x ∈ S! , and we may view R!∗ as a form of aggregate
cost for S! . The advantage of this formulation is that it involves significant
decomposition and parallelization of the computations among the processors,
when performing various DP algorithms. In particular, the computation of
W! (x, u, V! , R1 , . . . , Rm ) depends on just the local vector V! , whose dimension
may be potentially much smaller than n.

1.2.4 Approximation-Related Mappings

Given an abstract DP model described by a mapping H, we may be inter-


ested in fixed points of related mappings other than T and Tµ . Such map-
pings may arise in various contexts; we have seen one that is related to dis-
tributed asynchronous aggregation in Example 1.2.9. An important context
is subspace approximation, whereby Tµ and T are restricted onto a subspace
of functions for the purpose of approximating their fixed points. Much of
the theory of approximate DP and reinforcement learning relies on such ap-
proximations (see e.g., the books by Bertsekas and Tsitsiklis [BeT96], Sut-
ton and Barto [SuB98], Gosavi [Gos03], Cao [Cao07], Chang, Fu, Hu, and
Marcus [CFH07], Meyn [Mey07], Powell [Pow07], Borkar [Bor08], Haykin
[Hay08], Busoniu, Babuska, De Schutter, and Ernst [BBD10], Szepesvari
[Sze10], Bertsekas [Ber12a], and Vrabie, Vamvoudakis, and Lewis [VVL13]).
For an illustration, consider the approximate evaluation of the cost
vector of a discrete-time Markov chain with states i = 1, . . . , n. We assume
that state transitions (i, j) occur at time k according to given transition
probabilities pij , and generate a cost αk g(i, j), where α ∈ (0, 1) is a discount
factor. The cost function over an infinite number of stages can be shown to
be the unique fixed point of the Bellman equation mapping T : "n #→ "n
whose components are given by
n
! " #
(T J )(i) = pij (u) g(i, j) + αJ (j) , i = 1, . . . , n, J ∈ "n .
j=1
This is the same as the mapping T in the discounted finite-state MDP
Example 1.2.2, except that we restrict attention to a single policy. Find-
ing the cost function of a fixed policy is the important policy evaluation
subproblem which arises prominently within the context of policy iteration.
The approximation of the fixed point of T is often based on the so-
lution of lower-dimensional equations defined on the subspace {ΦR | R ∈
"s } that is spanned by the columns of a given n × s matrix Φ. Two
such approximating equations have been studied extensively (see [Ber12a],
Chapter 6, for a detailed account and references; also [BeY07], [BeY09],
[YuB10], [Ber11a] for extensions to abstract contexts beyond approximate
DP). These are:
(a) The projected equation
ΦR = Πξ T (ΦR), (1.18)
22 Introduction Chap. 1

where Πξ denotes projection onto S with respect to a weighted Eu-


clidean norm n
!
!J !ξ = ξi J (i) (1.19)
i=1

with ξ = (ξ1 , . . . , ξn ) being a probability distribution with positive


components.
(b) The aggregation equation

ΦR = ΦDT (ΦR), (1.20)

with D being an s×n matrix whose rows are restricted to be probabil-


ity distributions; these are known as the disaggregation probabilities.
Also, in this approach, the rows of Φ are restricted to be probability
distributions; they are known as the aggregation probabilities.
We now see that solution of the projected equation (1.18) and the
aggregation equation (1.20) amounts to finding a fixed point of the map-
pings Πξ T and ΦDT , respectively. These mappings derive their structure
from the DP operator T , so they have some DP-like properties, which can
be exploited for analysis and computation.
An important fact is that the aggregation mapping ΦDT preserves
the monotonicity and the sup-norm contraction property of T , while the
projected equation mapping Πξ T does not. The reason for preservation of
monotonicity is the nonnegativity of the components of the matrices Φ and
D. The reason for preservation of sup-norm contraction is that the matrices
Φ and D are sup-norm nonexpansive, because their rows are probability
distributions. In fact, it can be shown that the solution R of Eq. (1.20)
can be viewed as the exact DP solution of an “aggregate” DP problem that
represents a lower-dimensional approximation of the original (see [Ber12a],
Section 6.5).
By contrast, the projected equation mapping Πξ T need not be mono-
tone, because the components of Πξ need not be nonnegative. Moreover
while the projection Πξ is nonexpansive with respect to the projection norm
!·!ξ , it need not be nonexpansive with respect to the sup-norm. As a result
the projected equation mapping Πξ T need not be a sup-norm contraction.
These facts play a significant role in approximate DP methodology.
Let us also mention that multistep versions of the mapping T have
been used widely for approximations, particularly in connection with the
projected equation approach. For example, the popular temporal difference
methods, such as TD(λ), LSTD(λ), and LSPE(λ) (see the book references
on reinforcement learning and approximate DP cited earlier), are based on
the mapping T (λ) : #n $→ #n whose components are given by

!
" #
T (λ) J (i) = (1 − λ) λ#−1 (T # J )(i), i = 1, . . . , n, J ∈ #n ,
k=1
Sec. 1.3 Organization of the Book 23

for λ ∈ (0, 1], where T ! is the "-fold composition of T with itself " times.
Here the mapping T (λ) is used in place of T in the projected equation
(1.18). In the context of the aggregation equation approach, a multistep
method based on the mapping T (λ) is the λ-aggregation method, given for
the case of hard aggregation in [Ber12a], Section 6.5, as well as other forms
of aggregation (see [Ber12a], [YuB12]).
A more general form of multistep approach, introduced and studied
in [YuB12], uses instead the mapping T (w) : "n #→ "n , with components

#
! "
T (w) J (i) = wi! (T ! J )(i), i = 1, . . . , n, J ∈ "n ,
!=1

where for each i, (wi1 , wi2 , . . .) is a probability distribution over the positive
integers. Then the multistep analog of the projected Eq. (1.18) is

ΦR = Πξ T (w) (ΦR), (1.21)

while the multistep analog of the aggregation Eq. (1.20) is

ΦR = ΦDT (w) (ΦR). (1.22)

The mapping T (λ) is obtained for wi! = (1 − λ)λ!−1 , independently of the


state i. The solution of Eqs. (1.21) and (1.22) by simulation-based methods
is discussed in [YuB12] and [Yu12].
In fact, a connection between projected equations of the form (1.21)
and aggregation equations of the form (1.22) was established in [YuB12]
through the use of a seminorm [this is given by the same expression as the
norm & · &ξ of Eq. (1.19), with some of the components of ξ allowed to be 0].
In particular, the most prominent classes of aggregation equations can be
viewed as seminorm projected equations because it turns out that ΦD is a
seminorm projection (see [Ber12a], p. 639, [YuB12], Section 4). Moreover
they can be viewed as projected equations where the projection is oblique
(see [Ber12a], Section 7.3.6).
The preceding observations are important for our purposes, as they in-
dicate that much of the theory developed in this book applies to approxima-
tion-related mappings based on aggregation. However, this is not true to
nearly the same extent for approximation-related mappings based on pro-
jection.

1.3 ORGANIZATION OF THE BOOK

The examples of the preceding section have illustrated how the monotonic-
ity assumption is satisfied for many DP models, while the contraction as-
sumption may or may not be satisfied. In particular, the contraction as-
sumption is satisfied for the mapping H in Examples 1.2.1-1.2.5, assuming
24 Introduction Chap. 1

that there is discounting and that the cost per stage is bounded, but it
need not hold in the SSP Example 1.2.6 and the multiplicative Example
1.2.8.
The main theme of this book is that the presence or absence of mono-
tonicity and contraction is the primary determinant of the analytical and
algorithmic theory of a typical total cost DP model. In our development,
with some minor exceptions, we will assume that monotonicity holds. Con-
sequently, the rest of the book is organized around the presence or absence
of the contraction property. In the next four chapters we will discuss the
following four types of models.
(a) Contractive models: These models, discussed in Chapter 2, have
the richest and strongest algorithmic theory, and are the benchmark
against which the theory of other models is compared. Prominent
among them are discounted stochastic optimal control problems (cf.
Example 1.2.1), finite-state discounted MDP (cf. Example 1.2.2), and
some special types of SSP problems (cf. Example 1.2.6).
(b) Semicontractive models: In these models Tµ is monotone but it
need not be a contraction for all µ ∈ M. Instead policies are sepa-
rated into those that “behave well” with respect to our optimization
framework and those that do not. It turns out that the notion of
contraction is not sufficiently general for our purposes. We will thus
introduce a related notion of “regularity,” which is based on the idea
that a policy µ should be considered “well-behaved” if the dynamic
system defined by Tµ has Jµ as an asymptotically stable equilibrium
within some domain. Our models and analysis are patterned to a
large extent after the SSP problems of Example 1.2.6 (the regular µ
correspond to the proper policies). One of the complications here is
that policies that are not regular, may have cost functions that take
the value +∞ or −∞. Still under certain conditions, which directly
or indirectly guarantee that there exists an optimal regular policy,
the complications can be dealt with, and we can prove strong prop-
erties for these models, sometimes almost as strong as those of the
contractive models.
(c) Noncontractive models: These models rely on just the monotonic-
ity property of Tµ , and are more complex than the preceding ones.
As in semicontractive models, the various cost functions of the prob-
lem may take the values +∞ or −∞, and the mappings Tµ and T
must accordingly be allowed to deal with such functions. However,
the optimal cost function may take the values ∞ and −∞ as a matter
of course (rather than on an exceptional basis, as in semicontractive
models). The complications are considerable, and much of the the-
ory of the contractive models generalizes in weaker form, if at all.
For example, in general the fixed point equation J = T J need not
Sec. 1.4 Notes, Sources, and Exercises 25

have a unique solution, the value iteration method may work start-
ing with some functions but not with others, and the policy iteration
method may not work at all. Of course some of these weaknesses may
not appear in the presence of additional structure, and we will discuss
noncontractive models that also have some semicontractive structure,
and corresponding favorable properties.
(d) Restricted Policies Models: These models are variants of some of
the preceding ones, where there are restrictions of the set of policies,
so that M may be a strict subset of the set of functions µ : X !→ U
with µ(x) ∈ U (x) for all x ∈ X. Such restrictions may include mea-
surability (needed to establish a mathematically rigorous probabilistic
framework) or special structure that enhances the characterization of
optimal policies and facilitates their computation.
Examples of DP problems from each of the above model categories,
mostly special cases of the specific DP models discussed in Section 1.2, are
scattered throughout the book, both to illustrate the theory and its excep-
tions, and to illustrate the beneficial role of additional special structure.
The discussion of algorithms centers on abstract forms of value and policy
iteration, and is organized along three characteristics: exact, approximate,
and asynchronous.
The exact algorithms represent idealized versions, the approximate
represent implementations that use approximations of various kinds, and
the asynchronous involve irregular computation orders, where the costs and
controls at different states are updated at different iterations (for example
the cost of a single state being iterated at a time, as in Gauss-Seidel and
other methods; see [Ber12a]). Approximate and asynchronous implemen-
tations have been the subject of intensive investigations in the last twenty
five years, in the context of the solution of large-scale problems. Some of
this methodology relies on the use of simulation, which is asynchronous by
nature and is prominent in approximate DP and reinforcement learning.

1.4 NOTES, SOURCES, AND EXERCISES

The connection between DP and fixed point theory may be traced to Shap-
ley [Sha53], who exploited contraction mappings in analysis of the two-
player dynamic game model of Example 1.2.4. Since that time the under-
lying contraction properties of discounted DP problems with bounded cost
per stage have been explicitly or implicitly used by most authors that have
dealt with the subject. Moreover, the value of the abstract viewpoint as
the basis for economical and insightful analysis has been widely recognized.
An abstract DP model, based on unweighted sup-norm contraction
assumptions, was introduced in the paper by Denardo [Den67]. This model
pointed to the fundamental connections between DP and fixed point the-
26 Introduction Chap. 1

ory, and provided generality and insight into the principal analytical and
algorithmic ideas underlying the discounted DP research up to that time.
Abstract DP ideas were also researched earlier, notably in the paper by
Mitten (Denardo’s Ph.D. thesis advisor) [Mit64]; see also Denardo and
Mitten [DeM67]. The properties of monotone contractions were also used
in the analysis of sequential games by Zachrisson [Zac64].
Denardo’s model motivated a related abstract DP model by the au-
thor [Ber77], which relies only on monotonicity properties, and was pat-
terned after the positive DP problem of Blackwell [Bla65] and the negative
DP problem of Strauch [Str66]. These two abstract DP models were used
extensively in the book by Bertsekas and Shreve [BeS78] for the analysis
of both discounted and undiscounted DP problems, ranging over MDP,
minimax, multiplicative, Borel space models, and models based on outer
integration. Extensions of the analysis of [Ber77] were given by Verdu
and Poor [VeP87], which considered additional structure that allows the
development of backward and forward value iterations, and in the thesis
by Szepesvari [Sze98a], [Sze98b], which introduced non-Markovian poli-
cies into the abstract DP framework. The model of [Ber77] was also used
by Bertsekas [Ber82], and Bertsekas and Yu [BeY10b], to develop asyn-
chronous value and policy iteration methods for abstract contractive and
noncontractive DP models. Another line of related research involving ab-
stract DP mappings that are not necessarily scalar-valued was initiated by
Mitten [Mit74], and was followed up by a number of authors, including
Sobel [Sob75], Morin [Mor82], and Carraway and Morin [CaM88].
Restricted policies models that aim to address measurability issues
in the context of abstract DP were first considered in [BeS98]. Followup
research on this highly technical subject has been limited, and some issues
have not been fully worked out beyond the classical discounted, positive,
and negative stochastic optimal control problems; see Chapter 5.
Generally, noncontractive total cost DP models with some special
structure beyond monotonicity, fall in three major categories: monotone in-
creasing models principally represented by negative DP, monotone decreas-
ing models principally represented by positive DP, and transient models,
exemplified by the SSP model of Example 1.2.6, where the decision process
terminates after a period that is random and subject to control. Abstract
DP models patterned after the first two categories have been known since
[Ber77] and are further discussed in Section 4.3. The semicontractive mod-
els of Chapters 3 and 4, are patterned after the third category, and their
analysis is based on the idea of separating policies into those that are
well-behaved (have contraction-like properties) and those that are not (but
their detrimental effects can be effectively limited thanks to the problem’s
structure). As far as the author knows, this idea is new in the context of
abstract DP. One of the aims of the present monograph is to develop this
idea and to show that it leads to an important and insightful paradigm for
conceptualization and solution of major classes of practical DP problems.
Sec. 1.4 Notes, Sources, and Exercises 27

E X ER CI S E S

1.1 (Multistep Contraction Mappings)

This exercise shows how starting with an abstract mapping, we can obtain mul-
tistep mappings with the same fixed points and a stronger contraction modulus.
Consider a set of mappings Tµ : B(X) !→ B(X), µ ∈ M, satisfying the con-
traction Assumption 1.2.2, let m be a positive integer, and let Mm be the set
of m-tuples ν = (µ0 , . . . , µm−1 ), where µk ∈ M, k = 1, . . . , m − 1. For each
ν = (µ0 , . . . , µm−1 ) ∈ Mm , define the mapping T ν , by
T ν J = Tµ0 · · · Tµm−1 J, ∀ J ∈ B(X).
Show that we have the contraction properties
&T ν J − T ν J " & ≤ αm &J − J " &, ∀ J, J " ∈ B(X), (1.23)
and
&T J − T J " & ≤ αm &J − J " &, ∀ J, J " ∈ B(X), (1.24)
where T is defined by
(T J)(x) = inf (Tµ0 · · · Tµm−1 J)(x), ∀ J ∈ B(X), x ∈ X.
(µ0 ,...,µm−1 )∈Mm

1.2 (State-Dependent Weighted Multistep Mappings [YuB12])

Consider a set of mappings Tµ : B(X) !→ B(X), µ ∈ M, satisfying the con-


(w)
traction Assumption 1.2.2. Consider also the mappings Tµ : B(X) !→ B(X)
defined by

!
(Tµ(w) J)(x) w" (x) Tµ" J (x),
" #
= x ∈ X, J ∈ B(X),
"=1

where w" (x) are nonnegative scalars such that for all x ∈ X,

!
w" (x) = 1.
"=1

Show that
$ (w)
$(Tµ J)(x) − (Tµ(w) J " )(x)$ ∞
$
!
≤ w" (x)α" &J − J " &, ∀ x ∈ X,
v(x)
"=1
(w)
where α is the contraction modulus of Tµ , so that Tµ is a contraction with
modulus

!
ᾱ = sup w" (x) α" ≤ α.
x∈X
"=1
(w)
Show also that Tµ and Tµ have a common fixed point for all µ ∈ M.
2
Contractive Models

Contents

2.1. Fixed Point Equation and Optimality Conditions . . . . p. 30


2.2. Limited Lookahead Policies . . . . . . . . . . . . . p. 37
2.3. Value Iteration . . . . . . . . . . . . . . . . . . . p. 42
2.3.1. Approximate Value Iteration . . . . . . . . . . . p. 43
2.4. Policy Iteration . . . . . . . . . . . . . . . . . . . p. 46
2.4.1. Approximate Policy Iteration . . . . . . . . . . p. 48
2.5. Optimistic Policy Iteration . . . . . . . . . . . . . . p. 52
2.5.1. Convergence of Optimistic Policy Iteration . . . . p. 52
2.5.2. Approximate Optimistic Policy Iteration . . . . . p. 57
2.6. Asynchronous Algorithms . . . . . . . . . . . . . . p. 61
2.6.1. Asynchronous Value Iteration . . . . . . . . . . p. 61
2.6.2. Asynchronous Policy Iteration . . . . . . . . . . p. 67
2.6.3. Policy Iteration with a Uniform Fixed Point . . . . p. 72
2.7. Notes, Sources, and Exercises . . . . . . . . . . . . . p. 79

29
30 Contractive Models Chap. 2

In this chapter we consider the abstract DP model of Section 1.2 under the
most favorable assumptions: monotonicity and weighted sup-norm contrac-
tion. Important special cases of this model are the discounted problems
with bounded cost per stage (Example 1.2.1-1.2.5), the stochastic shortest
path problem of Example 1.2.6 in the case where all policies are proper,
as well as other problems involving special structures. We first provide
some basic analytical results and then focus on two types of algorithms:
value iteration and policy iteration. In addition to exact forms of these
algorithms, we discuss combinations and approximate versions, as well as
asynchronous distributed versions.

2.1 FIXED POINT EQUATION AND OPTIMALITY


CONDITIONS
In this section we recall the abstract DP model of Section 1.2, and derive
some of its basic properties under the monotonicity and contraction as-
sumptions of Section 1.3. We consider a set X of states and a set U of
controls, and for each x ∈ X, a nonempty control constraint set U (x) ⊂ U .
We denote by M the set of all functions µ : X #→ U with µ(x) ∈ U (x) for
all x ∈ X, which we refer to as policies.
We denote by R(X) the set of real-valued functions J : X #→ %. We
have a mapping H : X × U × R(X) #→ % and for each policy µ ∈ M, we
consider the mapping Tµ : R(X) #→ R(X) defined by
! "
(Tµ J )(x) = H x, µ(x), J , ∀ x ∈ X.

We also consider the mapping T defined by

(T J )(x) = inf H(x, u, J ) = inf (Tµ J )(x), ∀ x ∈ X.


u∈U (x) µ∈M

We want to find a function J * ∈ R(X) such that

J * (x) = inf H(x, u, J * ), ∀ x ∈ X,


u∈U (x)

i.e., to find a fixed point of T within R(X). We also want to obtain a policy
µ∗ ∈ M such that Tµ∗ J * = T J * .
Let us restate for convenience the contraction and monotonicity as-
sumptions of Section 1.2.2.

Assumption 2.1.1: (Monotonicity) If J, J # ∈ R(X) and J ≤ J # ,


then
H(x, u, J ) ≤ H(x, u, J # ), ∀ x ∈ X, u ∈ U (x).
Sec. 2.1 Fixed Point Equation and Optimality Conditions 31

Note that the monotonicity assumption implies the following proper-


ties, for all J, J ! ∈ R(X) and k = 0, 1, . . ., which we will use extensively:

J ≤ J! ⇒ T kJ ≤ T kJ !, Tµk J ≤ Tµk J ! , ∀ µ ∈ M,

J ≤ TJ ⇒ T k J ≤ T k+1 J, Tµk J ≤ Tµk+1 J, ∀ µ ∈ M.

Here T k and Tµk denotes the k-fold composition of T and Tµ , respectively.


For the contraction assumption, we introduce a function v : X %→ '
with
v(x) > 0, ∀ x ∈ X.

We consider the weighted sup-norm


! !
!J (x)!
(J ( = sup
x∈X v(x)

on B(X), the space of real-valued functions J on X such that J (x)/v(x)


is bounded over x ∈ X (see Appendix B for a discussion of the properties
of this space).

Assumption 2.1.2: (Contraction) For all J ∈ B(X) and µ ∈ M,


the functions Tµ J and T J belong to B(X). Furthermore, for some
α ∈ (0, 1), we have

(Tµ J − Tµ J ! ( ≤ α(J − J ! (, ∀ J, J ! ∈ B(X), µ ∈ M.

The classical DP models where both the monotonicity and contraction


assumptions are satisfied are the discounted finite-state Markovian decision
problem of Example 1.2.2, and the stochastic shortest path problem of
Example 1.2.6 in the special case where all policies are proper. We refer
the reader to the textbook [Ber12a] for an extensive discussion of these
problems.
The following proposition summarizes some of the basic consequences
of the contraction assumption.

Proposition 2.1.1: Let the contraction Assumption 2.1.2 hold. Then:


(a) The mappings Tµ and T are contraction mappings with modulus
α over B(X), and have unique fixed points in B(X), denoted Jµ
and J * , respectively.
32 Contractive Models Chap. 2

(b) For any J ∈ B(X) and µ ∈ M,

lim "J * − T k J " = 0, lim "Jµ − Tµk J " = 0.


k→∞ k→∞

(c) We have Tµ J * = T J * if and only if Jµ = J * .


(d) For any J ∈ B(X),

1 α
"J * − J " ≤ "T J − J ", "J * − T J " ≤ "T J − J ".
1−α 1−α

(e) For any J ∈ B(X) and µ ∈ M,

1 α
"Jµ − J " ≤ "Tµ J − J ", "Jµ − Tµ J " ≤ "Tµ J − J ".
1−α 1−α

Proof: We showed in Section 1.2.2 that T is a contraction with modulus


α over B(X). Parts (a) and (b) follow from Prop. B.1 of Appendix B.
To show part (c), note that if Tµ J * = T J * , then in view of T J * = J * ,
we have Tµ J * = J * , which implies that J * = Jµ , since Jµ is the unique
fixed point of Tµ . Conversely, if Jµ = J * , we have Tµ J * = Tµ Jµ = Jµ =
J * = T J *.
To show part (d), we use the triangle inequality to write for every k,
k
! k
!
"T k J − J " ≤ "T ! J − T !−1 J " ≤ α!−1 "T J − J ".
!=1 !=1
Taking the limit as k → ∞ and using part (b), the left-hand side inequality
follows. The right-hand side inequality follows from the left-hand side and
the contraction property of T . The proof of part (e) is similar to part (d)
[indeed it
" is the
# special case of part (d) where T is equal to Tµ , i.e., when
U (x) = µ(x) for all x ∈ X]. Q.E.D.

Part (c) of the preceding proposition shows that there exists a µ ∈ M


such that Jµ = J * if and only if the minimum of H(x, u, J * ) over U (x) is
attained for all x ∈ X. Of course the minimum is attained if U (x) is
finite for every x, but otherwise this is not guaranteed in the absence of
additional assumptions. Part (d) provides a useful error bound: we can
evaluate the proximity of any function J ∈ B(X) to the fixed point J * by
applying T to J and computing "T J − J ". The left-hand side inequality of
part (e) (with J = J * ) shows that for every " > 0, there exists a µ" ∈ M
such that "Jµ! − J * " ≤ ", which may be obtained by letting µ" (x) minimize
H(x, u, J * ) over U (x) within an error of (1 − α)" v(x), for all x ∈ X.
Sec. 2.1 Fixed Point Equation and Optimality Conditions 33

The preceding proposition and some of the subsequent results may


also be proved if B(X) is replaced by a closed subset B(X) ⊂ B(X).
This is because the contraction mapping fixed point theorem (Prop. B.1)
applies to closed subsets of complete spaces. For simplicity, however, we
will disregard this possibility in the present chapter.
An important consequence of monotonicity of H, when it holds in
addition to contraction, is that it implies an optimality property of J * .

Proposition 2.1.2: Let the monotonicity and contraction Assump-


tions 2.1.1 and 2.1.2 hold. Then

J * (x) = inf Jµ (x), ∀ x ∈ X.


µ∈M

Furthermore, for every ! > 0, there exists µ! ∈ M such that

J * (x) ≤ Jµ! (x) ≤ J * (x) + !, ∀ x ∈ X. (2.1)

Proof: We note that the right-hand side of Eq. (2.1) holds by Prop.
2.1.1(e) (see the remark following its proof). Thus inf µ∈M Jµ (x) ≤ J * (x)
for all x ∈ X. To show the reverse inequality as well as the left-hand side
of Eq. (2.1), we note that for all µ ∈ M, we have T J * ≤ Tµ J * , and since
J * = T J * , it follows that J * ≤ Tµ J * . By applying repeatedly Tµ to both
sides of this inequality and by using the monotonicity Assumption 2.1.1,
we obtain J * ≤ Tµk J * for all k > 0. Taking the limit as k → ∞, we see
that J * ≤ Jµ for all µ ∈ M. Q.E.D.

Note that without monotonicity, we may have inf µ∈M Jµ (x) < J * (x)
for some x. This is illustrated by the following example.

Example 2.1.1 (Counterexample Without Monotonicity)

Let X = {x1 , x2 }, U = {u1 , u2 }, and let


!
−αJ(x2 ) if u = u1 , 0 if u = u1 ,
"
H(x1 , u, J) = H(x2 , u, J) =
−1 + αJ(x1 ) if u = u2 , B if u = u2 ,

where B is a positive scalar. Then it can be seen that


1
J ∗ (x1 ) = − , J ∗ (x2 ) = 0,
1−α
and Jµ∗ = J ∗ where µ∗ (x1 ) = u2 and µ∗ (x2 ) = u1 . On the other hand, for
µ(x1 ) = u1 and µ(x2 ) = u2 , we have Jµ (x1 ) = −αB and Jµ (x2 ) = B, so
Jµ (x1 ) < J ∗ (x1 ) for B sufficiently large.
34 Contractive Models Chap. 2

Optimality over Nonstationary Policies


The connection with DP motivates us to consider the set Π of all sequences
π = {µ0 , µ1 , . . .} with µk ∈ M for all k (nonstationary policies in the DP
context), and define

Jπ (x) = lim sup (Tµ0 · · · Tµk J¯)(x), ∀ x ∈ X,


k→∞

with J¯ being some function in B(X), where Tµ0 · · · Tµk J denotes the com-
position of the mappings Tµ0 , . . . , Tµk applied to J , i.e,
! "
Tµ0 · · · Tµk J = Tµ0 Tµ1 · · · (Tµk−1 (Tµk J )) · · · .

Note that under the contraction Assumption 2.1.2, the choice of J¯ in


the definition of Jπ does not matter, since for any two J, J # ∈ B(X), we
have
#Tµ0 Tµ1 · · · Tµk J − Tµ0 Tµ1 · · · Tµk J # # ≤ αk+1 #J − J # #,
so the value of Jπ (x) is independent of J¯. Since by Prop. 2.1.1(b), Jµ (x) =
limk→∞ (Tµk J )(x) for all µ ∈ M, J ∈ B(X), and x ∈ X, in the DP context
we recognize Jµ as the cost function of the stationary policy {µ, µ, . . .}.
We now claim that under the monotonicity and contraction Assump-
tions 2.1.1 and 2.1.2, J * , which was defined as the unique fixed point of T ,
is equal to the optimal value of Jπ , i.e.,

J * (x) = inf Jπ (x), ∀ x ∈ X.


π∈Π

Indeed, since M defines a subset of Π, we have from Prop. 2.1.2,

J * (x) = inf Jµ (x) ≥ inf Jπ (x), ∀ x ∈ X,


µ∈M π∈Π

while for every π ∈ Π and x ∈ X, we have

Jπ (x) = lim sup (Tµ0 Tµ1 · · · Tµk J¯)(x) ≥ lim (T k+1 J¯)(x) = J * (x)
k→∞ k→∞

[the monotonicity Assumption 2.1.1 can be used to show that

Tµ0 Tµ1 · · · Tµk J¯ ≥ T k+1 J¯,

and the last equality holds by Prop. 2.1.1(b)]. Combining the preceding
relations, we obtain J * (x) = inf π∈Π Jπ (x).
Thus, in DP terms, we may view J * as an optimal cost function over
all policies. At the same time, Prop. 2.1.2 states that stationary policies
are sufficient in the sense that the optimal cost can be attained to within
arbitrary accuracy with a stationary policy [uniformly for all x ∈ X, as Eq.
(2.1) shows].
Sec. 2.1 Fixed Point Equation and Optimality Conditions 35

Error Bounds and Other Inequalities

The analysis of abstract DP algorithms and related approximations requires


the use of some basic inequalities that follow from the assumptions of con-
traction and monotonicity. We have obtained two such results in Prop.
2.1.1(d),(e), which assume only the contraction assumption. These results
can be strengthened if in addition to contraction, we have monotonicity.
To this end we first show the following useful characterization.

Proposition 2.1.3: The monotonicity and contraction Assumptions


2.1.1 and 2.1.2 hold if and only if for all J, J ! ∈ B(X), µ ∈ M, and
scalar c ≥ 0, we have

J ≤ J! + c v ⇒ Tµ J ≤ Tµ J ! + αc v, (2.2)

where v is the weight function of the weighted sup-norm % · %.

Proof: Let the contraction and monotonicity assumptions hold. If J ! ≤


J + c v, we have

H(x, u, J ! ) ≤ H(x, u, J + c v) ≤ H(x, u, J ) + αc v(x),∀ x ∈ X, u ∈ U (x),


(2.3)
where the left-side inequality follows from the monotonicity assumption and
the right-side inequality follows from the contraction assumption, which
together with %v% = 1, implies that

H(x, u, J + c v) − H(x, u, J )
≤ α%J + c v − J % = αc.
v(x)

The condition (2.3) implies the desired condition (2.2). Conversely, con-
dition (2.2) for c = 0 yields the monotonicity assumption, while for c =
%J ! − J % it yields the contraction assumption. Q.E.D.

We can now derive the following useful variant of Prop. 2.1.1(d),(e),


which involves one-sided inequalities. This variant will be used in the
derivation of error bounds for various computational methods.

Proposition 2.1.4: (Error Bounds Under Contraction and


Monotonicity) Let the monotonicity and contraction Assumptions
2.1.1 and 2.1.2 hold.
36 Contractive Models Chap. 2

(a) For any J ∈ B(X) and c ≥ 0, we have


c
TJ ≤ J + cv ⇒ J* ≤ J + v,
1−α
c
J ≤ TJ + cv ⇒ J ≤ J* + v.
1−α

(b) For any J ∈ B(X), µ ∈ M, and c ≥ 0, we have


c
Tµ J ≤ J + c v ⇒ Jµ ≤ J + v,
1−α
c
J ≤ Tµ J + c v ⇒ J ≤ Jµ + v.
1−α

(c) For all J ∈ B(X), c ≥ 0, and k = 0, 1, . . ., we have

αk c
TJ ≤ J + cv ⇒ J * ≤ T kJ + v,
1−α

αk c
J ≤ TJ + cv ⇒ T kJ ≤ J * + v.
1−α

Proof: (a) We show the first relation. Applying Eq. (2.2) with J ! and J
replaced by J and T J , respectively, and taking infimum over µ ∈ M, we
see that if T J ≤ J + c v, then T 2 J ≤ T J + αc v. Proceeding similarly, it
follows that
T ! J ≤ T !−1 J + α!−1 c v.

We now write for every k,

k
! k
!
T kJ − J = (T ! J − T !−1 J ) ≤ α!−1 c v,
!=1 !=1

" #
from which, by taking the limit as k → ∞, we obtain J * ≤ J + c/(1 − α) v.
The second relation follows similarly.
(b) This part is the special case of part (a) where T is equal to Tµ .
(c) We show the first relation. From part (a), the inequality T J ≤ J + c v
implies that
c
J* ≤ J + v.
1−α
Sec. 2.2 Limited Lookahead Policies 37

Applying T k to both sides of this inequality, and using the monotonicity


and fixed point property of T k , we have
! "
* k
c
J ≤T J+ v .
1−α

Using Eq. (2.2) with Tµ and α replaced by T k and αk , respectively, we


obtain ! "
c αk c
T k J+ v ≤ T kJ + v,
1−α 1−α
and the first relation to be shown follows from the preceding two relations.
The second relation follows similarly. Q.E.D.

2.2 LIMITED LOOKAHEAD POLICIES

In this section, we discuss a basic building block in the algorithmic method-


ology of abstract DP. Given some function J˜ that approximates J * , we
obtain a policy by solving a finite-horizon problem where J˜ is the termi-
nal cost function. The simplest possibility is a one-step lookahead policy µ
defined by
µ(x) ∈ arg min H(x, u, J˜), x ∈ X. (2.4)
u∈U (x)

The following proposition gives some bounds for its performance.

Proposition 2.2.1: (One-Step Lookahead Error Bounds) Let


the contraction Assumption 2.1.2 hold, and let µ be a one-step looka-
head policy obtained by minimization in Eq. (2.4), i.e., satisfying
Tµ J˜ = T J.
˜ Then

˜ ≤ α
$Jµ − T J$ $T J˜ − J˜$, (2.5)
1−α

where $ · $ denotes the weighted sup-norm. Moreover


$Jµ − J * $ ≤ $J˜ − J * $, (2.6)
1−α

and
2
$Jµ − J * $ ≤ $T J̃ − J˜$. (2.7)
1−α

Proof: Equation (2.5) follows from the second relation of Prop. 2.1.1(e)
with J = J˜. Also from the first relation of Prop. 2.1.1(e) with J = J * , we
38 Contractive Models Chap. 2

have
1
!Jµ − J * ! ≤ !Tµ J * − J * !.
1−α
By using the triangle inequality, and the relations Tµ J˜ = T J˜ and J * = T J * ,
we obtain

!Tµ J * − J * ! ≤ !Tµ J * − Tµ J˜!| + !Tµ J˜ − T J!|


˜ + !T J̃ − J * !
= !Tµ J * − Tµ J˜!| + !T J̃ − T J * !
≤ α!J * − J˜! + α!J˜ − J * !
= 2α !J˜ − J * !,

and Eq. (2.6) follows by combining the preceding two relations.


Also, from the first relation of Prop. 2.1.1(d) with J = J˜,
1
!J * − J˜! ≤ !T J̃ − J˜!. (2.8)
1−α
Thus
˜ + !T J˜ − J˜! + !J˜ − J * !
!Jµ − J * ! ≤ !Jµ − T J!
α 1
≤ !T J˜ − J˜! + !T J˜ − J˜! + !T J̃ − J˜!
1−α 1−α
2
= !T J˜ − J˜!,
1−α

where the second inequality follows from Eqs. (2.5) and (2.8). This proves
Eq. (2.7). Q.E.D.

Equation (2.5) provides a computable bound on the cost function Jµ


of the one-step lookahead policy. The bound (2.6) says that if the one-step
lookahead approximation J˜ is within " of the optimal, the performance
of the one-step lookahead policy is within 2α"/(1 − α) of the optimal.
Unfortunately, this is not very reassuring when α is close to 1, in which
case the error bound is very large relative to ". Nonetheless, the following
example from [BeT96], Section 6.1.1, shows that this bound is tight in the
sense that for any α < 1, there is a problem with just two states where
the error bound is satisfied with equality. What is happening is that an
O(")
! difference
" in single stage cost between two controls can generate an
O "/(1 − α) difference in policy costs, yet it can be “nullified” in the fixed
point equation J * = T J * by an O(") difference between J * and J˜.

Example 2.2.1

Consider a discounted optimal control problem with two states, 1 and 2, and
deterministic transitions. State 2 is absorbing, but at state 1 there are two
possible decisions: move to state 2 (policy µ∗ ) or stay at state 1 (policy µ).
Sec. 2.2 Limited Lookahead Policies 39

The cost of each transition is 0 except for the transition from 1 to itself under
policy µ, which has cost 2α", where " is a positive scalar and α ∈ [0, 1) is the
discount factor. The optimal policy µ∗ is to move from state 1 to state 2, and
the optimal cost-to-go function is J ∗ (1) = J ∗ (2) = 0. Consider the vector J˜
˜
with J(1) = −" and J˜(2) = ", so that

#J˜ − J ∗ # = ",

as assumed in Eq. (2.6) (cf. Prop. 2.2.1). The policy µ that decides to stay
at state 1 is a one-step lookahead policy based on J˜, because

2α" + αJ˜(1) = α" = 0 + αJ˜(2).

We have
2α" 2α
Jµ (1) = = #J˜ − J ∗ #,
1−α 1−α
so the bound of Eq. (2.6) holds with equality.

Multistep Lookahead Policies with Approximations

Let us now consider a more general form of lookahead involving multiple


stages as well as other approximations of the type that we will consider later
in the implementation of various approximate value and policy iteration
algorithms. In particular, we will assume that given any J ∈ B(X), we
cannot compute exactly T J , but instead we can compute J˜ ∈ B(X) and
µ ∈ M such that

"J˜ − T J " ≤ δ, "Tµ J − T J " ≤ ", (2.9)

where δ and " are nonnegative scalars. These scalars may be unknown, so
the resulting analysis will have a mostly qualitative character.
The case δ > 0 arises when the state space is either infinite or it is
finite but very large. Then instead of calculating (T J )(x) for all states x,
one may do so only for some states and estimate (T J )(x) for the remain-
ing states x by some form of interpolation. Alternatively, one may use
simulation data [e.g., noisy values of (T J )(x) for some or all x] and some
kind of least-squares error fit of (T J )(x) with a function from a suitable
parametric class. The function J˜ thus obtained will satisfy "J˜ − T J " ≤ δ
with δ > 0. Note that δ may not be small in this context, and the resulting
performance degradation may be a primary concern.
Cases where " > 0 may arise when the control space is infinite or
finite but large, and the minimization involved in the calculation of (T J )(x)
cannot be done exactly. Note, however, that it is possible that

δ > 0, " = 0,
40 Contractive Models Chap. 2

and in fact this occurs often in practice. In an alternative scenario, we may


first obtain the policy µ subject to a restriction that it belongs to a certain
subset of structured policies, so it satisfies

!Tµ J − T J ! ≤ !

for some ! > 0, and then we may set J˜ = Tµ J . In this case we have ! = δ
in Eq. (2.9).
In a multistep method with approximations, we are given a posi-
tive integer m and a lookahead function Jm , and we successively compute
(backwards in time) Jm−1 , . . . , J0 and policies µm−1 , . . . , µ0 satisfying

!Jk − T Jk+1 ! ≤ δ, !Tµk Jk+1 − T Jk+1 ! ≤ !, k = 0, . . . , m − 1. (2.10)

Note that in the context of MDP, Jk can be viewed as an approximation to


the optimal cost function of an (m − k)-stage problem with terminal cost
function Jm . We have the following proposition.

Proposition 2.2.2: (Multistep Lookahead Error Bound) Let


the contraction Assumption 2.1.2 hold. The periodic policy

π = {µ0 , . . . , µm−1 , µ0 , . . . , µm−1 , . . .}

generated by the method of Eq. (2.10) satisfies

2αm ! α(! + 2δ)(1 − αm−1 )


!Jπ − J * ! ≤ !J m − J *! + + .
1 − αm 1 − αm (1 − α)(1 − αm )
(2.11)

Proof: Using the triangle inequality, Eq. (2.10), and the contraction prop-
erty of T , we have for all k

!Jm−k − T k Jm ! ≤ !Jm−k − T Jm−k+1 ! + !T Jm−k+1 − T 2 Jm−k+2 !


+ · · · + !T k−1 Jm−1 − T k Jm !
≤ δ + αδ + · · · + αk−1 δ,
(2.12)
showing that

δ(1 − αk )
!Jm−k − T k Jm ! ≤ , k = 1, . . . , m. (2.13)
1−α
Sec. 2.2 Limited Lookahead Policies 41

From Eq. (2.10), we have !Jk − Tµk Jk+1 ! ≤ δ + ", so for all k

!Jm−k − Tµm−k · · · Tµm−1 Jm ! ≤ !Jm−k − Tµm−k Jm−k+1 !


+ !Tµm−k Jm−k+1 − Tµm−k Tµm−k+1 Jm−k+2 !
+ ···
+ !Tµm−k · · · Tµm−2 Jm−1 − Tµm−k · · · Tµm−1 Jm !
≤ (δ + ") + α(δ + ") + · · · + αk−1 (δ + "),

showing that

(δ + ")(1 − αk )
!Jm−k − Tµm−k · · · Tµm−1 Jm ! ≤ , k = 1, . . . , m.
1−α
(2.14)
Using the fact !Tµ0 J1 − T J1 ! ≤ " [cf. Eq. (2.10)], we obtain

!Tµ0 · · · Tµm−1 Jm − T m Jm ! ≤ !Tµ0 · · · Tµm−1 Jm − Tµ0 J1 !


+ !Tµ0 J1 − T J1 ! + !T J1 − T m Jm !
≤ α!Tµ1 · · · Tµm−1 Jm − J1 ! + " + α!J1 − T m−1 Jm !
α(" + 2δ)(1 − αm−1 )
≤"+ ,
1−α

where the last inequality follows from Eqs. (2.13) and (2.14) for k = m − 1.
From this relation and the fact that Tµ0 · · · Tµm−1 and T m are con-
tractions with modulus αm , we obtain

!Tµ0 · · · Tµm−1 J * − J * ! ≤ !Tµ0 · · · Tµm−1 J * − Tµ0 · · · Tµm−1 Jm !


+ !Tµ0 · · · Tµm−1 Jm − T m Jm ! + !T mJm − J * !
α(" + 2δ)(1 − αm−1 )
≤ 2αm !J * − Jm ! + " + .
1−α

We also have using Prop. 2.1.1(e), applied in the context of the multistep
mapping of Example 1.3.1,

1
!Jπ − J * ! ≤ !Tµ0 · · · Tµm−1 J * − J * !.
1 − αm

Combining the last two relations, we obtain the desired result. Q.E.D.

Note that for m = 1 and δ = " = 0, i.e., the case of one-step lookahead
policy µ with lookahead function J1 and no approximation error in the
minimization involved in T J1 , Eq. (2.11) yields the bound


!Jµ − J * ! ≤ !J1 − J * !,
1−α
42 Contractive Models Chap. 2

which coincides with the bound (2.6) derived earlier.


Also, in the special case where ! = δ and Jk = Tµk Jk+1 (cf. the
discussion preceding Prop. 2.2.2), the bound (2.11) can be strengthened
somewhat. In particular, we have for all k, Jm−k = Tµm−k · · · Tµm−1 Jm , so
the right-hand side of Eq. (2.14) becomes 0 and the preceding proof yields,
with some calculation,

2αm δ αδ(1 − αm−1 )


!Jπ − J * ! ≤ !Jm − J * ! + +
1−α m 1−α m (1 − α)(1 − αm )
2αm δ
= m
!Jm − J * ! + .
1−α 1−α
We finally note that Prop. 2.2.2 shows that as the lookahead size m
increases, the corresponding bound for !Jπ −J * ! tends to !+α(!+2δ)/(1−
α), or
! + 2αδ
lim sup !Jπ − J * ! ≤ .
m→∞ 1−α
We will see that this error bound is superior to corresponding error bounds
for approximate versions of value and policy iteration by essentially a factor
1/(1 − α).
There is an alternative and often used form of multistep lookahead,
whereby at each stage we compute a multistep policy {µ0 , . . . , µm−1 }, we
apply the first component µ0 of that policy, then at the next stage we
recompute a new multistep policy {µ̄0 , . . . , µ̄m−1 }, apply µ̄0 , etc. However,
no error bound for this type of lookahead is currently known.

2.3 VALUE ITERATION

In this section, we discuss value iteration (VI for short), the algorithm
that starts with some J ∈ B(X), and generates T J, T 2 J, . . .. Since T is
a weighted sup-norm contraction under Assumption 2.1.2, the algorithm
converges to J * , and the rate of convergence is governed by

!T k J − J * ! ≤ αk !J − J * !, k = 0, 1, . . . .

Similarly, for a given policy µ ∈ M, we have

!Tµk J − Jµ ! ≤ αk !J − Jµ !, k = 0, 1, . . . .

From Prop. 2.1.1(d), we also have the error bound


α
!T k+1 J − J * ! ≤ !T k+1 J − T k J !, k = 0, 1, . . . .
1−α
This bound does not rely on the monotonicity Assumption 2.1.1.
Sec. 2.3 Value Iteration 43

The VI algorithm is often used to compute an approximation J˜ to J * ,


and then to obtain a policy µ by minimization of H(x, u, J˜) over u ∈ U (x)
for each x ∈ X. In other words J˜ and µ satisfy

"J˜ − J * " ≤ γ, Tµ J˜ = T J,
˜

where γ is some positive scalar. Then by using Eq. (2.6), we have

2α γ
"Jµ − J * " ≤ . (2.15)
1−α

If the set of policies is finite, this procedure can be used to compute an


optimal policy with a finite but sufficiently large number of exact VI, as
shown in the following proposition.

Proposition 2.3.1: Let the contraction Assumption 2.1.2 hold and


let J ∈ B(X). If the set of policies M is finite, there exists an integer
k ≥ 0 such that Jµ∗ = J * for all µ∗ and k ≥ k with Tµ∗ T k J = T k+1 J .

Proof: Let M̃ be the set of policies such that Jµ &= J * . Since M̃ is finite,
we have
inf "Jµ − J * " > 0,
µ∈M̃

so by Eq. (2.15), there exists sufficiently small β > 0 such that

"J˜ − J * " ≤ β and Tµ J˜ = T J˜ ⇒ "Jµ − J * " = 0 µ∈⇒


/ M̃.
(2.16)
It follows that if k is sufficiently large so that "T k J − J * " ≤ β, then
/ M̃ so Jµ∗ = J * . Q.E.D.
Tµ∗ T k J = T k+1 J implies that µ∗ ∈

2.3.1 Approximate Value Iteration

We will now consider situations where the VI method may be imple-


mentable only through approximations. In particular, given a function
J , assume that we may only be able to calculate an approximation J˜ to
T J such that
!J˜ − T J ! ≤ δ,
! !

where δ is a given positive scalar. In the corresponding approximate VI


method, we start from an arbitrary bounded function J0 , and we generate
a sequence {Jk } satisfying

"Jk+1 − T Jk " ≤ δ, k = 0, 1, . . . . (2.17)


44 Contractive Models Chap. 2

This approximation may be the result of representing Jk+1 compactly, as a


linear combination of basis functions, through a projection or aggregation
process, as is common in approximate DP.
We may also simultaneously generate a sequence of policies {µk } such
that
!Tµk Jk − T Jk ! ≤ !, k = 0, 1, . . . , (2.18)
where ! is some scalar [which could be equal to 0, as in case of Eq. (2.10),
considered earlier]. The following proposition shows that the corresponding
!
cost functions Jµk “converge” to J * to within an error of order O δ/(1 −
" ! "
α)2 [plus a less significant error of order O !/(1 − α) ].

Proposition 2.3.2: (Error Bounds for Approximate VI) Let


the contraction Assumption 2.1.2 hold. A sequence {Jk } generated by
the approximate VI method (2.17)-(2.18) satisfies

δ
lim sup !Jk − J * ! ≤ , (2.19)
k→∞ 1−α

while the corresponding sequence of policies {µk } satisfies

! 2αδ
lim sup !Jµk − J * ! ≤ + . (2.20)
k→∞ 1−α (1 − α)2

Proof: Using the triangle inequality, Eq. (2.17), and the contraction prop-
erty of T , we have

!Jk − T k J0 ! ≤ !Jk − T Jk−1 !


+ !T Jk−1 − T 2 Jk−2 ! + · · · + !T k−1 J1 − T k J0 !
≤ δ + αδ + · · · + αk−1 δ,

and finally
(1 − αk )δ
!Jk − T k J0 ! ≤ , k = 0, 1, . . . . (2.21)
1−α
By taking limit as k → ∞ and by using the fact limk→∞ T k J0 = J * , we
obtain Eq. (2.19).
We also have using the triangle inequality and the contraction prop-
erty of Tµk and T ,

!Tµk J * − J * ! ≤ !Tµk J * − Tµk Jk ! + !Tµk Jk − T Jk ! + !T Jk − J * !


≤ α!J * − Jk ! + ! + α!Jk − J * !,
Sec. 2.3 Value Iteration 45

while by using also Prop. 2.1.1(e), we obtain

1 " 2α
!Jµk − J * ! ≤ !T k J * − J * ! ≤ + !Jk − J * !.
1−α µ 1−α 1−α
By combining this relation with Eq. (2.19), we obtain Eq. (2.20). Q.E.D.

The error bound (2.20) relates to stationary policies obtained from


the functions Jk by one-step lookahead. We may also obtain an m-step
periodic policy π from Jk by using m-step lookahead. Then Prop. 2.2.2
shows that the corresponding bound for !Jπ − J * ! tends to " + 2αδ/(1 − α)
as m → ∞, which improves on the error bound (2.20) by a factor 1/(1 − α).
Finally, let us note that the error bound of Prop. 2.3.2 is predicated
upon generating a sequence {Jk } satisfying !Jk+1 − T Jk ! ≤ δ for all k [cf.
Eq. (2.17)]. Unfortunately, some practical approximation schemes guaran-
tee the existence of such a δ only if {Jk } is a bounded sequence. The fol-
lowing example shows that boundedness of the iterates is not automatically
guaranteed, and is a serious issue that should be addressed in approximate
VI schemes.

Example 2.3.1 (Error Amplification in Approximate


Value Iteration)

Consider a two-state discounted MDP with states 1 and 2, and a single policy.
The transitions are deterministic: from state 1 to state 2, and from state 2 to
state 2. These transitions are also cost-free. Thus we have J ∗ (1) = J ∗ (2) = 0.
We consider a VI scheme that approximates!cost functions "within the
one-dimensional subspace of linear functions S = (r, 2r) | r ∈ " by using
a weighted least squares minimization; i.e., we approximate a vector J by its
weighted Euclidean projection onto S. In particular, given Jk = (rk , 2rk ), we
find Jk+1 = (rk+1 , 2rk+1 ), where for weights w1 , w2 > 0, rk+1 is obtained as
# $ %2 $ %2 &
rk+1 = arg min w1 r − (T Jk )(1) + w2 2r − (T Jk )(2) .
r

Since for a zero cost per stage and the given deterministic transitions, we
have T Jk = (2αrk , 2αrk ), the preceding minimization is written as

rk+1 = arg min w1 (r − 2αrk )2 + w2 (2r − 2αrk )2 ,


' (
r

which by writing the corresponding optimality condition yields rk+1 = αβrk ,


where β = 2(w1 + 2w2 )(w1 + 4w2 ) > 1. Thus if α > 1/β, the sequence
{rk } diverges and so does {Jk }. Note that in this example the optimal cost
function J ∗ = (0, 0) belongs to the subspace S. The difficulty here is that
the approximate VI mapping that generates Jk+1 as the weighted Euclidean
projection of T Jk is not a contraction (this is a manifestation of an important
issue in approximate DP and projected equation approximation; see [Ber12a]).
At the same time there is no δ such that $Jk+1 − T Jk $ ≤ δ for all k, because
of error amplification in each approximate VI.
46 Contractive Models Chap. 2

2.4 POLICY ITERATION

In this section, we discuss policy iteration (PI for short), an algorithm


whereby we maintain and update a policy µk , starting from some initial
policy µ0 . The typical iteration has the following form.

Policy iteration given the current policy µk :


Policy evaluation: We compute Jµk as the unique solution of the
equation
Jµk = Tµk Jµk .

Policy improvement: We obtain a policy µk+1 that satisfies

Tµk+1 Jµk = T Jµk .

We assume that the minimum of H(x, u, Jµk ) over u ∈ U (x) is at-


tained for all x ∈ X, so that the improved policy µk+1 is defined (we
use this assumption for all the PI algorithms of the book). The following
proposition establishes a basic cost improvement property, as well as finite
convergence for the case where the set of policies is finite.

Proposition 2.4.1: (Convergence of PI) Let the monotonicity


and contraction Assumptions 2.1.1 and 2.1.2 hold, and let {µk } be
a sequence generated by the PI algorithm. Then for all k, we have
Jµk+1 ≤ Jµk , with equality if and only if Jµk = J * . Moreover,

lim #Jµk − J * # = 0,
k→∞

and if the set of policies is finite, we have Jµk = J * for some k.

Proof: We have
Tµk+1 Jµk = T Jµk ≤ Tµk Jµk = Jµk .
Applying Tµk+1 to this inequality while using the monotonicity Assumption
2.1.1, we obtain
Tµ2k+1 Jµk ≤ Tµk+1 Jµk = T Jµk ≤ Tµk Jµk = Jµk .
Similarly, we have for all m > 0,
Tµmk+1 Jµk ≤ T Jµk ≤ Jµk ,
Sec. 2.4 Policy Iteration 47

and by taking the limit as m → ∞, we obtain

Jµk+1 ≤ T Jµk ≤ Jµk , k = 0, 1, . . . . (2.22)

If Jµk+1 = Jµk , it follows that T Jµk = Jµk , so Jµk is a fixed point of T and
must be equal to J * . Moreover by using induction, Eq. (2.22) implies that

Jµk ≤ T k Jµ0 , k = 0, 1, . . . ,

Since
J * ≤ Jµk , lim $T k Jµ0 − J * $ = 0,
k→∞

it follows that limk→∞ $Jµk − J * $ = 0. Finally, if the number of policies is


finite, Eq. (2.22) implies that there can be only a finite number of iterations
for which Jµk+1 (x) < Jµk (x) for some x, so we must have Jµk+1 = Jµk for
some k, at which time Jµk = J * as shown earlier. Q.E.D.

In the case where the set of policies is infinite, we may assert the
convergence of the sequence of generated policies under some compactness
and continuity conditions. In particular, we will assume that the state
space is finite, X = {1, . . . , n}, and that each control constraint set U (x)
is a compact subset of &m . We will view a cost function J as an element
of &n , and a policy µ as an element of the compact set U (1) × · · · ×
U (n) ⊂ &mn . Then {µk } has at least one limit point µ, which must
be an admissible policy. The following proposition guarantees, under an
additional continuity assumption for H(x, ·, ·), that every limit point µ is
optimal.

Assumption 2.4.1: (Compactness and Continuity)


(a) The state space is finite, X = {1, . . . , n}.
(b) Each control constraint set U (x), x = 1, . . . , n, is a compact
subset of &m .
(c) Each function H(x, ·, ·), x = 1, . . . , n, is continuous over U (x) ×
&n .

Proposition 2.4.2: Let the monotonicity and contraction Assump-


tions 2.1.1 and 2.1.2 hold, together with Assumption 2.4.1, and let
{µk } be a sequence generated by the PI algorithm. Then for every
limit point µ of {µk }, we have Jµ = J ∗ .
48 Contractive Models Chap. 2

Proof: We have Jµk → J * by Prop. 2.4.1. Let µ be the limit of a sub-


sequence {µk }k∈K . We will show that Tµ J * = T J * , from which it follows
that Jµ = J * [cf. Prop. 2.1.1(c)]. Indeed, we have Tµ J * ≥ T J * , so we focus
on showing the reverse inequality. From the equation Tµk Jµk−1 = T Jµk−1
we have
! "
H x, µk (x), Jµk−1 ≤ H(x, u, Jµk−1 ), x = 1, . . . , n, u ∈ U (x).

By taking limit in this relation as k → ∞, k ∈ K, and by using the


continuity of H(x, ·, ·) [cf. Assumption 2.4.1(c)], we obtain
! "
H x, µ(x), J * ≤ H(x, u, J * ), x = 1, . . . , n, u ∈ U (x).

By taking the minimum of the right-hand side over u ∈ U (x), we obtain


Tµ J * ≤ T J * . Q.E.D.

2.4.1 Approximate Policy Iteration

We now consider the PI method where the policy evaluation step and/or
the policy improvement step of the method are implemented through ap-
proximations. This method generates a sequence of policies {µk } and a
corresponding sequence of approximate cost functions {Jk } satisfying

&Jk − Jµk & ≤ δ, &Tµk+1 Jk − T Jk & ≤ ", k = 0, 1, . . . , (2.23)

where δ and " are some scalars, &·& denotes the sup-norm and v is the weight
function of the weighted sup-norm (it is important to use v rather than the
unit function in the above equation, in order for the bounds obtained to
have a clean form). The following proposition provides an error bound for
this algorithm.

Proposition 2.4.3: (Error Bound for Approximate PI) Let


the monotonicity and contraction Assumptions 2.1.1 and 2.1.2 hold.
The sequence {µk } generated by the approximate PI algorithm (2.23)
satisfies
" + 2αδ
lim sup &Jµk − J * & ≤ . (2.24)
k→∞ (1 − α)2

The essence of the proof is contained in the following proposition,


which quantifies the amount of approximate policy improvement at each
iteration.
Sec. 2.4 Policy Iteration 49

Proposition 2.4.4: Let the monotonicity and contraction Assump-


tions 2.1.1 and 2.1.2 hold. Let J , µ, and µ satisfy

!J − Jµ ! ≤ δ, !Tµ J − T J ! ≤ ",

where δ and " are some scalars. Then


" + 2αδ
!Jµ − J * ! ≤ α!Jµ − J * ! + . (2.25)
1−α

Proof: Using Eq. (2.25) and the contraction property of T and Tµ , which
implies that !Tµ Jµ − Tµ J ! ≤ αδ and !T J − T Jµ ! ≤ αδ, and hence Tµ Jµ ≤
Tµ J + αδ v and T J ≤ T Jµ + αδ v, we have

Tµ Jµ ≤ Tµ J + αδ v ≤ T J + (" + αδ) v ≤ T Jµ + (" + 2αδ) v. (2.26)

Since T Jµ ≤ Tµ Jµ = Jµ , this relation yields

Tµ Jµ ≤ Jµ + (" + 2αδ) v,

and applying Prop. 2.1.4(b) with µ = µ, J = Jµ , and " = " + 2αδ, we


obtain
" + 2αδ
Jµ ≤ Jµ + v. (2.27)
1−α
Using this relation, we have

α(" + 2αδ)
Jµ = Tµ Jµ = Tµ Jµ + (Tµ Jµ − Tµ Jµ ) ≤ Tµ Jµ + v,
1−α

where the inequality follows by using Prop. 2.1.3 and Eq. (2.27). Subtract-
ing J * from both sides, we have

α(" + 2αδ)
Jµ − J * ≤ T µ Jµ − J * + v, (2.28)
1−α

Also by subtracting J * from both sides of Eq. (2.26), and using the
contraction property

T Jµ − J * = T Jµ − T J * ≤ α!Jµ − J * ! v,

we obtain

Tµ Jµ − J * ≤ T Jµ − J * + (" + 2αδ) v ≤ α!Jµ − J * ! v + (" + 2αδ) v.


50 Contractive Models Chap. 2

Combining this relation with Eq. (2.28), yields

α(" + 2αδ) " + 2αδ


Jµ −J * ≤ α#Jµ −J * # v+ v+("+αδ)e = α#Jµ −J * # v+ v,
1−α 1−α

which is equivalent to the desired relation (2.25). Q.E.D.

Proof of Prop. 2.4.3: Applying Prop. 2.4.4, we have

" + 2αδ
#Jµk+1 − J * # ≤ α#Jµk − J * # + ,
1−α

which by taking the lim sup of both sides as k → ∞ yields the desired result.
Q.E.D.

We note that the error bound of Prop. 2.4.3 is tight, as can be shown
with an example from [BeT96], Section 6.2.3. The error bound is com-
parable to the one for approximate VI, derived earlier in Prop. 2.3.2. In
particular, the error #Jµk −J * # is asymptotically proportional to 1/(1−α)2
and to the approximation error in policy evaluation or value iteration, re-
spectively. This is noteworthy, as it indicates that contrary to the case of
exact implementation, approximate PI need not hold a convergence rate
advantage over approximate VI, despite its greater overhead per iteration.
On the other hand, approximate PI does not exhibit the same kind
of error amplification difficulty that was illustrated by Example 2.3.1 for
approximate VI. In particular, if the set of policies is finite, so that the
sequence {Jµk } is guaranteed to be bounded, the assumption of Eq. (2.23) is
not hard to satisfy in practice with commonly used function approximation
methods.
Note that when δ = " = 0, Eq. (2.25) yields

#Jµk+1 − J * # ≤ α#Jµk − J * #.

Thus in the case of an infinite state space and/or control space, exact
PI converges at a geometric rate under the contraction and monotonicity
assumptions of this section. This rate is the same as the rate of convergence
of exact VI.

The Case Where Policies Converge

Generally, the policy sequence {µk } generated by approximate PI may


oscillate between several policies. However, under some circumstances this
sequence may be guaranteed to converge to some µ, in the sense that

µk+1 = µk = µ for some k. (2.29)


Sec. 2.4 Policy Iteration 51

An example arises when the policy sequence {µk } is generated by exact


PI applied with a different mapping H̃ in place of H, but the bounds of
Eq. (2.23) are satisfied. The mapping H̃ may for example correspond to
an approximation of the original problem (as in aggregation methods; see
[Ber12a] for further discussion). In this case we can show the following
bound, which is much more favorable than the one of Prop. 2.4.3.

Proposition 2.4.5: (Error Bound for Approximate PI when


Policies Converge) Let the monotonicity and contraction Assump-
tions 2.1.1 and 2.1.2 hold, and let µ be a policy generated by the ap-
proximate PI algorithm (2.23) and satisfying condition (2.29). Then
we have
! + 2αδ
!Jµ − J * ! ≤ . (2.30)
1−α

Proof: Let J˜ be the cost function obtained by approximate policy evalu-


ation of µ [i.e., J˜ = Jk , where k satisfies the condition (2.29)]. Then we
have
!J˜ − Jµ ! ≤ δ, !Tµ J˜ − T J!
˜ ≤ !, (2.31)
where the latter inequality holds since we have

!Tµ J˜ − T J!
˜ = !T J − T Jk ! ≤ !,
µk+1 k

cf. Eq. (2.23). Using Eq. (2.31) and the fact Jµ = Tµ Jµ , we have
˜ + !T J˜ − Tµ J˜! + !Tµ J˜ − Jµ !
!T Jµ − Jµ ! ≤ !T Jµ − T J!
˜ + !T J˜ − Tµ J˜! + !Tµ J˜ − Tµ Jµ !
= !T Jµ − T J!
(2.32)
≤ α!Jµ − J˜! + ! + α!J˜ − Jµ !
≤ ! + 2αδ.

Using Prop. 2.1.1(d) with J = Jµ , we obtain the error bound (2.30).


Q.E.D.

The preceding error bound can be extended to the case where two
successive policies generated by the approximate PI algorithm are “not too
different” rather than being identical. In particular, suppose that µ and µ
are successive policies, which in addition to

!J˜ − Jµ ! ≤ δ, !Tµ J˜ − T J!
˜ ≤ !,

[cf. Eq. (2.23)], also satisfy

!Tµ J˜ − Tµ J˜! ≤ ζ,
52 Contractive Models Chap. 2

where ζ is some scalar (instead of µ = µ, which is the case where policies


converge exactly). Then we also have

!T J˜ − Tµ J˜! ≤ !T J˜ − Tµ J˜! + !Tµ J˜ − Tµ J˜! ≤ " + ζ,

and by replacing " with " + ζ and µ with µ in Eq. (2.32), we obtain

" + ζ + 2αδ
!Jµ − J ∗ ! ≤ .
1−α

When ζ is small enough to be of the order of max{δ, "}, this error bound
is comparable to the one for the case where policies converge.

2.5 OPTIMISTIC POLICY ITERATION

In this section, we discuss optimistic PI (also called “modified” PI, see


e.g., [Put94]), a variant of the PI algorithm of the preceding section, which
tends to be more computationally efficient, based on empirical evidence.
Here each policy µk is evaluated by solving the equation Jµk = Tµk Jµk
approximately, using a finite number of VI. Thus, starting with a function
J0 ∈ B(X), we generate sequences {Jk } and {µk } with the algorithm
m
Tµk Jk = T Jk , Jk+1 = Tµkk Jk , k = 0, 1, . . . , (2.33)

where {mk } is a sequence of positive integers. There is no systematic


guideline for selecting the integers mk . Usually their best values are chosen
empirically, and tend to be considerably larger than 1 (in which case the
optimistic PI method would be identical to the VI method).

2.5.1 Convergence of Optimistic Policy Iteration

The following two propositions provide the convergence properties of the


optimistic PI algorithm (2.33).

Proposition 2.5.1: (Convergence of Optimistic PI) Let the


monotonicity
! " and contraction Assumptions 2.1.1 and 2.1.2 hold, and
let (Jk , µk ) be a sequence generated by the optimistic PI algorithm
(2.33). Then
lim !Jk − J * ! = 0,
k→∞

and if the number of policies is finite, we have Jµk = J * for all k


greater than some index k̄.
Sec. 2.5 Optimistic Policy Iteration 53

Proposition 2.5.2: Let the monotonicity and contraction Assump-


tions
! 2.1.1
" and 2.1.2 hold, together with Assumption 2.4.1, and let
(Jk , µk ) be a sequence generated by the optimistic PI algorithm
(2.33). Then every limit point µ of {µk }, satisfies Jµ = J * .

We develop the proofs of the propositions through four lemmas. The


first lemma collects some properties of monotone weighted sup-norm con-
tractions, variants of which we noted earlier, and we restate for convenience.

Lemma 2.5.1: Let W : B(X) !→ B(X) be a mapping that satisfies


the monotonicity assumption

J ≤ J! ⇒ W J ≤ W J !, ∀ J, J ! ∈ B(X),

and the contraction assumption

'W J − W J ! ' ≤ α'J − J ! ', ∀ J, J ! ∈ B(X),

for some α ∈ (0, 1).


(a) For all J, J ! ∈ B(X) and scalar c ≥ 0, we have

J ≥ J! − c v ⇒ W J ≥ W J ! − αc v. (2.34)

(b) For all J ∈ B(X), c ≥ 0, and k = 0, 1, . . ., we have

αk
J ≥ WJ − cv ⇒ W kJ ≥ J * − c v, (2.35)
1−α

αk
WJ ≥ J − cv ⇒ J * ≥ W kJ − c v, (2.36)
1−α
where J * is the fixed point of W .

Proof: The proof of part (a) follows the one of Prop. 2.1.4(b), while the
proof of part (b) follows the one of Prop. 2.1.4(c). Q.E.D.

Lemma 2.5.2: Let the monotonicity and contraction Assumptions


2.1.1 and 2.1.2 hold, and let J ∈ B(X) and c ≥ 0 satisfy
54 Contractive Models Chap. 2

J ≥ T J − c v,
and let µ ∈ M be such that Tµ J = T J . Then for all k > 0, we have
α
T J ≥ Tµk J − c v, (2.37)
1−α

and
Tµk J ≥ T (Tµk J ) − αk c v. (2.38)

Proof: Since J ≥ T J − c v = Tµ J − c v, by using Lemma 2.5.1(a) with


W = Tµj and J ! = Tµ J , we have for all j ≥ 1,
Tµj J ≥ Tµj+1 J − αj c v. (2.39)
By adding this relation over j = 1, . . . , k − 1, we have
k−1
! α − αk α
T J = Tµ J ≥ Tµk J − αj c v = Tµk J − c v ≥ Tµk J − c v,
j=1
1−α 1−α

showing Eq. (2.37). From Eq. (2.39) for j = k, we obtain


Tµk J ≥ Tµk+1 J − αk c v = Tµ (Tµk J ) − αk c v ≥ T (Tµk J ) − αk c v,
showing Eq. (2.38). Q.E.D.

The next lemma applies to the optimistic PI algorithm (2.33) and


proves a preliminary bound.

Lemma 2.5.3: Let the" monotonicity # and contraction Assumptions


2.1.1 and 2.1.2 hold, let (Jk , µk ) be a sequence generated by the PI
algorithm (2.33), and assume that for some c ≥ 0 we have

J0 ≥ T J0 − c v.

Then for all k ≥ 0,


α
T Jk + βk c v ≥ Jk+1 ≥ T Jk+1 − βk+1 c v, (2.40)
1−α

where βk is the scalar given by


$
1 if k = 0,
βk = (2.41)
αm0 +···+mk−1 if k > 0,

with mj , j = 0, 1, . . ., being the integers used in the algorithm (2.33).


Sec. 2.5 Optimistic Policy Iteration 55

Proof: We prove Eq. (2.40) by induction on k, using Lemma 2.5.2. For


k = 0, using Eq. (2.37) with J = J0 , µ = µ0 , and k = m0 , we have
α α
T J 0 ≥ J1 − c v = J1 − β0 c v,
1−α 1−α
showing the left-hand side of Eq. (2.40) for k = 0. Also by Eq. (2.38) with
µ = µ0 and k = m0 , we have

J1 ≥ T J1 − αm0 c v = T J1 − β1 c v.

showing the right-hand side of Eq. (2.40) for k = 0.


Assuming that Eq. (2.40) holds for k − 1 ≥ 0, we will show that it
holds for k. Indeed, the right-hand side of the induction hypothesis yields

Jk ≥ T Jk − βk c v.

Using Eqs. (2.37) and (2.38) with J = Jk , µ = µk , and k = mk , we obtain


α
T Jk ≥ Jk+1 − βk c v,
1−α
and
Jk+1 ≥ T Jk+1 − αmk βk c v = T Jk+1 − βk+1 c v,
respectively. This completes the induction. Q.E.D.

The next lemma essentially proves the convergence of the optimistic


PI (Prop. 2.5.1) and provides associated error bounds.

Lemma 2.5.4: Let the! monotonicity " and contraction Assumptions


2.1.1 and 2.1.2 hold, let (Jk , µk ) be a sequence generated by the PI
algorithm (2.33), and let c ≥ 0 be a scalar such that

#J0 − T J0 # ≤ c. (2.42)

Then for all k ≥ 0,

αk βk (k + 1)αk
Jk + c v ≥ Jk + c v ≥ J * ≥ Jk − c v, (2.43)
1−α 1−α 1−α

where βk is defined by Eq. (2.41).

Proof: Using the relation J0 ≥ T J0 − c v [cf. Eq. (2.42)] and Lemma 2.5.3,
we have
Jk ≥ T Jk − βk c v, k = 0, 1, . . . .
56 Contractive Models Chap. 2

Using this relation in Lemma 2.5.1(b) with W = T and k = 0, we obtain

βk
Jk ≥ J * − c v,
1−α

which together with the fact αk ≥ βk , shows the left-hand side of Eq.
(2.43).
Using the relation T J0 ≥ J0 − c v [cf. Eq. (2.42)] and Lemma 2.5.1(b)
with W = T , we have

αk
J * ≥ T k J0 − c v, k = 0, 1, . . . . (2.44)
1−α

Using again the relation J0 ≥ T J0 − c v in conjunction with Lemma 2.5.3,


we also have
α
T Jj ≥ Jj+1 − βj c v, j = 0, . . . , k − 1.
1−α

Applying T k−j−1 to both sides of this inequality and using the monotonicity
and contraction properties of T k−j−1 , we obtain

αk−j
T k−j Jj ≥ T k−j−1 Jj+1 − βj c v, j = 0, . . . , k − 1,
1−α

cf. Lemma 2.5.1(a). By adding this relation over j = 0, . . . , k − 1, and using


the fact
βj ≤ αj ,

it follows that

k−1
! αk−j j kαk
T k J0 ≥ Jk − α c v = Jk − c v. (2.45)
j=0
1−α 1−α

Finally, by combining Eqs. (2.44) and (2.45), we obtain the right-hand side
of Eq. (2.43). Q.E.D.

Proof of Props. 2.5.1 and 2.5.2: Let c be a scalar satisfying Eq. (2.42).
Then the error bounds (2.43) show that limk→∞ $Jk −J * $ = 0, i.e., the first
part of Prop. 2.5.1. The second part (finite termination when the number
of policies is finite) follows similar to Prop. 2.4.1. The proof of Prop. 2.5.2
follows using the compactness and continuity Assumption 2.4.1, and the
convergence argument of Prop. 2.4.2. Q.E.D.
Sec. 2.5 Optimistic Policy Iteration 57

Convergence Rate Issues

Let us consider the convergence rate bounds of Lemma 2.5.4 for optimistic
PI, and write them in the form

(k + 1)αk αm0 +···+mk


!J0 − T J0 ! ≤ c ⇒ Jk − c v ≤ J * ≤ Jk + c v.
1−α 1−α
(2.46)
We may contrast these bounds with the ones for VI, where

αk αk
!J0 − T J0 ! ≤ c ⇒ T k J0 − c v ≤ J * ≤ T k J0 + c v (2.47)
1−α 1−α

[cf. Prop. 2.1.4(c)].


In comparing the bounds (2.46) and (2.47), we should also take into
account the associated overhead for a single iteration of each method: op-
timistic PI requires at iteration k a single application of T and mk − 1
applications of Tµk (each being less time-consuming than an application of
T ), while VI requires a single application of T . It can then be seen that the
upper bound for optimistic PI is better than the one for VI (same bound
for less overhead), while the lower bound for optimistic PI is worse than the
one for VI (worse bound for more overhead). This suggests that the choice
of the initial condition J0 is important in optimistic PI, and in particular
it is preferable to have J0 ≥ T J0 (implying convergence to J * from above)
rather than J0 ≤ T J0 (implying convergence to J * from below). This is
consistent with the results of other works, which indicate that the conver-
gence properties of the method are fragile when the condition J0 ≥ T J0
does not hold (see [WiB93], [BeT96], [BeY10a], [BeY10b], [YuB11a]).

2.5.2 Approximate Optimistic Policy Iteration

We will now derive error bounds for the case where the policy evaluation
and policy improvement operations are approximate, similar to the nonop-
timistic PI case of Section 2.4.1. In particular, we consider a method that
generates a sequence of policies {µk } and a corresponding sequence of ap-
proximate cost functions {Jk } satisfying
m
!Jk − Tµkk Jk−1 ! ≤ δ, !Tµk+1 Jk − T Jk ! ≤ #, k = 0, 1, . . . , (2.48)

[cf. Eq. (2.23)]. For example, we may compute (perhaps approximately,


m
by simulation) the values Tµkk (x) for a subset of states x, and use a least
squares fit of these values to select Jk from some parametric class of func-
tions.
We will prove the same error bound as for the nonoptimistic case, cf.
Eq. (2.24). However, for this we will need the following condition, which
58 Contractive Models Chap. 2

is stronger than the contraction and monotonicity conditions that we have


been using so far.

Assumption 2.5.1: (Semilinear Monotonic Contraction) For


all J ∈ B(X) and µ ∈ M, the functions Tµ J and T J belong to B(X).
Furthermore, for some α ∈ (0, 1), we have for all J, J ! ∈ B(X), µ ∈ M,
and x ∈ X,

(Tµ J ! )(x) − (Tµ J )(x) J ! (y) − J (y)


≤ α sup . (2.49)
v(x) y∈X v(y)

This assumption implies both the monotonicity and contraction As-


sumptions 2.1.1 and 2.1.2, as can be easily verified. Moreover the assump-
tion is satisfied in the discounted DP examples of Section 1.2, as well as
the stochastic shortest path problem of Example 1.2.6. It holds if Tµ is a
linear mapping involving a matrix with nonnegative components that has
spectral radius less than 1 (or more generally if Tµ is the minimum or the
maximum of a finite number of such linear mappings).
For any function y ∈ B(X), let us use the notation

y(x)
M (y) = sup .
x∈X v(x)

Then the condition (2.49) can be written for all J, J ! ∈ B(X), and µ ∈ M
as
M (Tµ J − Tµ J ! ) ≤ αM (J − J ! ), (2.50)
and also implies the following multistep versions, for " ≥ 1,

Tµ! J − Tµ! J ! ≤ α! M (J − J ! )v, M (Tµ! J − Tµ! J ! ) ≤ α! M (J − J ! ), (2.51)

which can be proved by induction using Eq. (2.50). We have the following
proposition.

Proposition 2.5.3: (Error Bound for Optimistic Approximate


PI) Let Assumption 2.5.1 hold. Then the sequence {µk } generated
by the optimistic approximate PI algorithm (2.48) satisfies

# + 2αδ
lim sup %Jµk − J * % ≤ .
k→∞ (1 − α)2
Sec. 2.5 Optimistic Policy Iteration 59

Proof: Let us fix k ≥ 1 and for simplicity let us denote

J = Jk−1 , J = Jk ,

µ = µk , µ̄ = µk+1 , m = mk , m̄ = mk+1 ,

s = Jµ − Tµm J , s̄ = Jµ̄ − Tµ̄m̄ J, t = Tµm J − J * , t̄ = Tµ̄m̄ J − J * .


We have
Jµ − J * = Jµ − Tµm J + Tµm J − J * = s + t. (2.52)
We will derive recursive relations for s and t, which will also involve the
residual functions

r = Tµ J − J , r̄ = Tµ̄ J − J.

We first obtain a relation between r and r̄. We have

r̄ = Tµ̄ J − J
= (Tµ̄ J − Tµ J ) + (Tµ J − J )
! " ! "
≤ (Tµ̄ J − T J ) + Tµ J − Tµ (Tµm J ) + (Tµm J − J ) + Tµm (Tµ J ) − Tµm J
≤ !v + αM (J − Tµm J )v + δv + αm M (Tµ J − J )v
≤ (! + δ)v + αδv + αm M (r)v,

where the first inequality follows from Tµ̄ J ≥ T J , and the second and third
inequalities follow from Eqs. (2.48) and (2.51). From this relation we have
! "
M (r̄) ≤ ! + (1 + α)δ + βM (r),

where β = αm . Taking lim sup as k → ∞ in this relation, we obtain

! + (1 + α)δ
lim sup M (r) ≤ , (2.53)
k→∞ 1 − β̂

where β̂ = αlim inf k→∞ mk .


Next we derive a relation between s and r. We have

s = Jµ − Tµm J
= Tµm Jµ − Tµm J
≤ αm M (Jµ − J )v
αm
≤ M (Tµ J − J )v
1−α
αm
= M (r)v,
1−α
60 Contractive Models Chap. 2

where the first inequality follows from Eq. (2.51) and the second inequality
αm
follows by using Prop. 2.1.4(b). Thus we have M (s) ≤ 1−α M (r), from
which by taking lim sup of both sides and using Eq. (2.53), we obtain
! "
β̂ " + (1 + α)δ
lim sup M (s) ≤ . (2.54)
k→∞ (1 − α)(1 − β̂)

Finally we derive a relation between t, t̄, and r. We first note that

T J − T J * ≤ αM (J − J * )v
= αM (J − Tµm J + Tµm J − J * )v
≤ αM (J − Tµm J )v + αM (Tµm J − J * )v
≤ αδv + αM (t)v.

Using this relation, and Eqs. (2.48) and (2.51), we have

t̄ = Tµ̄m̄ J − J *
= (Tµ̄m̄ J − Tµ̄m̄−1 J ) + · · · + (Tµ̄2 J − Tµ̄ J ) + (Tµ̄ J − T J ) + (T J − T J * )
≤ (αm̄−1 + · · · + α)M (Tµ̄ J − J )v + "v + αδv + αM (t)v,

so finally
α − αm̄
M (t̄) ≤ M (r̄) + (" + αδ) + αM (t).
1−α
By taking lim sup of both sides and using Eq. (2.53), it follows that
! "
(α − β̂) " + (1 + α)δ " + αδ
lim sup M (t) ≤ + . (2.55)
k→∞ 2
(1 − α) (1 − β̂) 1−α

We now combine Eqs. (2.52), (2.54), and (2.55). We obtain

lim sup M (Jµk − J * ) ≤ lim sup M (s) + lim sup M (t)


k→∞ k→∞ k→∞
! " ! "
β̂ " + (1 + α)δ (α − β̂) " + (1 + α)δ " + αδ
≤ + +
(1 − α)(1 − β̂) (1 − α)2 (1
− β̂) 1−α
! "! "
β̂(1 − α) + (α − β̂) " + (1 + α)δ " + αδ
= +
2
(1 − α) (1 − β̂) 1−α
! "
α " + (1 + α)δ " + αδ
= +
(1 − α)2 1−α
" + 2αδ
= .
(1 − α)2

This proves the result, since in view of Jµk ≥ J * , we have M (Jµk − J * ) =


$Jµk − J * $. Q.E.D.
Sec. 2.6 Asynchronous Algorithms 61

Note that optimistic PI with approximations is susceptible to the


error amplification difficulty illustrated by Example 2.3.1. In particular,
when mk = 1 for all k in Eq. (2.48), the method becomes essentially iden-
tical to approximate VI. However, it appears that choices of mk that are
significantly larger than 1 should be helpful in connection with this diffi-
culty. In particular, it can be verified that in Example 2.3.1, the method
converges to the optimal cost function if mk is sufficiently large.
A remarkable fact is that approximate VI, approximate PI, and ap-
proximate optimistic PI have very similar error bounds (cf. Props. 2.3.2,
2.4.3, and 2.5.3). Approximate VI has a slightly better bound, but insignif-
icantly so in practical terms. When approximate PI produces a convergent
sequence of policies, the associated error bound is much better (cf. Prop.
2.4.5).

2.6 ASYNCHRONOUS ALGORITHMS

In this section, we extend further the computational methods of VI and


PI for abstract DP models, by embedding them within an asynchronous
computation framework.

2.6.1 Asynchronous Value Iteration

Each VI of the form described in Section 2.3 applies the mapping T defined
by
(T J )(x) = inf H(x, u, J ), ∀ x ∈ X,
u∈U (x)

for all states simultaneously, thereby producing the sequence T J, T 2J, . . .


starting with some J ∈ B(X). In a more general form of VI, at any one
iteration, J (x) may be updated and replaced by (T J )(x) only for a subset
of states. An example is the Gauss-Seidel method for the finite-state case,
where at each iteration, J (x) is updated only for a single selected state x
and J (x) is left unchanged for all other states x #= x (see [Ber12a]). In that
method the states are taken up for iteration in a cyclic order, but more
complex iteration orders are possible, deterministic as well as randomized.
Methods of the type just described are called asynchronous VI meth-
ods and may be motivated by several considerations such as:
(a) Faster convergence. Generally, computational experience with DP
as well as analysis, have shown that convergence is accelerated by
incorporating the results of VI updates for some states as early as
possible into subsequent VI updates for other states. This is known
as the Gauss-Seidel effect , which is discussed in some detail in the
book [BeT89].
(b) Parallel and distributed asynchronous computation. In this context,
we have several processors, each applying VI for a subset of states, and
62 Contractive Models Chap. 2

communicating the results to other processors (perhaps with some


delay). One objective here may be faster computation by taking
advantage of parallelism. Another objective may be computational
convenience in problems where useful information is generated and
processed locally at geographically dispersed points. An example is
data or sensor network computations, where nodes, gateways, sensors,
and data collection centers collaborate to route and control the flow
of data, using DP or shortest path-type computations.
(c) Simulation-based implementations. In simulation-based versions of
VI, iterations at various states are often performed in the order that
the states are generated by some form of simulation.
With these contexts in mind, we introduce a model of asynchronous
distributed solution of abstract fixed point problems of the form J = T J .
Let R(X) be the set of real-valued functions defined on some given set
X and let T map R(X) into R(X). We consider a partition of X into
disjoint nonempty subsets X1 , . . . , Xm , and a corresponding partition of J
as J = (J1 , . . . , Jm ), where J! is the restriction of J on the set X! . Our
computation framework involves a network of m processors, each updat-
ing corresponding components of J . In a (synchronous) distributed VI
algorithm, processor ! updates J! at iteration t according to

J!t+1 (x) = T (J1t , . . . , Jm


t )(x), ∀ x ∈ X! , ! = 1, . . . , m. (2.56)

Here to accommodate the distributed algorithmic framework and its over-


loaded notation, we will use superscript t to denote iterations/times where
some (but not all) processors update their corresponding components, re-
serving the index k for computation stages involving all processors, and
also reserving subscript ! to denote component/processor index.
In an asynchronous VI algorithm, processor ! updates J! only for t in
a selected subset R! of iterations, and with components Jj , j #= !, supplied
by other processors with communication “delays” t − τ!j (t),
! " #
τ!1 (t) τ!m (t)
T J , . . . , J m (x) if t ∈ R! , x ∈ X! ,
J!t+1 (x) = 1 (2.57)
t
J! (x) if t ∈
/ R! , x ∈ X! .
Communication delays arise naturally in the context of asynchronous dis-
tributed computing systems of the type described in many sources (an
extensive reference is the book [BeT89]). Such systems are interesting for
solution of large DP problems, particularly for methods that are based on
simulation, which is naturally well-suited for distributed computation. On
the other hand, if the entire algorithm is centralized at a single physical
processor, the algorithm (2.57) ordinarily will not involve communication
delays, i.e., τ!j (t) = t for all !, j, and t.
The simpler case where X is a finite set and each subset X! consists
of a single element ! arises often, particularly in the context of simulation.
Sec. 2.6 Asynchronous Algorithms 63

In this case we may simplify the notation of iteration (2.57) by writing J!t
in place of the scalar component J!t (!), as we do in the following example.

Example 2.6.1 (One-State-at-a-Time Iterations)

Assuming X = {1, . . . , n}, let us view each state as a processor by itself, so


that X! = {!}, ! = 1, . . . , n, and execute VI one-state-at-a-time, according
to some state sequence {x0 , x1 , . . .}, which is generated in some way, possibly
by simulation. Thus, starting from some initial vector J 0 , we generate a
sequence {J t }, with J t = (J1t , . . . , Jnt ), as follows:
!
T (J1t , . . . , Jnt )(!) if ! = xt ,
J!t+1 =
J!t ! xt ,
if ! =

where T (J1t , . . . , Jnt )(!) denotes the !-th component of the vector

T (J1t , . . . , Jnt ) = T J t ,

and for simplicity we write J!t instead of J!t (!). This algorithm is a special
case of iteration (2.57) where the set of times at which J! is updated is
R! = {t | xt = !}, and there are no communication delays (as in the case
where the entire algorithm is centralized at a single physical processor).

Note also that if X is finite, we can assume without loss of generality


that each state is assigned to a separate processor. The reason is that a
physical processor that updates a group of states may be replaced by a
group of fictitious processors, each assigned to a single state, and updating
their corresponding components of J simultaneously.
We will now discuss the convergence of the asynchronous algorithm
(2.57). To this end we introduce the following assumption.

Assumption 2.6.1: (Continuous Updating and Information


Renewal)
(1) The set of times R! at which processor ! updates J! is infinite,
for each ! = 1, . . . , m.
(2) limt→∞ τ!j (t) = ∞ for all !, j = 1, . . . , m.

Assumption 2.6.1 is natural, and is essential for any kind of conver-


gence result about the algorithm. † In particular, the condition τ!j (t) → ∞

† Generally, convergent distributed iterative asynchronous algorithms are


classified in totally and partially asynchronous (cf. Chapters 6 and 7 of the book
[BeT89]). In the former, there is no bound on the communication delays, while
in the latter there must be a bound (which may be unknown). The algorithms of
the present section are totally asynchronous, as reflected by Assumption 2.6.1.
64 Contractive Models Chap. 2

guarantees that outdated information about the processor updates will


eventually be purged from the computation. It is also natural to assume
that τ!j (t) is monotonically increasing with t, but this assumption is not
necessary for the subsequent analysis.
We wish to show that J!t → J!∗ for all ", and to this end we employ
the following convergence theorem for totally asynchronous iterations from
the author’s paper [Ber83], which has served as the basis for the treatment
of totally asynchronous iterations in the book [BeT89] (Chapter 6), and
their application to DP (i.e., VI and PI), and asynchronous gradient-based
optimization. For the statement of the theorem, we say that a sequence
{J k } ⊂ R(X) converges pointwise to J ∈ R(X) if limk→∞ J k (x) = J (x)
for all x ∈ X.

Proposition 2.6.1 (Asynchronous Convergence Theorem): Let


T have a unique fixed point J * , let Assumption! 2.6.1"hold, and assume
that there is a sequence of nonempty subsets S(k) ⊂ R(X) with

S(k + 1) ⊂ S(k), k = 0, 1, . . . ,

and is such that if {V k } is any sequence with V k ∈ S(k), for all k ≥ 0,


then {V k } converges pointwise to J * . Assume further the following:
(1) Synchronous Convergence Condition: We have

T J ∈ S(k + 1), ∀ J ∈ S(k), k = 0, 1, . . . .

(2) Box Condition: For all k, S(k) is a Cartesian product of the form

S(k) = S1 (k) × · · · × Sm (k),

where S! (k) is a set of real-valued functions on X! , " = 1, . . . , m.


Then for every J 0 ∈ S(0), the sequence {J t } generated by the asyn-
chronous algorithm (2.57) converges pointwise to J * .

Proof: To explain the idea of the proof, let us note that the given condi-
tions imply that updating any component J! , by applying T to a function
J ∈ S(k), while leaving all other components unchanged, yields a function
in S(k). Thus, once enough time passes so that the delays become “irrel-
evant,” then after J enters S(k), it stays within S(k). Moreover, once a
component J! enters the subset S! (k) and the delays become “irrelevant,”
J! gets permanently within the smaller subset S! (k+1) at the first time that
J! is iterated on with J ∈ S(k). Once each component J! , " = 1, . . . , m,
gets within S! (k + 1), the entire function J is within S(k + 1) by the Box
Sec. 2.6 Asynchronous Algorithms 65

Condition. Thus the iterates from S(k) eventually get into S(k + 1) and
so on, and converge pointwise to J * in view of the assumed properties of
{S(k)}.
With this idea in mind, we show by induction that for each k ≥ 0,
there is a time tk such that:
(1) J t ∈ S(k) for all t ≥ tk .
(2) For all ! and t ∈ R! with t ≥ tk , we have
! "
τ (t) τ (t)
J1 !1 , . . . , Jm!m ∈ S(k).

[In words, after some time, all fixed point estimates will be in S(k) and all
estimates used in iteration (2.57) will come from S(k).]
The induction hypothesis is true for k = 0 since J 0 ∈ S(0). Assuming
it is true for a given k, we will show that there exists a time tk+1 with the
required properties. For each ! = 1, . . . , m, let t(!) be the first element of
R! such that t(!) ≥ tk . Then by the Synchronous Convergence Condition,
we have T J t(!) ∈ S(k + 1), implying (in view of the Box Condition) that
t(!)+1
J! ∈ S! (k + 1).

Similarly, for every t ∈ R! , t ≥ t(!), we have J!t+1 ∈ S! (k + 1). Between


elements of R! , J!t does not change. Thus,

J!t ∈ S! (k + 1), ∀ t ≥ t(!) + 1.

Let t!k = max! {t(!)} + 1. Then, using the Box Condition we have

J t ∈ S(k + 1), ∀ t ≥ t!k .

Finally, since by Assumption 2.6.1, we have τ!j (t) → ∞ as t → ∞, t ∈ R! ,


we can choose a time tk+1 ≥ t!k that is sufficiently large so that τ!j (t) ≥ t!k
for all !, j, and t ∈ R! with t ≥ tk+1 . We then have, for all t ∈ R!
τ (t)
with t ≥ tk+1 and j = 1, . . . , m, Jj j! ∈ Sj (k + 1), which (by the Box
Condition) implies that
! "
τ (t) τ (t)
J1 !1 , . . . , Jm!m ∈ S(k + 1).

The induction is complete. Q.E.D.

Figure 2.6.1 illustrates the assumptions of the preceding convergence


theorem.# The$challenge in applying the theorem is to identify the set se-
quence S(k) and to verify the assumptions of Prop. 2.6.1. In abstract
DP, these assumptions are satisfied in two primary contexts of interest.
The first is when S(k) are weighted sup-norm spheres centered at J * , and
66 Contractive Models Chap. 2

∗ J = (J1 , J2 )
(0) S2 (0) ) S(k + 1) + 1) J ∗ TJ
(0) S(k)
S(0)

S1 (0)

Figure 2.6.1 Geometric interpretation of the conditions of asynchronous con-


vergence theorem. We have a nested sequence of boxes {S(k)} such that T J ∈
S(k + 1) for all J ∈ S(k).

J1 Iterations
∗ J = (J1 , J2 )

) S(k + 1) + 1) J ∗
(0) S(k)
S(0)
Iterations J2 Iteration

Figure 2.6.2 Geometric interpretation of the mechanism for asynchronous con-


vergence. Iteration on a single component of a function J ∈ S(k), say J! , keeps J
in S(k), while it moves J! into the corresponding component S! (k + 1) of S(k + 1),
where it remains throughout the subsequent iterations. Once all components J!
have been iterated on at least once, the iterate is guaranteed to be in S(k + 1).

can be used in conjunction with the contraction framework of the preced-


ing section (see the following proposition). The second context is based
on monotonicity conditions. It will be used in Section 3.3 in conjunction
with semicontractive models for which there is no underlying sup-norm
contraction. It is also relevant to the noncontractive models of Section 4.3
where again there is no underlying contraction. Figure 2.6.2 illustrates the
mechanism by which asynchronous convergence is achieved.
We note a few extensions of the theorem. It is possible to allow T to be
time-varying, so in place of T we operate with a sequence of mappings Tk ,
k = 0, 1, . . .. Then if all Tk have a common fixed point J * , the conclusion
of the theorem holds (see Exercise 2.2 for a more precise statement). This
extension is useful in some of the algorithms to be discussed later. Another
extension is to allow T to have multiple fixed points and introduce an
assumption that roughly says that ∩∞ k=0 S(k) is the set of fixed points.
Then the conclusion is that any limit point (in an appropriate sense) of
{J t } is a fixed point.
Sec. 2.6 Asynchronous Algorithms 67

We now apply the preceding convergence theorem to the totally asyn-


chronous VI algorithm under the contraction assumption. Note that the
monotonicity Assumption 2.1.1 is not necessary (just like it is not needed
for the synchronous convergence of {T k J } to J * ).

Proposition 2.6.2: Let the contraction Assumption 2.1.2 hold, to-


gether with Assumption 2.6.1. Then if J 0 ∈ B(X), a sequence {J t }
generated by the asynchronous VI algorithm (2.57) converges point-
wise to J * .

Proof: We apply Prop. 2.6.1 with


! "
S(k) = J ∈ B(X) | "J k − J * " ≤ αk "J 0 − J * " , k = 0, 1, . . . .

Since T is a contraction with modulus α, the synchronous convergence


condition is satisfied. Since T is a weighted sup-norm contraction, the box
condition is also satisfied, and the result follows. Q.E.D.

2.6.2 Asynchronous Policy Iteration

We will now develop asynchronous PI algorithms that have comparable


properties to the asynchronous VI algorithm of the preceding subsection.
The processors collectively maintain and update an estimate J t of the op-
timal cost function, and an estimate µt of an optimal policy. The local
portions of J t and µt of processor " are denoted J!t and µt! , respectively,
i.e., J!t (x) = J t (x) and µt! (x) = µt (x) for all x ∈ X! .
For each processor ", there are two disjoint subsets of times R! , R! ⊂
{0, 1, . . .}, corresponding to policy improvement and policy evaluation iter-
ations, respectively. At the times t ∈ R! ∪ R! , the local cost function J!t of
τ (t)
processor " is updated using “delayed” local costs Jj !j of other processors
j '= ", where 0 ≤ τ!j (t) ≤ t. At the times t ∈ R! (the local policy improve-
ment times), the local policy µt! is also updated. For various choices of R!
and R! , the algorithm takes the character of VI (when R! = {0, 1, . . .}),
and PI (when R! contains a large number of time indices between succes-
sive elements of R! ). As before, we view t − τ!j (t) as a “communication
delay,” and we require Assumption 2.6.1. †
In a natural asynchronous version of optimistic PI, at each time t,
each processor " does one of the following:

† As earlier in all PI algorithms we assume that the infimum over u ∈ U (x)


in the policy improvement operation is attained, and we write min in place of
inf.
68 Contractive Models Chap. 2

 Jµ!
∗ J µ 
µ 
! Jµ!! J T J J0

Jµ1 !! Jµ0

1 Jµ2
J∗

Figure 2.6.3 Illustration of optimistic asynchronous PI. When started with J 0


and µ0 satisfying
J 0 ≥ T J 0 = Tµ0 J 0 ,

the algorithm converges monotonically to J ∗ (see the trajectory on the right).


However, for other initial conditions, there is a possibility for oscillations, since
with changing values of µ, the mappings Tµ have different fixed points and “aim
at different targets.” It turns out that such oscillations are not possible when
the algorithm is implemented synchronously (cf. Prop. 2.5.1), but may occur in
asynchronous implementations (see the trajectory on the left, which illustrates a
cycle between three policies µ, µ" , µ"" ).

(a) Local policy improvement : If t ∈ R! , processor ! sets for all x ∈ X! ,


τ (t) τ (t) "
J!t+1 (x) = min H x, u, J1 !1 , . . . , Jm!m
!
, (2.58)
u∈U (x)

τ (t) τ (t) "


µt+1
!
! (x) = arg min H x, u, J1 !1 , . . . , Jm!m . (2.59)
u∈U (x)

(b) Local policy evaluation: If t ∈ R! , processor ! sets for all x ∈ X! ,


τ (t) τ (t) "
J!t+1 (x) = H x, µt (x), J1 !1 , . . . , Jm!m
!
, (2.60)

and leaves µ! unchanged, i.e., µt+1


! (x) = µt! (x) for all x ∈ X! .
(c) No local change: If t ∈ / R! ∪ R! , processor ! leaves J! and µ! un-
changed, i.e., J!t+1 (x) = J!t (x) and µt+1
! (x) = µt! (x) for all x ∈ X! .
Unfortunately, even when implemented without the delays τ!j (t), the
preceding PI algorithm is unreliable. The difficulty is that the algorithm
involves a mix of applications of T and various mappings Tµ that have dif-
ferent fixed points, so in the absence of some systematic tendency towards
J * there is the possibility of oscillation (see Fig. 2.6.3). While this does not
Sec. 2.6 Asynchronous Algorithms 69

happen in synchronous versions (cf. Prop. 2.5.1), asynchronous versions of


the algorithm (2.33) may oscillate unless J 0 satisfies some special condition
such as
T J0 ≤ J0
(examples of this type of oscillation have been constructed in the paper
[WiB93]; see also [Ber10], which translates an example from [WiB93] to
the notation of the present book).
In this subsection and the next we will develop two distributed asyn-
chronous PI algorithms, each embodying a mechanism that precludes the
oscillatory behavior just described. In the first algorithm, there is a simple
randomization scheme, according to which a policy evaluation of the form
(2.60) is replaced by a policy improvement (2.58)-(2.59) with some positive
probability. In the second algorithm, we introduce a mapping Fµ , which
has a common fixed point property: its fixed point is related to J * and is
the same for all µ, so the anomaly illustrated in Fig. 2.6.3 cannot occur.
The first algorithm is simple but requires some restrictions, including that
the set of policies is finite. The second algorithm is more sophisticated and
does not require this restriction.

An Optimistic Asynchronous Algorithm with Randomization

We introduce a scheme that provides a randomization mechanism for avoid-


ing oscillatory behavior. It is defined by a small probability p > 0, ac-
cording to which a policy evaluation iteration is replaced by a policy im-
provement iteration with probability p, independently of the results of past
iterations. We model this randomization by assuming that before the al-
gorithm is started, we restructure the sets R! and R! as follows: we take
each element of each set R! , and with probability p, remove it from R! ,
and add it to R! (independently of other elements).
We will assume the following:

Assumption 2.6.2:
(a) The set of policies M is finite.
(b) There exists an integer B ≥ 0 such that

(R! ∪ R! ) ∩ {τ | t < τ ≤ t + B} %= Ø, ∀ t, ".

(c) There exists an integer B ! ≥ 0 such that

0 ≤ t − τ!j (t) ≤ B ! , ∀ t, ", j.


70 Contractive Models Chap. 2

Assumption 2.6.2 guarantees that each processor ! will execute at


least one policy evaluation or policy improvement iteration within every
block of B consecutive iterations, and places a bound B ! on the communi-
cation delays. The convergence of the algorithm is shown in the following
proposition.

Proposition 2.6.3: Under the contraction Assumption 2.1.2, and As-


sumptions 2.6.1, and 2.6.2, for the preceding algorithm with random-
ization, we have

lim J t (x) = J * (x), ∀ x ∈ X,


t→∞

with probability one.

Proof: Let J * and Jµ be the fixed points of T and Tµ , respectively, and


denote by M∗ the set of optimal policies:

M∗ = {µ ∈ M | Jµ = J * } = {µ ∈ M | Tµ J * = T J * }.

We will show that the algorithm eventually (with probability one) enters
a small neighborhood of J * within which it remains, generates policies in
M∗ , becomes equivalent to asynchronous VI, and therefore converges to
J * by Prop. 2.6.2. The idea of the proof is twofold.
(1) There exists a small enough weighted sup-norm sphere centered at
J * , call it S ∗ , within which policy improvement generates only poli-
cies in M∗ , so policy evaluation with such policies as well as policy
improvement keep the algorithm within S ∗ if started there, and re-
duce the weighted sup-norm distance to J * , in view of the contraction
and common fixed point property of T and Tµ , µ ∈ M∗ . This is a
consequence of Prop. 2.3.1 [cf. Eq. (2.16)].
(2) With probability one, thanks to the randomization device, the algo-
rithm will eventually enter permanently S ∗ with a policy in M∗ .
We now establish (1) and (2) in suitably refined form to account for
the presence of delays and asynchronism. We first define a bounded set
within which the algorithm remains at all times. Consider the set
! "
Ab = J | #J − Jµ # ≤ b, ∀ µ ∈ M ,

where # · # denotes weighted sup-norm and b is sufficiently large so that Ab


is nonempty. Then Ab is bounded since M is a finite set by Assumption
2.6.2. Note that we have Tµ J ∈ Ab for all µ ∈ M and J ∈ Ab , since

#Tµ J − Jµ # = #Tµ J − Tµ Jµ # ≤ α#J − Jµ # ≤ αb < b.


Sec. 2.6 Asynchronous Algorithms 71

Let b be sufficiently large so that J 0 ∈ Ab , and define


! "
S(k) = J | "J − J * " ≤ αk c ,

where c is sufficiently large so that Ab ⊂ S(0). Then J t ∈ Ab and hence


J t ∈ S(0) for all t.
Let k ∗ be such that

J ∈ S(k ∗ ) and Tµ J = T J ⇒ µ ∈ M∗ . (2.61)

Such a k ∗ exists in view of the finiteness of M and Prop. 2.3.1 [cf. Eq.
(2.16)].
We now claim that with probability one, for any given k ≥ 1, J t
will eventually enter S(k) and stay within S(k) for at least B " additional
consecutive iterations. This is because our randomization scheme is such
!
that for any t and k, with probability at least pk(B+B ) the next k(B + B " )
!
iterations are policy improvements, so that J t+k(B+B )−ξ ∈ S(k) for all
ξ with 0 ≤ ξ < B " [if t ≥ B " − 1, we have J t−ξ ∈ S(0) for all ξ with
!
0 ≤ ξ < B " , so J t+B+B −ξ ∈ S(1) for 0 ≤ ξ < B " , which implies that
!
J t+2(B+B )−ξ ∈ S(2) for 0 ≤ ξ < B " , etc].
It follows that with probability one, for some t we will have J τ ∈ S(k ∗ )
for all τ with t − B " ≤ τ ≤ t, as well as µt ∈ M∗ [cf. Eq. (2.61)]. Based
on property (2.61) and the definition (2.59)-(2.60) of the algorithm, we see
that at the next iteration, we have µt+1 ∈ M∗ and

"J t+1 − J * " ≤ "J t − J * " ≤ αk c,

so J t+1 ∈ S(k ∗ ); this is because in view of Jµt = J * , and the contraction


property of T and Tµt , we have
# t+1 #
#J
# (x) − J#* (x)# ∗
≤ α"J t − J * " ≤ αk +1 c, (2.62)
v(x)

for all x ∈ X# and $ such that t ∈ R# ∪ R# , while J t+1 (x) = J t (x) for
all other x. Proceeding similarly, it follows that for all t > t we will have
J τ ∈ S(k ∗ ) for all τ with t − B " ≤ τ ≤ t, as well as µt ∈ M∗ . Thus, after at
most B iterations following t [after all components J# are updated through
policy evaluation or policy improvement at least once so that
# t+1 #
#J
# (x) − J#* (x)# ∗
≤ α"J t − J * " ≤ αk +1 c,
v(x)

for every i, x ∈ X# , and some t with t ≤ t < t + B, cf. Eq. (2.62)], J t will
enter S(k ∗ + 1) permanently, with µt ∈ M∗ (since µt ∈ M∗ for all t ≥ t
as shown earlier). Then, with the same reasoning, after at most another
72 Contractive Models Chap. 2

B ! + B iterations, J t will enter S(k ∗ + 2) permanently, with µt ∈ M∗ , etc.


Thus J t will converge to J * with probability one. Q.E.D.

The proof of Prop. 2.6.3 shows that eventually (with probability one
after some iteration) the algorithm will become equivalent to asynchronous
VI (each policy evaluation will produce the same results as a policy im-
provement), while generating optimal policies exclusively. However, the
expected number of iterations for this to happen can be very large. More-
over the proof depends on the set of policies being finite. These observa-
tions raise questions regarding the practical effectiveness of the algorithm.
However, it appears that for many problems the algorithm works well, par-
ticularly when oscillatory behavior is a rare occurrence.
A potentially important issue is the choice of the randomization prob-
ability p. If p is too small, convergence may be slow because oscillatory
behavior may go unchecked for a long time. On the other hand if p is
large, a correspondingly large number of policy improvement iterations
may be performed, and the hoped for benefits of optimistic PI may be lost.
Adaptive schemes which adjust p based on algorithmic progress may be an
interesting possibility for addressing this issue.

2.6.3 Policy Iteration with a Uniform Fixed Point


We will now discuss another approach to address the convergence difficul-
ties of the “natural” asynchronous PI algorithm (2.58)-(2.60). As illus-
trated in Fig. 2.6.3 in connection with optimistic PI, the mappings T and
Tµ have different fixed points. As a result, optimistic and distributed PI,
which involve an irregular mixture of applications of Tµ and T , do not have
a “consistent target” at which to aim.
With this in mind, we introduce a new mapping that is parametrized
by µ and has a common fixed point for all µ, which in turn yields J * . This
mapping is a weighted sup-norm contraction with modulus α, so it may be
used in conjunction with asynchronous VI and PI. An additional benefit
is that the monotonicity Assumption 2.1.1 is not needed for convergence
(see Exercise 2.3 where the algorithm of the present section is applied to a
nonmonotonic contractive model).
The mapping operates on a pair (V, Q) where:
• V is a function with a component V (x) for each x (in the DP context
it may be viewed as a cost function).
• Q is a function with a component Q(x, u) for each pair (x, u) [in the
DP context Q(x, u) is known as a Q-factor].
The mapping produces a pair
! "
M Fµ (V, Q), Fµ (V, Q) ,

where
Sec. 2.6 Asynchronous Algorithms 73

• Fµ (V, Q) is a function with a component Fµ (V, Q)(x, u) for each (x, u),
defined by ! "
Fµ (V, Q)(x, u) = H x, u, min{V, Qµ } , (2.63)

where for any Q and µ, we denote by Qµ the function of x defined by


! "
Qµ (x) = Q x, µ(x) , x ∈ X,

and for any two functions V1 and V2 of x, we denote by min{V1 , V2 }


the function of x given by
# $
min{V1 , V2 }(x) = min V1 (x), V2 (x) , x ∈ X.

! "
• M Fµ (V, Q) is a function with a component M Fµ (V, Q) (x) for each
x, where M denotes minimization over u, so that
! "
M Fµ (V, Q) (x) = min Fµ (V, Q)(x, u). (2.64)
u∈U (x)

Example 2.6.2 (Asynchronous Optimistic Policy Iteration for


Discounted Finite-State MDP)

Consider the special case of the finite-state discounted MDP of Example 1.2.2.
We have
n
% ! "
H(x, u, J) = pxy (u) g(x, u, y) + αJ(y) ,
y=1

and
! "
Fµ (V, Q)(x, u) = H x, u, Wµ (V, Q)
n &
% # $'
= pxy (u) g(x, u, y) + α min V (y), Q(y, µ(y)) ,
y=1

n &
! " % # $'
M Fµ (V, Q) (x) = min pxy (u) g(x, u, y) + α min V (y), Q(y, µ(y)) ,
u∈U (x)
y=1

[cf. Eqs. (2.63)-(2.64)]. Note that Fµ (V, Q) is the mapping that defines Bell-
man’s equation for the Q-factors of a policy µ in an optimal stopping problem
where the stopping cost at state y is equal to V (y).

We now consider the mapping Gµ given by


! "
Gµ (V, Q) = M Fµ (V, Q), Fµ (V, Q) , (2.65)
74 Contractive Models Chap. 2

and show that it has a uniform contraction property and a corresponding


uniform fixed point. To this end, we introduce the norm
! ! " #
!(V, Q)! = max !V !, !Q!

in the space of (V, Q), where !V ! is the weighted sup-norm of V , and !Q!
is defined by $ $
$Q(x, u)$
!Q! = sup .
x∈X, u∈U (x) v(x)
We have the following proposition.

Proposition 2.6.4: Let the contraction Assumption 2.1.2 hold. Con-


sider the mapping Gµ defined by Eqs. (2.63)-(2.65). Then for all µ:
(a) (J * , Q* ) is the unique fixed point of Gµ , where Q* is defined by

Q* (x, u) = H(x, u, J * ), x ∈ X, u ∈ U (x). (2.66)

(b) The following uniform contraction property holds for all (V, Q),
(Ṽ , Q̃):
! ! ! !
!Gµ (V, Q) − Gµ (Ṽ , Q̃)! ≤ α!(V, Q) − (Ṽ , Q̃)!.

Proof: (a) Using the definition (2.66) of Q* , we have

J * (x) = (T J * )(x) = inf H(x, u, J * ) = inf Q* (x, u), ∀ x ∈ X,


u∈U (x) u∈U (x)

so that
" % &#
min J * (x), Q* x, µ(x) = J * (x), ∀ x ∈ X, µ ∈ M.

Using the definition (2.63) of Fµ , it follows that Fµ (J * , Q* ) = Q* and also


that M Fµ (J * , Q* ) = J * , so (J * , Q* ) is a fixed point of Gµ for all µ. The
uniqueness of this fixed point will follow from the contraction property of
part (b).
(b) Using the definition (2.63) of Fµ , it can be shown that for all (V, Q),
(Ṽ , Q̃),
! ! ! !
!Fµ (V, Q) − Fµ (Ṽ , Q̃)! ≤ α! min{V, Qµ } − min{Ṽ , Q̃µ }!
" # (2.67)
≤ α max !V − Ṽ !, !Q − Q̃! .
Sec. 2.6 Asynchronous Algorithms 75

Indeed, the first inequality follows from the contraction Assumption 2.1.2
and the second inequality follows from a nonexpansiveness property of the
minimization map: for any J1 , J2 , J̃ 1 , J̃ 2 , we have
! ! " #
! min{J1 , J2 } − min{J̃ 1 , J̃ 2 }! ≤ max #J1 − J̃ 1 #, #J2 − J̃ 2 # ; (2.68)

[to see this, write for every x,

Jm (x) " # J̃ m (x)


≤ max #J1 − J̃ 1 #, #J2 − J̃ 2 # + , m = 1, 2,
v(x) v(x)

take the minimum of both sides over m, exchange the roles of Jm and J̃ m ,
and take supremum over $ x]. Here we use the
$ relation
% (2.68) for J1 = V ,
J˜1 = Ṽ , and J2 (x) = Q x, µ(x) , J˜2 (x) = Q̃ x, µ(x) , for all x ∈ X.
%

We next note that for all Q, Q̃, †

#M Q − M Q̃# ≤ #Q − Q̃#,

which together with Eq. (2.67) yields


"! ! ! !#
max !M Fµ (V, Q) − M Fµ (Ṽ , Q̃)!, !Fµ (V, Q) − Fµ (Ṽ , Q̃)!
" #
≤ α max #V − Ṽ #, #Q − Q̃# ,
! ! ! !
or equivalently !Gµ (V, Q) − Gµ (Ṽ , Q̃)! ≤ α!(V, Q) − (Ṽ , Q̃)!. Q.E.D.

Because of the uniform contraction property of Prop. 2.6.4(b), a dis-


tributed fixed point iteration, like the VI algorithm of Eq. (2.57), can be
used in conjunction with the mapping (2.65) to generate asynchronously
a sequence {(V t , Qt )} that is guaranteed to converge to (J * , Q* ) for any
sequence {µt }. This can be verified using the proof of Prop. 2.6.2 (more
precisely, a proof that closely parallels the one of that proposition); the
mapping (2.65) plays the role of T in Eq. (2.57). ‡

† For a proof, we write

Q(x, u) Q̃(x, u)
≤ "Q − Q̃" + , ∀ u ∈ U (x), x ∈ X,
v(x) v(x)

take infimum of both sides over u ∈ U (x), exchange the roles of Q and Q̃, and
take supremum over x ∈ X and u ∈ U (x).
‡ Because Fµ and Gµ depend on µ, which changes as the algorithm pro-
gresses, it is necessary to use a minor extension of the asynchronous convergence
theorem, given in Exercise 2.2, for the convergence proof.
76 Contractive Models Chap. 2

Asynchronous PI Algorithm

We now describe a PI algorithm, which applies asynchronously the com-


ponents M Fµ (V, Q) and Fµ (V, Q) of the mapping Gµ (V, Q) of Eq. (2.65).
The first component is used for local policy improvement and makes a local
update to V and µ, while the second component is used for local policy
evaluation and makes a local update to Q. The algorithm draws its validity
from the weighted sup-norm contraction property of Prop. 2.6.4(b) and the
asynchronous convergence theory (Prop. 2.6.2 and Exercise 2.2).
For the asynchronous computation framework, we consider again m
processors, a partition of X into sets X1 , . . . , Xm , and assignment of each
subset X! to a processor ! ∈ {1, . . . , m}. The following modification of
the “natural” asynchronous PI algorithm (2.59)-(2.60) [without the “com-
munication delays” t − τ!j (t)] can be shown to converge, in the sense that
V t → J * , Qt → Q* . For each !, there are two infinite disjoint subsets of
times R! , R! ⊂ {0, 1, . . .}, corresponding to policy improvement and pol-
icy evaluation iterations, respectively. Each processor ! operates on V t (x),
Qt (x, u), and µt (x), only for x in its “local” state space X! . In particular,
at each time t, each processor ! does one of the following:
(a) Local policy improvement : If t ∈ R! , processor ! sets for all x ∈ X! , †
! " ! "
V t+1 (x) = min H x, u, min{V t , Qtµt } = M Fµt (V t , Qt ) (x),
u∈U (x)

sets µt+1 (x) to a u that attains the minimum, and leaves Q un-
changed, i.e., Qt+1 (x, u) = Qt (x, u) for all x ∈ X! and u ∈ U (x).
(b) Local policy evaluation: If t ∈ R! , processor ! sets for all x ∈ X! and
u ∈ U (x),
! "
Qt+1 (x, u) = H x, u, min{V t , Qtµt } = Fµt (V t , Qt )(x, u),

and leaves V and µ unchanged, i.e., V t+1 (x) = V t (x) and µt+1 (x) =
µt (x) for all x ∈ X! .

(c) No local change: If t ∈ / R! ∪ R! , processor ! leaves Q, V , and µ


unchanged, i.e., Qt+1 (x, u) = Qt (x, u) for all x ∈ X! and u ∈ U (x),
V t+1 (x) = V t (x), and µt+1 (x) = µt (x) for all x ∈ X! .
Note that while this algorithm does not involve the “communication
delays” t − τ!j (t), it can clearly be extended to include them. The reason
is that our asynchronous convergence analysis framework in combination
with the uniform weighted sup-norm contraction property of Prop. 2.6.4
can tolerate the presence of such delays.

† As earlier we assume that the infimum over u ∈ U (x) in the policy im-
provement operation is attained, and we write min in place of inf.
Sec. 2.6 Asynchronous Algorithms 77

Reduced Space Implementation


The preceding PI algorithm is well-suited for the calculation of both J * and
Q* . However, if the objective is just to calculate J * , a simpler and more
efficient algorithm is possible. To this end, we observe that the preceding
algorithm can be operated so that it does not require the maintenance of
the entire function Q. The reason is that the values Qt (x, u) with u !=
µt!(x) do not
" appear in the calculations, and hence we need only the values
Q x, µt (x) , which we store in a function J t :
! "
J t (x) = Q x, µt (x) .

This observation is the basis for the following algorithm.


At each time t and for each processor !:
(a) Local policy improvement : If t ∈ R! , processor ! sets for all x ∈ X! ,
! "
J t+1 (x) = V t+1 (x) = min H x, u, min{V t , J t } , (2.69)
u∈U (x)

and sets µt+1 (x) to a u that attains the minimum.


(b) Local policy evaluation: If t ∈ R! , processor ! sets for all x ∈ X! ,
! "
J t+1 (x) = H x, µt (x), min{V t , J t } , (2.70)

and leaves V and µ unchanged, i.e., for all x ∈ X! ,

V t+1 (x) = V t (x), µt+1 (x) = µt (x).

(c) No local change: If t ∈ / R! ∪ R! , processor ! leaves J , V , and µ


unchanged, i.e., for all x ∈ X! ,

J t+1 (x) = J t (x), V t+1 (x) = V t (x), µt+1 (x) = µt (x).

Example 2.6.3 (Asynchronous Optimistic Policy Iteration for


Discounted Finite-State MDP - Continued)

As an illustration of the preceding reduced space implementation, consider


the special case of the finite-state discounted MDP of Example 1.2.2. Here
n
# ! "
H(x, u, J) = pxy (u) g(x, u, y) + αJ(y) ,
y=1

and the mapping Fµ (V, Q) given by


n $
# % &'
Fµ (V, Q)(x, u) = pxy (u) g(x, u, y) + α min V (y), Q(y, µ(y)) ,
y=1
78 Contractive Models Chap. 2

defines the Q-factors of µ in a corresponding stopping problem. In the PI


algorithm (2.69)-(2.70), policy evaluation of µ aims to solve this stopping
problem, rather than solve a linear system of equations, as in classical PI. In
particular, the policy evaluation iteration (2.70) is
n
t+1
! #$ " &'
pxy µt (x) g x, µt (x), y + α min V t (y), J t (y)
" # %
J (x) = ,
y=1

for all x ∈ X! . The policy improvement iteration (2.69) is a VI for the


stopping problem:
n $
! &'
J t+1 (x) = V t+1 (x) = min pxy (u) g(x, u, y) + α min V t (y), J t (y)
%
,
u∈U (x)
y=1

for all x ∈ X! , while the current policy is locally updated by


n $
! &'
µt+1 (x) = arg min pxy (u) g(x, u, y) + α min V t (y), J t (y)
%
,
u∈U (x)
y=1

for all x ∈ X! . The “stopping cost” V t (y) is the most recent cost value,
obtained by local policy improvement at y.

Example 2.6.4 (Asynchronous Optimistic Policy Iteration for


Dynamic Games)

Consider the optimistic PI algorithm (2.69)-(2.70) for the case of the dis-
counted dynamic game problem of Example 1.2.4 of Chapter 1. In the con-
text of this problem, a local policy evaluation step [cf. Eq. (2.70)] consists of
a local VI for the maximizer’s DP problem assuming a fixed policy for the
minimizer, and a stopping cost V t as per Eq. (2.70). A local policy improve-
ment step [cf. Eq. (2.69)] at state x consists of the solution of a static game
with a payoff matrix that also involves min{V t , J t } in place of J t , as per Eq.
(2.69).

A Variant with Interpolation

While the use of min{V t , J t } (rather than J t ) in Eq. (2.70) provides a


convergence enforcement mechanism for the algorithm, it may also become
a source of inefficiency, particularly when V t (x) approaches its limit J * (x)
from lower values for many x. Then J t+1 (x) is set to a lower value than
the iterate
Jˆt+1 (x) = H x, µt (x), J t ,
" #
(2.71)
given by the “standard” policy evaluation iteration, and in some cases this
may slow down the algorithm.
Sec. 2.7 Notes, Sources, and Exercises 79

A possible way to address this is to use an algorithmic variation that


modifies appropriately Eq. (2.70), using interpolation with a parameter
γt ∈ (0, 1], with γt → 0. In particular, for t ∈ R! and x ∈ X! , we calculate
the values J t+1 (x) and Jˆt+1 (x) given by Eqs. (2.70) and (2.71), and if

J t+1 (x) < Jˆt+1 (x), (2.72)

we reset J t+1 (x) to

(1 − γt )J t+1 (x) + γt Jˆt+1 (x). (2.73)

The idea of the algorithm is to aim for a larger value of J t+1 (x)
when the condition (2.72) holds. Asymptotically, as γt → 0, the iteration
(2.72)-(2.73) becomes identical to the convergent update (2.70). For a
more detailed analysis of an algorithm of this type, we refer to the paper
[BeY10b].

2.7 NOTES, SOURCES, AND EXERCISES

The abstract contractive DP model of this chapter was introduced by


Denardo [Den67], who focused on unweighted sup-norm contractions. We
have transcribed his model and results in Section 2.1 so that they apply
under weighted sup-norm contraction assumptions.
The abstraction of the computational methodology for finite-state
discounted MDP within the broad framework of weighted sup-norm con-
tractions follows the author’s survey [Ber12b], and relies on several earlier
analyses that use more specialized assumptions. In particular, the multi-
stage error bound of Prop. 2.2.2 follows the work of Scherrer [Sch12], which
explores the use of periodic policies in approximate VI and PI in finite-state
discounted MDP (see also Scherrer and Lesner [ShL12], who give an exam-
ple showing that the bound for approximate VI of Prop. 2.3.2 is essentially
sharp for discounted finite-state MDP). For a related discussion of approx-
imate VI, including the error amplification phenomenon of Example 2.3.1,
and associated error bounds, see Munos and Szepesvari [MuS08]. The error
bound and approximate PI analysis of Section 2.4.1 (Prop. 2.4.3) extends
the one of Bertsekas and Tsitsiklis [BeT96] (Section 6.2.2), which was given
for finite-state discounted MDP, and was shown to be tight by an example.
Optimistic PI has received a lot of attention in the literature, par-
ticularly for finite-state discounted MDP, and it is generally thought to
be computationally more efficient than ordinary PI (see e.g., Puterman
[Put94], who refers to the method as “modified PI”). The convergence
analysis of the synchronous optimistic PI (Section 2.5.1) follows Rothblum
[Rot79], who considered the case of an unweighted sup-norm (v = e); see
also Canbolat and Rothblum [CaR13], which considers optimistic PI meth-
ods where the minimization in the policy improvement operation is approx-
imate, within some " > 0. The error bound for optimistic PI (Section 2.5.2)
80 Contractive Models Chap. 2

is due to Thierry and Scherrer [ThS10b], given for the case of a finite-state
discounted MDP. We follow closely their proof. Related error bounds and
analysis are given by Scherrer [Sch11].
An alternative form of optimistic PI is the λ-PI method, introduced
by Bertsekas and Ioffe [BeI96] (see also [BeT96]), and further studied in
approximate DP contexts by Thierry and Scherrer [ThS10a], Bertsekas
[Ber11b], and Scherrer [Sch11], and in a modified form by Yu and Bertsekas
[YuB12]. The λ-PI method is defined by

(λ)
Tµk Jk = T Jk , Jk+1 = Tµk Jk ,

where J0 is an initial function in B(X), and for any policy µ and λ ∈ (0, 1),
(λ)
the mapping Tµ is defined by

(λ)
!
Tµ J = (1 − λ) λ" Tµ"+1 J.
"=0

To compare λ-PI and optimistic PI, where


m
Jk+1 = Tµkk Jk ,

note that they both involve multiple applications of the VI mapping Tµk : a
fixed number mk in the latter case, and a geometrically weighted number in
the former case. We will revisit λ-PI in Section 4.3.3, where we will show
that it offers some implementation advantages in the contexts discussed
there.
Asynchronous VI (Section 2.6.1) for finite-state discounted MDP and
games, shortest path problems, and abstract DP models, was proposed in
Bertsekas [Ber82]. The asynchronous convergence theorem (Prop. 2.6.1)
was first given in the author’s paper [Ber83], where it was applied to a
variety of algorithms, including VI for discounted and undiscounted DP,
and gradient methods for unconstrained optimization (see also Bertsekas
and Tsitsiklis [BeT89], where a textbook account is presented). There are
earlier references on distributed asynchronous iterative algorithms, includ-
ing the early work of Chazan and Miranker [ChM69] on linear relaxation
methods (who attributed the original idea to Rosenfeld [Ros67]), and also
Baudet [Bau78] on sup-norm contractive iterations. We refer to [BeT89]
for detailed references.
Asynchronous algorithms have also been studied and applied to simu-
lation-based DP, particularly in the context of Q-learning, which may be
viewed as a stochastic version of VI. Two principal approaches for the
convergence analysis of such algorithms have been suggested. The first
approach, initiated in the paper by Tsitsiklis [Tsi94], considers the totally
asynchronous computation of fixed points of abstract sup-norm contractive
Sec. 2.7 Notes, Sources, and Exercises 81

mappings and monotone mappings, which are defined in terms of an ex-


pected value. The algorithm of [Tsi94] contains as special cases Q-learning
algorithms for finite-spaces discounted MDP and SSP problems. The anal-
ysis of [Tsi94] shares some ideas with the theory of Section 2.6.1, and also
relies on the theory of stochastic approximation methods. For a subsequent
analysis of the convergence of Q-learning for SSP, which addresses the is-
sue of boundedness of the iterates, we refer to Yu and Bertsekas [YuB11b].
The second approach, treats asynchronous algorithms of the stochastic ap-
proximation type under some restrictions on the size of the communication
delays or on the time between consecutive updates of a typical compo-
nent. This approach was initiated in the paper by Tsitsiklis, Bertsekas,
and Athans [TBA86], and was also developed in the book by Bertsekas
and Tsitsiklis [BeT89] for stochastic gradient optimization methods. A re-
lated analysis that uses the ODE approach for more general fixed point
problems was given in the paper by Borkar [Bor98], and was refined in
the papers by Abounadi, Bertsekas, and Borkar [ABB02], and Borkar and
Meyn [BeM00], which also considered applications to Q-learning. We refer
to the monograph by Borkar [Bor08] for a more comprehensive discussion.
The convergence of asynchronous PI for finite-state discounted MDP
under the condition J 0 ≥ Tµ0 J 0 was shown by Williams and Baird [WiB93],
who also gave examples showing that without this condition, cycling of the
algorithm may occur. The asynchronous PI algorithms of Sections 2.6.2
and 2.6.3 belong to the algorithmic framework of the papers by Bertsekas
and Yu [BeY10a], [BeY10b], which address this difficulty. Our analysis
follows closely the analysis of these papers, which also propose related Q-
learning algorithms with and without cost function approximation.
In addition to discounted finite-state MDP, the results of this chapter
find application in the context of the stochastic shortest path problem of
Example 1.2.6, in the special case where all policies are proper. Then
it can be shown that T and Tµ are weighted sup-norm contractions with
respect to a special norm. It follows that the analysis and algorithms of this
chapter fully apply in this special case. For a detailed discussion, we refer
to [BeT96], and the survey [Ber12b]. Some of the theory and algorithms
also extend to the case of countable state space; see the textbook [Ber12a],
Section 3.3, and Hinderer and Waldmann [HiW05].

E X ER CI S E S

2.1 (Periodic Policies)

Consider the multistep mappings T ν = Tµ0 · · · Tµm−1 , ν ∈ Mm , defined in Exer-


82 Contractive Models Chap. 2

cise 1.1 of Chapter 1, where Mm is the set of m-tuples ν = (µ0 , . . . , µm−1 ), with
µk ∈ M, k = 1, . . . , m−1, and m is a positive integer. Assume that the mappings
Tµ satisfy the monotonicity and contraction Assumptions 2.1.1 and 2.1.2, so that
the same is true for the mappings T ν (with the contraction modulus of T ν being
αm , cf. Exercise 1.1).
(a) Show that the unique fixed point of T ν is Jπ , where π is the nonstationary
but periodic policy
π = {µ0 , . . . , µm−1 , µ0 , . . . , µm−1 , . . .}.

(b) Show that the multistep mappings Tµ0 · · · Tµm−1 , Tµ1 · · · Tµm−1 Tµ0 , . . . ,
Tµm−1 Tµ0 · · · Tµm−2 , have unique corresponding fixed points J0 , J1 , . . .,
Jm−1 , which satisfy
J0 = Tµ0 J1 , J1 = Tµ1 J2 , . . . Jµm−2 = Tµm−2 Jµm−1 , Jµm−1 = Tµm−1 J0 .
Hint: Apply Tµ0 to the fixed point relation
J1 = Tµ1 · · · Tµm−1 Tµ0 J1
to show that Tµ0 J1 is the fixed point of Tµ0 · · · Tµm−1 , i.e., is equal to J0 .
Similarly, apply Tµ1 to the fixed point relation
J2 = Tµ2 · · · Tµm−1 Tµ0 Tµ1 J2 ,
to show that Tµ1 J2 is the fixed point of Tµ1 · · · Tµm−1 Tµ0 , i.e., is equal to
J1 , etc.

2.2 (Asynchronous Convergence Theorem for Time-Varying


Maps)

In reference to the asynchronous computation framework of Section 2.6.1, let


{Tt } be a sequence of mappings from R(X) to R(X) that have a common unique
fixed point J ∗ , let Assumption
! " 2.6.1 hold, and assume that there is a sequence
of nonempty subsets S(k) ⊂ R(X) with S(k + 1) ⊂ S(k) for all k, and with
the following properties:
(1) Synchronous Convergence Condition: Every sequence {J k } with J k ∈ S(k)
for each k, converges pointwise to J ∗ . Moreover, we have
Tt J ∈ S(k + 1), ∀ J ∈ S(k), k, t = 0, 1, . . . .

(2) Box Condition: For all k, S(k) is a Cartesian product of the form
S(k) = S1 (k) × · · · × Sm (k),
where S# (k) is a set of real-valued functions on X# , $ = 1, . . . , m.
Then for every J 0 ∈ S(0), the sequence {J t } generated by the asynchronous
algorithm
# τ (t) τ (t)
J#t+1 (x) = Tt (J1 #1 , . . . , Jm#m )(x) if t ∈ R# , x ∈ X# ,
J#t (x) if t ∈
/ R# , x ∈ X# ,
[cf. Eq. (2.57)] converges pointwise to J ∗ . Hint: Use the proof of Prop. 2.6.1.
Sec. 2.7 Notes, Sources, and Exercises 83

2.3 (Nonmonotonic Contractive Models – Fixed Points of


Concave Sup-Norm Contractions)

The purpose of this exercise is to make a connection between our abstract DP


model and the problem of finding the fixed point of a mapping that is a sup-norm
contraction and has concave components. Let T : !n "→ !n be a real-valued
function whose n scalar components are concave. Then the components of T can
be represented as
! "
(T J)(x) = inf F (x, u) − J " u , x = 1, . . . , n, (2.74)
u∈U (x)

where u ∈ !n , J " u denotes the inner product of J and u, F (x, ·) is ! the conju-
gate convex function of the convex function −(T J)(x), and U (x) = u ∈ !n |
"
F (x, u) < ∞ is the effective domain of F (x, ·) (for the definition of these terms,
we refer to books on convex analysis, such as [Roc70] and [Ber09]). Assuming
that the infimum in Eq. (2.74) is attained for all x, show how the VI algorithm of
Section 2.6.1 and the PI algorithm of Section 2.6.3 can be used to find the fixed
point of T in the case where T is a sup-norm contraction, but not necessarily
monotone.

2.4 (Discounted Problems with Unbounded Cost per Stage)

Consider a countable-state MDP, where X = {1, 2, . . .}, the discount factor is


α ∈ (0, 1), the transition probabilities are denoted pxy (u) for x, y ∈ X and
u ∈ U (x), and the expected cost per stage is denoted by g(x, u), x ∈ X, u ∈ U (x).
The
! constraint set U (x) may be infinite. For a positive weight ! sequence v = "
v(1), v(2), . . .}, we consider the space B(X) of sequences J = J(1), J(2), . . .
such that 'J' < ∞, where ' · ' is the corresponding weighted sup-norm. We
assume the following.
! "
(1) The sequence G = G1 , G2 , . . . , where

# #
Gx = sup #g(x, u)#, x ∈ X,
u∈U (x)

belongs to B(X).
! "
(2) The sequence V = V1 , V2 , . . . , where

$
Vx = sup pxy (u) v(y), x ∈ X,
u∈U (x)
y∈X

belongs to B(X).
(3) We have
%
y∈X
pxy (u)v(y)
≤ 1, ∀ x ∈ X.
v(x)
84 Contractive Models Chap. 2

Consider the monotone mappings Tµ and T , given by


! " # ! "
(Tµ J)(x) = g x, µ(x) + α pxy µ(x) J(y), x ∈ X,
y∈X

$ %
#
(T J)(x) = inf g(x, u) + α pxy (u)J(y) , x ∈ X.
u∈U (x)
y∈X

Show that Tµ and T map B(X) into B(X), and are contraction mappings with
modulus α.

2.5 (Solution by Mathematical Programming)

Let the monotonicity and contraction Assumptions 2.1.1 and 2.1.2 hold. Show
that if J ≤ T J and J ∈ B(X), then J ≤ J ∗ , and use this fact to show that
if X = {1, . . . , n} and U (i) is finite for each i = 1, . . . , n, then J ∗ (1), . . . , J ∗ (n)
solves the following problem (in z1 , . . . , zn ):

n
#
maximize zi
i=1

subject to zi ≤ H(i, u, z), i = 1, . . . , n, u ∈ U (i),

where z = (z1 , . . . , zn ). Note: This is a linear or nonlinear program (depending


on whether H is linear in J or not) with n variables and as many as n × m
constraints, where m is the maximum number of elements in the sets U (i).

2.6 (Convergence of Nonexpansive Monotone Fixed Point


Iterations)

Consider the mapping H of Section 2.1 under the monotonicity Assumption 2.1.1.
Assume that X is a finite set and that instead of the contraction Assumption
2.1.2, H satisfies the following:
(1) For every J ∈ B(X), the function T J belongs to B(X).
(2) For all J, J # ∈ B(X), H satisfies
& &
&H(x, u, J) − H(x, u, J # )&
≤ %J − J # %, ∀ x ∈ X, u ∈ U (x).
v(x)

(3) T has a unique fixed point within B(X), denoted J ∗ .


Show that for every J ∈ B(X), we have %T k J − J ∗ % → 0.
3

Semicontractive Models

Contents

3.1. Semicontractive Models and Regular Policies . . . . . . p. 86


3.1.1. Fixed Points, Optimality Conditions, and
Algorithmic Results . . . . . . . . . . . . . . p. 90
3.1.2. Illustrative Example: Deterministic Shortest
Path Problems . . . . . . . . . . . . . . . . p. 97
3.2. Irregular Policies and a Perturbation Approach . . . p. 100
3.2.1. The Case Where Irregular Policies Have Infinite
Cost . . . . . . . . . . . . . . . . . . . . p. 100
3.2.2. The Case Where Irregular Policies Have Finite
Cost - Perturbations . . . . . . . . . . . . . p. 107
3.3. Algorithms . . . . . . . . . . . . . . . . . . . . p. 116
3.3.1. Asynchronous Value Iteration . . . . . . . . . p. 117
3.3.2. Asynchronous Policy Iteration . . . . . . . . . p. 118
3.3.3. Policy Iteration with Perturbations . . . . . . . p. 124
3.4. Notes, Sources, and Exercises . . . . . . . . . . . . p. 125

85
86 Semicontractive Models Chap. 3

We will now consider abstract DP models that are intermediate between


the contractive models of Chapter 2, where all stationary policies involve a
contraction mapping, and noncontractive models to be discussed in Chap-
ter 4, where there are no contraction-like assumptions. A representative
example of such an intermediate model is the stochastic shortest path prob-
lem of Example 1.2.6 (SSP for short). In one of the main versions of an
SSP theory, there are two types of policies: those that are proper where
the mapping Tµ is a contraction with respect to a weighted sup-norm, and
those that are improper , where Tµ is not a contraction with respect to any
norm. As noted in Example 1.2.6, results that are comparable to the ones
for discounted finite-state MDP have been obtained with appropriate as-
sumptions that guarantee among others that improper policies are “bad”
in the sense that they have infinite cost from some initial state.
In this chapter we introduce models where, as in SSP problems, poli-
cies are divided into two groups, one of which has favorable characteristics.
We loosely refer to such models as semicontractive to indicate that these
favorable characteristics include contraction-like properties of the mapping
Tµ . To develop a more broadly applicable theory, we replace the notion of
contraction of Tµ with a notion of regularity of µ within an appropriate set
S (roughly, this is a form of “local stability” of Tµ , which ensures that the
cost function Jµ is the unique fixed point of Tµ within S, and that Tµk J
converges to Jµ regardless of the choice of J from within S). We allow that
some policies are regular in this sense while others are not, and impose
further conditions ensuring that there exist optimal policies that are regu-
lar. Under a variety of assumptions, we show results that resemble those
available for SSP problems: that J * is a fixed point of T , that the Bellman
equation J = T J has a unique solution, at least within a suitable class of
functions, and that variants of the VI and PI algorithms are valid.
We note that the term “semicontractive” is not used in a precise
mathematical sense here. Rather it refers qualitatively to a collection of
models where some policies have a regularity/contraction-like property but
others do not. In particular, in this chapter and the next one, we consider
several alternative assumptions for different semicontractive models.
The chapter is organized as follows. In Section 3.1, we introduce
the notion of a regular policy, and we develop some of the basic results.
In Section 3.2, we discuss related lines of analysis to address some major
special cases that bear similarity to SSP problems. In Section 3.3, we focus
on VI and PI-type algorithms.

3.1 SEMICONTRACTIVE MODELS AND REGULAR POLICIES

Our basic model for this chapter and the next one is similar to the one
of Chapter 2, but the assumptions are different. We will maintain the
monotonicity assumption, but we will weaken the contraction assumption,
and we will introduce some other conditions in its place.
Sec. 3.1 Semicontractive Models and Regular Policies 87

In our analysis of this chapter, the optimal cost function J * will typi-
cally be real-valued. However, the cost function Jµ of some policies µ may
take infinite values for some states. To accommodate this, we will use the
set of extended real numbers !∗ = ! ∪ {∞, −∞}, and the set of all ex-
tended real-valued functions J : X %→ !∗ , which we denote by E(X). We
denote by R(X) the set of real-valued functions J : X %→ !, and by B(X)
the set of real-valued functions J : X %→ ! that are bounded with respect
to a given weighted sup-norm. Throughout this chapter and the next two,
when we write lim, lim sup, or lim inf of a sequence of functions we mean
it to be pointwise. We also write Jk → J to mean that Jk (x) → J (x) for
each x ∈ X; see our notational conventions in Appendix A.
As in Chapters 1 and 2, we introduce the set X of states and the
set U of controls, and for each x ∈ X, the nonempty control constraint
set U (x) ⊂ U . We denote by M the set of all functions µ : X %→ U with
µ(x) ∈ U (x), for all x ∈ X, and by Π the set of nonstationary policies
π = {µ0 , µ1 , . . .}, with µk ∈ M for all k. We refer to a stationary policy
{µ, µ, . . .} simply as µ. We introduce a mapping H : X × U × E(X) %→ !∗ ,
satisfying the following condition.

Assumption 3.1.1: (Monotonicity) If J, J " ∈ E(X) and J ≤ J " ,


then
H(x, u, J ) ≤ H(x, u, J " ), ∀ x ∈ X, u ∈ U (x).

The preceding monotonicity assumption will be in effect throughout


this chapter. Consequently, we will not mention it explicitly in various
propositions. We define the mapping T : E(X) %→ E(X) by

(T J )(x) = inf H(x, u, J ), ∀ x ∈ X, J ∈ E(X),


u∈U (x)

and for each µ ∈ M the mapping Tµ : E(X) %→ E(X) by


! "
(Tµ J )(x) = H x, µ(x), J , ∀ x ∈ X, J ∈ E(X).

The monotonicity assumption implies the following properties for all J, J " ∈
E(X) and k = 0, 1, . . .,

J ≤ J" ⇒ T kJ ≤ T kJ ", Tµk J ≤ Tµk J " , ∀ µ ∈ M,

J ≤ TJ ⇒ T k J ≤ T k+1 J, Tµk J ≤ Tµk+1 J, ∀ µ ∈ M.


We now define cost functions associated with Tµ and T . In Chapter
2 our starting point was to define Jµ and J * as the unique fixed points of
Tµ and T , respectively, based on the contraction assumption used there.
88 Semicontractive Models Chap. 3

However, under our assumptions in this chapter this is not possible, so we


use a different definition, which nonetheless is consistent with the one of
Chapter 2 (see Section 2.1, following Prop. 2.1.2). We introduce a function
J¯ ∈ E(X), and we define the infinite horizon cost of a policy in terms of
the limit of its finite horizon costs with J¯ being the cost function at the
end of the horizon.

Definition 3.1.1: Given a function J¯ ∈ E(X), for a policy π ∈ Π


with π = {µ0 , µ1 , . . .}, we define the cost function of π by

Jπ (x) = lim sup (Tµ0 · · · Tµk J¯)(x), ∀ x ∈ X.


k→∞

In the case of a stationary policy µ ∈ M, the cost function of µ is


denoted by Jµ and is given by

Jµ (x) = lim sup (Tµk J¯)(x), ∀ x ∈ X.


k→∞

The optimal cost function J * is given by

J * (x) = inf Jπ (x), ∀ x ∈ X.


π∈Π

An optimal policy π ∗ ∈ Π is one for which Jπ∗ = J * . Note two important


differences from Chapter 2:
(1) Jµ is defined in terms of a pointwise lim sup rather than lim, since we
don’t know whether the limit exists.
(2) Jπ and Jµ in general depend on J¯, so J¯ becomes an important part
of the problem definition.
Similar to Chapter 2, under the assumptions to be introduced in this chap-
ter, stationary policies will turn out to be “sufficient” in the sense that the
optimal cost obtained with nonstationary policies is matched by the one
obtained by stationary ones.

Regular Policies

Our objective in this chapter is to construct an analytical framework with


a strong connection to fixed point theory, based on the idea of separating
policies into those that have “favorable” characteristics and those that do
not. It would then appear that a favorable property for a policy µ is that
Jµ is a fixed point of Tµ . However, Jµ may depend on J¯, even though
Tµ does not depend on J¯. It would thus appear that another favorable
property for µ is that Jµ stays the same if J¯ is changed arbitrarily within
some set S. We express these two properties with the following definition.
Sec. 3.1 Semicontractive Models and Regular Policies 89

S-Regular policy µ S-Irregular policy µ̄


J J Tµk J J
J J T
S Tµk J J T
J Jµ
J J T J J Tµk J J
J J T Jµ
Set S

Figure 3.1.1. Illustration of S-regular and S-irregular policies. Policy µ is S-


regular because Jµ ∈ S and Tµk J → Jµ for all J ∈ S. Policy µ is S-irregular.

Definition 3.1.2: Given a set of functions S ⊂ E(X), we say that a


stationary policy µ is S-regular if:
(a) Jµ ∈ S and Jµ = Tµ Jµ .
(b) Tµk J → Jµ for all J ∈ S.
A policy that is not S-regular is called S-irregular .

Thus a policy µ is S-regular if the VI iteration corresponding to µ,


Jk+1 = Tµ Jk , represents a dynamic system that has Jµ as its unique equi-
librium within S, and is asymptotically stable in the sense that the iteration
converges to Jµ , starting from any J ∈ S (see Fig. 3.1.1).
For orientation purposes, we note the distinction between the set S
and the problem data: S is an analytical device, and is not part of the
problem’s definition. Its choice, however, can enable analysis and clarify
properties of Jµ and J * . For example, we will later prove local fixed point
statements such as
“J * is the unique fixed point of T within S”
or local region of attraction assertions such as
“the VI sequence {T k J } converges to J * starting from any J ∈ S.”
Results of this type and their proofs depend on the choice of S: they may
hold for some choices but not for others.
Generally, with our selection of S we will aim to differentiate between
S-regular and S-irregular policies in a manner that produces useful results
for the given problem and does not necessitate restrictive assumptions. Ex-
amples of sets S that we will use are R(X), B(X), and subsets of R(X),
B(X), and E(X) involving functions J satisfying J ≥ J * or J ≥ J¯. How-
ever, there is a diverse range of other possibilities, so it makes sense to
90 Semicontractive Models Chap. 3

= 0 Tµ J

"-regular


-regular "-irregular

=0 ) Jµ J TJ

Figure 3.1.2. Illustration of S-regular and S-irregular policies for the case where
there is only one state and S = !. There are three mappings Tµ corresponding
to S-irregular policies: one crosses the 45-degree line at multiple points, another
crosses at a single point but at an angle greater than 45 degrees, and the third is
discontinuous and does not cross at all. The mapping Tµ of the !-regular policy
has Jµ as its unique fixed point and satisfies Tµk J → Jµ for all J ∈ !.

postpone making the choice of S more specific. Figure 3.1.2 illustrates the
mappings Tµ of some S-regular and S-irregular policies for the case where
there is a single state and S = !. Figure 3.1.3 illustrates the mapping
Tµ of an S-regular policy µ, where Tµ has multiple fixed points, and upon
changing S, the policy may become S-irregular.
3.1.1 Fixed Points, Optimality Conditions, and Algorithmic
Results
We will now introduce an analytical framework where S-regular policies
are central. Our focus is reflected by our first assumption in the following
proposition, which is that optimal policies can be found among the S-
regular policies, i.e., that for some S-regular µ∗ we have J * = Jµ∗ . This
assumption implies that

J * = Jµ∗ = Tµ∗ Jµ∗ ≥ T Jµ∗ = T J * ,

where the second equality follows from the S-regularity of µ∗ . Thus the
Bellman equation J * = T J * follows if µ∗ attains the infimum in the rela-
tion T J * = inf µ∈M Tµ J * , which is our second assumption. In addition to
existence of solution of the Bellman equation, the regularity of µ∗ implies
a uniqueness assertion and a convergence result for the VI algorithm, as
shown in the following proposition.
Sec. 3.1 Semicontractive Models and Regular Policies 91

= 0 Tµ J

Jˆ Tµk J¯

S S-regular

J˜ J =0 J¯ = 0 ) Jµ J TJ

Figure 3.1.3. Illustration of a mapping Tµ where there is only one state and S
is a subset of the real line. Here Tµ has two fixed points, Jµ and J˜. If S is as
shown, µ is S-regular. If S is enlarged to include J˜, µ becomes S-irregular.

Proposition 3.1.1: Let S be a given subset of E(X). Assume that:


(1) There exists an S-regular policy µ∗ that is optimal, i.e., Jµ∗ = J ∗ .
(2) The policy µ∗ satisfies Tµ∗ J ∗ = T J ∗ .
Then the following hold:
(a) The optimal cost function J * is the unique fixed point of T within
the set {J ∈ S | J ≥ J * }.
(b) We have T k J → J * for every J ∈ S with J ≥ J * .
(c) An S-regular policy µ that satisfies Tµ J * = T J * is optimal. Con-
versely if µ is an S-regular optimal policy, it satisfies Tµ J * =
T J *.

Proof: (a) The proof uses a more refined version of the argument preced-
ing the statement of the proposition. We first show that any fixed point J
of T that lies in S satisfies J ≤ J * . Indeed, if J = T J , then for the optimal
S-regular policy µ∗ , we have J ≤ Tµ∗ J , so in view of the monotonicity of
Tµ∗ and the S-regularity of µ∗ ,
J ≤ lim Tµk∗ J = Jµ∗ = J * .
k→∞

Thus the only function within {J ∈ S | J ≥ J * } that can be a fixed point


92 Semicontractive Models Chap. 3

of T is J * . Using the optimality and S-regularity of µ∗ , and condition (2),


we have
J * = Jµ∗ = Tµ∗ Jµ∗ = Tµ∗ J * = T J * ,

so J * is a fixed point of T . Finally, J * ∈ S since J * = Jµ∗ and µ∗ is


S-regular, so J * is the unique fixed point of T within {J ∈ S | J ≥ J * }.
(b) For the optimal S-regular policy µ∗ and any J ∈ S with J ≥ J * , we
have
Tµk∗ J ≥ T k J ≥ T k J * = J * , k = 0, 1, . . . .

Taking the limit as k → ∞, and using the fact

lim Tµk∗ J = Jµ∗ = J * ,


k→∞

which holds since µ∗ is S-regular and optimal, we see that T k J → J * .


(c) If µ satisfies Tµ J * = T J * , then using part (a), we have Tµ J * = J *
and hence limk→∞ Tµk J * = J * . If µ is in addition S-regular, then Jµ =
limk→∞ Tµk J * = J * and µ is optimal. Conversely, if µ is optimal and S-
regular, then Jµ = J * and Jµ = Tµ Jµ , which combined with J * = T J * [cf.
part (a)], yields Tµ J * = T J * . Q.E.D.

Note that given condition (1) of the proposition, condition (2) is


equivalent to the seemingly weaker assumption that some S-regular µ sat-
isfies Tµ J * = T J * . To see this note that if this latter condition holds
together with condition (1), we have

Jµ∗ = Tµ∗ Jµ∗ = Tµ∗ J * ≥ T J * = Tµ J * = Tµ Jµ∗ ≥ lim Tµk Jµ∗ = Jµ ,


k→∞

where the first two equalities follow from the S-regularity and optimality of
µ∗ , the second inequality follows from the monotonicity of Tµ , and the last
equality follows from the S-regularity of µ. Since µ∗ is optimal, it follows
that µ is also optimal, so equality holds in the above relation, and we have
Tµ∗ J * = T J * , implying condition (2) as stated in the proposition.
Let us also show an equivalent variation of the preceding proposition,
for problems where the validity of Bellman’s equation J * = T J * can be
independently verified. We will later encounter models where this can be
done (e.g., the perturbation model of Section 3.2.2, and the monotone
increasing and monotone decreasing models of Section 4.3).

Proposition 3.1.2: Let S be a given subset of E(X). Assume that:


(1) There exists an S-regular policy µ∗ that is optimal, i.e., Jµ∗ = J ∗ .
Sec. 3.1 Semicontractive Models and Regular Policies 93

(2) We have J * = T J * .
Then the assumptions and the conclusions of Prop. 3.1.1 hold.

Proof: We have
J * = Jµ∗ = Tµ∗ Jµ∗ = Tµ∗ J * ,
so, using also the assumption J * = T J * , we obtain Tµ∗ J * = T J * . Hence
condition (2) of Prop. 3.1.1 holds. Q.E.D.

The following proposition is a special case of Prop. 3.1.1. It applies


when functions in S are real-valued, and through its condition (1), it re-
quires that S-irregular policies have a certain “infinite cost-type” property.
Conditions of this type will appear prominently in Section 3.2 [see Assump-
tions 3.2.1(c) and 3.2.3(b)].

Proposition 3.1.3: Let S be a given subset of R(X). Assume that:


(1) There exists an optimal S-regular policy, and for every S-irregular
policy µ, there is at least one state x ∈ X such that

lim sup (Tµk J * )(x) = ∞.


k→∞

(2) There exists a policy µ such that Tµ J * = T J * .


Then the assumptions and the conclusions of Prop. 3.1.1 hold.

Proof: In view of the remark following the proof of Prop. 3.1.1, it will
suffice to show that the policy µ of condition (2) is S-regular. Let µ satisfy
Tµ J * = T J * , and let µ∗ be an optimal S-regular policy. Then for all k ≥ 1,

Jµ∗ = Tµ∗ Jµ∗ ≥ T Jµ∗ = Tµ Jµ∗ ≥ Tµk Jµ∗ ,

where the first equality follows from the definition of an S-regular policy,
and the second inequality follows from the monotonicity of Tµ . If µ is
S-irregular, by taking the limit as k → ∞ in the preceding relation, the
right-hand side tends to ∞ for some x ∈ X, while the left-hand side is finite
since Jµ∗ ∈ S ⊂ R(X) - a contradiction. Thus µ is S-regular. Q.E.D.

The examples of Fig. 3.1.2 show how Bellman’s equation may fail in
the absence of existence of an optimal S-regular policy [cf. condition (1)
of Props. 3.1.1-3.1.3]. Consider for instance a problem where there is only
one policy µ that is S-irregular and Tµ has no fixed point.
94 Semicontractive Models Chap. 3


J Tµ J
-regular "-irregular J¯ Tµk J¯

∗ J∗ Tµ J T
!-regular
J T J∗

=0 Jµ = J ∗ Jµ J¯ = 0 J TJ

Figure 3.1.4. Illustration of why condition (2) is essential in Prop. 3.1.1. Here
there is only one state and S = !. There are two stationary policies: µ for which
Tµ is a contraction, so µ is !-regular, and µ for which Tµ has multiple fixed
points, so µ is !-irregular. Moreover, Tµ is discontinuous from above at Jµ as
shown. Here, it can be verified that Tµ0 · · · Tµk J¯ ≥ Jµ for all µ0 , . . . , µk and k,
so that Jπ ≥ Jµ for all π and the S-regular policy µ is optimal. However, µ does
not satisfy Tµ J ∗ = T J ∗ [cf. condition (2) of Prop. 3.1.1] and we have J ∗ #= T J ∗ .
Here the conclusions (a) and (c) of Prop. 3.1.1 are violated.

Another example that illustrates the need for existence of an optimal


S-regular policy is the classical blackmailer problem, described in Exercise
3.1. This is a one-state problem, where Tµ is a contraction for all µ, so
all policies are !-regular, but we have J * = −∞, so there is no optimal
stationary policy. Here Bellman’s equation, J = T J , has no solution within
! (although we do have J * = T J * ). Moreover, it can be shown that
there exists a nonstationary optimal policy for this problem; see [Ber12a]
(Example 3.2.1).
Figure 3.1.4 shows what may happen if condition (2) of Prop. 3.1.1
is violated. The figure shows how we can then have J * $= T J * , and un-
derscores the importance of existence of an S-regular µ satisfying the op-
timality equation Tµ J * = T J * for a strong connection of our framework
with fixed point theory. This condition will be assumed either directly or
indirectly, via other conditions, throughout our analysis of semicontractive
models.
The two conditions of Props. 3.1.1-3.1.3 may not be easily verified in
a given problem. However, they can often be guaranteed through other
reasonable conditions, in which case Props. 3.1.1-3.1.3 can be brought to
bear on the analysis. We will encounter several such instances in Sections
3.2, 4.4, and 4.5.
Regardless of their assumptions, some of the conclusions (a)-(c) of
Props. 3.1.1-3.1.3 are not as strong as one would like. In particular, part (a)
Sec. 3.1 Semicontractive Models and Regular Policies 95

= 0 Tµ J S S-regular

=0 Jˆ Tµk J¯
Jµ = J ∗ J¯ = 0 J TJ

S S
Fixed Points of Tµ

Figure 3.1.5. Illustration of multiplicity of fixed points J satisfying J ≤ J ∗ ,


under the assumptions of Props. 3.1.1-3.1.2. Here there is only one state, so S is
a subset of the real line, and there is only one policy µ (so Jµ = J ∗ ), which is
S-regular for S = {J | J ≥ Jµ }.

asserts uniqueness of the fixed point of T only within the set {J ∈ S | J ≥


J * }. One may hope that there would be a unique fixed point, at least within
S. Similarly, part (b) asserts the convergence of T k J to J * only for J in
the set {J ∈ S | J ≥ J * }, and one would like to assert convergence starting
anywhere within S. These results cannot be improved in the absence of
additional conditions, as can be illustrated with simple examples involving
a single policy; see Fig. 3.1.5. The deterministic shortest path problem with
zero length cycles provides an interesting practical context where there may
exist additional fixed points J of T that do not satisfy J ≥ J * (see the next
subsection). However, with additional conditions, one may be able to show
uniqueness of the fixed point of T within S, and demonstrate an enlarged
region of initial conditions J from which T k J converges to J * (see for
example Sections 3.2.1 and 4.4.1).
We finally note a subtle point in part (c) of Props. 3.1.1-3.1.2, which
leaves open the possibility that the optimality condition Tµ J * = T J * is
satisfied by a nonoptimal S-irregular µ. Indeed this can happen as can be
shown with simple examples; see Fig. 3.1.6. †

† In the important case where J̄ ≤ J ∗ , we can show that the condition


Tµ J ∗ = T J ∗ implies that µ is optimal, regardless of whether it is S-regular or
not. The reason is that in this case we have

Tµk J̄ ≤ Tµk J ∗ = T k J ∗ = J ∗ ,

and taking the lim sup as k → ∞, we obtain Jµ ≤ J ∗ , so µ is optimal. Note that


the condition J̄ ≤ J ∗ holds for the monotone increasing models of Section 4.3.
96 Semicontractive Models Chap. 3


J Tµ J
-regular "-irregular

J¯ Tµk J¯

Tµ J T
!-regular

Jµ = J ∗ Jµ J¯ = 0 J TJ
=0

Figure 3.1.6. Illustration of a nonoptimal S-irregular policy µ that satisfies the


optimality condition Tµ J ∗ = T J ∗ . Here there is only one state and S = !. There
are two policies: µ for which Tµ is a contraction, so µ is !-regular, and µ for
which Tµ has two fixed points, so µ is !-irregular. For J¯ as shown in the figure,
µ is nonoptimal, yet it satisfies Tµ J ∗ = T J ∗ .

Policy Iteration

We established in Chapter 2 the significance of PI and its variations as


a major class of computational methods for abstract DP. However, for
semicontractive models, the convergence properties of PI are complicated,
and under the conditions of Props. 3.1.1-3.1.2, the sequence of generated
policies may not converge to an optimal policy.
What can be proved is that if µ and µ are S-regular policies that
satisfy the policy improvement equation Tµ Jµ = T Jµ , then Jµ ≤ Jµ . To
see this note that by using the S-regularity of µ, we have

Jµ = Tµ Jµ ≥ T Jµ = Tµ̄ Jµ ≥ Tµ̄k Jµ , k ≥ 1,

where the last inequality holds by the monotonicity of Tµ̄ . By taking the
limit as k → ∞ in the preceding relation and using the S-regularity of µ,
we obtain Jµ ≥ Jµ̄ .
The policy improvement relation Jµ ≥ Jµ̄ shows that the PI algo-
rithm, when restricted to S-regular policies, generates a nonincreasing
sequence {Jµk }, but this does not guarantee that Jµk ↓ J * . Moreover,
guaranteeing that the policies µk are S-regular may not be easy since the
equation Tµ Jµ = T Jµ may be satisfied by an S-irregular µ, in which case
there is no guarantee that Jµ ≤ Jµ , and an oscillation between policies
may occur. This can be seen from the example of Fig. 3.1.7, where there
Sec. 3.1 Semicontractive Models and Regular Policies 97

− J¯ Tµk J¯
J Tµ J
-regular "-irregular

Tµ J T
!-regular

J Tµ∗ J
J∗ = T J∗ !-regular

= 0 Jµ∗ = J

µ Jµ Jµ J¯ = 0 J T J

Figure 3.1.7. Oscillation of PI between two nonoptimal policies: an S-regular


policy µ and an S-irregular policy µ satisfying

Tµ Jµ = T Jµ , Tµ Jµ = T Jµ .

Here there is only one state and S = !. In addition to µ and µ, there is a third
policy µ∗ , which is S-regular and optimal. In this example all the assumptions
and conclusions of Props. 3.1.1-3.1.2 are satisfied.

is oscillation between two nonoptimal policies: an !-regular policy µ and


an !-irregular policy µ satisfying

T µ Jµ = T J µ , T µ Jµ = T J µ .

A similar example of PI oscillation will be given for deterministic short-


est path problems with zero length cycles in the next subsection. Thus
additional assumptions or modifications of the PI algorithm are needed to
improve its reliability. We will address this issue in Section 3.3.2, as well
as in Sections 4.4 and 4.5 of the next chapter.

3.1.2 Illustrative Example: Deterministic Shortest Path


Problems
In this section, we will highlight some of the analytical issues raised in
the preceding subsection through the classical deterministic shortest path
problem described in Example 1.2.7. We have a graph of n nodes x =
1, . . . , n, plus the destination 0, and an arc length axy for each directed
arc (x, y). Here X = {1, . . . , n}. A policy chooses at state/node x ∈ X
98 Semicontractive Models Chap. 3

an outgoing arc from x. Thus the controls available at x can be identified


with the outgoing neighbors of x [the nodes u such that (x, u) is an arc].
The corresponding mapping H is given by
!
axu + J (u) if u != 0,
H(x, u, J ) =
ax0 if u = 0,

and J¯ = 0.
We will consider" S-regularity
# with S = "n . A policy µ defines a
graph whose arcs are x, µ(x) , x = 1, . . . , n. If this graph contains a cycle
with m arcs, x1 → x2 → · · · → xm → x1 , with length L = ax1 x2 + · · · +
axm−1 xm + axm x1 , then it can be seen that for all k ≥ 1, we have
(Tµkm J )(x1 ) = kL + J (x1 ).
Thus such a policy cannot be "n -regular (and if L != 0, its cost function
Jµ has some infinite entries, so it is outside "n ). By contrast if a policy
defines a graph that is acyclic, it can be verified to be "n -regular.
Let us assume now that all cycles have positive cost (L > 0 above),
and that every node is connected to the destination with some path (this
is a common assumption in deterministic shortest path problems, which
will be revisited and generalized considerably in Section 3.2.1). Then every
"n -irregular policy has infinite cost starting from some node/state, and it
can be shown that there exists an optimal "n -regular policy. Thus Prop.
3.1.3 applies, and guarantees that J * is the unique fixed point of T within
the set {J | J ≥ J * }, and that the VI algorithm converges to J * starting
only from within that set. Actually the uniqueness of the fixed point and
the convergence of VI can be shown within the entire space "n . This is
well-known in shortest path theory, and will be covered by results to be
given in Section 3.2.1.
In the other extreme case where there is a cycle of negative cost, there
are "n -irregular policies that are optimal and no "n -regular policy can be
optimal. Thus Props. 3.1.1-3.1.3 do not apply in this case.
The case where there is a cycle with zero cost exhibits the most com-
plex behavior and will be illustrated for the example of Fig. 3.1.8. Here
X = {1, 2}, U (1) = {0, 2}, U (2) = {1}, J¯(1) = J¯(2) = 0.
There are two policies:
µ : where µ(1) = 0, corresponding to the path 2 → 1 → 0,
µ : where µ(1) = 2, corresponding to the cycle 1 → 2 → 1,
and the corresponding mapping H is

b if x = 1, u = 0,
H(x, u, J ) = a + J (2) if x = 1, u = 2, (3.1)
a + J (1) if x = 2, u = 1.

Sec. 3.1 Semicontractive Models and Regular Policies 99

a
t b Destination
. Under these conditions, the Bellman equation
tb ! "
12 a12 a012 J (1) = min b, a + J (2) , J
, J (2) = a + J (1),
a

Figure 3.1.8. A deterministic shortest path problem with nodes 1, 2, and desti-
nation 0. Arc lengths are shown next to the arcs.

The Bellman equation is given by


! "
J (1) = min b, a + J (2) , J (2) = a + J (1), (3.2)

while the VI algorithm takes the form


! "
Jk+1 (1) = min b, a + Jk (2) , Jk+1 (2) = a + Jk (1). (3.3)

Here the policy µ is !2 -regular [the VI for µ is Jk+1 (1) = b, Jk+1 (2) =
a + Jk (1)], while the policy µ is !2 -irregular [the VI for µ is Jk+1 (1) =
a + Jk (2), Jk+1 (2) = a + Jk (1)].
In the case where the cycle has zero length, so a = 0, there are two
possibilities, b ≤ 0 and b > 0, which we will consider separately:
(a) a = 0, b ≤ 0: Here the !2 -regular policy is optimal, and Prop. 3.1.1
applies. Bellman’s equation (3.2) has the unique solution

J * (1) = b, J * (2) = b,

within the set !2 , and the VI algorithm (3.3) converges to J * from


any starting J0 ≥ J * . However, it can be verified that Bellman’s
equation has multiple solutions: the set of solutions is
! "
J | J (1) = J (2), J ≤ J * ,

and VI starting from one of these solutions, will keep generating that
solution. Moreover we can verify that PI may oscillate between the
optimal !2 -regular policy and the !2 -irregular policy (which is nonop-
timal if b < 0). Indeed, the !2 -irregular policy µ is evaluated as
Jµ (1) = Jµ (2) = 0, while the !2 -regular policy µ is evaluated as
Jµ (1) = Jµ (2) = b, so in the policy improvement phase of the algo-
rithm, we have
! " ! "
µ(1) ∈ arg min b, Jµ (2) , µ(1) ∈ arg min b, Jµ (2) .
100 Semicontractive Models Chap. 3

Thus policy improvement starting with µ yields µ, and starting with


µ may yield µ, with the oscillatory sequence {µ, µ, µ, µ, . . .} resulting.
Note that here we have Tµ J * = T J * , so the optimality condition of
Prop. 3.1.1(b) is attained by µ, which is nonoptimal when b < 0.
(b) a = 0, b > 0: Here the unique optimal policy is the !2 -irregular µ,
and Props. 3.1.1-3.1.3 do not apply. The policy generates a sequence
of cycles 1 → 2 → 1, rather than a path that leads to the destination.
In this case, the abstract DP model based on the mapping H of Eq.
(3.1) cannot be used to model the shortest path problem (its optimal
solution does not yield paths from nodes 1 and 2 to the destination).
However, we may still be interested in finding an optimal policy within
the class of !2 -regular policies. It turns out that we may address
this problem by using a perturbation approach, which is described in
Section 3.2.2. In particular, we will add a small amount δ > 0 to the
length of each arc. This has a strong effect on the problem: the cost
function of the !2 -irregular policy becomes infinite, while the cost
function of the !2 -regular policy changes by an O(δ) amount. Thus
with δ > 0, we obtain the acyclic !2 -regular policy µ.

3.2 IRREGULAR POLICIES AND A PERTURBATION


APPROACH

In this section we will use the model and the results of the preceding section
as a starting point and motivation for the analysis of various special cases.
In particular, we will introduce various analytical techniques and alterna-
tive conditions, in order to strengthen the results of Props. 3.1.1-3.1.3, and
to extend the existing theory of the SSP problem of Example 1.2.6.

3.2.1 The Case Where Irregular Policies Have Infinite Cost

A weakness of Props. 3.1.1-3.1.3 is that it is sometimes difficult to verify


their assumptions, particularly the existence of an optimal S-regular policy.
In this section we will use the following assumption that combines elements
of the assumptions of these propositions, indirectly guarantees the existence
of an optimal S-regular policy, and yields stronger results.

Assumption 3.2.1: We are given a subset S ⊂ R(X) such that the


following hold:
(a) S contains J¯, and has the property that if J1 , J2 are two functions
in S, then S contains all functions J with J1 ≤ J ≤ J2 .
Sec. 3.2 Irregular Policies and a Perturbation Approach 101

(b) The function Jˆ given by

Jˆ(x) = inf Jµ (x), x ∈ X,


µ: S -regular

belongs to S.
(c) For each S-irregular policy µ and each J ∈ S, there is at least
one state x ∈ X such that

lim sup (Tµk J )(x) = ∞. (3.4)


k→∞

(d) The control set U is a metric space, and the set

{u ∈ U (x) | H(x, u, J ) ≤ λ}

is compact for every J ∈ S, x ∈ X, and λ ∈ $.


(e) For each sequence {Jm } ⊂ S with Jm ↑ J for some J ∈ S we
have

lim H(x, u, Jm ) = H (x, u, J ) , ∀ x ∈ X, u ∈ U (x).


m→∞

(f) For each function J ∈ S, there exists a function J # ∈ S such that


J # ≤ J and J # ≤ T J # .

Note that the preceding assumption requires that S is a set of real-


valued functions; this is similar to Prop. 3.1.3, and it allows us to make use
of Eq. (3.4) in the subsequent analysis. Part (a) holds for some common
choices of S, such as when S = R(X) or S! = B(X), or when S "is a
subset! of R(X) or B(X) of" the form S = J ∈ R(X) | J ≥ J # or
S = J ∈ B(X) | J ≥ J # for a given function J # in R(X) or B(X),
respectively. For a finite number n of states, R(X) and B(X) can both
be identified with $n , while otherwise R(X) may be simpler as it does not
require the use of a norm. On the other hand, for an infinite number of
states, the choice between R(X) and B(X) may have a substantial impact
on whether a policy µ is S-regular or not. In particular, Tµk J may converge
to Jµ for all J ∈ B(X) but not for all J ∈ R(X), because Tµ J ∈ B(X) for
J ∈ B(X) while Tµ J ∈ / R(X) for some J ∈ R(X). This can happen even
if Tµ is a contraction mapping with respect to the norm of B(X). Thus
some care is needed in deciding whether S should be a subset of R(X) or
a subset of B(X).
102 Semicontractive Models Chap. 3

Since part (b) requires that Jˆ belongs to S, and is therefore real-


valued, it also implies that there exists at least one S-regular policy [other-
wise the infimum in part (b) is taken over the empty set, and Jˆ(x) = ∞ for
all x]. If S satisfies part (a), then part (b) holds if and only if there exists
at least one S-regular policy, and also there exists a function in S that is
a lower bound to the cost function of all S-regular policies. The existence
of such a lower bound is essential, as can be shown by the blackmailer
problem of Exercise 3.1.
Part (c) asserts among others that an S-irregular policy cannot be
optimal, and is consistent with our analytical approach for semicontractive
models, which relies on the dominance of S-regular policies (cf. Prop. 3.1.3).
Part (e) is a mild technical continuity assumption that is needed for the
subsequent analysis. Part (f) is also of technical nature, but it may not be
satisfied in some problems of interest, a notable case being multiplicative
models discussed in Example 1.2.8, where
!
H(x, u, J ) = pxy (u)g(x, u, y)J (y),
y∈X

S contains only strictly positive functions, and the function g may take
values less than 1; see also the affine monotonic and exponential cost SSP
models of Section 4.5.
The compactness part (d) of the semicontraction Assumption 3.2.1
plays a key role for asserting the existence of an optimal S-regular policy, as
well as for various proof arguments (see Exercise 3.1 for counterexamples).
It implies that for every J ∈ S, the infimum in the equation

(T J )(x) = inf H(x, u, J ), (3.5)


u∈U (x)

is attained for all x ∈ X, and it also implies that for every J ∈ S, there
exists a policy µ such that Tµ J = T J . This will be shown as part of the
proof of the following proposition.
The compactness condition of Assumption 3.2.1(d) can be verified in
a few interesting cases involving both finite and infinite state and control
spaces:
(1) The case where U is a finite set.
(2) Cases where for each x, H(x, u, J ) depends on J only through its
values J (y) for y in a finite set Yx . For an illustration, consider a
mapping like the one of the SSP Example 1.2.6:
!
H(x, u, J ) = g(x, u) + pxy (u)J (y).
y∈Yx

Then the infimum in Eq. (3.5) is attained if U (x) is compact, g(x, ·)


is lower semicontinuous, and pxy (·) is continuous for each y ∈ Yx ,
Sec. 3.2 Irregular Policies and a Perturbation Approach 103

since then H(x, ·, J ) is lower semicontinuous as a function of u. This


covers important cases of finite-state and countable-state MDP with
compact control spaces (see [BeT91]).
(3) Cases of stochastic optimal control problems such as those of Example
1.2.1, under conditions involving a continuous state space, a compact
constraint set U (x), and a problem structure implying that H(x, ·, J )
is continuous. However, in such cases one must often require that
policies obey additional measurability restrictions, and to this end a
more complex mathematical formulation is needed to address these
restrictions. Such a formulation and corresponding analysis is given
for abstract contractive DP models in Chapter 5; see also [BeS78],
Ch. 6, and [JaC06] for SSP problems. The extension to the semicon-
tractive models of this section, while in principle straightforward, has
not been worked out.
We will show the following proposition, which is the main result of
this section.

Proposition 3.2.1: Let Assumption 3.2.1 hold. Then:


(a) The optimal cost function J * is the unique fixed point of T within
the set S.
(b) We have T k J → J * for all J ∈ S. Moreover, there exists an
optimal S-regular policy.
(c) A policy µ is optimal if and only if Tµ J * = T J * .
(d) For any J ∈ S, if J ≤ T J we have J ≤ J * , and if J ≥ T J we
have J ≥ J * .

In comparing this proposition with Props. 3.1.1-3.1.3 of Section 3.1,


we see that it requires more conditions (cf. Assumption 3.2.1). However,
the two assumptions of Props. 3.1.1-3.1.3 require some prior analysis for
verification. Assumption 3.2.1 provides a reasonably convenient way to
verify these two assumptions in the case where S consists of real-valued
functions, but goes further with additional technical conditions, and in
doing so it leads to stronger conclusions. These are the uniqueness of the
fixed point of T within S (not just within the set {J ∈ S | J ≥ J * }), and
the convergence of the VI sequence {T k J } starting from any J ∈ S (not
just starting from J ∈ S with J ≥ J * ).
The proof of Prop. 3.2.1 is long and is developed through several
lemmas. These lemmas will also help to illuminate the implications of the
various parts of Assumption 3.2.1, and to identify the roles of these parts in
the major steps of the proof. The lemmas culminate with showing that the
function Jˆ of Assumption 3.2.1(b) is the unique fixed point of T , and that
104 Semicontractive Models Chap. 3

any policy µ satisfying Tµ Jˆ = T Jˆ is optimal within the set of S-regular


policies. Then the proposition is proved by first showing that T k J → Jˆ
for all J ∈ S, and then by using this to prove that Jˆ = J * and that there
exists an optimal S-regular policy.

Lemma 3.2.1: Let Assumption 3.2.1(d) hold. For every J ∈ S, there


exists a policy µ such that Tµ J = T J .

Proof: Since H is in general extended real-valued, for a given x ∈ X, we


need to consider separately the cases (T J )(x) <! ∞ and" (T J )(x) = ∞.
Consider any x ∈ X with (T J )(x) < ∞, and let λm (x) be a decreasing
scalar sequence with

λm (x) ↓ inf H(x, u, J ).


u∈U (x)

The set ! "


Um (x) = u ∈ U (x) | H(x, u, J ) ≤ λm (x) ,
is nonempty, and by Assumption 3.2.1(d), it is compact. The set of points
attaining the infimum of H(x, u, J ) over U (x) is ∩∞
m=0 Um (x), and is there-
fore nonempty. Let ux be a point in this intersection. Then we have

H(x, ux , J ) ≤ λm (x), ∀ m ≥ 0. (3.6)

Consider the policy µ formed by the point ux , for x with (T J )(x) < ∞,
and by any point ux ∈ U (x) for x with (T J )(x) = ∞. Taking the limit
in Eq. (3.6) as m → ∞ shows that µ satisfies (Tµ J )(x) = (T J )(x) for
x with (T J )(x) < ∞. For x with (T J )(x) = ∞, we also have trivially
(Tµ J )(x) = (T J )(x), so Tµ J = T J . Q.E.D.

Lemma 3.2.2: Let Assumption 3.2.1(c) hold. A policy µ that satis-


fies Tµ J ≤ J for some J ∈ S is S-regular.

Proof: By the monotonicity of Tµ , we have Tµk J ≤ J , for all k ≥ 1. Thus


lim supk→∞ Tµk J ≤ J and since J is real-valued, from Assumption 3.2.1(c)
it follows that µ cannot be S-irregular. Q.E.D.

Lemma 3.2.3: Let Assumption 3.2.1(a),(d),(e) hold. Then if J $ ∈ S


and T k J $ ↑ J for some J ∈ S, we have J = T J .
Sec. 3.2 Irregular Policies and a Perturbation Approach 105

Proof: We first note that since J ! ∈ S and J ∈ S, from Assumption


3.2.1(a) it follows that T k J ! ∈ S. We fix x ∈ X, and consider the sets
! "
Uk (x) = u ∈ U (x) | H(x, u, T k J ! ) ≤ J (x) , k = 0, 1, . . . , (3.7)

which are compact by Assumption 3.2.1(d). Let uk ∈ U (x) be such that

H(x, uk , T k J ! ) = inf H(x, u, T k J ! ) = (T k+1 J ! )(x) ≤ J (x);


u∈U (x)

(such a point exists by Lemma 3.2.1). Then uk ∈ Uk (x).


For every k, consider the sequence {ui }∞ k !
i=k . Since T J ↑ J, it follows
using the monotonicity of H, that for all i ≥ k,

H(x, ui , T k J ! ) ≤ H(x, ui , T i J ! ) ≤ J (x).

Therefore from the definition (3.7), we have {ui }∞ i=k ⊂ Uk (x). Since Uk (x)
is compact, all the limit points of {ui }∞
i=k belong to Uk (x) and at least one
limit point exists. Hence the same is true for the limit points of the whole
sequence {ui }. Thus if ũ is a limit point of {ui }, we have

ũ ∈ ∩∞
k=0 Uk (x).

By Eq. (3.7), this implies that


# $
H x, ũ, T k J ! ≤ J (x), k = 0, 1, . . . .

Taking the limit as k → ∞ and using Assumption 3.2.1(e), we obtain

(T J )(x) ≤ H(x, ũ, J ) ≤ J (x).

Thus, since x was chosen arbitrarily within X, we have T J ≤ J . To show


the reverse inequality, we write T k J ! ≤ J , apply T to this inequality, and
take the limit as k → ∞, so that J = limk→∞ T k+1 J ! ≤ T J . It follows
that J = T J . Q.E.D.

Lemma 3.2.4: Let Assumption 3.2.1(b),(c),(d) hold. Then:


(a) The function Jˆ of Assumption 3.2.1(b),

Jˆ(x) = inf Jµ (x), x ∈ X,


µ: S -regular

is the unique fixed point of T within S.


(b) Every policy µ satisfying Tµ Jˆ = T Jˆ is optimal within the set
of S-regular policies, i.e., µ is S-regular and Jµ = Jˆ. Moreover,
there exists at least one such policy.
106 Semicontractive Models Chap. 3

Proof: For all S-regular policies µ, we have Jµ ≥ Jˆ, and by applying Tµ


to this relation, we have

Jµ = Tµ Jµ ≥ Tµ Jˆ ≥ T J,
ˆ

where the first equality follows from the S-regularity of µ. Taking the
infimum in this relation over all S-regular policies µ and using the definition
of Jˆ, we obtain Jˆ ≥ T J.
ˆ
To prove the reverse relation, let µ be any policy such that Tµ Jˆ = T Jˆ
(there exists one by Lemma 3.2.1). In view of the inequality Jˆ ≥ T Jˆ just
shown, we have Jˆ ≥ Tµ Jˆ, so µ is S-regular by Lemma 3.2.2. Thus we have,
using also the monotonicity of Tµ ,

Jˆ ≥ T Jˆ = Tµ Jˆ ≥ lim Tµk Jˆ = Jµ .
k→∞

From the definition of Jˆ, it follows that equality holds throughout in the
preceding relation, so µ is optimal within the class of S-regular policies,
and Jˆ is a fixed point of T .
Next we show that Jˆ is the unique fixed point of T within S. Indeed
if J # ∈ S is another fixed point, we choose an S-regular µ such that Jµ = Jˆ
(there exists one by the preceding argument), and we have

J # = T J # ≤ Tµ J # ≤ lim Tµk J # = Jµ = Jˆ.


k→∞

Let µ# be such that J # = T J # = Tµ! J # (cf. Lemma 3.2.1). Then µ# is


S-regular (cf. Lemma 3.2.2), and we have

J # = lim Tµk! J # = Jµ! .


k→∞

Combining the preceding two relations, we have J # = Jµ! ≤ Jˆ, which in


view of the definition of Jˆ, implies that J # = Jˆ. Q.E.D.

Proof of Prop. 3.2.1: (a), (b) We will first prove that T k J → Jˆ for all
J ∈ S, and we will use this to prove that Jˆ = J * and that there exists
an optimal S-regular policy. Thus both parts (a) and (b) will be shown
simultaneously.
We fix J ∈ S, and choose J # ∈ S such that J # ≤ J and J # ≤ T J #
[cf. Assumption 3.2.1(f)]. By the monotonicity of T , we have T k J # ↑ J˜ for
some J˜ ∈ E(X). Let µ be an S-regular policy such that Jµ = Jˆ [cf. Lemma
3.2.4(b)]. Then we have, using again the monotonicity of T ,

J˜ = lim T k J # ≤ lim sup T k J ≤ lim Tµk J = Jµ = Jˆ. (3.8)


k→∞ k→∞ k→∞

Since J # and Jˆ belong to S, and J # ≤ J˜ ≤ Jˆ, Assumption 3.2.1(a) implies


that J˜ ∈ S. From Lemma 3.2.3, it then follows that J˜ = T J. ˜ Since Jˆ
Sec. 3.2 Irregular Policies and a Perturbation Approach 107

is the unique fixed point of T within S [cf. Lemma 3.2.4(a)], it follows


that J˜ = Jˆ. Thus equality holds throughout in Eq. (3.8), proving that
limk→∞ T k J = Jˆ.
There remains to show that Jˆ = J * and that there exists an optimal
S-regular policy. To this end, we note that by the monotonicity Assumption
3.1.1, for any policy π = {µ0 , µ1 , . . .}, we have
Tµ0 · · · Tµk−1 J¯ ≥ T k J¯.
Taking the limit of both sides as k → ∞, we obtain
Jπ ≥ lim T k J¯ = Jˆ,
k→∞

where the equality follows since T k J → Jˆ for all J ∈ S (as shown earlier),
and J¯ ∈ S [cf. Assumption 3.2.1(a)]. Thus for all π ∈ Π, Jπ ≥ Jˆ = Jµ ,
implying that the policy µ that is optimal within the class of S-regular
policies is optimal over all policies, and that Jˆ = J * .
(c) If µ is optimal, then Jµ = J * ∈ S, so by Assumption 3.2.1(c), µ is S-
regular and therefore Tµ Jµ = Jµ . Hence, Tµ J * = Tµ Jµ = Jµ = J * = T J * .
Conversely, if J * = T J * = Tµ J * , µ is S-regular (cf. Lemma 3.2.2), so
J * = limk→∞ Tµk J * = Jµ . Therefore, µ is optimal.
(d) If J ∈ S and J ≤ T J , by repeatedly applying T to both sides and using
the monotonicity of T , we obtain J ≤ T k J for all k. Taking the limit as
k → ∞ and using the fact T k J → J * [cf. part (b)], we obtain J ≤ J * . The
proof that J ≥ T J implies J ≥ J * is similar. Q.E.D.

3.2.2 The Case Where Irregular Policies Have Finite Cost -


Perturbations
In this section, we consider problems where some S-irregular policies may
have finite cost for all states, so Prop. 3.2.1 does not apply. An example is
SSP problems where all one-stage costs are nonpositive and J * (x) > −∞
for all x. The following example describes a classical problem of this type.

Example 3.2.1 (Search Problem)

Consider a situation where the objective is to move within a finite set of


states searching for a state to stop while minimizing the expected cost. We
formulate this as a DP problem with finite state space X, and two controls
at each x ∈ X: stop, which yields an immediate cost s(x), and continue, in
which case we move to a state f (x, w) at cost g(x, w), where w is a random
variable with given distribution that may depend on x. The mapping H has
the form
!
s(x) if u = stop,
H(x, u, J) = " # $%
E g(x, w) + J f (x, w) if u = continue,
108 Semicontractive Models Chap. 3

and the function J̄ is identically 0.


Letting S = R(X), we note that the policy µ that stops nowhere is S-
irregular, since Tµ cannot have a unique fixed point (adding any unit function
multiple to J adds to Tµ J the same multiple). This policy may violate As-
sumption 3.2.1(c) of the preceding subsection, because its cost may be finite
for all states. A special case where this occurs is when g(x, w) ≡ 0 for all x.
Then the cost function of µ is identically 0.
Note that case (a) of the three-node shortest path problem given in
Section 3.1.2, which involves a zero length cycle, is a special case of the
search problem just described. Therefore, the anomalous behavior we saw
there (nonconvergence of VI starting from J0 < J ∗ and oscillation of PI) may
also arise in the context of the present example.

In this section, we show that the results of Props. 3.1.1-3.1.3 (unique-


ness of fixed point of T within the set {J ∈ S | J ≥ J * } and convergence of
VI starting from within that set) hold, provided that there exists an optimal
S-regular policy, and the assumptions are suitably modified by introduc-
ing a positive perturbation into Tµ . The idea is that with a perturbation,
the cost functions of S-irregular policies may increase disproportionately
relative to the cost functions of the S-regular policies, thereby making the
problem more amenable to analysis.
In particular, for each δ ≥ 0 and policy µ, we consider the mappings
Tµ,δ and Tδ given by
! "
(Tµ,δ J )(x) = H x, µ(x), J + δ, x ∈ X, Tδ J = inf Tµ,δ J.
µ∈M

We define the corresponding cost functions of policies π = {µ0 , µ1 , . . .} ∈ Π


and µ ∈ M, and optimal cost function Jδ* by
Jπ,δ (x) = lim sup Tµ ,δ · · · Tµ ,δ J¯,
0 k
Jµ,δ (x) = lim sup T k J¯,µ,δ
k→∞ k→∞

Jδ* = inf Jπ,δ .


π∈Π
We refer to the problem associated with the mappings Tµ,δ as the δ-
perturbed problem.
The following proposition shows that if the δ-perturbed problem is
“well-behaved” with respect to the S-regular policies, then its cost function
Jδ* can be used to approximate the optimal cost function over the S-regular
policies only.

Proposition 3.2.2: Given a set S ⊂ E(X), assume that:


(1) For every δ > 0, there exists an optimal S-regular policy for the
δ-perturbed problem.
(2) If µ is an S-regular policy, we have

Jµ,δ ≤ Jµ + wµ (δ), ∀ δ > 0,


Sec. 3.2 Irregular Policies and a Perturbation Approach 109

where wµ is a function such that limδ↓0 wµ (δ) = 0.


Then
lim Jδ* = inf Jµ ,
δ↓0 µ: S -regular

where Jδ* is the optimal cost function of the δ-perturbed problem.

Proof: For all δ > 0, we have by using condition (2),

inf Jµ ≤ Jµ∗ ≤ Jµ∗ ,δ = Jδ* ≤ Jµ" ,δ ≤ Jµ" + wµ" (δ), ∀ µ" : S-regular,
µ: S -regular δ δ

where µ∗δ is an optimal S-regular policy of the δ-perturbed problem [cf.


condition (1)]. By taking the limit as δ ↓ 0 and then the infimum over all
µ" that are S-regular, it follows that

inf Jµ ≤ lim Jδ* ≤ inf Jµ .


µ: S -regular δ↓0 µ: S -regular
Q.E.D.
The preceding proposition does not require that existence of an opti-
mal S-regular policy for the original problem. It applies even if the optimal
cost function J * does not belong to S and we may have limδ↓0 Jδ* (x) >
J * (x) for some x ∈ X. This is illustrated by the following example.

Example 3.2.2

Consider the case of a single state where J¯ = 0, and there are two policies,
µ∗ and µ, with

Tµ∗ J = J, Tµ J = 1, ∀ J ∈ #.

Here we have Jµ∗ = 0 and Jµ = 1. Moreover, it can be verified that for any set
S ⊂ # that contains the point 1, the optimal policy µ∗ is not S-regular while
the suboptimal policy µ is S-regular. For δ > 0, the δ-perturbed problem has
optimal cost Jδ∗ = 1 + δ, the unique solution of the Bellman equation

J = Tδ J = min{1, J} + δ,

and its optimal policy is the S-regular policy µ (see Fig. 3.2.1). We also have

lim Jδ∗ = Jµ = 1 > 0 = J ∗ ,


δ↓0

consistent with Prop. 3.2.2.


110 Semicontractive Models Chap. 3

Tδ J T J

1 1+δ

1 1+

J TJ

J T J J¯ = Jµ∗ = 0 11 1J+
µ,δ = 1+δ J J TJ

Figure 3.2.1: The mapping T and its perturbed version Tδ in Example 3.2.2.

A simple way to guarantee that limδ↓0 Jδ* = J * is to assume that


there exists an optimal S-regular policy for the unperturbed problem. This
will also guarantee Bellman’s equation J * = T J * , under some additional
conditions which are collected in the following assumption.

Assumption 3.2.2: We are given a set S ⊂ E(X) such that the


following hold:
(a) There exists an S-regular policy µ∗ that is optimal, i.e., Jµ∗ = J ∗ ,
and satisfies
Jµ∗ ,δ ≤ Jµ∗ + w(δ), ∀ δ > 0,

where w is a function such that limδ↓0 w(δ) = 0.


(b) The optimal cost function Jδ* of the δ-perturbed problem belongs
to S and satisfies the Bellman equation Jδ* = Tδ Jδ* for each δ > 0.
(c) For each sequence {Jm } ⊂ S with Jm ↓ J for some J ≥ J * , we
have
Tµ Jm ↓ Tµ J, ∀ µ ∈ M.

Under the preceding assumption we will show that J * = T J * . This


will allow us to use Prop. 3.1.2 and yield the results of Props. 3.1.1-3.1.3.

Proposition 3.2.3: Let Assumption 3.2.2 hold. Then:


(a) The optimal cost function J * is the unique fixed point of T within
the set {J ∈ S | J ≥ J * }.
Sec. 3.2 Irregular Policies and a Perturbation Approach 111

(b) We have T k J → J * for every J ∈ S with J ≥ J * .


(c) An S-regular policy µ that satisfies Tµ J * = T J * is optimal. Con-
versely if µ is an S-regular optimal policy, it satisfies Tµ J * =
T J *.

Proof: By the monotonicity of Tµ , we clearly have J * ≤ Jδ* , so using


Assumption 3.2.2(a), we obtain for all δ > 0,
J * ≤ Jδ* ≤ Jµ∗ ,δ ≤ Jµ∗ + w(δ) = J * + w(δ),
and limδ↓0 Jδ* = J * . The Bellman equation Jδ* = Tδ Jδ* is written as
Jδ* = inf Tµ Jδ* + δ e. (3.9)
µ∈M

From this relation, the fact limδ↓0 Jδ* = J * just shown, and Assumption
3.2.2(c), we have
J * = lim inf Tµ Jδ* ≤ inf lim Tµ Jδ* = inf Tµ J * = T J * . (3.10)
δ↓0 µ∈M µ∈M δ↓0 µ∈M

We also have
T J * ≤ T Jδ* = inf Tµ Jδ* = Jδ* − δ e, ∀ δ > 0,
µ∈M

where the last equality follows from Eq. (3.9). By taking the limit as
δ ↓ 0, we obtain T J * ≤ J * , which combined with Eq. (3.10), shows that
J * = T J * . Thus the assumptions of Prop. 3.1.2 are satisfied and the
conclusions follow from this proposition. Q.E.D.

The following example illustrates the preceding line of analysis. For


another application, see Section 4.5.3 on exponential cost models.

Example 3.2.3 (Search Problem Continued)

Consider the search problem of Example ! 3.2.1, assuming


" that the expected
costs for not stopping are nonnegative, E g(x, w) ≥ 0 for all x. Then for
all policies µ that don’t stop with probability 1 starting from state x we have
Jµ,δ (x) = ∞ for all δ > 0, since an expected cost of at least δ is incurred at
each transition in the δ-perturbed problem.
If the costs s(x) for stopping are nonpositive for all x, then from known
results on SSP problems (cf. Example 1.2.6 and [BeT91]), it follows that there
exists an optimal R(X)-regular policy, which stops with probability 1 starting
from every state. In this case, Assumption 3.2.2 holds, and Prop. 3.2.3 applies.
If some stopping costs s(x) are positive, it may happen that each optimal
policy is S-irregular, and there is no optimal R(X)-regular policy. In this
case, however, there is an optimal R(X)-regular policy for the δ-perturbed
problem, for all δ > 0, and Prop. 3.2.2 applies. Thus limδ↓0 Jδ* yields the best
that can be achieved when restricted to policies that stop with probability 1.
112 Semicontractive Models Chap. 3

An Alternative Line of Analysis

A weakness of Assumption 3.2.2(b) is that it requires the verification of


Bellman’s equation Jδ* = Tδ Jδ* for the δ-perturbed problem. An alternative
line of analysis makes instead regularity assumptions on the policies of this
problem, and is based on the following definition.

Definition 3.2.1: Given a set S ⊂ E(X) and a scalar δ ≥ 0, we say


that a stationary policy µ is δ-S-regular if Jµ,δ is the unique fixed
point of Tµ,δ within S, and

k J → J
Tµ,δ µ,δ , ∀ J ∈ S.

A policy that is not δ-S-regular is called δ-S-irregular .

Thus µ is δ-S-regular if and only if it is 0-S-regular for the δ-perturbed


problem. Our earlier notions of S-regular and S-irregular policies are equiv-
alent to 0-S-regular and 0-S-irregular policies, respectively. To illustrate,
consider the case where Tµ is a weighted sup-norm contraction (e.g., a
proper µ in a finite-state SSP problem). Then assuming also that Tµ,δ
maps B(X) into B(X) [e.g., when e ∈ B(X)], Tµ,δ is a weighted sup-norm
contraction for all δ ≥ 0, µ is δ-B(X)-regular for all δ ≥ 0.
Note that δ-S-regularity for all δ > 0 does not imply 0-S-regularity,
nor does it imply that Jµ,δ ↓ Jµ as δ ↓ 0. This can be illustrated by a
single-state example, a variation of Example 3.2.2, where J¯ = 0, and there
is a single policy µ with Tµ J = min{1, J }. Here for S = ', µ is δ-S-regular
for all δ > 0, but it is not 0-S-regular, and in fact it can be verified that
Jµ = 0 while Jµ,δ = 1 + δ (cf. Fig. 3.2.1).
We introduce the following assumption, whose conditions resemble
the ones of Prop. 3.1.3. In particular, parts (a)-(c) of the assumption are
patterned after conditions (1) and (2) of Prop. 3.1.3.

Assumption 3.2.3: We are given a set S ⊂ R(X) that contains the


function J¯, and is such that the following hold:
(a) There exists an S-regular policy µ∗ that is optimal, i.e., Jµ∗ = J ∗ .
(b) Each S-regular policy is δ-S-regular for every δ > 0. Moreover,
if µ is a δ-S-irregular policy for a given δ > 0, then there is at
least one state x ∈ X such that

k J ∗ )(x) = ∞,
lim sup (Tµ,δ (3.11)
µ ,δ
k→∞
Sec. 3.2 Irregular Policies and a Perturbation Approach 113

where µ∗ is the optimal S-regular policy of (a).


(c) For each δ > 0, there exists a policy µ such that Tµ Jµ∗ ,δ =
T Jµ∗ ,δ , where µ∗ is the optimal S-regular policy of (a).
(d) For each sequence {Jµk ,δk }, where for all k, µk is S-regular, δk >
0, δk ↓ 0, and Jµk ,δk ↓ J ,

lim H(x, u, Jµk ,δk ) = H (x, u, J ) , ∀ x ∈ X, u ∈ U (x).


m→∞

Note that similar to Assumption 3.2.1, the preceding assumption re-


quires that S is a set of real-valued functions; this allows us to take advan-
tage of Eq. (3.11). We first show a preliminary lemma.

Lemma 3.2.5: Let Assumption 3.2.3 hold. Then:


(a) If 0 ≤ δ ≤ δ $ , then for all policies µ, k ≥ 1, and functions
k J ≤ T k J $.
J, J $ ∈ E(X), with J ≤ J $ , we have Tµ,δ µ,δ "

(b) A policy that is δ-S-irregular for some δ > 0 is δ $ -S-irregular for


all δ $ ∈ (δ, ∞).
(c) A policy that is δ-S-regular for some δ > 0 is δ $ -S-regular for all
δ $ ∈ (0, δ).
(d) For every S-regular policy µ and sequence {δk } with δk > 0 for
all k, δk ↓ 0, we have Jµ,δk ↓ Jµ .
(e) Let µ be a policy that satisfies Tµ Jµ,δ = T Jµ,δ , where δ > 0 and
µ is a δ-S-regular policy. Then

Tµ,δ Jµ,δ = Tµ Jµ,δ + δ e = T Jµ,δ + δ e ≤ Tµ Jµ,δ + δ e = Jµ,δ .

Proof: (a) For k = 1, we have Tµ,δ J ≤ Tµ,δ" J ≤ Tµ,δ" J $ . The proof is


completed by induction.
(b) Assume, to arrive at a contradiction, that µ is δ-S-irregular for some
δ > 0, and is δ $ -S-regular for some δ $ > δ. Then, if µ∗ is the optimal
S-regular policy of Assumption 3.2.3(a), by the δ-S-irregularity of µ and
Assumption 3.2.3(b), there exists x ∈ X such that

k J ∗ )(x) ≥ lim sup (T k J ∗ )(x) = ∞,


Jµ,δ" (x) = lim (Tµ,δ " µ ,δ µ,δ µ ,δ
k→∞ k→∞

where the first equality holds because µ is δ $ -S-regular and Jµ∗ ,δ ∈ S, and
114 Semicontractive Models Chap. 3

the inequality uses part (a). This contradicts the δ ! -S-regularity of µ, which
implies that Jµ,δ! belongs to S and is therefore real-valued.
(c) Follows from part (b).
(d) Since µ is an S-regular policy, it is δ-S-regular for all δ > 0 by Assump-
tion 3.2.3(b). The sequence {Jµ,δk } is monotonically nonincreasing, belongs
to S, and is bounded below by Jµ , so Jµ,δk ↓ J + for some J + ≥ Jµ ≥ J * .
Hence, by Assumption 3.2.3(d), we have J + ∈ S and for all x ∈ X,
+ +
! " ! " ! "
H x, µ(x), J = lim H x, µ(x), Jµ,δk = lim Jµ,δk (x) − δk = J (x),
k→∞ k→∞

where the second equality follows from the definition of Jµ,δk as the fixed
point of Tµ,δk . Thus J + satisfies Tµ J + = J + and is therefore equal to Jµ ,
since µ is S-regular. Hence Jµ,δk ↓ Jµ .
(e) In the desired relation, repeated below for convenience,

Tµ,δ Jµ,δ = Tµ Jµ,δ + δ e = T Jµ,δ + δ e ≤ Tµ Jµ,δ + δ e = Jµ,δ ,

the inequality is evident, the second equality is an assumption, and the


other equalities follow from the definitions of Tµ,δ and Tµ,δ , and the fixed
point property Jµ,δ = Tµ,δ Jµ,δ . Q.E.D.

We have the following proposition, whose conclusions are identical to


the ones of the earlier Prop. 3.2.3.

Proposition 3.2.4: Under Assumption 3.2.3 the conclusions of Prop.


3.2.3 hold.

Proof: We will show that J * = T J * . The proof will then follow from Prop.
3.1.2. Let µ∗ be the optimal S-regular policy of Assumption 3.2.3(a), and
let {δk } be a positive sequence such that δk ↓ 0. Using Assumption 3.2.3(c),
we may choose a policy µk such that

Tµk Jµ∗ ,δk = T Jµ∗ ,δk .

Using Lemma 3.2.5(e) with µ = µ∗ , which applies since µ∗ is δ-S-regular


for all δ > 0 [cf. Assumption 3.2.3(b)], we have for all m ≥ 1,

Tµmk ,δ Jµ∗ ,δk ≤ Tµk ,δk Jµ∗ ,δk ≤ T Jµ∗ ,δk + δk e ≤ Jµ∗ ,δk ,
k

where the first inequality follows from the monotonicity of Tµk ,δk . Taking
the limit as m → ∞, and using Assumption 3.2.3(b) [cf. Eq. (3.11)], it
follows that µk is δ k -S-regular, and we have

Jµ∗ ≤ Jµk ≤ Jµk ,δk ≤ T Jµ∗ ,δk + δk e ≤ Jµ∗ ,δk , (3.12)


Sec. 3.2 Irregular Policies and a Perturbation Approach 115

where the second inequality follows from Lemma 3.2.5(a). Since Jµ∗ ,δk ↓
Jµ∗ [cf. Lemma 3.2.5(d)], by taking the limit as k → ∞ in Eq. (3.12), we
obtain
Jµ∗ = lim T Jµ∗ ,δk . (3.13)
k→∞

We thus obtain

(Tµ∗ Jµ∗ )(x) = Jµ∗ (x)


= lim inf H(x, u, Jµ∗ ,δk )
k→∞ u∈U (x)

≤ inf lim H(x, u, Jµ∗ ,δk )


u∈U (x) k→∞

= inf H(x, u, Jµ∗ )


u∈U (x)

= (T Jµ∗ )(x)
≤ (Tµ∗ Jµ∗ )(x),

where the first equality is the Bellman equation for the S-regular pol-
icy µ∗ , the second equality is Eq. (3.13), and the third equality follows
from Assumption 3.2.3(d) and the fact Jµk ,δk ↓ Jµ∗ . Thus equality holds
throughout above and we obtain Jµ∗ = T Jµ∗ . Since µ∗ is optimal, we
obtain J * = T J * , and the conclusions follow from Prop. 3.1.2. Q.E.D.

To see what may happen if there is no optimal S-regular policy, even


though there exists a δ-S-regular policy for all δ > 0, consider the example
given following the Definition 3.2.1 of a δ-S-regular policy. Here there is a
single state, J¯ = 0, and there is a single policy µ with

Tµ J = T J = min{1, J },

and Jµ = J * = 0. Then for S = %, µ is S-irregular, and Assumptions


3.2.3(a) and 3.2.3(b) are violated. As a result, contrary to the assertion of
Prop. 3.2.4, the set of fixed points of T is {J | −∞ < J ≤ 1} and contains
J < J * = 0, while VI starting from every J '= 0 does not converge to J ∗ .
The following example addresses a class of SSP problems where the
perturbation approach applies and yields interesting results.
Example 3.2.4 (Stochastic Shortest Problems with an
Optimal Proper Policy)

Consider the finite-spaces SSP problem of Example 1.2.6, and let S = R(X).
We assume that there exists an optimal proper policy, and we will show that
Assumption 3.2.3 is satisfied, so that Prop. 3.2.4 applies.
Indeed, according to known results for SSP problems discussed in Ex-
ample 1.2.6 (e.g., [BeT96], Prop. 2.2, [Ber12a], Prop. 3.3.1), a policy µ is
proper (stops with probability 1 starting from any x ∈ X) if and only if Tµ
is a contraction with respect to a weighted sup-norm. It follows that for a
116 Semicontractive Models Chap. 3

proper policy µ, Tµ,δ is a weighted sup-norm contraction for all δ ≥ 0, so µ


is δ-S-regular and Jµ,δ ∈ R(X). Moreover for an improper policy µ we have
Jµ (x) > −∞ for all x (since there exists an optimal policy that is proper and
hence its cost function is real-valued). Thus if δ > 0, we obtain
k
lim sup (Tµ,δ J)(x) = ∞, ∀ J ∈ R(X);
k→∞

this is because an additional cost of δ is incurred each time that the policy
does not stop. Since for the optimal proper policy µ∗ , we have Jµ∗ ,δ ∈
R(X), Assumption 3.2.3(b) holds. In addition the conditions (c) and (d) of
Assumption 3.2.3 are clearly satisfied.
Thus if there exists an optimal proper policy, Assumption 3.2.3 holds,
and the results of Prop.
! 3.2.4 apply. In ∗particular,
" J ∗ is the unique fixed point
of T within the set J ∈ R(X) | J ≥ J , and the VI algorithm converges to
J ∗ starting from any function J within this set. These results also apply to
the search problem of Example 3.2.1, assuming that there exists an optimal
policy that stops with probability 1.
We finally note that similar to the search problem, if we just assume that
there exists at least one proper policy, while J ∗ (x) > −∞ for all x ∈ X, Prop.
3.2.2 applies and shows that limδ↓0 Jδ∗ yields the best that can be achieved
when restricted to proper policies only.

The results for the preceding SSP example cannot be improved, in


the sense that uniqueness of the fixed point of T within R(X) cannot be
shown. This can be verified using case (a) of the shortest path example of
Section 3.1.2. Moreover, as shown by the same example, the PI algorithm
may oscillate between an optimal and a nonoptimal policy. This motivates
modifications of the PI framework, which we will discuss in Section 3.3.3.

3.3 ALGORITHMS

In this section, we will discuss VI and PI algorithms for finding J * and


an optimal policy under the assumptions of the preceding section. We
have already shown that the VI algorithm converges to the optimal cost
function J * for any starting function J ∈ S in the case of Assumption 3.2.1
(cf. Prop. 3.2.1), and also for any starting function J ∈ S with J ≥ J * in
the case of Assumption 3.2.3 (cf. Prop. 3.2.4). We will discuss asynchronous
versions of VI under these two assumptions in Section 3.3.1, and will prove
satisfactory convergence properties.
In Section 3.3.2, we will show that there is a valid version of the PI
algorithm, which starting from an S-regular µ0 , generates a sequence of
S-regular policies {µk } such that Jµk → J * . We will briefly discuss this
algorithm, and then focus on a modified version of PI that is unaffected
by the presence of S-irregular policies. This algorithm is similar to the PI
algorithm of Section 2.6.3, and can also be implemented in a distributed
asynchronous environment. Finally, we will discuss in Section 3.3.3 a ver-
sion of PI that is based on the perturbation approach of Section 3.2.2.
Sec. 3.3 Algorithms 117

3.3.1 Asynchronous Value Iteration

Let us consider the model of Section 2.6.1 for asynchronous distributed


computation of the fixed point of a mapping T , and the asynchronous
distributed VI method described there. The model involves a partition of X
into disjoint nonempty subsets X1 , . . . , Xm , and a corresponding partition
of J as J = (J1 , . . . , Jm ), where J! is the restriction of J on the set X! .
We consider a network of m processors, each updating asynchronously
corresponding components of J . In particular, we assume that J! is up-
dated only by processor !, and only for times t in a selected subset R! of it-
erations. Moreover, as in Section 2.6.1, processor ! uses components Jj sup-
plied by other processors j != ! with communication “delays” t − τ!j (t) ≥ 0:
! " #
τ (t) τ (t)
t+1 T J1 !1 , . . . , Jm!m (x) if t ∈ R! , x ∈ X! ,
J! (x) = (3.14)
t
J! (x) if t ∈
/ R! , x ∈ X! .

Under the assumptions of Section 3.2, we can prove convergence by


using the asynchronous convergence theorem (cf. Prop. 2.6.1), and the fact
that T is monotone and has J * as its unique fixed point within the ap-
propriate set. We will consider two types of conditions, corresponding
to the Assumptions of Sections 3.2.1 and 3.2.2, respectively. In the first
case, $we will choose S = % B(X), while in the second case we will choose
S = J ∈ B(X) | J ≥ J * . The reason for using B(X) instead of R(X)
is that it may make it easier for policies to be S-regular, since we allow an
infinite state space (cf. the remarks following Assumption 3.2.1).
Consider first the case where Assumption 3.2.1 holds with S = B(X),
and assume that the continuous updating and information renewal Assump-
tion 2.6.1. Assume further that we have two functions V , V ∈ S such that

V ≤ TV ≤ TV ≤ V , (3.15)

so that, by Prop. 3.2.1, T k V ≤ J * ≤ T k V for all k, and

T kV ↑ J *, T kV ↓ J *.

Then we can show asynchronous convergence of the VI algorithm (3.14),


starting from any function J 0 with V ≤ J 0 ≤ V .
Indeed, let us apply Prop. 2.6.1 with the sets S(k) given by
$ %
S(k) = J ∈ S | T k V ≤ J ≤ T k V , k = 0, 1, . . . .

The sets S(k) satisfy S(k + 1) ⊂ S(k) in view of Eq. (3.15) and the mono-
tonicity of T . Using Prop. 3.2.1, we also see that S(k) satisfy the syn-
chronous convergence and box conditions of Prop. 2.6.1. Thus, together
with Assumption 2.6.1, all the conditions of Prop. 2.6.1 are satisfied, and
the convergence of the algorithm follows starting from any J 0 ∈ S(0).
118 Semicontractive Models Chap. 3
!
Consider "next the case where Assumption 3.2.3 holds with S = J ∈
B(X) | J ≥ J * . In this case we use the sets S(k) given by
! "
S(k) = J ∈ S | J * ≤ J ≤ T k V , k = 0, 1, . . . ,
where V is a function in S with J * ≤ T V ≤ V . These sets satisfy the
synchronous convergence and box conditions of Prop. 2.6.1, and we can
similarly show asynchronous convergence to J * of the generated sequence
{J t } starting from any J 0 ∈ S(0).

3.3.2 Asynchronous Policy Iteration

In this section, we focus on PI methods, under Assumption 3.2.1 and some


additional assumptions to be introduced shortly. We first discuss briefly
a natural form of PI algorithm, which generates S-regular policies exclu-
sively. Let µ0 be an initial S-regular policy [there exists one by Assumption
3.2.1(b)]. At the typical iteration k, we have an S-regular policy µk , and
we compute a policy µk+1 such that Tµk+1 Jµk = T Jµk (this is possible by
Lemma 3.2.1). Then µk+1 is S-regular, by Lemma 3.2.2, and we have
Jµk = Tµk Jµk ≥ T Jµk = Tµk+1 Jµk ≥ lim Tµkk+1 Jµk = Jµk+1 . (3.16)
m→∞

We can thus construct a sequence of S-regular policies {µk } and a cor-


responding nonincreasing sequence {Jµk }. Under some additional mild
conditions it is then possible to show that Jµk ↓ J * (see Exercise 3.6).
Unfortunately, when there are S-irregular policies, the preceding PI
algorithm is somewhat limited, because an initial S-regular policy may not
be known, and also because when asynchronous versions of the algorithm
are implemented, it is difficult to guarantee that all the generated policies
are S-regular. In what follows in this section, we will discuss a PI algorithm
that works in the presence of S-irregular policies, and can operate in a
distributed asynchronous environment. We will need a few assumptions
that are in addition to the ones of Section 3.2.1. For analytical simplicity,
we include in these assumptions finiteness of the state and control spaces.

Assumption 3.3.1: The set S is equal to R(X), and Assumption


3.2.1 holds with this choice of S. Furthermore, the following hold:
(a) H(x, u, J ) is real-valued for all J ∈ S, x ∈ X, and u ∈ U (x).
(b) X and U are finite sets.
(c) For all scalars r > 0 and functions J ∈ S, we have

H(x, u, J + r e) ≤ H(x, u, J ) + r e, ∀ x ∈ X, u ∈ U (x),


(3.17)
where e is the unit function.
Sec. 3.3 Algorithms 119

In view of the requirement that S = R(X) with X being a finite set,


part (c) of the preceding assumption is a nonexpansiveness condition for
H(x, u, ·), which also implies continuity of H(x, u, ·).
Similar to Section 2.6.3, we introduce a new mapping that is parametri-
zed by µ and can be shown to have a common fixed point for all µ. It
operates on a pair (V, Q) where:
• V is a function with a component V (x) for each x.
• Q is a function with a component Q(x, u) for each pair (x, u) with
u ∈ U (x).
The mapping produces a pair
! "
M Fµ (V, Q), Fµ (V, Q) ,

where
• Fµ (V, Q) is a function with a component Fµ (V, Q)(x, u) for each (x, u),
defined by ! "
Fµ (V, Q)(x, u) = H x, u, min{V, Qµ } , (3.18)

where for any Q and µ, we denote by Qµ the function of x defined by


! "
Qµ (x) = Q x, µ(x) , x ∈ X,

and for any two functions V1 and V2 , we denote by min{V1 , V2 } the


function of x given by
# $
min{V1 , V2 }(x) = min V1 (x), V2 (x) , x ∈ X.

! "
• M Fµ (V, Q) is a function with a component M Fµ (V, Q) (x) for each
x, where M is the operator of pointwise minimization over u:

(M Q)(x) = min Q(x, u),


u∈U (x)

so that ! "
M Fµ (V, Q) (x) = min Fµ (V, Q)(x, u).
u∈U (x)

We consider an algorithm that is similar to the asynchronous PI al-


gorithm given in Section 2.6.3 for contractive models. It applies asyn-
chronously the mapping M Fµ (V, Q) for local policy improvement and up-
date of V and µ, and the mapping Fµ (V, Q) for local policy evaluation
and update of Q. The algorithm involves a partition of the state space
into sets X1 , . . . , Xm , and assignment of each subset X! to a processor
! ∈ {1, . . . , m}. For each !, there are two infinite disjoint subsets of times
120 Semicontractive Models Chap. 3

R! , R! ⊂ {0, 1, . . .}, corresponding to policy improvement and policy eval-


uation iterations, respectively. At time t, each processor ! operates on
V t (x), Qt (x, u), and µt (x), only for x in its “local” state space X! . In
particular, at each time t, each processor ! does one of the following:
(a) Local policy improvement : If t ∈ R! , processor ! sets for all x ∈ X! ,

V t+1 (x) = min H x, u, min{V t , Qtµt } = M Fµt (V t , Qt ) (x),


! " ! "
u∈U (x)
(3.19)
sets µt+1 (x) to a u that attains the minimum, and leaves Q un-
changed, i.e., Qt+1 (x, u) = Qt (x, u) for all x ∈ X! and u ∈ U (x).
(b) Local policy evaluation: If t ∈ R! , processor ! sets for all x ∈ X! and
u ∈ U (x),
! "
Qt+1 (x, u) = H x, u, min{V t , Qtµt } = Fµt (V t , Qt )(x, u), (3.20)

and leaves V and µ unchanged, i.e., V t+1 (x) = V t (x) and µt+1 (x) =
µt (x) for all x ∈ X! .

(c) No local change: If t ∈ / R! ∪ R! , processor ! leaves Q, V , and µ


unchanged, i.e., Qt+1 (x, u) = Qt (x, u) for all x ∈ X! and u ∈ U (x),
V t+1 (x) = V t (x), and µt+1 (x) = µt (x) for all x ∈ X! .
Under Assumption 3.3.1, we will show convergence of the algorithm
to (J * , Q* ), where Q* is defined by

Q* (x, u) = H(x, u, J * ), x ∈ X, u ∈ U (x). (3.21)

To this end, we first show that Q* is the unique fixed point of the mapping
F defined by s
! "
(F Q)(x, u) = H x, u, M Q , x ∈ X, u ∈ U (x).

Indeed, under our assumption, Prop. 3.2.1 applies, so J * is the unique


fixed point of T , and we have M Q* = T J * = J * . Thus, from the definition
(3.21), Q* is a fixed point of F . To show uniqueness of the fixed point
of F , note that if Q is a fixed point of F , then Q(x, u) = H(x, u, M Q)
for all x ∈ X, u ∈ U (x), and by minimization over u ∈ U (x), we have
M Q = T (M Q). Hence M Q is equal to the unique fixed point J * of T , so
that the equation Q = F Q yields Q(x, u) = H(x, u, M Q) = H(x, u, J * ),
for all (x, u). From the definition (3.21) of Q* , it then follows that Q = Q∗ .
We introduce the µ-dependent mapping
! "
Lµ (V, Q) = M Q, Fµ (V, Q) , (3.22)

where Fµ (V, Q) is given by Eq. (3.18). Note that the policy evaluation part
of the algorithm [cf. Eq. (3.20)] amounts to applying the second component
Sec. 3.3 Algorithms 121

of Lµ , while the policy improvement part of the algorithm [cf. Eq. (3.19)]
amounts to applying the second component of Lµ , and then applying the
first component of Lµ . The following proposition shows that (J * , Q* ) is
the common fixed point of the mappings Lµ , for all µ.

Proposition 3.3.1: Let Assumption 3.3.1 hold. Then for all µ ∈ M,


the mapping Lµ of Eq. (3.22) is monotone and has (J * , Q* ) as its
unique fixed point.

Proof: Monotonicity of Lµ follows from the monotonicity of the operators


M and Fµ . To show that Lµ has (J * , Q* ) as its unique fixed point, we first
note that J * = M Q* and Q* = F Q* , as shown earlier. Then, using also
the definition of Fµ , we have

J * = M Q* , Q* = F Q* = Fµ (J * , Q* ),

which shows that (J * , Q* ) is a fixed point of Lµ . To show uniqueness, let


(V , Q) be a fixed point of Lµ , i.e., V = M Q and Q = Fµ (V , Q). Then

Q = Fµ (V , Q) = F Q,

where the last equality follows from V = M Q. Thus Q is a fixed point of


F , and since Q* is the unique fixed point of F , as shown earlier, we have
Q = Q* . It follows that V = M Q* = J * , so (J * , Q* ) is the unique fixed
point of Lµ . Q.E.D.

The uniform fixed point property of Lµ just shown is, however, in-
sufficient for the convergence proof of the asynchronous algorithm, in the
absence of a contraction property. For this reason, we introduce two map-
pings L and L that are associated with the mappings Lµ and satisfy

L(V, Q) ≤ Lµ (V, Q) ≤ L(V, Q), ∀ µ ∈ M. (3.23)

These are the mappings defined by


! " ! "
L(V, Q) = M Q, min Fµ (V, Q) , L(V, Q) = M Q, max Fµ (V, Q) ,
µ∈M µ∈M
(3.24)
where the min and max over µ are attained in view of the finiteness of M
[cf. Assumption 3.3.1(b)]. We will show that L and L also have (J * , Q* ) as
their unique fixed point. Note that there exists µ̄ that attains the maximum
in Eq. (3.24), uniformly for all V and (x, u), namely a policy µ̄ for which
# $
Q x, µ̄(x) = max Q(x, u), ∀ x ∈ X,
u∈U (x)
122 Semicontractive Models Chap. 3

[cf. Eq. (3.18)]. Similarly, there exists µ that attains the minimum in Eq.
(3.24), uniformly for all V and (x, u). Thus for any given (V, Q), we have

L(V, Q) = Lµ (V, Q), L(V, Q) = Lµ̄ (V, Q), (3.25)

where µ and µ̄ are some policies. The following proposition shows that
(J * , Q* ), the common fixed point of the mappings Lµ , for all µ, is also the
unique fixed point of L and L.

Proposition 3.3.2: Let Assumption 3.3.1 hold. Then the mappings


L and L of Eq. (3.24) are monotone, and have (J * , Q* ) as their unique
fixed point.

Proof: Monotonicity is clear from the monotonicity of the operators M


and Fµ . Since (J * , Q* ) is the common fixed point of Lµ for all µ (cf. Prop.
3.3.1), and there exists µ such that L(J * , Q* ) = Lµ (J * , Q* ) [cf. Eq. (3.25)],
it follows that (J * , Q* ) is a fixed point of L. To show uniqueness, suppose
that (V, Q) is a fixed point, so (V, Q) = L(V, Q). Then by Eq. (3.25), we
have
(V, Q) = L(V, Q) = Lµ (V, Q)

for some µ ∈ M. Since by Prop. 3.3.1, (J * , Q* ) is the only fixed point of


Lµ , it follows that (V, Q) = (J * , Q* ), so (J * , Q* ) is the only fixed point of
L. Similarly, we show that (J * , Q* ) is the unique fixed point of L. Q.E.D.

We are now ready to construct a sequence of sets needed to apply


Prop. 2.6.1 and prove convergence. For a scalar c ≥ 0, we denote

Jc− = J * − c e, Q− *
c = Q − c eQ ,

Jc+ = J * + c e, Q+ *
c = Q + c eQ ,

with e and eQ are the unit functions in the spaces of J and Q, respectively.

Proposition 3.3.3: Let Assumption 3.3.1 hold. Then for all c > 0,
k
Lk (Jc− , Q− * *
c ) ↑ (J , Q ), L (Jc+ , Q+ * *
c ) ↓ (J , Q ), (3.26)

k
where Lk (or L ) denotes the k-fold composition of L (or L, respec-
tively).
Sec. 3.3 Algorithms 123

Proof: For any µ, using the assumption (3.17), we have for all (x, u),
Fµ (Jc+ , Q+ + +
! "
c )(x, u) = H x, u, min{Jc , Qc }
! "
= H x, u, min{J * , Q* } + c e
! "
≤ H x, u, min{J * , Q* } + c
= Q* (x, u) + c
= Q+
c (x, u),
and similarly
Q− − −
c (x, u) ≤ Fµ (Jc , Qc )(x, u).
We also have M Q+ + − −
c = Jc and M Qc = Jc . From these relations, the
definition of Lµ , and the fact Lµ (J , Q ) = (J * , Q* ) (cf. Prop. 3.3.1),
* *
+ + + +
(Jc− , Q− − − * *
c ) ≤ Lµ (Jc , Qc ) ≤ (J , Q ) ≤ Lµ (Jc , Qc ) ≤ (Jc , Qc ).
Using this relation and Eqs. (3.23) and (3.25), we obtain
+ + + +
(Jc− , Q− − − * *
c ) ≤ L(Jc , Qc ) ≤ (J , Q ) ≤ L(Jc , Qc ) ≤ (Jc , Qc ). (3.27)
Denote for k = 0, 1, . . . ,
k
(V k , Qk ) = L (Jc+ , Q+
c ), (V k , Qk ) = Lk (Jc− , Q−
c ).

From the monotonicity of L and L and Eq. (3.27), we have that (V k , Qk )


converges monotonically from above to some (V , Q) ≥ (J * , Q* ), while
(V k , Qk ) converges monotonically from below to some (V , Q) ≤ (J * , Q* ).
By taking the limit in the equation
(V k+1 , Qk+1 ) = L(V k , Qk ),
and using the continuity property of L, implied by Eq. (3.17) and the
finiteness of the control constraint set, it follows that (V , Q) = L(V , Q),
so (V , Q) must be equal to (J * , Q* ), the unique fixed point of L. Thus,
k
L (Jc+ , Q+ * * k − − * *
c ) ↓ (J , Q ). Similarly, L (Jc , Qc ) ↑ (J , Q ). Q.E.D.

To show asynchronous convergence of the algorithm (3.19)-(3.20),


consider the sets
k + +
S(k) = (V, Q) | Lk (Jc− , Q−
# $
c ) ≤ (V, Q) ≤ L (Jc , Qc ) , k = 0, 1, . . . ,
whose intersection is (J * , Q* ) [cf. Eq. (3.26)]. By Prop. 3.3.3 and Eq. (3.23),
this set sequence together with the mappings Lµ satisfy the synchronous
convergence and box conditions of the asynchronous convergence theorem
of Prop. 2.6.1 (more precisely, its time-varying version of Exercise 2.2). This
proves the convergence of the algorithm (3.19)-(3.20) for starting points
(V, Q) ∈ S(0). Since c can be chosen arbitrarily large, it follows that the
algorithm is convergent from an arbitrary starting point.
Finally, let us note some variations of the asynchronous PI algorithm.
One such variation is to allow “communication delays” t − τ!j (t). Another
variation, for the case where we want to calculate just J * , is to use a
reduced space implementation similar to the one discussed in Section 2.6.3.
There is also a variant with interpolation, cf. Section 2.6.3.
124 Semicontractive Models Chap. 3

3.3.3 Policy Iteration With Perturbations

Let us now consider PI for problems where there exists an optimal S-


regular policy, but S-irregular policies may have real-valued cost functions,
and the perturbation approach of Section 3.2.2 applies. We will develop an
algorithm that generates a sequence of policies {µk } such that Jµk → J * ,
under the following assumption, which among others, is satisfied in the SSP
problem of Example 3.2.4.

Assumption 3.3.2: Assumption 3.2.3 holds, and in addition:


(a) For every δ > 0 and δ-S-regular policy µ, there exists a policy µ
such that Tµ Jµ,δ = T Jµ,δ .
(b) For every δ > 0 and δ-S-irregular policy µ, and for every J ∈ S,
there exists at least one state x ∈ X such that

k J )(x) = ∞.
lim sup (Tµ,δ
k→∞

We generate a sequence {µk } with a perturbed version of PI as follows.


Let {δk } be a positive sequence with δk ↓ 0, and let µ0 be any δ0 -S-regular
policy. At the typical iteration k, we have a δk -S-regular policy µk , and we
generate µk+1 according to

Tµk+1 Jµk ,δk = T Jµk ,δk . (3.28)

Such µk+1 exists by Assumption 3.3.2(a), and we claim that µk+1 is δk+1 -
S-regular. To see this, note that from Lemma 3.2.5(e), we have

Tµk+1 ,δk Jµk ,δk = T Jµk ,δk + δk e ≤ Tµk Jµk ,δk + δk e = Jµk ,δk ,

so that

Tµmk+1 ,δ Jµk ,δk ≤ Tµk+1 ,δk Jµk ,δk = T Jµk ,δk + δk e ≤ Jµk ,δk ,
∀ m ≥ 1.
k
(3.29)
Since Jµk ,δk ∈ R(X), from Assumption 3.3.2(b) it follows that µk+1 is δk -
S-regular, and hence also δk+1 -S-regular, by Lemma 3.2.5(c). Thus the se-
quence {µk } generated by the perturbed PI algorithm (3.28) is well-defined
and consists of δk -S-regular policies. We have the following proposition.

Proposition 3.3.4: Let Assumption 3.3.2 hold. Then the sequence


{Jµk } generated by the algorithm (3.28) satisfies Jµk → J * .
Sec. 3.4 Notes, Sources, and Exercises 125

Proof: Using Eq. (3.29), we have

Jµk+1 ,δk+1 ≤ Jµk+1 ,δk = lim Tµmk+1 ,δ Jµk ,δk ≤ T Jµk ,δk + δk e ≤ Jµk ,δk ,
m→∞ k

where the equality holds because µk+1 is δk -S-regular, as shown earlier.


Taking the limit as k → ∞, and noting that Jµk+1 ,δk+1 ≥ J * , we see that
Jµk ,δk ↓ J + for some J + ≥ J * , and we obtain

J * ≤ J + = lim T Jµk ,δk . (3.30)


k→∞

We also have

inf H(x, u, J + ) ≤ lim inf H(x, u, Jµk ,δk )


u∈U (x) k→∞ u∈U (x)

≤ inf lim H(x, u, Jµk ,δk )


u∈U (x) k→∞

= inf H(x, u, J + ),
u∈U (x)

where the equality follows from Assumption 3.2.3(d). It follows that equal-
ity holds throughout above, so that

lim T Jµk ,δk = T J + . (3.31)


k→∞

Combining Eqs. (3.30) and (3.31), we obtain J * ≤ J + = T J + . Since by


Prop. 3.2.4, J * is the unique fixed point of T within {J ∈ S | J ≥ J * }, it
follows that J + = J * . Thus Jµk ,δk ↓ J * , and since Jµk ,δk ≥ Jµk ≥ J * , we
have Jµk → J * . Q.E.D.

Note that when X and U are finite sets, as in the SSP problem of
Example 3.2.4, Prop. 3.3.4 implies that the generated policies µk will be
optimal for all k sufficiently large. The reason is that in this case, the set
of policies is finite and there exists a sufficiently small " > 0, such that for
all nonoptimal µ there is some state x such that Jµ (x) ≥ J * (x) + ".
In the absence of finiteness of X and U , Prop. 3.3.4 guarantees the
monotonic pointwise convergence of {Jµk ,δk } to J * (see the preceding proof)
and the (possibly nonmonotonic) pointwise convergence of {Jµk } to J * .
This convergence behavior should be contrasted with the behavior of PI
without perturbations, which may lead to oscillation between two nonop-
timal policies, as noted earlier.

3.4 NOTES, SOURCES, AND EXERCISES

The semicontractive model framework of this chapter and the material of


Section 3.1 are new. The framework is inspired from the analysis of the SSP
126 Semicontractive Models Chap. 3

problem of Example 1.2.6, which involves finite state and control spaces,
as well as a termination state. In the absence of a termination state, a
key idea has been to generalize the notion of a proper policy from one that
leads to termination with probability 1, to one that is S-regular for an
appropriate set of functions S.
The line of proof of Prop. 3.1.1 dates back to an analysis of SSP
problems with finite state and control spaces, given in the author’s [Ber87],
Section 6.4, which assumes existence of an optimal proper policy and non-
negativity of the one-stage cost. Proposition 3.2.1 is patterned after a
similar result in Bertsekas and Tsitsiklis [BeT91] for SSP problems with fi-
nite state space and compact control constraint sets. The proof given there
contains an intricate part (Lemma 3 of [BeT91]) to show a lower bound on
the cost functions of proper policies, which is assumed here in part (b) of
the semicontraction Assumption 3.2.1.
The perturbation analysis of Section 3.2.2, including the PI algorithm
of Section 3.3.3, are new and are based on unpublished collaboration of the
author with H. Yu. The results for SSP problems using this analysis (cf.
Prop. 3.2.4) strengthen the results of [Ber87] (Section 6.4) and [BeT91]
(Prop. 3), in that the one-stage cost need not be assumed nonnegative.
We have given two different perturbation approaches in Section 3.2.2. The
first approach places assumptions on the optimal cost function Jδ* of the
δ-perturbed problem (cf. Prop. 3.2.2 and Assumption 3.2.2), while the sec-
ond places assumptions on policies (cf. Assumption 3.2.3) and separates
them into δ-S-regular and δ-S-irregular. The first approach is simpler an-
alytically, and at least in part, does not require existence of an S-regular
policy (cf. Prop. 3.2.2). The second approach allows the development of
a perturbed PI algorithm and the corresponding analysis of Section 3.3.3
(under the extra conditions of Assumption 3.3.2).
The asynchronous PI algorithm of Section 3.3.2 is essentially the same
as one of the optimistic PI algorithms of Yu and Bertsekas [YuB11a] for
the SSP problem of Example 1.2.6. This paper also analyzed asynchronous
stochastic iterative versions of the algorithms, and proved convergence re-
sults that parallel those for classical Q-learning for SSP, given in Tsitsiklis
[Tsi94] and Yu and Bertsekas [YuB11b]. We follow the line of analysis of
that paper. A related paper, which deals with a slightly different asyn-
chronous PI algorithm in an abstract setting and without a contraction
structure, is Bertsekas and Yu [BeY10b].
By allowing an infinite state space, the analysis of the present chapter
applies among others to SSP problems with a countable state space. Such
problems often arise in queueing control problems where the termination
state corresponds to an empty queue. The problem then is to empty the
system with minimum expected cost. Generalized forms of SSP problems,
which involve an infinite (uncountable) number of states, in addition to the
termination state, are analyzed by Pliska [Pli78], Hernandez-Lerma et al.
[HCP99], and James and Collins [JaC06]. The latter paper allows improper
Sec. 3.4 Notes, Sources, and Exercises 127

policies, assumes that J * is bounded below, and generalizes the results of


[BeT91] to infinite (Borel) state spaces, using a similar line of proof.
An important case of an SSP problem where the state space is infinite
arises under imperfect state information. There the problem is converted
to a perfect state information problem whose states are the belief states,
i.e., the posterior probability distributions of the original state given the
observations thus far. Patek [Pat07] proves results that are similar to the
ones for SSP problems with perfect state information. These results can
also be derived from the analysis of this chapter. In particular, the critical
condition that the cost functions of proper policies are bounded below by
some real-valued function [cf. Assumption 3.2.1(b)] is proved as Lemma
5 in [Pat07], using the fact that the cost functions of the proper policies
are bounded below by the optimal cost function of a corresponding perfect
state information problem.

E X ER CI S E S

3.1 (Blackmailer’s Dilemma)

Consider an SSP problem where there is only one state x = 1, in addition to the
termination state 0. At state 1, we can choose a control u with 0 < u ≤ 1, while
incurring a cost −u; we then move to state 0 with probability u2 , and stay in state
1 with probability 1 − u2 . We may regard u as a demand made by a blackmailer,
and state 1 as the situation where the victim complies. State 0 is the situation
where the victim (permanently) refuses to yield to the blackmailer’s demand.
The problem then can be seen as one whereby the blackmailer tries to maximize
his total gain by balancing his desire for increased demands with keeping his
victim compliant. In terms of abstract DP we have
X = {1}, U (1) = (0, 1], ¯
J(1) = 0, H(1, u, J) = −u + (1 − u2 )J(1).

(a) Verify that Tµ is a sup-norm contraction for each µ. In addition, show that
1
Jµ (1) = − µ(1) , so that J ∗ (1) = −∞, that there is no optimal policy, and
that T has no fixed points within $. Which parts of Assumption 3.2.1 with
S = $ are violated?
(b) Consider a variant of the problem where at state 1, we terminate at no cost
with probability u, and stay in state 1 at a cost −u with probability 1 − u.
Here we have
H(1, u, J) = (1 − u)(−u) + (1 − u)J(1).

Verify that J (1) = −1, that there is no optimal policy, and that T has
multiple fixed points within $. Which parts of Assumption 3.2.1 with
S = $ are violated?
128 Semicontractive Models Chap. 3

(c) Repeat part (b) for the case where at state 1, we may also choose u = 0
at a cost c. Show that the policy µ that chooses µ(1) = 0 is !-irregular.
What are the optimal policies and the fixed points of T in the three cases
where c > 0, c = 0, and c < 0. Which parts of Assumption 3.2.1 with
S = ! are violated in each of these three cases?

3.2 (Equivalent Semicontractive Conditions)

Let S be a given subset of E(X). Show that the assumptions of Prop. 3.1.1 hold
if and only if J ∗ ∈ S, T J ∗ ≤ J ∗ , and there exists an S-regular policy µ such that
Tµ J ∗ = T J ∗ .

3.3

Consider the three-node shortest path example of Section 3.1.2. Try to apply
Prop. 3.1.1 with S = [−∞, ∞) × [−∞, ∞). What conclusions can you obtain for
various values of a and b, and how do they compare with those for S = !2 ?

3.4 (Changing J¯)

Let the assumptions of Prop. 3.1.1 hold, and let J ∗ be the optimal cost function.
Suppose that J¯ is changed to some function J ∈ S.
(a) Show that following the change, J ∗ continues to be the optimal cost func-
tion over just the S-regular policies.
(b) Consider the three-node shortest path problem of Section 3.1.2 for the case
where a = 0, b < 0. Change J¯ from J¯ = 0 to J¯ = r e where r ∈ !.
Verify the result of part (a) for this example. For which values of r is the
!2 -irregular policy optimal?

3.5 (Alternative Semicontractive Conditions)

The purpose of this exercise and the next one is to provide conditions that imply
the results of Prop. 3.1.1. Let S be a given subset of E(X). Assume that:
(1) There exists an optimal S-regular policy.
(2) For every S-irregular policy µ, we have Tµ J ∗ ≥ J ∗ .
Then the assumptions and the conclusions of Prop. 3.1.1 hold.

3.6 (Convergence of PI)

Let Assumption 3.2.1 hold, and let {µk } be the sequence generated by the PI
algorithm described at the start of Section 3.3.2 [cf. Eq. (3.16)]. Let also J∞ =
limk→∞ Jµk , and assume that H(x, u, Jµk ) → H(x, u, J∞ ) for all x ∈ X and
u ∈ U (x). Show that J∞ = J ∗ .
4

Noncontractive Models

Contents

4.1. Noncontractive Models . . . . . . . . . . . . . . p. 130


4.2. Finite Horizon Problems . . . . . . . . . . . . . . p. 133
4.3. Infinite Horizon Problems . . . . . . . . . . . . . p. 139
4.3.1. Fixed Point Properties and Optimality Conditions p. 143
4.3.2. Value Iteration . . . . . . . . . . . . . . . . p. 154
4.3.3. Policy Iteration . . . . . . . . . . . . . . . p. 157
4.4. Semicontractive-Monotone Increasing Models . . . . . p. 163
4.4.1. Value and Policy Iteration Algorithms . . . . . p. 163
4.4.2. Some Applications . . . . . . . . . . . . . . p. 166
4.4.3. Linear-Quadratic Problems . . . . . . . . . . p. 168
4.5. Affine Monotonic Models . . . . . . . . . . . . . p. 171
4.5.1. Increasing Affine Monotonic Models . . . . . . p. 172
4.5.2. Nonincreasing Affine Monotonic Models . . . . . p. 172
4.5.3. Exponential Cost Stochastic Shortest Path
Problems . . . . . . . . . . . . . . . . . . p. 175
4.6. An Overview of Semicontractive Models and Results . p. 180
4.7. Notes, Sources, and Exercises . . . . . . . . . . . . p. 181

129
130 Noncontractive Models Chap. 4

In this chapter, we consider abstract DP models similar to the ones of


the earlier chapters, but without assuming any contraction-like property.
We will discuss both finite and infinite horizon models, and introduce just
enough assumptions (including monotonicity) to obtain some minimal re-
sults, which we will strengthen as we go along.
In Section 4.2, we discuss a general type of finite horizon problem.
We show under some reasonable assumptions the standard results that one
may expect in an abstract setting.
In Section 4.3, we consider an infinite horizon problem that is mo-
tivated by the well-known positive and negative DP models (see [Ber12a],
Chapter 4). These are the special cases of the infinite horizon stochastic
optimal control problem of Example 1.2.1, where the cost per stage g is
uniformly nonpositive or uniformly nonnegative. For these models there
is interesting theory (the validity of Bellman’s equation and the availabil-
ity of optimality conditions in a DP context), which we discuss in Section
4.3.1. There are also interesting computational methods, patterned after
the value and policy iteration methods, which we discuss in Sections 4.3.2
and 4.3.3. However, the performance guarantees for these methods are
not as powerful as in the contractive case, and their validity hinges upon
certain additional assumptions.
In Section 4.4, we discuss a semicontractive model, which combines
elements of the monotonic noncontractive models of Section 4.3 and the
semicontractive models of Chapter 3.
Finally, in Section 4.5 we apply the semicontractive analysis of Chap-
ter 3 and Section 4.4 to the class of monotonic affine models, which includes
the multiplicative and exponential cost/risk-sensitive models of Example
1.2.8.

4.1 NONCONTRACTIVE MODELS

Throughout this chapter we will continue to use the model of Chapter 3,


which involves the set of extended real numbers

!∗ = ! ∪ {∞, −∞}

and the set E(X) the set of all extended real-valued functions J : X %→ !∗ .
We denote by R(X) the set of real-valued functions J : X %→ !, and by
B(X) the set of real-valued functions J : X %→ ! that are bounded with
respect to a given weighted sup-norm. The operations with ∞ and −∞ are
summarized in Appendix A. In particular, we adopt standard conventions
regarding ordering, addition, and multiplication in !∗ , except that we take
∞ − ∞ = −∞ + ∞ = ∞, and we take the product of 0 and ∞ or −∞
to be 0, so that the sum and product of two extended real numbers are
well-defined (see Appendix A for details).
Sec. 4.1 Noncontractive Models 131

To repeat some of the basic definitions, we have a set X of states and


a set U of controls, and for each x ∈ X, the nonempty control constraint
set U (x) ⊂ U . We denote by M the set of all functions µ : X #→ U with
µ(x) ∈ U (x), for all x ∈ X, and by Π the set of “nonstationary policies”
π = {µ0 , µ1 , . . .}, with µk ∈ M for all k. We refer to a stationary policy
{µ, µ, . . .} simply as µ.
We introduce a mapping H : X × U × E(X) #→ &∗ , and we define the
mapping T : E(X) #→ E(X) by

(T J )(x) = inf H(x, u, J ), ∀ x ∈ X, J ∈ E(X),


u∈U (x)

and for each µ ∈ M the mapping Tµ : E(X) #→ E(X) by


! "
(Tµ J )(x) = H x, µ(x), J , ∀ x ∈ X, J ∈ E(X).

We continue to use the following assumption throughout this chapter, with-


out mentioning it explicitly in various propositions.

Assumption 4.1.1: (Monotonicity) If J, J # ∈ E(X) and J ≤ J # ,


then
H(x, u, J ) ≤ H(x, u, J # ), ∀ x ∈ X, u ∈ U (x).

A fact that we will be using frequently is that for each J ∈ E(X) and
scalar " > 0, there exists a µ! ∈ M such that for all x ∈ X,

 (T J )(x) + " if (T J )(x) > −∞,
(Tµ! J )(x) ≤
 −(1/") if (T J )(x) = −∞.

In particular, if J is such that (T J )(x) > −∞ for all x ∈ X, then for each
" > 0, there exists a µ! ∈ M such that

(Tµ! J )(x) ≤ (T J )(x) + ", ∀ x ∈ X.

We will often use in our analysis the unit function e, defined by e(x) ≡ 1,
so for example, we will write the above relation in shorthand as

Tµ! J ≤ T J + " e.

We define cost functions for policies consistently with Chapters 2 and


3. In particular, we are given a function J¯ ∈ E(X), and we consider for
every policy π = {µ0 , µ1 , . . .} ∈ Π and positive integer N the function
JN,π ∈ E(X) defined by

JN,π (x) = (Tµ0 · · · TµN −1 J¯)(x), ∀ x ∈ X,


132 Noncontractive Models Chap. 4

and the function Jπ ∈ E(X) defined by

Jπ (x) = lim sup (Tµ0 · · · Tµk J¯)(x), ∀ x ∈ X.


k→∞

We refer to JN,π as the N -stage cost function of π and to Jπ as the infinite


horizon cost function of π (or just “cost function” if the length of the
horizon is clearly implied by the context). For a stationary policy π =
{µ, µ, . . .} we also write Jπ as Jµ .
In Section 4.2, we consider the N -stage optimization problem

minimize JN,π (x)


(4.1)
subject to π ∈ Π,

while in Section 4.3 we discuss its infinite horizon version

minimize Jπ (x)
(4.2)
subject to π ∈ Π.

* (x) and J * (x) the optimal costs


For a fixed x ∈ X, we denote by JN
for these problems, i.e.,

* (x) = inf J
JN J * (x) = inf Jπ (x),
N,π (x), ∀ x ∈ X.
π∈Π π∈Π

We say that a policy π ∗ ∈ Π is N -stage optimal if

* (x),
JN,π∗ (x) = JN ∀ x ∈ X,

and (infinite horizon) optimal if

Jπ∗ (x) = J * (x), ∀ x ∈ X.

For a given " > 0, we say that π" is N -stage "-optimal if



* (x) + "
 JN * (x) > −∞,
if JN
(Jπ! J )(x) ≤
 −(1/") * (x) = −∞,
if JN

and we say that π" is "-optimal if



 J * (x) + " if J * (x) > −∞,
(Jπ! J )(x) ≤
 −(1/") if J * (x) = −∞.
Sec. 4.2 Finite Horizon Problems 133

4.2 FINITE HORIZON PROBLEMS

Consider the N -stage problem (4.1), where the cost function JN,π is defined
by
JN,π (x) = (Tµ0 · · · TµN −1 J¯)(x), ∀ x ∈ X.
Based on the theory of finite horizon DP, we expect that (at least under
* is obtained by N successive
some conditions) the optimal cost function JN
applications of the DP mapping T on the initial function J¯, i.e.,

* = inf J
JN N ¯
N,π = T J .
π∈Π

This is the analog of Bellman’s equation for the finite horizon problem in
a DP context.

The Case Where Uniformly N -Stage Optimal Policies Exist

A favorable case where the analysis is simplified and we can easily show that
JN* = T N J¯ is when the finite horizon DP algorithm yields an optimal policy

during its execution. By this we mean that the algorithm that starts with
J¯, and sequentially computes T J¯, T 2 J¯, . . . , T N J¯, also yields corresponding
µ∗N −1 , µ∗N −2 , . . . , µ∗0 ∈ M such that

Tµ∗ T N −k−1 J¯ = T N −k J¯, k = 0, . . . , N − 1. (4.3)


k

While µ∗N −1 , . . . , µ∗0 ∈ M satisfying this relation need not exist (because
the corresponding infimum in the definition of T is not attained), if they
do exist, they both form an optimal policy and also guarantee that

* = T N J¯.
JN

The proof is simple: we have for every π = {µ0 , µ1 , . . .} ∈ Π

JN,π = Tµ0 · · · TµN −1 J¯ ≥ T N J¯ = Tµ∗0 · · · Tµ∗ J¯,


N −1

where the inequality follows from the monotonicity assumption and the def-
inition of T , and the last equality follows from Eq. (4.3). Thus {µ∗0 , µ∗1 , . . .}
has no worse N -stage cost function than every other policy, so it is N -stage
optimal and JN * = T ∗ ···T ∗
µ0
¯
µN −1 J . By taking the infimum of the left-hand
side over π ∈ Π, we obtain JN * = T N J¯.

The preceding argument can also be used to show that {µ∗k , µ∗k+1 , . . .}
is (N − k)-stage optimal for all k = 0, . . . , N − 1. Such a policy is called
uniformly N -stage optimal . The fact that the finite horizon DP algorithm
provides an optimal solution of all the k-stage problems for k = 1, . . . , N ,
rather than just the last one, is a manifestation of the classical principle
134 Noncontractive Models Chap. 4

of optimality, expounded by Bellman in the early days of DP (the tail


portion of an optimal policy obtained by DP minimizes the corresponding
tail portion of the finite horizon cost). Note, however, that there may exist
an N -stage optimal policy that is not k-stage optimal for some k < N .
We state the result just derived as a proposition.

Proposition 4.2.1: Suppose that a policy {µ∗0 , µ∗1 , . . .} satisfies the


condition (4.3). Then this policy is uniformly N -stage optimal, and
we have JN* = T N J¯.

While the preceding result is theoretically limited, it is very useful


in practice, because the existence of a policy satisfying the condition (4.3)
can often be simply established. For example, this condition is trivially
satisfied if the control space is finite. The following proposition provides a
generalization.

Proposition 4.2.2: Let the control space U be a metric space, and


assume that for each x ∈ X, λ ∈ ", and k = 0, 1, . . . , N − 1, the set

Uk (x, λ) = u ∈ U (x) | H(x, u, T k J¯) ≤ λ


! "

is compact. Then there exists a uniformly N -stage optimal policy.

Proof: We will show that the infimum in the relation


(T k+1 J¯)(x) = inf H x, u, T k J¯
# $
u∈U (x)

is attained for all x ∈ X and k. Indeed if H x, u, T k J¯ = ∞ for all


# $

u ∈ U (x), then every u ∈ U (x) attains the infimum. If for a given x ∈ X,


inf H x, u, T k J¯ < ∞,
# $
u∈U (x)

the corresponding part of the proof of Prop. 3.2.1 (cf. Lemma 3.2.1) applies
and shows that the above infimum is attained. Q.E.D.

The General Case

We now consider the case where there is no uniformly N -stage optimal


∗ and T N J¯, the equation J ∗ = T N J¯
policy. By using the definitions of JN N
can be equivalently written as
% % &&
inf Tµ0 · · · TµN −1 J¯ = inf Tµ0 inf Tµ1 ··· inf TµN −1 J¯ .
µ0 ,...,µN −1 ∈M µ0 ∈M µ1 ∈M µN −1 ∈M
Sec. 4.2 Finite Horizon Problems 135

Thus we have JN ∗ = T N J¯ if the operations inf and T can be interchanged


µ
in the preceding equation. We will introduce two alternative assumptions,
which guarantee that this interchange is valid. Our first assumption is a
form of continuity from above of H with respect to J .

Assumption 4.2.1: For each sequence {Jm } ⊂ E(X) with Jm ↓ J


and H(x, u, J0 ) < ∞ for all x ∈ X and u ∈ U (x), we have

lim H(x, u, Jm ) = H(x, u, J ), ∀ x ∈ X, u ∈ U (x). (4.4)


m→∞

Note that if {Jm } and {Tµ Jm } are monotonically nonincreasing, then

inf Jm = lim Jm , inf (Tµ Jm ) = lim (Tµ Jm ),


m m→∞ m m→∞

so for all µ ∈ M, Eq. (4.4) implies that

! " ! "
inf (Tµ Jm ) = lim (Tµ Jm ) = Tµ lim Jm = Tµ inf Jm .
m m→∞ m→∞ m

This equality can be extended for any µ1 , . . . , µk ∈ M as follows:


! "
inf (Tµ1 · · · Tµk Jm ) = Tµ1 inf (Tµ1 · · · Tµk Jm )
m m
= ···
! " (4.5)
= Tµ1 Tµ1 · · · Tµk−1 inf (Tµk Jm )
m
! "
= Tµ1 · · · Tµk inf Jm .
m

We use this relation to prove the following proposition.

Proposition 4.2.3: Let Assumption 4.2.1 hold, and assume further


* = T N J¯.
that Jk,π (x) < ∞, for all x ∈ X, π ∈ Π, and k ≥ 1. Then JN

Proof: We select for each k = 0, . . . , N − 1, a sequence {µm


k } ⊂ M such
that
lim Tµm (T N −k−1 J¯) ↓ T N −k J¯.
k
m→∞
136 Noncontractive Models Chap. 4

Since JN∗ ≤ T ¯
µ0 · · · TµN −1 J for all µ0 , . . . , µN −1 ∈ M, we have using also
Eq. (4.5) and the assumption Jk,π (x) < ∞, for all k, π, and x,

∗ ≤ inf · · · inf T m · · · T m
JN 0
¯
N −1 J
m0 mN −1 µ0 µN −1
! "
= inf · · · inf Tµm0 · · · TµmN −2 ¯
inf TµmN −1 J
m0 mN −2 0 N −2 mN −1 N −1

= inf · · · inf T m
µ0 0 ···T mN −2
µN −2
T J¯
m0 mN −2

..
.
= inf Tµm0 (T N −1 J¯)
m0 0

= T N J¯.

On the other hand, it is clear from the definitions that T N J¯ ≤ JN,π for all
N and π ∈ Π, so that T N J¯ ≤ JN * . Thus, J * = T N J¯.
N Q.E.D.

We now introduce an alternative assumption, which in addition to


* = T N J¯, guarantees the existence of an "-optimal policy.
JN

Assumption 4.2.2: We have

Jk* (x) > −∞, ∀ x ∈ X, k = 1, . . . , N.

Moreover, there exists a scalar α ∈ (0, ∞) such that for all scalars
r ∈ (0, ∞) and functions J ∈ E(X), we have

H(x, u, J + r e) ≤ H(x, u, J ) + α r, ∀ x ∈ X, u ∈ U (x). (4.6)

Proposition 4.2.4: Let Assumption 4.2.2 hold. Then JN * = T N J¯,

and for every " > 0, there exists an "-optimal policy.

Proof: Note that since by assumption, JN * (x) > −∞ for all x ∈ X, an

N -stage "-optimal policy π" ∈ Π is one for which

* ≤ J
JN *
N,π! ≤ JN + " e.

We use induction. The result clearly holds for N = 1. Assume that


it holds for N = k, i.e., Jk* = T k J¯ and for a given " > 0, there is a π" ∈ Π
Sec. 4.2 Finite Horizon Problems 137

with Jk,π! ≤ Jk* + ! e. Using Eq. (4.6), we have for all µ ∈ M,

*
Jk+1 ≤ Tµ Jk,π! ≤ Tµ Jk* + α! e.

Taking the infimum over µ and then the limit as ! → 0, we obtain Jk+1 * ≤
T Jk . By using the induction hypothesis Jk = T k J¯, it follows that Jk+1
* * * ≤
T k+1 ¯
J . On the other hand, we have clearly T k+1 ¯
J ≤ Jk+1,π for all π ∈ Π,
so that T k+1 J¯ ≤ Jk+1
* , and hence T k+1 J¯ = Jk+1
* .
We now turn to the existence of an !-optimal policy part of the in-
duction argument. Using the assumption Jk* (x) > −∞ for all x ∈ x ∈ X,
for any ! > 0, we can choose π = {µ0 , µ1 , . . .} such that

!
Jk,π ≤ Jk* + e, (4.7)

and µ ∈ M such that


!
Tµ Jk* ≤ T Jk* + e.
2
Let π " = {µ, µ0 , µ1 , . . .}. Then

!
Jk+1,π ! = Tµ Jk,π ≤ Tµ Jk* + e ≤ T Jk* + ! e = Jk+1
* + ! e,
2

where the first inequality is obtained by applying Tµ to Eq. (4.7) and using
Eq. (4.6). The induction is complete. Q.E.D.

We now provide some counterexamples showing that the conditions


of the preceding propositions are necessary, and that for exceptional (but
otherwise very simple) problems, the Bellman equation JN * = T N J¯ may

not hold and/or there may not exist an !-optimal policy.

Example 4.2.1 (Counterexample to Bellman’s Equation I)

Let
X = {0}, U (0) = (−1, 0], ¯
J(0) = 0,
!
u if −1 < J(0),
H(0, u, J) =
J(0) + u if J(0) ≤ −1.

Then
(Tµ0 · · · TµN −1 J¯)(0) = µ0 (0),

and JN (0) = −1, while (T N J¯)(0) = −N for every N . Here Assumption 4.2.1,
and the condition (4.6) (cf. Assumption 4.2.2) are violated, even though the
condition Jk∗ (x) > −∞ for all x ∈ X (cf. Assumption 4.2.2) is satisfied.
138 Noncontractive Models Chap. 4

Example 4.2.2 (Counterexample to Bellman’s Equation II)

Let
X = {0, 1}, U (0) = U (1) = (−∞, 0], ¯
J(0) ¯
= J(1) = 0,
!
u if J(1) = −∞,
H(0, u, J) = H(1, u, J) = u.
0 if J(1) > −∞,
Then
(Tµ0 · · · TµN −1 J¯)(0) = 0, ¯
(Tµ0 · · · TµN −1 J)(1) = µ0 (1), ∀ N ≥ 1.
∗ ∗
It can be seen that for N ≥ 2, we have JN (0)
= 0 and JN
= −∞, but (1)
¯
(T N J)(0) = (T N J¯)(1) = −∞. Here Assumption 4.2.1, and the condition
Jk∗ (x) > −∞ for all x ∈ X (cf. Assumption 4.2.2) are violated, even though
the condition (4.6) of Assumption 4.2.2 is satisfied.

In the preceding two examples, the anomalies are due to discontinu-


ity of the mapping H with respect to J . In classical finite horizon DP, the
mapping H is generally continuous when it takes finite values, but coun-
terexamples arise in unusual problems where infinite values occur. The
next example is a simple stochastic optimal control problem, which involves
some infinite expected values of random variables and we have J2* != T 2 J¯.

Example 4.2.3 (Counterexample to Bellman’s Equation III)

Let
X = {0, 1}, U (0) = U (1) = &, J¯(0) = J¯(1) = 0,
let w be a real-valued random variable with E{w} = ∞, and let
! " #
E w + J(1) if x = 0,
H(x, u, J) = ∀ x ∈ X, u ∈ U (x).
u + J(1) if x = 1,
Then if Jm is real-valued for all m, and Jm (1) ↓ J(1) = −∞, we have
" #
lim H(0, u, Jm ) = lim E w + Jm (1) = ∞,
m→∞ m→∞

while $ % " #
H 0, u, lim Jm = E w + J(1) = −∞,
m→∞

so Assumption 4.2.1 is violated. Indeed, the reader may verify with a straight-
¯
forward calculation that J2∗ (0) = ∞, J2∗ (1) = −∞, while (T 2 J)(0) = −∞,
2 ¯ ∗ 2 ¯
(T J)(1) = −∞, so J2 (= T J. Note that Assumption 4.2.2 is also violated
because J2∗ (1) = −∞.

In the next counterexample, Bellman’s equation holds, but there is


no !-optimal policy. This is an undiscounted deterministic optimal control
problem of the type discussed in Section 1.1, where Jk∗ (x) = −∞ for some
x and k, so Assumption 4.2.2 is violated. We use the notation introduced
there.
Sec. 4.3 Infinite Horizon Problems 139

Example 4.2.4 (Counterexample to Existence of an


!-Optimal Policy)

Let α = 1 and

N = 2, X = {0, 1, . . .}, U (x) = (0, ∞), ¯


J(x) = 0, ∀ x ∈ X,

f (x, u) = 0, ∀ x ∈ X, u ∈ U (x),
−u if x = 0,
!
g(x, u) = , ∀ u ∈ U (x),
x if x =
% 0,
so that
H(x, u, J) = g(x, u) + J(0).
Then for π ∈ Π and x %= 0, we have J2,π (x) = x − µ1 (0), so that J2∗ (x) = −∞
for all x ∈ X. Here Assumption 4.2.1, as well as Eq. (4.6) (cf. Assumption
4.2.2) are satisfied, and indeed we have J2∗ (x) = (T 2 J¯)(x) = −∞ for all
x ∈ X. However, the condition Jk∗ (x) > −∞ for all x and k (cf. Assumption
4.2.2) is violated, and it is seen that there does not exist a two-stage #-optimal
policy for any # > 0, since an #-optimal policy π = {µ0 , µ1 } must satisfy
1
J2,π (x) = x − µ1 (0) ≤ − , ∀ x ∈ X,
#
[in view of J2∗ (x) = −∞ for all x ∈ X], which is impossible.

4.3 INFINITE HORIZON PROBLEMS

Consider the infinite horizon problem (4.2), where the cost function of a
policy π = {µ0 , µ1 , . . .} is
Jπ (x) = lim sup (Tµ0 · · · Tµk J¯)(x), ∀ x ∈ X.
k→∞

In this section one of the following two assumptions will be in effect.

Assumption I: (Monotone Increase)


(a) We have

−∞ < J¯(x) ≤ H(x, u, J¯), ∀ x ∈ X, u ∈ U (x).

(b) For each sequence {Jm } ⊂ E(X) with Jm ↑ J and J¯ ≤ Jm for


all m ≥ 0, we have

lim H(x, u, Jm ) = H (x, u, J ) , ∀ x ∈ X, u ∈ U (x).


m→∞
140 Noncontractive Models Chap. 4

(c) There exists a scalar α ∈ (0, ∞) such that for all scalars r ∈
(0, ∞) and functions J ∈ E(X) with J¯ ≤ J , we have

H(x, u, J + r e) ≤ H(x, u, J ) + α r, ∀ x ∈ X, u ∈ U (x).

Assumption D: (Monotone Decrease)


(a) We have

J¯(x) ≥ H(x, u, J¯), ∀ x ∈ X, u ∈ U (x).

(b) For each sequence {Jm } ⊂ E(X) with Jm ↓ J and Jm ≤ J¯ for


all m ≥ 0, we have

lim H(x, u, Jm ) = H (x, u, J ) , ∀ x ∈ X, u ∈ U (x).


m→∞

Assumptions I and D apply to the classes of the negative and positive


DP models, respectively (see [Ber12a], Chapter 4). These are the special
cases of the infinite horizon stochastic optimal control problem of Example
1.2.1, where J¯(x) ≡ 0 and the cost per stage g is uniformly nonnegative or
uniformly nonpositive, respectively. The latter occurs often when we want
to maximize positive rewards.
It is important to note that Assumptions I and D allow Jπ to be
defined as a limit rather than as a lim sup. In particular, part (a) of the
assumptions and the monotonicity of H imply that

J¯ ≤ Tµ0 J¯ ≤ Tµ0 Tµ1 J¯ ≤ · · · ≤ Tµ0 · · · Tµk J¯ ≤ · · ·

under Assumption I, and

J¯ ≥ Tµ0 J¯ ≥ Tµ0 Tµ1 J¯ ≥ · · · ≥ Tµ0 · · · Tµk J¯ ≥ · · ·

under Assumption D. Thus we have

Jπ (x) = lim (Tµ0 · · · Tµk J¯)(x), ∀ x ∈ X,


k→∞

with the limit being a real number or ∞ or −∞.


Sec. 4.3 Infinite Horizon Problems 141

= 0 Tµ J = 0 Tµ J

J¯ Jµ J TJ Jµ J¯ J TJ

Figure 4.3.1. Illustration of the consequences of lack of continuity of Tµ from


below or from above [cf. part (b) of Assumption I or D, respectively]. In the
figure on the left, we have J¯ ≤ Tµ J¯ but Tµ is discontinuous from below at Jµ , so
Assumption I does not hold, and Jµ is not a fixed point of Tµ . In the figure on the
right, we have J¯ ≥ Tµ J¯ but Tµ is discontinuous from above at Jµ , so Assumption
D does not hold, and Jµ is not a fixed point of Tµ .

The conditions of part (b) of Assumptions I and D are continuity as-


sumptions designed to preclude some of the anomalies of the type encoun-
tered also in Chapter 3, and addressed with the use of S-regular policies.
In particular, these conditions are essential for making a connection with
fixed point theory: they ensure that Jµ is a fixed point of Tµ , as shown in
the following proposition.

Proposition 4.3.1: Let Assumption I or Assumption D hold. Then


for every policy µ ∈ M, we have

Jµ = T µ Jµ .

Proof: Let Assumption I hold. Then for all k ≥ 0,


(Tµk+1 J¯)(x) = H x, µ(x), Tµk J¯ ,
! "
x ∈ X,
and by taking the limit as k → ∞, and using part (b) of Assumption I,
and the fact Tµk J¯ ↑ Jµ , we have for all x ∈ X,
Jµ (x) = lim H x, µ(x), Tµk J¯ = H x, µ(x), lim Tµk J¯ = H x, µ(x), Jµ ,
! " ! " ! "
k→∞ k→∞
or equivalently Jµ = Tµ Jµ . The proof for the case of Assumption D is
similar. Q.E.D.
142 Noncontractive Models Chap. 4

J Tµ J

J¯ Tµk J¯

Tµ J T

=0 Jµ Jµ J¯ = 0 J TJ
J∗ = T J∗

Figure 4.3.2. An example where nonstationary policies are dominant under As-
sumption D. Here there is only one state and S = !. There are two stationary
policies µ and µ with cost functions Jµ and Jµ as shown. However, by considering
a nonstationary policy of the form πk = {µ, . . . , µ, µ, µ, . . .}, with a number k of
policies µ, we can obtain a sequence {Jπk } that converges to the value J ∗ shown.
Note that here there is no optimal policy, stationary or not.

Figure 4.3.1 illustrates how Jµ may fail to be a fixed point of Tµ if


part (b) of Assumption I or D is violated. Note also that continuity of Tµ
does not imply continuity of T , and for example, under Assumption I, T
may be discontinuous from below. We will see later that as a result, the
value iteration sequence {T k J¯} may fail to converge to J * in the absence
of additional conditions (see Section 4.3.2). Part (c) of Assumption I is a
technical condition that facilitates the analysis, and assures the existence
of !-optimal policies.
Despite the similarities between Assumptions I and D, the corre-
sponding results that one may obtain involve some substantial differences.
In particular, Fig. 4.3.2 shows that nonstationary policies may become im-
portant, and may dominate all stationary policies under Assumption D.
It turns out that this cannot happen under Assumption I. An important
fact, which breaks the symmetry between the two cases, is that J * is ap-
proached by Jµ from above, but it is approached by T k J¯ from below in the
case of Assumption I and from above in the case of Assumption D. Another
important fact about Assumption I is that since we assume
J¯(x) > −∞, ∀ x ∈ X,
all the functions J encountered in the analysis under this assumption (such
Sec. 4.3 Infinite Horizon Problems 143

as T k J¯, Jπ , and J * ) also satisfy

J (x) > −∞, ∀ x ∈ X.

In particular, if J ≥ J¯, we have (T J )(x) ≥ (T J¯)(x) > −∞ for all x ∈ X,


and for every ! > 0 there exists µ" ∈ M such that

Tµ! J ≤ T J + ! e.

This property is not available under Assumption D and accounts in part


for the different character of the results that can be obtained under the
two assumptions.

4.3.1 Fixed Point Properties and Optimality Conditions

We first consider the question whether the optimal cost function J * is a


fixed point of T . This is indeed true, but the lines of proof are different
under the Assumptions I and D. We begin with the proof under Assump-
tion I, and as a preliminary step we show the following result, which is of
independent interest.

Proposition 4.3.2: Let Assumption I hold. Then given any ! > 0,


there exists a policy π" ∈ Π such that

J * ≤ Jπ! ≤ J * + ! e.

Furthermore, if the scalar α in part (c) of Assumption I satisfies α < 1,


the policy π" can be taken to be stationary.

Proof: Let {!k } be a sequence such that !k > 0 for all k and

!
αk !k = !. (4.8)
k=0
" #
For each x ∈ X, consider a sequence of policies πk [x] ⊂ Π of the form
" #
πk [x] = µk0 [x], µk1 [x], . . . , (4.9)

such that for k = 0, 1, . . . ,

Jπk [x] (x) ≤ J * (x) + !k . (4.10)

Such a sequence exists, since we have assumed that J¯(x) > −∞, and
therefore J * (x) > −∞, for all x ∈ X.
144 Noncontractive Models Chap. 4

The preceding notation should be interpreted as follows. The policy


πk [x] of Eq. (4.9) is associated with x. Thus µki [x] denotes for each x and
k, a function in M, while µki [x](z) denotes the value of µki [x] at an element
z ∈ X. In particular, µki [x](x) denotes the value of µki [x] at x ∈ X.
Consider the functions µk defined by

µk (x) = µk0 [x](x), ∀ x ∈ X, (4.11)

and the functions J¯k defined by


! "
J¯k (x) = H x, µk (x), lim Tµk [x] · · · Tµkm [x] J¯ , ∀ x ∈ X, k = 0, 1, . . . .
m→∞ 1
(4.12)
By using Eqs. (4.10), (4.11), and part (b) of Assumption I, we obtain for
all x ∈ X and k = 0, 1, . . .

J¯k (x) = lim Tµk [x] · · · Tµkm [x] J¯ (x)


# $
m→∞ 0

= Jπk [x] (x) (4.13)


≤ J * (x) + "k .

From Eqs. (4.12), (4.13), and part (c) of Assumption I, we have for all
x ∈ X and k = 1, 2, . . .,

(Tµk−1 J¯k )(x) = H x, µk−1 (x), J¯k


# $
# $
≤ H x, µk−1 (x), J * + "k e
# $
≤ H x, µk−1 (x), J * + α"k
! "
≤ H x, µk−1 (x), lim Tµk−1 [x] · · · Tµk−1 [x] J¯ + α"k
m→∞ 1 m

= J¯k−1 (x) + α"k ,

and finally
Tµk−1 J¯k ≤ J¯k−1 + α"k e, k = 1, 2, . . . .
Using this inequality and part (c) of Assumption I, we obtain

Tµk−2 Tµk−1 J¯k ≤ Tµk−2 (J¯k−1 + α"k e)


≤ Tµ J¯k−1 + α2 "k e
k−2

≤ J¯k−2 + (α"k−1 + α2 "k ) e.

Continuing in the same manner, we have for k = 1, 2, . . . ,

k
% '
Tµ0 · · · Tµk−1 J¯k ≤ J¯0 + (α"1 + · · · + αk "k ) e ≤ J * +
&
αi "i e.
i=0
Sec. 4.3 Infinite Horizon Problems 145

Since J¯ ≤ J¯k , it follows that


! k
#
Tµ0 · · · Tµk−1 J¯ ≤ J * +
"
αi "i e.
i=0

Denote π! = {µ0 , µ1 , . . .}. Then by taking the limit in the preceding in-
equality and using Eq. (4.8), we obtain

Jπ! ≤ J * + " e.
$ %
If α < 1, we take "k = "(1−α) for all k, and πk [x] = µ0 [x], µ1 [x], . . .
in Eq. (4.10). The stationary policy π! = {µ, µ, . . .}, where µ(x) = µ0 [x](x)
for all x ∈ X, satisfies Jπ! ≤ J * + " e. Q.E.D.

Note that the assumption α < 1 is essential in order to be able to take


π! stationary in the preceding proposition. As an example, let X = {0},
U (0) = (0, ∞), J¯(0) = 0, H(0, u, J ) = u + J (0). Then J * (0) = 0, but for
any µ ∈ M, we have Jµ (0) = ∞.
By using Prop. 4.3.2 we can prove the following.

Proposition 4.3.3: Let Assumption I hold. Then

J * = T J *.

Furthermore, if J ! ∈ E(X) is such that J ! ≥ J¯ and J ! ≥ T J ! , then


J ! ≥ J *.

Proof: For every π = {µ0 , µ1 , . . .} ∈ Π and x ∈ X, we have using part (b)


of Assumption I,

Jπ (x) = lim (Tµ0 Tµ1 · · · Tµk J¯)(x)


k→∞
& '
¯
= Tµ0 lim Tµ1 · · · Tµk J (x)
k→∞

≥ (Tµ0 J * )(x)
≥ (T J * )(x).

By taking the infimum of the left-hand side over π ∈ Π, we obtain

J * ≥ T J *.

To prove the reverse inequality, let "1 and "2 be any positive scalars,
and let π = {µ0 , µ1 , . . .} be such that

Tµ0 J * ≤ T J * + "1 e, Jπ1 ≤ J * + "2 e,


146 Noncontractive Models Chap. 4

where π1 = {µ1 , µ2 , . . .} (such a policy exists by Prop. 4.3.2). The sequence


{Tµ1 · · · Tµk J¯} is monotonically nondecreasing, so by using the preceding
relations and part (c) of Assumption I, we have
! "
¯ ¯
Tµ0 Tµ1 · · · Tµk J ≤ Tµ0 lim Tµ1 · · · Tµk J
k→∞

= Tµ0 Jπ1
≤ Tµ0 J * + α#2 e
≤ T J * + (#1 + α#2 ) e.
Taking the limit as k → ∞, we obtain
J * ≤ Jπ = lim Tµ0 Tµ1 · · · Tµk J¯ ≤ T J * + (#1 + α#2 ) e.
k→∞
Since #1 and #2 can be taken arbitrarily small, it follows that
J * ≤ T J *.
Hence J * = T J * .
Assume that J # ∈ E(X) satisfies J # ≥ J¯ and J # ≥ T J # . Let {#k } be
any sequence with #k > 0 for all k, and consider a policy π = {µ0 , µ1 , . . .} ∈
Π such that
Tµk J # ≤ T J # + #k e, k = 0, 1, . . . .
We have from part (c) of Assumption I
J * = inf lim Tµ0 · · · Tµk J¯
π∈Π k→∞

≤ inf lim inf Tµ0 · · · Tµk J #


π∈Π k→∞

≤ lim inf Tµ0 · · · Tµk J #


k→∞
≤ lim inf Tµ0 · · · Tµk−1 (T J # + #k e)
k→∞
≤ lim inf Tµ0 · · · Tµk−1 (J # + #k e)
k→∞
≤ lim inf (Tµ0 · · · Tµk−1 J # + αk #k e)
k→∞
..
.
k
# # % %
$
≤ lim T J# + αi #i e
k→∞
i=0
# k %
$
≤ J# + αi #i e.
i=0
&k
Since we may choose i=0 αi #i as small as desired, it follows that J * ≤ J # .
Q.E.D.

The following counterexamples show that parts (b) and (c) of As-
sumption I are essential for the preceding proposition to hold.
Sec. 4.3 Infinite Horizon Problems 147

Example 4.3.1 (Counterexample to Bellman’s Equation I)

Let

X = {0, 1}, U (0) = U (1) = (−1, 0], ¯


J(0) = J¯(1) = −1,
!
u if J(1) ≤ −1,
H(0, u, J) = H(1, u, J) = u.
0 if J(1) > −1,
Then for N ≥ 1,

¯
(Tµ0 · · · TµN −1 J)(0) = 0, (Tµ0 · · · TµN −1 J¯)(1) = µ0 (1).

Thus

J ∗ (0) = 0, J ∗ (1) = −1, (T J ∗ )(0) = −1, (T J ∗ )(1) = −1,

and hence J ∗ $= T J ∗ . Notice also that J¯ is a fixed point of T , while J¯ ≤ J ∗


and J¯ $= J ∗ , so the second part of Prop. 4.3.3 fails when J¯ = J # . Here
parts (a) and (b) of Assumption I are satisfied, but part (c) is violated, since
H(0, u, ·) is discontinuous at J = −1 when u < 0.

Example 4.3.2 (Counterexample to Bellman’s Equation II)

Let

X = {0, 1}, U (0) = U (1) = {0}, J¯(0) = J(1)


¯ = 0,
!
0 if J(1) < ∞,
H(0, 0, J) = H(1, 0, J) = J(1) + 1.
∞ if J(1) = ∞,
Here there is only one policy, which we denote by µ. For all N ≥ 1, we have

¯
(TµN J)(0) = 0, ¯
(TµN J)(1) = N,

so J ∗ (0) = 0, J ∗ (1) = ∞. On the other hand, we have (T J ∗ )(0) = (T J ∗ )(1) =


∞ and J ∗ $= T J ∗ . Here parts (a) and (c) of Assumption I are satisfied, but
part (b) is violated.

As a corollary to Prop. 4.3.3 we obtain the following.

Proposition 4.3.4: Let Assumption I hold. Then for every µ ∈ M,


we have
Jµ = T µ Jµ .
Furthermore, if J ! ∈ E(X) is such that J ! ≥ J¯ and J ! ≥ Tµ J ! , then
J ! ≥ Jµ .
148 Noncontractive Models Chap. 4

Proof: Consider the variant of the ! infinite


" horizon problem where the
control constraint set is Uµ (x) = µ(x) rather than U (x) for all x ∈ X.
Application of Prop. 4.3.3 yields the result. Q.E.D.

We now provide the counterpart of Prop. 4.3.3 under Assumption D.


We first prove a preliminary result regarding the convergence of the value
iteration method, which is of independent interest (we will see later that
this result need not hold under Assumption I).

Proposition 4.3.5: Let Assumption D hold. Then T N J¯ = JN *,


*
where JN is the optimal cost function for the N -stage problem. More-
over
J * = lim JN *.
N →∞

Proof: By repeating the proof of Prop. 4.2.3, we have T N J¯ = JN * [part (b)

of Assumption D is essentially identical to the assumption of that propo-


sition]. Clearly we have J * ≤ JN * for all N , and hence J * ≤ lim *
N →∞ JN .
Also for all π = {µ0 , µ1 , . . .} ∈ Π, we have

Tµ0 · · · TµN −1 J¯ ≥ JN
*,

*,
so by taking the limit of both sides as N → ∞, we obtain Jπ ≥ limN →∞ JN
* * *
and by taking infimum over π, J ≥ limN →∞ JN . Thus J = limN →∞ JN *.

Q.E.D.

Proposition 4.3.6: Let Assumption D hold. Then

J * = T J *.

Furthermore, if J # ∈ E(X) is such that J # ≤ J¯ and J # ≤ T J # , then


J # ≤ J *.

Proof: For any π = {µ0 , µ1 , . . .} ∈ Π, we have

Jπ = lim Tµ0 Tµ1 · · · Tµk J¯ ≥ lim Tµ0 T k J¯ ≥ Tµ0 J * ,


k→∞ k→∞

where the last inequality follows from the fact T k J¯ ↓ J * (cf. Prop. 4.3.5).
Taking the infimum of both sides over π ∈ Π, we obtain J * ≥ T J * .
To prove the reverse inequality, we select any µ ∈ M, and we apply
Tµ to both sides of the equation J * = limN →∞ T N J¯ (cf. Prop. 4.3.5). By
Sec. 4.3 Infinite Horizon Problems 149

using part (b) of assumption D, we obtain


! "
Tµ J * = Tµ lim T N J¯ = lim Tµ T N J¯ ≥ lim T N +1 J¯ = J * .
N →∞ N →∞ N →∞

Taking the infimum of the left-hand side over µ ∈ M, we obtain T J * ≥ J * ,


showing that T J * = J * .
To complete the proof, let J # ∈ E(X) be such that J # ≤ J¯ and
J # ≤ T J # . Then we have

J * = inf lim Tµ0 · · · TµN −1 J¯


π∈Π N →∞

≥ lim inf Tµ0 · · · TµN −1 J¯


N →∞ π∈Π

≥ lim inf Tµ0 · · · TµN −1 J #


N →∞ π∈Π

≥ lim T N J #
N →∞
≥ J #,

where the last inequality follows from the hypothesis J # ≤ T J # . Thus


J * ≥ J # . Q.E.D.

Counterexamples to Bellman’s equation can be readily constructed if


part (b) of Assumption D (continuity from above) is violated. In particular,
in Examples 4.2.1 and 4.2.2, part (a) of Assumption D is satisfied but part
(b) is not. In both cases we have J * $= T J * , as the reader can verify with
a straightforward calculation.
Similar to Prop. 4.3.4, we obtain the following.

Proposition 4.3.7: Let Assumption D hold. Then for every µ ∈ M,


we have
Jµ = T µ Jµ .
Furthermore, if J # ∈ E(X) is such that J # ≤ J¯ and J # ≤ Tµ J # , then
J # ≤ Jµ .

Proof: Consider# the$ variation of our problem where the control constraint
set is Uµ (x) = µ(x) rather than U (x) for all x ∈ X. Application of Prop.
4.3.6 yields the result. Q.E.D.

An examination of the proof of Prop. 4.3.6 shows that the only point
where we need part (b) of Assumption D was in establishing the relations
! "
* = T
lim T JN *
lim JN
N →∞ N →∞
150 Noncontractive Models Chap. 4

and JN* = T N J¯. If these relations can be established independently, then

the result of Prop. 4.3.6 follows. In this manner we obtain the following
proposition.

Proposition 4.3.8: Let part (a) of Assumption D hold, assume that


X is a finite set, and that J * (x) > −∞ for all x ∈ X. Assume further
that there exists a scalar α ∈ (0, ∞) such that for all scalars r ∈ (0, ∞)
and functions J ∈ E(X) with J ≤ J¯, we have

H(x, u, J ) − α r ≤ H(x, u, J − r e), ∀ x ∈ X, u ∈ U (x). (4.14)

Then
J * = T J *.
Furthermore, if J ! ∈ E(X) is such that J ! ≤ J¯ and J ! ≤ T J ! , then
J ! ≤ J *.

Proof: A nearly verbatim repetition of Prop. 4.2.4 shows that under our
* = T N J¯ for all N . We will show that
assumptions we have JN

! "
* ) ≤ H x, u, lim J *
lim H(x, u, JN
N →∞ N ,
N →∞
∀ x ∈ X, u ∈ U (x).

Then the result follows as in the proof of Prop. 4.3.6.


Assume the contrary, i.e., that for some x̃ ∈ X, ũ ∈ U (x̃), and " > 0,
there holds
! "
H(x̃, ũ, Jk* ) − " > H x̃, ũ, lim JN
* , k = 1, 2, . . . .
N →∞

* (x) > −∞ for


From the finiteness of X and the fact J * (x) = limN →∞ JN
all x, we know that for some integer k > 0

Jk* − ("/α)e ≤ lim JN


*, ∀ k ≥ k.
N →∞

By using the condition (4.14), we obtain for all k ≥ k

! "
H(x̃, ũ, Jk* ) − " ≤ H x̃, ũ, Jk* − ("/α) e ≤ H x̃, ũ, lim JN
*
# $
,
N →∞

which contradicts the earlier inequality. Q.E.D.


Sec. 4.3 Infinite Horizon Problems 151

Characterization of Optimal Policies

We now provide necessary and sufficient conditions for optimality of a sta-


tionary policy. These conditions are markedly different under Assumptions
I and D.

Proposition 4.3.9: Let Assumption I hold. Then a stationary policy


µ is optimal if and only if

Tµ J * = T J * .

Proof: If µ is optimal, then Jµ = J * so that the equation J * = T J * (cf.


Prop. 4.3.3) implies that Jµ = T Jµ . Since Jµ = Tµ Jµ (cf. Prop. 4.3.4), it
follows that Tµ J * = T J * .
Conversely, if Tµ J * = T J * , then since J * = T J * , it follows that
Tµ J = J * . By Prop. 4.3.4, it follows that
* Jµ ≤ J * , so µ is optimal.
Q.E.D.

Proposition 4.3.10: Let Assumption D hold. Then a stationary


policy µ is optimal if and only if

T µ Jµ = T J µ .

Proof: If µ is optimal, then Jµ = J * , so that the equation J * = T J *


(cf. Prop. 4.3.6) can be written as Jµ = T Jµ . Since Jµ = Tµ Jµ (cf. Prop.
4.3.4), it follows that Tµ Jµ = T Jµ .
Conversely, if Tµ Jµ = T Jµ , then since Jµ = Tµ Jµ , it follows that
Jµ = T Jµ . By Prop. 4.3.7, it follows that Jµ ≤ J * , so µ is optimal.
Q.E.D.

An example showing that under Assumption I, the condition Tµ Jµ =


T Jµ does not guarantee optimality of µ is given in Exercise 4.3. Under
Assumption D, we note that by Prop. 4.3.1 or 4.3.7, we have Jµ = Tµ Jµ
for all µ, so if µ is a stationary optimal policy, the fixed point equation
J * (x) = inf H(x, u, J * ), ∀ x ∈ X, (4.15)
u∈U (x)

and the optimality condition of Prop. 4.3.10, yield


T J * = J * = Jµ = T µ Jµ = T µ J * .
152 Noncontractive Models Chap. 4

Thus under D, a stationary optimal policy attains the infimum in the fixed
point Eq. (4.15) for all x. However, there may exist nonoptimal stationary
policies also attaining the infimum for all x, as shown by Fig. 4.3.2, and by
case (a) of the three-node shortest path example of Section 3.1.2. Moreover,
it is possible that this infimum is attained but no optimal policy exists, as
shown by Fig. 4.3.2 (see also Exercise 4.7).
Proposition 4.3.9 shows that under Assumption I, there exists a sta-
tionary optimal policy if and only if the infimum in the optimality equation

J * (x) = inf H(x, u, J * )


u∈U (x)

is attained for every x ∈ X. When the infimum is not attained for some x ∈
X, this optimality equation can still be used to yield an !-optimal policy,
which can be taken to be stationary whenever the scalar α in Assumption
I(c) is strictly less than 1. This is shown in the following proposition.

Proposition 4.3.11: Let Assumption I hold. Then:


(a) If ! > 0, the sequence {!k } satisfies ∞
! k
k=0 α !k = !, and !k > 0
∗ ∗ ∗
for all k, and the policy π = {µ0 , µ1 , . . .} ∈ Π is such that

Tµ∗ J * ≤ T J * + !k e, ∀ k = 0, 1, . . . ,
k

then
J * ≤ Jπ∗ ≤ J * + ! e.

(b) If ! > 0, the scalar α in part (c) of Assumption I is strictly less


than 1, and µ∗ ∈ M is such that

Tµ∗ J * ≤ T J * + !(1 − α) e,

then
J * ≤ Jµ∗ ≤ J * + ! e.

Proof: (a) Since T J * = J * , we have

Tµ∗ J * ≤ J * + !k e,
k

and applying Tµ∗ to both sides, we obtain


k−1

Tµ∗ Tµ∗ J * ≤ Tµ∗ J * + α!k e ≤ J * + (!k−1 + α!k ) e.


k−1 k k−1
Sec. 4.3 Infinite Horizon Problems 153

Applying Tµ∗ throughout and repeating the process, we obtain for every
k−2
k = 1, 2, . . .,
! k #
"
Tµ∗0 · · · Tµ∗ J * ≤ J * + αi "i e, k = 1, 2, . . . .
k
i=0
Since J¯ ≤ J * , it follows that
! k
#
Tµ∗0 · · · Tµ∗ J¯ ≤ J * +
"
αi "i e, k = 1, 2, . . . .
k
i=0
By taking the limit as k → ∞, we obtain Jπ∗ ≤ J * + " e.
(b) This part is proved by taking "k = "(1 − α) and µ∗k = µ∗ for all k in
the preceding argument. Q.E.D.

Under Assumption D, the existence of an "-optimal policy is harder


to establish, and requires some restrictive conditions.

Proposition 4.3.12: Let Assumption D hold, and let the additional


assumptions of Prop. 4.3.8 hold. Then for any " > 0, there exists an
"-optimal policy.

Proof: For each N , denote


"
"N = ,
2(1 + α + · · · + αN −1 )
and let
πN = {µN N N
0 , µ1 , . . . , µN −1 , µ, µ . . .}
be such that µ ∈ M, and for k = 0, . . . , N − 1, µN
k ∈ M and
T N T N −k−1 J¯ = T N −k J¯ + "N e.
µk

We have TµNN −1 J¯ ≤ T J¯+"N e, and applying TµNN −2 to both sides, we obtain


TµNN −2 TµNN −1 J¯ ≤ TµN T J¯ + α"N e ≤ T 2 J¯ + (1 + α)"N e.
N −2

Continuing in the same manner, we have


TµN0 · · · TµNN −1 J¯ ≤ T N J¯ + (1 + α + · · · + αN −1 )"N e,
from which we obtain for N = 0, 1, . . .,
Jπ ≤ T N J¯ + ("/2) e.
N

By Prop. 4.3.5, we have J* = limN →∞ T N J¯, so let N̄ be such that


T N̄ J¯ ≤ J * + ("/2) e
[such a N̄ exists using the assumptions of finiteness of X and J * (x) > −∞
for all x ∈ X]. Then we obtain JπN̄ ≤ J * + " e, and πN̄ is the desired
policy. Q.E.D.
154 Noncontractive Models Chap. 4

4.3.2 Value Iteration

We will now discuss algorithms for abstract DP under Assumptions I and


and D. We first consider value iteration (VI for short), which consists of
¯ T 2 J¯, . . .. Note that because T need not be a
successively generating T J,
contraction, it may have multiple fixed points J all of which satisfy J ≥ J *
under Assumption I (cf. Prop. 4.3.3) or J ≤ J * under Assumption D (cf.
Prop. 4.3.6). Thus, in the absence of additional conditions, it is essential to
start VI with J¯ or an initial J0 such that J¯ ≤ J0 ≤ J * under Assumption
I or J¯ ≥ J0 ≥ J * under Assumption D. In the next two propositions,
we show that for such initial conditions, we have convergence of VI to J *
under Assumption D, and with an additional compactness condition, under
Assumption I.

Proposition 4.3.13: Let Assumption D hold, and assume that J0 ∈


E(X) is such that J¯ ≥ J0 ≥ J * . Then

lim T k J0 = J * .
k→∞

Proof: The condition J¯ ≥ J0 ≥ J * implies that T k J¯ ≥ T k J0 ≥ J * for all


k. By Prop. 4.3.5, T k J¯ → J * , and the result follows. Q.E.D.

The convergence of VI under I requires an additional compactness


condition, which is satisfied in particular if U (x) is a finite set for all x ∈ X.

Proposition 4.3.14: Let Assumption I hold, let U be a metric space,


and assume that the sets

Uk (x, λ) = u ∈ U (x)" H(x, u, T k J¯) ≤ λ


! " #
(4.16)

are compact for every x ∈ X, λ ∈ %, and for all k greater than some
integer k. Assume that J0 ∈ E(X) is such that J¯ ≤ J0 ≤ J * . Then

lim T k J0 = J * .
k→∞

Furthermore, there exists a stationary optimal policy.

Proof: Similar to the proof of Prop. 4.3.13, it will suffice to show that
T k J¯ → J * . Since J¯ ≤ J * , we have T k J¯ ≤ T k J * = J * , so that
J¯ ≤ T J¯ ≤ · · · ≤ T k J¯ ≤ · · · ≤ J * .
Sec. 4.3 Infinite Horizon Problems 155

Thus we have T k J¯ ↑ J∞ for some J∞ ∈ E(X) satisfying T k J¯ ≤ J∞ ≤ J *


for all k. Applying T to this relation, we obtain

(T k+1 J¯)(x) = min H(x, u, T k J¯) ≤ (T J∞ )(x),


u∈U (x)

and by taking the limit as k → ∞, it follows that

J ∞ ≤ T J∞ .

Assume to arrive at a contradiction that there exists a state x̃ ∈ X such


that
J∞ (x̃) < (T J∞ )(x̃). (4.17)
Similar to Lemma 3.2.1, there exists a point uk attaining the minimum in

(T k+1 J¯)(x̃) = inf H(x̃, u, T k J¯);


u∈U (x̃)

i.e., uk is such that

(T k+1 J¯)(x̃) = H(x̃, uk , T k J¯).

Clearly, by Eq. (4.17), we must have J∞ (x̃) < ∞. For every k, consider
the set
" # %
Uk x̃, J∞ (x̃) = u ∈ U (x̃) $ H(x̃, uk , T k J¯) ≤ J∞ (x̃) ,
! $

and the sequence {ui }∞ k ¯


i=k . Since T J ↑ J∞ , it follows that for all i ≥ k,

H(x̃, ui , T k J¯) ≤ H(x̃, ui , T i J¯) ≤ J∞ (x̃).


! " ! "
Therefore {ui }∞
i=k ⊂ Uk x̃, J∞ (x̃) , and since
! Uk x̃," J∞ (x̃) is compact, all
the limit points of {ui }∞
i=k belong to Uk x̃, J∞ (x̃) and at least one such
limit point exists. Hence the same is true of the limit points of the whole
sequence {ui }. It follows that if ũ is a limit point of {ui } then
! "
ũ ∈ ∩∞
k=0 Uk x̃, J∞ (x̃) .

By Eq. (4.16), this implies that for all k ≥ k

J∞ (x̃) ≥ H(x̃, ũ, T k J¯) ≥ (T k+1 J¯)(x̃).

Taking the limit as k → ∞, and using part (b) of Assumption I, we obtain

J∞ (x̃) ≥ H(x̃, ũ, J∞ ) ≥ (T J∞ )(x̃). (4.18)


156 Noncontractive Models Chap. 4

which contradicts Eq. (4.17). Hence J∞ = T J∞ , which implies that J∞ ≥


J * in view of Prop. 4.3.3. Combined with the inequality J∞ ≤ J * , which
was shown earlier, we have J∞ = J * .
To show that there exists an optimal stationary policy, observe that
the relation J * = J∞ = T J∞ and Eq. (4.18) [whose proof is valid for all
x̃ ∈ X such that J * (x̃) < ∞] imply that ũ attains the infimum in
J * (x̃) = inf H(x̃, u, J * )
u∈U (x̃)

for all x̃ ∈ X with J * (x̃) < ∞. For x̃ ∈ X such that J * (x̃) = ∞, every
u ∈ U (x̃) attains the preceding minimum. Hence by Prop. 4.3.9 an optimal
stationary policy exists. Q.E.D.

The reader may verify by inspection of the preceding proof that if


µk (x̃), k = 0, 1, . . ., attains the infimum in the relation
(T k+1 J¯)(x̃) = inf H(x̃, u, T k J¯),
u∈U (x)

and µ∗ (x̃) is a limit point of {µk (x̃)}, for every x̃ ∈ X, then the stationary
policy µ∗ is optimal. Furthermore, {µk (x̃)} has at least one limit point
for every x̃ ∈ X for which J * (x̃) < ∞. Thus the VI algorithm under the
assumption of Prop. 4.3.14 yields in the limit not only the optimal cost
function J * but also an optimal stationary policy.
On the other hand, under Assumption I but in the absence of the
compactness condition (4.16), T k J¯ need not converge to J * . What is hap-
pening here is that while the mappings Tµ are continuous from below as
required by Assumption I(b), T may not be, and a phenomenon like the
one illustrated in the left-hand side of Fig. 4.3.1 may occur, whereby
! "
lim T k J¯ ≤ T lim T k J¯ ,
k→∞ k→∞

with strict inequality for some x ∈ X. This can happen even in simple
deterministic optimal control problems, as shown by the following example.

Example 4.3.3 (Counterexample to Convergence of VI)

Let
X = [0, ∞), U (x) = (0, ∞), ¯
J(x) = 0, ∀ x ∈ X,
and
# $
H(x, u, J) = min 1, x + J(2x + u) , ∀ x ∈ X, u ∈ U (x).

Then it can be verified that for all x ∈ X and policies µ, we have Jµ (x) = 1,
as well as J ∗ (x) = 1, while it can be seen by induction that starting with J¯,
the VI algorithm yields
¯
(T k J)(x) = min 1, (1 + 2k−1 )x ,
# $
∀ x ∈ X, k = 1, 2 . . . .
Sec. 4.3 Infinite Horizon Problems 157

¯
Thus we have 0 = limk→∞ (T k J)(0) != J * (0) = 1.

The range of convergence of VI may be expanded under additional as-


sumptions. In particular, in Chapter 3, under various conditions involving
the existence of optimal S-regular policies, we showed that VI converges
to J * assuming that the initial condition J0 satisfies J0 ≥ J * . Thus if the
assumptions of Prop. 4.3.14 hold in addition, we are guaranteed conver-
gence of VI starting from any J satisfying J ≥ J¯. Results of this type will
be obtained in the Section 4.4.1, where semicontractive models satisfying
Assumption I will be discussed.

Asynchronous Value Iteration

The concepts of asynchronous VI that we developed in Section 2.6.1 apply


also under the Assumptions I and D of this section. Under Assumption I,
if J * is real-valued, we may apply Prop. 2.6.1 with the sets S(k) defined by

S(k) = {J | T k J¯ ≤ J ≤ J * }, k = 0, 1, . . . .

Assuming that T k J¯ → J * (cf. Prop. 4.3.14), it follows that the asyn-


chronous form of VI converges pointwise to J * starting from any func-
tion in S(0). This result can also be shown for the case where J * is not
real-valued, by using a simple extension of Prop. 2.6.1, where the set of
real-valued functions R(X) is replaced by the set of all J ∈ E(X) with
J¯ ≤ J ≤ J * .
Under Assumption D similar conclusions hold for the asynchronous
version of VI that starts with a function J with J * ≤ J ≤ J¯. Asynchronous
pointwise convergence to J * can be shown, based on an extension of the
asynchronous convergence theorem (Prop. 2.6.1), where R(X) is replaced
by the set of all J ∈ E(X) with J * ≤ J ≤ J¯.

4.3.3 Policy Iteration


Unfortunately, in the absence of further conditions, policy iteration (PI
for short) is not guaranteed to yield the optimal cost function and/or an
optimal policy under either Assumption I or D. However, there are conver-
gence results for nonoptimistic and optimistic variants of PI under some
conditions. These results have a different character than the ones we have
seen so far in Chapters 2 and 3. One major difference is that they guaran-
tee convergence to J * , but not to an optimal policy (in fact in the case of
Assumption D, there need not exist an optimal policy).
In what follows in this section we will provide an analysis of various
types of PI, mainly under Assumption D. The analysis of PI under Assump-
tion I will be given primarily in the next section, as it requires different
assumptions and methods of proof, and will be coupled with ideas relating
to the semicontractive models of Chapter 3.
158 Noncontractive Models Chap. 4

Optimistic Policy Iteration Under D

A surprising fact under Assumption D is that nonoptimistic/exact PI may


generate a policy that is strictly inferior over the preceding one. Moreover
there may be an oscillation between nonoptimal policies even when the
state and control spaces are finite. This is illustrated by the example of
Fig. 3.1.7, and by case (a) of the three-node shortest path example of
Section 3.1.2, where it can be verified that exact PI may oscillate between
the policy that moves to the destination from node 1 and the policy that
does not (see also Exercises 4.7 and 4.8). A related difficulty is that, under
Assumption D, we may have Tµ J * = T J * without µ being optimal, so
starting from an optimal policy, we may obtain a nonoptimal policy by PI.
On the other hand optimistic PI under Assumption D has much better
convergence properties, because it embodies the mechanism of VI, which
is convergent to J * as we saw in the preceding subsection. Indeed, let
us consider an optimistic PI algorithm that generates a sequence {Jk , µk }
according to †
m
Tµk Jk = T Jk , Jk+1 = Tµkk Jk , (4.19)
where mk is a positive integer. We assume that the algorithm starts with a
function J0 ∈ E(X) that satisfies J¯ ≥ J0 ≥ T J0 and J0 ≥ J * . For example,
we may choose J0 = J¯. We have the following proposition.

Proposition 4.3.15: Let Assumption D hold and let {Jk , µk } be a


sequence generated by the optimistic PI algorithm (4.19), assuming
that J¯ ≥ J0 ≥ J * and J0 ≥ T J0 . Then Jk ↓ J ∗ .

Proof: We have

J0 ≥ Tµ0 J0 ≥ Tµm00 J0 = J1 ≥ Tµm00 +1 J0 = Tµ0 J1 ≥ T J1 = Tµ1 J1 ≥ · · · ≥ J2 ,

where the first, second, and third inequalities hold because the assumption
J0 ≥ T J0 = Tµ0 J0 implies that Tµ!0 J0 ≥ Tµ!+1
0 J0 for all ! ≥ 0. Continuing
similarly we obtain

Jk ≥ T Jk ≥ Jk+1 , ∀ k ≥ 0. (4.20)

Moreover, we can show by induction that Jk ≥ J * . Indeed this is true for


k = 0 by assumption. If Jk ≥ J * , we have
m
Jk+1 = Tµkk Jk ≥ T mk Jk ≥ T mk J * = J * , (4.21)

† As with all PI algorithms in this book, we assume that the policy im-
provement operation is well-defined, in the sense that there exists µk such that
Tµk Jk = T Jk for all k.
Sec. 4.3 Infinite Horizon Problems 159

where the last equality follows from the fact T J * = J * (cf. Prop. 4.3.6),
thus completing the induction. Thus, by combining the preceding two
relations, we have

Jk ≥ T Jk ≥ Jk+1 ≥ J * , ∀ k ≥ 0. (4.22)

We will now show by induction that

T k J0 ≥ Jk ≥ J * , ∀ k ≥ 0. (4.23)

Indeed this relation holds by assumption for k = 0, and assuming that it


holds for some k ≥ 0, we have by applying T to it and by using Eq. (4.22),

T k+1 J0 ≥ T Jk ≥ Jk+1 ≥ J * ,

thus completing the induction. By applying Prop. 4.3.13 to Eq. (4.23), we


obtain Jk ↓ J ∗ . Q.E.D.

For an illustration of the convergence of optimistic PI in an example


where nonoptimistic PI oscillates between nonoptimal policies, the reader
may trace the generated sequence {Jk } in the example of Fig. 3.1.7, and
verify that Jk ↓ J ∗ . Note that the preceding proposition makes no asser-
tion about obtaining an optimal policy or about convergence of Jµk to
J * . Indeed in the example of Fig. 3.1.7 the two available policies are both
nonoptimal, so Jµk does not converge to J * (in this example, J * is ap-
proached by the cost functions of nonstationary policies). Note also that
we have Jk → J * , even if J * (x) = −∞ for some x; for an example, see
the blackmailer’s problem of Exercise 4.6. The reason why optimistic PI
can deal with the absence of an optimal policy is that it acts as a form
of VI, which is convergent to J * under Assumption D (cf. Prop. 4.3.13).
However, optimistic PI tends to be more computationally efficient than VI,
as experience has shown, so it is usually preferable in practice.

λ-Policy Iteration Under D

We now consider an alternative optimistic PI algorithm, which for undis-


counted finite-state MDP, can be implemented by using matrix inversion,
just like nonoptimistic PI for discounted finite-state MDP. This is the λ-PI
algorithm, which is defined by
(λ)
Tµk Jk = T Jk , Jk+1 = Tµk Jk , (4.24)

(λ)
where for any policy µ and scalar λ ∈ (0, 1), Tµ is the mapping defined
by

(λ)
!
Tµ J = (1 − λ) λ" Tµ"+1 J.
"=0
160 Noncontractive Models Chap. 4

Here we assume that Tµ maps R(X) to R(X), and that for all µ ∈ M
and J ∈ R(X), the limit of the series above is well-defined as a function in
R(X).
We also assume a linearity property for Tµ , whereby we have

(λ) (λ)
Tµ (Tµ J ) = Tµ (Tµ J ), ∀ µ ∈ M, J ∈ R(X). (4.25)

This assumption is commonly satisfied in DP problems where Tµ is linear,


such as the stochastic optimal control problem of Example 1.2.1.
To compare the λ-PI method (4.24) with the optimistic PI method
(λ) m
(4.19), we note that both mappings Tµk and Tµkk appearing in Eqs. (4.24)
and (4.19), respectively, involve multiple applications of the VI mapping
Tµk : a fixed number mk in the latter case, and a geometrically weighted
number in the former case. Thus λ-PI and optimistic PI are similar: they
just use the mapping Tµk to apply VI in different ways.
Since λ-PI is a form of optimistic PI, it is not surprising that it has
the same type of convergence properties as the earlier optimistic PI method
(4.19). Similar to Prop. 4.3.15, we have the following.

Proposition 4.3.16: Let Assumption D hold and let {Jk , µk } be a


sequence generated by the λ-PI algorithm (4.24), assuming Eq. (4.25),
and that J¯ ≥ J0 ≥ J * and J0 ≥ T J0 . Then Jk ↓ J ∗ .

Proof: As in the proof of Prop. 4.3.15, by using Assumption D, the mono-


tonicity of Tµ , and the hypothesis J0 ≥ T J0 , we have

(λ) (λ)
J0 ≥ T J0 = Tµ0 J0 ≥ Tµ0 J0 = J1 ≥ Tµ0 J1 ≥ T J1 = Tµ1 J1 ≥ Tµ1 J0 = J2 ,

where for the third inequality, we use the relation J0 ≥ Tµ0 J0 , the definition
of J1 , and the assumption (4.25). Continuing in the same manner,

Jk ≥ T Jk ≥ Jk+1 , ∀ k ≥ 0.

Similar to the proof of Prop. 4.3.15, we show by induction that Jk ≥ J * ,


using the fact that if Jk ≥ J * , then

(λ) (λ)
!
Jk+1 = Tµk Jk ≥ Tµk J * = (1 − λ) λ" T "+1 J * = J * ,
"=0

[cf. the induction step of Eq. (4.21)]. By combining the preceding two
relations, we obtain Eq. (4.22), and the proof is completed by using the
argument following that equation. Q.E.D.
Sec. 4.3 Infinite Horizon Problems 161

The λ-PI algorithm has a useful property, which involves the mapping
Wk : R(X) !→ R(X) given by

Wk J = (1 − λ)Tµk Jk + λ Tµk J. (4.26)

In particular Jk+1 is a fixed point of Wk . Indeed, using the definition


(λ)
Jk+1 = Tµk Jk [cf. Eq. (4.24)], and the linearity assumption (4.25), we
have
! "
(λ)
Wk Jk+1 = (1 − λ)Tµk Jk + λTµk Tµk Jk
(λ)
= (1 − λ)Tµk Jk + λ Tµk (Tµk Jk )
(λ)
= Tµk Jk
= Jk+1 .

Thus Jk+1 can be calculated as a fixed point of Wk .


Consider now the case where Tµk is nonexpansive with respect to
some norm. Then from Eq. (4.26), it is seen that Wk is a contraction of
modulus λ with respect to that norm, so Jk+1 is the unique fixed point
of Wk . Moreover, if the norm is a weighted sup-norm, Jk+1 can be found
using the methods of Chapter 2 for contractive models. The following
example applies this idea to finite-state SSP, where contrary to the models
of Section 3.2, we do not assume that an optimal proper policy exists.

Example 4.3.4 (Stochastic Shortest Path Problems with


Nonpositive Costs)

Consider the SSP problem of Example 1.2.6 with states 1, . . . , n, plus the
termination state 0. For all u ∈ U (x), the state following x is y with prob-
ability pxy (u) and the expected cost incurred is nonpositive. This problem
arises when we wish to maximize nonnegative rewards up to termination. It
includes the classical search problem of Section 3.2.2, where the aim, roughly
speaking, is to move through the state space looking for states with favor-
able termination rewards. We have noted earlier the difficulties regarding PI
for this type of problem. In particular, case (a) of the three-node shortest
path problem of Section 3.1.2 provided an example where nonoptimistic PI
oscillates between an optimal and a nonoptimal policy (see also Exercise 4.8).
The PI method with perturbations of Section 3.3.3 applies only under the
conditions of that section. Here we will consider instead the λ-PI method,
which applies under the different conditions of Assumption D.
We view the problem within our abstract framework with J¯(x) ≡ 0 and

Tµ J = gµ + Pµ J, (4.27)

with gµ ∈ #n being the corresponding nonpositive one-stage cost vector, and


Pµ being an n × # n substochastic
$ matrix. The components of Pµ are the
probabilities pxy µ(x) , x, y = 1, . . . , n. Clearly Assumption D holds.
162 Noncontractive Models Chap. 4

Consider the λ-PI method (4.24), with Jk+1 computed by solving the
fixed point equation J = Wk J, cf. Eq. (4.26). This is a nonsingular n-
dimensional system of linear equations, and can be solved by matrix inversion,
just like in exact PI for discounted n-state MDP. In particular, using Eqs.
(4.26) and (4.27), we have

Jk+1 = (I − λPµk )−1 gµk + (1 − λ)Pµk Jk ,


! "
(4.28)

For a small number of states n, this matrix inversion-based policy evaluation


may be simpler than the optimistic PI policy evaluation equation
m
Jk+1 = Tµkk Jk

[cf. Eq. (4.19)], which points to an advantage of λ-PI.


Note that as λ → 1, the policy evaluation Eq. (4.28) resembles the
policy evaluation equation

Jµk = (I − λPµk )−1 gµk

for λ-discounted n-state MDP. An important difference, however, is that for


a discounted finite-state MDP, exact PI will find an optimal policy in a finite
number of iterations, while this is not guaranteed for λ-PI. Indeed λ-PI does
not require that there exists an optimal policy or even that J ∗ (x) is finite
for all x. Note also that the λ-PI method applies even to SSP problems with
nonpositive costs where there is an optimal improper policy and no optimal
proper policy. This is true for the optimistic PI method (4.19) as well.

Policy Iteration Under I

Contrary to the case of Assumption D, the important cost improvement


property of PI holds under Assumption I. Thus, if µ is a policy and µ̄
satisfies the policy improvement equation Tµ̄ Jµ = T Jµ , we have

Jµ = Tµ Jµ ≥ T Jµ = Tµ̄ Jµ ,

from which we obtain Jµ ≥ limk→∞ Tµ̄k Jµ . Since Jµ ≥ J¯ and Jµ̄ =


limk→∞ Tµ̄k J¯, it follows that

Jµ ≥ T Jµ ≥ Jµ̄ . (4.29)

However, this cost improvement property is not sufficient by itself to


guarantee the validity of PI under Assumption I. For example, it is not
clear that strict inequality holds in Eq. (4.29) for at least one state x ∈ X
when µ is not optimal. As Exercise 4.3 shows, there may exist a nonoptimal
policy µ such that Tµ Jµ = T Jµ , and PI will stop when started with such a
policy. Thus additional conditions are needed to obtain solid convergence
guarantees. The next section provides a semicontractive framework for
this purpose, which bears a connection with the corresponding framework
of Chapter 3.
Sec. 4.4 Semicontractive-Monotone Increasing Models 163

4.4 SEMICONTRACTIVE-MONOTONE INCREASING MODELS

In this section we combine Assumption I with the semicontractive abstract


DP framework involving S-regular policies, as per Chapter 3. To this end,
we introduce a special set of functions S that satisfies

S ⊂ J ∈ E(X) | J ≥ J¯ , J¯ ∈ S.
! "
(4.30)

The further specification of S will be left open for the moment, but we will
see that its choice is important for the application of the following analysis
in specific contexts. We will use the notion of S-regularity as defined in
Section 3.1.1 (cf. Def. 3.1.1), i.e., µ is called S-regular if Jµ is the unique
fixed point of Tµ within S, and

Tµk J → Jµ , ∀ J ∈ S. (4.31)

Our analysis bears similarity to the one of semicontractive models in


Chapter 3, but also takes advantage of Assumption I, thereby bringing to
bear the results of Section 4.3. For example, we typically assume that there
exists an S-regular optimal policy, but this is not a severe restriction, since
under Assumption I, optimal stationary policies usually exist (any µ such
that Tµ J * = T J * is an optimal stationary policy, cf. Prop. 4.3.9), and the
verification that these policies are S-regular is often simple.
4.4.1 Value and Policy Iteration Algorithms

The following proposition addresses the issue of uniqueness of solution of


the fixed point equation J = T J , and the convergence of the VI algorithm.

Proposition 4.4.1: (Fixed Point Properties and Convergence


of Value Iteration) Let Assumption I hold, and assume that there
exists an optimal S-regular policy, where S satisfies Eq. (4.30). Then:
(a) The optimal cost function J * is the unique fixed point of T within
S.
(b) We have T k J → J * for every J ∈ S with J ≥ J * .
(c) If limk→∞ T k J¯ = J * (as it is true under the assumptions of Prop.
4.3.14), then we have T k J → J * for every J ∈ S.

Proof: (a), (b) In view of the existence of an optimal S-regular policy and
the fixed point property J * = T J * (cf. Prop. 4.3.3), Prop. 3.1.2 applies,
and part (b) follows. Also by Prop. 3.1.2, J * is the unique fixed point of
T in the set {J | J ≥ J * }, while by Prop. 4.3.3, there are no fixed points
of T that belong to S and are outside this set. This proves part (a).
164 Noncontractive Models Chap. 4

(c) Let µ∗ be an optimal S-regular policy. For any J ∈ S, we have

Tµk∗ J ≥ T k J ≥ T k J¯,

where the second inequality holds since J ≥ J¯ for all J ∈ S. Taking the
limit in the preceding relation, and using the fact limk→∞ Tµk∗ J = Jµ∗ = J *
(which holds by the regularity and optimality of µ∗ ), and limk→∞ T k J¯ = J *
(which holds by assumption), we obtain limk→∞ T k J = J * . Q.E.D.

Policy Iteration

We now consider a PI algorithm that starts with a policy µ0 and generates


a sequence {µk } of policies according to

Tµk+1 Jµk = T Jµk . (4.32)

Thanks to the cost improvement property [cf. Eq. (4.29)], the sequence
{µk } (regardless of whether it consists of S-regular policies) satisfies

Jµk ≥ T Jµk ≥ Jµk+1 , k = 0, 1, . . . , (4.33)

and it follows that Jµk ↓ J∞ for some J∞ ≥ J * . The following proposition


gives conditions for J∞ = J * .

Proposition 4.4.2: (Convergence of PI) Let Assumption I hold,


and assume that:
(1) There exists an optimal S-regular policy, where S satisfies Eq.
(4.30).
(2) For each sequence {Jm } ⊂ S with Jm ↓ J we have

lim H(x, u, Jm ) = H (x, u, J ) , ∀ x ∈ X, u ∈ U (x). (4.34)


m→∞

(3) For some initial µ0 ∈ M, the monotonically nonincreasing se-


quence {Jµk } generated by the PI algorithm (4.32) belongs to S
and its limit J∞ also belongs to S.
Then J∞ = J * .

Proof: From Eq. (4.33), we have Jµk ≥ T Jµk for all k, and by taking the
limit as k → ∞,
J∞ ≥ lim T Jµk ≥ T J∞ , (4.35)
k→∞
Sec. 4.4 Semicontractive-Monotone Increasing Models 165

where the second inequality follows from the fact Jµk ≥ J∞ . We also have
for all x ∈ X and u ∈ U (x),

H(x, u, J∞ ) = lim H(x, u, Jµk ) ≥ lim (T Jµk )(x) = J∞ (x),


k→∞ k→∞

where the first equality follows from Eq. (4.34) and the second equality
follows from Eq. (4.33). By taking the infimum of the left-hand side over
u ∈ U (x), we obtain T J∞ ≥ J∞ , which combined with Eq. (4.35), yields
J∞ = T J∞ . Since there exists an optimal S-regular policy and J∞ ∈ S, we
have J∞ = J * in view of the uniqueness property shown in Prop. 4.4.1(a).
Q.E.D.

The condition (3) regarding PI convergence in the preceding propo-


sition automatically holds if we can choose

S = J ∈ R(X) | J ≥ J¯ S = J ∈ B(X) | J ≥ J¯
! " ! "
or

without affecting other assumptions, where R(X) [or B(X)] are the sets of
functions on X that are real-valued (or bounded with respect to a weighted
sup-norm, respectively). Note that this condition requires that the PI algo-
rithm works as described for some initial policy µ0 (not all initial policies).
This provides flexibility in selecting a suitable µ0 .

Optimistic Policy Iteration

We have discussed so far the VI algorithm in Prop. 4.4.1(b), and the PI


algorithm in Prop. 4.4.2. There is a variant of Prop. 4.4.2, which relates to
optimistic PI, where policies are evaluated inexactly, with a finite number
of VIs. In particular, this algorithm starts with J0 ∈ S such that J0 ≥ T J0 ,
and generates {Jk , µk } according to

m
Tµk Jk = T Jk , Jk+1 = Tµkk Jk , k = 0, 1, . . . , (4.36)

where mk is a positive integer for each k. Then we have

J0 ≥ Tµ0 J0 ≥ Tµm00 J0 = J1 ≥ Tµm00 +1 J0 = Tµ0 J1 ≥ T J1 = Tµ1 J1 ≥ · · · ≥ J2 ,

and continuing similarly, we have

Jk ≥ T Jk ≥ Jk+1 , k = 0, 1, . . . . (4.37)

Thus {Jk } is monotonically nonincreasing and converges to a limit J∞ .


The following proposition gives conditions for J∞ = J ∗ .
166 Noncontractive Models Chap. 4

Proposition 4.4.3: (Convergence of Optimistic PI) Let As-


sumption I hold, and assume that:
(1) There exists an optimal S-regular policy, where S satisfies Eq.
(4.30).
(2) For each sequence {Jm } ⊂ S with Jm ↓ J we have

lim H(x, u, Jm ) = H (x, u, J ) , ∀ x ∈ X, u ∈ U (x).


m→∞

(3) For some initial J0 ∈ S such that J0 ≥ T J0 , the monotonically


nonincreasing sequence {Jµk } generated by the optimistic PI al-
gorithm (4.36) belongs to S and its limit J∞ also belongs to S.
Then J∞ = J * .

Proof: We have shown that Jk ≥ T Jk for all k ≥ 0 [cf. Eq. (4.37)]. The
result follows by taking the limit in this equation to obtain
J∞ ≥ lim T Jk ≥ T J∞ ,
k→∞

[cf. Eq. (4.35)], and then by repeating the corresponding part of the proof
of Prop. 4.4.2. Q.E.D.

Note that a similar result may be shown when the λ-PI method is
used in place of the optimistic PI method (4.36); cf. Prop. 4.3.16.

4.4.2 Some Applications

The key to applying Props. 4.4.1 and 4.4.2 is to choose the set S in a
way that there exists an optimal S-regular policy, and the other required
assumptions are satisfied. We describe a few applications.

Example 4.4.1 (Stochastic Shortest Path Problems with


Countable State Space and Nonnegative Costs)

A context where the preceding analysis applies is SSP problems with count-
able state space, and nonnegative cost per stage, where with the choice
¯
J(x) ≡ 0, the monotone increase condition
¯
J(x) ¯
≤ H(x, u, J), ∀ x ∈ X, u ∈ U (x),

is satisfied. Then to obtain strong results it is not necessary that all S-


irregular or δ-S-irregular policies have infinite cost from some initial state, as
in Section 3.2.1. Instead we may choose

S = J ∈ R(X) | J ≥ J¯ S = J ∈ B(X) | J ≥ J¯ .
! " ! "
or
Sec. 4.4 Semicontractive-Monotone Increasing Models 167

where B(X) is the space of bounded functions with respect to a weighted


sup-norm, and apply Props. 4.4.1 and 4.4.2, assuming the conditions of these
propositions are satisfied. Then:
(a) J ∗ is the unique fixed point of T within S [cf. Prop. 4.4.1(a)].
(b) The VI algorithm converges to J ∗ if started with any J ≥ J ∗ [cf. Prop.
4.4.1(b)].
(c) If the control space is finite, the VI algorithm converges to J ∗ if started
with any J ≥ 0 [cf. Prop. 4.4.1(c), which applies because T k J¯ → J ∗ in
view of the finiteness of the control space, as shown in Prop. 4.3.14].
(d) The PI algorithm, started with any policy µ0 , generates a cost function
sequence that converges to J ∗ (cf. Prop. 4.4.2).
Except for (d), these results are well-known for the case of finite-state prob-
lems; see [Ber87] (Section 6.4) and [BeT91] (Prop. 3).

Here is another example involving a discounted stochastic optimal


control problem where Tµ need not be a contraction for all µ.

Example 4.4.2 (Discounted Semicontractive Optimal Control


Problems)

Consider the stochastic optimal control problem of Example 1.2.1, where


! " #$
H(x, u, J) = E g(x, u, w) + αJ f (x, u, w) , (4.38)

¯
α ∈ (0, 1), and J(x) ≡ 0. Let

S = J ∈ B(X) | J ≥ J¯ .
! $

Note that X need not be a finite set, and that the functions in S need not have
finite unweighted sup-norm. Assume that there is a set of policies M ⊂ M
such that:
! $
(a) The function E g(x, µ(x), w) belongs to S for all µ ∈ M.
! " #$
(b) The function E J f (x, µ(x), w) belongs to S for all µ ∈ M and
J ∈ S.
(c) For each J ∈ S there exists µ ∈ M such that Tµ J = T J.
(d) There exists an optimal policy within M.
It can be seen that S is a closed subset of B(X), and that Tµ maps S
into S and is a contraction with modulus α within S for all µ ∈ M. † This

† To see this note that by Eq. (4.38), we have for all J, J " ∈ B(X) and x ∈ X,
% %
(Tµ J)(x) − (Tµ J " )(x) %J(y) − J " (y)%
≤ α sup = α(J − J " (,
v(x) y∈X v(y)
168 Noncontractive Models Chap. 4

implies that all policies in M are S-regular, since Jµ is the unique fixed point
of Tµ within S, and Tµk J converges to the fixed point Jµ for all J ∈ S.
In view of property (c) above, there exists µ ∈ M such that Tµ J ∗ =

T J , which by Prop. 4.3.9, is optimal. Thus there exists an optimal S-regular
policy, and it may be verified that Prop. 4.4.1 applies. It follows from Prop.
4.4.1(a) that J ∗ is the unique fixed point of T within S. By Prop. 4.4.1(b),
J ∗ can be obtained in the limit by the VI algorithm starting with J ∈ S such
that J ≥ J ∗ [also starting with J ∈ S such that J ≥ J̄, if we can show that
T k J¯ → J ∗ (e.g., if the control space is finite); cf. Prop. 4.4.1(c)]. By Prop.
4.4.2, J ∗ can be obtained in the limit by the PI algorithm, which starts from
a policy µ0 ∈ M and generates exclusively policies in M [this is possible in
view of property (c) above], provided condition (2) of the proposition can be
verified.

4.4.3 Linear-Quadratic Problems

In this subsection, we consider a classical problem from control theory,


where an effective PI algorithm can be developed. Here introducing con-
traction mappings is difficult because the optimal cost function is quadratic
and unbounded above. However, it is possible to use a semicontractive
framework with an appropriate set S, such that S-regular policies can be
related to policies that are stable in traditional control theoretic terms.
Consider the linear system

xk+1 = Axk + Buk + wk , k = 0, 1, . . . ,

where xk ∈ "n , uk ∈ "m for all k, and A and B are given matrices.
We assume that the random disturbances wk are independent identically
distributed with zero mean and finite second moment. The cost function
of a policy π = {µ0 , µ1 , . . .} has the form
!N −1 %
" # $
Jπ (x0 ) = lim E αk x$k Qxk + µk (xk )$ Rµk (xk ) ,
N →∞
k=0

where x$ denotes the transpose of a column vector x, α ∈ (0, 1), Q is a


positive semidefinite symmetric n × n matrix, and R is a positive definite

where v is the weight function of the sup-norm of B(X). Interchanging J and


J " , we obtain
& & & &
&(Tµ J)(x) − (Tµ J " )(x)& &J(y) − J " (y)&
≤ α sup = α&J − J " &,
v(x) y∈X v(y)

and by taking supremum over x ∈ X, we see that Tµ is a contraction with mo-


dulus α.
Sec. 4.4 Semicontractive-Monotone Increasing Models 169

symmetric m × m matrix. This is a special case of the stochastic optimal


control problem of Example 1.2.1.
To convert the problem to the abstract format of this section, we let

X = "n , U (x) = "m , ∀ x ∈ X,


! "
H(x, u, J ) = x! Qx + u! Ru + αE J (Ax + Bu + w) ,
J¯(x) = 0, ∀ x ∈ X.
Clearly Assumption I is satisfied.
The theory of this problem is well-known and is discussed in various
forms in many sources, including [Ber12a] (Section 4.2) whose notation
and presentation we follow. In summary, the optimal cost function J * is
quadratic of the form
J * (x) = x! Kx + c,
where K is a positive definite symmetric n×n matrix and c is a nonnegative
constant. The matrix K is the unique solution of the equation
# $
K = A! αK − α2 KB(αB ! KB + R)−1 B ! K A + Q, (4.39)

within the class of nonnegative definite matrices, and


α
c= E{w! Kw}.
1−α
The optimal policy is linear of the form

µ∗ (x) = −α(αB ! KB + R)−1 B ! KAx, x ∈ "n .

The results just stated require that the pairs (A, B) and (A, C), where
Q = C ! C, are controllable and observable, respectively (see [Ber12a] for a
definition of these terms and corresponding proofs).
We now discuss how the problem can be solved by PI based on the the-
ory of this section. We define S to be the set of positive definite quadratic
functions plus a nonnegative constant,
! "
S = J (x) = x! W x + γ | W : positive definite symmetric, γ ≥ 0

It is easily seen that the continuity condition (4.34) of Prop. 4.4.2 is satis-
fied.
Consider the set M ⊂ M of all linear controllers µ that stabilize the
system, i.e,
µ(x) = Lx, ∀ x ∈ "n ,
where L is an m × n matrix such that the matrix

A + BL
170 Noncontractive Models Chap. 4

has eigenvalues strictly within the unit circle. We call such a controller
linear-stable. It is well-known that under the controllability and observ-
ability assumption noted earlier, there exists a linear-stable controller that
is optimal (see e.g., [Ber12a]). Moreover, from the definition of regularity
(4.31), it can be seen that a linear-stable controller µ is S-regular.
Consider now the PI algorithm of this section, starting with a linear-
stable controller µ0 (x) = L0 x. Then the cost corresponding to µ0 has the
form
Jµ0 (x) = x! K0 x + nonnegative constant,

where K0 is a positive definite symmetric matrix satisfying the (linear)


equation
K0 = α(A + BL0 )! K0 (A + BL0 ) + Q + L!0 RL0 .

Let µ1 (x) attain the minimum for each x in the expression

! "
min u! Ru + α(Ax + Bu)! K0 (Ax + Bu) .
u

Then it can be seen that

Jµ1 (x) = x! K1 x + constant ≤ Jµ0 (x), ∀ x ∈ X,

where K1 is some positive semidefinite symmetric matrix, and we have


Jµ1 ∈ S (see e.g., [Ber12a]). Continuing similarly, the PI algorithm yields a
sequence of linear-stable controllers and corresponding sequence of matrices
{Kk } such that
Kk → K∞ ,

where K∞ is a positive definite symmetric matrix. The corresponding


sequence {Jµk } as well as its limit belong to S. By Prop. 4.4.2, K∞ is
equal to K, the optimal cost matrix of the problem. Thus the problem can
be solved by PI, which generates a sequence of linear-stable controllers of
the form
µk (x) = Lk x.

The cost function of each of these controllers is quadratic with positive


semidefinite matrices Kk that can be obtained by solving the linear matrix
equation
Kk = α(A + BLk )! Kk (A + BLk ) + Q,

as can be verified from Eq. (4.39). This equation can be solved with efficient
specialized methods, for which we refer to the literature.
Sec. 4.5 Affine Monotonic Models 171

4.5 AFFINE MONOTONIC MODELS

In this section, we consider the case

Tµ J = Aµ J + bµ , (4.40)

where for each µ, bµ is a given function in R+ (X), the set of all nonnegative
real-valued functions on X, and Aµ : E + (X) !→ E + (X) is a given mapping,
where E + (X) is the set of all nonnegative extended real-valued functions
on X. We assume that Aµ has the “linearity” property

Aµ (J1 + J2 ) = Aµ J1 + Aµ J2 , ∀ J1 , J2 ∈ E + (X). (4.41)

Thus if J, J ! ∈ E + (X) with J ! ≥ J , we have Aµ (J ! − J ) ≥ 0 [since Aµ


maps E + (X) to E + (X)] and hence [using also Eq. (4.41)] Aµ J ! = Aµ J +
Aµ (J ! − J ) ≥ Aµ J , so that Aµ and Tµ are monotone in the sense that

J, J ! ∈ E + (X), J ≤ J ! ⇒ Aµ J ≤ Aµ J ! , Tµ J ≤ Tµ J ! .

(In the preceding equations we use our convention ∞ + ∞ = ∞ − ∞ =


r + ∞ = ∞ + r = ∞ for any real number r; see Appendix A.) We refer to
this model, with a function J¯ ∈ R+ (X), as an affine monotonic model.
An example of this model is when X is a countable set, Aµ is defined
by the transition probabilities corresponding to µ, and J¯(x) ≡ 0. Then we
obtain the countable-state case of the negative DP model of [Str66], which
is fully covered by the theory of Section 4.3, under Assumption I.
Another special case is the multiplicative model of Example 1.2.8,
where X and U are finite sets, J¯ is the unit function (J¯ = e), and for
transition probabilities pxy (u) and function g(x, u, y) ≥ 0, we have
!
H(x, u, J ) = pxy (u)g(x, u, y)J (y). (4.42)
y∈X

Thus with bµ = 0 and the matrix Aµ having components


" # " #
Aµ (x, y) = pxy µ(x) g x, µ(x), y ,

we obtain an affine monotonic model.


In a variant of the multiplicative model that involves a cost-free and
absorbing termination state 0, similar to SSP problems, H may contain a
“constant” term, i.e., have the form
!
H(x, u, J ) = px0 (u)g(x, u, 0) + pxy (u)g(x, u, y)J (y), (4.43)
y∈X
" # " #
in which case bµ (x) = px0 µ(x) g x, µ(x), 0 . A special case of this model
is the risk-sensitive SSP problem with exponential cost function, which will
be discussed later in Section 4.5.3.
172 Noncontractive Models Chap. 4

In the next two subsections we will consider two alternative lines of


semicontractive model analysis. The first assumes the monotone increase
condition T J¯ ≥ J¯, and relies on Assumption I of this chapter. The second
line of analysis follows the approach of Section 3.2.1 (irregular policies
have infinite cost for some x ∈ X), based on Assumption 3.2.1 with an
appropriate choice of a subset S of real-valued functions. Analyses based
on the monotone decrease condition T J¯ ≤ J¯, and on the perturbation-
based approach of Section 3.2.2 are also possible, but will not be pursued
in detail. Of course the strong results of Chapter 2 may also apply when
there is a weighted sup-norm for which Aµ is a contraction for all µ over
B(X), and with the same modulus.

4.5.1 Increasing Affine Monotonic Models

In this subsection we assume that the condition T J¯ ≥ J¯ holds and that


the remaining two conditions of Assumption I are satisfied. Then the affine
monotonic model admits a straightforward analysis with a choice
S ⊂ J ∈ E + (X) | J ≥ J¯ ,
! "
(4.44)
based on the theory of Section 4.4.1 and the parts of Section 4.3 that relate
to the monotone increase Assumption I. In particular, we have the following
proposition.

Proposition 4.5.1: Consider the affine monotonic model, assuming


that T J¯ ≥ J¯ and that the remaining conditions of Assumption I hold.
Assume that there exists an optimal S-regular policy, where S satisfies
Eq. (4.44). Then:
(a) The optimal cost function J * is the unique fixed point of T within
S.
(b) A policy µ is optimal if and only if Tµ J * = T J * .
(c) Under the compactness assumptions of Prop. 4.3.14, we have
T k J → J * for every J ∈ S.

Proof: (a) Follows from Prop. 4.4.1(a).


(b) Follows from Prop. 4.3.9.
(c) Follows from Prop. 4.4.1(c). Q.E.D.

4.5.2 Nonincreasing Affine Monotonic Models

We now consider the affine monotonic model without assuming the mono-
tone increase condition T J¯ ≥ J¯. We will use the approach of Section 3.2.1,
Sec. 4.5 Affine Monotonic Models 173

assuming that J¯ ∈ S and that S is equal to one of the three choices


! "
S = R+ (X) = J | 0 ≤ J (x) < ∞, ∀ x ∈ X ,

S = Rp+ (X) = J | 0 < J (x) < ∞, ∀ x ∈ X ,


! "
# $ %
S = Rb+ (X) = J $ 0 < inf J (x) ≤ sup J (x) < ∞ .
$
x∈X x∈X

Note that if X is finite, we have = Rp+ (X) Rb+ (X).


We first derive an expression for the cost function of a policy and
obtain conditions for S-regularity. Using the form of Tµ and the “linearity”
condition (4.41), we have
k−1
&
Tµk J = Akµ J + Am
µ bµ , ∀ J ∈ S, k = 1, 2, . . . .
m=0

By definition, µ is S-regular if Jµ ∈ S, and limk→∞ Tµk J = Jµ for all J ∈ S,


or equivalently if for all J ∈ S we have
∞ ∞
k ¯
& &
lim sup Akµ J + Am
µ bµ = lim sup Aµ J + Am
µ bµ ∈ S.
k→∞ m=0 k→∞ m=0

Letting J = 2J¯ and using the fact Akµ (2J¯) = 2Akµ J¯ [cf. Eq. (4.41)], we see
that Akµ J¯ → 0. It follows that µ is S-regular if and only if

&
lim Akµ J = 0, ∀ J ∈ S, and Am
µ bµ ∈ S. (4.45)
k→∞
m=0

We will now consider conditions for Assumption 3.2.1 to hold, so


that the results of Prop. 3.2.1 will follow. For the choices S = R+ (X) and
S = Rb+ (X), parts (a), (b), and (f) of this assumption are automatically
satisfied [a proof, to be given later, will be required for part (f) and the
case S = Rb+ (X)]. For the choice S = Rp+ (X), part (a) of this assumption
is automatically satisfied, while part (b),

inf Jµ ∈ Rp+ (X),


µ: R+
p (X)-regular

and part (f) will be assumed in the proposition that follows. The com-
pactness condition of Assumption 3.2.1(d) and the technical condition of
Assumption 3.2.1(e) are needed, and they will be assumed.
The critical part of Assumption 3.2.1 is (c), which requires that for
each S-irregular policy µ and each J ∈ S, there is at least one state x ∈ X
such that

&
lim sup (Tµk J )(x) = lim sup (Akµ J )(x) + (Am
µ bµ )(x) = ∞.
k→∞ k→∞ m=0
174 Noncontractive Models Chap. 4

This part is satisfied if and only if for each S-irregular µ and J ∈ S, there
is at least one x ∈ X such that
!∞
lim sup (Akµ J )(x) = ∞ or (Amµ bµ )(x) = ∞. (4.46)
k→∞ m=0

Note that this cannot be true if S = R+ (X)


and bµ = 0 [as in the multi-
plicative cost case of Eq. (4.42)], because for J = 0, the preceding condition
is violated. On the other hand, if S = Rp+ (X) or S = Rb+ (X), the condition
(4.46) is satisfied even if bµ = 0, provided that for each S-irregular µ and
J ∈ S, there is at least one x ∈ X with
lim sup (Akµ J )(x) = ∞.
k→∞

We have the following proposition.

Proposition 4.5.2: Consider the affine monotonic model and let S =


R+ (X) or S = Rp+ (X) or S = Rb+ (X). Assume that the following hold:
(1) There exists an S-regular policy.
(2) If µ is an S-irregular policy, then for each function J ∈ S, Eq.
(4.46) holds for at least one x ∈ X.
(3) The function Jˆ given by

Jˆ(x) = inf Jµ (x), x ∈ X,


µ: S -regular

belongs to S.
(4) The control set U is a metric space, and the set
" #
u ∈ U (x) | H(x, u, J ) ≤ λ

is compact for every J ∈ S, x ∈ X, and λ ∈ $.


(5) For each sequence {Jm } ⊂ S with Jm ↑ J for some J ∈ S we
have

lim (Aµ Jm )(x) = (Aµ J )(x), ∀ x ∈ X, µ ∈ M.


m→∞

(6) In the case where S = Rp+ (X), for each function J ∈ S, there
exists a function J # ∈ S such that J # ≤ J and J # ≤ T J # .
Then:
(a) The optimal cost function J * is the unique fixed point of T within
S.
Sec. 4.5 Affine Monotonic Models 175

(b) We have T k J → J * for every J ∈ S. Moreover there exists an


optimal S-regular policy.
(c) A policy µ is optimal if and only if Tµ J * = T J * .

Proof: If S = R+ (X) or S = Rp+ (X), it can be verified that all the


parts of Assumption 3.2.1 are satisfied, and the results follow from Prop.
3.2.1 [this includes part (f), which is satisfied by assumption in the case of
S = Rp+ (X); cf. condition (6)]. If S = Rb+ (X), the proof is similar, but to
apply Prop. 3.2.1, we need to show that Assumption 3.2.1(f) is satisfied.
To this end, we will show that for each J ∈ S, there exists a J ! ∈ S of
the form J ! = αJˆ, where α is a scalar with 0 < α < 1, such that J ! ≤ J and
J ! ≤ T J ! , so again the results will follow from Prop. 3.2.1. Indeed, from
Lemma 3.2.4, we have that Jˆ is a fixed point of T . For any J ∈ S, choose
J ! = αJˆ, with α ∈ (0, 1), such that J ! ≤ J , and let µ be an S-regular
policy µ such that Tµ J ! = T J ! [cf. Lemma 3.2.1 and condition (4)]. Then,
we have T J ! = Tµ J ! = Tµ (αJˆ) = αAµ Jˆ + bµ ≥ α(Aµ Jˆ + bµ ) = αTµ Jˆ ≥
αT Jˆ = αJˆ = J ! . Q.E.D.

Note the difference between Props. 4.5.1 and 4.5.2: in the former,
the uniqueness of fixed point of T is guaranteed within a smaller set of
functions when J¯ ∈ Rp+ (X). Similarly, the convergence of VI is guaranteed
from within a smaller range of starting functions when J¯ ∈ Rp+ (X).

4.5.3 Exponential Cost Stochastic Shortest Path Problems

We will now apply the analysis of the affine monotonic model to SSP prob-
lems with an exponential cost function, which is introduced to incorporate
risk sensitivity in the control selection process.
Consider an SSP problem with finite state and control spaces, transi-
tion probabilities pxy (u), and real-valued transition costs h(x, u, y). State 0
is a termination state, which is cost-free and absorbing. Instead of the stan-
dard additive cost function (cf. Example 1.2.6), we consider an exponential
cost function of the form

! " k−1 & (


# $ % ''
Jµ (x) = lim E exp h xm , µ(xm ), xm+1 ' x0 = x , x ∈ X,
k→∞
m=0

where {x0 , x1 , . . .} denotes the trajectory produced by the Markov chain


under policy µ. This is an affine monotonic model with J¯ = e and mapping
176 Noncontractive Models Chap. 4

Tµ given by
! " # " #
(Tµ J )(x) = pxy µ(x) exp h(x, µ(x), y) J (y)
y∈X
" # " #
+ px0 µ(x) exp h(x, µ(x), 0) , x ∈ X,
(4.47)
[cf. Eq. (4.43)]. Here Aµ and bµ have components
" # " #
Aµ (x, y) = pxy µ(x) exp h(x, µ(x), y) , (4.48)
" # " #
bµ (x) = px0 µ(x) exp h(x, µ(x), 0) . (4.49)
Note that there is a distinction between S-irregular policies and im-
proper policies (the ones that never terminate). In particular, there may
exist improper policies, which are S-regular because they can generate
some negative transition costs h(x, u, y), which make Aµ contractive [cf.
Eq. (4.47)]. Similarly, there may exist proper policies (i.e., terminate with
probability one),$which are S-irregular because for the corresponding Aµ

and bµ we have m=0 (Am µ bµ )(x) → ∞ for some x.
We may consider the two cases where the condition T J¯ ≥ J¯ holds (cf.
Section 4.5.1) and where it does not (cf. Section 4.5.2), as well as a third
case where none of these conditions applies, but the perturbation-based
theory of Section 3.2.2 or the contractive theory of Chapter 2 can be used.
Consider first the case where T J¯ ≥ J¯. An example is when
h(x, u, y) ≥ 0, ∀ x, y ∈ X, u ∈ U (x),
so that from Eq. (4.47), we have exp h(x, u, y) ≥ 1, and since J¯ = e, it
" #

follows that Tµ J¯ = Aµ J¯ + bµ ≥ J¯ for all µ ∈ M. As in Section 4.5.1, by


letting
S ⊂ J ∈ E + (X) | J ≥ J¯ ,
% &

and by assuming the existence of an optimal S-regular policy, we can apply


Prop. 4.5.1 to obtain the corresponding conclusions. In particular, J * is
the unique fixed point of T within S [cf. Eq. (4.44)], all optimal policies are
S-regular and satisfy the optimality condition Tµ J * = T J * , and VI yields
J * in the limit, when initialized from within S.
On the other hand, there are interesting applications where the con-
dition T J¯ ≥ J¯ does not hold. The following is an example.

Example 4.5.1 (Optimal Stopping with Risk-Sensitive Cost)

Consider an SSP problem where there are two controls at each x: stop, in
which case we move to the termination state 0 with a cost s(x), and continue,
in which case we move to a state y, with given transition probabilities pxy [at
no cost if y != 0 and a cost s̄(x) if y = 0]. The mapping H has the form
' " #
exp s(x) if u = stop,
H(x, u, J) = $ " #
y∈X
pxy J(y) + px0 exp s̄(x) if u = continue,
Sec. 4.5 Affine Monotonic Models 177

and J̄ is the unit function e. Here the stopping cost s(x) is often naturally
negative for some x (this is true for example in search problems of the type
discussed in Example 3.2.1), so the condition T J̄ ≥ J̄ can be written as
! %
" # $ " #
min exp s(x) , pxy + px0 exp s̄(x) ≥ 1, ∀ x ∈ X,
y∈X

and is violated.

When the condition T J¯ ≥ J¯ does not hold, we may use the analysis
of Section 4.5.2, under the conditions of Prop. 4.5.2, chief among which
is that an S-regular policy exists, and for every S-irregular policy µ and
J ∈ S, there exists x ∈ X such that

$
lim sup (Akµ J )(x) = ∞ or (Am
µ bµ )(x) = ∞,
k→∞ m=0

where Aµ and bµ are given by Eqs. (4.48), (4.49) [cf. Eq. (4.46)], and
S = R+ (X) or S = Rp+ (X) or S = Rb+ (X).
If these conditions do not hold, we may also use the approach of
Section 3.2.2, which is based on adding a perturbation δ to bµ . We assume
that the optimal cost function Jδ* of the δ-perturbed problem is a fixed
point of the mapping Tδ given by
!
$ " #
(Tδ J )(x) = min pxy (u)exp h(x, u, y) J (y)
u∈U (x)
y∈X
%
" #
+ px0 (u)exp h(x, u, 0) + δ, x ∈ X,

and we assume existence of an optimal S-regular policy with


& '
S = B(X) | J (x) > 0, ∀ x ∈ X ,
where B(X) is the space of bounded functions with respect to some weighted
sup-norm. The remaining conditions of Assumption 3.2.2 are relatively
mild and we assume that they hold. Then Prop. 3.2.3 applies and shows
that J * is equal to limδ↓0 Jδ* and is the unique fixed point of T within the
set {J ∈ S | J ≥ J * }, and that the VI sequence {T k J } converges to J *
starting from a function J ∈ S with J ≥ J * . Under some circumstances
where there is no optimal S-regular policy, we may also be able to use
Prop. 3.2.2. In particular, it may happen that for some x ∈ X, J * (x) is
strictly smaller than limδ↓0 Jδ* (x), the optimal cost over all S-regular poli-
cies, while there may exist S-irregular policies that are optimal and attain
J * , in which case Prop. 3.2.2 applies.
The following example illustrates the possibilities, and highlights the
ranges of applicability of Props. 4.5.1 and 4.5.2 (which are special cases of
Props. 4.4.1 and 3.2.1, respectively), and Props. 3.2.2 and 3.2.3.
178 Noncontractive Models Chap. 4

a
. Undert these
b Destination
conditions, the Bellman equation
! "
tb J (1) = min exp(b), exp(a)J (2)
12 a12 a012
J (2) = exp(a)J (1)
a

Figure 4.5.1. Shortest path problem with exponential cost function. The
cost that is exponentiated is shown next to each arc.

Example 4.5.2 (Shortest Paths with Risk-Sensitive Cost)

Consider the context of the three-node shortest path problem of Section 3.1.2,
but with the exponential cost function of the present subsection (see Fig.
4.5.1). Here the DP model has two states: x = 1, 2. There are two policies
denoted µ and µ: the 1st policy is 2 → 1 → 0, while the 2nd policy is
2 → 1 → 2. The corresponding mappings Tµ and Tµ are given by

(Tµ J)(1) = exp(b), (Tµ J)(2) = exp(a)J(1),

(Tµ J)(1) = exp(a)J(2), (Tµ J)(2) = exp(a)J(1).


Moreover, for k ≥ 2, we have

(Tµk J)(1) = exp(b), (Tµk J)(2) = exp(a + b),

and !" #k
exp(a) J(1) if k is even,
(Tµk J)(1) = " #k
exp(a) J(2) if k is odd,
!" #k
exp(a) J(1) if k is odd,
(Tµk J)(2) = " #k
exp(a) J(2) if k is even.

The cost functions of µ and µ, with J̄ = e, are

Jµ (1) = exp(b), Jµ (2) = exp(a + b),


$ k−1 &
% " #k
Jµ (1) = Jµ (2) = lim exp a = lim exp(a) .
k→∞ k→∞
m=0

Clearly the proper policy µ is S-regular, since Tµk J = Jµ for all k ≥ 2.


The improper policy µ is S-irregular when a > 0 since Jµ (1) = Jµ (2) = ∞,
and when a = 0 (since Tµk J depends on J), for any reasonable choice of S.
However, in the case where a < 0 and there is a negative cycle 2 → 1 →
2, µ is optimal and R+ (X)-regular [but not Rp+ (X)-regular], since Tµk J =
#k
J → 0 ∈ R+ (X) for all J ∈ R+ (X).
"
exp(a)
Sec. 4.5 Affine Monotonic Models 179

The major lines of analysis of semicontractive models that we have


discussed are all illustrated in the five possible combinations of values of a
and b given below. Each of these five combinations exhibits significantly
different characteristics, and in each case the assertion about the set of fixed
points of T is based on a different proposition!
(a) Case a > 0: Here the regular policy µ is optimal, and the irregular
policy µ has infinite cost for all x. It can be seen that the assumptions of
Prop. 4.5.2 with S = Rp+ (X) apply. Note here that bµ = 0, so condition
(2) of Prop. 4.5.2 is violated when S = R+ (X) [the condition (4.46) is
violated for J = 0]. Consistently with this fact, T has the additional
fixed point J = 0 within R+ (X), while value iteration starting from
J0 = 0 generates T k J0 = 0 for all k, and does not converge to J ∗ .
(b) Case a = 0 and b > 0: Here the irregular policy µ is optimal, and
the assumptions of Props. 4.5.1 and 4.5.2, with both S = R+ (X) and
S = Rp+ (X), are violated [despite the fact that Assumption (I) holds for
this case]. The assumptions of Prop. 3.2.3 are also violated because the
only optimal policy is irregular. However, consistent with Prop. 3.2.2,
limδ↓0 Jδ∗ is the optimal cost over the regular policies only, which is Jµ .
In particular, we have

Jµ (1) = exp(b) = lim Jδ∗ (1) > J ∗ (1) = 1.


δ↓0

Here the set of fixed points of T is


! "
J | J ≤ exp(b)e, J(1) = J(2) ,

and contains vectors J from the range J > J ∗ as well as from the range
J < J ∗ (however, J ∗ = e is the “smallest” fixed point with the property
J ≥ J¯ = e, consistently with Prop. 4.3.3).
(c) Case a = 0 and b = 0: Here µ and! µ are both optimal," and the results
of Prop. 4.5.1 apply with S = J | J ≥ J ∗ = J¯ = e . However, the
assumptions of Prop. 4.5.2 are violated, and indeed T has multiple fixed
points within both Rp+ (X) and (a fortiori) R+ (X); the set of its fixed
points is ! "
J | J ≤ e, J(1) = J(2) .

(d) Case a = 0 and b < 0: Here the regular policy µ is optimal. However,
the assumptions of Props. 4.5.1 and!4.5.2 are violated. On the other
hand, Prop. 3.2.3 applies with S = J | J ≥ J ∗ }, so T has a unique
fixed point within S, while value iteration converges to J ∗ starting from
within S. Here again T has multiple fixed points within Rp+ (X) and (a
fortiori) R+ (X); the set of its fixed points is
! "
J | J ≤ exp(b)e, J(1) = J(2) .

(e) Case a < 0: Here µ is optimal and also R+ (X)-regular [but not Rp+ (X)-
regular, since Jµ = 0 ∈/ Rp+ (X)]. However, the assumptions of Prop.
180 Noncontractive Models Chap. 4

4.5.1, and Prop. 4.5.2 with both S = R+ (X) and S = Rp+ (X) = Rb+ (X)
are violated. Still, however, our analysis applies and in a stronger form,
because both Tµ and Tµ are contractions. Thus we are dealing with a
contractive model for which the results of Chapter 2 apply (J ∗ = 0 is
the unique fixed point of T over the entire space !2 , and value iteration
converges to J ∗ starting from any J ∈ !2 ).

4.6 AN OVERVIEW OF SEMICONTRACTIVE MODELS AND


RESULTS
Several semicontractive models and results have been discussed in this
chapter and in Chapter 3, under several different assumptions, and it may
be worth summarizing them. Three types of models have been considered:
(a) Models where the set S may include extended real-valued functions,
an optimal S-regular policy is assumed to exist, and no other con-
ditions are placed on S-irregular policies. These models are covered
by Props. 3.1.1, 3.1.2, and 4.4.1, and they may require substantial
analysis to verify the corresponding assumptions. Note here that the
existence of an optimal stationary policy (regular or irregular) may
not be easily verified. However, in the special case where Assumption
I and the compactness assumption of Prop. 4.3.14 holds, existence of
an optimal stationary policy is guaranteed, and then requiring the
existence of an optimal S-regular policy may not be restrictive.
(b) Models where S consists of real-valued functions, and conditions are
placed on S-irregular policies, which roughly imply that their cost
is infinite from some states. There are two propositions that apply
to such models: Prop. 3.1.3, which assumes also that an optimal
S-regular policy exists, and Prop. 3.2.1 (and its specialized version,
Prop. 4.5.2, for affine monotonic models), which indirectly guarantees
existence of an optimal S-regular policy through other assumptions.
(c) Perturbation models, where S-irregular policies cannot be adequately
differentiated from S-regular ones on the basis of their cost functions,
but they become differentiated once a positive additive perturbation
is added to their associated mapping. These models include the ones
of Sections 3.2.2 and are covered by Props. 3.2.2-3.2.4.
Variants of these models may also include special structure that en-
hances the power of the analysis, as for example in SSP problems, linear
quadratic problems, and affine monotonic and exponential cost models.
The two significant issues in the analysis of semicontractive models
are how to select the set S so that an optimal S-regular policy exists, and
how to verify the existence of such a policy. There seems to be no universal
approach for addressing these issues, as can be evidenced by the variety
of alternative sets of assumptions that we have introduced, and by the
Sec. 4.7 Notes, Sources, and Exercises 181

variety of special cases that we have analyzed. Thus the analysis of the
special structure of the given model becomes important. This analysis may
be guided by the general ideas and results of the semicontractive theory
that we have developed.

4.7 NOTES, SOURCES, AND EXERCISES

The analysis of noncontractive abstract DP models of Sections 4.3.1 and


4.3.2 dates to the author’s paper [Ber77]. This analysis and its finite hori-
zon counterpart of Section 4.2 were also presented in the monograph by
Bertsekas and Shreve [BeS78].
Important examples of noncontractive infinite horizon models are the
classical positive DP problems, analyzed by Blackwell [Bla65], and by
Dubins and Savage [DuS65], and the negative DP problems analyzed in
Strauch [Str66] (and also in Strauch’s Ph.D. thesis, written under the su-
pervision of Blackwell). The monograph by Bertsekas and Shreve [BeS78]
provides a detailed treatment of these two models, which also resolves the
associated measurability questions. A simpler textbook treatment, which
bypasses the measurability questions, is given in [Ber12a], Chapter 4.
The compactness condition that guarantees convergence of VI to J *
under Assumption I (cf. Prop. 4.3.14) has been discussed by the author in
[Ber73] for reachability problems (see Exercise 4.9), and in [Ber75], [Ber77]
for negative DP models; also by Schal [Sch75] and Whittle [Whi80] for
negative DP models. A more refined analysis of the question of convergence
of VI to J * is possible. This analysis provides a necessary and sufficient
condition for convergence and improves over the compactness condition
of Prop. 4.3.14. In particular, the following characterization is shown in
[Ber77], Prop. 11 (see also [BeS78], Prop. 5.9):
For a set C ⊂ X × U × #, let Π(C) be the projection of C onto X × #:
! "
Π(C) = (x, λ) | (x, u, λ) ∈ C for some u ∈ U (x) ,

and denote also


! ! "
Π(C) = (x, λ) | λm → λ for some sequence {λm } with (x, λm )} ⊂ C .

Consider the sets Ck ⊂ X × U × # given by

Ck = (x, u, λ) | H(x, u, T k J¯) ≤ λ, x ∈ X, u ∈ U (x) ,


! "
k = 0, 1, . . . .

Then under Assumption I we have T k J¯ → J * if and only if


# $
Π ∩∞ ∞
k=0 Ck = ∩k=0 Π(Ck ).
182 Noncontractive Models Chap. 4

Moreover we have T k J¯ → J * and in addition there exists an optimal sta-


tionary policy if and only if
! "
Π ∩∞ ∞
k=0 Ck = ∩k=0 Π(Ck ). (4.50)

For a connection with Prop. 4.3.14, it can be shown that compactness of

Uk (x, λ) = u ∈ U (x)$ H(x, u, T k J¯) ≤ λ


# $ %

implies Eq. (4.50) (see [Ber77], Prop. 12, or [BeS78], Prop. 5.10).
Optimistic PI and λ-PI have not been considered earlier under As-
sumption D, and the corresponding analysis of Section 4.3.3 is new. See
[BeI96], [ThS10a], [ThS10b], [Ber11b], [Sch11] for analyses of λ-PI for dis-
counted and SSP problems. A related extension, called Λ-PI, is discussed
in Yu and Bertsekas [YuB12], Section 5.
The connection of the monotone increasing model with the semicon-
tractive model, discussed in Section 4.4, and the corresponding PI algo-
rithms are also new. Some finite-state SSP problems with nonpositive
costs may be related to average cost MDP, with possibly multichain poli-
cies, for which there are special PI methods (see [Put94], Section 10.4).
However, these PI methods are more complicated than the ones discussed
here. Linear-quadratic problems have been exhaustively analyzed in the
literature, and the convergence of PI for these problems was first shown by
Kleinman [Kle68].
The monotonic affine models of Sections 4.5.1 and 4.5.2 have not
been considered earlier, to the author’s knowledge. The analysis of the
SSP model with exponential cost of Section 4.5.3 is based on unpublished
collaboration with H. Yu. Two prominent papers for the latter model are
by Denardo and Rothblum [DeR79], and by Patek [Pat01]. The paper
[DeR79] assumes that the state and control spaces are finite, that there
exists at least one S-regular policy (a transient policy in the terminology
of [DeR79]), and that every improper policy is S-irregular. The paper
[Pat01] assumes that the state space is finite, that the control constraint
set is compact, and that the expected one-stage cost is strictly positive for
all state-control pairs.

E X ER CI S E S

4.1 (Example of Nonexistence of an Optimal Policy Under D)

This is an example of a deterministic stopping problem where Assumption D


holds, and an optimal policy does not exist, even though only two controls are
Sec. 4.7 Notes, Sources, and Exercises 183

available at each state (stop and continue). The state space is X = {1, 2, . . .}.
Continuation from state x leads to state x + 1 with certainty and no cost, while
the stopping cost is −1 + (1/x), so that there is an incentive to delay stopping
¯
at every state. Here for all x, J(x) = 0, and
!
J(x + 1) if u = continue,
H(x, u, J) =
−1 + (1/x) if u = stop.

Show that J ∗ (x) = −1 for all x, but there is no policy (stationary or not) that
attains the optimal cost starting from x.

4.2 (Counterexample for Optimality Condition Under D)

For the problem of Exercise 4.1, show that the policy µ that never stops is
nonoptimal and satisfies Tµ J ∗ = T J ∗ .

4.3 (Counterexample for Optimality Condition Under I)

Let
X = ", U (x) ≡ (0, 1], ¯
J(x) ≡ 0,
H(x, u, J) = |x| + J(ux), ∀ x ∈ X, u ∈ U (x).
Let µ(x) = 1 for all x ∈ X. Then Jµ (x) = ∞ if x '= 0 and Jµ (0) = 0. Verify that
Tµ Jµ = T Jµ . Verify also that J ∗ (x) = |x|, and hence µ is not optimal.

4.4 (Solution by Mathematical Programming)

This exercise shows that under Assumptions I and D, it is possible to devise a


computational method based on mathematical programming when X = {1, . . . , n}.
(a) Under Assumption I, show that J ∗ is the unique solution of the following
optimization problem in z = (z1 , . . . , zn ):
n
"
minimize zi
i=1
¯
subject to zi ≥ J(i), zi ≥ inf H(i, u, z), i = 1, . . . , n.
u∈U (i)

(b) Under Assumption D, show that J ∗ is the unique solution of the following
optimization problem in z = (z1 , . . . , zn ):
n
"
maximize zi
i=1

subject to zi ≤ J¯(i), zi ≤ H(i, u, z), i = 1, . . . , n, u ∈ U (i).

Note: Generally, these programs may not be linear or even convex.


184 Noncontractive Models Chap. 4

4.5 (Semicontractive Discounted Problems with Unbounded


Cost per Stage)

This exercise applies the theory of semicontractive models under Assumption I


(cf. Section 4.4.1) to a broad class of discounted problems. It is related to Exam-
ple 4.4.2. Consider a countable-state MDP where X = {1, 2, . . .}, the discount
factor is α ∈ (0, 1), the transition probabilities are denoted pxy (u) for x, y ∈ X
and u ∈ U (x), and the expected cost per stage is denoted by g(x, u), x ∈ X,
u ∈ U (x). We assume that
g(x, u) ≥ 0, ∀ x ∈ X, u ∈ U (x).
!
For a positive weight sequence v = v(1), v(2), . . .}, we consider the space B(X)
! "
of sequences J = J(1), J(2), . . . such that $J$ < ∞, where $ · $ is the corre-
sponding weighted sup-norm. Let
! "
S = J ∈ B(x) | J(x) ≥ 0 ,
and consider the monotone mappings Tµ and T , given by
# $ % # $
(Tµ J)(x) = g x, µ(x) + α pxy µ(x) J(y), x ∈ X,
y∈X
& '
%
(T J)(x) = inf g(x, u) + α pxy (u)J(y) , x ∈ X.
u∈U (x)
y∈X

(a) Assume that there exists a subset M of policies such that for all µ ∈ M:
! "
(1) The sequence G = G1 , G2 , . . . , where
# $
Gx = g x, µ(x) , x ∈ X,
belongs to S.
! "
(2) The sequence V = V1 , V2 , . . . , where
% # $
Vx = pxy µ(x) v(y), x ∈ X,
y∈X

belongs to S.
(3) We have ( # $
y∈X
pxy µ(x) v(y)
≤ 1, x ∈ X.
v(x)

Show that for all µ ∈ M, Tµ maps S into S, and is a contraction mapping


with modulus α. In particular, all policies in M are S-regular.
(b) Assume further that for each J ∈ S there exists µ ∈ M such that Tµ J =
T J, and that there exists an optimal policy within M. Show that Props.
4.4.1 and 4.4.2 apply, so that J ∗ is the unique fixed point of T within S,
and it can be obtained as the limit of cost functions generated by the PI
algorithm, starting from a policy µ0 ∈ M.
Sec. 4.7 Notes, Sources, and Exercises 185

4.6 (Blackmailer’s Dilemma)

Consider the blackmailer’s dilemma (cf. Exercise 3.1), where

X = {1}, U (1) = (0, 1], ¯


J(1) = 0,

and
H(1, u, J) = −u + (1 − u2 )J(1).
From Exercise 3.1 we have J ∗ (1) = −∞ and that there is no optimal stationary
policy. Show that:
(a) The PI algorithm generates a sequence {µk } of policies according to
1 1
Jµk (1) = − , µk+1 (1) = − ,
µk (1) 2Jµk (1)

so that Jµk+1 = 2Jµk , and Jµk ↓ J ∗ = −∞ since Jµ0 < 0.


(b) The optimistic PI and λ-PI algorithms starting from J0 = J¯ = 0 generate
sequences {Jk } with Jk ↓ J ∗ .

4.7 (Counterexample for Policy Improvement Under D -


Infinite State Space)

Consider the deterministic stopping problem of Exercise 4.1. Let µ be the policy
that stops at every state.
(a) Show that the next policy µ generated by nonoptimistic PI is to continue
at every state, and that we have
1
Jµ (x) = 0 > −1 + = Jµ (x), x ∈ X.
x
Moreover, the method oscillates between the policies µ and µ, none of which
is optimal.
(b) The optimistic PI and λ-PI algorithms starting from J0 = J¯ = 0 generate
sequences {Jk } with Jk ↓ J ∗ .

4.8 (Counterexample for Policy Improvement Under D -


Finite State Space)

Consider a finite-state version of the problem of Exercises 4.1 and 4.7, where
the state space is X = {1, . . . , n} and there are two controls: stop and continue.
Continuation from states x = 1, . . . , n − 1 leads to state x + 1 at no cost, while
continuation at state x = n keeps the state at n with no cost. The stopping cost
is −1 + (1/x) for all x > 0. Here J¯(x) ≡ 0, and

 J(x + 1) if x = 1, . . . , n − 1, and u = continue,
H(x, u, J) = J(x) if x = n, and u = continue,

−1 + (1/x) if x = 1, . . . , n, and u = stop.
186 Noncontractive Models Chap. 4

(a) Show that the pathology of Exercise 4.7(a) still can occur: nonoptimistic
PI can oscillate between the policy µ that stops at every state and the
policy µ that continues at every state, and that none of these two policies
is optimal (at any state except x = n). Note: This can be viewed as an
example of a finite-state SSP problem where there exists an optimal proper
policy, and all improper policies are not optimal (but have finite cost for
all initial states). The latter fact invalidates Assumption 3.2.1 of Chapter
3, and as a result the proof of Prop. 3.2.1 breaks down.
(b) Show that the pathology of part (a) still occurs if µ is replaced by the
optimal policy, which is to stop at x = n and to continue at x != n. Note:
Here we have Tµ J ∗ = T J ∗ even though µ is not optimal.

4.9 (Infinite Time Reachability [Ber72])

This exercise provides an instance of an interesting problem where the mapping


H is naturally extended real-valued. Consider a dynamic system
xk+1 = f (xk , uk , wk ),
where wk is viewed as an uncertain disturbance that may be any point in a
set W (xk , uk ) (this is known in the literature as an “unknown but bounded”
disturbance, and is the basis for a worst case/minimax treatment of uncertainty
in the control of uncertain dynamic systems). We introduce an abstract DP model
where the objective is to find a policy that keeps the state xk of the system within
a given set X at all times, for all possible values of the sequence {wk }. This is a
common objective, which arises in a variety of control theory contexts, including
model predictive control.
Let
0 if x ∈ X,
!
¯
J(x) =
∞ otherwise,
and
! " #
H(x, u, J) = 0 if J(x) = 0, u ∈ U (x), and J f (x, u, w) = 0, ∀ w ∈ W (x, u),
∞ otherwise.
(a) Show that Assumption I holds, and that the optimal cost function has the
form
0 if x ∈ X ∗ ,
!
J ∗ (x) =
∞ otherwise,
where X ∗ is some subset of X.
(b) Consider the sequence of sets {Xk }, where
Xk = x ∈ X | (T k J¯)(x) = 0 .
$ %

Show that Xk+1 ⊂ Xk for all k, and that X ∗ ⊂ ∩∞ k=0 Xk . Show also that
convergence of VI (i.e., T k J¯ → J ∗ ) is equivalent to X ∗ = ∩∞k=0 Xk .

(c) Show that X ∗ = ∩∞


k=0 Xk and there exists an optimal stationary policy if
the sets
$ %
Ûk (x) = u ∈ U (x) | f (x, u, w) ∈ Xk , ∀ w ∈ W (x, u)
are compact for all k greater than some index k̄. Hint: Use Prop. 4.3.14.
5

Models with Restricted


Policies

Contents

5.1. A Framework for Restricted Policies . . . . . . . . . p. 188


5.1.1. General Assumptions . . . . . . . . . . . . . p. 192
5.2. Finite Horizon Problems . . . . . . . . . . . . . . p. 196
5.3. Contractive Models . . . . . . . . . . . . . . . . p. 198
5.4. Borel Space Models . . . . . . . . . . . . . . . . p. 200
5.5. Notes, Sources, and Exercises . . . . . . . . . . . . p. 201

187
188 Models with Restricted Policies Chap. 5

In this chapter, we discuss variants of some of the models of the preceding


chapters, where there are restrictions on the set of policies. In particular,
policies may be selected from a strict subset M of the set of functions
µ : X !→ U with µ(x) ∈ U (x) for all x ∈ X. One potential use of such a
restriction arises when M consists of specially structured policies that re-
sult in convenient characterization and computation. Classical examples of
situations of this type are linear policies in linear-quadratic optimal control
problems, (s, S) policies in inventory control, and various threshold policies
in queueing and scheduling problems (see e.g., [Ber05a], [Ber12a], [Put94]).
In this context, the main focus of the analysis is to show that attention
can be confined to the policies µ ∈ M and their associated mappings Tµ .
If this can be shown, then the analytical solution of Bellman’s equation
may be enhanced, and the computational solution using for example policy
iteration that is restricted within M may be facilitated (cf. Section 4.4.3,
which deals with linear-quadratic optimal control).
Another major use arises in mathematically rigorous probabilistic
treatments of stochastic optimal control problems, and relates to the need
for µ to be measurable in an appropriate sense. We take this as the starting
point and motivation for the development of a general theory of restricted
models in the next section, and we develop this theory in subsequent sec-
tions for finite horizon and for contractive models. In the last section of
this chapter, we return to the treatment of measurability issues using the
theory developed in the earlier sections.

5.1 A FRAMEWORK FOR RESTRICTED POLICIES

As a motivating example, let us consider the mapping H of the stochastic


optimal control Example 1.2.1,
! " #$
H(x, u, J ) = E g(x, u, w) + αJ f (x, u, w) . (5.1)

We have considered so far the case where w is a discrete random variable,


taking a finite or a countably infinite set of values, so that the mapping
Tµ is defined for all µ ∈ M in terms of a summation. In the general case,
however, where w is a continuous random variable, H is well-defined only
if the expected value over w is well-defined. For this it is necessary to
introduce an appropriate probability space for w, and for g and f to be
appropriately measurable.
Most importantly, we cannot simply take J to belong to E(X), the
space of extended real-valued functions over X. We must restrict J to a
subset E(X) ⊂ E(X), consisting of appropriately measurable functions. In
addition, to define Tµ as a function from E(X) to E(X) by
" #
(Tµ J )(x) = H x, µ(x), J
Sec. 5.1 A Framework for Restricted Policies 189

or equivalently
! " #$
(Tµ J )(x) = E g(x, µ(x), w) + αJ f (x, µ(x), w) ,

µ must belong to a subclass M of appropriately measurable functions from


M. Appendix C provides a further illustration of the measurability is-
sues through a simple two-stage example, and also introduces some of the
terminology and background on measure theoretic issues, based on Borel
measurability, which we will use in this chapter (the monograph [BeS78]
contains an extensive account of this material).
The preceding discussion indicates that to define a restricted policies
model, we need at least two basic subsets of functions and policies:
(a) A restricted set of functions E(X) ⊂ E(X).
(b) A restricted set of policies M ⊂ M, such that

Tµ J ∈ E(X), ∀ J ∈ E(X), µ ∈ M,

where Tµ is given by
" #
(Tµ J )(x) = H x, µ(x), J , ∀ x ∈ X.

To conduct an analysis similar to the one of earlier chapters, we also


need to concern ourselves with the corresponding mapping T . To this end,
we assume, without loss of generality, that
! $
U (x) = µ(x) | µ ∈ M , ∀ x ∈ X,

so that the mapping T can be interchangeably defined as

(T J )(x) = inf (Tµ J )(x), ∀ J ∈ E(X), x ∈ X,


µ∈M

or as
(T J )(x) = inf H(x, u, J ), ∀ J ∈ E(X), x ∈ X. (5.2)
u∈U (x)

An issue to contend with here is whether the infimum in the definition of


T J is attained, exactly or within a small tolerance " > 0, by a policy in
M (simultaneously for all x), i.e., whether there exists µ ∈ M such that
(Tµ J )(x) is close to (T J )(x) uniformly for all x ∈ X. This issue turns out
to be quite delicate in the context of stochastic optimal control problems,
as we will discuss shortly. We first consider the case where the infimum is
exactly attained, and then address the more complex case where it is not.
190 Models with Restricted Policies Chap. 5

Models Admitting Exact Selection

Let us assume that E(X) and M have been defined as described above.
Consider the case where there exists µ ∈ M such that the infimum in Eq.
(5.2) is attained at µ(x) for all x ∈ X, i.e., for all J ∈ E(X) there exists a
µ ∈ M such that
Tµ J = T J. (5.3)
Otherwise stated, for all J ∈ E(X), the minimization in Eq. (5.2) admits
an exact selector µ from within M. Then assuming also that T and Tµ ,
µ ∈ M, preserve membership in E(X), i.e.,
Tµ J ∈ E(X), T J ∈ E(X), ∀ J ∈ E(X), µ ∈ M, (5.4)
a large portion of the analysis of the preceding chapters carries through
verbatim, and much of the remainder can be extended with minimal mod-
ifications.
In particular, in the finite horizon problems of Chapter 4, under this
condition, the condition J¯ ∈ E(X), and the monotonicity assumption
H(x, u, J ) ≤ H(x, u, J ! ), ∀ J, J ! ∈ E(X), x ∈ X, u ∈ U (x), (5.5)
we have JN* = T N J¯ and that there exists an N -stage optimal policy. Such

a policy can be obtained via the DP algorithm that starts with the ter-
minal cost function J¯, and sequentially computes T J, ¯ T 2 J¯, . . . , T N J¯, and
∗ ∗ ∗
corresponding µN −1 , µN −2 , . . . , µ0 ∈ M such that
Tµ∗ T N −k−1 J¯ = T N −k J¯, k = 0, . . . , N − 1, (5.6)
k

(cf. the discussion in the early part of Section 4.2).


To extend the analysis of the contractive models of Chapter 2, under
Eqs. (5.3) and (5.4), we need to assume that E(X) is a closed subset of
B(X), the space of functions J : X %→ ' that are bounded with respect to
a weighted sup-norm. This is necessary so that the fixed point theorems
of Appendix B apply. We also need to assume that the mappings Tµ are
contractions for all µ ∈ M with respect to a common weighted sup-norm.
Then the relevant portion of the analysis of Chapter 2 carries through with
hardly any change. The analysis of the semicontractive and infinite horizon
noncontractive models of Chapters 3 and 4, also admit a similar treatment,
under the exact selection assumption (5.3) and Eq. (5.4).
Generally, when the exact selection property (5.3) holds in the con-
text of the stochastic optimal control example where H is defined by Eq.
(5.1), there are few complications in providing a rigorous mathematical
treatment, even when w is a continuous random variable. Typically X,
U , and W are taken to be Borel spaces, E(X) and M are chosen to be
the spaces of Borel measurable functions from X to '* , and from X to U ,
respectively. Also g, f , and the probability space of w must satisfy certain
Borel measurability conditions (see Appendix C).
Sec. 5.1 A Framework for Restricted Policies 191

Models Without Exact Selection

When the exact selection property (5.3) may not hold, to conduct any kind
of meaningful analysis, it is necessary to adopt a restriction framework for
policies and functions, which guarantees that T J can be approximated
by Tµ J , with appropriate choice of µ. To this end, a seemingly natural
assumption would be that given J ∈ E(X) and ! > 0, there exists an
!-optimal selector , that is, a µ! ∈ M such that
!
(T J )(x) + ! if (T J )(x) > −∞,
(Tµ! J )(x) ≤ ∀ x ∈ X. (5.7)
−(1/!) if (T J )(x) = −∞,

However, in the Borel space model noted earlier and described in Ap-
pendix C, there is a serious difficulty: if E(X) and M are the spaces of
Borel measurable functions from X to &* , and from X to U , respectively,
there need not exist an !-optimal selector . For this reason, Borel measura-
bility of cost functions and policies is not the most appropriate probabilistic
framework for stochastic optimal control problems. † Instead, in the most
general framework for bypassing this difficulty, it is necessary to consider
a different kind of measurability, which is described in Appendix C. In this
framework:
(a) E(X) is taken to be the class of universally measurable functions from
X to &* .
(b) M is taken to be the class of universally measurable functions from
X to U .
(c) g, f , and the probability space of w must satisfy certain Borel mea-
surability conditions.
A key fact is that an !-selection property holds, whereby there exists a
µ! ∈ M such that
!
(T J )(x) + ! if (T J )(x) > −∞,
(Tµ! J )(x) ≤ ∀ x ∈ X, (5.8)
−(1/!) if (T J )(x) = −∞,

† There have been efforts to address the lack of an !-optimal selector within
the Borel measurability framework using the concept of a “p-!-optimal selector,”
whereby the concept of !-optimal selection is modified to hold over a set for
states that has p-measure 1, with p being any chosen probability measure over X
(see [Str66], [Str75], [DyY79]). This leads to a theory based on p-!-optimal and
p-optimal policies, i.e., policies that depend on the choice of p and are optimal
only for states in a subset of X that has p-measure 1 (rather than over all states
as in our case). It seems difficult to extend the abstract framework of this book
based on this inherently probabilistic viewpoint. For a related discussion and a
comparison of the p-!-optimal approach with ours, we refer to [BeS78].
192 Models with Restricted Policies Chap. 5

for each J in the class Ê(X) of lower semianalytic functions from X to !* ;


this is a strict subset of E(X), the set of universally measurable functions
(see Appendix C).
Because of the difficulty with !-selection within a Borel measurabil-
ity framework, to construct a more generally applicable restricted policies
model, it is necessary to introduce the set of lower semianalytic functions
Ê(X) within the Borel space framework, as a third subset , additional to
E(X) and M. In summary:
(a) We take E(X) to be the class of universally measurable functions from
X to !* , and M to be the class of universally measurable functions
from X to U . Then, Tµ J ∈ E(X) for all µ ∈ M and J ∈ E(X), so
the cost function of a policy lies in E(X).
(b) We take Ê(X) to be the set of lower semianalytic functions from X to
!* . Then we have T J ∈ Ê(X) for all J ∈ Ê(X), and the !-selection
property (5.8) holds. As a result, the VI algorithm produces functions
in Ê(X), if started within in Ê(X), and J * can be proved to lie in
Ê(X).
Motivated by the preceding discussion, we will now introduce a model
that involves a set of policies M, and two sets of functions E(X) and Ê(X)
with properties that are analogous to the ones just discussed for Borel space
models for stochastic optimal control, along with some additional technical
assumptions. We will use this model for the analysis of abstract finite and
infinite horizon problems, and we will review later its use within the Borel
measurability framework.

5.1.1 General Assumptions

As in Section 4.1, we have the sets X and U , and we introduce a set M


of functions µ : X #→ U , which we view as a restricted set of stationary
policies. We define

! "
U (x) = µ(x) | µ ∈ M , ∀ x ∈ X, (5.9)

and we assume that U (x) is nonempty for all x ∈ X. The corresponding


set of (nonstationary) policies π = {µ0 , µ1 , . . .} with µk ∈ M, k = 0, 1, . . . ,
is denoted by Π. We introduce two subsets Ê(X) and E(X) of E(X) (the
set of extended real-valued functions on X), such that

Ê(X) ⊂ E(X),

and the following assumption is satisfied.


Sec. 5.1 A Framework for Restricted Policies 193

Assumption 5.1.1:
(a) For each sequence {Jm } ⊂ E(X) with Jm → J , we have J ∈
E(X), and for each sequence {Jm } ⊂ Ê(X) with Jm → J , we
have J ∈ Ê(X).
(b) For all r ∈ $, we have

J ∈ E(X) ⇒ J + r e ∈ E(X),

and
J ∈ Ê(X) ⇒ J + r e ∈ Ê(X),
where e is the unit function, e(x) ≡ 1.

We also introduce a mapping H : X × U × E(X) (→ $* satisfying the


monotonicity assumption.

Assumption 5.1.2: (Monotonicity) If J, J ! ∈ E(X) and J ≤ J ! ,


then
H(x, u, J ) ≤ H(x, u, J ! ), ∀ x ∈ X, u ∈ U (x).

We define the mapping T : E(X) (→ $* by

(T J )(x) = inf H(x, u, J ), ∀ x ∈ X, J ∈ E(X),


u∈U (x)

and for each µ ∈ M, the mapping Tµ : E(X) (→ $* by


! "
(Tµ J )(x) = H x, µ(x), J , ∀ x ∈ X, J ∈ E(X).

Note that in view of the definition (5.9) of U (x), we also have

(T J )(x) = inf (Tµ J )(x), ∀ x ∈ X, J ∈ E(X),


µ∈M

or in shorthand
T J = inf Tµ J.
µ∈M

The sets M, E(X), and Ê(X), and the mappings Tµ and T are
assumed to satisfy the following.
194 Models with Restricted Policies Chap. 5

Assumption 5.1.3:
(a) For all µ ∈ M, we have

J ∈ E(X) ⇒ Tµ J ∈ E(X).

(b) We have
J ∈ Ê(X) ⇒ T J ∈ Ê(X).

Finally, we require that Ê(X) has the following critical !-selection


property.

Assumption 5.1.4: (!-Selection) For each J ∈ Ê(X) and ! > 0,


there exists a µ! ∈ M such that
!
(T J )(x) + ! if (T J )(x) > −∞,
(Tµ! J )(x) ≤ ∀ x ∈ X. (5.10)
−(1/!) if (T J )(x) = −∞,

The relevant selection theorem, which guarantees that the Assump-


tion 5.1.4 holds in the stochastic optimal control context is given as Prop.
C.5 in Appendix C. Note that Assumption 5.1.3 does not guarantee that
Tµ J ∈ Ê(X) for J ∈ Ê(X). As a result, the function Tµ! J of Assumption
5.1.4 is only guaranteed to belong to the larger set E(X).

Problem Formulation

We are given a function J¯ ∈ Ê(X) satisfying

J¯(x) > −∞, ∀ x ∈ X, (5.11)

and we consider for every policy π = {µ0 , µ1 , . . .} ∈ Π and positive integer


N the function JN,π ∈ E(X) defined by

JN,π (x) = (Tµ0 · · · TµN −1 J¯)(x), ∀ x ∈ X,

and the function Jπ defined by

Jπ (x) = lim sup (Tµ0 · · · Tµk J¯)(x), ∀ x ∈ X.


k→∞

For a stationary policy π = {µ, µ, . . .} we also write Jµ in place of Jπ .


Sec. 5.1 A Framework for Restricted Policies 195

As earlier, we consider the N -stage optimization problem

minimize JN,π (x)


(5.12)
subject to π ∈ Π,

and its infinite horizon version

minimize Jπ (x)
(5.13)
subject to π ∈ Π.

* (x) and J * (x) the optimal costs


For a fixed x ∈ X, we denote by JN
for these problems, i.e.,

* (x) = inf J
JN J * (x) = inf Jπ (x),
N,π (x), ∀ x ∈ X.
π∈Π π∈Π

We say that a policy π ∗ ∈ Π is N -stage optimal if

* (x),
JN,π∗ (x) = JN ∀ x ∈ X,

and (infinite horizon) optimal if

Jπ∗ (x) = J * (x), ∀ x ∈ X.

For a given " > 0, we say that π" ∈ Π is N -stage "-optimal if


* (x) + "
 JN * (x) > −∞,
if JN
Jπ! (x) ≤
 −(1/") * (x) = −∞,
if JN

and we say that π" is (infinite horizon) "-optimal if


 J * (x) + " if J * (x) > −∞,
Jπ! (x) ≤
 −(1/") if J * (x) = −∞.

Note that since J¯ ∈ Ê(X), the function T k J¯ belongs to Ê(X) for all k
[cf. Assumption 5.1.3(b)]. Similar to Chapter 4, we will aim to show under
various assumptions that JN * = T N J¯, and that J * ∈ Ê(X) and J * = T J * .
196 Models with Restricted Policies Chap. 5

5.2 FINITE HORIZON PROBLEMS

To show that JN * = T N J¯, we use an analysis that is similar to the one of

Section 4.2. In particular, we introduce the following assumption, which is


analogous to Assumption 4.2.1.

Assumption 5.2.1: For each sequence {Jm } ⊂ E(X) with Jm ↓ J


for some J ∈ E(X), we have

lim H(x, u, Jm ) ≤ H(x, u, J ), ∀ x ∈ X, u ∈ U (x). (5.14)


m→∞

Note that Assumption 5.1.1 implies that if {Jm } ⊂ E(X) and Jm ↓


J ∈ E(X), we have

J = lim Jm = inf Jm ∈ E(X),


m→∞ m=0,1,...

so for all µ ∈ M, by Assumption 5.2.1,


! "
inf (Tµ Jm ) = Tµ inf Jm .
m m

This inequality can be extended for any µ1 , . . . , µk ∈ M as follows:


! "
inf (Tµ1 · · · Tµk Jm ) = Tµ1 inf (Tµ1 · · · Tµk Jm )
m m
= ··· (5.15)
! "
= Tµ1 · · · Tµk inf Jm .
m

We have the following proposition, which extends Prop. 4.2.3 to the re-
stricted policies framework.

Proposition 5.2.1: Let Assumptions 5.1.1-5.1.4 and 5.2.1 hold. Then

* = T N J¯.
JN

Proof: We select for each k = 0, . . . , N − 1, a sequence {µm


k } ⊂ M such
that
lim Tµmk
(T N −k−1 J¯) ↓ T N −k J¯.
m→∞
Sec. 5.2 Finite Horizon Problems 197

This is possible in view of the !-selection property of Assumption 5.1.4,


since T N −k J¯ ∈ Ê(X) [cf. Assumption 5.1.3(b)]. Using Eq. (5.15), we have
∗ ≤ inf · · · inf T m · · · T m
JN 0
¯
N −1 J
m0 mN −1 µ0 µN −1
! "
= inf · · · inf Tµm0 · · · TµmN −2 ¯
inf TµmN −1 J
m0 mN −2 0 N −2 mN −1 N −1

= inf · · · inf Tµm0 · · · TµmN −2 T J¯


m0 mN −2 0 N −2

= ···
= T N J¯,
where the last equality is obtained by repeating the process used to obtain
the previous equalities. On the other hand, it is clear from the definitions
that T N J¯ ≤ JN,π for all N and π ∈ Π, so that T N J¯ ≤ JN * . Thus, J * =
N
N ¯
T J . Q.E.D.

We also introduce the following alternative assumption, which paral-


lels Assumption 4.2.2.

Assumption 5.2.2: The k-stages optimal cost function Jk* satisfies

Jk* (x) > −∞, ∀ x ∈ X, k = 1, . . . , N.

Moreover, there exists a scalar α ∈ (0, ∞) such that for all scalars
r ∈ (0, ∞) and functions J ∈ E(X), we have

H(x, u, J ) ≤ H(x, u, J +r e) ≤ H(x, u, J )+α r, ∀ x ∈ X, u ∈ U (x).


(5.16)

We have the following proposition whose statement and proof parallel


the ones of Prop. 4.2.4.

Proposition 5.2.2: Let Assumptions 5.1.1-5.1.4, and 5.2.2 hold. Then


* = T N J¯, and for every ! > 0, there exists an !-optimal policy.
JN

Proof: Note that since by assumption, JN * (x) > −∞ for all x ∈ X, an

N -stage !-optimal policy π" ∈ Π is one for which


* ≤ J
JN *
N,π! ≤ JN + ! e.

We use induction. The result clearly holds for N = 1. Assume that


it holds for N = k, i.e., Jk* = T k J¯ ∈ Ê(X) and for a given ! > 0, there is a
198 Models with Restricted Policies Chap. 5

πe ∈ Π with Jk,π! ≤ Jk* + " e. Using Eq. (5.16), we have for all µ ∈ M,
*
Jk+1 ≤ Tµ Jk,π! ≤ Tµ Jk* + α" e.

Taking the infimum over µ and then the limit as " → 0, we obtain Jk+1 * ≤
T Jk . By using the induction hypothesis, it follows that Jk+1 ≤ T k+1 J¯. On
* *

the other hand, we have clearly T k+1 J¯ ≤ Jk+1* , and hence T k+1 J¯ = Jk+1
* .
*
Using the assumption Jk (x) > −∞ for all x ∈ X, for any " > 0, we
can choose π = {µ0 , µ1 , . . .} ∈ Π such that

"
Jk,π ≤ Jk* + e,

and µ ∈ M such that


"
Tµ Jk* ≤ T Jk* + e.
2
Let π " = {µ, µ0 , µ1 , . . .}. Then

"
Jk+1,π ! = Tµ Jk,π ≤ Tµ Jk* + e ≤ T Jk* + " e = Jk+1
* + " e,
2
where the first inequality is obtained by using Eq. (5.16). The induction is
complete. Q.E.D.

5.3 CONTRACTIVE MODELS

In this section, we will discuss briefly the infinite horizon problem

minimize Jπ (x)
(5.17)
subject to π ∈ Π,

where for a policy {µ0 , µ1 , . . .} ∈ Π, Jπ ∈ E(X) is defined by

Jπ (x) = lim sup (Tµ0 · · · Tµk J¯)(x), ∀ x ∈ X.


k→∞

We analyze this problem under a contraction assumption, similar to


Chapter 2. To this end, we introduce a function v : X '→ ( with

v(x) > 0, ∀ x ∈ X,

and we consider the weighted sup-norm


! !
!J (x)!
)J ) = sup
x∈X v(x)
Sec. 5.3 Contractive Models 199

on B(X), the space of real-valued functions J on X such that J (x)/v(x)


is bounded over x ∈ X.
In addition to the Assumptions 5.1.1-5.1.4, we assume the following.

Assumption 5.3.1: (Contraction)


(a) The sets E(X) and Ê(X) are closed subsets of B(X).
(b) For some α ∈ (0, 1), we have

"Tµ J − Tµ J ! " ≤ α"J − J ! ", ∀ J, J ! ∈ E(X), µ ∈ M. (5.18)

The analysis of Chapter 2 carries through with few modifications. In


particular the following analog of the major analytical result of Chapter 2,
can be proved with an essentially identical proof. The key fact here is that
E(X) and Ê(X) are closed subsets of B(X), so the Contraction Mapping
Theorem (Prop. B.1) applies.

Proposition 5.3.1: Let Assumptions 5.1.1-5.1.4 and the contraction


Assumption 5.3.1 hold. Then:
(a) For all µ ∈ M, the mapping Tµ is a contraction mapping with
modulus α over E(X), and its unique fixed point within E(X)
is Jµ .
(b) The mapping T is a contraction mapping with modulus α over
Ê(X), and its unique fixed point within Ê(X) is equal to J * .
(c) For any J ∈ E(X) and µ ∈ M, we have

lim Tµk J = Jµ .
k→∞

(d) For any J ∈ Ê(X), we have

lim T k J = J * .
k→∞

(e) We have Tµ J * = T J * if and only if Jµ = J * .


(f) For every " > 0 there exists an "-optimal policy within M.

Proof: As in Section 1.2, Tµ is a contraction with modulus α over E(X).


Similarly, T is a contraction with modulus α over Ê(X). Parts (a), (b),
200 Models with Restricted Policies Chap. 5

(c), and (d) follow from Prop. B.1 of Appendix B.


To show part (e), note that if Tµ J * = T J * , then in view of T J * = J * ,
we have Tµ J * = J * , which implies that J * = Jµ , since Jµ is the unique
fixed point of Tµ . Conversely, if Jµ = J * , we have

T µ J * = T µ Jµ = Jµ = J * = T J * .

Part (f) follows similarly, using the proof of Prop. 2.1.2. Q.E.D.

5.4 BOREL SPACE MODELS

We will now apply the preceding analysis to models where the set of policies
is restricted in order to address measurability issues. The Borel space
model is the most general such model, and we will focus on it. Appendix
C provides a motivation and an outline of the model for finite horizon
problems, including the associated mathematical definitions, some basic
results, and a two-stage example. In this section we will provide a brief
discussion of an infinite horizon contractive model.
We consider the mapping H defined by
!
H(x, u, J ) = g(x, u) + α J (y) p(dy | x, u). (5.19)

Here X and U are Borel spaces, α is a scalar in (0, 1], J is an extended real-
valued functions on X, and p(dy | x, u) is a transition probability measure
for each x and u ∈ U (x). To make mathematical sense of the expression
in the right-hand side of Eq. (5.19), J must satisfy certain measurability
restrictions, so we assume that g is Borel measurable and that p(dy | x, u)
is a Borel measurable stochastic kernel. We let

E(X) = the subset of universally measurable functions from E(X).

Then the integral in Eq. (5.19) is well-defined as an extended real number


for every (x, u) for all J ∈ E(X) (recall that in our integration framework
we allow the sum ∞ − ∞ to appear and interpret it as ∞).
A requirement of the framework of this chapter is that Tµ J must be-
long to E(X) for each J ∈ E(X). For this it is sufficient that g be Borel
measurable as a function of (x, u), and the policy µ be a universally mea-
surable function of x. However, as noted earlier, universal measurability
of H [as a function of (x, u) for fixed J ] is insufficient to guarantee that
T J ∈ E(X). We thus let

Ê(X) = the subset of lower semianalytic functions from E(X).


Sec. 5.5 Notes, Sources, and Exercises 201

Then assuming that J ∈ Ê(X), that g(x, u) and p(dy | x, u) are Borel
measurable, and that the set
! "
(x, u) | u ∈ U (x)

is Borel measurable, we have that the function H(·, ·, J ) is lower semiana-


lytic and T J ∈ Ê(X), as discussed in Appendix C (cf. Prop. C.4).
Note that our framework requires that !-optimal selection is possible,
i.e., that for every ! > 0, there exists a universally measurable µ such that
the conditions of Assumption 5.1.4 are satisfied. This is ensured under our
assumptions by the selection theorem of Prop. C.5 in Appendix C.
We now consider a contractive infinite horizon problem, which is
based on the mapping H defined by Eq. (5.19), where α ∈ (0, 1) is a dis-
count factor, and the preceding conditions hold. In addition g is assumed
to be not only Borel measurable, but also bounded above and below, in
which case we obtain a contractive model in the space of bounded functions
B(X) with respect to the unweighted sup-norm. It is important to note
that E(X) and Ê(X) are closed subsets of B(X), since the pointwise limit
of universally measurable and lower semianalytic functions are universally
measurable and lower semianalytic, respectively (see [BeS78]).
Thus Prop. 5.3.1 applies and provides the basic analytical results for
contractive Borel space models, which are:
(a) J * is the unique fixed point of T within Ê(X).
(b) For every J ∈ Ê(X), T k J ∈ Ê(X) for all k, and we have T k J → J * .
(c) There exists a universally measurable optimal policy if and only if the
infimum of H(x, u, J * ) over u is attained for each x ∈ X.
(d) For any ! > 0, there exists an !-optimal universally measurable policy.
For a detailed discussion and proofs of these results, we refer to [BeS78].

5.5 NOTES, SOURCES, AND EXERCISES

The restricted model framework of this chapter was treated briefly in the
book [BeS78] (Chapter 6). This book focused far more extensively on the
classical type of stochastic optimal control problems (cf. Example 1.2.1),
rather than the more general abstract restricted model case.
The restricted policies framework may also be applied to the so-called
semicontinuous models similar to how it was applied to Borel models in Sec-
tion 5.4. The semicontinuous models provide more powerful results regard-
ing the character of the cost functions, but require additional assumptions,
which may be restrictive, namely that the cost function g and the stochastic
kernel p in Eq. (5.19) have certain upper or lower semicontinuity properties.
The relevant mathematical background is given in Section 7.5 of [BeS78],
202 Models with Restricted Policies Chap. 5

and the critical selection theorems (with Borel measurable selection) are
given in Props. 7.33 and 7.34 of that reference. Detailed related references
to the literature may also be found in [BeS78].
APPENDIX A:
Notation and Mathematical
Conventions
In this appendix we collect our notation, and some related mathematical
facts and conventions.

A.1 SET NOTATION AND CONVENTIONS

If X is a set and x is an element of X, we write x ∈ X. A set can be


specified in the form X = {x | x satisfies P }, as the set of all elements
satisfying property P . The union of two sets X1 and X2 is denoted by
X1 ∪ X2 , and their intersection by X1 ∩ X2 . The empty set is denoted by
Ø. The symbol ∀ means “for all.”
The set of real numbers (also referred to as scalars) is denoted by %.
The set of extended real numbers is denoted by %* :

%* = % ∪ {∞, −∞}.

We write −∞ < x < ∞ for all real numbers x, and −∞ ≤ x ≤ ∞ for all
extended real numbers x. We denote by [a, b] the set of (possibly extended)
real numbers x satisfying a ≤ x ≤ b. A rounded, instead of square, bracket
denotes strict inequality in the definition. Thus (a, b], [a, b), and (a, b)
denote the set of all x satisfying a < x ≤ b, a ≤ x < b, and a < x < b,
respectively.
Generally, we adopt standard conventions regarding addition and
multiplication in %* , except that we take

∞ − ∞ = −∞ + ∞ = ∞,

203
204 Notation and Mathematical Conventions Appendix A

and we take the product of 0 and ∞ or −∞ to be 0. In this way the sum


and product of two extended real numbers is well-defined. Division by 0 or
∞ does not appear in our analysis. In particular, we adopt the following
rules in calculations involving ∞ and −∞:

α + ∞ = ∞ + α = ∞, ∀ α ∈ %* ,

α − ∞ = −∞ + α = −∞, ∀ α ∈ [−∞, ∞),

α · ∞ = ∞, α · (−∞) = ∞, ∀ α ∈ (0, ∞],

α · ∞ = −∞, α · (−∞) = −∞, ∀ α ∈ [−∞, 0),

0 · ∞ = ∞ · 0 = 0 = 0 · (−∞) = (−∞) · 0, −(−∞) = ∞.

Under these rules, the following laws of arithmetic are still valid within %* :

α1 + α2 = α2 + α1 , (α1 + α2 ) + α3 = α1 + (α2 + α3 ),

α1 α2 = α2 α1 , (α1 α2 )α3 = α1 (α2 α3 ).

We also have
α(α1 + α2 ) = αα1 + αα2

if either α ≥ 0 or else (α1 + α2 ) is not of the form ∞ − ∞.

Inf and Sup Notation

The supremum of a nonempty set X ⊂ %* , denoted by sup X, is defined as


the smallest y ∈ %* such that y ≥ x for all x ∈ X. Similarly, the infimum
of X, denoted by inf X, is defined as the largest y ∈ %* such that y ≤ x
for all x ∈ X. For the empty set, we use the convention

sup Ø = −∞, inf Ø = ∞.

If sup X is equal to an x ∈ %* that belongs to the set X, we say


that x is the maximum point of X and we write x = max X. Similarly, if
inf X is equal to an x ∈ %* that belongs to the set X, we say that x is
the minimum point of X and we write x = min X. Thus, when we write
max X (or min X) in place of sup X (or inf X, respectively), we do so just
for emphasis: we indicate that it is either evident, or it is known through
earlier analysis, or it is about to be shown that the maximum (or minimum,
respectively) of the set X is attained at one of its points.
Sec. A.2 Functions 205

A.2 FUNCTIONS

If f is a function, we use the notation f : X !→ Y to indicate the fact that


f is defined on a nonempty set X (its domain) and takes values in a set
Y (its range). Thus when using the notation f : X !→ Y , we implicitly
assume that X is nonempty. We will often use the unit function e : X !→ #,
defined by
e(x) = 1, ∀ x ∈ X.
Given a set X, we denote by R(X) the set of real-valued functions
J : X !→ #, and by E(X) the set of all extended real-valued functions
J : X !→ #* . For any collection {Jγ | γ ∈ Γ} ⊂ E(X), parameterized by
the elements of a set Γ, we denote by inf γ∈Γ Jγ the function taking the
value inf γ∈Γ Jγ (x) at each x ∈ X.
For two functions J1 , J2 ∈ E(X), we use the shorthand notation
J1 ≤ J2 to indicate the pointwise inequality
J1 (x) ≤ J2 (x), ∀ x ∈ X.
We use the shorthand notation inf i∈I Ji to denote the function obtained
by pointwise infimum of a collection {Ji | i ∈ I} ⊂ E(X), i.e.,
! "
inf Ji (x) = inf Ji (x), ∀ x ∈ X.
i∈I i∈I

We use similar notation for sup.


Given subsets S1 , S2 , S3 ⊂ E(X) and mappings T1 : S1 !→ S3 and
T2 : S2 !→ S1 , the composition of T1 and T2 is the mapping T1 T2 : S2 !→ S3
defined by
# $
(T1 T2 J )(x) = T1 (T2 J ) (x), ∀ J ∈ S2 , x ∈ X.
In particular, given a subset S ⊂ E(X) and mappings T1 : S !→ S and
T2 : S !→ S, the composition of T1 and T2 is the mapping T1 T2 : S !→ S
defined by
# $
(T1 T2 J )(x) = T1 (T2 J ) (x), ∀ J ∈ S, x ∈ X.
Similarly, given mappings Tk : S !→ S, k = 1, . . . , N , their composition is
the mapping (T1 · · · TN ) : S !→ S defined by
# $
(T1 T2 · · · TN J )(x) = T1 (T2 (· · · (TN J ))) (x), ∀ J ∈ S, x ∈ X.
In our notation involving compositions we minimize the use of parentheses,
as long as clarity is not compromised. In particular, we write T1 T2 J instead
of (T1 T2 J ) or (T1 T2 )J or T1 (T2 J ), but we write (T1 T2 J )(x) to indicate the
value of T1 T2 J at x ∈ X.
If X and Y are nonempty sets, a mapping T : S1 !→ S2 , where
S1 ⊂ E(X) and S2 ⊂ E(Y ), is said to be monotone if for all J, J " ∈ S1 ,
J ≤ J" ⇒ T J ≤ T J ".
206 Notation and Mathematical Conventions Appendix A

Sequences of Functions

For a sequence of functions {Jk } ⊂ E(X) that converges pointwise, we de-


note by limk→∞ Jk the pointwise limit of {Jk }. We denote by lim supk→∞ Jk
(or lim inf k→∞ Jk ) the pointwise limit superior (or inferior, respectively) of
{Jk }. If {Jk } ⊂ E(X) converges pointwise to J , we write Jk → J . Note
that we reserve this notation for pointwise convergence. To denote conver-
gence with respect to a norm # · #, we write #Jk − J # → 0.
A sequence of functions {Jk } ⊂ E(X) is said to be monotonically
nonincreasing (or monotonically nondecreasing) if Jk+1 ≤ Jk for all k (or
Jk+1 ≥ Jk for all k, respectively). Such a sequence always has a (pointwise)
limit within E(X). We write Jk ↓ J (or Jk ↑ J ) to indicate that {Jk } is
monotonically nonincreasing (or monotonically nonincreasing, respectively)
and that its limit is J .
Let {Jmn } ⊂ E(X) be a double indexed sequence, which is monoton-
ically nonincreasing separately for each index in the sense that

J(m+1)n ≤ Jmn , Jm(n+1) ≤ Jmn , ∀ m, n = 0, 1, . . . .

For such sequences, a useful fact is that


! "
lim lim Jmn = lim Jmm .
m→∞ n→∞ m→∞

There is a similar fact for monotonically nondecreasing sequences.

Expected Values

Given a random variable w defined over a probability space Ω, the expected


value of w is defined by

E{w} = E{w+ } + E{w− },

where w+ and w− are the positive and negative parts of w,


# $ # $
w+ (ω) = max 0, w(ω) , w− (ω) = min 0, w(ω) .

In this way, taking also into account the rule ∞ − ∞ = ∞, the expected
value E{w} is well-defined if Ω is finite or countably infinite. In more gen-
eral cases, E{w} is similarly defined by the appropriate form of integration,
as will be discussed in more detail at specific points as needed.
APPENDIX B:
Contraction Mappings

B.1 CONTRACTION MAPPING FIXED POINT THEOREMS

The purpose of this appendix is to provide some background on contraction


mappings and their properties. Let Y be a real vector space with a norm
! · !, i.e., a real-valued function satisfying for all y ∈ Y , !y! ≥ 0, !y! = 0
if and only if y = 0, and

!ay! = |a|!y!, ∀ a ∈ %, !y + z! ≤ !y! + !z!, ∀ y, z ∈ Y.

Let Ȳ be a closed subset of Y . A function F : Ȳ '→ Ȳ is said to be a


contraction mapping if for some ρ ∈ (0, 1), we have

!F y − F z! ≤ ρ!y − z!, ∀ y, z ∈ Ȳ .

The scalar ρ is called the modulus of contraction of F .

Example B.1 (Linear Contraction Mappings in %n )

Consider the case of a linear mapping F : !n "→ !n of the form

F y = b + Ay,

where A is an n × n matrix and b is a vector in !n . Let σ(A) denote the


spectral radius of A (the largest modulus among the moduli of the eigenvalues
of A). Then it can be shown that A is a contraction mapping with respect to
some norm if and only if σ(A) < 1.
Specifically, given " > 0, there exists a norm % · %s such that

∀ y ∈ !n .
! "
%Ay%s ≤ σ(A) + " %y%s , (B.1)

207
208 Contraction Mappings Appendix B

Thus, if σ(A) < 1 we may select " > 0 such that ρ = σ(A) + " < 1, and obtain
the contraction relation
! !
!F y − F z!s = !A(y − z)!s ≤ ρ!y − z!s , ∀ y, z ∈ &n . (B.2)

The norm ! · !s can be taken to be a weighted Euclidean norm, i.e., it may


have the form !y!s = !M y!, where M is a square√ invertible matrix, and ! · !
is the standard Euclidean norm, i.e., !x! = x! x. †
Conversely, if Eq. (B.2) holds for some norm ! · !s and all real vectors
y, z, it also holds for all complex vectors y, z with the squared norm !c!2s of
a complex vector c defined as the sum of the squares of the norms of the real
and the imaginary components. Thus from Eq. (B.2), by taking y − z = u,
where u is an eigenvector corresponding to an eigenvalue λ with |λ| = σ(A),
we have σ(A)!u!s = !Au!s ≤ ρ!u!s . Hence σ(A) ≤ ρ, and it follows that if
F is a contraction with respect to a given norm, we must have σ(A) < 1.

A sequence {yk } ⊂ Y is said to be a Cauchy sequence if "ym −yn " → 0


as m, n → ∞, i.e., given any ! > 0, there exists N such that "ym − yn " ≤ !
for all m, n ≥ N . The space Y is said to be complete under the norm " · " if
every Cauchy sequence {yk } ⊂ Y is convergent, in the sense that for some
ȳ ∈ Y , we have "yk − ȳ" → 0. Note that a Cauchy sequence is always
bounded. Also, a Cauchy sequence of real numbers is convergent, implying
that the real line is a complete space and so is every real finite-dimensional
vector space. On the other hand, an infinite dimensional space may not be
complete under some norms, while it may be complete under other norms.
When Y is complete and Ȳ is a closed subset of Y , an important
property of a contraction mapping F : Ȳ )→ Ȳ is that it has a unique fixed
point within Ȳ , i.e., the equation
y = Fy
has a unique solution y ∗ ∈ Ȳ , called the fixed point of F . Furthermore, the
sequence {yk } generated by the iteration
yk+1 = F yk

† We may show Eq. (B.1) by using the Jordan canonical form of A, which is
denoted by J. In particular, if P is a nonsingular matrix such that P −1 AP = J
and D is the diagonal matrix with 1, δ, . . . , δ n−1 along the diagonal, where δ > 0,
it is straightforward to verify that D−1 P −1 AP D = J, ˆ where Jˆ is the matrix
that is identical to J except that each nonzero off-diagonal term is replaced by δ.
Defining Pˆ = P D, we have A = PˆJˆPˆ−1 . Now if ! · ! is the " standard# Euclidean
norm, we note that for some β > 0, we have !Jˆz! ≤ σ(A) + βδ !z! for all
z ∈ &n and δ ∈ (0, 1]. For a given δ ∈ (0, 1], consider the weighted Euclidean
norm ! · !s defined by !y!s = !Pˆ−1 y!. Then we have for all y ∈ &n ,
!Ay!s = !Pˆ−1 Ay! = !Pˆ−1 PˆJˆPˆ−1 y! = !JˆP̂ −1 y! ≤ σ(A) + βδ !P̂ −1 y!,
" #
" #
so that !Ay!s ≤ σ(A) + βδ !y!s , for all y ∈ &n . For a given " > 0, we choose
δ = "/β, so the preceding relation yields Eq. (B.1).
Sec. B.1 Contraction Mapping Fixed Point Theorems 209

converges to y ∗ , starting from an arbitrary initial point y0 .

Proposition B.1: (Contraction Mapping Fixed-Point Theo-


rem) Let Y be a complete vector space and let Ȳ be a closed subset
of Y . Then if F : Ȳ !→ Ȳ is a contraction mapping with modulus
ρ ∈ (0, 1), there exists a unique y ∗ ∈ Ȳ such that

y∗ = F y∗.

Furthermore, the sequence {F k y} converges to y ∗ for any y ∈ Ȳ , and


we have
$F k y − y ∗ $ ≤ ρk $y − y ∗ $, k = 1, 2, . . . .

Proof: Let y ∈ Ȳ and consider the iteration yk+1 = F yk starting with


y0 = y. By the contraction property of F ,

$yk+1 − yk $ ≤ ρ$yk − yk−1 $, k = 1, 2, . . . ,

which implies that

$yk+1 − yk $ ≤ ρk $y1 − y0 $, k = 1, 2, . . . .

It follows that for every k ≥ 0 and m ≥ 1, we have


m
!
$yk+m − yk $ ≤ $yk+i − yk+i−1 $
i=1
≤ ρk (1 + ρ + · · · + ρm−1 )$y1 − y0 $
ρk
≤ $y1 − y0 $.
1−ρ

Therefore, {yk } is a Cauchy sequence in Ȳ and must converge to a limit


y ∗ ∈ Ȳ , since Y is complete and Ȳ is closed. We have for all k ≥ 1,

$F y ∗ − y ∗ $ ≤ $F y ∗ − yk $ + $yk − y ∗ $ ≤ ρ$y ∗ − yk−1 $ + $yk − y ∗ $

and since yk converges to y ∗ , we obtain F y ∗ = y ∗ . Thus, the limit y ∗ of yk


is a fixed point of F . It is a unique fixed point because if ỹ were another
fixed point, we would have

$y ∗ − ỹ$ = $F y ∗ − F ỹ$ ≤ ρ$y ∗ − ỹ$,

which implies that y ∗ = ỹ.


210 Contraction Mappings Appendix B

To show the convergence rate bound of the last part, note that
! !
!F k y − y ∗ ! = !F k y − F y ∗ ! ≤ ρ!F k−1 y − y ∗ !.

Repeating this process for a total of k times, we obtain the desired result.
Q.E.D.

The convergence rate exhibited by F k y in the preceding proposition


is said to be geometric, and F k y is said to converge to its limit y ∗ geomet-
rically. This is in reference to the fact that the error !F k y − y ∗ ! converges
to 0 faster than some geometric progression (ρk !y − y ∗ ! in this case).
In some contexts of interest to us one may encounter mappings that
are not contractions, but become contractions when iterated a finite num-
ber of times. In this case, one may use a slightly different version of the
contraction mapping fixed point theorem, which we now present.
We say that a function F : Ȳ $→ Ȳ is an m-stage contraction mapping
if there exists a positive integer m and some ρ < 1 such that
! !
!F m y − F m y # ! ≤ ρ!y − y # !, ∀ y, y # ∈ Ȳ ,

where F m denotes the composition of F with itself m times. Thus, F is


an m-stage contraction if F m is a contraction. Again, the scalar ρ is called
the modulus of contraction. We have the following generalization of Prop.
B.1.

Proposition B.2: (m-Stage Contraction Mapping Fixed-Point


Theorem) Let Y be a complete vector space and let Ȳ be a closed
subset of Y . Then if F : Ȳ $→ Ȳ is an m-stage contraction mapping
with modulus ρ ∈ (0, 1), there exists a unique y ∗ ∈ Ȳ such that

y∗ = F y∗.

Furthermore, {F k y} converges to y ∗ for any y ∈ Ȳ .

Proof: Since F m maps Ȳ into Ȳ and is a contraction mapping, by Prop.


B.1, it has a unique fixed point in Ȳ , denoted y ∗ . Applying F to both sides
of the relation y ∗ = F m y ∗ , we see that F y ∗ is also a fixed point of F m , so
by the uniqueness of the fixed point, we have y ∗ = F y ∗ . Therefore y ∗ is a
fixed point of F . If F had another fixed point, say ỹ, then we would have
ỹ = F m ỹ, which by the uniqueness of the fixed point of F m implies that
ỹ = y ∗ . Thus, y ∗ is the unique fixed point of F .
To show the convergence of {F k y}, note that by Prop. B.1, we have
for all y ∈ Ȳ ,
lim !F mk y − y ∗ ! = 0.
k→∞
Sec. B.2 Weighted Sup-Norm Contractions 211

Using F ! y in place of y, we obtain

lim !F mk+! y − y ∗ ! = 0, ! = 0, 1, . . . , m − 1,
k→∞

which proves the desired result. Q.E.D.

B.2 WEIGHTED SUP-NORM CONTRACTIONS

In this section, we will focus on contraction mappings within a specialized


context that is particularly important in DP. Let X be a set (typically the
state space in DP), and let v : X #→ % be a positive-valued function,

v(x) > 0, ∀ x ∈ X.

Let B(X) denote the set of all functions J : X #→ % such that J (x)/v(x)
is bounded as x ranges over X. We define a norm on B(X), called the
weighted sup-norm, by
|J (x)|
!J ! = sup . (B.3)
x∈X v(x)

It is easily verified that !·! thus defined has the required properties for
being a norm. Furthermore, B(X) is complete under this norm. To see this,
consider a Cauchy sequence {Jk } ⊂ B(X), and note that !Jm − Jn ! → 0
as m, n → ∞ implies that for all x ∈ X, {Jk (x)} is a Cauchy sequence of
real numbers, so it converges to some J * (x). We will show that J * ∈ B(X)
and that !Jk − J * ! → 0. To this end, it will be sufficient to show that
given any " > 0, there exists a K such that
|Jk (x) − J * (x)|
≤ ", ∀ x ∈ X, k ≥ K.
v(x)
This will imply that
|J * (x)|
sup ≤ " + !Jk !, ∀ k ≥ K,
x∈X v(x)

so that J * ∈ B(X), and will also imply that !Jk − J * ! ≤ ", so that
!Jk − J * ! → 0. Assume the contrary, i.e., that there exists an " > 0 and a
subsequence {xm1 , xm2 , . . .} ⊂ X such that mi < mi+1 and
! !
!Jm (xm ) − J * (xm )!
i i i
"< , ∀ i ≥ 1.
v(xmi )
The right-hand side above is less or equal to
! ! ! !
!Jm (xm ) − Jn (xm )! !Jn (xm ) − J * (xm )!
i i i i i
+ , ∀ n ≥ 1, i ≥ 1.
v(xmi ) v(xmi )
212 Contraction Mappings Appendix B

The first term in the above sum is less than !/2 for i and n larger than some
threshold; fixing i and letting n be sufficiently large, the second term can
also be made less than !/2, so the sum is made less than ! - a contradiction.
In conclusion, the space B(X) is complete, so the fixed point results of
Props. B.1 and B.2 apply.
In our discussions, we will always assume that B(X) is equipped
with the weighted sup-norm above, where the weight function v will be
clear from the context. There will be frequent occasions! where
! the norm
will be unweighted, i.e., v(x) ≡ 1 and "J " = supx∈X !J (x)!, in which case
we will explicitly state so.

Finite-Dimensional Cases

Let us now focus on the finite-dimensional case X = {1, . . . , n}, in which


case R(X) and B(X) can be identified with #n . Consider a linear mapping
F : #n $→ #n of the form
F y = b + Ay,
where A is an n × n matrix with components aij , and b is a vector in
#n (cf. Example B.1). Then it can be shown (see the following proposi-
tion) that F is a contraction with respect to the weighted sup-norm "y" =
maxi=1,...,n |yi |/v(i) if and only if
"n
j=1 |aij | v(j)
< 1, i = 1, . . . , n.
v(i)

Let us also denote by |A| the matrix whose


# $ components are the abso-
lute values of the components of A and let σ |A| denote the spectral radius
of |A|. Then it can be shown that #F is$ a contraction with respect to some
weighted sup-norm if and only if σ |A| < 1. A proof of this may be found
in [BeT89], Ch.
"n 2, Cor. 6.2. Thus any substochastic matrix P (pij ≥ 0 for
all i, j, and j=1 pij ≤ 1, for all i) is a contraction with respect to some
weighted sup-norm if and only if σ(P ) < 1.
Finally, let us consider a nonlinear mapping F : #n $→ #n that has
the property
|F y − F z| ≤ P |y − z|, ∀ y, z ∈ #n ,
for some matrix P with nonnegative components and σ(P ) < 1. Here, we
generically denote by |w| the vector whose components are the absolute
values of the components of w, and the inequality is componentwise. Then
we claim that F is a contraction with respect to some weighted sup-norm.
To see this note that by the preceding discussion, P is a contraction with
respect to some weighted sup-norm "y" = maxi=1,...,n |yi |/v(i), and we
have
# $ # $
|F y − F z| (i) P |y − z| (i)
≤ ≤ α "y − z", ∀ i = 1, . . . , n,
v(i) v(i)
Sec. B.2 Weighted Sup-Norm Contractions 213
! " ! "
for some α ∈ (0, 1), where |F y − F z| (i) and P |y − z| (i) are the ith
components of the vectors |F y − F z| and P |y − z|, respectively. Thus, F
is a contraction with respect to # · #. For additional discussion of linear
and nonlinear contraction mapping properties and characterizations such
as the one above, see the book [OrR70].

Linear Mappings on Countable Spaces

The case where X is countable (or, as a special case, finite) is frequently


encountered in DP. The following proposition provides some useful criteria
for verifying the contraction property of mappings that are either linear or
are obtained via a parametric minimization of other contraction mappings.

Proposition B.3: Let X = {1, 2, . . .}.


(a) Let F : B(X) $→ B(X) be a linear mapping of the form
#
(F J )(i) = bi + aij J (j), i ∈ X,
j∈X

where bi and aij are some scalars. Then F is a contraction with


modulus ρ with respect to the weighted sup-norm (B.3) if and
only if $
j∈X |aij | v(j)
≤ ρ, i ∈ X. (B.4)
v(i)

(b) Let F : B(X) $→ B(X) be a mapping of the form

(F J )(i) = inf (Fµ J )(i), i ∈ X,


µ∈M

where M is parameter set, and for each µ ∈ M , Fµ is a contrac-


tion mapping from B(X) to B(X) with modulus ρ. Then F is a
contraction mapping with modulus ρ.

Proof: (a) Assume that Eq. (B.4) holds. For any J, J " ∈ B(X), we have
%$ ! "%%
% j∈X aij J (j) − J " (j) %
%
#F J − F J " # = sup
i∈X v(i)
$ % % &% % '
"
j∈X aij v(j) J (j) − J (j) /v(j)
% % % %
≤ sup
i∈X v(i)
$ % %
%aij % v(j) (
j∈X (
(J − J " (
≤ sup
i∈X v(i)
214 Contraction Mappings Appendix B

≤ ρ "J − J ! ",
where the last inequality follows from the hypothesis.
Conversely, arguing by contradiction, let’s assume that Eq. (B.4) is
violated for some i ∈ X. Define J (j) = v(j) sgn(aij ) and J ! (j) = 0 for all
j ∈ X. Then we have "J − J ! " = "J " = 1, and
! ! "
j∈X |aij | v(j)
!(F J )(i) − (F J ! )(i)!
= > ρ = ρ "J − J ! ",
v(i) v(i)
showing that F is not a contraction of modulus ρ.
(b) Since Fµ is a contraction of modulus ρ, we have for any J, J ! ∈ B(X),
(Fµ J )(i) (Fµ J ! )(i)
≤ + ρ "J − J ! ", i ∈ X,
v(i) v(i)
so by taking the infimum over µ ∈ M ,
(F J )(i) (F J ! )(i)
≤ + ρ "J − J ! ", i ∈ X.
v(i) v(i)
Reversing the roles of J and J ! , we obtain
! !
!(F J )(i) − (F J ! )(i)!
≤ ρ "J − J ! ", i ∈ X,
v(i)
and by taking the supremum over i, the contraction property of F is proved.
Q.E.D.

The preceding proposition assumes that F J ∈ B(X) for all J ∈ B(X).


The following proposition provides conditions, particularly relevant to the
DP context, which imply this assumption.

Proposition B.4: Let X = {1, 2, . . .}, let M be a parameter set, and


for each µ ∈ M , let Fµ be a linear mapping of the form
#
(Fµ J )(i) = bi (µ) + aij (µ) J (j), i ∈ X.
j∈X

(a) We have Fµ J ∈ B(X) for all J ∈ B(X) provided b(µ) ∈ B(X)


and V (µ) ∈ B(X), where
$ % $ %
b(µ) = b1 (µ), b2 (µ), . . . , V (µ) = V1 (µ), V2 (µ), . . . ,

with #! !
Vi (µ) = !aij (µ)! v(j), i ∈ X.
j∈X
Sec. B.2 Weighted Sup-Norm Contractions 215

(b) Consider the mapping F

(F J )(i) = inf (Fµ J )(i), i ∈ X.


µ∈M

We have F J ∈ B(X) for all J ∈ B(X), provided b ∈ B(X) and


V ∈ B(X), where
! "
b = b1 , b2 , . . . , V = {V1 , V2 , . . .},

with bi = supµ∈M bi (µ) and Vi = supµ∈M Vi (µ).

Proof: (a) For all µ ∈ M , J ∈ B(X) and i ∈ X, we have


# # $# ## #
(Fµ J )(i) ≤ #bi (µ)# + #aij (µ)# #J (j)/v(j)# v(j)
j∈X
# # $# #
≤ #bi (µ)# + #J # #aij (µ)# v(j)
j∈X
# #
= #bi (µ)# + #J # Vi (µ),
# #
and similarly (Fµ J )(i) ≥ −#bi (µ)# − #J # Vi (µ). Thus
# # # #
#(Fµ J )(i)# ≤ #bi (µ)# + #J # Vi (µ), i ∈ X.

By dividing this inequality with v(i) and by taking the supremum over
i ∈ X, we obtain

#Fµ J # ≤ #bµ # + #J # #Vµ # < ∞.

(b) By doing the same as in (a), but after first taking the infimum of
(Fµ J )(i) over µ, we obtain

#F J # ≤ #b# + #J # #V # < ∞.

Q.E.D.
APPENDIX C:
Measure Theoretic Issues
A general theory of stochastic dynamic programming must deal with the
formidable mathematical questions that arise from the presence of uncount-
able probability spaces. The purpose of this appendix is to motivate the
theory and to provide some mathematical background to the extent needed
for the development of Chapter 5. The research monograph by Bertsekas
and Shreve [BeS78] (freely available from the internet), contains a detailed
development of mathematical background and terminology on Borel spaces
and related subjects. We will explore here the main questions by means
of a simple two-stage example described in Section C.1. In Section C.2,
we develop a framework, based on universally measurable policies, for the
rigorous mathematical development of the standard DP results for this
example and for more general finite horizon models.

C.1 A TWO-STAGE EXAMPLE

Suppose that the initial state x0 is a point on the real line !. Knowing
x0 , we must choose a control u0 ∈ !. Then the new state x1 is generated
according to a transition probability measure p(dx1 | x0 , u0 ) on the Borel
σ-algebra of ! (the one generated by the open sets of !). Then, knowing
x1 , we must choose a control u1 ∈ ! and incur a cost g(x1 , u1 ), where g is
a real-valued function that is bounded either above or below. Thus a cost
is incurred only at the second stage.
A policy π = {µ0 , µ1 } is a pair of functions from state to control, i.e.,
if π is employed and x0 is the initial state, then u0 = µ0 (x0 ), and if x1 is
the subsequent state, then u1 = µ1 (x1 ). The expected value of the cost
corresponding to π when x0 is the initial state is given by
!
" # " #
Jπ (x0 ) = g x1 , µ1 (x1 ) p dx1 | x0 , µ0 (x0 ) . (C.1)

216
Sec. C.1 A Two-Stage Example 217

We wish to find π to minimize Jπ (x0 ).


To formulate the problem properly, we must insure that the integral
in Eq. (C.1) is defined. Various sufficient conditions can be used for this;
for example it is sufficient that g, µ0 , and µ1 be Borel measurable, and
that p(B | x0 , u0 ) is a Borel measurable function of (x0 , u0 ) for every Borel
set B (see [BeS78]). However, our aim in this example is to discuss the
necessary measure theoretic framework not only for the cost Jπ (x0 ) to be
defined, but also for the major DP-related results to hold. We thus leave
unspecified for the moment the assumptions on the problem data and the
measurability restrictions on the policy π.
The optimal cost is

J * (x0 ) = inf Jπ (x0 ),


π

where the infimum is over all policies π = {µ0 , µ1 } such that µ0 and µ1 are
measurable functions from ! to ! with respect to σ-algebras to be specified
later. Given # > 0, a policy π is #-optimal if

Jπ (x0 ) ≤ J * (x0 ) + #, ∀ x0 ∈ !.

A policy π is optimal if

Jπ (x0 ) = J * (x0 ), ∀ x0 ∈ !.

The DP Algorithm

The DP algorithm for the preceding two-stage problem takes the form

J1 (x1 ) = inf g(x1 , u1 ), ∀ x1 ∈ !, (C.2)


u1 ∈"
!
"
J0 (x0 ) = inf J1 (x1 ) p dx1 | x0 , u0 ), ∀ x0 ∈ !, (C.3)
u0 ∈"

and assuming that

J0 (x0 ) > −∞, ∀ x0 ∈ !, J1 (x1 ) > −∞, ∀ x1 ∈ !,

the results we expect to be able to prove are:


R.1: There holds
J * (x0 ) = J0 (x0 ), ∀ x0 ∈ !.

R.2: Given any # > 0, there is an #-optimal policy.


R.3: If µ∗1 (x1 ) and µ∗0 (x0 ) attain the infimum in the DP algorithm (C.2),
(C.3) for all x1 ∈ ! and x0 ∈ !, respectively, then π ∗ = {µ∗0 , µ∗1 } is
optimal.
218 Measure Theoretic Issues Appendix C

We will see that to establish these results, we will need to address


two main issues:
(1) The cost function Jπ of a policy π, and the functions J0 and J1 pro-
duced by DP should be well-defined, with a mathematical framework,
which ensures that the integrals in Eqs. (C.1)-(C.3) make sense.
(2) Since J0 (x0 ) is easily seen to be a lower bound to Jπ (x0 ) for all x0
and π = {µ0 , µ1 }, the equality of J0 and J * will be ensured if the
class of policies has an "-selection property, which guarantees that
the minima in Eqs. (C.2) and (C.3) can be nearly attained by µ1 (x1 )
and µ0 (x0 ) for all x1 and x0 , respectively.
To get a better sense of these issues, consider the following informal deriva-
tion of R.1:

J * (x0 ) = inf Jπ (x0 )


π
!
" # " #
= inf inf g x1 , µ1 (x1 ) p dx1 | x0 , µ0 (x0 ) (C.4a)
µ0 µ1
! $ %
" # " #
= inf inf g x1 , µ1 (x1 ) p dx1 | x0 , µ0 (x0 ) (C.4b)
µ0 µ1
! $ %
" #
= inf inf g(x1 , u1 ) p dx1 | x0 , µ0 (x0 )
µ0 u1
!
" #
= inf J1 (x1 ) p dx1 | x0 , µ0 (x0 ) (C.4c)
µ0
!
= inf J1 (x1 ) p(dx1 | x0 , u0 ) (C.4d)
u0

= J0 (x0 ).

In order to make this derivation meaningful and mathematically rigorous,


the following points need to be justified:
" #
(a) g and µ1 must be such that g x1 , µ1 (x1 ) can be integrated in a well-
defined manner in Eq. (C.4a).
(b) The interchange of infimization and integration in Eq. (C.4b) must
be legitimate.
(c) g must be such that the function

J1 (x1 ) = inf g(x1 , u1 )


u1

can be integrated in a well-defined manner in Eq. (C.4c).


We first discuss these points in the easier context where the state space is
essentially countable.
Sec. C.1 A Two-Stage Example 219

Countable Space Problems

We observe that if for each (x0 , u0 ), the measure p(dx1 | x0 , u0 ) has count-
able support , i.e., is concentrated on a countable number of points, then for
a fixed policy π and initial state x0 , the integral defining the cost Jπ (x0 )
of Eq. (C.1) is defined in terms of (possibly infinite) summation. Simi-
larly, the DP algorithm (C.2), (C.3) is defined in terms of summation, and
the same is true for the integrals in Eqs. (C.4a)-(C.4d). Thus, there is no
need to impose measurability restrictions of any kind for the integrals to
make sense, and for the summations/integrations to be well-defined, it is
sufficient that g is bounded either above or below.
It can also be shown that the interchange of infimization and sum-
mation in Eq. (C.4b) is justified in view of the assumption

inf g(x1 , u1 ) > −∞, ∀ x1 ∈ %.


u1

To see this, for any " > 0, select µ̄1 : % &→ % such that
! "
g x1 , µ̄1 (x1 ) ≤ inf g(x1 , u1 ) + ", ∀ x1 ∈ %. (C.5)
u1

Then
#
! " ! "
inf g x1 , µ1 (x1 ) p dx1 | x0 , µ0 (x0 )
µ1
#
! " ! "
≤ g x1 , µ̄1 (x1 ) p dx1 | x0 , µ0 (x0 )
#
! "
≤ inf g(x1 , u1 ) p dx1 | x0 , µ0 (x0 ) + ".
u1

Since " > 0 is arbitrary, it follows that


# #
! " ! " ! "
inf g x1 , µ1 (x1 ) p dx1 | x0 , µ0 (x0 ) ≤ inf g(x1 , u1 ) p dx1 | x0 , µ0 (x0 ) .
µ1 u1

The reverse inequality also holds, since for all µ1 , we can write
# #
! " ! " ! "
inf g(x1 , u1 ) p dx1 | x0 , µ0 (x0 ) ≤ g x1 , µ1 (x1 ) p dx1 | x0 , µ0 (x0 ) ,
u1

and then we can take the infimum over µ1 . It follows that the interchange
of infimization and summation in Eq. (C.4b) is justified, with the "-optimal
selection property of Eq. (C.5) being the key step in the proof.
We have thus shown that when the measure p(dx1 | x0 , u0 ) has count-
able support, g is bounded either above or below, and J0 (x0 ) > −∞ for
all x0 and J1 (x1 ) > −∞ for all x1 , the derivation of Eq. (C.4) is valid and
proves that the DP algorithm produces the optimal cost function J * (cf.
220 Measure Theoretic Issues Appendix C

property R.1). † A similar argument proves the existence of an !-optimal


policy (cf. R.2); it uses the !-optimal selection (C.5) for the second stage
and a similar !-optimal selection for the first stage, i.e., the existence of a
µ̄0 : ! "→ ! such that
! !
" #
J1 (x1 ) p dx1 | x0 , µ̄0 (x0 ) ≤ inf J1 (x1 ) p(dx1 | x0 , u0 ) + !. (C.6)
u0

Also R.3 follows easily using the fact that there are no measurability re-
strictions on µ0 and µ1 .

Approaches for Uncountable Space Problems

To address the case where p(dx1 | x0 , u0 ) does not have countable support,
two approaches have been used. The first is to expand the notion of inte-
gration, and the second is to place appropriate measurability restrictions
on g, p, and {µ0 , µ1 }. Expanding the notion of integration is possible by
interpreting the integrals appearing in the preceding equations as outer
integrals. Since the outer integral can be defined for any function, mea-
surable or not, there is no need to impose any measurability assumptions,
and the arguments given above go through just as in the countable distur-
bance case. We do not discuss this approach further except to mention that
the book [BeS78] shows that the basic results for finite and infinite hori-
zon problems of perfect state information carry through within an outer
integration framework. However, there are inherent limitations in this ap-
proach centering around the pathologies of outer integration, as discussed
in [BeS78].
The second approach is to impose a suitable measurability structure
that allows the key proof steps of the validity of the DP algorithm. These
are:
(a) Properly interpreting the integrals in the definition (C.2)-(C.3) of the
DP algorithm and the derivation (C.4).
(b) The !-optimal selection property (C.5), which in turn justifies the
interchange of infimization and integration in Eq. (C.4b).
To enable (a), the required properties of the problem structure must include
the preservation of measurability under partial minimization. In particu-
lar, it is necessary that when g is measurable in some sense, the partial
minimum function
J1 (x1 ) = inf g(x1 , u1 )
u1

† The condition that g is bounded either above or below may be replaced by


any condition that guarantees that the infinite sum/integral of J1 in Eq. (C.3)
is well-defined. Note also that if g is bounded below, then the assumption that
J0 (x0 ) > −∞ for all x0 and J1 (x1 ) > −∞ for all x1 is automatically satisfied.
Sec. C.2 Resolution of the Measurability Issues 221

is also measurable in the same sense, so that the integration in Eq. (C.3) is
well-defined. It turns out that this is a major difficulty with Borel measur-
ability, which may appear to be a natural framework for formulating the
problem: J1 need not be Borel measurable even when g is Borel measurable.
For this reason it is necessary to pass to a larger class of measurable func-
tions, which is closed under the key operation of partial minimization (and
also under some other common operations, such as addition and functional
composition). †
One such class is lower semianalytic functions and the related class
of universally measurable functions, which will be the focus of the next
section. They are the basis for a problem formulation that enables a DP
theory as powerful as the one for problems where measurability is of no
concern (e.g., those where the state and control spaces are countable).

C.2 RESOLUTION OF THE MEASURABILITY ISSUES

The example of the preceding section indicates that if measurability re-


strictions are necessary for the problem data and policies, then measurable
selection and preservation of measurability under partial minimization, be-
come crucial parts of the analysis. We will discuss measurability frame-
works that are favorable in this regard, and to this end, we will use the
theory of Borel spaces.

Borel Spaces and Analytic Sets

Given a topological space Y , we denote by BY the σ-algebra generated by


the open subsets of Y , and refer to the members of BY as the Borel subsets
of Y . A topological space Y is a Borel space if it is homeomorphic to a
Borel subset of a complete separable metric space. The concept of Borel
space is quite broad, containing any “reasonable” subset of n-dimensional
Euclidean space. Any Borel subset of a Borel space is again a Borel space,
as is any homeomorphic image of a Borel space and any finite or countable

† It is also possible to use a smaller class of functions that is closed under the
same operations. This has led to the so-called semicontinuous models, where the
state and control spaces are Borel spaces, and g and p have certain semicontinu-
ity and other properties. These models are also analyzed in detail in the book
[BeS78] (Section 8.3). However, they are not as useful and widely applicable as
the universally measurable models we will focus on, because they involve assump-
tions that may be restrictive and/or hard to verify. By contrast, the universally
measurable models are simple and very general. They allow a problem formula-
tion that brings to bear the power of DP analysis under minimal assumptions.
This analysis can in turn be used to prove more specific results based on special
characteristics of the model.
222 Measure Theoretic Issues Appendix C

Cartesian product of Borel spaces. Let Y and Z be Borel spaces, and


consider a function h : Y !→ Z. We say that h is Borel measurable if
h−1 (B) ∈ BY for every B ∈ BZ .
Borel spaces have a deficiency in the context of optimization: even in
the unit square, there exist Borel sets whose projections onto an axis are
not Borel subsets of that axis. In fact, this is the source of the difficulty
we mentioned earlier regarding Borel measurability in the DP context: if
g(x1 , u1 ) is Borel measurable, the partial minimum function
J1 (x1 ) = inf g(x1 , u1 )
u1
need not be, because its level sets are defined in terms of projections of the
level sets of g as
! " #! "$
x1 | J1 (x1 ) < c = P (x1 , u1 ) | g(x1 , u1 ) < c ,
where c is a scalar and P (·) denotes projection on the space of x1 . As an
example, take g to be the indicator of a Borel subset of the unit square
whose projection on the x1 -axis is not Borel. Then J1 is the indicator
function of this projection, so it is not Borel measurable. This leads us to
the notion of an analytic set.
A subset A of a Borel space Y is said to be analytic if there exists
a Borel space Z and a Borel subset B of Y × Z such that A = projY (B),
where projY is the projection mapping from Y × Z to Y . It is clear that
every Borel subset of a Borel space is analytic.
Analytic sets have many interesting properties, which are discussed
in detail in [BeS78]. Some of these properties are particularly relevant to
DP analysis. For example, let Y and Z be Borel spaces. Then:
(i) If A ⊂ Y is analytic and h : Y !→ Z is Borel measurable, then h(A)
is analytic. In particular, if Y is a product of Borel spaces Y1 and
Y2 , and A ⊂ Y1 × Y2 is analytic, then projY1 (A) is analytic. Thus,
the class of analytic sets is closed with respect to projection, a critical
property for DP, which the class of Borel sets is lacking, as mentioned
earlier.
(ii) If A ⊂ Z is analytic and h : Y !→ Z is Borel measurable, then h−1 (A)
is analytic.
(iii) If A1 , A2 , . . . are analytic subsets of Y , then ∪∞ ∞
k=1 Ak and ∩k=1 Ak
are analytic.
However, the complement of an analytic set need not be analytic, so the
collection of analytic subsets of Y need not be a σ-algebra.

Lower Semianalytic Functions


Let Y be a Borel space and let h : Y !→ [−∞, ∞] be a function. We say
that h is lower semianalytic if the level set
{y ∈ Y | h(y) < c}
Sec. C.2 Resolution of the Measurability Issues 223

is analytic for every c ∈ ". The following proposition states that lower
analyticity is preserved under partial minimization, a key result for our
purposes. The proof follows from the preservation of analyticity of a subset
of a product space under projection onto one of the component spaces, as
in (i) above (see [BeS78], Prop. 7.47).

Proposition C.1: Let Y and Z be Borel spaces, and let h : Y × Z $→


[−∞, ∞] be lower semianalytic. Then h∗ : Y $→ [−∞, ∞] defined by

h∗ (y) = inf h(y, z)


z∈Z

is lower semianalytic.

By comparing the DP equation J1 (x1 ) = inf u1 g(x1 , u1 ) [cf. Eq. (C.2)]


and Prop. C.1, we see how lower semianalytic functions can arise in DP. In
particular, J1 is lower semianalytic if g is. Let us also give two additional
properties of lower semianalytic functions that play an important role in
DP (for a proof, see [BeS78], Lemma 7.40).

Proposition C.2: Let Y be a Borel space, and let h : Y $→ [−∞, ∞]


and l : Y $→ [−∞, ∞] be lower semianalytic. Suppose that for every
y ∈ Y , the sum h(y) + l(y) is defined, i.e., is not of the form ∞ − ∞.
Then h + l is lower semianalytic.

Proposition C.3: Let Y and Z be Borel spaces, let h : Y $→ Z be


Borel measurable, and let l : Z $→ [−∞, ∞] be lower semianalytic.
Then the composition l ◦ h is lower semianalytic.

Universal Measurability

To address questions relating to the definition of the integrals appearing in


the DP algorithm, we must discuss the measurability properties of lower
semianalytic functions. In addition to the Borel σ-algebra BY mentioned
earlier, there is the universal σ-algebra UY , which is the intersection of all
completions of BY with respect to all probability measures. Thus, E ∈ UY
if and only if, given any probability measure p on (Y, BY ), there is a Borel
set B and a p-null set N such that E = B ∪ N . Clearly, we have BY ⊂ UY .
It is also true that every analytic set is universally measurable (for a proof,
224 Measure Theoretic Issues Appendix C

see [BeS78], Corollary 7.42.1), and hence the σ-algebra generated by the
analytic sets, called the analytic σ-algebra, and denoted AY , is contained
in UY :
BY ⊂ AY ⊂ UY .

Let X, Y , and Z be Borel spaces, and consider a function h : Y "→ Z.


We say that h is universally measurable if h−1 (B) ∈ UY for every B ∈ BZ .
It can be shown that if U ⊂ Z is universally measurable and h is universally
measurable, then h−1 (U ) is also universally measurable. As a result, if
g : X "→ Y , h : Y "→ Z are universally measurable functions, then the
composition (g ◦ h) : X "→ Z is universally measurable.
We say that h : Y "→ Z is analytically measurable if h−1 (B) ∈ AY
for every B ∈ BZ . It can be seen that every lower semianalytic function is
analytically measurable, and in view of the inclusion AY ⊂ UY , it is also
universally measurable.

Integration of Lower Semianalytic Functions

If p is a probability measure on (Y, BY ), then p has a unique extension! to a


probability !measure p̄ on (Y, UY ). We write simply p instead of p̄ and
! hdp
in place of hdp̄. In particular, if h is lower semianalytic, then h dp is
interpreted in this manner.
Let Y and Z be Borel spaces. A stochastic kernel q(dz | y) on Z given
Y is a collection of probability measures on (Z, BZ ) parameterized by the
elements of Y . If for each Borel set B ∈ BZ , the function q(B | y) is Borel
measurable (universally measurable) in y, the stochastic kernel q(dz | y)
is said to be Borel measurable (universally measurable, respectively). The
following proposition provides another basic property for the DP context
(for a proof, see [BeS78], Props. 7.46 and 7.48). †

† We use here a definition of integral of an extended real-valued function


that is always defined as an extended real number (see also Appendix A). In
particular. for a probability measure p, the integral of an extended real-valued
function f , with positive and negative parts f + and f − , is defined as

" " "


f dp = f + dp − f − dp,

+
!
where
! − we adopts the rule ∞ − ∞ = ∞ for the case where f dp = ∞ and
f dp = ∞. With this expanded definition, the integral of an extended real-
valued function is always defined as an extended real number (consistently also
with Appendix A).
Sec. C.2 Resolution of the Measurability Issues 225

Proposition C.4: Let Y and Z be Borel spaces, and let q(dz | y) be


a stochastic kernel on Z given Y . Let also h : Y × Z "→ [−∞, ∞] be a
function.
(a) If q is Borel measurable and h is lower semianalytic, then the
function l : Y "→ [−∞, ∞] given by
!
l(y) = h(y, z)q(dz | y)
Z

is lower semianalytic.
(b) If q is universally measurable and h is universally measurable,
then the function l : Y "→ [−∞, ∞] given by
!
l(y) = h(y, z)q(dz | y)
Z

is universally measurable.

Returning to the DP algorithm (C.2)-(C.3) of Section C.1, note that


if the cost function g is lower semianalytic and bounded either above or
below, then the partial minimum function J1 given by the DP Eq. (C.2)
is lower semianalytic (cf. Prop. C.1), and bounded either above or below,
respectively. Furthermore, if the transition kernel p(dx1 | x0 , u0 ) is Borel
measurable, then the integral
!
"
J1 (x1 ) p dx1 | x0 , u0 ) (C.7)

is a lower semianalytic function of (x0 , u0 ) (cf. Prop. C.4), and in view of


Prop. C.1, the same is true of the function J0 given by the DP Eq. (C.3),
which is the partial minimum over u0 of the expression (C.7). Thus, with
lower semianalytic g and Borel measurable p, the integrals appearing in
the DP algorithm make sense.
Note that in the example of Section C.1, there is no cost incurred in
the first stage of the system operation. When such a cost, call it g0 (x0 , u0 ),
is introduced, the expression minimized in the DP Eq. (C.3) becomes
!
"
g0 (x0 , u0 ) + J1 (x1 ) p dx1 | x0 , u0 ),

which is still a lower semianalytic function of (x0 , u0 ), provided g0 is lower


semianalytic and the sum above is not of the form ∞ − ∞ for any (x0 , u0 )
(Prop. C.2). Also, for alternative models defined in terms of a system func-
tion rather than a stochastic kernel (e.g., the total cost model of Chapter
1), Prop. C.3 provides some of the necessary machinery to show that the
functions generated by the DP algorithm are lower semianalytic.
226 Measure Theoretic Issues Appendix C

Universally Measurable Selection


The preceding discussion has shown that if g is lower semianalytic, and
p is Borel measurable, the DP algorithm (C.2)-(C.3) is well-defined and
produces lower semianalytic functions J1 and J0 . However, this does not
by itself imply that J0 is equal to the optimal cost function J * . For this
it is necessary that the chosen class of policies has the !-optimal selection
property (C.5). It turns out that universally measurable policies have this
property.
The following is the key selection theorem given in a general form,
which also addresses the question of existence of optimal policies that can
be obtained from the DP algorithm (for a proof, see [BeS78], Prop. 7.50).
The theorem shows that if any functions µ̄1 : ! → ! and µ̄0 : ! → ! can
be found such that µ̄1 (x1 ) and µ̄0 (x0 ) attain the respective minima in Eqs.
(C.2) and (C.3), for every x1 and x0 , then µ̄1 and µ̄0 can be chosen to be
universally measurable, the DP algorithm yields the optimal cost function
and π = (µ̄0 , µ̄1 ) is optimal, provided that g is lower semianalytic and the
integral in Eq. (C.3) is a lower semianalytic function of (x0 , u0 ).

Proposition C.5: (Measurable Selection Theorem) Let Y and


Z be Borel spaces and let h : Y × Z $→ [−∞, ∞] be lower semianalytic.
Define h∗ : Y $→ [−∞, ∞] by

h∗ (y) = inf h(y, z),


z∈Z

and let
! "
I = y ∈ Y | there exists a zy ∈ Z for which h(y, zy ) = h∗ (y) ,

i.e., I is the set of points y for which the infimum above is attained. For
any ! > 0, there exists a universally measurable function φ : Y $→ Z
such that # $
h y, φ(y) = h∗ (y), ∀ y ∈ I,
%
# $ h∗ (y) + !, ∀ y ∈ / I with h∗ (y) > −∞,
h y, φ(y) ≤
−1/!, ∀y∈ / I with h∗ (y) = −∞.

Universal Measurability Framework: A Summary

In conclusion, the preceding discussion shows that in the two-stage example


of Section C.1, the measurability issues are resolved in the following sense:
the DP algorithm (C.2)-(C.3) is well-defined, produces lower semianalytic
Sec. C.2 Resolution of the Measurability Issues 227

functions J1 and J0 , and yields the optimal cost function (as in R.1), and
furthermore there exist !-optimal and possibly exactly optimal policies (as
in R.2 and R.3), provided that:
(a) The stage cost function g is lower semianalytic; this is needed to show
that the function J1 of the DP Eq. (C.2) is lower semianalytic and
hence also universally measurable (cf. Prop. C.1). The more “nat-
ural” Borel measurability assumption on g implies lower analyticity
of g, but will not keep the functions J1 and J0 produced by the DP
algorithm within the domain of Borel measurability. This is because
the partial minimum operation on Borel measurable functions takes
us outside that domain (cf. Prop. C.1).
(b) The stochastic kernel p is Borel measurable. This is needed in order
for the integral in the DP Eq. (C.3) to be defined as a lower semi-
analytic function of (x0 , u0 ) (cf. Prop. C.4). In turn, this is used to
show that the function J0 of the DP Eq. (C.3) is lower semianalytic
(cf. Prop. C.1).
(c) The control functions µ0 and µ1 are allowed to be universally mea-
surable, and we have J0 (x0 ) > −∞ for all x0 and J1 (x1 ) > −∞ for
all x1 . This is needed in order for the calculation of Eq. (C.4) to go
through (using the measurable selection property of Prop. C.5), and
show that the DP algorithm produces the optimal cost function (cf.
R.1). It is also needed (using again Prop. C.5) in order to show the
associated existence of solutions results (cf. R.2 and R.3).

Extension to General Finite-Horizon DP

Let us now extend our analysis to an N -stage model with state xk and
control uk that take values in Borel spaces X and U , respectively. We
assume stochastic/transition kernels pk (dxk+1 | xk , uk ), which are Borel
measurable, and stage cost functions gk : X × U $→ (−∞, ∞], which are
lower semianalytic and bounded either above or below. † Furthermore, we
allow policies π = {µ0 , . . . , µN −1 } that are randomized: each component
µk is a universally measurable stochastic kernel µk (duk | xk ) from X to U .
If for every xk and k, µk (duk | xk ) assigns probability 1 to a single control
uk , π is said to be nonrandomized .
Each policy π and initial state x0 define a unique probability measure
with respect to which gk (xk , uk ) can be integrated to produce the expected
value of gk . The sum of these expected values for k = 0, . . . , N − 1, is the
cost Jπ (x0 ). It is convenient to write this cost in terms of the following

† Note that since gk may take the value ∞, constraints of the form uk ∈
Uk (xk ) may be implicitly introduced by letting gk (xk , uk ) = ∞ when uk ∈
/
Uk (xk ).
228 Measure Theoretic Issues Appendix C

DP-like backwards recursion (see [BeS78], Section 8.1):


!
Jπ,N −1 (xN −1 ) = gN −1 (xN −1 , uN −1 )µN −1 (duN −1 | xN −1 ),

! " ! #
Jπ,k (xk ) = gk (xk , uk ) + Jπ,k+1 (xk+1 ) pk (dxk+1 | xk , uk )

µk (duk | xk ), k = 0, . . . , N − 2.
The function obtained at the last step is the cost of π starting at x0 :

Jπ (x0 ) = Jπ,0 (x0 ).

We can interpret Jπ,k (xk ) as the expected cost-to-go starting from xk at


time k, and using π. Note that by Prop. C.4, the functions Jπ,k are all
universally measurable.
The DP algorithm is given by

JN −1 (xN −1 ) = inf gN −1 (xN −1 , uN −1 ), ∀ xN −1 ,


uN −1 ∈U

$ ! &
%
Jk (xk ) = inf gk (xk , uk ) + Jk+1 (xk+1 ) pk dxk+1 | xk , uk ) , ∀ xk , k.
uk ∈U

By essentially replicating the analysis of the two-stage example, we can


show that the integrals in the above DP algorithm are well-defined, and
that the functions JN −1 , . . . , J0 are lower semianalytic.
It can be seen from the preceding expressions that we have for all
policies π

Jk (xk ) ≤ Jπ,k (xk ), ∀ xk , k = 0, . . . , N − 1.

To show equality within " ≥ 0 in the above relation, we may use the
measurable selection theorem (Prop. C.5), assuming that

Jk (xk ) > −∞, ∀ xk , k,

so that "-optimal universally measurable selection is possible in the DP


algorithm. In particular, define π = {µ0 , . . . , µN −1 } such that µk : X &→ U
is universally measurable, and for all xk and k,

"
!
% ' % '
gk xk , µk (xk ) + Jk+1 (xk+1 ) pk dxk+1 | xk , µk (xk ) ≤ Jk (xk ) + .
N
(C.8)
Then, we can show by induction that

(N − k)"
Jk (xk ) ≤ Jπ,k (xk ) ≤ Jk (xk ) + , ∀ xk , k = 0, . . . , N − 1,
N
Sec. C.2 Resolution of the Measurability Issues 229

and in particular, for k = 0,

J0 (x0 ) ≤ Jπ (x0 ) ≤ J0 (x0 ) + !, ∀ x0 .

and hence also


J * (x0 ) = inf Jπ (x0 ) = J0 (x0 ).
π

Thus, the DP algorithm produces the optimal cost function, and via the
approximate minimization of Eq. (C.8), an !-optimal policy. Similarly,
if the infimum is attained for all xk and k in the DP algorithm, then
there exists an optimal policy. Note that both the !-optimal and the exact
optimal policies can be taken be nonrandomized.
The assumptions of Borel measurability of the stochastic kernels,
lower semianalyticity of the costs per stage, and universally measurable
policies, are the basis for the framework adopted by Bertsekas and Shreve
[BeS78], which provides a comprehensive analysis of finite and infinite hori-
zon total cost problems. There is also additional analysis in [BeS78] on
problems of imperfect state information, as well as various refinements
of the measurability framework just described. Among others, these re-
finements involve analytically measurable policies, and limit measurable
policies (measurable with respect to the, so-called, limit σ-algebra, the
smallest σ-algebra that has the properties necessary for a DP theory that
is comparably powerful to the one for the universal σ-algebra).
APPENDIX D:
Solutions of Exercises

CHAPTER 1

1.1 (Multistep Contraction Mappings)

By the contraction property of Tµ0 , . . . , Tµm−1 , we have for all J, J " ∈ B(X),

"T ν J − T ν J " " = "Tµ0 · · · Tµm−1 J − Tµ0 · · · Tµm−1 J " "


≤ α"Tµ1 · · · Tµm−1 J − Tµ1 · · · Tµm−1 J " "
≤ α2 "Tµ2 · · · Tµm−1 J − Tµ2 · · · Tµm−1 J " "
..
.
≤ αm "J − J " ",
thus showing Eq. (1.23).
We have from Eq. (1.23)
(Tµ0 · · · Tµm−1 J)(x) ≤ (Tµ0 · · · Tµm−1 J " )(x) + αm "J − J " " v(x), ∀ x ∈ X,
and by taking infimum of both sides over (Tµ0 · · · Tµm−1 ) ∈ Mm and dividing by
v(x), we obtain

(T J)(x) − (T J " )(x)


≤ αm "J − J " ", ∀ x ∈ X.
v(x)
Similarly
(T J " )(x) − (T J)(x)
≤ αm "J − J " ", ∀ x ∈ X,
v(x)
and by combining the last two relations and taking supremum over x ∈ X, Eq.
(1.24) follows.

230
Appendix D 231

1.2 (State-Dependent Weighted Multistep Mappings [YuB12])

By the contraction property of Tµ , we have for all J, J ! ∈ B(X) and x ∈ X,

! (w)
!(Tµ J)(x) − (Tµ(w) J ! )(x)!
! !"∞ "∞ !
!
!=1
w! (x)(Tµ! J)(x) − !=1 w! (x)(Tµ! J ! )(x)!
=
v(x) v(x)

#
≤ w! (x)$Tµ! J − Tµ! J ! $
!=1
$∞ %
#
≤ w! (x)α! $J − J ! $,
!=1

(w)
showing the contraction property of Tµ .
Let Jµ be the fixed point of Tµ . We have for all x ∈ X, by using the relation
(Tµ! Jµ )(x) = Jµ (x),


$ ∞
%
# #
(Tµ(w) Jµ )(x) = w! (x) Tµ! Jµ (x) =
& '
w! (x) Jµ (x) = Jµ (x),
!=1 !=1

(w) (w)
so Jµ is the fixed point of Tµ [which is unique since Tµ is a contraction].

CHAPTER 2

2.1 (Periodic Policies)

(a) Let us define

k k k
J0 = lim T ν J̄ , J1 = lim T ν (Tµ0 J̄ ), . . . Jm−2 = lim T ν (Tµ0 · · · Tµm−2 J̄).
k→∞ k→∞ k→∞

Since T ν is a contraction mapping, J0 , . . . , Jm−1 are all equal to the unique fixed
point of T ν . Since J0 , . . . , Jm−1 are all equal, they are also equal to Jπ (by the
definition of Jπ ). Thus Jπ is the unique fixed point of T ν .
(b) Follow the hint.

2.2 (Totally Asynchronous Convergence Theorem for


Time-Varying Maps)

A straightforward replication of the proof of Prop. 2.6.1.


232 Solutions of Exercises Chap. 2

2.3 (Nonmonotonic-Contractive Models – Fixed Points of


Concave Sup-Norm Contractions)
The analysis of Sections 2.6.1 and 2.6.3 does not require monotonicity of the
mapping Tµ given by
(Tµ J)(x) = F x, µ(x) − J ! µ(x).
! "

2.4 (Discounted Problems with Unbounded Cost per Stage)

We have
# # ! "
#(Tµ J)(x)# Gx $ pxy µ(x) v(y) |J(y)|
≤ +α , ∀ x ∈ X, µ ∈ M,
v(x) v(x) v(x) v(y)
y∈X

from which, using assumptions (1) and (2),


# #
#(Tµ J)(x)#
≤ %G% + α%V % %J%, ∀ x ∈ X, µ ∈ M.
v(x)
A similar argument shows that
# #
#(T J)(x)#
≤ %G% + α%V % %J%, ∀ x ∈ X.
v(x)
It follows that Tµ J ∈ B(X) and T J ∈ B(X) if J ∈ B(X).
For any J, J ! ∈ B(X) and µ ∈ M, we have
# % ! "! "##
#α y∈X pxy µ(x) J(y) − J ! (y) #
#
%Tµ J − Tµ J ! % = sup
x∈X v(x)
# % ! " ! "##
#α y∈X pxy µ(x) v(y) |J(y) − J ! (y)|/v(y) #
#
≤ sup
x∈X v(x)
#% ! " #
p µ(x) v(y)
# #
# y∈X xy #
≤ sup α %J − J ! %
x∈X v(x)
≤ α%J − J ! %,
where the last inequality follows from assumption (3). Hence Tµ is a contraction
of modulus α.
To show that T is a contraction, we note that
(Tµ J)(x) (Tµ J ! )(x)
≤ + α%J − J ! %, x ∈ X, µ ∈ M,
v(x) v(x)
so by taking infimum over µ ∈ M, we obtain
(T J)(x) (T J ! )(x)
≤ + α%J − J ! %, x ∈ X.
v(x) v(x)
Similarly,
(T J ! )(x) (T J)(x)
≤ + α%J − J ! %, x ∈ X,
v(x) v(x)
and by combining the last two relations the contraction property of T follows.
Appendix D 233

2.5 (Solution by Mathematical Programming)

If J ≤ T J, by monotonicity we have J ≤ limk→∞ T k J = J ∗ . Any feasible


solution z of the optimization problem satisfies zi ≤ H(i, u, z) for all i = 1, . . . , n
and u ∈ U (i), so that z ≤ T z. It follows that z ≤ J ∗ , which implies that J ∗ is
an optimal solution of the optimization problem.

2.6 (Convergence of Nonexpansive Monotone Fixed Point


Iterations)

For any c > 0, let Vk = T k (J ∗ + c v) for k ≥ 1, and note that J ∗ = T k J ∗ ≤ Vk .


From condition (2), we have

H(x, u, J ∗ + c v) ≤ H(x, u, J ∗ ) + c v(x), x ∈ X, u ∈ U (x),

and by taking infimum over u ∈ U (x), we obtain T (J ∗ +!c v) ≤ "J ∗ + c v, i.e., V1 ≤


V0 . From this and the monotonicity of T it follows that Vk (x) is monotonically
nonincreasing, and converges to some scalar V (x) ≥ J ∗ (x) for each x ∈ X.
Moreover, the corresponding function V is in B(X), since V0 ≥ V ≥ J ∗ , and
also satisfies $Vk − V $ → 0 (since X is finite). From condition (2), we have
$T Vk − T V $ ≤ $Vk − V $, so $T Vk − T V $ → 0 which together with the fact
T Vk = Vk+1 → V , implies that V = T V . Thus V = J ∗ by the uniqueness of the
fixed point of T , and it follows that {Vk } converges monotonically to J ∗ from
above.
Similarly, define Wk = T k (J ∗ − c v), and by an argument symmetric to the
above, {Wk } converges monotonically to J ∗ from below. Now let c = $J − J ∗ $
in the definition of Vk and Wk . Then J ∗ − c v ≤ J0 = J ≤ J ∗ + c v, so by the
monotonicity of T , Wk ≤ T k J ≤ Vk as well as Wk ≤ J ∗ ≤ Vk for all k. Therefore
# k # # #
#(T J)(x) − J ∗ (x)# #Wk (x) − Vk (x)#
≤ ≤ $Wk − Vk $, ∀ x ∈ X.
v(x) v(x)

Since $Wk − Vk $ ≤ $Wk − J ∗ $ + $Vk − J ∗ $ → 0, the conclusion follows.

CHAPTER 3

3.1 (Blackmailer’s Dilemma)

(a) Clearly Tµ is a sup-norm contraction with modulus 1 − µ(1)2 . Hence Jµ is


the unique fixed point of Tµ and we have

Jµ (1) = (Tµ Jµ )(1) = −µ(1) + 1 − µ(1)2 Jµ (1),


$ %

which yields Jµ (1) = −1/µ(1). The mapping T is given by

− u + (1 − u2 )J(1) ,
! "
(T J)(1) = inf
0<u≤1
234 Solutions of Exercises Chap. 3

and J ∈ " is a fixed point of T if and only if


− u + u2 J(1)
! " #$
0 = inf .
0<u≤1

However, it can be seen that this equation has no solution. Here parts (b) and
(d) of Assumption 3.2.1 are violated.
(b) Here Tµ is again a sup-norm contraction with modulus 1 − µ(1)2 . For Jµ , the
unique fixed point of Tµ , we have
" # " #
Jµ (1) = (Tµ Jµ )(1) = − 1 − µ(1) µ(1) + 1 − µ(1) Jµ (1),
which yields Jµ (1) = −1 + µ(1). Hence J ∗ = −1, but there is no optimal µ. The
mapping T is given by
− u + u2 + (1 − u)J(1) ,
! $
(T J)(1) = inf
0<u≤1

and J ∈ " is a fixed point of T if and only if


− u + u2 − uJ(1) .
! $
0 = inf
0<u≤1

It can be verified that the set of fixed points of T within " is {J | J ≤ −1}. Here
part (d) of Assumption 3.2.1 is violated.
(c) For the policy µ that chooses µ(1) = 0, we have
(Tµ J)(1) = c + J(1),
and µ is "-irregular since limk→∞ Tµk J either does not belong to " or depends
on J. Moreover, the mapping T is given by
% &
2
! $
(T J)(1) = min c + J(1), inf − u + u + (1 − u)J(1) .
0<u≤1

When c > 0, we have Jµ (1) = limk→∞ (Tµk J¯)(1) = ∞. It can be verified


that there is no optimal policy, and the set of fixed points of T within " is
{J | J ≤ −1}. Here part (d) of Assumption 3.2.1 is violated.
When c = 0, we have Jµ (1) = limk→∞ (Tµk J)(1) ¯ = 0. Again it can be
verified that there is no optimal policy, and the set of fixed points of T within "
is {J | J ≤ −1}. Here part (c) of Assumption 3.2.1 is violated.
When c < 0, we have Jµ (1) = limk→∞ (Tµk J¯)(1) = −∞, and the "-irregular
policy µ is optimal. The mapping T has no fixed point within ". Here parts (c)
and (d) of Assumption 3.2.1 are violated.

3.2 (Equivalent Semicontractive Conditions)

Let the assumptions of Prop. 3.1.1 hold, and let µ∗ be the S-regular policy that is
optimal. Then condition (1) implies that J ∗ = Jµ∗ ∈ S and J ∗ = Tµ∗ J ∗ ≥ T J ∗ ,
while condition (2) implies that there exists an S-regular policy µ such that
Tµ J ∗ = T J ∗ .
Conversely, assume that J ∗ ∈ S, T J ∗ ≤ J ∗ , and there exists an S-regular
policy µ such that Tµ J ∗ = T J ∗ . Then we have Tµ J ∗ = T J ∗ ≤ J ∗ . Hence
Tµk J ∗ ≤ J ∗ for all k, and by taking the limit as k → ∞, we obtain Jµ ≤ J ∗ .
Hence the S-regular policy µ is optimal, and both conditions of Prop. 3.1.1 hold.
Appendix D 235

3.3

The mapping H here is

b if x = 1, u = 0,
!
H(x, u, J) = a + J(2) if x = 1, u = 2,
a + J(1) if x = 2, u = 1.

The Bellman equation is given by


" #
J(1) = min b, a + J(2) , J(2) = a + J(1).

There are two policies:

µ : where µ(1) = 0, corresponding to the path 2 → 1 → 0,

µ : where µ(1) = 2, corresponding to the cycle 1 → 2 → 1.


The case where S = "2 has been discussed in Section 3.1.2. Here µ is
S-regular, as can be seen from the form of Tµ ,

(Tµ J)(1) = b, (Tµ J)(2) = a + J(1),

but µ is S-irregular, as can be seen from the form of Tµ ,

(Tµ J)(1) = a + J(2), (Tµ J)(2) = a + J(1).

Briefly there are four cases of interest:


(1) α > 0: Here Prop. 3.1.1 applies.
(2) α = 0 and b ≤ 0: Here Prop. 3.1.1 applies.
(3) α = 0 and b > 0: Here Prop. 3.1.1 does not apply because the S-regular
policy µ is not optimal.
(4) α < 0: Here Prop. 3.1.1 does not apply because the S-regular policy µ is
not optimal.
Consider now the case where S = [−∞, ∞)×[−∞, ∞). Then µ is S-regular
in all cases (1)-(4), but µ is S-irregular only in cases (1)-(3), and it is S-regular
in case (4) because Jµ (1) = Jµ (2) = −∞ and

lim (Tµk J)(1) = lim (Tµk J)(1) = −∞, ∀ J ∈ S,


k→∞ k→∞

while Jµ is the unique fixed point of T within S. In cases (1) and (2), Prop.
3.1.1 applies, because the S-regular policy µ is optimal. In case (3), Prop. 3.1.1
does not apply because the S-regular policy µ is not optimal. Finally, in case (4),
contrary to the case S = "2 , Prop. 3.1.1 applies, because the policy µ is optimal
and also S-regular. Case (3) cannot be analyzed with the aid of Props. 3.1.1,
3.1.2, or 3.2.1.
236 Solutions of Exercises Chap. 3

3.4 (Changing J¯)

(a) By the definition of S-regular policy, we have T k J → Jµ for all S-regular µ


and J ∈ S. Thus, changing J¯ to J ∈ S leaves the cost function of all S-regular
policies unchanged.
(b) Here
b if x = 1, u = 0,
!
H(x, u, J) = J(2) if x = 1, u = 2,
J(1) if x = 2, u = 1.

When J¯ = 0, the #2 -regular policy is optimal and J ∗ = b e, as shown in Section


3.1.2. When J¯ = r e, the cost function of the #2 -regular policy µ [µ(1) = 0]
continues to be
Jµ (1) = Jµ (2) = b,

while the cost function of the #2 -irregular policy µ [µ(1) = 2] is

Jµ (1) = Jµ (2) = r.

For r ≤ b, the #2 -irregular policy is optimal, but J ∗ = b e continues to be the


optimal cost over just the #2 -regular policies (there is only one in this example).

3.5 (Alternative Semicontractive Conditions)

We will show that conditions (1) and (2) imply that J ∗ = T J ∗ , and the result
will follow from Prop. 3.1.2. Assume to obtain a contradiction, that J ∗ %= T J ∗ .
Then J ∗ ≥ T J ∗ , as can be seen from the relations

J ∗ = Jµ∗ = Tµ∗ Jµ∗ ≥ T Jµ∗ = T J ∗ ,

where µ∗ is an optimal S-regular policy. Thus the relation J ∗ %= T J ∗ implies


that there exists µ and x ∈ X such that

J ∗ (x) ≥ (Tµ J ∗ )(x), ∀ x ∈ X,

with strict inequality for some x [note here that we can choose µ(x) = µ∗ (x) for
all x such that J ∗ (x) = (T J ∗ )(x), and we can choose µ(x) to satisfy J ∗ (x) >
(Tµ J ∗ )(x) for all other x]. If µ were S-regular, we would have

J ∗ ≥ Tµ J ∗ ≥ lim Tµk J ∗ = Jµ ,
k→∞

with strict inequality for some x ∈ X, which is impossible. Hence µ is S-irregular,


which contradicts condition (2).
Appendix D 237

3.6 (Convergence of PI)

We have
Jµk ≥ T Jµk ≥ Jµk+1 , k = 0, 1, . . . . (3.1)

Denote
J∞ = lim T Jµk = lim Jµk .
k→∞ k→∞

Since for all k, we have Jµk ≥ Jˆ ∈ S, where Jˆ is the optimal cost function
ˆ and by
over S-regular policies [cf. Assumption 3.2.1(b)]. It follows that J∞ ≥ J,
Assumption 3.2.1(a), we obtain J∞ ∈ S. By taking the limit in Eq. (3.1), we
have
J∞ = lim T Jµk ≥ T J∞ , (3.2)
k→∞

where the inequality follows from the fact Jµk ↓ J∞ . Using also the given as-
sumption, we have for all x ∈ X and u ∈ U (x),

H(x, u, J∞ ) = lim H(x, u, Jµk ) ≥ lim (T Jµk )(x) = J∞ (x).


k→∞ k→∞

By taking the infimum of the left-hand side over u ∈ U (x), we obtain T J∞ ≥ J∞ ,


which combined with Eq. (3.2), yields J∞ = T J∞ . Since J ∗ is the unique fixed
point of T within S, we obtain J∞ = J ∗ .

CHAPTER 4

4.1 (Example of Nonexistence of an Optimal Policy Under D)

Since a cost is incurred only upon stopping, and the stopping cost is greater than
-1, we have Jµ (x) > −1 for all x and µ. On the other hand, starting from any
1
state x and stopping at x + n yields a cost −1 + x+n , so by taking n sufficiently
large, we can attain a cost arbitrarily close to -1. Thus J ∗ (x) = −1 for all x, but
no policy can attain this optimal cost.

4.2 (Counterexample for Optimality Condition Under D)

We have J ∗ (x) = −1 and Jµ (x) = 0 for all x ∈ X. Thus µ is nonoptimal, yet


attains the minimum in Bellman’s equation

1
! "
J ∗ (x) = min J ∗ (x + 1), −1 +
x

for all x.
238 Solutions of Exercises Chap. 4

4.3 (Counterexample for Optimality Condition Under I)

The verification of Tµ Jµ = T Jµ is straightforward. To show that J ∗ (x) = |x|,


we first note that |x| is a fixed point of T , so by Prop. 4.3.2, J ∗ (x) ≤ |x|.
¯
Also (T J)(x) ¯ so
= |x| for all x, while under Assumption I, we have J ∗ ≥ T J,
J (x) ≥ |x|. Hence J ∗ (x) = |x|.

4.4 (Solution by Mathematical Programming)

(a) Any feasible solution z of the given optimization problem satisfies z ≥ J¯ as


well as zi ≥ inf u∈U (i) H(i, u, z) for all i = 1, . . . , n, so that z ≥ T z. It follows
from Prop. 4.3.3 that z ≥ J ∗ , which implies that J ∗ is an optimal solution of the
given optimization problem. Also J ∗ is the unique optimal ! solution
! since if z is
feasible and z #= J ∗ , the inequality z ≥ J ∗ implies that z >
i i i
J ∗ (i), so z
cannot be optimal.
(b) Any feasible solution z of the given optimization problem satisfies z ≤ J¯ as
well as zi ≤ H(i, u, z) for all i = 1, . . . , n and u ∈ U (i), so that z ≤ T z. It follows
from Prop. 4.3.6 that z ≤ J ∗ , which implies that J ∗ is an optimal solution of
the given optimization problem. Similar to part (a), J ∗ is the unique optimal
solution.

4.5 (Semicontractive Discounted Problems with Unbounded


Cost per Stage)

(a) See Exercise 2.4.


(b) Since all policies in M are S-regular and there exists an optimal policy within
M, it follows that Prop. 4.4.1 applies, so that J ∗ is the unique fixed point of T
within S. Similarly, the assumption that for each J ∈ S there exists µ ∈ M such
that Tµ J = T J, and the structure of H and S imply that Prop. 4.4.2 applies.

4.6 (Blackmailer’s Dilemma)

(a) From Exercise 3.1, the cost function of any policy µ is

1
Jµ (1) = − ,
µ(1)

so the policy evaluation equation given in part (a) is correct. Moreover, we have
Jµ (1) ≤ −1 since µ(1) ∈ (0, 1]. The policy improvement equation is

µk+1 (1) ∈ arg min − u + (1 − u2 )Jµk (1) .


" #
(4.1)
u∈(0,1]

By setting the gradient of the expression within braces to 0,

0 = −1 − 2uJµk (1),
Appendix D 239

we see that its unconstrained minimum is


1
uk = − ,
2Jµk (1)

which is less or equal to −1/2 since Jµ (1) ≤ −1 for all µ. Hence uk is equal to
the constrained minimum in Eq. (4.1), and we have

1
µk+1 (1) = − .
2Jµk (1)

(b) Follows from Props. 4.3.14 and 4.3.15.

4.7 (Counterexample for Policy Improvement Under D -


Infinite State Space)

(a) The policy µ that stops at every state has cost function

1
Jµ (x) = −1 + , x ∈ X.
x
Policy improvement starting with µ yields µ with

1
! "
µ(x) ∈ arg min Jµ (x), −1 + ,
x

so µ(x) can be either to continue or to stop at every x. Let µ be to continue at


every x. Then Jµ (x) = 0 > Jµ (x) for all x. Moreover, the next policy obtained
from µ by policy improvement is µ.
(b) Follows from Props. 4.3.14 and 4.3.15.

4.8 (Counterexample for Policy Improvement Under D -


Finite State Space)

(a) Essentially the same as the one of Exercise 4.7.


(b) Straightforward.

4.9 (Infinite Time Reachability [Ber72])

Let Ê(X) be the subset of E(X) that consists of functions that take only the two
values 0 and ∞, and for all J ∈ Ê(X) denote
# $
D(J) = x ∈ X | J(x) = 0 .

Note that for all J ∈ Ê(X) we have Tµ J ∈ Ê(X), T J ∈ Ê(X), and that
# % & $
D(Tµ J) = x ∈ X | x ∈ D(J), f x, µ(x), w ∈ D(J), ∀ w ∈ W (x, µ(x)) ,
240 Solutions of Exercises Chap. 4

D(T J) = ∪µ∈M D(Tµ J).

(a) For all J ∈ Ê(X), we have D(Tµ J) ⊂ D(J) and Tµ J ≥ J, so condition (1) of
Assumption I holds, and it is easily verified that the remaining two conditions of
Assumption I also hold. We have J¯ ∈ Ê(X), so for any policy π = {µ0 , µ1 , . . .},
we have Tµ0 · · · Tµk J¯ ∈ Ê(X). It follows that Jπ , given by

¯
Jπ = lim Tµ0 · · · Tµk J,
k→∞

also belongs to Ê(X), and the same is true for J ∗ = inf π∈Π Jπ . Thus J ∗ has the
given form with D(J ∗ ) = X ∗ .
(b) Since {T k J} ¯ is monotonically nondecreasing we have D(T k+1 J¯) ⊂ D(T k J¯),
or equivalently Xk+1 ⊂ Xk for all k. Generally for a sequence {Jk } ⊂ Ê(X), if
Jk ↑ J, we have J ∈ Ê(X) and D(J) = ∩∞ k=0 D(Jk ). Thus convergence of VI (i.e.,
T k J¯ ↑ J ∗ ) is equivalent to D(J ∗ ) = ∩∞ ∗ ∞
k=0 D(Jk ) or X = ∩k=0 Xk .

(c) The compactness condition of Prop. 4.3.14 guarantees that T k J¯ ↑ J ∗ , or


equivalently by part (b), X ∗ = ∩∞
k=0 Xk . This condition requires that the sets

Uk (x, λ) = u ∈ U (x)" H(x, u, T k J¯) ≤ λ


! " #

are compact for every x ∈ X, λ ∈ (, and for all k greater than some integer k.
It can be seen that Uk (x, λ) is equal to the set
! #
Ûk (x) = u ∈ U (x) | f (x, u, w) ∈ Xk , ∀ w ∈ W (x, u)

given in the statement of the exercise.


References

[ABB02] Abounadi, J., Bertsekas, B. P., and Borkar, V. S., 2002. “Stochastic
Approximation for Non-Expansive Maps: Q-Learning Algorithms,” SIAM J. on
Control and Opt., Vol. 41, pp. 1-22.
[BBB08] Basu, A., Bhattacharyya, and Borkar, V., 2008. “A Learning Algorithm
for Risk-Sensitive Cost,” Math. of OR, Vol. 33, pp. 880-898.
[BBD10] Busoniu, L., Babuska, R., De Schutter, B., and Ernst, D., 2010. Rein-
forcement Learning and Dynamic Programming Using Function Approximators,
CRC Press, N. Y.
[Bau78] Baudet, G. M., 1978. “Asynchronous Iterative Methods for Multiproces-
sors,” Journal of the ACM, Vol. 25, pp. 226-244.
[BeI96] Bertsekas, D. P., and Ioffe, S., 1996. “Temporal Differences-Based Policy
Iteration and Applications in Neuro-Dynamic Programming,” Lab. for Info. and
Decision Systems Report LIDS-P-2349, MIT.
[BeS78] Bertsekas, D. P., and Shreve, S. E., 1978. Stochastic Optimal Control:
The Discrete Time Case, Academic Press, N. Y.; may be downloaded from
http://web.mit.edu/dimitrib/www/home.html
[BeT89] Bertsekas, D. P., and Tsitsiklis, J. N., 1989. Parallel and Distributed
Computation: Numerical Methods, Prentice-Hall, Engl. Cliffs, N. J.; may be
downloaded from http://web.mit.edu/dimitrib/www/home.html
[BeT91] Bertsekas, D. P., and Tsitsiklis, J. N., 1991. “An Analysis of Stochastic
Shortest Path Problems,” Math. of OR, Vol. 16, pp. 580-595.
[BeT96] Bertsekas, D. P., and Tsitsiklis, J. N., 1996. Neuro-Dynamic Program-
ming, Athena Scientific, Belmont, MA.
[BeT08] Bertsekas, D. P., and Tsitsiklis, J. N., 2008. Introduction to Probability,
2nd Ed., Athena Scientific, Belmont, MA.
[BeY07] Bertsekas, D. P., and Yu, H., 2007. “Solution of Large Systems of Equa-
tions Using Approximate Dynamic Programming Methods,” Lab. for Info. and
Decision Systems Report LIDS-P-2754, MIT.
[BeY09] Bertsekas, D. P., and Yu, H., 2009. “Projected Equation Methods for Ap-
proximate Solution of Large Linear Systems,” J. of Computational and Applied
Mathematics, Vol. 227, pp. 27-50.
[BeY10a] Bertsekas, D. P., and Yu, H., 2010. “Q-Learning and Enhanced Policy
Iteration in Discounted Dynamic Programming,” Lab. for Info. and Decision
Systems Report LIDS-P-2831, MIT; Math. of OR, Vol. 37, 2012, pp. 66-94.

241
242 References

[BeY10b] Bertsekas, D. P., and Yu, H., 2010. “Asynchronous Distributed Policy
Iteration in Dynamic Programming,” Proc. of Allerton Conf. on Communication,
Control and Computing, Allerton Park, Ill, pp. 1368-1374.
[Ber72] Bertsekas, D. P., 1972. “Infinite Time Reachability of State Space Regions
by Using Feedback Control,” IEEE Trans. Aut. Control, Vol. AC-17, pp. 604-613.
[Ber75] Bertsekas, D. P., 1975. “Monotone Mappings in Dynamic Programming,”
1975 IEEE Conference on Decision and Control, pp. 20-25.
[Ber77] Bertsekas, D. P., 1977. “Monotone Mappings with Application in Dy-
namic Programming,” SIAM J. on Control and Opt., Vol. 15, pp. 438-464.
[Ber82] Bertsekas, D. P., 1982. “Distributed Dynamic Programming,” IEEE Trans.
Aut. Control, Vol. AC-27, pp. 610-616.
[Ber83] Bertsekas, D. P., 1983. “Asynchronous Distributed Computation of Fixed
Points,” Math. Programming, Vol. 27, pp. 107-120.
[Ber87] Bertsekas, D. P., 1987. Dynamic Programming: Deterministic and Stochas-
tic Models, Prentice-Hall, Englewood Cliffs, N. J.
[Ber05a] Bertsekas, D. P., 2005. Dynamic Programming and Optimal Control,
Vol. I, 3rd Edition, Athena Scientific, Belmont, MA.
[Ber05b] Bertsekas, D. P., 2005. “Dynamic Programming and Suboptimal Con-
trol: A Survey from ADP to MPC,” Fundamental Issues in Control, Special Issue
for the CDC-ECC 05, European J. of Control, Vol. 11, Nos. 4-5.
[Ber09] Bertsekas, D. P., 2009. Convex Optimization Theory, Athena Scientific,
Belmont, MA.
[Ber10] Bertsekas, D. P., 2010. “Williams-Baird Counterexample for Q-Factor
Asynchronous Policy Iteration,”
http://web.mit.edu/dimitrib/www/Williams-Baird Counterexample.pdf
[Ber11a] Bertsekas, D. P., 2011. “Temporal Difference Methods for General Pro-
jected Equations,” IEEE Trans. Aut. Control, Vol. 56, pp. 2128-2139.
[Ber11b] Bertsekas, D. P., 2011. “λ-Policy Iteration: A Review and a New Im-
plementation,” Lab. for Info. and Decision Systems Report LIDS-P-2874, MIT;
appears in Reinforcement Learning and Approximate Dynamic Programming for
Feedback Control, by F. Lewis and D. Liu (eds.), IEEE Press, 2012.
[Ber11c] Bertsekas, D. P., 2011. “Approximate Policy Iteration: A Survey and
Some New Methods,” J. of Control Theory and Applications, Vol. 9, pp. 310-335.
[Ber12a] Bertsekas, D. P., 2012. Dynamic Programming and Optimal Control,
Vol. II, 4th Edition: Approximate Dynamic Programming, Athena Scientific,
Belmont, MA.
[Ber12b] Bertsekas, D. P., 2012. “Weighted Sup-Norm Contractions in Dynamic
Programming: A Review and Some New Applications,” Lab. for Info. and Deci-
sion Systems Report LIDS-P-2884, MIT.
[Bla65] Blackwell, D., 1965. “Positive Dynamic Programming,” Proc. Fifth Berke-
ley Symposium Math. Statistics and Probability, pp. 415-418.
[BoM99] Borkar, V. S., Meyn, S. P., 1999. “Risk Sensitive Optimal Control:
Existence and Synthesis for Models with Unbounded Cost,” SIAM J. Control
and Opt., Vol. 27, pp. 192-209.
[BoM00] Borkar, V. S., Meyn, S. P., 1900. “The O.D.E. Method for Convergence
References 243

of Stochastic Approximation and Reinforcement Learning,” SIAM J. Control and


Opt., Vol. 38, pp. 447-469.
[BoM02] Borkar, V. S., Meyn, S. P., 2002. “Risk-Sensitive Optimal Control for
Markov Decision Processes with Monotone Cost,” Math. of OR, Vol. 27, pp.
192-209.
[Bor98] Borkar, V. S., 1998. “Asynchronous Stochastic Approximation,” SIAM
J. Control Opt., Vol. 36, pp. 840-851.
[Bor08] Borkar, V. S., 2008. Stochastic Approximation: A Dynamical Systems
Viewpoint, Cambridge Univ. Press, N. Y.
[CFH07] Chang, H. S., Fu, M. C., Hu, J., Marcus, S. I., 2007. Simulation-Based
Algorithms for Markov Decision Processes, Springer, N. Y.
[CaM88] Carraway, R. L., and Morin, T. L., 1988. “Theory and Applications of
Generalized Dynamic Programming: An Overview,” Computers and Mathemat-
ics with Applications, Vol. 16, pp. 779-788.
[CaR12] Cavus, O., and Ruszczynski, A., 2012. “Risk-Averse Control of Undis-
counted Transient Markov Models,” Rutgers Univ. Report, available on Opti-
mization On-Line at http://www.optimization-online.org/
[CaR13] Canbolat, P. G., and Rothblum, U. G., 2013. “(Approximate) Iterated
Successive Approximations Algorithm for Sequential Decision Processes,” Annals
of Operations Research, Vol. 208, pp. 309-320.
[Cao07] Cao, X. R., 2007. Stochastic Learning and Optimization: A Sensitivity-
Based Approach, Springer, N. Y.
[ChM69] Chazan D., and Miranker, W., 1969. “Chaotic Relaxation,” Linear Al-
gebra and Applications, Vol. 2, pp. 199-222.
[ChS87] Chung, K.-J., and Sobel, M. J., 1987. “Discounted MDPs: Distribution
Functions and Exponential Utility Maximization,” SIAM J. Control and Opt.,
Vol. 25, pp. 49-62.
[CoM99] Coraluppi, S. P., and Marcus, S. I., 1999. “Risk-Sensitive and Minimax
Control of Discrete-Time, Finite-State Markov Decision Processes,” Automatica,
Vol. 35, pp. 301-309.
[DeM67] Denardo, E. V., and Mitten, L. G., 1967. “Elements of Sequential De-
cision Processes,” J. Indust. Engrg., Vol. 18, pp. 106-112.
[DeR79] Denardo, E. V., and Rothblum, U. G., 1979. “Optimal Stopping, Ex-
ponential Utility, and Linear Programming,” Math. Programming, Vol. 16, pp.
228-244.
[Den67] Denardo, E. V., 1967. “Contraction Mappings in the Theory Underlying
Dynamic Programming,” SIAM Review, Vol. 9, pp. 165-177.
[Der70] Derman, C., 1970. Finite State Markovian Decision Processes, Academic
Press, N. Y.
[DuS65] Dubins, L., and Savage, L. M., 1965. How to Gamble If You Must,
McGraw-Hill, N. Y.
[DyY79] Dynkin, E. B., and Yushkevich, A. A., 1979. Controlled Markov Pro-
cesses, Springer-Verlag, Berlin and New York.
[FeM97] Fernandez-Gaucherand, E., and Marcus, S. I., 1997. “Risk-Sensitive Op-
timal Control of Hidden Markov Models: Structural Results,” IEEE Trans. Aut.
244 References

Control, Vol. AC-42, pp. 1418-1422.


[FiV96] Filar, J., and Vrieze, K., 1996. Competitive Markov Decision Processes,
Springer, N. Y.
[FlM95] Fleming, W. H., and McEneaney, W. M., 1995. “Risk-Sensitive Control
on an Infinite Time Horizon,” SIAM J. Control and Opt., Vol. 33, pp. 1881-1915.
[Gos03] Gosavi, A., 2003. Simulation-Based Optimization: Parametric Optimiza-
tion Techniques and Reinforcement Learning, Springer, N. Y.
[HCP99] Hernandez-Lerma, O., Carrasco, O., and Perez-Hernandez. 1999. “Mar-
kov Control Processes with the Expected Total Cost Criterion: Optimality, Sta-
bility, and Transient Models,” Acta Appl. Math., Vol. 59, pp. 229-269.
[Hay08] Haykin, S., 2008. Neural Networks and Learning Machines, (3rd Edition),
Prentice-Hall, Englewood-Cliffs, N. J.
[HeL99] Hernandez-Lerma, O., and Lasserre, J. B., 1999. ‘Further Topics on
Discrete-Time Markov Control Processes, Springer, N. Y.
[HeM96] Hernandez-Hernndez, D., and Marcus, S. I., 1996. “Risk Sensitive Con-
trol of Markov Processes in Countable State Space,” Systems and Control Letters,
Vol. 29, pp. 147-155.
[HiW05] Hinderer, K., and Waldmann, K.-H., 2005. “Algorithms for Countable
State Markov Decision Models with an Absorbing Set,” SIAM J. of Control and
Opt., Vol. 43, pp. 2109-2131.
[HoM72] Howard, R. S., and Matheson, J. E., 1972. “Risk-Sensitive Markov
Decision Processes,” Management Science, Vol. 8, pp. 356-369.
[JBE94] James, M. R., Baras, J. S., Elliott, R. J., 1994. “Risk-Sensitive Control
and Dynamic Games for Partially Observed Discrete-Time Nonlinear Systems,”
IEEE Trans. Aut. Control, Vol. AC-39, pp. 780-792.
[JaC06] James, H. W., and Collins, E. J., 2006. “An Analysis of Transient Markov
Decision Processes,” J. Appl. Prob., Vol. 43, pp. 603-621.
[Jac73] Jacobson, D. H., 1973. “Optimal Stochastic Linear Systems with Ex-
ponential Performance Criteria and their Relation to Deterministic Differential
Games,” IEEE Transactions on Automatic Control, Vol. AC-18, pp. 124-131.
[Kle68] Kleinman, D. L., 1968. “On an Iterative Technique for Riccati Equation
Computations,” IEEE Trans. Aut. Control, Vol. AC-13, pp. 114-115.
[Mey07] Meyn, S., 2007. Control Techniques for Complex Networks, Cambridge
Univ. Press, N. Y.
[Mit64] Mitten, L. G., 1964. “Composition Principles for Synthesis of Optimal
Multistage Processes,” Operations Research, Vol. 12, pp. 610-619.
[Mor82] Morin, T. L., 1982. “Monotonicity and the Principle of Optimality,” J.
of Math. Analysis and Applications, Vol. 88, pp. 665-674.
[OrR70] Ortega, J. M., and Rheinboldt, W. C., 1970. Iterative Solution of Non-
linear Equations in Several Variables, Academic Press, N. Y.
[PaB99] Patek, S. D., and Bertsekas, D. P., 1999. “Stochastic Shortest Path
Games,” SIAM J. on Control and Opt., Vol. 36, pp. 804-824.
[Pal67] Pallu de la Barriere, R., 1967. Optimal Control Theory, Saunders, Phila;
republished by Dover, N. Y., 1980.
References 245

[Pat01] Patek, S. D., 2001. “On Terminating Markov Decision Processes with a
Risk Averse Objective Function,” Automatica, Vol. 37, pp. 1379-1386.
[Pat07] Patek, S. D., 2007. “Partially Observed Stochastic Shortest Path Prob-
lems with Approximate Solution by Neuro-Dynamic Programming,” IEEE Trans.
on Systems, Man, and Cybernetics Part A, Vol. 37, pp. 710-720.
[Pli78] Pliska, S. R., 1978. “On the Transient Case for Markov Decision Chains
with General State Spaces,” in Dynamic Programming and its Applications, by
M. L. Puterman (ed.), Academic Press, N. Y.
[Pow07] Powell, W. B., 2007. Approximate Dynamic Programming: Solving the
Curses of Dimensionality, J. Wiley and Sons, Hoboken, N. J; 2nd ed., 2011.
[Put94] Puterman, M. L., 1994. Markovian Decision Problems, J. Wiley, N. Y.
[Roc70] Rockafellar, R. T., 1970. Convex Analysis, Princeton Univ. Press, Prince-
ton, N. J.
[Ros67] Rosenfeld, J., 1967. “A Case Study on Programming for Parallel Proces-
sors,” Research Report RC-1864, IBM Res. Center, Yorktown Heights, N. Y.
[Rot79] Rothblum, U. G., 1979. “Iterated Successive Approximation for Sequen-
tial Decision Processes,” in Stochastic Control and Optimization, by J. W. B.
van Overhagen and H. C. Tijms (eds), Vrije University, Amsterdam.
[Rot84] Rothblum, U. G., 1984. “Multiplicative Markov Decision Chains,” Math.
of OR, Vol. 9, pp. 6-24.
[Rus10] Ruszczynski, A., 2010. “Risk-Averse Dynamic Programming for Markov
Decision Processes,” Math. Programming, Ser. B, Vol. 125, pp. 235-261.
[ScL12] Scherrer, B., and Lesner, B., 2012. “On the Use of Non-Stationary Policies
for Stationary Infinite-Horizon Markov Decision Processes,” NIPS 2012 - Neural
Information Processing Systems, South Lake Tahoe, Ne.
[Sch75] Schal, M., 1975. “Conditions for Optimality in Dynamic Programming
and for the Limit of n-Stage Optimal Policies to be Optimal,” Z. Wahrschein-
lichkeitstheorie und Verw. Gebiete, Vol. 32, pp. 179-196.
[Sch11] Scherrer, B., 2011. “Performance Bounds for Lambda Policy Iteration
and Application to the Game of Tetris,” Report RR-6348, INRIA, France; to
appear in J. of Machine Learning Research.
[Sch12] Scherrer, B., 2012. “On the Use of Non-Stationary Policies for Infinite-
Horizon Discounted Markov Decision Processes,” INRIA Lorraine Report, France.
[Sha53] Shapley, L. S., 1953. “Stochastic Games,” Proc. Nat. Acad. Sci. U.S.A.,
Vol. 39.
[Str66] Strauch, R., 1966. “Negative Dynamic Programming,” Ann. Math. Statist.,
Vol. 37, pp. 871-890.
[Str75] Striebel, 1975. Optimal Control of Discrete Time Stochastic Systems,
Springer-Verlag, Berlin and New York.
[SuB98] Sutton, R. S., and Barto, A. G., 1998. Reinforcement Learning, MIT
Press, Cambridge, MA.
[Sze98a] Szepesvari, C., 1998. Static and Dynamic Aspects of Optimal Sequential
Decision Making, Ph.D. Thesis, Bolyai Institute of Mathematics, Hungary.
[Sze98b] Szepesvari, C., 1998. “Non-Markovian Policies in Sequential Decision
Problems,” Acta Cybernetica, Vol. 13, pp. 305-318.
246 References

[Sze10] Szepesvari, C., 2010. Algorithms for Reinforcement Learning, Morgan


and Claypool Publishers, San Franscisco, CA.
[TBA86] Tsitsiklis, J. N., Bertsekas, D. P., and Athans, M., 1986. “Distributed
Asynchronous Deterministic and Stochastic Gradient Optimization Algorithms,”
IEEE Trans. Aut. Control, Vol. AC-31, pp. 803-812.
[ThS10a] Thiery, C., and Scherrer, B., 2010. “Least-Squares λ-Policy Iteration:
Bias-Variance Trade-off in Control Problems,” in ICML’10: Proc. of the 27th
Annual International Conf. on Machine Learning.
[ThS10b] Thiery, C., and Scherrer, B., 2010. “Performance Bound for Approxi-
mate Optimistic Policy Iteration,” Technical Report, INRIA, France.
[Tsi94] Tsitsiklis, J. N., 1994. “Asynchronous Stochastic Approximation and Q-
Learning,” Machine Learning, Vol. 16, pp. 185-202.
[VVL13] Vrabie, V., Vamvoudakis, K. G., and Lewis, F. L., 2013. Optimal Adap-
tive Control and Differential Games by Reinforcement Learning Principles, The
Institution of Engineering and Technology, London.
[VeP87] Verdu, S., and Poor, H. V., 1987. “Abstract Dynamic Programming
Models under Commutativity Conditions,” SIAM J. on Control and Opt., Vol.
25, pp. 990-1006.
[Vei66] Veinott, A. F., 1966. “On Finding Optimal Policies in Discrete Dynamic
Programming with no Discounting,” Ann. Math. Stat., Vol. 37, pp. 1284-1294.
[Whi80] Whittle, P., 1980. “Stability and Characterization Conditions in Negative
Programming,” Journal of Applied Probability, Vol. 17, pp. 635-645.
[Whi90] Whittle, P., 1990. Risk-Sensitive Optimal Control, Wiley, Chichester.
[WiB93] Williams, R. J., and Baird, L. C., 1993. “Analysis of Some Incremen-
tal Variants of Policy Iteration: First Steps Toward Understanding Actor-Critic
Learning Systems,” Report NU-CCS-93-11, College of Computer Science, North-
eastern University, Boston, MA.
[YuB10] Yu, H., and Bertsekas, D. P., 2010. “Error Bounds for Approximations
from Projected Linear Equations,” Math. of OR, Vol. 35, pp. 306-329.
[YuB11a] Yu, H., and Bertsekas, D. P., 2011. “Q-Learning and Policy Iteration
Algorithms for Stochastic Shortest Path Problems,” Lab. for Info. and Decision
Systems Report LIDS-P-2871, MIT; Annals of Operations Research, Vol. 208,
2013, pp. 95-132.
[YuB11b] Yu, H., and Bertsekas, D. P., 2011. “On Boundedness of Q-Learning
Iterates for Stochastic Shortest Path Problems,” Lab. for Info. and Decision Sys-
tems Report LIDS-P-2859, MIT; to appear in Math. of OR.
[YuB12] Yu, H., and Bertsekas, D. P., 2012. “Weighted Bellman Equations and
their Applications in Dynamic Programming,” Lab. for Info. and Decision Sys-
tems Report LIDS-P-2876, MIT.
[Yu11] Yu, H., 2011. “Stochastic Shortest Path Games and Q-Learning,” Lab.
for Info. and Decision Systems Report LIDS-P-2875, MIT.
[Yu12] Yu, H., 2012. “Least Squares Temporal Difference Methods: An Analysis
Under General Conditions,” SIAM J. Control and Opt., Vol. 50, pp. 3310-3343.
[Zac64] Zachrisson, L. E., 1964. “Markov Games,” in Advances in Game Theory,
by M. Dresher, L. S. Shapley, and A. W. Tucker, (eds.), Princeton Univ. Press,
Princeton, N. J., pp. 211-253.
INDEX

A Error amplification, 45, 50, 61, 79


Affine monotonic model, 171 Error bounds, 35, 37, 40, 51, 58, 79
Aggregation, 21 Euclidean norm, 208
Aggregation, distributed, 19 Exponential cost function, 19, 175, 182
Aggregation, multistep, 22 F
Aggregation equation, 21
Finite-horizon problems, 133, 196
Analytic set, 222 First passage problem, 16
Analytically measurable, 224 Fixed point, 208
Asynchronous algorithms, 25, 61, 80,
116, 118, 157 G
Asynchronous convergence theorem, Games, dynamic, 13, 78
64, 82, 117, 123, 157 Gauss-Seidel method, 25, 61
Asynchronous policy iteration, 67, 76, Geometric convergence rate, 210
118 H
Asynchronous value iteration, 61, 117, Hard aggregation, 22
157
I
B
Imperfect state information, 127
Bellman’s equation, 6, 73, 86, 90, 92- Improper policy, vii, 16, 86
94, 99, 108, 111, 115, 133, 137, 138, Interpolation, 78
147
Borel space, 221 J, K
Borel space model, 200
L
Borel measurable, 200, 208, 222
Box condition, 64 λ-aggregation, 23
λ-policy iteration, 80, 159, 166, 181
C LSPE(λ), 22
Cauchy sequence, 208 LSTD(λ), 22
Complete space, 208, 211 Least squares approximation, 45
Contraction assumption, 8, 31, 199 Limited lookahead policy, 37
Contraction mapping fixed-point the- Linear contraction mappings, 207, 213
orem, 209, 210 Linear-quadratic problems, 168
Contraction mapping, 207, 210, 213 Lower semianalytic, 192, 200, 222
Contractive model, 23, 29, 198
Cost function, 6, 10, 34, 88, 131, 132 M
D MDP, 10
Markovian decision problem, see MDP
δ-S-regular, 112
Mathematical programming, 84, 183
δ-S-irregular, 112
Measurability issues, 188, 216
Discounted MDP, 12, 83, 167, 183
Measurable selection, 191, 194, 226
Distributed computation, 61, 117
Minimax problems, 15
E Modulus of contraction, 207
"-optimal policy, 33, 132, 136, 143, Monotone mapping, 205
152, 153, 195, 197, 217 Monotone decrease assumption, 140

247
248 Index

Monotone increase assumption, 139 R


Monotonicity assumption, 7, 30, 87, Reachability, 185
131, 193 Reduced space implementation, 77, 123
Multiplicative model, 18 Reinforcement learning, 21
Multistage lookahead, 39 Restricted policies model, 24, 188
Multistep aggregation, 23 Risk-sensitive model, 18, 176
Multistep mapping, 23, 26, 27
Multistep methods, 22 S
SSP problems, 15, 115, 161, 166, 171,
N
175
Negative DP model, 25, 130, 140, 180 S-irregular policy, 89, 163, 172
Noncontractive model, 24, 129 S-regular policy, 89, 163, 172
Nonmonotonic-contractive model, 83 Search problems, 107, 110, 176
Nonstationary policy, 5, 34, 87, 94, Semi-Markov problem, 13
142 Seminorm projection, 23
O Semicontinuous model, 201
Semicontractive model, 24, 85, 179
ODE approach, 81
Shortest path problem, 17, 97, 178
Oblique projection, 23
Simulation, 62
Optimality conditions, 32, 90, 151
Spectral radius, 207
P Stationary policy, 5, 30, 34
p-!-optimality, 191 Stochastic approximation, 81
Parallel computation, 61 Stochastic kernel, 224
Partially asynchronous algorithms, 63 Stochastic shortest path problems, see
Periodic policies, 40, 81 SSP problems
Perturbations, 107, 124, 126 Stopping problems, 73, 107, 176
Policy, 5, 30, 34, 87, 131, 188 Synchronous convergence condition, 64
Policy evaluation, 21, 46, 68, 76, 77, T
120
TD(λ), 22
Policy improvement, 46, 68, 76, 77,
Temporal differences, 22
120
Totally asynchronous algorithms, 63,
Policy iteration, 9, 25, 46, 96, 157,
80
164
Transient programming problem, 16
Policy iteration, approximate, 48
Policy iteration, constrained, 19 U
Policy iteration, convergence, 48, 128 Uniform fixed point, 72, 121
Policy iteration, modified, 79 Uniformly N -stage optimal, 133
Policy iteration, optimistic, 52, 67, 69, Unit function, 205
73, 77, 78, 79, 158, 165 Universally measurable, 191, 200, 223
Policy iteration, perturbations, 124, Universally measurable selection, 226
126 V
Positive DP model, 25, 130, 140, 180
Value iteration, 9, 25, 42, 43, 90, 103,
Projected equation, 21, 45
154, 163, 174
Proper policy, vii, 16, 86, 115
W
Q Weighted multistep mapping, 22, 27
Q-factor, 72, 73 Weighted sup norm, 211, 212
Q-learning, 80, 126 Weighted sup-norm contraction, 211

You might also like