0% found this document useful (0 votes)
75 views42 pages

RL Monograph1

This document is the first chapter of a textbook on reinforcement learning and optimal control. It provides background on exact dynamic programming (DP), which involves modeling sequential decision problems as dynamic systems and solving for optimal policies. The chapter covers deterministic and stochastic DP, with deterministic problems serving as an introductory example. It also provides various examples of DP formulations to illustrate concepts. The overall goal is to provide context for understanding approximate DP methods, which are the main subject of the book.

Uploaded by

Chainszz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views42 pages

RL Monograph1

This document is the first chapter of a textbook on reinforcement learning and optimal control. It provides background on exact dynamic programming (DP), which involves modeling sequential decision problems as dynamic systems and solving for optimal policies. The chapter covers deterministic and stochastic DP, with deterministic problems serving as an introductory example. It also provides various examples of DP formulations to illustrate concepts. The overall goal is to provide context for understanding approximate DP methods, which are the main subject of the book.

Uploaded by

Chainszz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Reinforcement Learning and Optimal Control

by
Dimitri P. Bertsekas
Massachusetts Institute of Technology

Chapter 1
Exact Dynamic Programming
DRAFT
This is Chapter 1 of the draft textbook “Reinforcement Learning and
Optimal Control.” The chapter represents “work in progress,” and it will
be periodically updated. It more than likely contains errors (hopefully not
serious ones). Furthermore, its references to the literature are incomplete.
Your comments and suggestions to the author at [email protected] are
welcome. The date of last revision is given below.

December 14, 2018


1

Exact Dynamic Programming

Contents

1.1. Deterministic Dynamic Programming . . . . . . . . . p. 2


1.1.1. Deterministic Problems . . . . . . . . . . . . p. 2
1.1.2. The Dynamic Programming Algorithm . . . . . . p. 7
1.1.3. Approximation in Value Space . . . . . . . . . p. 12
1.1.4. Model-Free Approximate Solution - Q-Learning . . p. 13
1.2. Stochastic Dynamic Programming . . . . . . . . . . . p. 14
1.3. Examples, Variations, and Simplifications . . . . . . . p. 17
1.3.1. Deterministic Shortest Path Problems . . . . . . p. 18
1.3.2. Discrete Deterministic Optimization . . . . . . . p. 19
1.3.3. Problems with a Terminal State . . . . . . . . p. 23
1.3.4. Forecasts . . . . . . . . . . . . . . . . . . . p. 26
1.3.5. Problems with Uncontrollable State Components . p. 27
1.3.6. Partial State Information and Belief States . . . . p. 32
1.3.7. Linear Quadratic Optimal Control . . . . . . . . p. 35
1.4. Reinforcement Learning and Optimal Control - Some . . . .
Terminology . . . . . . . . . . . . . . . . . . . . p. 38
1.5. Notes and Sources . . . . . . . . . . . . . . . . . p. 40

1
2 Exact Dynamic Programming Chap. 1

In this chapter, we provide some background on exact dynamic program-


ming (DP for short), with a view towards the suboptimal solution methods
that are the main subject of this book. These methods are known by
several essentially equivalent names: reinforcement learning, approximate
dynamic programming, and neuro-dynamic programming. In this book, we
will use primarily the most popular name: reinforcement learning (RL for
short).
We first consider finite horizon problems, which involve a finite se-
quence of successive decisions, and are thus conceptually and analytically
simpler. We defer the discussion of the more intricate infinite horizon
problems to Chapter 4 and later chapters. We also discuss separately de-
terministic and stochastic problems (Sections 1.1 and 1.2, respectively).
The reason is that deterministic problems are simpler and lend themselves
better as an entry point to the optimal control methodology. Moreover,
they have some favorable characteristics, which allow the application of a
broader variety of methods. For example simulation-based methods are
greatly simplified and sometimes better understood in the context of de-
terministic optimal control.
Finally, in Section 1.3 we provide various examples of DP formula-
tions, illustrating some of the concepts of Sections 1.1 and 1.2. The reader
with substantial background in DP may wish to just scan Section 1.3 and
skip to the next chapter, where we start the development of the approxi-
mate DP methodology.

1.1 DETERMINISTIC DYNAMIC PROGRAMMING

All DP problems involve a discrete-time dynamic system that generates a


sequence of states under the influence of control. In finite horizon problems
the system evolves over a finite number N of time steps (also called stages).
The state and control at time k are denoted by xk and uk , respectively. In
deterministic systems, xk+1 is generated nonrandomly, i.e., it is determined
solely by xk and uk .

1.1.1 Deterministic Problems

A deterministic DP problem involves a discrete-time dynamic system of


the form
xk+1 = fk (xk , uk ), k = 0, 1, . . . , N − 1, (1.1)
where
k is the time index,
xk is the state of the system, an element of some space,
uk is the control or decision variable, to be selected at time k from some
given set Uk (xk ) that depends on xk ,
Sec. 1.1 Deterministic Dynamic Programming 3

fk is a function of (xk , uk ) that describes the mechanism by which the


state is updated from time k to time k + 1.
N is the horizon or number of times control is applied,
The space of all possible xk is called the state space at time k. It can
be any set and can depend on k; this generality is one of the great strengths
of the DP methodology. Similarly, the space of all possible uk is called the
control space at time k. Again it can be any set and can depend on k.
The problem also involves a cost function that is additive in the sense
that the cost incurred at time k, denoted by gk (xk , uk ), accumulates over
time. Formally, gk is a function of (xk , uk ) that takes real number values,
and may depend on k. For a given initial state x0 , the total cost of a control
sequence {u0 , . . . , uN −1 } is
N
X −1
J(x0 ; u0 , . . . , uN −1 ) = gN (xN ) + gk (xk , uk ), (1.2)
k=0

where gN (xN ) is a terminal cost incurred at the end of the process. This
cost is a well-defined number, since the control sequence {u0 , . . . , uN −1 }
together with x0 determines exactly the state sequence {x1 , . . . , xN } via
the system equation (1.1). We want to minimize the cost (1.2) over all
sequences {u0 , . . . , uN −1 } that satisfy the control constraints, thereby ob-
taining the optimal value†

J * (x0 ) = min J(x0 ; u0 , . . . , uN −1 ),


uk ∈Uk (xk )
k=0,...,N−1

as a function of x0 .
We will next illustrate deterministic problems with some examples.

Discrete Optimal Control Problems

There are many situations where the state and control are naturally discrete
and take a finite number of values. Such problems are often conveniently
specified in terms of an acyclic graph specifying for each state xk the pos-
sible transitions to next states xk+1 . The nodes of the graph correspond
to states xk and the arcs of the graph correspond to state-control pairs
(xk , uk ). Each arc with start node xk corresponds to a choice of a single
control uk ∈ Uk (xk ) and has as end node the next state fk (xk , uk ). The
cost of an arc (xk , uk ) is defined as gk (xk , uk ); see Fig. 1.1.1. To handle the
final stage, an artificial terminal node t is added. Each state xN at stage
N is connected to the terminal node t with an arc having cost gN (xN ).

† We use throughout “min” (in place of “inf”) to indicate minimal value over
a feasible set of controls, even when we are not sure that the minimum is attained
by some feasible control.
4 Exact Dynamic Programming Chap. 1

Artificial Terminal Node Terminal Arcs with Cost Equal to Ter-


Artificial Terminal
.. . .Node
. Terminal Arcs with Cost Equal to Ter-
Terminal Arcs with Cost Equal to Terminal Cost AB AC CA

.. . . . s t u

Initial State Stage 0 Stage 1 Stage 2 Stage Artificial Terminal Node Terminal Arcs with Cost Equal to Ter
s t u Artificial Terminal Node Terminal Arcs with Cost Equal to Ter-

.. . . .

Initial State Stage 0 0Stage 101Stage


.
212Stage
Initial
Initial
State
State
Stage
Initial Stage
State Stage
0 Stage
Stage Stage
1 Stage
Stage Stage
2 . Stage
Stage . .2. Stage N − 11 Stage
Stage N

Figure 1.1.1 Transition graph for a deterministic finite-state system. Nodes


correspond to states, and arcs correspond to state-control pairs (xk , uk ). An arc
(xk , uk ) has start and end nodes xk and xk+1 = fk (xk , uk ), respectively. We
view the cost gk (xk , uk ) of the transition as the length of this arc. The problem is
equivalent to finding a shortest path from the initial node s to the terminal node
t.

Note that control sequences correspond to paths originating at the


initial state (node s at stage 0) and terminating at one of the nodes corre-
sponding to the final stage N . If we view the cost of an arc as its length,
we see that a deterministic finite-state finite-horizon problem is equivalent
to finding a minimum-length (or shortest) path from the initial node s of
the graph to the terminal node t. Here, by a path we mean a sequence of
arcs of the form (j1 , j2 ), (j2 , j3 ), . . . , (jk−1 , jk ), and by the length of a path
we mean the sum of the lengths of its arcs.†
Generally, combinatorial optimization problems can be formulated
as deterministic finite-state finite-horizon optimal control problem. The
following scheduling example illustrates the idea.

Example 1.1.1 (A Deterministic Scheduling Problem)

Suppose that to produce a certain product, four operations must be performed


on a certain machine. The operations are denoted by A, B, C, and D. We
assume that operation B can be performed only after operation A has been
performed, and operation D can be performed only after operation C has
been performed. (Thus the sequence CDAB is allowable but the sequence
CDBA is not.) The setup cost Cmn for passing from any operation m to any
other operation n is given. There is also an initial startup cost SA or SC for
starting with operation A or C, respectively (cf. Fig. 1.1.2). The cost of a

† It turns out also that any shortest path problem (with a possibly nona-
cyclic graph) can be reformulated as a finite-state deterministic optimal control
problem, as we will see in Section 1.3.1. See also [Ber17], Section 2.1, and also
[Ber98] for an extensive discussion of shortest path methods.
Sec. 1.1 Deterministic Dynamic Programming 5

+ 1 Initial State A C AB AC CA CD ABC CCD

CBC
+ 1 Initial State A C AB AC CA CD ABC
ACB CACD
BD CAB CAD CDA
CAB
+ 1 Initial State A C AB AC CA CCBCD ABC
+ 1 Initial State A C CAC
AB AC CA CD ABC
ACB
CCD ACD CCAB
DB CAD CDA
SA
+ 1 Initial State A C AB AC CA CD ABC
+ 1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CCAD
BD CDA
+ 1 Initial State A C AB AC CA CD
CABABC
SB
+ 1 Initial State A C ABCCA
AC CA CD ABC
CAD
ACB ACD CAB CAD CCDA
DB

+ 1 Initial State A C AB ACCCDCA CD ABC

CDA
ACB ACD CAB CAD CDA CAB

Figure 1.1.2 The transition graph of the deterministic scheduling problem


of Example 1.1.1. Each arc of the graph corresponds to a decision leading
from some state (the start node of the arc) to some other state (the end node
of the arc). The corresponding cost is shown next to the arc. The cost of the
last operation is shown as a terminal cost next to the terminal nodes of the
graph.

sequence is the sum of the setup costs associated with it; for example, the
operation sequence ACDB has cost

SA + CAC + CCD + CDB .

We can view this problem as a sequence of three decisions, namely the


choice of the first three operations to be performed (the last operation is
determined from the preceding three). It is appropriate to consider as state
the set of operations already performed, the initial state being an artificial
state corresponding to the beginning of the decision process. The possible
state transitions corresponding to the possible states and decisions for this
problem are shown in Fig. 1.1.2. Here the problem is deterministic, i.e., at
a given state, each choice of control leads to a uniquely determined state.
For example, at state AC the decision to perform operation D leads to state
ACD with certainty, and has cost CCD . Thus the problem can be conveniently
represented in terms of the transition graph of Fig. 1.1.2. The optimal solution
corresponds to the path that starts at the initial state and ends at some state
at the terminal time and has minimum sum of arc costs plus the terminal
cost.
6 Exact Dynamic Programming Chap. 1

Initial Temperature Oven 1 Oven 2 Final Temperature


Initial Temperature Oven 1 Oven 2 Final
x 2Initial
Temperature
Oven 1 Oven Final Temperature
Temperature
Initial Temperature x0 1 1 x2
Oven 1 Oven 2 Final Temperature
Initial Temperature
u0 0 u1

Figure 1.1.3 The linear-quadratic problem of Example 1.1.2 for N = 2. The


temperature of the material evolves according to the system equation xk+1 =
(1 − a)xk + auk , where a is some scalar with 0 < a < 1.

Continuous-Spaces Optimal Control Problems

Many classical problems in control theory involve a state that belongs to a


Euclidean space, i.e., the space of an n-dimensional vector of real variables,
where n is a positive integer. The following is representative of the class
of linear-quadratic problems, where the system equation is linear, the cost
function is quadratic, and there are no control constraints. In our example,
the states and controls are one-dimensional, but there are multidimensional
extensions, which are very popular (see [Ber17], Section 3.1).

Example 1.1.2 (A Linear-Quadratic Problem)

A certain material is passed through a sequence of N −1 ovens (see Fig. 1.1.3).


Denote
x0 : initial temperature of the material,
xk , k = 1, . . . , N : temperature of the material at the exit of oven k,
uk−1 , k = 1, . . . , N − 1: heat energy applied to the material in oven k.
In practice there will be some constraints on uk , such as nonnegativity.
However, for analytical tractability one may also consider the case where
uk is unconstrained, and check later if the solution satisfies some natural
restrictions in the problem at hand.
We assume a system equation of the form

xk+1 = (1 − a)xk + auk , k = 0, 1, . . . , N − 1,

where a is a known scalar from the interval (0, 1). The objective is to get
the final temperature xN close to a given target T , while expending relatively
little energy. We express this with a cost function of the form
N−1
X
r(xN − T )2 + u2k ,
k=0

where r > 0 is a given scalar.

Linear-quadratic problems with no constraints on the state or the con-


trol admit a nice analytical solution, as we will see later in Section 1.3.6.
Sec. 1.1 Deterministic Dynamic Programming 7

In another frequently arising optimal control problem there are linear con-
straints on the state and/or the control. In the preceding example it would
have been natural to require that ak ≤ xk ≤ bk and/or ck ≤ uk ≤ dk , where
ak , bk , ck , dk are given scalars. Then the problem would be solvable not only
by DP but also by quadratic programming methods. Generally determin-
istic optimal control problems with continuous state and control spaces
(in addition to DP) admit a solution by nonlinear programming methods,
such as gradient, conjugate gradient, and Newton’s method, which can be
suitably adapted to their special structure.

1.1.2 The Dynamic Programming Algorithm

The DP algorithm rests on a very simple idea, the principle of optimality,


which roughly states the following rather obvious fact.

Principle of Optimality
Let {u∗0 , . . . , u∗N −1 } be an optimal control sequence, which together
with x0 determines the corresponding state sequence {x∗1 , . . . , x∗N )}
via the system equation (1.1). Consider the subproblem whereby we
start at x∗m at time m and wish to minimize the “cost-to-go” from
time m to time N ,

N
X −1
gm (x∗m , um ) + gk (xk , uk ) + gN (xN ),
k=m+1

over {um , . . . , uN −1 } with uk ∈ Uk (xk ), k = m, . . . , N − 1. Then the


truncated control sequence {u∗m , u∗m+1 , . . . , u∗N −1 } is optimal for this
subproblem.

The intuitive justification of the principle of optimality is very simple.


If the truncated control sequence {u∗m , u∗m+1 , . . . , u∗N −1 } were not optimal
as stated, we would be able to reduce the cost further by switching to an
optimal sequence for the subproblem once we reach xi (since the preceding
choices u∗0 , . . . , u∗m−1 of controls do not restrict our future choices). For
an auto travel analogy, suppose that the fastest route from Los Angeles to
Boston passes through Chicago. The principle of optimality translates to
the obvious fact that the Chicago to Boston portion of the route is also the
fastest route for a trip that starts from Chicago and ends in Boston.
The principle of optimality suggests that an optimal control sequence
can be constructed in piecemeal fashion, first constructing an optimal se-
quence for the “tail subproblem” involving the last stage, then extending
the optimal policy to the “tail subproblem” involving the last two stages,
8 Exact Dynamic Programming Chap. 1

and continuing in this manner until an optimal policy for the entire problem
is constructed.
The DP algorithm is based on this idea: it proceeds sequentially, by
solving all the tail subproblems of a given time length, using the solution
of the tail subproblems of shorter time length. We illustrate the algorithm
with the scheduling problem of Example 1.1.1. The calculations are simple
but tedious, and may be skipped without loss of continuity. However, they
may be worth going over by a reader that has no prior experience in the
use of DP.

Example 1.1.1 (Scheduling Problem - Continued)

Let us consider the scheduling Example 1.1.1, and let us apply the principle of
optimality to calculate the optimal schedule. We have to schedule optimally
the four operations A, B, C, and D. The numerical values of the transition
and setup costs are shown in Fig. 1.1.4 next to the corresponding arcs.
According to the principle of optimality, the “tail” portion of an optimal
schedule must be optimal. For example, suppose that the optimal schedule
is CABD. Then, having scheduled first C and then A, it must be optimal to
complete the schedule with BD rather than with DB. With this in mind, we
solve all possible tail subproblems of length two, then all tail subproblems of
length three, and finally the original problem that has length four (the sub-
problems of length one are of course trivial because there is only one operation
that is as yet unscheduled). As we will see shortly, the tail subproblems of
length k + 1 are easily solved once we have solved the tail subproblems of
length k, and this is the essence of the DP technique.

Tail Subproblems of Length 2 : These subproblems are the ones that involve
two unscheduled operations and correspond to the states AB, AC, CA, and
CD (see Fig. 1.1.4)
State AB : Here it is only possible to schedule operation C as the next op-
eration, so the optimal cost of this subproblem is 9 (the cost of schedul-
ing C after B, which is 3, plus the cost of scheduling D after C, which
is 6).
State AC : Here the possibilities are to (a) schedule operation B and then
D, which has cost 5, or (b) schedule operation D and then B, which has
cost 9. The first possibility is optimal, and the corresponding cost of
the tail subproblem is 5, as shown next to node AC in Fig. 1.1.4.
State CA: Here the possibilities are to (a) schedule operation B and then
D, which has cost 3, or (b) schedule operation D and then B, which has
cost 7. The first possibility is optimal, and the corresponding cost of
the tail subproblem is 3, as shown next to node CA in Fig. 1.1.4.
State CD: Here it is only possible to schedule operation A as the next
operation, so the optimal cost of this subproblem is 5.
Tail Subproblems of Length 3 : These subproblems can now be solved using
the optimal costs of the subproblems of length 2.
Sec. 1.1 Deterministic Dynamic Programming 9

10 CA
+ 1 Initial State A C AB AC 5 7 CD
8 3ABC
9 6 1 2

3 5 2 4 6 2
+ 1 Initial State A C AB AC CA CD ABC
10 5 7 8103 59 76 81 32 9ACB
3 5 2 4 6 2
6 1ACD
2 CAB CAD CDA
+ 1 Initial State A C AB AC 3 5CA
2 4CD 6 2ABC
+ 1 Initial State A C 3AB 5 2AC
4 6CA2 CD ABC
10 5 7 8 3 9 63 15 22 10 6 52 7ACD
4ACB 8 3CAB
9 6 CAD
1 2 CDA
3 5 2 4 6 2 10 5 7 8 3 9 6 1 2
+ 1 Initial State A C AB AC CA CD ABC
+ 1 Initial State A C AB AC CA CD ABC
10 ACB5 7 ACD 8 3 9CAB 6 1CAD
2 CDA
+ 1 Initial
10 5State
7 8A 3C 9AB 6 AC
31 52 CA
2 4 CD6 2 ABC
3 5 2 4 6 2
+ 1 Initial State A 3C
105 AB
25 47AC
6 82 3CA
3 5 92CD ABC
64 61 22
ACB ACD 10 CAB 5 7CAD8 3CDA
9 6 1 2
10 53 57 28 4 36 92 6 1 2
+ 1 Initial State A C AB AC CA CD ABC
10 5 7 83 53 29 4 66 12 2
ACB 10 5 CAB
ACD 7 8 CAD
3 9 6CDA 1 2

Figure 1.1.4 Transition graph of the deterministic scheduling problem, with


the cost of each decision shown next to the corresponding arc. Next to each
node/state we show the cost to optimally complete the schedule starting from
that state. This is the optimal cost of the corresponding tail subproblem (cf.
the principle of optimality). The optimal cost for the original problem is equal
to 10, as shown next to the initial state. The optimal schedule corresponds
to the thick-line arcs.

State A: Here the possibilities are to (a) schedule next operation B (cost
2) and then solve optimally the corresponding subproblem of length 2
(cost 9, as computed earlier), a total cost of 11, or (b) schedule next
operation C (cost 3) and then solve optimally the corresponding sub-
problem of length 2 (cost 5, as computed earlier), a total cost of 8.
The second possibility is optimal, and the corresponding cost of the tail
subproblem is 8, as shown next to node A in Fig. 1.1.4.
State C : Here the possibilities are to (a) schedule next operation A (cost
4) and then solve optimally the corresponding subproblem of length 2
(cost 3, as computed earlier), a total cost of 7, or (b) schedule next
operation D (cost 6) and then solve optimally the corresponding sub-
problem of length 2 (cost 5, as computed earlier), a total cost of 11.
The first possibility is optimal, and the corresponding cost of the tail
subproblem is 7, as shown next to node A in Fig. 1.1.4.
Original Problem of Length 4 : The possibilities here are (a) start with oper-
ation A (cost 5) and then solve optimally the corresponding subproblem of
length 3 (cost 8, as computed earlier), a total cost of 13, or (b) start with
operation C (cost 3) and then solve optimally the corresponding subproblem
of length 3 (cost 7, as computed earlier), a total cost of 10. The second pos-
10 Exact Dynamic Programming Chap. 1

sibility is optimal, and the corresponding optimal cost is 10, as shown next
to the initial state node in Fig. 1.1.4.
Note that having computed the optimal cost of the original problem
through the solution of all the tail subproblems, we can construct the opti-
mal schedule: we begin at the initial node and proceed forward, each time
choosing the operation that starts the optimal schedule for the corresponding
tail subproblem. In this way, by inspection of the graph and the computa-
tional results of Fig. 1.1.4, we determine that CABD is the optimal schedule.

Finding an Optimal Control Sequence by DP

We now state the DP algorithm for deterministic finite horizon problem by


translating into mathematical terms the heuristic argument underlying the
principle of optimality. The algorithm constructs functions

* (x ), J * *
JN N N −1 (xN −1 ), . . . , J0 (x0 ),

* , and proceeding backwards to J *


sequentially, starting from JN *
N −1 , JN −2 ,
etc.

DP Algorithm for Deterministic Finite Horizon Problems


Start with
* (x ) = g (x ),
JN for all xN , (1.3)
N N N

and for k = 0, . . . , N − 1, let


h i
Jk* (xk ) = min *
gk (xk , uk ) + Jk+1 fk (xk , uk ) , for all xk .
uk ∈Uk (xk )
(1.4)

Note that at stage k, the calculation in (1.4) must be done for all states
xk before proceeding to stage k−1. The key fact about the algorithm is that
for every initial state x0 , the optimal cost J * (x0 ) is equal to the number
J0* (x0 ), which is obtained at the last step of the DP algorithm. Indeed, a
more general fact can be shown, namely that for all m = 0, 1, . . . , N − 1,
and all states xm at time m, we have

* (x ) =
Jm min J(xm ; um , . . . , uN −1 ), (1.5)
m
uk ∈Uk (xk )
k=m,...,N−1

where
N
X −1
J(xm ; um , . . . , uN −1 ) = gN (xN ) + gk (xk , uk ), (1.6)
k=m
Sec. 1.1 Deterministic Dynamic Programming 11

* (x ) is the optimal cost for an (N − m)-stage tail subproblem that


i.e., Jm m
starts at state xm and time m, and ends at time N .†
We can prove this by induction. The assertion holds for m = N in
view of the initial condition JN* (x ) = g (x ). To show that it holds for
N N N
all m, we use Eqs. (1.5) and (1.6) to write
" N −1
#
X
* (x )
Jm = min gN (xN ) + gk (xk , uk )
m
uk ∈Uk (xk )
k=m,...,N−1 k=m
"
= min gm (xm , um )
um ∈Um (xm )
" N −1
##
X
+ min gN (xN ) + gk (xk , uk )
uk ∈Uk (xk )
k=m+1,...,N−1 k=m
h i
= min *
gm (xm , um ) + Jm+1 fm (xm , um ) ,
um ∈Um (xm )

where for the last equality we use the induction hypothesis.‡


Note that the algorithm solves every tail subproblem, i.e., the problem
of minimization of the cost accumulated additively starting from an inter-
mediate state up to the end of the horizon. Once the functions J0* , . . . , JN *

have been obtained, we can use the following algorithm to construct an op-
timal control sequence {u∗0 , . . . , u∗N −1 } and corresponding state trajectory
{x∗1 , . . . , x∗N } for the given initial state x0 .

Construction of Optimal Control Sequence {u∗0 , . . . , u∗N −1 }


Set h i
u∗0 ∈ arg min g0 (x0 , u0 ) + J1* f0 (x0 , u0 ) ,
u0 ∈U0 (x0 )

and
x∗1 = f0 (x0 , u∗0 ).
Sequentially, going forward, for k = 1, 2, . . . , N − 1, set

† Based on this fact, we call Jm



(xm ) the optimal cost-to-go at state xm

and time m, and refer to Jm as the optimal cost-to-go function or optimal cost
function at time m. In maximization problems the DP algorithm (1.4) is written

with maximization in place of minimization, and then Jm is referred to as the
optimal value function at time m.
‡ A subtle mathematical point here is that, through the minimization oper-
*
ation, the cost-to-go functions Jm may take the value −∞ for some xm . Still the
preceding induction argument is valid even if this is so.
12 Exact Dynamic Programming Chap. 1

h i
u∗k ∈ arg min *
gk (x∗k , uk ) + Jk+1 fk (x∗k , uk ) , (1.7)
uk ∈Uk (x∗
k
)

and
x∗k+1 = fk (x∗k , u∗k ). (1.8)

The same algorithm can be used to find an optimal control sequence


for any tail subproblem. Figure 1.1.4 traces the calculations of the DP
algorithm for the scheduling Example 1.1.1. The numbers next to the
nodes, give the corresponding cost-to-go values, and the thick-line arcs
give the construction of the optimal control sequence using the preceding
algorithm.

1.1.3 Approximation in Value Space

The preceding forward optimal control sequence construction is possible


only after we have computed Jk* (xk ) by DP for all xk and k. Unfortu-
nately, in practice this is often prohibitively time-consuming, because of
the number of possible xk and k can be very large. However, a similar
forward algorithmic process can be used if the optimal cost-to-go functions
Jk* are replaced by some approximations J˜k . This is the basis for approx-
imation in value space, which will be central in our future discussions. It
constructs a suboptimal solution {ũ0 , . . . , ũN −1 } in place of the optimal
{u∗0 , . . . , u∗N −1 }, based on using J˜k in place of Jk* in the DP procedure
(1.7).

Approximation in Value Space - Use of J˜k in Place of Jk*


Start with
h i
ũ0 ∈ arg min g0 (x0 , u0 ) + J˜1 f0 (x0 , u0 ) ,
u0 ∈U0 (x0 )

and set
x̃1 = f0 (x0 , ũ0 ).
Sequentially, going forward, for k = 1, 2, . . . , N − 1, set
h i
ũk ∈ arg min gk (x̃k , uk ) + J˜k+1 fk (x̃k , uk ) , (1.9)
uk ∈Uk (x̃k )

and
x̃k+1 = fk (x̃k , ũk ). (1.10)
Sec. 1.1 Deterministic Dynamic Programming 13

The construction of suitable approximate cost-to-go functions J˜k is


a major focal point of the RL methodology. There are several possible
methods, depending on the context, and they will be taken up starting
with the next chapter.

1.1.4 Model-Free Approximate Solution - Q-Learning

Reducing the computational burden of DP through the use of approximate


cost-to-go functions J˜k is a major objective of the RL methodology. How-
ever, there is another potential benefit: given J˜k , we may be able to carry
out the forward minimization (1.9) by using a simulator/computer model
rather than a mathematical model of the functions fk and gk .
In particular, suppose that we have J˜k and a computer program,
which for any pair (xk , uk ) can generate the next state fk (xk , uk ) and the
stage cost gk (xk , uk ). Then we can compute the expression

Q̃k (xk , uk ) = gk (xk , uk ) + J˜k+1 fk (xk , uk ) ,




[cf. the right-hand side of Eq. (1.9)]; this is also known as the (approxi-
mate) Q-factor of (xk , uk ). We can then implement the computation of
the approximately optimal control (1.9) through the minimization

ũk ∈ arg min Q̃k (x̃k , uk ).


uk ∈Uk (x̃k )

This may be viewed as a model-free approximate solution method, a


simple example from a variety of methods that we will consider in sub-
sequent chapters. Some of these methods are based on Monte-Carlo sim-
ulation, particularly for problems that involve stochastic uncertainty. Of
course, for a method to fully qualify as model-free, the functions J˜k should
also be obtained without the use of a model. There are several approaches
for doing this, and they will be discussed starting with the next chapter.
There are also methods that use as starting point an alternative (and
equivalent) form of the DP algorithm, which instead of the optimal cost-
to-go functions Jk* , generates the optimal Q-factors defined for all pairs
(xk , uk ) and k by

Q*k (xk , uk ) = gk (xk , uk ) + Jk+1


*

fk (xk , uk ) . (1.11)

Thus the optimal Q-factors are simply the expressions that are minimized
in the right-hand side of the DP equation (1.4). Note that this equation
implies that the optimal cost function Jk* can be recovered from the optimal
Q-factor Q*k by means of

Jk* (xk ) = min Q*k (xk , uk ).


uk ∈Uk (xk )
14 Exact Dynamic Programming Chap. 1

Moreover, using the above relation, the DP algorithm can be written in an


essentially equivalent form that involves Q-factors only

Q*k (xk , uk ) = gk (xk , uk ) + Q*k+1 fk (xk , uk ), uk+1 .



min
uk+1 ∈Uk+1 (fk (xk ,uk ))

We will see later that exact and approximate forms of related algorithms
can be implemented by using model-free simulation, in the context of a
class of RL methods known as Q-learning.

1.2 STOCHASTIC DYNAMIC PROGRAMMING

The stochastic finite horizon optimal control problem differs from the de-
terministic version primarily in the nature of the discrete-time dynamic
system that governs the evolution of the state xk . This system includes a
random “disturbance” wk , which is characterized by a probability distri-
bution Pk (· | xk , uk ) that may depend explicitly on xk and uk , but not on
values of prior disturbances wk−1 , . . . , w0 . The system has the form

xk+1 = fk (xk , uk , wk ), k = 0, 1, . . . , N − 1,

where as before xk is an element of some state space Sk , the control uk is


an element of some control space. The control uk is constrained to take
values in a given subset U (xk ), which depends on the current state xk .
An important difference is that we optimize not over control sequences
{u0 , . . . , uN −1 }, but rather over policies (also called closed-loop control
laws, or feedback policies) that consist of a sequence of functions

π = {µ0 , . . . , µN −1 },

where µk maps states xk into controls uk = µk (xk ), and is such that


µk (xk ) ∈ Uk (xk ) for all xk ∈ Sk . Such policies will be called admissi-
ble. Policies are more general objects than control sequences, and in the
presence of stochastic uncertainty, they can result in improved cost, since
they allow choices of controls uk that incorporate knowledge of the state
xk . Without this knowledge, the controller cannot adapt appropriately
to unexpected values of the state, and as a result the cost can be ad-
versely affected. This is a fundamental distinction between deterministic
and stochastic optimal control problems.
Another important distinction between deterministic and stochastic
problems is that in the latter, the evaluation of various quantities such
as cost function values involves forming expected values, and this often
necessitates the use of Monte Carlo simulation. As a result many of the
methods that we will discuss for stochastic problems will involve the use of
simulation.
Sec. 1.2 Stochastic Dynamic Programming 15

Given an initial state x0 and a policy π = {µ0 , . . . , µN −1 }, the fu-


ture states xk and disturbances wk are random variables with distributions
defined through the system equation


xk+1 = fk xk , µk (xk ), wk , k = 0, 1, . . . , N − 1.

Thus, for given functions gk , k = 0, 1, . . . , N , the expected cost of π starting


at x0 is
( N −1
)
X 
Jπ (x0 ) = E gN (xN ) + gk xk , µk (xk ), wk ,
k=0

where the expected value operation E{·} is over the random variables wk
and xk . An optimal policy π ∗ is one that minimizes this cost; i.e.,

Jπ∗ (x0 ) = min Jπ (x0 ),


π∈Π

where Π is the set of all admissible policies.


The optimal cost depends on x0 and is denoted by J ∗ (x0 ); i.e.,

J ∗ (x0 ) = min Jπ (x0 ).


π∈Π

It is useful to view J ∗ as a function that assigns to each initial state x0 the


optimal cost J ∗ (x0 ) and call it the optimal cost function or optimal value
function, particularly in problems of maximizing reward.

Finite Horizon Stochastic Dynamic Programming

The DP algorithm for the stochastic finite horizon optimal control problem
has a similar form to its deterministic version, and shares several of its
major characteristics:
(a) Using tail subproblems to break down the minimization over multiple
stages to single stage minimizations.
(b) Generating backwards for all k and xk the values Jk* (xk ), which give
the optimal cost-to-go starting at stage k at state xk .
(c) Obtaining an optimal policy by minimization in the DP equations.
(d) A structure that is suitable for approximation in value space, whereby
we replace Jk* by approximations J˜k , and obtain a suboptimal policy
by the corresponding minimization.
16 Exact Dynamic Programming Chap. 1

DP Algorithm for Stochastic Finite Horizon Problems


Start with
* (x ) = g (x ),
JN (1.12)
N N N

and for k = 0, . . . , N − 1, let


n o
Jk* (xk ) = min *
E gk (xk , uk , wk ) + Jk+1 fk (xk , uk , wk ) . (1.13)
uk ∈Uk (xk )

If u∗k = µ∗k (xk ) minimizes the right side of this equation for each xk
and k, the policy π ∗ = {µ∗0 , . . . , µ∗N −1 } is optimal.

The key fact is that for every initial state x0 , the optimal cost J * (x0 )
is equal to the function J0* (x0 ), obtained at the last step of the above DP
algorithm. This can be proved by induction similar to the deterministic
case; we will omit the proof (see the discussion of Section 1.3 in the textbook
[Ber17]).†
As in deterministic problems, the DP algorithm can be very time-
consuming, in fact more so since it involves the expected value operation
in Eq. (1.13). This motivates suboptimal control techniques, such as ap-
proximation in value space whereby we replace Jk* with easier obtainable
approximations J˜k . We will discuss this approach at length in subsequent
chapters.

Q-factors for Stochastic Problems

We can define optimal Q-factors for stochastic problem, similar to the


case of deterministic problems [cf. Eq. (1.11)], as the expressions that are
minimized in the right-hand side of the stochastic DP equation (1.13).
They are given by
n o
Q*k (xk , uk ) = E gk (xk , uk , wk ) + Jk+1
* fk (xk , uk , wk ) .

The optimal cost-to-go functions Jk* can be recovered from the optimal
Q-factors Q*k by means of

Jk* (xk ) = min Q*k (xk , uk ),


uk ∈Uk (xk )

† There are some technical/mathematical difficulties here, having to do with


the expected value operation in Eq. (1.13) being well-defined and finite. These
difficulties are of no concern in practice, and disappear completely when the
disturbance spaces wk can take only a finite number of values, in which case all
expected values consist of sums of finitely many terms.
Sec. 1.3 Examples, Variations, and Simplifications 17

and the DP algorithm can be written in terms of Q-factors as



Q*k (xk , uk ) =E gk (xk , uk , wk )

Q*k+1 fk (xk , uk , wk ), uk+1

+ min .
uk+1 ∈Uk+1 (fk (xk ,uk ,wk ))

We will later discuss approximation in value space techniques based


on approximately optimal Q-factors Q̃k (xk , uk ). The corresponding subop-
timal policy, denoted by {µ̃0 , . . . , µ̃N −1 }, is obtained from the minimization

µ̃k (xk ) ∈ arg min Q̃k (xk , uk ),


uk ∈Uk (xk )

for all xk and k.

1.3 EXAMPLES, VARIATIONS, AND SIMPLIFICATIONS

In this section we provide some examples to illustrate problem formulation


techniques, solution methods, and adaptations of the basic DP algorithm
to various contexts. As a guide for formulating optimal control problems in
a manner that is suitable for DP solution, the following two-stage process
is suggested:
(a) Identify the controls/decisions uk and the times k at which these con-
trols are applied. Usually this step is fairly straightforward. However,
in some cases there may be some choices to make. For example in
deterministic problems, where the objective is to select an optimal
sequence of controls {u0 , . . . , uN −1 }, one may lump multiple controls
to be chosen together, e.g., view the pair (u0 , u1 ) as a single choice.
This is usually not possible in stochastic problems, where distinct de-
cisions are differentiated by the information/feedback available when
making them.
(b) Select the states xk . The basic guideline here is that xk should en-
compass all the information that is known to the controller at time k
and can be used with advantage in choosing uk .
Note that there may be multiple possibilities for selecting the states,
because information may be packaged in several different ways that are
equally useful from the point of view of control. It is thus worth considering
alternative ways to choose the states; for example try to use states that
minimize the dimensionality of the state space. For a trivial example that
illustrates the point, if a quantity xk qualifies as state, then (xk−1 , xk ) also
qualifies as state, since (xk−1 , xk ) contains all the information contained
18 Exact Dynamic Programming Chap. 1

within xk that can be useful to the controller when selecting uk . However,


using (xk−1 , xk ) in place of xk , gains nothing in terms of optimal cost
while complicating the DP algorithm which would be defined over a larger
space. The concept of a sufficient statistic, which refers to a quantity that
summarizes all the essential content of the information available to the
controller, may be useful in reducing the size of the state space (see the
discussion in Section 4.3 of [Ber17]).
Generally minimizing the dimension of the state makes sense but there
are exceptions. A case in point is problems involving partial or imperfect
state information, where we collect measurements to use for control of some
quantity of interest yk that evolves over time (for example, yk may be the
position/velocity vector of a moving vehicle). If Ik is the collection of all
measurements up to time k, it is correct to use Ik as state. However,
a better alternative may be to use as state the conditional probability
distribution Pk (yk | Ik ), called belief state, which may subsume all the
information that is useful for the purposes of choosing a control. On the
other hand, the belief state Pk (yk | Ik ) is an infinite-dimensional quantity,
whereas Ik may be finite dimensional, so the best choice may be problem-
dependent; see [Ber17] for further discussion of partial state information
problems.
We refer to DP textbooks for extensive additional discussions of mod-
eling and problem formulation techniques. The subsequent chapters do not
rely substantially on the material of this section, so the reader may selec-
tively skip forward to the next chapter and return to this material later as
needed.

1.3.1 Deterministic Shortest Path Problems

Let {1, 2, . . . , N, t} be the set of nodes of a graph, and let aij be the cost of
moving from node i to node j [also referred to as the length of the arc (i, j)
that joins i and j]. Node t is a special node, which we call the destination.
By a path we mean a sequence of arcs such that the end node of each arc
in the sequence is the start node of the next arc. The length of a path from
a given node to another node is the sum of the lengths of the arcs on the
path. We want to find a shortest (i.e., minimum length) path from each
node i to node t.
We make an assumption relating to cycles, i.e., paths of the form
(i, j1 ), (j1 , j2 ), . . . , (jk , i) that start and end at the same node. In particular,
we exclude the possibility that a cycle has negative total length. Otherwise,
it would be possible to decrease the length of some paths to arbitrarily small
values simply by adding more and more negative-length cycles. We thus
assume that all cycles have nonnegative length. With this assumption, it is
clear that an optimal path need not take more than N moves, so we may
limit the number of moves to N . We formulate the problem as one where
we require exactly N moves but allow degenerate moves from a node i to
Sec. 1.3 Examples, Variations, and Simplifications 19

itself with cost aii = 0. We also assume that for every node i there exists
at least one path from i to t.
We can formulate this problem as a deterministic DP problem with N
stages, where the states at any stage 0, . . . , N − 1 are the nodes {1, . . . , N },
the destination t is the unique state at stage N , and the controls correspond
to the arcs (i, j), including the self arcs (i, i). Thus at each state i we select
a control (i, j) and move to state j at cost aij .
We can write the DP algorithm for our problem, with the optimal
cost-to-go functions Jk having the meaning

Jk (i) = optimal cost of getting from i to t in N − k moves,

for i = 1, . . . , N , k = 0, . . . , N − 1. The cost of the optimal path from i to


t is J0 (i). The DP algorithm takes the intuitively clear form

optimal cost from i to t in N − k moves


 
= min aij + (optimal cost from j to t in N − k − 1 moves) ,
All arcs (i,j)

or  
Jk (i) = min aij + Jk+1 (j) , k = 0, 1, . . . , N − 2,
All arcs (i,j)

with
JN −1 (i) = ait , i = 1, 2, . . . , N.
This algorithm is also known as the Bellman-Ford algorithm for shortest
paths.
The optimal policy when at node i after k moves is to move to a node
j ∗ that minimizes aij + Jk+1 (j) over all j such that (i, j) is an arc. If the
optimal path obtained from the algorithm contains degenerate moves from
a node to itself, this simply means that the path involves in reality less
than N moves.
Note that if for some k > 0, we have Jk (i) = Jk+1 (i) for all i,
then subsequent DP iterations will not change the values of the cost-to-go
[Jk−m (i) = Jk (i) for all m > 0 and i], so the algorithm can be terminated
with Jk (i) being the shortest distance from i to t, for all i.
To demonstrate the algorithm, consider the problem shown in Fig.
1.3.1(a) where the costs aij with i 6= j are shown along the connecting line
segments (we assume that aij = aji ). Figure 1.3.1(b) shows the optimal
cost-to-go Jk (i) at each i and k together with the optimal paths.

1.3.2 Discrete Deterministic Optimization

Discrete optimization problems can be formulated as DP problems by


breaking down each feasible solution into a sequence of decisions/controls.
This formulation will often lead to an intractable DP computation because
20 Exact Dynamic Programming Chap. 1

State i
Destination
0 1 2 3 4 5 6 7 0 0 1 2 3 4 50 6
1
7 0
02 13 024 135 0246 1357 2460 357 460 57 60 7 0
0 1 2 30 41 52 63 74 05 6 7 0
0 1 2 3 04 15 26 37 40 5 6 7 0 0 1 2 03 14 5
02
6 7 0
13 024 0135 1246 2357 3460 457 560 67 70 0
0 1 3 4 50 61 72 03 4 5 6 07 10 2 3 4 5 6 7 0
0 1 2 3 04 105 216 327 430 54 65 76 07 0 05 14.552 4.53 54 5.5 5 6 7 0
0 1 2 3 4 5 6 7 0
0 1 2 3 4 5 6 7 0 0 1 3 4 5 6 70 01 0 1 02 13 024 135 0246 1357 2460 357 460 57 60 7 0
3 4 5 6 7 0
0 1 2 03 14 25 36 47 50 6 7 0
0 1 2 3 4 5 6 7 0.5
0 01 013 0134012451235623467345704560i567Stage
670 70 0k
Destination (a) (b) Destination (a) (b)

Figure 1.3.1 (a) Shortest path problem data. The destination is node 5. Arc
lengths are equal in both directions and are shown along the line segments con-
necting nodes. (b) Costs-to-go generated by the DP algorithm. The number along
stage k and state i is Jk (i). Arrows indicate the optimal moves at each stage and
node. The optimal paths are

1 → 5, 2 → 3 → 4 → 5, 3 → 4 → 5, 4 → 5.

of an exponential explosion of the number of states. However, it brings


to bear approximate DP methods, such as rollout and others that we will
discuss in future chapters. We illustrate the reformulation by means of an
example and then we generalize.

Example 1.3.1 (The Traveling Salesman Problem)

An important model for scheduling a sequence of operations is the classical


traveling salesman problem. Here we are given N cities and the travel time
between each pair of cities. We wish to find a minimum time travel that visits
each of the cities exactly once and returns to the start city. To convert this
problem to a DP problem, we form a graph whose nodes are the sequences
of k distinct cities, where k = 1, . . . , N . The k-city sequences correspond to
the states of the kth stage. The initial state x0 consists of some city, taken
as the start (city A in the example of Fig. 1.3.2). A k-city node/state leads
to a (k + 1)-city node/state by adding a new city at a cost equal to the travel
time between the last two of the k + 1 cities; see Fig. 1.3.2. Each sequence of
N cities is connected to an artificial terminal node t with an arc of cost equal
to the travel time from the last city of the sequence to the starting city, thus
completing the transformation to a DP problem.
The optimal costs-to-go from each node to the terminal state can be
obtained by the DP algorithm and are shown next to the nodes. Note, how-
ever, that the number of nodes grows exponentially with the number of cities
N . This makes the DP solution intractable for large N . As a result, large
traveling salesman and related scheduling problems are typically addressed
with approximation methods, some of which are based on DP, and will be
discussed as part of our subsequent development.
Sec. 1.3 Examples, Variations, and Simplifications 21

Initial State x0

15 1 5 18 4 19 9 21 25 8 12 13 A AB AC AD ABC ABD ACB ACD ADB ADC

1 2 3 4 51 62 73 84 95 10
6 7118 12
9 10
1311 12
1 214
13 14 15 20
3 4155 20
6 7 8 9 10 11 12 13 14 15 20

8 A12
15 1 5 18 4 19 9 21 25 15 1AB AC4 AD
5 18 19 9ABC 8A12
ABD
21 25 AB
ACBAC AD
ACD
15 ABC
1 5 ADB ABD
18 4 ADC
19 A AB
ACB
9 21 AC
25 ACD
8 AD
12 ADB
ABC ADC
ABD ACB ACD ADB ADC
1 2 3 4 5 6 7 8 9 10 11 12 13114 5 6 7 81 92 10
2 3154 20 3 411 8 9141015112012 13 1142 15
5 6127 13 3 4205 6 7 8 91 10
2 311
4 512613
71 821493 15
4 520
10 6 712813
11 9 10
14 11
15 12
20 13 14 15 20

15 1AC
A AB A 5AB
18 15A1AD
4ABC
ADAC 19 59 ABD
AB 18
21A425
AC
ABC 19
AD
ABABD
ACB
15 9ABC
521
1ACA 25
AB
ACB
ACD
AD
18 15 A
1 5AB
AC
4ABD
ACD
ABC
ADB
19 AD
9ACBAC
4ABC
ADB
18
ABD
ADC
21 AD
ACD
2519
15 5ABC
19ABD
ADC
ACB ADB
21 4ABD
1815
25
ACD 119
ACB
ADC
ADB ACB
5 9ACD
18 4 25
ADC
21 ACD
19
ADB ADB
9 21
ADC25 ADC

1 2 3 4 5 6 7 8 91 10
2 311
4 151226113
3724814
3594615
10
57120
611
827912 419135211
3810 1014 7415
6312
11 12962014
8513 7 815
10
13 9 10
11
14 12 11
20
15 13 12
20 14 13
15 14
20 15 20

15 1ABCD
5 ABCD
15
ABDC ABDC
1 5ABCD
ACBD 15
ACBD
ABDC
ACDB
ABCD ABCD
1ACDB
5ADBC
ACBD
ABDC
ABCD ABDC
ADBC
ACDB
ACBD
ABDC
15 ACBD
ADCB
ADBC
1ADCB
5 ACDB
ACBD 15 ACDB
ADCB
ADBC
ACDB ADBC
1 5ADCB
ADBC 1 5 ADCB
15ADCB

1 2 3 4 5 6 7 8 9 10 11 12 131 14
2 315
4 52061 72 83 94 10
5 6117 112
8 2913
310414
11612
5 15 813914
720 1015
1120
12 13 14 15 20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20

A AB AC AD ABC ABD ACB ACD ADB ADC

s Terminal State t

1 2 3 4 5 6 7 8 9 101 211312
4 5136 1417 2815 205 11
39 410 6 7128 13
9 101411
15122013 14 15 20
1 2123 13
1 2 3 4 5 6 7 8 9 10 11 7 1820
4 5146 15 29 3104 115 6127 813914
10151120
12 13 14 15 20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2203 4 15 26 37 48 59 6107 11
8 9121013
11141215
132014 15 20
1 2 3 4 5 6 7 8 9 10 11 12 13 141 15 2 320
14 25 36 47 58 69 7108 11
9 10
12 11
13 12
14 13
15 14
20 15 20

Figure 1.3.2 Example of a DP formulation of the traveling salesman problem.


The travel times between the four cities A, B, C, and D are shown in the table.
We form a graph whose nodes are the k-city sequences and correspond to the
states of the kth stage. The transition costs/travel times are shown next to the
arcs. The optimal costs-to-go are generated by DP starting from the terminal
state and going backwards towards the initial state, and are shown next to the
nodes. There are two optimal sequences here (ABDCA and ACDBA), and they
are marked with thick lines. Both optimal sequences can be obtained by forward
minimization [cf. Eq. (1.7)], starting from the initial state x0 .

Let us now extend the ideas of the preceding example to the general
discrete optimization problem:

minimize G(u)
subject to u ∈ U,

where U is a finite set of feasible solutions and G(u) is a cost function.


We assume that each solution u has N components; i.e., it has the form
u = (u1 , . . . , uN ), where N is a positive integer. We can then view the
22 Exact Dynamic Programming Chap. 1

Stage3 1Stage
Stage 1 Stage 2 Stage Stage 2 Stage
. . . i 3 Stage N
Stage 1 Stage 2 Stage 3 Stage
Stage 1 Stage 2 Stage 3 Stage
... i
Artificial Start State End. State .. .. Artificial Start State End State
) .. 9 21 25) 8. 12 13
Initial State 15 1.. 5 18 4 19 Artificial Start) .State End State
).
s t u ... i s t u
. .
) .. ) .. . .
) .. ... i ) .. Cost G(u)

Set of States (
Set of States (u1 )Set SetofofStates
Set (of( States (
States Set of States (
,). Set
. . , uof )States
) Set of(uStates
1 , u2 ) Set
(u1 ,of
u2States (
) Set
, u3Set of States u = (u1 , . . . , uN )

Figure 1.3.3. Formulation of a discrete optimization problem as a DP problem


with N + 1 stages. There is a cost G(u) only at the terminal stage on the arc
connecting an N -solution u = (u1 , . . . , uN ) to the artificial terminal state. Al-
ternative formulations may use fewer states by taking advantage of the problem’s
structure.

problem as a sequential decision problem, where the components u1 , . . . , uN


are selected one-at-a-time. A k-tuple (u1 , . . . , uk ) consisting of the first k
components of a solution is called an k-solution. We associate k-solutions
with the kth stage of the finite horizon DP problem shown in Fig. 1.3.3.
In particular, for k = 1, . . . , N , we view as the states of the kth stage all
the k-tuples (u1 , . . . , uk ). The initial state is an artificial state denoted s.
From this state we may move to any state (u1 ), with u1 belonging to the
set

U1 = ũ1 | there exists a solution of the form (ũ1 , ũ2 , . . . , ũN ) ∈ U .

Thus U1 is the set of choices of u1 that are consistent with feasibility.


More generally, from a state (u1 , . . . , uk ), we may move to any state
of the form (u1 , . . . , uk , uk+1 ), with uk+1 belonging to the set

Uk+1 (u1 , . . . , uk ) = ũk+1 | there exists a solution of the form

(u1 , . . . , uk , ũk+1 , . . . , ũN ) ∈ U .

At state (u1 , . . . , uk ) we must choose uk+1 from the set Uk+1 (u1 , . . . , uk ).
These are the choices of uk+1 that are consistent with the preceding choices
u1 , . . . , uk , and are also consistent with feasibility. The terminal states
correspond to the N -solutions u = (u1 , . . . , uN ), and the only nonzero cost
is the terminal cost G(u). This terminal cost is incurred upon transition
from u to an artificial end state; see Fig. 1.3.3.
Let Jk* (u1 , . . . , uk ) denote the optimal cost starting from the k-solution
(u1 , . . . , uk ), i.e., the optimal cost of the problem over solutions whose first
Sec. 1.3 Examples, Variations, and Simplifications 23

k components are constrained to be equal to ui , i = 1, . . . , k, respectively.


The DP algorithm is described by the equation

Jk* (u1 , . . . , uk ) = min * (u , . . . , u , u


Jk+1 1 k k+1 ), (1.14)
uk+1 ∈Uk+1 (u1 ,...,uk )

with the terminal condition

* (u , . . . , u ) = G(u , . . . , u ).
JN 1 N 1 N

The algorithm (1.14) executes backwards in time: starting with the known
function JN* = G, we compute J * *
N −1 , then JN −2 , and so on up to computing
J1* . An optimal solution (u∗1 , . . . , u∗N ) is then constructed by going forward
through the algorithm

u∗k+1 ∈ arg min * (u∗ , . . . , u∗ , u


Jk+1 1 k k+1 ), k = 0, . . . , N − 1,
uk+1 ∈Uk+1 (u∗
1
,...,u∗
k
)
(1.15)
first compute u∗1 , then u∗2 , and so on up to u∗N ; cf. Eq. (1.7).
Of course here the number of states typically grows exponentially with
N , but we can use the DP minimization (1.15) as a starting point for the use
of approximation methods. For example we may try to use approximation
in value space, whereby we replace Jk+1 * with some suboptimal J˜k+1 in Eq.
(1.15). One possibility is to use as

J˜k+1 (u∗1 , . . . , u∗k , uk+1 ),

the cost generated by a heuristic method that solves the problem sub-
optimally with the values of the first k + 1 decision components fixed at
u∗1 , . . . , u∗k , uk+1 . This is called a rollout algorithm and it is a very simple
and effective approach for approximate combinatorial optimization. It will
be discussed later in this book, in Chapter 2 for finite horizon stochastic
problems, and in Chapter 4 for infinite horizon problems, where it will be
related to the method of policy iteration.

1.3.3 Problems with a Terminal State

Many DP problems of interest involve a terminal state, i.e., a state t that


is cost-free and absorbing in the sense that

gk (t, uk , wk ) = 0, fk (t, uk , wk ) = t, for all uk ∈ Uk (t), k = 0, 1, . . . .

Thus the control process terminates upon reaching t, even if this happens
before the end of the horizon. One may reach t by choice if a special
stopping decision is available, or by means of a transition from another
state.
24 Exact Dynamic Programming Chap. 1

Generally, when it is known that an optimal policy will reach the


terminal state within at most some given number of stages N , the DP
problem can be formulated as an N -stage horizon problem.† The reason
is that even if the terminal state t is reached at a time k < N , we can
extend our stay at t for an additional N − k stages at no additional cost.
An example is the deterministic shortest path problem that we discussed
in Section 1.3.1.
Discrete deterministic optimization problems generally have a close
connection to shortest path problems as we have seen in Section 1.3.2. In
the problem discussed in that section, the terminal state is reached after
exactly N stages (cf. Fig. 1.3.3), but in other problems it is possible that
termination can happen earlier. The following well known puzzle is an
example.

Example 1.3.2 (The Four Queens Problem)

Four queens must be placed on a 4 × 4 portion of a chessboard so that no


queen can attack another. In other words, the placement must be such that
every row, column, or diagonal of the 4 × 4 board contains at most one queen.
Equivalently, we can view the problem as a sequence of problems; first, placing
a queen in one of the first two squares in the top row, then placing another
queen in the second row so that it is not attacked by the first, and similarly
placing the third and fourth queens. (It is sufficient to consider only the first
two squares of the top row, since the other two squares lead to symmetric
positions; this is an example of a situation where we have a choice between
several possible state spaces, but we select the one that is smallest.)
We can associate positions with nodes of an acyclic graph where the
root node s corresponds to the position with no queens and the terminal
nodes correspond to the positions where no additional queens can be placed
without some queen attacking another. Let us connect each terminal position
with an artificial terminal node t by means of an arc. Let us also assign to
all arcs cost zero except for the artificial arcs connecting terminal positions
with less than four queens with the artificial node t. These latter arcs are
assigned a cost of 1 (see Fig. 1.3.4) to express the fact that they correspond
to dead-end positions that cannot lead to a solution. Then, the four queens
problem reduces to finding a minimal cost path from node s to node t, with
an optimal sequence of queen placements corresponding to cost 0.
Note that once the states/nodes of the graph are enumerated, the prob-
lem is essentially solved. In this 4 × 4 problem the states are few and can
be easily enumerated. However, we can think of similar problems with much
larger state spaces. For example consider the problem of placing N queens
on an N × N board without any queen attacking another. Even for moder-
ate values of N , the state space for this problem can be extremely large (for
N = 8 the number of possible placements with exactly one queen in each

† When an upper bound on the number of stages to termination is not known,


the problem must be formulated as an infinite horizon problem, as will be dis-
cussed in a subsequent chapter.
Sec. 1.3 Examples, Variations, and Simplifications 25

Starting Position ˆ
Root Node s

Length = 0 Dead-End Position Solution

Length = 0 Dead-End Position Solution

t Length = 1 t Length = 1

Length = 0 Dead-End Position Solution Starting


Artificial Terminal Node
Artificial Terminal Node
Artificial Terminal Node t

Figure 1.3.4 Discrete optimization formulation of the four queens problem.


Symmetric positions resulting from placing a queen in one of the rightmost
squares in the top row have been ignored. Squares containing a queen have
been darkened. All arcs have length zero except for those connecting dead-end
positions to the artificial terminal node.

row is 88 = 16, 777, 216). It can be shown that there exist solutions to this
problem for all N ≥ 4.
There are also several variants of the N queens problem. For example
finding the minimal number of queens that can be placed on an N × N board
so that they either occupy or attack every square; this is known as the queen
domination problem. The minimal number can be found in principle by DP,
and it is known for some N (for example the minimal number is 5 for N = 8),
but not for all N (see e.g., the paper by Fernau [Fe10]).
26 Exact Dynamic Programming Chap. 1

1.3.4 Forecasts

Consider a situation where at time k the controller has access to a forecast


yk that results in a reassessment of the probability distribution of wk and
possibly of future disturbances. For example, yk may be an exact prediction
of wk or an exact prediction that the probability distribution of wk is a
specific one out of a finite collection of distributions. Forecasts of interest
in practice are, for example, probabilistic predictions on the state of the
weather, the interest rate for money, and the demand for inventory.
Generally, forecasts can be handled by introducing additional states
corresponding to the information that the forecasts provide. We will illus-
trate the process with a simple example.
Assume that at the beginning of each stage k, the controller receives
an accurate prediction that the next disturbance wk will be selected ac-
cording to a particular probability distribution out of a given collection of
distributions {P1 , . . . , Pm }; i.e., if the forecast is i, then wk is selected ac-
cording to Pi . The a priori probability that the forecast will be i is denoted
by pi and is given.
The forecasting process can be represented by means of the equation

yk+1 = ξk ,

where yk+1 can take the values 1, . . . , m, corresponding to the m possible


forecasts, and ξk is a random variable taking the value i with probability
pi . The interpretation here is that when ξk takes the value i, then wk+1
will occur according to the distribution Pi .
By combining the system equation with the forecast equation yk+1 =
ξk , we obtain an augmented system given by
   
xk+1 fk (xk , uk , wk )
= .
yk+1 ξk

The new state is


x̃k = (xk , yk ).
The new disturbance is
w̃k = (wk , ξk ),
and its probability distribution is determined by the distributions Pi and
the probabilities pi , and depends explicitly on x̃k (via yk ) but not on the
prior disturbances.
Thus, by suitable reformulation of the cost, the problem can be cast
into the basic problem format. Note that the control applied depends on
both the current state and the current forecast. The DP algorithm takes
the form
JN (xN , yN ) = gN (xN ),
Sec. 1.3 Examples, Variations, and Simplifications 27
n
Jk (xk , yk ) = min E gk (xk , uk , wk )
uk ∈Uk (xk ) wk
m
X  o (1.16)
+ pi Jk+1 fk (xk , uk , wk ), i yk ,
i=1

where yk may take the values 1, . . . , m, and the expectation over wk is


taken with respect to the distribution Pyk .
It should be clear that the preceding formulation admits several ex-
tensions. One example is the case where forecasts can be influenced by
the control action and involve several future disturbances. However, the
price for these extensions is increased complexity of the corresponding DP
algorithm.

1.3.5 Problems with Uncontrollable State Components

In many problems of interest the natural state of the problem consists of


several components, some of which cannot be affected by the choice of
control. In such cases the DP algorithm can be simplified considerably,
and be executed over the controllable components of the state. Before
describing how this can be done in generality, let us consider an example.

Example 1.3.3 (Parking)

A driver is looking for inexpensive parking on the way to his destination. The
parking area contains N spaces, and a garage at the end. The driver starts
at space 0 and traverses the parking spaces sequentially, i.e., from space k he
goes next to space k + 1, etc. Each parking space k costs c(k) and is free with
probability p(k) independently of whether other parking spaces are free or
not. If the driver reaches the garage without having parked, he must park at
the garage, which costs C. The driver can observe whether a parking space
is free only when he reaches it, and then, if it is free, he makes a decision to
park in that space or not to park and check the next space. The problem is
to find the minimum expected cost parking policy.
We formulate the problem as a DP problem with N = n + 1 stages,
corresponding to the parking spaces (including the garage), and an artifi-
cial terminal state t that corresponds to having parked; see Fig. 1.3.5. At
each stage k = 0, . . . , N − 1, in addition to t, we have two states (k, F ) and
(k, F ), corresponding to space k being free or taken, respectively. The deci-
sion/control is to park or continue at state (k, F ) [there is no choice at states
(k, F ) and the garage].
Let us now derive the form of DP algorithm, denoting
Jk∗ (F ): The optimal cost-to-go upon arrival at a space k that is free.
Jk∗ (F ): The optimal cost-to-go upon arrival at a space k that is taken.

JN (t) = C: The cost-to-go upon arrival at the garage.
Jk∗ (t) = 0: The terminal cost-to-go.
28 Exact Dynamic Programming Chap. 1

Garage
n 0 10 1 0 1 2 j · · · Stage
1) k k1 Stage
k k + 21j Stage
· N3−Stage
· ·N 1 N N
(0) c(k) ) c(k + 1)
itial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) C c(1)
)C c
+ 1) c(N − 1) Parked

1) Parking Spaces

Figure 1.3.5 Cost structure of the parking problem. The driver may park at
space k = 0, 1, . . . , N − 1 at cost c(k), if the space is free, or continue to the
next space k + 1 at no cost. At space N (the garage) the driver must park at
cost C.

The DP algorithm for k = 0, . . . , N − 1 takes the form


  ∗ ∗
 
min c(k), p(k)Jk+1 (F ) + 1 − p(k) Jk+1 (F ) if k = 0, . . . , N − 2,
Jk∗ (F ) =  
min c(N − 1), C if k = N − 1,
 
∗ ∗
Jk∗ (F ) = p(k)Jk+1 (F ) + 1 − p(k) Jk+1 (F ) if k = 0, . . . , N − 2,
C if k = N − 1,
(we omit here the obvious equations for the terminal state t and the garage
state N ).
While this algorithm is easily executed, it can be written in a simpler
and equivalent form, which takes advantage of the fact that the second compo-
nent (F or F ) of the state is uncontrollable. This can be done by introducing
the scalars

Jˆk = p(k)Jk∗ (F ) + 1 − p(k) J ∗ (F ),



k = 0, . . . , N − 1,

which can be viewed as the optimal expected cost-to-go upon arriving at space
k but before verifying its free or taken status.
Indeed, from the preceding DP algorithm, we have

JˆN−1 = min c(N − 1), C ,


 

Jˆk = p(k) min c(k), Jˆk+1 + 1 − p(k) Jˆk+1 ,


  
k = 0, . . . , N − 1.
From this algorithm we can also obtain the optimal parking policy, which is
to park at space k = 0, . . . , N − 1 if it is free and c(k) ≤ Jˆk+1 .
Figure 1.3.6 provides a plot for Jˆk for the case where

p(k) ≡ 0.05, c(k) = N − k, C = 100, N = 200. (1.17)


Sec. 1.3 Examples, Variations, and Simplifications 29

./,+0123()2+7831-934)*,

#$!

:"
#!!
165 200 150 100 150 200

./,+01234)*,356
$

;7,+)-
'!

./,+01237)*,<,)<=)3>?-7,+)-
./,+0123;7,+)-@33333333333333333 #
&!
#3A3(1BC3+>3>BDDE3$3A3F)-G,3(1BC

%!

! 150
165
165 200 165
200 200
"!
150
100 150
165
100
150 200165
100
#!!
150
200 150200
150
200 100150
50 100
200
#"! 0 50 $!!
0
()*+,+)-
165 200 150 100 50 0
Figure 1.3.6 Optimal cost-to-go and optimal policy for the parking problem with
the data in Eq. (1.17). The optimal policy is to travel from space 0 to space 165
and then to park at the first available space.

The optimal policy is to travel to space 165 and then to park at the first
available space. The reader may verify that this type of policy, characterized
by a single threshold distance, is optimal assuming that c(k) is monotonically
decreasing with k.

We will now formalize the procedure illustrated in the preceding ex-


ample. Let the state of the system be a composite (xk , yk ) of two compo-
nents xk and yk . The evolution of the main component, xk , is affected by
the control uk according to the equation

xk+1 = fk (xk , yk , uk , wk ),

where the probability distribution Pk (wk | xk , yk , uk ) is given. The evolu-


tion of the other component, yk , is governed by a given conditional distri-
bution Pk (yk | xk ) and cannot be affected by the control, except indirectly
through xk . One is tempted to view yk as a disturbance, but there is a
difference: yk is observed by the controller before applying uk , while wk
occurs after uk is applied, and indeed wk may probabilistically depend on
uk .
We will formulate a DP algorithm that is executed over the control-
lable component of the state, with the dependence on the uncontrollable
component being “averaged out” similar to the preceding example. In par-
ticular, let Jk (xk , yk ) denote the optimal cost-to-go at stage k and state
(xk , yk ), and define

Ĵ k (xk ) = E Jk (xk , yk ) | xk .
yk

We will derive a DP algorithm that generates Ĵ k (xk ).


30 Exact Dynamic Programming Chap. 1

Indeed, we have

Ĵ k (xk ) = Eyk Jk (xk , yk ) | xk
n 
= Eyk min Ewk ,xk+1 ,yk+1 gk (xk , yk , uk , wk )
uk ∈Uk (xk ,yk )
o
+ Jk+1 (xk+1 , yk+1 ) xk , yk , uk xk
 n
= Eyk min Ewk ,xk+1 gk (xk , yk , uk , wk )
uk ∈Uk (xk ,yk )
 o 
+ Eyk+1 Jk+1 (xk+1 , yk+1 ) xk+1 xk , yk , uk xk ,

and finally
 n
Ĵ k (xk ) = E min E gk (xk , yk , uk , wk )
yk uk ∈Uk (xk ,yk ) wk
 (1.18)
o
+ Ĵ k+1 fk (xk , yk , uk , wk ) xk .

The advantage of this equivalent DP algorithm is that it is executed


over a significantly reduced state space. For example, if xk takes n possible
values and yk takes m possible values, then DP is executed over n states
instead of nm states. Note, however, that the minimization in the right-
hand side of the preceding equation yields an optimal control law as a
function of the full state (xk , yk ).
As an example, consider the augmented state resulting from the in-
corporation of forecasts, as described earlier in Section 1.3.5. Then, the
forecast yk represents an uncontrolled state component, so that the DP al-
gorithm can be simplified as in Eq. (1.18). In particular, using the notation
of Section 1.3.5, by defining
m
Jˆk (xk ) =
X
pi Jk (xk , i), k = 0, 1, . . . , N − 1,
i=1

and
JˆN (xN ) = gN (xN ),
we have, using Eq. (1.16),
m n
Jˆk (xk ) =
X
pi min E gk (xk , uk , wk )
uk ∈Uk (xk ) wk
i=1
o
+ Jˆk+1 fk (xk , uk , wk ) yk = i ,

Sec. 1.3 Examples, Variations, and Simplifications 31

Figure 1.3.7 Illustration of a tetris board.

which is executed over the space of xk rather than xk and yk . This is a


simpler algorithm than the one of Eq. (1.16).
Uncontrollable state components often occur in arrival systems, such
as queueing, where action must be taken in response to a random event
(such as a customer arrival) that cannot be influenced by the choice of
control. Then the state of the arrival system must be augmented to include
the random event, but the DP algorithm can be executed over a smaller
space, as per Eq. (1.18). Here is another example of similar type.

Example 1.3.4 (Tetris)

Tetris is a popular video game played on a two-dimensional grid. Each square


in the grid can be full or empty, making up a “wall of bricks” with “holes”
and a “jagged top” (see Fig. 1.3.7). The squares fill up as blocks of different
shapes fall from the top of the grid and are added to the top of the wall. As a
given block falls, the player can move horizontally and rotate the block in all
possible ways, subject to the constraints imposed by the sides of the grid and
the top of the wall. The falling blocks are generated independently according
to some probability distribution, defined over a finite set of standard shapes.
The game starts with an empty grid and ends when a square in the top row
becomes full and the top of the wall reaches the top of the grid. When a
row of full squares is created, this row is removed, the bricks lying above this
row move one row downward, and the player scores a point. The player’s
objective is to maximize the score attained (total number of rows removed)
within N steps or up to termination of the game, whichever occurs first.
We can model the problem of finding an optimal tetris playing strategy
as a stochastic DP problem. The control, denoted by u, is the horizontal
positioning and rotation applied to the falling block. The state consists of
two components:
(1) The board position, i.e., a binary description of the full/empty status
of each square, denoted by x.
32 Exact Dynamic Programming Chap. 1

(2) The shape of the current falling block, denoted by y.


There is also an additional termination state which is cost-free. Once the
state reaches the termination state, it stays there with no change in cost.
The shape y is generated according to a probability distribution p(y),
independently of the control, so it can be viewed as an uncontrollable state
component. The DP algorithm (1.18) is executed over the space of x and has
the intuitive form
X h i
Jˆk (x) = p(y) max g(x, y, u) + Jˆk+1 f (x, y, u) , for all x,
u
y

where
g(x, y, u) is the number of points scored (rows removed),
f (x, y, u) is the board position (or termination state),
when the state is (x, y) and control u is applied, respectively. Note, however,
that despite the simplification in the DP algorithm achieved by eliminating
the uncontrollable portion of the state, the number of states x is enormous,
and the problem can only be addressed by suboptimal methods, which will
be discussed later in this book.

1.3.6 Partial State Information and Belief States

We have assumed so far that the controller has access to the exact value of
the current state xk , so a policy consists of a sequence of functions µk (xk ),
k = 0, . . . , N − 1. However, in many practical settings this assumption is
unrealistic, because some components of the state may be inaccessible for
measurement, the sensors used for measuring them may be inaccurate, or
the cost of obtaining accurate measurements may be prohibitive.
Often in such situations the controller has access to only some of
the components of the current state, and the corresponding measurements
may also be corrupted by stochastic uncertainty. For example in three-
dimensional motion problems, the state may consist of the six-tuple of po-
sition and velocity components, but the measurements may consist of noise-
corrupted radar measurements of the three position components. This gives
rise to problems of partial or imperfect state information, which have re-
ceived a lot of attention in the in the optimization and artificial intelligence
literature (see e.g., [Ber17], Ch. 4). Even though there are DP algorithms
for partial information problems, these algorithms are far more computa-
tionally intensive than in the perfect information case. For this reason,
in the absence of an analytical solution, partial information problems are
typically solved suboptimally in practice.
On the other hand it turns out that conceptually, partial state infor-
mation problems are no different than the perfect state information prob-
lems we have been addressing so far. In fact by various reformulations, we
Sec. 1.3 Examples, Variations, and Simplifications 33

Belief State
“Future” System x ) xk Belief State pk Controller
k Observations
xk+1 = fk (xk , uk , wk )

k Controller
Controller µk
k Control uk = µk (pk )

Figure 1.3.8 Schematic illustration of a control system with imperfect state


observations. The belief state pk is the conditional probability distribution of xk
given all the observations up to time k.

can reduce a partial state information problem to one with perfect state
information (see [Ber17], Ch. 4). The most common approach is to replace
the state xk with a belief state, which is the probability distribution of xk
given all the observations that have been obtained by the controller up to
time k (see Fig. 1.3.8). This probability distribution can in principle be
computed, and it can serve as “state” in an appropriate DP algorithm. We
illustrate this process with a simple example.

Example 1.3.5 (Treasure Hunting)

In a classical problem of search, one has to decide at each of N periods


whether to search a site that may contain a treasure. If a treasure is present,
the search reveals it with probability β, in which case the treasure is removed
from the site. Here the state xk has two values: either a treasure is present in
the site or it is not. The control uk takes two values: search and not search. If
the site is searched, we obtain an observation, which takes one of two values:
treasure found or not found. If the site is not searched, no information is
obtained.
Denote

pk : probability a treasure is present at the beginning of period k.

This is the belief state at time k and it evolves according to the equation

 pk if the site is not searched at time k,
pk+1 = 0 if the site is searched and a treasure is found,
pk (1−β)

pk (1−β)+1−pk
if the site is searched but no treasure is found.
(1.19)
The third relation above follows by application of Bayes’ rule (pk+1 is equal to
the kth period probability of a treasure being present and the search being un-
successful, divided by the probability of an unsuccessful search). The second
relation holds because the treasure is removed after a successful search.
34 Exact Dynamic Programming Chap. 1

Let us view pk as the state of a “belief system” given by Eq. (1.19),


and write a DP algorithm, assuming that the treasure’s worth is V , that each
search costs C, and that once we decide not to search at a particular time,
then we cannot search at future times. The algorithm takes the form
"
Jk (pk ) = max Jk+1 (pk ),

  # (1.20)
pk (1 − β)
− C + pk βV + (1 − pk β)Jˆk+1 ,
pk (1 − β) + 1 − pk

with JˆN (pN ) = 0.


This DP algorithm can be used to obtain an analytical solution. In
particular, it is straightforward to show by induction that the functions Jˆk
satisfy Jˆk (pk ) ≥ 0 if pk ∈ [0, 1] and

C
Jˆk (pk ) = 0 if pk ≤ .
βV

From this it follows that it is optimal to search at period k if and only if

C
≤ pk .
βV

Thus, it is optimal to search if and only if the expected reward from the next
search, pk βV , is greater or equal to the cost C of the search - a myopic policy
that focuses on just the next stage.

Of course the preceding example is extremely simple, involving a state


xk that takes just two values. As a result, the belief state pk takes val-
ues within the interval [0, 1]. Still there are infinitely many values in this
interval, and if a computational solution were necessary, the belief state
would have to be discretized and the DP algorithm (1.20) would have to
be adapted to the discretization.
In problems where the state xk can take many values, say n, the belief
state takes values in an n-dimensional simplex, so discretization becomes
problematic for large n. As a result, alternative suboptimal solution meth-
ods are often used in partial state information problems. Some of these
methods will be described in future chapters.
The following is a simple example of a partial state information prob-
lem whose belief state is complicated, and a solution by exact DP is im-
possible.

Example 1.3.6 (Bidirectional Parking)

Let us consider a more complex version of the parking problem of Example


1.3.3. As in that example, a driver is looking for inexpensive parking on the
Sec. 1.3 Examples, Variations, and Simplifications 35

way to his destination, along a line of L parking spaces with a garage at the
end. The difference is that the driver can move in either direction, rather
than just forward towards the garage. In particular, at space i, the driver can
park at cost c(i) if i is free, can move to i − 1 at cost βi− or can move to i + 1
at cost βi+ . Moreover, the driver records the free/taken status of the spaces
previously visited and may return to any of these spaces.
Let us assume that the probability p(i) of a space i being free changes
over time, i.e., a space found free (or taken) at a given visit may get taken
(or become free, respectively) by the time of the next visit. The initial prob-
abilities p(i), before visiting any spaces, are known, and the mechanism by
which these probabilities change over time is also known to the driver. As an
example, we may assume that each time period, p(i) increases by a certain
known factor with some probability ξ and decreases by another known factor
with the complementary probability 1 − ξ.
Here the belief state is the vector of current probabilities

p(1), . . . , p(L) ,

and it is updated at each time based on the new observation: the free/taken
status of the space visited at that time. Thus the belief state belongs to
the unit simplex of L-dimensional vectors and can be perfectly computed
by the driver, given the parking status observations of the spaces visited
thus far. While it is possible to state an exact DP algorithm that is defined
over the simplex of belief states, and we will do so later, the algorithm is
impossible to execute in practice.† Thus the problem can only be solved with
approximations.

1.3.7 Linear Quadratic Optimal Control

In a few exceptional special cases the DP algorithm yields an analytical


solution, which can be used among other purposes, as a starting point
for approximate DP schemes. Prominent among such cases are various
linear quadratic optimal control problems, which involve a linear (possibly
multidimensional) system, a quadratic cost function, and no constraints
on the control. Let us illustrate this with the deterministic scalar linear
quadratic Example 1.1.2. We will apply the DP algorithm for the case of
just two stages (N = 2), and illustrate the method for obtaining a nice
analytical solution.
As defined in Example 1.1.2, the terminal cost is
g2 (x2 ) = r(x2 − T )2 .

† The problem as stated is an infinite horizon problem because there is noth-


ing to prevent the driver from moving forever in the parking lot without ever
parking. We can convert the problem to a finite horizon problem by restricting
the number of moves to a given upper limit, say N > L, and requiring that if the
driver is at distance of k spaces from the garage at time N − k, then driving in
the direction away from the garage is not an option.
36 Exact Dynamic Programming Chap. 1

Thus the DP algorithm starts with

J2 (x2 ) = g2 (x2 ) = r(x2 − T )2 ,

[cf. Eq. (1.3)].


For the next-to-last stage, we have [cf. Eq. (1.4)]
h i
J1 (x1 ) = min u21 + J2 (x2 ) = min u21 + J2 (1 − a)x1 + au1 .
 
u1 u1

Substituting the previous form of J2 , we obtain


h 2 i
J1 (x1 ) = min u21 + r (1 − a)x1 + au1 − T . (1.21)
u1

This minimization will be done by setting to zero the derivative with respect
to u1 . This yields

0 = 2u1 + 2ra (1 − a)x1 + au1 − T ,

and by collecting terms and solving for u1 , we obtain the optimal temper-
ature for the last oven as a function of x1 :


ra T − (1 − a)x1
µ1 (x1 ) = . (1.22)
1 + ra2

By substituting the optimal u1 in the expression (1.21) for J1 , we


obtain
2  !2
r2 a2 (1 − a)x1 − T ra2 T − (1 − a)x1
J1 (x1 ) = + r (1 − a)x1 + −T
(1 + ra2 )2 1 + ra2
2 2
r2 a2 (1 − a)x1 − T ra2

2
= 2 2
+ r 2
− 1 (1 − a)x1 − T
(1 + ra ) 1 + ra
2
r (1 − a)x1 − T
= .
1 + ra2

We now go back one stage. We have [cf. Eq. (1.4)]


h i
J0 (x0 ) = min u20 + J1 (x1 ) = min u20 + J1 (1 − a)x0 + au0 ,
 
u0 u0

and by substituting the expression already obtained for J1 , we have


" 2 #
r (1 − a)2 x0 + (1 − a)au0 − T
J0 (x0 ) = min u20 + .
u0 1 + ra2
Sec. 1.3 Examples, Variations, and Simplifications 37

We minimize with respect to u0 by setting the corresponding derivative to


zero. We obtain

2r(1 − a)a (1 − a)2 x0 + (1 − a)au0 − T
0 = 2u0 + .
1 + ra2

This yields, after some calculation, the optimal temperature of the first
oven: 

r(1 − a)a T − (1 − a)2 x0
µ0 (x0 ) =  . (1.23)
1 + ra2 1 + (1 − a)2
The optimal cost is obtained by substituting this expression in the formula
for J0 . This leads to a straightforward but lengthy calculation, which in
the end yields the rather simple formula
2
r (1 − a)2 x0 − T
J0 (x0 ) = .
1 + ra2 1 + (1 − a)2

This completes the solution of the problem.


Note that the algorithm has simultaneously yielded an optimal policy
{µ∗0 , µ∗1 } via Eqs. (1.23) and (1.22): a rule that tells us the optimal oven
temperatures u0 = µ∗0 (x0 ) and u1 = µ∗1 (x1 ) for every possible value of the
states x0 and x1 , respectively. Thus the DP algorithm solves all the tail
subproblems and provides a feedback policy.
A noteworthy feature in this example is the facility with which we
obtained an analytical solution. A little thought while tracing the steps of
the algorithm will convince the reader that what simplifies the solution is
the quadratic nature of the cost and the linearity of the system equation.
Indeed, it can be shown in generality that when the system is linear and the
cost is quadratic, the optimal policy and cost-to-go function are given by
closed-form expressions, regardless of the number of stages N (see [Ber17],
Section 3.1).

Stochastic Linear Quadratic Problems - Certainty Equivalence

Let us now introduce a zero-mean stochastic additive disturbance in the


linear system equation. Remarkably, it turns out that the optimal policy
remains unaffected. To see this, assume that the material’s temperature
evolves according to

xk+1 = (1 − a)xk + auk + wk , k = 0, 1,

where w0 and w1 are independent random variables with given distribution,


zero mean
E{w0 } = E{w1 } = 0,
38 Exact Dynamic Programming Chap. 1

and finite variance. Then the equation for J1 [cf. Eq. (1.4)] becomes
n 2 o
J1 (x1 ) = min E u21 + r (1 − a)x1 + au1 + w1 − T
u1 w1
h 2
= min u21 + r (1 − a)x1 + au1 − T
u1
i
+ 2rE{w1 } (1 − a)x1 + au1 − T + rE{w12 } .


Since E{w1 } = 0, we obtain


h 2 i
J1 (x1 ) = min u21 + r (1 − a)x1 + au1 − T + rE{w12 }.
u1

Comparing this equation with Eq. (1.21), we see that the presence of w1 has
resulted in an additional inconsequential constant term, rE{w12 }. There-
fore, the optimal policy for the last stage remains unaffected by the presence
of w1 , while J1 (x1 ) is increased by rE{w12 }. It can be seen that a similar
situation also holds for the first stage. In particular, the optimal cost is
given by the same expression as before except for an additive constant that
depends on E{w02 } and E{w12 }.
Generally, if the optimal policy is unaffected when the disturbances
are replaced by their means, we say that certainty equivalence holds. This
occurs in several types of problems involving a linear system and a quadratic
cost; see [Ber17], Sections 3.1 and 4.2. For other problems, certainty equiv-
alence can be used as a basis for problem approximation, e.g., assume
that certainty equivalence holds (i.e., replace stochastic quantities by some
typical values, such as their expected values) and apply exact DP to the
resulting deterministic optimal control problem (see Section 2.3.2).

1.4 REINFORCEMENT LEARNING AND OPTIMAL CONTROL


- SOME TERMINOLOGY
There has been intense interest in DP-related approximations in view of
their promise to deal with the curse of dimensionality (the explosion of the
computation as the number of states increases is dealt with the use of ap-
proximate cost functions) and the curse of modeling (a simulator/computer
model may be used in place of a mathematical model of the problem). The
current state of the subject owes much to an enormously beneficial cross-
fertilization of ideas from optimal control (with its traditional emphasis
on decision making over time and formal optimization methodologies), and
from artificial intelligence (and its traditional emphasis on learning through
observation and experience, heuristic evaluation functions in game-playing
programs, and the use of feature-based and other representations).
The boundaries between these two fields are now diminished thanks
to a deeper understanding of the foundational issues, and the associated
Sec. 1.4 Reinforcement Learning and Optimal Control - Some Terminology39

methods and core applications. Unfortunately, however, there have been


substantial differences in language and emphasis in RL-based discussions
(where artificial intelligence-related terminology is used) and DP-based dis-
cussions (where the optimal control-related terminology is used). This in-
cludes the typical use of maximization/value function/reward in the former
field and the use of minimization/cost function/cost per stage in the latter
field, and goes much further.
The notation and terminology used in this book is standard in DP
and optimal control, and in an effort to forestall confusion of readers that
are accustomed to either the RL or the optimal control terminology, we
provide a list of selected terms commonly used in RL, and their optimal
control counterparts.
(a) Agent = Controller or decision maker.
(b) Action = Control.
(c) Environment = System.
(d) Reward of a stage = (Opposite of) Cost of a stage.
(e) State value = (Opposite of) Cost starting from a state.
(f) Value or reward (or state-value) function = (Opposite of) Cost
function.
(g) Maximizing the value function = Minimizing the cost function.
(h) Action (or state-action) value = Q-factor of a state-control pair.
(i) Planning = Solving a DP problem with a known mathematical
model.
(j) Learning = Solving a DP problem in model-free fashion.
(k) Self-learning (or self-play in the context of games) = Solving a DP
problem using policy iteration.
(l) Deep reinforcement learning = Approximate DP using value
and/or policy approximation with deep neural networks.
(m) Prediction = Policy evaluation.
(n) Generalized policy iteration = Optimistic policy iteration.
(o) State abstraction = Aggregation.
(p) Episodic task or episode = Finite-step system trajectory.
(q) Continuing task = Infinite-step system trajectory.
(r) Backup = Applying the DP operator at some state.
(s) Sweep = Applying the DP operator at all states.
40 Exact Dynamic Programming Chap. 1

(t) Greedy policy with respect to a cost function J = Minimizing


policy in the DP expression defined by J.
(u) Afterstate = Post-decision state.
Some of the preceding terms will be introduced in future chapters. The
reader may then wish to return to this section as an aid in connecting with
the relevant RL literature.

1.5 NOTES AND SOURCES

Our discussion of exact DP in this chapter has been brief since our fo-
cus in this book will be on approximate DP and RL. The author’s DP
textbooks [Ber12], [Ber17] provide an extensive discussion of exact DP
and its applications. The mathematical aspects of exact DP are discussed
in the monograph by Bertsekas and Shreve [BeS78], particularly the fine
probabilistic/measure-theoretic issues associated with stochastic optimal
control. The author’s abstract DP monograph [Ber18a] aims at a unified
development of the core theory and algorithms of total cost sequential de-
cision problems, based on the strong connections of the subject with fixed
point theory.
The approximate DP literature has expanded tremendously since the
connections between DP and RL became apparent in the late 80s and
early 90s. We will restrict ourselves to mentioning textbooks, research
monographs, and broad surveys, which supplement our discussions and
collectively provide a guide to the literature. Thus the author wishes to
apologize in advance for the many omissions of references from the research
literature.
Two books were written on our subject in the 1990s, setting the
tone for subsequent developments in the field. One in 1996 by Bertsekas
and Tsitsiklis [BeT96], which reflects a decision, control, and optimization
viewpoint, and another in 1998 by Sutton and Barto, which reflects an
artificial intelligence viewpoint (a 2nd edition, [SuB18], was published in
2018). We refer to the former book and also to the author’s DP textbooks
[Ber12], [Ber17] for a broader discussion of some of the topics of the present
book.
More recent books are the 2003 book by Gosavi (a much expanded
2nd edition [Gos15] appeared in 2015), which emphasizes simulation-based
optimization and RL algorithms, Cao [Cao07], which emphasizes a sensi-
tivity approach to simulation-based methods, Chang, Fu, Hu, and Marcus
[CFH07], which emphasizes finite-horizon/limited lookahead schemes and
adaptive sampling, Busoniu et. al. [BBD10], which focuses on function ap-
proximation methods for continuous space systems and includes a discus-
sion of random search methods, Powell [Pow11], which emphasizes resource
allocation and operations research applications, and Vrabie, Vamvoudakis,
and Lewis [VVL13], which discusses neural network-based methods and
Sec. 1.5 Notes and Sources 41

continuous-time optimal control applications. The book by Haykin [Hay08]


discusses approximate DP in the broader context of neural network-related
subjects. The book by Borkar [Bor08] is an advanced monograph that
addresses rigorously many of the convergence issues of iterative stochastic
algorithms in approximate DP, mainly using the so called ODE approach.
The book by Meyn [Mey07] is broader in its coverage, but touches upon
some of the approximate DP algorithms that we discuss.
Several survey papers in the volumes by Si, Barto, Powell, and Wun-
sch [SBP04], and Lewis and Liu [LeL12], and the special issue by Lewis,
Liu, and Lendaris [LLL08] describe approximation methodology that we
we will not be covering in this book: linear programming-based approaches
(De Farias [DeF04]), large-scale resource allocation methods (Powell and
Van Roy [PoV04]), and deterministic optimal control approaches (Fer-
rari and Stengel [FeS04], and Si, Yang, and Liu [SYL04]). The volume
by White and Sofge [WhS92] contains several surveys that describe early
work in the field. Influential surveys were written, from an artificial intelli-
gence viewpoint, by Barto, Bradtke, and Singh [BBS95], and by Kaelbling,
Littman, and Moore [KLM96]. More recent surveys are Borkar [Bor09] (a
methodological point of view that explores connections with other Monte
Carlo schemes), Lewis and Vrabie [LeV09] (a control theory point of view),
Werbos [Web09] (which reviews potential connections between brain in-
telligence, neural networks, and DP), Szepesvari [Sze10] (which provides
a detailed description of approximation in value space from a RL point of
view), Browne et al. [BPW12] (which focuses on Monte Carlo Tree Search),
Grondman et al. [GBL12] (which focuses on policy gradient methods), and
the author’s [Ber05a] (which focuses on rollout algorithms and model pre-
dictive control), [Ber10a] (which focuses on approximate policy iteration),
and [Ber18b] (which focuses on aggregation methods).

You might also like