RL Monograph1
RL Monograph1
by
Dimitri P. Bertsekas
Massachusetts Institute of Technology
Chapter 1
Exact Dynamic Programming
DRAFT
This is Chapter 1 of the draft textbook “Reinforcement Learning and
Optimal Control.” The chapter represents “work in progress,” and it will
be periodically updated. It more than likely contains errors (hopefully not
serious ones). Furthermore, its references to the literature are incomplete.
Your comments and suggestions to the author at [email protected] are
welcome. The date of last revision is given below.
Contents
1
2 Exact Dynamic Programming Chap. 1
where gN (xN ) is a terminal cost incurred at the end of the process. This
cost is a well-defined number, since the control sequence {u0 , . . . , uN −1 }
together with x0 determines exactly the state sequence {x1 , . . . , xN } via
the system equation (1.1). We want to minimize the cost (1.2) over all
sequences {u0 , . . . , uN −1 } that satisfy the control constraints, thereby ob-
taining the optimal value†
as a function of x0 .
We will next illustrate deterministic problems with some examples.
There are many situations where the state and control are naturally discrete
and take a finite number of values. Such problems are often conveniently
specified in terms of an acyclic graph specifying for each state xk the pos-
sible transitions to next states xk+1 . The nodes of the graph correspond
to states xk and the arcs of the graph correspond to state-control pairs
(xk , uk ). Each arc with start node xk corresponds to a choice of a single
control uk ∈ Uk (xk ) and has as end node the next state fk (xk , uk ). The
cost of an arc (xk , uk ) is defined as gk (xk , uk ); see Fig. 1.1.1. To handle the
final stage, an artificial terminal node t is added. Each state xN at stage
N is connected to the terminal node t with an arc having cost gN (xN ).
† We use throughout “min” (in place of “inf”) to indicate minimal value over
a feasible set of controls, even when we are not sure that the minimum is attained
by some feasible control.
4 Exact Dynamic Programming Chap. 1
.. . . . s t u
Initial State Stage 0 Stage 1 Stage 2 Stage Artificial Terminal Node Terminal Arcs with Cost Equal to Ter
s t u Artificial Terminal Node Terminal Arcs with Cost Equal to Ter-
.. . . .
† It turns out also that any shortest path problem (with a possibly nona-
cyclic graph) can be reformulated as a finite-state deterministic optimal control
problem, as we will see in Section 1.3.1. See also [Ber17], Section 2.1, and also
[Ber98] for an extensive discussion of shortest path methods.
Sec. 1.1 Deterministic Dynamic Programming 5
CBC
+ 1 Initial State A C AB AC CA CD ABC
ACB CACD
BD CAB CAD CDA
CAB
+ 1 Initial State A C AB AC CA CCBCD ABC
+ 1 Initial State A C CAC
AB AC CA CD ABC
ACB
CCD ACD CCAB
DB CAD CDA
SA
+ 1 Initial State A C AB AC CA CD ABC
+ 1 Initial State A C AB AC CA CD ABC
ACB ACD CAB CCAD
BD CDA
+ 1 Initial State A C AB AC CA CD
CABABC
SB
+ 1 Initial State A C ABCCA
AC CA CD ABC
CAD
ACB ACD CAB CAD CCDA
DB
CDA
ACB ACD CAB CAD CDA CAB
sequence is the sum of the setup costs associated with it; for example, the
operation sequence ACDB has cost
where a is a known scalar from the interval (0, 1). The objective is to get
the final temperature xN close to a given target T , while expending relatively
little energy. We express this with a cost function of the form
N−1
X
r(xN − T )2 + u2k ,
k=0
In another frequently arising optimal control problem there are linear con-
straints on the state and/or the control. In the preceding example it would
have been natural to require that ak ≤ xk ≤ bk and/or ck ≤ uk ≤ dk , where
ak , bk , ck , dk are given scalars. Then the problem would be solvable not only
by DP but also by quadratic programming methods. Generally determin-
istic optimal control problems with continuous state and control spaces
(in addition to DP) admit a solution by nonlinear programming methods,
such as gradient, conjugate gradient, and Newton’s method, which can be
suitably adapted to their special structure.
Principle of Optimality
Let {u∗0 , . . . , u∗N −1 } be an optimal control sequence, which together
with x0 determines the corresponding state sequence {x∗1 , . . . , x∗N )}
via the system equation (1.1). Consider the subproblem whereby we
start at x∗m at time m and wish to minimize the “cost-to-go” from
time m to time N ,
N
X −1
gm (x∗m , um ) + gk (xk , uk ) + gN (xN ),
k=m+1
and continuing in this manner until an optimal policy for the entire problem
is constructed.
The DP algorithm is based on this idea: it proceeds sequentially, by
solving all the tail subproblems of a given time length, using the solution
of the tail subproblems of shorter time length. We illustrate the algorithm
with the scheduling problem of Example 1.1.1. The calculations are simple
but tedious, and may be skipped without loss of continuity. However, they
may be worth going over by a reader that has no prior experience in the
use of DP.
Let us consider the scheduling Example 1.1.1, and let us apply the principle of
optimality to calculate the optimal schedule. We have to schedule optimally
the four operations A, B, C, and D. The numerical values of the transition
and setup costs are shown in Fig. 1.1.4 next to the corresponding arcs.
According to the principle of optimality, the “tail” portion of an optimal
schedule must be optimal. For example, suppose that the optimal schedule
is CABD. Then, having scheduled first C and then A, it must be optimal to
complete the schedule with BD rather than with DB. With this in mind, we
solve all possible tail subproblems of length two, then all tail subproblems of
length three, and finally the original problem that has length four (the sub-
problems of length one are of course trivial because there is only one operation
that is as yet unscheduled). As we will see shortly, the tail subproblems of
length k + 1 are easily solved once we have solved the tail subproblems of
length k, and this is the essence of the DP technique.
Tail Subproblems of Length 2 : These subproblems are the ones that involve
two unscheduled operations and correspond to the states AB, AC, CA, and
CD (see Fig. 1.1.4)
State AB : Here it is only possible to schedule operation C as the next op-
eration, so the optimal cost of this subproblem is 9 (the cost of schedul-
ing C after B, which is 3, plus the cost of scheduling D after C, which
is 6).
State AC : Here the possibilities are to (a) schedule operation B and then
D, which has cost 5, or (b) schedule operation D and then B, which has
cost 9. The first possibility is optimal, and the corresponding cost of
the tail subproblem is 5, as shown next to node AC in Fig. 1.1.4.
State CA: Here the possibilities are to (a) schedule operation B and then
D, which has cost 3, or (b) schedule operation D and then B, which has
cost 7. The first possibility is optimal, and the corresponding cost of
the tail subproblem is 3, as shown next to node CA in Fig. 1.1.4.
State CD: Here it is only possible to schedule operation A as the next
operation, so the optimal cost of this subproblem is 5.
Tail Subproblems of Length 3 : These subproblems can now be solved using
the optimal costs of the subproblems of length 2.
Sec. 1.1 Deterministic Dynamic Programming 9
10 CA
+ 1 Initial State A C AB AC 5 7 CD
8 3ABC
9 6 1 2
3 5 2 4 6 2
+ 1 Initial State A C AB AC CA CD ABC
10 5 7 8103 59 76 81 32 9ACB
3 5 2 4 6 2
6 1ACD
2 CAB CAD CDA
+ 1 Initial State A C AB AC 3 5CA
2 4CD 6 2ABC
+ 1 Initial State A C 3AB 5 2AC
4 6CA2 CD ABC
10 5 7 8 3 9 63 15 22 10 6 52 7ACD
4ACB 8 3CAB
9 6 CAD
1 2 CDA
3 5 2 4 6 2 10 5 7 8 3 9 6 1 2
+ 1 Initial State A C AB AC CA CD ABC
+ 1 Initial State A C AB AC CA CD ABC
10 ACB5 7 ACD 8 3 9CAB 6 1CAD
2 CDA
+ 1 Initial
10 5State
7 8A 3C 9AB 6 AC
31 52 CA
2 4 CD6 2 ABC
3 5 2 4 6 2
+ 1 Initial State A 3C
105 AB
25 47AC
6 82 3CA
3 5 92CD ABC
64 61 22
ACB ACD 10 CAB 5 7CAD8 3CDA
9 6 1 2
10 53 57 28 4 36 92 6 1 2
+ 1 Initial State A C AB AC CA CD ABC
10 5 7 83 53 29 4 66 12 2
ACB 10 5 CAB
ACD 7 8 CAD
3 9 6CDA 1 2
State A: Here the possibilities are to (a) schedule next operation B (cost
2) and then solve optimally the corresponding subproblem of length 2
(cost 9, as computed earlier), a total cost of 11, or (b) schedule next
operation C (cost 3) and then solve optimally the corresponding sub-
problem of length 2 (cost 5, as computed earlier), a total cost of 8.
The second possibility is optimal, and the corresponding cost of the tail
subproblem is 8, as shown next to node A in Fig. 1.1.4.
State C : Here the possibilities are to (a) schedule next operation A (cost
4) and then solve optimally the corresponding subproblem of length 2
(cost 3, as computed earlier), a total cost of 7, or (b) schedule next
operation D (cost 6) and then solve optimally the corresponding sub-
problem of length 2 (cost 5, as computed earlier), a total cost of 11.
The first possibility is optimal, and the corresponding cost of the tail
subproblem is 7, as shown next to node A in Fig. 1.1.4.
Original Problem of Length 4 : The possibilities here are (a) start with oper-
ation A (cost 5) and then solve optimally the corresponding subproblem of
length 3 (cost 8, as computed earlier), a total cost of 13, or (b) start with
operation C (cost 3) and then solve optimally the corresponding subproblem
of length 3 (cost 7, as computed earlier), a total cost of 10. The second pos-
10 Exact Dynamic Programming Chap. 1
sibility is optimal, and the corresponding optimal cost is 10, as shown next
to the initial state node in Fig. 1.1.4.
Note that having computed the optimal cost of the original problem
through the solution of all the tail subproblems, we can construct the opti-
mal schedule: we begin at the initial node and proceed forward, each time
choosing the operation that starts the optimal schedule for the corresponding
tail subproblem. In this way, by inspection of the graph and the computa-
tional results of Fig. 1.1.4, we determine that CABD is the optimal schedule.
* (x ), J * *
JN N N −1 (xN −1 ), . . . , J0 (x0 ),
Note that at stage k, the calculation in (1.4) must be done for all states
xk before proceeding to stage k−1. The key fact about the algorithm is that
for every initial state x0 , the optimal cost J * (x0 ) is equal to the number
J0* (x0 ), which is obtained at the last step of the DP algorithm. Indeed, a
more general fact can be shown, namely that for all m = 0, 1, . . . , N − 1,
and all states xm at time m, we have
* (x ) =
Jm min J(xm ; um , . . . , uN −1 ), (1.5)
m
uk ∈Uk (xk )
k=m,...,N−1
where
N
X −1
J(xm ; um , . . . , uN −1 ) = gN (xN ) + gk (xk , uk ), (1.6)
k=m
Sec. 1.1 Deterministic Dynamic Programming 11
have been obtained, we can use the following algorithm to construct an op-
timal control sequence {u∗0 , . . . , u∗N −1 } and corresponding state trajectory
{x∗1 , . . . , x∗N } for the given initial state x0 .
and
x∗1 = f0 (x0 , u∗0 ).
Sequentially, going forward, for k = 1, 2, . . . , N − 1, set
h i
u∗k ∈ arg min *
gk (x∗k , uk ) + Jk+1 fk (x∗k , uk ) , (1.7)
uk ∈Uk (x∗
k
)
and
x∗k+1 = fk (x∗k , u∗k ). (1.8)
and set
x̃1 = f0 (x0 , ũ0 ).
Sequentially, going forward, for k = 1, 2, . . . , N − 1, set
h i
ũk ∈ arg min gk (x̃k , uk ) + J˜k+1 fk (x̃k , uk ) , (1.9)
uk ∈Uk (x̃k )
and
x̃k+1 = fk (x̃k , ũk ). (1.10)
Sec. 1.1 Deterministic Dynamic Programming 13
[cf. the right-hand side of Eq. (1.9)]; this is also known as the (approxi-
mate) Q-factor of (xk , uk ). We can then implement the computation of
the approximately optimal control (1.9) through the minimization
Thus the optimal Q-factors are simply the expressions that are minimized
in the right-hand side of the DP equation (1.4). Note that this equation
implies that the optimal cost function Jk* can be recovered from the optimal
Q-factor Q*k by means of
We will see later that exact and approximate forms of related algorithms
can be implemented by using model-free simulation, in the context of a
class of RL methods known as Q-learning.
The stochastic finite horizon optimal control problem differs from the de-
terministic version primarily in the nature of the discrete-time dynamic
system that governs the evolution of the state xk . This system includes a
random “disturbance” wk , which is characterized by a probability distri-
bution Pk (· | xk , uk ) that may depend explicitly on xk and uk , but not on
values of prior disturbances wk−1 , . . . , w0 . The system has the form
xk+1 = fk (xk , uk , wk ), k = 0, 1, . . . , N − 1,
π = {µ0 , . . . , µN −1 },
xk+1 = fk xk , µk (xk ), wk , k = 0, 1, . . . , N − 1.
where the expected value operation E{·} is over the random variables wk
and xk . An optimal policy π ∗ is one that minimizes this cost; i.e.,
The DP algorithm for the stochastic finite horizon optimal control problem
has a similar form to its deterministic version, and shares several of its
major characteristics:
(a) Using tail subproblems to break down the minimization over multiple
stages to single stage minimizations.
(b) Generating backwards for all k and xk the values Jk* (xk ), which give
the optimal cost-to-go starting at stage k at state xk .
(c) Obtaining an optimal policy by minimization in the DP equations.
(d) A structure that is suitable for approximation in value space, whereby
we replace Jk* by approximations J˜k , and obtain a suboptimal policy
by the corresponding minimization.
16 Exact Dynamic Programming Chap. 1
If u∗k = µ∗k (xk ) minimizes the right side of this equation for each xk
and k, the policy π ∗ = {µ∗0 , . . . , µ∗N −1 } is optimal.
The key fact is that for every initial state x0 , the optimal cost J * (x0 )
is equal to the function J0* (x0 ), obtained at the last step of the above DP
algorithm. This can be proved by induction similar to the deterministic
case; we will omit the proof (see the discussion of Section 1.3 in the textbook
[Ber17]).†
As in deterministic problems, the DP algorithm can be very time-
consuming, in fact more so since it involves the expected value operation
in Eq. (1.13). This motivates suboptimal control techniques, such as ap-
proximation in value space whereby we replace Jk* with easier obtainable
approximations J˜k . We will discuss this approach at length in subsequent
chapters.
The optimal cost-to-go functions Jk* can be recovered from the optimal
Q-factors Q*k by means of
Let {1, 2, . . . , N, t} be the set of nodes of a graph, and let aij be the cost of
moving from node i to node j [also referred to as the length of the arc (i, j)
that joins i and j]. Node t is a special node, which we call the destination.
By a path we mean a sequence of arcs such that the end node of each arc
in the sequence is the start node of the next arc. The length of a path from
a given node to another node is the sum of the lengths of the arcs on the
path. We want to find a shortest (i.e., minimum length) path from each
node i to node t.
We make an assumption relating to cycles, i.e., paths of the form
(i, j1 ), (j1 , j2 ), . . . , (jk , i) that start and end at the same node. In particular,
we exclude the possibility that a cycle has negative total length. Otherwise,
it would be possible to decrease the length of some paths to arbitrarily small
values simply by adding more and more negative-length cycles. We thus
assume that all cycles have nonnegative length. With this assumption, it is
clear that an optimal path need not take more than N moves, so we may
limit the number of moves to N . We formulate the problem as one where
we require exactly N moves but allow degenerate moves from a node i to
Sec. 1.3 Examples, Variations, and Simplifications 19
itself with cost aii = 0. We also assume that for every node i there exists
at least one path from i to t.
We can formulate this problem as a deterministic DP problem with N
stages, where the states at any stage 0, . . . , N − 1 are the nodes {1, . . . , N },
the destination t is the unique state at stage N , and the controls correspond
to the arcs (i, j), including the self arcs (i, i). Thus at each state i we select
a control (i, j) and move to state j at cost aij .
We can write the DP algorithm for our problem, with the optimal
cost-to-go functions Jk having the meaning
or
Jk (i) = min aij + Jk+1 (j) , k = 0, 1, . . . , N − 2,
All arcs (i,j)
with
JN −1 (i) = ait , i = 1, 2, . . . , N.
This algorithm is also known as the Bellman-Ford algorithm for shortest
paths.
The optimal policy when at node i after k moves is to move to a node
j ∗ that minimizes aij + Jk+1 (j) over all j such that (i, j) is an arc. If the
optimal path obtained from the algorithm contains degenerate moves from
a node to itself, this simply means that the path involves in reality less
than N moves.
Note that if for some k > 0, we have Jk (i) = Jk+1 (i) for all i,
then subsequent DP iterations will not change the values of the cost-to-go
[Jk−m (i) = Jk (i) for all m > 0 and i], so the algorithm can be terminated
with Jk (i) being the shortest distance from i to t, for all i.
To demonstrate the algorithm, consider the problem shown in Fig.
1.3.1(a) where the costs aij with i 6= j are shown along the connecting line
segments (we assume that aij = aji ). Figure 1.3.1(b) shows the optimal
cost-to-go Jk (i) at each i and k together with the optimal paths.
State i
Destination
0 1 2 3 4 5 6 7 0 0 1 2 3 4 50 6
1
7 0
02 13 024 135 0246 1357 2460 357 460 57 60 7 0
0 1 2 30 41 52 63 74 05 6 7 0
0 1 2 3 04 15 26 37 40 5 6 7 0 0 1 2 03 14 5
02
6 7 0
13 024 0135 1246 2357 3460 457 560 67 70 0
0 1 3 4 50 61 72 03 4 5 6 07 10 2 3 4 5 6 7 0
0 1 2 3 04 105 216 327 430 54 65 76 07 0 05 14.552 4.53 54 5.5 5 6 7 0
0 1 2 3 4 5 6 7 0
0 1 2 3 4 5 6 7 0 0 1 3 4 5 6 70 01 0 1 02 13 024 135 0246 1357 2460 357 460 57 60 7 0
3 4 5 6 7 0
0 1 2 03 14 25 36 47 50 6 7 0
0 1 2 3 4 5 6 7 0.5
0 01 013 0134012451235623467345704560i567Stage
670 70 0k
Destination (a) (b) Destination (a) (b)
Figure 1.3.1 (a) Shortest path problem data. The destination is node 5. Arc
lengths are equal in both directions and are shown along the line segments con-
necting nodes. (b) Costs-to-go generated by the DP algorithm. The number along
stage k and state i is Jk (i). Arrows indicate the optimal moves at each stage and
node. The optimal paths are
1 → 5, 2 → 3 → 4 → 5, 3 → 4 → 5, 4 → 5.
Initial State x0
1 2 3 4 51 62 73 84 95 10
6 7118 12
9 10
1311 12
1 214
13 14 15 20
3 4155 20
6 7 8 9 10 11 12 13 14 15 20
8 A12
15 1 5 18 4 19 9 21 25 15 1AB AC4 AD
5 18 19 9ABC 8A12
ABD
21 25 AB
ACBAC AD
ACD
15 ABC
1 5 ADB ABD
18 4 ADC
19 A AB
ACB
9 21 AC
25 ACD
8 AD
12 ADB
ABC ADC
ABD ACB ACD ADB ADC
1 2 3 4 5 6 7 8 9 10 11 12 13114 5 6 7 81 92 10
2 3154 20 3 411 8 9141015112012 13 1142 15
5 6127 13 3 4205 6 7 8 91 10
2 311
4 512613
71 821493 15
4 520
10 6 712813
11 9 10
14 11
15 12
20 13 14 15 20
15 1AC
A AB A 5AB
18 15A1AD
4ABC
ADAC 19 59 ABD
AB 18
21A425
AC
ABC 19
AD
ABABD
ACB
15 9ABC
521
1ACA 25
AB
ACB
ACD
AD
18 15 A
1 5AB
AC
4ABD
ACD
ABC
ADB
19 AD
9ACBAC
4ABC
ADB
18
ABD
ADC
21 AD
ACD
2519
15 5ABC
19ABD
ADC
ACB ADB
21 4ABD
1815
25
ACD 119
ACB
ADC
ADB ACB
5 9ACD
18 4 25
ADC
21 ACD
19
ADB ADB
9 21
ADC25 ADC
1 2 3 4 5 6 7 8 91 10
2 311
4 151226113
3724814
3594615
10
57120
611
827912 419135211
3810 1014 7415
6312
11 12962014
8513 7 815
10
13 9 10
11
14 12 11
20
15 13 12
20 14 13
15 14
20 15 20
15 1ABCD
5 ABCD
15
ABDC ABDC
1 5ABCD
ACBD 15
ACBD
ABDC
ACDB
ABCD ABCD
1ACDB
5ADBC
ACBD
ABDC
ABCD ABDC
ADBC
ACDB
ACBD
ABDC
15 ACBD
ADCB
ADBC
1ADCB
5 ACDB
ACBD 15 ACDB
ADCB
ADBC
ACDB ADBC
1 5ADCB
ADBC 1 5 ADCB
15ADCB
1 2 3 4 5 6 7 8 9 10 11 12 131 14
2 315
4 52061 72 83 94 10
5 6117 112
8 2913
310414
11612
5 15 813914
720 1015
1120
12 13 14 15 20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20
s Terminal State t
1 2 3 4 5 6 7 8 9 101 211312
4 5136 1417 2815 205 11
39 410 6 7128 13
9 101411
15122013 14 15 20
1 2123 13
1 2 3 4 5 6 7 8 9 10 11 7 1820
4 5146 15 29 3104 115 6127 813914
10151120
12 13 14 15 20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2203 4 15 26 37 48 59 6107 11
8 9121013
11141215
132014 15 20
1 2 3 4 5 6 7 8 9 10 11 12 13 141 15 2 320
14 25 36 47 58 69 7108 11
9 10
12 11
13 12
14 13
15 14
20 15 20
Let us now extend the ideas of the preceding example to the general
discrete optimization problem:
minimize G(u)
subject to u ∈ U,
Stage3 1Stage
Stage 1 Stage 2 Stage Stage 2 Stage
. . . i 3 Stage N
Stage 1 Stage 2 Stage 3 Stage
Stage 1 Stage 2 Stage 3 Stage
... i
Artificial Start State End. State .. .. Artificial Start State End State
) .. 9 21 25) 8. 12 13
Initial State 15 1.. 5 18 4 19 Artificial Start) .State End State
).
s t u ... i s t u
. .
) .. ) .. . .
) .. ... i ) .. Cost G(u)
Set of States (
Set of States (u1 )Set SetofofStates
Set (of( States (
States Set of States (
,). Set
. . , uof )States
) Set of(uStates
1 , u2 ) Set
(u1 ,of
u2States (
) Set
, u3Set of States u = (u1 , . . . , uN )
At state (u1 , . . . , uk ) we must choose uk+1 from the set Uk+1 (u1 , . . . , uk ).
These are the choices of uk+1 that are consistent with the preceding choices
u1 , . . . , uk , and are also consistent with feasibility. The terminal states
correspond to the N -solutions u = (u1 , . . . , uN ), and the only nonzero cost
is the terminal cost G(u). This terminal cost is incurred upon transition
from u to an artificial end state; see Fig. 1.3.3.
Let Jk* (u1 , . . . , uk ) denote the optimal cost starting from the k-solution
(u1 , . . . , uk ), i.e., the optimal cost of the problem over solutions whose first
Sec. 1.3 Examples, Variations, and Simplifications 23
* (u , . . . , u ) = G(u , . . . , u ).
JN 1 N 1 N
The algorithm (1.14) executes backwards in time: starting with the known
function JN* = G, we compute J * *
N −1 , then JN −2 , and so on up to computing
J1* . An optimal solution (u∗1 , . . . , u∗N ) is then constructed by going forward
through the algorithm
the cost generated by a heuristic method that solves the problem sub-
optimally with the values of the first k + 1 decision components fixed at
u∗1 , . . . , u∗k , uk+1 . This is called a rollout algorithm and it is a very simple
and effective approach for approximate combinatorial optimization. It will
be discussed later in this book, in Chapter 2 for finite horizon stochastic
problems, and in Chapter 4 for infinite horizon problems, where it will be
related to the method of policy iteration.
Thus the control process terminates upon reaching t, even if this happens
before the end of the horizon. One may reach t by choice if a special
stopping decision is available, or by means of a transition from another
state.
24 Exact Dynamic Programming Chap. 1
Starting Position ˆ
Root Node s
t Length = 1 t Length = 1
row is 88 = 16, 777, 216). It can be shown that there exist solutions to this
problem for all N ≥ 4.
There are also several variants of the N queens problem. For example
finding the minimal number of queens that can be placed on an N × N board
so that they either occupy or attack every square; this is known as the queen
domination problem. The minimal number can be found in principle by DP,
and it is known for some N (for example the minimal number is 5 for N = 8),
but not for all N (see e.g., the paper by Fernau [Fe10]).
26 Exact Dynamic Programming Chap. 1
1.3.4 Forecasts
yk+1 = ξk ,
A driver is looking for inexpensive parking on the way to his destination. The
parking area contains N spaces, and a garage at the end. The driver starts
at space 0 and traverses the parking spaces sequentially, i.e., from space k he
goes next to space k + 1, etc. Each parking space k costs c(k) and is free with
probability p(k) independently of whether other parking spaces are free or
not. If the driver reaches the garage without having parked, he must park at
the garage, which costs C. The driver can observe whether a parking space
is free only when he reaches it, and then, if it is free, he makes a decision to
park in that space or not to park and check the next space. The problem is
to find the minimum expected cost parking policy.
We formulate the problem as a DP problem with N = n + 1 stages,
corresponding to the parking spaces (including the garage), and an artifi-
cial terminal state t that corresponds to having parked; see Fig. 1.3.5. At
each stage k = 0, . . . , N − 1, in addition to t, we have two states (k, F ) and
(k, F ), corresponding to space k being free or taken, respectively. The deci-
sion/control is to park or continue at state (k, F ) [there is no choice at states
(k, F ) and the garage].
Let us now derive the form of DP algorithm, denoting
Jk∗ (F ): The optimal cost-to-go upon arrival at a space k that is free.
Jk∗ (F ): The optimal cost-to-go upon arrival at a space k that is taken.
∗
JN (t) = C: The cost-to-go upon arrival at the garage.
Jk∗ (t) = 0: The terminal cost-to-go.
28 Exact Dynamic Programming Chap. 1
Garage
n 0 10 1 0 1 2 j · · · Stage
1) k k1 Stage
k k + 21j Stage
· N3−Stage
· ·N 1 N N
(0) c(k) ) c(k + 1)
itial State 15 1 5 18 4 19 9 21 25 8 12 13 c(0) C c(1)
)C c
+ 1) c(N − 1) Parked
1) Parking Spaces
Figure 1.3.5 Cost structure of the parking problem. The driver may park at
space k = 0, 1, . . . , N − 1 at cost c(k), if the space is free, or continue to the
next space k + 1 at no cost. At space N (the garage) the driver must park at
cost C.
which can be viewed as the optimal expected cost-to-go upon arriving at space
k but before verifying its free or taken status.
Indeed, from the preceding DP algorithm, we have
./,+0123()2+7831-934)*,
#$!
:"
#!!
165 200 150 100 150 200
./,+01234)*,356
$
;7,+)-
'!
./,+01237)*,<,)<=)3>?-7,+)-
./,+0123;7,+)-@33333333333333333 #
&!
#3A3(1BC3+>3>BDDE3$3A3F)-G,3(1BC
%!
! 150
165
165 200 165
200 200
"!
150
100 150
165
100
150 200165
100
#!!
150
200 150200
150
200 100150
50 100
200
#"! 0 50 $!!
0
()*+,+)-
165 200 150 100 50 0
Figure 1.3.6 Optimal cost-to-go and optimal policy for the parking problem with
the data in Eq. (1.17). The optimal policy is to travel from space 0 to space 165
and then to park at the first available space.
The optimal policy is to travel to space 165 and then to park at the first
available space. The reader may verify that this type of policy, characterized
by a single threshold distance, is optimal assuming that c(k) is monotonically
decreasing with k.
xk+1 = fk (xk , yk , uk , wk ),
Indeed, we have
Ĵ k (xk ) = Eyk Jk (xk , yk ) | xk
n
= Eyk min Ewk ,xk+1 ,yk+1 gk (xk , yk , uk , wk )
uk ∈Uk (xk ,yk )
o
+ Jk+1 (xk+1 , yk+1 ) xk , yk , uk xk
n
= Eyk min Ewk ,xk+1 gk (xk , yk , uk , wk )
uk ∈Uk (xk ,yk )
o
+ Eyk+1 Jk+1 (xk+1 , yk+1 ) xk+1 xk , yk , uk xk ,
and finally
n
Ĵ k (xk ) = E min E gk (xk , yk , uk , wk )
yk uk ∈Uk (xk ,yk ) wk
(1.18)
o
+ Ĵ k+1 fk (xk , yk , uk , wk ) xk .
and
JˆN (xN ) = gN (xN ),
we have, using Eq. (1.16),
m n
Jˆk (xk ) =
X
pi min E gk (xk , uk , wk )
uk ∈Uk (xk ) wk
i=1
o
+ Jˆk+1 fk (xk , uk , wk ) yk = i ,
Sec. 1.3 Examples, Variations, and Simplifications 31
where
g(x, y, u) is the number of points scored (rows removed),
f (x, y, u) is the board position (or termination state),
when the state is (x, y) and control u is applied, respectively. Note, however,
that despite the simplification in the DP algorithm achieved by eliminating
the uncontrollable portion of the state, the number of states x is enormous,
and the problem can only be addressed by suboptimal methods, which will
be discussed later in this book.
We have assumed so far that the controller has access to the exact value of
the current state xk , so a policy consists of a sequence of functions µk (xk ),
k = 0, . . . , N − 1. However, in many practical settings this assumption is
unrealistic, because some components of the state may be inaccessible for
measurement, the sensors used for measuring them may be inaccurate, or
the cost of obtaining accurate measurements may be prohibitive.
Often in such situations the controller has access to only some of
the components of the current state, and the corresponding measurements
may also be corrupted by stochastic uncertainty. For example in three-
dimensional motion problems, the state may consist of the six-tuple of po-
sition and velocity components, but the measurements may consist of noise-
corrupted radar measurements of the three position components. This gives
rise to problems of partial or imperfect state information, which have re-
ceived a lot of attention in the in the optimization and artificial intelligence
literature (see e.g., [Ber17], Ch. 4). Even though there are DP algorithms
for partial information problems, these algorithms are far more computa-
tionally intensive than in the perfect information case. For this reason,
in the absence of an analytical solution, partial information problems are
typically solved suboptimally in practice.
On the other hand it turns out that conceptually, partial state infor-
mation problems are no different than the perfect state information prob-
lems we have been addressing so far. In fact by various reformulations, we
Sec. 1.3 Examples, Variations, and Simplifications 33
Belief State
“Future” System x ) xk Belief State pk Controller
k Observations
xk+1 = fk (xk , uk , wk )
k Controller
Controller µk
k Control uk = µk (pk )
can reduce a partial state information problem to one with perfect state
information (see [Ber17], Ch. 4). The most common approach is to replace
the state xk with a belief state, which is the probability distribution of xk
given all the observations that have been obtained by the controller up to
time k (see Fig. 1.3.8). This probability distribution can in principle be
computed, and it can serve as “state” in an appropriate DP algorithm. We
illustrate this process with a simple example.
This is the belief state at time k and it evolves according to the equation
pk if the site is not searched at time k,
pk+1 = 0 if the site is searched and a treasure is found,
pk (1−β)
pk (1−β)+1−pk
if the site is searched but no treasure is found.
(1.19)
The third relation above follows by application of Bayes’ rule (pk+1 is equal to
the kth period probability of a treasure being present and the search being un-
successful, divided by the probability of an unsuccessful search). The second
relation holds because the treasure is removed after a successful search.
34 Exact Dynamic Programming Chap. 1
# (1.20)
pk (1 − β)
− C + pk βV + (1 − pk β)Jˆk+1 ,
pk (1 − β) + 1 − pk
C
Jˆk (pk ) = 0 if pk ≤ .
βV
C
≤ pk .
βV
Thus, it is optimal to search if and only if the expected reward from the next
search, pk βV , is greater or equal to the cost C of the search - a myopic policy
that focuses on just the next stage.
way to his destination, along a line of L parking spaces with a garage at the
end. The difference is that the driver can move in either direction, rather
than just forward towards the garage. In particular, at space i, the driver can
park at cost c(i) if i is free, can move to i − 1 at cost βi− or can move to i + 1
at cost βi+ . Moreover, the driver records the free/taken status of the spaces
previously visited and may return to any of these spaces.
Let us assume that the probability p(i) of a space i being free changes
over time, i.e., a space found free (or taken) at a given visit may get taken
(or become free, respectively) by the time of the next visit. The initial prob-
abilities p(i), before visiting any spaces, are known, and the mechanism by
which these probabilities change over time is also known to the driver. As an
example, we may assume that each time period, p(i) increases by a certain
known factor with some probability ξ and decreases by another known factor
with the complementary probability 1 − ξ.
Here the belief state is the vector of current probabilities
p(1), . . . , p(L) ,
and it is updated at each time based on the new observation: the free/taken
status of the space visited at that time. Thus the belief state belongs to
the unit simplex of L-dimensional vectors and can be perfectly computed
by the driver, given the parking status observations of the spaces visited
thus far. While it is possible to state an exact DP algorithm that is defined
over the simplex of belief states, and we will do so later, the algorithm is
impossible to execute in practice.† Thus the problem can only be solved with
approximations.
This minimization will be done by setting to zero the derivative with respect
to u1 . This yields
0 = 2u1 + 2ra (1 − a)x1 + au1 − T ,
and by collecting terms and solving for u1 , we obtain the optimal temper-
ature for the last oven as a function of x1 :
∗
ra T − (1 − a)x1
µ1 (x1 ) = . (1.22)
1 + ra2
This yields, after some calculation, the optimal temperature of the first
oven:
∗
r(1 − a)a T − (1 − a)2 x0
µ0 (x0 ) = . (1.23)
1 + ra2 1 + (1 − a)2
The optimal cost is obtained by substituting this expression in the formula
for J0 . This leads to a straightforward but lengthy calculation, which in
the end yields the rather simple formula
2
r (1 − a)2 x0 − T
J0 (x0 ) = .
1 + ra2 1 + (1 − a)2
and finite variance. Then the equation for J1 [cf. Eq. (1.4)] becomes
n 2 o
J1 (x1 ) = min E u21 + r (1 − a)x1 + au1 + w1 − T
u1 w1
h 2
= min u21 + r (1 − a)x1 + au1 − T
u1
i
+ 2rE{w1 } (1 − a)x1 + au1 − T + rE{w12 } .
Comparing this equation with Eq. (1.21), we see that the presence of w1 has
resulted in an additional inconsequential constant term, rE{w12 }. There-
fore, the optimal policy for the last stage remains unaffected by the presence
of w1 , while J1 (x1 ) is increased by rE{w12 }. It can be seen that a similar
situation also holds for the first stage. In particular, the optimal cost is
given by the same expression as before except for an additive constant that
depends on E{w02 } and E{w12 }.
Generally, if the optimal policy is unaffected when the disturbances
are replaced by their means, we say that certainty equivalence holds. This
occurs in several types of problems involving a linear system and a quadratic
cost; see [Ber17], Sections 3.1 and 4.2. For other problems, certainty equiv-
alence can be used as a basis for problem approximation, e.g., assume
that certainty equivalence holds (i.e., replace stochastic quantities by some
typical values, such as their expected values) and apply exact DP to the
resulting deterministic optimal control problem (see Section 2.3.2).
Our discussion of exact DP in this chapter has been brief since our fo-
cus in this book will be on approximate DP and RL. The author’s DP
textbooks [Ber12], [Ber17] provide an extensive discussion of exact DP
and its applications. The mathematical aspects of exact DP are discussed
in the monograph by Bertsekas and Shreve [BeS78], particularly the fine
probabilistic/measure-theoretic issues associated with stochastic optimal
control. The author’s abstract DP monograph [Ber18a] aims at a unified
development of the core theory and algorithms of total cost sequential de-
cision problems, based on the strong connections of the subject with fixed
point theory.
The approximate DP literature has expanded tremendously since the
connections between DP and RL became apparent in the late 80s and
early 90s. We will restrict ourselves to mentioning textbooks, research
monographs, and broad surveys, which supplement our discussions and
collectively provide a guide to the literature. Thus the author wishes to
apologize in advance for the many omissions of references from the research
literature.
Two books were written on our subject in the 1990s, setting the
tone for subsequent developments in the field. One in 1996 by Bertsekas
and Tsitsiklis [BeT96], which reflects a decision, control, and optimization
viewpoint, and another in 1998 by Sutton and Barto, which reflects an
artificial intelligence viewpoint (a 2nd edition, [SuB18], was published in
2018). We refer to the former book and also to the author’s DP textbooks
[Ber12], [Ber17] for a broader discussion of some of the topics of the present
book.
More recent books are the 2003 book by Gosavi (a much expanded
2nd edition [Gos15] appeared in 2015), which emphasizes simulation-based
optimization and RL algorithms, Cao [Cao07], which emphasizes a sensi-
tivity approach to simulation-based methods, Chang, Fu, Hu, and Marcus
[CFH07], which emphasizes finite-horizon/limited lookahead schemes and
adaptive sampling, Busoniu et. al. [BBD10], which focuses on function ap-
proximation methods for continuous space systems and includes a discus-
sion of random search methods, Powell [Pow11], which emphasizes resource
allocation and operations research applications, and Vrabie, Vamvoudakis,
and Lewis [VVL13], which discusses neural network-based methods and
Sec. 1.5 Notes and Sources 41