Dynamic Programming and Optimal Control
Dynamic Programming and Optimal Control
net/publication/224773123
CITATIONS READS
2,838 30,823
1 author:
Dimitri P. Bertsekas
Arizona State University
386 PUBLICATIONS 71,092 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
1) Approximate and abstract dynamic programming. 2) Proximal algorithms for large-scale linear systems of equations View project
All content following this page was uploaded by Dimitri P. Bertsekas on 21 December 2016.
Adi Ben-Israel
Adi Ben-Israel, RUTCOR–Rutgers Center for Operations Research, Rut-
gers University, 640 Bartholomew Rd., Piscataway, NJ 08854-8003, USA
E-mail address: [email protected]
LECTURE 1
Dynamic Programming
1.1. Recursion
A recursion is a rule for computing a value using previously computed values, for example,
the rule
fk+1 = fk + fk−1 , (1.1)
computes the Fibonnaci sequence, given the initial values f0 = f1 = 1.
Example 1.1. A projectile with mass 1 shoots up against earth gravity g. The initial
velocity of the projectile is v0 . What maximal altitude will it reach?
Solution. Let y(v) be the maximal altitude reachable with initial velocity v. After time ∆t,
the projectile advanced approximately v ∆t, and its velocity has decreased to approximately
v − g ∆t. Therefore the recursion
y(v) ≈ v δt + y(v − g ∆t) , (1.2)
gives
y(v) − y(v − g ∆t)
≈ v,
∆t
v
∴ y 0 (v) =
g
v2
∴ y(v) = , (1.3)
2g
and the maximal altitude reached by the projectile is v02 /2 g.
Example 1.2. Consider the partitioned matrix
µ ¶
B c
A= (1.4)
r α
where B is nonsingular, c is a column, r is a row, and α is a scalar. Then A is nonsingular
iff
α − rB −1 c 6= 0 , (1.5)
(verify!) in which case
µ ¶
−1 B −1 + βB −1 crB −1 −βB −1 c
A = , (1.6a)
−βrB −1 β
1
where β = . (1.6b)
α − rB −1 c
3
4 1. DYNAMIC PROGRAMMING
Can this result be used in a recursive computation of A−1 ? Try for example
1 2 3 4 −1 −2
A= 1 2 4 , with inverse A−1 = 0 −1 1 .
1 3 4 −1 1 0
Example 1.3. Consider the LP
max cT x
s.t. A x = b
x ≥ 0
and denote its optimal value by V (A, b). Let the matrix A be partitioned as A = (A1 , an ),
where an is the last column, and similarly partition the vector c as cT = (cT1 , cn ). Is the
recursion
V (A, b) = max {cn xn + V (A1 , b − an xn )} (1.7)
xn ≥0
valid? Is it useful for solving the problem?
An optimal policy has the property that whatever the initial state and the
initial decisions are, the remaining decisions must constitute an optimal
policy with regard to the state resulting from the first decision, [5, p. 15].
The PO can be used to recursively compute the OV functions
Vk (x) = max {rk (x, u) + Vk+1 (Tk (x, u), u)} , k ∈ 1, N , (1.11)
u∈Uk (x)
Example 1.5. ([29, pp. 294–295]) Consider the graph in Figure 1.1. It is required to
find the shortest path from node 1 to node 7, the length of a path is defined as the length
of its longest arc. For example,
length{1, 2, 5, 7} = 4 .
If the nodes are viewed as states, then the path {1, 2, 4, 6, 7} is optimal w.r.t. s = 1. However,
the path {2, 4, 6, 7} is not optimal w.r.t. the node s = 2, as its length is greater than the
length of {2, 5, 7}.
2
2 5
3 1
4 1
4 7
1
1 1
2
5
2
3 6
Exercises.
Exercise 1.1. Use DP to maximize the entropy
( N N
)
X X
max − pi log pi : pi = 1 .
i=1 i=1
Exercise 1.2. Let Z+ denote the nonnegative integers. Use DP to write a recursion for
the knapsack problem
( N N
)
X X
max fi (xi ) : wi (xi ) ≤ W, , xi ∈ Z+ ,
i=1 i=1
where ci (·), wi (·) are given functions: Z+ → R and W > 0 is given. State any additional
properties of fi (·) and wi (·) that are needed in your analysis.
Exercise 1.3. ([7]) A cash amount of x cents can be represented by
x1 coins of 50 cents ,
x2 coins of 25 cents ,
x3 coins of 10 cents ,
x4 coins of 5 cents , and
x5 coins of 1 cent .
The representation is:
x = 50x1 + 25x2 + 10x3 + 5x4 + x5 .
(a) Use DP to find the representation with the minimal number of coins.
(b) Show that your solution agrees with the ”greedy” solution:
jxk ¹ º
x − 50x1
x1 = , x2 = , etc.
50 25
where bαc is the greatest integer ≤ α.
(c) Suppose a new coin of 20 cents is introduced. Will the DP solution still agree with the
greedy solution?
Exercise 1.4. The energy required to compress a gas from pressure p1 to pressure pN +1
in N stages is proportional to
µ ¶α µ ¶α µ ¶α
p2 p3 pN +1
+ + ··· +
p1 p2 pN
with α a positive constant. Show how to choose the intermediate pressures p2 , · · · , pN so as
to minimize the energy requirement.
Exercise 1.5. Consider the following variation of the game NIM, defined in terms of N
piles of matches containing x1 , x2 , · · · , xN matches. The rules are:
(i) two players make moves in alternating turns,
(ii) if matches remain, a move consists of removing any number of matches all from the same
pile,
8 1. DYNAMIC PROGRAMMING
where
V (x1 , x2 , · · · , xN ) = 0 if all xi = 0 except, say xj = 1
Exercise 1.6. We have a number of coins, all of the same weight except for one which
is of different weight, and a balance.
(a) Determine the weighing procedures which minimize the maximum time required to locate
the distinctive coin in the following cases:
• the coin is known to be heavier,
• it is not known whether the coin is heavier or lighter.
(b) Determine the weighing procedures which minimize the expected time required to locate
the coin.
(c) Consider the more general problem where there are two or more distinctive coins, under
various assumptions concerning the distinctive coins.
Exercise 1.7. A rocket consists of k stages carrying fuel and a nose cone carrying the
pay load. After the fuel carried in stage k is consumed, this stage drops off, leaving a k − 1
stage rocket. Let
W0 = weight of nose cone ,
wk = initial gross weight of stage k ,
Wk = Wk−1 + wk , initial gross weight of sub-rocket k ,
pk = initial propellant weight of stage k ,
vk = change in rocket velocity during burning of stage k .
Assume that the change in velocity vk is a known function of Wk and pk , so that .
vk = v(Wk , pk )
1.3. INVERSE DYNAMIC PROGRAMMING 9
from which
pk = p(Wk , vk )
Since Wk = Wk−1 + wk , and the weight of the kth stage is a known function, g(pk ), of the
propellant carried in the stage, we have
wk = w(p(Wk−1 + wk , vk ))
whence, solving for wk , we have
wk = w(Wk−1 , vk )
(a) Use DP to design a k-stage rocket of minimum weight which will attain a final velocity
v.
(b) Describe an algorithm for finding the optimal number of stages k ∗ .
(c) Discuss the factors resulting in an increase of k ∗ i.e. in more stages of smaller size.
Exercise 1.8. Suppose that we are given the information that a ball is in one of N
boxes, and the a priori probability, pk , that it is in the kth box.
(a) Show that the procedure which minimizes the expected time required to find the ball
consists of looking in the most likely box first.
(b) Consider the more general problem where the time consumed in examining the kth box
is tk , and where there is a probability qk that the examination of the kth box will yield
no information about its contents. When this happens, we continue the search with the
information already available. Let F (p1 , p2 , · · · , pN ) be the expected time required to find
the ball under an optimal policy. Find the functional equation that F satisfies.
(c) Prove that if we wish to “obtain” the ball, the optimal policy consists of examining first
the box for which
pk (1 − qk )
tk
is a maximum. On the other hand, if we merely wish to “locate” the box containing the ball
in the minimum expected time, the box for which this quantity is maximum is examined
first, or not at all.
Exercise 1.9. A company has m jobs that are numbered 1 through m. Job i requires ki
employees. The “natural” monthly wage of job i is wi dollars, with wi ≤ wi+1 for all i. The
jobs are to be grouped into n labor grades, each grade consisting of several consecutive jobs.
All employees in a given labor grade receive the highest of the natural wages of the jobs in
that grade. A fraction rj of the employees in each jobs quit in each month. Vacancies must
be filled by promoting from the next lower grade. For instance, a vacancy in the highest of n
labor grades causes n − 1 promotions and a hire into the lowest labor grade. It costs t dollars
to train an employee to do any job. Write a functional equation whose solution determines
the number n of labor grades and the set of jobs in each labor grade that minimizes the sum
of the payroll and training costs.
Exercise 1.10. (The Jeep problem) Quoting from D. Gale, [18, p. 493],
· · · the problem concerns a jeep which is able to carry enough fuel to travel
a distance d, but is required to cross a desert whose distance is greater then
d (for example 2d). It is to do this by carrying fuel from its home base
10 1. DYNAMIC PROGRAMMING
and establishing fuel depots at various points along its route so that it can
refuel as it moves further out. It is then required to cross the desert on the
minimum possible amount of fuel.
Without loss of generality, we can assume d = 1.
A desert of length 4/3 can be crossed using 2 units of fuel, as follows:
Trip 1: The jeep travels a distance of 1/3, deposits 1/3 unit of fuel and returns to base.
Trip 2: The jeep travels a distance of 1/3, refuels, and travels 1 unit of distance.
The inverse problem is to determine the maximal desert that can be crossed, given the
quantity of fuel is x. The answer is:
1 1 1 x − dxe + 1
1 + + + ··· + + .
3 5 2dxe − 3 2dxe − 1
References: [1], [9], [11], [15], [17], [18], [21], [25].
LECTURE 2
Theorem 2.1. Let A(x) and S(x) be convex with limit +∞ at x = ±∞, and let K = 0.
Then the optimal policy is (
0 if xt > St ,
ut = (2.7)
St − xt otherwise ,
for some St > 0 (the order to level on day t).
Proof. Let C be the class of convex functions with limit +∞ as the argument approaches
±∞. The function minimized in (2.6) is
∗
θ(x + u) = EVt+1 (x + u − w) (2.8)
If θ(x) ∈ C and has minimum at St then the rule (2.7) is optimal. The minimized function
is ½
θ(St ) if x ≤ St ,
min θ(x + u) = (2.9)
u θ(x) if x > St ,
that is constant for x ≤ St .
Writing (2.6) as Vt := LVt+1 , it follows that Vt ∈ C if Vt+1 ∈ C. Since VN +1 ∈ C by (2.3),
it follows from Vt = LN −t VN +1 is in C ¤
The determination of the optimal St is from
Vt∗ (x) = A(x) + EVt+1
∗
(S − w) = A(x) + θ(S) , x ≤ S ,
and the optimal S minimizes EA(S − w).
Theorem 2.2. If A(x) and S(x) are convex with the limit +∞ at x = ±∞ and if K 6= 0
then the optimal order policy is
(
0 if x > s ,
u= (2.10)
S − x otherwise ,
where S > s if K > 0.
Proof. The relevant optimality condition is (2.6). The expression to be minimized is
(
θ(x) if u = 0 ,
KH(u) + θ(x + u) = (2.11)
K + θ(x + u) if u > 0 ,
and θ is (2.8). If θ ∈ C with minimum at S then the rule (2.10) is optimal for s the smaller
root of
θ(s) = K + θ(S) .
∗
In general the functions V and θ are not convex. The proof resumes after Lemma 2.4
below. ¤
Definition 2.1. A scalar function φ(x) is K-convex if
φ(x + u) ≥ φ(x) + uφ0 (x) − K , ∀u , (2.12)
where φ0 is the derivative from the right,
φ(x + u) − φ(x)
φ0 (x) = lim .
u↓0 u
2.1. INVENTORY CONTROL 13
Calculus of Variations
where nk is unit normal to the surface ∂D. If y is given on ∂D then η = 0 there, and the
Euler–Lagrange equation becomes
Xn
∂Fk
Fy − = 0. (3.37)
k=1
∂x k
Therefore any solution of (3.38) is an extremal of the variational problem of finding the
stationary values of
Rx1 ¡ ¢
a(x)y 0 2 (x) − b(x)y 2 (x) dx
J1 x0
λ= = Rx1 (3.39)
J2 2
c(x)y (x) dx
x0
Exercises.
Exercise 3.3. Reverse the above argument, starting from the variational problem (3.39)
and deriving the Euler–Lagrange equation.
22 3. CALCULUS OF VARIATIONS
subject to (3.42). Such a problem is called isoperimetric, for historic reasons (see Exam-
ple 3.2). The analog of (3.25) is
f (x, y, z) = min
0
[F (x, y, y 0 )∆ + f (x + ∆, y + y 0 ∆, z − G(x, y, y 0 )∆] (3.44)
y
with limit · ¸
0 ∂f 0 ∂f 0 ∂f
0 = min F (x, y, y ) + +y − G(x, y, y ) . (3.45)
y0 ∂x ∂y ∂z
Therefore,
∂f ∂f
0 = Fy0 + − Gy , (3.46)
∂y ∂z
and
∂f ∂f ∂f
0=F+ + y0 −G , (3.47)
∂x ∂y ∂z
holding for all x, y, z, y 0 satisfying (3.46). Differentiating (3.46) w.r.t. x and (3.47) w.r.t. y,
and combining, we get
µ ¶ µ ¶
∂ ∂f ∂ ∂f
F− G − F− G =0 (3.48)
∂y 0 ∂z ∂y ∂z
Partial differentiation of (3.47) w.r.t. z yields
∂ 2f ∂ 2f ∂ 2f
+ y0 −G 2 = 0. (3.49)
∂x∂z ∂y∂z ∂z
µ ¶
d ∂f
∴ = 0, (3.50)
dx ∂z
(3.51)
i.e. ∂f
∂z
is a constant, say λ, and (3.48) is the Euler-Lagrange equation for F − λG. The
parameter λ is the Lagrange multiplier of the constraint (3.42).
3.7. INTEGRAL CONSTRAINTS 23
Example 3.2. ([10, p. 22]) The Dido problem is to find the smooth plane curve of
perimeter L which encloses the maximum area. We show that this curve is a circle. Fix any
point P of the curve C, and use polar coordinates with origin R π 1 at2 P , and θ = 0 along the
(half) tangent of C at P . The area enclosed by C is A = 0 2 r dθ, and the perimeter is
R π ³¡ dr ¢2 2
´1/2
L= 0 dθ
+ r dθ. Define F := A + λL, and maximize
õ ¶ !1/2
Z π 2 Z π
dr £1 2 ¤
F = 1 2
r +λ +r 2 dθ = r + λT dθ
2 2
0 dθ 0
Exercises.
Exercise 3.5. A flexible fence of length L is used to enclose an area bounded on one
side by a straight wall. Find the maximum area that can be enclosed.
Exercise 3.6. (The catenary equation) A flexible uniform cable of length 2a hangs
between the fixed points (0, 0) and (2b, 0), where b < a. Find the curve y = y(x) minimizing
Z 2b
(1 + (y 0 )2 )1/2 y dx (potential energy)
0
Z 2b
s.t. (1 + (y 0 )2 )1/2 dx = 2a (given length)
0
x
3
1 2 3 4 t
3.9. Corners
Consider the Calculus of Variations problem
Z T
opt F (t, x, x0 ) dt
o
s.t. x(0), x(T ) given
x piecewise smooth
An optimal x∗ satisfies the Euler-Lagrange equation
d
Fx = Fx0
dt
∗
in any subinterval of [0, T ] where x is continuously differentiable. The discontinuity points
of dtd x∗ are called its corners.
The Weierstrass-Erdmann corner conditions: The functions Fx0 , F − x0 Fx0 are contin-
uous in corners.
Example 3.4. Consider
Z 3
min (x − 2)2 (x0 − 1)2 dt
0
s.t. x(0) = 0, x(3) = 2
The lower bound, 0, is attained if
x = 2 or x0 = 1 , 0 ≤ t ≤ 3 .
This function has a corner at t = 2, but
Fx0 = 2(x − 2)2 (x0 − 1) and F − x0 Fx0 = (x − 2)2 (x0 − 1)(x0 + 1)
are continuous at t = 2.
LECTURE 4
This lecture is based on [23] and [24]. Other useful refernces are [24], [20] and [32].
x
φ
1 l1
m
1
l2
φ
2
m
2
Hamilton’s Principle, [23, pp. 111-114]: The motion of a mechanical system between any
two points (t1 , q(t1 )) and (t2 , q(t2 )) occurs in such a way that the definite integral
Z t2
S := L(q, q̇, t) dt (4.2)
t1
∴ δ Sb = 0 ⇐⇒ δS = 0 .
d
The Lagrangian L(q, q̇, t) is therefore defined up to derivative dt
f (q, t).
4.7. THE LAGRANGIAN OF A SYSTEM 29
If the particles interact with each other, but not with anything outside the system (in which
case the system is called closed), the Lagrangian is
X
L= 1
2
mk vk2 − U (r1 , r2 , . . .) (4.8)
where rk P is the position of the kth particle. Here
1
T := 2
mk vk2 the kinetic energy,
U := U (r1 , r2 , . . .) the potential energy.
If t is replaced by −t the Lagrangian does not change (time is isotropic).
Example 4.3. Consider the double pendulum of Fig. 4.1. For the first particle,
2
T1 = 12 m1 l12 φ˙1
U1 = −m1 g l1 cos φ1
x2 = l1 sin φ1 + l2 sin φ2
y2 = l1 cos φ1 + l2 cos φ2
∴ T2 = 12 m2 (ẋ22 + ẏ22 )
³ ´
1 2 ˙ 2 2 ˙ 2 ˙ ˙
= 2 m2 l1 φ1 + l2 φ2 + 2l1 l2 cos(φ1 − φ2 )φ1 φ2
2 2
∴ L = 1
2
(m1 + m2 )l12 φ˙1 + 12 m2 l22 φ˙2 + m2 l1 l2 cos(φ1 − φ2 )φ˙1 φ˙2 +
+(m1 + m2 )gl1 cos φ1 + m2 gl2 cos φ2 .
∂L ∂L d ∂L
with no term ∂t
. Using the Euler-Lagrange equations ∂qi
= dt ∂ q̇i
we write
dL X d ∂L X ∂L X d µ ∂L ¶
= q̇i + q̈i = q̇i
dt i
dt ∂ q̇i i
∂ q̇i i
dt ∂ q̇i
à !
d X ∂L
∴ q̇i −L = 0.
dt i
∂ q̇i
Therefore
X ∂L
E := q̇i −L (4.12)
i
∂ q̇i
remains constant during the motion of a closed system, see also (3.33).
If T (q, q̇) is a quadratic function of q̇, and L = T (q, q̇) − U (q), then (4.12) becomes
X ∂L
E = q̇i − L = 2T − L
i
∂ q̇i
= T +U (4.13)
Since ε is arbitrary,
X ∂L
δL = 0 ⇐⇒ =0
i
∂ri
X d ∂L
∴ = 0 , by Euler–Lagrange .
i
dt ∂vi
d X ∂L
∴ = 0.
dt i
∂vi
Therefore
X ∂L
P := (4.14)
i
∂vi
remains constant during motion. For the Lagrangian (4.8),
X
P= mi vi (4.15)
i
the momentum of the system. P ∂L P ∂U
Newton’s 3rd Law. From i ∂r i
= 0 we conclude i Fi = 0 where Fi = − ∂ri , i.e. the
sum of forces on all particles in a closed system is zero.
4.12. Conservation of Angular Momentum
The isotropy of space implies that the Lagrangian is invariant under an infinitesimal
rotation δφ, see Fig. 4.2
δr = δφ × r ,
or kδrk = r sin θkδφk. Similarly,
δv = δφ × v .
Substituting these in
X µ ∂L ∂L
¶
δL = · δri + · δvi = 0 ,
i
∂r i ∂v i
∂L ∂L
and using pi = ∂vi
, ṗi = ∂ri
we get
X
(ṗi · δφ × ri + pi · δφ × vi ) = 0
i
or
X d X
δφ · (ri × ṗi + vi × pi ) = δφ · ri × pi = 0 .
i
dt i
4.13. MECHANICAL SIMILARITY 33
δr
δφ
we have X X
dH = − ṗi dqi + q̇i dpi .
i i
Comparing with
X ∂H X ∂H
dH = dqi + dpi
i
∂qi i
∂pi
we get the Hamilton equations
∂H
q̇i = (4.21)
∂pi
∂H
ṗi = − (4.22)
∂qi
The total time derivative of H is
dH ∂H X ∂H X ∂H
= + q̇i + ṗi ,
dt ∂t i
∂q i i
∂p i
4.16. THE HAMILTON–JACOBY EQUATIONS 35
and by (4.21)–(4.22),
dH ∂H
= .
dt ∂t
4.15. The Action as Function of Coordinates
R t2
The action S = t1 Ldt has the Lagrangian as its time derivative, dS dt
= L.
· ¸t2 Z t2 µ ¶
∂L ∂L d ∂L
δS = δq + − δqdt
∂ q̇ t1 t1 ∂q dt ∂ q̇
X ∂S
= pi δqi ∴ = pi
i
∂q i
dS ∂S X ∂S ∂S X
∴ L = = + q̇i = + pi q̇i
dt ∂t i
∂qi ∂t i
∂S X
∴ = L− pi q̇i = −H
∂t i
X
∴ dS = pi dqi − Hdt
i
Maupertuis’ Principle. The motion of a mechaincal system minimizes
Z ÃX !
pi dqi dt
i
True, if energy is conserved.
∂S
+H =0 (4.23)
∂t
or
∂S
+ H(q1 , . . . , qs , p1 , . . . , ps , t) = 0 (4.24)
∂t
∂H
Solution. If ∂t
= 0 then
µ ¶
∂S ∂S ∂S
+ H q1 , . . . , q s , ,..., =0
∂t ∂q1 ∂qs
with solution
S = −ht + V (q1 , . . . , qs , α1 , . . . , αs , h)
where h, α1 , . . . , αs are arbitrary constants.
µ ¶
∂V ∂L
∴ H q1 , . . . , qs , ,..., =h.
∂q1 ∂qs
Separation of variables. If
H = G(f1 (q1 , p1 ), f2 (q2 , p2 ), . . . , fs (qs , ps ))
36 4. THE VARIATIONAL PRINCIPLES OF MECHANICS
then µ µ ¶ µ ¶ µ ¶¶
∂V ∂V ∂V
G f1 q1 , , f2 q2 , , . . . , fs ps , =h
∂q1 ∂q2 ∂qs
Let
µ ¶
∂V
αi = fi qi , , i ∈ 1, s
∂qi
Xs Z
∴ V = fi (qi , αi )dqi
i=1
s Z
X
∴ S = −G(α1 , . . . , αs )t + fi (qi , αi )dqi
i=1
LECTURE 5
Integrating by parts,
Z T Z T
0
− λ(t)x (t) dt = −λ(T )x(T ) + λ(0)x(0) + x(t)λ0 (t) dt
0 0
Z T Z T
∴ f (t, x, u)dt = [f (t, x, u) + λg(t, x, u) + xλ0 ] dt − λ(T )x(T ) + λ(0)x(0) .
0 0
Let
• u∗ (t) be an optimal control,
• u∗ (t) + ²h(t) a comparison control, with parameter ² and h fixed,
• y(t, ²) the resulting state, y(t, 0) = x∗ (t) and y(0, ²) = x(0) , y(T, ²) = x(T ) for all
².
37
38 5. OPTIMAL CONTROL, UNCONSTRAINED CONTROL
Define
Z T
J(²) := f (t, y(t, ²), u∗ (t) + ²h(t)) dt
0
Z T
∴ J(²) = [f (t, y(t, ²), u∗ (t) + ²h(t)) + λ(t)g(t, y(t, ²), u∗ (t) + ²h(t)) + y(t, ²)λ0 (t)] dt
0
−λ(T )y(T, ²) + λ(0)y(0, ²)
Since u∗ is the maximizing control, J(²) has a local maximum at ² = 0. Therefore J 0 (0) = 0.
Z T
0
J (0) = [(fx + λgx + λ0 ) y² + (fu + λgu ) h] dt = 0 , (5.4)
0
∂
where fx := ∂x f (t, x∗ (t), u∗ (t)), etc.
(5.4) holds for all y, h iff along (x∗ , u∗ ),
λ0 (t) = − [fx (t, x(t), u(t)) + λ(t)gx (t, x(t), u(t))] , (5.5)
fu (t, x(t), u(t)) + λ(t) gu (t, x(t), u(t)) = 0 . (5.6)
Define the Hamiltonian
H(t, x(t), u(t), λ(t)) := f (t, x, u) + λ g(t, x, u) . (5.7)
Then
∂H
x0 = ⇐⇒ (5.3) ,
∂λ
∂H
λ0 = − ⇐⇒ (5.5) ,
∂x
∂H
= 0 ⇐⇒ (5.6) .
∂u
subject to
x0i (t) = gi (t, x, u) , i ∈ 1, n (5.9)
xi (0) = fixed , i ∈ 1, n (5.10)
xi (T ) = fixed , i ∈ 1, q (5.11)
xi (T ) = free , i ∈ q + 1, r (5.12)
xi (T ) ≥ 0 , i ∈ r + 1, s (5.13)
K(xs+1 , . . . , xn , t) ≥ 0 at t = T (5.14)
5.3. TERMINAL CONDITIONS 39
If
λ0 = − (fx + λgx ) , λ(T ) = 0 , (6.5)
then
Z T
δJ = (fu + λgu ) δu dt
0
If (x∗ , u∗ ) is optimal for (6.1)–(6.2) then there is a function λ such that (x∗ , u∗ , λ) satisfying
(6.2),(6.3),(6.5) and (6.6).
(6.2) ⇐⇒ x0 = Hλ ,
(6.5) ⇐⇒ λ0 = −Hx ,
for the Hamiltonian
H(t, x, u, λ) = f (t, x, u) + λ g(t, x, u) . (6.7)
Then (6.6) is the solution of the NLP
max H = f + λg
s.t. a≤u≤b
whose Lagrangian is
L := f (t, x, u) + λg(t, x, u) + w1 (b − u) + w2 (u − a)
giving the necessary conditions
∂L
= fu + λgu − w1 + w2 = 0 (6.8)
∂u
w1 ≥ 0 , w1 (b − u) = 0 (6.9)
w2 ≥ 0 , w2 (u − a) = 0 (6.10)
which are equivalent to (6.6). For example, if u∗ (t) = b then
u∗ − a > 0 , w2 = 0 and fu + λgu = w1 ≥ 0 .
The control u is singular if its value does not change H, for example, the middle case in
(6.6). A singular control cannot be determined from H.
rt rt
S λ1 e λ2 e
rt
λ 3e
L1 L2 t
u
u max
0
t
6.4. Corners
The optimality of discontinuous controls raises questions about the continuity of λ, H
at points where u is discontinuous. Recall the Weierstrass-Erdmann corner conditions of
§ 3.9: The functions Fx0 , F − x0 Fx0 are continuous in corners. Returning to optimal control,
44 6. OPTIMAL CONTROL, CONSTRAINED CONTROL
consider
Z T
opt F (t, x, u) dt s.t. x0 = u
0
with H = F + λu
Hu = Fu + λ = 0
∴ λ = −Fx0
The Weierstrass-Erdmann corner conditions imply
λ = −Fx0 , H = F − x0 Fx0
are continuous in corners of u.
Z T
max [F (t, x) + uf (t, x)] dt (6.13)
0
0
s.t. x = G(t, x) + ug(t, x)
x(0) given, and u satisfies (6.3)
The Hamiltonian is
H := (F + λG) + u(f + λg) (6.14)
and the necessary conditions include
λ0 = −Hx (6.15)
a <
u = ? if f + λg = 0. (6.16)
b >
with Lagrangian
L = 1 + λ1 x2 + λ2 u + w1 (u − 1) − w2 (u + 1)
and necessary conditions
∂L
= λ2 + w1 − w2 = 0 , (6.17)
∂u
w1 ≥ 0 , w1 (u − 1) = 0 ,
w2 ≥ 0 , w2 (u + 1) = 0 ,
∂L
λ01 = − =0, (6.18)
∂x1
∂L
λ02 = − = −λ1 , (6.19)
∂x2
L(T ) = 0 , since T is free . (6.20)
Indeed, L(t) ≡ 0 , 0 ≤ t ≤ T .
(6.18) =⇒ λ1 = constant . ∴ (6.19) =⇒ λ2 = −λ1 t + C .
If λ2 = 0 in some interval then λ1 = 0 and L(T ) = 1 + 0 + 0 contradicting (6.20).
∴ λ2 changes sign at most once .
• If λ2 > 0 then w2 > 0 and u = −1.
• If λ2 < 0 then w1 > 0 and u = 1.
Consider an interval where u = 1. Then
x02 = u = 1
∴ x2 = t + C0
t2
∴ x1 = + C0 t + C − 1
2
(x2 − C0 )2
= + C0 (x2 − C0 ) + C1
2
x2 C2
= 2 + C2 , C2 = − 0 + C1 .
2 2
Similarly, in an interval where u = −1,
x22
x1 = −+ C3
2
The optimal path must end on one of the parabolas
x22 x2
x1 = or x1 = − 2
2 2
passing through x1 = x2 = 0. The optimal path begins on a parabola passing through
(x0 , 0), and ends up on the switching curve, see Figure 6.3.
46 6. OPTIMAL CONTROL, CONSTRAINED CONTROL
x2
x1
x22
Figure 6.2. The parabolas x1 = ± + C and the switching curve.
2
x2
x0
x1
Differentiating (6.23)
p0 f 0 (K) + pf 00 (K)K 0 = (r + b)c0 − c00
and substituting K 0 = I − bK we get the singular solution I > 0.
Between investment periods we have I = 0. To see what this means collect m–terms in
(6.22) and integrate Z ∞
−(r+b)t
e m(t) = e−(r+b)s p(s)f 0 (K(s)) ds
t
Also
∞ Z
−(r+b)t d £ −(r+b)s ¤
e c(t) = − e c(s) ds
ds
Z ∞t
= e−(r+b)s [c(s)(r + b) − c0 (s)] ds
t
The condition m(t) ≤ c(t) gives
Z ∞
e−(r+b)s [p(s)f 0 (K(s)) − (r + b)c(s) + c0 (s)] ds ≤ 0
t
with equality if I > 0. Therefore at any interval [t1 , t2 ] with I = 0 (I > 0 for (t1 )− and (t2 )+ )
Z t2
e−(r+b)s [pf 0 − (r + b)c + c0 ] ds ≤ 0
t
with equality at t = t1 .
where: x(t) = stock at time t, F (x) = the growth function, u(t) = the harvest rate, p = unit
price, and c(x) = unit cost.
Substituting (6.24) in the objective
Z ∞
max e−rt (p − c(x))(F (x) − ẋ) dt
0
R∞
or max 0 φ(t, x, ẋ) dt with E-L equation
∂φ d ∂φ
=
∂x dt ∂ ẋ
where
∂φ
= e−rt {−c0 (x) [F (x) − ẋ] + [p = c(x)] F 0 (x)}
∂x
d ∂φ d © −rt ª
= −e [p − c(x)]
dt ∂ ẋ dt
= e−rt {r [p − c(x)] + c0 (x)ẋ}
c0 (x)F (x)
∴ F 0 (x) − = r (6.25)
p − c(x)
Economic interpretation: Write (6.25) as
F 0 (x) [p − c(x)] − c0 (x)F (x) = r [p − c(x)]
d
or {[p − c(x)] F (x)} = r [p − c(x)] (6.26)
dx
RHS = value, one instant later, of catching fish # x + 1
LHS = value, one instant later, of not catching fish # x + 1
LECTURE 7
( N
)
X
Vk (i) = max R(i, a) + Pij (a) Vk−1 (j) , k = n, n − 1, · · · , 2 , (7.1a)
a∈A
j=1
logarithmic utility used in (7.2b). If p ≤ 1/2 then it is optimal to bet zero (α = 0) and
Vn (x) = log x. Suppose p > 1/2. Then,
V1 (x) = max {p log(x + αx) + q log(x − αx)}
α
= max {p log(1 + α) + q log(1 − α)} + log x
α
∴ α = q−p (7.3)
∴ V1 (x) = C + log x
where C = log 2 + p log p + q log q (7.4)
Similarly, Vn (x) = n C + log x (7.5)
and the optimal policy is to bet the fraction (7.3) of the current fortune. See also Exercise 7.1.
7.1.2. A Stock–Option Model. The model considered here is a special case of [30].
Let Sk denote the price of a given stock on day k, and suppose
k+1
X
Sk+1 = Sk + Xk+1 = S0 + Xi
i=1
where X1 , X2 , · · · are i.i.d. RV’s with distribution F and finite mean µF . You have an option
to buy one share at a fixed price c, and N days in which to exercise this option. If the option
is exercised (on a day) when the stock price is s, the profit is s − c.
Let Vn (s) denote the maximal expected profit if the current stock price is s and n days
remain in the life of the option. Then
½ Z ¾
Vn (s) = max s − c , Vn−1 (s + x) dF (x) , n ≥ 1 (7.6a)
and (7.7) follows. Conversely, suppose (7.7) holds. Then for all x1 > x2 and y1 > y,
g(x1 , y1 ) − g(x1 , y) g(x2 , y1 ) − g(x2 , y)
≥
y1 − y y1 − y
and as y1 ↓ y,
∂ ∂
g(x1 , y) ≥ g(x2 , y)
∂y ∂y
implying (7.8). ¤
Example 7.1. An optimal allocation problem with penalty costs. There are I identical
jobs to be done consecutively in N stages, at most one job per stage with uncertainty
about completing it. If in any stage a budget y is allocated, the job will be completed with
probability P (y), an increasing function with P (0) = 0. If a job is not completed, the amount
allocated is lost. At the end of the N th stage, if there are i unfinished jobs, a penalty C(i)
must be paid.
The problem is to determine the optimal allocation at each stage so as to minimize the
total expected cost.
52 7. STOCHASTIC DYNAMIC PROGRAMMING
Take the state of the system to be the number i of unfinished jobs, and let Vn (i) be the
minimal expected cost in state i with n stages to go. Then
Clearly Vn (i) increases in i and decreases in n. Let yn (i) denote the minimizing y in (7.9a).
One may expect that
yn (i) increases in i and decreases in n . (7.10)
It follows that yn (i) increases in i if
∂2
{y + P (y) Vn−1 (i − 1) + [1 − P (y)] Vn−1 (i)} ≤ 0 ,
∂i∂y
∂
or P 0 (y) [Vn−1 (i − 1) − Vn−1 (i)] ≤ 0 ,
∂i
∂
where i is formally interpreted as continuous, so makes sense. Since P 0 (y) ≥ 0, it follows
∂i
that
yn (i) increases in i if Vn−1 (i − 1) − Vn−1 (i) decreases in i . (7.11)
Similarly,
yn (i) decreases in n if Vn−1 (i − 1) − Vn−1 (i) increases in n . (7.12)
It can be shown that if C(i) is convex in i, then (7.11)–(7.12) hold.
7.1.4. Accepting the best offer. This section is based on [19]. Given n offers in
sequence, which one to accept? Assume:
(a) if any offer is accepted, the process stops,
(b) if an offer is rejected, it is lost forever,
(c) the relative rank of an offer, relative to previous offers, is known, and
(d) information about future offers is unavailable.
The objective is to maximize the probability of accepting the best offer, assuming that
all n! arrangements of offers are equally likely.
Let P (i) denote the probability that the ith offer is the best,
1/n i
P (i) = Prob{offer is best of n| offer is best of i} = = ,
1/i n
and let H(i) denote the best outcome if offer i is rejected, i.e. the maximal probability of
accepting the best offer if the first i offers were rejected. Clearly H(i) is decreasing in i.
The maximal probability of accepting the best offer is
½ ¾
i
V (i) = max {P (i), H(i)} = max , H(i) , i = 1, · · · , n .
n
7.2. DISCOUNTED DYNAMIC PROGRAMMING 53
It follows (since i/n is increasing and H(i) is decreasing) that for some j,
i
≤ H(i) , i ≤ j ,
n
i
> H(i) , i > j ,
n
and the optimal policy is: reject the first j offers, then accept the best offer.
Exercises.
Exercise 7.1. What happens in the gambling model in § 7.1.1 if the boundary condition
(7.2b) is replaced by V0 (x) = x ?
Exercise 7.2. For the stock option problem of § 7.1.2 show: if µF ≥ 0 then sn = ∞ for
n ≥ 1.
A policy π ∗ is optimal if
Vπ∗ (i) = V (i) , ∀ i .
54 7. STOCHASTIC DYNAMIC PROGRAMMING
where Wπ (j) is the expected discounted return from time 1, under policy π. Since
Wπ (j) ≤ αV (j)
it follows that
" )
X X
Vπ (i) ≤ Pa R(i, a) + α Pij (a) V (j)
a∈A j
" #
X X
≤ Pa max R(i, a) + Pij (a) Wπ (j)
a
a∈A j
( )
X
= max R(i, a) + α Pij (a)V (j) (7.17)
a
j
( )
X
∴ V (i) ≤ max R(i, a) + α Pij (a)V (j) (7.18)
a
j
since π is arbitrary. To prove the reverse inequality, let a policy π begin with a decision a0
satisfying
( )
X X
R(i, a0 ) + α Pij (a0 )V (j) = max R(i, a) + α Pij (a)V (j) (7.19)
a
j j
We prove now that a stationary policy satisfying the optimality equation (7.16) is optimal.
Theorem 7.3. Let f be a stationary policy, such that a = f (i) is an action maximizing
RHS(7.16)
( )
X X
R(i, f (i)) + α Pij (f (i))V (j) = max R(i, a) + α Pij (a)V (j) , ∀ i .
a∈A
j j
Then
Vf (i) = V (i) , ∀ i .
Proof. Using
( )
X
V (i) = max R(i, a) + α Pij (a)V (j)
a∈A
j
X
= R(i, f (i)) + α Pij (f (i))V (j) (7.20)
j
we interpret V as the expected return from a 2–stage process with policy f in stage 1, and
terminal reward V in stage 2. Repeating the argument
V (i) = E {n–stage return under f | X0 = i} + αn E(V (Xn |X0 = i)
→ V (i) , by (7.15) .
¤
αm
≤ d(x1 , x0 ) , (7.22)
1−α
∴ d(xm , xn ) → 0 as m, n → ∞ .
56 7. STOCHASTIC DYNAMIC PROGRAMMING
Since {X, d} is complete, the sequence {xn } converges, say to x∗ ∈ X. Then T (x∗ ) = x∗
since
d(xn+1 , T (x∗ )) = d(T (xn ), T (x∗ )) ≤ αd(xn , x∗ ) → 0 as n → ∞
If x∗∗ is another fixed point of T then
d(x∗ , x∗∗ ) = d(T (x∗ ), T (x∗∗ ) ≤ αd(x∗ , x∗∗ )
and x∗ = x∗∗ since α < 1. Since (7.22) is independent of n it follows that
αm
d(xm , x∗ ) ≤ d(x1 , x0 ) , m = 1, 2, · · · (7.23)
1−α
¤
A mapping T satisfying (7.21) is called contraction, and Theorem 7.4 is also called the
contraction mapping principle.
Theorem 7.4 applies here as follows: Let U be the set of bounded functions on the state
space, endowed with the sup norm
kuk = sup |u(i)| (7.24)
i
For any stationary policy f let Tf be an operator mapping U into itself,
X
(Tf u)(i) := R(i, f (i)) + α Pij (f (i))u(j)
j
and therefore,
sup |(Tf u)(i) − (Tf v)(i)| < α sup |u(j) − v(j)| , ∀ u, v ∈ U (7.25)
i j
i.e. Tf is a contraction in the sup norm (since 0 < α < 1). It follows that
Tfn u → Vf as n → ∞
It follows that the optimal value function V is the unique bounded solution of (7.16)
7.2.3. Successive approximations of the optimal value. For V0 (i) be any bounded
function, i ∈ 1, N , and define
X
Vn (i) := max {R(i, a) + α Pij (a)Vn−1 (j)} , n = 1, 2, · · · (7.26)
a∈A
j
Proposition 7.2. Let g be a stationary policy with expected value Vg and define a policy
h by
X X
R(i, h(i)) + α Pij (h(i))Vg (i) = max {R(i, a) + α Pij (a)Vg (j)} . (7.27)
a
j j
Then
Vh (i) ≥ Vg (i) , ∀ i ,
and if Vh (i) = Vg (i) , ∀(i) , then Vg = Vh = V .
The policy improvement method starts with any policy g, and does (7.27) until no
improvement is possible. Finiteness follows if the state and action spaces are finite.
for bounded functions u. Then (7.28) means u ≥ T u, and therefore u ≥ T n u for all u. ¤
It follows that the optimal value function V is the unique solution of the problem
P
min u(i)
i P
s.t. u(i) ≥ max{R(i, a) + α Pij (a)u(j)} , ∀ i ,
a j
or equivalently
P
min u(i)
i P (7.29)
s.t. u(i) ≥ R(i, a) + α Pij (a)u(j) , ∀ i , ∀ a ∈ A .
j
58 7. STOCHASTIC DYNAMIC PROGRAMMING
7.3. Negative DP
Assume countable state space and finite action space. “Negative” means here that the
rewards are negative, i.e. the objective is to minimize costs. Let a cost C(i, a) ≥ 0 result
from action a in state i, and for any policy π,
"∞ #
X
Vπ (i) = Eπ C(Xn , an ) : X0 = i
n=0
V (i) = inf Vπ (i)
π
∗
and call a policy π optimal if
Vπ∗ (i) = V (i) , ∀ i .
The optimality principle here is
X
V (i) = min{C(i, a) + Pij (a)V (j)}
a
j
Then f is optimal.
Proof. From the optimality principle it follows that
X
C(i, f (i)) + Pij (f (i))V (j) = V (i) , ∀ i . (7.30)
j
If V (j) is cost of stopping in state j, (7.30) expresses indifference between stopping and
continuing one more stage. Repeating the argument,
" n−1 #
X
Ef C(Xt , at ) : X0 = i + Ef [V (Xn ) : X0 = i] = V (i)
t=0
and, since costs are nonnegative,
" n−1 #
X
Ef C(Xt , at ) : X0 = i ≤ V (i) ,
t=0
[1] G.C. Alway, Crossing the desert, Math. Gazette 41(1957), 209
[2] V.I. Arnold, Mathematical Methods of Classical Mechanics (Second Edition, English
Translation), Springer, New York, 1989
[3] R.E. Bellman, On the theory of dynamic programming Proc. Nat. Acad. Sci 38(1952),
716–719.
[4] R.E. Bellman, Dynamic Programming, Princeton University Press, 1957
[5] R.E. Bellman and S.E. Dreyfus, Applied Dynamic Programming, Princeton University
Press, 1962
[6] A. Ben-Israel and S.D. Flåm, Input optimization for infinite horizon discounted programs,
J. Optimiz. Th. Appl. 61(1989), 347–357
[7] A. Ben-Israel and W. Koepf, Making change with DERIVE: Different approaches, The
International DERIVE Journal 2(1995), 72–78
[8] D.P. Bertsekas, Dynamic Programming and Stochastic Control, Academic Press, 1976
[9] U. Brauer and W. Brauer, A new approach to the jeep problem, Bull. Euro. Assoc. for
Theoret. Comp. Sci. 38(1989), 145–154
[10] J.W. Craggs, Calculus of Variations, Problem Solvers # 9, George Allen & Unwin,
London, 1973
[11] A.K. Dewdney, Computer recreations, Scientific American, June 1987, 128–131
[12] S.E. Dreyfus and A.M. Law, The Art and Theory of Dynamic Programming, Academic
Press, 1977
[13] L. Elsgolts, Differential Equations and the Calculus of Variations (English Translation),
MIR, Moscow, 1970
[14] G.M. Ewing, Calculus of Variations with Applications, Norton, 1969 (reprinted by
Dover, 1985)
[15] N.J. Fine, The jeep problem, Amer. Math. Monthly 54(1947), 24–31
[16] C. Fox, An Introduction to the Calculus of Variations, Oxford University Press, 1950
(reprinted by Dover, 1987(
[17] J.N. Franklin, The range of a fleet of aircraft, J. Soc. Indust. Appl. Math. 8(1960),
541–548
[18] D. Gale, The jeep once more or jeeper by dozen, Amer. Math. Monthly 77(1970), 493–
501
[19] J. Gilbert and F. Mosteller, Recognizing the maximum of a sequence, J. Amer. Statist.
Assoc. 61(1966), 35-73
[20] H. Goldstein, Classical Mechanics, Addison-Wesley, Reading, 1950
[21] A. Hausrath, B. Jackson, J. Mitchem, E. Schmeichel, Gale’s round-trip jeep problem,
Amer. Math. Monthly 102(1995), 299–309
59
View publication stats
60 BIBLIOGRAPHY
[22] J. Kelly, A new interpretation of information rate, Bell System Tech. J. 35(1956), 917–
926
[23] C. Lanczos, The Variational Principles of Mechanics (4th Edition), University of
Toronto Press, 1970. Reprinted by Dover, 1987.
[24] L.D. Landau and E.M. Lifshitz, Course of Theoretical Physics, Vol 1: Mechanics (Eng-
lish Translation), Pergamon Press, Oxford 1960
[25] C.G. Phipps, The jeep problem, Amer. Math. Monthly 54(1947), 458–462
[26] D.A. Pierre, Optimization Theory with Applications, Wiley, New York, 1969 (reprinted
by Dover, 1986)
[27] W. Ritz, Über eine neue Methode zur Lösung gewisser Variationsprobleme der mathe-
matischen Physik, J. Reine Angew. Math. 135(1908)
[28] S.M. Ross, Introduction to Stochastic Dynamic Programming, Academic Press, New
York, 1983
[29] M. Sniedovich, Dynamic Programming, M. Dekker, 1992
[30] H. Taylor, Evaluating a call option and optimal timing strategy in the stock market,
Management Sci. 12(1967), 111-120
[31] D.M. Topkins, Minimizing a submodular function on a lattice, Operations Res. 26(1978),
305-321
[32] R. Weinstock, Calculus of Variations with Applications to Physics & Engineering, Mc-
Graw Hill, New York, 1952 (reprinted by Dover, 1974)
[33] H.S. Wilf, Mathematics for the Physical Sciences, Wiley, New York, 1962 (reprinted by
Dover, 1978)