Dynamic Programming and Markov Processes
Dynamic Programming and Markov Processes
Laval
js
ae
beainten
Ge) |
eof
, .
|
ry i\e
wd then! ;
haa!
Wt
bt a
uy : :
et Ll
Cn) co
ko | .
re
)
¢l
i OM
—
=
==
’
=
igs Bree
RaNeae) ae
£290)
x een
‘
Mi
7
iy Ott , :
ae ‘ae
Le
i” oe.)
‘eee
Digitized by the Internet Archive
in 2022 with funding from
Kahle/Austin Foundation
https://archive.org/details/dynamicprogrammi0000unse_x8u5
Dynamic Programming
and Markov Processes
Dynamic Programming
and Markov Processes
RONALD A. HOWARD
McConneil Library
Radford University
Copyright © 1960
by
The Massachusetts Institute of Technology
101502
Preface
PREFACE
INTRODUCTION
CHAPTER 1 Markov Processes
The Toymaker Example—State Probabilities
The z-Transformation
z-Transform Analysis of Markov Processes wont
<
We
> Pu =1
j=l
where the probability that the system will remain in 7, pu, has been
included. Since the pi are probabilities,
0<
py <1
D, mit) = 1 (1.1)
x(1) = n(0)P = [1 02
enleo
bole
Ea)
and
m1) = (3 2]
After one week, the toymaker is equally likely to be successful or un-
successful. After two weeks,
and
(2) =[g0 20
so that the toymaker is slightly more likely to be unsuccessful.
6 MARKOV PROCESSES
After three weeks, (3) = 7(2)P = [3% 200], and the probability
of occupying each state is little changed from the values after two
weeks. Note that since
89 Jil
P3 a ey
jel gle
250 250
(3) could have been obtained directly from (3) = 7(0)P°.
An interesting tendency appears if we calculate 7() as a function of
nm as shown in Table 1.1.
or absolute state probabilities. It follows from Eq. 1.3 that the vector
m™ must obey the equation
Tw = TP (1.5)
and, of course, the sum of the components of m must be 1.
S ti; = (1.6)
We may use Eqs. 1.5 and 1.6 to find the limiting state probabilities
for any process. For the toymaker example, Eq. 1.5 yields
T1 = 971 + 572
2
5
3
m2 = $71 + 572
whereas Eq. 1.6 becomes 71 + m2 = 1.
The three equations for the two unknowns 7 and z2 have the unique
solution =; = #, mz = 3. These are, of course, the same values for
the limiting state probabilities that we inferred from our tables of 7;(n).
In many applications the limiting state probabilities are the only
quantities of interest. It may be sufficient to know, for example, that
our toymaker is fortunate enough to have a successful toy $ of the time
and is unfortunate 3 of the time. The difficulty involved in finding
the limiting state probabilities is precisely that of solving a set of N
linear simultaneous equations. We must remember, however, that
the quantities 7; are a sufficient description of the process only if enough
transitions have occurred for the memory of starting position to be
lost. In the following section, we shall gain more insight into the
behavior of the process during the transient period when the state
probabilities are approaching their limiting values.
The z-Transformation
fQ)
f(O)
f(2)
f(3)
0 1 2 3 n
Fig. 1.1. An arbitrary discrete-time function.
The relationship between /(m) and its transform f(z) is unique; each
time function has only one transform, and the inverse transformation
of the transform will produce once more the original time function.
The z-transformation is useful in Markov processes because the prob-
ability transients in Markov processes are geometric sequences. The
z-transform provides us with a closed-form expression for such sequences.
Let us find the z-transforms of the typical time functions that we
shall soon encounter. Consider first the step function
1 nm = 0; 1, 2,3, -+:
I (n)=
0 n< 0
The z-transform is
fe) = Df alt rt
n=0
ett. orf) = ar
f(z)
Z)= aI Ae
) Pages n or f(z) = ee
Note that if
then
n=0
and
non — d ee d 1 az
-
Zale Oe hd ass 7a; ” =) \ (soz)
Thus we have obtained as a derived result that, if the time function we
are dealing with is /(m) = na”, its z-transform is f(z) = az/(1 — az)2.
z-TRANSFORM ANALYSIS OF MARKOV PROCESSES 9
From these and other easily derived results, we may compile the table
of z-transforms shown as Table 1.3. In particular, note that, if a time
function /(7) with transform /(z) is shifted to the right one unit so as
to become /(z + 1), then the transform of the shifted function is
The reader should become familiar with the results of Table 1.3
because they will be used extensively in examples and proofs.
f(n) £ (2)
film) + fo(n) fi(z) + f£2(2)
kf (n) (Rk is a constant) kf (z)
F(a 1) zf(z)
f(n + 1) zf(z) — f(0)]
an :
1 — az
rf]+ 3
For this case
5 5
so that
=P) 1 — tz
2 —tz2
( a) |—2z 1 — |
and
5 des Sieh dz a
ere Ch 2)(Lo rgzy eT is)
— 2 -i =
$2 ee
(1 — 2)(1 — Yoz) (= 2)(E— Yo2)
Each element of (I — zP)~! is a function of z with a factorable de-
nominator (1 — z)(1 — jz). By partial-fraction expansion? we can
express each element as the sum of two terms: one with denominator
1 — z and one with denominator 1 — jz. The (I — zP)-! matrix
now becomes
: ae i Peek! 9
To gain further insight into the Markov process, let us use the
z-transform approach to analyze processes that exhibit typical behavior
patterns. In the toymaker’s problem, both states had a finite proba-
bility of occupancy after a large number of transitions. It is possible
even in a completely ergodic process for some of the states to have a
limiting state probability of zero. Such states are called transient
states because we are certain that they will not be occupied after a
long time. A two-state problem with a transient state is described by
p_ fi i
alone
with transition diagram
0 1 — 3z
TRANSIENT, MULTICHAIN, AND PERIODIC BEHAVIOR 13
Pe |
a aaa —Teoh ee
aes Talo | + alo 0
Thus
H(7)
fo f+
If the system is started in state 1 so that 7(0) =[1
ol 0], then m1(n) =
(2)", m2(”) = 1 — ()". If the system is started in state 2 with
x(0) = (0 1), then naturally mi(m) = 0, mo(m) = 1. In either case
we see that the limiting state probability of state 1 is zero, so our
assertion that it is a transient state is correct. Of course the limiting
state probabilities could have been determined from Eqs. 1.5 and 1.6
in the manner described earlier.
A transient state need not lead the system into a trapping state.
The system may leave a transient state and enter a set of states that
are connected by possible transitions in such a way that the system
makes jumps within this set of states indefinitely but never jumps
outside the set. Such a set of states is called a recurrent chain of the
Markov process; every Markov process must have at least one recurrent
chain. A Markov process that has only one recurrent chain must be
completely ergodic because no matter where the processis started it
will end up making jumps among the members of the recurrent chain.
However, if a process has two or more recurrent chains, then the
completely ergodic property no longer holds, for if the system is started
in a state of one chain then it will continue to make transitions within
that chain but never to a state of another chain. In this sense, each
recurrent chain is a generalized trapping state; once it is entered, it can
never be left. We may now think of a transient state as a state that
the system occupies before it becomes committed to one of the recurrent
chains.
The possibility of many recurrent chains forces us to revise our think-
ing concerning S, the steady-state component of H(m). Since the
limiting state probability distribution is now dependent on how the
system is started, the rows of the stochastic matrix S are no longer equal.
Rather, the 7th row of S represents the limiting state probability distri-
bution that would exist if the system were started in the 7th state. The
ith row of the T(m) matrix is as before the set of transient components
of the state probability if 7 is the starting state.
Let us investigate a very simple three-state process with two recurrent
chains described by
tT HS
Pri) Ota 0
1S
3-6
yee
«4S
14 MARKOV PROCESSES
i
3
State 1 constitutes one recurrent chain; state 2 the other. Both are
trapping states, but the general behavior would be unchanged if each
were a collection of connected states. State 3 is a transient state that
may lead the system to either of the recurrent chains. To find H(n)
for this process, we first find
pasty 0 0
(I-2P)=] 0 ec, 0
—2z —k 1 — iz
and
1 — z)(1 — tz
(eres ? 9
1 — z)(1 — +z
So ait aie : ee :
(1 — z)dz (1 — z)4z (haz)?
(1. = -2)7(L = $2) (1 = 2)2( tee ee) (esa eee
Thus
ere UO i Gi. A0zn6
waa 1 jcae | Ghyll j
$5 San ; alte treat oe
and
100 Tete Yue
Ho = | 1 j| Oe eg j
2 2 0 mat Jet
= S + T(n)
If the system is started in state 1, mi(m) = 1, me(n) = x3(n) = 0.
TRANSIENT, MULTICHAIN, AND PERIODIC BEHAVIOR 15
aC
1
a - 2) = | : ma
and
16 MARKOV PROCESSES
This H(n) does represent the solution to the problem because, for
example, if the system is started in state 1, mi(m) = 4{1 + (—1)"] and
m2(n) = 4[1 — (—1)”]. These expressions produce the same results that
we saw intuitively. However, what is to be the interpretation placed on
Sand T(n) inthis problem? The matrix T(m) contains components that
do not die away for larger , but rather continue to oscillate indefinitely.
On the other hand, T(m) can still be considered as a perturbation to the
set of limiting state probabilities defined by S. The best interpretation
of the limiting state probabilities of S is that they represent the proba-
bility that the system will be found in each of its states at a tome
chosen at random in the future. For periodic processes, the original
concept of limiting state probabilities is not relevant since we know
the state of the system at all future times. However, in many practical
cases, the random-time interpretation introduced above is meaningful
and useful. Whenever we consider the limiting state probabilities of a
periodic Markov process, we shall use them in this sense. Incidentally,
if Eqs. 1.5 and 1.6 are used to find the limiting state probabilities, they
yield x1 = ze = 4, in agreement with our understanding.
We have now investigated the behavior of Markov processes using the
mechanism of the z-transform. This particular approach is useful
because it circumvents the difficulties that arise because of multiple
characteristic values of stochastic matrices. Many otherwise elegant
discussions of Markov processes based on matrix theory are markedly
complicated by this difficulty. The structure of the transform method
can be even more appreciated if use is made of the work that has been
done on signal-flow-graph models of Markov processes, but this is
beyond our present scope; references 3 and 4 may be useful.
The following chapter will begin the analysis of Markov processes
that have economic rewards associated with state transitions.
Markov Processes with Rewards
One question we might ask concerning this game is: What will be
the player’s expected winnings in the next ” jumps if the frog is now in
state 7 (sitting on the lily pad numbered 7)? To answer this question,
let us define v;(m) as the expected total earnings in the next m transitions
if the system is now in state 7.
ed
18 MARKOV PROCESSES WITH REWARDS
[s —
R =
9 3
Recalling that
oe Os
a A
we can find q from Eq. 2.3:
HL
Inspection of the q vector shows that if the toymaker has a successful
toy he expects to make 6 units in the following week; if he has no
successful toy, the expected loss for the next week is 3 units.
Suppose that the toymaker knows that he is going to go out of busi-
ness after » weeks. He is interested in determining the amount of
money he may expect to make in that time, depending on whether or
not he now has a successful toy. The recurrence relations Eq. 2.4
or Eq. 2.5 may be directly applied to this problem, but a set of
boundary values v;(0) must be specified. These quantities represent
the expected return the toymaker will receive on the day he ceases
operation. If the business is sold to another party, v1(0), would be the
purchase price if the firm had a successful toy on the selling date, and
ve(0) would be the purchase price if the business were not so situated
onthat day. Arbitrarily, for computational convenience, the boundary
values v;(0) will be set equal to zero in our example.
We may now use Eq. 2.4 to prepare Table 2.1 that shows v;(m) for
each state and for several values of n.
Thus, if the toymaker is four weeks from his shutdown time,he expects
to make 9.555 units in the remaining time if he now has a successful
toy and to lose 0.444 unit if he does not have one. Note that vi(m) —
v(m) seems to be approaching 10 as » becomes large, whereas both
vi(m) — vi(m — 1) and ve(m) — ve(m — 1) seem to approach the value
1 for large m. In other words, when x is large, having a successful
11 Points for
v, (n)
10 :
77 \_Asymptote of v, (n)
slope = 1
units
in
Earnings
monetary
Asymptote of vo (n)
slope = 1 bg
-2 Cs
Vi
me
-3 Points for
Ug (n)
0 1 2 3 4 5 6
n (weeks remaining)
Fig. 2.1. Toymaker’s problem; total expected reward in each state as a function
of weeks remaining.
be called v(z) where v(z) = > v(m)z". Equation 2.5 may be written
n=0
as
v(m + 1) = q + Pv(n) n= 0, 1,2,--- (2.6)
Finding the transform v(z) requires the inverse of the matrix (I — zP),
which also appeared in the solution for the state probabilities. This is
not surprising since the presence of rewards does not affect the proba-
bilistic structure of the process.
For the toymaker’s problem, v(0) is identically zero, so that Eq. 2.7
reduces to
For the toymaker process, the inverse matrix (I — zP)~1 was previously
found to be
22 MARKOV PROCESSES WITH REWARDS
4 3 ; 10 = et.O) See i
—
esCosrgeress lech
era +
R
Let the matrix F(m) be the inverse transform of [z/(1 — z)](I — 2P)7}.
Then
vi(n) = n + 32
von) =n — 49lo
are the equations of the asymptotes shown in Fig. 2.1. Note that, for
large n, both v1(m) and v2(m) have slope 1 and vi(m) — ve(n) = 10, as
we saw previously. For large n, the slope of v1(m) or v(m) is the average
reward per transition, in this case 1. If the toymaker were many,
many weeks from shutdown, he would expect to make 1 unit of return
per week. We call the average reward per transition the ‘‘gain”’; in
this case the gain is 1 unit.
Asymptotic Behavior
What can be said in general about the total expected earnings of a
ASYMPTOTIC BEHAVIOR 23
process of long duration? To answer this question, let us return to
Hqr2ak
z
U(2) = (I — zP)—1q + (I — zP)-!v(0) (2)
te
gi = > $4944
j=1
Qe TAG i (2.14)
(tpt ee i i]+i
T—al¢ 8] "T—wel-$ 8 of
4
so that
Since
The discussion of Markov processes with rewards has been the means
to anend. This end is the analysis of decisions in sequential processes
that are Markovian in nature. This chapter will describe the type of
process under consideration and will show a method of solution based on
recurrence relations.
Introduction of Alternatives
i=1 J=1
i=20 j=2
i=30 j=3
| |
! 1
1 1
i=NO Oj=N
Fig. 3.1. Diagram of states and alternatives.
Suppose that the toymaker has weeks remaining before his business
will close down. We shall call » the number of stages remaining in the
process. The toymaker would like to know as a function of ” and his
present state what alternative he should use for the next transition
(week) in order to maximize the total earnings of his business over the
n-week period.
We shall define d;(m) as the number of the alternative in the 7th state
that will be used at stage n. We call di(m) the ‘“‘decision”’ in state 7 at
the mth stage. When d;(m) has been specified for all 7 and all n, a
‘‘policy’’ has been determined. The optimal policy is the one that
maximizes total expected return for each 7 and n.
To analyze this problem, let us redefine v(m) as the total expected
TOYMAKER'S PROBLEM SOLVED BY VALUE ITERATION 29
return in # stages starting from state i if an optimal policy is followed.
It follows that for any 1
N
The method that has just been described for the solution of the
sequential process may be called the value-iteration method because the
vi(m) or “‘values’’ are determined iteratively. This method has some
important limitations. It must be clear to the reader that not many
enterprises or processes operate with the specter of termination so
imminent. For the most part, systems operate on an indefinite basis
with no clearly defined end point. It does not seem efficient to have
to iterate v;(m) for m = 1, 2, 3, and so forth, until we have a sufficiently
large m that termination is very remote. We would much rather have
a method that directed itself to the problem of analyzing processes of
indefinite duration, processes that will make many transitions before
termination.
Such a technique has been developed; it will be presented in the next
chapter. Recall that, even if we were patient enough to solve the
EVALUATION OF THE VALUE-ITERATION APPROACH 3]
g= > mg (2.14)
t=1
—— k Alternatives
. jSucceeding
Byes Pyrhe Prhg | state
i
i Present state
Poy Toy!
J}
vee
Poy" LrToe
Pools
ee
ee
i=" I
w
WRNN
An optimal policy is defined as a policy that maximizes the gain,
or average return per transition.* In the five-state problem dia-
grammed in Fig. 4.1, there are 4 x 3 x 2 x 1 x 5 = 120 different
* We shall assume for the moment that all policies produce completely ergodic
Markov processes. This assumption will be relaxed in Chapter 6.
34 THE POLICY-ITERATION METHOD
N N
or
> 7404(0)
vi(m) = ng + V4 ve(n) = ng + ve
A relevant question at this point is this: If we are seeking only the gain
of the given policy, why did we not use Eq. 2.14 rather than Eq. 4.1?
As a matter of fact, why are we bothering to find such things as relative
values at all? The answer is first, that although Eq. 2.14 does find
the gain of the process it does not inform us about how to find a better
policy. We shall see that the relative values hold the key to finding
better and better policies and ultimately the best policy.
A second part of the answer is that the amount of computational
effort required to solve Eqs. 4.1 for the gain and relative values is about
the same as that required to find the limiting state probabilities using
Eqs. 1.5 and 1.6, because both computations require the solution of N
linear simultaneous equations. From the point of view of finding the
gain, Eqs. 2.14 and 4.1 are a standoff; however, Eqs. 4.1 are to be pre-
ferred because they yield the relative values that will be shown to be
necessary for policy improvement.
From the point of view of computation, it is interesting to note that
we have considerable freedom in scaling our rewards because of the
linearity of Eqs. 4.1. If the rewards 7; of a process with gain g and
relative values 1; are modified by a linear transformation to yield new
rewards 7;;' in the sense 74’ = arij + 6, then since
N
gi = > Pures
j=1
the new expected immediate rewards q;’ will be gi’ = agi + b,so that the
qi are subjected to the same transformation. Equations 4.1 become
edt oy? = ets
EM cs ola + > pus 4: AID nt WN
j=l
Or
Bt ul = qi + D pyoy
j=1
POLICY-IMPROVEMENT ROUTINE Be
gk + > pik
(ng + 25) (4.3)
j=
as the test quantity to be maximized in each state. Since
N
> put = 1
j=1
using the relative values determined under the old policy. This
alternative k now becomes dj, the decision in the 7th state. A new
policy has been determined when this procedure has been performed
for every state.
We have now, by somewhat heuristic means, described a method for
finding a policy that is an improvement over our original policy. We
shall soon prove that the new policy will have a higher gain than the
old policy. First, however, we shall show how the value-determination
operation and the policy-improvement routine are combined in an
iteration cycle whose goal is the discovery of the policy that has highest
gain among all possible policies.
Policy-Improvement Routine
For each state 2, find the alternative k’ that maximizes
N
qk + > piyko;
1
using the relative values vj of the previous policy. Then k’
becomes the new decision in the ith state, q;*’ becomes q;, and
piy®’ becomes pi;.
Am i] onReale aaa
We are now ready to begin the value-determination operation that
will evaluate our initial policy. From Eqs. 4.1,
gt+v1 = 6 + 0.501 + 0.5ve2 gt+ve= —3 + 0.401 + 0.6v2
Setting ve = 0 and solving these equations, we obtain
ol v1 = 10 ve = 0
(Recall that by use of a different method the gain of 1 was obtained
earlier for this policy.) We are now ready to enter the policy-improve-
ment routine as shown in Table 4.1.
1 1 6 + 0.5(10).4-,0.5(0) == 141
Z 4 + 0.8(10) + 0.2(0) = 12<
qk + > pighoy
j=l
than does the first alternative. Thus the policy composed of the second
alternative in each state will have a higher gain than our original policy.
THE TOYMAKER’S PROBLEM 4]
However, we must continue our procedure because we are not yet sure
that the new policy is the best we can find. For this policy,
Pri Ps
tetraacetic
Equations 4.1 for this case become
gt v1 = 4 + 0.8u1 + 0.2v2
& + ve = —5 + 0.701 + 0.3v2
With v2 = 0, the results of the value-determination operation are
g=2 vi = 10 Ve = 0
Equations 4.11 are identical in form to Eqs. 4.1 except that they are
written in terms of differences rather than in terms of absolute quanti-
ties. Just as the solution for g obtained from Eqs. 4.1 is
N
= 2,Tigi
gr = > mv (4.12)
t=1
An Example—Taxicab Operation
Consider the problem of a taxicab driver whose territory encompasses
three towns, A, B, and C. If he is in town A, he has three alterna-
tives:
1 1 4+ Ff 7 LO eas 8
2 Teak tty ee 2.75
3 do 4 oS 4 6 4 4.25
2 1 4. 07 4 14 O 18 16
2 a i od 8 16 8 15
2 1 + +f $4 LOZ ns 7
2 + #2 4 6) 45 2 4
3 i ve ie 4 0 8 4.5
d= }1
1
> py
ge+u=at 1=1,2,---,N
j=l
46 EXAMPLES OF THE METHOD
qt + > puto;
j=1
for all 7 and &, as shown in Table 5.2.
1 1 10.53<
8.43
3 DOD
2 1 16.67
2 21.62<—
3 1 9.20
2 9.77<—
3 Se)
ad =Fr2
2
This means that if the driver is in town A he should cruise; if he is
in town B or C, he should drive to the nearest stand.
We have now
1 1 i‘
(ae es 8
P=
ot eke
|¥¢ Se
es q= ]15
1. ee |
Stain Be 4
TAXICAB OPERATION 47
Returning to the value-determination operation, we solve the equations
gtv= 84+ dvi + dve
+ dvs
& + ve = 15 + evr + gue + Vers
&E+tvs
= 44 fv. + Bue
+ fv
Again with v3 = 0, we obtain
1 1 9.27
72 12.14<
3 4.89
2 i 14.06
2 26.00 <
3 1 9.24
2 SO
3 Zoo
di= |2
Z
The driver should proceed to the nearest stand, regardless of the town
in which he finds himself.
With this policy
a i vs 2.75
ae
sae ale |
Ae Geel 4
o. £5 238
1 1 10.58
2 217 =
3 5.54
D 1 15.41
2 24.42<—
3 1 9.87
2 13.34<
3 4.41
The new policy is
2
d= |Z
Z
but this is equal to the previous policy, so that the process has con-
verged, and g has attained its maximum, namely, 13.34. The cab
driver should drive to the nearest stand in any city. Following this
policy will yield a return of 13.34 units per trip on the average, almost
half as much again as the policy of always cruising found by maxi-
mizing expected immediate reward. The calculations are summarized
in Table 5.5.
Table 5.5. SUMMARY OF TAXICAB PROBLEM SOLUTION
vy 0 1.33 — 3.88 BesWE)
v9 0 7.47 12.85 12.66
v3 0 0 0 0
g =H 9.20 13.15 13.34
oNig Gp Seg ALPS Aeas
dy 1 1 2 2
do 1 2 2 2. STOP
ds 1 2 2 2
Pp indicates that this step takes place in the policy-improvement routine.
Vv indicates that this step takes place in the value-determination operation.
AVBASEBALL) PROBEBM 49
to a policy B described by
The quantities y; defined by Eq. 4.6 may be obtained from Table 5.3.
They are the differences between the test quantities for each policy.
We find yi = 12.14 — 9.27 = 2.87, whereas y2 = y3 = 0 because the
decisions in states 2 and 3 are the same for both policies A and B.
Application of Eqs. 1.5 and 1.6 to the transition-probability matrix
for policy B yields the limiting state probabilities:
x1 = 0.0672 ma, = 0.8571 mg = 0.0757
From Eq. 4.12 we then have that
g& = (0.0672)(2.87) = 0.19
The change of policy from A to B should thus have produced an in-
crease in gain of 0.19 unit. Since g4 = 13.15 and g® = 13.34, our
prediction is correct.
A Baseball Problem
Single 0.15 1 ”) g H
Double 0.07 fe 3 H H
Triple 0.05 3 H H H
Home run 0.03 H H la! H
Base on balls 0.10 1 2 3 (if forced) H (if forced)
Strike out 0.30 Out 1 2 3
Fly out 0.10 Out 1 2, H (if less than
2 outs)
Ground out 0.10 Out 2 So H (if less than
2 outs)
Double play 0.10 Out The player nearest first is out.
0000 1 0.03 3
0100 5 0.05 2
0110 7 0.07 1
0111 8 0.25 0 gat = 0.26
1011 12 0.40 0
1110 15 0.10 0
2010 19 0.10 0
0111 8 0.05 0
1011 12 0.30 0 ee
1110 15" 0,60 0 ee
2010 19 0.05 0
Number
of Alter- Initial
Alternative Alternative Alternative Alternative natives Policy
State Description 1 2 3 4 in State dj if v4
Bases k=1 k=2 k=3 k=4 1 set = 0
+ Outs 3 2 1 ql qe qe qt
1 0 0 0 O - 0.03 Hit _ — — 1 1
2 0 OFF Ores Oxi rat 0 Bunt 0 Steal 2 — 3 1
3 0 OQ. OT 40.18 eit 0 Bunt 0 Steal 3 — 3 1
4 0 QO dlr 026 Hit 0 Bunt 0 Steal 3 — 3 1
5 0 1 (OO OT OS38i tit 0.65 Bunt 0.20 Steal H — g 2
6 0 (O05 53. 0:6) Birt 0.65 Bunt 0 Steal2 0.20 Steal H 4 ?)
Zz 0 LE i 0" O68 Eve 0.65 Bunt 0.20 Steal H — 3 1
8 0 tet et 10:86 Hit 0.65 Bunt 0.20 Steal H — 3 1
9 1 O- ‘O91 0.03: Fit - — Hs 1 1
10 1 Or Ors (OC Rit Q Bunt 0 Steal 2 — 3 1
11 1 Oo + 0,018 hit 0 Bunt 0 Steal 3 — 3 1
12 1 Q iP? 0.26 Hat 0 Bunt 0 Steal 3 — 3 1
13 1 i “O3n@ (0:53 He 0.65 Bunt 0.20 Steal H — 3 2
14 1 be Ome te f O.6h Ft 0.65 Bunt 0 Steal2 0.20 Steal H 4 2
15 1 1 1 O- 0.68 Hit 0.65 Bunt 0.20 Steal H — 3 1
16 1 bot (O86 Hie 0.65 Bunt 0.20 Steal H — 3 1
17 2 GO” .O2530. . 0:03 Ee — — — 1 1
18 2 De OWE “Oa hic 0 Bunt 0 Steal 2 — 3 1
19 2 0 > <2.) >-0:18Ent 0° Bunty 0 Steal 3 = 3 1
20 2 QO 2° , 026 Ne 0 Bunt. (0 Steal 3 — 3 1
21 Z S LOO > OSs Fhe 0.05 Bunt 0.20 Steal H — 3 1
22 2 fy Oi ads >) 0:47 Fist 0.05 Bunt 0 Steal 2 0.20 Steal H 4 1
23 2 1, f {0 “O48 But 0.05 Bunt 0.20 Steal H — 3 1
24 2 ft) Ste? 0:66-bire 0.05 Bunt 0.20 Steal H — 3 1
25 3 — — — 0 Trapped — — — 1 1
starts each inning in state 1, or “‘no outs, no men on,” then v1 may be
interpreted as the expected number of runs per inning under the given
policy. The initial policy yields 0.75 for v1, whereas the optimal
policy yields 0.81. In other words, the team will earn about 0.06
more runs per inning on the average if it uses the optimal policy
rather than the policy that maximizes expected immediate reward.
Note that under both policies the gain was zero as expected, since
after an infinite number of moves the system will be in state 25 and will
always make reward 0. Note also that, in spite of the fact that the
gain could not be increased, the policy-improvement routine yielded
values for the optimal policy that are all greater than or equal to those
for the initial policy. The appendix shows that the policy-improvement
routine will maximize values if it is impossible to increase gain.
The values v; can be used in comparing the usefulness of states. For
example, under either policy the manager would rather be in a position
with two men out and bases loaded than be starting a new inning
(compare v24 with v1). However, he would rather start a new inning
than have two men out and men on second and third (compare ve
with vi). Many other interesting comparisons can be made. Under
the optimal policy, having no men out and a player on first is just about
as valuable a position as having one man out and players on first and
54 EXAMPLES OF THE METHOD
pi air 1
pyk# =<1—pi j = 40 for
k =1
0 other 7
Pr-2 g=zk—1
pij® = 41 — pro 7 = 40 for k >1
0 other 7
56 EXAMPLES OF THE METHOD
The actual data used in the problem are listed in Table 5.10 and
graphed in Figure 5.1. The discontinuities in the cost and trade-in
functions were introduced in order to characterize typical model-year
effects.
1.00
Survival 0.90
1600 + : — g-- Probability 0.80
Man 0.70
w 1200 ; ee 0.60 2
0.50 8
iss}
= Cost C;
© 800 Pew 0.40 €
Trade-in
value 7. eee
16 24
* Of course, chaos for the automobile industry would result if everyone followed
this policy. Where would the 3-year-old cars come from? Economic forces
would increase the price of such cars to a point where the 3 to 6% policy is
no longer optimal. The preceding analysis must assume that there are enough
people in the market buying cars for psychological reasons that so-called
“‘rational’’ buyers are a negligible influence.
‘sanyea / uoreIE}] ay} JO Yora 0} ‘19 deios v yo onTeA oy} ‘O8$ Surppe Aq poznduroo st onyea paysn{pe oyL ‘sIe][Op
ore “re9 yuesaid ayy deoxy suvow y ve ‘sported ut o8e }eY} JO Ivo & IOF Ope} SUROU UUINTOS UOTSIOep dy} UT Joquinu VW
possoidxo
sonte,
pue
ut
sures
08 0 (a! 0 (a! 0 Ai 0 cl 0 61 0 0¢ 0 MI Ov
L8 je. CL if, cl L CT L cl L 61 L 0¢ St Ds| 6£
S6 st ca! St CL ST Ai ST cl St 61 ST 0¢ cS Dsi 8€
sol Sc 6)! Sc cL Sc cl Sc cL Sc 61 Sc 0@ ss 4 Le
OL 0€ CL 0€ cL 0€ a! O€ ai O£ 61 0€ 0c +8 BE 9€
SII ce (a! Se a! Sc (a! S€ cl Se 61 Se 0c IIL Ds| Oe
OcT OF cl Ov Gl Ov cL OF cl OF 61 OF 0c Ort Ds| ve
O€T Os (ai Os cl os (a! Os Gi Os 61 OS 0¢ OL BE £e
Sel ss a! ss cr Ss (ai ss a! ss 61 ss 0c 817 BE ce
OFT 09 cL 09 Gs 09 cL 09 cl 09 61 09 07 19¢ Be le
Stl $9 Gs $9 as $9 cL $9 a! $9 61 s9 07 90€ BI 0€
OST OL cL OL cL OL cL OL cl OL 61 OL 0@ 9sé a 6¢
O9T 08 CL 08 cL 08 cL 08 cL 08 61 08 07 Clh xo 87
METHOD
OLT 06 cl 06 a! 06 CL 06 cl 06 61 06 0c 69+ a L7
O8T OOT (a! OOT cL L6 Ds| OOT a! OOT 61 OOT 0c oes wt 9¢
161 IIL DE OLT cL 601 DE OIL a! OIL 61 Ol 0c c6S BE sé
907 9C1 Mt SCL Ds tcl w als Gs 9EL I 0cI 0c 859 xo 144
vCT trl ae trl ou a De| O€T a! 991 My O€t 07 8cL Ds| £C
947 991 Bi SOL M b9l w StI a! L61 i Stl 0c 108 a4 (Zé
697 681 w 681 DE 881 DsI 091 al 6¢7 w £1? a 9L8 a 17
THE
300
nmonfo)
hoio)fo)
oO
a oO
(o) a
(dollars)
Cost
period
per
50
0 i 2 3 4 5 6 7 8
Iteration number
that had two recurrent chains. Suppose that the process had an
1
expected immediate reward vector q = P| expressed in dollars. The
3
matrix of limiting-state probability vectors was found in Chapter 1
to be
Ue 0
sf 1 j
+ 4 0
VALUE-DETERMINATION OPERATION 61
1
The gain vector g = Sq = |a |and we interpret g as follows: If the
iho
process were started in state 1, it would earn $1.00 per transition. A
start in state 2 would earn $2.00 per transition. Finally, since the
system is equally likely to enter state 1 or state 2 after many transitions
if it is started in state 3, such a starting position is expected to earn
$1.50 per transition on the average. The averaging involved is per-
formed over several independent trials starting in state 3, because in
any given trial either $1.00 or $2.00 per transition will be ultimately
earned.
The gain of the system thus depends upon the state in which it is
started. A start in state 7 produces a gain gi, so that we may think of
the gain as being a function of the state as well as of the process. Our
new task is to find the policy for the system that will maximize the
gain of all states of the system. We are fortunate that the policy-
iteration method of Chapter 4 can be extended to the case of multiple-
gain processes. We shall now proceed to this extension.
and
N
setu=aut
> py 1=1,2,---,N (6.4)
j=1
Zitvu=1+ 0) g2 + vg = 2 + ve
g3 + vg = 3 + 4u1 + dug + 4u3
Uv1= 0 v2 = 0 v3 = UADES
are the gains and relative values for each state of the process. The
gains are of course the same as those obtained earlier.
Diskg;
j=l
ll
64 MULTIPLE-CHAIN PROCESSES
Policy Evaluation
Use pij and q for a given policy to solve the double set of
equations
N
& = >, Pugs
j=1
N
Policy Improvement
> Pukey
vat
using the gains of the previous policy, and make it the new
decision in the 7th state.
If
N
> putes
Ly
is the same for all alternatives, or if several alternatives are
equally good according to this test, the decision must be made
on the basis of relative values rather than gains. Therefore,
if the gain test fails, break the tie by determining the alter-
native k that maximizes
N
gk + > pikoy
j=l
Fig. 6.1. General iteration cycle for discrete sequential decision processes.
MULTICHAIN EXAMPLE 65
the gain test quantity, using the gains of the old policy. However,
when all alternatives have the same value of
N
> dikes
j=1
or when a group of alternatives have the same maximum value of the
gain test quantity, the tie is broken by choosing the alternative that
maximizes the value test quantity,
N
qt + > pistes
imi
by using the relative values of the old policy. The relative values may
be used for the value test because, as we shall see, the test is not affected
by a constant added to the v; of all states in a recurrent chain.
The general iteration cycle is shown in Fig. 6.1. Note that it reduces
to our iteration cycle of Fig. 4.2 for completely ergodic processes. An
example with more than one chain will now be discussed, followed by
the relevant proofs of optimality.
A Multichain Example
Let us find the optimal policy for the three-state system whose
transition probabilities and rewards are shown in Table 6.1. The
transition probabilities are all 1 or 0, first for ease of calculation and
second to show that no difficulties are introduced by such a structure.
This system has the possibility of multiple-chain policies.
2 1 1 0 0 6
2 0 i 0 4
3 0 0 1 5
3 1 1 0 0 8
2 0 1 0 9
3 0 0 1 7
66 MULTIPLE-CHAIN PROCESSES
Let us begin with the policy that maximizes expected immediate
reward. This policy is composed of the third alternative in the first
state, the first alternative in the second state, and the second alternative
in the third state. For this policy
3 Onn t 3
daa Diayiecorne ae
2 Onalion 9
We are now ready to enter the policy-evaluation part of the iteration
cycle. Equations 6.3 yield
Caen e3 aa ou Bee
These results show that there is only one recurrent chain and that
all three states are members of it. If we call its gain g, then
£1 = £2 = g3 = g; the relative value vg is arbitrarily set equal to zero.
If we use these results in writing Eqs. 6.4, the following equations are
obtained:
f+ =3 gtvu=6+ 01 g=9
+ v2
g1 = 6 ge = 6 g3
=6
and
v1 = —3 ve = —3 v3 = 0
2 1 6 6+.(=—3),=) 3
2 6 43) sl
3 6 Bi Oi" “SS
3 1 6 Sital=3) ay 5
2 6 9-4: (—3). =" 6
3 6 Fee (Ol ee
MULTICHAIN EXAMPLE 67
Since the gain test produced ties in all cases, the value test was
necessary. The new policy is
3 001 3
d= |3 P=/0 0 1 q= |5
3 0 OP 7
This policy must now be evaluated. Equations 6.3 yield
§1 = &3 §2 = §3 &3
= §3
We may let gi = ge = g3 = g, Set vg = 0, and use Eqs. 6.4 to obtain
£+01 = 3 gtvu.=5 fue 7]
The solution is g = 7, v1 = —4, ve = —2, and so
1 1 7, —3
2 7, 0
3 7 =
2 1 qi 2
2 a 2
3 7 Se
3 1 7 4
2 7 7
3 7 <
Since once more the gain test was indeterminate, it was necessary to
rely on the relative-value comparison. In state 3, alternatives 2 and
3 are tied in the value test. However, because alternative 3 was our
old decision, it will remain as our new decision. We have thus obtained
the same policy twice in succession; it must therefore be the optimal
policy. The optimal policy has a gain of 7 in all states. The policy
3
d= b which was possible because of the equality of the value test
2
in state 3, is also optimal.
68 MULTIPLE-CHAIN PROCESSES
Although this system had the capacity for multichain behavior, such
behavior did not appear if we chose as our starting point the policy
that maximized expected immediate reward. Nevertheless, other
choices of starting policy will create this behavior.
Let us assume the following initial policy:
3 pins 3
dari Peto yt BO q=t\4
1 1= 03/0 8
To evaluate this policy, we first apply Eqs. 6.3 and obtain
SITES Se ee Sooo
There are two recurrent chains. Chain 1 is composed of states 1 and
3, chain 2 of state 2 alone. Therefore, g1 = gs = 1g, g2 = 2g, and we
may set ve = v3 = 0. Equations 6.4 then yield
lg + vy = 3 29 = 4plea oieae
The solution of these equations is lg = 4, 2g = 4, v1 = — 3, and so
gi = "9 eyre4 gs = "9
and
Uu= —3 v2 = 0 v3 = 0
1 1 a} —}
Z 4 2
3 43 3<
2 1 ad 7
% 4 4
3 oa 5<
3 1 at an
2 4 9
3 oe 7<
has been produced is the optimal policy that was found earlier, and so
there is no need to continue the procedure because we would only
repeat our earlier work.
In the preceding example we began with a two-chain policy and ended
with the optimal one-chain policy. The reader should start with such
1 1
policies as d = P| and d = | to see how the optimal policy with
3 i
gain 7 for all states may be reached by various routes. Note that in
no case is it necessary to use the true limiting values v4; the relative
values are adequate for policy-improvement purposes.
> Ptgs4
j=l
and we let gi4 = gi — gid, then
N
np) leaks ee |
| | |
i}
Qe te PP
\
fey
|
rls rea
|
nO
fee See i ee
| |
| |
ps —
oe h 2yA p ai
gi = : yvi= : py = ; Y= :
LgA LyA Ly Ly
Itlga L+1yA Ltly a,
is started in a state of the 7th chain; "x = 'n’’P, and the sum of the
components of each "x for 7 = 1, 2,---, Lis 1. The subvector 4*17
has all components zero because all states in the group L + 1 are
transient.
Equations 6.12 and 6.13 in vector form are
go = + PF g4 (6.14)
gi+visyst PByv4 (6.15)
Since
Tr = Tr TTP
it follows that
rb = 0 (6.20)
Because all states in the vth chain are recurrent, "% contains all positive
elements. We know from our earlier discussion that all 4 are greater
than or equal to zero. From Eq. 6.20 we see that, in any of the 7
groups 7 = 1, 2,---, L, } must be zero. It follows that in each re-
current chain of the policy B the decision in each state must be based
on value rather than gain.
Equations 6.16 thus become
rgd = 7rp rgd (6.21)
We know that the solution of these equations is that all "g;4 = "g4, so
that all states in the vth group experience the same increase in gain as
the policy is changed from A to B. If this result is used in Eq. 6.18,
we find that
rgh = rary (6.22)
PROPERTIES OF THEVYTERATION CYCLE 73
Thus the increase in gain for each state in the 7th group is equal to
the vector of limiting state probabilities for the 7th group times the
vector of increases in the value test quantity for that group. Since,
for each group 7 < L, "); = 0, then"y; > 0. Equation 6.22 shows that
an increase in gain for each recurrent state of policy B will occur un-
less policies A and B are equivalent.
We have yet to determine whether or not the gain of the transient
states of policy B is increased. Equation 6.17 shows that
L
(EH — LHL L+1p)Lt+iga — LHy + » L+1,8P sg (6.23)
s=1
3 he pew| 3
of r=|0 0 j [5
3 001 7
by means of the policy-improvement routine of Table 6.4. The first
policy we have called policy A, the second, policy B. From Table 6.4
we see that
0 0
p= ] vee F
0 3
If the identity of states 3 and 1 is interchanged, we have
P2=|1/0
Pietiot® Y 2 ‘8°
0 p= |3 ‘aa oy meRees
1/0 0 0 0 8
Thus L = 1, there is one recurrent chain, and 4P = [1]. We notice
that in the new state 1 (the old state 3) the decision was based on values
rather than gains. The limiting-state-probability vector for s = 1,
Iq, is [1]. Hence from Eq. 6.22
Since
GoSS)
tole
neo
PROPER DTESIOLSTME Iie RATTON-GYCLE 75:
i= >:Pisrij
j=1
76
DISCOUNTING Th
to obtain for the basic recurrence relation
7
vi(m) = qe + B> pivj(n — 1) fe 1, Bos aul
j=1
= Ie 2; 3,25 (7.2)
v(z) = 1—z
——(I
— BzP)-1q
or
Since 8 = f,
1 ed Sones ©
btay 195a
rid
(I — 42P) = |
and
ws dz
ee
302) (1 2 42)(1 vs 352)
(1 =, $2) (1 _
(I — 42P)-1
5° tee: hy
(eo) (te
Thus
122
2(1 = 702)
1p
es AN 30
pt p-k 8
1 8 WO)
1 Zlig 19 1l—'}21-3 —"s
i
1 _100
71
100
aa
o's, 80 80
LePSraye Tepe =
ea
and
ALO! 100
1, 71 171
+ (z6) _80_ “|
171 i71
Since v(x) = H()q, the problem of finding v(m) has been solved for an
arbitrary q. Forq = ‘
As we shall soon prove, the total expected rewards v;(m) will increase
with m and approach the values v(m) = 22.2 and ve(n) = 12.3 as n
becomes very large. The policy of the toymaker should be to use the
second alternative in each state ifm > 1. Since we have seen how the
vi(m) approach asymptotic values for large m, we might ask if there is
any way we can by-pass the recurrence relation and develop a technique
that will yield directly the optimal policy for the system of very long
duration. The answer is that we do have such a procedure and that
it is completely analogous to the policy-iteration technique used for
VALUE-DETERMINATION OPERATION &1
processes without discounting. Since the concept of gain has no
meaning when rewards are discounted, the optimal policy is the one
that produces the highest present value in all states. We shall now
describe the new forms that the value-determination operation and the
policy-improvement routine assume. We shall see that the sequential
decision process with discounting is as easy to solve as the completely
ergodic process without discounting. We need no longer be concerned
with the chain structure of the Markov process.
and that 7 (8z) now refers to components that fall to zero even faster
as » grows large. Then Eq. 7.4 becomes
2 1 - 1
v(z) = ae li = ee + 7) 4 + F ety S + F (62)|v0) (7.7)
Let us investigate the behavior of Eq. 7.7 for large n. The coefficient
of v(0) represents terms that decay to zero, so that this term disappears.
The coefficient of q represents a step component that will remain
plus transient components that will vanish. By partial-fraction
expansion the step component has magnitude [1/(1 — §)]S + 7 (8).
Thus for large n, v(z) becomes {[1/(1 — z)][1/(1 — 8)JS + 7 (8)}q.
For large n, v(m) takes the form {[1/(1 — 8)]S + 7(8)}q. However,
{{1/(1 — 8)]S + 7(8)} is equal to (I — BP), by Eq. 7.6. Therefore,
for large m, v(m) approaches a limit, designated v, that is defined by
v = (I — BP)-1q (7.8)
The vector v may be called the vector of present values, because each
of its elements v; is the present value of an infinite number of future
expected rewards discounted by the discount factor f.
82 DISCOUNTING
M=G+B>
pyyy 1=1,2,---,N (7.9)
j=1
qi Bd, buto,(n)
with respect to all alternatives kin the 7th state. Since we are now deal-
ing only with processes that have a large number of stages, we may
substitute the present value v; for vj(m) in this expression. We must
now maximize
N
gi + BD pesto;
j=1
gk + BD desko;
j=1
using the v; determined for the original policy. This k now becomes
the new decision in the 7th state. A new policy has been determined
when this procedure has been performed for every state.
The policy-improvement routine can then be combined with the
84 DISCOUNTING
Policy-Improvement Routine
For each state 7, find the alternative k’ that maximizes
N
gk +8 > pykyy
j=1
using the present values v; from the previous policy. Then
k’ becomes the new decision in the ith state, g;*’ becomes qj,
and p4;*’ becomes pij.
Fig. 7.1. Iteration cycle for discrete decision processes with discounting.
The iteration cycle may be entered in either box. An initial policy may
be selected and the iteration begun with the value-determination opera-
tion, or an initial set of present values may be chosen and the iteration
started in the policy-improvement routine. Ifthereisnoa priori basis for
choosing a close-to-optimal policy, then it is often convenient to start the
process in the policy-improvement routine with all v; set equal to zero.
The initial policy selected will then be the one that maximizes expected
immediate reward, a very satisfactory starting point in most cases.
The iteration cycle will be able to make policy improvements until
the policies on two successive iterations are identical. At this point it
has found the optimal policy, and the problem is completed. It will
be shown after the example of the next section that the policy-improve-
ment routine must increase or leave unchanged the present values of
every state and that it cannot converge on a nonoptimal policy.
An Example
Let us solve the toymaker’s problem that was solved by value
iteration earlier in this chapter. The data were given in Table 3.1,
AN EXAMPLE 85
and as before 8 = 0.9. We seek the policy that the toymaker should
follow if his rewards are discounted and he is going to continue his
business indefinitely. The optimal policy is the one that maximizes the
present value of all his future rewards.
Let us choose as the initial policy the one that maximizes his
expected immediate reward. This is the policy formed by the first
alternative in each state, so that
1 UFce fad8) 6
H 7 ie ‘a 1 ies
d = = = |}
The second alternative in each state provides a better policy, so that now
2 0.39.02 4
LG A il bee bal Aig 3
The value-determination operation for this policy provides the
equations
v=4+ 0.9(0.801 + 0.2v2) v2= =—5 + 0.9(0.701 + 0.3v2)
2 1 11.6
2 12.3<—
The present values of the two states under the optimal policy are 22.2
and 12.3, respectively; these present values must be higher than those
of any other policy. The reader should check the policies d = ,
We have seen that if the discount factor is 0.9 the optimal no-
discounting policy found in Chapter 4 is still optimal for the toymaker.
We shall say more about how the discount factor affects the optimal
policy after we prove the properties of the iteration cycle.
— BY pijAv,4
j=1
If viA = v,;B — v;4, the increase in present value in the 7th state, then
N
ved = ye + BD pisBosA
j=l
This set of equations has the same form as our original present-value
equations (Eq. 7.9), but it is written in terms of the increase in present
values. We know that the solution in vector form is
v4 = [I — BP2}-1y (7.14)
where y is the vector with components y;._ It was shown earlier that
(I — 8P}-! has nonnegative elements and has values of at least 1 on the
main diagonal. Hence, if any y; > 0, at least one v;4 must be greater
than zero, and no v;4 can be less than zero. Therefore, the policy-
improvement routine must increase the present values of at least one
state and cannot decrease the present values of any state.
Is it possible for the routine to converge on policy A when policy B
produces a higher present value in some state? No, because if the
policy-improvement routine converges on A, then all yi < 0, and hence,
all v;4 < 0. It follows that when the policy-improvement routine has
converged on a policy no other policy can have higher present values.
Table 7.4. OptimAL PoLIcy AND PRESENT VALUES FOR THE TAXICAB
PROBLEM AS A FUNCTION OF THE DiscouNT FAcTOR 3
Discount Factor Optimal Policy Decisions Present Values
6B State1 State2 State 3 State1 State2 State3
-| fl A
Policy Policy Policy Policy
For 6 > 0.77, the second alternative in each state is the optimal
policy; the driver should always proceed to the nearest stand. In
Region I, the policy that maximizes expected immediate reward is
optimal; in Region IV, the no-discounting policy is best. An inter-
mediate policy should be followed in Regions II and III.
The behavior first described enables us to draw several conclusions
THE AUTOMOBILE PROBLEM WITH DISCOUNTING 89
about the place of processes with discounting in the analysis of sequen-
tial decision processes. First, even if the no-discounting process
described earlier is the preferred model of the system, the present
analysis will tell us how large the discounting element of the problem
must be before the no-discounting solution is no longer applicable.
Second, one criticism of a model that includes discounting is the
frequent difficulty of determining what the appropriate discount rate
should be. Figure 7.2 shows us that if the uncertainty about the
discount rate spans only one of our regions, the same policy will be
optimal, and the exact discount rate will affect only the present values.
Third, because it becomes increasingly difficult to solve the process
using discounting when § is near 1, in such a situation we are better
advised to solve the problem for the optimal policy without discounting.
Summary
The solution of sequential decision processes is of the same order of
difficulty whether or not discounting is introduced. In either case it is
necessary to solve repeatedly a set of linear simultaneous equations.
Each solution is followed by a set of comparisons to discover an im-
proved policy; convergence on the optimal policy is assured. Dis-
counting is useful when the cost of money is important or when there is
uncertainty concerning the duration of the process.
The Continuous- lime
Decision Process
There are two mutually exclusive ways in which the system can
occupy the state; at f+ dt. First, it could have been in state j at time
t and made no transitions during the interval dt. These events have
probability 7,(¢) and 1 — y aj; dt, respectively, because we have said
tAj
that the probability of multiple transitions is of order higher than dt
and is negligible, and because the probability of making no transition
in dt is 1 minus the probability of making a transition in dt to some
state 7 #7. The second way the system could be in state 7 at ¢ + dt
is to have been at state 7 # 7 at time ¢ and to have made a transition
from 7 to state 7 during the time dt. These events have probabilities
mi(t) and aj; dt, respectively. The probabilities must be multiplied
and added over all z that are not equal to 7 because the system could
have entered 7 from any other state 7. Thus we see how Eq. 8.1 was
obtained.
Let us define the diagonal elements of the A matrix by
ayj= » ajt (8.2)
i#j
Upon dividing both sides of this equation by dé and taking the limit
as dt — 0, we have
ad =
- w(t) = » T(t) aij timeil 1Z, ete. (8.3)
4=1
SS ay = 0
j=l
As mentioned earlier, a matrix whose rows sum to zero is called a
differential matrix. As we shall see, the differential matrix A is a
very close relative of the stochastic matrix P.
In the following section we shall discuss the use of Laplace transforms
in the solution of continuous-time Markov processes described by Eq.
8.4. Weshall find that our knowledge of discrete-time Markov processes
will be most helpful in our new work.
f(t) f(s |
Ailt) + fat) fils) + f2(s)
kf (t) (& is a constant) hf (s)
uO
d.
f(s) = f(0)
e7at 1
S$ +4
1 (unit step) .
oe 1
pit (s + a)2
t (unit ramp) 3
e-atf(t) f(s + a)
or
II(s)(sI — A) = 2(0)
where I is the identity matrix. Finally, we have
II(s) = 7(0)(sI — A)! (8.6)
The Laplace transform of the state-probability vector is thus the
initial-state-probability vector postmultiplied by the inverse of the
matrix (sl — A). The matrix (sI — A)~! is the continuous-process
counterpart of the matrix (I — zP)~!. We shall find that it has proper-
ties analogous to those of (I — zP)~1 and that it constitutes a complete
solution of the continuous-time Markov process.
By inspection, we see that the solution of Eq. 8.4 is
m(t) = m(0)e4? (8.7)
where the matrix function e4¢ is to be interpreted as the exponential
series
t2 {3
I+tA+ =
SI NB
31
5 ee) Noble ce
96 THE CONTINUOUS-TIME DECISION PROCESS
which will converge to e4¢. For discrete processes, Eqs. 1.4 yielded
n(n) = 1(0)P” i Nl borage (1.4)
Suppose that we wish to find the matrix A for the continuous-time
process that will have the same state probabilities as the discrete
process described by P at the times ¢ = 0, 1, 2,---, where a unit of
time is defined as the time for one transition of the discrete process.
Then, by comparison of Eqs. 8.7 and 1.4 when ¢ = 1, we see that
i
or
A=InP (8.8)
Recall the toymaker’s initial policy, for which the transition-proba-
bility matrix was
Polka
pi
5 pole
aes
Suppose that we should like to find the continuous process that will
have the same state probabilities at the end of each week for an
arbitrary starting position. Then we would have to solve Eq. 8.8
to find the matrix A. Methods for accomplishing this exist,4 and if we
apply them to the toymaker’s P we find
In 10 bss ‘
ARS iG om 4
Since the constant factor (In 10)/9 is annoying from the point of view of
calculation, we may as well solve a problem that is analogous to the
toymaker’s problem but that is not encumbered by the constants
necessary for complete correspondence in the sense just described. We
shall let A be simply
A= [75
4 a4.| (8.9)
Since we are abandoning complete correspondence, we may as well
treat ourselves to a change in problem interpretations at the same time.
We shall call this new problem “‘the foreman’s dilemma.’ A machine-
shop foreman has a cantankerous machine that may be either working
(state 1) or not working (state 2). If it is working, there is a probability
5 dt that it will break down in a short interval dt; if it is not working,
there is a probability 4 d¢ that it will be repaired in dt. We thus
obtain the transition-rate matrix (Eq. 8.9). The assumptions regarding
breakdown and repair are equivalent to saying that the operating time
between breakdowns is exponentially distributed with mean }, while
SOLUTION BY LAPLACE TRANSFORMATION 97
the time required for repair is exponentially distributed with mean }.
If we take 1 hour as our time unit, we expect a breakdown to occur
after 12 minutes of operation, and we expect a repair to take 15 minutes.
The standard deviation of operating and repair times is also equal to
12 and 15, respectively.
For the foreman’s problem we would like to find, for example, the
probability that the machine will be operating at time ¢ if it is operating
when ¢ = 0. To answer such a question, we must apply the analysis
of Eq. 8.6 to the matrix (Eq. 8.9). We find
TA S 5 —5
: jae = |
4 5
s(s + 9) s(s
+ 9)
(sI — A)-1 =
4 $+5
s(s + 9) s(s + 9)
Partial-fraction expansion permits
> 3 3 —3
Ree nara an ees
(sI — A)-1=
3 —§ 3 .
Ss IO, eA AAG
or
PNG 4 edleren:
apd alec ty
Let the matrix H(t) be the inverse transform of (sI — A)~!. Then
Eq. 8.6 becomes by means of inverse transformation
m(t) = 7(0)H(2) (8.10)
By comparing Eqs. 8.7 and 8.10, we see that H(#) is a closed-form
expression for eA?,
For the foreman example,
|9 ake9
H() = |i
9
=(9¥
s
clon
colon 9 9
are 1 (8.14)
1=1
1—- De ai; at, On the other hand, the system may make a transition
to core state 7 # 7 during the time interval dé with probability ai; dt.
In this case the system would receive the reward rij; plus the expected
reward to be made if it starts in state 7 with time ¢ remaining, v,(¢). The
product of probability and reward must then be summed over all
states 7 # 1 to obtain the total contribution to the expected values.
Using Eq. 8.2, we may write Eq. 8.15 as
ui(t + dt) = (1 + au dt)[ru dt + vi()] + > ay dtfry + 04(2)]
j#%
or
ad of :
di vi(t) = reg + D aijrig + 2 aizv;(t) ae ey ee Af
or
and finally
0(s) = AGE
S
—-A)=ig + (61 — A=2¥(0) (8.19)
Thus we find that Eq. 8.19 relates v(s), the Laplace transform of
v(t), to (sl — A)~!, the earning-rate vector q and the termination-
reward vector v(0), respectively. The reward vector v(¢) may be found
by inverse transformation of Eq. 8.19.
Let us apply the result (Eq. 8.19) to the foreman’s problem. The
transition-rate matrix and reward vector are
—5 5 6
Spytee ac aed
We shall assume that the machine will be thrown away at ¢ = 0, so that
v1(0) = v2(0) = 0. We found earlier that for this problem
ie ie 5
s(s + 9) s(s°9)
(sI — A)-! =
4 s+5
s(s + 9) s(s + 9)
102 THE CONTINUOUS-TIME DECISION PROCESS
le al ox lees
3 9
9 9
ii
81
eo
81
op
81
ery
81
6
Or
iA ‘s + F(s) (8.11)
where S is the matrix of limiting state probabilities and J (s) consists of
transforms of purely transient components. If Eq. 8.11 is substituted
into Eq. 8.19, we have
Il | Nn + ‘q &
104 THE CONTINUOUS-TIME DECISION PROCESS
so that
4 58 ig; 8
s = [iid] +ale 0
9 9
| Bees
eek 81
1
In addition, q = | i: From Eq. 8.22, we have g = Sq = |:
5
and from Eq. 8.23, since v(0) = 0, we have v = 7 (0)q = |-
5
Therefore by Eq. 8.25, it follows that for large ¢ we may write v1(t)
and v(t) in the form
vi(t) =t+3 ve(t) =t —§
Suppose that our machine shop foreman has to decide upon a main-
tenance and repair policy for the machinery. When the system is in
state 1, or working, the foremaa must decide what kind of maintenance
he will use. Let us suppose that if he uses normal maintenance pro-
cedures the facility will earn $6 per unit time and will have a probability
5 dt of breaking down in a short time dt. Note that this is equivalent
to saying that the length of operating intervals of the machine is
exponentially distributed with mean 4.
The foreman also has the option of a more expensive maintenance
procedure that will reduce earnings to $4 per unit time but will also
reduce the probability of a breakdown in dt to 2 dt. Under neither of
these maintenance schemes is there a cost associated with the break-
down fer se. If we number the two alternatives in state 1 as 1 and 2,
respectively, then we have for the first alternative
aig = 5 rit = 6 rigt = 0
and for the second alternative
@1o* = 2 112 = 4 7122 = 0
Finally, we obtain by using Eq. 8.16 that
gi=6 and qi? =
CONTINUOUS-TIME DECISION PROBLEM 105
Now we must consider what can happen when the machinery is not
working and the system occupies state 2. Let us suppose that the
foreman also has two alternatives in this state. First, he may have the
repair work done by his own men. For this alternative the repair will
cost $1 per unit time that the men are working, plus $0.50 fixed charge
per breakdown, and there is a probability 4 dt that the machine will
be repaired in a short time df (repair time is exponential with mean }).
The parameters of this alternative are thus
a1! = 4 Yoo = —] %o11 = —0.5
and
got =15 46°7 (205), = "25
The foreman must decide which alternative to use in each state in
order to maximize his profits in the longrun. The data for the problem
are summarized in Table 8.2. =
The concepts of alternative, decision, and policy carry over from the
discrete situation. Since each of the four possible policies contained
in Table 8.2 represents a completely ergodic process, each has a unique
gain that is independent of the starting state of the system. The
foreman would like to find the policy that has highest gain; this is the
optimal policy.
One way to find the optimal policy is to find the gain for each of the
four policies and see which gain is largest. Although this is feasible
106 THE CONTINUOUS-TIME DECISION PROCESS
for small problems, it is not feasible for problems that have many
states and many alternatives in each state.
Note also that the value-iteration method available for discrete-time
processes is no longer practical in the continuous-time case. It is not
possible to use simple recursive relations that will lead ultimately to the
optimal policy because we are now dealing with differential rather than
difference equations.
A policy-iteration method has been developed for the solution of the
long-duration continuous-time decision problem. It is in all major
respects completely analogous to the procedure used in discrete-time
processes. As before, the heart of the procedure is an iteration cycle
composed of a value-determination operation and a policy-improvement
routine. We shall now discuss each section in detail.
gi = qi + D auligs + ry)
j=l
or
N N
If Eqs. 8.26 are to hold for all large ¢, then we obtain the two sets of
linear algebraic equations
N
o aijgj = 0 1 | bane N = (8.27)
j=
N
Seg
+ > ayy, 7 = 1,2,--+,N (8.28)
j=1
Equations 8.27 and 8.28 are analogous to Eqs. 6.3 and 6.4 for the
discrete-time process. Solution of Eqs. 8.27 expresses the gain of
each state in terms of the gains of the recurrent chains in the process.
POLICY-IMPROVEMENT ROUTINE 107
The relative value of one state in each chain is set equal to zero, and
Eqs. 8.28 are used to solve for the remaining relative values and the
gains of the recurrent chains.
the gain test quantity, using the gains of the old policy. However,
when all alternatives produce the same value of Expression 8.31 or
when a group of alternatives produces the same maximum value, then
the tie is broken by the alternative that maximizes
N
gi® 3 % aijkv; (8.32)
j=l
the value test quantity, using the relative values of the old policy. The
relative values may be used for the value test because a constant differ-
ence will not affect decisions within a chain.
The general iteration cycle is shown in Fig. 8.1. It corresponds
completely with Fig. 6.1 for the discrete-time case and has a completely
analogous proof. The rules for starting and stopping the process are
unchanged.
108 THE CONTINUOUS-TIME DECISION PROCESS
Policy Evaluation
Use ay and qj for a given policy to solve the double set of
equations
N
> aye; = 0 i=1,2,---,N
j=1
N
Policy Improvement
For each state i, determine the alternative k that maximizes
N
>, "8;
j=1
using the gains g; of the previous policy, and make it the new
decision in the 7th state.
If
N
>, aj*e;
j=l
Ge + > aigkv;
j=1
using the relative values of the previous policy. This alternative
becomes the new decision in thezth state. A new policy has been found
when this procedure has been performed for every state.
The iteration cycle for completely ergodic continuous-time systems
is shown in Fig. 8.2. It is completely analogous to that shown in Fig.
4.2 for discrete-time processes. Note that, if the iteration is started
in the policy-improvement routine with all v; = 0, the initial policy
selected is the one that maximizes the earning rate of each state.
This policy is analogous to the policy of maximizing expected immediate
reward for discrete-time processes.
The proof of the properties of the iteration cycle for the continuous-
time case is very close to the proof for discrete time. We shall illustrate
this remark by the proof of policy improvement for the iteration cycle of
Fig. 8.2.
Consider two policies, A and B. The policy-improvement routine
has produced policy B as a successor to policy A. Therefore we know
N N
gi + > ajBuj4 > qiA + > argv;
i=1 j=1
or
N N
D> aig4vj4
vi = GB + D> aayBo74 — gid j=l (8.34)
|
110 THE CONTINUOUS-TIME DECISION PROCESS
Policy-Improvement Routine
For each state 7, find the alternative k’ that maximizes
N
gk + > agko;
j=l
using the relative values v; of the previous policy. Then k’
becomes the new decision in the 7th state, g¢;*’ becomes q;, and
aij’ becomes ajj.
If Eq. 8.36 is subtracted from Eq. 8.35 and if Eq. 8.34 is used to
eliminate gi? — qi4, we obtain
N
the solution is
N
ge + > auto;
i=
can be made in any state 7 that is recurrent under policy B.
The proof that the iteration cycle must converge on the optimal policy
is the same as that given in Chapter 4 for the discrete case.
1 1 Guo 51) 1
2 4 — 2(1) = 2<
2 1 —3 + 4(1) =1
Z =—§ + 7(1) = 2<—
112 THE CONTINUOUS-TIME DECISION PROCESS
Computational Considerations
These are the value-determination equations (Eqs. 8.27 and 8.28) for
the continuous-time decision process. Thus if we have a program for
the solution of Eqs. 6.3 and 6.4 for the discrete process, we may use
it for the solution of the continuous process described by the matrix
A by transforming the transition rates to “‘pseudo”’ transition proba-
bilities according to the relation pij = aij + Sij.*
As far as the policy-improvement routine is concerned, in the discrete
case we maximize either
N N
5 ((piy® — Busey
a or gk + => (Py® — 8y)vy
in state 7, since only terms dependent upon & affect decisions. In terms
of ay* = py* — 38y, the quantities to be maximized are
In this equation we assume that rewards are paid at the end of the
interval dt and that the process receives no reward from termination.
Using the definition given by Eq. 8.2, we may rewrite Eq. 8.40 as
and
N
vi(t + at) = (r + > aru) Qi va(t +> ai; at g(t — adt v(t)
i Fi j=1
where terms of higher order than dt have been neglected.
Introduction of the earning rate from Eq. 8.16 and rearrangement
yield
N
vilé + dt) — vif) + adt v(t) = qedt + > ay dt v4(2)
j=
CONTINUOUS-TIME DECISION PROCESS WITH DISCOUNTING 115
If this equation is divided by dé and the limit taken as dt approaches
zero, we have
dvi(t a
ee + av(t) = 95 + > aut) 1=1,2,---,N (8.41)
j=l
Equations 8.41 are analogous to Eqs. 8.17 and reduce to them if
a = 0. In vector form, Eqs. 8.41 become
dv(t)
at
+ av(t) = q + Av(t) (8.42)
Since Eq. 8.42 is a linear constant-coefficient differential equation,
we should expect a Laplace transformation to be useful. If the
transform of Eq. 8.42 is taken, we obtain
or
[(s + a)I — Ajv(s)=
and finally
v(s) = “Us + a)I — AJ-'q + [(s + «JI — AJ v(0) (8.43)
We might use Eq. 8.43 and inverse transformation to find v(t) for a
given process. As usual, however, we are interested in processes of
long duration, so that only the asymptotic form of v(t) for large ¢
interests us. Let us recall from Eq. 8.11 that
le ee ee
Serta 6
eee ae, (8.44)
so that [(s + «)I — A]~! has all transient components. If Eq. 8.44 is
used in Eq. 8.43, we have
vo)ar=if
[8+S4776+ H]a+ [58+
Ee
1
76ee+2)|v0)
vo B45)
8.45
We now wish to know which components of v(¢) will be nonzero for
large ¢. The matrix multiplying q contains a step component of magni-
tude [(1/a)S + Z(«)]; all other terms of Eq. 8.45 represent transient
116 THE CONTINUOUS-TIME DECISION PROCESS
we have
ve E i F(o)|4
a
or
v = (aI — A)—1q (8.46)
Policy Improvement
Policy-Improvement Routine
For each state 7, find the alternative k’ that maximizes
N
gk + > ayko;
j=l
using the present values v; from the previous policy. Then k’
becomes the new decision in the ith state, q;*’ becomes q;, and
ayj®’ becomes ay.
Fig. 8.3. Iteration cycle for continuous-time decision processes with discounting.
118 THE CONTINUOUS-TIME DECISION PROCESS
Equivalently,
N N
vYe= gai? + BD Agi ae — > aij4v;4 > 0 for all7
j=1 j=1
ef eta «Ls
choose as our initial policy the one that maximizes earning rate; that is,
Their solution is
783 702
UI soe Uli — soe
a k qk + > augku;
j=l
1 6 — 5(3A8) + 5(SME) = HH
2 Are) ee it 2(7ee) = 7ee<—
2 1 OR 4 (783°) ia 4 (793°) = a2
fi an penis «gfe
The value-determination equations (Eqs. 8.47) are
Their solution is
Uf = es tg = 33
120 THE CONTINUOUS-TIME DECISION PROCESS
Note that the present values have once more increased. The policy-
improvement routine is entered again, with results shown in Table
ref
2 1 i
152 <
ve = gi + BD pars
j=l
a set of equations of the same form as those for the discrete case. Thus
if we have a continuous problem described by «, q’, and A, we may use
the program for the discrete problem described by 8, q, and P by making
the transformations
1
Prac q = fq P=A+¢+I
git + BD paso;
j=1
j=1
gk + 2 pik — diy)vj
where aij* = pij* — 343. We now have an expression equivalent to
N
g® + > puto;
j=1
ai + > puto;
j=l
and this of course, is proportional to
N
gi® + BD puskv;
j=1
which is the test quantity for the discrete case. Thus the same trans-
formation that allowed us to use a program for the discrete case in the
122 THE CONTINUOUS-TIME DECISION PROCESS
solution of the value-determination operation allows us to use a
program for the policy-improvement routine that is based upon the
discrete process.
We see that by suitable transformations a single program suffices for
both the discrete and continuous cases with discounting. Since we
showed the same relation earlier for cases without discounting, it is
clear that the continuous-time decision process, with or without dis-
counting, is computationally equivalent to its discrete counterpart.
Conclusion
P i, A. Soe iter
. re
a ape
. eek e-a
7 ipseei! w Ue)
al bs wy 4 ud i7
wa aay ae
cell — , - Acitse
= 1 nf
as 7% : “s "
’ Ppyie thd pot '
Lape Sei a ii @ of Ua)
= == e.)
J we wih Rae ! 1%
+ Vile SIS Ae sega sy
is rab anei ' meen I—
bid — 7)
i)
ae
m l v P
Appendix:
If we i a matrix
M = [mij] el
then
1—f11 —pie --- —fPin-1 1
—pa 1 —p2e 1
M =
UN ie
then
V1
v2
Was :
UN-1
§
Equation A.1 in the v; and the g can then be written in matrix
form as
Mv = q
or
v = M"!q (AgZ)
where q is the vector of expected immediate rewards. The matrix
M~! will exist if the system is completely ergodic, as we have assumed.
Thus, by inverting M to obtain M-! and then postmultiplying M-! by q,
vi for 1 < 1 < N — 1 and g will be determined.
Suppose that state N is a recurrent state and a trapping state, so
that py; = 0 for7 # N, and fyy = 1. Furthermore, let there be no
recurrent states among the remaining N — 1 states of the problem.
We know that
vy = M'1q
where M assumes the special form
lpi Pig aan Py nek |1
=pai 1 Daa ies
M = :
—PNE1, 0 = PNalge Per ee
0 0 0 4
RELATIONSHIP OF TRANSIENT TO RECURRENT BEHAVIOR 129
The matrix W or U-! has the form of the matrix [4+1I — “+1, £+1P]
used in Eq. 6.23 of Chapter 6. Here we would interpret the uw as the
expected number of times the system will enter one of the transient
states 7 in the group L + 1 before it enters some recurrent chain if it
is started in state 7 of the group L + 1. With this definition, the
elements of [2+1I — 4+1,£+1P]-1 must all be nonnegative by the same
argument given here.
Using W~-1 = U, Eq. A.2, and the partitioned form of M~1, we may
write
N-1 N-1
Ve = > Myqi — IN >.May Pera 1
j=1 j=1
Or
N-1 N-1
j=1 j=1
j=1 j=l
i =1,2,--,N—-1
RELATIONSHIP OF TRANSIENT TO RECURRENT BEHAVIOR 131
Pree
¥
fs | eS a im
Er “, “9 § q ¢ oF. - ‘ “a ,
7 > eae vs i ae SA
’ ; ? = a
7 step ~tand ppieietas es te
7 yee 14,0 el ae :
Sweep
yy,Ths * ie 1G
, ae F-. 770
Sea
a 13 ee,
AL el; ;
References
General References
133;
h i
+ Sprieth sian
ae ;
can PMS, oe
5 > % veal voues
ny Pras
—=
;
>
os
os
eae
~
7
Index
Alternatives, 26-28, 33, 104, 105 Gain, changes related to policy changes, 43,
49, 72-75, 111
Barnes, J. L., 94 of a process, 22, 24, 32, 36, 102
Baseball problem, best long-run policy, 52 of a state, 23, 24, 61-63, 103
computational requirements, 52 in multiple-chain processes, 60-63
evaluation of base situations, 53 Gardner, M. F., 94
Bellman, R., 1, 29
Iteration cycle, for continuous-time proc-
Car problem, best long-run policy, 57, 58, esses, 108, 110
89, 90 for continuous-time processes with dis-
computational requirements, 58, 89 counting, 117
solution in special situations, 58, 89 for discrete-time processes, 38, 64
Chain, periodic, 15 for discrete-time processes with discount-
recurrent, 13 ing, 84
Computational considerations, 36, 37, 49-52, for multiple-chain processes, 64
112-114, 120-123
Laplace transforms, definition, 94
Decision, 28, 33, 105 of vectors and matrices, 95
Decision vector, 33 table of, 95
Discount factor, 76
Discount rate, 114 Markoy processes, definition, 3, 92
Matrices, differential, 12, 23, 94, 98
Earning rate, 100 stochastic, 11, 23, 94, 98
Equivalence of discrete- and continuous-time
processes, 96, 112-114, 120-122 Partial-fraction expansion, 10
Policy, 28, 33, 105
Foreman’s dilemma, 96-105, 111, 112, 119, optimal, 28, 33, 61, 81, 105, 116
120 Policy improvement by a change in chain
Frog example, 3, 17, 18 structure, 68
135
136 INDEX
Howard
ne nace