0% found this document useful (0 votes)
27 views

Dynamic Programming and Markov Processes

This document introduces dynamic programming and Markov processes as methods for modeling sequential decision-making problems with probabilistic elements. It provides the mathematical foundations for representing systems as Markov processes and solving for optimal policies using techniques like value and policy iteration. The goal is to develop analytical tools that can handle complex, real-world problems involving both chance and choice.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Dynamic Programming and Markov Processes

This document introduces dynamic programming and Markov processes as methods for modeling sequential decision-making problems with probabilistic elements. It provides the mathematical foundations for representing systems as Markov processes and solving for optimal policies using techniques like value and policy iteration. The goal is to develop analytical tools that can handle complex, real-world problems involving both chance and choice.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 152

1

Laval
js
ae
beainten

Ge) |
eof
, .
|
ry i\e

wd then! ;
haa!
Wt
bt a

uy : :
et Ll
Cn) co
ko | .
re
)
¢l
i OM

=
==


=
igs Bree
RaNeae) ae
£290)

x een

Mi
7
iy Ott , :

ae ‘ae
Le
i” oe.)
‘eee
Digitized by the Internet Archive
in 2022 with funding from
Kahle/Austin Foundation

https://archive.org/details/dynamicprogrammi0000unse_x8u5
Dynamic Programming
and Markov Processes
Dynamic Programming
and Markov Processes

RONALD A. HOWARD

The M.LT. Press


Massachusetts Institute of Technology
Cambridge, Massachusetts

McConneil Library
Radford University
Copyright © 1960
by
The Massachusetts Institute of Technology

All Rights Reserved


This book or any part thereof must not
be reproduced in any form without the
written permission of the publisher.

SECOND PRINTING, May 1962


THIRD PRINTING, AuGcusT 1964
FourTH PRINTING, Marcu 1966

Library of Congress Catalog Card Number: 60-11030


Printed in the United States of America

101502
Preface

This monograph is the outgrowth of an Sc.D. thesis submitted to


the Department of Electrical Engineering, M.I.T., in June, 1958. It
contains most of the results of that document, subsequent extensions,
and sufficient introductory material to afford the interested technical
reader a complete understanding of the subject matter.
The monograph was stimulated by widespread interest in dynamic
programming as a method for the solution of sequential problems.
This material has been used as part of a graduate course in systems
engineering and operations research offered in the Electrical Engineering
Department of M.I.T. Asa result, the present text emphasizes above
all else clarity of presentation at the graduate level. It is hoped that
it will find use both as collateral reading in graduate and advanced
undergraduate courses in operations research, and as a reference for
professionals who are interested in the Markov process as a system model.
The thesis from which this work evolved could not have been
written without the advice and encouragement of Professors Philip
M. Morse and George E. Kimball. Professor Morse aroused my interest
in this area; Professor Kimball provided countless helpful suggestions
that guided my thinking on basic problems. Conversations with Pro-
fessors Samuel J. Mason and Bernard Widrow and with Dr. Jerome
D. Herniter were also extremely profitable.
The final text was carefully reviewed by Dr. Robert L. Barringer,
to whom I owe great appreciation. He and his colleagues at the
Operations Research Group of Arthur D. Little, Inc., have continually
offered sympathy and encouragement.
v
ut PREFACE
This work was done in part at the Massachusetts Institute of Tech-
nology Computation Center, Cambridge, Massachusetts, and was
supported in part by the Research Laboratory of Electronics.
RONALD A. HOWARD
Cambridge, Massachusetts
February, 1960
Contents

PREFACE
INTRODUCTION
CHAPTER 1 Markov Processes
The Toymaker Example—State Probabilities
The z-Transformation
z-Transform Analysis of Markov Processes wont
<
We

Transient, Multichain, and Periodic Behavior 12


CHAPTER 2. Markov Processes with Rewards 17
Solution by Recurrence Relation Y;
The Toymaker Example 18
z-Transform Analysis of the Markov Process with Rewards Zh
Asymptotic Behavior 22
CHAPTER 3 The Solution of the Sequential Decision Process by
Value Iteration 26
Introduction of Alternatives 26
The Toymaker’s Problem Solved by Value Iteration 28
Evaluation of the Value-Iteration Approach 30
CHAPTER 4 The Policy-Iteration Method for the Solution of Sequential
Decision Processes o2
The Value-Determination Operation 34
The Policy-Improvement Routine 37
The Iteration Cycle 38
The Toymaker’s Problem 39
A Proof of the Properties of the Policy-Iteration Method 42
vu
vit CONTENTS
CHAPTER 5 Use of the Policy-Iteration Method in Problems of
Taxicab Operation, Baseball, and Automobile Replace-
ment 44
An Example—Taxicab Operation 44
A Baseball Problem 49
The Replacement Problem 54
CHAPTER 6 The Policy-Iteration Method for Multiple-Chain Processes 60
The Value-Determination Operation 61
The Policy-Improvement Routine 63
A Multichain Example 65
Properties of the Iteration Cycle 69
CHAPTER 7 The Sequential Decision Process with Discounting 76
The Sequential Decision Process with Discounting Solved
by Value Iteration 79
The Value-Determination Operation 81
The Policy-Improvement Routine 83
An Example 84
Proof of the Properties of the Iteration Cycle 86
The Sensitivity of the Optimal Policy to the Discount
Factor 87
The Automobile Problem with Discounting 89
Summary 91
CHAPTER 8 The Continuous-Time Decision Process 92
The Continuous-Time Markov Process 92
The Solution of Continuous-Time Markov Processes by
Laplace Transformation 94
The Continuous-Time Markov Process with Rewards ae,
The Continuous-Time Decision Problem 104
The Value-Determination Operation 106
The Policy-Improvement Routine 107
Completely Ergodic Processes 109
The Foreman’s Dilemma 111
Computational Considerations LIZ
The Continuous-Time Decision Process with Discounting 114
Policy Improvement 116
An Example 119
Comparison with Discrete-Time Case 120
CHAPTER 9 Conclusion 123
APPENDIX: The Relationship of Transient to Recurrent Behavior 127
REFERENCES 133
INDEX 135
Introduction

The systems engineer or operations researcher is often faced with


devising models for operational systems. The systems usually contain
both probabilistic and decision-making features, so that we should
expect the resultant model to be quite complex and analytically intrac-
table. This has indeed been the case for the majority of models that
have been proposed. The exposition of dynamic programming by
Richard Bellman! gave hope to those engaged in the analysis of complex
systems, but this hope was diminished by the realization that more
problems could be formulated by this technique than could be solved.
Schemes that seemed quite reasonable often ran into computational
difficulties that were not easily circumvented.
The intent of this work is to provide an analytic structure for a
decision-making system that is at the same time both general enough
to be descriptive and yet computationally feasible. It is based on the
Markov process as a system model, and uses an iterative technique
similar to dynamic programming as its optimization method.
We begin with a discussion of discrete-time Markov processes in
Chapter 1 and then add generalizations of the model as we progress.
These generalizations include the addition of economic rewards in
Chapter 2 and the introduction of the decision process in Chapter 3.
The policy-iteration method for the solution of decision processes
with simple probabilistic structures is discussed in Chapter 4 and then
examples are presented in Chapter 5. Chapter 6 introduces the case
of more complicated probabilistic structures, while Chapter 7 presents
the extension of the model to the case where the discounting of future
1
v4 INTRODUCTION

rewards is important. Chapter 8 generalizes all the preceding chapters


to continuous-time rather than discrete-time Markov processes. Finally,
Chapter 9 contains a few concluding remarks.
It is unfortunate that the nature of the work prevents discussion of
the linear programming formulation of the policy-optimization scheme,
but this very interesting viewpoint will have to be postponed to another
time. Readers who are familiar with linear programming will in any
event be able to see familiar structures in the linear forms with which
we deal.
Markov Processes

A Markov process is a mathematical model that is useful in the study


of complex systems. The basic concepts of the Markov process are
those of “‘state’’ of a system and state “transition.” We say that a
system occupies a state when it is completely described by the values
of variables that define the state. A system makes state transitions
when its describing variables change from the ae specified for one
state to those specified for another.
A graphic example of a Markov process is presented by a frog in a
lily pond. As time goes by, the frog jumps from one lily pad to another
according to his whim of the moment. The state of the system is the
number of the pad currently occupied by the frog; the state transition
is of course his leap. If the number of lily pads is finite, then we have a
finite-state process. All our future remarks will be confined to such a
process.
If we focus our attention on the state transitions of the system and
merely index the transitions in time, then we may profitably think
of the system as a discrete-time process. If the time between transi-
tions is a random variable that is of interest, then we may consider the
system to be a continuous-time process. Further discussion of this
latter case will occur in Chapter 8.
To study the discrete-time process, we must specify the probabilistic
nature of the state transition. It is convenient to assume that the
time between transitions is a constant. Suppose that there are N
states in the system numbered from 1 to N. If the system is a simple
Markov process, then the probability of a transition to state 7 during
3
4 MARKOV PROCESSES
the next time interval, given that the system now occupies state 7, is a
function only of 7 and 7 and not of any history of the system before its
arrival inz. In other words, we may specify a set of conditional proba-
bilities #;; that a system which now occupies state 7 will occupy state
y after its next transition. Since the system must be in some state
after its next transition,
N

> Pu =1
j=l
where the probability that the system will remain in 7, pu, has been
included. Since the pi are probabilities,
0<
py <1

The Toymaker Example—State Probabilities

A very simple example of a discrete-time Markov process of the type


we have defined can be thought of as the toymaker’s process. The
toymaker is involved in the novelty toy business. He may be in either
of twostates. He isin the first state if the toy he is currently producing
has found great favor with the public. He is in the second state if his .
toy is out of favor. Let us suppose that when he is in state 1 there is
50 per cent chance of his remaining in state 1 at the end of the following
week and, consequently, a 50 per cent chance of an unfortunate tran-
sition to state 2. When he is in state 2, he experiments with new toys,
and he may return to state 1 after a week with probability 2 or remain
unprofitable in state 2 with probability 23. Thus fii = 4, pie = 4
poi = 3, poz = 3. In matrix form we have
a
P= [pul
=|; 5 cneo
role

A corresponding transition diagram of the system showing the states


and transition probabilities in graphical form is

The transition matrix P is thus a complete description of the Markov


process. The rows of this matrix sum to 1, and it is composed of non-
negative elements that are not greater than 1; such a matrix is called a
THE TOYMAKER EXAMPLE—STATE PROBABILITIES s

stochastic matrix. We make use of this matrix to answer all questions


about the process. We may wish to know, for example, the probability
that the toymaker will be in state 1 after » weeks if we know he is in
state 1 at the beginning of the m-week period. To answer this and other
questions, we define a state probability 7;(”), the probability that the
system will occupy state 7 after transitions if its state at n = 0 is
known. It follows that

D, mit) = 1 (1.1)

x(n + 1) = > i(n)py n=0,1,2,--- (1.2)

If we define a row vector of state probabilities m(m) with components


m(), then
n(n + 1) = x(n)P nise-Q, 1, 2,-> (1.3)
Since by recursion
m(1) = 7(0)P
(2) = 2(1)P = x(0)P2
m(3) = m(2)P = z(0)P3
in general,
m(n) = 1(0)P” n = 0, 1, 2,--- (1.4)
Thus it is possible to find the probability that the system occupies
each of its states after » moves, x(n), by postmultiplying the initial-
state probability vector m(0) by the mth power of the transition matrix
Pp.
Let us illustrate these relations by applying them to the toymaking
process. If the toymaker starts with a successful toy, then 71(0) = 1
and r2(0) = 0, so that x(0) = [1 0]. Therefore, from Eq. 1.3,

x(1) = n(0)P = [1 02
enleo
bole
Ea)

and
m1) = (3 2]
After one week, the toymaker is equally likely to be successful or un-
successful. After two weeks,

n(2) = n(1)P = [3 af} pole


CIPO pee
enloo
bole

and
(2) =[g0 20
so that the toymaker is slightly more likely to be unsuccessful.
6 MARKOV PROCESSES

After three weeks, (3) = 7(2)P = [3% 200], and the probability
of occupying each state is little changed from the values after two
weeks. Note that since
89 Jil
P3 a ey
jel gle
250 250
(3) could have been obtained directly from (3) = 7(0)P°.
An interesting tendency appears if we calculate 7() as a function of
nm as shown in Table 1.1.

Table 1.1. SUCCESSIVE STATE PROBABILITIES OF TOYMAKER STARTING WITH


A SUCCESSFUL Toy
n= 0 1 2 3 4 5
1 (n) 1 0.5 0.45 0.445 0.4445 0.44445
rt9(n) 0 0.5 0.55 0.555 0.5555 0.55555

It appears as if mi(m) is approaching § and 7e(n) is approaching $


as m becomes very large. If the toymaker starts with an unsuccessful
toy, so that 71(0) = 0, 72(0) = 1, then the table for x(m) becomes
Table 1.2.

Table 1.2. SUCCESSIVE STATE PROBABILITIES OF TOYMAKER STARTING


WITHOUT A SUCCESSFUL TOY
mic 0 1 2 3 4 5
(2) 0 0.4 0.44 0.444 0.4444 0.44444
rr9(n) 1 0.6 0.56 0.556 0.5556 0.55556
For this case, 71(”) again appears to approach § for large n, while
m2(m) approaches §. The state-occupancy probabilities thus appear to
be independent of the starting state of the system if the number of
state transitions is large. Many Markov processes exhibit this property.
We shall designate as a completely ergodic process any Markov process
whose limiting state probability distribution is independent of starting
conditions. We shall investigate in a later discussion those Markov
processes whose state-occupancy probabilities for large numbers of
transitions are dependent upon the starting state of the system.
For completely ergodic Markov processes, we may define a quantity
m™ as the probability that the system occupies the 7th state after a large
number of moves. The row vector* z with components 7; is thus the
limit as » approaches infinity of m(m); it is called the vector of limiting
* x(n) and 7 are the only row vectors that we shall consider in our work; other
vectors will be column vectors.
THE 2z-TRANSFORMATION rf

or absolute state probabilities. It follows from Eq. 1.3 that the vector
m™ must obey the equation
Tw = TP (1.5)
and, of course, the sum of the components of m must be 1.

S ti; = (1.6)
We may use Eqs. 1.5 and 1.6 to find the limiting state probabilities
for any process. For the toymaker example, Eq. 1.5 yields
T1 = 971 + 572
2
5
3
m2 = $71 + 572
whereas Eq. 1.6 becomes 71 + m2 = 1.
The three equations for the two unknowns 7 and z2 have the unique
solution =; = #, mz = 3. These are, of course, the same values for
the limiting state probabilities that we inferred from our tables of 7;(n).
In many applications the limiting state probabilities are the only
quantities of interest. It may be sufficient to know, for example, that
our toymaker is fortunate enough to have a successful toy $ of the time
and is unfortunate 3 of the time. The difficulty involved in finding
the limiting state probabilities is precisely that of solving a set of N
linear simultaneous equations. We must remember, however, that
the quantities 7; are a sufficient description of the process only if enough
transitions have occurred for the memory of starting position to be
lost. In the following section, we shall gain more insight into the
behavior of the process during the transient period when the state
probabilities are approaching their limiting values.

The z-Transformation

For the study of transient behavior and for theoretical convenience,


it is useful to study the Markov process from the point of view of the
generating function or, as we shall call it, the z-transform. Consider
a time function f(m) that takes on arbitrary values /(0), f(1), (2), and
so on, at nonnegative, discrete, integrally spaced points of time and
that is zero for negative time. Sucha time function is shown in Fig. 1.1.
For time functions that do not increase in magnitude with m faster
than a geometric sequence, it is possible to define a z-transform f(z)
such that

Me) = > flan


n=0
(1.7)
& MARKOV PROCESSES

fQ)

f(O)
f(2)

f(3)

0 1 2 3 n
Fig. 1.1. An arbitrary discrete-time function.

The relationship between /(m) and its transform f(z) is unique; each
time function has only one transform, and the inverse transformation
of the transform will produce once more the original time function.
The z-transformation is useful in Markov processes because the prob-
ability transients in Markov processes are geometric sequences. The
z-transform provides us with a closed-form expression for such sequences.
Let us find the z-transforms of the typical time functions that we
shall soon encounter. Consider first the step function
1 nm = 0; 1, 2,3, -+:
I (n)=
0 n< 0
The z-transform is

fe) = Df alt rt
n=0
ett. orf) = ar

For the geometric sequence f(m) = «”, n > 0,


co oc 1

f(z)
Z)= aI Ae
) Pages n or f(z) = ee

Note that if

fle) = > arenn=0

then

atl) = > naman


d foe)

n=0
and

non — d ee d 1 az
-
Zale Oe hd ass 7a; ” =) \ (soz)
Thus we have obtained as a derived result that, if the time function we
are dealing with is /(m) = na”, its z-transform is f(z) = az/(1 — az)2.
z-TRANSFORM ANALYSIS OF MARKOV PROCESSES 9

From these and other easily derived results, we may compile the table
of z-transforms shown as Table 1.3. In particular, note that, if a time
function /(7) with transform /(z) is shifted to the right one unit so as
to become /(z + 1), then the transform of the shifted function is

D fw + then = S foment = aye) — /(0)


m=1

The reader should become familiar with the results of Table 1.3
because they will be used extensively in examples and proofs.

Table 1.3. z-TRANSFORM PAIRS


Time Function for n > 0 z-Transform

f(n) £ (2)
film) + fo(n) fi(z) + f£2(2)
kf (n) (Rk is a constant) kf (z)
F(a 1) zf(z)
f(n + 1) zf(z) — f(0)]
an :
1 — az

1 (unit step) i us:


an az
(1 — az)?
n (unit ramp) 7% = 2p
af (2) ; f(az)

z-Transform Analysis of Markov Processes


We shall now use the z-transform to analyze Markov processes. It
is possible to take the z-transform of vectors and matrices by taking
the z-transform of each component of the array. If the transform of
Eq. 1.3 is taken in this sense, and if the vector z-transform of the vector
m(n) is given the symbol II(z), then we obtain
z-l[II(z) — x(0)] = W(z)P (1.8)
Through rearrangement we have
II(z) — zII(z)P = x(0)
II(z)(I — zP) = x(0)
and finally
II(z) = 7(0)(I — zP)-! (1.9)
In this expression I is the identity matrix. The transform of the
state probability vector is thus equal to the initial-state-probability
10 MARKOV PROCESSES
vector postmultiplied by the inverse of the matrix I — zP; the inverse
of I — zP will always exist. Note that the solution to all transient
problems is contained in the matrix (I — zP)~!. To obtain the complete
solution to any transient problem, all we must do is to weight the rows
of (I — zP)~1 by the initial state probabilities, sum, and then take the
inverse transform of each element in the result.
Let us investigate the toymaker’s problem by z-transformation.

rf]+ 3
For this case

5 5

so that
=P) 1 — tz
2 —tz2
( a) |—2z 1 — |
and
5 des Sieh dz a
ere Ch 2)(Lo rgzy eT is)
— 2 -i =

$2 ee
(1 — 2)(1 — Yoz) (= 2)(E— Yo2)
Each element of (I — zP)~! is a function of z with a factorable de-
nominator (1 — z)(1 — jz). By partial-fraction expansion? we can
express each element as the sum of two terms: one with denominator
1 — z and one with denominator 1 — jz. The (I — zP)-! matrix
now becomes
: ae i Peek! 9

. Lie =e? LS eee


— 2P)-!=
foto isguharaeepail tals
l—-z l—-yyeer 1-2 1-— xe
Whe 1 559 89
2 eal eee
ifs Aa = wat Soe i

Let the matrix H(m) be the inverse transform of (I — zP)-1 on an


element-by-element basis. Then from Table 1.3, we see that
4 5 S 6S
H(n) = if | = (s)"| A 4
9 —e 9.

and finally by taking the inverse transform of Eq. 1.9


n(n) = 7(0)H(n) (1.10)
2-TRANSFORM ANALYSIS OF MARKOV PROCESSES 11
By comparison with Eq. 1.4 we see that H(m) = P”, and that we
have found a convenient way to calculate the mth power of the tran-
sition-probability matrix in closed form. The state-probability vector
at time m can thus be found by postmultiplying the initial-state-prob-
ability vector by the response matrix H(m). The zjth element of the
matrix H(m) represents the probability that the system will occupy
state 7 at time , given that it occupied state 7 at time = 0. If the
toymaker starts in the successful state 1, then x(0) = [1 0] and
m(n) =[$ §] + G)"(§ —8] or mi(n) = $+ SCs)", wala) = § — $ ()®,
Note that the expressions for m1(m) and 72(m) are exact analytic
representations for the state probabilities found in Table 1.1 by matrix
multiplication. Note further that as » becomes very large 71(m) tends
to § and (mn) tends to 3; they approach the limiting state probabilities
of the process.
If the toymaker starts in state 2, then x(0) = [0 1], x(n) =[§ &
+ (25)"(—$ 4], so that mi(n) = $ — $(45)" and mo(n) = $ + $(c45)*.
We have now obtained analytic forms for the data in Table 1.2. Once
more we see that for large the state probabilities become the limiting
state probabilities of the process.
It is possible to make some general statements about the form that
H(z) maytake. First it will always have among its component matrices
at least one that is a stochastic matrix and that arises from a term of
(I — zP)-! of the form 1/(1 — z). This statement is equivalent to
saying that the determinant of I — zP vanishes for z = 1 or that a
stochastic matrix always has at least one characteristic value equal to 1.
If the process is completely ergodic, then there will be exactly one
stochastic matrix in H(m). Furthermore, the rows of this matrix will
be identical and will each be the limiting-state-probability vector of the
process. We call this portion of H(m) the steady-state portion and
give it the symbol S since it is not a function of n.
The remainder of the terms of H(z) represent the transient behavior
of the process. These terms are matrices multiplied by coefficients of
the form «”, na”, n2a”, and soon. Naturally, |«| must not be greater
than 1, for if any « were greater than 1, that component of probability
would grow without bound, a situation that is clearly impossible.
The transient matrices represent the decreasing geometric sequences of
probability components that are typical of Markov processes. The
transient component of H(m) may be given the symbol T(n) since it is
a function of m. Since for completely ergodic processes |«| < 1 for all
a, the transient component T(m) vanishes as » becomes very large.
The matrices that compose T(m) are also of interest because they sum
to zero across each row. The transient components must sum to zero
2 MARKOV PROCESSES
since they may be considered as perturbations applied to the limiting
state probabilities. Matrices that sum to zero across all rows are called
differential matrices. Finally, for a completely ergodic process,
H(n) = S + T(n) (QELT)
where S is a stochastic matrix all of whose rows are equal to the limiting
state-probability vector and where T(n) is the sum of a number of
differential matrices with geometric coefficients that tend to zero as n
becomes very large.

Transient, Multichain, and Periodic Behavior

To gain further insight into the Markov process, let us use the
z-transform approach to analyze processes that exhibit typical behavior
patterns. In the toymaker’s problem, both states had a finite proba-
bility of occupancy after a large number of transitions. It is possible
even in a completely ergodic process for some of the states to have a
limiting state probability of zero. Such states are called transient
states because we are certain that they will not be occupied after a
long time. A two-state problem with a transient state is described by
p_ fi i
alone
with transition diagram

If the system is in state 1, it has probability } of making a transi-


tion to state 2. However, if a transition to 2 occurs, then the system
will remain in 2 for all future time. State 1 is a transient state;
state 2 is a trapping state (a state 7 for which py = 1).
By applying the z-transform analysis, we find

a- 2) =|[vs 32: alya


4
0 1-2
and
1-—2z 42
eee (liska)(le(22) (le 2}(h 2)
— gP)-1 =

0 1 — 3z
TRANSIENT, MULTICHAIN, AND PERIODIC BEHAVIOR 13

Pe |
a aaa —Teoh ee
aes Talo | + alo 0
Thus

H(7)
fo f+
If the system is started in state 1 so that 7(0) =[1
ol 0], then m1(n) =
(2)", m2(”) = 1 — ()". If the system is started in state 2 with
x(0) = (0 1), then naturally mi(m) = 0, mo(m) = 1. In either case
we see that the limiting state probability of state 1 is zero, so our
assertion that it is a transient state is correct. Of course the limiting
state probabilities could have been determined from Eqs. 1.5 and 1.6
in the manner described earlier.
A transient state need not lead the system into a trapping state.
The system may leave a transient state and enter a set of states that
are connected by possible transitions in such a way that the system
makes jumps within this set of states indefinitely but never jumps
outside the set. Such a set of states is called a recurrent chain of the
Markov process; every Markov process must have at least one recurrent
chain. A Markov process that has only one recurrent chain must be
completely ergodic because no matter where the processis started it
will end up making jumps among the members of the recurrent chain.
However, if a process has two or more recurrent chains, then the
completely ergodic property no longer holds, for if the system is started
in a state of one chain then it will continue to make transitions within
that chain but never to a state of another chain. In this sense, each
recurrent chain is a generalized trapping state; once it is entered, it can
never be left. We may now think of a transient state as a state that
the system occupies before it becomes committed to one of the recurrent
chains.
The possibility of many recurrent chains forces us to revise our think-
ing concerning S, the steady-state component of H(m). Since the
limiting state probability distribution is now dependent on how the
system is started, the rows of the stochastic matrix S are no longer equal.
Rather, the 7th row of S represents the limiting state probability distri-
bution that would exist if the system were started in the 7th state. The
ith row of the T(m) matrix is as before the set of transient components
of the state probability if 7 is the starting state.
Let us investigate a very simple three-state process with two recurrent
chains described by
tT HS
Pri) Ota 0
1S
3-6
yee
«4S
14 MARKOV PROCESSES

with the transition diagram

i
3

State 1 constitutes one recurrent chain; state 2 the other. Both are
trapping states, but the general behavior would be unchanged if each
were a collection of connected states. State 3 is a transient state that
may lead the system to either of the recurrent chains. To find H(n)
for this process, we first find
pasty 0 0
(I-2P)=] 0 ec, 0
—2z —k 1 — iz
and
1 — z)(1 — tz
(eres ? 9
1 — z)(1 — +z
So ait aie : ee :
(1 — z)dz (1 — z)4z (haz)?
(1. = -2)7(L = $2) (1 = 2)2( tee ee) (esa eee
Thus
ere UO i Gi. A0zn6
waa 1 jcae | Ghyll j
$5 San ; alte treat oe
and
100 Tete Yue
Ho = | 1 j| Oe eg j
2 2 0 mat Jet
= S + T(n)
If the system is started in state 1, mi(m) = 1, me(n) = x3(n) = 0.
TRANSIENT, MULTICHAIN, AND PERIODIC BEHAVIOR 15

the system is started in state 3, m1(m) = mo(m) = 4[1 — (4)”], ma(n) =

We may summarize by saying that if the system is started in state 1


or state 2 it will remain in its starting state indefinitely. If it is started
in state 3, it will be after many moves in state 1 with probability $ and
in state 2 with probability $. These results may be seen directly
from the rows of S$ which are, after all, the limiting state probability
distributions for each starting condition.
The multichain Markov process is thus treated with ease by z-trans-
formation methods. There is one other case that requires discussion,
however, before we may feel at all confident of our knowledge. That is
the case of periodic chains. A periodic chain is a recurrent chain
with the property that if the system occupies some state at the present
time it will be certain to occupy that same state after p, 2p, 3p, 4f,---
transitions, where # is an integer describing the periodicity of the system.
The simplest periodic system is the two-state system of period 2 with
transition matrix
Outed
Ear | eo
and transition diagram

aC
1

If the system is started in state 1, it will be once more in state 1 after


even numbers of transitions and in state 2 after odd numbers of transi-
tions. There is no need for analysis to understand this type of be-
havior, but let us investigate the results obtained by the transformation
method. We have

a - 2) = | : ma

and
16 MARKOV PROCESSES

The response matrix H(n) is thus

This H(n) does represent the solution to the problem because, for
example, if the system is started in state 1, mi(m) = 4{1 + (—1)"] and
m2(n) = 4[1 — (—1)”]. These expressions produce the same results that
we saw intuitively. However, what is to be the interpretation placed on
Sand T(n) inthis problem? The matrix T(m) contains components that
do not die away for larger , but rather continue to oscillate indefinitely.
On the other hand, T(m) can still be considered as a perturbation to the
set of limiting state probabilities defined by S. The best interpretation
of the limiting state probabilities of S is that they represent the proba-
bility that the system will be found in each of its states at a tome
chosen at random in the future. For periodic processes, the original
concept of limiting state probabilities is not relevant since we know
the state of the system at all future times. However, in many practical
cases, the random-time interpretation introduced above is meaningful
and useful. Whenever we consider the limiting state probabilities of a
periodic Markov process, we shall use them in this sense. Incidentally,
if Eqs. 1.5 and 1.6 are used to find the limiting state probabilities, they
yield x1 = ze = 4, in agreement with our understanding.
We have now investigated the behavior of Markov processes using the
mechanism of the z-transform. This particular approach is useful
because it circumvents the difficulties that arise because of multiple
characteristic values of stochastic matrices. Many otherwise elegant
discussions of Markov processes based on matrix theory are markedly
complicated by this difficulty. The structure of the transform method
can be even more appreciated if use is made of the work that has been
done on signal-flow-graph models of Markov processes, but this is
beyond our present scope; references 3 and 4 may be useful.
The following chapter will begin the analysis of Markov processes
that have economic rewards associated with state transitions.
Markov Processes with Rewards

Suppose that an N-state Markov process earns 7; dollars when it


makes a transition from state 7 to state 7. We call 7; the ‘“‘reward”’
associated with the transition from 7 to 7. The set of rewards for the
process may be described by a reward matrix R with elements 7.
The rewards need not be in dollars, they could be voltage levels, units
of production, or any other physical quantity relevant to the problem.
In most of our work, however, we shall find that economic units such as
dollars will be the pertinent interpretation.
The Markov process now generates a sequence of rewards as it
makes transitions from state to state. The reward is thus a random
variable with a probability distribution governed by the probabilistic
relations of the Markov process. Recalling our frog pond, we can
picture a game where the player receives an amount of money 7;
if the frog jumps from pad z to pad 7. As some of the ~; might
be negative, the player on occasion would have to contribute to the
pot.

Solution by Recurrence Relation

One question we might ask concerning this game is: What will be
the player’s expected winnings in the next ” jumps if the frog is now in
state 7 (sitting on the lily pad numbered 7)? To answer this question,
let us define v;(m) as the expected total earnings in the next m transitions
if the system is now in state 7.
ed
18 MARKOV PROCESSES WITH REWARDS

Some reflection on this definition allows us to write the recurrence


relation
N
1,2,3,---
--,N
vu(n) = > pulry + v(m —1)] 1 =1,2,-m= (2.1)
j=
If the system makes a transition from 7 to 7, it will earn the amount 74;
plus the amount it expects to earn if it starts in state 7 with one move
fewer remaining. As shown in Eq. 2.1, these earnings from a transition
to 7 must be weighted by the probability of such a transition, pi, to
obtain the total expected earnings.
Notice that Eq. 2.1 may be written in the form
N N

vi(n) = > Putts + > pigos(n — 1)


j=1 j=1
pee
ESRue tm PU rhe (p19)
so that if a quantity q is defined by
N

gn => pyry += 1,2,---,N (2.3)


j=1

Eq. 2.1 takes the form


N

vi(n) = ge + > pyvj(n—1) 1 =1,2,---,-N nm=1,2,3,---


(2.4)
j=1

The quantity g; may be interpreted as the reward to be expected in


the next transition out of state 7; it will be called the expected immediate
reward for statez. In terms of the frog jumping game, q; is the amount
that the player expects to receive from the next jump of the frog if it
is now on lily pad. Rewriting Eq. 2.1 as Eq. 2.4 shows us that it is
not necessary to specify both a P matrix and an R matrix in order to
determine the expected earnings of the system. All that is needed is a
P matrix and a q column vector with N components gi. The reduction
in data storage is significant when large problems are to be solved on a
digital computer. In vector form, Eq. 2.4 may be written as
v(m) = q + Pv(n — 1) Ni NZ ae (2.5)
where v(7) is a column vector with N components v;(m), called the
total-value vector.

The Toymaker Example


To investigate the problem of expected earnings in greater detail, let
us add a reward structure to the toymaker’s problem. Suppose that
THE TOYMAKER EXAMPLE a9
when the toymaker has a successful toy (the system is in state 1) and
again has a successful toy the following week (the system makes a
transition from state 1 to state 1) he earns a reward of 9 units for that
week (perhaps $900). Thus 711 is equal to 9. If the week has resulted
in a transition from unsuccessful to unsuccessful (state 2 to state 2),
then the toymaker loses 7 units or 722 = —7. Finally, if the week has
produced a change from unsuccessful to successful or from successful
to unsuccessful, the earnings are 3 units, so that 721 = 712 = 3. The
reward matrix R is thus

[s —
R =
9 3

Recalling that
oe Os
a A
we can find q from Eq. 2.3:

HL
Inspection of the q vector shows that if the toymaker has a successful
toy he expects to make 6 units in the following week; if he has no
successful toy, the expected loss for the next week is 3 units.
Suppose that the toymaker knows that he is going to go out of busi-
ness after » weeks. He is interested in determining the amount of
money he may expect to make in that time, depending on whether or
not he now has a successful toy. The recurrence relations Eq. 2.4
or Eq. 2.5 may be directly applied to this problem, but a set of
boundary values v;(0) must be specified. These quantities represent
the expected return the toymaker will receive on the day he ceases
operation. If the business is sold to another party, v1(0), would be the
purchase price if the firm had a successful toy on the selling date, and
ve(0) would be the purchase price if the business were not so situated
onthat day. Arbitrarily, for computational convenience, the boundary
values v;(0) will be set equal to zero in our example.
We may now use Eq. 2.4 to prepare Table 2.1 that shows v;(m) for
each state and for several values of n.

Table 2.1. ToTaL EXPECTED REWARD FOR TOYMAKER AS A FUNCTION OF


STATE AND NUMBER OF WEEKS REMAINING
je=— 0 1 2 3 4 5
v(m) 0 6 7165 8.55 9.555 10.5555
v9(n) 8 dette ret —1.44 ~ 0.444 0.5556
20 MARKOV PROCESSES WITH REWARDS

Thus, if the toymaker is four weeks from his shutdown time,he expects
to make 9.555 units in the remaining time if he now has a successful
toy and to lose 0.444 unit if he does not have one. Note that vi(m) —
v(m) seems to be approaching 10 as » becomes large, whereas both
vi(m) — vi(m — 1) and ve(m) — ve(m — 1) seem to approach the value
1 for large m. In other words, when x is large, having a successful

11 Points for
v, (n)
10 :

77 \_Asymptote of v, (n)
slope = 1

units
in
Earnings
monetary

Asymptote of vo (n)
slope = 1 bg

-2 Cs
Vi
me
-3 Points for
Ug (n)

0 1 2 3 4 5 6
n (weeks remaining)

Fig. 2.1. Toymaker’s problem; total expected reward in each state as a function
of weeks remaining.

toy seems to be worth about 10 units more than having an unsuccessful


one, as far as future returnis concerned. Also, for large n, an additional
week’s operation brings about 1 unit of profit on the average. The
behavior of v;(7) for large m is even more clear when the data of Table
2.1 are plotted as Fig. 2.1. The distance between the asymptotes to
the value expressions is 10 units, whereas the slope of each asymptote is
z-TRANSFORM ANALYSIS Gl
1 unit. We shall be very much interested in the asymptotic behavior
of total-earnings functions.

%-Transform Analysis of the Markov Process with Rewards

Let us analyze the Markov process with rewards by means of the


z-transformation. The z-transform of the total-value vector v(m) will
ie.@)

be called v(z) where v(z) = > v(m)z". Equation 2.5 may be written
n=0
as
v(m + 1) = q + Pv(n) n= 0, 1,2,--- (2.6)

If we take the z-transformation of this equation, we obtain

z\[v(z) — v(0)] = ;— 4 + Pol)

v(z) — v(0) = ;— 4 + 2Pv(2)

(I — 2P)v() = > + v(0)


or

v2) =; = -(I — 2P)-1q + (I — 2P)-1v(0) (2.7)

Finding the transform v(z) requires the inverse of the matrix (I — zP),
which also appeared in the solution for the state probabilities. This is
not surprising since the presence of rewards does not affect the proba-
bilistic structure of the process.
For the toymaker’s problem, v(0) is identically zero, so that Eq. 2.7
reduces to

vz) = ~~~Z (I - 2P)q (2.8)

For the toymaker process, the inverse matrix (I — zP)~1 was previously
found to be
22 MARKOV PROCESSES WITH REWARDS

4 3 ; 10 = et.O) See i

esCosrgeress lech
era +
R

Let the matrix F(m) be the inverse transform of [z/(1 — z)](I — 2P)7}.
Then

The total-value vector v(m) is then F()q by inverse transformation of


Eq. 2.8, and, since q = Ae
3

v(m)= nl] + arc — Gayl] _9]


In other words,

vi(n) = m + *>[1 — (¥o)"] ve(m) = n — *P(1 — (¥o)"] (2.9)


We have thus found a closed-form expression for the total expected
earnings starting in each state.
Equations 2.9 could be used to construct Table 2.1 or to draw Fig. 2.1.
We see that, as m becomes very large, vi(m) takes the form n + 42,
whereas v2() takes the form » — 44°. The asymptotic relations

vi(n) = n + 32
von) =n — 49lo
are the equations of the asymptotes shown in Fig. 2.1. Note that, for
large n, both v1(m) and v2(m) have slope 1 and vi(m) — ve(n) = 10, as
we saw previously. For large n, the slope of v1(m) or v(m) is the average
reward per transition, in this case 1. If the toymaker were many,
many weeks from shutdown, he would expect to make 1 unit of return
per week. We call the average reward per transition the ‘‘gain”’; in
this case the gain is 1 unit.

Asymptotic Behavior
What can be said in general about the total expected earnings of a
ASYMPTOTIC BEHAVIOR 23
process of long duration? To answer this question, let us return to
Hqr2ak
z
U(2) = (I — zP)—1q + (I — zP)-!v(0) (2)
te

It was shown in Chapter 1 that the inverse transform of (I — zP)-1


assumed the form § + T(m). In this expression, S is a stochastic
matrix whose 7th row is the vector of limiting state probabilities if the
system is started in the 7th state, and T(m) is a set of differential matrices
with geometrically decreasing coefficients. We shall write this relation
in the form

(I- zP)1 = i S + F(z) (2.10)

where J (z) is the z-transform of T(m). If we substitute Eq. 2.10 into


Eq. 2.7, we obtain
z 2 1
v(z) aei= eae
= Sq + pase:
f=) (z)q + IESE TF
nana Sv(0) + F(z)v(0) (2.11)

By inspection of this equation for v(z), we can identify the components


of v(m). The term [z/(1 — z)?] Sq represents a ramp of magnitude Sq.
Partial-fraction expansion shows that the term [z/(1 — z)] 7 (z)q repre-
sents a step of magnitude 7(1)q plus geometric terms that tend to
zero as m becomes very large. The quantity [1/(1 — z)] Sv(0) is a step
of magnitude Sv(0), whereas 7 (z)v(0) represents geometric components
that vanish when is large. The asymptotic form that v(m) assumes
for large » is thus
v(m) = nSq + F(1)q + Sv(0) (2:12)
If a column vector g with components g; is defined by g = Sq, then
v(n) = ng + F(1)q + Sv(0) (2,13)
The quantity g; is equal to the sum of the immediate rewards qj
weighted by the limiting state probabilities that result if the system is
started in the 7th state, or
N

gi = > $4944
j=1

It is also the average return per transition of the system if it is started


in the 7th state and allowed to make many transitions; we may call gi
the gain of thezthstate. Equivalently, it is the slope of the asymptote
of vi(n). Since all member states of the same recurrent chain have
identical rows in the § matrix, such states all have the same gain.
24 MARKOV PROCESSES WITH REWARDS
If there is only one recurrent chain in the system so that it is completely
ergodic, then all rows of § are the same and equal to the limiting state
probability distribution for the process, . It follows that in this case
all states have the same gain, say g, and that

Qe TAG i (2.14)

The column vector 7 (1)q + Sv(0) represents the intercepts at m = 0


of the asymptotes of v(m). These intercepts are jointly determined by
the transient behavior of the process 7 (1)q and by the boundary effect
Sv(0). We shall denote by v4; the asymptotic intercepts of vi(m), so
that for large n
vi(n) = ngi + v4 4a et (2.15)

The column vector with components v; may be designated by v so that


v = Z(1)q + Sv(0). Equations 2.15 then become

v(m) =ng+v for large (2.16)

If the system is completely ergodic, then, of course, all gi = g, and


we may call g the gain of the process rather than the gain of a state, so
that Eqs. 2.15 become

vi(n) = ng + 44 t=) 15/2; -5.Ny <> for large (2,2)


By way of illustration for the toymaker’s problem,

(tpt ee i i]+i
T—al¢ 8] "T—wel-$ 8 of
4

so that

Since

By assumption, v(0) = 0; then


ASYMPTOTIC BEHAVIOR 25

Therefore, from Eqs. 2.15,


vi(n) =n + vo(n) =n — 42 _ for large
as we found before.
We have now discussed the analysis of Markov processes with
rewards. Special attention has been paid to the asymptotic behavior
of the total expected reward function, for reasons that will become
clear in later chapters.
The Solution of the
Sequential Decision Process
by Value Iteration

The discussion of Markov processes with rewards has been the means
to anend. This end is the analysis of decisions in sequential processes
that are Markovian in nature. This chapter will describe the type of
process under consideration and will show a method of solution based on
recurrence relations.

Introduction of Alternatives

The toymaker’s problem that we have been discussing may be de-


scribed as follows. If the toymaker is in state 1 (successful toy), he
makes transitions to state 1 and state 2 (unsuccessful toy) according
to a probability distribution [f1;] = [0.5 0.5] and earns rewards
according to the reward distribution [71;] = [9 3]. If the toymaker
is in state 2, the pertinent probability and reward distributions are
[p2j] = [0.4 0.6] and[v2;] = [3-7]. This process has been analyzed
in detail; we know how to calculate the expected earnings for any
number of transitions before the toymaker goes out of business.
Suppose now that the toymaker has other courses of action open to
him that will change the probabilities and rewards governing the process.
For example, when the toymaker has a successful toy, he may use
advertising to decrease the chance that the toy will fall from favor.
However, because of the advertising cost, the profits to be expected
per week will generally be lower. To be specific, suppose that the
probability distribution for transitions from state 1 will be
[pis] = [0.8 0.2] when advertising is employed, and that the
26
INTRODUCTION OF ALTERNATIVES af
corresponding reward distribution will be [71;] = [4 4]. The toymaker
now has two alternatives when he is in state 1: He may use no advertis-
ing or he may advertise. We shall call these alternatives 1 and 2,
respectively. Each alternative has its associated reward and prob-
ability distributions for transitions out of state 1. We shall use a super-
script k to indicate the alternatives in a state. Thus, for alternative
1 in state 1, [f1;1] = [0.5 0.5], [v1;4] = [9 3]; and for alternative 2 in
state-1,[pi;7] = [0.8 (0:2) [7152] = [4 4].
There may also be alternatives in state 2 of the system (the company
has an unsuccessful toy). Increased research expenditures may in-
crease the probability of obtaining a successful toy, but they will also
increase the cost of being in state 2. Under the original alternative
in state 2, which we may call aiternative 1 and interpret as a limited
research alternative, the transition probability distribution was
[p23] = [0.4 0.6], and the reward distribution was [raj] = [3 —7].
Under the research alternative, alternative 2, the probability and reward
distribution might be [f2;] = [0.7 0.3] and [va;j] = {1 -—19]. Thus,
for alternative 1 in state 2,

[poy] = [0.4 0.6] [ve] = [3 —7]


and for alternative 2 in state 2,

[p27] = [0.7 0.3] [ros?] = (1 —19]


The concept of alternative for an N-state system is presented
graphically in Fig. 3.1. In this diagram, two alternatives have been

Present state Peta Succeeding state


of system of system

i=1 J=1

i=20 j=2

i=30 j=3
| |
! 1

1 1

i=NO Oj=N
Fig. 3.1. Diagram of states and alternatives.

allowed in the first state. If we pick alternative 1 (k = 1), then the


transition from state 1 to state 1 will be governed by the probability
28 THE VALUE-ITERATION METHOD

fii}, the transition from state 1 to state 2 will be governed by #121,


from 1 to 3 by f13!, and so on. The rewards associated with these
transitions are 7111, 7121, 7131, and so on. If the second alternative in
state 1 is chosen (Rk = 2), then $112, fie?, p132,---, Piw? and 711, 7122,
7132,---, in2, and so on, would be the pertinent probabilities and
rewards, respectively. In Fig. 3.1 we see that, if alternative 1 in state
1 is selected, we make transitions according to the solid lines; if alterna-
tive 2 is chosen, transitions are made according to the dashed lines.
The number of alternatives in any state must be finite, but the number
of alternatives in each state may be different from the numbers in
other states.

The Toymaker’s Problem Solved by Value Iteration


The alternatives for the toymaker are presented in Table 3.1. The
quantity gi* is the expected reward from a single transition from state
N
2 under alternative k. Thus, g* = » Disk riz*.
j=

Table 3.1. THE TOYMAKER’S SEQUENTIAL DECISION PROBLEM


Expected
Transition Immediate
State Alternative Probabilities Rewards Reward
a k pik pia™ fe (od qk
1 (Successful toy) 1 (No advertis- 0.5 0.5 9 3 6
ing)
2 (Advertising) 0.8 0.2 4 4. 4

2 (Unsuccessful 1 (No research) 0.4 0.6 3 —7 -—3


toy) 2 (Research) 0.7 0.3 1 —19 —5

Suppose that the toymaker has weeks remaining before his business
will close down. We shall call » the number of stages remaining in the
process. The toymaker would like to know as a function of ” and his
present state what alternative he should use for the next transition
(week) in order to maximize the total earnings of his business over the
n-week period.
We shall define d;(m) as the number of the alternative in the 7th state
that will be used at stage n. We call di(m) the ‘“‘decision”’ in state 7 at
the mth stage. When d;(m) has been specified for all 7 and all n, a
‘‘policy’’ has been determined. The optimal policy is the one that
maximizes total expected return for each 7 and n.
To analyze this problem, let us redefine v(m) as the total expected
TOYMAKER'S PROBLEM SOLVED BY VALUE ITERATION 29
return in # stages starting from state i if an optimal policy is followed.
It follows that for any 1
N

vi(m + 1) = max > pyF[ryk + v,(n)] 2 =0,1,2,--- (3.1)


k j=1

Suppose that we have decided which alternatives to follow at stages n,


nm — 1,---, 1 in such a way that we have maximized v,(n) for 7 = 1, 2,
--,.N. We are at stage m + 1 and are seeking the alternative we
should follow in the 7th state in order to make v;(m + 1) as large as
possible; this is d;(m + 1). If we used alternative & in the 7th state,
then our expected return for m + 1 stages would be

d,Puklra + u4(n)] (3.2)


by the argument of Chapter 2. Weare seeking the alternative in the 7th
state that will maximize Expression 3.2. For this alternative, v;(” + 1)
will be equal to Expression 3.2; thus we have derived Eq. 3.1,* which we
may call the value iteration equation. Equation 3.1 may be written
in terms of the expected immediate rewards from each alternative in
the form

vn + 1) = max|qe+ > pur] (3.3)


The use of the recursive relation (Eq. 3.3) will tell the toymaker
which alternative to use in each state at each stage and will also provide
him with his expected future earnings at each stage of the process.
To apply this relation, we must specify v;(0) the boundary condition
for the process. We shall assign the value 0 to both v1(0) and v2(0),
as we did in Chapter 2. Now Eq. 3.3 will be used to solve the toy-
maker’s problem as presented in Table 3.1. The results are shown in
Table 3.2.

Table 3.2. ToOyMAKER’S PROBLEM SOLVED BY VALUE ITERATION


n= 0 1 2 3 4
v4 (2) 0 6 8.2 10.22 421222
v9(n) 0 =3 31.7 0.23 2.223
dy(n) ca 1 2 2 2
do(n) a 1 2 ps 2
The calculation will be illustrated by finding the alternatives and
* Equation 3.1 is the application of the “‘ Principle of Optimality’’ of dynamic
programming to the Markovian decision process; this and other applications are
discussed by Bellman.
30 THE VALUE-ITERATION METHOD

rewards at the first stage. Since v(0) = 0, vi(1) = maxqi*. The


k
alternative to be used in state 1 at the first stage is that with the largest
expected immediate reward. Since gil = 6 and gi* = 4, the first
alternative in state 1 is the better one to use at the first stage, and
vi(1) = 6. Similarly, ve(1) = max qe*, and, since, gz! = —3, and
k
g22 = —5, the first alternative in state 2 is the better alternative and
ve(1) = —3. Having now calculated v,(1) for all states, we may
again use Eq. 3.3 to calculate v;(2) and to determine the alternatives
to be used at the second stage. The process may be continued for as
many 7 as we care to calculate.
Suppose that the toymaker has three weeks remaining and that he is
in state 1. Then we see from Table 3.2 that he expects to make 10.22
units of reward in this period of time, v1(3) = 10.22, and that he should
advertise during the coming week, di(3) = 2. We may similarly
interpret any other situation in which the toymaker may find himself.
Note that for m = 2, 3, and 4, the second alternative in each state is
to be preferred. This means that the toymaker is better advised to
advertise and to carry on research in spite of the costs of these activities.
The changes produced in the transition probabilities more than make
up for the additional cost. It has been shown! that the iteration
process (Eq. 3.3) will converge on a best alternative for each state as 1
becomes very large. For this problem the convergence seems to have
taken place at m = 2, and the second alternative in each state has been
chosen. However, in many problems it is difficult to tell when con-
vergence has been obtained.

Evaluation of the Value-Iteration Approach

The method that has just been described for the solution of the
sequential process may be called the value-iteration method because the
vi(m) or “‘values’’ are determined iteratively. This method has some
important limitations. It must be clear to the reader that not many
enterprises or processes operate with the specter of termination so
imminent. For the most part, systems operate on an indefinite basis
with no clearly defined end point. It does not seem efficient to have
to iterate v;(m) for m = 1, 2, 3, and so forth, until we have a sufficiently
large m that termination is very remote. We would much rather have
a method that directed itself to the problem of analyzing processes of
indefinite duration, processes that will make many transitions before
termination.
Such a technique has been developed; it will be presented in the next
chapter. Recall that, even if we were patient enough to solve the
EVALUATION OF THE VALUE-ITERATION APPROACH 3]

long-duration process by value iteration, the convergence on the best


alternative in each state is asymptotic and difficult to measure analyti-
cally. The method to be presented circumvents this difficulty.
Even though the value-iteration method is not particularly suited
to long-duration processes, it is relevant to those systems that face
termination in a relatively short time. However, it is important to
recognize that often the process need not have many stages before a
long-duration analysis becomes meaningful.
The Policy-Iteration Method
for the Solution of
Sequential Decision Processes

Consider a completely ergodic N-state Markov process described by


a transition-probability matrix P and a reward matrix R. Suppose
that the process is allowed to make transitions for a very, very long
time and that we are interested in the earnings of the process. The total
expected earnings depend upon the total number of transitions that the
system undergoes, so that this quantity grows without limit as the
number of transitions increases. A more useful quantity is the average
earnings of the process per unit time. It was shown in Chapter 2
that this quantity is meaningful if the process is allowed to make many
transitions; it was called the gain of the process.
Since the system is completely ergodic, the limiting state proba-
bilities z=; are independent of the starting state, and the gain g of the
system is
N

g= > mg (2.14)
t=1

where qi is the expected immediate return in state 7 defined by Eq. 2.3.


Every completely ergodic Markov process with rewards will have a
gain given by Eq. 2.14. If we have several such processes and we
should like to know which would be most profitable on a long-term
basis, we could find the gain of each and then select the one with highest
gain.
The sequential decision process of Chapter 3 requires consideration
of many possible processes because the alternatives in each state may
be selected independently. By way of illustration, consider the
32
THE POLICY-ITERATION METHOD 38,
three-dimensional array of Fig. 4.1, which presents in graphical form
the states and alternatives.

—— k Alternatives
. jSucceeding
Byes Pyrhe Prhg | state

i
i Present state
Poy Toy!

J}
vee
Poy" LrToe
Pools
ee
ee

Fig. 4.1. A possible five-state problem.

The array as drawn illustrates a five-state problem that has four


alternatives in the first state, three in the second, two in the third, one
in the fourth, and five in the fifth. Entered on the face of the array
are the parameters for the first alternative in each state, the second
row in depth of the array contains the parameters for the second
alternative in each state, and so forth. An X indicates that we have
chosen a particular alternative in a state with a probability and reward
distribution that will govern the behavior of the system at any time
that it enters that state. The alternative thus selected is called the
“decision”’ for that state; it is no longer a function of ». The set of
X’s or the set of decisions for all states is called a “‘policy.”’ Selection
of a policy thus determines the Markov process with rewards that will
describe the operations of the system. The policy indicated in the
diagram requires that the probability and reward matrices for the system
be composed of the first alternative in state 4, the second alternative in
states 2 and 3, and the third alternative in states land 5. _ It is possible
to describe the policy by a decision vector d whose elements represent
the number of the alternative selected in each state. In this case

i=" I

w
WRNN
An optimal policy is defined as a policy that maximizes the gain,
or average return per transition.* In the five-state problem dia-
grammed in Fig. 4.1, there are 4 x 3 x 2 x 1 x 5 = 120 different
* We shall assume for the moment that all policies produce completely ergodic
Markov processes. This assumption will be relaxed in Chapter 6.
34 THE POLICY-ITERATION METHOD

policies. It is conceivable that we could find the gain for each


of these policies in order to find the policy with the largest gain. How-
ever feasible this may be for 120 policies, it becomes unfeasible for very
large problems. For example, a problem with 50 states and 50 alter-
natives in each state contains 505°(~ 1085) policies.
The policy-iteration method that will be described will find the optimal
policy in a small number of iterations. It is composed of two parts, the
value-determination operation and the policy-improvement routine.
We shall first discuss the value-determination operation.

The Value-Determination Operation


Suppose that we are operating the system under a given policy so that
we have specified a given Markov process with rewards. If this process
were to be allowed to operate for ” stages or transitions, we could define
v(m) as the total expected reward that the system will earn in moves if
it starts from state 1 under the given policy.
The quantity v;(m) must obey the recurrence relation (Eq. 2.4)
derived in Chapter 2:
N

vi(n) = 90+ > pyvjitn —1) ¢=1,2,---,N n=1,2,3,---


(2.4)
j=l

There is no need for a superscript k to appear in this equation because


the establishment of a policy has defined the probability and reward
matrices that describe the system.
It was shown in Chapter 2 that for completely ergodic Markov proc-
esses v;(”) had the asymptotic form
vi(n) = ng + v4 +=1,2,---,N for large n (2:17)
In this chapter we are concerned only with systems that have a very,
very large number of stages. We are then justified in using Eq. 2.17
in Eq. 2.4. We obtain the equations
N

m+u=ut > pyllm—Ugto] i1=1,2,---,N


j=1

N N

m+u=at (n— 1g > by + > dary


j=1 j=1
N

Since — Pij = 1, these equations become


j=1

e+ u=qt > pyr; Bestel 2, oe aN


j=1
(4.1)
VALUE-DETERMINATION OPERATION 35
We have now obtained a set of N linear simultaneous equations that
relate the quantities v; and g to the probability and reward structure
of the process. However, a count of unknowns reveals N v; and 1 g
to be determined, a total of N + 1 unknowns. The nature of this
difficulty may be understood if we examine the result of adding a constant
a to all vy; in Eqs. 4.1. These equations become
N

gtut+a=qi t+ > pylvj + @)


j=1

or

gtu=u t+ > pur


j=1
The original equations have been obtained once more, so that the
absolute value of the v; cannot be determined by the equations. How-
ever, if we set one of the v; equal to zero, perhaps vy, then only N un-
knowns are present, and the Eqs. 4.1 may be solved for g and the
remaining v;. Notice that the v; so obtained will not be those defined
by Eq. 2.17 but will differ from them by a constant amount. Never-
theless, because the true values of the v; contain a constant term

> 7404(0)

as shown in Eq. 2.13, they have no real significance in processes that


continue for a very large number of transitions. The vj produced by
the solution of Eqs. 4.1 with vy = 0 will be sufficient for our purposes;
they will be called the relative values of the policy.
The relative values may be given a physical interpretation. Consider
the first two states, 1 and 2. For any large n, Eq. 2.17 yields

vi(m) = ng + V4 ve(n) = ng + ve

The difference vi(m) — ve(m) = v1 — ve for any large m; it is equal


to the increase in the long-run expected earnings of the system caused
by starting in state 1 rather than state 2. Since the difference v1 — ve
is independent of any absolute level, the relative values may be used to
find the difference. In other words, the difference in the relative values
of the two states v1 — vg is equal to the amount that a rational man
would be just willing to pay in order to start his transitions from
state 1 rather than state 2 if he is going to operate the system for
many, many transitions. We shall exploit this interpretation of the
relative values in the examples of Chapter 5.
36 THE POLICY-ITERATION METHOD
If Eqs. 4.1 are multiplied by 7, the limiting state probability of the
ith state, and then summed over 7, we obtain
N N N NEN,
gdm + > mie = > mage + D> Dd mpyr;
fa i=1 i=1 j=li=
yt
The basic equations (Eqs. 1.5 and 1.6) show that this expression is
equivalent to Eq. 2.14:
N
g => mg (2.14)
t=1

A relevant question at this point is this: If we are seeking only the gain
of the given policy, why did we not use Eq. 2.14 rather than Eq. 4.1?
As a matter of fact, why are we bothering to find such things as relative
values at all? The answer is first, that although Eq. 2.14 does find
the gain of the process it does not inform us about how to find a better
policy. We shall see that the relative values hold the key to finding
better and better policies and ultimately the best policy.
A second part of the answer is that the amount of computational
effort required to solve Eqs. 4.1 for the gain and relative values is about
the same as that required to find the limiting state probabilities using
Eqs. 1.5 and 1.6, because both computations require the solution of N
linear simultaneous equations. From the point of view of finding the
gain, Eqs. 2.14 and 4.1 are a standoff; however, Eqs. 4.1 are to be pre-
ferred because they yield the relative values that will be shown to be
necessary for policy improvement.
From the point of view of computation, it is interesting to note that
we have considerable freedom in scaling our rewards because of the
linearity of Eqs. 4.1. If the rewards 7; of a process with gain g and
relative values 1; are modified by a linear transformation to yield new
rewards 7;;' in the sense 74’ = arij + 6, then since
N

gi = > Pures
j=1
the new expected immediate rewards q;’ will be gi’ = agi + b,so that the
qi are subjected to the same transformation. Equations 4.1 become
edt oy? = ets
EM cs ola + > pus 4: AID nt WN
j=l
Or

(ag + b) + (av) = gi’ + > Pi(avj)


j=1
and
N

Bt ul = qi + D pyoy
j=1
POLICY-IMPROVEMENT ROUTINE Be

The gain g’ of the process with transformed rewards is thus ag + 0,


whereas the values v;’ of this process will equal av;. The effect of
changes in the units of measurement and in the absolute level of the
reward system upon the gain and relative values is easily calculated.
Thus we could normalize all rewards to be between 0 and 1, solve the
entire sequential decision process, and then use the inverse of our
original transformation to return the gain and relative values to their
original levels.
We have now shown that for a given policy we can find the gain and
relative values of that policy by solving the N linear simultaneous
equations (Eqs. 4.1) with vy = 0. We shall now show how the relative
values may be used to find a policy that has higher gain than the original
policy.

The Policy-Improvement Routine


In Chapter 3 we found that if we had an optimal policy up to stage
we could find the best alternative in the 7th state at stage m + 1 by
maximizing

qk + 2,Disko; (n) (4.2)


over all alternatives in thezth state. For large n, we could substitute
Eq. 2.17 to obtain
N

gk + > pik
(ng + 25) (4.3)
j=
as the test quantity to be maximized in each state. Since
N

> put = 1
j=1

the contribution of mg and any additive constant in the vj; becomes a


test-quantity component that is independent of k. Thus, when we are
making our decision in state 1, we can maximize
N

gi® + > puto; (4.4)


j=l
with respect to the alternatives in the 7th state. Furthermore, we can
use the relative values (as given by Eqs. 4.1) for the policy that was
used up to stage n.
The policy-improvement routine may be summarized as follows: For
each state 7, find the alternative & that maximizes the test quantity
N

git + > patos


j=
38 THE POLICY-ITERATION METHOD

using the relative values determined under the old policy. This
alternative k now becomes dj, the decision in the 7th state. A new
policy has been determined when this procedure has been performed
for every state.
We have now, by somewhat heuristic means, described a method for
finding a policy that is an improvement over our original policy. We
shall soon prove that the new policy will have a higher gain than the
old policy. First, however, we shall show how the value-determination
operation and the policy-improvement routine are combined in an
iteration cycle whose goal is the discovery of the policy that has highest
gain among all possible policies.

The Iteration Cycle


The basic iteration cycle may be diagrammed as shown in Figure 4.2.

Value- Determination Operation


Use pi and q; for a given policy to solve
N
B+ Up = ar + > pur (ORS
&0,5)
j=1

for all relative values v; and g by setting vy to zero.

Policy-Improvement Routine
For each state 2, find the alternative k’ that maximizes
N

qk + > piyko;
1
using the relative values vj of the previous policy. Then k’
becomes the new decision in the ith state, q;*’ becomes q;, and
piy®’ becomes pi;.

Fig. 4.2. The iteration cycle.

The upper box, the value-determination operation, yields the g and %


corresponding to a given choice of g; and pi. The lower box yields
the pi; and q; that increase the gain for a given set of v;. In other words,
the value-determination operation yields values as a function of policy,
THE TOYMAKER’S PROBLEM oo

whereas the policy-improvement routine yields the policy as a function


of the values.
We may enter the iteration cycle in either box. If the value-
determination operation is chosen as the entrance point, an initial
policy must be selected. If the cycle is to start in the policy-improve-
ment routine, then a starting set of values is necessary. If there is no
a priori reason for selecting a particular initial policy or for choosing a
certain starting set of values, then it is often convenient to start the
process in the policy-improvement routine with all v; = 0. In this
case, the policy-improvement routine will select a policy as follows:
For each 7, it will find the alternative k’ that maximizes g;* and then
set dy = R’,
This starting procedure will consequently cause the policy-improve-
ment routine to select as an initial policy the one that maximizes the
expected immediate reward in each state. The iteration will then
proceed to the value-determination operation with this policy, and the
iteration cycle will begin. The selection of an initial policy that maxi-
mizes expected immediate reward is quite satisfactory in the majority
of cases,
At this point it would be wise to say a few words about how to stop
the iteration cycle once it has done its job. The rule is quite simple:
The optimal policy has been reached (g is maximized) when the
policies on two successive iterations are identical. In order to prevent
the policy-improvement routine from quibbling over equally good alter-
natives in a particular state, it is only necessary to require that the
old d; be left unchanged if the test quantity for that d; is as large as
that of any other alternative in the new policy determination.
In summary, the policy-iteration method just described has the
following properties:

1. The solution of the sequential decision process is reduced to solving


sets of linear simultaneous equations and subsequent comparisons.
2. Each succeeding policy found in the iteration cycle has a higher
gain than the previous one.
3. The iteration cycle will terminate on the policy that has largest
gain attainable within the realm of the problem; it will usually find this
policy in a small number of iterations.
Before proving properties 2 and 3, let us see the policy-iteration
method in action by applying it to the toymaker’s problem.

The Toymaker’s Problem


The data for the toymaker’s problem were presented in Table 3.1.
40 THE POLICY-ITERATION METHOD
There are two states and two alternatives in each state, so that there
are four possible policies for the toymaker, each with associated proba-
bilities and rewards. He would like to know which of these four policies
he should follow into the indefinite future to make his average earnings
per week as large as possible.
Let us suppose that we have no a priori knowledge about which policy
is best. Then if we set v1] = ve = 0 and enter the policy-improvement
routine, it will select as an initial policy the one that maximizes expected
immediate reward in each state. For the toymaker, this policy con-
sists of selection of alternative 1 in both states 1 and 2. For this
policy

Am i] onReale aaa
We are now ready to begin the value-determination operation that
will evaluate our initial policy. From Eqs. 4.1,
gt+v1 = 6 + 0.501 + 0.5ve2 gt+ve= —3 + 0.401 + 0.6v2
Setting ve = 0 and solving these equations, we obtain
ol v1 = 10 ve = 0
(Recall that by use of a different method the gain of 1 was obtained
earlier for this policy.) We are now ready to enter the policy-improve-
ment routine as shown in Table 4.1.

Table 4.1. ToyMAKER POLICy-IMPROVEMENT ROUTINE


State Alternative Test Quantity
N
a k ga + > pike;
es

1 1 6 + 0.5(10).4-,0.5(0) == 141
Z 4 + 0.8(10) + 0.2(0) = 12<

2 1 =3 + 0.400)FP 060) ="


pe —5 +:0.7(10) +:0.3(0) = 2<

The policy-improvement routine reveals that the second alternative


in each state produces a higher value of the test quantity
N

qk + > pighoy
j=l

than does the first alternative. Thus the policy composed of the second
alternative in each state will have a higher gain than our original policy.
THE TOYMAKER’S PROBLEM 4]
However, we must continue our procedure because we are not yet sure
that the new policy is the best we can find. For this policy,

Pri Ps
tetraacetic
Equations 4.1 for this case become
gt v1 = 4 + 0.8u1 + 0.2v2
& + ve = —5 + 0.701 + 0.3v2
With v2 = 0, the results of the value-determination operation are
g=2 vi = 10 Ve = 0

The gain of the policy d = 5 is thus twice that of the original


policy. We must now enter the policy-improvement routine again,
but, since the relative values are coincidentally the same as those for
the previous iteration, the calculations in Table 4.1 are merely repeated.
The policy d = A is found once more, and, since we have found the
same policy twice in succession, we have found the optimal policy.
The toymaker should follow the second alternative in each state. If he
does, he will earn 2 units per week on the average, and this will be a
higher average earning rate than that offered by any other policy. The
reader should verify, for example, that both policy d = |, and policy

d = I; have inferior gains.

For the optimal policy, v1 = 10, ve = 0, so that v1 — ve = 10.


This means that, even when the toymaker is following the optimal
policy by using advertising and research, he is willing to pay up to
10 units to an outside inventor for a successful toy at any time that
he does not have one. The relative values of the optimal policy may
be used in this way to aid the toymaker in making ‘‘one-of-a-kind”’
decisions about whether to buy rights to a successful toy when business
is bad.
The optimal policy for the toymaker was found by value iteration in
Chapter 3. The similarities and differences of the two methods should
now be clear. Note how the policy-iteration method stopped of its
own accord when it achieved policy convergence; there is no comparable
behavior in the value-iteration method. The policy-iteration method
has a simplicity of form and interpretation that makes it very desirable
from a computational point of view. However, we must always bear
42 THE POLICY-ITERATION
METHOD

in mind that it may be applied only to continuing processes or to those


whose termination is remote.

A Proof of the Properties of the Policy-Iteration Method


Suppose that we have evaluated a policy A for the operation of the
system and that the policy-improvement routine has produced a policy
B that is different from A. Then if we use superscripts A and B to
indicate the quantities relevant to policies A and B, we seek to prove
that 3. es.
It follows from the definition of the policy-improvement routine
that, since B was chosen over A,
N N

qe? + > puBuj4 > eA + D puts 4 =1,2,---,N (45)


j= j=l
Let
N N

ve = 2 + > peyBoj4 — ge4 — > pisos4 (4.6)


j=l ji

so that y; > 0. The quantity y; is the improvement in the test quan-


tity that the policy-improvement routine was able to achieve in the
ith state. For policies A and B individually, we have from Eqs. 4.1
N

eB + u,2 = qiB + > PesBu;® t= Ls Le CDi. N (4.7)


j=1
N

ga t+ uA = gid + D>pyAvjA =f = 1,2,---, N (4.8)


j=1

If Eq. 4.8 is subtracted from Eq. 4.7, then the result is


N N

gh — gA + 1B — yA = gB — get + D>pasBog® — > pyytvg4 (4.9)


j=l j=l
If Eq. 4.6 is solved for qi2 — qgi4 and this result is substituted into
Eq. 4.9, then we have
N N

eB — g4 + 1B — v4 = v4 — > payBoj4 + > payAvsA


j=1 7=1
N N

+ > pisBoj7® — > pyyso,4


j=1 j=1
or

&B — gh + vB — 4 = v4 + D payB(vg® — 0,4) (4.10)


j=1
A PROOF OF THE POLICY-ITERATION METHOD 43

Let g4 = g® — g4 and v;4 = v48 — v4. Then Eq. 4.10 becomes


N

gh tud =e + > pyByA 1 =1,2,---,N (4.11)


j=1

Equations 4.11 are identical in form to Eqs. 4.1 except that they are
written in terms of differences rather than in terms of absolute quanti-
ties. Just as the solution for g obtained from Eqs. 4.1 is
N
= 2,Tigi

so the solution for g4 in Eqs. 4.11 is


N

gr = > mv (4.12)
t=1

where 72 is the limiting state probability of state 7 under policy B.


Since all x2 > 0 and all y; > 0, therefore, g4 > 0. In particular,
g® will be greater than g4 if an improvement in the test quantity can
be made in any state that will be recurrent under policy B. We see
from Eq. 4.12 that the increases in gain caused by improvements in
each recurrent state of the new policy are additive. Even if we
performed our policy improvement on only one state and left other
decisions unchanged, the gain of the system would increase if this
state is recurrent under the new policy.
We shall now show that it is impossible for a better policy to exist and
not be found at some time by the policy-improvement routine. Assume
that, for two policies A and B, g® > g4, but the policy-improvement
routine has converged on policy A. Then in all states, y; < 0, where
vi is defined by Eq. 4.6. Since 7 > 0 for all 1, Eq. 4.12 holds that
gB — g4 < 0. But g® > g4 by assumption, so that a contradiction
has been reached. It is thus impossible for a superior policy to remain
undiscovered.
The following chapter will present further examples of the policy-
iteration method that show how it may be applied to a variety of
problems.
Use of the Policy-Iteration Method
in Problems of Taxicab Operation,
Baseball, and Automobile Replacement

An Example—Taxicab Operation
Consider the problem of a taxicab driver whose territory encompasses
three towns, A, B, and C. If he is in town A, he has three alterna-
tives:

1. He can cruise in the hope of picking up a passenger by being


hailed.
2. He can drive to the nearest cab stand and wait in line.
3. He can pull over and wait for a radio call.

If he is in town C, he has the same three alternatives, but if he is in


town B, the last alternative is not present because there is no radio cab
service in that town. Fora given town and given alternative, there is a
probability that the next trip will go to each of the towns A, B, and C
and a corresponding reward in monetary units associated with each
such trip. This reward represents the income from the trip after all
necessary expenses have been deducted. For example, in the case of
alternatives 1 and 2, the cost of cruising and of driving to the nearest
stand must be included in calculating the rewards. The probabilities
of transition and the rewards depend upon the alternative because
different customer population will be encountered under each alter-
native.
If we identify being in towns A, B, and C with states 1, 2, and 3,
respectively, then we have Table 5.1.
44
TAXICAB OPERATION 45
Table 5.1. Data FoR TAXICAB PROBLEM
State Alternative Probability Reward Expected
Immediate
Reward
. N
i k piy* rag® gk = > pik
j=1
es es © pal Fl WS

1 1 4+ Ff 7 LO eas 8
2 Teak tty ee 2.75
3 do 4 oS 4 6 4 4.25

2 1 4. 07 4 14 O 18 16
2 a i od 8 16 8 15
2 1 + +f $4 LOZ ns 7
2 + #2 4 6) 45 2 4
3 i ve ie 4 0 8 4.5

The reward is measured in some arbitrary monetary unit; the num-


bers in Table 5.1 are chosen more for ease of calculation than for any
other reason.
In order to start the decision-making process, suppose that we make
V1, Vz, and vg = 0, so that the policy improvement will choose initially
the policy that maximizes expected immediate reward. By examining
the g:*, we see that this policy consists of choosing the first alternative
in each state. In other words, the policy vector d whose 7th element is
the decision in the 7th state is

d= }1
1

or the policy is always cruise.


The transition probabilities and expected immediate rewards corre-
sponding to this policy are
arene
2 ces24
8
P= |i 0/14 q = |16
dt,
4
and2
4
g|

Now the value-determination operation is entered, and we solve the


equations
N

> py
ge+u=at 1=1,2,---,N
j=l
46 EXAMPLES OF THE METHOD

In this case we have


g+v.=
8 + 4v1 + 302 + jus
oe We I 16 + 4v1 + Ove + $3
g+vs = 7 + fur + v2 + $v3
Setting vg = 0 arbitrarily and solving these equations, we obtain
v1 = 1.33 ve =) 7.47 v3 = 0 g= 972
Under a policy of always cruising, the driver will make 9.2 units per
trip on the average.
Returning to the policy-improvement routine, we calculate the
quantities
N

qt + > puto;
j=1
for all 7 and &, as shown in Table 5.2.

Table 5.2. First PoLticy IMPROVEMENT FOR TAXICAB PROBLEM


State Alternative Test Quantity
N
i k gk + > Pishey
j=l

1 1 10.53<
8.43
3 DOD

2 1 16.67
2 21.62<—

3 1 9.20
2 9.77<—
3 Se)

We see that forz = 1 the quantityin the right-hand column is maxi-


mized when k= 1. For 4 = 2 or 3, it is maximized when 2 = 2.
In other words, our new policy is

ad =Fr2
2
This means that if the driver is in town A he should cruise; if he is
in town B or C, he should drive to the nearest stand.
We have now
1 1 i‘
(ae es 8
P=
ot eke
|¥¢ Se
es q= ]15
1. ee |
Stain Be 4
TAXICAB OPERATION 47
Returning to the value-determination operation, we solve the equations
gtv= 84+ dvi + dve
+ dvs
& + ve = 15 + evr + gue + Vers
&E+tvs
= 44 fv. + Bue
+ fv
Again with v3 = 0, we obtain

v1 = —3.88 ve = 12.85 dg. = 0 g= 13.15


Note that g has increased from 9.2 to 13.15 as desired, so that the
cab earns 13.15 units per trip on the average. A second policy-
improvement routine is shown in Table 5.3.

Table 5.3. SrcoND PoLicy IMPROVEMENT FOR TAXICAB PROBLEM


State Alternative Test Quantity
N
i k git + > piyko;
j=1

1 1 9.27
72 12.14<
3 4.89

2 i 14.06
2 26.00 <

3 1 9.24
2 SO
3 Zoo

The new policy is thus

di= |2
Z
The driver should proceed to the nearest stand, regardless of the town
in which he finds himself.
With this policy
a i vs 2.75
ae
sae ale |
Ae Geel 4
o. £5 238

Entering the value-determination operation, we have


a i I 2.75 + jsV1 St 3v2 + 75V3
gt+ve = 15 + jovi + gue + Zeus
getvu= 4 + fr + fro + fv
48 EXAMPLES OF THE METHOD

With v3 = 0, the solution to these equations is


v1 = —1.18 ve = 12.66 vg = 0 g = 13.34
Note that there has been a small but definite increase in g from 13.15
to 13.34; however, we as yet have no evidence that the optimal policy
has been found. The next policy improvement is shown in Table 5.4.

Table 5.4. THrrp PoLicy IMPROVEMENT FOR TAXICAB PROBLEM


State Alternative Test Quantity
N
y k gk + > pity;
j=1

1 1 10.58
2 217 =
3 5.54

D 1 15.41
2 24.42<—

3 1 9.87
2 13.34<
3 4.41
The new policy is
2
d= |Z
Z
but this is equal to the previous policy, so that the process has con-
verged, and g has attained its maximum, namely, 13.34. The cab
driver should drive to the nearest stand in any city. Following this
policy will yield a return of 13.34 units per trip on the average, almost
half as much again as the policy of always cruising found by maxi-
mizing expected immediate reward. The calculations are summarized
in Table 5.5.
Table 5.5. SUMMARY OF TAXICAB PROBLEM SOLUTION
vy 0 1.33 — 3.88 BesWE)
v9 0 7.47 12.85 12.66
v3 0 0 0 0
g =H 9.20 13.15 13.34
oNig Gp Seg ALPS Aeas
dy 1 1 2 2
do 1 2 2 2. STOP
ds 1 2 2 2
Pp indicates that this step takes place in the policy-improvement routine.
Vv indicates that this step takes place in the value-determination operation.
AVBASEBALL) PROBEBM 49

Notice that the optimal policy of always driving to a stand is the


worst policy in terms of immediate reward. This is roughly equivalent
to saying that if a cab driver is to conduct his affairs in the best way
he must consider not only the fare from a trip but also the destination
of the trip with respect to the expectation of further trips. Any
experienced cab driver will verify the wisdom of such reasoning. It
often happens in the sequential decision process that the birds in the
bush are worth more than the one in the hand.
The policy-improvement routine of Table 5.3 provides us with an
opportunity to check Eq. 4.12. The policy changed as a result of this
routine from a policy 4 for which

to a policy B described by

The quantities y; defined by Eq. 4.6 may be obtained from Table 5.3.
They are the differences between the test quantities for each policy.
We find yi = 12.14 — 9.27 = 2.87, whereas y2 = y3 = 0 because the
decisions in states 2 and 3 are the same for both policies A and B.
Application of Eqs. 1.5 and 1.6 to the transition-probability matrix
for policy B yields the limiting state probabilities:
x1 = 0.0672 ma, = 0.8571 mg = 0.0757
From Eq. 4.12 we then have that
g& = (0.0672)(2.87) = 0.19
The change of policy from A to B should thus have produced an in-
crease in gain of 0.19 unit. Since g4 = 13.15 and g® = 13.34, our
prediction is correct.

A Baseball Problem

It is interesting to explore computational methods of solving the


discrete sequential decision problem. The policy-improvement routine
is a simple computational problem compared to the value-determination
operation. In order to determine the gain and the values, it is necessary
to solve a set of simultaneous equations that may be quite large.
A computer program for solving the problem that we have been
discussing has been developed as an instrument of research. This
50 EXAMPLES OF THE METHOD

program performs the value-determination operation by solving a set


of simultaneous equations using the Gauss-Jordan reduction. Prob-
lems possessing up to 50 states and with up to 50 alternatives in each
state may be solved.

Table 5.6. BASEBALL PROBLEM DATA

1. Manager tells player at bat to try for a hit.


Player Player on Player on
Probability Batter on First Second Goes Third Goes
Outcome of Outcome Goes to Goes to to to

Single 0.15 1 ”) g H
Double 0.07 fe 3 H H
Triple 0.05 3 H H H
Home run 0.03 H H la! H
Base on balls 0.10 1 2 3 (if forced) H (if forced)
Strike out 0.30 Out 1 2 3
Fly out 0.10 Out 1 2, H (if less than
2 outs)
Ground out 0.10 Out 2 So H (if less than
2 outs)
Double play 0.10 Out The player nearest first is out.

The interpretation of these outcomes is not described in detail. For instance,


if there are no men on base, then hitting into a double play is counted simply as
making an out.

2. Manuger tells player at bat to bunt.


Outcome Probability Effect

Single 0.05 Runners advance one base.


Sacrifice 0.60 Batter out; runners advance one base.
Fielder’s choice 0.20 Batter safe; runner nearest to making run is
out, other runners stay put unless forced.
Strike or foul out 0.10 Batter out; runners do not advance.
Double play 0.05 Batter and player nearest first are out.

3. Manager tells player on first to steal second.

4. Manager tells player on second to steal third.


In either case, the attempt is successful with probability 0.4, the player’s position
is unchanged with probability 0.2, and the player is out with probability 0.4.

5. Manager tells player on third to steal home.


The outcomes are the same as those above, but the corresponding probabilities
are 0.2,.0.1, and 0.7.
Baseball fans please note: No claim is made for the validity of either assumptions
or data.
A BASEBALL PROBLEM bl

When this program was used to solve the taxicab problem, it of


course yielded the same solutions we obtained earlier, but with more
significant figures. The power of the technique can be appreciated
only in a more complex problem possessing several states. As an
illustration of such a problem, let us analyze the game of baseball
using suitable simplifying assumptions to make the problem manageable.
Consider the half of an inning of a baseball game when one team is
at bat. This team is unusual because all its players are identical in
athletic ability and their play is unaffected by the tensions of the game.
The manager makes all decisions regarding the strategy of the team,
and his alternatives are limited in number. He may tell the batter
to hit or bunt, tell a man on first to steal second, a man on second to
steal third, or a man on third to steal home. For each situation during
the inning and for each alternative, there will be a probability of reaching
each other situation that could exist and an associated reward expressed
in runs. Let us specify the probabilities of transition under each
alternative as shown in Table 5.6.
The state of the system depends upon the number of outs and upon
the situation on the bases. We may designate the state of the system
by a four-digit number did2d3d4, where d, is the number of outs—0, 1, 2,
or 3—and the digits dd3dq are 1 or 0 corresponding to whether there
is or is not a player on bases 3, 2, and 1, respectively. Thus the state
designation 2110 would identify the situation “2 outs; players on second
and third,’ where 1111 would mean “1 out; bases loaded.’ The
states are also given a decimal number equal to 1 + 8d, + (decimal
number corresponding to binary number did3d4)._ The state 0000 would
be state 1, and the state 3000 would be state 25; 2110 corresponds
to 23, 1111 to 16. There are eight base situations possible for each of
the three out situations 0,1, 2. There is also the three-out case 3——-,
where the situation on base is irrelevant and we may arbitrarily call
3——-, the state 3000. Therefore, we have a 25-state problem.
The number of alternatives in each state is not the same. State
1000 or 9 has no men on base, so that none of the stealing alternatives
are applicable, and only the hit or bunt options are present. State
0101 or 6 has four alternatives: hit, bunt, steal second, or steal home.
State 3000 or 25 has only 1 alternative, and that alternative causes it
to return to itself with probability 1 and reward 0. State 25 is a trap-
ping or recurrent state; it is the only state that the system may occupy
as the number of transitions becomes infinite.
To fix ideas still more clearly, let us list explicitly in Table 5.7 the
transition probabilities p;;* and rewards 7ij* for a typical state, say 0011
or 4. In state 4(¢ = 4), three alternatives apply: hit, bunt, steal
third. Only nonzero p;;* are listed. The highest expected immediate
yh EXAMPLES OF THE METHOD

reward in this state would be obtained by following alternative e


Hit.

Table 5.7. PROBABILITIES AND REWARDS FOR STATE 4 OF BASEBALL


PROBLEM (0011)
First alternative: Hit, k = 1.
Next State il pay! V4j

0000 1 0.03 3
0100 5 0.05 2
0110 7 0.07 1
0111 8 0.25 0 gat = 0.26
1011 12 0.40 0
1110 15 0.10 0
2010 19 0.10 0

Second alternative: Bunt, k = 2.


Next State 7 paj? raj?

0111 8 0.05 0
1011 12 0.30 0 ee
1110 15" 0,60 0 ee
2010 19 0.05 0

Third alternative: Steal third, k = 3.


Next State i) paj? 745°

0011 4 0.20 0 wean


0101 6 0.40 0 Sa oe
1001 10 0.40 0

Table 5.8, entitled “Summary of Baseball Problem Input,” shows


for each state z the state description, the alternatives open to the man-
ager in that state, and qi*, the expected immediate reward (in runs) from
following alternative k in state 7. The final column shows the policy
that would be obtained by maximizing expected immediate reward in
each state. This policy is to bunt in states 5, 6, 13, and 14, and to hit
in all others. States 5, 6, 13, and 14 may be described as those states
with a player on third, none on second, and with less than two outs.
The foregoing data were used as an input to the computer program
described earlier. Since the program chooses an initial policy by
maximizing expected immediate reward, the initial policy was the one
just mentioned. The machine had to solve the equations only twice to
reach a solution. Its results are summarized in Table 5.9.
The optimal policy is to hit in every state. The v; may be interpreted
as the expected number of runs that will be made if the inning is now
in state 7 and it is played until three outs are incurred. Since a team
A BASEBALL PROBLEM 53
Table 5.8. SUMMARY OF BASEBALL PROBLEM INPUT

Number
of Alter- Initial
Alternative Alternative Alternative Alternative natives Policy
State Description 1 2 3 4 in State dj if v4
Bases k=1 k=2 k=3 k=4 1 set = 0
+ Outs 3 2 1 ql qe qe qt

1 0 0 0 O - 0.03 Hit _ — — 1 1
2 0 OFF Ores Oxi rat 0 Bunt 0 Steal 2 — 3 1
3 0 OQ. OT 40.18 eit 0 Bunt 0 Steal 3 — 3 1
4 0 QO dlr 026 Hit 0 Bunt 0 Steal 3 — 3 1
5 0 1 (OO OT OS38i tit 0.65 Bunt 0.20 Steal H — g 2
6 0 (O05 53. 0:6) Birt 0.65 Bunt 0 Steal2 0.20 Steal H 4 ?)
Zz 0 LE i 0" O68 Eve 0.65 Bunt 0.20 Steal H — 3 1
8 0 tet et 10:86 Hit 0.65 Bunt 0.20 Steal H — 3 1
9 1 O- ‘O91 0.03: Fit - — Hs 1 1
10 1 Or Ors (OC Rit Q Bunt 0 Steal 2 — 3 1
11 1 Oo + 0,018 hit 0 Bunt 0 Steal 3 — 3 1
12 1 Q iP? 0.26 Hat 0 Bunt 0 Steal 3 — 3 1
13 1 i “O3n@ (0:53 He 0.65 Bunt 0.20 Steal H — 3 2
14 1 be Ome te f O.6h Ft 0.65 Bunt 0 Steal2 0.20 Steal H 4 2
15 1 1 1 O- 0.68 Hit 0.65 Bunt 0.20 Steal H — 3 1
16 1 bot (O86 Hie 0.65 Bunt 0.20 Steal H — 3 1
17 2 GO” .O2530. . 0:03 Ee — — — 1 1
18 2 De OWE “Oa hic 0 Bunt 0 Steal 2 — 3 1
19 2 0 > <2.) >-0:18Ent 0° Bunty 0 Steal 3 = 3 1
20 2 QO 2° , 026 Ne 0 Bunt. (0 Steal 3 — 3 1
21 Z S LOO > OSs Fhe 0.05 Bunt 0.20 Steal H — 3 1
22 2 fy Oi ads >) 0:47 Fist 0.05 Bunt 0 Steal 2 0.20 Steal H 4 1
23 2 1, f {0 “O48 But 0.05 Bunt 0.20 Steal H — 3 1
24 2 ft) Ste? 0:66-bire 0.05 Bunt 0.20 Steal H — 3 1
25 3 — — — 0 Trapped — — — 1 1

starts each inning in state 1, or “‘no outs, no men on,” then v1 may be
interpreted as the expected number of runs per inning under the given
policy. The initial policy yields 0.75 for v1, whereas the optimal
policy yields 0.81. In other words, the team will earn about 0.06
more runs per inning on the average if it uses the optimal policy
rather than the policy that maximizes expected immediate reward.
Note that under both policies the gain was zero as expected, since
after an infinite number of moves the system will be in state 25 and will
always make reward 0. Note also that, in spite of the fact that the
gain could not be increased, the policy-improvement routine yielded
values for the optimal policy that are all greater than or equal to those
for the initial policy. The appendix shows that the policy-improvement
routine will maximize values if it is impossible to increase gain.
The values v; can be used in comparing the usefulness of states. For
example, under either policy the manager would rather be in a position
with two men out and bases loaded than be starting a new inning
(compare v24 with v1). However, he would rather start a new inning
than have two men out and men on second and third (compare ve
with vi). Many other interesting comparisons can be made. Under
the optimal policy, having no men out and a player on first is just about
as valuable a position as having one man out and players on first and
54 EXAMPLES OF THE METHOD

second (compare vz with v12). It is interesting to see how the preceding


comparisons compare with our intuitive notions of the relative values
of baseball positions.

Table 5.9. SUMMARY OF BASEBALL PROBLEM SOLUTION


Iteration 1 Iteration 2
g=0 g=0
State Description Decision Value v; State Description Decision Value v;

1 0000 Hit 0.75 1 0000 Hit 0.81


2 0001 Hit 1.08 2 0001 Hit 1325
3 0010 Hit 1.18 3 0010 Hit 4335
4 0011 Hit 1.82 4 0011 Hit 1.89
5 0100 Bunt 1.18 5 0100 Hit 1.56
6 0101 Bunt 1.56 6 0101 Hit 2.07
7 0110 Hit 2.00 7 0110 Hit Dag,
8 0111 Hit 2.07), 8 0111 Hit 2.74
9 1000 Hit 0.43 9 1000 Hit 0.46
10 1001 Hit 0.75 10 1001 Ent 0.77
11 1010 Hit 0.79 11 1010 Hit 0.86
12 1011 Hit 1.21 i? 1011 Hit 1223
13 1100 Bunt 0.88 13 1100 Hit Ea
14 1101 Bunt 1.10 14 1101 Hit 1.44
15 1110 Hit 1.46 15 1110 Hit 1253
16 ANE Hit 1.93 16 1111 Hit 1.95
17 2000 Hit OSE Le? 2000 Hit 0.17
18 2001 Hit 0.34 18 2001 Hit 0.34
19 2010 Hit 0.40 19 2010 Hit 0.40
20 2011 Hit 0.59 20 2011 Hit 0.59
21 2100 Hit 0.51 21 2100 Hit 0.51
22 2101 Hit 0.68 22 2101 Hit 0.68
23 2110 Hit 0.74 23 2110 Hit 0.74
24 PUI Hit 0.99 24 2111 Hit 0.99
25 3000 Hit 0 25 3000 Hit 0

The Replacement Problem


The examples of the policy-iteration method presented up to this
point have been somewhat far removed from the realm of practical
problems. It would be extremely interesting to see the method applied
to a problem that is of major importance to industry. As an example
of such a practical application, the replacement problem was chosen.
This is the problem of when to replace a piece of capital equipment
that deteriorates with time. The question to be answered is this:
If we now own a machine of a certain age, should we keep it or should
we trade it in; further, if we trade it in, how old a machine should we
buy?
In order to fix ideas, let us consider the problem of automobile
THE REPLACEMENT PROBLEM uh)

replacement over a time interval of ten years. We agree to review


our current situation every three months and to make a decision on
keeping our present car or trading it in at that time. The state of the
system, 7, is described by the age of the car in three-month periods; 7
may run from 1 to 40. In order to keep the number of states finite,
a car of age 40 remains a car of age 40 forever (it is considered to be
essentially worn out). The alternatives available in each state are
these: The first alternative, k = 1, is to keep the present car for another
quarter. The other alternatives, k > 1, are to buy a car of age k — 2,
where k — 2 may be as large as 39. We have then 40 states with 41
alternatives in each state, with the result that there are 414° possible
policies.
The data supplied are the following:

C;, the cost of buying a car of age 7


T;, the trade-in value of a car of age z
E;, the expected cost of operating a car of age 7 until it reaches age
pel
pi, the probability that a car of age 7 will survive to be age z + 1
without incurring a prohibitively expensive repair

The probability defined here is necessary to limit the number of


states. A car of any age that has a hopeless breakdown is immediately
sent to state 40. Naturally, Jao = 0.
The basic equations governing the system when it is in state 7 are
the following: If k = 1 (keep present car),

gtu = —Ey + pivisr + (1 — fa)vao

If k > 1 (trade for car of age k — 2),

gtu = Ti — Cra — Ex-2 + pe-ove-1 + (1 — pe-e)vao


It is simple to phrase these equations in terms of our earlier notation.
For instance,

git = —E; fork =1 qg*®= 1, —Ce-o—


Ex-2 fork> 1

pi air 1
pyk# =<1—pi j = 40 for
k =1
0 other 7

Pr-2 g=zk—1
pij® = 41 — pro 7 = 40 for k >1
0 other 7
56 EXAMPLES OF THE METHOD

The actual data used in the problem are listed in Table 5.10 and
graphed in Figure 5.1. The discontinuities in the cost and trade-in
functions were introduced in order to characterize typical model-year
effects.

Table 5.10. AUTOMOBILE REPLACEMENT DATA


Agein Cost Trade- Operating Survival Agein Cost Trade- Operating Survival
Periods in Expense Proba- Periods in Expense Proba-
Value bility Value bility
1 C; Bs E; Pi 1 C; T; E; pi

0 $2000 $1600 $50 1.000


1 1840 1460 53 0.999 21 $345 $240 $115 0-925
2 1680 1340 56 0.998 22 3305 9225 118 0.919
3 1560 1230 Be, 0.997 Ms S15.) 210 12 0.910
4 1300 1050 62 0.996 24 300 200 125 0.900
5 1220 980 65 0.994 25 290 190 £29 0.890
6 1150 910 68 0.991 26 280 180 133 0.880
7 1080 840 71 0.988 Did 265°" *170 Si, 0.865
8 900 =710 75 0.985 28 250 160 141 0.850
9 840 650 78 0.983 2S) 240 150 145 0.820
10 780 600 81 0.980 30 230 145 150 0.790
11 730) 550 84 0.975 31 220 = 140 155 0.760
12 600 480 87 0.970 32 PAs ANS 160 0.730
13 560 430 90 0.965 33 200 130 167 0.660
14 520) 54390 93 0.960 34 1905 ¥ 120 LAS 0.590
15 480 360 96 0.955 35 TSOP MPLS 182 0.510
16 £4.05 330 100 0.950 36 170 = 110 190 0.430
ilg) 420 310 103 0.945 67) 160 = 105 205 0.300
18 400 290 106 0.940 38 150 95 220 0.200
19 380 270 109 0.935 39 140 87 235 0.100
20 360 255 112 0.930 40 130 80 250 0

1.00
Survival 0.90
1600 + : — g-- Probability 0.80

Man 0.70
w 1200 ; ee 0.60 2
0.50 8
iss}
= Cost C;
© 800 Pew 0.40 €
Trade-in
value 7. eee

16 24

Big, Sk Automobile replacement data.


THE REPLACEMENT PROBLEM oy,
The automobile replacement problem was solved by the policy-
iteration method in seven iterations. The sequence of policies, gains,
and values is shown in Table 5.11. The optimal policy given by itera-
tion 7 is this: If you have a car that is more than 4 year old but less
than 63 years old, keep it. If you have a car of any other age, trade
it in on a 3-year-old car. This seems to correspond quite well
with our intuitive notions concerning the economics of automobile
ownership. Note that if we at present have a car that is 3 or 6 months
old we should trade it for a 3-year-old car, but that if our car’s
age is between 6 months and 64 years, we should keep it. These rules
enable us to enter the 3 to 63 cycle; once the cycle is entered, the car
we own will always be between 3 and 64 years old.* It is satisfying
to note that the program at any iteration requires that, if we are going
to trade, we must trade for a car whose age is independent of our
present car’s age. This is just the result that the logic of the situation
would dictate.
If we follow our optimal policy, we shall keep a car until it is 64 years
old and then buy a 3-year-old car. Suppose, however, that when
our car is 4 years old, a friend offers to swap his l-year-old car for
ours foran amount X. Should we take up his offer? In order to answer
this question, we must look at the values.
In each of the iterations, the value of state 40 was set equal to zero
for computational purposes. Table 5.11 also shows the values under the
best policy when the value of state 40 is set equal to $80, the trade-in
value of a car of that age. When this is done, each v; represents the
value of a car of age 7 to a person who is following the optimal policy.
In order to answer the question just posed, we must compare the value
of a l-year-old car, v4 = $1152, with the value of a 4-year-old car,
Vie = $422. If his asking price X is less than va — vig = $730, we
should make the trade; otherwise, we should not. It is, of course, not
necessary to change vao from zero in order to answer this problem;
however, making v49 = $80 does give the values an absolute physical
interpretation as well as a relative one.
If the optimal policy is followed, the yearly cost of transportation is
about $604 (4 x $150.95). If the policy of maximizing immediate
reward shown in iteration 1 were followed, the yearly cost would
be $1000. Thus, following a policy that maximizes future reward

* Of course, chaos for the automobile industry would result if everyone followed
this policy. Where would the 3-year-old cars come from? Economic forces
would increase the price of such cars to a point where the 3 to 6% policy is
no longer optimal. The preceding analysis must assume that there are enough
people in the market buying cars for psychological reasons that so-called
“‘rational’’ buyers are a negligible influence.
‘sanyea / uoreIE}] ay} JO Yora 0} ‘19 deios v yo onTeA oy} ‘O8$ Surppe Aq poznduroo st onyea paysn{pe oyL ‘sIe][Op
ore “re9 yuesaid ayy deoxy suvow y ve ‘sported ut o8e }eY} JO Ivo & IOF Ope} SUROU UUINTOS UOTSIOep dy} UT Joquinu VW
possoidxo
sonte,
pue
ut
sures
08 0 (a! 0 (a! 0 Ai 0 cl 0 61 0 0¢ 0 MI Ov
L8 je. CL if, cl L CT L cl L 61 L 0¢ St Ds| 6£
S6 st ca! St CL ST Ai ST cl St 61 ST 0¢ cS Dsi 8€
sol Sc 6)! Sc cL Sc cl Sc cL Sc 61 Sc 0@ ss 4 Le
OL 0€ CL 0€ cL 0€ a! O€ ai O£ 61 0€ 0c +8 BE 9€
SII ce (a! Se a! Sc (a! S€ cl Se 61 Se 0c IIL Ds| Oe
OcT OF cl Ov Gl Ov cL OF cl OF 61 OF 0c Ort Ds| ve
O€T Os (ai Os cl os (a! Os Gi Os 61 OS 0¢ OL BE £e
Sel ss a! ss cr Ss (ai ss a! ss 61 ss 0c 817 BE ce
OFT 09 cL 09 Gs 09 cL 09 cl 09 61 09 07 19¢ Be le
Stl $9 Gs $9 as $9 cL $9 a! $9 61 s9 07 90€ BI 0€
OST OL cL OL cL OL cL OL cl OL 61 OL 0@ 9sé a 6¢
O9T 08 CL 08 cL 08 cL 08 cL 08 61 08 07 Clh xo 87
METHOD

OLT 06 cl 06 a! 06 CL 06 cl 06 61 06 0c 69+ a L7
O8T OOT (a! OOT cL L6 Ds| OOT a! OOT 61 OOT 0c oes wt 9¢
161 IIL DE OLT cL 601 DE OIL a! OIL 61 Ol 0c c6S BE sé
907 9C1 Mt SCL Ds tcl w als Gs 9EL I 0cI 0c 859 xo 144
vCT trl ae trl ou a De| O€T a! 991 My O€t 07 8cL Ds| £C
947 991 Bi SOL M b9l w StI a! L61 i Stl 0c 108 a4 (Zé
697 681 w 681 DE 881 DsI 091 al 6¢7 w £1? a 9L8 a 17
THE

S67 ST? wt ve a IT DI SLT cl $9T BE 087 DI 691 9¢ 0@


£ce Ev Ds| 44 Mt Cre Df 061 cl 00€ BE 06T 0c +81 9¢ 61
ese £Lé Ds €Le BE €LZ HL OE? Dt See BE OL 0c 07 9¢ 81
98€ 90€ Dye 90€ DE 90€ MI 1Lé BE O8E BE x4 0c vCC QE LI
OF

Ctr CHE M CHE ow Ive 4 STE BE £ch BE OST 0¢ bye 9€ 91


19+ 18€ M I8¢ aM 18€ BE 79E M OLY DI 087 0c ole OF ST
0S $tr Dt $@r Mu 1£44 aM Ilb BE Ics w OLE 0c OE 9€ vas
EXAMPLES

Oss OLY DE OLY M OLY asi 94 De SZS aM OSE 07 ve 9€ cL


009 0¢s M o¢s De 0¢S D2| 0¢S Ds 00+ 61 OOF 0¢ v6E Ue cr
¥S9 Ls De Ls BE vLs DI OLY cL OLY 61 OLY 0c v9r Ns II
CL (4x) DE co ow £9 BE 0cS (a! CHS De OcS 07 +IS 9¢ OL
SLL $69 aM S69 Ds S69 WL OLS Gl L19 I OLS 0@ 9s Ne 6
0+8 O9L M 092 ow Bye BE 0£9 cL 969 DE 0€9 07 v9 9¢ 8
116 18 BE 1€8 DE 09L (a! 09L a! O9L 61 09L 0¢ SL 9¢ ie
986 906 Ds| L06 M 9€8 aw 0€8 G! 848 4 0€8 0¢ +78 9¢ 9
LOOT L86 Dt £86 aw L16 pt 006 (a! 06 pt 006 0c +68 9€ S
CSU CLOT DsI CLOT aM £OOT BE 0L6 cl LEOL De 0L6 0c $96 9¢ v
cas LOLT BE OSTT a! OSTT CL OSTL 6)! OSTT 61 OSTT 0@ tlt 9€
Orel 0971 (a! 0971 cL O9CT (a! 0971 ch O9cL 61 O9Cl 0c Sel 9¢ c
O9r1$ O8ET$ A! O8ET$ cl Osets (a! O8Er$ G! O8Er$ 61 O8ET$ 0¢ PLET$ 9¢ P
anjea poysn(py ented wuorsteq |onfeA uorsiooq |anteA UorIsIoaq |onTeA UOTSIOAG |enTeA UOTIsIOaq |oNTeA UOTSIO9q | enTeA UOTSTOAC |9}e3S
$6 OST$ — :uley 66 OST$ — :uIeD | SO'IST$ — :urey | LOLST$ — ‘ured | pr'Z9T$ — :urIeD | 68'C6I$ — suey |00'0S7$ — :uIeED
L uoryerey] 9 uorze10}] ¢ uore107] } uolzer0}] € woTyeI0}] Z uorze10}] ] wor e193]
58 SLTOSAY LNANAOVIdayY ATIGOWOLNY EL S2cISE
THE REPLACEMENT PROBLEM 59

rather than immediate reward has resulted in a saving of almost $400


per year. The decrease of period cost with iteration is shown in Fig.
5.2. The gain approaches the optimal value roughly exponentially.
Notice that the gains for the last three iterations are so close that for
all practical purposes the corresponding policies may be considered

300

nmonfo)

hoio)fo)

oO
a oO
(o) a

(dollars)
Cost
period
per
50

0 i 2 3 4 5 6 7 8
Iteration number

Fig. 5.2. Quarterly cost of automobile operation as a function of iteration.

to be equivalent. The fact that a 3-year-old car is the best buy is


discovered as early as iteration 4. The model-year discontinuity
occurring at 3 years is no doubt responsible for this particular
selection.
The replacement problem described in this section is typical of a
large class of industrial replacement problems. Placing these problems
in the framework of the policy-iteration method requires only a thorough
understanding of their peculiarities and some foresight in selecting a
suitable formulation.
The Policy-Iteration Method
for Multiple-Chain Processes

The developments of Chapter 4 assumed that all the possible policies


for the system were completely ergodic. Complete ergodicity meant
that each policy defined a Markov process with only one recurrent chain,
and thus with a unique gain. Our problem was simply to find the
policy that had highest gain; the method of Chapter 4 accomplished
this purpose. This iteration technique is satisfactory for most problems
because we can usually define a problem in such a way as to meet the
requirement that it have only completely ergodic policies. This was
the case for the examples of Chapter 5.
However, it is not difficult to think of processes that have multiple
chains. In Chapter 1 we discussed a three-state process with transi-
tion-probability matrix

that had two recurrent chains. Suppose that the process had an
1
expected immediate reward vector q = P| expressed in dollars. The
3
matrix of limiting-state probability vectors was found in Chapter 1
to be
Ue 0
sf 1 j
+ 4 0
VALUE-DETERMINATION OPERATION 61

1
The gain vector g = Sq = |a |and we interpret g as follows: If the
iho
process were started in state 1, it would earn $1.00 per transition. A
start in state 2 would earn $2.00 per transition. Finally, since the
system is equally likely to enter state 1 or state 2 after many transitions
if it is started in state 3, such a starting position is expected to earn
$1.50 per transition on the average. The averaging involved is per-
formed over several independent trials starting in state 3, because in
any given trial either $1.00 or $2.00 per transition will be ultimately
earned.
The gain of the system thus depends upon the state in which it is
started. A start in state 7 produces a gain gi, so that we may think of
the gain as being a function of the state as well as of the process. Our
new task is to find the policy for the system that will maximize the
gain of all states of the system. We are fortunate that the policy-
iteration method of Chapter 4 can be extended to the case of multiple-
gain processes. We shall now proceed to this extension.

The Value-Determination Operation


Equations 2.15 show the asymptotic form that the total expected
reward of the system assumes when the system is started in state 7
and allowed to make a large number of transitions:
vi(n) = ngi + V4 (Ss Ning a ese ait G48esl
Each state has its own gi, but, as discussed in Chapter 2, all states that
are members of the same recurrent chain have the same gain. If we
agree to study the unending process, Eqs. 2.15 may be used with the
basic recurrence relation for total expected earnings,
N
vim + 1) = gi + > Pyvj(m) 1 =1,2,---,N (6.1)
j=1
to yield
N

(n + lge + ve = Ge + D Puslngs + 2%)


j=1
Or
N N
Net + ki a Vi = Ot ate n> pugs =f > Pury (6.2)
j= j=1
If Eq. 6.2 is to be satisfied for any large m, it follows that
N

e= j=l> pugs 1=1,2,--.,N (6.3)


62 MULTIPLE-CHAIN PROCESSES

and
N

setu=aut
> py 1=1,2,---,N (6.4)
j=1

We now have the two sets of N linear simultaneous equations (Eqs.


6.3 and 6.4) that we may use to solve for the Ng; and Nv. However,
Eqs. 6.3 may not be solved uniquely for the g;. The matrix [I — P]
has a singular determinant, so that the solution for the g; obtained from
Eqs. 6.3 will contain arbitrary constants. The number of arbitrary
constants is equal to the number of recurrent chains in the process.
Equations 6.3 essentially relate the gains of each state to the gains
of each recurrent chain. For example, in an L-chain process there will
be L independent gains. The gains of the states that are transient
will be related by Eqs. 6.3 to the ZL independent gains and so will be
determined when the independent gains are determined.
The N equations (Eqs. 6.4) must now be used to determine the L
independent gains and also the Nv;. We thus have L too many un-
knowns. However, suppose that we extend our former procedure by
setting equal to zero the v; for one state in each recurrent chain, so
that a total of Lv; will be equated to zero. We shall generally choose
the highest numbered state in each chain to be the one whose v; is
set equal to zero. We find that Eqs. 6.4 may now be solved for the L
independent gains and for the remaining (NV — L)v,.
The v; determined by the solution of Eqs. 6.4 may still be called
relative values 1f we remember that they are relative within a chain.
The difficulty of solving Eqs. 6.3 and 6.4 is about the same as that of
finding the limiting-state-probability matrix S for a multiple-chain
process. We shall see that the relative values v; are as useful as the
true limiting v; defined by Eqs. 2.15, as far as the search for the optimal
policy is concerned.
To illustrate these remarks, let us find the gain and relative values of
the two-chain process discussed at the beginning of this section.
Equations 6.3. yield

g1 = €1 &2 = &2 &3 = 321 + 3€2 + $83


Thus there are two independent gains gi and gz. The gain of state 3
is expressed in terms of gi and ge by g3 = 4g1 + dg2. If we could find
gi and ge, we should know the gain of every state. In general, we shall
call 1g the gain of chain 1, 2g the gain of 2, and so on, and then express
the gain of each state in terms of 1g, 2g,---. This notation cannot be
used until the states are identified with respect to chain membership.
For this problem, g1 = 1g, ge = 2g, and g3 = $1g + 32g.
POLICY-IMPROVEMENT ROUTINE 63
Equations 6.4 yield

Zitvu=1+ 0) g2 + vg = 2 + ve
g3 + vg = 3 + 4u1 + dug + 4u3

If we now express g3 in terms of gi and ge and then set equal to zero


the relative value of one state in each recurrent chain so that
V4 = ve = 0, we obtain

gi=l ge =2 3G1 + $82 + Vs = 3 + Bus


The solution of this set of equations is g1 = 1, gz = 2, vg = 2.25, so
that

Uv1= 0 v2 = 0 v3 = UADES

are the gains and relative values for each state of the process. The
gains are of course the same as those obtained earlier.

The Policy-Improvement Routine


We shall now show how the gains and the relative values of a policy
may be used to find the optimal policy for the system. Following
the argument of Chapter 4, if we now have a policy that we have been
following up to stage x, we may find a better decision for the zth state
at stage m + 1 by maximizing

qe® + 2,Pates(n) (4.2)

with respect to all alternatives in state 7. For large u, in Expression


4.2 we may substitute the relation in Eqs. 2.15 to obtain
N

gi® + > piy*(ngy + 04)


j=1
Or
N N

n> pistes + qk + > puto; (6.5)


j= 1
as the test quantity to be maximized. When ~ is large, Expression
6.5 is of course maximized by the alternative that maximizes

Diskg;
j=l
ll
64 MULTIPLE-CHAIN PROCESSES

Policy Evaluation

Use pij and q for a given policy to solve the double set of
equations
N
& = >, Pugs
j=1
N

"+ 8 = Gr > Pur;


j=l
for all v4; and gj, by setting the value of one 1; in each recurrent
chain to zero.

Policy Improvement

For each state i, determine the alternative k that maximizes


N

> Pukey
vat
using the gains of the previous policy, and make it the new
decision in the 7th state.
If
N

> putes
Ly
is the same for all alternatives, or if several alternatives are
equally good according to this test, the decision must be made
on the basis of relative values rather than gains. Therefore,
if the gain test fails, break the tie by determining the alter-
native k that maximizes
N
gk + > pikoy
j=l

using the relative values of the previous policy, and by making


it the new decision in the th state.
Regardless of whether the policy-improvement test is based
on gains or values, if the old decision in the 7th state yields as
high a value of the test quantity as any other alternative,
leave the old decision unchanged. This rule assures conver-
gence in the case of equivalent policies.
When this procedure has been repeated for all states, a
new policy has been determined and new [p,j] and [q] ma-
trices have been obtained. If the new policy is the same as
the previous one, the iteration process has converged, and the
best policy has been found; otherwise, enter the upper box.

Fig. 6.1. General iteration cycle for discrete sequential decision processes.
MULTICHAIN EXAMPLE 65

the gain test quantity, using the gains of the old policy. However,
when all alternatives have the same value of
N

> dikes
j=1
or when a group of alternatives have the same maximum value of the
gain test quantity, the tie is broken by choosing the alternative that
maximizes the value test quantity,
N

qt + > pistes
imi
by using the relative values of the old policy. The relative values may
be used for the value test because, as we shall see, the test is not affected
by a constant added to the v; of all states in a recurrent chain.
The general iteration cycle is shown in Fig. 6.1. Note that it reduces
to our iteration cycle of Fig. 4.2 for completely ergodic processes. An
example with more than one chain will now be discussed, followed by
the relevant proofs of optimality.

A Multichain Example
Let us find the optimal policy for the three-state system whose
transition probabilities and rewards are shown in Table 6.1. The
transition probabilities are all 1 or 0, first for ease of calculation and
second to show that no difficulties are introduced by such a structure.
This system has the possibility of multiple-chain policies.

Table 6.1. A MULTICHAIN EXAMPLE


State Alternative Probabilities Expected Immediate
Reward
a k pik pik pisk qk
1 1 1 0 0 1
2 0 1 0 2
3 0 0 1 3

2 1 1 0 0 6
2 0 i 0 4
3 0 0 1 5

3 1 1 0 0 8
2 0 1 0 9
3 0 0 1 7
66 MULTIPLE-CHAIN PROCESSES
Let us begin with the policy that maximizes expected immediate
reward. This policy is composed of the third alternative in the first
state, the first alternative in the second state, and the second alternative
in the third state. For this policy
3 Onn t 3
daa Diayiecorne ae
2 Onalion 9
We are now ready to enter the policy-evaluation part of the iteration
cycle. Equations 6.3 yield

Caen e3 aa ou Bee

These results show that there is only one recurrent chain and that
all three states are members of it. If we call its gain g, then
£1 = £2 = g3 = g; the relative value vg is arbitrarily set equal to zero.
If we use these results in writing Eqs. 6.4, the following equations are
obtained:

f+ =3 gtvu=6+ 01 g=9
+ v2

Their solution is g = 6, v1 = ve = —3, so that

g1 = 6 ge = 6 g3
=6
and
v1 = —3 ve = —3 v3 = 0

We are now ready to seek a policy improvement as shown in Table


6.2.

Table 6.2. First PoLicy IMPROVEMENT FOR MULTICHAIN EXAMPLE


State Alternative Gain Test Quantity Value Test Quantity
N N
a k > Pukey gi® + > pigko;
j=1 j=l
1 1 6 1+ (—3) = -—2
Z 6 2) +e(—3) = —1
3 6 3+ (O)= 3<

2 1 6 6+.(=—3),=) 3
2 6 43) sl
3 6 Bi Oi" “SS

3 1 6 Sital=3) ay 5
2 6 9-4: (—3). =" 6
3 6 Fee (Ol ee
MULTICHAIN EXAMPLE 67
Since the gain test produced ties in all cases, the value test was
necessary. The new policy is
3 001 3
d= |3 P=/0 0 1 q= |5
3 0 OP 7
This policy must now be evaluated. Equations 6.3 yield
§1 = &3 §2 = §3 &3
= §3
We may let gi = ge = g3 = g, Set vg = 0, and use Eqs. 6.4 to obtain
£+01 = 3 gtvu.=5 fue 7]
The solution is g = 7, v1 = —4, ve = —2, and so

gi=7 g2=7 gs=7


and
v) = —4 vg = —2 v3 = 0

The policy-improvement routine is shown in Table 6.3.

Table 6.3. SECOND PoLticy IMPROVEMENT FOR MULTICHAIN EXAMPLE


State Alternative Gain Test Quantity Value Test Quantity
N N
i k > Piste gk + > pike,
j=l j=1

1 1 7, —3
2 7, 0
3 7 =

2 1 qi 2
2 a 2
3 7 Se

3 1 7 4
2 7 7
3 7 <

Since once more the gain test was indeterminate, it was necessary to
rely on the relative-value comparison. In state 3, alternatives 2 and
3 are tied in the value test. However, because alternative 3 was our
old decision, it will remain as our new decision. We have thus obtained
the same policy twice in succession; it must therefore be the optimal
policy. The optimal policy has a gain of 7 in all states. The policy
3
d= b which was possible because of the equality of the value test
2
in state 3, is also optimal.
68 MULTIPLE-CHAIN PROCESSES
Although this system had the capacity for multichain behavior, such
behavior did not appear if we chose as our starting point the policy
that maximized expected immediate reward. Nevertheless, other
choices of starting policy will create this behavior.
Let us assume the following initial policy:
3 pins 3
dari Peto yt BO q=t\4
1 1= 03/0 8
To evaluate this policy, we first apply Eqs. 6.3 and obtain
SITES Se ee Sooo
There are two recurrent chains. Chain 1 is composed of states 1 and
3, chain 2 of state 2 alone. Therefore, g1 = gs = 1g, g2 = 2g, and we
may set ve = v3 = 0. Equations 6.4 then yield
lg + vy = 3 29 = 4plea oieae
The solution of these equations is lg = 4, 2g = 4, v1 = — 3, and so
gi = "9 eyre4 gs = "9
and
Uu= —3 v2 = 0 v3 = 0

Table 6.4 shows the policy-improvement routine.

Table 6.4. PoLicy IMPROVEMENT BY A CHANGE IN CHAIN STRUCTURE


State Alternative Gain Test Quantity Value Test Quantity
N N
a k > puks; gi + > pyko;
j=1 j=1

1 1 a} —}
Z 4 2
3 43 3<

2 1 ad 7
% 4 4
3 oa 5<

3 1 at an

2 4 9
3 oe 7<

The policy improvement in this case was performed by means of


both gains and values. The gain test selected two alternatives in each
State, and the value test then decided between them. The policy that
PROPERDPAS OP THE ITERATION CYCLE 69

has been produced is the optimal policy that was found earlier, and so
there is no need to continue the procedure because we would only
repeat our earlier work.
In the preceding example we began with a two-chain policy and ended
with the optimal one-chain policy. The reader should start with such
1 1
policies as d = P| and d = | to see how the optimal policy with
3 i
gain 7 for all states may be reached by various routes. Note that in
no case is it necessary to use the true limiting values v4; the relative
values are adequate for policy-improvement purposes.

Properties of the Iteration Cycle


We shall now show that the iteration cycle of Fig. 6.1 will lead to
the policy that has a higher gain in each state than any other policy.
Suppose that a policy A has been evaluated so that its gains and values
are known. The policy-improvement routine will use these gains and
values to produce a new policy B. We need to determine the relation-
ship between policies A and B.
If in state 7 the decision was made on the basis of gains, we know
that
N N

> disBey4 > > pryAgy4


j=1 g=1

where superscripts A and B are used to denote the quantities pertaining


to each policy. In particular, we may define
N N

be = > debe — > pudgy (6.6)


g=1 iS
The quantity J; is greater than zero if the decision in the 7th state is
based on gain and is equal to zero if it is based on values. If zis equal
to zero, so that a value decision is made, we know that
N N

qi® + > puBujt > ged + D puytos4


j=1 j=l
If we let
N N

ve = U2 + > pegBoj4 — ged — > pyydvyA (6.7)


j=1 j=1

then y; > 0. If both Y; and y; = 0, then the policies A and B are


equivalent as far as the test quantities in state7 are concerned. Insuch
a case we would arbitrarily use the decision in state 7 pertaining to
policy A.
70 MULTIPLE-CHAIN PROCESSES

The policy-evaluation equations may now be written for both policies


A and B according to Eqs. 6.3 and 6.4. For policy A we have
N
gid = > puyAgy4 j= QietegN (6.8)
i=
‘ N
gi + uA = gi + > puAvy4 (eet WaAe, ac v\) (6.9)
fet
For policy B the corresponding relations are
N
ob = > puFgy? 4=1,2,---,N (6.10)
j=i
N
ge + VP = qi? ap > PijBuy? Yes 5 IN (6.11)
j=1
Subtraction of Eq. 6.8 from Eq. 6.10 yields
N N

iB — git = > pisBey® — D pustes4


j=1 j=
If Eq. 6.6 is used to eliminate
N

> Ptgs4
j=l
and we let gi4 = gi — gid, then
N

Bi = Yet > puPe® 7 = 192-9 4N (6.12)


j=l
Similarly, if Eq. 6.9 is subtracted from Eq. 6.11, we obtain
N N
EL cele L a MY amet SUR P > Pyboy? = > Pistvs4
j=1 j=1
Equation 6.7 may be used to eliminate g;2 — gi4. Then if we let
vi4 = viB — v;4, we have
N

gid + vA = Veo ote > PigBvj4 t= Lb 2.8 cree N (6.13)


j=l
We have now found that the changes in gains and values must satisfy
the two sets of equations (Eqs. 6.12 and 6.13). Equations 6.13 are
identical to Eqs. 6.4 except that they are written in terms of differences
in gain and value rather than the absolute quantities, and y; appears
instead of qi. However, Eqs. 6.12 differ from Eqs. 6.3 because of the
term ;; otherwise, if ¥; were zero, Eqs. 6.12 would bear the same
relation to Eqs. 6.3 that Eqs. 6.13 bear to Eqs. 6.4. Let us investigate
further the nature of Eqs. 6.12.
PROPERTIES:OF THE ITERATION CYCLE 71
The policy B described by parameters #;;8 and q;8 may of course have
many independent chains. If there are L recurrent chains in the proc-
ess, then we are able to identify L groups of states with the property
that if the system is started in any state within a group it will always
make transitions within that group. In addition there will be an L + 1st
group of transient states with the property that if the system is started
in any state of this group it will ultimately make a transition into one
of the L recurrent chains. By a renumbering of states, it is possible
to write the matrix P¥ in the form
| | ; fl

np) leaks ee |
| | |

i}

Se et nO ee po---------- |----------- PR eet es ey eee


| i H

Qe te PP
\
fey
|
rls rea
|
nO
fee See i ee
| |

| |
ps —

+ LARLY col a Se Late oe eae


| !H |
Oey sd 0 sey nap PEPaen }
hai 0
a Dra A ae iene et eae San Aadays
SAPO UO arte et

L+1, 1p L+1, Lp | L+1, L+1p

The square submatrices 1!P, 22P,.--, “P are the transition matrices


for the chains 1, 2,---, L after the renumbering; each is itself a stochas-
tic matrix. Submatrices of the form ”*P are composed of zero elements
ify #~sand7r4~L+4+1. The submatrix “+1, 4+1P is the matrix of
transition probabilities among transient states. Some of the elements
of the submatrices “+1, sP for s = 1, 2,---, L must be positive.
If the same renumbering scheme is used on the vectors g4, v4, y, y,
and x, we obtain a set of vectors composed of L + 1 subvectors; these
vectors are
ga 1yA lw ly

oe h 2yA p ai

gi = : yvi= : py = ; Y= :
LgA LyA Ly Ly
Itlga L+1yA Ltly a,

t= [176 : 2a : ai 38 Ly: L+1qq|

The vector x is the state-probability vector for the L-chain process.


Each subvector "x is the limiting-state-probability vector if the system
72 MULTIPLE-CHAIN PROCESSES

is started in a state of the 7th chain; "x = 'n’’P, and the sum of the
components of each "x for 7 = 1, 2,---, Lis 1. The subvector 4*17
has all components zero because all states in the group L + 1 are
transient.
Equations 6.12 and 6.13 in vector form are
go = + PF g4 (6.14)
gi+visyst PByv4 (6.15)

If the partitioned forms are used in Eq. 6.14, we obtain


rgA — rp 4+ 1rPrga Y= Telos, Lh (6.16)
and
L+1
L+lgh — L+l + Ys L+1,8P sg (6.17)
s=1

Partitioning transforms Eq. 6.15 into


rgA + fyA ss ty + pry V me ADL (6.18)
and
Eat
LtigA 4 LtlyA = Lt+ly + y L+1,sP syA (6.19)
s=1

Suppose that Eq. 6.16 is premultiplied by "x so that


Uea rgA =r ™ + 1% rrp rgd

Since
Tr = Tr TTP

it follows that
rb = 0 (6.20)
Because all states in the vth chain are recurrent, "% contains all positive
elements. We know from our earlier discussion that all 4 are greater
than or equal to zero. From Eq. 6.20 we see that, in any of the 7
groups 7 = 1, 2,---, L, } must be zero. It follows that in each re-
current chain of the policy B the decision in each state must be based
on value rather than gain.
Equations 6.16 thus become
rgd = 7rp rgd (6.21)

We know that the solution of these equations is that all "g;4 = "g4, so
that all states in the vth group experience the same increase in gain as
the policy is changed from A to B. If this result is used in Eq. 6.18,
we find that
rgh = rary (6.22)
PROPERTIES OF THEVYTERATION CYCLE 73

Thus the increase in gain for each state in the 7th group is equal to
the vector of limiting state probabilities for the 7th group times the
vector of increases in the value test quantity for that group. Since,
for each group 7 < L, "); = 0, then"y; > 0. Equation 6.22 shows that
an increase in gain for each recurrent state of policy B will occur un-
less policies A and B are equivalent.
We have yet to determine whether or not the gain of the transient
states of policy B is increased. Equation 6.17 shows that
L
(EH — LHL L+1p)Lt+iga — LHy + » L+1,8P sg (6.23)
s=1

where “*1I is an identity matrix of the same size as the number of


states in the transient group L + 1. The change in gain of the transient
states is thus given by
L
L+lga == (£411 — (aay) af > L+1,sp 8) (6.24)
s=1

In the appendix it is shown that (4+1I — 4+1,1+1P)—1 exists and has


no negative elements. We know that all ; are greater than or equal
to zero, that some elements of the matrices 4+1.sP for s = 1, 2,---, L
are positive and that none are negative, and that the changes in gain
for the L recurrent groups cannot be negative. It follows that the
change in gain for all the transient states of group L + 1 cannot be
negative and will be positive if either or both of two conditions occur.
First, the gain of a transient state will increase if its probabilistic be-
havior is changed so that it is more likely to run into chains of higher
gain. Second, the gain of the transient state will increase if the gains
of the chains into which the transient state runs are increased.
Thus we have shown that under the iteration cycle of Fig. 6.1 the
gain of no state can decrease, and that the gain of some state must
increase unless equivalent policies exist. We have now to show that
the iteration cycle will find the policy that has highest gain in all states.
Suppose that policy B has higher gain in some state than policy A,
but that the iteration cycle has converged on policy A. It follows that
all Ys < 0, and that, if); = 0, then y; < 0. Equation 6.22 shows that
all "g4 are nonpositive, so that no recurrent state of policy B can have
a higher gain than the same state under policy A. Since Eq. 6.24
shows that all 4+1g,4 are nonpositive, no transient state of policy B
can have a higher gain than the same state under policy A. Conse-
quently, no state can have a higher gain under policy B than it has
under policy A and still have the iteration cycle converge on policy A.
We have thus shown that the iteration cycle increases the gain of
74. MULTIPLE-CHAIN PROCESSES
all states until it converges on the policy that has highest gain in all
states, the optimal policy.
The preceding discussion may be illustrated by means of the multi-
chain example of Table 6.1. Recall the case when the policy
3 (eyed 3
uf P= 1 j = (4
1 ow 8

3 he pew| 3
of r=|0 0 j [5
3 001 7
by means of the policy-improvement routine of Table 6.4. The first
policy we have called policy A, the second, policy B. From Table 6.4
we see that

0 0
p= ] vee F
0 3
If the identity of states 3 and 1 is interchanged, we have

P2=|1/0
Pietiot® Y 2 ‘8°
0 p= |3 ‘aa oy meRees
1/0 0 0 0 8
Thus L = 1, there is one recurrent chain, and 4P = [1]. We notice
that in the new state 1 (the old state 3) the decision was based on values
rather than gains. The limiting-state-probability vector for s = 1,
Iq, is [1]. Hence from Eq. 6.22

Since

we see from Eq. 6.24 that


uae
vee =p + Pigs ~ [8] + [lta = [3
so that

GoSS)
tole
neo
PROPER DTESIOLSTME Iie RATTON-GYCLE 75:

If now the renumbering of states | and 3 is reversed, the vector g4 is


unchanged. Hence we find that, in going from policy A to policy B,
states 1 and 3 should have their gain increased by 3, while state 2
should have its gain increased by 3. Reference to the policy-evaluation
equations solved earlier for policies A and B shows that this was indeed
the case.
We have seen that the multichain sequential decision process may be
solved by a method extremely analogous to that for single-chain
processes. However, in most practical problems knowledge of the
process enables us to use the simpler single-chain approach.
The Sequential Decision Process
with Discounting

In many economic systems the cost of money is very important.


We might criticize the automobile replacement problem of Chapter 5,
for example, because a dollar of expenditure in the future was given as
much weight as a dollar spent at the present time. This chapter will
overcome such criticisms by extending our analysis of sequential decision
processes to the case in which discounting of future returns is important.
Consider a Markov process with rewards described by a transition-
probability matrix P and a reward matrix R. Let the quantity ® be
defined as the value at the beginning of a transition interval of a unit
sum received at the end of the interval. It follows that the discount
factor B must be the reciprocal of 1 plus the interest rate that is appli-
cable to the time interval required for a transition. For a nonzero
interest rate, 0 < 6 < 1.
Let us suppose that 7i; in such processes is received at the beginning
of the transition from 7 to 7. Then, if v(m) is defined as the present
value of the total expected reward for a system in state z with » transi-
tions remaining before termination, we obtain

vi(n) = 2 Pulra + oj —1)) ~ tel 2


1 ES ee Se
by analogy with Eqs. 2.1. Once again we may use the set of expected
immediate rewards
N

i= >:Pisrij
j=1
76
DISCOUNTING Th
to obtain for the basic recurrence relation
7
vi(m) = qe + B> pivj(n — 1) fe 1, Bos aul
j=1
= Ie 2; 3,25 (7.2)

Equations 7.2 may also be used to analyze processes where rewards


are received at the end of a transition rather than at the beginning.
All that is required is that we interpret g; as the expected present values
of the rewards received in the next transition out of state 7. In this
way we may use Eqs. 7.2 to analyze situations where rewards are
distributed in some arbitrary fashion over the transition interval.
Furthermore, Eqs. 7.2 may be used to analyze processes where
discounting is not present but where there is uncertainty concerning the
duration of the process. To see this, let 8 be defined as the probability
that the process will continue to earn rewards after the next transition.
Then 1 — § is the probability that the process will stop at its present
stage. If the process receives no reward from stopping, then the Eqs.
7.1 and 7.2 still describe the process. It will thus not be necessary
in the following to distinguish between processes with discounting and
processes with indefinite duration.
Let v(m) be the vector of total expected rewards and q be the vector
of expected immediate rewards. Equations 7.2 may be written as
vin + 1) = q + BPv(n) (7.3)
If the vector v(z) is defined as the z-transform of the vector v(m), then
by the techniques of Chapter 1, we may take the z-transform of Eq. 7.3
to obtain the matrix equation

z-1[p(z) — v(0)] = =— 4 + BPo(2)


Then

v(2) — v(0) = 74 + BePo(2)


(I — B2P)v(z) = -——q + v(0)
and finally
v(2) = 7 (I — BeP)“1q + (I — BeP)-1v(0) (7.4)
We have thus found the z-transform of v(m). It is now possible to
find a closed-form expression for v(7) in any problem, so that it is not
necessary to rely on the recurrence relation (Eq. 7.3).
78 DISCOUNTING

Let us illustrate these results by applying them to the toymaker’s


problem of Table 3.1. Suppose that the toymaker is following the
policy
a a
pus H |
pole
oles aa 3]
He is not advertising and not conducting research. Suppose also that
for each week there is a probability of 4 that he will go completely out
of business before the next week starts. If he goes out of business, he
still receives his immediate rewards for the present week but receives
no other reward. The problem as stated fits our framework with
8 = 4. We shall assume that v(0) = 0; therefore, if he does survive
for all n stages his business will have no value at that time.
For this problem, Eq. 7.4 becomes

v(z) = 1—z
——(I
— BzP)-1q
or

where we consider #(z) to be the z-transform of a response function


H(n). Finally, v(n) = H(n)q.
We must first find

2 @isvqaaae (Lice Geh lah


Z

Since 8 = f,

1 ed Sones ©
btay 195a
rid
(I — 42P) = |

and

ws dz
ee
302) (1 2 42)(1 vs 352)
(1 =, $2) (1 _
(I — 42P)-1

5° tee: hy

(eo) (te
Thus
122
2(1 = 702)

eee en a8) ON eee


VALUE ITERATION iy
By partial-fraction expansion

1p
es AN 30
pt p-k 8
1 8 WO)
1 Zlig 19 1l—'}21-3 —"s

i
1 _100
71
100
aa
o's, 80 80
LePSraye Tepe =
ea
and

Aa) = |e gee Ws]sr faa


28 10 8 10
19 19 =
i9 19 9) ae

ALO! 100
1, 71 171
+ (z6) _80_ “|
171 i71
Since v(x) = H()q, the problem of finding v(m) has been solved for an
arbitrary q. Forq = ‘

vn) = |_| + [23] + ear2


If the toymaker is in state 1 and has » possible stages remaining,
the present value of his expected rewards from the m stages is v1(7)
= 4338
19 — 2(4)"
2 — 492(s5)".
19 \2 The corresponding quantity if he is in
state 2 is vo(n) = —f$ — 2(4)™ + 88(s)". Note that v1(0) = v2(0)
= 0, as required. ‘For’ v(0) = 0) Eqs.7.2%show “that v7(1) = 6 and
ve(1) = —3; these results are also confirmed by our solution. The
z-transform method is thus a straightforward way to find the present
value of the future rewards of a process at any stage.
We note that as 2 becomes very large v(m) approaches 438 and
ve(m) approaches —{§. For a process with discounting, the expected
future reward does not grow with 7 as it did in the no-discounting case.
Indeed, the present value of future returns approaches a constant
value as ~ becomes very large. We shall have more to say of this
behavior.

The Sequential Decision Process with Discounting Solved by Value


Iteration
Just as we could use the value-iteration method to solve the sequential
decision process when discounting was not important, we may now use
it when discounting is important. We desire to find at each stage n
the alternative we should choose in each state to make v;(m), the present
value of future rewards, as large as possible. By analogy with the
recurrence equation (Eq. 3.3) of the no-discounting case, we obtain
80 DISCOUNTING

for the case where discounting is important the equation


N
vi(n Ie 1) _ nae Ea + 8 > putes(n) (7.5)
j=l
In this equation v;() is defined as the present value of the rewards
from the remaining stages if the system is now in state 7 and if the
optimal selection of alternatives has been performed at each stage
through stage m. For each state, the alternative k that maximizes

Git BD, Pestos(n)


is used as the decision for the 7th state at stage m + 1, or di(n + 1).
Since the v;(m) are known for stage n, all the quantities needed to
make the test at stage » + 1 are at hand. Once v(0) is specified, the
procedure can be carried through to any stage desired.
Let us work the toymaker example described by Table 3.1. We shall
assume that 6 = 0.9, so that either the toymaker has an interest rate
on his operation of 11.1 per cent per week or there is a probability 0.1
that he will go out of business in each week. The interest rate is
absurdly high, but it illustrates how such a problem is handled. If
transitions were made once a year, such an interest rate might be more
realistic.
The solution of this problem with use of Eq. 7.5 is shown in Table
7.1. Once more we assume that v1(0) = v2(0) = 0.

Table 7.1. SOLUTION OF TOYMAKER’S PROBLEM WITH DISCOUNTING


UsING VALUE ITERATION
n= 0 1 2 3 4
v4 (n) 0 6 7.78 9.1362 10.461658
v9(n) 0 a3 2303 — 0.6467 0.581197
dy(n) = 1 2 2 2
do(n) os 1 2 2 2

As we shall soon prove, the total expected rewards v;(m) will increase
with m and approach the values v(m) = 22.2 and ve(n) = 12.3 as n
becomes very large. The policy of the toymaker should be to use the
second alternative in each state ifm > 1. Since we have seen how the
vi(m) approach asymptotic values for large m, we might ask if there is
any way we can by-pass the recurrence relation and develop a technique
that will yield directly the optimal policy for the system of very long
duration. The answer is that we do have such a procedure and that
it is completely analogous to the policy-iteration technique used for
VALUE-DETERMINATION OPERATION &1
processes without discounting. Since the concept of gain has no
meaning when rewards are discounted, the optimal policy is the one
that produces the highest present value in all states. We shall now
describe the new forms that the value-determination operation and the
policy-improvement routine assume. We shall see that the sequential
decision process with discounting is as easy to solve as the completely
ergodic process without discounting. We need no longer be concerned
with the chain structure of the Markov process.

The Value-Determination Operation


Suppose that the system is operating under a given policy so that a
given Markov process with rewards has been specified. Then the
z-transform of v(m), the vector of present values of expected reward in
n stages, is given by Eq. 7.4 as
z
v(z) = ie
(I — B2P)“1q + (I — B2P)~1v(0) (7.4)
It was shown in Chapter 1 that (I — zP)-! could be written in the form
(1/(1 — z))S + 7 (z), where Sis the matrix of limiting state probabilities
and JF (z) is the transformed matrix of components that fall to zero as n
becomes large. It follows that (I — BzP)-1 can be written in the form

(I —feP)1 => 1 S + F(Be


— = = Zs
(7.6)
7.6

and that 7 (8z) now refers to components that fall to zero even faster
as » grows large. Then Eq. 7.4 becomes
2 1 - 1
v(z) = ae li = ee + 7) 4 + F ety S + F (62)|v0) (7.7)
Let us investigate the behavior of Eq. 7.7 for large n. The coefficient
of v(0) represents terms that decay to zero, so that this term disappears.
The coefficient of q represents a step component that will remain
plus transient components that will vanish. By partial-fraction
expansion the step component has magnitude [1/(1 — §)]S + 7 (8).
Thus for large n, v(z) becomes {[1/(1 — z)][1/(1 — 8)JS + 7 (8)}q.
For large n, v(m) takes the form {[1/(1 — 8)]S + 7(8)}q. However,
{{1/(1 — 8)]S + 7(8)} is equal to (I — BP), by Eq. 7.6. Therefore,
for large m, v(m) approaches a limit, designated v, that is defined by
v = (I — BP)-1q (7.8)
The vector v may be called the vector of present values, because each
of its elements v; is the present value of an infinite number of future
expected rewards discounted by the discount factor f.
82 DISCOUNTING

We may also derive Eq. 7.8 directly from Eq. 7.3:


vin + 1) = q + BPv(n) (7.3)
If we write v(1), v(2), v(3),--- in explicit form, we find
v(1) = q + BPv(0)
v(2) = q + BPq + 62P?v(0)
(
v(3) = q + BPq + B2P2q + 63P%v(0)

The general form of these equations is

“2 > (ee)| + BP»y(0)


since 0 < 6 < I,

lim v(n) = ye(BP)4q


n> j=0

Because P is a stochastic matrix, all of its eigenvalues are less


than or equal to 1 in magnitude. The matrix BP therefore has eigen-
values that are sey, less than 1 in magnitude because 0 < 6 < 1.

We may thus write ys(8P)i= (I — BP)- and obtain lim v(m) = v =


(I — BP)—q, or Eq. 7.8.
The present value of future rewards in each state is finite and equal
to the inverse of the (I — @P) matrix postmultiplied by the q vector.
Note for future reference that, since P is a matrix with nonnegative
ioe)

elements, (I — BP)-! = SY (BP) must have nonnegative elements and,


moreover, must have numbers at least as great as 1 on the main diagonal.
This result is understandable from physical considerations because a q
with nonnegative elements must produce a v with nonnegative elements.
Since no rewards are negative, no present value can be negative.
We are now in a position to describe the value determination itself.
Because we are interested in the sequential decision process for large n,
we may substitute the present values v; = lim v;(m) for the quantities
n—

vi(m) in Eq. 7.2 to obtain the equations


N

M=G+B>
pyyy 1=1,2,---,N (7.9)
j=1

For a given set of transition probabilities ;; and expected immediate


rewards qi, we may use Eqs. 7.9 to solve for the present values of the
POLICY-IMPROVEMENT ROUTINE 83
process. We are interested in the present values not only because they
are the quantities that we seek to maximize in the system but also
because they are the key to finding the optimal policy, as we shall see
when we discuss the policy-improvement routine.
Let us find the present values for 8 = } of the toymaker’s policy
defined by

Equations 7.9 yield


ok ee lesie 2 aaole ve = —3 + 501 + iove
The solution is v; = 438, ve = —{§. These are the limiting values
for vi(m) and v(m) found earlier.
We shall now see how to use the present values for policy improve-
ment.

The Policy-Improvement Routine


The optimal policy is the one that has highest present values in all
states. If we had a policy that was optimal up to stage m, according
to Eq 7.5 we should maximize

qi Bd, buto,(n)
with respect to all alternatives kin the 7th state. Since we are now deal-
ing only with processes that have a large number of stages, we may
substitute the present value v; for vj(m) in this expression. We must
now maximize
N

gi + BD pesto;
j=1

with respect to all alternatives in the 7th state.


Suppose that the present values for an arbitrary policy have been
determined. Then a better policy, one with higher present values in
every state, can be found by the following procedure, which we call
the policy-improvement routine.
For each 7, find the alternative k that maximizes
N

gk + BD desko;
j=1

using the v; determined for the original policy. This k now becomes
the new decision in the 7th state. A new policy has been determined
when this procedure has been performed for every state.
The policy-improvement routine can then be combined with the
84 DISCOUNTING

value-determination operation in the iteration cycle shown in Fig. 7.1.

Value- Determination Operation


Use pij and q; for given policy to solve the set of equations
N
4 = 4% + BD Pyry 4=1,2,-++,N
j=1
for all present values 1.

Policy-Improvement Routine
For each state 7, find the alternative k’ that maximizes
N
gk +8 > pykyy
j=1
using the present values v; from the previous policy. Then
k’ becomes the new decision in the ith state, g;*’ becomes qj,
and p4;*’ becomes pij.

Fig. 7.1. Iteration cycle for discrete decision processes with discounting.

The iteration cycle may be entered in either box. An initial policy may
be selected and the iteration begun with the value-determination opera-
tion, or an initial set of present values may be chosen and the iteration
started in the policy-improvement routine. Ifthereisnoa priori basis for
choosing a close-to-optimal policy, then it is often convenient to start the
process in the policy-improvement routine with all v; set equal to zero.
The initial policy selected will then be the one that maximizes expected
immediate reward, a very satisfactory starting point in most cases.
The iteration cycle will be able to make policy improvements until
the policies on two successive iterations are identical. At this point it
has found the optimal policy, and the problem is completed. It will
be shown after the example of the next section that the policy-improve-
ment routine must increase or leave unchanged the present values of
every state and that it cannot converge on a nonoptimal policy.

An Example
Let us solve the toymaker’s problem that was solved by value
iteration earlier in this chapter. The data were given in Table 3.1,
AN EXAMPLE 85
and as before 8 = 0.9. We seek the policy that the toymaker should
follow if his rewards are discounted and he is going to continue his
business indefinitely. The optimal policy is the one that maximizes the
present value of all his future rewards.
Let us choose as the initial policy the one that maximizes his
expected immediate reward. This is the policy formed by the first
alternative in each state, so that
1 UFce fad8) 6
H 7 ie ‘a 1 ies
d = = = |}

Equations 7.7 of the value-determination operation yield


v1 = 6 + 0.9(0.5v1 + 0.5v2) ve = —3 + 0.9(0.401 + 0.6v2)
The solution isv; = 15.5,vg = 5.6. The policy-improvement routine
is now used as shown in Table 7.2.

Table 7.2. First PoLticy IMPROVEMENT FOR TOYMAKER’S PROBLEM


WITH DISCOUNTING
State Alternative Value Test Quantity
N
i k gk +B > pity;
j=l

1 1 6 + 0.9[0.5(15.5) + 0.5(5.6)] = 15.5


2 4 + 0.9[0.8(15.5) + 0.2(5.6)] = 16.2<—

2 1 —3 + 0.9[0.4(15.5) + 0.6(5.6)] = 5.6


2 —5 + 0.9[0.7(15.5) + 0.3(5.6)] = 6.3<

The second alternative in each state provides a better policy, so that now
2 0.39.02 4
LG A il bee bal Aig 3
The value-determination operation for this policy provides the
equations
v=4+ 0.9(0.801 + 0.2v2) v2= =—5 + 0.9(0.701 + 0.3v2)

From these equations we find v1 = 22.2, vg = 12.3.


Notice that there has been a significant increase in present values in
both states. The policy-improvement routine must be used once more
as we see in Table 7.3.
The policy-improvement routine has found the same policy that it did
in the previous iteration, so that this policy must be optimal. The
second alternative in each state should be used if the present values of
both states are to be maximized. The toymaker should advertise and
do research even if faced by the 11.1 per cent interest rate per week.
86 DISCOUNTING
Table 7.3. SECOND PoLicy IMPROVEMENT FOR TOYMAKER’S PROBLEM
WITH DISCOUNTING
State Alternative Value Test Quantity
N
i i qk + BDj=1 piky;
1 1 21.
2 22.2<-

2 1 11.6
2 12.3<—

The present values of the two states under the optimal policy are 22.2
and 12.3, respectively; these present values must be higher than those
of any other policy. The reader should check the policies d = ,

and d = |: to make sure that this is the case.

We have seen that if the discount factor is 0.9 the optimal no-
discounting policy found in Chapter 4 is still optimal for the toymaker.
We shall say more about how the discount factor affects the optimal
policy after we prove the properties of the iteration cycle.

Proof of the Properties of the Iteration Cycle


Consider a policy A and its successor policy B produced by the
policy-improvement routine. Since B was generated from A, it follows
that
N N

GiB + BD piBoj4 > gid + BD pydo,4 (7.10)


j=1 j=1
in every state 7. We also know for the policies taken individually that
N

vA = gid + BD pitryA (7.11)


j=l

UB = giP + BD pigBuj8 (7.12)


j=1
Tet
N N

ve = 98 + BD pisBujt — git — BD Pury


j=l j=1

Thus yi is the improvement in the test quantity that the policy-improve-


ment routine was able to achieve in the ith state; from the preceding
SENSITIVITY OF OPTIMAL POLICY TO DISCOUNT FACTOR 87
definition, y; > 0. If we subtract Eq. 7.11 from Eq. 7.12, we obtain
N N

vB — v4 = g® — ged + BD puy?oy — BD PuydvyA


j=l j=l
N N N

vt — BD puF4 + B ) Pishvj4 + BD piyBoz®


j=1 jg=1 j=1 me

— BY pijAv,4
j=1
If viA = v,;B — v;4, the increase in present value in the 7th state, then
N

ved = ye + BD pisBosA
j=l

This set of equations has the same form as our original present-value
equations (Eq. 7.9), but it is written in terms of the increase in present
values. We know that the solution in vector form is
v4 = [I — BP2}-1y (7.14)
where y is the vector with components y;._ It was shown earlier that
(I — 8P}-! has nonnegative elements and has values of at least 1 on the
main diagonal. Hence, if any y; > 0, at least one v;4 must be greater
than zero, and no v;4 can be less than zero. Therefore, the policy-
improvement routine must increase the present values of at least one
state and cannot decrease the present values of any state.
Is it possible for the routine to converge on policy A when policy B
produces a higher present value in some state? No, because if the
policy-improvement routine converges on A, then all yi < 0, and hence,
all v;4 < 0. It follows that when the policy-improvement routine has
converged on a policy no other policy can have higher present values.

The Sensitivity of the Optimal Policy to the Discount Factor


The taxicab problem discussed in Chapter 5 was solved for discount
factors 8 ranging from 0 to 0.95 with intervals of 0.05. In this example,
1—§ might be considered to be the probability that the driver’s cab
will break down before his next trip. The optimal policy and present
values for each situation are shown in Table 7.4. We see that, although
the present values change as 8 increases, the optimal policy changes
only as we pass certain critical values of 8. More detailed calculation
reveals that these critical values of 8 are approximately 0.13, 0.53, and
0.77. The solution for the optimal policy for different values of B
is shown in Fig. 7.2. For 8 between 0 and 0.13, the first alternative in
each state is the optimal policy; the driver should cruise in every town.
88 DISCOUNTING

Table 7.4. OptimAL PoLIcy AND PRESENT VALUES FOR THE TAXICAB
PROBLEM AS A FUNCTION OF THE DiscouNT FAcTOR 3
Discount Factor Optimal Policy Decisions Present Values
6B State1 State2 State 3 State1 State2 State3

0 1 1 1 8.00 16.00 7.00


0.05 1 1 il 8.51 16.40 7.50
0.10 1 1 1 9.08 16.86 8.05
0.15 1 2) 1 9.71 17.46 8.67
0.20 1 2 i) 10.44 18.48 9.38
0.25 1 2, 1 Ae 27, 19.63 10.21
0.30 1 2 1 12.24 20.93 LLO
0.35 1 2 1 13.38 22.43 12.28
0.40 1 2 1 14.72 24.17 13.61
0.45 1 2 1 16.33 26.21 179 |
0.50 1 2 1 18.30 28.64 17.16
0.55 1 pe 2 20.79 31.61 19.83
0.60 1 2 2 24.03 35555 23.46
0.65 it 2 2 28.28 40.10 28.13
0.70 1 Z 2 34.06 46.44 34.37
0.75 il 2 2 42.32 55.29 43.11
0.80 2 2 2 55.08 68.56 56.27
0.85 2 2 2 TUDE 90.81 78.43
0.90 2 2 2 121-65 135.317 12284
0.95 2 2 2 255102 2685760 256820

Region Region Region Region


I II Ill IV
Optimal Optimal Optimal Optimal

-| fl A
Policy Policy Policy Policy

0 OTs 0.53 0.77 1.0


e
Fig. 7.2. Optimal policy as a function of discount factor for taxicab problem.

For 6 > 0.77, the second alternative in each state is the optimal
policy; the driver should always proceed to the nearest stand. In
Region I, the policy that maximizes expected immediate reward is
optimal; in Region IV, the no-discounting policy is best. An inter-
mediate policy should be followed in Regions II and III.
The behavior first described enables us to draw several conclusions
THE AUTOMOBILE PROBLEM WITH DISCOUNTING 89
about the place of processes with discounting in the analysis of sequen-
tial decision processes. First, even if the no-discounting process
described earlier is the preferred model of the system, the present
analysis will tell us how large the discounting element of the problem
must be before the no-discounting solution is no longer applicable.
Second, one criticism of a model that includes discounting is the
frequent difficulty of determining what the appropriate discount rate
should be. Figure 7.2 shows us that if the uncertainty about the
discount rate spans only one of our regions, the same policy will be
optimal, and the exact discount rate will affect only the present values.
Third, because it becomes increasingly difficult to solve the process
using discounting when § is near 1, in such a situation we are better
advised to solve the problem for the optimal policy without discounting.

The Automobile Problem with Discounting


The automobile replacement problem discussed in Chapter 5 was
solved using a discount factor B = 0.97. This discount factor corre-
sponds to an annual interest rate of approximately 12 per cent, a fairly
realistic cost of money for the average car purchaser. Recall that the
optimal no-discounting policy was found in seven iterations and that
it was to buy a 3-year-old car and keep it until it was 63 years old.
The optimal policy with discounting was found in nine iterations; it is
to buy a 3-year-old car and trade it when it is 6? years old. The
optimal no-discounting and discounting policies are very similar—
if the 3 to 64 policy is evaluated with a discount factor 8 = 0.97, its
present values differ negligibly from those of the 3 to 62 policy. This
result emphasizes the point made above that the no-discounting policy
is often adequate for relatively low interest rates.
The present values of the optimal policy with discounting are of
interest ; they are presented in Table 7.5, along with the decision in
each state. Note that if we have a car less than 1 year old we should
trade it for a 3-year-old car. The present values are negative be-
cause they represent a discounted stream of future costs. The present
value of a l-year-old car is — $4332, while the present value of a
4-year-old car is — $4946. These figures tell us that if we have a
4-year-old car we should depart from our optimal 3 to 63? policy if
we can trade it for a 1-year-old car and pay less than (— $4332)
— (— $4946) = $614. In the no-discounting case the corresponding
quantity was $730, so that we were somewhat more willing to make such
a trade when the cost of money was not important.
An interesting business opportunity is presented by Table 7.5. It
appears that for a cash deposit of about $5000 some entrepreneur
90 DISCOUNTING
Table 7.5. OptimAL PoLIcy AND PRESENT VALUES OF AUTOMOBILE
REPLACEMENT PROBLEM FOR DiscounT Factor 8 = 0.97
State: Age of Car in
Quarterly Periods Decision Present Value

1 Trade for 12-period car — $3925


2 Trade for 12-period car — $4045
3 Trade for 12-period car — $4155
4 Keep present car — $4332
5 Keep present car — $4398
6 Keep present car — $4462
¥ Keep present car — $4523
8 Keep present car — $4581
9 Keep present car — $4635
10 Keep present car — $4688
11 Keep present car — $4738
12 Keep present car — $4785
13 Keep present car — $4829
14 Keep present car — $4870
15 Keep present car — $4909
16 Keep present car — $4946
17 Keep present car — $4979
18 Keep present car — $5011
19 Keep present car — $5041
20 Keep present car — $5069
21 Keep present car — $5096
WD Keep present car — $5121
23 Keep present car — $5145
24 Keep present car — $5167
25 Keep present car — $5186
26 Keep present car — $5202
27 Trade for 12-period car — $5215
28 Trade for 12-period car — $5225
29 Trade for 12-period car — $5235
30 Trade for 12-period car — $5240
31 Trade for 12-period car — $5245
32 Trade for 12-period car — $5250
33 Trade for 12-period car — $5255
34 Trade for 12-period car — $5265
35 Trade for 12-period car — $5270
36 Trade for 12-period car — $5275
37 Trade for 12-period car — $5280
38 Trade for 12-period car — $5290
39 Trade for 12-period car — $5298
40 Trade for 12-period car — $5305

should be willing to supply us with the use of a car between 3 and 63


years old forever. In order to make the deal more appealing to him,
we might make the deposit $6000 and allow him some profit. How
unusual it would be to pay for a lifetime of car ownership in advance
SUMMARY 91
rather than by an unending stream of time payments and gasoline
bills.

Summary
The solution of sequential decision processes is of the same order of
difficulty whether or not discounting is introduced. In either case it is
necessary to solve repeatedly a set of linear simultaneous equations.
Each solution is followed by a set of comparisons to discover an im-
proved policy; convergence on the optimal policy is assured. Dis-
counting is useful when the cost of money is important or when there is
uncertainty concerning the duration of the process.
The Continuous- lime
Decision Process

In the previous chapters we have been discussing Markov processes


that make state transitions at discrete, uniformly spaced intervals of
time. In this chapter we shall extend our previous work to the case
in which the process may make transitions at random time intervals.

The Continuous-Time Markov Process

The first problem we face is how to describe an N-state process whose


time between transitions is random. Reflection shows that the sig-
nificant parameters of the process must be transition rates rather than
transition probabilities. Let us call aj; the transition rate of a process
from state 7 to statej, fori #7. The quantity aj; is defined as follows:
In a short time interval dt, a process that is now in state 7 will make a
transition to state 7 with probability aij dt (1 # 7). The probability of
two or more state transitions is of the order of (dt)? or higher and is
assumed to be zero if dt is taken sufficiently small. The correspondence
between this definition and the assumptions of the Poisson process
should be clear. We shall consider only those processes for which the
transition rates a;; are constants, an assumption equivalent in the dis-
crete-time case to the assumption that the transition probabilities do
not change with time. We may now describe the continuous-time
Markov process by a transition-rate matrix A with components 4j;;
the diagonal elements have not yet been defined.
The probability that the system occupies state 7 at a time ¢ after the
start of the process is the state probability x;,(¢) by analogy with m(m).
92
CONTINUOUS-TIME MARKOV PROCESS 93
We may relate the state probabilities at time ¢ to those a short time
dt later by the equations

wate di) = ny(0|1 See ay ie Riayat. (Pe 2,--,N (8A)


t#j ty

There are two mutually exclusive ways in which the system can
occupy the state; at f+ dt. First, it could have been in state j at time
t and made no transitions during the interval dt. These events have
probability 7,(¢) and 1 — y aj; dt, respectively, because we have said
tAj
that the probability of multiple transitions is of order higher than dt
and is negligible, and because the probability of making no transition
in dt is 1 minus the probability of making a transition in dt to some
state 7 #7. The second way the system could be in state 7 at ¢ + dt
is to have been at state 7 # 7 at time ¢ and to have made a transition
from 7 to state 7 during the time dt. These events have probabilities
mi(t) and aj; dt, respectively. The probabilities must be multiplied
and added over all z that are not equal to 7 because the system could
have entered 7 from any other state 7. Thus we see how Eq. 8.1 was
obtained.
Let us define the diagonal elements of the A matrix by
ayj= » ajt (8.2)
i#j

If Eq. 8.2 is used in Eq. 8.1, we have


m,(¢ ob at) = m,(¢)[1 + aj; dt] + » T(t) ae; at
t#l
or

m,(t ~ at) a 7 (2) = Sena at

Upon dividing both sides of this equation by dé and taking the limit
as dt — 0, we have

ad =
- w(t) = » T(t) aij timeil 1Z, ete. (8.3)
4=1

Equations 8.3 are a set of N linear constant-coefficient differential


equations that relate the state probabilities to the transition-rate
matrix A. The initial conditions 7,(0) for 7 = 1, 2,---,.N must be
specified if a solution is to be obtained.
We see that the transition-rate matrix A for continuous-time proc-
esses plays the same central role that the transition-probability matrix
P played for discrete-time processes. However, we now have a set of
94 THE CONTINUOUS-TIME DECISION PROCESS

differential equations (Eqs. 8.3) rather than a set of difference equations


(Eqs. 1.2). In matrix form we may write Eqs. 8.3 as

“nll) = (tA (8.4)


where n(¢) is the vector of state probabilities at time ¢. The matrix
A is of itself interesting. The off-diagonal elements of A are given by
the transition rates of the process. The diagonal elements of A are
given by Eq. 8.2. Asa result the rows of A sum to zero, or
N

SS ay = 0
j=l
As mentioned earlier, a matrix whose rows sum to zero is called a
differential matrix. As we shall see, the differential matrix A is a
very close relative of the stochastic matrix P.
In the following section we shall discuss the use of Laplace transforms
in the solution of continuous-time Markov processes described by Eq.
8.4. Weshall find that our knowledge of discrete-time Markov processes
will be most helpful in our new work.

The Solution of Continuous-Time Markov Processes by Laplace Trans-


formation
The Laplace transform of a time function /(¢) which is zero for ¢ < 0
is defined by

f(s) = [*f(te-stdt (8.5)


The Laplace transform exists for any such time function that does
not grow faster than exponentially. Consider, for example, the
function f(¢) = et fort > Oand/(t) = Ofort < 0. Using Eq. 8.5, we
find
f(s) = [ewes — (oeticy — 1
0 0 s+a
Table 8.1 shows some typical time functions and their corresponding
Laplace transforms derived using Eq. 8.5.
The properties of Laplace transforms are widely known and are
thoroughly discussed in such references as Gardner and Barnes.2 The
Laplace transform of a time function is unique; there is a one-to-one
correspondence between the time function and its Laplace trans-
formation. These transforms are particularly suited to the analysis
of systems that can be described by linear constant-coefficient differ-
ential equations.
SOLUTION BY LAPLACE TRANSFORMATION 93
Table 8.1. LApLAcE TRANSFORM PAIRS
Time Function for ¢ > 0 Laplace Transform

f(t) f(s |
Ailt) + fat) fils) + f2(s)
kf (t) (& is a constant) hf (s)

uO
d.
f(s) = f(0)
e7at 1
S$ +4

1 (unit step) .

oe 1
pit (s + a)2
t (unit ramp) 3

e-atf(t) f(s + a)

The continuous-time Markov process is described by Eq. 8.4, so we


should expect Laplace transformations to be useful in the solution of
such a process. Let us designate by II(s) the Laplace transform of
the state-probability vector z(¢). The Laplace transform of any
matrix of time functions is the matrix of the Laplace transforms of the
individual time functions. If we take the Laplace transform of Eq. 8.4,
we obtain

or
II(s)(sI — A) = 2(0)
where I is the identity matrix. Finally, we have
II(s) = 7(0)(sI — A)! (8.6)
The Laplace transform of the state-probability vector is thus the
initial-state-probability vector postmultiplied by the inverse of the
matrix (sl — A). The matrix (sI — A)~! is the continuous-process
counterpart of the matrix (I — zP)~!. We shall find that it has proper-
ties analogous to those of (I — zP)~1 and that it constitutes a complete
solution of the continuous-time Markov process.
By inspection, we see that the solution of Eq. 8.4 is
m(t) = m(0)e4? (8.7)
where the matrix function e4¢ is to be interpreted as the exponential
series
t2 {3
I+tA+ =
SI NB
31
5 ee) Noble ce
96 THE CONTINUOUS-TIME DECISION PROCESS

which will converge to e4¢. For discrete processes, Eqs. 1.4 yielded
n(n) = 1(0)P” i Nl borage (1.4)
Suppose that we wish to find the matrix A for the continuous-time
process that will have the same state probabilities as the discrete
process described by P at the times ¢ = 0, 1, 2,---, where a unit of
time is defined as the time for one transition of the discrete process.
Then, by comparison of Eqs. 8.7 and 1.4 when ¢ = 1, we see that
i
or

A=InP (8.8)
Recall the toymaker’s initial policy, for which the transition-proba-
bility matrix was

Polka
pi
5 pole
aes

Suppose that we should like to find the continuous process that will
have the same state probabilities at the end of each week for an
arbitrary starting position. Then we would have to solve Eq. 8.8
to find the matrix A. Methods for accomplishing this exist,4 and if we
apply them to the toymaker’s P we find
In 10 bss ‘
ARS iG om 4
Since the constant factor (In 10)/9 is annoying from the point of view of
calculation, we may as well solve a problem that is analogous to the
toymaker’s problem but that is not encumbered by the constants
necessary for complete correspondence in the sense just described. We
shall let A be simply

A= [75
4 a4.| (8.9)
Since we are abandoning complete correspondence, we may as well
treat ourselves to a change in problem interpretations at the same time.
We shall call this new problem “‘the foreman’s dilemma.’ A machine-
shop foreman has a cantankerous machine that may be either working
(state 1) or not working (state 2). If it is working, there is a probability
5 dt that it will break down in a short interval dt; if it is not working,
there is a probability 4 d¢ that it will be repaired in dt. We thus
obtain the transition-rate matrix (Eq. 8.9). The assumptions regarding
breakdown and repair are equivalent to saying that the operating time
between breakdowns is exponentially distributed with mean }, while
SOLUTION BY LAPLACE TRANSFORMATION 97
the time required for repair is exponentially distributed with mean }.
If we take 1 hour as our time unit, we expect a breakdown to occur
after 12 minutes of operation, and we expect a repair to take 15 minutes.
The standard deviation of operating and repair times is also equal to
12 and 15, respectively.
For the foreman’s problem we would like to find, for example, the
probability that the machine will be operating at time ¢ if it is operating
when ¢ = 0. To answer such a question, we must apply the analysis
of Eq. 8.6 to the matrix (Eq. 8.9). We find
TA S 5 —5
: jae = |
4 5
s(s + 9) s(s
+ 9)
(sI — A)-1 =
4 $+5
s(s + 9) s(s + 9)
Partial-fraction expansion permits
> 3 3 —3
Ree nara an ees
(sI — A)-1=
3 —§ 3 .
Ss IO, eA AAG
or

PNG 4 edleren:
apd alec ty
Let the matrix H(t) be the inverse transform of (sI — A)~!. Then
Eq. 8.6 becomes by means of inverse transformation
m(t) = 7(0)H(2) (8.10)
By comparing Eqs. 8.7 and 8.10, we see that H(#) is a closed-form
expression for eA?,
For the foreman example,
|9 ake9
H() = |i
9
=(9¥

s
clon
colon 9 9

The state-probability vector (t) may be obtained by postmultiplying


the initial-state-probability vector (0) by the matrix H(é). If the
machine is operating at t = 0, so that (0) =[1 0], then x(?) =
$ 8)+ eM} —$] orml) = $+ fe, mall) = — de®. Both
mi(¢) and ma(t) have a constant term plus an exponentially decaying
98 THE CONTINUOUS-TIME DECISION PROCESS

term. The constant term represents the limiting state probability as ¢


becomes very large. Thus the probability that the machine is operating,
ri(t), falls exponentially from 1 to $ as ¢ increases. The time constant
for this exponential decay is 5.
Similarly, if the machine is not working at ¢ = 0, n(0) = [0 1),
and n(f) = [$ 8] + e-%[—§ §],sothat mi(t) = $ — $e-%, malt) =$ +
4e-9t Note that the probability that the machine is working rises
exponentially from 0 to its steady-state value of § as ¢ becomes large.
The limiting state probabilities of the process are § and % for states
1 and 2, respectively. They are independent of the state of the system
at? = 0.
The similarity between the discrete-time and continuous-time Markov
processes is now apparent. Both have limiting state probabilities and
transient components of probability. The transients in the discrete
case were geometric; in the continuous case they are exponential. The
matrix (sI — A)~! will always have one term of the form 1/s times a
stochastic matrix §. This is true because s is a factor of the determinant
of (sI — A); a differential matrix always has one characteristic value
that equals zero. The stochastic matrix S$ is the matrix of limiting-
state-probability vectors, as it was in the discrete case. The 7th row
of S is the limiting-state-probability vector of the process if it is started
in the 7th state. The remarks concerning recurrent chains still apply
in the continuous-time process.
The remaining terms of (sI — A)~! represent transient components
of the form e~@!, te-*t, and so on, that vanish for large ¢. The matrices
multiplying these components are themselves differential matrices.
We may call 7 (s) the transient part of (sI — A)-1 and write

(I - Apt = 48 + 75) (8.11)


or

H(t) = S + T(2) (8.12)


where S is the stochastic matrix of limiting state probabilities and
T(¢) contains the transient components of probability. For the fore-
man’s problem

The rows of S are identical because the process is completely ergodic.


It is not necessary to find (sl — A)~1if only the limiting state proba-
bilities are required. Suppose that the process is completely ergodic.
Since the limiting state probabilities are constants, we know that
CONTINUOUS-TIME MARKOV PROCESS WITH REWARDS 99
dr(t)/dt = 0 for large ¢. If we denote the limiting-state-probability
vector by z, then Eq. 8.4 becomes
0 = nA (8.13)
This set of simultaneous equations plus the requirement
N

are 1 (8.14)
1=1

is sufficient to determine the limiting state probabilities. For the


matrix A (Eq. 8.9) we have from Eq. 8.13
—5r1 + 42 = 0 51 — 42 I S
and from Eq. 8.14
T + tm2 = 1
The solution of these equations is m1 = $, mz I= 3, in accordance
with our earlier results.

The Continuous-Time Markov Process with Rewards

Just as the notion of continuous time made us think in terms of


transition rates rather than transition probabilities, so we must redefine
the concept of reward. Let us suppose that the system earns a reward
at the rate of 7;; dollars per unit time during all the time that it occupies
statez. Suppose further that when the system makes a transition from
state 7 to state 7 (« # 7) it receives a reward of 7; dollars. (Note that
vii and 7;; have different dimensions.) It is not necessary that the
system earn according to both reward rates and transition rewards, but
these definitions give us generality.
We are interested in the expected earnings of the system if it operates
for a time / with a given initial condition. If we let v;(t) be the expected
total reward that the system will earn in a time # if it starts in state 7,
then we can relate the total expected reward ina time t + dt, vi(t + dt),
to ui(t) by Eq. 8.15. Here dt represents, as before, a very short time
interval:
vi(t + dt) = 1-Day dt)[rs dt + vi(t)] + > ag dilriy + vj(t)] (8.15)
j#i j#i
Equation 8.15 may be interpreted as follows. During the time
interval dt the system may remain in state 7 or make a transition to
some other state j. If it remains in state 7 for a time di, it will earn a
reward 7; dt plus the expected reward that it will earn in the remaining
t units of time, v(t). The probability that it remains in state 7 for a
time dt is 1 minus the probability that it makes a transition in dé, or
100 THE CONTINUOUS-TIME DECISION PROCESS

1—- De ai; at, On the other hand, the system may make a transition
to core state 7 # 7 during the time interval dé with probability ai; dt.
In this case the system would receive the reward rij; plus the expected
reward to be made if it starts in state 7 with time ¢ remaining, v,(¢). The
product of probability and reward must then be summed over all
states 7 # 1 to obtain the total contribution to the expected values.
Using Eq. 8.2, we may write Eq. 8.15 as
ui(t + dt) = (1 + au dt)[ru dt + vi()] + > ay dtfry + 04(2)]
j#%

or

vi(t + dt) = ra dt + vit) + auvi(t) dt + > ayrey dt + > aiyvj(t) dt


j#t j#t

where terms of higher order than dt have been neglected. Finally, if


we subtract v;(t) from both sides of the equation and divide by dt, we
have
N
v4(t GP dt) cm v,(t)
= re + » aijriy + s Aijv;(t)
dt j#i fat
If we take the limit as dt — 0, we obtain

ad of :
di vi(t) = reg + D aijrig + 2 aizv;(t) ae ey ee Af

We now have a set of N linear constant-coefficient differential


equations that completely define v;(¢) when the v;(0) are known. Let
us define a quantity q; as the ‘‘earning rate’’ of the system where

Gi = ru + > QijXij (8.16)


jx
In the foreman’s problem, for example, the machine might have
earning rates q1 = 6,¢2 = —3. These earning rates could be composed
of many different combinations of reward rates and transition rewards.
Thus, if the reward rate in state 1 is $6 per unit time, the reward rate
in state 2 is — $3 per unit time, and there is no reward associated with
transitions, then 711 = 6, 722 = —3, 712 = 721 = 0, and we obtain the
earning rates just mentioned. In a later section we consider the qi
to be obtained partly from transition rewards, but for the moment it
makes no difference.
With use of the definition of earning rate, our equations become
d N

qv) = a + 2,ayes) i= 1, 2°e (8.17)


J
CONTINUOUS-TIME MARKOV PROCESS WITH REWARDS 101
Equations 8.17 are a set of linear, constant-coefficient differential
equations that relate the total reward in time ¢ from a start in state 7
to the quantities gi and aij. If v(¢) is designated as the column vector
with elements v;(¢), the total expected rewards, and if q is designated
as the earning-rate vector with components qi, then Eqs. 8.17 can be
written in matrix form as

S¥) =a + Avi) (8.18)


To obtain a solution to Eq. 8.18, we must of course specify v(0).
Since Eq. 8.18 is a linear constant-coefficient differential equation,
the Laplace transform should provide a useful method of solution.
If the Laplace transform of Eq. 8.18 is taken according to Table 8.1,
we have

or

and finally

0(s) = AGE
S
—-A)=ig + (61 — A=2¥(0) (8.19)
Thus we find that Eq. 8.19 relates v(s), the Laplace transform of
v(t), to (sl — A)~!, the earning-rate vector q and the termination-
reward vector v(0), respectively. The reward vector v(¢) may be found
by inverse transformation of Eq. 8.19.
Let us apply the result (Eq. 8.19) to the foreman’s problem. The
transition-rate matrix and reward vector are

—5 5 6
Spytee ac aed
We shall assume that the machine will be thrown away at ¢ = 0, so that
v1(0) = v2(0) = 0. We found earlier that for this problem

ie ie 5
s(s + 9) s(s°9)
(sI — A)-! =
4 s+5
s(s + 9) s(s + 9)
102 THE CONTINUOUS-TIME DECISION PROCESS

To use Eq. 8.19, we must find (1/s)(sI — A)~!. This is


s+4 5
1 s2(s + 9) S2(s 4-0)
—~(sI — A)-! =
7 4 s+5
s2(s + 9) s2(s + 9)
Using partial-fraction expansion, we obtain
$3 gee
e+e ie + + or
1 SP 2 S s+9 $2 s Sot 9
—(sI — A)-!1 =
( ) 49 Le 68
4 ae
he
81
yes
9 Sf 81
one81

Thus, since V(s) = (1/s)(sI — A)~1q, by inverse transformation

le al ox lees
3 9
9 9
ii
81
eo
81
op
81
ery
81
6

Or

The total expected reward in time ¢ if the system is started in state 1


is thus
v1(t) t+ 3 — $e-%
and if started in state 2 is
vo(t) =¢ — § + §e-%
Note that regardless of the starting state the machine will earn, on the
average, $1 per unit time when ?¢ is large because the coefficient of ¢
in both v;(¢) and ve(t) is 1. The average reward per unit time for a
system is called the gain of the system by analogy with the discrete-
time case. As before, the gain will depend upon the starting state if the
system is not completely ergodic. We also see that for large ¢, v(t)
and v(t) may be written in the form v(t) = git + vi; in the above case,
v1 = 3, v2 = —§. Let us prove that this relation holds for a general
continuous-time Markov process,
Equation 8.19 is

a= ‘sl — A)-1q + (sI — A)—1v(0) (8.19)


CONTINUOUS-TIME MARKOV PROCESS WITH REWARDS 103
We know from Eq. 8.11 that

iA ‘s + F(s) (8.11)
where S is the matrix of limiting state probabilities and J (s) consists of
transforms of purely transient components. If Eq. 8.11 is substituted
into Eq. 8.19, we have

v(s) = :5s ‘iF\)|q + |;S 4 F(s)|¥(0)


or

es) = “4 " :F(s)q + *Sv(0) + F(s)v(0) (8.20)


We shall investigate the behavior of v(t) for large t by determining the
behavior of each component of Eq. 8.20. The term (1/s?)Sq represents
a ramp of magnitude Sq. The second term (1/s).7(s)q refers to both
step and exponential transient components. The transient components
vanish for large ¢; the step component has magnitude 7(0)q. The
term (1/s)Sv(0) represents a step of magnitude Sv(0); the term 7 (s)v(0)
refers to transient components that vanish for large #.
Thus when # is large, v(¢) has the form
v(t) = Sq + 7 (0)q + Sv(0) (8.21)
If a vector g of state gains g; 1s defined by
= Sq (8.22)
and if a vector v with components 1 is defined by
v = J(0)q + Sv(0) (8.23)
then Eq. 8.21 becomes
v(t) =tg+v for large¢ (8.24)
or
vi(t) = tgs + V3 for large ¢ (8.25)
We see that the total expected reward in time ¢ for a continuous-time
system started in state 7 has the same form as the corresponding
quantity in the discrete-time case (Eq. 2.15) except that » has been
replaced by ¢. For the foreman’s problem,

ase ne etVesa eae


Shs =a: & s*4--91 4 4

Il | Nn + ‘q &
104 THE CONTINUOUS-TIME DECISION PROCESS

so that
4 58 ig; 8
s = [iid] +ale 0
9 9
| Bees
eek 81

1
In addition, q = | i: From Eq. 8.22, we have g = Sq = |:
5
and from Eq. 8.23, since v(0) = 0, we have v = 7 (0)q = |-
5
Therefore by Eq. 8.25, it follows that for large ¢ we may write v1(t)
and v(t) in the form
vi(t) =t+3 ve(t) =t —§

These expressions agree with those found previously.


We have now completed our analysis of the continuous-time Markov
process with given earning rates in each state. The reader should
compare the results of the foreman’s problem analyzed in this section
with those found for the analogous toymaker’s problem in order to
understand clearly the similarities and differences of discrete and
continuous-time Markov processes. We shall now turn toa study of the
continuous-time decision problem.

The Continuous-Time Decision Problem

Suppose that our machine shop foreman has to decide upon a main-
tenance and repair policy for the machinery. When the system is in
state 1, or working, the foremaa must decide what kind of maintenance
he will use. Let us suppose that if he uses normal maintenance pro-
cedures the facility will earn $6 per unit time and will have a probability
5 dt of breaking down in a short time dt. Note that this is equivalent
to saying that the length of operating intervals of the machine is
exponentially distributed with mean 4.
The foreman also has the option of a more expensive maintenance
procedure that will reduce earnings to $4 per unit time but will also
reduce the probability of a breakdown in dt to 2 dt. Under neither of
these maintenance schemes is there a cost associated with the break-
down fer se. If we number the two alternatives in state 1 as 1 and 2,
respectively, then we have for the first alternative
aig = 5 rit = 6 rigt = 0
and for the second alternative
@1o* = 2 112 = 4 7122 = 0
Finally, we obtain by using Eq. 8.16 that
gi=6 and qi? =
CONTINUOUS-TIME DECISION PROBLEM 105
Now we must consider what can happen when the machinery is not
working and the system occupies state 2. Let us suppose that the
foreman also has two alternatives in this state. First, he may have the
repair work done by his own men. For this alternative the repair will
cost $1 per unit time that the men are working, plus $0.50 fixed charge
per breakdown, and there is a probability 4 dt that the machine will
be repaired in a short time df (repair time is exponential with mean }).
The parameters of this alternative are thus
a1! = 4 Yoo = —] %o11 = —0.5

and using Eq. 8.16, we have


g2i = —1 + (4)(-—0.5) = -—3
The second alternative for the supervisor when the machine is not
working is to use an outside repair firm. For this alternative the fixed
charge per breakdown is the same, $0.50. However, these men will
cost $1.50 per unit time, but will increase the probability of a repair in
dt to 7 dt. Thus, for this alternative
aei2 = 7 7292 = —1,5 7212 = —0.5

and
got =15 46°7 (205), = "25
The foreman must decide which alternative to use in each state in
order to maximize his profits in the longrun. The data for the problem
are summarized in Table 8.2. =

Table 8.2. THE FOREMAN’S DILEMMA


State Alternative Transition Rate Earning Rate
i k ayy aig® qk
1 (Facility 1 (Normal maintenance) —5 5 6
operating) 2 (Expensive maintenance) —2 Me 4

2 (Facility 1 (Inside repair) 4 —4 —3


out of order) 2 (Outside repair) i, —7/, —5

The concepts of alternative, decision, and policy carry over from the
discrete situation. Since each of the four possible policies contained
in Table 8.2 represents a completely ergodic process, each has a unique
gain that is independent of the starting state of the system. The
foreman would like to find the policy that has highest gain; this is the
optimal policy.
One way to find the optimal policy is to find the gain for each of the
four policies and see which gain is largest. Although this is feasible
106 THE CONTINUOUS-TIME DECISION PROCESS

for small problems, it is not feasible for problems that have many
states and many alternatives in each state.
Note also that the value-iteration method available for discrete-time
processes is no longer practical in the continuous-time case. It is not
possible to use simple recursive relations that will lead ultimately to the
optimal policy because we are now dealing with differential rather than
difference equations.
A policy-iteration method has been developed for the solution of the
long-duration continuous-time decision problem. It is in all major
respects completely analogous to the procedure used in discrete-time
processes. As before, the heart of the procedure is an iteration cycle
composed of a value-determination operation and a policy-improvement
routine. We shall now discuss each section in detail.

The Value-Determination Operation


For a given policy the total expected reward of the system in time
tis governed by Eqs. 8.17
d =
di v,(t) = Gi 2 Aijv;(t) (= I oS oy N (8.17)

Since we are concerned only with processes whose termination is remote,


we may use the asymptotic expression (Eq. 8.25) for v;(t)
vi(t) = tgg + Vv; for large¢ (8.25)
and transform Eqs. 8.17 into
N

gi = qi + D auligs + ry)
j=l
or
N N

= 9 +t> augs t+ Day 1=1,2,---,N (8.26)


J j=1

If Eqs. 8.26 are to hold for all large ¢, then we obtain the two sets of
linear algebraic equations
N
o aijgj = 0 1 | bane N = (8.27)
j=
N
Seg
+ > ayy, 7 = 1,2,--+,N (8.28)
j=1
Equations 8.27 and 8.28 are analogous to Eqs. 6.3 and 6.4 for the
discrete-time process. Solution of Eqs. 8.27 expresses the gain of
each state in terms of the gains of the recurrent chains in the process.
POLICY-IMPROVEMENT ROUTINE 107
The relative value of one state in each chain is set equal to zero, and
Eqs. 8.28 are used to solve for the remaining relative values and the
gains of the recurrent chains.

The Policy-Improvement Routine

Suppose that we have a policy that is optimal when ¢ units of time


remain, and that this policy has expected total rewards v(t). If we are
considering what policy to follow if more time than ¢ is available, we
see from Eqs. 8.17 that we may maximize our rate of increase of v;(t) by
maximizing
N

git + > aaskoj(t) (8.29)


j=1
with respect to the alternatives k in state 7. If ¢ is large, we may use
v;(¢) = tg; + v; to obtain
N

git + > ayk(tgs + v4)


j=1
or
N N

gi® oF 3 aijkv; a 5 aij*g; (8.30)


j=1 j=1
as the quantity to be maximized in thezthstate. For large ?, Expression
8.30 is maximized by the alternative that maximizes
N

> akg; (8.31)


j=1

the gain test quantity, using the gains of the old policy. However,
when all alternatives produce the same value of Expression 8.31 or
when a group of alternatives produces the same maximum value, then
the tie is broken by the alternative that maximizes
N
gi® 3 % aijkv; (8.32)
j=l

the value test quantity, using the relative values of the old policy. The
relative values may be used for the value test because a constant differ-
ence will not affect decisions within a chain.
The general iteration cycle is shown in Fig. 8.1. It corresponds
completely with Fig. 6.1 for the discrete-time case and has a completely
analogous proof. The rules for starting and stopping the process are
unchanged.
108 THE CONTINUOUS-TIME DECISION PROCESS

Policy Evaluation
Use ay and qj for a given policy to solve the double set of
equations
N
> aye; = 0 i=1,2,---,N
j=1
N

B=uU+t > ayy 7 ll —_ ~~ 2


j=4
for all v; and g;, by setting the value of one vj in each recurrent
chain to zero.

Policy Improvement
For each state i, determine the alternative k that maximizes
N
>, "8;
j=1

using the gains g; of the previous policy, and make it the new
decision in the 7th state.
If
N
>, aj*e;
j=l

is the same for all alternatives, or if several alternatives are


equally good according to this test, the decision must be made
on the basis of relative values rather than gains. Therefore,
if the gain test fails, break the tie by determining the alter-
native k that maximizes
N
gi®& + > aaj;
j=1
using the relative values of the previous policy, and by making
it the new decision in the 7th state.
Regardless of whether the policy-improvement test is based
on gains or values, if the old decision in the ith state yields as
high a value of the test quantity as any other alternative,
leave the old decision unchanged. This rule assures conver-
gence in the case of equivalent policies.
When this procedure has been repeated for all states, a
new policy has been determined, and new [a,j] and [qj] ma-
trices have been obtained. If the new policy is the same as
the previous one, the iteration process has converged, and the
best policy has been found; otherwise, enter the upper box.

Fig. 8.1. General iteration cycle for continuous-time decision processes.


COMPLETELY ERGODIC PROCESSES 109
Completely Ergodic Processes
If, as is usually the case, all possible policies of the problem are
completely ergodic, the computational process may be considerably
simplified. Since all states of each Markov process have the same gain
g, the value-determination operation involves only the solution of the
equations
N

=a t > ayn; me BeRees (8.33)


j=1
with vy set equal to zero. The solution for g and the remaining 1% is
then used to find an improved policy. Multiplication of Eqs. 8.33 by
the limiting state probability x; and summation over 7 show that

a result previously obtained.


The policy-improvement routine becomes simply: For each state 7,
find the alternative k that maximizes
N

Ge + > aigkv;
j=1
using the relative values of the previous policy. This alternative
becomes the new decision in thezth state. A new policy has been found
when this procedure has been performed for every state.
The iteration cycle for completely ergodic continuous-time systems
is shown in Fig. 8.2. It is completely analogous to that shown in Fig.
4.2 for discrete-time processes. Note that, if the iteration is started
in the policy-improvement routine with all v; = 0, the initial policy
selected is the one that maximizes the earning rate of each state.
This policy is analogous to the policy of maximizing expected immediate
reward for discrete-time processes.
The proof of the properties of the iteration cycle for the continuous-
time case is very close to the proof for discrete time. We shall illustrate
this remark by the proof of policy improvement for the iteration cycle of
Fig. 8.2.
Consider two policies, A and B. The policy-improvement routine
has produced policy B as a successor to policy A. Therefore we know
N N
gi + > ajBuj4 > qiA + > argv;
i=1 j=1
or
N N

D> aig4vj4
vi = GB + D> aayBo74 — gid j=l (8.34)
|
110 THE CONTINUOUS-TIME DECISION PROCESS

Value- Determination Operation


Use ayj and q; for a given policy to solve the set of equations
N
=U
+ > ayy t= 1,2,:--,N
j=l

for all relative values v; and g by setting vy to zero.

Policy-Improvement Routine
For each state 7, find the alternative k’ that maximizes
N
gk + > agko;
j=l
using the relative values v; of the previous policy. Then k’
becomes the new decision in the 7th state, g¢;*’ becomes q;, and
aij’ becomes ajj.

Fig. 8.2. Iteration cycle for completely ergodic continuous-time decision


processes.

and y; > 0. From the equations of the value-determination operation


we know
N

eB = GB + > aeyBu,® (8.35)


=
N
re = qi oe > aiz4v,j4 (8.36)

If Eq. 8.36 is subtracted from Eq. 8.35 and if Eq. 8.34 is used to
eliminate gi? — qi4, we obtain
N

eB — g4 = v4 + > aijB(vj® — 044) (8.37)


f=

Let g4 = gB — gA and v,4 = v,8 — w4. Then Eq. 8.37 becomes


N
go =v + > aj2vj4 a= 1, 2,224, (8.38)
7a

Equations 8.38 are the equations of the value-determination operation


written in terms of differences rather than absolute values. We know
THE FOREMAN’S DILEMMA 111

the solution is
N

gh = > mBys (8.39)


j=1
where 73 is the limiting state probability of state 7 under policy B.
Since all 7,2 > 0, and all y; > 0, therefore gO) ln particular,
g® will be greater than g4 if an improvement in the test quantity
N

ge + > auto;
i=
can be made in any state 7 that is recurrent under policy B.
The proof that the iteration cycle must converge on the optimal policy
is the same as that given in Chapter 4 for the discrete case.

The Foreman’s Dilemma

Let us solve the foreman’s problem shown in Table 8.2. Which


maintenance service and which repair service will provide greatest
earnings per unit time? Since all policies in the system are completely
ergodic, the simplified procedure of Fig. 8.2 can be used. Let us choose
as our initial policy the one that maximizes the earning rate for each
state. This is the policy consisting of normal maintenance and inside
repair. For this policy
1 —5 5 6
graililcotSe | tne pe es =
The value-determination equations (Eqs. 8.33) are

£= 6 — Ste g = —3 + 401 — 4u2


The solution of these equations with vz = 0 is
o-= 1 v1 = 1 vez = 0
To find a policy with higher gain, we perform the policy-improvement
routine as shown in Table 8.3.

Table 8.3. Porticy IMPROVEMENT FOR FOREMAN’S DILEMMA


State Alternative Test Quantity
N
4 k qk =f » ajo;
j=1

1 1 Guo 51) 1
2 4 — 2(1) = 2<

2 1 —3 + 4(1) =1
Z =—§ + 7(1) = 2<—
112 THE CONTINUOUS-TIME DECISION PROCESS

The second alternative in each state is selected as a better policy. It


has been found that the policy of using expensive maintenance and
outside repair is more profitable than that of using normal services.
We evaluate this policy

using Eqs. 8.33. We have

g = 4 — 201 + 202 g= —5 + 71 — 7ve

The solution of these equations with ve = 0 is


p= 2 y= 1 ve = 0

Note that the gain is larger than it was before.


We must now enter the policy-improvement routine to see if we can
find a still better policy. However, since the values have coincidentally
not been changed, the policy-improvement routine would yield once
more the policy d = |,. Because this policy has been achieved twice
in succession, it must be the optimal policy. Hence, the foreman
should use more expensive maintenance and outside repair; in this way
he will increase his profits from $1 to $2 per hour, on the average.
Note that, since v1 — vg = 1, the foreman should be willing to pay as
much as $1 for an instantaneous repair. The reader may investigate

policies d = | and d = k|to assure himself that they do have


lower earnings per hour than the optimal policy.

Computational Considerations

We have seen that the solution of the continuous-time decision process


involves about the same amount of computation as the solution of the
corresponding discrete process. As a matter of fact, the two types of
processes are computationally equivalent, so that the same computer
program may be used for the solution of both. To see this, let us write
the value-determination equations (Eqs. 6.3 and 6.4) for the discrete
process

gt = > pug, t= 42, aN (6.3)


N
get uw = get > pyvy = bo eee (6.4)
j=
COMPUTATIONAL CONSIDERATIONS UI
These equations may be written as

> (py — dSy)g; = 0

yt Lieu S (pig — Sis)vy


j=1
where 84; is the Kronecker delta; 34 = lift =jandOifi 47. If we
now let ay = pij — Siz, we have

These are the value-determination equations (Eqs. 8.27 and 8.28) for
the continuous-time decision process. Thus if we have a program for
the solution of Eqs. 6.3 and 6.4 for the discrete process, we may use
it for the solution of the continuous process described by the matrix
A by transforming the transition rates to “‘pseudo”’ transition proba-
bilities according to the relation pij = aij + Sij.*
As far as the policy-improvement routine is concerned, in the discrete
case we maximize either
N N

> pute, or get + > peskoy


j=1 j=1

with respect to all alternatives & in state 7.


Our decisions would be unchanged if we instead maximized
N

5 ((piy® — Busey
a or gk + => (Py® — 8y)vy
in state 7, since only terms dependent upon & affect decisions. In terms
of ay* = py* — 38y, the quantities to be maximized are

Saute and gi® + > aku;


j=1

However, these are the test quantities for the policy-improvement


routine of the continuous process. As a result, a policy-improvement
routine programmed for the discrete process may be used for the
continuous process if the transformation pij* = ai + 84; is performed.

* If the computer program assumes that 0 < py< 1, it will be necessary to


scale the ay so that —1 < ay < 0.
114 THE CONTINUOUS-TIME DECISION PROCESS

The discrete and continuous decision processes are thus computation-


ally equivalent. The same computer program may be used for the
solution of both processes by a simple data transformation.

The Continuous-Time Decision Process with Discounting


In Chapter 7 we studied the discrete sequential process with dis-
counting or with an indefinite duration. We may analyze continuous-
time decision processes with similar elements by use of an analogous
approach. Let us define a discount rate 0 < a < o in sucha way that
a unit quantity of money received after a very short time interval dt
is now worth 1 — «dt. This definition corresponds to continuous com-
pounding at the rate «. An alternate interpretation of « that allows
the process an indefinite duration is that there is a probability « dt
that the process will terminate in the interval dt.
If v,(t) is the total expected earnings of the system in time /, then by
analogy with Eq. 8.15 we have

son ashe a dt) (1- >a at) dt + »,(0))


+ Day dtiry + won (8.40)
j#t

In this equation we assume that rewards are paid at the end of the
interval dt and that the process receives no reward from termination.
Using the definition given by Eq. 8.2, we may rewrite Eq. 8.40 as

va(t + at) = (1 = a di) (+ Ai at) [Via ihe ae v,(t) |

a > ais atlrig + o(ol


}
j#t
or
N
vi(t + dt) = (1 — xd (ra+> aru) dt + vit) + > aay dt w(0)|
eee 7

and
N
vi(t + at) = (r + > aru) Qi va(t +> ai; at g(t — adt v(t)
i Fi j=1
where terms of higher order than dt have been neglected.
Introduction of the earning rate from Eq. 8.16 and rearrangement
yield
N
vilé + dt) — vif) + adt v(t) = qedt + > ay dt v4(2)
j=
CONTINUOUS-TIME DECISION PROCESS WITH DISCOUNTING 115
If this equation is divided by dé and the limit taken as dt approaches
zero, we have

dvi(t a
ee + av(t) = 95 + > aut) 1=1,2,---,N (8.41)
j=l
Equations 8.41 are analogous to Eqs. 8.17 and reduce to them if
a = 0. In vector form, Eqs. 8.41 become
dv(t)
at
+ av(t) = q + Av(t) (8.42)
Since Eq. 8.42 is a linear constant-coefficient differential equation,
we should expect a Laplace transformation to be useful. If the
transform of Eq. 8.42 is taken, we obtain

vs(s) — v(0) + av(s) = — q + Av(s)


nla

or
[(s + a)I — Ajv(s)=
and finally
v(s) = “Us + a)I — AJ-'q + [(s + «JI — AJ v(0) (8.43)
We might use Eq. 8.43 and inverse transformation to find v(t) for a
given process. As usual, however, we are interested in processes of
long duration, so that only the asymptotic form of v(t) for large ¢
interests us. Let us recall from Eq. 8.11 that

(eees Ay sei = + Fs) (8.11)


where S is the matrix of limiting state probabilities and JF (s) is a matrix
consisting of only transient components. It follows that

le ee ee
Serta 6
eee ae, (8.44)
so that [(s + «)I — A]~! has all transient components. If Eq. 8.44 is
used in Eq. 8.43, we have

vo)ar=if
[8+S4776+ H]a+ [58+
Ee
1
76ee+2)|v0)
vo B45)
8.45
We now wish to know which components of v(¢) will be nonzero for
large ¢. The matrix multiplying q contains a step component of magni-
tude [(1/a)S + Z(«)]; all other terms of Eq. 8.45 represent transient
116 THE CONTINUOUS-TIME DECISION PROCESS

components of v(/). Therefore, if we define a vector v of present


values v; so that
v = lim vié)
t— oo

we have

ve E i F(o)|4
a

or
v = (aI — A)—1q (8.46)

using Eq. 8.11.


The vector v represents the discounted future earnings in a very long
time if the system is started in each state. Equation 8.46 shows how
these present values are related to the discount rate «, the transition-
rate matrix A, and the earning-rate vector q. Equation 8.46 may also
be written in the form
N

ave = get > ary 1 = 1,2,---,N (8.47)


j=l
We may solve Eqs. 8.47 to find the present values of any continuous-
time decision process with discounting.

Policy Improvement

We are interested not only in evaluating a given policy but also in


finding the policy that has highest present values in all states. We
should like to be able to solve a problem such as that posed by Table
8.2 when discounting is an important element. Equations 8.47
constitute a value-determination operation; we still require a policy-
improvement routine.
If we desired to maximize the rate of growth of u,(t) at time ¢ in
Eq. 8.41, we should maximize
N

gi® + > aagskuj(t) — avi(2)


j=1
with respect to all the alternatives kin the:th state. If we are interested
only in large ¢, we may use the asymptotic present value v; rather than
vi(t) to obtain the test quantity
N

qi® qP y3 Aij*v; — av;


j=1
POLICY IMPROVEMENT 117
However, since v; does not depend upon k, the expression
N

qi® + > Aij*v;


j=l
is a sufficient test quantity to be maximized with respect to all alterna-
tives k in state 7.
The policy-improvement routine is thus: For each state 7, find the
alternative k that maximizes
N
ge + > aijkv;
j=1
using the present values of the previous policy. This alternative
becomes the new decision in the 7th state. When the procedure has
been repeated for all states, a new policy has been determined. This
new policy must have present values that are greater than those of the
previous policy unless the two policies are identical. In the latter
case the optimal policy has been found.
The value-determination operation and the policy-improvement
routine are shown in the iteration cycle of Fig. 8.3. The rules for
entering and leaving the cycle are the same as those given for earlier
cases. We shall now prove the properties of the cycle, following the
lines of the proof for the discrete case given in Chapter 7.

V alue- Determination Operation


Use aj; and q; for a given policy to solve the set of equations
N
ay, = 9% + > ayn; et PaO 1S al
j=1

for all present values 13.

Policy-Improvement Routine
For each state 7, find the alternative k’ that maximizes
N
gk + > ayko;
j=l
using the present values v; from the previous policy. Then k’
becomes the new decision in the ith state, q;*’ becomes q;, and
ayj®’ becomes ay.

Fig. 8.3. Iteration cycle for continuous-time decision processes with discounting.
118 THE CONTINUOUS-TIME DECISION PROCESS

Suppose that the iteration cycle produces a policy B as a successor


to policy A. Since B followed A, we know that
N N
giB + > aijBvj4 > ge + > ay4vj4 in every state 1.
j=1 i

Equivalently,
N N
vYe= gai? + BD Agi ae — > aij4v;4 > 0 for all7
j=1 j=1

where y; is the improvement in the test quantity that the policy-


improvement routine was able to achieve in the 7th state. For the
individual policies the value-determination operation yields
N

av;A = git -- >» aij4v;4


jah
N

av;e = qi? + », a4zP 0,8


j=1
If the first equation is subtracted from the second and the relation
for y« 1s used to eliminate qi2 — qi4, we obtain
N
a(uB as v;4) ye + D>, aey2 (078 — v5)
j=l
or
N

aU; A ve + > aeyBuj4


j=l
where v;4 = v;2 — v;4. These equations are the same as our value-
determination equations except that they are written in terms of
differences in present values. In vector form their solution is
v4 = (aI — A)-ly
where y is the vector with components y;. All elements of (aI — A)-1
are nonnegative, as were those of (I — 8P)~—! in the discrete case, again
on either physical or mathematical grounds. If any y; > 0, at least
one v;4 must be greater than zero and no v;4 can be less than zero. The
policy-improvement routine must increase the present values of at
least one state and can decrease the present value of no state.
Similarly, no policy B that has some higher present values than
policy A can remain undiscovered because of convergence on A.
This is true because in such a case all y; would be <0, while at least one
vi4 would be >0; this situation would contradict the relation derived
above. When the iteration cycle has converged on a policy, that policy
has higher present values than any other nonequivalent policy.
AN EXAMPLE 119
An Example
Let us use our results to solve the sequential decision problem pre-
sented in Table 8.2 with « = §. We may interpret this to mean
that the duration of the foreman’s operation is exponentially distrib-
uted with mean 9 hours, or we may think of some investment situation
in which the interest rate is important. As is the custom, we shall

ef eta «Ls
choose as our initial policy the one that maximizes earning rate; that is,

The value-determination equations (Eqs. 8.47) are


401 = OU—-D0q vs 4v2 = —3 4+ 4u, — 4ue

Their solution is
783 702
UI soe Uli — soe

Proceeding to find a better policy, we employ the policy-improvement


routine as shown in Table 8.4.

Table 8.4. Frrst Pottcy IMPROVEMENT FOR FOREMAN’S DILEMMA


WITH DISCOUNTING
State Alternative Test Quantity
N

a k qk + > augku;
j=l

1 6 — 5(3A8) + 5(SME) = HH
2 Are) ee it 2(7ee) = 7ee<—

2 1 OR 4 (783°) ia 4 (793°) = a2

2 rate 7 (782) = 7 (7827) = Ihe

The second alternative in each state constitutes a better policy, so that

fi an penis «gfe
The value-determination equations (Eqs. 8.47) are

401 = 4 — 201 + 22 v2 = —5 + 7v1 — 7ve

Their solution is
Uf = es tg = 33
120 THE CONTINUOUS-TIME DECISION PROCESS

Note that the present values have once more increased. The policy-
improvement routine is entered again, with results shown in Table
ref

Table 8.5. SECOND PoLticy IMPROVEMENT FOR FOREMAN’S DILEMMA


WITH DISCOUNTING
State Alternative Test Quantity
‘ k gk + > aagkv;
1 1 a
2 188

2 1 i
152 <

Since v1 — vg has remained unchanged, the values for the test


quantities are the same as those in Table 8.4. The policy d = A
has been found twice in succession. Therefore, it is the optimal policy;
it has higher present values in all states than any other policy. Even
when the expected duration of the process is only 9 hours, the foreman
should use expensive maintenance and outside repair.

Comparison with Discrete-Time Case


In the discrete sequential decision process with discounting the value-
determination equations are
N
Ve = Gi + BD Par; CAE er ine y's (7.9)
j=
If we have developed a computer program for this operation, we might
be interested in knowing whether such a program would be useful in
the continuous case. For the continuous case the analogous equations
are
N

a= 9 + > ay, 1=1,2,---,N (8.48)


j=l
where gi’ has been used to distinguish the continuous from the discrete
case. We may define aij = pi; — 34; and write Eq. 8.48 as
N

avi = gi’ + > (pig — 3uj)04


j=1
or
COMPARISON WITH DISCRETE-TIME CASE 121
and
1
1 He ee
ae rele

If we define 6 = 1/(1 + «) andg = i + a)qi’, then we have


N

ve = gi + BD pars
j=l

a set of equations of the same form as those for the discrete case. Thus
if we have a continuous problem described by «, q’, and A, we may use
the program for the discrete problem described by 8, q, and P by making
the transformations
1
Prac q = fq P=A+¢+I

In the policy-improvement routine for the discrete case, the test


quantity is
N

git + BD paso;
j=1

For the continuous case it is


N

de® ch > aijkv;


/

j=1

This quantity may be rewritten as


N

gk + 2 pik — diy)vj
where aij* = pij* — 343. We now have an expression equivalent to
N

g® + > puto;
j=1

since vj does not depend on k. If qi’* = (1/8)qe*, where 8 = 1/(1 + a),


then we have
1 N

ai + > puto;
j=l
and this of course, is proportional to
N

gi® + BD puskv;
j=1
which is the test quantity for the discrete case. Thus the same trans-
formation that allowed us to use a program for the discrete case in the
122 THE CONTINUOUS-TIME DECISION PROCESS
solution of the value-determination operation allows us to use a
program for the policy-improvement routine that is based upon the
discrete process.
We see that by suitable transformations a single program suffices for
both the discrete and continuous cases with discounting. Since we
showed the same relation earlier for cases without discounting, it is
clear that the continuous-time decision process, with or without dis-
counting, is computationally equivalent to its discrete counterpart.
Conclusion

With the discussion of continuous-time processes we have completed


our present investigation of dynamic programming and Markov proc-
esses. We have seen that the analysis of discrete-time and con-
tinuous-time Markov processes is very similar. In the discrete case
the z-transform is a powerful analytic technique, whereas for the con-
tinuous case the Laplace transform assumes this role. In either
situation the pertinent transformation has allowed us to analyze the
special cases of periodicity and multiple chains that so often complicate
other analytic approaches.
Even when a structure of rewards is added to the process, the trans-
formational methods are useful for calculating total expected rewards
as a function of time and for determining the asymptotic forms of the
reward expressions. For a system operating under a fixed policy, a
knowledge of the total expected rewards of the process constitutes a
complete understanding of the system.
The most interesting case arises when there are alternatives available
for the operation of the system. In general, we should like to find
which set of alternatives or policy will yield the maximum total
expected reward. If we are dealing with a discrete system, and if we
wish to maximize the total expected reward over only a few stages of
the process, then a value-iteration approach is indicated. If, however,
we expect the process to have an indefinite duration, the policy-
iteration method is preferable. This method will find the policy that
has a higher average return per transition than any other policy under
consideration. Even in processes with possible multiple-chain behavior,
1ZS,
124 CONCLUSION

no serious difficulties arise. The computational scheme involved is


simple, practical, and easily implemented.
If, however, we are interested in maximizing total expected reward
for a continuous-time system, our choice is more limited. The con-
tinuous analogue of the value-iteration approach is so laborious that
practicality forces us to make simplifications. If we are especially
interested in processes of short duration, then the easiest course is to
approximate the continuous-time process by a discrete-time process
and then use value-iteration. If, on the other hand, we are interested
in processes of long duration, the policy-iteration method is just as
applicable as it was in the discrete-time case. Furthermore, the
computational requirements of the two types of processes are so similar
that the same general computer program will suffice for the solution of
both classes of problems. We may conclude that the policy-iteration
method is especially important in the solution of continuous-time
processes because of the lack of practical alternatives.
We have found that the presence of discounting does not change the
basic nature of the decision-making problem. The earlier remarks
comparing value- and policy-iteration methods for discrete- and
continuous-time processes apply with equal weight when discounting is
present. The existence of discounting does have some interesting
features, however. First, for processes of long duration the concept
of gain is replaced by that of present value, and our objective in policy
improvement is to maximize the present values of all states. Second,
the chain structure of the process can be ignored in our computations.
Third, there will exist regions of discount-factor values that have the
same optimal policy. These features, however, change our com-
putational procedure very little. A well-designed computer program
can solve both discrete- and continuous-time processes, with or
without discounting.
Whenever the policy-iteration method is applied, a by-product of the
calculation of the optimal policy is a set of state values that permits
the evaluation of departures from this policy in special circumstances.
In most systems these values are more interesting and useful than their
origin might indicate. It is important to remember in using these
numbers that their validity rests on the assumption that the optimal
policy is being followed almost always.
The examples in baseball strategy, automobile replacement, and so
forth, that have been presented are so simplified that they only whet
the appetite for further applications. The considerations involved in
selecting possible applications are these. First, can the system be
adequately described by a number of states small enough to make the
CONCLUSION 125
solution of the corresponding simultaneous equations computationally
feasible? Second, are the data necessary to describe the alternatives
of the system available? If the answers to these questions are affirma-
tive, then a possible application has been discovered. There is every
reason to believe that a possible application when combined with
diligent work will yield a successful application.
“ay

(islt aes ih Delay


enrak eve te “it's a H
tie bale
rte
wa j
=
Liv i AO fat 7
irpi valet AO
oy Tia ees, 0@

ieee % ‘prensa La Cow


hoe fi ig pqilry
‘ 4 _ oie

P i, A. Soe iter
. re
a ape
. eek e-a
7 ipseei! w Ue)
al bs wy 4 ud i7
wa aay ae

a: 4 th bewha tae able Va


; or Reelin vast eae 4 = ite
™ ve Tul rua rages’ in Aa
viet ; esesriting heel ene Get
a 2 ry —_ * inte ee : _

{ abel] hagy miele


suit alge’ an an aie
i Wie a i et qd Oi ete: ernpt
a ee f yeas thar 108
i a ty Nea pasiiey
i fe. pie fans beet cigs
5 ve
i] tina ' oe

cell — , - Acitse
= 1 nf
as 7% : “s "
’ Ppyie thd pot '
Lape Sei a ii @ of Ua)
= == e.)
J we wih Rae ! 1%
+ Vile SIS Ae sega sy
is rab anei ' meen I—
bid — 7)

———— i “ile. oreuale : >


ips opal find int iy Sat ote
ie ne
gyiahant>) ’ ~e

ant > Geetha ae


é-ocy -pal) eae Saini
"oe
-
Se oe avn
7 may

i)
ae

m l v P
Appendix:

The Relationship of Transient


to Recurrent Behavior

In the value-determination operation for a completely ergodic


process, we must solve the following equation for the v; and the g:
N

gtu= gi + > pun; 7 Ay 2) SN (au)


j=l
Rearranging, we have
Fe
— Ddput e=%
j=l
When vy = 0,ra pee then

If we i a matrix

M = [mij] el
then
1—f11 —pie --- —fPin-1 1
—pa 1 —p2e 1

M =

—pyi1 —pyne -:: —py,w-1 1


aT,
128 APPENDIX
Note that the matrix M is formed by taking the P matrix, making all
elements negative, adding ones to the main diagonal, and replacing the
last column by ones.
If we also define a vector ¥ where
v; = Vi tee IN

UN ie

then
V1
v2
Was :

UN-1
§
Equation A.1 in the v; and the g can then be written in matrix
form as
Mv = q
or
v = M"!q (AgZ)
where q is the vector of expected immediate rewards. The matrix
M~! will exist if the system is completely ergodic, as we have assumed.
Thus, by inverting M to obtain M-! and then postmultiplying M-! by q,
vi for 1 < 1 < N — 1 and g will be determined.
Suppose that state N is a recurrent state and a trapping state, so
that py; = 0 for7 # N, and fyy = 1. Furthermore, let there be no
recurrent states among the remaining N — 1 states of the problem.
We know that
vy = M'1q
where M assumes the special form
lpi Pig aan Py nek |1
=pai 1 Daa ies

M = :
—PNE1, 0 = PNalge Per ee

0 0 0 4
RELATIONSHIP OF TRANSIENT TO RECURRENT BEHAVIOR 129

where the nature of W and f are evident by comparison with the M


defined above. From the relations for partitioned matrices we have

It is clear that MM-! = M-1M =I as required. The nature of f


shows us that the elements in the first N — 1 rows of the last column
of M~ are each equal to the negative sum of the first N — 1 elements
in each row. Also, from Eq. A.2, g = gw as expected.
What is the significance of W-! and Wf? Let us consider the
relations for the number of times the system enters each transient
state before it is absorbed by the recurrent state. Let ui equal the
expected number of times that a system started in state 7 will enter
state 7 before it enters state N.
The balancing relations for the “; are
NS 1
Niji = > wines + 3a 1,4<—N—1 (A.3)
k=1
Equation A.3 may be developed as follows: The number of times a
state 7 will be occupied for a given starting state 7 depends primarily
on its probabilistic relations with other states. For example, if the
system spends an average amount of time wx in some state k and if a
fraction px; of the times state & is occupied a transition is made to state
j, then the expected number of transitions into state 7 from state & is
uixpej. This contribution to “4; must be summed over all of the N
states with the exception of the trapping state, and so we have the first
term of Eq. A.3. In addition to this mechanism, however, 4; will be
increased by 1 if 7 is the state 7 in which the system is started; this
accounts for the 3;; term of Eq. A.3.
Let us define an N — 1 by N — 1 square matrix U with components
uij. Then if we write Eq. A.3 in the form
NA

>, Mix (Sas — Drs) = 84s


k=1

we see that we have in matrix form


UW =I
or
W2=U
That is, the matrix W~! is the matrix U of average times spent in each
state for each starting state. Since these quantities must be non-
negative, the elements of W~! are nonnegative.
130 APPENDIX

The matrix W or U-! has the form of the matrix [4+1I — “+1, £+1P]
used in Eq. 6.23 of Chapter 6. Here we would interpret the uw as the
expected number of times the system will enter one of the transient
states 7 in the group L + 1 before it enters some recurrent chain if it
is started in state 7 of the group L + 1. With this definition, the
elements of [2+1I — 4+1,£+1P]-1 must all be nonnegative by the same
argument given here.
Using W~-1 = U, Eq. A.2, and the partitioned form of M~1, we may
write
N-1 N-1
Ve = > Myqi — IN >.May Pera 1
j=1 j=1

Or

N-1 N-1

j=1 j=1

The vg; may now be interpreted in the following way. The w%


represent the sum of the expected number of times the system will enter
each state 7 multiplied by the expected immediate reward in that state
less the total number of times any state other than N will be entered
multiplied by the gain for state N, all given that the system started in
state 1.
In particular, if the reward gy in the recurrent state is zero, and if all
gs 2 Ofor1 <2 < N — 1, then
N

Ue = > Wags 2 Oy 1 i Nel


j=1

This was exactly the situation encountered in the baseball example of


Chapter 5, where we found that no negative values occurred relative
to the recurrent state.
Suppose that we are investigating various policies for a system that
has only one recurrent state, state N. Suppose further that we have at
some stage found a policy B as a successor to policy A. Equation 4.11
must hold for the changes in gain and values
N
gh + vd = ye + > piyB vjA t= 2a (4.11)
j=1
Since for this particular system we know that these equations are
equivalent to Eq. A.4,

vid = > uyPye — 4 Dug®


N N

j=1 j=l
i =1,2,--,N—-1
RELATIONSHIP OF TRANSIENT TO RECURRENT BEHAVIOR 131

If there has been no change in gain between A and B, then g4 = 0, and


we have left the sum of nonnegative terms so that v;4 must be non-
negative. We thus see that when increases in gain are not possible
the policy-improvement routine will attempt to maximize the values
of the transient states. This is the behavior observed in the baseball
problem, where at first glance it appeared as if we were violating our
ground rules by working with a system in which the gain was zero for all
policies.
If the words “recurrent chain with gain g”’ are substituted for
«
‘single recurrent state,’ the preceding development is virtually un-
changed. The policy-improvement routine will not only maximize
the gain of a recurrent chain, it will also maximize the values of transient
states that run into that chain.
:;a
Z . a
TI ul edt TRE
' . i) ee Sale
..
Lid Aree oe was 4 ih oe dal
; cant ssi $0 wat d Spans
writ . ks Pea tanetd NPE ip
there tantig Ge
A >i i ee
}
ames ailke ie j
oe ey) an ea bie Partin
«xp
iigd agathatt th, .
ea.
Re ones
— he
"
a stn
; . ov Mer
past "i
; ' “ eo
Be tee
arang neueleregy e216
} liprs wm) en ilres a
ES ig
aw Bt hee
~ ayy hy ete ee ee a)
Loam@goiaege ae Ge ca
1; vilestsy
wenienan ye Sm fen Sale y aaa
Uienett Lo rmdew Galt wi
. ~~ ne sre

Pree
¥

fs | eS a im
Er “, “9 § q ¢ oF. - ‘ “a ,

. woes! teu Lay ip ot @ (he leg oes


fa : 4’ The 34 “ee males cd tev ie ome
vuillnaedl Gee (owed eine te ipehel Sil
Se One 46 Ot
i eee ey
ie petehee % Ae fe seep elain ly iets

¢ €h eS pees 7 ede: teen —,

7 > eae vs i ae SA

’ ; ? = a
7 step ~tand ppieietas es te
7 yee 14,0 el ae :

Sweep
yy,Ths * ie 1G

, ae F-. 770
Sea
a 13 ee,

: aaee tabré j ain


Shalt thie ay atige: r
— : 7 :

AL el; ;
References

1. R. Bellman, Dynamic Programming, Princeton University Press, Princeton,


N.J., 1957, Chapter XI.
2. M. F. Gardner and J. L. Barnes, Transients in Linear Systems, John Wiley &
Sons, New York, 1942.
3. R. W. Sittler, ““Systems Analysis of Discrete Markov Processes,’’ IRE Tvans.
on Circuit Theory, CT-3, No. 1, 257 (1956).
4. Operations Research Center, M.I.T., Notes on Operations Research 1959,
Technology Press, Cambridge, 1959, Chapters 3, 5, 7.

General References

E. F. Beckenbach, Ed., Modern Mathematics for the Engineer, McGraw-Hill


Book Company, New York, 1956.
R. Bellman, ‘‘A Markovian Decision Process,” J]. Math. and Mech., 6, 679 (1957).
J. L. Doob, Stochastic Processes, John Wiley & Sons, New York, 1953.
G. Elving, “ Zur Theorie der Markoffschen Ketten,’”’ Acta Soc. Sci. Fennicae, 2,
Now 8; 19377.
W. Feller, An Introduction to Probability Theory and Its Applications, Vol. I, 2nd
Ed., John Wiley & Sons, New York, 1957.
B. Friedman, Principles and Techniques of Applied Mathematics, John Wiley &
Sons, New York, 1956.
W. H. Huggins, ‘‘Signal-Flow Graphs and Random Signals,’”’ Proc. I.R.E., 45,
74 (1957).
J. G. Kemeny and J. L. Snell, Finite Markov Chains, D. Van Nostrand Com-
pany, Princeton, 1960.
V. I. Romanovskii, Diskvetnye Tsepi Markova, State Publishers, Moscow, 1949.
T. A. Sarymsakoyv, Osnovy Teoriit Protsessov Markova, State Publishers, Moscow,
LOS:

133;
h i

. we woe mR eiig@wll > +e /


why, ETE Ut el
atu ‘ e es i }

(208 det all weer


sai
rh 8 Ocak Ge ele 1? oe oth ieee
baad A Mae & yonder etal rte a seal asd = mk
sa An ea wh-otwadt ‘e

; vw ick yd nels 1 ele ‘


at 7
ae Part io4 urnd kb YT 2
ie te ha
: belide&: Yo Selina bade pri Bosnba
—. : =a dea wit '
vei! qanabes tine di’ ee eouigy hl
7 &
. iow Geet the Uf {tens
(eet oie
sola
bey

+ Sprieth sian
ae ;
can PMS, oe
5 > % veal voues
ny Pras
—=

;
>
os
os
eae
~

7
Index

Alternatives, 26-28, 33, 104, 105 Gain, changes related to policy changes, 43,
49, 72-75, 111
Barnes, J. L., 94 of a process, 22, 24, 32, 36, 102
Baseball problem, best long-run policy, 52 of a state, 23, 24, 61-63, 103
computational requirements, 52 in multiple-chain processes, 60-63
evaluation of base situations, 53 Gardner, M. F., 94
Bellman, R., 1, 29
Iteration cycle, for continuous-time proc-
Car problem, best long-run policy, 57, 58, esses, 108, 110
89, 90 for continuous-time processes with dis-
computational requirements, 58, 89 counting, 117
solution in special situations, 58, 89 for discrete-time processes, 38, 64
Chain, periodic, 15 for discrete-time processes with discount-
recurrent, 13 ing, 84
Computational considerations, 36, 37, 49-52, for multiple-chain processes, 64
112-114, 120-123
Laplace transforms, definition, 94
Decision, 28, 33, 105 of vectors and matrices, 95
Decision vector, 33 table of, 95
Discount factor, 76
Discount rate, 114 Markoy processes, definition, 3, 92
Matrices, differential, 12, 23, 94, 98
Earning rate, 100 stochastic, 11, 23, 94, 98
Equivalence of discrete- and continuous-time
processes, 96, 112-114, 120-122 Partial-fraction expansion, 10
Policy, 28, 33, 105
Foreman’s dilemma, 96-105, 111, 112, 119, optimal, 28, 33, 61, 81, 105, 116
120 Policy improvement by a change in chain
Frog example, 3, 17, 18 structure, 68
135
136 INDEX

Present values, 81, 116 Total expected reward, for continuous-time


changes related to policy changes, 87, processes, 99-104
118 for continuous-time processes with dis-
Principle of Optimality, 29 counting, 114-116
Process, completely ergodic, 6, 13, 24, 32, for discrete-time processes, 18
60, 98, 109-111 for discrete-time processes with discount-
continuous-time, 3, 92 ing, 76-79
discrete-time, 3 Total-value vector, 18, 21
discrete-time related to continuous-time, Toymaker example, 4-7, 10, 11, 18-21,
96, 98 39-42, 80, 84-86
finite-state, 3 Transient state, 12
Transition, 3
Relative values, 35, 62, 107 Transition probability, 4
interpretation of, 35, 41, 112 Transition-probability matrix, 4
sufficiency of, 35, 65 Transition rate, 92
Reward, 17, 76 Transition-rate matrix, 93
expected immediate, 18 Transition reward, 99
scaling, 36, 37 Trapping state, 12
Reward rate, 99
Value iteration, for discrete-time processes,
Stage of a process, 28 28-31
State, definition, 3 for discrete-time processes with discount-
State probability, 5, 92 ing, 80, 81
in periodic chains, 16 limitations, 30, 31
limiting, 7, 98, 99
State-probability vector, 5 z-transforms, definition, 7
examples of, 8
Taxicab problem, best long-run policy, 48, of vectors and matrices, 9
87, 88 table of, 9
" ATE DUF
foneet 101502
ee

Howard

| Dynamic programming and Markov processes.


ape
ee re
iver
pesee ao aa
=

ne nace

You might also like