Introduction_To_Dynamic_Programming
Introduction_To_Dynamic_Programming
net/publication/50334812
CITATIONS READS
367 3,081
1 author:
George L. Nemhauser
Georgia Institute of Technology
332 PUBLICATIONS 37,931 CITATIONS
SEE PROFILE
All content following this page was uploaded by George L. Nemhauser on 28 July 2014.
AN INTRODUCTION TO
DYNAMIC PROGRAMMING
by
BRIAN GLUSS
BASIC PROPERTIES
The basic characteristic of problems in dynamic programming is
that they are formulated as models involving N-stage decision
processes, where the desired solution is the determination of
'policies' or 'strategies' which satisfy criteria, which, for example,
could be minimization of cost, or, in stochastic models, of the
statistical expectation of the cost or maximization of profit. To
avoid any confusion with insurance terminology, the term
'strategy' will be used throughout this paper. One of the basic
techniques used is to reduce the TV-dimensional optimization
problem to N sequential one-dimensional problems using Bellman's
so-called 'Principal of Optimality':
'An optimal strategy has the property that whatever the initial
state and initial decision are, the remaining decisions must constitute
an optimal strategy with regard to the state resulting from the first
decision'
These one-dimensional problems lead to sequential equations
which are often amenable to mathematical iterative solution, and
even when this is not so, the iterative equations often prove much
simpler to handle on a digital computer for specific solutions than
the original N-dimensional optimization problem, reducing con-
siderably the amount of storage required.
An example of this type of reduction—applied to a problem
independent of time—is as follows
Suppose it is desired to find the maximum of
(1)
/3(c) is then obtained in exactly the same way, using equation (3)
with N = 3; and so on.
Note that this kind of technique of maximization eliminates the
trouble that plagues such well-known techniques as partial differen-
tiation and Lagrangian multipliers; that is, that if the solution
happens to be on a boundary, as so often happens in maximization
problems (cf. for example, linear programming), then the equations
break down. This trouble is never encountered in Bellman's
maximization procedures.
The crux of dynamic programming formulation, then, is to
define functionsfN(x)—or in multi-dimensional casesfN(XI • •• xr)
—so that they may be expressed in terms off^x), i < N. The most
important part of the procedure is to choose the 'right' function
so that such equations may be developed and solved, and this to
a great extent is a matter of experience and intuition, which may be
partially acquired by delving through the literature. Reference [1]
is an obvious starting point towards this end, and the bibliography
in it gives ample further references.
The following example comes from reference [1], (example 45,
chapter I) and is chosen because of its similarity to valuation
problems. The notation has been slightly amended.
DYNAMIC PROGRAMMING 265
Example (i)
Suppose that we have a machine whose output in time periods
1, 2, 3, ... is r1; r2, r3, ..., and its upkeep cost ux, w2, u3, .... The
purchase price of a new machine is p, and its trade-in value at the
end of the tth period is st. p> s0, where s0 is defined as the trade-in
value at the beginning of the first period. The discounting factor
is v.
Let the present value of future returns using an optimal strategy
from the end of the tth period =/ ( , where fo is defined as at the
beginning of the first period but immediately after purchase.
Assume rt and ut are to be discounted from the end of the
period.
The recurrence relationships are
f0 = max
and ft = max
and
Eliminating fn,
i.e.
17 ASS l 6
266 BRIAN GLUSS
Hence to maximize the present value of future returns, n must be
chosen to maximize this expression.
Example (ii)
Reference [I] chapter I, example 47, asks further if it is uniformly
true that the optimal policy, if given an over-age machine, is to
turn it in immediately for a new one.
Consider the value of the process again:
If we are given a machine aged T and sell it immediately, then
the value of the process is
Hence, if
Example (iii)
Reference [1], chapter I, example 48, asks how one would
formulate the problem to take into account technological improve-
ment in machines and operating procedure.
Here the inclusion in the recurrence relationship of a penalty for
outdating would be reasonable, as a loss is incurred due to the
inefficiency of the old machine relative to a new type. Hence the
equation would read something like
and
all of which can be tabulated.
Then value
where
It would be required to find the value of n which maximized i.
Example (iv)
Suppose an insurance company has a certain number of life
insurance policies and must decide how large its liquid reserves
should be in order to handle all claims. If the reserves are too
large, the company is losing money by failing to invest the surfeit;
if they are too small, additional costs are incurred in meeting the
DYNAMIC PROGRAMMING 269
excess claims, for example, by being forced to sell securities under
unfavourable circumstances. What, then, is the optimal reserve
level to adopt?
Let it be assumed that if there are reserves x on hand at time t, we
wish to determine what additional reserves y — x should be put
aside in the next time-period (t, t + 1) for payment of claims, given
that the probability distribution for the amounts of claims for the
time-period is (s)ds, that it costs k(y — x) to make these additional
reserves available (for example, in administrative costs and lost
investment dividends), that a cost p(z) is incurred if there is an
excess z of claims over reserves, and that there is a discount ratio v,
o < v < 1, per unit time.
Then, if f{x) = expected total discounted cost of the process
starting with initial reserves x and using an optimal policy,
(8)
The first term is the cost of adding reserves y — x; the second is
the integral of the probability that s > y and a cost p(s —y) is
incurred; the third term is the cost of the process from the next
time-period given that reserves have been completely exhausted
(i.e. s > y); and the fourth term is the integral of the probability
that s < y and we are left at time t+I with reserves y — s, from when
the cost of the process is f(y — s).
This is precisely equation (7), p. 156, of [1], and the reader is
referred to it for a discussion of the solution. An exact solution
is given for k(z) = kz and p(z) =pz; further cases are considered
there and also in references [2] and [3]. The optimization criterion
used was that of minimizing f(x). It turns out, incidentally, that
the exact solutions in some of the cases just referred to are constant
reserve level solutions. That is, if at time t our reserves are x, we
order reserves y — x, where y is independent of x and depends only
upon the parameters of the system. This would appear to be
actuarially convenient.
27O BRIAN GLUSS
For example, if ($) ds, the distribution for the amount of claims,
= min
On p. 160 it is given that if vp > k, and some other simple practical
conditions are also satisfied, then the solution is as follows:
Let x be the unique root of
i.e. of
i.e.
i.e.
Also, let there be corresponding cost parameters tlt ..., tN. Let it
be assumed further that if a cell is searched and the object is in
a neighbouring cell it will move one cell further away, unless it is
in either of the cells 1, N, when it will remain where it is. Now,
in the process in which no movement is permitted, it has been
determined that the strategy that minimizes the statistical ex-
pectation of the total cost of the search is to examine in order of
272 BRIAN GLUSS
increasing values of tr\pr. This is an intuitively reasonable solution,
since the procedure involves examining cells with low cost and high
probability first.
Hence let us use this strategy as a first approximation to our
more complicated process, and simulate the process a large number
of times, noting what the costs of the search are. For successive
approximations, consider the subset of possible strategies defined
by minimizing tr/(pr)kr, where the kr are initially set equal to
unity, and then varied incrementally (and independently). For
example, searching near the boundaries will have a different effect
from searching elsewhere, and we may therefore decide to see
what happens when we vary the kr for r near i and N first. Hence
the whole problem of determining the strategy in this subset may
be (computer) programmed from beginning to end to simulate
the process, change the kr incrementally, observe when these
changes produce positive and negative effects (i.e. decrease and
increase the cost), and increment and decrement accordingly until
the suboptimal strategy is determined. Another subset of strategies
could comprise criteria of the form krtrjpr, for example. An
interesting research problem for the reader, incidentally, would be to
attempt to obtain the mathematically exact optimal strategy, which
has not as yet been found. More sophisticated methods exist of
applying the concept of successive approximation using Monte
Carlo simulation on a computer; the example above and the
strategy subsets considered were chosen to attempt to explain the
concept as simply as possible.
FIELDS OF APPLICATION
The number of areas of application, actual and potential, of dynamic
programming is extremely large, one might say virtually inexhaust-
ible; and much research of considerable importance is being
attacked successfully from this new angle. Probably one of the
most important and promising of these areas is the field of learning
processes [4], feed-back processes, and adaptive control pro-
cesses [5], [6]. Much thought has been given in the last decade to
the construction of machines that can 'learn'. The definition of
DYNAMIC PROGRAMMING 273
this phrase as far as machines are concerned is to some extent a
philosophical problem; simply, what we usually mean is to
program the machine, or give it sufficient capacity, instructions
and logic, to enable it to learn from its own experiences, and adapt
its decisions, or strategies, according to the criteria for success that
we feed into it. Hence, as it observes its successes and failures
using certain strategies, it will modify them to increase the success
rate, hoping eventually to achieve complete success. This is
obviously possible with a computer of infinite storage space,
because the totality of experiences may be stored to cite in future
experiences; however, since the machine cannot have infinite
storage space, other concepts, of a probabilistic nature, have to be
developed.
Another important area is that concerning search processes, in
which we may be considering a system in which we wish to find
an object such as the cause of a breakdown in the system, for ex-
ample [7], [8], or even find it and automatically repair it [9], There
naturally are also applications to war games in which it is desired
to derive optimal strategies for detecting enemy objects, such as
submarines. A further application is the retrieval of information
from large libraries of data or documents. A simple example of
an information retrieval problem is given on pp. 50-51 of [1].
Also, inventory control, and scheduling problems, as previously
cited, have been considered with success from the dynamic pro-
gramming viewpoint.
Dynamic programming has also been applied to such diverse
fields as communication theory (see pp. 140-143 of [1], [10]),
allocation processes (chapter I of [1]), and communication network
theory [II], [12].
It should of course be realized that some of the fields and
references mentioned are to a certain extent overlapping, depending
upon one's definitions of these respective fields. However, the variety
of applications should nevertheless be obvious, and it is to be hoped
that this introduction to dynamic programming may attract
further research workers into the field.
274 BRIAN GLUSS
REFERENCES
[i] BELLMAN, R. (1957). Dynamic Programming. Princeton University Press.
(The reader should perhaps be cautioned that, as is to be expected for any
important mathematical first edition, there are a certain number of slight
errors.)
[2] GLUSS, B. (I960). 'An optimal inventory solution for some specific
demand distributions'. Naval Research Logistics Quarterly, 7, no. 1,
45-48-
[3] LEVY, J. (1959). Further notes on the loss resulting from the use of incorrect
data in computing an optimal policy. Naval Research Logistics Quarterly,
6, no. 1, 25-31.
[4] BELLMAN, R. & KALABA, R. (1958). On communication processes involv-
ing learning and random duration. RAND Research Memorandum P-II 94.
[5] BELLMAN, R. (1961). Adaptive Control; A Guided Tour. Princeton
University Press.
[6] KALABA, R. (1959). Some aspects of adaptive control processes. RAND
Research Memorandum P-1809.
[7] JOHNSON, S. M. (1956). Optimal sequential testing. RAND Research
Memorandum RM-1652.
[8] GLUSS, B. (1959). An optimum policy for detecting a fault in a complex
system. Operations Research, 7, no. 4, 467-477.
[9] FIRSTMAN, S. I. & GLUSS, B. (i960). Optimum search routines for auto-
matic fault location. Operations Research, 8, no. 4, 512-523.
[10] BELLMAN, R. & KALABA, R. (1957). On the role of dynamic programming in
statistical communication theory. IRE Transactions of the Professional
Group on Information Theory, vol. IT-3, no. 3.
[11] KALABA, R. & JUNCOSA, M. L. (1956). Optimal design and utilization of
communication networks. RAND Research Memorandum P-782.
[12] KALABA, R. (1959). On some communication network problems. RAND
Research Memorandum P-1325.