Markov Chains and Decision Processes For Engineers and Managers
Markov Chains and Decision Processes For Engineers and Managers
Markov Chains and Decision Processes For Engineers and Managers
DECISION PROCESSES
for ENGINEERS and
MANAGERS
Theodore J. Sheskin
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
Preface xi
Author xiii
Index 463
The goal of this book is to provide engineers and managers with a unified
treatment of Markov chains and Markov decision processes. The unified
treatment of both subjects distinguishes this book from many others. The
prerequisites are matrix algebra and elementary probability. In addition, lin-
ear programming is used in several sections of Chapter 5 as an alternative
procedure for finding an optimal policy for a Markov decision process. These
sections may be omitted without loss of continuity. The book will be of inter-
est to seniors and beginning graduate students in quantitative disciplines
including engineering, science, applied mathematics, operations research,
management, and economics. Although written as a textbook, the book is
also suitable for self-study by engineers, managers, and other quantitatively
educated individuals. People who study this book will be prepared to con-
struct and solve Markov models for a variety of random processes.
Many books on Markov chains or decision processes are either highly the-
oretical, with few examples, or else highly prescriptive, with little justifica-
tion for the steps of the algorithms used to solve Markov models. This book
balances both algorithms and applications. Engineers and quantitatively
trained managers will be reluctant to use a formula or execute an algorithm
without an explanation of the logical relationships on which they are based.
On the other hand, they are not interested in proving theorems. In this book,
formulas and algorithms are derived informally, occasionally in a section
labeled as an optional insight. The validity of a formula is often justified by
applying it to a small generic Markov model for which transition probabili-
ties or rewards are expressed as symbols. The validity of other relationships
is demonstrated by applying them to larger Markov models with numerical
transition probabilities or rewards. Informal derivations and demonstrations
of the validity of formulas are carried out in considerable detail.
Since engineers and managers are interested in applications, considerable
attention is devoted to the construction of Markov models. A large number of
simplified Markov models are constructed for a wide assortment of processes
important to engineers and managers including the weather, gambling, dif-
fusion of gases, a waiting line, inventory, component replacement, machine
maintenance, selling a stock, a charge account, a career path, patient flow in
a hospital, marketing, and a production line. The book is distinguished by
the high level of detail with which the construction and solution of Markov
models are described. Many of these Markov models have numerical transi-
tion probabilities and rewards. Descriptions of the step-by-step calculations
made by the algorithms implemented for solving these models will facilitate
the student’s ability to apply the algorithms.
xi
Theodore J. Sheskin
xiii
FIGURE 1.1
Sequence of consecutive epochs.
X0 X1 X2 Xn State
Period 1 Period 2 Period n
0 1 2 n Epoch
FIGURE 1.2
Sequence of states.
epoch n + 1, the next epoch, is denoted by P(Xn+1 = j). The conditional prob-
ability that the chain will be in state j at epoch n + 1, given that it is in state
i at epoch n, is denoted by P(Xn+1 = j|Xn = i). This conditional probability
is called a transition probability, and is denoted by pij. Thus, the transition
probability,
X n\X n + 1 1 2 3 4
1 p11 p12 p13 p14
P= 2 p21 p22 p23 p24 . (1.3)
3 p31 p32 p33 p34
4 p41 p42 p43 p44
Each row of P represents the present state, at epoch n. Each column repre-
sents the next state, at epoch n + 1. Since the probability of a transition is
conditioned on the present state, the entries in every row of P sum to one. In
addition, all entries are nonnegative, and no entry is greater than one. The
matrix P is called a stochastic matrix.
When the number of states in a Markov chain is small, the chain can be
represented by a graph, which consists of nodes connected by directed arcs.
In a transition probability graph, a node i denotes a state. An arc directed
from node i to node j denotes a transition from state i to node j with transi-
tion probability pij. For example, consider a two-state generic Markov chain
with the following transition probability matrix:
State 1 2
P= 1 p11 p12 . (1.4)
2 p21 p22
p12
State State p
p11 22
1 2
p21
FIGURE 1.3
Transition probability graph for a two-state Markov chain.
X0 = a X1 = b X2 = c X3 = d X4 = e State
pa(0) pab pbc pcd pde Transition probability
0 1 2 3 4 Epoch
FIGURE 1.4
State transitions for a sample path.
P(X0 = a, X1 = b , X 2 = c , X 3 = d , X 4 = e )
= P(X0 = a)P(X1 = b X0 = a) P(X 2 = c X1 = b)
(1.7)
× P(X 3 = d X 2 = c) P(X 4 = e X 3 = d)
= p(0) p p p p
a ab bc cd de .
P(X1 = b , X 2 = c , X 3 = d , X 4 = e X0 = a)
= P(X1 = b X0 = a) P(X 2 = c X1 = b) P(X 3 = d X 2 = c) P(X 4 = e X 3 = d) (1.8)
= pab pbc pcd pde .
daily weather is assumed to have the Markov property, the weather will be
modeled as a Markov chain with four states. The state Xn denotes the weather
on day n for n = 0, 1, 2, 3, …. The states are indexed below:
State, X n Description
1 Raining
2 Snowing
3 Cloudy
4 Sunny
The 15 remaining transition probabilities for the four-state Markov chain are
obtained in similar fashion to produce the following transition probability
matrix:
X n \X n + 1 1 2 3 4 State 1 2 3 4
1 p11 p12 p13 p14 1 0.3 0.1 0.4 0.2
P= 2 p21 p22 p23 p24 = 2 0.2 0.5 0.2 0.1 . (1.9)
3 p31 p32 p33 p34 3 0.3 0.2 0.1 0.4
4 p41 p42 p43 p44 4 0 0.6 0.3 0.1
Suppose that an initial state probability vector for the four-state Markov
chain model of the weather is
Since p3(0) = P(X0 = 3) = 0.4, the chain has a probability of 0.4 of starting in
state 3, representing a cloudy day. This vector also indicates that the chain
has a probability of 0.2 of starting in state 1, denoting a rainy day, a probabil-
ity of 0.3 of starting in state 2, indicating a snowy day, and a 0.1 probability of
starting in state 4, designating a sunny day. Note that the entries of p(0) sum
to one.
Suppose the chain starts in state 3, indicating a cloudy day, so that
P(X0 = 3) = p3(0) = 1. The probability of a particular sample path during the
next 5 days, given that the initial state is X0 = 3, is calculated below:
0 1 2 3 4
FIGURE 1.5
Random walk with five states.
For a state i such that 0 < i < 4, the probability of moving one position to
the right is
pi , i +1 = P(X n +1 = i + 1|X n = i) = p.
1.4.1 Barriers
Random walks may have various types of barriers. Three common types of
barriers are absorbing, reflecting, and partially reflecting.
X n\X n + 1 0 1 2 3 4
0 1 0 0 0 0
1 1− p 0 p 0 0
P = [ pij ] = .
2 0 1− p 0 p 0 (1.11)
3 0 0 1− p 0 p
4 0 0 0 0 1
the money, a total of $4. The gambler who loses all her money is ruined. To
model this game as a random walk, let the state, Xn, denote the amount of
money player A has after n trials. Since the total money is $4, the state space,
in units of dollars, is E = {0, 1, 2, 3, 4}. Since player A starts with $2, X0 = 2.
If player A wins each $1 bet with probability p, and loses each $1 bet with
probability 1 − p, the five-state gambler’s ruin model has the transition prob-
ability matrix given for the random walk in Section 1.4.1.1. States 0 and 4 are
both absorbing states, which indicate that the game is over. If the game ends
in state 0, then player A has lost all her initial money, $2. If the game ends
in state 4, then player A has won all of her opponent’s money, increasing her
total to $4. The transition probability matrix is shown in Equation (1.11). This
model of a gambler’s ruin is treated in Section 3.3.3.2.1.
X n\ X n + 1 0 1 2 3 4
0 0 1 0 0 0
1 1− p 0 p 0 0
P = [ pij ] = . (1.12)
2 0 1− p 0 p 0
3 0 0 1− p 0 p
4 0 0 0 1 0
X n\ X n + 1 0 1 2 3 4
0 0 p 0 0 1− p
1 1− p 0 p 0 0
P = [ pij ] = . (1.13)
2 0 1− p 0 p 0
3 0 0 1− p 0 p
4 p 0 0 1− p 0
State, X n Outcome
A Product is accepted
W Product is sent back to be reworked
R Product is rejected
The state space has three entries given by E = {A, W, R}. After 61 inspections,
suppose that the following sequence of states has been observed: {A, A, A, W,
A, R, A, A, A, W, W, A, A, R, A, A, A, W, R, A, A, W, A, W, A, A, R, A, A, A, R,
W, A, R, A, A, A, W, W, A, A, A, A, R, A, A, W, A, A, R, A, A, R, W, A, A, R, A,
A, W, A}.
A transition occurs when one inspection follows another. A transition
probability, pij, is estimated by the proportion of inspections in which state i
is followed by state j. Let nij denote the number of transitions from state i to
state j. Let ni denote the number of transitions made from state i. By counting
transitions, Table 1.1 of transition counts is obtained.
Note that the 61st or last outcome is not counted in the sum of transitions
because it is not followed by a transition to another outcome. The estimated
transition probability is
TABLE 1.1
Transition Counts
TABLE 1.2
Matrix of Estimated Transition Probabilities
X n\X n +1 A W R
A pAA = 21/38 pAW = 8/38 pAR = 9/38
P=
W pWA = 9/12 pWW = 2/12 pWR = 1/12
R pRA = 8/10 pRW = 2/10 pRR = 0/10
WA: the previous product was sent back to be reworked, and the pre-
sent product was accepted,
RW: the previous product was rejected, and the present product was
sent back to be reworked.
TABLE 1.3
Transition Counts for the Enlarged Process
State AA AW AR WA WW WR RA RW RR Row Sum
AA 7 7 7 0 0 0 0 0 0 21
AW 0 0 0 5 2 1 0 0 0 8
AR 0 0 0 0 0 0 7 2 0 9
WA 5 1 2 0 0 0 0 0 0 8
WW 0 0 0 2 0 0 0 0 0 2
WR 0 0 0 0 0 0 1 0 0 1
RA 8 0 0 0 0 0 0 0 0 8
RW 0 0 0 2 0 0 0 0 0 2
RR 0 0 0 0 0 0 0 0 0 0
∑ = 59
The 61 outcomes for the sequence of inspections indicate that the process has
moved from state AA to AA to AW to WA to AR to RA to AA to AA to AW to
WW to WA to AA to AR to RA to AA to AA to AW to WR to RA to AA to AW
to WA to AW to WA to AA to AR to RA to AA to AA to AR to RW to WA to
AR to RA to AA to AA to AW to WW to WA to AA to AA to AA to AR to RA
to AA to AW to WA to AA to AR to RA to AA to AR to RW to WA to AA to AR
to RA to AA to AW, and lastly to WA. Observe that the first letter of the state
to which the process moves must agree with the second letter of the state from
which it moves, since they both refer to the outcome of the same inspection.
Since the new model has nine states, the new transition matrix has (9)2 = 81
entries. Table 1.3 of transition counts is obtained for the enlarged process.
Note that unless additional products are inspected, state RR is not a pos-
sible outcome. Hence, the number of states can be reduced by one. A transi-
tion probability, pij, jk, is estimated by the proportion of inspections in which
the pair of outcomes i, j is followed by the pair of outcomes j, k. The estimated
transition probability is
TABLE 1.4
Matrix of Estimated Transition Probabilities for the Enlarged Process
X n−1 = i, X n = j\X n = j,
X n+1 = k AA AW AR WA WW WR RA RW
By the rules of matrix multiplication, pij(2), the (i, j)th element of the two-step
transition matrix, P(2), is also the (i, j)th element of the product matrix, P2,
obtained by multiplying the matrix of one-step transition probabilities by
itself. That is,
P(2) = P ⋅ P = P 2 . (1.18)
Observe that
4
(2)
p34 = p31 p14 + p32 p24 + p33 p34 + p34 p44 = ∑ p3 k pk 4
k =1
(2)
p34 = P(X 2 = 4|X0 = 3) = P(X n + 2 = 4|X n = 3) = 0.16
= P(Sunny 2 days from now Cloudy today).
The result for a two-step transition matrix can be generalized to show that
for an N-state Markov chain, a matrix of n-step transition probabilities is
obtained by raising the matrix of one-step transition probabilities to the nth
power. That is,
P( n ) = P n = PP n −1 = P n −1 P. (1.21)
P ( n+ m ) = P n+ m = P n P m = P ( n ) P ( m ) . (1.22)
i = 1, 2, …, N; j = 1, 2, 3, … , N. (1.23)
N
pij( n + m ) = ∑ P(X n + m = j |X0 = i)
i =1
N
= ∑ P(X n + m = j, X n = k|X0 = i)
i =1
N
= ∑ P(X n + m = j, X n = k, X0 = i)/P(X0 = i)
i =1
N
= ∑ P(X n + m = j|X n = k , X0 = i) P(X n = k , X0 = i)/P(X0 = i)
i =1
N
= ∑ P(X n + m = j|X n = k , X0 = i) P(X n = k|X0 = i) P(X0 = i)/P(X0 = i)
i =1
N
= ∑ P(X n + m = j|X n = k ) P(X n = k|X0 = i)
i =1
N
= ∑ P(X n = k|X0 = i) P(X n + m = j|X n = k )
i =1
N
= ∑ pik( n ) pkj( m ) .
i =1
The state probabilities for an N-state Markov chain after n steps are collected
in a 1 × N row vector, p(n), termed a vector of state probabilities. The vector
of state probabilities specifies the probability distribution of the state after n
steps. For a four-state Markov chain, the vector of state probabilities is rep-
resented by
4 4
P(X n = j) = ∑ P(X0 = i) P(X n = j|X0 = i), or p(j n ) = ∑ pi(0) pij( n ) . (1.26)
i =1 i =1
In matrix form,
p( n ) = p(0) P( n ) . (1.27)
Note that
By induction,
p( n ) = p(0) P( n ) = p( n −1) P.
Thus, the vector of state probabilities is equal to the initial state probability
vector post multiplied by the matrix of n-step transition probabilities, and is
also equal to the vector of state probabilities after n − 1 transitions post mul-
tiplied by the matrix of one-step transition probabilities.
Consider the four-state Markov chain model of the weather for which the
transition probability matrix is given in Equation (1.9). Suppose that the ini-
tial state probability vector is given by Equation (1.10). After 1 day, the vector
of state probabilities is
Thus, after 1 day, the process has a probability of 0.31 of being in state 2, and
a 0.24 probability of being in state 4. The matrix of two-step transition prob-
abilities is given by
Thus, the probability of moving from state 1 to state 4 in 2 days is 0.25. After
2 days the vector of state probabilities is equal to
After 2 days the probability of being in state 2 is 0.365. Finally, the matrix of
three-step transition probabilities is given by
Thus, the probability of moving from state 1 to state 4 in 3 days is 0.195. After
3 days the vector of state probabilities is equal to
p(0) = [0 1 0 0 ]. (1.36)
Observe that when p(0) = [0 1 0 0], the entries of vector p(3) are identical to the
entries in row 2 of matrix p(3). One can generalize this observation to con-
clude that when a Markov chain starts in state i, then after n steps, the entries
of vector p(n) are identical to those in row i of matrix P(n).
so that the chain can move from state 4 to state 1 in two or more steps.
1.9.1 Unichain
A Markov chain is termed unichain if it consists of a single closed set of
recurrent states plus a possibly empty set of transient states.
1.9.1.1 Irreducible
A unichain with no transient states is called an irreducible or recurrent
chain. All states in an irreducible chain are recurrent states, which belong
to a single closed communicating class. In an irreducible chain, it is pos-
sible to move from every state to every other state, not necessarily in one
step [4].
An irreducible or recurrent Markov chain is called a regular chain if some
power of the transition matrix has only positive elements. The irreducible
four-state Markov chain model of weather in Section 1.3 is a regular chain.
Another example of a regular Markov chain is the model for diffusion of two
gases treated in Section 1.10.1.1.1.1 for which
0 1 2 3
0 0 1 0 0
1 4 4
1 0
P= 9 9 9 . (1.38)
4 4 1
2 0
9 9 9
3 0 0 1 0
The easiest way to check regularity is to keep track of whether the entries in
the powers of P are positive. This can be done without computing numerical
values by putting an x in the entry if it is positive and a 0 otherwise. To check
regularity, let
0 0 x 0 0
1 x x x 0
P=
2 0 x x x
3 0 0 x 0
0 0 x 0 0 0 x 0 0 x x x 0
1 x x x 0 x x x 0 x x x x
P 2 = PP = =
2 0 x x x 0 x x x x x x x
3 0 0 x 0 0 0 x 0 0 x x x
0 x x x 0 x x x 0 x x x x
1 x x x x x x x x x x x x
P4 = P2P2 = = .
2 x x x x x x x x x x x x
3 0 x x x 0 x x x x x x x
Since all entries in P4 are positive, the chain is regular. Note that the test for
regularity is made faster by squaring the result each time.
The irreducible Markov chain constructed in Section 1.10.1.1.1.2 for the
Ehrenfest model of diffusion for which
0 1 2 3
0 0 1 0 0
1 2
1 0 0
P= 3 3 . (1.39)
2 1
2 0 0
3 3
3 0 0 1 0
0 0 x 0 0
1 x 0 x 0
P=
2 0 x 0 x
3 0 0 x 0
0 0 x 0 0 0 x 0 0 x 0 x 0
1 x 0 x 0 x 0 x 0 0 x 0 x
P 2 = PP = =
2 0 x 0 x 0 x 0 x x 0 x 0
3 0 0 x 0 0 0 x 0 0 x 0 x
0 x 0 x 0 x 0 x 0 x 0 x 0
1 0 x 0 x 0 x 0 x 0 x 0 x
P4 = P2P2 = = .
2 x 0 x 0 x 0 x 0 x 0 x 0
3 0 x 0 x 0 x 0 x 0 x 0 x
Observe that even powers of P will have 0s in the odd numbered entries of
row 0. Furthermore,
0 x 0 x 0 0 x 0 0 0 x 0 x
1 0 x 0 x x 0 x 0 x 0 x 0
P3 = P2P = =
2 x 0 x 0 0 x 0 x 0 x 0 x
3 0 x 0 x 0 0 x 0 x 0 x 0
0 x 0 x 0 0 x 0 x 0 x 0 x
1 0 x 0 x x 0 x 0 x 0 x 0
P5 = P2P3 = = .
2 x 0 x 0 0 x 0 x 0 x 0 x
3 0 x 0 x x 0 x 0 x 0 x 0
Note that odd powers of P will have 0s in the even numbered entries of
row 0. This chain is not regular because no power of the transition matrix
has only positive elements. This example has demonstrated that a periodic
chain cannot be regular. Hence, a regular Markov chain is irreducible and
aperiodic. Regular Markov chains are the subject of Chapter 2.
1 p11 p12 0 0
2 p21 p22 0 0
P = [ pij ] = .
3 p31 p32 p33 p34 (1.40)
4 p41 p42 p43 p44
If the recurrent chain consists of a single absorbing state, the reducible uni-
chain is called an absorbing unichain or an absorbing Markov chain. The
transition probability matrix for a generic three-state absorbing unichain
is shown below. The transition matrix is partitioned to show that state 1 is
absorbing, and states 2 and 3 are transient. The recurrent chain is denoted by
R = {1}. The transient set of states is denoted by T = {2, 3}.
1 1 0 0
P = [ pij ] = 2 p21 p22 p23 . (1.41)
3 p31 p32 p33
1.9.2 Multichain
A Markov chain is termed multichain if it consists of two or more closed
sets of recurrent states plus a possibly empty set of transient states. Transient
states are those states that do not belong to any of the recurrent closed clas-
ses. A multichain with transient states is called a reducible multichain. The
state space for a reducible multichain can be partitioned into two or more
mutually exclusive closed communicating classes of recurrent states plus
one or more transient states. The mutually exclusive closed sets of recur-
rent states are called recurrent chains. There is no interaction among the
recurrent chains. Hence, each recurrent chain, which may consist of only a
single absorbing state, may be analyzed separately by treating it as an irre-
ducible Markov chain. If every recurrent chain consists solely of one absorb-
ing state, then the reducible multichain is called an absorbing multichain
or an absorbing Markov chain. For brevity, a reducible multichain is often
called a multichain. A multichain with no transient states is called a recur-
rent multichain.
The transition probability matrix for a generic five-state reducible multi-
chain is shown below. The transition matrix is partitioned to show that the
chain has two recurrent chains plus two transient states.
1 1 0 0 0 0
2 0 p22 p23 0 0
P = 3 0 p32 p33 0 0 . (1.42)
4 p41 p42 p43 p44 p45
5 p51 p52 p53 p54 p55
State 1 is an absorbing state, and constitutes the first recurrent chain, denoted
by R1 = {1}. Recurrent states 2 and 3 belong to the second recurrent chain,
denoted by R 2 = {2, 3}. States 4 and 5 are transient. The set of transient states
is denoted by T = {4, 5}.
S 0
P= . (1.43)
D Q
For example, the aggregated canonical form of the transition matrix for the
generic five-state reducible multichain shown in Equation (1.42) appears
below:
1 1 0 0 0 0
2 0 p22 p23 0 0
S 0
P = [ pij ] = 3 0 p32 p33 0 0 = , (1.44)
D Q
4 p41 p42 p43 p44 p45
5 p51 p52 p53 p54 p55
where
1 1 0 0
4 p41 p42 p43 4 p44 p45
S = 2 0 p22 D= Q= , and
5 p51 p53 5 p54 p55
p23 , ,
p52
3 0 p32 p33
1 0 0
0 = 2 0 0 .
3 0 0
1.10.1 Unichain
Unichain models include irreducible chains, which have no transient states,
and unichains, which have transient states.
1.10.1.1 Irreducible
The model of the weather in Section 1.3, and the circular random walk in
Section 1.4.2 are both irreducible chains. Since these two Markov chains are
aperiodic, they are also regular chains. Several additional models of regular
Markov chains will be constructed, plus one model of an irreducible, peri-
odic chain in Section 1.10.1.1.1.2.
1.10.1.1.1 Diffusion
This section will describe two simplified Markov chain models for diffu-
sion. In the first model, two different gases are diffused. The second model,
which produces a periodic chain, is for the diffusion of one gas between two
containers [4, 5].
TABLE 1.5
Number of Molecules in Each Container
When the System Is in State i
Gas/Container A B
Gas U k−i i
Gas V i k−i
molecule of the same gas is chosen from each container, the system remains
in state i. Hence,
i k − i k − i i i k − i 2i(k − i)
pii = + = 2 =
k k k k k k
.
k2
2 2
k − i 3 − i (3 − i)2
pi , i + 1 = = = , for i = 0,1, 2.
k 3 9
0 1 2 3
0 0 1 0 0
1 4 4
1 0
P= 9 9 9 . (1.38)
4 4 1
2 0
9 9 9
3 0 0 1 0
This model for the diffusion of two gases is a regular Markov chain.
i
pi , i− 1 =
k
k− i
pi , i + 1 = .
k
i i
pi , i− 1 = = , for i = 1, 2, 3
k 3
k − i 3− i
pi , i + 1 = = , for i = 0,1, 2.
k 3
0 1 2 3
0 0 1 0 0
1 2
1 0 0 .
P= 3 3 (1.39)
2 1
2 0 0
3 3
3 0 0 1 0
As Section 1.9.1.1 has indicated, this Ehrenfest model of diffusion can be inter-
preted as a four-state random walk with reflecting barriers in which p = 2/3.
Hence, this Ehrenfest model is a periodic Markov chain with a period of 2.
X n + 1 = min(3, An + 1 ), if X n = 0.
X n + 1 = min(3, X n − 1 + An + 1 ) if X n > 0.
Thus, the transition probabilities for state 1 are the same as those for state 0.
The transition probabilities for state 1 are
X n + 1 = j = min(3, X n − 1 + An + 1 )
= min(3, 2 − 1 + An + 1 ) = min(3, 1 + An + 1 ).
The transition probabilities for the four-state Markov chain model of the
waiting line inside the recruiter’s office are collected to construct the follow-
ing transition probability matrix:
State 0 1 2 3
0 P(An = 0) P(An = 1) P(An = 2) P(An ≥ 3)
P= 1 P(An = 0) P(An = 1) P(An = 2) P(An ≥ 3)
(1.45)
2 0 P(An = 0) P(An = 1) P(An ≥ 2)
3 0 0 P(An = 0) P(An ≥ 1)
State 0 1 2 3
0 0.30 0.25 0.20 0.25
= 1 0.30 0.25 0.20 0.25 .
(1.46)
2 0 0.30 0.25 0.45
3 0 0 0.30 0.70
Demand dn in period n 0 1 2 3
Probability, p(dn ) 0.3 0.4 0.1 0.2
If Xn−1 < 2, the retailer orders cn−1 = 3 − Xn−1 computers, which are delivered
immediately to increase the beginning inventory to three computers.
If Xn−1 ≥ 2, the retailer does not order, so that the beginning inventory
remains Xn−1 computers.
This policy has the form (s, S), and is called an (s, S) inventory ordering policy.
The quantity s = 2 is called the reorder point, and the quantity S = 3 is called
the reorder level. Observe that under this (2, 3) policy,
X n = X n− 1 + cn− 1 − dn = j
= j = i + cn −1 − dn = 0 + 3 − dn = 3 − dn .
p0 j = P(X n = j X n −1 = 0) = P(X n = 3 − dn X n −1 = 0) = P( j = 3 − dn ).
Hence,
When j = 0,
When j = 1,
When j = 2,
When j = 3,
In state 1, two computers are ordered. When Xn−1 = i = 1 and cn−1 = 2, the next
state is
X n = X n− 1 + cn− 1 − dn = j
= j = i + cn −1 − dn = 1 + 2 − dn = 3 − dn .
p1 j = P(X n = j X n −1 = 1) = P(X n = 3 − dn X n −1 = 1) = P( j = 3 − dn ).
.
Hence,
When j = 0,
When j = 1,
When j = 2,
When j = 3,
When dn = 3, the demand exceeds the beginning inventory plus the quan-
tity ordered, and the sale of one computer is lost. To ensure that the ending
inventory is nonnegative, the equation for the next state, which represents
the ending inventory, is expressed in the form
X n = j = max(2 − dn , 0).
For example,
Hence,
When j = 0,
When j = 1,
When j = 2,
When j = 3,
In state 3, zero computers are ordered. When Xn−1 = i = 3 and cn−1 = 0, the next
state is
X n = X n− 1 + cn− 1 − dn = j
= j = i + cn− 1 − dn = 3 + 0 − dn = 3 − dn .
p3 j = P(X n = j X n −1 = 3) = P(X n = 3 − dn X n −1 = 3) = P( j = 3 − dn ).
Hence,
When j = 0,
When j = 1,
When j = 2,
When j = 3,
Beginning Order,
X n−1 + cn−1
Inventory, X n−1 cn−1 State 0 1 2 3
0 3 3 0 P(dn = 3) P(dn = 2) P(dn = 1) P(dn = 0)
P=
1 2 3 1 P(dn = 3) P(dn = 2) P(dn = 1) P(dn = 0)
2 0 2 2 P(dn ≥ 2) P(dn = 1) P(dn = 0) 0
3 0 3 3 P(dn = 3) P(dn = 2) P(dn = 1) P(dn = 0)
(1.47)
State 0 1 2 3
0 0.2 0.1 0.4 0.3
P = [ pij ] = 1 0.2 0.1 0.4 0.3 .
2 0.3 0.4 0.3 0 (1.48)
3 0.2 0.1 0.4 0.3
Observe that the transition probabilities in states 0, 1, and 3 are the same. The
inventory system is enlarged to create a Markov chain with rewards in Section
4.2.3.4.1 and a Markov decision process in Sections 5.1.3.3.1 and 5.2.2.4.1.
TABLE 1.6
Mortality Data for 100 Components Used Over a 4-Week Period
End of Week # of Survivors # of Failures P(Failures) Cond. P(Failure)
n Sn Fn =Fn/100 =Fn/Sn−1
0 100 0 0 –
1 80 20 0.20 0.2 = 20/100
2 50 30 0.30 0.375 = 30/80
3 10 40 0.40 0.8 = 40/50
4 0 10 0.10 1 = 10/10
The mortality data indicates that p00 = 0.2, p10 = 0.375, and p20 = 0.8. Since a
component of age 3 is certain to fail during the current week and be replaced
at the end of the week, the transition probability for a 3-week-old component
is p30 = P (Xn = 0|Xn−1 = 3) = F4/S3 = 1.
The age of the component in use during week n − 1 is Xn−1 = i. For
any n − 1, Xn = 0 if the component failed during week n. If the compo-
nent survived during week n, then the component is 1 week older, so that
Xn = Xn−1 + 1 = i + 1. The conditional probabilities that a component of age
TABLE 1.7
Conditional Probability That a Component of Age i Will Fail During Week i + 1
Age i of a component in week, i 0 1 2 3
Conditional probability of 0.2 = 20/100 0.375 = 30/80 0.8 = 40/50 1 = 10/10
Failing during week i + 1 of life = F1/S0 = F2/S1 = F3/S2 = F4/S3
pi , i + 1 = P ( X n = i + 1 X n − 1 = i )
= P(Component survives 1 additional week Component has
survived i weeks)
= Si + 1/Si
= 1 − p10 = 1 − Fi + 1/Si = 1 − (Si − Si + 1 )/Si
p01 = P( X n = 1 X n −1 = 0) = S1/S0 = 80/100 = 0.8
State 0 1 2 3
0 0.2 0.8 0 0
P= 1 0.375 0 0.625 0 .
2 0.8 0 0 0.2 (1.50)
3 1 0 0 0
The possible scores and the probabilities of these scores are the same for
every applicant. The process of interviewing and assigning scores to can-
didates is an independent trials process because the score assigned to one
candidate does not affect the score assigned to any other candidate.
To formulate this independent trials process as an irreducible Markov
chain, let the state Xn denote the score assigned to the nth candidate, for
n = 1, 2, 3, … . The state space is E = {15, 20, 25, 30}. The sequence {X1, X2, X3, …}
is a collection of independent, identically distributed random variables. The
probability distribution of Xn is shown in Table 1.8.
The score or state Xn+1 of the next applicant is independent of the state
Xn of the current applicant. Hence, pij = P(Xn+1 = j|Xn = i) = P(Xn+1 = j). The
TABLE 1.8
Probability Distribution of Candidate Scores
Candidate Poor Fair Good Excellent
State Xn = i 15 20 25 30
P(Xn = i) 0.3 0.4 0.2 0.1
sequence {X1, X2, X3,…} for this independent trials process forms a Markov
chain. The transition probability matrix is
State 15 20 25 30
15 P(X n + 1 = 15) P(X n + 1 = 20) P(X n + 1 = 25) P(X n + 1 = 30)
P = 20 P(X n + 1 = 15) P(X n + 1 = 20) P(X n + 1 = 25) P(X n + 1 = 30)
25 P(X n + 1 = 15) P(X n + 1 = 20) P(X n + 1 = 25) P(X n + 1 = 30)
30 P(X n + 1 = 15) P(X n + 1 = 20) P(X n + 1 = 25) P(X n + 1 = 30)
. (1.51)
State 15 20 25 30
15 0.3 0.4 0.2 0.1
= 20 0.3 0.4 0.2 0.1
25 0.3 0.4 0.2 0.1
30 0.3 0.4 0.2 0.1
Note that all the rows of P are identical. This property holds for every
independent trials process. Two extended forms of this problem, called
the secretary problem, will be formulated as Markov decision processes in
Sections 5.1.3.3.3 and 5.2.2.4.2.
bi , i = 0,1, 2
P(X n + 1 = i + 1|X n = i) =
0, i = 3.
di , i = 1, 2, 3
P(X n + 1 = i − 1|X n = i) =
0, i = 0
P(X n + 1 = i |X n = i) = 1 − bi − di .
State 0 1 2 3
0 1 − b0 b0 0 0
P= 1 d1 1 − b1 − d1 b1 0 .
(1.52)
2 0 d2 1 − b2 − d2 b2
3 0 0 d3 1 − d3
1.10.1.2.1.1 Selling a Stock for a Target Price Suppose that at the end of a
month a woman buys one share of a certain stock for $10. The share price,
rounded to the nearest $10, has been varying among the prices $0, $10, and
$20 from month to month. She plans to sell the stock at the end of first month
in which the share price rises to $20. She believes that the price of the stock
can be modeled as a Markov chain in which the state, Xn, denotes the share
price at the end of month n. The state space for the stock price is E = {$0,
$10, $20}. The state Xn = $20 is an absorbing state, reached when the stock is
sold. The two remaining states, which are entered when the stock is held,
are transient. She models her investment as an absorbing unichain with the
following transition probability matrix represented in the canonical form of
Equation (1.43).
X n\ X n + 1 20 10 0
20 1 0 0 1 0
P= = .
10 0.1 0.6 0.3 (1.53)
D Q
0 0.4 0.2 0.4
A model for selling a stock with two target prices will be treated in
Sections 4.2.5.4.2 and 4.2.5.5.
TABLE 1.9
States of a Machine
State Description
1 Not Working (NW), Inoperable
2 Working, with a Major Defect (WM)
3 Working, with a Minor Defect (Wm)
4 Working Properly (WP)
(Note that the states are labeled so that as the index of the state increases,
the condition of the machine improves.) The state of the machine at the start
of tomorrow depends only on its state at the start of today, and is indepen-
dent of its past history. Hence, the condition of the machine can be modeled
as a four-state Markov chain. Let Xn−1 denote the state of the machine when
it is observed at the start of day n. The state space is E = {1, 2, 3, 4}. Assume
that at the start of each day, the engineer in charge of the production pro-
cess does nothing to respond to the deterioration of the machine, that is,
she does not perform maintenance. Therefore, the condition of the machine
will either deteriorate by one or more states or remain unchanged. In other
words, if the engineer responsible for the machine always does nothing, then
at the start of tomorrow, the condition of the machine will be worse than or
equal to the condition today. All state transitions caused by deterioration are
assumed to occur at the end of the day. A transition probability matrix for
the machine when it is left alone for one day is given below in the canonical
form of Equation (1.43):
1 Not Working 1 0 0 0
2 Major Defect 0.6 0.4 0 0 1 0
P= = .
3 Minor Defect 0.2 0.3 0.5 0 D Q (1.54)
4 Working Properly 0.3 0.2 0.1 0.4
The transition probability matrix for a machine, which is left alone, repre-
sents an absorbing Markov chain. As the machine deteriorates daily, it will
eventually enter state 1, an absorbing state, where it will remain, not working.
For example, if today a machine is in state 3, working, with a minor defect,
then tomorrow, with transition probability p32 = P(Xn = 2|Xn−1 = 3) = 0.3, the
machine will be in state 2, working, with a major defect. One day later,
with transition probability p21 = P(Xn = 1|Xn−1 = 2) = 0.6, the chain will be
absorbed in state 1, where the machine will remain, not working. The model
of machine deterioration will be revisited in Sections 3.3.2.1 and 3.3.3.1.
TABLE 1.10
Four Possible Maintenance Actions
Decision Action Outcome
1 Do nothing The condition tomorrow will be worse than or equal to the
(DN) condition today
2 Overhaul If, with probability 0.8, an overhaul in states 1, 2, or 3, is
(OV) successful, the condition tomorrow will be superior by one
state to the condition today. If unsuccessful, the condition
tomorrow will be unchanged
3 Repair (RP) If, with probability 0.7, a repair in state 1 or 2 is successful, the
condition tomorrow will be superior by two states to the
condition today. If unsuccessful, the condition tomorrow will
be unchanged
4 Replace The machine will work properly tomorrow
(RL)
TABLE 1.11
Original Maintenance Policy
State, i Description Decision, k Maintenance Action
1 Not Working, Inoperable 2 Overhaul
2 Working, with a major defect 1 Do Nothing
3 Working, with a minor defect 2 Overhaul
4 Working Properly 1 Do Nothing
TABLE 1.12
Modified Maintenance Policy
State, i Description Decision, k Maintenance Action
1 Not Working, Inoperable 2 Overhaul
2 Working, with a major defect 1 Do Nothing
3 Working, with a minor defect 1 Do Nothing
4 Working Properly 1 Do Nothing
Observe that this Markov chain is unichain as states 1 and 2 form a recurrent
closed class while states 3 and 4 are transient.
Now suppose that the engineer who manages the production process
modifies the original maintenance policy by always doing nothing when the
machine is in state 3 instead of overhauling it in state 3. The modified mainte-
nance policy, under which the engineer always overhauls the machine in state
1 and does nothing in the other three states, is summarized in Table 1.12.
The transition probability matrix associated with the modified mainte-
nance policy appears below in Equation (1.56).
Observe that this Markov chain under the modified maintenance policy
is also unichain as states 1 and 2 form a recurrent closed class while states 3
and 4 are transient. The machine maintenance model will be revisited in
Sections 3.3.2.1, 3.3.3.1, 3.5.4.2, 4.2.4.1, 4.2.4.2, 4.2.4.3, and 5.1.4.4.
State, X n Description
1 Management of Hardware
2 Management of Software
3 Management of Marketing
4 Engineering Product Design
5 Engineering Systems Integration
6 Engineering Systems Testing
(1.57)
State Description
0 0 months (1−30 days) old
1 1 month (31−60 days) old
2 2 months (61−90 days) old
P Paid in full
B Bad debt
A charge account is classified according to the oldest unpaid debt, starting from
the billing date. (Assume that all debt includes interest and finance charges.) For
example, if a customer has one unpaid bill, which is 1 month old, and a second
unpaid bill, which is 2 months old, then the account is classified as 2 months
old. If she makes a payment less than the 2-month-old bill, the account remains
classified as 2 months old. However, if she makes a payment greater than the
2-month-old bill but less than the sum of the 1-month-old and 2-month-old bills,
the account is reclassified as 1 month old. If at any time she pays the entire bal-
ance owed, the account is labeled as paid in full. When an account becomes
3 months old, it is labeled as a bad debt and sent to a collection agency.
Assume that the change of status of a charge account depends only on
its present classification. Then the process can be modeled as a five-state
Markov chain. Let Xn denote the state of an account at the nth month since
the account was opened. The state space is E = {0, 1, 2, P, B}. The states 0, 1,
and 2 indicate the age of an account, in months. An account that is 0 months
old is a new account with only current charges. State P indicates that an
account is paid in full, and state B indicates a bad debt. When the account is
in states 0, 1, or 2, it may stay in its present state. When the account is in state
0, it may move to state 1. When the account is in state 1, it may move to states
0 or 2. When the account is in state 2, it may move to states 0 or 1. Because an
account can be paid in full at any time, transitions are possible from states
0, 1, and 2 to state P. Since an account is reclassified as a bad debt only when
it becomes 3 months old, state B is reached only by a transition from state 2.
States 0, 1, and 2 are transient states because eventually the charge account
will either be paid in full or labeled as a bad debt. States P and B are absorb-
ing states because once one of these states is entered, the account is settled
and no further activity is possible. The process is an absorbing multichain.
After rearranging the five states so that the two absorbing states appear first,
the transition probability matrix is given below in the canonical form of
Equation (1.43):
State P B 0 1 2
P 1 0 0 0 0
I1 0 0
P=
B 0 1 0 0 0 I 0
=0 I2 0 = .
0 p0P 0 p00 p01 0 D Q
D1 D2
Q
1 p1P 0 p10 p11 p12
2 p2P p2B p20 p21 p22
Suppose that observations over a period of time have produced the follow-
ing transition probability matrix, also displayed in the canonical form of
Equation (1.43):
State P B 0 1 2
P 1 0 0 0 0
B 0 1 0 0 0 I 0 (1.58)
P= = .
0 0.5 0 0.2 0.3 0 D Q
1 0.2 0 0.1 0.4 0.3
2 0.1 0.2 0.4 0.2 0.1
State Department
0 Discharged
1 Diagnostic
2 Outpatient
3 Surgery
4 Physical Therapy
5 Morgue
During a given day, 40% of all diagnostic patients will not be moved, 10%
will become outpatients, 20% will enter surgery, and 30% will begin physi-
cal therapy. Also, 15% of all outpatients will be discharged, 5% will die, 10%
State 0 5 1 2 3 4
0 1 0 0 0 0 0
5 0 1 0 0 0 0
S 0 I 0
P= 1 0 0 0.4 0.1 0.2 0.3 = . = .
D Q
(1.59)
D Q
2 0.15 0.05 0.1 0.2 0.3 0.2
3 0.07 0.03 0.2 0.1 0.4 0.2
4 0 0 0.3 0.4 0.2 0.1
State Operation
1 Scrapped
2 Sold
3 Training Engineers
4 Training Technicians
5 Training Technical Writers
6 Stage 3
7 Stage 2
8 Stage 1
The next operation on an item depends only on the outcome of the current
operation. Therefore, the production process can be modeled as a Markov
chain. An epoch is the instant at which an item passes through a production
stage and is inspected, or is transferred to an employee in the training cen-
ter, or is transferred within the training center. Let the state Xn denote the
operation on an item at epoch n. Since output that is scrapped will not be
reused, and output that is sold will not be returned, states 1 and 2 are absorb-
ing states. Output sent to the training center will remain there permanently,
and will therefore not rejoin the production stages or be scrapped or sold.
Output received by the training center is dedicated exclusively to training
engineers, technicians, and technical writers, and will be shared by these
employees. Hence, states 3, 4, and 5 form a closed communicating class of
recurrent states. States 6, 7, and 8 are transient because all output must even-
tually leave the production stages to be scrapped, sold, or sent to the training
center. Thus, the model is a reducible multichain, which has two absorbing
states, one closed class of three recurrent states, and three transient states.
Observe that production stage i is represented by transient state (9 − i). An
item enters the production process at stage 1, which is transient state 8.
The following transition probabilities for the transient states are expressed
in terms of the probabilities of producing output, which has a major defect, a
minor defect, is blemished, or has no defect:
p88 = 0.75 = P(output from stage 1 has a minor defect and is reworked)
p77 = 0.65 = P(output from stage 2 has a minor defect and is reworked)
p66 = 0.55 = P(output from stage 3 has a minor defect and is reworked)
p87 = 0.15 = P(output from stage 1 has no defect and is passed to stage 2)
p76 = 0.20 = P(output from stage 2 has no defect and is passed to stage 3)
p62 = 0.16 = P(output from stage 3 has no defect or blemish and is sold)
p71 = 0.15 = P(output from stage 2 has a major defect and is scrapped)
p61 = 0.20 = P(output from stage 3 has a major defect and is scrapped).
Scrapped 1 1 0 0 0 0 0 0 0
Sold 2 0 1 0 0 0 0 0 0
Training Engineers 3 0 0 0.50 0.30 0.20 0 0 0
Training Technicians 4 0 0 0.30 0.45 0.25 0 0 0
P=
Training Tech. Writers 5 0 0 0.10 0.35 0.55 0 0 0
Stage 3 6 0.20 0.16 0.04 0.03 0.02 0.55 0 0
Stage 2 7 0.15 0 0 0 0 0.20 0.65 0
Stage 1 8 0.10 0 0 0 0 0 0.15 0.75
P1 0 0 0 I 0 0 0
0 P2 0 0 0 I 0 0 S 0
= = = .
0 0 P3 0 0 0 P3 0 D Q (1.60)
D1 D2 D3 Q D1
D2 D3 Q
Sold
Reworked
0.55
Reworked Reworked
0.75 0.65 0.16 Acceptable
0.20
Defective Defective
0.10 0.15
Defective
Scrapped
FIGURE 1.6
Passage of an item through a three-stage production process.
PROBLEMS
1.1 The condition of a machine, which is observed at the beginning of
each day, can be represented by one of the following four states:
State Description
State Description
The use of two states to distinguish the 2 days of the repair pro-
cess allows the next state of the machine to depend only on the
present state and to be independent of its past history. Hence,
the condition of the machine can be modeled as a five-state
Markov chain. Let Xn−1 denote the state of the machine when it is
observed at the start of day n. The state space is E = {1, 2, 3, 4, 5}.
Assume that at the start of each day, the engineer in charge of
the machine does nothing to respond to the condition of the
machine when it is in states 3, 4, or 5.
A machine in states 3, 4, or 5 that fails will enter state 1
(NW1). One day later the machine will move from state 1 to
State Description
0 (W) Working
1 (D1) Not Working, in first day of repair
2 (D2) Not Working, in second day of repair
3 (D3) Not Working, in third day of repair
4 (D4) Not Working, in fourth day of repair
Since the next state of the machine depends only on its present
state, the condition of the machine can be modeled as a five-
state Markov chain.
Construct the transition probability matrix.
1.4 Many products are classified as either acceptable or defective.
Such products are often shipped in lots, which may contain a
large number of individual items. A purchaser wants assurance
that the proportion of defective items in a lot is not excessive.
Instead of inspecting each item in a lot, a purchaser may follow
an acceptance sampling plan under which a random sample
selected from the lot is inspected. An acceptance sampling plan
is said to be sequential if, after each item is inspected, one of
the following decisions is made: accept a lot, reject it, or inspect
another item. Suppose that the proportion of defective items in
a lot is denoted by p. Consider the following sequential inspec-
tion plan: accept the lot if four acceptable items are found, reject
the lot if two defective items are found, or inspect another item
if neither four acceptable items nor two defective items have
been found.
The sequential inspection plan represents an independent
trials process in which the condition of the nth item to be
inspected is independent of the condition of its predecessors.
Hence, the sequential inspection plan can be modeled as a
Markov chain. A state is represented by a pair of numbers.
The fi rst number in the pair is the number of acceptable items
inspected, and the second is the number of defective items
inspected. The model is an absorbing Markov chain because
when the lot is accepted or rejected, the inspection process
stops in an absorbing state. The transient states indicate that
the inspection process will continue. The states are indexed
and identified in the table below:
Demand dn on day n, dn = k 0 1 2 3
P(dn = k ) P(dn = 0) P(dn = 1) P(dn = 2) P(dn = 3)
Age i of an IC in years, i 0 1 2 3 4
Number surviving to age i S0 S1 S2 S3 S4 = 0
Observe that at the end of the 4-year period, none of the S0 ICs
has survived, that is, all have failed. Every IC that fails is replaced
with a new one at the end of the year in which it has failed. The life
of an IC can be modeled as a four-state recurrent Markov chain.
Let the state Xn denote the age of an IC at the end of year n.
Construct the transition probability matrix for the Markov
chain.
1.10 A small refinery produces one barrel of gasoline per hour. Each
barrel of gasoline has an octane rating of either 120, 110, 100, or
90. The refinery engineer models the octane rating of the gas-
oline as a Markov chain in which the state Xn represents the
octane rating of the nth barrel of gasoline. The states of the
Markov chain, {X0, X1, X2, …}, are shown below:
Suppose that the Markov chain, {X0, X1, X2, …}, has the follow-
ing transition probability matrix:
State 1 2 3 4
1 p11 p12 b k −b
P= 2 p21 p22 k−d d
3 p31 p32 e h−e
4 p41 p42 h−u u
p13 + p14 = b + (k − b) = k.
p23 + p24 = (k − d) + d = k.
Hence,
p33 + p34 = e + ( h − e ) = h.
p43 + p44 = ( h − u) + u = h.
Hence,
P, if X n = 1 or 2
Yn =
E, if X n = 3 or 4
State P E
G= P g PP g PE
E g EP g EE
Now suppose that the Markov chain {X0, X1, X2, …} has the
following transition probability matrix, such that p11 + p12 ≠ p21 +
p22 and p31 + p32 ≠ p41 + p42.
State 1 2 3 4
1 p11 p12 p13 p14
P= 2 p21 p22 p23 p24
3 p31 p32 p33 p34
4 p41 p42 p43 p44
Since p11 + p12 ≠ p21 + p22 and p31 + p32 ≠ p41 + p42, the Markov chain is
not lumpable with respect to the partition P = {1, 2} and E = {3, 4}.
The process {Y0, Y1, Y2, …} is not a Markov chain.
References
1. Bhat, N., Elements of Applied Stochastic Processes, 2nd ed., Wiley, New York,
1985.
2. Cinlar, E., Introduction to Stochastic Processes, Prentice-Hall, Englewood Cliffs,
NJ, 1975.
3. Clarke, A. B. and Disney R. L., Probability and Random Processes: A First Course
with Applications, 2nd ed., Wiley, New York, 1985.
4. Kemeny, J. G., Mirkil, H., Snell, J. L., and Thompson, G. L., Finite Mathematical
Structures, Prentice-Hall, Englewood Cliffs, NJ, 1959.
5. Kemeny, J. G. and Snell, J. L., Finite Markov Chains, Van Nostrand, Princeton, NJ,
1960. Reprinted by Springer-Verlag, New York, 1976.
6. Heyman, D. P. and Sobel, M. J., Stochastic Models in Operations Research, vol. 1,
McGraw Hill, New York, 1982.
That is, after n transitions, as n becomes very large, the n-step transition
probability pij( n ) approaches a limiting probability, p(j n ), irrespective of the
starting state i. If πj denotes the limiting probability for state j in an N-state
Markov chain, then the limiting probability is defined by the formula
π = [π 1 π 2 … π N ]. (2.5)
∑πj =1
j = 1. (2.6)
1 π 1 π 1 π2 … πN
2 π 2 π 1 π2 … πN
lim P( n ) = lim P n = Π = = . (2.7)
n →∞ n →∞ # # # … … … …
N π N π 1 π2 … πN
For the four-state Markov chain model of the weather, the rows of P(8) calcu-
lated in Equation (2.2) indicate that
For large n, the state probability p(j n ) approaches the limiting probability πj.
That is,
and does not depend on the starting state. Thus, the vector π of steady-state
probabilities is equal to the limit, as the number of transitions approaches
infinity, of the vector p(n) of state probabilities. That is,
π = lim p( n ) . (2.11)
n →∞
N
P(X n +1 = j) = ∑ P(X n = i) P(X n +1 = j|X n = i)
i =1
N
p(j n + 1) = ∑ pi( n ) pij .
i =1
Letting n → ∞,
N
lim p(j n + 1) = lim ∑ pi( n ) pij .
n →∞ n →∞
i =1
π j = ∑ π i pij .
i =1
In matrix form,
π = πP
πI = πP
π (I − P) = 0.
N
π j = ∑ π i pij ,
for j = 1, 2, 3,..., N
i =1
N
∑ πj = 1 (2.12)
i =1
π i > 0, for i = 1, 2, 3,..., N .
π = π P
(2.13)
πe = 1 ,
π > 0
p11 p12
[π 1 π 2 ] = [π 1 π 2 ]
p21 p22
(2.14)
T
[π 1 π 2 ]1 1 = 1.
π 1 = p11π 1 + p21π 2
(2.15)
π 2 = p12π 1 + p22π 2
π 1 + π 2 = 1.
Note that if the substitutions p12 = 1 − p11 and p22 = 1 − p21 are made in the
second equation, the second equation is transformed into the first equation,
as shown below:
π 2 = (1 − p11 )π 1 + (1 − p21 )π 2
π 2 = π 1 − p11π 1 + π 2 − p21π 2
π 1 = p11π 1 + p21π 2 .
π 1 = p11π 1 + p21π 2 ,
or
(1 − p11 )π 1 − p21π 2 = 0,
π 1 + π 2 = 1.
After multiplying both sides of the second equation by p21 and adding the
result to the first equation, the solution for the vector of steady-state prob-
abilities is
is
π = π 1 π 2 = p21 /( p12 + p21 ) p12 /( p12 + p21 ) = 3/7 4/7 . (2.18)
π 1 = p11π 1 + p21π 2
π 2 = p12π 1 + p22π 2
The first equation is solved to express π1 as the following constant times π2:
p21 p
π1 = π 2 = 21 π 2 .
1 – p11 p12
This expression is substituted into the normalizing equation (2.6) to solve for π2.
p21
π 2 + π 2 = 1.
p12
Once again, the solution for the vector of steady-state probabilities is given
by Equation (2.16) .
When the numerical values given in Equation (1.9) are substituted for the
transition probabilities, the following system of five equations in four
unknowns is produced:
When the first approach is followed, and the fourth equation is arbitrarily
deleted, the resulting system of four equations in four unknowns is shown
below:
This solution almost matches the approximate one obtained in Equation (2.9)
by calculating P8.
In the second approach, the normalizing equation (2.6) is initially ignored.
The first three equations contained in the system π = πP are solved to express
π1, π2, and π3 as the following constants times π4:
These values for π1, π2, and π3 expressed in terms of π4 are substituted into the
normalizing equation to solve for π4.
1 = π1 + π 2 + π 3 + π 4
1 = (1407/1309)π 4 + (2499/1309) π 4 + (1617/1309) π 4 + (1309/1309 )π 4 .
The result is
π 4 = (1309/6832 ). (2.24)
Substituting the result for π4 to solve for the other steady-state probabilities
gives the values obtained by following the first approach:
0.2 0.8 0 0
0.375 0 0.625 0
π 1 π 2 π3 π 4 = π 1 π 2 π 3 π 4
0.8 0 0 0.2 (2.28)
1 0 0 0
T
[π 1 π 2 π3 π 4 ]1 1 1 1 = 1.
The steady-state probability πi represents the long run probability that a ran-
domly selected component is i weeks old. For example, the long run probabil-
ity that a component has just been replaced is given by π0 = 0.4167. Similarly,
the long run probability that a component has been in service 3 weeks is
equal to π3 = 0.0417.
Consider a generic four-state regular Markov chain with the transition prob-
ability matrix shown in Equation (1.3).
f1(jn )
(n)
f2 j
f j( n ) = [ f ij ] = ( n ) ,
(n)
where j = 1, 2, 3, or 4 (2.31)
f3 j
( n)
f 4 j
denote the column vector of n-step first passage time probabilities to a target
state j. To obtain a formula for computing the probability distribution of the
n-step first passage times, let j = 1. Suppose that f 41( n ), the distribution of n-step
first passage time probabilities from state 4 to a target state 1 is desired. To
compute f 41( n ), one may start with n = 1. The probability of going from state 4
to state 1 for the first time in one step, f 41(1), is simply the one-step transition
probability, p41. That is,
f 41(1) = p(1)
41 = p41 . (2.32)
To move from state 4 to state 1 for the first time in two steps, the chain must go
from state 4 to any nontarget state k different from state 1 on the first step, and
from that nontarget state k to state 1 on the second step. Therefore, for k = 2, 3, 4,
f 41(2) = p42 p21 + p43 p31 + p44 p41 = p42 f 21(1) + p43 f 31(1) + p44 f 41(1) . (2.33)
To move from state 4 to state 1 for the first time in three steps, the chain must
go from state 4 to any nontarget state k different from state 1 on the first step,
and from that nontarget state k to state 1 for the first time after two additional
steps. Hence, for k = 2, 3, 4,
f 41( n ) = p42 f 21( n −1) + p43 f 31( n −1) + p44 f 41( n −1) = ∑ p4 k f k(1n −1) .
k ≠1
Thus, the n-step probability distribution of first passage times from a state i to
a target state j can be computed recursively by using the algebraic formula
To express formula (2.35) for recursively calculating f ij( n ) in matrix form, let j
denote the target state in a generic four-state regular Markov chain. Suppose
(1)
that j = 1. Let column vector f1 denote column 1 of P, so that
p11 f11
(1)
p f (1)
= = 21(1) .
21
f1(1) (2.36)
p31 f 31
(1)
p41 f 41
Thus f1(1) is the column vector of one-step transition probabilities from any
state to state 1, the target state. Let the matrix Z be the matrix P with column
j of the target state replaced by a column of zeroes. When j = 1,
After n steps
f j( n ) = Zf j( n −1) = Z n −1 f j(1) .
Matrix Z is the matrix of probabilities of not entering the target state in one
step because all entries in the column of the target state are zero. Therefore,
Zn−1 is the matrix of probabilities of not entering the target state in n − 1 steps.
(1)
Vector f j is the vector of probabilities of entering the target state in one step.
Hence, f j( n ) = Z n− 1 f j(1) is the vector of probabilities of not entering the target
state during the first n − 1 steps, and then entering the target state for the first
time on the nth step.
To see that these recursive matrix formulas are equivalent to the corre-
sponding algebraic formulas, they will be applied to a generic regular four-
state Markov chain to calculate the vectors f1(2) and f1(3) when state 1 is the
target state. The column vector of probabilities of first passage to state 1 in
two steps is represented by
0 p12 p13 p14 p11 p12 p21 + p13 p31 + p14 p41 f11(2)
0 p p23 p24 p21 p22 p21 + p23 p31 + p24 p41 f 21(2)
= = =
22
f1(2) = Zf1(1) .
0 p32 p33 p34 p31 p32 p21 + p33 p31 + p34 p41 f 31(2 )
0 p42 p43 p44 p41 p42 p21 + p43 p31 + p44 p41 f 41(2)
(2.39)
Observe that f 41(2), the fourth entry in column vector f1(2), is given by
f 41(2) = p42 f 21(1) + p43 f 31(1) + p44 f 41(1) = p42 p21 + p43 p31 + p44 p41 ,
in agreement with the algebraic formula (2.32) obtained earlier. The vector of
probabilities of first passage to state 1 in three steps is given by
0 p12 p22 + p13 p32 + p14 p42 p12 p23 + p13 p33 + p14 p43 p12 p24 + p13 p34 + p14 p44 p11
0 p p +p p + p24 p42 p22 p23 + p23 p33 + p24 p43 p22 p24 + p23 p34 + p24 p44 p21
=
22 22 23 32
0 p32 p22 + p33 p32 + p34 p42 p32 p23 + p33 p33 + p34 p43 p32 p24 + p33 p34 + p34 p44 p31
0 p42 p22 + p43 p32 + p44 p42 p42 p23 + p43 p33 + p44 p43 p42 p24 + p43 p34 + p44 p44 p41
( p12 p22 + p13 p32 + p14 p42 )p21 + ( p12 p23 + p13 p33 + p14 p43 )p31 + ( p12 p24 + p13 p34 + p14 p44 )p41
( p p + p23 p32 + p24 p42 )p21 + ( p22 p23 + p23 p33 + p24 p43 )p31 + ( p22 p24 + p23 p34 + p24 p44 )p41
= 22 22 .
( p32 p22 + p33 p32 + p34 p42 )p21 + ( p32 p23 + p33 p33 + p34 p43 )p31 + ( p32 p24 + p33 p34 + p34 p44 )p41
( p42 p22 + p43 p32 + p44 p42 )p21 + ( p42 p23 + p43 p33 + p44 p43 )p31 + ( p42 p24 + p43 p34 + p44 p44 )p41
(2.40)
(3) (3)
Observe that f , the fourth entry in column vector f , is given by
41 1
f 41(3) = ( p42 p22 + p43 p32 + p44 p42 )p21 + ( p42 p23 + p43 p33 + p44 p43 )p31
+ ( p42 p24 + p43 p34 + p44 p44 )p41
= p42 ( p22 p21 + p23 p31 + p24 p41 ) + p43 ( p32 p21 + p33 p31 + p34 p41 )
+ p44 ( p42 p21 + p43 p31 + p44 p41 )
= p42 f 21(2) + p43 f 31(2) + p44 f 41(2) ,
0 p12 p13 p14 f11(2) p12 f 21(2) + p13 f 31(2) + p14 f 41(2) f11(3)
0 p p23 p24 f 21(2) p22 f 21(2) + p23 f 31(2)
+ p24 f 41(2) f 21(3)
=
22
f1(3) = Zf1(2) = = .
0 p32 p33 p34 f 31(2) p32 f 21(2) + p33 f 31(2) + p34 f 41(2) f 31(3)
0 p42 p43 p44 f 41(2) p42 f 21(2) + p43 f 31(2) + p44 f 41(2) f 41(3)
(2.41)
Observe that f 41(3), the fourth entry in column vector f1(3), is given by
The vectors of n-step first passage probabilities, for n = 2, 3, and 4, are calcu-
lated below, along with Zn−1:
Alternatively,
Alternatively,
The probability that the chain moves from state 4 to target state 1 for the first
time in four steps is given by f41(4) = 0.1197. Therefore, the probability that the
next rainy day (state 1) will appear for the first time 4 days after a sunny day
(state 4) is 0.1197.
∞ ∞
mij = E(Tij ) = ∑ nP(T
n= 1
ij = n) = ∑ nf
n= 1
(n)
ij , (2.48)
j 1 0 0 0 0
1 p1 j p11 p12 p13 p14
1 0
PM = 2 p2 j p21 p22 p23 p24 = ,
D Q
(2.50a)
3 p3 j p31 p32 p33 p34
4 p4 j p41 p42 p43 p44
where
the outcome of the first step, multiply this outcome by the probability that
it occurs, and sum these products over all possible outcomes. Given that the
process is initially in state 4, then either (a) the first step, with probability p4j, is
to target state j, in which case the MFPT is exactly one step, or (b) the first step,
with probability p4k, is to a nontarget state, k ≠ j, in which case the MFPT will
be (1 + mkj), equal to the one step already taken plus mkj, the MFPT from non-
target state k to target state j. Weighting each outcome by the probability that
it occurs produces the following formula for the MFPT from state 4 to state j:
Since m4j in a five-state regular Markov chain is a linear function of the four
unknowns, m1j, m2j, m3j, and m4j, m4j cannot be found by solving a single equation.
Instead, the following linear system of four equations in the four unknowns,
m1j, m2j, m3j, and m4j, must be solved to find the MFPTs to a target state j:
For the five-state regular Markov chain, the matrix form of the system of
algebraic equations (2.53) for calculating the vector of MFPTs to a target
state j is
For an (N + 1)-state regular Markov chain, the concise form of the matrix
equation for computing the vector of MFPTs to a target state j is
m j = e + Qm j , (2.56)
where
T
m j = m1 j m2 j … mNj
Suppose that the vector of MFPTs to target state j is desired. When target
state j is made an absorbing state, the modified transition probability matrix,
PM, is partitioned in the following manner to put it in canonical form of
Equation (1.43):
State j 1 2 3 4
j 1 0 0 0 0
1 0 0.4 0.1 0.2 0.3 1 0
PM = = , (2.58)
2 0.2 0.1 0.2 0.3 0.2 D Q
3 0.1 0.2 0.1 0.4 0.2
4 0 0.3 0.4 0.2 0.1
where
In algebraic form, the Equations (2.53) or (2.54) for the MFPTs are
For example, the mean number of steps to move from state 2 to state j for the
first time is given by m2j = 12.1164. Observe that matrix Q in Equation (2.59)
is identical to matrix Q in Equation (3.20). However, the entries in the vector
of MFPTs in Equation (2.62) differ slightly from those calculated by different
methods in Equations (3.22) and (6.68). Discrepancies are due to roundoff
error because only the first four significant decimal digits were stored.
0 1 0 0 0
1 0.375 0 0.625 0 1 0
PM = = . (2.63)
2 0.8 0 0 0.2 D Q
3 1 0 0 0
The matrix form of the equations (2.56) to be solved to find the vector m0 of
MFPTs is
State 1 2 3 4
1 1 0 0 0
1 0
PM = 2 0.2 0.5 0.2 0.1 = . (2.66)
3 0.3 0.2 0.1 0.4 D Q
4 0 0.6 0.3 0.1
The matrix equation (2.56) for computing m1, the vector of MFPTs to target
state 1, is
T T
m1 = m21 m31 m41 = 5.3234 5.1244 6.3682 . (2.68)
Snow, state 2, is followed by rain after a mean interval of 5.3234 days, and a
sunny day, state 4, is followed by rain after a mean interval of 6.3682 days.
m jj = 1 + ∑p
k≠ j
jk mkj (2.69)
for the case in which i = j. The second method is to calculate the steady-state
probabilities for the original N-state regular Markov chain, The mean recur-
rence times are the reciprocals of the steady-state probabilities [1, 4, 5].
m jj = 1 + p j1m1 j + p j 2 m2 j + p j 3 m3 j + p j 4 m4 j
= 1 + (0.1)m1 j + (0.2)m2 j + (0.4)m3 j + (0)m4 j
(2.70)
= 1 + (0.1)(15.7143) + (0.2)(12.1164) + (0.4)(13.8624) + (0)(14.8148)
= 10.5397.
Alternatively, if the steady probabilities are known, the mean recurrence time
mjj for a target state j is simply the reciprocal of the steady-state probability
πj for the target state j. The steady-state probability vector for the five-state
regular Markov chain for which the transition probability matrix is given in
Equation (2.57) is obtained by solving the system of equations (2.13).
The solution is
Observe that πj = 0.0949. Hence, the mean recurrence time for state j is,
which is close to the result obtained by the first method. Discrepancies are
due to roundoff error.
Next, the mean recurrence times for states 1 and 2, respectively, are calculated
by solving Equation (2.69) in terms of the MFPTs previously calculated.
m jj = 1/ π j . (2.76)
1 m11 m12
M= .
m22
(2.77)
2 m21
The systems of Equations (2.54) and (2.69) used to compute the MFPTs and
the mean recurrence times can be written in the following form:
m11 0 1 1
MD = , E= , π = π 1 π 2 .
0 m22 1 1
M = P( M − MD ) + E.
π M = π P( M − MD ) + π E
= π P( M − MD ) + E, since π E = E.
= π ( M − MD ) + E, as π P = π
= π M − π MD + E
π MD = E
m11 0 1 1
π 1 π 2 = .
0 m22 1 1
π 1 m11 = 1
π 2 m22 = 1.
The conclusion that the mean recurrence time is the reciprocal of the
steady-state probability can be generalized to apply to any regular Markov
chain with a finite number of states.
PROBLEMS
2.1 This problem refers to Problem 1.1.
(a) If the machine is observed today operating in state 4 (WP),
what is the probability that it will enter state 2 (MD) after
4 days?
(b) If the machine is observed today operating in state 4 (WP),
what is the probability that it will enter state 2 (MD) for the
first time after 4 days?
State Description
0 (W) Working
1 (D1) Not working, in first day of repair
2 (D2) Not working, in second day of repair
3 (D3) Not working, in third day of repair
Since the next state of the machine depends only on its present
state, the condition of the machine can be modeled as a four-
state Markov chain. Answer the following questions if p = 0.1,
q1 = 0.5, q2 = 0.3, and q3 = 0.2:
(a) Construct the transition probability matrix for this four state
recurrent Markov chain.
(b) If the machine is observed today in state 3 (D3), what is the
probability that it will enter state 2 (D2) after 3 days?
(c) If the machine is observed today in state 1 (D1), what is the
probability that it will enter state 3 (D3) for the first time
after 3 days?
r−p p(1 − r )
[1 − r s s2 (1 − p)s3 ], where s = .
r(1 − r ) − p(1 − p)s3 (1 − p)r
(b) What is the long run proportion of time that the checkout
clerk is idle?
(c) What is the long run proportion of time that an arriving cus-
tomer will be denied admission to the store?
In Problem 1.7, let p = 0.24 and r = 0.36.
(d) Construct the transition probability matrix.
(e) If the checkout clerk starts the day with one customer inside
the convenience store, find the probability distribution for
the number of customers inside the store at the end of the
third 30-min interval.
(f) Find the steady-state probability distribution for the number
of customers inside the convenience store.
(g) If the checkout clerk starts the day with two customers inside
the convenience store, fi nd the expected length of time until
the store has three customers.
2.9 In Problem 1.11, let
w0 = 0.14, w1 = 0.10, w2 = 0.20, w3 = 0.18, w4 = 0.24, and w5 = 0.14.
(a) Construct the transition probability matrix.
(b) If the dam starts the week with three units of water, fi nd the
probability distribution for the volume of water stored in the
dam after 3 weeks.
(c) Find the steady-state probability distribution for the volume
of water stored in the dam.
(d) If the dam starts the week with one unit of water, find the
expected length of time until the dam has four units of
water.
2.10 In Problem 1.14, let p = 0.6, r = 0.8, and q = 0.7.
(a) Construct the transition probability matrix.
(b) If the optical shop starts the day with both the optometrist
and the optician idle, find the probability state vector after
three 30-min intervals.
(c) Find the steady-state probability vector for the optical shop.
(d) If the optical shop starts the day with the optometrist blocked
and the optician working, find the expected length of time
until the optometrist is working and the optician is idle.
References
1. Bhat, N., Elements of Applied Stochastic Processes, 2nd ed., Wiley, New York,
1985.
2. Cinlar, E., Introduction to Stochastic Processes, Prentice-Hall, Englewood Cliffs,
NJ, 1975.
3. Clarke, A. B. and Disney R. L., Probability and Random Processes: A First Course
with Applications, 2nd ed., Wiley, New York, 1985.
4. Kemeny, J. G., Mirkil, H., Snell, J. L., and Thompson, G. L., Finite Mathematical
Structures, Prentice-Hall, Englewood Cliffs, NJ, 1959.
5. Kemeny, J. G. and Snell, J. L., Finite Markov Chains, Van Nostrand, Princeton, NJ,
1960. Reprinted by Springer-Verlag, New York, 1976.
All Markov chains treated in this book have a finite number of states. Recall
from Section 1.9 that a reducible Markov chain has both recurrent states and
transient states. The state space for a reducible chain can be partitioned into
one or more mutually exclusive closed communicating classes of recurrent
states plus a set of transient states. A closed communicating class of recur-
rent states is often called a recurrent class or a recurrent chain. The state
space for a reducible unichain can be partitioned into one recurrent chain
plus a set of transient states. The state space for a reducible multichain can be
partitioned into two or more mutually exclusive recurrent chains plus a set
of transient states. There is no interaction among different recurrent chains
within a reducible multichain. Hence, each recurrent chain can be analyzed
separately by treating it as an irreducible Markov chain. A reducible chain,
which starts in a transient state, will eventually leave the set of transient
states to enter a recurrent chain, within which it will continue to make tran-
sitions indefinitely.
3.1.1 Unichain
Suppose that the states in a unichain are rearranged so that the recurrent
states come first, numbered consecutively, followed by the transient states,
which are also numbered consecutively. The closed class of recurrent states
is denoted by R, and the set of transient states is denoted by T. This rearrange-
ment of states can be used to partition the transition probability matrix for
97
the unichain into a recurrent chain plus a set of transient states. As Equation
(1.43) indicates, the transition probability matrix, P, is represented in the fol-
lowing canonical form:
S 0
P= . (3.1)
D Q
The square submatrix S governs transitions within the closed class of recur-
rent states. Rectangular submatrix 0 consists entirely of zeroes indicating
that no transitions from recurrent states to transient states are possible.
Rectangular submatrix D governs transitions from transient states to recur-
rent states. The square submatrix Q governs transitions among transient
states. Submatrix Q is substochastic, which means that at least one row sum
is less than one.
Consider the generic four-state reducible unichain described in Section
1.9.1.2. The transition probability matrix is partitioned in Equation (1.40) to
show that states 1 and 2 belong to the recurrent chain denoted by R = {1, 2},
and states 3 and 4 are transient. The set of transient states is denoted by
T = {3, 4}. The canonical form of the transition matrix is
1 p11 p12 0 0
2 0
=
p21 p22 0 S 0
P = [ pij ] = D Q ,
3 p31 p32 p33 p34
4 p41 p42 p43 p44
where
1 1 0 0
1 0
P = [ pij ] = 2 p21 p22 p23 , P = ,
D Q
3 p31 p32 p33
where
2 p21 2 p22 p23
D = , and Q = .
p33
(3.3)
3 p31 3 p32
3.1.2 Multichain
Consider a reducible multichain that has M recurrent chains, denoted by
R1, . . . , R M, with the respective transition probability matrices P1, . . . ,PM, plus a
set of transient states denoted by T. The states can be rearranged if necessary
so that the transition probability matrix, P, can be represented in the follow-
ing canonical form:
P1 0 " 0 0
0 P2 " 0 0
P= # # % # # . (3.4)
0 0 " PM 0
D1 D2 " DM Q
The M square submatrices, P1, . . . ,PM, are the transition probability matrices,
which specify the transitions after the chain has entered the corresponding
closed class of recurrent states. The M rectangular submatrices, D1, . . . ,DM,
are the transition probability matrices, which govern transitions from tran-
sient states to the corresponding closed classes of recurrent states. The square
submatrix Q is the transition probability matrix, which governs transitions
among the transient states. The submatrix 0 consists entirely of zeroes.
The transition probability matrix for a generic five-state reducible multi-
chain is shown in Equation (1.42). As Section 1.9.2 indicates, the multichain
has M = 2 recurrent closed sets, one of which is an absorbing state, plus two
transient states. The two recurrent sets of states are R1 = {1} and R 2 = {2, 3}. The
set of transient states is denoted by T = {4, 5}.
The canonical form of the transition matrix is
1 1 0 0 0 0
2 0 p22 p23 0 0 P1 0 0
P = 3 0 p32 p33 0 0 = 0 P2 0,
4 p41 p42 p43 p44 p45 D1 D2 Q
5 p51 p52 p53 p54 p55
where
2 p22 p23 4 p41
P1 = [1], P2 = , D1 = ,
3 p32 p33 5 p51
4 p42 p43 4 p44 p45 (3.5)
D2 = , and Q = .
5 p52 p53 5 p54 p55
P1 0 0 0 P1 0 0 0
0 0 S 0
P2 0 0 0 P2 0
=
P= = .
0 0 P3 0 0 0 P3 0 D Q
D1 D2 D3 Q D1 D2 D3 Q
where
P1 0 0 0
S= 0 P2 0 , 0 = 0 , D = D1 D2 D3 . (3.6)
0 0 P3 0
and 8 are the transient states. The aggregated transition matrix in canonical
form is shown in Equation (1.60) and is repeated below with the submatrices
identified individually:
Scrapped 1 1 0 0 0 0 0 0 0
Sold 2 0 1 0 0 0 0 0 0
Training engineers
3 0 0 0.50 0.30 0.20 0 0 0
Training technicians 4 0 0 0.30 0.45 0.25 0 0 0
P=
Training tech. writers 5 0 0 0.10 0.35 0.55 0 0 0
Stage 1 6 0.20 0.16 0.04 0.03 0.02 0.55 0 0
Stage 2 7 0.15 0 0 0 0 0.20 0.65 0
Stage 3 8 0.10 0 0 0 0 0 0.15 0.75
P1 0 0 0
0 P 0 0 S 0
= 2
=
,
0 0 P3 0 D Q
D1 D2 D3 Q
where
1 1 0 0 0 0
2 0 1 0 0 0 P1 0 0 6 0.55 0 0
S = 3 0 0 0.50 0.30 0.20 = 0 P2 0, Q = 7 0.20 0.65 0 ,
4 0 0 0.30 0.45 0.25 0 0 P3 8 0 0.15 0.75
5 0 0 0.10 0.35 0.55
6 0.20 0.16 0.04 0.03 0.02
D = 7 0.15 0 0 0 0 = [ D1 D2 D3 ],
8 0.10 0 0 0 0
3 0.50 0.30 0.20 6 0.20 6 0.16
P1 = P2 = [1], P3 = 4 0.30 0.45 0.25 , D1 = 7 0.15 , D2 = 7 0 ,
5 0.10 0.35 0.55 8 0.10 8 0
6 0.04 0.03 0.02
D3 = 7 0 0 0 .
8 0 0 0
(3.7)
The transition probability matrices, P1, P2, and P3, correspond to the three
closed classes of recurrent states, denoted by R1 = {1}, R 2 = {2}, and R3 = {3, 4, 5},
respectively. Note that P1 = [1] and P2 = [1] are each associated with a single
absorbing state. Each absorbing state forms its own closed communicating
class. The set of transient states is denoted by T = {6, 7, 8}.
As Section 1.10.2.1 indicates, a special case of a reducible multichain is an
absorbing multichain in which the submatrix S is an identity matrix, I. Thus,
in an absorbing multichain, which is generally called an absorbing Markov
chain, all the recurrent states are absorbing states. All the remaining states
are transient states. The canonical form of the transition matrix for an absorb-
ing multichain is
I 0
P= . (3.8)
D Q
S 0
P= . (3.1)
D Q
U = (I − Q)−1. (3.10)
To see how the fundamental matrix can be used to answer the three ques-
tions of Section 3.2, it is instructive to consider, for simplicity, the canonical
form of the transition matrix for the reducible generic four-state unichain
shown in Equation (3.2). Observe that
U = (I − Q)−1
−1
3 1 − p33 − p34
=
4 − p43 1 − p44
1 1 − p44 p34 3
= .
(1 − p33 )(1 − p44 ) − p34 p43 p43 1 − p33 4 (3.11)
Suppose that i and j are different transient states, so that i ≠ j. If the first
transition from transient state i is to any transient state k, with probability
pik, then state j will be entered a mean number of ukj times. State j will be
entered zero times if the first transition is to a recurrent state. For the case
in which i ≠ j,
uij = ∑p
k ∈T
ik ukj, if i ≠ j, (3.13)
u33 u34
U= .
u43 u44
p33 p34
Q= .
p43 p44
U = I + QU
U − QU = I
IU − QU = I
(I − Q)U = I
U = (I − Q)−1.
U = I + QU. (3.15)
U = (I − Q)−1. (3.16)
Ni = ∑N
j∈T
ij ,
where the sum is over all transient states, j. Let ui denote the mean num-
ber of times the chain enters all transient states before the chain eventually
enters a recurrent state, given that the chain starts in transient state i. It fol-
lows that
ui = E( N i ) = E
∑N
j∈T
ij =
∑ E(N
j∈T
ij )= ∑u,
j∈T
ij (3.17)
where the sum is over all transient states, j. Thus, ui is the sum of the elements
in row i of U, the fundamental matrix. To answer question two, if the chain
begins in transient state i, it will make a mean total number of ui transitions
before eventually entering a recurrent state.
This result can be expressed in the form of a matrix equation. Suppose
that a reducible Markov chain has q transient states. Let e be a q-component
column vector with all entries one. Then the ith component of the column
vector
u = Ue (3.18)
represents the mean total number of transitions to all transient states before
the chain eventually enters a recurrent state, given that the chain started in
transient state i. The column vector Ue is computed for the reducible four-state
unichain for which the fundamental matrix is shown in Equation (3.11).
The sum
p43 + 1 − p33
u4 = u43 + u44 = ,
(1 − p33 )(1 − p44 ) − p34 p43
which is the last entry of column vector Ue, represents the mean number
of time periods that the reducible unichain, starting in transient state 4,
spends in transient states 3 and 4 before the chain eventually enters recurrent
state 1 or 2.
states. The transition probability matrix for the absorbing multichain is given
in canonical form in Equation (1.59). For this model,
State 1 2 3 4
1 5.2364 2.8563 4.2847 3.3321
U = (I − Q)−1 = 2 2.9263 3.2798 3.4383 2.4682 .
(3.21)
3 3.5078 2.4858 5.0253 2.8382
4 3.8254 2.9621 4.0731 3.9494
To answer question one of Section 3.2, uij, the (i, j)th entry of the funda-
mental matrix, represents the mean number of days a patient spends in tran-
sient state j before the patient eventually enters an absorbing state, given
that the patient starts in transient state i. Therefore, u43 = 4.0731 is the mean
number of days that a patient who is initially in physical therapy (transient
state 4) spends in surgery (transient state 3) before the patient is eventually
discharged (absorbed in state 0) or dies (absorbed in state 5).
To answer question two, the column vector Ue has as components the
mean total number of days a patient spends in each transient state before
the patient eventually enters an absorbing state. In other words, compo-
nent ui of the column vector Ue denotes the mean time to absorption for a
patient who starts in transient state i. The column vector Ue for the hospital
model is
The sum u4 = u41 + u42 + u43 + u44 = 14.81, which is the fourth entry of col-
umn vector Ue, represents the mean number of days that a patient initially
in physical therapy (transient state 4) spends in the diagnostic, outpatient,
surgery, and physical therapy departments before the patient eventually is
discharged or dies. Observe that matrix Q in Equation (3.20) is identical to
matrix Q in Equation (2.58). However, the entries in the vector of mean first
passage times (MFPTs) in Equation (3.22) differ slightly from those calcu-
lated by different methods in Equations (2.62) and (6.68). Discrepancies are
due to roundoff error because only the first four significant decimal digits
were stored.
Imj = e + Qmj
Imj − Qmj = e
(I − Q)mj = e (3.23)
mj = (I − Q)−1 e = Ue.
m1 j
m
mj = = (I − Q)−1 e
2j
m3 j
m4 j
= Ue
u11 u12 u13 u14 1 u11 + u12 + u13 + u14
u u22 u23 u24 1 u + u22 + u23 + u24
= = 21
21
u3
u4 (3.24)
because the probability of entering state j for the first time after one step is
simply the one-step transition probability. To enter recurrent state j for the
first time after two steps, the chain must enter any transient state k after one
step, and move from k to j after one additional step. Since each transition is
conditioned only on the present state, these two transitions are independent,
and
where the sum is over all transient states, k. To enter recurrent state j for the
first time after three steps, the chain must enter any transient state k after
one step, and move from k to j for the first time after two additional steps.
Therefore,
f ij(3) = ∑ pik f kj(2) = ∑ pik f kj(3 −1), (3.27)
k ∈T k ∈T
where, once again, the sum is over all transient states, k. By continuing in this
manner, the following algebraic equation is obtained for recursively comput-
ing f ij( n ):
f ij( n ) = ∑p
k ∈T
ik f kj( n− 1), (3.28)
where the sum is over all transient states, k. Note that in order to reach the
target recurrent state j for the first time on the nth step, the first step, with
probability pik, must be to any transient state k. This probability is multi-
plied by the probability, f kj( n− 1), of moving from transient state k to the target
recurrent state j for the first time in n − 1 additional steps. These products are
summed over all transient states k.
The recursive equation (3.28) can also be expressed in matrix form. Let
f j( n ) = [ f ij( n ) ] be the column vector of probabilities of first passage in n steps
from transient states to recurrent state j. Then, the algebraic equation (3.25)
has the matrix form
f j( n ) = Q n −1Dj . (3.33)
State 1 2 3 4
Not Working 1 1 0 0 0
1 0
P = Working, with a Major Defect 2 0.6 0.4 0 0 = . (1.54)
D1 Q
Working, with a Minor Defect 3 0.2 0.3 0.5 0
Working Properly 4 0.3 0.2 0.1 0.4
In this model, state 1, Not Working, is the absorbing state, and states 2, 3, and
4 are transient. The vector of probabilities of absorption in 1 day from the
three transient states is
f 41
(3)
4 0.2 0.1 0.4 0.2 0.1 0.4 0.3
0.16 0 0 0.6 0.096
= 0.27 0.25 0 0.2 = 0.212
0.19 0.09 0.16 0.3 0.180
(3.36)
State 1 2 3 4
Rain 1 1 0 0 0
1 0
P = Snow 2 0.2 0.5 0.2 0.1 = .
D1 Q
(3.38)
Cloudy 3 0.3 0.2 0.1 0.4
Sunny 4 0 0.6 0.3 0.1
(3)
f 41 4 0.6 0.3 0.1 0.6 0.3 0.1 0
0.35 0.15 0.14 0.2 0.115
= 0.36 0.17 0.10 0.3 = 0.123
0.42 0.18 0.19 0 0.138
(3.41)
The probability that a cloudy day, state 3, will be absorbed in state 1 and
change to a rainy day for the first time 4 days later is given by f 31(4) = 0.0905 in
Equation (3.42). The same results are obtained in Equations (2.46) and (2.47).
Observe that the matrix Q in the treatment of the problem as an absorbing
chain in this section is a submatrix of the matrix Z in the treatment as a reg-
ular chain in Section 2.2.1. The matrix Q is obtained by deleting the first row
(n)
and first column of Z. As a result, the first entry, f jj , of vector f j( n ) in the treat-
ment as a regular chain is omitted from the vector f j( n ) in the treatment as
an absorbing chain. Hence, the treatment as a regular chain in Section 2.2.1
(n)
yields additional information, namely, f jj , the n-step first passage proba-
bility or n-step recurrence probability for the target state j in the original,
regular Markov chain.
3.3.2.1.3 Unichain with Nonabsorbing Recurrent States
A reducible unichain may consist of a closed class of r recurrent states, none
of which are absorbing states, plus one or more transient states. The recur-
sive equation (3.33) for computing the vector of probabilities of first passage
in n steps can be extended such that it has the form
where F(n) is the matrix of n-step first passage probabilities, vector f j( n ) is the
jth column of matrix F(n), and vector dj is the jth column of matrix D.
The probability of entering a recurrent closed set for the first time in n steps
is the sum of the n-step first passage probabilities to the states, which belong
to the recurrent closed set. If the recurrent closed set of states is denoted by
R, and f iR( n ) denotes the n-step first passage probability from transient state i
to the recurrent set R, then
f iR( n ) = ∑
j∈ R
f ij( n ), (3.44)
Consider the generic four-state reducible unichain for which the transition
matrix is shown below:
1 p11 p12 0 0
2 0
=
p21 p22 0 S 0
P = [ pij ] = D Q . (3.2)
3 p31 p32 p33 p34
4 p41 p42 p43 p44
The probability of first passage in one step from transient state 4 to recurrent
state 2 is
Using Equation (3.44), the probability of first passage in one step from tran-
sient state 4 to the recurrent set of states R = {1, 2} is
= =
(2) = [ f1
(2)
f 2(2) ]. (3.48)
( p43 p31 + p44 p41 ) ( p43 p32 + p44 p42 ) f 41 f 42(2)
The probability of first passage in two steps from transient state 4 to recur-
rent state 2 is
( p33 p33 + p34 p43 ) ( p33 p34 + p34 p44 ) p31 p32
=
( p43 p33 + p44 p43 ) ( p43 p34 + p44 p44 ) p41 p42
( p33 p33 + p34 p43 )p31 + ( p33 p34 + p34 p44 )p41 ( p33 p33 + p34 p43 )p32 + ( p33 p34 + p34 p44 )p42
=
( p43 p33 + p44 p43 )p31 + ( p43 p34 + p44 p44 )p41 ( p43 p33 + p44 p43 )p32 + ( p43 p34 + p44 p44 )p42
f (3) f 32(3)
= 31(3) = [ f1
(3)
f 2(3) ].
f 41 f 42(3)
The probability of first passage in three steps from transient state 4 to recur-
rent state 2 is
f 42(3) = ( p43 p33 + p44 p43 )p32 + ( p43 p34 + p44 p44 )p42.
Using Equation (3.44), the probability of first passage in three steps from
transient state 4 to the recurrent set of states R = {1, 2} is
R = f 41 + f 42 .
f 4(3) (3) (3)
(3.49)
State 1 2 3 4
1 0.2 0.8 0 0
S 0
P= 2 0.6 0.4 0 0 = . (1.56)
3 0.2 0.3 0.5 0 D Q
4 0.3 0.2 0.1 0.4
F (1) = Q1−1D = Q 0 D = D = =
(1) = [ f1
(1)
f 2(1) ]. (3.50)
0.3 0.2 f 41 f 42(1)
(3.51)
f (3) f 32(3)
= 31(3) = [ f1
(3)
f 2(3) ]. (3.52)
f 41 f 42(3)
P1 0 0
S 0
P=0 P2 0 = , (3.4)
D D Q
D2 Q
1
where
P1 0 0
S= , 0 = , and D = [D1 D2 ].
0 P2 0
chains, is
1 1 0 0 0 0
2 0 p22 p23 0 0 P1 0 0
P = 3 0 p32 p33 0 0 = 0 P2
0 . (3.5)
4 p41 p42 p43 p44 p45 D1 D2 Q
5 p51 p52 p53 p54 p55
= =
(2) = [F2
(2)
F3(2) ]. (3.55)
( p54 p42 + p55 p52 ) ( p54 p43 + p55 p53 ) f 52 f 53(2)
Observe that the probability of first passage in two steps from transient state
4 to recurrent state 2 is
Using Equation (3.44), the probability of first passage in two steps from state
4 to R 2 is
State 0 5 1 2 3 4
0 1 0 0 0 0 0
5 0 1 0 0 0 0
S 0 I 0
P= 1 0 0 0.4 0.1 0.2 0.3 = = , (1.59a)
2 0.15 0.05 0.1 0.2 0.3 0.2 D Q D Q
3 0.07 0.03 0.2 0.1 0.4 0.2
4 0 0 0.3 0.4 0.2 0.1
where
1 0 0 f10(1) f15(1)
2 0.15 0.05 f 20(1) f 25(1)
F (1) = D= = (3.59)
3 0.07 0.03 f 30(1) f 35(1)
4 0 0 f 40(1) f 45(1)
F ( n ) = [ f1( n ) f 2( n ) f 3( n ) f 4( n ) f 5( n ) ], (3.61)
After one step, an item in state 6, production stage 3, will be in state 3, used
to train engineers, with probability f 63(1) = 0.04, or in state 4, used to train
technicians, with probability f 64(1) = 0.03, or in state 5, used to train techni-
cal writers, with probability f 65(1) = 0.02. Therefore, after one step, an item in
state 6, production stage 3, will be sent to the training center with probability
f 63(1) + f 64(1) + f 65(1) = 0.04 + 0.03 + 0.02 = 0.09, where it will remain.
The matrix of two-step absorption probabilities is
After two steps, an item in state 6, production stage 3, will, for the first time,
be in state 3, used to train engineers, with probability f 63(2) = 0.022, or in state
4, used to train technicians, with probability f 64(2) = 0.0165, or in state 5, used
to train technical writers, with probability f 65(2) = 0.011. Therefore, after two
steps, an item in state 6, production stage 3, will, for the first time, be sent to
the training center with probability
S 0
P= . (3.1)
D Q
In matrix form,
∞
F = ∑ F ( n ) = F (1) + F (2) + F (3) + ... = D + F (2) + F (3) + ...
n=1
∞ ∞
= D + ∑ F ( n ) = ID + ∑ F ( n )
n= 2 n= 2
∞
= ID + ∑ Q n −1D = ID + (Q + Q 2 + Q 3 + ...)D
n= 2
= (I + Q + Q 2 + Q 3 + ...)D (3.68)
The entries of Qn give the probabilities of being in each of the transient states
after n steps for each possible transient starting state. After zero steps the
chain is in the transient state in which it started, so that Q 0 = I. As the chain
is certain to eventually enter a closed set of recurrent states within which
it will remain forever, the chain will eventually never return to a transient
state. Therefore, the probability of being in the transient states after n steps
approaches zero. In other words, if i and j are both transient states, then
limn→∞ pij( n ) = 0, irrespective of the transient starting state i. It follows that
every entry of Qn must approach zero as n→∞. That is,
lim Q n = 0, (3.69)
n →∞
QY = Q + Q 2 + Q 3 + Q 4 + " + Q n− 1 + Q n. (3.71)
Y − QY = IY − QY = (I − Q)Y = I − Q n. (3.72)
(I − Q)Y = I
(3.73)
Y = (I − Q)−1 = I + Q + Q 2 + Q 3 + … ,
where the sum of the infinite series, Y = (I−Q)−1, is defined in Equation (3.10)
as the fundamental matrix of a reducible Markov chain, denoted by U.
Therefore, the fundamental matrix can be expressed as the sum of an infinite
series of substochastic matrices, Qn, as shown in Equation (3.74).
This result is analogous to the formula for the sum of an infinite geometric
series, which indicates that
where q is a number less than one in absolute value. Using this result in the
calculation of F,
1. The chain enters target recurrent state j after one step with probabil-
ity pij.
2. With probability pih the chain enters a nontarget recurrent state h,
which communicates with the target recurrent state j. In this case the
first step, with probability pih, does not contribute to fij, the probabil-
ity of eventual passage from transient state i to the target state j, but
By combining the two relevant events, (1) and (4), in which the target state j is
reached after starting in a transient state i, the following system of algebraic
equations is produced:
f ij = pij + ∑p
k ∈T
ik f kj , (3.77)
where the sum is over all transient states k. The formula in Equation (3.77)
represents a system of equations because the unknown fij is expressed
in terms of all the unknowns f kj. In matrix form, the system of equations
(3.77) is
F = D + QF
F − QF = D
IF − QF = D (3.78)
( I − Q )F = D
F = (I − Q)−1 D = UD, (3.76)
which confirms the previous result. Thus, fij, the probability of eventual pas-
sage from a transient state i to a target recurrent state j, can be calculated by
solving either the system (3.77) of algebraic equations, or the matrix equation
(3.76). In Equation (3.76), F = [fij] is the matrix of probabilities of eventual pas-
sage from transient states to recurrent states. Observe that two alternative
matrix forms of the system of equations (3.77) are given in Equations (3.78)
and (3.76).
State 1 2 3 4
Not Working 1 1 0 0 0
1 0
P = Working, with a Major Defect 2 0.6 0.4 0 0 = .
D Q
Working, with a Minor Defect 3 0.2 0.3 0.5 0
Working Properly 4 0.3 0.2 0.1 0.4
(1.54)
State 2 3 4
2 0.6 0 0
I−Q=
3 − 0.3 0.5 0
4 − 0.2 − 0.1 0.6
State 2 3 4
2 1.6667 0 0
U = (I − Q)−1 =
3 1 2 0
4 0.7222 0.3333 1.6667
1 p11 p12 0 0
2 p21 p22 0 0 S 0
P = [ pij ] = = .
3 p31 p32 p33 p34 D Q (3.2)
4 p41 p42 p43 p44
Suppose that the probability f32 of eventual passage from transient state 3
to recurrent state 2 is of interest. Since Equation (3.77) for calculating f32 is
expressed in terms of both f32 and f42, the following system of two algebraic
equations must be solved:
These two sums both equal one because eventual passage from a tran-
sient state to the single recurrent closed class is certain. This result can be
extended to apply to any reducible unichain. One may conclude that start-
ing from any particular transient state, the sum of the probabilities of even-
tual passage to all of the recurrent states is one. For the generic four-state
reducible unichain, note that the probabilities of eventual passage, f31, f32,
f41, and f42, are independent of the transition probabilities for the recurrent
closed class, p11, p12, p21, and p22. Therefore, if states 1 and 2 were converted
to absorbing states, making the chain a generic four-state absorbing multi-
chain, the probabilities of eventual passage, which would be called absorp-
tion probabilities, would be unchanged. To illustrate this result, consider a
generic four-state absorbing multichain, in which states 1 and 2 are absorb-
ing states. The absorbing multichain has the following transition probability
matrix in canonical form:
1 1 0 0 0
2 0 1 0 0 I 0
P = [ pij ] = = .
3 p31 p32 p33 p34 D Q (3.85)
4 p41 p42 p43 p44
The same result is obtained for F as the one obtained in Equation (3.82) for
the matrix of probabilities of eventual passage to recurrent states for the
generic four-state reducible unichain. However, in this case, as Section 3.5.2
indicates, F is called the matrix of absorption probabilities for the generic
four-state absorbing multichain.
associated with the modified maintenance policy, under which the engineer
always does nothing to a machine in state 3, is given in canonical form in
Equation (1.56).
State 1 2 3 4
1 0.2 0.8 0 0
S 0
P= 2 0.6 0.4 0 0 = .
D Q (1.56)
3 0.2 0.3 0.5 0
4 0.3 0.2 0.1 0.4
The recurrent set of states is R = {1,2}. The set of transient states is T = {3,4}.
Solving Equation (3.84), the matrix of probabilities of eventual passage from
transient states to recurrent states is
f 31 f 32
F=
f 41 f 42
0.4 0.6
= . (3.86)
0.5667 0.4333
When the machine is used under the modified maintenance policy pre-
scribed in Section 1.10.1.2.2.1, the probability of eventual passage from tran-
sient state 3, working, with a minor defect, to recurrent state 1, not working,
is f31 = 0.4. Also, under the modified maintenance policy, the probability
of eventual passage from transient state 4, working properly, to recurrent
state 2, working, with a major defect, is f42 = 0.4333.
Alternatively, using Equation (3.76), where
1 0.6 0
U = (I − Q)−1 =
0.3 0.1 0.5
F = UD = (I − Q)−1 D
0 1 0 0 0 0
4 0 1 0 0 0
S 0 I 0
P = [ pij ] = 1 1 − p 0 0 p 0 = = . (1.11)
D Q D Q
2 0 0 1− p 0 p
3 0 p 0 1− p 0
Observe that
State 1 2 3
1 1 −p 0
( I − Q) =
2 −(1 − p) 1 −p (3.88)
3 0 −(1 − p) 1
State 1 2 3
p + (1 − p) 2
p p2
1
p 2 + (1 − p)2 p 2 + (1 − p)2 p 2 + (1 − p)2
U = (I − Q)−1 = (1 − p) 1 p .
2 (3.89)
p + (1 − p)2
2
p + (1 − p)2
2
p + (1 − p)
2 2
(1 − p)2 (1 − p) p 2 + (1 − p)
3
p + (1 − p)2
2
p + (1 − p)2
2
p 2 + (1 − p)2
Note that when player A starts with $2, the expected number of times that
she will have $2 is given by
1
u22 = , (3.90)
p + (1 − p)2
2
the entry in row 2 and column 2 of the fundamental matrix. The expected
number of times that she will have $2, given that she starts with $2, varies
between one and two. This expected value is one if p = 0 or p = 1. If player
A starts with $2 and p = 0, she will have $1 remaining after the first bet, and
lose the game with $0 remaining after the second bet. If player A starts with
$2 and p = 1, she will have a total of $3 after the first bet, and win the game
with a total of $4 after the second bet. Therefore, if p = 0 or 1, and player A
starts in state 2, she will have $2 only until she makes the first bet. If player
A starts with $2 and p = 1/2, the expected number of times that she will have
$2 is two.
Using Equation (3.76), the matrix of absorption probabilities is equal to
F = UD = (I − Q)−1 D
State 1 2 3
p + (1 − p) 2
p p2
1
p 2 + (1 − p)2 p + (1 − p)2
2
p + (1 − p)2
2
1 − p 0
(1 − p)
= 1 p 0 0
2
p + (1 − p)2
2
p + (1 − p)2
2
p + (1 − p)2
2
0 p
(1 − p)2 (1 − p) p 2 + (1 − p)
3
p + (1 − p)2
2
p + (1 − p)2
2
p 2 + (1 − p)2
State 0 4
p + (1 − p) + (1 − p) 3
p3
1
p 2 + (1 − p)2 p 2 + (1 − p)2
= (1 − p)2 p2 .
2 (3.91)
p 2 + (1 − p)2 p 2 + (1 − p)2
(1 − p)3 p(1 − p) + p 3
3
p 2 + (1 − p)2 p 2 + (1 − p)2
If player A starts with $2, the probability that she will eventually lose all her
money and be ruined is given by
(1 − p)2
f 20 = , (3.92)
p 2 + (1 − p)2
State 0 5 1 2 3 4
0 1 0 0 0 0 0
5 0 1 0 0 0 0
S 0 I 0
P= 1 0 0 0.4 0.1 0.2 0.3 = = . (1.59)
D Q D Q
2 0.15 0.05 0.1 0.2 0.3 0.2
3 0.07 0.03 0.2 0.1 0.4 0.2
4 0 0 0.3 0.4 0.2 0.1
(3.93)
The entry f30 = 0.7246 shows that a surgery patient has a probability of
0.7246 of eventually being discharged, while the entry f35 = 0.2750 shows that
a surgery patient has a probability of 0.2750 of dying. The entries in the first
column of the matrix F indicate that the probabilities that a patient in the
diagnostic, outpatient, surgery, and physical therapy departments, respec-
tively, will eventually be discharged are 0.7284, 0.7327, 0.7246, and 0.7294,
respectively. Of course, the absorption probabilities in each row of matrix F
sum to one because eventually a patient must be discharged or die.
Now assume that the fundamental matrix for the absorbing multichain has
not been calculated. As Section 3.3.3 indicates, fi0, the probability of absorp-
tion in state 0, given that the chain starts in a transient state i, can also be
calculated by solving the system of equations (3.77).
f i 0 = pi 0 + ∑ pik f k 0 . (3.77)
k ∈T
1 1 0 0 0 0
2 0 p22 p23 0 0 P1 0 0
P = 3 0 p32 p33 0 0 = 0 P2
0 .
(3.5)
4 p41 p42 p43 p44 p45 D1 D2 Q
5 p51 p52 p53 p54 p55
Since the five-state reducible multichain has r = 2 recurrent chains, and the
target recurrent state 2 belongs to the second recurrent chain R2 = (2, 3),
matrix F2 will be calculated.
Thus,
f 42 f 43
=
f 53
. (3.101)
f 52
−1 6 7 8
6 0.45 0 0
6 2.2222 0 0
U = (I − Q)−1 = 7 − 0.20 0.35 0 = .
7 1.2698 2.8571 0
8 0 − 0.15 0.25
8 0.7619 1.7143 4
(3.102)
6 7 8 1 2 3 4 5
6 2.2222 0 0 0.20 0.16 0.04 0.03 0.02
F = UD = U[D1 D2 DR ] =
7 1.2698 2.8571 0 0.15 0 0 0 0
8 0.7619 1.7143 4 0.10 0 0 0 0
1 2 3 4 5
6 0.4444 0.3556 0.0889 0.0667 0.0444
= = [F1 F2 FR ],
7 0.6825 0.2032 0.0508 0.0381 0.0254
8 0.8095 0.1219 0.0305 0.0229 0.0152
where
For an entering item, which starts in transient state 8, the probability of being
scrapped, or absorbed in state 1, is f81 = 0.8095, while the probability of being
sold, or absorbed in state 2, is f82 = 0.1219. The probability that an entering
item will enter the training center to be used initially for training engineers
is f83 = 0.0305. The probability that an entering item will eventually enter the
training center, to be used for training engineers, technicians, and technical
writers, is given by
Suppose that an order for 100 items is received from a customer. If exactly 100
items are started, then the firm can expect to sell only
f 64 = p64 + ∑ p6 k f k 4 = p64 + p66 f 64 + p67 f74 + p68 f 84
k = 6,7 ,8
f74 = p74 + ∑ p7 k f k 4 = p74 + p76 f 64 + p77 f74 + p78 f 84 (3.107a)
k = 6,7 ,8
f 84 = p84 + ∑ p8 k f k 4 = p84 + p86 f 64 + p87 f74 + p88 f 84 .
k = 6,7 ,8
Equation (3.109) produces the same solution for the vector [f64, f 74, f84]T as does
Equation (3.103).
1 1 0 0 0 0
2 0 p22 p23 0 0 P1 0 0
P = 3 0 p32 p33 0 0 = 0 P2
0 .
(3.5)
4 p41 p42 p43 p44 p45 D1 D2 Q
5 p51 p52 p53 p54 p55
Recall that the probabilities f42, f43, f52, and f53 of eventual passage were com-
puted in Section 3.3.3.2.3. To compute the probability of eventual passage
from a transient state to the recurrent closed class denoted by R 2 = {2,3}, R 2 is
replaced by an absorbing state. The following associated absorbing Markov
chain is formed with transition probability matrix denoted by PA.
1 1 0 0 0
1 0 0
R2 0 1 0 0 p42 + p43
PA = = 0 1 0 , d2 = .
4 p41 p42 + p43 p44 p45 p52 + p53
D d2 Q
5 p51 p52 + p53 p54 p55 1
(3.110)
where the transition probability in each row of vector d2 is equal to the sum
of the transition probabilities in the same row of matrix D2. As the matrix
Q for the original multichain is unchanged, the fundamental matrix U for
the absorbing multichain, calculated in Equation (3.100), is also unchanged.
Applying Equation (3.99) to calculate f2, which denotes the vector of probabil-
ities of eventual passage from the transient states to the closed class of recur-
rent states, R2 = {2,3},
f 42 + f 43 f 4 R2
= = ,
f 53 f 5 R2
(3.111)
f 52 +
where
(1 − p55 )( p42 + p43 ) + p45 ( p52 + p53 )
f 4 R2 =
(1 − p44 )(1 − p55 ) − p45 p54
(3.112)
p54 ( p42 + p43 ) + (1 − p44 )( p52 + p53 )
f 5 R2 = .
(1 − p44 )(1 − p55 ) − p45 p54
In this example,
f iR2 = f i 2 + f i 3 (3.113)
Scrapped 1 1 0 0 0 0 0
Sold 2 0 1 0 0 0 0
Training Center R3 0 0 1 0 0 0
PA =
Stage 3 6 0.20 0.16 0.09 0.55 0 0
Stage 2 7 0.15 0 0 0.20 0.65 0
Stage 1 8 0.10 0 0 0 0.15 0.75
Scrapped 1 1 0 0 0
Sold 2 0 1 0 0
= .
Training Center R3 0 0 1 0 (3.114)
Production Stages T D1 D2 d3 Q
where
6 0.20 0.16 0.04 + 0.03 + 0.02 = 0.09 0.09
D1 = 7 0.15 , D2 = 0 , d3 = 0 = 0 ,
8 0.10 0
0 0
0.55 0 0
and Q = 0.20 0.65 0 .
0 0.15 0.75
and the transition probability in each row of vector d3 is equal to the sum of
the transition probabilities in the same row of matrix D3. As the matrix Q
for the original multichain is unchanged, the fundamental matrix U for the
absorbing multichain, calculated in Equation (3.102), is also unchanged.
Applying Equation (3.98),
State 6 7 8 1 2 R3 = {3, 4, 5}
6 2.2222 0 0 0.20 0.16 0.09
F = UD = U[D1 D2 d3 ] =
7 1.2698 2.8571 0 0.15 0 0
8 0.7619 1.7143 4 0.10 0 0
State 1 2 R3 = {3, 4, 5}
6 f 61 f 62 f 6 R3
6 0.4444 0.3556 0.2
= = 7 f71 f72 f7 R3
7 0.6825 0.2032 0.1143 (3.115a)
8 f 81 f 82 f 8 R3
8 0.8095 0.1219 0.0686
f 6 R3 6 0.2
f7 R3 = 7 0.1143 . (3.115b)
f 8 R 8 0.0686
3
f iR = ∑ f ij , (3.116)
j ∈R
where the recurrent states j belong to the recurrent closed set R. For this
example, the probability that an entering item will eventually be sent to the
training center is equal to
where i is a transient starting state, the first sum is over all recurrent states,
k, which belong to R, and the second sum is over all transient states, h, which
belong to T, the set of all transient states.
Consider again the following transition matrix for the generic five-state
reducible multichain treated by method one in Section 3.4.1.1. The chain has
two recurrent closed sets, one of which is an absorbing state, plus two tran-
sient states. The transition probability matrix is represented on canonical
form in Equation (3.5).
Method 2 will be applied to compute the probabilities of absorption in
state 1 from the transient states 4 and 5. These absorption probabilities are
denoted by f41 and f51, respectively. Applying Equation (3.118),
P1 0 " 0
0 P2 " 0
P= , (3.123)
# # % #
0 0 " PM
where π(k)j is the steady-state probability for state j, which belongs to the
recurrent chain Rk. Let
π (k ) = [π (k )1 π (k )2 ⋅⋅⋅ π (k )N ] (3.125)
where Π(k) is a matrix with each row π(k), the steady-state probability vector
for the transition probability matrix Pk.
The limiting transition probability matrix is computed for the following
numerical example of a four-state recurrent multichain with M = 2 recurrent
chains:
1 0.2 0.8 0 0
0
=P= 1
2 0.6 0.4 0 P 0
P= P2
(3.128)
3 0 0 0.7 0.3 0
4 0 0 0.5 0.5
1 π (1)1 π (1)2 0 0
lim P1n 0 Π(1) 0 2 π (1) π (1) 0 0
lim P n = n→∞ = = 1 2
n→∞ 0 lim P2n 0 Π(2) 3 0 0 π (2)3 π (2)4
n→∞
4 0 0 π (2)3 π (2)4
0.4286 0.5714 0 0
0.4286 0.5714 0 0
= .
0 0 0.625 0.375
0 0 0.625 0.375
(3.129)
I 0 I 0 I 0 0
2
I
P2 = = 2
= 2 . (3.130)
D Q D Q D + QD Q ( I + Q )D Q
I 0 I2 0 I3 0 I 0
P 3 = PP 2 = 2
= 3
.= 3 .
D Q ( I + Q )D Q ( I + Q + Q 2
)D Q ( I + Q + Q 2
)D Q
(3.131)
By raising P to successively higher powers, one can show that the canonical
form of the n-step transition matrix is
I 0
Pn = n −1 n . (3.132)
( I + Q + Q 2
+ " + Q )D Q
Hence,
I 0
lim P n =
n →∞ lim(
n →∞ I + Q + Q 2
+ ... + Q n −1
)D lim Q n
n →∞
I 0 I 0 I 0
= n = n =
Q UD lim Q F lim Q n
−1 , (3.133)
(I − Q) D lim n →∞ n →∞ n →∞
lim Q n = 0. (3.69)
n →∞
I 0 I 0
lim P n = =
Q n F 0
. (3.134)
n →∞
F lim
n →∞
1 1 0 0 0
2 0 1 0 0 I 0
P = [ pij ] = = .
p34 D Q
(3.85)
3 p31 p32 p33
4 p41 p42 p43 p44
1 0 0 0
0 0
I 0
1 0
.
lim P n = = (3.135)
n →∞
F 0 f 31 f 32 0 0
f 41 f 42 0 0
multichain is
State 0 5 1 2 3 4
0 1 0 0 0 0 0
5 0 1 0 0 0 0
I 0
lim P =
n
= 1 0.7284 0.2714 0 0 0 0 . (3.136)
n →∞
F 0 2 0.7327 0.2671 0 0 0 0
3 0.7246 0.2750 0 0 0 0
4 0.7294 0.2703 0 0 0 0
S 0 S 0 S2 0 S2 0
P 2 = PP = = 2
= , (3.137)
D Q D Q DS + QD Q D2 Q2
where
D2 = DS + QD. (3.138)
S 0 S 0 0 S3 0
2
S3
P 3 = PP 2 = =
= , (3.139)
D Q D2 Q 2 DS 2 + QD2 Q 3 D3 Q3
where
By raising P to successively higher powers, one can show that the canonical
form of the n-step transition matrix is
Sn 0
Pn = , (3.141)
Dn Qn
where
D1 = D (3.142)
and
Dn = Dn−1S+Qn−1D. (3.143)
lim S n 0 lim S n 0
lim P n = n→∞ = n→∞ . (3.144)
lim Q n lim Dn
n→∞ n
n →∞ lim D 0
n →∞ n→∞
If states i and j are both recurrent, then they belong to the same recurrent
closed class, which acts as a separate irreducible chain with transition prob-
ability matrix S. The unichain has only one recurrent closed class, which is
therefore certain to be reached from a transient state. Once the unichain has
entered the recurrent closed class, its limiting behavior is governed by the
steady-state probability vector for the recurrent closed class.
If π = [πj] is the steady-state probability vector for the recurrent closed class, then
π = π S,
π e = 1,
and (3.145)
π j = lim pij( n ) .
n →∞
Thus,
lim S( n ) = Π , (3.146)
n →∞
lim S n 0 Π 0
lim P n = n→∞ = .
0 Π 0
(3.148)
n →∞
lim
n →∞
Dn
Once the unichain has entered the recurrent closed set, which is certain to
be reached, its limiting behavior is governed by the steady-state probability
vector for the closed set.
1 p11 p12 0 0
2 p21 p22 0 0 S 0
P = [ pij ] = = .
p34 D Q
(3.2)
3 p31 p32 p33
4 p41 p42 p43 p44
The steady-state probability vector for the generic two-state transition matrix
S was obtained in Equation (2.16). Using Equation (3.148), the limiting transi-
tion probability matrix for P is
1 0.2 0.8 0 0
2 0.6 0.4 0 0 S 0
P = [ pij ] = = .
3 0.2 0.3 0.5 0 D Q (1.56)
4 0.3 0.2 0.1 0.4
1 0.2 0.8 0 0
2 0.6 0.4 0 0 S 0
P = [ pij ] = = .
3 0 0 0.2 0.8 D Q (1.55)
4 0.3 0.2 0.1 0.4
Observe that under both maintenance policies, states 1 and 2 form a recurrent
closed class, while states 3 and 4 are transient. Submatrix S, which governs tran-
sitions for the recurrent chain, is identical under both maintenance policies.
Using Equation (2.16), the vector of steady-state probabilities for submatrix S is
Under both policies, using Equation (3.149), the limiting transition probabil-
ity matrix is
π 1 π2 0 0 3/7 4/7 0 0
0 3/7 4/7 0 0
Π 0 π 1 π2 0
= .
lim P n = = (3.151)
n →∞
Π 0 π 1 π2 0 0 3/7 4/7 0 0
π 1 π2 0 0 3/7 4/7 0 0
In the long run, under both maintenance policies, the machine will be in
state 1, not working, 3/7 of the time, and in state 2, working, with a major
defect, the remaining 4/7 of the time.
P1 0 0 0
0 % 0 0 S 0
P= = , (3.4)
0 0 PM 0 D Q
D1 " DM Q
P1 0 0
S = 0 % 0 , D = D1 … DM . (3.152)
0 0 PM
Π(1) 0 0 0
0 % 0 0
= ,
(3.153)
0 … Π( M ) #
lim D1, n … lim DM , n 0
n →∞ n →∞
where ∏(j) is a matrix with each row π(j), the steady-state probability vector
for the transition probability matrix Pj, and Dj,n is the matrix of n-step transi-
tion probabilities from transient states to the recurrent chain Rj.
As Section 3.4.1 and Equation (3.116) indicate, in a reducible multichain, the
probability of eventual passage from a transient state i to a recurrent closed
class is simply the sum of the probabilities of eventual passage from the tran-
sient state i to all of the recurrent states within the closed class. Suppose
that Rk represents a recurrent chain for which the transition probabilities are
governed by the matrix Pk. The probability fi,Rk of eventual passage from a
transient state i to the recurrent closed class Rk is given by
f iRk = ∑f
j∈Rk
ij , (3.116)
where the recurrent states j belongs to the recurrent closed class Rk.
The limiting probability of a transition from a transient state i to a state j
belonging to the recurrent closed class Rk is equal to the product of fi,Rk, the
probability of eventual passage from the transient state i to the recurrent
chain Rk, and π(k)j, the steady-state probability for state j, which belongs to the
recurrent closed class Rk. That is,
1 1 0 0 0 0
2 0 p22 p23 0 0 P1 0 0
P = 3 0 p32 p33 0 0 = 0 P2 0 . (3.5)
4 p41 p42 p43 p44 p45 D1 D2 Q
5 p51 p52 p53 p54 p55
Recall that state 1 is an absorbing state. The two recurrent closed sets are
denoted by R1 = {1} and R2 = {2, 3}. The set of transient states is denoted by
T = {4, 5}. The probabilities of eventual passage to the individual recurrent
states, states 2 and 3, were computed in Section 3.3.3.2.3 in Equations (3.97)
and (3.101). The probabilities of eventual passage to R 2 = {2, 3} were computed
in Section 3.4.1.1 in Equations (3.111) through (3.113), and in Section 3.4.2 in
Equation (3.122). The limiting transition probability matrix for the reducible
multichain has the form
1 1 0 0 0 0
2 0 (n)
p22 (n)
p23 0 0
lim P n = lim 3 0 (n)
p32 (n)
p33 0 0
n →∞ n →∞
4 p(41n ) p(42n ) p(43n ) p(44n ) p(45n )
5 p51
(n) (n)
p52 (n)
p53 (n)
p54 (n)
p55
1 1 0 0 0 0
π2 π3 0
2 0 0
= 3 0 π2 π3 0 0 ,
(3.155)
4 f 41 lim p(42n ) lim p(43n ) 0 0
n →∞ n →∞
5 f 51 ( n)
lim p52 (n)
lim p53 0 0
n →∞ n →∞
where f41 and f 51 are the absorption probabilities for transient states
4 and 5, respectively, and π2 and π3 are the steady-state probabilities for
recurrent states 2 and 3, respectively. Using Equation (3.116), the probability
f iR 2 of eventual passage from a transient state i to the recurrent chain
R 2 = {2, 3}, is
(n)
Using Equation (3.154), the limiting probability, limn→∞
pij , of a transition from a
transient state i to a state j belonging to the recurrent closed class R2 = {2, 3} is
Therefore, the limiting transition probability for the five-state reducible mul-
tichain is
1 1 0 0 0 0
2 0 π2 π3 0 0
lim P n = 3 0 π2 π3 0 0
n →∞
4 f 41 ( f 42 + f 43 )π 2 ( f 42 + f 43 )π 3 0 0
5 f 51 ( f 52 + f 53 )π 2 ( f 52 + f 53 )π 3 0 0
(3.158)
1 1 0 0 0 0
2 0 π2 π3 0 0
= 3 0 π2 π3 0 0 .
4 f 41 f 4 R2 π 2 f 4 R2 π 3 0 0
5 f 51 f 5 R2 π 2 f 5 R2 π 3 0 0
following form:
lim P n
n →∞
1 1 0 0 0 0 0 0 0
0 0 0
2 0 1 0 0 0
3 0 0 π3 π4 π5 0 0 0
4 0 0 π3 π4 π5 0 0 0
=
5 0
0 π3 π4 π5 0 0 0
6 f 61 f 62 ( f 63 + f 64 + f 65 )π 3 ( f 63 + f 64 + f 65 )π 4 ( f 63 + f 64 + f 65 )π 5 0 0 0
7 f71 f72 ( f73 + f74 + f75 )π 3 ( f73 + f74 + f75 )π 4 ( f73 + f74 + f75 )π 5 0 0 0
8 f 81 f 82 ( f 83 + f 84 + f 85 )π 3 ( f 83 + f 84 + f 85 )π 4 ( f 83 + f 84 + f 85 )π 5 0 0 0
1 1 0 0 0 0 0 0 0
0
2 0 1 0 0 0 0 0
3 0 0 π3 π4 π5 0 0 0
4 0 0 π3 π4 π5 0 0 0
(3.159)
= 0 0 π3 π4 π5 0 0 0 ,
5
6 f 61 f 62 f 6 R3 π 3 f 6 R3 π 4 f 6 R3 π 5 0 0 0
f7 R3 π 3 f7 R3 π 4 f7 R3 π 5 0
7 f71 f72 0 0
8 f 81 f 82 f 8 R3 π 3 f 8 R3 π 4 f 8 R3 π 5 0 0 0
π3 + π4 + π5 = 1
Π(1) 0 0 0
0 Π(2) 0 0
lim P n = . (3.153)
n →∞ 0 0 Π(3) 0
lim
n →∞
D1, n lim D2, n
n →∞
lim D3, n
n →∞
0
production process is
1 1 0 0 0 0 0 0
0
2 0 1 0 0 0 0 0
0
3 0 0 0.2909 0.3727 0.3364 0 0
0
4 0 0 0.2909 0.3727 0.3364 0 0
0 (3.165)
lim P( n ) = .
n →∞ 5 0 0 0.2909 0.3727 0.3364 0 0
0
6 0.4444 0.3556 0.0582 0.0745 0.0673 0 0
0
7 0.6825 0.2032 0.0332 0.0426 0.0385 0 0
0
8 0.8095 0.1219 0.0199 0.0256 0.0231 0 0 0
p ij = P(i → j in unichain)
= P (i → j in multichain multichain state i absorbed in state 2)
P(i → j in multichain ∩ multichain state j absorbed in state 2)
= (3.166)
P(multichain state i absorbed in state 2)
P(i → j in multichain) P(multichain state j absorbed in state 2)
=
P(multichain state i absorbed in state 2)
pij f j 2
= .
fi 2
Hence, the mean time to absorption in state 2 for an entering item is 9.0793
l.
steps, the sum of the entries in the third row of U
PROBLEMS
3.1 Three basketball players compete in a basketball foul shoot-
ing contest. The eligible players are allowed one foul shot per
round. After each round, all players who miss their foul shots
are eliminated, and the remaining players participate in the
next round. The contest ends when a single player who has not
missed a foul shot remains, and is declared the winner, or when
all players have been eliminated, and no one wins. In the past,
player A has made an average of 80% of his foul shots, player B
has made an average of 75%, and player C has made an average
of 70 %.
This contest can be modeled as an absorbing multichain by
choosing as states all sets of players who have not been elimi-
nated. For example, if two players remain, the corresponding
states are the three pairs (A, B), (A, C), and (B, C).
(a) Construct the transition probability matrix.
(b) What is the probability that the contest will end without a
winner?
(c) What is the probability that player A will win the contest?
(d) If players B and C are the remaining contestants, what is the
expected number of rounds needed before C wins?
3.2 Three hockey players compete in a defensive contest. In each
round, each eligible player is allowed one shot at the goal of the
remaining player who has the highest career scoring average.
After each round, all players who allow their opponent to score
a goal are eliminated, and the remaining players participate
in the next round. The contest ends when a single player who
has not allowed a goal remains, and is declared the winner, or
when all players have been eliminated, and no one wins. The
career scoring averages of players A, B, and C are 40%, 35%, and
20%, respectively.
The contest begins with all three players eligible to compete
in the first round. Player A will shoot his puck toward the goal
of B, A’s competitor who has the highest career scoring aver-
age. Next, player B will shoot his puck toward the goal of A, B’s
competitor who has the highest career scoring average. Finally,
player C will shoot his puck toward the goal of A, C’s competi-
tor who has the highest career scoring average. When the round
ends, any player who has allowed a goal to be scored is elimi-
nated. Note that after the first round, in which all three players
compete, players A and B can both be eliminated if each scores a
goal against the other. Player C, who has the lowest career scor-
ing average, cannot be eliminated after the first round because
no other player will shoot his puck toward the goal of C. The
surviving players enter the next round.
This contest can be modeled as an absorbing multichain
by choosing as states all sets of players who have not
been eliminated. For example, if two players remain,
the corresponding states are the three pairs (A, B),
(A, C), and (B, C).
(a) Construct the transition probability matrix.
(b) What is the probability that the contest will end without a
winner?
(c) What is the probability that player A will win the contest?
(d) If players B and C are the remaining contestants, what is the
expected number of rounds needed before C wins?
3.3 A woman needs $5,000 for a down payment on a condominium.
She will try to raise the money for the down payment by gam-
bling. She will place a sequence of bets until she either accumu-
lates $5,000 or loses all her money. She starts with $2,000, and
will wager $1,000 on each bet. Each time that she bets $1,000,
she will win $1,000 with probability 0.4, or lose $1,000 with
probability 0.6.
(a) Model the woman’s gambling experience as a six-state absorb-
ing multichain. Let the state Xn denote the gambler’s revenue
after the nth bet. Construct the transition probability matrix.
(b) What is the expected number of bets that she will make?
(c) What is the probability that she will obtain her $5,000 down
payment?
3.4 Suppose that the woman seeking a $5,000 down payment in
Problem 3.3 has the option of betting either $1,000 or $2,000. She
chooses the following aggressive strategy. If she has $1,000 or
$4,000, she will bet $1,000, and will win $1,000 with probability
0.4 or lose $1,000 with probability 0.6. If she has $2,000 or $3,000,
she will bet $2,000, and will either win $2,000 with probabil-
ity 0.05, or win $1,000 with probability 0.15, or lose $2,000 with
probability 0.8.
(a) Model the woman’s gambling experience under the aggres-
sive strategy as a six-state absorbing Markov chain. Let the
state represent the amount of money that she has when she
places a bet. Construct the transition probability matrix.
(b) What is the expected number of bets that she will make?
(c) What is the probability that she will obtain her $5,000 down
payment?
3.5 A consumer electronics dealer sells new flat panel televisions
for $1,000. The dealer offers customers a 4-year warranty. The
warranty provides free replacement of a television that fails
within the 4-year warranty period, but does not cover the cost
of a replaced TV.
The dealer discloses that 5% of new flat panel televisions fail
during their first year of life, 10% of 1-year-old televisions fail
during their second year of life, 15% of 2-year-old televisions
fail during their third year of life, and 20% of 3-year-old tele-
visions fail during their fourth year of life. Suppose that the
dealer sells a consumer a 4-year warranty for $140 along with
the television.
(a) Model the 4-year warranty experience of the dealer as a six-
state absorbing multichain with two absorbing states. Choose
one absorbing state to represent the replacement of a TV,
which has failed during the warranty period, and the other
to represent the survival of the TV until the end of the war-
ranty period. Choose the transient states to represent the age
of the television. Construct the transition probability matrix.
(b) What is the probability that the TV will have to be replaced
during the warranty period?
(c) What is the dealer’s expected revenue from selling a 4-year
warranty?
3.6 A production process contains three stages in series. An enter-
ing item starts in the first manufacturing stage. The output from
each stage is inspected. An item of acceptable quality is passed
on to the next stage, a defective item is scrapped, and an item
of marginal quality is reworked at the current stage. An item at
stage 1 has a 0.08 probability of being defective, a 0.12 probabil-
ity of being of marginal quality, and a 0.80 probability of being
of acceptable quality. An item at stage 2 has a 0.06 probability
of being defective, a 0.09 probability of being of marginal qual-
ity, and a 0.85 probability of being acceptable. The probabilities
that an item at stage 3 will be defective, marginal in quality,
or acceptable, are 0.04, 0.06, and 0.90, respectively. All items of
acceptable quality produced by stage 3 are sold.
(a) Model the production process as a five-state absorbing mul-
tichain with two absorbing states. Choose one absorbing
state to represent scrapping a defective item, and the other to
State Description
1 On course to HEO
2 On course to LEO
3 Minor deviation from course to HEO
4 Major deviation from course to HEO
5 Abort mission
State 1 2 3 4 5
1, On course to HEO 1 0 0 0 0
2, On course to LEO 0 1 0 0 0
P=
3, Minor deviation 0.55 0.30 0.10 0.05 0
4, Major deviation 0 0.40 0.30 0.20 0.10
5, Abort 0 0 0 0 1
State 1 2 3
1 p p2
1 = Assistant professor
d+p ( d + p )2 ( d + p )3
U = (I − Q)−1 = 1 p
2 = Associate professor 0
d+p ( d + p )2
1
3 = Professor 0 0
d+p
(b) Find the average number of years that a newly hired assis-
tant professor spends working for this college (in any aca-
demic rank).
(c) Find the average number of years that a newly hired assis-
tant professor spends working for this college as an asso-
ciate professor.
(d) Find the average number of years that a professor spends
working for this college (as a professor).
(e) Find the probability that a newly hired assistant professor
will eventually retire as a professor.
(f) Find the probability that an associate professor will eventu-
ally retire as a professor.
(g) Assuming that a faculty member at this college begin her
career as assistant professor, find the probability that she
will be an associate professor after 2 years.
3.9 In Problem 1.13, let d1 = 0.6, d2 = 0.5, d3 = 0.4, r1 = 0.3, p1 = 0.1,
r2 = 0.3, p2 = 0.2, pR = 0.1, pA = 0.2, r3 = 0.3, r4 = 0.7, and r5 = 0.8.
(a) Find the average number of years that a newly hired instruc-
tor will spend working for this college (in any academic
rank).
X n\ X n + 1 0 5 10 15 20
0 0.34 0.12 0.26 0.10 0.18
5 0.12 0.28 0.32 0.20 0.08
P=
10 0.22 0.24 0.30 0.14 0.10
15 0.10 0.20 0.40 0.24 0.06
20 0.08 0.18 0.38 0.22 0.14
The investor intends to sell the stock at the end of the first
month in which the share price rises to $20 or falls to $0. Under
this policy, the share price can be modeled as an absorbing
multichain with two absorbing states, $0 and $20. An absorbing
state is reached when the stock is sold. The remaining states,
which are entered when the stock is held during the present
month, are transient.
(a) Represent the transition probability matrix in canonical
form.
(b) Find the probability that she will eventually sell the stock
for $20.
(c) Find the probability that she will sell the stock after
3 months.
(d) Find the expected number of months until she sells the
stock.
References
1. Bhat, N., Elements of Applied Stochastic Processes, 2nd ed., Wiley, New York,
1985.
2. Cinlar, E., Introduction to Stochastic Processes, Prentice-Hall, Englewood Cliffs,
NJ, 1975.
3. Clarke, A. B. and Disney R. L., Probability and Random Processes: A First Course
with Applications, 2nd ed., Wiley, New York, 1985.
4. Kemeny, J. G., Mirkil, H., Snell, J. L., and Thompson, G. L., Finite Mathematical
Structures, Prentice-Hall, Englewood Cliffs, NJ, 1959.
5. Kemeny, J. G. and Snell, J. L., Finite Markov Chains, Van Nostrand, Princeton, NJ,
1960. Reprinted by Springer-Verlag, New York, 1976.
When income or costs are associated with the states of a Markov chain, the
system is called a Markov chain with rewards, or MCR. This chapter, which
treats an MCR, has two objectives. The first is to show how to calculate the
economic value of an MCR. The second is to use an MCR to link a Markov
chain to a Markov decision process (MDP), thereby unifying the treatment
of both subjects. In Chapter 5, an MDP, is constructed by associating decision
alternatives with a set of MCRs. Thus, an MDP can be viewed simply as a set
of Markov chains with rewards plus decisions.
4.1 Rewards
This chapter develops procedures for constructing a reward vector and cal-
culating the expected economic value of an MCR. As Section 1.2 indicates,
all Markov chains with rewards treated in this book are assumed to have a
finite number of states [2–4, 6, 7].
FIGURE 4.1
Planning horizon of length T periods.
165
Note that epoch 1 marks the end of period 1, which coincides with the
beginning of period 2. The present time, denoted by epoch 0, marks the
beginning of period 1. A future time is denoted by epoch n, where n > 0.
Epoch T marks the end of period T, which is also the end of the planning
horizon. Often, the following terms will be used interchangeably: epoch n,
period n, time n, step n, and transition n.
N
qi = ∑ pij rij , i = 1, 2, ... , N . (4.1)
j =1
X0 g X1 h X2 k Xn i Xn 1 j State
qg qh qk qi qj Reward
pg( 0 ) pgh phk pij Transition probability
0 1 2 n n 1 Epoch
FIGURE 4.2
Sequence of states, transitions, and rewards for an MCR.
In the most general case, the reward received in the present state is the sum
of a constant term and an expected immediate reward earned from a transi-
tion to the next state. In this book, qi will simply be termed a reward received
in state i, regardless of how it is earned.
Figure 1.3 in Section 1.2 demonstrates that a small Markov chain can be
represented by a transition probability graph. Similarly, a small MCR can
be represented by a transition probability and reward graph. For example,
consider a two-state generic MCR with the following transition probability
matrix P and reward vector q:
q = [q1 q2 . . . qN ]T . (4.3)
p1 2 , r12
q1 q2
p21, r21
FIGURE 4.3
Transition probability and reward graph for a two-state Markov chain with rewards.
Vectors of expected rewards, Pnq, and expected reward scalars, p(n)q, will be
calculated for an example of an MCR model in Section 4.2.2.1.
TABLE 4.1
States, Actions, and Rewards for Monthly Sales
Monthly Sales
Quartile State Action Reward
First (lowest) 1 Offer employee buyouts −$20,000
Second 2 Reduce executive salaries $5,000
Third 3 Invest in new technology −$5,000
Fourth (highest) 4 Make strategic acquisitions $25,000
Using Equation (4.5), the expected reward received at the start of month
one is
−20
5
p(0) q = [0.3 0.1 0.4 0.2] = −2.5. (4.8)
− 5
25
−20
5
p( 0 )Pq = p(1)q = [0.225 0.240 0.325 0.210] = 0.325. (4.10)
− 5
25
Using Equation (4.4), the vector of expected rewards after 1 month, which is
independent of the initial probability state vector, is equal to
−20
5
p(1)Pq = p(2)q = [0.225 0.240 0.325 0.210] = 0.69875. (4.14)
− 5
25
Using Equation (4.4), the vector of expected rewards received after 2 months,
which is independent of the initial probability state vector, is equal to
0 1 2 T –1 T Epoch
FIGURE 4.4
Cash flow diagram for reward vectors over planning horizon of length T periods.
v(0), is equal to the sum of the expected reward vectors, Pkq, earned at each
epoch k, for k = 0, 1, 2, . . . , T − 1, plus the expected salvage value, PTv(T),
received at epoch T. That is,
0 1 2 3 4 Epoch
FIGURE 4.5
Cash flow diagram for reward vectors over four-period planning horizon.
3 4 Epoch
FIGURE 4.6
One-period cash flow diagram.
2 3 4 Epoch
FIGURE 4.7
Two-period cash flow diagram.
1 2 3 4 Epoch
FIGURE 4.8
Three-period cash flow diagram.
Using Equation (4.17) with T = 4, the solution for v(0) is given below:
In compact algebraic form, using summation signs, the four recursive equa-
tions (4.31) are
4
v1 (n) = q1 + ∑ p1 j v j (n + 1)
j =1
4
v2 (n) = q2 + ∑ p2 j v j (n + 1)
j =1 (4.32)
4
v3 (n) = q3 + ∑ p3 j v j (n + 1)
j =1
4
v4 (n) = q4 + ∑ p4 j v j (n + 1).
j =1
The four algebraic equations (4.32) can be replaced by the single algebraic
equation
4
vi (n) = qi + ∑ pij v j (n + 1),
j =1 (4.33)
for n = 0, 1,..., T − 1, and i = 1, 2, 3, and 4.
N
vi (n) = qi + ∑ pij v j (n + 1),
j =1 (4.34)
for n = 0, 1,..., T − 1, and i = 1, 2,..., N .
Recall that vi(n) denotes the expected total reward earned until the end of
the planning horizon if the system is in state i at epoch n. The value iter-
ation equation (4.34) indicates that the total expected reward, vi(n), can be
expressed as the sum of two terms. The first term is the reward, qi, earned at
epoch n. The second term is the expected total reward, vj(n + 1), that will be
earned if the chain starts in state j at epoch n + 1, weighted by the probability,
pij, that state j can be reached in one step from state i.
In summary, the value iteration equation expresses a backward recursive
relationship because it starts with a known set of salvage values for vector
v(T) at epoch T, the end of the planning horizon. Next, vector v(T − 1) is cal-
culated in terms of v(T). The value iteration procedure moves backward one
epoch at a time by calculating v(n) in terms of v(n + 1). The vectors, v(T − 2),
v(T − 3), … , v(1), and v(0) are computed in succession. The backward recursive
procedure, or backward recursion, stops at epoch 0 after v(0) is calculated in
terms of v(1). The component vi(0) of v(0) is the expected total reward earned
until the end of the planning horizon if the system starts in state i at epoch 0.
Figure 4.9 is a tree diagram of the value iteration equation in algebraic form
for a two-state MCR.
Epoch n+1
State j
Expected
Total
Reward
vj (n + 1)
pij
Epoch n
State i
Expected
Total
vi (n) qi pijv j ( n 1) pihvh ( n 1)
Reward
vi(n )
Reward qi
pih
State h
Expected
Total
Reward
vh (n + 1)
FIGURE 4.9
Tree diagram of value iteration for a two-state MCR.
n = T = 3.
v(3) = v(T ) = 0
v1 (n) v1 (T ) v1 (3) 0
v (n) v (T ) v (3) 0
2 = 2 = 2 = (4.36)
v3 (n) v3 (T ) v3 (3) 0
v4 (n) v4 (T ) v4 (3) 0
n=2
v(2) = q + Pv(3)
v1 (2) −20 0.60 0.30 0.10 0 0 −20
v (2) 5 0.25 0.35 0.10 0 5 (4.37)
2 = +
0.30
=
v3 (2) − 5 0.05 0.25 0.50 0.20 0 − 5
v4 (2) 25 0 0.10 0.30 0.60 0 25
n=1
v(1) = q + Pv(2)
v1 (1) −20 0.60 0.30 0.10 0 −20 −31
v (1) 5 0.25 0.35 0.10 5 2.25 (4.38)
.
0.30
2 = + =
v3 (1) − 5 0.05 0.25 0.50 0.20 − 5 −2.25
v4 (1) 25 0 0.10 0.30 0.60 25 39
n=0
v(0) = q + Pv(1)
v1 (0) −20 0.60 0.30 0.10 0 −31 −38.15
v (0) 5 0.25
0.30 0.35 0.10 2.25 1.0375
2 = + = . (4.39)
v3 (0) − 5 0.05 0.25 0.50 0.20 −2.25 0.6875
v4 (0) 25 0 0.10 0.30 0.60 39 47.95
The vector v(0) indicates that if this MCR operates over a 3-month planning
horizon, the expected total reward will be –38.15 if the system starts in state 1,
1.0375 if it starts in state 2, 0.6875 if it starts in state 3, and 47.95 if it starts in
state 4. The calculations for the 3-month planning horizon are summarized
in Table 4.2.
The solution by value iteration for v(0) over a 3-month planning horizon
has verified that, for v(3) = 0,
equal to the sum of the results obtained in Equations (4.11) and (4.15).
TABLE 4.2
Expected Total Rewards for Monthly Sales
Calculated by Value Iteration Over a 3-Month
Planning Horizon
n
End of Month 0 1 2 3
v1(n) –38.15 –31 –20 0
v2(n) 1.0375 2.25 5 0
v3(n) 0.6875 –2.25 –5 0
v4(n) 47.95 39 25 0
v( −1) = q + Pv(0)
v( −2) = q + Pv( −1)
#
v( −∆ ) = q + Pv( −∆ + 1).
Note that if period 0 denotes the first period placed ahead of period 1 to lengthen
the horizon, and the additional periods added are labeled with consecutively
increasing negative integers, then epoch n denotes the end of period n, for
n = 0, − 1, − 2, … , (−∆ + 1), as well as for n = 1, 2, … , T. For example, suppose that
T = 2 and ∆ = 3. Then the lengthened planning horizon appears in Figure 4.10.
Lastly, all the epochs may be renumbered, so that the epochs of the length-
ened planning horizon are numbered consecutively with nonnegative inte-
gers from epoch 0 at the origin to epoch (T + ∆) at the end, by adding ∆ to the
numerical index of each epoch, starting at epoch (−∆) and ending at epoch T.
If the epochs are renumbered, then
FIGURE 4.10
Two-period horizon lengthened by three periods.
v( −1) = q + Pv(0)
v1 ( −1) −20 0.60 0.30 0.10 0 −38.15 −42.51
v ( −1) 5 0.25 0.30 0.35 0.10 1.0375 0.8094
2 = + = (4.41)
v3 ( −1) − 5 0.05 0.25 0.50 0.20 0.6875 3.2856
v4 ( −1) 25 0 0.10 0.30 0.60 47.95 54.08
The calculations for the entire 7-month planning horizon are summarized
in Table 4.4.
Finally, ∆ = 4 may be added to each end-of-month index, n, to renumber the
epochs of the seven-period planning horizon consecutively from 0 to 7.
FIGURE 4.11
Lengthened planning horizon with epochs renumbered.
TABLE 4.3
Three-Month Horizon for Monthly Sales Lengthened by 4 Months
n
End of Month –4 –3 –2 –1 0 1 2 3
v1(n) –38.15 –31 –20 0
v2(n) 1.0375 2.25 5 0
v3(n) 0.6875 –2.25 –5 0
v4(n) 47.95 39 25 0
TABLE 4.4
Expected Total Rewards for Monthly Sales Calculated by Value Iteration Over a
7-month Planning Horizon
n
End of Month –4 –3 –2 –1 0 1 2 3
v1(n) –46.3092 –46.0552 –44.9346 –42.51 –38.15 –31 –20 0
v2(n) 2.8782 1.9073 1.1733 0.8094 1.0375 2.25 5 0
v3(n) 9.3101 7.5174 5.5357 3.2856 0.6875 –2.25 –5 0
v4(n) 64.578 61.8868 58.5146 54.08 47.95 39 25 0
g = limP(n)q. (4.43)
n→∞
Each component gi of the gain vector represents the expected average reward
per transition, or the gain, if the MCR starts in state i. Suppose now that the
MCR is recurrent. Using Equation (2.7),
N
∑ π i qi
π 1 π 2 … π N q1 iN= 1 π q g1 g
π π
= ∑
… π N q2 π i qi πq g2 g
g = limP(n)q = Πq = = = = .
1 2
n →∞ # # # # # i =1 # # #
#
π 1 π 2 … π N qN N π q g N g
πq
∑
i =1
i i
(4.44)
Hence, the gain for starting in every state of a recurrent MCR is the same.
The gain in every state of a recurrent MCR is a scalar constant, denoted by
g, where
g1 = g 2 = " = g N = g. (4.45)
The gain g in every state of a recurrent MCR is equal to the sum, over all
states, of the reward qi received in state i weighted by the steady-state prob-
ability, πi. The algebraic form of the equation to calculate the gain g for a
recurrent MCR is
N
g= ∑π q.
i= 1
i i (4.46)
The matrix form of the equation to calculate the gain g for a recurrent
MCR is
g = π q. (4.47)
q1 p21 p12 q1
g = π q = π 1 π 2 =
q2 p12 + p21 p12 + p21 q2 (4.48)
.
p21 q1 + p12 q2
= .
p12 + p21
g1 π 1 π 2 q1 π 1 q1 + π 2 q2
g = = limP(n)q = Πq = =
g
2 n →∞
π 1 π 2 q2 π 1 q1 + π 2 q2
p21 q1 + p12 q2
g p12 + p21
= = .
g p21 q1 + p12 q2 (4.49)
p +p
12 21
As the components of the gain vector have demonstrated, the gain in every
state is equal to the scalar
p21 q1 + p12 q2
g = g1 = g 2 = . (4.50)
p12 + p21
For example, consider the following two-state recurrent MCR for which:
−20
5
g = π q = [0.1908 0.2368 0.3421 0.2303] = 1.4143. (4.53)
−5
25
In the steady state, the process earns an average reward of $1,414.30 per
month.
T −1
v(0) = ∑ P k q + P T v(T ). (4.17)
k=0
T −1
v(0) = lim ∑ P k q + P T v(T )
T →∞
k=0 (4.54)
T −1
= lim ∑ P k q + lim P T v(T ).
T →∞
k=0
T →∞
Interchanging the limit and the sum, v(0) is expressed as the sum of the lim-
iting transition probability matrices for P.
T− 1
v(0) = ∑ lim P q + lim P
k= 0
k→∞
k
T →∞
T
v(T ). (4.55)
T −1
v(0) = ∑ Πq + Πv(T )
k=0 (4.56)
= T Πq + Πv(T ).
N N
vi (0) = T ∑ π j q j + ∑ π j v j (T ), for i = 1, 2, ..., N . (4.57)
j =1 j =1
Recall from Section 4.2.3.1 that the average reward per period, or gain, for a
recurrent MCR is
N
g= ∑π q.
j= 1
j j
Let
N
vi = ∑ π v (T ).
j= 1
j j (4.58)
Substituting quantities from Equations (4.46) and (4.58), into Equation (4.57),
as T becomes large, vi(0) has the form
vi (0) ≈ Tg + vi . (4.59)
vi (0) ≈ Tg + vi . (4.59)
v j (1) ≈ (T − 1) g + v j . (4.60)
N
vi (0) = qi + ∑ pij v j (1), for i = 1, 2, ..., N . (4.61)
j =1
Substituting the linear approximations for vi(0) in Equation (4.59) and vj(1)
in Equation (4.60), assumed to be valid for large T, into Equation (4.61) gives
the result
N
Tg + vi = qi + ∑ pij [(T − 1) g + v j ], for i = 1, 2, ... , N ,
j =1
N N N
= qi + Tg ∑ pij − g ∑ pij + ∑ pij v j .
j =1 j =1 j =1
N
Substituting ∑ pij = 1 ,
j =1
N
Tg + vi = qi + Tg − g + ∑ pij v j
j =1
(4.62)
N
g + vi = qi + ∑ pij v j , i = 1, 2, ... , N .
j =1
v1 + g = q1 + p11v1 + p12 v2
(4.63)
v2 + g = q2 + p21v1 + p22 v2 .
v1 + g = q1 + p11v1
. (4.64)
g = q2 + p21v1
The solution is
q1 − q2 q −q
v1 = = 1 2
1 − p11 + p21 p12 + p21
(4.65)
p (q − q ) q ( p + p21 ) + p21 (q1 − q2 ) p21 q1 + p12 q2
g = q2 + 21 1 2 = 2 12 = ,
1 − p11 + p21 p12 + p21 p12 + p21
which agrees with value of the gain obtained in Equations (4.48) and (4.49).
and
Then
g + v = q + Pv. (4.68)
Consider the four-state MCR model of monthly sales for which the transition
probability matrix, P, and reward vector, q, are shown in Equation (4.6).
The matrix equation for the VDE’s, after setting v4 = 0, is
Thus, if this MCR operates over an infinite planning horizon, the average
return per period or the gain will be 1.4145. This agrees with the value of
the gain calculated in Equation (4.53). (Discrepancies are due to roundoff
error.) Suppose the system starts in state 4 for which v4 = 0. Then it will earn
116.5022 more than it will earn if it starts in state 1. It will also earn 64.9671
more than it will earn if it starts in state 2, and it will earn 56.627 more than
it will earn if it starts in state 3.
vi (0) = Tg + vi , (4.59)
and
vN (0) = Tg + vN . (4.59b)
The difference vi(0) − vN(0) is termed the expected relative reward earned in
state i because it is equal to the increase in the long run expected total reward
earned if the system starts in state i rather than in state N. This difference is
also the reason why vi is called the relative value of the process when it starts
in state i. If value iteration satisfies a stopping condition after a large number
of iterations, then the expected relative rewards will converge to the relative
values.
Hence, as T grows large, the difference in the expected total rewards earned
in state i over two planning horizons, which differ in length by one period,
approaches the gain, g. The convergence of vi(0) − vi(1) to the gain can be
used to obtain upper and lower bounds on the gain, and also to obtain an
approximate solution for the gain. An upper bound on the gain over a plan-
ning horizon of length T is given by
Since upper and lower bounds on the gain have been obtained, the gain is
approximately equal to the arithmetic average of its upper and lower bounds.
gU (T ) + g L (T )
g= .
2 (4.75)
gU (T ) − g L (T ) < ε , (4.76)
or
max [vi (0) − vi (1)] − min [vi (0) − vi (1)] < ε . (4.77)
i = 1,..., N i = 1,..., N
N
vi (n) = qi + ∑ pij v j (n + 1), for i = 1, 2, ... , N .
j =1
Step 3. If max [vi (n) − vi (n + 1)] − min [vi (n) − vi (n + 1)] < ε , go to step 4.
i = 1,2,..., N i = 1,2,..., N
Otherwise, decrement n by 1 and return to step 2.
Step 4. Stop.
TABLE 4.5
Expected Total Rewards for Monthly Sales Calculated by Value Iteration During the
Last 7 Months of an Infinite Planning Horizon
n
Epoch –7 –6 –5 –4 –3 –2 –1 0
v1(n) –46.3092 –46.0552 –44.9346 –42.51 –38.15 –31 –20 0
v2(n) 2.8782 1.9073 1.1733 0.8094 1.0375 2.25 5 0
v3(n) 9.3101 7.5174 5.5357 3.2856 0.6875 –2.25 –5 0
v4(n) 64.578 61.8868 58.5146 54.08 47.95 39 25 0
TABLE 4.6
Differences Between the Expected Total Rewards Earned Over Planning Horizons
Which Differ in Length by One Period
n
Epoch –7 –6 –5 –4 –3 –2 –1
i vi(–7) vi(–6) vi(–5) vi(–4) vi(–3) vi(–2) vi(–1)
–vi(–6) –vi(–5) –vi(–4) –vi(–3) –vi(–2) –vi(–1) –vi(0)
1 –0.254L –1.1206L –2.4246L –4.36L –7.15L –11L –20L
2 0.9709 0.734 0.3639 –0.2263 –1.2143 –2.75 5
3 1.7927 1.9817 2.2501 2.5981 2.9375 2.75 –5
4 2.6912U 3.3722U 4.4346U 6.13U 8.95U 14U 25U
Max ( vi (n) − vi (n + 1))
= gU (T ) 2.6912 3.3722 4.4346 6.13 8.95 14 25
In Table 4.6, a suffix U identifies gU(T), the maximum difference for each
epoch. The suffix L identifies gL(T), the minimum difference for each epoch.
The differences, gU(T) − gL(T), obtained for all the epochs are listed in the
bottom row of Table 4.6. In Equations (4.53) and (4.70), the gain was found
to be 1.4145. When seven periods remain in the planning horizon, Table 4.6
shows that the bounds on the gain obtained by value iteration are given by
−0.254 ≤ g ≤ 2.6912. The gain is approximately equal to the arithmetic aver-
age of its upper and lower bounds, so that
Tables 4.7 demonstrates that as the horizon grows longer, the expected rel-
ative reward earned for starting in each state slowly approaches the corre-
sponding relative value for that state when the relative value for the highest
numbered state is set equal to zero.
The expected revenue is $300(1.2) = $360. The ordering cost is $20 + $120cn−1 =
$20 + $120(3) = $380. The next state, which represents, the ending inventory,
is given by Xn = Xn−1 + cn−1 − dn = 0 + 3 − dn = 3 − dn.
If the demand is less than three computers, the retailer will have (3 − dn)
unsold computers remaining in stock at the end of a period. The expected
number of computers not sold during a period is given by
3
E(3 − dn ) = ∑ (3 − d n )p(dn )
dn = 0
3
= ∑ (3 − k )P(d = k )
n
k=0 (4.80)
= (3 − 0)P(d = 0) + (3 − 1)P(d = 1)
n n
+ (3 − 2)P(d = 2) + (3 − 3)P(d = 3)
= 3(0.3) + 2(0.4) + 1(0.1) + 0(0.2) = 1.8.
n n
3
$50 ∑ (3 − k )P(dn = k ) = $50[3(0.3) + 2(0.4) + 1(0.1) + 0(0.2)] = $50(1.8) = $90 .
k= 0
(4.81)
Since the ending inventory is given by Xn = 3−dn, and the demand will never
exceed three computers, the retailer will never have a shortage of computers.
Hence, she will never incur a shortage cost in state 0.
The expected reward or profit vector for the retailer’s MCR model of her
inventory system is denoted by q = [q0 q1 q2 q3]T. The expected reward or
profit in state 0, denoted by q0, equals the expected revenue minus the order-
ing cost minus the expected holding cost minus the expected shortage cost.
The expected reward in state 0 is
X n = X n− 1 + cn− 1 − dn = 1 + 2 − dn = 3 − dn .
∑ (3 − k )P(d
k=0
n = k ) = 1.8. (4.83)
The expected holding cost is also equal to its value when Xn−1 = 0, and is
given by
3
$50∑ (3 − k )P(dn = k ) = $90. (4.84)
k=0
3 3
E[min(2, dn )] = ∑ min(2, dn )p(dn ) = ∑ min(2, d n = k )P( dn = k )
dn = 0 k=0
= min(2, dn = 0)P(dn = 0) + min(2, dn = 1)P(dn = 1) (4.86)
+ min(2, dn = 2)P(dn = 2) + min(2, dn = 3)P(dn = 3)
= (0)(0.3) + (1)(0.4) + (2)(0.1) + 2(0.2) = 1.
X n = X n −1 + cn −1 − dn = 2 + 0 − dn = 2 − dn (4.87)
2
E[max(2 − dn , 0)] = ∑ (2 − dn )p(dn )
dn = 0
2
= ∑ (2 − k )P(dn = k ) = (2 − 0)P(dn = 0) + (2 − 1)P(dn = 1)
k=0
+ (2 − 2)P(dn = 2)
= 2(0.3) + 1(0.4) + 0(0.1) = 1.
(4.88)
2
$50 ∑ (2 − k )P(dn = k ) = $50[2(0.3) + 1(0.4)] = $50(1) = $50 (4.89)
k= 0
When the demand exceeds two computers, the retailer will have a shortage
of (dn − 2) computers, and thus will incur a shortage cost. Since the demand
will never exceed three computers, the retailer can have a maximum short-
age of one computer. The expected number of computers not available to
satisfy the demand during a period is equal to the expected number of short-
ages. The expected number of shortages is given by
3 3
E[(max(dn − 2, 0)] = ∑ (d n − 2)p(dn ) = ∑ (k − 2)P(dn = k )
dn = 2 k=2
(4.90)
= (2 − 2)P(dn = 2) + (3 − 2)P(dn = 3)
= (0)P(dn = 2) + (1)P(dn = 3) = 0(0.1) + 1(0.2) = 0.2.
3
$40 ∑ (k − 2)P(dn = k ) = $40[1(0.2)] = $40(0.2) = $8. (4.91)
k= 2
The expected reward in state 2, denoted by q2, equals the expected revenue
minus the ordering cost minus the expected holding cost minus the expected
shortage cost. The expected reward in state 2 is
When Xn−1 = 3, the order quantity is cn−1 = 0, so that no computers are ordered,
and no ordering cost is incurred. In state 3, as in states 0 and 1, the expected
revenue is $300(1.2) = $360.
The next state is
X n = X n −1 + cn −1 − dn = 3 + 0 − dn = 3 − dn . (4.93)
∑ (3 − k )P(d
k=0
n = k ) = 1.8. (4.94)
The expected holding cost is equal to its value when Xn−1 = 0 and Xn−1 = 1, and
is given by
3
$50∑ (3 − k )P(dn = k ) = $50(1.8) = $90. (4.95)
k=0
In state 3, as in states 0 and 1, the retailer will never have a shortage of com-
puters, and thus will never incur a shortage cost.
The expected reward in state 3, denoted by q3, equals the expected rev-
enue minus the ordering cost minus the expected holding cost minus the
expected shortage cost. The expected reward in state 3 is
Beginning
Order, X n −1 + cn −1
Inventory, X n −1 cn − 1 State 0 1 2 3 Reward
0 3 3 0 0.2 0.1 0.4 0.3 −110
P= ,q= .
1 2 3 1 0.2 0.1 0.4 0.3 10
2 0 2 2 0.3 0.4 0.3 0 242
3 0 3 3 0.2 0.1 0.4 0.3 270
(4.97)
Her expected average reward per period, or gain, is equal to the sum, over
all states, of the expected reward in each state multiplied by the steady-state
probability for that state. The steady-state probability vector for the inven-
tory system was calculated in Equation (2.27). Using Equation (4.47), the gain
earned by following a (2, 3) inventory ordering policy is
−110
10
g = π q = 0.2364 0.2091 0.3636 0.1909 = 115.6212. (4.98)
242
270
qi = $2 + $10 pi 0 . (4.99)
The recurrent MCR model of component replacement has the following tran-
sition probability matrix, P, and cost vector, q:
State 0 1 2 3 Cost
0 0.2 0.8 0 0 0 $4
P= 1 0.375 0 0.625 0 , q = 1 5.75 (4.101)
2 0.8 0 0 0.2 2 10
3 1 0 0 0 3 12
The expected average cost per week, or negative gain, for this component
replacement policy is equal to the sum, over all states, of the cost in each
state multiplied by the steady-state probability for that state. The steady-state
probability vector for the component replacement model is calculated in
Equation (2.29). Using Equation (4.47), the negative gain incurred by follow-
ing this replacement policy is:
$4
$5.75
g = π q = 0.4167 0.3333 0.2083 0.0417 = $6.17. (4.102)
$10
$12
S 0 qR
P= , q= , (4.103)
D Q qT
where the components of the reward vector q are the vectors qR and qT. The
entries of vector qR are the rewards received for visits to the recurrent states.
The entries of vector qT are the rewards received for visits to the transient
states.
lim S n 0 Π 0
lim P = n→∞
n
= . (3.148)
n →∞
lim Dn 0 Π 0
n →∞
lim S n 0 qR Π 0 qR ΠqR g R
g = lim P q = n→∞
(n)
= = = ,
0 qT Π 0 qT ΠqR g R
(4.104)
n →∞
lim
n →∞
Rn
where each row of the matrix Π is the steady-state probability vector, π = [πi],
for the recurrent chain S, and gR is the gain vector for the recurrent states.
Observe that gR = ΠqR. By the rules of matrix multiplication, all the compo-
nents of the gain vector for a unichain MCR are equal. All states, both recur-
rent and transient, have the same gain, which is equal to the gain of the closed
class S of recurrent states. If g denotes the gain in every state of a unichain
MCR, then the scalar gain g in every state of a unichain MCR with N recur-
rent states is
N
g= ∑π q.
i= 1
i i (4.105)
In vector form,
g = π qR . (4.106)
1 p11 p12 0 0 q1
2 p21 q
=
p22 0 0 S 0 qR
P = [ pij ] = , q = = ,
2
1 π1 π2 0 0 q1 g1 g π 1 q1 + π 2 q2
2 π1 π2 0 0 q g g π q + π q
g = lim P(n) q = 2 = 2 = = 1 1
2 2
n→∞ 3 π1 π2 0 0 q3 g 3 g π 1 q1 + π 2 q2 .
4 π 1 π2 0 0 q4 g 4 g π 1 q1 + π 2 q2
(4.108)
As Section 4.2.4.1 has indicated, the gain for the unichain MCR is the same
in every state. Treating the two-state recurrent chain separately, the gain,
calculated by using (4.47), is
q1
g = π q = π qR = π 1 π 2 = π 1 q1 + π 2 q2 . (4.109)
q2
Since all rows of the limiting transition probability matrix for the four-state
unichain are identical, the gain in every state of the unichain MCR can also
be computed by applying Equation (4.47) to the unichain MCR. That is,
q1
q
q1
0 0 = [π 1 π 2 ] = π 1 q1 + π 2 q2 . (4.110)
2
g = π q = π 1 π 2
q3 q2
q4
The generic two-state recurrent chain has the transition probability matrix
1 p11 p12
S= . (4.111)
2 p21 p22
p21 q1 + p12 q2
g = π 1 q1 + π 2 q2 = . (4.112)
p12 + p21
1 0 q1
P= , q = , (4.113)
D Q qT
where the scalar q1 is the reward received when state 1, the absorbing state,
is visited. The steady-state probability for the absorbing state is π1 = 1. Hence
the gain in every state of an absorbing unichain MCR is equal to the reward
received in that state. That is,
g = π 1 q1 = 1q1 = q1 . (4.114)
TABLE 4.8
Daily Revenue Earned in Every State
State Condition Daily Revenue
1 Not Working (NW) $0
2 Working, with a Major Defect (WM) $200
3 Working, with a Minor Defect (Wm) $500
4 Working Properly (WP) $1,000
TABLE 4.9
Daily Maintenance Costs
Decision Action Daily Cost
1 Do Nothing (DN) $0
2 Overhaul (OV) $300
3 Repair (RP) $700
4 Replace (RL) $1,200
repair it is $700. The daily cost to replace the machine with a new machine
is $1,200. The daily maintenance costs in every state are summarized in
Table 4.9.
The reward vector associated with the original maintenance pol-
icy of Section 1.10.1.2.2.1, under which the decision is to overhaul the
machine in states 1 and 3, and do nothing in states 2 and 4, appears in
Equation (4.115).
Reward
−$300 = q1
q = $200 = q2 . (4.117)
$500 = q3
$1, 000 = q4
q1
q
1
q
g = π q = π 1 π 2 0 0 = [π 1 π 2 ] = π 1 q1 + π 2 q2
2
q3 q2
(4.118)
q4
3 4 −300 −100
= = = −14.29.
7 7 200 7
Note that the gain is the same under both maintenance policies.
a solution of the VDEs will also produce the relative values. Since all states in
a unichain MCR have the same gain, denoted by g, the VDEs for a unichain
MCR are identical to those for a recurrent MCR. However, it is not necessary
to simultaneously solve the full set of VDEs for a unichain MCR. It is suffi-
cient to first solve the subset of the VDEs associated with the recurrent states
to determine the gain and the relative values of the recurrent states. The rel-
ative values for the transient states may be obtained next by substituting the
relative values of the recurrent states into the VDEs for the transient states.
The relative values for all the states are used to execute policy improvement
for an MDP in Section 5.1.2.3.3 of Chapter 5.
4
g + vi = qi + ∑ pij v j , = qi + pi1v1 + pi 2 v2 + pi 3 v3 + pi 4 v4 , for i = 1, 2, 3, 4. (4.119)
j =1
Setting p13 = p14 = 0, and p23 =p 24 = 0, for the unichain MCR, the four VDEs
become
g + v1 = q1 + p11v1 + p12 v2
+ v2 = q2 + p21v1 + p22 v2
g (4.121)
g + v3 = q3 + p31v1 + p32 v2 + p33 v3 + p34 v4
g + v4 = q4 + p41v1 + p42 v2 + p43 v3 + p44 v4.
The gain can be calculated by solving the subset of two VDEs for the recur-
rent chain separately. Since states 1 and 2 are the two recurrent states, the
VDEs for the two-state recurrent chain within the four-state unichain are
g + v1 = q1 + p11v1 + p12 v2
(4.122)
g + v2 = q2 + p21v1 + p22 v2.
Setting v2 = 0 for the highest numbered state in the recurrent chain, the VDEs
for the recurrent chain become
g + v1 = q1 + p11v1 (4.123)
g = q2 + p21v1 .
q1 − q2 q − q2
v1 = = 1
1 − p11 + p21 p12 + p21
p21 (q1 − q2 ) q2 ( p12 + p21 ) + p21 (q1 − q2 )
g = q2 + = (4.124)
1 − p11 + p21 p12 + p21
p q + p12 q2
= 21 1 = π 1 q1 + π 2 q2 ,
p12 + p21
which agree with the gain and the relative values obtained for a generic two-
state recurrent MCR in Equation (4.65). The solution for the gain also agrees
with the solution obtained for the generic four-state MCR in Section 4.2.4.1.1.
By substituting the gain and the relative values of the recurrent states into
the VDEs for the transient states, the latter two VDEs can be solved to find
the relative values of the transient states. The two VDEs for the transient
states are
Rewriting the two VDEs with the two unknowns, the relative values v3 and
v4 for the two transient states, on the left hand side,
q1 − q2
v1 =
p12 + p21
(4.124)
p21 q1 + p12 q2
g= .
p12 + p21
In matrix form, the two VDEs for the transient states become
Setting v2 = 0 for the highest numbered state in the recurrent chain, the VDEs
for the recurrent chain become
g + v1 = −300 + 0.2v1
(4.130)
g = 200 + 0.6v1.
v1 = −357.14
(4.131)
g = −14.29.
The same value for the gain was obtained in Equation (4.118).
Step 2. The VDEs for the two transient states are
Substituting the gain, g =−14.29, and the relative values, v1 = −357.14, v2 = 0, for
the recurrent states obtained in step 1, the VDEs for the transient states are
Step 3. The solutions for v3 and v4, the relative values of the transient states, are
v3 = 885.72
(4.134)
v4 = 1659.53.
To begin value iteration, the terminal values at the end of the planning hori-
zon are set equal to zero for all states.
n=0
v1 (0) 0
v (0) 0
2 = .
v3 (0) 0
v4 (0) 0
Value iteration is executed over the last three periods of an infinite planning
horizon.
n = −1
v( −1) = q + Pv(0)
v1 ( −1) −300 0.2 0.8 0 0 0 −300
v ( −1) 200 0.6
0.4 0 0 0 200 (4.136)
2 = + =
v3 ( −1) 500 0.2 0.3 0.5 0 0 500
v4 ( −1) 1000 0.3 0.2 0.1 0.4 0 1000
n = −2
v( −2) = q + Pv( −1)
v1 ( −2) −300 0.2 0.8 0 0 −300 −200
v ( −2) 200 0.6
0.4 0 0 200 100 (4.137)
2 = + =
v3 ( −2) 500 0.2 0.3 0.5 0 500 750
v4 ( −2) 1000 0.3 0.2 0.1 0.4 1000 1400
n = −3
v( −3) = q + Pv( −2)
v1 ( −3) −300 0.2 0.8 0 0 −200 −260
v ( −3) 200 0.6
0.4 0 0 100 120 (4.138)
2 = + = .
v3 ( −3) 500 0.2 0.3 0.5 0 750 865
v4 ( −3) 1000 0.3 0.2 0.1 0.4 1400 1595
After executing value iteration over four additional periods, the results are
summarized in Table 4.10.
Table 4.11 gives the differences between the expected total rewards earned
over planning horizons, which differ in length by one period.
TABLE 4.10
Expected Total Rewards for Machine Maintenance Under the Modified
Maintenance Policy Calculated by Value Iteration During the Last Seven Periods
of an Infi nite Planning Horizon
n
Epoch –7 –6 –5 –4 –3 –2 –1 0
v1(n) –304.416 –288.96 –277.6 –256 –260 –200 –300 0
v2(n) 53.312 66.72 83.2 92 120 100 200 0
v3(n) 930.6065 936.765 934.65 916.5 865 750 500 0
v4(n) 1703.2945 1707.405 1701.45 1670.5 1595 1400 1000 0
TABLE 4.11
Differences between the Expected Total Rewards Earned Over Planning Horizons
Which Differ in Length by One Period
Epoch –7 –6 –5 –4 –3 –2 –1
vi (n) −
Max
vi (n + 1)
= gU (T ) −4.1105 5.955 30.95 85.5 185 400 1000
vi (n) −
Min
vi (n + 1)
= g L (T ) −15.456 −16.48 −21.6 −28 −60 −100 −300
In Table 4.11, a suffix U identifies gU(T), the maximum difference for each
epoch. The suffix L identifies gL(T), the minimum difference for each epoch.
The differences, gU(T) − gL(T), obtained for all the epochs are listed in the
bottom row of Table 4.11. In Equations (4.1.18) and (4.131), the gain was found
to be –14.29 thousand dollars per period. When seven periods remain in the
planning horizon, Table 4.11 shows that the bounds on the gain are given by
−15.456 ≤ g ≤ −4.1105. The bounds are quite loose. After seven iterations, using
Equation (4.75), the gain is approximately equal to the arithmetic average of
its upper and lower bounds, so that
Many more iterations will be needed to see if value iteration will generate a
close approximation to the gain.
Subtracting v2(n), the expected total reward for the highest numbered state
in the recurrent chain, from all of the other expected total rewards in Table 4.10
produces Table 4.12, which shows the expected relative rewards, vi(n) − v2(n),
during the last seven epochs of the planning horizon.
TABLE 4.12
Expected Relative Rewards, vi(n) − v2(n), During the Last Seven Epochs of the
Planning Horizon
n
Epoch –7 –6 –5 –4 –3 –2 –1 0
v1(n) – v2(n) –357.728 –355.68 –360.8 –348 –380 –300 –500 0
v3(n) – v2(n) 877.2945 870.045 851.45 824.5 745 650 300 0
v4(n) – v2(n) 1649.9825 1640.685 1618.25 1578.5 1465 1300 800 0
In Equations (4.131) and (4.134), the relative values obtained by solving the
VDEs are, by comparison,
g v3 q3 p33 p34
g R = , vT = , qT = , and Q = ,
p44
(4.141)
g v4 q4 p43
in Equation (4.125), the matrix equation for the vector vT of relative values for
the transient states is
g R + vT = qT + QvT
IvT − QvT = qT − g R
(4.142)
(I − Q)vT = qT − g R
vT = (I − Q)−1 (qT − g R ).
The first term, UqT, represents the vector of expected total rewards received
before passage to the closed class of recurrent states, given that the chain
started in a transient state. The second term, UgR, is the vector of expected
total rewards received after passage to the recurrent closed class. Observe
that when the components of vector qR are set equal to zero, then the vector
gR = 0 because
q1 0
g R = π qR = π 1 π 2 = π 1 π 2 = 0 (4.144)
q2 0
and
Thus, if the rewards received in all the recurrent states are set equal to zero,
then vT, the vector of relative values for the transient states, is equal to UqT, the
vector of expected total rewards received before passage to the closed class
of recurrent states, given that the chain started in a transient state. In other
words, by setting qR = 0, vT = UqT can be found by solving the VDEs.
The following alternative approach to interpreting UqT as the vector of
expected total rewards received before passage to a closed class of recurrent
states, given that the chain started in a transient state, does not involve solv-
ing the VDEs for the relative values of the transient states. Recall from Section
(3.2.2) that uij, the (i, j)th entry of the fundamental matrix, U, specifies the
expected number of times that the chain is in transient state j before eventual
passage to a recurrent state, given that the chain started in transient state i. The
entry in row j of the vector qT, denoted by (qT)j, is the reward received each time
the chain is in transient state j. Therefore, the entry in row i of the vector UqT,
denoted by (UqT)i, is equal to the following sum of products:
(4.146)
Both of these approaches have demonstrated that the following result holds for
any unichain MCR. Suppose that P is the transition probability matrix, U is the
fundamental matrix, qT is the vector of rewards received in the transient states,
and T denotes the set of transient states. Then the ith component of the vector
UqT represents the expected total reward earned before eventual passage to
the recurrent closed class, given that the chain started in transient state i.
1 $15, 000
2 14, 000
3 16, 000 qR
q= = , (4.147)
4 12, 000 qT
5 13, 000
6 11, 000
−1 4 5 6
0.70 −0.16 −0.24
−1 4 1.9918 1.0215 0.9193
U = (I − Q) = −0.26 0.60 −0.18 = . (4.148)
5 1.1300 2.5026 1.0023
−0.14 −0.32 0.72
6 0.8895 1.3109 2.0131
The entry in row i of the vector UqT represents the expected total salary
earned by an engineer before she is promoted to a management position,
given that she started in a transient state i. For example, an engineer who
started in state 6, systems testing, can expect to earn $49,859.80 prior to being
promoted to management.
Of course, if the engineer is interested solely in calculating the expected
total salary that she will earn before she is promoted to a management posi-
tion, she can merge the three recurrent states, that is, states 1, 2, and 3, into a
single absorbing state, denoted by 0, which will represent management. The
result will be the following four-state absorbing unichain:
0 Management 1 0 0 0
0.30 0.30 0.16 0.24
=
4 Engineering Product Design 1 0
P= .
5 Engineering Systems Integration 0.16 0.26 0.40 0.18 D Q
6 Engineering Systems Testing 0.26 0.14 0.32 0.28
(4.150)
Since the matrix Q and the vector qT are unchanged, the same entries will be
obtained for the vector UqT.
Suppose that the fundamental matrix has not been calculated. If the com-
ponents of qR, the vector of monthly salaries earned in the recurrent states,
are set equal to zero, then, as Section 4.2.4.4.1 indicates, solving the VDEs
for the vector vT of relative values for the transient states is an alternative
way of calculating the expected total reward received before passage to the
recurrent closed class, given that the chain started in a transient state. For
example, suppose that the modified MCR is
In vector form, the VDEs for the relative values of the transient states are
g R + vT = qT + QvT
gR = 0 (4.152)
vT = qT + QvT .
The solution is
4 v4 4 47, 293.40
vT = 5 v5 = 5 57,119.10 .
6 v6 6 49, 859.80
These are the same values calculated in Equation (4.149) for UqT, the vector of
expected total salaries received before passage to the closed class of recur-
rent states, given that the employee started as an engineer.
0 Accept Buyout 1 0 0 0
=
4 Engineering Product Design 0.30 0.30 0.16 0.24 1 0
P= ,
5 Engineering Systems Integration 0.16 0.26 0.40 0.18 D Q
6 Engineering Systems Testing 0.26 0.14 0.32 0.28
.
0 Accept Buyout $50,000
12, 000
4 Engineering Product Design
= qA
q=
5 Engineering Systems Integration 13, 000 qT
6 Engineering Systems Testing 11, 000
(4.154)
Value iteration using Equation (4.29) will be executed over a 3-month plan-
ning horizon to calculate the expected total cost to the company of offer-
ing a buyout to an engineer. To begin the backward recursion, the vector of
expected terminal total costs received at the end of the 3-month planning
horizon is set equal to zero for all states.
n=T=3
v(3) = v(T ) = 0
v0 (3) 0
v (3) 0
4 =
v5 (3) 0
v6 (3) 0
n= 2
v(2) = q + Pv(3)
v0 (2) 50, 000 1 0 0 0 0 50, 000
v (2) 12, 000 0.30 0.30 0.16 0.24 0 12, 000 (4.155)
4 = + =
v5 (2) 13, 000 0.16 0.26 0.40 0.18 0 13, 000
v6 (2) 11, 000 0.26 0.14 0.32 0.28 0 11, 000
n= 1
v(1) = q + Pv(2)
v0 (1) 50, 000 1 0 0 0 50, 000 100, 000
v (1) 12, 000 0.30 0.30 0.16 0.24 12, 000 35, 320
4 = + =
v5 (1) 13, 000 0.16 0.26 0.40 0.18 13, 000 31, 300
v6 (1) 11, 000 0.26 0.14 0.32 0.28 11, 000 32, 920
(4.156)
n=0
v(0) = q + Pv(1)
v1 (0) 50, 000 1 0 0 0 100, 000 150, 000
v (0) 12, 000 0.30 0.30 0.16 0.24 35, 320 65, 504.80
2 = + = .
v3 (0) 13, 000 0.16 0.26 0.40 0.18 31, 300 56, 628.80
v4 (0) 11, 000 0.26 0.14 0.32 0.28 32, 920 61,178.40
(4.157)
Equation (4.157) indicates that after 3 months, the expected total cost to the
company of offering a buyout to an engineer, given that she is currently
engaged in product design, systems integration, or systems testing, will be
$65,504.80, $56,628.80, or $61,178.40, respectively.
As an alternative to executing value iteration equation, Equation (4.17) with
T = 3 can also be used to calculate v(0) over a 3-month planning horizon.
P1 0 " 0 0 q1
0 P2 " 0 0 q
2
P= # # % # # , q = # . (4.160)
0 0 " PM 0 qM
D1 D2 " DM Q qT
The components of vectors q1, … , qM are the rewards received by the recurrent
states in the recurrent chains governed by the transition matrices P1 ,…,PM,
respectively. The components of vector qT are the rewards received in the
transient states.
TABLE 4.13
Operation Times and Costs
Operation Cost Cost Per
State Operation Time (h) Per Hour Operation
1 Scrap the output 2.6 $60 (disposal) $156 = (2.6 h)($60)=q1
2 Sell the output 1.4 $40 $56 = (1.4 h)($40)=q2
3 Train engineers 3.2 $50 $160 = (3.2 h)($50)=q3
4 Train technicians 4.1 $30 $123 = (4.1 h)($30)=q4
5 Train tech. writers 5.3 $55 $291.50 = (5.3 h)($55)=q5
6 Stage 3 10 $45 $450 = (10 h)($45)=q6
7 Stage 2 16 $25 $400 = (16 h)($25)=q7
8 Stage 1 12 $35 $420 = (12 h)($35)=q8
Table 4.13, is equal to the product of the entries in columns three and four
of row i of Table 4.13. The complete eight-state multichain MCR model of
production is shown below:
1 1 0 0 0 0 0 0 0 1 156
2 0 1 0 0 0 0 0 0 2 56
3 0 0 0.50 0.30 0.20 0 0 0 3 160
4 0 0 0.30 0.45 0.25 0 0 0 4 123
P= , q= (4.161)
5 0 0 0.10 0.35 0.55 0 0 0 5 291.5
6 0.20 0.16 0.04 0.03 0.02 0.55 0 0 6 450
7 0.15 0 0 0 0 0.20 0.65 0 7 400
8 0.10 0 0 0 0 0 0.15 0.75 8 420
TABLE 4.14
Expected Operation Costs Per Entering Item
Operation Expected Operation Cost Per
State Operation Cost Entering Item
1 Scrap the Output $156 q1 f 81 =($156)(0.8095) = $126.28
2 Sell the Output $56 q2 f 82 = ($56)(0.1219) = $6.83
3 Train Engineers $160 (n)
q3 lim p83 = ($160)(0.0199) = $3.18
n→∞
As Equation (3.103) indicates, if the chain occupies a transient state, the matrix
of absorption probabilities is
1 2
6 0.4444 0.3556
F = U[D1 D2 ] = = [F1 F2 ]. (3.103)
7 0.6825 0.2032
8 0.8095 0.1219
The operations “scrap the output,” state 1, and “sell the output,” state 2, are
both represented by absorbing states. The expected cost that will be incurred
by an entering item in an absorbing state j is equal to the operation cost for
the absorbing state multiplied by the probability f8j that an entering item will
be absorbed in absorbing state j. The expected cost that will be incurred by
an entering item in each of the two absorbing states is calculated in rows 1
and 2 in the right-hand column of Table 4.14.
As Section 3.5.5.2 indicates, if the chain occupies a transient state, the
matrix of limiting probabilities of transitions from the set of transient states,
T = {6, 7, 8}, to the three recurrent states, which belong to the recurrent chain,
R = {3, 4, 5}, is given by Equation (3.163). As Equation (3.164) demonstrates, the
expected cost that will be incurred by an entering item in a recurrent state j
is equal to the operation cost for the recurrent state multiplied by
the limiting probability for recurrent state j. The expected cost that will be
incurred by an entering item in each of the three recurrent states associated
TABLE 4.15
Expected Operation Costs Per Item Sold
Expected Operation Cost per Item Sold
State Operation = Expected Operation Cost per Entering Item)/f82
1 Scrap the Output ($126.28)/f82 = ($126.28)/(0.1219) = $1.035.95
2 Sell the Output ($6.83)/f82 = ($6.83)/(0. 1219) = $56
3 Train Engineers ($3.18)/f82 = ($3.18)/(0. 1219) = $26.09
4 Train Technicians ($3.15)/f82 = ($3.15)/(0. 1219) = $25.84
5 Train Technical Writers ($6.73)/f82 = ($6.73)/( 0. 1219) = $55.21
6 Stage 3 ($342.86)/f82 = ($342.86)/( 0. 1219) = $2,812.63
7 Stage 2 ($685.72)/f82 = ($685.72)/( 0. 1219) = 5,625.27
8 Stage 1 ($1,680)/f82 = ($1,680)/(0. 1219) = $13,781.79
Sum = $23,418.78
1 1 0 0 0 1 q1
2 0 2 q2 qA
=
1 0 0 I 0
P= , q = = . (4.163)
3 p31 p32 p33 p34 D Q 3 q3 qT
4 p41 p42 p43 p44 4 q4
Absorbing states 1 and 2, which belong to the set A, have gains denoted by g1
and g2, respectively. Transient states 3 and 4, which belong to the set T, have
gains denoted by g3 and g4, respectively. The limiting transition probability
matrix for the absorbing multichain was calculated in Equation (3.135). The
gain vector for the four-state absorbing multichain MCR is computed using
Equation (4.43).
1 1 0 0 0 q1 q1 g1
2 0 g
= 2 = A .
1 0 0 q2 q2 g
g = limP(n)q = =
n →∞ 3 f 31 f 32 0 0 q3 f 31 q1 + f 32 q2 g 3 gT
4 f 41 f 42 0 0 q4 f 41 q1 + f 42 q2 g 4
(4.164)
The gain of the absorbing multichain MCR depends on the state in which it
starts. If the system starts in an absorbing state i, for i = 1 or 2, the gain will
be gi = qi. Thus, in a multichain MCR, the gain of an absorbing state is equal
to the reward received in that state. If the system is initially in transient state
i, for i = 3 or 4, the gain will be
g i = f i1 q1 + f i 2 q2 (4.165)
because the chain will eventually be absorbed in state 1 with probability fi1,
or in state 2 with probability fi2.
1 1 0 0 0 0 1 q1
2 0 p22 p23 0 0 2 q2
P = 3 0 p32 p33 0 0 , q = 3 q3 . (4.166)
4 p41 p42 p43 p44 p45 4 q4
5 p51 p52 p53 p54 p55 5 q5
State 1, with a gain denoted by g1, is an absorbing state, and constitutes the first
recurrent closed set. States 2 and 3, with gains denoted by g2 and g3, respec-
tively, are members of the second recurrent chain, denoted by R2 = {2, 3}. Since
both recurrent states belong to the same recurrent chain, they have the same
gain, denoted by gR2, so that
g 2 = g 3 = g R2 , (4.167)
where gR2 denotes the gain of all states which belong to the recurrent chain
R 2. States 4 and 5 are transient, with gains denoted by g4 and g5, respectively.
The limiting transition probability matrix for the multichain was calculated
in Equation (3.158). In the limiting transition probability matrix, π2 and π3 are
the steady-state probabilities for states 2 and 3, respectively, within the recur-
rent chain R 2. The probabilities of absorption from transient states 4 and 5
are denoted by f41 and f51, respectively. The probabilities of eventual passage
from transient states 4 and 5 to the recurrent chain R 2 are denoted by f4R2
and f5R2, respectively. All of these quantities are computed for the generic
five-state multichain in Section 3.5.5.1. The gain vector for the five-state mul-
tichain MCR is computed using Equation (4.43).
1 1 0 0 0 0 q1 g1 g1
2 0 π2 π3 0 0 q2 g 2 g R 2
g = limP(n)q = 3 0 π2 π3 0 0 q3 = g 3 = g R 2
n →∞
4 f 41 f 4 Rπ 2 f 4 Rπ 3 0 0 q4 g 4 g 4
5 f 51 f 5 Rπ 2 f 5 Rπ 3 0 0 q5 g 5 g 5
q1
π 2 q2 + π 3 q3
= π 2 q2 + π 3 q3 . (4.168)
f 41 q1 + f 4 R (π 2 q2 + π 3 q3 )
f 51 q1 + f 5 R (π 2 q2 + π 3 q3 )
The gain of the multichain MCR depends on the state in which it starts. If
the system starts in state 1, an absorbing state, the gain will be g1 = q1. Thus,
in a multichain MCR, the gain of an absorbing state is equal to the reward
received in that state, confirming the conclusion of Equation (4.114) in Section
4.2.4.1.2. If the system is initially in either of the two communicating recur-
rent states, states 2 or 3, the gain will be
g R2 = π 2 q2 + π 3 q3 . (4.169)
If the chain starts in transient state 4, either it will eventually be absorbed with
probability f41, or it will eventually enter the recurrent chain R2 with probabil-
ity f4R2. Hence, if the chain starts in transient state 4, the gain will be
g 4 = f 41 q1 + f 4 R2 π 2 q2 + f 4 R2 π 3 q3 = f 41 q1 + f 4 R2 (π 2 q2 + π 3 q3 ) = f 41 g1 + f 4 R2 g R2.
(4.170)
Observe that the gain of transient state 4 has been expressed as the weighted
average of the independent gains, g1 and gR2, of the two recurrent chains. The
respective weights are f41 and f4R2. Similarly, if the chain starts in transient
state 5, the gain will be
g 5 = f 51 q1 + f 5 R2 (π 2 q2 + π 3 q3 ) = f 51 g1 + f 5 R2 g R2 . (4.171)
The gain of transient state 5 has also been expressed as the weighted average
of the independent gains, g1 and gR2, of the two recurrent chains. In this case
the respective weights are f51 and f5R2.
Since each closed class of recurrent states in a multichain MCR can be
treated as a separate recurrent chain, all states that belong to the same recur-
rent chain have the same gain. Hence, every recurrent chain has an inde-
pendent gain. One may conclude that the gains of the recurrent chains can
be found separately by finding the steady-state probability vector for each
recurrent chain and multiplying it by the associated reward vector. The gain
of every transient state can be expressed as a weighted average of the inde-
pendent gains of the recurrent chains. The weights are the probabilities of
eventual passage from the transient state to the recurrent chains.
g = limP(n)q
n →∞
1 1 0 0 0 0 0 0 0 156 156
2 0 1 0 0 0 0 0 0 56 56
3 0 0 0.2909 0.3727 0.3364 0 0 0 160 190.45
4 0 0 0.2909 0.3727 0.3364 0 0 0 123 190.45
= = .
5 0 0 0.2909 0.3727 0.3364 0 0 0 291.5 190.45
6 0.4444 0.3556 0.0582 0.0745 0.0673 0 0 0 450 127.33
7 0.6825 0.2032 0.0332 0.0426 0.0385 0 0 0 400 139.62
8 0.8095 0.1219 0.0199 0.0256 0.0231 0 0 0 420 146.17
(4.172)
v j (1) ≈ (T − 1) g j + v j . (4.174)
N
vi (0) = qi + ∑ pij v j (1), for i = 1, 2, ... , N
j =1
N
Tg i + vi = qi + ∑ pij [(T − 1) g j + v j ], for i = 1, 2, ... , N ,
j =1
N N N
(4.175)
= qi + T ∑ pij g j − ∑ pij g j + ∑ pij v j .
j =1 j =1 j =1
N
g i = ∑ pij g j , for i = 1, 2, ... , N (4.176)
j =1
Then
N
Tg i + vi = qi + Tg i − g i + ∑ pij v j
j =1
Since
N
∑p
j =1
ij = 1,
(4.177)
N
g i + vi = qi + ∑ pij v j , i = 1, 2, ... , N .
j =1
have been obtained. These two systems may be solved for the N variables, gi,
and the N variables, vi. The first system of N linear equations (4.176) is called
the GSEs. The second system is the set of VDEs (4.177). These two systems of
equations are together called the set of REEs (4.178).
4.2.5.3.2 The REEs for a Five-State Multichain MCR
Consider once again the generic five-state multichain MCR for which the
transition matrix and reward vector are shown in Equation (4.166). The sys-
tem of ten REEs (4.178) consists of a system of five VDEs (4.177) plus a system
of five GSEs (4.176). The system of five VDEs is
5
g i + vi = qi + ∑ pij v j = qi + pi1v1 + pi 2 v2 + pi 3 v3 + pi 4 v4 + pi 5 v5 , for i = 1, 2, ... , 5.
j =1
(4.179)
The individual value determination equations are
Setting p12 = p13 = p14 = p15 = 0, p21 = p24 = p25 = 0, and p31 = p34 = p35 = 0, and setting
g2 = g3 = gR2, the five VDEs become
g1 + v1 = q1 + p11v1
g R2 + v2 = q2 + p22 v2 + p23 v3
g R2 + v3 = q3 + p32 v2 + p33 v3 (4.181)
g 4 + v4 = q4 + p41v1 + p42 v2 + p43 v3 + p44 v4 + p45 v5
g 5 + v5 = q5 + p51v1 + p52 v2 + p53 v3 + p54 v4 + p55 v5 .
5
g i = ∑ pij g j = pi1 g1 + pi 2 g 2 + pi 3 g 3 + pi 4 g 4 + pi 5 g 5 , for i = 1, 2 ,..., 5. (4.182)
j =1
Setting p12 = p13 = p14 = p15 = 0, p21 = p24 = p25 = 0, p31 = p34 = p35 = 0, and setting
g 2 = g3 = gR2, the five GSEs become
g1 = g1
g 2 = p22 g 2 + p23 g 3 = p22 g R2 + p23 g R2 = ( p22 + p23 ) g R2 = g R2
g 3 = p32 g 2 + p33 g 3 = p32 g R2 + p33 g R2 = ( p32 + p33 ) g R2 = g R2
g 4 = p41 g1 + p42 g 2 + p43 g 3 + p44 g 4 + p45 g 5 = p41 g1 + ( p42 + p43 ) g R2 + p44 g 4 + p45 g 5
g 5 = p51 g1 + p52 g 2 + p53 g 3 + p54 g 4 + p55 g 5 = p51 g1 + ( p52 + p53 ) g R2 + p54 g 4 + p55 g 5 .
(4.183)
v1 + g1 = q1 + p11v1 (4.184)
Setting the relative value v1 = 0 for the highest numbered state in the first
recurrent chain, and setting p11 = 1 for the absorbing state 1, gives the solu-
tion g1 = q1, confirming the conclusion of Equation (4.114) in Section 4.2.4.1.2
that the gain of an absorbing state is equal to the reward received in that
state. The VDEs for the second recurrent chain, consisting of states 2 and 3,
and denoted by R 2={2, 3}, are
v2 + g 2 = q2 + p22 v2 + p23 v3
(4.185)
v3 + g 3 = q3 + p32 v2 + p33 v3 .
Setting v3 = 0 for the highest numbered state in the second recurrent chain,
and setting g2 = g3=gR2, the VDEs become
v2 + g R2 = q2 + p22 v2
. (4.186)
g R2 = q3 + p32 v2
The solution is
q2 − q3
v2 =
p23 + p32
. (4.187)
p32 q2 + p23 q3
g R2 =
p23 + p32
g 4 = p41 g1 + p42 g 2 + p43 g 3 + p44 g 4 + p45 g 5 = p41 g1 + ( p42 + p43 ) g R2 + p44 g 4 + p45 g 5
.
g 5 = p51 g1 + p52 g 2 + p53 g 3 + p54 g 4 + p55 g 5 = p51 g1 + ( p52 + p53 ) g R2 + p54 g 4 + p55 g 5
(4.189)
Rearranging the terms to place the two unknowns, g4 and g5, on the left-hand
side,
The solution is
p41 (1 − p55 ) + p45 p51 (1 − p55 )( p42 + p43 ) + p45 ( p52 + p53 )
g4 = g1 + g R2
(1 − p44 )(1 − p55 ) − p45 p54 (1 − p44 )(1 − p55 ) − p45 p54 (4.191)
.
p51 (1 − p44 ) + p54 p41 (1 − p44 )( p52 + p53 ) + p54 ( p42 + p43 )
g5 = g1 + g R2
(1 − p44 )(1 − p55 ) − p45 p54 (1 − p44 )(1 − p55 ) − p45 p54
These equations may be written more concisely to express the gains of the
two transient states as weighted averages of the independent gains of the
two recurrent chains.
g 4 = f 41 g1 + f 4 R2 g R2
, (4.192)
g 5 = f 51 g1 + f 5 R2 g R2
where the weights f41 and f51 are calculated in Equation (3.120), and the weights
f4R2 and f5R2 are calculated in Equation (3.121). Thus, the gain, gi, of each tran-
sient state, i, has been expressed as a weighted average of the independent
gains of the recurrent chains. It is interesting to note that each weight, fiR, can
be interpreted as the probability of eventual passage from a transient state i
to a recurrent chain R. This result confirms the conclusion reached earlier in
Section 4.2.5.2.2 by solving Equation (4.43).
Step 3. The set of VDEs for the two transient states is given below:
5
g i + vi = qi + ∑ pij v j = qi + pi1v1 + pi 2 v2 + pi 3 v3 + pi 4 v4 + pi 5 v5 , (4.193)
j =1
for transient states i = 4, 5. The two VDEs for the transient states are
Rearranging the terms to place the two unknowns, v4 and v5, on the left hand
side,
g1 = q1 , (4.196)
p32 q2 + p23 q3
g 2 = g 3 = g R2 = , (4.197)
p23 + p32
p32 q2 + p23 q3
g 4 = f 41 g1 + f 4 R2 g R2 = f 41 q1 + f 4 R2 , (4.198)
p23 + p32
p32 q2 + p23 q3
g 5 = f 51 g1 + f 5 R2 g R2 = f 51 q1 + f 5 R2 , (4.199)
p23 + p32
p32 q2 + p23 q3
(1 − p44 )v4 − p45 v5 = − f 41 q1 − f 4 R2 + q4 + p41v1 + p42 v2 + p43 v3
p23 + p32
p32 q2 + p23 q3
− p54 v4 + (1 − p55 )v5 = − f 51 q1 − f 5 R2 + q5 + p51v1 + p52 v2 + p53 v3 .
p23 + p32
(4.200)
Setting the relative value v1 = 0 for the highest numbered state in the first
recurrent chain, and also setting v3 = 0 for the highest numbered state in the
second recurrent chain, the VDEs for the transient states are
p32 q2 + p23 q3
(1 − p44 )v4 − p45 v5 = − f 41 q1 − f 4 R2 + q4 + p41 0 + p42 v2 + p43 0
p23 + p32
p32 q2 + p23 q3
− p54 v4 + (1 − p55 )v5 = − f 51 q1 − f 5 R2 + q5 + p51 0 + p52 v2 + p53 0.
p23 + p32
(4.201)
Substituting for v2 from Equation (4.187), the VDEs for the transient
states are
p32 q2 + p23 q3 q − q3
(1 − p44 )v4 − p45 v5 = − f 41 q1 − f 4 R2 + q4 + p42 2
p23 + p32 p23 + p32
p32 q2 + p23 q3 q − q3
− p54 v4 + (1 − p55 )v5 = − f 51 q1 − f 5 R2 + q5 + p52 2 . (4.202)
p23 + p32 p23 + p32
Step 4. The solution for the relative values of the transient states is
p32 q2 + p23 q3 q − q3
− f 41 q1 − f 4 R2 + q4 + p42 2 (1 − p55 )
p23 + p32 p23 + p32
+ − p32 q2 + p23 q3 q2 − q3
51 1 f q − f + q + p p
45
+ +
5 R2 5 52
p p p p 32
v4 =
23 32 23
v5 =
(1 − p44 )(1 − p55 ) − p45 p54
g i + vi = qi + pi1v1 + pi 2 v2 + pi 3 v3 + pi 4 v4 + pi 5 v5 + pi 6 v6 + pi 7 v7 + pi 8 v8 ,
(4.204)
for i = 1, 2, ... , 8.
g1 + v1 = 156 + v1
g 2 + v2 = 56 + v2
g 3 + v3 = 160 + 0.5v3 + 0.3v4 + 0.2v5
g 4 + v4 = 123 + 0.3v3 + 0.45v4 + 0.25v5
(4.205)
g 5 + v5 = 291.5 + 0.1v3 + 0.35v4 + 0.55v5
g6 + v6 = 450 + 0.2v1 + 0.16v2 + 0.04v3 + 0.03v4 + 0.02v5 + 0.55v6
g7 + v7 = 400 + 0.15v1 + 0.2v6 + 0.65v7
g8 + v8 = 420 + 0.1v1 + 0.15v7 + 0.75v8.
g i = pi1 g1 + pi 2 g 2 + pi 3 g 3 + pi 4 g 4 + pi 5 g 5 + pi 6 g 6 + pi 7 g7 + pi 8 g 8 ,
(4.206)
for i = 1, 2, ... , 8.
g1 = g1
g2 = g2
g3 = 0.5 g 3 + 0.3 g 4 + 0.2 g 5
= 0.3 g 3 + 0.45 g 4 + 0.25 g 5
g4
(4.207)
g5 = 0.1g 3 + 0.35 g 4 + 0.55 g 5
g6 = 0.2 g1 + 0.16 g 2 + 0.04 g 3 + 0.03 g 4 + 0.02 g 5 + 0.55 g 6
g7 = 0.15 g1 + 0.2 g 6 + 0.65 g7
g8 = 0.1g1 + 0.15 g7 + 0.75 g 8 .
Step 1. Letting R3 = {3, 4, 5} denote the third closed class of three recurrent
states that has an independent gain denoted by gR3, and substituting
g 3 = g 4 = g 5 = g R3 ,
g1 + v1 = 156 + v1
g 2 + v2 = 56 + v2
g R + v3 = 160 + 0.5v3 + 0.3v4 + 0.2v5
g R + v4 = 123 + 0.3v3 + 0.45v4 + 0.25v5
(4.208)
g R + v5 = 291.5 + 0.1v3 + 0.35v4 + 0.55v5
g 6 + v6 = 450 + 0.2v1 + 0.16v2 + 0.04v3 + 0.03v4 + 0.02v5 + 0.55v6
g7 + v7 = 400 + 0.15v1 + 0.2v6 + 0.65v7
g 8 + v8 = 420 + 0.1v1 + 0.15v7 + 0.75v8.
Setting v1 = 0 for the absorbing state in the first recurrent chain, the first VDE
yields the gain g1 = 156 for absorbing state 1. Similarly, setting v2 = 0 for the
absorbing state in the second recurrent chain, the second VDE produces the
gain g2 = 56 for absorbing state 2. Setting v5 = 0 for the highest numbered state
in the third recurrent closed class, the three VDEs for the third recurrent
chain appear as
These three VDEs are solved simultaneously for gR3, v3, and v4 to obtain the
quantities gR3 = 190.44, v3 = −199.86, and v4 = −231.64 for the third recurrent
chain.
Step 2. Substituting
g 3 = g 4 = g 5 = g R3 , (4.210)
After algebraic simplification, the following result is obtained for the gains
of the three transient states expressed as weighted averages of the indepen-
dent gains of the two absorbing states and of the recurrent closed class.
Substituting the independent gains, g1 = 156, g2 = 56, and gR3 = 190.44, com-
puted for the three recurrent chains in step 1, the gains of the three transient
states are
Step 3. The three VDEs for the three transient states are
Step 4. The solution of the three VDEs for the transient states produces the
relative values
relative values v3 and v4 for the third recurrent chain are obtained by solving
the VDEs for the third recurrent chain. In step 2, the GSEs for the transient
states are solved to calculate the dependent gains of the transient states, g6,
g7, and g8, as weighted averages of the independent gains, g1, g2, and gR3, of
the recurrent chains. Finally, in step 4, the VDEs for the transient states are
solved to obtain the relative values of the transient states, v6, v7, and v8.
The complete solutions for the gain vector and the vector of relative values
within each class of states are given below:
1 156 1 0
2 56 2 0
3 190.44 3 −199.86
4 190.44 4 −231.64
g= , v= . (4.216)
5 190.44 5 0
6 127.33 6 683.84
7 139.62 7 1,134.71
8 146.17 8 1, 776.14
Thus, if the process starts in any of the three recurrent states associated with
the training center, the expected cost per item sold will be $190.44. If the pro-
cess starts in recurrent state 3, the expected total cost will be $199.86 lower
than if it starts in recurrent state 5. Similarly, if the process starts in recurrent
state 4, the expected total cost will be $231.64 lower than if it starts in recur-
rent state 5.
Since qT is the vector of operation costs for the transient states, the ith compo-
nent of the vector UqT represents the expected total operation cost for an item
before its eventual passage to an absorbing state or to the recurrent closed
class, given that the item started in transient state i. An item enters the pro-
duction process at production stage 1, which is transient state 8. With refer-
ence to the last three rows in the right-hand column of Table 4.14, note that:
which is the expected total operation cost for an item before its eventual pas-
sage to an absorbing state or to the recurrent closed class, given that the item
started in transient state 8. The expected total operation cost of $2,708.58 for
an item entering the production process at stage 1 is the last entry in the vec-
tor UqT calculated in Equation (4.218), and is also the sum of the entries in the
last three rows of the right-hand column of Table 4.14.
This example has demonstrated that the following result holds for any
multichain MCR. Suppose that P is the transition matrix for a multichain
MCR, U is the fundamental matrix, and qT is the vector of rewards received
in the transient states. Then the ith component of the vector UqT represents
the expected total reward earned before eventual passage to any recurrent
closed class, given that the chain started in transient state i.
chain, given that the chain started in a transient state [5]. Suppose that on
March 31, at the end of a 3-month quarter, a woman buys one share of a
certain stock for $40. The share price, rounded to the nearest $5, has been
varying among $30, $35, $40, $45, and $50 from quarter to quarter. She plans
to sell her stock at the end of the first quarter in which the share price rises
to $50 or falls to $30. She believes that the price of the stock can be modeled
as a Markov chain in which the state, Xn, denotes the share price at the end
of quarter n. The state space is E = {$30, $35, $40, $45, $50}. The two states
Xn = $30 and Xn = $50 are absorbing states, reached when the stock is sold.
The three remaining states, which are entered when the stock is held, are
transient. (A model for selling a stock with one target price was constructed
in Section 1.10.1.2.1.) The quarterly dividend is $2 per share. No dividend
is received when the stock is sold. She believes that her investment can be
represented as a multichain MCR with the following transition probability
matrix, expressed in canonical form, and the associated reward vector.
State 30 50 35 40 45
30 1 0 0 0 0
1 0 0
50 0 1 0 0 0
P= = 0 1 0 , (4.220a)
35 0.20 0.10 0.40 0.20 0.10
D1 D2 Q
40 0.10 0.06 0.30 0.28 0.26
45 0.14 0.12 0.24 0.30 0.20
State Reward
30 30
qA 1
50 50
q= = qA 2 , (4.220b)
35 2
qT
40 2
45 2
where
State 30 State 50 State 35 40 45
35 0.20 35 0.10 35 0.40 0.20 0.10
D1 = , D2 = , Q= ,
40 0.10 40 0.06 40 0.30 0.28 0.26
45 0.14 45 0.12 45 0.24 0.30 0.20
State Reward
35 2 ,
qT =
40 2
45 2
Note that qA is the vector of selling prices received in the two absorbing
states, and qT is the vector of dividends received in the three transient states.
The fundamental matrix for the submatrix Q is
−1 35 40 45
0.60 −0.20 −0.10
−1 35 2.3486 0.8961 0.5848
U = ( I − Q ) = −0.30 0.72 −0.26 = .
40 1.4261 2.1505 0.8772
−0.24 −0.30 0.80
45 1.2394 1.0753 1.7544
(4.221)
As Sections 4.2.4.4 and 4.2.5.4.1 indicate, the ith component of the vector UqT
represents the expected total dividend earned before eventual passage to
an absorbing state when the stock is sold, given that the chain started in
transient state i. For example, if she buys the stock for $40, she will earn an
expected total dividend of (UqT)40 = $8.91 before the stock is sold.
When the stock is sold, the chain will be absorbed in state $30 or in state $50.
Using Equation (3.98), the matrix of absorption probabilities for an absorbing
multichain, given that the process starts in a transient state, is
30 50
35 f 35,30 f 35,50
F = (I − Q)−1 D = UD = U D1 D2 =
40 f 40,30 f 40,50
45 f 45,30 f 45,50
35 40 45 30 50 30 50
35 2.3486 0.8961 0.5848 0.20 0.10 35 0.6412 0.3588
= 40 1.4261 2.1505 0.8772 0.10 0.06 = 40 0.6231 0.3769 . (4.223)
45 1.2394 1.0753 1.7544 0.14 0.12 45 0.6010 0.3990
If the investor buys the stock for $40, she will eventually sell it for either $30
with probability f40,30 = 0.6231 or $50 with probability f40,50 = 0.3769. Hence,
when the stock is sold, she will receive an expected selling price of
30 50
35 f 35,30 qA1 + f 35,50 qA 2
35 f 35,30 f 35,50 qA1
= 40 f 40,30 aA1 + f 40,50 qA 2
FqA = 40 f 40,30 f 40,50 qA 2
45 f 45,30 qA1 + f 45,50 qA 2
45 f 45,30 f 45,50
30 50 (4.225)
35 $37.18
35 0.6412 0.3588 30
= 40 $37.54 .
= 40 0.6231 0.3769 50
45 $37.98
45 0.6010 0.3990
Thus, the ith component of the vector FqA represents the expected selling
price received when the stock is sold, given that the chain started in transient
state i. For example, if the investor buys the stock for $40, she will receive an
expected selling price of
when the stock is sold, confirming the earlier result. The investor’s expected
total reward, given that she bought the stock for $40, is the expected divi-
dends received before selling plus the expected selling price, or
(UqT )40 + ( FqA )40 = (UqT )40 + ($30 f 40,30 + $50 f 40,50 ) = $8.91 + $37.54 = $46.45.
(4.227)
To begin the backward recursion, the vector of expected total income received at
the end of the three quarter planning horizon is set equal to zero for all states.
n=T=3 (4.229a)
v30 (3) 0
v (3) 0
50
v35 (3) = 0 (4.229c)
v40 (3) 0
v45 (3) 0
n= 2
v(2) = q + Pv(3)
v30 (2) 30 1 0 0 0 0 0 30
v (2) 50 0 1 0 0 0 0 50
50 (4.230)
v35 (2) = 2 + 0.20 0.10 0.40 0.20 0.10 0 = 2
v40 (2) 2 0.10 0.06 0.30 0.28 0.26 0 2
v45 (2) 2 0.14 0.12 0.24 0.30 0.20 0 2
n= 1
v(1) = q + Pv(2)
v30 (1) 30 1 0 0 0 0
30 60
v (1) 50 0 0 50 100
(4.231)
1 0 0
50
v35 (1) = 2 + 0.20 0.10 0.40 0.20 0.10 2 = 14.4
v40 (1) 2 0.10 0.06 0.30 0.28 0.26 2 9.68
v45 (1) 2 0.14 0.12 0.24 0.30 0.20 2 13.68
n=0
v(0) = q + Pv(1)
v30 (0) 30 1 0 0 0 0 60 90
v (0) 50 0 1 0 0 0 100 150
50
v35 (0) = 2 + 0.20 0.10 0.40 0.20 0.10 14.4 = 33.064 .
v40 (0) 2 0.10 0.06 0.30 0.28 0.26 9.68 24.5872
v45 (0) 2 0.14 0.12 0.24 0.30 0.20 13.68 31.496
(4.232)
If the woman paid $40 for the stock, her expected total income after three
quarters will be v40(0) = $24.59.
1
α= , (4.233)
1 +i
Clearly, α is a fraction between zero and one. Thus, q dollars received one
period in the future is equivalent to αq dollars received now, a smaller quan-
tity. The present value, at epoch 0, of q dollars received n periods in the future,
at epoch n, is α nq dollars. Rewards of q dollars received at epochs 0, 1, 2, … , n
have the respective present values of q, αq, α 2 q, … , α n q dollars. Note that α n is the
single-payment present-worth factor used in engineering economic analysis
to compute the present worth of a single payment received n periods in the
future.
Figure 4.12 is a discounted cash flow diagram, which shows the pres-
ent values, with a discount factor, α, of a series of equal reward vectors, q,
received at epochs 0 through T − 1. The present value of a reward vector, q,
2 T 1 T
q q q q v(T ) Present value of reward
0 1 2 T 1 T Epoch
FIGURE 4.12
Present values of a discounted cash flow diagram for reward vectors over a planning horizon
of length T periods.
Observe that the term in brackets, obtained by factoring out αP, is equal
to v(1), which represents the vector, at epoch 1, of the expected total dis-
counted income received from epoch 1 until the end of the planning hori-
zon. That is,
Thus, the recursive relationship between the vectors v(0) and v(1) is
When rewards are discounted, chain structure is not relevant because all
row sums in a matrix, αP, are less than one. Thus, the entries of αP are not
probabilities. For this reason, the value iteration procedure is the same for all
discounted MCRs, irrespective of whether the associated Markov chains are
recurrent, unichain, or multichain.
In expanded algebraic form the four recursive value iteration equations are
In compact algebraic form the four recursive value iteration equations are
4
v1 (n) = q1 + α ∑ p1 j v j (n + 1)
j =1
4
v2 (n) = q2 + α ∑ p2 j v j (n + 1)
j =1
4 (4.241)
v3 (n) = q3 + α ∑ p3 j v j (n + 1)
j =1
4
v4 (n) = q4 + α ∑ p4 j v j (n + 1).
j =1
4
vi (n) = qi + α ∑ pij v j (n + 1),
j =1
(4.242)
for n = 0, 1, ..., T − 1, and i = 1, 2, 3, and 4.
This result can be generalized to apply to any N-state Markov chain with
discounted rewards. Therefore, vi(n) can be computed by using the follow-
ing recursive value iteration equation in algebraic form which relates vi(n)
to vj(n + 1).
N
vi (n) = qi + α ∑ pij v j (n + 1),
j =1
for n = 0, 1,..., T − 1, i = 1, 2, ..., N , where vi (T) is specified for all states i.
(4.243)
n = T = 3.
v1 (n) v1 (T ) v1 (3) 0
v (n) v (T ) v (3) 0
2 = 2 = 2 = (4.246)
v3 (n) v3 (T ) v3 (3) 0
v4 (n) v4 (T ) v4 (3) 0
n=2
v(2) = q + α Pv(3) = q + (0.9P)v(3) = q + (0.9P)(0) = q
v1 (2) −20 0.540 0.270 0.090 0 0 −20
v (2) 5 0.225 0.270 0.315 0.09 0 5
2 = + = (4.247)
v3 (2) −5 0.045 0.225 0.450 0.180 0 −5
v4 (2) 25 0 0.090 0.270 0.540 0 25
n=1
v(1) = q + α Pv(2) = q + (0.9P)v(2)
v1 (1) −20 0.540 0.270 0.090 0 −20 −29.9
v (1) 5 0.225
0.270 0.315 0.09 5 2.525
2 = + = (4.248)
v3 (1) −5 0.045 0.225 0.450 0.180 −5 −2.525
v4 (1) 25 0 0.090 0.270 0.540 25 37.6
n=0
v(0) = q + α Pv(1) = q + (0.9P )v(1)
v1 (0) −20 0.540 0.270 0.090 0 −29.9 −35.6915
v (0) 5 0.225
0.270 0.315 0.09 2.525 1.5429
2 = + =
v3 (0) −5 0.045 0.225 0.450 0.180 −2.525 − 0.1456
v4 (0) 25 0 0.090 0.270 0.540 37.6 44.8495
(4.249)
The vector v(0) indicates that if this discounted MCR operates over a three-
period planning horizon with a discount factor of 0.9, the expected total
discounted reward will be –35.6915 if the system starts in state 1, 1.5429
if it starts in state 2, –0.1456 if it starts in state 3, and 44.8495 if it starts in
state 4.
Using Equation (4.235) with T = 3, the solution by value iteration for v(0) in
Equation (4.249) can be verified by calculating
v(0) = q + α Pq + (α P)2 q
v = (I − α P)−1 q (4.253)
(I − α P )v = q (4.254)
Iv − α Pv = q
v − α Pv = q (4.255)
v = q + α Pv.
For an N-state MCR, the matrix equation (4.253) relates the expected total
discounted reward vector, v = [ (v1, v2, . . . , vN,)]T, to the reward vector, q =
[(q1, q2, . . . , qN)]T. The matrix equation (4.255) represents a system of N lin-
ear equations in N unknowns. The system of equations (4.255) is called the
matrix form of VDEs, for a discounted MCR. The N unknowns, v1, v2, ... , vN,
represent the expected total discounted rewards received in every state. An
alternate form of the VDEs is the matrix equation (4.253). However, Gaussian
reduction is a more efficient procedure than matrix inversion for solving a
system of linear equations.
When N = 4 the expanded matrix form of the VDEs (4.255) is
4
v1 = q1 + α ∑ p1 j v j
j =1
4
v2 = q2 + α ∑ p2 j v j
j =1
4 (4.258)
v3 = q3 + α ∑ p3 j v j
j =1
4
v4 = q4 + α ∑ p4 j v j .
j =1
4
vi = qi + α ∑ pij v j , for i = 1, 2, 3, and 4. (4.259)
j =1
This result can be generalized to apply to any N-state Markov chain with
discounted rewards. Therefore, the compact algebraic form of the VDEs is
N
vi = qi + α ∑ pij v j , for i = 1, 2, ... , N (4.260)
j =1
Since vi is the ith element of the vector v, and vi(0) is the ith element of the
vector v(0), it follows that, for large T,
Similarly, v(1) is the expected total discounted reward vector earned at epoch
1 over a planning horizon of length T periods. As T grows large, (T − 1) also
grows large, so that
(I − α P )v = q (4.254)
1 0 α p11 α p12 v1 q1
0 1 − α p =
21 α p22 v2 q2 (4.263)
.
1 − α p11 −α p12 v1 q1
−α p 1 − α p v = q
21 22 2 2
For example, consider the following two-state recurrent MCR for which the
gain was calculated in Equation (4.51).
When a discount factor of α = 0 9 is used for the MCR, the expected total
discounted rewards are
= 14.706
(4.265)
where vi is the expected total discounted reward received in state i [2, 7]. This
limiting relationship will be demonstrated with respect to the discounted
two-state MCR for which v1 and v2 were calculated in Equation (4.265).
Recall from Equation (4.48) or (4.50) that the gain of an undiscounted recur-
rent two-state MCR is
p21 q1 + p12 q2
g= . (4.50)
p12 + p21
Since the last two terms in the denominator sum to zero, they can be dropped,
and
because over an infi nite planning horizon the expected total reward
without discounting is infi nite. Hence α can approach one but can never
equal one. Note that when both sides of Equation (4.269) are multiplied
by (1 − α),
q1 p21 + q2 p12
lim(1 − α )v2 = = g. (4.273)
α →1 p12 + p21
When the discount factor for the two-state MCR represented in Equation
(4.51) of Section 4.3.3.1.1 is increased to α = 0.999, the expected total dis-
counted rewards are
Although these results have been demonstrated to hold for a two-state MCR,
they can be extended to show that Equation (4.266) is true for any N-state
MCR, which has a recurrent chain with a gain g.
State 0 1 2 3 4
0 1 0 0 0 0
1 1 − α α (0.60) α (0.30) α (0.10) α (0)
Y=
2 1 − α α (0.25) α (0.30) α (0.35) α (0.10)
3 1 − α α (0.05) α (0.25) α (0.50) α (0.20) (4.277)
4 1−α α (0) α (0.10) α (0.30) α (0.60)
1 0 1 0
= = .
1 − α α P D Q
Since the entries in every row of P sum to 1, the entries in all rows of αP sum
to α. Hence, Y is the transition probability matrix for an absorbing unichain.
The state space E=(1, 2, 3, 4} of the undiscounted process has been augmented
by an absorbing state 0. States 1 through 4 are transient. There is a probability
1 − α that the undiscounted process will reach the absorbing state on the next
transition, and stop. There is also a probability α that the process will not reach
the absorbing state on the next transition, and will therefore continue. One may
conclude that if the duration of an undiscounted MCR is indefinite, one minus
a discount factor can be interpreted as the probability that the process will
stop after the next transition. Therefore, a discount factor can be interpreted as
the probability that the process will continue after the next transition.
Matrix αP governs transitions among transient states before the process
is absorbed. The matrix (I − αP)−1 is the fundamental matrix of the absorb-
ing unichain. Hence, the alternate matrix form of the VDEs (4.253) for a dis-
counted MCR,
v = (I − α P)−1 q, (4.253)
indicates that the vector of expected total discounted rewards can be obtained
by multiplying the fundamental matrix by the reward vector.
1 0.9 0 0 0 0 0 0 0
2 0 0.9 0 0 0 0 0 0
3 0 0 0.45 0.27 0.18 0 0 0
4 0 0 0..27 0.405 0.225 0 0 0
αP = , (4.278)
5 0 0 0.09 0.315 0.495 0 0 0
6 0.18 0.144 0.036 0.027 0.018 0.495 0 0
7 0.135 0 0 0 0 0.18 0.585 0
8 0.09 0 0 0 0 0 0.135 0.675
1 156
2 56
3 160
4 123
q= .
5 291.5
6 450
7 400
8 420
v1 156 0.900 0 0 0 0 0 0 0 v1
v 56 0 0.900 0 0 0 0 0 0 v2
2
v3 160 0 0 0.450 0.270 0.180 0 0 0 v3
v4 = 123 + 0 0 0.270 0.405 0.225 0 0 0 v4
.
v5 291.5 0 0 0.090 0.315 0.495 0 0 0 v5
v6 450 0.180 0.144 0.036 0.027 0.018 0.495 0 0 v6
v 400 0.135 0 0 0 0 0.180 0.585 0 v
7 7
v8 420 0.090 0 0 0 0 0 0.135 0.675 v8
(4.279)
(4.280)
The solution vector shows, for example, that if this discounted multichain
MCR model of a serial production process operates over an infinite planning
horizon, the expected total discounted cost will be $1,560 if it starts in state 1,
and $2,679.41 if it starts in state 8.
using Equations (4.261) and (4.262). The eventual convergence of the expected
total discounted rewards received over a finite horizon to the expected total
vi(0), for i = 1, 2, ... , N. For simplicity, set vi(0) = 0. Specify ε > 0. Set n = −1.
Step 2. For each state i, use the value iteration equation to compute
N
vi (n) = qi + α ∑ pij v j (n + 1), for i = 1, 2, ... , N .
j =1
TABLE 4.15
Expected Total Discounted Rewards for Monthly Sales Calculated by Value
Iteration During the Last 7 Months of an Infinite Planning Horizon
n
Epoch –7 –6 –5 –4 –3 –2 –1 0
v1(n) –41.2574 –41.1224 –40.4607 –38.8699 –35.6915 –29.9 –20 0
v2(n) 2.5647 2.0488 1.6153 1.3766 1.5429 2.525 5 0
v3(n) 5.3476 4.3948 3.2247 1.7484 –0.1456 –2.525 –5 0
v4(n) 55.6493 54.2191 52.2278 49.3183 44.8495 37.6 25 0
TABLE 4.16
Absolute Values of the Differences Between the Expected Total Discounted
Rewards Earned Over Planning Horizons, Which Differ in Length by One Period
n
Epoch –7 –6 –5 –4 –3 –2 –1
Table 4.16 gives the absolute values of the differences between the expected
total discounted rewards earned over planning horizons, which differ in
length by one period. Note that the maximum absolute differences become
progressively smaller for each epoch added to the planning horizon.
In Table 4.16, a suffix U identifies the maximum absolute difference for each
epoch. The maximum absolute differences obtained for all seven epochs are
listed in the bottom row of Table 4.16. The bottom row of Table 4.16 shows
that if an analyst chooses an ε < 1.4302, then more than seven repetitions of
value iteration will be needed before the algorithm can be assumed to have
converged.
Table 4.15 shows that the approximate expected total discounted rewards,
v1 = −41.2574, v2 = 2.5647, v3 = 5.3476, v4 = 55.6493,
obtained by solving the VDEs in Section 4.3.3.1.3. Thus, value iteration for
the discounted MCR model of monthly sales has not converged after seven
repetitions to the actual expected total discounted rewards.
PROBLEMS
4.1 This problem adds daily revenue to the Markov chain model
of Problems 1.1 and 2.1. Since a repair takes 1 day to complete,
no revenue is earned on the day during which maintenance is
performed. The daily revenue earned in every state is shown
below:
(a) Given that the machine begins day one with an equal prob-
ability of starting in any state, what is the expected total rev-
enue that will be earned after 3 days?
(b) Find the expected average reward, or gain.
(c) Find the expected total discounted reward vector using a
discount factor of α = 0.9.
4.2 A married couple owns a consulting fi rm. The fi rm’s weekly
income is a random variable, which may equal $10,000,
$15,000, or $20,000. The income earned next week depends
only on the income earned this week. In any week, the mar-
ried couple may sell the fi rm to either their adult son or their
adult daughter. If the income this week is $10,000, the couple
may sell the fi rm next week to their son with probability 0.05
or to their daughter with probability 0.10. If the income this
week is $10,000 and they do not sell, the income next week
will be either $10,000 with probability 0.40, $15,000 with prob-
ability 0.25, or $20,000 with probability 0.20. If the income this
Age i of an IC in years, i 0 1 2 3
Cost to inspect and test an IC $57 $34.29 $23.33 $10
The cost to replace an IC, which has failed, is $10. The cost in
dollars of buying and operating an IC of age i is denoted by qi.
(a) Construct a cost vector for the IC replacement problem.
(b) Model the IC replacement problem as a recurrent MCR by
adding the cost vector to the transition probability matrix,
which was constructed for Problem: 2.5.
(c) Calculate the average cost per year of operation and
replacement.
(d) If operation starts with a 1-year-old component, calculate
the expected total cost of operation and replacement rela-
tive to the expected total cost of starting with a 3-year-old
component.
(e) If operation starts with a 2-year-old component, calculate the
expected total cost of operation before the component fails
for the first time.
The dam has the following policy for releasing water at the
beginning of every week. If the volume of water stored in the
dam plus the volume flowing into the dam at the beginning
of the week exceeds 2 units, then 2 units of water are released.
The first unit of water released is used to generate electricity,
which is sold for $5. The second unit released is used for irriga-
tion, which earns $4. If the volume of water stored in the dam
plus the volume flowing into the dam at the beginning of the
week equals 2 units, then only 1 unit of water is released. The
1 unit of water released is used to generate electricity, which
is sold for $5. No water is released for irrigation. The volume
of water stored in the dam is normally never allowed to drop
below 1 unit to provide a reserve in the event of a natural dis-
aster. Hence, if the volume of water stored in the dam plus the
volume flowing into the dam at the beginning of the week is
less than 2 units, no water is released. If the volume of water
stored in the dam plus the volume flowing into the dam at the
beginning of the week exceeds 6 units, then 2 units of water are
released to generate electricity and for irrigation. In addition,
surplus water is released through the spillway and lost, causing
flood damage at a cost of $3 per unit.
References
1. Bhat, N., Elements of Applied Stochastic Processes, 2nd ed., Wiley, New York,
1985.
2. Feldman, R. M. and Valdez-Flores, C., Applied Probability & Stochastic Processes,
PWS, Boston, MA, 1996.
3. Hillier, F. S. and Lieberman G. J., Introduction to Operations Research, 8th ed.,
McGraw-Hill, New York, 2005.
4. Howard, R. A., Dynamic Programming and Markov Processes, M.I.T. Press,
Cambridge, MA, 1960.
5. Kemeny, J. G., Schleifer, Jr., A., Snell, J. L., and Thompson, G. L., Finite Mathematics
with Business Applications, 2nd ed., Prentice-Hall, Englewood Cliffs, NJ, 1972.
6. Jensen, P. A. and Bard, J. F., Operations Research: Models and Methods,Wiley,
New York, 2003.
7. Puterman, M. L., Markov Decision Processes: Discrete Stochastic Dynamic
Programming, Wiley, New York, 1994.
8. Shamblin, J. E. and Stevens, G. T., Operations Research: A Fundamental Approach,
McGraw-Hill, New York, 1974.
271
TABLE 5.1
States, Decisions, and Rewards for Monthly Sales
Monthly Sales
Quartile State Decision Action Reward
First (Lowest) 1 1 Sell noncore assets –$30,000
2 Take firm private –25,000
3 Offer employee buyouts –20,000
Second 2 1 Reduce executive salaries $5,000
2 Reduce employee benefits 10,000
Third 3 1 Design more appealing products –$10,000
2 Invest in new technology –5,000
Fourth (Highest) 4 1 Invest in new projects $35,000
2 Make strategic acquisitions 25,000
1 d1 (n)
2 d2 (n)
d(n) = = [d (n) d (n) ⋅⋅⋅ d (n)]T , (5.1,5.2)
# # 1 2 N
N dN ( n )
called a decision vector. The elements of vector d(n) indicate which decision
is made in every state at epoch n. Thus, the element di(n) = k indicates that in
state i at epoch n, decision k is made.
Each decision made in a state has an associated reward and probability
distribution for transitions out of that state. A superscript k is used to desig-
nate the decision in a state. Thus, pjk indicates that the transition probability
from state i to state j is determined by the decision k in state i at epoch n.
That is,
X0 g X1 h X2 u Xn i Xn 1 j State
q ga qhb quc qik q jf Reward
d g (0) a d h (1) b du ( 2) c di ( n) k d j ( n 1) f Decision
a b
pg( 0) pgh phu pijk Transition probability
0 1 2 n n 1 Epoch
FIGURE 5.1
Sequence of states, decisions, transitions, and rewards for an MDP.
TABLE 5.2
Two-State Generic MDP with Two Decisions in Each State
Transition Probability
State i Decision k pki1 pki2 Reward qki
1 1 p111 p112 q11
2 2
2 p11
p 12
q21
2 1 p121 p122 q12
2 2
2 p21
p 12
q22
To see how an MDP can be represented in a tabular format, Table 5.2 speci-
fies a 2-state generic MDP with two decisions in each state.
By matching the actions in Table 4.1 with those in Table 5.1, it is appar-
ent that the MCR model of monthly sales constructed in equation (4.6) is
generated by the decision vector d(n) = [3 1 2 2]T . For example, d1(n) = 3
in Table 5.1 means that in state 1 at epoch n, when monthly sales are in
the first quartile, the fi rm will make decision 3 to offer employee buy-
outs. The decision vector d(n) produces the same MCR that was introduced
without decision alternatives in Section 4.2.2.1. Hence the MCR model in
Section 4.2.2.1 corresponds to the following transition probability matrix
P, reward vector q with entries expressed in thousands of dollars, and
decision vector d(n).
TABLE 5.3
Data for Recurrent MDP Model of Monthly Sales
Transition Probabilities Reward
State i Decision k p k pk pk p k qki
i1 i2 i3 i4
The data collected by the firm on the rewards and the transition probabilities
associated with the decision made in every state of Table 5.1 are summarized
in Table 5.3. Table 5.3 represents a recurrent MDP model of monthly sales in
which the state denotes the monthly sales quartile.
Epoch (n + 1)
vi ( n) max{ [ qik pijkvj ( n 1) pihk vh ( n 1)],
[ qib pijbvj ( n 1) pihb vh( n 1)]}
State j
pijk
Expected total
Random reward vj (n + 1)
outcomes
of
decision k
Decision pih
Epoch n
di (n) = k
State h
Reward qik
Expected total
State i reward vh (n + 1)
Decision
di (n) = b
Expected
total Reward qib pijb State j
reward
Random
vi (n) outcomes Expected total
of reward vj (n + 1)
decision
b
pih
State h
Expected total
reward vh (n + 1)
FIGURE 5.2
Tree diagram of the value iteration equation for an MDP
N
vi (n) = max qik + ∑ pijk v j (n + 1) ,
k
j =1 (5.6)
for n = 0, 1,..., T − 1, and i = 1, 2, ..., N
The salvage values vi(T) at the end of the planning horizon must be speci-
fied for all states i = 1, 2, . . . , N. Figure 5.2 is a tree diagram of the value iteration
equation in algebraic form for an MDP with two decisions in the current state.
To begin, the following salvage values are specified for all states at the end
of month 7:
vi (6) = max[qik + pik1v1 (7) + pik2 v2 (7) + pik3 v3 (7) + pik4 v4 (7)], for i = 1, 2, 3, 4
k
= max[qik + pik1 (0) + pik2 (0) + pik3 (0) + pik4 (0)], for i = 1, 2, 3, 4
k
= max[qi ], for i = 1, 2, 3, 4.
k
k
(5.10,5.11)
At the end of month 6, the optimal decision is to select the alternative which
will maximize the reward in every state. Thus, the decision vector at the end
of month 6 is d(6) = [3 2 2 1]T .
The calculations for month 5, denoted by n=5, are indicated in Table 5.5.
The ← symbol in column 3 of Tables 5.5 through 5.10 identifies the maximum
TABLE 5.4
Value Iteration for n = 6
State qi1 qi2 qi3 Expected Total Reward Decision
i k=1 k=2 k=3 1 2 3
vi(6) = maxk[qi , qi , qi ], for i = 1, 2, 3, 4 k
1 –30 –25 –20 v1(6) = max [−30, −25, −20 ] = −20 3
2 5 10 _ v2(6) = max [5, 10 ] = 10 2
3 –10 –5 _ v3(6) = max [−10, −5 ] = −5 2
4 35 25 _ v4(6) = max [35, 25 ] = 35 1
TABLE 5.5
Value Iteration for n = 5
q ik + p ik1v 1 (6) + p ik2v 2 (6) + p ik3v 3 (6) + p ik4v 4 (6)
vi(5) = Expected Decision
i k = q ik + p ik1 (220) + p ik2 (10) + p ik3 (25) + p ik4 (35) Total Reward di(5) = k
1 1 −30 + 0.15( −20) + 0.40(10) + 0.35( −5) + 0.10(35)
= −27.25
1 2 −25 + 0.45( −20) + 0.05(10) + 0.20( −5) + 0.30(35) max [ − 27.25, −24, d1(5) = 2
= −24 ← −29.5]
= −24 = v1 (5)
1 3 −20 + 0.60( −20) + 0.30(10) + 0.10( −5) + 0(35)
= −29.5
2 1 5 + 0.25( −20) + 0.30(10) + 0.35( −5) + 0.10(35)
= 4.75
2 2 10 + 0.30( −20) + 0.40(10) + 0.25( −5) + 0.05(35) max [4.75, 8.5] d2(5) = 2
= 8.5 ← = 8.5 = v2 (5)
4 1 35 + 0.05( −20) + 0.20(10) + 0.40( −5) + 0.35(35) max [46.25, 45.5] d4(5) = 1
= 46.25 ← = 46.25 = v4 (5)
expected total reward. Both the maximum expected total reward and the
associated decision are indicated in columns 4 and 5, respectively, of the row
flagged by the ← symbol.
vi (5) = max[qik + pik1v1 (6) + pik2 v2 (6) + pik3 v3 (6) + pik4 v4 (6)], for i = 1, 2, 3, 4
k
. (5.12)
= max[qik + pik1 ( −20) + pik2 (10) + pik3 ( −5) + pik4 (35)], for i = 1, 2, 3, 4
k
At the end of month 5, the optimal decision is to select the second alternative
in states 1, 2, and 3, and the first alternative in state 4. Thus, the decision vec-
tor at the end of month 5 is d( 5) = [2 2 2 1] .
T
TABLE 5.6
Value Iteration for n = 4
q ik + p ik1v 1 (5) + p ik2v 2 (5) + p ik3v 3 (5) + p ik4v 4 (5)
vi(4) = Expected Decision
i k = q ik + p ik1 (224) + p ik2 (8.5) + p ik3 (1) + p ik4 (46.25) Total Reward di(4) = k
1 1 −30 + 0.15( −24) + 0.40(8.5) + 0.35(1) + 0.10(46.25)
= −25.225
1 2 −25 + 0.45( −24) + 0.05(8.5) + 0.20(1) + 0.30(46.25) max [ − 25.225, −21.3, d1(4) = 2
= −21.3 ← −31.75] = −21.3 = v1 (4)
1 3 −20 + 0.60( −24) + 0.30(8.5) + 0.10(1) + 0(46.25)
= −31.75
2 1 5 + 0.25( −24) + 0.30(8.5) + 0.35(1) + 0.10(46.25)
= 6.525
2 2 10 + 0.30( −24) + 0.40(8.5) + 0.25(1) + 0.05(46.25) max [6.525, 8.7625 d2(4) = 2
= 8.7625 ← = 8.7625 = v2 (4)
TABLE 5.7
Value Iteration for n = 3
vi (4) = max[qik + pik1v1 (5) + pik2 v2 (5) + pik3 v3 (5) + pik4 v4 (5)], for i = 1, 2, 3, 4
k
. (5.13)
= max[qi + pi1 ( −24) + pi 2 (8.5) + pi 3 (1) + pi 4 (46.25)], for i = 1, 2, 3, 4
k k k k k
k
At the end of month 4, the optimal decision is to select the second alter-
native in every state. Thus, the decision vector at the end of month 4 is
d( 4 ) = [2 2 2 2]T .
The calculations for month 3, denoted by n = 3, are indicated in Table 5.7.
vi (3) = max[qik + pik1v1 (4) + pik2 v2 (4) + pik3 v3 (4) + pik4 v4 (4)], for i = 1, 2, 3, 4
k
= max[qik + pik1 ( −21.3) + pik2 (8.7625) + pik3 (5.675) + pik4 (53.9)], (5.14)
k
for i = 1, 2, 3, 4.
At the end of month 3, the optimal decision is to select the second alter-
native in every state. Thus, the decision vector at the end of month 3 is
d( 3 ) = [2 2 2 2]T .
The calculations for month 2, denoted by n = 2, are indicated in Table 5.8.
vi (2) = max[qik + pik1v1 (3) + pik2 v2 (3) + pik3 v3 (3) + pik4 v4 (3)], for i = 1, 2, 3, 4
k
= max[qik + pik1 ( −16.8419) + pik2 (11.2288) + pik3 (9.7431) + pik4 (59.9188)], (5.15)
k
for i = 1, 2, 3, 4.
TABLE 5.8
Value Iteration for n = 2
q ik + p ik1v 1 (3) + p ik2v 2 (3) + p ik3v 3 (3) + p ik4v 4 (3)
= q ik + p ik1 (216.8419) + p ik2 (11.2288) vi(2) = Expected Decision
i k + p ik3 (9.7431) + p ik4 (59.9188) Total Reward di(2) = k
1 1 −30 + 0.15(−16.8419) + 0.40(11.2288)
+ 0.35(9.7431) + 0.10(59.9188) = −18.6328
2 1 5 + 0.25(−16.8419) + 0.30(11.2288)
+ 0.35(9.7431) + 0.10(59.9188) = 13.5601
4 1 35 + 0.05(−16.8419) + 0.20(11.2288)
+ 0.40(9.7431) + 0.35(59.9188) = 61.2725
TABLE 5.9
Value Iteration for n = 1
q ik + p ik1v 1 (2) + p ik2v 2 (2) + p ik3v 3 (2) + p ik4v 4 (2)
= q ik + p ik1 (212.0932) + p ik2 (14.8707)
vi(1) = Expected Decision
+ p ik3 (13.8204) + p ik4 (64.9971)
i k Total Reward di(1) = k
1 1 −30 + 0.15(−12.0932) + 0.40(14.8707)
+ 0.35(13.8204) + 0.10(64.9971) = −14.5289
2 1 5 + 0.25(−12.0932) + 0.30(14.8707)
+ 0.35(13.8204) + 0.10(64.9971) = 17.7748
4 1 35 + 0.05(−12.0932) + 0.20(14.8707)
+ 0.40(13.8204) + 0.35(64.9971) = 65.6466
TABLE 5.10
Value Iteration for n = 0
q ik + p ik1v 1 (1) + p ik2v 2 (1) + p ik3v 3 (1) + p ik4v 4 (1)
= q ik + p ik1 (27.4352) + p ik2 (19.0253)
vi(0) = Expected Decision
+ p ik3 (18.0226) + p ik4 (69.6315)
i k Total Reward di(0) = k
1 1 −30 + 0.15(−7.4352) + 0.40(19.0253)
+ 0.35(18.0226) + 0.10(69.6315) = −10.2341
2 1 5 + 0.25(−7.4352) + 0.30(19.0253)
+ 0.35(18.0226) + 0.10(69.6315) = 22.1199
4 1 35 + 0.05(−7.4352) + 0.20(19.0253)
+ 0.40(18.0226) + 0.35(69.6315) = 70.0134
At the end of month 2, the optimal decision is to select the second alter-
native in every state. Thus, the decision vector at the end of month 2 is
d(2) = [2 2 2 2]T.
The calculations for month 1, denoted by n = 1, are indicated in Table 5.9.
vi (1) = max[qik + pik1v1 (2) + pik2 v2 (2) + pik3 v3 (2) + pik4 v4 (2)], for i = 1, 2, 3, 4
k
= max[q + p ( −12.0932) + p (14.8707) + p (13.8204) + p (64.9971)], (5.16)
k
i
k
i1
k
i2
k
i3
k
i4
k
for i = 1, 2, 3, 4.
At the end of month 1, the optimal decision is to select the second alternative in
every state. Thus, the decision vector at the end of month 1 is d(2) = [2 2 2 2]T.
Finally, the calculations for month 0, denoted by n = 0, are indicated in
Table 5.10.
vi (0) = max[qik + pik1v1 (1) + pik2 v2 (1) + pik3 v3 (1) + pik4 v4 (1)], for i = 1, 2, 3, 4
k
= max[qik + pik1 ( −7.4352) + pik2 (19.0253) + pik3 (18.0226) + pik4 (69.6315)],
k
for i = 1, 2, 3, 4.
(5.17)
At the end of month 0, which is the beginning of month 1, the optimal deci-
sion is to select the second alternative in every state. Thus, the decision vector
at the beginning of month 1 is d(0 ) = [2 2 2 2]T.
TABLE 5.11
Expected Total Rewards and Optimal Decisions for a Planning
Horizon of 7 Months
n
End of Month 0 1 2 3 4 5 6 7
v1 (n) −2.9006 −7.4352 −12.0932 −16.8419 −21.3 −24 −20 0
v2 ( n) 23.3668 19.0253 14.8707 11.2288 8.7625 8.5 10 0
v3 ( n) 22.3222 18.0226 13.8204 9.7431 5.675 1 −5 0
v 4 ( n) 74.0882 69.6315 64.9971 59.9188 53.9 46.25 35 0
d1 (n) 2 2 2 2 2 2 3 −
d2 (n) 2 2 2 2 2 2 2 −
d3 (n) 2 2 2 2 2 2 2 −
The results of these calculations for the expected total rewards and the
optimal decisions at the end of each month of the 7-month planning horizon
are summarized in Table 5.11.
If the process starts in state 4, the expected total reward is 74.0882, the high-
est for any state. On the other hand, the lowest expected total reward is –2.9006,
which is an expected total cost of 2.9006, if the process starts in state 1.
the epoch n. Thus, over an infinite horizon, di = k indicates that in state i the
same decision k will be always made, irrespective of the epoch, n.
The set of decisions for all states is called a policy. For an N-state process, a
policy over an infinite horizon can be specified by a decision vector,
1 3
2 1
d = . (5.19)
3 2
4 2
TABLE 5.12
Exhaustive Enumeration of Decision Vectors for the 24 Possible Policies for the
MDP Model of Monthly Sales
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 1 2 2 1 3 2 1 4 2 1 5 2 2 6 2 2 7 2 2 8 2 2
1
d = , d = , d = , d = , d = , d = , d = , d = ,
3 1 3 1 3 2 3 2 3 1 3 1 3 2 3 2
4 1 4 2 4 1 4 2 4 1 4 2 4 1 4 2
1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2
2 1 2 1 2 1 2 1 2 2 2 2 2 2 2 2
9
d = , 10
d = , 11
d = , 12
d = , 13
d = , 14
d = , 15
d = , 16
d = ,
3 1 3 1 3 2 3 2 3 1 3 1 3 2 3 2
4 1 4 2 4 1 4 2 4 1 4 2 4 1 4 2
Each policy corresponds to a different MCR. Let rP, rq, rπ, and rg denote the
transition probability matrix, the reward vector, the steady-state probability
vector, and the gain, respectively, associated with policy rd. For example, for pol-
icy 3 d = [1 1 2 1]T , the associated transition probability matrix, the reward
vector, the steady-state probability vector, and the gain are indicated below.
3
p = [0.1159 0.2715 0.4229 0.1897], and 3
g = 2.4055.
Observe that policy 20 d = [3 1 2 2]T, the policy used for the MCR model
of monthly sales, with a gain of 1.4143 calculated in Equation (4.53), is
inferior to policy 3 d = [1 1 2 1]T which has a higher gain of 2.4055. All
24 policies and their associated transition probability matrices, reward
vectors, steady-state probability vectors, and gains are enumerated in
Table 5.13a.
Exhaustive enumeration has identified 16 d = [2 2 2 2]T as the opti-
mal policy with a gain of 4.39. However, as this example has demonstrated,
exhaustive enumeration is not feasible for larger problems because of the
huge computational burden it imposes.
TABLE 5.13a
Exhaustive Enumeration of 24 Policies and their Parameters for
MDP Model of Monthly Sales
1 1 0.15 0.40 0.35 0.10 1 −30
1 2 0.25 0.30 0.35 0.10 2 5
1
d = , 1
P= , 1
q= ,
1 3 0.05 0.65 0.25 0.05 3 −10
1 4 0.05 0.20 0.40 0.35 4 35
1
= [0.1482 0.4168 0.3233 0.1118], and 1
g = −1.682
2
= [0.1324 0.3881 0.3105 0.1689], and 2
g = −0.914
3
= [0.1159 0.2715 0.4229 0.1897], and 3
g = 2.4055
4
= [0.0920 0.2336 0.3953 0.2791], and 4
g = 3.409
5
= [0.1815 0.4533 0.2808 0.0844], and 5
g = −0.766
6
= [0.1676 0.4294 0.2732 0.1297], and 6
g = −0.2235
continued
7
= [0.1415 0.3094 0.3850 0.1640], and 7
g = 2.664
8
= [0.1173 0.2714 0.3654 0.2459], and 8
g = 3.5155
9
= [0.1963 0.3390 0.2989 0.1658], and 9
g = −0.4006
10
= [0.1669 0.3102 0.2846 0.2383], and 10
g = 0.49
11
= [0.1561 0.2183 0.3976 0.2280], and 11
g = −6.9415
13
= [0.2308 0.3538 0.2615 0.1538], and 13
g = 0.536
14
= [0.2006 0.3258 0.2511 0.2225], and 14
g = 1.2945
15
= [0.1827 0.2385 0.3641 0.2147], and 15
g = 3.5115
16
= [0.1438 0.2063 0.3441 0.3057], and 16
g = 4.39
17
= [0.2811 0.3824 0.2579 0.0787], and 17
g = −3.5345
continued
18
= [0.2594 0.3642 0.2537 0.1228], and 18
g = −2.834
19
= [0.2299 0.2674 0.3529 0.1497], and 19
g = 0.214
20
= [0.1908 0.2368 0.3421 0.2303], and 20
g = 1.4143
21
= [0.3380 0.4083 0.2064 0.0473], and 21
g = −3.0855
22
= [0.3230 0.3965 0.2053 0.0752], and 22
g = −2.668
23
= [0.2799 0.3038 0.3005 0.1158], and 23
g = −0.01
24
= [0.2443 0.2762 0.2967 0.1829], and 24
g = 0.9655
N
vi (n) = max qik + ∑ pijk v j (n + 1) , for i = 1, 2, ... , N .
k
j =1
Step 3. If i =max [vi (n) − vi (n + 1)] − min [vi (n) − vi (n + 1)] < ε , go to step 4 .
1,2,..., N i = 1,2,..., N
Otherwise, decrement n by 1 and return to step 2.
Step 4. For each state i, choose the decision di = k which maximizes the value
of vi(n), and stop.
TABLE 5.13b
Expected Total Rewards and Optimal Decisions during the Last
7 Months of an Infinite Planning Horizon
n
Epoch 27 26 25 24 23 22 21 0
TABLE 5.14
Differences between the Expected Total Rewards Earned Over Planning
Horizons Which Differ in Length by One Period
n
Epoch 27 26 25 24 23 22 21
vi ( −7) vi ( −6) vi ( −5) vi ( −4) vi ( −3) vi ( −2) vi ( −1)
−vi ( −6) −vi ( −5) −vi ( −4) −vi ( −3) − vi ( −2) − vi ( −1) −vi (0)
i
1 4.5346U 4.658U 4.7487 4.4581 2.7 −4L −20L
2 4.3415 4.1546L 3.6419L 2.4663L 0.2625L −1.5 10
3 4.2996L 4.2022 4.0773 4.0681 4.675 6 −5
4 4.4567 4.6344 5.0783U 6.0188U 7.65U 11.25U 35U
Vi (n) −
Max 4.5346 4.6580 5.0783 6.0188 7.65 11.25 35
vi (n + 1)
= gU (T )
Vi (n) −
Min 4.2996 4.1564 3.6419 2.4663 0.2625 −4 −20
vi (n + 1)
= g L (T )
the planning horizon, Table 5.14 shows that the bounds on the gain obtained
by value iteration are given by 4.2996 ≤ g ≤ 4.5346. The gain is approximately
equal to the arithmetic average of its upper and lower bounds, so that
TABLE 5.15
Expected Relative Rewards, vi(n) − v4(n), Received during the Last Seven
Epochs of the Planning Horizon
Epoch 27 26 25 24 23 22 21 0
TABLE 5.16
Expected Relative Rewards, vi(−7) − v4(−7), Received after Seven
Repetitions of Value Iteration, Compared with the Relative Values
Seven Period Horizon Infinite Horizon
(Value Iteration) (Policy Iteration, Linear Programming)
v1 ( −7) − v4 ( −7) = −76.9888 v1 = −76.8825
v2 ( −7) − v4 ( −7) = −50.7214 v2 = −50.6777
v3 ( −7) − v4 ( −7) = −51.766 v3 = −51.8072
5.1.2.3.3.1 Test Quantity for Policy Improvement (IM) The IM routine is based
on the value iteration Equation (5.4) for an MDP. Equation (5.4) indicates that
if an optimal policy is known over a planning horizon starting at epoch n + 1
and ending at epoch T, then the best decision in state i at epoch n can be
found by maximizing a test quantity,
N
qik + ∑ pijk v j (n + 1), (5.20)
j =1
over all decisions in state i. Recall from Section 4.2.3.2.2 that when T, the
length of the planning horizon, is very large, (T − 1) is also very large, so that
v j (1) ≈ (T − 1) g + v j . (4.60)
Substituting this expression for vj(1) in the test quantity produces the
result
N N
qik + ∑ pijk v j (1) = qik + ∑ pijk [(T − 1) g + v j ]
j =1 j =1
N N
= qik + (T − 1) g ∑ pijk + ∑ pijk v j , (5.22)
j =1 j =1
N
= qik + (T − 1) g + ∑ pijk v j
j =1
using the relative values vi of the previous policy. Then k∗ becomes the new
decision in state i, so that di = k∗, qik* becomes qi, and pijk* becomes pij.
Step 4. Stopping rule
When the policies on two successive iterations are identical, the algorithm
stops because an optimal policy has been found. Leave the old di unchanged
if the test quantity for the old di is equal to the test quantity for any other
alternative in the new policy determination. If the new policy is different
from the previous policy in at least one state, go to step 2.
Howard proved that the gain of each policy will be greater than or equal to that
of its predecessor. The algorithm will terminate after a finite number of iterations.
5.1.2.3.3.3 Solution by PI of MDP Model of Monthly Sales Policy iteration
will be executed to find an optimal policy over an infinite horizon for the
MDP model of monthly sales specified in Table 5.3. An optimal policy was
obtained by exhaustive enumeration in Section 5.1.2.3.1, and by value itera-
tion in Section 5.1.2.3.2.2.
First iteration
Step 1. Initial policy
Arbitrarily choose the initial policy 23 d = [3 2 2 1]T by making decision
3 in state 1, decision 2 in states 2 and 3, and decision 1 in state 4. Thus, d1 = 3,
d2 = d3 = 2, and d4 = 1. The initial decision vector 23d, along with the associated
transition probability matrix 23P and the reward vector 23q, are shown below:
Step 2. VD operation
Use pij and qi for the initial policy, 23
d = [3 2 2 1]T , to solve the VDEs
(4.62)
Step 3. IM routine
For each state i, find the decision k* that maximizes the test quantity
4
qik + ∑pv
j= 1
k
ij j
using the relative values vi of the initial policy. Then k* becomes the new
decision in state i, so that di = k∗, qik* becomes qi, and pijk* becomes pij. The first
policy improvement routine is s executed in Table 5.17.
Step 4. Stopping rule
The new policy is 12 d = [2 1 2 2]T, which is different from the initial
policy. Therefore, go to step 2. The new decision vector 12d, along with the
associated transition probability matrix 12P and the reward vector 12q, are
shown below:
TABLE 5.17
First IM for Monthly Sales Example
Test Quantity
q ik + p ik1v 1 + p ik2v 2 + p ik3v 3 + p ik4v 4
Decision = q ik + p ik1 (2102.6731) + p ik2 (254.4350) Maximum
State Alternative Value of Test Decision
i k + p (247.4686) + p (0)
k
i3
k
i4 Quantity di = k*
2 2 10 + 0.30(−102.6731) + 0.40(−54.4350)
+ 0.25(−47.4686) + 0.05(0) = −54.4431
4 1 35 + 0.05(−102.6731) + 0.20(−54.4350)
+ 0.40(−47.4686) + 0.35(0) = −0.0081
Step 2. VD operation
Use pij and qi for the first new policy, 12 d = [2 1 2 2]T, to solve the VDEs
(4.62) for all relative values vi and the gain g.
Step 3. IM routine
For each state i, find the decision k* that maximizes the test quantity
4
qik + ∑pv
j= 1
k
ij j
using the relative values vi of the previous policy. Then k* becomes the new
decision in state i, so that di = k*, qik* becomes qi, and pijk* becomes pij. The sec-
ond policy improvement routine is executed in Table 5.18.
Step 4. Stopping rule
The new policy is 16 d = [2 2 2 2]T , which is different from the previous
policy. Therefore, go to step 2. The new decision vector 16 d, along with the
associated transition probability matrix 16P, and the reward vector 16q, are
shown below:
Step 2. VD operation
Use pij and qi for the second new policy, 16 d = [2 2 2 2]T, to solve the VDEs
(4.62) for all relative values vi and the gain g.
TABLE 5.18
Second IM for Monthly Sales Example
Test Quantity
q ik + p ik1v 1 + p ik2v 2 + p ik3v 3 + p ik4v 4
Decision = q ik + p ik1 (276.6917) + p ik2 (252.2215 Maximum
State Alternative Value of Test Decision
+ p (252.0848) + p (0)
k
i3
k
i4
i k Quantity di = k*
1 1 −30 + 0.15(−76.6917) + 0.40(−52.2215)
+ 0.35(−52.0848) + 0.10(0) = −80.6220
2 1 5 + 0.25(−76.6917) + 0.30(−52.2215)
+ 0.35(−52.0848) + 0.10(0) = −48.0691
Step 3. IM routine
For each state i, find the decision k* that maximizes the test quantity
4
qik + ∑pv
j= 1
k
ij j
TABLE 5.19
Third IM for Monthly Sales Example
Test Quantity
q ik + p ik1v 1 + p ik2v 2 + p ik3v 3 + p ik4v 4 Maximum
Decision = q + p (276.8825) + p (250.6777)
k k k Value of
i i1 i2
State Alternative Test Decision
i k + p ik3 (251.8072) + p ik4 (0) Quantity di = k*
2 1 5 + 0.25(−76.8825) + 0.30(−50.6777)
+ 0.35(−51.8072) + 0.10(0) = −47.5565
4 1 35 + 0.05(−76.8825) + 0.20(−50.6777)
+ 0.40(−51.8072) + 0.35(0) = 0.2975
using the relative values vi of the previous policy. Then k* becomes the new
decision in state i, so that di = k*, qik* becomes qi, and pijk* becomes pij. The third
policy improvement routine is executed in Table 5.19.
Step 4. Stopping rule
Stop because the new policy, given by the vector 16 d = [2 2 2 2]T , is iden-
tical to the previous policy. Therefore, this policy is optimal. The transition
probability matrix 16P and the reward vector 16q for the optimal policy are
shown in Equation (5.20). The relative values and the gain are given in
Equation (5.21). Equation (5.21) shows that the relative values obtained by PI
are identical to the corresponding dual variables obtained by LP in Table 5.23
of Section 5.1.2.3.4.3.
2 1 1 − b b 1 f
dD = , P D = , qD = .
2 2 d 1 − d 2 h
df + bh
gD = .
b+d
The objective of this insight is to show that gD > gA. Since the IM routine has
chosen policy D over policy A, the test quantity for policy D must be greater
than or equal to the test quantity for policy A in every state. Therefore,
For i = 1,
q1D + p11
D A
v1 + p12
D A
v2 ≥ q1A + p11
A A
v1 + p12
A A
v2
f + (1 − b)v1A + b(0) ≥ e + (1 − a)v1A + a(0)
f + (1 − b)v1A ≥ e + (1 − a)v1A .
γ 1 = f + (1 − b)v1A − e − (1 − a)v1A ≥ 0
= f − e + ( a − b)v1A ≥ 0.
For i = 2,
q2D + p21
D A
v1 + p22
D A
v2 ≥ q2A + p21
A A
v1 + p22
A A
v2
h + dv1A + (1 − d)(0) ≥ s + cv1A + (1 − c)(0)
h + dv1A ≥ s + cv1A .
γ 2 = h + dv1A − s − cv1A ≥ 0
= h − s + (d − c)v1A ≥ 0.
df + bh ce + as
gD − g A = −
b+d a+c
( a + c)(df + bh) − (b + d)(ce + as)
=
(b + d)( a + c)
d ( a − b)(e − s) + ( a + c)( f − e ) b (d − c)( e − s) + ( a + c)( h − s)
=
b + d a+c + b + d a+c
d b
= ( a − b)v1A + f − e + (d − c)v1A + h − s
b+d b+d
d b
= γ1 + γ2
b+d b+d
Equation (2.16) indicates that the steady-state probability vector under policy
D is
d b
πD = .
b + d b + d
Hence,
g D − g A = π 1D γ 1 + π 2Dγ 2 .
matrix is regular [1,3]. As Section 2.1 indicates, the entries of the steady-state
probability vector for a regular Markov chain are strictly positive and sum
to one. To formulate an LP, one must identify decision variables, an objective
function, and a set of constraints.
Decision Variables
The decision variable in the LP formulation is the joint probability of
being in state i and making decision k. This decision variable is denoted
by
y ik = P(state = i ∩ decision = k ).
Ki Ki
∑ P(state = i ∩ decision = k) = ∑ y .
k
P(state = i) = i (5.26)
k= 1 k= 1
Ki
π i = P(state = i) ∑ y ik . (5.27)
k =1
Substituting π i = ⌺kK=i 1 y ik
y ik
P(decision = k |state = i) = Ki
(5.30)
∑ yik
k =1
d1 1 3 P(decision = 3|state = 1)
d 2 1 P(decision = 1|state = 2)
d= = = .
20 2
P(decision = 3|state = 1) = 1.
For each state, only one joint probability has a positive value; the others have
values of zero. The joint probability yik = P(state = i ∩ decision = k) > 0 if the
associated conditional probability, P(decision = k | state = i) = 1. The reason
is that when P(decision = k | state = i) = 1, then P(state = i ∩ decision = k) =
P (state = i) P(decision = k | state = i)
For the decision vector 20d=[3 1 2 2]T, the joint probabilities in state 1 are
2
y i1 = π1P(decision = 1 | state = 1) = π1(0) = 0, y1 = π1P(decision = 2 | state =
3
1) = π1(0) = 0, y1 = π1P(decision = 3 | state = 1) = π1(1) = π1 = 0.1908 > 0, from
Table 5.12.
Objective Function
The objective of the LP formulation for a recurrent MDP is to find a policy,
which will maximize the gain. The reward received in state i of an MCR is
denoted by qi. The reward received in state i of an MDP when decision k is
made is denoted by qik. The reward received in state i of an MDP is equal to
the reward received for each decision in state i weighted by the conditional
probability of making that decision, P(decision = k | state = i), so that
Ki
qi = ∑ P(decision = k |state = i) qik . (5.32)
k =1
Using Equation (4.46), the equation for the gain of a recurrent MDP is
N N Ki
g = πq = ∑ π i qi =
i = 1
∑ π i ∑ P(decision = k |state = i) qik .
i = 1 k =1
Therefore, the objective function is to find the decision variables yik that will
N Ki
Maximize g = ∑∑ qi y i .
k k
(5.33)
i =1 k =1
Constraints
The LP formulation has the following three sets of constraints. They represent
Equation (2.12), which must be solved to find the steady-state probability dis-
tribution for a regular Markov chain.
N
1. π j = ∑ π p , for j = 1, 2, . . . , N
i= 1
i ij (5.34)
N
2. ∑π i= 1
i =1 (5.35)
The three sets of constraints will be expressed in terms of the decision vari-
ables, yik. The left-hand side of the set of constraints (5.34) is
Ki
π j = ∑ y kj , for j = 1, 2, . . . , N. (5.27)
k =1
Ki
pij = ∑ P(decision = k |state = i) pijk . (5.37)
k =1
N N Ki
∑π p
i =1
i ij = ∑ π i ∑ P(decision = k |state = i) pijk
i =1 k =1
Substituting yik= πi P(decision = k | state = i), the right-hand side of the set of
constraints (5.34) is
N N Ki
∑ π i pij =
i= 1
∑∑y
i= 1 k = 1
k
i pijk
∑y
k= 1
k
j = ∑∑yi= 1 k = 1
k
i pijk , for j = 1, 2, . . . , N.
∑π
i= 1
i = 1.
N Ki
∑∑y
i= 1 k = 1
k
i = 1.
N Ki
Maximize g = ∑∑q i= 1 k = 1
k
i y ik (5.38)
subject to
Ki N Ki
∑y −∑∑y
k= 1
k
j
i= 1 k = 1
k
i pijk = 0, for j = 1, 2, . . . , N (5.39)
N Ki
∑∑y
i= 1 k = 1
k
i =1 (5.40)
Ki N Ki
∑y
k= 1
k
N −∑ ∑y
i= 1 k = 1
k
i
k
piN = 0, (5.42)
N Ki
Maximize g = ∑∑q i= 1 k = 1
k
i y ik (5.43a)
subject to
Ki N Ki
∑k= 1
y kj − ∑ ∑y
i= 1 k = 1
k
i pijk = 0 ,for j = 1, 2, . . . , N−1
N Ki
∑ ∑ yi = 1
k
(5.43b)
i= 1 k = 1
yik ≥ 0, for i = 1, 2, . . . , N, and k = 1, 2, . . . , Ki.
The LP formulation for a regular N-state MDP with Ki decisions per state
has ∑ i=1 Ki decision variables and N constraints, all of which are equations.
N
In contrast, the N VDEs, which must be solved for each repetition of PI, have
only N unknowns consisting of the N − 1 relative values, v1, v2, … , vN−1, and
the gain, g.
When an optimal solution for the LP decision variables, yik, has been found,
the conditional probabilities,
y ik
P(decision = k|state = i) = Ki
,
∑y
k =1
k
i
TABLE 5.21
Data for Recurrent MDP Model of Monthly Sales
Transition Probability
4 Ki 3 2 2 2
Maximize g = ∑∑ qi y i = ∑ q1 y1 + ∑ q2 y 2 + ∑ q3 y 3 + ∑ q4 y 4
k k k k k k k k k k
i =1 k =1 k =1 k =1 k =1 k =1
Constraints
State 1 has K1 = 3 possible decisions. The constraint associated with a transi-
tion to state j = 1 is
3 4 Ki
∑ k= 1
y1k − ∑
i= 1 k = 1
∑y k
i pik1 = 0
3 2 2 2
( y11 + y12 + y13 ) − ( ∑ y1k p11
k
+ ∑ y 2k p21
k
+ ∑ y 3k p31
k
+ ∑y k
4
k
p41 )= 0
k= 1 k= 1 k= 1 k= 1
( y11 + y12 + y13 ) − (0.15 y11 + 0.45 y12 + 0.60 y13 ) − (0.25 y 21 + 0.30 y 22 ) − (0.05 y 31 + 0.05 y 32 )
− (0.05 y 14 + 0 y 42 ) = 0.
2 4 Ki
∑y −∑∑y
k= 1
k
2
i= 1 k = 1
k
i pik2 = 0
3 2 2 2
( y 21 + y 22 ) − ( ∑ y1k p12
k
+ ∑y k
2
k
p22 + ∑y k
3
k
p32 + ∑y k
4
k
p42 )= 0
k= 1 k= 1 k= 1 k= 1
( y 21 + y 22 ) − ( y11 p12
1
+ y12 p12
2
+ y13 p12
3
) − ( y 21 p22
1
+ y 22 p22
2
) − ( y 31 p32
1
+ y 32 p32
2
)
− ( y 14 p142 + y 42 p42
2
)= 0
( y 21 + y 22 ) − (0.40 y11 + 0.05 y12 + 0.30 y13 ) − (0.30 y 21 + 0.40 y 22 ) − (0.65 y 31 + 0.25 y 32 )
−(0.20 y 14 + 0.10 y 42 ) = 0.
2 4 Ki
∑ y − ∑∑ y
k =1
k
3
i =1 k =1
k
i pik3 = 0.
3 2 2 2
( y 31 + y 32 ) − ( ∑ y1k p13
k
+ ∑y k
2
k
p23 + ∑y k
3
k
p33 + ∑y k
4
k
p43 )= 0
k= 1 k= 1 k= 1 k= 1
( y 31 + y 32 ) − ( y11 p13
1
+ y12 p13
2
+ y13 p13
3
) − ( y 21 p23
1
+ y 22 p23
2
) − ( y 31 p33
1
+ y 32 p33
2
)
− ( y 14 p143 + y 42 p43
2
)= 0
( y 31 + y 32 ) − (0.35 y11 + 0.20 y12 + 0.10 y13 ) − (0.35 y 21 + 0.25 y 22 ) − (0.25 y 31 + 0.50 y 32 )
−(0.40 y 14 + 0.30 y 42 ) = 0.
∑∑ y
i =1 k =1
k
i = ( y11 + y12 + y13 ) + ( y 21 + y 22 ) + ( y 31 + y 32 ) + ( y 14 + y 42 ) = 1.
The complete LP formulation for the recurrent MDP model of monthly sales,
with the redundant constraint associated with the state j = 4 omitted, and the
four remaining constraints numbered consecutively from 1 to 4, is
Maximize
subject to
(1) (0.85 y11 + 0.55 y12 + 0.40 y13 ) − (0.25 y 21 + 0.30 y 22 ) − (0.05 y 31 + 0.05 y 32 )
− (0.05 y 14 + 0 y 42 ) = 0
(2) −(0.40 y11 + 0.05 y12 + 0.30 y13 ) + (0.70 y 21 + 0.60 y 22 ) − (0.65 y 31 + 0.25 y 32 )
−(0.20 y 14 + 0.10 y 42 ) = 0
(3) −(0.35 y11 + 0.20 y12 + 0.10 y13 ) − (0.35 y 21 + 0.25 y 22 ) + (0.75 y 31 + 0.50 y 32 )
−(0.40 y 14 + 0.30 y 42 ) = 0
(4) ( y1 + y1 + y1 ) + ( y 2 + y 2 ) + ( y 3 + y 3 ) + ( y 4 + y 4 ) = 1
1 2 3 1 2 1 2 1 2
TABLE 5.22
LP Solution of MDP Model of Monthly Sales
k
Objective i Function Value k G = 4.3901 y i
1 1 0
2 0.1438 Row Dual Variable
3 0 Constraint 1 −76.8825
2 1 0 Constraint 2 −50.6777
2 0.2063 Constraint 3 −51.8072
3 1 0 Constraint 4 4.3901
2 0.3441
4 1 0
2 0.3057
Table 5.22 shows that the value of the objective function is 4.3901. All
the yik equal zero, except for y12 = 0.1438, y22 = 0.2063, y32 = 0.3441, and y42 =
0.3057. These values are the steady-state probabilities for the optimal policy
given by 16 d = [2 2 2 2 ]T, which was previously found by exhaustive enumer-
ation in Section 5.1.2.3.1, by value iteration in Section 5.1.2.3.2.2, and by PI in
Section 5.1.2.3.3.3. The conditional probabilities,
y ik
P(decision = k|state = i) = Ki
,
∑ yik
k =1
TABLE 5.23
Comparison of Dual Variables Obtained by LP, Relative Values Obtained by PI, and
Expected Relative Rewards, vi(−7) − v4(−7), Obtained by Value Iteration Over a
Seven-Period Planning Horizon
LP Dual
LP Row Variable Relative Value Expected Relative Reward
TABLE 5.24
States for Unichain Model of an Experimental
Production Process
State Operation
1 Training center
2 Stage 3
3 Stage 2
4 Stage 1
TABLE 5.25
Data for Unichain Model of an Experimental Production Process
Transition Probability Reward
FIGURE 5.3
Passage of an item through a three-stage experimental production process.
1 1 1 0 0 0 56
1
2 0.45 0.55 0 0 423.54
1
d = , 1
P= , 1
q= .
1 3 0 0.35 0.65 0 847.09
1 4 0 0 0.25 0.75 1935.36
g + v1 = 56 + v1
g + v2 = 423.54 + 0.45v1 + 0.55v2
g + v3 = 847.09 + 0.35v2 + 0.65v3
g + v4 = 1935.36 + 0.25v3 + 0.75v4 .
The VDE for the recurrent closed set, which consists of the single absorbing
state 1, is
g + v1 = 56 + v1 .
Setting v1=0, the solution of the VDE for the gain and the relative value for
the absorbing state under the initial policy is
g = 56, v1 = 0.
The VDEs for the three transient states under the initial policy are
Substituting g = 56, v1 = 0, the solution for the relative values of the transient
states under the initial policy is
Step 3. IM routine
For each state i, find the decision k∗ that maximizes the test quantity
4
qik + ∑pv
j= 1
k
ij j
using the relative values vi of the initial policy. Then k∗ becomes the new
decision in state i, so that di=k∗, qik∗ becomes qi, and pijk∗ becomes pij. The first
policy improvement routine is shown in Table 5.26.
TABLE 5.26
First IM for a Uichain Model of an Experimental Production Process
Test Quantity
q ik + p ik1v 1 + p ik2v 2 + p ik3v 3 + p ik4v 4 Maximum
Decision = q ik + p ik1 (0) + p ik2 (816.7556) Value of
State Alternative Test Decision
+p ik3 (3077.0127) + p ik4 (10594.453)
i k Quantity di = k*
1 1 56 + 1(0) = 56
= 872.7556 ←
2 2 468.62 + 0.55(0) + 0.45(816.7556)
= 836.1600
3 1 847.09 + 0.35(816.7556) 3133.0127 d3 = 1
+0.65(3077.0127) = 3133.0127 ←
3 2 796.21 + 0.50(816.7556)
+0.50(3077.0127) = 2743.0942
4 1 1935.36 + 0.25(3077.0127)
+0.75(10594.453) = 10650.453
+0.80(10594.453) = 11133.115 ←
2 1 0 0 0 1 62
1
2 0.45 0.55 0 0 423.54
2
d = , 2
P= , 2
q= .
1 3 0 0.35 0.65 0 847.09
2 4 0 0 0.20 0.80 2042.15
Note that all four states belong to the same recurrent closed set.
Step 2. VD operation
Use pij and qi for the second policy, 2d=[2 1 1 2]T, to solve the VDEs (4.62)
for the relative values vi and the gain g.
g + v1 = 62 + v4
g + v2 = 423.54 + 0.45v1 + 0.55v2
g + v3 = 847.09 + 0.35v2 + 0.65v3
g + v4 = 2042.15 + 0.20v3 + 0.80v4 .
Step 3. IM routine
The second IM routine is shown in Table 5.27.
2 1 0 0 0 1 62
2
2 0.55 0.45 0 0 468.62
3
d = , 3
P= , 3
q= . (5.44)
2 3 0 0.50 0.50 0 796.21
2 4 0 0 0.20 0.80 2042.15
Once again all four states belong to the same recurrent closed set.
TABLE 5.27
Second IM for Unichain Model of an Experimental Production Process
Test Quantity
q ik + p ik1v 1 + p ik2v 2 + p ik3v 3 + p ik4v 4 Maximum
Decision = q ik + p ik1 (–1168.5946) + p ik1 (–2962.0494) Value of
State Alternative Test Decision
+ p ik1 (–4057.7769)) + p ik4 (0)
i k Quantity di = k*
1 1 56 + 1(−1168.5946) = −1112.5946
1 2 62 + 1(0) = 62 ← 62 d1 = 2
2 1 423.54 + 0.45(−1168.5946)
+0.55(−2962.0494) = −1731.4547
2 2 468.62 + 0.55(−1168.5946) −1507.029 d2 = 2
+0.45(−2962.0494) = −1507.0293 ←
3 1 847.09 + 0.35(−2962.0494)
+0.65(−4057.7769) = −2827.1823
3 2 796.21 + 0.50(−2962.0494) −2713.703 d3 = 2
+0.50(−4057.7769) = −2713.7032 ←
4 1 1935.36 + 0.25(−4057.7769)
+0.75(0) = 920.9158
4 2 2042.15 + 0.20(−4057.7769) 1230.5946 d4 = 2
+0.80(0) = 1230.5946 ←
Step 2. VD operation
Use pij and qi for the third policy, 3d = [2 2 2 2]T, to solve the VDEs (4.62)
4
g + vi = qi + ∑ pij v j , i = 1, 2, 3, 4
j =1
g + v1 = 62 + v4
g + v2 = 468.62 + 0.55v1 + 0.45v2
g + v3 = 796.21 + 0.50v2 + 0.50v3
g + v4 = 2042.15 + 0.20v3 + 0.80v4 .
Step 3. IM routine
The third IM routine is shown in Table 5.28.
TABLE 5.28
Third IM for Unichain Model of an Experimental Production Process
Test Quantity
q ik + p ik1v 1 + p ik2v 2 + p ik3v 3 + p ik4v 4 Maximum
Decision = q ik + p ik1 (–1233.2710) + p ik2 (–2736.2729) Value of
State Alternative Test Decision
+ p ik3 (–3734.3949)) + p ik4 (0)
i k Quantity di = k*
1 2 56 + 1( −1233.2710) = −1177.271
1 2 62 + 1(0) = 62 ← 62 d1 = 2
transition probability matrix 3P and the reward vector 3q for the optimal pol-
icy are shown in Equation (5.44).
Note that the relative values obtained by PI are identical to the correspond-
ing dual variables obtained by LP in Table 5.30 of Section 5.1.3.2.2. The gain
obtained by PI is also equal to the gain obtained by LP. By solving Equation
(2.12), the recurrent transition probability matrix 3P in Equation (5.44) has the
steady-state probability vector
3
π = [0.1019 0.1852 0.2037 0.5093] . (5.46)
4 2 2 2 2 2
Maximize g = ∑∑q
i= 1 k = 1
k
i yik = ∑q
k=1
k
1 y1k + ∑q
k=1
k
2 y2k + ∑q
k=1
k
3 y3k + ∑q
k=1
k
4 y4k
= (q11 y11 + q12 y12 ) + (q21 y 21 + q22 y 22 ) + (q31 y 31 + q32 y 32 ) + (q14 y 14 + q42 y 42 )
TABLE 5.29
Data for Unichain Model of an Experimental Production Process
Transition Probability
State Operation Decision Reward LP Variable
i k p ik1 p ik2 p ik3 p ik4 q ik y ik
1 Training 1=I 1 0 0 0 56
y11
2=C 0 0 0 1 62
y12
2 Stage 3 1=I 0.45 0.55 0 0 423.54
y 21
2=C 0.55 0.45 0 0 468.62
y 22
3 Stage 2 1=I 0 0.35 0.65 0 847.09
y 31
2=C 0 0.50 0.50 0 796.21
y 32
4 Stage 1 1=I 0 0 0.25 0.75 1935.36
y 14
2=C 0 0 0.20 0.80 2042.15
y 14
Constraints
State 1 has K1 = 2 possible decisions. The constraint associated with a transi-
tion to state j = 1 is
2 4 2
∑y −∑∑y
k= 1
k
1
i= 1 k = 1
k
i pik1 = 0
2 2 2 2
( y11 + y12 ) − ( ∑ y1k p11
k
+ ∑ y 2k p21
k
+ ∑ y 3k p31
k
+ ∑y k
4
k
p41 )= 0
k= 1 k= 1 k= 1 k= 1
2 4 2
∑
k= 1
y 2k − ∑
i= 1 k = 1
∑y k
i pik2 = 0
2 2 2 2
( y 21 + y 22 ) − ( ∑ y1k p12
k
+ ∑y k
2
k
p22 + ∑y k
3
k
p32 + ∑y k
4
k
p42 )= 0
k= 1 k= 1 k= 1 k= 1
( y 21 + y 22 ) − ( y11 p12
1
+ y12 p12
2
) − ( y 21 p22
1
+ y 22 p22
2
) − ( y 31 p32
1
+ y 32 p32
2
) − ( y 14 p142 + y 42 p42
2
)= 0
2 4 2
∑y −∑∑y
k= 1
k
3
i= 1 k = 1
k
i pik3 = 0
2 2 2 2
( y 31 + y 32 ) − ( ∑ y1k p13
k
+ ∑y k
2
k
p23 + ∑y k
3
k
p33 + ∑y k
4
k
p43 )= 0
k= 1 k= 1 k= 1 k= 1
( y 31 + y 32 ) − ( y11 p13
1
+ y12 p13
2
) − ( y 21 p23
1
+ y 22 p23
2
) − ( y 31 p33
1
+ y 32 p33
2
) − ( y 14 p143 + y 42 p43
2
)= 0
4 2i
∑∑y
i= 1 k = 1
k
i = ( y11 + y12 ) + ( y12 + y22 ) + ( y31 + y32 ) + ( y14 + y42 ) = 1
subject to
Ki
P(decision = k |state = i ) = yik /∑ y ,
k =1
k
i
are
P(decision = 2|state = 1) = P(decision = 2|state = 2) =
P(decision = 2|state = 3) = P(decision = 2|state = 4) = 1.
TABLE 5.30
LP Solution of MDP Model of an Experimental Production Process
Objective Function Value g = 1295.2710
i k y ik
1 1 0 Row Dual Variable
2 0.1019 Constraint 1 −1233.2710
2 1 0 Constraint 2 −2726.2739
2 0.1852 Constraint 3 −3734.3949
3 1 0 Constraint 4 1295.2710
2 0.2037
4 1 0
2 0.5093
TABLE 5.31
Data for Modified Unichain Model of an Experimental Production Process
Transition Probability Reward LP Variable
State Operation Decision k k k k k
i k p i1 p i2 p i3 p i4 q i y ik
2=C 0 0 0 1 62 y12
All the remaining conditional probabilities are zero. The dual variables
associated with constraints 1, 2, and 3 are the relative values v1, v2, and v3,
respectively, obtained by PI in Equation (5.33) of Section 5.1.3.1.2. The dual
variable associated with constraint 4, the normalizing constraint, is the gain,
g = 1295.2710, which is also the value of the objective function.
subject to
TABLE 5.32
LP Solution of a Modified MDP Model of an
Experimental Production Process
Objective Function Value g = 1300
i k yit
1 1 1
2 0
2 1 0
2 0
3 1 0
2 0
4 1 0
2 0
TABLE 5.33
Feasible Ordering Decisions Associated with Beginning Inventory Levels
Beginning Order
Inventory Quantity
Xn-1 = i cn-1 = k
If i = 0
computers, order k = 0 computers, or k = 1 computer, or k = 2 computers,
or k = 3 computers; order up to k = 3 = (3 − 0) computers
If i = 1
computers, order k = 0 computers, or k = 1 computer, or k = 2 computers;
order up to k = 2 = (3 − 1) computers
If i = 2
computers, order k = 0 computers, or k = 1 computer;
order up to k = 2 = (3 − 2) computers
If i = 3
computers, order k = 0 computers; order up to k = 0 = (3–3) computers
Table 5.34 shows the decisions allowed in every state and the associated
transition probabilities expressed as a function of the demand.
Table 5.35 shows the decisions allowed in every state and the associated
numerical transition probabilities.
Observe that in state Xn−1 = 0, when decision cn−1 = 0 is made, p000 = 1, so that
state 0 is absorbing and the other states are transient. In addition, when
decision cn−1 = 1 is made in state Xn−1 = 0, and decision cn−1 = 0 is made in state
Xn−1 = 1, then states 0 and 1 are recurrent and the remaining states are tran-
sient. Therefore, the MDP model of the inventory system is unichain with
transient states.
The expected reward or profit earned by every decision in every state will
be computed next as the expected revenue minus the expected cost. The
number of computers sold during a period equals min[(Xn−1+cn−1), dn]. The
TABLE 5.34
Allowable Decisions in Every State and Associated Transition Probabilities
Expressed as a Function of the Demand
Transition Probability
State Decision
k k k k
Xn−1 = 0 cn−1 = k p 00 p 01 p 02 p 04
0 P(dn ≥ 0) 0 0 0
1 P(dn ≥ 1) P(dn ≥ 0) 0 0
2 P(dn ≥ 2) P(dn ≥ 1) P(dn ≥ 0) 0
3 P(dn ≥ 3) P(dn≥ 2) P(dn ≥ 1) P(dn ≥ 0)
0 P(dn ≥ 1) P(dn = 0) 0 0
1 P(dn ≥ 2) P(dn = 1) P(dn = 0) 0
2 P(dn = 3) P(dn = 2) P(dn = 1) P(dn = 0)
k k k k
Xn−1 = 2 cn−1 = k p20 p21 p22 p23
TABLE 5.35
Allowable Decisions in Every State and Associated Numerical Transition
Probabilities
Transition Probability
k
State Xn–1 = i Decision cn–1 = k p i0 p ik1 p ik2 p ik3
0 0 1 0 0 0
1 0.7 0.3 0 0
2 0.3 0.4 0.3 0
3 0.2 0.1 0.4 0.3
1 0 0.7 0.3 0 0
1 0.3 0.4 0.3 0
2 0.2 0.1 0.4 0.3
2 0 0.3 0.4 0.3 0
1 0.2 0.1 0.1 0.3
3 0 0.2 0.1 0.1 0.3
3
E{min[(X n −1 + cn −1 ), dn ]} = ∑ min[(X
dn = 0
n −1 + cn −1 ), dn ]p(dn )
3
= ∑ min[(X n − 1 + cn − 1 ), dn = k ]P(dn = k ).
k=0
The retailer sells computers for $300 each. The expected revenue equals $300
per computer sold times the expected number sold. The expected revenue is
3
$300∑ min[(X n −1 + cn −1 ), dn = k ]P(dn = k ).
k=0
For example, in state Xn−1 = 0, when decision cn−1 = 2 is made, the expected
revenue is equal to
3
$300 ∑ min[(X n −1 + cn −1 ), dn ]p(dn )
dn = 0
3
= $300 ∑ min[(0 + 2), dn ]p(dn )
dn = 0
3
E{max[(X n −1 + cn −1 − dn ), 0]} = ∑ max[(X
dn = 0
n −1 + cn −1 − dn ), 0]p( dn )
3
= ∑ max[(X n −1 + cn −1 − k ), 0]P(dn = k ).
k=0
3
$50∑ max[(X n −1 + cn −1 − k ), 0]P( dn = k ).
k=0
For example, in state Xn−1= 0, when decision cn−1= 2 is made, the expected
holding cost is equal to
3 3
$50 ∑ max[(X n −1 + cn −1 − dn ), 0]p(dn ) = $50 ∑ max[(0 + 2 − dn ), 0] p(dn )
dn = 0 dn = 0
The expected holding costs generated in every state by every decision are
calculated in Table 5.38.
The expected shortage cost equals $40 times the expected number of com-
puters not available to satisfy demand during a period. The number not
available to satisfy demand during a period, that is, the shortage of comput-
ers at the end of the period, equals max[(dn − Xn−1 − cn−1), 0]. The expected
number of computers not available to satisfy demand during a period is
given by
51113_C005.indd 333
Expected Revenue
State Decision Revenue Expected Revenue
9/23/2010 5:46:03 PM
334 Markov Chains and Decision Processes for Engineers and Managers
TABLE 5.37
Ordering Costs
State Decision Ordering Cost
3
E{max[(dn − X n −1 − cn −1 ), 0]} = ∑ max[(d
dn = 0
n − X n −1 − cn −1 ), 0]p( dn )
3
=∑ max[(k − X n −1 − cn −1 ), 0]P(dn = k ).
k=0
3
$40∑ max[(k − X n −1 − cn −1 ), 0]P( dn = k ).
k=0
For example, in state Xn−1 = 0, when decision cn−1 = 1 is made, the expected
shortage cost is equal to
3 3
$40 ∑ max[(dn − X n −1 − cn −1 ), 0]p(dn ) = $40 ∑ max[(dn − 0 − 1), 0]p(dn )
dn = 0 dn = 0
The expected shortage costs generated in every state by every decision are
calculated in Table 5.39.
9/23/2010 5:46:10 PM
51113_C005.indd 336
336
TABLE 5.39
Expected Shortage Costs
State Decision Shortage Cost Expected Shortage Cost
X n–1 c n–1 3
$40 max[( d n − X n −1 − cn −1 ), 0]
$40 ∑ min[( X n−1 + cn−1 ), d n ] p ( d n )
dn =0
9/23/2010 5:46:12 PM
A Markov Decision Process (MDP) 337
In Table 5.40 the expected rewards corresponding to every state and decision
are calculated.
The complete unichain MDP model of the inventory system is summa-
rized in Table 5.41.
TABLE 5.40
Expected Rewards
State Decision Expected Reward
X n–1 C n–1 E(Revenue) – Ordering Cost – E(Holding Cost) – E (Shortage Cost)
0 0 $0 – $0 – $0 – $48 = –$48
1 $210 – $140 – $15 – $20 = $35
2 $300 – $260 – $50 – $8 = $18
3 $360 – $380 – $90 – $0 = $110
1 0 $210 – $0 – $15 – $20 = $175
1 $300 – $140 – $50 – $8 = $102
2 $360 – $260 – $90 – $0 = $10
2 0 $300 – $0 – $50 – $8 = $242
1 $360 – $140 – $90 – $0 = $130
3 0 $360 – $0 – $90 – $0 = $270
TABLE 5.41
Unichain MDP Model of Inventory System
Transition Probability Expected
State Decision Reward LP Variable
X n–1 = i c n–1 = k j=0 j=1 j=2 j=3 q ik y ik
0 0 1 0 0 0 –$48 y00
1 0.7 0.3 0 0 $35 y01
2 0.3 0.4 0.3 0 –$18 y02
3 0.2 0.1 0.4 0.3 –$110 y03
1 0 0.7 0.3 0 0 $175 y10
1 0.3 0.4 0.3 0 $102 y11
2 0.2 0.1 0.4 0.3 $10 y12
2 0 0.3 0.4 0.3 0 $242 y20
1 0.2 0.1 0.4 0.3 $130 y12
3 0 0.2 0.1 0.4 0.3 $270 y30
subject to
(1) (0.3 y 01 + 0.7 y 02 + 0.8 y 03 ) − (0.7 y10 + 0.3 y11 + 0.2 y12 ) − (0.3 y 20 + 0.2 y 21 ) − 0.2 y 30 = 0
(2) − (0.3 y 01 + 0.4 y 02 + 0.1y 03 ) + (0.7 y10 + 0.6 y11 + 0.9 y12 ) − (0.4 y 20 + 0.1y 21 ) − 0.1y 30 = 0
(3) − (0.3 y 02 + 0.4 y 03 ) − (0.3 y11 + 0.4 y12 ) + (0.7 y 20 + 0.6 y 21 ) − 0.4 y 30 = 0
y ik
P(order = k|state = i) = Ki
∑y
k =1
k
i
TABLE 5.42
LP Solution of Unichain MDP Model of
an Inventory System
g = 115.6364
Objective i Function Value k y ik
0 0 0
1 0
2 0
3 0.2364
1 0 0
1 0
2 0.2091
2 0 0.3636
1 0
3 0 0.1909
All the remaining conditional probabilities are zero. Observe that the
retailer’s expected average profit per period is maximized by the same (2, 3)
policy under which she operated the recurrent MCR inventory model in
Section 4.2.3.4.1.
qiK = $2 + $10 pi 0 .
Since a component has no salvage value at the end of its fourth week of life,
Table 5.43 contains the transition probabilities and cost data for the four-state
unichain MDP model of component replacement.
TABLE 5.43
Data for Unichain MDP Model of Component Replacement
Transition Probability
LP Variable
State i Decision k p k
i0 p ik1 p ik2 p ik3 Cost q ik y ik
2=R 1 0 0 0 8 = 2 + 10 – 4 y02
2=R 1 0 0 0 8 = 2 + 10 – 4 y12
2=R 1 0 0 0 8 = 2 + 10 – 4 y22
3 2=R 1 0 0 0 12 = 2 + 10 y32
TABLE 5.44
Four Alternative Component Replacement Policies
Decision Vector at Age ⇒ Sequence of Actions at Age
Replacement Policy i = {0, 1, 2, 3} ⇒ i + 1 = {1, 2, 3, 4}
Replace every week 1 d = [R R R R]T ⇒ [K R R R]T
Replace every 2 weeks 2 d = [K R R R]T ⇒ [K K R R]T
Replace every 3 weeks 3 d = [K K R R]T ⇒ [K K K R]T
Replace every 4 weeks 4 d = [K K K R]T ⇒ [K K K K]T
The MCR associated with a replacement interval less than the 4-week service
life is a unichain with transient states. If a replacement interval of t < 4 weeks
is chosen, then states 0, 1, … , t form a recurrent closed class, and the remain-
ing states are transient. The MCRs associated with each of the four replace-
ment intervals are shown below, accompanied by their respective gains.
Policy 1, Replace a component every week:
Under this policy, state 0 is absorbing, and states 1, 2, and 3 are transient.
Policy 2, Replace a component every 2 weeks:
Under this policy, states {0, 1} form a recurrent closed class, while states 2 and
3 are transient.
Under this policy, states {0, 1, 2} form a recurrent closed class, while state 3
is transient.
Policy 4, Replace a component every 4 weeks:
4 d = [K K K R]T ⇒ action = [K K K K]T
Under this policy, all states are recurrent. The minimum cost policy,
3 d = [K K R R]T ⇒ action = [K K K R]T, (5.47)
is to replace a component every 3 weeks at an expected average cost per week
of $5.48.
subject to
(1) (y01 + y02) − (0.2y01 + y02) − (0.375y11 + y12) − (0.8y21 + y22) − y32 = 0
TABLE 5.45
LP Solution of Unichain MDP Model of
Component Replacement
Objective Function = 5.4783
i k y ik
0 1 0.4348
2
1 1 0.3478
2 0
2 1 0
2 0.2174
3 2 0
TABLE 5.46
Data for Unichain MDP Model of a Secretary Problem Over a Finite Planning
Horizon
Transition Probability
State Decision Reward
i k p k
i ,15 p ik,20 p ik,25 p ik,30 p ik,∆ q ik
15 1=H 0 0 0 0 1 15
2=R 0.3 0.4 0.2 0.1 0 0
20 1=H 0 0 0 0 1 20
2=R 0.3 0.4 0.2 0.1 0 0
25 1=H 0 0 0 0 1 25
2=R 0.3 0.4 0.2 0.1 0 0
30 1=H 0 0 0 0 1 30
2=R 0.3 0.4 0.2 0.1 0 0
Δ 1=H 0 0 0 0 1 0
2=R 0 0 0 0 1 0
when an applicant is rejected. The augmented state space is E = {15, 20, 25,
30, ∆}. If the nth applicant is hired, the daily reward is Xn, equal to the score
assigned to the nth applicant. When an applicant is hired, the process goes
to the absorbing state ∆, where it remains because no more candidates will
be interviewed. If the nth applicant is rejected, the daily reward is zero.
Since all the transition probability matrices corresponding to every possi-
ble decision contain the single absorbing state ∆, the MDP is unichain. The
unichain MDP model of the secretary problem over a fi nite planning hori-
zon is shown in Table 5.46.
30
vi (n) = max{[xn if hire], [qi2 + ∑ pij2 v j (n + 1) if reject]},
j = 15
for n = 1,2,..., 5, and i = 15, 20, 25, 30
Substituting qi = 0,2
vi (n) = max{[xn if hire], [ pi2,15 v15 (n + 1) + pi2,20 v20 (n + 1)
+ pi2,25 v25 (n + 1) + pi2,30 v30 (n + 1) if reject]}, (5.48)
for n = 1, 2,..., 5, and i = 15, 20, 25, 30
vi (n) = max{[xn if hire], [0.3v15 (n + 1) + 0.4v20 (n + 1)
+ 0.2v25 (n + 1) + 0.1v30 ( n + 1) if reject]},
for n = 1, 2,..., 5.
To begin the backward recursion, the terminal values on day 6 are set
equal to the possible scores of the applicant on day 6 because a sixth or final
applicant must be hired. Therefore, for n = 6,
v15 (6) = 15, v20 (6) = 20, v25 (6) = 25, v30 (6) = 30.
vi (5) = max{[x5 if hire], [0.3v15 (6) + 0.4v20 (6) + 0.2v25 (6) + 0.1v30 (6) if reject]},
for i = 15, 20,25,30
vi (5) = max{[x5 if hire], [0.3(15) + 0.4(20) + 0.2(25) + 0.1(30) if reject]},
for i = 15, 20,25,30
vi (5) = max{[x5 if hire], [20.5 if reject]}, for i = 15, 20,25,30
If X 5 = i = 15, then v15 (5) = max{[15 if hire], [20.5 if reject]} = 20.5, reject
If X 5 = i = 20, then v20 (5) = max{[20 if hire], [20.5 if reject]} = 20.5, reject
If X 5 = i = 25, then v25 (5) = max{[25 if hire], [20.5 if reject]} = 25, hire
If X 5 = i = 30, then v30 (5) = max{[30 if hire], [20.5 if reject]} = 30, hire
vi (4) = max{[x 4 if hire], [0.3v15 (5) + 0.4v20 (5) + 0.2v25 (5) + 0.1v30 (5) if reject]},
for i = 15, 20,25,30
vi (4) = max{[x 4 if hire], [0.3(20.5) + 0.4(20.5) + 0.2(25) + 0.1(30) if reject]},
for i = 15, 20,25,30
vi (4) = max{[x 4 if hire], [22.35 if reject]}, for i = 15, 20,25,30
If X 4 = i = 15, then v15 (4) = max{[15 if hire], [22.35 if reject]} = 22.35, reject
If X 4 = i = 20, then v20 (4) = max{[20 if hire], [22.35 if reject]} = 22.35, reject
If X 4 = i = 25, then v25 (4) = max{[25 if hire], [22.35 if reject]} = 25, hire
If X 4 = i = 30, then v30 (4) = max{[30 if hire], [22.35 if reject]} = 30, hire
vi (3) = max{[x3 if hire], [0.3v15 (4) + 0.4v20 (4) + 0.2v25 (4) + 0.1v30 (4) if reject]},
for i = 15, 20,25,30
vi (3) = max{[x3 if hire], [0.3(22.35) + 0.4(22.35) + 0.2(25) + 0.1(30) if reject]},
for i = 15, 20,25,30
vi (3) = max{[x3 if hire], [23.645 if reject]}, for i = 15, 20,25,30
If X 3 = i = 15, then v15 (3) = max{[15 if hire], [23.645 if reject]} = 23.645, reject
If X 3 = i = 20, then v20 (3) = max{[20 if hire], [23.645 if reject]} = 23.645, reject
If X 3 = i = 25, then v25 (3) = max{[25 if hire], [23.645 if reject]} = 25, hire
If X 3 = i = 30, then v30 (3) = max{[30 if hire], [23.645 if reject]} = 30, hire
vi (2) = max{[x2 if hire], [0.3v15 (3) + 0.4v20 (3) + 0.2v25 (3) + 0.1v30 (3) if reject]},
for i = 15, 20,25,30
vi (2) = max{[x2 if hire], [0.3(23.645) + 0.4(23.645) + 0.2(25) + 0.1(30) if reject]},
for i = 15, 20,25,30
vi (2) = max{[x2 if hire], [24.5515 if reject]}, for i = 15, 20,25,30
If X 2 = i = 15, then v15 (2) = max{[15 if hire], [24.5515 if reject]} = 24.5515, reject
If X 2 = i = 20, then v20 (2) = max{[20 if hire], [24.5515 if reject]} = 24.5515, reject
If X 2 = i = 25, then v25 (2) = max{[25 if hire], [24.5515 if reject]} = 25, hire
If X 2 = i = 30, then v30 (2) = max{[30 if hire], [24.5515 if reject]} = 30, hire.
vi (1) = max{[x1 if hire], [0.3v15 (2) + 0.4v20 (2) + 0.2v25 (2) + 0.1v30 (2) if reject]},
for i = 15, 20,25,30
vi (1) = max{[x1 if hire], [0.3(24.5515) + 0.4(24.5515) + 0.2(25) + 0.1(30) if reject]},
for i = 15, 20,25,30
vi (1) = max{[x1 if hire], [25.18605 if reject]}, for i = 15, 20,25,30
If X1 = i = 15, then v15 (1) = max{[15 if hire], [25.18605 if reject]} = 25.18605, reject
If X1 = i = 20, then v20 (1) = max{[20 if hire], [25.18605 if reject]} = 25.18605, reject
If X1 = i = 25, then v25 (1) = max{[25 if hire], [25.18605 if reject]} = 25.18605, reject
If X1 = i = 30, then v30 (1) = max{[30 if hire], [25.18605 if reject]} = 30, hire.
TABLE 5.47
Decision Rules for Maximizing the Expected Score of the Secretary
Minimum
Rating of
Candidate to
Day, n Candidate Score, Xn Decision Be Hired Interviews Will
1 If X1 = 15, 20, or 25, then Reject Continue
If X1 = 30, then Hire X1 ≥ 25.18605 Stop
2 If X2 = 15 or 20, then Reject Continue
If X2 = 25 or 30, then Hire X2 ≥ 24.5515 Stop
3 If X3 = 15 or 20, then Reject Continue
If X3 = 25 or 30, then Hire X3 ≥ 23.645 Stop
4 If X4 = 15 or 20, then Reject Continue
If X4 = 25 or 30, then Hire X4 ≥ 22.35 Stop
5 If X5 = 15 or 20, then Reject Continue
If X5 = 25 or 30, then Hire X5 ≥ 20.5 Stop
6 If X6 = 15, 20, 25, or 30, then Reject Continue
Hire X6 ≥ 15 Stop
Suppose that the optimal policy specified by the decision rules in Table 5.47
is followed. Observe that if the score of the first candidate is greater than or
equal to 25.18605, that is, 30, the first candidate should be hired, and the inter-
views should be stopped. Otherwise, the interviews should be continued. If
X1 equals 30, the maximum expected score is 30 because the first applicant is
hired. If X1 is less than 30, the maximum expected score is 25.18605 because
the first applicant is rejected. Prior to knowing X1, the maximum expected
score is (0.1)(30) + (0.9)(25.18605) = 25.667445 because there is a probability of
0.1 that the first candidate will be excellent and receive a score of 30, and a
probability of 0.9 that the first candidate will not be excellent and the inter-
views will have to be continued. Note that if the fifth candidate receives a
score greater than or equal to 20.5, that is, 25 or 30, the fifth candidate should
be hired, and the interviews should be stopped. Finally, if the fifth candidate
is rejected, the sixth candidate must be hired.
TABLE 5.48
States for Flexible Production Process
State Operation
1 Scrap
2 Training center
3 Stage 3
4 Stage 2
5 Stage 1
TABLE 5.49
Data for Multichain Model of a Flexible Production System
Transition Probability
State Decision k
Reward
i Operation k p i1 p ik2 p ik3 p ik4 p ik5 q ik
1 Scrap 1 1 0 0 0 0 62
2 Training 1=S 0 1 0 0 0 100
2=H 0 0 0 0 1 22.58
3 Stage 3 1=I 0.25 0.20 0.55 0 0 423.54
2=C 0.20 0.35 0.45 0 0 468.62
4 Stage 2 1= I 0.15 0 0.20 0.65 0 847.09
2=C 0.20 0 0.30 0.50 0 796.21
5 Stage 1 1=I 0.10 0 0 0.15 0.75 1935.36
2=C 0.08 0 0 0.12 0.80 2042.15
0.2
5(
0.2
0.10(0.08)
0.15(0.20)
0)
1
Scrap
100
Reward
FIGURE 5.4
Passage of an item through a three-stage flexible production system.
N
vi (n) = max qik + ∑ pijk v j (n + 1) , for n = 0, 1,..., T − 1, and i = 1, 2,..., N , (5.6)
k
j =1
where the salvage values vi(T) are specified for all states i = 1, 2, ... , N. The
value iteration equation indicates that if an optimal policy is known over a
planning horizon starting at epoch n + 1 and ending at epoch T, then the best
decision in state i at epoch n can be found by maximizing a test quantity,
N
qik + ∑ pijk v j (n + 1), (5.49)
j =1
N
qik + ∑ pijk v j (1), (5.50)
j =1
over all decisions in state i. Let gj denote the gain of state j. Recall from
Section 4.2.3.2.2 that when T, the length of the horizon, is very large, (T − 1) is
also very large, so that
v j (1) ≈ (T − 1) g j + v j . (5.51)
Substituting this expression for vj(1) in the test quantity produces the result
N N N N
qik + ∑
j= 1
pijk v j (1) = qik + ∑
j= 1
pijk [(T − 1) g j + v j ] = qik + (T − 1)∑ pijk g j +
j= 1
∑pv
j= 1
k
ij j (5.52)
∑p
j =1
k
ij gj , (5.53)
called the gain test quantity, using the gains of the previous policy. However,
when two or more alternatives have the same maximum value of the gain
test quantity, there is a tie, and the gain test fails. In that case the decision
must be made on the basis of relative values rather than gains. The tie is bro-
ken by choosing the alternative that maximizes
N
qik + ∑ pijk v j , (5.54)
j =1
called the value test quantity, by using the relative values of the previous
policy.
5.1.4.2.2 PI Algorithm
The detailed steps of the PI algorithm [2] for a multichain MDP are given
below:
PI Algorithm
Step 1. Initial policy
Arbitrarily choose an initial policy by selecting for each state i a decision
di = k.
Step 2. PE operation
Use pij and qi for a given policy to solve the set of GSEs
N
g i = ∑ pij g j , for i = 1, 2,..., N . (4.176)
j =1
for all the relative values vi and gains gi by executing the four-step procedure
given in Section 4.2.5.3.3.
Step 3. IM routine
For each state i, find the decision k∗ that maximizes the gain test quantity
∑p
j= 1
k
ij gj (5.53)
by using the gains of the previous policy. Then k∗ becomes the new decision
in state i, so that di = k∗, qik∗ becomes qi, and pijk∗ becomes pij.
If two or more alternatives have the same maximum value of the gain test
quantity, the tie is broken by choosing the decision k∗ that maximizes the
value test quantity
N
qik + ∑ pijk v j , (5.54)
j =1
by using the relative values of the previous policy. Then k∗ becomes the new
decision in state i, so that di = k ∗, qik∗ becomes qi, and pijk∗ becomes pij.
Regardless of whether the IM test is based on gains or values, if the previ-
ous decision in state i yields as high a value of the test quantity as any other
alternative, leave the previous decision unchanged to assure convergence in
the case of equivalent policies.
Step 4. Stopping rule
When the IM test has been completed for all states, a new policy has been
determined. A new P matrix and q vector have been obtained. If the new
policy is the same as the previous one, the algorithm has converged, and an
optimal policy has been found. If the new policy is different from the previ-
ous policy in at least one state, go to step 2.
1 1 1 0 0 0 0 62
1 2 0 1 0 0 0 100
1
d = 1 , 1
P = 3 0.25 0.20 0.55 0 0 , 1
q = 423.54 .
1 4 0.15 0 0.20 0.65 0 847.09
1 5 0.10 0 0 0.15 0.75 1935.36
g1 + v1 = 62 + v1
g1 = 62.
After setting the relative value v2 = 0 for the highest numbered state in the
second recurrent chain, the VDE for the second recurrent chain is
g 2 + v2 = 100 + v2
g 2 = 100.
g1 = 62
g 2 = 100.
Step 2 of PE:
The GSEs for the transient states are
After algebraic simplification, the gains of the three transient states are
expressed as weighted averages of the independent gains of the two closed
classes of recurrent states
Step 3 of PE:
The VDEs for the transients states are
After substituting the gains of the transient states obtained in step 2 of PE,
and the relative values of the recurrent states obtained in step 1 of PE, the
VDEs for the transient states are
Step 4 of PE:
The VDEs for the transient states are solved to obtain the relative values of
the transient states: v3 = 765.8951, v4 = 2653.2023, and v5 = 9062.2118.
The solutions for the gain vector and the vector of relative values for the
initial policy are summarized below:
1 62 1 0
2 100 2 0
1
g = 3 78.8872 , 1
v = 3 765.8951 .
4 71.6482 4 2653.2023
5 67.7874 5 9062.2118
Step 3. IM routine
For each state i, find the decision k∗ that maximizes the gain test quantity
∑p
j =1
k
ij g j = pik1 g1 + pik2 g 2 + ( pik3 g 3 + pik4 g 4 + pik5 g 5 ).
TABLE 5.50
First IM for Model of a Flexible Production System
State Decision Gain Test Quantity Value Test Quantity
5 5
i k ∑p
j =1
k
ij gj q ik + ∑ p ijk v j
j= 1
1 1 1(62) = 62 ←
2 1 1(100) = 100 ←
2 1(67.7874) = 67.7874
3 1 0.25(62) + 0.20(100) + 0.55(78.8872)
= 78.8880
2 0.20(62) + 0.35(100) + 0.45(78.8872)
= 82.8992 ←
4 1 0.15(62) + 0.20(78.8872) + 0.65(71.6482) = 71.6488
2 0.20(62) + 0.30(78.8872) + 0.50(71.6482) = 71.8903 ←
5 1 67.7878 1935.36 + (7194.6392) = 9129.9992
2 67.7878 2042.15 + (7568.1537) = 9610.3037 ←
by using the gains of the previous policy. Then k∗ becomes the new decision
in state i, so that di = k∗, qik∗ becomes qi, and pijk∗ becomes pij. The first IM rou-
tine is executed in Table 5.50.
1 1 1 0 0 0 0 62
1 2 0 1 0 0 0 100
2
d = 2 , 2
P = 3 0.20 0.35 0.45 0 0 , 2
q = 468.62 .
(5.55)
2 4 0.20 0 0.30 0.50 0 796.21
2 5 0.08 0 0 0.12 0.80 2042.15
Step 2. PE operation
Use pij and qi for the second policy, 2d = [1 1 2 2 2]T, to solve the
set of GSEs (4.176) and the set of VDEs (4.177) for all the relative values
vi and gains gi, by setting the value of one vi in each closed class of recur-
rent states equal to zero.
Step 1 of PE:
After setting the relative values v1 = v2 = 0 in the VDEs for the highest num-
bered states in the two recurrent closed classes, the independent gains of the
two recurrent states are once again
g 1 = 62
g 2 = 100.
Step 2 of PE:
The GSEs for the transient states are
After algebraic simplification, the gains of the three transient states expressed
as weighted averages of the independent gains of the two closed classes of
recurrent states are
Step 3 of PE:
The VDEs for the transients states are
After substituting the gains of the transient states obtained in step 2 of PE,
and the relative values of the recurrent states obtained in step 1 of PE, the
VDEs for the transient states are
Step 4 of PE:
The VDEs for the transient states are solved to obtain the relative values of
the transient states: v3 = 695.3422, v4 = 1856.6071, and v5 = 10971.187.
The solutions for the gain vector and the vector of relative values for the
second policy are summarized below:
1 62 1 0
2 100 2 0
2
g = 3 86.1818 , 2
v = 3 695.3422 .
(5.56)
4 76.5091 4 1856.6071
5 70.7055 5 10971.187
Step 3. IM routine
For each state i, find the decision k∗ that maximizes the gain test quantity
∑pg
j= 1
k
ij j = pik1 g1 + pik2 g 2 + ( pik3 g 3 + pik4 g 4 + pik5 g 5 )
by using the gains of the previous policy. Then k∗ becomes the new decision
in state i, so that di = k∗, qik∗ becomes qi, and pijk∗ becomes pij. The second IM
routine is executed in Table 5.51.
Step 4. Stopping rule
Stop because the new policy, given by the vector 3d ≡ 2d = [1 1 2 2 2]T, is
identical to the previous policy. Therefore, this policy is optimal. The firm
TABLE 5.51
Second IM for Model of a Flexible Production System
Gain Test Quantity Value Test Quantity
State Decision 5 5
i k ∑p
j =1
k
ij gj q ik + ∑ p ijk v j
j =1
1 1 1(62) = 62 ←
2 1 1(100) = 100 ←
2 1(70.7055) = 70.7055
3 1 0.25(62) + 0.20(100) + 0.55(86.1818)
= 82.9000
2 0.20(62) + 0.35(100) + 0.45(86.1818)
= 86.1818 ←
4 1 0.15(62) + 0.20(86.1818) + 0.65(76.5091) = 76.2673
2 0.20(62) + 0.30(86.1818) + 0.50(76.5091) = 76.5091 ←
5 1 70.7056 1935.36 + (8506.8813) = 10442.241
2 70.7056 2042.15 + (8999.7425) = 11041.892 ←
N Ki
Maximize z = ∑∑q
i= 1 k = 1
k
i y ik
subject to
Ki N Ki
Ki Ki N Ki
N
where each constant b j > 0 and ∑b
j =1
j = 1 and y ik ≥ 0, xik ≥ 0, for i = 1, 2,..., N ,
and k = 1, 2, 3, … , Ki. At least one of the equations in the first set of constraints
is redundant.
The optimal stationary policy generated by the solution of the LP for a uni-
chain MDP is deterministic. A deterministic policy is one for which
k Ki Ki
yi
/∑ y k =1
k
i for states i in which ∑y
k =1
k
i >0
P(decision = k|state = i) = Ki Ki
xk
i / ∑ xik for states i in which
k =1
∑x
k =1
k
i > 0.
In the final simplex tableau, under a deterministic policy, the dual variable
associated with each equation in the second set of constraints is equal to
the gain in the corresponding state. That is, for an N-state MDP, under a
deterministic policy, the dual variable associated with constraint equation i,
for i > N, is equal to the gain in state (i − N).
TABLE 5.52
Data for Multichain Model of a Flexible Production System
Transition Probability
State Decision Reward LP Variable
i Operation k p ik1 p ik2 p ik3 p ik4 p ik5 q ik y ik x ik
Objective Function
The objective function for the LP is
1 5 2
Maximize z = ∑q
k= 1
k
1 y1k + ∑∑q
i= 2 k = 1
k
i y ik
1 2 2 2 2
= ∑q
k= 1
k
1 y1k + ∑q
k= 1
k
2 y 2k + ∑q
k= 1
k
3 y 3k + ∑q
k= 1
k
4 y 4k + ∑q
k= 1
k
5 y 5k
= q11 y11 + (q21 y 21 + q22 y 22 ) + (q31 y 31 + q32 y 32 ) + (q14 y 14 + q42 y 42 ) + (q51 y 51 + q52 y 52 )
1
1 2 2 2 2
k
∑y
k =1
k
1 − ∑ y1k p11
k =1
k
+ ∑ y 2k p21
k =1
k
+ ∑ y 3k p31
k =1
k
+ ∑ y 4k p41
k =1
k
+ ∑ y 5k p51
k =1
= 0
State 2 has K 2 = 2 possible decisions. In the first set of constraints, the equa-
tion associated with a transition to state j = 2 is
2
1 5 2
∑y
k =1
k
2 − ∑ y1k p12
k =1
k
+ ∑∑ y ik pik2 = 0
i = 2 k =1
2
1 2 2 2 2
k
∑y
k =1
k
2 − ∑ y1k p12
k =1
k
+ ∑ y 2k p22
k =1
k
+ ∑ y 3k p32
k =1
k
+ ∑ y 4k p42
k =1
k
+ ∑ y 5k p52
k =1
=0
( y 21 + y 22 ) − ( y11 p12
1
) − ( y 21 p22
1
+ y 22 p22
2
) − ( y 31 p32
1
+ y 32 p32
2
)
− ( y 14 p142 + y 42 p42
2
) − ( y 51 p52
1
+ y 52 p52
2
)=0
State 3 has K3 = 2 possible decisions. In the first set of constraints, the equa-
tion associated with a transition to state j = 3 is
2
1 5 2
∑
k =1
y 3k − ∑ y1k p13
k =1
k
+ ∑∑ y ik pik3 = 0
i = 2 k =1
2
1 2 2 2 2
k
∑y
k =1
k
3 − ∑ y1k p13
k =1
k
+ ∑ y 2k p23
k =1
k
+ ∑ y 3k p33
k =1
k
+ ∑ y 4k p43
k =1
k
+ ∑ y 5k p53
k =1
=0
( y 31 + y 32 ) − ( y11 p13
1
) − ( y 21 p23
1
+ y 22 p23
2
) − ( y 31 p33
1
+ y 32 p33
2
)
− ( y 14 p143 + y 42 p43
2
) − ( y 51 p53
1
+ y 52 p53
2
)=0
State 4 has K4 = 2 possible decisions. In the first set of constraints, the equa-
tion associated with a transition to state j = 4 is
2
1 5 2
∑y
k =1
k
4 − ∑ y1k p14
k =1
k
+ ∑∑ y ik pik4 = 0
i = 2 k =1
2
1 2 2 2 2
k
∑y
k =1
k
4 − ∑ y1k p14
k =1
k
+ ∑ y 2k p24
k =1
k
+ ∑ y 3k p34
k =1
k
+ ∑ y 4k p44
k =1
k
+ ∑ y 5k p54
k =1
=0
( y 14 + y 42 ) − ( y11 p14
1
) − ( y 21 p24
1
+ y 22 p24
2
) − ( y 31 p34
1
+ y 32 p34
2
)
− ( y 14 p144 + y 42 p44
2
) − ( y 51 p54
1
+ y 52 p54
2
)=0
( y 14 + y 42 ) − (0 y11 ) − (0 y 21 + 0 y 22 ) − (0 y 31 + 0 y 32 )
− (0.65 y 14 + 0.50 y 42 ) − (0.15 y 51 + 0.12 y 52 ) = 0.
State 5 has K5 = 2 possible decisions. In the first set of constraints, the equa-
tion associated with a transition to state j = 5 is
2
1 5 2
∑y
k =1
k
5 − ∑ y1k p15
k =1
k
+ ∑∑ y ik pik5 = 0
i = 2 k =1
2
1 2 2 2 2
k
∑y
k =1
k
5 − ∑ y1k p15
k =1
k
+ ∑ y 2k p25
k =1
k
+ ∑ y 3k p35
k =1
k
+ ∑ y 4k p45
k =1
k
+ ∑ y 5k p55
k =1
=0
( y 51 + y 52 ) − ( y11 p15
1
) − ( y 21 p25
1
+ y 22 p25
2
) − ( y 31 p35
1
+ y 32 p35
2
)
− ( y 14 p145 + y 42 p45
2
) − ( y 51 p55
1
+ y 52 p55
2
)=0
( y 51 + y 52 ) − (0 y11 ) − (0 y 21 + 1y 22 ) − (0 y 31 + 0 y 32 )
− (0 y 14 + 0 y 42 ) − (0.75 y 51 + 0.80 y 52 ) = 0.
Form the second set of constraints in terms of the variable xik by setting the
constants b1 = b2 = b3 = b4 = b5 = 0.2.
State 1 has K1 = 1 decision. In the second set of constraints, the equation
associated with a transition to state j = 1 is
1 1
1 5 2
∑y +∑x
k =1
k
1
k =1
k
1 − ∑ x1k p11
k =1
k
+ ∑∑ xik pik1 = b1 = 0.2
i = 2 k =1
1 2 2 2 2
k
y11 + x11 − ∑ x1k p11
k
+ ∑ x2k p21
k
+ ∑ x3k p31
k
+ ∑ x 4k p41
k
+ ∑ x5k p51 = 0.2
k =1 k =1 k =1 k =1 k =1
2 2
1 5 2
∑y +∑x
k =1
k
2
k =1
k
2 − ∑ x1k p11
k =1
k
+ ∑∑ xik pik2 = b2 = 0.2
i = 2 k =1
1 2 2 2 2
k
( y 21 + y 22 ) + ( x21 + x22 ) − ∑ x1k p12
k
+ ∑ x2k p22
k
+ ∑ x3k p32
k
+ ∑ x 4k p42
k
+ ∑ x5k p52 = 0.2
k =1 k =1 k =1 k =1 k =1
2 2
1 5 2
∑y +∑x
k =1
k
3
k =1
k
3 − ∑ x1k p11
k =1
k
+ ∑∑ xik pik3 = b3 = 0.2
i = 2 k =1
1 2 2 2 2
k
( y 31 + y 32 ) + ( x31 + x32 ) − ∑ x1k p13
k
+ ∑ x2k p23
k
+ ∑ x3k p33
k
+ ∑ x 4k p43
k
+ ∑ x5k p53 = 0.2
k =1 k =1 k =1 k =1 k =1
1 2 2 2 2
k
( y 14 + y 42 ) + ( x14 + x 42 ) − ∑ x1k p14
k
+ ∑ x2k p24
k
+ ∑ x3k p34
k
+ ∑ x 4k p44
k
+ ∑ x5k p54 = 0.2
k =1 k =1 k =1 k =1 k =1
1 2 2 2 2
k
( y 51 + y 52 ) + ( x51 + x52 ) − ∑ x1k p15
k
+ ∑ x2k p25
k
+ ∑ x3k p35
k
+ ∑ x 4k p45
k
+ ∑ x5k p55 = 0.2
k =1 k =1 k =1 k =1 k =1
subject to
y11 ≥ 0, y12 ≥ 0, y 21 ≥ 0, y 22 ≥ 0, y 31 ≥ 0, y 32 ≥ 0, y 14 ≥ 0, y 42 ≥ 0, y 51 ≥ 0, y 52 ≥ 0.
TABLE 5.53
LP Solution of Multichain MDP Model of a Flexible Production System
i k yki xki
1 1 0.5505 0 Objective Function Value = 79.0793
2 1 0.4495 0 Row Dual Variable
2 0 0 Constraint 6 62 = g1
3 1 0 0 Constraint 7 100 = g2
2 0 0.7127 Constraint 8 86.1818 = g3
4 1 0 0 Constraint 9 76.5091 = g4
2 0 0.6400 Constraint 10 70.7055 = g5
5 1 0 0
2 0 1
TABLE 5.54
Decisions Feasible in the Different States
State Description Feasible Decision
1 Not Working (NW) Overhaul (OV)
Repair (RP)
Replace (RL)
2 Working, with a Major Defect (WM) Do Nothing (DN)
Repair (RP)
3 Working, with a Minor Defect (Wm) Do Nothing (DN)
Overhaul (OV)
4 Working Properly (WP) Do Nothing (DN)
Overhaul (OV)
TABLE 5.55
Daily Rewards for the Decisions That Are Feasible in the Different States
State\Decision Do Nothing Overhaul Repair Replace
Not Working Not Feasible –$300 –$700 –$1,200
Working, with $200 Not Feasible –$700 Not Feasible
Major Defect
Working, with $500 –$300 Not Feasible Not Feasible
Minor Defect
Working Properly $1,000 –$300 Not Feasible Not Feasible
TABLE 5.56
The 24 Stationary Policies for Machine Maintenance
Policy State 1 State 2 State 3 State 4 Structure
1d OV DN OV DN U
2d OV DN OV OV M
3d OV RP OV DN I
4d OV RP OV OV U
5d RP DN OV DN I
6d RP DN OV OV U
7d RP RP OV DN I
8d RP RP OV OV U
9d RL DN OV DN I
10d RL DN OV OV U
11d RL RP OV DN I
12d RL RP OV OV U
13d OV DN DN DN U
14d OV DN DN OV M
15d OV RP DN DN I
16d OV RP DN OV U
17d RP DN DN DN U
18d RP DN DN OV M
19d RP RP DN DN I
20d RP RP DN OV U
21d RL DN DN DN I
22d RL DN DN OV U
23d RL RP DN DN I
24d RL RP DN OV U
Recall that under a stationary policy the same decision is made each time that
the process returns to a particular state. With three feasible decisions in state 1,
and two decisions in states 2, 3, and 4, the number of stationary policies is
equal to (3)(2)(2)(2) = 24. The 24 stationary policies, labeled 1d through 24d, are
enumerated in Table 5.56. The right most column indicates whether the associ-
ated Markov chain is irreducible (I), unichain (U), or multichain (M).
TABLE 5.57
Data for Multichain Model of Machine Maintenance
Transition Probability
State Decision k
Reward LP Variable
i k p i1 p ik2 p ik3 p ik4 q ik y ik x ik
1 = NW 2 = OV 0.2 0.8 0 0 –$300 y12 x12
3 = RP 0.3 0 0.7 0 –$700 y13 x13
4 = RL 0 0 0 1 –$1,200 y14 x14
2 = WM 1 = DN 0.6 0.4 0 0 $200 y21 x21
3 = RP 0 0.3 0 0.7 –$700 y23 x23
3 = Wm 1 = DN 0.2 0.3 0.5 0 $500 y31 x31
2 = OV 0 0 0.2 0.8 –$300 y32 x32
4 = WP 1 = DN 0.3 0.2 0.1 0.4 $1,000 y41 x41
2 = OV 0 0 0 1 –$300 y42 x42
Policies 2d, 14d, and 18d produce recurrent multichains, each with two recur-
rent chains and no transient states. Table 5.57 contains the data for the four-
state multichain MDP model of machine maintenance. The variables for an
LP formulation appear in the two right-most columns.
( y12 + y13 + y14 ) − (0.2 y12 + 0.3 y13 + 0 y14 ) − (0.6 y 21 + 0 y 23 ) − (0.2 y 13 + 0 y 32 ) − (0.3 y 14 + 0 y 42 ) = 0.
State 2 has K 2 = 2 decisions. In the first set of constraints, the equation asso-
ciated with a transition to state j = 2 is
( y 21 + y 23 ) − ( y12 p12
2
+ y13 p12
3
+ y14 p12
4
) − ( y 21 p22
1
+ y 23 p22
3
)
− ( y 31 p32
1
+ y 32 p32
2
) − ( y 14 p142 + y 42 p42
2
)=0
State 3 has K3 = 2 decisions. In the first set of constraints, the equation asso-
ciated with a transition to state j = 3 is
( y 31 + y 32 ) − ( y12 p13
2
+ y13 p13
3
+ y14 p13
4
) − ( y 21 p23
1
+ y 23 p23
3
)
− ( y 31 p33
1
+ y 32 p33
2
) − ( y 14 p143 + y 42 p43
2
)=0
State 4 has K4 = 2 decisions. In the first set of constraints, the equation asso-
ciated with a transition to state j = 4 is
( y 14 + y 42 ) − ( y12 p14
2
+ y13 p14
3
+ y14 p14
4
) − ( y 21 p24
1
+ y 23 p24
3
)
− ( y 31 p34
1
+ y 32 p34
2
) − ( y 14 p144 + y 42 p44
2
)=0
( y12 + y13 + y14 ) + ( x12 + x13 + x14 ) − (0.2 x12 + 0.3 x13 + 0 x14 ) − (0.6 x21 + 0 x23 )
− (0.2 x31 + 0 x32 ) − (0.3 x14 + 0 x 42 ) = 0.25.
( y 21 + y 23 ) + ( x21 + x23 ) − (0.8 x12 + 0 x13 + 0 x14 ) − (0.4 x21 + 0.3 x23 )
− (0.3 x31 + 0 x32 ) − (0 x14 + 0.2 x 42 ) = 0.25.
subject to
TABLE 5.58
LP Solution of Multichain Model of Machine Maintenance
i k yik xik
1 2 0 0
3 0.3226 0.2285 Objective Function Value = 45.1613
4 0 0 Constraint Dual Variable
2 1 0.2258 0.1792 5 45.1613 = g1 = g
3 0 0 6 45.1613 = g2 = g
3 1 0.4516 0 7 45.1613 = g3 = g
2 0 0 8 45.1613 = g4 = g
4 1 0 0.4167
2 0 0
TABLE 5.59
Unichain MCR Corresponding to Optimal Policy for Machine Maintenance
N
vi (n) = max qik + α ∑ pijk v j (n + 1) ,
k
j =1 (5.57)
for n = 0, 1,..., T − 1, and i = 1, 2, ..., N , where vi (T) is specified for all states i.
For the four-state discounted MDP, the value iteration equations are
vi (6) = max {qik + α [ pik1v1 (7) + pik2 v2 (7) + pik3 v3 (7) + pik4 v4 (7)]}, for i = 1, 2, 3, 4
k
= max {qi + 0.9[ pi1 (0) + pi 2 (0) + pi 3 (0) + pi 4 (0)] }, for i = 1, 2, 3, 4
k k k k k
k
= max[qi ], for i = 1, 2, 3, 4.
k
k
(5.61)
At the end of month 6, the optimal decision is to select the alternative which
will maximize the expected total discounted reward in every state. Thus, the
decision vector at the end of month 6 is d(6) = [3 2 2 1]T.
The calculations for month 5, denoted by n = 5, are indicated in Table 5.61.
vi (5) = max {qik + α [ pik1v1 (6) + pik2 v2 (6) + pik3 v3 (6) + pik4 v4 (6)]}, for i = 1, 2, 3, 4
k
= max {qi + 0.9[ pi1 ( −20) + pi 2 (10) + pi 3 ( −5) + pi 4 (35)] }, for i = 1, 2, 3, 4.
k k k k k
k
(5.62)
TABLE 5.60
Value Iteration for n = 6
n=6 qi1 qi2 qi3 Expected Total Discounted Reward Decision k
i k=1 k=2 k=3 vi (6) = max[q , q , q ] ,
1 2 3
i i i
k
for i = 1, 2, 3, 4
1 –30 –25 –20 v1 (6) = max[ − 30, −25, −20 ] 3
2 5 10 _ v2(6) = max[5, 10 ] = 10 2
3 –10 –5 _ v3(6) = max[–10, –5 ] = −5 2
4 35 25 _ v4(6) = max[35, 25 ] = 35 1
TABLE 5.61
Value Iteration for n = 5
At the end of month 5, the optimal decision is to select the second alterna-
tive in states 1, 2, and 3, and the first alternative in state 4. Thus, the decision
vector at the end of month 5 is d(5) = [2 2 2 1]T. The calculations for month 4,
denoted by n = 4, are indicated in Table 5.62.
TABLE 5.62
Value Iteration for n = 4
vi (4) = max {qik + α [ pik1v1 (5) + pik2 v2 (5) + pik3 v3 (5) + pik4 v4 (5)]} for i = 1, 2, 3, 4
k
= max {qik + 0.9[ pik1 (−24.1) + pik2 (8.65) + pik3 (0.4) + pik4 (45.125)] }
k
for i = 1, 2, 3, 4
(5.63)
At the end of month 4, the optimal decision is to select the second alternative in
every state. Thus, the decision vector at the end of month 4 is d(4) = [2 2 2 2]T.
The calculations for month 3, denoted by n = 3, are indicated in Table 5.63.
vi (3) = max {qik + α [ pik1v1 (4) + pik2 v2 (4) + pik3 v3 (4) + pik4 v4 (4)]} for i = 1, 2, 3, 4
k
= max {qik + 0.9[ pik1 ( −22.1155) + pik2 (8.7276) + pik3 (4.1643) + pik4 (50.2540)] }
k
for i = 1, 2, 3, 4.
(5.64)
TABLE 5.63
Value Iteration for n = 3
1 1 − 30 + 0.9[0.15(−22.1155) + 0.40(8.7276)
+ 0.35(4.1643) + 0.10(50.2540)] = −24.0090
1 2 −25 + 0.9[0.45(−22.1155) + 0.05(8.7276) max [ − 24.0090, d1(3) = 2
+ 0.20(4.1643) + 0.30(50.2540)] −19.2459, −29.2111]
= −19.2459 ← = −19.2459 = v1 (3)
1 3 −20 + 0.9[0.60(−22.1155) + 0.30(8.7276)
+ 0.10(4.1643) + 0(50.2540)] = −29.2111
2 1 5 + 0.9[0.25(−22.1155) + 0.30(8.7276)
+ 0.35(4.1643) + 0.10(50.2540)] = 8.2151
2 2 10 + 0.9[0.30(−22.1155) + 0.40(8.7276) max [8.2151, d2(3) = 2
+ 0.25(4.1643) + 0.05(50.2540)] = 10.3691 ← 10.3691] = 10.3691 = v2 (3)
4 1 35 + 0.9[0.05(−22.1155) + 0.20(8.7276)
+ 0.40(4.1643) + 0.35(50.2540)] = 52.9049
4 2 25 + 0.9[0(−22.1155) + 0.10(8.7276) max [52.9049, d4(3) = 2
+ 0.30(4.1643) + 0.60(50.2540)] = 54.0470 ← 54.0470] = 54.0470
= v4 (3)
At the end of month 3, the optimal decision is to select the second alternative in
every state. Thus, the decision vector at the end of month 3 is d(3) = [2 2 2 2]T.
The calculations for month 2, denoted by n = 2, are indicated in Table 5.64.
vi (2) = max {qik + α [ pik1v1 (3) + pik2 v2 (3) + pik3 v3 (3) + pik4 v4 (3)]} for i = 1, 2, 3, 4
k
= max {qik + 0.9[ pik1 ( −19.2459) + pik2 (10.3691) + pik3 (6.8882) + pik4 (54.0470)] }
k
for i = 1, 2, 3, 4.
(5.65)
At the end of month 2, the optimal decision is to select the second alternative
in every state. Thus, the policy at the end of month 2 is d(2) = [2 2 2 2]T.
The calculations for month 1, denoted by n = 1, are indicated in Table 5.65.
TABLE 5.64
Value Iteration for n = 2
TABLE 5.65
Value Iteration for n = 1
vi (1) = max {qik + α [ pik1v1 (2) + pik2 v2 (2) + pik3 v3 (2) + pik4 v4 (2)]} for i = 1, 2, 3, 4
k
= max {qik + 0.9[ pik1 (−16.4954) + pik2 (12.5184) + pik3 (9.2951) + pik4 (56.9784]}
k
for i = 1, 2, 3, 4.
(5.66)
At the end of month 1, the optimal decision is to select the second alternative in
every state. Thus, the decision vector at the end of month 1 is d(1) = [2 2 2 2]T.
The calculations for month 0, denoted by n = 0, are indicated in Table 5.66.
vi (0) = max {qik + α [ pik1v1 (1) + pik2 v2 (1) + pik3 v3 (1) + pik4 v4 (1)]} for i = 1, 2, 3, 4
k
max {qik + 0.9[ pik1 ( −14.0600) + pik2 (14.7083) + pik3 (11.5133) + pik4 (59.4047)]}
k
for i = 1, 2, 3, 4.
(5.67)
At the end of month 0, which is the beginning of month 1, the optimal deci-
sion is to select the second alternative in every state. Thus, the decision vector
at the beginning of month 1 is d(0) = [2 2 2 2]T.
The results of these calculations for the expected total discounted rewards
and the optimal decisions at the end of each month of the seven month plan-
ning horizon are summarized in Table 5.67.
If the process starts in state 4, the expected total discounted reward is
61.5109, the highest for any state. On the other hand, the lowest expected
total discounted reward is −11.9208, which is the negative of an expected
total discounted cost of 11.9208, if the process starts in state 1.
TABLE 5.66
Value Iteration for n = 0
TABLE 5.67
Expected Total Discounted Rewards and Optimal Decisions for a Planning Horizon
of 7 Months
n
Epoch 0 1 2 3 4 5 6 7
v1(n) −11.9208 −14.0600 −16.4954 −19.2459 −22.1155 −24.10 −20 0
v2(n) 16.7625 14.7083 12.5184 10.3691 8.7276 8.65 10 0
v3(n) 13.5505 11.5133 9.2951 6.8882 4.1643 0.40 −5 0
v4(n) 61.5109 59.4047 56.9784 54.0470 50.254 45.125 35 0
d1(n) 2 2 2 2 2 2 3 −
d2(n) 2 2 2 2 2 2 2 −
d3(n) 2 2 2 2 2 2 2 −
d4(n) 2 2 2 2 2 1 1 −
vi(0), for i = 1, 2, ..., N. For simplicity, set vi(0) = 0. Specify ε > 0. Set n = −1.
Step 2. For each state i, use the value iteration equation to compute
N
vi (n) = max qik + α ∑ pijk v j (n + 1) , for i = 1, 2, ... , N .
k
j =1
TABLE 5.68
Expected Total Discounted Rewards and Optimal Decisions during the Last
7 Months of an Infinite Planning Horizon
n
Epoch −7 −6 −5 −4 −3 −2 −1 −0
v1(n) −11.9208 −14.0600 −16.4954 −19.2459 −22.1155 −24.10 −20 0
v2(n) 16.7625 14.7083 12.5184 10.3691 8.7276 8.65 10 0
v3(n) 13.5505 11.5133 9.2951 6.8882 4.1643 0.40 −5 0
v4(n) 61.5109 59.4047 56.9784 54.0470 50.254 45.125 35 0
d1(n) 2 2 2 2 2 2 3 −
d2(n) 2 2 2 2 2 2 2 −
d3(n) 2 2 2 2 2 2 2 −
d4(n) 2 2 2 2 2 1 1 −
TABLE 5.69
Absolute Values of the Differences between the Expected Total Discounted
Rewards Earned Over Planning Horizons, Which Differ in Length by One Period
n
−7 −6 −5 −4 −3 −2 −1
v i ( −7 ) v i ( −6) v i ( −5) v i ( −4 ) v i ( −3) v i ( −2) v i ( −1)
Epoch −v i ( −6) −v i ( −5) −v i ( −4 ) −v i ( −3) −v i ( −2) −v i ( −1) −v i (0)
i
1 2.1392U 2.4354U 2.7505 2.8696 1.9845 4.1 20
2 2.0542 2.1899 2.1493 1.6415 0.0776 1.35 5
3 2.0372 2.2181 2.4069 2.7239 3.7643 5.4 5
4 2.1062 2.4263 2.9314U 3.7930U 5.129U 10.125U 35U
vi (n) 2.1392 2.4354 2.9314 3.7930 5.129 10.125 35
Max
− vi (n + 1)
In Table 5.69, a suffix U identifies the maximum absolute difference for each
epoch. The maximum absolute differences obtained for all seven epochs are
listed in the bottom row of Table 5.69.
Table 5.68 shows that under the optimal policy d(−7) = [2 2 2 2]T, the approx-
imate expected total discounted rewards,
the PI algorithm has two main steps: the VD operation and the IM routine.
The algorithm begins by arbitrarily choosing an initial policy. During the
VD operation, the VDEs corresponding to the current policy are solved for
the expected total discounted rewards received in all states. The IM routine
attempts to find a better policy. If a better policy is found, the VD operation is
repeated using the new policy to identify a new system of associated VDEs.
The algorithm stops when two successive iterations lead to identical policies.
N
qik + α ∑ pijk v j (n + 1) (5.68)
j= 1
N
qik + α ∑ pijk v j (1) (5.69)
j= 1
over all decisions in state i. Recall from Section 4.3.3.1 that when T, the length
of the planning horizon, is very large, (T − 1) is also very large.
Substituting vj in Equation (4.62) for vj(1) in the test quantity produces the
result
N N
qik + α ∑ pijk v j (1) = qik + α ∑ pijk v j (5.70)
j= 1 j= 1
5.2.2.2.2 PI Algorithm
The detailed steps of the PI algorithm are given below.
Step 1. Initial policy
Arbitrarily choose an initial policy by selecting for each state i a decision
di = k.
Step 2. VD operation
Use pij and qi for a given policy to solve the VDEs (4.260)
N
vi = qi + α ∑ pij v j , i = 1, 2,..., N
j= 1
N
qik + α ∑ pijk v j
j= 1
using the expected total discounted rewards vi of the previous policy. Then
k∗ becomes the new decision in state i, so that di = k∗, qik∗ becomes qi, and pijk∗
becomes pij.
Step 4. Stopping rule
When the policies on two successive iterations are identical, the algorithm
stops because an optimal policy has been found. Leave the old di unchanged
if the test quantity for that di is equal to that of any other alternative in the
new policy determination. If the new policy is different from the previous
policy in at least one state, go to step 2.
Howard proved that, for each policy, the expected total discounted rewards
received in every state will be greater than or equal to their respective values
for the previous policy. The algorithm will terminate after a finite number
of iterations.
Step 2. VD operation
Use pij and qi for the initial policy, 23d = [3 2 2 1]T, to solve the VDEs (4.257)
Step 3. IM routine
For each state i, find the decision k∗ that maximizes the test quantity
4
qik + α ∑ pijk v j
j= 1
using the expected total discounted rewards vi of the initial policy. Then k∗
becomes the new decision in state i, so that di = k∗, qik∗ becomes qi, and pijk∗
becomes pij. The first IM routine is executed in Table 5.70.
TABLE 5.70
First IM for Monthly Sales Example
Test Quantity
qik + α ( pik1 v1 + pik2 v2 + pik3 v3 + pik4 v4 )
Decision = qik + 0.9[ pik1 ( − 38.2655) + pik2 (6.1707)
State Alternative Maximum Value Decision
i k
+ pik3 (8.1311) + pik4 (54.4759)] of Test Quantity di = k∗
1 1 −30 + 0.9[0.15(−38.2655)
+ 0.40(6.1707) + 0.35(8.1311)
+ 0.10(54.4759)] = −25.4803
1 2 −25 + 0.9[0.45(−38.2655) max[ −25.4803, d1 = 2
+ 0.05(6.1707) + 0.20(8.1311) −24.0478,
+ 0.30(54.4759)] = −24.0478 ← −38.2655]
= −24.0478
1 3 −20 + 0.9[0.60(−38.2655)
+ 0.30(6.1707) + 0.10(8.1311)
+ 0(54.4759)] = −38.2655
2 1 5 + 0.9[0.25(−38.2655)
+ 0.30(6.1707) + 0.35(8.1311)
+ 0.10(54.4759)] = 5.5205
2 2 10 + 0.9[0.30(−38.2655) max[5.5205, d2 = 2
+ 0.40(6.1707) + 0.25(8.1311) 6.1707]
+ 0.05(54.4759)] = 6.1707 ← = 6.1707
3 1 −10 + 0.9[0.05(−38.2655)
+ 0.65(6.1707) + 0.25(8.1311)
+ 0.05(54.4759)] = −3.8312
3 2 −5 + 0.9[0.05(−38.2655) max[ −3.8312, d3 = 2
+ 0.25(6.1707) + 0.50(8.1311) 8.1311]
+ 0.20(54.4759)] = 8.1311 ← = 8.1311
continued
Step 2. VD operation
Use pij and qi for the new policy, 16 d = [2 2 2 2]T, to solve the VDEs (4.257) for
all the expected total discounted rewards vi .
Step 3. IM routine
For each state i, find the decision k∗ that maximizes the test quantity
4
qik + α ∑ pijk v j
j= 1
using the expected total discounted rewards vi of the previous policy. Then
k∗ becomes the new decision in state i, so that di = k∗, qik∗ becomes qi, and pijk∗
becomes pij. The second IM routine is s executed in Table 5.71.
Step 4. Stopping rule
Stop because the new policy, given by the vector 16 d = [2 2 2 2]T, is identi-
cal to the previous policy. Therefore, this policy is optimal. The expected
TABLE 5.71
Second IM for Monthly Sales Example
Test Quantity
qik + α ( pik1 v1 + pik2 v2 + pik3 v3 + pik4 v4 )
Maximum
Decision = qik + 0.9[ pik1 (6.8040) + pik2 (35.4613) Value of
State Alternative Test Decision
i k + pik3 (32.2190) + pik4 (80.1970)] Quantity di = k∗
1 1 −30 + 0.9[0.15(6.8040)
+0.40(35.4613) + 0.35(32.2190)
+0.10(80.1970)] = 1.0513
1 2 −25 + 0.9[0.45(6.8040) max[1.0513, d1 = 2
+0.05(35.4613) + 0.20(32.2190) 6.8040,
+0.30(80.1970)] = 6.8040 ← −38.5158]
= 6.8040
1 3 −20 + 0.9[0.60(6.8040)
+0.30(35.4613) + 0.10(32.2190)
+0(80.1970)] = −38.5158
2 1 5 + 0.9[0.25(6.8040)
+0.30(35.4613) + 0.35(32.2190)
+0.10(80.1970)] = 33.4722
2 2 10 + 0.9[0.30(6.8040) max[33.4722, d2 = 2
+0.40(35.4613) + 0.25(32.2190) 35.4613]
+0.05(80.1970)] = 35.4613 ← = 35.4613
3 1 −10 + 0.9[0.05(6.8040)
+0.65(35.4613) + 0.25(32.2190)
+0.05(80.1970)] = 21.9092
3 2 −5 + 0.9[0.05(6.8040) max[21.9092, d3 = 2
+0.25(35.4613) + 0.50(32.2190) 32.2190]
+0.20(80.1970)] = 32.2190 ← = 32.21901
continued
total discounted rewards for the optimal policy are calculated in Equation
(5.77). The transition probability matrix 16P and the reward vector 16q for
the optimal policy are given in Equation (5.71). Note that the optimal pol-
icy is the same as the one obtained without discounting in Equation (5.24)
of Section 5.1.2.3.3.3. Note also that the expected total discounted rewards
obtained by PI are identical to the corresponding dual variables obtained by
LP in Table 5.73 of Section 5.2.2.3.4.
1 1 1 − u u 1 e 1 v A
dA = , P A = , q A = , v A = 1A .
1 2 c 1 − c 2 s 2 v2
The VDEs for states 1 and 2 under policy A are given below:
2
viA = qiA + α ∑ pijA v jA , i = 1, 2
j= 1
For i = 1,
v1A = q1A + α p11
A A
v1 + α p12
A A
v2
v1A = e + α (1 − u)v1A + α uv2A
For i = 2,
2 1 1 − b b 1 f 1 vD
dD = , P D = , q = 2h ,
D
v D = 1D .
2
2 d 1 − d 2 v2
The VDEs for states 1 and 2 under policy D are given below.
2
viD = qiD + α ∑ pijD v Dj , i = 1, 2.
j =1
For i = 1,
v1D = q1D + α p11
D D
v1 + α p12
D D
v2
v1D = f + α (1 − b)v1D + α bv2D .
For i = 2,
v2D = q2D + α p21
D D
v1 + α p22
D D
v2
v2D = h + α dv1D + α (1 − d)v2D .
Since the IM routine has chosen policy D over A, the test quantity for pol-
icy D must be greater than or equal to the test quantity for A in both states.
Therefore,
For i = 1,
q1D + α ( p11
D A
v1 + p12
D A
v2 ) ≥ q1A + α ( p11
A A
v1 + p12
A A
v2 )
f + α (1 − b)v1A + α bv2A ≥ e + α (1 − u)v1A + α uv2A .
For i = 2,
q2D + α ( p21
D A
v1 + p22
D A
v2 ) ≥ q2A + α ( p21
A A
v1 + p22
A A
v2 )
h + α dv1A + α (1 − d)v2A ≥ s + α cv1A + α (1 − c)v2A .
The objective of this insight is to show that the IM routine must increase
the expected total discounted rewards of one or both states.
For i = 1,
For i = 2,
Observe that the pair of equations for the increase in the expected total
discounted rewards has the same form as the pair of equations for the
expected total discounted rewards,
In matrix form, the pair of equations for the expected total discounted
rewards is
vD = qD + α PD vD .
The solution for the vector of the expected total discounted rewards is
v D = (I − α P D )−1 qD .
Similarly, the matrix form of the pair of equations for the increase in the
expected total discounted rewards is
v D − v A = γ + α PD (v D − v A ),
the test quantity in both states. Therefore, the solution for the vector of the
increase in the expected total discounted rewards is
v D − v A = (I − α P D )−1 γ .
α (1 − b) αb
α PD =
αd α (1 − d)
1− α + α b −α b
I − α PD =
−α d 1 − α + α d
1 1 − α + α d αb
(I − α P D )−1 = .
(1 − α + α b)(1 − α + α d) − α bd α d
2
1 − α + α b
entries in (I−αPD)−1 is
(1 − α + α b)(1 − α + α d) − α 2 bd
= (1 − α + α d − α + α 2 − α 2 d + α b − α 2 b + α 2 bd) − α 2 bd
= (1 − α − α + α 2 ) + (α d − α 2 d) + (α b − α 2 b)
= [(1 − α ) − α (1 − α )] + α d(1 − α ) + α b(1 − α )
= (1 − α )(1 − α + α d + α b) ≥ 0.
v D − v A = (I − α P D )−1 γ
Ki N Ki
∑k= 1
y kj − ∑ ∑y
i= 1 k = 1
k
i pijk = 0 , for j = 1, 2, . . . , N, (5.39)
∑y −∑∑y
k= 1
k
j
i= 1 k = 1
k
i (α pijk ) = ∑y
k= 1
k
j −α∑ ∑y
i= 1 k = 1
k
i pijk > 0, for j = 1, 2, ..., N, (5.73)
Ki N Ki
∑y
k= 1
k
j > α∑ ∑y
i= 1 k = 1
k
i pijk . (5.74)
Ki N Ki
∑y
k= 1
k
j −α∑ ∑y
i= 1 k = 1
k
i pijk = b j , (5.75)
N Ki
Maximize ∑ ∑ qik y ik
i= 1 k = 1
subject to
Ki N Ki
∑ y j − α ∑ ∑ y i pij = b j ,
k k k
(5.76)
k= 1 i= 1 k = 1
for j = 1, 2, ..., N, where bj > 0.
y ik ≥ 0 , for i = 1,2,...,N, and k = 1, 2, . . ., Ki
When LP software has found an optimal solution for the decision vari-
ables, the conditional probabilities,
y ik
P(decision = k|state = i) = Ki
, (5.30)
∑y
k =1
k
i
∑b
j= 1
j = 1. (5.77)
Since bj is the right-hand side constant of the jth constraint in the LP, bj is also
the objective function coefficient of the jth dual variable, vj. The optimal dual
objective function is equal to ∑ j = 1 b j v j, which is also the optimal value of the
N
bj can be interpreted as the probability that the system starts in state j. That
is,
Objective Function
The objective function for the LP formulation for a discounted MDP is
∞ N Ki
Maximize ∑ α ∑∑ q P(X
n=0
n
i =1 k =1
k
i n = i , di = k )
N Ki ∞
=∑∑ qik ∑ α n P(X n = i , di = k )
i =1 k =1 n=0
N Ki
= ∑∑ qik y ik
i =1 k =1
which is the same objective function as the one for the LP formulation for an
undiscounted, recurrent MDP.
Constraints
N Ki
The constraints can be obtained by starting with the expression ∑ ∑ αp k
ij y ik.
i= 1 k = 1
N Ki N Ki ∞
∑∑ α p
i =1 k =1
k
ij y = ∑∑ α p
k
i
i =1 k =1
k
ij ∑α
n=0
n
P(X n = i , di = k )
Ki ∞ N
= ∑∑ α n + 1 ∑ pijk P(X n = i , di = k )
k =1 n= 0 i =1
Kj ∞
= ∑∑ α n + 1P(X n + 1 = j , d j = k )
k =1 n= 0
Kj
= ∑ [α 1P(X1 = j , d j = k ) + α 2 P(X 2 = j , d j = k )
k =1
+ α 3 P(X 3 = j , d j = k ) + "]
Kj
+ α 3 P(X 3 = j , d j = k ) + "] − α 0 P( X0 = j , d j = k )}
Kj
+ α 3 P(X 3 = j , d j = k ) + "] − ∑ α 0 P( X0 = j , d j = k )
k =1
Kj ∞ Kj
= ∑∑ α n P(X n = j , d j = k ) − ∑ α 0 P(X0 = j , d j = k )
k =1 n= 0 k =1
Kj ∞
= ∑∑ α n P(X n = j , d j = k ) − P(X0 = j)
k =1 n= 0
Kj
= ∑ y kj − b j .
k =1
Ki N Ki
∑
k= 1
y kj − α ∑ ∑y
i= 1 k = 1
k
i pijk = b j .
for j = 1, 2, ... , N.
∑b
j =1
j = 1.
TABLE 5.72
Data for Discounted MDP Model of Monthly Sales, α = 0.9
Transition Probability
State Decision Reward LP
i k p ik1 p ik2 p ik3 p ik4 q ik Variable
Objective Function
The objective function for the LP is
4 Ki
Maximize ∑∑q
i= 1 k = 1
k
i y ik
3 2 3 4
Maximize ∑
k= 1
q1k y1k + ∑
k= 1
q2k y 2k + ∑
k= 1
q3k y 3k + ∑q
k= 1
k
4 y 4k
= (q11 y11 + q12 y12 + q13 y13 ) + (q21 y 21 + q22 y 22 ) + (q31 y 31 + q32 y 32 ) + (q14 y 14 + q42 y 42 )
3 4 K ii
∑
k= 1
y1k − α ∑
i= 1 k = 1
∑y k
i pik1 = b1
3 2 2 2
k
( y11 + y12 + y13 ) − α
∑y
k= 1
k
1
k
p11 + ∑y k= 1
k
2
k
p21 + ∑y k= 1
k
3
k
p31 + ∑y
k= 1
k
4 p41 = b1
( y + y + y ) − α ( y p + y p + y p ) − α ( y p + y p ) − α ( y 31 p31
1
1
2
1
3
1
1
1
1
11
2
1
2
11
1
+ y 32 p31
3
1
2
)3
11
1
2
1
21
2
2
2
21
− α ( y 14 p141 + y 42 p41
2
) = b1
( y11 + y12 + y13 ) − 0.9(0.15 y11 + 0.45 y12 + 0.60 y13 ) − 0.9(0.25 y 21 + 0.30 y 22 )
− 0.9(0.05 y 31 + 0.05 y 32 ) − 0.9(0.05 y 14 + 0 y 42 ) = 0.1.
2 4 Ki
∑y
k= 1
k
2 −α∑ ∑y
i= 1 k = 1
k
i pik2 = b2
3 2 2 2
k
( y 21 + y 22 ) − α
∑y
k= 1
k
1
k
p12 + ∑y
k= 1
k
2
k
p22 + ∑y
k= 1
k
3
k
p32 + ∑y
k= 1
k
4 p42 = b2
( y 21 + y 22 ) − α ( y11 p12
1
+ y12 p12
2
+ y13 p12
3
) − α ( y 21 p22
1
+ y 22 p22
2
) − α ( y 31 p32
1
+ y 32 p32
2
)
− α ( y 14 p142 + y 42 p42
2
) = b2
( y 21 + y 22 ) − 0.9(0.40 y11 + 0.05 y12 + 0.30 y13 ) − 0.9(0.30 y 21 + 0.40 y 22 ) − 0.9(0.65 y 31 + 0.25 y 32 )
− 0.9(0.20 y 14 + 0.10 y 42 ) = 0.2
2 4 Ki
∑
k= 1
y 3k − α ∑ ∑y
i= 1 k = 1
k
i pik3 = b3
3 2 2 2
k
( y 31 + y 32 ) − α
∑y
k= 1
k
1
k
p13 + ∑y
k= 1
k
2
k
p23 + ∑y
k= 1
k
3
k
p33 + ∑y
k= 1
k
4 p43 = b3
( y 31 + y 32 ) − α ( y11 p13
1
+ y12 p13
2
+ y13 p13
3
) − α ( y 21 p23
1
+ y 22 p23
2
) − α ( y 31 p33
1
+ y 32 p33
2
)
− α ( y 14 p143 + y 42 p43
2
) = b3
2 4 Ki
∑y
k= 1
k
4 −α∑ ∑y
i= 1 k = 1
k
i pik4 = b4
3 2 2 2
k
( y 14 + y 42 ) − α
∑y
k= 1
k
1
k
p14 + ∑y
k= 1
k
2
k
p24 + ∑y
k= 1
k
3
k
p34 + ∑y
k= 1
k
4 p44 = b4
( y 14 + y 42 ) − α ( y11 p14
1
+ y12 p14
2
+ y13 p14
3
) − α ( y 21 p24
1
+ y 22 p24
2
) − α ( y 31 p34
1
+ y 32 p34
2
)
− α ( y 14 p144 + y 42 p44
2
) = b4
subject to
(1) (0.865 y11 + 0.595 y12 + 0.46 y13 ) − (0.225 y 21 + 0.27 y 22 ) − (0.045 y 31 + 0.045 y 32 )
− (0.045 y 14 + 0 y 42 ) = 0.1
2) −(0.36 y11 + 0.045 y12 + 0.27 y13 ) + (0.73 y 21 + 0.64 y 22 ) − (0.585 y 31 + 0.225 y 32 )
− (0.18 y 14 + 0.09 y 42 ) = 0.2
3) −(0.315 y11 + 0.18 y12 + 0.09 y13 ) − (0.315 y 21 + 0.225 y 22 ) + (0.775 y 31 + 0.55 y 32 )
− (0.36 y 14 + 0.27 y 42 ) = 0.3
TABLE 5.73
LP Solution of Discounted MDP Model of Monthly Sales
Function value = 49.5172
Objective i k y ik
1 1 0
2 1.3559 Row Dual Variable
3 0 Constraint 1 6.8040
2 1 0 Constraint 2 35.4613
2 2.0515 Constraint 3 32.2190
3 1 0 Constraint 4 80.1970
2 3.3971
4 1 0
2 3.1954
Ki
P(decision = k|state = i) = y ik/∑ y ik ,
k= 1
That is,
∑bv
j= 1
j j = 49.5172 = 0.1(6.8040) + (0.2)(35.4613) + (0.3)(32.2190) + (0.4)(80.1970)
The objective function for the LP for the discounted model, unchanged from
the objective function for the unichain model, is
(2) ( y10 + y11 + y12 ) − 0.9(0.3 y 01 + 0.4 y 02 + 0.1y 03 ) − 0.9(0.3 y10 + 0.4 y11 + 0.1y12 )
− 0.9(0.4 y 20 + 0.1y 21 ) − 0.9(0.1)y 30 = 0.2
subject to
(1) (0.1y 00 + 0.37 y 01 + 0.73 y 02 + 0.82 y 03 ) − (0.63 y10 + 0.27 y11 + 0.18 y12 )
− (0.27 y 20 + 0.18 y 21 ) − 0.18 y 30 = 0.1
TABLE 5.74
LP Solution of Discounted MDP Model of an Inventory System
Function Value = 1218.2572
Objective i k y ik
0 0 0
1 0 Row Dual Variable
2 0 Constraint 1 964.7156
3 2.2220 Constraint 2 1084.7156
1 0 0 Constraint 3 1223.2477
1 0 Constraint 4 1344.7156
2 2.0661
2 0 3.5780
1 0
3 0 2.1339
y ik
P(order = k|state = i) = Ki
,
∑y
k =1
k
i
All the remaining conditional probabilities are zero. The retailer’s expected
total discounted profit is maximized by the same (2, 3) inventory policy that
maximized her expected average profit per period without discounting in
Section 5.1.3.3.1.2.
The dual variables associated with constraints 1, 2, 3, and 4, respec-
tively, are the expected total discounted rewards, v0 = 964.7156,
v1 = 1084.7156, v2 = 1223.2477, and v3 = 1344.7156 , respectively. Since
∑ j =1 bj = 1, the optimal value of the LP objective function, 1218.2752, equals
N
∑
3
j=0
b j v j, as is stated at the end of Section 5.2.2.3.1. That is,
∑b v
j=0
j j = 1218.2752 = 0.1(964.7156) + (0.2)(1084.7156) + (0.3)(1223.2477)
+ (0.4)(1344.7156).
TABLE 5.75
Data for Discounted MDP Model of a Secretary Problem Over an Infinite Planning
Horizon
Transition Probility Reward LP Variable
State Decision
i k p k
i ,15 p k
i ,20 pik,25 pik,30 pik, ∆ k
q
i y ik
15 1=H 0 0 0 0 1 15 1
y15
2=R 0.3 0.4 0.2 0.1 0 −2 2
y15
20 1=H 0 0 0 0 1 20 1
y 20
2=R 0.3 0.4 0.2 0.1 0 −2 2
y 20
25 1=H 0 0 0 0 1 25 1
y 25
2=R 0.3 0.4 0.2 0.1 0 −2 2
y 25
30 1=H 0 0 0 0 1 30 1
y 30
2=R 0.3 0.4 0.2 0.1 0 −2 2
y 30
∆ 1=H 0 0 0 0 1 0 y 1∆
2=R 0 0 0 0 1 0 y ∆2
Objective Function
The objective function for the LP is
30 2
Maximize ∑∑q
i = 15 k = 1
k
i y ik
2 2 2 2
= ∑q
k= 1
k
15
k
y15 + ∑q
k= 1
k
20
k
y 20 + ∑q
k= 1
k
25
k
y 25 + ∑q
k= 1
k
30
k
y 30
= (q15
1 1
y15 + q15
2 2
y15 ) + (q20
1 1
y 20 + q20
2 2
y 20 ) + (q25
1 1
y 25 + q25
2 2
y 25 ) + (q30
1 1
y 30 + q30
2 2
y 30 )
= (15 y15
1
− 2 y15
2
) + (20 y 20
1
− 2 y 20
2
) + (25 y 25
1
− 2 y 25
2
) + (30 y 30
1
− 2 y 30
2
).
Constraints
State 15 has K15 = 2 possible decisions. The constraint associated with a tran-
sition to state j = 15 is
2 30 2
∑y
k= 1
k
15 −α∑ ∑y
i = 15 k = 1
k
i pik,15 = b1
2 2 2 2
1
( y15 + y15
2
)− α
∑y
k= 1
k
15
k
p15,15 + ∑y
k= 1
k
20
k
p20,15 + ∑y
k= 1
k
25
k
p25,15 + ∑ y 30
k= 1
k k
p30,15 = b1
1
( y15 + y15
2
) − α ( y15
1 1
p15,15 + y15
2 2
p15,15 ) − α ( y 20
1 1
p20,15 + y 20
2 2
p20,15 ) − α ( y 25
1 1
p25,15 + y 25
2 2
p25,15 )
− α ( y 30
1 1
p30,15 + y 30
2 2
p30,15 ) = b1
1
( y15 + y15
2
) − 0.9(0 y15
1
+ 0.3 y15
2
) − 0.9(0 y 20
1
+ 0.3 y 20
2
) − 0.9(0 y 25
1
+ 0.3 y 25
2
)
− 0.9(0 y 30
1
+ 0.3 y 30
2
) = 0.25.
∑y
k= 1
k
20 −α∑ ∑y
i = 15 k = 1
k
i pik,20 = b2
2 2 2 2
1
( y 20 + y 20
2
) − α ( ∑ y15
k k
p15,20 + ∑y k
20
k
p20,20 + ∑y k
25
k
p25,20 + ∑y k
30
k
p30,20 ) = b2
k= 1 k= 1 k= 1 k= 1
1
( y 20 + y 20
2
) − α ( y15
1 1
p15,20 + y15
2 2
p15,20 ) − α ( y 20
1 1
p20,20 + y 20
2 2
p20,20 ) − α ( y 25
1 1
p25,20 + y 25
2 2
p25,20 )
− α ( y 30
1 1
p30,20 + y 30
2 2
p30,20 ) = b2
1
( y 20 + y 20
2
) − 0.9(0 y15
1
+ 0.4 y15
2
) − 0.9(0 y 20
1
+ 0.4 y 20
2
) − 0.9(0 y 25
1
+ 0.4 y 25
2
)
− 0.9(0 y 30
1
+ 0.4 y 30
2
) = 0.25.
2 30 2
∑
k= 1
k
y 25 −α∑ ∑y
i = 15 k = 1
k
i pik,25 = b3
2 2 2 2
1
( y 25 + y 25
2
)− α
∑y
k= 1
k
15
k
p15,25 + ∑y
k= 1
k
20
k
p20,25 + ∑y
k= 1
k
25
k
p25,25 + ∑ y 30
k= 1
k k
p30,25 = b3
1
( y 25 + y 25
2
) − α ( y15
1 1
p15,25 + y15
2 2
p15,25 ) − α ( y 20
1 1
p20,25 + y 20
2 2
p20,25 ) − α ( y 25
1 1
p25,25 + y 25
2 2
p25,25 )
− α ( y 30
1 1
p30,25 + y 30
2 2
p30,25 ) = b3
1
( y 25 + y 25
2
) − 0.9(0 y15
1
+ 0.2 y15
2
) − 0.9(0 y 20
1
+ 0.2 y 20
2
) − 0.9(0 y 25
1
+ 0.2 y 25
2
)
− 0.9(0 y 30
1
+ 0.2 y 30
2
) = 0.25.
2 30 2
∑y
k= 1
k
30 −α∑ ∑y
i = 15 k = 1
k
i pik,30 = b4
2 2 2 2
1
( y 30 + y 30
2
)− α
∑y
k= 1
k
15
k
p15,30 + ∑y
k= 1
k
20
k
p20,30 + ∑y
k= 1
k
25
k
p25,30 + ∑ y 30
k= 1
k k
p30,30 = b4
1
( y 30 + y 30
2
) − α ( y15
1 1
p15,30 + y15
2 2
p15,30 ) − α ( y 20
1 1
p20,30 + y 20
2 2
p20,30 ) − α ( y 25
1 1
p25,30 + y 25
2 2
p25,30 )
− α ( y 30
1 1
p30,30 + y 30
2 2
p30,30 ) = b4
1
( y 30 + y 30
2
) − 0.9(0 y15
1
+ 0.1y15
2
) − 0.9(0 y 20
1
+ 0.1y 20
2
) − 0.9(0 y 25
1
+ 0.1y 25
2
)
− 0.9(0 y 30
1
+ 0.1y 30
2
) = 0.25.
The complete LP formulation for the discounted MDP model of the secretary
problem analyzed over an infinite planning horizon is
1
Maximize (15 y15 − 2 y15
2
) + (20 y 20
1
− 2 y 20
2
) + (25 y 25
1
− 2 y 25
2
) + (30 y 30
1
− 2 y 30
2
)
subject to
1
(1) ( y15 + 0.73 y15
2
) − (0.27 y 20
2
+ 0.27 y 25
2
+ 0.27 y 30
2
) = 0.25
1
(2) ( y 20 + 0.64 y 20
2
) − (0.36 y15
2
+ 0.36 y 25
2
+ 0.36 y 30
2
) = 0.25
1
(4) ( y 30 + 0.91y 30
2
) − (0.09 y15
2
+ 0.09 y 20
2
+ 0.09 y 25
2
) = 0.25
y12 ≥ 0, y12 ≥ 0, y13 ≥ 0, y 21 ≥ 0, y 22 ≥ 0, y 31 ≥ 0, y 32 ≥ 0, y 14 ≥ 0, y 42 ≥ 0.
policy, given by d = [2 1 1 1]T, is to reject a candidate rated 15, and hire a can-
didate rated 20 or 25 or 30. The associated conditional probabilities,
Ki
k
P(decision = k|state = i) y i / ∑y k= 1
k
i ,
∑bv
j= 1
j j = 22.9966 = 0.25(16.9863) + (0.25)(20) + (0.25)(25) + (0.25)(30).
TABLE 5.76
LP Solution of Discounted MDP Model of the Secretary
Problem
Function = 22.9966
Objective i k y ik
15 1 0 Row Dual Variable
2 0.3425 Constraint 1 16.9863
20 1 0.3733 Constraint 2 20
2 0 Constraint 3 25
25 1 0.3116 Constraint 4 30
2 0
30 1 0.2808
2 0
It is instructive to note that the MCR associated with the optimal policy given
by d = [2 1 1 1]T, with the absorbing state ∆ restored, is
Also,
15 0.27 0.36 0.18 0.09 0
20 0 0 0 0 0.9
α P = 25 0 0 0 0 0.9
30 0 0 0 0 0.9
∆ 0 0 0 0 0.9
Recall that the matrix equation (4.253) in Section 4.3.3.1 is an alternate form
of the VDEs. Solving the matrix equation (4.253) for the vector of expected
total discounted rewards,
PROBLEMS
5.1 A woman wishes to sell a car within the next 4 weeks. She
expects to receive one bid or offer each week from a prospec-
tive buyer. The weekly offer is a random variable, which has the
following stationary probability distribution:
Once the woman accepts an offer, the bidding stops. Her objec-
tive is to maximize her expected total income over a 4-week
planning horizon.
(a) Formulate this optimal stopping problem as a unichain MDP.
(b) Use value iteration to find an optimal policy which will
specify when to accept or reject an offer during each week of
the planning horizon.
5.2 In Problem 3.10, suppose that the investor receives a monthly
dividend of $1 per share. No dividend is received when the
stock is sold. The investor uses a monthly discount factor of
α = 0.9. She wants to determine when to sell and when to hold
the stock. Her objective is to maximize her expected total dis-
counted reward.
(a) Treat this problem as an optimal stopping problem over an
infinite planning horizon. Formulate this optimal stopping
problem as a discounted, unichain MDP.
(b) Formulate the discounted, unichain MDP as a LP, and solve
it to find an optimal policy.
(c) Use PI to find an optimal policy.
5.3 A consumer electronics retailer can place orders for flat panel
TVs at the beginning of each day. All orders are delivered
immediately. Every time an order is placed for one or more TVs,
the retailer pays a fixed cost of $60. Every TV ordered costs the
retailer $100. The daily holding cost per unsold TV is $10. A
daily shortage cost of $180 is incurred for each TV that is not
available to satisfy demand. The retailer can accommodate a
maximum inventory of two TVs. The daily demand for TVs is
an independent, identically distributed random variable which
has the following stationary probability distribution:
Daily demand d 0 1 2
Probability, p( d) 0.5 0.4 0.1
State E A P
Excellent (E) 0.1 0.7 0.2
P = [ pij ] =
Acceptable (A) 0 0.4 0.6
Poor (P) 0 0 1
State Condition
NW Not Working
WI Working Intermittently
WP Working Properly
The daily behavior of the machine when it is left alone for 1 day
is modeled as an absorbing unichain with the following transi-
tion probability matrix:
State NW WI WP
Not Working (NW) 1 0 0
P = [ pij ] =
Working Intermittently (WI) 0.8 0.2 0
Working Properly (WP) 0.2 0.5 0.3
5.6 In Problem 4.1, suppose that the engineer responsible for the
machine has the options of bringing a machine with a major
defect or a minor defect to the repair process. When the engi-
neer elects to bring a machine with a major defect or a minor
defect to the repair process, the machine makes a transition
with probability 1 to state 1 (NW). The daily costs to bring a
machine in state 2 (MD) and state 3 (mD) to the repair process
are $80 and $40, respectively. A repair takes one day to com-
plete. No revenue is earned on days during which a machine
is repaired. When the machine is in state 1 (NW), it is always
under repair. Hence, the only feasible action in state 1 is to
repair (RP) the machine. Since a machine in state 4 (WP) is
never brought to the repair process, the only feasible action in
state 4 (WP) is to do nothing (DN). The following four policies
are feasible:
Node
Node p24k, r24k 4
2
p 25
12 k
k
p
,r2
46 k
,r
,r
12 k
k
5
46 k
p
Node Node
1 6
p 13
k , r 13
34 k
56 k
k
,r
34 k
,r
56 k
p
Node Node p
3 p35k, r35k 5
State Condition
F Virus-free
B Benign virus
M Malignant
virus
of its own products and services. To prepare her CPD log and
assemble supporting documentation for a possible audit at the
end of the current year, she is considering hiring a prominent
engineering educator as an audit consultant. The audit consul-
tant will charge her a fee of $2,000. Experience indicates that
the Board often imposes a fine for deficiencies in CPD courses
taken in prior years when a registrant is audited. Past records
show that when a CPD log is prepared by an audit consultant,
the average fine is $600 after a log is audited. However, when a
CPD log is prepared by the registrant herself, the average fine
is $8,000. The consulting engineer must decide whether or not
to hire an audit consultant to prepare her CPD log for a possible
audit at the end of the current year. Hiring an audit consultant
appears to reduce by 0.05 the probability that a registrant will
be audited in the following year:
(a) Formulate the consulting engineer’s decision alternatives as
a two-state recurrent MDP. For each state and decision, spec-
ify the associated transition probabilities and calculate the
expected immediate cost.
(b) Execute value iteration to find a policy which minimizes the
vector of expected total costs that will be incurred in both
states after 3 years. Assume zero terminal costs at the end of
year 3.
(c) Use exhaustive enumeration to find a policy that minimizes
the expected average cost, or negative gain, over an infinite
planning horizon.
(d) Use PI to find an optimal policy over an infinite horizon.
References
1. Hillier, F. S. and Lieberman G. J., Introduction to Operations Research, 8th ed.,
McGraw-Hill, New York, 2005.
2. Howard, R. A., Dynamic Programming and Markov Processes, M.I.T. Press,
Cambridge, MA, 1960.
3. Puterman, M. L., Markov Decision Processes: Discrete Stochastic Dynamic
Programming, Wiley, New York, 1994.
4. Wagner, H. M., Principles of Operations Research, 2nd ed., Prentice-Hall,
Englewood Cliffs, NJ, 1975.
423
Snell (pp. 114–116). Consider a regular Markov chain with N states, indexed
1, 2, . . . , N. For simplicity, assume N = 4. The transition probability matrix is
denoted by P. The states are divided into two subsets. The first subset con-
tains state 1. The second subset contains the remaining states, and is denoted
by S = {2, 3, 4}. The transition matrix P is partitioned into four submatrices
called p11, u, v, and T, as shown below:
1 p11 u (6.1)
= ,
S v T
Since roundoff error can be reduced by eliminating subtractions [1], the fac-
tor (1 − p11)−1 is replaced by (p12 + p13 + p14)−1. The modified formula, called the
matrix reduction formula, is
The foregoing process of applying the matrix reduction formula (6.3) to the
original matrix, P, to form the reduced matrix, P , is called matrix reduction.
By repeating this argument for the remaining pairs of states in S, the follow-
ing eight additional transition probabilities for the reduced matrix, P, shown
in equation (6.4), can be obtained.
The matrix form of the nine formulas (6.5) through (6.13) is the matrix reduc-
tion formulas (6.3).
1 p11
(1) (1)
p12 (1)
p13 (1)
p14
(1)
2 p(1) (1)
p22 (1)
p23 p24 1 p11(1)
u(1)
P(1) = 21 = (1)
3 p31
(1) (1)
p32 (1)
p33 (1)
p34 S v T (1) (6.14)
(1)
4 p(1)
41 p(1)
42 p(1)
43 p44
−1
P(2) = T (1) + v(1) p12
(1)
+ p13
(1)
+ p14
(1)
u(1) (6.15)
2 p22
(1) (1)
p23 (1)
p24 p21
(1)
(1) (1) (1) (1) (1) −1
= 3 p32 (1)
p33 p34 + p31 [ p12 + p13
(1)
+ p14 ] p12
(1)
p (1)
13 p
(1)
14
= 3 p32 (2)
p33 p34 = (2)
v T (2)
4 p(2
42
)
p(2)
43 p(2)
44
3 p33
(2) (2)
p34 p32
(2)
(2)
+ (2) (p23 ) p23(2)
(2) −1 (2) −1
P(3) = T (2) + v(2) p23
(2)
+ p24 u = 4 p(2) (2)
+ (2)
p24 (2)
p24
43 p44 p42
3 p33 3 p33
(3) (3)
p34 (3)
u(3)
= (3) (3) = (3) .
4 p43 p44 4 v
T (3)
(6.16)
The matrix reduction step has ended with a two-state Markov chain, which
has a transition probability matrix denoted by P(3). Using Equation (2.16), the
steady-state probability vector for P(3), denoted by π(3), is known, and is shown
below with its two components.
p(3) (3)
p34
π (3) = π 3(3) π 4(3) = (3) 43 (3) (3)
. (6.17)
p34 + p43 p34 + p43
(3)
The second step of the MCPA is back substitution, which begins by solving
for π 3(3) as a constant k3 times π 4(3).
(3) −1 (3)
k 3 = ( p34 ) p43 . (6.19)
is
(1 − p22
(2)
)π 2(2) = π 3(2) p32
(2)
+ π 4(2) p(2) (6.22)
42 .
π 3(2) π 4(2)
π 3(3) = , and π (3)
= (6.23)
π 3(2) + π 4(2) π 3(2) + π 4(2
4
Observe that
Hence, π 3(3) = k 3π 4(3) implies that π 3(2) = k 3π 4(2) . Substituting π 3(2) = k 3π 4(2) in the
first steady-state equation of the system (6.21),
(1 − p22
(2)
)π 2(2) = p32
(2)
k 3π 4(2) + p(2)
42 π 4 = ( p32 k 3 + p42 )π 4
(2) (2) (2) (2)
= k 2π 4(2) ,
(2) −1
(2)
and where ( p23 + p24 (2) −1
) is substituted for (1 − p22 ) to avoid subtractions. The
(1)
steady-state probability vector for P is denoted by
Observe that
(1 − p11 )π 1(1) = p21 k 2π 4(1) + p31 k3π 4(1) + p(1) 41 π 4 = ( p21 k 2 + p31 k 3 + p41 )π 4
(1) (1) (1) (1) (1) (1) (1) (1)
(1) −1
π 1(1) = (1 − p11 ) ( p21(1)
k 2 + p31(1)
k 3 + p(1)41 )π 4
(1)
(6.32)
(1) −1
= ( p12
(1)
+ p13 (1)
+ p14 ) ( p21 (1)
k 2 + p31(1)
k 3 + p(1)41 )π (1)
4 = k π
1 4
(1)
,
Solving the normalizing equation (2.6) for the system (6.28) for π 4(1) as a
function of the constants,
Solving for the steady-state probabilities for the original Markov chain,
π 4 = π 4(1)
π3 = π (1)
3 = k3π (1)
4
. (6.35)
π 2 = π 2(1) = k 2π 4(1)
π1 = π (1)
1 = k1π 4(1)
Matrix Reduction
1. Initialize n = 1.
2. Let P(n) = P = [pij] for n ≤ i ≤ N and n ≤ j ≤ N.
3. Partition P(n) as
n pnn (n)
pn( n, n) + 1 " pn( n, N)
n + 1 pn( n+)1, n pn( n+)1, n + 1 " pn( n+)1, N pnn
(n)
u( n )
P(n) = = (n) .
# # # " # v T (n)
(n)
N p(Nn,)n pN , n + 1 " pN , N
(n)
4. Store the first row and first column, respectively, of P(n) by overwriting
the first row and the first column, respectively, of P.
−1
5. Compute P(n+1) = T ( n ) + v( n ) pn( n, n) +1 + " + pn( n, N) u( n )
6. Increment n by 1. If n < N – 1, go to step 3. Otherwise, go back
substitution.
Back Substitution
1. Initialize i = N – 1.
2. kN–1 = (p(NN−−1,1)N )−1 p(NN, N−1)−1
3. Decrement i by 1
−1
N (i) N− 1
4. ki =
∑
j = i+ 1
(i)
p
ij
pNi + ∑
h = i+ 1
p(hii ) k h , i = N − 2, N − 3, ... , 1.
5. If i > 1, go to step 3.
6. i = 1
−1
N −1
7. π N = 1 + ∑ k h .
h =1
8. πh = khπN, h = 1, 2, … , N − 1.
9. π = [π1 π2 … πN]
−1
P (3)
= T + v p
(2) (2)
+ p u
(2)
23
(2)
24
(2)
(6.38)
3 19/70 34/70 17 /70
( 22/70 + 11/70 ) 22/70 11/70 .
−1
= +
4 21/70 7 /70 42/70
3 13/30 17 /30 p33
(3)
u (3)
= =
4 21/30 9/30 v(3) T (3)
Back Substitution
(2) −1
k 2 = p23
(2)
+ p24 p32
(2)
42
k 3 + p(2) (6.40)
= (22/70 + 11/70)−1 (17 /70)(21/17) + 42/70 = 2499/1309
−1
k1 = p12 p21
+ + + + 41
(1) (1) (1) (1) (1)
p13 p14 k2 p31 k3 p(1)
(6.41)
= (0.1 + 0.4 + 0.2) (0.2)(2499/1309) + (0.3)(1617/1309) = 1407/1309
−1
(6.42)
These are the same steady-state probabilities that were obtained in Equation
(2.22).
When Gaussian elimination is applied to the linear system (6.47), the first
step is to solve the first equation for π1 as a function of π2, π3, and π4 in the
following manner.
to obtain
or
Thus, the first step of Gaussian elimination has produced the first reduced
matrix, P(2), which is calculated in Equation (6.37). This procedure can be
repeated until only two equations remain, producing the final reduced
matrix, P(3). The results of this example can be generalized to conclude that
each step of Gaussian elimination produces a reduced coefficient matrix
equivalent to the reduced matrix produced by the corresponding step of
Suppose that MFPTs to target state 0 are desired. When target state 0 is made
an absorbing state, the modified transition probability matrix, PM, is parti-
tioned in the following manner:
0 1 0 0
1 0
PM = 1 p10 p11 p12 = ,
D Q (6.54a)
2 p20 p21 p22
where
G = [Q D e ], (6.55)
1 p10 1 g13
D= = .
2 p20 2 g 23 (6.57)
Since the sum of the entries in the first three columns in each row of the
augmented matrix equals one, subtractions can be eliminated by making the
substitution
1. Initialize n =1
2. Let G(n) = G = [gij] for n ≤ i ≤ N and n ≤ j ≤ N + 2.
3. Partition G(n) as
n g nn (n)
g n( n, n) + 1 " g n( n, N) + 1 g n( n, N) + 2
(n)
n + 1 g n + 1, n g n( n+)1, n + 1 " g n( n+)1, N + 1 g n( n+)1, N + 2 g nn
(n)
u( n )
G(n ) = = ( n) . (6.59)
# # # " # # v T( n)
N g (Nn,)n g (Nn,)n + 1 " g (Nn,)N + 1 ( n)
g N , N + 2
C. Back Substitution
4. Decrement i by 1.
5. If i > 0, go to step 3. Otherwise, stop.
B. Matrix Reduction
C. Back Substitution
Back substitution begins by computing the entry in row four of the vector of
MFPTs to state 0.
m30 = ( g 36
(3)
+ g 34
(3) (3)
m40 )/( g 34 + g 35
(3)
)
(6.66)
= (1.5318 + 0.3425(14.8104))/(0.3425 + 0.1341) = 13.8573.
m20 = ( g 26
(2)
+ g 23
(2)
m30 + g 24
(2) (2)
m40 )/( g 23 + g 24
(2)
+ g 25
(2)
)
= (1.1667 + 0.3333(13.8573) + 0.25(14.8104)) (0.3333 + 0.25 + 0.2) = 12.1128. (6.67)
m10 = ( g16 + g12 m20 + g13 m30 + g14 + g13 + g14 + g15
(1) (1) (1) (1) (1) (1) (1) (1)
m40 )/( g12 )
(6.68)
= (1 + 0.1(12.1128) + 0.2(13.8573) + 0.3(14.8104)) (0.1 + 0.2 + 0.3 + 0 ) = 15.7098.
This vector of MFPTs differs slightly from those computed in Equations (2.62)
and (3.22). Discrepancies are due to roundoff error because only the first four
significant decimal digits were stored.
State 0 5 1 2 3 4
0 1 0 0 0 0 0
5 0 1 0 0 0 0
I 0
P= 1 0 0 0.4 0.1 0.2 0.3 =
D Q , (1.59)
2 0.15 0.05 0.1 0.2 0.3 0.2
3 0.07 0.03 0.2 0.1 0.4 0.2
4 0 0 0.3 0.4 0.2 0.1
where
G = Q D2 D1 = g ij , (6.70)
To avoid subtractions, matrix Q has been augmented with vectors D1 and D2,
where
p10 1 0 p15 1 0
p 2 0.15 p 2 0.05
D1 = = , D = 25 =
20
p30 3 0.07 2
p35 3 0.03 (6.72)
p40 4 0 p45 4 0
Since the row sums of the augmented matrix equal one, subtractions are
avoided by making the substitution
(1 − p11 ) = ( p12 + p13 + p14 + p15 + p10 ) = (1 − g11 ) = ( g12 + g13 + g14 + g15 + g16 ). (6.73)
G = Q D2 D1 = g ij . (6.70)
B. Matrix Reduction
Matrix reduction is applied to the augmented matrix, G.
1. Initialize n = 1.
2. Let G(n) = G = [gij] for n ≤ i ≤ N and n ≤ j ≤ N + 2.
3. Partition G(n) as
n g nn (n)
g n( n, n) + 1 " g n( n, N) + 1 g n( n, N) + 2
(n)
n + 1 g n + 1, n g n( n+)1, n + 1 " g n( n+)1, N + 1 g n( n+)1, N + 2
G(n) =
# # # " # #
N g (Nn,)n gN ,n+1 " gN , N +1
(n) (n) (n)
g N , N + 2
g(n) u( n )
= nn
(n) .
v T( n)
C. Back Substitution
2. Let i = N – 1.
3. For 1 ≤ i < N, Compute the entry in row i of the vector of the prob-
abilities of absorption in state 0.
N+2
N
fi 0 = g i(,iN) + 2 + ∑ g ih( i ) f h 0 ∑g (i)
.
ih
h= i +1 h= i +1
4. Decrement i by 1.
5. If i > 0, go to step 3. Otherwise, stop.
A. Augmentation
B. Matrix Reduction
−1
G(3) = T (2) + v (2) g 23
(2)
+ (2)
g 24 + (2)
g 25 + (2)
g 26 u(2)
−1
G(4) = T (3) + v (3) g 34
(3)
+ (3)
g 35 + (3)
g 36 u(3)
C. Back Substitution
Back substitution begins by computing the entry in row four of the vector of
probabilities of absorption in state 0.
f 30 = ( g 36
(3)
+ g 34
(3) (3)
f 40 )/( g 34 + g 35
(3)
+ g 36
(3)
)
= (0.0955 + 0.3426(0.7298)) (0.3426 + 0.0385 + 0.0955) = 0.7250. (6.80)
f 20 = ( g 26
(2)
+ g 23
(2)
f 30 + g 24
(2) (2)
f 40 )/( g 23 + g 24
(2)
+ g 25
(2)
+ g 26
(2)
)
= (0.15 + 0.3333(0.7250) + 0.25(0.7298)) (0.3333 + 0.25 + 0.05 + 0.15) (6.81)
= 0.7329.
f10 = ( g16
(1)
+ g12
(1)
f 20 + g13
(1)
f 30 + g14
(1) (1)
f 40 )/( g12 + g13
(1)
+ g14
(1)
+ g15
(1)
+ g16
(1)
)
= (0 + 0.1(0.7329) + 0.2(0.7250) + 0.3(0.7298)) (0.1 + 0.2 + 0.3 + 0 + 0 ) = 0.7287 .
(6.82)
The probability fi0 that a patient in a transient state i will eventually be dis-
charged can be expressed as an entry in the vector f0 of the probabilities of
absorption in absorbing state 0.
TABLE 6.1
Symbolic Probabilities of Weather on Islands
Dry Weather, D Wet Weather, W
Island 1 P(D) = d1 P(W) = 1 − d1
Island 2 P(D) = d2 P(W) = 1 − d2
TABLE 6.2
Numerical Probabilities of Weather on Islands
Dry Weather, D Wet Weather, W
Island 1 P(D) = 0.55 P(W) = 0.45
Island 2 P(D) = 0.25 P(W) = 0.75
Suppose that the initial state probability vector for the two-state hidden
Markov chain is
where each observation, On, is one of the symbols from the alphabet V, and
M + 1 is the number of observations in the sequence. The observation symbols
are generated in accordance with an observation symbol probability distribu-
tion. The observation symbol probability distribution function in state i is
Weather = D Weather = W
B = [bi (k )] = Island 1 0.55 0.45 . (6.89)
Island 2 0.25 0.75
TABLE 6.3
Observation Symbol Probability Distribution when V={v1, v2}
Observation Symbol vk = v1 Observation Symbol vk = v2
B = [bi (k)] = State i = 1 P(On = v1|Xn =1) P(On = v2|Xn =1)
State i = 2 P(On = v1|Xn =2) P(On = v1|Xn =2)
TABLE 6.4
Observation Symbol Probability Distribution when V={D,W}
Observation Symbol v1 = D Observation Symbol v2 = W
B = [bi (k)] = State i = 1 P(On = D|Xn =1) P(On = W|Xn =1)
State i = 2 P(On = D|Xn =2) P(On = W|Xn =2)
TABLE 6.5
Observation Symbol Probability Distribution as a Function of di
Weather Symbol v1 = D Weather Symbol v2 = W
B = [bi (k)] = State 1 = Island 1 b1 (D) = d1 b1 (W) = 1−d1
State 2 = Island 2 b2 (D) = d2 b2 (W) = 1−d2
X0 i X1 j X2 k Xn g Xn 1 h State
O0 O1 O2 On On 1 Observation Symbol
bi (O0 ) b j (O1) bk (O2 ) bg (On ) bh (On 1) Symbol Probability
pi(0) pij p jk pgh Transition Probability
0 1 2 n n 1 Epoch
FIGURE 6.1
State transition and observation symbol generation for a sample path.
p(0) = p1(0) N
p2(0) " p(0) = [P(X0 = 1) P(X 0 = 2) " P(X 0 = N )].
2. Set n = 0.
3. Choose On = vk according to the observation symbol probability dis-
tribution in state i,
In Table 6.6 below, all eight possible three-state sequences, X = {X0, X1, X2},
are enumerated. The joint probabilities,
TABLE 6.6
Enumeration of Joint Probabilities for all Eight Possible Three-State Sequences
are calculated for each three-state sequence. The sum, taken over all values
of X, of these eight joint probabilities is equal to the marginal probability,
P(O|λ). That is,
which is the joint probability that the partial sequence {O 0, O1, O2, . . . , On} of
n + 1 observations is generated from epoch 0 until epoch n, and the HMM
is in state i at epoch n. The forward procedure has three steps, which are
labeled initialization, induction, and termination. The three steps will be
described in reference to the small HMM of the weather on two hidden
islands.
Step 1. Initialization
At epoch 0, the forward variable is
Step 2. Induction
At epoch 1,
2
α 1 ( j) = [∑ α 0 ]
(i)pij b j (O1 ) for 1 ≤ j ≤ N = 2.
i= 1
Similarly, at epoch 2,
α 2 ( j) = P(O0 , O1 , O2 , X 2 = j λ )
α 2 ( j) = p1(0) b1 (O0 )p11b1 (O1 )p1 j b j (O2 ) + p1(0) b1 (O0 )p12 b2 (O1 )p2 j b j (O2 )
+ p2(0) b2 (O0 )p21b1 (O1 )p1 j b j (O2 ) + p2(0) b2 (O0 )p22 b2 (O1 )p2 j b j (O2 )
= p1(0) b1 (O0 )p11b1 (O1 )p1 j b j (O2 ) + p2(0) b2 (O0 )p21b1 (O1 )p1 j b j (O2 )
+ p1(0) b1 (O0 )p12 b2 (O1 )p2 j b j (O2 ) + p2(0) b2 (O0 )p22 b2 (O1 )p2 j b j (O2 ).
2
α 2 ( j) = [∑ α 1 (i)pij ]b j (O2 ) for 1 ≤ j ≤ N = 2.
i =1
By induction, one may conclude that in the general case the forward vari-
able is
N
α n + 1 ( j) = [∑ α n ]
(i)pij b j (On + 1 ) for epochs 0 ≤ n ≤ M − 1 and states 1 ≤ j ≤ N .
i =1
Step 3. Termination
At epoch M=2, the forward procedure ends with the desired probability
expressed as the sum of the terminal forward variables.
P(O λ ) = ∑
all values of X M
P(O , X M λ )
2
= α 2 (1) + α 2 (2) = ∑ α 2 (i).
i =1
Step 2. Induction
N
α n + 1 ( j) = [∑ α n ]
(i)pij b j (On + 1 ) for epochs 0 ≤ n ≤ M − 1 and states 1 ≤ j ≤ N .
i =1
Step 3. Termination
N
P(O λ ) = P(O0 , O1 , O2 , … , OM λ ) = ∑ α M (i).
i =1
islands. Recall that the set of parameters λ = {p(0), P, B} for this example is
given in Equation (6.90).
Step 1. Initialization
At epoch 0,
Step 2. Induction
At epoch 1,
2
α 1 ( j) = [∑ α (i)p ]b (O ) = [α (1)p
0 ij j 1 0 1j + α 0 (2)p2 j ]b j (D)
i =1
α 1 (1) = [α 0 (1)p11 + α 0 (2)p21 ]b1 (D) = [(0.1575)(0.6) + (0.4875)(0.7)](0.55) = 0.2396625
α 1 (2) = [α 0 (1)p12 + α 0 (2)p22 ]b2 (D) = [(0.1575)(0.4) + (0.4875)(0.3)](0.25) = 0.0523125
At epoch 2,
2
α 2 ( j) = [∑ α (i)p ]b (O ) = [α (1)p
1 ij j 2 1 1j + α 1 (2)p2 j ]b j (W )
i =1
Step 3. Termination
At epoch M = 2,
2
P(O λ ) = P(O0 , O1 , O2 λ ) = P(W , D, W λ ) = ∑ α 2 (i) = α 2 (1) + α 2 (2)
i =1
which is the conditional probability that the partial sequence {On+1, On+2,
On+3, . . . , OM} of M − n observations is generated from epoch n + 1 until epoch
M, given that the HMM is in state i at epoch n. The backward procedure,
when it is applied to solve problem 1, also has three steps called initialization,
induction, and termination. The three-step backward procedure used to
solve problem 1 is given below.
Step 1. Initialization
At epoch M, the backward variable is
β M (i) = 1 for 1 ≤ i ≤ N .
Step 2. Induction
N
β n (i) = ∑ pij b j (On + 1 )β n + 1 ( j) for epochs n = M − 1, M − 2,… , 0 and states 1 ≤ i ≤ N .
j=1
Step 3. Termination
At epoch 0,
N
P(O λ ) = P(O0 , O1 , O2 ,…OM λ ) = ∑ P(O0 , X0 = i λ )P(O1 , O2 , O3 ,… , OM X0 = i , λ
i =1
N
= ∑ P(X0 = i)P(O0 X0 = i) P(O1 , O2 , O3 , … , OM X 0 = i , λ )
i =1
N N
= ∑ pi(0) bi (O0 )β 0 (i) = ∑ α 0 (i)β 0 (i),
i =1 i =1
after substituting α 0 (i) = pi(0) bi (O0 ) (the initialization step of the forward
procedure).
The backward procedure will be executed to calculate P(O|λ) = P(O 0, O1,
O2|λ) = P(W, D, W|λ) for the small example involving the weather on two
hidden islands.
Step 1. Initialization.
At epoch M = 2,
Step 2. Induction
At epoch 1,
2
β1 (i) = ∑ pij b j (O2 )β 2 ( j) = pi1b1 (W )β 2 (1) + pi 2 b2 (W )β 2 (2)
j =1
At epoch 0,
2
β 0 (i) = ∑ pij b j (O1 )β1 ( j) = pi1b1 (D)β1 (1) + pi 2 b2 (D)β1 (2)
j =1
β 0 (1) = p11b1 (D)β1 (1) + p12b2 (D)β1 (2) = (0.6)(0.55)(0.57) + (0.4)(0.25)(0.54) = 0.2421
β0 (2) = p21b1 (D)β1 (1) + p22b2 (D)β1 (2) = (0.7)(0.55)(0.57) + (0.3)(0.25)(0.54) = 0.25995.
Step 3. Termination
At epoch 0, after substituting α 0 (i) = pi(0) bi (O0 ),
Since the denominator on the right hand side is not a function of X, max-
imizing the conditional probability P(X|O, λ) with respect to X is equiva-
lent to maximizing the numerator P(X, O|λ) with respect to X. Hence an
equivalent objective of problem 2 is to find the state sequence X, which maxi-
mizes P(X, O|λ), the joint probability of state sequence X and observation
sequence O.
A formal procedure called the Viterbi algorithm, which is based on
dynamic programming, is used to calculate max X P(X, O|λ.) and fi nd
argmax X P(X, O|λ). The algorithm fi rst calculates max X P(X, O|λ.). Then
the algorithm works backward or backtracks to recover the state sequence
X which maximizes P(X, O|λ). If this state sequence is not unique, the
Viterbi algorithm will fi nd one state sequence, which maximizes P(X, O|λ).
To fi nd the state sequence X that is most likely to correspond to the given
observation sequence O, the algorithm defi nes, at epoch n, the quantity
δ n ( j) = max P(X0 , X1 , X 2 , … , X n = j , O0 , O1 , O2 , … , On λ )
X0 , X1 ,…, Xn− 1 (6.99a)
The quantity δn(i) is the highest probability of the state sequence ending in
state Xn at epoch n, which accounts for the first n + 1 observations.
At epoch 0, δn(i) is initialized as
As is true for the forward variable, αn(i), and the backward variable, βn(i), the
probability δn(i) can be calculated by induction. The induction equation for
calculating δn(i) is developed informally below.
At epoch 0,
At epoch 1,
δ 1 ( j) = max [ pi(0) bi (O0 )]pij b j (O1 ) = max [δ 0 (i)pij ]b j (O1 ) for j = 1,… , N .
i = 1,…, N i = 1,…, N
At epoch 2,
δ 2 (X 2 = k ) = max [ max pX(0)0 bX0 (O0 )pX0 , X1 bX1 (O1 )]pX1 , X2 bX2 (O2 )
X0 , X1 X0
Recall that δM(i) is the highest probability of the state sequence ending in
state X M at epoch M, which accounts for the observation sequence ending in
symbol OM at epoch M. Therefore, the joint probability of this state sequence
and the associated observation sequence is
To recover the state sequence {X0, X1, X2, . . . ,XM}, which has the highest proba-
bility of generating the observation sequence {O0, O1, O2, . . . ,OM}, it is necessary
to keep a record of the argument, which maximizes the induction equation for
δn+1(j) for each n and j. To construct this record, an array ψn(j) is defined as
Step 2. Recursion
Step 3. Termination
P∗ = max[δ M (i)]
1≤ i ≤ N
∗
XM = argmax[δ M (i)].
1≤ i ≤ N
X n∗ = ψ n + 1 (X n∗ + 1 ), for n = M − 1, M − 2,… , 0.
The Viterbi algorithm will be applied to the example HMM of the weather
on two hidden islands to recover the state sequence, X = {X0, X1, X2}, which
has the highest probability of generating the weather observation sequence,
O = {O 0, O1, O2} = {W, D, W}, representing the weather at three consecutive
epochs. Recall that the hidden state indicates which island is the source of
the observed weather symbol. The algorithm calculates maxXP(X, O) and
retrieves argmaxXP(X, O). The set of model parameters fλ = {p(0), P, B} for this
example is given in equation (6.90).
Step 1. Initialization
At epoch 0,
Step 2. Recursion
At epoch 1,
At epoch 2,
Step 3. Termination
X n∗ = ψ n + 1 (X n∗ + 1 ), for n = (2 − 1), 0 = 1, 0
X1∗ = ψ 1+ 1 (X1∗+ 1 ) = ψ 2 (X 2∗ ) = ψ 2 (2) = 1, for n = 1
X0∗ = ψ 0 + 1 (X0∗+ 1 ) = ψ 1 (X1∗ ) = ψ 1 (1) = 2, for n = 0.
Note that P∗ = 0.0563062 is the highest joint probability that the state sequence
{X0, X1, X2} accounts for the given observation sequence {O 0, O1, O2} = {W,D,W}.
Backtracking has retrieved the accountable state sequence X = {X0, X1, X2} =
{2,1,2}. Hence,
PROBLEMS
6.1 Consider a regular Markov chain that has the following transi-
tion probability matrix:
State 0 1 2 3
0 0.23 0.34 0.26 0.17
P= 1 0.08 0.42 0.18 0.32
2 0.31 0.17 0.23 0.29
3 0.24 0.36 0.34 0.06
State 0 1 2 3
0 0.23 0.34 0.26 0.17
P= 1 0.08 0.42 0.18 0.32
2 0.31 0.17 0.23 0.29
3 0.24 0.36 0.34 0.06
State 0 4 5 1 2 3
0 1 0 0 0 0 0
4 0 1 0 0 0 0
I 0
P= 5 0 0 1 0 0 0 =
D Q , where
1 0.15 0.05 0.20 0.2 0.3 0.1
2 0.09 0.03 0.18 0.1 0.4 0.2
3 0.12 0.02 0.06 0.5 0.2 0.1
1 0 0 1 0.15 0.05 0.20 1 0.2 0.3 0.1
I = 0 1 0 , D = 2 0.09 0.03 0.18 , Q = 2 0.1 0.4 0.2
0 0 1 3 0.12 0.02 0.06 3 0.5 0.2 0.1
State 1 2 3 X n\X n + 1 1 2 3
1 p11 p12 p13 1 0.24 0.56 0.20
P = [ pij ] = =
2 p21 p22 p23 2 0.38 0.22 0.40
3 p31 p32 p33 3 0.28 0.37 0.35
B = [bi (k )] =
Observation Observation Observation Observation
State
Symbol c Symbol d Symbol e Symbol f
i = 1 P(On = c X n = 1) P(On = d X n = 1) P(On = e X n = 1) P(On = f X n = 1)
i = 2 P(On = c X n = 2) P(On = d X n = 2) P(On = e X n = 2) P(On = f X n = 2)
i = 3 P(On = c X n = 3) P(On = d X n = 3) P(On = e X n = 3) P(On = f X n = 3)
References
1. Grassmann, W. K., Taksar, M. I., and Heyman, D. P., Regenerative analysis and
steady state distributions for Markov chains, Op. Res., 33, 1107, 1985.
2. Kemeny, J. G. and Snell, J. L., Finite Markov Chains, Van Nostrand, Princeton, NJ,
1960. Reprinted by Springer-Verlag, New York, 1976.
3. Kohlas, J., Numerical computation of mean passage times and absorption prob-
abilities in Markov and semi-Markov models, Zeitschrift für Op. Res., 30, A197,
1986.
4. Sheskin, T. J., A Markov chain partitioning algorithm for computing steady state
probabilities, Op. Res., 33, 228, 1985.
5. Rabiner, J. L., A tutorial on hidden Markov models and selected applications in
speech recognition. Proc. IEEE. 77, 257, 1989.
6. Koski, T., Hidden Markov Models for Bioinformatics, Kluwer Academic Publishers,
Dordrecht, 2001.
7. Ewens, W. J. and Grant, G. R., Statistical Methods in Bioinformatics: An Introduction,
Springer, New York, 2001.