Markov Chains and Decision Processes For Engineers and Managers

Download as pdf or txt
Download as pdf or txt
You are on page 1of 478
At a glance
Powered by AI
The document provides content from several pages of a book about Markov chains and decision processes for engineers and managers. It discusses the structure and models of Markov chains as well as some examples.

The book discusses Markov chains and decision processes and how they can be applied for engineers and managers. It provides examples of modeling weather and random walks.

Publishing details provided include the publisher CRC Press which is an imprint of Taylor & Francis Group. It also provides copyright information and mentions the book contains information from authentic sources.

MARKOV CHAINS and

DECISION PROCESSES
for ENGINEERS and
MANAGERS

51113_C000.indd i 9/23/2010 9:11:50 PM


51113_C000.indd ii 9/23/2010 9:11:51 PM
MARKOV CHAINS and
DECISION PROCESSES
for ENGINEERS and
MANAGERS

Theodore J. Sheskin

Boca Raton London New York

CRC Press is an imprint of the


Taylor & Francis Group, an informa business

51113_C000.indd iii 9/23/2010 9:11:51 PM


CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2011 by Taylor and Francis Group, LLC


CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Printed in the United States of America on acid-free paper


10 9 8 7 6 5 4 3 2 1

International Standard Book Number-13: 978-1-4200-5112-4 (Ebook-PDF)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com

and the CRC Press Web site at


http://www.crcpress.com
Contents

Preface xi
Author xiii

Chapter 1 Markov Chain Structure and Models ..........................................1


1.1 Historical Note ......................................................................................1
1.2 States and Transitions...........................................................................2
1.3 Model of the Weather ...........................................................................5
1.4 Random Walks ......................................................................................7
1.4.1 Barriers ......................................................................................8
1.4.1.1 Absorbing Barriers ...................................................8
1.4.1.2 Gambler’s Ruin .........................................................8
1.4.1.3 Reflecting Barriers ....................................................9
1.4.2 Circular Random Walk ...........................................................9
1.5 Estimating Transition Probabilities .................................................. 10
1.5.1 Conditioning on the Present State ....................................... 10
1.5.2 Conditioning on the Present and Previous States............. 11
1.6 Multiple-Step Transition Probabilities ............................................. 12
1.7 State Probabilities after Multiple Steps ............................................ 15
1.8 Classification of States ........................................................................ 19
1.9 Markov Chain Structure .................................................................... 20
1.9.1 Unichain .................................................................................. 20
1.9.1.1 Irreducible ............................................................... 21
1.9.1.2 Reducible Unichain ................................................ 23
1.9.2 Multichain ............................................................................... 24
1.9.3 Aggregated Canonical Form of the Transition Matrix ..... 24
1.10 Markov Chain Models........................................................................ 25
1.10.1 Unichain .................................................................................. 26
1.10.1.1 Irreducible ............................................................... 26
1.10.1.2 Reducible Unichain ................................................43
1.10.2 Reducible Multichain ............................................................ 48
1.10.2.1 Absorbing Markov Chain ..................................... 49
1.10.2.2 Eight-State Multichain Model of a
Production Process................................................. 51
Problems ..........................................................................................................54
References .......................................................................................................65

Chapter 2 Regular Markov Chains ............................................................... 67


2.1 Steady-State Probabilities ................................................................... 67
2.1.1 Calculating Steady-State Probabilities for a Generic
Two-State Markov Chain ...................................................... 72
v

51113_C000.indd v 9/23/2010 9:11:51 PM


vi Contents

2.1.2 Calculating Steady-State Probabilities for a Four-


State Model of Weather ......................................................... 74
2.1.3 Steady-State Probabilities for Four-State Model of
Inventory System ........................................................ 76
2.1.4 Steady-State Probabilities for Four-State Model of
Component Replacement ...................................................... 76
2.2 First Passage to a Target State............................................................77
2.2.1 Probability of First Passage in n Steps ................................77
2.2.2 Mean First Passage Times ..................................................... 82
2.2.2.1 MFPTs for a Five-State Markov Chain ................ 82
2.2.2.2 MFPTs for a Four-State Model of
Component Replacement ...................................... 86
2.2.2.3 MFPTs for a Four-State Model of Weather ......... 87
2.2.3 Mean Recurrence Time ......................................................... 88
2.2.3.1 Mean Recurrence Time for a Five-State
Markov Chain ......................................................... 88
2.2.3.2 Mean Recurrence Times for a Four-State
Model of Component Replacement ..................... 89
2.2.3.3 Optional Insight: Mean Recurrence Time
as the Reciprocal of the Steady-State
Probability for a Two-State Markov Chain ......... 89
Problems ..........................................................................................................91
References ....................................................................................................... 95

Chapter 3 Reducible Markov Chains ............................................................97


3.1 Canonical Form of the Transition Matrix ........................................ 97
3.1.1 Unichain .................................................................................. 97
3.1.2 Multichain ............................................................................... 99
3.1.3 Aggregation of the Transition Matrix in Canonical
Form ....................................................................................... 100
3.2 The Fundamental Matrix ................................................................. 102
3.2.1 Definition of the Fundamental Matrix ............................. 102
3.2.2 Mean Time in a Particular Transient State ....................... 103
3.2.3 Mean Time in All Transient States .................................... 105
3.2.4 Absorbing Multichain Model of Patient Flow in a
Hospital ................................................................................. 106
3.3 Passage to a Target State .................................................................. 108
3.3.1 Mean First Passage Times in a Regular Markov
Chain Revisited .................................................................... 108
3.3.2 Probability of First Passage in n Steps .............................. 110
3.3.2.1 Reducible Unichain .............................................. 110
3.3.2.2 Reducible Multichain ........................................... 118
3.3.3 Probability of Eventual Passage to a Recurrent State ..... 122
3.3.3.1 Reducible Unichain .............................................. 125
3.3.3.2 Reducible Multichain ........................................... 130

51113_C000.indd vi 9/23/2010 9:11:51 PM


Contents vii

3.4 Eventual Passage to a Closed Set within a Reducible


Multichain .......................................................................................... 138
3.4.1 Method One: Replacing Recurrent Sets with
Absorbing States and Using the Fundamental Matrix... 138
3.4.1.1 Five-State Reducible Multichain ........................ 138
3.4.1.2 Multichain Model of an Eight-State Serial
Production Process............................................... 140
3.4.2 Method Two: Direct Calculation without Using the
Fundamental Matrix............................................................ 142
3.5 Limiting Transition Probability Matrix ......................................... 143
3.5.1 Recurrent Multichain .......................................................... 143
3.5.2 Absorbing Markov Chain ................................................... 145
3.5.3 Absorbing Markov Chain Model of Patient Flow in a
Hospital ................................................................................. 146
3.5.4 Reducible Unichain ............................................................. 147
3.5.4.1 Reducible Four-State Unichain ........................... 149
3.5.4.2 Reducible Unichain Model of Machine
Maintenance .......................................................... 149
3.5.5 Reducible Multichain .......................................................... 150
3.5.5.1 Reducible Five-State Multichain ........................ 152
3.5.5.2 Reducible Multichain Model of an
Eight-State Serial Production Process ............... 153
3.5.5.3 Conditional Mean Time to Absorption ............. 156
Problems ........................................................................................................157
References ..................................................................................................... 163

Chapter 4 A Markov Chain with Rewards (MCR)................................... 165


4.1 Rewards .............................................................................................. 165
4.1.1 Planning Horizon ................................................................ 165
4.1.2 Reward Vector ...................................................................... 166
4.2 Undiscounted Rewards .................................................................... 168
4.2.1 MCR Chain Structure.......................................................... 168
4.2.2 A Recurrent MCR over a Finite Planning Horizon ........ 169
4.2.2.1 An MCR Model of Monthly Sales ...................... 169
4.2.2.2 Value Iteration over a Fixed Planning
Horizon .................................................................. 172
4.2.2.3 Lengthening a Finite Planning Horizon........... 179
4.2.2.4 Numbering the Time Periods Forward............. 182
4.2.3 A Recurrent MCR over an Infinite Planning Horizon ... 182
4.2.3.1 Expected Average Reward or Gain.................... 183
4.2.3.2 Value Determination Equations (VDEs) ........... 185
4.2.3.3 Value Iteration ....................................................... 190
4.2.3.4 Examples of Recurrent MCR Models ................ 194
4.2.4 A Unichain MCR.................................................................. 201
4.2.4.1 Expected Average Reward or Gain.................... 201

51113_C000.indd vii 9/23/2010 9:11:51 PM


viii Contents

4.2.4.2 Value Determination Equations ......................... 206


4.2.4.3 Solution by Value Iteration of Unichain
MCR Model of Machine Maintenance
under Modified Policy of Doing Nothing in
State 3 ..................................................................... 211
4.2.4.4 Expected Total Reward before Passage to a
Closed Set .............................................................. 214
4.2.4.5 Value Iteration over a Finite Planning
Horizon .................................................................. 218
4.2.5 A Multichain MCR .............................................................. 220
4.2.5.1 An Eight-State Multichain MCR Model of a
Production Process............................................... 221
4.2.5.2 Expected Average Reward or Gain.................... 224
4.2.5.3 Reward Evaluation Equations ............................ 227
4.2.5.4 Expected Total Reward before Passage to a
Closed Set .............................................................. 239
4.2.5.5 Value Iteration over a Finite Planning
Horizon .................................................................. 243
4.3 Discounted Rewards ........................................................................ 245
4.3.1 Time Value of Money........................................................... 245
4.3.2 Value Iteration over a Finite Planning Horizon .............. 246
4.3.2.1 Value Iteration Equation ...................................... 246
4.3.2.2 Value Iteration for Discounted MCR Model
of Monthly Sales ................................................... 249
4.3.3 An Infinite Planning Horizon ............................................ 251
4.3.3.1 VDEs for Expected Total Discounted
Rewards ................................................................. 251
4.3.3.2 Value Iteration for Expected Total
Discounted Rewards ............................................ 260
Problems ........................................................................................................263
References ..................................................................................................... 270

Chapter 5 A Markov Decision Process (MDP) ..........................................271


5.1 An Undiscounted MDP .................................................................... 271
5.1.1 MDP Chain Structure.......................................................... 271
5.1.2 A Recurrent MDP ................................................................ 272
5.1.2.1 A Recurrent MDP Model of Monthly Sales ...... 272
5.1.2.2 Value Iteration over a Finite Planning
Horizon .................................................................. 275
5.1.2.3 An Infinite Planning Horizon ............................284
5.1.3 A Unichain MDP.................................................................. 315
5.1.3.1 Policy Iteration (PI) ............................................... 315
5.1.3.2 Linear Programming ........................................... 323
5.1.3.3 Examples of Unichain MDP Models ................. 329
5.1.4 A Multichain MDP .............................................................. 350

51113_C000.indd viii 9/23/2010 9:11:51 PM


Contents ix

5.1.4.1 Multichain Model of a Flexible Production


System .................................................................... 350
5.1.4.2 PI for a Multichain MDP ..................................... 352
5.1.4.3 Linear Programming ........................................... 361
5.1.4.4 A Multichain MDP Model of Machine
Maintenance .......................................................... 368
5.2 A Discounted MDP ........................................................................... 374
5.2.1 Value Iteration over a Finite Planning Horizon .............. 374
5.2.1.1 Value Iteration Equation ...................................... 374
5.2.1.2 Value Iteration for Discounted MDP Model
of Monthly Sales ................................................... 374
5.2.2 An Infinite Planning Horizon ............................................ 382
5.2.2.1 Value Iteration ....................................................... 383
5.2.2.2 Policy Iteration ...................................................... 385
5.2.2.3 LP for a Discounted MDP ................................... 396
5.2.2.4 Examples of Discounted MDP Models .............404
Problems ........................................................................................................413
References .....................................................................................................422

Chapter 6 Special Topics: State Reduction and Hidden Markov


Chains .............................................................................................423
6.1 State Reduction ..................................................................................423
6.1.1 Markov Chain Partitioning Algorithm for
Computing Steady-State Probabilities .............................. 424
6.1.1.1 Matrix Reduction of a Partitioned Markov
Chain ...................................................................... 424
6.1.1.2 Optional Insight: Informal Justification of
the Formula for Matrix Reduction ..................... 426
6.1.1.3 Optional Insight: Informal Derivation of the
MCPA ..................................................................... 427
6.1.1.4 Markov Chain Partitioning Algorithm ............. 431
6.1.1.5 Using the MCPA to Compute the
Steady-State Probabilities for a Four-State
Markov Chain ....................................................... 432
6.1.1.6 Optional Insight: Matrix Reduction and
Gaussian Elimination ..........................................433
6.1.2 Mean First Passage Times ................................................... 435
6.1.2.1 Forming the Augmented Matrix ........................ 435
6.1.2.2 State Reduction Algorithm for Computing
MFPTs .................................................................... 436
6.1.2.3 Using State Reduction to Compute MFPTs
for a Five-State Markov Chain............................ 437
6.1.3 Absorption Probabilities ..................................................... 439
6.1.3.1 Forming the Augmented Matrix ........................ 439

51113_C000.indd ix 9/23/2010 9:11:51 PM


x Contents

6.1.3.2 State Reduction Algorithm for Computing


Absorption Probabilities .....................................440
6.1.3.3 Using State Reduction to Compute
Absorption Probabilities for an Absorbing
Multichain Model of Patient Flow in a
Hospital.................................................................. 441
6.2 An Introduction to Hidden Markov Chains .................................443
6.2.1 HMM of the Weather ..........................................................444
6.2.2 Generating an Observation Sequence...............................446
6.2.3 Parameters of an HMM....................................................... 447
6.2.4 Three Basic Problems for HMMs ...................................... 447
6.2.4.1 Solution to Problem 1 ...........................................448
6.2.4.2 Solution to Problem 2 ........................................... 455
Problems ........................................................................................................460
References ..................................................................................................... 462

Index 463

51113_C000.indd x 9/23/2010 9:11:51 PM


Preface

The goal of this book is to provide engineers and managers with a unified
treatment of Markov chains and Markov decision processes. The unified
treatment of both subjects distinguishes this book from many others. The
prerequisites are matrix algebra and elementary probability. In addition, lin-
ear programming is used in several sections of Chapter 5 as an alternative
procedure for finding an optimal policy for a Markov decision process. These
sections may be omitted without loss of continuity. The book will be of inter-
est to seniors and beginning graduate students in quantitative disciplines
including engineering, science, applied mathematics, operations research,
management, and economics. Although written as a textbook, the book is
also suitable for self-study by engineers, managers, and other quantitatively
educated individuals. People who study this book will be prepared to con-
struct and solve Markov models for a variety of random processes.
Many books on Markov chains or decision processes are either highly the-
oretical, with few examples, or else highly prescriptive, with little justifica-
tion for the steps of the algorithms used to solve Markov models. This book
balances both algorithms and applications. Engineers and quantitatively
trained managers will be reluctant to use a formula or execute an algorithm
without an explanation of the logical relationships on which they are based.
On the other hand, they are not interested in proving theorems. In this book,
formulas and algorithms are derived informally, occasionally in a section
labeled as an optional insight. The validity of a formula is often justified by
applying it to a small generic Markov model for which transition probabili-
ties or rewards are expressed as symbols. The validity of other relationships
is demonstrated by applying them to larger Markov models with numerical
transition probabilities or rewards. Informal derivations and demonstrations
of the validity of formulas are carried out in considerable detail.
Since engineers and managers are interested in applications, considerable
attention is devoted to the construction of Markov models. A large number of
simplified Markov models are constructed for a wide assortment of processes
important to engineers and managers including the weather, gambling, dif-
fusion of gases, a waiting line, inventory, component replacement, machine
maintenance, selling a stock, a charge account, a career path, patient flow in
a hospital, marketing, and a production line. The book is distinguished by
the high level of detail with which the construction and solution of Markov
models are described. Many of these Markov models have numerical transi-
tion probabilities and rewards. Descriptions of the step-by-step calculations
made by the algorithms implemented for solving these models will facilitate
the student’s ability to apply the algorithms.

xi

51113_C000.indd xi 9/23/2010 9:11:52 PM


xii Preface

The text is organized around a Markov chain structure. Chapter 1 describes


Markov chain states transitions, structure, and models. Chapter 2 discusses
steady-state distributions and passage to a target state in a regular Markov
chain. A regular chain has the simplest structure because all of its states com-
municate. Chapter 3 treats canonical forms and passage to target states or to
classes of target states for reducible Markov chains. Reducible chains have
more complex structures because they contain one or more closed classes
of communicating states plus transient states. Chapter 4 adds an economic
dimension to a Markov chain by associating rewards with states, thereby
linking a Markov chain to a Markov decision process. The measure of effect-
iveness for a Markov chain with rewards is an expected reward. Chapter 5
adds decisions to create a Markov decision process, enabling an analyst to
choose among alternative Markov chains with rewards so as to maximize an
expected reward. Last, Chapter 6 introduces two interesting special topics
that are rarely included in a senior and beginning graduate level text: state
reduction and hidden Markov chains.
Markov chains and Markov decision processes are commonly treated in
separate books, which use different conventions for numbering the time
periods in a planning horizon. When a Markov chain is analyzed, time peri-
ods are numbered forward, starting with epoch 0 at the beginning of the
horizon. However, in books in which value iteration is executed for a Markov
decision process over an infinite planning horizon, time periods are num-
bered backward, starting with epoch 0 at the end of the horizon. In this book,
the time periods are always numbered forward to establish a consistent con-
vention for both Markov chains and decision processes. When value itera-
tion is executed for a Markov decision process over an infi nite horizon, time
periods are numbered as consecutive negative integers, starting with epoch
0 at the end of the horizon.
I thank Professor David Sheskin of Western Connecticut State University
for helpful suggestions and for encouraging me to write this book. I also
thank Eugene Sheskin for his support. I am grateful to Cindy Carelli, editor
of industrial engineering and operations research at CRC Press, for her
enormous patience. I also thank the reviewers: Professors Isaac Sonin of
University of North Carolina at Charlotte, M. Jeya Chandra of Pennsylvania
State University, and Henry Lewandowski of Baldwin Wallace College.

Theodore J. Sheskin

51113_C000.indd xii 9/23/2010 9:11:52 PM


Author

Theodore J. Sheskin is professor emeritus of industrial engineering at


Cleveland State University. He earned a B.S. in electrical engineering from
the Massachusetts Institute of Technology, an M.S. in electrical engineering
from Syracuse University, and a Ph.D. in industrial engineering and opera-
tions research from Pennsylvania State University. Professor Sheskin is the
sole author of 21 papers published in peer-reviewed journals of engineering
and mathematical methods. He is a registered professional engineer.

xiii

51113_C000.indd xiii 9/23/2010 9:11:52 PM


51113_C000.indd xiv 9/23/2010 9:11:54 PM
1
Markov Chain Structure and Models

The goal of this book is to prepare students of engineering and manage-


ment science to construct and solve Markov models of physical and social
systems that may produce random outcomes. Chapter 1 is an introduction
to Markov chain models and structure. A Markov chain is an indexed col-
lection of random variables used to model a sequence of dependent events
such that the probability of the next event depends only on the present
event. In other words, a Markov chain assumes that, given the present
event, the probability of the next event is independent of the past events.
A Markov decision process is a sequential decision process for which the
decisions produce a sequence of Markov chains with rewards. In this
book, closed form solutions will be obtained algebraically for small generic
Markov models. Larger models will be solved numerically.

1.1 Historical Note


The Markov chain model was created by A. A. Markov, a professor of mathe-
matics at St. Petersburg University in Russia. He lived from 1856 to 1922, made
significant contributions to the theory of probability, and also participated in
protests against the Czarist government. In the United States, modern treat-
ments of Markov chains were initiated by W. Feller of Princeton University
in his 1950 book, An Introduction to Probability Theory and its Applications, and
by J. G. Kemeny and J. L. Snell of Dartmouth College in their 1960 book,
Finite Markov Chains [5]. This author follows the approach of Kemeny and
Snell.
Many of the basic concepts of a Markov decision process were developed
by mathematicians at the RAND Corporation in California in the late
1940s and early 1950s. The policy iteration algorithm used to determine
an optimal set of decisions was created by R. A. Howard. His pioneer-
ing and eloquent 1960 book, Dynamic Programming and Markov Processes,
based on his doctoral dissertation at M.I.T., stimulated great interest in
Markov decision processes, and motivated this author’s treatment of that
subject.

51113_C001.indd 1 9/23/2010 12:58:01 PM


2 Markov Chains and Decision Processes for Engineers and Managers

1.2 States and Transitions


A random process, which is also called a stochastic process, is an indexed
sequence of random variables. Assume that the sequence is observed at equally
spaced points in time called epochs. The time interval between successive
epochs is called a period or step. At each epoch, the system is observed to be
in one of a finite number of mutually exclusive categories or states. A state is
one of the possible values that the random variable can have. A Markov chain
is the simplest kind of random process because it has the Markov property.
The Markov property assumes the simplest kind of dependency, namely, that
given the present state of a random process, the conditional probability of the
next state depends only on the present state, and is independent of the past
history of the process. This simplification has enabled engineers and manag-
ers to develop mathematically tractable models that can be used to analyze
a variety of physical, economic, and social systems. A random process that
lacks the Markov property is one for which knowledge of its past history is
needed in order to probabilistically model its future behavior.
A Markov chain has N states designated 1, 2, … , N. All Markov chains
treated in this book are assumed to have a finite number of states, so that N is
a finite integer. The set of all possible states, called the state space, is denoted
by E = {1, 2, … , N}. The state of the chain is observed at equally spaced points
in time called epochs. An epoch is denoted by n = 0, 1, 2, … . Epoch n des-
ignates the end of time period n, which is also the beginning of period n + 1.
A sequence of n consecutive time periods, each marked by an end-of-period
epoch, is shown in Figure 1.1.
The random variable Xn represents the state of the chain at epoch n. A
Markov chain is an indexed sequence of random variables, {X0, X1, X2, …},
which has the Markov property. The index, n, is the epoch at which the state
of the random variable is observed. Thus, a Markov chain may be viewed as
a sequence of states, which are observed at consecutive epochs, as shown in
Figure 1.2.
While the index n most commonly represents an epoch or time of observa-
tion, it can also represent other parameters, such as the order of an observa-
tion. For example, the index n can indicate the nth item inspected, or the nth
customer served, or the nth trial in a contest.
If at epoch n the chain is in state i, then Xn = i. The probability that the
chain is in state i at epoch n, which may represent the present epoch, is
denoted by P(Xn = i). Then the probability that the chain is in state j at

Period 1 Period 2 Period n


Epoch 0 Epoch 1 Epoch 2 Epoch n – 1 Epoch n

FIGURE 1.1
Sequence of consecutive epochs.

51113_C001.indd 2 9/23/2010 12:58:01 PM


Markov Chain Structure and Models 3

X0 X1 X2 Xn State
Period 1 Period 2 Period n
0 1 2 n Epoch

FIGURE 1.2
Sequence of states.

epoch n + 1, the next epoch, is denoted by P(Xn+1 = j). The conditional prob-
ability that the chain will be in state j at epoch n + 1, given that it is in state
i at epoch n, is denoted by P(Xn+1 = j|Xn = i). This conditional probability
is called a transition probability, and is denoted by pij. Thus, the transition
probability,

pij = P(X n +1 = j|X n = i), (1.1)

represents the conditional probability that if the chain is in state i at the


current epoch, then it will be in state j at the next epoch. The transition
probability, pij, represents a one-step transition because it is the probability of
making a transition in one time period or step. The Markov property ensures
that a transition probability depends only on the present state of the process,
denoted by Xn. In other words, the history of the process, represented by the
sequence of states, {X0, X1, X2, … , Xn−1}, occupied prior to the present epoch,
can be ignored. Transition probabilities are assumed to be stationary over
time, or time homogeneous. That is, they do not change over time. Therefore,
the transition probability,

pij = P(X n +1 = j|X n = i) = P(X1 = j|X0 = i), (1.2)

is constant, independent of the epoch n.


The transition probabilities for a Markov chain with N states are collected
in an N × N square matrix, called a one-step transition probability matrix, or
more simply, a transition probability matrix, which is denoted by P. To avoid
abstraction, and without loss of generality, consider a Markov chain with
N = 4 states. The transition probability matrix is given by

X n\X n + 1 1 2 3 4
1 p11 p12 p13 p14
P= 2 p21 p22 p23 p24 . (1.3)
3 p31 p32 p33 p34
4 p41 p42 p43 p44

Each row of P represents the present state, at epoch n. Each column repre-
sents the next state, at epoch n + 1. Since the probability of a transition is

51113_C001.indd 3 9/23/2010 12:58:01 PM


4 Markov Chains and Decision Processes for Engineers and Managers

conditioned on the present state, the entries in every row of P sum to one. In
addition, all entries are nonnegative, and no entry is greater than one. The
matrix P is called a stochastic matrix.
When the number of states in a Markov chain is small, the chain can be
represented by a graph, which consists of nodes connected by directed arcs.
In a transition probability graph, a node i denotes a state. An arc directed
from node i to node j denotes a transition from state i to node j with transi-
tion probability pij. For example, consider a two-state generic Markov chain
with the following transition probability matrix:

State 1 2
P= 1 p11 p12 . (1.4)
2 p21 p22

The corresponding transition probability graph is shown in Figure 1.3.


A Markov chain evolves over time as it moves from state to state in accor-
dance with its transition probabilities. Suppose the probability that a Markov
chain is in a particular state after n transitions is of interest. Then, in addi-
tion to the one-step transition probabilities, the initial state probabilities are
also needed. The initial state probability for a state j is the probability that at
j = P( X 0 = j ) denote the initial state
epoch 0 the process starts in state j. Let p(0)
probability for state j. The initial state probabilities for all states in a Markov
chain with four states are arranged in a row vector of initial state probabili-
ties, designated by

p(0) = [ p1(0) p2(0) p3(0) p(0)


4 ]

= [P(X0 = 1) P(X0 = 2) P(X0 = 3) P(X0 = 4)]. (1.5)

The initial state probability vector specifies the probability distribution of


the initial state. The elements of every state probability vector, including the

p12

State State p
p11 22
1 2

p21

FIGURE 1.3
Transition probability graph for a two-state Markov chain.

51113_C001.indd 4 9/23/2010 12:58:03 PM


Markov Chain Structure and Models 5

X0 = a X1 = b X2 = c X3 = d X4 = e State
pa(0) pab pbc pcd pde Transition probability
0 1 2 3 4 Epoch

FIGURE 1.4
State transitions for a sample path.

elements of an initial state probability vector, must sum to one. Hence,

p1(0) + p2(0) + p3(0) + p(0)


4 = 1. (1.6)

A sequence of states visited by a Markov chain as it evolves over time is


called a sample path. The state sequence {X0, X1, X2, X3, X4} is a sample path
during the first four periods. State transitions for a sample path are shown
in Figure 1.4.
The probability of a sample path is a joint probability, which can be calcu-
lated by invoking the Markov property. For example, the probability of the
sample path is shown in Figure 1.4 is

P(X0 = a, X1 = b , X 2 = c , X 3 = d , X 4 = e )


= P(X0 = a)P(X1 = b X0 = a) P(X 2 = c X1 = b)
 (1.7)
× P(X 3 = d X 2 = c) P(X 4 = e X 3 = d) 
= p(0) p p p p 
a ab bc cd de . 

If the initial state is known, so that P(X0 = a) = p(0)


a = 1
, the probability of a
particular sample path, given the initial state, is calculated below by using
the Markov property once again

P(X1 = b , X 2 = c , X 3 = d , X 4 = e X0 = a) 

= P(X1 = b X0 = a) P(X 2 = c X1 = b) P(X 3 = d X 2 = c) P(X 4 = e X 3 = d) (1.8)
= pab pbc pcd pde . 

1.3 Model of the Weather


As an example of a Markov chain model, suppose that weather in Cleveland,
Ohio, can be classified as either raining, snowing, cloudy, or sunny [5].
Observations of the weather are made at the same time every day. The daily
weather is assumed to have the Markov property, which means that the
weather tomorrow depends only on the weather today. That is, the weather
yesterday or on prior days will not affect the weather tomorrow. Since the

51113_C001.indd 5 9/23/2010 12:58:04 PM


6 Markov Chains and Decision Processes for Engineers and Managers

daily weather is assumed to have the Markov property, the weather will be
modeled as a Markov chain with four states. The state Xn denotes the weather
on day n for n = 0, 1, 2, 3, …. The states are indexed below:

State, X n Description
1 Raining
2 Snowing
3 Cloudy
4 Sunny

The state space is E = {1, 2, 3, 4}.


Transition probabilities are based on the following observations. If it is
raining today, the probabilities that tomorrow will bring rain, snow, clouds,
or sun are 0.3, 0.1, 0.4, and 0.2, respectively. If it is snowing today, the prob-
abilities of rain, snow, clouds, or sun tomorrow are 0.2, 0.5, 0.2, and 0.1,
respectively. If today is cloudy, the probabilities that rain, snow, clouds, or
sun will appear tomorrow are 0.3, 0.2, 0.1, and 0.4, respectively. Finally, if
today is sunny, the probabilities that tomorrow it will be snowing, cloudy,
or sunny are 0.6, 0.3, and 0.1, respectively. (A sunny day is never followed
by a rainy day.) Transition probabilities are obtained in the following man-
ner. If day n designates today, then day n + 1 designates tomorrow. If Xn
designates the state today, then Xn+1 designates the state tomorrow. If it
is raining today, then Xn = 1. If it is cloudy tomorrow, then Xn+1 = 3. The
observation that if it is raining today, then the conditional probability that
tomorrow will be cloudy is 0.4, is expressed as the following transition
probability:

P(Cloudy tomorrow Raining today) = P(X n +1 = 3 X n = 1) = p13 = 0.4.

The 15 remaining transition probabilities for the four-state Markov chain are
obtained in similar fashion to produce the following transition probability
matrix:

X n \X n + 1 1 2 3 4 State 1 2 3 4
1 p11 p12 p13 p14 1 0.3 0.1 0.4 0.2
P= 2 p21 p22 p23 p24 = 2 0.2 0.5 0.2 0.1 . (1.9)
3 p31 p32 p33 p34 3 0.3 0.2 0.1 0.4
4 p41 p42 p43 p44 4 0 0.6 0.3 0.1

Observe that the entries in every row of P sum to one.

51113_C001.indd 6 9/23/2010 12:58:06 PM


Markov Chain Structure and Models 7

Suppose that an initial state probability vector for the four-state Markov
chain model of the weather is

p(0) = 0.2 0.3 0.4 0.1 



=  p1(0) p2(0) p3(0) p(0)
4    (1.10)
= [P(X0 = 1) P(X0 = 2) P(X0 = 3) P(X0 = 4)].

Since p3(0) = P(X0 = 3) = 0.4, the chain has a probability of 0.4 of starting in
state 3, representing a cloudy day. This vector also indicates that the chain
has a probability of 0.2 of starting in state 1, denoting a rainy day, a probabil-
ity of 0.3 of starting in state 2, indicating a snowy day, and a 0.1 probability of
starting in state 4, designating a sunny day. Note that the entries of p(0) sum
to one.
Suppose the chain starts in state 3, indicating a cloudy day, so that
P(X0 = 3) = p3(0) = 1. The probability of a particular sample path during the
next 5 days, given that the initial state is X0 = 3, is calculated below:

P(X1 = 1, X 2 = 4, X 3 = 2, X 4 = 3, X 5 = 4 X0 = 3) = p31 p14 p42 p23 p34 = (0.3)(0.2)(0.6)(0.2)(0.4)


= 0.00288

1.4 Random Walks


As a second example of a Markov chain model, suppose that a particle moves
among positions on a line by one position at a time. The positions on the line
are marked by consecutive integers. At each epoch, the particle moves one
position to the right with probability p, or one position to the left with proba-
bility 1 − p. The particle continues to move, either right or left, until it reaches
one of two end positions called barriers. Since the movement of the particle
depends only on its present position and not on any of its previous positions,
the process can be modeled as a Markov chain. This chain is called a ran-
dom walk because it may describe the unsteady movement of an intoxicated
individual [5]. The states, represented by consecutive integers, are the pos-
sible positions of the particle on the line. When the particle is in position i,
the random walk is in state i. Hence, the state Xn = i denotes the position
of the particle after n moves. Consider a five-state random walk shown in
Figure 1.5 with the state space E = {0, 1 ,2, 3, 4}.

0 1 2 3 4

FIGURE 1.5
Random walk with five states.

51113_C001.indd 7 9/23/2010 12:58:07 PM


8 Markov Chains and Decision Processes for Engineers and Managers

For a state i such that 0 < i < 4, the probability of moving one position to
the right is
pi , i +1 = P(X n +1 = i + 1|X n = i) = p.

Similarly, the probability of moving one position to the left is


pi , i −1 = P(X n +1 = i − 1|X n = i) = 1 − p.

The remaining transition probabilities are zero. That is


pi , j = 0 for j ≠ i − 1 and fo r j ≠ i + 1.

The transition probabilities for i = 0 and i = 4, the terminal states, depend on


the nature of the barriers represented by these terminal states.

1.4.1 Barriers
Random walks may have various types of barriers. Three common types of
barriers are absorbing, reflecting, and partially reflecting.

1.4.1.1 Absorbing Barriers


When a particle reaches an absorbing barrier, it remains there. Thus, an
absorbing barrier is represented by a state, which is called an absorbing
state. The transition probability matrix for a five-state random walk with
two absorbing barriers, represented by absorbing states 0 and 4, is

X n\X n + 1 0 1 2 3 4
0 1 0 0 0 0
1 1− p 0 p 0 0
P = [ pij ] = .
2 0 1− p 0 p 0 (1.11)
3 0 0 1− p 0 p
4 0 0 0 0 1

As Section 1.8 indicates, when a state i is an absorbing state, pii = 1. A Markov


chain cannot leave an absorbing state.

1.4.1.2 Gambler’s Ruin


The gambler’s ruin model is an interesting application of a random walk
with two absorbing barriers [1, 3]. Suppose that a game between two gam-
blers, called player A and player B, is a sequence of independent bets or tri-
als. Each player starts with $2. Both players bet $1 at a time. The loser pays
$1 to the winner. They agree to continue betting until one player wins all

51113_C001.indd 8 9/23/2010 12:58:09 PM


Markov Chain Structure and Models 9

the money, a total of $4. The gambler who loses all her money is ruined. To
model this game as a random walk, let the state, Xn, denote the amount of
money player A has after n trials. Since the total money is $4, the state space,
in units of dollars, is E = {0, 1, 2, 3, 4}. Since player A starts with $2, X0 = 2.
If player A wins each $1 bet with probability p, and loses each $1 bet with
probability 1 − p, the five-state gambler’s ruin model has the transition prob-
ability matrix given for the random walk in Section 1.4.1.1. States 0 and 4 are
both absorbing states, which indicate that the game is over. If the game ends
in state 0, then player A has lost all her initial money, $2. If the game ends
in state 4, then player A has won all of her opponent’s money, increasing her
total to $4. The transition probability matrix is shown in Equation (1.11). This
model of a gambler’s ruin is treated in Section 3.3.3.2.1.

1.4.1.3 Reflecting Barriers


When a particle reaches a reflecting barrier, it enters the adjacent state on
the next transition. The transition probability matrix for a five-state random
walk with two reflecting barriers, represented by states 0 and 4, is

X n\ X n + 1 0 1 2 3 4
0 0 1 0 0 0
1 1− p 0 p 0 0
P = [ pij ] = . (1.12)
2 0 1− p 0 p 0
3 0 0 1− p 0 p
4 0 0 0 1 0

1.4.2 Circular Random Walk


A circular five-state random walk represents a particle which moves on a
circle through the state space E = {0, 1, 2, 3, 4}. The probability of moving clock-
wise to an adjacent state is p, and the probability of moving counterclockwise
to an adjacent state is 1 − p. Since states 0 and 4 are adjacent on the circle, a
circular random walk, has no barriers. The transition probability matrix is

X n\ X n + 1 0 1 2 3 4
0 0 p 0 0 1− p
1 1− p 0 p 0 0
P = [ pij ] = . (1.13)
2 0 1− p 0 p 0
3 0 0 1− p 0 p
4 p 0 0 1− p 0

51113_C001.indd 9 9/23/2010 12:58:10 PM


10 Markov Chains and Decision Processes for Engineers and Managers

1.5 Estimating Transition Probabilities


In developing a Markov chain model, an engineer must assume but cannot
prove that a process possesses the Markov property. Transition probabilities
can be estimated by counting the transitions among states over a long period
of time [1]. A procedure for estimating transition probabilities is demon-
strated by the following example of a sequential inspection process. Suppose
that every hour a quality inspector either accepts a product, sends it back to
be reworked, or rejects it. The inspector’s goal is to construct a Markov chain
model of product quality.

1.5.1 Conditioning on the Present State


As a first approximation, the inspector assumes that a prediction of the out-
come of the next inspection depends solely on the outcome of the present
inspection. By making this assumption, a Markov chain model can be con-
structed. The state, Xn, denotes the outcome of the nth inspection. Three out-
comes are defined:

State, X n Outcome
A Product is accepted
W Product is sent back to be reworked
R Product is rejected

The state space has three entries given by E = {A, W, R}. After 61 inspections,
suppose that the following sequence of states has been observed: {A, A, A, W,
A, R, A, A, A, W, W, A, A, R, A, A, A, W, R, A, A, W, A, W, A, A, R, A, A, A, R,
W, A, R, A, A, A, W, W, A, A, A, A, R, A, A, W, A, A, R, A, A, R, W, A, A, R, A,
A, W, A}.
A transition occurs when one inspection follows another. A transition
probability, pij, is estimated by the proportion of inspections in which state i
is followed by state j. Let nij denote the number of transitions from state i to
state j. Let ni denote the number of transitions made from state i. By counting
transitions, Table 1.1 of transition counts is obtained.
Note that the 61st or last outcome is not counted in the sum of transitions
because it is not followed by a transition to another outcome. The estimated
transition probability is

pij = nij / ni = P(X n +1 = j X n = i). (1.14)

The matrix of estimated transition probabilities appears in Table 1.2.

51113_C001.indd 10 9/23/2010 12:58:11 PM


Markov Chain Structure and Models 11

TABLE 1.1
Transition Counts

State A W R Row Sum of Transitions


A nAA = 21 nAW = 8 nAR =9 nA = 38
W nWA = 9 nWW = 2 nWR =1 nW = 12
R nRA = 8 nRW = 2 nRR =0 nR = 10
Sum of transitions ∑ = 60

TABLE 1.2
Matrix of Estimated Transition Probabilities

X n\X n +1 A W R
A pAA = 21/38 pAW = 8/38 pAR = 9/38
P=
W pWA = 9/12 pWW = 2/12 pWR = 1/12
R pRA = 8/10 pRW = 2/10 pRR = 0/10

1.5.2 Conditioning on the Present and Previous States


Now suppose that knowledge of the outcome of the present inspection is no
longer considered sufficient to predict the outcome of the next inspection.
Instead, this prediction will be based on the outcomes of both the present
inspection and the previous inspection. If the current defi nition of a state
as the outcome of the most recent inspection is retained, the process is no
longer a Markov chain. However, the process will become a Markov chain
if a new state is defined as the pair of outcomes of the two most recent
inspections.
To obtain a better approximation, the model has been enlarged, which
makes it less tractable computationally. Since the original model has three
states, and all pairs of consecutive outcomes may be possible, the new model
has (3)2 = 9 states. In the enlarged model, the state, Yn = Xn−1, Xn, denotes the
pair of outcomes of the (n − 1)th and nth inspections. The state space has nine
entries given by E = {AA, AW, AR, WA, WW, WR, RA, RW, RR}. For example,
states WA and RW for the enlarged process are defined below:

WA: the previous product was sent back to be reworked, and the pre-
sent product was accepted,
RW: the previous product was rejected, and the present product was
sent back to be reworked.

51113_C001.indd 11 9/23/2010 12:58:12 PM


12 Markov Chains and Decision Processes for Engineers and Managers

TABLE 1.3
Transition Counts for the Enlarged Process
State AA AW AR WA WW WR RA RW RR Row Sum
AA 7 7 7 0 0 0 0 0 0 21
AW 0 0 0 5 2 1 0 0 0 8
AR 0 0 0 0 0 0 7 2 0 9
WA 5 1 2 0 0 0 0 0 0 8
WW 0 0 0 2 0 0 0 0 0 2
WR 0 0 0 0 0 0 1 0 0 1
RA 8 0 0 0 0 0 0 0 0 8
RW 0 0 0 2 0 0 0 0 0 2
RR 0 0 0 0 0 0 0 0 0 0
∑ = 59

The 61 outcomes for the sequence of inspections indicate that the process has
moved from state AA to AA to AW to WA to AR to RA to AA to AA to AW to
WW to WA to AA to AR to RA to AA to AA to AW to WR to RA to AA to AW
to WA to AW to WA to AA to AR to RA to AA to AA to AR to RW to WA to
AR to RA to AA to AA to AW to WW to WA to AA to AA to AA to AR to RA
to AA to AW to WA to AA to AR to RA to AA to AR to RW to WA to AA to AR
to RA to AA to AW, and lastly to WA. Observe that the first letter of the state
to which the process moves must agree with the second letter of the state from
which it moves, since they both refer to the outcome of the same inspection.
Since the new model has nine states, the new transition matrix has (9)2 = 81
entries. Table 1.3 of transition counts is obtained for the enlarged process.
Note that unless additional products are inspected, state RR is not a pos-
sible outcome. Hence, the number of states can be reduced by one. A transi-
tion probability, pij, jk, is estimated by the proportion of inspections in which
the pair of outcomes i, j is followed by the pair of outcomes j, k. The estimated
transition probability is

pij , jk = P(X n = j , X n + 1 = k X n −1 = i , X n = j). (1.15)

The matrix of estimated transition probabilities for the enlarged process


appears in Table 1.4.

1.6 Multiple-Step Transition Probabilities


As Section 1.2 indicates, a one-step transition probability matrix for a Markov
chain is denoted by P. Suppose that transition probabilities after more than

51113_C001.indd 12 9/23/2010 12:58:13 PM


Markov Chain Structure and Models 13

TABLE 1.4
Matrix of Estimated Transition Probabilities for the Enlarged Process
X n−1 = i, X n = j\X n = j,
X n+1 = k AA AW AR WA WW WR RA RW

AA 7/21 7/21 7/21 0 0 0 0 0


AW 0 0 0 5/8 2/8 1/8 0 0
AR 0 0 0 0 0 0 7/9 2/9
P= WA 5/8 1/8 2/8 0 0 0 0 0
WW 0 0 0 2/2 0 0 0 0
WR 0 0 0 0 0 0 1/1 0
RA 8/8 0 0 0 0 0 0 0
RW 0 0 0 2/2 0 0 0 0
RA 8/8 0 0 0 0 0 0 0
RW 0 0 0 2/2 0 0 0 0

one epoch are of interest. Initially, suppose that transition probabilities at


every second epoch are desired. Such transition probabilities are called two-
step transition probabilities. They are denoted by pij(2). Since transition prob-
abilities are stationary in time,

pij(2) = P(X n + 2 = j|X n = i) = P(X 2 = j|X0 = i) (1.16)

is the conditional probability of moving from state i to state j in two steps.


The two-step transition probabilities are arranged in a matrix designated by
P(2) = [ pij(2) ]. To see how to calculate two-step transition probabilities, consider
the one-step transition probability matrix shown in Equation (1.9) for the
four-state Markov chain model of the weather.
The conditional probability of going from state i to state j in two steps is
equal to the probability of going from state i to an intermediate state k on
the first step multiplied by the probability of going from state k to state j on
the second step, summed over all states k. The state k can be any state of the
process, including i and j. Therefore, k = 1, 2, 3, and 4, and
4
pij(2) = pi1 p1 j + pi 2 p2 j + pi 3 p3 j + pi 4 p4 j = ∑ pik pkj . (1.17)
k =1

By the rules of matrix multiplication, pij(2), the (i, j)th element of the two-step
transition matrix, P(2), is also the (i, j)th element of the product matrix, P2,
obtained by multiplying the matrix of one-step transition probabilities by
itself. That is,

P(2) = P ⋅ P = P 2 . (1.18)

51113_C001.indd 13 9/23/2010 12:58:13 PM


14 Markov Chains and Decision Processes for Engineers and Managers

For the model of the weather,

1 0.3 0.1 0.4 0.2 0.3 0.1 0.4 0.2 


2  0.2 0.5 0.2 0.1  0.2 0.5 0.2 0.1 
P(2) = P ⋅ P = P2 =   
3 0.3 0.2 0.1 0.4  0.3 0.2 0.1 0.4  
  
4 0 0.6 0.3 0.1  0 0.6 0.3 0.1 
 (1.19)
1 0.23 0.28 0.24 0.25 
2  0.22 0.37 0.23 0.18  
=  . 
3 0.16 0.39 0.29 0.16  
  
4  0.21 0.42 0.18 0.19 

Observe that

4
(2)
p34 = p31 p14 + p32 p24 + p33 p34 + p34 p44 = ∑ p3 k pk 4
k =1

= (0.3)(0.2) + (0.2)(0.1) + (0.1)(0.4) + (0.4)(0.1) = 0.16,

which is element (3, 4) of P2. Thus,

(2)
p34 = P(X 2 = 4|X0 = 3) = P(X n + 2 = 4|X n = 3) = 0.16
= P(Sunny 2 days from now Cloudy today).

An n-step transition probability is denoted by

pij( n ) = P(X n = j|X0 = i) = P(X r + n = j|X r = i). (1.20)

The result for a two-step transition matrix can be generalized to show that
for an N-state Markov chain, a matrix of n-step transition probabilities is
obtained by raising the matrix of one-step transition probabilities to the nth
power. That is,

P( n ) = P n = PP n −1 = P n −1 P. (1.21)

The matrix of (n + m)-step transition probabilities is equal to

P ( n+ m ) = P n+ m = P n P m = P ( n ) P ( m ) . (1.22)

51113_C001.indd 14 9/23/2010 12:58:15 PM


Markov Chain Structure and Models 15

By the rules of matrix multiplication, for a Markov chain with N states,


element (i, j) of P(n+m) is given by
N
pij( n + m ) = ∑ pik( n ) pkj( m ) , for n = 0, 1, 2, …; m = 0, 1, 2, …;
k =1

i = 1, 2, …, N; j = 1, 2, 3, … , N. (1.23)

These expressions are called the Chapman–Kolmogorov equations. They can


be justified informally by the following algebraic argument:

N
pij( n + m ) = ∑ P(X n + m = j |X0 = i)
i =1
N
= ∑ P(X n + m = j, X n = k|X0 = i)
i =1
N
= ∑ P(X n + m = j, X n = k, X0 = i)/P(X0 = i)
i =1
N
= ∑ P(X n + m = j|X n = k , X0 = i) P(X n = k , X0 = i)/P(X0 = i)
i =1
N
= ∑ P(X n + m = j|X n = k , X0 = i) P(X n = k|X0 = i) P(X0 = i)/P(X0 = i)
i =1
N
= ∑ P(X n + m = j|X n = k ) P(X n = k|X0 = i)
i =1
N
= ∑ P(X n = k|X0 = i) P(X n + m = j|X n = k )
i =1
N
= ∑ pik( n ) pkj( m ) .
i =1

1.7 State Probabilities after Multiple Steps


An n-step transition probability, pij( n ) = P(X n = j|X0 = i) , is a conditional prob-
ability, conditioned on the process starting in a given state i. In contrast, an
unconditional probability that the process is in state j after n steps, denoted by
P(Xn = j), requires a probability distribution for the starting state. This uncondi-
tional probability is called a state probability, and is also denoted by p(j n ). Thus,

p(j n ) = P(X n = j). (1.24)

51113_C001.indd 15 9/23/2010 12:58:17 PM


16 Markov Chains and Decision Processes for Engineers and Managers

The state probabilities for an N-state Markov chain after n steps are collected
in a 1 × N row vector, p(n), termed a vector of state probabilities. The vector
of state probabilities specifies the probability distribution of the state after n
steps. For a four-state Markov chain, the vector of state probabilities is rep-
resented by

p( n ) = [ p1( n ) p2( n ) p3( n ) p(4n ) ] 


 (1.25)
= [P(X n = 1) P(X n = 2) P(X n = 3) P(X n = 4)].

The vector of state probabilities after n transitions, p(n), can be determined


only if the matrix of one-step transition probabilities, P, is accompanied by
a vector of initial state probabilities. The vector of initial state probabilities,
introduced in Section 1.2, is a special case of the vector of state probabilities,
when n = 0. As Equation (1.5) indicates, the initial state probability vector
specifies the probability distribution of the initial state.
A state probability can be calculated by multiplying the probability of
starting in a particular state by an n-step transition probability, and sum-
ming over all states. For a four-state Markov chain,

4 4
P(X n = j) = ∑ P(X0 = i) P(X n = j|X0 = i), or p(j n ) = ∑ pi(0) pij( n ) . (1.26)
i =1 i =1

In matrix form,

p( n ) = p(0) P( n ) . (1.27)

Note that

p(1) = p(0) P(1) = p(0) P


p(2) = p(0) P(2) = p(0) (PP) = ( p(0) P )P = p(1) P (1.28)
p (3)
=p P (0) (3)
= p (P P) = ( p P )P = p P.
(0) (2) (0) (2) (2)

By induction,

p( n ) = p(0) P( n ) = p( n −1) P.

Thus, the vector of state probabilities is equal to the initial state probability
vector post multiplied by the matrix of n-step transition probabilities, and is
also equal to the vector of state probabilities after n − 1 transitions post mul-
tiplied by the matrix of one-step transition probabilities.

51113_C001.indd 16 9/23/2010 12:58:19 PM


Markov Chain Structure and Models 17

Consider the four-state Markov chain model of the weather for which the
transition probability matrix is given in Equation (1.9). Suppose that the ini-
tial state probability vector is given by Equation (1.10). After 1 day, the vector
of state probabilities is

0.3 0.1 0.4 0.2 


 0.2 0.5 0.2 0.1
p(1) = p(0) P = [0.2 0.3 0.4 0.1] 
0.3 0.2 0.1 0.4 
 
 0 0.6 0.3 0.1
= [0.24 0.31 0.21 0.24 ]. (1.29)

Thus, after 1 day, the process has a probability of 0.31 of being in state 2, and
a 0.24 probability of being in state 4. The matrix of two-step transition prob-
abilities is given by

1 0.3 0.1 0.4 0.2 0.3 0.1 0.4 0.2


2  0.2 0.5 0.2 0.1  0.2 0.5 0.2 0.1
P(2) = P2 =   
3 0.3 0.2 0.1 0.4  0.3 0.2 0.1 0.4 
  
4 0 0.6 0.3 0.1  0 0.6 0.3 0.1
0.23 0.28 0.24 0.25
 0.22 0.37 0.23 0.18 
= . (1.30)
0.16 0.39 0.29 0.16 
 
 0.21 0.42 0.18 0.19

Thus, the probability of moving from state 1 to state 4 in 2 days is 0.25. After
2 days the vector of state probabilities is equal to

0.23 0.28 0.24 0.25 


 0.22 0.37 0.23 0.18  
p(2) = p(0) P(2) = 0.2 0.3 0.4 0.1  
0.16 0.39 0.29 0.16   (1.31)
 
 0.21 0.42 0.18 0.19 

= 0.197 0.365 0.251 0.187 . 

51113_C001.indd 17 9/23/2010 12:58:21 PM


18 Markov Chains and Decision Processes for Engineers and Managers

Alternatively, the vector of state probabilities after 2 days is equal to


0.3 0.1 0.4 0.2 
 0.2 0.5 0.2 0.1 
p(2) = p(1) P = 0.24 0.31 0.21 0.24   
0.3 0.2 0.1 0.4  
  (1.32)
 0 0.6 0.3 0.1 

= 0.197 0.365 0.251 0.187  . 

After 2 days the probability of being in state 2 is 0.365. Finally, the matrix of
three-step transition probabilities is given by

1 0.23 0.28 0.24 0.25 0.3 0.1 0.4 0.2 


2  0.22 0.37 0.23 0.18   0.2 0.5 0.2 0.1 
P(3) = P 3 = P(2) P =   
3 0.16 0.39 0.29 0.16  0.3 0.2 0.1 0.4  
  
4  0.21 0.42 0.18 0.19  0 0.6 0.3 0.1 
 (1.33)
0.197 0.361 0.247 0.195 
 0.209 0.361 0.239 0.191 
= . 
 0.213 0.365 0.219 0.203  
  
 0.201 0.381 0.243 0.175 

Thus, the probability of moving from state 1 to state 4 in 3 days is 0.195. After
3 days the vector of state probabilities is equal to

0.197 0.361 0.247 0.195 


 0.209 0.361 0.239 0.191 
p(3) = p(0) P(3) = 0.2 0.3 0.4 0.1  
 0.213 0.365 0.219 0.203   (1.34)
 
 0.201 0.381 0.243 0.175 

= 0.2074 0.3646 0.2330 0.1950  . 

Alternatively, the vector of state probabilities after 3 days is equal to

0.3 0.1 0.4 0.2 


 0.2 0.5 0.2 0.1 
p(3) = p(2) P = 0.197 0.365 0.251 0.187   
0.3 0.2 0.1 0.4   (1.35)
 
 0 0.6 0.3 0.1 

= 0.2074 0.3646 0.2330 0.1950  . 

After 3 days the probability of being in state 2 is 0.3646.


Now suppose that the process starts day 1 in state 2, denoting snow. The
initial state probability vector is

p(0) = [0 1 0 0 ]. (1.36)

51113_C001.indd 18 9/23/2010 12:58:22 PM


Markov Chain Structure and Models 19

After 3 days the vector of state probabilities is equal to

0.197 0.361 0.247 0.195 


 0.209 0.361 0.239 0.191 
p(3) = p(0) P(3) = 0 1 0 0   
 0.213 0.365 0.219 0.203  
  (1.37)
 0.201 0.381 0.243 0.175 

= 0.209 0.361 0.239 0.191 . 

Observe that when p(0) = [0 1 0 0], the entries of vector p(3) are identical to the
entries in row 2 of matrix p(3). One can generalize this observation to con-
clude that when a Markov chain starts in state i, then after n steps, the entries
of vector p(n) are identical to those in row i of matrix P(n).

1.8 Classification of States


A state j is said to be accessible from a state i if pij( n ) > 0 for some n ≥ 0. That is,
state j is accessible from state i if starting from state i, it is possible to eventu-
ally reach state j. If state j is accessible from state i, and if state i is also acces-
sible from state j, then both states are said to communicate. A set of states is
called a closed set if no state outside the set is accessible from a state inside the
set. When the process enters a closed set of states, it will remain forever inside
the closed set. If each pair of states inside a closed set communicates, then the
closed set is also called a closed communicating class. If all states in a Markov
chain communicate, then all states belong to a single closed communicating
class. Such a Markov chain is called irreducible. Thus, an irreducible chain is
one in which it is possible to go from every state to every other state, not nec-
essarily in one step. That is, all states in an irreducible chain communicate.
Consider the four-state Markov chain model of the weather for which the
transition probability matrix is given in Equation (1.9). All states belong to
the same closed communicating class denoted by R = {1, 2, 3, 4}. Since p41 = 0,
the chain cannot move from state 4 to state 1 in one step. Instead, the chain
can move from state 4 to state 2 in one step, or from state 4 to state 3 in one
step, and then enter state 1 from either of those states after one or more addi-
tional steps. The chain can also move from state 4 to state 4 in one step, and
then enter state 1 after two or more additional steps. Note that the two-step
transition probability for going from state 4 to state 1 is positive. That is,

41 = p41 p11 + p42 p21 + p43 p31 + p44 p41


p(2)
= (0)(0.3) + (0.6)(0.2) + (0.3)(0.3) + (0.1)(0) = 0.21 > 0,

so that the chain can move from state 4 to state 1 in two or more steps.

51113_C001.indd 19 9/23/2010 12:58:24 PM


20 Markov Chains and Decision Processes for Engineers and Managers

Two different types of states can be distinguished: recurrent states and


transient states. A recurrent state is one to which eventual return is certain.
A transient state is one to which the process may not eventually return. To
quantify this distinction, let fii denote the probability that, starting in state i,
the chain will ever return to state i. State i is recurrent if fii = 1, and transient
if fii < 1. An absorbing state is a special case of a recurrent state i for which
pii = fii = 1. A Markov chain that enters an absorbing state will never leave it
because the chain will always return to it on every transition. A chain that
enters an absorbing state is said to be absorbed in that state.
In a finite-state Markov chain, any state that belongs to a closed communi-
cating class of states is recurrent. Any state that does not belong to a closed
communicating class is transient. When the state space is finite, not all states
can be transient. The reason is that eventually the process will never return to
a transient state. Since the process must always be in one state, at least one state
must be recurrent. All the states in a finite-state, irreducible Markov chain are
recurrent. For that reason an irreducible chain is also called a recurrent chain.
A recurrent state i is called periodic, with period d > 1, if the process will
always return to state i after a multiple of d transitions, with d an integer. If any
state in a closed communicating class of states is periodic, with period d, then all
states in the closed communicating class are periodic, with the same period d.
For example, the random walk with reflecting barriers for which the transition
matrix is shown in Equation (1.12) is an irreducible Markov chain because it is
possible to go from every state to every other state. Observe that every transi-
tion from an even numbered state (0, 2, 4) is to an odd numbered state (1, 3), and
every transition from an odd numbered state (1, 3) is to an even numbered state
(0, 2, 4). Starting in state 0, the chain will alternately be in odd and even num-
bered states. That is, starting in state 0, the chain can enter an even numbered
state i only at epochs 0, 2, 4, …. The chain is therefore periodic, with period d = 2,
because it is possible to return to a state only in an even number of steps. A state
that is not periodic is called aperiodic. With the exception of Sections 1.4.1.3,
1.9.1.1, and 1.10.1.1.1.2, only aperiodic chains are considered in this book.

1.9 Markov Chain Structure


A Markov chain can be classified as unichain or multichain. A unichain or
multichain that has transient states is called reducible. Examples of reduc-
ible unichains and multichains are given in Sections 1.10.1.2, 1.10.2, and 3.1.
All Markov chains treated in this book have a fi nite number of states.

1.9.1 Unichain
A Markov chain is termed unichain if it consists of a single closed set of
recurrent states plus a possibly empty set of transient states.

51113_C001.indd 20 9/23/2010 12:58:26 PM


Markov Chain Structure and Models 21

1.9.1.1 Irreducible
A unichain with no transient states is called an irreducible or recurrent
chain. All states in an irreducible chain are recurrent states, which belong
to a single closed communicating class. In an irreducible chain, it is pos-
sible to move from every state to every other state, not necessarily in one
step [4].
An irreducible or recurrent Markov chain is called a regular chain if some
power of the transition matrix has only positive elements. The irreducible
four-state Markov chain model of weather in Section 1.3 is a regular chain.
Another example of a regular Markov chain is the model for diffusion of two
gases treated in Section 1.10.1.1.1.1 for which

0 1 2 3
0 0 1 0 0
1 4 4
1 0
P= 9 9 9 . (1.38)
4 4 1
2 0
9 9 9
3 0 0 1 0

The easiest way to check regularity is to keep track of whether the entries in
the powers of P are positive. This can be done without computing numerical
values by putting an x in the entry if it is positive and a 0 otherwise. To check
regularity, let

0 0 x 0 0
1 x x x 0
P=  
2 0 x x x
 
3 0 0 x 0

0 0 x 0 0 0 x 0 0 x x x 0
1 x x x 0 x x x 0 x x x x
P 2 = PP =   = 
2 0 x x x 0 x x x x x x x
    
3 0 0 x 0 0 0 x 0 0 x x x

0 x x x 0 x x x 0 x x x x
1 x x x x x x x x x x x x
P4 = P2P2 =   = .
2 x x x x x x x x x x x x
    
3 0 x x x 0 x x x x x x x

51113_C001.indd 21 9/23/2010 12:58:26 PM


22 Markov Chains and Decision Processes for Engineers and Managers

Since all entries in P4 are positive, the chain is regular. Note that the test for
regularity is made faster by squaring the result each time.
The irreducible Markov chain constructed in Section 1.10.1.1.1.2 for the
Ehrenfest model of diffusion for which

0 1 2 3
0 0 1 0 0
1 2
1 0 0
P= 3 3 . (1.39)
2 1
2 0 0
3 3
3 0 0 1 0

can be interpreted as a four-state random walk with reflecting barriers intro-


duced in Section 1.4.1.3 in which p = 2/3. Since the random walk with reflect-
ing barriers is a periodic chain, the Ehrenfest model is also a periodic chain.
To check regularity, let

0 0 x 0 0
1 x 0 x 0
P=  
2 0 x 0 x
 
3 0 0 x 0

0 0 x 0 0 0 x 0 0 x 0 x 0
1 x 0 x 0 x 0 x 0 0 x 0 x
P 2 = PP =   = 
2 0 x 0 x 0 x 0 x x 0 x 0
    
3 0 0 x 0 0 0 x 0 0 x 0 x

0 x 0 x 0 x 0 x 0 x 0 x 0
1 0 x 0 x 0 x 0 x 0 x 0 x
P4 = P2P2 =   = .
2 x 0 x 0 x 0 x 0 x 0 x 0
    
3 0 x 0 x 0 x 0 x 0 x 0 x

Observe that even powers of P will have 0s in the odd numbered entries of
row 0. Furthermore,

51113_C001.indd 22 9/23/2010 12:58:27 PM


Markov Chain Structure and Models 23

0 x 0 x 0 0 x 0 0 0 x 0 x
1 0 x 0 x x 0 x 0 x 0 x 0
P3 = P2P =   = 
2 x 0 x 0 0 x 0 x 0 x 0 x
    
3 0 x 0 x 0 0 x 0 x 0 x 0

0 x 0 x 0 0 x 0 x 0 x 0 x
1 0 x 0 x x 0 x 0 x 0 x 0
P5 = P2P3 =   = .
2 x 0 x 0 0 x 0 x 0 x 0 x
    
3 0 x 0 x x 0 x 0 x 0 x 0

Note that odd powers of P will have 0s in the even numbered entries of
row 0. This chain is not regular because no power of the transition matrix
has only positive elements. This example has demonstrated that a periodic
chain cannot be regular. Hence, a regular Markov chain is irreducible and
aperiodic. Regular Markov chains are the subject of Chapter 2.

1.9.1.2 Reducible Unichain


A unichain with transient states is called a reducible unichain. Thus, a reduc-
ible unichain consists of one recurrent chain plus one or more transient states.
Transient states are those states that do not belong to the recurrent chain. The
state space for a reducible unichain can be partitioned into a recurrent chain
plus one or more transient states. For brevity, a reducible unichain is often
called a unichain.
The transition probability matrix for a generic four-state reducible uni-
chain is shown below. The transition matrix is partitioned to show that states
1 and 2 belong to the recurrent chain denoted by R = {1, 2}, and states 3 and 4
are transient. The set of transient states is denoted by T = {3, 4}.

1  p11 p12 0 0 
2  p21 p22 0 0 
P = [ pij ] =  .
3  p31 p32 p33 p34  (1.40)
 
4  p41 p42 p43 p44 

If the recurrent chain consists of a single absorbing state, the reducible uni-
chain is called an absorbing unichain or an absorbing Markov chain. The
transition probability matrix for a generic three-state absorbing unichain
is shown below. The transition matrix is partitioned to show that state 1 is
absorbing, and states 2 and 3 are transient. The recurrent chain is denoted by
R = {1}. The transient set of states is denoted by T = {2, 3}.

51113_C001.indd 23 9/23/2010 12:58:28 PM


24 Markov Chains and Decision Processes for Engineers and Managers

1 1 0 0 
 
P = [ pij ] = 2  p21 p22 p23  . (1.41)
3  p31 p32 p33 

1.9.2 Multichain
A Markov chain is termed multichain if it consists of two or more closed
sets of recurrent states plus a possibly empty set of transient states. Transient
states are those states that do not belong to any of the recurrent closed clas-
ses. A multichain with transient states is called a reducible multichain. The
state space for a reducible multichain can be partitioned into two or more
mutually exclusive closed communicating classes of recurrent states plus
one or more transient states. The mutually exclusive closed sets of recur-
rent states are called recurrent chains. There is no interaction among the
recurrent chains. Hence, each recurrent chain, which may consist of only a
single absorbing state, may be analyzed separately by treating it as an irre-
ducible Markov chain. If every recurrent chain consists solely of one absorb-
ing state, then the reducible multichain is called an absorbing multichain
or an absorbing Markov chain. For brevity, a reducible multichain is often
called a multichain. A multichain with no transient states is called a recur-
rent multichain.
The transition probability matrix for a generic five-state reducible multi-
chain is shown below. The transition matrix is partitioned to show that the
chain has two recurrent chains plus two transient states.

1 1 0 0 0 0 
2 0 p22 p23 0 0 
 
P = 3 0 p32 p33 0 0 . (1.42)
 
4  p41 p42 p43 p44 p45 
5  p51 p52 p53 p54 p55 

State 1 is an absorbing state, and constitutes the first recurrent chain, denoted
by R1 = {1}. Recurrent states 2 and 3 belong to the second recurrent chain,
denoted by R 2 = {2, 3}. States 4 and 5 are transient. The set of transient states
is denoted by T = {4, 5}.

1.9.3 Aggregated Canonical Form of the Transition Matrix


Suppose that the states in a reducible Markov chain are reordered so that the
recurrent states come first, followed by the transient states. The states belong-
ing to each recurrent chain in a reducible multichain are grouped together
and numbered consecutively. All the recurrent chains are combined into one

51113_C001.indd 24 9/23/2010 12:58:29 PM


Markov Chain Structure and Models 25

recurrent chain with a transition matrix denoted by S. The transient states


are also numbered consecutively. Then the transition probability matrix,
P, for a reducible Markov chain can be partitioned into four submatrices
labeled S, 0, D, and Q, and represented in the following aggregated canonical
or standard form:

S 0
P= . (1.43)
D Q 

For example, the aggregated canonical form of the transition matrix for the
generic five-state reducible multichain shown in Equation (1.42) appears
below:

1 1 0 0 0 0 
2 0 p22 p23 0 0 
   S 0
P = [ pij ] = 3  0 p32 p33 0 0  =  , (1.44)
   D Q
4  p41 p42 p43 p44 p45 
5  p51 p52 p53 p54 p55 

where

1 1 0 0 
  4  p41 p42 p43  4  p44 p45 
S = 2 0 p22 D= Q= , and
5  p51 p53  5  p54 p55 
p23  , ,
p52
3 0 p32 p33 
1  0 0
 
0 = 2  0 0 .
3  0 0 

Additional examples of transition matrices represented in canonical form,


with and without aggregation, are given in Sections 1.10.1.2, 1.10.2, 3.1, and
in Chapter 3.

1.10 Markov Chain Models


A variety of random processes can be modeled as Markov chains. Examples
of unichain and multichain models are given in this section.

51113_C001.indd 25 9/23/2010 12:58:30 PM


26 Markov Chains and Decision Processes for Engineers and Managers

1.10.1 Unichain
Unichain models include irreducible chains, which have no transient states,
and unichains, which have transient states.

1.10.1.1 Irreducible
The model of the weather in Section 1.3, and the circular random walk in
Section 1.4.2 are both irreducible chains. Since these two Markov chains are
aperiodic, they are also regular chains. Several additional models of regular
Markov chains will be constructed, plus one model of an irreducible, peri-
odic chain in Section 1.10.1.1.1.2.

1.10.1.1.1 Diffusion
This section will describe two simplified Markov chain models for diffu-
sion. In the first model, two different gases are diffused. The second model,
which produces a periodic chain, is for the diffusion of one gas between two
containers [4, 5].

1.10.1.1.1.1 Two Gases Consider a collection of 2k molecules of two gases,


of which k are molecules of gas U, and k are molecules of gas V. The mol-
ecules are placed in two containers, labeled A and B, such that there are
k molecules in each container. A single transition consists of choosing a
molecule at random from each container, moving the molecule obtained
from container A to container B, and moving the molecule obtained from
container B to container A. The state of the process, denoted by Xn, is the
number of molecules of gas V in container A after n transitions. Suppose
that Xn = i. Then there are i molecules of gas V in container A, k − i mol-
ecules of gas U in container A, k − i molecules of gas V in container B, and
i molecules of gas U in container B. The number of molecules of each type
of gas in each container when the system is in state i is summarized in
Table 1.5.
To compute transition probabilities, assume the probability that a mole-
cule changes containers is proportional to the number of molecules in the
container that the molecule leaves. Suppose that the process is in state i. If a

TABLE 1.5
Number of Molecules in Each Container
When the System Is in State i
Gas/Container A B
Gas U k−i i
Gas V i k−i

51113_C001.indd 26 9/23/2010 12:58:33 PM


Markov Chain Structure and Models 27

molecule of the same gas is chosen from each container, the system remains
in state i. Hence,

pii = P(a gas V molecule moves from A to B)


P(a gas V molecule moves from B to A)
+ P(a gas U molecule moves from A to B)
P(a gas U molecule moves from B to A)

 i   k − i  k − i  i   i   k − i  2i(k − i)
pii =    + = 2   =
 k   k   k   k   k   k 
.
k2

If a molecule of gas V is chosen from container A and a molecule of gas U is


chosen from container B, the system moves from state i to state i − 1. Hence,

pi, i-1 = P(a gas V molecule moves from A to B)


P(a gas U molecule moves from B to A)
2
 i i  i
=    =   .
 k  k  k

Finally, If a molecule of gas U is chosen from container A and a molecule


of gas V is chosen from container B, the system moves from state i to state
i + 1. Hence,

pi,i+1 = P(a gas U molecule moves from A to B)


P(a gas V molecule moves from B to A)
2
 k − i  k − i  k − i
= =
 k   k   k 
.

No other transitions from state i are possible.


To gain insight into this model, consider the special case in which k = 3
molecules. The transition probabilities are

2i(k − i) 2i(3 − i) 2i(3 − i)


pii = = = , for i = 0,1, 2, 3.
k2 32 9
2 2
 i  i i2
pi , i −1 =   =   = , for i = 1, 2, 3
 k  3 9

2 2
 k − i  3 − i (3 − i)2
pi , i + 1 =   =  = , for i = 0,1, 2.
 k   3  9

51113_C001.indd 27 9/23/2010 12:58:33 PM


28 Markov Chains and Decision Processes for Engineers and Managers

The transition probability matrix is shown below:

0 1 2 3
0 0 1 0 0
1 4 4
1 0
P= 9 9 9 . (1.38)
4 4 1
2 0
9 9 9
3 0 0 1 0

This model for the diffusion of two gases is a regular Markov chain.

1.10.1.1.1.2 Ehrenfest Model In the second model of diffusion, developed by


the physicist T. Ehrenfest, k molecules of a single gas are stored in a con-
tainer that is divided by a permeable membrane into two compartments,
labeled A and B. At each transition, a molecule is chosen at random from
the set of molecules and moved from the compartment in which it cur-
rently resides to the other compartment. The state of the process, denoted
by Xn, is the number of molecules of gas in compartment A after n transi-
tions. Therefore, if the process is in state i, then compartment A contains
i molecules of gas, and compartment B contains k − i molecules of gas. If
the process is in state i, then with probability i/k a molecule moves from
compartment A to B, and with probability (k − i)/k a molecule moves from
compartment B to A. If Xn = i, then at each transition, Xn either decreases
by one molecule, so that Xn+1 = i − 1, or increases by one molecule, so that
Xn+1 = i + 1. Thus, the transition probabilities are

i
pi , i− 1 =
k

k− i
pi , i + 1 = .
k

No other transitions from state i are possible.


For the special case in which k = 3 molecules, the transition probabilities are

i i
pi , i− 1 = = , for i = 1, 2, 3
k 3

k − i 3− i
pi , i + 1 = = , for i = 0,1, 2.
k 3

51113_C001.indd 28 9/23/2010 12:58:35 PM


Markov Chain Structure and Models 29

The transition probability matrix is shown below:

0 1 2 3
0 0 1 0 0
1 2
1 0 0 .
P= 3 3 (1.39)
2 1
2 0 0
3 3
3 0 0 1 0

As Section 1.9.1.1 has indicated, this Ehrenfest model of diffusion can be inter-
preted as a four-state random walk with reflecting barriers in which p = 2/3.
Hence, this Ehrenfest model is a periodic Markov chain with a period of 2.

1.10.1.1.2 Waiting Line Inside an Office


A recruiter of engineers interviews applicants inside her office for positions
in engineering management [1, 2]. The recruiter’s office has a capacity of
three candidates including the one being interviewed. Applicants who arrive
when the office is full are not admitted. All interviews last 30 min. The num-
ber of applicants who arrive during the nth 30 min time period needed to
conduct an interview is an independent and identically distributed random
variable, which is denoted by An. The random variable An has the following
probability distribution, which is stationary in time:

Number An of arrivals in period n, An = k 0 1 2 3 4 5 or more


Probability, pk = P( An = k ) 0.30 0.25 0.20 0.15 0.10 0

Let Xn represent the number of applicants in the recruiter’s office imme-


diately after the completion of the nth interview. Since the 30-min period
required to conduct an interview is the same length as the interval dur-
ing which new applicants arrive, the following relationships between Xn
and Xn+1 can be established. When Xn = 0, no candidates are in the recruit-
er’s office so that Xn cannot be decreased by the completion of an interview.
Furthermore, when Xn = 0, Xn can be increased by the arrival of An+1 new
applicants during the (n + 1)th 30 min period. Although new candidates
may arrive, Xn cannot exceed three candidates, the capacity of the recruiter’s
office. Hence,

X n + 1 = min(3, An + 1 ), if X n = 0.

When Xn > 0, one candidate is being interviewed and Xn − 1 candidates are


in the recruiter’s office waiting to be interviewed. In this case, Xn will be

51113_C001.indd 29 9/23/2010 12:58:38 PM


30 Markov Chains and Decision Processes for Engineers and Managers

decreased by one at the completion of an interview at the end of the nth


30 min period. In addition, Xn can be increased by the arrival of An+1 new
applicants during the (n + 1)th 30 min period. Hence,

X n + 1 = min(3, X n − 1 + An + 1 ) if X n > 0.

These relationships show that Xn+1 is a function only of Xn and An+1, an


independent random variable. Therefore, {X0, X1, X2 …} is a Markov chain.
Since the recruiter’s office has a capacity of three candidates, the state space
is E = {0, 1, 2, 3}. The transition probabilities are computed below:

If Xn = i = 0, the next state is Xn+1 = j = min(3, An+1).

The transition probabilities for state 0 are

p0 j = P(X n + 1 = j X n = 0) = P(X n + 1 = j = min(3, An + 1 ) X n = 0) = P( j = min(3, An + 1 ))

p00 = P( j = min(3, An + 1 )) = P(0 = min(3, An + 1 )) = P( An + 1 = 0) = 0.30


p01 = P( j = min(3, An + 1 )) = P(1 = min(3, An + 1 )) = P( An + 1 = 1) = 0.25
p02 = P( j = min(3, An + 1 )) = P(2 = min(3, An + 1 )) = P( An + 1 = 2) = 0.20
p03 = P( j = min(3, An + 1 )) = P(3 = min(3, An + 1 )) = P( An + 1 ≥ 3)
= P( An + 1 = 3) + P( An + 1 = 4) = 0.15 + 0.10 = 0.25.

If Xn = i = 1, the next state is

X n + 1 = j = min(3, X n − 1 + An + 1 ) = min(3,1 − 1 + An + 1 ) = min(3, An + 1 ).

Thus, the transition probabilities for state 1 are the same as those for state 0.
The transition probabilities for state 1 are

p1 j = P(X n + 1 = j X n = 1) = P(X n + 1 = j = min(3, X n − 1 + An + 1 ) X n = 1)


= P( j = min(3, X n − 1 + An + 1 )) = P( j = min(3, 1 − 1 + An + 1 )) = P( j = min(3, An + 1 ))
p10 = P( j = min(3, An + 1 )) = P(0 = min(3, An + 1 )) = P( An + 1 = 0) = 0.30
p11 = P( j = min(3, An + 1 )) = P(1 = min(3, An + 1 )) = P( An + 1 = 1) = 0.25
p12 = P( j = min(3, An + 1 )) = P(2 = min(3, An + 1 )) = P( An + 1 = 2) = 0.20
p13 = P( j = min(3, An + 1 )) = P(3 = min(3, An + 1 )) = P( An + 1 ≥ 3)
= P( An + 1 = 3) + P( An + 1 = 4) = 0.15 + 0.10 = 0.25.

51113_C001.indd 30 9/23/2010 12:58:40 PM


Markov Chain Structure and Models 31

If Xn = i = 2, the next state is

X n + 1 = j = min(3, X n − 1 + An + 1 )
= min(3, 2 − 1 + An + 1 ) = min(3, 1 + An + 1 ).

The transition probabilities for state 2 are

p2 j = P(X n +1 = j X n = 2) = P(X n +1 = j = min(3, X n − 1 + An +1 ) X n = 2)


= P( j = min(3, X n − 1 + An +1 )) = P( j = min(3, 2 − 1 + An +1 ))
= P( j = min(3,1 + An +1 ))
p20 = P( j = min(3,1 + An + 1 )) = P(0 = min(3,1 + An + 1 )) = P(1 + An + 1 = 0)
= P( An + 1 = −1) = 0
p21 = P( j = min(3,1 + An + 1 )) = P(1 = min(3,1 + An + 1 )) = P(1 + An + 1 = 1)
= P( An + 1 = 0) = 0.30
p22 = P( j = min(3,1 + An + 1 )) = P(2 = min(3,1 + An + 1 )) = P(1 + An + 1 = 2)
= P( An + 1 = 1) = 0.25
p23 = P( j = min(3,1 + An + 1 )) = P(3 = min(3,1 + An + 1 )) = P(1 + An + 1 ≥ 3)
= P( An + 1 ≥ 2)
= P( An + 1 = 2) + P( An + 1 = 3) + P( An + 1 = 4) = 0.20 + 0.15 + 0.10 = 0.45.

If Xn = i = 3, the next state is

X n + 1 = j = min(3, X n − 1 + An + 1 ) = min(3, 3 − 1 + An + 1 ) = min(3, 2 + An + 1 ).

The transition probabilities for state 3 are

p3 j = P(X n +1 = j X n = 3) = P(X n +1 = j = min(3, X n − 1 + An +1 ) X n = 3)


= P( j = min(3, X n − 1 + An +1 )) = P( j = min(3, 3 − 1 + An +1 ))
= P( j = min(3, 2 + An + 1 ))
p30 = P( j = min(3, 2 + An + 1 )) = P(0 = min(3, 2 + An + 1 )) = P(2 + An + 1 = 0)
= P( An + 1 = −2) = 0
p31 = P( j = min(3, 2 + An + 1 )) = P(1 = min(3, 2 + An + 1 )) = P(2 + An + 1 = 1)
= P( An + 1 = −1) = 0
p32 = P( j = min(3, 2 + An + 1 )) = P(2 = min(3, 2 + An + 1 )) = P(2 + An + 1 = 2)
= P( An + 1 = 0) = 0.30
p33 = P( j = min(3, 2 + An + 1 )) = P(3 = min(3, 2 + An + 1 )) = P(2 + An + 1 ≥ 3)
= P( An + 1 ≥ 1) = P( An + 1 = 1) + P( An + 1 = 2) + P( An + 1 = 3) + P( An + 1 = 4)
= 0.25 + 0.20 + 0.15 + 0.10 = 0.70.

51113_C001.indd 31 9/23/2010 12:58:42 PM


32 Markov Chains and Decision Processes for Engineers and Managers

The transition probabilities for the four-state Markov chain model of the
waiting line inside the recruiter’s office are collected to construct the follow-
ing transition probability matrix:
State 0 1 2 3
0 P(An = 0) P(An = 1) P(An = 2) P(An ≥ 3)
P= 1 P(An = 0) P(An = 1) P(An = 2) P(An ≥ 3)
(1.45)
2 0 P(An = 0) P(An = 1) P(An ≥ 2)
3 0 0 P(An = 0) P(An ≥ 1)

State 0 1 2 3
0 0.30 0.25 0.20 0.25
= 1 0.30 0.25 0.20 0.25 .
(1.46)
2 0 0.30 0.25 0.45
3 0 0 0.30 0.70

1.10.1.1.3 Inventory System


A retailer who sells personal computers can order the computers from a
manufacturer at the beginning of every period [1, 2]. All computers that are
ordered are delivered immediately. The demand for computers during each
period is an independent, identically distributed random variable with a
known probability distribution. The retailer has a limited storage capacity
sufficient to accommodate a maximum inventory of three computers. The
number of computers in inventory at epoch n, the end of period n, is denoted
by Xn. Hence, the inventory on hand at epoch n − 1, the beginning of period n,
is denoted by Xn–1. The number of computers ordered and delivered immedi-
ately at the beginning of period n is denoted by cn–1. The number of comput-
ers demanded by customers during period n is an independent, identically
distributed random variable denoted by dn. The demand for computers in
every period has the following stationary probability distribution.

Demand dn in period n 0 1 2 3
Probability, p(dn ) 0.3 0.4 0.1 0.2

Note that the number of computers demanded in a period cannot exceed


three.
The retailer observes the inventory level at the beginning of every period.
Computers are ordered and delivered immediately at the beginning of a
period according to the following inventory ordering policy:

If Xn−1 < 2, the retailer orders cn−1 = 3 − Xn−1 computers, which are delivered
immediately to increase the beginning inventory to three computers.

51113_C001.indd 32 9/23/2010 12:58:45 PM


Markov Chain Structure and Models 33

If Xn−1 ≥ 2, the retailer does not order, so that the beginning inventory
remains Xn−1 computers.

This policy has the form (s, S), and is called an (s, S) inventory ordering policy.
The quantity s = 2 is called the reorder point, and the quantity S = 3 is called
the reorder level. Observe that under this (2, 3) policy,

If Xn−1 = 0, the retailer orders cn−1 = 3 − 0 = 3 computers, which increase


the beginning inventory level to three computers.
If Xn−1 = 1, the retailer orders cn−1 = 3 − 1 = 2 computers, which increase
the beginning inventory level to three computers.
If Xn−1 = 2, the retailer does not order, so that cn−1 = 0, and the beginning
inventory level remains two computers.
If Xn−1 = 3, the retailer does not order, so that cn−1 = 0, and the beginning
inventory level remains three computers.

Observe that the inventory level Xn at the end of period n is a random


variable, which is dependent on the inventory level Xn−1 at the beginning
of the period, on the quantity cn−1 ordered and delivered at the beginning
of the period, and on the demand dn, during the period. The demand, dn, is
governed by a probability distribution, which is known and is independent
of the inventory on hand. Therefore, the sequence {X0, X1, X2, …} forms a
Markov chain, which can be used to model the retailer’s inventory system.
The state Xn−1 is the inventory level or number of computers on hand at the
beginning of period n. The Markov chain has four states. The state space is
E = {0, 1, 2, 3}.
When the inventory on hand at the beginning of a period plus the num-
ber of computers ordered exceeds the number demanded by customers
during the period, the number of unsold computers left over at the end
of the period is Xn−1 + cn−1 − dn > 0. However, when the demand exceeds
the beginning inventory plus the quantity ordered, sales are lost because
customers who are unable to buy computers in the current period will
not wait until the next period to purchase them. When sales are lost, the
ending inventory level will be zero. In this case, the number of lost sales
of computers, or the shortage of computers, at the end of the period is
dn − Xn−1 − cn−1 > 0.
Transition probabilities are calculated by evaluating the relationship
between beginning and ending inventory levels during a period.
In state 0, three computers are ordered. When Xn−1 = i = 0 and cn−1 = 3, the
next state is

X n = X n− 1 + cn− 1 − dn = j

= j = i + cn −1 − dn = 0 + 3 − dn = 3 − dn .

51113_C001.indd 33 9/23/2010 12:58:47 PM


34 Markov Chains and Decision Processes for Engineers and Managers

The transition probabilities for state 0 are

p0 j = P(X n = j X n −1 = 0) = P(X n = 3 − dn X n −1 = 0) = P( j = 3 − dn ).

Hence,

When j = 0,

p00 = P(X n = 0 X n −1 = 0) = P( j = 3 − dn ) = P(0 = 3 − dn ) = P(dn = 3) = 0.2

When j = 1,

p01 = P(X n = 1 X n −1 = 0) = P( j = 3 − dn ) = P(1 = 3 − dn ) = P(dn = 2) = 0.1

When j = 2,

p02 = P(X n = 2 X n −1 = 0) = P( j = 3 − dn ) = P(2 = 3 − dn ) = P(dn = 1) = 0.4

When j = 3,

p03 = P(X n = 3 X n −1 = 0) = P( j = 3 − dn ) = P(3 = 3 − dn ) = P(dn = 0) = 0.3.

In state 1, two computers are ordered. When Xn−1 = i = 1 and cn−1 = 2, the next
state is

X n = X n− 1 + cn− 1 − dn = j
= j = i + cn −1 − dn = 1 + 2 − dn = 3 − dn .

The transition probabilities for state 1 are

p1 j = P(X n = j X n −1 = 1) = P(X n = 3 − dn X n −1 = 1) = P( j = 3 − dn ).
.

Hence,
When j = 0,

p10 = P(X n = 0 X n −1 = 1) = P( j = 3 − dn ) = P(0 = 3 − dn ) = P(dn = 3) = 0.2

When j = 1,

p11 = P(X n = 1 X n −1 = 1) = P( j = 3 − dn ) = P(1 = 3 − dn ) = P(dn = 2) = 0.1

51113_C001.indd 34 9/23/2010 12:58:48 PM


Markov Chain Structure and Models 35

When j = 2,

p12 = P(X n = 2 X n −1 = 1) = P( j = 3 − dn ) = P(2 = 3 − dn ) = P(dn = 1) = 0.4

When j = 3,

p13 = P(X n = 3 X n −1 = 1) = P( j = 3 − dn ) = P(3 = 3 − dn ) = P(dn = 0) = 0.3.

Observe that the transition probabilities in state 1 are identical to those in


state 0. In state 2, zero computers are ordered. When Xn–1 = i = 2 and cn–1 = 0,
the next state is Xn = Xn–1 + cn–1 − dn = j, provided that Xn–1 + cn–1 − dn ≥ 0.

Xn = j = i + cn–1 − dn = 2 + 0 − dn = 2 −dn, provided that 2 − dn ≥ 0 or dn ≤ 2.

When dn = 3, the demand exceeds the beginning inventory plus the quan-
tity ordered, and the sale of one computer is lost. To ensure that the ending
inventory is nonnegative, the equation for the next state, which represents
the ending inventory, is expressed in the form

X n = j = max(2 − dn , 0).

For example,

When dn = 0, Xn = max (2−dn, 0) = max (2−0, 0) = max (2, 0) = 2.


When dn = 1, Xn = max (2−dn, 0) = max (2−1, 0) = max (1, 0) = 1.
When dn = 2, Xn = max (2−dn, 0) = max (2−2, 0) = max (0, 0) = 1.
When dn = 3, Xn = max (2−dn, 0) = max (2−3, 0) = max (−1, 0) = 0.

The transition probabilities for state 2 are

p2 j = P(X n = j X n − 1 = 2) = P(X n = max(2 − dn , 0) X n − 1 = 2) = P( j = max(2 − dn , 0)) .

Hence,
When j = 0,

p20 = P(X n = 0 X n − 1 = 2) = P( j = max(2 − dn , 0)) = P(0 = max(2 − dn , 0)) = P(dn ≥ 2)


= P(dn = 2) + P(dn = 3) = 0.1 + 0.2 = 0.3

When j = 1,

p21 = P(X n = 1 X n − 1 = 2) = P( j = max(2 − dn , 0)) = P(1 = max(2 − dn , 0)) = P(dn = 1) = 0.4

51113_C001.indd 35 9/23/2010 12:58:52 PM


36 Markov Chains and Decision Processes for Engineers and Managers

When j = 2,

p22 = P(X n = 2 X n − 1 = 2) = P( j = max(2 − dn , 0)) = P(2 = max(2 − dn , 0)) = P(dn = 0) = 0.3

When j = 3,

p23 = P(X n = 3 X n − 1 = 2) = P( j = max(2 − dn , 0)) = P(3 = max(2 − dn , 0)) = P(dn = −1) = 0.

In state 3, zero computers are ordered. When Xn−1 = i = 3 and cn−1 = 0, the next
state is

X n = X n− 1 + cn− 1 − dn = j
= j = i + cn− 1 − dn = 3 + 0 − dn = 3 − dn .

The transition probabilities for state 3 are

p3 j = P(X n = j X n −1 = 3) = P(X n = 3 − dn X n −1 = 3) = P( j = 3 − dn ).

Hence,

When j = 0,

p30 = P(X n = 0 X n −1 = 3) = P( j = 3 − dn ) = P(0 = 3 − dn ) = P(dn = 3) = 0.2

When j = 1,

p31 = P(X n = 1 X n −1 = 3) = P( j = 3 − dn ) = P(1 = 3 − dn ) = P(dn = 2) = 0.1

When j = 2,

p32 = P(X n = 2 X n− 1 = 3) = P( j = 3 − dn ) = P(2 = 3 − dn ) = P(dn = 1) = 0.4

When j = 3,

p33 = P(X n = 3 X n −1 = 3) = P( j = 3 − dn ) = P(3 = 3 − dn ) = P(dn = 0) = 0.3.

The transition probabilities in state 3 are identical to those in states 0 and 1.


The transition probabilities for the four-state Markov chain model of the
retailer’s inventory system under a (2, 3) policy are collected to construct the

51113_C001.indd 36 9/23/2010 12:58:54 PM


Markov Chain Structure and Models 37

following transition probability matrix:

Beginning Order,
X n−1 + cn−1
Inventory, X n−1 cn−1 State 0 1 2 3
0 3 3 0 P(dn = 3) P(dn = 2) P(dn = 1) P(dn = 0)
P=
1 2 3 1 P(dn = 3) P(dn = 2) P(dn = 1) P(dn = 0)
2 0 2 2 P(dn ≥ 2) P(dn = 1) P(dn = 0) 0
3 0 3 3 P(dn = 3) P(dn = 2) P(dn = 1) P(dn = 0)

(1.47)
State 0 1 2 3
0 0.2 0.1 0.4 0.3
P = [ pij ] = 1 0.2 0.1 0.4 0.3 .
2 0.3 0.4 0.3 0 (1.48)
3 0.2 0.1 0.4 0.3

Observe that the transition probabilities in states 0, 1, and 3 are the same. The
inventory system is enlarged to create a Markov chain with rewards in Section
4.2.3.4.1 and a Markov decision process in Sections 5.1.3.3.1 and 5.2.2.4.1.

1.10.1.1.4 Machine Breakdown and Repair


A factory has two machines and one repair crew [4]. Only one machine is
used at any given time. A machine breaks down at the end of a day with
probability p. The repair crew can work on only one machine at a time.
When a machine has broken down at the end of the previous day, a repair
can be completed in 1 day with probability r or in 2 days with probability
1 − r. All repairs are completed at the end of the day, and no repair takes
longer than 2 days. All breakdowns and repairs are independent events.
To model this process as a Markov chain, let the state Xn = (u, v) be a vec-
tor consisting of the pairs (u, v). Element u is the number of machines in
operating condition at the end of day n. Element v is the number of days’
work expended on a machine not yet repaired. The state space is given by
E = {(2, 0) (1, 0) (1, 1) (0, 1)}. For example, state (0, 1) indicates that neither
machine is in operating condition, and 1 days’ work was expended on a
machine not yet repaired. If the process is in states (1, 1) or (0, 1), a repair is
certain to be completed in 1 day. To see how to compute transition proba-
bilities, suppose, for example, that the process is in state (1, 0), which indi-
cates that one machine is in operating condition and that 0 days’ work was
expended repairing the other machine. If with probability 1 − p the machine
in operating condition does not fail, and with probability r the repair of the
other machine is completed at the end of the day, then the process moves
to state (2, 0) with transition probability (1 − p)r. If with probability p the

51113_C001.indd 37 9/23/2010 12:58:58 PM


38 Markov Chains and Decision Processes for Engineers and Managers

machine in operating condition breaks down, and with probability r the


repair of the other machine is completed at the end of the day, then the pro-
cess remains in state (1, 0) with transition probability pr. If with probability
1 − p the machine in operating condition does not fail, and with probabil-
ity 1 − r the repair of the other machine is not completed at the end of the
day, then the process moves to state (1, 1) with transition probability (1 − p)
(1 − r). Finally, if with probability p the machine in operating condition
breaks down, and with probability 1 − r the repair of the other machine is
not completed at the end of the day, then the process moves to state (0, 1)
with transition probability p(1 − r). The transition probability matrix is

(2, 0) (1, 0) (1,1) (0,1)


(2, 0) 1− p p 0 0
P = (1, 0) (1 − p)r pr (1 − p)(1 − r ) p(1 − r ) . (1.49)
(1,1) 1− p p 0 0
(0,1) 0 1 0 0

1.10.1.1.5 Component Replacement


Consider an electronic device, which contains a number of interchangeable
components, that operates independently [2]. Each component is subject to
random failure, which makes it completely inoperable. All components are
inspected at the end of every week. At the end of a week, a component that is
observed to have failed is immediately replaced with an identical new com-
ponent. When the new component fails, it is again replaced by an identical
one, and so forth. Table 1.6 contains mortality data collected for a group of
100 of these components used over a 4-week period.
Observe that at the end of the 4-week period, none of the 100 components
has survived, that is, all have failed. Therefore, the mortality data indicates
that the service life of an individual component is at most 4 weeks. The
replacement policy is to replace a component at the end of the week in which

TABLE 1.6
Mortality Data for 100 Components Used Over a 4-Week Period
End of Week # of Survivors # of Failures P(Failures) Cond. P(Failure)
n Sn Fn =Fn/100 =Fn/Sn−1
0 100 0 0 –
1 80 20 0.20 0.2 = 20/100
2 50 30 0.30 0.375 = 30/80
3 10 40 0.40 0.8 = 40/50
4 0 10 0.10 1 = 10/10

51113_C001.indd 38 9/23/2010 12:59:00 PM


Markov Chain Structure and Models 39

it has failed. Since no component survives longer than 4 weeks, a component


may be replaced every 1, 2, 3, or 4 weeks.
Note that 0.20 of new components fail during their first week of life, 0.30 of
new components fail during their second week of life, 0.40 fail during their
third week, and the remaining 0.10 fail during their fourth week. In addition,
0.375 of 1-week-old components fail during their second week of life, 0.80 of
2-week-old components fail during their third week of life, and all 3-week-
old components fail during their fourth week of life. Let Sn denote the num-
ber of components, which survive until the end of week n. Let Fn = Sn−1 − Sn.
Thus, Fn denotes the number of components, which fail during week n of life.
The probability that a component fails during its nth week of life is equal
to Fn/100. The conditional probability that an individual component fails
during week n of its life, given that it has survived for n − 1 weeks, is equal
to Fn /Sn−1 = (Sn−1 − Sn)/Sn−1.
The mortality data indicates that the conditional probability that an indi-
vidual component will fail depends only on its age. Let i represent the age of
a component in weeks. The conditional probability that a component of age
i will fail during week i + 1 of its life, given that it has survived for i weeks,
is indicated in Table 1.7.
The process of component failure and replacement can be modeled as
a four-state recurrent Markov chain. Let the state Xn−1 denote the age of a
component when it is observed at the end of week n − 1. The state space is
E = {0, 1, 2, 3}. If a component of age i fails during week n, then it is replaced
with a new component of age 0 at the end of week n, so that Xn = 0. Thus, if
a component of age i fails during week n, the Markov chain model makes a
transition from the state Xn−1 = i at the end of week n − 1 to the state Xn = 0 at
the end of week n. The associated transition probability is

pi 0 = P(X n = 0 X n −1 = i) = Fi + 1/Si = (Si − Si + 1 )/Si .

The mortality data indicates that p00 = 0.2, p10 = 0.375, and p20 = 0.8. Since a
component of age 3 is certain to fail during the current week and be replaced
at the end of the week, the transition probability for a 3-week-old component
is p30 = P (Xn = 0|Xn−1 = 3) = F4/S3 = 1.
The age of the component in use during week n − 1 is Xn−1 = i. For
any n − 1, Xn = 0 if the component failed during week n. If the compo-
nent survived during week n, then the component is 1 week older, so that
Xn = Xn−1 + 1 = i + 1. The conditional probabilities that a component of age

TABLE 1.7
Conditional Probability That a Component of Age i Will Fail During Week i + 1
Age i of a component in week, i 0 1 2 3
Conditional probability of 0.2 = 20/100 0.375 = 30/80 0.8 = 40/50 1 = 10/10
Failing during week i + 1 of life = F1/S0 = F2/S1 = F3/S2 = F4/S3

51113_C001.indd 39 9/23/2010 12:59:00 PM


40 Markov Chains and Decision Processes for Engineers and Managers

i survives one additional week are calculated below:

pi , i + 1 = P ( X n = i + 1 X n − 1 = i )
= P(Component survives 1 additional week Component has
survived i weeks)
= Si + 1/Si
= 1 − p10 = 1 − Fi + 1/Si = 1 − (Si − Si + 1 )/Si
p01 = P( X n = 1 X n −1 = 0) = S1/S0 = 80/100 = 0.8

p12 = P(X n = 2 X n −1 = 1) = S2 /S1 = 50/80 = 0.625


p23 = P(X n = 3 X n −1 = 2) = S3 /S2 = 10/50 = 0.2
p34 = P(X n = 4 X n −1 = 3) = S4 /S3 = 0/10 = 0.

The four-state transition probability matrix is given below:

State 0 1 2 3
0 0.2 0.8 0 0
P= 1 0.375 0 0.625 0 .
2 0.8 0 0 0.2 (1.50)
3 1 0 0 0

The component replacement problem is enlarged to create a Markov


chain with rewards in Section 4.2.3.4.2 and a Markov decision process
in Section 5.1.3.3.2.

1.10.1.1.6 Independent Trials Process: Evaluating


Candidates for a Secretarial Position
An independent trials process is a sequence of independent experiments or
trials such that the outcome of any one trial does not affect the outcome of
any other trial [2, 4]. Each trial is assumed to have a finite number of out-
comes. The set of possible outcomes and the probability distribution for this
set of outcomes are the same for every trial. To treat this process as a special
case of a Markov chain, let the state Xn denote the outcome of the nth trial.
The Markov property holds because given the outcome of the present trial,
the outcome of the next trial is independent of the outcomes of the previous
trials (and is also independent of the outcome of the present trial). Therefore,
a transition probability is given by

pij = P(X n + 1 = j X n = i) = P(X n + 1 = j) = P(X n = j) = P(X n − 1 = j) = " = P(X1 = j).

51113_C001.indd 40 9/23/2010 12:59:01 PM


Markov Chain Structure and Models 41

Because the trials are independent, the joint probability of a sequence of n


outcomes is equal to the product of the n marginal probabilities. That is,
after n trials,

P(X1 = j1 , X 2 = j2 , … , X n = jn ) = P(X1 = j1 )P(X 2 = j2 )" P(X n = jn ).

Consider the following example of an independent trials process involv-


ing the evaluation of candidates or applicants for a secretarial position.
Suppose that an executive must hire a secretary. The executive will inter-
view one candidate per day. After each interview the executive will assign
one of the following four numerical scores to the current candidate:

Candidate Poor Fair Good Excellent


Score 15 20 25 30

The scores assigned to the candidates, who arrive independently to be inter-


viewed, are expected to vary according to the following stationary proba-
bility distribution:

Candidate Poor Fair Good Excellent


Score 15 20 25 30
Probability 0.3 0.4 0.2 0.1

The possible scores and the probabilities of these scores are the same for
every applicant. The process of interviewing and assigning scores to can-
didates is an independent trials process because the score assigned to one
candidate does not affect the score assigned to any other candidate.
To formulate this independent trials process as an irreducible Markov
chain, let the state Xn denote the score assigned to the nth candidate, for
n = 1, 2, 3, … . The state space is E = {15, 20, 25, 30}. The sequence {X1, X2, X3, …}
is a collection of independent, identically distributed random variables. The
probability distribution of Xn is shown in Table 1.8.
The score or state Xn+1 of the next applicant is independent of the state
Xn of the current applicant. Hence, pij = P(Xn+1 = j|Xn = i) = P(Xn+1 = j). The

TABLE 1.8
Probability Distribution of Candidate Scores
Candidate Poor Fair Good Excellent
State Xn = i 15 20 25 30
P(Xn = i) 0.3 0.4 0.2 0.1

51113_C001.indd 41 9/23/2010 12:59:04 PM


42 Markov Chains and Decision Processes for Engineers and Managers

sequence {X1, X2, X3,…} for this independent trials process forms a Markov
chain. The transition probability matrix is

State 15 20 25 30 
15 P(X n + 1 = 15) P(X n + 1 = 20) P(X n + 1 = 25) P(X n + 1 = 30) 

P = 20 P(X n + 1 = 15) P(X n + 1 = 20) P(X n + 1 = 25) P(X n + 1 = 30) 

25 P(X n + 1 = 15) P(X n + 1 = 20) P(X n + 1 = 25) P(X n + 1 = 30) 
30 P(X n + 1 = 15) P(X n + 1 = 20) P(X n + 1 = 25) P(X n + 1 = 30) 

 . (1.51)
State 15 20 25 30 

15 0.3 0.4 0.2 0.1 

= 20 0.3 0.4 0.2 0.1 
25 0.3 0.4 0.2 0.1 

30 0.3 0.4 0.2 0.1 

Note that all the rows of P are identical. This property holds for every
independent trials process. Two extended forms of this problem, called
the secretary problem, will be formulated as Markov decision processes in
Sections 5.1.3.3.3 and 5.2.2.4.2.

1.10.1.1.7 Birth and Death Process


A birth and death process is a Markov chain in which a transition can be
made to an adjacent state or else leave the present state unchanged [6]. If at
epoch n the chain is in state i, then at epoch n + 1 the chain is in state i + 1,
i − 1, or i. A birth and death process can model a service facility by letting Xn
denote the number of customers inside the facility at epoch n. Suppose the
facility has a capacity of three customers including the one receiving service.
The state space is E = {0, 1, 2, 3}. A birth represents the arrival of a customer.
A customer who arrives when the server is not busy and the facility is not
full goes directly into service. If the server is busy, the customer waits for
service. Any customer who arrives when the facility is full is not admitted.
When Xn= i < 3, the probability of a birth is bi. A death, which occurs with
probability di when Xn = i > 0, represents the departure of a customer who
has completed service. A birth and death cannot occur simultaneously.
A birth and death process is a Markov chain because the number of cus-
tomers at the next epoch depends only on the number at the present epoch.
The transition probabilities are given below:

bi , i = 0,1, 2
P(X n + 1 = i + 1|X n = i) = 
0, i = 3.

51113_C001.indd 42 9/23/2010 12:59:06 PM


Markov Chain Structure and Models 43

di , i = 1, 2, 3
P(X n + 1 = i − 1|X n = i) = 
0, i = 0
P(X n + 1 = i |X n = i) = 1 − bi − di .

The chain has the following transition probability matrix:

State 0 1 2 3
0 1 − b0 b0 0 0
P= 1 d1 1 − b1 − d1 b1 0 .
(1.52)
2 0 d2 1 − b2 − d2 b2
3 0 0 d3 1 − d3

1.10.1.2 Reducible Unichain


A reducible unichain consists of one closed communicating class of recurrent
states plus one or more transient states. When the recurrent chain is a single
absorbing state, the reducible unichain is called an absorbing unichain.

1.10.1.2.1 Absorbing Markov Chain


An absorbing unichain contains one absorbing state plus a set of transient
states. Generally, an absorbing unichain is called an absorbing Markov chain.

1.10.1.2.1.1 Selling a Stock for a Target Price Suppose that at the end of a
month a woman buys one share of a certain stock for $10. The share price,
rounded to the nearest $10, has been varying among the prices $0, $10, and
$20 from month to month. She plans to sell the stock at the end of first month
in which the share price rises to $20. She believes that the price of the stock
can be modeled as a Markov chain in which the state, Xn, denotes the share
price at the end of month n. The state space for the stock price is E = {$0,
$10, $20}. The state Xn = $20 is an absorbing state, reached when the stock is
sold. The two remaining states, which are entered when the stock is held,
are transient. She models her investment as an absorbing unichain with the
following transition probability matrix represented in the canonical form of
Equation (1.43).

X n\ X n + 1 20 10 0
20 1 0 0 1 0
P= = .
10 0.1 0.6 0.3 (1.53)
D Q
0 0.4 0.2 0.4

51113_C001.indd 43 9/23/2010 12:59:06 PM


44 Markov Chains and Decision Processes for Engineers and Managers

A model for selling a stock with two target prices will be treated in
Sections 4.2.5.4.2 and 4.2.5.5.

1.10.1.2.1.2 Machine Deterioration Consider a machine used in a produc-


tion process [6]. Suppose that the condition of the machine deteriorates over
time. The machine is observed at the beginning of each day. As Table 1.9
indicates, the condition of the machine can be represented by one of the
four states.

TABLE 1.9
States of a Machine
State Description
1 Not Working (NW), Inoperable
2 Working, with a Major Defect (WM)
3 Working, with a Minor Defect (Wm)
4 Working Properly (WP)

(Note that the states are labeled so that as the index of the state increases,
the condition of the machine improves.) The state of the machine at the start
of tomorrow depends only on its state at the start of today, and is indepen-
dent of its past history. Hence, the condition of the machine can be modeled
as a four-state Markov chain. Let Xn−1 denote the state of the machine when
it is observed at the start of day n. The state space is E = {1, 2, 3, 4}. Assume
that at the start of each day, the engineer in charge of the production pro-
cess does nothing to respond to the deterioration of the machine, that is,
she does not perform maintenance. Therefore, the condition of the machine
will either deteriorate by one or more states or remain unchanged. In other
words, if the engineer responsible for the machine always does nothing, then
at the start of tomorrow, the condition of the machine will be worse than or
equal to the condition today. All state transitions caused by deterioration are
assumed to occur at the end of the day. A transition probability matrix for
the machine when it is left alone for one day is given below in the canonical
form of Equation (1.43):

1 Not Working  1 0 0 0 
2 Major Defect  0.6 0.4 0 0  1 0
P=  = .
3 Minor Defect  0.2 0.3 0.5 0  D Q  (1.54)
 
4 Working Properly 0.3 0.2 0.1 0.4 

The transition probability matrix for a machine, which is left alone, repre-
sents an absorbing Markov chain. As the machine deteriorates daily, it will
eventually enter state 1, an absorbing state, where it will remain, not working.
For example, if today a machine is in state 3, working, with a minor defect,

51113_C001.indd 44 9/23/2010 12:59:09 PM


Markov Chain Structure and Models 45

then tomorrow, with transition probability p32 = P(Xn = 2|Xn−1 = 3) = 0.3, the
machine will be in state 2, working, with a major defect. One day later,
with transition probability p21 = P(Xn = 1|Xn−1 = 2) = 0.6, the chain will be
absorbed in state 1, where the machine will remain, not working. The model
of machine deterioration will be revisited in Sections 3.3.2.1 and 3.3.3.1.

1.10.1.2.2 Unichain with Recurrent States


A reducible unichain with recurrent states and no absorbing states consists
of one recurrent chain plus transient states.

1.10.1.2.2.1 Machine Maintenance Consider the absorbing Markov chain


model of machine deterioration introduced in Section 1.10.1.2.1.2 [6]. Suppose
that, at the start of each day, the engineer in charge of the production pro-
cess can respond to the deterioration of the machine either by doing nothing
(as in Section 1.10.1.2.1.2) or by choosing among the following four alternative
maintenance actions:

Decision, k Maintenance Action


1 Do Nothing
2 Overhaul
3 Repair
4 Replace

The success or failure of a maintenance action depends only on the present


state of the machine, and does not depend on its past behavior. All main-
tenance actions will take exactly 1 day to complete. However, decisions to
overhaul or repair the machine are not certain to succeed. If the engineer
has the machine overhauled, the overhaul will be completed, either suc-
cessfully or unsuccessfully, in 1 day. An overhaul may be successful, with
probability 0.8, or unsuccessful, with probability 0.2. If an overhaul is suc-
cessful, then at the start of the next day, the condition of the machine will
be improved by one state if that is possible. That is, if an overhaul at the
start of today is successful for a machine in state i, where i = 1, 2, or 3, then
with probability 0.8, the machine will be in state i + 1 at the start of tomor-
row. If the overhaul is not successful, the machine will remain in state i
with probability 0.2. If a machine in state 4 is overhauled, the machine
will remain in state 4 with probability 1. If the engineer has the machine
repaired, the repair will also be completed in 1 day, either successfully or
unsuccessfully. A repair may be successful, with probability 0.7, or unsuc-
cessful, with probability 0.3. If a repair is successful, then at the start of
the next day, the condition of the machine will be improved by two states
if that is possible. That is, if a repair at the start of today is successful for a
machine in state i, where i = 1 or 2, then with probability 0.7, the machine
will be in state i + 2 at the start of tomorrow. If the repair is not successful,

51113_C001.indd 45 9/23/2010 12:59:10 PM


46 Markov Chains and Decision Processes for Engineers and Managers

the machine will remain in state i with probability 0.3. If a machine in


state 3 is repaired successfully, then with probability 0.7, the condition of
the machine will improve to state 4. If a machine in state 4 is repaired,
the machine will remain in state 4 with probability 1. If a machine in any
state i is replaced with a new machine at the start of today, then tomorrow,
with probability 1, the machine will be in state 4, working properly. The
replacement process takes 1 day to complete. The four possible mainte-
nance actions are summarized in Table 1.10.
The engineer who manages the production process has implemented the
following maintenance policy, called the original maintenance policy, for
the machine. In state 1, when the machine is not working, it is always over-
hauled. In state 2, when the machine is working, with a major defect, the
engineer always does nothing. In state 3, when the machine is working, with
a minor defect, it is always overhauled. Finally, in state 4, when the machine
is working properly, the engineer always does nothing. The original mainte-
nance policy is summarized in Table 1.11.
The transition probability matrix associated with this original mainte-
nance policy appears in Equation (1.55).

TABLE 1.10
Four Possible Maintenance Actions
Decision Action Outcome
1 Do nothing The condition tomorrow will be worse than or equal to the
(DN) condition today
2 Overhaul If, with probability 0.8, an overhaul in states 1, 2, or 3, is
(OV) successful, the condition tomorrow will be superior by one
state to the condition today. If unsuccessful, the condition
tomorrow will be unchanged
3 Repair (RP) If, with probability 0.7, a repair in state 1 or 2 is successful, the
condition tomorrow will be superior by two states to the
condition today. If unsuccessful, the condition tomorrow will
be unchanged
4 Replace The machine will work properly tomorrow
(RL)

TABLE 1.11
Original Maintenance Policy
State, i Description Decision, k Maintenance Action
1 Not Working, Inoperable 2 Overhaul
2 Working, with a major defect 1 Do Nothing
3 Working, with a minor defect 2 Overhaul
4 Working Properly 1 Do Nothing

51113_C001.indd 46 9/23/2010 12:59:10 PM


Markov Chain Structure and Models 47

TABLE 1.12
Modified Maintenance Policy
State, i Description Decision, k Maintenance Action
1 Not Working, Inoperable 2 Overhaul
2 Working, with a major defect 1 Do Nothing
3 Working, with a minor defect 1 Do Nothing
4 Working Properly 1 Do Nothing

State, X n−1 = i Decision, k State 1 2 3 4


1 2, Overhaul 1 0.2 0.8 0 0
S 0
2 1, Do Nothing P= 2 0.6 0.4 0 0 = . (1.55)
3 2, Overhaul 3 0 0 0.2 0.8 D Q 
4 1, Do Nothing 4 0.3 0.2 0.1 0.4

Observe that this Markov chain is unichain as states 1 and 2 form a recurrent
closed class while states 3 and 4 are transient.
Now suppose that the engineer who manages the production process
modifies the original maintenance policy by always doing nothing when the
machine is in state 3 instead of overhauling it in state 3. The modified mainte-
nance policy, under which the engineer always overhauls the machine in state
1 and does nothing in the other three states, is summarized in Table 1.12.
The transition probability matrix associated with the modified mainte-
nance policy appears below in Equation (1.56).

State, X n−1 = i Decision, k State 1 2 3 4


1 2, Overhaul 1 0.2 0.8 0 0
S 0
2 1, Do Nothing P = 2 0.6 0.4 0 0 = . (1.56)
3 1, Do Nothing 3 0.2 0.3 0.5 0 D Q 
4 1, Do Nothing 4 0.3 0.2 0.1 0.4

Observe that this Markov chain under the modified maintenance policy
is also unichain as states 1 and 2 form a recurrent closed class while states 3
and 4 are transient. The machine maintenance model will be revisited in
Sections 3.3.2.1, 3.3.3.1, 3.5.4.2, 4.2.4.1, 4.2.4.2, 4.2.4.3, and 5.1.4.4.

1.10.1.2.2.2 Career Path with Lifetime Employment Consider a company owned


by its employees that offers both lifetime employment and career flexibility.
Employees are so well treated that no one ever leaves the company or retires.
New employees begin their careers in engineering. Eventually, all engineers
will be promoted to management positions, but managers never return to
engineering. Engineers may have their engineering job assignments changed
among the three areas of product design, systems integration, and systems

51113_C001.indd 47 9/23/2010 12:59:10 PM


48 Markov Chains and Decision Processes for Engineers and Managers

testing. Managers may have their supervisory assignments changed among


the three areas of hardware, software, and marketing. All changes in job
assignment and all promotions occur monthly.
By assigning a state to represent each job assignment, the career path of an
employee is modeled as a six-state Markov chain. The state Xn denotes an employ-
ee’s job assignment at the end of month n. The states are indexed as follows:

State, X n Description
1 Management of Hardware
2 Management of Software
3 Management of Marketing
4 Engineering Product Design
5 Engineering Systems Integration
6 Engineering Systems Testing

Since people in management never return to engineering, states 1, 2, and


3 form a closed communicating class of recurrent states, denoted by R =
{1, 2, 3}. States 4, 5, and 6 are transient because all engineers will eventu-
ally be promoted to management. The set of transient states is denoted
by T = {4, 5, 6}. Thus, the model is a reducible unichain, which has one
closed class of three recurrent states, and three transient states. After
many years, the company has recorded sufficient data to construct the
following transition probability matrix, displayed in the canonical form
of Equation (1.43):

1 Management of Hardware 0.30 0.20 0.50 0 0 0 


2 Management of Software 0.40 0.25 0.35 0 0 0 
 
3 Management of Marketing 0.50 0.10 0.40 0 0 0  S 0
P=  = .
4 Engineering Product Design  0.05 0.15 0.10 0.30 0.16 0.24  D Q 
5 Engineering Systems Integration 0.04 0.07 0.05 0.26 0.40 0.18 
 
6 Engineering Systems Testing 0.08 0.06 0.12 0.14 0.32 0.28 

(1.57)

The model of a career path is revisited in Section 4.2.4.4.2.

1.10.2 Reducible Multichain


A reducible multichain can be partitioned into two or more mutually exclu-
sive closed communicating classes of recurrent states plus one or more
transient states.

51113_C001.indd 48 9/23/2010 12:59:11 PM


Markov Chain Structure and Models 49

1.10.2.1 Absorbing Markov Chain


An absorbing multichain has two or more absorbing states plus transient
states. Generally, an absorbing multichain is called an absorbing Markov
chain.

1.10.2.1.1 Charge Account


Suppose that a store classifies a customer charge account into one of the fol-
lowing five states [1]:

State Description
0 0 months (1−30 days) old
1 1 month (31−60 days) old
2 2 months (61−90 days) old
P Paid in full
B Bad debt

A charge account is classified according to the oldest unpaid debt, starting from
the billing date. (Assume that all debt includes interest and finance charges.) For
example, if a customer has one unpaid bill, which is 1 month old, and a second
unpaid bill, which is 2 months old, then the account is classified as 2 months
old. If she makes a payment less than the 2-month-old bill, the account remains
classified as 2 months old. However, if she makes a payment greater than the
2-month-old bill but less than the sum of the 1-month-old and 2-month-old bills,
the account is reclassified as 1 month old. If at any time she pays the entire bal-
ance owed, the account is labeled as paid in full. When an account becomes
3 months old, it is labeled as a bad debt and sent to a collection agency.
Assume that the change of status of a charge account depends only on
its present classification. Then the process can be modeled as a five-state
Markov chain. Let Xn denote the state of an account at the nth month since
the account was opened. The state space is E = {0, 1, 2, P, B}. The states 0, 1,
and 2 indicate the age of an account, in months. An account that is 0 months
old is a new account with only current charges. State P indicates that an
account is paid in full, and state B indicates a bad debt. When the account is
in states 0, 1, or 2, it may stay in its present state. When the account is in state
0, it may move to state 1. When the account is in state 1, it may move to states
0 or 2. When the account is in state 2, it may move to states 0 or 1. Because an
account can be paid in full at any time, transitions are possible from states
0, 1, and 2 to state P. Since an account is reclassified as a bad debt only when
it becomes 3 months old, state B is reached only by a transition from state 2.
States 0, 1, and 2 are transient states because eventually the charge account
will either be paid in full or labeled as a bad debt. States P and B are absorb-
ing states because once one of these states is entered, the account is settled
and no further activity is possible. The process is an absorbing multichain.

51113_C001.indd 49 9/23/2010 12:59:12 PM


50 Markov Chains and Decision Processes for Engineers and Managers

After rearranging the five states so that the two absorbing states appear first,
the transition probability matrix is given below in the canonical form of
Equation (1.43):

State P B 0 1 2
P 1 0 0 0 0
 I1 0 0
P=
B 0 1 0 0 0    I 0
=0 I2 0 =  .
0 p0P 0 p00 p01 0 D Q 
D1 D2 
Q
1 p1P 0 p10 p11 p12
2 p2P p2B p20 p21 p22

Suppose that observations over a period of time have produced the follow-
ing transition probability matrix, also displayed in the canonical form of
Equation (1.43):

State P B 0 1 2
P 1 0 0 0 0
B 0 1 0 0 0  I 0 (1.58)
P= =  .
0 0.5 0 0.2 0.3 0  D Q
1 0.2 0 0.1 0.4 0.3
2 0.1 0.2 0.4 0.2 0.1

1.10.2.1.2 Patient Flow in a Hospital


Suppose that the patients in a hospital may be treated in one of six depart-
ments, which are indexed by the following states:

State Department
0 Discharged
1 Diagnostic
2 Outpatient
3 Surgery
4 Physical Therapy
5 Morgue

During a given day, 40% of all diagnostic patients will not be moved, 10%
will become outpatients, 20% will enter surgery, and 30% will begin physi-
cal therapy. Also, 15% of all outpatients will be discharged, 5% will die, 10%

51113_C001.indd 50 9/23/2010 12:59:12 PM


Markov Chain Structure and Models 51

will be moved to the diagnostic department, 20% will remain as outpatients,


30% will undergo surgery, and 20% will start physical therapy. In addition,
7% of all patients in surgery will be discharged, 3% will die, 20% will be
transferred to the diagnostic unit, 10% will become outpatients, 40% will
remain in surgery, and 20% will begin physical therapy. Furthermore, 30% of
all patients in physical therapy will be transferred to the diagnostic depart-
ment, 40% will become outpatients, 20% will enter surgery, and 10% will
remain in physical therapy. Assume that discharged patients will never
reenter the hospital.
Assume that the daily movement of a patient depends only on the
department in which the patient currently resides. Then the daily move-
ment of patients in the hospital can be modeled as an absorbing mul-
tichain in which states 0 and 5 are absorbing states, and the other four
states are transient. Let the state Xn denote the department in which a
patient resides on day n. The state space is E = {0, 1, 2, 3, 4, 5}. After rear-
ranging the six states so that the two absorbing states appear fi rst, the
transition probability matrix is given below in the canonical form of
Equation (1.43):

State 0 5 1 2 3 4
0 1 0 0 0 0 0
5 0 1 0 0 0 0
S 0   I 0
P= 1 0 0 0.4 0.1 0.2 0.3 =   . =  .
D Q 
(1.59)
 D Q 
2 0.15 0.05 0.1 0.2 0.3 0.2
3 0.07 0.03 0.2 0.1 0.4 0.2
4 0 0 0.3 0.4 0.2 0.1

The model of patient flow is revisited in Chapter 3.

1.10.2.2 Eight-State Multichain Model of a Production Process


A multichain model may have recurrent states as well as absorbing states.
Consider, for example, a production process that consists of three manufactur-
ing stages in series. When the work at a stage is completed, the output of the
stage is inspected. Output from stage 1 or 2 that is not defective is passed on
to the next stage. Output with a minor defect is reworked at the current stage.
Output with a major defect is scrapped. Nondefective but blemished output
from stage 3 is sent to a training center where it is used to train engineers,
technicians, and technical writers. Output from stage 3 that is neither defec-
tive nor blemished is sold.

51113_C001.indd 51 9/23/2010 12:59:14 PM


52 Markov Chains and Decision Processes for Engineers and Managers

By assigning a state to represent each operation of the production process,


the following eight states are identified:

State Operation
1 Scrapped
2 Sold
3 Training Engineers
4 Training Technicians
5 Training Technical Writers
6 Stage 3
7 Stage 2
8 Stage 1

The next operation on an item depends only on the outcome of the current
operation. Therefore, the production process can be modeled as a Markov
chain. An epoch is the instant at which an item passes through a production
stage and is inspected, or is transferred to an employee in the training cen-
ter, or is transferred within the training center. Let the state Xn denote the
operation on an item at epoch n. Since output that is scrapped will not be
reused, and output that is sold will not be returned, states 1 and 2 are absorb-
ing states. Output sent to the training center will remain there permanently,
and will therefore not rejoin the production stages or be scrapped or sold.
Output received by the training center is dedicated exclusively to training
engineers, technicians, and technical writers, and will be shared by these
employees. Hence, states 3, 4, and 5 form a closed communicating class of
recurrent states. States 6, 7, and 8 are transient because all output must even-
tually leave the production stages to be scrapped, sold, or sent to the training
center. Thus, the model is a reducible multichain, which has two absorbing
states, one closed class of three recurrent states, and three transient states.
Observe that production stage i is represented by transient state (9 − i). An
item enters the production process at stage 1, which is transient state 8.
The following transition probabilities for the transient states are expressed
in terms of the probabilities of producing output, which has a major defect, a
minor defect, is blemished, or has no defect:

p88 = 0.75 = P(output from stage 1 has a minor defect and is reworked)

p77 = 0.65 = P(output from stage 2 has a minor defect and is reworked)

p66 = 0.55 = P(output from stage 3 has a minor defect and is reworked)

p87 = 0.15 = P(output from stage 1 has no defect and is passed to stage 2)

51113_C001.indd 52 9/23/2010 12:59:15 PM


Markov Chain Structure and Models 53

p76 = 0.20 = P(output from stage 2 has no defect and is passed to stage 3)

p62 = 0.16 = P(output from stage 3 has no defect or blemish and is sold)

P(output from stage 3 has no defect but is blemished, and


p65 = 0.02 = is sent to train technical writers)

P(output from stage 3 has no defect but is blemished, and


p64 = 0.03 = is sent to train technicians)

P(output from stage 3 has no defect but is blemished, and
p63 = 0.04 = 
is sent to train engineers)
p81 = 0.10 = P(output from stage 1 has a major defect and is scrapped)

p71 = 0.15 = P(output from stage 2 has a major defect and is scrapped)

p61 = 0.20 = P(output from stage 3 has a major defect and is scrapped).

The transition probability matrix is shown below in the canonical form of


Equation (1.43):

Scrapped 1 1 0 0 0 0 0 0 0 
Sold 2 0 1 0 0 0 0 0 0 
 
Training Engineers 3 0 0 0.50 0.30 0.20 0 0 0 
 
Training Technicians 4 0 0 0.30 0.45 0.25 0 0 0 
P=
Training Tech. Writers 5 0 0 0.10 0.35 0.55 0 0 0 
 
Stage 3 6 0.20 0.16 0.04 0.03 0.02 0.55 0 0 
Stage 2 7  0.15 0 0 0 0 0.20 0.65 0 
 
Stage 1 8 0.10 0 0 0 0 0 0.15 0.75

 P1 0 0 0  I 0 0 0
0 P2 0 0  0 I 0 0 S 0
= = = .
0 0 P3 0  0 0 P3 0  D Q  (1.60)
   
D1 D2 D3 Q  D1
 
D2 D3 Q

The passage of an item through the production process is shown in


Figure 1.6.
The multichain model of a production process is revisited in Chapters 3
and 4.

51113_C001.indd 53 9/23/2010 12:59:15 PM


54 Markov Chains and Decision Processes for Engineers and Managers

Sold

Reworked
0.55
Reworked Reworked
0.75 0.65 0.16 Acceptable

0.15 0.20 0.09 Training


Stage 1 Stage 2 Stage 3
Acceptable Acceptable Blemished Center

0.20
Defective Defective
0.10 0.15
Defective

Scrapped

FIGURE 1.6
Passage of an item through a three-stage production process.

PROBLEMS
1.1 The condition of a machine, which is observed at the beginning of
each day, can be represented by one of the following four states:

State Description

1 (NW) Not Working, Under Repair


2 (MD) Working, with a major defect
3 (mD) Working, with a minor defect
4 (WP) Working Properly

The next state of the machine depends only on the present


state and is independent of its past history. Hence, the condition
of the machine can be modeled as a four-state Markov chain.
Let Xn−1 denote the state of the machine when it is observed at
the start of day n. The state space is E = {1, 2, 3, 4}. Assume that at
the start of each day, the engineer in charge of the machine does
nothing to respond to the condition of the machine when it is
in states 2, 3, or 4. A machine in states 2, 3, or 4 is allowed to fail
and enter state 1, which represents a repair process. Therefore, a
machine in state 1, not working, is assumed to be under repair.

51113_C001.indd 54 9/23/2010 12:59:19 PM


Markov Chain Structure and Models 55

The repair process carried out when the machine is in state 1


(NW) takes 1 day to complete. When the machine is in state 1
(NW), the repair process will be completely successful with prob-
ability 0.5 and transfer the machine to state 4 (WP), or largely suc-
cessful with probability 0.2 and transfer the machine to state 3
(mD), or marginally successful with probability 0.2 and transfer
the machine to state 2 (MD), or unsuccessful with probability 0.1
and leave the machine in state 1 (NW). A machine is not repaired
when it is in states 2, 3, or 4. A machine in state 2 (MD) will remain
in state 2 with probability 0.7, or fail with probability 0.3 and enter
state 1 (NW). A machine in state 3 (mD) will remain in state 3 with
probability 0.3, or acquire a major defect with probability 0.5 and
enter state 2, or fail with probability 0.2 and enter state 1 (NW).
Finally, a machine in state 4 (WP) will remain in state 4 with prob-
ability 0.4, or acquire a minor defect with probability 0.3 and enter
state 3, or acquire a major defect with probability 0.2 and enter
state 2, or fail with probability 0.1 and enter state 1 (NW).
Construct the transition probability matrix for this four-state
Markov chain model under which a machine is left alone in
states 2, 3, and 4, and repaired in state 1.
1.2 Suppose that the machine modeled in Problem 1.1 is modified by
making the repair time last 2 days. Two states are needed to repre-
sent both days of a 2-day repair process. When the machine fails,
it goes to state 1 (NW1), which denotes the first day of the repair
process. When the first day of repair ends, the machine must enter
state 2 (NW2), which denotes the second day of repair. The condi-
tion of the machine, which is observed at the beginning of each
day, can be represented by one of the following five states:

State Description

1 (NW1) Not Working, in first day of repair


2 (NW2) Not Working, in second day of repair
3 (MD) Working, with a major defect
4 (mD) Working, with a minor defect
5 (WP) Working Properly

The use of two states to distinguish the 2 days of the repair pro-
cess allows the next state of the machine to depend only on the
present state and to be independent of its past history. Hence,
the condition of the machine can be modeled as a five-state
Markov chain. Let Xn−1 denote the state of the machine when it is
observed at the start of day n. The state space is E = {1, 2, 3, 4, 5}.
Assume that at the start of each day, the engineer in charge of
the machine does nothing to respond to the condition of the
machine when it is in states 3, 4, or 5.
A machine in states 3, 4, or 5 that fails will enter state 1
(NW1). One day later the machine will move from state 1 to

51113_C001.indd 55 9/23/2010 12:59:20 PM


56 Markov Chains and Decision Processes for Engineers and Managers

state 2 (NW2). When the machine is in state 2 (NW2), the repair


process will be completely successful with probability 0.5 and
transfer the machine to state 5 (WP), or largely successful with
probability 0.2 and transfer the machine to state 4 (mD), or mar-
ginally successful with probability 0.2 and transfer the machine
to state 3 (MD), or unsuccessful with probability 0.1 and trans-
fer the machine to state 1 (NW1). A machine in state 3 (MD)
will remain in state 3 with probability 0.7, or fail with proba-
bility 0.3 and enter state 1 (NW1). A machine in state 4 (mD) will
remain in state 4 with probability 0.3, or acquire a major defect
with probability 0.5 and enter state 3, or fail with probability 0.2
and enter state 1 (NW1). Finally, a machine in state 5 (WP) will
remain in state 5 with probability 0.4, or acquire a minor defect
with probability 0.3 and enter state 4, or acquire a major defect
with probability 0.2 and enter state 3, or fail with probability 0.1
and enter state 1 (NW1).
Construct the transition probability matrix for this five-state
Markov chain model under which a machine is left alone in
states 3, 4, and 5.
1.3 Consider a machine that breaks down on a given day with prob-
ability p. A repair can be completed in 1, 2, 3, or 4 days with
respective probabilities q1, q2, q3, and q4. The condition of the
machine can be represented by one of the following five states:

State Description

0 (W) Working
1 (D1) Not Working, in first day of repair
2 (D2) Not Working, in second day of repair
3 (D3) Not Working, in third day of repair
4 (D4) Not Working, in fourth day of repair

Since the next state of the machine depends only on its present
state, the condition of the machine can be modeled as a five-
state Markov chain.
Construct the transition probability matrix.
1.4 Many products are classified as either acceptable or defective.
Such products are often shipped in lots, which may contain a
large number of individual items. A purchaser wants assurance
that the proportion of defective items in a lot is not excessive.
Instead of inspecting each item in a lot, a purchaser may follow
an acceptance sampling plan under which a random sample
selected from the lot is inspected. An acceptance sampling plan
is said to be sequential if, after each item is inspected, one of
the following decisions is made: accept a lot, reject it, or inspect
another item. Suppose that the proportion of defective items in
a lot is denoted by p. Consider the following sequential inspec-
tion plan: accept the lot if four acceptable items are found, reject

51113_C001.indd 56 9/23/2010 12:59:21 PM


Markov Chain Structure and Models 57

the lot if two defective items are found, or inspect another item
if neither four acceptable items nor two defective items have
been found.
The sequential inspection plan represents an independent
trials process in which the condition of the nth item to be
inspected is independent of the condition of its predecessors.
Hence, the sequential inspection plan can be modeled as a
Markov chain. A state is represented by a pair of numbers.
The fi rst number in the pair is the number of acceptable items
inspected, and the second is the number of defective items
inspected. The model is an absorbing Markov chain because
when the lot is accepted or rejected, the inspection process
stops in an absorbing state. The transient states indicate that
the inspection process will continue. The states are indexed
and identified in the table below:

Number of Number of State


State Number Pair Acceptable Items Defective Items Classification
1 (0, 0) 0 0 Transient
2 (1, 0) 1 0 Transient
3 (2, 0) 2 0 Transient
4 (3, 0) 3 0 Transient
5 (4, 0) 4, stop, accept lot 0 Absorbing
6 (0, 1) 0 1 Transient
7 (1, 1) 1 1 Transient
8 (2, 1) 2 1 Transient
9 (3, 1) 3 1 Transient
10 (4, 1) 4, stop, accept lot 1 Absorbing
11 (0, 2) 0 2, stop, reject lot Absorbing
12 (1, 2) 1 2, stop, reject lot Absorbing
13 (2, 2) 2 2, stop, reject lot Absorbing
14 (3, 2) 3 2, stop, reject lot Absorbing

Construct the transition probability matrix for this 14-state


absorbing Markov chain.
1.5 A credit counselor accepts customers who arrive at her office
without appointments. Her office consists of a private room
in which she counsels one customer, plus a small waiting room,
which can hold up to two additional customers. Any customer
who arrives when the waiting room is full does not enter. Each
credit counseling session for one customer lasts 45 min. If no
customers are present when she has finished counseling a cus-
tomer, she takes a 45 min break until the beginning of the next
45 min period. The number of customers who arrive during a
45 min counseling period is an independent, identically dis-
tributed random variable. If An is the number of arrivals during

51113_C001.indd 57 9/23/2010 12:59:21 PM


58 Markov Chains and Decision Processes for Engineers and Managers

the nth 45 min counseling period, then An has the following


probability distribution, which is stationary in time:

Number An of arrivals in period n , An = k 0 1 2 3 4 or more


Probability, pk = P( An = k ) p0 p1 p2 p3 p4

To model this problem as a regular Markov chain, let the state


Xn represent the number of customers in the counselor’s office
at the end of the nth 45 min counseling period, immediately
after the departure of the nth customer.
Construct the transition probability matrix for the Markov chain.
1.6 Consider a small clinic staffed by one nurse who injects patients
with an antiflu vaccine. The clinic has space for only four
chairs. One chair is reserved for the patient being vaccinated.
The other three chairs are reserved for patients who wait. The
arrival process is deterministic. One new patient arrives every
20 min. A single patient arrives at the clinic at precisely on the
hour, at precisely 20 min past the hour, and at precisely 40 min
past the hour. Any patient who arrives to find all four chairs
occupied does not enter the clinic.
If the clinic is empty, the nurse is idle. If the clinic is not empty,
the number of patients that she vaccinates during a 20 min inter-
val is an independent, identically distributed random variable. If
Vn is the number of vaccinations that the nurse provides during
the nth 20 min interval, then Vn has the following conditional
probability distribution, which is stationary in time.

Number Vn of vaccinations in period n, Vn = k 0 1 2 3 4 5 or more


Probability, pk = P(Vn = k ) p0 p1 p2 p3 p4 p5

The quantity pk = P(Vn = k) is a conditional probability, con-


ditioned on the presence of at least k patients inside the clinic
available to be vaccinated.
To model this problem as a regular Markov chain, let the state Xn
represent the number of patients in the clinic at the end of the nth
20 min interval, immediately before the arrival of the nth patient.
Construct the transition probability matrix for the Markov chain.
1.7 Consider a small convenience store that is open 24 h a day,
7 days a week. The store is staffed by one checkout clerk. The
convenience store has space for only three customers including
the one who is being served. Customers may arrive at the store
every 30 min, immediately before the hour and immediately
before the half-hour. Any customer who arrives to find three
customers already inside the store does not enter the store.
The probability of an arrival at the end of a consecutive 30 min
interval is p. If the store is empty, the clerk is idle. If the store is
not empty, the probability of a service completion at the end of
a consecutive 30 min interval is r.

51113_C001.indd 58 9/23/2010 12:59:21 PM


Markov Chain Structure and Models 59

To model this problem as a regular Markov chain, let the state


Xn represent the number of customers in the store at the end of
the nth consecutive 30 min interval.
Construct the transition probability matrix for the Markov
chain.
1.8 Consider the following inventory system controlled by a (1, 3)
ordering policy. An office supply store sells laptops. The state
of the system is the inventory on hand at the beginning of the
day. If the number of laptops on hand at the start of the day is
less than 1 (in other words, equal to 0), then the store places an
order, which is delivered immediately to raise the beginning
inventory level to three laptops. If the store starts the day with
1 or more laptops in stock, no laptops are ordered. The number
dn of laptops demanded by customers during day n is an inde-
pendent, identically distributed random variable which has the
following stationary probability distribution:

Demand dn on day n, dn = k 0 1 2 3
P(dn = k ) P(dn = 0) P(dn = 1) P(dn = 2) P(dn = 3)

Construct the transition probability matrix for a four-state


recurrent Markov chain model of this inventory system under
a (1, 3) policy.
1.9 The following mortality data has been collected for a group of
S0 integrated circuit chips (ICs) over a 4-year period:

Age i of an IC in years, i 0 1 2 3 4
Number surviving to age i S0 S1 S2 S3 S4 = 0

Observe that at the end of the 4-year period, none of the S0 ICs
has survived, that is, all have failed. Every IC that fails is replaced
with a new one at the end of the year in which it has failed. The life
of an IC can be modeled as a four-state recurrent Markov chain.
Let the state Xn denote the age of an IC at the end of year n.
Construct the transition probability matrix for the Markov
chain.
1.10 A small refinery produces one barrel of gasoline per hour. Each
barrel of gasoline has an octane rating of either 120, 110, 100, or
90. The refinery engineer models the octane rating of the gas-
oline as a Markov chain in which the state Xn represents the
octane rating of the nth barrel of gasoline. The states of the
Markov chain, {X0, X1, X2, …}, are shown below:

State Octane Rating


1 120
2 110
3 100
4 90

51113_C001.indd 59 9/23/2010 12:59:22 PM


60 Markov Chains and Decision Processes for Engineers and Managers

Suppose that the Markov chain, {X0, X1, X2, …}, has the follow-
ing transition probability matrix:

State 1 2 3 4
1 p11 p12 b k −b
P= 2 p21 p22 k−d d
3 p31 p32 e h−e
4 p41 p42 h−u u

Observe that in columns 3 and 4 of row 1 of the transition


probability matrix,

p13 + p14 = b + (k − b) = k.

In columns 3 and 4 of row 2,

p23 + p24 = (k − d) + d = k.

Hence,

p11 + p12 = p21 + p22 = 1 − k.

Note that in columns 3 and 4 of row 3 of the transition prob-


ability matrix,

p33 + p34 = e + ( h − e ) = h.

Similarly, in columns 3 and 4 of row 4,

p43 + p44 = ( h − u) + u = h.

Hence,

p31 + p32 = p41 + p42 = 1 − h.

The initial state probability vector for the four-state Markov


chain is

p(0) =  p1(0) p2(0) p3(0) 4 


p(0) 
= [P(X0 = 1) P(X0 = 2) P(X0 = 3) P(X0 = 4)].

The refinery sells only 2 grades of gasoline: premium and


economy. Premium has an octane rating of 120 or 110, and

51113_C001.indd 60 9/23/2010 12:59:23 PM


Markov Chain Structure and Models 61

economy has an octane rating of 100 or 90. The refinery engi-


neer measures the octane rating of each barrel of gasoline pro-
duced, and classifies it as either premium (P) or economy (E).
Suppose that Yn denotes the classification given to the nth barrel
of gasoline, where

P, if X n = 1 or 2
Yn = 
 E, if X n = 3 or 4

Let G be the transition matrix for the two-state process


{Y0 , Y1 , Y2 , …}, where

State P E
G= P g PP g PE
E g EP g EE

The two-state process {Y0, Y1, Y2,…} is termed partially observ-


able because the premium grade (state P) does not distinguish
between the octane ratings of 120 (state 1) and 110 (state 2).
Similarly, the economy grade (state E) does not distinguish
between the octane ratings of 110 (state 3) and 90 (state 4).
(a) Construct the transition probabilities, gij, for the process {Y0,
Y1, Y2,…}.
(b) A four-state Markov chain, with the state space, S = {1, 2, 3, 4},
which has the relationships, p11 + p12 = p21 + p22 and p31 + p32 =
p41 + p42, is said to be lumpable with respect to the partition
P = {1, 2} and E = {3, 4}. The lumped process {Y0, Y1, Y2, …}
is also a Markov chain. Show that, as a consequence of the
lumpability of the Markov chain {X0, X1, X2, …}, the transi-
tion probabilities, gij, for the lumped process {Y0, Y1, Y2, …}
are independent of the choice of the initial state probability
4 
vector, p(0) =  p1(0) p2(0) p3(0) p(0) .

Now suppose that the Markov chain {X0, X1, X2, …} has the
following transition probability matrix, such that p11 + p12 ≠ p21 +
p22 and p31 + p32 ≠ p41 + p42.

State 1 2 3 4
1 p11 p12 p13 p14
P= 2 p21 p22 p23 p24
3 p31 p32 p33 p34
4 p41 p42 p43 p44

Since p11 + p12 ≠ p21 + p22 and p31 + p32 ≠ p41 + p42, the Markov chain is
not lumpable with respect to the partition P = {1, 2} and E = {3, 4}.
The process {Y0, Y1, Y2, …} is not a Markov chain.

51113_C001.indd 61 9/23/2010 12:59:27 PM


62 Markov Chains and Decision Processes for Engineers and Managers

(c) With respect to this partition, calculate the transition proba-


bilities, gij, for the process {Y0, Y1, Y2, …}.
(d) Show that the transition probabilities, gij, depend
on the choice of the initial state probability vector,
p(0) =  p1(0) p2(0) p3(0) p(0)
4  .
1.11 A dam is used for generating electricity, controlling floods,
and irrigating land. The dam has a capacity of 4 units of water.
Assume that the volume of water stored in the dam is always
an integer. Let Xn be the volume of water in the dam at the end
of week n. During week n, a volume of water denoted by Wn
flows into the dam. The probability distribution of Wn, which
is an independent, identically distributed, and integer random
variable, is given below:

Volume Wn of water flowing into dam in week n, Wn = k 0 1 2 3 4 5 or more


Probability, wk = P(Wn = k ) w0 w1 w2 w3 w4 w5

At the end of each week, if the dam contains one or more


units of water, exactly one unit of water is released. Whenever
the inflow of water to the dam exceeds its capacity of 4 units, the
surplus water is released over the spillway and lost.
Construct the transition probability matrix.
1.12 A certain engineering college has three academic ranks for its
faculty: assistant professor, associate professor, and professor.
The first two ranks are tenure-track. Only professors have ten-
ure. A tenure-track faculty member may be discharged, remain
at her present rank, or be promoted to the next higher rank. A
tenure-track faculty member who is discharged is never rehired.
Only tenure-track faculty members can be promoted or dis-
charged. A professor may remain at her present rank or retire.
All changes in rank occur at the end of an academic year.
By assigning a state to represent each academic rank, the
career path of a faculty member of this college is modeled as a
five-state absorbing multichain. The state Xn denotes a faculty
member’s rank at the end of academic year n. The states are
indexed as follows:

State, X n Academic Rank


1 Assistant professor
2 Associate professor
3 Professor
4 Discharged
5 Retired

A tenure-track faculty member of rank i will be discharged


with probability di, remain at her present rank with probability
ri, or be promoted to the next higher rank with probability pi.

51113_C001.indd 62 9/23/2010 12:59:29 PM


Markov Chain Structure and Models 63

A professor may remain at her present rank with probability


r 3 or retire with probability 1 − r 3.
(a) Construct the transition probability matrix.
(b) Classify the states.
(c) Represent the transition probability matrix in the canonical
form of Equation (1.43).
1.13 A certain college of management has five academic ranks for
its faculty: instructor, assistant professor, associate professor,
research professor, and applications professor. The first three
ranks are tenure-track. Only research professors and applica-
tions professors have tenure. A tenure-track faculty member
may be discharged, remain at her present rank, or be promoted
to the next higher rank. Only tenure-track faculty members can
be promoted or discharged. A tenure-track faculty member
who is discharged is never rehired. An associate professor may
apply for promotion with tenure to the rank of research profes-
sor or applications professor. A research professor may remain
a research professor or switch to the position of applications
professor. Similarly, an applications professor may remain an
applications professor or switch to the position of research
professor.
By assigning a state to represent each academic rank, the
career path of a faculty member of this college is modeled as
a six-state multichain. The state Xn denotes a faculty member’s
rank at the end of academic year n. The states are indexed as
follows:

State, X n Academic Rank


1 Instructor
2 Assistant professor
3 Associate professor
4 Research professor
5 Applications professor
6 Discharged

An instructor or assistant professor of rank i will be discharged


with probability di, remain at her present rank with probability
ri, or be promoted to the next higher rank with probability pi.
An associate professor may be discharged with probability d3,
remain an associate professor with probability r3, be promoted
to the rank of research professor with probability pR, or be pro-
moted to the rank of applications professor with probability
pA. A research professor may remain a research professor with
probability r4 or switch to the position of applications professor
with probability 1 − r4. Similarly, an applications professor may
remain an applications professor with probability r5 or switch to
the position of research professor with probability 1 − r5.

51113_C001.indd 63 9/23/2010 12:59:31 PM


64 Markov Chains and Decision Processes for Engineers and Managers

(a) Construct the transition probability matrix.


(b) Classify the states.
(c) Represent the transition probability matrix in canonical form.
1.14 A small optical shop is staffed by an optometrist and an opti-
cian. Customers may arrive at the shop every 30 min, immedi-
ately after the hour and immediately after the half-hour. Two
stages of service are provided in the following sequence. A cus-
tomer is given an eye examination by the optometrist at stage 1
followed by the fitting of glasses by the optician at stage 2. Only
one chair is provided by the optometrist for a customer at stage
1, and only one chair is provided by the optician for a customer
at stage 2. Both the optometrist and the optician may be idle if
they have no customer, or busy if they are working with a cus-
tomer. In addition, the optometrist may be blocked if she has
completed work on a customer at stage 1 before the optician has
completed work on a customer at stage 2.
A person who arrives at the beginning of a 30 min interval
during which the optometrist is busy or blocked does not enter
the shop. A customer enters the shop with probability p at the
beginning of a 30 min interval during which the optometrist is
idle. The optometrist begins her eye examination promptly but
may not complete it during the current 30 min. interval. If the
optometrist is giving an eye examination at the beginning of
a 30 min interval, she has a probability r of completing it and
passing the customer to the optician at the beginning of the
next 30 min interval if the optician is idle. Otherwise, the cus-
tomer remains with the optometrist who is blocked until the
optician completes work with a customer at stage 2. If the opti-
cian is working with a customer at the beginning of a 30 min
interval, she has a probability q of completing her work during
the current 30 min interval so that the customer departs.
To model the operation of the optical shop as a Markov chain,
let the state Xn be the pair (u, v). The quantity u is I if the optom-
etrist is idle at the beginning of the nth 30 min interval, W if the
optometrist is working at the beginning of the nth 30 min inter-
val, and B if the optometrist is blocked at the beginning of the
nth 30 min interval. The quantity v is I if the optician is idle at
the beginning of the nth 30 min interval, and W if the optician
is working at the beginning of the nth 30 min interval. The five
states are indexed below:

State, X n Optometrist Optician


1 = (I, I) Idle Idle
2 = (I, W) Idle Working
3 = (W, I) Working Idle
4 = (W, W) Working Working
5 = (B, W) Blocked Working

Construct the transition probability matrix.

51113_C001.indd 64 9/23/2010 12:59:31 PM


Markov Chain Structure and Models 65

References
1. Bhat, N., Elements of Applied Stochastic Processes, 2nd ed., Wiley, New York,
1985.
2. Cinlar, E., Introduction to Stochastic Processes, Prentice-Hall, Englewood Cliffs,
NJ, 1975.
3. Clarke, A. B. and Disney R. L., Probability and Random Processes: A First Course
with Applications, 2nd ed., Wiley, New York, 1985.
4. Kemeny, J. G., Mirkil, H., Snell, J. L., and Thompson, G. L., Finite Mathematical
Structures, Prentice-Hall, Englewood Cliffs, NJ, 1959.
5. Kemeny, J. G. and Snell, J. L., Finite Markov Chains, Van Nostrand, Princeton, NJ,
1960. Reprinted by Springer-Verlag, New York, 1976.
6. Heyman, D. P. and Sobel, M. J., Stochastic Models in Operations Research, vol. 1,
McGraw Hill, New York, 1982.

51113_C001.indd 65 9/23/2010 12:59:32 PM


51113_C001.indd 66 9/23/2010 12:59:32 PM
2
Regular Markov Chains

Recall from Section 1.9.1.1 that an irreducible or recurrent Markov chain is


one in which all the states communicate. Therefore, all states in an irreduc-
ible chain are recurrent states. An irreducible chain has no transient states.
A Markov chain is termed a regular chain if some power of the transition
matrix has only positive elements. A regular Markov chain is irreducible
and aperiodic. Regular chains with a finite number of states are the subject
of this chapter.

2.1 Steady-State Probabilities


When a Markov chain has made a small number of transitions, its behavior
is called transient, or short term, or time-dependent. The set of consecutive
time periods over which a Markov chain is analyzed is called the planning
horizon. Under transient conditions, the planning horizon is finite, of length
T periods. As Section 1.7 demonstrates, when the planning horizon is finite,
the probability state vector after n steps is completely determined by the ini-
tial state probability vector and the one-step transition probability matrix.
As the number of steps or transitions increases, the affect of the initial state
probability vector decreases. After a large number of transitions, the behav-
ior of the chain changes from transient to what is called steady state, or
long term, or time-independent. In the steady state, the planning horizon
approaches infinity, and the state probability vector becomes independent
of the initial state probability vector. In the steady state, the state probabi-
lity vector becomes a fixed or stationary probability vector, which does
not change over time. The fixed or stationary probability vector is called a
steady-state probability vector. The entries of a steady-state probability vec-
tor are called steady-state probabilities. In the steady state, a Markov chain
continues to make transitions indefinitely among its various states. Hence, a
steady-state probability can be interpreted as the long run proportion of time
that a particular state is occupied [1–5].
To see how to compute the steady-state probability vector for a regular
Markov chain, it is instructive to begin by examining the rows of the n-step
transition matrix, P(n), as n grows larger. It will be seen that all of the rows
approach the same stationary probability vector, namely, the steady-state
probability vector. Consider the four-state regular Markov chain model of
67

51113_C002.indd 67 9/23/2010 11:54:27 AM


68 Markov Chains and Decision Processes for Engineers and Managers

the weather introduced in Section 1.3. The one-step transition probability


matrix is given in Equation (1.9). The two-step transition probability matrix
is calculated in Equation (1.19). The four-step transition probability matrix is
calculated as

1 0.23 0.28 0.24 0.25 0.23 0.28 0.24 0.25


2  0.22 0.37 0.23 0.18   0.22 0.37 0.23 0.18 
P(4) = P 4 = P(2) P(2) =   
3 0.16 0.39 0.29 0.16  0.16 0.39 0.29 0.16 
  
4  0.21 0.42 0.18 0.19  0.21 0.42 0.18 0.19
1 0.2054 0.3666 0.2342 0.1938 
2 0.2066 0.3638 0.2370 0.1926 
=  . (2.1)
3 0.2026 0.3694 0.2410 0.1870 
 
4 0.2094 0.3642 0.2334 0.1930 

The eight-step transition probability matrix is computed as follows:

P(8) = P 8 = P(4) P(4)


1 0.2054 0.3666 0.2342 0.1938  0.2054 0.3666 0.2342 0.1938 
2 0.2066 0.3638 0.2370 0.1926  0.2066 0.3638 0.2370 0.1926 
=   
3 0.2026 0.3694 0.2410 0.1870  0.2026 0.3694 0.2410 0.1870 
  
4 0.2094 0.3642 0.2334 0.1930  0.2094 0.3642 0.2334 0.1930 
1 0.2060 0.3658 0.2367 0.1916 
2  0.2059 0.3658 0.2367 0.1916 
=  .
(2.2)
3  0.2059 0.3658 0.2367 0.1916 
 
4 0.2060 0.3658 0.2367 0.1916 

Observe that as the exponent n increases from 1 to 2, from 2 to 4, and from


4 to 8, the entries of P(n) approach limiting values. When n = 8, all the rows of
P(8) are almost identical. One may infer that as n becomes very large, all the
rows of P(n) approach the same stationary probability vector, namely,

p( n ) = [ p1( n ) p2( n ) p3( n ) p(4n ) ] =  0.2059 0.3658 0.2367 0.1916. (2.3)

That is, after n transitions, as n becomes very large, the n-step transition
probability pij( n ) approaches a limiting probability, p(j n ), irrespective of the
starting state i. If πj denotes the limiting probability for state j in an N-state
Markov chain, then the limiting probability is defined by the formula

51113_C002.indd 68 9/23/2010 11:54:28 AM


Regular Markov Chains 69

π j = lim pij( n ) , for j = 1, … , N . (2.4)


n →∞

The limiting probability πj is called a steady-state probability. The vector of


steady-state probabilities for an N-state Markov chain is a 1 × N row vector
denoted by

π = [π 1 π 2 … π N ]. (2.5)

Since π is a probability vector, the entries of π must sum to one. Thus,

∑πj =1
j = 1. (2.6)

Equation (2.6) is called the normalizing equation.


The behavior of P(n) for the four-state regular Markov chain suggests that
as n → ∞, P(n) will converge to a matrix ∏ with identical rows. Each row of ∏
is equal to the steady-state probability vector, π. To see why this is true for all
regular Markov chains, note that the n-step transition probability, pij( n ), is the
(i, j)th element of the n-step transition probability matrix, P( n ) = [ pij( n ) ]. Since
lim pij( n ) = π j, the steady-state probability for state j, it follows that the limiting
n→∞
transition probability matrix is

1 π  1 π 1 π2 … πN 
2 π  2 π 1 π2 … πN 
lim P( n ) = lim P n = Π =   =  . (2.7)
n →∞ n →∞ #  #  # … … … …
   
N π  N π 1 π2 … πN 

Thus, ∏ is a matrix with each row π, the steady-state probability vector.


As Section 1.9.1.1 indicates, if a Markov chain is regular, then some power
of the transition matrix has only positive elements. Thus, if P is the transition
probability matrix for a regular Markov chain, then after some number of
steps denoted by a positive integer K, PK has no zero entries. If P raised to the
power K has only positive entries, then P raised to all powers higher than K
also has only positive entries. In particular, Π = lim P( n ) = lim P n is a matrix
n→∞ n→∞
with only positive elements. Therefore, all entries of the matrix ∏ are also
positive. Since ∏ is a matrix with each row π, all the entries of π are positive.
Hence, the entries of the probability vector π for a regular Markov chain are
strictly positive and sum to one. For an N-state regular Markov chain,
N
π j > 0 for j=1, … , N , and ∑π
j =1
j = 1. (2.8)

51113_C002.indd 69 9/23/2010 11:54:29 AM


70 Markov Chains and Decision Processes for Engineers and Managers

For the four-state Markov chain model of the weather, the rows of P(8) calcu-
lated in Equation (2.2) indicate that

π ≈  0.20595 0.3658 0.2367 0.1916  . (2.9)

For large n, the state probability p(j n ) approaches the limiting probability πj.
That is,

π j = lim p(j n ) = lim pij( n ) , for j = 1, … , N , (2.10)


n →∞ n →∞

and does not depend on the starting state. Thus, the vector π of steady-state
probabilities is equal to the limit, as the number of transitions approaches
infinity, of the vector p(n) of state probabilities. That is,

π = lim p( n ) . (2.11)
n →∞

An informal derivation of the equations used to calculate the steady-state


probability πj begins by conditioning on the state at epoch n.

N
P(X n +1 = j) = ∑ P(X n = i) P(X n +1 = j|X n = i)
i =1

N
p(j n + 1) = ∑ pi( n ) pij .
i =1

Letting n → ∞,

N
lim p(j n + 1) = lim ∑ pi( n ) pij .
n →∞ n →∞
i =1

Interchanging the limit and the summation,

lim p (j n +1) = ∑ lim pi( n ) pij


n →∞ n →∞
i =1

π j = ∑ π i pij .
i =1

51113_C002.indd 70 9/23/2010 11:54:33 AM


Regular Markov Chains 71

In matrix form,

π = πP
πI = πP
π (I − P) = 0.

Hence, π = πP is a homogeneous system of linear equations, which has


infinitely many solutions.
Since the homogeneous system π = πP has infi nitely many solutions, one
of the equations is redundant because it can be expressed as a linear com-
bination of the others. A unique solution for π is obtained by dropping
one of the homogeneous equations and replacing it with the normalizing
equation (2.6). For a regular Markov chain with N states, the quantities πj
are the unique positive solution of the following steady-state equations in
algebraic form:

N

π j = ∑ π i pij ,
for j = 1, 2, 3,..., N 
i =1

N

∑ πj = 1  (2.12)
i =1 
π i > 0, for i = 1, 2, 3,..., N . 

The matrix form of the steady-state equations is

π = π P
 (2.13)
πe = 1  ,
π > 0 

where e is a column vector with all entries one.


In summary, after a large number of transitions, the behavior of a Markov
chain changes from short term, or transient, to long term, or steady state. In
the steady state the process continues to make transitions among the various
states in accordance with the stationary one-step transition probabilities, pij.
The steady-state probability, πj, represents the long run proportion of time that
the process will spend in state j. For an N-state irreducible Markov chain, the
steady-state probabilities, πj, are the components of a steady-state probability
vector, π, which is a 1 × N row vector. For an irreducible Markov chain, the
components of π are nonnegative. For a regular Markov chain, which is both
irreducible and aperiodic, all the components of π are positive.
Two approaches can be followed to solve the homogeneous system of equa-
tions π = πP and the normalizing equation (2.6). The first approach is based on

51113_C002.indd 71 9/23/2010 11:54:36 AM


72 Markov Chains and Decision Processes for Engineers and Managers

the observation that the homogeneous system augmented by the normaliz-


ing equation produces a linear system of N + 1 equations in N unknowns.
The homogeneous system contains one redundant equation. In order for the
augmented system to have a unique solution, one of the homogeneous equa-
tions is dropped. The equation that is dropped is replaced by the normaliz-
ing equation (2.6) to avoid the trivial solution and thereby ensure that π is a
probability vector.
The second approach is to first solve the homogeneous system of N equa-
tions in N unknowns for π1, π2, ... , πN−1 interms of πN. These values are then
substituted into the normalizing equation (2.6) to determine πN. The remain-
ing quantities πj are equal to constants times πN. This approach for com-
puting the steady-state probability vector is followed by the Markov chain
partitioning algorithm, which is described in Section 6.1.1 as an example of a
computational procedure called state reduction.

2.1.1 Calculating Steady-State Probabilities for a


Generic Two-State Markov Chain
Both approaches for calculating steady-state probabilities will be illustrated
by applying them initially to the simplest kind of generic regular Markov
chain, a two-state chain for which the transition probability matrix is shown
in Equation (1.4), The matrix form of the steady-state equations (2.13) to be
solved is

 p11 p12  
[π 1 π 2 ] = [π 1 π 2 ] 
 p21 p22  
(2.14)
T 
[π 1 π 2 ]1 1 = 1. 

In algebraic form, the steady-state equations (2.12) consist of the following


three equations in two unknowns, π1 and π2:

π 1 = p11π 1 + p21π 2 
 (2.15)
π 2 = p12π 1 + p22π 2 
π 1 + π 2 = 1. 

Note that if the substitutions p12 = 1 − p11 and p22 = 1 − p21 are made in the
second equation, the second equation is transformed into the first equation,
as shown below:

π 2 = (1 − p11 )π 1 + (1 − p21 )π 2
π 2 = π 1 − p11π 1 + π 2 − p21π 2
π 1 = p11π 1 + p21π 2 .

51113_C002.indd 72 9/23/2010 11:54:37 AM


Regular Markov Chains 73

Thus, the system of equations π = πP is linearly dependent. Any one of the


equations is redundant and can be discarded.
When the first approach to solving the steady-state equations is followed,
and the second equation is arbitrarily deleted, the resulting system of two
equations in two unknowns is shown below:

π 1 = p11π 1 + p21π 2 ,

or

(1 − p11 )π 1 − p21π 2 = 0,
π 1 + π 2 = 1.

After multiplying both sides of the second equation by p21 and adding the
result to the first equation, the solution for the vector of steady-state prob-
abilities is

π = π 1 π 2  = p21 /( p12 + p21 ) p12 /( p12 + p21 ). (2.16)

For example, the vector of steady-state probabilities for a two-state regular


Markov chain with the transition probability matrix,

1  p11 p12  1  0.2 0.8 


P= = (2.17)
2  p21 p22  2  0.6 0.4 

is

π = π 1 π 2  = p21 /( p12 + p21 ) p12 /( p12 + p21 ) = 3/7 4/7  . (2.18)

In the second approach to solving the steady-state equations, the normaliz-


ing equation is initially ignored. The resulting system, π = πP, of two linearly
dependent equations in two unknowns is shown below:

π 1 = p11π 1 + p21π 2
π 2 = p12π 1 + p22π 2

The first equation is solved to express π1 as the following constant times π2:

p21 p
π1 = π 2 = 21 π 2 .
1 – p11 p12

51113_C002.indd 73 9/23/2010 11:54:38 AM


74 Markov Chains and Decision Processes for Engineers and Managers

This expression is substituted into the normalizing equation (2.6) to solve for π2.

p21
π 2 + π 2 = 1.
p12

Once again, the solution for the vector of steady-state probabilities is given
by Equation (2.16) .

2.1.2 Calculating Steady-State Probabilities for


a Four-State Model of Weather
Both approaches to solving the steady-state equations will be illustrated numer-
ically by applying them to the four-state regular Markov chain model of the
weather for which the transition probability matrix is shown in Equation (1.9)
of Section 1.3. The matrix form of the steady-state equations (2.13) is

 p11 p12 p13 p14  


p p22 p23 p24  
π 4   ,
21
π 1 π 2 π3 π 4  = π 1 π 2 π3
 p31 p32 p33 p34   (2.19)
 
 p41 p42 p43 p44  
T

[π 1 π 2 π3 π 4 ]1 1 1 1 = 1. 

When the numerical values given in Equation (1.9) are substituted for the
transition probabilities, the following system of five equations in four
unknowns is produced:

0.3 0.1 0.4 0.2 


 0.2 0.5 0.2 0.1 
π 1 π 2 π3 π 4  = π 1 π 2 π3 π 4   
0.3 0.2 0.1 0.4   (2.20)
 
 0 0.6 0.3 0.1 
T

[π 1 π 2 π3 π 4 ]1 1 1 1 = 1. 

In algebraic form the system of steady-state equations (2.19) is

π 1 = (0.3)π 1 + (0.2)π 2 + (0.3)π 3 + (0)π 4 


π 2 = (0.1)π 1 + (0.5)π 2 + (0.2)π 3 + (0.6)π 4 

π 3 = (0.4)π 1 + (0.2)π 2 + (0.1)π 3 + (0.3)π 4  (2.21)
π 4 = (0.2)π 1 + (0.1)π 2 + (0.4)π 3 + (0.1)π 4 

π 1 + π 2 + π 3 + π 4 = 1. 

51113_C002.indd 74 9/23/2010 11:54:40 AM


Regular Markov Chains 75

When the first approach is followed, and the fourth equation is arbitrarily
deleted, the resulting system of four equations in four unknowns is shown
below:

π 1 = (0.3)π 1 + (0.2)π 2 + (0.3)π 3 + (0)π 4


π 2 = (0.1)π 1 + (0.5)π 2 + (0.2)π 3 + (0.6)π 4
π 3 = (0.4)π 1 + (0.2)π 2 + (0.1)π 3 + (0.3)π 4
π 1 + π 2 + π 3 + π 4 = 1.

The solution of this system is

π = π 1 π 2 π 3 π 4  = 1407/6832 2499/6832 1617/6832 1309/6832 


= [0.2059 0.3658 0.2367 0.1916]. (2.22)

This solution almost matches the approximate one obtained in Equation (2.9)
by calculating P8.
In the second approach, the normalizing equation (2.6) is initially ignored.
The first three equations contained in the system π = πP are solved to express
π1, π2, and π3 as the following constants times π4:

π 1 = (1407/1309) π 4 , π 2 = (2499/1309 ) π 4 , and π 3 = (1617/1309) π 4 . (2.23)

These values for π1, π2, and π3 expressed in terms of π4 are substituted into the
normalizing equation to solve for π4.

1 = π1 + π 2 + π 3 + π 4
1 = (1407/1309)π 4 + (2499/1309) π 4 + (1617/1309) π 4 + (1309/1309 )π 4 .
The result is

π 4 = (1309/6832 ). (2.24)

Substituting the result for π4 to solve for the other steady-state probabilities
gives the values obtained by following the first approach:

π 1 = (1407/6832 ), π 2 = (2499/6832 ), and π 3 = (1617/6832). (2.25)

The steady-state probability πi represents the long run proportion of time


that the weather will be represented by state i. For example, the long
run proportion of cloudy days is equal to π 3 = (1617/6832) = 0.2367.

51113_C002.indd 75 9/23/2010 11:54:42 AM


76 Markov Chains and Decision Processes for Engineers and Managers

2.1.3 Steady-State Probabilities for Four-State


Model of Inventory System
The model of the inventory system described in Section 1.10.1.1.3 for which
the transition matrix is given in Equation (1.48) is another example of a regu-
lar four-state Markov chain. The system of matrix equations (2.13) for finding
the steady-state probabilities is

 0.2 0.1 0.4 0.3  


 0.2 0.1 0.4 0.3  
π 1 π 2 π3 π 4  = π 1 π 2 π3 π 4   
0.3 0.4 0.3 0   (2.26)
 
 0.2 0.1 0.4 0.3  
T

[π 1 π 2 π3 π 4 ]1 1 1 1 = 1. 

The solution of the system (2.26) is

π = [π 0 π1 π 2 π 3 ] = [0.4167 0.3333 0.2083 0.0417 ]. (2.27)

The steady-state probability πi represents the long run proportion of time


that the retailer will have i computers in stock. For example, the long run
proportion of time that the retailer will have no computers in stock is given
by π0 = 0.2364. Similarly, the long run proportion of time that she will have
three computers in stock is equal to π3 = 0.1909.

2.1.4 Steady-State Probabilities for Four-State


Model of Component Replacement
The model of component replacement described in Section 1.10.1.1.5 is also a
regular four-state Markov chain. The transition matrix is given in Equation
(1.50). To obtain the age distribution of the components in an electronic
device, which has been in operation for a long time, the steady-state prob-
abilities are required. The matrix equations for finding the steady-state prob-
abilities (2.13) are

 0.2 0.8 0 0 
0.375 0 0.625 0  
π 1 π 2 π3 π 4  = π 1 π 2 π 3 π 4   
 0.8 0 0 0.2  (2.28)
 
 1 0 0 0 
T

[π 1 π 2 π3 π 4 ]1 1 1 1 = 1. 

51113_C002.indd 76 9/23/2010 11:54:44 AM


Regular Markov Chains 77

The solution of the system (2.28) is

π = [π 0 π1 π 2 π 3 ] = [0.4167 0.3333 0.2083 0.0417 ]. (2.29)

The steady-state probability πi represents the long run probability that a ran-
domly selected component is i weeks old. For example, the long run probabil-
ity that a component has just been replaced is given by π0 = 0.4167. Similarly,
the long run probability that a component has been in service 3 weeks is
equal to π3 = 0.0417.

2.2 First Passage to a Target State


The number of time periods, or steps, needed to move from a starting state
i to a destination or target state j for the first time is called a first passage
time.

2.2.1 Probability of First Passage in n Steps


The first passage time from state i to state j is a random variable denoted by Tij.
If i = j, then Tii is called the recurrence time for state i. If Tij = n, then the first
passage time from state i to state j equals n steps. Let fij(n) denote the probability
that the first passage time from state i to state j equals n steps [1–3]. That is,

f ij( n ) = P(Tij = n) (2.30)

Consider a generic four-state regular Markov chain with the transition prob-
ability matrix shown in Equation (1.3).

 f1(jn ) 
 (n) 
f2 j
f j( n ) = [ f ij ] =  ( n )  ,
(n)
where j = 1, 2, 3, or 4 (2.31)
 f3 j 
 ( n) 
 f 4 j 

denote the column vector of n-step first passage time probabilities to a target
state j. To obtain a formula for computing the probability distribution of the
n-step first passage times, let j = 1. Suppose that f 41( n ), the distribution of n-step
first passage time probabilities from state 4 to a target state 1 is desired. To
compute f 41( n ), one may start with n = 1. The probability of going from state 4

51113_C002.indd 77 9/23/2010 11:54:45 AM


78 Markov Chains and Decision Processes for Engineers and Managers

to state 1 for the first time in one step, f 41(1), is simply the one-step transition
probability, p41. That is,

f 41(1) = p(1)
41 = p41 . (2.32)

To move from state 4 to state 1 for the first time in two steps, the chain must go
from state 4 to any nontarget state k different from state 1 on the first step, and
from that nontarget state k to state 1 on the second step. Therefore, for k = 2, 3, 4,

f 41(2) = p42 p21 + p43 p31 + p44 p41 = p42 f 21(1) + p43 f 31(1) + p44 f 41(1) . (2.33)

To move from state 4 to state 1 for the first time in three steps, the chain must
go from state 4 to any nontarget state k different from state 1 on the first step,
and from that nontarget state k to state 1 for the first time after two additional
steps. Hence, for k = 2, 3, 4,

f 41(3) = p42 f 21(2) + p43 f 31(2) + p44 f 41(2) . (2.34)

By induction, for n = 2, 3, ..., one can show that

f 41( n ) = p42 f 21( n −1) + p43 f 31( n −1) + p44 f 41( n −1) = ∑ p4 k f k(1n −1) .
k ≠1

Thus, the n-step probability distribution of first passage times from a state i to
a target state j can be computed recursively by using the algebraic formula

f ij( n ) = ∑ pik f kj( n −1) . (2.35)


k≠ j

To express formula (2.35) for recursively calculating f ij( n ) in matrix form, let j
denote the target state in a generic four-state regular Markov chain. Suppose
(1)
that j = 1. Let column vector f1 denote column 1 of P, so that

 p11   f11 
(1)

 p   f (1) 
=   =  21(1)  .
21
f1(1) (2.36)
 p31   f 31 
   (1) 
 p41   f 41 

Thus f1(1) is the column vector of one-step transition probabilities from any
state to state 1, the target state. Let the matrix Z be the matrix P with column
j of the target state replaced by a column of zeroes. When j = 1,

51113_C002.indd 78 9/23/2010 11:54:47 AM


Regular Markov Chains 79

0 p12 p13 p14 


0 p p23 p24 
Z= .
22
(2.37)
0 p32 p33 p34 
 
0 p42 p43 p44 

Matrix Z is the matrix of one-step transition probabilities from any state to


any nontarget state, because the column of the target state has been replaced
with zeroes. Hence, the chain has zero probability of entering the target state
in one step. The recursive matrix equations corresponding to the algebraic
equations are given below:

f j(1) = f j(1) = Z 0 f j(1)


f j(2) = Zf j(1) = Z1 f j(1)
(2.38)
f j(3) = Zf j(2) = Z(Zf j(1) ) = Z 2 f j(1)
f j(4) = Zf j(3) = Z(Z 2 f j(1) ) = Z 3 f j(1) .

After n steps

f j( n ) = Zf j( n −1) = Z n −1 f j(1) .

Matrix Z is the matrix of probabilities of not entering the target state in one
step because all entries in the column of the target state are zero. Therefore,
Zn−1 is the matrix of probabilities of not entering the target state in n − 1 steps.
(1)
Vector f j is the vector of probabilities of entering the target state in one step.
Hence, f j( n ) = Z n− 1 f j(1) is the vector of probabilities of not entering the target
state during the first n − 1 steps, and then entering the target state for the first
time on the nth step.
To see that these recursive matrix formulas are equivalent to the corre-
sponding algebraic formulas, they will be applied to a generic regular four-
state Markov chain to calculate the vectors f1(2) and f1(3) when state 1 is the
target state. The column vector of probabilities of first passage to state 1 in
two steps is represented by

0 p12 p13 p14   p11   p12 p21 + p13 p31 + p14 p41   f11(2) 
0 p p23 p24   p21   p22 p21 + p23 p31 + p24 p41   f 21(2) 
=   =  =
22
f1(2) = Zf1(1) .
0 p32 p33 p34   p31   p32 p21 + p33 p31 + p34 p41   f 31(2 ) 
      
 0 p42 p43 p44   p41   p42 p21 + p43 p31 + p44 p41   f 41(2) 
(2.39)

51113_C002.indd 79 9/23/2010 11:54:50 AM


80 Markov Chains and Decision Processes for Engineers and Managers

Observe that f 41(2), the fourth entry in column vector f1(2), is given by

f 41(2) = p42 f 21(1) + p43 f 31(1) + p44 f 41(1) = p42 p21 + p43 p31 + p44 p41 ,

in agreement with the algebraic formula (2.32) obtained earlier. The vector of
probabilities of first passage to state 1 in three steps is given by

 0 p12 p13 p14  0 p12 p13 p14   f11(1)   f11(3) 


0 p p24  0 p24     
p23 p22 p23  f 21(1)   f 21(3) 
=    
22
f1(3) = Z 2 f1(1) =
 0 p32 p33 p34  0 p32 p33 p34   f 31(1)   f 31(3) 
       
 0 p42 p43 p44   0 p42 p43 p44   f 41(1)   f 41(3) 

 0 p12 p22 + p13 p32 + p14 p42 p12 p23 + p13 p33 + p14 p43 p12 p24 + p13 p34 + p14 p44   p11 
0 p p +p p + p24 p42 p22 p23 + p23 p33 + p24 p43 p22 p24 + p23 p34 + p24 p44   p21 
=    
22 22 23 32

 0 p32 p22 + p33 p32 + p34 p42 p32 p23 + p33 p33 + p34 p43 p32 p24 + p33 p34 + p34 p44   p31 
   
 0 p42 p22 + p43 p32 + p44 p42 p42 p23 + p43 p33 + p44 p43 p42 p24 + p43 p34 + p44 p44   p41 

 ( p12 p22 + p13 p32 + p14 p42 )p21 + ( p12 p23 + p13 p33 + p14 p43 )p31 + ( p12 p24 + p13 p34 + p14 p44 )p41 
 
( p p + p23 p32 + p24 p42 )p21 + ( p22 p23 + p23 p33 + p24 p43 )p31 + ( p22 p24 + p23 p34 + p24 p44 )p41 
=  22 22 .
( p32 p22 + p33 p32 + p34 p42 )p21 + ( p32 p23 + p33 p33 + p34 p43 )p31 + ( p32 p24 + p33 p34 + p34 p44 )p41 
 
( p42 p22 + p43 p32 + p44 p42 )p21 + ( p42 p23 + p43 p33 + p44 p43 )p31 + ( p42 p24 + p43 p34 + p44 p44 )p41 

(2.40)
(3) (3)
Observe that f , the fourth entry in column vector f , is given by
41 1

f 41(3) = ( p42 p22 + p43 p32 + p44 p42 )p21 + ( p42 p23 + p43 p33 + p44 p43 )p31
+ ( p42 p24 + p43 p34 + p44 p44 )p41
= p42 ( p22 p21 + p23 p31 + p24 p41 ) + p43 ( p32 p21 + p33 p31 + p34 p41 )
+ p44 ( p42 p21 + p43 p31 + p44 p41 )
= p42 f 21(2) + p43 f 31(2) + p44 f 41(2) ,

in agreement with the algebraic formula (2.33) obtained earlier.


Alternatively,

0 p12 p13 p14   f11(2)   p12 f 21(2) + p13 f 31(2) + p14 f 41(2)   f11(3) 
0 p p23 p24   f 21(2)   p22 f 21(2) + p23 f 31(2)
 
+ p24 f 41(2)   f 21(3) 

= 
22
f1(3) = Zf1(2) = = .
0 p32 p33 p34   f 31(2)   p32 f 21(2) + p33 f 31(2) + p34 f 41(2)   f 31(3) 
      
0 p42 p43 p44   f 41(2)   p42 f 21(2) + p43 f 31(2) + p44 f 41(2)   f 41(3) 
(2.41)

51113_C002.indd 80 9/23/2010 11:54:53 AM


Regular Markov Chains 81

Observe that f 41(3), the fourth entry in column vector f1(3), is given by

f 41(3) = p42 f 21(2) + p43 f 31(2) + p44 f 41(2) ,

in agreement with the algebraic formula (2.33) obtained earlier.


To calculate numerical probabilities of first passage in n-steps, consider the
four-state regular Markov chain model of the weather introduced in Section
1.3. The transition probability matrix is given in Equation (1.9). If state 1 (rain)
is the target state, then

1 0.3  0 0.1 0.4 0.2 


2  0.2  0 0.5 0.2 0.1
f1(1) =  , and Z= . (2.42)
3 0.3  0 0.2 0.1 0.4 
   
4  0  0 0.6 0.3 0.1

The vectors of n-step first passage probabilities, for n = 2, 3, and 4, are calcu-
lated below, along with Zn−1:

1 0 0.2   0.3   0.14   f11 


(2)
0.1 0.4
2 0 0.5 0.2 0.1   0.2   0.16   (2) 

f1(2) = Zf1(1) =     =   =  f 21  (2.43)


3 0 0.2 0.1 0.4   0.3   0.07   f 31(2) 
       (2) 
4 0 0.6 0.3 0.1   0   0.21   f 41 

1 0 0.2   0.14   0.086   f11 


(3)
0.1 0.4
2 0 0.5 0.2 0.1   0.16   0.115   (3) 

f1(3) = Zf1(2) =     =   =  f 21  . (2.44)


3 0 0.2 0.1 0.4   0.07   0.123   f 31(3) 
       (3) 
4 0 0.6 0.3 0.1   0.21   0.138   f 41 

Alternatively,

1 0 0.25 0.12 0.19  0.3  0.086   f11(3) 


2 0 0.35 0.15 0.14   0.2   0.115   f 21(3) 
f1(3) = Z 2 f1(1) =    =  =
3 0 0.36 0.17 0.10  0.3  0.123   f 31(3)  (2.45)
      
4 0 0.42 0.18 0.19   0  0.138   f 41(3) 

51113_C002.indd 81 9/23/2010 11:54:56 AM


82 Markov Chains and Decision Processes for Engineers and Managers

1 0 0.1 0.4 0.2 0.086   0.0883   f11(4) 


2 0 0.5 0.2 0.1  0.115  0.0959   f 21(4) 
f1(4) = Zf1(3) =   = = . (2.46)
3 0 0.2 0.1 0.4  0.123   0.0905   f 31(4) 
      
4 0 0.6 0.3 0.1 0.138  0.1197   f 41(4) 

Alternatively,

1 0 0.263 0.119 0.092  0.3   0.0883   f11(4) 


2 0 0.289 0.127 0.109  0.2  0.0959   f 21(4) 
f1(4) = Z 3 f1(1) =   =  = . (2.47)
3 0 0.274 0.119 0.114  0.3   0.0905   f 31(4) 
      
4 0 0.360 0.159 0.133   0  0.1197   f 41(4) 

The probability that the chain moves from state 4 to target state 1 for the first
time in four steps is given by f41(4) = 0.1197. Therefore, the probability that the
next rainy day (state 1) will appear for the first time 4 days after a sunny day
(state 4) is 0.1197.

2.2.2 Mean First Passage Times


Recall from Section 2.2.1 that the first passage time from a state i to a target
state j is a random variable denoted by Tij. The mean first passage time from
a state i to a state j is denoted by mij and is abbreviated as MFPT [1, 4, 5]. The
MFPT is defined by the formula for mathematical expectation,

∞ ∞
mij = E(Tij ) = ∑ nP(T
n= 1
ij = n) = ∑ nf
n= 1
(n)
ij , (2.48)

where, as Equation (2.30) indicates, f ij( n ) = P(Tij = n) is the probability distribu-


tion of Tij. However, calculating f ij( n ) for all n is not feasible because the num-
ber of steps in a first passage is unbounded. Hence, Equation (2.48) cannot be
used to compute the MFPTs.

2.2.2.1 MFPTs for a Five-State Markov Chain


As this section will demonstrate, the MFPTs can be obtained by solving a
system of linear equations in which the unknowns are the entries mij to a
target state j. To see how to compute the MFPTs for a regular Markov chain,
consider the following example of a generic Markov chain with five states,
indexed j, 1, 2, 3, and 4. Suppose that the transition probability matrix,
denoted by P, is

51113_C002.indd 82 9/23/2010 11:54:59 AM


Regular Markov Chains 83

j  p jj p j1 pj2 pj3 pj4 


1  p1 j p11 p12 p13 p14 

P = 2  p2 j p21 p22 p23 p24  . (2.49)
 
3  p3 j p31 p32 p33 p34 
4  p4 j p41 p42 p43 p44 

The vector of MFPTs can be found by following a procedure, which is also


used in Section 3.3.1. The procedure begins by making the target state j
an absorbing state. (As Sections 1.4.1.1 and 1.8 indicate, when a state j is an
absorbing state, pjj = 1, and the chain never leaves state j.) All the remaining
states become transient states. The modified transition matrix represents an
absorbing Markov chain with a single absorbing state, j. In the terminology
of Section 1.9.1.2, the Markov chain has been transformed into an absorbing
unichain. The chain stops as soon as it enters state j for the first time. When
the chain stops, it is said to have been absorbed in state j. Therefore, the MFPT
from a state i to a target state j in the original chain is also the mean time to
absorption from a transient state i to the absorbing state j in the modified
chain. The positions of the target state and the first state may be interchanged,
if necessary, to make the target state the first state. In this example the target
state is already the first state, so that no interchange is necessary. The transi-
tion probability matrix, PM, for the modified absorbing Markov chain is parti-
tioned, as shown below, to put it in the canonical form of Equation (1.43):

j 1 0 0 0 0 
1  p1 j p11 p12 p13 p14 
 1 0
PM = 2  p2 j p21 p22 p23 p24  =  ,
D Q 
(2.50a)
 
3  p3 j p31 p32 p33 p34  
4  p4 j p41 p42 p43 p44 

where

1  p1 j  1  p11 p12 p13 p14 



2 p2 j  2  p21 p22 p23 p24 
D =  , Q =  . (2.50b)
3  p3 j  3  p31 p32 p33 p34 
   
4  p4 j  4  p41 p42 p43 p44 

In this partition, the submatrix Q is formed by deleting row j and column j of


the original transition probability matrix P.
Suppose that the MFPT from state 4 to state j, that is, m4j = E(T4j), is of inter-
est. To obtain a formula for computing the MFPTs, one may condition on

51113_C002.indd 83 9/23/2010 11:55:00 AM


84 Markov Chains and Decision Processes for Engineers and Managers

the outcome of the first step, multiply this outcome by the probability that
it occurs, and sum these products over all possible outcomes. Given that the
process is initially in state 4, then either (a) the first step, with probability p4j, is
to target state j, in which case the MFPT is exactly one step, or (b) the first step,
with probability p4k, is to a nontarget state, k ≠ j, in which case the MFPT will
be (1 + mkj), equal to the one step already taken plus mkj, the MFPT from non-
target state k to target state j. Weighting each outcome by the probability that
it occurs produces the following formula for the MFPT from state 4 to state j:

m4 j = p4 j (1) + p41 (1 + m1 j ) + p42 (1 + m2 j ) + p43 (1 + m3 j ) + p44 (1 + m4 j )


= ( p4 j + p41 + p42 + p43 + p44 ) + p41 m1 j + p42 m2 j + p43 m3 j + p44 m4 j (2.51)
= 1 + p41 m1 j + p42 m2 j + p43 m3 j + p44 m4 j
4
= 1 + ∑ p4 k mkj . (2.52)
k =1

Since m4j in a five-state regular Markov chain is a linear function of the four
unknowns, m1j, m2j, m3j, and m4j, m4j cannot be found by solving a single equation.
Instead, the following linear system of four equations in the four unknowns,
m1j, m2j, m3j, and m4j, must be solved to find the MFPTs to a target state j:

m1 j = 1 + p11 m1 j + p12 m2 j + p13 m3 j + p14 m4 j 



m2 j = 1 + p21 m1 j + p22 m2 j + p23 m3 j + p24 m4 j 
. (2.53)
m3 j = 1 + p31 m1 j + p32 m2 j + p33 m3 j + p34 m4 j 
m4 j = 1 + p41 m1 j + p42 m2 j + p43 m3 j + p44 m4 j 

In the general case of an (N + 1)-state regular Markov chain, given a starting


state i and a target state j, the algebraic formula for computing the MFPTs is
the system of N linear equations

mij = 1 + ∑ pik mkj . (2.54)


k≠ j

For the five-state regular Markov chain, the matrix form of the system of
algebraic equations (2.53) for calculating the vector of MFPTs to a target
state j is

 m1 j  1  p11 p12 p13 p14   m1 j 


 m  1  p p22 p23 p24   m2 j 
 2 j  =   +  21  .
 m3 j  1  p31 (2.55)
p32 p33 p34   m3 j 
      
 m4 j  1  p41 p42 p43 p44   m4 j 

51113_C002.indd 84 9/23/2010 11:55:01 AM


Regular Markov Chains 85

For an (N + 1)-state regular Markov chain, the concise form of the matrix
equation for computing the vector of MFPTs to a target state j is

m j = e + Qm j , (2.56)

where
T
m j =  m1 j m2 j … mNj 

is an N × 1 column vector of MFPTs to a target state j, and e is an N × 1 column


vector with all entries one. Equation (2.56) is the matrix form of Equation (2.54).
To see how to calculate the MFPTs for a regular Markov chain with
numerical transition probabilities, consider the following example of a
Markov chain with five states, indexed j, 1, 2, 3, and 4. The transition prob-
ability matrix is

j 0.3 0.1 0.2 0.4 0 


1 0 0.4 0.1 0.2 0.3 
 
P = 2  0.2 0.1 0.2 0.3 0.2 . (2.57)
 
3  0.1 0.2 0.1 0.4 0.2
4  0 0.3 0.4 0.2 0.1

Suppose that the vector of MFPTs to target state j is desired. When target
state j is made an absorbing state, the modified transition probability matrix,
PM, is partitioned in the following manner to put it in canonical form of
Equation (1.43):

State j 1 2 3 4
j 1 0 0 0 0
1 0 0.4 0.1 0.2 0.3 1 0
PM = =  , (2.58)
2 0.2 0.1 0.2 0.3 0.2 D Q 
3 0.1 0.2 0.1 0.4 0.2
4 0 0.3 0.4 0.2 0.1

where

1 0  1 0.4 0.1 0.2 0.3  1 1


2 0.2 2  0.1 0.2 0.3 0.2  2 1
D =  , Q=   , e =  . (2.59)
3  0.1 3  0.2 0.1 0.4 0.2 3 1
     
4 0  4 0.3 0.4 0.2 0.1 4 1

51113_C002.indd 85 9/23/2010 11:55:03 AM


86 Markov Chains and Decision Processes for Engineers and Managers

In algebraic form, the Equations (2.53) or (2.54) for the MFPTs are

m1 j = 1 + (0.4)m1 j + (0.1)m2 j + (0.2)m3 j + (0.3)m4 j 



m2 j = 1 + (0.1)m1 j + (0.2)m2 j + (0.3)m3 j + (0.2)m4 j 
. (2.60)
m3 j = 1 + (0.2)m1 j + (0.1)m2 j + (0.4)m3 j + (0.2)m4 j 
m4 j = 1 + (0.3)m1 j + (0.4)m2 j + (0.2)m3 j + (0.1)m4 j 

The matrix form of these Equations (2.54) or (2.55) is

 m1 j  1 0.4 0.1 0.2 0.3   m1 j 


 m  1  0.1 0.2 0.3 0.2  m2 j 
 2j  =   +   . (2.61)
 m3 j  1  0.2 0.1 0.4 0.2  m3 j 
      
 m4 j  1  0.3 0.4 0.2 0.1  m4 j 

The solution for the vector mj of MFPTs to target state j, is

m4 j  = [15.7143 12.1164 13.8624 14.8148 ] .


T T
m j =  m1 j m2 j m3 j (2.62)

For example, the mean number of steps to move from state 2 to state j for the
first time is given by m2j = 12.1164. Observe that matrix Q in Equation (2.59)
is identical to matrix Q in Equation (3.20). However, the entries in the vector
of MFPTs in Equation (2.62) differ slightly from those calculated by different
methods in Equations (3.22) and (6.68). Discrepancies are due to roundoff
error because only the first four significant decimal digits were stored.

2.2.2.2 MFPTs for a Four-State Model of Component Replacement


Consider the Markov chain model of component replacement introduced in
Section 1.10.1.1.5. The transition probability matrix is given in Equation (1.50).
Each time the Markov chain enters state 0, a component is replaced with a
new component of age 0. The expected number of weeks until a component
of age i is replaced is equal to mi0, the MFPT from state i to state 0. To calculate
m0, the vector of MFPTs to target state 0, state 0 is made an absorbing state.
The modified transition probability matrix is partitioned below to put it in
canonical form:

0 1 0 0 0 
1 0.375 0 0.625 0   1 0 
PM =  = . (2.63)
2  0.8 0 0 0.2 D Q 
 
3 1 0 0 0 

51113_C002.indd 86 9/23/2010 11:55:05 AM


Regular Markov Chains 87

The matrix form of the equations (2.56) to be solved to find the vector m0 of
MFPTs is

 m10  1 0 0.625 0   m10 


      
 m20  = 1 + 0 0 0.2  m20  . (2.64)
 m30  1 0 0 0   m30 

The solution for the vector m0 is


T T
m0 = m10 m20 m30  = 1.75 1.2 1 . (2.65)

Thus, the expected length of time until a 1-week-old component will be


replaced is 1.75 weeks. On average, a 2-week-old component will be replaced
every 1.2 weeks, and a 3-week-old component will be replaced weekly.

2.2.2.3 MFPTs for a Four-State Model of Weather


Consider the Markov chain model of weather introduced in Section 1.3. The
transition probability matrix is given in Equation (1.9). The vector of MFPTs
to target state 1, rain, will be computed. After making state 1 an absorbing
state, the modified transition probability matrix in canonical form is

State 1 2 3 4
1 1 0 0 0
1 0
PM = 2 0.2 0.5 0.2 0.1 =  . (2.66)
3 0.3 0.2 0.1 0.4 D Q 
4 0 0.6 0.3 0.1

The matrix equation (2.56) for computing m1, the vector of MFPTs to target
state 1, is

 m21  1  0.5 0.2 0.1  m21 


      
 m31  = 1 +  0.2 0.1 0.4   m31  .
(2.67)
 m41  1 0.6 0.3 0.1  m41 

The solution for m1 is

T T
m1 = m21 m31 m41  = 5.3234 5.1244 6.3682 . (2.68)

Snow, state 2, is followed by rain after a mean interval of 5.3234 days, and a
sunny day, state 4, is followed by rain after a mean interval of 6.3682 days.

51113_C002.indd 87 9/23/2010 11:55:06 AM


88 Markov Chains and Decision Processes for Engineers and Managers

2.2.3 Mean Recurrence Time


If i = j, then the starting state is also the target state. When i = j, Tjj is called
the recurrence time for state j, and mjj = E(Tjj) denotes the mean recurrence
time for state j [1, 4, 5]. The mean recurrence time for state j can be calculated
by two different methods. The first method is to first compute the MFPTs to
state j by solving the system of equations (2.54) or (2.56) for the case in which
i ≠ j, and then to solve the additional equation

m jj = 1 + ∑p
k≠ j
jk mkj (2.69)

for the case in which i = j. The second method is to calculate the steady-state
probabilities for the original N-state regular Markov chain, The mean recur-
rence times are the reciprocals of the steady-state probabilities [1, 4, 5].

2.2.3.1 Mean Recurrence Time for a Five-State Markov Chain


Consider the numerical example of a five-state regular Markov chain for
which the transition probability matrix is given in Equation (2.57). The
MFPTs to a target state j, m1j, m2j, m3j, and m4j, have already been calculated by
solving the system of equations (2.60) or (2.61) . The vector of MFPTs to state j
is shown in Equation (2.62). The following additional equation can be solved
to calculate mjj, the mean recurrence time for state j:

m jj = 1 + p j1m1 j + p j 2 m2 j + p j 3 m3 j + p j 4 m4 j 

= 1 + (0.1)m1 j + (0.2)m2 j + (0.4)m3 j + (0)m4 j 
 (2.70)
= 1 + (0.1)(15.7143) + (0.2)(12.1164) + (0.4)(13.8624) + (0)(14.8148)
= 10.5397. 

Alternatively, if the steady probabilities are known, the mean recurrence time
mjj for a target state j is simply the reciprocal of the steady-state probability
πj for the target state j. The steady-state probability vector for the five-state
regular Markov chain for which the transition probability matrix is given in
Equation (2.57) is obtained by solving the system of equations (2.13).

j 0.3 0.1 0.2 0.4 0  


1 0 0.4 0.1 0.2 0.3  
 
 π 4  = π j π 4  2  0.2 0.1 0.2 0.3 0.2 
π j π1 π 2 π3 π1 π 2 π3
 
3  0.1 0.2 0.1 0.4 0.2 

4  0 0.3 0.4 0.2 0.1 
T 
 π 4  1 1 1 1 1 = 1.
π j π1 π 2 π3 
(2.71)

51113_C002.indd 88 9/23/2010 11:55:08 AM


Regular Markov Chains 89

The solution is

π = [π j π1 π 2 π3 π 4 ] = [0.0949 0.2385 0.1837 0.2967 0.1862].


(2.72)

Observe that πj = 0.0949. Hence, the mean recurrence time for state j is,

m jj = 1/π j = 1/0.0949 = 10.5374,

which is close to the result obtained by the first method. Discrepancies are
due to roundoff error.

2.2.3.2 Mean Recurrence Times for a Four-State


Model of Component Replacement
Consider the Markov chain model of component replacement for which the
vector of MFPTs to target state 0 was computed in Equation (2.65). The mean
recurrence time for target state 0 can be computed by adding the equation

m00 = 1 + p01 m10 + p02 m20 + p03 m30 



= 1 + (0.8)m10 + (0)m20 + (0)m30  (2.73)
= 1 + (0.8)(1.75) + (0)(1.2) + (0)(1) = 2.4.

Alternatively, the steady-state probabilities for the model of component


replacement were calculated in Equation (2.29). The steady-state probability
that a component has just been replaced is given by π 0 = 0.4167. Therefore
the mean recurrence time, m00, is again 2.4 weeks, equal to the reciprocal
of π 0. This indicates that a new component, of age 0 weeks, will be replaced
about every 2.4 weeks. In other words, the mean life of a new component
is 2.4 weeks.

2.2.3.3 Optional Insight: Mean Recurrence Time as the Reciprocal of


the Steady-State Probability for a Two-State Markov Chain
The reciprocal relationship between the mean recurrence time for a state and
its associated steady-state probability will be demonstrated in two different
ways for the special case of a generic two-state regular Markov chain. The
first demonstration depends on knowledge of the steady-state probability
vector. The transition probability matrix is given in Equation (1.4). The steady-
state probability vector was computed in Equation (2.16). First, the MFPTs to
target states 1 and 2, respectively, are calculated by solving the system of
equations (2.54).

51113_C002.indd 89 9/23/2010 11:55:09 AM


90 Markov Chains and Decision Processes for Engineers and Managers

m21 = 1 + p22 m21 


(1 − p22 )m21 = 1 = p21 m21 
m21 = 1/ p21 , 

and  (2.74)
m12 = 1 + p11 m12 

(1 − p11 )m12 = 1 = p12 m12 
m12 = 1/ p12 

Next, the mean recurrence times for states 1 and 2, respectively, are calculated
by solving Equation (2.69) in terms of the MFPTs previously calculated.

m11 = 1 + p12 m21 = 1 + p12 / p21 = ( p12 + p21 )/ p21 = 1/π 1 


 (2.75)
m22 = 1 + p21 m12 = 1 + p21 / p12 = ( p12 + p21 )/ p12 = 1/π 2 

Hence, for a regular two-state Markov chain,

m jj = 1/ π j . (2.76)

An alternative demonstration of the reciprocal relationship between the


mean recurrence time for a state and the associated steady-state probabil-
ity is given below. This demonstration does not depend on knowledge of
the steady-state probabilities. Consider once again, for simplicity, a two-state
generic Markov chain for which the transition probability matrix is shown in
Equation (1.4). A matrix of MFPTs and mean recurrence times is denoted by

1  m11 m12 
M=  .
m22 
(2.77)
2  m21

The systems of Equations (2.54) and (2.69) used to compute the MFPTs and
the mean recurrence times can be written in the following form:

m11 = 1 + p12 m21 = p11 (m11 − m11 ) + p12 m21 + 1 


m12 = 1 + p11 m12 = p12 (m22 − m22 ) + p11 m12 + 1 
 (2.78)
m21 = 1 + p22 m21 = p21 (m11 − m11 ) + p22 m21 + 1 
m22 = 1 + p21 m12 = p22 (m22 − m22 ) + p21 m12 + 1.

To express the final form of this system as a matrix equation, let

 m11 0  1 1
MD =   , E= , π = π 1 π 2 .
 0 m22  1 1

51113_C002.indd 90 9/23/2010 11:55:10 AM


Regular Markov Chains 91

Then the matrix form of the system of equations (2.78) is

 m11 m12   p11 p12   m11 − m11 m12  1 1


m = +
 21 m22   p21 p22   m21 m22 − m22  1 1
 p11 p12    m11 m12   m11 0  1 1
= − + .
 p21 p22    m21 m22   0 m22  1 1

In symbolic form the matrix equation is

M = P( M − MD ) + E.

Premultiplying both sides by π gives

π M = π P( M − MD ) + π E
= π P( M − MD ) + E, since π E = E.
= π ( M − MD ) + E, as π P = π
= π M − π MD + E
π MD = E
 m11 0  1 1
π 1 π 2  = .
 0 m22  1 1

In algebraic form the equations are

π 1 m11 = 1
π 2 m22 = 1.

Hence, mii = 1/π i .

The conclusion that the mean recurrence time is the reciprocal of the
steady-state probability can be generalized to apply to any regular Markov
chain with a finite number of states.

PROBLEMS
2.1 This problem refers to Problem 1.1.
(a) If the machine is observed today operating in state 4 (WP),
what is the probability that it will enter state 2 (MD) after
4 days?
(b) If the machine is observed today operating in state 4 (WP),
what is the probability that it will enter state 2 (MD) for the
first time after 4 days?

51113_C002.indd 91 9/23/2010 11:55:12 AM


92 Markov Chains and Decision Processes for Engineers and Managers

(c) If the machine is observed today operating in state 4 (WP),


what is the expected number of days until the machine
enters state 1 (NW) for the first time?
(d) If the machine is observed today operating in state 3 (mD),
what is the expected number of days until the machine
returns to state 3 (mD) for the first time?
(e) Find the long run proportion of time that the machine will
occupy each state.
2.2 This problem refers to Problem 1.2.
(a) If the machine is observed today operating in state 5 (WP),
what is the probability that it will enter state 3 (MD) after
4 days?
(b) If the machine is observed today operating in state 5 (WP),
what is the probability that it will enter state 3 (MD) for the
first time after 4 days?
(c) If the machine is observed today operating in state 5 (WP),
what is the expected number of days until the machine
enters state 1 (NW1) for the first time?
(d) If the machine is observed today operating in state 4 (mD),
what is the expected number of days until the machine
returns to state 4 (mD) for the first time?
(e) Find the long run proportion of time that the machine will
occupy each state.
2.3 Consider the following modification of Problem 1.3 in which
a machine breaks down on a given day with probability p.
A repair can be completed in 1 day with probability q1, or in
2 days with probability q2, or in 3 days with probability q3. The
condition of the machine can be represented by one of the fol-
lowing four states:

State Description
0 (W) Working
1 (D1) Not working, in first day of repair
2 (D2) Not working, in second day of repair
3 (D3) Not working, in third day of repair

Since the next state of the machine depends only on its present
state, the condition of the machine can be modeled as a four-
state Markov chain. Answer the following questions if p = 0.1,
q1 = 0.5, q2 = 0.3, and q3 = 0.2:
(a) Construct the transition probability matrix for this four state
recurrent Markov chain.
(b) If the machine is observed today in state 3 (D3), what is the
probability that it will enter state 2 (D2) after 3 days?
(c) If the machine is observed today in state 1 (D1), what is the
probability that it will enter state 3 (D3) for the first time
after 3 days?

51113_C002.indd 92 9/23/2010 11:55:14 AM


Regular Markov Chains 93

(d) If the machine is observed today in state 2 (D2), what is the


expected number of days until the machine enters state 0
(W) for the first time?
(e) If the machine is observed today in state 0 (W), what is the
expected number of days until the machine returns to state
0 (W) for the first time?
(f) Find the long run proportion of time that the machine will
occupy each state.
2.4 Consider the following modification of Problem 1.8 in which
an inventory system is controlled by a (1, 3) ordering policy. An
office supply store sells laptops. The state of the system is the
inventory on hand at the beginning of the day. If the number of
laptops on hand at the start of the day is less than one (in other
words, equal to 0), the store places an order, which is delivered
immediately, to raise the beginning inventory level to three lap-
tops. If the store starts the day with one or more laptops in stock,
no laptops are ordered. The number dn of laptops demanded by
customers during the day is an independent, identically distrib-
uted random variable, which has a stationary Poisson probabil-
ity distribution with parameter λ = 0.5 laptops per day.
(a) Construct the transition probability matrix for a four-state
recurrent Markov chain model of this inventory system
under a (1, 3) policy.
(b) If one laptop is on hand at the beginning of today, what is
the probability distribution of the inventory level 3 days
from today?
(c) If two laptops are on hand at the beginning of today, what is
the probability that no orders are placed tomorrow?
(d) If the inventory level is 2 at the beginning of today, what is
the expected number of days before the inventory level is 0
for the first time?
(e) If the inventory level is 1 at the beginning of today, what
is the expected number of days until the inventory level
returns to 1 for the first time?
(f) Find the long run probability distribution of the inventory
level.
2.5 In Problem 1.9, let

S0 = 1,000, S1 = 700, S2 = 300, S3 = 100, and S4 = 0.

(a) Construct the transition probability matrix.


(b) Starting with a new IC in a single location, find the probability
distribution for the age of an IC after 3 years.
(c) Find the steady-state probability distribution for the age of
an IC.
(d) Find the expected length of time until a component of age i
years is replaced, for i = 0, 1, 2, 3.
2.6 In Problem 1.5, let

p0 = 0.30, p1 = 0.10, p2 = 0.20, p3 = 0.25, and p4 = 0.15.

51113_C002.indd 93 9/23/2010 11:55:14 AM


94 Markov Chains and Decision Processes for Engineers and Managers

(a) Construct the transition probability matrix.


(b) If the credit counselor starts the day with one customer in
her office, find the probability distribution for the number
of customers in her office at the end of the third 45-min
counseling period.
(c) Find the steady-state probability distribution for the number
of customers in the credit counselor’s office.
(d) If the credit counselor starts the day with two customers in
her office, find the expected length of time until her office is
empty.
2.7 In Problem 1.6, let

p0 = 0.24, p1 = 0 .08, p2 = 0.18, p3 = 0.16, p4 = 0.22, and p5 = 0.12.

(a) Construct the transition probability matrix.


(b) If the nurse starts the day with two patients in the clinic, find
the probability distribution for the number of patients in the
clinic at the end of the third 20-min. interval.
(c) Find the steady-state probability distribution for the number
of patients in the clinic.
(d) If the nurse starts the day with four patients in the clinic, find
the expected length of time until the clinic has one patient.
2.8 This problem refers to Problem 1.7.
(a) Verify that the following four-element vector is the steady-
state probability vector:

r−p p(1 − r )
[1 − r s s2 (1 − p)s3 ], where s = .
r(1 − r ) − p(1 − p)s3 (1 − p)r

(b) What is the long run proportion of time that the checkout
clerk is idle?
(c) What is the long run proportion of time that an arriving cus-
tomer will be denied admission to the store?
In Problem 1.7, let p = 0.24 and r = 0.36.
(d) Construct the transition probability matrix.
(e) If the checkout clerk starts the day with one customer inside
the convenience store, find the probability distribution for
the number of customers inside the store at the end of the
third 30-min interval.
(f) Find the steady-state probability distribution for the number
of customers inside the convenience store.
(g) If the checkout clerk starts the day with two customers inside
the convenience store, fi nd the expected length of time until
the store has three customers.
2.9 In Problem 1.11, let
w0 = 0.14, w1 = 0.10, w2 = 0.20, w3 = 0.18, w4 = 0.24, and w5 = 0.14.
(a) Construct the transition probability matrix.

51113_C002.indd 94 9/23/2010 11:55:14 AM


Regular Markov Chains 95

(b) If the dam starts the week with three units of water, fi nd the
probability distribution for the volume of water stored in the
dam after 3 weeks.
(c) Find the steady-state probability distribution for the volume
of water stored in the dam.
(d) If the dam starts the week with one unit of water, find the
expected length of time until the dam has four units of
water.
2.10 In Problem 1.14, let p = 0.6, r = 0.8, and q = 0.7.
(a) Construct the transition probability matrix.
(b) If the optical shop starts the day with both the optometrist
and the optician idle, find the probability state vector after
three 30-min intervals.
(c) Find the steady-state probability vector for the optical shop.
(d) If the optical shop starts the day with the optometrist blocked
and the optician working, find the expected length of time
until the optometrist is working and the optician is idle.

References
1. Bhat, N., Elements of Applied Stochastic Processes, 2nd ed., Wiley, New York,
1985.
2. Cinlar, E., Introduction to Stochastic Processes, Prentice-Hall, Englewood Cliffs,
NJ, 1975.
3. Clarke, A. B. and Disney R. L., Probability and Random Processes: A First Course
with Applications, 2nd ed., Wiley, New York, 1985.
4. Kemeny, J. G., Mirkil, H., Snell, J. L., and Thompson, G. L., Finite Mathematical
Structures, Prentice-Hall, Englewood Cliffs, NJ, 1959.
5. Kemeny, J. G. and Snell, J. L., Finite Markov Chains, Van Nostrand, Princeton, NJ,
1960. Reprinted by Springer-Verlag, New York, 1976.

51113_C002.indd 95 9/23/2010 11:55:14 AM


51113_C002.indd 96 9/23/2010 11:55:14 AM
3
Reducible Markov Chains

All Markov chains treated in this book have a finite number of states. Recall
from Section 1.9 that a reducible Markov chain has both recurrent states and
transient states. The state space for a reducible chain can be partitioned into
one or more mutually exclusive closed communicating classes of recurrent
states plus a set of transient states. A closed communicating class of recur-
rent states is often called a recurrent class or a recurrent chain. The state
space for a reducible unichain can be partitioned into one recurrent chain
plus a set of transient states. The state space for a reducible multichain can be
partitioned into two or more mutually exclusive recurrent chains plus a set
of transient states. There is no interaction among different recurrent chains
within a reducible multichain. Hence, each recurrent chain can be analyzed
separately by treating it as an irreducible Markov chain. A reducible chain,
which starts in a transient state, will eventually leave the set of transient
states to enter a recurrent chain, within which it will continue to make tran-
sitions indefinitely.

3.1 Canonical Form of the Transition Matrix


As Section 1.9.3 indicates, when the state space of a reducible Markov chain is
partitioned into one or more disjoint closed communicating classes of recur-
rent states plus a set of transient states, the transition probability matrix is
said to be represented in a canonical or standard form. The canonical form
of the transition matrix without aggregation is shown for unichain models
in Section 1.10.1.2 and for multichain models in Section 1.10.2. The canoni-
cal form of the transition matrix with aggregation is shown in Sections 1.9.3
and 3.1.3 [1, 4, 5].

3.1.1 Unichain
Suppose that the states in a unichain are rearranged so that the recurrent
states come first, numbered consecutively, followed by the transient states,
which are also numbered consecutively. The closed class of recurrent states
is denoted by R, and the set of transient states is denoted by T. This rearrange-
ment of states can be used to partition the transition probability matrix for

97

51113_C003.indd 97 9/23/2010 4:19:19 PM


98 Markov Chains and Decision Processes for Engineers and Managers

the unichain into a recurrent chain plus a set of transient states. As Equation
(1.43) indicates, the transition probability matrix, P, is represented in the fol-
lowing canonical form:

S 0
P= . (3.1)
D Q 

The square submatrix S governs transitions within the closed class of recur-
rent states. Rectangular submatrix 0 consists entirely of zeroes indicating
that no transitions from recurrent states to transient states are possible.
Rectangular submatrix D governs transitions from transient states to recur-
rent states. The square submatrix Q governs transitions among transient
states. Submatrix Q is substochastic, which means that at least one row sum
is less than one.
Consider the generic four-state reducible unichain described in Section
1.9.1.2. The transition probability matrix is partitioned in Equation (1.40) to
show that states 1 and 2 belong to the recurrent chain denoted by R = {1, 2},
and states 3 and 4 are transient. The set of transient states is denoted by
T = {3, 4}. The canonical form of the transition matrix is

1 p11 p12 0 0 
2 0 
 = 
p21 p22 0 S 0
P = [ pij ] =   D Q ,
3 p31 p32 p33 p34   
 
4 p41 p42 p43 p44 

where

1  p11 p12  3  p31 p32  3  p33 p34 


S=   , D=   , and Q =  .
p44 
(3.2)
2  p21 p22  4  p41 p42  4  p43

A special case of a reducible unichain is an absorbing unichain in which


the recurrent chain, denoted by R = {1}, is a single absorbing state. That is,
S = I = [1]. Thus, in an absorbing unichain, which is often called an absorb-
ing Markov chain, the only recurrent state is an absorbing state. All the
remaining states are transient. The canonical form of the transition prob-
ability matrix shown in Equation (1.41) for a generic three-state absorbing
unichain is

1 1 0 0 
  1 0
P = [ pij ] = 2  p21 p22 p23  , P =  ,
 D Q 
3  p31 p32 p33 

51113_C003.indd 98 9/23/2010 4:19:19 PM


Reducible Markov Chains 99

where
2  p21  2  p22 p23 
D =   , and Q =  .
p33 
(3.3)
3  p31  3  p32

If passage from a transient state to the recurrent chain, R, in a reducible


unichain is of interest, rather than passage to the individual states within R,
then the recurrent states within R may be lumped together to be replaced by
a single absorbing state. This procedure, which is used in Section 3.5.1, will
transform a reducible unichain with recurrent states into an absorbing uni-
chain with the same set of transient states.

3.1.2 Multichain
Consider a reducible multichain that has M recurrent chains, denoted by
R1, . . . , R M, with the respective transition probability matrices P1, . . . ,PM, plus a
set of transient states denoted by T. The states can be rearranged if necessary
so that the transition probability matrix, P, can be represented in the follow-
ing canonical form:
 P1 0 " 0 0
0 P2 " 0 0
 
P= # # % # # . (3.4)
 
0 0 " PM 0
D1 D2 " DM Q 

The M square submatrices, P1, . . . ,PM, are the transition probability matrices,
which specify the transitions after the chain has entered the corresponding
closed class of recurrent states. The M rectangular submatrices, D1, . . . ,DM,
are the transition probability matrices, which govern transitions from tran-
sient states to the corresponding closed classes of recurrent states. The square
submatrix Q is the transition probability matrix, which governs transitions
among the transient states. The submatrix 0 consists entirely of zeroes.
The transition probability matrix for a generic five-state reducible multi-
chain is shown in Equation (1.42). As Section 1.9.2 indicates, the multichain
has M = 2 recurrent closed sets, one of which is an absorbing state, plus two
transient states. The two recurrent sets of states are R1 = {1} and R 2 = {2, 3}. The
set of transient states is denoted by T = {4, 5}.
The canonical form of the transition matrix is
1 1 0 0 0 0 
2 0 p22 p23 0 0   P1 0 0
 
P = 3 0 p32 p33 0 0  =  0 P2 0,

 
4  p41 p42 p43 p44 p45  D1 D2 Q 
5  p51 p52 p53 p54 p55 

51113_C003.indd 99 9/23/2010 4:19:21 PM


100 Markov Chains and Decision Processes for Engineers and Managers

where
2  p22 p23  4  p41 
P1 = [1], P2 =   , D1 =   ,
3  p32 p33  5  p51 
4  p42 p43  4  p44 p45  (3.5)
D2 =   , and Q =  .
5  p52 p53  5  p54 p55 

3.1.3 Aggregation of the Transition Matrix in Canonical Form


The canonical form of the transition probability matrix, P, for a reducible
multichain can be represented in an aggregated format. As Section 1.9.3
indicates, this is accomplished by combining all the closed communicating
classes of recurrent states into a single recurrent closed class with a transi-
tion matrix denoted by S.
Since S in a multichain is formed by combining the transition probability
matrices for several recurrent chains, Equation (3.1) represents the transi-
tion matrix of a reducible multichain in aggregated canonical form. Equation
(1.44) in Section 1.9.3 shows the aggregated canonical form of the transition
matrix for a generic five-state reducible multichain. Now consider a reduc-
ible multichain that has M = 3 closed communicating classes of recurrent
states denoted by R1, R 2, and R3 with the corresponding transition probabil-
ity matrices P1, P2, and P3, plus a set of transient states. The transition proba-
bility matrix, P, in aggregated canonical form is

 P1 0 0 0   P1 0 0 0
0  0 S 0
P2 0 0  0 P2 0
=
P= = .
0 0 P3 0  0 0 P3 0  D Q 
   
D1 D2 D3 Q  D1 D2 D3 Q

where

 P1 0 0 0
   
S= 0 P2 0  , 0 =  0  , D = D1 D2 D3 . (3.6)
 0 0 P3   0 

The Markov chain model of a production process constructed in


Section 1.10.2.2 is an example of an eight-state reducible multichain with
three closed classes of recurrent states. States 1 and 2 are absorbing states,
which represent the first two recurrent chains, respectively. States 3, 4, and
5 are recurrent states, which represent the third recurrent chain. States 6, 7,

51113_C003.indd 100 9/23/2010 4:19:22 PM


Reducible Markov Chains 101

and 8 are the transient states. The aggregated transition matrix in canonical
form is shown in Equation (1.60) and is repeated below with the submatrices
identified individually:

Scrapped 1 1 0 0 0 0 0 0 0 
Sold 2 0 1 0 0 0 0 0 0 
Training engineers

3 0 0 0.50 0.30 0.20 0 0 0

 
Training technicians 4 0 0 0.30 0.45 0.25 0 0 0 
P= 
Training tech. writers 5 0 0 0.10 0.35 0.55 0 0 0 
 
Stage 1 6 0.20 0.16 0.04 0.03 0.02 0.55 0 0
 
Stage 2 7 0.15 0 0 0 0 0.20 0.65 0 
Stage 3 8 0.10 0 0 0 0 0 0.15 0.75


 P1 0 0 0  
0 P 0 0 S 0  
= 2
= 
, 
 0 0 P3 0  D Q
D1 D2 D3 Q 


where 


1 1 0 0 0 0  
2 0 1 0 0 0  P1 0 0 6 0.55 0 0  
  
S = 3 0 0 0.50 0.30 0.20 =  0 P2 0, Q = 7 0.20 0.65 0 , 
    
4 0 0 0.30 0.45 0.25  0 0 P3  8 0 0.15 0.75 
5 0 0 0.10 0.35 0.55 

6 0.20 0.16 0.04 0.03 0.02 
D = 7 0.15 0 0 0 0  = [ D1 D2 D3 ],

  
8 0.10 0 0 0 0  

3 0.50 0.30 0.20 6 0.20 6 0.16 
P1 = P2 = [1], P3 = 4 0.30 0.45 0.25 , D1 = 7 0.15 , D2 = 7  0  , 
      
5 0.10 0.35 0.55 8 0.10 8 0 

6 0.04 0.03 0.02


D3 = 7  0 0 0 . 
  
8 0 0 0  
(3.7)

The transition probability matrices, P1, P2, and P3, correspond to the three
closed classes of recurrent states, denoted by R1 = {1}, R 2 = {2}, and R3 = {3, 4, 5},
respectively. Note that P1 = [1] and P2 = [1] are each associated with a single

51113_C003.indd 101 9/23/2010 4:19:23 PM


102 Markov Chains and Decision Processes for Engineers and Managers

absorbing state. Each absorbing state forms its own closed communicating
class. The set of transient states is denoted by T = {6, 7, 8}.
As Section 1.10.2.1 indicates, a special case of a reducible multichain is an
absorbing multichain in which the submatrix S is an identity matrix, I. Thus,
in an absorbing multichain, which is generally called an absorbing Markov
chain, all the recurrent states are absorbing states. All the remaining states
are transient states. The canonical form of the transition matrix for an absorb-
ing multichain is

 I 0
P=  . (3.8)
 D Q

If passage from a transient state to a particular recurrent chain, Ri, in a


reducible multichain is of interest, rather than passage to an individual state
within the recurrent chain, Ri, then the states within each recurrent chain
may be lumped together to be replaced by a single absorbing state. This pro-
cedure, which is used in Section 3.4.1, will transform a reducible multichain
into an absorbing multichain with the same set of transient states. Of course,
if passage from a transient state to any recurrent state is of interest, then sub-
matrix S may be replaced by a single absorbing state to create an absorbing
unichain, for which the transition matrix is
 1 0
P=  . (3.9)
 D Q

3.2 The Fundamental Matrix


A reducible Markov chain model can be used to answer the following three
questions concerning the eventual passage from a transient state to a recur-
rent state, assuming that the chain starts in a given transient state [1, 4, 5].
1. What is the expected number of times the chain may enter a particu-
lar transient state before it eventually enters a recurrent state?
2. What is the expected total number of times the chain may enter all
transient states before it eventually enters a recurrent state?
3. What is the probability of eventual passage to a recurrent state?

Questions 1, 2, and 3 are answered in Sections 3.2.2, 3.2.3, and 3.3.3,


respectively.

3.2.1 Definition of the Fundamental Matrix


To answer these three questions, a matrix inverse called the fundamen-
tal matrix is defined. Assume that the transition matrix, P, for a reducible

51113_C003.indd 102 9/23/2010 4:19:23 PM


Reducible Markov Chains 103

unichain, which may be produced by aggregating all the recurrent closed


classes in a reducible multichain, is represented in the canonical form.

 S 0
P=  . (3.1)
 D Q

The inverse of the matrix (I − Q) exists, and is called the fundamental


matrix, denoted by U. That is, the fundamental matrix, U, is defined to be

U = (I − Q)−1. (3.10)

To see how the fundamental matrix can be used to answer the three ques-
tions of Section 3.2, it is instructive to consider, for simplicity, the canonical
form of the transition matrix for the reducible generic four-state unichain
shown in Equation (3.2). Observe that

3  1 0  3  p33 p34  3  1 − p33 − p34 


I−Q= − = .
4  0 1  4  p43 p44  4  − p43 1 − p44 

The fundamental matrix is

U = (I − Q)−1
−1
3 1 − p33 − p34 
= 
4  − p43 1 − p44 
1 1 − p44 p34  3
= .
(1 − p33 )(1 − p44 ) − p34 p43  p43 1 − p33  4 (3.11)

3.2.2 Mean Time in a Particular Transient State


To see how the fundamental matrix can be used to answer question one for
the reducible four-state unichain for which the transition matrix is shown
in Equation (3.2), let the random variable Nij denote the number of times the
chain enters transient state j before the chain eventually enters a recurrent
state, given that the chain starts in transient state i. Let uij denote the mean
number of times the chain enters transient state j before the chain eventually
enters a recurrent state, given that the chain starts in transient state i. Then

uij = E(Nij). (3.12)

Suppose that i and j are different transient states, so that i ≠ j. If the first
transition from transient state i is to any transient state k, with probability
pik, then state j will be entered a mean number of ukj times. State j will be
entered zero times if the first transition is to a recurrent state. For the case

51113_C003.indd 103 9/23/2010 4:19:24 PM


104 Markov Chains and Decision Processes for Engineers and Managers

in which i ≠ j,

uij = ∑p
k ∈T
ik ukj, if i ≠ j, (3.13)

where the sum is over all transient states, k.


Now suppose that i and j are the same transient state, so that i = j. Then, by
assumption, state i will be occupied once during the first time period. State i
will be entered a mean number of uki additional times if, with probability pik,
the first transition is to any transient state k. Thus,

uii = 1 + ∑ pik uki , if i = j, (3.14)


k ∈T

where the sum is over all transient states, k.


Since the reducible four-state unichain has two transient states, there are 22 = 4
unknown quantities, uij, to be determined. The system of four algebraic equations
used for calculating the uij for the reducible four-state unichain is given below:
4
u34 = ∑ p3 k uk 4 = 0 + p33 u34 + p34 u44
k=3
4
u43 = ∑ p4 k uk 3 = 0 + p43 u33 + p44 u43
k=3
4
u33 = 1 + ∑ p3 k uk 3 = 1 + p33 u33 + p34 u43
k=3
4
u44 = 1 + ∑ p4 k uk 4 = 1 + p43 u34 + p44 u44 .
k=3

The matrix form of these equations is

 u33 u34   1 0   p33 p34   u33 u34 


u = + .
 43 u44  0 1  p43 p44  u43 u44 

In this matrix equation, let a matrix

 u33 u34 
U=  .
 u43 u44 

Recall from Equation (3.2) that

 p33 p34 
Q=  .
 p43 p44 

The compact form of the matrix equation is

U = I + QU

51113_C003.indd 104 9/23/2010 4:19:25 PM


Reducible Markov Chains 105

U − QU = I
IU − QU = I
(I − Q)U = I
U = (I − Q)−1.

The following results, although obtained for a reducible four-state uni-


chain, can be generalized to hold for any finite-state reducible Markov chain.
The matrix form of Equations (3.13) and (3.14) is

U = I + QU. (3.15)

An alternative matrix form of Equations (3.13) and (3.14) is

U = (I − Q)−1. (3.16)

Observe that the matrix U = (I − Q)−1. Hence, U is the fundamental matrix


defined in Equation (3.9). Since U = [uij], uij is the (i, j)th entry of U. To answer
question one of Section 3.2, uij, the (i, j)th entry of the fundamental matrix,
specifies the mean number of times the chain enters transient state j before
the chain eventually enters a recurrent state, given that the chain starts in
transient state i. For example, the fundamental matrix for the reducible four-
state unichain is shown in Equation (3.11). Suppose the unichain starts in
transient state 4. Then
p43
u43 =
(1 − p33 )(1 − p44 ) − p34 p43

is the mean number of time periods that a unichain, which is initially in


transient state 4, spends in transient state 3 before the chain eventually enters
recurrent state 1 or recurrent state 2, which together form a recurrent closed
class.

3.2.3 Mean Time in All Transient States


To see how the fundamental matrix can be used to answer question two of
Section 3.2, let the random variable Ni denote the total number of times the
chain enters all transient states before the chain eventually enters a recurrent
state, given that the chain starts in transient state i. Then

Ni = ∑N
j∈T
ij ,

where the sum is over all transient states, j. Let ui denote the mean num-
ber of times the chain enters all transient states before the chain eventually

51113_C003.indd 105 9/23/2010 4:19:28 PM


106 Markov Chains and Decision Processes for Engineers and Managers

enters a recurrent state, given that the chain starts in transient state i. It fol-
lows that

 
ui = E( N i ) = E 

∑N
j∈T
ij  =

∑ E(N
j∈T
ij )= ∑u,
j∈T
ij (3.17)

where the sum is over all transient states, j. Thus, ui is the sum of the elements
in row i of U, the fundamental matrix. To answer question two, if the chain
begins in transient state i, it will make a mean total number of ui transitions
before eventually entering a recurrent state.
This result can be expressed in the form of a matrix equation. Suppose
that a reducible Markov chain has q transient states. Let e be a q-component
column vector with all entries one. Then the ith component of the column
vector

u = Ue (3.18)

represents the mean total number of transitions to all transient states before
the chain eventually enters a recurrent state, given that the chain started in
transient state i. The column vector Ue is computed for the reducible four-state
unichain for which the fundamental matrix is shown in Equation (3.11).

 u3  3  u33 u34  1 3  u33 + u34 


u =   = Ue =  =
 u4  4 u43 u44  1 4  u43 + u44 
(3.19)
1 1 − p44 + p34  3 .
= p + 1 − p  4
(1 − p33 )(1 − p44 ) − p34 p43  43 33 

The sum

p43 + 1 − p33
u4 = u43 + u44 = ,
(1 − p33 )(1 − p44 ) − p34 p43

which is the last entry of column vector Ue, represents the mean number
of time periods that the reducible unichain, starting in transient state 4,
spends in transient states 3 and 4 before the chain eventually enters recurrent
state 1 or 2.

3.2.4 Absorbing Multichain Model of Patient Flow in a Hospital


As an illustration of the role of the fundamental matrix in analyzing a reduc-
ible Markov chain model, consider the example of the daily movement of
patients within a hospital, which is modeled as a six-state absorbing multi-
chain in Section 1.10.2.1.2. In this model, both recurrent states are absorbing

51113_C003.indd 106 9/23/2010 4:19:29 PM


Reducible Markov Chains 107

states. The transition probability matrix for the absorbing multichain is given
in canonical form in Equation (1.59). For this model,

1 0.4 0.1 0.2 0.3  1  0.6 −0.1 −0.2 −0.3 


2  0.1 0.2 0.3 0.2  2  −0.1 0.8 −0.3 −0.2 
Q=   , ( I − Q) =  . (3.20)
3  0.2 0.1 0.4 0.2 3  −0.2 −0.1 0.6 −0.2 
   
4 0.3 0.4 0.2 0.1 4  −0.3 −0.4 −0.2 0.9 

The fundamental matrix is

State 1 2 3 4
1 5.2364 2.8563 4.2847 3.3321
U = (I − Q)−1 = 2 2.9263 3.2798 3.4383 2.4682 .
(3.21)
3 3.5078 2.4858 5.0253 2.8382
4 3.8254 2.9621 4.0731 3.9494

To answer question one of Section 3.2, uij, the (i, j)th entry of the funda-
mental matrix, represents the mean number of days a patient spends in tran-
sient state j before the patient eventually enters an absorbing state, given
that the patient starts in transient state i. Therefore, u43 = 4.0731 is the mean
number of days that a patient who is initially in physical therapy (transient
state 4) spends in surgery (transient state 3) before the patient is eventually
discharged (absorbed in state 0) or dies (absorbed in state 5).
To answer question two, the column vector Ue has as components the
mean total number of days a patient spends in each transient state before
the patient eventually enters an absorbing state. In other words, compo-
nent ui of the column vector Ue denotes the mean time to absorption for a
patient who starts in transient state i. The column vector Ue for the hospital
model is

 u1  1  5.2364 2.8563 4.2847 3.3321 1 1 15.7095


 u  2  2.9263 3.2798 3.4383 2.4682  1 2 12.1126 
u = Ue =   =    =   . (3.22)
2

 u3  3  3.5078 2.4858 5.0253 2.8382  1 3 13.8571


      
u4  4  3.8254 2.9621 4.0731 3.9494  1 4  14.81 

The sum u4 = u41 + u42 + u43 + u44 = 14.81, which is the fourth entry of col-
umn vector Ue, represents the mean number of days that a patient initially
in physical therapy (transient state 4) spends in the diagnostic, outpatient,
surgery, and physical therapy departments before the patient eventually is
discharged or dies. Observe that matrix Q in Equation (3.20) is identical to

51113_C003.indd 107 9/23/2010 4:19:30 PM


108 Markov Chains and Decision Processes for Engineers and Managers

matrix Q in Equation (2.58). However, the entries in the vector of mean first
passage times (MFPTs) in Equation (3.22) differ slightly from those calcu-
lated by different methods in Equations (2.62) and (6.68). Discrepancies are
due to roundoff error because only the first four significant decimal digits
were stored.

3.3 Passage to a Target State


A reducible Markov chain contains a set of transient states plus one or more
closed classes of recurrent states, some of which may be absorbing states.
In many reducible chain models, passage from a transient state to a target
recurrent state is of interest [1, 3–5].

3.3.1 Mean First Passage Times in a Regular Markov Chain Revisited


In this section, MFPTs in a regular Markov chain, calculated in Section 2.2.2,
are revisited to show that they are equivalent to the mean times to absorp-
tion in an associated absorbing Markov chain created by changing the target
recurrent state in the regular chain into an absorbing state.
Consider the generic five-state regular Markov chain which was intro-
duced in Section 2.2.2.1. The five states are indexed j, 1, 2, 3, and 4. The tran-
sition probability matrix, P, is shown in Equation (2.49). As Section 2.2.2.1
indicates, the vector of MFPTs can be found by making the target state j an
absorbing state. Equation (2.50) shows the canonical form of the transition
probability matrix, PM, for the modified absorbing Markov chain, obtained
by deleting row j and column j of the original transition probability matrix.
When Equation (2.56) is solved for the vector mj of MFPTs to target state j, the
result is

Imj = e + Qmj
Imj − Qmj = e
(I − Q)mj = e (3.23)
mj = (I − Q)−1 e = Ue.

Observe that (I − Q)−1 = U is the fundamental matrix of a reducible Markov


chain. Formula (3.23) for the vector mj indicates that the vector of MFPTs for
a regular Markov chain is equal to the fundamental matrix of the associated
absorbing Markov chain postmultiplied by a column vector of ones. By the
rules of matrix multiplication, the MFPT from a state i to a target state j in the
regular chain is equal to the sum of the entries in row i of the fundamental
matrix. Therefore, the vector of MFPTs to the target state j is

51113_C003.indd 108 9/23/2010 4:19:31 PM


Reducible Markov Chains 109

 m1 j 
m 
mj =   = (I − Q)−1 e
2j

 m3 j 
 
 m4 j 
= Ue
 u11 u12 u13 u14  1  u11 + u12 + u13 + u14 
u u22 u23 u24  1 u + u22 + u23 + u24 
=    =  21 
21

 u31 u32 u33 u34  1 u31 + u32 + u33 + u34 


     
u41 u42 u43 u44  1 u41 + u42 + u43 + u44 
 u1 
u 
=  .
2

 u3 
 
 u4  (3.24)

The column vector Ue = mj of mean times to absorption in absorbing state j is


also the vector of MFPTs to the target state j in a regular Markov chain, which
has been transformed into an associated absorbing unichain by changing
the target state j into an absorbing state. For example, the sum u4 = u41 + u42 +
u43 + u44, which is the fourth entry of column vector Ue in Equation (3.19), rep-
resents the mean time to absorption in absorbing state j, starting in transient
state 4. The quantity u4 = m4j is also the MFPT from recurrent state 4 to target
state j in the original regular chain. Hence, an alternative way to compute
MFPTs is to multiply the fundamental matrix by a vector of ones. However,
this alternative procedure is recommended only if the fundamental matrix
is already available because, for an N-state chain, inverting a matrix involves
about N times as many arithmetic operations as solving a system of N linear
equations.
For a numerical illustration of the alternative procedure, observe that the
submatrix Q shown in Equation (2.58) for the modified absorbing unichain
constructed for the regular five-state Markov chain is identical to the sub-
matrix Q shown in Equation (3.20) for the absorbing multichain model of
patient flow in a hospital. Hence, both models have the same fundamental
matrix. The vector of MFPTs to the target state j was calculated in Equation
(3.22) by summing the entries in the rows of the fundamental matrix. These
values differ slightly from those obtained in Equation (2.62) by solving the
matrix equation for the vector of MFPTs to target state j. These values also
differ slightly from those obtained by state reduction in Equation (6.68).
Discrepancies are caused by roundoff error because only the first four signif-
icant decimal digits were stored.

51113_C003.indd 109 9/23/2010 4:19:31 PM


110 Markov Chains and Decision Processes for Engineers and Managers

3.3.2 Probability of First Passage in n Steps


This section is focused on computing probabilities of first passage in n time
periods, or steps, from a transient state to a target recurrent state in a reduc-
ible Markov chain. (The adjective “first” is redundant because a reducible
chain allows only one passage from a transient state to a recurrent state.) For
a reducible Markov chain, the first passage probabilities of interest involve
transitions in one direction only, from transient states to recurrent states or
to absorbing states. Recall that in a regular Markov chain all states are recur-
rent, and they all belong to the same closed communicating class of states.
In Equation (2.35), f ij( n ) represents the probability that the first passage time
from a recurrent state i to a target recurrent state j equals n steps, provided
that states i and j belong to the same recurrent closed class. The n-step first
passage probabilities in a recurrent chain can be computed recursively by
using the algebraic formula (2.35) or the matrix formula (2.38).

3.3.2.1 Reducible Unichain


Consider a reducible unichain with a set of transient states plus one closed
set of recurrent states. The canonical form of the transition probability matrix
is shown in Equation (3.1). As in Equation (2.35), let f ij( n ) be the probability of
first passage in n steps from a transient state i to a target recurrent state j.
That is, for n ≥ 1, f ij( n ) denotes the probability of moving from a transient state
i to a target recurrent state j for the first time on the nth step.
Assume that the chain starts in a transient state i. To obtain a formula for
recursively computing the n-step first passage probabilities from a transient
state i to a recurrent state j in a reducible unichain, an argument similar to
the one used in Section 2.2.1 for a regular Markov chain is followed. The
argument begins by observing that

f ij(1) = pij(1) = pij , (3.25)

because the probability of entering state j for the first time after one step is
simply the one-step transition probability. To enter recurrent state j for the
first time after two steps, the chain must enter any transient state k after one
step, and move from k to j after one additional step. Since each transition is
conditioned only on the present state, these two transitions are independent,
and

f ij(2) = ∑ pik f kj(1) = ∑ pik f kj(2 −1), (3.26)


k ∈T k ∈T

where the sum is over all transient states, k. To enter recurrent state j for the
first time after three steps, the chain must enter any transient state k after

51113_C003.indd 110 9/23/2010 4:19:31 PM


Reducible Markov Chains 111

one step, and move from k to j for the first time after two additional steps.
Therefore,
f ij(3) = ∑ pik f kj(2) = ∑ pik f kj(3 −1), (3.27)
k ∈T k ∈T

where, once again, the sum is over all transient states, k. By continuing in this
manner, the following algebraic equation is obtained for recursively comput-
ing f ij( n ):
f ij( n ) = ∑p
k ∈T
ik f kj( n− 1), (3.28)

where the sum is over all transient states, k. Note that in order to reach the
target recurrent state j for the first time on the nth step, the first step, with
probability pik, must be to any transient state k. This probability is multi-
plied by the probability, f kj( n− 1), of moving from transient state k to the target
recurrent state j for the first time in n − 1 additional steps. These products are
summed over all transient states k.
The recursive equation (3.28) can also be expressed in matrix form. Let
f j( n ) = [ f ij( n ) ] be the column vector of probabilities of first passage in n steps
from transient states to recurrent state j. Then, the algebraic equation (3.25)
has the matrix form

f j(1) = Dj, (3.29)

where Dj is the vector of one-step transition probabilities from transient


states to recurrent state j, and f j(1) is the vector of probabilities of first passage
in one step from transient states to recurrent state j. Similarly, the algebraic
equation (3.26) corresponds to the matrix equation

f j(2) = Qf j(1) = QDj , (3.30)

where Q is the matrix of one-step transition probabilities from transient


states to transient states. The recursive algebraic equation (3.27) is equivalent
to the matrix equation

f j(3) = Qf j(2) = Q(Qf j(1) ) = Q 2 f j(1) = Q 2 Dj. (3.31)

Similarly, for n = 4, the matrix equation is

f j(4) = Qf j(3) = Q(Qf j(2) ) = Q(Q(Qf j(1) )) = Q 3 f j(1) = Q 3 Dj . (3.32)

By continuing in this manner, for successively higher values of n, the matrix


equation for recursively computing f j( n ) is

f j( n ) = Q n −1Dj . (3.33)

51113_C003.indd 111 9/23/2010 4:19:33 PM


112 Markov Chains and Decision Processes for Engineers and Managers

3.3.2.1.1 Probability of Absorption in n Steps in Absorbing


Unichain Model of Machine Deterioration
As Section 3.1.1 indicates, the simplest kind of reducible unichain is an
absorbing unichain, which has one absorbing state plus transient states. In
an absorbing unichain, the probability of first passage in n steps from a tran-
sient state i to the target absorbing state j, denoted by f ij( n ), is called a proba-
bility of absorption in n steps because the chain never leaves the absorbing
state. The column vector of n-step probabilities of absorption in absorbing
state j is denoted by f j( n ) = [ f ij( n ) ].
Consider the four-state absorbing unichain model of machine deteriora-
tion introduced in Section 1.10.1.2.1.2. The canonical form of the transition
probability matrix is

State 1 2 3 4
Not Working 1 1 0 0 0
 1 0
P = Working, with a Major Defect 2 0.6 0.4 0 0 = . (1.54)
 D1 Q 
Working, with a Minor Defect 3 0.2 0.3 0.5 0
Working Properly 4 0.3 0.2 0.1 0.4

In this model, state 1, Not Working, is the absorbing state, and states 2, 3, and
4 are transient. The vector of probabilities of absorption in 1 day from the
three transient states is

 p21   f 21(1)  2  0.6 


   (1)   
f1(1) =  p31  =  f 31  = D1 = 3  0.2 . (3.34)
 p41   f 41(1)  4  0.3 

The probability that a machine in state 3, which is working, with a minor


defect, will be absorbed in state 1 and stop working after 1 day is given by
p31 = f 31(1) = 0.2. The vectors of probabilities of absorption in 2, 3, and 4 days,
respectively, are computed below:

State 2 3 4 State 1 State 1


 f 21(2) 
  2 0.4 0 0 0.6 2 0.24
f1(2) =  f 31(2)  = Qf1(1) = QD1 = =
3 0.3 0.5 0 0.2 3 0.28
 f 41(2) 
4 0.2 0.1 0.4 0.3 4 0.26
(3.35)

51113_C003.indd 112 9/23/2010 4:19:37 PM


Reducible Markov Chains 113

 f 21(3)  2 0.4 0 0  0.4 0 0  0.6  


 (3)      
f1(3) =  f 31  = Q f1 = Q D1 = 3 0.3 0.5 0  0.3 0.5 0   0.2 
2 (1) 2

 f 41 
(3)
4  0.2 0.1 0.4   0.2 0.1 0.4  0.3  

 0.16 0 0  0.6  0.096  
     
= 0.27 0.25 0   0.2 =  0.212

 0.19 0.09 0.16  0.3  0.180  
(3.36)

 f 21(4)  2  0.16 0 0  0.4 0 0  0.6  


 (4)      
f1(4) =  f 31  = Q f1 = Q D1 = 3 0.27 0.25
3 (1) 3
0  0.3 0.5 0   0.2 
 f 41 
(4)
4  0.19 0.09 0.16   0.2 0.1 0.4  0.3  

0.064 0 0  0.6  0.0384  
     
= 0.183 0.125 0   0.2 = 0.1348  .

 0.135 0.061 0.064  0.3  0.1124  
(3.37)

The probability that a machine in state 3, which is working, with a minor


defect, will be absorbed in state 1 and stop working for the first time 4 days
later is given by f 31(4) = 0.1348 in Equation (3.37).
3.3.2.1.2 Probability of First Passage in n Steps in a Regular
Markov Chain Revisited
In Sections 2.2.2.1 and 3.3.1, MFPTs for a regular Markov chain are computed
by changing a target state in a regular Markov chain into an absorbing state
to produce an absorbing unichain. In Section 2.2.1, probabilities of first pas-
sage in n steps are computed for a regular Markov chain. In this section,
probabilities of first passage in n steps for a regular Markov chain will be
computed by the alternative procedure of changing a target state in a regular
Markov chain into an absorbing state. When a regular chain is converted into
an absorbing unichain, the probability of first passage in n steps for the reg-
ular Markov chain is equal to the probability of absorption in n steps for the
associated absorbing unichain. To demonstrate this alternative procedure for
computing probabilities of first passage in n steps for a regular Markov chain,
consider the four-state regular Markov chain model of the weather described
in Section 1.3. The transition probability matrix is given by

1  0.3 0.1 0.4 0.2 


2  0.2 0.5 0.2 0.1 
P=  . (1.9)
3  0.3 0.2 0.1 0.4 
 
4 0 0.6 0.3 0.1 

51113_C003.indd 113 9/23/2010 4:19:39 PM


114 Markov Chains and Decision Processes for Engineers and Managers

In Section 2.2.1, probabilities of first passage in n steps are computed for


this regular Markov chain model with state 1 (rain) chosen as the target
state. When state 1 is made an absorbing state, the probability transition
matrix of the modified absorbing unichain model is shown below in canon-
ical form:

State 1 2 3 4
Rain 1 1 0 0 0
 1 0
P = Snow 2 0.2 0.5 0.2 0.1 =  .
D1 Q 
(3.38)
Cloudy 3 0.3 0.2 0.1 0.4 
Sunny 4 0 0.6 0.3 0.1

The vector of probabilities of absorption or first passage in 1 day from the


three transient states is

 p21   f 21(1)  2  0.2


     
f 1
(1)
=  p31  =  f 31(1)  = D1 = 3  0.3  . (3.39)
 p41   f 41(1)  4  0 

The vectors of n-step probabilities of absorption, or first passage probabili-


ties, for n = 2, 3, and 4, are calculated below:

State 2 3 4 State 1 State 1


 f 21(2) 
  2 0.5 0.2 0.1 0.2 2 0.16
f1(2) =  f 31(2)  = Qf1(1) = QD1 = =
3 0.2 0.1 0.4 0.3 3 0.07
 f 41(2) 
4 0.6 0.3 0.1 0 4 0.21
(3.40)

 f 21(3)  2  0.5 0.2 0.1  0.5 0.2 0.1  0.2  


  
f1(3)  f 31(3)  = Q 2 f1(1) = Q D1 = 3  0.2 0.1 0.4   0.2 0.1 0.4  0.3  
2

 (3) 
 f 41  4 0.6 0.3 0.1 0.6 0.3 0.1  0  

 0.35 0.15 0.14   0.2   0.115  
= 0.36 0.17 0.10  0.3  = 0.123  

 0.42 0.18 0.19   0   0.138  
(3.41)

51113_C003.indd 114 9/23/2010 4:19:41 PM


Reducible Markov Chains 115

 f 21(4)  2  0.35 0.15 0.14   0.5 0.2 0.1  0.2 


      
f 1
(4)
=  f 31(4)  = Q 3 f1(1) = Q D1 = 3 0.36 0.17 0.10   0.2 0.1 0.4  0.3  
3

 f 41(4)  4  0.42 0.18 0.19 0.6 0.3 0.1  0  



 0.289 0.127 0.109  0.2  0.0959  
     
= 0.274 0.119 0.114  0.3  =  0.0905  .

0.360 0.159 0.133   0  0.1197  
(3.42)

The probability that a cloudy day, state 3, will be absorbed in state 1 and
change to a rainy day for the first time 4 days later is given by f 31(4) = 0.0905 in
Equation (3.42). The same results are obtained in Equations (2.46) and (2.47).
Observe that the matrix Q in the treatment of the problem as an absorbing
chain in this section is a submatrix of the matrix Z in the treatment as a reg-
ular chain in Section 2.2.1. The matrix Q is obtained by deleting the first row
(n)
and first column of Z. As a result, the first entry, f jj , of vector f j( n ) in the treat-
ment as a regular chain is omitted from the vector f j( n ) in the treatment as
an absorbing chain. Hence, the treatment as a regular chain in Section 2.2.1
(n)
yields additional information, namely, f jj , the n-step first passage proba-
bility or n-step recurrence probability for the target state j in the original,
regular Markov chain.
3.3.2.1.3 Unichain with Nonabsorbing Recurrent States
A reducible unichain may consist of a closed class of r recurrent states, none
of which are absorbing states, plus one or more transient states. The recur-
sive equation (3.33) for computing the vector of probabilities of first passage
in n steps can be extended such that it has the form

F ( n ) = [ f1( n ) f 2( n ) ⋅⋅⋅ f r( n ) ] = Q n −1D = Q n −1 [d1 d2 ⋅⋅⋅ dr ]


n −1 n −1 n −1 , (3.43)
= [Q d1 Q d2 ⋅⋅⋅ Q dr ], 

where F(n) is the matrix of n-step first passage probabilities, vector f j( n ) is the
jth column of matrix F(n), and vector dj is the jth column of matrix D.
The probability of entering a recurrent closed set for the first time in n steps
is the sum of the n-step first passage probabilities to the states, which belong
to the recurrent closed set. If the recurrent closed set of states is denoted by
R, and f iR( n ) denotes the n-step first passage probability from transient state i
to the recurrent set R, then

f iR( n ) = ∑
j∈ R
f ij( n ), (3.44)

where the sum is over all recurrent states j, which belong to R.

51113_C003.indd 115 9/23/2010 4:19:43 PM


116 Markov Chains and Decision Processes for Engineers and Managers

Consider the generic four-state reducible unichain for which the transition
matrix is shown below:

1 p11 p12 0 0 
2 0 
 = 
p21 p22 0 S 0
P = [ pij ] =   D Q . (3.2)
3 p31 p32 p33 p34   
 
4 p41 p42 p43 p44 

The matrix of one-step first passage probabilities is

 p31 p32   f 31(1) f 32(1) 


F (1) = Q1− 1D = Q 0 D = D =  =  = [ f1
(1)
f 2(1) ].
p42   f 41(1)
(3.45)
 p41 f 42(1) 

The probability of first passage in one step from transient state 4 to recurrent
state 2 is

f 42(1) = p42 . (3.46)

Using Equation (3.44), the probability of first passage in one step from tran-
sient state 4 to the recurrent set of states R = {1, 2} is

f 4(1)R = f 41(1) + f 42(1) = p41 + p42 . (3.47)

The matrix of two-step first passage probabilities is

 p33 p34   p31 p32 


F (2) = Q 2− 1D = Q1D = QD = 
 p43 p44   p41 p42 

( p33 p31 + p34 p41 ) ( p33 p32 + p34 p42 )  f 31 f 32(2) 


(2)

= =
  (2)  = [ f1
(2)
f 2(2) ]. (3.48)
( p43 p31 + p44 p41 ) ( p43 p32 + p44 p42 )  f 41 f 42(2) 

The probability of first passage in two steps from transient state 4 to recur-
rent state 2 is

f 42(2) = p43 p32 + p44 p42.

The matrix of three-step first passage probabilities is


2
3 −1  p33 p34   p31 p32 
F (3)
=Q D=Q D= 2

 p43 p44   p41 p42 

51113_C003.indd 116 9/23/2010 4:19:46 PM


Reducible Markov Chains 117

 ( p33 p33 + p34 p43 ) ( p33 p34 + p34 p44 )   p31 p32 
=   
 ( p43 p33 + p44 p43 ) ( p43 p34 + p44 p44 )   p41 p42 

 ( p33 p33 + p34 p43 )p31 + ( p33 p34 + p34 p44 )p41 ( p33 p33 + p34 p43 )p32 + ( p33 p34 + p34 p44 )p42 
= 
 ( p43 p33 + p44 p43 )p31 + ( p43 p34 + p44 p44 )p41 ( p43 p33 + p44 p43 )p32 + ( p43 p34 + p44 p44 )p42 

 f (3) f 32(3) 
=  31(3)  = [ f1
(3)
f 2(3) ].
 f 41 f 42(3) 

The probability of first passage in three steps from transient state 4 to recur-
rent state 2 is

f 42(3) = ( p43 p33 + p44 p43 )p32 + ( p43 p34 + p44 p44 )p42.

Using Equation (3.44), the probability of first passage in three steps from
transient state 4 to the recurrent set of states R = {1, 2} is

R = f 41 + f 42 .
f 4(3) (3) (3)
(3.49)

3.3.2.1.4 Probability of First Passage in n Steps in a Unichain Model of Machine


Maintenance Under a Modified Policy of Doing Nothing in State 3
The four-state Markov chain model of machine maintenance introduced
in Section 1.10.1.2.2.1 provides a numerical example of a reducible unichain
with recurrent states. The transition probability matrix associated with the
modified maintenance policy, under which the engineer always overhauls
the machine in state 1 and does nothing in all other states, is given below in
canonical form.

State 1 2 3 4
1 0.2 0.8 0 0
S 0
P= 2 0.6 0.4 0 0 = . (1.56)
3 0.2 0.3 0.5 0 D Q 
4 0.3 0.2 0.1 0.4

The matrix of one-step first passage probabilities is

 0.2 0.3   f 31 f 32(1) 


(1)

F (1) = Q1−1D = Q 0 D = D =  =
  (1)  = [ f1
(1)
f 2(1) ]. (3.50)
0.3 0.2  f 41 f 42(1) 

51113_C003.indd 117 9/23/2010 4:19:49 PM


118 Markov Chains and Decision Processes for Engineers and Managers

The matrix of two-step first passage probabilities is

0.5 0   0.2 0.3  0.10 0.15  f 31(2) f 32(2) 


F (2) = QD =   =  =  (2)  = [ f1
(2)
f 2(2) ].
 0.1 0.4  0.3 0.2 0.14 0.11  f 41 f 42(2) 

(3.51)

The matrix of three-step first passage probabilities is

 0.5 0   0.5 0   0.2 0.3   0.25 0   0.2 0.3   0.05 0.075 


F (3) = Q 2 D =       =    = 
 0.1 0.4   0.1 0.4   0.3 0.2   0.09 0.16   0.3 0.2   0.066 0.059 

 f (3) f 32(3) 
=  31(3)  = [ f1
(3)
f 2(3) ]. (3.52)
 f 41 f 42(3) 

Under the modified maintenance policy, the probability of first passage in


3 days from transient state 4, working properly, to recurrent state 2, working,
with a major defect, is f 42(3) = 0.059. Similarly, under this maintenance policy,
the probability of first passage in 3 days from transient state 4, working prop-
erly, to the recurrent set of states, R = {1, 2}, which denotes either not working
or working, with a major defect, is

R = f 41 + f 42 = 0.066 + 0.059 = 0.125.


f 4(3) (3) (3)
(3.53)

3.3.2.2 Reducible Multichain


Consider a reducible multichain with a set of transient states plus two or more
closed sets of recurrent states. As Section 3.1.2 indicates, the canonical form of
the transition probability matrix for a multichain with two recurrent chains is

 P1 0 0
  S 0
P=0 P2 0 =  , (3.4)
D  D Q 
D2 Q
 1 

where

 P1 0 0 
S=  , 0 =   , and D = [D1 D2 ].
0 P2  0 

The recursive equation (3.43) for computing the matrix of probabilities of


first passage in n steps, extended to apply to a multichain with r recurrent

51113_C003.indd 118 9/23/2010 4:19:52 PM


Reducible Markov Chains 119

chains, is

F ( n ) = Q n −1D = Q n −1 [D1 … Dr ] = [Q n −1D1 … Q n −1Dr ] = [F1( n ) … Fr( n ) ].


(3.54)

Consider the canonical form of the following transition probability matrix


for a generic five-state reducible multichain displayed in Section 3.1.2. The
multichain has two recurrent closed sets, one of which is an absorbing state,
plus two transient states.

1 1 0 0 0 0 
2 0 p22 p23 0 0   P1 0 0
 
P = 3 0 p32 p33 0 0  =  0 P2

0 . (3.5)
 
4  p41 p42 p43 p44 p45  D1 D2 Q 
5  p51 p52 p53 p54 p55 

Suppose that two-step first passage probabilities to R 2 = {2, 3} are of interest.


The matrix of two-step first passage probabilities is

 p44 p45   p42 p43 


F2(2) = Q 2− 1D2 = Q1D2 = QD2 = 
 p54 p55   p52 p53 

( p44 p42 + p45 p52 ) ( p44 p43 + p45 p53 )  f 42 f 43(2) 


(2)

= =
  (2)  = [F2
(2)
F3(2) ]. (3.55)
( p54 p42 + p55 p52 ) ( p54 p43 + p55 p53 )  f 52 f 53(2) 

Observe that the probability of first passage in two steps from transient state
4 to recurrent state 2 is

f 42(2) = p44 p42 + p45 p52 . (3.56)

Using Equation (3.44), the probability of first passage in two steps from state
4 to R 2 is

f 4,(2)R 2 = f 42(2) + f 43(2). (3.57)

If the target recurrent state j is an absorbing state, then the probability of


absorption in two steps from transient state 4 to absorbing state 1 is

f 41(2) = p44 p41 + p45 p51 = f 4,(2)R1. (3.58)

51113_C003.indd 119 9/23/2010 4:19:54 PM


120 Markov Chains and Decision Processes for Engineers and Managers

3.3.2.2.1 Probability of Absorption in n Steps in an Absorbing


Multichain Model of Hospital Patient Flow
Consider the absorbing multichain model of patient flow in a hospital intro-
duced in Section 1.10.2.1.2, and for which the fundamental matrix is obtained
in Section 3.2.4. The transition probability matrix in canonical form is shown
in Equation (1.59).

State 0 5 1 2 3 4
0 1 0 0 0 0 0
5 0 1 0 0 0 0
S 0  I 0
P= 1 0 0 0.4 0.1 0.2 0.3 =  = , (1.59a)
2 0.15 0.05 0.1 0.2 0.3 0.2 D Q  D Q 
3 0.07 0.03 0.2 0.1 0.4 0.2
4 0 0 0.3 0.4 0.2 0.1

where

1 0 0  1 0.4 0.1 0.2 0.3 


 0.15 0.05 2  0.1 0.2
 1 0  2
 , Q = 
0.2 0.3
.
S=I=  , D = (1.59b)
 0 1 3 0.07 0.03  3  0.2 0.1 0.4 0.2
   
4 0 0  4 0.3 0.4 0.2 0.1

The matrix of one-step absorption probabilities is

1 0 0   f10(1) f15(1) 
2  0.15 0.05   f 20(1) f 25(1) 

F (1) = D=   = (3.59)
3  0.07 0.03   f 30(1) f 35(1) 
   
4 0 0   f 40(1) f 45(1) 

After 1 day, a patient in state 3, surgery, will be in state 0, discharged, with


probability f 30(1) = 0.07, or in state 5, dead, with probability f 35(1) = 0.03.
The matrix of two-step absorption probabilities is

1 0.4 0.1 0.2 0.3   0 0   0.029 0.011  f10(2) f15(2) 


2  0.1 0.2 0.3 0.2 0.15 0.05   0.051 0.019   f 20(2)
  f 25(2) 

F (2) = QD =   = = .
3  0.2 0.1 0.4 0.2 0.07 0.03  0.043 0.017   f 30(2) f 35(2) 
      
4 0.3 0.4 0.2 0.1  0 0  0.074 0.026   f 40(2) f 45(2) 
(3.60)

51113_C003.indd 120 9/23/2010 4:19:56 PM


Reducible Markov Chains 121

After 2 days, a patient in state 3, surgery, will be in state 0, discharged, with


probability f 30(2) = 0.043, or in state 5, dead, with probability f 35(2) = 0.017.

3.3.2.2.2 Probability of First Passage in n Steps in a


Multichain Model of a Production Process
Consider the multichain model of a production process introduced in
Section 1.10.2.2. The transition probability matrix in canonical form is shown
in Equation (3.5) of Section 3.1.3. Recall that production stage i is represented
by transient state (9 − i).
The matrix F(n) of n-step first passage probabilities has the form

F ( n ) = [ f1( n ) f 2( n ) f 3( n ) f 4( n ) f 5( n ) ], (3.61)

where vector f j( n ) is the jth column of matrix F(n).


The matrix of one-step absorption probabilities is

6  0.20 0.16   f 61(1) f 62(1) 


   
FA(1) = [ f1(1) f 2(1) ] = [D1 D2 ] = 7  0.15 0  =  f71(1) f72(1)  . (3.62)
8  0.10 0   f 81(1) f 82(1) 
After one step, an item in state 6, production stage 3, will be in state 1, scrapped,
with probability f 61(1) = 0.20, or in state 2, sold, with probability f 62(1) = 0.16.
The matrix of one-step first passage probabilities is

6 0.04 0.03 0.02  f 63(1) f 64(1) f 65(1) 


   
(1)
F
R =[f (1)
3 f (1)
4 f ] = D3 = 7  0
(1)
5 0 0  =  f73(1) f74(1) f75(1)  .
8  0 0 0   f 83(1) f 84(1) f 85(1) 
(3.63)

After one step, an item in state 6, production stage 3, will be in state 3, used
to train engineers, with probability f 63(1) = 0.04, or in state 4, used to train
technicians, with probability f 64(1) = 0.03, or in state 5, used to train techni-
cal writers, with probability f 65(1) = 0.02. Therefore, after one step, an item in
state 6, production stage 3, will be sent to the training center with probability
f 63(1) + f 64(1) + f 65(1) = 0.04 + 0.03 + 0.02 = 0.09, where it will remain.
The matrix of two-step absorption probabilities is

FA(2) = [ f1(2) f 2(2) ] = Q[D1 D2 ]

6  0.55 0 0  0.20 0.16   0.11 0.088   f 61(2) f 62(2) 


      
= 7 0.20 0.65 0   0.15 0  = 0.1375 0.032  =  f71(2) f72(2)  .
8  0 0.15 0.75 0.10 0  0.0975 0   f 81(2) f 82(2) 
(3.64)

51113_C003.indd 121 9/23/2010 4:19:58 PM


122 Markov Chains and Decision Processes for Engineers and Managers

After two steps, an item in state 6, production stage 3, will be in state 1,


scrapped, with probability f 61(2) = 0.11, or in state 2, sold, with probability
f 62(2) = 0.088.
The matrix of two-step first passage probabilities is

FR(2) = [ f 3(2) f 4(2) f 5(2) ] = QD3

6  0.55 0 0   0.04 0.03 0.02   0.022 0.0165 0.011 


     
= 7  0.20 0.65 0   0 0 0  =  0.008 0.006 0.004 
8  0 0.15 0.75   0 0 0   0 0 0 

 f 63(2) f 64(2) f 65(2) 


 
=  f73(2) f74(2) f75(2)  . (3.65)
 f 83(2) f 84(2) f 85(2) 

After two steps, an item in state 6, production stage 3, will, for the first time,
be in state 3, used to train engineers, with probability f 63(2) = 0.022, or in state
4, used to train technicians, with probability f 64(2) = 0.0165, or in state 5, used
to train technical writers, with probability f 65(2) = 0.011. Therefore, after two
steps, an item in state 6, production stage 3, will, for the first time, be sent to
the training center with probability

f 63(2) + f 64(2) + f 65(2) = 0.022 + 0.0165 + 0.011 = 0.0495. (3.66)

3.3.3 Probability of Eventual Passage to a Recurrent State


In this section attention is focused on the probability of eventual passage
(without counting the number of steps) from a transient state to a target
recurrent state in a reducible unichain or multichain. Suppose that the tran-
sition probability matrix is represented in the aggregated canonical form of
Equation (3.1).

S 0
P= . (3.1)
D Q 

S may have been formed by aggregating the transition probability matrices


for several recurrent closed classes of states belonging to a reducible multi-
chain. Let fij denote the probability of eventual passage from a transient state
i to a target recurrent state j. If the target state j is an absorbing state, then fij
is called an absorption probability. Let F = [fij] denote the matrix of probabil-
ities of eventual passage from transient states to recurrent states. Since the

51113_C003.indd 122 9/23/2010 4:20:04 PM


Reducible Markov Chains 123

number of steps required for eventual passage to a target recurrent state j is


a random variable,

f ij = f ij(1) + f ij(2) + f ij(3) + ... = ∑ f ij( n ). (3.67)
n=1

In matrix form,


F = ∑ F ( n ) = F (1) + F (2) + F (3) + ... = D + F (2) + F (3) + ...
n=1
∞ ∞
= D + ∑ F ( n ) = ID + ∑ F ( n )
n= 2 n= 2

= ID + ∑ Q n −1D = ID + (Q + Q 2 + Q 3 + ...)D
n= 2

= (I + Q + Q 2 + Q 3 + ...)D (3.68)

The entries of Qn give the probabilities of being in each of the transient states
after n steps for each possible transient starting state. After zero steps the
chain is in the transient state in which it started, so that Q 0 = I. As the chain
is certain to eventually enter a closed set of recurrent states within which
it will remain forever, the chain will eventually never return to a transient
state. Therefore, the probability of being in the transient states after n steps
approaches zero. In other words, if i and j are both transient states, then
limn→∞ pij( n ) = 0, irrespective of the transient starting state i. It follows that
every entry of Qn must approach zero as n→∞. That is,

lim Q n = 0, (3.69)
n →∞

the null matrix.


Let the matrix Y represent the following sum.

Y = I + Q + Q 2 + Q 3 + ... + Q n −1. (3.70)

Premultiplying both sides of Equation (3.69) by Q,

QY = Q + Q 2 + Q 3 + Q 4 + " + Q n− 1 + Q n. (3.71)

Subtracting Equation (3.70) from Equation (3.69),

Y − QY = IY − QY = (I − Q)Y = I − Q n. (3.72)

51113_C003.indd 123 9/23/2010 4:20:07 PM


124 Markov Chains and Decision Processes for Engineers and Managers

Now let n→∞. Since the lim Q n = 0,


n →∞

(I − Q)Y = I
(3.73)
Y = (I − Q)−1 = I + Q + Q 2 + Q 3 + … ,

where the sum of the infinite series, Y = (I−Q)−1, is defined in Equation (3.10)
as the fundamental matrix of a reducible Markov chain, denoted by U.
Therefore, the fundamental matrix can be expressed as the sum of an infinite
series of substochastic matrices, Qn, as shown in Equation (3.74).

U = Y = (I−Q)−1 = I + Q + Q2 + Q3+ . . . . (3.74)

This result is analogous to the formula for the sum of an infinite geometric
series, which indicates that

1 + q + q 2 + q3 + ... = (1 − q)−1, (3.75)

where q is a number less than one in absolute value. Using this result in the
calculation of F,

F = (I + Q + Q 2 + Q 3 + ...)D = (I − Q)−1 D = UD. (3.76)

Therefore, the matrix F of the probabilities of eventual passage from tran-


sient states to recurrent states in a reducible Markov chain is equal to the
fundamental matrix, U = (I − Q)−1, postmultiplied by the matrix D of one-step
transition probabilities from transient states to recurrent states. The (i, j)th
entry, fij, of matrix F represents an answer to question 3 in Section 3.2 because
fij is the probability of eventual passage from a transient state i to a target
recurrent state j.
An alternative derivation of fij, the probability of eventual passage from a
transient state i to a target recurrent state j, is instructive. Consider a reduc-
ible multichain that has been partitioned into two or more closed communi-
cating classes of recurrent states plus a set of transient states. Suppose that
the chain starts in a transient state i. On the first step, one of the following
four mutually exclusive events may occur:

1. The chain enters target recurrent state j after one step with probabil-
ity pij.
2. With probability pih the chain enters a nontarget recurrent state h,
which communicates with the target recurrent state j. In this case the
first step, with probability pih, does not contribute to fij, the probabil-
ity of eventual passage from transient state i to the target state j, but

51113_C003.indd 124 9/23/2010 4:20:09 PM


Reducible Markov Chains 125

contributes instead to fih, the probability of eventual passage from


transient state i to the nontarget state h.
3. The chain enters a recurrent state g, which belongs to a different
closed set of recurrent states that does not contain the target state j.
In this case the target state will never be reached because recurrent
state g does not communicate with the target state j.
4. The chain enters a transient state k with probability pik. In this case
the probability of eventual passage from transient state k to the tar-
get state j is f kj.

By combining the two relevant events, (1) and (4), in which the target state j is
reached after starting in a transient state i, the following system of algebraic
equations is produced:

f ij = pij + ∑p
k ∈T
ik f kj , (3.77)

where the sum is over all transient states k. The formula in Equation (3.77)
represents a system of equations because the unknown fij is expressed
in terms of all the unknowns f kj. In matrix form, the system of equations
(3.77) is

F = D + QF
F − QF = D
IF − QF = D (3.78)
( I − Q )F = D
F = (I − Q)−1 D = UD, (3.76)

which confirms the previous result. Thus, fij, the probability of eventual pas-
sage from a transient state i to a target recurrent state j, can be calculated by
solving either the system (3.77) of algebraic equations, or the matrix equation
(3.76). In Equation (3.76), F = [fij] is the matrix of probabilities of eventual pas-
sage from transient states to recurrent states. Observe that two alternative
matrix forms of the system of equations (3.77) are given in Equations (3.78)
and (3.76).

3.3.3.1 Reducible Unichain


A reducible unichain has one closed class of recurrent states plus a set of
transient states.

51113_C003.indd 125 9/23/2010 4:20:11 PM


126 Markov Chains and Decision Processes for Engineers and Managers

3.3.3.1.1 Probability of Absorption in Absorbing Unichain


Model of Machine Deterioration
As Section 3.3.2.1.1 indicates, the simplest kind of reducible unichain is an
absorbing unichain, which has one absorbing state plus transient states.
In an absorbing unichain, the probability of eventual passage from a transient
state i to the target absorbing state j, denoted by fij, is called a probability
of absorption, or absorption probability, because the chain never leaves the
absorbing state. The column vector of probabilities of absorption in absorb-
ing state j is denoted by F = fj = [fij]. Starting from every transient state in an
absorbing unichain, absorption is certain. Therefore, fij = 1 for all transient
states i, and F = fj is a vector of ones.
The four-state absorbing unichain model of machine deterioration intro-
duced in Section 1.10.1.2.1.2 is an example of an absorbing unichain. The
probability of absorption in n steps is computed in Section 3.3.2.1.1. The
canonical form of the transition probability matrix for the unichain model of
machine deterioration is shown in Equation (1.54).

State 1 2 3 4
Not Working 1 1 0 0 0
1 0
P = Working, with a Major Defect 2 0.6 0.4 0 0 = .
 D Q 
Working, with a Minor Defect 3 0.2 0.3 0.5 0
Working Properly 4 0.3 0.2 0.1 0.4
(1.54)

The vector of probabilities of absorption in absorbing state 1, starting from


the three transient states, is calculated by Equation (3.76). For this example,

State 2 3 4
2 0.6 0 0
I−Q=
3 − 0.3 0.5 0
4 − 0.2 − 0.1 0.6

State 2 3 4
2 1.6667 0 0
U = (I − Q)−1 =
3 1 2 0
4 0.7222 0.3333 1.6667

2 1.6667 0 0  0.6  1


−1     
F = (I − Q) D = UD = 3  1 2 0   0.2 = 1 . (3.79)
4  0.7222 0.3333 1.6667   0.3  1

51113_C003.indd 126 9/23/2010 4:20:12 PM


Reducible Markov Chains 127

As expected, F = f1 = e is a vector of ones. Regardless of the transient starting


state, the machine will eventually end its life in absorbing state 1, not working.

3.3.3.1.2 Four-State Reducible Unichain


Consider the generic four-state reducible unichain for which the transition
matrix is shown in canonical form in Equation (3.2).

1  p11 p12 0 0 
2  p21 p22 0 0  S 0
P = [ pij ] =  = .
3  p31 p32 p33 p34  D Q  (3.2)
 
4  p41 p42 p43 p44 

Suppose that the probability f32 of eventual passage from transient state 3
to recurrent state 2 is of interest. Since Equation (3.77) for calculating f32 is
expressed in terms of both f32 and f42, the following system of two algebraic
equations must be solved:

f 32 = p32 + ∑ p3 k f k 2 = p32 + p33 f 32 + p34 f 42 


k ∈T 
.
f 42 = p42 + ∑ p4 k f k 2 = p42 + p43 f 32 + p44 f 42  (3.80a)
k ∈T 

A matrix form of equations (3.80) based on Equation (3.78) is

 f 32   p32   p33 p34   f 32 


 f  = p  + p p44   f 42 
. (3.80b)
 42   42   43

The solution of equations (3.80) for the probabilities of eventual passage to


recurrent state 2 is

p32 (1 − p44 ) + p34 p42 


f 32 =
(1 − p33 )(1 − p44 ) − p34 p43 
. (3.81)
p42 (1 − p33 ) + p43 p32 
f 42 =
(1 − p33 )(1 − p44 ) − p34 p43 

Alternatively, the matrix F of probabilities of eventual passage from transient


states to recurrent states in a reducible unichain can be calculated by solving
Equation (3.76). The fundamental matrix for the generic four-state reducible
unichain was obtained by Equation (3.11). Thus,

1  1 − p44 p34   p31 p32 


F = UD = (I − Q)−1 D =
(1 − p33 )(1 − p44 ) − p34 p43  p43 1 − p33   p41 p42 

51113_C003.indd 127 9/23/2010 4:20:13 PM


128 Markov Chains and Decision Processes for Engineers and Managers

1 (1 − p44 )p31 + p34 p41 (1 − p44 )p32 + p34 p42 


=
(1 − p33 )(1 − p44 ) − p34 p43  p43 p31 + (1 − p33 )p41 p43 p32 + (1 − p33 )p42 
 f 31 f 32  (3.82)
= .
 f 41 f 42 

These values agree with the results obtained in equations (3.81).


After algebraic simplification, observe that

p31 (1 − p44 ) + p34 p41 p32 (1 − p44 ) + p34 p42


f 31 + f 32 = + =1 (3.83)
(1 − p33 )(1 − p44 ) − p34 p43 (1 − p33 )(1 − p44 ) − p34 p43
f41 + f42 = 1. (3.84)

These two sums both equal one because eventual passage from a tran-
sient state to the single recurrent closed class is certain. This result can be
extended to apply to any reducible unichain. One may conclude that start-
ing from any particular transient state, the sum of the probabilities of even-
tual passage to all of the recurrent states is one. For the generic four-state
reducible unichain, note that the probabilities of eventual passage, f31, f32,
f41, and f42, are independent of the transition probabilities for the recurrent
closed class, p11, p12, p21, and p22. Therefore, if states 1 and 2 were converted
to absorbing states, making the chain a generic four-state absorbing multi-
chain, the probabilities of eventual passage, which would be called absorp-
tion probabilities, would be unchanged. To illustrate this result, consider a
generic four-state absorbing multichain, in which states 1 and 2 are absorb-
ing states. The absorbing multichain has the following transition probability
matrix in canonical form:
1 1 0 0 0 
2 0 1 0 0   I 0
P = [ pij ] =  = .
3  p31 p32 p33 p34  D Q  (3.85)
 
4  p41 p42 p43 p44 

The same result is obtained for F as the one obtained in Equation (3.82) for
the matrix of probabilities of eventual passage to recurrent states for the
generic four-state reducible unichain. However, in this case, as Section 3.5.2
indicates, F is called the matrix of absorption probabilities for the generic
four-state absorbing multichain.

3.3.3.1.3 Reducible Unichain Model of Machine Maintenance Under


a Modified Policy of Doing Nothing in State 3
Consider the unichain model of machine maintenance introduced in Section
1.10.1.2.2.1 and analyzed in Section 3.3.2.1.4. The transition probability matrix

51113_C003.indd 128 9/23/2010 4:20:15 PM


Reducible Markov Chains 129

associated with the modified maintenance policy, under which the engineer
always does nothing to a machine in state 3, is given in canonical form in
Equation (1.56).

State 1 2 3 4
1 0.2 0.8 0 0
S 0
P= 2 0.6 0.4 0 0 = .
 D Q  (1.56)
3 0.2 0.3 0.5 0
4 0.3 0.2 0.1 0.4

The recurrent set of states is R = {1,2}. The set of transient states is T = {3,4}.
Solving Equation (3.84), the matrix of probabilities of eventual passage from
transient states to recurrent states is

 f 31 f 32 
F= 
 f 41 f 42 

1  (1 − p44 )p31 + p34 p41 (1 − p44 )p32 + p34 p42 


=
(1 − p33 )(1 − p44 ) − p34 p43  p43 p31 + (1 − p33 )p41 p43 p32 + (1 − p33 )p42 

1  (1 − 0.4)(0.2) + (0)(0.3) (1 − 0.4)(0.3) + (0)(0.2) 


=
(1 − 0.5)(1 − 0.4) − (0)(0.1) (0.1)(0.2) + (1 − 0.5)(0.3) (0.1)(0.3) + (1 − 0.5)(0.2)

 0.4 0.6 
= . (3.86)
0.5667 0.4333 

When the machine is used under the modified maintenance policy pre-
scribed in Section 1.10.1.2.2.1, the probability of eventual passage from tran-
sient state 3, working, with a minor defect, to recurrent state 1, not working,
is f31 = 0.4. Also, under the modified maintenance policy, the probability
of eventual passage from transient state 4, working properly, to recurrent
state 2, working, with a major defect, is f42 = 0.4333.
Alternatively, using Equation (3.76), where

0.5 0   0.2 0.3   0.5 0


Q=  , D=  , I −Q =  ,
 0.1 0.4  0.3 0.2 − 0.1 0.6

1  0.6 0 
U = (I − Q)−1 =
0.3  0.1 0.5 

51113_C003.indd 129 9/23/2010 4:20:16 PM


130 Markov Chains and Decision Processes for Engineers and Managers

F = UD = (I − Q)−1 D

1  0.6 0   0.2 0.3   0.4 0.6   f 31 f 32 


= =  = ,
0.3  0.1 0.5   0.3 0.2   0.5667 0.4333   f 41
    f 42  (3.87)

confirming the results obtained previously.

3.3.3.2 Reducible Multichain


A reducible multichain has two or more closed classes of recurrent states
plus a set of transient states.

3.3.3.2.1 Absorbing Multichain Model of a Gambler’s Ruin


An absorbing multichain has two or more absorbing states plus one or more
transient states. As Section 3.3.3.1.1 indicates, the probability of eventual pas-
sage from a transient state i to an absorbing state j, denoted by fij, is called an
absorption probability. The matrix of probabilities of eventual passage from
transient states to absorbing states, denoted by F = [fij], is called the matrix of
absorption probabilities.
In Section 1.4.1.2, a gambler’s ruin is modeled as a five-state random walk
with two absorbing barriers. This random walk is an absorbing multichain
in which states 0 and 4 are absorbing states, and the other three states are
transient states. Suppose that player A wants to compute the probability that
she will eventually lose all her money and be ruined. This is the probability
that the chain will be absorbed in state 0. The transition probability matrix
for the absorbing multichain model of the gambler’s ruin is given in canoni-
cal form in Equation (1.11).

0 1 0 0 0 0
4 0 1 0 0 0
  S 0  I 0
P = [ pij ] = 1 1 − p 0 0 p 0 =  = . (1.11)
  D Q  D Q 
2 0 0 1− p 0 p
3  0 p 0 1− p 0 

Observe that

State 1 2 3
1 1 −p 0
( I − Q) =
2 −(1 − p) 1 −p (3.88)
3 0 −(1 − p) 1

51113_C003.indd 130 9/23/2010 4:20:19 PM


Reducible Markov Chains 131

The fundamental matrix is

State 1 2 3
p + (1 − p) 2
p p2
1
p 2 + (1 − p)2 p 2 + (1 − p)2 p 2 + (1 − p)2
U = (I − Q)−1 = (1 − p) 1 p .
2 (3.89)
p + (1 − p)2
2
p + (1 − p)2
2
p + (1 − p)
2 2

(1 − p)2 (1 − p) p 2 + (1 − p)
3
p + (1 − p)2
2
p + (1 − p)2
2
p 2 + (1 − p)2

Note that when player A starts with $2, the expected number of times that
she will have $2 is given by

1
u22 = , (3.90)
p + (1 − p)2
2

the entry in row 2 and column 2 of the fundamental matrix. The expected
number of times that she will have $2, given that she starts with $2, varies
between one and two. This expected value is one if p = 0 or p = 1. If player
A starts with $2 and p = 0, she will have $1 remaining after the first bet, and
lose the game with $0 remaining after the second bet. If player A starts with
$2 and p = 1, she will have a total of $3 after the first bet, and win the game
with a total of $4 after the second bet. Therefore, if p = 0 or 1, and player A
starts in state 2, she will have $2 only until she makes the first bet. If player
A starts with $2 and p = 1/2, the expected number of times that she will have
$2 is two.
Using Equation (3.76), the matrix of absorption probabilities is equal to

F = UD = (I − Q)−1 D
State 1 2 3
p + (1 − p) 2
p p2
1
p 2 + (1 − p)2 p + (1 − p)2
2
p + (1 − p)2
2
1 − p 0 
(1 − p)  
= 1 p  0 0 
2
p + (1 − p)2
2
p + (1 − p)2
2
p + (1 − p)2
2
 0 p 
(1 − p)2 (1 − p) p 2 + (1 − p)
3
p + (1 − p)2
2
p + (1 − p)2
2
p 2 + (1 − p)2

51113_C003.indd 131 9/23/2010 4:20:20 PM


132 Markov Chains and Decision Processes for Engineers and Managers

State 0 4
p + (1 − p) + (1 − p) 3
p3
1
p 2 + (1 − p)2 p 2 + (1 − p)2
= (1 − p)2 p2 .
2 (3.91)
p 2 + (1 − p)2 p 2 + (1 − p)2
(1 − p)3 p(1 − p) + p 3
3
p 2 + (1 − p)2 p 2 + (1 − p)2

If player A starts with $2, the probability that she will eventually lose all her
money and be ruined is given by

(1 − p)2
f 20 = , (3.92)
p 2 + (1 − p)2

the entry in row 2 and column 1 of F, the matrix of absorption probabilities.

3.3.3.2.2 Absorbing Multichain Model of Patient Flow in a Hospital


In Sections 1.10.2.1.2, 3.2.4, and 3.3.2.2.1, the movement of patients in a hospi-
tal is modeled as an absorbing multichain in which states 0 and 5 are absorb-
ing states, and the other four states are transient. Suppose that the objective
is to compute the probability that a patient will eventually be discharged.
This is the probability that a patient in one of the four transient states will be
absorbed in state 0. The transition probability matrix for the absorbing mul-
tichain is given in canonical form in Equation (1.59).

State 0 5 1 2 3 4
0 1 0 0 0 0 0
5 0 1 0 0 0 0
S 0  I 0
P= 1 0 0 0.4 0.1 0.2 0.3 =  = . (1.59)
 D Q  D Q 
2 0.15 0.05 0.1 0.2 0.3 0.2
3 0.07 0.03 0.2 0.1 0.4 0.2
4 0 0 0.3 0.4 0.2 0.1

The fundamental matrix was calculated in Equation (3.21). Using Equation


(3.76), the matrix of absorption probabilities is equal to

1  5.2364 2.8563 4.2847 3.3321  0 0  1  0.7284 0.2714 


2  2.9263 3.2798 3.4383 2.4682  0.15 0.05 2 0.7327 0.2671
F = UD = (I − Q)−1 D =   =  .
3  3.5078 2.4858 5.0253 2.8382 0.07 0.03  3  0.7246 0.2750 
    
4  3.8254 2.9621 4.0731 3.9494   0 0  4  0.7294 0.2703 

(3.93)

51113_C003.indd 132 9/23/2010 4:20:21 PM


Reducible Markov Chains 133

The entry f30 = 0.7246 shows that a surgery patient has a probability of
0.7246 of eventually being discharged, while the entry f35 = 0.2750 shows that
a surgery patient has a probability of 0.2750 of dying. The entries in the first
column of the matrix F indicate that the probabilities that a patient in the
diagnostic, outpatient, surgery, and physical therapy departments, respec-
tively, will eventually be discharged are 0.7284, 0.7327, 0.7246, and 0.7294,
respectively. Of course, the absorption probabilities in each row of matrix F
sum to one because eventually a patient must be discharged or die.
Now assume that the fundamental matrix for the absorbing multichain has
not been calculated. As Section 3.3.3 indicates, fi0, the probability of absorp-
tion in state 0, given that the chain starts in a transient state i, can also be
calculated by solving the system of equations (3.77).

f i 0 = pi 0 + ∑ pik f k 0 . (3.77)
k ∈T

Equations (3.77) for the probabilities of eventual discharge are

f10 = 0 + (0.4) f10 + (0.1) f 20 + (0.2) f 30 + (0.3) f 40 


f 20 = 0.15 + (0.1) f10 + (0.2) f 20 + (0.3) f 30 + (0.2) f 40 
 (3.94)
f 30 = 0.07 + (0.2) f10 + (0.1) f 20 + (0.4) f 30 + (0.2) f 40 
f 40 = 0 + (0.3) f10 + (0.4) f 20 + (0.2) f 30 + (0.1) f 40 . 

The solution of equations (3.94) for the vector of probabilities of absorption in


absorbing state 0 is column one of matrix (3.93),

f 0 = [ f10 f 20 f 30 f 40 ]T = [0.7284 0.7327 0.7246 0.7294]T . (3.95)

These probabilities of absorption in state 0 differ only slightly from those


computed by state reduction in Equation (6.83). Discrepancies are due to
roundoff error because only the first four significant digits after the decimal
point were stored.

3.3.3.2.3 Five-State Reducible Multichain


Consider the canonical form of the transition probability matrix for a generic
five-state reducible multichain, shown in Equation (3.5).

1 1 0 0 0 0 
2 0 p22 p23 0 0   P1 0 0
 
P = 3 0 p32 p33 0 0  =  0 P2

0 .
  (3.5)
4  p41 p42 p43 p44 p45  D1 D2 Q 
5  p51 p52 p53 p54 p55 

51113_C003.indd 133 9/23/2010 4:20:22 PM


134 Markov Chains and Decision Processes for Engineers and Managers

Suppose that the probability of eventual passage from transient state 4 to


recurrent state 2, f42, is of interest. The following system of equations (3.77)
must be solved:

f 42 = p42 + ∑ p4 k f k 2 = p42 + p44 f 42 + p45 f 52 


kεT 
 (3.96a)
f 52 = p52 + ∑ p5 k f k 2 = p52 + p54 f 42 + p55 f 52 .
kεT 

A matrix form of equations (3.96) based on Equation (3.78) is

 f 42   p42   p44 p45   f 42 


 f  = p  + p p55   f 52 
. (3.96b)
 52   52   54

The solution of equations (3.96) for the probabilities of eventual passage to


recurrent state 2 is

p42 (1 − p55 ) + p45 p52 


f 42 =
(1 − p44 )(1 − p55 ) − p45 p54 

p52 (1 − p44 ) + p54 p42  (3.97)
f 52 =
(1 − p44 )(1 − p55 ) − p45 p54 . 

Alternatively, Equation (3.96) for computing the matrix of probabilities of


eventual passage from transient states to recurrent states, extended to apply
to a multichain with r recurrent chains, is

F = UD = U[D1 … Dr ] = [UD1 … UDr ] = [F1 … Fr ]. (3.98)

Since the five-state reducible multichain has r = 2 recurrent chains, and the
target recurrent state 2 belongs to the second recurrent chain R2 = (2, 3),
matrix F2 will be calculated.

F2 = UD2 = (I − Q)−1 D2 . (3.99)

The fundamental matrix is


−1
1 − p44 − p45  1 1 − p55 p45 
U = (I − Q)−1 =  = .
 − p54 1 − p55  (1 − p44 )(1 − p55 ) − p45 p54  p54 1 − p44 
(3.100)

51113_C003.indd 134 9/23/2010 4:20:24 PM


Reducible Markov Chains 135

Thus,

1  1 − p55 p45   p42 p43 


F2 =
(1 − p44 )(1 − p55 ) − p45 p54  p54 1 − p44   p52 p53 

1  (1 − p55 )p42 + p45 p52 (1 − p55 )p43 + p45 p53 


=
(1 − p44 )(1 − p55 ) − p45 p54  p54 p42 + (1 − p44 )p52 p54 p43 + (1 − p44 )p53 

 f 42 f 43 
= 
f 53 
. (3.101)
 f 52

These values agree with those obtained previously in Equation (3.97). As


Section 3.3.3.1.2 has noted for the case of a reducible unichain, the probabil-
ities f42, f43, f52, and f53 of eventual passage are independent of the transition
probabilities, p22, p23, p32, and p33, for the recurrent closed class. Thus, if states
2 and 3 were absorbing states, making the chain an absorbing multichain, the
probabilities of eventual passage, which would be called absorption proba-
bilities, would be unchanged.

3.3.3.2.4 Multichain Model of an Eight-State Serial Production Process


For another illustration of how to calculate the probability of eventual pas-
sage from a transient state to a target recurrent state, consider the eight-state
reducible multichain model of a serial production process introduced in
Section 1.10.2.2. The transition probability matrix in canonical form is shown
in Equation (3.7).
The fundamental matrix is

−1 6 7 8
6  0.45 0 0 
  6 2.2222 0 0
U = (I − Q)−1 = 7 − 0.20 0.35 0  = .
7 1.2698 2.8571 0
8  0 − 0.15 0.25
8 0.7619 1.7143 4
(3.102)

In this model, production stage i is represented by transient state (9 − i). Recall


that uij, the (i, j)th entry of the fundamental matrix, specifies the mean num-
ber of visits to transient state j before the chain eventually enters a recurrent
state, given that the chain starts in transient state i. Hence, the entries in the
bottom row of U specify the expected number of times that an entering item,
starting at production stage 1 in transient state 8, will visit each of the three
production stages prior to being sold, scrapped, or sent to the training center.
On average, an entering item will make u88 = 4 visits to stage 1, u87 = 1.7143
visits to stage 2, and u86 = 0.7619 visits to stage 3. The expected total number

51113_C003.indd 135 9/23/2010 4:20:26 PM


136 Markov Chains and Decision Processes for Engineers and Managers

of visits to all three production stages made by an entering item is 6.4762,


equal to the sum of these three entries.
If the chain occupies a transient state, the matrix of probabilities of even-
tual passage to absorbing and recurrent states is calculated by applying
Equation (3.98).

6 7 8 1 2 3 4 5 
6 2.2222 0 0 0.20 0.16 0.04 0.03 0.02 
F = UD = U[D1 D2 DR ] =
7 1.2698 2.8571 0 0.15 0 0 0 0 

8 0.7619 1.7143 4 0.10 0 0 0 0 


1 2 3 4 5 

6 0.4444 0.3556 0.0889 0.0667 0.0444 
= = [F1 F2 FR ], 
7 0.6825 0.2032 0.0508 0.0381 0.0254 
8 0.8095 0.1219 0.0305 0.0229 0.0152 

where

6  2.2222 0 0  0.20  6 0.4444  6  f 61 


      
F1 = UD1 = 7 1.2698 2.8571 0   0.15 = 7  0.6825 = 7  f71 
8  0.7619 1.7143 4  0.10  8  0.8095 8  f 81 

6  2.2222 0 0  0.16  6 0.3556  6  f 62 


      
F2 = UD2 = 7 1.2698 2.8571 0   0  = 7  0.2032 = 7  f72 
8  0.7619 1.7143 4   0  8  0.1219 8  f 82 

6  2.2222 0 0  0.04 0.03 0.02 


   
FR = UDR = 7  1.2698 2.8571 0   0 0 0 
8  0.7619 1.7143 4   0 0 0 

6  0.0889 0.0667 0.0444  6  f 63 f 64 f 65 


   
= 7 0.0508 0.0381 0.0254  = 7  f73 f74 f75  . (3.103)
8  0.0305 0.0229 0.0152 8  f 83 f 84 f 85 

For an entering item, which starts in transient state 8, the probability of being
scrapped, or absorbed in state 1, is f81 = 0.8095, while the probability of being
sold, or absorbed in state 2, is f82 = 0.1219. The probability that an entering
item will enter the training center to be used initially for training engineers
is f83 = 0.0305. The probability that an entering item will eventually enter the
training center, to be used for training engineers, technicians, and technical

51113_C003.indd 136 9/23/2010 4:20:27 PM


Reducible Markov Chains 137

writers, is given by

f83 + f84 + f85 = 0.0305 + 0.0229 + 0.0152 = 0.0686. (3.104)

Suppose that an order for 100 items is received from a customer. If exactly 100
items are started, then the firm can expect to sell only

100 f82 = 100 (0.1219) ≈ 12 (3.105)

items; the remainder will be scrapped or sent to the training center.


Therefore, the expected number of entering items required to fill this order
is equal to

100/f82 = 100/0.1219 ≈ 821. (3.106)

Alternatively, if the chain occupies a transient state, the probabilities of


eventual passage to absorbing and recurrent states can be also computed by
using Equation (3.77). Note that only output from production stage 3, repre-
sented by transient state 6, can reach the training center in one step. Suppose
that the probability f84 that an entering item will enter the training center to be
used initially for training technicians is of interest. States 6, 7, and 8 are tran-
sient. Recurrent state 4 communicates with recurrent states 3, 4, and 5. The
following system of three linear equations (3.77) must be solved for the three
unknowns, f64, f 74, and f84.


f 64 = p64 + ∑ p6 k f k 4 = p64 + p66 f 64 + p67 f74 + p68 f 84 
k = 6,7 ,8 

f74 = p74 + ∑ p7 k f k 4 = p74 + p76 f 64 + p77 f74 + p78 f 84  (3.107a)
k = 6,7 ,8 
f 84 = p84 + ∑ p8 k f k 4 = p84 + p86 f 64 + p87 f74 + p88 f 84 .
k = 6,7 ,8


A matrix form of equations (3.107) based on Equation (3.78) is

 f 64   p64   p66 p67 p68   f 64 


      
 f74  =  p74  +  p76 p77 p78   f74  . (3.107b)
 f 84   p84   p86 p87 p88   f 84 

When numerical coefficients are inserted, the system of equations (3.107) is

f 64 = 0.03 + 0.55 f 64 + 0 f74 + 0 f 84 = 0.0667 



f74 = 0 + 0.20 f 64 + 0.65 f74 + 0 f 84 = 0.0381  (3.108)
f 84 = 0 + 0 f 64 + 0.15 f74 + 0.75 f 84 = 0.0229.

51113_C003.indd 137 9/23/2010 4:20:29 PM


138 Markov Chains and Decision Processes for Engineers and Managers

A matrix form and the solution of equations (3.108) is

 f 64  0.03   0.55 0 0   f 64  0.0667 


        
 f74  =  0  + 0.20 0.65 0   f74  =  0.0381 . (3.109)
 f 84   0   0 0.15 0.75  f 84   0.0229 

Equation (3.109) produces the same solution for the vector [f64, f 74, f84]T as does
Equation (3.103).

3.4 Eventual Passage to a Closed Set


within a Reducible Multichain
A reducible Markov chain can be partitioned into one or more disjoint closed
communicating classes of recurrent states plus a set of transient states. In this
section, passage from a transient state to a recurrent chain, R, in a reducible mul-
tichain is of interest, rather than passage to a particular state within R [1–3]. The
probability of eventual passage from a transient state to the single closed class of
recurrent states within a reducible unichain is one, because eventual passage to
the recurrent closed class is certain. Two methods are presented for calculating
the probability of eventual passage from a transient state to a closed set of recur-
rent states within a reducible multichain. Both methods verify that the probabil-
ity of eventual passage from a transient state i to a closed class of recurrent states
is equal to the sum of the probabilities of eventual passage from the transient
state i to all of the states, which belong to the recurrent closed class.

3.4.1 Method One: Replacing Recurrent Sets with Absorbing


States and Using the Fundamental Matrix
Method one, which requires calculation of the fundamental matrix, can be
applied without first calculating the individual probabilities of eventual
passage from transient states to recurrent states. To do this, all of the recur-
rent states, which belong to the same closed class, are lumped into a single
absorbing state. When this is done for every recurrent closed class, the result
is an absorbing multichain with as many absorbing states as there are recur-
rent closed classes in the original chain. The absorption probabilities for the
absorbing multichain are equal to the probabilities of eventual passage to the
corresponding recurrent closed classes in the original multichain. The set of
transient states is unchanged.

3.4.1.1 Five-State Reducible Multichain


To demonstrate method one, consider, once again, the generic five-state
reducible multichain, treated most recently in Section 3.3.3.2.3. The canoni-
cal form of the transition probability matrix shown in Equation (3.5).

51113_C003.indd 138 9/23/2010 4:20:30 PM


Reducible Markov Chains 139

1 1 0 0 0 0 
2 0 p22 p23 0 0   P1 0 0
 
P = 3 0 p32 p33 0 0  =  0 P2

0 .
  (3.5)
4  p41 p42 p43 p44 p45  D1 D2 Q 
5  p51 p52 p53 p54 p55 

Recall that the probabilities f42, f43, f52, and f53 of eventual passage were com-
puted in Section 3.3.3.2.3. To compute the probability of eventual passage
from a transient state to the recurrent closed class denoted by R 2 = {2,3}, R 2 is
replaced by an absorbing state. The following associated absorbing Markov
chain is formed with transition probability matrix denoted by PA.

1  1 0 0 0 
1 0 0
R2  0 1 0 0     p42 + p43 
PA =  = 0 1 0  , d2 =  .
4  p41 p42 + p43 p44 p45    p52 + p53 
  D d2 Q 
5  p51 p52 + p53 p54 p55   1
(3.110)

where the transition probability in each row of vector d2 is equal to the sum
of the transition probabilities in the same row of matrix D2. As the matrix
Q for the original multichain is unchanged, the fundamental matrix U for
the absorbing multichain, calculated in Equation (3.100), is also unchanged.
Applying Equation (3.99) to calculate f2, which denotes the vector of probabil-
ities of eventual passage from the transient states to the closed class of recur-
rent states, R2 = {2,3},

1 1 − p55 p45   p42 + p43 


f 2 = Ud2 = (I − Q)−1 d2 =
(1 − p44 )(1 − p55 ) − p45 p54  p54 1 − p44   p52 + p53 

1 (1 − p55 )( p42 + p43 ) + p45 ( p52 + p53 )


=
(1 − p44 )(1 − p55 ) − p45 p54  p54 ( p42 + p43 ) + (1 − p44 )( p52 + p53 )

 f 42 + f 43   f 4 R2 
=  = ,
f 53   f 5 R2 
(3.111)
 f 52 +

where
(1 − p55 )( p42 + p43 ) + p45 ( p52 + p53 ) 
f 4 R2 = 
(1 − p44 )(1 − p55 ) − p45 p54 
 (3.112)
p54 ( p42 + p43 ) + (1 − p44 )( p52 + p53 ) 
f 5 R2 = .
(1 − p44 )(1 − p55 ) − p45 p54 

51113_C003.indd 139 9/23/2010 4:20:31 PM


140 Markov Chains and Decision Processes for Engineers and Managers

In this example,

f iR2 = f i 2 + f i 3 (3.113)

is the probability of eventual passage from a transient state i to the recurrent


closed class R2 = {2, 3}.

3.4.1.2 Multichain Model of an Eight-State Serial Production Process


As a demonstration of method one with numerical values given for transi-
tion probabilities, the procedure will be applied to the eight-state reducible
multichain model of a serial production process for which probabilities of
eventual passage to recurrent states are computed in Section 3.3.3.2.4. The
model is a reducible multichain, which has two absorbing states, one closed
set of three recurrent states, and three transient states. The transition prob-
ability matrix in canonical form is shown in Equation (3.7).
Recall that R3 = {3, 4, 5} denote the closed communicating class of three
recurrent states, and T = {6, 7, 8} denote the set of transient states. To compute
the probabilities of eventual passage from transient states to the closed class
of recurrent states represented by R3, R3 is replaced by an absorbing state.
States 1 and 2 remain absorbing states. The following associated absorbing
multichain is formed with transition probability matrix denoted by PA.

Scrapped 1  1 0 0 0 0 0  
Sold 2 0 1 0 0 0 0  
  
Training Center R3  0 0 1 0 0 0  
PA =   
Stage 3 6 0.20 0.16 0.09 0.55 0 0  
Stage 2 7 0.15 0 0 0.20 0.65 0  
  
Stage 1 8 0.10 0 0 0 0.15 0.75 


Scrapped 1 1 0 0 0 

Sold 2 0 1 0 0 
=  .
Training Center R3  0 0 1 0  (3.114)
  
Production Stages T D1 D2 d3 Q  

where 
6 0.20  0.16  0.04 + 0.03 + 0.02 = 0.09 0.09 
        
D1 = 7  0.15 , D2 =  0  , d3 =  0 =  0 , 
8 0.10   0  
 0   0  
 0.55 0 0  
  
and Q =  0.20 0.65 0 . 

 0 0.15 0.75 

51113_C003.indd 140 9/23/2010 4:20:33 PM


Reducible Markov Chains 141

and the transition probability in each row of vector d3 is equal to the sum of
the transition probabilities in the same row of matrix D3. As the matrix Q
for the original multichain is unchanged, the fundamental matrix U for the
absorbing multichain, calculated in Equation (3.102), is also unchanged.
Applying Equation (3.98),

State 6 7 8 1 2 R3 = {3, 4, 5}
6 2.2222 0 0 0.20 0.16 0.09
F = UD = U[D1 D2 d3 ] =
7 1.2698 2.8571 0 0.15 0 0
8 0.7619 1.7143 4 0.10 0 0

State 1 2 R3 = {3, 4, 5}
6  f 61 f 62 f 6 R3 
6 0.4444 0.3556 0.2  
= = 7  f71 f72 f7 R3 
7 0.6825 0.2032 0.1143 (3.115a)
8  f 81 f 82 f 8 R3 
8 0.8095 0.1219 0.0686

 f 6 R3  6  0.2 
   
 f7 R3  = 7 0.1143  . (3.115b)
 f 8 R  8 0.0686 
 3  

Thus, the probability of eventually reaching the closed set of recur-


rent states R 3 = {3, 4, 5} from transient state 8 or production stage 1 is f 8R3 =
0.0686.
As Equation (3.1.1.3) indicates, the probability of eventual passage from
a transient state i to a recurrent closed set denoted by R is simply the sum
of the probabilities of eventual passage from the transient state i to all of
the recurrent states, which are members of R. Expressed as an equation,
the probability fiR of eventual passage from a transient state i to a recurrent
closed set R is

f iR = ∑ f ij , (3.116)
j ∈R

where the recurrent states j belong to the recurrent closed set R. For this
example, the probability that an entering item will eventually be sent to the
training center is equal to

f8R = f83 + f84 + f85 = 0.0305 + 0.0229 + 0.0152 = 0.0686, (3.117)

which confirms the result obtained in Equation (3.104) without replacing


R3={3, 4, 5} with a single absorbing state.

51113_C003.indd 141 9/23/2010 4:20:34 PM


142 Markov Chains and Decision Processes for Engineers and Managers

3.4.2 Method Two: Direct Calculation without


Using the Fundamental Matrix
In the second method for calculating the probability of eventual passage to a
closed set, the fundamental matrix is not needed. Once again, let R represent
a designated closed set of recurrent states. The probability of eventual passage
from a transient state i to the designated closed set R is denoted by fiR. Starting
in transient state i, the chain may enter the designated recurrent set R in one or
more steps. The probability of entering R on the first step is the sum of the one-
step transition probabilities, pik, from transient state i to every recurrent state k,
which belongs to R. If the chain does not enter R on the first step, the chain may
move either to another closed set different from R, from which R will never be
reached, or to a transient state, h. In the latter case, there is a probability f hR of
eventual passage to R from the transient state, h. Therefore, for a reducible mul-
tichain, the probabilities of eventual passage to a designated recurrent closed
class R are obtained by solving the following system of linear equations:

f iR = ∑ pik + ∑ pih f hR , (3.118)


k∈R h∈T

where i is a transient starting state, the first sum is over all recurrent states,
k, which belong to R, and the second sum is over all transient states, h, which
belong to T, the set of all transient states.
Consider again the following transition matrix for the generic five-state
reducible multichain treated by method one in Section 3.4.1.1. The chain has
two recurrent closed sets, one of which is an absorbing state, plus two tran-
sient states. The transition probability matrix is represented on canonical
form in Equation (3.5).
Method 2 will be applied to compute the probabilities of absorption in
state 1 from the transient states 4 and 5. These absorption probabilities are
denoted by f41 and f51, respectively. Applying Equation (3.118),

f 41 = p41 + p44 f 41 + p45 f 51  (3.119)



f 51 = p51 + p54 f 41 + p55 f 51 .

The solution of system (3.119) for the absorption probabilities is


p41 (1 − p55 ) + p45 p51 
f 41 =
(1 − p44 )(1 − p55 ) − p45 p54 
 (3.120)
p51 (1 − p44 ) + p54 p41
f 51 = .
(1 − p44 )(1 − p55 ) − p45 p54 

Method 2 will also be applied. to compute the probabilities of eventual


passage from the transient states 4 and 5 to the closed set of recurrent states,
R 2 = {2, 3}. The probability of eventual passage from a transient state i to a
designated closed set R 2 is denoted by fiR2.

51113_C003.indd 142 9/23/2010 4:20:35 PM


Reducible Markov Chains 143

f 4 R2 = ( p42 + p43 ) + ( p44 f 4 R2 + p45 f 5 R2 ) 


 (3.121)
f 5 R2 = ( p52 + p53 ) + ( p54 f 4 R2 + p55 f 5 R2 ).

The solution of system (3.121) for the probabilities of eventual passage to


R 2 = {2, 3} is

(1 − p55 )( p42 + p43 ) + p45 ( p52 + p53 ) 


f 4 R2 = 
(1 − p44 )(1 − p55 ) − p45 p54 
 (3.122)
(1 − p44 )( p52 + p53 ) + p54 ( p42 + p43 ) 
f 5 R2 = .
(1 − p44 )(1 − p55 ) − p45 p54 

These values agree with those obtained by using method 1 in Equation


(3.112) of Section 3.4.1.1. Note that both f41 + f4R2 = 1 and f51 + f5R2 = 1 because
eventual passage from a transient state to one of the recurrent closed classes
in a multichain is certain.

3.5 Limiting Transition Probability Matrix


A one-step transition probability from state i to state j of a Markov chain is
(n)
denoted by pij. An n-step transition probability is denoted by pij . Limiting
transition probabilities govern the behavior of the chain after a large num-
ber of transitions, as n→∞, and the chain has entered a steady state. Thus, a
limiting transition probability is designated by lim n→∞ pij( n ). The matrix of lim-
iting transition probabilities is denoted by lim n →∞ P( n ), where P is the matrix
of one-step transition probabilities. As Equation (2.7) indicates, the limiting
transition probability matrix for an irreducible Markov chain is obtained by
solving the equation lim n →∞ P( n ) = ∏, where ∏ is a matrix with each row π, the
steady-state probability vector.
The matrix of limiting transition probabilities for a reducible Markov chain
is also denoted by lim n →∞ P( n ). Prior to calculating the limiting transition prob-
abilities for a reducible Markov chain, the transition probability matrix is
arranged in the canonical form of Equation (3.1) [1, 2].

3.5.1 Recurrent Multichain


As Section 1.9.2 indicates, a multichain with no transient states is called a
recurrent multichain. The state space for a recurrent multichain can be par-
titioned into two or more disjoint closed classes of recurrent states. The tran-
sition probability matrix for a recurrent multichain with M recurrent chains

51113_C003.indd 143 9/23/2010 4:20:36 PM


144 Markov Chains and Decision Processes for Engineers and Managers

has the following canonical form:

 P1 0 " 0 
0 P2 " 0 
P= , (3.123)
# # % # 
 
 0 0 " PM 

where P1, . . . , PM are the transition probability matrices of the M separate


recurrent chains denoted by R1, . . . , R M, respectively. Once a chain enters a
closed set of recurrent states, it never leaves that set. Suppose that i and j are
two recurrent states belonging to the same recurrent closed class Rk, which
contains N states. If p(k )(ijn ) is an n-step transition probability for Pk, let

lim p(k )(ijn ) = π (k ) j , (3.124)


n →∞

where π(k)j is the steady-state probability for state j, which belongs to the
recurrent chain Rk. Let
π (k ) = [π (k )1 π (k )2 ⋅⋅⋅ π (k )N ] (3.125)

denote the steady-state probability vector or limiting probability vector for


Pk, so that

π(k) = π(k)Pk. (3.126)

This result can be generalized to apply to all recurrent chains belonging to


a recurrent multichain. The limiting transition probability matrix is

lim P1n 0 0  Π(1) 0 0 


 n →∞
 
lim P =  0
n
% 0 = 0 % 0  , (3.127)
n →∞
 0 0 lim PMn   0 0 Π( M )
 n →∞ 

where Π(k) is a matrix with each row π(k), the steady-state probability vector
for the transition probability matrix Pk.
The limiting transition probability matrix is computed for the following
numerical example of a four-state recurrent multichain with M = 2 recurrent
chains:

1  0.2 0.8 0 0 
 0 
=P= 1
2 0.6 0.4 0 P 0
P=   P2 
(3.128)
3 0 0 0.7 0.3  0
 
4 0 0 0.5 0.5

51113_C003.indd 144 9/23/2010 4:20:39 PM


Reducible Markov Chains 145

1 π (1)1 π (1)2 0 0 
lim P1n 0   Π(1) 0  2 π (1) π (1) 0 0  
lim P n =  n→∞ = =  1 2 
n→∞  0 lim P2n   0 Π(2) 3  0 0 π (2)3 π (2)4  
 n→∞   
4  0 0 π (2)3 π (2)4  

0.4286 0.5714 0 0  
0.4286 0.5714 0 0  
= . 
 0 0 0.625 0.375 
  
 0 0 0.625 0.375 
(3.129)

3.5.2 Absorbing Markov Chain


As Equation (3.8) indicates, the canonical form of the two-step transition
matrix for an absorbing Markov chain, including an absorbing multichain, is

 I 0 I 0  I 0  0
2
I
P2 =     = 2
= 2 . (3.130)
 D Q  D Q  D + QD Q   ( I + Q )D Q 

The canonical form of the three-step transition matrix is

 I 0   I2 0  I3 0  I 0
P 3 = PP 2 =    2
= 3
.=  3 .
 D Q  ( I + Q )D Q   ( I + Q + Q 2
)D Q   ( I + Q + Q 2
)D Q 
(3.131)

By raising P to successively higher powers, one can show that the canonical
form of the n-step transition matrix is

 I 0 
Pn =  n −1 n . (3.132)
( I + Q + Q 2
+ " + Q )D Q 

Hence,

 I 0 
lim P n = 
n →∞ lim(
 n →∞ I + Q + Q 2
+ ... + Q n −1
)D lim Q n 
n →∞ 
 I 0   I 0  I 0 
= n =  n = 
Q  UD lim Q  F lim Q n 
−1 , (3.133)
(I − Q) D lim n →∞   n →∞   n →∞ 

where U = (I − Q)−1 is the fundamental matrix defined in Equation (3.10),


and F = UD is the matrix of eventual passage probabilities and absorption

51113_C003.indd 145 9/23/2010 4:20:41 PM


146 Markov Chains and Decision Processes for Engineers and Managers

probabilities defined in Equation (3.76). As Equation (3.69) indicates, as n


approaches infinity, Qn approaches the null matrix. That is,

lim Q n = 0. (3.69)
n →∞

Hence, the limiting transition probability matrix for an absorbing Markov


chain, including an absorbing multichain, is

I 0   I 0
lim P n =  =
Q n  F 0 
. (3.134)
n →∞
 F lim
n →∞ 

Consider the generic four-state absorbing multichain of Section 3.3.3.1.2, in


which states 1 and 2 are absorbing states. The transition probability matrix
in canonical form is shown below:

1 1 0 0 0 
2 0 1 0 0   I 0
P = [ pij ] =  = .
p34  D Q 
(3.85)
3  p31 p32 p33
 
4  p41 p42 p43 p44 

The matrix of absorption probabilities is calculated as

1 (1 − p44 )p31 + p34 p41 (1 − p44 )p32 + p34 p42 


F=
(1 − p33 )(1 − p44 ) − p34 p43  p43 p31 + (1 − p33 )p41 p43 p32 + (1 − p33 )p42 
 f 31 f 32 
(3.82)
= .
 f 41 f 42 

The limiting transition probability matrix for this absorbing multichain, is

1 0 0 0
0 0
 I 0  
1 0
.
lim P n =   = (3.135)
n →∞
F 0  f 31 f 32 0 0
 
 f 41 f 42 0 0

3.5.3 Absorbing Markov Chain Model of Patient Flow in a Hospital


Consider the absorbing multichain model of patient flow in a hospital
for which the transition probability matrix is given in canonical form in
Equation (1.59). The matrix of absorption probabilities is calculated in
Equation (3.9.3). The matrix of limiting probabilities for this absorbing

51113_C003.indd 146 9/23/2010 4:20:43 PM


Reducible Markov Chains 147

multichain is

State 0 5 1 2 3 4
0 1 0 0 0 0 0
5 0 1 0 0 0 0
 I 0
lim P = 
n
= 1 0.7284 0.2714 0 0 0 0 . (3.136)
n →∞
F 0  2 0.7327 0.2671 0 0 0 0
3 0.7246 0.2750 0 0 0 0
4 0.7294 0.2703 0 0 0 0

A patient receiving physical therapy, in state 4, has a probability of f40 = 0.7294


of eventually being discharged, and a probability of f45 = 0.2703 of eventually
dying.

3.5.4 Reducible Unichain


As Section 1.9.1.2 indicates, a reducible unichain has a set of transient states
plus one closed set of recurrent states. The canonical form of the transition
probability matrix is given in Equation (3.1). The canonical form of the two-
step transition probability matrix is

 S 0  S 0  S2 0   S2 0 
P 2 = PP =     =  2
=  , (3.137)
 D Q   D Q   DS + QD Q   D2 Q2 

where

D2 = DS + QD. (3.138)

The canonical form of the three-step transition matrix is

S 0  S 0   0   S3 0 
2
S3
P 3 = PP 2 =   =
  = , (3.139)
D Q  D2 Q 2  DS 2 + QD2 Q 3  D3 Q3 

where

D3 = DS 2 + QD2 = DS 2 + Q(DS + QD) = DS2 + QDS + Q 2D = (DS + QD)S + Q 2D


= D2S + Q 2D. (3.140)

By raising P to successively higher powers, one can show that the canonical
form of the n-step transition matrix is

 Sn 0 
Pn =  , (3.141)
 Dn Qn 

51113_C003.indd 147 9/23/2010 4:20:45 PM


148 Markov Chains and Decision Processes for Engineers and Managers

where
D1 = D (3.142)
and

Dn = Dn−1S+Qn−1D. (3.143)

Using Equation (3.69),

 lim S n 0   lim S n 0
lim P n =  n→∞  =  n→∞ . (3.144)
lim Q n  lim Dn
 n→∞ n
n →∞ lim D 0
n →∞   n→∞ 

If states i and j are both recurrent, then they belong to the same recurrent
closed class, which acts as a separate irreducible chain with transition prob-
ability matrix S. The unichain has only one recurrent closed class, which is
therefore certain to be reached from a transient state. Once the unichain has
entered the recurrent closed class, its limiting behavior is governed by the
steady-state probability vector for the recurrent closed class.
If π = [πj] is the steady-state probability vector for the recurrent closed class, then
π = π S, 
π e = 1, 

and  (3.145)

π j = lim pij( n ) .
n →∞ 

Thus,
lim S( n ) = Π , (3.146)
n →∞

where ∏ is a matrix with each row π.


Suppose that state i is transient and state j is recurrent. Then transitions
from transient state i to recurrent state j are governed by submatrix D. Once
again, lim n→∞ pij( n ) = π j because the chain is certain to eventually enter the
closed set of recurrent states. Thus,

lim Dn = lim S( n ) = Π . (3.147)


n→∞ n→∞

In other words, if j is a recurrent state, then lim n→∞ pij( n ) = π j , irrespective of


whether the starting state i is transient or recurrent. Therefore, the limiting
transition probability matrix for a reducible unichain is

 lim S n 0 Π 0
lim P n =  n→∞ = .
0   Π 0 
(3.148)
n →∞
lim
n →∞
Dn

51113_C003.indd 148 9/23/2010 4:20:47 PM


Reducible Markov Chains 149

Once the unichain has entered the recurrent closed set, which is certain to
be reached, its limiting behavior is governed by the steady-state probability
vector for the closed set.

3.5.4.1 Reducible Four-State Unichain


Consider the generic four-state reducible unichain for which the transition
matrix is shown in canonical form in Equation (3.2).

1  p11 p12 0 0 
2  p21 p22 0 0  S 0
P = [ pij ] =  = .
p34  D Q 
(3.2)
3  p31 p32 p33
 
4  p41 p42 p43 p44 

The steady-state probability vector for the generic two-state transition matrix
S was obtained in Equation (2.16). Using Equation (3.148), the limiting transi-
tion probability matrix for P is

π 1 π2 0 0   p21 /( p12 + p21 ) p12 /( p12 + p21 ) 0 0


 0   p21 /( p12 + p21 ) 0
 Π 0  π 1 π2 0
=
p12 /( p12 + p21 ) 0
.
lim P n =   = (3.149)
n →∞
 Π 0  π 1 π2 0 0   p21 /( p12 + p21 ) p12 /( p12 + p21 ) 0 0
   
π 1 π2 0 0   p21 /( p12 + p21 ) p12 /( p12 + p21 ) 0 0

3.5.4.2 Reducible Unichain Model of Machine Maintenance


The model of machine maintenance introduced in Section 1.10.1.2.2.1 is a
four-state unichain. If the engineer who manages the production process
follows the modified maintenance policy, under which she is always doing
nothing when the machine is in state 3, the probability of first passage in
n steps was computed in Section 3.3.2.1.4, and the probability of eventual
passage was computed in Section 3.3.3.1.3. The transition probability matrix
associated with the modified maintenance policy is shown in canonical form
in Equation (1.56).

1  0.2 0.8 0 0 
2  0.6 0.4 0 0  S 0
P = [ pij ] =  = .
3  0.2 0.3 0.5 0  D Q  (1.56)
 
4 0.3 0.2 0.1 0.4 

51113_C003.indd 149 9/23/2010 4:20:49 PM


150 Markov Chains and Decision Processes for Engineers and Managers

The transition probability matrix associated with the original maintenance


policy, under which the engineer overhauls the machine when it is in state 3,
is shown in canonical form in Equation (1.55).

1  0.2 0.8 0 0 

2 0.6 0.4 0 0  S 0
P = [ pij ] =  = .
3 0 0 0.2 0.8  D Q  (1.55)
 
4 0.3 0.2 0.1 0.4 

Observe that under both maintenance policies, states 1 and 2 form a recurrent
closed class, while states 3 and 4 are transient. Submatrix S, which governs tran-
sitions for the recurrent chain, is identical under both maintenance policies.
Using Equation (2.16), the vector of steady-state probabilities for submatrix S is

π = [π 1 π 2 ] = 0.6/(0.8 + 0.6) 0.8/(0.8 + 0.6) = 3/7 4/7 . (3.150)

Under both policies, using Equation (3.149), the limiting transition probabil-
ity matrix is

π 1 π2 0 0   3/7 4/7 0 0 
 0   3/7 4/7 0 0 
 Π 0  π 1 π2 0
= .
lim P n =   = (3.151)
n →∞
 Π 0  π 1 π2 0 0   3/7 4/7 0 0 
   
π 1 π2 0 0   3/7 4/7 0 0 

In the long run, under both maintenance policies, the machine will be in
state 1, not working, 3/7 of the time, and in state 2, working, with a major
defect, the remaining 4/7 of the time.

3.5.5 Reducible Multichain


Consider a reducible multichain that has M recurrent chains, denoted by
R1, . . . , R M, with the respective transition probability matrices P1, . . . , PM, plus a
set of transient states. The canonical form of the transition probability matrix
is shown in Equation (3.4).

 P1 0 0 0
0 % 0 0  S 0 
P= = , (3.4)
 0 0 PM 0  D Q 
 
D1 " DM Q 

51113_C003.indd 150 9/23/2010 4:20:50 PM


Reducible Markov Chains 151

where the transition probability matrix is also expressed in aggregated form,


so that

 P1 0 0 
 
S =  0 % 0  , D = D1 … DM  . (3.152)
 0 0 PM 

By combining the results obtained in Equation (3.127) for a recurrent multi-


chain and in Equation (3.144) for a reducible unichain, the limiting transition
probability matrix for a reducible multichain has the form:

 lim P1n 0 0 0   lim P1n 0 0 0


 n→∞   n→∞ 
 0 % 0 0   0 % 0 0
lim P n =   =
n →∞
 0 … lim PMn #   0 … lim PMn # 
n →∞ n →∞
lim D … lim DM , n lim Q n  lim D1, n … lim DM , n 0
 n→∞ 1, n n →∞ n →∞   n→∞ n →∞ 

 Π(1) 0 0 0
 0 % 0 0
= ,
(3.153)
 0 … Π( M ) #
 
lim D1, n … lim DM , n 0
n →∞ n →∞ 

where ∏(j) is a matrix with each row π(j), the steady-state probability vector
for the transition probability matrix Pj, and Dj,n is the matrix of n-step transi-
tion probabilities from transient states to the recurrent chain Rj.
As Section 3.4.1 and Equation (3.116) indicate, in a reducible multichain, the
probability of eventual passage from a transient state i to a recurrent closed
class is simply the sum of the probabilities of eventual passage from the tran-
sient state i to all of the recurrent states within the closed class. Suppose
that Rk represents a recurrent chain for which the transition probabilities are
governed by the matrix Pk. The probability fi,Rk of eventual passage from a
transient state i to the recurrent closed class Rk is given by

f iRk = ∑f
j∈Rk
ij , (3.116)

where the recurrent states j belongs to the recurrent closed class Rk.
The limiting probability of a transition from a transient state i to a state j
belonging to the recurrent closed class Rk is equal to the product of fi,Rk, the
probability of eventual passage from the transient state i to the recurrent

51113_C003.indd 151 9/23/2010 4:20:52 PM


152 Markov Chains and Decision Processes for Engineers and Managers

chain Rk, and π(k)j, the steady-state probability for state j, which belongs to the
recurrent closed class Rk. That is,

lim pij( n ) = f i , Rk π (k ) j , (3.154)


n →∞

represents the limiting probability of a transition from a transient state i to a


state j within the recurrent closed class Rk.

3.5.5.1 Reducible Five-State Multichain


Consider the generic five-state reducible multichain, treated most recently
in Section 3.4.1.1. The canonical form of the transition probability matrix is
shown below:

1 1 0 0 0 0 
2 0 p22 p23 0 0   P1 0 0
 
P = 3 0 p32 p33 0 0  =  0 P2 0  . (3.5)
 
4  p41 p42 p43 p44 p45  D1 D2 Q 
5  p51 p52 p53 p54 p55 

Recall that state 1 is an absorbing state. The two recurrent closed sets are
denoted by R1 = {1} and R2 = {2, 3}. The set of transient states is denoted by
T = {4, 5}. The probabilities of eventual passage to the individual recurrent
states, states 2 and 3, were computed in Section 3.3.3.2.3 in Equations (3.97)
and (3.101). The probabilities of eventual passage to R 2 = {2, 3} were computed
in Section 3.4.1.1 in Equations (3.111) through (3.113), and in Section 3.4.2 in
Equation (3.122). The limiting transition probability matrix for the reducible
multichain has the form

1 1 0 0 0 0 
2 0 (n)
p22 (n)
p23 0 0 
 
lim P n = lim 3  0 (n)
p32 (n)
p33 0 0 
n →∞ n →∞  
4  p(41n ) p(42n ) p(43n ) p(44n ) p(45n ) 
5  p51
(n) (n)
p52 (n)
p53 (n)
p54 (n) 
p55 

1 1 0 0 0 0
 π2 π3 0
2 0 0

= 3 0 π2 π3 0 0 ,
(3.155)
 
4  f 41 lim p(42n ) lim p(43n ) 0 0
n →∞ n →∞

5  f 51 ( n)
lim p52 (n)
lim p53 0 0 
 n →∞ n →∞ 

51113_C003.indd 152 9/23/2010 4:20:53 PM


Reducible Markov Chains 153

where f41 and f 51 are the absorption probabilities for transient states
4 and 5, respectively, and π2 and π3 are the steady-state probabilities for
recurrent states 2 and 3, respectively. Using Equation (3.116), the probability
f iR 2 of eventual passage from a transient state i to the recurrent chain
R 2 = {2, 3}, is

fi,R2 = fi2 + fi3. (3.156)

(n)
Using Equation (3.154), the limiting probability, limn→∞
pij , of a transition from a
transient state i to a state j belonging to the recurrent closed class R2 = {2, 3} is

lim pij( n ) = ( f i 2 + f i 3 )π j = f iR2 π j . (3.157)


n →∞

Therefore, the limiting transition probability for the five-state reducible mul-
tichain is

1 1 0 0 0 0
2 0 π2 π3 0 0
 
lim P n = 3  0 π2 π3 0 0
n →∞
 
4  f 41 ( f 42 + f 43 )π 2 ( f 42 + f 43 )π 3 0 0
5  f 51 ( f 52 + f 53 )π 2 ( f 52 + f 53 )π 3 0 0 
(3.158)
1 1 0 0 0 0
2  0 π2 π3 0 0

= 3 0 π2 π3 0 0 .
 
4  f 41 f 4 R2 π 2 f 4 R2 π 3 0 0
5  f 51 f 5 R2 π 2 f 5 R2 π 3 0 0 

3.5.5.2 Reducible Multichain Model of an Eight-State


Serial Production Process
Consider the eight-state reducible multichain model of a serial production
process for which probabilities of eventual passage to the recurrent closed
set R3 = {3, 4, 5} are computed in Section 3.4.1.2. The transition probability
matrix in canonical form is shown in Equation (3.7). The limiting transi-
tion probability matrix for the eight-state serial production process has the

51113_C003.indd 153 9/23/2010 4:20:54 PM


154 Markov Chains and Decision Processes for Engineers and Managers

following form:

lim P n
n →∞

1 1 0 0 0 0 0 0 0
 0 0 0
2 0 1 0 0 0

3 0 0 π3 π4 π5 0 0 0
 
4 0 0 π3 π4 π5 0 0 0
=  
5 0
 0 π3 π4 π5 0 0 0
6  f 61 f 62 ( f 63 + f 64 + f 65 )π 3 ( f 63 + f 64 + f 65 )π 4 ( f 63 + f 64 + f 65 )π 5 0 0 0
 
7  f71 f72 ( f73 + f74 + f75 )π 3 ( f73 + f74 + f75 )π 4 ( f73 + f74 + f75 )π 5 0 0 0
8  f 81 f 82 ( f 83 + f 84 + f 85 )π 3 ( f 83 + f 84 + f 85 )π 4 ( f 83 + f 84 + f 85 )π 5 0 0 0 

1 1 0 0 0 0 0 0 0
 0
2 0 1 0 0 0 0 0

3 0 0 π3 π4 π5 0 0 0
 
4 0 0 π3 π4 π5 0 0 0
(3.159)
= 0 0 π3 π4 π5 0 0 0 ,
5 
6  f 61 f 62 f 6 R3 π 3 f 6 R3 π 4 f 6 R3 π 5 0 0 0
 f7 R3 π 3 f7 R3 π 4 f7 R3 π 5 0
7  f71 f72 0 0

8  f 81 f 82 f 8 R3 π 3 f 8 R3 π 4 f 8 R3 π 5 0 0 0


where, for, i = 6, 7, and 8,

fiR3 = fi3 + fi4 + fi5. (3.160)

The vectors of probabilities of absorption in absorbing states 1 and 2, start-


ing from the set of transient states T = {6, 7, 8}, are computed in Equations
(3.103) and (3.115). The matrix of probabilities of eventual passage starting
from the set of transient states, T = {6, 7, 8}, to the three individual recurrent
states, which belong to the recurrent closed class R3 = {3, 4, 5}, are computed
in Equation (3.103). The matrix of probabilities of eventual passage starting
from the set of transient states, T = {6, 7, 8}, to the recurrent closed class R3 =
{3, 4, 5} are computed in equations (3.115).
The steady-state probability vector for the transition probability matrix P3,
which governs transitions among the three recurrent states that belong to
the recurrent chain R3 = {3, 4, 5}, is calculated by solving equations (2.13).

0.50 0.30 0.20  


 
[π 3 π4 π 5 ] = [π 3 π4 π 5 ] 0.30 0.45 0.25 
0.10 0.35 0.55  (3.161)

π3 + π4 + π5 = 1 

51113_C003.indd 154 9/23/2010 4:20:55 PM


Reducible Markov Chains 155

The solution for the steady-state probability vector for P3 is


π = [π 3 π4 π 5 ] = [0.2909 0.3727 0.3364]. (3.162)

Using Equation (3.153) with M = 3, the limiting transition probability matrix


for the eight-state serial production process has the form:

 Π(1) 0 0 0
 0 Π(2) 0 0
lim P n =  . (3.153)
n →∞  0 0 Π(3) 0
 
lim
n →∞
D1, n lim D2, n
n →∞
lim D3, n
n →∞
0

Using Equations (3.154) and (3.159), the matrix limn→∞


D3, n of the limiting prob-
abilities of transitions from the set of transient states, T = {6, 7, 8} associated
with the three production stages, to the three recurrent states, which belong
to the recurrent chain R3 = {3, 4, 5} associated with the training center, is
given by

 lim p(n) (n) (n)  


  f 6 R3 π 3
f 6 R3 π 5  
lim p64 lim p65 f 6 R3 π 4
 n→∞ 6 , 3 n →∞ n →∞

lim D3, n= lim p7(n), 3 (n)
lim p74 (n)
f7 R3 π 5  
lim p75  =  f7 R3 π 3 f7 R3 π 4
n →∞  n→∞ n →∞ n →∞   
 lim p8(n), 3 (n)
lim p84 (n)
lim p85 f 8 R3 π 5  
  f 8 R3 π 3 f 8 R3 π 4
 n→∞ n →∞ n →∞  


 0.2(0.2909) 0.2(0.3727) 0.2(0.3364)  
  
= 0.1143(0.2909) 0.1143(0.3727) 0.1143(0.3364) 
0.0686(0.2909) 0.0686(0.3727) 0.0686(0.3364) 



0.0582 0.0745 0.0673  
= 0.0332 0.0426 0.0385 . 

0.0199 0.0256 0.0231 

(3.163)
For example,
(n)
lim p84 = ( f 83 + f 84 + f 85 )π 4 = f 8 R3 π 4 = (0.0686)(0.3727) = 0.0256 (3.164)
n→∞

represents the limiting probability or long run proportion of time that an


entering item which has been sent to the training center will be used for
training technicians.
By combining all of the results in Section 3.5.5.2, the limiting transition
probability matrix for the reducible eight-state multichain model of the serial

51113_C003.indd 155 9/23/2010 4:20:56 PM


156 Markov Chains and Decision Processes for Engineers and Managers

production process is

1 1 0 0 0 0 0 0
0
2  0 1 0 0 0 0 0
0
 
3 0 0 0.2909 0.3727 0.3364 0 0
0
 
4 0 0 0.2909 0.3727 0.3364 0 0
0 (3.165)
lim P( n ) = .
n →∞ 5 0 0 0.2909 0.3727 0.3364 0 0
0
 
6 0.4444 0.3556 0.0582 0.0745 0.0673 0 0
0
7  0.6825 0.2032 0.0332 0.0426 0.0385 0 0
0
 
8  0.8095 0.1219 0.0199 0.0256 0.0231 0 0 0 

3.5.5.3 Conditional Mean Time to Absorption


In this section, the conditional mean time to absorption will be calculated
for an absorbing multichain when it is known that the process will end in
a target absorbing state [5]. For example, consider the absorbing multichain
model of a production process for which the transition probability matrix
P is expressed in canonical form in Equation (3.114). Suppose that, for all
items which will eventually be sold, management wishes to compute the
mean number of transitions until an entering item is sold. That is, manage-
ment wants to calculate the mean time to absorption in the target absorbing
state 2 when the process starts in transient state 8. To calculate this number,
the absorbing multichain, which has three absorbing states, is transformed
into an absorbing unichain, which has only state 2 as an absorbing state.
The absorbing unichain will have a modified transition probability matrix
denoted by Pl = [p ] . Recall that the matrix of absorption probabilities, F = [ f ],
ij ij
for the production process is calculated in Equation (3.115a). When the con-
ditional process starts in a transient state i, the transition probabilities p ij are
computed in the following manner:



p ij = P(i → j in unichain) 


= P (i → j in multichain multichain state i absorbed in state 2) 
P(i → j in multichain ∩ multichain state j absorbed in state 2) 
=  (3.166)
P(multichain state i absorbed in state 2) 
P(i → j in multichain) P(multichain state j absorbed in state 2) 
= 
P(multichain state i absorbed in state 2) 
pij f j 2 
= . 
fi 2 

51113_C003.indd 156 9/23/2010 4:20:58 PM


Reducible Markov Chains 157

l for the absorbing unichain is shown


The transition probability matrix P
below:
2 6 7 8
2 1 0 0 0 
6 0.45 0.55 0 0   I 0
l=  = (3.167)
l
P .
7 0 0.35 0.65 0  D l Q

 
8 0 0 0.25 0.75

l for the absorbing unichain is


The fundamental matrix U
6 7 8
6  2.2222 0 0
l l −1  
U = ( I − Q) = 7  2.2222 2.8571 0  .
(3.168)
8  2.2222 2.8571 4 

Hence, the mean time to absorption in state 2 for an entering item is 9.0793
l.
steps, the sum of the entries in the third row of U

PROBLEMS
3.1 Three basketball players compete in a basketball foul shoot-
ing contest. The eligible players are allowed one foul shot per
round. After each round, all players who miss their foul shots
are eliminated, and the remaining players participate in the
next round. The contest ends when a single player who has not
missed a foul shot remains, and is declared the winner, or when
all players have been eliminated, and no one wins. In the past,
player A has made an average of 80% of his foul shots, player B
has made an average of 75%, and player C has made an average
of 70 %.
This contest can be modeled as an absorbing multichain by
choosing as states all sets of players who have not been elimi-
nated. For example, if two players remain, the corresponding
states are the three pairs (A, B), (A, C), and (B, C).
(a) Construct the transition probability matrix.
(b) What is the probability that the contest will end without a
winner?
(c) What is the probability that player A will win the contest?
(d) If players B and C are the remaining contestants, what is the
expected number of rounds needed before C wins?
3.2 Three hockey players compete in a defensive contest. In each
round, each eligible player is allowed one shot at the goal of the
remaining player who has the highest career scoring average.
After each round, all players who allow their opponent to score
a goal are eliminated, and the remaining players participate

51113_C003.indd 157 9/23/2010 4:20:59 PM


158 Markov Chains and Decision Processes for Engineers and Managers

in the next round. The contest ends when a single player who
has not allowed a goal remains, and is declared the winner, or
when all players have been eliminated, and no one wins. The
career scoring averages of players A, B, and C are 40%, 35%, and
20%, respectively.
The contest begins with all three players eligible to compete
in the first round. Player A will shoot his puck toward the goal
of B, A’s competitor who has the highest career scoring aver-
age. Next, player B will shoot his puck toward the goal of A, B’s
competitor who has the highest career scoring average. Finally,
player C will shoot his puck toward the goal of A, C’s competi-
tor who has the highest career scoring average. When the round
ends, any player who has allowed a goal to be scored is elimi-
nated. Note that after the first round, in which all three players
compete, players A and B can both be eliminated if each scores a
goal against the other. Player C, who has the lowest career scor-
ing average, cannot be eliminated after the first round because
no other player will shoot his puck toward the goal of C. The
surviving players enter the next round.
This contest can be modeled as an absorbing multichain
by choosing as states all sets of players who have not
been eliminated. For example, if two players remain,
the corresponding states are the three pairs (A, B),
(A, C), and (B, C).
(a) Construct the transition probability matrix.
(b) What is the probability that the contest will end without a
winner?
(c) What is the probability that player A will win the contest?
(d) If players B and C are the remaining contestants, what is the
expected number of rounds needed before C wins?
3.3 A woman needs $5,000 for a down payment on a condominium.
She will try to raise the money for the down payment by gam-
bling. She will place a sequence of bets until she either accumu-
lates $5,000 or loses all her money. She starts with $2,000, and
will wager $1,000 on each bet. Each time that she bets $1,000,
she will win $1,000 with probability 0.4, or lose $1,000 with
probability 0.6.
(a) Model the woman’s gambling experience as a six-state absorb-
ing multichain. Let the state Xn denote the gambler’s revenue
after the nth bet. Construct the transition probability matrix.
(b) What is the expected number of bets that she will make?
(c) What is the probability that she will obtain her $5,000 down
payment?
3.4 Suppose that the woman seeking a $5,000 down payment in
Problem 3.3 has the option of betting either $1,000 or $2,000. She
chooses the following aggressive strategy. If she has $1,000 or
$4,000, she will bet $1,000, and will win $1,000 with probability
0.4 or lose $1,000 with probability 0.6. If she has $2,000 or $3,000,

51113_C003.indd 158 9/23/2010 4:21:01 PM


Reducible Markov Chains 159

she will bet $2,000, and will either win $2,000 with probabil-
ity 0.05, or win $1,000 with probability 0.15, or lose $2,000 with
probability 0.8.
(a) Model the woman’s gambling experience under the aggres-
sive strategy as a six-state absorbing Markov chain. Let the
state represent the amount of money that she has when she
places a bet. Construct the transition probability matrix.
(b) What is the expected number of bets that she will make?
(c) What is the probability that she will obtain her $5,000 down
payment?
3.5 A consumer electronics dealer sells new flat panel televisions
for $1,000. The dealer offers customers a 4-year warranty. The
warranty provides free replacement of a television that fails
within the 4-year warranty period, but does not cover the cost
of a replaced TV.
The dealer discloses that 5% of new flat panel televisions fail
during their first year of life, 10% of 1-year-old televisions fail
during their second year of life, 15% of 2-year-old televisions
fail during their third year of life, and 20% of 3-year-old tele-
visions fail during their fourth year of life. Suppose that the
dealer sells a consumer a 4-year warranty for $140 along with
the television.
(a) Model the 4-year warranty experience of the dealer as a six-
state absorbing multichain with two absorbing states. Choose
one absorbing state to represent the replacement of a TV,
which has failed during the warranty period, and the other
to represent the survival of the TV until the end of the war-
ranty period. Choose the transient states to represent the age
of the television. Construct the transition probability matrix.
(b) What is the probability that the TV will have to be replaced
during the warranty period?
(c) What is the dealer’s expected revenue from selling a 4-year
warranty?
3.6 A production process contains three stages in series. An enter-
ing item starts in the first manufacturing stage. The output from
each stage is inspected. An item of acceptable quality is passed
on to the next stage, a defective item is scrapped, and an item
of marginal quality is reworked at the current stage. An item at
stage 1 has a 0.08 probability of being defective, a 0.12 probabil-
ity of being of marginal quality, and a 0.80 probability of being
of acceptable quality. An item at stage 2 has a 0.06 probability
of being defective, a 0.09 probability of being of marginal qual-
ity, and a 0.85 probability of being acceptable. The probabilities
that an item at stage 3 will be defective, marginal in quality,
or acceptable, are 0.04, 0.06, and 0.90, respectively. All items of
acceptable quality produced by stage 3 are sold.
(a) Model the production process as a five-state absorbing mul-
tichain with two absorbing states. Choose one absorbing
state to represent scrapping a defective item, and the other to

51113_C003.indd 159 9/23/2010 4:21:01 PM


160 Markov Chains and Decision Processes for Engineers and Managers

represent selling an item of acceptable quality produced by


stage 3. Choose the transient states to represent the stages of
the production process. Construct the transition probability
matrix. Represent it in canonical form.
(b) Find the probability that an item will be scrapped, given
that it is in stage 3.
(c) Find the probability that an item will be sold without being
reworked.
(d) Find the probability that an item will be in stage 2 after
three inspections.
(e) Find the probability that an item will be scrapped.
(f) Find the probability that an item will be sold.
(g) Find the mean number of inspections that an item will
receive in stage 3.
(h) Given that an item is in stage 2, find the mean number of
inspections that it will receive in stage 3.
3.7 An unmanned expendable rocket is launched to place a com-
munications satellite in either high earth orbit (HEO) or low
earth orbit (LEO). HEO is the preferred destination. As the
rocket is tracked, a sequence of course correction signals is
sent to it. The system has five states, which are labeled as
follows:

State Description
1 On course to HEO
2 On course to LEO
3 Minor deviation from course to HEO
4 Major deviation from course to HEO
5 Abort mission

Assume that the system can be modeled by an absorbing Markov


chain in which states 1, 2, and 5 are absorbing, and states 3 and
4 are transient. The state of the system is observed after the nth
course correction. Suppose that the five-state absorbing multi-
chain has the following transition probability matrix:

State 1 2 3 4 5
1, On course to HEO 1 0 0 0 0
2, On course to LEO 0 1 0 0 0
P=
3, Minor deviation 0.55 0.30 0.10 0.05 0
4, Major deviation 0 0.40 0.30 0.20 0.10
5, Abort 0 0 0 0 1

(a) Represent the transition matrix in canonical form.


(b) If, upon launch, the rocket is observed to start in state 3, fi nd
the probability that it eventually gets on course to LEO.

51113_C003.indd 160 9/23/2010 4:21:01 PM


Reducible Markov Chains 161

(c) If, upon launch, the rocket is observed to start in state 4,


fi nd the probability that it eventually gets on course to
HEO.
(d) If, upon launch, the rocket is observed to start in state 3
with probability 0.8 or in state 4 with probability 0.2, fi nd
the probability that it eventually gets on course to HEO.
(e) If, upon launch, the rocket is observed to start in state 3
with probability 0.8 or in state 4 with probability 0.2, find
the probability that the mission will eventually be aborted.
(f) If, upon launch, the rocket is observed to start in state 4, fi nd
the probability that it will be in state 3 after three course
corrections.
(g) If, upon launch, the rocket is observed to start in state 3, find
the mean number of course corrections before it reaches
HEO or LEO or aborts.
3.8 In Problem 1.12, let d1 = d2 = d, r1 = r2= r3 = r, and p1 = p2 = p.
(a) Show that the following quantity is the fundamental
matrix:

State 1 2 3
1 p p2
1 = Assistant professor
d+p ( d + p )2 ( d + p )3
U = (I − Q)−1 = 1 p
2 = Associate professor 0
d+p ( d + p )2
1
3 = Professor 0 0
d+p

(b) Find the average number of years that a newly hired assis-
tant professor spends working for this college (in any aca-
demic rank).
(c) Find the average number of years that a newly hired assis-
tant professor spends working for this college as an asso-
ciate professor.
(d) Find the average number of years that a professor spends
working for this college (as a professor).
(e) Find the probability that a newly hired assistant professor
will eventually retire as a professor.
(f) Find the probability that an associate professor will eventu-
ally retire as a professor.
(g) Assuming that a faculty member at this college begin her
career as assistant professor, find the probability that she
will be an associate professor after 2 years.
3.9 In Problem 1.13, let d1 = 0.6, d2 = 0.5, d3 = 0.4, r1 = 0.3, p1 = 0.1,
r2 = 0.3, p2 = 0.2, pR = 0.1, pA = 0.2, r3 = 0.3, r4 = 0.7, and r5 = 0.8.
(a) Find the average number of years that a newly hired instruc-
tor will spend working for this college (in any academic
rank).

51113_C003.indd 161 9/23/2010 4:21:02 PM


162 Markov Chains and Decision Processes for Engineers and Managers

(b) Find the average number of years that a newly hired


instructor will spend working for this college as an asso-
ciate professor.
(c) Find the probability that a newly hired instructor will even-
tually be promoted to the rank of professor.
(d) Find the probability that an associate professor will even-
tually be promoted to the rank of professor, initially as a
research professor.
(e) Assuming that a faculty member at this college begins her
career as assistant professor, find the probability that she
will be an associate professor after 2 years.
(f) In the long run, what proportion of associate professors will
be applications professors?
(g) In the long run, what proportion of faculty members will be
applications professors?
3.10 An investor buys a stock for $10. The monthly share price,
rounded to the nearest $5, has been varying among $0, $5, $10,
$15, and $20. She has been advised that the price will never
exceed $20. She has been informed that the price of the stock
can be modeled as a recurrent Markov chain in which the state,
Xn, denotes the share price at the end of month n. The state space
is E = {$0, $5, $10, $15, $20}. Her financial advisor believes that
the Markov chain will have the following transition probability
matrix:

X n\ X n + 1 0 5 10 15 20
0 0.34 0.12 0.26 0.10 0.18
5 0.12 0.28 0.32 0.20 0.08
P=
10 0.22 0.24 0.30 0.14 0.10
15 0.10 0.20 0.40 0.24 0.06
20 0.08 0.18 0.38 0.22 0.14

The investor intends to sell the stock at the end of the first
month in which the share price rises to $20 or falls to $0. Under
this policy, the share price can be modeled as an absorbing
multichain with two absorbing states, $0 and $20. An absorbing
state is reached when the stock is sold. The remaining states,
which are entered when the stock is held during the present
month, are transient.
(a) Represent the transition probability matrix in canonical
form.
(b) Find the probability that she will eventually sell the stock
for $20.
(c) Find the probability that she will sell the stock after
3 months.
(d) Find the expected number of months until she sells the
stock.

51113_C003.indd 162 9/23/2010 4:21:03 PM


Reducible Markov Chains 163

References
1. Bhat, N., Elements of Applied Stochastic Processes, 2nd ed., Wiley, New York,
1985.
2. Cinlar, E., Introduction to Stochastic Processes, Prentice-Hall, Englewood Cliffs,
NJ, 1975.
3. Clarke, A. B. and Disney R. L., Probability and Random Processes: A First Course
with Applications, 2nd ed., Wiley, New York, 1985.
4. Kemeny, J. G., Mirkil, H., Snell, J. L., and Thompson, G. L., Finite Mathematical
Structures, Prentice-Hall, Englewood Cliffs, NJ, 1959.
5. Kemeny, J. G. and Snell, J. L., Finite Markov Chains, Van Nostrand, Princeton, NJ,
1960. Reprinted by Springer-Verlag, New York, 1976.

51113_C003.indd 163 9/23/2010 4:21:03 PM


51113_C003.indd 164 9/23/2010 4:21:03 PM
4
A Markov Chain with Rewards (MCR)

When income or costs are associated with the states of a Markov chain, the
system is called a Markov chain with rewards, or MCR. This chapter, which
treats an MCR, has two objectives. The first is to show how to calculate the
economic value of an MCR. The second is to use an MCR to link a Markov
chain to a Markov decision process (MDP), thereby unifying the treatment
of both subjects. In Chapter 5, an MDP, is constructed by associating decision
alternatives with a set of MCRs. Thus, an MDP can be viewed simply as a set
of Markov chains with rewards plus decisions.

4.1 Rewards
This chapter develops procedures for constructing a reward vector and cal-
culating the expected economic value of an MCR. As Section 1.2 indicates,
all Markov chains with rewards treated in this book are assumed to have a
finite number of states [2–4, 6, 7].

4.1.1 Planning Horizon


As Section 2.1 indicates, the set of consecutive, equally spaced time peri-
ods over which a Markov chain is analyzed is called a planning horizon.
As Section 1.2 indicates, an epoch is a point in time that marks the end of a
time period. That is, epoch n designates the end of period n, which is also
the beginning of period n + 1. A planning horizon can be finite or infinite.
If a planning horizon is finite, of length T periods, then it consists of T peri-
ods numbered consecutively 1, 2, … , T. The periods are separated by T + 1
epochs numbered consecutively 0, 1, 2, … , T. A planning horizon of length T
periods is shown in Figure 4.1.

Period 1 Period 2 Period T


Epoch 0 Epoch 1 Epoch 2 Epoch T – 1 Epoch T

FIGURE 4.1
Planning horizon of length T periods.

165

51113_C004.indd 165 9/23/2010 5:38:41 PM


166 Markov Chains and Decision Processes for Engineers and Managers

Note that epoch 1 marks the end of period 1, which coincides with the
beginning of period 2. The present time, denoted by epoch 0, marks the
beginning of period 1. A future time is denoted by epoch n, where n > 0.
Epoch T marks the end of period T, which is also the end of the planning
horizon. Often, the following terms will be used interchangeably: epoch n,
period n, time n, step n, and transition n.

4.1.2 Reward Vector


An MCR is a Markov chain that generates a sequence of rewards as it evolves
over time from state to state, in accordance with the transition probabilities, pij,
of the chain. Rewards are assumed to be generated at the epochs at which the
Markov chain moves from state to state. Rewards are expressed as discrete,
end-of-period cash flows, measured in dollars, which is the convention used
in engineering economic analysis. (Dollar signs are often omitted.) A positive
reward represents revenue, while a negative reward signifies cost. Each time
the chain visits a state i at epoch n, a reward, denoted by qi, is earned. Suppose
that at epoch n a chain is observed to be in state i, so that Xn = i. Then at epoch
n, the chain will receive a reward qi. The chain will move with transition prob-
ability pij to state j at epoch n + 1, so that Xn+1 = j. Both the rewards, qi, and the
transition probabilities, pij, are assumed to be stationary over time, that is, they
do not depend on the epoch, n. For an MCR starting in state g, a sequence of
states visited, the rewards earned in those states, and the state transition prob-
abilities is shown in Figure 4.2.
In some cases the reward may depend on the state visited at the next
epoch. In such cases, when the chain makes a transition from state i at epoch
n to state j at epoch n + 1 with transition probability pij, it immediately earns
an associated transition reward, denoted by rij. The transition rewards are
assumed to be stationary over time. The reward received in state i, denoted
by qi, may now be called an expected immediate reward. For an N-state
Markov chain, an expected immediate reward qi received in state i is equal to
the reward rij earned from a transition to state j weighted by the probability
pij of the transition. That is,

N
qi = ∑ pij rij , i = 1, 2, ... , N . (4.1)
j =1

X0 g X1 h X2 k Xn i Xn 1 j State
qg qh qk qi qj Reward
pg( 0 ) pgh phk pij Transition probability
0 1 2 n n 1 Epoch

FIGURE 4.2
Sequence of states, transitions, and rewards for an MCR.

51113_C004.indd 166 9/23/2010 5:38:42 PM


A Markov Chain with Rewards (MCR) 167

In the most general case, the reward received in the present state is the sum
of a constant term and an expected immediate reward earned from a transi-
tion to the next state. In this book, qi will simply be termed a reward received
in state i, regardless of how it is earned.
Figure 1.3 in Section 1.2 demonstrates that a small Markov chain can be
represented by a transition probability graph. Similarly, a small MCR can
be represented by a transition probability and reward graph. For example,
consider a two-state generic MCR with the following transition probability
matrix P and reward vector q:

State 1 2 State Reward


P= 1 p11 p12 , q = 1 q1 = p11 r11 + p12 r12 . (4.2)
2 p21 p22 2 q2 = p21 r21 + p22 r22

The corresponding transition probability and reward graph is shown in


Figure 4.3.
The set of rewards for all states is collected in an N-component reward
vector,

q = [q1 q2 . . . qN ]T . (4.3)

After each transition, a vector of expected rewards, which is a function of


both P and q, is received. The vector of expected rewards received at epoch 0,
after 0 steps, is equal to q. The vector of expected rewards received at epoch
1, after one step, is dependent on the transitions made after one step. Those
transitions are governed by the one-step transition probability matrix, P.
Therefore, the vector of expected rewards received at epoch 1 is Pq. Similarly,
the vector of expected rewards received at epoch 2, after two steps, is P2q,
where P2 is the two-step transition probability matrix. Continuing to higher
numbered epochs, the vector of expected rewards received at epoch n, after
n steps, is given by Equation (4.4).

p1 2 , r12

q1 q2

p11, r11 State State p , r


22 22
1 2

p21, r21

FIGURE 4.3
Transition probability and reward graph for a two-state Markov chain with rewards.

51113_C004.indd 167 9/23/2010 5:38:43 PM


168 Markov Chains and Decision Processes for Engineers and Managers

Vector of expected rewards received after n steps = P n q , (4.4)

where Pn is the n-step transition probability matrix.


If p(0) is an initial probability state vector, then, as Equation (1.27) has indi-
cated, p(n) = p(0)Pn is the vector of state probabilities after n steps. The expected
reward received after n steps is a scalar, given by

Expected reward received after n steps = p( 0 )P n q = p( n )q. (4.5)

Vectors of expected rewards, Pnq, and expected reward scalars, p(n)q, will be
calculated for an example of an MCR model in Section 4.2.2.1.

4.2 Undiscounted Rewards


In Section 4.2, no interest is charged for the use of money. When no interest
is charged, one dollar in the present has the same value as one dollar in the
future. In other words, in this section, future cash flows over a planning
horizon are not discounted. When cash flows are not discounted, Markov
chain structure is relevant in calculating the expected economic value of an
MCR. (When cash flows are discounted, as they are in Section 4.3, Markov
chain structure is not relevant.)

4.2.1 MCR Chain Structure


The calculation of the expected value of an undiscounted cash flow gener-
ated by an MCR is dependent on chain structure [7]. The chain structure of
an MCR is determined by the chain structure of the associated Markov chain,
which is discussed in Sections 1.9 and 3.1. An MCR is unichain if the state
space of the Markov chain consists of a single closed class of recurrent states
plus a possibly empty set of transient states. A recurrent MCR is a special
case of a unichain MCR with no transient states. Thus, an MCR is recurrent if
the state space is irreducible, that is, if all states belong to a single closed com-
municating class of recurrent states. A reducible unichain MCR has at least
one transient state. An MCR is multichain if the transition matrix consists
of two or more closed classes of recurrent states plus a possibly empty set
of transient states. A recurrent multichain MCR will consist of two or more
closed classes of recurrent states with no transient states. A reducible mul-
tichain MCR will contain one or more transient states. To simplify the ter-
minology, the adjective “reducible” will often be omitted in describing MCR
chain structure. Thus, a reducible unichain MCR will simply be termed a
unichain MCR, and a reducible multichain MCR will be called a multichain

51113_C004.indd 168 9/23/2010 5:38:44 PM


A Markov Chain with Rewards (MCR) 169

MCR. To further simplify the terminology, a closed communicating class of


recurrent states will often be termed a recurrent chain.

4.2.2 A Recurrent MCR over a Finite Planning Horizon


In Section 4.2.2, a recurrent MCR is analyzed over a fi nite planning horizon,
while the planning horizon is assumed to be infinite in Section 4.2.3 [4, 7].

4.2.2.1 An MCR Model of Monthly Sales


Consider the following example of a recurrent MCR. (In Section 5.1.2.1 of
Chapter 5 this model will be transformed into an MDP by introducing a set
of MCRs plus decisions.) Suppose that a firm records its monthly sales at the
end of every month. Monthly sales, which have fluctuated widely over the
years, are ranked with respect to those of the firm’s competitors. The rank-
ings are expressed in quartiles. The fourth quartile is the highest rank, and
indicates that a firm’s monthly sales are greater than those of 75% of its com-
petitors. At the beginning of every month, the firm follows a particular policy
under which a unique action and reward are associated with each of the four
quartile ranks. For example, when monthly sales are in the first quartile, the
firm will always offer employee buyouts and earn a reward of −$20,000 (or
incur a cost of $20,000).
By following the prescribed policy for many years, the fi rm has found
that sales in any month can be used to forecast sales in the following
month. Hence, the sequence of monthly sales is believed to form a Markov
chain. The state Xn−1 denotes the quartile rank of monthly sales at the
beginning of month n. The chain has four states, which correspond to
the four quartiles. Estimates of the transition probabilities are based on
the firm’s historical record of monthly sales. Table 4.1 identifies the four
states accompanied by the associated actions and rewards prescribed by
the firm’s policy.
The firm constructs a model of a four-state MCR, which has the following
transition probability matrix, P, and reward vector, q, with the entries of q
expressed in thousands of dollars.

TABLE 4.1
States, Actions, and Rewards for Monthly Sales
Monthly Sales
Quartile State Action Reward
First (lowest) 1 Offer employee buyouts −$20,000
Second 2 Reduce executive salaries $5,000
Third 3 Invest in new technology −$5,000
Fourth (highest) 4 Make strategic acquisitions $25,000

51113_C004.indd 169 9/23/2010 5:38:45 PM


170 Markov Chains and Decision Processes for Engineers and Managers

1 0.60 0.30 0.10 0  1  −20 


2  0.25 0.30 0.35 0.10  2  5
P=  , q =  . (4.6)
3  0.05 0.25 0.50 0.20  3  − 5
   
4 0 0.10 0.30 0.60  4  25 

Suppose that the initial probability state vector is

p( 0 ) = [0.3 0.1 0.4 0.2] =  p1(0) p2(0) p3(0) p(0)


4  
(4.7)
= [P(X0 = 1) P(X0 = 2) P(X0 = 3) P(X0 = 4)].

Using Equation (4.5), the expected reward received at the start of month
one is

−20 
 5 
p(0) q = [0.3 0.1 0.4 0.2]   = −2.5. (4.8)
− 5 
 
 25 

Using Equation (1.27), after 1 month the state probability vector is

0.60 0.30 0.10 0 


 0.25 0.30 0.35 0.10 
p( 1 ) = p( 0 )P = [0.3 0.1 0.4 0.2]  
 0.05 0.25 0.50 0.20  (4.9)
 
 0 0.10 0.30 0.60 
= [0.225 0.240 0.325 0.210].

Using Equation (4.5), the expected reward received after 1 month is


equal to

−20 
 5 
p( 0 )Pq = p(1)q = [0.225 0.240 0.325 0.210]   = 0.325. (4.10)
− 5 
 
 25 

Using Equation (4.4), the vector of expected rewards after 1 month, which is
independent of the initial probability state vector, is equal to

51113_C004.indd 170 9/23/2010 5:38:45 PM


A Markov Chain with Rewards (MCR) 171

1 0.60 0.30 0.10 0  −20  1  −11 


2  0.25 0.30 0.35 0.10   5  2  −2.75
Pq =   =  . (4.11)
3  0.05 0.25 0.50 0.20   − 5  3  2.75 
    
4 0 0.10 0.30 0.60   25  4  14 

Using Equation (1.28), after 2 months the state probability vector is

0.60 0.30 0.10 0 


 0.25 0.30 0.35 0.10  
p(2) = p(1)P = [0.225 0.240 0.325 0.210]  
 0.05 0.25 0.50 0.20   (4.12)
 
 0 0.10 0.30 0.60  

= [0.21125 0.24175 0.332 0.215]. 

Alternatively, using Equation (1.27), the state probability vector after


2 months is

 0.440 0.295 0.215 0.05  


0.2425 0.2625 0.335 0.16  
p( 2 ) = p( 0 )P 2 = [0.3 0.1 0.4 0.2]   
0.1175 0.235 0.4025 0.245  (4.13)
 
 0.04 0.165 0.365 0.43  

= [0.21125 0.24175 0.332 0.215], 

where P2 is the two-step transition probability matrix.


Using Equation (4.5), the expected reward received after 2 months is equal to

−20 
 5 
p(1)Pq = p(2)q = [0.225 0.240 0.325 0.210]   = 0.69875. (4.14)
− 5 
 
 25 

Using Equation (4.4), the vector of expected rewards received after 2 months,
which is independent of the initial probability state vector, is equal to

1  0.440 0.295 0.215 0.05  −20  1  −7.15 


2  0.2425 0.2625 0.335 0.16   5  2  −1.2125
P 2q =   =  . (4.15)
3  0.1175 0.235 0.4025 0.245  − 5  3  2.9375 
    
4  0.04 0.165 0.365 0.43   25  4  8.95 

51113_C004.indd 171 9/23/2010 5:38:47 PM


172 Markov Chains and Decision Processes for Engineers and Managers

4.2.2.2 Value Iteration over a Fixed Planning Horizon


Assume that a decision maker chooses a finite planning horizon of fixed
length, equal to T time periods. Suppose that the expected total reward
earned by an MCR from epoch n until epoch T, the end of the planning hori-
zon, is of interest. Let vi(n) denote the expected total reward earned during
the next T − n periods, from epoch n to epoch T, if the system is in state i at
epoch n. Since epoch 0 marks the beginning of period 1, the present time,
the objective is to compute vi(0), which represents the expected total reward
earned during the next T − 0 = T periods, until the end of the planning hori-
zon, if the system starts in state i. Since the planning horizon ends at epoch
T, terminal values must be specified for the rewards earned in all states at
epoch T. That is, terminal values must be specified for vj(T), for j = 1, 2, . . . , N.
The terminal values represent trade-in or salvage values that will be received
at the end of the planning horizon. The set of expected total rewards for all
states at epoch n is collected in an N-component, expected total reward vec-
tor [4, 7],

v(n) = [v1 (n) v2 (n) ... vN (n)]T . (4.16)

4.2.2.2.1 Value Iteration Equation


Figure 4.4 shows a cash flow diagram involving a series of equal reward vectors,
q, earned at the beginning of each period for T periods. At the end of period T,
a vector of salvage values denoted by v(T) is earned. Observe that vector q is
earned at epochs 0 through T − 1, and that vector v(T) is earned at epoch T.
The reward vector received at the present time, epoch 0, is q. As Equation
(4.5) and Section 4.2.2.1 have demonstrated, the expected reward vector
received one period from now, at epoch 1, is Pq, where P is the one-step
transition probability matrix. The expected reward vector received two
periods from now, at epoch 2, is P2 q, where P2 is the two-step transition
probability matrix. Similarly, the expected reward vector received T − 1
periods from now, at epoch T−1, is PT−1q, where PT−1 is the (T − 1)-step tran-
sition probability matrix. Finally, at the end of the planning horizon, the
expected salvage value is PTv(T), where PT is the T-step transition proba-
bility matrix, and v(T) is the vector of salvage values received at the end of
period T. Rewards are additive. Therefore, the expected total reward vector,

q q q q v(T ) Cash Flow

0 1 2 T –1 T Epoch

FIGURE 4.4
Cash flow diagram for reward vectors over planning horizon of length T periods.

51113_C004.indd 172 9/23/2010 5:38:49 PM


A Markov Chain with Rewards (MCR) 173

v(0), is equal to the sum of the expected reward vectors, Pkq, earned at each
epoch k, for k = 0, 1, 2, . . . , T − 1, plus the expected salvage value, PTv(T),
received at epoch T. That is,

v(0) = P 0 q + P1 q + P 2 q + P 3 q + " + P T −1 q + P T v(T )



= Iq + Pq + P 2 q + P 3 q + " + P T −1 q + P T v(T )  (4.17)

= q + Pq + P 2 q + P 3 q + " + P T − 1 q + P T v(T ). 

A backward recursive solution procedure called value iteration will be


developed to relate v(n) to v(n + 1). To simplify this development, consider a
planning horizon of length T = 4 periods. The cash flow diagram is shown
in Figure 4.5.
The backward recursive solution procedure begins by solving for v(3) over
a one-period horizon consisting of period 4 alone. Note that a salvage value,
v(4), is received one period after the end of period 3. The one-period cash
flow diagram is shown in Figure 4.6.
Using Equation (4.17) with T = 1 and the origin moved to n = 3, the solution
for v(3) is

v(3) = q + Pv(4). (4.18)

Next, v(2) is calculated over a two-period horizon consisting of periods 3 and


4. The two-period cash flow diagram is shown in Figure 4.7.
Using Equation (4.17) with T = 2 and the origin moved to n = 2, the solution
for v(2) is given below:

v(2) = q + Pq + P 2 v(4). (4.19)

q q q q v(4) Cash Flow

0 1 2 3 4 Epoch

FIGURE 4.5
Cash flow diagram for reward vectors over four-period planning horizon.

q v(4) Cash Flow

3 4 Epoch

FIGURE 4.6
One-period cash flow diagram.

51113_C004.indd 173 9/23/2010 5:38:49 PM


174 Markov Chains and Decision Processes for Engineers and Managers

q q v(4) Cash flow

2 3 4 Epoch

FIGURE 4.7
Two-period cash flow diagram.

q q q v(4) Cash flow

1 2 3 4 Epoch

FIGURE 4.8
Three-period cash flow diagram.

When P is factored out of terms two and three

v(2) = q + P[q + Pv(4)]. (4.20)

Substituting v(3) from Equation (4.18),

v(2) = q + Pv(3). (4.21)

Next, v(1) is calculated over a three-period horizon consisting of periods 2, 3,


and 4. The three-period cash flow diagram is shown in Figure 4.8.
Using Equation (4.17) with T = 3 and the origin moved to n = 1, the solution
for v(1) is given below:

v(1) = q + Pq + P 2 q + P 3 v(4). (4.22)

When P is factored out of terms two, three, and four,

v(1) = q + P[q + Pq + P 2 v(4)]. (4.23)

Substituting v(2) from Equation (4.19),

v(1) = q + Pv(2). (4.24)

Finally, v(0) is calculated over the entire four-period horizon consisting of


periods 1, 2, 3, and 4. The original four-period cash flow diagram is shown
in Figure 4.5.

51113_C004.indd 174 9/23/2010 5:38:53 PM


A Markov Chain with Rewards (MCR) 175

Using Equation (4.17) with T = 4, the solution for v(0) is given below:

v(0) = q + Pq + P 2 q + P 3 q + P 4 v(4). (4.25)

When P is factored out of terms two, three, four, and five,

v(0) = q + P[q + Pq + P 2 q + P 3 v(4)]. (4.26)

Substituting v(1) from Equation (4.22),

v(0) = q + Pv(1). (4.27)

Observe that at the end of periods 0, 1, 2, and 3 in the four-period planning


horizon, the following backward recursive equation in matrix form has been
established:

v(n) = q + Pv(n + 1), for n = 0, 1, 2, and 3, (4.28)

where the salvage value v(4) is specified.


By induction, the recursive equation (4.28) can be extended to any finite
planning horizon. Thus, for a planning horizon of length T periods, v(n) can
be computed in terms of v(n + 1) to produce Equation (4.29), which is called
the value iteration equation in matrix form:

v(n) = q + Pv(n + 1), for n = 0, 1, 2,..., T − 1, (4.29)

where the salvage value v(T) is specified.


Consider a generic four-state MCR with transition probability matrix, P,
and reward vector, q. When N = 4, the expanded matrix form of the recursive
equation (4.28) is

 v1 (n)   q1   p11 p12 p13 p14   v1 (n + 1) 


 v ( n)   q   p p22 p23 p24   v (n + 1) 
 2  =  2  +  21   2 . (4.30)
 v3 (n)   q3   p31 p32 p33 p34   v3 (n + 1) 
       
 v4 (n)   q4   p41 p42 p43 p44   v4 (n + 1) 

51113_C004.indd 175 9/23/2010 5:38:56 PM


176 Markov Chains and Decision Processes for Engineers and Managers

In expanded algebraic form the four recursive equations (4.30) are

v1 (n) = q1 + p11v1 (n + 1) + p12 v2 (n + 1) + p13 v3 (n + 1) + p14 v4 (n + 1) 


v2 (n) = q2 + p21v1 (n + 1) + p22 v2 (n + 1) + p23 v3 (n + 1) + p24 v4 (n + 1) 
 (4.31)
v3 (n) = q3 + p31v1 (n + 1) + p32 v2 (n + 1) + p33 v3 (n + 1) + p34 v4 (n + 1) 
v4 (n) = q4 + p41v1 (n + 1) + p42 v2 (n + 1) + p43 v3 (n + 1) + p44 v4 (n + 1).

In compact algebraic form, using summation signs, the four recursive equa-
tions (4.31) are

4

v1 (n) = q1 + ∑ p1 j v j (n + 1) 
j =1

4 
v2 (n) = q2 + ∑ p2 j v j (n + 1) 
j =1  (4.32)
4 
v3 (n) = q3 + ∑ p3 j v j (n + 1) 
j =1



4
v4 (n) = q4 + ∑ p4 j v j (n + 1).
j =1 

The four algebraic equations (4.32) can be replaced by the single algebraic
equation

4
vi (n) = qi + ∑ pij v j (n + 1),
j =1 (4.33)
for n = 0, 1,..., T − 1, and i = 1, 2, 3, and 4.

This result can be generalized to apply to any N-state MCR. Therefore,


vi(n) can be computed in terms of vj(n + 1) by using the following algebraic
recursive equation, which is called the value iteration equation in algebraic
form:

N
vi (n) = qi + ∑ pij v j (n + 1),
j =1 (4.34)
for n = 0, 1,..., T − 1, and i = 1, 2,..., N .

Recall that vi(n) denotes the expected total reward earned until the end of
the planning horizon if the system is in state i at epoch n. The value iter-
ation equation (4.34) indicates that the total expected reward, vi(n), can be
expressed as the sum of two terms. The first term is the reward, qi, earned at

51113_C004.indd 176 9/23/2010 5:38:58 PM


A Markov Chain with Rewards (MCR) 177

epoch n. The second term is the expected total reward, vj(n + 1), that will be
earned if the chain starts in state j at epoch n + 1, weighted by the probability,
pij, that state j can be reached in one step from state i.
In summary, the value iteration equation expresses a backward recursive
relationship because it starts with a known set of salvage values for vector
v(T) at epoch T, the end of the planning horizon. Next, vector v(T − 1) is cal-
culated in terms of v(T). The value iteration procedure moves backward one
epoch at a time by calculating v(n) in terms of v(n + 1). The vectors, v(T − 2),
v(T − 3), … , v(1), and v(0) are computed in succession. The backward recursive
procedure, or backward recursion, stops at epoch 0 after v(0) is calculated in
terms of v(1). The component vi(0) of v(0) is the expected total reward earned
until the end of the planning horizon if the system starts in state i at epoch 0.
Figure 4.9 is a tree diagram of the value iteration equation in algebraic form
for a two-state MCR.

Epoch n+1

State j
Expected
Total
Reward
vj (n + 1)

pij

Epoch n

State i
Expected
Total
vi (n) qi pijv j ( n 1) pihvh ( n 1)
Reward
vi(n )

Reward qi

pih

State h
Expected
Total
Reward
vh (n + 1)

FIGURE 4.9
Tree diagram of value iteration for a two-state MCR.

51113_C004.indd 177 9/23/2010 5:39:00 PM


178 Markov Chains and Decision Processes for Engineers and Managers

4.2.2.2.2 Value Iteration for MCR Model of Monthly Sales


Consider the four-state MCR model of monthly sales introduced in Section
4.2.2.1 for which the transition probability matrix, P, and reward vector, q,
with the entries of q expressed in thousands of dollars, are shown in Equation
(4.6). Substituting the numerical values for P and q from Equation (4.6) into
Equation (4.30), the matrix equation for the backward recursion is

 v1 (n)  −20  0.60 0.30 0.10 0   v1 (n + 1) 


 v ( n)  5   0.25 0.30 0.35 0.10   v2 (n + 1)
 2 = +  . (4.35)
 v 3 ( n )  − 5   0.05 0.25 0.50 0.20   v3 (n + 1)
      
v4 (n)  25   0 0.10 0.30 0.60   v4 (n + 1)

Value iteration will be executed by solving the backward recursive equa-


tions in matrix form (4.35) to calculate the expected total reward vector over
a 3-month planning horizon. To begin the backward recursion, the salvage
values at the end of the planning horizon are set equal to zero for all states.

n = T = 3.
v(3) = v(T ) = 0
 v1 (n)   v1 (T )   v1 (3)   0 
 v (n)   v (T )   v (3)   0 
 2  =  2  =  2  =   (4.36)
 v3 (n)   v3 (T )   v3 (3)   0 
       
 v4 (n)   v4 (T )   v4 (3)   0 

n=2
v(2) = q + Pv(3) 

 v1 (2)  −20  0.60 0.30 0.10 0  0  −20  
 v (2)  5   0.25 0.35 0.10  0   5  (4.37)
 2 = +
0.30
  =   
v3 (2)  − 5   0.05 0.25 0.50 0.20  0   − 5  
         
v4 (2)  25   0 0.10 0.30 0.60  0   25  

n=1
v(1) = q + Pv(2) 

 v1 (1)  −20  0.60 0.30 0.10 0  −20   −31 
 v (1)  5   0.25 0.35 0.10   5   2.25   (4.38)
 .
0.30
 2 = +  =
 v3 (1)  − 5   0.05 0.25 0.50 0.20   − 5   −2.25 
        
 v4 (1)  25   0 0.10 0.30 0.60   25   39  

51113_C004.indd 178 9/23/2010 5:39:00 PM


A Markov Chain with Rewards (MCR) 179

n=0
v(0) = q + Pv(1) 
 v1 (0)  −20  0.60 0.30 0.10 0   −31   −38.15 
 v (0)  5   0.25 
0.30 0.35 0.10   2.25   1.0375  
 2 = +  = . (4.39)
 v3 (0)  − 5   0.05 0.25 0.50 0.20   −2.25  0.6875  
        
v4 (0)  25   0 0.10 0.30 0.60   39   47.95  

The vector v(0) indicates that if this MCR operates over a 3-month planning
horizon, the expected total reward will be –38.15 if the system starts in state 1,
1.0375 if it starts in state 2, 0.6875 if it starts in state 3, and 47.95 if it starts in
state 4. The calculations for the 3-month planning horizon are summarized
in Table 4.2.
The solution by value iteration for v(0) over a 3-month planning horizon
has verified that, for v(3) = 0,

v(0) = q + Pq + P 2 q + P 3 v(3) (4.17)

 v1 (0)  −20   −11   −7.15   0   −38.15


 v (0)  5   −2.75  −1.2125  0   1.0375 
v(0) =  = + + +  =  ,
2
(4.40)
 v3 (0)  − 5   2.75   2.9375   0   0.6875 
           
v4 (0)  25   14   8.95   0   47.95 

equal to the sum of the results obtained in Equations (4.11) and (4.15).

4.2.2.3 Lengthening a Finite Planning Horizon


The length of a finite planning horizon may be increased by adding addi-
tional periods during value iteration. Suppose the original planning horizon

TABLE 4.2
Expected Total Rewards for Monthly Sales
Calculated by Value Iteration Over a 3-Month
Planning Horizon
n
End of Month 0 1 2 3
v1(n) –38.15 –31 –20 0
v2(n) 1.0375 2.25 5 0
v3(n) 0.6875 –2.25 –5 0
v4(n) 47.95 39 25 0

51113_C004.indd 179 9/23/2010 5:39:02 PM


180 Markov Chains and Decision Processes for Engineers and Managers

contains T periods. To continue the backward recursion of value iteration using


the results already obtained for a horizon of length T periods, each additional
period added is placed at the beginning of the horizon, and designated by a
consecutively numbered negative epoch. For example, if one period is added,
epoch (−1) is placed one period ahead of epoch 0, so that epoch (−1) becomes
the present time or origin. If two periods are added, epoch (−2) is placed one
period ahead of epoch (−1), or two periods ahead of epoch 0, so that epoch
(−2) becomes the present time or origin. If ∆ periods are added, epoch (−∆) is
placed one period ahead of epoch (−∆ + 1), or ∆ periods ahead of epoch 0, so
that epoch (−∆) becomes the present time or origin. The lengthened planning
horizon contains T − (−∆) = T + ∆ periods. Value iteration is continued over
the new horizon by starting at epoch 0 using a terminal salvage value equal
to the vector v(0) obtained for the original horizon of length T periods. The
sequence of backward recursive value iteration equations is

v( −1) = q + Pv(0)
v( −2) = q + Pv( −1)
#
v( −∆ ) = q + Pv( −∆ + 1).

Note that if period 0 denotes the first period placed ahead of period 1 to lengthen
the horizon, and the additional periods added are labeled with consecutively
increasing negative integers, then epoch n denotes the end of period n, for
n = 0, − 1, − 2, … , (−∆ + 1), as well as for n = 1, 2, … , T. For example, suppose that
T = 2 and ∆ = 3. Then the lengthened planning horizon appears in Figure 4.10.
Lastly, all the epochs may be renumbered, so that the epochs of the length-
ened planning horizon are numbered consecutively with nonnegative inte-
gers from epoch 0 at the origin to epoch (T + ∆) at the end, by adding ∆ to the
numerical index of each epoch, starting at epoch (−∆) and ending at epoch T.
If the epochs are renumbered, then

epoch (−∆) becomes epoch(−∆ + ∆) = epoch 0, the present time,


epoch (−∆ + 1) becomes epoch [(−∆ + 1) + ∆] = epoch 1,
epoch 0 becomes epoch (0 + ∆) = epoch ∆,
epoch (T − 1) becomes epoch (T − 1 + ∆), and
epoch T becomes epoch (T + ∆), the end of the lengthened planning
horizon.

Period – 2 Period – 1 Period 0 Period 1 Period 2


–3 –2 –1 0 1 2 epoch

FIGURE 4.10
Two-period horizon lengthened by three periods.

51113_C004.indd 180 9/23/2010 5:39:03 PM


A Markov Chain with Rewards (MCR) 181

For the example in which T = 2 and ∆ = 3, the lengthened planning horizon,


with the epochs relabeled, appears in Figure 4.11.
For the example of monthly sales, suppose that four periods are added
to the original three-period planning horizon to form a new 7-month plan-
ning horizon. Thus, T = 3 and ∆ = 4. As shown in Table 4.3, the four negative
epochs, –1, –2, –3, and –4, are added sequentially at the beginning of the
original 3-month horizon.
Value iteration is executed to compute the expected total rewards earned
during the first 4 months of the 7-month horizon.

v( −1) = q + Pv(0)
 v1 ( −1)  −20  0.60 0.30 0.10 0   −38.15  −42.51
 v ( −1)  5   0.25 0.30 0.35 0.10   1.0375   0.8094 
 2 = +  =  (4.41)
 v3 ( −1)  − 5   0.05 0.25 0.50 0.20   0.6875   3.2856 
        
v4 ( −1)  25   0 0.10 0.30 0.60   47.95   54.08 

v( −2) = q + Pv( −1) 



v( −3) = q + Pv( −2). (4.42)
v( −4) = q + Pv( −3) 

The calculations for the entire 7-month planning horizon are summarized
in Table 4.4.
Finally, ∆ = 4 may be added to each end-of-month index, n, to renumber the
epochs of the seven-period planning horizon consecutively from 0 to 7.

Period 1 Period 2 Period 3 Period 4 Period 5


0 1 2 3 4 5 epoch

FIGURE 4.11
Lengthened planning horizon with epochs renumbered.

TABLE 4.3
Three-Month Horizon for Monthly Sales Lengthened by 4 Months
n
End of Month –4 –3 –2 –1 0 1 2 3
v1(n) –38.15 –31 –20 0
v2(n) 1.0375 2.25 5 0
v3(n) 0.6875 –2.25 –5 0
v4(n) 47.95 39 25 0

51113_C004.indd 181 9/23/2010 5:39:04 PM


182 Markov Chains and Decision Processes for Engineers and Managers

TABLE 4.4
Expected Total Rewards for Monthly Sales Calculated by Value Iteration Over a
7-month Planning Horizon
n
End of Month –4 –3 –2 –1 0 1 2 3
v1(n) –46.3092 –46.0552 –44.9346 –42.51 –38.15 –31 –20 0
v2(n) 2.8782 1.9073 1.1733 0.8094 1.0375 2.25 5 0
v3(n) 9.3101 7.5174 5.5357 3.2856 0.6875 –2.25 –5 0
v4(n) 64.578 61.8868 58.5146 54.08 47.95 39 25 0

4.2.2.4 Numbering the Time Periods Forward


When value iteration is executed, many authors follow the convention of
dynamic programming by numbering the time periods backward, starting
with n = 0 at the end of the planning horizon, and ending with n = T at the
beginning. When time periods in the horizon are numbered backward, n
denotes the number of periods remaining. However, numbering the time
periods backward contradicts the convention used for Markov chains for
which the time periods are numbered forward, starting with n = 0 at the
beginning of the horizon. In this book, the time periods are always num-
bered forward for Markov chains, MCRs, and MDPs. When value iteration
is executed over an infinite planning horizon, epochs are numbered as con-
secutive negative integers, starting with epoch 0 at the end of the horizon,
and ending with epoch (−T) at the beginning.

4.2.3 A Recurrent MCR over an Infinite Planning Horizon


Consider an N-state recurrent MCR with stationary transition probabilities
and stationary rewards. As the planning horizon grows longer, the expected
total return grows correspondingly larger. If the planning horizon becomes
unbounded or infinite, then the expected total reward also becomes infinite.
Two different measures of the earnings of an MCR over an infinite hori-
zon are the expected average reward per period, called the gain, and the
expected present value, which is also called the expected total discounted
reward. Discounted cash flows are treated in Section 4.3.
Three approaches can be taken to calculate the gain, or average reward
per period, of a recurrent MCR operating over an infinite planning hori-
zon. The first approach is to compute the gain as the product of the steady-
state probability vector and the reward vector. The second approach is to
solve a system of linear equations called the value determination equations
(VDEs) to find the gain and the relative values of the expected total rewards
earned in every state. Since these two approaches both involve the solu-
tion of N simultaneous linear equations, they both require about the same

51113_C004.indd 182 9/23/2010 5:39:05 PM


A Markov Chain with Rewards (MCR) 183

computational effort. The third approach, which requires less computation,


is to execute value iteration over a large number of periods. However, the
sequence of expected total rewards produced by value iteration may not
converge. When value iteration does converge, it gives only an approximate
solution for the gain and the expected total rewards earned in every state.

4.2.3.1 Expected Average Reward or Gain


Consider an N-state MCR, irrespective of structure, in the steady state.
Suppose that a vector called the gain vector is defined, equal to the limiting
transition probability matrix multiplied by the reward vector. If the gain vec-
tor is denoted by g = [gi], then

g = limP(n)q. (4.43)
n→∞

Each component gi of the gain vector represents the expected average reward
per transition, or the gain, if the MCR starts in state i. Suppose now that the
MCR is recurrent. Using Equation (2.7),

N 
 ∑ π i qi 
π 1 π 2 … π N   q1   iN= 1  π q  g1   g 
π π           
   = ∑
… π N q2 π i qi  πq g2 g
g = limP(n)q = Πq =  =   =   =  .
1 2
n →∞ # # # #  #   i =1   #   #  #
    #       
π 1 π 2 … π N   qN   N  π q   g N   g 
 πq
 ∑
i =1
i i

(4.44)

Hence, the gain for starting in every state of a recurrent MCR is the same.
The gain in every state of a recurrent MCR is a scalar constant, denoted by
g, where

g1 = g 2 = " = g N = g. (4.45)

The gain g in every state of a recurrent MCR is equal to the sum, over all
states, of the reward qi received in state i weighted by the steady-state prob-
ability, πi. The algebraic form of the equation to calculate the gain g for a
recurrent MCR is

N
g= ∑π q.
i= 1
i i (4.46)

51113_C004.indd 183 9/23/2010 5:39:05 PM


184 Markov Chains and Decision Processes for Engineers and Managers

The matrix form of the equation to calculate the gain g for a recurrent
MCR is

g = π q. (4.47)

4.2.3.1.1 Gain of a Two-State Recurrent MCR


Consider a generic two-state recurrent MCR for which the transitional prob-
ability matrix and reward vector are given in Equation (4.2). The transition
probability and reward graph is shown in Figure 4.3. The steady-state prob-
ability vector for the Markov chain is calculated in Equation (2.16). Using
Equation (4.47), the gain of a two-state recurrent MCR is

 q1   p21 p12   q1  
g = π q = π 1 π 2    =    
 q2   p12 + p21 p12 + p21   q2   (4.48)
.
p21 q1 + p12 q2 
= .
p12 + p21 

Alternatively, using Equation (4.44), the gain vector is

 g1  π 1 π 2   q1  π 1 q1 + π 2 q2 
g =   = limP(n)q = Πq =    =  
g
 2 n →∞
π 1 π 2   q2  π 1 q1 + π 2 q2 
 p21 q1 + p12 q2 
 
 g   p12 + p21 
= = .
 g   p21 q1 + p12 q2  (4.49)
 p +p 
 12 21 

As the components of the gain vector have demonstrated, the gain in every
state is equal to the scalar

p21 q1 + p12 q2
g = g1 = g 2 = . (4.50)
p12 + p21

For example, consider the following two-state recurrent MCR for which:

1  p11 p12  1  0.2 0.8  1  q1  1  −300 


P=  = , q=  =  .
p22  2 0.6 0.4  2  q2  2  200 
(4.51)
2  p21

Equation (2.18) calculates the vector of steady-state probabilities for the


Markov chain.

51113_C004.indd 184 9/23/2010 5:39:06 PM


A Markov Chain with Rewards (MCR) 185

The gain of the two-state recurrent MCR is

 q1   3 4   −300  p21 q1 + p12 q2


g = π q = π 1 π 2   =   = p +p
 q2   7 7   200  12 21

(0.6)( −300) + (0.8)(200) 100


= =− = −14.29. (4.52)
0.8 + 0.6 7

4.2.3.1.2 Gain of an MCR Model of Monthly Sales


Consider the four-state MCR model of monthly sales for which the transi-
tion probability matrix, P, and reward vector, q, are shown in Equation (4.6).
The steady-state vector, π, is calculated by solving equations (2.13). Using
Equation (4.47), the gain is

 −20 
 5 
g = π q = [0.1908 0.2368 0.3421 0.2303]   = 1.4143. (4.53)
 −5 
 
 25 

In the steady state, the process earns an average reward of $1,414.30 per
month.

4.2.3.2 Value Determination Equations (VDEs)


The second approach that can be taken to calculate the expected average
reward of a recurrent MCR operating over an infinite planning horizon is to
solve a system of equations called the VDEs. The solution of these equations
will produce the gain and the relative values of the expected total rewards
earned in every state.

4.2.3.2.1 Optional Insight: Limiting Behavior of the Expected Total Reward


The following informal derivation of the limiting behavior of the expected
total reward is adapted from Bhat [1]. As Equation (4.17) indicates, at epoch 0,
the beginning of a planning horizon of length T periods, the expected total
reward vector is

T −1
v(0) = ∑ P k q + P T v(T ). (4.17)
k=0

51113_C004.indd 185 9/23/2010 5:39:08 PM


186 Markov Chains and Decision Processes for Engineers and Managers

As T becomes large, the system enters the steady state. Then

 T −1  
v(0) = lim  ∑ P k q + P T v(T ) 
T →∞
 k=0   (4.54)
T −1

= lim ∑ P k q + lim P T v(T ).
T →∞
k=0
T →∞ 

Interchanging the limit and the sum, v(0) is expressed as the sum of the lim-
iting transition probability matrices for P.

T− 1
v(0) = ∑ lim P q + lim P
k= 0
k→∞
k
T →∞
T
v(T ). (4.55)

Using Equation (2.7),

T −1

v(0) = ∑ Πq + Πv(T )
k=0  (4.56)
= T Πq + Πv(T ). 

The algebraic form of Equation (4.56) is

N N
vi (0) = T ∑ π j q j + ∑ π j v j (T ), for i = 1, 2, ..., N . (4.57)
j =1 j =1

Recall from Section 4.2.3.1 that the average reward per period, or gain, for a
recurrent MCR is

N
g= ∑π q.
j= 1
j j

Let
N
vi = ∑ π v (T ).
j= 1
j j (4.58)

Substituting quantities from Equations (4.46) and (4.58), into Equation (4.57),
as T becomes large, vi(0) has the form

vi (0) ≈ Tg + vi . (4.59)

51113_C004.indd 186 9/23/2010 5:39:10 PM


A Markov Chain with Rewards (MCR) 187

4.2.3.2.2 Informal Derivation of VDEs


Suppose the system starts in state i at epoch 0, and the planning horizon
contains T periods. As T becomes large, the system enters the steady state.
Then, as the optional insight in Section 4.2.3.2.1 has demonstrated, vi(0)
becomes approximately equal to the sum of two terms. The first term, Tg,
is independent of the starting state, i. This term can be interpreted as the
number of periods, T, remaining in the planning horizon multiplied by
the average reward per period or gain, g. The second term, vi, can be inter-
preted as the expected total reward earned if the system starts in state i.
Therefore [3, 4],

vi (0) ≈ Tg + vi . (4.59)

At epoch 1, (T − 1) periods remain. As T grows large, (T − 1) also grows large,


so that

v j (1) ≈ (T − 1) g + v j . (4.60)

Using the value iteration equation (4.34) with n = 0,

N
vi (0) = qi + ∑ pij v j (1), for i = 1, 2, ..., N . (4.61)
j =1

Substituting the linear approximations for vi(0) in Equation (4.59) and vj(1)
in Equation (4.60), assumed to be valid for large T, into Equation (4.61) gives
the result

N
Tg + vi = qi + ∑ pij [(T − 1) g + v j ], for i = 1, 2, ... , N ,
j =1
N N N
= qi + Tg ∑ pij − g ∑ pij + ∑ pij v j .
j =1 j =1 j =1

N
Substituting ∑ pij = 1 ,
j =1

N
Tg + vi = qi + Tg − g + ∑ pij v j
j =1
(4.62)
N
g + vi = qi + ∑ pij v j , i = 1, 2, ... , N .
j =1

51113_C004.indd 187 9/23/2010 5:39:12 PM


188 Markov Chains and Decision Processes for Engineers and Managers

This system (4.62) of N linear equations in the N + 1 unknowns, v1, v2, . . . ,


vN, and g, is called the VDEs. Since the VDEs have one more unknown than
equations, they will not have a unique solution. Therefore, the value of one
variable may be chosen arbitrarily. By convention, the variable for the highest
numbered state, vN, is set equal to zero, and the remaining vi are expressed
in terms of vN. Hence, the vi are termed the relative values of the expected
total rewards earned in every state or, more concisely, the relative values. If
vN = 0, then the relative value vi represents the expected total reward earned
by starting in state i relative to the expected total reward earned by starting
in state N.
If the objective is simply to find the gain of the process, then it is not nec-
essary to solve the VDEs (4.62). In that case, the first approach described in
Section 4.2.3.1 is sufficient. However, by solving the VDEs, one can determine
not only the gain but also the relative value in every state. The relative values
are used in Chapter 5 to evaluate alternative policies, which are candidates
to maximize the gain of an MDP. For an N-state recurrent MCR, both proce-
dures for calculating the gain require the solution of N simultaneous linear
equations in N unknowns.

4.2.3.2.3 VDEs for a Two-State Recurrent MCR


To verify that the gain of a recurrent MCR is obtained by solving the VDEs,
consider once again the generic two-state MCR for which the gain was cal-
culated in Equations (4.48) and (4.49). The VDEs are

v1 + g = q1 + p11v1 + p12 v2 
 (4.63)
v2 + g = q2 + p21v1 + p22 v2 .

Setting v2 = 0, the VDEs become

v1 + g = q1 + p11v1 
. (4.64)
g = q2 + p21v1 

The solution is

q1 − q2 q −q 
v1 = = 1 2 
1 − p11 + p21 p12 + p21 
 (4.65)
p (q − q ) q ( p + p21 ) + p21 (q1 − q2 ) p21 q1 + p12 q2
g = q2 + 21 1 2 = 2 12 = ,
1 − p11 + p21 p12 + p21 p12 + p21 

which agrees with value of the gain obtained in Equations (4.48) and (4.49).

51113_C004.indd 188 9/23/2010 5:39:14 PM


A Markov Chain with Rewards (MCR) 189

4.2.3.2.4 VDEs for an MCR Model of Monthly Sales


The VDEs for an N-state MCR can be expressed in matrix form. Let g be a
column vector with all N entries equal to g. Let

v = [v1 v2 ... vN ]T , (4.66)

and

q = [q1 q2 ... qN ]T . (4..67)

Then
g + v = q + Pv. (4.68)

When N = 4 the matrix form of the VDEs is

 g   v1   q1   p11 p12 p13 p14   v1 


 g   v2   q2   p21 p22 p23 p24  v 
 +  =  +   2 . (4.69)
 g   v3   q3   p31 p32 p33 p34   v3 
         
 g   v4   q4   p41 p42 p43 p44   v4 

Consider the four-state MCR model of monthly sales for which the transition
probability matrix, P, and reward vector, q, are shown in Equation (4.6).
The matrix equation for the VDE’s, after setting v4 = 0, is

 g   v1   −20  0.60 0.30 0.10 0   v1 


 g   v   5   0.25 0.30 0.35 0.10   v2 
 +  2 =  +   .
 g  v3   −5   0.05 0.25 0.50 0.20  v3 
        
 g   0   25   0 0.10 0.30 0.60   0 

The solution for the gain and the relative values is

g = 1.4145, v1 = −116.5022, v2 = −64.9671, v3 = −56.9627, and v4 = 0. (4.70)

Thus, if this MCR operates over an infinite planning horizon, the average
return per period or the gain will be 1.4145. This agrees with the value of
the gain calculated in Equation (4.53). (Discrepancies are due to roundoff
error.) Suppose the system starts in state 4 for which v4 = 0. Then it will earn
116.5022 more than it will earn if it starts in state 1. It will also earn 64.9671
more than it will earn if it starts in state 2, and it will earn 56.627 more than
it will earn if it starts in state 3.

51113_C004.indd 189 9/23/2010 5:39:15 PM


190 Markov Chains and Decision Processes for Engineers and Managers

4.2.3.3 Value Iteration


The third approach that can be taken to calculate the gain of an MCR oper-
ating over an infinite planning horizon is to execute value iteration over a
large number of periods. Value iteration requires less computational effort
than the first two approaches. However, the sequence of expected total
rewards produced by value iteration will not always converge. When value
iteration does converge, it will yield upper and lower bounds on the gain
as well as an approximate solution for the expected total rewards earned in
every state [4, 7].
Value iteration is conducted over an infinite planning horizon by repeat-
edly adding one additional period to lengthen the horizon. Each period
added is placed at the beginning of the horizon. To obtain convergence crite-
ria, the epochs are numbered forward as consecutive negative integers, start-
ing with epoch 0 at the end of the horizon, and ending with epoch (−n) at the
beginning.

4.2.3.3.1 Expected Relative Rewards


After a stopping condition has been satisfied, value iteration will produce an
approximate solution for the expected total rewards earned in every state.
An approximate solution for the relative values of the expected total rewards
is obtained by subtracting the expected total reward earned in the highest
numbered state from the expected total reward earned in every state. These
differences are called the expected relative rewards. To see why the expected
relative rewards represent approximate solutions for the relative values, sup-
pose that n successive repetitions of value iteration have satisfied a stopping
condition over a finite horizon consisting of n periods. The epochs in the
planning horizon are numbered as consecutive negative integers from −n to
0. That is, the epochs for a horizon of length n are numbered sequentially as
−n, −n + 1, −n + 2, … , −2, −1, 0. Epoch (−n) denotes the beginning of the hori-
zon, and epoch 0 denotes the end. The absolute value of each epoch repre-
sents the number of periods remaining in the horizon. For example, at epoch
(−3), three periods remain until the end of the horizon at epoch 0. When a
stopping condition is satisfied, let T = |−n| = n, so that T denotes the length
of the planning horizon at which convergence is achieved. Thus, the epochs
for the planning horizon of length T are numbered sequentially as −T, −T +
1, − T + 2, … , −2, −1, 0.
Suppose that all the epochs are renumbered by adding T to the numerical
index of each epoch, so that the epochs for the planning horizon of length
T are numbered consecutively as 0, 1, 2, … , T − 1, T. If epoch 0 denotes the
beginning of a fi nite horizon of length T periods, then for large T, using
Equation (4.59),

vi (0) = Tg + vi , (4.59)

51113_C004.indd 190 9/23/2010 5:39:17 PM


A Markov Chain with Rewards (MCR) 191

and

vN (0) = Tg + vN . (4.59b)

Hence, for large T, the difference

vi (0) − vN (0) ≈ vi − vN . (4.71)

The difference vi(0) − vN(0) is termed the expected relative reward earned in
state i because it is equal to the increase in the long run expected total reward
earned if the system starts in state i rather than in state N. This difference is
also the reason why vi is called the relative value of the process when it starts
in state i. If value iteration satisfies a stopping condition after a large number
of iterations, then the expected relative rewards will converge to the relative
values.

4.2.3.3.2 Bounds on the Gain


If epoch 0 denotes the beginning of a finite horizon of length T periods, then the
quantity vi(0) − vi(1) can be interpreted as the difference between the expected
total rewards earned in state i over two planning horizons, which differ in
length by one period. For long planning horizons of length T and T − 1, using
Equations (4.59) and (4.60),

vi (0) − vi (1) = (Tg + vi ) − [(T − 1) g + vi ] = (Tg + vi ) − (Tg − g + vi ) = g. (4.72)

Hence, as T grows large, the difference in the expected total rewards earned
in state i over two planning horizons, which differ in length by one period,
approaches the gain, g. The convergence of vi(0) − vi(1) to the gain can be
used to obtain upper and lower bounds on the gain, and also to obtain an
approximate solution for the gain. An upper bound on the gain over a plan-
ning horizon of length T is given by

gU (T ) = max {vi (0) − vi (1)} (4.73)


i = 1,..., N

Similarly, a lower bound on the gain over a planning horizon of length T is


given by

g L (T ) = min {vi (0) − vi (1)} (4.74)


i = 1,..., N

Since upper and lower bounds on the gain have been obtained, the gain is
approximately equal to the arithmetic average of its upper and lower bounds.

51113_C004.indd 191 9/23/2010 5:39:17 PM


192 Markov Chains and Decision Processes for Engineers and Managers

That is, an approximate solution for the gain is given by

gU (T ) + g L (T )
g= .
2 (4.75)

4.2.3.3.3 Stopping Rule for Value Iteration


The difference between the upper and lower bounds on the gain over plan-
ning horizons which differ in length by one period provides a stopping con-
dition for value iteration. Value iteration can be stopped when the difference
between the upper and lower bounds on the gain for the current planning
horizon and a planning horizon one period shorter is less than a small posi-
tive number, ε, specified by the analyst. That is, value iteration can be stopped
after a planning horizon of length T for which

gU (T ) − g L (T ) < ε , (4.76)

or

max [vi (0) − vi (1)] − min [vi (0) − vi (1)] < ε . (4.77)
i = 1,..., N i = 1,..., N

Of course, the magnitude of ε is a matter of judgment, and the number of


periods in T needed to achieve convergence cannot be predicted.

4.2.3.3.4 Value Iteration Algorithm


The foregoing discussion is the basis for the following value iteration algo-
rithm in its simplest form (adapted from Puterman [7]), which can be applied
to a recurrent MCR over an infinite planning horizon. Epochs are numbered
as consecutive negative integers from −n at the beginning of the horizon to
0 at the end.
Step 1. Select arbitrary salvage values for

vi (0), for i = 1, 2, ... , N . For simplicity, set vi (0) = 0.

Specify an ε > 0. Set n = −1.


Step 2. For each state i, use the value iteration equation to compute

N
vi (n) = qi + ∑ pij v j (n + 1), for i = 1, 2, ... , N .
j =1

Step 3. If max [vi (n) − vi (n + 1)] − min [vi (n) − vi (n + 1)] < ε , go to step 4.
i = 1,2,..., N i = 1,2,..., N
Otherwise, decrement n by 1 and return to step 2.
Step 4. Stop.

51113_C004.indd 192 9/23/2010 5:39:19 PM


A Markov Chain with Rewards (MCR) 193

4.2.3.3.5 Solution by Value Iteration of MCR Model of Monthly Sales


Table 4.4 is repeated as Table 4.5 to show the expected total rewards earned
from monthly sales calculated by value iteration over the last 7 months of
an infinite planning horizon. (Recall that all the rewards are expressed in
units of thousands of dollars.) The last eight epochs of the infinite planning
horizon in Table 4.5 are numbered sequentially as −7, −6, −5, −4, −3, −2, −1,
and 0. Epoch 0 denotes the end of the horizon. As Section 4.2.3.3.1 indicates,
the absolute value of a negative epoch in Table 4.5 represents the number of
months remaining in the infinite horizon.
Table 4.6 gives the differences between the expected total rewards earned
over planning horizons, which differ in length by one period.

TABLE 4.5
Expected Total Rewards for Monthly Sales Calculated by Value Iteration During the
Last 7 Months of an Infinite Planning Horizon
n
Epoch –7 –6 –5 –4 –3 –2 –1 0
v1(n) –46.3092 –46.0552 –44.9346 –42.51 –38.15 –31 –20 0
v2(n) 2.8782 1.9073 1.1733 0.8094 1.0375 2.25 5 0
v3(n) 9.3101 7.5174 5.5357 3.2856 0.6875 –2.25 –5 0
v4(n) 64.578 61.8868 58.5146 54.08 47.95 39 25 0

TABLE 4.6
Differences Between the Expected Total Rewards Earned Over Planning Horizons
Which Differ in Length by One Period
n
Epoch –7 –6 –5 –4 –3 –2 –1
i vi(–7) vi(–6) vi(–5) vi(–4) vi(–3) vi(–2) vi(–1)
–vi(–6) –vi(–5) –vi(–4) –vi(–3) –vi(–2) –vi(–1) –vi(0)
1 –0.254L –1.1206L –2.4246L –4.36L –7.15L –11L –20L
2 0.9709 0.734 0.3639 –0.2263 –1.2143 –2.75 5
3 1.7927 1.9817 2.2501 2.5981 2.9375 2.75 –5
4 2.6912U 3.3722U 4.4346U 6.13U 8.95U 14U 25U
Max ( vi (n) − vi (n + 1))
= gU (T ) 2.6912 3.3722 4.4346 6.13 8.95 14 25

Min ( vi (n) − vi (n + 1))


= g L (T ) –0.254 –1.1206 –2.4246 –4.36 –7.15 –11 –20

gU(T) − gL(T) 2.9452 4.4928 6.8592 10.49 16.10 25 45

51113_C004.indd 193 9/23/2010 5:39:21 PM


194 Markov Chains and Decision Processes for Engineers and Managers

In Table 4.6, a suffix U identifies gU(T), the maximum difference for each
epoch. The suffix L identifies gL(T), the minimum difference for each epoch.
The differences, gU(T) − gL(T), obtained for all the epochs are listed in the
bottom row of Table 4.6. In Equations (4.53) and (4.70), the gain was found
to be 1.4145. When seven periods remain in the planning horizon, Table 4.6
shows that the bounds on the gain obtained by value iteration are given by
−0.254 ≤ g ≤ 2.6912. The gain is approximately equal to the arithmetic aver-
age of its upper and lower bounds, so that

g ≈ (2.6912 + 0.254)/2 = 1.4726. (4.78)

As the planning horizon is lengthened beyond seven periods, tighter bounds on


the gain will be obtained. The bottom row of Table 4.6 shows that if an analyst
chooses an ε < 2.9452, then more than seven iterations of value iteration will be
needed before the value iteration algorithm can be assumed to have converged.
Table 4.7 gives the expected relative rewards, vi(n) − v4(n), earned during
the last 7 months of the planning horizon.
In Equation (4.70), which assumes that the planning horizon is infinite, the
relative values obtained by solving the VDEs are

g = 1.4145, v1 = −116.5022, v2 = −64.9671, v3 = −56.9627, and v4 = 0. (4.70)

Tables 4.7 demonstrates that as the horizon grows longer, the expected rel-
ative reward earned for starting in each state slowly approaches the corre-
sponding relative value for that state when the relative value for the highest
numbered state is set equal to zero.

4.2.3.4 Examples of Recurrent MCR Models


In this section, recurrent MCR models are constructed for an inventory sys-
tem [3] and for component replacement [6].

4.2.3.4.1 Recurrent MCR Model of an Inventory System


In Section 1.10.1.1.3, a retailer who sells personal computers has modeled
her inventory system as a Markov chain. She follows a (2, 3) inventory
TABLE 4.7
Expected Relative Rewards, vi(n) − v4(n), Earned During the Last 7 Months of the
Planning Horizon
n
Epoch –7 –6 –5 –4 –3 –2 –1 0
v1(n) – v4(n) –110.8872 –108.774 –103.4492 –96.59 –86.1 –70 –45 0
v2(n) – v4(n) –61.6998 –59.9795 –57.3413 –53.2706 –46.9143 –36.75 –20 0
v3(n) – v4(n) –55.2679 –54.3694 –52.9789 –50.7944 –47.2625 -41.25 –30 0

51113_C004.indd 194 9/23/2010 5:39:22 PM


A Markov Chain with Rewards (MCR) 195

ordering policy, which has produced the transition probability matrix


shown in Equation (1.48). The retailer will construct an expected reward
vector to transform the Markov chain model into an MCR. She wishes to
calculate her gain, that is, her expected average reward or profit per period,
which is equal to the sum, over all states, of the expected reward or profit
in each state multiplied by the steady-state probability for that state. The
retailer has collected the following cost and revenue data. If she places an
order for one or more computers, she must pay a $20 cost for placing the
order, plus a cost of $120 per computer ordered. All orders are delivered
immediately. The retailer sells the computers for $300 each. The holding
cost for each computer not sold during the period is $50. A shortage cost
of $40 is incurred for each computer that is not available to satisfy demand
during the period.
The expected reward or profit equals the expected revenue minus the
expected cost. The expected revenue equals the $300 selling price of a com-
puter times the expected number of computers sold.
The expected cost equals the ordering cost plus the expected holding
cost plus the expected shortage cost. Recall that cn−1 denotes the number
of computers ordered at the beginning of period n. If one or more comput-
ers are ordered, then cn−1 > 0, and the ordering cost for the period is $20 +
$120cn−1. If no computers are ordered, then cn−1 = 0, and the ordering cost
is zero dollars. When the beginning inventory plus the quantity ordered
exceed demand during a period, the number of unsold computers left over
at the end of the period, that is, the surplus of computers at the end of
the period, is Xn−1 + cn−1 − dn > 0. The holding cost for computers not sold
during a period is $50(Xn−1 + cn−1 − dn). When demand exceeds the begin-
ning inventory plus the quantity ordered during a period, the shortage of
computers at the end of the period is dn − Xn−1 − cn−1 > 0. The shortage cost
for computers that are not available to satisfy demand during a period is
$40(dn − Xn−1 − cn−1).
When Xn−1 = 0, the retailer orders cn−1 = 3 computers. The next state
is Xn = 3−dn, so that the expected number of computers sold equals the
expected demand for computers. The expected demand is given by
3 3

E(dn ) = ∑ d p(d
n
dn = 0
n ) =
k=0
∑ (d= k )P(dn = k ) = (dn = 0)P(dn = 0) + (dn = 1)P(dn = 1) 
n


+ (dn = 2)P(dn = 2) + (dn = 3)P(dn = 3) 
= (0)(0.3) + (1)(0.4) + (2)(0.1) + 3(0.2) = 1.2. 
(4.79)

The expected revenue is $300(1.2) = $360. The ordering cost is $20 + $120cn−1 =
$20 + $120(3) = $380. The next state, which represents, the ending inventory,
is given by Xn = Xn−1 + cn−1 − dn = 0 + 3 − dn = 3 − dn.

51113_C004.indd 195 9/23/2010 5:39:23 PM


196 Markov Chains and Decision Processes for Engineers and Managers

If the demand is less than three computers, the retailer will have (3 − dn)
unsold computers remaining in stock at the end of a period. The expected
number of computers not sold during a period is given by


3
E(3 − dn ) = ∑ (3 − d n )p(dn )

dn = 0
3
= ∑ (3 − k )P(d = k )

n
k=0 (4.80)
= (3 − 0)P(d = 0) + (3 − 1)P(d = 1)
n n

+ (3 − 2)P(d = 2) + (3 − 3)P(d = 3)

= 3(0.3) + 2(0.4) + 1(0.1) + 0(0.2) = 1.8. 
n n

The expected holding cost is equal to the expected number of computers


not sold multiplied by the holding cost per computer. Thus, the expected
holding cost is equal to

3
$50 ∑ (3 − k )P(dn = k ) = $50[3(0.3) + 2(0.4) + 1(0.1) + 0(0.2)] = $50(1.8) = $90 .
k= 0
(4.81)

Since the ending inventory is given by Xn = 3−dn, and the demand will never
exceed three computers, the retailer will never have a shortage of computers.
Hence, she will never incur a shortage cost in state 0.
The expected reward or profit vector for the retailer’s MCR model of her
inventory system is denoted by q = [q0 q1 q2 q3]T. The expected reward or
profit in state 0, denoted by q0, equals the expected revenue minus the order-
ing cost minus the expected holding cost minus the expected shortage cost.
The expected reward in state 0 is

q0 = $360 − $380 − $90 − $0 = −$110. (4.82)

When Xn−1 = 1, the retailer orders cn−1 = 2 computers. In state 1, as in state 0,


the expected revenue is $300(1.2) = $360.
The ordering cost is

$20 + $120cn− 1 = $20 + $120(2) = $260.

The next state is again equal to

X n = X n− 1 + cn− 1 − dn = 1 + 2 − dn = 3 − dn .

51113_C004.indd 196 9/23/2010 5:39:24 PM


A Markov Chain with Rewards (MCR) 197

The expected number of computers not sold during a period is unchanged


from its value when Xn−1 = 0, and is given by

∑ (3 − k )P(d
k=0
n = k ) = 1.8. (4.83)

The expected holding cost is also equal to its value when Xn−1 = 0, and is
given by

3
$50∑ (3 − k )P(dn = k ) = $90. (4.84)
k=0

In state 1, as in state 0, the retailer will never have a shortage of computers,


and thus will never incur a shortage cost.
The expected reward or profit in state 1, denoted by q1, equals the expected
revenue minus the ordering cost minus the expected holding cost minus the
expected shortage cost. The expected reward in state 1 is

q1 = $360 − $260 − $90 − $0 = $10. (4.85)

When Xn−1 = 2, the order quantity is cn−1 = 0, so that no computers are


ordered, and no ordering cost is incurred. In state 2, Xn = max(2−dn, 0), so that
at most two computers can be sold. The number of computers sold during
the period is equal to min(2, dn). The expected number of computers sold is
given by

3 3

E[min(2, dn )] = ∑ min(2, dn )p(dn ) = ∑ min(2, d n = k )P( dn = k )

dn = 0 k=0

= min(2, dn = 0)P(dn = 0) + min(2, dn = 1)P(dn = 1)  (4.86)
+ min(2, dn = 2)P(dn = 2) + min(2, dn = 3)P(dn = 3)

= (0)(0.3) + (1)(0.4) + (2)(0.1) + 2(0.2) = 1. 

In state 2, the expected revenue is $300(1) = $300.


The next state is

X n = X n −1 + cn −1 − dn = 2 + 0 − dn = 2 − dn (4.87)

provided that 2 − dn ≥ 0. When the beginning inventory is represented by


Xn−1 = 2, the ending inventory is represented by Xn = max(2 − dn, 0). If the
demand is less than two computers, the retailer will have (2 − dn) unsold

51113_C004.indd 197 9/23/2010 5:39:26 PM


198 Markov Chains and Decision Processes for Engineers and Managers

computers remaining in stock at the end of a period. The expected number


of computers not sold during a period is given by

2

E[max(2 − dn , 0)] = ∑ (2 − dn )p(dn ) 
dn = 0

2 
= ∑ (2 − k )P(dn = k ) = (2 − 0)P(dn = 0) + (2 − 1)P(dn = 1)
k=0 
+ (2 − 2)P(dn = 2) 

= 2(0.3) + 1(0.4) + 0(0.1) = 1. 
(4.88)

Thus, the expected holding cost is equal to

2
$50 ∑ (2 − k )P(dn = k ) = $50[2(0.3) + 1(0.4)] = $50(1) = $50 (4.89)
k= 0

When the demand exceeds two computers, the retailer will have a shortage
of (dn − 2) computers, and thus will incur a shortage cost. Since the demand
will never exceed three computers, the retailer can have a maximum short-
age of one computer. The expected number of computers not available to
satisfy the demand during a period is equal to the expected number of short-
ages. The expected number of shortages is given by

3 3

E[(max(dn − 2, 0)] = ∑ (d n − 2)p(dn ) = ∑ (k − 2)P(dn = k ) 
dn = 2 k=2
 (4.90)
= (2 − 2)P(dn = 2) + (3 − 2)P(dn = 3) 

= (0)P(dn = 2) + (1)P(dn = 3) = 0(0.1) + 1(0.2) = 0.2.

The expected shortage cost is equal to the expected number of shortages


times the shortage cost per computer short. Thus, the expected shortage cost
is equal to

3
$40 ∑ (k − 2)P(dn = k ) = $40[1(0.2)] = $40(0.2) = $8. (4.91)
k= 2

The expected reward in state 2, denoted by q2, equals the expected revenue
minus the ordering cost minus the expected holding cost minus the expected
shortage cost. The expected reward in state 2 is

51113_C004.indd 198 9/23/2010 5:39:27 PM


A Markov Chain with Rewards (MCR) 199

q2 = $300 − $0 − $50 − $8 = $242. (4.92)

When Xn−1 = 3, the order quantity is cn−1 = 0, so that no computers are ordered,
and no ordering cost is incurred. In state 3, as in states 0 and 1, the expected
revenue is $300(1.2) = $360.
The next state is

X n = X n −1 + cn −1 − dn = 3 + 0 − dn = 3 − dn . (4.93)

The expected number of computers not sold during a period is unchanged


from its value when Xn−1 = 0 and Xn−1 = 1, and is given by
3

∑ (3 − k )P(d
k=0
n = k ) = 1.8. (4.94)

The expected holding cost is equal to its value when Xn−1 = 0 and Xn−1 = 1, and
is given by
3
$50∑ (3 − k )P(dn = k ) = $50(1.8) = $90. (4.95)
k=0

In state 3, as in states 0 and 1, the retailer will never have a shortage of com-
puters, and thus will never incur a shortage cost.
The expected reward in state 3, denoted by q3, equals the expected rev-
enue minus the ordering cost minus the expected holding cost minus the
expected shortage cost. The expected reward in state 3 is

q3 = $360 − $0 − $90 − $0 = $270. (4.96)

The retailer’s inventory system, controlled by a (2, 3) ordering policy, has


been modeled as a four-state MCR. The MCR model has the following transi-
tion probability matrix, P, and expected reward vector, q:

Beginning
Order, X n −1 + cn −1  
Inventory, X n −1 cn − 1 State 0 1 2 3  Reward 
 
0 3 3 0 0.2 0.1 0.4 0.3  −110 
P= ,q= .
1 2 3 1 0.2 0.1 0.4 0.3  10 
2 0 2 2 0.3 0.4 0.3 0  242 
 
3 0 3 3 0.2 0.1 0.4 0.3  270 
(4.97)

Her expected average reward per period, or gain, is equal to the sum, over
all states, of the expected reward in each state multiplied by the steady-state

51113_C004.indd 199 9/23/2010 5:39:29 PM


200 Markov Chains and Decision Processes for Engineers and Managers

probability for that state. The steady-state probability vector for the inven-
tory system was calculated in Equation (2.27). Using Equation (4.47), the gain
earned by following a (2, 3) inventory ordering policy is

 −110 
 10 
g = π q =  0.2364 0.2091 0.3636 0.1909    = 115.6212. (4.98)
 242 
 
 270 

4.2.3.4.2 Recurrent MCR Model of Component Replacement


Consider the four-state recurrent Markov chain model of component replace-
ment introduced in Section 1.10.1.1.5. The transition probability matrix is
shown in Equation (1.50). The following replacement policy is implemented.
All components are inspected every week. All components that have failed
during the week are replaced with new components. No component is replaced
later than age 4 weeks. By adding cost information, component replacement
will be modeled as an MCR. Suppose that the cost to inspect a component is
$2, and the cost to replace a component, which has failed, is $10. Therefore, the
expected immediate cost incurred for a component of age i is $2 for inspect-
ing a component plus $10 for replacing a component, which has failed times
the probability pi0 that an i−week old component has failed. Letting qi denote
the cost incurred for a component of age i gives the result that

qi = $2 + $10 pi 0 . (4.99)

The vector of costs for this replacement policy is

 q0   $2 + $10 p00   $2 + $10(0.2)   $4 


 q   $2 + $10 p  $2 + $10(0.375) $5.75
q= =
10 
= = .
1
(4.100)
 q2   $2 + $10 p20   $2 + $10(0.8)   $10 
       
 q3   $2 + $10 p30   $2 + $10(1)   $12 

The recurrent MCR model of component replacement has the following tran-
sition probability matrix, P, and cost vector, q:

State 0 1 2 3 Cost
0 0.2 0.8 0 0 0 $4
P= 1 0.375 0 0.625 0 , q = 1 5.75 (4.101)
2 0.8 0 0 0.2 2 10
3 1 0 0 0 3 12

51113_C004.indd 200 9/23/2010 5:39:31 PM


A Markov Chain with Rewards (MCR) 201

The expected average cost per week, or negative gain, for this component
replacement policy is equal to the sum, over all states, of the cost in each
state multiplied by the steady-state probability for that state. The steady-state
probability vector for the component replacement model is calculated in
Equation (2.29). Using Equation (4.47), the negative gain incurred by follow-
ing this replacement policy is:

 $4 
 $5.75 
g = π q =  0.4167 0.3333 0.2083 0.0417    = $6.17. (4.102)
 $10 
 
 $12 

4.2.4 A Unichain MCR


Enlarging Equation (3.1), the transition matrix P in canonical form and the
reward vector q for a unichain MCR are (4.103)

 S 0  qR 
P=  , q=  , (4.103)
 D Q  qT 

where the components of the reward vector q are the vectors qR and qT. The
entries of vector qR are the rewards received for visits to the recurrent states.
The entries of vector qT are the rewards received for visits to the transient
states.

4.2.4.1 Expected Average Reward or Gain


The limiting transition probability matrix for a unichain is given by

 lim S n 0 Π 0
lim P =  n→∞
n
= . (3.148)
n →∞
lim Dn 0   Π 0 
n →∞ 

Using Equation (4.43), the gain vector g for a unichain MCR is

 lim S n 0   qR   Π 0   qR   ΠqR   g R 
g = lim P q =  n→∞
(n)
 = = = ,
0   qT   Π 0   qT   ΠqR   g R 
(4.104)
n →∞
lim
n →∞
Rn

51113_C004.indd 201 9/23/2010 5:39:33 PM


202 Markov Chains and Decision Processes for Engineers and Managers

where each row of the matrix Π is the steady-state probability vector, π = [πi],
for the recurrent chain S, and gR is the gain vector for the recurrent states.
Observe that gR = ΠqR. By the rules of matrix multiplication, all the compo-
nents of the gain vector for a unichain MCR are equal. All states, both recur-
rent and transient, have the same gain, which is equal to the gain of the closed
class S of recurrent states. If g denotes the gain in every state of a unichain
MCR, then the scalar gain g in every state of a unichain MCR with N recur-
rent states is

N
g= ∑π q.
i= 1
i i (4.105)

In vector form,

g = π qR . (4.106)

Thus, the gain of a unichain MCR is equal to the steady-state probability


vector for the recurrent chain multiplied by the reward vector for the recur-
rent chain. Since all states in a recurrent MCR have the same gain, and the
gain is the same for every state in a reducible unichain MCR, the same three
approaches given in Section 4.2.3 for calculating the gain of a recurrent MCR
can be used to calculate the gain of a unichain MCR. Therefore, the gain of
a unichain MCR can also be calculated by solving the VDEs or by executing
value iteration over an infinite horizon.

4.2.4.1.1 Gain of a Four-State Unichain MCR


Consider the following generic four-state unichain MCR in which states 1
and 2 form a recurrent closed class, and states 3 and 4 are transient. The tran-
sition matrix and reward vector are given in Equation (4.107).

1  p11 p12 0 0   q1 
2  p21  q 
=
p22 0 0 S 0  qR 
P = [ pij ] =  , q =   =  ,
2

3  p31 p32 p33  


p34   D Q   q3   qT 
    (4.107)
4  p41 p42 p43 p44   q4 
 q1   q3 
qR =   , qT =   .
 q2   q4 

The limiting transition probability matrix is calculated in Equation (3.149).


The gain vector is

51113_C004.indd 202 9/23/2010 5:39:35 PM


A Markov Chain with Rewards (MCR) 203

1  π1 π2 0 0  q1   g1   g   π 1 q1 + π 2 q2 
2  π1 π2 0 0  q   g   g  π q + π q 
g = lim P(n) q =    2 =  2 =   =  1 1 
2 2
n→∞ 3  π1 π2 0 0  q3   g 3   g   π 1 q1 + π 2 q2  .
         
4  π 1 π2 0 0   q4   g 4   g   π 1 q1 + π 2 q2 
(4.108)

As Section 4.2.4.1 has indicated, the gain for the unichain MCR is the same
in every state. Treating the two-state recurrent chain separately, the gain,
calculated by using (4.47), is

 q1 
g = π q = π qR =  π 1 π 2    = π 1 q1 + π 2 q2 . (4.109)
 q2 

Since all rows of the limiting transition probability matrix for the four-state
unichain are identical, the gain in every state of the unichain MCR can also
be computed by applying Equation (4.47) to the unichain MCR. That is,

 q1 
q 
 q1 
0 0    = [π 1 π 2 ]   = π 1 q1 + π 2 q2 . (4.110)
2
g = π q =  π 1 π 2
 q3   q2 
 
 q4 

The generic two-state recurrent chain has the transition probability matrix

1  p11 p12 
S= . (4.111)
2  p21 p22 

The steady-state probability vector for S is calculated in Equation (2.16).


Therefore, the gain in every state of the generic four-state unichain MCR is

p21 q1 + p12 q2
g = π 1 q1 + π 2 q2 = . (4.112)
p12 + p21

4.2.4.1.2 Gain of an Absorbing Unichain MCR


In the special case of an absorbing unichain MCR, the recurrent chain con-
sists of a single absorbing state. The canonical form of the transition matrix
for an absorbing unichain MCR is given below:

 1 0  q1 
P=   , q =  , (4.113)
 D Q  qT 

51113_C004.indd 203 9/23/2010 5:39:36 PM


204 Markov Chains and Decision Processes for Engineers and Managers

where the scalar q1 is the reward received when state 1, the absorbing state,
is visited. The steady-state probability for the absorbing state is π1 = 1. Hence
the gain in every state of an absorbing unichain MCR is equal to the reward
received in that state. That is,

g = π 1 q1 = 1q1 = q1 . (4.114)

4.2.4.1.3 A Unichain MCR Model of Machine Maintenance


Consider the absorbing Markov chain model of machine deterioration
introduced in Section 1.10.1.2.1.2. The engineer responsible for the machine
observes the condition of the machine at the start of a day. The engineer
always does nothing to respond to the deterioration of the machine. This
model will be enlarged by adding revenue data to create an absorbing
unichain MCR [3, 6]. Since all state transitions caused by deterioration
occur at the end of the day, a full day’s production is achieved by a work-
ing machine, which is left alone. If the machine is in state 4, working
properly, revenue of $1,000 is earned for the day. If the machine is in state
3, working, with a minor defect, daily revenue of $500 is earned. If the
machine is in state 2, working, with a major defect, the daily revenue is
$200. Finally, if the machine is in state 1, not working, no daily revenue
is received. The daily revenue earned in every state is summarized in
Table 4.8.
Now consider the unichain model of machine maintenance introduced
in Section 1.10.1.2.2.1. This unichain model will also be enlarged by add-
ing cost data to the revenue data to create a unichain MCR. Assume that
the engineer follows the original maintenance policy described in Section
1.10.1.2.2.1, under which a machine in state 3 is overhauled. All mainte-
nance actions, which may include overhauling, repairing, or replacing the
machine, take 1 day to complete. No revenue is earned on days during
which maintenance is performed. For simplicity, the daily maintenance
costs are assumed to be independent of the state of the machine. Suppose
that the daily cost to overhaul the machine is $300, and the daily cost to

TABLE 4.8
Daily Revenue Earned in Every State
State Condition Daily Revenue
1 Not Working (NW) $0
2 Working, with a Major Defect (WM) $200
3 Working, with a Minor Defect (Wm) $500
4 Working Properly (WP) $1,000

51113_C004.indd 204 9/23/2010 5:39:38 PM


A Markov Chain with Rewards (MCR) 205

TABLE 4.9
Daily Maintenance Costs
Decision Action Daily Cost
1 Do Nothing (DN) $0
2 Overhaul (OV) $300
3 Repair (RP) $700
4 Replace (RL) $1,200

repair it is $700. The daily cost to replace the machine with a new machine
is $1,200. The daily maintenance costs in every state are summarized in
Table 4.9.
The reward vector associated with the original maintenance pol-
icy of Section 1.10.1.2.2.1, under which the decision is to overhaul the
machine in states 1 and 3, and do nothing in states 2 and 4, appears in
Equation (4.115).

State, X n −1 = i Decision, k Reward, qi


1, Not Working 2, Overhaul − $300 = q1
2, Working, Major Defect 1, Do Nothing $200 = q2 (4.115)
3, Working, Minor Defect 2, Overhaul $300 = q3
4, Working Properly 1, Do Nothing $1,000 = q4

The addition of a reward vector to the transition probability matrix associ-


ated with the original maintenance policy of Section 1.10.1.2.2.1 has trans-
formed the Markov chain model into an MCR. The unichain MCR model is
shown in below:

State, X n −1 = i Decision, k State 1 2 3 4


1, Not Working 2, Overhaul 1 0.2 0.8 0 0
2, Working, Major Defect 1, Do Nothing P= 2 0.6 0.4 0 0 ,
3, Working, Minor Defect 2, Overhaul 3 0 0 0.2 0.8
4, Working Properly 1, Do Nothing 4 0.3 0.2 0.1 0.4
Reward
−$300 = q1
q = $200 = q2
(4.116)
−$300 = q3
$1, 000 = q4

51113_C004.indd 205 9/23/2010 5:39:38 PM


206 Markov Chains and Decision Processes for Engineers and Managers

Under the modified maintenance policy of Section 1.10.1.2.2.1, if the machine


is in state 3, working, with a minor defect, the engineer will do nothing. In
that case the revenue in state 3 will be $500, the daily income from working
with a minor defect. The unichain MCR model under the modified mainte-
nance policy is shown in below:

State, X n −1 = i Decision, k State 1 2 3 4


1, Not Working 2, Overhaul 1 0.2 0.8 0 0
2, Working, Major Defect 1, Do Nothing P= 2 0.6 0.4 0 0
3, Working, Minor Defect 1, Do Nothing 3 0.2 0.3 0.5 0
4, Working Properly 1, Do Nothing 4 0.3 0.2 0.1 0.4

Reward
−$300 = q1
q = $200 = q2 . (4.117)
$500 = q3
$1, 000 = q4

4.2.4.1.4 Gain of a Unichain MCR Model of Machine Maintenance


Using the steady-state probability vector calculated in Equation (3.150), the
gain, under both maintenance policies, for the unichain MCR model of
machine maintenance constructed in Section 4.2.4.1.3 is computed below by
applying Equation (4.47) or Equation (4.110) for a four-state unichain MCR:

 q1  
q  
 1
q
g = π q = π 1 π 2 0 0    = [π 1 π 2 ]   = π 1 q1 + π 2 q2 
2

 q3   q2  
   (4.118)
 q4  
 3 4   −300  −100 
=   = = −14.29. 
 7 7   200  7 

Note that the gain is the same under both maintenance policies.

4.2.4.2 Value Determination Equations


A second procedure for calculating the gain of a unichain MCR over an infi-
nite planning horizon is to solve the VDEs. As the discussion for a recurrent
MCR in Section 4.2.3.2.2 indicates, if the objective is simply to find the gain
of the process, then it is not necessary to solve the VDEs. In that case, the
approach taken in Section 4.2.4.1 to calculate the gain is sufficient. Of course,

51113_C004.indd 206 9/23/2010 5:39:39 PM


A Markov Chain with Rewards (MCR) 207

a solution of the VDEs will also produce the relative values. Since all states in
a unichain MCR have the same gain, denoted by g, the VDEs for a unichain
MCR are identical to those for a recurrent MCR. However, it is not necessary
to simultaneously solve the full set of VDEs for a unichain MCR. It is suffi-
cient to first solve the subset of the VDEs associated with the recurrent states
to determine the gain and the relative values of the recurrent states. The rel-
ative values for the transient states may be obtained next by substituting the
relative values of the recurrent states into the VDEs for the transient states.
The relative values for all the states are used to execute policy improvement
for an MDP in Section 5.1.2.3.3 of Chapter 5.

4.2.4.2.1 Solving the VDEs for a Four-State Unichain MCR


To illustrate the procedure briefly outlined in Section 4.2.4.2 for solving the
VDEs for a unichain MCR, consider the generic four-state unichain MCR for
which the transition matrix and reward vector are given in Equation (4.107).
The gain was computed in Section 4.2.4.1.1. The system (4.62) of four VDEs is

4
g + vi = qi + ∑ pij v j , = qi + pi1v1 + pi 2 v2 + pi 3 v3 + pi 4 v4 , for i = 1, 2, 3, 4. (4.119)
j =1

The four individual value determination equation are

g + v1 = q1 + p11v1 + p12 v2 + p13 v3 + p14 v4 


g + v2 = q2 + p21v1 + p22 v2 + p23 v3 + p24 v4 
 (4.120)
g + v3 = q3 + p31v1 + p32 v2 + p33 v3 + p34 v4 
g + v4 = q4 + p41v1 + p42 v2 + p43 v3 + p44 v4.

Setting p13 = p14 = 0, and p23 =p 24 = 0, for the unichain MCR, the four VDEs
become

g + v1 = q1 + p11v1 + p12 v2 
+ v2 = q2 + p21v1 + p22 v2 
g  (4.121)

g + v3 = q3 + p31v1 + p32 v2 + p33 v3 + p34 v4 
g + v4 = q4 + p41v1 + p42 v2 + p43 v3 + p44 v4.

The gain can be calculated by solving the subset of two VDEs for the recur-
rent chain separately. Since states 1 and 2 are the two recurrent states, the
VDEs for the two-state recurrent chain within the four-state unichain are

g + v1 = q1 + p11v1 + p12 v2 
 (4.122)
g + v2 = q2 + p21v1 + p22 v2.

51113_C004.indd 207 9/23/2010 5:39:40 PM


208 Markov Chains and Decision Processes for Engineers and Managers

Setting v2 = 0 for the highest numbered state in the recurrent chain, the VDEs
for the recurrent chain become

g + v1 = q1 + p11v1  (4.123)

g = q2 + p21v1 .

The solutions for v1 and g are

q1 − q2 q − q2 
v1 = = 1 
1 − p11 + p21 p12 + p21 
p21 (q1 − q2 ) q2 ( p12 + p21 ) + p21 (q1 − q2 ) 
g = q2 + =  (4.124)
1 − p11 + p21 p12 + p21 
p q + p12 q2 
= 21 1 = π 1 q1 + π 2 q2 , 
p12 + p21 

which agree with the gain and the relative values obtained for a generic two-
state recurrent MCR in Equation (4.65). The solution for the gain also agrees
with the solution obtained for the generic four-state MCR in Section 4.2.4.1.1.
By substituting the gain and the relative values of the recurrent states into
the VDEs for the transient states, the latter two VDEs can be solved to find
the relative values of the transient states. The two VDEs for the transient
states are

g + v3 = q3 + p31v1 + p32 v2 + p33 v3 + p34 v4 


 (4.125)
g + v4 = q4 + p41v1 + p42 v2 + p43 v3 + p44 v4 .

Rewriting the two VDEs with the two unknowns, the relative values v3 and
v4 for the two transient states, on the left hand side,

(1 − p33 )v3 − p34 v4 = − g + q3 + p31v1 + p32 v2 


 (4.126)
− p43 v3 + (1 − p44 )v4 = − g + q4 + p41v1 + p42 v2 

The right-hand side constants are evaluated by setting v2 = 0, and substitut-


ing from Equation (4.124)

q1 − q2 
v1 = 
p12 + p21 
 (4.124)
p21 q1 + p12 q2 
g= .
p12 + p21 

51113_C004.indd 208 9/23/2010 5:39:42 PM


A Markov Chain with Rewards (MCR) 209

In matrix form, the two VDEs for the transient states become

 q1 ( p31 − p21 ) − q2 ( p12 + p31 ) + q3 ( p12 + p21 ) 


− −  p12 + p21 
(1 p ) p 34   3 
v
= 
33
 −p   
(1 − p44 ) v4   q1 ( p41 − p21 ) − q2 ( p12 + p41 ) + q4 ( p12 + p21 ) 
 43
.
 p12 + p21
 
(4.127)

The solution, expressed as ratios of determinants, is

q1 ( p31 − p21 ) − q2 ( p12 + p31 ) + q3 ( p12 + p21 ) 


− p34 
p12 + p21

q1 ( p41 − p21 ) − q2 ( p12 + p41 ) + q4 ( p12 + p21 ) 
(1 − p44 ) 
p12 + p21
v3 = 
(1 − p33 ) − p34 
− p43 (1 − p44 ) 
 (4.128)

q ( p − p21 ) − q2 ( p12 + p31 ) + q3 ( p12 + p21 ) 
(1 − p33 ) 1 31
p12 + p21 

q1 ( p41 − p21 ) − q2 ( p12 + p41 ) + q4 ( p12 + p21 ) 
− p43
p12 + p21 
v4 = .
(1 − p33 ) − p34 
− p43 (1 − p44 ) 

4.2.4.2.2 Procedure for Solving the VDEs for a Unichain MCR


In summary, to find the gain and relative values for an N-state unichain
MCR, it is not necessary to solve the N VDEs simultaneously. Instead, the
gain and the relative values can be determined by executing the following
three-step procedure:
Step 1. For the recurrent chain, solve the VDEs for the gain and the rela-
tive values of the recurrent states by setting equal to zero the relative
value for the highest numbered recurrent state.
Step 2. Substitute, into the VDEs for the transient states, the gain and the rel-
ative values of the recurrent states obtained in step 1.
Step 3. Solve the VDEs for the transient states to obtain the relative values of
the transient states.

51113_C004.indd 209 9/23/2010 5:39:43 PM


210 Markov Chains and Decision Processes for Engineers and Managers

4.2.4.2.3 Solving the VDEs for a Unichain MCR Model of Machine


Maintenance under Modified Policy of Doing Nothing in State 3
The VDEs for the unichain MCR model of machine maintenance under the
modified maintenance policy described in Section 4.2.4.1.3 will be solved
to demonstrate the three-step procedure outlined in Section 4.2.4.2.2. The
unichain MCR model under the modified maintenance policy is shown in
Equation (4.117).
Step 1. The VDEs for the two recurrent states are

g + v1 = −300 + 0.2v1 + 0.8v2


 (4.129)
g + v2 = 200 + 0.6v1 + 0.4v2 

Setting v2 = 0 for the highest numbered state in the recurrent chain, the VDEs
for the recurrent chain become

g + v1 = −300 + 0.2v1 
 (4.130)
g = 200 + 0.6v1. 

The solutions for v1 and g are

v1 = −357.14
 (4.131)
g = −14.29. 

The same value for the gain was obtained in Equation (4.118).
Step 2. The VDEs for the two transient states are

g + v3 = 500 + 0.2v1 + 0.3v2 + 0.5v3 + 0v4 


 (4.132)
g + v4 = 1, 000 + 0.3v1 + 0.2v2 + 0.1v3 + 0.4v4.

Substituting the gain, g =−14.29, and the relative values, v1 = −357.14, v2 = 0, for
the recurrent states obtained in step 1, the VDEs for the transient states are

−14.29 + v3 = 500 + 0.2( −357.14) + 0.3(0) + 0.5v3 + 0v4 


 (4.133)
−14.29 + v4 = 1, 000 + 0.3( −357.14) + 0.2(0) + 0.1v3 + 0.4v4 .

Step 3. The solutions for v3 and v4, the relative values of the transient states, are

v3 = 885.72 
 (4.134)
v4 = 1659.53.

51113_C004.indd 210 9/23/2010 5:39:44 PM


A Markov Chain with Rewards (MCR) 211

4.2.4.3 Solution by Value Iteration of Unichain MCR Model


of Machine Maintenance under Modified
Policy of Doing Nothing in State 3
An approximate solution for the gain and the relative values of a unichain
MCR can be obtained by executing value iteration over an infi nite planning
horizon. The value iteration algorithm for a unichain MCR is identical to the
one given in Section 4.2.3.3.4 for a recurrent MCR. Hence, when value iter-
ation is executed for a unichain MCR, bounds on the gain can be calculated
by using the formula (4.75). In this section, seven iterations of value iteration
will be executed to obtain an approximate solution for the gain and rela-
tive values of the unichain MCR model of machine maintenance under the
modified maintenance policy described in Section 4.2.4.1.3. An exact solu-
tion for the relative values and the gain was obtained in Section 4.2.4.2.3 by
solving the VDEs.
The four-state unichain MCR model under the modified maintenance
policy is shown in Equation (4.117).
The value iteration equation in matrix form is

v(n) = q + Pv(n + 1) (4.29)

 v1 (n)   −300   0.2 0.8 0 0   v1 (n + 1) 


 v (n)  200  0.6 0.4 0 0   v2 (n + 1)
 2 = +  . (4.135)
 v3 (n)  500   0.2 0.3 0.5 0  v3 (n + 1)
      
v4 (n)  1000  0.3 0.2 0.1 0.4  v4 (n + 1)

To begin value iteration, the terminal values at the end of the planning hori-
zon are set equal to zero for all states.

n=0
 v1 (0)  0 
 v (0) 0 
 2  =  .
 v3 (0) 0 
   
v4 (0) 0 

Value iteration is executed over the last three periods of an infinite planning
horizon.

51113_C004.indd 211 9/23/2010 5:39:46 PM


212 Markov Chains and Decision Processes for Engineers and Managers

n = −1
v( −1) = q + Pv(0) 
 v1 ( −1)   −300   0.2 0.8 0 0  0   −300  
 v ( −1)  200  0.6 
0.4 0 0  0   200   (4.136)
 2 = +   =  
v3 ( −1)  500   0.2 0.3 0.5 0  0   500  
        
v4 ( −1)  1000  0.3 0.2 0.1 0.4  0   1000  

n = −2
v( −2) = q + Pv( −1) 
 v1 ( −2)   −300   0.2 0.8 0 0   −300   −200  
 v ( −2)  200  0.6 
0.4 0 0   200   100   (4.137)
 2 = +  = 
 v3 ( −2)  500   0.2 0.3 0.5 0   500   750  
        
v4 ( −2)  1000  0.3 0.2 0.1 0.4   1000   1400  

n = −3
v( −3) = q + Pv( −2) 
 v1 ( −3)   −300   0.2 0.8 0 0   −200   −260  
 v ( −3)  200  0.6 
0.4 0 0   100   120   (4.138)
 2 = +  = .
 v3 ( −3)  500   0.2 0.3 0.5 0   750   865  
        
v4 ( −3)  1000   0.3 0.2 0.1 0.4   1400   1595  

After executing value iteration over four additional periods, the results are
summarized in Table 4.10.
Table 4.11 gives the differences between the expected total rewards earned
over planning horizons, which differ in length by one period.

TABLE 4.10
Expected Total Rewards for Machine Maintenance Under the Modified
Maintenance Policy Calculated by Value Iteration During the Last Seven Periods
of an Infi nite Planning Horizon
n
Epoch –7 –6 –5 –4 –3 –2 –1 0
v1(n) –304.416 –288.96 –277.6 –256 –260 –200 –300 0
v2(n) 53.312 66.72 83.2 92 120 100 200 0
v3(n) 930.6065 936.765 934.65 916.5 865 750 500 0
v4(n) 1703.2945 1707.405 1701.45 1670.5 1595 1400 1000 0

51113_C004.indd 212 9/23/2010 5:39:48 PM


A Markov Chain with Rewards (MCR) 213

TABLE 4.11
Differences between the Expected Total Rewards Earned Over Planning Horizons
Which Differ in Length by One Period

Epoch –7 –6 –5 –4 –3 –2 –1

i vi ( −7) vi ( −6) vi ( −5) vi ( −4) vi ( −3) vi ( −2) vi ( −1)


−vi ( −6) −vi ( −5) −vi ( −4) −vi ( −3) −vi ( −2) −vi ( −1) −vi (0)
1 −15.456L −11.36 −21.6 L 4 −60 L 100 −300L
2 −13.408 −16.48L −8.8 −28 L 20 −100 L 200
3 −6.1585 2.115 18.15 51.5 115 250 500
4 −4.1105U 5.955U 30.95U 85.5U 185U 400U 1000U

 vi (n) − 
Max 
 vi (n + 1)
= gU (T ) −4.1105 5.955 30.95 85.5 185 400 1000
 vi (n) − 
Min 
 vi (n + 1)
= g L (T ) −15.456 −16.48 −21.6 −28 −60 −100 −300

gU (T ) − g L (T ) 11.3455 22.435 52.55 113.5 245 500 1300

In Table 4.11, a suffix U identifies gU(T), the maximum difference for each
epoch. The suffix L identifies gL(T), the minimum difference for each epoch.
The differences, gU(T) − gL(T), obtained for all the epochs are listed in the
bottom row of Table 4.11. In Equations (4.1.18) and (4.131), the gain was found
to be –14.29 thousand dollars per period. When seven periods remain in the
planning horizon, Table 4.11 shows that the bounds on the gain are given by
−15.456 ≤ g ≤ −4.1105. The bounds are quite loose. After seven iterations, using
Equation (4.75), the gain is approximately equal to the arithmetic average of
its upper and lower bounds, so that

g ≈ (−4.1105 − 15.456)/2 = −9.7833. (4.139)

Many more iterations will be needed to see if value iteration will generate a
close approximation to the gain.
Subtracting v2(n), the expected total reward for the highest numbered state
in the recurrent chain, from all of the other expected total rewards in Table 4.10
produces Table 4.12, which shows the expected relative rewards, vi(n) − v2(n),
during the last seven epochs of the planning horizon.

51113_C004.indd 213 9/23/2010 5:39:49 PM


214 Markov Chains and Decision Processes for Engineers and Managers

TABLE 4.12
Expected Relative Rewards, vi(n) − v2(n), During the Last Seven Epochs of the
Planning Horizon
n
Epoch –7 –6 –5 –4 –3 –2 –1 0
v1(n) – v2(n) –357.728 –355.68 –360.8 –348 –380 –300 –500 0
v3(n) – v2(n) 877.2945 870.045 851.45 824.5 745 650 300 0
v4(n) – v2(n) 1649.9825 1640.685 1618.25 1578.5 1465 1300 800 0

In Equations (4.131) and (4.134), the relative values obtained by solving the
VDEs are, by comparison,

v1 = −357.14, v2 = 0, v3 = 885.72, and v4 = 1659.53. (4.140)

4.2.4.4 Expected Total Reward before Passage to a Closed Set


In this section, the expected total reward earned by a unichain MCR before
absorption or passage to a closed class of recurrent states, given that the
chain started in a transient state, will be calculated [5]. As Section 4.2.5.4 will
demonstrate, this calculation can also be made for a multichain MCR.

4.2.4.4.1 Four-State Unichain MCR


To see how to calculate the expected total reward received before passage to
a closed class of recurrent states, given that the chain started in a transient
state, consider the generic four-state unichain MCR, introduced in Section
4.2.4.1.1. The transition matrix and reward vector are given in Equation
(4.107). The VDEs were solved in Section 4.2.4.2.1. By setting v2 = 0 for the
highest numbered state in the recurrent chain, the solutions for the other
recurrent state v1 and the gain g were calculated in Equation (4.124). The
VDEs for the two transient states are shown in Equation (4.125). Referring to
Equation (4.107), and letting

 g  v3   q3   p33 p34 
g R =   , vT =   , qT =   , and Q =  ,
p44 
(4.141)
 g  v4   q4   p43

in Equation (4.125), the matrix equation for the vector vT of relative values for
the transient states is

g R + vT = qT + QvT 
IvT − QvT = qT − g R 
 (4.142)
(I − Q)vT = qT − g R 

vT = (I − Q)−1 (qT − g R ).

51113_C004.indd 214 9/23/2010 5:39:50 PM


A Markov Chain with Rewards (MCR) 215

Substituting U = (I − Q)−1 to represent the fundamental matrix, the vector of


relative values for the transient states is

vT = U (qT − g R ) = UqT − Ug R . (4.143)

The first term, UqT, represents the vector of expected total rewards received
before passage to the closed class of recurrent states, given that the chain
started in a transient state. The second term, UgR, is the vector of expected
total rewards received after passage to the recurrent closed class. Observe
that when the components of vector qR are set equal to zero, then the vector
gR = 0 because

 q1  0
g R = π qR = π 1 π 2    = π 1 π 2    = 0 (4.144)
 q2  0

and

vT = UqT − Ug R = UqT − 0 = UqT . (4.145)

Thus, if the rewards received in all the recurrent states are set equal to zero,
then vT, the vector of relative values for the transient states, is equal to UqT, the
vector of expected total rewards received before passage to the closed class
of recurrent states, given that the chain started in a transient state. In other
words, by setting qR = 0, vT = UqT can be found by solving the VDEs.
The following alternative approach to interpreting UqT as the vector of
expected total rewards received before passage to a closed class of recurrent
states, given that the chain started in a transient state, does not involve solv-
ing the VDEs for the relative values of the transient states. Recall from Section
(3.2.2) that uij, the (i, j)th entry of the fundamental matrix, U, specifies the
expected number of times that the chain is in transient state j before eventual
passage to a recurrent state, given that the chain started in transient state i. The
entry in row j of the vector qT, denoted by (qT)j, is the reward received each time
the chain is in transient state j. Therefore, the entry in row i of the vector UqT,
denoted by (UqT)i, is equal to the following sum of products:

(UqT )i = ∑ [(Expected number of times the chain is in transient state j 


jε T 
the chain started in transient state i)(Reward received each time 

the chain is in transient state j)] 

= ∑ uij (qT ) j . 
jε T 

(4.146)

51113_C004.indd 215 9/23/2010 5:39:51 PM


216 Markov Chains and Decision Processes for Engineers and Managers

Both of these approaches have demonstrated that the following result holds for
any unichain MCR. Suppose that P is the transition probability matrix, U is the
fundamental matrix, qT is the vector of rewards received in the transient states,
and T denotes the set of transient states. Then the ith component of the vector
UqT represents the expected total reward earned before eventual passage to
the recurrent closed class, given that the chain started in transient state i.

4.2.4.4.2 Unichain MCR Model of a Career Path with Lifetime Employment


Consider the company owned by its employees described in Section
1.10.1.2.2.2. The company offers both lifetime employment and career flex-
ibility. After many years, the company has collected sufficient data to con-
struct the transition probability matrix which is shown in canonical form in
Equation (1.57). Monthly salary data has also been collected, and is summa-
rized in the following reward vector:

1  $15, 000 
2  14, 000 
 
3  16, 000   qR 
q=   = , (4.147)
4  12, 000   qT 
5  13, 000 
 
6  11, 000 

where qR is the vector of monthly salaries earned in recurrent states, which


represent management positions, and qT is the vector of monthly salaries
earned in transient states, which represent engineering positions. By adding
the reward vector q to the transition probability matrix P, the career path of
an employee is modeled as a six-state MCR.
The fundamental matrix for this model is

−1 4 5 6
 0.70 −0.16 −0.24 
−1   4 1.9918 1.0215 0.9193
U = (I − Q) =  −0.26 0.60 −0.18  = . (4.148)
5 1.1300 2.5026 1.0023
 −0.14 −0.32 0.72 
6 0.8895 1.3109 2.0131

The vector, Uq is computed below.


T

4 1.9918 1.0215 0.9193  12, 000  4  47, 293.40 


−1     
UqT = (I − Q) qT = 5 1.1300 2.5026 1.0023  13, 000  = 5  57,119.10  .
6  0.8895 1.3109 2.0131  11, 000  6  49, 859.80 
(4.149)

51113_C004.indd 216 9/23/2010 5:39:52 PM


A Markov Chain with Rewards (MCR) 217

The entry in row i of the vector UqT represents the expected total salary
earned by an engineer before she is promoted to a management position,
given that she started in a transient state i. For example, an engineer who
started in state 6, systems testing, can expect to earn $49,859.80 prior to being
promoted to management.
Of course, if the engineer is interested solely in calculating the expected
total salary that she will earn before she is promoted to a management posi-
tion, she can merge the three recurrent states, that is, states 1, 2, and 3, into a
single absorbing state, denoted by 0, which will represent management. The
result will be the following four-state absorbing unichain:

0 Management  1 0 0 0 
 0.30 0.30 0.16 0.24 
 = 
4 Engineering Product Design 1 0
P=  .
5 Engineering Systems Integration  0.16 0.26 0.40 0.18   D Q 
 
6 Engineering Systems Testing  0.26 0.14 0.32 0.28 
(4.150)

Since the matrix Q and the vector qT are unchanged, the same entries will be
obtained for the vector UqT.
Suppose that the fundamental matrix has not been calculated. If the com-
ponents of qR, the vector of monthly salaries earned in the recurrent states,
are set equal to zero, then, as Section 4.2.4.4.1 indicates, solving the VDEs
for the vector vT of relative values for the transient states is an alternative
way of calculating the expected total reward received before passage to the
recurrent closed class, given that the chain started in a transient state. For
example, suppose that the modified MCR is

1 0.30 0.20 0.50 0 0 0 


2 0.40 0.25 0.35 0 0 0 
 
3 0.50 0.10 0.40 0 0 0  S 0
P=  = ,
4  0.05 0.15 0.10 0.30 0.16 0.24  D Q 
5 0.04 0.07 0.05 0.26 0.40 0.18 
 
6 0.08 0.06 0.12 0.14 0.32 0.28 
(4.151)
1 0 
2 0 
 
 qR  3  0 
q= =  .
 qT  4 12, 000 
5 13, 000 
 
6 11, 000 

51113_C004.indd 217 9/23/2010 5:39:54 PM


218 Markov Chains and Decision Processes for Engineers and Managers

In vector form, the VDEs for the relative values of the transient states are

g R + vT = qT + QvT 

gR = 0  (4.152)
vT = qT + QvT . 

In algebraic form, the VDEs are

v4 = 12, 000 + 0.30v4 + 0.16v5 + 0.24v6 



v5 = 13, 000 + 0.26v4 + 0.4v5 + 0.18v6  (4.153)
v6 = 11, 000 + 0.14v4 + 0.32v5 + 0.28v6 .

The solution is

4  v4  4  47, 293.40 
   
vT = 5  v5  = 5  57,119.10  .
6  v6  6  49, 859.80 

These are the same values calculated in Equation (4.149) for UqT, the vector of
expected total salaries received before passage to the closed class of recur-
rent states, given that the employee started as an engineer.

4.2.4.5 Value Iteration over a Finite Planning Horizon


Value iteration can be executed to calculate the expected total reward
received over a finite planning horizon by a reducible MCR, either unichain
or multichain. In this section, value iteration will be executed over three
periods for the following modified unichain MCR model of a career path
treated in Section 4.2.4.4.2. Suppose that a significant decline in sales has
induced the company to offer an engineer a buyout of $50,000 if she volun-
tarily leaves the company at the end of any month. The probabilities that
an engineer will accept the buyout and leave are 0.30 if she is engaged in
product design, 0.16 if she is engaged in systems integration, and 0.26 if she
is engaged in systems testing. The probabilities that she will make monthly
transitions among the three engineering positions remain unchanged. The
three recurrent states that formerly represented management positions are
merged to form a single absorbing state 0, which now represents an engi-
neer’s acceptance of the buyout. The resulting absorbing unichain MCR

51113_C004.indd 218 9/23/2010 5:39:54 PM


A Markov Chain with Rewards (MCR) 219

model is shown below:

0 Accept Buyout  1 0 0 0  
  
=
4 Engineering Product Design 0.30 0.30 0.16 0.24 1 0 
P=  ,
 
5 Engineering Systems Integration  0.16 0.26 0.40 0.18  D Q  
  
6 Engineering Systems Testing 0.26 0.14 0.32 0.28  

.

0 Accept Buyout $50,000  
 12, 000  
4 Engineering Product Design
 =  qA 
q= 
5 Engineering Systems Integration  13, 000   qT  
  
6 Engineering Systems Testing  11, 000  
(4.154)
Value iteration using Equation (4.29) will be executed over a 3-month plan-
ning horizon to calculate the expected total cost to the company of offer-
ing a buyout to an engineer. To begin the backward recursion, the vector of
expected terminal total costs received at the end of the 3-month planning
horizon is set equal to zero for all states.
n=T=3
v(3) = v(T ) = 0

 v0 (3)   0 
 v (3)   0 
 4  =  
 v5 (3)   0 
   
 v6 (3)   0 
n= 2
v(2) = q + Pv(3) 

 v0 (2)   50, 000   1 0 0 0   0   50, 000  
 v (2)   12, 000   0.30 0.30 0.16 0.24   0   12, 000  (4.155)
 4  =  +    =   
 v5 (2)   13, 000   0.16 0.26 0.40 0.18   0   13, 000  
          
 v6 (2)   11, 000   0.26 0.14 0.32 0.28   0   11, 000  
n= 1
v(1) = q + Pv(2) 

 v0 (1)   50, 000   1 0 0 0   50, 000   100, 000  
 v (1)   12, 000   0.30 0.30 0.16 0.24   12, 000   35, 320 
 4  =  +    =   
 v5 (1)   13, 000   0.16 0.26 0.40 0.18   13, 000   31, 300  
          
 v6 (1)   11, 000   0.26 0.14 0.32 0.28   11, 000   32, 920  
(4.156)

51113_C004.indd 219 9/23/2010 5:39:55 PM


220 Markov Chains and Decision Processes for Engineers and Managers

n=0
v(0) = q + Pv(1) 
 v1 (0)   50, 000   1 0 0 0  100, 000   150, 000  
 v (0) 12, 000  0.30 0.30 0.16 0.24   35, 320   65, 504.80  
 2 = +  =  .
 v3 (0) 13, 000  0.16 0.26 0.40 0.18   31, 300   56, 628.80  
        
v4 (0)  11, 000  0.26 0.14 0.32 0.28   32, 920   61,178.40  
(4.157)

Equation (4.157) indicates that after 3 months, the expected total cost to the
company of offering a buyout to an engineer, given that she is currently
engaged in product design, systems integration, or systems testing, will be
$65,504.80, $56,628.80, or $61,178.40, respectively.
As an alternative to executing value iteration equation, Equation (4.17) with
T = 3 can also be used to calculate v(0) over a 3-month planning horizon.

v(0) = q + Pq + P 2 q + P 3 v(3) (4.158)

 v0 (0)  50, 000   1 0 0 0   50, 000  


v (0) 12, 000  0.30 
0.30 0.16 0.24  12, 000  
v(0) =  = +  
4

 v5 (0) 13, 000  0.16 0.26 0.40 0.18  13, 000  


      
 v6 (0)  11, 000  0.26 0.14 0.32 0.28   11, 000  

 1 0 0 0   50, 000  0 

0.4780 0.1652 0.1888 0.1680  12, 000  0  
+  +   (4.159)
0.3488 0.2072 0.2592 0.1848  13, 000  0  
     
0.4260 0.1644 0.2400 0.1696   11, 000  0 

 50, 000   50, 000   50, 000   150, 000  
12, 000   23, 320   30,184.8   65, 504.80  
+ + =  
= 
13, 000  18, 300   25, 328.8   56, 628.80 
        
 11, 000   21, 920   28, 258.4   61,178.40  

equal to the expected total cost vector calculated by value iteration in


Equation (4.157).

4.2.5 A Multichain MCR


In this section, the expected average rewards, or gains, earned by the states
in a multichain MCR over an infinite planning horizon are calculated. Value
iteration will not be applied to a multichain MCR over an infinite horizon

51113_C004.indd 220 9/23/2010 5:39:57 PM


A Markov Chain with Rewards (MCR) 221

because no suitable stopping condition is available. The gain vector can be


calculated either by multiplying the limiting transition probability matrix by
the reward vector using Equation (4.43), or by solving two sets of equations,
which together are called the reward evaluation equations (REEs) [4, 7].
Recall from Section 4.2.1 that a reducible multichain MCR, which is simply
called a multichain MCR, consists of two or more closed classes of recurrent
states plus one or more transient states. By enlarging the representation of
the transition matrix for a multichain in Equation (3.4), the canonical form of
the transition matrix for a multichain MCR with M recurrent chains is given
below:

 P1 0 " 0 0  q1 
 0 P2 " 0 0  q 
   2
P=  # # % # # , q =  # . (4.160)
   
 0 0 " PM 0  qM 
 D1 D2 " DM Q   qT 

The components of vectors q1, … , qM are the rewards received by the recurrent
states in the recurrent chains governed by the transition matrices P1 ,…,PM,
respectively. The components of vector qT are the rewards received in the
transient states.

4.2.5.1 An Eight-State Multichain MCR Model of a Production Process


The eight-state multichain model of a three-stage production process intro-
duced in Section 1.10.2.2 will be enlarged by introducing data for operation
times and costs. (This model is adapted from one in Shamblin and Stevens
[8].) The Markov chain model is revisited in Sections 3.1.3, 3.3.2.2.2, 3.3.3.2.4,
3.4.1.2, and 3.5.5.2. The objective of this section is to construct a cost vector
(the negative of a reward vector) for a multichain MCR model, and calculate
the expected total operation cost of an item sold.
Recall that the output of each manufacturing stage in the sequential pro-
duction process is inspected. Output from stage 1 or 2 that is not defective
is passed on to the next stage. Output with a minor defect is reworked at
the current stage. Output with a major defect is scrapped. Nondefective but
blemished output from stage 3 is sent to a training center where it is used to
train engineers, technicians, and technical writers. Output from stage 3 that
is neither defective nor blemished is sold. The transition probability matrix
in canonical form is shown in Equation (3.7). Table 4.13 provides data for
operation times and costs.
Each element qi of the cost vector is simply the cost of an operation,
which is equal to the operation time multiplied by the cost per hour. Thus,
component qi of the cost vector, which appears in the right hand column of

51113_C004.indd 221 9/23/2010 5:39:58 PM


222 Markov Chains and Decision Processes for Engineers and Managers

TABLE 4.13
Operation Times and Costs
Operation Cost Cost Per
State Operation Time (h) Per Hour Operation
1 Scrap the output 2.6 $60 (disposal) $156 = (2.6 h)($60)=q1
2 Sell the output 1.4 $40 $56 = (1.4 h)($40)=q2
3 Train engineers 3.2 $50 $160 = (3.2 h)($50)=q3
4 Train technicians 4.1 $30 $123 = (4.1 h)($30)=q4
5 Train tech. writers 5.3 $55 $291.50 = (5.3 h)($55)=q5
6 Stage 3 10 $45 $450 = (10 h)($45)=q6
7 Stage 2 16 $25 $400 = (16 h)($25)=q7
8 Stage 1 12 $35 $420 = (12 h)($35)=q8

Table 4.13, is equal to the product of the entries in columns three and four
of row i of Table 4.13. The complete eight-state multichain MCR model of
production is shown below:

1 1 0 0 0 0 0 0 0 1  156 
2 0 1 0 0 0 0 0 0 2  56 
   
3 0 0 0.50 0.30 0.20 0 0 0 3  160 
   
4 0 0 0.30 0.45 0.25 0 0 0  4  123 
P=  , q= (4.161)
5 0 0 0.10 0.35 0.55 0 0 0  5  291.5 
   
6  0.20 0.16 0.04 0.03 0.02 0.55 0 0  6  450 
7  0.15 0 0 0 0 0.20 0.65 0  7  400 
   
8  0.10 0 0 0 0 0 0.15 0.75  8  420 

To calculate the expected operation cost of an item sold, it is first necessary to


compute the expected cost that will be incurred by an entering item in each
operation. These calculations are shown in Table 4.14. Recall that the three
manufacturing stages correspond to transient states. Production stage i is rep-
resented by transient state (9 − i). The fundamental matrix, U, is computed
in Equation (3.102). The entries in the bottom row of U specify the expected
number of visits that an entering item, starting at stage 1 in transient state 8,
will make to each of the three manufacturing stages prior to being scrapped,
sold, or sent to the training center. On average, an entering item will make
u88 = 4 visits to stage 1, u87 = 1.7143 visits to stage 2, and u86 = 0.7619 visits
to stage 3. The expected cost that will be incurred by an entering item in a
manufacturing stage is equal to the operation cost for the stage multiplied
by the expected number of visits to the stage. The expected cost that will be
incurred by an entering item in each of the three manufacturing stages is cal-
culated in rows 6, 7, and 8 in the right-hand column of Table 4.14.

51113_C004.indd 222 9/23/2010 5:39:58 PM


A Markov Chain with Rewards (MCR) 223

TABLE 4.14
Expected Operation Costs Per Entering Item
Operation Expected Operation Cost Per
State Operation Cost Entering Item
1 Scrap the Output $156 q1 f 81 =($156)(0.8095) = $126.28
2 Sell the Output $56 q2 f 82 = ($56)(0.1219) = $6.83
3 Train Engineers $160 (n)
q3 lim p83 = ($160)(0.0199) = $3.18
n→∞

4 Train Technicians $123 (n)


q4 lim p84 = ($123)(0.0256) = $3.15
n→∞

5 Train Technical $291.5 (n)


q5 lim p85 = ($291.50)(0.0231) = $6.73
n→∞
Writers
6 Stage 3 $450 q6 u86 = ($450)(0.7619) = $342.86
7 Stage 2 $400 q7 u87 = ($400)(1.7143) = $685.72
8 Stage 1 $420 q8 u88 = ($420)(4) = $1680

As Equation (3.103) indicates, if the chain occupies a transient state, the matrix
of absorption probabilities is

1 2
6 0.4444 0.3556
F = U[D1 D2 ] = = [F1 F2 ]. (3.103)
7 0.6825 0.2032
8 0.8095 0.1219

The operations “scrap the output,” state 1, and “sell the output,” state 2, are
both represented by absorbing states. The expected cost that will be incurred
by an entering item in an absorbing state j is equal to the operation cost for
the absorbing state multiplied by the probability f8j that an entering item will
be absorbed in absorbing state j. The expected cost that will be incurred by
an entering item in each of the two absorbing states is calculated in rows 1
and 2 in the right-hand column of Table 4.14.
As Section 3.5.5.2 indicates, if the chain occupies a transient state, the
matrix of limiting probabilities of transitions from the set of transient states,
T = {6, 7, 8}, to the three recurrent states, which belong to the recurrent chain,
R = {3, 4, 5}, is given by Equation (3.163). As Equation (3.164) demonstrates, the
expected cost that will be incurred by an entering item in a recurrent state j
is equal to the operation cost for the recurrent state multiplied by

lim p8( nj ) = ( f 83 + f 84 + f 85 )π j = f 8 Rπ j , (4.162)


n→∞

the limiting probability for recurrent state j. The expected cost that will be
incurred by an entering item in each of the three recurrent states associated

51113_C004.indd 223 9/23/2010 5:39:59 PM


224 Markov Chains and Decision Processes for Engineers and Managers

TABLE 4.15
Expected Operation Costs Per Item Sold
Expected Operation Cost per Item Sold
State Operation = Expected Operation Cost per Entering Item)/f82
1 Scrap the Output ($126.28)/f82 = ($126.28)/(0.1219) = $1.035.95
2 Sell the Output ($6.83)/f82 = ($6.83)/(0. 1219) = $56
3 Train Engineers ($3.18)/f82 = ($3.18)/(0. 1219) = $26.09
4 Train Technicians ($3.15)/f82 = ($3.15)/(0. 1219) = $25.84
5 Train Technical Writers ($6.73)/f82 = ($6.73)/( 0. 1219) = $55.21
6 Stage 3 ($342.86)/f82 = ($342.86)/( 0. 1219) = $2,812.63
7 Stage 2 ($685.72)/f82 = ($685.72)/( 0. 1219) = 5,625.27
8 Stage 1 ($1,680)/f82 = ($1,680)/(0. 1219) = $13,781.79
Sum = $23,418.78

with training is calculated in rows 3, 4, and 5 in the right-hand column of


Table 4.14.
Recall from Equation (3.106) that the expected number of entering items
required to enable 100 items to be sold is equal to 100/f82, where f82=0.1219
is the probability that an entering item will be sold. Therefore, the expected
operation cost per item sold is equal to the expected operation cost for an
entering item, calculated in the right hand column of Table 4.14, divided by
the probability that an entering item will be sold. The expected operation
costs per item sold are calculated in the right-hand column of Table 4.15.
The expected total operation cost per item sold is $23,418.78, equal to the
sum of the entries in the right-hand column in the bottom row of Table 4.15.

4.2.5.2 Expected Average Reward or Gain


As Section 4.2.3.1 indicates, every state i in a multichain MCR has its own
gain gi, which is the ith component of a gain vector g. Hence, the gain of a
multichain MCR depends on the state in which it starts. The gain vector is
calculated by Equation (4.43).

4.2.5.2.1 Gain of a Four-State Absorbing Multichain MCR


Consider a generic four-state absorbing multichain MCR with the following
transition probability matrix P, which appears in Equation (3.85), and reward
vector q. The Markov chain has two absorbing states and two transient states.

1 1 0 0 0  1  q1 
2 0  2  q2   qA 
=
1 0 0 I 0
P=    , q =   =  . (4.163)
3  p31 p32 p33 p34   D Q  3  q3   qT 
   
4  p41 p42 p43 p44  4  q4 

51113_C004.indd 224 9/23/2010 5:40:02 PM


A Markov Chain with Rewards (MCR) 225

Absorbing states 1 and 2, which belong to the set A, have gains denoted by g1
and g2, respectively. Transient states 3 and 4, which belong to the set T, have
gains denoted by g3 and g4, respectively. The limiting transition probability
matrix for the absorbing multichain was calculated in Equation (3.135). The
gain vector for the four-state absorbing multichain MCR is computed using
Equation (4.43).

1 1 0 0 0   q1   q1   g1 
2 0      g 
 =  2  =  A .
1 0 0 q2 q2 g
g = limP(n)q =    = 
n →∞ 3  f 31 f 32 0 0   q3   f 31 q1 + f 32 q2   g 3   gT 

      
4  f 41 f 42 0 0   q4   f 41 q1 + f 42 q2   g 4 
(4.164)

The gain of the absorbing multichain MCR depends on the state in which it
starts. If the system starts in an absorbing state i, for i = 1 or 2, the gain will
be gi = qi. Thus, in a multichain MCR, the gain of an absorbing state is equal
to the reward received in that state. If the system is initially in transient state
i, for i = 3 or 4, the gain will be

g i = f i1 q1 + f i 2 q2 (4.165)

because the chain will eventually be absorbed in state 1 with probability fi1,
or in state 2 with probability fi2.

4.2.5.2.2 Gain of a Five-State Multichain MCR


Consider a generic five-state multichain MCR with the transition probability
matrix P, which appears in Equation (3.5), and reward vector q. The Markov
chain, treated in Sections 1.9.2 and 3.1.2, has two recurrent closed sets, one of
which is an absorbing state, plus two transient states.

1 1 0 0 0 0  1  q1 
2 0 p22 p23 0 0  2  q2 
   
P = 3 0 p32 p33 0 0 , q = 3  q3  . (4.166)
   
4  p41 p42 p43 p44 p45  4  q4 
5  p51 p52 p53 p54 p55  5  q5 

State 1, with a gain denoted by g1, is an absorbing state, and constitutes the first
recurrent closed set. States 2 and 3, with gains denoted by g2 and g3, respec-
tively, are members of the second recurrent chain, denoted by R2 = {2, 3}. Since
both recurrent states belong to the same recurrent chain, they have the same
gain, denoted by gR2, so that
g 2 = g 3 = g R2 , (4.167)

51113_C004.indd 225 9/23/2010 5:40:03 PM


226 Markov Chains and Decision Processes for Engineers and Managers

where gR2 denotes the gain of all states which belong to the recurrent chain
R 2. States 4 and 5 are transient, with gains denoted by g4 and g5, respectively.
The limiting transition probability matrix for the multichain was calculated
in Equation (3.158). In the limiting transition probability matrix, π2 and π3 are
the steady-state probabilities for states 2 and 3, respectively, within the recur-
rent chain R 2. The probabilities of absorption from transient states 4 and 5
are denoted by f41 and f51, respectively. The probabilities of eventual passage
from transient states 4 and 5 to the recurrent chain R 2 are denoted by f4R2
and f5R2, respectively. All of these quantities are computed for the generic
five-state multichain in Section 3.5.5.1. The gain vector for the five-state mul-
tichain MCR is computed using Equation (4.43).

1 1 0 0 0 0   q1   g1   g1 
2 0 π2 π3 0 0   q2   g 2   g R 2 
      
g = limP(n)q = 3  0 π2 π3 0 0   q3  =  g 3  =  g R 2 
n →∞
      
4  f 41 f 4 Rπ 2 f 4 Rπ 3 0 0   q4   g 4   g 4 
5  f 51 f 5 Rπ 2 f 5 Rπ 3 0 0   q5   g 5   g 5 

 q1 
 π 2 q2 + π 3 q3 
 
= π 2 q2 + π 3 q3 . (4.168)
 
 f 41 q1 + f 4 R (π 2 q2 + π 3 q3 )
 f 51 q1 + f 5 R (π 2 q2 + π 3 q3 )

The gain of the multichain MCR depends on the state in which it starts. If
the system starts in state 1, an absorbing state, the gain will be g1 = q1. Thus,
in a multichain MCR, the gain of an absorbing state is equal to the reward
received in that state, confirming the conclusion of Equation (4.114) in Section
4.2.4.1.2. If the system is initially in either of the two communicating recur-
rent states, states 2 or 3, the gain will be

g R2 = π 2 q2 + π 3 q3 . (4.169)

If the chain starts in transient state 4, either it will eventually be absorbed with
probability f41, or it will eventually enter the recurrent chain R2 with probabil-
ity f4R2. Hence, if the chain starts in transient state 4, the gain will be

g 4 = f 41 q1 + f 4 R2 π 2 q2 + f 4 R2 π 3 q3 = f 41 q1 + f 4 R2 (π 2 q2 + π 3 q3 ) = f 41 g1 + f 4 R2 g R2.
(4.170)

51113_C004.indd 226 9/23/2010 5:40:04 PM


A Markov Chain with Rewards (MCR) 227

Observe that the gain of transient state 4 has been expressed as the weighted
average of the independent gains, g1 and gR2, of the two recurrent chains. The
respective weights are f41 and f4R2. Similarly, if the chain starts in transient
state 5, the gain will be

g 5 = f 51 q1 + f 5 R2 (π 2 q2 + π 3 q3 ) = f 51 g1 + f 5 R2 g R2 . (4.171)

The gain of transient state 5 has also been expressed as the weighted average
of the independent gains, g1 and gR2, of the two recurrent chains. In this case
the respective weights are f51 and f5R2.
Since each closed class of recurrent states in a multichain MCR can be
treated as a separate recurrent chain, all states that belong to the same recur-
rent chain have the same gain. Hence, every recurrent chain has an inde-
pendent gain. One may conclude that the gains of the recurrent chains can
be found separately by finding the steady-state probability vector for each
recurrent chain and multiplying it by the associated reward vector. The gain
of every transient state can be expressed as a weighted average of the inde-
pendent gains of the recurrent chains. The weights are the probabilities of
eventual passage from the transient state to the recurrent chains.

4.2.5.2.3 Gain of an Eight-State Multichain MCR Model of a Production Process


Consider the eight-state multichain MCR model of production for which the
transition matrix and reward vector are shown in Equation (4.161). The limit-
ing transition probability matrix was calculated in Equation (3.165). Using
Equation (4.43), the negative gain vector is

g = limP(n)q
n →∞

1 1 0 0 0 0 0 0 0   156   156 
2 0 1 0 0 0 0 0 0   56   56 
    
3 0 0 0.2909 0.3727 0.3364 0 0 0   160  190.45 
    
4 0 0 0.2909 0.3727 0.3364 0 0 0   123  190.45 
= = .
5 0 0 0.2909 0.3727 0.3364 0 0 0   291.5 190.45 
    
6 0.4444 0.3556 0.0582 0.0745 0.0673 0 0 0   450  127.33 
7  0.6825 0.2032 0.0332 0.0426 0.0385 0 0 0   400  139.62 
    
8  0.8095 0.1219 0.0199 0.0256 0.0231 0 0 0   420  146.17 

(4.172)

4.2.5.3 Reward Evaluation Equations


As Sections 4.2.3.2.2 and 4.2.4.2 indicate, if the objective is simply to find
the gain attained by starting in every state of a multichain MCR, then it is

51113_C004.indd 227 9/23/2010 5:40:06 PM


228 Markov Chains and Decision Processes for Engineers and Managers

sufficient to find the gain vector, g, by using Equation (4.43). Alternatively, by


solving two sets of equations, which together are called the (REEs) [4, 7], one
can determine not only the gain in every state but also the relative value in
every state. The advantage of obtaining the relative values is that they can be
used to find an improved policy when decisions are introduced in Chapter
5 to transform an MCR into an MDP. The first set of REEs is called the gain
state equations (GSEs), and the second set is called the VDEs. Thus, the REEs
consist of the GSEs plus the VDEs.

4.2.5.3.1 Informal Derivation of the REEs


When an N-state multichain MCR starts in state i, and the length of the plan-
ning horizon, T, grows very large, the system enters the steady state. By mod-
ifying the argument given in Section 4.2.3.2.2 for a recurrent MCR to make it
apply to a multichain MCR, Equation (4.59) for vi(0) becomes

vi (0) ≈ Tg i + vi for i = 1, 2,..., N , (4.173)

where gi is the gain in state i.


Similarly, Equation (4.60) for vj(1) becomes

v j (1) ≈ (T − 1) g j + v j . (4.174)

The recursive equation relating vi(0) to vj(1) is

N
vi (0) = qi + ∑ pij v j (1), for i = 1, 2, ... , N
j =1

Substituting the linear approximations in Equations (4.173) and (4.174),


assumed to be valid for large T, into Equation (4.61) gives the result

N
Tg i + vi = qi + ∑ pij [(T − 1) g j + v j ], for i = 1, 2, ... , N ,
j =1
N N N
(4.175)
= qi + T ∑ pij g j − ∑ pij g j + ∑ pij v j .
j =1 j =1 j =1

In order for Equation (4.175) to be satisfied for any large T, let

N
g i = ∑ pij g j , for i = 1, 2, ... , N (4.176)
j =1

51113_C004.indd 228 9/23/2010 5:40:06 PM


A Markov Chain with Rewards (MCR) 229

Then
N
Tg i + vi = qi + Tg i − g i + ∑ pij v j
j =1

Since
N

∑p
j =1
ij = 1,
(4.177)
N
g i + vi = qi + ∑ pij v j , i = 1, 2, ... , N .
j =1

Thus, the following two systems of RN linear equations each,


N

g i = ∑ pij g j , for i = 1, 2, ... , N , 
j =1 
 (4.178)
and 
N 
g i + vi = qi + ∑ pij v j , i = 1, 2, ..., N ,
j =1 

have been obtained. These two systems may be solved for the N variables, gi,
and the N variables, vi. The first system of N linear equations (4.176) is called
the GSEs. The second system is the set of VDEs (4.177). These two systems of
equations are together called the set of REEs (4.178).
4.2.5.3.2 The REEs for a Five-State Multichain MCR
Consider once again the generic five-state multichain MCR for which the
transition matrix and reward vector are shown in Equation (4.166). The sys-
tem of ten REEs (4.178) consists of a system of five VDEs (4.177) plus a system
of five GSEs (4.176). The system of five VDEs is
5
g i + vi = qi + ∑ pij v j = qi + pi1v1 + pi 2 v2 + pi 3 v3 + pi 4 v4 + pi 5 v5 , for i = 1, 2, ... , 5.
j =1

(4.179)
The individual value determination equations are

g1 + v1 = q1 + p11v1 + p12 v2 + p13 v3 + p14 v4 + p15 v5 


g 2 + v2 = q2 + p21v1 + p22 v2 + p23 v3 + p24 v4 + p25 v5 

g 3 + v3 = q3 + p31v1 + p32 v2 + p33 v3 + p34 v4 + p35 v5  (4.180)
g 4 + v4 = q4 + p41v1 + p42 v2 + p43 v3 + p44 v4 + p45 v5 

g 5 + v5 = q5 + p51v1 + p52 v2 + p53 v3 + p54 v4 + p55 v5 .

51113_C004.indd 229 9/23/2010 5:40:08 PM


230 Markov Chains and Decision Processes for Engineers and Managers

Setting p12 = p13 = p14 = p15 = 0, p21 = p24 = p25 = 0, and p31 = p34 = p35 = 0, and setting
g2 = g3 = gR2, the five VDEs become

g1 + v1 = q1 + p11v1 
g R2 + v2 = q2 + p22 v2 + p23 v3 


g R2 + v3 = q3 + p32 v2 + p33 v3  (4.181)
g 4 + v4 = q4 + p41v1 + p42 v2 + p43 v3 + p44 v4 + p45 v5 
g 5 + v5 = q5 + p51v1 + p52 v2 + p53 v3 + p54 v4 + p55 v5 .

The system of five GSEs is given below:

5
g i = ∑ pij g j = pi1 g1 + pi 2 g 2 + pi 3 g 3 + pi 4 g 4 + pi 5 g 5 , for i = 1, 2 ,..., 5. (4.182)
j =1

Setting p12 = p13 = p14 = p15 = 0, p21 = p24 = p25 = 0, p31 = p34 = p35 = 0, and setting
g 2 = g3 = gR2, the five GSEs become

g1 = g1 

g 2 = p22 g 2 + p23 g 3 = p22 g R2 + p23 g R2 = ( p22 + p23 ) g R2 = g R2 

g 3 = p32 g 2 + p33 g 3 = p32 g R2 + p33 g R2 = ( p32 + p33 ) g R2 = g R2 
g 4 = p41 g1 + p42 g 2 + p43 g 3 + p44 g 4 + p45 g 5 = p41 g1 + ( p42 + p43 ) g R2 + p44 g 4 + p45 g 5 
g 5 = p51 g1 + p52 g 2 + p53 g 3 + p54 g 4 + p55 g 5 = p51 g1 + ( p52 + p53 ) g R2 + p54 g 4 + p55 g 5 .
(4.183)

4.2.5.3.3 Procedure for Solving the REEs for a Multichain MCR


As this section will indicate, both the independent gains and the relative val-
ues for the recurrent chains can be obtained by solving the VDEs for each
recurrent chain separately. Next, the GSEs for the transient states can be solved
to obtain the gains of the transient states. Finally, the VDEs for the transient
states can be solved to obtain the relative values for the transient states [4].
Suppose that an N-state multichain MCR has L recurrent chains, where
L < N. The MCR has L independent gains, each associated with a different
recurrent chain. The unknown quantities are the L independent gains plus
the N relative values, vi. The number of unknowns can be reduced to N by
equating to zero the relative value, vi for the highest numbered state in each
of the L recurrent chains. The remaining (N − L) relative values for the MCR
are not equated to zero. Hence, the total number of unknown quantities is

51113_C004.indd 230 9/23/2010 5:40:10 PM


A Markov Chain with Rewards (MCR) 231

N, equal to L independent gains plus (N − L) relative values. The L indepen-


dent gains and the remaining (N − L) relative values, vi, can be determined
by following the four-step procedure given below for solving the REEs for a
multichain MCR:
Step 1. Assign an independent gain to each recurrent chain. For each recur-
rent chain, solve the VDEs for the independent gain and the rela-
tive values of the recurrent states by setting equal to zero the relative
value for the highest numbered recurrent state.
Step 2. Solve the GSEs for the transient states to compute the gains of the
transient states as weighted averages of the independent gains of the
recurrent chains obtained in step 1.
Step 3. Substitute, into the VDEs for the transient states, the gains of the tran-
sient states obtained in step 2, and the relative values of the recurrent
states obtained in step 1.
Step 4. Solve the VDEs for the transient states to obtain the relative values of
the transient states.

4.2.5.3.4 Solving the REEs for a Five-State Multichain MCR


The four-step procedure outlined in Section 4.2.5.3.3 for solving the REEs
will be demonstrated by applying it to solve the REEs obtained in Section
4.2.5.3.2 for the generic five-state multichain MCR. The MCR has N = 5 states,
and L = 2 recurrent chains, with the relative value v1 equated to zero in the
first recurrent chain, and v3 equated to zero in the second. The two indepen-
dent gains, g1 and gR2, and the remaining (5 − 2) relative values, v2, v4, and v5,
are the five unknowns.
Step 1. Each recurrent chain in the five-state multichain MCR will be consid-
ered separately. The VDE for the first recurrent chain, consisting of
the absorbing state 1, is

v1 + g1 = q1 + p11v1 (4.184)

Setting the relative value v1 = 0 for the highest numbered state in the first
recurrent chain, and setting p11 = 1 for the absorbing state 1, gives the solu-
tion g1 = q1, confirming the conclusion of Equation (4.114) in Section 4.2.4.1.2
that the gain of an absorbing state is equal to the reward received in that
state. The VDEs for the second recurrent chain, consisting of states 2 and 3,
and denoted by R 2={2, 3}, are

v2 + g 2 = q2 + p22 v2 + p23 v3 
 (4.185)
v3 + g 3 = q3 + p32 v2 + p33 v3 .

51113_C004.indd 231 9/23/2010 5:40:12 PM


232 Markov Chains and Decision Processes for Engineers and Managers

Setting v3 = 0 for the highest numbered state in the second recurrent chain,
and setting g2 = g3=gR2, the VDEs become

v2 + g R2 = q2 + p22 v2 
. (4.186)
g R2 = q3 + p32 v2 

The solution is

q2 − q3 
v2 = 
p23 + p32 
. (4.187)
p32 q2 + p23 q3 
g R2 =
p23 + p32 

confirming the result obtained for the gain in Equation (4.169).


Step 2. The set of GSEs for the transient states that will be solved to express
the gains of the two transient states as weighted averages of the independent
gains of the recurrent chains is given below.
5
g i = ∑ pij g j = pi1 g1 + pi 2 g 2 + pi 3 g 3 + pi 4 g 4 + pi 5 g 5 , (4.188)
j =1

for transient states i = 4, 5.


For the transient states 4 and 5, Equation (4.185) becomes

g 4 = p41 g1 + p42 g 2 + p43 g 3 + p44 g 4 + p45 g 5 = p41 g1 + ( p42 + p43 ) g R2 + p44 g 4 + p45 g 5 
.
g 5 = p51 g1 + p52 g 2 + p53 g 3 + p54 g 4 + p55 g 5 = p51 g1 + ( p52 + p53 ) g R2 + p54 g 4 + p55 g 5 
(4.189)

Rearranging the terms to place the two unknowns, g4 and g5, on the left-hand
side,

(1 − p44 ) g 4 − p45 g 5 = p41 g1 + ( p42 + p43 ) g R2 


. (4.190)
− p54 g 4 + (1 − p55 ) g 5 = p51 g1 + ( p52 + p53 ) g R2 

The solution is

p41 (1 − p55 ) + p45 p51 (1 − p55 )( p42 + p43 ) + p45 ( p52 + p53 ) 
g4 = g1 + g R2 
(1 − p44 )(1 − p55 ) − p45 p54 (1 − p44 )(1 − p55 ) − p45 p54  (4.191)
.
p51 (1 − p44 ) + p54 p41 (1 − p44 )( p52 + p53 ) + p54 ( p42 + p43 )
g5 = g1 + g R2 
(1 − p44 )(1 − p55 ) − p45 p54 (1 − p44 )(1 − p55 ) − p45 p54 

51113_C004.indd 232 9/23/2010 5:40:12 PM


A Markov Chain with Rewards (MCR) 233

These equations may be written more concisely to express the gains of the
two transient states as weighted averages of the independent gains of the
two recurrent chains.

g 4 = f 41 g1 + f 4 R2 g R2 
, (4.192)
g 5 = f 51 g1 + f 5 R2 g R2 

where the weights f41 and f51 are calculated in Equation (3.120), and the weights
f4R2 and f5R2 are calculated in Equation (3.121). Thus, the gain, gi, of each tran-
sient state, i, has been expressed as a weighted average of the independent
gains of the recurrent chains. It is interesting to note that each weight, fiR, can
be interpreted as the probability of eventual passage from a transient state i
to a recurrent chain R. This result confirms the conclusion reached earlier in
Section 4.2.5.2.2 by solving Equation (4.43).
Step 3. The set of VDEs for the two transient states is given below:

5
g i + vi = qi + ∑ pij v j = qi + pi1v1 + pi 2 v2 + pi 3 v3 + pi 4 v4 + pi 5 v5 , (4.193)
j =1

for transient states i = 4, 5. The two VDEs for the transient states are

g 4 + v4 = q4 + p41v1 + p42 v2 + p43 v3 + p44 v4 + p45 v5 


. (4.194)
g 5 + v5 = q5 + p51v1 + p52 v2 + p53 v3 + p54 v4 + p55 v5 

Rearranging the terms to place the two unknowns, v4 and v5, on the left hand
side,

(1 − p44 ) v4 − p45 v5 = − g 4 + q4 + p41v1 + p42 v2 + p43 v3


(4.195)
− p54 v4 + (1 − p55 ) v5 = − g 5 + q5 + p51v1 + p52 v2 + p53 v3 .

Substituting the quantities

g1 = q1 , (4.196)

p32 q2 + p23 q3
g 2 = g 3 = g R2 = , (4.197)
p23 + p32

p32 q2 + p23 q3
g 4 = f 41 g1 + f 4 R2 g R2 = f 41 q1 + f 4 R2 , (4.198)
p23 + p32

51113_C004.indd 233 9/23/2010 5:40:15 PM


234 Markov Chains and Decision Processes for Engineers and Managers

p32 q2 + p23 q3
g 5 = f 51 g1 + f 5 R2 g R2 = f 51 q1 + f 5 R2 , (4.199)
p23 + p32

the VDEs for the two transient states appear as

p32 q2 + p23 q3
(1 − p44 )v4 − p45 v5 = − f 41 q1 − f 4 R2 + q4 + p41v1 + p42 v2 + p43 v3
p23 + p32

p32 q2 + p23 q3
− p54 v4 + (1 − p55 )v5 = − f 51 q1 − f 5 R2 + q5 + p51v1 + p52 v2 + p53 v3 .
p23 + p32

(4.200)

Setting the relative value v1 = 0 for the highest numbered state in the first
recurrent chain, and also setting v3 = 0 for the highest numbered state in the
second recurrent chain, the VDEs for the transient states are

p32 q2 + p23 q3
(1 − p44 )v4 − p45 v5 = − f 41 q1 − f 4 R2 + q4 + p41 0 + p42 v2 + p43 0
p23 + p32

p32 q2 + p23 q3
− p54 v4 + (1 − p55 )v5 = − f 51 q1 − f 5 R2 + q5 + p51 0 + p52 v2 + p53 0.
p23 + p32
(4.201)

Substituting for v2 from Equation (4.187), the VDEs for the transient
states are

p32 q2 + p23 q3 q − q3
(1 − p44 )v4 − p45 v5 = − f 41 q1 − f 4 R2 + q4 + p42 2
p23 + p32 p23 + p32

p32 q2 + p23 q3 q − q3
− p54 v4 + (1 − p55 )v5 = − f 51 q1 − f 5 R2 + q5 + p52 2 . (4.202)
p23 + p32 p23 + p32

51113_C004.indd 234 9/23/2010 5:40:17 PM


A Markov Chain with Rewards (MCR) 235

Step 4. The solution for the relative values of the transient states is

 p32 q2 + p23 q3 q − q3  
  − f 41 q1 − f 4 R2 + q4 + p42 2  (1 − p55 ) 
 p23 + p32 p23 + p32  
 
+ −  p32 q2 + p23 q3 q2 − q3  
  51 1 f q − f + q + p p
 45  
+ +
5 R2 5 52
 p p p p 32 
v4 =  
23 32 23

(1 − p44 )(1 − p55 ) − p45 p54 


 (4.203)
.

 p32 q2 + p23 q3 q2 − q3  

  51 1
f q − f + q + p  (1 − p 44 
)
p23 + p32 p23 + p32  
5 R2 5 52

 
+  − p32 q2 + p23 q3 q2 − q3  
  f q − f + q + p  p 
p23 + p32 p23 + p32 
41 1 4 R 4 42 54

2

v5 =  
(1 − p44 )(1 − p55 ) − p45 p54 

Note that in this generic five-state multichain example, N = 5 states, and


L = 2 recurrent chains with the relative value v1 equated to zero in the first
recurrent chain, and v3 equated to zero in the second. The two independent
gains, g1and gR2, and the remaining (5 − 2) relative values, v2, v4, and v5, are
the five unknowns.

4.2.5.3.5 Solving the REEs for an Eight-State Multichain


MCR Model of a Production Process
The four-step procedure outlined in Section 4.2.5.3.3 for solving the REEs
will be demonstrated on a numerical example by applying it to solve the
REEs for the eight-state multichain MCR model of a production process for
which the transition matrix and reward vector are shown in Equation (4.161).
The system of 16 REEs consists of a system of eight VDEs plus a system of
eight GSEs. The system of eight VDEs is

g i + vi = qi + pi1v1 + pi 2 v2 + pi 3 v3 + pi 4 v4 + pi 5 v5 + pi 6 v6 + pi 7 v7 + pi 8 v8 ,
(4.204)
for i = 1, 2, ... , 8.

51113_C004.indd 235 9/23/2010 5:40:20 PM


236 Markov Chains and Decision Processes for Engineers and Managers

The eight individual value determination equations are

g1 + v1 = 156 + v1 
g 2 + v2 = 56 + v2 

g 3 + v3 = 160 + 0.5v3 + 0.3v4 + 0.2v5 
g 4 + v4 = 123 + 0.3v3 + 0.45v4 + 0.25v5 

 (4.205)
g 5 + v5 = 291.5 + 0.1v3 + 0.35v4 + 0.55v5 
g6 + v6 = 450 + 0.2v1 + 0.16v2 + 0.04v3 + 0.03v4 + 0.02v5 + 0.55v6

g7 + v7 = 400 + 0.15v1 + 0.2v6 + 0.65v7 
g8 + v8 = 420 + 0.1v1 + 0.15v7 + 0.75v8. 

The set of eight GSEs is

g i = pi1 g1 + pi 2 g 2 + pi 3 g 3 + pi 4 g 4 + pi 5 g 5 + pi 6 g 6 + pi 7 g7 + pi 8 g 8 ,
(4.206)
for i = 1, 2, ... , 8.

The eight individual GSEs are

g1 = g1 
g2 = g2 

g3 = 0.5 g 3 + 0.3 g 4 + 0.2 g 5 
= 0.3 g 3 + 0.45 g 4 + 0.25 g 5 
g4 
 (4.207)
g5 = 0.1g 3 + 0.35 g 4 + 0.55 g 5 
g6 = 0.2 g1 + 0.16 g 2 + 0.04 g 3 + 0.03 g 4 + 0.02 g 5 + 0.55 g 6 

g7 = 0.15 g1 + 0.2 g 6 + 0.65 g7 
g8 = 0.1g1 + 0.15 g7 + 0.75 g 8 . 

Step 1. Letting R3 = {3, 4, 5} denote the third closed class of three recurrent
states that has an independent gain denoted by gR3, and substituting

g 3 = g 4 = g 5 = g R3 ,

51113_C004.indd 236 9/23/2010 5:40:20 PM


A Markov Chain with Rewards (MCR) 237

the eight VDEs appear as

g1 + v1 = 156 + v1 
g 2 + v2 = 56 + v2 

g R + v3 = 160 + 0.5v3 + 0.3v4 + 0.2v5 
g R + v4 = 123 + 0.3v3 + 0.45v4 + 0.25v5 
 (4.208)

g R + v5 = 291.5 + 0.1v3 + 0.35v4 + 0.55v5 
g 6 + v6 = 450 + 0.2v1 + 0.16v2 + 0.04v3 + 0.03v4 + 0.02v5 + 0.55v6

g7 + v7 = 400 + 0.15v1 + 0.2v6 + 0.65v7 
g 8 + v8 = 420 + 0.1v1 + 0.15v7 + 0.75v8. 

Setting v1 = 0 for the absorbing state in the first recurrent chain, the first VDE
yields the gain g1 = 156 for absorbing state 1. Similarly, setting v2 = 0 for the
absorbing state in the second recurrent chain, the second VDE produces the
gain g2 = 56 for absorbing state 2. Setting v5 = 0 for the highest numbered state
in the third recurrent closed class, the three VDEs for the third recurrent
chain appear as

g R3 + v3 = 160 + 0.5v3 + 0.3v4


g R3 + v4 = 123 + 0.3v3 + 0.45v4 (4.209)
g R3 = 291.5 + 0.1v3 + 0.35v4 .

These three VDEs are solved simultaneously for gR3, v3, and v4 to obtain the
quantities gR3 = 190.44, v3 = −199.86, and v4 = −231.64 for the third recurrent
chain.

Step 2. Substituting

g 3 = g 4 = g 5 = g R3 , (4.210)

the three GSEs for the transient states are

g 6 = 0.2 g1 + 0.16 g 2 + (0.04 + 0.03 + 0.02) g R3 + 0.55 g 6 



g7 = 0.15 g1 + 0.2 g 6 + 0.65 g7  (4.211)
g 8 = 0.1g1 + 0.15 g7 + 0.75 g 8 . 

After algebraic simplification, the following result is obtained for the gains
of the three transient states expressed as weighted averages of the indepen-
dent gains of the two absorbing states and of the recurrent closed class.

51113_C004.indd 237 9/23/2010 5:40:22 PM


238 Markov Chains and Decision Processes for Engineers and Managers

g 6 = 0.4444 g1 + 0.3556 g 2 + 0.2 g R3 



g7 = 0.6825 g1 + 0.2032 g 2 + 0.1143 g R3  (4.212)
g 8 = 0.8095 g1 + 0.1219 g 2 + 0.0686 g R3 .

Substituting the independent gains, g1 = 156, g2 = 56, and gR3 = 190.44, com-
puted for the three recurrent chains in step 1, the gains of the three transient
states are

g 6 = 0.4444(156) + 0.3556(56) + 0.2(190.44) = 127.33 



g7 = 0.6825(156) + 0.2032(56) + 0.1143(190.44) = 139.62  (4.213)
g 8 = 0.8095(156) + 0.1219(56) + 0.0686(190.44) = 146.17.

Step 3. The three VDEs for the three transient states are

g 6 + v6 = 450 + 0.2v1 + 0.16v2 + 0.04v3 + 0.03v4 + 0.02v5 + 0.55v6



g7 + v7 = 400 + 0.15v1 + 0.2v6 + 0.65v7  (4.214)
g 8 + v8 = 420 + 0.1v1 + 0.15v7 + 0.75v8. 

Substituting the gains, g6 = 127.33, g 7 = 139.62, and g8 = 146.17, of the three


transient states computed in step 2, and substituting the relative values,
v3 = −199.86, v4 = −231.64, and v1 = v2 = v5 = 0, determined in step 1, the three
VDEs for the transient states are

127.33 + v6 = 450 + 0.04( −199.86) + 0.03( −231.64) + 0.55v6 



139.62 + v7 = 400 + 0.2v6 + 0.65v7  (4.215)
146.17 + v8 = 420 + 0.15v7 + 0.75v8 . 

Step 4. The solution of the three VDEs for the transient states produces the
relative values

v6 = 683.84, v7 = 1,134.71, and v8 = 1, 776.14.

Note that in this multichain model, N = 8 states, and L = 3 recurrent chains.


The unknown quantities are the L = 3 independent gains and the N = 8 rel-
ative values, vi. In step 1, the relative value v1 is equated to zero in the first
recurrent chain, v2 is equated to zero in the second, and v5 is equated to
zero in the third. Also in step 1, the three independent gains, g1, g2, and gR3,
are determined by solving the VDEs for each recurrent chain. The remain-
ing (N − L) or (8 − 3) relative values are v3, v4, v6, v7, and v8. Step 1 has fur-
ther reduced the number of independent unknowns from 5 to 3 because the

51113_C004.indd 238 9/23/2010 5:40:23 PM


A Markov Chain with Rewards (MCR) 239

relative values v3 and v4 for the third recurrent chain are obtained by solving
the VDEs for the third recurrent chain. In step 2, the GSEs for the transient
states are solved to calculate the dependent gains of the transient states, g6,
g7, and g8, as weighted averages of the independent gains, g1, g2, and gR3, of
the recurrent chains. Finally, in step 4, the VDEs for the transient states are
solved to obtain the relative values of the transient states, v6, v7, and v8.
The complete solutions for the gain vector and the vector of relative values
within each class of states are given below:

1  156  1 0 
2  56  2  0 
   
3 190.44  3  −199.86 
   
4 190.44  4 −231.64 
g=  , v=  . (4.216)
5 190.44  5 0 
   
6 127.33  6  683.84 
7 139.62  7  1,134.71
   
8 146.17  8 1, 776.14 

Thus, if the process starts in any of the three recurrent states associated with
the training center, the expected cost per item sold will be $190.44. If the pro-
cess starts in recurrent state 3, the expected total cost will be $199.86 lower
than if it starts in recurrent state 5. Similarly, if the process starts in recurrent
state 4, the expected total cost will be $231.64 lower than if it starts in recur-
rent state 5.

4.2.5.4 Expected Total Reward before Passage to a Closed Set


In Section 4.2.4.4, the expected total reward earned by a unichain MCR
model before passage to a closed class of recurrent states, given that
the chain started in a transient state, was calculated. This section will
demonstrate that this calculation can also be made for a multichain MCR
model [5].

4.2.5.4.1 Multichain Eight-State MCR Model of a Production Process


Consider the eight-state multichain MCR model of a production process for
which the transition matrix and reward vector are shown in Equation (4.161).
Recall that T = {6, 7, 8} denotes the class of transient states associated with the
three production stages. Transient state i represents production stage (9 − i).
Suppose that the operation costs for the transient states, calculated in the
right-hand column of Table 4.13, are treated as the components of a produc-
tion stage operation cost vector denoted by qT. The production operation cost

51113_C004.indd 239 9/23/2010 5:40:25 PM


240 Markov Chains and Decision Processes for Engineers and Managers

vector for the three transient states is

qT = [$450 $400 $420]T . (4.217)

The expected total cost vector UqT is computed below.

6  2.2222 0 0   $450  6  $10, 000 


−1      
UqT = (I − Q) qT = 7  1.2698 2.8571 0   $400  = 7  $1, 714.25  (4.218)
8  0.7619 1.7143 4   $420  8  $2, 708.58 

Since qT is the vector of operation costs for the transient states, the ith compo-
nent of the vector UqT represents the expected total operation cost for an item
before its eventual passage to an absorbing state or to the recurrent closed
class, given that the item started in transient state i. An item enters the pro-
duction process at production stage 1, which is transient state 8. With refer-
ence to the last three rows in the right-hand column of Table 4.14, note that:

Expected operation cost that will be incurred by an entering item in state 6


+ Expected operation cost that will be incurred by an entering item in state 7
+ Expected operation cost that will be incurred by an entering item in state 8
= q6u86 + q7u87 + q8u88
= ($450)(0.7619) + ($400)(1.7143) + ($420)(4)
= $342.86 in state 6 + $685.72 in state 7 + $$1,680 in state 8
= $2,708.58, (4.219)

which is the expected total operation cost for an item before its eventual pas-
sage to an absorbing state or to the recurrent closed class, given that the item
started in transient state 8. The expected total operation cost of $2,708.58 for
an item entering the production process at stage 1 is the last entry in the vec-
tor UqT calculated in Equation (4.218), and is also the sum of the entries in the
last three rows of the right-hand column of Table 4.14.
This example has demonstrated that the following result holds for any
multichain MCR. Suppose that P is the transition matrix for a multichain
MCR, U is the fundamental matrix, and qT is the vector of rewards received
in the transient states. Then the ith component of the vector UqT represents
the expected total reward earned before eventual passage to any recurrent
closed class, given that the chain started in transient state i.

4.2.5.4.2 Absorbing Multichain MCR Model of Selling


a Stock with Two Target Prices
This section gives a second example of the calculation of expected total
rewards earned by a multichain MCR model before passage to a recurrent

51113_C004.indd 240 9/23/2010 5:40:26 PM


A Markov Chain with Rewards (MCR) 241

chain, given that the chain started in a transient state [5]. Suppose that on
March 31, at the end of a 3-month quarter, a woman buys one share of a
certain stock for $40. The share price, rounded to the nearest $5, has been
varying among $30, $35, $40, $45, and $50 from quarter to quarter. She plans
to sell her stock at the end of the first quarter in which the share price rises
to $50 or falls to $30. She believes that the price of the stock can be modeled
as a Markov chain in which the state, Xn, denotes the share price at the end
of quarter n. The state space is E = {$30, $35, $40, $45, $50}. The two states
Xn = $30 and Xn = $50 are absorbing states, reached when the stock is sold.
The three remaining states, which are entered when the stock is held, are
transient. (A model for selling a stock with one target price was constructed
in Section 1.10.1.2.1.) The quarterly dividend is $2 per share. No dividend
is received when the stock is sold. She believes that her investment can be
represented as a multichain MCR with the following transition probability
matrix, expressed in canonical form, and the associated reward vector.

State 30 50 35 40 45
30 1 0 0 0 0
 1 0 0
50 0 1 0 0 0  
P= =  0 1 0  , (4.220a)
35 0.20 0.10 0.40 0.20 0.10
 D1 D2 Q 
40 0.10 0.06 0.30 0.28 0.26
45 0.14 0.12 0.24 0.30 0.20

State Reward
30 30
 qA 1 
50 50  
q= =  qA 2  , (4.220b)
35 2
 qT 
40 2
45 2

where
State 30 State 50 State 35 40 45
35 0.20 35 0.10 35 0.40 0.20 0.10
D1 = , D2 = , Q= ,
40 0.10 40 0.06 40 0.30 0.28 0.26
45 0.14 45 0.12 45 0.24 0.30 0.20

State Reward
35 2 ,
qT =
40 2
45 2

51113_C004.indd 241 9/23/2010 5:40:27 PM


242 Markov Chains and Decision Processes for Engineers and Managers

State Reward State Reward


qA = 30 qA 1 = 30 30 . (4.220c)
50 qA 2 50 50

Note that qA is the vector of selling prices received in the two absorbing
states, and qT is the vector of dividends received in the three transient states.
The fundamental matrix for the submatrix Q is

−1 35 40 45
 0.60 −0.20 −0.10 
−1   35 2.3486 0.8961 0.5848
U = ( I − Q ) =  −0.30 0.72 −0.26  = .
40 1.4261 2.1505 0.8772
 −0.24 −0.30 0.80 
45 1.2394 1.0753 1.7544
(4.221)

The vector, UqT, is computed below:

35  2.3486 0.8961 0.5848   2 35 7.66 


−1     
UqT = ( I − Q ) qT = 40  1.4261 2.1505 0.8772  2 = 40  8.91. (4.222)
45 1.2394 1.0753 1.7544   2 45 8.14 

As Sections 4.2.4.4 and 4.2.5.4.1 indicate, the ith component of the vector UqT
represents the expected total dividend earned before eventual passage to
an absorbing state when the stock is sold, given that the chain started in
transient state i. For example, if she buys the stock for $40, she will earn an
expected total dividend of (UqT)40 = $8.91 before the stock is sold.
When the stock is sold, the chain will be absorbed in state $30 or in state $50.
Using Equation (3.98), the matrix of absorption probabilities for an absorbing
multichain, given that the process starts in a transient state, is

30 50
35 f 35,30 f 35,50
F = (I − Q)−1 D = UD = U D1 D2  =
40 f 40,30 f 40,50
45 f 45,30 f 45,50

35 40 45 30 50 30 50
35 2.3486 0.8961 0.5848 0.20 0.10 35 0.6412 0.3588
= 40 1.4261 2.1505 0.8772 0.10 0.06 = 40 0.6231 0.3769 . (4.223)
45 1.2394 1.0753 1.7544 0.14 0.12 45 0.6010 0.3990

51113_C004.indd 242 9/23/2010 5:40:28 PM


A Markov Chain with Rewards (MCR) 243

If the investor buys the stock for $40, she will eventually sell it for either $30
with probability f40,30 = 0.6231 or $50 with probability f40,50 = 0.3769. Hence,
when the stock is sold, she will receive an expected selling price of

$30 f 40,30 + $50 f 40,50 = $30(0.6231) + $50(0.3769) = $37.54. (4.224)

To extend this result, suppose that a reward is received when an absorbing


multichain MCR enters an absorbing state. In this example the reward is the
selling price. The vector, FqA, of expected rewards received when the pro-
cess is absorbed, given that the MCR started in a transient state, is computed
below:

30 50
35  f 35,30 qA1 + f 35,50 qA 2 
35 f 35,30 f 35,50  qA1   
= 40  f 40,30 aA1 + f 40,50 qA 2 
FqA = 40 f 40,30 f 40,50  qA 2 
45  f 45,30 qA1 + f 45,50 qA 2 
45 f 45,30 f 45,50
30 50 (4.225)
35 $37.18 
35 0.6412 0.3588  30   
= 40 $37.54  .
= 40 0.6231 0.3769  50 
45 $37.98 
45 0.6010 0.3990

Thus, the ith component of the vector FqA represents the expected selling
price received when the stock is sold, given that the chain started in transient
state i. For example, if the investor buys the stock for $40, she will receive an
expected selling price of

( FqA )40 = f 40,30 qA1 + f 40,50 qA 2 = $37.54 (4.226)

when the stock is sold, confirming the earlier result. The investor’s expected
total reward, given that she bought the stock for $40, is the expected divi-
dends received before selling plus the expected selling price, or

(UqT )40 + ( FqA )40 = (UqT )40 + ($30 f 40,30 + $50 f 40,50 ) = $8.91 + $37.54 = $46.45.
(4.227)

4.2.5.5 Value Iteration over a Finite Planning Horizon


As Section 4.2.4.5 indicates, value iteration can be executed to calculate the
expected total reward received by a reducible MCR over a finite planning
horizon. To demonstrate this calculation for a multichain MCR, consider the
absorbing multichain MCR model of selling a stock with two target prices

51113_C004.indd 243 9/23/2010 5:40:29 PM


244 Markov Chains and Decision Processes for Engineers and Managers

described in Section 4.2.5.4.2. The transition matrix P and reward vector q


are shown in Equation (4.220). Value iteration will be executed to calculate
the expected total income that will be earned by the investor’s one share of
stock after three quarters, given that she paid $40 for the stock. She may not
keep her share for three quarters because she will sell her share at the end of
the first quarter in which the share price rises to $50 or falls to $30.
The matrix form of the value iteration equation (4.29) is

v(n) = q + Pv(n + 1) for n = 0,1, 2. (4.228)

To begin the backward recursion, the vector of expected total income received at
the end of the three quarter planning horizon is set equal to zero for all states.

n=T=3 (4.229a)

v(3) = v(T ) = 0 (4.229b)

 v30 (3)   0 
 v (3)   0 
 50   
 v35 (3)  =  0  (4.229c)
   
 v40 (3)   0 
 v45 (3)   0 

n= 2
v(2) = q + Pv(3) 

 v30 (2)   30   1 0 0 0 0   0   30  
 v (2)   50   0 1 0 0 0   0   50  
 50           (4.230)
 v35 (2)  =  2  +  0.20 0.10 0.40 0.20 0.10   0 =  2  
          
 v40 (2)   2   0.10 0.06 0.30 0.28 0.26   0  2  
 v45 (2)   2   0.14 0.12 0.24 0.30 0.20   0   2  

n= 1
v(1) = q + Pv(2) 

 v30 (1)   30   1 0 0 0 0    
30 60 
 v (1)   50   0 0   50   100  
  (4.231)
1 0 0
 50        
 v35 (1)  =  2  +  0.20 0.10 0.40 0.20 0.10   2  =  14.4  
         
 v40 (1)   2   0.10 0.06 0.30 0.28 0.26   2   9.68  
 v45 (1)   2   0.14 0.12 0.24 0.30 0.20   2   13.68  

51113_C004.indd 244 9/23/2010 5:40:31 PM


A Markov Chain with Rewards (MCR) 245

n=0
v(0) = q + Pv(1) 
 v30 (0)  30   1 0 0 0 0   60   90  
 v (0)  50   0 1 0 0 0   100   150  
 50        
 v35 (0) =  2  + 0.20 0.10 0.40 0.20 0.10   14.4  =  33.064  .
        
v40 (0)  2   0.10 0.06 0.30 0.28 0.26   9.68   24.5872 
 v45 (0)  2   0.14 0.12 0.24 0.30 0.20  13.68   31.496  

(4.232)

If the woman paid $40 for the stock, her expected total income after three
quarters will be v40(0) = $24.59.

4.3 Discounted Rewards


In Section 4.2, the earnings of an MCR have been measured by the expected
total reward earned over a finite planning horizon, or by the expected aver-
age reward or gain earned over an infinite horizon. An alternative measure
is one that reflects the time value of money. The measure of an MCR’s earn-
ings used in this section is based on the time value of money, and is called
either the present value of the expected total reward, or the expected present
value, or the expected total discounted reward. In this book, all three terms
are used interchangeably. When the earnings of an MCR are measured by
the expected present value, the process is called an MCR with discounted
rewards, or a discounted MCR. Markov chain structure is not relevant when
a discounted MCR is analyzed [3, 4, 7].

4.3.1 Time Value of Money


The time value of money is a consequence of the fact that a dollar received
now is worth more than a dollar received in the future because money
received now can be invested to earn interest. Conversely, one dollar in the
future is worth less than one dollar in the present. Suppose that money earns
compound interest at a positive interest rate of i% per period. A discount fac-
tor, denoted by α, is defined as

1
α= , (4.233)
1 +i

where 0 < α < 1.

51113_C004.indd 245 9/23/2010 5:40:33 PM


246 Markov Chains and Decision Processes for Engineers and Managers

Clearly, α is a fraction between zero and one. Thus, q dollars received one
period in the future is equivalent to αq dollars received now, a smaller quan-
tity. The present value, at epoch 0, of q dollars received n periods in the future,
at epoch n, is α nq dollars. Rewards of q dollars received at epochs 0, 1, 2, … , n
have the respective present values of q, αq, α 2 q, … , α n q dollars. Note that α n is the
single-payment present-worth factor used in engineering economic analysis
to compute the present worth of a single payment received n periods in the
future.

4.3.2 Value Iteration over a Finite Planning Horizon


When the planning horizon is finite, of length T periods, the expected total
discounted reward received in every state can be found by using value itera-
tion with a discount factor α [2–4, 7].

4.3.2.1 Value Iteration Equation


Suppose that a Markov chain with discounted rewards, which is called a
discounted MCR, has N states. Let vi(n) denote the value, at epoch n, of the
expected total discounted reward earned during the T − n periods from
epoch n to epoch T, the end of the planning horizon, if the system is in state
i at epoch n. The value, at epoch 0, of the expected total discounted reward
earned during the T periods from epoch 0 to epoch T, if the system starts in
state i at epoch 0, is denoted by vi(0). For brevity, vi(0) will be termed either an
expected total discounted reward or an expected present value if the system
starts in state i. By following a procedure similar to the one used in Section
4.2.2.2.1 for a Markov chain with undiscounted rewards, a backward recur-
sive value iteration equation relating vi(n) to vj(n + 1) will be developed. The
set of expected total discounted rewards for all states received at epoch n is
collected in an N-component expected total discounted reward vector,

v(n) = [v1 (n) v2 (n) ... vN (n)]T . (4.234)

Figure 4.12 is a discounted cash flow diagram, which shows the pres-
ent values, with a discount factor, α, of a series of equal reward vectors, q,
received at epochs 0 through T − 1. The present value of a reward vector, q,

2 T 1 T
q q q q v(T ) Present value of reward

0 1 2 T 1 T Epoch

FIGURE 4.12
Present values of a discounted cash flow diagram for reward vectors over a planning horizon
of length T periods.

51113_C004.indd 246 9/23/2010 5:40:34 PM


A Markov Chain with Rewards (MCR) 247

received at epoch n, is α nq. At epoch T, a vector of salvage values denoted by


v(T) is received.
The expected total discounted reward vector received at epoch 0 is denoted
by v(0). Note that the reward vector received now, at epoch 0, is q. The expected
present value of the reward vector received one period from now, at epoch
1, is P(αq) = (αP)q. The expected present value of the reward vector received
two periods from now, at epoch 2, is P2(α 2q) = (αP)2q. Similarly, the expected
present value of the reward vector received T − 1 periods from now, at epoch
T − 1, is (αP)T−1q. Finally, the expected present value of the vector of salvage
values is (αP)Tv(T), where v(T) is the vector of salvage values received at epoch
T. Expected reward vectors are additive. That is, the expected value of a sum of
expected reward vectors is the sum of their expected values. Therefore,

v(0) = q + α Pq + (α P)2 q + (α P)3 q + " + (α P)T − 2 q + (α P)T −1 q + (α P)T v(T )


= q + α P[q + α Pq + (α P)2 q + (α P)3 q + " + (α P)T − 3 q + (α P)T − 2 q + (α P)T −1 v(T )].
(4.235)

Observe that the term in brackets, obtained by factoring out αP, is equal
to v(1), which represents the vector, at epoch 1, of the expected total dis-
counted income received from epoch 1 until the end of the planning hori-
zon. That is,

v(1) = q + α Pq + (α P )2 q + (α P )3 q + " + (α P )T − 2 q + (α P )T −1 v(T ). (4.236)

Thus, the recursive relationship between the vectors v(0) and v(1) is

v(0) = q + α Pv(1). (4.237)

By following an argument analogous to the one used in Section 4.2.2.2.1 for a


Markov chain with undiscounted rewards, this result can be generalized to
produce the following recursive value iteration equation in matrix form for
a discounted MCR relating the vector v(n) at epoch n to the vector v(n + 1) at
epoch n + 1.

v(n) = q + α Pv(n + 1), for n = 0,1,..., T − 1, where v(T ) is specified. (4.238)

When rewards are discounted, chain structure is not relevant because all
row sums in a matrix, αP, are less than one. Thus, the entries of αP are not
probabilities. For this reason, the value iteration procedure is the same for all
discounted MCRs, irrespective of whether the associated Markov chains are
recurrent, unichain, or multichain.

51113_C004.indd 247 9/23/2010 5:40:35 PM


248 Markov Chains and Decision Processes for Engineers and Managers

Consider a generic four-state discounted MCR with transition probability


matrix, P, discount factor, α, and reward vector, q. When N = 4 the expanded
matrix form of the recursive value iteration equation (4.238) is

 v1 (n)   q1   α p11 α p12 α p13 α p14   v1 (n + 1) 


 v ( n)   q   α p α p22 α p23 α p24   v (n + 1) 
 2  =  2 +  21
  2 . (4.239)
 v3 (n)   q3   α p31 α p32 α p33 α p34   v3 (n + 1) 
       
 v4 (n)   q4   α p41 α p42 α p43 α p44   v4 (n + 1) 

In expanded algebraic form the four recursive value iteration equations are

v1 (n) = q1 + α p11v1 (n + 1) + α p12 v2 (n + 1) + α p13 v3 (n + 1) + α p14 v4 (n + 1) 


v 2 ( n) = q2 + α p21v1 (n + 1) + α p22 v2 (n + 1) + α p23 v3 (n + 1) + α p24 v4 (n + 1) 

v 3 ( n) = q3 + α p31v1 (n + 1) + α p32 v2 (n + 1) + α p33 v3 (n + 1) + α p34 v4 (n + 1)
v 4 ( n) = q4 + α p41v1 (n + 1) + α p42 v2 (n + 1) + α p43 v3 (n + 1) + α p44 v4 (n + 1)
(4.240)

In compact algebraic form the four recursive value iteration equations are

4

v1 (n) = q1 + α ∑ p1 j v j (n + 1) 
j =1

4 
v2 (n) = q2 + α ∑ p2 j v j (n + 1) 
j =1 
4  (4.241)
v3 (n) = q3 + α ∑ p3 j v j (n + 1) 
j =1



4
v4 (n) = q4 + α ∑ p4 j v j (n + 1).
j =1 

The four algebraic equations can be represented by the single equation

4
vi (n) = qi + α ∑ pij v j (n + 1),
j =1
(4.242)
for n = 0, 1, ..., T − 1, and i = 1, 2, 3, and 4.

This result can be generalized to apply to any N-state Markov chain with
discounted rewards. Therefore, vi(n) can be computed by using the follow-
ing recursive value iteration equation in algebraic form which relates vi(n)
to vj(n + 1).

51113_C004.indd 248 9/23/2010 5:40:37 PM


A Markov Chain with Rewards (MCR) 249

N

vi (n) = qi + α ∑ pij v j (n + 1), 
j =1 
for n = 0, 1,..., T − 1, i = 1, 2, ..., N , where vi (T) is specified for all states i.
(4.243)

In summary, the value iteration equation expresses a backward recursive


relationship because it starts with a known set of salvage values for vector
v(T) at the end of the planning horizon. Next, vector v(T − 1) is calculated in
terms of v(T). The value iteration procedure moves backward one epoch at a
time by calculating v(n) in terms of v(n + 1). The backward recursive proce-
dure stops at epoch 0 after v(0) is calculated in terms of v(1). The component
vi(0) of v(0) is the expected total discounted reward received until the end of
the planning horizon if the system starts in state i at epoch 0.

4.3.2.2 Value Iteration for Discounted MCR Model of Monthly Sales


Consider a discounted MCR model of monthly sales. This model was intro-
duced without discounting in Section 4.2.2.1 and solved by value iteration
over a fi nite horizon in Section 4.2.2.2.2. The four-state discounted MCR
model has the following transition probability matrix P, discount factor
α = 0.9, matrix αP, and reward vector q:

1  0.60 0.30 0.10 0  1  0.540 0.270 0.090 0   −20 


2  0.25 0.30 0.35 0.10  2  0.225 0.270 0.315 0.090   5 
P=   , αP =   ,q=  .
3  0.05 0.25 0.50 0.20  3  0.045 0.225 0.450 0.180   −5 
     
4  0 0.10 0.30 0.60  4  0 0.090 0.270 0.540   25 
(4.244)

(Rewards are in thousands of dollars.) In matrix form, the value iteration


equation (4.239) for the discounted MCR model of monthly sales is

 v1 (n)   −20  0.540 0.270 0.090 0   v1 (n + 1) 


 v (n)  5   0.225 0.270 0.315 0.090   v2 (n + 1)
 2 = +  . (4.245)
 v3 (n)  −5   0.045 0.225 0.450 0.180   v3 (n + 1)
      
v4 (n)  25   0 0.090 0.270 0.540  v4 (n + 1)

Value iteration will be executed by solving the backward recursive equations


in matrix form to compute the expected total discounted reward vector over
a planning horizon consisting of three periods. To begin value iteration, in
order to simplify computations, the salvage values at the end of the planning
horizon are set equal to zero for all states.

51113_C004.indd 249 9/23/2010 5:40:38 PM


250 Markov Chains and Decision Processes for Engineers and Managers

n = T = 3.
 v1 (n)   v1 (T )   v1 (3)   0 
 v (n)   v (T )   v (3)   0 
 2  =  2  =  2  =   (4.246)
 v3 (n)   v3 (T )   v3 (3)   0 
       
 v4 (n)   v4 (T )   v4 (3)   0 

n=2
v(2) = q + α Pv(3) = q + (0.9P)v(3) = q + (0.9P)(0) = q
 v1 (2)   −20  0.540 0.270 0.090 0  0   −20 
 v (2)  5   0.225 0.270 0.315 0.09  0   5 
 2 = +   =   (4.247)
 v3 (2)  −5   0.045 0.225 0.450 0.180  0   −5 
        
v4 (2)  25   0 0.090 0.270 0.540  0   25 

n=1
v(1) = q + α Pv(2) = q + (0.9P)v(2) 
 v1 (1)   −20  0.540 0.270 0.090 0   −20   −29.9  
 v (1)  5   0.225 
0.270 0.315 0.09   5   2.525 
 2 = +  =  (4.248)
 v3 (1)  −5   0.045 0.225 0.450 0.180   −5   −2.525  
        
v4 (1)  25   0 0.090 0.270 0.540   25   37.6  

n=0
v(0) = q + α Pv(1) = q + (0.9P )v(1) 
 v1 (0)   −20   0.540 0.270 0.090 0  −29.9  −35.6915  
 v (0)  5   0.225 
0.270 0.315 0.09   2.525  1.5429  
 2 = +  = 
 v3 (0)  −5   0.045 0.225 0.450 0.180   −2.525  − 0.1456  
        
v4 (0)  25   0 0.090 0.270 0.540   37.6   44.8495  
(4.249)

The vector v(0) indicates that if this discounted MCR operates over a three-
period planning horizon with a discount factor of 0.9, the expected total
discounted reward will be –35.6915 if the system starts in state 1, 1.5429
if it starts in state 2, –0.1456 if it starts in state 3, and 44.8495 if it starts in
state 4.

51113_C004.indd 250 9/23/2010 9:31:01 PM


A Markov Chain with Rewards (MCR) 251

Using Equation (4.235) with T = 3, the solution by value iteration for v(0) in
Equation (4.249) can be verified by calculating

v(0) = q + α Pq + (α P)2 q

 v1 (0)   −20  0.540 0.270 0.090 0   −20  


 v (0)  5   0.225 
0.270 0.315 0.09   5  
 2 = +  
 v3 (0)  −5   0.045 0.225 0.450 0.180   −5  
       
v4 (0)  25   0 0.090 0.270 0.540   25  

. (4.250)
2 
0.540 0.270 0.090 0   −20   −35.6915  
 0.225 0.270 0.315 0.09   5   1.5429  
+   = 
 0.045 0.225 0.450 0.180   −5   − 0.1456  
     
 0 0.090 0.270 0.540   25   44.8495  

Suppose that one period is added to the three-period planning horizon to


create a four-period horizon. The expected total discounted reward vector,
v(−1), for the four-period horizon is calculated below, using the procedure
described in Section 4.2.2.3:
v( −1) = q + α Pv(0) = q + (0.9P )v(0)
 v1 ( −1)   −20  0.540 0.270 0.090 0   −35.6915  −38.8699  
 v ( −1)  5   0.225 0.270 0.315 0.090   1.5429  1.3766  
 2 = +  =  .
 v3 ( −1)  −5   0.045 0.225 0.450 0.180   − 0.1456   1.7484  
        
 v4 ( −1)  25   0 0.090 0.270 0.540   44.8495  49.3183  
(4.251)
4.3.3 An Infinite Planning Horizon
Two approaches can be followed to calculate the expected total discounted
rewards received by an N-state MCR operating over an infinite planning
horizon. The first approach is to solve a set of VDEs for a discounted MCR to
find the expected total discounted rewards earned in all states. This approach
involves the solution of N simultaneous linear equations in N unknowns.
The second approach, which requires less computation, is to execute value
iteration over a large number of periods. However, value iteration gives only
an approximate solution.

4.3.3.1 VDEs for Expected Total Discounted Rewards


Recall Equation (4.235) for calculating the vector v(0) over a finite planning
horizon of length T periods,

v(0) = [I + α P + (α P)2 + (α P)3 + " + (α P)T − 2 + (α P)T −1 ]q + (α P)T v(T ). (4.235)

51113_C004.indd 251 9/23/2010 5:40:41 PM


252 Markov Chains and Decision Processes for Engineers and Managers

As T approaches infinity, the finite-state Markov chain with discounted


rewards enters the steady state. Since αP is a substochastic matrix for which
all row sums are less than one, lim(α P)T = 0. Using formula (3.73) for the sum
T →∞
of an infinite series of substochastic matrices,

lim(I + (α P ) + (α P )2 + (α P)3 + " + (α P )T −1 ) = (I − α P )−1 . (4.252)


T →∞

Using this result,

lim v(0) = v = (I − α P )−1 q. (4.253)


T →∞

Thus, the expected total discounted reward vector, v, of income received in


each state over an infinite planning horizon is finite, and is equal to the inverse
of the matrix (I − αP) times the reward vector, q. When both sides of the matrix
equation (4.253) are premultiplied by the matrix (I − αP), the result is

v = (I − α P)−1 q (4.253)

(I − α P )v = q (4.254)

Iv − α Pv = q
v − α Pv = q (4.255)
v = q + α Pv.

For an N-state MCR, the matrix equation (4.253) relates the expected total
discounted reward vector, v = [ (v1, v2, . . . , vN,)]T, to the reward vector, q =
[(q1, q2, . . . , qN)]T. The matrix equation (4.255) represents a system of N lin-
ear equations in N unknowns. The system of equations (4.255) is called the
matrix form of VDEs, for a discounted MCR. The N unknowns, v1, v2, ... , vN,
represent the expected total discounted rewards received in every state. An
alternate form of the VDEs is the matrix equation (4.253). However, Gaussian
reduction is a more efficient procedure than matrix inversion for solving a
system of linear equations.
When N = 4 the expanded matrix form of the VDEs (4.255) is

 v1   q1   α p11 α p12 α p13 α p14   v1 


 v   q   αp α p22 α p23 α p24  v 
 2 =  2 +  21
  2 . (4.256)
 v3   q3   α p31 α p32 α p33 α p34   v3 
       
 v4   q4   α p41 α p42 α p43 α p44   v4 

51113_C004.indd 252 9/23/2010 5:40:42 PM


A Markov Chain with Rewards (MCR) 253

In expanded algebraic form the four VDEs are

v1 = q1 + α p11v1 + α p12 v2 + α p13 v3 + α p14 v4 


v2 = q2 + α p21v1 + α p22 v2 + α p23 v3 + α p24 v4 
 (4.257)
v3 = q3 + α p31v1 + α p32 v2 + α p33 v3 + α p34 v4 
v4 = q4 + α p41v1 + α p42 v2 + α p43 v3 + α p44 v4 .

In compact algebraic form the four VDEs are

4

v1 = q1 + α ∑ p1 j v j 
j =1

4 
v2 = q2 + α ∑ p2 j v j 
j =1 
4  (4.258)
v3 = q3 + α ∑ p3 j v j 
j =1



4
v4 = q4 + α ∑ p4 j v j .
j =1 

A more compact algebraic form of the four VDEs is

4
vi = qi + α ∑ pij v j , for i = 1, 2, 3, and 4. (4.259)
j =1

This result can be generalized to apply to any N-state Markov chain with
discounted rewards. Therefore, the compact algebraic form of the VDEs is

N
vi = qi + α ∑ pij v j , for i = 1, 2, ... , N (4.260)
j =1

Consider the matrix equation (4.253).

v = lim v(0) = (I − α P )−1 q. (4.253)


T →∞

Since vi is the ith element of the vector v, and vi(0) is the ith element of the
vector v(0), it follows that, for large T,

vi = lim vi (0). (4.261)


T →∞

51113_C004.indd 253 9/23/2010 5:40:45 PM


254 Markov Chains and Decision Processes for Engineers and Managers

Similarly, v(1) is the expected total discounted reward vector earned at epoch
1 over a planning horizon of length T periods. As T grows large, (T − 1) also
grows large, so that

vi = lim vi (1). (4.262)


T →∞

4.3.3.1.1 VDEs for a Two-State Discounted MCR


Consider a generic two-state discounted MCR with a discount factor, α,
for which the transition probability matrix and reward vector are given in
Equation (4.2). As Equation (4.254) indicates, the VDEs can be expressed in
matrix form as

(I − α P )v = q (4.254)

  1 0  α p11 α p12   v1   q1  
  0 1 − α p    =   
   21 α p22   v2   q2  (4.263)
.
1 − α p11 −α p12   v1   q1  
 −α p 1 − α p  v  =  q  
 21 22   2   2 

The solution for the expected total discounted rewards is

q1 (1 − α p22 ) + q2α p12 


v1 =
(1 − α p11 )(1 − α p22 ) − α 2 p12 p21 
. (4.264)
q2 (1 − α p11 ) + q1α p21 
v2 =
(1 − α p11 )(1 − α p22 ) − α 2 p12 p21 

For example, consider the following two-state recurrent MCR for which the
gain was calculated in Equation (4.51).

1  p11 p12  1  0.2 0.8  1  q1  1  −300 


P=  = , q=  =  . (4.67)
2  p21 p22  2 0.6 0.4  2  q2  2  200 

When a discount factor of α = 0 9 is used for the MCR, the expected total
discounted rewards are

51113_C004.indd 254 9/23/2010 5:40:47 PM


A Markov Chain with Rewards (MCR) 255

q1 (1 − α p22 ) + q2α p12 ( −300)[(1 − 0.9(0.4)] + 200(0.9)0.8 


v1 = = 
(1 − α p11 )(1 − α p22 ) − α p12 p21 [(1 − 0.9(0.2)][(1 − 0.9(0.4)] − (0.9)2 (0.8)(0.6)
2

= −352.9412 

q2 (1 − α p11 ) + q1α p21 200[(1 − 0.9(0.2)] − 300(0.9)0.6 
v2 = =
(1 − α p11 )(1 − α p22 ) − α 2 p12 p21 [(1 − 0.9(0.2)][(1 − 0.9(0.4)] − (0.9) (0.8)(0.6). 
2


= 14.706 
(4.265)

4.3.3.1.2 Optional Insight: Limiting Relationship Between


Expected Total Discounted Rewards and the Gain
Consider a recurrent Markov chain with discounted rewards. If α = 1 so
that rewards are not discounted, the gain of the recurrent chain is g. Now
suppose that rewards are discounted. Suppose that the discount factor, α,
approaches the limiting value of one. When a discount factor α → 1 for an
N-state MCR, which has a recurrent chain with a gain of g, then

lim(1 − α )vi = g , (4.266)


α→1

where vi is the expected total discounted reward received in state i [2, 7]. This
limiting relationship will be demonstrated with respect to the discounted
two-state MCR for which v1 and v2 were calculated in Equation (4.265).
Recall from Equation (4.48) or (4.50) that the gain of an undiscounted recur-
rent two-state MCR is

p21 q1 + p12 q2
g= . (4.50)
p12 + p21

The expected total discounted rewards, v1 and v2, calculated in Equation


(4.264) for the discounted two-state MCR will be expressed in terms of the
transition probabilities p12 and p21 by making the substitutions p11 = 1 − p12
and p22 = 1 − p21.

q1 [1 − α (1 − p21 )] + q2α p12


v1 = . (4.267)
[1 − α (1 − p12 )][1 − α (1 − p21 )] − α 2 p12 p21

After simplification of the denominator,

q1 [1 − α (1 − p21 )] + q2α p12


v1 = . (4.268)
[1 − 2α + α p12 + α p21 + α 2 − α 2 p12 − α 2 p21 ] + α 2 p12 p21 − α 2 p12 p21

51113_C004.indd 255 9/23/2010 5:40:49 PM


256 Markov Chains and Decision Processes for Engineers and Managers

Since the last two terms in the denominator sum to zero, they can be dropped,
and

q1 [1 − α (1 − p21 )] + q2α p12


v1 =
(1 − 2α + α 2 ) + α p12 (1 − α ) + α p21 (1 − α )
q1 [1 − α (1 − p21 )] + q2α p12 (4.269)
= .
(1 − α )2 + α p12 (1 − α ) + α p21 (1 − α )

Observe that when α = 1,

v1 = (q1 p21 + q2 p12 )/0 = ∞ (4.270)

because over an infi nite planning horizon the expected total reward
without discounting is infi nite. Hence α can approach one but can never
equal one. Note that when both sides of Equation (4.269) are multiplied
by (1 − α),

(1 − α ){q1 [1 − α (1 − p21 )] + q2α p12 } q [1 − α (1 − p21 )] + q2α p12


(1 − α )v1 = = 1
(1 − α ) + α p12 (1 − α ) + α p21 (1 − α )
2
(1 − α ) + α p12 + α p21
(4.271)

q1 [1 − α (1 − p21 )] + q2α p12 q1 [1 − 1(1 − p21 )] + q2 (1)p12


lim(1 − α )v1 = lim =
α →1 α →1 (1 − α ) + α p12 + α p21 (1 − 1) + (1)p12 + (1)p21
q1 p21 + q2 p12
= = g.
p12 + p21
(4.272)

A similar argument will show that

q1 p21 + q2 p12
lim(1 − α )v2 = = g. (4.273)
α →1 p12 + p21

When the discount factor for the two-state MCR represented in Equation
(4.51) of Section 4.3.3.1.1 is increased to α = 0.999, the expected total dis-
counted rewards are

51113_C004.indd 256 9/23/2010 5:40:50 PM


A Markov Chain with Rewards (MCR) 257

q1 (1 − α p22 ) + q2α p12 


v1 = 
(1 − α p11 )(1 − α p22 ) − α p12 p21
2

( −300)[(1 − 0.999(0.4)] + 200(0.999)0.8 
= = −14, 489.854 
[(1 − 0.999(0.2)][(1 − 0.999(0.4)] − (0.999)2 (0.8)(0.6)  (4.274)

q2 (1 − α p11 ) + q1α p21 
v2 =
(1 − α p11 )(1 − α p22 ) − α 2 p12 p21 

200[(1 − 0.999(0.2)] − 300(0.999)0.6 
= = − 14,132.609.
[(1 − 0.999(0.2)][(1 − 0.999(0.4)] − (0.999)2 (0.8)(0.6) 

As Equation (4.52) indicates, the gain without discounting is g = −14.29.


Observe that

(1 − α )v1 = (1 − 0.999)( −14, 489.854) = −14.49 → −14.29 = g 


 (4.275)
(1 − α )v2 = (1 − 0.999)( −14,132.609) = −14.13 → −14.29 = g.

Although these results have been demonstrated to hold for a two-state MCR,
they can be extended to show that Equation (4.266) is true for any N-state
MCR, which has a recurrent chain with a gain g.

4.3.3.1.3 VDEs for a Discounted MCR Model of Monthly Sales


Consider the four-state discounted MCR model of monthly sales treated
in Section 4.3.2.2 over a finite planning horizon. The transition probability
matrix P, discount factor α = 0.9, matrix αP, and reward vector q are shown
in Equation (4.244).
The expanded matrix form of the VDEs (4.256) is

 v1   −20  0.540 0.270 0.090 0   v1 


 v   5   0.225 0.270 0.315 0.090   v2 
 2 =  +   . (4.276)
 v3   −5   0.045 0.225 0.450 0.180   v3 
      
v4   25   0 0.090 0.270 0.540  v4 

The solution for the expected total discounted rewards is

v1 = −35.7733, v2 = 8.9917, v3 = 12.4060, and v4 = 63.3889.

Thus, if this Markov chain with discounted rewards operates over an


infinite planning horizon, the expected total discounted reward will be
–$35,773.30, $8,991.70, $12,406.00, or $63,388.90 if it starts in state 1, 2, 3, or 4,
respectively.

51113_C004.indd 257 9/23/2010 5:40:52 PM


258 Markov Chains and Decision Processes for Engineers and Managers

4.3.3.1.4 Optional Insight: Probabilistic Interpretation of Discount Factor


A discount factor α, with 0 < α < 1, has an interesting probabilistic interpreta-
tion with respect to an undiscounted MCR of uncertain duration [4, 5, 7]. To
reveal this probabilistic interpretation, consider the four-state MCR model of
monthly sales introduced without discounting in Section 4.2.2.1. The transi-
tion probability matrix P and reward vector q for the undiscounted MCR
model are shown in Equation (4.6).

1 0.60 0.30 0.10 0   −20 


2  0.25 0.30 0.35 0.10   5 
P=  , q =  . (4.6)
3  0.05 0.25 0.50 0.20   −5 
   
4 0 0.10 0.30 0.60   25 

Assume that a discount factor of α is specified. Suppose that the matrix αP is


embedded in the following transition matrix denoted by Y:

State 0 1 2 3 4
0 1 0 0 0 0
1 1 − α α (0.60) α (0.30) α (0.10) α (0)
Y=
2 1 − α α (0.25) α (0.30) α (0.35) α (0.10)
3 1 − α α (0.05) α (0.25) α (0.50) α (0.20) (4.277)
4 1−α α (0) α (0.10) α (0.30) α (0.60)
 1 0  1 0
= = .
1 − α α P  D Q 

Since the entries in every row of P sum to 1, the entries in all rows of αP sum
to α. Hence, Y is the transition probability matrix for an absorbing unichain.
The state space E=(1, 2, 3, 4} of the undiscounted process has been augmented
by an absorbing state 0. States 1 through 4 are transient. There is a probability
1 − α that the undiscounted process will reach the absorbing state on the next
transition, and stop. There is also a probability α that the process will not reach
the absorbing state on the next transition, and will therefore continue. One may
conclude that if the duration of an undiscounted MCR is indefinite, one minus
a discount factor can be interpreted as the probability that the process will
stop after the next transition. Therefore, a discount factor can be interpreted as
the probability that the process will continue after the next transition.
Matrix αP governs transitions among transient states before the process
is absorbed. The matrix (I − αP)−1 is the fundamental matrix of the absorb-
ing unichain. Hence, the alternate matrix form of the VDEs (4.253) for a dis-
counted MCR,

v = (I − α P)−1 q, (4.253)

51113_C004.indd 258 9/23/2010 5:40:54 PM


A Markov Chain with Rewards (MCR) 259

indicates that the vector of expected total discounted rewards can be obtained
by multiplying the fundamental matrix by the reward vector.

4.3.3.1.5 VDEs for an Eight-State Discounted Multichain


MCR Model of a Production Process
To illustrate the solution of the VDEs for a multichain MCR model with dis-
counting, consider the eight-state sequential production process for which
the transition matrix and reward vector without discounting are shown in
Equation (4.161). Assume in this case the discount factor is α = 0.9. The matrix
αP and the cost vector q are shown below:

1  0.9 0 0 0 0 0 0 0 
2 0 0.9 0 0 0 0 0 0 
 
3 0 0 0.45 0.27 0.18 0 0 0 
 
4 0 0 0..27 0.405 0.225 0 0 0 
αP = , (4.278)
5 0 0 0.09 0.315 0.495 0 0 0 
 
6  0.18 0.144 0.036 0.027 0.018 0.495 0 0 

7 0.135 0 0 0 0 0.18 0.585 0 
 
8  0.09 0 0 0 0 0 0.135 0.675
1  156 
2  56 
 
3  160 
 
4 123 
q=  .
5  291.5
 
6  450 
7  400 
 
8  420 

In matrix form the VDEs (4.255) are

 v1   156  0.900 0 0 0 0 0 0 0   v1 
 v   56   0 0.900 0 0 0 0 0 0   v2 
 2     
 v3   160   0 0 0.450 0.270 0.180 0 0 0   v3 
      
v4  =  123  +  0 0 0.270 0.405 0.225 0 0 0   v4 
.
 v5   291.5  0 0 0.090 0.315 0.495 0 0 0   v5 
      
 v6   450  0.180 0.144 0.036 0.027 0.018 0.495 0 0   v6 
 v   400   0.135 0 0 0 0 0.180 0.585 0  v 
 7     7
 v8   420  0.090 0 0 0 0 0 0.135 0.675  v8 
(4.279)

51113_C004.indd 259 9/23/2010 5:40:55 PM


260 Markov Chains and Decision Processes for Engineers and Managers

The solution for the vector of expected total discounted costs is



[v1 v2 v3 v4 v5 v6 v7 v8 ]T = [1, 560 560 1, 852.83 1, 819.93 
T 
2, 042.64 1, 909 2, 299.33 2, 679.41] 


(4.280)

The solution vector shows, for example, that if this discounted multichain
MCR model of a serial production process operates over an infinite planning
horizon, the expected total discounted cost will be $1,560 if it starts in state 1,
and $2,679.41 if it starts in state 8.

4.3.3.2 Value Iteration for Expected Total Discounted Rewards


The second approach that can be taken to calculate the expected total dis-
counted rewards received by an MCR over an infinite planning horizon is
to execute value iteration over a large number of periods. Value iteration
requires less computational effort than solving a system of VDEs, but gives
only an approximate solution [7].
Value iteration is conducted over an infinite planning horizon by repeat-
edly adding one additional period at a time to lengthen the horizon, follow-
ing the procedure described in Section 4.2.2.3. The value iteration equation
is solved over each most recently lengthened horizon until a stopping condi-
tion has been satisfied. The beginning of each new period, which is added
at the beginning of the horizon, is designated by a consecutively numbered
negative epoch. In Section 4.2.3.3, this procedure was applied to develop a
value iteration algorithm for an undiscounted recurrent MCR.

4.3.3.2.1 Stopping Rule for Value Iteration


As Section 4.2.3.3.1 indicates, when a value iteration algorithm stops, the plan-
ning horizon contains n periods. The epochs are numbered sequentially as
−n, −n+1, −n+2, … , −2, −1, 0. Epoch (−n) denotes the beginning of the horizon.
Epoch 0 denotes the end. As Section 4.2.3.3.1 indicates, the absolute value of
each epoch represents the number of periods remaining. When a stopping
condition is satisfied, let T = |−n| = n, so that T denotes the length of the hori-
zon at which convergence of a value iteration algorithm is achieved.
Suppose that all the epochs are renumbered by adding T to the numerical
index of each epoch, so that the planning horizon begins at epoch 0 and ends
at epoch T. As T grows large,

lim[vi (0) − vi (1)] = lim vi (0) − lim vi (1) = vi − vi = 0, (4.281)


T →∞ T →∞ T →∞

using Equations (4.261) and (4.262). The eventual convergence of the expected
total discounted rewards received over a finite horizon to the expected total

51113_C004.indd 260 9/23/2010 5:40:56 PM


A Markov Chain with Rewards (MCR) 261

discounted rewards received over an infinite horizon suggests the following


rule for ending value iteration. Value iteration will stop when the absolute
value of the maximum difference, over all states, between the expected total
discounted rewards for the current planning horizon and a planning hori-
zon one period shorter is less than a small positive number, ε, specified by
the analyst. That is, value iteration can be stopped after a planning horizon
of length T for which

max vi (0) − vi (1) < ε . (4.282)


i = 1,..., N

Of course, the magnitude of ε is a matter of judgment, and the number of


periods in T cannot be predicted.

4.3.3.2.2 Value Iteration Algorithm


The foregoing discussion is the basis for the following value iteration algo-
rithm in its simplest form (adapted from Puterman [7], p. 161) that can be
applied to a discounted MCR over an infinite horizon. Epochs are numbered
as consecutive negative integers from −n at the beginning of the horizon to
0 at the end.
Step 1. Select arbitrary salvage values for

vi(0), for i = 1, 2, ... , N. For simplicity, set vi(0) = 0. Specify ε > 0. Set n = −1.

Step 2. For each state i, use the value iteration equation to compute

N
vi (n) = qi + α ∑ pij v j (n + 1), for i = 1, 2, ... , N .
j =1

Step 3. If max vi (n) − vi (n + 1) < ε , go to step 4. Otherwise, decrement n


i = 1,2,..., N
by 1 and return to step 2.
Step 4. Stop.

4.3.3.2.3 Solution by Value Iteration of Discounted


MCR Model of Monthly Sales
In Section 4.3.2.2, value iteration was applied to a four-state discounted MCR
model of monthly sales over a three-period planning horizon, which was
lengthened to four periods. By applying value iteration to a planning horizon
lengthened to seven periods, Table 4.15 is constructed to show the expected
total discounted rewards for monthly sales during the last 7 months of an
infinite horizon. The last eight epochs of an infinite horizon are numbered
sequentially in Table 4.15 as −7, −6, −5, −4, −3, −2, −1, and 0. Epoch 0 denotes
the end of the horizon.

51113_C004.indd 261 9/23/2010 5:40:57 PM


262 Markov Chains and Decision Processes for Engineers and Managers

TABLE 4.15
Expected Total Discounted Rewards for Monthly Sales Calculated by Value
Iteration During the Last 7 Months of an Infinite Planning Horizon
n
Epoch –7 –6 –5 –4 –3 –2 –1 0
v1(n) –41.2574 –41.1224 –40.4607 –38.8699 –35.6915 –29.9 –20 0
v2(n) 2.5647 2.0488 1.6153 1.3766 1.5429 2.525 5 0
v3(n) 5.3476 4.3948 3.2247 1.7484 –0.1456 –2.525 –5 0
v4(n) 55.6493 54.2191 52.2278 49.3183 44.8495 37.6 25 0

TABLE 4.16
Absolute Values of the Differences Between the Expected Total Discounted
Rewards Earned Over Planning Horizons, Which Differ in Length by One Period
n
Epoch –7 –6 –5 –4 –3 –2 –1

vi ( −7) vi ( −6) vi ( −5) vi ( −4) vi ( −3) vi ( −2) vi ( −1)


−vi ( −6) −vi ( −5) −vi ( −4) −vi ( −3) −vi ( −2) −vi ( −1) −vi (0)
i
1 −0.135 −0.6617 −1.5908 −3.1784 −5.7915 −9.9 −20
2 0.5159 0.4335 0.2387 −0.1663 −0.9821 −2.475 5
3 0.9528 1.1701 1.4763 1.894 2.3794 2.475 −5
4 1.4302U 1.9913U 2.9095U 4.4688U 7.2495U 12.6U 25U
 vi ( n ) 
Max  1.4302 1.9913 2.9095 4.4688 7.2495 12.6 25
 − vi (n + 1) 

Table 4.16 gives the absolute values of the differences between the expected
total discounted rewards earned over planning horizons, which differ in
length by one period. Note that the maximum absolute differences become
progressively smaller for each epoch added to the planning horizon.
In Table 4.16, a suffix U identifies the maximum absolute difference for each
epoch. The maximum absolute differences obtained for all seven epochs are
listed in the bottom row of Table 4.16. The bottom row of Table 4.16 shows
that if an analyst chooses an ε < 1.4302, then more than seven repetitions of
value iteration will be needed before the algorithm can be assumed to have
converged.
Table 4.15 shows that the approximate expected total discounted rewards,
v1 = −41.2574, v2 = 2.5647, v3 = 5.3476, v4 = 55.6493,

obtained after seven repetitions of value iteration are significantly different


from the actual expected total discounted rewards,

51113_C004.indd 262 9/23/2010 5:40:58 PM


A Markov Chain with Rewards (MCR) 263

v1 = −35.7733, v2 = 8.9917, v3 = 12.4060, and v4 = 63.3889,

obtained by solving the VDEs in Section 4.3.3.1.3. Thus, value iteration for
the discounted MCR model of monthly sales has not converged after seven
repetitions to the actual expected total discounted rewards.

PROBLEMS
4.1 This problem adds daily revenue to the Markov chain model
of Problems 1.1 and 2.1. Since a repair takes 1 day to complete,
no revenue is earned on the day during which maintenance is
performed. The daily revenue earned in every state is shown
below:

State Condition Daily Revenue


1 Not Working, under repair (NW) $0
2 Working, with a Major Defect (MD) $200
3 Working, with a Minor Defect (mD) $400
4 Working Properly (WP) $600

The transition probability matrix and the daily reward vector


for the recurrent MCR are given below:

State Maintenance Action State 1 2 3 4 Reward


1, NW Under Repair 1 0.1 0.2 0.2 0.5 $0 = q1
2, MD Do Nothing P= 2 0.3 0.7 0 0 , q= $200 = q2
3, mD Do Nothing 3 0.2 0.5 0.3 0 $400 = q3
4, WP Do Nothing 4 0.1 0.2 0.3 0.4 $600 = q4

(a) Given that the machine begins day one with an equal prob-
ability of starting in any state, what is the expected total rev-
enue that will be earned after 3 days?
(b) Find the expected average reward, or gain.
(c) Find the expected total discounted reward vector using a
discount factor of α = 0.9.
4.2 A married couple owns a consulting fi rm. The fi rm’s weekly
income is a random variable, which may equal $10,000,
$15,000, or $20,000. The income earned next week depends
only on the income earned this week. In any week, the mar-
ried couple may sell the fi rm to either their adult son or their
adult daughter. If the income this week is $10,000, the couple
may sell the fi rm next week to their son with probability 0.05
or to their daughter with probability 0.10. If the income this
week is $10,000 and they do not sell, the income next week
will be either $10,000 with probability 0.40, $15,000 with prob-
ability 0.25, or $20,000 with probability 0.20. If the income this

51113_C004.indd 263 9/23/2010 5:40:59 PM


264 Markov Chains and Decision Processes for Engineers and Managers

week is $15,000, then next week there is a 0.15 probability that


the couple will sell the fi rm to their son, and a 0.20 probability
that they will sell the fi rm to their daughter. If the income this
week is $15,000 and they do not sell, the income next week
will be either $10,000 with probability 0.20, $15,000 with prob-
ability 0.35, or $20,000 with probability 0.10. Finally, if the
income this week is $20,000, the couple may sell the fi rm next
week to their son with probability 0.15 or to their daughter
with probability 0.10. If the income this week is $20,000 and
they do not sell, the income next week will be either $10,000
with probability 0.25, $15,000 with probability 0.20, or $20,000
with probability 0.30.
(a) Formulate this problem as an absorbing multichain MCR.
Represent the transition probability matrix in canonical
form. Construct the reward vector.
(b) Given that the firm’s income in the first week is $10,000, what
is the probability that the firm will eventually be sold to the
son?
(c) Given that the firm’s income in the first week is $20,000, what
is the probability that the firm will eventually be sold to the
daughter?
(d) Given that the firm’s income in the first week is $15,000, what
is the expected number of weeks before the firm is sold to
the daughter?
(e) Given that the firm’s income in the first week is $15,000, what
is the expected total income that will be earned before the
firm is sold?
4.3 This problem adds cost data to the Markov chain model of
Problems 1.9 and 2.5. Suppose that the cost to inspect and test
an IC varies inversely with its age as shown below:

Age i of an IC in years, i 0 1 2 3
Cost to inspect and test an IC $57 $34.29 $23.33 $10

The cost to replace an IC, which has failed, is $10. The cost in
dollars of buying and operating an IC of age i is denoted by qi.
(a) Construct a cost vector for the IC replacement problem.
(b) Model the IC replacement problem as a recurrent MCR by
adding the cost vector to the transition probability matrix,
which was constructed for Problem: 2.5.
(c) Calculate the average cost per year of operation and
replacement.
(d) If operation starts with a 1-year-old component, calculate
the expected total cost of operation and replacement rela-
tive to the expected total cost of starting with a 3-year-old
component.
(e) If operation starts with a 2-year-old component, calculate the
expected total cost of operation before the component fails
for the first time.

51113_C004.indd 264 9/23/2010 5:41:00 PM


A Markov Chain with Rewards (MCR) 265

(f) If operation starts with a 1-year-old component, calculate


the expected total discounted cost of operation and replace-
ment, for a discount factor of α = 0.9.
(g) If operation starts with a 1-year-old component, calculate
the expected total cost of operation after 3 years. Do not use
value iteration.
(h) If operation starts with a 1-year-old component, calculate
the expected total cost of operation after 3 years. Use value
iteration.
4.4 This problem refers to Problem 3.3.
(a) Given that the woman starts with $2,000, and will bet $1,000
each time, find the expected amount of money that she will
have after two bets. Do not use value iteration.
(b) Given that the woman starts with $2,000, and will bet $1,000
each time, use value iteration to find the expected amount of
money that she will have after two bets.
(c) Given that the woman starts with $2,000, and will bet $1,000
each time, find the expected amount of money that she will
have when the game ends.
4.5 This problem refers to Problem 3.6. The following revenue and
costs are specified for the states of the production process. A
cost of $60 is incurred to scrap a defective item. A cost of $20 is
incurred to sell an acceptable item. An acceptable item is sold
for $1,000. The cost per item of input material to stage 1 is $100.
Each time an item is processed and inspected at stage 1, the
cost is $300. The corresponding costs per item processed and
inspected at stages 2 and 3 are $400 and $500, respectively.
(a) Model this production process as a five-state absorbing mul-
tichain MCR. Construct the transition probability matrix
and the reward vector.
(b) Find the expected total cost of an item sold.
(c) Find the expected total cost of an item before it is scrapped
or sold, given that an entering item starts in the first manu-
facturing stage.
(d) Find the expected total cost of an item after two inspections,
given that an entering item starts in the first manufacturing
stage.
4.6 In Problem 3.7, the following revenue and cost data are speci-
fied. If the mission is aborted, a cost of $6,000,000 is incurred for
the loss of the satellite payload. When the rocket is launched,
the cost of fuel needed to reach HEO is $2,000,000. The revenue
earned by delivering the satellite to HEO is $4,000,000. If the sat-
ellite is delivered instead to LEO, the revenue will be reduced to
$1,000,000. The cost of fuel for each minor course correction is
$200,000, and is $700,000 for each major course correction.
(a) Model this rocket launch as a five-state absorbing multi-
chain MCR. Construct the transition probability matrix and
the reward vector.
(b) If, upon launch, the rocket is observed to start in state 3, fi nd
the expected total revenue earned by launching the rocket.

51113_C004.indd 265 9/23/2010 5:41:00 PM


266 Markov Chains and Decision Processes for Engineers and Managers

(c) If, upon launch, the rocket is observed to start in state 3, fi nd


the expected total revenue earned by launching the rocket
into LEO.
(d) If, upon launch, the rocket is observed to start in state 3, fi nd
the expected total revenue earned by the rocket launch after
two course corrections.
4.7 In Problem 3.10, suppose that the investor receives a monthly
dividend of d dollars per share. No dividend is received when
the stock is sold. The investor wants to compare the follow-
ing two alternative policies for deciding when to sell the stock.
Policy 1 is to sell the stock at the end of the first month in which
the share price rises to $20 or falls to $0. Policy 2 is to sell the
stock at the end of the first month in which the share price
rises to $15 or falls to $5. If the investor bought the stock for
$10, fi nd the monthly dividend d for which both policies will
produce equal expected total rewards when the investor sells
the stock.
4.8 A dam is used for generating electric power and for irrigation.
The dam has a capacity of 4 units of water. Assume that the
volume of water stored in the dam is always an integer. At the
beginning of every week, a volume of water denoted by Wn
flows into the dam. The stationary probability distribution of
Wn, which is an independent, identically distributed, and inte-
ger random variable, is given below.

Volume Wn of water flowing into dam at start of week n, Wn = k 0 1 2 3


Probability, P(Wn = k ) 0.2 0.3 0.4 0.1

The dam has the following policy for releasing water at the
beginning of every week. If the volume of water stored in the
dam plus the volume flowing into the dam at the beginning
of the week exceeds 2 units, then 2 units of water are released.
The first unit of water released is used to generate electricity,
which is sold for $5. The second unit released is used for irriga-
tion, which earns $4. If the volume of water stored in the dam
plus the volume flowing into the dam at the beginning of the
week equals 2 units, then only 1 unit of water is released. The
1 unit of water released is used to generate electricity, which
is sold for $5. No water is released for irrigation. The volume
of water stored in the dam is normally never allowed to drop
below 1 unit to provide a reserve in the event of a natural dis-
aster. Hence, if the volume of water stored in the dam plus the
volume flowing into the dam at the beginning of the week is
less than 2 units, no water is released. If the volume of water
stored in the dam plus the volume flowing into the dam at the
beginning of the week exceeds 6 units, then 2 units of water are
released to generate electricity and for irrigation. In addition,
surplus water is released through the spillway and lost, causing
flood damage at a cost of $3 per unit.

51113_C004.indd 266 9/23/2010 5:41:00 PM


A Markov Chain with Rewards (MCR) 267

The volume of water in the dam can be modeled as a Markov


chain. Let the state Xn be the volume of water in the dam at the
beginning of week n.
(a) Assuming that this release policy is followed, model the vol-
ume of water in the dam and the associated rewards as a
four-state recurrent MCR. Construct the transition probabil-
ity matrix and the reward vector.
(b) Execute value iteration to find the vector of expected total
rewards that will be earned in every state after 3 weeks.
Assume zero terminal rewards at the end of week 3.
(c) Find the expected average reward, or gain.
(d) Find the vector of expected total discounted rewards using a
discount factor of α = 0.9.
4.9 At 9:00 AM every day an independent financial advisor will
see a client who is concerned about retirement. A client has one
of the following three types of concerns: creating a new retire-
ment plan, managing investments under an existing retirement
plan, or generating income during retirement. The retirement
concerns of the clients form a Markov chain. The states are the
clients’ three retirement concerns. Let the state Xn denote the
retirement concern of the client who is seen on day n. The finan-
cial advisor’s historical records indicate that if today’s client is
interested in creating a new retirement plan, the probabilities
that tomorrow’s client will be interested in managing invest-
ments under an existing retirement plan, or generating income
during retirement, are 0.2 and 0.5, respectively. If a client today
is interested in managing investments under an existing retire-
ment plan, the probabilities that a client tomorrow will be inter-
ested in creating a new retirement plan, or generating income
during retirement, are 0.4 and 0.1, respectively. Finally, if today’s
client is interested in generating income during retirement, the
probabilities that tomorrow’s client will be interested in cre-
ating a new retirement plan, or managing investments under
an existing retirement plan, are 0.3 and 0.6, respectively. The
financial advisor charges the following fixed fees each time
that a client visits. The fees for creating a new retirement plan,
managing investments under an existing retirement plan, and
generating income during retirement are $500, $300, and $400,
respectively.
(a) Model the financial advisor’s daily 9:00 AM meetings with
clients as a three-state recurrent MCR. Construct the transi-
tion probability matrix and the reward vector.
(b) Execute value iteration to find the vector of expected total
rewards that the financial advisor will earn from her retire-
ment plan clients after 3 days. Assume zero terminal rewards
at the end of day 3.
(c) Find the financial advisor’s expected average reward, or
gain.
(d) Find the financial advisor’s vector of expected total dis-
counted rewards using a discount factor of α = 0.9.

51113_C004.indd 267 9/23/2010 5:41:01 PM


268 Markov Chains and Decision Processes for Engineers and Managers

4.10 Suppose that in a certain state, every licensed professional


engineer (P.E.) is required to pass an annual proficiency test in
the practice of engineering in order to renew her license. A P.E.
who passes the test has her license renewed for 1 year. A P.E.
who fails the test has her license suspended. If a person whose
license is suspended passes the test after the first or second
year of suspension, her license is restored. However, if a person
whose license is suspended fails the test after the second year
of suspension, her license is revoked.
Assume that the status of an engineer’s license can be mod-
eled as a four-state absorbing unichain. Let the state Xn be the
status of an engineer’s license after the nth annual proficiency
test. The four states are identified below:

State Status of an engineer’s license


0 Renewed
1 In first year of suspension
2 In second year of suspension
3 Revoked

Suppose that a licensed P.E. has a 0.85 probability of passing


the test, and earns $200,000 per year as a licensed engineer.
An engineer with a suspended P.E. license is employed as an
associate engineer at an annual salary of $120.000. An engi-
neer whose license is suspended has a 0.65 probability of pass-
ing the test on her first try, and a 0.45 probability of passing
the test on her second try. An engineer without a P.E. license
is employed as an assistant engineer at an annual salary of
$80.000.
(a) Model the licensure status of an engineer as a four-state
absorbing unichain MCR. Construct the transition probabil-
ity matrix and the reward vector.
(b) Find the expected total reward that a licensed P.E. will earn
after 3 years.
(c) Find the expected total reward that a licensed P.E. will earn
until her license is revoked.
Suppose that an engineer can increase her likelihood of
passing the annual proficiency test by enrolling in courses of
continuing professional development (CPD). The tuition for
such courses is $1,000 per hour. The recommended annual
course loads are 5 h for a licensee, 10 h for an engineer begin-
ning her fi rst year of suspension, and 15 h for an engineer
beginning her second year of suspension. By following the
recommended annual course loads, an engineer can add 0.10
to each of her probabilities of passing the annual proficiency
test.

51113_C004.indd 268 9/23/2010 5:41:01 PM


A Markov Chain with Rewards (MCR) 269

(d) If an engineer always follows the recommended annual


course loads, model the licensure status of an engineer as a
four-state absorbing unichain MCR.
(e) Find the expected total reward that a licensed P.E. who
always follows the recommended annual course loads will
earn until her license is revoked.
4.11 Every December 15, a certain company offers its employees a
choice of one of the following three health care plans for the
new year beginning on January 1:

Health Plan Description


P.P.O. Preferred provider organization
H.M.O. Health maintenance organization
H.D.P. High-deductible plan

An employee may keep her current plan or switch to a differ-


ent plan. During the past several years the Human Resources
Department (HRD) has compiled the following data on the
numbers of employees who have switched health care plans:

From Plan\To Plan P.P.O. H.M.O. H.D.P. Total


P.P.O. 170 24 6 200
H.M.O. 45 225 30 300
H.D.P. 4 16 80 100
Total 219 265 116 600

The HRD believes that the choice of a health care plan by an


employee can be modeled as a three-state recurrent Markov
chain. They let the state Xn be the health care plan selected at

State Health care plan selected


0 P.P.O.
1 H.M.O.
2 H.D.P.

the end of year n. The three states are identified below:


The HRD has estimated that the company incurs the following
administrative costs for each employee who switches her health
care plan.

From Plan\To Plan P.P.O. H.M.O. H.D.P.


P.P.O. $0 $30 $25
H.M.O. 20 0 24
H.D.P. 32 26 0

51113_C004.indd 269 9/23/2010 5:41:01 PM


270 Markov Chains and Decision Processes for Engineers and Managers

(a) Model the choice of a health care plan by an employee as


a three-state recurrent MCR. Construct the transition prob-
ability matrix and the cost vector.
(b) Execute value iteration to find the vector of expected total
costs that the company will incur for every health care plan
after 3 years. Assume zero terminal rewards at the end of
year 3.
(c) Find the company’s expected average cost, or negative gain.
(d) Find the company’s vector of expected total discounted costs
using a discount factor of α = 0.9.
Suppose that the HRD has proposed that the company try
to reduce its administrative costs by offering all employees
a cash incentive of $D to stay with their current health care
plan rather than switch plans. The goal of the HRD proposal
is to increase by 0.06 the probability that an employee will
stay with her current plan, and to decrease by 0.03 the prob-
ability that she will switch to either of the other two plans.
(e) Assume that the desired modified probabilities of switching
health care plans can be achieved by offering cash incentives
of $D to each employee. Find the value of $D for which the
expected total costs of the original policy and the policy pro-
posed by the HRD are the same.

References
1. Bhat, N., Elements of Applied Stochastic Processes, 2nd ed., Wiley, New York,
1985.
2. Feldman, R. M. and Valdez-Flores, C., Applied Probability & Stochastic Processes,
PWS, Boston, MA, 1996.
3. Hillier, F. S. and Lieberman G. J., Introduction to Operations Research, 8th ed.,
McGraw-Hill, New York, 2005.
4. Howard, R. A., Dynamic Programming and Markov Processes, M.I.T. Press,
Cambridge, MA, 1960.
5. Kemeny, J. G., Schleifer, Jr., A., Snell, J. L., and Thompson, G. L., Finite Mathematics
with Business Applications, 2nd ed., Prentice-Hall, Englewood Cliffs, NJ, 1972.
6. Jensen, P. A. and Bard, J. F., Operations Research: Models and Methods,Wiley,
New York, 2003.
7. Puterman, M. L., Markov Decision Processes: Discrete Stochastic Dynamic
Programming, Wiley, New York, 1994.
8. Shamblin, J. E. and Stevens, G. T., Operations Research: A Fundamental Approach,
McGraw-Hill, New York, 1974.

51113_C004.indd 270 9/23/2010 5:41:02 PM


5
A Markov Decision Process (MDP)

A Markov decision process (MDP) is a sequential decision process for which


the decisions produce a sequence of Markov chains with rewards or MCRs.
As the introduction to Chapter 4 indicates, when decisions are added to a
set of MCRs, the augmented system is called an MDP. An MDP generates a
sequence of states and an associated sequence of rewards as it evolves over
time from state to state, governed by both its transition probabilities and the
series of decisions made. A rule that prescribes a set of decisions for all states
is called a policy. When the planning horizon is infinite, a policy is assumed
to be independent of time, or stationary. When the planning horizon is infi-
nite, the objective of an MDP model is to determine a stationary policy that is
optimal in the sense that it will either maximize the gain, or expected reward
per period, or maximize the expected total discounted reward received in
every state. As in Chapter 4, both transition probabilities and rewards are
assumed to be stationary.
An MCR can be viewed as a building block for an MDP. Therefore, many
of the definitions, equations, and algorithms developed in Chapter 4 for
an MCR will be modified in Chapter 5 to enable them to be applied, in an
extended form, to an MDP. All MDPs are assumed to have a finite number
of states.

5.1 An Undiscounted MDP


When cash flows for an MDP are not discounted, a Markov decision process
is called an undiscounted MDP.

5.1.1 MDP Chain Structure


In Section 4.2.1, an MCR is classified as unichain, or multichain. Similarly, an
MDP can also be termed unichain, or multichain. An MDP is unichain if the
transition matrix associated with every stationary policy consists of a recur-
rent chain (which is a single closed communicating class of recurrent states)
plus a possibly empty set of transient states. A recurrent MDP is a special
case of a unichain MDP with no transient states. Thus, an MDP is recurrent
if the transition matrix associated with every stationary policy is irreducible

271

51113_C005.indd 271 9/23/2010 5:42:45 PM


272 Markov Chains and Decision Processes for Engineers and Managers

or recurrent. An MDP is multichain if at least one transition matrix associ-


ated with a stationary policy consists of two or more recurrent chains plus a
possibly empty set of transient states.
In this chapter, all unichain MDPs and all multichain MDPs are assumed
to have one or more transient states, so that they are reducible, although the
latter term is omitted. Hence, an MDP will be classified as either recurrent,
unichain, or multichain. Chain structure affects the algorithm used to find
an optimal policy when rewards are not discounted [2, 3].

5.1.2 A Recurrent MDP


A recurrent MDP will be analyzed first over a finite planning horizon, and
next over an infinite horizon.

5.1.2.1 A Recurrent MDP Model of Monthly Sales


A recurrent MCR model of monthly sales was introduced in Table 4.1 of
Section 4.2.2.1. To transform this model into an MDP model, the MCR model
will be augmented by adding decisions in every state. Each decision in every
state will be associated with a set of transition probabilities and a reward.
Recall that a firm tracks its monthly sales, which have fluctuated widely over
many months. Monthly sales are ranked with respect to those of the firm’s
competitors. The rankings are expressed in quartiles. The fourth quartile is
the highest ranking, while the first quartile is the lowest. The sequence of
monthly sales is believed to form a Markov chain. The state Xn−1 denotes the
quartile rank of monthly sales at the beginning of month n. The chain has
four states, which correspond to the four quartiles. Estimates of the transi-
tion probabilities are based on a historical record of monthly sales. At the
beginning of each month, in contrast to the MCR model of Section 4.2.2.1,
the firm must now select one of several alternatives actions in every state.
The alternative selected is called the decision for that state. Table 5.1 identi-
fies the four states accompanied by their decision alternatives and the asso-
ciated rewards.
Note, for example, that when monthly sales are in the third quartile,
state 3, the firm will either make decision 1 to design more appealing prod-
ucts for a reward of –$10,000 (which is a cost of $10,000), or make decision
2 to invest in new technology for a reward of –$5,000 (which is a cost of
$5,000). Observe that the firm can consider three decision alternatives in
state 1, and can choose between two decision alternatives in each of the
other three states.
Over a finite horizon, epoch n denotes the end of month n. A policy pre-
scribes the decision to be made in every state at every epoch. The selection
of a policy identifies which transition probabilities will govern the behav-
ior of the system as it evolves over time from state to state. For an N-state

51113_C005.indd 272 9/23/2010 5:42:46 PM


A Markov Decision Process (MDP) 273

TABLE 5.1
States, Decisions, and Rewards for Monthly Sales
Monthly Sales
Quartile State Decision Action Reward
First (Lowest) 1 1 Sell noncore assets –$30,000
2 Take firm private –25,000
3 Offer employee buyouts –20,000
Second 2 1 Reduce executive salaries $5,000
2 Reduce employee benefits 10,000
Third 3 1 Design more appealing products –$10,000
2 Invest in new technology –5,000
Fourth (Highest) 4 1 Invest in new projects $35,000
2 Make strategic acquisitions 25,000

process, a policy at epoch n can be specified by the following N × 1 column


vector:

1  d1 (n) 
2  d2 (n) 
d(n) =   = [d (n) d (n) ⋅⋅⋅ d (n)]T , (5.1,5.2)
#  #  1 2 N

 
N  dN ( n ) 

called a decision vector. The elements of vector d(n) indicate which decision
is made in every state at epoch n. Thus, the element di(n) = k indicates that in
state i at epoch n, decision k is made.
Each decision made in a state has an associated reward and probability
distribution for transitions out of that state. A superscript k is used to desig-
nate the decision in a state. Thus, pjk indicates that the transition probability
from state i to state j is determined by the decision k in state i at epoch n.
That is,

pijk = P(X n + 1 = j |X n = i ∩ di (n) = k ). (5.3)

Similarly, qik indicates that the reward earned by a transition in state i is


determined by decision k. Suppose that at epoch n an MDP is in state i, so
that Xn = i. At epoch n, decision di(n) = k is made, and a reward qik is received.
The MDP will move with transition probability pkij to state j at epoch n + 1, so
that Xn+1 = j. For an MDP starting in state g, a sequence of states visited, the
decisions made, the rewards earned, and the state transition probabilities are
shown in Figure 5.1.

51113_C005.indd 273 9/23/2010 5:42:46 PM


274 Markov Chains and Decision Processes for Engineers and Managers

X0 g X1 h X2 u Xn i Xn 1 j State
q ga qhb quc qik q jf Reward
d g (0) a d h (1) b du ( 2) c di ( n) k d j ( n 1) f Decision
a b
pg( 0) pgh phu pijk Transition probability
0 1 2 n n 1 Epoch

FIGURE 5.1
Sequence of states, decisions, transitions, and rewards for an MDP.

TABLE 5.2
Two-State Generic MDP with Two Decisions in Each State
Transition Probability
State i Decision k pki1 pki2 Reward qki
1 1 p111 p112 q11
2 2
2 p11
p 12
q21
2 1 p121 p122 q12
2 2
2 p21
p 12
q22

To see how an MDP can be represented in a tabular format, Table 5.2 speci-
fies a 2-state generic MDP with two decisions in each state.
By matching the actions in Table 4.1 with those in Table 5.1, it is appar-
ent that the MCR model of monthly sales constructed in equation (4.6) is
generated by the decision vector d(n) = [3 1 2 2]T . For example, d1(n) = 3
in Table 5.1 means that in state 1 at epoch n, when monthly sales are in
the first quartile, the fi rm will make decision 3 to offer employee buy-
outs. The decision vector d(n) produces the same MCR that was introduced
without decision alternatives in Section 4.2.2.1. Hence the MCR model in
Section 4.2.2.1 corresponds to the following transition probability matrix
P, reward vector q with entries expressed in thousands of dollars, and
decision vector d(n).

1 0.60 0.30 0.10 0  1  −20  1 3


2  0.25 0.30 0.35 0.10  2  5 2  1
P=  , q =   , d(n) =   . (5.4)
3  0.05 0.25 0.50 0.20  3  − 5 3  2
     
4 0 0.10 0.30 0.60  4  25  4  2

In state 3 decision 2 is made. Therefore, d3(n) = 2, and

p32 j = [0.05 0.25 0.50 0.20] and q32 = −5 . (5.5)

51113_C005.indd 274 9/23/2010 5:42:47 PM


A Markov Decision Process (MDP) 275

TABLE 5.3
Data for Recurrent MDP Model of Monthly Sales
Transition Probabilities Reward
State i Decision k p k pk pk p k qki
i1 i2 i3 i4

1 1 Sell noncore assets 0.15 0.40 0.35 0.10 –30


2 Take firm private 0.45 0.05 0.20 0.30 –25
3 Offer employee buyouts 0.60 0.30 0.10 0 –20
2 1 Reduce management 0.25 0.30 0.35 0.10 5
salaries
2 Reduce employee benefits 0.30 0.40 0.25 0.05 10
3 1 Design more appealing 0.05 0.65 0.25 0.05 –10
products
2 Invest in new technology 0.05 0.25 0.50 0.20 –5
4 1 Invest in new projects 0.05 0.20 0.40 0.35 35
2 Make strategic acquisitions 0 0.10 0.30 0.60 25

The data collected by the firm on the rewards and the transition probabilities
associated with the decision made in every state of Table 5.1 are summarized
in Table 5.3. Table 5.3 represents a recurrent MDP model of monthly sales in
which the state denotes the monthly sales quartile.

5.1.2.2 Value Iteration over a Finite Planning Horizon


In this section value iteration is executed over a finite planning horizon con-
taining T periods [2, 3].

5.1.2.2.1 Value Iteration Equation for Expected Total Reward


An optimal policy over a finite planning horizon is defined as one which
maximizes the vector of expected total rewards received until the end of the
horizon. When the planning horizon is finite, an optimal policy can be found
by using value iteration which was applied to an MCR in Section 4.2.2.2.
When value iteration is applied to an MDP, the definition of vi(n) is changed
so that vi(n) now denotes the expected total reward earned if the system is in
state i at the end of period n and an optimal policy is followed until the end
of the planning horizon. By extending the informal derivation of the value
iteration equation (4.34) for an MCR, the algebraic form of the value iteration
equation for an MDP is

51113_C005.indd 275 9/23/2010 5:42:49 PM


276 Markov Chains and Decision Processes for Engineers and Managers

Epoch (n + 1)
vi ( n) max{ [ qik pijkvj ( n 1) pihk vh ( n 1)],
[ qib pijbvj ( n 1) pihb vh( n 1)]}
State j
pijk
Expected total
Random reward vj (n + 1)
outcomes
of
decision k
Decision pih
Epoch n
di (n) = k
State h
Reward qik
Expected total
State i reward vh (n + 1)
Decision
di (n) = b

Expected
total Reward qib pijb State j
reward
Random
vi (n) outcomes Expected total
of reward vj (n + 1)
decision
b
pih

State h

Expected total
reward vh (n + 1)

FIGURE 5.2
Tree diagram of the value iteration equation for an MDP

 N  
vi (n) = max  qik + ∑ pijk v j (n + 1) , 
k
 j =1   (5.6)

for n = 0, 1,..., T − 1, and i = 1, 2, ..., N 

The salvage values vi(T) at the end of the planning horizon must be speci-
fied for all states i = 1, 2, . . . , N. Figure 5.2 is a tree diagram of the value iteration
equation in algebraic form for an MDP with two decisions in the current state.

5.1.2.2.2 Value Iteration for MDP Model of Monthly Sales


Value iteration will be applied to the MDP model of monthly sales con-
structed in Section 5.1.2.1 to determine which decision to make in every
state in each month of a seven month planning horizon so that the vec-
tor of expected total rewards is maximized. The data is given in Table 5.3.

51113_C005.indd 276 9/23/2010 5:42:49 PM


A Markov Decision Process (MDP) 277

To begin, the following salvage values are specified for all states at the end
of month 7:

v1 (7) = v2 (7) = v3 (7) = v4 (7) = 0. (5.7)

For the four-state MDP the value iteration equations are

vi(7) = 0, for i = 1, 2, 3, 4 (5.8)


vi (7) = 0, for i = 1, 2, 3, 4
 4 
vi (n) = max  qik + ∑ pijk v j (n + 1) , for n = 0, 1, ... , 6, and i = 1, 2, 3, 4
k
 j =1  

= max[qi + pi1v1 (n + 1) + pik2 v2 (n + 1) + pik3 v3 (n + 1) + pik4 v4 (n + 1)], 
k k
(5.9)
k

for n = 0, 1, ... , 6. 


Since value iteration with decision alternatives is a form of dynamic pro-


gramming, the detailed calculations for each epoch will be expressed in a
tabular format.
The calculations for month 6, denoted by n=6, are indicated in Table 5.4.

vi (6) = max[qik + pik1v1 (7) + pik2 v2 (7) + pik3 v3 (7) + pik4 v4 (7)], for i = 1, 2, 3, 4
k 

= max[qik + pik1 (0) + pik2 (0) + pik3 (0) + pik4 (0)], for i = 1, 2, 3, 4 
k

= max[qi ], for i = 1, 2, 3, 4.
k

k

(5.10,5.11)

At the end of month 6, the optimal decision is to select the alternative which
will maximize the reward in every state. Thus, the decision vector at the end
of month 6 is d(6) = [3 2 2 1]T .
The calculations for month 5, denoted by n=5, are indicated in Table 5.5.
The ← symbol in column 3 of Tables 5.5 through 5.10 identifies the maximum

TABLE 5.4
Value Iteration for n = 6
State qi1 qi2 qi3 Expected Total Reward Decision
i k=1 k=2 k=3 1 2 3
vi(6) = maxk[qi , qi , qi ], for i = 1, 2, 3, 4 k
1 –30 –25 –20 v1(6) = max [−30, −25, −20 ] = −20 3
2 5 10 _ v2(6) = max [5, 10 ] = 10 2
3 –10 –5 _ v3(6) = max [−10, −5 ] = −5 2
4 35 25 _ v4(6) = max [35, 25 ] = 35 1

51113_C005.indd 277 9/23/2010 5:42:50 PM


278 Markov Chains and Decision Processes for Engineers and Managers

TABLE 5.5
Value Iteration for n = 5
q ik + p ik1v 1 (6) + p ik2v 2 (6) + p ik3v 3 (6) + p ik4v 4 (6)
vi(5) = Expected Decision
i k = q ik + p ik1 (220) + p ik2 (10) + p ik3 (25) + p ik4 (35) Total Reward di(5) = k
1 1 −30 + 0.15( −20) + 0.40(10) + 0.35( −5) + 0.10(35)
= −27.25
1 2 −25 + 0.45( −20) + 0.05(10) + 0.20( −5) + 0.30(35) max [ − 27.25, −24, d1(5) = 2
= −24 ← −29.5]
= −24 = v1 (5)
1 3 −20 + 0.60( −20) + 0.30(10) + 0.10( −5) + 0(35)
= −29.5
2 1 5 + 0.25( −20) + 0.30(10) + 0.35( −5) + 0.10(35)
= 4.75
2 2 10 + 0.30( −20) + 0.40(10) + 0.25( −5) + 0.05(35) max [4.75, 8.5] d2(5) = 2
= 8.5 ← = 8.5 = v2 (5)

3 1 −10 + 0.05( −20) + 0.65(10) + 0.25( −5) + 0.05(35)


= −4
3 2 −5 + 0.05( −20) + 0.25(10) + 0.50( −5) + 0.20(35) max [ − 4, 1] d3(5) = 2
=1← = 1 = v3 (5)

4 1 35 + 0.05( −20) + 0.20(10) + 0.40( −5) + 0.35(35) max [46.25, 45.5] d4(5) = 1
= 46.25 ← = 46.25 = v4 (5)

4 2 25 + 0( −20) + 0.10(10) + 0.30( −5) + 0.60(35)


= 45.5

expected total reward. Both the maximum expected total reward and the
associated decision are indicated in columns 4 and 5, respectively, of the row
flagged by the ← symbol.

vi (5) = max[qik + pik1v1 (6) + pik2 v2 (6) + pik3 v3 (6) + pik4 v4 (6)], for i = 1, 2, 3, 4
k 
 . (5.12)
= max[qik + pik1 ( −20) + pik2 (10) + pik3 ( −5) + pik4 (35)], for i = 1, 2, 3, 4 
k 

At the end of month 5, the optimal decision is to select the second alternative
in states 1, 2, and 3, and the first alternative in state 4. Thus, the decision vec-
tor at the end of month 5 is d( 5) = [2 2 2 1] .
T

The calculations for month 4, denoted by n = 4, are indicated in Table 5.6.

51113_C005.indd 278 9/23/2010 5:42:52 PM


A Markov Decision Process (MDP) 279

TABLE 5.6
Value Iteration for n = 4
q ik + p ik1v 1 (5) + p ik2v 2 (5) + p ik3v 3 (5) + p ik4v 4 (5)
vi(4) = Expected Decision
i k = q ik + p ik1 (224) + p ik2 (8.5) + p ik3 (1) + p ik4 (46.25) Total Reward di(4) = k
1 1 −30 + 0.15( −24) + 0.40(8.5) + 0.35(1) + 0.10(46.25)
= −25.225
1 2 −25 + 0.45( −24) + 0.05(8.5) + 0.20(1) + 0.30(46.25) max [ − 25.225, −21.3, d1(4) = 2
= −21.3 ← −31.75] = −21.3 = v1 (4)
1 3 −20 + 0.60( −24) + 0.30(8.5) + 0.10(1) + 0(46.25)
= −31.75
2 1 5 + 0.25( −24) + 0.30(8.5) + 0.35(1) + 0.10(46.25)
= 6.525
2 2 10 + 0.30( −24) + 0.40(8.5) + 0.25(1) + 0.05(46.25) max [6.525, 8.7625 d2(4) = 2
= 8.7625 ← = 8.7625 = v2 (4)

3 1 −10 + 0.05( −24) + 0.65(8.5) + 0.25(1) + 0.05(46.25)


= −3.1125
3 2 −5 + 0.05( −24) + 0.25(8.5) + 0.50(1) + 0.20(46.25) max [ ⫺ 3.1125, 5.675 d3(4) = 2
= 5.675 ← = 5.675 = v3 (4)
4 1 35 + 0.05( −24) + 0.20(8.5) + 0.40(1) + 0.35(46.25)
= 52.0875
4 2 25 + 0( −24) + 0.10(8.5) + 0.30(1) + 0.60(46.25) max [52.0875, 53.9] d4(4) = 2
= 53.9 ← = 53.9 = v4 (4)

TABLE 5.7
Value Iteration for n = 3

q ik + p ik1v 1 (4) + p ik2v 2 (4) + p ik3v 3 (4) + p ik4v 4 (4)


= q ik + p ik1 (221.3) + p ik2 (8.7625)
vi(3) = Expected Decision
i k + p ik3 (5.675) + p ik4 (53.9) Total Reward di(3) = k
1 1 −30 + 0.15( −21.3) + 0.40(8.7625) + 0.35(5.675)
+ 0.10(53.9) = −22.3138
1 2 −25 + 0.45(−21.3) + 0.05(8.7625) + 0.20(5.675) max [− 22.3138, −16.8419, d1(3) = 2
+ 0.30(53.9) = −16.8419 ← −29.5838] = −16.8419
= v1 (3)
1 3 −20 + 0.60(−21.3) + 0.30(8.7625) + 0.10(5.675)
+ 0(53.9) = −29.58375

2 1 5 + 0.25(−21.3) + 0.30(8.7625) + 0.35(5.675)


+ 0.10(53.9) = 9.68
continued

51113_C005.indd 279 9/23/2010 5:42:57 PM


280 Markov Chains and Decision Processes for Engineers and Managers

TABLE 5.7 (continued)


Value Iteration for n = 3

q ik + p ik1v 1 (4) + p ik2v 2 (4) + p ik3v 3 (4) + p ik4v 4 (4)


= q ik + p ik1 (221.3) + p ik2 (8.7625)
vi(3) = Expected Decision
i k + p ik3 (5.675) + p ik4 (53.9) Total Reward di(3) = k
2 2 10 + 0.30( −21.3) + 0.40(8.7625) + 0.25(5.675) max[9.68, 11.2288] d2(3) = 2
+ 0.05(53.9) = 11.22875 ← = 11.2288 = v2 (3)
3 1 −10 + 0.05(−21.3) + 0.65(8.7625) + 0.25(5.675)
+ 0.05(53.9) = −1.255625
3 2 −5 + 0.05(−21.3) + 0.25(8.7625) + 0.50(5.675) max[− 1.2556, 9.7431] d3(3) = 2
+ 0.20(53.9) = 9.743125 ← = 9.7431 = v3 (3)
4 1 35 + 0.05(−21.3) + 0.20(8.7625) + 0.40(5.675)
+ 0.35(53.9) = 56.8225
4 2 25 + 0(−21.3) + 0.10(8.7625) + 0.30(5.675) max [56.8225, 59.9188] d4(3) = 2
+ 0.60(53.9) = 59.91875 ← = 59.9188 = v4 (3)

vi (4) = max[qik + pik1v1 (5) + pik2 v2 (5) + pik3 v3 (5) + pik4 v4 (5)], for i = 1, 2, 3, 4
k 
 . (5.13)
= max[qi + pi1 ( −24) + pi 2 (8.5) + pi 3 (1) + pi 4 (46.25)], for i = 1, 2, 3, 4 
k k k k k
k 

At the end of month 4, the optimal decision is to select the second alter-
native in every state. Thus, the decision vector at the end of month 4 is
d( 4 ) = [2 2 2 2]T .
The calculations for month 3, denoted by n = 3, are indicated in Table 5.7.

vi (3) = max[qik + pik1v1 (4) + pik2 v2 (4) + pik3 v3 (4) + pik4 v4 (4)], for i = 1, 2, 3, 4
k

= max[qik + pik1 ( −21.3) + pik2 (8.7625) + pik3 (5.675) + pik4 (53.9)],  (5.14)
k

for i = 1, 2, 3, 4. 

At the end of month 3, the optimal decision is to select the second alter-
native in every state. Thus, the decision vector at the end of month 3 is
d( 3 ) = [2 2 2 2]T .
The calculations for month 2, denoted by n = 2, are indicated in Table 5.8.


vi (2) = max[qik + pik1v1 (3) + pik2 v2 (3) + pik3 v3 (3) + pik4 v4 (3)], for i = 1, 2, 3, 4
k


= max[qik + pik1 ( −16.8419) + pik2 (11.2288) + pik3 (9.7431) + pik4 (59.9188)], (5.15)
k

for i = 1, 2, 3, 4. 

51113_C005.indd 280 9/23/2010 5:43:04 PM


A Markov Decision Process (MDP) 281

TABLE 5.8
Value Iteration for n = 2
q ik + p ik1v 1 (3) + p ik2v 2 (3) + p ik3v 3 (3) + p ik4v 4 (3)
= q ik + p ik1 (216.8419) + p ik2 (11.2288) vi(2) = Expected Decision
i k + p ik3 (9.7431) + p ik4 (59.9188) Total Reward di(2) = k
1 1 −30 + 0.15(−16.8419) + 0.40(11.2288)
+ 0.35(9.7431) + 0.10(59.9188) = −18.6328

1 2 −25 + 0.45(−16.8419) + 0.05(11.2288) max [ − 18.6328, −12.0932, d1(2) = 2


+ 0.20(9.7431) + 0.30(59.9188) = −12.0932 ← −25.7622 ] = −12.0932 = v1

1 3 −20 + 0.60(−16.8419) + 0.30(11.2288)


+ 0.10(9.7431) + 0(59.9188) = −25.7622

2 1 5 + 0.25(−16.8419) + 0.30(11.2288)
+ 0.35(9.7431) + 0.10(59.9188) = 13.5601

2 2 10 + 0.30(−16.8419) + 0.40(11.2288) max [13.5601, 14.8707] d2(2) = 2


+ 0.25(9.7431) + 0.05(59.9188) = 14.8707 ← = 14.8707 = v2 (2)

3 1 −10 + 0.05(−16.8419) + 0.65(11.2288)


+ 0.25(9.7431) + 0.05(59.9188) = 1.8883

3 2 −5 + 0.05(−16.8419) + 0.25(11.2288) max[1.8883, 13.8204] d3(2) = 2


+ 0.50(9.7431) + 0.20(59.9188) = 13.8204 ← = 13.8204 = v3 (2)

4 1 35 + 0.05(−16.8419) + 0.20(11.2288)
+ 0.40(9.7431) + 0.35(59.9188) = 61.2725

4 2 25 + 0(−16.8419) + 0.10(11.2288) max[61.2725, 64.9971] d4(2) = 2


+ 0.30(9.7431) + 0.60(59.9188) = 64.9971 ← = 64.9971 = v4 (2)

TABLE 5.9
Value Iteration for n = 1
q ik + p ik1v 1 (2) + p ik2v 2 (2) + p ik3v 3 (2) + p ik4v 4 (2)
= q ik + p ik1 (212.0932) + p ik2 (14.8707)
vi(1) = Expected Decision
+ p ik3 (13.8204) + p ik4 (64.9971)
i k Total Reward di(1) = k
1 1 −30 + 0.15(−12.0932) + 0.40(14.8707)
+ 0.35(13.8204) + 0.10(64.9971) = −14.5289

1 2 −25 + 0.45(−12.0932) + 0.05(14.8707) max[ − 14.5289, −7.4352, d1(1) = 2


+ 0.20(13.8204) + 0.30(64.9971) = −7.4352 ← −21.4127]
= −7.4352 = v1 (1)
continued

51113_C005.indd 281 9/23/2010 5:43:09 PM


282 Markov Chains and Decision Processes for Engineers and Managers

TABLE 5.9 (continued)


Value Iteration for n = 1
q ik + p ik1v 1 (2) + p ik2v 2 (2) + p ik3v 3 (2) + p ik4v 4 (2)
= q ik + p ik1 (212.0932) + p ik2 (14.8707)
vi(1) = Expected Decision
+ p ik3 (13.8204) + p ik4 (64.9971)
i k Total Reward di(1) = k
1 3 −20 + 0.60(−12.0932) + 0.30(14.8707)
+ 0.10(13.8204) + 0(64.9971) = −21.4127

2 1 5 + 0.25(−12.0932) + 0.30(14.8707)
+ 0.35(13.8204) + 0.10(64.9971) = 17.7748

2 2 10 + 0.30(−12.0932) + 0.40(14.8707) max [17.7748, 19.0253] d2(1) = 2


+ 0.25(13.8204) + 0.05(64.9971) = 19.0253 ← = 19.0253 = v2 (1)

3 1 −10 + 0.05(−12.0932) + 0.65(14.8707)


+ 0.25(13.8204) + 0.05(64.9971) = 5.7663

3 2 −5 + 0.05(−12.0932) + 0.25(14.8707) max[5.7663, 18.0226] d3(1) = 2


+ 0.50(13.8204) + 0.20(64.9971) = 18.0226 ← = 18.0226 = v3 (1)

4 1 35 + 0.05(−12.0932) + 0.20(14.8707)
+ 0.40(13.8204) + 0.35(64.9971) = 65.6466

4 2 25 + 0(−12.0932) + 0.10(14.8707) max [65.6466, 69.6315] d4(1) = 2


+ 0.30(13.8204) + 0.60(64.9971) = 69.6315 ← = 69.6315 = v4 (1)

TABLE 5.10
Value Iteration for n = 0
q ik + p ik1v 1 (1) + p ik2v 2 (1) + p ik3v 3 (1) + p ik4v 4 (1)
= q ik + p ik1 (27.4352) + p ik2 (19.0253)
vi(0) = Expected Decision
+ p ik3 (18.0226) + p ik4 (69.6315)
i k Total Reward di(0) = k
1 1 −30 + 0.15(−7.4352) + 0.40(19.0253)
+ 0.35(18.0226) + 0.10(69.6315) = −10.2341

1 2 −25 + 0.45(−7.4352) + 0.05(19.0253) max [ − 10.2341, −2.9006, d1(0) = 2


+ 0.20(18.0226) + 0.30(69.6315) = −2.9006 ← −16.9513]
= −2.9006 = v1 (0)

1 3 −20 + 0.60(−7.4352) + 0.30(19.0253)


+ 0.10(18.0226) + 0(69.6315) = −16.9513

2 1 5 + 0.25(−7.4352) + 0.30(19.0253)
+ 0.35(18.0226) + 0.10(69.6315) = 22.1199

51113_C005.indd 282 9/23/2010 5:43:16 PM


A Markov Decision Process (MDP) 283

TABLE 5.10 (continued)


Value Iteration for n = 0
q ik + p ik1v 1 (1) + p ik2v 2 (1) + p ik3v 3 (1) + p ik4v 4 (1)
= q ik + p ik1 (27.4352) + p ik2 (19.0253)
vi(0) = Expected Decision
+ p ik3 (18.0226) + p ik4 (69.6315)
i k Total Reward di(0) = k
2 2 10 + 0.30(−7.4352) + 0.40(19.0253) max[22.1199, 23.3668] d2(0) = 2
+ 0.25(18.0226) + 0.05(69.6315) = 23.3668 ← = 23.3668 = v2 (0)

3 1 −10 + 0.05(−7.4352) + 0.65(19.0253)


+ 0.25(18.0226) + 0.05(69.6315) = 9.9819

3 2 −5 + 0.05(−7.4352) + 0.25(19.0253) max [9.9819, 22.3222] d3(0) = 2


+ 0.50(18.0226) + 0.20(69.6315) = 22.3222 ← = 22.3222 = v3 (0)

4 1 35 + 0.05(−7.4352) + 0.20(19.0253)
+ 0.40(18.0226) + 0.35(69.6315) = 70.0134

4 2 25 + 0(−7.4352) + 0.10(19.0253) max[70.0134, 74.0882] d4(0) = 2


+ 0.30(18.0226) + 0.60(69.6315) = 74.0882 ← = 74.0882 = v4 (0)

At the end of month 2, the optimal decision is to select the second alter-
native in every state. Thus, the decision vector at the end of month 2 is
d(2) = [2 2 2 2]T.
The calculations for month 1, denoted by n = 1, are indicated in Table 5.9.


vi (1) = max[qik + pik1v1 (2) + pik2 v2 (2) + pik3 v3 (2) + pik4 v4 (2)], for i = 1, 2, 3, 4
k

= max[q + p ( −12.0932) + p (14.8707) + p (13.8204) + p (64.9971)],  (5.16)
k
i
k
i1
k
i2
k
i3
k
i4
k

for i = 1, 2, 3, 4. 

At the end of month 1, the optimal decision is to select the second alternative in
every state. Thus, the decision vector at the end of month 1 is d(2) = [2 2 2 2]T.
Finally, the calculations for month 0, denoted by n = 0, are indicated in
Table 5.10.

vi (0) = max[qik + pik1v1 (1) + pik2 v2 (1) + pik3 v3 (1) + pik4 v4 (1)], for i = 1, 2, 3, 4 
k

= max[qik + pik1 ( −7.4352) + pik2 (19.0253) + pik3 (18.0226) + pik4 (69.6315)], 
k

for i = 1, 2, 3, 4. 
(5.17)
At the end of month 0, which is the beginning of month 1, the optimal deci-
sion is to select the second alternative in every state. Thus, the decision vector
at the beginning of month 1 is d(0 ) = [2 2 2 2]T.

51113_C005.indd 283 9/23/2010 5:43:22 PM


284 Markov Chains and Decision Processes for Engineers and Managers

TABLE 5.11
Expected Total Rewards and Optimal Decisions for a Planning
Horizon of 7 Months
n
End of Month 0 1 2 3 4 5 6 7
v1 (n) −2.9006 −7.4352 −12.0932 −16.8419 −21.3 −24 −20 0
v2 ( n) 23.3668 19.0253 14.8707 11.2288 8.7625 8.5 10 0
v3 ( n) 22.3222 18.0226 13.8204 9.7431 5.675 1 −5 0
v 4 ( n) 74.0882 69.6315 64.9971 59.9188 53.9 46.25 35 0
d1 (n) 2 2 2 2 2 2 3 −
d2 (n) 2 2 2 2 2 2 2 −
d3 (n) 2 2 2 2 2 2 2 −

The results of these calculations for the expected total rewards and the
optimal decisions at the end of each month of the 7-month planning horizon
are summarized in Table 5.11.
If the process starts in state 4, the expected total reward is 74.0882, the high-
est for any state. On the other hand, the lowest expected total reward is –2.9006,
which is an expected total cost of 2.9006, if the process starts in state 1.

5.1.2.3 An Infinite Planning Horizon


To find an optimal policy for a recurrent MDP over an infinite planning
horizon, the following four computational procedures can, in principle, be
applied: exhaustive enumeration, value iteration, policy iteration (PI), and
linear programming (LP). However, exhaustive enumeration, which involves
calculating the steady-state probability distribution and the associated gain
for every possible policy, is computationally prohibitive unless the problem
is extremely small. Nevertheless, exhaustive enumeration will be applied to
the MDP model of monthly sales to display the scope of the enumeration
process, and to show why a formal optimization procedure is needed. Value
iteration requires fewer arithmetic operations than alternative computa-
tional procedures, but may never satisfy a stopping condition. If value iter-
ation does satisfy a stopping condition, it yields only approximate solutions
for the relative values and the gain of an optimal policy.
To illustrate the four methods for finding an optimal policy for a recurrent
MDP over an infinite planning horizon, consider the MDP model of monthly
sales specified in Table 5.3. The expected total rewards and the associated
decisions were obtained by value iteration executed over a finite horizon of
length seven periods and summarized in Table 5.11. Suppose that the firm
plans to track its monthly sales over a large enough number of months to jus-
tify using an infinite planning horizon. Over an infi nite horizon, the system
enters a steady state. In the steady state, a decision is no longer a function of

51113_C005.indd 284 9/23/2010 5:43:26 PM


A Markov Decision Process (MDP) 285

the epoch n. Thus, over an infinite horizon, di = k indicates that in state i the
same decision k will be always made, irrespective of the epoch, n.
The set of decisions for all states is called a policy. For an N-state process, a
policy over an infinite horizon can be specified by a decision vector,

d = [d1 d2 ⋅⋅⋅ dN ]T . (5.18)

The elements of d indicate which decision is to be made in every state. Over


an infinite planning horizon, the policy specified by the vector d is called
a stationary policy because the policy will always specify the same deci-
sion in a given state, independent of time. For the model of monthly sales,
if the horizon is infinite and the policy is chosen, then in state 1, when
monthly sales are in the first quartile, d1 = 3 given by the decision vector
in Equation (5.19). Therefore, every month in which monthly sales are in
the first quartile, the firm will always make decision 3, which is to offer
employee buyouts.

1 3
2  1
d =  . (5.19)
3  2
 
4  2

Over an infinite horizon, an optimal policy is defined as one which maxi-


mizes the gain, or average reward per period. In Equation (4.53), the gain g
for the policy d = [3 1 2 2]T was calculated by using Equation (4.47).
Under this policy the firm earns an average return of 1.4143 per period.
Suppose that the firm wishes to determine whether this policy is optimal.
The firm may proceed by applying exhaustive enumeration to this four-state
MDP model.

5.1.2.3.1 Exhaustive Enumeration


The simplest but least efficient way to find an optimal stationary policy is
to use exhaustive enumeration [1, 2]. This procedure involves listing every
possible policy, computing the steady-state probability vector for each pol-
icy, and then using Equation (4.47) to compute the gain for each policy. The
policy with the highest gain is optimal. Since the decisions in every state can
be made independently, the number of possible policies is equal to the prod-
uct, over all states, of the number of decisions permitted in every state. For
example, for the MDP model of monthly sales, three decisions are allowed
in state 1 while two decisions are permitted in each of the other three states.
Hence, there are (3)(2)(2)(2) = 24 alternative policies which must be compared.
Letting rd denote the rth policy, for r = 1, 2, … , 24, the 24 possible policies are
enumerated in Table 5.12.

51113_C005.indd 285 9/23/2010 5:43:27 PM


286 Markov Chains and Decision Processes for Engineers and Managers

TABLE 5.12
Exhaustive Enumeration of Decision Vectors for the 24 Possible Policies for the
MDP Model of Monthly Sales
1 1 1  1 1  1 1  1 1  1 1  1 1  1 1  1
2 1 2 2  1 3 2  1 4 2  1 5 2  2 6 2  2 7 2  2 8 2  2
1
d =  , d =  , d =  , d =  , d =  , d =  , d =  , d =  ,
3 1 3  1 3  2 3  2 3  1 3  1 3  2 3  2
               
4 1 4  2 4  1 4  2 4  1 4  2 4  1 4  2

1  2 1  2 1  2 1  2 1  2 1  2 1  2 1  2
2  1 2  1 2  1 2  1 2  2 2  2 2  2 2  2
9
d =  , 10
d =  , 11
d =  , 12
d =  , 13
d =  , 14
d =  , 15
d =  , 16
d =  ,
3  1 3  1 3  2 3  2 3  1 3  1 3  2 3  2
               
4  1 4  2 4  1 4  2 4  1 4  2 4  1 4  2

1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3


2  1 2  1 2  1 2  1 2  2 2  2 2  2 2  2
17
d =  , 18
d =  , 19
d =  , 20
d =  , 21
d =  , 22
d =  , 23
d =  , 24
d =  .
3  1 3  1 3  2 3  2 3  1 3  1 3  2 3  2
               
4  1 4  2 4  1 4  2 4  1 4  2 4  1 4  2

Each policy corresponds to a different MCR. Let rP, rq, rπ, and rg denote the
transition probability matrix, the reward vector, the steady-state probability
vector, and the gain, respectively, associated with policy rd. For example, for pol-
icy 3 d = [1 1 2 1]T , the associated transition probability matrix, the reward
vector, the steady-state probability vector, and the gain are indicated below.

 1 1 0.15 0.40 0.35 0.10  1  −30 


 1 2 0.25 0.30 0.35 0.10  2 5 
3
d =  , 3
P=  , 3
q=  ,
 2 3 0.05 0.25 0.50 0.20  3  −5 
     
 1 4 0.05 0.20 0.40 0.35 4  35 

3
p = [0.1159 0.2715 0.4229 0.1897], and 3
g = 2.4055.

Observe that policy 20 d = [3 1 2 2]T, the policy used for the MCR model
of monthly sales, with a gain of 1.4143 calculated in Equation (4.53), is
inferior to policy 3 d = [1 1 2 1]T which has a higher gain of 2.4055. All
24 policies and their associated transition probability matrices, reward
vectors, steady-state probability vectors, and gains are enumerated in
Table 5.13a.
Exhaustive enumeration has identified 16 d = [2 2 2 2]T as the opti-
mal policy with a gain of 4.39. However, as this example has demonstrated,
exhaustive enumeration is not feasible for larger problems because of the
huge computational burden it imposes.

51113_C005.indd 286 9/23/2010 5:43:28 PM


A Markov Decision Process (MDP) 287

TABLE 5.13a
Exhaustive Enumeration of 24 Policies and their Parameters for
MDP Model of Monthly Sales
1 1 0.15 0.40 0.35 0.10  1  −30 
1 2 0.25 0.30 0.35 0.10  2 5 
1
d =  , 1
P=  , 1
q=  ,
1 3 0.05 0.65 0.25 0.05 3  −10 
     
1 4 0.05 0.20 0.40 0.35 4  35 

1
␲ = [0.1482 0.4168 0.3233 0.1118], and 1
g = −1.682

 1 1 0.15 0.40 0.35 0.10  1  −30 


 1 2 0.25 0.30 0.35 0.10  2 5 
2
d =  , 2
P=  , 2
q=  ,
 1 3 0.05 0.65 0.25 0.05 3  −10 
     
 2 4 0 0.10 0.30 0.60  4  25 

2
␲ = [0.1324 0.3881 0.3105 0.1689], and 2
g = −0.914

 1 1 0.15 0.40 0.35 0.10  1  −30 


 1 2 0.25 0.30 0.35 0.10  2 5 
3
d =  , 3
P=  , 3
q=  ,
 2 3 0.05 0.25 0.50 0.20  3  −5 
     
 1 4 0.05 0.20 0.40 0.35 4  35 

3
␲ = [0.1159 0.2715 0.4229 0.1897], and 3
g = 2.4055

 1 1 0.15 0.40 0.35 0.10  1  −30 


 1 2 0.25 0.30 0.35 0.10  2 5 
4
d =  , 4
P=  , 4
q=  ,
 2 3 0.05 0.25 0.50 0.20  3  −5 
     
 2 4 0 0.10 0.30 0.60  4  25 

4
␲ = [0.0920 0.2336 0.3953 0.2791], and 4
g = 3.409

 1 1  0.15 0.40 0.35 0.10  1  −30 


 2 2 0.30 0.40 0.25 0.05 2  10 
5
d =  , 5
P=  , 5
q=  ,
 1 3  0.05 0.65 0.25 0.05 3  −10 
     
 1 4  0.05 0.20 0.40 0.35 4  35 

5
␲ = [0.1815 0.4533 0.2808 0.0844], and 5
g = −0.766

 1 1  0.15 0.40 0.35 0.10  1  −30 


 2 2 0.30 0.40 0.25 0.05 2  10 
6
d =  , 6
P=  , 6
q=  ,
 1 3  0.05 0.65 0.25 0.05 3  −10 
     
 2 4 0 0.10 0.30 0.60  4  25 

6
␲ = [0.1676 0.4294 0.2732 0.1297], and 6
g = −0.2235

continued

51113_C005.indd 287 9/23/2010 5:43:30 PM


288 Markov Chains and Decision Processes for Engineers and Managers

TABLE 5.13a (continued)


Exhaustive Enumeration of 24 Policies and their Parameters for
MDP Model of Monthly Sales
 1 1  0.15 0.40 0.35 0.10  1  −30 
 2 2 0.30 0.40 0.25 0.05 2  10 
7
d =  , 7
P=  , 7
q=  ,
 2 3  0.05 0.25 0.50 0.20  3  −5 
     
 1 4  0.05 0.20 0.40 0.35 4  35 

7
␲ = [0.1415 0.3094 0.3850 0.1640], and 7
g = 2.664

 1 1  0.15 0.40 0.35 0.10  1  −30 


 2 2 0.30 0.40 0.25 0.05 2  10 
8
d =  , 8
P=  , 8
q=  ,
 2 3  0.05 0.25 0.50 0.20  3  −5 
     
 2 4 0 0.10 0.30 0.60  4  25 

8
␲ = [0.1173 0.2714 0.3654 0.2459], and 8
g = 3.5155

 2 1 0.45 0.05 0.20 0.30  1  −25


 1 2 0.25 0.30 0.35 0.10  2 5 
9
d =  , 9
P=  , 9
q=  ,
 1 3 0.05 0.65 0.25 0.05 3  −10 
     
 1 4 0.05 0.20 0.40 0.35 4  35 

9
␲ = [0.1963 0.3390 0.2989 0.1658], and 9
g = −0.4006

 2 1 0.45 0.05 0.20 0.30  1  −25


 1 2 0.25 0.30 0.35 0.10  2 5 
10
d =  , 10
P=  , 10
q=  ,
 1 3 0.05 0.65 0.25 0.05 3  −10 
     
 2 4 0 0.10 0.30 0.60  4  25 

10
␲ = [0.1669 0.3102 0.2846 0.2383], and 10
g = 0.49

 2 1 0.45 0.05 0.20 0.30  1  −25


 1 2 0.25 0.30 0.35 0.10  2 5 
11
d =  , 11
P=  , 11
q=  ,
 2 3 0.05 0.25 0.50 0.20  3  −5 
     
 1 4 0.05 0.20 0.40 0.35 4  35 

11
␲ = [0.1561 0.2183 0.3976 0.2280], and 11
g = −6.9415

 2 1 0.45 0.05 0.20 0.30  1  −25


 1 2 0.25 0.30 0.35 0.10  2 5 
12
d =  , 12
P=  , 12
q=  ,
 2 3 0.05 0.25 0.50 0.20  3  −5 
     
 2 4 0 0.10 0.30 0.60  4  25 

51113_C005.indd 288 9/23/2010 5:43:34 PM


A Markov Decision Process (MDP) 289

TABLE 5.13a (continued)


Exhaustive Enumeration of 24 Policies and their Parameters for
MDP Model of Monthly Sales
12
␲ = [0.1189 0.1873 0.3718 0.3219], and 12
g = 4.1524

 2 1  0.45 0.05 0.20 0.30  1  −25


 2 2 0.30 0.40 0.25 0.05  2  10 
13
d =  , 13
P=  , 13
q=  ,
 1 3  0.05 0.65 0.25 0.05  3  −10 
     
 1 4  0.05 0.20 0.40 0.35  4  35 

13
␲ = [0.2308 0.3538 0.2615 0.1538], and 13
g = 0.536

 2 1  0.45 0.05 0.20 0.30  1  −25


 2 2 0.30 0.40 0.25 0.05 2  10 
14
d =  , 14
P=  , 14
q=  ,
 1 3  0.05 0.65 0.25 0.05 3  −10 
     
 2 4 0 0.10 0.30 0.60  4  25 

14
␲ = [0.2006 0.3258 0.2511 0.2225], and 14
g = 1.2945

 2 1  0.45 0.05 0.20 0.30  1  −25


 2 2 0.30 0.40 0.25 0.05 2  10 
15
d =  , 15
P=  , 15
q=  ,
 2 3  0.05 0.25 0.50 0.20  3  −5 
     
 1 4  0.05 0.20 0.40 0.35 4  35 

15
␲ = [0.1827 0.2385 0.3641 0.2147], and 15
g = 3.5115

 2 1  0.45 0.05 0.20 0.30  1  −25


 2 2 0.30 0.40 0.25 0.05 2  10 
16
d =  , 16
P=  , 16
q=  ,
 2 3  0.05 0.25 0.50 0.20  3  −5 
     
 2 4 0 0.10 0.30 0.60  4  25 

16
␲ = [0.1438 0.2063 0.3441 0.3057], and 16
g = 4.39

3 1 0.60 0.30 0.10 0  1  −20 


 1 2  0.25 0.30 0.35 0.10  2 5 
17
d =  , 17
P=  , 17
q=  ,
 1 3  0.05 0.65 0.25 0.05 3  −10 
     
 1 4  0.05 0.20 0.40 0.35 4  35 

17
␲ = [0.2811 0.3824 0.2579 0.0787], and 17
g = −3.5345

continued

51113_C005.indd 289 9/23/2010 5:43:39 PM


290 Markov Chains and Decision Processes for Engineers and Managers

TABLE 5.13a (continued)


Exhaustive Enumeration of 24 Policies and their Parameters for
MDP Model of Monthly Sales
3 1 0.60 0.30 0.10 0  1  −20 
 1 2  0.25 0.30 0.35 0.10  2 5 
18
d =  , 18
P=  , 18
q=  ,
 1 3  0.05 0.65 0.25 0.05 3  −10 
     
 2 4 0 0.10 0.30 0.60  4  25 

18
␲ = [0.2594 0.3642 0.2537 0.1228], and 18
g = −2.834

3 1 0.60 0.30 0.10 0  1  −20 


 1 2  0.25 0.30 0.35 0.10  2 5 
19
d =  , 19
P=  , 19
q=  ,
 2 3  0.05 0.25 0.50 0.20  3  −5 
     
 1 4  0.05 0.20 0.40 0.35 4  35 

19
␲ = [0.2299 0.2674 0.3529 0.1497], and 19
g = 0.214

3 1 0.60 0.30 0.10 0  1  −20 


 1 2  0.25 0.30 0.35 0.10  2 5 
20
d =  , 20
P=  , 20
q=  ,
 2 3  0.05 0.25 0.50 0.20  3  −5 
     
 2 4 0 0.10 0.30 0.60  4  25 

20
␲ = [0.1908 0.2368 0.3421 0.2303], and 20
g = 1.4143

3 1 0.60 0.30 0.10 0  1  −20 


 2 2 0.30 0.40 0.25 0.05 2  10 
21
d =  , 21
P=  , 21
q=  ,
 1 3  0.05 0.65 0.25 0.05 3  −10 
     
 1 4  0.05 0.20 0.40 0.35 4  35 

21
␲ = [0.3380 0.4083 0.2064 0.0473], and 21
g = −3.0855

3 1 0.60 0.30 0.10 0  1  −20 


 2 2 0.30 0.40 0.25 0.05 2  10 
22
d =  , 22
P=  , 22
q=  ,
 1 3  0.05 0.65 0.25 0.05 3  −10 
     
 2 4 0 0.10 0.30 0.60  4  25 

22
␲ = [0.3230 0.3965 0.2053 0.0752], and 22
g = −2.668

3 1 0.60 0.30 0.10 0  1  −20 


 2 2 0.30 0.40 0.25 0.05 2  10 
23
d =  , 23
P=  , 23
q=  ,
 2 3  0.05 0.25 0.50 0.20  3  −5 
     
 1 4  0.05 0.20 0.40 0.35 4  35 

23
␲ = [0.2799 0.3038 0.3005 0.1158], and 23
g = −0.01

51113_C005.indd 290 9/23/2010 5:43:42 PM


A Markov Decision Process (MDP) 291

TABLE 5.13a (continued)


Exhaustive Enumeration of 24 Policies and their Parameters for
MDP Model of Monthly Sales
3 1 0.60 0.30 0.10 0  1  −20 
 2 2 0.30 0.40 0.25 0.05 2  10 
24
d =  , 24
P=  , 24
q=  ,
 2 3  0.05 0.25 0.50 0.20  3  −5 
     
 2 4 0 0.10 0.30 0.60  4  25 

24
␲ = [0.2443 0.2762 0.2967 0.1829], and 24
g = 0.9655

5.1.2.3.2 Value Iteration


An optimal policy for a recurrent MDP over an infinite planning horizon
can be found with a relatively small computational effort by executing value
iteration over a large number of periods. Value iteration over an infinite hori-
zon for a recurrent MDP is similar to value iteration for a recurrent MCR,
which is treated in Section 4.2.3.3. As Section 4.2.3.3 indicates, the sequence
of expected total rewards produced by value iteration will not always con-
verge. When value iteration does converge, it will identify an optimal policy
for an MDP, and will also produce approximate solutions for the gain and
the expected relative rewards earned in every state.

5.1.2.3.2.1 Value Iteration Algorithm By extending the value iteration algo-


rithm for an MCR in Section 4.2.3.3.4, the following value iteration algo-
rithm in its most basic form (adapted from Puterman [3]) can be applied to
a recurrent MDP over an infinite horizon. This value iteration algorithm
assumes that esspochs are numbered as consecutive negative integers from
−n at the beginning of the horizon to 0 at the end.
Step 1. Select arbitrary salvage values for
vi(0), for i = 1, 2, . . . , N. For simplicity, set vi(0) = 0. Specify ε > 0. Set n = −1.
Step 2. For each state i, use the value iteration equation to compute

 N 
vi (n) = max  qik + ∑ pijk v j (n + 1) , for i = 1, 2, ... , N .
k
 j =1 

Step 3. If i =max [vi (n) − vi (n + 1)] − min [vi (n) − vi (n + 1)] < ε , go to step 4 .
1,2,..., N i = 1,2,..., N
Otherwise, decrement n by 1 and return to step 2.
Step 4. For each state i, choose the decision di = k which maximizes the value
of vi(n), and stop.

51113_C005.indd 291 9/23/2010 5:43:46 PM


292 Markov Chains and Decision Processes for Engineers and Managers

TABLE 5.13b
Expected Total Rewards and Optimal Decisions during the Last
7 Months of an Infinite Planning Horizon
n
Epoch 27 26 25 24 23 22 21 0

v1 (n) −2.9006 −7.4352 −12.0932 −16.8419 −21.3 −24 −20 0


v2 ( n) 23.3668 19.0253 14.8707 11.2288 8.7625 8.5 10 0
v3 ( n) 22.3222 18.0226 13.8204 9.7431 5.675 1 −5 0
v 4 ( n) 74.0882 69.6315 64.9971 59.9188 53.9 46.25 35 0
d1 (n) 2 2 2 2 2 2 3 −
d2 (n) 2 2 2 2 2 2 2 −
d3 (n) 2 2 2 2 2 2 2 −
d4 ( n ) 2 2 2 2 2 1 1 −

5.1.2.3.2.2 Solution by Value Iteration of MDP Model of Monthly Sales In


Section 5.1.2.2.2, value iteration was executed to find an optimal policy for
a four-state MDP model of monthly sales over a 7-month planning horizon.
The data are given in Table 5.3. The solution is summarized in Table 5.11 of
Section 5.1.2.2.2, and is repeated in Table 5.13b to show the expected total
rewards received and the optimal decisions made during the last 7 months
of an infinite planning horizon. In accordance with the treatment of an MCR
model of monthly sales in Section 4.2.3.3.5, the last eight epochs of an infinite
horizon are numbered sequentially in Table 5.13b as −7, −6, −5, −4, −3, −2,
−1, and 0. Epoch 0 denotes the end of the horizon. As Sections 4.2.3.3.1 and
4.2.3.3.5 indicate, the absolute value of a negative epoch in Table 5.13b repre-
sents the number of months remaining in the horizon.
Table 5.13b shows that when 1 month remains in an infinite planning hori-
zon, the optimal policy is to maximize the expected reward in every state.
When 2 months remain, the expected total rewards are maximized by mak-
ing decision 2 in states 1, 2, and 3, and decision 1 in state 4. When 3 or more
months remain, the optimal policy is to make decision 2 in every state. Thus,
as value iteration proceeds backward from the end of the horizon, conver-
gence to an optimal policy given by the decision vector d(n) = [2 2 2 2]T
appears to have occurred at n = −3. This optimal policy was identified by
exhaustive enumeration in Section 5.1.2.3.1. It can be shown that as n becomes
very large, value iteration will converge to an optimal policy.
Table 5.14 gives the differences between the expected total rewards earned
over planning horizons which differ in length by one period.
In Table 5.14, a suffix U identifies gU(T), the maximum difference for each
epoch. The suffix L identifies gL(T), the minimum difference for each epoch.
The differences, gU(T) − gL(T), obtained for all the epochs are listed in the bottom
row of Table 5.14. In Section 5.1.2.3.1, the optimal policy, d = [2 2 2 2] , found
16 T

by exhaustive enumeration, has a gain of 4.39. When seven periods remain in

51113_C005.indd 292 9/23/2010 5:43:48 PM


A Markov Decision Process (MDP) 293

TABLE 5.14
Differences between the Expected Total Rewards Earned Over Planning
Horizons Which Differ in Length by One Period
n
Epoch 27 26 25 24 23 22 21
vi ( −7) vi ( −6) vi ( −5) vi ( −4) vi ( −3) vi ( −2) vi ( −1)
−vi ( −6) −vi ( −5) −vi ( −4) −vi ( −3) − vi ( −2) − vi ( −1) −vi (0)
i
1 4.5346U 4.658U 4.7487 4.4581 2.7 −4L −20L
2 4.3415 4.1546L 3.6419L 2.4663L 0.2625L −1.5 10
3 4.2996L 4.2022 4.0773 4.0681 4.675 6 −5
4 4.4567 4.6344 5.0783U 6.0188U 7.65U 11.25U 35U
 Vi (n) − 
Max  4.5346 4.6580 5.0783 6.0188 7.65 11.25 35
 vi (n + 1)
= gU (T )
 Vi (n) − 
Min  4.2996 4.1564 3.6419 2.4663 0.2625 −4 −20
 vi (n + 1)
= g L (T )

gU (T ) − g L (T ) 0.235 0.5016 1.4364 3.5525 7.3875 15.25 55

the planning horizon, Table 5.14 shows that the bounds on the gain obtained
by value iteration are given by 4.2996 ≤ g ≤ 4.5346. The gain is approximately
equal to the arithmetic average of its upper and lower bounds, so that

g ≈ (4.5346 + 4.2996)/2 = 4.4171.

As the planning horizon is lengthened beyond seven periods, tighter bounds


on the gain will be obtained. The bottom row of Table 5.14 shows that if an
analyst chooses an ε < 0.235, then more than seven iterations of value itera-
tion will be needed before the value iteration algorithm can be assumed to
have converged to an optimal policy.
Table 5.15 gives the expected relative rewards, vi(n) − v4(n), received during
the last seven epochs of the planning horizon.
Table 5.16 compares the expected relative rewards, vi(−7) − v4(−7), received after
seven repetitions of value iteration, with the relative values, vi, for i = 1, 2, and 3.
The relative values are obtained by policy iteration (PI) in Section 5.1.2.3.3.3, and
by linear programming (LP) as dual variables in Section 5.1.2.3.4.3.
Tables 5.15 and 5.16 have demonstrated that as the horizon grows longer
and n → ∞, the expected relative reward vi(n) − v4(n) received for starting in
each state i approaches the relative value vi when the relative value for the
highest numbered state, v4, is set equal to 0.

51113_C005.indd 293 9/23/2010 5:43:49 PM


294 Markov Chains and Decision Processes for Engineers and Managers

TABLE 5.15
Expected Relative Rewards, vi(n) − v4(n), Received during the Last Seven
Epochs of the Planning Horizon

Epoch 27 26 25 24 23 22 21 0

v1 (n) − v4 (n) −76.9888 −77.0667 −77.0903 −76.7607 −75.2 −70.25 −55 0


v2 (n) − v4 (n) −50.7214 −50.6062 −50.1264 −48.69 −45.1375 −37.75 −25 0
v3 (n) − v4 (n) −51.766 −51.6089 −51.1767 −50.1757 −48.225 −45.25 −40 0

TABLE 5.16
Expected Relative Rewards, vi(−7) − v4(−7), Received after Seven
Repetitions of Value Iteration, Compared with the Relative Values
Seven Period Horizon Infinite Horizon
(Value Iteration) (Policy Iteration, Linear Programming)
v1 ( −7) − v4 ( −7) = −76.9888 v1 = −76.8825
v2 ( −7) − v4 ( −7) = −50.7214 v2 = −50.6777
v3 ( −7) − v4 ( −7) = −51.766 v3 = −51.8072

5.1.2.3.3 Policy Iteration (PI)


In 1960, Ronald A. Howard [2] published an efficient algorithm, which he
called PI, for finding an optimal stationary policy for a unichain MDP over
an infinite planning horizon. He extended PI to a multichain MDP, which
is treated in Section 5.1.4.2, and to a discounted MDP, which is treated in
Section 5.2.2.2. In this section, PI is applied to a recurrent MDP. Recall that an
MDP is termed recurrent when the transition probability matrix correspond-
ing to every stationary policy is irreducible. A recurrent MDP is a special
case of a unichain MDP without transient states. As Section 5.1.3 indicates,
Howard’s PI algorithm for a recurrent MDP also applies to a unichain MDP.
He proved that the algorithm for a recurrent MDP will converge to an opti-
mal policy which will maximize the gain. The PI algorithm has two main
steps: the value determination (VD) operation and the policy improvement
(IM) routine. The algorithm begins by arbitrarily choosing an initial policy.
During the VD operation, the VD equations (VDEs) (4.62) corresponding to
the current policy are solved for the gain and the relative values. (Recall that
the VDEs for a recurrent MCR were informally derived in Section 4.2.3.2.2.)
The IM routine attempts to find a better policy. If a better policy is found, the
VD operation is repeated using the new policy to identify the appropriate
transition probabilities, rewards, and VDEs. The algorithm stops when two
successive iterations lead to identical policies.

51113_C005.indd 294 9/23/2010 5:43:50 PM


A Markov Decision Process (MDP) 295

5.1.2.3.3.1 Test Quantity for Policy Improvement (IM) The IM routine is based
on the value iteration Equation (5.4) for an MDP. Equation (5.4) indicates that
if an optimal policy is known over a planning horizon starting at epoch n + 1
and ending at epoch T, then the best decision in state i at epoch n can be
found by maximizing a test quantity,
N
qik + ∑ pijk v j (n + 1), (5.20)
j =1

over all decisions in state i. Therefore, if an optimal policy is known over a


planning horizon starting at epoch 1 and ending at epoch T, then at epoch 0,
the beginning of the planning horizon, the best decision in state i can be
found by maximizing a test quantity,
N
qik + ∑ pijk v j (1), (5.21)
j =1

over all decisions in state i. Recall from Section 4.2.3.2.2 that when T, the
length of the planning horizon, is very large, (T − 1) is also very large, so that

v j (1) ≈ (T − 1) g + v j . (4.60)

Substituting this expression for vj(1) in the test quantity produces the
result
N N 
qik + ∑ pijk v j (1) = qik + ∑ pijk [(T − 1) g + v j ] 
j =1 j =1 
N N 
= qik + (T − 1) g ∑ pijk + ∑ pijk v j  , (5.22)
j =1 j =1 
N 
= qik + (T − 1) g + ∑ pijk v j 
j =1 

as the test quantity to be maximized with respect to all alternatives in every


state. Since the term (T − 1)g is independent of k, the test quantity to be max-
imized when making decisions in state i is
N
qik + ∑ pijk v j ,
j =1
(5.23)

for i = 1, 2, . . . , N. The relative values produced by solving the VDEs associ-


ated with the most recent policy can be used.

51113_C005.indd 295 9/23/2010 5:43:50 PM


296 Markov Chains and Decision Processes for Engineers and Managers

5.1.2.3.3.2 Policy Iteration Algorithm The detailed steps of Howard’s PI algo-


rithm are given below:
Step 1. Initial policy
Arbitrarily choose an initial policy by selecting for each state i a decision di = k.
Step 2. VD operation
Use pij and qi for a given policy to solve the VDEs (4.62),
N
g + vi = qi + ∑p v,
j= 1
ij j i = 1, 2,..., N

for all relative values vi and the gain g by setting vN = 0.


Step 3. IM routine
For each state i, find the decision k∗ that maximizes the test quantity,
N
qik + ∑pv
j= 1
k
ij j

using the relative values vi of the previous policy. Then k∗ becomes the new
decision in state i, so that di = k∗, qik* becomes qi, and pijk* becomes pij.
Step 4. Stopping rule
When the policies on two successive iterations are identical, the algorithm
stops because an optimal policy has been found. Leave the old di unchanged
if the test quantity for the old di is equal to the test quantity for any other
alternative in the new policy determination. If the new policy is different
from the previous policy in at least one state, go to step 2.
Howard proved that the gain of each policy will be greater than or equal to that
of its predecessor. The algorithm will terminate after a finite number of iterations.
5.1.2.3.3.3 Solution by PI of MDP Model of Monthly Sales Policy iteration
will be executed to find an optimal policy over an infinite horizon for the
MDP model of monthly sales specified in Table 5.3. An optimal policy was
obtained by exhaustive enumeration in Section 5.1.2.3.1, and by value itera-
tion in Section 5.1.2.3.2.2.
First iteration
Step 1. Initial policy
Arbitrarily choose the initial policy 23 d = [3 2 2 1]T by making decision
3 in state 1, decision 2 in states 2 and 3, and decision 1 in state 4. Thus, d1 = 3,
d2 = d3 = 2, and d4 = 1. The initial decision vector 23d, along with the associated
transition probability matrix 23P and the reward vector 23q, are shown below:

 3 1  0.60 0.30 0.10 0  1  −20 


 2 2  0.30 0.40 0.25 0.05  2  10 
23
d=   , 23
P=   , 23
q=  
 2 3  0.05 0.25 0.50 0.20  3  −5 
     
 1 4  0.05 0.20 0.40 0.35  4  35 

51113_C005.indd 296 9/23/2010 5:43:52 PM


A Markov Decision Process (MDP) 297

Step 2. VD operation
Use pij and qi for the initial policy, 23
d = [3 2 2 1]T , to solve the VDEs
(4.62)

g + v1 = q1 + p11v1 + p12 v2 + p13 v3 + p14 v4


g + v2 = q2 + p21v1 + p22 v2 + p23 v3 + p24 v4
g + v3 = q3 + p31v1 + p32 v2 + p33 v3 + p34 v4
g + v4 = q4 + p41v1 + p42 v2 + p43 v3 + p44 v4

for all relative values vi and the gain g.

g + v1 = −20 + 0.60v1 + 0.30v2 + 0.10v3 + 0v4


g + v2 = 10 + 0.30v1 + 0.40v2 + 0.25v3 + 0.05v4
g + v3 = −5 + 0.05v1 + 0.25v2 + 0.50v3 + 0.20v4
g + v4 = 35 + 0.05v1 + 0.20v2 + 0.40v3 + 0.35v4 .

Setting v4 = 0, the solution of the VDEs is

g = −0.01, v1 = −102.6731, v2 = −54.4350, v3 = −47.4686, v4 = 0.

Step 3. IM routine
For each state i, find the decision k* that maximizes the test quantity

4
qik + ∑pv
j= 1
k
ij j

using the relative values vi of the initial policy. Then k* becomes the new
decision in state i, so that di = k∗, qik* becomes qi, and pijk* becomes pij. The first
policy improvement routine is s executed in Table 5.17.
Step 4. Stopping rule
The new policy is 12 d = [2 1 2 2]T, which is different from the initial
policy. Therefore, go to step 2. The new decision vector 12d, along with the
associated transition probability matrix 12P and the reward vector 12q, are
shown below:

 2 1 0.45 0.05 0.20 0.30  1  −25


 1 2 0.25 0.30 0.35 0.10  2  5
12
d =  , 12
P=  , 12
q=  .
 2 3 0.05 0.25 0.50 0.20  3  −5 
     
 2 4 0 0.10 0.30 0.60  4  25 

51113_C005.indd 297 9/23/2010 5:43:54 PM


298 Markov Chains and Decision Processes for Engineers and Managers

TABLE 5.17
First IM for Monthly Sales Example
Test Quantity
q ik + p ik1v 1 + p ik2v 2 + p ik3v 3 + p ik4v 4
Decision = q ik + p ik1 (2102.6731) + p ik2 (254.4350) Maximum
State Alternative Value of Test Decision
i k + p (247.4686) + p (0)
k
i3
k
i4 Quantity di = k*

1 1 −30 + 0.15(−102.6731) + 0.40(−54.4350)


+ 0.35(−47.4686) + 0.10(0) = −83.7890

1 2 −25 + 0.45(−102.6731) + 0.05(−54.4350) max d1 = 2


+ 0.20(−47.4686) + 0.30(0) = −83.4184 ← [ − 83.7890,
−83.4184,
−102.6812]
= −83.4184

1 3 −20 + 0.60(−102.6731) + 0.30(−54.4350)


+ 0.10(−47.4686) + 0(0) = −102.6812

2 1 5 + 0.25(−102.6731) + 0.30(−54.4350) max d2 = 1


+ 0.35(−47.4686) + 0.10(0) = −53.6128 ← [− 53.6128,
−54.4431]
= −53.6128

2 2 10 + 0.30(−102.6731) + 0.40(−54.4350)
+ 0.25(−47.4686) + 0.05(0) = −54.4431

3 1 −10 + 0.05(−102.6731) + 0.65(−54.4350)


+ 0.25(−47.4686) + 0.05(0) = −62.3836

3 2 −5 + 0.05(−102.6731) + 0.25(−54.4350) max d3 = 2


+ 0.50(−47.4686) + 0.20(0) = −47.4767 ← [− 62.3836,
−47.4767]
= −47.4767

4 1 35 + 0.05(−102.6731) + 0.20(−54.4350)
+ 0.40(−47.4686) + 0.35(0) = −0.0081

4 2 25 + 0(−102.6731) + 0.10(−54.4350) max d4 = 2


+ 0.30(−47.4686) + 0.60(0) = 5.3159 ← [− 0.0081,
5.3159]
= 5.3159

Step 2. VD operation
Use pij and qi for the first new policy, 12 d = [2 1 2 2]T, to solve the VDEs
(4.62) for all relative values vi and the gain g.

51113_C005.indd 298 9/23/2010 5:43:57 PM


A Markov Decision Process (MDP) 299

g + v1 = −25 + 0.45v1 + 0.05v2 + 0.20v3 + 0.30v4


g + v2 = 5 + 0.25v1 + 0.30v2 + 0.35v3 + 0.10v4
g + v3 = −5 + 0.05v1 + 0.25v2 + 0.50v3 + 0.20v4
g + v4 = 25 + 0v1 + 0.10v2 + 0.30v3 + 0.60v4 .

Setting v4 = 0, the solution of the VDEs is

g = 4.15, v1 = −76.6917, v2 = −52.2215, v3 = −52.0848, v4 = 0.

Step 3. IM routine
For each state i, find the decision k* that maximizes the test quantity
4
qik + ∑pv
j= 1
k
ij j

using the relative values vi of the previous policy. Then k* becomes the new
decision in state i, so that di = k*, qik* becomes qi, and pijk* becomes pij. The sec-
ond policy improvement routine is executed in Table 5.18.
Step 4. Stopping rule
The new policy is 16 d = [2 2 2 2]T , which is different from the previous
policy. Therefore, go to step 2. The new decision vector 16 d, along with the
associated transition probability matrix 16P, and the reward vector 16q, are
shown below:

 2 1  0.45 0.05 0.20 0.30  1  −25


 2 2 0.30 0.40 0.25 0.05 2  10 
16
d =  , 16
P=  , 16
q=   (5.24)
 2 3  0.05 0.25 0.50 0.20  3  −5 
     
 2 4 0 0.10 0.30 0.60  4  25 

Step 2. VD operation
Use pij and qi for the second new policy, 16 d = [2 2 2 2]T, to solve the VDEs
(4.62) for all relative values vi and the gain g.

g + v1 = −25 + 0.45v1 + 0.05v2 + 0.20v3 + 0.30v4


g + v2 = 10 + 0.30v1 + 0.40v2 + 0.25v3 + 0.05v4
g + v3 = −5 + 0.05v1 + 0.25v2 + 0.50v3 + 0.20v4
g + v4 = 25 + 0v1 + 0.10v2 + 0.30v3 + 0.60v4 .

Setting v4 = 0, the solution of the VDE’s is


g = 4.39, v1 = −76.8825, v2 = −50.6777, v3 = −51.8072, v4 = 0. (5.25)

51113_C005.indd 299 9/23/2010 5:44:02 PM


300 Markov Chains and Decision Processes for Engineers and Managers

TABLE 5.18
Second IM for Monthly Sales Example
Test Quantity
q ik + p ik1v 1 + p ik2v 2 + p ik3v 3 + p ik4v 4
Decision = q ik + p ik1 (276.6917) + p ik2 (252.2215 Maximum
State Alternative Value of Test Decision
+ p (252.0848) + p (0)
k
i3
k
i4
i k Quantity di = k*
1 1 −30 + 0.15(−76.6917) + 0.40(−52.2215)
+ 0.35(−52.0848) + 0.10(0) = −80.6220

1 2 −25 + 0.45(−76.6917) + 0.05(−52.2215) max d1 = 2


+ 0.20(−52.0848) + 0.30(0) = −72.5393 ← [ − 80.6220,
−72.5393,
−86.8900]
= −72.5393
1 3 −20 + 0.60(−76.6917) + 0.30(−52.2215)
+ 0.10(−52.0848) + 0(0) = −86.8900

2 1 5 + 0.25(−76.6917) + 0.30(−52.2215)
+ 0.35(−52.0848) + 0.10(0) = −48.0691

2 2 10 + 0.30(−76.6917) + 0.40(−52.2215) max d2 = 2


+ 0.25(−52.0848) + 0.05(0) = − 46.9173 ← [− 48.0691,
−46.9173]
= −46.9173
3 1 −10 + 0.05(−76.6917) + 0.65(−52.2215)
+ 0.25(−52.0848) + 0.05(0) = −60.7998

3 2 −5 + 0.05(−76.6917) + 0.25(−52.2215) max d3 = 2


+ 0.50(−52.0848) + 0.20(0) = −47.9324 ← [− 60.7998,
−47.9324]
= −47.9324
4 1 35 + 0.05(−76.6917) + 0.20(−52.2215)
+ 0.40(−52.0848) + 0.35(0) = −0.1128

4 2 25 + 0(−76.6917) + 0.10(−52.2215) max d4 = 2


+ 0.30(−52.0848) + 0.60(0) = 4.1524 ← [− 0.1128,
4.1524]
= 4.1524

Step 3. IM routine
For each state i, find the decision k* that maximizes the test quantity

4
qik + ∑pv
j= 1
k
ij j

51113_C005.indd 300 9/23/2010 5:44:05 PM


A Markov Decision Process (MDP) 301

TABLE 5.19
Third IM for Monthly Sales Example

Test Quantity
q ik + p ik1v 1 + p ik2v 2 + p ik3v 3 + p ik4v 4 Maximum
Decision = q + p (276.8825) + p (250.6777)
k k k Value of
i i1 i2
State Alternative Test Decision
i k + p ik3 (251.8072) + p ik4 (0) Quantity di = k*

1 1 −30 + 0.15(−76.8825) + 0.40(−50.6777)


+ 0.35(−51.8072) + 0.10(0) = −79.9360

1 2 −25 + 0.45(−76.8825) + 0.05(−50.6777) max d1 = 2


+ 0.20(−51.8072) + 0.30(0) = −72.4925 ← [ − 79.9360,
−72.4925,
−86.5135]
= −72.4925

1 3 −20 + 0.60(−76.8825) + 0.30(−50.6777)


+ 0.10(−51.8072) + 0(0) = −86.5135

2 1 5 + 0.25(−76.8825) + 0.30(−50.6777)
+ 0.35(−51.8072) + 0.10(0) = −47.5565

2 2 10 + 0.30(−76.8825) + 0.40(−50.6777) max d2 = 2


+ 0.25(−51.8072) + 0.05(0) = − 46.2876 ← [− 47.5565,
−46.2876]
= −46.2876

3 1 −10 + 0.05(−76.8825) + 0.65(−50.6777)


+ 0.25(−51.8072) + 0.05(0) = −59.7364

3 2 −5 + 0.05(−76.8825) + 0.25(−50.6777) max d3 = 2


+ 0.50(−51.8072) + 0.20(0) = −47.4172 ← [− 59.7364,
−47.4172]
= −47.4172

4 1 35 + 0.05(−76.8825) + 0.20(−50.6777)
+ 0.40(−51.8072) + 0.35(0) = 0.2975

4 2 25 + 0(−76.8825) + 0.10(−50.6777) max d4 = 2


+ 0.30(−51.8072) + 0.60(0) = 4.3901 ← [0.2975,
4.3901]
= 4.3901

using the relative values vi of the previous policy. Then k* becomes the new
decision in state i, so that di = k*, qik* becomes qi, and pijk* becomes pij. The third
policy improvement routine is executed in Table 5.19.
Step 4. Stopping rule
Stop because the new policy, given by the vector 16 d = [2 2 2 2]T , is iden-
tical to the previous policy. Therefore, this policy is optimal. The transition

51113_C005.indd 301 9/23/2010 5:44:11 PM


302 Markov Chains and Decision Processes for Engineers and Managers

probability matrix 16P and the reward vector 16q for the optimal policy are
shown in Equation (5.20). The relative values and the gain are given in
Equation (5.21). Equation (5.21) shows that the relative values obtained by PI
are identical to the corresponding dual variables obtained by LP in Table 5.23
of Section 5.1.2.3.4.3.

5.1.2.3.3.4 Optional Insight: IM for a Two-State MDP Step 3 of the PI algo-


rithm in Section 5.1.2.3.3.2 is the IM routine. To gain insight (adapted from
Howard [2]) into why the IM routine works, it is instructive to examine the
special case of a generic two-state recurrent MDP with two decisions per
state, which is shown in Table 5.2. To simplify the notation, all symbols for
transition probabilities and rewards in Table 5.2 are replaced by letters with-
out subscripts or superscripts in Table 5.20.
Since the process has two states with two decisions per state, there are 22 = 4
possible policies, identified by the letters A, B, C, and D. The decision vectors
for the four policies are
1 1 1  1 1  2 1  2
d A =   , d B =   , dC =   , d D =  
2 1 2  2 2  1 2  2

Suppose that policy A is selected. Policy A is characterized by the follow-


ing decision vector, transition probability matrix, reward vector, and relative
value vector:
1 1 1 − a a  1 e 1 v A  1 v A 
dA =   , P A =   , q A =   , v A =  1A  =  1  ,
1 2 c 1 − c 2  s 2  v2  2  0 

after setting v2A = 0 .


Under policy A, using Equation (4.65), the gain is
ce + as
gA =
a+c
and the relative value for state 1 is
e−s
v1A = .
a+c
TABLE 5.20
Two-State Generic MDP with Two Decisions Per State
Transition Probability

State i Decision k p ik1 p ik2 k


Reward q i
1 1 1− a a e
2 1− b b f
2 1 c 1− c s
2 d 1− d h

51113_C005.indd 302 9/23/2010 5:44:16 PM


A Markov Decision Process (MDP) 303

Suppose that the evaluation of policy A by the IM routine has produced a


new policy, policy D. Policy D is characterized by the following decision vec-
tor, transition probability matrix, and reward vector:

 2 1 1 − b b  1 f 
dD =   , P D =   , qD =   .
 2 2 d 1 − d 2 h

Using Equation (4.65), the gain under policy D is

df + bh
gD = .
b+d

The objective of this insight is to show that gD > gA. Since the IM routine has
chosen policy D over policy A, the test quantity for policy D must be greater
than or equal to the test quantity for policy A in every state. Therefore,
For i = 1,

q1D + p11
D A
v1 + p12
D A
v2 ≥ q1A + p11
A A
v1 + p12
A A
v2
f + (1 − b)v1A + b(0) ≥ e + (1 − a)v1A + a(0)
f + (1 − b)v1A ≥ e + (1 − a)v1A .

Let γ1 denote the improvement in the test quantity achieved in state 1.

γ 1 = f + (1 − b)v1A − e − (1 − a)v1A ≥ 0
= f − e + ( a − b)v1A ≥ 0.

For i = 2,

q2D + p21
D A
v1 + p22
D A
v2 ≥ q2A + p21
A A
v1 + p22
A A
v2
h + dv1A + (1 − d)(0) ≥ s + cv1A + (1 − c)(0)
h + dv1A ≥ s + cv1A .

Let γ2 denote the improvement in the test quantity achieved in state 2.

γ 2 = h + dv1A − s − cv1A ≥ 0
= h − s + (d − c)v1A ≥ 0.

51113_C005.indd 303 9/23/2010 5:44:18 PM


304 Markov Chains and Decision Processes for Engineers and Managers

To show that gD>gA, calculate

df + bh ce + as
gD − g A = −
b+d a+c
( a + c)(df + bh) − (b + d)(ce + as)
=
(b + d)( a + c)
d  ( a − b)(e − s) + ( a + c)( f − e )  b  (d − c)( e − s) + ( a + c)( h − s) 
=
b + d  a+c  + b + d  a+c 

d b
= ( a − b)v1A + f − e  + (d − c)v1A + h − s
b+d b+d
d b
= γ1 + γ2
b+d b+d

Equation (2.16) indicates that the steady-state probability vector under policy
D is
 d b 
πD =  .
b + d b + d 

Hence,

g D − g A = π 1D γ 1 + π 2Dγ 2 .

Since π1D> o, π2D> 0, γ 1 ≥ 0, and γ 2 ≥ 0, it follows that gD − gA ≥ 0 for the two-state


process. Howard proves that for an N-state MDP, gD will be greater than gA
if an improvement in the test quantity can be made in any state that will be
recurrent under policy D.

5.1.2.3.4 Linear Programming


An MDP can be formulated as a linear program (LP) when the planning hori-
zon is infinite. (In this section, and in Sections 5.1.3.2, 5.1.4.3, and 5.2.2.3, a basic
knowledge of LP is assumed.) In this section, MDPs will be formulated as
linear programs which will be solved by using a computer software package.
The advantage of an LP formulation is the availability of various computer
software packages for solving LPs. For example, the Excel spreadsheet has an
add-in called Solver which will solve both linear and nonlinear programs. The
disadvantage of an LP formulation is that it produces a much larger model of
an MDP than the corresponding model solved by PI. (In this book, both “lin-
ear programming” and “linear program” are abbreviated as “LP.”)

5.1.2.3.4.1 Formulation of a LP Model The LP formulation for a recurrent


N-state MDP assumes that the Markov chain associated with every transition

51113_C005.indd 304 9/23/2010 5:44:21 PM


A Markov Decision Process (MDP) 305

matrix is regular [1,3]. As Section 2.1 indicates, the entries of the steady-state
probability vector for a regular Markov chain are strictly positive and sum
to one. To formulate an LP, one must identify decision variables, an objective
function, and a set of constraints.
Decision Variables
The decision variable in the LP formulation is the joint probability of
being in state i and making decision k. This decision variable is denoted
by

y ik = P(state = i ∩ decision = k ).

The marginal probability of being in state i is obtained by summing the


joint probabilities over all values of k. Letting Ki denote the number of possi-
ble decisions in state i,

Ki Ki

∑ P(state = i ∩ decision = k) = ∑ y .
k
P(state = i) = i (5.26)
k= 1 k= 1

In the steady state,

Ki
π i = P(state = i) ∑ y ik . (5.27)
k =1

By the multiplication rule of probability,

P(state = i ∩ decision = k ) = P(state = i) P(decision = k |state = i) , or

y ik = π i P(decision = k |state = i) . (5.28)

Dividing both sides by πi = P(state = i),

P(decision = k |state = i) = y ik/π i . (5.29)

Substituting π i = ⌺kK=i 1 y ik

y ik
P(decision = k |state = i) = Ki
(5.30)
∑ yik
k =1

To interpret these probabilities, consider the recurrent MDP model


of monthly sales specified in Table 5.3. Consider the decision vector 20
d = [3 1 2 2]T

51113_C005.indd 305 9/23/2010 5:44:23 PM


306 Markov Chains and Decision Processes for Engineers and Managers

in Table 5.12. Expressed in terms of the conditional probabilities, the decision


vector is

 d1  1  3   P(decision = 3|state = 1) 
 d  2  1   P(decision = 1|state = 2) 
d= =  = .
20 2

 d3  3  2   P(decision = 2 |state = 3) (5.31)


     
 dN  4  2   P(decision = 2 |state = 4)

The conditional probabilities in a decision vector must assume values of zero


or one. For each state, only one conditional probability has a value of one; the
others have values of zero. The reason is that in each state i exactly one deci-
sion alternative is selected, and the remaining alternatives are rejected. The
decision alternative which is selected in state i has a conditional probability of
one. The remaining alternatives which are not selected in state i have condi-
tional probabilities of zero. Thus, if decision alternative h is selected in state i,
P(decision = h | state = i) = 1, and P(decision ≠ h | state = i) = 0, for all other deci-
sion alternatives in state i. For example, the model of monthly sales has three
decision alternatives in state 1. For the decision vector 20d = [3 1 2 2]T, alternative
3 is selected in state 1. Therefore, the conditional probabilities in state 1 are

P(decision = 1|state = 1) = 0, P(decision = 2|state = 1) = 0,

P(decision = 3|state = 1) = 1.

For each state, only one joint probability has a positive value; the others have
values of zero. The joint probability yik = P(state = i ∩ decision = k) > 0 if the
associated conditional probability, P(decision = k | state = i) = 1. The reason
is that when P(decision = k | state = i) = 1, then P(state = i ∩ decision = k) =
P (state = i) P(decision = k | state = i)

y ik = π i P(decision = k |state = i) = π i (1) = π i > 0.

Similarly, the joint probability yik = P(state = i ∩ decision = k) = 0 if the asso-


ciated conditional probability, P(decision = k | state = i) = 0 because when
P(decision = k | state = i) = 0 then,

y ik = π i P(decision = k |state = i) = π i (0) = 0.

For the decision vector 20d=[3 1 2 2]T, the joint probabilities in state 1 are
2
y i1 = π1P(decision = 1 | state = 1) = π1(0) = 0, y1 = π1P(decision = 2 | state =
3
1) = π1(0) = 0, y1 = π1P(decision = 3 | state = 1) = π1(1) = π1 = 0.1908 > 0, from
Table 5.12.

51113_C005.indd 306 9/23/2010 5:44:27 PM


A Markov Decision Process (MDP) 307

Objective Function
The objective of the LP formulation for a recurrent MDP is to find a policy,
which will maximize the gain. The reward received in state i of an MCR is
denoted by qi. The reward received in state i of an MDP when decision k is
made is denoted by qik. The reward received in state i of an MDP is equal to
the reward received for each decision in state i weighted by the conditional
probability of making that decision, P(decision = k | state = i), so that

Ki
qi = ∑ P(decision = k |state = i) qik . (5.32)
k =1

Using Equation (4.46), the equation for the gain of a recurrent MDP is

N N Ki
g = πq = ∑ π i qi =
i = 1
∑ π i ∑ P(decision = k |state = i) qik .
i = 1 k =1

Bringing πi inside the second summation,


N Ki
g= ∑ ∑π
i = 1k =1
i P(decision = k |state = i) qik

Using Equation (5.28),


N Ki
g= ∑∑y i= 1 k = 1
k
i qik

Therefore, the objective function is to find the decision variables yik that will

N Ki

Maximize g = ∑∑ qi y i .
k k
(5.33)
i =1 k =1

Constraints
The LP formulation has the following three sets of constraints. They represent
Equation (2.12), which must be solved to find the steady-state probability dis-
tribution for a regular Markov chain.
N

1. π j = ∑ π p , for j = 1, 2, . . . , N
i= 1
i ij (5.34)
N
2. ∑π i= 1
i =1 (5.35)

3. πi > 0, for i = 1, 2 , . . . , N. (5.36)

51113_C005.indd 307 9/23/2010 5:44:30 PM


308 Markov Chains and Decision Processes for Engineers and Managers

The three sets of constraints will be expressed in terms of the decision vari-
ables, yik. The left-hand side of the set of constraints (5.34) is
Ki
π j = ∑ y kj , for j = 1, 2, . . . , N. (5.27)
k =1

The right-hand side of the set of constraints (5.34) is ⌺iN= 1 π i pij.

The probability of a transition from state i to state j is denoted by pij. The


probability of a transition from state i to state j when decision k is made is
denoted by pijk. The probability of a transition from state i to state j is equal to
the probability of a transition from state i to state j for each decision k in state
i weighted by P(decision = k | state = i), the conditional probability of making
such a decision, so that

Ki
pij = ∑ P(decision = k |state = i) pijk . (5.37)
k =1

Therefore, the right-hand side of the set of constraints (5.34) is

N N Ki

∑π p
i =1
i ij = ∑ π i ∑ P(decision = k |state = i) pijk
i =1 k =1

Bringing πi inside the second summation,


N N Ki

∑ π i pij = ∑∑ π i P(decision = k |state = i) pijk


i =1 i =1 k =1

Substituting yik= πi P(decision = k | state = i), the right-hand side of the set of
constraints (5.34) is

N N Ki

∑ π i pij =
i= 1
∑∑y
i= 1 k = 1
k
i pijk

Hence, the set of constraints (5.34) expressed in terms of yik is


Ki N Ki

∑y
k= 1
k
j = ∑∑yi= 1 k = 1
k
i pijk , for j = 1, 2, . . . , N.

The set of constraints (5.35) is the normalizing equation,


N

∑π
i= 1
i = 1.

51113_C005.indd 308 9/23/2010 5:44:33 PM


A Markov Decision Process (MDP) 309

Substituting π i = ⌺kK=i 1 y ik, the normalizing equation expressed in terms of


yik is

N Ki

∑∑y
i= 1 k = 1
k
i = 1.

The set of constraints (5.36) is the nonnegativity condition imposed by LP on


the decision variables, yik,

yik ≥ 0, for i = 1, 2, . . . , N, and k = 1, 2, . . . , Ki.

The complete LP formulation is

N Ki

Maximize g = ∑∑q i= 1 k = 1
k
i y ik (5.38)

subject to

Ki N Ki

∑y −∑∑y
k= 1
k
j
i= 1 k = 1
k
i pijk = 0, for j = 1, 2, . . . , N (5.39)

N Ki

∑∑y
i= 1 k = 1
k
i =1 (5.40)

yik ≥ 0, for i = 1, 2, . . . , N, and k = 1, 2, . . . , Ki. (5.41)

Although this LP formulation has N + 1 constraints, only N of them are inde-


pendent. One of the first N constraints (5.39) is redundant. The reason is that
these constraints are based on the steady-state equations (5.34), one of which
is redundant. The Nth constraint in this set will be deleted. That is, the con-
straint associated with a transition to state j = N,

Ki N Ki

∑y
k= 1
k
N −∑ ∑y
i= 1 k = 1
k
i
k
piN = 0, (5.42)

will be deleted. The complete LP formulation, with the redundant constraint


for j = N omitted, is

N Ki

Maximize g = ∑∑q i= 1 k = 1
k
i y ik (5.43a)

51113_C005.indd 309 9/23/2010 5:44:36 PM


310 Markov Chains and Decision Processes for Engineers and Managers

subject to

Ki N Ki 
∑k= 1
y kj − ∑ ∑y
i= 1 k = 1
k
i pijk = 0 ,for j = 1, 2, . . . , N−1 

N Ki


∑ ∑ yi = 1
k
 (5.43b)
i= 1 k = 1 



yik ≥ 0, for i = 1, 2, . . . , N, and k = 1, 2, . . . , Ki. 

The LP formulation for a regular N-state MDP with Ki decisions per state
has ∑ i=1 Ki decision variables and N constraints, all of which are equations.
N

In contrast, the N VDEs, which must be solved for each repetition of PI, have
only N unknowns consisting of the N − 1 relative values, v1, v2, … , vN−1, and
the gain, g.
When an optimal solution for the LP decision variables, yik, has been found,
the conditional probabilities,

y ik
P(decision = k|state = i) = Ki
,
∑y
k =1
k
i

can be calculated to determine the optimal decision in every state, thereby


specifying an optimal policy. In the final tableau of the LP solution, with the
redundant constraint (5.42) deleted so that the first N − 1 constraints remain,
the dual variable associated with constraint i is equal to the relative value,
vi, earned for starting in state i, for i = 1, 2, . . . , N−1. The gain, which is the
maximum value of the objective function, is also equal to the dual variable
associated with the constraint for the normalizing equation, constraint (5.40).
As the constraints are equations, the dual variables, and hence the relative
values, are unrestricted in sign.

5.1.2.3.4.2 Formulation of MDP Model of Monthly Sales as an LP The recurrent


MDP model of monthly sales introduced in Section 5.1.2.1 will be formulated
in this section as an LP, and solved in Section 5.1.2.3.4.3 to find an optimal
policy. Data for the recurrent MDP model of monthly sales appear in Table
5.21 which is the same as Table 5.3, augmented by a right-hand side column
of LP decision variables.

51113_C005.indd 310 9/23/2010 5:44:39 PM


A Markov Decision Process (MDP) 311

TABLE 5.21
Data for Recurrent MDP Model of Monthly Sales
Transition Probability

State i Decision k pi1 k pi2k pi3k pi4k Reward qik LP Variable


1 1 Sell noncore 0.15 0.40 0.35 0.10 −30 y11
assets
2 Take firm 0.45 0.05 0.20 0.30 −25 y12
private
3 Offer employee 0.60 0.30 0.10 0 −20 y13
buyouts
2 1 Reduce 0.25 0.30 0.35 0.10 5 y 21
management
salaries
2 Reduce 0.30 0.40 025 0.05 10 y 22
employee
benefits
3 1 Design more 0.05 0.65 0.25 0.05 −10 y 31
appealing
products
2 Invest in new 0.05 0.25 0.50 0.20 −5 y 32
technology
4 1 Invest in new 0.05 0.20 0.40 0.35 35 y 14
projects
2 Make strategic 0 0.10 0.30 0.60 25 y 42
acquisitions

In this example N = 4 states, K1 = 3 decisions in state 1, and K 2 = K3 = K4 = 2


decisions in states 2, 3, and 4.
Objective Function
The objective function for the LP is

4 Ki 3 2 2 2

Maximize g = ∑∑ qi y i = ∑ q1 y1 + ∑ q2 y 2 + ∑ q3 y 3 + ∑ q4 y 4
k k k k k k k k k k

i =1 k =1 k =1 k =1 k =1 k =1

= (q y + q y + q y ) + (q y + q y ) + (q y + q32 y 32 ) + (q14 y 14 + q42 y 42 )


1
1
1
1
2
1
2
1
3
1
3
1
1
2
1
2
2
2
2
2
1
3
1
3

= (−30 y11 − 25 y12 − 20 y13 ) + (5 y 21 + 10 y 22 ) + (−10 y 31 − 5 y 32 ) + (35 y 14 + 25 y 42 )

51113_C005.indd 311 9/23/2010 5:44:41 PM


312 Markov Chains and Decision Processes for Engineers and Managers

Constraints
State 1 has K1 = 3 possible decisions. The constraint associated with a transi-
tion to state j = 1 is

3 4 Ki

∑ k= 1
y1k − ∑
i= 1 k = 1
∑y k
i pik1 = 0

3 2 2 2
( y11 + y12 + y13 ) − ( ∑ y1k p11
k
+ ∑ y 2k p21
k
+ ∑ y 3k p31
k
+ ∑y k
4
k
p41 )= 0
k= 1 k= 1 k= 1 k= 1

( y11 + y12 + y13 ) − ( y11 p11


1
+ y12 p11
2
+ y13 p11
3
) − ( y 21 p21
1
+ y 22 p21
2
) − ( y 31 p31
1
+ y 32 p31
2
)
−( y 14 p141 + y 42 p41
2
)=0

( y11 + y12 + y13 ) − (0.15 y11 + 0.45 y12 + 0.60 y13 ) − (0.25 y 21 + 0.30 y 22 ) − (0.05 y 31 + 0.05 y 32 )
− (0.05 y 14 + 0 y 42 ) = 0.

State 2 has K 2 = 2 possible decisions. The constraint associated with a transi-


tion to state j = 2 is

2 4 Ki

∑y −∑∑y
k= 1
k
2
i= 1 k = 1
k
i pik2 = 0

3 2 2 2
( y 21 + y 22 ) − ( ∑ y1k p12
k
+ ∑y k
2
k
p22 + ∑y k
3
k
p32 + ∑y k
4
k
p42 )= 0
k= 1 k= 1 k= 1 k= 1

( y 21 + y 22 ) − ( y11 p12
1
+ y12 p12
2
+ y13 p12
3
) − ( y 21 p22
1
+ y 22 p22
2
) − ( y 31 p32
1
+ y 32 p32
2
)
− ( y 14 p142 + y 42 p42
2
)= 0

( y 21 + y 22 ) − (0.40 y11 + 0.05 y12 + 0.30 y13 ) − (0.30 y 21 + 0.40 y 22 ) − (0.65 y 31 + 0.25 y 32 )
−(0.20 y 14 + 0.10 y 42 ) = 0.

State 3 has K3 = 2 possible decisions. The constraint associated with a transi-


tion to state j = 3 is

2 4 Ki

∑ y − ∑∑ y
k =1
k
3
i =1 k =1
k
i pik3 = 0.

51113_C005.indd 312 9/23/2010 5:44:46 PM


A Markov Decision Process (MDP) 313

3 2 2 2
( y 31 + y 32 ) − ( ∑ y1k p13
k
+ ∑y k
2
k
p23 + ∑y k
3
k
p33 + ∑y k
4
k
p43 )= 0
k= 1 k= 1 k= 1 k= 1

( y 31 + y 32 ) − ( y11 p13
1
+ y12 p13
2
+ y13 p13
3
) − ( y 21 p23
1
+ y 22 p23
2
) − ( y 31 p33
1
+ y 32 p33
2
)
− ( y 14 p143 + y 42 p43
2
)= 0

( y 31 + y 32 ) − (0.35 y11 + 0.20 y12 + 0.10 y13 ) − (0.35 y 21 + 0.25 y 22 ) − (0.25 y 31 + 0.50 y 32 )
−(0.40 y 14 + 0.30 y 42 ) = 0.

The redundant constraint associated with the state j = 4 will be omitted.


The normalizing equation for the four-state MDP is
4 Ki

∑∑ y
i =1 k =1
k
i = ( y11 + y12 + y13 ) + ( y 21 + y 22 ) + ( y 31 + y 32 ) + ( y 14 + y 42 ) = 1.

The complete LP formulation for the recurrent MDP model of monthly sales,
with the redundant constraint associated with the state j = 4 omitted, and the
four remaining constraints numbered consecutively from 1 to 4, is
Maximize

g = (−30 y11 − 25 y12 − 20 y13 ) + (5 y 21 + 10 y 22 ) + (−10 y 31 − 5 y 32 ) + (35 y 14 + 25 y 42 )

subject to

(1) (0.85 y11 + 0.55 y12 + 0.40 y13 ) − (0.25 y 21 + 0.30 y 22 ) − (0.05 y 31 + 0.05 y 32 )
− (0.05 y 14 + 0 y 42 ) = 0

(2) −(0.40 y11 + 0.05 y12 + 0.30 y13 ) + (0.70 y 21 + 0.60 y 22 ) − (0.65 y 31 + 0.25 y 32 )
−(0.20 y 14 + 0.10 y 42 ) = 0
(3) −(0.35 y11 + 0.20 y12 + 0.10 y13 ) − (0.35 y 21 + 0.25 y 22 ) + (0.75 y 31 + 0.50 y 32 )
−(0.40 y 14 + 0.30 y 42 ) = 0

(4) ( y1 + y1 + y1 ) + ( y 2 + y 2 ) + ( y 3 + y 3 ) + ( y 4 + y 4 ) = 1
1 2 3 1 2 1 2 1 2

y11 ≥ 0, y12 ≥ 0, y13 ≥ 0, y 21 ≥ 0, y 22 ≥ 0, y 31 ≥ 0, y 32 ≥ 0, y 14 ≥ 0, y 42 ≥ 0.

5.1.2.3.4.3 Solution by LP of MDP Model of Monthly Sales The LP formula-


tion of the recurrent MDP model of monthly sales is solved on a personal
computer using LP software to find an optimal policy. The output of the LP
software is summarized in Table 5.22.

51113_C005.indd 313 9/23/2010 5:44:50 PM


314 Markov Chains and Decision Processes for Engineers and Managers

TABLE 5.22
LP Solution of MDP Model of Monthly Sales
k
Objective i Function Value k G = 4.3901 y i
1 1 0
2 0.1438 Row Dual Variable
3 0 Constraint 1 −76.8825
2 1 0 Constraint 2 −50.6777
2 0.2063 Constraint 3 −51.8072
3 1 0 Constraint 4 4.3901
2 0.3441
4 1 0
2 0.3057

Table 5.22 shows that the value of the objective function is 4.3901. All
the yik equal zero, except for y12 = 0.1438, y22 = 0.2063, y32 = 0.3441, and y42 =
0.3057. These values are the steady-state probabilities for the optimal policy
given by 16 d = [2 2 2 2 ]T, which was previously found by exhaustive enumer-
ation in Section 5.1.2.3.1, by value iteration in Section 5.1.2.3.2.2, and by PI in
Section 5.1.2.3.3.3. The conditional probabilities,

y ik
P(decision = k|state = i) = Ki
,
∑ yik
k =1

are calculated below:

P(decision = 2|state = 1) = P(decision = 2|state = 2) = P(decision = 2|state = 3)


= P(decision = 2|state = 4) = 1.

All the remaining conditional probabilities are zero.


Table 5.23 shows that the dual variables associated with constraints 1, 2,
and 3 are identical to the relative values, v1, v2, and v3, respectively, obtained
by PI in Equation (5.25) of Section 5.1.2.3.3.3. The dual variable associated
with constraint 4, the normalizing constraint, is the gain, g=4.3901, which
is also the value of the objective function. Table 5.23 also indicates how
closely the expected relative rewards, vi(−7) − v4(−7), obtained by value iter-
ation over a seven-period planning horizon in Section 5.1.2.3.2.2, approach
the relative values, vi, for i = 1, 2, and 3, confirming the results shown in
Table 5.16.

51113_C005.indd 314 9/23/2010 5:44:55 PM


A Markov Decision Process (MDP) 315

TABLE 5.23
Comparison of Dual Variables Obtained by LP, Relative Values Obtained by PI, and
Expected Relative Rewards, vi(−7) − v4(−7), Obtained by Value Iteration Over a
Seven-Period Planning Horizon
LP Dual
LP Row Variable Relative Value Expected Relative Reward

Constraint 1 −76.8825 v1 = −76.8825 v1 (−7) – v4 (−7) = − 76.9888


Constraint 2 −50.6777 v2 = −50.6777 v2 (−7) – v4 (−7) = − 50.7214
Constraint 3 −51.8072 v3 = −51.8072 v3 (−7) – v4 (−7) = − 51.766
Constraint 4 4.3901 g = 4.3901

5.1.3 A Unichain MDP


Recall that an MDP is unichain if the transition matrix associated with every
stationary policy contains one closed communicating class of recurrent states
plus a possibly empty set of transient states. Section 5.1.2 treats a recurrent
MDP, which is a unichain MDP without any transient states. Section 5.1.3 will
treat a unichain MDP which has one or more transient states. An optimal pol-
icy for a unichain MDP over an infinite planning horizon will be found by PI
in Section 5.1.3.1, and by LP in Section 5.1.3.2. Value iteration for a unichain
MDP is identical to value iteration for a recurrent MDP. Section 5.1.3.3.3 will
formulate a problem of when to stop a sequence of independent trials over a
finite planning horizon, and solve this optimal stopping problem by execut-
ing value iteration.

5.1.3.1 Policy Iteration (PI)


Recall from Section 4.2.4.1 that all states in a unichain MCR, both recurrent
and transient, have the same gain, which is equal to the gain of the closed
set of recurrent states. Since a policy for an MDP specifies a particular MCR,
this property holds also for a unichain MDP. Hence, in a unichain MDP, the
gain of every state is equal to g, the independent gain of the closed class of
recurrent states. Since all states in a unichain MDP have the same gain, the
PI algorithm for a unichain MDP with transient states is unchanged from the
algorithm given in Section 5.1.2.3.3.2 for a recurrent MDP, which is a unichain
MDP without transient states. The gain and the relative values for an MCR
associated with a particular policy for a unichain MDP can be determined
by executing the same three-step procedure given in Section 4.2.4.2.2 for a
unichain MCR [2, 3].

5.1.3.1.1 A Unichain Four-State MDP Model of an


Experimental Production Process
To see how PI can be applied to find an optimal policy for a unichain MDP
in which a state may be recurrent under one stationary policy and transient

51113_C005.indd 315 9/23/2010 5:44:57 PM


316 Markov Chains and Decision Processes for Engineers and Managers

under another, the following simplified unichain model of an experimental


production line will be constructed. (An enlarged multichain MDP model
of a flexible production system is constructed in Section 5.1.4.1.) Suppose
that an experimental production line consists of three manufacturing stages
in series. The three-stage line can be used to manufacture two types of
products: industrial (I) or consumer (C). Each stage can be programmed
to make either type of product independently of the other stages. Because
this sequential production process is experimental, no output will be sold
or scrapped. The output of each manufacturing stage is inspected. Output
with a defect is reworked at the current stage. Output from stage 1 or stage 2
that is not defective is passed on to the next stage. Output from stage 3 that
is not defective is sent to a training center. An industrial product sent from
stage 3 to the training center will remain there. A consumer product sent
from stage 3 to the training center will be disassembled, and returned as
input to stage one in the following period.
By assigning a state to represent each operation, and adding a reward vec-
tor, the production process is modeled as a four-state unichain MDP. The
four states are indexed in Table 5.24.
Observe that production stage i is represented by transient state (5 − i).
In every state, two decisions are possible. The decision is the type of
customer, either industrial (I) or consumer (C), for whom the product
is designed. The probability of defective or nondefective output from a
stage is affected by whether the product will be industrial or commer-
cial. When the training center decides that a product is industrial, the
product will never leave the training center, so that state 1 is absorbing
and the remaining states are transient. When the training center decides
that a product is designed for consumers, all four states are members of
one closed class of recurrent states. With four states and two decisions
per state, the number of possible policies is equal to 24 = 16. Table 5.25
represents the four-state unichain MDP model of the experimental pro-
duction process with numerical data furnished for the transition proba-
bilities and the rewards.

TABLE 5.24
States for Unichain Model of an Experimental
Production Process
State Operation
1 Training center
2 Stage 3
3 Stage 2
4 Stage 1

51113_C005.indd 316 9/23/2010 5:44:57 PM


A Markov Decision Process (MDP) 317

The passage of an item through the experimental production process is


shown in Figure 5.3.

5.1.3.1.2 Solution by PI of a Unichain Four-State MDP Model of an Experimental


Production Process
Policy iteration will be executed to find an optimal policy for the unichain
MDP model of an experimental production process.
First iteration

TABLE 5.25
Data for Unichain Model of an Experimental Production Process
Transition Probability Reward

State i Operation Decision k p ik1 p ik2 p ik3 p ik4 q ik


1 Training 1=I 1 0 0 0 56
2=C 0 0 0 1 62
2 Stage 3 1=I 0.45 0.55 0 0 423.54
2=C 0.55 0.45 0 0 468.62
3 Stage 2 1=I 0 0.35 0.65 0 847.09
2=C 0 0.50 0.50 0 796.21
4 Stage 1 1=I 0 0 0.25 0.75 1935.36
2=C 0 0 0.20 0.80 2042.15

The number on an arc indicates a transition probability


56
under decision 1 (decision 2)
(62)
Reward
The number next to a node indicates a reward under
decision 1 (decision 2)
Training
Center
1
0.55(0.45)

0.75(0.80) 0.65(0.50) (1)


0.45(0.55)

Stage 1 0.25(0.20) Stage 2 0.35(0.50) Stage 3


847.09
(796.21)
1935.36 Reward 423.54
(2042.15) (468.62)
Reward Reward

FIGURE 5.3
Passage of an item through a three-stage experimental production process.

51113_C005.indd 317 9/23/2010 5:44:57 PM


318 Markov Chains and Decision Processes for Engineers and Managers

Step 1. Initial policy


Arbitrarily choose an initial policy which consists of making decision 1 in
every state. Thus, at every manufacturing stage and in the training center,
the product is designed for an industrial customer. The initial decision vector
1d, along with the associated transition probability matrix 1P and the reward

vector, 1q, are shown below:

1 1 1 0 0 0   56 
1 
2 0.45 0.55 0 0   423.54 
1
d =  , 1
P=  , 1
q= .
1 3 0 0.35 0.65 0   847.09 
     
1 4 0 0 0.25 0.75  1935.36 

Note that state 1 is absorbing and states 2, 3, and 4 are transient.


Step 2. VD operation
Use pij and qi for the initial policy, 1d=[1 1 1 1]T, to solve the VDEs (4.62) for the
relative values vi and the gain g. The VDEs (4.62) for the initial policy are

g + v1 = 56 + v1
g + v2 = 423.54 + 0.45v1 + 0.55v2
g + v3 = 847.09 + 0.35v2 + 0.65v3
g + v4 = 1935.36 + 0.25v3 + 0.75v4 .

The VDE for the recurrent closed set, which consists of the single absorbing
state 1, is

g + v1 = 56 + v1 .

Setting v1=0, the solution of the VDE for the gain and the relative value for
the absorbing state under the initial policy is

g = 56, v1 = 0.

The VDEs for the three transient states under the initial policy are

g + v2 = 423.54 + 0.45v1 + 0.55v2


g + v3 = 847.09 + 0.35v2 + 0.65v3
g + v4 = 1935.36 + 0.25v3 + 0.75v4 .

Substituting g = 56, v1 = 0, the solution for the relative values of the transient
states under the initial policy is

51113_C005.indd 318 9/23/2010 5:45:01 PM


A Markov Decision Process (MDP) 319

v2 = 816.7556, v3 = 3077.0127, v4 = 10594.453.

Step 3. IM routine
For each state i, find the decision k∗ that maximizes the test quantity

4
qik + ∑pv
j= 1
k
ij j

using the relative values vi of the initial policy. Then k∗ becomes the new
decision in state i, so that di=k∗, qik∗ becomes qi, and pijk∗ becomes pij. The first
policy improvement routine is shown in Table 5.26.

TABLE 5.26
First IM for a Uichain Model of an Experimental Production Process
Test Quantity
q ik + p ik1v 1 + p ik2v 2 + p ik3v 3 + p ik4v 4 Maximum
Decision = q ik + p ik1 (0) + p ik2 (816.7556) Value of
State Alternative Test Decision
+p ik3 (3077.0127) + p ik4 (10594.453)
i k Quantity di = k*

1 1 56 + 1(0) = 56

1 2 62 + 1(10594.453) = 10656.453 ← 10656.453 d1 = 2

2 1 423.54 + 0.45(0) + 0.55(816.7556) 872.7556 d2 = 1

= 872.7556 ←
2 2 468.62 + 0.55(0) + 0.45(816.7556)
= 836.1600
3 1 847.09 + 0.35(816.7556) 3133.0127 d3 = 1

+0.65(3077.0127) = 3133.0127 ←

3 2 796.21 + 0.50(816.7556)
+0.50(3077.0127) = 2743.0942

4 1 1935.36 + 0.25(3077.0127)
+0.75(10594.453) = 10650.453

4 2 2042.15 + 0.20(3077.0127) 11133.115 d4 = 2

+0.80(10594.453) = 11133.115 ←

51113_C005.indd 319 9/23/2010 5:45:04 PM


320 Markov Chains and Decision Processes for Engineers and Managers

Step 4. Stopping rule


The new policy is 2d = [2 1 1 2]T, which is different from the initial policy.
Therefore, go to step 2. The new decision vector 2d, along with the associated
transition probability matrix 2P and the reward vector 2q, are shown below:

 2 1 0 0 0 1   62 
 1 
2 0.45 0.55 0 0   423.54 
2
d =  , 2
P=  , 2
q= .
 1 3 0 0.35 0.65 0   847.09 
     
 2 4 0 0 0.20 0.80   2042.15

Note that all four states belong to the same recurrent closed set.
Step 2. VD operation
Use pij and qi for the second policy, 2d=[2 1 1 2]T, to solve the VDEs (4.62)
for the relative values vi and the gain g.

g + v1 = 62 + v4
g + v2 = 423.54 + 0.45v1 + 0.55v2
g + v3 = 847.09 + 0.35v2 + 0.65v3
g + v4 = 2042.15 + 0.20v3 + 0.80v4 .

Setting v4=0, the solution of the VDEs is

g = 1230.5946, v1 = −1168.5946, v2 = −2962.0494, v3 = −4057.7769, v4 = 0

Step 3. IM routine
The second IM routine is shown in Table 5.27.

Step 4. Stopping rule


The new policy is 3d = [2 2 2 2]T, which is different from the previous policy.
Therefore, go to step 2. The new decision vector 3d, along with the associated
transition probability matrix 3P and the reward vector 3q, are shown below:

 2 1 0 0 0 1   62 
 2 
2 0.55 0.45 0 0   468.62 
3
d =  , 3
P=  , 3
q= . (5.44)
 2 3 0 0.50 0.50 0   796.21 
     
 2 4 0 0 0.20 0.80   2042.15

Once again all four states belong to the same recurrent closed set.

51113_C005.indd 320 9/23/2010 5:45:09 PM


A Markov Decision Process (MDP) 321

TABLE 5.27
Second IM for Unichain Model of an Experimental Production Process

Test Quantity
q ik + p ik1v 1 + p ik2v 2 + p ik3v 3 + p ik4v 4 Maximum
Decision = q ik + p ik1 (–1168.5946) + p ik1 (–2962.0494) Value of
State Alternative Test Decision
+ p ik1 (–4057.7769)) + p ik4 (0)
i k Quantity di = k*

1 1 56 + 1(−1168.5946) = −1112.5946
1 2 62 + 1(0) = 62 ← 62 d1 = 2

2 1 423.54 + 0.45(−1168.5946)
+0.55(−2962.0494) = −1731.4547
2 2 468.62 + 0.55(−1168.5946) −1507.029 d2 = 2
+0.45(−2962.0494) = −1507.0293 ←
3 1 847.09 + 0.35(−2962.0494)
+0.65(−4057.7769) = −2827.1823
3 2 796.21 + 0.50(−2962.0494) −2713.703 d3 = 2
+0.50(−4057.7769) = −2713.7032 ←
4 1 1935.36 + 0.25(−4057.7769)
+0.75(0) = 920.9158
4 2 2042.15 + 0.20(−4057.7769) 1230.5946 d4 = 2
+0.80(0) = 1230.5946 ←

Step 2. VD operation
Use pij and qi for the third policy, 3d = [2 2 2 2]T, to solve the VDEs (4.62)

4
g + vi = qi + ∑ pij v j , i = 1, 2, 3, 4
j =1

for the relative values vi and the gain g.

g + v1 = 62 + v4
g + v2 = 468.62 + 0.55v1 + 0.45v2
g + v3 = 796.21 + 0.50v2 + 0.50v3
g + v4 = 2042.15 + 0.20v3 + 0.80v4 .

51113_C005.indd 321 9/23/2010 5:45:11 PM


322 Markov Chains and Decision Processes for Engineers and Managers

Setting v4 = 0, the solution of the VDEs is

g = 1295.2710, v1 = −1233.2710, v2 = −2736.2729, v3 = −3734.3949, v4 = 0.


(5.45)

Step 3. IM routine
The third IM routine is shown in Table 5.28.

Step 4. Stopping rule


Stop because the new policy, given by the vector 4 d ≡ 3 d = [2 2 2 2]T, is identi-
cal to the previous policy. Therefore, this policy is optimal. Under this policy,
all states belong to the same recurrent closed class. The relative values and
the gain for the optimal policy are shown in Equation (5.45). The recurrent

TABLE 5.28
Third IM for Unichain Model of an Experimental Production Process

Test Quantity
q ik + p ik1v 1 + p ik2v 2 + p ik3v 3 + p ik4v 4 Maximum
Decision = q ik + p ik1 (–1233.2710) + p ik2 (–2736.2729) Value of
State Alternative Test Decision
+ p ik3 (–3734.3949)) + p ik4 (0)
i k Quantity di = k*
1 2 56 + 1( −1233.2710) = −1177.271
1 2 62 + 1(0) = 62 ← 62 d1 = 2

2 1 423.54 + 0.45( −1233.2710)


+ 0.55( −2736.2729) = −1636.382
2 2 468.62 + 0.55( −1233.2710) −1441.002 d2 = 2
+ 0.45( −2736.2729) = −1441.0019 ←
3 1 847.09 + 0.35( −2736.2729)
+ 0.65( −3734.3949) = −2537.9622
3 2 796.21 + 0.50( −2736.2729) −2439.124 d3 = 2
+ 0.50( −3734.3949) = −2439.1239 ←
4 1 1935.36 + 0.25( −3734.3949)
+ 0.75(0) = 1001.7613
4 2 2042.15 + 0.20( −3734.3949) 1295.271 d4 = 2
+ 0.80(0) = 1295.271 ←

51113_C005.indd 322 9/23/2010 5:45:18 PM


A Markov Decision Process (MDP) 323

transition probability matrix 3P and the reward vector 3q for the optimal pol-
icy are shown in Equation (5.44).
Note that the relative values obtained by PI are identical to the correspond-
ing dual variables obtained by LP in Table 5.30 of Section 5.1.3.2.2. The gain
obtained by PI is also equal to the gain obtained by LP. By solving Equation
(2.12), the recurrent transition probability matrix 3P in Equation (5.44) has the
steady-state probability vector

3
π = [0.1019 0.1852 0.2037 0.5093] . (5.46)

5.1.3.2 Linear Programming


Recall from Section 5.1.3.1 that all states in a unichain MDP have the same
gain, irrespective of whether or not the set of transient states is empty. Since
the objective of an LP formulation for a unichain MDP is to maximize the
gain, the LP formulation is not affected by the presence of transient states.
Therefore, the LP formulation in this section for a unichain MDP with tran-
sient states is the same as the one for a recurrent MDP. However, there is a
distinction [3] in the properties of the feasible solutions to the LP formula-
tions for both models. The distinction is that exactly one stationary policy
corresponds to one feasible solution to the LP formulation for a recurrent
MDP, but that more than one stationary policy may correspond to a particu-
lar feasible solution of the LP formulation for a unichain MDP with transient
states. The optimal solution to the LP formulation for a unichain MDP does
not specify decisions in transient states. The value of the gain of an optimal
policy is not affected by decisions made in transient states. This distinction
will be illustrated in Section 5.1.3.2.4 by the solution of a modified unichain
MDP model of an experimental production line with transient states.

5.1.3.2.1 LP Formulation of MDP Model of an Experimental Production Process


The unichain MDP model of an experimental production process with tran-
sient states specified in Table 5.25 will be formulated as an LP. Table 5.29
contains Table 5.25 augmented by a right-hand side column of LP decision
variables.
In this example N = 4 states, and K1 = K 2 = K3 = K4 = K = 2 decisions in every
state.
Objective Function
The objective function for the LP is

4 2 2 2 2 2
Maximize g = ∑∑q
i= 1 k = 1
k
i yik = ∑q
k=1
k
1 y1k + ∑q
k=1
k
2 y2k + ∑q
k=1
k
3 y3k + ∑q
k=1
k
4 y4k

= (q11 y11 + q12 y12 ) + (q21 y 21 + q22 y 22 ) + (q31 y 31 + q32 y 32 ) + (q14 y 14 + q42 y 42 )

51113_C005.indd 323 9/23/2010 5:45:23 PM


324 Markov Chains and Decision Processes for Engineers and Managers

TABLE 5.29
Data for Unichain Model of an Experimental Production Process
Transition Probability
State Operation Decision Reward LP Variable
i k p ik1 p ik2 p ik3 p ik4 q ik y ik
1 Training 1=I 1 0 0 0 56
y11
2=C 0 0 0 1 62
y12
2 Stage 3 1=I 0.45 0.55 0 0 423.54
y 21
2=C 0.55 0.45 0 0 468.62
y 22
3 Stage 2 1=I 0 0.35 0.65 0 847.09
y 31
2=C 0 0.50 0.50 0 796.21
y 32
4 Stage 1 1=I 0 0 0.25 0.75 1935.36
y 14
2=C 0 0 0.20 0.80 2042.15
y 14

= (56 y11 + 62 y12 ) + (423.54 y 21 + 468.62 y 22 ) + (847.09 y 31 + 796.21y 32 )


+ (1935.36 y 14 + 2042.15 y 42 ).

Constraints
State 1 has K1 = 2 possible decisions. The constraint associated with a transi-
tion to state j = 1 is

2 4 2

∑y −∑∑y
k= 1
k
1
i= 1 k = 1
k
i pik1 = 0

2 2 2 2
( y11 + y12 ) − ( ∑ y1k p11
k
+ ∑ y 2k p21
k
+ ∑ y 3k p31
k
+ ∑y k
4
k
p41 )= 0
k= 1 k= 1 k= 1 k= 1

( y11 + y12 ) − ( y11 p11


1
+ y12 p11
2
) − ( y 21 p21
1
+ y 22 p21
2
) − ( y 31 p31
1
+ y 32 p31
2
) − ( y 14 p141 + y 42 p41
2
)= 0

(0 y11 + y12 ) − (0.45 y 21 + 0.55 y 22 ) = 0.

51113_C005.indd 324 9/23/2010 5:45:24 PM


A Markov Decision Process (MDP) 325

State 2 has K 2 = 2 possible decisions. The constraint associated with a


transition to state j = 2 is

2 4 2


k= 1
y 2k − ∑
i= 1 k = 1
∑y k
i pik2 = 0

2 2 2 2
( y 21 + y 22 ) − ( ∑ y1k p12
k
+ ∑y k
2
k
p22 + ∑y k
3
k
p32 + ∑y k
4
k
p42 )= 0
k= 1 k= 1 k= 1 k= 1

( y 21 + y 22 ) − ( y11 p12
1
+ y12 p12
2
) − ( y 21 p22
1
+ y 22 p22
2
) − ( y 31 p32
1
+ y 32 p32
2
) − ( y 14 p142 + y 42 p42
2
)= 0

(0.45 y 21 + 0.55 y 22 ) − (0.35 y 31 + 0.50 y 32 ) = 0.

State 3 has K3 = 2 possible decisions. The constraint associated with a transi-


tion to state j = 3 is

2 4 2

∑y −∑∑y
k= 1
k
3
i= 1 k = 1
k
i pik3 = 0

2 2 2 2
( y 31 + y 32 ) − ( ∑ y1k p13
k
+ ∑y k
2
k
p23 + ∑y k
3
k
p33 + ∑y k
4
k
p43 )= 0
k= 1 k= 1 k= 1 k= 1

( y 31 + y 32 ) − ( y11 p13
1
+ y12 p13
2
) − ( y 21 p23
1
+ y 22 p23
2
) − ( y 31 p33
1
+ y 32 p33
2
) − ( y 14 p143 + y 42 p43
2
)= 0

(0.35 y 31 + 0.50 y 32 ) − (0.25 y 14 + 0.20 y 42 ) = 0.

The redundant constraint associated with the state j = 4 will be omitted.


The normalizing equation for the four-state MDP is

4 2i

∑∑y
i= 1 k = 1
k
i = ( y11 + y12 ) + ( y12 + y22 ) + ( y31 + y32 ) + ( y14 + y42 ) = 1

The complete LP formulation for the unichain MDP model of an experimen-


tal production process with transient states, with the redundant constraint
associated with the state j = 4 omitted, is

Maximize g = (56 y11 + 62 y12 ) + (423.54 y12 + 468.62 y22 )


+ (847.09 y31 + 796.21 y32 ) + (1935.36 y14 + 2042.15 y42 )

51113_C005.indd 325 9/23/2010 5:45:34 PM


326 Markov Chains and Decision Processes for Engineers and Managers

subject to

(1) (0y11 + y12 ) − (0.45y21 + 0.55y22 ) = 0


(2) (0.45y21 + 0.55y22 ) − (0.35y31 + 0.50y32 ) = 0
(3) (0.35y31 + 0.50y32 ) − (0.25y41 + 0.20y42 ) = 0
(4) (y11 + y12 + y13 ) + (y21 + y22 ) + (y31 + y32 ) + (y41 + y42 ) = 1
y11 ≥ 0, y12 ≥ 0, y22 ≥ 0, y31 ≥ 0, y32 ≥ 0, y14 ≥ 0, y42 ≥ 0.

5.1.3.2.2 Solution by LP of MDP Model of an Experimental Production


Process The LP formulation in Section 5.1.3.2.3 of the unichain MDP model
of an experimental production process with transient states is solved to find
an optimal policy by using LP software on a personal computer. The output
of the LP software is summarized in Table 5.30.
Table 5.30 shows that y12 = 0.1019, y22 = 0.1852, y32 = 0.2037, y42 = 0.5093, and the
remaining yik equal zero. These nonzero yik values are the steady-state prob-
abilities shown in Equation (5.46) for the recurrent transition probability
matrix 3P associated with the optimal policy. The optimal policy is 3d = [2 2
2 2]T, which was previously found by PI in Equation (5.45) of Section 5.1.3.1.2.
Under this policy, all states are recurrent so that the LP solution prescribes a
decision in every state. The associated conditional probabilities,

Ki
P(decision = k |state = i ) = yik /∑ y ,
k =1
k
i

are
P(decision = 2|state = 1) = P(decision = 2|state = 2) =
P(decision = 2|state = 3) = P(decision = 2|state = 4) = 1.

TABLE 5.30
LP Solution of MDP Model of an Experimental Production Process
Objective Function Value g = 1295.2710
i k y ik
1 1 0 Row Dual Variable
2 0.1019 Constraint 1 −1233.2710
2 1 0 Constraint 2 −2726.2739
2 0.1852 Constraint 3 −3734.3949
3 1 0 Constraint 4 1295.2710
2 0.2037
4 1 0
2 0.5093

51113_C005.indd 326 9/23/2010 5:45:39 PM


A Markov Decision Process (MDP) 327

TABLE 5.31
Data for Modified Unichain Model of an Experimental Production Process
Transition Probability Reward LP Variable
State Operation Decision k k k k k
i k p i1 p i2 p i3 p i4 q i y ik

1 Training 1=I 1 0 0 0 1300 y11

2=C 0 0 0 1 62 y12

2 Stage 3 1=I 0.45 0.55 0 0 423.54 y12

2=C 0.55 0.45 0 0 468.62 y22

3 Stage 2 1=I 0 0.35 0.65 0 847.09 y13

2=C 0 0.50 0.50 0 796.21 y32

4 Stage 1 1=I 0 0 0.25 0.75 1935.36 y14

2=C 0 0 0.20 0.80 2042.15 y42

All the remaining conditional probabilities are zero. The dual variables
associated with constraints 1, 2, and 3 are the relative values v1, v2, and v3,
respectively, obtained by PI in Equation (5.33) of Section 5.1.3.1.2. The dual
variable associated with constraint 4, the normalizing constraint, is the gain,
g = 1295.2710, which is also the value of the objective function.

5.1.3.2.3 LP Formulation of a Modified MDP Model of


an Experimental Production Process
Consider the unichain MDP model of an experimental production process
with transient states given in Table 5.29 of Section 5.1.3.2.1. The model is
modified so that the reward received by the training center per unit of indus-
trial product is increased from 56 to 1300. Thus, the only difference between
the original model and the modified model is that q11=56 in the first model is
replaced by q11=1300 in the modified model. The data for the modified model
are given in Table 5.31.
The modified model will be formulated as an LP. The only difference in the
LP formulation is that the term 56y11 in the objective function of the original
model is replaced by 1300y11 in the objective function of the modified model.
The LP formulation for the modified unichain MDP model of an experimen-
tal production process with transient states, with the redundant constraint

51113_C005.indd 327 9/23/2010 5:45:41 PM


328 Markov Chains and Decision Processes for Engineers and Managers

associated with the state j = 4 omitted, is

Maximize g = (1300 y11 + 62 y12 ) + (423.54 y 21 + 468.62 y 22 )


+ (847.09 y 31 + 796.21y 32 ) + (1935.36 y 14 + 2042.15 y

subject to

(1) (0y11 + y12 ) − (0.45y21 + 0.55y22 ) = 0


(2) (0.45y21 + 0.55y22 ) − (0.35y31 + 0.50y32 ) = 0
(3) (0.35y31 + 0.50y32 ) − (0.25y41 + 0.20y42 ) = 0
(4) (y11 + y12 + y13 ) + (y21 + y22 ) + (y31 + y32 ) + (y41 + y42 ) = 1
y11 ≥ 0, y12 ≥ 0, y 21 ≥ 0, y 22 ≥ 0, y 31 ≥ 0, y 32 ≥ 0, y 14 ≥ 0, y 42 ≥ 0.

5.1.3.2.4 Solution by LP of a Modified MDP Model of


an Experimental Production Process
The LP formulation of the modified unichain MDP model of an experimental
production process with transient states is solved by using LP software to find
an optimal policy. The output of the LP software is summarized in Table 5.32.
Table 5.32 shows that y11 = 1, and all the other yik equal zero. State 1 is absorb-
ing (and also recurrent), and the other three states are transient. Thus, the
optimal solution to the LP prescribes a decision in the recurrent state only, in
this case state 1, but does not prescribe decisions in the transient states. The
gain, which is not affected by decisions in the transient states, equals 1300.
In this example, with three transient states and two possible decisions per
transient state, 23 = 8 stationary policies correspond to the feasible solution
obtained for the LP formulation.

TABLE 5.32
LP Solution of a Modified MDP Model of an
Experimental Production Process
Objective Function Value g = 1300
i k yit
1 1 1
2 0
2 1 0
2 0
3 1 0
2 0
4 1 0
2 0

51113_C005.indd 328 9/23/2010 5:45:48 PM


A Markov Decision Process (MDP) 329

5.1.3.3 Examples of Unichain MDP Models


Examples of unichain MDP models over an infinite planning horizon will
be constructed for an inventory system and for component replacement. The
secretary problem, which is concerned with optimal stopping, will be mod-
eled over a finite horizon.

5.1.3.3.1 Unichain MDP Model of an Inventory System


Consider the inventory system of a retail operation modeled as a four-state
regular Markov chain in Section 1.10.1.1.3. In Section 4.2.3.4.1, this model
was enlarged by adding a reward vector to produce a four-state recurrent
MCR. The retailer followed a (2, 3) inventory ordering policy for which an
expected profit per period, or gain, of 115.6212 was calculated in Equation
(4.98). However, she does not know whether a (2, 3) inventory ordering pol-
icy will maximize her expected profit per period. The retailer will model her
inventory system as a four-state MDP that will allow her to choose among all
the ordering decisions which are feasible in every state [1].
Since the retailer can store at most three computers in her shop, she can
order from zero computers up to three computers minus the number of
computers currently in stock. That is, if her inventory at the beginning of a
period is Xn−1 = i≤ 3 computers, she can order from 0 computers up to (3 − i)
computers. Her feasible ordering decisions corresponding to every possible
level of entering inventory are shown in Table 5.33.

TABLE 5.33
Feasible Ordering Decisions Associated with Beginning Inventory Levels

Beginning Order

Inventory Quantity

Xn-1 = i cn-1 = k

If i = 0
computers, order k = 0 computers, or k = 1 computer, or k = 2 computers,

or k = 3 computers; order up to k = 3 = (3 − 0) computers

If i = 1
computers, order k = 0 computers, or k = 1 computer, or k = 2 computers;

order up to k = 2 = (3 − 1) computers

If i = 2
computers, order k = 0 computers, or k = 1 computer;

order up to k = 2 = (3 − 2) computers

If i = 3
computers, order k = 0 computers; order up to k = 0 = (3–3) computers

51113_C005.indd 329 9/23/2010 5:45:49 PM


330 Markov Chains and Decision Processes for Engineers and Managers

Table 5.34 shows the decisions allowed in every state and the associated
transition probabilities expressed as a function of the demand.
Table 5.35 shows the decisions allowed in every state and the associated
numerical transition probabilities.
Observe that in state Xn−1 = 0, when decision cn−1 = 0 is made, p000 = 1, so that
state 0 is absorbing and the other states are transient. In addition, when
decision cn−1 = 1 is made in state Xn−1 = 0, and decision cn−1 = 0 is made in state
Xn−1 = 1, then states 0 and 1 are recurrent and the remaining states are tran-
sient. Therefore, the MDP model of the inventory system is unichain with
transient states.
The expected reward or profit earned by every decision in every state will
be computed next as the expected revenue minus the expected cost. The
number of computers sold during a period equals min[(Xn−1+cn−1), dn]. The

TABLE 5.34
Allowable Decisions in Every State and Associated Transition Probabilities
Expressed as a Function of the Demand
Transition Probability
State Decision
k k k k
Xn−1 = 0 cn−1 = k p 00 p 01 p 02 p 04
0 P(dn ≥ 0) 0 0 0
1 P(dn ≥ 1) P(dn ≥ 0) 0 0
2 P(dn ≥ 2) P(dn ≥ 1) P(dn ≥ 0) 0
3 P(dn ≥ 3) P(dn≥ 2) P(dn ≥ 1) P(dn ≥ 0)

Xn−1 = 1 cn−1 = k p10k p11k p12k p13k

0 P(dn ≥ 1) P(dn = 0) 0 0
1 P(dn ≥ 2) P(dn = 1) P(dn = 0) 0
2 P(dn = 3) P(dn = 2) P(dn = 1) P(dn = 0)
k k k k
Xn−1 = 2 cn−1 = k p20 p21 p22 p23

0 P(dn ≥ 2) P(dn = 1) P(dn = 0) 0


1 P(dn = 3) P(dn = 2) P(dn = 1) P(dn = 0)
k k k k
Xn−1 = 3 cn−1 = k p30 p31 p32 p33

0 P(dn = 3) P(dn = 2) P(dn = 1) P(dn = 0)

51113_C005.indd 330 9/23/2010 5:45:51 PM


A Markov Decision Process (MDP) 331

TABLE 5.35
Allowable Decisions in Every State and Associated Numerical Transition
Probabilities
Transition Probability
k
State Xn–1 = i Decision cn–1 = k p i0 p ik1 p ik2 p ik3

0 0 1 0 0 0
1 0.7 0.3 0 0
2 0.3 0.4 0.3 0
3 0.2 0.1 0.4 0.3
1 0 0.7 0.3 0 0
1 0.3 0.4 0.3 0
2 0.2 0.1 0.4 0.3
2 0 0.3 0.4 0.3 0
1 0.2 0.1 0.1 0.3
3 0 0.2 0.1 0.1 0.3

expected number sold is given by

3
E{min[(X n −1 + cn −1 ), dn ]} = ∑ min[(X
dn = 0
n −1 + cn −1 ), dn ]p(dn )

3
= ∑ min[(X n − 1 + cn − 1 ), dn = k ]P(dn = k ).
k=0

The retailer sells computers for $300 each. The expected revenue equals $300
per computer sold times the expected number sold. The expected revenue is

3
$300∑ min[(X n −1 + cn −1 ), dn = k ]P(dn = k ).
k=0

For example, in state Xn−1 = 0, when decision cn−1 = 2 is made, the expected
revenue is equal to

3
$300 ∑ min[(X n −1 + cn −1 ), dn ]p(dn )
dn = 0
3
= $300 ∑ min[(0 + 2), dn ]p(dn )
dn = 0

= $300[min(2, 0) p(0) + min(2,1) p(1) + min(2, 2) p(2) + min(2, 3) p(3)]


= $300[0 p(0) + 1p(1) + 2 p(2) + 2 p(3)]
= $300{0 p(0) + 1p(1) + 2[ p(2) + p(3)]} = $300[0(0.3) + 1(0.4) + 2(0.1 + 0.2)] = $300.

51113_C005.indd 331 9/23/2010 5:45:58 PM


332 Markov Chains and Decision Processes for Engineers and Managers

The expected revenue generated in every state by every decision is calcu-


lated in Table 5.36.
The expected cost is equal to the ordering cost plus the expected holding
cost plus the expected shortage cost. If cn−1 > 0 computers are ordered at the
beginning of a period, the ordering cost for the period is $20 + $120cn−1. If no
computers are ordered, the ordering cost is $0. The ordering costs associated
with every decision in every state are calculated in Table 5.37.
The expected holding cost equals $50 times the expected number of com-
puters not sold. The number not sold during a period, that is, the surplus
of computers at the end of the period, equals max[(Xn−1 + cn−1 − dn), 0]. The
expected number of computers not sold is given by

3
E{max[(X n −1 + cn −1 − dn ), 0]} = ∑ max[(X
dn = 0
n −1 + cn −1 − dn ), 0]p( dn )

3
= ∑ max[(X n −1 + cn −1 − k ), 0]P(dn = k ).
k=0

The expected holding cost is

3
$50∑ max[(X n −1 + cn −1 − k ), 0]P( dn = k ).
k=0

For example, in state Xn−1= 0, when decision cn−1= 2 is made, the expected
holding cost is equal to

3 3
$50 ∑ max[(X n −1 + cn −1 − dn ), 0]p(dn ) = $50 ∑ max[(0 + 2 − dn ), 0] p(dn )
dn = 0 dn = 0

= $50[max(2 − 0, 0) p(0) + max(2 − 1, 0) p(1) + max(2 − 2, 0) p(2) + max(2 − 3, 0) p(3)]


= $50{(2 − 0)p(0) + (2 − 1)p(1) + 0 p(2) + 0 p(3)}
= $50{(2 − 0)p(0) + (2 − 1)p(1) + 0[ p(2) + p(3)]} = $50{2 p(0) + 1p(1) + 0[ p(2) + p(3)]}
= $50[2(0.3) + 1(0.4) + 0(0.1 + 0.2)] = $50.

The expected holding costs generated in every state by every decision are
calculated in Table 5.38.
The expected shortage cost equals $40 times the expected number of com-
puters not available to satisfy demand during a period. The number not
available to satisfy demand during a period, that is, the shortage of comput-
ers at the end of the period, equals max[(dn − Xn−1 − cn−1), 0]. The expected
number of computers not available to satisfy demand during a period is
given by

51113_C005.indd 332 9/23/2010 5:46:01 PM


TABLE 5.36

51113_C005.indd 333
Expected Revenue
State Decision Revenue Expected Revenue

Xn−1 = i cn−1 = k $300min[(Xn−1+ cn−1), dn] 3


$300 ∑ min[( X n−1 + cn−1 ), d n ] p ( d n )
dn =0

0 0 $300min(0, dn) $300(0) = $0


1 $300min(1, dn) $300{0 p (0) +1[ p (1) + p (2) + p (3)]}
= $300[0(0.3) +1(0.4 + 0.1+ 0.2)]= $210

2 $300min(2, dn) $300{0 p (0) +1 p (1) + 2[ p (2) + p (3)]}


A Markov Decision Process (MDP)

= $300[0(0.3) +1(0.4) + 2(0.1+ 0.2)]= $300


3 $300min(3, dn) $300[0 p (0) +1 p (1) + 2 p (2) + 3 p (3)]


= $300[0(0.3) +1(0.4) + 2(0.1) + 3(0.2)]= $360

1 0 $300min(1, dn) $300{0 p (0) + 1[ p (1) + p (2) + p (3)]} = $210

1 $300min(2, dn) $300{0 p (0) + 1 p (1) + 2[ p (2) + p (3)]} = $300

2 $300min(3, dn) $300[0 p (0) + 1 p (1) + 2 p (2) + 3 p (3)] = $360

2 0 $300min(2, dn) $300{0 p (0) + 1 p (1) + 2[ p (2) + p (3)]} = $300

1 $300min(3, dn) $300[0 p (0) + 1 p (1) + 2 p (2) + 3 p (3)] = $360

3 0 $300min(3, dn) $300[0 p (0) + 1 p (1) + 2 p (2) + 3 p (3)] = $360


333

9/23/2010 5:46:03 PM
334 Markov Chains and Decision Processes for Engineers and Managers

TABLE 5.37
Ordering Costs
State Decision Ordering Cost

X n−1 c n−1 $20 + $120 cn −1 if c n −1 > 0


0 0 $0
1 $20 + $120(1) = $140
2 $20 + $120(2) = $260
3 $20 + $120(3) = $380
1 0 $0
1 $20 + $120(1) = $140
2 $20 + $120(2) = $260
2 0 $0
1 $20 + $120(1) = $140
3 0 $0

3
E{max[(dn − X n −1 − cn −1 ), 0]} = ∑ max[(d
dn = 0
n − X n −1 − cn −1 ), 0]p( dn )

3
=∑ max[(k − X n −1 − cn −1 ), 0]P(dn = k ).
k=0

The expected shortage cost is

3
$40∑ max[(k − X n −1 − cn −1 ), 0]P( dn = k ).
k=0

For example, in state Xn−1 = 0, when decision cn−1 = 1 is made, the expected
shortage cost is equal to

3 3
$40 ∑ max[(dn − X n −1 − cn −1 ), 0]p(dn ) = $40 ∑ max[(dn − 0 − 1), 0]p(dn )
dn = 0 dn = 0

= $40[max(0 − 1, 0)p(0) + max(1 − 1, 0) p(1) + max(2 − 1, 0) p(2) + max(3 − 1, 0) p(3)]


= $40[0 p(0) + (1 − 1)p(1) + (2 − 1)p(2) + (3 − 1)p(3)] = $40[0 p(0) + 0 p(1) + 1p(2) + 2 p(3)]
= $40[0(0.3) + 0(0.4) + 1(0.1) + 2(0.2)] = $20.

The expected shortage costs generated in every state by every decision are
calculated in Table 5.39.

51113_C005.indd 334 9/23/2010 5:46:08 PM


51113_C005.indd 335
TABLE 5.38
Expected Holding Costs
State Decision Holding Cost Expected Holding Cost

X n–1 c n–1 $50 max[( X n −1 + cn −1 − d n ), 0] $50 ∑ min[( X n−1 + cn−1 ), d n ] p ( d n )


dn =0

0 0 $50 max(0−dn, 0) $50(0) = $0

$50{1p (0) + 0[ p (1) + p (2) + p (3)]}


1 $50 max(1−dn, 0) = $50[1(0.3) + 0(0.4 + 0.1+ 0.2)]= $15

A Markov Decision Process (MDP)

$50{2 p (0) +1 p (1) + 0[ p (2) + p (3)]}


2 $50 max(2−dn, 0) = $50[2(0.3) +1(0.4) + 0(0.1+ 0.2)]= $50

$50[3p (0) + 2 p (1) +1 p (2) + 0 p (3)]


3 $50 max(3−dn, 0) = $50[3(0.3) + 2(0.4) +1(0.1) + 0(0.2)]= $90

1 0 $50 max(1 – dn, 0) $50{1p(0) + 0[p(1) + p(2) + p(3)]} = $15


1 $50 max(2 – dn, 0) $50{2p(0) + 1p(1) + 0[p(2) + p(3)]} = $50
2 $50 max(3 – dn, 0) $50[3p(0) + 2p(1) + 1p(2) + p(3)]} = $90
2 0 $50 max(2 – dn, 0) $50{2p(0) + 1p(1) + 0[p(2) + p(3)]} = $50
$50[3p(0) + 2p(1) + 1p(2) + 0p(3)]} =
1 $50 max(3 – dn, 0)
$90
$50[3p(0) + 2p(1) + 1p(2) + 0p(3)]} =
3 0 $50 max(3 – dn, 0)
$90
335

9/23/2010 5:46:10 PM
51113_C005.indd 336
336

TABLE 5.39
Expected Shortage Costs
State Decision Shortage Cost Expected Shortage Cost

X n–1 c n–1 3
$40 max[( d n − X n −1 − cn −1 ), 0]
$40 ∑ min[( X n−1 + cn−1 ), d n ] p ( d n )
dn =0

0 0 $40 max(dn – 0, 0) $40[0p (0) +1 p (1) + 2 p (2) + 3 p (3)]



= $40[0(0.3) +1(0.4) + 2(0.1) + 3(0.2)]= $48

1 $40 max(dn – 1, 0) $40{0[p (0) + p (1)]+1 p (2) + 2 p (3)}



= $40[0(0.3+ 0.4) +1(0.1) + 2(0.2)]= $20

2 $40 max(dn – 2, 0) $40{0[p (0) + p (1) + p (2)]+1 p (3)}



= $40[0(0.3+ 0.4 + 0.1) +1(0.2)]= $8

3 $40 max(dn – 3, 0) $40(0) = $0


1 0 $40 max(dn – 1, 0) $40{0[p(0) + p(1)] + 1p(2) + 2p(3)} = $20
1 $40 max(dn – 2, 0) $40{0[p(0) + p(1) + p(2)] + 1p(3)} = $8
2 $40 max(dn – 3, 0) $40(0) = $0
2 0 $40 max(dn – 2, 0) $40{0[p(0) + p(1) + p(2)] + 1p(3)} = $8
1 $40 max(dn – 3, 0) $40(0) = $0
3 0 $40 max(dn – 3, 0) $40(0) = $0
Markov Chains and Decision Processes for Engineers and Managers

9/23/2010 5:46:12 PM
A Markov Decision Process (MDP) 337

In Table 5.40 the expected rewards corresponding to every state and decision
are calculated.
The complete unichain MDP model of the inventory system is summa-
rized in Table 5.41.

5.1.3.3.1.1 Formulation of a Unichain MDP Model of an Inventory System


as an LP The unichain MDP model of an inventory system will be formu-
lated as an LP. In this example N = 4 states. The number of decisions in each
state is K0 = 4, K1 = 3, K 2 = 2, and K 3 = 1. The objective function for the LP is

Maximize g = ( −48 y 00 + 35 y 01 − 18 y 02 − 110 y 03 ) + (175 y10 + 102 y11 + 10 y12 )


+ (242 y 20 + 130 y 21 ) + 270 y 30 .

The constraints are given below:

(1) ( y 00 + y 01 + y 02 + y 03 ) − ( y 00 + 0.7 y 01 + 0.3 y 02 + 0.2 y 03 )


− (0.7 y10 + 0.3 y11 + 0.2 y12 ) − (0.3 y 20 + 0.2 y 21 ) − 0.2 y 30
(2) ( y10 + y11 + y12 ) − (0.3 y 01 + 0.4 y 02 + 0.1y 03 ) − (0.3 y10 + 0.4 y11 + 0.1y12 )
− (0.4 y 20 + 0.1y 21 ) − 0.1y 30 = 0
(3) ( y 20 + y 21 ) − (0.3 y 02 + 0.4 y 03 ) − (0.3 y11 + 0.4 y12 ) − (0.3 y 20 + 0.4 y 21 )
− 0.4 y 30 = 0
(4) y 30 − (0.3 y 03 + 0.3 y12 + 0.3 y 21 + 0.3 y 30 ) = 0.

TABLE 5.40
Expected Rewards
State Decision Expected Reward
X n–1 C n–1 E(Revenue) – Ordering Cost – E(Holding Cost) – E (Shortage Cost)
0 0 $0 – $0 – $0 – $48 = –$48
1 $210 – $140 – $15 – $20 = $35
2 $300 – $260 – $50 – $8 = $18
3 $360 – $380 – $90 – $0 = $110
1 0 $210 – $0 – $15 – $20 = $175
1 $300 – $140 – $50 – $8 = $102
2 $360 – $260 – $90 – $0 = $10
2 0 $300 – $0 – $50 – $8 = $242
1 $360 – $140 – $90 – $0 = $130
3 0 $360 – $0 – $90 – $0 = $270

51113_C005.indd 337 9/23/2010 5:46:15 PM


338 Markov Chains and Decision Processes for Engineers and Managers

TABLE 5.41
Unichain MDP Model of Inventory System
Transition Probability Expected
State Decision Reward LP Variable
X n–1 = i c n–1 = k j=0 j=1 j=2 j=3 q ik y ik

0 0 1 0 0 0 –$48 y00
1 0.7 0.3 0 0 $35 y01
2 0.3 0.4 0.3 0 –$18 y02
3 0.2 0.1 0.4 0.3 –$110 y03
1 0 0.7 0.3 0 0 $175 y10
1 0.3 0.4 0.3 0 $102 y11
2 0.2 0.1 0.4 0.3 $10 y12
2 0 0.3 0.4 0.3 0 $242 y20
1 0.2 0.1 0.4 0.3 $130 y12
3 0 0.2 0.1 0.4 0.3 $270 y30

(5) ( y 00 + y 01 + y 02 + y 03 ) + ( y10 + y11 + y12 ) + ( y 20 + y 21 ) + y 30 = 1

y 00 ≥ 0, y 01 ≥ 0, y 02 ≥ 0, y 03 ≥ 0, y10 ≥ 0, y11 ≥ 0, y12 ≥ 0, y 20 ≥ 0, y 21 ≥ 0, y 30 ≥ 0.

The complete LP appears below:

Maximize g = ( −48 y 0 + 35 y 0 − 18 y 0 − 110 y 0 ) + (175 y1 + 102 y1 + 10 y1 )


0 1 2 3 0 1 2

+ (242 y 20 + 130 y 21 ) + 270 y 30

subject to

(1) (0.3 y 01 + 0.7 y 02 + 0.8 y 03 ) − (0.7 y10 + 0.3 y11 + 0.2 y12 ) − (0.3 y 20 + 0.2 y 21 ) − 0.2 y 30 = 0

(2) − (0.3 y 01 + 0.4 y 02 + 0.1y 03 ) + (0.7 y10 + 0.6 y11 + 0.9 y12 ) − (0.4 y 20 + 0.1y 21 ) − 0.1y 30 = 0

(3) − (0.3 y 02 + 0.4 y 03 ) − (0.3 y11 + 0.4 y12 ) + (0.7 y 20 + 0.6 y 21 ) − 0.4 y 30 = 0

51113_C005.indd 338 9/23/2010 5:46:17 PM


A Markov Decision Process (MDP) 339

(4) − (0.3 y 03 + 0.3 y12 + 0.3 y 21 ) + 0.7 y 30 = 0

(5) ( y 00 + y 01 + y 02 + y 03 ) + ( y10 + y11 + y12 ) + ( y 20 + y 21 ) + y 30 = 1

y 00 ≥ 0, y 01 ≥ 0, y 02 ≥ 0, y 03 ≥ 0, y10 ≥ 0, y11 ≥ 0, y12 ≥ 0, y 20 ≥ 0, y 21 ≥ 0, y 30 ≥ 0.

5.1.3.3.1.2 Solution by LP of Unichain MDP Model of an Inventory System After


omitting one constraint which is redundant, the LP formulation of the uni-
chain MDP model of an inventory system is solved to find an optimal policy
by using LP software on a personal computer. The output of the LP software
is summarized in Table 5.42.
Table 5.42 shows that the value of the objective function is 115.6364,
which differs slightly from the gain calculated in Equation (4.98) because
of roundoff error. The output also shows that all the yik equal zero, except
for y03 = 0.2364, y12 = 0.2091, y20 = 0.3636, andy30 = 0.1909. These nonzero values
are the steady-state probabilities, calculated in Equation (2.27), for the policy
given by d = [3 2 0 0]T, which is therefore an optimal policy. The conditional
probabilities,

y ik
P(order = k|state = i) = Ki

∑y
k =1
k
i

are calculated below:

P(order = 3|state = 0) = P(order = 2|state = 1) = P(order = 0|state = 2)


= P(order = 0|state = 3) = 1.

TABLE 5.42
LP Solution of Unichain MDP Model of
an Inventory System
g = 115.6364
Objective i Function Value k y ik
0 0 0
1 0
2 0
3 0.2364
1 0 0
1 0
2 0.2091
2 0 0.3636
1 0
3 0 0.1909

51113_C005.indd 339 9/23/2010 5:46:24 PM


340 Markov Chains and Decision Processes for Engineers and Managers

All the remaining conditional probabilities are zero. Observe that the
retailer’s expected average profit per period is maximized by the same (2, 3)
policy under which she operated the recurrent MCR inventory model in
Section 4.2.3.4.1.

5.1.3.3.2 Unichain MDP Model of Component Replacement


The MCR model of component replacement constructed in Equation (4.101)
of Section 4.1.3.4.2 is associated with the policy of replacing a surviving
component every 4 weeks, its maximum service life. This recurrent MCR
model can be transformed into a unichain MDP model by adding two deci-
sion alternatives per state in states 0, 1, and 2. The decision alternatives are
whether to keep (decision K) or replace (decision R) a surviving component.
More precisely, the decision for a component, which has survived i weeks, is
whether to keep it for another week (unless it breaks down during the cur-
rent week), or replace the surviving component with a new component at the
end of the current week. If a decision is made to keep a surviving component
of age i for another week, then either it will fail during the current week with
probability pi0 and be replaced, or it will survive to age i+1 with probability
pi,i+1 = 1 − pi0. A component of age i, which is replaced, becomes a component
of age 0 at the end of the current week. A component of age 3 weeks must be
replaced at the end of its fourth week of life [4].
As Section 4.2.3.4.2 indicates, the cost incurred by the decision to keep a
component of age i is

qiK = $2 + $10 pi 0 .

A surviving component that is replaced at age 0, 1, or 2 weeks is assumed


to have a trade-in or salvage value of $4. The cost incurred by the decision
to replace a surviving component is $2 for inspecting the component, minus
$4 for trading it in, plus $10 for replacing it with a new component of age 0.
Letting qiR denote the cost incurred by the decision to replace a surviving
component of age i gives the result that

qiR = (Cost of inspection) + (Cost of replacement) – (Trade-in value)

q0R = q1R = q2R = $2 + $10 − $4 = $8.

Since a component has no salvage value at the end of its fourth week of life,

q3R = $2 + $10 = $12.

51113_C005.indd 340 9/23/2010 5:46:27 PM


A Markov Decision Process (MDP) 341

Table 5.43 contains the transition probabilities and cost data for the four-state
unichain MDP model of component replacement.

5.1.3.3.2.1 Solution by Enumeration of MDP Model of Component Replacement A


component that fails will be replaced at the end of the week in which it has
failed. A component that has not failed can be replaced after 1 week of ser-
vice, or after 2 weeks, 3 weeks, or 4 weeks. Hence there are four replacement
policies, which correspond to the four alternative replacement intervals.
Enumeration of the MCRs associated with each of these four alternative
replacement intervals can be used to determine a least cost replacement
interval by finding which MCR has the smallest expected average cost per
week, or negative gain. Equation (4.47) is used to calculate the negative gain,
which will simply be called the gain, g, associated with each replacement
policy. A decision to replace a component of age i is effective 1 year later
when the component reaches age i + 1. For example, a policy of replacing a
component every 2 weeks means that you decide at age 0 to keep it until it
reaches age 1, and decide at age 1 to replace it when it reaches age 2. The
four replacement policies, the corresponding decision vectors, and the cor-
responding sequence of actions produced by each decision vector are enu-
merated in Table 5.44.

TABLE 5.43
Data for Unichain MDP Model of Component Replacement
Transition Probability
LP Variable
State i Decision k p k
i0 p ik1 p ik2 p ik3 Cost q ik y ik

0 1=K 0.2 0.8 0 0 4 = 2 + 0.2(10 y10

2=R 1 0 0 0 8 = 2 + 10 – 4 y02

1 1=K 0.375 0 0.625 0 5.75 = 2 + 0.375(10) y11

2=R 1 0 0 0 8 = 2 + 10 – 4 y12

2 1=K 0.8 0 0 0.2 10 = 2 + 0.8(10) y12

2=R 1 0 0 0 8 = 2 + 10 – 4 y22

3 2=R 1 0 0 0 12 = 2 + 10 y32

51113_C005.indd 341 9/23/2010 5:46:28 PM


342 Markov Chains and Decision Processes for Engineers and Managers

TABLE 5.44
Four Alternative Component Replacement Policies
Decision Vector at Age ⇒ Sequence of Actions at Age
Replacement Policy i = {0, 1, 2, 3} ⇒ i + 1 = {1, 2, 3, 4}
Replace every week 1 d = [R R R R]T ⇒ [K R R R]T
Replace every 2 weeks 2 d = [K R R R]T ⇒ [K K R R]T
Replace every 3 weeks 3 d = [K K R R]T ⇒ [K K K R]T
Replace every 4 weeks 4 d = [K K K R]T ⇒ [K K K K]T

The MCR associated with a replacement interval less than the 4-week service
life is a unichain with transient states. If a replacement interval of t < 4 weeks
is chosen, then states 0, 1, … , t form a recurrent closed class, and the remain-
ing states are transient. The MCRs associated with each of the four replace-
ment intervals are shown below, accompanied by their respective gains.
Policy 1, Replace a component every week:

1d = [R R R R]T ⇒ action = [K R R R]T

State Decision 0 1 2 3 Cost


0 2=R 1 0 0 0 0 $8
P= 1 2=R 1 0 0 0 , q = 1 8 , π = [1 0 0 0], g = π q = $8.
2 2=R 1 0 0 0 2 8
3 2=R 1 0 0 0 3 8

Under this policy, state 0 is absorbing, and states 1, 2, and 3 are transient.
Policy 2, Replace a component every 2 weeks:

2 d = [K R R R]T ⇒ action = [K K R R]T

State Decision 0 1 2 3 Cost


0 1= K 0.2 0.8 0 0 0 $4
P= 1 2 = R 1 0 0 0 , q=1 8
2 2=R 1 0 0 0 2 8
3 2=R 1 0 0 0 3 8

π = [0.5556 0.4444 0 0], g = π q = $5.78.

Under this policy, states {0, 1} form a recurrent closed class, while states 2 and
3 are transient.

51113_C005.indd 342 9/23/2010 5:46:34 PM


A Markov Decision Process (MDP) 343

Policy 3, Replace a component every 3 weeks:


3 d = [K K R R]T ⇒ action = [K K K R]T

State Decision 0 1 2 3 Cost


0 1= K 0.2 0.8 0 0 0 $4
P= 1 1 = K 0.375 0 0.625 0 , q = 1 5.75
2 2=R 1 0 0 0 2 8
3 2=R 1 0 0 0 3 8

π = [0.4348 0.3478 0.2174 0], g = π q = $5.48.

Under this policy, states {0, 1, 2} form a recurrent closed class, while state 3
is transient.
Policy 4, Replace a component every 4 weeks:
4 d = [K K K R]T ⇒ action = [K K K K]T

State Decision 0 1 2 3 Cost


0 1= K 0.2 0.8 0 0 0 $4
P= 1 1= K 0.375 0 0.625 0 , q = 1 5.75
2 1= K 0.8 0 0 0.2 2 10
3 2=R 1 0 0 0 3 12

π = [0.4167 0.3333 0.2083 0.0417], g = π q = $6.17.

Under this policy, all states are recurrent. The minimum cost policy,
3 d = [K K R R]T ⇒ action = [K K K R]T, (5.47)
is to replace a component every 3 weeks at an expected average cost per week
of $5.48.

5.1.3.3.2.2 Solution by LP of MDP Model of Component Replacement The


unichain MDP model of component replacement specified in Table 5.43 is
formulated as the following LP:

Minimize g = (4 y 01 + 8 y 02 ) + (5.75 y11 + 8 y12 ) + (10 y 21 + 8 y 22 ) + 12 y 32

subject to
(1) (y01 + y02) − (0.2y01 + y02) − (0.375y11 + y12) − (0.8y21 + y22) − y32 = 0

51113_C005.indd 343 9/23/2010 5:46:37 PM


344 Markov Chains and Decision Processes for Engineers and Managers

(2) (y11 + y12) − 0.8y01 = 0


(3) (y21 + y22) − 0.625y11 = 0
(4) y32 − 0.2y21 = 0
(5) (y01 + y02) + (y11 + y12) + (y21 + y22) + y32 = 1
y 01 ≥ 0, y 02 ≥ 0, y11 ≥ 0, y12 ≥ 0, y 21 ≥ 0, y 22 ≥ 0, y 32 ≥ 0.
After omitting one constraint which is redundant, the LP is solved on a
personal computer using LP software to find an optimal policy. The output
of the LP software is summarized in Table 5.45.
The output of the LP software shows that all the yik equal zero, except for
y0 = 0.4348, y11 = 0.3478, and y22 = 0.2174. These values are the steady-state
1

probabilities for the optimal policy 3d = [1 1 2 2]T = [K K R R]T, which


was previously found by enumeration in Equation (5.47) of Section 5.1.3.3.2.2.
The LP solution has confirmed that the minimum cost policy is to replace a
component every 3 weeks at an expected average cost per week of $5.48.

5.1.3.3.3 Optimal Stopping Over a Finite Planning Horizon


An optimal stopping problem is concerned with determining when to stop
a sequence of trials with random outcomes. The objective is to maximize
an expected reward. This book treats the simplest kind of optimal stopping
problem, which is to decide when to stop a sequence of independent trials.
The planning horizon can be finite or infinite. In Section 5.1.3.3.3.1, the sec-
retary problem, an example of an optimal stopping problem, is formulated
as a unichain MDP over a finite planning horizon, and solved by executing
value iteration. In Section 5.2.2.4.2, LP will be used to find an optimal policy
for a discounted MDP model of a modified secretary problem over an infi-
nite horizon [1, 3, 4].
In the simplest kind of optimal stopping problem, a system evolves as
an independent trials process. As Section 1.10.1.1.6 indicates, a sequence

TABLE 5.45
LP Solution of Unichain MDP Model of
Component Replacement
Objective Function = 5.4783
i k y ik
0 1 0.4348
2
1 1 0.3478
2 0
2 1 0
2 0.2174
3 2 0

51113_C005.indd 344 9/23/2010 5:46:40 PM


A Markov Decision Process (MDP) 345

of independent trials can be modeled as a Markov chain. At each epoch,


there are two decisions in every state: to stop or to continue. If the decision
maker stops in state i at epoch n, a reward qi(n) is received. If the decision
maker continues in state i at epoch n, no reward is received. If the decision maker
decides to continue in state i at epoch n, then with transition probability pij
the system reaches state j at epoch n + 1. If the planning horizon is finite, of
length T periods, and the decision maker reaches a state h at epoch T, then a
reward qh(T) is received, and the process stops. The objective is to choose a
policy to maximize the expected reward.

5.1.3.3.3.1 Formulation of the Secretary Problem as a Unichain MDP In this


section, a colorful example of an optimal stopping problem, called the sec-
retary problem, will be modeled as a unichain MDP. The secretary problem
is a continuation of the example of an independent trials process involving
the arrival of candidates for a secretarial position, which was described in
Section 1.10.1.1.6. To transform that example into the secretary problem, sup-
pose that the executive who is interviewing candidates must hire a secretary
within the next 6 days. The executive will interview up to six consecutive
candidates at a rate of one applicant per day. After interviewing each candi-
date or applicant, the executive will assign the current candidate one of the
four numerical scores listed in Table 1.8. At the end of an interview, the exec-
utive must decide immediately whether to hire or reject the current appli-
cant. A rejected candidate is excluded from further consideration. When a
candidate is hired as a secretary, the interviews stop. When an applicant is
rejected, the interviews continue. However, if the first five applicants have
been rejected, the sixth applicant must be hired. The objective is to maximize
the expected score of the secretary.
As Equation (1.51) in Section 1.10.1.1.6 indicates, an independent trials
process can be modeled as a Markov chain. The secretary problem will be
formulated as a unichain MDP and solved by value iteration over a 6-day
planning horizon. To formulate the problem as an MDP, note that there are
two decisions: to hire (decision H) or reject (decision R) the nth candidate.
There are six alternative policies that correspond to hiring the first, second,
third, fourth, fifth, or sixth candidate, and ending the interviews. The state,
denoted by Xn, is the numerical score assigned to the nth candidate, for n = 1,
2, 3, 4, 5, 6. The sequence {X1, X2, X3, X4, X5, X6} is a collection of six indepen-
dent, identically distributed random variables. The probability distribution
of Xn is shown in Table 1.8.
When the nth applicant is rejected, the state Xn+1 of the next applicant
is independent of the state Xn of the current applicant. Hence, when the
nth candidate is rejected, the transition probability is pij = P(Xn+1 = j|Xn = i) =
P(Xn+1 = j). The state space is augmented by an absorbing state ∆ that is
reached with probability 1 when an applicant is hired, and with probability 0

51113_C005.indd 345 9/23/2010 5:46:43 PM


346 Markov Chains and Decision Processes for Engineers and Managers

TABLE 5.46
Data for Unichain MDP Model of a Secretary Problem Over a Finite Planning
Horizon
Transition Probability
State Decision Reward
i k p k
i ,15 p ik,20 p ik,25 p ik,30 p ik,∆ q ik
15 1=H 0 0 0 0 1 15
2=R 0.3 0.4 0.2 0.1 0 0
20 1=H 0 0 0 0 1 20
2=R 0.3 0.4 0.2 0.1 0 0
25 1=H 0 0 0 0 1 25
2=R 0.3 0.4 0.2 0.1 0 0
30 1=H 0 0 0 0 1 30
2=R 0.3 0.4 0.2 0.1 0 0
Δ 1=H 0 0 0 0 1 0
2=R 0 0 0 0 1 0

when an applicant is rejected. The augmented state space is E = {15, 20, 25,
30, ∆}. If the nth applicant is hired, the daily reward is Xn, equal to the score
assigned to the nth applicant. When an applicant is hired, the process goes
to the absorbing state ∆, where it remains because no more candidates will
be interviewed. If the nth applicant is rejected, the daily reward is zero.
Since all the transition probability matrices corresponding to every possi-
ble decision contain the single absorbing state ∆, the MDP is unichain. The
unichain MDP model of the secretary problem over a fi nite planning hori-
zon is shown in Table 5.46.

5.1.3.3.3.2 Solution of the Secretary Problem by Value Iteration Value iteration


will be executed to determine whether to hire or reject the nth applicant on
the basis of the applicant’s score, Xn. The length of the planning horizon is
6 days, numbered 1 through 6, corresponding to six candidates who may be
interviewed. The calculations will produce a decision rule for each day of
the 6-day horizon. To conduct value iteration, let vi(n) denote the maximum
expected score if the nth candidate interviewed is assigned a score of Xn = i,
and an optimal policy is followed from day n until the end of the planning
horizon.
To solve the secretary problem, the backward recursive equations of value
iteration (5.6) are modified to have the following form.

51113_C005.indd 346 9/23/2010 5:46:43 PM


A Markov Decision Process (MDP) 347

30

vi (n) = max{[xn if hire], [qi2 + ∑ pij2 v j (n + 1) if reject]}, 
j = 15

for n = 1,2,..., 5, and i = 15, 20, 25, 30 

Substituting qi = 0,2



vi (n) = max{[xn if hire], [ pi2,15 v15 (n + 1) + pi2,20 v20 (n + 1) 

+ pi2,25 v25 (n + 1) + pi2,30 v30 (n + 1) if reject]}, (5.48)

for n = 1, 2,..., 5, and i = 15, 20, 25, 30 



vi (n) = max{[xn if hire], [0.3v15 (n + 1) + 0.4v20 (n + 1) 
+ 0.2v25 (n + 1) + 0.1v30 ( n + 1) if reject]}, 

for n = 1, 2,..., 5. 


To begin the backward recursion, the terminal values on day 6 are set
equal to the possible scores of the applicant on day 6 because a sixth or final
applicant must be hired. Therefore, for n = 6,

vi (6) = X6 = i , for i = 15, 20, 25, 30

v15 (6) = 15, v20 (6) = 20, v25 (6) = 25, v30 (6) = 30.

The calculations for day 5, denoted by n = 5, are indicated below:

vi (5) = max{[x5 if hire], [0.3v15 (6) + 0.4v20 (6) + 0.2v25 (6) + 0.1v30 (6) if reject]},
for i = 15, 20,25,30
vi (5) = max{[x5 if hire], [0.3(15) + 0.4(20) + 0.2(25) + 0.1(30) if reject]},
for i = 15, 20,25,30
vi (5) = max{[x5 if hire], [20.5 if reject]}, for i = 15, 20,25,30

If X 5 = i = 15, then v15 (5) = max{[15 if hire], [20.5 if reject]} = 20.5, reject
If X 5 = i = 20, then v20 (5) = max{[20 if hire], [20.5 if reject]} = 20.5, reject
If X 5 = i = 25, then v25 (5) = max{[25 if hire], [20.5 if reject]} = 25, hire
If X 5 = i = 30, then v30 (5) = max{[30 if hire], [20.5 if reject]} = 30, hire

51113_C005.indd 347 9/23/2010 5:46:46 PM


348 Markov Chains and Decision Processes for Engineers and Managers

The calculations for day 4, denoted by n = 4, are indicated below:

vi (4) = max{[x 4 if hire], [0.3v15 (5) + 0.4v20 (5) + 0.2v25 (5) + 0.1v30 (5) if reject]},
for i = 15, 20,25,30
vi (4) = max{[x 4 if hire], [0.3(20.5) + 0.4(20.5) + 0.2(25) + 0.1(30) if reject]},
for i = 15, 20,25,30
vi (4) = max{[x 4 if hire], [22.35 if reject]}, for i = 15, 20,25,30

If X 4 = i = 15, then v15 (4) = max{[15 if hire], [22.35 if reject]} = 22.35, reject
If X 4 = i = 20, then v20 (4) = max{[20 if hire], [22.35 if reject]} = 22.35, reject
If X 4 = i = 25, then v25 (4) = max{[25 if hire], [22.35 if reject]} = 25, hire
If X 4 = i = 30, then v30 (4) = max{[30 if hire], [22.35 if reject]} = 30, hire

The calculations for day 3, denoted by n = 3, are indicated below:

vi (3) = max{[x3 if hire], [0.3v15 (4) + 0.4v20 (4) + 0.2v25 (4) + 0.1v30 (4) if reject]},
for i = 15, 20,25,30
vi (3) = max{[x3 if hire], [0.3(22.35) + 0.4(22.35) + 0.2(25) + 0.1(30) if reject]},
for i = 15, 20,25,30
vi (3) = max{[x3 if hire], [23.645 if reject]}, for i = 15, 20,25,30

If X 3 = i = 15, then v15 (3) = max{[15 if hire], [23.645 if reject]} = 23.645, reject
If X 3 = i = 20, then v20 (3) = max{[20 if hire], [23.645 if reject]} = 23.645, reject
If X 3 = i = 25, then v25 (3) = max{[25 if hire], [23.645 if reject]} = 25, hire
If X 3 = i = 30, then v30 (3) = max{[30 if hire], [23.645 if reject]} = 30, hire

The calculations for day 2, denoted by n = 2, are indicated below.

vi (2) = max{[x2 if hire], [0.3v15 (3) + 0.4v20 (3) + 0.2v25 (3) + 0.1v30 (3) if reject]},
for i = 15, 20,25,30
vi (2) = max{[x2 if hire], [0.3(23.645) + 0.4(23.645) + 0.2(25) + 0.1(30) if reject]},
for i = 15, 20,25,30
vi (2) = max{[x2 if hire], [24.5515 if reject]}, for i = 15, 20,25,30

51113_C005.indd 348 9/23/2010 5:46:49 PM


A Markov Decision Process (MDP) 349

If X 2 = i = 15, then v15 (2) = max{[15 if hire], [24.5515 if reject]} = 24.5515, reject
If X 2 = i = 20, then v20 (2) = max{[20 if hire], [24.5515 if reject]} = 24.5515, reject
If X 2 = i = 25, then v25 (2) = max{[25 if hire], [24.5515 if reject]} = 25, hire
If X 2 = i = 30, then v30 (2) = max{[30 if hire], [24.5515 if reject]} = 30, hire.

Finally, the calculations for day 1, denoted by n = 1, are indicated below:

vi (1) = max{[x1 if hire], [0.3v15 (2) + 0.4v20 (2) + 0.2v25 (2) + 0.1v30 (2) if reject]},
for i = 15, 20,25,30
vi (1) = max{[x1 if hire], [0.3(24.5515) + 0.4(24.5515) + 0.2(25) + 0.1(30) if reject]},
for i = 15, 20,25,30
vi (1) = max{[x1 if hire], [25.18605 if reject]}, for i = 15, 20,25,30

If X1 = i = 15, then v15 (1) = max{[15 if hire], [25.18605 if reject]} = 25.18605, reject
If X1 = i = 20, then v20 (1) = max{[20 if hire], [25.18605 if reject]} = 25.18605, reject
If X1 = i = 25, then v25 (1) = max{[25 if hire], [25.18605 if reject]} = 25.18605, reject
If X1 = i = 30, then v30 (1) = max{[30 if hire], [25.18605 if reject]} = 30, hire.

The results of these calculations are summarized in the form of decision


rules given in Table 5.47. By following these decision rules, the executive will
maximize the expected score of the secretary.

TABLE 5.47
Decision Rules for Maximizing the Expected Score of the Secretary
Minimum
Rating of
Candidate to
Day, n Candidate Score, Xn Decision Be Hired Interviews Will
1 If X1 = 15, 20, or 25, then Reject Continue
If X1 = 30, then Hire X1 ≥ 25.18605 Stop
2 If X2 = 15 or 20, then Reject Continue
If X2 = 25 or 30, then Hire X2 ≥ 24.5515 Stop
3 If X3 = 15 or 20, then Reject Continue
If X3 = 25 or 30, then Hire X3 ≥ 23.645 Stop
4 If X4 = 15 or 20, then Reject Continue
If X4 = 25 or 30, then Hire X4 ≥ 22.35 Stop
5 If X5 = 15 or 20, then Reject Continue
If X5 = 25 or 30, then Hire X5 ≥ 20.5 Stop
6 If X6 = 15, 20, 25, or 30, then Reject Continue
Hire X6 ≥ 15 Stop

51113_C005.indd 349 9/23/2010 5:46:50 PM


350 Markov Chains and Decision Processes for Engineers and Managers

Suppose that the optimal policy specified by the decision rules in Table 5.47
is followed. Observe that if the score of the first candidate is greater than or
equal to 25.18605, that is, 30, the first candidate should be hired, and the inter-
views should be stopped. Otherwise, the interviews should be continued. If
X1 equals 30, the maximum expected score is 30 because the first applicant is
hired. If X1 is less than 30, the maximum expected score is 25.18605 because
the first applicant is rejected. Prior to knowing X1, the maximum expected
score is (0.1)(30) + (0.9)(25.18605) = 25.667445 because there is a probability of
0.1 that the first candidate will be excellent and receive a score of 30, and a
probability of 0.9 that the first candidate will not be excellent and the inter-
views will have to be continued. Note that if the fifth candidate receives a
score greater than or equal to 20.5, that is, 25 or 30, the fifth candidate should
be hired, and the interviews should be stopped. Finally, if the fifth candidate
is rejected, the sixth candidate must be hired.

5.1.4 A Multichain MDP


Recall that an MDP is said to be multichain if the transition matrix associ-
ated with at least one stationary policy contains two or more closed commu-
nicating sets of recurrent states plus a possibly empty set of transient states.
In Section 5.1.4.1, a multichain MDP model of a flexible production process
will be constructed. In Section 5.1.4.2, the PI algorithm of Section 5.1.2.3.3.2
will be extended so that it can be used to find an optimal policy for a mul-
tichain MDP. An LP formulation for a multichain MDP will be given in
Section 5.1.4.3.

5.1.4.1 Multichain Model of a Flexible Production System


To see how PI can be applied to determine an optimal policy for a multichain
MDP model, consider the following simplified multichain MDP model of a
flexible production system. (A smaller unichain MDP model of an experi-
mental production process was constructed in Section 5.1.3.1.1.) Suppose that
a flexible production system consists of three manufacturing stages in series.
The three-stage line can manufacture two types of products: industrial (I) or
consumer (C). Each stage can be programmed to make either type of prod-
uct independently of the other stages. The output of each manufacturing
stage is inspected. Output with a minor defect is reworked at the current
stage. Output from stage 1 or stage 2 that is not defective is passed on to
the next stage. Output from any stage with a major defect is discarded as
scrap. Nondefective output from stage three is sent to a training center. At
the training center, technicians are trained to maintain either the product
software (S) or the product hardware (H). The probability of defective out-
put from a stage is different for industrial and commercial output. Output
sent from stage 3 to the training center to provide training on software will
remain in the training center. Output sent from stage 3 to the training center

51113_C005.indd 350 9/23/2010 5:46:52 PM


A Markov Decision Process (MDP) 351

to provide training on hardware will be disassembled, and returned as input


to stage 1 in the following period.
By assigning a state to represent each operation, and adding a reward vec-
tor, the flexible production system is modeled as a five-state multichain MDP.
The states are indexed in Table 5.48.
Observe that production stage i is represented by transient state (6 − i). In
states 2, 3, 4, and 5, two decisions are possible. The decision at a manufactur-
ing stage is the type of customer, either industrial (I) or consumer (C), for
whom the product is intended. The decision at the training center is whether
to focus on hardware (H) or software (S). Since both industrial and consumer
output with a major defect at any stage is always discarded as scrap, state 1 is
always absorbing. States 3, 4, and 5 are always transient. In contrast, state 2 is
absorbing when the training center is dedicated to software, and is transient
when the training center is dedicated to hardware. With one decision in state
1 and two decisions in each of the other four states, the number of possible
policies is equal to (1)24 = 16. Table 5.49 contains the data for the five-state
multichain MDP model of the flexible production system.
The passage of an item through the flexible production system is shown in
Figure 5.4.

TABLE 5.48
States for Flexible Production Process
State Operation
1 Scrap
2 Training center
3 Stage 3
4 Stage 2
5 Stage 1

TABLE 5.49
Data for Multichain Model of a Flexible Production System
Transition Probability
State Decision k
Reward
i Operation k p i1 p ik2 p ik3 p ik4 p ik5 q ik

1 Scrap 1 1 0 0 0 0 62
2 Training 1=S 0 1 0 0 0 100
2=H 0 0 0 0 1 22.58
3 Stage 3 1=I 0.25 0.20 0.55 0 0 423.54
2=C 0.20 0.35 0.45 0 0 468.62
4 Stage 2 1= I 0.15 0 0.20 0.65 0 847.09
2=C 0.20 0 0.30 0.50 0 796.21
5 Stage 1 1=I 0.10 0 0 0.15 0.75 1935.36
2=C 0.08 0 0 0.12 0.80 2042.15

51113_C005.indd 351 9/23/2010 5:46:52 PM


352 Markov Chains and Decision Processes for Engineers and Managers

The number on an arc indicates a transition probability 22.58


under decision 1 (decision 2) (423.54)
Reward
The number next to a node indicates a reward under
decision 1 (decision 2)
Training
Center
1
0.55(0.45)

0.75(0.80) 0.65(0.50) (1)


0.20(0.35)

Stage 1 0.15(0.12) Stage 2 0.20(0.30) Stage 3


847.09
(796.21)
1935.36 Reward 423.54
(2042.15) (468.62)
Reward Reward

0.2
5(
0.2
0.10(0.08)
0.15(0.20)

0)
1

Scrap

100
Reward

FIGURE 5.4
Passage of an item through a three-stage flexible production system.

5.1.4.2 PI for a Multichain MDP


The PI algorithm [2] for finding an optimal stationary policy for a multi-
chain MDP is an extension of the one (described in Section 5.1.2.3.3.2) cre-
ated for a unichain MDP. The PI algorithm for a multichain MDP has two
main steps: the policy evaluation (PE) operation and the IM routine. The
algorithm begins by arbitrarily choosing an initial policy. The PE operation
for a multichain MDP is analogous to the reward evaluation operation for a
multichain MCR described in Section 4.2.5.3. During the PE operation, two
sets of N simultaneous linear equations are solved. The first system of equa-
tions is called the gain state equations (GSEs). The second system is called
the set of VDEs. The two systems of equations, the GSEs plus the VDEs, for a
multichain MDP are together called the Policy Evaluation Equations (PEEs).
(In Section 4.2.5.3 the GSEs plus the VDEs for a multichain MCR are together
called the Reward Evaluation Equations, or REEs.) The PEEs associated
with the current policy for a multichain MDP can be solved by following
the four-step procedure given in Section 4.2.5.3.3 for solving the REEs for a

51113_C005.indd 352 9/23/2010 5:46:54 PM


A Markov Decision Process (MDP) 353

multichain MCR. Following the completion of the PE operation to obtain the


gains and the relative values associated with the current policy, the IM rou-
tine attempts to find a better policy. If a better policy is found, the PE oper-
ation is repeated using the new policy to identify the appropriate transition
probabilities, rewards, and VDEs. The algorithm stops when two successive
iterations lead to identical policies.

5.1.4.2.1 Policy Improvement


The IM routine is motivated by the value iteration equation (5.6) of
Section 5.1.2.2.1 for a recurrent MDP over a finite planning horizon,

 N 
vi (n) = max  qik + ∑ pijk v j (n + 1) , for n = 0, 1,..., T − 1, and i = 1, 2,..., N , (5.6)
k
 j =1 

where the salvage values vi(T) are specified for all states i = 1, 2, ... , N. The
value iteration equation indicates that if an optimal policy is known over a
planning horizon starting at epoch n + 1 and ending at epoch T, then the best
decision in state i at epoch n can be found by maximizing a test quantity,

N
qik + ∑ pijk v j (n + 1), (5.49)
j =1

over all decisions in state i. Therefore, if an optimal policy is known over a


planning horizon starting at epoch 1 and ending at epoch T, then at epoch
0, the beginning of the planning horizon, the best decision in state i can be
found by maximizing a test quantity,

N
qik + ∑ pijk v j (1), (5.50)
j =1

over all decisions in state i. Let gj denote the gain of state j. Recall from
Section 4.2.3.2.2 that when T, the length of the horizon, is very large, (T − 1) is
also very large, so that

v j (1) ≈ (T − 1) g j + v j . (5.51)

Substituting this expression for vj(1) in the test quantity produces the result

N N N N
qik + ∑
j= 1
pijk v j (1) = qik + ∑
j= 1
pijk [(T − 1) g j + v j ] = qik + (T − 1)∑ pijk g j +
j= 1
∑pv
j= 1
k
ij j (5.52)

51113_C005.indd 353 9/23/2010 5:46:55 PM


354 Markov Chains and Decision Processes for Engineers and Managers

as the test quantity to be maximized with respect to all alternatives in


N
every state. When (T − 1) is very large, the term (T − 1) ∑ pijk gj dominates the
j=1
test quantity, so that the test quantity is maximized by the alternative that
maximizes

∑p
j =1
k
ij gj , (5.53)

called the gain test quantity, using the gains of the previous policy. However,
when two or more alternatives have the same maximum value of the gain
test quantity, there is a tie, and the gain test fails. In that case the decision
must be made on the basis of relative values rather than gains. The tie is bro-
ken by choosing the alternative that maximizes

N
qik + ∑ pijk v j , (5.54)
j =1

called the value test quantity, by using the relative values of the previous
policy.

5.1.4.2.2 PI Algorithm
The detailed steps of the PI algorithm [2] for a multichain MDP are given
below:
PI Algorithm
Step 1. Initial policy
Arbitrarily choose an initial policy by selecting for each state i a decision
di = k.
Step 2. PE operation
Use pij and qi for a given policy to solve the set of GSEs

N
g i = ∑ pij g j , for i = 1, 2,..., N . (4.176)
j =1

and the set of VDEs


N
g i + vi = qi + ∑ pij v j , i = 1, 2, ..., N , (4.177)
j =1

for all the relative values vi and gains gi by executing the four-step procedure
given in Section 4.2.5.3.3.

51113_C005.indd 354 9/23/2010 5:46:56 PM


A Markov Decision Process (MDP) 355

Step 3. IM routine
For each state i, find the decision k∗ that maximizes the gain test quantity

∑p
j= 1
k
ij gj (5.53)

by using the gains of the previous policy. Then k∗ becomes the new decision
in state i, so that di = k∗, qik∗ becomes qi, and pijk∗ becomes pij.
If two or more alternatives have the same maximum value of the gain test
quantity, the tie is broken by choosing the decision k∗ that maximizes the
value test quantity

N
qik + ∑ pijk v j , (5.54)
j =1

by using the relative values of the previous policy. Then k∗ becomes the new
decision in state i, so that di = k ∗, qik∗ becomes qi, and pijk∗ becomes pij.
Regardless of whether the IM test is based on gains or values, if the previ-
ous decision in state i yields as high a value of the test quantity as any other
alternative, leave the previous decision unchanged to assure convergence in
the case of equivalent policies.
Step 4. Stopping rule
When the IM test has been completed for all states, a new policy has been
determined. A new P matrix and q vector have been obtained. If the new
policy is the same as the previous one, the algorithm has converged, and an
optimal policy has been found. If the new policy is different from the previ-
ous policy in at least one state, go to step 2.

5.1.4.2.3 Solution by PI of Multichain Model of


a Flexible Production System
To see how PI can be applied to find an optimal policy for a multichain MDP,
PI will be executed to find an optimal policy for the multichain MDP model
of a flexible production system specified in Table 5.49 of Section 5.1.4.1.
First iteration
Step 1. Initial policy
Arbitrarily choose an initial policy which consists of making decision 1 in
every state. Thus, in state 2 the training center focuses on software. In states
3, 4, and 5, the product is designed for an industrial customer. The initial
decision vector 1d, along with the associated transition probability matrix 1P

51113_C005.indd 355 9/23/2010 5:46:58 PM


356 Markov Chains and Decision Processes for Engineers and Managers

and the reward vector 1q, are shown below:

1 1 1 0 0 0 0   62 
1 2 0 1 0 0 0   100 
     
1
d = 1 , 1
P = 3  0.25 0.20 0.55 0 0 , 1
q =  423.54  .
     
1 4  0.15 0 0.20 0.65 0   847.09 
1 5 0.10 0 0 0.15 0.75 1935.36 

Observe that the initial policy generates an absorbing multichain in which


states 1 and 2 are absorbing states and the other three states are transient.
Step 2. PE operation
Use pij and qi for the initial policy, 1d = [1 1 1 1 1]T, to solve the
set of GSEs (4.176) and the set of VDEs (4.177) for all the relative values
vi and gains gi, by setting the value of one vi in each closed class of recurrent
states equal to zero.
Step 1 of PE:
The two recurrent chains are the absorbing states 1 and 2. After setting the
relative value v1 = 0 for the highest numbered state in the first recurrent
chain, the VDE for the first recurrent chain is

g1 + v1 = 62 + v1
g1 = 62.

After setting the relative value v2 = 0 for the highest numbered state in the
second recurrent chain, the VDE for the second recurrent chain is

g 2 + v2 = 100 + v2
g 2 = 100.

Hence, the independent gains of the two recurrent states are

g1 = 62
g 2 = 100.

Step 2 of PE:
The GSEs for the transient states are

g 3 = 0.25 g1 + 0.20 g 2 + 0.55 g 3


g 4 = 0.15 g1 + 0.20 g 3 + 0.65 g 4
g 5 = 0.10 g1 + 0.15 g 4 + 0.75 g 5

51113_C005.indd 356 9/23/2010 5:46:58 PM


A Markov Decision Process (MDP) 357

After algebraic simplification, the gains of the three transient states are
expressed as weighted averages of the independent gains of the two closed
classes of recurrent states

g 3 = 0.5556 g1 + 0.4444 g 2 = 0.5556(62) + 0.4444(100) = 78.8872


g 4 = 0.7461g1 + 0.2539 g 2 = 0.7461(62) + 0.2539(100) = 71.6482
g 5 = 0.8477 g1 + 0.1523 g 2 = 0.8477(62) + 0.1523(100) = 67.7874.

Step 3 of PE:
The VDEs for the transients states are

g 3 + v3 = 423.54 + 0.25v1 + 0.20v2 + 0.55v3


g 4 + v4 = 847.09 + 0.15v1 + 0.20v3 + 0.65v4
g 5 + v5 = 1935.36 + 0.10v1 + 0.15v4 + 0.75v5 .

After substituting the gains of the transient states obtained in step 2 of PE,
and the relative values of the recurrent states obtained in step 1 of PE, the
VDEs for the transient states are

78.8872 + v3 = 423.54 + 0.25(0) + 0.20(0) + 0.55v3


71.6482 + v4 = 847.09 + 0.15(0) + 0.20(v3 ) + 0.65v4
67.7874 + v5 = 1935.36 + 0.10(0) + 0.15v4 + 0.75v5 .

Step 4 of PE:
The VDEs for the transient states are solved to obtain the relative values of
the transient states: v3 = 765.8951, v4 = 2653.2023, and v5 = 9062.2118.
The solutions for the gain vector and the vector of relative values for the
initial policy are summarized below:

1  62  1 0 
2  100  2 0 
   
1
g = 3 78.8872  , 1
v = 3  765.8951  .
   
4 71.6482  4  2653.2023 
5 67.7874  5  9062.2118 

Step 3. IM routine
For each state i, find the decision k∗ that maximizes the gain test quantity

∑p
j =1
k
ij g j = pik1 g1 + pik2 g 2 + ( pik3 g 3 + pik4 g 4 + pik5 g 5 ).

51113_C005.indd 357 9/23/2010 5:47:00 PM


358 Markov Chains and Decision Processes for Engineers and Managers

TABLE 5.50
First IM for Model of a Flexible Production System
State Decision Gain Test Quantity Value Test Quantity
5 5
i k ∑p
j =1
k
ij gj q ik + ∑ p ijk v j
j= 1

1 1 1(62) = 62 ←

2 1 1(100) = 100 ←
2 1(67.7874) = 67.7874
3 1 0.25(62) + 0.20(100) + 0.55(78.8872)
= 78.8880
2 0.20(62) + 0.35(100) + 0.45(78.8872)
= 82.8992 ←
4 1 0.15(62) + 0.20(78.8872) + 0.65(71.6482) = 71.6488
2 0.20(62) + 0.30(78.8872) + 0.50(71.6482) = 71.8903 ←
5 1 67.7878 1935.36 + (7194.6392) = 9129.9992
2 67.7878 2042.15 + (7568.1537) = 9610.3037 ←

by using the gains of the previous policy. Then k∗ becomes the new decision
in state i, so that di = k∗, qik∗ becomes qi, and pijk∗ becomes pij. The first IM rou-
tine is executed in Table 5.50.

Step 4. Stopping rule


The new policy is given by the vector 2d = [1 1 2 2 2]T, which is differ-
ent from the initial policy. Therefore, go to step 2. The new decision vector
2d, along with the associated transition probability matrix 2P and the reward

vector 2q, are shown below.


Second iteration

 1 1 1 0 0 0 0   62 
 1 2 0 1 0 0 0   100 
     
2
d =  2 , 2
P = 3 0.20 0.35 0.45 0 0 , 2
q =  468.62  .
      (5.55)
 2 4 0.20 0 0.30 0.50 0   796.21 
 2 5 0.08 0 0 0.12 0.80   2042.15

Step 2. PE operation
Use pij and qi for the second policy, 2d = [1 1 2 2 2]T, to solve the
set of GSEs (4.176) and the set of VDEs (4.177) for all the relative values
vi and gains gi, by setting the value of one vi in each closed class of recur-
rent states equal to zero.

51113_C005.indd 358 9/23/2010 5:47:02 PM


A Markov Decision Process (MDP) 359

Step 1 of PE:
After setting the relative values v1 = v2 = 0 in the VDEs for the highest num-
bered states in the two recurrent closed classes, the independent gains of the
two recurrent states are once again

g 1 = 62
g 2 = 100.

Step 2 of PE:
The GSEs for the transient states are

g 3 = 0.20 g1 + 0.35 g 2 + 0.45 g 3


g 4 = 0.20 g1 + 0.30 g 3 + 0.50 g 4
g 5 = 0.08 g1 + 0.12 g 4 + 0.80 g 5 .

After algebraic simplification, the gains of the three transient states expressed
as weighted averages of the independent gains of the two closed classes of
recurrent states are

g 3 = 0.3636 g1 + 0.6364 g 2 = 0.3636(62) + 0.6364(100) = 86.1818


g 4 = 0.6182 g1 + 0.3818 g 2 = 0.6182(62) + 0.3818(100) = 76.5091
g 5 = 0.7709 g1 + 0.2291g 2 == 0.7709(62) + 0.2291(100) = 70.7055.

Step 3 of PE:
The VDEs for the transients states are

g 3 + v3 = 468.62 + 0.20v1 + 0.35v2 + 0.45v3


g 4 + v4 = 796.21 + 0.20v1 + 0.30v3 + 0.50v4
g 5 + v5 = 2042.15 + 0.08v1 + 0.12v4 + 0.80v5 .

After substituting the gains of the transient states obtained in step 2 of PE,
and the relative values of the recurrent states obtained in step 1 of PE, the
VDEs for the transient states are

86.1818 + v3 = 468.62 + 0.20(0) + 0.35(0) + 0.45v3


76.5091 + v4 = 796.21 + 0.20(0) + 0.30(v3 ) + 0.50v4
70.7055 + v5 = 2042.15 + 0.08(0) + 0.12v4 + 0.80v5 .

Step 4 of PE:
The VDEs for the transient states are solved to obtain the relative values of
the transient states: v3 = 695.3422, v4 = 1856.6071, and v5 = 10971.187.

51113_C005.indd 359 9/23/2010 5:47:03 PM


360 Markov Chains and Decision Processes for Engineers and Managers

The solutions for the gain vector and the vector of relative values for the
second policy are summarized below:

1  62  1 0 
2  100  2  0 
   
2
g = 3 86.1818  , 2
v = 3  695.3422  .
    (5.56)
4  76.5091 4  1856.6071
5 70.7055  5 10971.187 

Step 3. IM routine
For each state i, find the decision k∗ that maximizes the gain test quantity

∑pg
j= 1
k
ij j = pik1 g1 + pik2 g 2 + ( pik3 g 3 + pik4 g 4 + pik5 g 5 )

by using the gains of the previous policy. Then k∗ becomes the new decision
in state i, so that di = k∗, qik∗ becomes qi, and pijk∗ becomes pij. The second IM
routine is executed in Table 5.51.
Step 4. Stopping rule
Stop because the new policy, given by the vector 3d ≡ 2d = [1 1 2 2 2]T, is
identical to the previous policy. Therefore, this policy is optimal. The firm

TABLE 5.51
Second IM for Model of a Flexible Production System
Gain Test Quantity Value Test Quantity
State Decision 5 5
i k ∑p
j =1
k
ij gj q ik + ∑ p ijk v j
j =1

1 1 1(62) = 62 ←

2 1 1(100) = 100 ←
2 1(70.7055) = 70.7055
3 1 0.25(62) + 0.20(100) + 0.55(86.1818)
= 82.9000
2 0.20(62) + 0.35(100) + 0.45(86.1818)
= 86.1818 ←
4 1 0.15(62) + 0.20(86.1818) + 0.65(76.5091) = 76.2673
2 0.20(62) + 0.30(86.1818) + 0.50(76.5091) = 76.5091 ←
5 1 70.7056 1935.36 + (8506.8813) = 10442.241
2 70.7056 2042.15 + (8999.7425) = 11041.892 ←

51113_C005.indd 360 9/23/2010 5:47:05 PM


A Markov Decision Process (MDP) 361

will manufacture a consumer product at stages 1, 2, and 3, and train employ-


ees on software in state 2. The optimal policy generates an absorbing mul-
tichain MCR with two recurrent closed classes, absorbing states 1 and 2,
respectively, and a set of transient states {3, 4, 5}. The gain vector 2g, and the
vector of relative values 2v, for the optimal policy are shown in Equation
(5.56). The optimal decision vector 2d, the associated transition probability
matri, 2P, and the reward vector 2q, are shown in Equation (5.55).

5.1.4.3 Linear Programming


Puterman [3] constructs the LP formulation for a multichain MDP from the
LP formulation in Section 5.1.2.3.4.1 for a unichain MDP by adding an addi-
tional set of variables, xik, and an additional set of constraints which general-
izes the single normalizing constraint for a unichain MDP. Since a multichain
MDP may not have a unique gain g, the value of the objective function for
the LP is denoted by z instead of g. The LP formulation for a multichain MDP
with N states and Ki decisions in each state i is

N Ki
Maximize z = ∑∑q
i= 1 k = 1
k
i y ik

subject to

Ki N Ki

∑ y kj − ∑∑ yik pijk = 0, for j = 1, 2,..., N.


k =1 i =1 k =1

Ki Ki N Ki

∑ y kj + ∑ x kj − ∑∑ xik pijk = bj , for j = 1, 2,..., N ,


k =1 k =1 i =1 k =1

N
where each constant b j > 0 and ∑b
j =1
j = 1 and y ik ≥ 0, xik ≥ 0, for i = 1, 2,..., N ,

and k = 1, 2, 3, … , Ki. At least one of the equations in the first set of constraints
is redundant.
The optimal stationary policy generated by the solution of the LP for a uni-
chain MDP is deterministic. A deterministic policy is one for which

1 for a single decision k in state i


P(decision = k|state = i) = 
0 for all other decisions in state i.

The optimal LP solution for a multichain MDP may generate a deterministic


stationary policy or a randomized stationary policy. In a randomized policy,

51113_C005.indd 361 9/23/2010 5:47:06 PM


362 Markov Chains and Decision Processes for Engineers and Managers

P(decision = k|state = i) is a conditional probability distribution for the deci-


sion to be made in state i, so that 0 ≤ P(decision = k|state = i) ≤ 1. A random-
ized stationary policy generated by a feasible solution to the LP formulation
for a multichain MDP is characterized by

 k Ki Ki

 yi

/∑ y k =1
k
i for states i in which ∑y
k =1
k
i >0
P(decision = k|state = i) =  Ki Ki
xk
 i / ∑ xik for states i in which
k =1
∑x
k =1
k
i > 0.

In the final simplex tableau, under a deterministic policy, the dual variable
associated with each equation in the second set of constraints is equal to
the gain in the corresponding state. That is, for an N-state MDP, under a
deterministic policy, the dual variable associated with constraint equation i,
for i > N, is equal to the gain in state (i − N).

5.1.4.3.1 LP Formulation of Multichain Model of a Flexible Production System


The multichain MDP model of a flexible production system constructed in
Section 5.1.4.1 will be formulated as an LP. Table 5.49 from Section 5.1.4.1 is
repeated in Table 5.52, augmented by two right-hand side columns of LP
decision variables.
In this example N = 5 states, K1 = 1 decision in state 1, and K 2 = K3 =
K4 = K5 = 2 decisions in every other state.

TABLE 5.52
Data for Multichain Model of a Flexible Production System

Transition Probability
State Decision Reward LP Variable
i Operation k p ik1 p ik2 p ik3 p ik4 p ik5 q ik y ik x ik

1 Scrap 1 1 0 0 0 0 62 y11 x11


2 Training 1=S 0 1 0 0 0 100 y21 x21
2=H 0 0 0 0 1 22.58 y22 x22
3 Stage 3 1=I 0.25 0.20 0.55 0 0 423.54 y31 x31
2=C 0.20 0.35 0.45 0 0 468.62 y32 x32
4 Stage 2 1=I 0.15 0 0.20 0.65 0 847.09 y41 x41
2=C 0.2 0 0.30 0.50 0 796.21 y42 x42
5 Stage 1 1=I 0.10 0 0 0.15 0.75 1935.36 y51 x51
2=C 0.08 0 0 0.12 0.80 2042.15 y52 x52

51113_C005.indd 362 9/23/2010 5:47:08 PM


A Markov Decision Process (MDP) 363

Objective Function
The objective function for the LP is

1 5 2

Maximize z = ∑q
k= 1
k
1 y1k + ∑∑q
i= 2 k = 1
k
i y ik

1 2 2 2 2
= ∑q
k= 1
k
1 y1k + ∑q
k= 1
k
2 y 2k + ∑q
k= 1
k
3 y 3k + ∑q
k= 1
k
4 y 4k + ∑q
k= 1
k
5 y 5k

= q11 y11 + (q21 y 21 + q22 y 22 ) + (q31 y 31 + q32 y 32 ) + (q14 y 14 + q42 y 42 ) + (q51 y 51 + q52 y 52 )

= 62 y11 + (100 y 21 + 22.58 y 22 ) + (423.54 y 31 + 468.62 y 32 )


+ (847.09 y 14 + 796.21y 42 ) + (1935.36 y 51 + 2042.15 y 52 ).

Form the first set of constraints in terms of the variable yik.


State 1 has K1 = 1 decision. In the first set of constraints, the equation
associated with a transition to state j = 1 is
1
 1 k k 5 2
k 

k =1
y k
1 −  ∑ 1 11 ∑∑ y i pi1  = 0
k =1
y p +
i = 2 k =1
k

1
 1 2 2 2 2
k 
∑y
k =1
k
1 −  ∑ y1k p11
 k =1
k
+ ∑ y 2k p21
k =1
k
+ ∑ y 3k p31
k =1
k
+ ∑ y 4k p41
k =1
k
+ ∑ y 5k p51
k =1
 = 0

y11 − ( y11 p11


1
) − ( y 21 p21
1
+ y 22 p21
2
) − ( y 31 p31
1
+ y 32 p31
2
)
− ( y 14 p141 + y 42 p41
2
) − ( y 51 p51
1
+ y 52 p51
2
)=0

y11 − (1y11 ) − (0 y 21 + 0 y 22 ) − (0.25 y 31 + 0.20 y 32 )


− (0.15 y 14 + 0.20 y 42 ) − (0.10 y 51 + 0.08 y 52 ) = 0.

State 2 has K 2 = 2 possible decisions. In the first set of constraints, the equa-
tion associated with a transition to state j = 2 is
2
 1 5 2

∑y
k =1
k
2 −  ∑ y1k p12
 k =1
k
+ ∑∑ y ik pik2  = 0
i = 2 k =1 
2
 1 2 2 2 2
k 
∑y
k =1
k
2 −  ∑ y1k p12
 k =1
k
+ ∑ y 2k p22
k =1
k
+ ∑ y 3k p32
k =1
k
+ ∑ y 4k p42
k =1
k
+ ∑ y 5k p52
k =1
=0

( y 21 + y 22 ) − ( y11 p12
1
) − ( y 21 p22
1
+ y 22 p22
2
) − ( y 31 p32
1
+ y 32 p32
2
)
− ( y 14 p142 + y 42 p42
2
) − ( y 51 p52
1
+ y 52 p52
2
)=0

51113_C005.indd 363 9/23/2010 5:47:12 PM


364 Markov Chains and Decision Processes for Engineers and Managers

( y 21 + y 22 ) − (0 y11 ) − (1y 21 + 0 y 22 ) − (0.20 y 31 + 0.35 y 32 )


− (0 y 14 + 0 y 42 ) − (0 y 51 + 0 y 52 ) = 0.

State 3 has K3 = 2 possible decisions. In the first set of constraints, the equa-
tion associated with a transition to state j = 3 is
2
 1 5 2


k =1
y 3k −  ∑ y1k p13
 k =1
k
+ ∑∑ y ik pik3  = 0
i = 2 k =1 
2
 1 2 2 2 2
k 
∑y
k =1
k
3 −  ∑ y1k p13
 k =1
k
+ ∑ y 2k p23
k =1
k
+ ∑ y 3k p33
k =1
k
+ ∑ y 4k p43
k =1
k
+ ∑ y 5k p53
k =1
=0

( y 31 + y 32 ) − ( y11 p13
1
) − ( y 21 p23
1
+ y 22 p23
2
) − ( y 31 p33
1
+ y 32 p33
2
)
− ( y 14 p143 + y 42 p43
2
) − ( y 51 p53
1
+ y 52 p53
2
)=0

( y 31 + y 32 ) − (0 y11 ) − (0 y 21 + 0 y 22 ) − (0.55 y 31 + 0.45 y 32 )


− (0.20 y 14 + 0.30 y 42 ) − (0 y 51 + 0 y 52 ) = 0.

State 4 has K4 = 2 possible decisions. In the first set of constraints, the equa-
tion associated with a transition to state j = 4 is

2
 1 5 2

∑y
k =1
k
4 −  ∑ y1k p14
 k =1
k
+ ∑∑ y ik pik4  = 0
i = 2 k =1 
2
 1 2 2 2 2
k 
∑y
k =1
k
4 −  ∑ y1k p14
 k =1
k
+ ∑ y 2k p24
k =1
k
+ ∑ y 3k p34
k =1
k
+ ∑ y 4k p44
k =1
k
+ ∑ y 5k p54
k =1
=0

( y 14 + y 42 ) − ( y11 p14
1
) − ( y 21 p24
1
+ y 22 p24
2
) − ( y 31 p34
1
+ y 32 p34
2
)
− ( y 14 p144 + y 42 p44
2
) − ( y 51 p54
1
+ y 52 p54
2
)=0

( y 14 + y 42 ) − (0 y11 ) − (0 y 21 + 0 y 22 ) − (0 y 31 + 0 y 32 )
− (0.65 y 14 + 0.50 y 42 ) − (0.15 y 51 + 0.12 y 52 ) = 0.

State 5 has K5 = 2 possible decisions. In the first set of constraints, the equa-
tion associated with a transition to state j = 5 is
2
 1 5 2

∑y
k =1
k
5 −  ∑ y1k p15
 k =1
k
+ ∑∑ y ik pik5  = 0
i = 2 k =1 
2
 1 2 2 2 2
k 
∑y
k =1
k
5 −  ∑ y1k p15
 k =1
k
+ ∑ y 2k p25
k =1
k
+ ∑ y 3k p35
k =1
k
+ ∑ y 4k p45
k =1
k
+ ∑ y 5k p55
k =1
=0

51113_C005.indd 364 9/23/2010 5:47:16 PM


A Markov Decision Process (MDP) 365

( y 51 + y 52 ) − ( y11 p15
1
) − ( y 21 p25
1
+ y 22 p25
2
) − ( y 31 p35
1
+ y 32 p35
2
)
− ( y 14 p145 + y 42 p45
2
) − ( y 51 p55
1
+ y 52 p55
2
)=0

( y 51 + y 52 ) − (0 y11 ) − (0 y 21 + 1y 22 ) − (0 y 31 + 0 y 32 )
− (0 y 14 + 0 y 42 ) − (0.75 y 51 + 0.80 y 52 ) = 0.

Form the second set of constraints in terms of the variable xik by setting the
constants b1 = b2 = b3 = b4 = b5 = 0.2.
State 1 has K1 = 1 decision. In the second set of constraints, the equation
associated with a transition to state j = 1 is

1 1
 1 5 2

∑y +∑x
k =1
k
1
k =1
k
1 −  ∑ x1k p11
 k =1
k
+ ∑∑ xik pik1  = b1 = 0.2
i = 2 k =1 

 1 2 2 2 2
k 
y11 + x11 −  ∑ x1k p11
k
+ ∑ x2k p21
k
+ ∑ x3k p31
k
+ ∑ x 4k p41
k
+ ∑ x5k p51  = 0.2
 k =1 k =1 k =1 k =1 k =1 

y11 + x11 − ( x11 p11


1
) − ( x21 p21
1
+ x22 p21
2
) − ( x31 p31
1
+ x32 p31
2
)
− ( x14 p141 + x 42 p41
2
) − ( x51 p51
1
+ x52 p51
2
) = 0.2

y11 + x11 − (1x11 ) − (0 x21 + 0 x22 ) − (0.25 x31 + 0.20 x32 )


− (0.15 x14 + 0.20 x 42 ) − (0.10 x51 + 0.08 x52 ) = 0.2.

State 2 has K 2 = 2 possible decisions. In the second set of constraints, the


equation associated with a transition to state j = 2 is

2 2
 1 5 2

∑y +∑x
k =1
k
2
k =1
k
2 −  ∑ x1k p11
 k =1
k
+ ∑∑ xik pik2  = b2 = 0.2
i = 2 k =1 

 1 2 2 2 2
k 
( y 21 + y 22 ) + ( x21 + x22 ) −  ∑ x1k p12
k
+ ∑ x2k p22
k
+ ∑ x3k p32
k
+ ∑ x 4k p42
k
+ ∑ x5k p52  = 0.2
 k =1 k =1 k =1 k =1 k =1 

( y 21 + y 22 ) + ( x21 + x22 ) − ( x11 p12


1
) − ( x21 p22
1
+ x22 p22
2
) − ( x31 p32
1
+ x32 p32
2
)
− ( x14 p142 + x 42 p42
2
) − ( x51 p52
1
+ x52 p52
2
) = 0.2

( y 21 + y 22 ) + ( x21 + x22 ) − (0 x11 ) − (1x21 + 0 x22 ) − (0.20 x31 + 0.35 x32 )


− (0 x14 + 0 x 42 ) − (0 x51 + 0 x52 ) = 0.2.

State 3 has K3 = 2 possible decisions. In the second set of constraints, the


equation associated with a transition to state j = 3 is

51113_C005.indd 365 9/23/2010 5:47:21 PM


366 Markov Chains and Decision Processes for Engineers and Managers

2 2
 1 5 2

∑y +∑x
k =1
k
3
k =1
k
3 −  ∑ x1k p11
 k =1
k
+ ∑∑ xik pik3  = b3 = 0.2
i = 2 k =1 

 1 2 2 2 2
k 
( y 31 + y 32 ) + ( x31 + x32 ) −  ∑ x1k p13
k
+ ∑ x2k p23
k
+ ∑ x3k p33
k
+ ∑ x 4k p43
k
+ ∑ x5k p53  = 0.2
 k =1 k =1 k =1 k =1 k =1 

( y 31 + y 32 ) + ( x31 + x32 ) − ( x11 p13


1
) − ( x31 p23
1
+ x22 p23
2
) − ( x31 p33
1
+ x32 p33
2
)
− ( x14 p143 + x 42 p43
2
) − ( x51 p53
1
+ x52 p53
2
) = 0.2

( y 31 + y 32 ) + ( x31 + x32 ) − (0 x11 ) − (0 x21 + 0 x22 ) − (0.55 x31 + 0.45 x32 )


− (0.20 x14 + 0.30 x 42 ) − (0 x51 + 0 x52 ) = 0.2.

State 4 has K4 = 2 possible decisions. In the second set of constraints, the


equation associated with a transition to state j = 4 is
2 2
 1 5 2

∑y +∑x
k =1
k
4
k =1
k
4 −  ∑ x1k p11
 k =1
k
+ ∑∑ xik pik4  = b4 = 0.2
i = 2 k =1 

 1 2 2 2 2
k 
( y 14 + y 42 ) + ( x14 + x 42 ) −  ∑ x1k p14
k
+ ∑ x2k p24
k
+ ∑ x3k p34
k
+ ∑ x 4k p44
k
+ ∑ x5k p54  = 0.2
 k =1 k =1 k =1 k =1 k =1 

( y 14 + y 42 ) + ( x14 + x 42 ) − ( x11 p14


1
) − ( x31 p24
1
+ x22 p24
2
) − ( x31 p34
1
+ x32 p34
2
)
− ( x14 p144 + x 42 p44
2
) − ( x51 p54
1
+ x52 p54
2
) = 0.2

( y 14 + y 42 ) + ( x14 + x 42 ) − (0 x11 ) − (0 x21 + 0 x22 ) − (0 x31 + 0 x32 )


− (0.65 x14 + 0.50 x 42 ) − (0.15 x51 + 0.12 x52 ) = 0.2.

State 5 has K5 = 2 possible decisions. In the second set of constraints, the


equation associated with a transition to state j = 5 is
2 2
 1 5 2

∑y +∑x
k =1
k
5
k =1
k
5 −  ∑ x1k p11
 k =1
k
+ ∑∑ xik pik5  = b5 = 0.2
i = 2 k =1 

 1 2 2 2 2
k 
( y 51 + y 52 ) + ( x51 + x52 ) −  ∑ x1k p15
k
+ ∑ x2k p25
k
+ ∑ x3k p35
k
+ ∑ x 4k p45
k
+ ∑ x5k p55  = 0.2
 k =1 k =1 k =1 k =1 k =1 

( y 51 + y 52 ) + ( x51 + x52 ) − ( x11 p15


1
) − ( x31 p25
1
+ x22 p25
2
) − ( x31 p35
1
+ x32 p35
2
)
− ( x14 p145 + x 42 p45
2
) − ( x51 p55
1
+ x52 p55
2
) = 0.2

( y 51 + y 52 ) + ( x51 + x52 ) − (0 x11 ) − (0 x21 + 1x22 ) − (0 x31 + 0 x32 )


− (0 x14 + 0 x 42 ) − (0.75 x51 + 0.80 x52 ) = 0.2.

51113_C005.indd 366 9/23/2010 5:47:25 PM


A Markov Decision Process (MDP) 367

The complete LP formulation for the multichain MDP model of a flexible


production system is

Maximize z = 62 y11 + (100 y 21 + 22.58 y 21 ) + (423.54 y 31 + 468.62 y 32 )


+ (847.09 y 14 + 796.21y 42 ) + (1935.36 y 51 + 2042.15 y 52 )

subject to

(1) − (0.25 y 31 + 0.20 y 32 ) − (0.15 y 14 + 0.20 y 42 ) − (0.10 y 51 + 0.08 y 52 ) = 0


(2) ( y 22 ) − (0.20 y 31 + 0.35 y 32 ) = 0
(3) (0.45 y 31 + 0.55 y 32 ) − (0.20 y 14 + 0.30 y 42 ) = 0
(4) (0.35 y 14 + 0.50 y 42 ) − (0.15 y 51 + 0.12 y 52 ) = 0
(5) − ( y 22 ) + (0.25 y 51 + 0.20 y 52 ) = 0
(6) ( y11 ) − (0.25 x31 + 0.20 x32 ) − (0.15 x14 + 0.20 x 42 ) − (0.10 x51 + 0.08 x52 ) = 0.2
(7) ( y 21 + y 22 ) + ( x22 ) − (0.20 x31 + 0.35 x32 ) = 0.2
(8) ( y 31 + y 32 ) + (0.45 x31 + 0.55 x32 ) − (0.20 x14 + 0.30 x 42 ) = 0.2
(9) ( y 14 + y 42 ) + (0.35 x14 + 0.50 x 42 ) − (0.15 x51 + 0.12x52 ) = 0.2
(10) ( y 51 + y 52 ) − ( x22 ) + (0.25 x51 + 0.20 x52 ) = 0.2

y11 ≥ 0, y12 ≥ 0, y 21 ≥ 0, y 22 ≥ 0, y 31 ≥ 0, y 32 ≥ 0, y 14 ≥ 0, y 42 ≥ 0, y 51 ≥ 0, y 52 ≥ 0.

x11 ≥ 0, x12 ≥ 0, x21 ≥ 0, x22 ≥ 0, x31 ≥ 0, x32 ≥ 0, x14 ≥ 0, x 42 ≥ 0, x51 ≥ 0, x52 ≥ 0.

5.1.4.3.2 Solution by LP of the Multichain MDP Model


of a Flexible Production System
The LP formulation of the multichain MDP model of a flexible production
system is solved on a personal computer using LP software to find an opti-
mal policy. The output of the LP software is summarized in Table 5.53.
The objective function value is 79.0793. The optimal policy, given by the
vector d = [1 1 2 2 2]T, is deterministic, and was obtained by PI in
Equation (5.55) of Section 5.1.4.2.3. Observe that the gains in states 1, 2, 3, 4,
and 5, respectively, obtained by PI and shown in Equation (5.56), are equal to
the dual variables associated with constraints 6, 7, 8, 9, and 10, respectively,
shown in Table 5.53 of the LP solution. For the five-state MDP, the dual var-
iable associated with constraint equation i, for i > 5, is equal to the gain in
state (i − 5). For example, the dual variable associated with constraint equa-
tion 9 is equal to g4 = 76.5091, the gain in state (9 − 5) = 4.

51113_C005.indd 367 9/23/2010 5:47:30 PM


368 Markov Chains and Decision Processes for Engineers and Managers

5.1.4.4 A Multichain MDP Model of Machine Maintenance


Consider the unichain MCR model of machine maintenance introduced
as a Markov chain in Section 1.10.1.2.2.1, and enlarged to produce an MCR
in Section 4.2.4.1.3. The four states of the machine are listed in Table 1.9 of
Section 1.10.1.2.1.2. The unichain MCR model will be expanded to create
a multichain MDP model by allowing choices among the four alternative
maintenance actions described in Table 1.10 of Section 1.10.1.2.2.1. The main-
tenance policy followed for the unichain MCR model in Section 4.2.4.2.3 was
d = [2 1 1 1]T, which calls for overhauling the machine in state 1, and
doing nothing in states 2, 3, and 4. (This is called the modified maintenance
policy shown in Equation (4.117) of Sections 1.10.1.2.2.1 and 4.2.4.1.3.)
Suppose that the engineer responsible for maintaining the machine has deter-
mined that the decisions listed in Table 5.54 are feasible in the different states.
The daily revenue earned from production in every state is given in Table
4.8 of Section 4.2.4.1.3. The daily maintenance costs in every state are sum-
marized in Table 4.9 of Section 4.2.4.1.3. The daily rewards for the decisions
that are feasible in the different states are indicated in Table 5.55.

TABLE 5.53
LP Solution of Multichain MDP Model of a Flexible Production System
i k yki xki
1 1 0.5505 0 Objective Function Value = 79.0793
2 1 0.4495 0 Row Dual Variable
2 0 0 Constraint 6 62 = g1
3 1 0 0 Constraint 7 100 = g2
2 0 0.7127 Constraint 8 86.1818 = g3
4 1 0 0 Constraint 9 76.5091 = g4
2 0 0.6400 Constraint 10 70.7055 = g5
5 1 0 0
2 0 1

TABLE 5.54
Decisions Feasible in the Different States
State Description Feasible Decision
1 Not Working (NW) Overhaul (OV)
Repair (RP)
Replace (RL)
2 Working, with a Major Defect (WM) Do Nothing (DN)
Repair (RP)
3 Working, with a Minor Defect (Wm) Do Nothing (DN)
Overhaul (OV)
4 Working Properly (WP) Do Nothing (DN)
Overhaul (OV)

51113_C005.indd 368 9/23/2010 5:48:15 PM


A Markov Decision Process (MDP) 369

TABLE 5.55
Daily Rewards for the Decisions That Are Feasible in the Different States
State\Decision Do Nothing Overhaul Repair Replace
Not Working Not Feasible –$300 –$700 –$1,200
Working, with $200 Not Feasible –$700 Not Feasible
Major Defect
Working, with $500 –$300 Not Feasible Not Feasible
Minor Defect
Working Properly $1,000 –$300 Not Feasible Not Feasible

TABLE 5.56
The 24 Stationary Policies for Machine Maintenance
Policy State 1 State 2 State 3 State 4 Structure
1d OV DN OV DN U
2d OV DN OV OV M
3d OV RP OV DN I
4d OV RP OV OV U
5d RP DN OV DN I
6d RP DN OV OV U
7d RP RP OV DN I
8d RP RP OV OV U
9d RL DN OV DN I
10d RL DN OV OV U
11d RL RP OV DN I
12d RL RP OV OV U
13d OV DN DN DN U
14d OV DN DN OV M
15d OV RP DN DN I
16d OV RP DN OV U
17d RP DN DN DN U
18d RP DN DN OV M
19d RP RP DN DN I
20d RP RP DN OV U
21d RL DN DN DN I
22d RL DN DN OV U
23d RL RP DN DN I
24d RL RP DN OV U

Recall that under a stationary policy the same decision is made each time that
the process returns to a particular state. With three feasible decisions in state 1,
and two decisions in states 2, 3, and 4, the number of stationary policies is
equal to (3)(2)(2)(2) = 24. The 24 stationary policies, labeled 1d through 24d, are
enumerated in Table 5.56. The right most column indicates whether the associ-
ated Markov chain is irreducible (I), unichain (U), or multichain (M).

51113_C005.indd 369 9/23/2010 5:48:15 PM


370 Markov Chains and Decision Processes for Engineers and Managers

TABLE 5.57
Data for Multichain Model of Machine Maintenance
Transition Probability
State Decision k
Reward LP Variable
i k p i1 p ik2 p ik3 p ik4 q ik y ik x ik
1 = NW 2 = OV 0.2 0.8 0 0 –$300 y12 x12
3 = RP 0.3 0 0.7 0 –$700 y13 x13
4 = RL 0 0 0 1 –$1,200 y14 x14
2 = WM 1 = DN 0.6 0.4 0 0 $200 y21 x21
3 = RP 0 0.3 0 0.7 –$700 y23 x23
3 = Wm 1 = DN 0.2 0.3 0.5 0 $500 y31 x31
2 = OV 0 0 0.2 0.8 –$300 y32 x32
4 = WP 1 = DN 0.3 0.2 0.1 0.4 $1,000 y41 x41
2 = OV 0 0 0 1 –$300 y42 x42

Policies 2d, 14d, and 18d produce recurrent multichains, each with two recur-
rent chains and no transient states. Table 5.57 contains the data for the four-
state multichain MDP model of machine maintenance. The variables for an
LP formulation appear in the two right-most columns.

5.1.4.4.1 LP Formulation of Multichain Model of Machine Maintenance


The multichain MDP model of machine maintenance will be formulated as
an LP. In this example N = 4 states, K1 = 3 decisions in state 1, and K 2 = K3 =
K4 = 2 decisions in states 2, 3, and 4.
Objective Function
The objective function for the LP is
Maximize z = (q12 y12 + q13 y13 + q14 y14 ) + (q21 y 21 + q23 y 23 ) + (q31 y 31 + q32 y 32 ) + (q14 y 14 + q42 y 42 )

= ( −300 y12 − 700 y13 − 1200 y14 ) + (200 y 21 − 700 y 23 )


+ (500 y 31 − 300 y 32 ) + (1000 y 14 − 300 y 42 ).

Form the first set of constraints.


State 1 has K1 = 3 decisions. In the first set of constraints, the equation asso-
ciated with a transition to state j = 1 is

( y12 + y13 + y14 ) − ( y12 p11


2
+ y13 p11
3
+ y14 p11
4
) − ( y 21 p21
1
+ y 23 p21
3
)
− ( y 31 p31
1
+ y 32 p31
2
) − ( y 14 p141 + y 42 p41
2
)=0

( y12 + y13 + y14 ) − (0.2 y12 + 0.3 y13 + 0 y14 ) − (0.6 y 21 + 0 y 23 ) − (0.2 y 13 + 0 y 32 ) − (0.3 y 14 + 0 y 42 ) = 0.

State 2 has K 2 = 2 decisions. In the first set of constraints, the equation asso-
ciated with a transition to state j = 2 is

51113_C005.indd 370 9/23/2010 5:48:15 PM


A Markov Decision Process (MDP) 371

( y 21 + y 23 ) − ( y12 p12
2
+ y13 p12
3
+ y14 p12
4
) − ( y 21 p22
1
+ y 23 p22
3
)
− ( y 31 p32
1
+ y 32 p32
2
) − ( y 14 p142 + y 42 p42
2
)=0

( y 21 + y 23 ) − (0.8 y12 + 0 y13 + 0 y14 ) − (0.4 y 21 + 0.3 y 23 ) − (0.3 y 31 + 0 y 32 ) − (0.2 y 14 + 0 y 42 ) = 0.

State 3 has K3 = 2 decisions. In the first set of constraints, the equation asso-
ciated with a transition to state j = 3 is

( y 31 + y 32 ) − ( y12 p13
2
+ y13 p13
3
+ y14 p13
4
) − ( y 21 p23
1
+ y 23 p23
3
)
− ( y 31 p33
1
+ y 32 p33
2
) − ( y 14 p143 + y 42 p43
2
)=0

( y 31 + y 32 ) − (0 y12 + 0.7 y13 + 0 y14 ) − (0 y 21 + 0 y 23 ) − (0.5 y 31 + 0.2 y 32 ) − (0.1y 14 + 0 y 42 ) = 0.

State 4 has K4 = 2 decisions. In the first set of constraints, the equation asso-
ciated with a transition to state j = 4 is

( y 14 + y 42 ) − ( y12 p14
2
+ y13 p14
3
+ y14 p14
4
) − ( y 21 p24
1
+ y 23 p24
3
)
− ( y 31 p34
1
+ y 32 p34
2
) − ( y 14 p144 + y 42 p44
2
)=0

( y 14 + y 42 ) − (0 y12 + 0 y13 + y14 ) − (0 y 21 + 0.7 y 23 ) − (0 y 31 + 0.8 y 32 ) − (0.4 y 14 + y 42 ) = 0.

Form the second set of constraints by setting b1 = b2 = b3 = b4 = 0.25.


State 1 has K1 = 3 decisions. In the second set of constraints, the equation
associated with a transition to state j = 1 is

( y12 + y13 + y14 ) + ( x12 + x13 + x14 ) − ( x12 p11


2
+ x13 p11
3
+ x14 p11
4
) − ( x21 p21
1
+ x23 p21
3
)
− ( x31 p31
1
+ x32 p31
2
) − ( x14 p141 + x 42 p41
2
) = 0.25

( y12 + y13 + y14 ) + ( x12 + x13 + x14 ) − (0.2 x12 + 0.3 x13 + 0 x14 ) − (0.6 x21 + 0 x23 )
− (0.2 x31 + 0 x32 ) − (0.3 x14 + 0 x 42 ) = 0.25.

State 2 has K 2 = 2 decisions. In the second set of constraints, the equation


associated with a transition to state j = 2 is

( y 21 + y 23 ) + ( x21 + x23 ) − ( x12 p12


2
+ x13 p12
3
+ x14 p12
4
) − ( x21 p22
1
+ x23 p22
3
)
− ( x31 p32
1
+ x32 p32
2
) − ( x14 p142 + x 42 p42
2
) = 0.25

( y 21 + y 23 ) + ( x21 + x23 ) − (0.8 x12 + 0 x13 + 0 x14 ) − (0.4 x21 + 0.3 x23 )
− (0.3 x31 + 0 x32 ) − (0 x14 + 0.2 x 42 ) = 0.25.

51113_C005.indd 371 9/23/2010 5:48:20 PM


372 Markov Chains and Decision Processes for Engineers and Managers

State 3 has K3 = 2 decisions. In the second set of constraints, the equation


associated with a transition to state j = 3 is

( y 31 + y 32 ) + ( x31 + x32 ) − ( x12 p13


2
+ x13 p13
3
+ x14 p13
4
) − ( x21 p23
1
+ x23 p23
3
)
− ( x31 p33
1
+ x32 p33
2
) − ( x14 p143 + x 42 p43
2
) = 0.25

( y 31 + y 32 ) + ( x31 + x32 ) − (0 x12 + 0.7 x13 + 0 x14 ) − (0 x21 + 0 x23 )


− (0.5 x31 + 0.2 x32 ) − (0.1x14 + 0 x 42 ) = 0.25.

State 4 has K4 = 2 decisions. In the second set of constraints, the equation


associated with a transition to state j = 4 is

( y 14 + y 42 ) + ( x14 + x 42 ) − ( x12 p14


2
+ x13 p14
3
+ x14 p14
4
) − ( x21 p24
1
+ x23 p24
3
)
− ( x31 p34
1
+ x32 p34
2
) − ( x14 p144 + x 42 p44
2
) = 0.25

( y 14 + y 42 ) + ( x14 + x 42 ) − (0 x12 + 0 x13 + x14 ) − (0 x21 + 0.7 x23 )


− (0 x31 + 0.8 x32 ) − (0.4 x14 + x 42 ) = 0.25.

The complete LP formulation for the multichain MDP model of machine


maintenance is

Maximize z = ( −300 y12 − 700 y13 − 1200 y14 ) + (200 y 21 − 700 y 23 )


+ (500 y 31 − 300 y 32 ) + (1000 y 14 − 300 y 44 )

subject to

(1) (0.8 y12 + 0.7 y13 + y14 ) − 0.6 y 21 − 0.2 y 31 − 0.3 y 14 = 0


(2) − (0.8 y12 ) + (0.6 y 21 + 0.7 y 23 ) − 0.3 y 31 − (0.2 y 14 ) = 0
(3) − (0.7 y13 ) + (0.5 y 31 ) + (0.8 y 32 ) − (0.1y 14 ) = 0
(4) − ( y14 ) − (0.7 y 23 ) − (0.8 y 32 ) + (0.6 y 14 ) = 0
(5) ( y12 + y13 + y14 ) + (0.8 x12 + 0.7 x13 + x14 ) − 0.6 x21 − 0.2 x31 − 0.3 x14 = 0.25
(6) ( y 21 + y 23 ) − (0.8 x12 ) + (0.6 x21 + 0.7 x23 ) − (0.3 x31 ) − (0.2 x14 ) = 0.25
(7) ( y 31 + y 32 ) − (0.7 x13 ) + (0.5 x31 ) + (0.8 x32 ) − (0.1x14 ) = 0.25
(8) ( y 14 + y 42 ) − ( x14 ) − (0.7 x23 ) − (0.8 x32 ) + (0.6 x14 ) = 0.25

y12 ≥ 0, y13 ≥ 0, y14 ≥ 0, y 21 ≥ 0, y 23 ≥ 0, y 32 ≥ 0, y 14 ≥ 0, y 42 ≥ 0.

x12 ≥ 0, x13 ≥ 0, x14 ≥ 0, x21 ≥ 0, x23 ≥ 0, x32 ≥ 0, x14 ≥ 0, x 42 ≥ 0.

51113_C005.indd 372 9/23/2010 5:48:23 PM


A Markov Decision Process (MDP) 373

5.1.4.4.2 Solution by LP of the Multichain Model of Machine Maintenance


The LP formulation of the multichain MDP model of machine maintenance
is solved on a personal computer using LP software to find an optimal pol-
icy. The output of the LP software is summarized in Table 5.58.
The optimal policy, given by the decision vector 17d = [3 1 1 1]T in Table
5.56, is deterministic. The engineer will repair the machine when it is in state
1, not working, and will do nothing when the machine is in states 2, 3, and
4. The optimal policy generates a unichain MCR with one recurrent closed
class consisting of states 1, 2, and 3. State 4 is transient. The unichain transi-
tion probability matrix 17P, and the reward vector 17q, generated by the opti-
mal decision vector 17d are shown in Table 5.59.
Since the optimal policy has produced a unichain MCR, all states have
the same gain. The gain in every state is g1 = g 2 = g3 = g4 = g = 45.1613, equal
to the optimal value of the LP objective function. The gain is also equal to
the optimal value of the dual variables associated with constraints 5, 6, 7,
and 8.

TABLE 5.58
LP Solution of Multichain Model of Machine Maintenance
i k yik xik
1 2 0 0
3 0.3226 0.2285 Objective Function Value = 45.1613
4 0 0 Constraint Dual Variable
2 1 0.2258 0.1792 5 45.1613 = g1 = g
3 0 0 6 45.1613 = g2 = g
3 1 0.4516 0 7 45.1613 = g3 = g
2 0 0 8 45.1613 = g4 = g
4 1 0 0.4167
2 0 0

TABLE 5.59
Unichain MCR Corresponding to Optimal Policy for Machine Maintenance

State, X n−1 = i Decision, k State 1 2 3 4 Reward


1, Not Working 3, Repair 1 0.3 0 0.7 0 −$700 = q1
2, Working, Major Defect 1, Do Nothing 17
P= 2 0.6 0.4 0 0 , 17
q = $200 = q2
3, Working, Minor Defect 1, Do Nothing 3 0.2 0.3 0.5 0 $500 = q3
4, Working Properly 1, Do Nothing 4 0.3 0.2 0.1 0.4 $1, 000 = q4

51113_C005.indd 373 9/23/2010 5:49:08 PM


374 Markov Chains and Decision Processes for Engineers and Managers

5.2 A Discounted MDP


A discounted MDP is one for which a dollar earned one year from now has
a present value of α dollars now, where 0 < α < 1. As Section 4.3 indicates,
α is called a discount factor. As is true for a discounted MCR treated in
Section 4.3, chain structure is not relevant when analyzing an MDP with
discounting.

5.2.1 Value Iteration over a Finite Planning Horizon


An optimal policy for a discounted MDP is defined as one which maximizes
the expected total discounted reward earned in every state. When the plan-
ning horizon is finite, an optimal policy can be found by using value itera-
tion, which was applied to a discounted MCR in Section 4.3.2 [2, 3].

5.2.1.1 Value Iteration Equation


When value iteration is applied to a discounted MDP, the definition of vi(n)
given in Section 4.3.2.1 for a discounted MCR is changed by adding the
requirement that an optimal policy is followed from epoch n to the end
of the planning horizon. Hence, vi(n) now denotes the value, at epoch n, of
the expected total discounted reward earned during the T−n periods from
epoch n to epoch T, the end of the planning horizon, if the system is in state
i at epoch n, and an optimal policy is followed. The value iteration equation
(4.243) for a discounted MCR is modified to obtain the following value iter-
ation equation in algebraic form for a discounted MDP:

 N  
vi (n) = max  qik + α ∑ pijk v j (n + 1) , 
k
 j =1   (5.57)

for n = 0, 1,..., T − 1, and i = 1, 2, ..., N , where vi (T) is specified for all states i.

5.2.1.2 Value Iteration for Discounted MDP


Model of Monthly Sales
Consider the MDP model of monthly sales specified in Table 5.3 of
Section 5.1.2.1, and solved by value iteration without discounting over a
seven month horizon in Section 5.1.2.2.2. In this section, value iteration will
be applied to a discounted MDP model of monthly sales over a seven month
horizon. A discount factor of α = 0.9 is chosen. To begin, the following sal-
vage values are specified at the end of month 7.

51113_C005.indd 374 9/23/2010 5:49:09 PM


A Markov Decision Process (MDP) 375

v1 (7) = v2 (7) = v3 (7) = v4 (7) = 0. (5.58)

For the four-state discounted MDP, the value iteration equations are

vi (7) = 0, for i = 1, 2, 3, 4 (5.59)


 4  
vi (n) = max  qik + α ∑ pijk v j (n + 1) , for n = 0, 1,..., 6, and i = 1, 2, 3, 4 
k
 j =1  

= max{qi + 0.9[ pi1v1 (n + 1) + pik2v2 (n + 1) + pik3v3 (n + 1) + pik4v4 (n + 1)]},
k k
 (5.60)
k

for n = 0, 1, ..., 6. 

The calculations for month 6, denoted by n = 6, are indicated in Table 5.60.

vi (6) = max {qik + α [ pik1v1 (7) + pik2 v2 (7) + pik3 v3 (7) + pik4 v4 (7)]}, for i = 1, 2, 3, 4
k 

= max {qi + 0.9[ pi1 (0) + pi 2 (0) + pi 3 (0) + pi 4 (0)] }, for i = 1, 2, 3, 4
k k k k k

k

= max[qi ], for i = 1, 2, 3, 4.
k

k

(5.61)

At the end of month 6, the optimal decision is to select the alternative which
will maximize the expected total discounted reward in every state. Thus, the
decision vector at the end of month 6 is d(6) = [3 2 2 1]T.
The calculations for month 5, denoted by n = 5, are indicated in Table 5.61.

vi (5) = max {qik + α [ pik1v1 (6) + pik2 v2 (6) + pik3 v3 (6) + pik4 v4 (6)]}, for i = 1, 2, 3, 4 
k 

= max {qi + 0.9[ pi1 ( −20) + pi 2 (10) + pi 3 ( −5) + pi 4 (35)] }, for i = 1, 2, 3, 4.
k k k k k
k 
(5.62)
TABLE 5.60
Value Iteration for n = 6
n=6 qi1 qi2 qi3 Expected Total Discounted Reward Decision k
i k=1 k=2 k=3 vi (6) = max[q , q , q ] ,
1 2 3
i i i
k
for i = 1, 2, 3, 4
1 –30 –25 –20 v1 (6) = max[ − 30, −25, −20 ] 3
2 5 10 _ v2(6) = max[5, 10 ] = 10 2
3 –10 –5 _ v3(6) = max[–10, –5 ] = −5 2
4 35 25 _ v4(6) = max[35, 25 ] = 35 1

51113_C005.indd 375 9/23/2010 5:49:09 PM


376 Markov Chains and Decision Processes for Engineers and Managers

TABLE 5.61
Value Iteration for n = 5

q ik + a[ p ik1v 1 (6) + p ik2 v 2 (6) + p ik3v 3 (6)


+ p ik4v 4 (6) ] = q ik + 0.9[ p ik1 (–20) + p ik2 (10) vi(5) = Expected
Total Discounted Decision
+ p ik3 (–5) + p ik4 (35)] Reward di(5) = k
i k
1 1 −30 + 0.9[0.15(−20) + 0.40(10) + 0.35(−5)
+ 0.10(35)] = −27.525

1 2 −25 + 0.9[0.45(−20) + 0.05(10) + 0.20(−5) max[ − 27.525, d1(5) = 2


+ 0.30(35)] = −24.1 ← −24.1, −28.55 ]
= −24.1 = v1 (5)

1 3 −20 + 0.9[0.60(−20) + 0.30(10) + 0.10(−5)


+ 0(35)] = −28.55

2 1 5 + 0.9[0.25(−20) + 0.30(10) + 0.35(−5)


+ 0.10(35)] = 4.775

2 2 10 + 0.9[0.30(−20) + 0.40(10) + 0.25(−5) max[4.775, 8.65] d2(5) = 2


+ 0.05(35)] = 8.65 ← = 8.65 = v2 (5)

3 1 −10 + 0.9[0.05(−20) + 0.65(10) + 0.25(−5)


+ 0.05(35)] = −4.6

3 2 −5 + 0.9[0.05(−20) + 0.25(10) + 0.50(−5) max[ − 4.6, 0.4] d3(5) = 2


+ 0.20(35)] = 0.4 ← = 0.4 = v3 (5)

4 1 35 + 0.9[0.05(−20) + 0.20(10) + 0.40(−5) max[45.125, 43.45] d4(5) = 1


+ 0.35(35)] = 45.125 ← = 45.125 = v4 (5)

4 2 25 + 0.9[0(−20) + 0.10(10) + 0.30(−5)


+ 0.60(35)] = 43.45

At the end of month 5, the optimal decision is to select the second alterna-
tive in states 1, 2, and 3, and the first alternative in state 4. Thus, the decision
vector at the end of month 5 is d(5) = [2 2 2 1]T. The calculations for month 4,
denoted by n = 4, are indicated in Table 5.62.

51113_C005.indd 376 9/23/2010 5:49:13 PM


A Markov Decision Process (MDP) 377

TABLE 5.62
Value Iteration for n = 4

q ik + α [ p ik1v 1(5) + p ik2v 2 (5) + p ik3v 3 (5)


+ p ik4v 4 (5)] = q ik + 0.9[ p ik1( −24.1) vi(4) = Expected
+ p (8.65) + p (0.4) + p (45.125)]
k k k Total Discounted Decision
i2 i3 i4
i k Reward di(4) = k
1 1 −30 + 0.9[0.15(−24.1) + 0.40(8.65)
+ 0.35(0.4) + 0.10(45.125)] = −25.95225
1 2 −25 + 0.9[0.45(−24.1) + 0.05(8.65) max [ − 25.9523, d1(4) = 2
+ 0.20(0.4) + 0.30(45.125)] = −22.1155 ← −22.1155, −30.6435 ]
= −22.1155 = v1 (4)
1 3 −20 + 0.9[0.60(−24.1) + 0.30(8.65)
+ 0.10(0.4) + 0(45.125)] = −30.6425
2 1 5 + 0.9[0.25(−24.1) + 0.30(8.65)
+ 0.35(0.4) + 0.10(45.125)] = 6.1003
2 2 10 + 0.9[0.30(−24.1) + 0.40(8.65) max [6.1003, 8.7276] d2(4) = 2
+ 0.25(0.4) + 0.05(45.125)] = 8.7276 ← = 8.7276 = v2 (4)

3 1 −10 + 0.9[0.05(−24.1) + 0.65(8.65)


+ 0.25(0.4) + 0.05(45.125)] = −3.9063
3 2 −5 + 0.9[0.05(−24.1) + 0.25(8.65) max [ − 3.9063, d3(4) = 2
+ 0.50(0.4) + 0.20(45.125)] = 4.1643 ← 4.1643] = 4.1643
= v3 (4)
4 1 35 + 0.9[0.05(−24.1) + 0.20(8.65)
+ 0.40(0.4) + 0.35(45.125)] = 49.8309
4 2 25 + 0.9[0(−24.1) + 0.10(8.65) + 0.30(0.4) max [49.8309, 50.2540] d4(4) = 2
+ 0.60(45.125)] = 50.2540 ← = 50.2540 = v4 (4)

vi (4) = max {qik + α [ pik1v1 (5) + pik2 v2 (5) + pik3 v3 (5) + pik4 v4 (5)]} for i = 1, 2, 3, 4
k

= max {qik + 0.9[ pik1 (−24.1) + pik2 (8.65) + pik3 (0.4) + pik4 (45.125)] }
k

for i = 1, 2, 3, 4
(5.63)

51113_C005.indd 377 9/23/2010 5:49:19 PM


378 Markov Chains and Decision Processes for Engineers and Managers

At the end of month 4, the optimal decision is to select the second alternative in
every state. Thus, the decision vector at the end of month 4 is d(4) = [2 2 2 2]T.
The calculations for month 3, denoted by n = 3, are indicated in Table 5.63.


vi (3) = max {qik + α [ pik1v1 (4) + pik2 v2 (4) + pik3 v3 (4) + pik4 v4 (4)]} for i = 1, 2, 3, 4
k

= max {qik + 0.9[ pik1 ( −22.1155) + pik2 (8.7276) + pik3 (4.1643) + pik4 (50.2540)] }
k

for i = 1, 2, 3, 4. 
(5.64)
TABLE 5.63
Value Iteration for n = 3

q ik + α [ p ik1v 1 (4) + p ik2v 2 (4) + p ik3v 3 (4)


+ p ik4v 4 (4) ] = q ik + 0.9[ p ik1 ( −22.1155) vi(3) = Expected
Total Discounted Decision
i k
+ p (8.7276) + p (4.1643) + p (50.2540)]
k
i2
k
i3
k
i4 Reward di(3) = k

1 1 − 30 + 0.9[0.15(−22.1155) + 0.40(8.7276)
+ 0.35(4.1643) + 0.10(50.2540)] = −24.0090
1 2 −25 + 0.9[0.45(−22.1155) + 0.05(8.7276) max [ − 24.0090, d1(3) = 2
+ 0.20(4.1643) + 0.30(50.2540)] −19.2459, −29.2111]
= −19.2459 ← = −19.2459 = v1 (3)
1 3 −20 + 0.9[0.60(−22.1155) + 0.30(8.7276)
+ 0.10(4.1643) + 0(50.2540)] = −29.2111
2 1 5 + 0.9[0.25(−22.1155) + 0.30(8.7276)
+ 0.35(4.1643) + 0.10(50.2540)] = 8.2151
2 2 10 + 0.9[0.30(−22.1155) + 0.40(8.7276) max [8.2151, d2(3) = 2
+ 0.25(4.1643) + 0.05(50.2540)] = 10.3691 ← 10.3691] = 10.3691 = v2 (3)

3 1 −10 + 0.9[0.05(−22.1155) + 0.65(8.7276)


+ 0.25(4.1643) + 0.05(50.2540)] = −2.6912
3 2 −5 + 0.9[0.05(−22.1155) + 0.25(8.7276) max [ − 2.6912, d3(3) = 2
+ 0.50(4.1643) + 0.20(50.2540)] = 6.8882 ← 6.8882] = 6.8882 = v3 (3)

4 1 35 + 0.9[0.05(−22.1155) + 0.20(8.7276)
+ 0.40(4.1643) + 0.35(50.2540)] = 52.9049
4 2 25 + 0.9[0(−22.1155) + 0.10(8.7276) max [52.9049, d4(3) = 2
+ 0.30(4.1643) + 0.60(50.2540)] = 54.0470 ← 54.0470] = 54.0470
= v4 (3)

51113_C005.indd 378 9/23/2010 5:49:47 PM


A Markov Decision Process (MDP) 379

At the end of month 3, the optimal decision is to select the second alternative in
every state. Thus, the decision vector at the end of month 3 is d(3) = [2 2 2 2]T.
The calculations for month 2, denoted by n = 2, are indicated in Table 5.64.


vi (2) = max {qik + α [ pik1v1 (3) + pik2 v2 (3) + pik3 v3 (3) + pik4 v4 (3)]} for i = 1, 2, 3, 4
k

= max {qik + 0.9[ pik1 ( −19.2459) + pik2 (10.3691) + pik3 (6.8882) + pik4 (54.0470)] }
k

for i = 1, 2, 3, 4. 
(5.65)
At the end of month 2, the optimal decision is to select the second alternative
in every state. Thus, the policy at the end of month 2 is d(2) = [2 2 2 2]T.
The calculations for month 1, denoted by n = 1, are indicated in Table 5.65.
TABLE 5.64
Value Iteration for n = 2

q ik + α [ p ik1v 1(3) + p ik2v 2 (3) + p ik3v 3(3)


+ p ik4v 4 (3) ] = q ik + 0.9[ p ik1 ( − 19.2459) vi(2) = Expected
+ p ik2 (10.3691) + p ik3(6.8882) + p ik4 (54.0470)] Total Discounted Decision
i k Reward di(2) = k

1 1 −30 + 0.9[0.15(−19.2459) + 0.40(10.3691)


+ 0.35(6.8882) + 0.10(54.0470)] = −21.8313
1 2 −25 + 0.9[0.45(−19.2459) + 0.05(10.3691) max[ −21.8313, d1(2) = 2
+ 0.20(6.8882) + 0.30(54.0470)] −16.4954, −26.9732]
= −16.4954 ← = −16.4954 = v1 (2)
1 3 −20 + 0.9[0.60(−19.2459) + 0.30(10.3691)
+ 0.10(6.8882) + 0(54.0470)] = −26.9732
2 1 5 + 0.9[0.25(−19.2459) + 0.30(10.3691)
+ 0.35(6.8882) + 0.10(54.0470)] = 10.5033
2 2 10 + 0.9[0.30(−19.2459) + 0.40(10.3691) max[10.5033, d2(2) = 2
+ 0.25(6.8882) + 0.05(54.0470)] 12.5184] = 12.5184
= 12.5184 ← = v2 (2)
3 1 −10 + 0.9[0.05(−19.2459) + 0.65(10.3691)
+ 0.25(6.8882) + 0.05(54.0470)] = −0.8182
3 2 −5 + 0.9[0.05(−19.2459) + 0.25(10.3691) max[ − 0.8182, d3(2) = 2
+ 0.50(6.8882) + 0.20(54.0470)] 9.2951] = 9.2951
= 9.2951 ← = v3 (2)
continued

51113_C005.indd 379 9/23/2010 5:49:55 PM


380 Markov Chains and Decision Processes for Engineers and Managers

TABLE 5.64 (continued)


Value Iteration for n = 2
4 1 35 + 0.9[0.05(−19.2459) + 0.20(10.3691)
+ 0.40(6.8882) + 0.35(54.0470)] = 55.5049
4 2 25 + 0.9[0(−19.2459) + 0.10(10.3691) max [55.5049, d4(2) = 2
+ 0.30(6.8882) + 0.60(54.0470)] 56.9784] = 56.9784
= 56.9784 ← = v4 (2)

TABLE 5.65
Value Iteration for n = 1

qik + α [ pik1 v1(2) + pik2 v2 (2) + pik3 v3(2)


+ pik4 v4 (2) ] = qik + 0.9[ pik1 ( −16.4954) vi(1) = Expected
+ pik2 (12.5184) + pik3 (9.2951) + pik4 (56.9784)] Total Discounted Decision
i k Reward di(1) = k

1 1 −30 + 0.9[0.15(−16.4954) + 0.40(12.5184)


+ 0.35(9.2951) + 0.10(56.9784)] = −19.6642
1 2 −25 + 0.9[0.45(−16.4954) + 0.05(12.5184) max[ −19.6642, d1(1) = 2

+ 0.20(9.2951) + 0.30(56.9784)] −14.0060, −24.6910]


= −14.0600 ← = −14.0060 = v1 (1)
1 3 −20 + 0.9[0.60(−16.4954) + 0.30(12.5184)
+ 0.10(9.2951) + 0(56.9784)] = −24.6910
2 1 5 + 0.9[0.25(−16.4954) + 0.30(12.5184)
+ 0.35(9.2951) + 0.10(56.9784)] = 12.7245
2 2 10 + 0.9[0.30(−16.4954) + 0.40(12.5184) max[12.7245, d2(1) = 2

+ 0.25(9.2951) + 0.05(56.9784)] 14.7083] = 14.7083


= 14.7083 ← = v2 (1)
3 1 −10 + 0.9[0.05(−16.4954) + 0.65(12.5184)
+ 0.25(9.2951) + 0.05(56.9784)] = 1.2364
3 2 −5 + 0.9[0.05(−16.4954) max[1.2364, d3(1) = 2

+ 0.25(12.5184) + 0.50(9.2951) 11.5133] = 11.5133


+ 0.20(56.9784)] = 11.5133 ← = v3 (1)

51113_C005.indd 380 9/23/2010 5:50:01 PM


A Markov Decision Process (MDP) 381

TABLE 5.65 (continued)


Value Iteration for n = 1
4 1 35 + 0.9[0.05(−16.4954) + 0.20(12.5184)
+ 0.40(9.2951) + 0.35(56.9784)] = 55.5049
4 2 25 + 0.9[0(−16.4954) + 0.10(12.5184) max[57.8055, d4(1)
=2
+ 0.30(9.2951) + 0.60(56.9784)] 59.4047] = 59.4047
= 59.4047 ← = v4 (1)


vi (1) = max {qik + α [ pik1v1 (2) + pik2 v2 (2) + pik3 v3 (2) + pik4 v4 (2)]} for i = 1, 2, 3, 4
k

= max {qik + 0.9[ pik1 (−16.4954) + pik2 (12.5184) + pik3 (9.2951) + pik4 (56.9784]} 

k

for i = 1, 2, 3, 4. 
(5.66)

At the end of month 1, the optimal decision is to select the second alternative in
every state. Thus, the decision vector at the end of month 1 is d(1) = [2 2 2 2]T.
The calculations for month 0, denoted by n = 0, are indicated in Table 5.66.


vi (0) = max {qik + α [ pik1v1 (1) + pik2 v2 (1) + pik3 v3 (1) + pik4 v4 (1)]} for i = 1, 2, 3, 4
k

max {qik + 0.9[ pik1 ( −14.0600) + pik2 (14.7083) + pik3 (11.5133) + pik4 (59.4047)]}
k

for i = 1, 2, 3, 4. 
(5.67)

At the end of month 0, which is the beginning of month 1, the optimal deci-
sion is to select the second alternative in every state. Thus, the decision vector
at the beginning of month 1 is d(0) = [2 2 2 2]T.
The results of these calculations for the expected total discounted rewards
and the optimal decisions at the end of each month of the seven month plan-
ning horizon are summarized in Table 5.67.
If the process starts in state 4, the expected total discounted reward is
61.5109, the highest for any state. On the other hand, the lowest expected
total discounted reward is −11.9208, which is the negative of an expected
total discounted cost of 11.9208, if the process starts in state 1.

51113_C005.indd 381 9/23/2010 5:50:55 PM


382 Markov Chains and Decision Processes for Engineers and Managers

TABLE 5.66
Value Iteration for n = 0

qik + α [ pik1 v1 (1) + pik2 v2 (1) + pik3 v3 (1)


+ pik4 v4 (1) ] = qik + 0.9[ pik1 ( − 14.0600) vi(0) = Expected
Total Discounted Decision
+ pik2 (14.7083) + pik3 (11.5133) + pik4 (59.4047)]
i k Reward di(0) = k
1 1 −30 + 0.9[0.15(−14.0600) + 0.40(14.7083)
+ 0.35(11.5133) + 0.10(59.4047)] = −17.6300
1 2 −25 + 0.9[0.45(−14.0600) + 0.05(14.7083) max[ − 17.6300, d1(0) = 2
+ 0.20(11.5133) + 0.30(59.4047)] −11.9208,
= −11.9208 ← −22.5850]
= −11.9208 = v1 (0)
1 3 −20 + 0.9[0.60(−14.0600) + 0.30(14.7083)
+ 0.10(11.5133) + 0(59.4047)] = −22.5850
2 1 5 + 0.9[0.25(−14.0600) + 0.30(14.7083)
+ 0.35(11.5133) + 0.10(59.4047)] = 14.7809
2 2 10 + 0.9[0.30(−14.0600) max[14.7809, d2(0) = 2
+ 0.40(14.7083) + 0.25(11.5133) 16.7625]
+ 0.05(59.4047)] = 16.7625 ← = 16.7625 = v2 (0)
3 1 −10 + 0.9[0.05(−14.0600) + 0.65(14.7083)
+ 0.25(11.5133) + 0.05(59.4047)] = 3.2354
3 2 −5 + 0.9[0.05(−14.0600) + 0.25(14.7083) max[3.2354, d3(0) = 2
+ 0.50(11.5133) + 0.20(59.4047)] 13.5505] = 13.5505
= 13.5505 ← = v3 (0)
4 1 35 + 0.9[0.05(−14.0600) + 0.20(14.7083)
+ 0.40(11.5133) + 0.35(59.4047)] = 59.8721
4 2 25 + 0.9[0(−14.0600) + 0.10(14.7083) max[59.8721, d4(0) = 2
+ 0.30(11.5133) + 0.60(59.4047)] 61.5109] = 61.5109
= 61.5109 ← = v4 (0)

5.2.2 An Infinite Planning Horizon


When an optimal stationary policy is to be determined for a discounted
MDP over an infinite planning horizon, either value iteration, PI, or LP can

51113_C005.indd 382 9/23/2010 5:50:58 PM


A Markov Decision Process (MDP) 383

TABLE 5.67
Expected Total Discounted Rewards and Optimal Decisions for a Planning Horizon
of 7 Months
n
Epoch 0 1 2 3 4 5 6 7
v1(n) −11.9208 −14.0600 −16.4954 −19.2459 −22.1155 −24.10 −20 0
v2(n) 16.7625 14.7083 12.5184 10.3691 8.7276 8.65 10 0
v3(n) 13.5505 11.5133 9.2951 6.8882 4.1643 0.40 −5 0
v4(n) 61.5109 59.4047 56.9784 54.0470 50.254 45.125 35 0
d1(n) 2 2 2 2 2 2 3 −
d2(n) 2 2 2 2 2 2 2 −
d3(n) 2 2 2 2 2 2 2 −
d4(n) 2 2 2 2 2 1 1 −

be applied. Value iteration imposes the smallest computational burden,


but yields only an approximate solution for the expected total discounted
rewards. Both PI and LP produce exact solutions.

5.2.2.1 Value Iteration


In Section 4.3.3.2, value iteration was used to find the expected total dis-
counted rewards received by an MCR over an infinite planning horizon. In
Section 5.1.2.3.2, value iteration was used to find an optimal stationary pol-
icy plus the expected relative rewards received by an undiscounted MDP
over an infinite horizon. In this section, value iteration will be applied to
find an optimal stationary policy plus the expected total discounted rewards
received by an MDP over an infinite horizon.

5.2.2.1.1 Value Iteration Algorithm


By extending the value iteration algorithm for an MCR in Section 4.3.3.2.2,
the following value iteration algorithm in its most basic form (adapted from
Puterman [3]) can be applied to a discounted MDP over an infinite horizon.
This value iteration algorithm assumes that epochs are numbered as con-
secutive negative integers from −n at the beginning of the horizon to 0 at the
end.
Step 1. Select arbitrary salvage values for

vi(0), for i = 1, 2, ..., N. For simplicity, set vi(0) = 0. Specify ε > 0. Set n = −1.

Step 2. For each state i, use the value iteration equation to compute

 N 
vi (n) = max  qik + α ∑ pijk v j (n + 1) , for i = 1, 2, ... , N .
k
 j =1 

51113_C005.indd 383 9/23/2010 5:51:05 PM


384 Markov Chains and Decision Processes for Engineers and Managers

Step 3. If maxi = 1,2,...,N|vi(n)−vi(n+1)| < ε, go to step 4. Otherwise, decrement n


by 1 and return to step 2.
Step 4. For each state i, choose the decision di = k that yields the maximum
value of vi(n), and stop.

5.2.2.1.2 Solution by Value Iteration of Discounted MDP Model of Monthly Sales


In Section 5.2.1.2, value iteration was executed to find an optimal policy for
the four-state discounted MDP model of monthly sales over a 7 month plan-
ning horizon. The solution is summarized in Table 5.67, and is repeated in
Table 5.68. In accordance with the treatment of a discounted MCR model in
Section 4.3.3.2.3, the last eight epochs of an infinite horizon are numbered
sequentially in Table 5.68 as −7, −6, −5, −4, −3, −2, −1, and 0. The absolute
value of a negative epoch represents the number of months remaining in the
horizon.
Table 5.68 shows that when 1 month remains in an infinite horizon, the
optimal policy is to maximize the reward received in every state. When 2
months remain, the expected total discounted rewards are maximized by
making decision 2 in states 1, 2, and 3, and decision 1 in state 4. When 3 or
more months remain, the optimal policy is to make decision 2 in every state.
Thus, as value iteration proceeds backward from the end of the horizon, con-
vergence to an optimal policy given by the decision vector d(−3) = [2 2 2 2]T
appears to have occurred at n = −3.
Table 5.69 gives the absolute values of the differences between the
expected total discounted rewards earned over planning horizons, which
differ in length by one period. Note that the maximum absolute differ-
ences become progressively smaller for each epoch added to the planning
horizon.

TABLE 5.68
Expected Total Discounted Rewards and Optimal Decisions during the Last
7 Months of an Infinite Planning Horizon
n
Epoch −7 −6 −5 −4 −3 −2 −1 −0
v1(n) −11.9208 −14.0600 −16.4954 −19.2459 −22.1155 −24.10 −20 0
v2(n) 16.7625 14.7083 12.5184 10.3691 8.7276 8.65 10 0
v3(n) 13.5505 11.5133 9.2951 6.8882 4.1643 0.40 −5 0
v4(n) 61.5109 59.4047 56.9784 54.0470 50.254 45.125 35 0
d1(n) 2 2 2 2 2 2 3 −
d2(n) 2 2 2 2 2 2 2 −
d3(n) 2 2 2 2 2 2 2 −
d4(n) 2 2 2 2 2 1 1 −

51113_C005.indd 384 9/23/2010 5:51:06 PM


A Markov Decision Process (MDP) 385

TABLE 5.69
Absolute Values of the Differences between the Expected Total Discounted
Rewards Earned Over Planning Horizons, Which Differ in Length by One Period
n
−7 −6 −5 −4 −3 −2 −1
v i ( −7 ) v i ( −6) v i ( −5) v i ( −4 ) v i ( −3) v i ( −2) v i ( −1)
Epoch −v i ( −6) −v i ( −5) −v i ( −4 ) −v i ( −3) −v i ( −2) −v i ( −1) −v i (0)

i
1 2.1392U 2.4354U 2.7505 2.8696 1.9845 4.1 20
2 2.0542 2.1899 2.1493 1.6415 0.0776 1.35 5
3 2.0372 2.2181 2.4069 2.7239 3.7643 5.4 5
4 2.1062 2.4263 2.9314U 3.7930U 5.129U 10.125U 35U
 vi (n)  2.1392 2.4354 2.9314 3.7930 5.129 10.125 35
Max  
 − vi (n + 1) 

In Table 5.69, a suffix U identifies the maximum absolute difference for each
epoch. The maximum absolute differences obtained for all seven epochs are
listed in the bottom row of Table 5.69.
Table 5.68 shows that under the optimal policy d(−7) = [2 2 2 2]T, the approx-
imate expected total discounted rewards,

v1 = −11.9208, v2 = 16.7625, v3 = 13.5505, v4 = 61.5109,

obtained after seven repetitions of value iteration are significantly different


from the actual expected total discounted rewards,

v1 = 6.8040, v2 = 35.4613, v3 = 32.2190, v4 = 80.1970,

obtained by PI in Equation (5.71) of Section 5.2.2.2.3, and by LP as dual vari-


ables in Table 5.73 of Section 5.2.2.3.4. Thus, while value iteration appears to
have converged quickly to an optimal policy, it has not converged after seven
repetitions to the actual expected total discounted rewards.

5.2.2.2 Policy Iteration


The PI algorithm created by Ronald A. Howard [2] for finding an optimal
stationary policy for a discounted MDP over an infinite planning horizon
is similar to the one (described in Section 5.1.2.3.3.2) that he created for an
undiscounted, recurrent MDP. Since the gain is not meaningful for a dis-
counted process, an optimal policy is one which maximizes the expected
total discounted rewards received in all states. Howard proved that after a
finite number of iterations, an optimal policy will be determined. Once again,

51113_C005.indd 385 9/23/2010 5:51:06 PM


386 Markov Chains and Decision Processes for Engineers and Managers

the PI algorithm has two main steps: the VD operation and the IM routine.
The algorithm begins by arbitrarily choosing an initial policy. During the
VD operation, the VDEs corresponding to the current policy are solved for
the expected total discounted rewards received in all states. The IM routine
attempts to find a better policy. If a better policy is found, the VD operation is
repeated using the new policy to identify a new system of associated VDEs.
The algorithm stops when two successive iterations lead to identical policies.

5.2.2.2.1 Policy Improvement


The IM routine was motivated by the value iteration equation (5.57) for a dis-
counted MDP in Section 5.2.1.1. The value iteration equation indicates that if
an optimal policy is known over a planning horizon starting at epoch n + 1
and ending at epoch T, then the best decision in state i at epoch n can be
found by maximizing a test quantity

N
qik + α ∑ pijk v j (n + 1) (5.68)
j= 1

over all decisions in state i. Therefore, if an optimal policy is known over a


planning horizon starting at epoch 1 and ending at epoch T, then at epoch
0, which marks the beginning of period 1, the best decision in state i can be
found by maximizing a test quantity

N
qik + α ∑ pijk v j (1) (5.69)
j= 1

over all decisions in state i. Recall from Section 4.3.3.1 that when T, the length
of the planning horizon, is very large, (T − 1) is also very large.
Substituting vj in Equation (4.62) for vj(1) in the test quantity produces the
result

N N
qik + α ∑ pijk v j (1) = qik + α ∑ pijk v j (5.70)
j= 1 j= 1

as the test quantity to be maximized with respect to all alternatives when


making decisions in state i, for i = 1, 2, ... , N. The expected total discounted
rewards produced by solving the VDEs associated with the most recent pol-
icy can be used.

5.2.2.2.2 PI Algorithm
The detailed steps of the PI algorithm are given below.
Step 1. Initial policy
Arbitrarily choose an initial policy by selecting for each state i a decision
di = k.

51113_C005.indd 386 9/23/2010 5:51:11 PM


A Markov Decision Process (MDP) 387

Step 2. VD operation
Use pij and qi for a given policy to solve the VDEs (4.260)

N
vi = qi + α ∑ pij v j , i = 1, 2,..., N
j= 1

for all expected total discounted rewards vi.


Step 3. IM routine
For each state i, find the decision k∗ that maximizes the test quantity

N
qik + α ∑ pijk v j
j= 1

using the expected total discounted rewards vi of the previous policy. Then
k∗ becomes the new decision in state i, so that di = k∗, qik∗ becomes qi, and pijk∗
becomes pij.
Step 4. Stopping rule
When the policies on two successive iterations are identical, the algorithm
stops because an optimal policy has been found. Leave the old di unchanged
if the test quantity for that di is equal to that of any other alternative in the
new policy determination. If the new policy is different from the previous
policy in at least one state, go to step 2.
Howard proved that, for each policy, the expected total discounted rewards
received in every state will be greater than or equal to their respective values
for the previous policy. The algorithm will terminate after a finite number
of iterations.

5.2.2.2.3 Solution by PI of MDP Model of Monthly Sales


Consider a discounted MDP model of monthly sales. This model was spe-
cified in Table 5.3 of Section 5.1.2.1. In Table 5.68 of Section 5.2.2.1.2, using
a discount factor of α = 0.9, value iteration was executed to find an optimal
policy over an infinite horizon. In this section, PI will be executed to find an
optimal policy using the same discount factor.
First iteration
Step 1. Initial policy
Arbitrarily choose the initial policy 23d = [3 2 2 1]T by making decision 3 in
state 1, decision 2 in states 2 and 3, and decision 1 in state 4. Thus, d1 = 3,
d2 = d3 = 2, and d4 = 1. The initial decision vector 23d, along with the associ-
ated transition probability matrix 23P, and the reward vector 23q, are shown
below.

51113_C005.indd 387 9/23/2010 5:51:13 PM


388 Markov Chains and Decision Processes for Engineers and Managers

3 1 0.60 0.30 0.10 0  1  −20 


2 2 0.30 0.40 0.25 0.05  2  10 
23
d =  , 23
P=  , 23
q=  .
2 3  0.05 0.25 0.50 0.20  3  −5 
     
 1  4  0.05 0.20 0.40 0.35  4  35 

Step 2. VD operation
Use pij and qi for the initial policy, 23d = [3 2 2 1]T, to solve the VDEs (4.257)

v1 = q1 + α ( p11v1 + p12 v2 + p13 v3 + p14 v4 )


v2 = q2 + α ( p21v1 + p22 v2 + p23 v3 + p24 v4 )
v3 = q3 + α ( p31v1 + p32 v2 + p33 v3 + p34 v4 )
v4 = q4 + α ( p41v1 + p42 v2 + p43 v3 + p44 v4 )

for all the expected total discounted rewards vi

v1 = −20 + 0.9(0.60v1 + 0.30v2 + 0.10v3 + 0v4 )


v2 = 10 + 0.9(0.30v1 + 0.40v2 + 0.25v3 + 0.05v4 )
v3 = −5 + 0.9(0.05v1 + 0.25v2 + 0.50v3 + 0.20v4 )
v4 = 35 + 0.9(0.05v1 + 0.20v2 + 0.40v3 + 0.35v4 ).

The solution of the VDEs is

v1 = −38.2655, v2 = 6.1707, v3 = 8.1311, v4 = 54.4759.

Step 3. IM routine
For each state i, find the decision k∗ that maximizes the test quantity
4
qik + α ∑ pijk v j
j= 1

using the expected total discounted rewards vi of the initial policy. Then k∗
becomes the new decision in state i, so that di = k∗, qik∗ becomes qi, and pijk∗
becomes pij. The first IM routine is executed in Table 5.70.

Step 4. Stopping rule


The new policy is 16d = [2 2 2 2]T, which is different from the initial policy.
Therefore, go to step 2. The new decision vector, 16d, along with the associated
transition probability matrix, 16P, and the reward vector, 16q, are shown below.

51113_C005.indd 388 9/23/2010 5:51:14 PM


A Markov Decision Process (MDP) 389

TABLE 5.70
First IM for Monthly Sales Example

Test Quantity
qik + α ( pik1 v1 + pik2 v2 + pik3 v3 + pik4 v4 )
Decision = qik + 0.9[ pik1 ( − 38.2655) + pik2 (6.1707)
State Alternative Maximum Value Decision
i k
+ pik3 (8.1311) + pik4 (54.4759)] of Test Quantity di = k∗
1 1 −30 + 0.9[0.15(−38.2655)
+ 0.40(6.1707) + 0.35(8.1311)
+ 0.10(54.4759)] = −25.4803
1 2 −25 + 0.9[0.45(−38.2655) max[ −25.4803, d1 = 2
+ 0.05(6.1707) + 0.20(8.1311) −24.0478,
+ 0.30(54.4759)] = −24.0478 ← −38.2655]
= −24.0478
1 3 −20 + 0.9[0.60(−38.2655)
+ 0.30(6.1707) + 0.10(8.1311)
+ 0(54.4759)] = −38.2655
2 1 5 + 0.9[0.25(−38.2655)
+ 0.30(6.1707) + 0.35(8.1311)
+ 0.10(54.4759)] = 5.5205
2 2 10 + 0.9[0.30(−38.2655) max[5.5205, d2 = 2
+ 0.40(6.1707) + 0.25(8.1311) 6.1707]
+ 0.05(54.4759)] = 6.1707 ← = 6.1707
3 1 −10 + 0.9[0.05(−38.2655)
+ 0.65(6.1707) + 0.25(8.1311)
+ 0.05(54.4759)] = −3.8312
3 2 −5 + 0.9[0.05(−38.2655) max[ −3.8312, d3 = 2
+ 0.25(6.1707) + 0.50(8.1311) 8.1311]
+ 0.20(54.4759)] = 8.1311 ← = 8.1311
continued

51113_C005.indd 389 9/23/2010 5:51:44 PM


390 Markov Chains and Decision Processes for Engineers and Managers

TABLE 5.70 (continued)


First IM for Monthly Sales Example
4 1 35 + 0.9[0.05(−38.2655)
+ 0.20(6.1707) + 0.40(8.1311)
+ 0.35(54.4759)] = 54.4759
4 2 25 + 0.9[0(−38.2655) max[54.4759, d4 = 2
+ 0.10(6.1707) + 0.30(8.1311) 57.1677]
+ 0.60(54.4759)] = 57.1677 ← = 57.1677

2 1  0.45 0.05 0.20 0.30  1  −25 


2 2 0.30 0.40 0.25 0.05  2  10 
16
d =  , 16
P=  , 16
q=  .
(5.71)
2 3  0.05 0.25 0.50 0.20  3  −5 
     
 2  4  0 0.10 0.30 0.60  4  25 

Step 2. VD operation
Use pij and qi for the new policy, 16 d = [2 2 2 2]T, to solve the VDEs (4.257) for
all the expected total discounted rewards vi .

v1 = −25 + 0.9[0.45v1 + 0.05v2 + 0.20v3 + 0.30v4 ]


v2 = 10 + 0.9[0.30v1 + 0.40v2 + 0.25v3 + 0.05v4 ]
v3 = −5 + 0.9[0.05v1 + 0.25v2 + 0.50v3 + 0.20v4 ]
v4 = 25 + 0.9[0v1 + 0.10v2 + 0.30v3 + 0.60v4 ].

The solution of the VDEs is

v1 = 6.8040, v2 = 35.4613, v3 = 32.2190, v4 = 80.1970. (5.72)

Step 3. IM routine
For each state i, find the decision k∗ that maximizes the test quantity
4
qik + α ∑ pijk v j
j= 1

using the expected total discounted rewards vi of the previous policy. Then
k∗ becomes the new decision in state i, so that di = k∗, qik∗ becomes qi, and pijk∗
becomes pij. The second IM routine is s executed in Table 5.71.
Step 4. Stopping rule
Stop because the new policy, given by the vector 16 d = [2 2 2 2]T, is identi-
cal to the previous policy. Therefore, this policy is optimal. The expected

51113_C005.indd 390 9/23/2010 5:52:20 PM


A Markov Decision Process (MDP) 391

TABLE 5.71
Second IM for Monthly Sales Example

Test Quantity
qik + α ( pik1 v1 + pik2 v2 + pik3 v3 + pik4 v4 )
Maximum
Decision = qik + 0.9[ pik1 (6.8040) + pik2 (35.4613) Value of
State Alternative Test Decision
i k + pik3 (32.2190) + pik4 (80.1970)] Quantity di = k∗
1 1 −30 + 0.9[0.15(6.8040)
+0.40(35.4613) + 0.35(32.2190)
+0.10(80.1970)] = 1.0513
1 2 −25 + 0.9[0.45(6.8040) max[1.0513, d1 = 2
+0.05(35.4613) + 0.20(32.2190) 6.8040,
+0.30(80.1970)] = 6.8040 ← −38.5158]
= 6.8040
1 3 −20 + 0.9[0.60(6.8040)
+0.30(35.4613) + 0.10(32.2190)
+0(80.1970)] = −38.5158
2 1 5 + 0.9[0.25(6.8040)
+0.30(35.4613) + 0.35(32.2190)
+0.10(80.1970)] = 33.4722
2 2 10 + 0.9[0.30(6.8040) max[33.4722, d2 = 2
+0.40(35.4613) + 0.25(32.2190) 35.4613]
+0.05(80.1970)] = 35.4613 ← = 35.4613
3 1 −10 + 0.9[0.05(6.8040)
+0.65(35.4613) + 0.25(32.2190)
+0.05(80.1970)] = 21.9092
3 2 −5 + 0.9[0.05(6.8040) max[21.9092, d3 = 2
+0.25(35.4613) + 0.50(32.2190) 32.2190]
+0.20(80.1970)] = 32.2190 ← = 32.21901
continued

51113_C005.indd 391 9/23/2010 5:52:24 PM


392 Markov Chains and Decision Processes for Engineers and Managers

TABLE 5.71 (continued)


Second IM for Monthly Sales Example
4 1 35 + 0.9[0.05(6.8040)
+0.20(35.4613) + 0.40(32.2190)
+0.35(80.1970)] = 78.5501
4 2 25 + 0.9[0(6.8040) max[78.5501, d4 = 2
+0.10(35.4613) + 0.30(32.2190) 80.1970]
+0.60(80.1970)] = 80.1970 ← = 80.1970

total discounted rewards for the optimal policy are calculated in Equation
(5.77). The transition probability matrix 16P and the reward vector 16q for
the optimal policy are given in Equation (5.71). Note that the optimal pol-
icy is the same as the one obtained without discounting in Equation (5.24)
of Section 5.1.2.3.3.3. Note also that the expected total discounted rewards
obtained by PI are identical to the corresponding dual variables obtained by
LP in Table 5.73 of Section 5.2.2.3.4.

5.2.2.2.4 Optional Insight: IM for a Two-State Discounted MDP


To gain insight (adapted from Howard [2]) into why the IM routine works for
a discounted MDP, it is instructive to examine the special case of a generic
two-state process with two decisions per state. (The special case of a two-state
recurrent MDP without discounting was addressed in Section 5.1.2.3.3.4.)
The two-state discounted MDP is shown in Table 5.20 in which µ replaces a.
The discount factor is denoted by α, where 0 < α < 1.
Since the process has two states with two decisions per state, there are 22 =
4 possible policies, identified by the letters A, B, C, and D. The decision vec-
tors for the four policies are

1 1 1 1 1 2 1 2


dA =   , dB =   , dC =   , dD =   .
2 1 2 2 2 1 2 2

Suppose that policy A is the current policy. Policy A is characterized by the


following decision vector, transition probability matrix, reward vector, and
expected total discounted reward vector:

1 1 1 − u u  1 e 1 v A 
dA =   , P A =   , q A =   , v A =  1A  .
1 2 c 1 − c 2  s 2  v2 

51113_C005.indd 392 9/23/2010 5:52:28 PM


A Markov Decision Process (MDP) 393

The VDEs for states 1 and 2 under policy A are given below:

2
viA = qiA + α ∑ pijA v jA , i = 1, 2
j= 1

For i = 1,
v1A = q1A + α p11
A A
v1 + α p12
A A
v2
v1A = e + α (1 − u)v1A + α uv2A

For i = 2,

v2A = q2A + α p21


A A
v1 + α p22
A A
v2
v2A = s + α cv1A + α (1 − c)v2A

Suppose that the evaluation of policy A by the IM routine has produced


a new policy, policy D. Policy D is characterized by the following decision
vector, transition probability matrix, reward vector, and expected total dis-
counted reward vector:

2 1 1 − b b  1 f  1 vD 
dD =   , P D =   , q = 2h ,
D
v D =  1D  .
2
  2  d 1 − d    2  v2 

The VDEs for states 1 and 2 under policy D are given below.
2
viD = qiD + α ∑ pijD v Dj , i = 1, 2.
j =1

For i = 1,
v1D = q1D + α p11
D D
v1 + α p12
D D
v2
v1D = f + α (1 − b)v1D + α bv2D .

For i = 2,
v2D = q2D + α p21
D D
v1 + α p22
D D
v2
v2D = h + α dv1D + α (1 − d)v2D .

Since the IM routine has chosen policy D over A, the test quantity for pol-
icy D must be greater than or equal to the test quantity for A in both states.

51113_C005.indd 393 9/23/2010 5:52:30 PM


394 Markov Chains and Decision Processes for Engineers and Managers

Therefore,
For i = 1,
q1D + α ( p11
D A
v1 + p12
D A
v2 ) ≥ q1A + α ( p11
A A
v1 + p12
A A
v2 )
f + α (1 − b)v1A + α bv2A ≥ e + α (1 − u)v1A + α uv2A .

Let γ 1 denote the improvement in the test quantity achieved in state 1.

γ 1 = f + α (1 − b)v1A + α bv2A − e − α (1 − u)v1A − α uv2A ≥ 0.

For i = 2,
q2D + α ( p21
D A
v1 + p22
D A
v2 ) ≥ q2A + α ( p21
A A
v1 + p22
A A
v2 )
h + α dv1A + α (1 − d)v2A ≥ s + α cv1A + α (1 − c)v2A .

Let γ 2 denote the improvement in the test quantity achieved in state 2.

γ 2 = h + α dv1A + α (1 − d)v2A − s − α cv1A − α (1 − c)v2A ≥ 0.

The objective of this insight is to show that the IM routine must increase
the expected total discounted rewards of one or both states.

For i = 1,

v1D − v1A = [ f + α (1 − b)v1D + α bv2D ] − [e + α (1 − u)v1A + α uv2A ]


= [( f − e ) + α (1 − b)v1A + α bv2A − α (1 − u)v1A − α uv2A ]
− α (1 − b)v1A − α bv2A + α (1 − b)v1D + α bv2D
= γ 1 − α (1 − b)v1A − α bv2A + α (1 − b)v1D + α bv2D
= γ 1 + α p11
D
(v1D − v1A ) + α p12
D
(v2D − v2A ).

For i = 2,

v2D − v2A = [h + α dv1D + α (1 − d)v2D ] − [s + α cv1A + α (1 − c)v2A ]


= [( h − s) + α dv1A + α (1 − d)v2A − α cv1A − α (1 − c)v2A ]
− α dv1A − α (1 − d)v2A + α dv1D + α (1 − d)v2D
= γ 2 − α dv1A − α (1 − d)v2A + α dv1D + α (1 − d)v2D
= γ 2 + α p21
D
(v1D − v1A ) + α p22
D
(v2D − v2A )

Observe that the pair of equations for the increase in the expected total
discounted rewards has the same form as the pair of equations for the
expected total discounted rewards,

51113_C005.indd 394 9/23/2010 5:54:48 PM


A Markov Decision Process (MDP) 395

v1D = q1D + α p11


D D
v1 + α p12
D D
v2
v2D = q2D + α p21
D D
v1 + α p22
D D
v2 .

In matrix form, the pair of equations for the expected total discounted
rewards is

vD = qD + α PD vD .

The solution for the vector of the expected total discounted rewards is

v D = (I − α P D )−1 qD .

Similarly, the matrix form of the pair of equations for the increase in the
expected total discounted rewards is

v D − v A = γ + α PD (v D − v A ),

where vD − vA is the vector of the increase in the expected total discounted


rewards in both states, and γ = [γ 1 γ 2 ] ), is the vector of improvement of
T

the test quantity in both states. Therefore, the solution for the vector of the
increase in the expected total discounted rewards is

v D − v A = (I − α P D )−1 γ .

For the two-state process under policy D,

 α (1 − b) αb 
α PD = 
 αd α (1 − d) 

 1− α + α b −α b 
I − α PD = 
 −α d 1 − α + α d 

1 1 − α + α d αb 
(I − α P D )−1 =  .
(1 − α + α b)(1 − α + α d) − α bd  α d
2
1 − α + α b 

Note that α ≥ 0, b ≥ 0, and d ≥ 0. Hence, α b ≥ 0 and α d ≥ 0. Also,


1 − α ≥ 0. Hence, 1 − α + α b ≥ 0 and 1 − α + α d ≥ 0. Thus, all entries in the
numerator of the matrix (I−αPD)−1 are nonnegative. The denominator of all

51113_C005.indd 395 9/23/2010 5:54:58 PM


396 Markov Chains and Decision Processes for Engineers and Managers

entries in (I−αPD)−1 is
(1 − α + α b)(1 − α + α d) − α 2 bd
= (1 − α + α d − α + α 2 − α 2 d + α b − α 2 b + α 2 bd) − α 2 bd
= (1 − α − α + α 2 ) + (α d − α 2 d) + (α b − α 2 b)
= [(1 − α ) − α (1 − α )] + α d(1 − α ) + α b(1 − α )
= (1 − α )(1 − α + α d + α b) ≥ 0.

The denominator of the elements of (I − αPD)−1 is also nonnegative because


it is the product of two factors which are both greater than or equal to zero.
Therefore, all entries in the matrix (I − αPD)−1 are greater than or equal to
zero. Since both elements of the vector γ are also greater than or equal to
zero, the elements of the vector

v D − v A = (I − α P D )−1 γ

must be nonnegative. Therefore, the IM routine cannot decrease the expected


total discounted rewards of either state. Howard proves that for a discounted
MDP with N states, the IM routine must increase the expected total dis-
counted rewards of at least one state.

5.2.2.3 LP for a Discounted MDP


When a linear program is formulated for a discounted MDP, Markov chain
structure is not relevant.

5.2.2.3.1 Formulation of a LP for a Discounted MDP


Informally, an LP formulation for a discounted MDP can be obtained by
modifying the LP formulation given in Equations (5.43) of Section 5.1.2.3.4.1
for an undiscounted, recurrent MDP in the following manner. The transi-
tion probability pij in the undiscounted MDP is replaced with αpij in the dis-
counted process. Note that αpijis not a transition probability because the row
sums of the discounted transition matrix, αP, equal α, which is less than one.
With discounting, no constraint is redundant, so that no constraint (5.42) is
discarded. In the absence of transition probabilities, steady-state probabil-
ities are not meaningful, so that there is no normalizing constraint equation
(5.40) for steady-state probabilities. Thus, the LP formulation for an N-state
discounted process will have N independent constraints. The LP objective
function is unchanged from the one for an undiscounted MDP because its
coefficients do not contain pij.
When pij is replaced by αpij, the constraint equations (5.39),

Ki N Ki

∑k= 1
y kj − ∑ ∑y
i= 1 k = 1
k
i pijk = 0 , for j = 1, 2, . . . , N, (5.39)

51113_C005.indd 396 9/23/2010 5:55:04 PM


A Markov Decision Process (MDP) 397

which do contain pij, are transformed into the inequality constraints


Ki N Ki Ki N Ki

∑y −∑∑y
k= 1
k
j
i= 1 k = 1
k
i (α pijk ) = ∑y
k= 1
k
j −α∑ ∑y
i= 1 k = 1
k
i pijk > 0, for j = 1, 2, ..., N, (5.73)

because, for 0 < ∝ < 1,

Ki N Ki

∑y
k= 1
k
j > α∑ ∑y
i= 1 k = 1
k
i pijk . (5.74)

The inequality constraints are expressed as equations by transposing all


terms to the left hand sides, and making the right-hand sides arbitrary posi-
tive constants, denoted by bj. The equality constraints are

Ki N Ki

∑y
k= 1
k
j −α∑ ∑y
i= 1 k = 1
k
i pijk = b j , (5.75)

for j = 1, 2, ..., N, where bj > 0.


The complete LP formulation for a discounted MDP is [1, 3]

N  Ki

Maximize ∑ ∑  qik y ik
i= 1 k = 1 

subject to 


Ki N Ki

∑ y j − α ∑ ∑ y i pij = b j ,
k k k
 (5.76)
k= 1 i= 1 k = 1 
for j = 1, 2, ..., N, where bj > 0. 


y ik ≥ 0 , for i = 1,2,...,N, and k = 1, 2, . . ., Ki 

When LP software has found an optimal solution for the decision vari-
ables, the conditional probabilities,

y ik
P(decision = k|state = i) = Ki
, (5.30)
∑y
k =1
k
i

can be calculated to determine the optimal decision in every state, thereby


specifying an optimal policy. The expected total discounted reward,
vj, earned by starting in state j, is equal to the dual variable associated with
row j in the final LP tableau. Since the LP constraints are equations, the dual
variables, and hence the expected total discounted rewards, vj, are unre-
stricted in sign. Both an optimal policy and the expected total discounted

51113_C005.indd 397 9/23/2010 5:55:06 PM


398 Markov Chains and Decision Processes for Engineers and Managers

rewards are independent of the values assigned to the arbitrary positive


constants, bj.
Suppose that
N

∑b
j= 1
j = 1. (5.77)

Since bj is the right-hand side constant of the jth constraint in the LP, bj is also
the objective function coefficient of the jth dual variable, vj. The optimal dual
objective function is equal to ∑ j = 1 b j v j, which is also the optimal value of the
N

primal LP objective function.

5.2.2.3.2 Optional Insight: Informal Derivation of the LP


Formulation for a Discounted MDP
The following informal derivation of the LP formulation for a discounted
MDP is adapted from Puterman [3]. This informal derivation begins by
defining the nonnegative decision variable yik as the following infinite
series:

y ik = α 0 P(at epoch 0, state = i and decision = k ) + α 1 P(at epoch 1,


state = i and decision = k ) + α 2 P(at epoch 2, state = i and
decision = k ) + α 3 P(at epoch 3, state = i and decision = k ) + "
= α 0 P(X0 = i , di = k ) + α 1 P(X1 = i , di = k ) + α 2 P(X 2 = i , di = k )
+ α 3 P(X 3 = i , di = k ) + "

= ∑ α n P(X n = i , di = k ).
n=0

If the positive constants denoted by bj are chosen such that ∑ j = 1 b j = 1, then


N

bj can be interpreted as the probability that the system starts in state j. That
is,

b j = P(X0 = j). (5.78)

If ∑ j = 1 b j = 1, then the variable yik can be interpreted as the joint discounted


N

probability of being in state i and making decision k.

51113_C005.indd 398 9/23/2010 5:55:10 PM


A Markov Decision Process (MDP) 399

Objective Function
The objective function for the LP formulation for a discounted MDP is
∞ N Ki
Maximize ∑ α ∑∑ q P(X
n=0
n

i =1 k =1
k
i n = i , di = k )
N Ki ∞
=∑∑ qik ∑ α n P(X n = i , di = k )
i =1 k =1 n=0
N Ki
= ∑∑ qik y ik
i =1 k =1

which is the same objective function as the one for the LP formulation for an
undiscounted, recurrent MDP.

Constraints
N Ki
The constraints can be obtained by starting with the expression ∑ ∑ αp k
ij y ik.
i= 1 k = 1
N Ki N Ki ∞

∑∑ α p
i =1 k =1
k
ij y = ∑∑ α p
k
i
i =1 k =1
k
ij ∑α
n=0
n
P(X n = i , di = k )
Ki ∞ N
= ∑∑ α n + 1 ∑ pijk P(X n = i , di = k )
k =1 n= 0 i =1
Kj ∞
= ∑∑ α n + 1P(X n + 1 = j , d j = k )
k =1 n= 0
Kj

= ∑ [α 1P(X1 = j , d j = k ) + α 2 P(X 2 = j , d j = k )
k =1

+ α 3 P(X 3 = j , d j = k ) + "]
Kj

= ∑ {α 0 P(X0 = j , d j = k ) + [α 1P(X1 = j , d j = k ) + α 2 P(X 2 = j , d j = k )


k =1

+ α 3 P(X 3 = j , d j = k ) + "] − α 0 P( X0 = j , d j = k )}
Kj

= ∑ [α 0 P(X0 = j , d j = k ) + α 1P(X1 = j , d j = k ) + α 2 P(X 2 = j , d j = k )


k =1
Kj

+ α 3 P(X 3 = j , d j = k ) + "] − ∑ α 0 P( X0 = j , d j = k )
k =1
Kj ∞ Kj

= ∑∑ α n P(X n = j , d j = k ) − ∑ α 0 P(X0 = j , d j = k )
k =1 n= 0 k =1
Kj ∞
= ∑∑ α n P(X n = j , d j = k ) − P(X0 = j)
k =1 n= 0
Kj

= ∑ y kj − b j .
k =1

51113_C005.indd 399 9/23/2010 5:55:14 PM


400 Markov Chains and Decision Processes for Engineers and Managers

Hence, the constraints are

Ki N Ki


k= 1
y kj − α ∑ ∑y
i= 1 k = 1
k
i pijk = b j .

for j = 1, 2, ... , N.

5.2.2.3.3 Formulation of a Discounted MDP Model of Monthly Sales as an LP


The discounted MDP model of monthly sales, specified in Table 5.3 of
Section 5.1.2.1, was solved over an infinite horizon by value iteration in Table
5.68 of Section 5.2.2.1.2, and by PI in Section 5.2.2.2.3. Using a discount factor
of α = 0.9, the discounted MDP model will be formulated in this section as
an LP, and solved in Section 5.2.2.3.4 to find an optimal policy. Table 5.3, aug-
mented by a right hand column of LP variables, is repeated as Table 5.72.
In this example N = 4 states, K1 = 3 decisions in state 1, and K 2 = K 3 = K 4 = 2
decisions in states 2, 3, and 4, respectively. The positive constants are
arbitrarily chosen to be b1 = 0.1, b2 = 0.2, b3 = 0.3, and b4 = 0.4. Note that
N

∑b
j =1
j = 1.

TABLE 5.72
Data for Discounted MDP Model of Monthly Sales, α = 0.9
Transition Probability
State Decision Reward LP
i k p ik1 p ik2 p ik3 p ik4 q ik Variable

1 1 Sell noncore assets 0.15 0.40 0.35 0.10 −30 y11

2 Take firm private 0.45 0.05 0.20 0.30 −25 y12

3 Offer employee buyouts 0.60 0.30 0.10 0 −20 y13


1 Reduce management
2 salaries 0.25 0.30 0.35 0.10 5 y 21

2 Reduce employee benefits 0.30 0.40 025 0.05 10 y 22


1 Design more appealing
3 products 0.05 0.65 0.25 0.05 −10 y 31

2 Invest in new technology 0.05 0.25 0.50 0.20 −5 y 32

4 1 Invest in new projects 0.05 0.20 0.40 0.35 35 y 14

2 Make strategic acquisitions 0 0.10 0.30 0.60 25 y 42

51113_C005.indd 400 9/23/2010 5:55:15 PM


A Markov Decision Process (MDP) 401

Objective Function
The objective function for the LP is

4 Ki
Maximize ∑∑q
i= 1 k = 1
k
i y ik

3 2 3 4
Maximize ∑
k= 1
q1k y1k + ∑
k= 1
q2k y 2k + ∑
k= 1
q3k y 3k + ∑q
k= 1
k
4 y 4k

= (q11 y11 + q12 y12 + q13 y13 ) + (q21 y 21 + q22 y 22 ) + (q31 y 31 + q32 y 32 ) + (q14 y 14 + q42 y 42 )

= (−30 y11 − 25 y12 − 20 y13 ) + (5 y 21 + 10 y 22 ) + (−10 y 31 − 5 y 32 ) + (35 y 14 + 25 y 42 )


Constraints
State 1 has K1 = 3 possible decisions. The constraint associated with a transi-
tion to state j = 1 is

3 4 K ii


k= 1
y1k − α ∑
i= 1 k = 1
∑y k
i pik1 = b1

 3 2 2 2
k 
( y11 + y12 + y13 ) − α 

∑y
k= 1
k
1
k
p11 + ∑y k= 1
k
2
k
p21 + ∑y k= 1
k
3
k
p31 + ∑y
k= 1
k
4 p41  = b1

( y + y + y ) − α ( y p + y p + y p ) − α ( y p + y p ) − α ( y 31 p31
1
1
2
1
3
1
1
1
1
11
2
1
2
11
1
+ y 32 p31
3
1
2
)3
11
1
2
1
21
2
2
2
21

− α ( y 14 p141 + y 42 p41
2
) = b1

( y11 + y12 + y13 ) − 0.9(0.15 y11 + 0.45 y12 + 0.60 y13 ) − 0.9(0.25 y 21 + 0.30 y 22 )
− 0.9(0.05 y 31 + 0.05 y 32 ) − 0.9(0.05 y 14 + 0 y 42 ) = 0.1.

State 2 has K 2 = 2 possible decisions. The constraint associated with a tran-


sition to state j = 2 is

2 4 Ki

∑y
k= 1
k
2 −α∑ ∑y
i= 1 k = 1
k
i pik2 = b2

 3 2 2 2
k 
( y 21 + y 22 ) − α 

∑y
k= 1
k
1
k
p12 + ∑y
k= 1
k
2
k
p22 + ∑y
k= 1
k
3
k
p32 + ∑y
k= 1
k
4 p42  = b2
( y 21 + y 22 ) − α ( y11 p12
1
+ y12 p12
2
+ y13 p12
3
) − α ( y 21 p22
1
+ y 22 p22
2
) − α ( y 31 p32
1
+ y 32 p32
2
)
− α ( y 14 p142 + y 42 p42
2
) = b2
( y 21 + y 22 ) − 0.9(0.40 y11 + 0.05 y12 + 0.30 y13 ) − 0.9(0.30 y 21 + 0.40 y 22 ) − 0.9(0.65 y 31 + 0.25 y 32 )
− 0.9(0.20 y 14 + 0.10 y 42 ) = 0.2

51113_C005.indd 401 9/23/2010 5:55:24 PM


402 Markov Chains and Decision Processes for Engineers and Managers

State 3 has K3 = 2 possible decisions. The constraint associated with a transi-


tion to state j = 3 is

2 4 Ki


k= 1
y 3k − α ∑ ∑y
i= 1 k = 1
k
i pik3 = b3

 3 2 2 2
k 
( y 31 + y 32 ) − α 

∑y
k= 1
k
1
k
p13 + ∑y
k= 1
k
2
k
p23 + ∑y
k= 1
k
3
k
p33 + ∑y
k= 1
k
4 p43  = b3

( y 31 + y 32 ) − α ( y11 p13
1
+ y12 p13
2
+ y13 p13
3
) − α ( y 21 p23
1
+ y 22 p23
2
) − α ( y 31 p33
1
+ y 32 p33
2
)
− α ( y 14 p143 + y 42 p43
2
) = b3

( y 31 + y 32 ) − 0.9(0.35 y11 + 0.20 y12 + 0.10 y13 ) − 0.9(0.35 y 21 + 0.25 y 22 )


− 0.9(0.25 y 31 + 0.50 y 32 ) − 0.9(0.40 y 14 + 0.30 y 42 ) = 0.3.

State 4 has K4 = 2 possible decisions. The constraint associated with a tran-


sition to state j = 4 is

2 4 Ki

∑y
k= 1
k
4 −α∑ ∑y
i= 1 k = 1
k
i pik4 = b4

 3 2 2 2
k 
( y 14 + y 42 ) − α 

∑y
k= 1
k
1
k
p14 + ∑y
k= 1
k
2
k
p24 + ∑y
k= 1
k
3
k
p34 + ∑y
k= 1
k
4 p44  = b4

( y 14 + y 42 ) − α ( y11 p14
1
+ y12 p14
2
+ y13 p14
3
) − α ( y 21 p24
1
+ y 22 p24
2
) − α ( y 31 p34
1
+ y 32 p34
2
)
− α ( y 14 p144 + y 42 p44
2
) = b4

( y 14 + y 42 ) − 0.9(0.10 y11 + 0.30 y12 + 0 y13 ) − 0.9(0.10 y 21 + 0.05 y 22 )


− 0.9(0.05 y 31 + 0.20 y 32 ) − 0.9(0.35 y 14 + 0.60 y 42 ) = 0.4.

The complete LP formulation for the discounted MDP model of monthly


sales is

Maximize (−30 y11 − 25 y12 − 20 y13 ) + (5 y 21 + 10 y 22 ) + (−10 y 31 − 5 y 32 ) + (35 y 14 + 25 y 42 )

51113_C005.indd 402 9/23/2010 5:56:24 PM


A Markov Decision Process (MDP) 403

subject to

(1) (0.865 y11 + 0.595 y12 + 0.46 y13 ) − (0.225 y 21 + 0.27 y 22 ) − (0.045 y 31 + 0.045 y 32 )
− (0.045 y 14 + 0 y 42 ) = 0.1

2) −(0.36 y11 + 0.045 y12 + 0.27 y13 ) + (0.73 y 21 + 0.64 y 22 ) − (0.585 y 31 + 0.225 y 32 )
− (0.18 y 14 + 0.09 y 42 ) = 0.2

3) −(0.315 y11 + 0.18 y12 + 0.09 y13 ) − (0.315 y 21 + 0.225 y 22 ) + (0.775 y 31 + 0.55 y 32 )
− (0.36 y 14 + 0.27 y 42 ) = 0.3

4) −(0.09 y11 + 0.27 y12 + 0 y13 ) − (0.09 y 21 + 0.045 y 22 ) − (0.045 y 31 + 0.18 y 32 )


+ (0.685 y 14 + 0.46 y 42 ) = 0.4

y11 ≥ 0, y12 ≥ 0, y13 ≥ 0, y 21 ≥ 0, y 22 ≥ 0, y 31 ≥ 0, y 32 ≥ 0, y 14 ≥ 0, y 42 ≥ 0.

5.2.2.3.4 Solution by LP of a Discounted MDP Model of Monthly Sales


The LP formulation of the discounted MDP model of monthly sales is solved
on a personal computer using LP software to find an optimal policy. The out-
put of the LP software is summarized in Table 5.73.
The output of the LP software shows that all the yik equal zero, except for
y1 = 1.3559, y22 = 2.0515, y32 = 3.3971, and y42 = 3.1954. The optimal policy
2

is given by 16 d = [2 2 2 2]T. This optimal policy was previously found by


value iteration in Table 5.68 of Section 5.2.2.1.2, and by PI in Equation (5.71)
of Section 5.2.2.2.3. This is also the same optimal policy obtained for the

TABLE 5.73
LP Solution of Discounted MDP Model of Monthly Sales
Function value = 49.5172
Objective i k y ik
1 1 0
2 1.3559 Row Dual Variable
3 0 Constraint 1 6.8040
2 1 0 Constraint 2 35.4613
2 2.0515 Constraint 3 32.2190
3 1 0 Constraint 4 80.1970
2 3.3971
4 1 0
2 3.1954

51113_C005.indd 403 9/23/2010 5:56:38 PM


404 Markov Chains and Decision Processes for Engineers and Managers

undiscounted, recurrent MDP model of monthly sales in Sections 5.1.2.3.2.2,


5.1.2.3.3.3, and 5.1.2.3.4.3. The conditional probabilities,

Ki
P(decision = k|state = i) = y ik/∑ y ik ,
k= 1

are calculated below:

P(decision = 2|state = 1) = P(decision = 2|state = 2) = P(decision


= 2|state = 3) = P(decision = 2|state = 4) = 1.

All the remaining conditional probabilities are zero.


The dual variables associated with constraints 1, 2, 3, and 4, respectively,
are the expected total discounted rewards, v1 = 6.8040,
v2 = 35.4613, v3 = 32.2190, and v4 = 80.1970 , respectively. The expected
total discounted rewards obtained as dual variables by LP are identical to
N

those obtained by PI in Equation (5.72) of Section 5.2.2.2.3. Since ∑b


j =1
j = 1, the

optimal value of the LP objective function, 49.5172, equals ∑ b v , as is stated


j =1
j j
at the end of Section 5.2.2.3.1.

That is,

∑bv
j= 1
j j = 49.5172 = 0.1(6.8040) + (0.2)(35.4613) + (0.3)(32.2190) + (0.4)(80.1970)

5.2.2.4 Examples of Discounted MDP Models


Discounted MDP models over an infinite horizon will be constructed for an
inventory system and for a modified form of the secretary problem.

5.2.2.4.1 Inventory System


The inventory system modeled as a unichain MDP in Section 5.1.3.3.1 will
be treated in this section as a discounted MDP [1]. Using a discount fac-
tor of α = 0.9, the discounted MDP model will be formulated as an LP in
Section 5.2.2.4.1.1, and solved in Section 5.2.2.4.1.2 to find an optimal policy.
Data for the discounted MDP model of an inventory system are given in
Table 5.41 of Section 5.1.3.3.1.

51113_C005.indd 404 9/23/2010 5:56:46 PM


A Markov Decision Process (MDP) 405

5.2.2.4.1.1 Formulation of a Discounted MDP Model of an Inventory System as


an LP As Section 5.1.3.3.1 indicates, the LP formulations for both the uni-
chain and discounted models have N = 4 states, K0 = 4 decisions in state 0,
K1 = 3 decisions in state 1, K 2 = 2 decisions in state 2, and K3 = 1 decision in
state 3. The positive constants for the right hand sides of the constraints in
the discounted model are arbitrarily chosen to be b1 = 0.1, b2 = 0.2, b3 = 0.3,
N
and b4 = 0.4. Note that ∑b j = 1.
Objective Function j =1

The objective function for the LP for the discounted model, unchanged from
the objective function for the unichain model, is

Maximize ( −48 y 00 + 35 y 01 − 18 y 02 − 110 y 03 ) + (175 y10 + 102 y11 + 10 y12 )t


+ (242 y 20 + 130 y 21 ) + 270 y 30 .

The constraints are given below:

(1) ( y 00 + y 01 + y 02 + y 03 ) − 0.9( y 00 + 0.7 y 01 + 0.3 y 02 + 0.2 y 03 ) − 0.9(0.7 y10 + 0.3 y11


+ 0.2 y12 ) − 0.9(0.3 y 20 + 0.2 y 21 ) − 0.9(0.2 y 30 ) = 0.1

(2) ( y10 + y11 + y12 ) − 0.9(0.3 y 01 + 0.4 y 02 + 0.1y 03 ) − 0.9(0.3 y10 + 0.4 y11 + 0.1y12 )
− 0.9(0.4 y 20 + 0.1y 21 ) − 0.9(0.1)y 30 = 0.2

(3) ( y 20 + y 21 ) − 0.9(0.3 y 02 + 0.4 y 03 ) − 0.9(0.3 y11 + 0.4 y12 ) − 0.9(0.3 y 20 + 0.4 y 21 )


− 0.9(0.4)y 30 = 0.3

(4) y 30 − 0.9(0.3 y 03 + 0.3 y12 + 0.3 y 21 + 0.3 y 30 ) = 0.4

y 00 ≥ 0, y 01 ≥ 0, y 02 ≥ 0, y 03 ≥ 0, y10 ≥ 0, y11 ≥ 0, y12 ≥ 0, y 20 ≥ 0, y 21 ≥ 0, y 30 ≥ 0.

The complete LP appears below:

Maximize ( −48 y 00 + 35 y 01 − 18 y 02 − 110 y 03 ) + (175 y10 + 102 y11 + 10 y12 )


+ (242 y 20 + 130 y 21 ) + 270 y 30

51113_C005.indd 405 9/23/2010 5:56:50 PM


406 Markov Chains and Decision Processes for Engineers and Managers

subject to

(1) (0.1y 00 + 0.37 y 01 + 0.73 y 02 + 0.82 y 03 ) − (0.63 y10 + 0.27 y11 + 0.18 y12 )
− (0.27 y 20 + 0.18 y 21 ) − 0.18 y 30 = 0.1

(2) − (0.27 y 01 + 0.36 y 02 + 0.09 y 03 ) + (0.73 y10 + 0.64 y11 + 0.91y12 )


− (0.36 y 20 + 0.09 y 21 ) − 0.09 y 30 = 0.2

(3) − (0.27 y 02 + 0.36 y 03 ) − (0.27 y11 + 0.36 y12 ) + (0.73 y 20 + 0.64 y 21 )


− 0.36 y 30 = 0.3

(4) − (0.27 y 03 + 0.27 y12 + 0.27 y 21 ) + 0.73 y 30 = 0.4

y 00 ≥ 0, y 01 ≥ 0, y 02 ≥ 0, y 03 ≥ 0, y10 ≥ 0, y11 ≥ 0, y12 ≥ 0, y 20 ≥ 0, y 21 ≥ 0, y 30 ≥ 0.

5.2.2.4.1.2 Solution by LP of Discounted MDP Model of an Inventory System The


LP formulation of the discounted MDP model of an inventory system is
solved on a personal computer using LP software to find an optimal policy.
The output of the LP software is summarized in Table 5.74.
Table 5.74 shows that the value of the objective function is 1218.2572. The
output also shows that all the yik equal zero, except for y 03 = 2.2220, y12 = 2.0661,
y 20 = 3.5780, and y 30 = 2.1339 . The optimal policy is given by d = [3 2 0 0]T,
which is the same optimal policy obtained for the undiscounted, unichain

TABLE 5.74
LP Solution of Discounted MDP Model of an Inventory System
Function Value = 1218.2572

Objective i k y ik
0 0 0
1 0 Row Dual Variable
2 0 Constraint 1 964.7156
3 2.2220 Constraint 2 1084.7156
1 0 0 Constraint 3 1223.2477
1 0 Constraint 4 1344.7156
2 2.0661
2 0 3.5780
1 0
3 0 2.1339

51113_C005.indd 406 9/23/2010 5:58:09 PM


A Markov Decision Process (MDP) 407

MDP model of an inventory system in Table 5.42 of Section 5.1.3.3.1.2. The


conditional probabilities,

y ik
P(order = k|state = i) = Ki
,
∑y
k =1
k
i

are calculated below:

P(order = 3|state = 0) = P(order = 2|state = 1) = P(order = 0|state = 2)


= P(order = 0|state = 3) = 1.

All the remaining conditional probabilities are zero. The retailer’s expected
total discounted profit is maximized by the same (2, 3) inventory policy that
maximized her expected average profit per period without discounting in
Section 5.1.3.3.1.2.
The dual variables associated with constraints 1, 2, 3, and 4, respec-
tively, are the expected total discounted rewards, v0 = 964.7156,
v1 = 1084.7156, v2 = 1223.2477, and v3 = 1344.7156 , respectively. Since
∑ j =1 bj = 1, the optimal value of the LP objective function, 1218.2752, equals
N


3
j=0
b j v j, as is stated at the end of Section 5.2.2.3.1. That is,

∑b v
j=0
j j = 1218.2752 = 0.1(964.7156) + (0.2)(1084.7156) + (0.3)(1223.2477)

+ (0.4)(1344.7156).

5.2.2.4.2 Optimal Stopping Over an Infinite Planning


Horizon: The Secretary Problem
In Section 5.1.3.3.3.1, the secretary problem was treated as an optimal stopping
problem over a finite planning horizon by formulating it as an undiscounted,
unichain MDP. Recall that the state space is augmented with an absorbing state
∆ that is reached with probability 1 when a candidate is hired and with prob-
ability 0 when an applicant is rejected. Value iteration was executed to find
an optimal policy. A modified form of an optimal stopping problem can be
analyzed over an infinite planning horizon by formulating it as a discounted
MDP. When rewards are discounted, the secretary problem can be analyzed
over an infinite horizon by making the following two modifications in the
undiscounted model analyzed over a finite horizon. (1) There is no limit to
the number of candidates who may be interviewed. (2) If the nth applicant is
rejected, the executive must pay a continuation cost to interview the next can-
didate. Recall that the state, denoted by Xn, is the numerical score assigned to

51113_C005.indd 407 9/23/2010 5:58:45 PM


408 Markov Chains and Decision Processes for Engineers and Managers

the nth candidate. The continuation cost is equivalent to a numerical score of


−2. Hence, if the nth applicant is hired, the daily reward is Xn, the applicant’s
numerical score. If the nth applicant is rejected, the daily reward is Xn = −2, the
continuation cost. A daily reward of zero is associated with the absorbing state
∆ because when state ∆ is reached, no more interviews will be conducted [1, 3,
4].

5.2.2.4.2.1 Formulation of the Secretary Problem as a Discounted MDP The dis-


counted MCR model of the secretary problem analyzed over an infinite plan-
ning horizon is shown in Table 5.75. LP decision variables yik are shown in
the right-hand-column.
There are four alternative policies, which correspond to hiring a candidate
rated poor (15), fair (20), good (25), or excellent (30).

5.2.2.4.2.2 Solution of the Secretary Problem by LP The discounted MDP


model of the secretary problem over an infinite planning horizon is formu-
lated as the following LP. In this example N = 5 states, and K15 = K 20 = K 25 =
K30 = K∆ = K = 2 decisions in every state. The LP decision variables associated
with the absorbing state ∆, and the constraint associated with a transition to
the absorbing state ∆ can be omitted from the LP formulation. A daily dis-
count factor of α = 0.9 is specified. The positive constants are arbitrarily cho-
sen to be b1 = b2 = b3 = b4 = 0.25. Note that they sum to one.

TABLE 5.75
Data for Discounted MDP Model of a Secretary Problem Over an Infinite Planning
Horizon
Transition Probility Reward LP Variable
State Decision
i k p k
i ,15 p k
i ,20 pik,25 pik,30 pik, ∆ k
q
i y ik
15 1=H 0 0 0 0 1 15 1
y15
2=R 0.3 0.4 0.2 0.1 0 −2 2
y15
20 1=H 0 0 0 0 1 20 1
y 20
2=R 0.3 0.4 0.2 0.1 0 −2 2
y 20
25 1=H 0 0 0 0 1 25 1
y 25
2=R 0.3 0.4 0.2 0.1 0 −2 2
y 25
30 1=H 0 0 0 0 1 30 1
y 30
2=R 0.3 0.4 0.2 0.1 0 −2 2
y 30
∆ 1=H 0 0 0 0 1 0 y 1∆
2=R 0 0 0 0 1 0 y ∆2

51113_C005.indd 408 9/23/2010 6:00:46 PM


A Markov Decision Process (MDP) 409

Objective Function
The objective function for the LP is
30 2
Maximize ∑∑q
i = 15 k = 1
k
i y ik

2 2 2 2
= ∑q
k= 1
k
15
k
y15 + ∑q
k= 1
k
20
k
y 20 + ∑q
k= 1
k
25
k
y 25 + ∑q
k= 1
k
30
k
y 30

= (q15
1 1
y15 + q15
2 2
y15 ) + (q20
1 1
y 20 + q20
2 2
y 20 ) + (q25
1 1
y 25 + q25
2 2
y 25 ) + (q30
1 1
y 30 + q30
2 2
y 30 )

= (15 y15
1
− 2 y15
2
) + (20 y 20
1
− 2 y 20
2
) + (25 y 25
1
− 2 y 25
2
) + (30 y 30
1
− 2 y 30
2
).

Constraints
State 15 has K15 = 2 possible decisions. The constraint associated with a tran-
sition to state j = 15 is
2 30 2

∑y
k= 1
k
15 −α∑ ∑y
i = 15 k = 1
k
i pik,15 = b1

 2 2 2 2

1
( y15 + y15
2
)− α 

∑y
k= 1
k
15
k
p15,15 + ∑y
k= 1
k
20
k
p20,15 + ∑y
k= 1
k
25
k
p25,15 + ∑ y 30
k= 1
k k
p30,15  = b1

1
( y15 + y15
2
) − α ( y15
1 1
p15,15 + y15
2 2
p15,15 ) − α ( y 20
1 1
p20,15 + y 20
2 2
p20,15 ) − α ( y 25
1 1
p25,15 + y 25
2 2
p25,15 )
− α ( y 30
1 1
p30,15 + y 30
2 2
p30,15 ) = b1

1
( y15 + y15
2
) − 0.9(0 y15
1
+ 0.3 y15
2
) − 0.9(0 y 20
1
+ 0.3 y 20
2
) − 0.9(0 y 25
1
+ 0.3 y 25
2
)
− 0.9(0 y 30
1
+ 0.3 y 30
2
) = 0.25.

State 20 has K 20 = 2 possible decisions. The constraint associated with a tran-


sition to state j = 20 is
2 30 2

∑y
k= 1
k
20 −α∑ ∑y
i = 15 k = 1
k
i pik,20 = b2

2 2 2 2
1
( y 20 + y 20
2
) − α ( ∑ y15
k k
p15,20 + ∑y k
20
k
p20,20 + ∑y k
25
k
p25,20 + ∑y k
30
k
p30,20 ) = b2
k= 1 k= 1 k= 1 k= 1

1
( y 20 + y 20
2
) − α ( y15
1 1
p15,20 + y15
2 2
p15,20 ) − α ( y 20
1 1
p20,20 + y 20
2 2
p20,20 ) − α ( y 25
1 1
p25,20 + y 25
2 2
p25,20 )
− α ( y 30
1 1
p30,20 + y 30
2 2
p30,20 ) = b2
1
( y 20 + y 20
2
) − 0.9(0 y15
1
+ 0.4 y15
2
) − 0.9(0 y 20
1
+ 0.4 y 20
2
) − 0.9(0 y 25
1
+ 0.4 y 25
2
)
− 0.9(0 y 30
1
+ 0.4 y 30
2
) = 0.25.

51113_C005.indd 409 9/23/2010 6:01:30 PM


410 Markov Chains and Decision Processes for Engineers and Managers

State 25 has K 25 = 2 possible decisions. The constraint associated with a tran-


sition to state j = 25 is

2 30 2


k= 1
k
y 25 −α∑ ∑y
i = 15 k = 1
k
i pik,25 = b3

 2 2 2 2

1
( y 25 + y 25
2
)− α 

∑y
k= 1
k
15
k
p15,25 + ∑y
k= 1
k
20
k
p20,25 + ∑y
k= 1
k
25
k
p25,25 + ∑ y 30
k= 1
k k
p30,25  = b3

1
( y 25 + y 25
2
) − α ( y15
1 1
p15,25 + y15
2 2
p15,25 ) − α ( y 20
1 1
p20,25 + y 20
2 2
p20,25 ) − α ( y 25
1 1
p25,25 + y 25
2 2
p25,25 )
− α ( y 30
1 1
p30,25 + y 30
2 2
p30,25 ) = b3

1
( y 25 + y 25
2
) − 0.9(0 y15
1
+ 0.2 y15
2
) − 0.9(0 y 20
1
+ 0.2 y 20
2
) − 0.9(0 y 25
1
+ 0.2 y 25
2
)
− 0.9(0 y 30
1
+ 0.2 y 30
2
) = 0.25.

State 30 has K30 = 2 possible decisions. The constraint associated with a


transition to state j = 30 is

2 30 2

∑y
k= 1
k
30 −α∑ ∑y
i = 15 k = 1
k
i pik,30 = b4

 2 2 2 2

1
( y 30 + y 30
2
)− α 

∑y
k= 1
k
15
k
p15,30 + ∑y
k= 1
k
20
k
p20,30 + ∑y
k= 1
k
25
k
p25,30 + ∑ y 30
k= 1
k k
p30,30  = b4
1
( y 30 + y 30
2
) − α ( y15
1 1
p15,30 + y15
2 2
p15,30 ) − α ( y 20
1 1
p20,30 + y 20
2 2
p20,30 ) − α ( y 25
1 1
p25,30 + y 25
2 2
p25,30 )
− α ( y 30
1 1
p30,30 + y 30
2 2
p30,30 ) = b4

1
( y 30 + y 30
2
) − 0.9(0 y15
1
+ 0.1y15
2
) − 0.9(0 y 20
1
+ 0.1y 20
2
) − 0.9(0 y 25
1
+ 0.1y 25
2
)
− 0.9(0 y 30
1
+ 0.1y 30
2
) = 0.25.

The complete LP formulation for the discounted MDP model of the secretary
problem analyzed over an infinite planning horizon is
1
Maximize (15 y15 − 2 y15
2
) + (20 y 20
1
− 2 y 20
2
) + (25 y 25
1
− 2 y 25
2
) + (30 y 30
1
− 2 y 30
2
)

subject to
1
(1) ( y15 + 0.73 y15
2
) − (0.27 y 20
2
+ 0.27 y 25
2
+ 0.27 y 30
2
) = 0.25
1
(2) ( y 20 + 0.64 y 20
2
) − (0.36 y15
2
+ 0.36 y 25
2
+ 0.36 y 30
2
) = 0.25

(3) ( y 25 + 0.82 y 25 ) − (0.18 y15 + 0.18 y 20 + 0.18 y 30 ) = 0.25


1 2 2 2 2

1
(4) ( y 30 + 0.91y 30
2
) − (0.09 y15
2
+ 0.09 y 20
2
+ 0.09 y 25
2
) = 0.25
y12 ≥ 0, y12 ≥ 0, y13 ≥ 0, y 21 ≥ 0, y 22 ≥ 0, y 31 ≥ 0, y 32 ≥ 0, y 14 ≥ 0, y 42 ≥ 0.

51113_C005.indd 410 9/23/2010 6:01:52 PM


A Markov Decision Process (MDP) 411

The LP formulation of the discounted MDP model of the secretary problem


is solved on a personal computer using LP software to find an optimal pol-
icy. The output of the LP software is summarized in Table 5.76.
The output of the LP software shows that y15 2
= 0.3425 , y 20
1
= 0.3733,
y 25 = 0.3116 y 30 = 0.2808
1
, 1
, and the remaining yi
k equal zero. The optimal

policy, given by d = [2 1 1 1]T, is to reject a candidate rated 15, and hire a can-
didate rated 20 or 25 or 30. The associated conditional probabilities,

Ki
k
P(decision = k|state = i) y i / ∑y k= 1
k
i ,

are calculated below:

P(decision = 2|state = 15) = P(decision = 1|state = 20) = P(decision


= 1|state = 25) = P(decision = 1|state = 30) = 1.

All the remaining conditional probabilities are zero.


The dual variables associated with constraints 1, 2, 3, and 4, respectively,
are the expected total discounted rewards, v15 = 16.9863, v20 = 20, v25 = 25, and
N
v30 = 30, respectively. Since ∑b
j =1
j = 1, the optimal value of the LP objective
4
function, 22.9966, equals ∑ b v , as is stated at the end of Section 5.2.2.3.1.
j =1
j j
That is,

∑bv
j= 1
j j = 22.9966 = 0.25(16.9863) + (0.25)(20) + (0.25)(25) + (0.25)(30).

TABLE 5.76
LP Solution of Discounted MDP Model of the Secretary
Problem
Function = 22.9966

Objective i k y ik
15 1 0 Row Dual Variable
2 0.3425 Constraint 1 16.9863
20 1 0.3733 Constraint 2 20
2 0 Constraint 3 25
25 1 0.3116 Constraint 4 30
2 0
30 1 0.2808
2 0

51113_C005.indd 411 9/23/2010 6:03:58 PM


412 Markov Chains and Decision Processes for Engineers and Managers

It is instructive to note that the MCR associated with the optimal policy given
by d = [2 1 1 1]T, with the absorbing state ∆ restored, is

15 0.3 0.4 0.2 0.1 0  15  −2



20 0 0 0 0 1  20  20 
   
P = 25  0 0 0 0 1 , q = 25  25  .
   
30  0 0 0 0 1 30  30 
∆  0 0 0 0 1 ∆  0 

Also,
15 0.27 0.36 0.18 0.09 0 
20  0 0 0 0 0.9
 
α P = 25  0 0 0 0 0.9
 
30  0 0 0 0 0.9
∆  0 0 0 0 0.9

15 0.73 −0.36 −0.18 −0.09 0 


20  0 1 0 0 −0.9
 
(I − α P) = 25  0 0 1 0 −0.9 .
 
30  0 0 0 1 −0.9
∆  0 0 0 0 0.10 

Recall that the matrix equation (4.253) in Section 4.3.3.1 is an alternate form
of the VDEs. Solving the matrix equation (4.253) for the vector of expected
total discounted rewards,

15 1.369863 0.493151 0.246575 0.123288 7.767123   −2 


20  0 1 0 0 9   20  
   
v = (I − α P)−1q = 25  0 0 1 0 9   25  
   
30  0 0 0 1 9   30 
∆  0 0 0 0 10   0 
.
15 16.9863  
20  20  
  
= 25  25  
  
30  30 


∆ 0  

(5.79)

51113_C005.indd 412 9/23/2010 6:07:34 PM


A Markov Decision Process (MDP) 413

Substituting the components of vector v into the equations of the IM routine


will confirm that d = [2 1 1 1]T is an optimal policy. Table 5.76 shows, once
again, that under an optimal policy, the expected total discounted reward in
every state is equal to the corresponding dual variable in the optimal tableau
of the LP formulation.

PROBLEMS
5.1 A woman wishes to sell a car within the next 4 weeks. She
expects to receive one bid or offer each week from a prospec-
tive buyer. The weekly offer is a random variable, which has the
following stationary probability distribution:

Offer per week $16, 000 $18,000 $22,000 $24,000


Probability 0.40 0.10 0.30 0.20

Once the woman accepts an offer, the bidding stops. Her objec-
tive is to maximize her expected total income over a 4-week
planning horizon.
(a) Formulate this optimal stopping problem as a unichain MDP.
(b) Use value iteration to find an optimal policy which will
specify when to accept or reject an offer during each week of
the planning horizon.
5.2 In Problem 3.10, suppose that the investor receives a monthly
dividend of $1 per share. No dividend is received when the
stock is sold. The investor uses a monthly discount factor of
α = 0.9. She wants to determine when to sell and when to hold
the stock. Her objective is to maximize her expected total dis-
counted reward.
(a) Treat this problem as an optimal stopping problem over an
infinite planning horizon. Formulate this optimal stopping
problem as a discounted, unichain MDP.
(b) Formulate the discounted, unichain MDP as a LP, and solve
it to find an optimal policy.
(c) Use PI to find an optimal policy.
5.3 A consumer electronics retailer can place orders for flat panel
TVs at the beginning of each day. All orders are delivered
immediately. Every time an order is placed for one or more TVs,
the retailer pays a fixed cost of $60. Every TV ordered costs the
retailer $100. The daily holding cost per unsold TV is $10. A
daily shortage cost of $180 is incurred for each TV that is not
available to satisfy demand. The retailer can accommodate a
maximum inventory of two TVs. The daily demand for TVs is
an independent, identically distributed random variable which
has the following stationary probability distribution:

Daily demand d 0 1 2
Probability, p( d) 0.5 0.4 0.1

51113_C005.indd 413 9/23/2010 6:08:12 PM


414 Markov Chains and Decision Processes for Engineers and Managers

(a) Formulate this model as an MDP.


(b) Determine an optimal ordering policy that will minimize
the expected total cost over the next 3 days.
(c) Use LP to determine an optimal ordering policy that will
minimize the expected average cost per day over an infinite
planning horizon.
(d) Use PI to determine an optimal ordering policy over an infi-
nite planning horizon.
(e) Use LP with a discount factor of α = 0.9 to find an optimal
ordering policy over an infinite planning horizon.
(f) Use PI with a discount factor of α = 0.9 to find an optimal
ordering policy and the associated expected total discounted
cost vector over an infinite planning horizon.
5.4 At the start of each day, the condition of a machine is classified
as either excellent, acceptable, or poor. The daily behavior of the
machine is modeled as a three-state absorbing unichain with
the following transition probability matrix:

State E A P
Excellent (E) 0.1 0.7 0.2
P = [ pij ] =
Acceptable (A) 0 0.4 0.6
Poor (P) 0 0 1

A machine in excellent condition earns revenue of $600 per day.


A machine in acceptable condition earns daily revenue of $300,
and a machine in poor condition earns daily revenue of $100. At
the start of each day, the engineer responsible for the machine
can make one of the following three maintenance decisions: do
nothing, repair the machine, or replace it with a new machine.
One day is needed to repair the machine at a cost of $500, or to
replace it at a cost of $1,000. A machine which is repaired starts
the following day in excellent condition with probability 0.6, in
acceptable condition with probability 0.3, or in poor condition
with probability 0.1. A machine that is replaced always starts
the following day in excellent condition.
(a) Formulate this model as an MDP.
(b) Determine an optimal maintenance policy that will maxi-
mize the expected total profit over the next 3 days.
(c) Use LP to determine an optimal maintenance policy that
will maximize the expected average profit per day over an
infinite planning horizon.
(d) Use PI to determine an optimal maintenance policy over an
infinite planning horizon.
(e) Use LP with a discount factor of α = 0.9 to find an optimal
maintenance policy over an infinite planning horizon.
(f) Use PI with a discount factor of α = 0.9 to find an optimal
maintenance policy and the associated expected total dis-
counted reward vector over an infinite planning horizon.

51113_C005.indd 414 9/23/2010 6:08:28 PM


A Markov Decision Process (MDP) 415

5.5 Suppose that the condition of a machine can be described by


one of the following three states:

State Condition
NW Not Working
WI Working Intermittently
WP Working Properly

The daily behavior of the machine when it is left alone for 1 day
is modeled as an absorbing unichain with the following transi-
tion probability matrix:

State NW WI WP
Not Working (NW) 1 0 0
P = [ pij ] =
Working Intermittently (WI) 0.8 0.2 0
Working Properly (WP) 0.2 0.5 0.3

The machine is observed at the start of each day. Suppose ini-


tially that the machine is in state WP at the beginning of the day.
If, with probability 0.3, it works properly throughout the day, it
earns $1,000 in revenue. If, with probability 0.5, the machine dete-
riorates during the day, it earns $500 in revenue and enters state
WI, working intermittently. If, with probability 0.2, the machine
fails during the day, it earns zero revenue and enters state NW,
not working. Suppose next that the machine starts the day in
state WI. If, with probability 0.2, it works intermittently through-
out the day, it earns $500 in revenue. If, with probability 0.8, the
machine fails during the day, it earns zero revenue and enters
state NW. Daily revenue is summarized in the table below:

State Condition Daily Revenue


NW Not Working $0
WI Working Intermittently $500
WP Working Properly $1000

At the beginning of the day, the engineer responsible for the


machine can make one of the following four maintenance deci-
sions: do nothing (DN), perform preventive maintenance (PM),
repair the machine (RP), or replace the machine (RL). The costs
of the maintenance actions are summarized in the table below:

Maintenance Action Cost


Do Nothing (DN) $0
Preventive Maintenance (PM) $100
Repair (RP) $400
Replace (RL) $1600

51113_C005.indd 415 9/23/2010 6:08:37 PM


416 Markov Chains and Decision Processes for Engineers and Managers

All maintenance actions are completed instantly. However,


neither preventive maintenance nor a repair are always suc-
cessful. The following maintenance actions are feasible in each
state:

State Maintenance Action


Not Working (NW) Repair (RP), Replace (RL)
Working Intermittently (WI) Repair (RP), Preventive
Maintenance (PM)
Working Properly (WP) Preventive Maintenance
(PM), Do Nothing (DN)

Replacement, and a repair which is completely successful,


always produces a machine that works properly throughout the
day. A partly successful repair when the machine is in state NW
enables a machine to work intermittently throughout the day.
An unsuccessful repair when the machine is in state NW leaves
the machine not working throughout the day. An unsuccess-
ful repair when the machine is in state WI leaves the machine
working intermittently throughout the day. Preventive main-
tenance which is completely successful enables a machine to
remain in its current state throughout the day.
When the machine begins the day in state NW, a repair is
completely successful with probability 0.5, partly successful
with probability 0.4, and unsuccessful with probability 0.1.
When the machine starts the day in state WI, a repair is success-
ful with probability 0.6, and unsuccessful with probability 0.4.
When the machine begins the day in state WP, preventive main-
tenance is completely successful with probability 0.7, partly suc-
cessful with probability 0.2, and unsuccessful with probability
0.1. When the machine starts the day in state WI, preventive
maintenance is successful with probability 0.8, and unsuccess-
ful with probability 0.2.
(a) Construct the vector of expected immediate rewards associ-
ated with each state and decision.
(b) Formulate this model as an MDP.
(c) Determine an optimal maintenance policy that will maxi-
mize the expected total profit over the next 3 days.
(d) Use LP to determine an optimal maintenance policy that
will maximize the expected average profit per day over an
infinite planning horizon.
(e) Use PI to determine an optimal maintenance policy over an
infinite planning horizon.
(f) Use LP with a discount factor of α = 0.9 to find an optimal
maintenance policy over an infinite planning horizon.
(g) Use PI with a discount factor of α = 0.9 to find an optimal
maintenance policy and the associated expected total dis-
counted reward vector over an infinite planning horizon.

51113_C005.indd 416 9/23/2010 6:08:48 PM


A Markov Decision Process (MDP) 417

5.6 In Problem 4.1, suppose that the engineer responsible for the
machine has the options of bringing a machine with a major
defect or a minor defect to the repair process. When the engi-
neer elects to bring a machine with a major defect or a minor
defect to the repair process, the machine makes a transition
with probability 1 to state 1 (NW). The daily costs to bring a
machine in state 2 (MD) and state 3 (mD) to the repair process
are $80 and $40, respectively. A repair takes one day to com-
plete. No revenue is earned on days during which a machine
is repaired. When the machine is in state 1 (NW), it is always
under repair. Hence, the only feasible action in state 1 is to
repair (RP) the machine. Since a machine in state 4 (WP) is
never brought to the repair process, the only feasible action in
state 4 (WP) is to do nothing (DN). The following four policies
are feasible:

State Policy 1 Policy 2 Policy 3 Policy 4


1 (NW) RP RP RP RP
2 (MD) DN DN RP RP
3 (mD) DN RP DN RP
4 (WP) DN DN DN DN

The daily revenue earned in every state under policies 1 and


4 is calculated in the table below:

Daily Repair Daily Reward Daily Reward


State revenue Cost Policy 1 of Policy 1 Policy 4 of Policy
1 (NW) $0 $0 Repair $0 Repair $0
2 (MD) $200 $80 DN $200 Repair −$80
3 (mD) $400 $40 DN $400 Repair −$40
4 (WP) $600 DN $600 DN $600

(a) Model this machine repair problem as a recurrent MDP.


(b) Find an optimal repair policy and its expected total reward
over a three day planning horizon.
(c) Use LP to find an optimal repair policy and its expected aver-
age reward, or gain, over an infinite planning horizon.
(d) Use PI to find an optimal repair policy and its expected aver-
age reward, or gain, over an infinite planning horizon.
(e) Use LP with a discount factor of α = 0.9 to find an optimal
repair policy over an infinite planning horizon.
(f) Use PI with a discount factor of α = 0.9 to find an optimal
repair policy and the associated expected total discounted
reward vector over an infinite planning horizon.

51113_C005.indd 417 9/23/2010 6:08:50 PM


418 Markov Chains and Decision Processes for Engineers and Managers

5.7 Consider the following stochastic shortest route problem over a


three-period planning horizon for the network shown below:

Epoch 0 Epoch 1 Epoch 2 Epoch 3

Node
Node p24k, r24k 4
2
p 25
12 k

k
p
,r2

46 k
,r

,r
12 k

k
5

46 k
p

Node Node
1 6
p 13
k , r 13

34 k

56 k
k

,r
34 k

,r
56 k
p

Node Node p
3 p35k, r35k 5

Network through which a stochastic shortest route is to be found

The network has six nodes numbered 1 through 6. One time


period is needed to traverse an arc between two adjacent nodes.
The origin, node 1, is entered at epoch 0. Nodes 2 and 3 can be
reached at epoch 1 in one step from node 1 by traversing arcs
(1,2) and (1, 3), respectively. Nodes 4 and 5 can be reached at
epoch 2 in one step from node 2 by traversing arcs (2, 4) and
(2, 5), respectively. Nodes 4 and 5 can also be reached at epoch 2
in one step from node 3 by traversing arcs (3, 4) and (3, 5), respec-
tively. The final destination, node 6, can be reached at epoch 3 in
one step from nodes 4 and 5 by traversing arcs (4, 6) and (5, 6),
respectively. The network has no cycles. That is, once a node has
been has been reached, it will not be visited again. A decision k
at a node i identifies the node k chosen to be reached at the next
epoch. The network is stochastic because a decision k made at a
node i can result in the process moving to another node instead
of the one chosen. Also, the distance, or arc length, between
adjacent nodes is a function of the decision made at the cur-
rent node. The objective is to find a path of minimum expected
length from node 1, the origin, to node 6, the final destination.

51113_C005.indd 418 9/23/2010 6:08:52 PM


A Markov Decision Process (MDP) 419

To model this network problem as an MDP, the following


variables are defined.
di(n) = k is the decision k made at node i at epoch n.
pijk = P(X n + 1 = j|X n = i ∩ di (n) = k ) = the probability of mov-
ing from node i to node j when decision k is made at node i at
epoch n.
rijk = the actual distance from node i to node j when decision
k is made at node i at epoch n and node j is reached at epoch
n + 1.
6
qik = ∑pr
j= i+1
k k
ij ij , i = 1, 2, ..., 5,

is the expected distance from node i to node j when decision k


is made at node i at epoch n.
The network is Markovian because the node j reached at
epoch n+1 depends only on the present node i and the decision
k made at epoch n. The transition probabilities and the actual
distances corresponding to the decisions made at every node
are shown in the table below.
Transition probabilities and actual distances for stochastic
shortest route problem

State Decision Transition Probability Actual Distance


i k pik1 pik2 pik3 pik4 pik5 pik6 rik1 rik2 rik3 rik4 rik5 rik6 qik
1 2 0 0.7 0.3 0 0 0 0 23 25 0 0 0
3 0 0.2 0.8 0 0 0 0 26 22 0 0 0
2 4 0 0 0 0.4 0.6 0 0 0 0 37 33 0
5 0 0 0 0.65 0.35 0 0 0 0 35 36 0
3 4 0 0 0 0.75 0.25 0 0 0 0 31 34 0
5 0 0 0 0.15 0.85 0 0 0 0 28 32 0
4 6 0 0 0 0 0 1 0 0 0 0 0 46
5 6 0 0 0 0 0 1 0 0 0 0 0 42

(a) Calculate the expected distances qik from node i to node j


when decision k is made at every node i at epoch n.
(b) Model this stochastic shortest route problem as a unichain
MDP.
(c) Use value iteration to find a path of minimum expected
length from node 1, the origin, to node 6, the final destina-
tion, over a 3 day planning horizon.
5.8 At the beginning of every month, a woman executes diagnostic
software on her personal computer to determine whether it is
virus-free or is infected with a malignant or benign virus. She
models the condition of her computer as a Markov chain with
the following three states:
If her computer is virus-free at the start of the current month,
the probability that it will be virus-free at the start of the next
month is 0.45, the probability that it will be infected with a
benign virus is 0.40, and the probability that it will be infected

51113_C005.indd 419 9/23/2010 6:08:54 PM


420 Markov Chains and Decision Processes for Engineers and Managers

State Condition
F Virus-free
B Benign virus
M Malignant
virus

with a malignant virus is 0.15. If her computer is infected with


a virus, she can hire either virus-removal consultant A or virus-
removal consultant B. Both consultants require a full month
in which to attempt to remove a virus, but are not always suc-
cessful. Consultant A charges higher monthly fees than con-
sultant B because she has higher probabilities of success. To
remove a benign virus, consultant A charges $160 and consul-
tant B charges $110. To remove a malignant virus, consultant A
charges $280 and consultant B charges $200. Consultant A has a
0.64 probability of removing a malignant virus and a 0.28 prob-
ability of reducing it to a benign virus. She has a 0.92 probability
of removing a benign virus and a 0.08 probability of leaving it
unchanged. Consultant B has a 0.45 probability of removing a
malignant virus and a 0.38 probability of reducing it to a benign
virus. She has a 0.62 probability of removing a benign virus, and
a 0.04 probability of transforming it into a malignant virus.
(a) Model this virus-removal problem as a recurrent MDP.
(b) Find an optimal virus-removal policy and its expected total
cost over a 3 month planning horizon.
(c) Use LP to find an optimal virus-removal policy and its
expected average cost over an infinite planning horizon.
(d) Use PI to find an optimal virus-removal policy and its
expected average cost over an infinite planning horizon.
(e) Use LP with a discount factor of α = 0.9 to find an optimal
virus-removal policy over an infinite planning horizon.
(f) Use PI with a discount factor of α = 0.9 to find an optimal
virus-removal policy and the associated expected total dis-
counted cost vector over an infinite planning horizon.
5.9 In Problem 4.8, suppose that the manager of the dam is con-
sidering the following two policies for releasing water at the
beginning of every week: she can release at most 2 units of
water, or she can release at most 3 units. The implementation of
the decision to release at most 2 units of water was described in
Problem 4.8.
Suppose that the decision is made to release at most 3 units
of water at the beginning of every week. If the volume of water
stored in the dam plus the volume flowing into the dam at the
beginning of the week exceeds 3 units, then 3 units of water
are released. The first unit of water released is used to gener-
ate electricity, which is sold for $5. The second and third units
released are used for irrigation which earns $4 per unit. If the
volume of water stored in the dam plus the volume flowing
into the dam at the beginning of the week equals 3 units, then

51113_C005.indd 420 9/23/2010 6:09:29 PM


A Markov Decision Process (MDP) 421

2 units of water are released. The first unit of water released is


used to generate electricity, which is sold for $5. The second unit
released is used for irrigation which earns $4. If the volume of
water stored in the dam plus the volume flowing into the dam
at the beginning of the week equals 2 units, then only 1 unit of
water is released which is used to generate electricity that is
sold for $5. If the volume of water stored in the dam plus the
volume flowing into the dam at the beginning of the week is
less than 2 units, no water is released.
(a) Model the dam problem as a four-state unichain MDP. For
each state and decision, calculate the associated transition
probabilities and the reward.
(b) Use LP to find a release policy which maximizes the expected
average reward, or gain, over an infinite planning horizon.
(c) Use PI to find a release policy which maximizes the expected
average reward, or gain, over an infinite planning horizon.
(d) Use LP with a discount factor of α = 0.9 to find a release pol-
icy which maximizes the vector of expected total discounted
rewards.
(e) Use PI with a discount factor of α = 0.9 to find an optimal
release policy and the associated expected total discounted
reward vector over an infinite planning horizon.
5.10 Suppose that in a certain state, every registered professional
engineer (P.E.) is required to complete 15 h of continuing pro-
fessional development (CPD) every year. CPD hours may
be earned by successfully completing continuing education
courses offered by a professional or trade organization, or
offered in-house by a corporation. All registrants need to keep
a yearly log showing the type of courses attended. Each year
the State Board of Registration for Professional Engineers will
audit randomly selected registrants. If selected for an audit,
a registrant must submit a CPD activity log and supporting
documentation.
A consulting engineer plans to construct a two-state Markov
chain model of the possibility that the Board will audit a reg-
istrant’s record of continuing engineering education. The state
variable is
Xn = 1 if a registrant was audited at the end of year n,
Xn = 0 if a registrant was not audited at the end of year n.
Historical data indicates that an engineer who was not audited
last year has a 0.3 probability of being audited this year. An
engineer who was audited last year has a 0.2 probability of
being audited this year.
The consulting engineer is concerned that a portion of her
CPD education consists of in-house courses offered by a corpo-
ration which provides engineering products and services that
she uses and recommends to her clients. Critics have said that
such courses are often little more than company marketing in
the guise of education. If she is audited, she must demonstrate
that the course provider did not bias the course material in favor

51113_C005.indd 421 9/23/2010 6:09:30 PM


422 Markov Chains and Decision Processes for Engineers and Managers

of its own products and services. To prepare her CPD log and
assemble supporting documentation for a possible audit at the
end of the current year, she is considering hiring a prominent
engineering educator as an audit consultant. The audit consul-
tant will charge her a fee of $2,000. Experience indicates that
the Board often imposes a fine for deficiencies in CPD courses
taken in prior years when a registrant is audited. Past records
show that when a CPD log is prepared by an audit consultant,
the average fine is $600 after a log is audited. However, when a
CPD log is prepared by the registrant herself, the average fine
is $8,000. The consulting engineer must decide whether or not
to hire an audit consultant to prepare her CPD log for a possible
audit at the end of the current year. Hiring an audit consultant
appears to reduce by 0.05 the probability that a registrant will
be audited in the following year:
(a) Formulate the consulting engineer’s decision alternatives as
a two-state recurrent MDP. For each state and decision, spec-
ify the associated transition probabilities and calculate the
expected immediate cost.
(b) Execute value iteration to find a policy which minimizes the
vector of expected total costs that will be incurred in both
states after 3 years. Assume zero terminal costs at the end of
year 3.
(c) Use exhaustive enumeration to find a policy that minimizes
the expected average cost, or negative gain, over an infinite
planning horizon.
(d) Use PI to find an optimal policy over an infinite horizon.

References
1. Hillier, F. S. and Lieberman G. J., Introduction to Operations Research, 8th ed.,
McGraw-Hill, New York, 2005.
2. Howard, R. A., Dynamic Programming and Markov Processes, M.I.T. Press,
Cambridge, MA, 1960.
3. Puterman, M. L., Markov Decision Processes: Discrete Stochastic Dynamic
Programming, Wiley, New York, 1994.
4. Wagner, H. M., Principles of Operations Research, 2nd ed., Prentice-Hall,
Englewood Cliffs, NJ, 1975.

51113_C005.indd 422 9/23/2010 6:09:31 PM


6
Special Topics: State Reduction
and Hidden Markov Chains

This chapter introduces two interesting special topics: state reduction in


Section 6.1, and hidden Markov chains in Section 6.2. State reduction is an
iterative procedure which reduces a Markov chain to a smaller chain from
which a solution to the original chain can be found. Roundoff error can be
reduced when subtractions are avoided. A hidden Markov chain is one for
which the states cannot be observed. When a hidden Markov chain enters
a state, only an observation symbol, not the state itself, can be detected.
Hidden Markov models have been constructed for various applications such
as speech recognition and bioinformatics.

6.1 State Reduction


State reduction is an alterative procedure for finding various quantities for
a Markov chain, such as steady-state probabilities, mean first passage times
(MFPTs), and absorption probabilities. State reduction has two steps: matrix
reduction and back substitution. Each iteration of matrix reduction pro-
duces a reduced matrix one state smaller than its predecessor, resulting in a
final reduced matrix from which the solution to the original problem can be
obtained by back substitution.
State reduction is based on a theorem in Kemeny and Snell [2], and is
equivalent to Gaussian elimination. Kemeny and Snell showed that a par-
titioned Markov chain can be reduced in size by performing matrix opera-
tions to produce a smaller Markov chain. They showed that if the steady-state
probabilities for the original Markov chain are known, then the steady-state
probabilities for the smaller chain are proportional to those for the corre-
sponding states of the original chain. Three state reduction algorithms will
be described. The first two compute steady-state probabilities and MFPT,
respectively, for a regular Markov chain. The third computes absorption
probabilities for a reducible chain.

423

51113_C006.indd 423 9/23/2010 4:22:11 PM


424 Markov Chains and Decision Processes for Engineers and Managers

6.1.1 Markov Chain Partitioning Algorithm for


Computing Steady-State Probabilities
The first state reduction algorithm, called the Markov chain partitioning
algorithm [4, 1], or MCPA, computes the vector of steady-state probabilities
for a regular Markov chain. The MCPA contain two steps: matrix reduction
and back substitution. The matrix reduction step successively partitions the
transition probability matrix. Each partition of a transition matrix produces
four submatrices: a square matrix, a column vector, a single element in the
upper left hand corner, and a row vector. When combined in accordance
with the theorem in Kemeny and Snell, the four submatrices create a reduced
transition matrix that also represents a regular Markov chain. The number
of states in the reduced matrix is one less than the number in the original
matrix. The reduced matrix equals the sum of the square matrix and the
product of the column vector, the reciprocal of one minus the single elem-
ent, and the row vector. To improve numerical accuracy, subtractions may
be eliminated by replacing one minus the single element by the sum of the
remaining elements in the top row. Each reduced matrix is partitioned in
the same manner as the original matrix, creating a sequence of successively
smaller reduced matrices. Each reduced matrix overwrites its predecessor,
beginning with the diagonal element in the upper left hand corner. Matrix
reduction ends when the last reduced matrix has two states. Recall that the
steady-state probability vector for a generic two-state regular Markov chain
is known, as it was computed in Equation (2.16).
Matrix reduction is followed by back substitution, which begins by using
the ratio of the steady-state probabilities for the final two-state reduced matrix
to express the steady-state probability for the next to last state as a constant
times the steady-state probability for the last state. Then the first equation
for steady-state probabilities from the next larger reduced matrix is used to
express the steady-state probability for the third from last state as a function
of the steady-state probabilities for the last two states. The steady-state prob-
ability for the next to last state is substituted in this equation, expressed in
terms of the steady-state probability for the last state. This recursive proced-
ure is repeated, using the first equation for steady-state probabilities from
each reduced matrix next larger in size, until the steady-state probabilities
for all states have been expressed as constants times the steady-state prob-
ability for the last state. Then the normalizing equation (2.6) for the original
Markov chain is solved to find the steady-state probability for the last state.
All other steady-state probabilities are obtained by multiplying the constants
previously found by the steady-state probability for the last state.

6.1.1.1 Matrix Reduction of a Partitioned Markov Chain


The partitioning algorithm for recursively computing steady-state probabil-
ities is motivated by the following theorem, in modified form, of Kemeny and

51113_C006.indd 424 9/23/2010 4:22:13 PM


Special Topics: State Reduction and Hidden Markov Chains 425

Snell (pp. 114–116). Consider a regular Markov chain with N states, indexed
1, 2, . . . , N. For simplicity, assume N = 4. The transition probability matrix is
denoted by P. The states are divided into two subsets. The first subset con-
tains state 1. The second subset contains the remaining states, and is denoted
by S = {2, 3, 4}. The transition matrix P is partitioned into four submatrices
called p11, u, v, and T, as shown below:

1  p11 p12 p13 p14 


2  p21 p22 p23 p24 
P = [ pij ] =  
3  p31 p32 p33 p34 
 
4  p41 p42 p43 p44 

1  p11 u (6.1)
=  ,
S v T 

where p11 is the single element in the upper left-hand corner of P,

u = [ p12 p13 p14 ] is a 1 × 3 row vector, (6.1)

v = [ p21 p31 p41 ]T is a 3 × 1 column vector, and (6.1)

2  p22 p23 p24 


T = [ pij ] = 3  p32 p33 p34 
is a 3 × 3 square matrix. (6.1)
4  p42 p43 p44 

A reduced matrix, P, which represents a regular three-state Markov chain,


is computed according to the formula

P = T + v(1 − p11 )−1 u . (6.2)

Since roundoff error can be reduced by eliminating subtractions [1], the fac-
tor (1 − p11)−1 is replaced by (p12 + p13 + p14)−1. The modified formula, called the
matrix reduction formula, is

P = T + v( p12 + p13 + p14 )−1 u . (6.3)

51113_C006.indd 425 9/23/2010 4:22:13 PM


426 Markov Chains and Decision Processes for Engineers and Managers

The reduced transition matrix, denoted by P, is calculated below using the


matrix reduction formula (6.3):

2  p22 p23 p24   p21  


    
P = 3  p32 p33 p34  +  p31  [ p12 + p13 + p14 ]−1 p12 p13 p14  
4  p42 p43 p44   p41  

2  p22 p23 p24   p21 p12 p21 p13 p21 p14  
  −1   
= 3  p32 p33 p34  + [ p12 + p13 + p14 ]  p31 p12 p31 p13 p31 p14  
4  p42 p43 p44   p41 p12 p41 p13 p41 p14  


 p21 p12 p21 p13 p21 p14 
2  p22 + p23 + p24 + 
 p12 + p13 + p14 p12 + p13 + p14 p12 + p13 + p14   (6.4)
 p31 p12 p31 p13 p31 p14 
= 3  p32 + p33 + p34 + 
 p12 + p13 + p14 p12 + p13 + p14 p12 + p13 + p14  
 p41 p12 p41 p13 p41 p14 
4  p42 + p43 + p44 + 
 p12 + p13 + p14 p12 + p13 + p14 p12 + p13 + p14 


2  p22 p23 p24 

 
= 3  p32 p33 p34  . 

4  p42 p43

p44  

The foregoing process of applying the matrix reduction formula (6.3) to the
original matrix, P, to form the reduced matrix, P , is called matrix reduction.

6.1.1.2 Optional Insight: Informal Justification of


the Formula for Matrix Reduction
An informal justification of the formula for matrix reduction is instruc-
tive. Assume that the original four-state Markov chain is observed only
when it is in the subset S = {2, 3, 4} of the last three states. A new three-state
Markov chain, called the reduced chain, is produced, with a transition
probability matrix denoted by P. A single step in the reduced chain corre-
sponds, in the original chain, to the transition, not necessarily in one step,
from a state in S to another state in S. Consider two states of S, say states
2 and 4. The transition probability, p24, for the reduced matrix is equal to
the probability that the original chain, starting in state 2, enters S for the
first time at state 4. This is the probability that the original chain moves
from state 2 to state 4 in one step, plus the probability that it moves from
state 2 to state 1 in one step, and enters S from state 1 for the fi rst time at
state 4. Therefore,

51113_C006.indd 426 9/23/2010 4:22:16 PM


Special Topics: State Reduction and Hidden Markov Chains 427

p24 = p24 + p21 p14 + p21 p11 p14 + p21 p11


2
p14 + p21 p11
3
p14 + …

= p24 + p21 (1 + p11 + p112
+ p11
3
+ …)p14 
 (6.5)
= p24 + p21 (1 − p11 )−1 p14 
= p24 + p21 ( p12 + p13 + p14 )−1 p14 . 

By repeating this argument for the remaining pairs of states in S, the follow-
ing eight additional transition probabilities for the reduced matrix, P, shown
in equation (6.4), can be obtained.

p22 = p22 + p21 ( p12 + p13 + p14 )−1 p12 (6.6)


p23 = p23 + p21 ( p12 + p13 + p14 )−1 p13 (6.7)
p32 = p32 + p31 ( p12 + p13 + p14 )−1 p12 (6.8)
p33 = p33 + p31 ( p12 + p13 + p14 )−1 p13 (6.9)

p34 = p34 + p31 ( p12 + p13 + p14 )−1 p14 (6.10)


p42 = p42 + p41 ( p12 + p13 + p14 )−1 p12 (6.11)

p43 = p43 + p41 ( p12 + p13 + p14 )−1 p13 (6.12)

p44 = p44 + p41 ( p12 + p13 + p14 )−1 p14 . (6.13)

The matrix form of the nine formulas (6.5) through (6.13) is the matrix reduc-
tion formulas (6.3).

6.1.1.3 Optional Insight: Informal Derivation of the MCPA


In this optional section, an informal derivation of the MCPA will be given.
For simplicity, consider the transition probability matrix for a regular four-
state Markov chain partitioned as shown in Equation (6.1). The first step of
the MCPA is matrix reduction using Equation (6.3). The original transition
probability matrix is identified by superscript one. The superscript is incre-
mented by one for each reduced matrix.

1  p11
(1) (1)
p12 (1)
p13 (1)
p14 
 (1) 
2 p(1) (1)
p22 (1)
p23 p24  1  p11(1)
u(1) 
P(1) =  21 =  (1) 
3  p31
(1) (1)
p32 (1)
p33 (1) 
p34 S v T (1)  (6.14)
 (1) 
4  p(1)
41 p(1)
42 p(1)
43 p44 

51113_C006.indd 427 9/23/2010 4:22:17 PM


428 Markov Chains and Decision Processes for Engineers and Managers

−1
P(2) = T (1) + v(1)  p12
(1)
+ p13
(1)
+ p14
(1)
 u(1) (6.15)

2  p22
(1) (1)
p23 (1)
p24   p21
(1)
 
 (1) (1)   (1)  (1) (1) −1 
= 3  p32 (1)
p33 p34  +  p31  [ p12 + p13
(1)
+ p14 ]  p12
(1)
p (1)
13 p  
(1)
14

4  p(1) p(1) p(1)   p(1)  


44   41 
42 43


 (6.15)
2  p22
(2) (2)
p23 (2)
p24  
 (2) (2)   p22
(2)
u 
(2)

= 3  p32 (2)
p33 p34  =  (2)  
v T (2)  
4  p(2
42
)
p(2)
43 p(2)
44 
  
3  p33
(2) (2)
p34   p32
(2)
 (2) 
+  (2)  (p23 )  p23(2)
(2) −1 (2) −1
P(3) = T (2) + v(2)  p23
(2)
+ p24  u = 4  p(2) (2) 
+ (2)
p24 (2)
p24  
 43 p44   p42  

3  p33  3  p33
(3) (3)
p34 (3)
u(3)  
=  (3) (3)  =  (3) . 
4  p43 p44  4  v
 T (3)  
(6.16)

The matrix reduction step has ended with a two-state Markov chain, which
has a transition probability matrix denoted by P(3). Using Equation (2.16), the
steady-state probability vector for P(3), denoted by π(3), is known, and is shown
below with its two components.

 p(3) (3)
p34 
π (3) = π 3(3) π 4(3)  =  (3) 43 (3) (3) 
. (6.17)
 p34 + p43 p34 + p43 
(3)

The second step of the MCPA is back substitution, which begins by solving
for π 3(3) as a constant k3 times π 4(3).

p(3) (3) −1 (3) (3) (6.18)


π 3(3) = 43
(3)
π 4(3) = ( p34 ) p43 π 4 = k 3π 4(3),
p34

where the constant

(3) −1 (3)
k 3 = ( p34 ) p43 . (6.19)

The steady-state probability vector for P(2) is denoted by

π (2) = π 2(2) π 3(2) π 4(2)  . (6.20)

51113_C006.indd 428 9/23/2010 4:22:21 PM


Special Topics: State Reduction and Hidden Markov Chains 429

The first steady-state equation of the system

π (2) = π (2) P(2) (6.21)

is

π 2(2) = π 2(2) p22


(2)
+ π 3(2) p32
(2)
+ π 4(2) p(2)
42

(1 − p22
(2)
)π 2(2) = π 3(2) p32
(2)
+ π 4(2) p(2) (6.22)
42 .

By the theorem of Kemeny and Snell,

π 3(2) π 4(2)
π 3(3) = , and π (3)
= (6.23)
π 3(2) + π 4(2) π 3(2) + π 4(2
4

Observe that

π 3(2) π 3(3) k 3π 4(3)


= = (3) = k 3 . (6.24)
π 4(2) π 4(3) π4

Hence, π 3(3) = k 3π 4(3) implies that π 3(2) = k 3π 4(2) . Substituting π 3(2) = k 3π 4(2) in the
first steady-state equation of the system (6.21),

(1 − p22
(2)
)π 2(2) = p32
(2)
k 3π 4(2) + p(2)
42 π 4 = ( p32 k 3 + p42 )π 4
(2) (2) (2) (2)

(2) −1 (2) −1 (6.25)


π 2(2) = (1 − p22 (2)
) ( p32 42 )π 4 = ( p23 + p24 ) ( p32 k 3 + p42 )π 4 ,
k 3 + p(2) (2) (2) (2) (2) (2)

= k 2π 4(2) ,

where the constant


(2) −1
k 2 = ( p23
(2)
+ p24 (2)
) ( p32 k 3 + p(2)
42 ), (6.26)

(2) −1
(2)
and where ( p23 + p24 (2) −1
) is substituted for (1 − p22 ) to avoid subtractions. The
(1)
steady-state probability vector for P is denoted by

π (1) = π 1(1) π 2(1) π 3(1) π 4(1)  . (6.27)

The first steady-state equation of the system

π (1) = π (1) P(1) (6.28)


is

π 1(1) = π 1(1) p11 + π 2(1) p21 + π 3(1) p31 41 


+ π 4(1) p(1)
(1) (1) (1)

. (6.29)
(1 − p11 )π 1 = π 2 p21 + π 3 p31 + π 4 p41
(1) (1) (1) (1) (1) (1) (1) (1)


51113_C006.indd 429 9/23/2010 4:22:24 PM


430 Markov Chains and Decision Processes for Engineers and Managers

By the theorem of Kemeny and Snell,


π 2(1) π 3(1) π 4(1)
π 2(2) = , π 3(2) = (1) , and π 4(2) = (1)
π (1)
2 + π3 + π4
(1) (1)
π2 + π3 + π4
(1) (1)
π 2 + π 3(1) + π 4(1) (6.30)

Observe that

π 2(1) π 2(2) k 2π 4(2)


= = (2) = k 2 . (6.31)
π 4(1) π 4(2) π4

Hence, π 2(2) = k2π 4(2) implies that π 2(1) = k2π 4(1) .


Similarly, π 3(2) = k3π 4(2) implies that π 3(1) = k3π 4(1) .
Substituting π 2(1) = k 2π 4(1) and π 3(1) = k 3π 4(1) in the first steady-state equation of
the system (6.28),

(1 − p11 )π 1(1) = p21 k 2π 4(1) + p31 k3π 4(1) + p(1) 41 π 4 = ( p21 k 2 + p31 k 3 + p41 )π 4 
(1) (1) (1) (1) (1) (1) (1) (1)

(1) −1 
π 1(1) = (1 − p11 ) ( p21(1)
k 2 + p31(1)
k 3 + p(1)41 )π 4
(1)
 (6.32)
(1) −1 
= ( p12
(1)
+ p13 (1)
+ p14 ) ( p21 (1)
k 2 + p31(1)
k 3 + p(1)41 )π (1)
4 = k π
1 4
(1)
, 

where the constant


(1) −1
k1 = ( p12
(1)
+ p13
(1)
+ p14 (1)
) ( p21 k 2 + p31
(1)
k 3 + p(1)
41 ).

Solving the normalizing equation (2.6) for the system (6.28) for π 4(1) as a
function of the constants,

π 1(1) + π 2(1) + π 3(1) + π 4(1) = 1 



k1π (1)
4 + k 2π (1)
4 + k 3π (1)
4 +π (1)
4 = 1
 (6.33)
(k1 + k 2 + k 3 + 1)π (1)
4 =1 
yields

π 4(1) = (1 + k1 + k 2 + k 3 )−1 . (6.34)

Solving for the steady-state probabilities for the original Markov chain,

π 4 = π 4(1) 

π3 = π (1)
3 = k3π  (1)
4
. (6.35)
π 2 = π 2(1) = k 2π 4(1) 
π1 = π (1)
1 = k1π 4(1) 

51113_C006.indd 430 9/23/2010 4:22:28 PM


Special Topics: State Reduction and Hidden Markov Chains 431

6.1.1.4 Markov Chain Partitioning Algorithm


The MCPA will calculate the steady-state probability vector for a regular
Markov chain with N states indexed 1, 2, . . . , N, and a transition probability
matrix, P. The detailed steps of the MCPA are given below:

Matrix Reduction

1. Initialize n = 1.
2. Let P(n) = P = [pij] for n ≤ i ≤ N and n ≤ j ≤ N.
3. Partition P(n) as

n  pnn (n)
pn( n, n) + 1 " pn( n, N) 
 
n + 1  pn( n+)1, n pn( n+)1, n + 1 " pn( n+)1, N   pnn
(n)
u( n ) 
P(n) = =  (n) .
#  # # " #  v T (n) 
 (n) 
N  p(Nn,)n pN , n + 1 " pN , N 
(n)

4. Store the first row and first column, respectively, of P(n) by overwriting
the first row and the first column, respectively, of P.
−1
5. Compute P(n+1) = T ( n ) + v( n )  pn( n, n) +1 + " + pn( n, N)  u( n )
6. Increment n by 1. If n < N – 1, go to step 3. Otherwise, go back
substitution.

Back Substitution

1. Initialize i = N – 1.
2. kN–1 = (p(NN−−1,1)N )−1 p(NN, N−1)−1
3. Decrement i by 1
−1
 N   (i) N− 1

4. ki =



j = i+ 1
(i)
p 
ij

 pNi + ∑
h = i+ 1
p(hii ) k h  , i = N − 2, N − 3, ... , 1.

5. If i > 1, go to step 3.
6. i = 1
−1
 N −1 
7. π N =  1 + ∑ k h  .
 h =1 
8. πh = khπN, h = 1, 2, … , N − 1.
9. π = [π1 π2 … πN]

51113_C006.indd 431 9/23/2010 4:22:32 PM


432 Markov Chains and Decision Processes for Engineers and Managers

6.1.1.5 Using the MCPA to Compute the Steady-State


Probabilities for a Four-State Markov Chain
In this section, the MCPA will be executed to calculate the steady-state prob-
ability vector for the regular four-state Markov chain model of the weather
for which the transition matrix is shown in Equation (1.9). The steady-state
probability vector was previously computed in Equation (2.22). The transi-
tion probability matrix is partitioned as shown in Equation (6.36).
Matrix Reduction

1 0.3 0.1 0.4 0.2 


2  0.2 0.5 0.2 0.1 1  p11(1)
u(1) 
P = P(1) =  =  
3 0.3 0.2 0.1 0.4  S  v (1)
T (1)  (6.36)
 
4  0 0.6 0.3 0.1
 −1
P(2) = T (1) + v(1)  p12
(1)
+ p13
(1)
+ p14
(1)
 u(1)


2  0.5 0.2 0.1  0.2  
= 3  0.2 0.1 0.4  + 0.3  (0.1 + 0.4 + 0.2)−1 [0.1 0.4 0.2] 

4 0.6 0.3 0.1  0  
 (6.37)

2  37 /70 22/70 11/70  
  p22 u 
(2) (2)
 
= 3 17 /70 19/70 34/70  =  (2) (2) 
v T 
4  42/70 21/70 7 /70    



−1 
P (3)
= T + v  p
(2) (2)
+ p  u
(2)
23
(2)
24
(2)

 (6.38)
3 19/70 34/70  17 /70  
( 22/70 + 11/70 ) 22/70 11/70  .
−1
=  + 
4  21/70 7 /70   42/70  

3 13/30 17 /30   p33
(3)
u (3)

=   =
4  21/30 9/30   v(3) T (3)  

Back Substitution

) p43 = (17 /30 ) (21/30 ) = 1617 /1309.


(3) −1 (3) −1
k 3 = ( p34 (6.39)

51113_C006.indd 432 9/23/2010 4:22:34 PM


Special Topics: State Reduction and Hidden Markov Chains 433

(2) −1 
k 2 =  p23
(2)
+ p24   p32
(2)
42 
k 3 + p(2)  (6.40)

= (22/70 + 11/70)−1 (17 /70)(21/17) + 42/70  = 2499/1309

−1
k1 =  p12   p21 
+ + + + 41 
(1) (1) (1) (1) (1)
p13 p14 k2 p31 k3 p(1) 

 (6.41)

= (0.1 + 0.4 + 0.2) (0.2)(2499/1309) + (0.3)(1617/1309) = 1407/1309
−1

π 4 = (1 + k1 + k 2 + k 3 )−1 = (1 + 1407/1309 + 2499/1309 + 1617/1309) = 1309/6832


−1

(6.42)

π 3 = k 3π 4 = (1617 /1309) (1309/6832) = 1617 /6832 (6.43)

π 2 = k 2π 4 = (2499/1309) (1309/6832) = 2499/6832 (6.44)

π 1 = k1π 4 = (1407 /1309) (1309/6832) = 1407 /6832. (6.45)

These are the same steady-state probabilities that were obtained in Equation
(2.22).

6.1.1.6 Optional Insight: Matrix Reduction and Gaussian Elimination


To see that matrix reduction is equivalent to Gaussian elimination, consider
again the transition probability matrix shown in Equation (6.36) for the four-
state Markov chain model of the weather. Equations (2.13), with the normal-
izing equation omitted, can be expressed as

π = π P ⇒ π (1) = π (1) P(1) . (6.46)

Equations (2.12), with the normalizing equation omitted, can be expressed as

π1 = (0.3)π 1 + (0.2)π 2 + (0.3)π 3 + (0)π 4 


π2 = (0.1)π 1 + (0.5)π 2 + (0.2)π 3 + (0.6)π 4 
.
π3 = (0.4)π 1 + (0.2)π 2 + (0.1)π 3 + (0.3)π 4  (6.47)
π4 = (0.2)π 1 + (0.1)π 2 + (0.4)π 3 + (0.1)π 4 

51113_C006.indd 433 9/23/2010 4:22:36 PM


434 Markov Chains and Decision Processes for Engineers and Managers

When Gaussian elimination is applied to the linear system (6.47), the first
step is to solve the first equation for π1 as a function of π2, π3, and π4 in the
following manner.

π 1 = (0.3) π 1 + (0.2) π 2 + (0.3) π 3 + (0) π 4 


 (6.48)
(1 − 0.3) π 1 = (0.2) π 2 + (0.3) π 3 + (0) π 4 , 

to obtain

π 1 = (20/70 ) π 2 + (30/70 ) π 3 + (0) π 4 . (6.49)

To reduce roundoff error by avoiding subtractions, both matrix reduction


and Gaussian elimination can be modified by replacing the coefficient (1−0.3)
of π1 in Equation (6.48) with the coefficient (0.1 + 0.4 + 0.2) , equal to the
sum of the remaining coefficients in the first row of the transition matrix
P(1). The expression for π1 is substituted into the remaining three equations
of (6.47) to produce the following reduced system of three equations in three
unknowns.

π 2 = (37 /70 ) π 2 + (17 /70 ) π 3 + (42/70 ) π 4 



π 3 = (22/70 ) π 2 + (19/70 ) π 3 + (21/70 ) π 4  . (6.50)
π 4 = (11/70 ) π 2 + (34/70 ) π 3 + (7 /70 ) π 4 

Expressed in matrix form equations (6.50) are

π = π P ⇒ π (2) = π (2) P(2) , (6.51)

or

2  37/70 22/70 11/70 


 π (2)
2 π (2)
3 π (2)
4  =  π (2)
2 π (2)
3 π (2)
4  3  17/70 19/70

34/70  . (6.52)
4  42/70 21/70 7/70 

Thus, the first step of Gaussian elimination has produced the first reduced
matrix, P(2), which is calculated in Equation (6.37). This procedure can be
repeated until only two equations remain, producing the final reduced
matrix, P(3). The results of this example can be generalized to conclude that
each step of Gaussian elimination produces a reduced coefficient matrix
equivalent to the reduced matrix produced by the corresponding step of

51113_C006.indd 434 9/23/2010 4:22:39 PM


Special Topics: State Reduction and Hidden Markov Chains 435

matrix reduction. To conform to the order in which the equations of a linear


system are removed during Gaussian elimination, which begins with the
first equation, matrix reduction starts in the upper left-hand corner of a tran-
sition probability matrix. Back substitution is the same for both Gaussian
elimination and state reduction.

6.1.2 Mean First Passage Times


Recall from Sections 2.2.2.1 and 3.3.1 that when a vector of MFPTs to a target
state 0 (or j) in a regular Markov chain is to be calculated, the process begins
by making the target state 0 an absorbing state. A state reduction algorithm
for computing MFPTs also makes the target state 0 an absorbing state [3]. A
state reduction algorithm for computing MFPTs has three steps: augmenta-
tion, matrix reduction, and back substitution. This algorithm differs from the
MCPA of Section 6.1.1 in two major respects. The first difference is that an
augmentation step has been added to avoid subtractions. The second differ-
ence is that matrix reduction stops when the final reduced matrix has only
one row.

6.1.2.1 Forming the Augmented Matrix


To understand the augmentation step, consider the following transition
matrix for a generic three-state regular Markov chain:

0  p00 p01 p02 


 
P = 1  p10 p11 p12  . (6.53)
2  p20 p21 p22 

Suppose that MFPTs to target state 0 are desired. When target state 0 is made
an absorbing state, the modified transition probability matrix, PM, is parti-
tioned in the following manner:

0 1 0 0 
  1 0
PM = 1  p10 p11 p12  =  ,
 D Q  (6.54a)
2  p20 p21 p22 

where

1  p10  1  p11 p12 


D=   , Q=  .
2  p20  2  p21 p22  (6.54b)

51113_C006.indd 435 9/23/2010 4:22:41 PM


436 Markov Chains and Decision Processes for Engineers and Managers

In the augmentation step, a rectangular augmented matrix, denoted by G, is


constructed, and is shown below in partitioned form.

G = [Q D e ], (6.55)

where e is a two-component column vector with all entries one.


The augmented matrix, G, is shown below. Note that the columns in the
augmented matrix are numbered consecutively from 1 to 4, so that state 0 is
placed in column 3, and the vector e is placed in column 4.

1  p11 p12 p10 e1  1  g11 g12 g13 1 


G=   = [Q D e ] =  
2  p21 p22 p20 e2  2  g 21 g 22 g 23 1
 (6.56)
1  g11 g12 g13 g14  
=  .
2  g 21 g 22 g 23 g 24  

Matrix Q is augmented with vector D to avoid subtractions. Observe that

1  p10  1  g13 
D=   =  .
2  p20  2  g 23  (6.57)

Since the sum of the entries in the first three columns in each row of the
augmented matrix equals one, subtractions can be eliminated by making the
substitution

(1 − p11 ) = ( p12 + p10 ) = (1 − g11 ) = ( g12 + g13 ). (6.58)

6.1.2.2 State Reduction Algorithm for Computing MFPTs


The detailed steps of a state reduction algorithm for computing MFPTs for
a regular Markov chain with N states indexed 1, 2, . . . , N, and a transition
probability matrix, P, are given below. As Section 6.1.2 indicates, the modi-
fied transition probability matrix, PM, is assumed to be partitioned such that
target state 0 is made an absorbing state.
A. Augmentation
An N × (N + 2) rectangular augmented matrix, G, is formed such that G =
[Q, D, e] = [gij], where matrices Q and D are defined as in Equation (2.50) and
vector e is an N-component column vector with all entries one.
B. Matrix Reduction
Matrix reduction is applied to the augmented matrix, G.

1. Initialize n =1
2. Let G(n) = G = [gij] for n ≤ i ≤ N and n ≤ j ≤ N + 2.

51113_C006.indd 436 9/23/2010 4:22:42 PM


Special Topics: State Reduction and Hidden Markov Chains 437

3. Partition G(n) as
n  g nn (n)
g n( n, n) + 1 " g n( n, N) + 1 g n( n, N) + 2 
 (n) 
n + 1  g n + 1, n g n( n+)1, n + 1 " g n( n+)1, N + 1 g n( n+)1, N + 2   g nn
(n)
u( n ) 
G(n ) = =  ( n) . (6.59)
#  # # " # #  v T( n) 
 
N  g (Nn,)n g (Nn,)n + 1 " g (Nn,)N + 1 ( n)
g N , N + 2 

4. Store the first row of G(n).


−1
5. Compute G(n+1) = T ( n ) + v( n )  g n( n, n) + 1 + " + g n( n, N) + 1  u( n )
6. Increment n by 1. If n < N, go to step 3. Otherwise, go back
substitution.

C. Back Substitution

1. The entry in row N of the vector MFPTs to state 0 is computed.


mN0 = g (NN, N) + 2 g (NN, N) + 1 .
2. Let i = N − 1.
3. For i ≤ 1 < N, compute the entry in row i of the vector of mean first
passage times to state 0.
N +1
 N

mi 0 =  g i(,iN) + 2 + ∑ g ih( i ) mh 0  ∑g (i)
.
 
ih
h= i +1 h= i +1

4. Decrement i by 1.
5. If i > 0, go to step 3. Otherwise, stop.

6.1.2.3 Using State Reduction to Compute MFPTs


for a Five-State Markov Chain
To demonstrate how the state reduction algorithm is used to compute the
MFPTs for a regular Markov chain, consider the following numerical example,
treated in Section 2.2.2.1, of a Markov chain with five states, indexed 0, 1, 2, 3,
and 4. The vector of MFPTs was calculated in Equations (2.62) and (3.22). The
transition probability matrix is shown in Equation (2.57). The vector of MFPTs
to target state 0 can be found by making state 0 an absorbing state. The modified
transition probability matrix, PM, is partitioned as shown in Equation (2.58).
A. Augmentation

1 0.4 0.1 0.2 0.3 0 1


2  0.1 0.2 0.3 0.2 0.2 1
G = Q D e  =  .
3  0.2 0.1 0.4 0.2 0.1 1 (6.60)
 
4 0.3 0.4 0.2 0.1 0 1

51113_C006.indd 437 9/23/2010 4:22:43 PM


438 Markov Chains and Decision Processes for Engineers and Managers

B. Matrix Reduction

1 0.4 0.1 0.2 0.3 0 1


2  0.1 0.2 0.3 0.2 0.2 1  g11
(1)
u(1) 
G(1) =  = 
3  0.2 0.1 0.4 0.2 0.1 1  v (1) T (1) 
  (6.61)
4 0.3 0.4 0.2 0.1 0 1
−1
G(2) = T (1) + v (1)  g12
(1)
+ (1)
g13 + (1)
g14 + (1)
g15  u(1)

2 0.2167 0.3333 0.25 0.2 1.1667 


 g (2) u(2) 
G (2)
= 3  0.1333 0.4667 0.30 0.1 1.3333  =  22  (6.62)
v (2) T (2) 
4  0.4500 0.3000 0.25 0 1.5000  
−1
G(3) = T (2) + v (2)  g 23
(2)
+ (2)
g 24 + (2)
g 25  u(2)

3 0.5234 0.3425 0.1341 1.5138   g 33 (3)


u(3) 
G(3) =   =  (3)  (6.63)
4  0.4915 0.3936 0.1149 2.1703   v T (3) 
−1
G(4) = T (3) + v (3)  g 34
(3)
+ (3)
g 35  u(3)

G(4) = [0.7468 0.2532 3.7500 ] =  g (4)


44 g (4)
45 46 
g (4)  (6.64)

C. Back Substitution
Back substitution begins by computing the entry in row four of the vector of
MFPTs to state 0.

m40 = g (4) 45 = 3.7500 0.2532 = 14.8104.


g (4)
46 (6.65)

Next, the entry in row three is computed.

m30 = ( g 36
(3)
+ g 34
(3) (3)
m40 )/( g 34 + g 35
(3)
) 
 (6.66)
= (1.5318 + 0.3425(14.8104))/(0.3425 + 0.1341) = 13.8573.

Then the entry in row two is computed.

m20 = ( g 26
(2)
+ g 23
(2)
m30 + g 24
(2) (2)
m40 )/( g 23 + g 24
(2)
+ g 25
(2)
) 

= (1.1667 + 0.3333(13.8573) + 0.25(14.8104)) (0.3333 + 0.25 + 0.2) = 12.1128. (6.67)

Finally, the entry in row one is computed.

m10 = ( g16 + g12 m20 + g13 m30 + g14 + g13 + g14 + g15 
(1) (1) (1) (1) (1) (1) (1) (1)
m40 )/( g12 )
 (6.68)
= (1 + 0.1(12.1128) + 0.2(13.8573) + 0.3(14.8104)) (0.1 + 0.2 + 0.3 + 0 ) = 15.7098.

51113_C006.indd 438 9/23/2010 4:22:45 PM


Special Topics: State Reduction and Hidden Markov Chains 439

This vector of MFPTs differs slightly from those computed in Equations (2.62)
and (3.22). Discrepancies are due to roundoff error because only the first four
significant decimal digits were stored.

6.1.3 Absorption Probabilities


In Sections 6.1.1 and 6.1.2, state reduction was applied to calculate quantities
for regular Markov chains. In this section, a state reduction algorithm will
be developed to compute absorption probabilities for a reducible chain [3].
To demonstrate how state reduction can be used to compute a vector of
absorption probabilities, consider the six-state absorbing multichain model
of patient flow in a hospital introduced in Section 1.10.2.1.2. States 0 and 5 are
absorbing, and the other four states are transient. The transition probability
matrix is given in canonical form in Equation (1.59).

State 0 5 1 2 3 4
0 1 0 0 0 0 0
5 0 1 0 0 0 0
I 0
P= 1 0 0 0.4 0.1 0.2 0.3 = 
D Q  , (1.59)
2 0.15 0.05 0.1 0.2 0.3 0.2 
3 0.07 0.03 0.2 0.1 0.4 0.2
4 0 0 0.3 0.4 0.2 0.1

where

1 0 0  1 0 0  1 0.4 0.1 0.2 0.3 



2 0.15 0.05  2 0.15 0.05 
 2  0.1 0.2 0.3 0.2
D=  =   = [D D2 ], Q =  .
3 0.07 0.03  3 0.07 0.03  1
3  0.2 0.1 0.4 0.2  (6.69)
     
4 0 0  4 0 0  4 0.3 0.4 0.2 0.1

A state reduction algorithm will be constructed to compute the vector of


the probabilities of absorption in state 0. This vector of absorption probabili-
ties was previously computed in Equations (3.93) and (3.95). The state reduc-
tion algorithm is similar in structure to the one constructed for computing
MFPTs in Section 6.1.2. The major difference is in the augmentation step.

6.1.3.1 Forming the Augmented Matrix


To compute a vector of absorption probabilities, an N × (N + 2) rectangular
augmented matrix, G, is formed such that

G = Q D2 D1  =  g ij  , (6.70)

51113_C006.indd 439 9/23/2010 4:22:48 PM


440 Markov Chains and Decision Processes for Engineers and Managers

where column vector D1 of matrix D governs one-step transitions from tran-


sient states to the target absorbing state, state 0, and the entry in each row
of column vector D2 is the sum of the entries in the corresponding row of
matrix D, excluding the entry in column D1. For the absorbing multichain
model of patient flow in a hospital, N = 4. The augmented matrix is

1  g11 g12 g13 g14 g15 g16 


2  g 21 g 22 g 23 g 24 g 25 g 26 
G = [Q D2 D1 ] =  
3  g 31 g 32 g 33 g 34 g 35 g 36 
 
4  g 41 g 42 g 43 g 44 g 45 g 46 

1  p11 p12 p13 p14 p15 p10 


2  p21 p22 p23 p24 p25 p20 
=  .
3  p31 p32 p33 p34 p35 p30  (6.71)
 
4  p41 p42 p43 p44 p45 p40 

To avoid subtractions, matrix Q has been augmented with vectors D1 and D2,
where

 p10  1  0   p15  1  0 
 p  2  0.15   p  2  0.05
D1 =   =   , D =  25  =  
20

 p30  3 0.07  2
 p35  3 0.03  (6.72)
       
 p40  4  0   p45  4  0 

Since the row sums of the augmented matrix equal one, subtractions are
avoided by making the substitution

(1 − p11 ) = ( p12 + p13 + p14 + p15 + p10 ) = (1 − g11 ) = ( g12 + g13 + g14 + g15 + g16 ). (6.73)

6.1.3.2 State Reduction Algorithm for Computing Absorption Probabilities


The detailed steps of a state reduction algorithm for computing absorption
probabilities for a regular Markov chain with N states indexed 1,2, . . . ,N,
and a transition probability matrix, P, are given below. The transition
probability matrix, P, is assumed to be partitioned as in Equations (1.59)
and (6.69).
A. Augmentation
An N × (N + 2) augmented matrix, G, is formed such that

G = Q D2 D1  =  g ij  . (6.70)

51113_C006.indd 440 9/23/2010 4:22:49 PM


Special Topics: State Reduction and Hidden Markov Chains 441

B. Matrix Reduction
Matrix reduction is applied to the augmented matrix, G.

1. Initialize n = 1.
2. Let G(n) = G = [gij] for n ≤ i ≤ N and n ≤ j ≤ N + 2.
3. Partition G(n) as
n  g nn (n)
g n( n, n) + 1 " g n( n, N) + 1 g n( n, N) + 2 
 (n) 
n + 1  g n + 1, n g n( n+)1, n + 1 " g n( n+)1, N + 1 g n( n+)1, N + 2 
G(n) =
#  # # " # # 
 
N  g (Nn,)n gN ,n+1 " gN , N +1
(n) (n) (n)
g N , N + 2 
 g(n) u( n ) 
=  nn
(n) .
v T( n) 

4. Store the first row of G(n).


−1
5. Compute G(n+1) = T ( n ) + v ( n )  g n( n, n) + 1 + " + g n( n, N) + 2  u( n ) .
6. Increment n by 1. If n < N, go to step 3. Otherwise, go back to
substitution.

C. Back Substitution

1. Compute the entry in row N of the vector of the probabilities of


absorption in state 0.
fN0 = g (NN, N) + 2 ( g (NN, N) +1 + g (NN, N) + 2 ).

2. Let i = N – 1.
3. For 1 ≤ i < N, Compute the entry in row i of the vector of the prob-
abilities of absorption in state 0.
N+2
 N

fi 0 =  g i(,iN) + 2 + ∑ g ih( i ) f h 0  ∑g (i)
.
 
ih
h= i +1 h= i +1

4. Decrement i by 1.
5. If i > 0, go to step 3. Otherwise, stop.

6.1.3.3 Using State Reduction to Compute Absorption Probabilities for


an Absorbing Multichain Model of Patient Flow in a Hospital
The state reduction algorithm will be executed to compute the vector of the
probabilities of absorption in state 0, which indicates that a hospital patient
has been discharged.

51113_C006.indd 441 9/23/2010 4:22:51 PM


442 Markov Chains and Decision Processes for Engineers and Managers

A. Augmentation

1 0.4 0.1 0.2 0.3 0 0 


2  0.1 0.2 0.3 0.2 0.05 0.15 
G = Q D2 D1  =  .
(6.74)
3  0.2 0.1 0.4 0.2 0.03 0.07 
 
4 0.3 0.4 0.2 0.1 0 0 

B. Matrix Reduction

1 0.4 0.1 0.2 0.3 0 0 


2  0.1 0.2 0.3 0.2 0.05 0.15   g11
(1)
u(1) 
G(1) =  = 
3  0.2 0.1 0.4 0.2 0.03 0.07   v (1) T (1)  (6.75)
 
4 0.3 0.4 0.2 0.1 0 0 
−1
G(2) = T (1) + v (1)  g12
(1)
+ (1)
g13 + (1)
g14 + (1)
g15 + (1)
g16  u(1)

2 0.2167 0.3333 0.25 0.05 0.15 


 g (2) u(2) 
G(2) = 3  0.1333 0.4667 0.30 0.03 0.07  =  22  (6.76)
v (2) T (2) 
4  0.4500 0.3000 0.25 0 0  

−1
G(3) = T (2) + v (2)  g 23
(2)
+ (2)
g 24 + (2)
g 25 + (2)
g 26  u(2)

3 0.5234 0.3426 0.0385 0.0955   g 33 (3)


u(3) 
G(3) =   =
4  0.4915 0.3936 0.0287 0.0862   v (3) 
T (3) 
(6.77)

−1
G(4) = T (3) + v (3)  g 34
(3)
+ (3)
g 35 + (3)
g 36  u(3)

G(4) = 0.7469 0.0684 0.1847  =  g (4)


44 g (4)
45 46 
g (4) . (6.78)

C. Back Substitution
Back substitution begins by computing the entry in row four of the vector of
probabilities of absorption in state 0.

46 )/( g 45 + g 46 ) = (0.1847) (0.0684 + 0.1847 ) = 0.7298.


f 40 = ( g (4) (4) (4)
(6.79)

Next the entry in row three is computed.

51113_C006.indd 442 9/23/2010 4:22:53 PM


Special Topics: State Reduction and Hidden Markov Chains 443

f 30 = ( g 36
(3)
+ g 34
(3) (3)
f 40 )/( g 34 + g 35
(3)
+ g 36
(3)
) 

= (0.0955 + 0.3426(0.7298)) (0.3426 + 0.0385 + 0.0955) = 0.7250. (6.80)

Then the entry in row two is computed.

f 20 = ( g 26
(2)
+ g 23
(2)
f 30 + g 24
(2) (2)
f 40 )/( g 23 + g 24
(2)
+ g 25
(2)
+ g 26
(2)
) 

= (0.15 + 0.3333(0.7250) + 0.25(0.7298)) (0.3333 + 0.25 + 0.05 + 0.15) (6.81)
= 0.7329. 

Finally, the entry in row one is computed.

f10 = ( g16
(1)
+ g12
(1)
f 20 + g13
(1)
f 30 + g14
(1) (1)
f 40 )/( g12 + g13
(1)
+ g14
(1)
+ g15
(1)
+ g16
(1)
) 

= (0 + 0.1(0.7329) + 0.2(0.7250) + 0.3(0.7298)) (0.1 + 0.2 + 0.3 + 0 + 0 ) = 0.7287 .

(6.82)

The probability fi0 that a patient in a transient state i will eventually be dis-
charged can be expressed as an entry in the vector f0 of the probabilities of
absorption in absorbing state 0.

f 0 = [ f10 f 20 f 30 f 40 ]T = [0.7287 0.7329 0.7250 0.7298]T . (6.83)

These probabilities of absorption in state 0 differ only slightly from those


computed in Equation (3.95). Discrepancies are due to roundoff error because
only the first four significant digits after the decimal point were stored.

6.2 An Introduction to Hidden Markov Chains


The following introduction to hidden Markov chains is based primarily on
the first three sections of an excellent tutorial by Rabiner [4]. In certain appli-
cations such as speech recognition, bioinformatics, and musicology, the states
of a Markov chain model may be hidden from an observer. To treat such
cases, an extension of a Markov chain model, called a hidden Markov model,
or HMM, has been developed. When a hidden or underlying Markov chain
in an HMM enters a state, only an observation symbol, not the state itself,
can be detected. In this section, the underlying Markov chain is assumed to
be a regular chain.

51113_C006.indd 443 9/23/2010 4:22:55 PM


444 Markov Chains and Decision Processes for Engineers and Managers

6.2.1 HMM of the Weather


Consider the following small example of an HMM for the hourly weather
on two remote hidden islands. The weather on the hidden islands cannot be
observed by scientists at a distant meteorological station. Suppose that every
hour a weather satellite randomly selects one of the two hidden islands and
scans the weather on that island. Every hour, a signal indicating dry weather
or wet weather on the randomly selected hidden island is relayed by the
weather satellite to the distant meteorological station. The two islands are
distinguished by the integers 1 and 2.
The hourly weather is treated as an independent trials process by assum-
ing that the weather in one hour does not affect the weather in any other
hour. Suppose that dry weather is denoted by D and wet weather by W. The
probability that island i will have dry weather is denoted by di. Hence, for the
ith island, P(D) = di and P(W) = 1 − di. The probabilities of wet and dry weather
on each hidden island are indicated symbolically in Table 6.1.
Letting d1 = 0.55, and d2 = 0.25, the probabilities of wet and dry weather on
each hidden island are indicated numerically in Table 6.2.
Scientists at the meteorological station can see only the hourly report of
the weather on a hidden island, either D or W, but cannot identify the island
for which the weather is reported. The random choice of which island is to
be scanned each hour by the satellite is governed by a two-state, regular,
hidden Markov chain. The state, denoted by Xn, represents the identity of
the hidden island scanned in hour n. The state space is E = {1, 2}. The num-
ber of states is denoted by N. In this example N = 2. Suppose that the hidden
two-state Markov chain has the following transition probability matrix:

Island = State 1 2 X n\X n +1 1 2


P = [ pij ] = 1 p11 p12 = 1 0.6 0.4
(6.84)
2 p21 p22 2 0.7 0.3

TABLE 6.1
Symbolic Probabilities of Weather on Islands
Dry Weather, D Wet Weather, W
Island 1 P(D) = d1 P(W) = 1 − d1
Island 2 P(D) = d2 P(W) = 1 − d2

TABLE 6.2
Numerical Probabilities of Weather on Islands
Dry Weather, D Wet Weather, W
Island 1 P(D) = 0.55 P(W) = 0.45
Island 2 P(D) = 0.25 P(W) = 0.75

51113_C006.indd 444 9/23/2010 4:22:57 PM


Special Topics: State Reduction and Hidden Markov Chains 445

Suppose that the initial state probability vector for the two-state hidden
Markov chain is

p(0) =  p1(0) p2(0)  = [P(X0 = 1) P(X0 = 2)] = [0.35 0.65]. (6.85)

Each island or state has two mutually exclusive observation symbols,


namely, D for dry weather and W for wet weather. The two distinct symbols
are members of an alphabet denoted by
V = {v1 , v2 } = {D, W }. (6.86)

The number of distinct observation symbols per state, or alphabet size, is


denoted by K. In this example K = 2. The HMM can generate a sequence of
observations denoted by
O = {O0 , O1 ,..., OM } , (6.87)

where each observation, On, is one of the symbols from the alphabet V, and
M + 1 is the number of observations in the sequence. The observation symbols
are generated in accordance with an observation symbol probability distribu-
tion. The observation symbol probability distribution function in state i is

bi (k ) = P(On = vk|X n = i), for i = 1, … , N and k = 1, … , K . (6.88)

In the hidden island weather example, N = 2 states and K = 2 observation


symbols per state. In matrix form the observation symbol probability distri-
bution is shown in Table 6.3.
Substituting {v1, v2} = {D, W}, the observation symbol probability distribu-
tion matrix is shown in Table 6.4.
Letting di = P(On = D|Xn = i) and 1 − di = P(On = W|Xn = i) for island i, the
observation symbol probability distribution matrix is shown in Table 6.5.
Substituting d1 = 0.55, and d2 = 0.25, the observation symbol probability dis-
tribution matrix is shown in Equation (6.89).

Weather = D Weather = W
B = [bi (k )] = Island 1 0.55 0.45 . (6.89)
Island 2 0.25 0.75

TABLE 6.3
Observation Symbol Probability Distribution when V={v1, v2}
Observation Symbol vk = v1 Observation Symbol vk = v2
B = [bi (k)] = State i = 1 P(On = v1|Xn =1) P(On = v2|Xn =1)
State i = 2 P(On = v1|Xn =2) P(On = v1|Xn =2)

51113_C006.indd 445 9/23/2010 4:22:57 PM


446 Markov Chains and Decision Processes for Engineers and Managers

TABLE 6.4
Observation Symbol Probability Distribution when V={D,W}
Observation Symbol v1 = D Observation Symbol v2 = W
B = [bi (k)] = State i = 1 P(On = D|Xn =1) P(On = W|Xn =1)
State i = 2 P(On = D|Xn =2) P(On = W|Xn =2)

TABLE 6.5
Observation Symbol Probability Distribution as a Function of di
Weather Symbol v1 = D Weather Symbol v2 = W
B = [bi (k)] = State 1 = Island 1 b1 (D) = d1 b1 (W) = 1−d1
State 2 = Island 2 b2 (D) = d2 b2 (W) = 1−d2

X0 i X1 j X2 k Xn g Xn 1 h State
O0 O1 O2 On On 1 Observation Symbol
bi (O0 ) b j (O1) bk (O2 ) bg (On ) bh (On 1) Symbol Probability
pi(0) pij p jk pgh Transition Probability
0 1 2 n n 1 Epoch

FIGURE 6.1
State transition and observation symbol generation for a sample path.

In summary, a hidden Markov chain starts at epoch 0 in state X0 = i with an


initial state probability pi(0) = P(X0 = i), and generates an observation symbol
O 0 with a symbol probability bi(O 0) = P(O 0|X0 = i). At epoch 1, with transition
probability pij = P(X1 = j|X0 = i), the hidden chain moves to state X1 = j, and gen-
erates an observation symbol O1 with symbol probability bj(O1) = P(O1|X1 = j).
Continuing in this manner, a sequence X = {X0, X1, …,X M} of hidden states
(a sample path) generates a corresponding sequence O = {O 0, O1, … ,OM} of
observation symbols. The process of state transition and observation symbol
generation for a sample path is illustrated in Figure 6.1.

6.2.2 Generating an Observation Sequence


The following procedure will generate a sequence of M + 1 observations,
O = {O 0, O1, … , OM}.

1. Chose a starting state, Xn = i, according to the initial state probability


vector,

p(0) =  p1(0) N 
p2(0) " p(0)  = [P(X0 = 1) P(X 0 = 2) " P(X 0 = N )].

51113_C006.indd 446 9/23/2010 4:22:59 PM


Special Topics: State Reduction and Hidden Markov Chains 447

2. Set n = 0.
3. Choose On = vk according to the observation symbol probability dis-
tribution in state i,

bi (k ) = P(On = vk|X n = i), for i = 1, … , N and k = 1, … , K .

4. At epoch n + 1 the hidden Markov chain moves to a state Xn+1 = j


according to the state transition probability pij = P(Xn+1 = j|Xn = i).
5. Set n = n + 1, and return to step 3 if n < M. Otherwise, stop.

6.2.3 Parameters of an HMM


An HMM can be completely specified in terms of five parameters: N, K, p(0),
P, and B. The first two parameters, are numbers, the third is a vector, and
the last two are matrices. Parameter N is the number of states in the hidden
Markov chain, while K is the number of distinct observation symbols per
state, or alphabet size. For the HMM example of the weather on two hidden
islands, N = 2 and K = 2. Parameter p(0) is the initial state probability vec-
tor for the hidden Markov chain. Parameter P = [pij] is the transition prob-
ability matrix for the hidden Markov chain. Finally, parameter B = [bi(k)] is
the observation symbol probability distribution matrix. A compact notation
for the set of the three probability parameters used to specify an HMM is
λ = {p(0), P, B}. As Equations (6.85), (6.84), and (6.89) indicate, the set of three
probability parameters for the HMM example of the weather on two hidden
islands is given by

  0.6 0.4  0.55 0.45 


λ = [0.35 0.65],  ,   . (6.90)
 0.7 0.3  0.25 0.75 

6.2.4 Three Basic Problems for HMMs


In order for an HMM to be useful in applications, three basic problems must
be addressed. The three problems can be linked together under a proba-
bilistic framework. While an exact solution can be obtained for problem 1,
problem 2 can be solved in several ways. No analytical solution can be found
for problem 3, which is not treated in this book.
Problem 1 is how to calculate the probability that a particular sequence
of observations, represented by O={O0, O1, … , OM}, is generated by a given
model specified by the set of parameters λ ={p(0), P, B}. This probability is
denoted by P(O|λ). Problem 1 is useful in evaluating how well a given model
matches a particular observation sequence.

51113_C006.indd 447 9/23/2010 4:23:00 PM


448 Markov Chains and Decision Processes for Engineers and Managers

In problem 2 a sequence of observations, represented by O = {O0, O1, … , OM},


is given, and a model is specified by the set of parameters λ = {p(0), P, B}.
The objective is to choose a corresponding hidden sequence of states, X = {X0,
X1, … , X M}, which best fits or explains the given observation sequence. Among
several possible optimality criteria for determining a best fit, the one selected
is to choose the state sequence that is most likely to have generated the given
observation sequence.
Problem 3 is concerned with how to adjust the model parameters, λ = {p(0),
P, B}, so as to maximize the probability of a particular observation sequence,
P(O|λ). Rabiner [4] and references [5, 6] describe iterative procedures for
choosing model parameters such that P(O|λ) is locally maximized.

6.2.4.1 Solution to Problem 1


If an HMM is extremely small, problem 1 can be solved by exhaustive enu-
meration. Otherwise, problem 1 is solved more efficiently by a forward pro-
cedure described in Section 6.2.4.1.2, or by a backward procedure described
in Section 6.2.4.1.3.

6.2.4.1.1 Exhaustive Enumeration


Given a very small model and a particular observation sequence, the sim-
plest way to solve problem 1 is by exhaustive enumeration. This would
involve enumerating every possible state sequence capable of producing the
particular observation sequence, calculating the probability of each of these
enumerated state sequences, and adding these probabilities together. Recall
that the set of parameters λ = {p(0), P, B} for the HMM of the weather on two
hidden islands is given in Equation (6.90).
Consider the particular sequence, O = {O 0, O1, O2} = {W, D, W}, of M + 1 =
3 (three) observations. This particular observation sequence can be gener-
ated by a corresponding sequence of hidden states denoted by X = {X0, X1,
X2} . The state space for distinguishing the two islands is E = {1, 2}. Since the
observation sequence consists of three observation symbols, and each obser-
vation symbol can be generated by one of two states, the number of possible
three-state sequences that must be enumerated is (2)(2)(2) = 23 = 8. The obser-
vations are assumed to be independent random variables. Then the probabil-
ity that the observation sequence O={O 0, O1, O2} = {W, D, W} is generated by
the state sequence X={X0, X1, X2} = {i, j, k} is equal to

P(X0 = i)P(O0 = W|X0 = i)P(X1 = j|X 0 = i)P(O1 = D|X1 = j)



P(X 2 = k|X1 = j)P(O2 = W|X 2 = k ) 
 (6.91)
= pi(0) bi (W )pij b j (D)p jk bk (W ). 

In Table 6.6 below, all eight possible three-state sequences, X = {X0, X1, X2},
are enumerated. The joint probabilities,

51113_C006.indd 448 9/23/2010 4:23:01 PM


Special Topics: State Reduction and Hidden Markov Chains 449

TABLE 6.6
Enumeration of Joint Probabilities for all Eight Possible Three-State Sequences

X0 X1 X2 Joint Probability P(X0 , X1 , X 2 ; W , D, W λ )


1 1 1 p1(0) b1 (W )p11b1 (D)p11b1 (WR)
= 0.35(0.45)0.6(0.55)0.6(0.45) = 0.0140332
1 1 2 p1(0) b1 (W )p11b1 (D)p12 b2 (W )
= 0.35(0.45)0.6(0.55)0.4(0.75) = 0.0155925
1 2 1 p1(0) b1 (W )p12 b2 (D)p21b1 (W )
= 0.35(0.45)0.4(0.25)0.7(0.45) = 0.00496125
1 2 2 p1(0) b1 (W )p12 b2 (D)p22 b2 (W )
= 0.35(0.45)0.4(0.25)0.3(0.75) = 0.00354375
2 1 1 p2(0) b2 (W )p21b1 (D)p11b1 (W )
= 0.65(0.75)0.7(0.55)0.6(0.45) = 0.0506756
2 1 2 p2(0) b2 (W )p21b1 (D)p12 b2 (W )
= 0.65(0.75)0.7(0.55)0.4(0.75) = 0.0563062 ← highest
2 2 1 p2(0) b2 (W )p22 b2 (D)p21b1 (W )
= 0.65(0.75)0.3(0.25)0.7(0.45) = 0.0115171
2 2 2 p2(0) b2 (W )p22 b2 (D)p22 b2 (W )
= 0.65(0.75)0.3(0.25)0.3(0.75) = 0.0082265625
P(W , D, W λ ) = ∑
X0 , X1 , X2
P(X0 , X1 , X 2 ; W , D, W λ ) = 0.164856

P(X , O λ ) = P(X0 , X1 , X 2 ; O0 , O1 , O2 λ ) = P(X0 , X1 , X 2 ; W , D, W λ ) (6.92)

are calculated for each three-state sequence. The sum, taken over all values
of X, of these eight joint probabilities is equal to the marginal probability,
P(O|λ). That is,

P(O λ ) = ∑ P(X , O λ ) = ∑ P(O X , λ )P(X λ )


all X all X

P(O λ ) = ∑ P(X , O λ ) = ∑ P(O X , λ )P(X λ )


all X all X


 (6.93)
= P(O0 , O1 , O2 λ ) = P(W , D, W λ ) = ∑ P(X0 , X1 , X 2 ; W , D, W λ ).
X 0 , X1 , X 2 

51113_C006.indd 449 9/23/2010 4:23:01 PM


450 Markov Chains and Decision Processes for Engineers and Managers

In this small example involving a sequence of M + 1 = 3 observation sym-


bols and N = 2 states, approximately [2(M + 1) − 1]NM+1 = [2(3) − 1]23 = 40
multiplications were needed to calculate

P(O λ ) = P(O0 , O1 , O2 λ ) = P(W , D, W λ ) = 0.164856 (6.94)

by using exhaustive enumeration in Table 6.6. However, in larger problems,


exhaustive enumeration is not a practical procedure for calculating P(O|λ). to
solve problem 1. Either one of two alternative procedures, called the forward
procedure and the backward procedure, can be used to solve problem 1 far
more efficiently than exhaustive enumeration. Calculations in the forward
procedure move forward in time, while those in the backward procedure
move backward.

6.2.4.1.2 Forward Procedure


The forward procedure defines a forward variable,

α n (i) = P(O0 , O1 , O2 , … , On , X n = i λ ) (6.95)

which is the joint probability that the partial sequence {O 0, O1, O2, . . . , On} of
n + 1 observations is generated from epoch 0 until epoch n, and the HMM
is in state i at epoch n. The forward procedure has three steps, which are
labeled initialization, induction, and termination. The three steps will be
described in reference to the small HMM of the weather on two hidden
islands.
Step 1. Initialization
At epoch 0, the forward variable is

α 0 (i) = P(O0 , X0 = i λ ) = P(X0 = i)P(O0 X0 = i) = pi(0) bi (O0 ) for 1 ≤ i ≤ N = 2

α 0 (1) = p1(0) b1 (O0 )


α 0 (2) = p2(0) b2 (O0 ).

Step 2. Induction
At epoch 1,

α 1 ( j) = P(O0 , O1 , X1 = j λ ) = P(X0 = 1)P(O0 X0 = 1) P(X1 = j X0 = 1)P(O1 X1 = j)


+ P(X0 = 2)P(O0 X0 = 2) P(X1 = j X0 = 2)P(O1 X1 = j) for 1 ≤ j ≤ N = 2.

51113_C006.indd 450 9/23/2010 4:23:03 PM


Special Topics: State Reduction and Hidden Markov Chains 451

α 1 ( j) = p1(0) b1 (O0 )p1 j b j (O1 ) + p2(0) b2 (O0 )p2 j b j (O1 )


= [ p1(0) b1 (O0 )p1 j + p2(0) b2 (O0 )p2 j ]b j (O1 )
= [α 0 (1)p1 j + α 0 (2)p2 j ]b j (O1 )
2
= [∑ α (i)p ]b (O ).
0 ij j 1
i=1

Hence, at epoch 1, the forward variable is

2
α 1 ( j) = [∑ α 0 ]
(i)pij b j (O1 ) for 1 ≤ j ≤ N = 2.
i= 1

Similarly, at epoch 2,

α 2 ( j) = P(O0 , O1 , O2 , X 2 = j λ )
α 2 ( j) = p1(0) b1 (O0 )p11b1 (O1 )p1 j b j (O2 ) + p1(0) b1 (O0 )p12 b2 (O1 )p2 j b j (O2 )
+ p2(0) b2 (O0 )p21b1 (O1 )p1 j b j (O2 ) + p2(0) b2 (O0 )p22 b2 (O1 )p2 j b j (O2 )
= p1(0) b1 (O0 )p11b1 (O1 )p1 j b j (O2 ) + p2(0) b2 (O0 )p21b1 (O1 )p1 j b j (O2 )
+ p1(0) b1 (O0 )p12 b2 (O1 )p2 j b j (O2 ) + p2(0) b2 (O0 )p22 b2 (O1 )p2 j b j (O2 ).

Substituting α 0 (1) = p1(0) b1 (O0 ) and α 0 (2) = p2(0) b2 (O0 ) ,

α 2 ( j) = α 0 (1)p11b1 (O1 )p1 j b j (O2 ) + α 0 (2)p21b1 (O1 )p1 j b j (O2 )


+ α 0 (1)p12 b2 (O1 )p2 j b j (O2 ) + α 0 (2)p22 b2 (O1 )p2 j b j (O2 )
= [α 0 (1)p11 + α 0 (2)p21 ]b1 (O1 )p1 j b j (O2 )
+ [α 0 (1)p12 + α 0 (2)p22 ]b2 (O1 )p2 j b j (O2 )
Substituting α 1 (1) = [α 0 (1)p11 + α 0 (2)p21 ]b1 (O1 ) and α 1 (2)
= [α 0 (1)p12 + α 0 (2)p22 ]b2 (O1 ),
α 2 ( j) = α 1 (1)p1 j b j (O2 ) + α 1 (2)p2 j b j (O2 )
= [α 1 (1)p1 j + α 1 (2)p2 j ]b j (O2 )
2
= [∑ α (i)p ]b (O )
1 ij j 2
i =1

51113_C006.indd 451 9/23/2010 4:23:05 PM


452 Markov Chains and Decision Processes for Engineers and Managers

Thus, at epoch 2 the forward variable is

2
α 2 ( j) = [∑ α 1 (i)pij ]b j (O2 ) for 1 ≤ j ≤ N = 2.
i =1

By induction, one may conclude that in the general case the forward vari-
able is
N
α n + 1 ( j) = [∑ α n ]
(i)pij b j (On + 1 ) for epochs 0 ≤ n ≤ M − 1 and states 1 ≤ j ≤ N .
i =1

Step 3. Termination
At epoch M=2, the forward procedure ends with the desired probability
expressed as the sum of the terminal forward variables.

P(O λ ) = ∑
all values of X M
P(O , X M λ )

= P(O0 , O1 , O2 λ ) = P(O0 , O1 , O2 , X 2 = 1 λ ) + P(O0 , O1 , O2 , X 2 = 2 λ )

2
= α 2 (1) + α 2 (2) = ∑ α 2 (i).
i =1

The complete forward procedure is given below:


Step 1. Initialization
At epoch 0, the forward variable is

α 0 (i) = pi(0) bi (O0 ) for 1 ≤ i ≤ N .

Step 2. Induction

N
α n + 1 ( j) = [∑ α n ]
(i)pij b j (On + 1 ) for epochs 0 ≤ n ≤ M − 1 and states 1 ≤ j ≤ N .
i =1

Step 3. Termination
N
P(O λ ) = P(O0 , O1 , O2 , … , OM λ ) = ∑ α M (i).
i =1

The forward procedure will be executed to calculate P(O|λ) = P(O 0,O1,O2|λ) =


P(W,D,W|λ.) for the small example concerning the weather on two hidden

51113_C006.indd 452 9/23/2010 4:23:07 PM


Special Topics: State Reduction and Hidden Markov Chains 453

islands. Recall that the set of parameters λ = {p(0), P, B} for this example is
given in Equation (6.90).
Step 1. Initialization
At epoch 0,

α 0 (i) = pi(0) bi (O0 ) = pi(0) bi (W )


α 0 (1) = p1(0) b1 (W ) = (0.35)(0.45) = 0.1575
α 0 (2) = p2(0) b2 (W ) = (0.65)(0.75) = 0.4875

Step 2. Induction
At epoch 1,

2
α 1 ( j) = [∑ α (i)p ]b (O ) = [α (1)p
0 ij j 1 0 1j + α 0 (2)p2 j ]b j (D)
i =1
α 1 (1) = [α 0 (1)p11 + α 0 (2)p21 ]b1 (D) = [(0.1575)(0.6) + (0.4875)(0.7)](0.55) = 0.2396625
α 1 (2) = [α 0 (1)p12 + α 0 (2)p22 ]b2 (D) = [(0.1575)(0.4) + (0.4875)(0.3)](0.25) = 0.0523125

At epoch 2,
2
α 2 ( j) = [∑ α (i)p ]b (O ) = [α (1)p
1 ij j 2 1 1j + α 1 (2)p2 j ]b j (W )
i =1

α 2 (1) = [α 1 (1)p11 + α 1 (2)p21 ]b1 (W )


= [(0.2396625)(0.6) + (0.0523125)(0.7)](0.45) = 0.0811873
α 2 (2) = [α 1 (1)p12 + α 1 (2)p22 ]b2 (W )
= [(0.2396625)(0.4) + (0.0523125)(0.3)](0.75) = 0.083669.

Step 3. Termination
At epoch M = 2,

2
P(O λ ) = P(O0 , O1 , O2 λ ) = P(W , D, W λ ) = ∑ α 2 (i) = α 2 (1) + α 2 (2)
i =1

= 0.0811873 + 0.083669 = 0.1648563 (6.96)

in agreement with the result obtained by exhaustive enumeration in


Table 6.6.
In contrast to exhaustive enumeration, which required about 40 multipli-
cations to calculate P(O|λ) = P(O 0, O1, O2|λ) for this small HMM involving
a sequence of M + 1 = 3 observation symbols and N = 2 states, the forward
procedure required only N(N + 1)M + N = 2(3)2 + 2 = 14 multiplications.

51113_C006.indd 453 9/23/2010 4:23:10 PM


454 Markov Chains and Decision Processes for Engineers and Managers

6.2.4.1.3 Backward Procedure


Although the forward procedure alone is sufficient for solving problem 1, the
backward procedure is introduced as an alternative procedure for solving
problem 1 because both forward and backward calculations are needed for
solving problem 3. Calculations in the backward procedure move backward
in time. The backward procedure defines a backward variable,

β n (i) = P(On +1 , On + 2 , On + 3 , … , OM X n = i , λ ) (6.97)

which is the conditional probability that the partial sequence {On+1, On+2,
On+3, . . . , OM} of M − n observations is generated from epoch n + 1 until epoch
M, given that the HMM is in state i at epoch n. The backward procedure,
when it is applied to solve problem 1, also has three steps called initialization,
induction, and termination. The three-step backward procedure used to
solve problem 1 is given below.
Step 1. Initialization
At epoch M, the backward variable is

β M (i) = 1 for 1 ≤ i ≤ N .

Step 2. Induction

N
β n (i) = ∑ pij b j (On + 1 )β n + 1 ( j) for epochs n = M − 1, M − 2,… , 0 and states 1 ≤ i ≤ N .
j=1

Step 3. Termination
At epoch 0,
N
P(O λ ) = P(O0 , O1 , O2 ,…OM λ ) = ∑ P(O0 , X0 = i λ )P(O1 , O2 , O3 ,… , OM X0 = i , λ
i =1
N
= ∑ P(X0 = i)P(O0 X0 = i) P(O1 , O2 , O3 , … , OM X 0 = i , λ )
i =1
N N
= ∑ pi(0) bi (O0 )β 0 (i) = ∑ α 0 (i)β 0 (i),
i =1 i =1

after substituting α 0 (i) = pi(0) bi (O0 ) (the initialization step of the forward
procedure).
The backward procedure will be executed to calculate P(O|λ) = P(O 0, O1,
O2|λ) = P(W, D, W|λ) for the small example involving the weather on two
hidden islands.

51113_C006.indd 454 9/23/2010 4:23:13 PM


Special Topics: State Reduction and Hidden Markov Chains 455

Step 1. Initialization.
At epoch M = 2,

β2 (1) = 1 and β2 (2) = 1.

Step 2. Induction
At epoch 1,

2
β1 (i) = ∑ pij b j (O2 )β 2 ( j) = pi1b1 (W )β 2 (1) + pi 2 b2 (W )β 2 (2)
j =1

β1 (1) = p11b1 (W )β 2 (1) + p12 b2 (W )β 2 (2) = p11b1 (W )(1) + p12 b2 (W )(1)


= (0.6)(0.45)(1) + (0.4)(0.75)(1) = 0.57
β1 (2) = p21b1 (W )β2 (1) + p22 b2 (W )β2 (2) = p21b1 (W )(1) + p22 b2 (W )(1)
= (0.7)(0.45)(1) + (0.3)(0.75)(1) = 0.54.

At epoch 0,
2
β 0 (i) = ∑ pij b j (O1 )β1 ( j) = pi1b1 (D)β1 (1) + pi 2 b2 (D)β1 (2)
j =1

β 0 (1) = p11b1 (D)β1 (1) + p12b2 (D)β1 (2) = (0.6)(0.55)(0.57) + (0.4)(0.25)(0.54) = 0.2421
β0 (2) = p21b1 (D)β1 (1) + p22b2 (D)β1 (2) = (0.7)(0.55)(0.57) + (0.3)(0.25)(0.54) = 0.25995.

Step 3. Termination
At epoch 0, after substituting α 0 (i) = pi(0) bi (O0 ),

α 0 (1) = p1(0) b1 (W ) = (0.35)(0.45) = 0.1575


α 0 (2) = p2(0) b2 (W ) = (0.65)(0.75) = 0.4875,
2
P(O λ ) = P(O0 , O1 , O2 λ ) = P(W , D, W λ ) = ∑ α 0 (i)β 0 (i) = α 0 (1)β 0 (1) + α 0 (2)β 0 (2)
i =1

= (0.1575)(0.2421) + (0.4875)(0.25995) = 0.1648563,

which is the same result obtained by exhaustive enumeration in Table 6.6


and Equation (6.94), and by the forward procedure in Equation (6.96).

6.2.4.2 Solution to Problem 2


Problem 2 assumes that a sequence of symbols has been observed, and the
parameter set λ of the model has been specified. Given the sequence of obser-
vation symbols, O = {O 0, O1, . . . , OM}, the objective of problem 2 is to identify

51113_C006.indd 455 9/23/2010 4:23:16 PM


456 Markov Chains and Decision Processes for Engineers and Managers

a corresponding sequence of states, X = {X0, X1, . . . , X M}, which is most likely


to have generated the given observation sequence. In other words, given
the observation sequence, O = {O 0, O1, . . . , OM}, the objective of problem 2
is to find a corresponding state sequence, X = {X0, X1, . . . , X M}, which maxi-
mizes the probability of having generated the given observation sequence.
Mathematically, this objective is to find the state sequence X, which maxi-
mizes the conditional probability, P(X|O, λ). This objective is expressed by
the notation argmaxXP(X|O, λ). The term argmax refers to the argument X
which yields the maximum value of P(X|O, λ). By the definition of condi-
tional probability,

max P(X O , λ ) = max P(X , O λ ) P(O λ ). (6.98)


X X

Since the denominator on the right hand side is not a function of X, max-
imizing the conditional probability P(X|O, λ) with respect to X is equiva-
lent to maximizing the numerator P(X, O|λ) with respect to X. Hence an
equivalent objective of problem 2 is to find the state sequence X, which maxi-
mizes P(X, O|λ), the joint probability of state sequence X and observation
sequence O.
A formal procedure called the Viterbi algorithm, which is based on
dynamic programming, is used to calculate max X P(X, O|λ.) and fi nd
argmax X P(X, O|λ). The algorithm fi rst calculates max X P(X, O|λ.). Then
the algorithm works backward or backtracks to recover the state sequence
X which maximizes P(X, O|λ). If this state sequence is not unique, the
Viterbi algorithm will fi nd one state sequence, which maximizes P(X, O|λ).
To fi nd the state sequence X that is most likely to correspond to the given
observation sequence O, the algorithm defi nes, at epoch n, the quantity

δ n ( j) = max P(X0 , X1 , X 2 , … , X n = j , O0 , O1 , O2 , … , On λ )
X0 , X1 ,…, Xn− 1 (6.99a)

The quantity δn(i) is the highest probability of the state sequence ending in
state Xn at epoch n, which accounts for the first n + 1 observations.
At epoch 0, δn(i) is initialized as

δ 0 (i) = pi(0) bi (O0 ). (6.99b)

As is true for the forward variable, αn(i), and the backward variable, βn(i), the
probability δn(i) can be calculated by induction. The induction equation for
calculating δn(i) is developed informally below.
At epoch 0,

δ 0 (i) = pi(0) bi (O0 ) for i = 1, 2, … , N .

51113_C006.indd 456 9/23/2010 4:23:19 PM


Special Topics: State Reduction and Hidden Markov Chains 457

At epoch 1,

δ 1 (X1 = j) = max pX(0)0 bX0 (O0 )pX0 , X1 bX1 (O1 )


X0

δ 1 ( j) = max [ pi(0) bi (O0 )]pij b j (O1 ) = max [δ 0 (i)pij ]b j (O1 ) for j = 1,… , N .
i = 1,…, N i = 1,…, N

At epoch 2,

δ 2 (X 2 = k ) = max [ max pX(0)0 bX0 (O0 )pX0 , X1 bX1 (O1 )]pX1 , X2 bX2 (O2 )
X0 , X1 X0

= max [ pi(0) bi (O0 )pij b j (O1 )]p jk bk (O2 )


i = 1,…, N

δ 2 (k ) = max [δ 1 (i)pik ]bk (O2 ) for k = 1,… , N .


i = 1,…, N

At epoch n + 1, for n = 0, . . . , M − 1, by induction,


δ n + 1 ( j) = max[δ n (i)pij ]b j (On + 1 ) for j = 1,… , N .
i

Observe that at epoch M,

δ M ( j) = max[δ M −1 (i)pij ]b j (OM ) for j = 1,… , N .


i

Recall that δM(i) is the highest probability of the state sequence ending in
state X M at epoch M, which accounts for the observation sequence ending in
symbol OM at epoch M. Therefore, the joint probability of this state sequence
and the associated observation sequence is

δ M ( j) = max P(X0 , X1 , X 2 , … , X M = j , O0 , O1 , O2 , … , OM λ ). (6.100)


X0 , X1 ,…, X M −1

To recover the state sequence {X0, X1, X2, . . . ,XM}, which has the highest proba-
bility of generating the observation sequence {O0, O1, O2, . . . ,OM}, it is necessary
to keep a record of the argument, which maximizes the induction equation for
δn+1(j) for each n and j. To construct this record, an array ψn(j) is defined as

ψ n ( j) = argmax[δ n −1 (i)pij ] for n = 1,… , M and j = 1,… , N . (6.101)


i = 1,…, N

The complete Viterbi algorithm is given below:


Step 1. Initialization

δ 0 (i) = pi(0) bi (O0 ) for 1 ≤ i ≤ N


ψ 0 (i ) = 0

51113_C006.indd 457 9/23/2010 4:23:21 PM


458 Markov Chains and Decision Processes for Engineers and Managers

Step 2. Recursion

δ n ( j) = max[δ n −1 (i)pij ]b j (On ), for 1 ≤ n ≤ M and 1 ≤ j ≤ N


1≤ i ≤ N

ψ n ( j) = argmax[δ n −1 (i)pij ], for 1 ≤ n ≤ M and 1 ≤ j ≤ N .


1≤ i ≤ N

Step 3. Termination

P∗ = max[δ M (i)]
1≤ i ≤ N

XM = argmax[δ M (i)].
1≤ i ≤ N

Step 4. Backtracking to recover the state sequence

X n∗ = ψ n + 1 (X n∗ + 1 ), for n = M − 1, M − 2,… , 0.

The Viterbi algorithm will be applied to the example HMM of the weather
on two hidden islands to recover the state sequence, X = {X0, X1, X2}, which
has the highest probability of generating the weather observation sequence,
O = {O 0, O1, O2} = {W, D, W}, representing the weather at three consecutive
epochs. Recall that the hidden state indicates which island is the source of
the observed weather symbol. The algorithm calculates maxXP(X, O) and
retrieves argmaxXP(X, O). The set of model parameters fλ = {p(0), P, B} for this
example is given in equation (6.90).
Step 1. Initialization
At epoch 0,

δ 0 (i) = pi(0) bi (O0 ) = pi(0) bi (W ), for 1 ≤ i ≤ 2


δ 0 (1) = p1(0) b1 (W ) = (0.35)(0.45) = 0.1575
δ 0 (2) = p2(0) b2 (W ) = (0.65)(0.75) = 0.4875
ψ 0 (i) = 0 = ψ 0 (1) = ψ 0 (2).

Step 2. Recursion
At epoch 1,

δ 1 ( j) = max[δ 0 (i)pij ]b j (O1 ) = max[δ 0 (i)pij ]b j (D), for 1 ≤ j ≤ 2


1≤ i ≤ 2 1≤ i ≤ 2

δ 1 ( j) = max[δ 0 (1)p1 j , δ 0 (2)p2 j ]b j (D), for 1 ≤ j ≤ 2


δ 1 (1) = max[δ 0 (1)p11 , δ 0 (2)p21 ]b1 (D), for j = 1
= max[(0.1575)(0.6), (0.4875)(0.7)](0.55)
= [(0.4875)(0.7)](0.55) = 0.1876875

51113_C006.indd 458 9/23/2010 4:23:23 PM


Special Topics: State Reduction and Hidden Markov Chains 459

δ 1 (2) = max[δ 0 (1)p12 , δ 0 (2)p22 ]b2 (D), for j = 2


= max[(0.1575)(0.4), (0.4875)(0.3)](0.25)
= [(0.4875)(0.3)](0.25) = 0.0365625
ψ 1 ( j) = argmax[δ 0 (i)pij ], for 1 ≤ j ≤ 2
1≤ i ≤ 2

= arg max[δ 0 (1) p1 j for i = 1, δ 0 (2) p2 j for i = 2], for 1 ≤ j ≤ 2


ψ 1 (1) = arg max[δ 0 (1)p11 , δ 0 (2)p21 ] for j = 1 ,
= arg max[(0.1575)(0.6), (0.4875)(0.7)]
= arg max[0.0945, 0.34125] = 2
ψ 1 (2) = arg max[δ 0 (1)p12 , δ 0 (2)p22 ] for j = 2 ,
= arg max[(0.1575)(0.4), (0.4875)(0.3)]
= arg max[0.063, 0.14625] = 2.

At epoch 2,

δ 2 ( j) = max[δ 1 (i)pij ]b j (O2 ) = max[δ 1 (i)pij ]b j (W ), for 1 ≤ j ≤ 2


1≤ i ≤ 2 1≤ i ≤ 2

= max[δ 1 (1)p1 j , δ 1 (2)p2 j ]b j (W ), for 1 ≤ j ≤ 2


δ 2 (1) = max[δ 1 (1)p11 , δ 1 (2)p21 ]b1 (W ) for j = 1 ,
= max[(0.1876875)(0.6), (0.0365625)(0.7)](0.45)
= [(0.1876875)(0.6)](0.45) = 0.0506756
δ 2 (2) = max[δ 1 (1)p12 , δ 1 (2)p22 ]b2 (W ) for j = 2 ,
= max[(0.1876875)(0.4)], (0.0365625)(0.3)](0.75)
= [(0.1876875)(0.4)](0.75) = 0.0563062
ψ 2 ( j) = argmax[δ 1 (i)pij ], for 1 ≤ j ≤ 2
1≤ i ≤ 2

= arg max[δ 1 (1) p1 j for i = 1, δ 1 (2) p2 j for i = 2], for 1 ≤ j ≤ 2


ψ 2 (1) = arg max[δ 1 (1)p11 , δ 1 (2)p21 ], for j = 1 .
= arg max[(0.1876875)(0.6), (0.0365625)(0.7)]
= arg max[0.1126125, 0.0255937]
=1
ψ 2 (2) = arg max[δ 1 (1)p12 , δ 1 (2)p22 ], for j = 2
= arg max[(0.1876875)(0.4), (0.0365625)(0.3)]
= arg max[(0.075075, 0.0109687]
= 1.

51113_C006.indd 459 9/23/2010 4:23:27 PM


460 Markov Chains and Decision Processes for Engineers and Managers

Step 3. Termination

P∗ = max[δ 2 (i)] = max[δ 2 (1), δ 2 (2)] = max[0.0506756, 0.0563062] = 0.0563062 = δ 2 (2)


1≤ i ≤ 2

X 2∗ = argmax[δ 2 (i)] = arg max[δ 2 (1), δ 2 (2)] = arg max[0.0506756, 0.0563062] = 2


1≤ i ≤ 2

Step 4. Backtracking to recover the state sequence

X n∗ = ψ n + 1 (X n∗ + 1 ), for n = (2 − 1), 0 = 1, 0
X1∗ = ψ 1+ 1 (X1∗+ 1 ) = ψ 2 (X 2∗ ) = ψ 2 (2) = 1, for n = 1
X0∗ = ψ 0 + 1 (X0∗+ 1 ) = ψ 1 (X1∗ ) = ψ 1 (1) = 2, for n = 0.

Note that P∗ = 0.0563062 is the highest joint probability that the state sequence
{X0, X1, X2} accounts for the given observation sequence {O 0, O1, O2} = {W,D,W}.
Backtracking has retrieved the accountable state sequence X = {X0, X1, X2} =
{2,1,2}. Hence,

P * = P(X0 , X1 , X 2 ; O0 , O1 , O2 λ ) = P(2,1, 2; W , D, W λ ) = 0.0563062. (6.102)

This result was obtained previously by exhaustive enumeration in Table 6.6,


where it is flagged by an arrow. The solution to problem 2 has identified the
state sequence that is most likely to have occurred, given a particular observa-
tion sequence. Other criteria for identifying a state sequence can be imposed.

PROBLEMS
6.1 Consider a regular Markov chain that has the following transi-
tion probability matrix:
State 0 1 2 3
0 0.23 0.34 0.26 0.17
P= 1 0.08 0.42 0.18 0.32
2 0.31 0.17 0.23 0.29
3 0.24 0.36 0.34 0.06

Use the MCPA to find the vector of steady-state probabilities.


6.2 Consider a regular Markov chain which has the following tran-
sition probability matrix:

State 0 1 2 3
0 0.23 0.34 0.26 0.17
P= 1 0.08 0.42 0.18 0.32
2 0.31 0.17 0.23 0.29
3 0.24 0.36 0.34 0.06

Use state reduction to find the vector of MFPTs to state 0.

51113_C006.indd 460 9/23/2010 4:23:35 PM


Special Topics: State Reduction and Hidden Markov Chains 461

6.3 Consider an absorbing multichain which has the following


transition probability matrix expressed in canonical form:

State 0 4 5 1 2 3
0 1 0 0 0 0 0
4 0 1 0 0 0 0
I 0
P= 5 0 0 1 0 0 0 =
D Q  , where
1 0.15 0.05 0.20 0.2 0.3 0.1 
2 0.09 0.03 0.18 0.1 0.4 0.2
3 0.12 0.02 0.06 0.5 0.2 0.1
1 0 0 1 0.15 0.05 0.20  1 0.2 0.3 0.1
     
I = 0 1 0  , D = 2 0.09 0.03 0.18  , Q = 2  0.1 0.4 0.2
0 0 1 3 0.12 0.02 0.06  3 0.5 0.2 0.1

Use state reduction to find the vector of probabilities of absorp-


tion in state 0.
6.4 Consider a HMM specified by the set of parameters λ = {p(0), P, B}.
The following parameters are given.
N = 3 states in a hidden Markov chain, E = {1,2,3} is the state
space.
K = 4 distinct observation symbols, V = {v1,v2,v3,v4} = {c,d,e,f}
is the alphabet.
The initial state probability vector for the three-state hidden
Markov chain is

p(0) =  p1(0) p2(0) p3(0)  = [P(X0 = 1) P(X0 = 2) P(X0 = 3)]


= 0.25 0.40 0.0.35.

The transition probability matrix for the hidden Markov chain is

State 1 2 3 X n\X n + 1 1 2 3
1 p11 p12 p13 1 0.24 0.56 0.20
P = [ pij ] = =
2 p21 p22 p23 2 0.38 0.22 0.40
3 p31 p32 p33 3 0.28 0.37 0.35

The observation symbol probability distribution matrix is

B = [bi (k )] =
Observation Observation Observation Observation
State
Symbol c Symbol d Symbol e Symbol f
i = 1 P(On = c X n = 1) P(On = d X n = 1) P(On = e X n = 1) P(On = f X n = 1)
i = 2 P(On = c X n = 2) P(On = d X n = 2) P(On = e X n = 2) P(On = f X n = 2)
i = 3 P(On = c X n = 3) P(On = d X n = 3) P(On = e X n = 3) P(On = f X n = 3)

51113_C006.indd 461 9/23/2010 4:23:36 PM


462 Markov Chains and Decision Processes for Engineers and Managers

Observation Observation Observation Observation


State
Symbol c Symbol d Symbol e Symbol f
B = [bi (k )] = i = 1 b1 (c) b1 (d) b1 (e ) b1 ( f )
i=2 b2 (c) b2 (d) b2 (e ) b2 ( f )
i=3 b3 (c) b3 (d) b3 (e ) b3 ( f )

Observation Observation Observation Observation


State
Symbol c Symbol d Symbol e Symbol f
B = [bi (k )] = i = 1 0.30 0.18 0.20 0.32
i=2 0.26 0.34 0.12 0.28
i=3 0.16 0.40 0.34 0.10

(a) Consider the particular sequence, O={O 0, O1}={f, d}, of M+1=2


observations. Use exhaustive enumeration to solve HMM
problem 1 of Section 6.2.4 by calculating P(f, d|λ).
(b) Use the forward procedure to solve HMM problem 1 of
Section 6.2.4 by calculating P(f, d|λ).
(c) Use the Viterbi algorithm to solve HMM problem 2 of Section
6.2.4 by calculating P(X0, X1; f, d|λ), where X={X0,X1} is the
state sequence which maximizes the probability of having
generated the given observation sequence, O={O 0,O1}={f, d}.

References
1. Grassmann, W. K., Taksar, M. I., and Heyman, D. P., Regenerative analysis and
steady state distributions for Markov chains, Op. Res., 33, 1107, 1985.
2. Kemeny, J. G. and Snell, J. L., Finite Markov Chains, Van Nostrand, Princeton, NJ,
1960. Reprinted by Springer-Verlag, New York, 1976.
3. Kohlas, J., Numerical computation of mean passage times and absorption prob-
abilities in Markov and semi-Markov models, Zeitschrift für Op. Res., 30, A197,
1986.
4. Sheskin, T. J., A Markov chain partitioning algorithm for computing steady state
probabilities, Op. Res., 33, 228, 1985.
5. Rabiner, J. L., A tutorial on hidden Markov models and selected applications in
speech recognition. Proc. IEEE. 77, 257, 1989.
6. Koski, T., Hidden Markov Models for Bioinformatics, Kluwer Academic Publishers,
Dordrecht, 2001.
7. Ewens, W. J. and Grant, G. R., Statistical Methods in Bioinformatics: An Introduction,
Springer, New York, 2001.

51113_C006.indd 462 9/23/2010 4:23:39 PM

You might also like