0% found this document useful (0 votes)

5 views

Lecture Notes

Uploaded by

imrebalog1991

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Lecture Notes

Uploaded by

imrebalog1991

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Lecture notes for “Analysis of Algorithms”:

Markov decision processes

∗
Lecturer: Thomas Dueholm Hansen

June 26, 2013

Abstract

We give an introduction to infinite-horizon Markov decision processes (MDPs) with

finite sets of states and actions. We focus primarily on discounted MDPs for which
we present Shapley’s (1953) value iteration algorithm and Howard’s (1960) policy iter-
ation algorithm. We also give a short introduction to discounted turn-based stochastic
games, a 2-player generalization of MDPs. Finally, we give a short introduction to two
alternative criteria for optimality: average cost and total cost.

The presentation given in these lecture notes is based on [6, 9, 5].

1 Markov decision processes

A Markov decision process (MDP) is composed of a finite set of states, and for each state a
finite, non-empty set of actions. In each time unit, the MDP is in exactly one of the states.
A controller must choose one of the actions associated with the current state. Using an
action a ∈ A incurs an immediate cost, and results in a probabilistic transition to a new
state according to a probability distribution that depends on the action. The process goes
on indefinitely. The goal of the controller is to minimize the incurred costs according to
some criterion. We will later define three optimality criteria: discounted cost, average cost,
and total cost.
Formally, a Markov decision process is defined as follows. We use ∆(S) to denote the set of
probability distributions over elements of a set S.

Definition 1.1 (Markov decision process) A Markov decision process (MDP) is a tuple
M = (S, A, s, c, p), where
∗
School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel. E-mail: [email protected].

1
• S is a set of states,

• A is a set of actions,

• s : A → S assigns each action to the state from which it can be used,

• c : A → R associates each action with a cost,

• and p : A → ∆(S) associates each action with a probability distribution over states
according to which the next state is chosen when a is used.

For every i ∈ S, we let Ai = {a ∈ A | s(a) = i} be the set of actions that can be used from i.
We assume that Ai 6= ∅, for every i ∈ S. We use n = |S| and m = |A| to denote the number
of states and actions, respectively.

An MDP can be represented in different ways. A convenient and compact representation is

the matrix representation given in Definition 1.2 below.

Definition 1.2 (Probability matrix, cost vector, and source matrix)

Let M = (S, A, s, c, p) be an MDP. We may assume, without loss of generality, that S =
[n] = {1, . . . , n} and A = [m] = {1, . . . , m}.

• We let P ∈ Rm×n , where Pa,i = (p(a))i is the probability of ending up in state i after
taking action a, for every a ∈ A = [m] and i ∈ S = [n], be the probability matrix of
M.

• We let c ∈ Rm , where ca = c(a) is the cost of action a ∈ A = [m], be the cost vector
of M .

• We let J ∈ Rm×n , where Ja,i = 1 if a ∈ Ai and Ja,i = 0 otherwise, be the source matrix
of M .

Figure 1 shows an example of an MDP and its matrix representation. The states are pre-
sented as circles numbered from 1 to 3. I.e., the set of states is S = {1, 2, 3}. The actions
are represented as arrows leaving the states. The set of actions is A = {a1 , . . . , a6 }. We,
for instance, have s(a4 ) = 2. The costs of the actions are shown inside the diamond-shaped
vertices. For instance, c(a4 ) = 2. Finally, the probability distribution associated with an
action is shown as numbers labelling the edges leaving the corresponding diamond-shaped
vertex. For instance, the probability of moving to state 1 when using action a4 is 21 .
A policy for the controller of an MDP is a rule that specifies which action should be taken in
each situation. The decision may in general depend on the current state of the process and
possibly on the history, i.e., the sequence of states that have been visited and the actions
that were used in those states. We also allow desicions to be randomized. We denote the set
of histories by H. The set of policies can be restricted in a natural way as follows. First, we

2
7 -4 5
1 1
2 2 1 1
a1 a3 a5

1 2 3

a2 a4 a6
1 2
1 1 4 1 1 3
2 4 3
3 2 -10

0 12 12
    
1 0 0 7
1 0 0 1 0 0  3 
     
     
0 1 0 1 0 0  −4 
J =
0
 P = 1 1 1
 c=
 
 1 0
 2 4 4

 2 

0 0 1 0 1 0  5 
     

0 0 1 0 13 23 −10

k eT1 Pπk
 1 1
  0 1 0 0
0 2 2
7 1 1
1 0
Pπ =  12 1 1
cπ =  2  2 2
  
4 4 2 5 1
0 1 0 5 2 8 8 8
10 13 9
3 32 32 32
.. ..
. .

Figure 1: Example of a simple MDP and a (positional) policy π.

3
may require that every decision only depends on the current state and the time (the number
of steps performed). We call such policies time-dependent. Second, we may require that
every decision only depends on the current state. We call such policies positional. We also
require that positional policies do not make use of randomization.

Definition 1.3 (Policies)

• A history-dependent policy π is a mapping π : H → ∆(A) such that for every history

h ∈ H, π(h) assigns positive probability only to actions that can be used from the final
state of h. We denote the set of history-dependent policies by Π(H).

• A time-dependent policy π is a mapping π : S × N0 → ∆(A) such that for every pair

(i, t) ∈ S × N0 , π(i, t) assigns positive probability only to actions that can be used from
the state i. We denote the set of time-dependent policies by Π(T ).

• A positional policy π is a mapping π : S → A such that π(i) ∈ Ai , for every i ∈ S.

We denote the set of positional policies by Π(P).

If an MDP starts in some state i ∈ S and the controller uses a policy π, then the state
reached after t steps is a random variable Xπ (i, t), and the action used from the t-th state is
a random variable Yπ (i, t).
The random variable Xπ (i, t) can be nicely described for positional policies. We identify
a positional policy π with the set π(S) ⊆ A of actions used in π. We let Pπ ∈ Rn×n be
the matrix obtained by selecting the rows of P whose indices belong to π. We will assume
that the actions are ordered according to states, such that for every policy π, by the above
convention we have Jπ = I; the identity matrix. Note that (Pπ )i = Pπ(i) . Similarly, we let
cπ ∈ Rn be the vector containing the costs of the actions that belong to π. Note that Pπ is
a (row) stochastic matrix ; its elements are non-negative and the elements in each row sum
to 1. Thus, Pπ defines a Markov chain, and we get that

Pr [Xπ (i, t) = j] = (Pπt )i,j .

In particular, if an MDP starts in some state i and the controller uses a positional policy
π, then the probabilities of being in the different states after t steps are given by the vector
eTi Pπt , where ei is the i’th unit vector.
Figure 1 shows a positional policy π for a simple MDP. The policy is represented by bold
gray arrows. The corresponding matrix Pπ and vector cπ are also shown. Furthermore, the
table at the lower right corner shows the probabilities of being in different states after k
steps when starting in state 1.
Running an MDP with a (possibly history-dependent) policy π generates a random, infinite
sequence of costs c(Yπ (i, 0)), c(Yπ (i, 1)), c(Yπ (i, 2)), . . . . This sequence of costs is evaluated
according to some optimality criterion C, and the goal of the controller is to minimize the

4
expected resulting value, valC C
π (i). We will use val (c0 , c1 , c2 , . . . ) to denote the value of the
infinite sequence c0 , c1 , c2 , . . . according to the criterion C. We then have:
valC
C
π (i) = E val (c(Y π (i, 0)), c(Yπ (i, 1)), c(Y π (i, 2)), . . . ) .

Let us again note that things work out nicely for positional policies. In this case the pair
Pπ , cπ is a Markov chain with costs assigned to its states, and the expected cost observed at
time t when starting from state i and using the positional policy π is then:
E [c(Yπ (i, t))] = eTi Pπt cπ . (1)

Definition 1.4 (Discounted cost, average cost, and total cost) Let 0 < γ < 1 be a
discount factor. The discounted cost criterion D(γ), the average cost criterion A, and the
total cost criterion T are defined by:
∞
X
D(γ)
val (c0 , c1 , c2 , . . . ) = γ t ct
t=0
N −1
A 1 X
val (c0 , c1 , c2 , . . . ) = lim sup ct
N →∞ N t=0
∞
X
T
val (c0 , c1 , c2 , . . . ) = ct
t=0

Note that for the total cost criterion the series may diverge. We will later introduce a
stopping condition that ensures convergence in this case.

Definition 1.5 (Optimal policy) Let M = (S, A, s, c, p) be an MDP and let π be a policy.
We say that π is optimal for M with respect to a criterion C if and only if for all π 0 ∈ Π(H)
and all i ∈ S we have valC C
π (i) ≤ valπ 0 (i).

Note that an optimal policy must minimize the expected value for all states simultaneously.
The existence of an optimal policy is non-trivial. We are, however, going to prove the
following theorem that shows that, in fact, there always exists an optimal positional policy.
The theorem was first proved by Shapley [12] in 1953 for discounted costs.

Theorem 1.6 Every MDP has an optimal positional policy w.r.t. the discounted cost crite-
rion, the average cost criterion, and the total cost criterion.

Theorem 1.6 shows that when looking for an optimal policy we may restrict our search to
positional policies. We will prove the theorem in the later sections. The following lemma
makes the first step towards proving Theorem 1.6. It shows that we may restrict our search to
time-dependent policies. More precisely, the lemma shows that for every history-dependent
policy there exists a time-dependent policy that visits the same states and uses the same
actions with the same probabilities.

5
Lemma 1.7 Let π ∈ Π(H) be any history-dependent policy, and let i0 ∈ S be any starting
state. Then there exists a time-dependent policy π 0 ∈ Π(T ) such that for every non-negative
integer t ≥ 0, every state i ∈ S, and every action a ∈ A we have Pr [Xπ0 (i0 , t) = i] =
Pr [Xπ (i0 , t) = i] and Pr [Yπ0 (i0 , t) = a] = Pr [Yπ (i0 , t) = a].

Proof: Since we are given π we can construct π 0 such for all t ≥ 0, i ∈ S, and a ∈ A:

Pr [π 0 (i, t) = a] = Pr [Yπ0 (i0 , t) = a | Xπ0 (i0 , t) = i] := Pr [Yπ (i0 , t) = a | Xπ (i0 , t) = i] .

It follows that if Pr [Xπ0 (i0 , t) = i] = Pr [Xπ (i0 , t) = i], i.e., the two policies reach the same
states in t steps with the same probabilities, then Pr [Yπ0 (i0 , t) = a] = Pr [Yπ (i0 , t) = a], i.e.,
the same actions are chosen from the t-th state with the same probabilities.
We next prove by induction in t that Pr [Xπ0 (i0 , t) = i] = Pr [Xπ (i0 , t) = i] for all t ≥ 0 and
i ∈ S. For t = 0 we have Pr [Xπ (i0 , 0) = i0 ] = Pr [Xπ0 (i0 , 0) = i0 ] = 1 as desired. Recall
that Pa,i is the probability of moving to state i when using action a. For t > 0 we have by
induction:
XX
Pr [Xπ0 (i0 , t) = i] = Pr [Xπ0 (i0 , t − 1) = j] · Pr [π 0 (j, t − 1) = a] · Pa,i
j∈S a∈Aj
XX
= Pr [Xπ (i0 , t − 1) = j] · Pr [Yπ (i0 , t − 1) = a | Xπ (i0 , t − 1) = j] · Pa,i
j∈S a∈Aj

= Pr [Xπ0 (i0 , t) = i] .

2 The discounted cost criterion

Throughout this section we let M = (S, A, s, c, p) be any given MDP, we let (P, c, J) be the
matrix representation of M , and we let 0 < γ < 1 be a given discount factor. There are
two ways of interpreting the discount factor γ. The discount factor may be viewed as the
inverse of the rate of inflation. Thus, a cost of c(a) of action a at time t corresponds to a
cost of γ t c(a) at time 0. Alternatively, 1−γ may be viewed as the probability that the MDP
stops after each step. The expected cost of an action a at time t is then again γ t c(a), as γ t
is the probability that the MDP reaches time t.
P∞ t
Recall that for the discounted cost criterion we define valD(γ) (c0 , c1 , c2 , . . . ) = t=0 γ ct ,
such that the expected total discounted cost when starting in a state i and using a policy π
is "∞ #
X X∞
D(γ) t
valπ (i) = E γ c(Yπ (i, t)) = γ t E [c(Yπ (i, t))] .
t=0 t=0

6
We leave it as an exercise to show that valD(γ)
π (i) exists for every policy π ∈ Π(H). In the
important special case of positional policies we get, using (1):
∞
X
valD(γ)
π (i) = eTi (γPπ )t cπ .
t=0

The following lemma is helpful for understanding this infinite sum. We leave proving the
lemma as an exercise.

Lemma 2.1 For any policy π, the matrix (I − γPπ ) is nonsingular and
∞
X
−1
(I − γPπ ) = (γPπ )t ≥ I .
t=0

Definition 2.2 (Value vector) For every policy π, we define the value vector vπ ∈ Rn by:

∀i ∈ S : (vπ )i = valD(γ)
π (i) .

Let u, v ∈ Rn be two vectors. We say that u ≤ v if and only if ui ≤ vi , for every 1 ≤ i ≤ n.

We say that u < v if and only if u ≤ v and u 6= v. We interpret u ≥ v and u > v similarly.
0
Hence, a policy π is optimal if and only if vπ ≤ vπ for all π 0 ∈ Π(H).
The following lemma follows from Lemma 2.1 and Definition 2.2.

Lemma 2.3 For every positional policy π, we have:

vπ = (I − γPπ )−1 cπ .

2.1 The value iteration algorithm

We next introduce Shapley’s value iteration algorithm [12] for solving discounted MDPs, and
in the process of doing so we prove the existence of an optimal positional policy. To motivate
the value iteration algorithm we first consider finite-horizon MDPs with discounted costs.
A discounted, finite-horizon MDP terminates after a given finite number of steps, t, and
the goal of the controller is to minimize the expected discounted sum of the incurred costs.
Furthermore, for a given vector v ∈ Rn we add an additional cost of γ t vi for terminating in
state i.
It is not difficult to see that a finite-horizon MDP can be solved using dynamic programming.
That is, after t − 1 steps the best choice is from each state to use the action that locally
minimizes the cost. This gives us a new vector v 0 corresponding to the costs of reaching
the different states after t − 1 steps. The process can then be repeated until all decisions
are specified. The following definitions are useful for formalizing the process. Note that
Pa ∈ R1×n is the a-th row of the matrix P , and that ca is the cost of action a.

7
Definition 2.4 (Value iteration operator) Let v ∈ Rn be an arbitrary vector. Define
the value iteration operator T : Rn → Rn by:

∀i ∈ S : (T v)i = min ca + γPa v .

a∈Ai

Definition 2.5 (Policy extraction operator) Define the policy extraction operator P :
Rn → Π(P) to map any vector v ∈ Rn to a positional policy π = Pv such that:

∀i ∈ S : π(i) ∈ argmin ca + γPa v .

a∈Ai

The following relation between the value iteration operator and the policy extraction operator
is immediate.

Lemma 2.6 For every v ∈ Rn we have T v = cπ + γPπ v, where π = Pv.

Following the above discussion, the minimum values that can be obtained using any (time-
dependent) policy for a discounted, finite-horizon MDP that terminates after t steps with
termination costs specified by v ∈ Rn are T t v. Furthermore, the actions used after k steps,
for k < t, are those of the positional policy PT t−k−1 v. The idea of the value iteration
algorithm is to repeat the process for t going to infinity, or until t is sufficiently large.
For every time-dependent policy π ∈ Π(T ), and every integer t ≥ 0. We let π>t be the
policy obtained from π by cutting away the first t rounds, that is, π>t (i, k) = π(i, k + t) for
all i ∈ S and k ≥ 0. Moreover, we let π∆t be the policy obtained from π by changing the
decisions of the first t rounds such that for the k-th round, for k < t, we use the actions
from the positional policy PT t−k−1 vπ>t . Note that the first t rounds of π∆t can viewed as
an optimal policy for the finite-horizon MDP that terminates after t steps with termination
costs vπ>t . In particular, the decisions made by π∆t during the first t rounds are at least as
good as the corresponding decisions made by π. We get the following lemma.

Lemma 2.7 Let π ∈ Π(T ) be any time-dependent policy, and let t ≥ 0 be an integer. Then
vπ∆t = T t vπ>t ≤ vπ .

We next show that vπ∆t always converges to unique vector v∗ , regardless of the policy π, and
that Pv∗ is an optimal positional policy. First we show that the operator T is a contraction
with Lipschitz constant γ.

Lemma 2.8 For every u, v ∈ Rn we have:

kT u − T vk∞ ≤ γ ku − vk∞ .

8
Proof: Let i ∈ S, and assume that (T u)i ≥ (T v)i . Let

a ∈ argmin ca + γPa u and b ∈ argmin cb + γPb v .

a∈Ai b∈Ai

Then,
(T u − T v)i = (ca + γPa u) − (cb + γPb v)
≤ (cb + γPb u) − (cb + γPb v)
= γPb (u − v)
≤ γ ku − vk∞ .
The last inequality follows from the fact that the elements in Pb are non-negative and sum
up to 1. The case when (T u)i ≤ (T v)i is analogous.
Since the inequality holds for all states i, it holds for the state for which the absolute
difference of T u and T v is largest, and, hence, the result follows.

The Banach fixed point theorem now implies that the operator T has a unique fixed point.

Corollary 2.9 There is a unique vector v∗ ∈ Rn such that T v∗ = v∗ .

The following lemma shows that the policy π ∗ = Pv∗ is optimal, which proves Theorem 1.6
for the discounted cost criterion. Note that all optimal policies must, per definition, have
the same value vectors. We therefore refer to v∗ as the optimal value vector.

Lemma 2.10 Let v∗ ∈ Rn be the unique fixed point of T , and let π ∗ = Pv∗ . Then, π ∗ is
∗
an optimal policy, and vπ = v∗ .
∗
Proof: We first show that vπ = v∗ . Since v∗ is a fixed point for T and π ∗ = Pv∗ we get
from Lemma 2.6 that:
v∗ = T v∗ = cπ∗ + γPπ∗ v∗
We know from Lemma 2.1 that the matrix (I − γPπ∗ ) is non-singular, which implies that
∗
v∗ = (I − γPπ∗ )−1 cπ∗ . Lemma 2.3 then shows that vπ = v∗ .
We next show that π ∗ is an optimal policy. We know from Lemma 2.7 that for every time-
dependent policy π ∈ Π(T ) and every integer t ≥ 0, we have vπ∆t = T t vπ>t ≤ vπ . We will
show that T t vπ>t → v∗ for t → ∞. It then follows that v∗ ≤ vπ , which shows that π ∗ is
optimal.
Let cmax = maxa∈A |ca | be largest absolute value of the cost of any action. Observe first that
for every policy π and every state i ∈ S, we have |(vπ )i | ≤ c1−γ
max
. Indeed, cmax is the largest
absolute value of any cost incurred byPa single step of the MDP, and if the cost is cmax at
every step the total discounted cost is ∞ t cmax
t=0 γ cmax = 1−γ . From repeated use of Lemma 2.8
we then get:
2cmax
kT t vπ>t − v∗ k∞ = kT t vπ>t − T t v∗ k∞ ≤ γ t kvπ>t − v∗ k∞ ≤ γ t ,
1−γ

9
Function ValueIteration(u, )
while ku − T uk∞ > 2 (1 − γ) do
u ← T u;
return u;

Figure 2: The value iteration algorithm.

and, hence, T t vπ>t → v∗ for t → ∞.

The ValueIteration algorithm, given in Figure 2, repeatedly applies the value iteration
operator T to an initial vector u ∈ Rn until the difference between two successive vectors is
sufficiently small, i.e., ku − T uk∞ ≤ 2 (1 − γ) for some given ≥ 0.
The value iteration algorithm was introduced for MDPs by Bellman [1] in 1957. It can be
viewed as a special case of an algorithm by Shapley [12] from 1953 for solving the more
general class of stochastic games. Note that the vector T t u can be viewed as the optimal
value vector for the finite-horizon MDP that terminates after t steps with termination costs
specified by u. As noted in the beginning of the section, the value iteration algorithm can
therefore be viewed as a dynamic programming algorithm.
We say that a policy π is -optimal if kvπ − v∗ k∞ ≤ , for some ≥ 0. The following lemma
shows that when the value iteration algorithm terminates we can extract an -optimal policy
from the resulting vector.

Lemma 2.11 Let u ∈ Rn be a vector, and let π = Pu be the policy extracted from u. If
ku − T uk∞ ≤ 2 (1 − γ), then ku − v∗ k∞ ≤ 2 and ku − vπ k∞ ≤ 2 . In particular, π is
-optimal.

Proof: We only prove that if ku − T uk∞ ≤ 2 (1 − γ) then ku − v∗ k∞ ≤ 2 . The second

part of the lemma can be proved analogously. Indeed, if we consider the MDP where only
the actions of π are available, then vπ = v∗ .
We know from Lemma 2.8 that kT u − v∗ k∞ ≤ γku − v∗ k∞ . From the triangle inequality
we get:

ku − v∗ k∞ ≤ ku − T uk∞ + kT u − v∗ k∞ ≤ (1 − γ) + γku − v∗ k∞ ⇒
2

ku − v∗ k∞ ≤ .
2

10
k∞∗
Lemma 2.12 For a given vector u ∈ Rn , let N ≥ 1
1−γ
log 4ku−v
(1−γ)
be an integer. Then
kT N u − T N +1 uk∞ ≤ 2 (1 − γ)

Proof: We will show that kT N u−v∗ k∞ ≤ 4 (1−γ). Since, by Lemma 2.8, kT N +1 u−v∗ k∞ ≤
γkT N u − v∗ k∞ , it then follows from the triangle inequality that:

kT N u − T N +1 uk∞ ≤ kT N u − v∗ k∞ + kT N +1 u − v∗ k∞ ≤ (1 − γ) .
2

From repeated use of Lemma 2.8 we know that kT N u − v∗ k∞ ≤ γ N ku − v∗ k∞ . From

log x log x 4ku−v∗ k∞
the inequality log1/γ x = log 1/γ
≤ 1−γ
, we get that N ≥ log 1/γ (1−γ)
. It follows that
N ∗
γ ku − v k∞ ≤ 4 (1 − γ).

Lemmas 2.11 and 2.12 show that ValueIteration(u, ) computes a vector from which
we can extract an -optimal policy. Furthermore, Lemma 2.12 shows that the number of
1 ku−v∗ k∞
iterations needed to find this vector is at most O( 1−γ log (1−γ) ). We next estimate the
for which the resulting policy is guaranteed to be optimal.
For every MDP M , let L(M, γ) be the number of bits needed to describe the matrix (J −γP )
and the vector c. More precisely, for every number a = p/q, where p ∈ Z and q ∈ N are
relatively prime, we use 1 + dlog2 (|p| + 1)e + dlog2 (q + 1)e bits. Using Cramer’s rule and
simple bounds on the size of determinants, one can prove that the number of bits needed to
describe a component of the value vector vπ = (I − γPπ )−1 cπ , for some policy π, is at most
4L(M, γ). We leave this as an exercise.
0 0
Lemma 2.13 Let π and π 0 be two policies such that vπ 6= vπ . Then kvπ − vπ k∞ ≥
2−4L(M,γ) .

Corollary 2.14 If π is -optimal for 0 ≤ < 2−4L(M,γ)−1 , then π is optimal.

Let cmax = maxa∈A |ca |. As we saw in the proof of Lemma 2.10, it is not difficult to show
that for every policy π and every state i, we have |(vπ )i | ≤ c1−γmax
. We also clearly have
L(M,γ) ∗ L(M,γ)
cmax ≤ 2 . Hence, kv − 0k∞ ≤ (2 )/(1 − γ), where 0 is the all-zero vector.
Combining Lemma 2.12 and Corollary 2.14 we then get the following bound on the number
of iterations needed for the value iteration algorithm to produce an optimal policy:
5L(M,γ)
Theorem 2.15 Let π = PT N 0, where N = 1
1−γ
log 8·2(1−γ)2 ≤ O( L(M,γ)
1−γ
1
log 1−γ ). Then π
is optimal.

2.2 The policy iteration algorithm

We next give an introduction to Howard’s policy iteration algorithm [7] from 1960. The
policy iteration algorithm can be viewed as an attempt to speed up the value iteration

11
algorithm. It is the most widely used algorithm for solving MDPs in practice. As we will see
in Section 2.3 below it can be viewed as a generalization of the simplex method for solving
linear programs in the context of MDPs.
Throughout this section we restrict our attention to positional policies. Hence, whenever we
use the term policy we mean a positional policy.

Definition 2.16 (One-step operator) Let π be a policy and v ∈ Rn be a vector. We

define the one-step operator Tπ : Rn → Rn by:

Tπ v = cπ + γPπ v .

Recall that Lemma 2.3 says that vπ = (I − γPπ )−1 cπ . Hence, for every policy π, the value
vector vπ is a fixed point for Tπ . The following lemma can be proved in the same way as
Lemma 2.8 and Corollary 2.9.

Lemma 2.17 For every policy π, vπ is a unique fixed point for Tπ , and for every vector
u ∈ Rn , Tπt u → vπ for t → ∞.
0
Lemma 2.18 Let π and π 0 be two policies. If Tπ0 vπ < vπ , then vπ ≤ Tπ0 vπ < vπ . The
same holds for ≤, >, and ≥.

Proof: We only prove the lemma for <. The other cases are proved analogously.
First observe that for any vector u ∈ Rn , if Tπ0 u ≤ u then since all entries of Pπ0 are
non-negative we get:

Tπ20 u = cπ0 + γPπ0 (Tπ0 u) ≤ cπ0 + γPπ0 u = Tπ0 u .

If Tπ0 vπ ≤ vπ , then repeated use of the above inequality shows that vπ ≥ Tπ0 vπ ≥ Tπ20 vπ ≥
0 0
Tπ30 vπ ≥ . . . . Since, by Lemma 2.17, Tπt0 vπ → vπ for t → ∞, it follows that vπ ≥ Tπ0 vπ ≥ vπ .
0
In particular, if Tπ0 vπ < vπ then vπ < vπ .

Lemma 2.19 Let π be a policy. If for every policy π 0 we have Tπ0 vπ 6< vπ then π is optimal.

Proof: We show that if for every policy π 0 we have Tπ0 vπ 6< vπ , then for every policy π 0
0
we have Tπ0 vπ ≥ vπ . It then follows from Lemma 2.18 that vπ ≥ vπ for all policies π 0 .
Assume for the sake of contradiction that there exists a policy π 0 such that (Tπ0 vπ )i <
(vπ )i for some state i. We will construct a policy π 00 such that Tπ00 vπ < vπ which gives a
contradiction. Let π 00 be defined by π 00 (i) = π 0 (i) and π 00 (j) = π(j) for all j 6= i. Then,
(Tπ00 vπ )i < (vπ )i and (Tπ00 vπ )j = (vπ )j for all j 6= i.
Note that Tπ0 vπ < vπ can be equivalently stated as cπ0 − (I − γPπ0 )vπ < 0. I.e., if an action
is better for one step with respect to the values of the current policy, then it is also better
to keep using this action. This motivates the following definition.

12
Function PolicyIteration(π)
while ∃π 0 : Tπ0 vπ < vπ do
π ← π0;
return π;

Function PolicyIteration(π)
while ∃ an improving switch w.r.t. π do
Update π by performing improving switches;
return π;

Function Howard’s PolicyIteration(π)

while T vπ < vπ do
π ← Pvπ ;
return π;

Figure 3: Two equivalent formulations of the policy iteration algorithm (top and middle),
and the policy iteration algorithm with Howard’s improvement rule (bottom).

Definition 2.20 (Reduced costs, improving switches) The reduced cost vector c̄π ∈
Rm corresponding to a policy π is defined to be

c̄π = c − (J − γP )vπ .

We say that an action a ∈ A is an improving switch with respect to a policy π if and only if
c̄πa < 0.

Note that actions a ∈ π have reduced cost c̄πa = 0. We say that a policy π 0 is obtained from π
by performing improving switches if every new action a ∈ π 0 \ π is an improving switch with
respect to π, i.e., if (c̄π )a < 0. We will use the notation π 0 = π[B] where B = π 0 \ π. Lemma
2.18 can then be interpreted as saying that if a policy π 0 is obtained from π by performing
0
improving switches then vπ < vπ . On the other hand, Lemma 2.19 says that if there are no
improving switches with respect to a policy π, then π is optimal. Hence, a policy is optimal
if and only if there are no improving switches.
The PolicyIteration algorithm is given in Figure 3. It starts with some initial policy π 0
and generates an improving sequence π 0 , π 1 , . . . , π N of policies, ending with an optimal policy
π N . In each iteration the algorithm evaluates the current policy π k and computes the value
k
vector vπ by solving a system of linear equations. The next policy π k+1 is obtained from π k
by performing a non-empty set of improving switches B ⊆ A with respect to π k , such that
π k+1 = π k [B].

13
It follows from Lemma 2.18 that the value vectors strictly improve with each iteration;
vk+1 < vk . Hence, the number of iterations is bounded by the number of policies. Moreover,
since there are no improving switches with respect to the final policy π N , we know from
Lemma 2.19 that π N is optimal. We get:

Theorem 2.21 For every initial policy π, PolicyIteration(π) terminates after a finite
number of iterations, returning an optimal policy.

The set of improving switches that is performed is decided by an improvement rule. There
is, in fact, a whole family of policy iteration algorithms using different improvement rules.
The most natural variant is, perhaps, the one in which the algorithm selects the improving
switch with most negative reduced cost from every state and performs all these switches
k
simultaneously, i.e., π k+1 = Pvπ . This was the original improvement rule suggested by
Howard [7], and we will refer to the algorithm obtained by using this improvement rule as
Howard’s policy iteration algorithm.
For the remainder of this section we focus on Howard’s policy iteration algorithm. Let π 0
be some initial policy. We next relate the sequences of value vectors obtained by running
Howard’s policy iteration algorithm and the value iteration algorithm. The following lemma
k
appears, e.g., in Meister and Holzbaur [8]. We use vk as short-hand notation for vπ .

Lemma 2.22 Let π 0 be any policy, and let (vk )N

k=0 be the value vectors generated by Howard’s
policy iteration algorithm starting from π . Then, vk ≤ T k v0 , for every 0 ≤ k ≤ N .
0

Proof: We prove the lemma by induction. Clearly the statement is true for k = 0. Suppose
now that vk ≤ T k v0 . Since π k+1 is obtained from π k by performing improving switches we
k k
know that Tπk+1 vπ < vπ . Furthermore, since π k+1 = Pvk , we have T vk = Tπk+1 vk . Using
k
Lemma 2.18, it then follows that vk+1 ≤ Tπk+1 vπ = T vk . Finally, from the induction
hypothesis and the monotonicity of the value iteration operator we get vk+1 ≤ T k+1 v0 .
Lemma 2.22 shows that Howard’s policy iteration algorithm converges to the optimal value
vector at least as fast that the value iteration algorithm.
Combining Lemma 2.22 with Theorem 2.15 we get the following theorem. In Theorem 2.15
the value iteration algorithm is initialized with the all-zero vector. It is not difficult to
0 0
see, however, that the value vector of any policy vπ works as well. I.e., kvπ − v∗ k∞ ≤
(2L(M,γ)+1 )/(1 − γ), for any policy π 0 .

Theorem 2.23 Starting with any policy π, the number of iterations performed by Howard’s
policy iteration algorithm is at most O( L(M,γ)
1−γ
1
log 1−γ ).

A recent series of papers by Ye [13]; Hansen, Miltersen, and Zwick [6]; and Scherrer [11] has
shown the following improved bound. Note that the bound is strongly polynomial when γ
is a fixed constant, i.e., the bound does not depend on the bit complexity L(M, γ). In fact,
the bound is linear in m, the number of actions, when γ is fixed.

14
Theorem 2.24 Starting with any policy π, the number of iterations performed by Howard’s
m 1
policy iteration algorithm is at most O( 1−γ log 1−γ ).

2.3 Linear programming formulation

The proofs given in this section were not covered during the lectures.
In this section we show how the problem of solving a discounted MDP can be formulated as
a linear program. The linear program we present is due to d’Epenoux [2] from 1963.
We will no longer restrict our attention to positional policies. Recall that Yπ (i, t) is the
random variable corresponding to the action used after t steps when starting from state i
and using policy π. We start by making the following useful definition of the flux vector of
a policy π. The flux vector counts the expected number of times every action is used when
we sum over all starting states, and where we interpret 1−γ as a stopping probability.

Definition 2.25 (Flux vector) For every policy π, define the flux vector xπ ∈ Rm by:
∞
XX
∀a ∈ A : (xπ )a = γ t Pr [Yπ (i, t) = a] .
i∈S t=0

We let e = (1, 1, . . . , 1)T ∈ Rn be an all one vector. When π is a positional policy we use x̄π =
(xπ )π ∈ Rn to denote the vector obtained from xπ by selecting the entries corresponding to
actions used in π. Note that for positional policies we have (xπ )a = 0 for a 6∈ π. Also note that
for every positional policy π, every i, j ∈ S, and every t ≥ 0, we have Pr [Yπ (i, t) = π(j)] =
Pr [Xπ (i, t) = j] = (Pπt )i,j . Hence,
∞
XX ∞
X
π π t
∀j ∈ S : (x )π(j) = (x̄ )j = γ (Pπt )i,j = e T
(γPπ )t ej ,
i∈S t=0 t=0

and from Lemma 2.1 we get:

Lemma 2.26 For every positional policy π we have:

(x̄π )T = eT (I − γPπ )−1

Note that since an optimal policy simultaneously minimizes the values of all states, it also
minimizes the sum of the values of the states. Flux vectors provide an alternative way of
summing the values of all the states:

Lemma 2.27 For every policy π, we have

eT vπ = cT xπ .

15
Proof: Recall that:
∞
X ∞
X X
π t
(v )i = γ E [c(Yπ (i, t))] = γt ca Pr [Yπ (i, t) = a] .
t=0 t=0 a∈A

Hence we have:
∞
XX X
eT vπ = γt ca Pr [Yπ (i, t) = a]
i∈S t=0 a∈A
X ∞
XX
= ca γ t Pr [Yπ (i, t) = a]
a∈A i∈S t=0
X
= ca (xπ )a = cT xπ .
a∈A

Note that for every policy π and every state i, the expected number of times we use actions
leaving i must be equal to the expected number of times we reach i. Also, no action can be
used a negative number of times. Hence, the flux vector must satisfy xπ ≥ 0 and
X X
∀i ∈ S : (xπ )a = 1 + (xπ )a γPa,i . (2)
a∈Ai a∈A

Using matrix notation, this equality can be stated as (J − γP )T xπ = e. This leads us to the
following definition of a linear program (P ) and its dual (D). The variables of the primal
linear program (P ) correspond to flux vectors, and the variables of the dual linear program
(D) correspond to value vectors.

min cT x
max eT y
(P ) s.t. (J − γP )T x = e (D)
s.t. (J − γP )y ≤ c
x ≥ 0
The above discussion proves the following lemma.

Lemma 2.28 For every policy π, the flux vector xπ is a feasible solution to the linear pro-
gram (P ).

Lemma 2.29 For every feasible solution x to the linear program (P ) there exists a policy π
with flux vector xπ = x.

Proof: We construct π from x as follows. We let π be aP time-dependent policy that uses

the same decisions in every round. Define x̄ ∈ Rn by x̄i = a∈Ai xa for all i ∈ S. We define
π by:
xa
∀t ≥ 0 ∀i ∈ S ∀a ∈ Ai : Pr [π(i, t) = a] = .
x̄i

16
Note that since the right-hand-side of (2) is at least 1, x̄i ≥ 1 for all i, and π is well-defined.
Observe also that since π uses the same (randomized) decisions in everyPround it defines a
Markov chain given by the stochastic matrix Q ∈ Rn×n where Qj = x̄1j a∈Aj xa Pa , for all
states j. Define x̄π ∈ Rn by (x̄π )i = a∈Ai (xπ )a for all i, i.e., (x̄π )i is the expected number
P
of times we visit state i. In the same way as we proved Lemma 2.26 we then get that:
∞
X
π T
(x̄ ) = e T
(γQ)t = eT (I − γQ)−1 .
t=0

Next we show that xπ = x. It suffices to show that x̄π = x̄, since the definition of π then
ensures that the flux is distributed correctly on the different actions. Since x is a feasible
solution to (P ) it must satisfy:
X XX X
∀i ∈ S : x̄i = xa = 1 + xa γPa,i = 1 + x̄j γQj,i = 1 + x̄T γQei
a∈Ai j∈S a∈Aj j∈S

Using matrix notation this equality says that x̄ = e + γQT x̄, or x̄T = eT (I − γQ)−1 . Since
(x̄π )T = eT (I − γQ)−1 , we have shown that x̄π = x̄ as desired.

Theorem 2.30 The linear program (P ) has an optimal solution x, and from x we get an
optimal time-dependent policy π defined by
xa
∀t ≥ 0 ∀i ∈ S ∀a ∈ Ai : Pr [π(i, t) = a] = P .
b∈Ai xb

Proof: Since, by Lemma 2.28, every policy gives a feasible solution for (P ) we know that
(P ) is feasible. Furtermore, since, by lemmas 2.29 and 2.27, every feasible solution of (P )
gives a policy with the same sum of values we know that (P ) is bounded. Hence, (P ) has an
optimal solution, and the corresponding policy π, constructed in the proof of Lemma 2.29,
is optimal.

Note that basic feasible solutions of (P ) correspond to positional policies. The existence of
an optimal basic feasible solution for (P ) therefore gives an alternative proof for the existence
of an optimal positional policy.
Furthermore, it is not difficult to see that the reduced cost vector from Definition 2.20
corresponds exactly to the reduced costs that the simplex method operates with. In fact,
this shows that the policy iteration algorithm can be viewed as a generalization of the simplex
method where multiple pivots are performed in parallel.

17
3 Turn-based stochastic games
We next consider a generalization of MDPs known as turn-based stochastic games (TBSGs).
TBSGs are a special case of Shapley’s stochastic games [12], which were introduced in 1953.
A TBSG is an MDP where the set of states have been assigned to two players; player
1, the minimizer, and player 2, the maximizer. Whenever the game is in some state the
player controlling that state decides which action to use. The goal of the minimizer is to
minimize the incurred costs according to some criterion, while the goal of the maximizer is
to maximize the incurred costs. Hence, the game is a zero-sum game. In this section we
restrict our attention to the discounted cost criterion.

Definition 3.1 (Turn-based stochastic game) A turn-based stochastic game (TBSG)

is a tuple G = (S1 , S2 , A, s, c, p) such that S1 ∩ S2 = ∅ and (S1 ∪ S2 , A, s, c, p) is an MDP.
We let SS= S1 ∪ S2 . We again let Ai be the set of actions applicable from state i. We also
let Aj = i∈Sj Ai be the set of actions controlled by player j, for j ∈ {1, 2}.

We use the same notation to describe TBSGs as we did for MDPs. In particular, probability
matrices, cost vectors, and source matrices are defined analogously. Policies are also defined
in the same way, except that a policy belongs to a player and only describes the decision
made by that player. Also, in order to be consistent with terminology from game theory we
use the term strategy instead of policy. We restrict our attention to positional strategies. As
for MDPs it can be shown that the players are not worse off by being restricted to positional
strategies. The proof is essentially the same for TBSGs.

Definition 3.2 (Strategies, strategy profiles) A positional strategy πj for player j, where
j ∈ {1, 2}, is a mapping πj : Sj → A such that πj (i) ∈ Ai , for every i ∈ Sj . A strategy
profile π = (π1 , π2 ) is a pair of strategies, one for each player. We denote the set of positional
strategies for player j by Πj (P), and the set of strategy profiles by Π(P) = Π1 (P) × Π2 (P).

Note that a strategy profile can be viewed as a policy for the underlying MDP. We again
let Pπ and cπ be obtained from P and c by selecting rows corresponding to actions used in
π. Hence, the valueP∞ for a state i when the players play according to a strategy profile π is
D(γ)
again valπ (i) = t=0 γ t E [c(Yπ (i, t))], and Lemma 2.3 shows that the value vector satisfies
vπ = (I − γPπ )−1 cπ .
Since there are two players with opposite objectives in a TBSG we need a different definition
of optimal strategies compared to the definition of optimal policies for MDPs. Note, however,
that if the strategy of one of the players is fixed then the game can be viewed as an MDP.
Indeed, if we remove the unused actions of the player whose strategy is fixed, we may transfer
control of his states to the other player. In this case the optimal counter-strategy of the player
whose strategy is not fixed should be the optimal policy in the corresponding MDP. We get
the following definition.

18
Definition 3.3 (Optimal counter-strategies) Let G be a TBSG and let π1 be a strategy
for player 1. We say that a strategy π2 for player 2 is an optimal counter-strategy against
π1 if and only if for all states i ∈ S:
D(γ)
valD(γ)
π1 ,π2 (i) = max valπ1 ,π0 (i) .
π20 ∈Π2 (P) 2

Optimal counter-strategies are defined analogously for player 1 with max exchanged with min.

It follows from Theorem 1.6 that optimal counter-strategies always exist, since we get an
MDP when the strategy of one player is fixed and MDPs always have optimal policies.
Let π1 be some strategy. If π2 is an optimal counter-strategy against π1 , then valD(γ)π1 ,π2 (i) is
the best possible value that player 2 can obtain for state i when player 1 uses the strategy π1 .
Hence, we may view valD(γ)π1 ,π2 (i) as an upper bound for the value player 1 can guarantee for
state i. It is natural to ask what the best guarantee a player can get from a single strategy
is. This leads to the following definition.

Definition 3.4 (Upper and lower values) For every state i ∈ S define the upper value,
val(i), and lower value, val(i), by:

val(i) = min max valD(γ)

π1 ,π2 (i)
π1 ∈Π1 (P) π2 ∈Π2 (P)

val(i) = max min valD(γ)

π1 ,π2 (i)
π2 ∈Π2 (P) π1 ∈Π1 (P)

Note that we must have val(i) ≤ val(i) for all states i. We say that a strategy is optimal
when it achieves the best possible guarantee against the other player.

Definition 3.5 (Optimal strategies) We say that a strategy π1 for player 1 is optimal if
and only if valD(γ)
π1 ,π2 (i) ≤ val(i) for all states i and strategies π2 ∈ Π2 (P). Similarly, a strategy
π2 for player 2 is optimal if and only if valD(γ) π1 ,π2 (i) ≥ val(i) for all i and π1 ∈ Π1 (P).

The following theorem shows the existence of optimal strategies. The theorem was first
established by Shapley [12]. It can be proved in essentially the same way as Theorem 1.6 for
MDPs by using Theorem 3.8 below.

Theorem 3.6 For any TBSG G there exists an optimal positional strategy profile (π1 , π2 ).
Moreover, for all states i we have:

valD(γ)
π1 ,π2 (i) = val(i) = val(i) .

By solving a TBSG we mean computing an optimal strategy profile. The following defini-
tion and theorem gives an alternative interpretation of optimal strategies. The theorem is
standard for finite two-player zero-sum games. It was also established by Shapley [12]. We
leave it as an exercise to prove the theorem.

19
Definition 3.7 (Nash equilibrium) A strategy profile (π1 , π2 ) ∈ Π is a Nash equilibrium
if and only if π1 is an optimal counter-strategy against π2 , and π2 is an optimal counter-
strategy against π1 .

Theorem 3.8

(i) If (π1 , π2 ) ∈ Π(P) is optimal, then (π1 , π2 ) is a Nash equilibrium.

D(γ)
(ii) If (π1 , π2 ) ∈ Π(P) is a Nash equilibrium then valπ1 ,π2 (i) = val(i) = val(i) for all i ∈ S,
and (π1 , π2 ) is optimal.

3.1 The value iteration algorithm

Recall that a finite-horizon MDP runs for a fixed number of steps t, and that an additional
cost γ t vi is added for terminating in state i, where v ∈ Rn is some given vector. A finite-
horizon TBSG can be defined analogously. As for MDPs, a finite-horizon discounted TBSG
can be solved using dynamic programming. The value iteration operator can be generalized
to TBSGs as follows.

Definition 3.9 (Value iteration operator) Let v ∈ Rn be an arbitrary vector. Define

the value iteration operator T : Rn → Rn by:

∀i ∈ S1 : (T v)i = min ca + γPa v

a∈Ai

∀i ∈ S2 : (T v)i = max ca + γPa v .

a∈Ai

Definition 3.10 (Strategy extraction operators) The strategy extraction operators P1 :

Rn → Π1 (P), P2 : Rn → Π2 (P), and P : Rn → Π(P) are defined by:

∀i ∈ S1 : (P1 v)(i) ∈ argmin ca + γPa v ,

a∈Ai

∀i ∈ S2 : (P2 v)(i) ∈ argmax ca + γPa v ,

a∈Ai

and Pv = (P1 v, P2 v).

Note that T t v is the optimal value vector for the finite-horizon TBSG that terminates after
t steps with termination costs specified by v ∈ Rn .
The value iteration algorithm can also be defined for TBSGs. In fact, the algorithm remains
the same except that we now use the value iteration operator for two players. We refer to
Figure 2 for a description of the value iteration algorithm. The lemmas and theorems in
Section 2.1 can be extended to the two-player setting without much modication. We leave
this as an exercise.

20
3.2 The strategy iteration algorithm
We will next see how Howard’s policy iteration algorithm [7] for solving MDPs can be
extended to 2TBSGs in a natural way. We refer to the resulting algorithm as the strategy
iteration algorithm. The strategy iteration algorithm was described for TBSGs by Rao et
al. [10].
For every strategy profile π = (π1 , π2 ), we define the vector of reduced costs c̄π as we did for
MDPs: c̄π = c − (J − γP )vπ . We again use the reduced costs to define improving switches.

Definition 3.11 (Improving switches) We say that an action a ∈ A1 is an improving

switch for player 1 with respect to a strategy profile π if and only if c̄πa < 0. Similarly, an
action a ∈ A2 is an improving switch for player 2 with respect to π if and only if c̄πa > 0.

The following lemma is a direct consequence of lemmas 2.18 and 2.19, i.e., since a policy is
optimal for an MDP if and only if there are no improving switches, we get a similar statement
about optimal counter-strategies.

Lemma 3.12 Let π = (π1 , π2 ) be a strategy profile. π1 is an optimal counter-strategy against

π2 if and only if no action a ∈ A1 is an improving switch for player 1 w.r.t. π. Similarly, π2
is an optimal counter-strategy against π1 if and only if player 2 has no improving switches.

From the definition of Nash equilibria and Theorem 3.8 we get the following important
corollary.

Corollary 3.13 A strategy profile π = (π1 , π2 ) is a Nash equilibrium, and optimal, if and
only if neither player has an improving switch with respect to π.

The following lemma can be viewed as a generalization of Lemma 2.18 to TBSGs.

Lemma 3.14 Let π = (π1 , π2 ) and π 0 = (π10 , π20 ) be two strategy profiles such that π2 is an
optimal counter-strategy against π1 , π10 is obtained from π1 by performing improving switches
0
w.r.t. π, and π20 is an optimal counter-strategy against π10 . Then vπ < vπ .

Proof: Observe first that since π10 is obtained from π1 by performing improving switches
w.r.t. π we have (c̄π )π10 < 0. Furthermore, since π2 is an optimal counter-strategy against
π1 , we get from Lemma 3.12 that player 2 has no improving switches with respect to π. That
is, (c̄π )a ≤ 0 for all a ∈ A2 , and in particular (c̄π )π20 ≤ 0. Hence, (c̄π )π0 < 0 which means
0
that Tπ0 vπ < vπ . It follows from Lemma 2.18 that vπ < vπ .

The strategy iteration algorithm is given in Figure 4. It starts with some initial strategy
profile π 0 = (π10 , π20 ) and generates a sequence (π k )N k k k
k=0 , where π = (π1 , π2 ), of strategy
profiles, ending with an optimal strategy profile π N . The algorithm repeatedly updates the

21
Function StrategyIteration(π1 , π2 )
while ∃π20 : Tπ1 ,π20 vπ1 ,π2 > vπ1 ,π2 do
π2 ← π20 ;
while ∃π10 : Tπ10 ,π2 vπ1 ,π2 < vπ1 ,π2 do
π1 ← π10 ;
while ∃π20 : Tπ1 ,π20 vπ1 ,π2 > vπ1 ,π2 do
π2 ← π20 ;
return (π1 , π2 );

Function StrategyIteration(π1 , π2 )
while ∃ an improving switch for player 2 w.r.t. (π1 , π2 ) do
Update π2 by performing improving switches for player 2;
while ∃ an improving switch for player 1 w.r.t. (π1 , π2 ) do
Update π1 by performing improving switches for player 1;
while ∃ an improving switch for player 2 w.r.t. (π1 , π2 ) do
Update π2 by performing improving switches for player 2;
return (π1 , π2 );

Figure 4: Two equivalent formulations of the strategy iteration algorithm.

22
strategy π2k for player 2 by performing improving switches for player 2. Note that this can
be viewed as running the policy iteration algorithm on the MDP obtained by fixing π1k , and
therefore this inner loop always terminates. When π2k is an optimal counter-strategy against
π1k , the algorithm updates π1k once by performing improving switches for player 1, and the
process is restarted. If π1k can not be updated then neither of the players has an improving
switch, and we know from Corollary 3.13 that π N = (π1N , π2N ) is optimal. Furthermore, since
π2k is an optimal counter-strategy against π1k , and π1k+1 is obtained from π1k by performing
k+1 k
improving switches with respect to (π1k , π2k ), we get from Lemma 3.14 that vπ < vπ for all
0 ≤ k < N . It follows that the same strategy profile does not appear twice, and since there
are only a finite number of strategy profiles the algorithm terminates. We get the following
theorem. By an iteration we mean an iteration of the outer loop.

Theorem 3.15 For every initial strategy profile (π1 , π2 ), StrategyIteration(π1 , π2 ) re-
turns an optimal strategy profile after a finite number of iterations.

It can be helpful to think of the strategy iteration algorithm as only operating with the strat-
egy for player 1, and view the inner loop as a subroutine. Indeed, optimal counter-strategies
for player 2 can also be found by other means, for instance by using linear programming.
During each iteration of the strategy iteration algorithm the current values and reduced costs
are computed, and an improvement rule decides which improving switches to perform. This
is completely analogous to the policy iteration algorithm for MDPs. We will again say that
Howard’s improvement rule [7] picks the action from each state with most negative reduced
cost, i.e., π1k+1 satisfies:
k
∀i ∈ S1 : π1k+1 (i) ∈ argmin c̄πa .
a∈Ai

We refer to the resulting algorithm as Howard’s strategy iteration algorithm, although

Howard only introduced the algorithm for MDPs.
It is again possible to prove that the number of iterations required for Howard’s strategy
iteration algorithm to find an optimal strategy profile is at most the number of iterations
required for the value iteration algorithm to do the same. This was proved in Lemma 2.22
for MDPs. We leave it as an exercise to provide a proof for TBSGs. Moreover, the bound
obtained through the work of Ye [13]; Hansen, Miltersen, and Zwick [6]; and Scherrer [11]
also holds for Howard’s strategy iteration algorithm:

Theorem 3.16 Starting with any strategy profile π, the number of iterations performed by
m 1
Howard’s strategy iteration algorithm is at most O( 1−γ log 1−γ ).

There is no known way of formulating the solution of a TBSG as a linear program. In fact,
it is a major open problem to find a polynomial time algorithm for solving TBSGs when γ
is given as part of the input. When γ is fixed, Theorem 3.16 provides a strongly polynomial
bound. On the other hand, it is easy to see that the problem of solving TBSGs is in both
NP and coNP: an optimal strategy profile serves as a witness of the optimal value of a state
being both above and below a given threshold.

23
4 The average cost criterion
In this section we give a very brief introduction to the average cost criterion. Proofs in this
section have been omitted, and we refer to Puterman [9] for additional details. As it was
the case for the discounted cost criterion, one can show that every MDP has an optimal
positional policy for the average cost criterion. We restrict our attention to the use of
positional policies. Throughout the section we let M = (S, A, s, c, p) be some given MDP.
For the average cost criterion the value of a state i for a positional policy π is:
" N −1
# N −1
1 X 1 X T t
valA
π (i) = E lim inf c(Yπ (i, t)) = lim ei Pπ cπ .
N →∞ N N →∞ N
t=0 t=0

The fact that the limit exists is non-trivial.

Definition 4.1 (Values, potentials) Let π be a positional policy, and let gπ ∈ Rn and
hπ ∈ Rn be defined as:
N −1
π 1 X t
g = lim Pπ c π
N →∞ N
t=0
N −1 t
π 1 XX k
h = lim Pπ (cπ − gπ )
N →∞ N
t=0 k=0

We say that gπ and hπ are the vectors of values and potentials, respectively.

Note that valA π T ∗

π (i) = (g )i for every i ∈ S. If Pπ is aperiodic, then ei Pπ , where Pπ =
∗

limt→∞ Pπt , is the limiting,Pstationary distribution reached when starting in state i. In

this case we have (hπ )i = ∞ T t π π ∗ π
t=0 ei Pπ (cπ − g ) and (g )i = ei Pπ cπ . Hence, (h )i can be
interpreted as the expected total difference in cost between starting in state i and starting
according to eTi Pπ∗ .
Recall that a recurrent class for a Markov chain is a set of states for which we can eventually
get from every state to every other state with positive probability, and where the probability
of leaving the set is zero.

Theorem 4.2 Let π be a positional policy and let Rπ ⊆ 2S be the set of recurrent classes of
Pπ . Then gπ and hπ are the unique solution to the following system of equations:

g π = Pπ g π
hπ = cπ − gπ + Pπ hπ
X
∀R ∈ Rπ : (hπ )i = 0 .
i∈R

24
We next establish a connection between values and potentials, and values for the correspond-
ing discounted MDP. We let vπ,γ be the value vector for policy π for the discounted cost
criterion when using discount factor γ.

Lemma 4.3 For every positional policy π we have

1
vπ,γ = gπ + hπ + f π (γ)
1−γ

where f π (γ) ∈ Rn is a vector, and limγ↑1 f π (γ) = 0.

By multiplying both sides of the equation in Lemma 4.3 by (1 − γ), and taking the limit for
γ going to 1 we get the following important Corollary:

Corollary 4.4 Let π be any positional policy. Then:

gπ = lim (1 − γ)vπ,γ .
γ↑1

Let (γk )∞
k=0 be a sequence of discount factors converging to 1. Since for every discount factor
there is an optimal policy for the discounted MDP, and since there are only finitely many
policies, there must be a policy that is optimal for discount factors arbitrarily close to 1. In
fact, the following lemma shows something stronger.

Lemma 4.5 There exists a discount factor γ(M ) < 1 and a policy π ∗ , such that for all
γ ∈ [γ(M ), 1), π ∗ is optimal for the discounted MDP with discount factor γ.

By combining Lemma 4.5 and Corollary 4.4 it follows that π ∗ is optimal under the average
cost criterion, which proves Theorem 1.5 for the average cost criterion. It should be noted
that not every policy that is optimal under the average cost criterion is optimal under the
discounted reward criterion for all discount factors sufficiently close to 1.
Lemmas 4.3 and 4.5 also show that we can use the strategy iteration algorithm for solving
MDPs with the average cost criterion. Indeed, we can pick a discount factor γ sufficiently
close to 1 and run the strategy iteration algorithm for the corresponding discounted MDP.
To simplify the algorithm we can use Lemma 4.3 to compare reduced costs lexicographically
in terms of values and potentials.

5 The total cost criterion

We next give a brief introduction to the total cost criterion. We again restrict our attention
to positional policies. Throughout the section we let M = (S, A, s, c, p) be some given MDP.

25
For the total cost criterion the value of a state i for a positional policy π is defined as:
∞
X
valTπ (i) = eTi Pπk cπ
k=0

This series may not converge, and we therefore need to make assumptions about the MDP
to ensure convergence. For this purpose we introduce the notion of a terminal state:

Definition 5.1 (Terminal state) We say that a state t ∈ S is a terminal state if for all
actions a ∈ At , Pa = et and ca = 0. I.e., t is an absorbing state for every policy π, and
once t is reached no additional cost is accumulated.

For the remainder of this section we assume that the MDP under consideration has a terminal
state t. Note that reaching the terminal state corresponds to terminating the MDP. Also,
since the decision at t is fixed we will let t and its actions be implicit in the description
of the MDP M . We therefore do not consider t to be an element of S, and we let P , J,
and c be the matrices and vectors obtained by removing the column corresponding to t
and the rows corresponding to At . I.e., rows of P may sum to less than one, and 1 − Pa e
is the probability of moving to t when using action a. Note that this provides a natural
interpretation of the discounted cost criterion. The rows of the matrix γP sum to γ < 1,
and for each action we move to the terminal state with probability (1 − γ).

Definition 5.2 (Stopping condition) A policy π is said to satisfy the stopping condition
if from each state i there is positive probability of eventually reaching the terminal state.
If every policy satisfies the stopping condition we say that the MDP satisfies the stopping
condition.

For the remainder of this section we assume that the MDP under consideration satisfies the
stopping condition.
The following lemma is analogous to Lemma 2.1 for the discounted case.

Lemma 5.3 For any policy π satisfying the stopping condition, the matrix (I − Pπ ) is non-
singular and
X∞
−1
(I − Pπ ) = Pπk ≥ I .
k=0

From Lemma 5.3 we get that values exist such that the following definition is well-defined.

Definition 5.4 (Value vector) For every positional policy π, we define the value vector
vπ ∈ Rn by:
vπ = (I − Pπ )−1 cπ .

26
Note that if the MDP is deterministic, i.e., all actions move to single states with probability
1, then the problem becomes a shortest path problem, where from every state we want to
find the shortest path to the terminal state. Solving MDPs for the total cost criterion may
in general be viewed as a stochastic shortest path problem.
We next consider the relationship between the total cost criterion and the average cost
criterion. For this purpose we assume that the terminal state t is represented explicitly. Note
that, assuming that the MDP satisfies the stopping condition, the terminal state is reached
from every state for every policy. Hence, the limiting average of the observed costs
Pis always
zero such that gπ = 0 for every policy π. Furthermore, since the limit limt→∞ tk=0 Pπk cπ
always exists, we have:
N −1 t t
π 1 XX k X
h = lim Pπ cπ = lim Pπk cπ = vπ .
N →∞ N t→∞
t=0 k=0 k=0

Hence, the values for the total cost criterion may be viewed as the potentials for the average
cost criterion.
The existence of optimal positional policies can either be proved directly or through the
connection to the average cost criterion. The strategy iteration algorithm can also be defined
for the total cost criterion in the same way as it was done for the discounted cost criterion
and the average cost criterion.
Through the series of papers by Friedmann [4] and Fearnley [3], from 2009 and 2010, re-
spectively, it was shown that Howard’s policy iteration algorithm may require exponentially
many iterations in the number of states to solve MDPs with the total cost criterion. Hence,
there is no hope of getting rid of the dependence on γ in Theorem 2.24.

Exercises
D(γ)
P∞ t1 Prove that for every MDP M , and every policy π ∈ Π(H), the value valπ (i) =
Exercise
E [ t=0 γ c(Yπ (i, t))] always exists.
Exercise 2 Prove Lemma 2.1, i.e., for an MDP represented by (P, c, J) and any policy π,
show that the matrix (I − γPπ ) is nonsingular and that
∞
X
−1
(I − γPπ ) = (γPπ )k ≥ I .
k=0
P∞ k
Hint: Consider the telescoping series (I − γPπ ) k=0 (γPπ ) .

Exercise 3 Use Cramer’s rule to prove Lemma 2.13. Cramer’s rule says the following:
Let Ax = b be a system of linear equations where A ∈ Rn×n is a non-singular matrix and
b ∈ Rn is a vector. Then a solution x satisfies for all i, xi = det(Ai )/ det(A), where Ai is
the matrix obtained from A by replacing the i-th column with b.

27
Exercise 4 Give an alternative proof using complementary slackness for the fact that the
linear programs (P ) and (D) can be used to solve an MDP. The complementary slackness
theorem says that x, y, z is an optimal solution for the primal and dual linear programs:

min cT x max bT y
s.t. Ax = b s.t. AT y + z = c
x ≥ 0 z ≥ 0

if and only x, y, z is a feasible solution and xT z = 0.

Exercise 5 Prove Theorem 3.8.

Exercise 6 Extend the lemmas and theorems in Section 2.1 to TBSGs and update the
proofs. Hint: Lemma 2.7 should be split into two cases where different players use an optimal
counter-strategy.

Exercise 7 Prove Lemma 2.22 in the setting of TBSGs.

References
[1] R. Bellman. Dynamic programming. Princeton University Press, 1957.

[2] F. d’Epenoux. A probabilistic production and inventory problem. Management Science,

10(1):98–108, 1963.

[3] J. Fearnley. Exponential lower bounds for policy iteration. In Proc. of 37th ICALP,
pages 551–562, 2010.

[4] O. Friedmann. An exponential lower bound for the parity game strategy improvement
algorithm as we know it. In Proc. of 24th LICS, pages 145–156, 2009.

[5] T. D. Hansen. Worst-case Analysis of Strategy Iteration and the Simplex Method. PhD
thesis, Aarhus University, 2012.

[6] T. D. Hansen, P. B. Miltersen, and U. Zwick. Strategy iteration is strongly polynomial

for 2-player turn-based stochastic games with a constant discount factor. Journal of the
ACM, 60(1):1–16, 2013.

[7] R. Howard. Dynamic programming and Markov processes. MIT Press, 1960.

[8] U. Meister and U. Holzbaur. A polynomial time bound for Howard’s policy improvement
algorithm. OR Spektrum, 8:37–40, 1986.

[9] M. Puterman. Markov decision processes. Wiley, 1994.

28
[10] S. Rao, R. Chandrasekaran, and K. Nair. Algorithms for discounted games. Journal of
Optimization Theory and Applications, pages 627–637, 1973.

[11] B. Scherrer. Improved and generalized upper bounds on the complexity of policy itera-
tion. CoRR, abs/1306.0386, 2013.

[12] L. Shapley. Stochastic games. Proc. Nat. Acad. Sci. U.S.A., 39:1095–1100, 1953.

[13] Y. Ye. The simplex and policy-iteration methods are strongly polynomial for the markov
decision problem with a fixed discount rate. Math. Oper. Res., 36(4):593–603, 2011.

FlowJo Advanced Tutorial PDF
No ratings yet
FlowJo Advanced Tutorial PDF
87 pages
DTMB Propeller Geometry
100% (1)
DTMB Propeller Geometry
17 pages
KVPY Admit Card
No ratings yet
KVPY Admit Card
3 pages
Markovian Decision Process
No ratings yet
Markovian Decision Process
27 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
Stochastic DP
No ratings yet
Stochastic DP
23 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
RL and ObC Lecture 2
No ratings yet
RL and ObC Lecture 2
20 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
43 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Lecture4 Model Free Prediction
No ratings yet
Lecture4 Model Free Prediction
34 pages
An Introduction to Reinforcement Learning From theory to algorithms (December 19, 2024)_Joon Kwon
No ratings yet
An Introduction to Reinforcement Learning From theory to algorithms (December 19, 2024)_Joon Kwon
66 pages
06 MDP
No ratings yet
06 MDP
89 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
Conjugate Markov Decision Processes
No ratings yet
Conjugate Markov Decision Processes
8 pages
19.5 Markov Decision Processes: Resolving Unbounded Expected Rewards
No ratings yet
19.5 Markov Decision Processes: Resolving Unbounded Expected Rewards
13 pages
CS229
No ratings yet
CS229
17 pages
2008 - Carmon, Shwartz - Markov decision processes with exponentially representable discounting
No ratings yet
2008 - Carmon, Shwartz - Markov decision processes with exponentially representable discounting
10 pages
RL_UNIT-II (1)
No ratings yet
RL_UNIT-II (1)
14 pages
Markov Decision Processes: Lecture Notes For STP 425: Jay Taylor
100% (1)
Markov Decision Processes: Lecture Notes For STP 425: Jay Taylor
86 pages
RL-DQN-PG
No ratings yet
RL-DQN-PG
65 pages
Markov decision
No ratings yet
Markov decision
4 pages
Infinite Horizon Problems
No ratings yet
Infinite Horizon Problems
69 pages
mondal-smdps
No ratings yet
mondal-smdps
17 pages
cs229-notes12 Reinforcement in Control
No ratings yet
cs229-notes12 Reinforcement in Control
17 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
17 pages
15 MDP
No ratings yet
15 MDP
35 pages
Mondal Smdp Alg
No ratings yet
Mondal Smdp Alg
23 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
14 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
An Introduction To Markov Decision Processes: Bob Givan Ron Parr Purdue University Duke University
No ratings yet
An Introduction To Markov Decision Processes: Bob Givan Ron Parr Purdue University Duke University
23 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
On Minimizing Ordered Weighted Regrets in Multiobjective Markov Decision Processes
No ratings yet
On Minimizing Ordered Weighted Regrets in Multiobjective Markov Decision Processes
15 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
08 - Markov Decision Processes
No ratings yet
08 - Markov Decision Processes
31 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
DSA5102_lecture11
No ratings yet
DSA5102_lecture11
44 pages
Quick Start: Resolving A Markov Decision Process Problem Using The Mdptoolbox in Matlab
No ratings yet
Quick Start: Resolving A Markov Decision Process Problem Using The Mdptoolbox in Matlab
9 pages
Logistics: CSE 473 Markov Decision Processes
No ratings yet
Logistics: CSE 473 Markov Decision Processes
10 pages
Markov Decision Process Tutorial
No ratings yet
Markov Decision Process Tutorial
22 pages
Sp14 Cs188 Lecture 8 - Mdps I
No ratings yet
Sp14 Cs188 Lecture 8 - Mdps I
50 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
Lecture7 MDP
No ratings yet
Lecture7 MDP
44 pages
Definitions
No ratings yet
Definitions
2 pages
Microsoft PowerPoint - Lecture20Final-Part1
No ratings yet
Microsoft PowerPoint - Lecture20Final-Part1
65 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
On State Variables and POMDP-s
No ratings yet
On State Variables and POMDP-s
49 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
AI Lec4 MarkovDecisionProcess&RL
No ratings yet
AI Lec4 MarkovDecisionProcess&RL
34 pages
Lecture7 MDPs I
No ratings yet
Lecture7 MDPs I
9 pages
MIT16 410F10 Lec22
No ratings yet
MIT16 410F10 Lec22
19 pages
08 MDPs
No ratings yet
08 MDPs
110 pages
Unit-4 MDP
No ratings yet
Unit-4 MDP
21 pages
policy (RL IITH)
No ratings yet
policy (RL IITH)
46 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Markov Decision Processes With Their Applications
No ratings yet
Markov Decision Processes With Their Applications
305 pages
Worked Examples in Advanced Mechanics of Materials using MATLAB
From Everand
Worked Examples in Advanced Mechanics of Materials using MATLAB
Eric Okoth Ogur
No ratings yet
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Learn Digital and Microprocessor Techniques On Your Smartphone: Portable Learning, Reference and Revision Tools.
From Everand
Learn Digital and Microprocessor Techniques On Your Smartphone: Portable Learning, Reference and Revision Tools.
Clive W. Humphris
No ratings yet
2021 - Kozachinskiy - Continuous Positional Payoffs
No ratings yet
2021 - Kozachinskiy - Continuous Positional Payoffs
17 pages
2007 - Gimbert, Zielonka - Limits of Multi-Discounted Markov Decision Processes
No ratings yet
2007 - Gimbert, Zielonka - Limits of Multi-Discounted Markov Decision Processes
13 pages
2006 - Gimnert, Zielonka - Deterministic priority mean-payoff games as limits of discounted games
No ratings yet
2006 - Gimnert, Zielonka - Deterministic priority mean-payoff games as limits of discounted games
12 pages
2012 - Gimbert, Zielonka - Blackwell Optimal Strategies in Priority Mean-payoff Games
No ratings yet
2012 - Gimbert, Zielonka - Blackwell Optimal Strategies in Priority Mean-payoff Games
15 pages
Bubble Sort
No ratings yet
Bubble Sort
1 page
Module 1
No ratings yet
Module 1
194 pages
Rocket JDBC Connection
No ratings yet
Rocket JDBC Connection
102 pages
Project PDF
No ratings yet
Project PDF
13 pages
Quibble Layout
No ratings yet
Quibble Layout
15 pages
Unit Progress Test 10 - Version A
No ratings yet
Unit Progress Test 10 - Version A
12 pages
Mission-Critical Communications Networks For Power Utilities
No ratings yet
Mission-Critical Communications Networks For Power Utilities
19 pages
Zero Knowledge ML For Generative AI
No ratings yet
Zero Knowledge ML For Generative AI
9 pages
AutoMach Services - Company Profile
No ratings yet
AutoMach Services - Company Profile
17 pages
Bài 1ccc
No ratings yet
Bài 1ccc
9 pages
Huawei Imanager n2000 936
No ratings yet
Huawei Imanager n2000 936
5 pages
Power - Chapter 12, Handling Reports (Corel Paradox - ObjectPAL Coding)
No ratings yet
Power - Chapter 12, Handling Reports (Corel Paradox - ObjectPAL Coding)
10 pages
DS-VE11D-C_HW01(C)_Datasheet_20240329
No ratings yet
DS-VE11D-C_HW01(C)_Datasheet_20240329
4 pages
Observation: As We Can See We Have Threwe Types of Datatypes I.E. (Int, Float, Object) That Means We Have Both Categorical and Numerical Data
No ratings yet
Observation: As We Can See We Have Threwe Types of Datatypes I.E. (Int, Float, Object) That Means We Have Both Categorical and Numerical Data
2 pages
Book - Report - On - Smart - Card
No ratings yet
Book - Report - On - Smart - Card
25 pages
C1 W2
No ratings yet
C1 W2
18 pages
Iot Final Implementation
No ratings yet
Iot Final Implementation
17 pages
Packet Tracer 5 3 7 2
0% (1)
Packet Tracer 5 3 7 2
2 pages
MAX2020 Agenda
No ratings yet
MAX2020 Agenda
76 pages
College of Computer Application 2
No ratings yet
College of Computer Application 2
22 pages
District Wise List of Working Units As On 31-03-2021: Units Situated in Ahmedabad District Electronic and Engineering
No ratings yet
District Wise List of Working Units As On 31-03-2021: Units Situated in Ahmedabad District Electronic and Engineering
20 pages
Download full (Ebook) Workflow Automation: Basic Concepts of Workflow Automation in the Graphic Industry by Thomas Hoffmann-Walbeck ISBN 9783030847814, 3030847810 ebook all chapters
100% (6)
Download full (Ebook) Workflow Automation: Basic Concepts of Workflow Automation in the Graphic Industry by Thomas Hoffmann-Walbeck ISBN 9783030847814, 3030847810 ebook all chapters
81 pages
How to Use Our Reddit Accounts
No ratings yet
How to Use Our Reddit Accounts
2 pages
Research Report On ASR Development For Mongolian
No ratings yet
Research Report On ASR Development For Mongolian
15 pages
Bakhtawar OS 2
No ratings yet
Bakhtawar OS 2
4 pages
CSS Presentation Slides
No ratings yet
CSS Presentation Slides
467 pages
Year2_Core_Skill_Enhancement_Roadmap
No ratings yet
Year2_Core_Skill_Enhancement_Roadmap
2 pages

Lecture Notes

Uploaded by

Lecture Notes

Uploaded by

Lecture notes for “Analysis of Algorithms”:

Markov decision processes

June 26, 2013

We give an introduction to infinite-horizon Markov decision processes (MDPs) with

The presentation given in these lecture notes is based on [6, 9, 5].

1 Markov decision processes

• s : A → S assigns each action to the state from which it can be used,

• c : A → R associates each action with a cost,

An MDP can be represented in different ways. A convenient and compact representation is

Definition 1.2 (Probability matrix, cost vector, and source matrix)

Figure 1: Example of a simple MDP and a (positional) policy π.

Definition 1.3 (Policies)

• A history-dependent policy π is a mapping π : H → ∆(A) such that for every history

• A time-dependent policy π is a mapping π : S × N0 → ∆(A) such that for every pair

• A positional policy π is a mapping π : S → A such that π(i) ∈ Ai , for every i ∈ S.

Pr [Xπ (i, t) = j] = (Pπt )i,j .

Pr [π 0 (i, t) = a] = Pr [Yπ0 (i0 , t) = a | Xπ0 (i0 , t) = i] := Pr [Yπ (i0 , t) = a | Xπ (i0 , t) = i] .

2 The discounted cost criterion

Let u, v ∈ Rn be two vectors. We say that u ≤ v if and only if ui ≤ vi , for every 1 ≤ i ≤ n.

Lemma 2.3 For every positional policy π, we have:

2.1 The value iteration algorithm

∀i ∈ S : (T v)i = min ca + γPa v .

∀i ∈ S : π(i) ∈ argmin ca + γPa v .

Lemma 2.6 For every v ∈ Rn we have T v = cπ + γPπ v, where π = Pv.

Lemma 2.8 For every u, v ∈ Rn we have:

a ∈ argmin ca + γPa u and b ∈ argmin cb + γPb v .

Corollary 2.9 There is a unique vector v∗ ∈ Rn such that T v∗ = v∗ .

Figure 2: The value iteration algorithm.

and, hence, T t vπ>t → v∗ for t → ∞.

Proof: We only prove that if ku − T uk∞ ≤ 2 (1 − γ) then ku − v∗ k∞ ≤ 2 . The second

From repeated use of Lemma 2.8 we know that kT N u − v∗ k∞ ≤ γ N ku − v∗ k∞ . From

Corollary 2.14 If π is -optimal for 0 ≤  < 2−4L(M,γ)−1 , then π is optimal.

2.2 The policy iteration algorithm

Definition 2.16 (One-step operator) Let π be a policy and v ∈ Rn be a vector. We

Tπ20 u = cπ0 + γPπ0 (Tπ0 u) ≤ cπ0 + γPπ0 u = Tπ0 u .

Function Howard’s PolicyIteration(π)

Lemma 2.22 Let π 0 be any policy, and let (vk )N

2.3 Linear programming formulation

and from Lemma 2.1 we get:

Lemma 2.26 For every positional policy π we have:

(x̄π )T = eT (I − γPπ )−1

Lemma 2.27 For every policy π, we have

Proof: We construct π from x as follows. We let π be aP time-dependent policy that uses

Definition 3.1 (Turn-based stochastic game) A turn-based stochastic game (TBSG)

val(i) = min max valD(γ)

val(i) = max min valD(γ)

(i) If (π1 , π2 ) ∈ Π(P) is optimal, then (π1 , π2 ) is a Nash equilibrium.

3.1 The value iteration algorithm

Definition 3.9 (Value iteration operator) Let v ∈ Rn be an arbitrary vector. Define

∀i ∈ S1 : (T v)i = min ca + γPa v

∀i ∈ S2 : (T v)i = max ca + γPa v .

Definition 3.10 (Strategy extraction operators) The strategy extraction operators P1 :

∀i ∈ S1 : (P1 v)(i) ∈ argmin ca + γPa v ,

∀i ∈ S2 : (P2 v)(i) ∈ argmax ca + γPa v ,

and Pv = (P1 v, P2 v).

Definition 3.11 (Improving switches) We say that an action a ∈ A1 is an improving

Lemma 3.12 Let π = (π1 , π2 ) be a strategy profile. π1 is an optimal counter-strategy against

The following lemma can be viewed as a generalization of Lemma 2.18 to TBSGs.

Figure 4: Two equivalent formulations of the strategy iteration algorithm.

We refer to the resulting algorithm as Howard’s strategy iteration algorithm, although

The fact that the limit exists is non-trivial.

Note that valA π T ∗

limt→∞ Pπt , is the limiting,Pstationary distribution reached when starting in state i. In

Lemma 4.3 For every positional policy π we have

where f π (γ) ∈ Rn is a vector, and limγ↑1 f π (γ) = 0.

Corollary 4.4 Let π be any positional policy. Then:

5 The total cost criterion

if and only x, y, z is a feasible solution and xT z = 0.

Exercise 5 Prove Theorem 3.8.

Exercise 7 Prove Lemma 2.22 in the setting of TBSGs.

[2] F. d’Epenoux. A probabilistic production and inventory problem. Management Science,

[6] T. D. Hansen, P. B. Miltersen, and U. Zwick. Strategy iteration is strongly polynomial

[9] M. Puterman. Markov decision processes. Wiley, 1994.

You might also like

Proof: We only prove that if ku − T uk∞ ≤ 2 (1 − γ) then ku − v∗ k∞ ≤ 2 . The second

Corollary 2.14 If π is -optimal for 0 ≤ < 2−4L(M,γ)−1 , then π is optimal.