Lecture Notes
Lecture Notes
Abstract
Definition 1.1 (Markov decision process) A Markov decision process (MDP) is a tuple
M = (S, A, s, c, p), where
∗
School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel. E-mail: [email protected].
1
• S is a set of states,
• A is a set of actions,
• and p : A → ∆(S) associates each action with a probability distribution over states
according to which the next state is chosen when a is used.
For every i ∈ S, we let Ai = {a ∈ A | s(a) = i} be the set of actions that can be used from i.
We assume that Ai 6= ∅, for every i ∈ S. We use n = |S| and m = |A| to denote the number
of states and actions, respectively.
• We let P ∈ Rm×n , where Pa,i = (p(a))i is the probability of ending up in state i after
taking action a, for every a ∈ A = [m] and i ∈ S = [n], be the probability matrix of
M.
• We let c ∈ Rm , where ca = c(a) is the cost of action a ∈ A = [m], be the cost vector
of M .
• We let J ∈ Rm×n , where Ja,i = 1 if a ∈ Ai and Ja,i = 0 otherwise, be the source matrix
of M .
Figure 1 shows an example of an MDP and its matrix representation. The states are pre-
sented as circles numbered from 1 to 3. I.e., the set of states is S = {1, 2, 3}. The actions
are represented as arrows leaving the states. The set of actions is A = {a1 , . . . , a6 }. We,
for instance, have s(a4 ) = 2. The costs of the actions are shown inside the diamond-shaped
vertices. For instance, c(a4 ) = 2. Finally, the probability distribution associated with an
action is shown as numbers labelling the edges leaving the corresponding diamond-shaped
vertex. For instance, the probability of moving to state 1 when using action a4 is 21 .
A policy for the controller of an MDP is a rule that specifies which action should be taken in
each situation. The decision may in general depend on the current state of the process and
possibly on the history, i.e., the sequence of states that have been visited and the actions
that were used in those states. We also allow desicions to be randomized. We denote the set
of histories by H. The set of policies can be restricted in a natural way as follows. First, we
2
7 -4 5
1 1
2 2 1 1
a1 a3 a5
1 2 3
a2 a4 a6
1 2
1 1 4 1 1 3
2 4 3
3 2 -10
0 12 12
1 0 0 7
1 0 0 1 0 0 3
0 1 0 1 0 0 −4
J =
0
P = 1 1 1
c=
1 0
2 4 4
2
0 0 1 0 1 0 5
0 0 1 0 13 23 −10
k eT1 Pπk
1 1
0 1 0 0
0 2 2
7 1 1
1 0
Pπ = 12 1 1
cπ = 2 2 2
4 4 2 5 1
0 1 0 5 2 8 8 8
10 13 9
3 32 32 32
.. ..
. .
3
may require that every decision only depends on the current state and the time (the number
of steps performed). We call such policies time-dependent. Second, we may require that
every decision only depends on the current state. We call such policies positional. We also
require that positional policies do not make use of randomization.
If an MDP starts in some state i ∈ S and the controller uses a policy π, then the state
reached after t steps is a random variable Xπ (i, t), and the action used from the t-th state is
a random variable Yπ (i, t).
The random variable Xπ (i, t) can be nicely described for positional policies. We identify
a positional policy π with the set π(S) ⊆ A of actions used in π. We let Pπ ∈ Rn×n be
the matrix obtained by selecting the rows of P whose indices belong to π. We will assume
that the actions are ordered according to states, such that for every policy π, by the above
convention we have Jπ = I; the identity matrix. Note that (Pπ )i = Pπ(i) . Similarly, we let
cπ ∈ Rn be the vector containing the costs of the actions that belong to π. Note that Pπ is
a (row) stochastic matrix ; its elements are non-negative and the elements in each row sum
to 1. Thus, Pπ defines a Markov chain, and we get that
In particular, if an MDP starts in some state i and the controller uses a positional policy
π, then the probabilities of being in the different states after t steps are given by the vector
eTi Pπt , where ei is the i’th unit vector.
Figure 1 shows a positional policy π for a simple MDP. The policy is represented by bold
gray arrows. The corresponding matrix Pπ and vector cπ are also shown. Furthermore, the
table at the lower right corner shows the probabilities of being in different states after k
steps when starting in state 1.
Running an MDP with a (possibly history-dependent) policy π generates a random, infinite
sequence of costs c(Yπ (i, 0)), c(Yπ (i, 1)), c(Yπ (i, 2)), . . . . This sequence of costs is evaluated
according to some optimality criterion C, and the goal of the controller is to minimize the
4
expected resulting value, valC C
π (i). We will use val (c0 , c1 , c2 , . . . ) to denote the value of the
infinite sequence c0 , c1 , c2 , . . . according to the criterion C. We then have:
valC
C
π (i) = E val (c(Y π (i, 0)), c(Yπ (i, 1)), c(Y π (i, 2)), . . . ) .
Let us again note that things work out nicely for positional policies. In this case the pair
Pπ , cπ is a Markov chain with costs assigned to its states, and the expected cost observed at
time t when starting from state i and using the positional policy π is then:
E [c(Yπ (i, t))] = eTi Pπt cπ . (1)
Definition 1.4 (Discounted cost, average cost, and total cost) Let 0 < γ < 1 be a
discount factor. The discounted cost criterion D(γ), the average cost criterion A, and the
total cost criterion T are defined by:
∞
X
D(γ)
val (c0 , c1 , c2 , . . . ) = γ t ct
t=0
N −1
A 1 X
val (c0 , c1 , c2 , . . . ) = lim sup ct
N →∞ N t=0
∞
X
T
val (c0 , c1 , c2 , . . . ) = ct
t=0
Note that for the total cost criterion the series may diverge. We will later introduce a
stopping condition that ensures convergence in this case.
Definition 1.5 (Optimal policy) Let M = (S, A, s, c, p) be an MDP and let π be a policy.
We say that π is optimal for M with respect to a criterion C if and only if for all π 0 ∈ Π(H)
and all i ∈ S we have valC C
π (i) ≤ valπ 0 (i).
Note that an optimal policy must minimize the expected value for all states simultaneously.
The existence of an optimal policy is non-trivial. We are, however, going to prove the
following theorem that shows that, in fact, there always exists an optimal positional policy.
The theorem was first proved by Shapley [12] in 1953 for discounted costs.
Theorem 1.6 Every MDP has an optimal positional policy w.r.t. the discounted cost crite-
rion, the average cost criterion, and the total cost criterion.
Theorem 1.6 shows that when looking for an optimal policy we may restrict our search to
positional policies. We will prove the theorem in the later sections. The following lemma
makes the first step towards proving Theorem 1.6. It shows that we may restrict our search to
time-dependent policies. More precisely, the lemma shows that for every history-dependent
policy there exists a time-dependent policy that visits the same states and uses the same
actions with the same probabilities.
5
Lemma 1.7 Let π ∈ Π(H) be any history-dependent policy, and let i0 ∈ S be any starting
state. Then there exists a time-dependent policy π 0 ∈ Π(T ) such that for every non-negative
integer t ≥ 0, every state i ∈ S, and every action a ∈ A we have Pr [Xπ0 (i0 , t) = i] =
Pr [Xπ (i0 , t) = i] and Pr [Yπ0 (i0 , t) = a] = Pr [Yπ (i0 , t) = a].
Proof: Since we are given π we can construct π 0 such for all t ≥ 0, i ∈ S, and a ∈ A:
It follows that if Pr [Xπ0 (i0 , t) = i] = Pr [Xπ (i0 , t) = i], i.e., the two policies reach the same
states in t steps with the same probabilities, then Pr [Yπ0 (i0 , t) = a] = Pr [Yπ (i0 , t) = a], i.e.,
the same actions are chosen from the t-th state with the same probabilities.
We next prove by induction in t that Pr [Xπ0 (i0 , t) = i] = Pr [Xπ (i0 , t) = i] for all t ≥ 0 and
i ∈ S. For t = 0 we have Pr [Xπ (i0 , 0) = i0 ] = Pr [Xπ0 (i0 , 0) = i0 ] = 1 as desired. Recall
that Pa,i is the probability of moving to state i when using action a. For t > 0 we have by
induction:
XX
Pr [Xπ0 (i0 , t) = i] = Pr [Xπ0 (i0 , t − 1) = j] · Pr [π 0 (j, t − 1) = a] · Pa,i
j∈S a∈Aj
XX
= Pr [Xπ (i0 , t − 1) = j] · Pr [Yπ (i0 , t − 1) = a | Xπ (i0 , t − 1) = j] · Pa,i
j∈S a∈Aj
= Pr [Xπ0 (i0 , t) = i] .
6
We leave it as an exercise to show that valD(γ)
π (i) exists for every policy π ∈ Π(H). In the
important special case of positional policies we get, using (1):
∞
X
valD(γ)
π (i) = eTi (γPπ )t cπ .
t=0
The following lemma is helpful for understanding this infinite sum. We leave proving the
lemma as an exercise.
Lemma 2.1 For any policy π, the matrix (I − γPπ ) is nonsingular and
∞
X
−1
(I − γPπ ) = (γPπ )t ≥ I .
t=0
Definition 2.2 (Value vector) For every policy π, we define the value vector vπ ∈ Rn by:
∀i ∈ S : (vπ )i = valD(γ)
π (i) .
vπ = (I − γPπ )−1 cπ .
7
Definition 2.4 (Value iteration operator) Let v ∈ Rn be an arbitrary vector. Define
the value iteration operator T : Rn → Rn by:
Definition 2.5 (Policy extraction operator) Define the policy extraction operator P :
Rn → Π(P) to map any vector v ∈ Rn to a positional policy π = Pv such that:
The following relation between the value iteration operator and the policy extraction operator
is immediate.
Following the above discussion, the minimum values that can be obtained using any (time-
dependent) policy for a discounted, finite-horizon MDP that terminates after t steps with
termination costs specified by v ∈ Rn are T t v. Furthermore, the actions used after k steps,
for k < t, are those of the positional policy PT t−k−1 v. The idea of the value iteration
algorithm is to repeat the process for t going to infinity, or until t is sufficiently large.
For every time-dependent policy π ∈ Π(T ), and every integer t ≥ 0. We let π>t be the
policy obtained from π by cutting away the first t rounds, that is, π>t (i, k) = π(i, k + t) for
all i ∈ S and k ≥ 0. Moreover, we let π∆t be the policy obtained from π by changing the
decisions of the first t rounds such that for the k-th round, for k < t, we use the actions
from the positional policy PT t−k−1 vπ>t . Note that the first t rounds of π∆t can viewed as
an optimal policy for the finite-horizon MDP that terminates after t steps with termination
costs vπ>t . In particular, the decisions made by π∆t during the first t rounds are at least as
good as the corresponding decisions made by π. We get the following lemma.
Lemma 2.7 Let π ∈ Π(T ) be any time-dependent policy, and let t ≥ 0 be an integer. Then
vπ∆t = T t vπ>t ≤ vπ .
We next show that vπ∆t always converges to unique vector v∗ , regardless of the policy π, and
that Pv∗ is an optimal positional policy. First we show that the operator T is a contraction
with Lipschitz constant γ.
kT u − T vk∞ ≤ γ ku − vk∞ .
8
Proof: Let i ∈ S, and assume that (T u)i ≥ (T v)i . Let
Then,
(T u − T v)i = (ca + γPa u) − (cb + γPb v)
≤ (cb + γPb u) − (cb + γPb v)
= γPb (u − v)
≤ γ ku − vk∞ .
The last inequality follows from the fact that the elements in Pb are non-negative and sum
up to 1. The case when (T u)i ≤ (T v)i is analogous.
Since the inequality holds for all states i, it holds for the state for which the absolute
difference of T u and T v is largest, and, hence, the result follows.
The Banach fixed point theorem now implies that the operator T has a unique fixed point.
The following lemma shows that the policy π ∗ = Pv∗ is optimal, which proves Theorem 1.6
for the discounted cost criterion. Note that all optimal policies must, per definition, have
the same value vectors. We therefore refer to v∗ as the optimal value vector.
Lemma 2.10 Let v∗ ∈ Rn be the unique fixed point of T , and let π ∗ = Pv∗ . Then, π ∗ is
∗
an optimal policy, and vπ = v∗ .
∗
Proof: We first show that vπ = v∗ . Since v∗ is a fixed point for T and π ∗ = Pv∗ we get
from Lemma 2.6 that:
v∗ = T v∗ = cπ∗ + γPπ∗ v∗
We know from Lemma 2.1 that the matrix (I − γPπ∗ ) is non-singular, which implies that
∗
v∗ = (I − γPπ∗ )−1 cπ∗ . Lemma 2.3 then shows that vπ = v∗ .
We next show that π ∗ is an optimal policy. We know from Lemma 2.7 that for every time-
dependent policy π ∈ Π(T ) and every integer t ≥ 0, we have vπ∆t = T t vπ>t ≤ vπ . We will
show that T t vπ>t → v∗ for t → ∞. It then follows that v∗ ≤ vπ , which shows that π ∗ is
optimal.
Let cmax = maxa∈A |ca | be largest absolute value of the cost of any action. Observe first that
for every policy π and every state i ∈ S, we have |(vπ )i | ≤ c1−γ
max
. Indeed, cmax is the largest
absolute value of any cost incurred byPa single step of the MDP, and if the cost is cmax at
every step the total discounted cost is ∞ t cmax
t=0 γ cmax = 1−γ . From repeated use of Lemma 2.8
we then get:
2cmax
kT t vπ>t − v∗ k∞ = kT t vπ>t − T t v∗ k∞ ≤ γ t kvπ>t − v∗ k∞ ≤ γ t ,
1−γ
9
Function ValueIteration(u, )
while ku − T uk∞ > 2 (1 − γ) do
u ← T u;
return u;
The ValueIteration algorithm, given in Figure 2, repeatedly applies the value iteration
operator T to an initial vector u ∈ Rn until the difference between two successive vectors is
sufficiently small, i.e., ku − T uk∞ ≤ 2 (1 − γ) for some given ≥ 0.
The value iteration algorithm was introduced for MDPs by Bellman [1] in 1957. It can be
viewed as a special case of an algorithm by Shapley [12] from 1953 for solving the more
general class of stochastic games. Note that the vector T t u can be viewed as the optimal
value vector for the finite-horizon MDP that terminates after t steps with termination costs
specified by u. As noted in the beginning of the section, the value iteration algorithm can
therefore be viewed as a dynamic programming algorithm.
We say that a policy π is -optimal if kvπ − v∗ k∞ ≤ , for some ≥ 0. The following lemma
shows that when the value iteration algorithm terminates we can extract an -optimal policy
from the resulting vector.
Lemma 2.11 Let u ∈ Rn be a vector, and let π = Pu be the policy extracted from u. If
ku − T uk∞ ≤ 2 (1 − γ), then ku − v∗ k∞ ≤ 2 and ku − vπ k∞ ≤ 2 . In particular, π is
-optimal.
10
k∞∗
Lemma 2.12 For a given vector u ∈ Rn , let N ≥ 1
1−γ
log 4ku−v
(1−γ)
be an integer. Then
kT N u − T N +1 uk∞ ≤ 2 (1 − γ)
Proof: We will show that kT N u−v∗ k∞ ≤ 4 (1−γ). Since, by Lemma 2.8, kT N +1 u−v∗ k∞ ≤
γkT N u − v∗ k∞ , it then follows from the triangle inequality that:
kT N u − T N +1 uk∞ ≤ kT N u − v∗ k∞ + kT N +1 u − v∗ k∞ ≤ (1 − γ) .
2
Lemmas 2.11 and 2.12 show that ValueIteration(u, ) computes a vector from which
we can extract an -optimal policy. Furthermore, Lemma 2.12 shows that the number of
1 ku−v∗ k∞
iterations needed to find this vector is at most O( 1−γ log (1−γ) ). We next estimate the
for which the resulting policy is guaranteed to be optimal.
For every MDP M , let L(M, γ) be the number of bits needed to describe the matrix (J −γP )
and the vector c. More precisely, for every number a = p/q, where p ∈ Z and q ∈ N are
relatively prime, we use 1 + dlog2 (|p| + 1)e + dlog2 (q + 1)e bits. Using Cramer’s rule and
simple bounds on the size of determinants, one can prove that the number of bits needed to
describe a component of the value vector vπ = (I − γPπ )−1 cπ , for some policy π, is at most
4L(M, γ). We leave this as an exercise.
0 0
Lemma 2.13 Let π and π 0 be two policies such that vπ 6= vπ . Then kvπ − vπ k∞ ≥
2−4L(M,γ) .
Let cmax = maxa∈A |ca |. As we saw in the proof of Lemma 2.10, it is not difficult to show
that for every policy π and every state i, we have |(vπ )i | ≤ c1−γmax
. We also clearly have
L(M,γ) ∗ L(M,γ)
cmax ≤ 2 . Hence, kv − 0k∞ ≤ (2 )/(1 − γ), where 0 is the all-zero vector.
Combining Lemma 2.12 and Corollary 2.14 we then get the following bound on the number
of iterations needed for the value iteration algorithm to produce an optimal policy:
5L(M,γ)
Theorem 2.15 Let π = PT N 0, where N = 1
1−γ
log 8·2(1−γ)2 ≤ O( L(M,γ)
1−γ
1
log 1−γ ). Then π
is optimal.
11
algorithm. It is the most widely used algorithm for solving MDPs in practice. As we will see
in Section 2.3 below it can be viewed as a generalization of the simplex method for solving
linear programs in the context of MDPs.
Throughout this section we restrict our attention to positional policies. Hence, whenever we
use the term policy we mean a positional policy.
Tπ v = cπ + γPπ v .
Recall that Lemma 2.3 says that vπ = (I − γPπ )−1 cπ . Hence, for every policy π, the value
vector vπ is a fixed point for Tπ . The following lemma can be proved in the same way as
Lemma 2.8 and Corollary 2.9.
Lemma 2.17 For every policy π, vπ is a unique fixed point for Tπ , and for every vector
u ∈ Rn , Tπt u → vπ for t → ∞.
0
Lemma 2.18 Let π and π 0 be two policies. If Tπ0 vπ < vπ , then vπ ≤ Tπ0 vπ < vπ . The
same holds for ≤, >, and ≥.
Proof: We only prove the lemma for <. The other cases are proved analogously.
First observe that for any vector u ∈ Rn , if Tπ0 u ≤ u then since all entries of Pπ0 are
non-negative we get:
If Tπ0 vπ ≤ vπ , then repeated use of the above inequality shows that vπ ≥ Tπ0 vπ ≥ Tπ20 vπ ≥
0 0
Tπ30 vπ ≥ . . . . Since, by Lemma 2.17, Tπt0 vπ → vπ for t → ∞, it follows that vπ ≥ Tπ0 vπ ≥ vπ .
0
In particular, if Tπ0 vπ < vπ then vπ < vπ .
Lemma 2.19 Let π be a policy. If for every policy π 0 we have Tπ0 vπ 6< vπ then π is optimal.
Proof: We show that if for every policy π 0 we have Tπ0 vπ 6< vπ , then for every policy π 0
0
we have Tπ0 vπ ≥ vπ . It then follows from Lemma 2.18 that vπ ≥ vπ for all policies π 0 .
Assume for the sake of contradiction that there exists a policy π 0 such that (Tπ0 vπ )i <
(vπ )i for some state i. We will construct a policy π 00 such that Tπ00 vπ < vπ which gives a
contradiction. Let π 00 be defined by π 00 (i) = π 0 (i) and π 00 (j) = π(j) for all j 6= i. Then,
(Tπ00 vπ )i < (vπ )i and (Tπ00 vπ )j = (vπ )j for all j 6= i.
Note that Tπ0 vπ < vπ can be equivalently stated as cπ0 − (I − γPπ0 )vπ < 0. I.e., if an action
is better for one step with respect to the values of the current policy, then it is also better
to keep using this action. This motivates the following definition.
12
Function PolicyIteration(π)
while ∃π 0 : Tπ0 vπ < vπ do
π ← π0;
return π;
Function PolicyIteration(π)
while ∃ an improving switch w.r.t. π do
Update π by performing improving switches;
return π;
Figure 3: Two equivalent formulations of the policy iteration algorithm (top and middle),
and the policy iteration algorithm with Howard’s improvement rule (bottom).
Definition 2.20 (Reduced costs, improving switches) The reduced cost vector c̄π ∈
Rm corresponding to a policy π is defined to be
c̄π = c − (J − γP )vπ .
We say that an action a ∈ A is an improving switch with respect to a policy π if and only if
c̄πa < 0.
Note that actions a ∈ π have reduced cost c̄πa = 0. We say that a policy π 0 is obtained from π
by performing improving switches if every new action a ∈ π 0 \ π is an improving switch with
respect to π, i.e., if (c̄π )a < 0. We will use the notation π 0 = π[B] where B = π 0 \ π. Lemma
2.18 can then be interpreted as saying that if a policy π 0 is obtained from π by performing
0
improving switches then vπ < vπ . On the other hand, Lemma 2.19 says that if there are no
improving switches with respect to a policy π, then π is optimal. Hence, a policy is optimal
if and only if there are no improving switches.
The PolicyIteration algorithm is given in Figure 3. It starts with some initial policy π 0
and generates an improving sequence π 0 , π 1 , . . . , π N of policies, ending with an optimal policy
π N . In each iteration the algorithm evaluates the current policy π k and computes the value
k
vector vπ by solving a system of linear equations. The next policy π k+1 is obtained from π k
by performing a non-empty set of improving switches B ⊆ A with respect to π k , such that
π k+1 = π k [B].
13
It follows from Lemma 2.18 that the value vectors strictly improve with each iteration;
vk+1 < vk . Hence, the number of iterations is bounded by the number of policies. Moreover,
since there are no improving switches with respect to the final policy π N , we know from
Lemma 2.19 that π N is optimal. We get:
Theorem 2.21 For every initial policy π, PolicyIteration(π) terminates after a finite
number of iterations, returning an optimal policy.
The set of improving switches that is performed is decided by an improvement rule. There
is, in fact, a whole family of policy iteration algorithms using different improvement rules.
The most natural variant is, perhaps, the one in which the algorithm selects the improving
switch with most negative reduced cost from every state and performs all these switches
k
simultaneously, i.e., π k+1 = Pvπ . This was the original improvement rule suggested by
Howard [7], and we will refer to the algorithm obtained by using this improvement rule as
Howard’s policy iteration algorithm.
For the remainder of this section we focus on Howard’s policy iteration algorithm. Let π 0
be some initial policy. We next relate the sequences of value vectors obtained by running
Howard’s policy iteration algorithm and the value iteration algorithm. The following lemma
k
appears, e.g., in Meister and Holzbaur [8]. We use vk as short-hand notation for vπ .
Proof: We prove the lemma by induction. Clearly the statement is true for k = 0. Suppose
now that vk ≤ T k v0 . Since π k+1 is obtained from π k by performing improving switches we
k k
know that Tπk+1 vπ < vπ . Furthermore, since π k+1 = Pvk , we have T vk = Tπk+1 vk . Using
k
Lemma 2.18, it then follows that vk+1 ≤ Tπk+1 vπ = T vk . Finally, from the induction
hypothesis and the monotonicity of the value iteration operator we get vk+1 ≤ T k+1 v0 .
Lemma 2.22 shows that Howard’s policy iteration algorithm converges to the optimal value
vector at least as fast that the value iteration algorithm.
Combining Lemma 2.22 with Theorem 2.15 we get the following theorem. In Theorem 2.15
the value iteration algorithm is initialized with the all-zero vector. It is not difficult to
0 0
see, however, that the value vector of any policy vπ works as well. I.e., kvπ − v∗ k∞ ≤
(2L(M,γ)+1 )/(1 − γ), for any policy π 0 .
Theorem 2.23 Starting with any policy π, the number of iterations performed by Howard’s
policy iteration algorithm is at most O( L(M,γ)
1−γ
1
log 1−γ ).
A recent series of papers by Ye [13]; Hansen, Miltersen, and Zwick [6]; and Scherrer [11] has
shown the following improved bound. Note that the bound is strongly polynomial when γ
is a fixed constant, i.e., the bound does not depend on the bit complexity L(M, γ). In fact,
the bound is linear in m, the number of actions, when γ is fixed.
14
Theorem 2.24 Starting with any policy π, the number of iterations performed by Howard’s
m 1
policy iteration algorithm is at most O( 1−γ log 1−γ ).
Definition 2.25 (Flux vector) For every policy π, define the flux vector xπ ∈ Rm by:
∞
XX
∀a ∈ A : (xπ )a = γ t Pr [Yπ (i, t) = a] .
i∈S t=0
We let e = (1, 1, . . . , 1)T ∈ Rn be an all one vector. When π is a positional policy we use x̄π =
(xπ )π ∈ Rn to denote the vector obtained from xπ by selecting the entries corresponding to
actions used in π. Note that for positional policies we have (xπ )a = 0 for a 6∈ π. Also note that
for every positional policy π, every i, j ∈ S, and every t ≥ 0, we have Pr [Yπ (i, t) = π(j)] =
Pr [Xπ (i, t) = j] = (Pπt )i,j . Hence,
∞
XX ∞
X
π π t
∀j ∈ S : (x )π(j) = (x̄ )j = γ (Pπt )i,j = e T
(γPπ )t ej ,
i∈S t=0 t=0
Note that since an optimal policy simultaneously minimizes the values of all states, it also
minimizes the sum of the values of the states. Flux vectors provide an alternative way of
summing the values of all the states:
eT vπ = cT xπ .
15
Proof: Recall that:
∞
X ∞
X X
π t
(v )i = γ E [c(Yπ (i, t))] = γt ca Pr [Yπ (i, t) = a] .
t=0 t=0 a∈A
Hence we have:
∞
XX X
eT vπ = γt ca Pr [Yπ (i, t) = a]
i∈S t=0 a∈A
X ∞
XX
= ca γ t Pr [Yπ (i, t) = a]
a∈A i∈S t=0
X
= ca (xπ )a = cT xπ .
a∈A
Note that for every policy π and every state i, the expected number of times we use actions
leaving i must be equal to the expected number of times we reach i. Also, no action can be
used a negative number of times. Hence, the flux vector must satisfy xπ ≥ 0 and
X X
∀i ∈ S : (xπ )a = 1 + (xπ )a γPa,i . (2)
a∈Ai a∈A
Using matrix notation, this equality can be stated as (J − γP )T xπ = e. This leads us to the
following definition of a linear program (P ) and its dual (D). The variables of the primal
linear program (P ) correspond to flux vectors, and the variables of the dual linear program
(D) correspond to value vectors.
min cT x
max eT y
(P ) s.t. (J − γP )T x = e (D)
s.t. (J − γP )y ≤ c
x ≥ 0
The above discussion proves the following lemma.
Lemma 2.28 For every policy π, the flux vector xπ is a feasible solution to the linear pro-
gram (P ).
Lemma 2.29 For every feasible solution x to the linear program (P ) there exists a policy π
with flux vector xπ = x.
16
Note that since the right-hand-side of (2) is at least 1, x̄i ≥ 1 for all i, and π is well-defined.
Observe also that since π uses the same (randomized) decisions in everyPround it defines a
Markov chain given by the stochastic matrix Q ∈ Rn×n where Qj = x̄1j a∈Aj xa Pa , for all
states j. Define x̄π ∈ Rn by (x̄π )i = a∈Ai (xπ )a for all i, i.e., (x̄π )i is the expected number
P
of times we visit state i. In the same way as we proved Lemma 2.26 we then get that:
∞
X
π T
(x̄ ) = e T
(γQ)t = eT (I − γQ)−1 .
t=0
Next we show that xπ = x. It suffices to show that x̄π = x̄, since the definition of π then
ensures that the flux is distributed correctly on the different actions. Since x is a feasible
solution to (P ) it must satisfy:
X XX X
∀i ∈ S : x̄i = xa = 1 + xa γPa,i = 1 + x̄j γQj,i = 1 + x̄T γQei
a∈Ai j∈S a∈Aj j∈S
Using matrix notation this equality says that x̄ = e + γQT x̄, or x̄T = eT (I − γQ)−1 . Since
(x̄π )T = eT (I − γQ)−1 , we have shown that x̄π = x̄ as desired.
Theorem 2.30 The linear program (P ) has an optimal solution x, and from x we get an
optimal time-dependent policy π defined by
xa
∀t ≥ 0 ∀i ∈ S ∀a ∈ Ai : Pr [π(i, t) = a] = P .
b∈Ai xb
Proof: Since, by Lemma 2.28, every policy gives a feasible solution for (P ) we know that
(P ) is feasible. Furtermore, since, by lemmas 2.29 and 2.27, every feasible solution of (P )
gives a policy with the same sum of values we know that (P ) is bounded. Hence, (P ) has an
optimal solution, and the corresponding policy π, constructed in the proof of Lemma 2.29,
is optimal.
Note that basic feasible solutions of (P ) correspond to positional policies. The existence of
an optimal basic feasible solution for (P ) therefore gives an alternative proof for the existence
of an optimal positional policy.
Furthermore, it is not difficult to see that the reduced cost vector from Definition 2.20
corresponds exactly to the reduced costs that the simplex method operates with. In fact,
this shows that the policy iteration algorithm can be viewed as a generalization of the simplex
method where multiple pivots are performed in parallel.
17
3 Turn-based stochastic games
We next consider a generalization of MDPs known as turn-based stochastic games (TBSGs).
TBSGs are a special case of Shapley’s stochastic games [12], which were introduced in 1953.
A TBSG is an MDP where the set of states have been assigned to two players; player
1, the minimizer, and player 2, the maximizer. Whenever the game is in some state the
player controlling that state decides which action to use. The goal of the minimizer is to
minimize the incurred costs according to some criterion, while the goal of the maximizer is
to maximize the incurred costs. Hence, the game is a zero-sum game. In this section we
restrict our attention to the discounted cost criterion.
We use the same notation to describe TBSGs as we did for MDPs. In particular, probability
matrices, cost vectors, and source matrices are defined analogously. Policies are also defined
in the same way, except that a policy belongs to a player and only describes the decision
made by that player. Also, in order to be consistent with terminology from game theory we
use the term strategy instead of policy. We restrict our attention to positional strategies. As
for MDPs it can be shown that the players are not worse off by being restricted to positional
strategies. The proof is essentially the same for TBSGs.
Definition 3.2 (Strategies, strategy profiles) A positional strategy πj for player j, where
j ∈ {1, 2}, is a mapping πj : Sj → A such that πj (i) ∈ Ai , for every i ∈ Sj . A strategy
profile π = (π1 , π2 ) is a pair of strategies, one for each player. We denote the set of positional
strategies for player j by Πj (P), and the set of strategy profiles by Π(P) = Π1 (P) × Π2 (P).
Note that a strategy profile can be viewed as a policy for the underlying MDP. We again
let Pπ and cπ be obtained from P and c by selecting rows corresponding to actions used in
π. Hence, the valueP∞ for a state i when the players play according to a strategy profile π is
D(γ)
again valπ (i) = t=0 γ t E [c(Yπ (i, t))], and Lemma 2.3 shows that the value vector satisfies
vπ = (I − γPπ )−1 cπ .
Since there are two players with opposite objectives in a TBSG we need a different definition
of optimal strategies compared to the definition of optimal policies for MDPs. Note, however,
that if the strategy of one of the players is fixed then the game can be viewed as an MDP.
Indeed, if we remove the unused actions of the player whose strategy is fixed, we may transfer
control of his states to the other player. In this case the optimal counter-strategy of the player
whose strategy is not fixed should be the optimal policy in the corresponding MDP. We get
the following definition.
18
Definition 3.3 (Optimal counter-strategies) Let G be a TBSG and let π1 be a strategy
for player 1. We say that a strategy π2 for player 2 is an optimal counter-strategy against
π1 if and only if for all states i ∈ S:
D(γ)
valD(γ)
π1 ,π2 (i) = max valπ1 ,π0 (i) .
π20 ∈Π2 (P) 2
Optimal counter-strategies are defined analogously for player 1 with max exchanged with min.
It follows from Theorem 1.6 that optimal counter-strategies always exist, since we get an
MDP when the strategy of one player is fixed and MDPs always have optimal policies.
Let π1 be some strategy. If π2 is an optimal counter-strategy against π1 , then valD(γ)π1 ,π2 (i) is
the best possible value that player 2 can obtain for state i when player 1 uses the strategy π1 .
Hence, we may view valD(γ)π1 ,π2 (i) as an upper bound for the value player 1 can guarantee for
state i. It is natural to ask what the best guarantee a player can get from a single strategy
is. This leads to the following definition.
Definition 3.4 (Upper and lower values) For every state i ∈ S define the upper value,
val(i), and lower value, val(i), by:
Note that we must have val(i) ≤ val(i) for all states i. We say that a strategy is optimal
when it achieves the best possible guarantee against the other player.
Definition 3.5 (Optimal strategies) We say that a strategy π1 for player 1 is optimal if
and only if valD(γ)
π1 ,π2 (i) ≤ val(i) for all states i and strategies π2 ∈ Π2 (P). Similarly, a strategy
π2 for player 2 is optimal if and only if valD(γ) π1 ,π2 (i) ≥ val(i) for all i and π1 ∈ Π1 (P).
The following theorem shows the existence of optimal strategies. The theorem was first
established by Shapley [12]. It can be proved in essentially the same way as Theorem 1.6 for
MDPs by using Theorem 3.8 below.
Theorem 3.6 For any TBSG G there exists an optimal positional strategy profile (π1 , π2 ).
Moreover, for all states i we have:
valD(γ)
π1 ,π2 (i) = val(i) = val(i) .
By solving a TBSG we mean computing an optimal strategy profile. The following defini-
tion and theorem gives an alternative interpretation of optimal strategies. The theorem is
standard for finite two-player zero-sum games. It was also established by Shapley [12]. We
leave it as an exercise to prove the theorem.
19
Definition 3.7 (Nash equilibrium) A strategy profile (π1 , π2 ) ∈ Π is a Nash equilibrium
if and only if π1 is an optimal counter-strategy against π2 , and π2 is an optimal counter-
strategy against π1 .
Theorem 3.8
Note that T t v is the optimal value vector for the finite-horizon TBSG that terminates after
t steps with termination costs specified by v ∈ Rn .
The value iteration algorithm can also be defined for TBSGs. In fact, the algorithm remains
the same except that we now use the value iteration operator for two players. We refer to
Figure 2 for a description of the value iteration algorithm. The lemmas and theorems in
Section 2.1 can be extended to the two-player setting without much modication. We leave
this as an exercise.
20
3.2 The strategy iteration algorithm
We will next see how Howard’s policy iteration algorithm [7] for solving MDPs can be
extended to 2TBSGs in a natural way. We refer to the resulting algorithm as the strategy
iteration algorithm. The strategy iteration algorithm was described for TBSGs by Rao et
al. [10].
For every strategy profile π = (π1 , π2 ), we define the vector of reduced costs c̄π as we did for
MDPs: c̄π = c − (J − γP )vπ . We again use the reduced costs to define improving switches.
The following lemma is a direct consequence of lemmas 2.18 and 2.19, i.e., since a policy is
optimal for an MDP if and only if there are no improving switches, we get a similar statement
about optimal counter-strategies.
From the definition of Nash equilibria and Theorem 3.8 we get the following important
corollary.
Corollary 3.13 A strategy profile π = (π1 , π2 ) is a Nash equilibrium, and optimal, if and
only if neither player has an improving switch with respect to π.
Lemma 3.14 Let π = (π1 , π2 ) and π 0 = (π10 , π20 ) be two strategy profiles such that π2 is an
optimal counter-strategy against π1 , π10 is obtained from π1 by performing improving switches
0
w.r.t. π, and π20 is an optimal counter-strategy against π10 . Then vπ < vπ .
Proof: Observe first that since π10 is obtained from π1 by performing improving switches
w.r.t. π we have (c̄π )π10 < 0. Furthermore, since π2 is an optimal counter-strategy against
π1 , we get from Lemma 3.12 that player 2 has no improving switches with respect to π. That
is, (c̄π )a ≤ 0 for all a ∈ A2 , and in particular (c̄π )π20 ≤ 0. Hence, (c̄π )π0 < 0 which means
0
that Tπ0 vπ < vπ . It follows from Lemma 2.18 that vπ < vπ .
The strategy iteration algorithm is given in Figure 4. It starts with some initial strategy
profile π 0 = (π10 , π20 ) and generates a sequence (π k )N k k k
k=0 , where π = (π1 , π2 ), of strategy
profiles, ending with an optimal strategy profile π N . The algorithm repeatedly updates the
21
Function StrategyIteration(π1 , π2 )
while ∃π20 : Tπ1 ,π20 vπ1 ,π2 > vπ1 ,π2 do
π2 ← π20 ;
while ∃π10 : Tπ10 ,π2 vπ1 ,π2 < vπ1 ,π2 do
π1 ← π10 ;
while ∃π20 : Tπ1 ,π20 vπ1 ,π2 > vπ1 ,π2 do
π2 ← π20 ;
return (π1 , π2 );
Function StrategyIteration(π1 , π2 )
while ∃ an improving switch for player 2 w.r.t. (π1 , π2 ) do
Update π2 by performing improving switches for player 2;
while ∃ an improving switch for player 1 w.r.t. (π1 , π2 ) do
Update π1 by performing improving switches for player 1;
while ∃ an improving switch for player 2 w.r.t. (π1 , π2 ) do
Update π2 by performing improving switches for player 2;
return (π1 , π2 );
22
strategy π2k for player 2 by performing improving switches for player 2. Note that this can
be viewed as running the policy iteration algorithm on the MDP obtained by fixing π1k , and
therefore this inner loop always terminates. When π2k is an optimal counter-strategy against
π1k , the algorithm updates π1k once by performing improving switches for player 1, and the
process is restarted. If π1k can not be updated then neither of the players has an improving
switch, and we know from Corollary 3.13 that π N = (π1N , π2N ) is optimal. Furthermore, since
π2k is an optimal counter-strategy against π1k , and π1k+1 is obtained from π1k by performing
k+1 k
improving switches with respect to (π1k , π2k ), we get from Lemma 3.14 that vπ < vπ for all
0 ≤ k < N . It follows that the same strategy profile does not appear twice, and since there
are only a finite number of strategy profiles the algorithm terminates. We get the following
theorem. By an iteration we mean an iteration of the outer loop.
Theorem 3.15 For every initial strategy profile (π1 , π2 ), StrategyIteration(π1 , π2 ) re-
turns an optimal strategy profile after a finite number of iterations.
It can be helpful to think of the strategy iteration algorithm as only operating with the strat-
egy for player 1, and view the inner loop as a subroutine. Indeed, optimal counter-strategies
for player 2 can also be found by other means, for instance by using linear programming.
During each iteration of the strategy iteration algorithm the current values and reduced costs
are computed, and an improvement rule decides which improving switches to perform. This
is completely analogous to the policy iteration algorithm for MDPs. We will again say that
Howard’s improvement rule [7] picks the action from each state with most negative reduced
cost, i.e., π1k+1 satisfies:
k
∀i ∈ S1 : π1k+1 (i) ∈ argmin c̄πa .
a∈Ai
Theorem 3.16 Starting with any strategy profile π, the number of iterations performed by
m 1
Howard’s strategy iteration algorithm is at most O( 1−γ log 1−γ ).
There is no known way of formulating the solution of a TBSG as a linear program. In fact,
it is a major open problem to find a polynomial time algorithm for solving TBSGs when γ
is given as part of the input. When γ is fixed, Theorem 3.16 provides a strongly polynomial
bound. On the other hand, it is easy to see that the problem of solving TBSGs is in both
NP and coNP: an optimal strategy profile serves as a witness of the optimal value of a state
being both above and below a given threshold.
23
4 The average cost criterion
In this section we give a very brief introduction to the average cost criterion. Proofs in this
section have been omitted, and we refer to Puterman [9] for additional details. As it was
the case for the discounted cost criterion, one can show that every MDP has an optimal
positional policy for the average cost criterion. We restrict our attention to the use of
positional policies. Throughout the section we let M = (S, A, s, c, p) be some given MDP.
For the average cost criterion the value of a state i for a positional policy π is:
" N −1
# N −1
1 X 1 X T t
valA
π (i) = E lim inf c(Yπ (i, t)) = lim ei Pπ cπ .
N →∞ N N →∞ N
t=0 t=0
Definition 4.1 (Values, potentials) Let π be a positional policy, and let gπ ∈ Rn and
hπ ∈ Rn be defined as:
N −1
π 1 X t
g = lim Pπ c π
N →∞ N
t=0
N −1 t
π 1 XX k
h = lim Pπ (cπ − gπ )
N →∞ N
t=0 k=0
We say that gπ and hπ are the vectors of values and potentials, respectively.
Theorem 4.2 Let π be a positional policy and let Rπ ⊆ 2S be the set of recurrent classes of
Pπ . Then gπ and hπ are the unique solution to the following system of equations:
g π = Pπ g π
hπ = cπ − gπ + Pπ hπ
X
∀R ∈ Rπ : (hπ )i = 0 .
i∈R
24
We next establish a connection between values and potentials, and values for the correspond-
ing discounted MDP. We let vπ,γ be the value vector for policy π for the discounted cost
criterion when using discount factor γ.
By multiplying both sides of the equation in Lemma 4.3 by (1 − γ), and taking the limit for
γ going to 1 we get the following important Corollary:
gπ = lim (1 − γ)vπ,γ .
γ↑1
Let (γk )∞
k=0 be a sequence of discount factors converging to 1. Since for every discount factor
there is an optimal policy for the discounted MDP, and since there are only finitely many
policies, there must be a policy that is optimal for discount factors arbitrarily close to 1. In
fact, the following lemma shows something stronger.
Lemma 4.5 There exists a discount factor γ(M ) < 1 and a policy π ∗ , such that for all
γ ∈ [γ(M ), 1), π ∗ is optimal for the discounted MDP with discount factor γ.
By combining Lemma 4.5 and Corollary 4.4 it follows that π ∗ is optimal under the average
cost criterion, which proves Theorem 1.5 for the average cost criterion. It should be noted
that not every policy that is optimal under the average cost criterion is optimal under the
discounted reward criterion for all discount factors sufficiently close to 1.
Lemmas 4.3 and 4.5 also show that we can use the strategy iteration algorithm for solving
MDPs with the average cost criterion. Indeed, we can pick a discount factor γ sufficiently
close to 1 and run the strategy iteration algorithm for the corresponding discounted MDP.
To simplify the algorithm we can use Lemma 4.3 to compare reduced costs lexicographically
in terms of values and potentials.
25
For the total cost criterion the value of a state i for a positional policy π is defined as:
∞
X
valTπ (i) = eTi Pπk cπ
k=0
This series may not converge, and we therefore need to make assumptions about the MDP
to ensure convergence. For this purpose we introduce the notion of a terminal state:
Definition 5.1 (Terminal state) We say that a state t ∈ S is a terminal state if for all
actions a ∈ At , Pa = et and ca = 0. I.e., t is an absorbing state for every policy π, and
once t is reached no additional cost is accumulated.
For the remainder of this section we assume that the MDP under consideration has a terminal
state t. Note that reaching the terminal state corresponds to terminating the MDP. Also,
since the decision at t is fixed we will let t and its actions be implicit in the description
of the MDP M . We therefore do not consider t to be an element of S, and we let P , J,
and c be the matrices and vectors obtained by removing the column corresponding to t
and the rows corresponding to At . I.e., rows of P may sum to less than one, and 1 − Pa e
is the probability of moving to t when using action a. Note that this provides a natural
interpretation of the discounted cost criterion. The rows of the matrix γP sum to γ < 1,
and for each action we move to the terminal state with probability (1 − γ).
Definition 5.2 (Stopping condition) A policy π is said to satisfy the stopping condition
if from each state i there is positive probability of eventually reaching the terminal state.
If every policy satisfies the stopping condition we say that the MDP satisfies the stopping
condition.
For the remainder of this section we assume that the MDP under consideration satisfies the
stopping condition.
The following lemma is analogous to Lemma 2.1 for the discounted case.
Lemma 5.3 For any policy π satisfying the stopping condition, the matrix (I − Pπ ) is non-
singular and
X∞
−1
(I − Pπ ) = Pπk ≥ I .
k=0
From Lemma 5.3 we get that values exist such that the following definition is well-defined.
Definition 5.4 (Value vector) For every positional policy π, we define the value vector
vπ ∈ Rn by:
vπ = (I − Pπ )−1 cπ .
26
Note that if the MDP is deterministic, i.e., all actions move to single states with probability
1, then the problem becomes a shortest path problem, where from every state we want to
find the shortest path to the terminal state. Solving MDPs for the total cost criterion may
in general be viewed as a stochastic shortest path problem.
We next consider the relationship between the total cost criterion and the average cost
criterion. For this purpose we assume that the terminal state t is represented explicitly. Note
that, assuming that the MDP satisfies the stopping condition, the terminal state is reached
from every state for every policy. Hence, the limiting average of the observed costs
Pis always
zero such that gπ = 0 for every policy π. Furthermore, since the limit limt→∞ tk=0 Pπk cπ
always exists, we have:
N −1 t t
π 1 XX k X
h = lim Pπ cπ = lim Pπk cπ = vπ .
N →∞ N t→∞
t=0 k=0 k=0
Hence, the values for the total cost criterion may be viewed as the potentials for the average
cost criterion.
The existence of optimal positional policies can either be proved directly or through the
connection to the average cost criterion. The strategy iteration algorithm can also be defined
for the total cost criterion in the same way as it was done for the discounted cost criterion
and the average cost criterion.
Through the series of papers by Friedmann [4] and Fearnley [3], from 2009 and 2010, re-
spectively, it was shown that Howard’s policy iteration algorithm may require exponentially
many iterations in the number of states to solve MDPs with the total cost criterion. Hence,
there is no hope of getting rid of the dependence on γ in Theorem 2.24.
Exercises
D(γ)
P∞ t1 Prove that for every MDP M , and every policy π ∈ Π(H), the value valπ (i) =
Exercise
E [ t=0 γ c(Yπ (i, t))] always exists.
Exercise 2 Prove Lemma 2.1, i.e., for an MDP represented by (P, c, J) and any policy π,
show that the matrix (I − γPπ ) is nonsingular and that
∞
X
−1
(I − γPπ ) = (γPπ )k ≥ I .
k=0
P∞ k
Hint: Consider the telescoping series (I − γPπ ) k=0 (γPπ ) .
Exercise 3 Use Cramer’s rule to prove Lemma 2.13. Cramer’s rule says the following:
Let Ax = b be a system of linear equations where A ∈ Rn×n is a non-singular matrix and
b ∈ Rn is a vector. Then a solution x satisfies for all i, xi = det(Ai )/ det(A), where Ai is
the matrix obtained from A by replacing the i-th column with b.
27
Exercise 4 Give an alternative proof using complementary slackness for the fact that the
linear programs (P ) and (D) can be used to solve an MDP. The complementary slackness
theorem says that x, y, z is an optimal solution for the primal and dual linear programs:
min cT x max bT y
s.t. Ax = b s.t. AT y + z = c
x ≥ 0 z ≥ 0
Exercise 6 Extend the lemmas and theorems in Section 2.1 to TBSGs and update the
proofs. Hint: Lemma 2.7 should be split into two cases where different players use an optimal
counter-strategy.
References
[1] R. Bellman. Dynamic programming. Princeton University Press, 1957.
[3] J. Fearnley. Exponential lower bounds for policy iteration. In Proc. of 37th ICALP,
pages 551–562, 2010.
[4] O. Friedmann. An exponential lower bound for the parity game strategy improvement
algorithm as we know it. In Proc. of 24th LICS, pages 145–156, 2009.
[5] T. D. Hansen. Worst-case Analysis of Strategy Iteration and the Simplex Method. PhD
thesis, Aarhus University, 2012.
[7] R. Howard. Dynamic programming and Markov processes. MIT Press, 1960.
[8] U. Meister and U. Holzbaur. A polynomial time bound for Howard’s policy improvement
algorithm. OR Spektrum, 8:37–40, 1986.
28
[10] S. Rao, R. Chandrasekaran, and K. Nair. Algorithms for discounted games. Journal of
Optimization Theory and Applications, pages 627–637, 1973.
[11] B. Scherrer. Improved and generalized upper bounds on the complexity of policy itera-
tion. CoRR, abs/1306.0386, 2013.
[12] L. Shapley. Stochastic games. Proc. Nat. Acad. Sci. U.S.A., 39:1095–1100, 1953.
[13] Y. Ye. The simplex and policy-iteration methods are strongly polynomial for the markov
decision problem with a fixed discount rate. Math. Oper. Res., 36(4):593–603, 2011.
29