Lecture MarkovDecisionProcess
Lecture MarkovDecisionProcess
1 Introduction
1.1 Introductory examples
Example 1: snakes and ladders
We give an initial example to better position our intuition. We highlight some of the key
properties of a Markov chain: how to calculate transitions, how the past effects the current
movement of the processes, how to construct a chain, what might the long run behavior of the
process look like.
Figure 1: Example 1
We let Xt be the position of the counter on the board for the dice after the dice has been
thrown t times. The processes X = {Xt : t ∈ Z} is a discrete time Markov chain. The
following exercises should be fairly straight-forward and should position your intuition about
Markov chains.
Example 2: Transition probabilities
Given that the counter is in square x, we let Pxy be the probability that you next go to square
y. That is,
P xy = P (X1 = y|X0 = x) (1)
Calculate:
a) P14 b) P17 c) P34
Answer:
a) 16 + 16 = 1
3
b) 1
6
+ 1
6
= 1
3
c) 1
6
+ 1
6
= 1
3
∗
ECE Paris Graduate School of Engineering, 37 quai de Grenelle 75015 Paris, France; [email protected]
1
Example 3: (Markov property)
Show that:
2 Markov chain
Let X be a countable set.
Definition: (Initial distribution/Transition matrix)
An initial distribution
λ = (λx : x ∈ X ) (4)
is a positive vector whose components sums to one. A transition Pmatrix P = (Pxy : x, y ∈ X )
is a positive matrix whose rows sum to one, that is, for x ∈ X , y∈X Pxy = 1.
Definition: (Discrete Time Markov Chain)
We say that a sequence of random variables X(Xt : t ∈ Z+ ) is a discrete time Markov chain,
with initial distribution λ and transition matrix P , if for x0 , · · · , xt+1 ∈ X ,
and
P(Xt+1 = xt+1 |Xt = xt , · · · , X0 = x0 ) = P(Xt+1 = xt+1 |Xt = xt ) = Pxt |xt+1 . (6)
The condition (Markov) is often called the Markov property:
It states that the past (X1 , · · · , Xt−1 ) and future Xt+1 are conditionally independent of the
present Xt .
Otherwise stated, it says that, when we know the past and present states (X1 , · · · , Xt ) =
(x0 , · · · , xt ), the distribution of the future states Xt+1 , Xt+2 , · · · is only determined by the
present state Xt = xt .
Think of a board game like snakes and ladders, where you go in, the future is only determined
by where you are now and not how you got there. This is the Markov property.
2
Definition: (Plant equation):
The state evolves according to functions ft : X × At × [0, 1] → X as
where (Ut )t≥0 are IIDRVs uniform on [0,1]. This is called the plant equation. As noted in
the equivalence above, we will often suppress dependence on Ut .
Definition: (Policy):
A policy π chooses an action πt at each time t as a function of past states x0 , · · · , xt and past
actions π0 , · · · , πt−1 . We let P be the set of policies.
A policy, a plant equation, and the resulting sequence of states and rewards describe a Markov
Decision Process. Objective is to find a process that optimizes the following objective
function.
Definition: (Markov decision problem):
Given initial state x0 , a Markov Decision Problem is the following optimization
"T −1 #
X
W (x0 ) =Maximize RT (x0 , Π) := E rt (Xt , πt ) + rT (XT )
t=0
(8)
over Π ∈ P.
Further, let Rτ (xτ , Π) (respectively, Wτ (xτ )) be the objective (respectively, the optimal objec-
tive) for (MDP) when the summation is started from time t = τ and state Xτ = xτ , rather
than t = 0 and X0 = x0 .
Definition: (Bellman equation):
Setting WT (x) = rT (x), for t = T − 1, T − 2, · · · , 0,
Wt (xt ) = sup {rt (xt , at ) + Ext ,at [Wt+1 (Xt+1 )]} . (9)
at ∈At
The above equation is the Bellman’s equation for a Markov Decision Process.
Example:
You need to sell a car. At every time t = 0, · · · , T − 1, you set a price pt , and a customer then
views the car. The probability that the customer buys a car at price p is D(p). If the car is
not sold at time T , then it is sold for a fixed price WT , WT < 1. Maximize the reward from
selling the car and find the recursion for the optimal reward, when D(p) = (1 − p)+ .
3
Figure 3: Markov Decision Process problem
Answer:
Let xt = I[Car is not sold by time t]
xt = 0 ⇒ xt+1 = 0
(
0, w.p. D(pt ) (10)
xt = 1 ⇒ xt+1 =
1, w.p. 1 − D(pt )
Let Rt (xt ) be the reward from t to T give xt and Wt be the optimal reward from t to T given
xt .
Note: Rt (0) = Wt (0) = 0.
Choosing pT −s optimally,
where D(pT −s ) = (1 − pT −s )+ .
Differentiating it over p, we obtain
CT −s+1 + 1
pT −s =
2
(14)
1 + CT −s+1 2
CT −s =
2