0% found this document useful (0 votes)
8 views4 pages

Lecture MarkovDecisionProcess

Hhjj bbb jj dj

Uploaded by

josselin.arj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views4 pages

Lecture MarkovDecisionProcess

Hhjj bbb jj dj

Uploaded by

josselin.arj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Markov chains and Markov Decision Processes

Jae Yun JUN KIM∗

Reference: Neil Walton’s lecture notes

1 Introduction
1.1 Introductory examples
Example 1: snakes and ladders
We give an initial example to better position our intuition. We highlight some of the key
properties of a Markov chain: how to calculate transitions, how the past effects the current
movement of the processes, how to construct a chain, what might the long run behavior of the
process look like.

Figure 1: Example 1

We let Xt be the position of the counter on the board for the dice after the dice has been
thrown t times. The processes X = {Xt : t ∈ Z} is a discrete time Markov chain. The
following exercises should be fairly straight-forward and should position your intuition about
Markov chains.
Example 2: Transition probabilities
Given that the counter is in square x, we let Pxy be the probability that you next go to square
y. That is,
P xy = P (X1 = y|X0 = x) (1)
Calculate:
a) P14 b) P17 c) P34
Answer:
a) 16 + 16 = 1
3
b) 1
6
+ 1
6
= 1
3
c) 1
6
+ 1
6
= 1
3

ECE Paris Graduate School of Engineering, 37 quai de Grenelle 75015 Paris, France; [email protected]

1
Example 3: (Markov property)
Show that:

P(X3 = 7|X2 = 6, X1 = 3, X0 = 3) = P(X3 = 7|X2 = 6, X1 = 5, X0 = 1) = P(X3 = 7|X2 = 6)


(2)
illustrates that given we are on square 6, the probability of reaching square 7 is not effected
by the path by which we reached square 6.
Answer:
P(X3 = 7, X2 = 6, X1 = 5|X0 = 1)
P(X3 = 7|X2 = 6, X1 = 5, X0 = 1) = = P(X3 = 7|X2 = 6)
P(X2 = 6, X1 = 5|X0 = 1)
(3)
In general, given that counter is on a square we will see that the next square reached by the
counter on the next turn is not effected by the paht that was used to reach the square. This
is called the Markov property.

2 Markov chain
Let X be a countable set.
Definition: (Initial distribution/Transition matrix)
An initial distribution
λ = (λx : x ∈ X ) (4)
is a positive vector whose components sums to one. A transition Pmatrix P = (Pxy : x, y ∈ X )
is a positive matrix whose rows sum to one, that is, for x ∈ X , y∈X Pxy = 1.
Definition: (Discrete Time Markov Chain)
We say that a sequence of random variables X(Xt : t ∈ Z+ ) is a discrete time Markov chain,
with initial distribution λ and transition matrix P , if for x0 , · · · , xt+1 ∈ X ,

P(X0 = x0 ) = λx0 (5)

and
P(Xt+1 = xt+1 |Xt = xt , · · · , X0 = x0 ) = P(Xt+1 = xt+1 |Xt = xt ) = Pxt |xt+1 . (6)
The condition (Markov) is often called the Markov property:
It states that the past (X1 , · · · , Xt−1 ) and future Xt+1 are conditionally independent of the
present Xt .
Otherwise stated, it says that, when we know the past and present states (X1 , · · · , Xt ) =
(x0 , · · · , xt ), the distribution of the future states Xt+1 , Xt+2 , · · · is only determined by the
present state Xt = xt .
Think of a board game like snakes and ladders, where you go in, the future is only determined
by where you are now and not how you got there. This is the Markov property.

3 Markov decision processes


As in the section de Dynamic Programming, we consider discrete times t = 0, 1, · · · , T , states
x ∈ X , actions a ∈ At and rewards rt (a, x). However, the plant equation and definition of a
policy are slightly different.

2
Definition: (Plant equation):
The state evolves according to functions ft : X × At × [0, 1] → X as

Xt+1 = Ft (Xt , at ; Ut ) ≡ Ft (Xt , at ), (7)

where (Ut )t≥0 are IIDRVs uniform on [0,1]. This is called the plant equation. As noted in
the equivalence above, we will often suppress dependence on Ut .

Figure 2: Illustration for the plant equation

Definition: (Policy):
A policy π chooses an action πt at each time t as a function of past states x0 , · · · , xt and past
actions π0 , · · · , πt−1 . We let P be the set of policies.
A policy, a plant equation, and the resulting sequence of states and rewards describe a Markov
Decision Process. Objective is to find a process that optimizes the following objective
function.
Definition: (Markov decision problem):
Given initial state x0 , a Markov Decision Problem is the following optimization
"T −1 #
X
W (x0 ) =Maximize RT (x0 , Π) := E rt (Xt , πt ) + rT (XT )
t=0
(8)
over Π ∈ P.

Further, let Rτ (xτ , Π) (respectively, Wτ (xτ )) be the objective (respectively, the optimal objec-
tive) for (MDP) when the summation is started from time t = τ and state Xτ = xτ , rather
than t = 0 and X0 = x0 .
Definition: (Bellman equation):
Setting WT (x) = rT (x), for t = T − 1, T − 2, · · · , 0,

Wt (xt ) = sup {rt (xt , at ) + Ext ,at [Wt+1 (Xt+1 )]} . (9)
at ∈At

The above equation is the Bellman’s equation for a Markov Decision Process.
Example:
You need to sell a car. At every time t = 0, · · · , T − 1, you set a price pt , and a customer then
views the car. The probability that the customer buys a car at price p is D(p). If the car is
not sold at time T , then it is sold for a fixed price WT , WT < 1. Maximize the reward from
selling the car and find the recursion for the optimal reward, when D(p) = (1 − p)+ .

3
Figure 3: Markov Decision Process problem

Answer:
Let xt = I[Car is not sold by time t]

xt = 0 ⇒ xt+1 = 0
(
0, w.p. D(pt ) (10)
xt = 1 ⇒ xt+1 =
1, w.p. 1 − D(pt )

Let Rt (xt ) be the reward from t to T give xt and Wt be the optimal reward from t to T given
xt .
Note: Rt (0) = Wt (0) = 0.

RT −s (1) = D(pT −s )[pT −s + RT −s+1 (0)] + (1 − D(pT −s )[0 + RT −s+1 (1)]


(if optimal) (11)
= D(pT −s )pT −s + (1 − D(pT −s ))WT −s+1 (1).

Choosing pT −s optimally,

WT −s (1) = max{D(pT −s )pT −s + (1 − D(pT −s ))WT −s+1 (1)}. (12)


pT −s

Rewriting the above equation,

CT −s = max{pT −s (1 − pT −s )+ + (1 − (1 − pT −s )+ CT −s+1 }, (13)


pT −s

where D(pT −s ) = (1 − pT −s )+ .
Differentiating it over p, we obtain
CT −s+1 + 1
pT −s =
2
(14)
1 + CT −s+1 2
 
CT −s =
2

You might also like