AI512/EE633: Reinforcement Learning: Lecture 2 - Markov Decision Process
AI512/EE633: Reinforcement Learning: Lecture 2 - Markov Decision Process
Seungyul Han
UNIST
[email protected]
Spring 2024
1 Setup
2 Markov Chains
Table of Contents
1 Setup
2 Markov Chains
Sufficient Statistic
The state variable st is sufficient to describe the status of the environment
at time step t containing all information relevant to make decision or
inference.
Markovianness
The sequence of states over time {st , t = 0, 1, 2, · · · } is a Markov process
or Markov chain. That is,
Sufficient Statistic
Remark
Sufficiency is with respect to the target optimal inference or decision.
Example 1:
R l
+ +
vin (t) = x(t) i C vout (t) = y (t)
− −
In this 2nd order circuit system, we can set the state variable as the capacity
voltage and the inductor current.
i.i.d.
Example 2: Xi ∼ N(θ, 1). We observe data X1 , · · · , Xn and want to estimate
the mean parameter θ. Suppose that we Pconsider as optimal and use maximum
n
likelihood estimator (MLE) for θ. Then, i=1 Xi is a sufficient statistic.
n
Y 1 1 2 1 1
P 2 P 2
p(x1 , · · · , x1 ; θ) = √ e − 2 (xi −θ) = p e − 2 ( i xi −θ i xi +θ )
2π (2π) n
i=1
Table of Contents
1 Setup
2 Markov Chains
Remark
A Markov state contains all relevant information from the entire history for system
evolution.
The history in the past and the evolution in the future are independent given St ,
i.e.,
P(τ1:t , τt+1,∞ |st ) = P(τ1:t |st )P(τt+1:∞ |st ),
∆
where τ1:t = (s1 , s2 , · · · , st ).
A Markov process is also called a Markov chain (MC).
S0 S1 S2 S3 S4
Remark
For a finite MC, the collection of state transition probabilities
0.3 0.4
0.3
I L C A F H
C A I 1
0.5
0.1 0.4
0.3
0.6 L 0.5 0.1 0.3 0.1
C 0.4 0.3 0.3
I L F H
1 0.1
1 A 0.4 0.6
F 1
H 1
1−α
α β
H L
1−β
τ = H, H, L, H, L, H, · · ·
If the agent travel on the state transition diagram, on average what is the time
portion with which the agent stays at each state?
We assign a probability distribution on the set of states, i.e., the state space
with cardinality N at time t:
After one transition, the new distribution on the state space is given by
X
ps ′ = ps Pss ′
s∈S
In matrix form, we
p0′ P00 P10 ··· PN0 p0
p1′
P01 P11 ··· PN1
p1
.. = .. .. ..
. . . .
pN′ P0N P1N ··· PNN pN
| {z } | {z }| {z }
new =PT (=[Pss ′ ]T ) old
p = PT p.
0.3 0.4
0.3
C A
L 0.1 F
p0 1 P00 P10 ··· PN0 p0
p1
1−α
1
P01 P11 ··· PN1
p1
.. = .. + α .. .. ..
. 1+N . . . .
pN 1 P0N P1N ··· PNN pN
Table of Contents
1 Setup
2 Markov Chains
From an MC to an MDP
Definition (Policy)
We call this mapping from st ∈ S to at ∈ A a policy π.
π : S 7→ A : st → at .
Remark
For generality, we consider probabilistic functions in (1).
In the case of a probabilistic function with finite state and action
spaces, the function (1) is fully described by the set of probabilities:
a ′
Pss ′ = Pr[st+1 = s |st = s, at = a]
s st
Transitions:
1
(st , a1 ) ⇒ s 1 with probability Pss
a
1
1
(st , a1 ) ⇒ s 2 with probability Pss
a
2
a 1
a 2 at
2
2
(st , a2 ) ⇒ s 2 with probability Pss
a
2
a2
1
Pssa 1 P a1 P Pssa 3 2
ss 2 ss 2
(st , a2 ) ⇒ s 3 with probability Pss
a
3
st+1
s1 s2 s3 a ] is a rank-3 tensor.
[Pss ′
state action
Reward
Reward Function:
s st a ′
Rss ′ = R(s, a, s )
Return
Definition (Return)
The return Gt at time t is the sum of discounted rewards, i.e.,
Discount Factor γ:
The discount factor stabilizes the problem and makes the problem mathematically
simple. Suppose the
Pinfinite-horizon case. Even if rt is bounded for all t, the
undiscounted sum ∞ k=t rt grows without bound, i.e., goes to infinity.
The discount factor γ < 1 guarantees the existence of an optimal solution.
The discount factor prioritizes the rewards in the near future.
γ = 0: greedy case. Gt = rt+1 , γ = 1: undiscounted sum
Seungyul Han (UNIST) AI512/EE633 Spring 2024 26 / 83
Markov Decision Processes
Definition (A Task)
The environment, the agent and the dyanmics model together with the
state space S and the action space A define a specific instance of the
reinforcement learning problem and this instance is called a task.
Remark
Basically, we consider an MDP task.
s a s′ a
Pss ′
a
Rss ′
H search H α r search
H search L 1−α r search
L search L β r search
L search H (rescued) 1−β −C
H wait H 1 r wait
H wait L 0 r wait
L wait H 0 r wait
L wait L 1 r wait
L recharge H 1 0
L recharge L 0 0
Table: An MDP setup for the recycling robot (from the textbook by Sutton and Barto)
(r search > r wait )
Seungyul Han (UNIST) AI512/EE633 Spring 2024 30 / 83
Markov Decision Processes
search wait
1, 0
H L
recharge
wait search
1 − β, − C β, r search
wait
1, r
(battery deplete)
search wait
1, 0
H L
recharge
wait search
1 − β, − C β, r search
wait
1, r
(battery deplete)
Table of Contents
1 Setup
2 Markov Chains
Value Functions
Questions:
How much expected
α, r search 1 − α, r search 1, r wait return can the agent
search wait receive if it is in a state
1, 0
s and follow policy π?
H L
recharge How much expected
wait search β, r search return can the agent
1 − β, − C
1, r wait receive if it takes action
(battery deplete)
a in a state s and then
follows policy π?
Value Functions
Definition (State-Value Function)
The value of a state s under a policy π is the expected return when the
agent starts from s and follows π, i.e.,
" ∞ #
X
V π (s) = Eπ [Gt |st = s] = E γ i rt+1+i st = s (2)
i=0
Note that the value of an action is not the immediate reward but the expected return
following the
Seungyul Hanaction.
(UNIST) AI512/EE633 Spring 2024 35 / 83
Value Functions and Bellman Equations
Value Functions
s r
V π (s) s′
a
π
s r
Q π (s, a) s′
a
π
Value Functions
Remark:
Bellman equation
X
V π (s) = π(a|s)Pssa ′ R(s, a, s ′ ) + γV π (s ′ )
a,s ′
X
= R π (s) + γ Pssπ′ V π (s ′ )
|{z}
s′ P a
a π(a|s)Pss ′
s V π (s)
X X
V π (s) = π(a|s)Pssa ′ R(s, a, s ′ ) + γ π(a|s)Pssa ′ V π (s ′ )
a,s ′ a,s ′
X
= R π (s) + γ Pssπ′ V π (s ′ )
s′
π π
V π (s1 ) R (s1 ) P11 ··· P1N
π
V π (s1 )
.. .. .. ..
. = . +γ . .
v π (sN ) R π (sN ) PN1
π
··· PNN
π
v π (sN )
vπ = r + γPvπ
vπ = (I − γP)−1 r
" ∞
#
π
X i
Q (s, a) = Eπ [Gt |st = s, at = a] = Eπ γ rt+1+i st = s, at = a
i=0
" ∞
#
X i
= Eπ rt+1 + γ γ rt+2+i st = s, at = a
i=0
" ∞
#
X a ′
X a
X ′ ′
X i ′ ′
= Pss ′ R(s, a, s ) +γ Pss ′ π(a |s ) Eπ γ rt+2+i st+1 = s , at+1 = a
s′ s′ a′ i=0
| {z }
=Q π (s ′ ,a′ )
X
= R(s, a) + γ Pssa ′ π(a′ |s ′ ) Q π (s ′ , a′ )
| {z }
s ′ ,a′
=p(s ′ ,a′ |s,a)
a Q π (s, a)
r
π ′ ′
a′ Q (s , a )
Table of Contents
1 Setup
2 Markov Chains
Comparison of Policies
Remark
In a partially-ordered set P, we do not require either a ≤ b or b ≤ a for all pairs
a, b ∈ P. If we have neither a ≤ b or b ≤ a, we say a and b are incomparable.
Comparison of Policies
π3
V π (s) π2
π1 ≤ π3
π2 ≤ π3
But, neither π1 ≤ π2 nor
π1
π2 ≤ π1
s
Optimal Policy
π ∗ ≥ π, ∀ π ∈ Π,
equivalently,
∗
V π (s) ≥ V π (s), ∀ s ∈ S, ∀π∈Π
where Π is the set of all feasible policies for the given MDP.
Questions:
Does such π ∗ exists?
What is the value function of π ∗ and how can we compute it?
Recall V π (s):
" ∞
#
X
V π (s) = E{at }∼π γ i R(st+i , at+i , st+1+i ) st = s
i=0
X
= (r0 + γr1 + γ 2 r2 + · · · )ρ(s0 )π(a0 |s0 )p(s1 |s0 , a0 )π(a1 |s1 )p(s2 |s1 , a1 ) · · ·
Definition: Optimal State Value Function The optimal value of state s is defined as
∆
V ∗ (s) = max V π (s) for each s.
π
V π (s) V ∗ (s)
Definition (Optimal State Value Function) π2
∆
Q ∗ (s, a) = max Q π (s, a)
π
hP i
Q ∗ (s, a1 ) = maxπ:{aτ ,τ ≥t+1}∼π E ∞ i
i=0 γ R(st+i , at+i , st+1+i ) st = s, at = a
1
a1 P ′′
P s′ P ′′′
r ′ r′ s ′′ ′′ s ′′′
a a r a
s
hP i
Q ∗ (s, a1 ) = maxπ:{aτ ,τ ≥t+1}∼π E ∞ i
i=0 γ R(st+i , at+i , st+1+i ) st = s, at = a
2
a2 P ′′ P ′′′
P s′
r ′ r′ s ′′ ′′ s ′′′
a a r a
s′
1
a V ∗ (s 1 )
a ∼ π o (s) Q ∗ (s, a1 ) P s1
a2 a
V ∗ (s) s Q ∗ (s, a2 ) s s2 V ∗ (s 2 )
∗
a 3 Q (s, a)
Q ∗ (s, a3 ) V ∗ (s 3 )
s3
Preliminary Definitions:
F : the set of all functions f : S → A
π: a policy = a sequence of (f1 , f2 , f3 , · · · ) of functions ft ∈ F , s.t. at = ft (st ).
Stationary function case: ft = f , ∀t
f (N) = (f , f , · · · , f ), f (∞) = (f , f , f , · · · )
f (s)
R(f ): the reward vector of size |S| with elements Rs
f (s)
P(f ): the transition prob. matrix with elements Ps,s ′
V (π): the state value vector of size |S|
f (1) f (1)
R f (s1 ) (s1 ) P11 ··· P1N V π (s1 )
.. .. ..
R(f ) = . , P(f ) =
.
,
V (π) = .
R f (sN (sN ) f (N)
PN1 ··· PNN
f (N)
v π (sN )
Consider the concatenated policy (f , π), where only the first action is taken from f
and then all the following actions by π. Then, L(f )V (π) is the value vector of the
concatenated policy (f , π)
Monotonicity: If v1 ≥ v2 , i.e., v1 (s) ≥ v2 (s), ∀s ∈ S, then L(f )v1 ≥ L(f )v2 .
Lemma 1 (Beating 1st step change): If π ≥ (f , π) for all f ∈ F for some π, then π is
optimal. Here, (f , π) means that the first action is taken by the function f and then all
the following actions follow π.
Proof: Due to the assumption, L(f )V (π), which is the value vector of (f , π), satisfies
L(f )V (π) ≤ V (π).
Let any policy π ′ be π ′ = (f1 , f2 , f3 , · · · ). Then, by the assumption, (fN , π) ≤ π, i.e.,
L(fN )V (π) ≤ V (π). Now, apply the monotonicity on this
(a) (b)
L(fN−1 )L(fN )V (π) ≤ L(fN−1 )V (π) ≤ V (π)
where (a) is by monotonicity and (b) is by the assumption. Repeatedly applying this
procedure, we have
L(f1 )L(f2 ) · · · L(fN )V (π) ≤ V (π).
Letting N → ∞, we have
V (π ′ ) ≤ V (π), ∀π ′ .
Hence, π is optimal.
Proof: By assumption,
L(f )V (π) > V (π).
Applying the monotonicity,
(a) (b)
L(f )L(f )V (π) > L(f )V (π) > V (π)
where (a) is by the monotonicity and (b) is by the above eq. Repeating this yields
Letting N → ∞, we have
f (∞) > π
(∞) (∞)
i.e., Q f (s, a) > V f (s). If G (s, f ) is empty for all s ∈ S, the f (∞) is optimal. If
G (s, f ) is not empty for some s ′ , construct a new policy g as
For s ′ s.t. G (s ′ , f ) 6= ∅, set g (s ′ ) = a ∈ G (s ′ , f ).
For all other s 6= s ′ , set g (s) = f (s).
Then, we have g (∞) > f (∞) .
for all a, equivalently for all h : a = h(s). That is, (h, f (∞) ) ≤ f (∞) . Then, by Lemma
1, f (∞) is optimal.
By Lemma 2,
g (∞) > f (∞) .
Suppose that we computed the optimal action value function Q ∗ (s, a) over S × A.
(Q ∗ (s, a) exists for finite MDP cases, as we will see later.) Then, construct a policy
π o : S 7→ A as follows:
Initialization:
1, a = arg maxα∈A Q ∗ (s, α) S = {s(1), s(2), · · · , s(|S|)}
π o (a|s) =
0, otherwise A = {a(1), a(2), · · · , a(|A|)}
π o (a|s) = 0, ∀ (s, a) ∈ S × A
Q ∗ (s, a1 )
a1 for i = 1 : |S|
P ′ P ′′ P ′′′
r s a′ r
′ s
a′′ r
′′ s
a′′′ s = s(i)
for j = 1 : |A|
a = a(j)
s if a = arg maxα∈A Q ∗ (s, α)
π o (a|s) = 1
Q ∗ (s, a2 ) end if
a2 end for
P P ′′ P ′′′
r s′ ′ s ′′ s
a′ r a′′ r a′′′ end for
Proof of Theorem 1
Recall Theorem 1: The following is true:
∗
i) There exists an optimal policy π ∗ such that V π (s) ≥ V π (s), ∀s, ∀π.
∗
ii) V π (s) = V ∗ (s).
∗
iii) Q π = Q ∗ (s, a).
Proof) ii) By the definition of optimal policy π ∗ , we have
∆ ∗
V ∗ (s) = max V π (s) ≤ V π (s).
π
Proof of Theorem 1
Proof continued)
∗
iii) Suppose that Q π (s, a) < maxπ Q π (s, a) = Q ∗ (s, a). Then,
X a ∗ ∗ X a
Rsa + γ Ps,s ′ V π (s ′ ) = Q π (s, a)< max Q π (s, a) = Rsa + γ Ps,s ′ max V π (s ′ ).
π π
s′ s′
∗
This contradicts the optimality of π ∗ . Hence, Q π (s, a) = Q ∗ (s, a)
s′
1
a V ∗ (s 1 )
a ∼ π o (s) Q ∗ (s, a1 ) P s1
a2 a
V ∗ (s) s Q ∗ (s, a2 ) s s2 V ∗ (s 2 )
∗
a 3 Q (s, a)
Q ∗ (s, a3 ) V ∗ (s 3 )
s3
Remark: The Bellman equation for optimal value functions is called the Bellman
optimality equation.
Seungyul Han (UNIST) AI512/EE633 Spring 2024 62 / 83
Optimal Policy and Bellman Optimality Equation
( )
X
∗ ′ ∗ ′
V (s) = max R(s, a) + γ p(s |s, a)V (s )]
a∈A
s′
.
.
.
1 M 1 1 M ∗ 1 2 1 M ∗ 2 N 1 M ∗ N
R(s , a ) + γ[p(s |s , a )V (s ) + p(s |s , a )V (s ) + · · · p(s |s , a )V (s )]}
.
.
.
∗ N N 1 1 N 1 ∗ 1 2 N 1 ∗ 2 N N 1 ∗ N
V (s ) = max{R(s , a ) + γ[p(s |s , a )V (s ) + p(s |s , a )V (s ) + · · · p(s |s , a )V (s )],
.
.
.
N M 1 N M ∗ 1 2 N M ∗ 2 N N M ∗ N
R(s , a ) + γ[p(s |s , a )V (a ) + p(s |s , a )V (s ) + · · · p(s |s , a )V (s )]}
A system of nonlinear equations ⇒ The number of terms for max operation: |A||S| ⇒ Difficult to solve.
Seungyul Han (UNIST) AI512/EE633 Spring 2024 63 / 83
Optimal Policy and Bellman Optimality Equation
1, 0
H L
recharge
wait search β, r search
1 − β, − C
1, r wait
(battery deplete)
Remarks:
The Bellman optimality equation is a system of nonlinear equations.
So, it is difficult to obtain closed-form solutions in general.
Thus, we approach the MDP problem based on iterative methods
such as generalized policy iteration.
X X
x1 T
T (x1 )
T (x2 )
x2
lim T n (x) = x ∗ ,
n→∞
where
T n (x) = T
| ◦ T {z
◦ · · · ◦ T}(x).
n times
V = {V (s), s ∈ S} = R|S| ,
d(V1 (s), V2 (s)) = ||V1 (s) − V2 (s)||∞ = max |V1 (s) − V2 (s)|.
s∈S
Then, this (V, d) is a complete metric space since the set of real numbers is complete
w.r.t. L∞ -norm. Now, define a mapping T ∗ : V → V as
( )
∗ ∗
X ′ ′
T : V (s) → T (V (s)) = max R(s, a) + γ p(s |s, a)V (s )] .
a∈A
s′
Proof Continued
Now consider any two value functions V1 (s) and V2 (s). Then, we have
||T ∗ V1 (s) − T ∗ V2 (s)||∞
( ) ( )
X ′ ′
X ′ ′
= max R(s, a) + γ p(s |s, a)V1 (s ) − max R(s, ã) + γ p(s |s, ã)V2 (s )
a ã
s′ s′ ∞
( )
X X
≤ max R(s, a) + γ p(s ′ |s, a)V1 (s ′ ) − R(s, a) + γ p(s ′ |s, a)V2 (s ′ )
a
s′ s′ ∞
( )
X
≤ max γ p(s ′ |s, a)[V1 (s ′ ) − V2 (s ′ )]
a
s′ ∞
( )
X ′ ′ ′
≤ γ max p(s |s, a) [V1 (s ) − V2 (s )] ∞
, ∵ || max(a, b)|| ≤ max(||a||, ||b||)
a
s′
( )
′ ′
X ′
≤ γ [V1 (s ) − V2 (s )] ∞
max p(s |s, a) , due to the definition of || · ||∞
a
s′
X
≤ γ [V1 (s ′ ) − V2 (s ′ )] ∞
, ∵ p(s ′ |s, a) = 1, ∀s, a.
s′
Proof Continued
Hence, the Bellman operator T ∗ is a contraction mapping. By the Banach fixed point
theorem, there exists a unique fixed point V ∗ (s) such that
( )
∗ ∗ ∗
X ′ ∗ ′
V (s) = T (V (s)) = max R(s, a) + γ p(s |s, a)V (s )
a
s′
but this is nothing but the Bellman optimality equation. Hence, we have the claim.
Furthermore, by the Banach fixed point theorem, V ∗ (s) can be obtained by iteratively
applying the Bellman operator to any initial V (0) (s), i.e.,
References
Appendix
Cauchy Sequences
Goal
We want to define the convergence of a sequence of numbers in some
number system without knowing whether the limit is contained in that
number system or not
Cauchy Sequences
xn
ǫ
i j
N
Definition (Completeness)
A set X is completed if every Cauchy sequence has a limit in the set X .
Definition (Limit)
A number x is called the limit of sequence xn if
Remark
It is desirable that any element that can be approached arbitrarily closely
by elements in a set should be in the same set.
Remark
We can generalize the definitions to any metric space (X , d) with metric d.
Seungyul Han (UNIST) AI512/EE633 Spring 2024 77 / 83
Optimal Policy and Bellman Optimality Equation
Contraction
Definition (A Contraction Mapping)
Let (X , d) be a complete metric space. A mapping T : X → X is a contraction
mapping (or simply contraction) if there exists some constant γ ∈ [0, 1) such that
X X
x1 T
T (x1 )
T (x2 )
x2
Remark
Note that any contraction mapping is a continuous mapping by the definition of
continuity.
Seungyul Han (UNIST) AI512/EE633 Spring 2024 78 / 83
Optimal Policy and Bellman Optimality Equation
Contraction: An Example
y
y =x
y = Tx
x
x1 x0
Fixed Point
lim T n (x0 ) = x ∗ ,
n→∞
where
∆
T n (x) = T
| ◦ T {z
◦ · · · ◦ T}(x).
n times
Proof
∆
For any given x0 ∈ X , xn = T n (x0 ). Let c = d(x0 , x1 ). Then, we have
for every n. Then, for any n and m such that n < m, we have
∗
Hence, x is a fixed point of T .
Proof
Now, we show that the fixed point is unique. Let x and y be fixed points of T , i.e.,
x = Tx and y = Ty . Then,
Since γ < 1 and d(x, y ) ≥ 0. The only possibility is d(x, y ) = 0. That is, x = y . This
concludes the proof.