0% found this document useful (0 votes)
13 views83 pages

AI512/EE633: Reinforcement Learning: Lecture 2 - Markov Decision Process

The document discusses Markov chains and their properties. It defines Markov chains as random processes where the probability of the next state only depends on the current state. It also discusses finite Markov chains and provides an example of a Markov chain represented using a state transition diagram and transition probability matrix.

Uploaded by

이강민
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views83 pages

AI512/EE633: Reinforcement Learning: Lecture 2 - Markov Decision Process

The document discusses Markov chains and their properties. It defines Markov chains as random processes where the probability of the next state only depends on the current state. It also discusses finite Markov chains and provides an example of a Markov chain represented using a state transition diagram and transition probability matrix.

Uploaded by

이강민
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

AI512/EE633: Reinforcement Learning

Lecture 2 - Markov Decision Process

Seungyul Han

UNIST
[email protected]

Spring 2024

Seungyul Han (UNIST) AI512/EE633 Spring 2024 1 / 83


Contents

1 Setup

2 Markov Chains

3 Markov Decision Processes

4 Value Functions and Bellman Equations

5 Optimal Policy and Bellman Optimality Equation

Seungyul Han (UNIST) AI512/EE633 Spring 2024 2 / 83


Setup

Table of Contents

1 Setup

2 Markov Chains

3 Markov Decision Processes

4 Value Functions and Bellman Equations

5 Optimal Policy and Bellman Optimality Equation

Seungyul Han (UNIST) AI512/EE633 Spring 2024 3 / 83


Setup

The Setup: Interaction between Agent and Environment

Interaction between Agent and


Environment The agent
st The environment
action
rt Agent at +
The dynamics model
state reward State transition
rt+1 Reward function
Environment
st+1 st +
A policy (the function to be
t = 0, 1, 2, 3, · · · designed)

Seungyul Han (UNIST) AI512/EE633 Spring 2024 4 / 83


Setup

The Setup: Interaction between Agent and Environment

Interaction between Agent and At each time step t,


Environment the agent
st observes observation
action
rt Agent at ot = st ,
executes action at ,
receives reward rt .
state reward
rt+1 The envirionment
Environment receives action at ,
st+1 st switch to next state st+1 .
produces reward rt+1 ,
t = 0, 1, 2, 3, · · · and time step t continues.

What is the background for this setup?

Seungyul Han (UNIST) AI512/EE633 Spring 2024 5 / 83


Setup

The State Space and The Action Space

Definition (The State Space)


The set of all possible stats of the environment is called the state space,
typically denoted by S.

Definition (The Action Space)


The set of all possible actions of the agent is called the action space,
typically denoted by A.

Assumption (Finite State and Acation Spaces)


Through this course, we will assume the following:
The state space S is finite, i.e., |S| < ∞.
The action space A is finite, i.e., |A| < ∞.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 6 / 83


Setup

Sufficiency and Markovianness

The State Variable st :

Sufficient Statistic
The state variable st is sufficient to describe the status of the environment
at time step t containing all information relevant to make decision or
inference.

Markovianness
The sequence of states over time {st , t = 0, 1, 2, · · · } is a Markov process
or Markov chain. That is,

P[st+1 |s1 , s2 , · · · , st ] = P[st+1 |st ], ∀t

Seungyul Han (UNIST) AI512/EE633 Spring 2024 7 / 83


Setup

Sufficient Statistic
Remark
Sufficiency is with respect to the target optimal inference or decision.

Example 1:
R l

+ +
vin (t) = x(t) i C vout (t) = y (t)
− −
In this 2nd order circuit system, we can set the state variable as the capacity
voltage and the inductor current.
i.i.d.
Example 2: Xi ∼ N(θ, 1). We observe data X1 , · · · , Xn and want to estimate
the mean parameter θ. Suppose that we Pconsider as optimal and use maximum
n
likelihood estimator (MLE) for θ. Then, i=1 Xi is a sufficient statistic.
n
Y 1 1 2 1 1
P 2 P 2
p(x1 , · · · , x1 ; θ) = √ e − 2 (xi −θ) = p e − 2 ( i xi −θ i xi +θ )
2π (2π) n
i=1

Seungyul Han (UNIST) AI512/EE633 Spring 2024 8 / 83


Markov Chains

Table of Contents

1 Setup

2 Markov Chains

3 Markov Decision Processes

4 Value Functions and Bellman Equations

5 Optimal Policy and Bellman Optimality Equation

Seungyul Han (UNIST) AI512/EE633 Spring 2024 9 / 83


Markov Chains

Markov Processes or Markov Chains


S0 S1 S2 S3 S4

Definition (A Markov process)


A random process or sequence {St } with St ∈ S, ∀t is Markov if and only if

Pr[St+1 |S1 , S2 , · · · , St ] = Pr[St+1 |St ]

Remark
A Markov state contains all relevant information from the entire history for system
evolution.
The history in the past and the evolution in the future are independent given St ,
i.e.,
P(τ1:t , τt+1,∞ |st ) = P(τ1:t |st )P(τt+1:∞ |st ),

where τ1:t = (s1 , s2 , · · · , st ).
A Markov process is also called a Markov chain (MC).

Seungyul Han (UNIST) AI512/EE633 Spring 2024 10 / 83


Markov Chains

Finite Markov Processes

S0 S1 S2 S3 S4

Definition (A finite Markov process)


A Markov process is called finite if the cardinality of its state space S is
finite. (Recall that we will assume that the state space is finite through
this course.)

Remark
For a finite MC, the collection of state transition probabilities

Pss ′ = Pr[st+1 = s ′ |st = s], ∀s ∈ S, ∀s ′ ∈ S

together with the state space S fully defines a Markov chain.


This probability is called the state-transition probability. Let us denote the
collection by P = {Pss ′ |s, s ′ ∈ S}.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 11 / 83


Markov Chains

Markov Chain Example: State-Transition Diagram

0.3 0.4 States:


0.3
I: Initial State
C A L: Library
0.1 0.4 0.6
C: Cafe
0.5
0.3
A: Arbeit
I 1
L 0.1 F 1
H F: Fatigued
H: Home
t = 0, 1, 2, · · · (Terminal State)

Seungyul Han (UNIST) AI512/EE633 Spring 2024 12 / 83


Markov Chains

Markov Chain Example: Transition Probability Matrix

Transition Probability Matrix P = [Pss ′ ]

0.3 0.4

0.3
I L C A F H
C A I 1
0.5
0.1 0.4
0.3
0.6 L 0.5 0.1 0.3 0.1
C 0.4 0.3 0.3
I L F H
1 0.1
1 A 0.4 0.6
F 1
H 1

Seungyul Han (UNIST) AI512/EE633 Spring 2024 13 / 83


Markov Chains

Markov Chain Example: Realizations and Episodes

Definition (An episode)


An episode is a terminated realization
0.3 0.4 of a Markov chain: (S0 , S1 , · · · , ST ),
0.3
where ST is the terminal state and T
C A is the episode length.
0.1 0.4 0.6
0.5
0.3 I,L,F,H
I L F H
I,L,L,L,F,H
0.1
1 1
I,L,L,C,A,A,F,H
I,L,C,L,L,F,H
..
.
Note that we can compute the probability of each episode. For example,
S∞
Pr[(I , L, L, L, F , H)] = 1 × 0.53 × 0.1 × 1. Of course, Pr[ i=1 Ei ] = 1.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 14 / 83


Markov Chains

The Unending Continuing Case


The Gilbert-Elliott Markov Chain:

1−α
α β
H L
1−β

τ = H, H, L, H, L, H, · · ·

The sequence never ends!


This process is not episodic.
We say a continuing case (not episodic case).

Seungyul Han (UNIST) AI512/EE633 Spring 2024 15 / 83


Markov Chains

Stationary Distribution of a Finite MC

If the agent travel on the state transition diagram, on average what is the time
portion with which the agent stays at each state?
We assign a probability distribution on the set of states, i.e., the state space
with cardinality N at time t:

p0 = Pr(S0 ), p1 = Pr(S1 ), · · · , pN = Pr(SN )

After one transition, the new distribution on the state space is given by
X
ps ′ = ps Pss ′
s∈S

Seungyul Han (UNIST) AI512/EE633 Spring 2024 16 / 83


Markov Chains

Stationary Distribution of a Finite MC


X
ps ′ = ps Pss ′
s∈S

In matrix form, we
    
p0′ P00 P10 ··· PN0 p0

 p1′  
  P01 P11 ··· PN1 

 p1 

 .. = .. ..  .. 
 .   . .  . 
pN′ P0N P1N ··· PNN pN
| {z } | {z }| {z }
new =PT (=[Pss ′ ]T ) old

When stationarity is achieved, there is no more change in the state


distribution. This is called the stationary distribution, which can be obtained
by solving the following linear equation:

p = PT p.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 17 / 83


Markov Chains

Google’s Page Rank Algorithm

0.3 0.4

0.3
C A

0.1 0.4 0.6


0.5
0.3

L 0.1 F

      
p0 1 P00 P10 ··· PN0 p0

 p1 
 1−α 
 1 


 P01 P11 ··· PN1 

 p1 

 .. =  ..  + α .. ..  .. 
 .  1+N  .   . .  . 
pN 1 P0N P1N ··· PNN pN

Seungyul Han (UNIST) AI512/EE633 Spring 2024 18 / 83


Markov Decision Processes

Table of Contents

1 Setup

2 Markov Chains

3 Markov Decision Processes

4 Value Functions and Bellman Equations

5 Optimal Policy and Bellman Optimality Equation

Seungyul Han (UNIST) AI512/EE633 Spring 2024 19 / 83


Markov Decision Processes

From an MC to an MDP

Control of Environment by Agent In a Markov chain, we do not care


about why the state transition from
st st to st+1 happens but assume that
action
rt Agent at the state transition happens at each
state
time according to the state
reward
rt+1 transition probability [Pss ′ ].
st+1 Environment
st To define a Markov decision process
(MDP), we need two more
ingredients:
t = 0, 1, 2, 3, · · · Action at
Reward rt+1

Seungyul Han (UNIST) AI512/EE633 Spring 2024 20 / 83


Markov Decision Processes

Action and Policy

We assume that the state variable st of the environment is a sufficient and


Markovian statistic, as in MCs.
Hence, to control the environment at time t, the agent needs to know st only.
We assume that the agent observes st , i.e., the fully observable case.
Since st is sufficient and Markovian, it is sufficient that the control input or action
at of the agent is a function of st only.

Definition (Policy)
We call this mapping from st ∈ S to at ∈ A a policy π.

π : S 7→ A : st → at .

Deterministic function case: at = π(st )


Stochastic function case: at ∼ π(at |st ), where π(a|s) = Pr[at = a|st = s]

Seungyul Han (UNIST) AI512/EE633 Spring 2024 21 / 83


Markov Decision Processes

Markovian State Transition with Action Involvement

Fundamental Modeling of System Dynamics


Under sufficiency and Markovianness, the next state of a system is
(probabilistically) determined by the current state and the system input,
i.e.,
st+1 = func(st , at ). (1)

Remark
For generality, we consider probabilistic functions in (1).
In the case of a probabilistic function with finite state and action
spaces, the function (1) is fully described by the set of probabilities:
a ′
Pss ′ = Pr[st+1 = s |st = s, at = a]

Seungyul Han (UNIST) AI512/EE633 Spring 2024 22 / 83


Markov Decision Processes

Markovian State Transition with Action Involvement

s st
Transitions:
1
(st , a1 ) ⇒ s 1 with probability Pss
a
1
1
(st , a1 ) ⇒ s 2 with probability Pss
a
2
a 1
a 2 at
2
2
(st , a2 ) ⇒ s 2 with probability Pss
a
2
a2
1
Pssa 1 P a1 P Pssa 3 2
ss 2 ss 2
(st , a2 ) ⇒ s 3 with probability Pss
a
3

st+1
s1 s2 s3 a ] is a rank-3 tensor.
[Pss ′

state action

Seungyul Han (UNIST) AI512/EE633 Spring 2024 23 / 83


Markov Decision Processes

Markovian State Transition with Action Involvement:


Action Marginalization

[Pssa ′ ] a rank-3 tensor ⇒ [Pss ′ ] a rank-2 tensor, i.e.,


matrix
s st

Pss ′ = Pr[st+1 = s ′ |st = s]


a1 a2 at X
= Pr[st+1 = s ′ , at = a|st = s] (marginalization)
a∈A
X
1 2 3
st+1 = Pr[at = a|st = s]Pr[st+1 = s ′ |st = s, at = a]
s s s a∈A
state action X
= π(a|s)Pssa ′
a∈A

Seungyul Han (UNIST) AI512/EE633 Spring 2024 24 / 83


Markov Decision Processes

Reward

Reward Function:

s st a ′
Rss ′ = R(s, a, s )

= E[Rt+1 |st = s, at = a, st+1 = s ′ ]

a1 a2 at Remark: In full generality, the reward Rt+1 can


1
be a random variable even given the event
Rssa 1 {st = s, at = a, st+1 = s ′ }. In this course, we are

st+1 interested in the mean value rt+1 = E[Rt+1 ]. In
1 2 3
s s s distributional RL, however, distribution is
action
considered.
state
In full generality, [Rssa ′ ] is a rank-3 tensor in the
case of finite S and M.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 25 / 83


Markov Decision Processes

Return
Definition (Return)
The return Gt at time t is the sum of discounted rewards, i.e.,

Gt = rt+1 + γrt+2 + γ 2 rt+3 + γ 3 rt+4 + · · ·


or ∞
TX
= γ k rt+1+k ,
k=0

where the discount factor γ ∈ [0, 1].

Discount Factor γ:
The discount factor stabilizes the problem and makes the problem mathematically
simple. Suppose the
Pinfinite-horizon case. Even if rt is bounded for all t, the
undiscounted sum ∞ k=t rt grows without bound, i.e., goes to infinity.
The discount factor γ < 1 guarantees the existence of an optimal solution.
The discount factor prioritizes the rewards in the near future.
γ = 0: greedy case. Gt = rt+1 , γ = 1: undiscounted sum
Seungyul Han (UNIST) AI512/EE633 Spring 2024 26 / 83
Markov Decision Processes

Markov Decision Processes (MDPs)


Definition (A Markov Decision Process)
A Markov decision process is a Markov process with actions and corresponding
rewards and is defined by a tuple < S, A, P, R, γ >, where
S is the state space,
A is the action space,
P = [Pssa ′ ] is the collection of state-transition probabilities, given by

Pssa ′ = P(s ′ |s, a) = Pr[st+1 = s ′ |st = s, at = a]

R = [Rssa ′ ] is the reward function, given by

Rssa ′ = R(s, a, s ′ ) = E[Rt+1 |st = s, at = a, st+1 = s ′ ]

the discount factor γ ∈ [0, 1]

Goal: We want to maximize the expected discounted return:


" ∞
#
X
E[G0 ] = E[r1 + γr2 + γ 2 r3 + · · · ] = E γ t rt+1
t=0
Seungyul Han (UNIST) AI512/EE633 Spring 2024 27 / 83
Markov Decision Processes

An Reinforcement Learning Task

Definition (A Task)
The environment, the agent and the dyanmics model together with the
state space S and the action space A define a specific instance of the
reinforcement learning problem and this instance is called a task.

Remark
Basically, we consider an MDP task.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 28 / 83


Markov Decision Processes

MDP Example: Recycling Robot

State: Battery Level


1 high (H)
2 low (L)
Action
search for a can for
duration D
Charging Station robot
wait for someone to trash
a can to the onboard bin.
go to the recharging
station and recharge.
soda can
A decision should be made
every L seconds.
Reward and goal: Collect as
many empty soda cans as
possible.
Seungyul Han (UNIST) AI512/EE633 Spring 2024 29 / 83
Markov Decision Processes

Recycling Robot - Transition Probabilities and Reward


Function

s a s′ a
Pss ′
a
Rss ′

H search H α r search
H search L 1−α r search
L search L β r search
L search H (rescued) 1−β −C
H wait H 1 r wait
H wait L 0 r wait
L wait H 0 r wait
L wait L 1 r wait
L recharge H 1 0
L recharge L 0 0

Table: An MDP setup for the recycling robot (from the textbook by Sutton and Barto)
(r search > r wait )
Seungyul Han (UNIST) AI512/EE633 Spring 2024 30 / 83
Markov Decision Processes

Recycling Robot - State Transition Diagram


α, r search 1 − α, r search
1, r wait

search wait

1, 0
H L
recharge

wait search
1 − β, − C β, r search
wait
1, r
(battery deplete)

A policy π : π(s|H), π(w |H), π(s|L), π(w |L), π(r |L)

(from the textbook by Sutton and Barto)

Seungyul Han (UNIST) AI512/EE633 Spring 2024 31 / 83


Markov Decision Processes

Where Is Randomness in an MDP?

α, r search 1 − α, r search 1, r wait

search wait

1, 0
H L
recharge

wait search
1 − β, − C β, r search
wait
1, r
(battery deplete)

Randomness in evolution comes from


Initial state distribution p(s0 )
Action choice π(a|s)
State transition by the environment P(s ′ |s, a)

Seungyul Han (UNIST) AI512/EE633 Spring 2024 32 / 83


Value Functions and Bellman Equations

Table of Contents

1 Setup

2 Markov Chains

3 Markov Decision Processes

4 Value Functions and Bellman Equations

5 Optimal Policy and Bellman Optimality Equation

Seungyul Han (UNIST) AI512/EE633 Spring 2024 33 / 83


Value Functions and Bellman Equations

Value Functions

Questions:
How much expected
α, r search 1 − α, r search 1, r wait return can the agent
search wait receive if it is in a state
1, 0
s and follow policy π?
H L
recharge How much expected
wait search β, r search return can the agent
1 − β, − C
1, r wait receive if it takes action
(battery deplete)
a in a state s and then
follows policy π?

Seungyul Han (UNIST) AI512/EE633 Spring 2024 34 / 83


Value Functions and Bellman Equations

Value Functions
Definition (State-Value Function)
The value of a state s under a policy π is the expected return when the
agent starts from s and follows π, i.e.,
" ∞ #
X
V π (s) = Eπ [Gt |st = s] = E γ i rt+1+i st = s (2)
i=0

Definition (Action-Value Function)


The value of an action a in a state s under a policy π is the expected
return when the agent takes action a in state s and then follows π, i.e.,
" ∞ #
X
π i
Q (s, a) = Eπ [Gt |st = s, at = a] = E γ rt+1+i st = s, at = a (3)
i=0

Note that the value of an action is not the immediate reward but the expected return
following the
Seungyul Hanaction.
(UNIST) AI512/EE633 Spring 2024 35 / 83
Value Functions and Bellman Equations

Value Functions

V π (s) versus Q π (s, a)

V π (s) = Eπ [Gt |st = s]


π
Q (s, a) = Eπ [Gt |st = s, at = a]

s r
V π (s) s′
a
π

s r
Q π (s, a) s′
a
π

Seungyul Han (UNIST) AI512/EE633 Spring 2024 36 / 83


Value Functions and Bellman Equations

Value Functions

Remark:

Note that V π : S → R is a function from the state space to the set of


real numbers.
Q π : S × A → R is a function from the product space S × A to the
set of real numbers.
This is why we call them value functions.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 37 / 83


Value Functions and Bellman Equations

The Bellman Equation

The Bellman equation


" ∞
#
π
X i
V (s) = Eπ [Gt |st = s] = Eπ γ rt+1+i st = s
i=0
" ∞
#
X i
= Eπ rt+1 + γ γ rt+2+i st = s
i=0
 

 

 " ∞
#


X  X 
′ ′
= p(a, s |s) R(s, a, s ) + γ Eπ γ i rt+2+i st+1 = s ′
| {z }   

a,s ′  i=0
=π(a|s)P a ′ 
ss 
| {z }


=V π (s ′ )
X  
= π(a|s)Pssa ′ R(s, a, s ′ ) + γV π (s ′ )
a,s ′

Seungyul Han (UNIST) AI512/EE633 Spring 2024 38 / 83


Value Functions and Bellman Equations

The Bellman Equation: Backup Relationship

Bellman equation
X  
V π (s) = π(a|s)Pssa ′ R(s, a, s ′ ) + γV π (s ′ )
a,s ′
X
= R π (s) + γ Pssπ′ V π (s ′ )
|{z}
s′ P a
a π(a|s)Pss ′

s V π (s)

Transition (forward) a Backup (Bellman eq.)


r
π ′
s ′ V (s )
Remark: The Bellman equation above is a matrix version of the Fredholm equation of
R
the second kind: φ(t) = f (t) + λ K (t, s)φ(s)ds.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 39 / 83


Value Functions and Bellman Equations

Theoretical Solution to The Bellman Equation

X X
V π (s) = π(a|s)Pssa ′ R(s, a, s ′ ) + γ π(a|s)Pssa ′ V π (s ′ )
a,s ′ a,s ′
X
= R π (s) + γ Pssπ′ V π (s ′ )
s′

   π   π  
V π (s1 ) R (s1 ) P11 ··· P1N
π
V π (s1 )
 ..   ..   ..  .. 
 . = . +γ .  . 
v π (sN ) R π (sN ) PN1
π
··· PNN
π
v π (sN )

vπ = r + γPvπ
vπ = (I − γP)−1 r

For small MDPs, the linear system can be solved.


For large MDPs, iterative methods are applied rather than directly solving the
linear system.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 40 / 83


Value Functions and Bellman Equations

The Bellman Equation for The Action-Value Function

" ∞
#
π
X i
Q (s, a) = Eπ [Gt |st = s, at = a] = Eπ γ rt+1+i st = s, at = a
i=0
" ∞
#
X i
= Eπ rt+1 + γ γ rt+2+i st = s, at = a
i=0
" ∞
#
X a ′
X a
X ′ ′
X i ′ ′
= Pss ′ R(s, a, s ) +γ Pss ′ π(a |s ) Eπ γ rt+2+i st+1 = s , at+1 = a
s′ s′ a′ i=0
| {z }
=Q π (s ′ ,a′ )
X
= R(s, a) + γ Pssa ′ π(a′ |s ′ ) Q π (s ′ , a′ )
| {z }
s ′ ,a′
=p(s ′ ,a′ |s,a)

Seungyul Han (UNIST) AI512/EE633 Spring 2024 41 / 83


Value Functions and Bellman Equations

Backup Diagram for The Bellman Equation for


Action-Value Function

Bellman equation for Action-Value Function


X
Q π (s, a) = R(s, a) + γ Pssa ′ π(a′ |s ′ ) Q π (s ′ , a′ )
| {z }
s ′ ,a′
=p(s ′ ,a′ |s,a)

a Q π (s, a)
r

s′ Backup (Bellman eq.)

π ′ ′
a′ Q (s , a )

Seungyul Han (UNIST) AI512/EE633 Spring 2024 42 / 83


Optimal Policy and Bellman Optimality Equation

Table of Contents

1 Setup

2 Markov Chains

3 Markov Decision Processes

4 Value Functions and Bellman Equations

5 Optimal Policy and Bellman Optimality Equation

Seungyul Han (UNIST) AI512/EE633 Spring 2024 43 / 83


Optimal Policy and Bellman Optimality Equation

Comparison of Policies

Definition (A Partially-Ordered Set)


A partially-ordered set is a set with a partial order, where a partial order is a
binary relation ≤ over the set satisfying the following three axioms:
1 (Reflexive) a ≤ a
2 (Antisymmetric) If a ≤ b and b ≤ a, then a = b.
3 (Transitive) If a ≤ b and b ≤ c, then a ≤ c.

Remark
In a partially-ordered set P, we do not require either a ≤ b or b ≤ a for all pairs
a, b ∈ P. If we have neither a ≤ b or b ≤ a, we say a and b are incomparable.

Definition (Total Order or Linear Order)


A linearly-ordered set is a partially-ordered set in which all pairs are comparable.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 44 / 83


Optimal Policy and Bellman Optimality Equation

Comparison of Policies

Theorem (Partial ordering of policies)


Value functions provide a partial ordering on the set of all policies for the
given finite MDP, where the partial order is defined as

V π (s) ≤ V π (s), ∀s ∈ S ⇒ π ≤ π′ .

π3
V π (s) π2

π1 ≤ π3
π2 ≤ π3
But, neither π1 ≤ π2 nor
π1
π2 ≤ π1
s

Seungyul Han (UNIST) AI512/EE633 Spring 2024 45 / 83


Optimal Policy and Bellman Optimality Equation

Optimal Policy

Definition (Optimal Policy)


A policy π ∗ is an optimal policy if

π ∗ ≥ π, ∀ π ∈ Π,

equivalently,

V π (s) ≥ V π (s), ∀ s ∈ S, ∀π∈Π
where Π is the set of all feasible policies for the given MDP.

Questions:
Does such π ∗ exists?
What is the value function of π ∗ and how can we compute it?

Seungyul Han (UNIST) AI512/EE633 Spring 2024 46 / 83


Optimal Policy and Bellman Optimality Equation

Optimal Value Functions

Recall V π (s):
" ∞
#
X
V π (s) = E{at }∼π γ i R(st+i , at+i , st+1+i ) st = s
i=0
X
= (r0 + γr1 + γ 2 r2 + · · · )ρ(s0 )π(a0 |s0 )p(s1 |s0 , a0 )π(a1 |s1 )p(s2 |s1 , a1 ) · · ·

Definition: Optimal State Value Function The optimal value of state s is defined as

V ∗ (s) = max V π (s) for each s.
π

Seungyul Han (UNIST) AI512/EE633 Spring 2024 47 / 83


Optimal Policy and Bellman Optimality Equation

Optimal Value Functions

V π (s) V ∗ (s)
Definition (Optimal State Value Function) π2

The optimal state value function V ∗ : S 7→ R is


constructed as follows: For each s ∈ S, we
π3
assign the value of V ∗ (s) as
π1
V ∗ (s) = max V π (s) s
π

Definition (Optimal Action Value Function)


The optimal action value function Q ∗ : S × A 7→ R is defined as follows: For each
(s, a) ∈ S × A, we assign the value of Q ∗ (s, a) as


Q ∗ (s, a) = max Q π (s, a)
π

Seungyul Han (UNIST) AI512/EE633 Spring 2024 48 / 83


Optimal Policy and Bellman Optimality Equation

Optimal Value Functions

Q ∗ (s, a) = max Q π (s, a)


π

hP i
Q ∗ (s, a1 ) = maxπ:{aτ ,τ ≥t+1}∼π E ∞ i
i=0 γ R(st+i , at+i , st+1+i ) st = s, at = a
1

a1 P ′′
P s′ P ′′′
r ′ r′ s ′′ ′′ s ′′′
a a r a

s
hP i
Q ∗ (s, a1 ) = maxπ:{aτ ,τ ≥t+1}∼π E ∞ i
i=0 γ R(st+i , at+i , st+1+i ) st = s, at = a
2

a2 P ′′ P ′′′
P s′
r ′ r′ s ′′ ′′ s ′′′
a a r a

Seungyul Han (UNIST) AI512/EE633 Spring 2024 49 / 83


Optimal Policy and Bellman Optimality Equation

Relationship between Optimal Value Functions V ∗ and Q ∗

Optimal State and Value Functions: Relationship

V ∗ (s) = max Q ∗ (s, a)


a∈A

Q ∗ (s, a) = EP [ rt+1 +γV ∗ (st+1 )|st = s, at = a]


|{z}
R(s,a,st+1 )

s′
1
a V ∗ (s 1 )
a ∼ π o (s) Q ∗ (s, a1 ) P s1
a2 a
V ∗ (s) s Q ∗ (s, a2 ) s s2 V ∗ (s 2 )

a 3 Q (s, a)
Q ∗ (s, a3 ) V ∗ (s 3 )
s3

Seungyul Han (UNIST) AI512/EE633 Spring 2024 50 / 83


Optimal Policy and Bellman Optimality Equation

Existence of Optimal Policy

Theorem 1: The following is true:



i) There exists an optimal policy π ∗ such that V π (s) ≥ V π (s), ∀s, ∀π.

ii) V π (s) = V ∗ (s).

iii) Q π = Q ∗ (s, a).

Seungyul Han (UNIST) AI512/EE633 Spring 2024 51 / 83


Optimal Policy and Bellman Optimality Equation

Proof: Existence of Optimal Policy

Preliminary Definitions:
F : the set of all functions f : S → A
π: a policy = a sequence of (f1 , f2 , f3 , · · · ) of functions ft ∈ F , s.t. at = ft (st ).
Stationary function case: ft = f , ∀t
f (N) = (f , f , · · · , f ), f (∞) = (f , f , f , · · · )
f (s)
R(f ): the reward vector of size |S| with elements Rs
f (s)
P(f ): the transition prob. matrix with elements Ps,s ′
V (π): the state value vector of size |S|

   f (1) f (1)
  
R f (s1 ) (s1 ) P11 ··· P1N V π (s1 )
 ..   ..   .. 
R(f ) =  . , P(f ) = 

.
,
 V (π) =  . 
R f (sN (sN ) f (N)
PN1 ··· PNN
f (N)
v π (sN )

Seungyul Han (UNIST) AI512/EE633 Spring 2024 52 / 83


Optimal Policy and Bellman Optimality Equation

Proof: Existence of Optimal Policy


Preliminary Definitions:
Note that V (π) of π = (f1 , f2 , f3 , · · · ) can be written as

V (π) = R(f1 ) + γP(f1 )V (T π),

where the shift operator T is defined as T π = (f2 , f3 , f4 , · · · )


L(f ): Affine transform from a value vector v of size |S| to L(f )v of size |S|:

L(f )v = R(f ) + γP(f )v


 f (s1 )   f (1) f (1)
 
R (s1 ) P11 ··· P1N v (s1 )
 ..   ..  .. 
= . +γ

.

 . 
R f (sN (sN ) f (N)
PN1 ···
f (N)
PNN v (sN )

Consider the concatenated policy (f , π), where only the first action is taken from f
and then all the following actions by π. Then, L(f )V (π) is the value vector of the
concatenated policy (f , π)
Monotonicity: If v1 ≥ v2 , i.e., v1 (s) ≥ v2 (s), ∀s ∈ S, then L(f )v1 ≥ L(f )v2 .

Seungyul Han (UNIST) AI512/EE633 Spring 2024 53 / 83


Optimal Policy and Bellman Optimality Equation

Lemmas for Proof of Existence of Optimal Policy

Lemma 1 (Beating 1st step change): If π ≥ (f , π) for all f ∈ F for some π, then π is
optimal. Here, (f , π) means that the first action is taken by the function f and then all
the following actions follow π.

Proof: Due to the assumption, L(f )V (π), which is the value vector of (f , π), satisfies
L(f )V (π) ≤ V (π).
Let any policy π ′ be π ′ = (f1 , f2 , f3 , · · · ). Then, by the assumption, (fN , π) ≤ π, i.e.,
L(fN )V (π) ≤ V (π). Now, apply the monotonicity on this
(a) (b)
L(fN−1 )L(fN )V (π) ≤ L(fN−1 )V (π) ≤ V (π)

where (a) is by monotonicity and (b) is by the assumption. Repeatedly applying this
procedure, we have
L(f1 )L(f2 ) · · · L(fN )V (π) ≤ V (π).
Letting N → ∞, we have
V (π ′ ) ≤ V (π), ∀π ′ .
Hence, π is optimal.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 54 / 83


Optimal Policy and Bellman Optimality Equation

Lemmas for Proof of Existence of Optimal Policy

Lemma 2 (Winning by 1st step change): If (f , π) > π, then f (∞) > π.

Proof: By assumption,
L(f )V (π) > V (π).
Applying the monotonicity,
(a) (b)
L(f )L(f )V (π) > L(f )V (π) > V (π)

where (a) is by the monotonicity and (b) is by the above eq. Repeating this yields

LN (f )V (π) > V (π) ⇒ (f (N) , π) > π, ∀N.

Letting N → ∞, we have
f (∞) > π

Seungyul Han (UNIST) AI512/EE633 Spring 2024 55 / 83


Optimal Policy and Bellman Optimality Equation

Lemmas for Proof of Existence of Optimal Policy

Lemma 3 (Policy improvement): Policy improvement: Consider any f ∈ F . For each


s ∈ S, define G (s, f ) as the set of actions a satisfying

Rsa +γ Psa V (f (∞) ) > Vs (f (∞) ).


|{z} |{z} | {z }
scalar row vec column vec

(∞) (∞)
i.e., Q f (s, a) > V f (s). If G (s, f ) is empty for all s ∈ S, the f (∞) is optimal. If
G (s, f ) is not empty for some s ′ , construct a new policy g as
For s ′ s.t. G (s ′ , f ) 6= ∅, set g (s ′ ) = a ∈ G (s ′ , f ).
For all other s 6= s ′ , set g (s) = f (s).
Then, we have g (∞) > f (∞) .

Seungyul Han (UNIST) AI512/EE633 Spring 2024 56 / 83


Optimal Policy and Bellman Optimality Equation

Lemmas for Proof of Existence of Optimal Policy

Proof of Lemma 3: If G (s, f ) is empty, then for all s ∈ S

Rsa + γPsa V (f (∞) ) ≤ Vs (f (∞) ),

for all a, equivalently for all h : a = h(s). That is, (h, f (∞) ) ≤ f (∞) . Then, by Lemma
1, f (∞) is optimal.

Otherwise, for the constructed new policy g we have

(g , f (∞) ) > f (∞) .

By Lemma 2,
g (∞) > f (∞) .

Seungyul Han (UNIST) AI512/EE633 Spring 2024 57 / 83


Optimal Policy and Bellman Optimality Equation

A Policy Based on Optimal Action Value Function

Suppose that we computed the optimal action value function Q ∗ (s, a) over S × A.
(Q ∗ (s, a) exists for finite MDP cases, as we will see later.) Then, construct a policy
π o : S 7→ A as follows:
 Initialization:
1, a = arg maxα∈A Q ∗ (s, α) S = {s(1), s(2), · · · , s(|S|)}
π o (a|s) =
0, otherwise A = {a(1), a(2), · · · , a(|A|)}
π o (a|s) = 0, ∀ (s, a) ∈ S × A
Q ∗ (s, a1 )
a1 for i = 1 : |S|
P ′ P ′′ P ′′′
r s a′ r
′ s
a′′ r
′′ s
a′′′ s = s(i)
for j = 1 : |A|
a = a(j)
s if a = arg maxα∈A Q ∗ (s, α)
π o (a|s) = 1
Q ∗ (s, a2 ) end if
a2 end for
P P ′′ P ′′′
r s′ ′ s ′′ s
a′ r a′′ r a′′′ end for

Seungyul Han (UNIST) AI512/EE633 Spring 2024 58 / 83


Optimal Policy and Bellman Optimality Equation

Proof of Theorem 1
Recall Theorem 1: The following is true:

i) There exists an optimal policy π ∗ such that V π (s) ≥ V π (s), ∀s, ∀π.

ii) V π (s) = V ∗ (s).

iii) Q π = Q ∗ (s, a).
Proof) ii) By the definition of optimal policy π ∗ , we have
∆ ∗
V ∗ (s) = max V π (s) ≤ V π (s).
π

i) Existence: π o is an optimal policy π ∗ .


Uniqueness (perspective of value function): Let π ∗ be an optimal for each s, i.e.
π ∗ ≥ π, ∀π. Then, π ∗ ≥ π o . However, for all s ∈ S
o o o o
Vs (π o , π ∗ ) = Rsπ + γPsπ (s)
V (π ∗ ) = Rsπ + γPsπ (s)
V ∗ = Q ∗ (s, π o (s))
= max[Rsa + γPsa V ∗ ] = max[Rsa + γPsa V (π ∗ )]
a∈A a∈A
∗ ∗
≥ Rsπ (s) + γPsπ (s) V (π ∗ ) = Vs (π ∗ ).

That is, (π o , π ∗ ) ≥ π ∗ . By Lemma 2, π o ≥ π ∗ . Hence, π o = π ∗ .


Seungyul Han (UNIST) AI512/EE633 Spring 2024 59 / 83
Optimal Policy and Bellman Optimality Equation

Proof of Theorem 1

Proof continued)


iii) Suppose that Q π (s, a) < maxπ Q π (s, a) = Q ∗ (s, a). Then,
X a ∗ ∗ X a
Rsa + γ Ps,s ′ V π (s ′ ) = Q π (s, a)< max Q π (s, a) = Rsa + γ Ps,s ′ max V π (s ′ ).
π π
s′ s′


This contradicts the optimality of π ∗ . Hence, Q π (s, a) = Q ∗ (s, a)

Seungyul Han (UNIST) AI512/EE633 Spring 2024 60 / 83


Optimal Policy and Bellman Optimality Equation

Existence of Optimal Value Functions V ∗ and Q ∗

Optimal State and Value Functions: Relationship

V ∗ (s) = max Q ∗ (s, a)


a∈A

Q ∗ (s, a) = EP [ rt+1 +γV ∗ (st+1 )|st = s, at = a]


|{z}
R(s,a,st+1 )

s′
1
a V ∗ (s 1 )
a ∼ π o (s) Q ∗ (s, a1 ) P s1
a2 a
V ∗ (s) s Q ∗ (s, a2 ) s s2 V ∗ (s 2 )

a 3 Q (s, a)
Q ∗ (s, a3 ) V ∗ (s 3 )
s3

Seungyul Han (UNIST) AI512/EE633 Spring 2024 61 / 83


Optimal Policy and Bellman Optimality Equation

Bellman Equation for Optimal Value Functions


Bellman Optimality Equation for State Value Functions
V ∗ (s) = max Q ∗ (s, a)
a∈A

= max E[rt+1 + γV ∗ (st+1 )|st = s, at = a]


a∈A
X
= max p(s ′ |s, a)[R(s, a, s ′ ) + γV ∗ (s ′ )]
a∈A
s′
( )
X
= max R(s, a) + γ p(s ′ |s, a)V ∗ (s ′ )]
a∈A
s′

Bellman Optimality Equation for Action Value Functions


Q ∗ (s, a) = EP [rt+1 + γV ∗ (st+1 )|st = s, at = a]
X
= p(s ′ |s, a)[R(s, a, s ′ ) + γ max

Q ∗ (s ′ , a′ )]
a ∈A
s′
X
= R(s, a) + γ p(s ′ |s, a) max

Q ∗ (s ′ , a′ )]
a ∈A
s′

Remark: The Bellman equation for optimal value functions is called the Bellman
optimality equation.
Seungyul Han (UNIST) AI512/EE633 Spring 2024 62 / 83
Optimal Policy and Bellman Optimality Equation

Bellman Optimality Equation


Bellman optimality equation (BOE)

( )
X
∗ ′ ∗ ′
V (s) = max R(s, a) + γ p(s |s, a)V (s )]
a∈A
s′

An MDP with S = {s 1 , · · · , s N } and A = {a1 , · · · , aM }:


∗ 1 1 1 1 1 1 ∗ 1 2 1 1 ∗ 2 N 1 1 ∗ N
V (s ) = max{R(s , a ) + γ[p(s |s , a )V (s ) + p(s |s , a )V (s ) + · · · p(s |s , a )V (s )],

.
.
.
1 M 1 1 M ∗ 1 2 1 M ∗ 2 N 1 M ∗ N
R(s , a ) + γ[p(s |s , a )V (s ) + p(s |s , a )V (s ) + · · · p(s |s , a )V (s )]}

.
.
.
∗ N N 1 1 N 1 ∗ 1 2 N 1 ∗ 2 N N 1 ∗ N
V (s ) = max{R(s , a ) + γ[p(s |s , a )V (s ) + p(s |s , a )V (s ) + · · · p(s |s , a )V (s )],

.
.
.
N M 1 N M ∗ 1 2 N M ∗ 2 N N M ∗ N
R(s , a ) + γ[p(s |s , a )V (a ) + p(s |s , a )V (s ) + · · · p(s |s , a )V (s )]}

A system of nonlinear equations ⇒ The number of terms for max operation: |A||S| ⇒ Difficult to solve.
Seungyul Han (UNIST) AI512/EE633 Spring 2024 63 / 83
Optimal Policy and Bellman Optimality Equation

Bellman Optimality Equation: An Example

α, r search 1 − α, r search 1, r wait


search wait

1, 0
H L
recharge
wait search β, r search
1 − β, − C
1, r wait
(battery deplete)

V ∗ (H) = max {R(H, a) + γ[p(H|H, a)V ∗ (H) + p(L|H, a)V ∗ (L)]}


a=s,w ,r

V (L) = max {R(L, a) + γ[p(H|L, a)V ∗ (H) + p(L|L, a)V ∗ (L)]}



a=s,w ,r

Seungyul Han (UNIST) AI512/EE633 Spring 2024 64 / 83


Optimal Policy and Bellman Optimality Equation

Bellman Optimality Equation

Bellman optimality equation (BOE)


( )
X
∗ ′ ∗ ′
V (s) = max R(s, a) + γ p(s |s, a)V (s )] (4)
a∈A
s′

Theorem: The Bellman optimality equation has a unique solution.

Remarks:
The Bellman optimality equation is a system of nonlinear equations.
So, it is difficult to obtain closed-form solutions in general.
Thus, we approach the MDP problem based on iterative methods
such as generalized policy iteration.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 65 / 83


Optimal Policy and Bellman Optimality Equation

Existence and Uniqueness of Solution to The Bellman


Optimality Equation

Proof of the existence and uniqueness of solution to the BOE is based on


the Banach fixed point theorem.
The Banach fixed point theorem plays a crucial role in value-based methods,
together with the policy improvement theorem that we will learn later.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 66 / 83


Optimal Policy and Bellman Optimality Equation

The Banach Fixed Point Theorem

Definition (A Contraction Mapping)


Let (X , d) be a complete metric space. A mapping T : X → X is a
contraction mapping (or simply contraction) if there exists some constant
γ ∈ [0, 1) such that

d(T (x1 ), T (x2 )) ≤ γd(x1 , x2 ), ∀x1 , x2 ∈ X .

X X

x1 T
T (x1 )
T (x2 )
x2

Seungyul Han (UNIST) AI512/EE633 Spring 2024 67 / 83


Optimal Policy and Bellman Optimality Equation

The Banach Fixed Point Theorem

Theorem (The Banach Fixed Point Theorem)


Let (X , d) be a complete metric space and T : X → X be a contraction
mapping on X . Then, the mapping T has a unique fixed point x ∗ ∈ X ,
i.e., x ∗ = T (x ∗ ) such that

lim T n (x) = x ∗ ,
n→∞

where
T n (x) = T
| ◦ T {z
◦ · · · ◦ T}(x).
n times

Pf) See Appendix.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 68 / 83


Optimal Policy and Bellman Optimality Equation

Existence and Uniqueness of Solution to The Bellman


Optimality Equation

Theorem (Existence and Uniqueness of Solution to BOE)


A solution to the Bellman optimality equation (4) exists and is unique.
Pf) For a finite MDP, we set a metric space (V, d)

V = {V (s), s ∈ S} = R|S| ,
d(V1 (s), V2 (s)) = ||V1 (s) − V2 (s)||∞ = max |V1 (s) − V2 (s)|.
s∈S

Then, this (V, d) is a complete metric space since the set of real numbers is complete
w.r.t. L∞ -norm. Now, define a mapping T ∗ : V → V as
( )
∗ ∗
X ′ ′
T : V (s) → T (V (s)) = max R(s, a) + γ p(s |s, a)V (s )] .
a∈A
s′

This mapping is called the Bellman operator.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 69 / 83


Optimal Policy and Bellman Optimality Equation

Proof Continued
Now consider any two value functions V1 (s) and V2 (s). Then, we have
||T ∗ V1 (s) − T ∗ V2 (s)||∞
( ) ( )
X ′ ′
X ′ ′
= max R(s, a) + γ p(s |s, a)V1 (s ) − max R(s, ã) + γ p(s |s, ã)V2 (s )
a ã
s′ s′ ∞
( )
X X
≤ max R(s, a) + γ p(s ′ |s, a)V1 (s ′ ) − R(s, a) + γ p(s ′ |s, a)V2 (s ′ )
a
s′ s′ ∞
( )
X
≤ max γ p(s ′ |s, a)[V1 (s ′ ) − V2 (s ′ )]
a
s′ ∞
( )
X ′ ′ ′
≤ γ max p(s |s, a) [V1 (s ) − V2 (s )] ∞
, ∵ || max(a, b)|| ≤ max(||a||, ||b||)
a
s′
( )
′ ′
X ′
≤ γ [V1 (s ) − V2 (s )] ∞
max p(s |s, a) , due to the definition of || · ||∞
a
s′
X
≤ γ [V1 (s ′ ) − V2 (s ′ )] ∞
, ∵ p(s ′ |s, a) = 1, ∀s, a.
s′

Seungyul Han (UNIST) AI512/EE633 Spring 2024 70 / 83


Optimal Policy and Bellman Optimality Equation

Proof Continued

Hence, the Bellman operator T ∗ is a contraction mapping. By the Banach fixed point
theorem, there exists a unique fixed point V ∗ (s) such that
( )
∗ ∗ ∗
X ′ ∗ ′
V (s) = T (V (s)) = max R(s, a) + γ p(s |s, a)V (s )
a
s′

but this is nothing but the Bellman optimality equation. Hence, we have the claim.

Furthermore, by the Banach fixed point theorem, V ∗ (s) can be obtained by iteratively
applying the Bellman operator to any initial V (0) (s), i.e.,

V ∗ (s) = lim T ∗n (V (0) (s)).


n→∞

This is the fundamental background of many iterative methods to solve MDPs.




Seungyul Han (UNIST) AI512/EE633 Spring 2024 71 / 83


Optimal Policy and Bellman Optimality Equation

References

Textbook: Sutton and Barto, Reinforcement Learning: An Introduction, The MIT


Press, Cambridge MA, 2018
Dr. David Silver’s course material.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 72 / 83


Optimal Policy and Bellman Optimality Equation

Appendix

1 Cauchy sequence and completeness


2 Contraction mapping
3 The Banach fixed point theorem

Seungyul Han (UNIST) AI512/EE633 Spring 2024 73 / 83


Optimal Policy and Bellman Optimality Equation

Motivation for Completeness of a Set

Consider the following sequence of rationals: x1 , x2 , x3 , · · ·


 
1 2
xn+1 = xn + , x1 = 1 (5)
2 xn
Then, x2 =√ 1.5, x3 = 1.4167, x4 = 1.4142, · · · . Eventually the sequence will
converge 2.

2 is not a rational but it can be approached by rationals arbitrarily close!

In fact, 2 is the limit of the sequence xn defined above. The limit of a sequence
of rationals is not a rational!
We say that the set of rationals is not complete.
What if we define a bigger number system including rational numbers and all
limits of converging sequences of rationals?
Let us call this number system real R.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 74 / 83


Optimal Policy and Bellman Optimality Equation

Cauchy Sequences

Goal
We want to define the convergence of a sequence of numbers in some
number system without knowing whether the limit is contained in that
number system or not

Definition (Cauchy Sequence)


A converging sequence xn of numbers is Cauchy sequence defined as a
sequence satisfying

∀ ǫ > 0, ∃ N such that (s.t.) ∀ i, j ≥ N, |xi − xj | ≤ ǫ

Seungyul Han (UNIST) AI512/EE633 Spring 2024 75 / 83


Optimal Policy and Bellman Optimality Equation

Cauchy Sequences

Definition (Cauchy Sequence)


A converging sequence xn of numbers is Cauchy sequence defined as a
sequence satisfying

∀ ǫ > 0, ∃ N such that (s.t.) ∀ i, j ≥ N, |xi − xj | ≤ ǫ

xn

ǫ
i j
N

Seungyul Han (UNIST) AI512/EE633 Spring 2024 76 / 83


Optimal Policy and Bellman Optimality Equation

Cauchy Sequences and Completeness

Definition (Completeness)
A set X is completed if every Cauchy sequence has a limit in the set X .

Definition (Limit)
A number x is called the limit of sequence xn if

∀ ǫ > 0, ∃ N such that ∀ i ≥ N, |xi − x| ≤ ǫ

Remark
It is desirable that any element that can be approached arbitrarily closely
by elements in a set should be in the same set.

Remark
We can generalize the definitions to any metric space (X , d) with metric d.
Seungyul Han (UNIST) AI512/EE633 Spring 2024 77 / 83
Optimal Policy and Bellman Optimality Equation

Contraction
Definition (A Contraction Mapping)
Let (X , d) be a complete metric space. A mapping T : X → X is a contraction
mapping (or simply contraction) if there exists some constant γ ∈ [0, 1) such that

d(T (x1 ), T (x2 )) ≤ γd(x1 , x2 ), ∀x1 , x2 ∈ X .

X X

x1 T
T (x1 )
T (x2 )
x2

Remark
Note that any contraction mapping is a continuous mapping by the definition of
continuity.
Seungyul Han (UNIST) AI512/EE633 Spring 2024 78 / 83
Optimal Policy and Bellman Optimality Equation

Contraction: An Example

y
y =x

y = Tx

x
x1 x0

Seungyul Han (UNIST) AI512/EE633 Spring 2024 79 / 83


Optimal Policy and Bellman Optimality Equation

Fixed Point

Definition (A fixed point)


Let T be a transformation from X to X . A point x in X is called a fixed
point of T if
x = Tx.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 80 / 83


Optimal Policy and Bellman Optimality Equation

The Banach Fixed Point Theorem

Theorem (The Banach Fixed Point Theorem)


Let (X , d) be a complete metric space and T : X → X be a contraction
mapping on X . Then, the mapping T has a unique fixed point x ∗ ∈ X ,
i.e., x ∗ = T (x ∗ ). Furthermore, for any x0 , we have

lim T n (x0 ) = x ∗ ,
n→∞

where

T n (x) = T
| ◦ T {z
◦ · · · ◦ T}(x).
n times

Seungyul Han (UNIST) AI512/EE633 Spring 2024 81 / 83


Optimal Policy and Bellman Optimality Equation

Proof

For any given x0 ∈ X , xn = T n (x0 ). Let c = d(x0 , x1 ). Then, we have

d(xn , xn+1 ) = d(T n x0 , T n x1 ) ≤ γd(T n−1 x0 , T n−1 x1 ) ≤ · · · ≤ γ n d(x0 , x1 ) = cγ n

for every n. Then, for any n and m such that n < m, we have

d(xn , xm ) ≤ d(xn , xn+1 ) + d(xn+1 , xn+2 ) + · · · + d(xm−1 , xm )


≤ cγ n + cγ n+1 + cγ n+2 + · · · + cγ m−1
≤ cγ n (1 + γ + γ 2 + · · · )
γn
≤c .
1−γ
Because γ < 1, limn→∞ d(xn , xm ) = 0. So, {xn } is a Cauchy sequence. Since X is
complete, the limit of {xn } is inside X . Let this limit be denoted by x ∗ , i.e.,
limn→∞ xn = x ∗ . By the continuity of the mapping T , we have

T (x ∗ ) = T ( lim xn ) = lim T (xn ) = lim xn+1 = x ∗ .


n→∞ n→∞ n→∞


Hence, x is a fixed point of T .

Seungyul Han (UNIST) AI512/EE633 Spring 2024 82 / 83


Optimal Policy and Bellman Optimality Equation

Proof

Now, we show that the fixed point is unique. Let x and y be fixed points of T , i.e.,
x = Tx and y = Ty . Then,

d(x, y ) = d(Tx, Ty ) = γd(x, y ).

Since γ < 1 and d(x, y ) ≥ 0. The only possibility is d(x, y ) = 0. That is, x = y . This
concludes the proof.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 83 / 83

You might also like