0% found this document useful (0 votes)

13 views83 pages

AI512/EE633: Reinforcement Learning: Lecture 2 - Markov Decision Process

The document discusses Markov chains and their properties. It defines Markov chains as random processes where the probability of the next state only depends on the current state. It also discusses finite Markov chains and provides an example of a Markov chain represented using a state transition diagram and transition probability matrix.

Uploaded by

이강민

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views83 pages

AI512/EE633: Reinforcement Learning: Lecture 2 - Markov Decision Process

Uploaded by

이강민

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 83

AI512/EE633: Reinforcement Learning

Lecture 2 - Markov Decision Process

Seungyul Han

UNIST
[email protected]

Spring 2024

Seungyul Han (UNIST) AI512/EE633 Spring 2024 1 / 83

Contents

1 Setup

2 Markov Chains

3 Markov Decision Processes

4 Value Functions and Bellman Equations

5 Optimal Policy and Bellman Optimality Equation

Seungyul Han (UNIST) AI512/EE633 Spring 2024 2 / 83

Setup

Table of Contents

1 Setup

2 Markov Chains

3 Markov Decision Processes

4 Value Functions and Bellman Equations

5 Optimal Policy and Bellman Optimality Equation

Seungyul Han (UNIST) AI512/EE633 Spring 2024 3 / 83

Setup

The Setup: Interaction between Agent and Environment

Interaction between Agent and

Environment The agent
st The environment
action
rt Agent at +
The dynamics model
state reward State transition
rt+1 Reward function
Environment
st+1 st +
A policy (the function to be
t = 0, 1, 2, 3, · · · designed)

Seungyul Han (UNIST) AI512/EE633 Spring 2024 4 / 83

Setup

The Setup: Interaction between Agent and Environment

Interaction between Agent and At each time step t,

Environment the agent
st observes observation
action
rt Agent at ot = st ,
executes action at ,
receives reward rt .
state reward
rt+1 The envirionment
Environment receives action at ,
st+1 st switch to next state st+1 .
produces reward rt+1 ,
t = 0, 1, 2, 3, · · · and time step t continues.

What is the background for this setup?

Seungyul Han (UNIST) AI512/EE633 Spring 2024 5 / 83

Setup

The State Space and The Action Space

Definition (The State Space)

The set of all possible stats of the environment is called the state space,
typically denoted by S.

Definition (The Action Space)

The set of all possible actions of the agent is called the action space,
typically denoted by A.

Assumption (Finite State and Acation Spaces)

Through this course, we will assume the following:
The state space S is finite, i.e., |S| < ∞.
The action space A is finite, i.e., |A| < ∞.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 6 / 83

Setup

Sufficiency and Markovianness

The State Variable st :

Sufficient Statistic
The state variable st is sufficient to describe the status of the environment
at time step t containing all information relevant to make decision or
inference.

Markovianness
The sequence of states over time {st , t = 0, 1, 2, · · · } is a Markov process
or Markov chain. That is,

P[st+1 |s1 , s2 , · · · , st ] = P[st+1 |st ], ∀t

Seungyul Han (UNIST) AI512/EE633 Spring 2024 7 / 83

Setup

Sufficient Statistic
Remark
Sufficiency is with respect to the target optimal inference or decision.

Example 1:
R l

+ +
vin (t) = x(t) i C vout (t) = y (t)
− −
In this 2nd order circuit system, we can set the state variable as the capacity
voltage and the inductor current.
i.i.d.
Example 2: Xi ∼ N(θ, 1). We observe data X1 , · · · , Xn and want to estimate
the mean parameter θ. Suppose that we Pconsider as optimal and use maximum
n
likelihood estimator (MLE) for θ. Then, i=1 Xi is a sufficient statistic.
n
Y 1 1 2 1 1
P 2 P 2
p(x1 , · · · , x1 ; θ) = √ e − 2 (xi −θ) = p e − 2 ( i xi −θ i xi +θ )
2π (2π) n
i=1

Seungyul Han (UNIST) AI512/EE633 Spring 2024 8 / 83

Markov Chains

Table of Contents

1 Setup

2 Markov Chains

3 Markov Decision Processes

4 Value Functions and Bellman Equations

5 Optimal Policy and Bellman Optimality Equation

Seungyul Han (UNIST) AI512/EE633 Spring 2024 9 / 83

Markov Chains

Markov Processes or Markov Chains

S0 S1 S2 S3 S4

Definition (A Markov process)

A random process or sequence {St } with St ∈ S, ∀t is Markov if and only if

Pr[St+1 |S1 , S2 , · · · , St ] = Pr[St+1 |St ]

Remark
A Markov state contains all relevant information from the entire history for system
evolution.
The history in the past and the evolution in the future are independent given St ,
i.e.,
P(τ1:t , τt+1,∞ |st ) = P(τ1:t |st )P(τt+1:∞ |st ),
∆
where τ1:t = (s1 , s2 , · · · , st ).
A Markov process is also called a Markov chain (MC).

Seungyul Han (UNIST) AI512/EE633 Spring 2024 10 / 83

Markov Chains

Finite Markov Processes

S0 S1 S2 S3 S4

Definition (A finite Markov process)

A Markov process is called finite if the cardinality of its state space S is
finite. (Recall that we will assume that the state space is finite through
this course.)

Remark
For a finite MC, the collection of state transition probabilities

Pss ′ = Pr[st+1 = s ′ |st = s], ∀s ∈ S, ∀s ′ ∈ S

together with the state space S fully defines a Markov chain.

This probability is called the state-transition probability. Let us denote the
collection by P = {Pss ′ |s, s ′ ∈ S}.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 11 / 83

Markov Chains

Markov Chain Example: State-Transition Diagram

0.3 0.4 States:

0.3
I: Initial State
C A L: Library
0.1 0.4 0.6
C: Cafe
0.5
0.3
A: Arbeit
I 1
L 0.1 F 1
H F: Fatigued
H: Home
t = 0, 1, 2, · · · (Terminal State)

Seungyul Han (UNIST) AI512/EE633 Spring 2024 12 / 83

Markov Chains

Markov Chain Example: Transition Probability Matrix

Transition Probability Matrix P = [Pss ′ ]

0.3 0.4

0.3
I L C A F H
C A I 1
0.5
0.1 0.4
0.3
0.6 L 0.5 0.1 0.3 0.1
C 0.4 0.3 0.3
I L F H
1 0.1
1 A 0.4 0.6
F 1
H 1

Seungyul Han (UNIST) AI512/EE633 Spring 2024 13 / 83

Markov Chains

Markov Chain Example: Realizations and Episodes

Definition (An episode)

An episode is a terminated realization
0.3 0.4 of a Markov chain: (S0 , S1 , · · · , ST ),
0.3
where ST is the terminal state and T
C A is the episode length.
0.1 0.4 0.6
0.5
0.3 I,L,F,H
I L F H
I,L,L,L,F,H
0.1
1 1
I,L,L,C,A,A,F,H
I,L,C,L,L,F,H
..
.
Note that we can compute the probability of each episode. For example,
S∞
Pr[(I , L, L, L, F , H)] = 1 × 0.53 × 0.1 × 1. Of course, Pr[ i=1 Ei ] = 1.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 14 / 83

Markov Chains

The Unending Continuing Case

The Gilbert-Elliott Markov Chain:

1−α
α β
H L
1−β

τ = H, H, L, H, L, H, · · ·

The sequence never ends!

This process is not episodic.
We say a continuing case (not episodic case).

Seungyul Han (UNIST) AI512/EE633 Spring 2024 15 / 83

Markov Chains

Stationary Distribution of a Finite MC

If the agent travel on the state transition diagram, on average what is the time
portion with which the agent stays at each state?
We assign a probability distribution on the set of states, i.e., the state space
with cardinality N at time t:

p0 = Pr(S0 ), p1 = Pr(S1 ), · · · , pN = Pr(SN )

After one transition, the new distribution on the state space is given by
X
ps ′ = ps Pss ′
s∈S

Seungyul Han (UNIST) AI512/EE633 Spring 2024 16 / 83

Markov Chains

Stationary Distribution of a Finite MC

X
ps ′ = ps Pss ′
s∈S

In matrix form, we
    
p0′ P00 P10 ··· PN0 p0

 p1′  
  P01 P11 ··· PN1 

 p1 

 .. = .. ..  .. 
 .   . .  . 
pN′ P0N P1N ··· PNN pN
| {z } | {z }| {z }
new =PT (=[Pss ′ ]T ) old

When stationarity is achieved, there is no more change in the state

distribution. This is called the stationary distribution, which can be obtained
by solving the following linear equation:

p = PT p.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 17 / 83

Markov Chains

Google’s Page Rank Algorithm

0.3 0.4

0.3
C A

0.1 0.4 0.6

0.5
0.3

L 0.1 F

      
p0 1 P00 P10 ··· PN0 p0

 p1 
 1−α 
 1 


 P01 P11 ··· PN1 

 p1 

 .. =  ..  + α .. ..  .. 
 .  1+N  .   . .  . 
pN 1 P0N P1N ··· PNN pN

Seungyul Han (UNIST) AI512/EE633 Spring 2024 18 / 83

Markov Decision Processes

Table of Contents

1 Setup

2 Markov Chains

3 Markov Decision Processes

4 Value Functions and Bellman Equations

5 Optimal Policy and Bellman Optimality Equation

Seungyul Han (UNIST) AI512/EE633 Spring 2024 19 / 83

Markov Decision Processes

From an MC to an MDP

Control of Environment by Agent In a Markov chain, we do not care

about why the state transition from
st st to st+1 happens but assume that
action
rt Agent at the state transition happens at each
state
time according to the state
reward
rt+1 transition probability [Pss ′ ].
st+1 Environment
st To define a Markov decision process
(MDP), we need two more
ingredients:
t = 0, 1, 2, 3, · · · Action at
Reward rt+1

Seungyul Han (UNIST) AI512/EE633 Spring 2024 20 / 83

Markov Decision Processes

Action and Policy

We assume that the state variable st of the environment is a sufficient and

Markovian statistic, as in MCs.
Hence, to control the environment at time t, the agent needs to know st only.
We assume that the agent observes st , i.e., the fully observable case.
Since st is sufficient and Markovian, it is sufficient that the control input or action
at of the agent is a function of st only.

Definition (Policy)
We call this mapping from st ∈ S to at ∈ A a policy π.

π : S 7→ A : st → at .

Deterministic function case: at = π(st )

Stochastic function case: at ∼ π(at |st ), where π(a|s) = Pr[at = a|st = s]

Seungyul Han (UNIST) AI512/EE633 Spring 2024 21 / 83

Markov Decision Processes

Markovian State Transition with Action Involvement

Fundamental Modeling of System Dynamics

Under sufficiency and Markovianness, the next state of a system is
(probabilistically) determined by the current state and the system input,
i.e.,
st+1 = func(st , at ). (1)

Remark
For generality, we consider probabilistic functions in (1).
In the case of a probabilistic function with finite state and action
spaces, the function (1) is fully described by the set of probabilities:
a ′
Pss ′ = Pr[st+1 = s |st = s, at = a]

Seungyul Han (UNIST) AI512/EE633 Spring 2024 22 / 83

Markov Decision Processes

Markovian State Transition with Action Involvement

s st
Transitions:
1
(st , a1 ) ⇒ s 1 with probability Pss
a
1
1
(st , a1 ) ⇒ s 2 with probability Pss
a
2
a 1
a 2 at
2
2
(st , a2 ) ⇒ s 2 with probability Pss
a
2
a2
1
Pssa 1 P a1 P Pssa 3 2
ss 2 ss 2
(st , a2 ) ⇒ s 3 with probability Pss
a
3

st+1
s1 s2 s3 a ] is a rank-3 tensor.
[Pss ′

state action

Seungyul Han (UNIST) AI512/EE633 Spring 2024 23 / 83

Markov Decision Processes

Markovian State Transition with Action Involvement:

Action Marginalization

[Pssa ′ ] a rank-3 tensor ⇒ [Pss ′ ] a rank-2 tensor, i.e.,

matrix
s st

Pss ′ = Pr[st+1 = s ′ |st = s]

a1 a2 at X
= Pr[st+1 = s ′ , at = a|st = s] (marginalization)
a∈A
X
1 2 3
st+1 = Pr[at = a|st = s]Pr[st+1 = s ′ |st = s, at = a]
s s s a∈A
state action X
= π(a|s)Pssa ′
a∈A

Seungyul Han (UNIST) AI512/EE633 Spring 2024 24 / 83

Markov Decision Processes

Reward

Reward Function:

s st a ′
Rss ′ = R(s, a, s )

= E[Rt+1 |st = s, at = a, st+1 = s ′ ]

a1 a2 at Remark: In full generality, the reward Rt+1 can

1
be a random variable even given the event
Rssa 1 {st = s, at = a, st+1 = s ′ }. In this course, we are
∆
st+1 interested in the mean value rt+1 = E[Rt+1 ]. In
1 2 3
s s s distributional RL, however, distribution is
action
considered.
state
In full generality, [Rssa ′ ] is a rank-3 tensor in the
case of finite S and M.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 25 / 83

Markov Decision Processes

Return
Definition (Return)
The return Gt at time t is the sum of discounted rewards, i.e.,

Gt = rt+1 + γrt+2 + γ 2 rt+3 + γ 3 rt+4 + · · ·

or ∞
TX
= γ k rt+1+k ,
k=0

where the discount factor γ ∈ [0, 1].

Discount Factor γ:
The discount factor stabilizes the problem and makes the problem mathematically
simple. Suppose the
Pinfinite-horizon case. Even if rt is bounded for all t, the
undiscounted sum ∞ k=t rt grows without bound, i.e., goes to infinity.
The discount factor γ < 1 guarantees the existence of an optimal solution.
The discount factor prioritizes the rewards in the near future.
γ = 0: greedy case. Gt = rt+1 , γ = 1: undiscounted sum
Seungyul Han (UNIST) AI512/EE633 Spring 2024 26 / 83
Markov Decision Processes

Markov Decision Processes (MDPs)

Definition (A Markov Decision Process)
A Markov decision process is a Markov process with actions and corresponding
rewards and is defined by a tuple < S, A, P, R, γ >, where
S is the state space,
A is the action space,
P = [Pssa ′ ] is the collection of state-transition probabilities, given by

Pssa ′ = P(s ′ |s, a) = Pr[st+1 = s ′ |st = s, at = a]

R = [Rssa ′ ] is the reward function, given by

Rssa ′ = R(s, a, s ′ ) = E[Rt+1 |st = s, at = a, st+1 = s ′ ]

the discount factor γ ∈ [0, 1]

Goal: We want to maximize the expected discounted return:

" ∞
#
X
E[G0 ] = E[r1 + γr2 + γ 2 r3 + · · · ] = E γ t rt+1
t=0
Seungyul Han (UNIST) AI512/EE633 Spring 2024 27 / 83
Markov Decision Processes

An Reinforcement Learning Task

Definition (A Task)
The environment, the agent and the dyanmics model together with the
state space S and the action space A define a specific instance of the
reinforcement learning problem and this instance is called a task.

Remark
Basically, we consider an MDP task.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 28 / 83

Markov Decision Processes

MDP Example: Recycling Robot

State: Battery Level

1 high (H)
2 low (L)
Action
search for a can for
duration D
Charging Station robot
wait for someone to trash
a can to the onboard bin.
go to the recharging
station and recharge.
soda can
A decision should be made
every L seconds.
Reward and goal: Collect as
many empty soda cans as
possible.
Seungyul Han (UNIST) AI512/EE633 Spring 2024 29 / 83
Markov Decision Processes

Recycling Robot - Transition Probabilities and Reward

Function

s a s′ a
Pss ′
a
Rss ′

H search H α r search
H search L 1−α r search
L search L β r search
L search H (rescued) 1−β −C
H wait H 1 r wait
H wait L 0 r wait
L wait H 0 r wait
L wait L 1 r wait
L recharge H 1 0
L recharge L 0 0

Table: An MDP setup for the recycling robot (from the textbook by Sutton and Barto)
(r search > r wait )
Seungyul Han (UNIST) AI512/EE633 Spring 2024 30 / 83
Markov Decision Processes

Recycling Robot - State Transition Diagram

α, r search 1 − α, r search
1, r wait

search wait

1, 0
H L
recharge

wait search
1 − β, − C β, r search
wait
1, r
(battery deplete)

A policy π : π(s|H), π(w |H), π(s|L), π(w |L), π(r |L)

(from the textbook by Sutton and Barto)

Seungyul Han (UNIST) AI512/EE633 Spring 2024 31 / 83

Markov Decision Processes

Where Is Randomness in an MDP?

α, r search 1 − α, r search 1, r wait

search wait

1, 0
H L
recharge

wait search
1 − β, − C β, r search
wait
1, r
(battery deplete)

Randomness in evolution comes from

Initial state distribution p(s0 )
Action choice π(a|s)
State transition by the environment P(s ′ |s, a)

Seungyul Han (UNIST) AI512/EE633 Spring 2024 32 / 83

Value Functions and Bellman Equations

Table of Contents

1 Setup

2 Markov Chains

3 Markov Decision Processes

4 Value Functions and Bellman Equations

5 Optimal Policy and Bellman Optimality Equation

Seungyul Han (UNIST) AI512/EE633 Spring 2024 33 / 83

Value Functions and Bellman Equations

Value Functions

Questions:
How much expected
α, r search 1 − α, r search 1, r wait return can the agent
search wait receive if it is in a state
1, 0
s and follow policy π?
H L
recharge How much expected
wait search β, r search return can the agent
1 − β, − C
1, r wait receive if it takes action
(battery deplete)
a in a state s and then
follows policy π?

Seungyul Han (UNIST) AI512/EE633 Spring 2024 34 / 83

Value Functions and Bellman Equations

Value Functions
Definition (State-Value Function)
The value of a state s under a policy π is the expected return when the
agent starts from s and follows π, i.e.,
" ∞ #
X
V π (s) = Eπ [Gt |st = s] = E γ i rt+1+i st = s (2)
i=0

Definition (Action-Value Function)

The value of an action a in a state s under a policy π is the expected
return when the agent takes action a in state s and then follows π, i.e.,
" ∞ #
X
π i
Q (s, a) = Eπ [Gt |st = s, at = a] = E γ rt+1+i st = s, at = a (3)
i=0

Note that the value of an action is not the immediate reward but the expected return
following the
Seungyul Hanaction.
(UNIST) AI512/EE633 Spring 2024 35 / 83
Value Functions and Bellman Equations

Value Functions

V π (s) versus Q π (s, a)

V π (s) = Eπ [Gt |st = s]

π
Q (s, a) = Eπ [Gt |st = s, at = a]

s r
V π (s) s′
a
π

s r
Q π (s, a) s′
a
π

Seungyul Han (UNIST) AI512/EE633 Spring 2024 36 / 83

Value Functions and Bellman Equations

Value Functions

Remark:

Note that V π : S → R is a function from the state space to the set of

real numbers.
Q π : S × A → R is a function from the product space S × A to the
set of real numbers.
This is why we call them value functions.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 37 / 83

Value Functions and Bellman Equations

The Bellman Equation

The Bellman equation

" ∞
#
π
X i
V (s) = Eπ [Gt |st = s] = Eπ γ rt+1+i st = s
i=0
" ∞
#
X i
= Eπ rt+1 + γ γ rt+2+i st = s
i=0
 

 

 " ∞
#


X  X 
′ ′
= p(a, s |s) R(s, a, s ) + γ Eπ γ i rt+2+i st+1 = s ′
| {z }   

a,s ′  i=0
=π(a|s)P a ′ 
ss 
| {z }


=V π (s ′ )
X
= π(a|s)Pssa ′ R(s, a, s ′ ) + γV π (s ′ )
a,s ′

Seungyul Han (UNIST) AI512/EE633 Spring 2024 38 / 83

Value Functions and Bellman Equations

The Bellman Equation: Backup Relationship

Bellman equation
X
V π (s) = π(a|s)Pssa ′ R(s, a, s ′ ) + γV π (s ′ )
a,s ′
X
= R π (s) + γ Pssπ′ V π (s ′ )
|{z}
s′ P a
a π(a|s)Pss ′

s V π (s)

Transition (forward) a Backup (Bellman eq.)

r
π ′
s ′ V (s )
Remark: The Bellman equation above is a matrix version of the Fredholm equation of
R
the second kind: φ(t) = f (t) + λ K (t, s)φ(s)ds.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 39 / 83

Value Functions and Bellman Equations

Theoretical Solution to The Bellman Equation

X X
V π (s) = π(a|s)Pssa ′ R(s, a, s ′ ) + γ π(a|s)Pssa ′ V π (s ′ )
a,s ′ a,s ′
X
= R π (s) + γ Pssπ′ V π (s ′ )
s′

   π   π  
V π (s1 ) R (s1 ) P11 ··· P1N
π
V π (s1 )
 ..   ..   ..  .. 
 . = . +γ .  . 
v π (sN ) R π (sN ) PN1
π
··· PNN
π
v π (sN )

vπ = r + γPvπ
vπ = (I − γP)−1 r

For small MDPs, the linear system can be solved.

For large MDPs, iterative methods are applied rather than directly solving the
linear system.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 40 / 83

Value Functions and Bellman Equations

The Bellman Equation for The Action-Value Function

" ∞
#
π
X i
Q (s, a) = Eπ [Gt |st = s, at = a] = Eπ γ rt+1+i st = s, at = a
i=0
" ∞
#
X i
= Eπ rt+1 + γ γ rt+2+i st = s, at = a
i=0
" ∞
#
X a ′
X a
X ′ ′
X i ′ ′
= Pss ′ R(s, a, s ) +γ Pss ′ π(a |s ) Eπ γ rt+2+i st+1 = s , at+1 = a
s′ s′ a′ i=0
| {z }
=Q π (s ′ ,a′ )
X
= R(s, a) + γ Pssa ′ π(a′ |s ′ ) Q π (s ′ , a′ )
| {z }
s ′ ,a′
=p(s ′ ,a′ |s,a)

Seungyul Han (UNIST) AI512/EE633 Spring 2024 41 / 83

Value Functions and Bellman Equations

Backup Diagram for The Bellman Equation for

Action-Value Function

Bellman equation for Action-Value Function

X
Q π (s, a) = R(s, a) + γ Pssa ′ π(a′ |s ′ ) Q π (s ′ , a′ )
| {z }
s ′ ,a′
=p(s ′ ,a′ |s,a)

a Q π (s, a)
r

s′ Backup (Bellman eq.)

π ′ ′
a′ Q (s , a )

Seungyul Han (UNIST) AI512/EE633 Spring 2024 42 / 83

Optimal Policy and Bellman Optimality Equation

Table of Contents

1 Setup

2 Markov Chains

3 Markov Decision Processes

4 Value Functions and Bellman Equations

5 Optimal Policy and Bellman Optimality Equation

Seungyul Han (UNIST) AI512/EE633 Spring 2024 43 / 83

Optimal Policy and Bellman Optimality Equation

Comparison of Policies

Definition (A Partially-Ordered Set)

A partially-ordered set is a set with a partial order, where a partial order is a
binary relation ≤ over the set satisfying the following three axioms:
1 (Reflexive) a ≤ a
2 (Antisymmetric) If a ≤ b and b ≤ a, then a = b.
3 (Transitive) If a ≤ b and b ≤ c, then a ≤ c.

Remark
In a partially-ordered set P, we do not require either a ≤ b or b ≤ a for all pairs
a, b ∈ P. If we have neither a ≤ b or b ≤ a, we say a and b are incomparable.

Definition (Total Order or Linear Order)

A linearly-ordered set is a partially-ordered set in which all pairs are comparable.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 44 / 83

Optimal Policy and Bellman Optimality Equation

Comparison of Policies

Theorem (Partial ordering of policies)

Value functions provide a partial ordering on the set of all policies for the
given finite MDP, where the partial order is defined as
′
V π (s) ≤ V π (s), ∀s ∈ S ⇒ π ≤ π′ .

π3
V π (s) π2

π1 ≤ π3
π2 ≤ π3
But, neither π1 ≤ π2 nor
π1
π2 ≤ π1
s

Seungyul Han (UNIST) AI512/EE633 Spring 2024 45 / 83

Optimal Policy and Bellman Optimality Equation

Optimal Policy

Definition (Optimal Policy)

A policy π ∗ is an optimal policy if

π ∗ ≥ π, ∀ π ∈ Π,

equivalently,
∗
V π (s) ≥ V π (s), ∀ s ∈ S, ∀π∈Π
where Π is the set of all feasible policies for the given MDP.

Questions:
Does such π ∗ exists?
What is the value function of π ∗ and how can we compute it?

Seungyul Han (UNIST) AI512/EE633 Spring 2024 46 / 83

Optimal Policy and Bellman Optimality Equation

Optimal Value Functions

Recall V π (s):
" ∞
#
X
V π (s) = E{at }∼π γ i R(st+i , at+i , st+1+i ) st = s
i=0
X
= (r0 + γr1 + γ 2 r2 + · · · )ρ(s0 )π(a0 |s0 )p(s1 |s0 , a0 )π(a1 |s1 )p(s2 |s1 , a1 ) · · ·

Definition: Optimal State Value Function The optimal value of state s is defined as
∆
V ∗ (s) = max V π (s) for each s.
π

Seungyul Han (UNIST) AI512/EE633 Spring 2024 47 / 83

Optimal Policy and Bellman Optimality Equation

Optimal Value Functions

V π (s) V ∗ (s)
Definition (Optimal State Value Function) π2

The optimal state value function V ∗ : S 7→ R is

constructed as follows: For each s ∈ S, we
π3
assign the value of V ∗ (s) as
π1
V ∗ (s) = max V π (s) s
π

Definition (Optimal Action Value Function)

The optimal action value function Q ∗ : S × A 7→ R is defined as follows: For each
(s, a) ∈ S × A, we assign the value of Q ∗ (s, a) as

∆
Q ∗ (s, a) = max Q π (s, a)
π

Seungyul Han (UNIST) AI512/EE633 Spring 2024 48 / 83

Optimal Policy and Bellman Optimality Equation

Optimal Value Functions

Q ∗ (s, a) = max Q π (s, a)

hP i
Q ∗ (s, a1 ) = maxπ:{aτ ,τ ≥t+1}∼π E ∞ i
i=0 γ R(st+i , at+i , st+1+i ) st = s, at = a
1

a1 P ′′
P s′ P ′′′
r ′ r′ s ′′ ′′ s ′′′
a a r a

s
hP i
Q ∗ (s, a1 ) = maxπ:{aτ ,τ ≥t+1}∼π E ∞ i
i=0 γ R(st+i , at+i , st+1+i ) st = s, at = a
2

a2 P ′′ P ′′′
P s′
r ′ r′ s ′′ ′′ s ′′′
a a r a

Seungyul Han (UNIST) AI512/EE633 Spring 2024 49 / 83

Optimal Policy and Bellman Optimality Equation

Relationship between Optimal Value Functions V ∗ and Q ∗

Optimal State and Value Functions: Relationship

V ∗ (s) = max Q ∗ (s, a)

a∈A

Q ∗ (s, a) = EP [ rt+1 +γV ∗ (st+1 )|st = s, at = a]

|{z}
R(s,a,st+1 )

s′
1
a V ∗ (s 1 )
a ∼ π o (s) Q ∗ (s, a1 ) P s1
a2 a
V ∗ (s) s Q ∗ (s, a2 ) s s2 V ∗ (s 2 )
∗
a 3 Q (s, a)
Q ∗ (s, a3 ) V ∗ (s 3 )
s3

Seungyul Han (UNIST) AI512/EE633 Spring 2024 50 / 83

Optimal Policy and Bellman Optimality Equation

Existence of Optimal Policy

Theorem 1: The following is true:

∗
i) There exists an optimal policy π ∗ such that V π (s) ≥ V π (s), ∀s, ∀π.
∗
ii) V π (s) = V ∗ (s).
∗
iii) Q π = Q ∗ (s, a).

Seungyul Han (UNIST) AI512/EE633 Spring 2024 51 / 83

Optimal Policy and Bellman Optimality Equation

Proof: Existence of Optimal Policy

Preliminary Definitions:
F : the set of all functions f : S → A
π: a policy = a sequence of (f1 , f2 , f3 , · · · ) of functions ft ∈ F , s.t. at = ft (st ).
Stationary function case: ft = f , ∀t
f (N) = (f , f , · · · , f ), f (∞) = (f , f , f , · · · )
f (s)
R(f ): the reward vector of size |S| with elements Rs
f (s)
P(f ): the transition prob. matrix with elements Ps,s ′
V (π): the state value vector of size |S|

   f (1) f (1)
  
R f (s1 ) (s1 ) P11 ··· P1N V π (s1 )
 ..   ..   .. 
R(f ) =  . , P(f ) = 

.
,
 V (π) =  . 
R f (sN (sN ) f (N)
PN1 ··· PNN
f (N)
v π (sN )

Seungyul Han (UNIST) AI512/EE633 Spring 2024 52 / 83

Optimal Policy and Bellman Optimality Equation

Proof: Existence of Optimal Policy

Preliminary Definitions:
Note that V (π) of π = (f1 , f2 , f3 , · · · ) can be written as

V (π) = R(f1 ) + γP(f1 )V (T π),

where the shift operator T is defined as T π = (f2 , f3 , f4 , · · · )

L(f ): Affine transform from a value vector v of size |S| to L(f )v of size |S|:

L(f )v = R(f ) + γP(f )v

 f (s1 )   f (1) f (1)
 
R (s1 ) P11 ··· P1N v (s1 )
 ..   ..  .. 
= . +γ

.

 . 
R f (sN (sN ) f (N)
PN1 ···
f (N)
PNN v (sN )

Consider the concatenated policy (f , π), where only the first action is taken from f
and then all the following actions by π. Then, L(f )V (π) is the value vector of the
concatenated policy (f , π)
Monotonicity: If v1 ≥ v2 , i.e., v1 (s) ≥ v2 (s), ∀s ∈ S, then L(f )v1 ≥ L(f )v2 .

Seungyul Han (UNIST) AI512/EE633 Spring 2024 53 / 83

Optimal Policy and Bellman Optimality Equation

Lemmas for Proof of Existence of Optimal Policy

Lemma 1 (Beating 1st step change): If π ≥ (f , π) for all f ∈ F for some π, then π is
optimal. Here, (f , π) means that the first action is taken by the function f and then all
the following actions follow π.

Proof: Due to the assumption, L(f )V (π), which is the value vector of (f , π), satisfies
L(f )V (π) ≤ V (π).
Let any policy π ′ be π ′ = (f1 , f2 , f3 , · · · ). Then, by the assumption, (fN , π) ≤ π, i.e.,
L(fN )V (π) ≤ V (π). Now, apply the monotonicity on this
(a) (b)
L(fN−1 )L(fN )V (π) ≤ L(fN−1 )V (π) ≤ V (π)

where (a) is by monotonicity and (b) is by the assumption. Repeatedly applying this
procedure, we have
L(f1 )L(f2 ) · · · L(fN )V (π) ≤ V (π).
Letting N → ∞, we have
V (π ′ ) ≤ V (π), ∀π ′ .
Hence, π is optimal.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 54 / 83

Optimal Policy and Bellman Optimality Equation

Lemmas for Proof of Existence of Optimal Policy

Lemma 2 (Winning by 1st step change): If (f , π) > π, then f (∞) > π.

Proof: By assumption,
L(f )V (π) > V (π).
Applying the monotonicity,
(a) (b)
L(f )L(f )V (π) > L(f )V (π) > V (π)

where (a) is by the monotonicity and (b) is by the above eq. Repeating this yields

LN (f )V (π) > V (π) ⇒ (f (N) , π) > π, ∀N.

Letting N → ∞, we have
f (∞) > π

Seungyul Han (UNIST) AI512/EE633 Spring 2024 55 / 83

Optimal Policy and Bellman Optimality Equation

Lemmas for Proof of Existence of Optimal Policy

Lemma 3 (Policy improvement): Policy improvement: Consider any f ∈ F . For each

s ∈ S, define G (s, f ) as the set of actions a satisfying

Rsa +γ Psa V (f (∞) ) > Vs (f (∞) ).

|{z} |{z} | {z }
scalar row vec column vec

(∞) (∞)
i.e., Q f (s, a) > V f (s). If G (s, f ) is empty for all s ∈ S, the f (∞) is optimal. If
G (s, f ) is not empty for some s ′ , construct a new policy g as
For s ′ s.t. G (s ′ , f ) 6= ∅, set g (s ′ ) = a ∈ G (s ′ , f ).
For all other s 6= s ′ , set g (s) = f (s).
Then, we have g (∞) > f (∞) .

Seungyul Han (UNIST) AI512/EE633 Spring 2024 56 / 83

Optimal Policy and Bellman Optimality Equation

Lemmas for Proof of Existence of Optimal Policy

Proof of Lemma 3: If G (s, f ) is empty, then for all s ∈ S

Rsa + γPsa V (f (∞) ) ≤ Vs (f (∞) ),

for all a, equivalently for all h : a = h(s). That is, (h, f (∞) ) ≤ f (∞) . Then, by Lemma
1, f (∞) is optimal.

Otherwise, for the constructed new policy g we have

(g , f (∞) ) > f (∞) .

By Lemma 2,
g (∞) > f (∞) .

Seungyul Han (UNIST) AI512/EE633 Spring 2024 57 / 83

Optimal Policy and Bellman Optimality Equation

A Policy Based on Optimal Action Value Function

Suppose that we computed the optimal action value function Q ∗ (s, a) over S × A.
(Q ∗ (s, a) exists for finite MDP cases, as we will see later.) Then, construct a policy
π o : S 7→ A as follows:
Initialization:
1, a = arg maxα∈A Q ∗ (s, α) S = {s(1), s(2), · · · , s(|S|)}
π o (a|s) =
0, otherwise A = {a(1), a(2), · · · , a(|A|)}
π o (a|s) = 0, ∀ (s, a) ∈ S × A
Q ∗ (s, a1 )
a1 for i = 1 : |S|
P ′ P ′′ P ′′′
r s a′ r
′ s
a′′ r
′′ s
a′′′ s = s(i)
for j = 1 : |A|
a = a(j)
s if a = arg maxα∈A Q ∗ (s, α)
π o (a|s) = 1
Q ∗ (s, a2 ) end if
a2 end for
P P ′′ P ′′′
r s′ ′ s ′′ s
a′ r a′′ r a′′′ end for

Seungyul Han (UNIST) AI512/EE633 Spring 2024 58 / 83

Optimal Policy and Bellman Optimality Equation

Proof of Theorem 1
Recall Theorem 1: The following is true:
∗
i) There exists an optimal policy π ∗ such that V π (s) ≥ V π (s), ∀s, ∀π.
∗
ii) V π (s) = V ∗ (s).
∗
iii) Q π = Q ∗ (s, a).
Proof) ii) By the definition of optimal policy π ∗ , we have
∆ ∗
V ∗ (s) = max V π (s) ≤ V π (s).
π

i) Existence: π o is an optimal policy π ∗ .

Uniqueness (perspective of value function): Let π ∗ be an optimal for each s, i.e.
π ∗ ≥ π, ∀π. Then, π ∗ ≥ π o . However, for all s ∈ S
o o o o
Vs (π o , π ∗ ) = Rsπ + γPsπ (s)
V (π ∗ ) = Rsπ + γPsπ (s)
V ∗ = Q ∗ (s, π o (s))
= max[Rsa + γPsa V ∗ ] = max[Rsa + γPsa V (π ∗ )]
a∈A a∈A
∗ ∗
≥ Rsπ (s) + γPsπ (s) V (π ∗ ) = Vs (π ∗ ).

That is, (π o , π ∗ ) ≥ π ∗ . By Lemma 2, π o ≥ π ∗ . Hence, π o = π ∗ .

Seungyul Han (UNIST) AI512/EE633 Spring 2024 59 / 83
Optimal Policy and Bellman Optimality Equation

Proof of Theorem 1

Proof continued)

∗
iii) Suppose that Q π (s, a) < maxπ Q π (s, a) = Q ∗ (s, a). Then,
X a ∗ ∗ X a
Rsa + γ Ps,s ′ V π (s ′ ) = Q π (s, a)< max Q π (s, a) = Rsa + γ Ps,s ′ max V π (s ′ ).
π π
s′ s′

∗
This contradicts the optimality of π ∗ . Hence, Q π (s, a) = Q ∗ (s, a)

Seungyul Han (UNIST) AI512/EE633 Spring 2024 60 / 83

Optimal Policy and Bellman Optimality Equation

Existence of Optimal Value Functions V ∗ and Q ∗

Optimal State and Value Functions: Relationship

V ∗ (s) = max Q ∗ (s, a)

a∈A

Q ∗ (s, a) = EP [ rt+1 +γV ∗ (st+1 )|st = s, at = a]

|{z}
R(s,a,st+1 )

s′
1
a V ∗ (s 1 )
a ∼ π o (s) Q ∗ (s, a1 ) P s1
a2 a
V ∗ (s) s Q ∗ (s, a2 ) s s2 V ∗ (s 2 )
∗
a 3 Q (s, a)
Q ∗ (s, a3 ) V ∗ (s 3 )
s3

Seungyul Han (UNIST) AI512/EE633 Spring 2024 61 / 83

Optimal Policy and Bellman Optimality Equation

Bellman Equation for Optimal Value Functions

Bellman Optimality Equation for State Value Functions
V ∗ (s) = max Q ∗ (s, a)
a∈A

= max E[rt+1 + γV ∗ (st+1 )|st = s, at = a]

a∈A
X
= max p(s ′ |s, a)[R(s, a, s ′ ) + γV ∗ (s ′ )]
a∈A
s′
( )
X
= max R(s, a) + γ p(s ′ |s, a)V ∗ (s ′ )]
a∈A
s′

Bellman Optimality Equation for Action Value Functions

Q ∗ (s, a) = EP [rt+1 + γV ∗ (st+1 )|st = s, at = a]
X
= p(s ′ |s, a)[R(s, a, s ′ ) + γ max
′
Q ∗ (s ′ , a′ )]
a ∈A
s′
X
= R(s, a) + γ p(s ′ |s, a) max
′
Q ∗ (s ′ , a′ )]
a ∈A
s′

Remark: The Bellman equation for optimal value functions is called the Bellman
optimality equation.
Seungyul Han (UNIST) AI512/EE633 Spring 2024 62 / 83
Optimal Policy and Bellman Optimality Equation

Bellman Optimality Equation

Bellman optimality equation (BOE)

( )
X
∗ ′ ∗ ′
V (s) = max R(s, a) + γ p(s |s, a)V (s )]
a∈A
s′

An MDP with S = {s 1 , · · · , s N } and A = {a1 , · · · , aM }:

∗ 1 1 1 1 1 1 ∗ 1 2 1 1 ∗ 2 N 1 1 ∗ N
V (s ) = max{R(s , a ) + γ[p(s |s , a )V (s ) + p(s |s , a )V (s ) + · · · p(s |s , a )V (s )],

.
.
.
1 M 1 1 M ∗ 1 2 1 M ∗ 2 N 1 M ∗ N
R(s , a ) + γ[p(s |s , a )V (s ) + p(s |s , a )V (s ) + · · · p(s |s , a )V (s )]}

.
.
.
∗ N N 1 1 N 1 ∗ 1 2 N 1 ∗ 2 N N 1 ∗ N
V (s ) = max{R(s , a ) + γ[p(s |s , a )V (s ) + p(s |s , a )V (s ) + · · · p(s |s , a )V (s )],

.
.
.
N M 1 N M ∗ 1 2 N M ∗ 2 N N M ∗ N
R(s , a ) + γ[p(s |s , a )V (a ) + p(s |s , a )V (s ) + · · · p(s |s , a )V (s )]}

A system of nonlinear equations ⇒ The number of terms for max operation: |A||S| ⇒ Difficult to solve.
Seungyul Han (UNIST) AI512/EE633 Spring 2024 63 / 83
Optimal Policy and Bellman Optimality Equation

Bellman Optimality Equation: An Example

α, r search 1 − α, r search 1, r wait

search wait

1, 0
H L
recharge
wait search β, r search
1 − β, − C
1, r wait
(battery deplete)

V ∗ (H) = max {R(H, a) + γ[p(H|H, a)V ∗ (H) + p(L|H, a)V ∗ (L)]}

a=s,w ,r

V (L) = max {R(L, a) + γ[p(H|L, a)V ∗ (H) + p(L|L, a)V ∗ (L)]}

∗
a=s,w ,r

Seungyul Han (UNIST) AI512/EE633 Spring 2024 64 / 83

Optimal Policy and Bellman Optimality Equation

Bellman Optimality Equation

Bellman optimality equation (BOE)

( )
X
∗ ′ ∗ ′
V (s) = max R(s, a) + γ p(s |s, a)V (s )] (4)
a∈A
s′

Theorem: The Bellman optimality equation has a unique solution.

Remarks:
The Bellman optimality equation is a system of nonlinear equations.
So, it is difficult to obtain closed-form solutions in general.
Thus, we approach the MDP problem based on iterative methods
such as generalized policy iteration.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 65 / 83

Optimal Policy and Bellman Optimality Equation

Existence and Uniqueness of Solution to The Bellman

Optimality Equation

Proof of the existence and uniqueness of solution to the BOE is based on

the Banach fixed point theorem.
The Banach fixed point theorem plays a crucial role in value-based methods,
together with the policy improvement theorem that we will learn later.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 66 / 83

Optimal Policy and Bellman Optimality Equation

The Banach Fixed Point Theorem

Definition (A Contraction Mapping)

Let (X , d) be a complete metric space. A mapping T : X → X is a
contraction mapping (or simply contraction) if there exists some constant
γ ∈ [0, 1) such that

d(T (x1 ), T (x2 )) ≤ γd(x1 , x2 ), ∀x1 , x2 ∈ X .

X X

x1 T
T (x1 )
T (x2 )
x2

Seungyul Han (UNIST) AI512/EE633 Spring 2024 67 / 83

Optimal Policy and Bellman Optimality Equation

The Banach Fixed Point Theorem

Theorem (The Banach Fixed Point Theorem)

Let (X , d) be a complete metric space and T : X → X be a contraction
mapping on X . Then, the mapping T has a unique fixed point x ∗ ∈ X ,
i.e., x ∗ = T (x ∗ ) such that

lim T n (x) = x ∗ ,
n→∞

where
T n (x) = T
| ◦ T {z
◦ · · · ◦ T}(x).
n times

Pf) See Appendix.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 68 / 83

Optimal Policy and Bellman Optimality Equation

Existence and Uniqueness of Solution to The Bellman

Optimality Equation

Theorem (Existence and Uniqueness of Solution to BOE)

A solution to the Bellman optimality equation (4) exists and is unique.
Pf) For a finite MDP, we set a metric space (V, d)

V = {V (s), s ∈ S} = R|S| ,
d(V1 (s), V2 (s)) = ||V1 (s) − V2 (s)||∞ = max |V1 (s) − V2 (s)|.
s∈S

Then, this (V, d) is a complete metric space since the set of real numbers is complete
w.r.t. L∞ -norm. Now, define a mapping T ∗ : V → V as
( )
∗ ∗
X ′ ′
T : V (s) → T (V (s)) = max R(s, a) + γ p(s |s, a)V (s )] .
a∈A
s′

This mapping is called the Bellman operator.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 69 / 83

Optimal Policy and Bellman Optimality Equation

Proof Continued
Now consider any two value functions V1 (s) and V2 (s). Then, we have
||T ∗ V1 (s) − T ∗ V2 (s)||∞
( ) ( )
X ′ ′
X ′ ′
= max R(s, a) + γ p(s |s, a)V1 (s ) − max R(s, ã) + γ p(s |s, ã)V2 (s )
a ã
s′ s′ ∞
( )
X X
≤ max R(s, a) + γ p(s ′ |s, a)V1 (s ′ ) − R(s, a) + γ p(s ′ |s, a)V2 (s ′ )
a
s′ s′ ∞
( )
X
≤ max γ p(s ′ |s, a)[V1 (s ′ ) − V2 (s ′ )]
a
s′ ∞
( )
X ′ ′ ′
≤ γ max p(s |s, a) [V1 (s ) − V2 (s )] ∞
, ∵ || max(a, b)|| ≤ max(||a||, ||b||)
a
s′
( )
′ ′
X ′
≤ γ [V1 (s ) − V2 (s )] ∞
max p(s |s, a) , due to the definition of || · ||∞
a
s′
X
≤ γ [V1 (s ′ ) − V2 (s ′ )] ∞
, ∵ p(s ′ |s, a) = 1, ∀s, a.
s′

Seungyul Han (UNIST) AI512/EE633 Spring 2024 70 / 83

Optimal Policy and Bellman Optimality Equation

Proof Continued

Hence, the Bellman operator T ∗ is a contraction mapping. By the Banach fixed point
theorem, there exists a unique fixed point V ∗ (s) such that
( )
∗ ∗ ∗
X ′ ∗ ′
V (s) = T (V (s)) = max R(s, a) + γ p(s |s, a)V (s )
a
s′

but this is nothing but the Bellman optimality equation. Hence, we have the claim.

Furthermore, by the Banach fixed point theorem, V ∗ (s) can be obtained by iteratively
applying the Bellman operator to any initial V (0) (s), i.e.,

V ∗ (s) = lim T ∗n (V (0) (s)).

n→∞

This is the fundamental background of many iterative methods to solve MDPs.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 71 / 83

Optimal Policy and Bellman Optimality Equation

References

Textbook: Sutton and Barto, Reinforcement Learning: An Introduction, The MIT

Press, Cambridge MA, 2018
Dr. David Silver’s course material.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 72 / 83

Optimal Policy and Bellman Optimality Equation

Appendix

1 Cauchy sequence and completeness

2 Contraction mapping
3 The Banach fixed point theorem

Seungyul Han (UNIST) AI512/EE633 Spring 2024 73 / 83

Optimal Policy and Bellman Optimality Equation

Motivation for Completeness of a Set

Consider the following sequence of rationals: x1 , x2 , x3 , · · ·

1 2
xn+1 = xn + , x1 = 1 (5)
2 xn
Then, x2 =√ 1.5, x3 = 1.4167, x4 = 1.4142, · · · . Eventually the sequence will
converge 2.
√
2 is not a rational but it can be approached by rationals arbitrarily close!
√
In fact, 2 is the limit of the sequence xn defined above. The limit of a sequence
of rationals is not a rational!
We say that the set of rationals is not complete.
What if we define a bigger number system including rational numbers and all
limits of converging sequences of rationals?
Let us call this number system real R.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 74 / 83

Optimal Policy and Bellman Optimality Equation

Cauchy Sequences

Goal
We want to define the convergence of a sequence of numbers in some
number system without knowing whether the limit is contained in that
number system or not

Definition (Cauchy Sequence)

A converging sequence xn of numbers is Cauchy sequence defined as a
sequence satisfying

∀ ǫ > 0, ∃ N such that (s.t.) ∀ i, j ≥ N, |xi − xj | ≤ ǫ

Seungyul Han (UNIST) AI512/EE633 Spring 2024 75 / 83

Optimal Policy and Bellman Optimality Equation

Cauchy Sequences

Definition (Cauchy Sequence)

A converging sequence xn of numbers is Cauchy sequence defined as a
sequence satisfying

∀ ǫ > 0, ∃ N such that (s.t.) ∀ i, j ≥ N, |xi − xj | ≤ ǫ

ǫ
i j
N

Seungyul Han (UNIST) AI512/EE633 Spring 2024 76 / 83

Optimal Policy and Bellman Optimality Equation

Cauchy Sequences and Completeness

Definition (Completeness)
A set X is completed if every Cauchy sequence has a limit in the set X .

Definition (Limit)
A number x is called the limit of sequence xn if

∀ ǫ > 0, ∃ N such that ∀ i ≥ N, |xi − x| ≤ ǫ

Remark
It is desirable that any element that can be approached arbitrarily closely
by elements in a set should be in the same set.

Remark
We can generalize the definitions to any metric space (X , d) with metric d.
Seungyul Han (UNIST) AI512/EE633 Spring 2024 77 / 83
Optimal Policy and Bellman Optimality Equation

Contraction
Definition (A Contraction Mapping)
Let (X , d) be a complete metric space. A mapping T : X → X is a contraction
mapping (or simply contraction) if there exists some constant γ ∈ [0, 1) such that

d(T (x1 ), T (x2 )) ≤ γd(x1 , x2 ), ∀x1 , x2 ∈ X .

X X

x1 T
T (x1 )
T (x2 )
x2

Remark
Note that any contraction mapping is a continuous mapping by the definition of
continuity.
Seungyul Han (UNIST) AI512/EE633 Spring 2024 78 / 83
Optimal Policy and Bellman Optimality Equation

Contraction: An Example

y
y =x

y = Tx

x
x1 x0

Seungyul Han (UNIST) AI512/EE633 Spring 2024 79 / 83

Optimal Policy and Bellman Optimality Equation

Fixed Point

Definition (A fixed point)

Let T be a transformation from X to X . A point x in X is called a fixed
point of T if
x = Tx.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 80 / 83

Optimal Policy and Bellman Optimality Equation

The Banach Fixed Point Theorem

Theorem (The Banach Fixed Point Theorem)

Let (X , d) be a complete metric space and T : X → X be a contraction
mapping on X . Then, the mapping T has a unique fixed point x ∗ ∈ X ,
i.e., x ∗ = T (x ∗ ). Furthermore, for any x0 , we have

lim T n (x0 ) = x ∗ ,
n→∞

where
∆
T n (x) = T
| ◦ T {z
◦ · · · ◦ T}(x).
n times

Seungyul Han (UNIST) AI512/EE633 Spring 2024 81 / 83

Optimal Policy and Bellman Optimality Equation

Proof
∆
For any given x0 ∈ X , xn = T n (x0 ). Let c = d(x0 , x1 ). Then, we have

d(xn , xn+1 ) = d(T n x0 , T n x1 ) ≤ γd(T n−1 x0 , T n−1 x1 ) ≤ · · · ≤ γ n d(x0 , x1 ) = cγ n

for every n. Then, for any n and m such that n < m, we have

d(xn , xm ) ≤ d(xn , xn+1 ) + d(xn+1 , xn+2 ) + · · · + d(xm−1 , xm )

≤ cγ n + cγ n+1 + cγ n+2 + · · · + cγ m−1
≤ cγ n (1 + γ + γ 2 + · · · )
γn
≤c .
1−γ
Because γ < 1, limn→∞ d(xn , xm ) = 0. So, {xn } is a Cauchy sequence. Since X is
complete, the limit of {xn } is inside X . Let this limit be denoted by x ∗ , i.e.,
limn→∞ xn = x ∗ . By the continuity of the mapping T , we have

T (x ∗ ) = T ( lim xn ) = lim T (xn ) = lim xn+1 = x ∗ .

n→∞ n→∞ n→∞

∗
Hence, x is a fixed point of T .

Seungyul Han (UNIST) AI512/EE633 Spring 2024 82 / 83

Optimal Policy and Bellman Optimality Equation

Proof

Now, we show that the fixed point is unique. Let x and y be fixed points of T , i.e.,
x = Tx and y = Ty . Then,

d(x, y ) = d(Tx, Ty ) = γd(x, y ).

Since γ < 1 and d(x, y ) ≥ 0. The only possibility is d(x, y ) = 0. That is, x = y . This
concludes the proof.

Seungyul Han (UNIST) AI512/EE633 Spring 2024 83 / 83

MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
System Modeling - 5
No ratings yet
System Modeling - 5
9 pages
Discrete Time Markov
No ratings yet
Discrete Time Markov
71 pages
Markov Process: Properties, Analysis and Applications: Ajay Kumar
No ratings yet
Markov Process: Properties, Analysis and Applications: Ajay Kumar
113 pages
Markov Process: Properties, Analysis and Applications: Ajay Kumar
No ratings yet
Markov Process: Properties, Analysis and Applications: Ajay Kumar
113 pages
Ai (It) Unit-4
No ratings yet
Ai (It) Unit-4
37 pages
Markov Decission Process. Unit 3
No ratings yet
Markov Decission Process. Unit 3
37 pages
Markov Chain (Part 1)
No ratings yet
Markov Chain (Part 1)
31 pages
Markov Processes: Fundamental of Stochastic Networks-Oliver C.Ibe, John-Wiley, 2011
No ratings yet
Markov Processes: Fundamental of Stochastic Networks-Oliver C.Ibe, John-Wiley, 2011
30 pages
Lecture MarkovDecisionProcess
No ratings yet
Lecture MarkovDecisionProcess
4 pages
Dissecting Reinforcement Learning-Part8
No ratings yet
Dissecting Reinforcement Learning-Part8
16 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
Module 2
No ratings yet
Module 2
73 pages
Lec7 MarkovChains
No ratings yet
Lec7 MarkovChains
14 pages
Cadeia de Markov
No ratings yet
Cadeia de Markov
178 pages
Cadenas de Markov
No ratings yet
Cadenas de Markov
53 pages
Markov Chains
100% (7)
Markov Chains
91 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Markov Decision Processes: Lecture Notes For STP 425: Jay Taylor
100% (1)
Markov Decision Processes: Lecture Notes For STP 425: Jay Taylor
86 pages
Chapter 4 - Discrete Time Markov Chains
No ratings yet
Chapter 4 - Discrete Time Markov Chains
37 pages
MSC Day 3
No ratings yet
MSC Day 3
117 pages
Markov Chains
No ratings yet
Markov Chains
23 pages
1 Discrete-Time Markov Chains
No ratings yet
1 Discrete-Time Markov Chains
7 pages
Materi Markov Chain
No ratings yet
Materi Markov Chain
26 pages
16 RL PDF
No ratings yet
16 RL PDF
87 pages
Stochastic Process (STA102)
No ratings yet
Stochastic Process (STA102)
30 pages
HiddenMarkovModel FINAL
100% (2)
HiddenMarkovModel FINAL
73 pages
Co 3 Material
100% (1)
Co 3 Material
16 pages
Markov Decision Processes: Stochastic, Sequential Environments
No ratings yet
Markov Decision Processes: Stochastic, Sequential Environments
20 pages
Markov Chains
No ratings yet
Markov Chains
6 pages
Chapter 4 Markov Chain
No ratings yet
Chapter 4 Markov Chain
39 pages
Probability and Statistics With Reliability, Queuing and Computer Science Applications: Chapter 7 On
No ratings yet
Probability and Statistics With Reliability, Queuing and Computer Science Applications: Chapter 7 On
41 pages
Discrete Markov Chain
No ratings yet
Discrete Markov Chain
43 pages
RL Unit 2
No ratings yet
RL Unit 2
11 pages
Markov Hand Out
No ratings yet
Markov Hand Out
14 pages
확통1 LectureNote08 on Markov Chains
No ratings yet
확통1 LectureNote08 on Markov Chains
77 pages
17 - Markov Decision Processes
No ratings yet
17 - Markov Decision Processes
59 pages
14 RA MIRI MarkovChains
No ratings yet
14 RA MIRI MarkovChains
61 pages
Markov Analysis
No ratings yet
Markov Analysis
8 pages
Markovian Decision Process
No ratings yet
Markovian Decision Process
27 pages
Math5846 Chapter8
No ratings yet
Math5846 Chapter8
101 pages
Markov Models by Sivas A My and Mole Fe
No ratings yet
Markov Models by Sivas A My and Mole Fe
20 pages
Markov Chains
No ratings yet
Markov Chains
55 pages
Markov Chains: J. M. Akinpelu
No ratings yet
Markov Chains: J. M. Akinpelu
56 pages
Markov Process: T T T S Itoj Jisnotequaltoi Q I
No ratings yet
Markov Process: T T T S Itoj Jisnotequaltoi Q I
1 page
Cadeia de Markov
No ratings yet
Cadeia de Markov
184 pages
Markov Chain Applications in Finance
No ratings yet
Markov Chain Applications in Finance
14 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
DSBD Unit-Ii 2
No ratings yet
DSBD Unit-Ii 2
47 pages
Discrete Time Markov Chains
No ratings yet
Discrete Time Markov Chains
59 pages
Stochastic Processes
No ratings yet
Stochastic Processes
8 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
Markov Chain and Markov Processes
No ratings yet
Markov Chain and Markov Processes
9 pages
Lecture2-MRP (RL IITH)
No ratings yet
Lecture2-MRP (RL IITH)
54 pages
Lec 02
No ratings yet
Lec 02
89 pages
A Guide To Markov Chain and Its Applications in Machine Learning
No ratings yet
A Guide To Markov Chain and Its Applications in Machine Learning
8 pages
All Chapters
No ratings yet
All Chapters
180 pages
3 Markov Chains and Markov Processes
No ratings yet
3 Markov Chains and Markov Processes
3 pages
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Sequences and Infinite Series, A Collection of Solved Problems
From Everand
Sequences and Infinite Series, A Collection of Solved Problems
Steven Tan
No ratings yet
Flexible Quadrotor Unmanned Aerial Vehicles Spatially Distributed Modeling and Delay-Resistant Control
No ratings yet
Flexible Quadrotor Unmanned Aerial Vehicles Spatially Distributed Modeling and Delay-Resistant Control
14 pages
Fa Ii
No ratings yet
Fa Ii
62 pages
EE530 Image Processing Project #2: 20215259 Kangmin Lee 2023.04.20
No ratings yet
EE530 Image Processing Project #2: 20215259 Kangmin Lee 2023.04.20
13 pages
MPC HW1
No ratings yet
MPC HW1
7 pages
DBMS Unit 3
No ratings yet
DBMS Unit 3
17 pages
CBSE Class 11th Mathematics Sample Ebook
No ratings yet
CBSE Class 11th Mathematics Sample Ebook
21 pages
Lect08 Greedy K Center
No ratings yet
Lect08 Greedy K Center
5 pages
Experiment 1
No ratings yet
Experiment 1
3 pages
هياكل بيانات وخوارزميات عمليم ماجد البعداني
No ratings yet
هياكل بيانات وخوارزميات عمليم ماجد البعداني
83 pages
12.2 Resolution Theorem Proving
No ratings yet
12.2 Resolution Theorem Proving
32 pages
Discrete Math Lecture #03 (2020) PDF
No ratings yet
Discrete Math Lecture #03 (2020) PDF
20 pages
TVL Comprog12 q3 m1
No ratings yet
TVL Comprog12 q3 m1
12 pages
CSE326 Lec11 Part1 1
No ratings yet
CSE326 Lec11 Part1 1
16 pages
Continuity at A Point and On An Open Interval
No ratings yet
Continuity at A Point and On An Open Interval
20 pages
Practical 9DLD
No ratings yet
Practical 9DLD
3 pages
Boolean Algebra
No ratings yet
Boolean Algebra
1 page
Unit 5 - Theory of Computation - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Theory of Computation - WWW - Rgpvnotes.in
17 pages
Learning Activity
No ratings yet
Learning Activity
21 pages
Math1011 - Assignment
No ratings yet
Math1011 - Assignment
2 pages
Unit 4 CFG
No ratings yet
Unit 4 CFG
137 pages
DSA Interview Questions
No ratings yet
DSA Interview Questions
5 pages
C Mps 375 Class Notes Chap 03
No ratings yet
C Mps 375 Class Notes Chap 03
28 pages
Cambridge IGCSE Mathematics Core and Extended CH 20
No ratings yet
Cambridge IGCSE Mathematics Core and Extended CH 20
6 pages
Multi-Layer Feed-Forward Networks
No ratings yet
Multi-Layer Feed-Forward Networks
6 pages
P Q) (P R) (Q R) ) R.: Math 4 Quiz #1.2
No ratings yet
P Q) (P R) (Q R) ) R.: Math 4 Quiz #1.2
2 pages
Numerical Differentiation PDF
No ratings yet
Numerical Differentiation PDF
28 pages
Notes On Symbolic Logic (San Carlos Seminary College)
No ratings yet
Notes On Symbolic Logic (San Carlos Seminary College)
14 pages
(PDF Download) How To Prove It Solutions Manual Third Edition Daniel Velleman Fulll Chapter
No ratings yet
(PDF Download) How To Prove It Solutions Manual Third Edition Daniel Velleman Fulll Chapter
53 pages
Lecture 01 Patterns in Nature
No ratings yet
Lecture 01 Patterns in Nature
50 pages
COMPILER DESIGN (18CS2T33) - Mid Term Exam - 2021-2022
No ratings yet
COMPILER DESIGN (18CS2T33) - Mid Term Exam - 2021-2022
1 page
CSE-304 Design & Analysis of Algorithm: Recurrence Relation
No ratings yet
CSE-304 Design & Analysis of Algorithm: Recurrence Relation
5 pages
01 Basic Notions 1
No ratings yet
01 Basic Notions 1
14 pages
History of Analytic Philosophy
No ratings yet
History of Analytic Philosophy
2 pages