0% found this document useful (0 votes)
6 views69 pages

Infinite Horizon Problems

The document discusses infinite horizon Markov Decision Processes (MDPs) and the application of dynamic programming to solve them, highlighting concepts such as value iteration and policy iteration. It uses the example of a warehouse managing inventory to illustrate the principles of MDPs, including state and action spaces, transition probabilities, and reward functions. The document emphasizes the importance of optimal policies and value functions in decision-making processes over an infinite time horizon.

Uploaded by

hung.nguyenbk18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views69 pages

Infinite Horizon Problems

The document discusses infinite horizon Markov Decision Processes (MDPs) and the application of dynamic programming to solve them, highlighting concepts such as value iteration and policy iteration. It uses the example of a warehouse managing inventory to illustrate the principles of MDPs, including state and action spaces, transition probabilities, and reward functions. The document emphasizes the importance of optimal policies and value functions in decision-making processes over an infinite time horizon.

Uploaded by

hung.nguyenbk18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

2022-09-15

Infinite horizon problems


tl;dr: Dynamic programming still works

Cathy Wu
6.7950 Reinforcement Learning: Foundations and Methods

Wu
2

References

1. With many slides adapted from Alessandro Lazaric and


Ma9eo Piro9a.

2. Dimitri P. Bertsekas. Dynamic Programming and OpAmal


Control. Volume 2. 4th EdiAon. (2012). Chapters 1-2:
Discounted Problems.

3. R. E. Bellman. Dynamic Programming. Princeton University


Press, Princeton, N.J., 1957.

Wu
Outline

1. Infinite horizon Markov Decision Processes

2. Value iteration

3. Policy iteration

Wu
Outline

1. Infinite horizon Markov Decision Processes


a. Discounted, stochas:c shortest path, average cost
b. Policies
c. Dynamic programming algorithm?

2. Value iteraAon

3. Policy iteraAon

Wu
7

Example: The Amazing Goods Company


§ Description. At each month 𝑡, a warehouse contains 𝑠! items of a
specific goods and the demand for that goods is 𝐷 (stochastic).
At the end of each month the manager of the warehouse can order
𝑎! more items from the supplier.

§ The cost of maintaining an inventory of 𝑠 is ℎ(𝑠).


§ The cost to order 𝑎 items is 𝐶(𝑎).
§ The income for selling 𝑞 items if 𝑓(𝑞).
§ If the demand 𝑑~𝐷 is bigger than the available inventory 𝑠,
customers that cannot be served leave.
§ The value of the remaining inventory at the end of the year is
𝑔 𝑠 .
§ Constraint: the store has a maximum capacity 𝐶.

Wu
12

Markov Decision Process


Definition (Markov decision process)
A Markov decision process (MDP) is defined as a tuple 𝑀 = (𝑆, 𝐴, 𝑃, 𝑟, 𝛾) where
§ 𝑆 is the state space,
often simplified to finite
§ A is the action space,
§ 𝑃(𝑠"|𝑠, 𝑎) is the transition probability with
𝑃 𝑠" 𝑠, 𝑎 = ℙ(𝑠!#$ = 𝑠"|𝑠! = 𝑠, 𝑎! = 𝑎)
§ 𝑟(𝑠, 𝑎, 𝑠") is the immediate reward
at state 𝑠 upon taking action 𝑎, sometimes simply 𝑟(𝑠)
§ 𝛾 ∈ [0, 1) is the discount factor.

Example: The Amazing Goods Company


§ Discount: 𝛾 = 0.95. A dollar today is worth more than a dollar tomorrow.

Wu
13

Markov Decision Process


Definition (Markov decision process)
A Markov decision process (MDP) is defined as a tuple 𝑀 = (𝑆, 𝐴, 𝑃, 𝑟, 𝛾) where
§ 𝑆 is the state space,
often simplified to finite
§ A is the action space,
§ 𝑃(𝑠"|𝑠, 𝑎) is the transition probability with
𝑃 𝑠" 𝑠, 𝑎 = ℙ(𝑠!#$ = 𝑠"|𝑠! = 𝑠, 𝑎! = 𝑎)
§ 𝑟(𝑠, 𝑎, 𝑠") is the immediate reward
at state 𝑠 upon taking action 𝑎, sometimes simply 𝑟(𝑠)
§ 𝛾 ∈ [0, 1) is the discount factor.

Example: The Amazing Goods Company


§ Objective: 𝑉 𝑠% ; 𝑎% , … = ∑' !
!&% 𝛾 𝑟! . This corresponds to the cumulative
reward, including the value of the remaining inventory at “the end.”
§ The “horizon” of the problem is 12 (12 months in 1 year), i.e. r$(
= g s$( ; 𝑟! = 0, 𝑡 > 12.
Wu
23

Markov Decision Process


Definition (Markov decision process)
A Markov decision process (MDP) is defined as a tuple 𝑀 = (𝑆, 𝐴, 𝑃, 𝑟, 𝛾) where
§ 𝑆 is the state space,
§ A is the action space,
§ 𝑃(𝑠"|𝑠, 𝑎) is the transition probability with
𝑃 𝑠" 𝑠, 𝑎 = ℙ(𝑠!#$ = 𝑠"|𝑠! = 𝑠, 𝑎! = 𝑎)
§ 𝑟(𝑠, 𝑎, 𝑠") is the immediate reward
at state 𝑠 upon taking action 𝑎,
§ 𝛾 ∈ [0, 1) is the discount factor.

F Two missing ingredients:


§ How actions are selected → policy.
§ What determines which actions (and states) are good → value function.

Wu
24

Policy
Definition (Policy)
A decision rule 𝑑 can be
§ Deterministic: 𝑑: 𝑆 → 𝐴,
§ Stochastic: 𝑑: 𝑆 → Δ(𝐴),
§ History-dependent: 𝑑: 𝐻! → 𝐴,
§ Markov: 𝑑: 𝑆 → Δ(𝐴),
A policy (strategy, plan) can be
§ Stationary: 𝜋 = 𝑑, 𝑑, 𝑑, … ,
§ (More generally) Non-stationary: 𝜋 = (𝑑% , 𝑑$ , 𝑑( , … )

FFor simplicity, we will typically write 𝜋 instead of 𝑑 for stationary policies, and 𝜋!
instead of 𝑑! for non-stationary policies.

Wu
Recall: The Amazing Goods 25

Company Example
§ Descrip(on. At each month 𝑡, a warehouse
contains 𝑠! items of a specific goods and the
demand for that goods is 𝐷 (stochastic). At the
end of each month the manager of the warehouse can
order 𝑎! more items from the supplier.

§ The cost of maintaining an inventory of 𝑠 is ℎ(𝑠).


§ The cost to order 𝑎 items is 𝐶(𝑎).
§ The income for selling 𝑞 items if 𝑓(𝑞).
§ If the demand 𝑑~𝐷 is bigger than the available
inventory 𝑠, customers that cannot be served
leave.
§ The value of the remaining inventory at the
end of the year is 𝑔 𝑠 .
§ Constraint: the store has a maximum capacity
𝐶.

Wu
Recall: The Amazing Goods 26

Company Example
§ Description. At each month 𝑡, a warehouse contains
𝑠! items of a specific goods and the demand for that
goods is 𝐷 (stochastic). At the end of each month
the manager of the warehouse can order 𝑎! more
items from the supplier.

§ The cost of maintaining an inventory of 𝑠 is ℎ(𝑠).


§ The cost to order 𝑎 items is 𝐶(𝑎).
§ The income for selling 𝑞 items if 𝑓(𝑞). Stationary policy composed of
§ If the demand 𝑑~𝐷 is bigger than the available deterministic Markov decision rules
inventory 𝑠, customers that cannot be served
leave. 𝐶−𝑠 if 𝑠 < 𝑀/4
𝜋 𝑠 = $
§ The value of the remaining inventory at the 0 otherwise
end of the year is 𝑔 𝑠 .
§ Constraint: the store has a maximum capacity
𝐶.

Wu
Recall: The Amazing Goods 27

Company Example
§ Descrip(on. At each month 𝑡, a warehouse
contains 𝑠! items of a specific goods and the
demand for that goods is 𝐷 (stochastic). At the end
of each month the manager of the warehouse can
order 𝑎! more items from the supplier.

§ The cost of maintaining an inventory of 𝑠 is ℎ(𝑠).


§ The cost to order 𝑎 items is 𝐶(𝑎).
§ The income for selling 𝑞 items if 𝑓(𝑞). Stationary policy composed of stochastic
§ If the demand 𝑑~𝐷 is bigger than the available history-dependent decision rules
inventory 𝑠, customers that cannot be served if 𝑠! < 𝑠!"# /2
𝑈(𝐶 − 𝑠!"# , 𝐶 − 𝑠!"# + 10)
leave. 𝜋 𝑠! = '
0 otherwise
§ The value of the remaining inventory at the
end of the year is 𝑔 𝑠 .
§ Constraint: the store has a maximum capacity
𝐶.

Wu
29

Optimization Problem
§ Our goal: solve the MDP
Definition (Optimal policy and optimal value function)
The solution to an MDP is an optimal policy 𝜋 ∗ satisfying

𝜋 ∗ ∈ arg max 𝑉 %
%∈'

where Π is some policy set of interest.


The corresponding value function is the optimal value function

𝑉∗ = 𝑉%

Wu
State Value Function
§ Given a policy 𝜋 = (𝑑8 , 𝑑9 , … ) (deterministic to simplify notation)
§ Infinite time horizon with discount: the problem never terminates but
rewards which are closer in time receive a higher importance.
>

𝑉 : 𝑠 = 𝔼 + 𝛾 ; 𝑟 𝑠; , 𝜋; ?) |𝑠= = 𝑠; 𝜋
;<=
with discount factor 0 ≤ 𝛾 < 1:
§ Small = short-term rewards, big = long-term rewards
§ For any 𝛾 ∈ [0, 1) the series always converges (for bounded rewards)

§ Used when: there is uncertainty about the deadline and/or an intrinsic


definition of discount.

Wu
State Value Function
§ Given a policy 𝜋 = (𝑑5 , 𝑑6 , … ) (deterministic to simplify
notation)
§ Finite time horizon 𝑇: deadline at time 𝑇, the agent
focuses on the sum of the rewards up to 𝑇.
AB8

𝑉 : 𝑡, 𝑠 = 𝔼 + 𝑟 𝑠@ , 𝜋@ , ℎ@ + 𝑅 𝑠A |𝑠; = 𝑠; 𝜋 = (𝜋; , … , 𝜋 A )
@<;
where 𝑅 is a value function for the final state.

§ Used when: there is an intrinsic deadline to meet.

Wu
State Value Function
§ Given a policy 𝜋 = (𝑑5 , 𝑑6 , … ) (deterministic to simplify
notation)
§ Stochastic shortest path: the problem never terminates but
the agent will eventually reach a termination state.
;
𝑉 7 𝑠 = 𝔼 - 𝑟 𝑠8 , 𝜋8 , ℎ8 |𝑠: = 𝑠; 𝜋
89:
where 𝑇 is the first (random) time when the termination
state is achieved.

§ Used when: there is a specific goal condition.


Wu
State Value Func6on
§ Given a policy 𝜋 = (𝑑5 , 𝑑6 , … ) (deterministic to simplify
notation)
§ Infinite time horizon with average reward: the problem
never terminates but the agent only focuses on the
(expected) average of the rewards.
;>5
1
𝑉7 𝑠 = lim 𝔼 - 𝑟 𝑠8 , 𝜋8 , ℎ8 |𝑠: = 𝑠; 𝜋
;→= 𝑇
89:

§ Used when: the system should be constantly controlled


over time.
Wu
Notice
From now on we mostly work in the
discounted infinite horizon setting (except Lecture 5).

Most results (not always so smoothly ) extend to other settings.

Wu Wu
DP applies to infinite horizon problems, too!
§ Finite horizon stochastic and Markov problems (e.g. driving, robotics,
games)

𝑉!∗ 𝑠! = 𝑟! 𝑠! for all 𝑠! terminal reward



𝑉M∗ 𝑠M = max 𝑟M (𝑠M , 𝑎M ) + 𝔼Q!"# ~R ⋅ 𝑠M , 𝑎M 𝑉MST 𝑠MST
O! ∈P
for all 𝑠M , and 𝑡 = {𝑇 − 1, … , 0}

§ From finite to (discounted) infinite horizon problems?

§ Infinite horizon stochastic problems (e.g. package delivery over months or


years, long-term customer satisfaction, control of autonomous vehicles)

𝑉 ∗ 𝑠 = max 𝑟(𝑠, 𝑎) + 𝛾𝔼&C ~ ( ⋅ &,#) 𝑉 ∗(𝑠 , ) for all 𝑠


#∈%
Wu Wu
Really?
§ Infinite horizon stochastic problems (e.g. package delivery over
months or years, long-term customer satisfaction, control of
autonomous vehicles)

𝑉 ∗ 𝑠 = max 𝑟(𝑠, 𝑎) + 𝛾𝔼&C ~ ( ⋅ &,#) 𝑉 ∗(𝑠 , ) for all 𝑠


#∈%

FThis is called the optimal Bellman equation.

§ An optimal policy is such that:


𝜋 ∗ 𝑠 = arg max 𝑟 𝑠, 𝑎 + 𝛾𝔼&C ~ ( ⋅ &,#) 𝑉∗ 𝑠, for all 𝑠
#∈%

§ Discuss: Any difficulties with this new algorithm?

Wu Wu
Outline

1. Infinite horizon Markov Decision Processes

2. Value iteration
a. Bellman operators, Optimal Bellman equation, and properties
b. Convergence
c. Numerical example

3. Policy iteration

Wu
Value iteration algorithm
1. Let 𝑉% (𝑠) be any function 𝑉% : 𝑆 → ℝ. [Note: not stage 0, but iteration 0.]
2. Apply the principle of optimality so that given 𝑉* at iteration 𝑖, we compute
𝑉*#$ 𝑠 = 𝒯𝑉* 𝑠 = max 𝑟(𝑠, 𝑎) + 𝛾𝔼. ! ~ 0 ⋅ .,+) 𝑉* 𝑠 " for all 𝑠
+∈-
3. Terminate when 𝑉* stops improving, e.g. when max |𝑉*#$ 𝑠 − 𝑉* 𝑠 | is small.
.
4. Return the greedy policy: 𝜋4 𝑠 = arg max 𝑟 𝑠, 𝑎 + 𝛾𝔼. ! ~ 0 ⋅ .,+) 𝑉4 𝑠"
+∈-

F A key result: 𝑉* → 𝑉 ∗ , as 𝑖 → ∞.

V
F Helpful properties
• Markov process
• Contraction in max-norm
• Cauchy sequences
• Fixed point

Adapted from Morales, Grokking Deep


Reinforcement Learning, 2020.

Wu
Value iteration algorithm
1. Let 𝑉% (𝑠) be any function 𝑉% : 𝑆 → ℝ. [Note: not stage 0, but iteration 0.]
2. Apply the principle of optimality so that given 𝑉* at iteration 𝑖, we compute
𝑉*#$ 𝑠 = 𝒯𝑉* 𝑠 = max 𝑟(𝑠, 𝑎) + 𝛾𝔼. ! ~ 0 ⋅ .,+) 𝑉* 𝑠 " for all 𝑠
+∈-
3. Terminate when 𝑉* stops improving, e.g. when max |𝑉*#$ 𝑠 − 𝑉* 𝑠 | is small.
.
4. Return the greedy policy: 𝜋4 𝑠 = arg max 𝑟 𝑠, 𝑎 + 𝛾𝔼. ! ~ 0 ⋅ .,+) 𝑉4 𝑠"
+∈-

Definition (Optimal Bellman operator)


For any 𝑊 ∈ ℝ 6 , the optimal Bellman operator is defined as
𝒯𝑊 𝑠 = max 𝑟 𝑠, 𝑎 + 𝛾𝔼. ! ~ 0 ⋅ .,+) 𝑊 𝑠 " for all 𝑠
+∈-

F Then we can write the algorithm step 2 concisely:


𝑉*#$ 𝑠 = 𝒯𝑉* 𝑠 for all 𝑠

Key question: Does 𝑉* → 𝑉 ∗ ?

Wu
The student dilemma
§ Model: all the transitions are
Markov, states 𝑠D, 𝑠E, 𝑠F are 2 r=1
p=0.5
terminal. 1 Rest
Rest 0.4 r=−10
§ Setting: infinite horizon with 0.5

terminal states. 5
r=0 Work
§ Objective: find the policy Work 0.3 0.4 0.6
that maximizes the expected 0.5 0.7
0.5
sum of rewards before Rest
r=100
0.6
achieving a terminal state.
r=−1
§ Notice: Not a discounted Rest
0.9 6
Work
infinite horizon setting. But 3
the Bellman equations hold 0.1 r=−1000
unchanged. 0.5
1
0.5
§ Discuss: What kind of r=−10
Work

problem setting is this? (Hint: 4 7


value function.)

Wu Wu
The Op,mal Bellman Equa,on
Bellman’s Principle of Op8mality (Bellman (1957)):
“An optimal policy has the property that, whatever the initial state and the
initial decision are, the remaining decisions must constitute an
optimal policy with regard to the state resulting from the first
decision.”

Wu
The Op6mal Bellman Equa6on
Theorem (Optimal Bellman Equation)
The optimal value function 𝑉 ∗ (i.e. 𝑉 ∗ = max 𝑉 % ) is the solution to the optimal Bellman
%
equation:

𝑉 ∗ 𝑠 = max 𝑟 𝑠, 𝑎 + 𝛾 7 𝑝 𝑠 * 𝑠, 𝑎) 𝑉 ∗ (𝑠 * )
&∈(
)(

And any optimal policy is such that:

𝜋 ∗ 𝑎 𝑠 ≥ 0 ⟺ 𝑎 ∈ arg max
(
𝑟 𝑠, 𝑎* + 𝛾 7 𝑝 𝑠 * 𝑠, 𝑎) 𝑉 ∗ (𝑠 * )
& ∈(
)(

Or, for short: 𝑉 ∗ = 𝒯𝑉 ∗

F There is always a deterministic policy (see: Puterman, 2005, Chapter 7)

Wu
Proof: The Optimal Bellman Equation
For any policy 𝜋 = 𝑎, 𝜋 , (possibly non-stationary),
D
𝑉 ∗ 𝑠 = max 𝔼 5 𝛾 A 𝑟 𝑠A , 𝜋 𝑠A | 𝑠C = 𝑠; 𝜋
@ [value function]
ABC
C
= maxC 𝑟 𝑠, 𝑎 + 𝛾 5 𝑝 𝑠 , 𝑠, 𝑎 𝑉 @ 𝑠 ,
(#,@ ) [Markov property &
&C change of “time”]
C
= max 𝑟 𝑠, 𝑎 + 𝛾 5 𝑝 𝑠 , 𝑠, 𝑎 max
C
𝑉 @ 𝑠,
# @
&C

= max 𝑟 𝑠, 𝑎 + 𝛾 5 𝑝 𝑠 , 𝑠, 𝑎 𝑉 ∗ 𝑠 ,
# [value function]
&C

Wu
Proof: Line 2 (also, the Bellman Equation)
For simplicity, consider any stationary
>
policy 𝜋 = 𝜋, 𝜋, … ,
𝑉 : 𝑠 = 𝔼 + 𝛾 ; 𝑟 𝑠; , 𝜋 𝑠; | 𝑠= = 𝑠; 𝜋 [value function]
;<= >

= 𝑟 𝑠, 𝜋 𝑠 + 𝔼 + 𝛾 ; 𝑟 𝑠; , 𝜋 𝑠; | 𝑠= = 𝑠; 𝜋 [Markov property]
;<8 >

= 𝑟 𝑠, 𝜋 𝑠 + 𝛾 + ℙ 𝑠8 = 𝑠 I 𝑠= = 𝑠; 𝜋(𝑠= )) 𝔼 + 𝛾 ;B8 𝑟 𝑠; , 𝜋 𝑠; | 𝑠8 = 𝑠′; 𝜋


H7 ;<8
[MDP and change of “time”]
>
7
= 𝑟 𝑠, 𝜋 𝑠 + 𝛾 + 𝑝(𝑠 I |𝑠, 𝜋 𝑠 ) 𝔼 + 𝛾 ; 𝑟 𝑠; 7 , 𝜋 𝑠; 7 | 𝑠=7 = 𝑠′; 𝜋
H7 ; 7 <=

[value function]
= 𝑟 𝑠, 𝜋 𝑠 + 𝛾 + 𝑝 𝑠 I 𝑠, 𝜋 𝑠 𝑉 : (𝑠 I )
H7
Wu
Proof: Line 3
For the =, we have:
! !
max c 𝑝 𝑠 " 𝑠, 𝑎 𝑉 8 𝑠 " ≤ c 𝑝 𝑠 " 𝑠, 𝑎 max
!
𝑉8 𝑠"
8" 8
.! .!
!
But, let 𝜋k 𝑠 " = argmax 𝑉 8 (𝑠 " )
8"
! !
c 𝑝 𝑠 " 𝑠, 𝑎 max
!
𝑉 8 𝑠 " ≤ c 𝑝 𝑠 " 𝑠, 𝑎 𝑉 89 𝑠 " ≤ max c 𝑝 𝑠 " 𝑠, 𝑎 𝑉 8 𝑠 "
8 8"
.! .! .!

Wu
The student dilemma p=0.5
2 r=1

1 Rest 0.5
Rest
0.4 r=−10

Work 5
𝑉 ∗ 𝑠 = max 𝑟 𝑥, 𝑎 + 𝛾 I 𝑝 𝑦 𝑥, 𝑎) 𝑉 ∗(𝑦)
r=0
Work 0.3 0.4 0.6
K∈L 0.5 0.7
M 0.5
Rest
r=100
0.6
r=−1
Rest
0.9 6
Work

System of equations 3
0.1
r=−1000
0.5
1
0.5
𝑉8 = max 0 + 0.5 𝑉8 + 0.5 𝑉9; 0 + 0.5 𝑉8 + 0.5 𝑉N r=−10
Work

4 7
𝑉9 = max 0 + 0.4 𝑉D + 0.6 𝑉9; 0 + 0.3 𝑉8 + 0.7 𝑉N
𝑉N = max −1 + 0.4 𝑉9 + 0.6 𝑉N; −1 + 0.5 𝑉O + 0.5 𝑉N
𝑉O = max −10 + 0.9 𝑉E + 0.1 𝑉O; −10 + 𝑉F
𝑉D = −10 Wu

𝑉E = 100
𝑉F = −1000

Discuss: How to solve this system of equations?

Wu
System of Equa6ons
The optimal Bellman equation:
𝑉 ∗ 𝑠 = max 𝑟 𝑠, 𝑎 + 𝛾 , 𝑝 𝑠 @ 𝑠, 𝑎) 𝑉 ∗ (𝑠 @ )
<∈>
?+

Is a non-linear system of equations with 𝑁 unknowns and 𝑁 non-


linear constraints (i.e. the max operator).

Wu
Value iteration algorithm
1. Let 𝑉% (𝑠) be any function 𝑉% : 𝑆 → ℝ. [Note: not stage 0, but iteration 0.]
2. Apply the principle of optimality so that given 𝑉* at iteration 𝑖, we compute
𝑉*#$ 𝑠 = 𝒯𝑉* 𝑠 = max 𝑟(𝑠, 𝑎) + 𝛾𝔼. ! ~ 0 ⋅ .,+) 𝑉* 𝑠 " for all 𝑠
+∈-
3. Terminate when 𝑉* stops improving, e.g. when max |𝑉*#$ 𝑠 − 𝑉* 𝑠 | is small.
.
4. Return the greedy policy: 𝜋4 𝑠 = arg max 𝑟 𝑠, 𝑎 + 𝛾𝔼. ! ~ 0 ⋅ .,+) 𝑉4 𝑠"
+∈-

F A key result: 𝑉* → 𝑉 ∗ , as 𝑖 → ∞.

V
F Helpful properties
• Markov process
• Contraction in max-norm
• Cauchy sequences
• Fixed point

Adapted from Morales, Grokking Deep


Reinforcement Learning, 2020.

Wu
Proper&es of Bellman
Operators
Proposi'on
1. Contraction in 𝐿)-norm: for any 𝑊# , 𝑊* , ∈ ℝ+
𝒯𝑊# − 𝒯𝑊* ) ≤ 𝛾 𝑊# − 𝑊* )

2. Fixed point: 𝑉 ∗ is the unique fixed point of 𝒯, i.e. 𝑉 ∗ = 𝒯𝑉 ∗ .

Proof: value iteration


§ From contraction property of 𝒯, 𝑉: = 𝒯V; <$ , and optimal value function 𝑉 ∗ = 𝒯𝑉 ∗ :
𝑉 ∗ − 𝑉:#$ '
= 𝒯𝑉 ∗ − 𝒯𝑉: ' [optimal Bellman eq. and value iteration]
≤ 𝛾 𝑉 ∗ − 𝑉: ' [contraction]
≤ 𝛾 :#$ 𝑉 ∗ − 𝑉% ' [recursion]
→0
𝑉: → 𝑉 ∗ [fixed point]

Wu
Properties of Bellman
Operators
Proposi'on
1. Contraction in 𝐿)-norm: for any 𝑊# , 𝑊* , ∈ ℝ+
𝒯𝑊# − 𝒯𝑊* ) ≤ 𝛾 𝑊# − 𝑊* )

2. Fixed point: 𝑉 ∗ is the unique fixed point of 𝒯, i.e. 𝑉 ∗ = 𝒯𝑉 ∗ .

Proof: value iteration


§ Convergence rate. Let 𝜖 > 0 and 𝑟 ' ≤ 𝑟max , then after at most
𝑟max
log
1−𝛾 𝜖
𝑉 ∗ − 𝑉: ' ≤ 𝛾 : 𝑉 ∗ − 𝑉% ' <𝜖 ⟹ 𝐾>
1
log( )
𝛾

Wu
Proof: Contraction of the Bellman Operator
For any 𝑠 ∈ 𝑆

𝒯𝑊$ 𝑠 − 𝒯𝑊( 𝑠

= max 𝑟 𝑠, 𝑎 + 𝛾 c 𝑝 𝑠 " 𝑠, 𝑎) 𝑊$ 𝑠 " − max


!
𝑟 𝑠, 𝑎 " + 𝛾 c 𝑝 𝑠 " 𝑠, 𝑎 " ) 𝑊( 𝑠 "
+ +
.! .!

≤ max 𝑟 𝑠, 𝑎 + 𝛾 c 𝑝 𝑠 " 𝑠, 𝑎) 𝑊$ 𝑠 " − 𝑟 𝑠, 𝑎 + 𝛾 c 𝑝 𝑠 " 𝑠, 𝑎) 𝑊( 𝑠 "


+
.! .!

= 𝛾 max c 𝑝 𝑠 " 𝑠, 𝑎) 𝑊$ 𝑠 " − 𝑊( 𝑠 "


+
.!

≤ 𝛾 𝑊$ − 𝑊( ' max c 𝑝 𝑠 " 𝑠, 𝑎) = 𝛾 𝑊$ − 𝑊( '


+
.!

max 𝑓 𝑥 − max
!
𝑔 𝑥 " ≤ max(𝑓 𝑥 − 𝑔 𝑥 )
= = =

Wu
Value Iteration: the Guarantees
Corollary
Let 𝑉, be the function computed after 𝐾 iterations by value iteration, then the greedy policy

𝜋, 𝑠 ∈ arg max 𝑟 𝑠, 𝑎 + 𝛾 Z 𝑝 𝑠 0 𝑠, 𝑎 𝑉, 𝑠 0
-∈.
/"

is such that

2𝛾
𝑉 ∗ − 𝑉 %# ) ≤ 𝑉 ∗ − 𝑉, )
1−𝛾

performance loss approx. error

Furthermore, there exists 𝜖 > 0 such that if 𝑉 ∗ − 𝑉, ) ≤ 𝜖, then 𝜋, is optimal.

Wu
Proof: Performance Loss
§ Note 1: We drop the 𝐾 everywhere.

§ Note 2: 𝜋 is the greedy policy corresponding to 𝑉, and 𝑉 @ is the


value function evaluated with 𝜋.

𝑉 ∗ − 𝑉 @ D ≤ Τ𝑉 ∗ − Τ @ 𝑉 D + Τ @ V − Τ @ 𝑉 @ D
≤ Τ𝑉 ∗ − Τ𝑉 D + 𝛾 V − 𝑉 @ D
≤ 𝛾 V ∗ − 𝑉 D + 𝛾( V − 𝑉 ∗ D+ V ∗ − 𝑉 @ D)
2𝛾
≤ V∗ − 𝑉 D
1−𝛾

Wu
Value Iteration: the Complexity
Time complexity
§ Each iteration takes on the order of 𝑆 9 𝐴 operations.
𝑉ab8 𝑠 = 𝒯𝑉a 𝑠 = max 𝑟 𝑠, 𝑎 + 𝛾 + 𝑝 𝑠 I 𝑠, 𝑎 𝑉a 𝑠 I
K∈L
HI

§ The computation of the greedy policy takes on the order of 𝑆 9 𝐴 operations.


𝜋c 𝑠 ∈ arg max 𝑟 𝑠, 𝑎 + 𝛾 + 𝑝 𝑠 I 𝑠, 𝑎 𝑉c 𝑠 I
K∈L
HI

§ Total time complexity on the order of 𝐾𝑆 9 𝐴.

Space complexity
§ Storing the MDP: dynamics on the order of 𝑆 9 𝐴 and reward on the order of
𝑆𝐴.
§ Storing the value function and the optimal policy on the order of 𝑆.

Wu
Value Iteration: Extensions and Implementations
Asynchronous VI:
1. Let 𝑉r be any vector in 𝑅 s
2. At each iteration 𝑘 = 1, 2, … . , 𝐾
• Choose a state 𝑠1
• Compute 𝑉12# 𝑠1 = Τ𝑉1 (𝑠1 )

3. Return the greedy policy


𝜋t 𝑠 ∈ arg max 𝑟 𝑠, 𝑎 + 𝛾 A 𝑝 𝑠 u 𝑠, 𝑎 𝑉t 𝑠 u
O∈P
Q+
Comparison
§ Reduced time complexity to O(SA)
§ Using round-robin, number of iterations increased by at most O(KS)
but much smaller in practice if states are properly prioritized
§ Convergence guarantees if no starvation
Wu
The Grid-World Problem

Wu
Example: Winter parking (with ice and potholes)
§ Simple grid world with a goal state (green, desired parking spot) with
reward (+1), a “bad state” (red, pothole) with reward (-100), and all
other states neural (+0).
§ Omnidirectional vehicle (agent) can head in any direction. Actions
move in the desired direction with probably 0.8, in one of the
perpendicular directions with.
§ Taking an action that would bump into a wall leaves agent where it is.

[Source: adapted from Kolter, 2016]

Wu
Example: value iteration

(a)
Recall value iteration algorithm:
V>#$ 𝑠 = max 𝑟 𝑠, 𝑎 + 𝛾𝔼. ! ~ 0 ⋅ .,+) 𝑉* 𝑠" for all 𝑠
+∈-
Let’s arbitrarily initialize 𝑉% as the reward function, since it can be any function. Wu

Example update (red state):


V$ red = −100 + 𝛾 max{ 0.8𝑉% green + 0.1𝑉% red + 0, [up]
0 + 0.1𝑉% red + 0, [down]
0 + 0.1𝑉% green + 0, [left]
0.8𝑉% red + 0.1𝑉% green + 1 } [right]
= −100 + 0.9 0.1 ∗ 1 = −99.91 [best: go left]

Wu
Example: value iteration

(a)
Recall value iteration algorithm:
V>#$ 𝑠 = max 𝑟 𝑠, 𝑎 + 𝛾𝔼. ! ~ 0 ⋅ .,+) 𝑉* 𝑠" for all 𝑠
+∈-
Let’s arbitrarily initialize 𝑉% as the reward function, since it can be any function. Wu

Example update (green state):


V$ red = 1 + 𝛾 max{ 0.8𝑉% green + 0.1𝑉% green , [up]
0.8𝑉% red + 0.1𝑉% green , [down]
0 + 0.1𝑉% green + 0.1𝑉% red , [left]
0.8𝑉% red + 0.1𝑉% green + 0 } [right]
= 1 + 0.9 0.9 ∗ 1 = 1.81 [best: go up]

Wu
Example: value iteration

(a) (b)
Recall value iteration algorithm:
V>#$ 𝑠 = max 𝑟 𝑠, 𝑎 + 𝛾𝔼. ! ~ 0 ⋅ .,+) 𝑉* 𝑠" for all 𝑠
+∈- Wu

Let’s arbitrarily initialize 𝑉% as the reward function, since it can be any function.

Need to also do this for all the “unnamed” states, too.

Wu
Example: value iteration

(a) (b) (c)

(d) (e) (f)


Wu
Outline

1. Infinite horizon Markov Decision Processes

2. Value iteration

3. Policy iteration
a. Bellman equation, and properties
b. Convergence
c. Geometric interpretations
d. Generalized policy iteration

Wu
More generally…
Value iteration:
1. 𝑉db8 𝑠 = max 𝑟(𝑠, 𝑎) + 𝛾𝔼H 7 ~ e ⋅ H,K) 𝑉d (𝑠 I ) for all 𝑠
K∈L
2. 𝜋c 𝑠 = arg max 𝑟 𝑠, 𝑎 + 𝛾𝔼H 7 ~ e ⋅ H,K) 𝑉c 𝑠 I
K∈L

Related Operations:
§ Policy evaluation: 𝑉db8 𝑠 = 𝑟 𝑠, 𝜋d 𝑠 + 𝛾𝔼H7~ e ⋅ H,:? H ) 𝑉d 𝑠 I for all 𝑠
§ Policy improvement: 𝜋d 𝑠 = arg max 𝑟 𝑠, 𝑎 + 𝛾𝔼H 7 ~ e ⋅ H,K) 𝑉d 𝑠 I
K∈L

F Generalized Policy Iteration:


§ Repeat:
1. Policy evaluation for 𝑁 steps
2. Policy improvement

§ Value iteration: 𝑁 = 1; Policy iteration: 𝑁 = ∞


Wu
85

In pictures

Adapted from Morales, Grokking Deep Reinforcement Learning, 2020.

Wu
Policy Itera6on: the Idea
1. Let 𝜋C be any stationary policy
2. At each iteration 𝑘 = 1, 2, … . , 𝐾
• Policy evaluation: given 𝜋: , compute 𝑉 8"
• Policy improvement: compute the greedy policy
𝜋:#$ 𝑠 ∈ arg max 𝑟 𝑠, 𝑎 + 𝛾 c 𝑝 𝑠 " 𝑠, 𝑎 𝑉 8" 𝑠 "
+∈-
.!

3. Stop if 𝑉 @C = 𝑉 @CDE
4. Return the last policy 𝜋E

Wu
Policy Iteration: the Guarantees
Proposition
The policy iteration algorithm generates a sequence of policies with non-decreasing performance
𝑉 8"#$ ≥ 𝑉 8"

and it converges to 𝜋 ∗ in a finite number of iterations.

Wu
The Bellman Equation
Theorem (Bellman equa7on)
For any stationary policy 𝜋 = (𝜋, 𝜋, … ), at any state 𝑠 ∈ 𝑆, the state value function satisfies the Bellman equation:

𝑉 % 𝑠 = 𝑟 𝑠, 𝜋 𝑠 + 𝛾 Z 𝑝 𝑠 0 𝑠, 𝜋 𝑠 𝑉 % 𝑠 0
/" ∈3

Wu
The student dilemma
V2 = 88.3
p=0.5 r=1
Rest
0.4 r=−10
V1 = 88.3 Rest 0.5
V5= −10
§ Discuss: How to solve this system of equa5ons? r=0
Work
W ork
0.3 0.4 0.6
0.5 0.7
0.5 r=100
Rest
0.6
V = 100
r=−1 0.9
6

𝑉 8 𝑥 = 𝑟 𝑥, 𝜋 𝑥 + 𝛾 c 𝑝 𝑦 𝑥, 𝜋 𝑥 𝑉 8 (𝑦) V3 = 86.9 Rest


Wor
k
0.1 r=−1000
F 0.5
1
V7 = −100
0.5
System of equations
r=−10 Work
V4= 88.9

𝑉$ = 0 + 0.5 𝑉$ + 0.5 𝑉(
𝑉( = 1 + 0.3 𝑉$ + 0.7 𝑉@ 𝑉, 𝑅 ∈ ℝD , 𝑃8 ∈ ℝD×D
𝑉@ = −1 + 0.5 𝑉A + 0.5 𝑉@ 𝑉 = 𝑅 + 𝑃𝑉
𝑉A = −10 + 0.9𝑉C + 0.1 𝑉A ⟹
𝑉B = −10 ⇓
𝑉C = 100 𝑉 = 𝐼 −𝑃 <$
𝑅
𝑉D = −1000

Wu
Recap: The Bellman Operators
Notation. w.l.o.g. a discrete state space 𝑆 = 𝑁 and 𝑉 @ ∈ ℝK
(analysis extends to include 𝑁 → ∞ )

Defini:on
For any 𝑊 ∈ ℝ, , the Bellman operator Τ % : ℝ, → ℝ, is
Τ % 𝑊 𝑠 = 𝑟 𝑠, 𝜋 𝑠 + 𝛾 7 𝑝 𝑠 * 𝑠, 𝜋 𝑠 𝑊(𝑠 * )
)(

And the optimal Bellman operator (or dynamic programming operator) is

Τ𝑊 𝑠 = max 𝑟 𝑠, 𝑎 + 𝛾 7 𝑝 𝑠 * 𝑠, 𝑎 𝑊(𝑠)
&∈(
)(

Wu
The Bellman Operators
Proposition
Properties of the Bellman operators
1. Monotonicity: For any 𝑊- , 𝑊. ∈ ℝ, , if 𝑊- ≤ W. component-wise, then
Τ % 𝑊- ≤ Τ % 𝑊.
Τ𝑊- ≤ Τ𝑊.
2. Offset: For any scalar 𝑐 ∈ ℝ,
Τ % 𝑊 − 𝑐𝐼, = Τ % 𝑊 + 𝛾𝑐𝐼,
Τ 𝑊 − 𝑐𝐼, = Τ𝑊 + 𝛾𝑐𝐼,

Wu
The Bellman Operators
Proposition
3. Contraction in 𝐿/ -norm: For any 𝑊- , 𝑊. ∈ ℝ,
Τ % 𝑊- − Τ % 𝑊. / ≤ 𝛾 𝑊- − 𝑊. /
Τ𝑊- − Τ𝑊. / ≤ 𝛾 𝑊- − 𝑊. /

4. Fixed point: For any policy 𝜋,


𝑉 % is the unique fixed point of Τ %

𝑉 ∗ is the unique fixed point of Τ

§ For any 𝑊 ∈ ℝ+ and any stationary policy 𝜋


lim Τ % 1 𝑊 = 𝑉 %
1→)
lim Τ 1 𝑊 = 𝑉 ∗
1→)

Wu
Policy Iteration: the Idea
1. Let 𝜋C be any stationary policy
2. At each iteration 𝑘 = 1, 2, … . , 𝐾
• Policy evaluation: given 𝜋: , compute 𝑉 8"
• Policy improvement: compute the greedy policy
𝜋:#$ 𝑠 ∈ arg max 𝑟 𝑠, 𝑎 + 𝛾 c 𝑝 𝑠 " 𝑠, 𝑎 𝑉 8" 𝑠 "
+∈-
.!

3. Stop if 𝑉 @C = 𝑉 @CDE
4. Return the last policy 𝜋E

Wu
Policy Iteration: the Guarantees
Proposition
The policy iteration algorithm generates a sequence of policies with non-
decreasing performance
𝑉 F0"# ≥ 𝑉 F0

and it converges to 𝜋 ∗ in a finite number of iterations.

Wu
Proof: Policy Iteration
§ From the definition of the Bellman operators and the greedy policy
𝜋‚ST
𝑉 ƒ0 = 𝒯 ƒ0 𝑉 ƒ0 ≤ 𝒯𝑉 ƒ0 = 𝒯 ƒ0"# 𝑉 ƒ0
§ and from the monotonicity property of 𝒯 ƒ0"# , it follows that
𝑉 ƒ0 ≤ 𝒯 ƒ0"# 𝑉 ƒ0
𝒯 ƒ0"# 𝑉 ƒ0 ≤ 𝒯 ƒ0"# „ 𝑉 ƒ0

𝒯 ƒ0"# …†T 𝑉 ƒ0 ≤ 𝒯 ƒ0"# … 𝑉 ƒ0

§ Joining all inequalities in the chain, we obtain
𝑉 ƒ0 ≤ lim 𝒯 ƒ0"# … 𝑉 ƒ0 = 𝑉 ƒ0"#
…→‡

§ Then 𝑉 ƒ0 ‚ is a non-decreasing sequence.


Wu
Policy Itera6on: the Guarantees

Since a finite MDP admits a finite number of policies, then the termination condition
is eventually met for a specific 𝑘.

Thus eq. 1 holds with an equality and we obtain


𝑉 8" = 𝒯𝑉 8"

and 𝑉 8" = 𝑉 ∗ which implies that 𝜋: is an optimal policy.

Wu
Notation. For any policy 𝜋 the reward vector
Policy Itera&on: Complexity is 𝑟 % 𝑥 = 𝑟(𝑥, 𝜋 𝑥 ) and the transition
matrix is 𝑃% 5,7 = 𝑝(𝑦|𝑥, 𝜋 𝑥 )
§ Policy Evaluation Step
§ Direct computation: For any policy 𝜋 compute
𝑉 8 = 𝐼 − 𝛾𝑃8 <$ 8
𝑟
Complexity: O(S3).
§ Iterative policy evaluation: For any policy 𝜋
lim Τ 8 𝑉% = 𝑉 8
G→'
$
IJK
8 (
Complexity: An 𝜖-approximation of 𝑉 requires 𝑂 𝑠 %
$ steps.
IJK
&

§ Monte-Carlo simulation: In each state 𝑠, simulate 𝑛 trajectories 𝑠!* !L%


following
$M*MG Wu
policy 𝜋 and compute
G
1
𝑉• 8 𝑠 ≃ c c 𝛾 ! 𝑟 𝑠!* , 𝜋 𝑠!*
𝑛
*&$ !L%
N'() $
Complexity: In each state, the approximation error is 𝑂 $<O G
.

Wu
Policy Iteration: Complexity
§ Policy Improvement Step
• Complexity O(S2A)

§ Number of IteraAons
L> M
• At most 𝑂 log
MNO MNO
• Other results exist that do not depend on 𝛾

Wu

Wu
Comparison between Value and Policy Itera6on
§ Value Iteration
• Pros: each iteration is computationally efficient.
• Cons: convergence is only asymptotic.

§ Policy Iteration
• Pros: converge in a finite number of iterations (often small in practice).
• Cons: each iteration requires a full policy evaluation and it might be
expensive.
Wu

Wu
Example: Winter parking (with ice and potholes)
§ Simple grid world with a goal state (green, desired parking spot) with
reward (+1), a “bad state” (red, pothole) with reward (-100), and all
other states neural (+0).
§ Omnidirectional vehicle (agent) can head in any direction. Actions
move in the desired direction with probably 0.8, in one of the
perpendicular directions with.
§ Taking an action that would bump into a wall leaves agent where it is.

[Source: adapted from Kolter, 2016]

Wu
Example: value iteration

(a) (b) (c)

(d) (e) (f)


Wu
Example: policy iteration

(a) (b)

(c) (d)
Wu
Value iteration: geometric Interpretation
𝒯𝑉

Wu
Policy iteration: geometric Interpretation
𝒯𝑉

Wu
More variations

Adapted from Morales, Grokking Deep Reinforcement Learning, 2020. Wu


Summary & Takeaways
§ When specifying a sequential problem, care should be taken to
select an appropriate type of policy and value function,
depending on the use case.
§ The ideas from dynamic programming, namely the principle of
optimality, carry over to infinite horizon problems.
§ The value iteration algorithm solves discounted infinite horizon MDP
problems by leveraging results of Bellman operators, namely the
optimal Bellman equation, contractions, and fixed points.
§ Generalized policy iteration methods include policy iteration and
value iteration.
§ Policy iteration algorithm additionally leverages monotonicity and
Bellman equation.
§ The update mechanism for VI and PI differ and thus their
convergence in practice depends on the geometric structure of the
optimal value function.

Wu

You might also like