Infinite Horizon Problems
Infinite Horizon Problems
Cathy Wu
6.7950 Reinforcement Learning: Foundations and Methods
Wu
2
References
Wu
Outline
2. Value iteration
3. Policy iteration
Wu
Outline
2. Value iteraAon
3. Policy iteraAon
Wu
7
Wu
12
Wu
13
Wu
24
Policy
Definition (Policy)
A decision rule 𝑑 can be
§ Deterministic: 𝑑: 𝑆 → 𝐴,
§ Stochastic: 𝑑: 𝑆 → Δ(𝐴),
§ History-dependent: 𝑑: 𝐻! → 𝐴,
§ Markov: 𝑑: 𝑆 → Δ(𝐴),
A policy (strategy, plan) can be
§ Stationary: 𝜋 = 𝑑, 𝑑, 𝑑, … ,
§ (More generally) Non-stationary: 𝜋 = (𝑑% , 𝑑$ , 𝑑( , … )
FFor simplicity, we will typically write 𝜋 instead of 𝑑 for stationary policies, and 𝜋!
instead of 𝑑! for non-stationary policies.
Wu
Recall: The Amazing Goods 25
Company Example
§ Descrip(on. At each month 𝑡, a warehouse
contains 𝑠! items of a specific goods and the
demand for that goods is 𝐷 (stochastic). At the
end of each month the manager of the warehouse can
order 𝑎! more items from the supplier.
Wu
Recall: The Amazing Goods 26
Company Example
§ Description. At each month 𝑡, a warehouse contains
𝑠! items of a specific goods and the demand for that
goods is 𝐷 (stochastic). At the end of each month
the manager of the warehouse can order 𝑎! more
items from the supplier.
Wu
Recall: The Amazing Goods 27
Company Example
§ Descrip(on. At each month 𝑡, a warehouse
contains 𝑠! items of a specific goods and the
demand for that goods is 𝐷 (stochastic). At the end
of each month the manager of the warehouse can
order 𝑎! more items from the supplier.
Wu
29
Optimization Problem
§ Our goal: solve the MDP
Definition (Optimal policy and optimal value function)
The solution to an MDP is an optimal policy 𝜋 ∗ satisfying
𝜋 ∗ ∈ arg max 𝑉 %
%∈'
Wu
State Value Function
§ Given a policy 𝜋 = (𝑑8 , 𝑑9 , … ) (deterministic to simplify notation)
§ Infinite time horizon with discount: the problem never terminates but
rewards which are closer in time receive a higher importance.
>
𝑉 : 𝑠 = 𝔼 + 𝛾 ; 𝑟 𝑠; , 𝜋; ?) |𝑠= = 𝑠; 𝜋
;<=
with discount factor 0 ≤ 𝛾 < 1:
§ Small = short-term rewards, big = long-term rewards
§ For any 𝛾 ∈ [0, 1) the series always converges (for bounded rewards)
Wu
State Value Function
§ Given a policy 𝜋 = (𝑑5 , 𝑑6 , … ) (deterministic to simplify
notation)
§ Finite time horizon 𝑇: deadline at time 𝑇, the agent
focuses on the sum of the rewards up to 𝑇.
AB8
𝑉 : 𝑡, 𝑠 = 𝔼 + 𝑟 𝑠@ , 𝜋@ , ℎ@ + 𝑅 𝑠A |𝑠; = 𝑠; 𝜋 = (𝜋; , … , 𝜋 A )
@<;
where 𝑅 is a value function for the final state.
Wu
State Value Function
§ Given a policy 𝜋 = (𝑑5 , 𝑑6 , … ) (deterministic to simplify
notation)
§ Stochastic shortest path: the problem never terminates but
the agent will eventually reach a termination state.
;
𝑉 7 𝑠 = 𝔼 - 𝑟 𝑠8 , 𝜋8 , ℎ8 |𝑠: = 𝑠; 𝜋
89:
where 𝑇 is the first (random) time when the termination
state is achieved.
Wu Wu
DP applies to infinite horizon problems, too!
§ Finite horizon stochastic and Markov problems (e.g. driving, robotics,
games)
Wu Wu
Outline
2. Value iteration
a. Bellman operators, Optimal Bellman equation, and properties
b. Convergence
c. Numerical example
3. Policy iteration
Wu
Value iteration algorithm
1. Let 𝑉% (𝑠) be any function 𝑉% : 𝑆 → ℝ. [Note: not stage 0, but iteration 0.]
2. Apply the principle of optimality so that given 𝑉* at iteration 𝑖, we compute
𝑉*#$ 𝑠 = 𝒯𝑉* 𝑠 = max 𝑟(𝑠, 𝑎) + 𝛾𝔼. ! ~ 0 ⋅ .,+) 𝑉* 𝑠 " for all 𝑠
+∈-
3. Terminate when 𝑉* stops improving, e.g. when max |𝑉*#$ 𝑠 − 𝑉* 𝑠 | is small.
.
4. Return the greedy policy: 𝜋4 𝑠 = arg max 𝑟 𝑠, 𝑎 + 𝛾𝔼. ! ~ 0 ⋅ .,+) 𝑉4 𝑠"
+∈-
F A key result: 𝑉* → 𝑉 ∗ , as 𝑖 → ∞.
V
F Helpful properties
• Markov process
• Contraction in max-norm
• Cauchy sequences
• Fixed point
Wu
Value iteration algorithm
1. Let 𝑉% (𝑠) be any function 𝑉% : 𝑆 → ℝ. [Note: not stage 0, but iteration 0.]
2. Apply the principle of optimality so that given 𝑉* at iteration 𝑖, we compute
𝑉*#$ 𝑠 = 𝒯𝑉* 𝑠 = max 𝑟(𝑠, 𝑎) + 𝛾𝔼. ! ~ 0 ⋅ .,+) 𝑉* 𝑠 " for all 𝑠
+∈-
3. Terminate when 𝑉* stops improving, e.g. when max |𝑉*#$ 𝑠 − 𝑉* 𝑠 | is small.
.
4. Return the greedy policy: 𝜋4 𝑠 = arg max 𝑟 𝑠, 𝑎 + 𝛾𝔼. ! ~ 0 ⋅ .,+) 𝑉4 𝑠"
+∈-
Wu
The student dilemma
§ Model: all the transitions are
Markov, states 𝑠D, 𝑠E, 𝑠F are 2 r=1
p=0.5
terminal. 1 Rest
Rest 0.4 r=−10
§ Setting: infinite horizon with 0.5
terminal states. 5
r=0 Work
§ Objective: find the policy Work 0.3 0.4 0.6
that maximizes the expected 0.5 0.7
0.5
sum of rewards before Rest
r=100
0.6
achieving a terminal state.
r=−1
§ Notice: Not a discounted Rest
0.9 6
Work
infinite horizon setting. But 3
the Bellman equations hold 0.1 r=−1000
unchanged. 0.5
1
0.5
§ Discuss: What kind of r=−10
Work
Wu Wu
The Op,mal Bellman Equa,on
Bellman’s Principle of Op8mality (Bellman (1957)):
“An optimal policy has the property that, whatever the initial state and the
initial decision are, the remaining decisions must constitute an
optimal policy with regard to the state resulting from the first
decision.”
Wu
The Op6mal Bellman Equa6on
Theorem (Optimal Bellman Equation)
The optimal value function 𝑉 ∗ (i.e. 𝑉 ∗ = max 𝑉 % ) is the solution to the optimal Bellman
%
equation:
𝑉 ∗ 𝑠 = max 𝑟 𝑠, 𝑎 + 𝛾 7 𝑝 𝑠 * 𝑠, 𝑎) 𝑉 ∗ (𝑠 * )
&∈(
)(
𝜋 ∗ 𝑎 𝑠 ≥ 0 ⟺ 𝑎 ∈ arg max
(
𝑟 𝑠, 𝑎* + 𝛾 7 𝑝 𝑠 * 𝑠, 𝑎) 𝑉 ∗ (𝑠 * )
& ∈(
)(
Wu
Proof: The Optimal Bellman Equation
For any policy 𝜋 = 𝑎, 𝜋 , (possibly non-stationary),
D
𝑉 ∗ 𝑠 = max 𝔼 5 𝛾 A 𝑟 𝑠A , 𝜋 𝑠A | 𝑠C = 𝑠; 𝜋
@ [value function]
ABC
C
= maxC 𝑟 𝑠, 𝑎 + 𝛾 5 𝑝 𝑠 , 𝑠, 𝑎 𝑉 @ 𝑠 ,
(#,@ ) [Markov property &
&C change of “time”]
C
= max 𝑟 𝑠, 𝑎 + 𝛾 5 𝑝 𝑠 , 𝑠, 𝑎 max
C
𝑉 @ 𝑠,
# @
&C
= max 𝑟 𝑠, 𝑎 + 𝛾 5 𝑝 𝑠 , 𝑠, 𝑎 𝑉 ∗ 𝑠 ,
# [value function]
&C
Wu
Proof: Line 2 (also, the Bellman Equation)
For simplicity, consider any stationary
>
policy 𝜋 = 𝜋, 𝜋, … ,
𝑉 : 𝑠 = 𝔼 + 𝛾 ; 𝑟 𝑠; , 𝜋 𝑠; | 𝑠= = 𝑠; 𝜋 [value function]
;<= >
= 𝑟 𝑠, 𝜋 𝑠 + 𝔼 + 𝛾 ; 𝑟 𝑠; , 𝜋 𝑠; | 𝑠= = 𝑠; 𝜋 [Markov property]
;<8 >
[value function]
= 𝑟 𝑠, 𝜋 𝑠 + 𝛾 + 𝑝 𝑠 I 𝑠, 𝜋 𝑠 𝑉 : (𝑠 I )
H7
Wu
Proof: Line 3
For the =, we have:
! !
max c 𝑝 𝑠 " 𝑠, 𝑎 𝑉 8 𝑠 " ≤ c 𝑝 𝑠 " 𝑠, 𝑎 max
!
𝑉8 𝑠"
8" 8
.! .!
!
But, let 𝜋k 𝑠 " = argmax 𝑉 8 (𝑠 " )
8"
! !
c 𝑝 𝑠 " 𝑠, 𝑎 max
!
𝑉 8 𝑠 " ≤ c 𝑝 𝑠 " 𝑠, 𝑎 𝑉 89 𝑠 " ≤ max c 𝑝 𝑠 " 𝑠, 𝑎 𝑉 8 𝑠 "
8 8"
.! .! .!
Wu
The student dilemma p=0.5
2 r=1
1 Rest 0.5
Rest
0.4 r=−10
Work 5
𝑉 ∗ 𝑠 = max 𝑟 𝑥, 𝑎 + 𝛾 I 𝑝 𝑦 𝑥, 𝑎) 𝑉 ∗(𝑦)
r=0
Work 0.3 0.4 0.6
K∈L 0.5 0.7
M 0.5
Rest
r=100
0.6
r=−1
Rest
0.9 6
Work
System of equations 3
0.1
r=−1000
0.5
1
0.5
𝑉8 = max 0 + 0.5 𝑉8 + 0.5 𝑉9; 0 + 0.5 𝑉8 + 0.5 𝑉N r=−10
Work
4 7
𝑉9 = max 0 + 0.4 𝑉D + 0.6 𝑉9; 0 + 0.3 𝑉8 + 0.7 𝑉N
𝑉N = max −1 + 0.4 𝑉9 + 0.6 𝑉N; −1 + 0.5 𝑉O + 0.5 𝑉N
𝑉O = max −10 + 0.9 𝑉E + 0.1 𝑉O; −10 + 𝑉F
𝑉D = −10 Wu
𝑉E = 100
𝑉F = −1000
Wu
System of Equa6ons
The optimal Bellman equation:
𝑉 ∗ 𝑠 = max 𝑟 𝑠, 𝑎 + 𝛾 , 𝑝 𝑠 @ 𝑠, 𝑎) 𝑉 ∗ (𝑠 @ )
<∈>
?+
Wu
Value iteration algorithm
1. Let 𝑉% (𝑠) be any function 𝑉% : 𝑆 → ℝ. [Note: not stage 0, but iteration 0.]
2. Apply the principle of optimality so that given 𝑉* at iteration 𝑖, we compute
𝑉*#$ 𝑠 = 𝒯𝑉* 𝑠 = max 𝑟(𝑠, 𝑎) + 𝛾𝔼. ! ~ 0 ⋅ .,+) 𝑉* 𝑠 " for all 𝑠
+∈-
3. Terminate when 𝑉* stops improving, e.g. when max |𝑉*#$ 𝑠 − 𝑉* 𝑠 | is small.
.
4. Return the greedy policy: 𝜋4 𝑠 = arg max 𝑟 𝑠, 𝑎 + 𝛾𝔼. ! ~ 0 ⋅ .,+) 𝑉4 𝑠"
+∈-
F A key result: 𝑉* → 𝑉 ∗ , as 𝑖 → ∞.
V
F Helpful properties
• Markov process
• Contraction in max-norm
• Cauchy sequences
• Fixed point
Wu
Proper&es of Bellman
Operators
Proposi'on
1. Contraction in 𝐿)-norm: for any 𝑊# , 𝑊* , ∈ ℝ+
𝒯𝑊# − 𝒯𝑊* ) ≤ 𝛾 𝑊# − 𝑊* )
Wu
Properties of Bellman
Operators
Proposi'on
1. Contraction in 𝐿)-norm: for any 𝑊# , 𝑊* , ∈ ℝ+
𝒯𝑊# − 𝒯𝑊* ) ≤ 𝛾 𝑊# − 𝑊* )
Wu
Proof: Contraction of the Bellman Operator
For any 𝑠 ∈ 𝑆
𝒯𝑊$ 𝑠 − 𝒯𝑊( 𝑠
max 𝑓 𝑥 − max
!
𝑔 𝑥 " ≤ max(𝑓 𝑥 − 𝑔 𝑥 )
= = =
Wu
Value Iteration: the Guarantees
Corollary
Let 𝑉, be the function computed after 𝐾 iterations by value iteration, then the greedy policy
𝜋, 𝑠 ∈ arg max 𝑟 𝑠, 𝑎 + 𝛾 Z 𝑝 𝑠 0 𝑠, 𝑎 𝑉, 𝑠 0
-∈.
/"
is such that
2𝛾
𝑉 ∗ − 𝑉 %# ) ≤ 𝑉 ∗ − 𝑉, )
1−𝛾
Wu
Proof: Performance Loss
§ Note 1: We drop the 𝐾 everywhere.
𝑉 ∗ − 𝑉 @ D ≤ Τ𝑉 ∗ − Τ @ 𝑉 D + Τ @ V − Τ @ 𝑉 @ D
≤ Τ𝑉 ∗ − Τ𝑉 D + 𝛾 V − 𝑉 @ D
≤ 𝛾 V ∗ − 𝑉 D + 𝛾( V − 𝑉 ∗ D+ V ∗ − 𝑉 @ D)
2𝛾
≤ V∗ − 𝑉 D
1−𝛾
Wu
Value Iteration: the Complexity
Time complexity
§ Each iteration takes on the order of 𝑆 9 𝐴 operations.
𝑉ab8 𝑠 = 𝒯𝑉a 𝑠 = max 𝑟 𝑠, 𝑎 + 𝛾 + 𝑝 𝑠 I 𝑠, 𝑎 𝑉a 𝑠 I
K∈L
HI
Space complexity
§ Storing the MDP: dynamics on the order of 𝑆 9 𝐴 and reward on the order of
𝑆𝐴.
§ Storing the value function and the optimal policy on the order of 𝑆.
Wu
Value Iteration: Extensions and Implementations
Asynchronous VI:
1. Let 𝑉r be any vector in 𝑅 s
2. At each iteration 𝑘 = 1, 2, … . , 𝐾
• Choose a state 𝑠1
• Compute 𝑉12# 𝑠1 = Τ𝑉1 (𝑠1 )
Wu
Example: Winter parking (with ice and potholes)
§ Simple grid world with a goal state (green, desired parking spot) with
reward (+1), a “bad state” (red, pothole) with reward (-100), and all
other states neural (+0).
§ Omnidirectional vehicle (agent) can head in any direction. Actions
move in the desired direction with probably 0.8, in one of the
perpendicular directions with.
§ Taking an action that would bump into a wall leaves agent where it is.
Wu
Example: value iteration
(a)
Recall value iteration algorithm:
V>#$ 𝑠 = max 𝑟 𝑠, 𝑎 + 𝛾𝔼. ! ~ 0 ⋅ .,+) 𝑉* 𝑠" for all 𝑠
+∈-
Let’s arbitrarily initialize 𝑉% as the reward function, since it can be any function. Wu
Wu
Example: value iteration
(a)
Recall value iteration algorithm:
V>#$ 𝑠 = max 𝑟 𝑠, 𝑎 + 𝛾𝔼. ! ~ 0 ⋅ .,+) 𝑉* 𝑠" for all 𝑠
+∈-
Let’s arbitrarily initialize 𝑉% as the reward function, since it can be any function. Wu
Wu
Example: value iteration
(a) (b)
Recall value iteration algorithm:
V>#$ 𝑠 = max 𝑟 𝑠, 𝑎 + 𝛾𝔼. ! ~ 0 ⋅ .,+) 𝑉* 𝑠" for all 𝑠
+∈- Wu
Let’s arbitrarily initialize 𝑉% as the reward function, since it can be any function.
Wu
Example: value iteration
2. Value iteration
3. Policy iteration
a. Bellman equation, and properties
b. Convergence
c. Geometric interpretations
d. Generalized policy iteration
Wu
More generally…
Value iteration:
1. 𝑉db8 𝑠 = max 𝑟(𝑠, 𝑎) + 𝛾𝔼H 7 ~ e ⋅ H,K) 𝑉d (𝑠 I ) for all 𝑠
K∈L
2. 𝜋c 𝑠 = arg max 𝑟 𝑠, 𝑎 + 𝛾𝔼H 7 ~ e ⋅ H,K) 𝑉c 𝑠 I
K∈L
Related Operations:
§ Policy evaluation: 𝑉db8 𝑠 = 𝑟 𝑠, 𝜋d 𝑠 + 𝛾𝔼H7~ e ⋅ H,:? H ) 𝑉d 𝑠 I for all 𝑠
§ Policy improvement: 𝜋d 𝑠 = arg max 𝑟 𝑠, 𝑎 + 𝛾𝔼H 7 ~ e ⋅ H,K) 𝑉d 𝑠 I
K∈L
In pictures
Wu
Policy Itera6on: the Idea
1. Let 𝜋C be any stationary policy
2. At each iteration 𝑘 = 1, 2, … . , 𝐾
• Policy evaluation: given 𝜋: , compute 𝑉 8"
• Policy improvement: compute the greedy policy
𝜋:#$ 𝑠 ∈ arg max 𝑟 𝑠, 𝑎 + 𝛾 c 𝑝 𝑠 " 𝑠, 𝑎 𝑉 8" 𝑠 "
+∈-
.!
3. Stop if 𝑉 @C = 𝑉 @CDE
4. Return the last policy 𝜋E
Wu
Policy Iteration: the Guarantees
Proposition
The policy iteration algorithm generates a sequence of policies with non-decreasing performance
𝑉 8"#$ ≥ 𝑉 8"
Wu
The Bellman Equation
Theorem (Bellman equa7on)
For any stationary policy 𝜋 = (𝜋, 𝜋, … ), at any state 𝑠 ∈ 𝑆, the state value function satisfies the Bellman equation:
𝑉 % 𝑠 = 𝑟 𝑠, 𝜋 𝑠 + 𝛾 Z 𝑝 𝑠 0 𝑠, 𝜋 𝑠 𝑉 % 𝑠 0
/" ∈3
Wu
The student dilemma
V2 = 88.3
p=0.5 r=1
Rest
0.4 r=−10
V1 = 88.3 Rest 0.5
V5= −10
§ Discuss: How to solve this system of equa5ons? r=0
Work
W ork
0.3 0.4 0.6
0.5 0.7
0.5 r=100
Rest
0.6
V = 100
r=−1 0.9
6
𝑉$ = 0 + 0.5 𝑉$ + 0.5 𝑉(
𝑉( = 1 + 0.3 𝑉$ + 0.7 𝑉@ 𝑉, 𝑅 ∈ ℝD , 𝑃8 ∈ ℝD×D
𝑉@ = −1 + 0.5 𝑉A + 0.5 𝑉@ 𝑉 = 𝑅 + 𝑃𝑉
𝑉A = −10 + 0.9𝑉C + 0.1 𝑉A ⟹
𝑉B = −10 ⇓
𝑉C = 100 𝑉 = 𝐼 −𝑃 <$
𝑅
𝑉D = −1000
Wu
Recap: The Bellman Operators
Notation. w.l.o.g. a discrete state space 𝑆 = 𝑁 and 𝑉 @ ∈ ℝK
(analysis extends to include 𝑁 → ∞ )
Defini:on
For any 𝑊 ∈ ℝ, , the Bellman operator Τ % : ℝ, → ℝ, is
Τ % 𝑊 𝑠 = 𝑟 𝑠, 𝜋 𝑠 + 𝛾 7 𝑝 𝑠 * 𝑠, 𝜋 𝑠 𝑊(𝑠 * )
)(
Τ𝑊 𝑠 = max 𝑟 𝑠, 𝑎 + 𝛾 7 𝑝 𝑠 * 𝑠, 𝑎 𝑊(𝑠)
&∈(
)(
Wu
The Bellman Operators
Proposition
Properties of the Bellman operators
1. Monotonicity: For any 𝑊- , 𝑊. ∈ ℝ, , if 𝑊- ≤ W. component-wise, then
Τ % 𝑊- ≤ Τ % 𝑊.
Τ𝑊- ≤ Τ𝑊.
2. Offset: For any scalar 𝑐 ∈ ℝ,
Τ % 𝑊 − 𝑐𝐼, = Τ % 𝑊 + 𝛾𝑐𝐼,
Τ 𝑊 − 𝑐𝐼, = Τ𝑊 + 𝛾𝑐𝐼,
Wu
The Bellman Operators
Proposition
3. Contraction in 𝐿/ -norm: For any 𝑊- , 𝑊. ∈ ℝ,
Τ % 𝑊- − Τ % 𝑊. / ≤ 𝛾 𝑊- − 𝑊. /
Τ𝑊- − Τ𝑊. / ≤ 𝛾 𝑊- − 𝑊. /
Wu
Policy Iteration: the Idea
1. Let 𝜋C be any stationary policy
2. At each iteration 𝑘 = 1, 2, … . , 𝐾
• Policy evaluation: given 𝜋: , compute 𝑉 8"
• Policy improvement: compute the greedy policy
𝜋:#$ 𝑠 ∈ arg max 𝑟 𝑠, 𝑎 + 𝛾 c 𝑝 𝑠 " 𝑠, 𝑎 𝑉 8" 𝑠 "
+∈-
.!
3. Stop if 𝑉 @C = 𝑉 @CDE
4. Return the last policy 𝜋E
Wu
Policy Iteration: the Guarantees
Proposition
The policy iteration algorithm generates a sequence of policies with non-
decreasing performance
𝑉 F0"# ≥ 𝑉 F0
Wu
Proof: Policy Iteration
§ From the definition of the Bellman operators and the greedy policy
𝜋‚ST
𝑉 ƒ0 = 𝒯 ƒ0 𝑉 ƒ0 ≤ 𝒯𝑉 ƒ0 = 𝒯 ƒ0"# 𝑉 ƒ0
§ and from the monotonicity property of 𝒯 ƒ0"# , it follows that
𝑉 ƒ0 ≤ 𝒯 ƒ0"# 𝑉 ƒ0
𝒯 ƒ0"# 𝑉 ƒ0 ≤ 𝒯 ƒ0"# „ 𝑉 ƒ0
…
𝒯 ƒ0"# …†T 𝑉 ƒ0 ≤ 𝒯 ƒ0"# … 𝑉 ƒ0
…
§ Joining all inequalities in the chain, we obtain
𝑉 ƒ0 ≤ lim 𝒯 ƒ0"# … 𝑉 ƒ0 = 𝑉 ƒ0"#
…→‡
Since a finite MDP admits a finite number of policies, then the termination condition
is eventually met for a specific 𝑘.
Wu
Notation. For any policy 𝜋 the reward vector
Policy Itera&on: Complexity is 𝑟 % 𝑥 = 𝑟(𝑥, 𝜋 𝑥 ) and the transition
matrix is 𝑃% 5,7 = 𝑝(𝑦|𝑥, 𝜋 𝑥 )
§ Policy Evaluation Step
§ Direct computation: For any policy 𝜋 compute
𝑉 8 = 𝐼 − 𝛾𝑃8 <$ 8
𝑟
Complexity: O(S3).
§ Iterative policy evaluation: For any policy 𝜋
lim Τ 8 𝑉% = 𝑉 8
G→'
$
IJK
8 (
Complexity: An 𝜖-approximation of 𝑉 requires 𝑂 𝑠 %
$ steps.
IJK
&
Wu
Policy Iteration: Complexity
§ Policy Improvement Step
• Complexity O(S2A)
§ Number of IteraAons
L> M
• At most 𝑂 log
MNO MNO
• Other results exist that do not depend on 𝛾
Wu
Wu
Comparison between Value and Policy Itera6on
§ Value Iteration
• Pros: each iteration is computationally efficient.
• Cons: convergence is only asymptotic.
§ Policy Iteration
• Pros: converge in a finite number of iterations (often small in practice).
• Cons: each iteration requires a full policy evaluation and it might be
expensive.
Wu
Wu
Example: Winter parking (with ice and potholes)
§ Simple grid world with a goal state (green, desired parking spot) with
reward (+1), a “bad state” (red, pothole) with reward (-100), and all
other states neural (+0).
§ Omnidirectional vehicle (agent) can head in any direction. Actions
move in the desired direction with probably 0.8, in one of the
perpendicular directions with.
§ Taking an action that would bump into a wall leaves agent where it is.
Wu
Example: value iteration
(a) (b)
(c) (d)
Wu
Value iteration: geometric Interpretation
𝒯𝑉
Wu
Policy iteration: geometric Interpretation
𝒯𝑉
Wu
More variations
Wu