04 RL DP
04 RL DP
Abir Das
IIT Kharagpur
Agenda
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 2 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Resources
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 3 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Dynamic Programing
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 4 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Dynamic Programing
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 5 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Dynamic Programing
§ Dynamic Programing addresses a bigger problem by breaking it down
as subproblems and then
I Solving the subproblems
I Combining solutions to subproblems
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 6 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Dynamic Programing
§ Dynamic Programing addresses a bigger problem by breaking it down
as subproblems and then
I Solving the subproblems
I Combining solutions to subproblems
§ Dynamic Programing is based on the principle of optimality.
𝑠%∗ Tail subproblem
0 𝑘 𝑁 Time
𝑎(∗ , ⋯ , 𝑎%∗ , ⋯ , 𝑎+,-
∗
Principle of Optimality
Let {a∗0 , a∗1 , · · · , a∗(N −1) } be an optimal action sequence with a
corresponding state sequence {s∗1 , s∗2 , · · · , s∗N }. Consider the tail
subproblem that starts at s∗k at time k and maximizes the ‘reward to go’
from k to N over {ak , · · · , a(N −1) }, then the tail optimal action sequence
{a∗k , · · · , a∗(N −1) } is optimal for the tail subproblem.
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 6 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 7 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 8 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 8 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 9 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
( )
X X
(k+1) 0 (k) 0
v (s) ← π(a|s) r(s, a) + γ p(s |s, a)v (s )
a∈A s0 ∈S
§ In code, this can be implemented by using two arrays - one for the old
values v (k) (s) and the other for the new values v (k+1) (s). Here, new
values of v (k+1) (s) are computed one by one from the old values
v (k) (s) without changing the old values.
§ Another way is to use one array and update the values ‘in place’, i.e.,
each new value immediately overwriting the old one.
§ Both these converges to the true value vπ and the ‘in place’ algorithm
usually converges faster.
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 10 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 11 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 13 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 14 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
being greedy means choosing the action that will land the agent
.
into best state i.e., π 0 (s) = arg max qπ (s, a) =
a∈A
0 0
P
arg max r(s, a) + γ p(s |s, a)vπ (s )
a∈A s0 ∈S
§ In Small Gridworld improved policy was optimal π 0 = π∗
§ In general, need more iterations of improvement/evaluation
§ But this process of policy iteration always converges to π∗
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 15 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Policy Iteration
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 17 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Policy Iteration
Policy Iteration
§ At each step of policy iteration, the policy improves i.e., the value
function for a policy at a later iteration is greater than or equal to the
value function for a policy at an earlier step.
§ This comes from the policy improvement theorem which (informally)
is - Let π n be some stationary policy and let π n+1 be greedy w.r.t.
v(πn ) , then v(πn+1 ) ≥ v(πn ) , i.e., π n+1 is an improvement upon π n .
rπn+1 + γPπn+1 v(πn ) ≥ rπn + γPπn v(πn )
= v(πn ) [Bellman eqn.]
=⇒ rπn+1 ≥ (I − γPπn+1 )v(πn )
=⇒ (I − γPπn+1 )−1 rπn+1 ≥ v(πn )
=⇒ vπn+1 ≥ v(πn ) (2)
§ The first step: π n+1 is obtained by maximizing rπ + γPπ v(πn ) over all
π’s. So, rπn+1 + γPπn+1 v(πn ) will be better than any other π in
rπ + γPπ v(πn ) . That ‘any other π’ happens to be π n .
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 19 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 20 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 21 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
§ Policy iteration involves the policy evaluation step first and this itself
requires a few iterations to get the exact value of vπ in limit.
§ The question is - must we wait for exact convegence to vπ ? Or can
we stop short of that?
§ The small gridworld example showed that there is no change of the
greedy policy after the first three iterations.
§ So the question is - is there such a number of iterations such that
after that the greedy policy does not change?
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 23 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Value Iteration
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 24 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Value Iteration
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 24 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Value Iteration
§ What policy iteration does: iterate over
( )
. (k+1) X X
0 (k) 0
vπ = v (s) ← π(a|s) r(s, a) + γ p(s |s, a)v (s )
§ And then a∈A s0 ∈S
( )
X
0 0 0
π (s) = arg max r(s, a) + γ p(s |s, a)vπ (s )
a∈A
s0 ∈S
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 25 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Value Iteration
§ What policy iteration does: iterate over
( )
. (k+1) X X
0 (k) 0
vπ = v (s) ← π(a|s) r(s, a) + γ p(s |s, a)v (s )
§ And then a∈A s0 ∈S
( )
X
0 0 0
π (s) = arg max r(s, a) + γ p(s |s, a)vπ (s )
a∈A
s0 ∈S
( )
X
(k+1) 0 (k) 0
v (s) = max r(s, a) + γ p(s |s, a)v (s ) Where have we seen it?
a∈A
s0 ∈S
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 25 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Value Iteration
Algorithm 2: Value iteration
8 initialization: v ← v 0 ∈ V, pick an > 0, n ← 0;
9 while ||v n+1 − v n || > 1−γ
2γ do
10 foreach s ∈ S do n o
v n+1 (s) ← max r(s, a) + γ p(s0 /s, a)v n (s0 )
P
11
a s0
12 end
13 n ← n + 1;
14 end
15 foreach s ∈ S do
/* Note the use of π(s). It mens deterministic policy
*/ n o
π(s) ← arg max r(s, a) + γ p(s0 /s, a)v n (s0 ) ;
P
16 // n has
a s0
already been incremented by 1
17 end
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 27 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Norms
Definition
Given a vector space V ⊆ Rd , a function f : V → R+ is a norm (denoted
as ||.||) if and only if
§ ||v|| ≥ 0 ∀v ∈ V
§ ||v|| = 0 if and only if v = 0
§ ||αv|| = |α| ||v||, ∀α ∈ R and ∀v ∈ V
§ Triangle inequality: ||u + v|| ≤ ||u|| + ||v|| u, v ∈ V
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 28 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
§ Lp norm:
d
! p1
X
||v||p = |vi |p
i=1
§ L0 norm:
§ L∞ norm:
||v||∞ = max |vi |
1≤i≤d
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 29 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 30 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Definition
An operator T : V → V is L-Lipschitz if for any u, v ∈ V
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 31 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Definition
An operator T : V → V is L-Lipschitz if for any u, v ∈ V
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 31 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Theorem
Suppose V is a Banach space and T : V → V is a contraction mapping,
then,
- ∃ an unique v ∗ in V s.t. T v ∗ = v ∗ and
- for arbitrary v 0 in V, the sequence {v n } defined by
v n+1 = T v n = T n+1 v 0 , converges to v ∗ .
The above theorem tells that
§ T has fixed point, an unique fixed point.
§ For arbitrary starting point if we keep repeatedly applying T on that
starting point, then we will converge to v ∗ .
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 32 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 33 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 34 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 34 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 35 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 35 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 36 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 36 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 36 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
§ Now, we will start talking about the existance and uniqueness of the
solution to Bellman expecation equations and the Bellman optimality
equations.
§ In case of a finite MDP the value function v can be thought of as a
vector in a |S| dimensional vector space V.
§ Whenever, we will use norm ||.|| in this space we will mean the max
norm, unless otherwise specified.
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 37 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 38 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 39 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
§ vπ = rπ + γPπ vπ
§ rπ is a |S| dimensional vector while Pπ is a |S| × |S| dimensional
matrix.
§ For all s0 , pπ (s0 |s) is one row (sth row) of the Pπ matrix. Similarly,
vπ (s0 )’s are the value functions for all states i.e., in the vectorized
notation, this is a vector vπ .
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 39 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
§ vπ = rπ + γPπ vπ
§ We are, now, going to define a linear operator.
Lπ : V → V such that
Lπ v ≡ rπ + γPπ v ∀v ∈ V, [V as defined in slide (37)] (5)
§ So far we have proved the Banach Fixed Point Theorem. Now we will
try to show that Lπ is a contraction.
§ We will hold the proof of V being a Banach space for later.
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 40 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 41 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
0 ≤ Lπ v(s) − Lπ u(s)
X
=γ pπ (s0 |s){v(s0 ) − u(s0 )}
s0
X
≤ γ||v − u|| pπ (s0 |s)
s0
[Why is this?]
X
= γ||v − u|| [Since pπ (s0 |s) = 1] (8)
s0
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 43 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 44 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 45 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
0 ≤ Lv(s) − Lu(s)
n X o n X o
= r(s, a∗s )+γ p(s0 /s, a∗s )v(s0 ) − r(s, (a0 )∗s )+γ p(s0 /s, (a0 )∗s )u(s0 )
s0 s0
n X o n X o
≤ r(s, a∗s )+γ p(s0 /s, a∗s )v(s0 ) − r(s, a∗s )+γ p(s0 /s, a∗s )u(s0 )
s0 s0
[why?? Note what has changed!] (15)
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 46 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
0 ≤ Lv(s) − Lu(s)
n X o n X o
= r(s, a∗s )+γ p(s0 /s, a∗s )v(s0 ) − r(s, (a0 )∗s )+γ p(s0 /s, (a0 )∗s )u(s0 )
s0 s0
n X o n X o
≤ r(s, a∗s )+γ p(s0 /s, a∗s )v(s0 ) − r(s, a∗s )+γ p(s0 /s, a∗s )u(s0 )
s0 s0
[why?? Note what has changed!] (15)
§ The two actions a∗s and (a0 )∗s maximize the value functions v and u
respectively at state s. So replacing (a0 )∗s with a∗s , in the second part
reduces the value of the second part.
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 46 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Combining eqns. (16) and (17), |Lv(s) − Lu(s)| ≤ γ||v − u|| ∀s ∈ S which again
from definition of max norm leads to ||Lv − Lu|| ≤ γ||v − u||
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 47 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 48 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 48 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 48 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Proof
§ Proof: Suppose, for some n, II is met i.e., ||v n+1 − v n || < 1−γ
2γ and
π(s) obtained by III. Now, by triangle inequality,
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 50 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Proof
§ Proof: Suppose, for some n, II is met i.e., ||v n+1 − v n || < 1−γ
2γ and
π(s) obtained by III. Now, by triangle inequality,
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 50 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Proof
§ Proof: Suppose, for some n, II is met i.e., ||v n+1 − v n || < 1−γ
2γ and
π(s) obtained by III. Now, by triangle inequality,
Proof
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 51 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Proof
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 51 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Proof
§ ( )
X
Lv n+1 (s) = max r(s, a) + γ p(s0 |s, a)v n+1 (s0 ) (24)
a∈A
s0 ∈S
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 52 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Proof
§ ( )
X
Lv n+1 (s) = max r(s, a) + γ p(s0 |s, a)v n+1 (s0 ) (24)
a∈A
s0 ∈S
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 52 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Proof
§ Now let us take the first term in eqn. (18) and proceed.
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 53 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Proof
§ Now let us take the second term in eqn. (18) and proceed.
∞
X
n+1 ∗
||v − v || ≤ ||v n+k+2 − v n+k+1 || [Triangle inequality repeatedly]
k=0
X∞
= ||Lk+1 v n+1 − Lk+1 v n || [From iterative application of L]
k=0
X∞
≤ γ k+1 ||v n+1 − v n || [L is a contraction mapping]
k=0
γ
= ||v n+1 − v n || [G.P. sum]
1−γ
γ 1−γ
≤ [By statement II of the theorem]
1 − γ 2γ
= (26)
2
§ this is also the proof of statement IV of the theorem
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 54 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Proof
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 55 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
§ For convergence, the order of update does not matter as long as all
states are picked at least a few times.
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 56 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 57 / 57