0% found this document useful (0 votes)
6 views76 pages

04 RL DP

Uploaded by

udipi.adithya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views76 pages

04 RL DP

Uploaded by

udipi.adithya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

Prediction and Control by Dynamic Programing

CS60077: Reinforcement Learning

Abir Das

IIT Kharagpur

Sep 09, 10, 11 and 16, 2021


Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Agenda

§ Understand how to evaluate policies using dynamic programing based


methods
§ Understand policy iteration and value iteration algorithms for control
of MDPs
§ Existence and convergence of solutions obtained by the above
methods

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 2 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Resources

§ Reinforcement Learning by David Silver [Link]


§ Reinforcement Learning by Balaraman Ravindran [Link]
§ SB: Chapter 4

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 3 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Dynamic Programing

“Life can only be understood going back-


wards, but it must be lived going forwards.”
- S. Kierkegaard, Danish Philosopher.

The first line of the famous book by Dimitri


P Bertsekas.

Image taken from: amazon.com

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 4 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Dynamic Programing

§ Dynamic Programing [DP] in this course, refer to a collection of


algorithms that can be used to compute optimal policies given a
perfect model of the environment in a MDP.
§ Limited utility due to the ‘perfect model’ assumption and due to
computational expense.
§ But still are important as they provide essential foundation for many
of the subsequent methods.
§ Many of the methods can be viewed as attempts to achieve much the
same effect as DP with less computation and without perfect model
assumption of the environment.
§ The key idea in DP is to use the value functions and Bellman
equations to organize and structure the search for good policies.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 5 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Dynamic Programing
§ Dynamic Programing addresses a bigger problem by breaking it down
as subproblems and then
I Solving the subproblems
I Combining solutions to subproblems

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 6 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Dynamic Programing
§ Dynamic Programing addresses a bigger problem by breaking it down
as subproblems and then
I Solving the subproblems
I Combining solutions to subproblems
§ Dynamic Programing is based on the principle of optimality.
𝑠%∗ Tail subproblem

0 𝑘 𝑁 Time
𝑎(∗ , ⋯ , 𝑎%∗ , ⋯ , 𝑎+,-

Optimal action sequence

Principle of Optimality
Let {a∗0 , a∗1 , · · · , a∗(N −1) } be an optimal action sequence with a
corresponding state sequence {s∗1 , s∗2 , · · · , s∗N }. Consider the tail
subproblem that starts at s∗k at time k and maximizes the ‘reward to go’
from k to N over {ak , · · · , a(N −1) }, then the tail optimal action sequence
{a∗k , · · · , a∗(N −1) } is optimal for the tail subproblem.
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 6 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Requirements for Dynamic Programing

§ Optimal substructure i.e., principle of optimality applies.


§ Overlapping subproblems, i.e., subproblems recur many times and
solutions to these subproblems can be cached and reused.
§ MDPs satisfy both through Bellman equations and value functions.
§ Dynamic programming is used to solve many other problems, e.g.,
Scheduling algorithms, Graph algorithms (e.g. shortest path
algorithms), Bioinformatics etc.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 7 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Planning by Dynamic Programing

§ Planning by dynamic programing assumes full knowledge of the MDP


§ For prediction/evaluation
I Input: MDP hS, A, P, R, γi and policy π
I Output: Value function vπ

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 8 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Planning by Dynamic Programing

§ Planning by dynamic programing assumes full knowledge of the MDP


§ For prediction/evaluation
I Input: MDP hS, A, P, R, γi and policy π
I Output: Value function vπ
§ For control
I Input: MDP hS, A, P, R, γi
I Output: Optimal value function v∗ and optimal policy π∗

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 8 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Iterative Policy Evaluation


§ Problem: Policy evaluation: Compute the state-value function vπ for
an arbitrary policy π.
§ Solution strategy: Iterative application of Bellman expectation
equation.
§ Recall the Bellman expectation equation.
( )
X X
0 0
vπ (s) = π(a|s) r(s, a) + γ p(s |s, a)vπ (s ) (1)
a∈A s0 ∈S

§ Consider a sequence of approximate value functions v (0) , v (1) , v (2) , · · ·


each mapping S + to R. Each successive approximation is obtained by
using eqn. (1) as an update rule.
( )
X X
v (k+1) (s) ← π(a|s) r(s, a) + γ p(s0 |s, a)v (k) (s0 )
a∈A s0 ∈S

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 9 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Iterative Policy Evaluation

( )
X X
(k+1) 0 (k) 0
v (s) ← π(a|s) r(s, a) + γ p(s |s, a)v (s )
a∈A s0 ∈S

§ In code, this can be implemented by using two arrays - one for the old
values v (k) (s) and the other for the new values v (k+1) (s). Here, new
values of v (k+1) (s) are computed one by one from the old values
v (k) (s) without changing the old values.
§ Another way is to use one array and update the values ‘in place’, i.e.,
each new value immediately overwriting the old one.
§ Both these converges to the true value vπ and the ‘in place’ algorithm
usually converges faster.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 10 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Iterative Policy Evaluation

Iterative Policy Evaluation, for estimating V ≈ vπ


Input: π, the policy to be evaluated
Algorithm parameter: a small threshold θ > 0 determining accuracy of
estimation
Initialize V (s), for all s ∈ S + , arbitrarily except that V (terminal)= 0
Loop:
∆←0
Loop for each s ∈ S:
v ← V (s)  
p(s0 |s, a)V (s0 )
P P
V (s) ← π(a|s) r(s, a) + γ
a∈A  s0 ∈S
∆ ← max ∆, |v − V (s)|
until ∆ < θ

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 11 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Evaluating a Random Policy in the Small Gridworld

Figure credit: [SB] chapter 4

§ Undiscounted episodic MDP (λ = 1)


§ Non-terminal states are S = {1, 2, · · · , 14}
§ Two terminal states (shown as shaded squares)
§ 4 possible actions in each state, A = {up, down, right, lef t}
§ Deterministic state transitions
§ Actions leading out of the grid leave state unchanged
§ Reward is -1 until the terminal state is reached
§ Agent follows uniform random policy
π(n|.) = π(s|.) = π(e|.) = π(w|.)
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 12 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Evaluating a Random Policy in the Small Gridworld

Figure credit: [SB] chapter 4

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 13 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Evaluating a Random Policy in the Small Gridworld

Figure credit: [SB] chapter 4

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 14 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Improving a Policy: Policy Iteration


§ Given a policy π
I Evaluate the policy
( )
. (k+1) X X
0 (k) 0
vπ = v (s) ← π(a|s) r(s, a) + γ p(s |s, a)v (s )
a∈A s0 ∈S

I Improve the policy by acting greedily with respect to vπ


π 0 = greedy(vπ )

being greedy means choosing the action that will land the agent
.
into best state i.e., π 0 (s) = arg max qπ (s, a) =
 a∈A 
0 0
P
arg max r(s, a) + γ p(s |s, a)vπ (s )
a∈A s0 ∈S
§ In Small Gridworld improved policy was optimal π 0 = π∗
§ In general, need more iterations of improvement/evaluation
§ But this process of policy iteration always converges to π∗
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 15 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Improving a Policy: Policy Iteration


Given a policy π
§ Evaluate the policy
( )
. (k+1) X X
0 (k) 0
vπ = v (s) ← π(a|s) r(s, a) + γ p(s |s, a)v (s )
a∈A s0 ∈S
X XX
= π(a|s)r(s, a) +γ π(a|s)p(s0 |s, a) v (k) (s0 )
a∈A s0 ∈S a∈A
| {z } | {z }
rπ (s) pπ (s0 |s)
X
= rπ (s) + γ pπ (s0 |s)v (k) (s0 )
s0 ∈S

I rπ (s) = one step expected reward for following policy π at state s.


I pπ (s0 |s) = one step transition probability under policy π.
§ Improve the policy by acting greedily with respect to vπ
( )
X
π 0 (s) = arg max r(s, a) + γ p(s0 |s, a)vπ (s0 )
a∈A
s0 ∈S
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 16 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Policy Iteration

Figure credit: [David Silver: DeepMind]

§ Policy Evaluation: Estimate vπ by iterative policy evaluation.


§ Policy Improvement: Generate π 0 ≥ π by greedy policy improvement.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 17 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Policy Iteration

Algorithm 1: Policy iteration


1 initialization: Select π 0 , n ← 0;
2 do
3 (Policy Evaluation) v(πn+1 ) ← r(πn ) + γPπn v(πn ) ; // componentwise
4 (Policy Improvement)
π n+1 (s) ∈ arg max[r(s, a) + γ p(s0 |s, a)v(πn+1 ) (s0 )] ∀s ∈ S];
P
a∈A s0 ∈S
5 n ← n + 1;
6 while π n+1 6= π n ;
7 Declare π ∗ = π n

§ why in step (4), ∈ is used?


§ Note the terminating condition.
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 18 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Policy Iteration
§ At each step of policy iteration, the policy improves i.e., the value
function for a policy at a later iteration is greater than or equal to the
value function for a policy at an earlier step.
§ This comes from the policy improvement theorem which (informally)
is - Let π n be some stationary policy and let π n+1 be greedy w.r.t.
v(πn ) , then v(πn+1 ) ≥ v(πn ) , i.e., π n+1 is an improvement upon π n .
rπn+1 + γPπn+1 v(πn ) ≥ rπn + γPπn v(πn )
= v(πn ) [Bellman eqn.]
=⇒ rπn+1 ≥ (I − γPπn+1 )v(πn )
=⇒ (I − γPπn+1 )−1 rπn+1 ≥ v(πn )
=⇒ vπn+1 ≥ v(πn ) (2)

§ The first step: π n+1 is obtained by maximizing rπ + γPπ v(πn ) over all
π’s. So, rπn+1 + γPπn+1 v(πn ) will be better than any other π in
rπ + γPπ v(πn ) . That ‘any other π’ happens to be π n .
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 19 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Policy Iteration: Example ([SB])

§ Jack manages two locations of a car rental company. At any location


if car is available, he rents it out and gets $10. To ensure that cars
are available, Jack can move cars between the two locations
overnight, at a cost of $2 per car.
§ Cars are returned and requested randomly according to Poissonn
distribution. Probability that n cars are rented or returned is λn! e−λ .
I 1st location - λ: average requests = 3, average returns = 3
I 2nd location - λ: average requests = 4, average returns = 2
§ there can be no more than 20 cars at each location and a maximum
of 5 cars can be moved from one location to the other

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 20 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Policy Iteration: Example - MDP Formulation

§ State: number of cars at each location at the end of the day


(between 0 and 20).
§ Actions: number of cars moved overnight from one location to other
(max 5).
§ Reward: $10 per car rented (if available) and -$2 per car moved.
§ Transition probability: The Poisson distribution defined in the last
slide.
§ Discount factor: γ is assumed to be 0.9.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 21 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Policy Iteration: Example

Figure credit: [SB - Chapter 4]


The sequence of policies found by policy iteration on Jack’s car rental problem,
and the final state-value function
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 22 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Policy Iteration: Disadvantages

§ Policy iteration involves the policy evaluation step first and this itself
requires a few iterations to get the exact value of vπ in limit.
§ The question is - must we wait for exact convegence to vπ ? Or can
we stop short of that?
§ The small gridworld example showed that there is no change of the
greedy policy after the first three iterations.
§ So the question is - is there such a number of iterations such that
after that the greedy policy does not change?

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 23 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Value Iteration

§ A related question is - what about the extreme case of 1 iteration of


policy evaluation and then greedy policy improvement? If we repeat
this cycle, does it find the optimal policy at least in limit?

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 24 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Value Iteration

§ A related question is - what about the extreme case of 1 iteration of


policy evaluation and then greedy policy improvement? If we repeat
this cycle, does it find the optimal policy at least in limit?
§ The good news is that - yes the gurantee is there and we will soon
prove that. However, first let us modify the policy iteration algorithm
to this extreme case. This is known as ‘value iteration’ strategy.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 24 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Value Iteration
§ What policy iteration does: iterate over
( )
. (k+1) X X
0 (k) 0
vπ = v (s) ← π(a|s) r(s, a) + γ p(s |s, a)v (s )
§ And then a∈A s0 ∈S

( )
X
0 0 0
π (s) = arg max r(s, a) + γ p(s |s, a)vπ (s )
a∈A
s0 ∈S

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 25 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Value Iteration
§ What policy iteration does: iterate over
( )
. (k+1) X X
0 (k) 0
vπ = v (s) ← π(a|s) r(s, a) + γ p(s |s, a)v (s )
§ And then a∈A s0 ∈S

( )
X
0 0 0
π (s) = arg max r(s, a) + γ p(s |s, a)vπ (s )
a∈A
s0 ∈S

§ What value iteration does: evaluate ∀a ∈ A


X
r(s, a) + γ p(s0 |s, a)v (k) (s0 )
§ And then take max over it s0 ∈S

( )
X
(k+1) 0 (k) 0
v (s) = max r(s, a) + γ p(s |s, a)v (s ) Where have we seen it?
a∈A
s0 ∈S

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 25 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Value Iteration
Algorithm 2: Value iteration
8 initialization: v ← v 0 ∈ V, pick an  > 0, n ← 0;
9 while ||v n+1 − v n || >  1−γ
2γ do
10 foreach s ∈ S do n o
v n+1 (s) ← max r(s, a) + γ p(s0 /s, a)v n (s0 )
P
11
a s0
12 end
13 n ← n + 1;
14 end
15 foreach s ∈ S do
/* Note the use of π(s). It mens deterministic policy
*/ n o
π(s) ← arg max r(s, a) + γ p(s0 /s, a)v n (s0 ) ;
P
16 // n has
a s0
already been incremented by 1
17 end

§ Take a note of the stopping criterion


Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 26 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Summary of Exact DP Algorithms for Planning

Problem Bellman Equation Algorithm


Prediction Bellman Expectation Equation Iterative Policy Evaluation
Bellman Expectation Equation
Control Policy Iteration
+ Greedy Policy Improvement
Control Bellman Optimality Equation Value Iteration

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 27 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Norms

Definition
Given a vector space V ⊆ Rd , a function f : V → R+ is a norm (denoted
as ||.||) if and only if
§ ||v|| ≥ 0 ∀v ∈ V
§ ||v|| = 0 if and only if v = 0
§ ||αv|| = |α| ||v||, ∀α ∈ R and ∀v ∈ V
§ Triangle inequality: ||u + v|| ≤ ||u|| + ||v|| u, v ∈ V

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 28 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Different types of Norms

§ Lp norm:
d
! p1
X
||v||p = |vi |p
i=1

§ L0 norm:

||v||0 = Number of non-zero elements in v

§ L∞ norm:
||v||∞ = max |vi |
1≤i≤d

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 29 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Cauchy Sequence, Completeness


Definition
A sequence of vectors v1 , v2 , v3 , · · · ∈ V (with subscripts n ∈ N) is called a
Cauchy sequence if for any positive real  > 0, ∃ N ∈ Z+ such that
∀m, n > N, ||vm − vn || < .

§ Basically, for any real positive , an element can be found in the


sequence, beyond which any two elements of the sequence will be
within  of each other.
§ In other words, the elements of the sequence comes closer and closer
to each other - i.e., the sequence converges.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 30 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Cauchy Sequence, Completeness


Definition
A sequence of vectors v1 , v2 , v3 , · · · ∈ V (with subscripts n ∈ N) is called a
Cauchy sequence if for any positive real  > 0, ∃ N ∈ Z+ such that
∀m, n > N, ||vm − vn || < .

§ Basically, for any real positive , an element can be found in the


sequence, beyond which any two elements of the sequence will be
within  of each other.
§ In other words, the elements of the sequence comes closer and closer
to each other - i.e., the sequence converges.
Definition
A vector space V equipped with a norm ||.|| is complete if every Cauchy
sequence converges in that norm to a point in the space. To pay tribute
to Stefan Banach, the great Polish mathematician, such a space is also
called the Banach space.
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 30 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Contraction Mapping, Fixed Point

Definition
An operator T : V → V is L-Lipschitz if for any u, v ∈ V

||T u − T v|| ≤ L||u − v||

§ If L ≤ 1, then T is called a non-expansion, while if 0 ≤ L < 1, then


T is called a contraction.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 31 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Contraction Mapping, Fixed Point

Definition
An operator T : V → V is L-Lipschitz if for any u, v ∈ V

||T u − T v|| ≤ L||u − v||

§ If L ≤ 1, then T is called a non-expansion, while if 0 ≤ L < 1, then


T is called a contraction.
Definition
Let v is a vector in the vector space V and T is an operator T : V → V.
Then v is called a fixed point of the operator T , if T v = v.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 31 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Banach Fixed Point Theorem

Theorem
Suppose V is a Banach space and T : V → V is a contraction mapping,
then,
- ∃ an unique v ∗ in V s.t. T v ∗ = v ∗ and
- for arbitrary v 0 in V, the sequence {v n } defined by
v n+1 = T v n = T n+1 v 0 , converges to v ∗ .
The above theorem tells that
§ T has fixed point, an unique fixed point.
§ For arbitrary starting point if we keep repeatedly applying T on that
starting point, then we will converge to v ∗ .

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 32 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Banach Fixed Point Theorem - Proof (1)


§ Let v n and v m+n be two values of v obtained after the nth and the
(n + m)th iteration.
m−1
X
||v m+n − v n || ≤ ||v n+k+1 − v n+k || [Triangle inequality]
k=0
m−1
X m−1
X
= ||T n+k v 1 − T n+k v 0 || ≤ λ||T n+k−1 v 1 − T n+k−1 v 0 ||
k=0 k=0
m−1
X
≤ λn+k ||v 1 − v 0 || [Repeated use of contraction]
k=0
m−1
X
= ||v 1 − v 0 || λn+k
k=0
n m
λ (1 − λ ) 1
= ||v − v 0 || (3)
1−λ

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 33 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Banach Fixed Point Theorem - Proof (2)

§ As m and n → ∞ and as λ < 1, the norm of difference between


v m+n and v n becomes less and less.
§ That means the sequence {v n } is Cauchy.
§ And since V is a Banach space and since every Cauchy sequence
converges to a point in that Banach space, therefore the Cauchy
sequence {v n } also converges to a point in V.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 34 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Banach Fixed Point Theorem - Proof (2)

§ As m and n → ∞ and as λ < 1, the norm of difference between


v m+n and v n becomes less and less.
§ That means the sequence {v n } is Cauchy.
§ And since V is a Banach space and since every Cauchy sequence
converges to a point in that Banach space, therefore the Cauchy
sequence {v n } also converges to a point in V.
§ What we have proved till now is that the sequence {v n } will reach a
converging point in the same space.
§ Lets say that the converging point is v ∗ .
§ What we will try to prove next is that v ∗ is a fixed point and then we
will try to prove that v ∗ is an unique fixed point.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 34 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Banach Fixed Point Theorem - Proof (3)


§ Let us try to see what we get as the norm of the difference between
v ∗ and T v ∗ .

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 35 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Banach Fixed Point Theorem - Proof (3)


§ Let us try to see what we get as the norm of the difference between
v ∗ and T v ∗ .
§ In the first line below we apply triangle inequality where v n is the
value of v at the nth iteration.

||T v ∗ − v ∗ || ≤ ||T v ∗ − v n || + ||v n − v ∗ ||


= ||T v ∗ − T v n−1 || + ||v n − v ∗ ||
≤ λ||v ∗ − v n−1 || + ||v n − v ∗ || [Contraction property]
(4)

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 35 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Banach Fixed Point Theorem - Proof (3)


§ Let us try to see what we get as the norm of the difference between
v ∗ and T v ∗ .
§ In the first line below we apply triangle inequality where v n is the
value of v at the nth iteration.

||T v ∗ − v ∗ || ≤ ||T v ∗ − v n || + ||v n − v ∗ ||


= ||T v ∗ − T v n−1 || + ||v n − v ∗ ||
≤ λ||v ∗ − v n−1 || + ||v n − v ∗ || [Contraction property]
(4)

§ Since {v n } is Cauchy and v ∗ is the convergence point, both the terms


in the above equation will tend to 0 as n → ∞.
§ So, as n → ∞, ||T v ∗ − v ∗ || → 0. That means in limit T v ∗ = v ∗ . So,
it is proved that v ∗ is a fixed point.
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 35 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Banach Fixed Point Theorem - Proof (4)

§ Now we will show the uniqueness, i.e., v ∗ is unique.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 36 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Banach Fixed Point Theorem - Proof (4)

§ Now we will show the uniqueness, i.e., v ∗ is unique.


§ Let u∗ and v ∗ be two fixed points of the space. From the contraction
property, we can write ||T u∗ − T v ∗ || ≤ λ||u∗ − v ∗ ||.
§ But, since u∗ and v ∗ are fixed points, T u∗ = u∗ and T v ∗ = v ∗ .

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 36 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Banach Fixed Point Theorem - Proof (4)

§ Now we will show the uniqueness, i.e., v ∗ is unique.


§ Let u∗ and v ∗ be two fixed points of the space. From the contraction
property, we can write ||T u∗ − T v ∗ || ≤ λ||u∗ − v ∗ ||.
§ But, since u∗ and v ∗ are fixed points, T u∗ = u∗ and T v ∗ = v ∗ .
§ That means ||u∗ − v ∗ || ≤ λ||u∗ − v ∗ || which can not be true for λ < 1
unless v ∗ = u∗ .
§ So, it is proved that v ∗ is an unique fixed point.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 36 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Existence and Uniqueness of Bellman Equations

§ Now, we will start talking about the existance and uniqueness of the
solution to Bellman expecation equations and the Bellman optimality
equations.
§ In case of a finite MDP the value function v can be thought of as a
vector in a |S| dimensional vector space V.
§ Whenever, we will use norm ||.|| in this space we will mean the max
norm, unless otherwise specified.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 37 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Existence and Uniqueness of Bellman Equations

§ Previously, we have seen


P
I rπ (s) = π(a|s)r(s, a), one step expected reward for following
a∈A
policy π at state s.
I pπ (s0 |s) = π(a|s)p(s0 |s, a), one step transition probability under
P
a∈A
policy π.
§ Using these notations, the Bellman expectation equation becomes,
( )
X X
0 0
vπ (s) = π(a|s) r(s, a) + γ p(s |s, a)vπ (s )
a∈A s0 ∈S
X
= rπ (s) + γ pπ (s0 |s)vπ (s0 )
s0 ∈S

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 38 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Existence and Uniqueness of Bellman Equations

§ vπ (s) = rπ (s) + γ pπ (s0 |s)vπ (s0 )


P
s0 ∈S

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 39 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Existence and Uniqueness of Bellman Equations

§ vπ (s) = rπ (s) + γ pπ (s0 |s)vπ (s0 )


P
s0 ∈S
§ Refresher from earlier lectures
      
v(s1 ) r(s1 ) P11 P12 ··· P1n v(s1 )
 v(s2 )   r(s2 )   P21 P22 ··· P2n 
  v(s2 ) 
 
 ..  =  ..  + γ  ..
    
.. .. ..   .. 
 .   .   . . . .  . 
v(sn ) r(sn ) Pn1 Pn2 ··· Pnn v(sn )

§ vπ = rπ + γPπ vπ
§ rπ is a |S| dimensional vector while Pπ is a |S| × |S| dimensional
matrix.
§ For all s0 , pπ (s0 |s) is one row (sth row) of the Pπ matrix. Similarly,
vπ (s0 )’s are the value functions for all states i.e., in the vectorized
notation, this is a vector vπ .

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 39 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Existence and Uniqueness of Bellman Equations

§ vπ = rπ + γPπ vπ
§ We are, now, going to define a linear operator.
Lπ : V → V such that
Lπ v ≡ rπ + γPπ v ∀v ∈ V, [V as defined in slide (37)] (5)

§ So using this operator notation, we can write the Bellman expectation


equation as the following,
Lπ vπ = vπ (6)

§ So far we have proved the Banach Fixed Point Theorem. Now we will
try to show that Lπ is a contraction.
§ We will hold the proof of V being a Banach space for later.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 40 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Existence and Uniqueness of Bellman Equations

§ Let u and v be in V. So,


X
Lπ u(s) = rπ (s) + γ pπ (s0 |s)u(s0 )
s0
X
Lπ v(s) = rπ (s) + γ pπ (s0 |s)v(s0 ) (7)
s0

§ One important note: Lπ u(s) or Lπ v(s) does not mean Lπ applied


on u(s) or v(s). It means the sth component of the vector Lπ u or
Lπ v

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 41 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Existence and Uniqueness of Bellman Equations


§ Let us consider the case, Lπ v(s) > Lπ u(s). Then,

0 ≤ Lπ v(s) − Lπ u(s)
X
=γ pπ (s0 |s){v(s0 ) − u(s0 )}
s0
X
≤ γ||v − u|| pπ (s0 |s)
s0
[Why is this?]
X
= γ||v − u|| [Since pπ (s0 |s) = 1] (8)
s0

§ Similarly, when Lπ u(s) > Lπ v(s), we can show that,

0 ≤ Lπ u(s)−Lπ v(s) ≤ γ||u−v|| = γ||v−u|| [Since ||u − v|| = ||v − u||]


(9)
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 42 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Existence and Uniqueness of Bellman Equations

§ Putting the two equations 8 and 9 together, we can get that

|Lπ v(s) − Lπ u(s)| ≤ γ||v − u|| ∀s ∈ S (10)

§ Pointwise or componentwise the difference is being drawn closer by a


factor of γ, so the maximum of the difference will also have come
down.
||Lπ v − Lπ u|| ≤ γ||v − u|| (11)
§ So, that means that Lπ is a contraction.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 43 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Existence and Uniqueness of Bellman Equations


§ Another proof of contraction property of Bellman expectation operator.
||Lπ v − Lπ u||∞= max |Lπ v(s) − Lπ u(s)|
s∈S
X X
= max rπ (s)+γ pπ (s0 |s)v(s0 ) − rπ (s)−γ pπ (s0 |s)u(s0 )
s∈S
s0 ∈S s0 ∈S
X
=γ max pπ (s0 |s) {v(s0 ) − u(s0 )}
s∈S
s0 ∈S
X
≤γ max pπ (s0 |s) |{v(s0 ) − u(s0 )}|
s∈S
s0 ∈S
X
≤γ max pπ (s0 |s) ||{v − u}||∞
s∈S
s0 ∈S
[ Absolute value each element ≤ max norm of a vector]
X *1

= γ ||{v − u}||∞ max (s0
pπ  |s) = γ ||{v − u}||∞
s∈S
 
s0 ∈S

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 44 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Existence and Uniqueness of Bellman Equations


§ Next we have to move on to the Bellman optimality equation’s
convergence proof.
§ Bellman optimality equation
( is given by )
X
v∗ (s) = max r(s, a) + γ p(s0 |s, a)v∗ (s0 ) (12)
a∈A
s0 ∈S

§ Let us define the Bellman optimality operator,


L : V → V such that
( )
X
0 0
(Lv)(s) ≡ max r(s, a) + γ p(s |s, a)v(s ) ∀v ∈ V (13)
a∈A
s0 ∈S

§ To declutter notation, we will use Lv(s) to denote (Lv)(s).


§ Then Bellman optimality equation becomes
v∗ = Lv∗ \\Componentwise (14)

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 45 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Existence and Uniqueness of Bellman Equations


§ Now we will prove that L is contraction by taking the same route as
we took for Lπ .
§ Let u and v be in V. Let us also assume, first, that Lv(s) ≥ Lu(s).
Then we can write,

0 ≤ Lv(s) − Lu(s)
n X o n X o
= r(s, a∗s )+γ p(s0 /s, a∗s )v(s0 ) − r(s, (a0 )∗s )+γ p(s0 /s, (a0 )∗s )u(s0 )
s0 s0
n X o n X o
≤ r(s, a∗s )+γ p(s0 /s, a∗s )v(s0 ) − r(s, a∗s )+γ p(s0 /s, a∗s )u(s0 )
s0 s0
[why?? Note what has changed!] (15)

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 46 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Existence and Uniqueness of Bellman Equations


§ Now we will prove that L is contraction by taking the same route as
we took for Lπ .
§ Let u and v be in V. Let us also assume, first, that Lv(s) ≥ Lu(s).
Then we can write,

0 ≤ Lv(s) − Lu(s)
n X o n X o
= r(s, a∗s )+γ p(s0 /s, a∗s )v(s0 ) − r(s, (a0 )∗s )+γ p(s0 /s, (a0 )∗s )u(s0 )
s0 s0
n X o n X o
≤ r(s, a∗s )+γ p(s0 /s, a∗s )v(s0 ) − r(s, a∗s )+γ p(s0 /s, a∗s )u(s0 )
s0 s0
[why?? Note what has changed!] (15)

§ The two actions a∗s and (a0 )∗s maximize the value functions v and u
respectively at state s. So replacing (a0 )∗s with a∗s , in the second part
reduces the value of the second part.
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 46 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Existence and Uniqueness of Bellman Equations


0 ≤ Lv(s) − Lu(s)
n X o n X o
≤ r(s, a∗s )+γ p(s0 /s, a∗s )v(s0 ) − r(s, a∗s )+γ p(s0 /s, a∗s )u(s0 )
s0 s0
X
=γ p(s0 /s, a∗s )[v(s0 ) − u(s0 )]
s0
X
≤γ p(s0 /s, a∗s )||v − u|| [Use of max norm similar to Lπ ]
s0
X
= γ||v − u|| [Since p(s0 /s, a∗s ) = 1] (16)
s0

Similarly, for the second case Lu(s) ≥ Lv(s), we can write,

0 ≤ Lu(s) − Lv(s) ≤ γ||v − u|| (17)

Combining eqns. (16) and (17), |Lv(s) − Lu(s)| ≤ γ||v − u|| ∀s ∈ S which again
from definition of max norm leads to ||Lv − Lu|| ≤ γ||v − u||

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 47 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Value Iteration Theorem

Theorem (Value Iteration Theorem(ref. S P. Singh and R C. Yee, 1993))


Let v 0 ∈ V,  > 0, sequence {v n } is obtained from v n+1 = Lv n , Then
I. v n converges in norm to v ∗ .
II. ∃ a finite N at which the condition ||v n+1 − v n || <  1−γ
2γ is met
∀n > N .
III. π(s) (obtained
n by o
arg max r(s, a) + γ p(s0 /s, a)v n+1 (s0 ) ∀s ∈ S) is  optimal.
P
a s0
IV. ||v n+1 − v ∗ || ≤ 
2 when the condition ||v n+1 − v n || <  1−γ
2γ holds.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 48 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Value Iteration Theorem

Theorem (Value Iteration Theorem(ref. S P. Singh and R C. Yee, 1993))


Let v 0 ∈ V,  > 0, sequence {v n } is obtained from v n+1 = Lv n , Then
I. v n converges in norm to v ∗ .
II. ∃ a finite N at which the condition ||v n+1 − v n || <  1−γ
2γ is met
∀n > N .
III. π(s) (obtained
n by o
arg max r(s, a) + γ p(s0 /s, a)v n+1 (s0 ) ∀s ∈ S) is  optimal.
P
a s0
IV. ||v n+1 − v ∗ || ≤ 
2 when the condition ||v n+1 − v n || <  1−γ
2γ holds.

§ statement III means ||vπ − v ∗ || ≤ . And statement IV tells that


||v n+1 − v ∗ || ≤ 2 . Are they redundant?

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 48 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Value Iteration Theorem

Theorem (Value Iteration Theorem(ref. S P. Singh and R C. Yee, 1993))


Let v 0 ∈ V,  > 0, sequence {v n } is obtained from v n+1 = Lv n , Then
I. v n converges in norm to v ∗ .
II. ∃ a finite N at which the condition ||v n+1 − v n || <  1−γ
2γ is met
∀n > N .
III. π(s) (obtained
n by o
arg max r(s, a) + γ p(s0 /s, a)v n+1 (s0 ) ∀s ∈ S) is  optimal.
P
a s0
IV. ||v n+1 − v ∗ || ≤ 
2 when the condition ||v n+1 − v n || <  1−γ
2γ holds.

§ statement III means ||vπ − v ∗ || ≤ . And statement IV tells that


||v n+1 − v ∗ || ≤ 2 . Are they redundant?
§ No! Think about what is vπ and what is v n+1 .

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 48 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Value Iteration Theorem


§ Though the figure is related to policy iteration, remember the figure
in slide (17).

§ Figure credit: [Singh and Yee, 1993]

§ Equality occurs if and only if value function given by the value


iteration algorithm is equal to the optimal policy.
§ What III is telling is that vπ is  optimal and what IV is telling is that
v n+1 is 2 optimal given condition in II.
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 49 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Proof
§ Proof: Suppose, for some n, II is met i.e., ||v n+1 − v n || <  1−γ
2γ and
π(s) obtained by III. Now, by triangle inequality,

||vπ − v ∗ || ≤ ||vπ − v n+1 || + ||v n+1 − v ∗ || (18)

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 50 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Proof
§ Proof: Suppose, for some n, II is met i.e., ||v n+1 − v n || <  1−γ
2γ and
π(s) obtained by III. Now, by triangle inequality,

||vπ − v ∗ || ≤ ||vπ − v n+1 || + ||v n+1 − v ∗ || (18)

§ Now we have seen Lπ to be such that


X
Lπ v(s) = rπ (s) + γ pπ (s0 |s)v(s0 )
s0
( )
X X
0 0
= π(a|s) r(s, a) + γ p(s |s, a)v(s ) (19)
a∈A s0 ∈S

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 50 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Proof
§ Proof: Suppose, for some n, II is met i.e., ||v n+1 − v n || <  1−γ
2γ and
π(s) obtained by III. Now, by triangle inequality,

||vπ − v ∗ || ≤ ||vπ − v n+1 || + ||v n+1 − v ∗ || (18)

§ Now we have seen Lπ to be such that


X
Lπ v(s) = rπ (s) + γ pπ (s0 |s)v(s0 )
s0
( )
X X
0 0
= π(a|s) r(s, a) + γ p(s |s, a)v(s ) (19)
a∈A s0 ∈S

§ Let us apply Lπ on v n+1 and remember that π is deterministic policy.


So, X
Lπ v n+1 (s) = r(s, π(s)) + γ p(s0 /s, π(s))v n+1 (s0 ) (20)
s0
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 50 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Proof

§ Now we have seen L to be such that


( )
X
0 0
Lv(s) ≡ max r(s, a) + γ p(s |s, a)v(s ) ∀v ∈ V (21)
a∈A
s0 ∈S

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 51 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Proof

§ Now we have seen L to be such that


( )
X
0 0
Lv(s) ≡ max r(s, a) + γ p(s |s, a)v(s ) ∀v ∈ V (21)
a∈A
s0 ∈S

§ So, similarly, let us apply L on v n+1 . So,


( )
X
n+1 0 n+1 0
Lv (s) = max r(s, a) + γ p(s |s, a)v (s ) (22)
a∈A
s0 ∈S

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 51 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Proof

§ Repeating eqn. (20) and (22)


X
Lπ v n+1 (s) = r(s, π(s)) + γ p(s0 /s, π(s))v n+1 (s0 ) (23)
s0

§ ( )
X
Lv n+1 (s) = max r(s, a) + γ p(s0 |s, a)v n+1 (s0 ) (24)
a∈A
s0 ∈S

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 52 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Proof

§ Repeating eqn. (20) and (22)


X
Lπ v n+1 (s) = r(s, π(s)) + γ p(s0 /s, π(s))v n+1 (s0 ) (23)
s0

§ ( )
X
Lv n+1 (s) = max r(s, a) + γ p(s0 |s, a)v n+1 (s0 ) (24)
a∈A
s0 ∈S

§ Now, because π was chosen such that π maximizes the argument


inside the max {.} operator, so whether we apply Lπ on v n+1 or L on
v n+1 , they are the same, i.e., Lv n+1 = Lπ v n+1 .

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 52 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Proof

§ Now let us take the first term in eqn. (18) and proceed.

||vπ − v n+1 || = ||Lπ vπ − v n+1 || [By eqn. 6 - fixed point]


≤ ||Lπ vπ − Lv n+1 || + ||Lv n+1 − v n+1 || [Triangle inequality]
= ||Lπ vπ − Lπ v n+1 || + ||Lv n+1 − Lv n ||
[1. Using previous slide 2. v n+1 = Lv n ]
≤ γ||vπ − v n+1 || + γ||v n+1 − v n || [Contraction mappings]
γ
=⇒ ||vπ − v n+1 || ≤ ||v n+1 − v n ||
1−γ
γ 1−γ
≤  [By statement II of the theorem]
1 − γ 2γ

= (25)
2

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 53 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Proof
§ Now let us take the second term in eqn. (18) and proceed.

X
n+1 ∗
||v − v || ≤ ||v n+k+2 − v n+k+1 || [Triangle inequality repeatedly]
k=0
X∞
= ||Lk+1 v n+1 − Lk+1 v n || [From iterative application of L]
k=0
X∞
≤ γ k+1 ||v n+1 − v n || [L is a contraction mapping]
k=0
γ
= ||v n+1 − v n || [G.P. sum]
1−γ
γ 1−γ
≤  [By statement II of the theorem]
1 − γ 2γ

= (26)
2
§ this is also the proof of statement IV of the theorem
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 54 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Proof

Now putting eqn. 25 and eqn. 26 in eqn. 18, we get,


 
||vπ − v ∗ || ≤ + = (27)
2 2
So, statement III is proved.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 55 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Asynchronous Dynamic Programing

§ Major drawback of DP methods is that they involve operations over


entire state set.
§ the game of backgammon has over 1020 states. Even if we could
perform the value iteration update on a million states per second, it
would take over a thousand years to complete a single sweep.
foreach s E S do
v n +1(s) f-- max r(s a) +rLp(s'/s a)v n (s')
a s'
end

§ Inplace dynamic programing uses one single array to do the update


foreach s E S do
v(s) f-- max r(s a) +rLp(s'/s a)v (s')
a s'
end

§ For convergence, the order of update does not matter as long as all
states are picked at least a few times.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 56 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Asynchronous Dynamic Programing

§ Real Time Dynamic Programing (RTDP): The main idea is to reduce


computation again but by not choosing randomly the states.
§ In an MDP there may be many states which occur very rarely i.e.,
they are seldom visited. So there is no point in putting more effort in
trying to discover the true value of these states. The agent might not
visit it at all.
§ Pick an initial state and run a policy/agent from that state. Then
employ DP update only on these states.
§ This makes changes to the value function estimate. Get the policy
from it and sample a trajectory again and do updates along the
trajectory.
§ Why is it called Real Time?
§ Many ideas from RTDP will be used in full RL problem.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 57 / 57

You might also like