0% found this document useful (0 votes)

6 views76 pages

04 RL DP

Uploaded by

udipi.adithya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views76 pages

04 RL DP

Uploaded by

udipi.adithya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 76

Prediction and Control by Dynamic Programing

CS60077: Reinforcement Learning

Abir Das

IIT Kharagpur

Sep 09, 10, 11 and 16, 2021

Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Agenda

§ Understand how to evaluate policies using dynamic programing based

methods
§ Understand policy iteration and value iteration algorithms for control
of MDPs
§ Existence and convergence of solutions obtained by the above
methods

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 2 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Resources

§ Reinforcement Learning by David Silver [Link]

§ Reinforcement Learning by Balaraman Ravindran [Link]
§ SB: Chapter 4

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 3 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Dynamic Programing

“Life can only be understood going back-

wards, but it must be lived going forwards.”
- S. Kierkegaard, Danish Philosopher.

The first line of the famous book by Dimitri

P Bertsekas.

Image taken from: amazon.com

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 4 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Dynamic Programing

§ Dynamic Programing [DP] in this course, refer to a collection of

algorithms that can be used to compute optimal policies given a
perfect model of the environment in a MDP.
§ Limited utility due to the ‘perfect model’ assumption and due to
computational expense.
§ But still are important as they provide essential foundation for many
of the subsequent methods.
§ Many of the methods can be viewed as attempts to achieve much the
same effect as DP with less computation and without perfect model
assumption of the environment.
§ The key idea in DP is to use the value functions and Bellman
equations to organize and structure the search for good policies.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 5 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Dynamic Programing
§ Dynamic Programing addresses a bigger problem by breaking it down
as subproblems and then
I Solving the subproblems
I Combining solutions to subproblems

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 6 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Dynamic Programing
§ Dynamic Programing addresses a bigger problem by breaking it down
as subproblems and then
I Solving the subproblems
I Combining solutions to subproblems
§ Dynamic Programing is based on the principle of optimality.
𝑠%∗ Tail subproblem

0 𝑘 𝑁 Time
𝑎(∗ , ⋯ , 𝑎%∗ , ⋯ , 𝑎+,-
∗

Optimal action sequence

Principle of Optimality
Let {a∗0 , a∗1 , · · · , a∗(N −1) } be an optimal action sequence with a
corresponding state sequence {s∗1 , s∗2 , · · · , s∗N }. Consider the tail
subproblem that starts at s∗k at time k and maximizes the ‘reward to go’
from k to N over {ak , · · · , a(N −1) }, then the tail optimal action sequence
{a∗k , · · · , a∗(N −1) } is optimal for the tail subproblem.
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 6 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Requirements for Dynamic Programing

§ Optimal substructure i.e., principle of optimality applies.

§ Overlapping subproblems, i.e., subproblems recur many times and
solutions to these subproblems can be cached and reused.
§ MDPs satisfy both through Bellman equations and value functions.
§ Dynamic programming is used to solve many other problems, e.g.,
Scheduling algorithms, Graph algorithms (e.g. shortest path
algorithms), Bioinformatics etc.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 7 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Planning by Dynamic Programing

§ Planning by dynamic programing assumes full knowledge of the MDP

§ For prediction/evaluation
I Input: MDP hS, A, P, R, γi and policy π
I Output: Value function vπ

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 8 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Planning by Dynamic Programing

§ Planning by dynamic programing assumes full knowledge of the MDP

§ For prediction/evaluation
I Input: MDP hS, A, P, R, γi and policy π
I Output: Value function vπ
§ For control
I Input: MDP hS, A, P, R, γi
I Output: Optimal value function v∗ and optimal policy π∗

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 8 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Iterative Policy Evaluation

§ Problem: Policy evaluation: Compute the state-value function vπ for
an arbitrary policy π.
§ Solution strategy: Iterative application of Bellman expectation
equation.
§ Recall the Bellman expectation equation.
( )
X X
0 0
vπ (s) = π(a|s) r(s, a) + γ p(s |s, a)vπ (s ) (1)
a∈A s0 ∈S

§ Consider a sequence of approximate value functions v (0) , v (1) , v (2) , · · ·

each mapping S + to R. Each successive approximation is obtained by
using eqn. (1) as an update rule.
( )
X X
v (k+1) (s) ← π(a|s) r(s, a) + γ p(s0 |s, a)v (k) (s0 )
a∈A s0 ∈S

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 9 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Iterative Policy Evaluation

( )
X X
(k+1) 0 (k) 0
v (s) ← π(a|s) r(s, a) + γ p(s |s, a)v (s )
a∈A s0 ∈S

§ In code, this can be implemented by using two arrays - one for the old
values v (k) (s) and the other for the new values v (k+1) (s). Here, new
values of v (k+1) (s) are computed one by one from the old values
v (k) (s) without changing the old values.
§ Another way is to use one array and update the values ‘in place’, i.e.,
each new value immediately overwriting the old one.
§ Both these converges to the true value vπ and the ‘in place’ algorithm
usually converges faster.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 10 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Iterative Policy Evaluation

Iterative Policy Evaluation, for estimating V ≈ vπ

Input: π, the policy to be evaluated
Algorithm parameter: a small threshold θ > 0 determining accuracy of
estimation
Initialize V (s), for all s ∈ S + , arbitrarily except that V (terminal)= 0
Loop:
∆←0
Loop for each s ∈ S:
v ← V (s)
p(s0 |s, a)V (s0 )
P P
V (s) ← π(a|s) r(s, a) + γ
a∈A s0 ∈S
∆ ← max ∆, |v − V (s)|
until ∆ < θ

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 11 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Evaluating a Random Policy in the Small Gridworld

Figure credit: [SB] chapter 4

§ Undiscounted episodic MDP (λ = 1)

§ Non-terminal states are S = {1, 2, · · · , 14}
§ Two terminal states (shown as shaded squares)
§ 4 possible actions in each state, A = {up, down, right, lef t}
§ Deterministic state transitions
§ Actions leading out of the grid leave state unchanged
§ Reward is -1 until the terminal state is reached
§ Agent follows uniform random policy
π(n|.) = π(s|.) = π(e|.) = π(w|.)
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 12 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Evaluating a Random Policy in the Small Gridworld

Figure credit: [SB] chapter 4

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 13 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Evaluating a Random Policy in the Small Gridworld

Figure credit: [SB] chapter 4

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 14 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Improving a Policy: Policy Iteration

§ Given a policy π
I Evaluate the policy
( )
. (k+1) X X
0 (k) 0
vπ = v (s) ← π(a|s) r(s, a) + γ p(s |s, a)v (s )
a∈A s0 ∈S

I Improve the policy by acting greedily with respect to vπ

π 0 = greedy(vπ )

being greedy means choosing the action that will land the agent
.
into best state i.e., π 0 (s) = arg max qπ (s, a) =
a∈A
0 0
P
arg max r(s, a) + γ p(s |s, a)vπ (s )
a∈A s0 ∈S
§ In Small Gridworld improved policy was optimal π 0 = π∗
§ In general, need more iterations of improvement/evaluation
§ But this process of policy iteration always converges to π∗
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 15 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Improving a Policy: Policy Iteration

Given a policy π
§ Evaluate the policy
( )
. (k+1) X X
0 (k) 0
vπ = v (s) ← π(a|s) r(s, a) + γ p(s |s, a)v (s )
a∈A s0 ∈S
X XX
= π(a|s)r(s, a) +γ π(a|s)p(s0 |s, a) v (k) (s0 )
a∈A s0 ∈S a∈A
| {z } | {z }
rπ (s) pπ (s0 |s)
X
= rπ (s) + γ pπ (s0 |s)v (k) (s0 )
s0 ∈S

I rπ (s) = one step expected reward for following policy π at state s.

I pπ (s0 |s) = one step transition probability under policy π.
§ Improve the policy by acting greedily with respect to vπ
( )
X
π 0 (s) = arg max r(s, a) + γ p(s0 |s, a)vπ (s0 )
a∈A
s0 ∈S
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 16 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Policy Iteration

Figure credit: [David Silver: DeepMind]

§ Policy Evaluation: Estimate vπ by iterative policy evaluation.

§ Policy Improvement: Generate π 0 ≥ π by greedy policy improvement.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 17 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Policy Iteration

Algorithm 1: Policy iteration

1 initialization: Select π 0 , n ← 0;
2 do
3 (Policy Evaluation) v(πn+1 ) ← r(πn ) + γPπn v(πn ) ; // componentwise
4 (Policy Improvement)
π n+1 (s) ∈ arg max[r(s, a) + γ p(s0 |s, a)v(πn+1 ) (s0 )] ∀s ∈ S];
P
a∈A s0 ∈S
5 n ← n + 1;
6 while π n+1 6= π n ;
7 Declare π ∗ = π n

§ why in step (4), ∈ is used?

§ Note the terminating condition.
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 18 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Policy Iteration
§ At each step of policy iteration, the policy improves i.e., the value
function for a policy at a later iteration is greater than or equal to the
value function for a policy at an earlier step.
§ This comes from the policy improvement theorem which (informally)
is - Let π n be some stationary policy and let π n+1 be greedy w.r.t.
v(πn ) , then v(πn+1 ) ≥ v(πn ) , i.e., π n+1 is an improvement upon π n .
rπn+1 + γPπn+1 v(πn ) ≥ rπn + γPπn v(πn )
= v(πn ) [Bellman eqn.]
=⇒ rπn+1 ≥ (I − γPπn+1 )v(πn )
=⇒ (I − γPπn+1 )−1 rπn+1 ≥ v(πn )
=⇒ vπn+1 ≥ v(πn ) (2)

§ The first step: π n+1 is obtained by maximizing rπ + γPπ v(πn ) over all
π’s. So, rπn+1 + γPπn+1 v(πn ) will be better than any other π in
rπ + γPπ v(πn ) . That ‘any other π’ happens to be π n .
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 19 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Policy Iteration: Example ([SB])

§ Jack manages two locations of a car rental company. At any location

if car is available, he rents it out and gets $10. To ensure that cars
are available, Jack can move cars between the two locations
overnight, at a cost of $2 per car.
§ Cars are returned and requested randomly according to Poissonn
distribution. Probability that n cars are rented or returned is λn! e−λ .
I 1st location - λ: average requests = 3, average returns = 3
I 2nd location - λ: average requests = 4, average returns = 2
§ there can be no more than 20 cars at each location and a maximum
of 5 cars can be moved from one location to the other

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 20 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Policy Iteration: Example - MDP Formulation

§ State: number of cars at each location at the end of the day

(between 0 and 20).
§ Actions: number of cars moved overnight from one location to other
(max 5).
§ Reward: $10 per car rented (if available) and -$2 per car moved.
§ Transition probability: The Poisson distribution defined in the last
slide.
§ Discount factor: γ is assumed to be 0.9.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 21 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Policy Iteration: Example

Figure credit: [SB - Chapter 4]

The sequence of policies found by policy iteration on Jack’s car rental problem,
and the final state-value function
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 22 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Policy Iteration: Disadvantages

§ Policy iteration involves the policy evaluation step first and this itself
requires a few iterations to get the exact value of vπ in limit.
§ The question is - must we wait for exact convegence to vπ ? Or can
we stop short of that?
§ The small gridworld example showed that there is no change of the
greedy policy after the first three iterations.
§ So the question is - is there such a number of iterations such that
after that the greedy policy does not change?

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 23 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Value Iteration

§ A related question is - what about the extreme case of 1 iteration of

policy evaluation and then greedy policy improvement? If we repeat
this cycle, does it find the optimal policy at least in limit?

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 24 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Value Iteration

§ A related question is - what about the extreme case of 1 iteration of

policy evaluation and then greedy policy improvement? If we repeat
this cycle, does it find the optimal policy at least in limit?
§ The good news is that - yes the gurantee is there and we will soon
prove that. However, first let us modify the policy iteration algorithm
to this extreme case. This is known as ‘value iteration’ strategy.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 24 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Value Iteration
§ What policy iteration does: iterate over
( )
. (k+1) X X
0 (k) 0
vπ = v (s) ← π(a|s) r(s, a) + γ p(s |s, a)v (s )
§ And then a∈A s0 ∈S

( )
X
0 0 0
π (s) = arg max r(s, a) + γ p(s |s, a)vπ (s )
a∈A
s0 ∈S

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 25 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Value Iteration
§ What policy iteration does: iterate over
( )
. (k+1) X X
0 (k) 0
vπ = v (s) ← π(a|s) r(s, a) + γ p(s |s, a)v (s )
§ And then a∈A s0 ∈S

( )
X
0 0 0
π (s) = arg max r(s, a) + γ p(s |s, a)vπ (s )
a∈A
s0 ∈S

§ What value iteration does: evaluate ∀a ∈ A

X
r(s, a) + γ p(s0 |s, a)v (k) (s0 )
§ And then take max over it s0 ∈S

( )
X
(k+1) 0 (k) 0
v (s) = max r(s, a) + γ p(s |s, a)v (s ) Where have we seen it?
a∈A
s0 ∈S

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 25 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Value Iteration
Algorithm 2: Value iteration
8 initialization: v ← v 0 ∈ V, pick an > 0, n ← 0;
9 while ||v n+1 − v n || > 1−γ
2γ do
10 foreach s ∈ S do n o
v n+1 (s) ← max r(s, a) + γ p(s0 /s, a)v n (s0 )
P
11
a s0
12 end
13 n ← n + 1;
14 end
15 foreach s ∈ S do
/* Note the use of π(s). It mens deterministic policy
*/ n o
π(s) ← arg max r(s, a) + γ p(s0 /s, a)v n (s0 ) ;
P
16 // n has
a s0
already been incremented by 1
17 end

§ Take a note of the stopping criterion

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 26 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Summary of Exact DP Algorithms for Planning

Problem Bellman Equation Algorithm

Prediction Bellman Expectation Equation Iterative Policy Evaluation
Bellman Expectation Equation
Control Policy Iteration
+ Greedy Policy Improvement
Control Bellman Optimality Equation Value Iteration

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 27 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Norms

Definition
Given a vector space V ⊆ Rd , a function f : V → R+ is a norm (denoted
as ||.||) if and only if
§ ||v|| ≥ 0 ∀v ∈ V
§ ||v|| = 0 if and only if v = 0
§ ||αv|| = |α| ||v||, ∀α ∈ R and ∀v ∈ V
§ Triangle inequality: ||u + v|| ≤ ||u|| + ||v|| u, v ∈ V

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 28 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Different types of Norms

§ Lp norm:
d
! p1
X
||v||p = |vi |p
i=1

§ L0 norm:

||v||0 = Number of non-zero elements in v

§ L∞ norm:
||v||∞ = max |vi |
1≤i≤d

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 29 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Cauchy Sequence, Completeness

Definition
A sequence of vectors v1 , v2 , v3 , · · · ∈ V (with subscripts n ∈ N) is called a
Cauchy sequence if for any positive real > 0, ∃ N ∈ Z+ such that
∀m, n > N, ||vm − vn || < .

§ Basically, for any real positive , an element can be found in the

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 30 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Cauchy Sequence, Completeness

§ Basically, for any real positive , an element can be found in the

sequence, beyond which any two elements of the sequence will be
within of each other.
§ In other words, the elements of the sequence comes closer and closer
to each other - i.e., the sequence converges.
Definition
A vector space V equipped with a norm ||.|| is complete if every Cauchy
sequence converges in that norm to a point in the space. To pay tribute
to Stefan Banach, the great Polish mathematician, such a space is also
called the Banach space.
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 30 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Contraction Mapping, Fixed Point

Definition
An operator T : V → V is L-Lipschitz if for any u, v ∈ V

||T u − T v|| ≤ L||u − v||

§ If L ≤ 1, then T is called a non-expansion, while if 0 ≤ L < 1, then

T is called a contraction.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 31 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Contraction Mapping, Fixed Point

Definition
An operator T : V → V is L-Lipschitz if for any u, v ∈ V

||T u − T v|| ≤ L||u − v||

§ If L ≤ 1, then T is called a non-expansion, while if 0 ≤ L < 1, then

T is called a contraction.
Definition
Let v is a vector in the vector space V and T is an operator T : V → V.
Then v is called a fixed point of the operator T , if T v = v.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 31 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Banach Fixed Point Theorem

Theorem
Suppose V is a Banach space and T : V → V is a contraction mapping,
then,
- ∃ an unique v ∗ in V s.t. T v ∗ = v ∗ and
- for arbitrary v 0 in V, the sequence {v n } defined by
v n+1 = T v n = T n+1 v 0 , converges to v ∗ .
The above theorem tells that
§ T has fixed point, an unique fixed point.
§ For arbitrary starting point if we keep repeatedly applying T on that
starting point, then we will converge to v ∗ .

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 32 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Banach Fixed Point Theorem - Proof (1)

§ Let v n and v m+n be two values of v obtained after the nth and the
(n + m)th iteration.
m−1
X
||v m+n − v n || ≤ ||v n+k+1 − v n+k || [Triangle inequality]
k=0
m−1
X m−1
X
= ||T n+k v 1 − T n+k v 0 || ≤ λ||T n+k−1 v 1 − T n+k−1 v 0 ||
k=0 k=0
m−1
X
≤ λn+k ||v 1 − v 0 || [Repeated use of contraction]
k=0
m−1
X
= ||v 1 − v 0 || λn+k
k=0
n m
λ (1 − λ ) 1
= ||v − v 0 || (3)
1−λ

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 33 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Banach Fixed Point Theorem - Proof (2)

§ As m and n → ∞ and as λ < 1, the norm of difference between

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 34 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Banach Fixed Point Theorem - Proof (2)

§ As m and n → ∞ and as λ < 1, the norm of difference between

v m+n and v n becomes less and less.
§ That means the sequence {v n } is Cauchy.
§ And since V is a Banach space and since every Cauchy sequence
converges to a point in that Banach space, therefore the Cauchy
sequence {v n } also converges to a point in V.
§ What we have proved till now is that the sequence {v n } will reach a
converging point in the same space.
§ Lets say that the converging point is v ∗ .
§ What we will try to prove next is that v ∗ is a fixed point and then we
will try to prove that v ∗ is an unique fixed point.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 34 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Banach Fixed Point Theorem - Proof (3)

§ Let us try to see what we get as the norm of the difference between
v ∗ and T v ∗ .

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 35 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Banach Fixed Point Theorem - Proof (3)

§ Let us try to see what we get as the norm of the difference between
v ∗ and T v ∗ .
§ In the first line below we apply triangle inequality where v n is the
value of v at the nth iteration.

||T v ∗ − v ∗ || ≤ ||T v ∗ − v n || + ||v n − v ∗ ||

= ||T v ∗ − T v n−1 || + ||v n − v ∗ ||
≤ λ||v ∗ − v n−1 || + ||v n − v ∗ || [Contraction property]
(4)

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 35 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Banach Fixed Point Theorem - Proof (3)

§ Let us try to see what we get as the norm of the difference between
v ∗ and T v ∗ .
§ In the first line below we apply triangle inequality where v n is the
value of v at the nth iteration.

||T v ∗ − v ∗ || ≤ ||T v ∗ − v n || + ||v n − v ∗ ||

= ||T v ∗ − T v n−1 || + ||v n − v ∗ ||
≤ λ||v ∗ − v n−1 || + ||v n − v ∗ || [Contraction property]
(4)

§ Since {v n } is Cauchy and v ∗ is the convergence point, both the terms

in the above equation will tend to 0 as n → ∞.
§ So, as n → ∞, ||T v ∗ − v ∗ || → 0. That means in limit T v ∗ = v ∗ . So,
it is proved that v ∗ is a fixed point.
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 35 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Banach Fixed Point Theorem - Proof (4)

§ Now we will show the uniqueness, i.e., v ∗ is unique.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 36 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Banach Fixed Point Theorem - Proof (4)

§ Now we will show the uniqueness, i.e., v ∗ is unique.

§ Let u∗ and v ∗ be two fixed points of the space. From the contraction
property, we can write ||T u∗ − T v ∗ || ≤ λ||u∗ − v ∗ ||.
§ But, since u∗ and v ∗ are fixed points, T u∗ = u∗ and T v ∗ = v ∗ .

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 36 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Banach Fixed Point Theorem - Proof (4)

§ Now we will show the uniqueness, i.e., v ∗ is unique.

§ Let u∗ and v ∗ be two fixed points of the space. From the contraction
property, we can write ||T u∗ − T v ∗ || ≤ λ||u∗ − v ∗ ||.
§ But, since u∗ and v ∗ are fixed points, T u∗ = u∗ and T v ∗ = v ∗ .
§ That means ||u∗ − v ∗ || ≤ λ||u∗ − v ∗ || which can not be true for λ < 1
unless v ∗ = u∗ .
§ So, it is proved that v ∗ is an unique fixed point.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 36 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Existence and Uniqueness of Bellman Equations

§ Now, we will start talking about the existance and uniqueness of the
solution to Bellman expecation equations and the Bellman optimality
equations.
§ In case of a finite MDP the value function v can be thought of as a
vector in a |S| dimensional vector space V.
§ Whenever, we will use norm ||.|| in this space we will mean the max
norm, unless otherwise specified.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 37 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Existence and Uniqueness of Bellman Equations

§ Previously, we have seen

P
I rπ (s) = π(a|s)r(s, a), one step expected reward for following
a∈A
policy π at state s.
I pπ (s0 |s) = π(a|s)p(s0 |s, a), one step transition probability under
P
a∈A
policy π.
§ Using these notations, the Bellman expectation equation becomes,
( )
X X
0 0
vπ (s) = π(a|s) r(s, a) + γ p(s |s, a)vπ (s )
a∈A s0 ∈S
X
= rπ (s) + γ pπ (s0 |s)vπ (s0 )
s0 ∈S

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 38 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Existence and Uniqueness of Bellman Equations

§ vπ (s) = rπ (s) + γ pπ (s0 |s)vπ (s0 )

P
s0 ∈S

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 39 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Existence and Uniqueness of Bellman Equations

§ vπ (s) = rπ (s) + γ pπ (s0 |s)vπ (s0 )

P
s0 ∈S
§ Refresher from earlier lectures
      
v(s1 ) r(s1 ) P11 P12 ··· P1n v(s1 )
 v(s2 )   r(s2 )   P21 P22 ··· P2n 
  v(s2 ) 
 
 ..  =  ..  + γ  ..
    
.. .. ..   .. 
 .   .   . . . .  . 
v(sn ) r(sn ) Pn1 Pn2 ··· Pnn v(sn )

§ vπ = rπ + γPπ vπ
§ rπ is a |S| dimensional vector while Pπ is a |S| × |S| dimensional
matrix.
§ For all s0 , pπ (s0 |s) is one row (sth row) of the Pπ matrix. Similarly,
vπ (s0 )’s are the value functions for all states i.e., in the vectorized
notation, this is a vector vπ .

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 39 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Existence and Uniqueness of Bellman Equations

§ vπ = rπ + γPπ vπ
§ We are, now, going to define a linear operator.
Lπ : V → V such that
Lπ v ≡ rπ + γPπ v ∀v ∈ V, [V as defined in slide (37)] (5)

§ So using this operator notation, we can write the Bellman expectation

equation as the following,
Lπ vπ = vπ (6)

§ So far we have proved the Banach Fixed Point Theorem. Now we will
try to show that Lπ is a contraction.
§ We will hold the proof of V being a Banach space for later.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 40 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Existence and Uniqueness of Bellman Equations

§ Let u and v be in V. So,

X
Lπ u(s) = rπ (s) + γ pπ (s0 |s)u(s0 )
s0
X
Lπ v(s) = rπ (s) + γ pπ (s0 |s)v(s0 ) (7)
s0

§ One important note: Lπ u(s) or Lπ v(s) does not mean Lπ applied

on u(s) or v(s). It means the sth component of the vector Lπ u or
Lπ v

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 41 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Existence and Uniqueness of Bellman Equations

§ Let us consider the case, Lπ v(s) > Lπ u(s). Then,

0 ≤ Lπ v(s) − Lπ u(s)
X
=γ pπ (s0 |s){v(s0 ) − u(s0 )}
s0
X
≤ γ||v − u|| pπ (s0 |s)
s0
[Why is this?]
X
= γ||v − u|| [Since pπ (s0 |s) = 1] (8)
s0

§ Similarly, when Lπ u(s) > Lπ v(s), we can show that,

0 ≤ Lπ u(s)−Lπ v(s) ≤ γ||u−v|| = γ||v−u|| [Since ||u − v|| = ||v − u||]

(9)
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 42 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Existence and Uniqueness of Bellman Equations

§ Putting the two equations 8 and 9 together, we can get that

|Lπ v(s) − Lπ u(s)| ≤ γ||v − u|| ∀s ∈ S (10)

§ Pointwise or componentwise the difference is being drawn closer by a

factor of γ, so the maximum of the difference will also have come
down.
||Lπ v − Lπ u|| ≤ γ||v − u|| (11)
§ So, that means that Lπ is a contraction.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 43 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Existence and Uniqueness of Bellman Equations

§ Another proof of contraction property of Bellman expectation operator.
||Lπ v − Lπ u||∞= max |Lπ v(s) − Lπ u(s)|
s∈S
X X
= max rπ (s)+γ pπ (s0 |s)v(s0 ) − rπ (s)−γ pπ (s0 |s)u(s0 )
s∈S
s0 ∈S s0 ∈S
X
=γ max pπ (s0 |s) {v(s0 ) − u(s0 )}
s∈S
s0 ∈S
X
≤γ max pπ (s0 |s) |{v(s0 ) − u(s0 )}|
s∈S
s0 ∈S
X
≤γ max pπ (s0 |s) ||{v − u}||∞
s∈S
s0 ∈S
[ Absolute value each element ≤ max norm of a vector]
X *1

= γ ||{v − u}||∞ max (s0
pπ |s) = γ ||{v − u}||∞
s∈S

s0 ∈S

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 44 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Existence and Uniqueness of Bellman Equations

§ Next we have to move on to the Bellman optimality equation’s
convergence proof.
§ Bellman optimality equation
( is given by )
X
v∗ (s) = max r(s, a) + γ p(s0 |s, a)v∗ (s0 ) (12)
a∈A
s0 ∈S

§ Let us define the Bellman optimality operator,

L : V → V such that
( )
X
0 0
(Lv)(s) ≡ max r(s, a) + γ p(s |s, a)v(s ) ∀v ∈ V (13)
a∈A
s0 ∈S

§ To declutter notation, we will use Lv(s) to denote (Lv)(s).

§ Then Bellman optimality equation becomes
v∗ = Lv∗ \\Componentwise (14)

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 45 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Existence and Uniqueness of Bellman Equations

§ Now we will prove that L is contraction by taking the same route as
we took for Lπ .
§ Let u and v be in V. Let us also assume, first, that Lv(s) ≥ Lu(s).
Then we can write,

0 ≤ Lv(s) − Lu(s)
n X o n X o
= r(s, a∗s )+γ p(s0 /s, a∗s )v(s0 ) − r(s, (a0 )∗s )+γ p(s0 /s, (a0 )∗s )u(s0 )
s0 s0
n X o n X o
≤ r(s, a∗s )+γ p(s0 /s, a∗s )v(s0 ) − r(s, a∗s )+γ p(s0 /s, a∗s )u(s0 )
s0 s0
[why?? Note what has changed!] (15)

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 46 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Existence and Uniqueness of Bellman Equations

§ Now we will prove that L is contraction by taking the same route as
we took for Lπ .
§ Let u and v be in V. Let us also assume, first, that Lv(s) ≥ Lu(s).
Then we can write,

§ The two actions a∗s and (a0 )∗s maximize the value functions v and u
respectively at state s. So replacing (a0 )∗s with a∗s , in the second part
reduces the value of the second part.
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 46 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Existence and Uniqueness of Bellman Equations

0 ≤ Lv(s) − Lu(s)
n X o n X o
≤ r(s, a∗s )+γ p(s0 /s, a∗s )v(s0 ) − r(s, a∗s )+γ p(s0 /s, a∗s )u(s0 )
s0 s0
X
=γ p(s0 /s, a∗s )[v(s0 ) − u(s0 )]
s0
X
≤γ p(s0 /s, a∗s )||v − u|| [Use of max norm similar to Lπ ]
s0
X
= γ||v − u|| [Since p(s0 /s, a∗s ) = 1] (16)
s0

Similarly, for the second case Lu(s) ≥ Lv(s), we can write,

0 ≤ Lu(s) − Lv(s) ≤ γ||v − u|| (17)

Combining eqns. (16) and (17), |Lv(s) − Lu(s)| ≤ γ||v − u|| ∀s ∈ S which again
from definition of max norm leads to ||Lv − Lu|| ≤ γ||v − u||

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 47 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Value Iteration Theorem

Theorem (Value Iteration Theorem(ref. S P. Singh and R C. Yee, 1993))

Let v 0 ∈ V, > 0, sequence {v n } is obtained from v n+1 = Lv n , Then
I. v n converges in norm to v ∗ .
II. ∃ a finite N at which the condition ||v n+1 − v n || < 1−γ
2γ is met
∀n > N .
III. π(s) (obtained
n by o
arg max r(s, a) + γ p(s0 /s, a)v n+1 (s0 ) ∀s ∈ S) is optimal.
P
a s0
IV. ||v n+1 − v ∗ || ≤
2 when the condition ||v n+1 − v n || < 1−γ
2γ holds.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 48 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Value Iteration Theorem

Theorem (Value Iteration Theorem(ref. S P. Singh and R C. Yee, 1993))

§ statement III means ||vπ − v ∗ || ≤ . And statement IV tells that

||v n+1 − v ∗ || ≤ 2 . Are they redundant?

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 48 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Value Iteration Theorem

Theorem (Value Iteration Theorem(ref. S P. Singh and R C. Yee, 1993))

§ statement III means ||vπ − v ∗ || ≤ . And statement IV tells that

||v n+1 − v ∗ || ≤ 2 . Are they redundant?
§ No! Think about what is vπ and what is v n+1 .

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 48 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Value Iteration Theorem

§ Though the figure is related to policy iteration, remember the figure
in slide (17).

§ Figure credit: [Singh and Yee, 1993]

§ Equality occurs if and only if value function given by the value

iteration algorithm is equal to the optimal policy.
§ What III is telling is that vπ is optimal and what IV is telling is that
v n+1 is 2 optimal given condition in II.
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 49 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Proof
§ Proof: Suppose, for some n, II is met i.e., ||v n+1 − v n || < 1−γ
2γ and
π(s) obtained by III. Now, by triangle inequality,

||vπ − v ∗ || ≤ ||vπ − v n+1 || + ||v n+1 − v ∗ || (18)

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 50 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Proof
§ Proof: Suppose, for some n, II is met i.e., ||v n+1 − v n || < 1−γ
2γ and
π(s) obtained by III. Now, by triangle inequality,

||vπ − v ∗ || ≤ ||vπ − v n+1 || + ||v n+1 − v ∗ || (18)

§ Now we have seen Lπ to be such that

X
Lπ v(s) = rπ (s) + γ pπ (s0 |s)v(s0 )
s0
( )
X X
0 0
= π(a|s) r(s, a) + γ p(s |s, a)v(s ) (19)
a∈A s0 ∈S

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 50 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Proof
§ Proof: Suppose, for some n, II is met i.e., ||v n+1 − v n || < 1−γ
2γ and
π(s) obtained by III. Now, by triangle inequality,

||vπ − v ∗ || ≤ ||vπ − v n+1 || + ||v n+1 − v ∗ || (18)

§ Now we have seen Lπ to be such that

X
Lπ v(s) = rπ (s) + γ pπ (s0 |s)v(s0 )
s0
( )
X X
0 0
= π(a|s) r(s, a) + γ p(s |s, a)v(s ) (19)
a∈A s0 ∈S

§ Let us apply Lπ on v n+1 and remember that π is deterministic policy.

So, X
Lπ v n+1 (s) = r(s, π(s)) + γ p(s0 /s, π(s))v n+1 (s0 ) (20)
s0
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 50 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Proof

§ Now we have seen L to be such that

( )
X
0 0
Lv(s) ≡ max r(s, a) + γ p(s |s, a)v(s ) ∀v ∈ V (21)
a∈A
s0 ∈S

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 51 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Proof

§ Now we have seen L to be such that

( )
X
0 0
Lv(s) ≡ max r(s, a) + γ p(s |s, a)v(s ) ∀v ∈ V (21)
a∈A
s0 ∈S

§ So, similarly, let us apply L on v n+1 . So,

( )
X
n+1 0 n+1 0
Lv (s) = max r(s, a) + γ p(s |s, a)v (s ) (22)
a∈A
s0 ∈S

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 51 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Proof

§ Repeating eqn. (20) and (22)

X
Lπ v n+1 (s) = r(s, π(s)) + γ p(s0 /s, π(s))v n+1 (s0 ) (23)
s0

§ ( )
X
Lv n+1 (s) = max r(s, a) + γ p(s0 |s, a)v n+1 (s0 ) (24)
a∈A
s0 ∈S

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 52 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Proof

§ Repeating eqn. (20) and (22)

X
Lπ v n+1 (s) = r(s, π(s)) + γ p(s0 /s, π(s))v n+1 (s0 ) (23)
s0

§ ( )
X
Lv n+1 (s) = max r(s, a) + γ p(s0 |s, a)v n+1 (s0 ) (24)
a∈A
s0 ∈S

§ Now, because π was chosen such that π maximizes the argument

inside the max {.} operator, so whether we apply Lπ on v n+1 or L on
v n+1 , they are the same, i.e., Lv n+1 = Lπ v n+1 .

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 52 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Proof

§ Now let us take the first term in eqn. (18) and proceed.

||vπ − v n+1 || = ||Lπ vπ − v n+1 || [By eqn. 6 - fixed point]

≤ ||Lπ vπ − Lv n+1 || + ||Lv n+1 − v n+1 || [Triangle inequality]
= ||Lπ vπ − Lπ v n+1 || + ||Lv n+1 − Lv n ||
[1. Using previous slide 2. v n+1 = Lv n ]
≤ γ||vπ − v n+1 || + γ||v n+1 − v n || [Contraction mappings]
γ
=⇒ ||vπ − v n+1 || ≤ ||v n+1 − v n ||
1−γ
γ 1−γ
≤ [By statement II of the theorem]
1 − γ 2γ

= (25)
2

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 53 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Proof
§ Now let us take the second term in eqn. (18) and proceed.
∞
X
n+1 ∗
||v − v || ≤ ||v n+k+2 − v n+k+1 || [Triangle inequality repeatedly]
k=0
X∞
= ||Lk+1 v n+1 − Lk+1 v n || [From iterative application of L]
k=0
X∞
≤ γ k+1 ||v n+1 − v n || [L is a contraction mapping]
k=0
γ
= ||v n+1 − v n || [G.P. sum]
1−γ
γ 1−γ
≤ [By statement II of the theorem]
1 − γ 2γ

= (26)
2
§ this is also the proof of statement IV of the theorem
Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 54 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Proof

Now putting eqn. 25 and eqn. 26 in eqn. 18, we get,

||vπ − v ∗ || ≤ + = (27)
2 2
So, statement III is proved.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 55 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Asynchronous Dynamic Programing

§ Major drawback of DP methods is that they involve operations over

entire state set.
§ the game of backgammon has over 1020 states. Even if we could
perform the value iteration update on a million states per second, it
would take over a thousand years to complete a single sweep.
foreach s E S do
v n +1(s) f-- max r(s a) +rLp(s'/s a)v n (s')
a s'
end

§ Inplace dynamic programing uses one single array to do the update

foreach s E S do
v(s) f-- max r(s a) +rLp(s'/s a)v (s')
a s'
end

§ For convergence, the order of update does not matter as long as all
states are picked at least a few times.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 56 / 57
Agenda DP Policy Evaluation Policy Iteration Value Iteration Mathematical Tools DP Extensions

Asynchronous Dynamic Programing

§ Real Time Dynamic Programing (RTDP): The main idea is to reduce

computation again but by not choosing randomly the states.
§ In an MDP there may be many states which occur very rarely i.e.,
they are seldom visited. So there is no point in putting more effort in
trying to discover the true value of these states. The agent might not
visit it at all.
§ Pick an initial state and run a policy/agent from that state. Then
employ DP update only on these states.
§ This makes changes to the value function estimate. Get the policy
from it and sample a trajectory again and do updates along the
trajectory.
§ Why is it called Real Time?
§ Many ideas from RTDP will be used in full RL problem.

Abir Das (IIT Kharagpur) CS60077 Sep 09, 10, 11 and 16, 2021 57 / 57

Data Structure Through C++
No ratings yet
Data Structure Through C++
278 pages
DP Slides
No ratings yet
DP Slides
263 pages
MIT6 231F15 Complete Slide
No ratings yet
MIT6 231F15 Complete Slide
166 pages
08 MDPs
No ratings yet
08 MDPs
111 pages
Mohanty S. The Art of Algorithm Design 2022
No ratings yet
Mohanty S. The Art of Algorithm Design 2022
323 pages
Lecture 3 Pre
No ratings yet
Lecture 3 Pre
67 pages
Lecture 3 Pre
No ratings yet
Lecture 3 Pre
67 pages
RL Module 4
No ratings yet
RL Module 4
50 pages
1.10 Policy Evaluation (Prediction)
No ratings yet
1.10 Policy Evaluation (Prediction)
30 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
18 - Dynamic Programming For Markov Decision Processes
No ratings yet
18 - Dynamic Programming For Markov Decision Processes
50 pages
Adprl Chapter Icis
No ratings yet
Adprl Chapter Icis
43 pages
Lecture4 Model Free Prediction
No ratings yet
Lecture4 Model Free Prediction
34 pages
MIT6 231F15 Notes PDF
No ratings yet
MIT6 231F15 Notes PDF
303 pages
06 MDP
No ratings yet
06 MDP
89 pages
Reinforcement Learning I
No ratings yet
Reinforcement Learning I
85 pages
Pomdps
No ratings yet
Pomdps
76 pages
Tut21 RL
No ratings yet
Tut21 RL
101 pages
AI512/EE633: Reinforcement Learning: Lecture 3 - Dynamic Programming
No ratings yet
AI512/EE633: Reinforcement Learning: Lecture 3 - Dynamic Programming
43 pages
Module 04
No ratings yet
Module 04
63 pages
RL Lecture4
No ratings yet
RL Lecture4
16 pages
Cs5811 Ch17 Complex Dec
No ratings yet
Cs5811 Ch17 Complex Dec
29 pages
Handling Uncertainty 03 - Solving MDP
No ratings yet
Handling Uncertainty 03 - Solving MDP
11 pages
FAQ & Sample Question - Star Coder 2024
100% (1)
FAQ & Sample Question - Star Coder 2024
4 pages
RL Exam Tutti
No ratings yet
RL Exam Tutti
47 pages
Lec 08
No ratings yet
Lec 08
59 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
Solution To Assignment - 4 - Dynamic Programming
No ratings yet
Solution To Assignment - 4 - Dynamic Programming
11 pages
Sp14 Cs188 Lecture 8 - Mdps I
No ratings yet
Sp14 Cs188 Lecture 8 - Mdps I
50 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
15 MDP
No ratings yet
15 MDP
35 pages
RL Unit-4
No ratings yet
RL Unit-4
18 pages
Module 3.0
No ratings yet
Module 3.0
17 pages
Lec 09
No ratings yet
Lec 09
51 pages
Unit 05 Dynamic Programming
No ratings yet
Unit 05 Dynamic Programming
9 pages
Question Complexity For Following Line of Code Is "For (I 0 I 9 I++) " A O (9) B O (10) C O (11) D O (12) Answer Marks 1.5 Unit 1
100% (1)
Question Complexity For Following Line of Code Is "For (I 0 I 9 I++) " A O (9) B O (10) C O (11) D O (12) Answer Marks 1.5 Unit 1
151 pages
Lnotes 03
No ratings yet
Lnotes 03
11 pages
Dynamic Programing and Optimal Control PDF
No ratings yet
Dynamic Programing and Optimal Control PDF
276 pages
Dynamic Programming RL Answers Final
No ratings yet
Dynamic Programming RL Answers Final
3 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
Experiment 4
No ratings yet
Experiment 4
7 pages
MIT6 231F11 Notes Short
No ratings yet
MIT6 231F11 Notes Short
125 pages
Dynamic Programing and Optimal Control
No ratings yet
Dynamic Programing and Optimal Control
276 pages
Experiment 3
No ratings yet
Experiment 3
6 pages
Lecture7 MDPs I
No ratings yet
Lecture7 MDPs I
9 pages
MIT Dynamic Programming Lecture Slides
No ratings yet
MIT Dynamic Programming Lecture Slides
261 pages
3 DP PDF
No ratings yet
3 DP PDF
42 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
Align 2
No ratings yet
Align 2
29 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
DAA Lab Syllabus
No ratings yet
DAA Lab Syllabus
2 pages
Lect28 4up
No ratings yet
Lect28 4up
11 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
Greedy Algorithm
100% (1)
Greedy Algorithm
30 pages
Unit 1 Dynamic Programming: Structure Page Nos
No ratings yet
Unit 1 Dynamic Programming: Structure Page Nos
72 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
Markov Decision Process Tutorial
No ratings yet
Markov Decision Process Tutorial
22 pages
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
No ratings yet
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
22 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
Reinforcement Learning 3 Recap
No ratings yet
Reinforcement Learning 3 Recap
3 pages
DAA Unit-1
No ratings yet
DAA Unit-1
78 pages
Computer
No ratings yet
Computer
25 pages
Zoho Round 2 Question Set
100% (4)
Zoho Round 2 Question Set
38 pages
MTH601-MidTerm-solved MCQ Mega File 2
No ratings yet
MTH601-MidTerm-solved MCQ Mega File 2
14 pages
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
No ratings yet
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
4 pages
Second Year Electrical and Computer Engineering Syllabus
No ratings yet
Second Year Electrical and Computer Engineering Syllabus
36 pages
IGNOU MCA Design and Analysis of Algorithms Previous Years Unsolved Papers MCS 211
From Everand
IGNOU MCA Design and Analysis of Algorithms Previous Years Unsolved Papers MCS 211
Manish Soni
No ratings yet
Chapter 8 Dynamic Programming Student
No ratings yet
Chapter 8 Dynamic Programming Student
24 pages
Daa Course Handout
No ratings yet
Daa Course Handout
5 pages
4th Sem Syllabus
No ratings yet
4th Sem Syllabus
14 pages
ACM ICPC Finals 2012 Solutions
No ratings yet
ACM ICPC Finals 2012 Solutions
5 pages
IIM7064 Dynamic Programming
No ratings yet
IIM7064 Dynamic Programming
18 pages
Adsaa Imp Questions
No ratings yet
Adsaa Imp Questions
3 pages
Dynamic Programming & Assembly - Line Scheduling
No ratings yet
Dynamic Programming & Assembly - Line Scheduling
9 pages
Robot Path Planning
No ratings yet
Robot Path Planning
29 pages
Operational Reseach 1
No ratings yet
Operational Reseach 1
9 pages
Advanced Data Structures and Algorithms
No ratings yet
Advanced Data Structures and Algorithms
4 pages
DATA STRUCTURE Reviewer
No ratings yet
DATA STRUCTURE Reviewer
7 pages
DAA Q Bank CAE2
No ratings yet
DAA Q Bank CAE2
9 pages
IGNOU BCA Introduction to Algorithm Design Previous Year Unsolved Papers BCS 042
From Everand
IGNOU BCA Introduction to Algorithm Design Previous Year Unsolved Papers BCS 042
Manish Soni
No ratings yet
Imp CP Topics
No ratings yet
Imp CP Topics
3 pages
p20 PDF
No ratings yet
p20 PDF
12 pages
Chapter One: Introduction To Theory of Algorithm
No ratings yet
Chapter One: Introduction To Theory of Algorithm
54 pages
23 April, 2021: CA3: Online Assignment Course Code: Cse408
No ratings yet
23 April, 2021: CA3: Online Assignment Course Code: Cse408
2 pages
Tuf M
No ratings yet
Tuf M
1 page

04 RL DP

Uploaded by

04 RL DP

Uploaded by

Prediction and Control by Dynamic Programing

CS60077: Reinforcement Learning

Sep 09, 10, 11 and 16, 2021

§ Understand how to evaluate policies using dynamic programing based

§ Reinforcement Learning by David Silver [Link]

“Life can only be understood going back-

The first line of the famous book by Dimitri

Image taken from: amazon.com

§ Dynamic Programing [DP] in this course, refer to a collection of

Optimal action sequence

Requirements for Dynamic Programing

§ Optimal substructure i.e., principle of optimality applies.

Planning by Dynamic Programing

§ Planning by dynamic programing assumes full knowledge of the MDP

Planning by Dynamic Programing

§ Planning by dynamic programing assumes full knowledge of the MDP

Iterative Policy Evaluation

§ Consider a sequence of approximate value functions v (0) , v (1) , v (2) , · · ·

Iterative Policy Evaluation

Iterative Policy Evaluation

Iterative Policy Evaluation, for estimating V ≈ vπ

Evaluating a Random Policy in the Small Gridworld

Figure credit: [SB] chapter 4

§ Undiscounted episodic MDP (λ = 1)

Evaluating a Random Policy in the Small Gridworld

Figure credit: [SB] chapter 4

Evaluating a Random Policy in the Small Gridworld

Figure credit: [SB] chapter 4

Improving a Policy: Policy Iteration

I Improve the policy by acting greedily with respect to vπ

Improving a Policy: Policy Iteration

I rπ (s) = one step expected reward for following policy π at state s.

Figure credit: [David Silver: DeepMind]

§ Policy Evaluation: Estimate vπ by iterative policy evaluation.

Algorithm 1: Policy iteration

§ why in step (4), ∈ is used?

Policy Iteration: Example ([SB])

§ Jack manages two locations of a car rental company. At any location

Policy Iteration: Example - MDP Formulation

§ State: number of cars at each location at the end of the day

Policy Iteration: Example

Figure credit: [SB - Chapter 4]

Policy Iteration: Disadvantages

§ A related question is - what about the extreme case of 1 iteration of

§ A related question is - what about the extreme case of 1 iteration of

§ What value iteration does: evaluate ∀a ∈ A

§ Take a note of the stopping criterion

Summary of Exact DP Algorithms for Planning

Problem Bellman Equation Algorithm

Different types of Norms

||v||0 = Number of non-zero elements in v

Cauchy Sequence, Completeness

§ Basically, for any real positive , an element can be found in the

Cauchy Sequence, Completeness

§ Basically, for any real positive , an element can be found in the

Contraction Mapping, Fixed Point

||T u − T v|| ≤ L||u − v||

§ If L ≤ 1, then T is called a non-expansion, while if 0 ≤ L < 1, then

Contraction Mapping, Fixed Point

||T u − T v|| ≤ L||u − v||

§ If L ≤ 1, then T is called a non-expansion, while if 0 ≤ L < 1, then

Banach Fixed Point Theorem

Banach Fixed Point Theorem - Proof (1)

Banach Fixed Point Theorem - Proof (2)

§ As m and n → ∞ and as λ < 1, the norm of difference between

Banach Fixed Point Theorem - Proof (2)

§ As m and n → ∞ and as λ < 1, the norm of difference between

Banach Fixed Point Theorem - Proof (3)

Banach Fixed Point Theorem - Proof (3)

||T v ∗ − v ∗ || ≤ ||T v ∗ − v n || + ||v n − v ∗ ||

Banach Fixed Point Theorem - Proof (3)

||T v ∗ − v ∗ || ≤ ||T v ∗ − v n || + ||v n − v ∗ ||

§ Since {v n } is Cauchy and v ∗ is the convergence point, both the terms

Banach Fixed Point Theorem - Proof (4)

§ Now we will show the uniqueness, i.e., v ∗ is unique.

Banach Fixed Point Theorem - Proof (4)

§ Basically, for any real positive , an element can be found in the

§ Basically, for any real positive , an element can be found in the

§ statement III means ||vπ − v ∗ || ≤ . And statement IV tells that

§ statement III means ||vπ − v ∗ || ≤ . And statement IV tells that