DRL_Homework_1
DRL_Homework_1
Assignment 1
1
1. Fill out the table with the results of value iteration with a discount factor γ = 0.9 [9 pts]:
2. At k = 2 with γ = 0.9 what policy would you select? Is it necessarily true that this is
the optimal policy? At k = 3 what policy would you select? Is it necessarily true that
this is the optimal policy? [6 pts]
We also saw the contraction operator Bπ which is the Bellman backup operator for a
particular policy given below:
X
(Bπ V )(s) = Ea∼π [R(s, a) + γ p(s0 |s, a)V (s0 )] (2)
s0 ∈S
(a) Recall that ||BV − BV 0 || ≤ γ||V − V 0 || for two random value functions V and V 0 . Prove
that Bπ is also a contraction mapping: ||Bπ V − Bπ V 0 || ≤ γ||V − V 0 ||. [5 pts]
(b) Prove that the fixed point for Bπ is unique. What is the fixed point of Bπ ? [5 pts]
In value iteration, we repeatedly apply the Bellman backup operator B to improve our value
function. At the end of value iteration, we can recover a greedy policy π from the value function
using the equation below:
X
π(s) = arg max[r(s, a) + γ p(s0 |s, a)V (s0 )] (3)
a
s0 ∈S
Suppose we run value iteration for a finite number of steps to obtain a value function V (V
has not necessarily converged to V ∗ ). Say now that we evaluate our policy π obtained using
the formula above to get V π . Note that here and for the rest of Q2, π refers to the
greedy policy.
2
In lecture, we learned that running value iteration until a certain tolerance can bring us close
to recovering the optimal value function. Let Vn and Vn+1 be the outputs of value iteration at
the nth and n + 1th iterations respectively. Let > 0 and consider the point in value iteration
such that ||Vn+1 − Vn || < (1−γ)
2γ
. Let π be the greedy policy given the value function Vn+1 .
You will now prove that this policy π is -optimal. This result justifies why halting value
iteration when the difference between success iterations is sufficiently small, ensures the decision
policy obtained by being greedy with respect to the value function, is near-optimal.
Precisely if
(1 − γ)
||Vn+1 − Vn || < (4)
2γ
then,
||V π − V ∗ || ≤ (5)
(d) When π is the greedy policy, what is the relationship between B and Bπ ? [2 pts]
(h) Use the results from parts (e) and (g), to show that ||V π − V ∗ || ≤ [2 pts]
π
Here Ṽh+1 (s0 ) is the expected total reward of π under M̃ from time h + 1.[10 Points]
(b) Imagine the following situation: we have M as the true real MDP, and we have M̃ as
some learned model that supposes to approximate the real MDP M. Given M̃, the
natural thing to do is to compute the optimal policy under M̃, i.e.,
π̃ ? = arg max Ṽ π ,
π∈Π
3
where Π ⊂ {π : S → ∆(A)} is a pre-defined policy class. Let us also denote the true
optimal policy π ? under the real model as:
π ? = arg max V π .
π∈Π
A natural question is that what is the performance of π̃ ? under the real model M,
compared to π ? under M? To answer this question, let’s assume r(s, a) ∈ [0, 1] for all
s, a and prove the following inequality:
H−1
Xh i
π? π̃ ?
V −V ≤H E
s,a∼Pπ
h
? kP̃h (·|s, a) − Ph (·|s, a)k1 + E
s,a∼Pπ̃
h
? kP̃h (·|s, a) − Ph (·|s, a)k1
h=0
R
(Hint: |f (x)g(x)|dx ≤ kf (·)k1 kg(·)k∞ ) [15 Points]