HW 2
HW 2
• Questions 1 to 4 are practice problems, that should help you get familiar with the
content in the course.
• Only Question 5 and 6 (Coding problem) will be graded. You will be required to submit
the .ipynb file (IPython Notebook) or .py file(s) via Moodle. Submission deadline is
11:59PM on 25th September.
1. Consider the MDP M = ⟨S, A, r, p, γ⟩, where S, A denote the state and action spaces, respectively,
r : S × A → R is the reward function, p(s′ |s, a) is the transition probability, and γ ∈ (0, 1) is the
discount factor. The objective is to determine a policy π : S → A that maximizes the value function
Vπ (s), i.e.
∞
hX i
max Vπ (s) = Epπ γ t r(st , at ) s0 = s
π
t=0
Write a linear programming formulation to determine the optimal value function V ∗ (s) for all s that
solves the above optimization problem. Prove that this linear program indeed determines V ∗ (s) ∀ s ∈ S.
2. Consider the MDP M = ⟨S, A, c, p⟩, where S, A denote the state and action spaces, respectively, c :
S × A → R is the cost function, and p(s′ |s, a) is the transition probability. Consider the case where the
state space consists of a zero cost termination state δ, i.e., c(δ, ·) = 0, and p(δ|δ, ·) = 1. The corresponding
optimal value function V (s) is given as
∞
hX i
V ∗ (s) = min Epπ c(st , π(st )) s0 = s ,
pi
t=0
where the associated recursive Bellman equation that V ∗ (s) satisfies is given by
X
V ∗ (s) = min c(s, a) + p(s′ |s, a)V ∗ (s′ ) .
a
s′
| {z }
=:Q∗ (s,a)
From above definition, it is clear that Q∗ (s, a) satisfies the following recursive Bellman equation
X
Q∗ (s, a) = c(s, a) + p(s′ |s, a) min
′
Q∗ (s′ , a′ ) .
a
s′
| {z }
=;H(Q)
Show that that map H is a contraction map in a ξ-weighted norm ∥ · ∥ξ , where ∥Q∥ξ := maxs,a |Q(s,a)|
ξs
for some ξ ∈ R|S| > 0. Recall that we did such a proof in the class to show that the map corresponding
to the value function is a contraction.
3. What is the difference between the following?
1. On policy and Off policy algorithms.
2. Model based and model free RL algorithms.
3. TD learning and Q-learning.
Give examples for each type of algorithms above.
4. Consider the scenario where the action at+1 is sampled from an ϵ-greedy policy. Is the SARSA and
Q-learning algorithm exactly similar for this case? If no, for which choice of a policy for at+1 are SARSA
and Q-learning exactly similar.
5. (10 points) Coding Problem: Consider the Gridworld problem from the Homework 1. Write a code
that uses the linear programming formulation to solve for the optimal value functions. Print the optimal
value function that you obtain. Compare it with the solution given by the value iteration method from
Homework 1.
6. (30 points) Coding Problem: Consider a 8 × 8 Gridworld discussed in paper https://auai.org/
uai2016/proceedings/papers/219.pdf. The corresponding environment has been coded into a python
file GridWorldF ox2016.py and provided to you for better understanding. You can go over the code to
understand the cost functions, and transition probabilities (as you would need these to determine V ∗ (s)
using value (or policy) iterations).
Write a Q learning algorithm to determine the optimal policy and value function for the above Gridworld.
Use the metric presented in equation (31) of the paper to describe the evolution of the learning algorithm.