Assignment (2)
Assignment (2)
- Assignment 02 -
1. Consider the MDP M ≡ (S, A, P, r, γ) with |S| = S and |A| = A. Suppose µ is a stochastic
policy and Φ ∈ RS×d a feature matrix for some d ≥ 1. Let Pµ be the S × S matrix given by
X
Pµ (s′ |s) = µ(a|s)P (s′ |s, a).
a
This matrix represents the transition matrix of the Markov chain (S, Pµ ) induced by µ.
Suppose this Markov chain is ergodic so that it has a unique stationary distribution, which
we denote by dµ . Let Dµ be the S × S diagonal matrix whose diagonal is dµ , and let
A = Φ⊤ Dµ (I − γPµ )Φ and b = Φ⊤ Dµ rµ ,
2. Let Tµ be the Bellman operator defined in the above question. Show that Tµ is a γ-contraction
with respect to ∥ · ∥Dµ . [05]
where (
1 if n = 0,
dn =
σγ n if n ≥ 1;
further, an ∼ µ(·|sn ) and sn+1 ∼ P(·|sn , an ) for n ≥ 0.
Answer the following questions.
1
(c) Suppose the transition matrix P is unknown. Can you design a model-free algorithm to
estimate Jµ ? You can presume that you can sample from the invariant distribution of
the Markov chain induced by µ. Discuss the almost convergence of this algorithm. You
may directly use the results that were discussed in class. [08]
Show that f¯ is quasi-monotone increasing. Additionally, show that f (y) ≤ f¯(y) for all y.
Finally, show that the origin is the globally asymptotically stable equilibrium for the ODE
ż(t) = f (z(t)).
Use this to conclude that the solution trajectory of the noiseless Q-learning ODE is asymp-
totically lower bounded by the zero vector. [05]
References
[1] Donghwan Lee and Niao He. A unified switching system perspective and ode analysis of q-
learning algorithms. arXiv preprint arXiv:1912.02270, 2019.