0% found this document useful (0 votes)
10 views

Assignment (2)

The document outlines an assignment on Reinforcement Learning focusing on Markov Decision Processes (MDPs), stochastic policies, and the Bellman operator. It includes tasks such as proving the fixed point of the projected Bellman operator, demonstrating the contraction property of the Bellman operator, and exploring quasi-hyperbolic discounting in value functions. Additionally, it discusses model-free algorithms for estimating value functions and the stability of solutions in Q-learning algorithms.

Uploaded by

indrakumar180100
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Assignment (2)

The document outlines an assignment on Reinforcement Learning focusing on Markov Decision Processes (MDPs), stochastic policies, and the Bellman operator. It includes tasks such as proving the fixed point of the projected Bellman operator, demonstrating the contraction property of the Bellman operator, and exploring quasi-hyperbolic discounting in value functions. Additionally, it discusses model-free algorithms for estimating value functions and the stability of solutions in Q-learning algorithms.

Uploaded by

indrakumar180100
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Reinforcement Learning (E1 277)

- Assignment 02 -

1. Consider the MDP M ≡ (S, A, P, r, γ) with |S| = S and |A| = A. Suppose µ is a stochastic
policy and Φ ∈ RS×d a feature matrix for some d ≥ 1. Let Pµ be the S × S matrix given by
X
Pµ (s′ |s) = µ(a|s)P (s′ |s, a).
a

This matrix represents the transition matrix of the Markov chain (S, Pµ ) induced by µ.
Suppose this Markov chain is ergodic so that it has a unique stationary distribution, which
we denote by dµ . Let Dµ be the S × S diagonal matrix whose diagonal is dµ , and let

A = Φ⊤ Dµ (I − γPµ )Φ and b = Φ⊤ Dµ rµ ,

: RS → RS be given by ΠJ = Φ(Φ⊤ Dµ Φ)−1 Φ⊤ Dµ J.


P
where rµ (s) = a µ(a|s)r(s, a). Let Π
Show that θ∗ := A−1 b is the fixed point
of the projected Bellman operator, i.e., ΠTµ Φθ∗ = Φθ∗ ,
where Tµ : RS → RS is the Bellman operator satisfying Tµ J = rµ + γPµ J. [05]

2. Let Tµ be the Bellman operator defined in the above question. Show that Tµ is a γ-contraction
with respect to ∥ · ∥Dµ . [05]

3. We have so far discussed infinite-horizon reinforcement learning with exponential discounting.


In practice, one is also interested in quasi-hyperbolic discounting. In this case, we have two
parameter σ, γ ∈ [0, 1), and the value function Jµ ∈ R|S| of a stationary (possibly stochastic)
policy µ is given by
X∞ 
Jµ (s) = E dn g(sn , an , sn+1 ) s0 = s ,
n=0

where (
1 if n = 0,
dn =
σγ n if n ≥ 1;
further, an ∼ µ(·|sn ) and sn+1 ∼ P(·|sn , an ) for n ≥ 0.
Answer the following questions.

(a) Does there exist a Bellman-type relation for Jµ ? [05]


(b) From the above relation, can you identify the Bellman operator Tµ ? Is this operator a
contraction? [02]

1
(c) Suppose the transition matrix P is unknown. Can you design a model-free algorithm to
estimate Jµ ? You can presume that you can sample from the invariant distribution of
the Markov chain induced by µ. Discuss the almost convergence of this algorithm. You
may directly use the results that were discussed in class. [08]

4. As in proof of [1, Theorem 2], let

f¯(y) = (γDP Ππy − D)y + Dr


f (z) = (γDP ΠπQ∗ − D)z + Dr

Show that f¯ is quasi-monotone increasing. Additionally, show that f (y) ≤ f¯(y) for all y.
Finally, show that the origin is the globally asymptotically stable equilibrium for the ODE

ż(t) = f (z(t)).

Use this to conclude that the solution trajectory of the noiseless Q-learning ODE is asymp-
totically lower bounded by the zero vector. [05]

References
[1] Donghwan Lee and Niao He. A unified switching system perspective and ode analysis of q-
learning algorithms. arXiv preprint arXiv:1912.02270, 2019.

You might also like