0% found this document useful (0 votes)
187 views4 pages

Convergence of Q-Learning PDF

This document provides a proof of convergence for Q-learning. It begins by introducing Markov decision processes and defining value and Q-functions. It then shows that the Q-learning update rule is a contraction mapping, and uses this along with stochastic approximation results to prove that Q-learning converges with probability 1 to the optimal Q-function, provided that each state-action pair is visited infinitely often.

Uploaded by

Roy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
187 views4 pages

Convergence of Q-Learning PDF

This document provides a proof of convergence for Q-learning. It begins by introducing Markov decision processes and defining value and Q-functions. It then shows that the Q-learning update rule is a contraction mapping, and uses this along with stochastic approximation results to prove that Q-learning converges with probability 1 to the optimal Q-function, provided that each state-action pair is visited infinitely often.

Uploaded by

Roy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Convergence of Q-learning: a simple proof

Francisco S. Melo
Institute for Systems and Robotics,
Instituto Superior Técnico,
Lisboa, PORTUGAL
[email protected]

1 Preliminaries
We denote a Markov decision process as a tuple (X , A, P, r), where
• X is the (finite) state-space;
• A is the (finite) action-space;
• P represents the transition probabilities;
• r represents the reward function.
We denote elements of X as x and y and elements of A as a and b. We admit
the general situation where the reward is defined over triplets (x, a, y), i.e., r is
a function
r : X × A × X −→ R
assigning a reward r(x, a, y) everytime a transition from x to y occurs due to
action a. We admit r to be a bounded, deterministic function.
The value of a state x is defined, for a sequence of controls {At }, as
"∞ #
X
J(x, {At }) = E γ t R(Xt , At ) | X0 = x .
t=0

The optimal value function is defined, for each x ∈ X as

V ∗ (x) = max J(x, {At })


At

and verifies X
V ∗ (x) = max Pa (x, y) r(x, a, y) + γV ∗ (y) .
 
a∈A
y∈X

From here we define the optimal Q-function, Q∗ as


X
Q∗ (x, a) = Pa (x, y) r(x, a, y) + γV ∗ (y) .
 

y∈X

1
The optimal Q-function is a fixed point of a contraction operator H, defined
for a generic function q : X × A −→ R as
X  
(Hq)(x, a) = Pa (x, y) r(x, a, y) + γ max q(y, b) .
b∈A
y∈X

This operator is a contraction in the sup-norm, i.e.,

kHq1 − Hq2 k∞ ≤ γ kq1 − q2 k∞ . (1)

To see this, we write

kHq1 − Hq2 k∞ =

X
 
= max Pa (x, y) r(x, a, y) + γ max q1 (y, b) − r(x, a, y) + γ max q2 (y, b) =
x,a b∈A b∈A
y∈X


X  
= max γ Pa (x, y) max q1 (y, b) − max q2 (y, b) ≤
x,a y∈X b∈A b∈A
X

= max γ Pa (x, y) max q1 (y, b) − max q2 (y, b) ≤
x,a b∈A b∈A
y∈X
X
= max γ Pa (x, y) max |q1 (z, b) − q2 (z, b)| =
x,a z,b
y∈X
X
= max γ Pa (x, y) kq1 − q2 k∞ =
x,a
y∈X

= γ kq1 − q2 k∞ .

The Q-learning algorithm determines the optimal Q-function using point


samples. Let π be some random policy such that

Pπ [At = a | Xt = x] > 0

for all state-action pairs (x, a). Let {xt } be a sequence of states obtained follow-
ing policy π, {at } the sequence of corresponding actions and {rt } the sequence
of obtained rewards. Then, given any initial estimate Q0 , Q-learning uses the
following update rule:
 
Qt+1 (xt , at ) = Qt (xt , at ) + αt (xt , at ) rt + γ max Qt (xt+1 , b) − Qt (xt , at ) ,
b∈A

where the step-sizes αt (x, a) verify 0 ≤ αt (x, a) ≤ 1. This means that, at the
(t + 1)th update, only the component (xt , at ) is updated.1
This leads to the following result.
1 There are variations of Q-learning that use a single transition tuple (x, a, y, r) to perform

updates in multiple states to speed up convergence, as seen for example in [2].

2
Theorem 1. Given a finite MDP (X , A, P, r), the Q-learning algorithm, given
by the update rule
 
Qt+1 (xt , at ) = Qt (xt , at ) + αt (xt , at ) rt + γ max Qt (xt+1 , b) − Qt (xt , at ) , (2)
b∈A

converges w.p.1 to the optimal Q-function as long as


X X
αt (x, a) = ∞ αt2 (x, a) < ∞ (3)
t t

for all (x, a) ∈ X × A.


Notice that, since 0 ≤ αt (x, a) < 1, (3) requires that all state-action pairs
be visited infinitely often.
To establish Theorem 1 we need an auxiliary result from stochastic approx-
imation, that we promptly present.
Theorem 2. The random process {∆t } taking values in Rn and defined as

∆t+1 (x) = (1 − αt (x))∆t (x) + αt (x)Ft (x)

converges to zero w.p.1 under the following assumptions:


• 0 ≤ αt ≤ 1, t αt (x) = ∞ and t αt2 (x) < ∞;
P P

• kE [Ft (x) | Ft ]kW ≤ γ k∆t kW , with γ < 1;


2
• var [Ft (x) | Ft ] ≤ C(1 + k∆t kW ), for C > 0.
Proof See [1]. 2

We are now in position to prove Theorem 1.

Proof of Theorem 1 We start by rewriting (2) as


 
Qt+1 (xt , at ) = (1 − αt (xt , at ))Qt (xt , at ) + αt (xt , at ) rt + γ max Qt (xt+1 , b) .
b∈A

Subtracting from both sides the quantity Q∗ (xt , at ) and letting

∆t (x, a) = Qt (x, a) − Q∗ (x, a)

yields

∆t (xt , at ) = (1 − αt (xt , at ))∆t (xt , at ))+


+ αt (x, a) rt + γ max Qt (xt+1 , b) − Q∗ (xt , at ) .
 
b∈A

If we write

Ft (x, a) = r(x, a, X(x, a)) + γ max Qt (y, b) − Q∗ (x, a),


b∈A

3
where X(x, a) is a random sample state obtained from the Markov chain (X , Pa ),
we have
X
Pa (x, y) r(x, a, y) + γ max Qt (y, b) − Q∗ (x, a) =
 
E [Ft (x, a) | Ft ] =
b∈A
y∈X

= (HQt )(x, a) − Q∗ (x, a).

Using the fact that Q∗ = HQ∗ ,

E [Ft (x, a) | Ft ] = (HQt )(x, a) − (HQ∗ )(x, a).

It is now immediate from (1) that

kE [Ft (x, a) | Ft ]k∞ ≤ γ kQt − Q∗ k∞ = γ k∆t k∞ .

Finally,
var [Ft (x) | Ft ] =
»“ ”2 –
= E r(x, a, X(x, a)) + γ max Qt (y, b) − Q∗ (x, a) − (HQt )(x, a) + Q∗ (x, a) =
b∈A
»“ ”2 –
= E r(x, a, X(x, a)) + γ max Qt (y, b) − (HQt )(x, a) =
b∈A
» –
= var r(x, a, X(x, a)) + γ max Qt (y, b) | Ft
b∈A

which, due to the fact that r is bounded, clearly verifies


2
var [Ft (x) | Ft ] ≤ C(1 + k∆t kW )

for some constant C.


Then, by Theorem 2, ∆t converges to zero w.p.1, i.e., Qt converges to Q∗
w.p.1. 2

References
[1] Tommi Jaakkola, Michael I. Jordan, and Satinder P. Singh. On the con-
vergence of stochastic iterative dynamic programming algorithms. Neural
Computation, 6(6):1185–1201, 1994.
[2] Carlos Ribeiro and Csaba Szepesvári. Q-learning combined with spreading:
Convergence and results. In Proceedings of the ISRF-IEE International Con-
ference: Intelligent and Cognitive Systems (Neural Networks Symposium),
pages 32–36, 1996.

You might also like