0% found this document useful (0 votes)
3 views

note5 (2)

This document analyzes Fitted Q-Iteration (FQI) in the context of Markov Decision Processes (MDPs), focusing on guarantees related to the Q-function class and empirical updates. It establishes conditions for realizability and Bellman completeness, and derives upper bounds on the performance of the learned policy compared to the optimal policy. Additionally, it discusses alternative analysis methods and the implications of sample size on error rates in FQI.

Uploaded by

zzyy20010204
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

note5 (2)

This document analyzes Fitted Q-Iteration (FQI) in the context of Markov Decision Processes (MDPs), focusing on guarantees related to the Q-function class and empirical updates. It establishes conditions for realizability and Bellman completeness, and derives upper bounds on the performance of the learned policy compared to the optimal policy. Additionally, it discusses alternative analysis methods and the implications of sample size on error rates in FQI.

Uploaded by

zzyy20010204
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Notes on Fitted Q-iteration

Nan Jiang

November 8, 2023

1 Analysis of FQI
Let M = (S, A, P, R, γ, d0 ) be an MDP, where d0 is the initial distribution over states. Given a dataset
{(s, a, r, s0 )} generated from M and a Q-function class F ⊂ RS×A , we want to analyze the guarantee
of Fitted Q-Iteration. This note is inspired by and scrutinizes the results in Approximate Value/Policy
Iteration literature [e.g., 1, 2, 3] under simplification assumptions.

Setup and Assumptions

1. F is finite but can be exponentially large.

2. Realizability: Q? ∈ F.

3. Bellman completeness: ∀f ∈ F, T f ∈ F. (For finite F, this implies realizability.)

4. The dataset D = {(s, a, r, s0 )} is generated as follows: (s, a) ∼ µ, r ∼ R(s, a), s0 ∼ P (s, a). Define
the empirical update TbF f 0 as

1 X 2
LD (f ; f 0 ) := (f (s, a) − r − γVf 0 (s0 )) .
|D|
(s,a,r,s0 )∈D

TbF f 0 := arg min LD (f ; f 0 ),


f ∈F

where Vf 0 (s0 ) := maxa0 f 0 (s0 , a0 ). Note that by completeness, T f 0 ∈ F is the Bayes optimal
regressor for the regression problem defined in LD (f ; f 0 ). It will also be useful to define

Lµ (f ; f 0 ) := ED [LD (f ; f 0 )].

5. For any function g : X → R, any distribution ν ∈ ∆(X ), and p ≥ 1, define kgkp,ν := (Ex∼ν [|g(x)|p ])1/p ,
and let kgkν be a shorthand for kgk2,ν . Such norms are similarly defined for functions over X .

6. Let dπh be the distribution of (sh , ah ) under π, that is, dπh (s, a) := Pr[sh = s, ah = a s1 ∼ d0 , π].
dπ is the usual discounted occupancy. The same notations are sometimes abused to refer to the
corresponding state marginals, which will be clarified if not clear from the context.

1
7. We call any state-action distribution admissible if it can be generated at some time step from d0
in the MDP. That is, it takes the form of dπh for some h and (possibly non-stationary) policy π.
Then, assume that data is exploratory: for any admissible ν,

ν(s, a)
∀s ∈ S, ≤ C.
µ(s, a)

As a consequence, k · kν ≤ Ck · kµ . See slides for example scenarios where C is naturally
bounded.

8. Algorithm (simplified for analysis): let f0 ≡ 0 (assuming 0 ∈ F), and for k ≥ 1, fk := TbF fk−1 .

9. Uniform deviation bound (can be obtained by concentration inequalities and union bound):

∀f, f 0 ∈ F, |LD (f ; f 0 ) − Lµ (f ; f 0 )| ≤ .

(Note: at the end we will show how to obtain fast rates.)

Goal Let π̂ := πfk . Derive an upper bound on J(π ? ) − J(π̂).

Analysis

X
J(π ? ) − J(π̂) = γ h−1 Es∼dπ̂h [V ? (s) − Q? (s, π̂)]
h=1
X∞
≤ γ h−1 Es∼dπ̂h [Q? (s, π ? ) − fk (s, π ? ) + fk (s, π̂) − Q? (s, π̂)]
h=1
X∞  
≤ γ h−1 kQ? − fk k1,dπ̂h ×π? + kQ? − fk k1,dπ̂h ×π̂
h=1
X∞  
≤ γ h−1 kQ? − fk kdπ̂h ×π? + kQ? − fk kdπ̂h ×π̂ . (1)
h=1

In the above equations all the terms in the form of dπh should be treated as state distributions, and
dπh × π 0 refers to a state-action distribution generated as s ∼ dπh , a ∼ π 0 (·|s). The last line contains two
terms, both in the form of kQ? − fk kν with some admissible ν ∈ ∆(S × A). So it remains to bound
kQ? − fk kν for any ν ∈ ∆(S × A) that satisfies bullet 7.
First a helper lemma:

Lemma 1. Define πf,fk (s) := arg maxa∈A max{f (s, a), fk (s, a)}. Then we have ∀ν̃ ∈ ∆(S),

kVf − Vfk kν̃ ≤ kf − fk kν̃×πf,fk .

Proof.
X
kVf − Vfk k2ν̃ = ν̃(s)(max f (s, a) − max
0
fk (s, a0 ))2
a∈A a ∈A
s∈S
X
≤ ν̃(s)(f (s, πf,fk ) − fk (s, πf,fk ))2 = kf − fk k2ν̃×πf,f .
k
s∈S

2
Now we can bound kQ? − fk kν using Lemma 1. Define P (ν) as a distribution over S generated as
s0 ∼ P (ν) ⇔ (s, a) ∼ ν, s0 ∼ P (s, a), and

kfk − Q? kν = kfk − T fk−1 + T fk−1 − Q? kν


≤ kfk − T fk−1 kν + kT fk−1 − T Q? kν

≤ C kfk − T fk−1 kµ + γkVfk−1 − V ? kP (ν) (*)

≤ C kfk − T fk−1 kµ + γkfk−1 − Q? kP (ν)×πfk−1 ,Q? . (Lemma 1)

Step (*) holds because:


h i
2
kT fk−1 − T Q? k2ν = E(s,a)∼ν ((T fk−1 )(s, a) − (T Q? )(s, a))
h 2 i
= E(s,a)∼ν γEs0 ∼P (s,a) [Vfk−1 (s0 ) − V ? (s0 )]
h 2 i
≤ γ 2 E(s,a)∼ν,s0 ∼P (s,a) Vfk−1 (s0 ) − V ? (s0 ) (Jensen)
h 2 i
= γ 2 Es0 ∼P (ν) Vfk−1 (s0 ) − V ? (s0 ) = γ 2 kVfk−1 − V ? k2P (ν) .

Note that we can apply the same analysis on P (ν)×πfk−1 ,Q? since it is also admissible, and expand
the inequality k times. It then suffices to upper bound kfk − T fk−1 kµ .

kfk − T fk−1 k2µ = Lµ (fk ; fk−1 ) − Lµ (T fk−1 ; fk−1 ) (L squared loss + T fk−1 Bayes optimal)
≤ LD (fk ; fk−1 ) − LD (T fk−1 ; fk−1 ) + 2 (T fk−1 ∈ F)
≤ 2. (fk minimizes LD (· ; fk−1 ))

Note that the RHS does not depend on k, so we conclude that for any admissible ν,

1 − γk √
kfk − Q? kν ≤ 2C + γ k Vmax .
1−γ
Apply this to Equation (1) and we get

1 − γk √
 
2
J(π ? ) − J(πfk ) ≤ 2C + γ k Vmax .
1−γ 1−γ

Extension: fast rate The previous bound should have O(n−1/4 ) dependence on sample size n :=

|D|, because  in bullet 9 should be O(n−1/2 ) using Hoeffding’s, and the final bound depends on .
Here we exploit realizability to achieve fast rate so that the final bound is O(n−1/2 ).
Define
Y (f ; f 0 ) := (f (s, a) − r − γVf 0 (s0 ))2 − ((T f 0 )(s, a) − r − γVf 0 (s0 ))2 .
Plug each (s, a, r, s0 ) ∈ D into Y (f ; f 0 ) and we get i.i.d. variables Y1 (f ; f 0 ), Y2 (f ; f 0 ), . . . , Yn (f ; f 0 )
where n = |D|. It is easy to see that
n
1X
Yi (f ; f 0 ) = LD (f ; f 0 ) − LD (T f 0 ; f 0 ),
n i=1

so we only shift our objective LD by a f -independent constant. Our goal is to show that

kTbF f 0 − T f 0 k2µ ≡ E[Y (TbF f 0 ; f 0 )] = O(1/n).

3
Note that this result can be directly plugged into the previous analysis by letting f 0 = fk−1 (hence
TbF f 0 = fk ), and we immediately obtain a final bound of O(n−1/2 ).
To prove the result, first notice that ∀f ∈ F,

E[Y (f ; f 0 )] = Lµ (f ; f 0 ) − Lµ (T f 0 ; f 0 ) = kf − T f 0 k2µ ,

thanks to realizability and squared loss. Next we bound variance of Y :

V[Y (f ; f 0 )] ≤ E[Y (f ; f 0 )2 ]
 
2 2  2
=E f (s, a) − r − γVf 0 (s0 ) − (T f 0 )(s, a) − r − γVf 0 (s0 )
h 2 2 i
= E f (s, a) − (T f 0 )(s, a) f (s, a) + (T f 0 )(s, a) − 2r − 2γVf 0 (s0 )
h 2 i
2
≤ 4Vmax E f (s, a) − (T f 0 )(s, a)
2
= 4Vmax kf − T f 0 k2µ = 4Vmax
2
E[Y (f ; f 0 )],

where Vmax = Rmax /(1 − γ) is a constant.


Next we apply (one-sided) Bernstein’s inequality (see [4]) and union bound over all f ∈ F. Let
N = |F|. For any fixed f 0 , with probability at least 1 − δ, ∀f ∈ F,
s
n
1 X 2V[Y (f ; f 0 )] log Nδ 4V 2 log Nδ
E[Y (f ; f 0 )] − Yi (f ; f 0 ) ≤ + max 2
(Yi ∈ [−Vmax 2
, Vmax ])
n i=1 n 3n
s
2 E[Y (f ; f 0 )] log N
8Vmax 4V 2 log Nδ
≤ δ
+ max .
n 3n
Pn
Since TbF f 0 minimizes LD ( · ; f 0 ), it also minimizes 1
n i=1 Yi (·; f 0 ) because the two objectives only
differ by a constant LD (T f 0 ; f 0 ). Hence,
n n
1X 1X
Yi (TbF f 0 ; f 0 ) ≤ Yi (T f 0 ; f 0 ) = 0.
n i=1 n i=1

Then,
s
bF f 0 ; f 0 )] log
2 E[Y (T N 2 N
8Vmax 4Vmax log
E[Y (TbF f 0 ; f 0 )] ≤ δ
+ δ
.
n 3n

Solving for the quadratic formula,


2
√ 2 N
 q Vmax log
E[Y (TbF f 0 ; f 0 )] ≤ 2+ 10
3
δ
.
n

Relaxing the definition of C The assumption kν/µk∞ ≤ C for all admissible ν can be relaxed. In
particular, note that we always use this assumption in the form of

kf − T f 0 kν ≤ Ckf − T f 0 kµ

4
for some f, f 0 ∈ F. We can therefore literally redefine C as an upper bound of

kf − T f 0 k2ν
max
0
f,f ∈F kf − T f 0 k2µ

for all admissible ν. Despite the straightforward relaxation, when F has some nice structural proper-
ties, this new definition can be significantly tighter than the old definition based on raw density ratios.
For example, when F is induced from a bisimulation state abstraction (which satisfies completeness),
the new definition measures density ratio between the distributions over abstract state-action pairs,
which can be much smaller than that between the raw state-action pairs. More generally, F is lin-
ear and Bellman completeness is satisfied, f − T f 0 is also a linear function, and the new definition
measures coverage in the linear feature space. See further discussion on this in Akshay’s note.

5
2 Alternative Analysis
Below we sketch an alternative proof to the FQI guarantee. There are two motivations:

Error propagation along “simple” policies The error propagation in the above analysis of FQI was
along a somewhat “ugly” set of policies in the form of πfk ,Q? , which in each state takes the action
that “witnesses” the inequality | maxa f (s, a) − maxa f 0 (s, a)| ≤ maxa |f (s, a) − f 0 (s, a)| for f = fk
and f 0 = Q? . However, the error propagation in the ADP literature (e.g., [3]) only involved “simple”
policies, such as πf for some f ∈ F (and the concatenation of such policies at different time steps to
form a non-stationary policy).

“Modern” error-propagation analysis Error propagation in RL theory were often done by recursive
expansion in the “old” literature, and the above analysis also follows this style. However, we have
also seen alternative proofs based on cleaner and more elegant tools. For example, it is easy to analyze
the error propagation of the “minimax algorithm” arg minf ∈F maxg∈F LD (f ; f )−LD (g; f ) [5, 6] using
the following lemma: ∀π, f

1
J(π) − J(πf ) ≤ (Edπ [T f − f ] + Edπf [f − T f ]). (2)
1−γ

Using this lemma is also well aligned with the first motivation, as it often produces simple policies
on the RHS (which the data distribution needs to cover).

2.1 Performance guarantee for non-stationary FQI


A major difficulty in applying Eq.(2) to FQI is that it requires control over the Bellman error kf − T f k,
i.e., the learned function should be “self-consistent”. However, in FQI, we only have control over
kft − T ft−1 k, i.e., the output function fk is not necessarily consistent with itself, but consistent with
its previous iterate fk−1 , which is further consistent with its previous iterate fk−2 , and so on.
To overcome this difficulty, we first consider a different output policy: πfk:0 := πfk ◦πfk−1 ◦· · ·◦πf0 .
This is a non-stationary policy, and after πf0 we take arbitrary actions.1 We call FQI with such a
policy (instead of πfk ) non-stationary FQI. Similar to the situation in value iteration (see note1), such
a non-stationary policy actually has better guarantees than the usual FQI policy and saves a factor of
horizon2 , and is also easier to analyze.
In particular, we now can use (a variant of) Eq.(2), because πfk:0 is greedy w.r.t. a self-consistent
function, namely fk:0 := fk ◦ fk−1 ◦ · · · ◦ f0 ! The unusual aspect here is that we typically consider
all value functions as only functions of states and actions, i.e., they are stationary. Here, the function
fk ◦fk−1 ◦· · ·◦f0 itself is a non-stationary object. Also, compared to Eq.(2) we will also need to include
truncation error here. The lemma is given below, with proof left as a homework exercise:
1 The result might be improved (though it is unclear if the improvement is significant) if we produce a periodic policy that

simply repeats πfk:0 forever [7].


2 Another way to save this factor of horizon is to run the minimax algorithm [6].

6
Lemma 2 (Non-stationary variant of Eq.(2)). Given an arbitrary sequence of functions f0 , . . . , fk ∈ RS×A
and any (non-stationary) comparator policy π, let π
b := πfk:0 (followed by arbitrary actions after k + 1 steps)
k
X  
J(π) − J(b
π) ≤ γ t−1 Edπt [T fk−t − fk−t+1 ] + Edπtb [fk−t+1 − T fk−t ] + γ k Vmax . (3)
t=1

According to the RHS of the bound, when we choose the optimal policy π ? as the comparator
policy π, we need the data µ to cover the state distributions induced by two types of policies from
d0 : (π ? )t , ∀t ≤ k, and π fk:k0 , ∀0 ≤ k 0 ≤ k. Caution: when we analyze the minimax algorithm using
Eq.(2), we only need the data µ to cover the discounted occupancy as a whole, instead of covering
the per-step distributions that contribute to the occupancy separately. Here we do not enjoy such a
property, because our algorithm controls kft − T ft−1 k2,µ for each t separately, so change of measure
must happen in each step instead of over the entire occupancy as a whole.

2.2 Performance guarantee for FQI


The previous section sketches an analysis of non-stationary FQI. To relate FQI to the analysis of its
non-stationary variant, we use performance difference lemma

1
J(π ? ) − J(πfk ) = E πf [V ? (s) − Q? (s, πfk )]
1 − γ s∼d k
1
≤ E πf [V ? (s) − V πfk:0 (s)].
1 − γ s∼d k

Here the inequality is due to Q? (s, πfk ) ≥ V πfk:0 (s), because Q? (s, πfk ) is the expected return of
starting in s, takes πfk immediately (which coincides with V πfk:0 (s) in the first time step because πfk
is πfk:0 ’s first-step policy), and acts optimally thereafter.
Now, the RHS looks like the performance guarantee of πfk:0 , which we can directly apply the
analysis in the previous section! We can also see that compared to the guarantee of non-stationary
FQI, here we paying an extra 1/(1 − γ) factor. The only unusual aspect is that here dπfk is treated as
the initial distribution for the non-stationary FQI analysis, so finally, the distributions that need to be
covered are those induced by the policies mentioned at the end of Section 2.1, but from dπfk as the
initial distribution.

References
[1] Rémi Munos. Error bounds for approximate policy iteration. In ICML, volume 3, pages 560–567,
2003.

[2] András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with Bellman-
residual minimization based fitted policy iteration and a single sample path. Machine Learning,
2008.

[3] Rémi Munos and Csaba Szepesvári. Finite-time bounds for fitted value iteration. Journal of Ma-
chine Learning Research, 9(May):815–857, 2008.

7
[4] Sham Kakade. Hoeffding, Chernoff, Bennet, and Bernstein Bounds, 2011. http://stat.wharton.
upenn.edu/˜skakade/courses/stat928/lectures/lecture06.pdf.

[5] Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning.
In Proceedings of the 36th International Conference on Machine Learning, pages 1042–1051, 2019.

[6] Tengyang Xie and Nan Jiang. Q* approximation schemes for batch reinforcement learning: A
theoretical comparison. In Conference on Uncertainty in Artificial Intelligence, pages 550–559. PMLR,
2020.

[7] Bruno Scherrer and Boris Lesner. On the use of non-stationary policies for stationary infinite-
horizon markov decision processes. In Advances in Neural Information Processing Systems, pages
1826–1834, 2012.

You might also like