note5 (2)
note5 (2)
Nan Jiang
November 8, 2023
1 Analysis of FQI
Let M = (S, A, P, R, γ, d0 ) be an MDP, where d0 is the initial distribution over states. Given a dataset
{(s, a, r, s0 )} generated from M and a Q-function class F ⊂ RS×A , we want to analyze the guarantee
of Fitted Q-Iteration. This note is inspired by and scrutinizes the results in Approximate Value/Policy
Iteration literature [e.g., 1, 2, 3] under simplification assumptions.
2. Realizability: Q? ∈ F.
4. The dataset D = {(s, a, r, s0 )} is generated as follows: (s, a) ∼ µ, r ∼ R(s, a), s0 ∼ P (s, a). Define
the empirical update TbF f 0 as
1 X 2
LD (f ; f 0 ) := (f (s, a) − r − γVf 0 (s0 )) .
|D|
(s,a,r,s0 )∈D
where Vf 0 (s0 ) := maxa0 f 0 (s0 , a0 ). Note that by completeness, T f 0 ∈ F is the Bayes optimal
regressor for the regression problem defined in LD (f ; f 0 ). It will also be useful to define
Lµ (f ; f 0 ) := ED [LD (f ; f 0 )].
5. For any function g : X → R, any distribution ν ∈ ∆(X ), and p ≥ 1, define kgkp,ν := (Ex∼ν [|g(x)|p ])1/p ,
and let kgkν be a shorthand for kgk2,ν . Such norms are similarly defined for functions over X .
6. Let dπh be the distribution of (sh , ah ) under π, that is, dπh (s, a) := Pr[sh = s, ah = a s1 ∼ d0 , π].
dπ is the usual discounted occupancy. The same notations are sometimes abused to refer to the
corresponding state marginals, which will be clarified if not clear from the context.
1
7. We call any state-action distribution admissible if it can be generated at some time step from d0
in the MDP. That is, it takes the form of dπh for some h and (possibly non-stationary) policy π.
Then, assume that data is exploratory: for any admissible ν,
ν(s, a)
∀s ∈ S, ≤ C.
µ(s, a)
√
As a consequence, k · kν ≤ Ck · kµ . See slides for example scenarios where C is naturally
bounded.
8. Algorithm (simplified for analysis): let f0 ≡ 0 (assuming 0 ∈ F), and for k ≥ 1, fk := TbF fk−1 .
9. Uniform deviation bound (can be obtained by concentration inequalities and union bound):
∀f, f 0 ∈ F, |LD (f ; f 0 ) − Lµ (f ; f 0 )| ≤ .
Analysis
∞
X
J(π ? ) − J(π̂) = γ h−1 Es∼dπ̂h [V ? (s) − Q? (s, π̂)]
h=1
X∞
≤ γ h−1 Es∼dπ̂h [Q? (s, π ? ) − fk (s, π ? ) + fk (s, π̂) − Q? (s, π̂)]
h=1
X∞
≤ γ h−1 kQ? − fk k1,dπ̂h ×π? + kQ? − fk k1,dπ̂h ×π̂
h=1
X∞
≤ γ h−1 kQ? − fk kdπ̂h ×π? + kQ? − fk kdπ̂h ×π̂ . (1)
h=1
In the above equations all the terms in the form of dπh should be treated as state distributions, and
dπh × π 0 refers to a state-action distribution generated as s ∼ dπh , a ∼ π 0 (·|s). The last line contains two
terms, both in the form of kQ? − fk kν with some admissible ν ∈ ∆(S × A). So it remains to bound
kQ? − fk kν for any ν ∈ ∆(S × A) that satisfies bullet 7.
First a helper lemma:
Lemma 1. Define πf,fk (s) := arg maxa∈A max{f (s, a), fk (s, a)}. Then we have ∀ν̃ ∈ ∆(S),
Proof.
X
kVf − Vfk k2ν̃ = ν̃(s)(max f (s, a) − max
0
fk (s, a0 ))2
a∈A a ∈A
s∈S
X
≤ ν̃(s)(f (s, πf,fk ) − fk (s, πf,fk ))2 = kf − fk k2ν̃×πf,f .
k
s∈S
2
Now we can bound kQ? − fk kν using Lemma 1. Define P (ν) as a distribution over S generated as
s0 ∼ P (ν) ⇔ (s, a) ∼ ν, s0 ∼ P (s, a), and
Note that we can apply the same analysis on P (ν)×πfk−1 ,Q? since it is also admissible, and expand
the inequality k times. It then suffices to upper bound kfk − T fk−1 kµ .
kfk − T fk−1 k2µ = Lµ (fk ; fk−1 ) − Lµ (T fk−1 ; fk−1 ) (L squared loss + T fk−1 Bayes optimal)
≤ LD (fk ; fk−1 ) − LD (T fk−1 ; fk−1 ) + 2 (T fk−1 ∈ F)
≤ 2. (fk minimizes LD (· ; fk−1 ))
Note that the RHS does not depend on k, so we conclude that for any admissible ν,
1 − γk √
kfk − Q? kν ≤ 2C + γ k Vmax .
1−γ
Apply this to Equation (1) and we get
1 − γk √
2
J(π ? ) − J(πfk ) ≤ 2C + γ k Vmax .
1−γ 1−γ
Extension: fast rate The previous bound should have O(n−1/4 ) dependence on sample size n :=
√
|D|, because in bullet 9 should be O(n−1/2 ) using Hoeffding’s, and the final bound depends on .
Here we exploit realizability to achieve fast rate so that the final bound is O(n−1/2 ).
Define
Y (f ; f 0 ) := (f (s, a) − r − γVf 0 (s0 ))2 − ((T f 0 )(s, a) − r − γVf 0 (s0 ))2 .
Plug each (s, a, r, s0 ) ∈ D into Y (f ; f 0 ) and we get i.i.d. variables Y1 (f ; f 0 ), Y2 (f ; f 0 ), . . . , Yn (f ; f 0 )
where n = |D|. It is easy to see that
n
1X
Yi (f ; f 0 ) = LD (f ; f 0 ) − LD (T f 0 ; f 0 ),
n i=1
so we only shift our objective LD by a f -independent constant. Our goal is to show that
3
Note that this result can be directly plugged into the previous analysis by letting f 0 = fk−1 (hence
TbF f 0 = fk ), and we immediately obtain a final bound of O(n−1/2 ).
To prove the result, first notice that ∀f ∈ F,
E[Y (f ; f 0 )] = Lµ (f ; f 0 ) − Lµ (T f 0 ; f 0 ) = kf − T f 0 k2µ ,
V[Y (f ; f 0 )] ≤ E[Y (f ; f 0 )2 ]
2 2 2
=E f (s, a) − r − γVf 0 (s0 ) − (T f 0 )(s, a) − r − γVf 0 (s0 )
h 2 2 i
= E f (s, a) − (T f 0 )(s, a) f (s, a) + (T f 0 )(s, a) − 2r − 2γVf 0 (s0 )
h 2 i
2
≤ 4Vmax E f (s, a) − (T f 0 )(s, a)
2
= 4Vmax kf − T f 0 k2µ = 4Vmax
2
E[Y (f ; f 0 )],
Then,
s
bF f 0 ; f 0 )] log
2 E[Y (T N 2 N
8Vmax 4Vmax log
E[Y (TbF f 0 ; f 0 )] ≤ δ
+ δ
.
n 3n
Relaxing the definition of C The assumption kν/µk∞ ≤ C for all admissible ν can be relaxed. In
particular, note that we always use this assumption in the form of
√
kf − T f 0 kν ≤ Ckf − T f 0 kµ
4
for some f, f 0 ∈ F. We can therefore literally redefine C as an upper bound of
kf − T f 0 k2ν
max
0
f,f ∈F kf − T f 0 k2µ
for all admissible ν. Despite the straightforward relaxation, when F has some nice structural proper-
ties, this new definition can be significantly tighter than the old definition based on raw density ratios.
For example, when F is induced from a bisimulation state abstraction (which satisfies completeness),
the new definition measures density ratio between the distributions over abstract state-action pairs,
which can be much smaller than that between the raw state-action pairs. More generally, F is lin-
ear and Bellman completeness is satisfied, f − T f 0 is also a linear function, and the new definition
measures coverage in the linear feature space. See further discussion on this in Akshay’s note.
5
2 Alternative Analysis
Below we sketch an alternative proof to the FQI guarantee. There are two motivations:
Error propagation along “simple” policies The error propagation in the above analysis of FQI was
along a somewhat “ugly” set of policies in the form of πfk ,Q? , which in each state takes the action
that “witnesses” the inequality | maxa f (s, a) − maxa f 0 (s, a)| ≤ maxa |f (s, a) − f 0 (s, a)| for f = fk
and f 0 = Q? . However, the error propagation in the ADP literature (e.g., [3]) only involved “simple”
policies, such as πf for some f ∈ F (and the concatenation of such policies at different time steps to
form a non-stationary policy).
“Modern” error-propagation analysis Error propagation in RL theory were often done by recursive
expansion in the “old” literature, and the above analysis also follows this style. However, we have
also seen alternative proofs based on cleaner and more elegant tools. For example, it is easy to analyze
the error propagation of the “minimax algorithm” arg minf ∈F maxg∈F LD (f ; f )−LD (g; f ) [5, 6] using
the following lemma: ∀π, f
1
J(π) − J(πf ) ≤ (Edπ [T f − f ] + Edπf [f − T f ]). (2)
1−γ
Using this lemma is also well aligned with the first motivation, as it often produces simple policies
on the RHS (which the data distribution needs to cover).
6
Lemma 2 (Non-stationary variant of Eq.(2)). Given an arbitrary sequence of functions f0 , . . . , fk ∈ RS×A
and any (non-stationary) comparator policy π, let π
b := πfk:0 (followed by arbitrary actions after k + 1 steps)
k
X
J(π) − J(b
π) ≤ γ t−1 Edπt [T fk−t − fk−t+1 ] + Edπtb [fk−t+1 − T fk−t ] + γ k Vmax . (3)
t=1
According to the RHS of the bound, when we choose the optimal policy π ? as the comparator
policy π, we need the data µ to cover the state distributions induced by two types of policies from
d0 : (π ? )t , ∀t ≤ k, and π fk:k0 , ∀0 ≤ k 0 ≤ k. Caution: when we analyze the minimax algorithm using
Eq.(2), we only need the data µ to cover the discounted occupancy as a whole, instead of covering
the per-step distributions that contribute to the occupancy separately. Here we do not enjoy such a
property, because our algorithm controls kft − T ft−1 k2,µ for each t separately, so change of measure
must happen in each step instead of over the entire occupancy as a whole.
1
J(π ? ) − J(πfk ) = E πf [V ? (s) − Q? (s, πfk )]
1 − γ s∼d k
1
≤ E πf [V ? (s) − V πfk:0 (s)].
1 − γ s∼d k
Here the inequality is due to Q? (s, πfk ) ≥ V πfk:0 (s), because Q? (s, πfk ) is the expected return of
starting in s, takes πfk immediately (which coincides with V πfk:0 (s) in the first time step because πfk
is πfk:0 ’s first-step policy), and acts optimally thereafter.
Now, the RHS looks like the performance guarantee of πfk:0 , which we can directly apply the
analysis in the previous section! We can also see that compared to the guarantee of non-stationary
FQI, here we paying an extra 1/(1 − γ) factor. The only unusual aspect is that here dπfk is treated as
the initial distribution for the non-stationary FQI analysis, so finally, the distributions that need to be
covered are those induced by the policies mentioned at the end of Section 2.1, but from dπfk as the
initial distribution.
References
[1] Rémi Munos. Error bounds for approximate policy iteration. In ICML, volume 3, pages 560–567,
2003.
[2] András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with Bellman-
residual minimization based fitted policy iteration and a single sample path. Machine Learning,
2008.
[3] Rémi Munos and Csaba Szepesvári. Finite-time bounds for fitted value iteration. Journal of Ma-
chine Learning Research, 9(May):815–857, 2008.
7
[4] Sham Kakade. Hoeffding, Chernoff, Bennet, and Bernstein Bounds, 2011. http://stat.wharton.
upenn.edu/˜skakade/courses/stat928/lectures/lecture06.pdf.
[5] Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning.
In Proceedings of the 36th International Conference on Machine Learning, pages 1042–1051, 2019.
[6] Tengyang Xie and Nan Jiang. Q* approximation schemes for batch reinforcement learning: A
theoretical comparison. In Conference on Uncertainty in Artificial Intelligence, pages 550–559. PMLR,
2020.
[7] Bruno Scherrer and Boris Lesner. On the use of non-stationary policies for stationary infinite-
horizon markov decision processes. In Advances in Neural Information Processing Systems, pages
1826–1834, 2012.