0% found this document useful (0 votes)

3 views

note5 (2)

This document analyzes Fitted Q-Iteration (FQI) in the context of Markov Decision Processes (MDPs), focusing on guarantees related to the Q-function class and empirical updates. It establishes conditions for realizability and Bellman completeness, and derives upper bounds on the performance of the learned policy compared to the optimal policy. Additionally, it discusses alternative analysis methods and the implications of sample size on error rates in FQI.

Uploaded by

zzyy20010204

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

note5 (2)

Uploaded by

zzyy20010204

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Notes on Fitted Q-iteration

Nan Jiang

November 8, 2023

1 Analysis of FQI
Let M = (S, A, P, R, γ, d0 ) be an MDP, where d0 is the initial distribution over states. Given a dataset
{(s, a, r, s0 )} generated from M and a Q-function class F ⊂ RS×A , we want to analyze the guarantee
of Fitted Q-Iteration. This note is inspired by and scrutinizes the results in Approximate Value/Policy
Iteration literature [e.g., 1, 2, 3] under simplification assumptions.

Setup and Assumptions

1. F is finite but can be exponentially large.

2. Realizability: Q? ∈ F.

3. Bellman completeness: ∀f ∈ F, T f ∈ F. (For finite F, this implies realizability.)

4. The dataset D = {(s, a, r, s0 )} is generated as follows: (s, a) ∼ µ, r ∼ R(s, a), s0 ∼ P (s, a). Define
the empirical update TbF f 0 as

1 X 2
LD (f ; f 0 ) := (f (s, a) − r − γVf 0 (s0 )) .
|D|
(s,a,r,s0 )∈D

TbF f 0 := arg min LD (f ; f 0 ),

f ∈F

where Vf 0 (s0 ) := maxa0 f 0 (s0 , a0 ). Note that by completeness, T f 0 ∈ F is the Bayes optimal
regressor for the regression problem defined in LD (f ; f 0 ). It will also be useful to define

Lµ (f ; f 0 ) := ED [LD (f ; f 0 )].

5. For any function g : X → R, any distribution ν ∈ ∆(X ), and p ≥ 1, define kgkp,ν := (Ex∼ν [|g(x)|p ])1/p ,
and let kgkν be a shorthand for kgk2,ν . Such norms are similarly defined for functions over X .

6. Let dπh be the distribution of (sh , ah ) under π, that is, dπh (s, a) := Pr[sh = s, ah = a s1 ∼ d0 , π].
dπ is the usual discounted occupancy. The same notations are sometimes abused to refer to the
corresponding state marginals, which will be clarified if not clear from the context.

1
7. We call any state-action distribution admissible if it can be generated at some time step from d0
in the MDP. That is, it takes the form of dπh for some h and (possibly non-stationary) policy π.
Then, assume that data is exploratory: for any admissible ν,

ν(s, a)
∀s ∈ S, ≤ C.
µ(s, a)
√
As a consequence, k · kν ≤ Ck · kµ . See slides for example scenarios where C is naturally
bounded.

8. Algorithm (simplified for analysis): let f0 ≡ 0 (assuming 0 ∈ F), and for k ≥ 1, fk := TbF fk−1 .

9. Uniform deviation bound (can be obtained by concentration inequalities and union bound):

∀f, f 0 ∈ F, |LD (f ; f 0 ) − Lµ (f ; f 0 )| ≤ .

(Note: at the end we will show how to obtain fast rates.)

Goal Let π̂ := πfk . Derive an upper bound on J(π ? ) − J(π̂).

Analysis
∞
X
J(π ? ) − J(π̂) = γ h−1 Es∼dπ̂h [V ? (s) − Q? (s, π̂)]
h=1
X∞
≤ γ h−1 Es∼dπ̂h [Q? (s, π ? ) − fk (s, π ? ) + fk (s, π̂) − Q? (s, π̂)]
h=1
X∞
≤ γ h−1 kQ? − fk k1,dπ̂h ×π? + kQ? − fk k1,dπ̂h ×π̂
h=1
X∞
≤ γ h−1 kQ? − fk kdπ̂h ×π? + kQ? − fk kdπ̂h ×π̂ . (1)
h=1

In the above equations all the terms in the form of dπh should be treated as state distributions, and
dπh × π 0 refers to a state-action distribution generated as s ∼ dπh , a ∼ π 0 (·|s). The last line contains two
terms, both in the form of kQ? − fk kν with some admissible ν ∈ ∆(S × A). So it remains to bound
kQ? − fk kν for any ν ∈ ∆(S × A) that satisfies bullet 7.
First a helper lemma:

Lemma 1. Define πf,fk (s) := arg maxa∈A max{f (s, a), fk (s, a)}. Then we have ∀ν̃ ∈ ∆(S),

kVf − Vfk kν̃ ≤ kf − fk kν̃×πf,fk .

Proof.
X
kVf − Vfk k2ν̃ = ν̃(s)(max f (s, a) − max
0
fk (s, a0 ))2
a∈A a ∈A
s∈S
X
≤ ν̃(s)(f (s, πf,fk ) − fk (s, πf,fk ))2 = kf − fk k2ν̃×πf,f .
k
s∈S

2
Now we can bound kQ? − fk kν using Lemma 1. Define P (ν) as a distribution over S generated as
s0 ∼ P (ν) ⇔ (s, a) ∼ ν, s0 ∼ P (s, a), and

kfk − Q? kν = kfk − T fk−1 + T fk−1 − Q? kν

≤ kfk − T fk−1 kν + kT fk−1 − T Q? kν
√
≤ C kfk − T fk−1 kµ + γkVfk−1 − V ? kP (ν) (*)
√
≤ C kfk − T fk−1 kµ + γkfk−1 − Q? kP (ν)×πfk−1 ,Q? . (Lemma 1)

Step (*) holds because:

h i
2
kT fk−1 − T Q? k2ν = E(s,a)∼ν ((T fk−1 )(s, a) − (T Q? )(s, a))
h 2 i
= E(s,a)∼ν γEs0 ∼P (s,a) [Vfk−1 (s0 ) − V ? (s0 )]
h 2 i
≤ γ 2 E(s,a)∼ν,s0 ∼P (s,a) Vfk−1 (s0 ) − V ? (s0 ) (Jensen)
h 2 i
= γ 2 Es0 ∼P (ν) Vfk−1 (s0 ) − V ? (s0 ) = γ 2 kVfk−1 − V ? k2P (ν) .

Note that we can apply the same analysis on P (ν)×πfk−1 ,Q? since it is also admissible, and expand
the inequality k times. It then suffices to upper bound kfk − T fk−1 kµ .

kfk − T fk−1 k2µ = Lµ (fk ; fk−1 ) − Lµ (T fk−1 ; fk−1 ) (L squared loss + T fk−1 Bayes optimal)
≤ LD (fk ; fk−1 ) − LD (T fk−1 ; fk−1 ) + 2 (T fk−1 ∈ F)
≤ 2. (fk minimizes LD (· ; fk−1 ))

Note that the RHS does not depend on k, so we conclude that for any admissible ν,

1 − γk √
kfk − Q? kν ≤ 2C + γ k Vmax .
1−γ
Apply this to Equation (1) and we get

1 − γk √

2
J(π ? ) − J(πfk ) ≤ 2C + γ k Vmax .
1−γ 1−γ

Extension: fast rate The previous bound should have O(n−1/4 ) dependence on sample size n :=
√
|D|, because in bullet 9 should be O(n−1/2 ) using Hoeffding’s, and the final bound depends on .
Here we exploit realizability to achieve fast rate so that the final bound is O(n−1/2 ).
Define
Y (f ; f 0 ) := (f (s, a) − r − γVf 0 (s0 ))2 − ((T f 0 )(s, a) − r − γVf 0 (s0 ))2 .
Plug each (s, a, r, s0 ) ∈ D into Y (f ; f 0 ) and we get i.i.d. variables Y1 (f ; f 0 ), Y2 (f ; f 0 ), . . . , Yn (f ; f 0 )
where n = |D|. It is easy to see that
n
1X
Yi (f ; f 0 ) = LD (f ; f 0 ) − LD (T f 0 ; f 0 ),
n i=1

so we only shift our objective LD by a f -independent constant. Our goal is to show that

kTbF f 0 − T f 0 k2µ ≡ E[Y (TbF f 0 ; f 0 )] = O(1/n).

3
Note that this result can be directly plugged into the previous analysis by letting f 0 = fk−1 (hence
TbF f 0 = fk ), and we immediately obtain a final bound of O(n−1/2 ).
To prove the result, first notice that ∀f ∈ F,

E[Y (f ; f 0 )] = Lµ (f ; f 0 ) − Lµ (T f 0 ; f 0 ) = kf − T f 0 k2µ ,

thanks to realizability and squared loss. Next we bound variance of Y :

V[Y (f ; f 0 )] ≤ E[Y (f ; f 0 )2 ]

2 2 2
=E f (s, a) − r − γVf 0 (s0 ) − (T f 0 )(s, a) − r − γVf 0 (s0 )
h 2 2 i
= E f (s, a) − (T f 0 )(s, a) f (s, a) + (T f 0 )(s, a) − 2r − 2γVf 0 (s0 )
h 2 i
2
≤ 4Vmax E f (s, a) − (T f 0 )(s, a)
2
= 4Vmax kf − T f 0 k2µ = 4Vmax
2
E[Y (f ; f 0 )],

where Vmax = Rmax /(1 − γ) is a constant.

Next we apply (one-sided) Bernstein’s inequality (see [4]) and union bound over all f ∈ F. Let
N = |F|. For any fixed f 0 , with probability at least 1 − δ, ∀f ∈ F,
s
n
1 X 2V[Y (f ; f 0 )] log Nδ 4V 2 log Nδ
E[Y (f ; f 0 )] − Yi (f ; f 0 ) ≤ + max 2
(Yi ∈ [−Vmax 2
, Vmax ])
n i=1 n 3n
s
2 E[Y (f ; f 0 )] log N
8Vmax 4V 2 log Nδ
≤ δ
+ max .
n 3n
Pn
Since TbF f 0 minimizes LD ( · ; f 0 ), it also minimizes 1
n i=1 Yi (·; f 0 ) because the two objectives only
differ by a constant LD (T f 0 ; f 0 ). Hence,
n n
1X 1X
Yi (TbF f 0 ; f 0 ) ≤ Yi (T f 0 ; f 0 ) = 0.
n i=1 n i=1

Then,
s
bF f 0 ; f 0 )] log
2 E[Y (T N 2 N
8Vmax 4Vmax log
E[Y (TbF f 0 ; f 0 )] ≤ δ
+ δ
.
n 3n

Solving for the quadratic formula,

2
√ 2 N
q Vmax log
E[Y (TbF f 0 ; f 0 )] ≤ 2+ 10
3
δ
.
n

Relaxing the definition of C The assumption kν/µk∞ ≤ C for all admissible ν can be relaxed. In
particular, note that we always use this assumption in the form of
√
kf − T f 0 kν ≤ Ckf − T f 0 kµ

4
for some f, f 0 ∈ F. We can therefore literally redefine C as an upper bound of

kf − T f 0 k2ν
max
0
f,f ∈F kf − T f 0 k2µ

for all admissible ν. Despite the straightforward relaxation, when F has some nice structural proper-
ties, this new definition can be significantly tighter than the old definition based on raw density ratios.
For example, when F is induced from a bisimulation state abstraction (which satisfies completeness),
the new definition measures density ratio between the distributions over abstract state-action pairs,
which can be much smaller than that between the raw state-action pairs. More generally, F is lin-
ear and Bellman completeness is satisfied, f − T f 0 is also a linear function, and the new definition
measures coverage in the linear feature space. See further discussion on this in Akshay’s note.

5
2 Alternative Analysis
Below we sketch an alternative proof to the FQI guarantee. There are two motivations:

Error propagation along “simple” policies The error propagation in the above analysis of FQI was
along a somewhat “ugly” set of policies in the form of πfk ,Q? , which in each state takes the action
that “witnesses” the inequality | maxa f (s, a) − maxa f 0 (s, a)| ≤ maxa |f (s, a) − f 0 (s, a)| for f = fk
and f 0 = Q? . However, the error propagation in the ADP literature (e.g., [3]) only involved “simple”
policies, such as πf for some f ∈ F (and the concatenation of such policies at different time steps to
form a non-stationary policy).

“Modern” error-propagation analysis Error propagation in RL theory were often done by recursive
expansion in the “old” literature, and the above analysis also follows this style. However, we have
also seen alternative proofs based on cleaner and more elegant tools. For example, it is easy to analyze
the error propagation of the “minimax algorithm” arg minf ∈F maxg∈F LD (f ; f )−LD (g; f ) [5, 6] using
the following lemma: ∀π, f

1
J(π) − J(πf ) ≤ (Edπ [T f − f ] + Edπf [f − T f ]). (2)
1−γ

Using this lemma is also well aligned with the first motivation, as it often produces simple policies
on the RHS (which the data distribution needs to cover).

2.1 Performance guarantee for non-stationary FQI

A major difficulty in applying Eq.(2) to FQI is that it requires control over the Bellman error kf − T f k,
i.e., the learned function should be “self-consistent”. However, in FQI, we only have control over
kft − T ft−1 k, i.e., the output function fk is not necessarily consistent with itself, but consistent with
its previous iterate fk−1 , which is further consistent with its previous iterate fk−2 , and so on.
To overcome this difficulty, we first consider a different output policy: πfk:0 := πfk ◦πfk−1 ◦· · ·◦πf0 .
This is a non-stationary policy, and after πf0 we take arbitrary actions.1 We call FQI with such a
policy (instead of πfk ) non-stationary FQI. Similar to the situation in value iteration (see note1), such
a non-stationary policy actually has better guarantees than the usual FQI policy and saves a factor of
horizon2 , and is also easier to analyze.
In particular, we now can use (a variant of) Eq.(2), because πfk:0 is greedy w.r.t. a self-consistent
function, namely fk:0 := fk ◦ fk−1 ◦ · · · ◦ f0 ! The unusual aspect here is that we typically consider
all value functions as only functions of states and actions, i.e., they are stationary. Here, the function
fk ◦fk−1 ◦· · ·◦f0 itself is a non-stationary object. Also, compared to Eq.(2) we will also need to include
truncation error here. The lemma is given below, with proof left as a homework exercise:
1 The result might be improved (though it is unclear if the improvement is significant) if we produce a periodic policy that

simply repeats πfk:0 forever [7].

2 Another way to save this factor of horizon is to run the minimax algorithm [6].

6
Lemma 2 (Non-stationary variant of Eq.(2)). Given an arbitrary sequence of functions f0 , . . . , fk ∈ RS×A
and any (non-stationary) comparator policy π, let π
b := πfk:0 (followed by arbitrary actions after k + 1 steps)
k
X
J(π) − J(b
π) ≤ γ t−1 Edπt [T fk−t − fk−t+1 ] + Edπtb [fk−t+1 − T fk−t ] + γ k Vmax . (3)
t=1

According to the RHS of the bound, when we choose the optimal policy π ? as the comparator
policy π, we need the data µ to cover the state distributions induced by two types of policies from
d0 : (π ? )t , ∀t ≤ k, and π fk:k0 , ∀0 ≤ k 0 ≤ k. Caution: when we analyze the minimax algorithm using
Eq.(2), we only need the data µ to cover the discounted occupancy as a whole, instead of covering
the per-step distributions that contribute to the occupancy separately. Here we do not enjoy such a
property, because our algorithm controls kft − T ft−1 k2,µ for each t separately, so change of measure
must happen in each step instead of over the entire occupancy as a whole.

2.2 Performance guarantee for FQI

The previous section sketches an analysis of non-stationary FQI. To relate FQI to the analysis of its
non-stationary variant, we use performance difference lemma

1
J(π ? ) − J(πfk ) = E πf [V ? (s) − Q? (s, πfk )]
1 − γ s∼d k
1
≤ E πf [V ? (s) − V πfk:0 (s)].
1 − γ s∼d k

Here the inequality is due to Q? (s, πfk ) ≥ V πfk:0 (s), because Q? (s, πfk ) is the expected return of
starting in s, takes πfk immediately (which coincides with V πfk:0 (s) in the first time step because πfk
is πfk:0 ’s first-step policy), and acts optimally thereafter.
Now, the RHS looks like the performance guarantee of πfk:0 , which we can directly apply the
analysis in the previous section! We can also see that compared to the guarantee of non-stationary
FQI, here we paying an extra 1/(1 − γ) factor. The only unusual aspect is that here dπfk is treated as
the initial distribution for the non-stationary FQI analysis, so finally, the distributions that need to be
covered are those induced by the policies mentioned at the end of Section 2.1, but from dπfk as the
initial distribution.

References
[1] Rémi Munos. Error bounds for approximate policy iteration. In ICML, volume 3, pages 560–567,
2003.

[2] András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with Bellman-
residual minimization based fitted policy iteration and a single sample path. Machine Learning,
2008.

[3] Rémi Munos and Csaba Szepesvári. Finite-time bounds for fitted value iteration. Journal of Ma-
chine Learning Research, 9(May):815–857, 2008.

7
[4] Sham Kakade. Hoeffding, Chernoff, Bennet, and Bernstein Bounds, 2011. http://stat.wharton.
upenn.edu/˜skakade/courses/stat928/lectures/lecture06.pdf.

[5] Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning.
In Proceedings of the 36th International Conference on Machine Learning, pages 1042–1051, 2019.

[6] Tengyang Xie and Nan Jiang. Q* approximation schemes for batch reinforcement learning: A
theoretical comparison. In Conference on Uncertainty in Artificial Intelligence, pages 550–559. PMLR,
2020.

[7] Bruno Scherrer and Boris Lesner. On the use of non-stationary policies for stationary infinite-
horizon markov decision processes. In Advances in Neural Information Processing Systems, pages
1826–1834, 2012.

Solutions To Oksendal
0% (2)
Solutions To Oksendal
35 pages
A Child's Guide To Dynamic Programming
No ratings yet
A Child's Guide To Dynamic Programming
20 pages
cs229 Notes13
No ratings yet
cs229 Notes13
15 pages
Infinite-Horizon Dynamic Programming: Tianxiao Zheng Saif
No ratings yet
Infinite-Horizon Dynamic Programming: Tianxiao Zheng Saif
10 pages
Linear Quadratic Control
No ratings yet
Linear Quadratic Control
7 pages
Problem Set 11 Solutions
No ratings yet
Problem Set 11 Solutions
4 pages
Master LN
No ratings yet
Master LN
113 pages
Oooo
No ratings yet
Oooo
6 pages
Appendix PDF
No ratings yet
Appendix PDF
6 pages
M Jeanblanc-Picqué 1995 Russ. Math. Surv. 50 R03
No ratings yet
M Jeanblanc-Picqué 1995 Russ. Math. Surv. 50 R03
22 pages
Notes On Non-Cooperative Game Theory Econ 8103, Spring 2009, Aldo Rustichini
No ratings yet
Notes On Non-Cooperative Game Theory Econ 8103, Spring 2009, Aldo Rustichini
30 pages
Master LN
No ratings yet
Master LN
135 pages
Research
No ratings yet
Research
4 pages
Assignment Two
No ratings yet
Assignment Two
3 pages
EC744 Lecture Note 3 Dynamic Programming Under Certainty: Prof. Jianjun Miao
No ratings yet
EC744 Lecture Note 3 Dynamic Programming Under Certainty: Prof. Jianjun Miao
17 pages
A Strengthened Conjecture on the Minimax Optimal Constant Stepsize for Gradient Descent
No ratings yet
A Strengthened Conjecture on the Minimax Optimal Constant Stepsize for Gradient Descent
8 pages
Controle Sto Arret Optimal
No ratings yet
Controle Sto Arret Optimal
58 pages
(Touzi) Deterministic and Stochastic Control, Application To Finance
No ratings yet
(Touzi) Deterministic and Stochastic Control, Application To Finance
117 pages
DRL_Homework_1
No ratings yet
DRL_Homework_1
4 pages
A Quantum Version of Randomization Crite
No ratings yet
A Quantum Version of Randomization Crite
28 pages
Automatica: Guangchen Wang Hua Xiao Guojing Xing
No ratings yet
Automatica: Guangchen Wang Hua Xiao Guojing Xing
6 pages
Lec8h2 OptimalControl PDF
No ratings yet
Lec8h2 OptimalControl PDF
83 pages
H2-Optimal Control - Lec8
No ratings yet
H2-Optimal Control - Lec8
83 pages
Notes Vfi sp2024
No ratings yet
Notes Vfi sp2024
11 pages
09 - Monte Carlo Learning
No ratings yet
09 - Monte Carlo Learning
24 pages
Icml Zeno Proofs
No ratings yet
Icml Zeno Proofs
8 pages
MS&E 221: Stochastic Modeling: Session 7: Nonlinear Optimization, Markov Decision Processes
No ratings yet
MS&E 221: Stochastic Modeling: Session 7: Nonlinear Optimization, Markov Decision Processes
18 pages
STAT 513 Solutions
No ratings yet
STAT 513 Solutions
16 pages
MIT6 441S16 Midterm
No ratings yet
MIT6 441S16 Midterm
5 pages
1 Problems in Oksendal's Book
0% (1)
1 Problems in Oksendal's Book
22 pages
EC744 Lecture Note 7 Stochastic Dynamic Programming: Prof. Jianjun Miao
No ratings yet
EC744 Lecture Note 7 Stochastic Dynamic Programming: Prof. Jianjun Miao
24 pages
Конспект лекций (Стохастическое оптимальное управление ВЕГА)
No ratings yet
Конспект лекций (Стохастическое оптимальное управление ВЕГА)
118 pages
EE675A Lec12
No ratings yet
EE675A Lec12
5 pages
Sol2
No ratings yet
Sol2
7 pages
Cs748 s2021 Quizzes Till q4
No ratings yet
Cs748 s2021 Quizzes Till q4
4 pages
Stochastic Control Princeton
No ratings yet
Stochastic Control Princeton
14 pages
Solution 3 Problem 1: Let X
No ratings yet
Solution 3 Problem 1: Let X
12 pages
Essential Questions For The Exam 2017, AMCS 336, Numerical Methods For Stochastic Differential Equations
No ratings yet
Essential Questions For The Exam 2017, AMCS 336, Numerical Methods For Stochastic Differential Equations
21 pages
Proofs Theorem 2 23
No ratings yet
Proofs Theorem 2 23
6 pages
mdp-cheatsheet
No ratings yet
mdp-cheatsheet
3 pages
EC744 Lecture Note 9 Convergence of Markov Processes: Prof. Jianjun Miao
No ratings yet
EC744 Lecture Note 9 Convergence of Markov Processes: Prof. Jianjun Miao
22 pages
Affine Processes and Applications in Finance: (With D. Duffie and W. Schachermayer)
No ratings yet
Affine Processes and Applications in Finance: (With D. Duffie and W. Schachermayer)
25 pages
Raghu Meka notes
No ratings yet
Raghu Meka notes
7 pages
Computational Economics: Session 16: Numerical Dynamic Programming
No ratings yet
Computational Economics: Session 16: Numerical Dynamic Programming
17 pages
Softmax_Policy_Gradient_Methods_Can_Take_Exponenti
No ratings yet
Softmax_Policy_Gradient_Methods_Can_Take_Exponenti
65 pages
SDE Book
No ratings yet
SDE Book
119 pages
ch2. Linear Least Mean Squared Error (LLMSE, MMSE) Filter
No ratings yet
ch2. Linear Least Mean Squared Error (LLMSE, MMSE) Filter
18 pages
exam-jan2022+sol
No ratings yet
exam-jan2022+sol
6 pages
Assignment 5 (Sol.) : Reinforcement Learning
100% (1)
Assignment 5 (Sol.) : Reinforcement Learning
4 pages
E2 201: Information Theory (2019) Solutions To Homework 3
No ratings yet
E2 201: Information Theory (2019) Solutions To Homework 3
11 pages
Control Optimo
No ratings yet
Control Optimo
132 pages
sol3_2015
No ratings yet
sol3_2015
8 pages
Solutions Chapter 10
No ratings yet
Solutions Chapter 10
7 pages
Martingale Limit Theory and Stochastic Regression Theory: Ching-Zong Wei
No ratings yet
Martingale Limit Theory and Stochastic Regression Theory: Ching-Zong Wei
155 pages
Bouchardtalk
No ratings yet
Bouchardtalk
78 pages
09 LQR
No ratings yet
09 LQR
68 pages
SRE_Report_merged
No ratings yet
SRE_Report_merged
16 pages
נוסחאות ואי שיוויונים
No ratings yet
נוסחאות ואי שיוויונים
12 pages
Stein 2011 DiffFilter
No ratings yet
Stein 2011 DiffFilter
20 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Abbott Kim Genre Classification PDF
No ratings yet
Abbott Kim Genre Classification PDF
3 pages
Quality Trainer Content Outline
0% (1)
Quality Trainer Content Outline
4 pages
Scopus
No ratings yet
Scopus
6,919 pages
Saad Y., Iterative Methods For Sparse Linear Systems
No ratings yet
Saad Y., Iterative Methods For Sparse Linear Systems
469 pages
Accelerating Digital Transformation Understanding and Setting Up A Digital Services Unit
No ratings yet
Accelerating Digital Transformation Understanding and Setting Up A Digital Services Unit
12 pages
Web Grabber PP T New
No ratings yet
Web Grabber PP T New
39 pages
XML and Database
No ratings yet
XML and Database
89 pages
Philippine Outsourcing Companies
No ratings yet
Philippine Outsourcing Companies
8 pages
Public Roadmap Article PDF
No ratings yet
Public Roadmap Article PDF
11 pages
UBC Math 255 Practice Midterm 1
No ratings yet
UBC Math 255 Practice Midterm 1
4 pages
RCP Technical Bulletin 20-2013-115 PDF
No ratings yet
RCP Technical Bulletin 20-2013-115 PDF
13 pages
Sap Notes For Fiori For SRM
No ratings yet
Sap Notes For Fiori For SRM
2 pages
Difference Between Draft
No ratings yet
Difference Between Draft
3 pages
Faro Freestyle 3dx - Brochure
No ratings yet
Faro Freestyle 3dx - Brochure
2 pages
Expression & Assignment Worksheet For Java: A Ac B B X
No ratings yet
Expression & Assignment Worksheet For Java: A Ac B B X
1 page
All Proc
No ratings yet
All Proc
737 pages
211 MPLS Te
No ratings yet
211 MPLS Te
27 pages
2.4.3.4 Lab - Configuring HSRP and GLBP (10m)
No ratings yet
2.4.3.4 Lab - Configuring HSRP and GLBP (10m)
9 pages
COS1512 Feedback On Assignment 2
No ratings yet
COS1512 Feedback On Assignment 2
4 pages
ENhancement Package SAP PDF
No ratings yet
ENhancement Package SAP PDF
69 pages
Global Edge Paper Pattern - Clanguage
No ratings yet
Global Edge Paper Pattern - Clanguage
8 pages
Syndicate Bank Clerical Solved Paper 2010
No ratings yet
Syndicate Bank Clerical Solved Paper 2010
16 pages
Structure of The Question Papers and Prototype Questions For GIT Online Examinations 2018 Onwards AlevelApi JPG
No ratings yet
Structure of The Question Papers and Prototype Questions For GIT Online Examinations 2018 Onwards AlevelApi JPG
77 pages
Eudora® Email 7.1 Quick Start Guide For Windows
No ratings yet
Eudora® Email 7.1 Quick Start Guide For Windows
19 pages
Shared Disk vs. Shared Nothing
No ratings yet
Shared Disk vs. Shared Nothing
17 pages
MABS Multicast Authentication Based On Batch Signature
No ratings yet
MABS Multicast Authentication Based On Batch Signature
12 pages
Pentium 4
No ratings yet
Pentium 4
11 pages
Fileupload in ADF
No ratings yet
Fileupload in ADF
21 pages
By Bikash Chandra Behera
No ratings yet
By Bikash Chandra Behera
11 pages
Exploratory Testing in Pairs: Cem Kaner James Bach
No ratings yet
Exploratory Testing in Pairs: Cem Kaner James Bach
15 pages

note5 (2)

Uploaded by

note5 (2)

Uploaded by

Notes on Fitted Q-iteration

Setup and Assumptions

1. F is finite but can be exponentially large.

3. Bellman completeness: ∀f ∈ F, T f ∈ F. (For finite F, this implies realizability.)

TbF f 0 := arg min LD (f ; f 0 ),

(Note: at the end we will show how to obtain fast rates.)

Goal Let π̂ := πfk . Derive an upper bound on J(π ? ) − J(π̂).

kVf − Vfk kν̃ ≤ kf − fk kν̃×πf,fk .

kfk − Q? kν = kfk − T fk−1 + T fk−1 − Q? kν

Step (*) holds because:

kTbF f 0 − T f 0 k2µ ≡ E[Y (TbF f 0 ; f 0 )] = O(1/n).

thanks to realizability and squared loss. Next we bound variance of Y :

where Vmax = Rmax /(1 − γ) is a constant.

Solving for the quadratic formula,

2.1 Performance guarantee for non-stationary FQI

simply repeats πfk:0 forever [7].

2.2 Performance guarantee for FQI

You might also like