0% found this document useful (0 votes)
2 views

note3

The document discusses tabular methods in reinforcement learning (RL), specifically focusing on certainty-equivalence and value-based methods. It outlines the process of estimating transition and reward functions from data, as well as analyzing the performance guarantees of the certainty-equivalence method. The analysis includes various approaches to improve sample efficiency and reduce dependence on state space size, ultimately providing insights into the optimal policy derived from these methods.

Uploaded by

zzyy20010204
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

note3

The document discusses tabular methods in reinforcement learning (RL), specifically focusing on certainty-equivalence and value-based methods. It outlines the process of estimating transition and reward functions from data, as well as analyzing the performance guarantees of the certainty-equivalence method. The analysis includes various approaches to improve sample efficiency and reduce dependence on state space size, ultimately providing insights into the optimal policy derived from these methods.

Uploaded by

zzyy20010204
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Notes on Tabular Methods

Nan Jiang

September 27, 2022

1 Overview of the methods


1.1 Tabular certainty-equivalence
Certainty-equivalence is a model-based RL algorithm, that is, it first estimates an MDP model from
data, and then performs policy evaluation or optimization in the estimated model as if it were true.
To specify the algorithm it suffices to specify the model estimation step.
Given a dataset D of trajectories, D = {(s1 , a1 , r1 , s2 , . . . , sH+1 )}, we first convert it into a bag of
{(s, a, r, s′ )} tuples, where each trajectory is broken into H tuples: (s1 , a1 , r1 , s2 ), (s2 , a2 , r2 , s3 ), . . . ,
(sH , aH , rH , sH+1 ). For every s ∈ S, a ∈ A, define Ds,a as the subset of tuples where the first element
of the tuple is s and the second is a, and we write (r, s′ ) ∈ Ds,a since all tuples in Ds,a share the
same state-action pair. The tabular certainty-equivalence model uses the following estimation of the
transition function Pb: let es′ be the unit vector whose s′ -th entry is 1 and all other entries are 0,
1 X
Pb(s, a) = es′ . (1)
|Ds,a |
(r,s′ )∈Ds,a

Here I( · ) is the indicator function. In words, Pb(s′ |s, a) is simply the empirical frequency of observing
s′ after taking a in state s. Similarly when reward function also needs to be learned, the estimate is
1 X
R(s,
b a) = r. (2)
|Ds,a |
(r,s′ )∈Ds,a

Pb and Rb are the maximum likelihood estimates of the transition and the reward functions, respec-
tively. Note that for the transition function to be well-defined we need n(s, a) > 0 for every s, a ∈ S.

1.2 Value-based tabular methods


Certainty-equivalence explicitly stores an estimated MDP model, which has O(|S|2 |A|) space com-
plexity, and the algorithm has a batch nature, i.e., it is invoked after all the data are collected. In
contrast, there is another popular family of RL algorithms that (1) only model the Q-value functions
hence has O(|S||A|) sample complexity, (2) can be applied in an online manner, i.e., the algorithm
runs as more and more data are collected. Well-known examples include Q-learning [1] and Sarsa [2].
Another very appealing property of these methods is that it is relatively easy to incorporate so-
phisticated generalization schemes, such as deep neural networks, which has recently led to many

1
empirical successes [3]. On the other hand, such methods are typically less sample-efficient than
model-based methods and will not be discussed in more details in this note.1

2 Analysis of certainty-equivalence RL
Here we analyze the method introduced in Section 1.1. For simplicity we further assume that data
are generated by sampling each (s, a) a fixed number of times. We are interested in deriving high-
c = (S, A, Pb, R,
probability guarantees for the optimal policy of M b γ) as a function of n ≡ |Ds,a |.
We provide three different analyses for the algorithm, and we should see some interesting trade-
off between state space and horizon.

2.1 Naive analysis


b ≈ R and Pb ≈ P . In particular, by Hoeffd-
The basic idea is, when n is sufficiently large, we expect R
ing’s inequality and union bound, the following inequalities hold with probability at least 1 − δ:
r
b a) − R(s, a)| ≤ Rmax 1 ln 4|S × A|
max |R(s, (3)
s,a 2n δ
and
r
′ ′ 1 4|S × A × S|
max′ |Pb(s |s, a) − P (s |s, a)| ≤ ln . (4)
s,a,s 2n δ

Note that we first split the failure probability δ evenly between the reward estimation events and the
transition estimation events. Then for reward, we split δ/2 evenly among all (s, a); for transition, we
split δ/2 evenly among all (s, a, s′ ). From Eq.(4) we further have
r
1 4|S × A × S|
max ∥P (s, a) − P (s, a)∥1 ≤ max |S| · ∥P (s, a) − P (s, a)∥∞ ≤ |S| ·
b b ln . (5)
s,a s,a 2n δ

c, we first introduce the simulation lemma [6].
To bound the suboptimality of πM

Lemma 1 (Simulation Lemma). If maxs,a |R(s,


b a) − R(s, a)| ≤ ϵR and maxs,a ∥Pb(s, a) − P (s, a)∥1 ≤ ϵP ,
then for any policy π : S → A, we have

π π ϵR γ ϵP Vmax
∥VM
c − VM ∥∞ ≤ + ,
1−γ 2(1 − γ)

where Vmax := Rmax /(1 − γ).


1 Techniques such as experience replay can be used the improve the sample efficiency of many online algorithms [4], but it

also blurs the boundary between value-based and model-based methods [5].

2
Proof. For any s ∈ S,

π π
|VM
c(s) − VM (s)|
b π) + γ⟨Pb(s, π), V π ⟩ − R(s, π) − γ⟨P (s, π), VM
= |R(s, π
⟩|
M
c
π π π π
≤ ϵR + γ|⟨Pb(s, π), VM
c⟩ − ⟨P (s, π), VM
c⟩ + ⟨P (s, π), VM
c⟩ − ⟨P (s, π), VM ⟩⟩|
π π π
c⟩| + γ∥VM
≤ ϵR + γ|⟨Pb(s, π) − P (s, π), VM c − VM ∥∞
π Vmax π π
c−
= ϵR + γ|⟨Pb(s, π) − P (s, π), VM 2 · 1⟩| + γ∥VM
c − VM ∥∞
π Vmax π π
≤ ϵR + γ∥Pb(s, π) − P (s, π)∥1 ∥VM
c− 2 ∥∞ c − VM ∥∞
+ γ∥VM
γϵP Vmax π π
≤ ϵR + 2 c − VM ∥∞ .
+ γ∥VM

Since this holds for all s ∈ S, we can also take infinite-norm on the LHS, which yields the desired
result. Note that we subtract Vmax
2
π
· 1 (1 is the all-one vector) to center the range of VM
c around the
origin, which exploits the fact that both Pb(s, π) and P (s, π) are valid probability distributions and
sum up to 1.

Alternative proof of Simulation Lemma Here we sketch an alternative and more “modern” proof
to the simulation lemma. The proof relies on the following identity: ∀f ∈ RS , s0 ∈ S, 2

1
π
f (s0 ) − VM (s0 ) = Es∼dπ ,a∼π(·|s),r∼R(s,a),s′ ∼P (·|s,a) [f (s) − r − γf (s′ )]. (6)
1−γ
π 1
To see why, first note that VM (s0 ) = 1−γ Es∼dπ,s0 ,a∼π(·|s),r∼R(s,a) [r], so the corresponding terms can
be dropped on the two sides of the equation. For the remaining terms, the RHS is

1
Es∼dπ,s0 ,a∼π(·|s),s′ ∼P (·|s,a) [f (s) − γf (s′ )]
1−γ
X∞

= γ t−1 Es∼dπ,s
t
0 ,a∼π(·|s),s′ ∼P (·|s,a) [f (s) − γf (s )].

t=1

Recall that dπ,s


t is the distribution of st under π starting from s0 . In this summation, the γf (s′ ) term
0

for t cancels out exactly with the f (s) term for t + 1, because both s and s′ are distributed according to
dπ,s0
t+1 and the difference in discount factor (γ
t−1
vs. γ t ) accounts for the γ in γf (s′ ). As a result, only
the first term f (s0 ) is left, which is the same as the remaining term on the LHS. In fact, this term is
effectively the Bellman flow equation for dπ , written in a form where f serves as a “test function” or
discriminator to the Bellman flow equation.
π ′
To prove simulation lemma, we simply let f = VM c. On the RHS, we marginalize out r and s , and
obtain
1 π π
c(s) − R(s, a) − γ⟨P (·|s, a), VM
Es∼dπ ,a∼π(·|s) [VM c⟩]
1−γ
1 b a) + γ⟨Pb(·|s, a), V π ⟩ − R(s, a) − γ⟨P (·|s, a), V π ⟩].
= Es∼dπ ,a∼π(·|s) [R(s,
1−γ M
c M
c

The rest of the proof follows similarly as the original proof.


2 The 1
RHS can also be conveniently written as E
1−γ s∼d
π [f − T π f ].

3
Turning back to the analysis of certainty-equivalence, the following lemma translates the policy

evaluation error to the suboptimality of πM
c:

⋆ π π⋆ π
Lemma 2 (Evaluation error to decision loss). ∀s ∈ S, VM (s) − VMM
c
(s) ≤ 2 supπ:S→A ∥VM
c − VM ∥∞ .

Proof. For any s ∈ S,


⋆ π⋆ π⋆ π⋆ π⋆ π⋆
VM (s) − VMM
c
(s) = VMM (s) − V cM (s) + V cM (s) − VMM
c
(s)
M M
π⋆ π⋆ π ⋆c π⋆ ⋆ ·
≤ VMM (s) − V cM (s) + V cM (s) − VMM
c
(s) (πM
c maximizes vM
c)
M M
⋆ ⋆
π⋆ π⋆ πM πM
≤ ∥VMM − V cM ∥∞ + ∥V c − VM ∥∞ . c c
M M

Putting Lemmas 1 and 2 together with the concentration inequalities, we can see that the suboptimal-
ity we incur is  
⋆ π ⋆c |S|Vmax
VM (s) − VMM (s) = Õ √ , ∀s ∈ S.
n(1 − γ)
Here Õ(·) supresses poly-logarithmic dependences on |S| and |A|; in this note we also omit the de-
pendence on Rmax and 1/δ, and only highlight the dependence on |S|, n, and 1/(1 − γ).

p
2.2 Improving |S| to |S|
The previous analysis proves concentration for each individual P (s′ |s, a) and adds up the errors to
give an ℓ1 error bound, which is loose. We can obtain a tighter analysis by proving an ℓ1 concentration
bound for multinomial distribution directly.
Note that for any vector v ∈ R|S| ,

∥v∥1 = sup u⊤ v.
u∈{−1,1}|S|

Each u ∈ {−1, 1}|S| projects the vector v to some scalar value. If v can be written as the sum of
zero-mean i.i.d. vectors, we can prove concentration for u⊤ v first, and then union bound over all u
to obtain the ℓ1 error bound. Concretely, for any fixed (s, a) pair and any fixed u ∈ {−1, 1}|S| , with
probability at least 1 − δ/(2|S × A| · 2|S| ), we have
r
1 2|S × A| · 2|S|
u⊤ (Pb(s, a) − P (s, a)) ≤ 2 ln , (7)
2n δ
because u⊤ Pb(s, a) is the average of i.i.d. random variables u⊤ es′ with bounded range [−1, 1]. 3 This
leads to the following improvement over Eq.(5): w.p. at least 1 − δ/2,
r
⊤ 1 2|S × A| · 2|S|
max ∥Pb(s, a) − P (s, a)∥1 = max max u (Pb(s, a) − P (s, a)) ≤ 2 ln . (8)
s,a s,a u∈{−1,1}|S| 2n δ
q q
Rougly speaking, the Õ(|S| n1 ) bound in Eq.5 is improved to Õ( |S| n ), and propagating the im-
provement through the remainder of the analysis yields
p !


πM |S|Vmax
VM (s) − VM (s) = Õ √
c
, ∀s ∈ S.
n(1 − γ)
3 Also note that we only bound the deviation from one side, so we save a factor of 2 in ln compared to bounding the absolute

deviation. Another tiny improvement: for u ∈ {−1, 1}|S| , one can ingore u = ±1 as ±1(Pb(s, a) − P (s, a)) ≡ 0.

4
2.3 No dependence on |S|
1
The last analysis removes the dependence of n on |S|, at the cost of an additional dependence on 1−γ .
Note that the total number of samples still scales with |S| as we require n samples per (s, a).
The core idea is to show Q⋆M ⋆
c ≈ QM , and then upper bound loss by Lemma 4 from Note 1. First,
by contraction we have,

1
∥Q⋆M ⋆
c − QM ∥∞ ≤ ∥Q⋆ − TM ⋆
cQM ∥∞ . (9)
1−γ M

This is because

∥Q⋆M ⋆
c − QM ∥∞ = ∥TM

c − TM
cQM

cQM + TM
⋆ ⋆
cQM − QM ∥∞

≤ γ∥Q⋆M ⋆
c − QM ∥∞ + ∥TM
⋆ ⋆
cQM − QM ∥∞ . (TM
c is a γ-contraction)

Then we bound the RHS in the following lemma.

Lemma 3. For any fixed s ∈ S, a ∈ A, with probability at least 1 − δ,


r
  Rmax 1 2
Q⋆M (s, a) b b ⋆
− R(s, a) + γ⟨P (s, a), VM ⟩ ≤ log .
1 − γ 2n δ

Proof. The bound follows directly from Hoeffding’s inequality upon the following observation:

1 X
R(s, ⋆
b a) + γ⟨Pb(s, a), VM ⟩= ⋆
(r + γVM (s′ )) .
n
(r,s′ )∈Ds,a

Note that the RHS is the average of i.i.d. random variables (r + γVM ⋆
(s′ )) in the interval of [0, R1−γ
max
],

whose expectation is exactly QM (s, a). Therefore, the LHS of the lemma statement is the deviation of
average of i.i.d. variables from the expectation, where Hoeffding’s inequality applies.

Note that the LHS of the lemma statement is simply the (s, a)-th entry of (Q⋆M − TM ⋆
cQM ). The final
result we can get is  


πM Vmax
VM (s) − VM (s) = Õ √
c
, ∀s ∈ S.
n(1 − γ)2
The cubic dependence on horizon comes from 3 different sources: (1) the range of value, (2) trans-
lating Bellman error to the difference in optimal Q-value functions, and (3) error accumulation over
time when taking actions greedily wrt Q. b The previous analyses only paid quadratic dependence on
horizon because (3) was not present.

Some notes on Eq.(9) One can also obtain the following inequality by swapping the roles of M and
c in Eq.(9):
M

1
∥Q⋆M ⋆
c − QM ∥∞ ≤ ∥Q⋆c − TM Q⋆M
c ∥∞ .
1−γ M

In fact, the RHS of the above inequality is the more standard notion of Bellman errors (or Bellman
residuals): it measures how much an approximate Q-value function (here Q⋆M
c) deviates from itself when

5
updated using the true Bellman update operator. In fact we can attempt to complete the analysis based
1
on this inequality instead of Eq.(9), by noticing that the RHS is (ignoring 1−γ and the max-norm)

⋆ ⋆
TM c − TM QM
cQM c.

This way we also introduce TM c into the expression and compare it with TM , which should allow us
to use concentration inequalities to bound the difference between TM
c and TM .
Now the (s, a)-th entry of the above expression is
   
b a) + γ⟨Pb(s, a), V ⋆ ⟩ − R(s, a) + γ⟨P (s, a), V ⋆ ⟩
R(s, Mc M
c

⋆ ′
It is attempting to use the techniques in the proof of Lemma 3, by claiming that (r + VM c(s )) are
i.i.d. random variables for (r, s′ ) ∈ Ds,a , with expected value R(s, a) + γ⟨P (s, a), VM

c⟩. This is not
⋆ ′
true in general, because the function VM c(s ) itself is random and depends on the data in Ds,a ! Hence
Hoeffding does not apply. One workaround is to consider a deterministic function class that always

contains VM c and do a union bound over that class; in fact, if we choose all tabular functions in the
range of [0, Vmax ], the analysis is basically identical to Section 2.2.
Now you should see why we use Q⋆M and TM c in Eq.(9), as this way we compare TM and TM c

against VM , which is a deterministic function.

In cases where M ’s state space forms a directed acyclic graph (DAG), the argument with VM c can
⋆ ′
still work as VMc(s ) only depends on the datasets for later state-action pairs, which do not include
the current (s, a) under consideration. This argument is straightforward here because we have a
very simple and clean data collection procedure. One has to be extremely careful when using this
⋆ ′
argument in more realistic settings: for example, in the exploration setting, even if VM
c(s ) is estimated
from datasets not including Ds,a , the outcomes in Ds,a might have determined which later states we

have sufficient samples and which not, which introduces very subtle interdependence with VM c.

Connection to MCTS Interestingly, the independence of n on |S| in the last analysis is the core idea
that leads to Sparse Sampling [7], which is a prototype algorithm for the family of Monte-Carlo tree
search algorithm that played a crucial role in the success of AlphaGo.
One way to view Sparse Sampling is the following: conceptually we run the tabular method
with n set according to the last analysis (no dependence on |S|). Of course, when |S| is large this is
impractical, but if we only need to know π ⋆ (s0 ) for some particular state s0 (which is the setting of
online planning with MCTS), we can perform “lazy evaluation”: only generate the datasets for state-

action pairs that contribute to the calculation of VM
c(s0 ) and truncate at the effective horizon. Roughly
1
O( 1−γ )
speaking, this requires a total of (n|A|) samples to compute π ⋆ (s0 ), where has no dependence
on |S|.

2.4 “Best of both worlds” in the large-sample regime


We have seen two different analyses, where one pays an extra factor in |S| and the other pays an extra
factor in horizon (i.e., 1/(1 − γ)). Can we get the best of both worlds and pay neither factors?
The answer turns out to be positive, as long as we allow a slow “burn-in” term. To start, recall
that in the comments below Eq.(9), we show that ∥Q⋆M ⋆
c − QM ∥∞ can also be bounded if we swap the

6
role between M and M c on the RHS, i.e., the RHS becomes 1 ∥Q⋆ − TM Q⋆ ∥∞ . The trouble is that
1−γ M
c M
c
we can no longer directly apply Hoeffding’s inequality due to the data-dependence on Q⋆M c.
However, if we manage to control this term, it will save us a horizon factor (which is a good reason
to consider this idea more carefully)! This is shown in the following lemma: (the lemma only involves
the true MDP M , which is omitted in the subscripts of value functions and occupancies)

Lemma 4 ([8]). For any f ∈ RS×A , any initial state s0 , and any policy π,

1
V π (s0 ) − V πf (s0 ) ≤ (Edπ,s0 [T f − f ] + Edπf ,s0 [f − T f ]),
1−γ

where terms in the form of Eµ [f ] are the shorthand for E(s,a)∼µ [f (s, a)].

Proof. We use the following identity, which is the Q-function variant of Eq.(6) (what we used to give
the alternative proof of simulation lemma) and can be proved in very similar ways: for any f ∈ RS×A ,

1
f (s0 , π) − V π (s0 ) = Edπ,s0 [f − T π f ]. (10)
1−γ

Then,

V π (s0 ) − V πf (s0 ) ≤ V π (s0 ) − f (s0 , π) + f (s0 , πf ) − V πf (s0 ).

For the two pairs of differences on the RHS, we invoke Eq.(10) twice: one with π, and the other with
π rebound to πf . This gives:

V π (s0 ) − f (s0 , π) + f (s0 , πf ) − V πf (s0 )


1 1
= Edπ,s0 [T π f − f ] + E πf ,s0 [f − T πf f ].
1−γ 1−γ d

The proof is completed by recognizing that T π f ≤ T f , and T πf f = T f .

Using Lemma 4, we immediately have that

⋆ π⋆ 2
∥VM − VMM
c
∥∞ ≤ ∥Q⋆c − TM Q⋆M
c ∥∞ ,
1−γ M

which contrasts the 2/(1 − γ)2 factor in Section 2.3. However, this brings back the earlier problem:
how to handle the data dependence of Q⋆M c in concentration?

Bounding ∥Q⋆M ⋆
c − TM QM
c ∥∞ We now show how to bound this term. ∀(s, a),

|Q⋆M ⋆
c(s, a) − (TM QM
c)(s, a)| = |(TM
⋆ ⋆
c)(s, a) − (TM QM
cQM c)(s, a)|
⋆ ⋆ ⋆ ⋆ ⋆ ⋆
= |(TM
c)QM (s, a) − (TM QM )(s, a) + (TM c)(s, a) − (TM
cQM cQM )(s, a) + (TM QM )(s, a) − (TM QM
c)(s, a)|.

So the idea is to replace Q⋆M ⋆


c with QM , then the difference is the same as in Section 2.3 which we know
how to handle (without paying extra |S|). The consequence of doing so is that we will get some extra
terms (the last 4 terms above).

7
To handle those 4 terms, we plug in the definition of TM and TM
c:

⋆ ⋆ ⋆ ⋆
|(TM c)(s, a) − (TM
cQM cQM )(s, a) + (TM QM )(s, a) − (TM QM
c)(s, a)|
b a) + γ⟨Pb(s, a), V ⋆ ⟩ − R(s,
= (R(s, ⋆
b a) − γ⟨Pb(s, a), VM ⟩
M
c
⋆ ⋆
+ R(s, a) + γ⟨P (s, a), VM c⟩|
⟩ − R(s, a) − γ⟨P (s, a), VM
⋆ ⋆ ⋆ ⋆
c − VM ⟩ − ⟨P (s, a), VM
= γ|⟨Pb(s, a), VM c − VM ⟩|
⋆ ⋆
c − VM ⟩|
= γ|⟨Pb(s, a) − Pb(s, a), VM
⋆ ⋆
c − VM ∥∞ .
≤ γ∥P (s, a) − Pb(s, a)∥1 ∥VM

Now we can control ∥P (s, a) − Pb(s, a)∥1 using the total-variation concentration bound in Section 2.2
⋆ ⋆
(while paying the extra |S|). We can separately control ∥VM c − VM ∥∞ using the analysis in Section 2.3.

Each term will scale as O(1/ n) (we are only considering the scaling with n here and ignoring the
other variables such as |S| and 1/(1 − γ)), so their product scales as O(1/n). When n is sufficiently
√ ⋆ ⋆
large, this term will be dominated by the 1/ n error coming out from |(TM c)QM (s, a) − (TM QM )(s, a)|
and can be omitted. (This is why it is sometimes called a “burn-in” term, as it only has significant
effects in the small sample-size regime.) Note that this O(1/n) term will have worse dependencies

on |S| and 1/(1 − γ) compared to the O(1/ n) term, so “sufficiently large n” means that n has to be

larger than some function of |S| and 1/(1 − γ) for the difference between n and n to compensate for
the worse factors in |S| and 1/(1 − γ). In such a large-sample regime, we obtain the nice error bound
Vmax
of Õ( √n(1−γ) ), i.e., there is neither the extra |S| factor as in Section 2.2 nor the extra 1/(1 − γ) factor
as in Section 2.3.

Further improvement The bound can be further improved by replacing the Hoeffding’s inequalities
with the Bernstein’s, which provides sharper concentration bounds when the variance of the random
variables are substantially smaller compared to their ranges (squared). In our setting, the range of
random variables in the concentration of ⟨P̂ (s, a) − P (s, a), V ⟩ is Vmax , so the worst-case variance is
2 π
O(Vmax ). However, it turns out that for certain V (e.g., V = VM ), such variance cannot be large for
2
all (s, a) simultaneously as it adds up to O(Vmax ) along the occupancy of π, and leveraging such a
property leads to improved sample complexities; see [9] and [10, Section 2.3].

References
[1] Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. PhD thesis, University
of Cambridge England, 1989.

[2] Satinder P Singh and Richard S Sutton. Reinforcement learning with replacing eligibility traces.
Machine learning, 22(1-3):123–158, 1996.

[3] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Belle-
mare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level
control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.

[4] Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and
teaching. Machine learning, 8(3-4):293–321, 1992.

8
[5] Harm van Seijen and Rich Sutton. A deeper look at planning as learning from replay. In Proceed-
ings of the 32nd International Conference on Machine Learning, pages 2314–2322, 2015.

[6] Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time.
Machine Learning, 49(2-3):209–232, 2002.

[7] Michael Kearns, Yishay Mansour, and Andrew Y Ng. A sparse sampling algorithm for near-
optimal planning in large Markov decision processes. Machine Learning, 49(2-3):193–208, 2002.

[8] Tengyang Xie and Nan Jiang. Q* approximation schemes for batch reinforcement learning:
A theoretical comparison. In Conference on Uncertainty in Artificial Intelligence, pages 550–559.
PMLR, 2020.

[9] Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J Kappen. Minimax pac bounds on
the sample complexity of reinforcement learning with a generative model. Machine learning,
91(3):325–349, 2013.

[10] Alekh Agarwal, Nan Jiang, Sham Kakade, and Wen Sun. Reinforcement Learning: Theory and
Algorithms. https://rltheorybook.github.io/.

You might also like