0% found this document useful (0 votes)

2 views

note3

The document discusses tabular methods in reinforcement learning (RL), specifically focusing on certainty-equivalence and value-based methods. It outlines the process of estimating transition and reward functions from data, as well as analyzing the performance guarantees of the certainty-equivalence method. The analysis includes various approaches to improve sample efficiency and reduce dependence on state space size, ultimately providing insights into the optimal policy derived from these methods.

Uploaded by

zzyy20010204

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

note3

Uploaded by

zzyy20010204

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Notes on Tabular Methods

Nan Jiang

September 27, 2022

1 Overview of the methods

1.1 Tabular certainty-equivalence
Certainty-equivalence is a model-based RL algorithm, that is, it first estimates an MDP model from
data, and then performs policy evaluation or optimization in the estimated model as if it were true.
To specify the algorithm it suffices to specify the model estimation step.
Given a dataset D of trajectories, D = {(s1 , a1 , r1 , s2 , . . . , sH+1 )}, we first convert it into a bag of
{(s, a, r, s′ )} tuples, where each trajectory is broken into H tuples: (s1 , a1 , r1 , s2 ), (s2 , a2 , r2 , s3 ), . . . ,
(sH , aH , rH , sH+1 ). For every s ∈ S, a ∈ A, define Ds,a as the subset of tuples where the first element
of the tuple is s and the second is a, and we write (r, s′ ) ∈ Ds,a since all tuples in Ds,a share the
same state-action pair. The tabular certainty-equivalence model uses the following estimation of the
transition function Pb: let es′ be the unit vector whose s′ -th entry is 1 and all other entries are 0,
1 X
Pb(s, a) = es′ . (1)
|Ds,a |
(r,s′ )∈Ds,a

Here I( · ) is the indicator function. In words, Pb(s′ |s, a) is simply the empirical frequency of observing
s′ after taking a in state s. Similarly when reward function also needs to be learned, the estimate is
1 X
R(s,
b a) = r. (2)
|Ds,a |
(r,s′ )∈Ds,a

Pb and Rb are the maximum likelihood estimates of the transition and the reward functions, respec-
tively. Note that for the transition function to be well-defined we need n(s, a) > 0 for every s, a ∈ S.

1.2 Value-based tabular methods

Certainty-equivalence explicitly stores an estimated MDP model, which has O(|S|2 |A|) space com-
plexity, and the algorithm has a batch nature, i.e., it is invoked after all the data are collected. In
contrast, there is another popular family of RL algorithms that (1) only model the Q-value functions
hence has O(|S||A|) sample complexity, (2) can be applied in an online manner, i.e., the algorithm
runs as more and more data are collected. Well-known examples include Q-learning [1] and Sarsa [2].
Another very appealing property of these methods is that it is relatively easy to incorporate so-
phisticated generalization schemes, such as deep neural networks, which has recently led to many

1
empirical successes [3]. On the other hand, such methods are typically less sample-efficient than
model-based methods and will not be discussed in more details in this note.1

2 Analysis of certainty-equivalence RL
Here we analyze the method introduced in Section 1.1. For simplicity we further assume that data
are generated by sampling each (s, a) a fixed number of times. We are interested in deriving high-
c = (S, A, Pb, R,
probability guarantees for the optimal policy of M b γ) as a function of n ≡ |Ds,a |.
We provide three different analyses for the algorithm, and we should see some interesting trade-
off between state space and horizon.

2.1 Naive analysis

b ≈ R and Pb ≈ P . In particular, by Hoeffd-
The basic idea is, when n is sufficiently large, we expect R
ing’s inequality and union bound, the following inequalities hold with probability at least 1 − δ:
r
b a) − R(s, a)| ≤ Rmax 1 ln 4|S × A|
max |R(s, (3)
s,a 2n δ
and
r
′ ′ 1 4|S × A × S|
max′ |Pb(s |s, a) − P (s |s, a)| ≤ ln . (4)
s,a,s 2n δ

Note that we first split the failure probability δ evenly between the reward estimation events and the
transition estimation events. Then for reward, we split δ/2 evenly among all (s, a); for transition, we
split δ/2 evenly among all (s, a, s′ ). From Eq.(4) we further have
r
1 4|S × A × S|
max ∥P (s, a) − P (s, a)∥1 ≤ max |S| · ∥P (s, a) − P (s, a)∥∞ ≤ |S| ·
b b ln . (5)
s,a s,a 2n δ
⋆
c, we first introduce the simulation lemma [6].
To bound the suboptimality of πM

Lemma 1 (Simulation Lemma). If maxs,a |R(s,

b a) − R(s, a)| ≤ ϵR and maxs,a ∥Pb(s, a) − P (s, a)∥1 ≤ ϵP ,
then for any policy π : S → A, we have

π π ϵR γ ϵP Vmax
∥VM
c − VM ∥∞ ≤ + ,
1−γ 2(1 − γ)

where Vmax := Rmax /(1 − γ).

1 Techniques such as experience replay can be used the improve the sample efficiency of many online algorithms [4], but it

also blurs the boundary between value-based and model-based methods [5].

2
Proof. For any s ∈ S,

π π
|VM
c(s) − VM (s)|
b π) + γ⟨Pb(s, π), V π ⟩ − R(s, π) − γ⟨P (s, π), VM
= |R(s, π
⟩|
M
c
π π π π
≤ ϵR + γ|⟨Pb(s, π), VM
c⟩ − ⟨P (s, π), VM
c⟩ + ⟨P (s, π), VM
c⟩ − ⟨P (s, π), VM ⟩⟩|
π π π
c⟩| + γ∥VM
≤ ϵR + γ|⟨Pb(s, π) − P (s, π), VM c − VM ∥∞
π Vmax π π
c−
= ϵR + γ|⟨Pb(s, π) − P (s, π), VM 2 · 1⟩| + γ∥VM
c − VM ∥∞
π Vmax π π
≤ ϵR + γ∥Pb(s, π) − P (s, π)∥1 ∥VM
c− 2 ∥∞ c − VM ∥∞
+ γ∥VM
γϵP Vmax π π
≤ ϵR + 2 c − VM ∥∞ .
+ γ∥VM

Since this holds for all s ∈ S, we can also take infinite-norm on the LHS, which yields the desired
result. Note that we subtract Vmax
2
π
· 1 (1 is the all-one vector) to center the range of VM
c around the
origin, which exploits the fact that both Pb(s, π) and P (s, π) are valid probability distributions and
sum up to 1.

Alternative proof of Simulation Lemma Here we sketch an alternative and more “modern” proof
to the simulation lemma. The proof relies on the following identity: ∀f ∈ RS , s0 ∈ S, 2

1
π
f (s0 ) − VM (s0 ) = Es∼dπ ,a∼π(·|s),r∼R(s,a),s′ ∼P (·|s,a) [f (s) − r − γf (s′ )]. (6)
1−γ
π 1
To see why, first note that VM (s0 ) = 1−γ Es∼dπ,s0 ,a∼π(·|s),r∼R(s,a) [r], so the corresponding terms can
be dropped on the two sides of the equation. For the remaining terms, the RHS is

1
Es∼dπ,s0 ,a∼π(·|s),s′ ∼P (·|s,a) [f (s) − γf (s′ )]
1−γ
X∞
′
= γ t−1 Es∼dπ,s
t
0 ,a∼π(·|s),s′ ∼P (·|s,a) [f (s) − γf (s )].

t=1

Recall that dπ,s

t is the distribution of st under π starting from s0 . In this summation, the γf (s′ ) term
0

for t cancels out exactly with the f (s) term for t + 1, because both s and s′ are distributed according to
dπ,s0
t+1 and the difference in discount factor (γ
t−1
vs. γ t ) accounts for the γ in γf (s′ ). As a result, only
the first term f (s0 ) is left, which is the same as the remaining term on the LHS. In fact, this term is
effectively the Bellman flow equation for dπ , written in a form where f serves as a “test function” or
discriminator to the Bellman flow equation.
π ′
To prove simulation lemma, we simply let f = VM c. On the RHS, we marginalize out r and s , and
obtain
1 π π
c(s) − R(s, a) − γ⟨P (·|s, a), VM
Es∼dπ ,a∼π(·|s) [VM c⟩]
1−γ
1 b a) + γ⟨Pb(·|s, a), V π ⟩ − R(s, a) − γ⟨P (·|s, a), V π ⟩].
= Es∼dπ ,a∼π(·|s) [R(s,
1−γ M
c M
c

The rest of the proof follows similarly as the original proof.

2 The 1
RHS can also be conveniently written as E
1−γ s∼d
π [f − T π f ].

3
Turning back to the analysis of certainty-equivalence, the following lemma translates the policy
⋆
evaluation error to the suboptimality of πM
c:

⋆ π π⋆ π
Lemma 2 (Evaluation error to decision loss). ∀s ∈ S, VM (s) − VMM
c
(s) ≤ 2 supπ:S→A ∥VM
c − VM ∥∞ .

Proof. For any s ∈ S,

⋆ π⋆ π⋆ π⋆ π⋆ π⋆
VM (s) − VMM
c
(s) = VMM (s) − V cM (s) + V cM (s) − VMM
c
(s)
M M
π⋆ π⋆ π ⋆c π⋆ ⋆ ·
≤ VMM (s) − V cM (s) + V cM (s) − VMM
c
(s) (πM
c maximizes vM
c)
M M
⋆ ⋆
π⋆ π⋆ πM πM
≤ ∥VMM − V cM ∥∞ + ∥V c − VM ∥∞ . c c
M M

Putting Lemmas 1 and 2 together with the concentration inequalities, we can see that the suboptimal-
ity we incur is
⋆ π ⋆c |S|Vmax
VM (s) − VMM (s) = Õ √ , ∀s ∈ S.
n(1 − γ)
Here Õ(·) supresses poly-logarithmic dependences on |S| and |A|; in this note we also omit the de-
pendence on Rmax and 1/δ, and only highlight the dependence on |S|, n, and 1/(1 − γ).

p
2.2 Improving |S| to |S|
The previous analysis proves concentration for each individual P (s′ |s, a) and adds up the errors to
give an ℓ1 error bound, which is loose. We can obtain a tighter analysis by proving an ℓ1 concentration
bound for multinomial distribution directly.
Note that for any vector v ∈ R|S| ,

∥v∥1 = sup u⊤ v.
u∈{−1,1}|S|

Each u ∈ {−1, 1}|S| projects the vector v to some scalar value. If v can be written as the sum of
zero-mean i.i.d. vectors, we can prove concentration for u⊤ v first, and then union bound over all u
to obtain the ℓ1 error bound. Concretely, for any fixed (s, a) pair and any fixed u ∈ {−1, 1}|S| , with
probability at least 1 − δ/(2|S × A| · 2|S| ), we have
r
1 2|S × A| · 2|S|
u⊤ (Pb(s, a) − P (s, a)) ≤ 2 ln , (7)
2n δ
because u⊤ Pb(s, a) is the average of i.i.d. random variables u⊤ es′ with bounded range [−1, 1]. 3 This
leads to the following improvement over Eq.(5): w.p. at least 1 − δ/2,
r
⊤ 1 2|S × A| · 2|S|
max ∥Pb(s, a) − P (s, a)∥1 = max max u (Pb(s, a) − P (s, a)) ≤ 2 ln . (8)
s,a s,a u∈{−1,1}|S| 2n δ
q q
Rougly speaking, the Õ(|S| n1 ) bound in Eq.5 is improved to Õ( |S| n ), and propagating the im-
provement through the remainder of the analysis yields
p !
⋆
⋆
πM |S|Vmax
VM (s) − VM (s) = Õ √
c
, ∀s ∈ S.
n(1 − γ)
3 Also note that we only bound the deviation from one side, so we save a factor of 2 in ln compared to bounding the absolute

deviation. Another tiny improvement: for u ∈ {−1, 1}|S| , one can ingore u = ±1 as ±1(Pb(s, a) − P (s, a)) ≡ 0.

4
2.3 No dependence on |S|
1
The last analysis removes the dependence of n on |S|, at the cost of an additional dependence on 1−γ .
Note that the total number of samples still scales with |S| as we require n samples per (s, a).
The core idea is to show Q⋆M ⋆
c ≈ QM , and then upper bound loss by Lemma 4 from Note 1. First,
by contraction we have,

1
∥Q⋆M ⋆
c − QM ∥∞ ≤ ∥Q⋆ − TM ⋆
cQM ∥∞ . (9)
1−γ M

This is because

∥Q⋆M ⋆
c − QM ∥∞ = ∥TM
⋆
c − TM
cQM
⋆
cQM + TM
⋆ ⋆
cQM − QM ∥∞

≤ γ∥Q⋆M ⋆
c − QM ∥∞ + ∥TM
⋆ ⋆
cQM − QM ∥∞ . (TM
c is a γ-contraction)

Then we bound the RHS in the following lemma.

Lemma 3. For any fixed s ∈ S, a ∈ A, with probability at least 1 − δ,

r
Rmax 1 2
Q⋆M (s, a) b b ⋆
− R(s, a) + γ⟨P (s, a), VM ⟩ ≤ log .
1 − γ 2n δ

Proof. The bound follows directly from Hoeffding’s inequality upon the following observation:

1 X
R(s, ⋆
b a) + γ⟨Pb(s, a), VM ⟩= ⋆
(r + γVM (s′ )) .
n
(r,s′ )∈Ds,a

Note that the RHS is the average of i.i.d. random variables (r + γVM ⋆
(s′ )) in the interval of [0, R1−γ
max
],
⋆
whose expectation is exactly QM (s, a). Therefore, the LHS of the lemma statement is the deviation of
average of i.i.d. variables from the expectation, where Hoeffding’s inequality applies.

Note that the LHS of the lemma statement is simply the (s, a)-th entry of (Q⋆M − TM ⋆
cQM ). The final
result we can get is
⋆
⋆
πM Vmax
VM (s) − VM (s) = Õ √
c
, ∀s ∈ S.
n(1 − γ)2
The cubic dependence on horizon comes from 3 different sources: (1) the range of value, (2) trans-
lating Bellman error to the difference in optimal Q-value functions, and (3) error accumulation over
time when taking actions greedily wrt Q. b The previous analyses only paid quadratic dependence on
horizon because (3) was not present.

Some notes on Eq.(9) One can also obtain the following inequality by swapping the roles of M and
c in Eq.(9):
M

1
∥Q⋆M ⋆
c − QM ∥∞ ≤ ∥Q⋆c − TM Q⋆M
c ∥∞ .
1−γ M

In fact, the RHS of the above inequality is the more standard notion of Bellman errors (or Bellman
residuals): it measures how much an approximate Q-value function (here Q⋆M
c) deviates from itself when

5
updated using the true Bellman update operator. In fact we can attempt to complete the analysis based
1
on this inequality instead of Eq.(9), by noticing that the RHS is (ignoring 1−γ and the max-norm)

⋆ ⋆
TM c − TM QM
cQM c.

This way we also introduce TM c into the expression and compare it with TM , which should allow us
to use concentration inequalities to bound the difference between TM
c and TM .
Now the (s, a)-th entry of the above expression is

b a) + γ⟨Pb(s, a), V ⋆ ⟩ − R(s, a) + γ⟨P (s, a), V ⋆ ⟩
R(s, Mc M
c

⋆ ′
It is attempting to use the techniques in the proof of Lemma 3, by claiming that (r + VM c(s )) are
i.i.d. random variables for (r, s′ ) ∈ Ds,a , with expected value R(s, a) + γ⟨P (s, a), VM
⋆
c⟩. This is not
⋆ ′
true in general, because the function VM c(s ) itself is random and depends on the data in Ds,a ! Hence
Hoeffding does not apply. One workaround is to consider a deterministic function class that always
⋆
contains VM c and do a union bound over that class; in fact, if we choose all tabular functions in the
range of [0, Vmax ], the analysis is basically identical to Section 2.2.
Now you should see why we use Q⋆M and TM c in Eq.(9), as this way we compare TM and TM c
⋆
against VM , which is a deterministic function.
⋆
In cases where M ’s state space forms a directed acyclic graph (DAG), the argument with VM c can
⋆ ′
still work as VMc(s ) only depends on the datasets for later state-action pairs, which do not include
the current (s, a) under consideration. This argument is straightforward here because we have a
very simple and clean data collection procedure. One has to be extremely careful when using this
⋆ ′
argument in more realistic settings: for example, in the exploration setting, even if VM
c(s ) is estimated
from datasets not including Ds,a , the outcomes in Ds,a might have determined which later states we
⋆
have sufficient samples and which not, which introduces very subtle interdependence with VM c.

Connection to MCTS Interestingly, the independence of n on |S| in the last analysis is the core idea
that leads to Sparse Sampling [7], which is a prototype algorithm for the family of Monte-Carlo tree
search algorithm that played a crucial role in the success of AlphaGo.
One way to view Sparse Sampling is the following: conceptually we run the tabular method
with n set according to the last analysis (no dependence on |S|). Of course, when |S| is large this is
impractical, but if we only need to know π ⋆ (s0 ) for some particular state s0 (which is the setting of
online planning with MCTS), we can perform “lazy evaluation”: only generate the datasets for state-
⋆
action pairs that contribute to the calculation of VM
c(s0 ) and truncate at the effective horizon. Roughly
1
O( 1−γ )
speaking, this requires a total of (n|A|) samples to compute π ⋆ (s0 ), where has no dependence
on |S|.

2.4 “Best of both worlds” in the large-sample regime

We have seen two different analyses, where one pays an extra factor in |S| and the other pays an extra
factor in horizon (i.e., 1/(1 − γ)). Can we get the best of both worlds and pay neither factors?
The answer turns out to be positive, as long as we allow a slow “burn-in” term. To start, recall
that in the comments below Eq.(9), we show that ∥Q⋆M ⋆
c − QM ∥∞ can also be bounded if we swap the

6
role between M and M c on the RHS, i.e., the RHS becomes 1 ∥Q⋆ − TM Q⋆ ∥∞ . The trouble is that
1−γ M
c M
c
we can no longer directly apply Hoeffding’s inequality due to the data-dependence on Q⋆M c.
However, if we manage to control this term, it will save us a horizon factor (which is a good reason
to consider this idea more carefully)! This is shown in the following lemma: (the lemma only involves
the true MDP M , which is omitted in the subscripts of value functions and occupancies)

Lemma 4 ([8]). For any f ∈ RS×A , any initial state s0 , and any policy π,

1
V π (s0 ) − V πf (s0 ) ≤ (Edπ,s0 [T f − f ] + Edπf ,s0 [f − T f ]),
1−γ

where terms in the form of Eµ [f ] are the shorthand for E(s,a)∼µ [f (s, a)].

Proof. We use the following identity, which is the Q-function variant of Eq.(6) (what we used to give
the alternative proof of simulation lemma) and can be proved in very similar ways: for any f ∈ RS×A ,

1
f (s0 , π) − V π (s0 ) = Edπ,s0 [f − T π f ]. (10)
1−γ

Then,

V π (s0 ) − V πf (s0 ) ≤ V π (s0 ) − f (s0 , π) + f (s0 , πf ) − V πf (s0 ).

For the two pairs of differences on the RHS, we invoke Eq.(10) twice: one with π, and the other with
π rebound to πf . This gives:

V π (s0 ) − f (s0 , π) + f (s0 , πf ) − V πf (s0 )

1 1
= Edπ,s0 [T π f − f ] + E πf ,s0 [f − T πf f ].
1−γ 1−γ d

The proof is completed by recognizing that T π f ≤ T f , and T πf f = T f .

Using Lemma 4, we immediately have that

⋆ π⋆ 2
∥VM − VMM
c
∥∞ ≤ ∥Q⋆c − TM Q⋆M
c ∥∞ ,
1−γ M

which contrasts the 2/(1 − γ)2 factor in Section 2.3. However, this brings back the earlier problem:
how to handle the data dependence of Q⋆M c in concentration?

Bounding ∥Q⋆M ⋆
c − TM QM
c ∥∞ We now show how to bound this term. ∀(s, a),

|Q⋆M ⋆
c(s, a) − (TM QM
c)(s, a)| = |(TM
⋆ ⋆
c)(s, a) − (TM QM
cQM c)(s, a)|
⋆ ⋆ ⋆ ⋆ ⋆ ⋆
= |(TM
c)QM (s, a) − (TM QM )(s, a) + (TM c)(s, a) − (TM
cQM cQM )(s, a) + (TM QM )(s, a) − (TM QM
c)(s, a)|.

So the idea is to replace Q⋆M ⋆

c with QM , then the difference is the same as in Section 2.3 which we know
how to handle (without paying extra |S|). The consequence of doing so is that we will get some extra
terms (the last 4 terms above).

7
To handle those 4 terms, we plug in the definition of TM and TM
c:

⋆ ⋆ ⋆ ⋆
|(TM c)(s, a) − (TM
cQM cQM )(s, a) + (TM QM )(s, a) − (TM QM
c)(s, a)|
b a) + γ⟨Pb(s, a), V ⋆ ⟩ − R(s,
= (R(s, ⋆
b a) − γ⟨Pb(s, a), VM ⟩
M
c
⋆ ⋆
+ R(s, a) + γ⟨P (s, a), VM c⟩|
⟩ − R(s, a) − γ⟨P (s, a), VM
⋆ ⋆ ⋆ ⋆
c − VM ⟩ − ⟨P (s, a), VM
= γ|⟨Pb(s, a), VM c − VM ⟩|
⋆ ⋆
c − VM ⟩|
= γ|⟨Pb(s, a) − Pb(s, a), VM
⋆ ⋆
c − VM ∥∞ .
≤ γ∥P (s, a) − Pb(s, a)∥1 ∥VM

Now we can control ∥P (s, a) − Pb(s, a)∥1 using the total-variation concentration bound in Section 2.2
⋆ ⋆
(while paying the extra |S|). We can separately control ∥VM c − VM ∥∞ using the analysis in Section 2.3.
√
Each term will scale as O(1/ n) (we are only considering the scaling with n here and ignoring the
other variables such as |S| and 1/(1 − γ)), so their product scales as O(1/n). When n is sufficiently
√ ⋆ ⋆
large, this term will be dominated by the 1/ n error coming out from |(TM c)QM (s, a) − (TM QM )(s, a)|
and can be omitted. (This is why it is sometimes called a “burn-in” term, as it only has significant
effects in the small sample-size regime.) Note that this O(1/n) term will have worse dependencies
√
on |S| and 1/(1 − γ) compared to the O(1/ n) term, so “sufficiently large n” means that n has to be
√
larger than some function of |S| and 1/(1 − γ) for the difference between n and n to compensate for
the worse factors in |S| and 1/(1 − γ). In such a large-sample regime, we obtain the nice error bound
Vmax
of Õ( √n(1−γ) ), i.e., there is neither the extra |S| factor as in Section 2.2 nor the extra 1/(1 − γ) factor
as in Section 2.3.

Further improvement The bound can be further improved by replacing the Hoeffding’s inequalities
with the Bernstein’s, which provides sharper concentration bounds when the variance of the random
variables are substantially smaller compared to their ranges (squared). In our setting, the range of
random variables in the concentration of ⟨P̂ (s, a) − P (s, a), V ⟩ is Vmax , so the worst-case variance is
2 π
O(Vmax ). However, it turns out that for certain V (e.g., V = VM ), such variance cannot be large for
2
all (s, a) simultaneously as it adds up to O(Vmax ) along the occupancy of π, and leveraging such a
property leads to improved sample complexities; see [9] and [10, Section 2.3].

References
[1] Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. PhD thesis, University
of Cambridge England, 1989.

[2] Satinder P Singh and Richard S Sutton. Reinforcement learning with replacing eligibility traces.
Machine learning, 22(1-3):123–158, 1996.

[3] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Belle-
mare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level
control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.

[4] Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and
teaching. Machine learning, 8(3-4):293–321, 1992.

8
[5] Harm van Seijen and Rich Sutton. A deeper look at planning as learning from replay. In Proceed-
ings of the 32nd International Conference on Machine Learning, pages 2314–2322, 2015.

[6] Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time.
Machine Learning, 49(2-3):209–232, 2002.

[7] Michael Kearns, Yishay Mansour, and Andrew Y Ng. A sparse sampling algorithm for near-
optimal planning in large Markov decision processes. Machine Learning, 49(2-3):193–208, 2002.

[8] Tengyang Xie and Nan Jiang. Q* approximation schemes for batch reinforcement learning:
A theoretical comparison. In Conference on Uncertainty in Artificial Intelligence, pages 550–559.
PMLR, 2020.

[9] Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J Kappen. Minimax pac bounds on
the sample complexity of reinforcement learning with a generative model. Machine learning,
91(3):325–349, 2013.

[10] Alekh Agarwal, Nan Jiang, Sham Kakade, and Wen Sun. Reinforcement Learning: Theory and
Algorithms. https://rltheorybook.github.io/.

Lec 4
No ratings yet
Lec 4
16 pages
Littomore
No ratings yet
Littomore
169 pages
Foundations of Machine Learning, Second Edition (2nd Ed) (Instructor Res. N. 1 of 3, Solution Manual, Solutions) (201
100% (1)
Foundations of Machine Learning, Second Edition (2nd Ed) (Instructor Res. N. 1 of 3, Solution Manual, Solutions) (201
61 pages
note7 (1)
No ratings yet
note7 (1)
5 pages
Visualizing and Improving The Robustness of Phase Re - 2015 - Procedia Computer
No ratings yet
Visualizing and Improving The Robustness of Phase Re - 2015 - Procedia Computer
10 pages
HW 2
No ratings yet
HW 2
2 pages
Lecture03 PDF
No ratings yet
Lecture03 PDF
22 pages
Cs748 s2021 Quizzes Till q4
No ratings yet
Cs748 s2021 Quizzes Till q4
4 pages
Potential-Based Shaping in Model-Based Reinforcement Learning
No ratings yet
Potential-Based Shaping in Model-Based Reinforcement Learning
6 pages
Machine Learning - Unit 2
No ratings yet
Machine Learning - Unit 2
104 pages
Assignment Two
No ratings yet
Assignment Two
3 pages
Solutions for Exercises in Foundations of Machine Learning, 2nd Edition – Mohri & Rostamizadeh
100% (1)
Solutions for Exercises in Foundations of Machine Learning, 2nd Edition – Mohri & Rostamizadeh
5 pages
Chap6 PDF
No ratings yet
Chap6 PDF
11 pages
Bayes-Adaptive POMDPs 2007
No ratings yet
Bayes-Adaptive POMDPs 2007
8 pages
Packing Density
No ratings yet
Packing Density
16 pages
Lec 3
No ratings yet
Lec 3
15 pages
Interpolated Advanced Algorithms
No ratings yet
Interpolated Advanced Algorithms
19 pages
Fast, Memory Efficient Low-Rank Approximation of SimRank
No ratings yet
Fast, Memory Efficient Low-Rank Approximation of SimRank
9 pages
Residual Power Series Method For Obstacle Boundary Value Problems
No ratings yet
Residual Power Series Method For Obstacle Boundary Value Problems
5 pages
Remezr2 Arxiv
No ratings yet
Remezr2 Arxiv
30 pages
Similarity Filters Jaro Winkler
No ratings yet
Similarity Filters Jaro Winkler
7 pages
Assignment 2 Solution
No ratings yet
Assignment 2 Solution
6 pages
Uncertainty Evaluation in Reservoir Forecasting by Bayes Linear Methodology
No ratings yet
Uncertainty Evaluation in Reservoir Forecasting by Bayes Linear Methodology
10 pages
bombieri2
No ratings yet
bombieri2
5 pages
PS_6 (3)
No ratings yet
PS_6 (3)
7 pages
Autuori, Pucci. Existence of Entire Solutions For A Class of Quasilinear Elliptic Equations
No ratings yet
Autuori, Pucci. Existence of Entire Solutions For A Class of Quasilinear Elliptic Equations
33 pages
CS 748 (Spring 2021) : Weekly Quizzes: Week 2
No ratings yet
CS 748 (Spring 2021) : Weekly Quizzes: Week 2
2 pages
Approximation by Constrained Parametric Polynomials: Michael Lachance
No ratings yet
Approximation by Constrained Parametric Polynomials: Michael Lachance
16 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
Set-Coloring Ramsey Numbers and Error-Correcting Codes Near The Zero-Rate Threshold
No ratings yet
Set-Coloring Ramsey Numbers and Error-Correcting Codes Near The Zero-Rate Threshold
9 pages
5 Quadratic Reciprocity Law: 5.1 Primitive Roots and Solutions of Congruences
No ratings yet
5 Quadratic Reciprocity Law: 5.1 Primitive Roots and Solutions of Congruences
18 pages
Geometry of grassmann manifolds
No ratings yet
Geometry of grassmann manifolds
23 pages
An Algorithm To Compute The Nucleolus of Shortest Path Games
No ratings yet
An Algorithm To Compute The Nucleolus of Shortest Path Games
15 pages
Assignment (2)
No ratings yet
Assignment (2)
2 pages
slidedeck_6_MAS_2021_22_RL_2_MDP_Model-based
No ratings yet
slidedeck_6_MAS_2021_22_RL_2_MDP_Model-based
36 pages
Gibbs sampling algorithm from scratch using R programming language
No ratings yet
Gibbs sampling algorithm from scratch using R programming language
11 pages
note6 (3)
No ratings yet
note6 (3)
13 pages
Problems of The Millennium: The Riemann Hypothesis (2004)
No ratings yet
Problems of The Millennium: The Riemann Hypothesis (2004)
9 pages
A-Summation Process and Korovkin-Type Approximation Theorem For Double Sequences of Positive Linear Operators
No ratings yet
A-Summation Process and Korovkin-Type Approximation Theorem For Double Sequences of Positive Linear Operators
12 pages
Incidence Algebras and Mobius Inversion Formula
No ratings yet
Incidence Algebras and Mobius Inversion Formula
5 pages
Lemaire Et Al. - 2020 - New Weak Error Bounds and Expansions For Optimal Q
No ratings yet
Lemaire Et Al. - 2020 - New Weak Error Bounds and Expansions For Optimal Q
25 pages
Revised Establishing Algebras of Pseudo Differential Functions
No ratings yet
Revised Establishing Algebras of Pseudo Differential Functions
17 pages
Sacnasdiazlopez
No ratings yet
Sacnasdiazlopez
26 pages
Solutions To Reinforcement Learning by Sutton Chapter 4 r5
No ratings yet
Solutions To Reinforcement Learning by Sutton Chapter 4 r5
5 pages
Majority Car Xiv
No ratings yet
Majority Car Xiv
26 pages
Meanvariance Efficiency Proof
No ratings yet
Meanvariance Efficiency Proof
3 pages
INT5
No ratings yet
INT5
14 pages
Maximizing Hamming Distance in Contraction of Permutation Arrays
No ratings yet
Maximizing Hamming Distance in Contraction of Permutation Arrays
22 pages
LaplaceMeanValue
No ratings yet
LaplaceMeanValue
5 pages
String Matching
No ratings yet
String Matching
4 pages
Korovkin Type Approximation Theorem For Double Sequences of Positive Linear Operators Via Statistical A-Summability
No ratings yet
Korovkin Type Approximation Theorem For Double Sequences of Positive Linear Operators Via Statistical A-Summability
13 pages
Consistency of One-Class SVM and Related Algorithms
No ratings yet
Consistency of One-Class SVM and Related Algorithms
8 pages
Tes1 PDF
No ratings yet
Tes1 PDF
7 pages
cs229-notes12 Reinforcement in Control
No ratings yet
cs229-notes12 Reinforcement in Control
17 pages
Efficient Qos Routing: Stavroula Siachalou, Leonidas Georgiadis
No ratings yet
Efficient Qos Routing: Stavroula Siachalou, Leonidas Georgiadis
10 pages
P Adics
No ratings yet
P Adics
12 pages
Quasi Newton Trpo
No ratings yet
Quasi Newton Trpo
10 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Algebraic Equations
From Everand
Algebraic Equations
Demetrios P. Kanoussis
No ratings yet
Mathematical Functions
From Everand
Mathematical Functions
Oliver Linton
No ratings yet
Inglés PDF
0% (1)
Inglés PDF
94 pages
DLL English
No ratings yet
DLL English
3 pages
Recruitment To The Post of Junior Hindi Translator
No ratings yet
Recruitment To The Post of Junior Hindi Translator
12 pages
Lamb Shift Presentation For Leyman
100% (1)
Lamb Shift Presentation For Leyman
18 pages
Assignment 2
No ratings yet
Assignment 2
3 pages
CTFL-Chapter 2-Fundamentals of Testing v2.2.0
No ratings yet
CTFL-Chapter 2-Fundamentals of Testing v2.2.0
81 pages
Mathematical Reasonlng Skills of 7th Grade Students: Conference Paper
No ratings yet
Mathematical Reasonlng Skills of 7th Grade Students: Conference Paper
3 pages
Short Case Study - Chap 4 Bishop Et Al 2019 - CrudeOil
No ratings yet
Short Case Study - Chap 4 Bishop Et Al 2019 - CrudeOil
2 pages
Disbursement Voucher: Provincial Agrarian Reform Office
No ratings yet
Disbursement Voucher: Provincial Agrarian Reform Office
30 pages
ISFP Meaning: I - Introverted
No ratings yet
ISFP Meaning: I - Introverted
12 pages
Bullying
100% (1)
Bullying
18 pages
Jitesh Gupta Resume 2021
No ratings yet
Jitesh Gupta Resume 2021
1 page
Prototype-Dlp in Eng 7
No ratings yet
Prototype-Dlp in Eng 7
18 pages
Alliance Francaise de Bombay Schedule13
No ratings yet
Alliance Francaise de Bombay Schedule13
2 pages
Action Plan For 5S - EDITED
No ratings yet
Action Plan For 5S - EDITED
8 pages
Face Mask Detection Alert System Using Raspberry Pi: International Research Journal of Engineering and Technology (IRJET)
No ratings yet
Face Mask Detection Alert System Using Raspberry Pi: International Research Journal of Engineering and Technology (IRJET)
3 pages
Ashley Resume Final
No ratings yet
Ashley Resume Final
1 page
MBA Course Structure - PROPOSED
No ratings yet
MBA Course Structure - PROPOSED
3 pages
Fundamentals of Multimedia Encryption Techniques
No ratings yet
Fundamentals of Multimedia Encryption Techniques
71 pages
About Rotaract Club of Cit
No ratings yet
About Rotaract Club of Cit
4 pages
List of Eligible Students For Btech Honours Course
No ratings yet
List of Eligible Students For Btech Honours Course
4 pages
Picdem Fs Usb
No ratings yet
Picdem Fs Usb
48 pages
SCN Prospectus 2022
No ratings yet
SCN Prospectus 2022
47 pages
Social Semiotics: January 2009
No ratings yet
Social Semiotics: January 2009
10 pages
Knapsack Problem
No ratings yet
Knapsack Problem
30 pages
q1 PPT Music Arts 7 Week 1 Lesson 1
No ratings yet
q1 PPT Music Arts 7 Week 1 Lesson 1
21 pages
Thesis by Monograph
100% (3)
Thesis by Monograph
6 pages
3rd Grade Newsletter October 3-7
No ratings yet
3rd Grade Newsletter October 3-7
1 page
June 2020 MS
No ratings yet
June 2020 MS
22 pages
Dark Psychology and Manipulation - 2 in 1 - Discover The Hidden Secrets of Dark Psychology, NLP
No ratings yet
Dark Psychology and Manipulation - 2 in 1 - Discover The Hidden Secrets of Dark Psychology, NLP
897 pages

note3

Uploaded by

note3

Uploaded by

Notes on Tabular Methods

September 27, 2022

1 Overview of the methods

1.2 Value-based tabular methods

2.1 Naive analysis

Lemma 1 (Simulation Lemma). If maxs,a |R(s,

where Vmax := Rmax /(1 − γ).

Recall that dπ,s

The rest of the proof follows similarly as the original proof.

Proof. For any s ∈ S,

Then we bound the RHS in the following lemma.

Lemma 3. For any fixed s ∈ S, a ∈ A, with probability at least 1 − δ,

2.4 “Best of both worlds” in the large-sample regime

V π (s0 ) − V πf (s0 ) ≤ V π (s0 ) − f (s0 , π) + f (s0 , πf ) − V πf (s0 ).

V π (s0 ) − f (s0 , π) + f (s0 , πf ) − V πf (s0 )

The proof is completed by recognizing that T π f ≤ T f , and T πf f = T f .

Using Lemma 4, we immediately have that

So the idea is to replace Q⋆M ⋆

You might also like