note3
note3
Nan Jiang
Here I( · ) is the indicator function. In words, Pb(s′ |s, a) is simply the empirical frequency of observing
s′ after taking a in state s. Similarly when reward function also needs to be learned, the estimate is
1 X
R(s,
b a) = r. (2)
|Ds,a |
(r,s′ )∈Ds,a
Pb and Rb are the maximum likelihood estimates of the transition and the reward functions, respec-
tively. Note that for the transition function to be well-defined we need n(s, a) > 0 for every s, a ∈ S.
1
empirical successes [3]. On the other hand, such methods are typically less sample-efficient than
model-based methods and will not be discussed in more details in this note.1
2 Analysis of certainty-equivalence RL
Here we analyze the method introduced in Section 1.1. For simplicity we further assume that data
are generated by sampling each (s, a) a fixed number of times. We are interested in deriving high-
c = (S, A, Pb, R,
probability guarantees for the optimal policy of M b γ) as a function of n ≡ |Ds,a |.
We provide three different analyses for the algorithm, and we should see some interesting trade-
off between state space and horizon.
Note that we first split the failure probability δ evenly between the reward estimation events and the
transition estimation events. Then for reward, we split δ/2 evenly among all (s, a); for transition, we
split δ/2 evenly among all (s, a, s′ ). From Eq.(4) we further have
r
1 4|S × A × S|
max ∥P (s, a) − P (s, a)∥1 ≤ max |S| · ∥P (s, a) − P (s, a)∥∞ ≤ |S| ·
b b ln . (5)
s,a s,a 2n δ
⋆
c, we first introduce the simulation lemma [6].
To bound the suboptimality of πM
π π ϵR γ ϵP Vmax
∥VM
c − VM ∥∞ ≤ + ,
1−γ 2(1 − γ)
also blurs the boundary between value-based and model-based methods [5].
2
Proof. For any s ∈ S,
π π
|VM
c(s) − VM (s)|
b π) + γ⟨Pb(s, π), V π ⟩ − R(s, π) − γ⟨P (s, π), VM
= |R(s, π
⟩|
M
c
π π π π
≤ ϵR + γ|⟨Pb(s, π), VM
c⟩ − ⟨P (s, π), VM
c⟩ + ⟨P (s, π), VM
c⟩ − ⟨P (s, π), VM ⟩⟩|
π π π
c⟩| + γ∥VM
≤ ϵR + γ|⟨Pb(s, π) − P (s, π), VM c − VM ∥∞
π Vmax π π
c−
= ϵR + γ|⟨Pb(s, π) − P (s, π), VM 2 · 1⟩| + γ∥VM
c − VM ∥∞
π Vmax π π
≤ ϵR + γ∥Pb(s, π) − P (s, π)∥1 ∥VM
c− 2 ∥∞ c − VM ∥∞
+ γ∥VM
γϵP Vmax π π
≤ ϵR + 2 c − VM ∥∞ .
+ γ∥VM
Since this holds for all s ∈ S, we can also take infinite-norm on the LHS, which yields the desired
result. Note that we subtract Vmax
2
π
· 1 (1 is the all-one vector) to center the range of VM
c around the
origin, which exploits the fact that both Pb(s, π) and P (s, π) are valid probability distributions and
sum up to 1.
Alternative proof of Simulation Lemma Here we sketch an alternative and more “modern” proof
to the simulation lemma. The proof relies on the following identity: ∀f ∈ RS , s0 ∈ S, 2
1
π
f (s0 ) − VM (s0 ) = Es∼dπ ,a∼π(·|s),r∼R(s,a),s′ ∼P (·|s,a) [f (s) − r − γf (s′ )]. (6)
1−γ
π 1
To see why, first note that VM (s0 ) = 1−γ Es∼dπ,s0 ,a∼π(·|s),r∼R(s,a) [r], so the corresponding terms can
be dropped on the two sides of the equation. For the remaining terms, the RHS is
1
Es∼dπ,s0 ,a∼π(·|s),s′ ∼P (·|s,a) [f (s) − γf (s′ )]
1−γ
X∞
′
= γ t−1 Es∼dπ,s
t
0 ,a∼π(·|s),s′ ∼P (·|s,a) [f (s) − γf (s )].
t=1
for t cancels out exactly with the f (s) term for t + 1, because both s and s′ are distributed according to
dπ,s0
t+1 and the difference in discount factor (γ
t−1
vs. γ t ) accounts for the γ in γf (s′ ). As a result, only
the first term f (s0 ) is left, which is the same as the remaining term on the LHS. In fact, this term is
effectively the Bellman flow equation for dπ , written in a form where f serves as a “test function” or
discriminator to the Bellman flow equation.
π ′
To prove simulation lemma, we simply let f = VM c. On the RHS, we marginalize out r and s , and
obtain
1 π π
c(s) − R(s, a) − γ⟨P (·|s, a), VM
Es∼dπ ,a∼π(·|s) [VM c⟩]
1−γ
1 b a) + γ⟨Pb(·|s, a), V π ⟩ − R(s, a) − γ⟨P (·|s, a), V π ⟩].
= Es∼dπ ,a∼π(·|s) [R(s,
1−γ M
c M
c
3
Turning back to the analysis of certainty-equivalence, the following lemma translates the policy
⋆
evaluation error to the suboptimality of πM
c:
⋆ π π⋆ π
Lemma 2 (Evaluation error to decision loss). ∀s ∈ S, VM (s) − VMM
c
(s) ≤ 2 supπ:S→A ∥VM
c − VM ∥∞ .
Putting Lemmas 1 and 2 together with the concentration inequalities, we can see that the suboptimal-
ity we incur is
⋆ π ⋆c |S|Vmax
VM (s) − VMM (s) = Õ √ , ∀s ∈ S.
n(1 − γ)
Here Õ(·) supresses poly-logarithmic dependences on |S| and |A|; in this note we also omit the de-
pendence on Rmax and 1/δ, and only highlight the dependence on |S|, n, and 1/(1 − γ).
p
2.2 Improving |S| to |S|
The previous analysis proves concentration for each individual P (s′ |s, a) and adds up the errors to
give an ℓ1 error bound, which is loose. We can obtain a tighter analysis by proving an ℓ1 concentration
bound for multinomial distribution directly.
Note that for any vector v ∈ R|S| ,
∥v∥1 = sup u⊤ v.
u∈{−1,1}|S|
Each u ∈ {−1, 1}|S| projects the vector v to some scalar value. If v can be written as the sum of
zero-mean i.i.d. vectors, we can prove concentration for u⊤ v first, and then union bound over all u
to obtain the ℓ1 error bound. Concretely, for any fixed (s, a) pair and any fixed u ∈ {−1, 1}|S| , with
probability at least 1 − δ/(2|S × A| · 2|S| ), we have
r
1 2|S × A| · 2|S|
u⊤ (Pb(s, a) − P (s, a)) ≤ 2 ln , (7)
2n δ
because u⊤ Pb(s, a) is the average of i.i.d. random variables u⊤ es′ with bounded range [−1, 1]. 3 This
leads to the following improvement over Eq.(5): w.p. at least 1 − δ/2,
r
⊤ 1 2|S × A| · 2|S|
max ∥Pb(s, a) − P (s, a)∥1 = max max u (Pb(s, a) − P (s, a)) ≤ 2 ln . (8)
s,a s,a u∈{−1,1}|S| 2n δ
q q
Rougly speaking, the Õ(|S| n1 ) bound in Eq.5 is improved to Õ( |S| n ), and propagating the im-
provement through the remainder of the analysis yields
p !
⋆
⋆
πM |S|Vmax
VM (s) − VM (s) = Õ √
c
, ∀s ∈ S.
n(1 − γ)
3 Also note that we only bound the deviation from one side, so we save a factor of 2 in ln compared to bounding the absolute
deviation. Another tiny improvement: for u ∈ {−1, 1}|S| , one can ingore u = ±1 as ±1(Pb(s, a) − P (s, a)) ≡ 0.
4
2.3 No dependence on |S|
1
The last analysis removes the dependence of n on |S|, at the cost of an additional dependence on 1−γ .
Note that the total number of samples still scales with |S| as we require n samples per (s, a).
The core idea is to show Q⋆M ⋆
c ≈ QM , and then upper bound loss by Lemma 4 from Note 1. First,
by contraction we have,
1
∥Q⋆M ⋆
c − QM ∥∞ ≤ ∥Q⋆ − TM ⋆
cQM ∥∞ . (9)
1−γ M
This is because
∥Q⋆M ⋆
c − QM ∥∞ = ∥TM
⋆
c − TM
cQM
⋆
cQM + TM
⋆ ⋆
cQM − QM ∥∞
≤ γ∥Q⋆M ⋆
c − QM ∥∞ + ∥TM
⋆ ⋆
cQM − QM ∥∞ . (TM
c is a γ-contraction)
Proof. The bound follows directly from Hoeffding’s inequality upon the following observation:
1 X
R(s, ⋆
b a) + γ⟨Pb(s, a), VM ⟩= ⋆
(r + γVM (s′ )) .
n
(r,s′ )∈Ds,a
Note that the RHS is the average of i.i.d. random variables (r + γVM ⋆
(s′ )) in the interval of [0, R1−γ
max
],
⋆
whose expectation is exactly QM (s, a). Therefore, the LHS of the lemma statement is the deviation of
average of i.i.d. variables from the expectation, where Hoeffding’s inequality applies.
Note that the LHS of the lemma statement is simply the (s, a)-th entry of (Q⋆M − TM ⋆
cQM ). The final
result we can get is
⋆
⋆
πM Vmax
VM (s) − VM (s) = Õ √
c
, ∀s ∈ S.
n(1 − γ)2
The cubic dependence on horizon comes from 3 different sources: (1) the range of value, (2) trans-
lating Bellman error to the difference in optimal Q-value functions, and (3) error accumulation over
time when taking actions greedily wrt Q. b The previous analyses only paid quadratic dependence on
horizon because (3) was not present.
Some notes on Eq.(9) One can also obtain the following inequality by swapping the roles of M and
c in Eq.(9):
M
1
∥Q⋆M ⋆
c − QM ∥∞ ≤ ∥Q⋆c − TM Q⋆M
c ∥∞ .
1−γ M
In fact, the RHS of the above inequality is the more standard notion of Bellman errors (or Bellman
residuals): it measures how much an approximate Q-value function (here Q⋆M
c) deviates from itself when
5
updated using the true Bellman update operator. In fact we can attempt to complete the analysis based
1
on this inequality instead of Eq.(9), by noticing that the RHS is (ignoring 1−γ and the max-norm)
⋆ ⋆
TM c − TM QM
cQM c.
This way we also introduce TM c into the expression and compare it with TM , which should allow us
to use concentration inequalities to bound the difference between TM
c and TM .
Now the (s, a)-th entry of the above expression is
b a) + γ⟨Pb(s, a), V ⋆ ⟩ − R(s, a) + γ⟨P (s, a), V ⋆ ⟩
R(s, Mc M
c
⋆ ′
It is attempting to use the techniques in the proof of Lemma 3, by claiming that (r + VM c(s )) are
i.i.d. random variables for (r, s′ ) ∈ Ds,a , with expected value R(s, a) + γ⟨P (s, a), VM
⋆
c⟩. This is not
⋆ ′
true in general, because the function VM c(s ) itself is random and depends on the data in Ds,a ! Hence
Hoeffding does not apply. One workaround is to consider a deterministic function class that always
⋆
contains VM c and do a union bound over that class; in fact, if we choose all tabular functions in the
range of [0, Vmax ], the analysis is basically identical to Section 2.2.
Now you should see why we use Q⋆M and TM c in Eq.(9), as this way we compare TM and TM c
⋆
against VM , which is a deterministic function.
⋆
In cases where M ’s state space forms a directed acyclic graph (DAG), the argument with VM c can
⋆ ′
still work as VMc(s ) only depends on the datasets for later state-action pairs, which do not include
the current (s, a) under consideration. This argument is straightforward here because we have a
very simple and clean data collection procedure. One has to be extremely careful when using this
⋆ ′
argument in more realistic settings: for example, in the exploration setting, even if VM
c(s ) is estimated
from datasets not including Ds,a , the outcomes in Ds,a might have determined which later states we
⋆
have sufficient samples and which not, which introduces very subtle interdependence with VM c.
Connection to MCTS Interestingly, the independence of n on |S| in the last analysis is the core idea
that leads to Sparse Sampling [7], which is a prototype algorithm for the family of Monte-Carlo tree
search algorithm that played a crucial role in the success of AlphaGo.
One way to view Sparse Sampling is the following: conceptually we run the tabular method
with n set according to the last analysis (no dependence on |S|). Of course, when |S| is large this is
impractical, but if we only need to know π ⋆ (s0 ) for some particular state s0 (which is the setting of
online planning with MCTS), we can perform “lazy evaluation”: only generate the datasets for state-
⋆
action pairs that contribute to the calculation of VM
c(s0 ) and truncate at the effective horizon. Roughly
1
O( 1−γ )
speaking, this requires a total of (n|A|) samples to compute π ⋆ (s0 ), where has no dependence
on |S|.
6
role between M and M c on the RHS, i.e., the RHS becomes 1 ∥Q⋆ − TM Q⋆ ∥∞ . The trouble is that
1−γ M
c M
c
we can no longer directly apply Hoeffding’s inequality due to the data-dependence on Q⋆M c.
However, if we manage to control this term, it will save us a horizon factor (which is a good reason
to consider this idea more carefully)! This is shown in the following lemma: (the lemma only involves
the true MDP M , which is omitted in the subscripts of value functions and occupancies)
Lemma 4 ([8]). For any f ∈ RS×A , any initial state s0 , and any policy π,
1
V π (s0 ) − V πf (s0 ) ≤ (Edπ,s0 [T f − f ] + Edπf ,s0 [f − T f ]),
1−γ
where terms in the form of Eµ [f ] are the shorthand for E(s,a)∼µ [f (s, a)].
Proof. We use the following identity, which is the Q-function variant of Eq.(6) (what we used to give
the alternative proof of simulation lemma) and can be proved in very similar ways: for any f ∈ RS×A ,
1
f (s0 , π) − V π (s0 ) = Edπ,s0 [f − T π f ]. (10)
1−γ
Then,
For the two pairs of differences on the RHS, we invoke Eq.(10) twice: one with π, and the other with
π rebound to πf . This gives:
⋆ π⋆ 2
∥VM − VMM
c
∥∞ ≤ ∥Q⋆c − TM Q⋆M
c ∥∞ ,
1−γ M
which contrasts the 2/(1 − γ)2 factor in Section 2.3. However, this brings back the earlier problem:
how to handle the data dependence of Q⋆M c in concentration?
Bounding ∥Q⋆M ⋆
c − TM QM
c ∥∞ We now show how to bound this term. ∀(s, a),
|Q⋆M ⋆
c(s, a) − (TM QM
c)(s, a)| = |(TM
⋆ ⋆
c)(s, a) − (TM QM
cQM c)(s, a)|
⋆ ⋆ ⋆ ⋆ ⋆ ⋆
= |(TM
c)QM (s, a) − (TM QM )(s, a) + (TM c)(s, a) − (TM
cQM cQM )(s, a) + (TM QM )(s, a) − (TM QM
c)(s, a)|.
7
To handle those 4 terms, we plug in the definition of TM and TM
c:
⋆ ⋆ ⋆ ⋆
|(TM c)(s, a) − (TM
cQM cQM )(s, a) + (TM QM )(s, a) − (TM QM
c)(s, a)|
b a) + γ⟨Pb(s, a), V ⋆ ⟩ − R(s,
= (R(s, ⋆
b a) − γ⟨Pb(s, a), VM ⟩
M
c
⋆ ⋆
+ R(s, a) + γ⟨P (s, a), VM c⟩|
⟩ − R(s, a) − γ⟨P (s, a), VM
⋆ ⋆ ⋆ ⋆
c − VM ⟩ − ⟨P (s, a), VM
= γ|⟨Pb(s, a), VM c − VM ⟩|
⋆ ⋆
c − VM ⟩|
= γ|⟨Pb(s, a) − Pb(s, a), VM
⋆ ⋆
c − VM ∥∞ .
≤ γ∥P (s, a) − Pb(s, a)∥1 ∥VM
Now we can control ∥P (s, a) − Pb(s, a)∥1 using the total-variation concentration bound in Section 2.2
⋆ ⋆
(while paying the extra |S|). We can separately control ∥VM c − VM ∥∞ using the analysis in Section 2.3.
√
Each term will scale as O(1/ n) (we are only considering the scaling with n here and ignoring the
other variables such as |S| and 1/(1 − γ)), so their product scales as O(1/n). When n is sufficiently
√ ⋆ ⋆
large, this term will be dominated by the 1/ n error coming out from |(TM c)QM (s, a) − (TM QM )(s, a)|
and can be omitted. (This is why it is sometimes called a “burn-in” term, as it only has significant
effects in the small sample-size regime.) Note that this O(1/n) term will have worse dependencies
√
on |S| and 1/(1 − γ) compared to the O(1/ n) term, so “sufficiently large n” means that n has to be
√
larger than some function of |S| and 1/(1 − γ) for the difference between n and n to compensate for
the worse factors in |S| and 1/(1 − γ). In such a large-sample regime, we obtain the nice error bound
Vmax
of Õ( √n(1−γ) ), i.e., there is neither the extra |S| factor as in Section 2.2 nor the extra 1/(1 − γ) factor
as in Section 2.3.
Further improvement The bound can be further improved by replacing the Hoeffding’s inequalities
with the Bernstein’s, which provides sharper concentration bounds when the variance of the random
variables are substantially smaller compared to their ranges (squared). In our setting, the range of
random variables in the concentration of ⟨P̂ (s, a) − P (s, a), V ⟩ is Vmax , so the worst-case variance is
2 π
O(Vmax ). However, it turns out that for certain V (e.g., V = VM ), such variance cannot be large for
2
all (s, a) simultaneously as it adds up to O(Vmax ) along the occupancy of π, and leveraging such a
property leads to improved sample complexities; see [9] and [10, Section 2.3].
References
[1] Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. PhD thesis, University
of Cambridge England, 1989.
[2] Satinder P Singh and Richard S Sutton. Reinforcement learning with replacing eligibility traces.
Machine learning, 22(1-3):123–158, 1996.
[3] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Belle-
mare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level
control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
[4] Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and
teaching. Machine learning, 8(3-4):293–321, 1992.
8
[5] Harm van Seijen and Rich Sutton. A deeper look at planning as learning from replay. In Proceed-
ings of the 32nd International Conference on Machine Learning, pages 2314–2322, 2015.
[6] Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time.
Machine Learning, 49(2-3):209–232, 2002.
[7] Michael Kearns, Yishay Mansour, and Andrew Y Ng. A sparse sampling algorithm for near-
optimal planning in large Markov decision processes. Machine Learning, 49(2-3):193–208, 2002.
[8] Tengyang Xie and Nan Jiang. Q* approximation schemes for batch reinforcement learning:
A theoretical comparison. In Conference on Uncertainty in Artificial Intelligence, pages 550–559.
PMLR, 2020.
[9] Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J Kappen. Minimax pac bounds on
the sample complexity of reinforcement learning with a generative model. Machine learning,
91(3):325–349, 2013.
[10] Alekh Agarwal, Nan Jiang, Sham Kakade, and Wen Sun. Reinforcement Learning: Theory and
Algorithms. https://rltheorybook.github.io/.