Note 8
Note 8
Nan Jiang
November 9, 2022
In this note we introduce LSVI-UCB [1] and provide its regret analysis. The note is also heavily
inspired by that of Wen Sun [2]. For the purpose of clearly conveying the main ideas behind the
analysis, we will be sketchy in several aspects: (1) we will omit the proof on the boundedness of
the relevant parameters, especially when they appear in e.g., covering analysis in a poly-logarithmic
manner, (2) we will not explicitly mention it when we need to split the failure probability over a
constant number of estimation events of distinct nature. We will briefly mention it when we union
bound over a variable number of events (e.g., T and H), but they will not appear in the bounds
because we will use Õ(·) to suppress poly-logarithmic factors.
1 Setup
We consider episodic RL problems where the environment is specified by a finite-horizon MDP M =
(S, A, {Ph }, {Rh }, H, d0 ). d0 is the initial state distribution from which a trajectory is sampled and
w.r.t. which the expected return J(π) is defined, and Ph and Rh are time-dependent transition and
reward functions, respectively. We adopt the convention that rh ∈ [0, 1], so the total sum of reward in
an episode is between [0, H]. In finite-horizon problems, policies and value functions are also time-
dependent, and we often write π = {πh } and V π = {Vhπ }, where πh and Vhπ are meant to apply to
states that appear at the h-th time step.
In the low-rank MDP model, we have Ph = Φh × Ψh ∈ R|S×A|×|S| , where Φh ∈ R|S×A|×d ,
Ψh ∈ Rd×|S| , and the decomposition implies that Ph has rank at most d. The linear MDP setting is
when the MDP has low-rank, and the matrix Φh is known to the learner: we denote each row of Φh
as φ(s, a) ∈ Rd , which will be used as features by the learner. Typically, it is also assumed that the
expected reward is linear in φ(s, a), i.e., Rh (s, a) = φ(s, a)> θR , and it is easy to show that (this was left
as homework) (T fh+1 )(s, a) := Rh (s, a) + Es0 ∼Ph (·|s,a)) [maxa0 fh+1 (s0 , a0 )] is linear in φ for any fh+1 ;
the same statement holds for T π for any π. As a result, we have that Q? and Qπ (for all π) are both
linear in φ, making φ a powerful representation for linear MDPs.
Boundedness assumptions We assume kφ(s, a)k ≤ 1 ∀(s, a), where k · k without any script is the
standard `2 norm. See [1] for other needed boundedness assumptions.
1
2 Technical Preparation
Before we introduce the algorithm and analyze it, we need to introduce a number of useful tools that
find important applications across machine learning theory. Due to the purpose of this note, we only
provide a very minimal introduction to some of the concepts, and will provide references for further
reading on these topics.
In a typical scenario, X0 = 0, and Xk is the sum of k zero-mean random variables, e.g., deviation
between a random variable and its conditional expectation (conditioned on all previous information),
and ck is the range of the k-th r.v.. Therefore, XN − X0 is the sum of deviations across all variables,
which is what we often care to bound. Different from Hoeffding’s, Azuma’s is more general and
allows non-independent r.v.’s. This typically arises when the distribution of the random variable at
time k (i.e., Xk − Xk−1 ) is determined based on all the information up to time k − 1, including the
realization of all previous random variables.
As an example, consider a simple multi-armed bandit. An algorithm adaptively pulls arms a1 , a2 , . . . , aT ∈
A and observes random rewards r1 , r2 , . . . , rT for the corresponding arms, where at may be chosen
based on a1:t−1 and r1:t−1 . In such a case, we can still apply martingale concentration to bound
PT
t=1 (rt − µat ) where µa is the mean reward for arm a, because µat is the expectation of rt given all
the information so far, which includes the algorithm’s choice of at .1
2.2 Ridge regression with non-stochastic inputs and the Elliptical Potential Lemma
The algorithm and analysis for linear MDPs will heavily rely on ridge regression, which we review
here. Consider the following protocol: for t = 1, 2, . . . , T , nature chooses xt ∈ Rd in each round, and
then reveals a noisy label yt = x> ? ? d
t θ + t ∈ [0, Vmax ], where t is a zero-mean noise. θ ∈ R is the
unknown parameters, and our goal is to learn it to make accurate predictions given any x ∈ Rd , i.e.,
to predict y(x) = x> θ? . We also assume that kxt k ≤ 1 and kθ? k is a constant.
contrast, one can easily construct examples where T
1 In
P
t=1 (rt − E[rt ]) does not enjoy concentration, because E[rt ] consid-
ers the unconditional expectation over all randomness and the partial sum does not satisfy the definition of a martingale.
2
When {xt } is sampled i.i.d. and comes with a well-conditioned design matrix x1:T ∈ Rd×T , we can
simply apply ordinary linear regression. However, in this setup, we allow the xt to be chosen in an
arbitrary (or even adversarial) manner, possibly depending on x1:t−1 and y1:t−1 .
Ridge regression considers the following estimator and prediction: given x1:t−1 and y1:t−1 ,
t−1
X
θ̂t := arg min (x> 2 2
i θ − yi ) + kθk , ŷt (x) := x> θ̂t .
θ∈Rd i=1
So it is simply least-square regression with `2 regularization. The estimator and the prediction can
Pt−1
also be put in matrix form: define Λt := I + i=1 xi x>i , we have
t−1
X
ŷt (x) = x> Λ−1
t xi yi .
i=1
The first question we can ask is a “sample complexity” one: given data x1:t−1 and y1:t−1 , how ac-
curately can we make predictions about arbitrary input x? Note that we cannot expect the error to
shrink to 0 when t → ∞ in general, as the nature may pick a very poor design of x1:t−1 , leading to the
unidentifiability of θ? . However, even in such a case, we can still hope to make accurate predictions
for x that is well within the span of x1:t−1 . This intuition is formalized by the following result:
3
Pt−1
Here we introduced an intermediate term x> Λ−1 t
> ?
i=1 xi (xi θ ), which can be interpreted as our
prediction if we had observed the noise-free labels. By injecting this term, we bound |ŷt (x) − y(x)| by
the sum of two terms, where the first term can be bounded by Lemma 3. The second term is small,
because the prediction based on noise-free labels would be perfectly accurate if (1) x is within the span
of x1:t−1 , and (2) there were no regularization. As a result, bounding the second term is mainly about
characterizing the bias due to regularization.
We now handle the first term:
t−1 t−1
−1/2 −1/2
X X
|x >
Λ−1
t xi (yi − x> ?
i θ )| = |x >
Λt Λt xi i |
i=1 i=1
t−1 t−1
−1/2 −1/2
X X
≤ kx> Λt kkΛt xi i k = kxkΛ−1 · k xi i kΛ−1
t t
i=1 i=1
r
1
≤ kxkΛ−1 · O(Vmax d + log ). (Lemma 3)
t δ
For the second term,
t−1 t−1
−1/2 −1/2 −1/2 −1/2
X X
|x> Λ−1
t xi (x> ? > ? >
i θ ) − x θ | = |x Λt Λt (xi x> ? >
i )θ − x Λt Λt Λt θ ? |
i=1 i=1
t−1
−1/2 −1/2 −1/2
X
= |x> Λt Λt ( xi x> ?
i − Λt )θ |kxkΛ−1 kΛt · I · θ? k
t
i=1
−1/2
≤ kxkΛ−1 kθ? k. (Λt 4 I)
t
Given that we assume kθ? k is a constant, this term is dominated by the first term, which completes
the proof.
Lemma 2 shows that our accuracy of prediction relies on kxkΛ−1 , which can be very large and may
t
not shrink as t → ∞ when the design matrix is poorly chosen. Is there a sense in which we have
nice learning guarantees (i.e., some kind of error guaranteed to go to 0) for the ridge algorithm, even
under arbitrary inputs?
The answer is yes, if we consider an online-learning-style protocol: in each round, after nature
chooses xt , the algorithm makes a prediction ŷt := ŷt (xt ) by running ridge regression on the data
observed so far, after which the true label yt is revealed. Define the cumulative regret as
T
X
RegretT := |ŷt − y(xt )|.
t=1
√
We can show that this regret is Õ( T ) (the dependence on other variables are tentatively omitted),
thus the average regret RegretT /T → 0 (which can be interpreted as an average error rate) as T → ∞.
The intuition is that the nature can repeatedly choose the same x to form a poor design matrix, but
then we will quickly learn how to predict y(x) and start to incur low regret. To force a high regret,
nature must choose a different direction, but we will learn as we make mistakes. Ultimately, nature
will run out of directions since there are only d dimensions, which allows us to bound the total regret.
4
To formally bound the regret, we use Lemma 2 and union bound over T steps (which incurs an
O(log T ) dependence suppressed by Õ(·) notation): w.p. ≥ 1 − δ,
T T r
X X 1
|ŷt − y(xt )| ≤ kxt kΛ−1 · Õ(Vmax
d + log ) (1)
t=1 t=1
t δ
v
u T
1 √ uX > −1
r
≤ Õ(Vmax d + log ) · T t xt Λ t xt (Jensen’s)
δ t=1
√ PT −1
Since we already have T , it remains to bound t=1 x> t Λt xt in a way that does not incur further
polynomial dependence on T , which is given by the famous elliptical potential lemma. Note that
PT > −1 T
t=1 xt Λt xt only depends on {xt }t=1 which is chosen arbitrarily, so the elliptical potential lemma
is a general result that does not require any assumptions other than kxt k ≤ 1 ∀t.
For the last step, note that for A ∈ Rm×n , A> A and AA> share the same non-zero eigenvalues, so
−1/2
det(In + A> A) = det(Im + AA> ). The step follows from letting A = Λt xt ∈ Rd×1 .
Now, taking log on both sides, we have log(det(Λt+1 )/ det(Λt )) = log(1 + xt Λ−1 > 1 −1 >
t xt ) ≥ 2 xt Λt xt ,
so
T T
X
−1
X det(Λt+1 ) det(ΛT +1 )
x>
t Λ t xt ≤ 2 log = 2 log .
t=1 t=1
det(Λ t ) det(Λ0 )
This proves the first inequality given Λ0 = I.
To further bound det(ΛT +1 ), we have det(A) ≤ σmax (A)d where σmax (·) is the largest eigenvalue.
T
X
σmax (ΛT +1 ) = max u> (I + xt x>
t )u ≤ T + 1. (kxt k ≤ 1)
kuk=1
t=1
2.3 Covering
In previous lectures we have analyzed function-approximation algorithms, assuming a finite function
class F. We simply union bound over it in generalization analyses and pay log |F|. However, almost
all function classes in practice (including linear classes) are continuous, in which case the cardinality
5
of the function class is infinite. The covering argument provides a way to derive generalization error
bounds for infinite classes. Given a (possibly infinite) function class F ⊂ (X → R), its `∞ -covering
number is defined as
We first show how to prove generalization bounds for F when it admits a finite covering number.
Consider a simple setting: we have x1 , . . . , xn ∈ X sampled i.i.d. from some distribution, and want
Pn
to bound supf ∈F |E[f (X)] − n1 i=1 f (xi )|. Let’s assume that f ∈ [0, 1].
Let F be the (smallest) -cover of F, where the value of will be decided
q later. Then, by Hoeffding
n 2N
+ union bound, w.p. ≥ 1 − δ, ∀fC ∈ F , |E[fC (X)] − n1 i=1 fC (xi )| ≤ 1
P
2n log δ . Note that this
will be the only high-probability statement we make.
Pn
We now try to use the above to bound |E[f (X)] − n1 i=1 f (xi )|. Fixing any f ∈ F, find its closest
cover center fC ∈ F with kf − fC k∞ ≤ , then
n n
1X 1X
|E[f (X)] − f (xi )| ≤ |E[fC (X)] − fC (xi )|
n i=1 n i=1
Here the first term is bounded by union bounding over F , and the second and the third terms are
each deterministically
q bounded by due to kf − fC k∞ ≤ . So overall the generalization error bound
is 2 + 2n1
log 2N
δ , and one can optimize to obtain the best bound.
As we will see below, typically log N grows with very slowly, often as log 1 . Therefore, as long
as we do not care about poly-logarithmic terms, qcan be set arbitrarily small as long as it is inversely
1 2N
polynomial, and as long as is sufficiently small, can be roughly viewed as not changing
2n log δ
q
with (since the change is only poly-logarithmic). Typically, one can set = 2n1
log 2N
δ , since further
The other half of the puzzle is how to bound the covering number for common classes. Here we show
a simple example of linear class: consider F = {x 7→ φ(x)> θ : kθk ≤ B}, where the feature map φ
satisfies kφ(x)k ≤ 1 ∀x. We want to bound N (F).
We construct F as follows: first, kθk ≤ B implies that kθk∞ ≤ B, so we can relax the `2 ball
of θ to a larger `∞ ball for easy construction. Then, we simply create a regular grid of resolution 0
for {θ : kθk∞ ≤ B}, which results in (B/0 )d grids2 . Let θC be the center of a grid, and any other
√ √
θ in the same grid we have kθ − θC k ≤ dkθ − θC k∞ ≤ d0 . Choosing the functions correspond
to the grid centers immediately yield an -cover of F, with kf − fC k∞ = maxx |φ(x)> (θ − θC )| ≤
√ √ √
maxx kφ(x)kkθ − θC k ≤ d0 . So to obtain an -cover, we can set 0 = / d, and N ≤ (B d/)d .
2 We ignore the issue that B/0 may not be an integer; this can be easily handled by relaxing the cover size to (2B/0 )d .
6
Recall
√
that we pay the log-covering number in the generalization error bounds, which is log N ≤
B d
d log . So, as long as B and are polynomial, they only incur poly-logarithmic dependence in
log N , and the main part of log N is simply d.
Remark When we construct the cover, we created grids which can be viewed as `∞ “balls” in the
parameter space Rd . This `∞ should not be confused with the “`∞ ” in `∞ covering, as the latter refers
to the fact that we measure the closeness between two functions as kf − f 0 k∞ = maxx |f (x) − f 0 (x)|,
where the `∞ corresponds to taking max over all possible function inputs. In fact, since we later
relaxed kθ − θC k∞ to kθ − θC k, it is also valid to say that we essentially created a `2 cover over the
parameter space, which nevertheless led to an `∞ cover over the function space.
Besides `∞ covering (which is the simplest), there are alternatives such as `1 covering which mea-
Pn
sures the distance between two functions in a sample-dependent manner, e.g., n1 i=1 |f (xi ) − f 0 (xi )|.
When using such a covering number in generalization analysis, one cannot reduce to the analysis to
the case of finite classes due to the sample dependence, and the analysis requires techniques such as
symmetrization similar to what is used in the analysis of VC dimensions; see [3, 4].
As before, the first term can be controlled by union bounding over F , and the remaining terms by
kf − fC k∞ . However, there are two notable changes:
• For the first term, it is the range of l that matters in the concentration bounds, not f .
• When controlling the remaining terms, we typically need l to be Lipcshitz in f , in the sense that
To make sure the two terms are controlled by (which is often set to be equal to the concentration
bound for the first term), we will need an (/L)-cover of F, so the Lipschitz constant L enters the
covering number. Fortunately, it will typically only appear poly-logarithmically in the log-covering
number, so as long as L is polynomial we are fine.
Another way to look at this is that we are essentially trying to translate the cover of F to a cover of
another function class {l(· ; f ) : f ∈ F }, and via doing so we incur dependence on the Lipschitzness
of l in f .
7
3 Algorithm
We are now ready to state the algorithm. Let (sih , aih , rhi ) denote the state-action-reward tuple en-
countered in the i-th episode. For each episode t = 1, 2, . . . , T , the algorithm performs an optimistic
version of LSVI (least-square value iteration): for h = H, H − 1, . . . , 1:
Pt−1
1. Define Λht := I + i=1 φih (φih )> , where φih := φ(sih , aih ).
e t be the point estimator of ridge regression:
2. Let Q h
t−1
X
e th (s, a) := φ(s, a)> (Λth )−1
Q φih (rhi + Vbh+1 (sh+1 )). (2)
i=1
p
3. Add bonus to ensure optimism: Q(s,
b a) := Q(s,
e a) + β φ(s, a)> (Λth )−1 φ(s, a).
We adopt the convention that all notions of value functions evaluate to 0 at h = H + 1. β > 0 is
a hyperparameter which will be set in the analysis. Once the computation is completed at h = 1,
t
the algorithm collects a new episode of data (st1 , at1 , r1t , . . . , stH , atH , rH ) using π t , the greedy policy
t
w.r.t. {Q
b }.
h
4 Regret analysis
4.1 Concentration bounds
We establish concentration bounds first, and then use them to bound the regret of the algorithm. A
key step of the algorithm is to run ridge regression at each level h, where xi corresponds to φ(sih , aih )
and yi corresponds to rhi + V (sh+1 ) for V = Vbh+1 . While we would be able to directly apply the
guarantee for ridge regression in Section 2.2.1 if V is fixed a priori, the data-dependence of Vbh+1
violates the independence and invalidates the guarantees.
Remark: Bellman-completeness is insufficient for LSVI-UCB Before diving into how the issue
of data-dependence of Vbh+1 , we emphasize that the above application of ridge regression crucially
relies on the fact that the expected label, Rh (s, a) + Ph (s, a)> V is linear in φ regardless of the choice of
V . Recall that in the FQI section we have seen the completeness assumption T f ∈ F ∀f ∈ F (or
the T π version), and in linear MDPs the linear class (in φ) satisfies both assumptions. However, the
algorithm would not work if we merely assume F is linear and Bellman-complete (instead of the
MDP being a linear MDP), because it needs to back up functions outside the linear class: while Q e is
linear, we add a (square-root of) quadratic bonus to it, followed by clipping, both of which makes
the Q-function we backup nonlinear. On the other hand, when only linear completeness is assumed,
there are information-theoretic algorithms that can handle it, e.g., by showing that the problem has
low “Q-type” Bellman rank [5, Sec 9.3.1].
To handle the data-dependence of Vbh+1 , we identify a function class that is defined independent
8
of the data and is guaranteed to contain Vbh+1 , and try to union bound over it. The class is
Apparently we cannot directly union bound over V since it is a continuous class, and instead we use a
covering argument. The result is that we can obtain an α-cover of V of size Õ(d2 ) for any polynomially
small α. We will only give a proof sketch here: the main idea is to create covers for “simple” classes
(i.e., linear and quadratic), and then analyze the effects of the transformations .
1. First, create an α/2-cover of {φ> w : kwk ≤ B} and an α2 /4 cover of {φ> Λ−1 φ> : σmin (Λ) ≥ 1},
with log cardinalities Õ(d) and Õ(d2 ), respectively. Note that the latter is effectively a linear class
with features {φi φj }i,j , and the linear coefficients are the entries of Λ−1 which are bounded due to
σmin (Λ) ≥ 1.
2. Let {ΛC } be the parameters of the cover centers of the quadratic class. We show that these param-
p
> −1 >
eters also provide an α/2 cover for the square-root
q class q φ Λ φ : σmin (Λ) ≥ 1}, because for
{
any Λ and its closest ΛC , | φ> Λ−1 φ> − φ> Λ−1 > −1 φ> − φ> Λ−1 φ> | ≤ α/2. The first
p
>
C φ |≤| φ Λ C
√ √ p
inequality follows from the fact that | x − y| ≤ |x − y|, ∀x, y ≥ 0.
3. The additive composition of the linear and the square-root class yields an α-cover.
4. min{H, ·} and maxa (·) are non-expansions in k·k∞ (recall | maxa f (s, a)−maxa f 0 (s, a)| ≤ maxa |f (s, a)−
f 0 (s, a)|), so we have an α-cover of the final class, whose size is the product of the covering num-
bers for the linear and the quadratic classes. The log covering number is then dominated by that
of the quadratic class, which is Õ(d2 ).
Let Vα be the resulting α-cover of V. For any fixed t and h, we now apply Lemma 2 for any fixed
V , with the following correspondence between the variables:
Union bounding over all V = VC ∈ Vα , we obtain that w.p. ≥ 1 − δ, ∀VC ∈ Vα , ∀(s, a),
q
e th (s, a; VC ) − Rh (s, a) − Ph (s, a)> VC | ≤ kφk(Λt )−1 Õ(H(d + log 1 )).
|Q h δ
3 See [1, Lemma B.2] for a bound that is tighter in t but incurs dependence on d.
9
Recall our goal is to bound the LHS of the above for all V ∈ V, and we use an argument similar to
Section 2.3.3: for any V ∈ V, we decompose the error as follows:
Here the first term is handled by union bounding over Vα , and recall from Section 2.3.3 that it suffices
to show that we can use α to control the remaining terms up to polynomial blow-up factors. The third
term is easily bounded by α, so the key is to bound the second term: let ∆i := V (sih+1 ) − VC (sh+1 )i
t−1
X t−1
X
|Q e t (s, a; V )| = |φ> (Λt )−1
e t (s, a; VC ) − Q φih ∆i | ≤ kφkk φih ∆i k ≤ (t − 1)α.
h h h
i=1 i=1
Having verified that the blow-up factors are polynomial, we have w.p. ≥ 1 − δ, for all t ∈ [T ] and
h ∈ [H] (the additional log(HT ) dependence is suppressed by Õ(·)): define the residual of ridge
regression as
then
q
bth (s, a) ≤ kφk(Λth )−1 Õ(H(d + log 1δ )) =: βkφk(Λth )−1 . (4)
q
Here we choose the hyperparameter of the algorithm β as the coefficient Õ(H(d + log 1δ )) in the
bound. Since Q
b is defined as Q
e + βkφk(Λt )−1 , a direct consequence is
h
b th ≥ Rh + Ph> Vbh+1
Q t
. (5)
Proof. The base case h = H + 1 is trivial as all value functions are 0. Inductively:
10
With this, we are finally ready to analyze the regret of the algorithm:
T
X
RegretT := J(π ? ) − J(π t )
t=1
T
X
≤ b t1 (s1 , π t )] − J(π t )
Es1 ∼d0 [Q (Optimism)
t=1
T X
X H
= b th (sh , ah ) − Rh (sh , ah ) − Ph (sh , ah )> [max Q
Edπt [Q b th (·, ah+1 )]]
h a
t=1 h=1
T X
X H
≤ b th (sh , ah ) − Rh (sh , ah ) − Ph (sh , ah )> [min{H, max Q
Edπt [Q b th (·, ah+1 )]}]
h a
t=1 h=1
T X
X H
= e t (sh , ah ) + βkφ(sh , ah )k(Λt )−1 − Rh (sh , ah ) − Ph (sh , ah )> Vbh+1 ]
Edπt [Q
h
h h
t=1 h=1
T X
X H
≤ Edπt [bth (sh , ah ) + βkφ(sh , ah )k(Λth )−1 ] (See Eq.(3) for the def of bth )
h
t=1 h=1
H X
X T
≤ Edπt [2βkφ(sh , ah )k(Λth )−1 ] (Eq.(4))
h
h=1 t=1
The third line is essentially the finite-horizon variant of a telescoping lemma we have used multiple
times in previous lectures, where we translate the error of using an arbitrary Q to evaluate J(π) into
the Bellman error of Q along the occupancy of π.
We are almost there—what we obtain above looks very similar to (an upper bound of) the cu-
mulative regret of ridge regression (Eq.(1)). However, there is still a crucial gap: in Eq.(1), we can
upper-bound the sum of kxt kΛ−1 for those xt that goes into the definition of Λt . If we want to apply
t
this bound, we need to have kφ(sth , ath )k(Λth )−1 show up, where sth , ath is the actual state-action pair we
t
run into in the t-th episode. Instead, what we have is an expectation over (sh , ah ) ∼ dπh .
Fortunately, this mismatch can be addressed by noting that (1) sth , ath is actually sampled from
πt
dh , (2) 2βkφ(sh , ah )k(Λth )−1 as a function S × A → R is defined deterministically based on all the
information before episode t, and (3) the function is bounded between [0, β]. So, we can directly
t
apply Azuma’s inequality to bridge the gap between (sh , ah ) ∼ dπh and sth , ath , and the remaining
steps use Jensen and the elliptical potential lemma in the same way as Section 2.2.2
H X
X T
RegretT ≤ Edπt [2βkφ(sh , ah )k(Λth )−1 ]
h
h=1 t=1
H X
X T q
≤ 2βkφ(sth , ath )k(Λth )−1 + Õ(β T log 1δ )
h=1 t=1
H T
X √ X q
≤ 2β T φth (Λth )−1 φth + Õ(β T log 1δ ) (Jensen)
h=1 t=1
√ p q
≤ 2βH T 2d log(T + 1) + Õ(β T log 1δ ) (Elliptical potential lemma)
√ 1 √
r
2
= Õ(H d(d + log ) T ). (Plug in def of β from Eq.(4))
δ
11
.
References
[1] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement
learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143.
PMLR, 2020.
[2] Wen Sun. Learning in Linear MDPs: Upper Confidence Bound Value Iteration. https://wensun.
github.io/CS6789_data/linearMDP_note_complete.pdf.
[3] Sham Kakade and Ambuj Tewari. CMSC 35900 (Spring 2008) Learning Theory: Covering Numbers.
https://home.ttic.edu/˜tewari/lectures/lecture14.pdf.
[4] Shivani Agarwal. E0 370 Statistical Learning Theory: Covering Numbers, Pseudo-Dimension, and Fat-
Shattering Dimension. http://www.shivani-agarwal.net/Teaching/E0370/Aug-2011/
Lectures/5.pdf.
[5] Alekh Agarwal, Nan Jiang, Sham Kakade, and Wen Sun. Reinforcement Learning: Theory and Algo-
rithms. https://rltheorybook.github.io/.
12