A Reduction From Reinforcement Learning To No-Regr
A Reduction From Reinforcement Learning To No-Regr
net/publication/337273591
CITATIONS READS
0 17
4 authors, including:
Ching-An Cheng
Georgia Institute of Technology
38 PUBLICATIONS 203 CITATIONS
SEE PROFILE
All content following this page was uploaded by Ching-An Cheng on 06 December 2019.
Ching-An Cheng Remi Tachet des Combes Byron Boots Geoff Gordon
Georgia Tech Microsoft Research Montreal UW Microsoft Research Montreal
arXiv:1911.05873v1 [cs.LG] 14 Nov 2019
literature, and the second part can be quantified in- initial state distribution,
dependently of the learning process. For example, one P∞
can accelerate learning by adopting optimistic online V π (s) := (1 − γ)Eξ∼ρπ (s) [ t=0 γ t r(st , at )] (1)
algorithms [17, 18] that account for the predictability
in COL, without worrying about function approxima- is the value function of π at state s, r : S × A → [0, 1]
tors. Because of these problem-agnostic features, the is the reward function, and ρπ (s) is the distribution
proposed reduction can be used to systematically de- of trajectory ξ = s0 , a0 , s1 , . . . generated by running
sign efficient RL algorithms with performance guaran- π from s0 = s in an MDP. We assume that the initial
tees. distribution p(s0 ), the transition P(s′ |s, a), and the
reward function r(s, a) in the MDP are unknown but
As a demonstration, we design an RL algorithm based
can be queried through a generative model, i.e. we can
on arguably the simplest online learning algorithm:
sample s0 from p, s′ from P, and r(s, a) for any s ∈ S
mirror descent. Assuming a generative model1 , we
and a ∈ A. We remark that the definition of V π in (1)
prove that, for any tabular Markov decision process
contains a (1 − γ) factor. We adopt this setup to make
(MDP), with probability at least 1 − δ, this algorithm
writing more compact. We denote the optimal policy
learns an ǫ-optimal policy for theγ-discounted accumu-
|S||A| log( δ1 ) as π ∗ and its value function as V ∗ for short.
lated reward, using at most Õ (1−γ)4 ǫ2 samples,
where |S|,|A| are the sizes of state and action spaces,
2.1 Duality in RL
and γ is the discount rate. Furthermore, thanks to
the separation property above, our algorithm admits a Our reduction is based on the linear-program (LP) for-
natural extension with linearly parameterized function mulation of RL. We provide a short recap here (please
approximators, whose sample and per-round computa- see Appendix A and [21] for details).
tion complexities are linear in the number of parame-
ters and independent of |S|,|A|, though at the cost of To show how maxπ V π (p) can be framed as a LP, let us
policy performance bias due to approximation error. define the
P∞average state distribution under π, dπ (s) :=
(1−γ) t=0 γ dt (s), where dπt is the state distribution
t π
This sample complexity improves the current best at time t visited by running π from p (e.g. dπ0 = p).
provable rate of the saddle-point RL setup [3–6] by a By construction, dπ satisfies the stationarity property,
|S|2
large factor of (1−γ) 2 , without making any assumption
on the MDP.2 This improvement is attributed to our dπ (s′ ) = (1 − γ)p(s′ ) + γEs∼dπ Ea∼π|s [P(s′ |s, a)]. (2)
new online-learning-style analysis that uses a cleverly
selected comparator in the regret definition. While it is With dπ , we can write V π (p) = Es∼dπ Ea∼π|s [r(s, a)]
possible to devise a minor modification of the previous and our objective maxπ V π (p) equivalently as:
stochastic mirror descent algorithms, e.g. [5], achieving
the same rate with our new analysis, we remark that maxµ∈R|S||A| :µ≥0 r⊤ µ
our algorithm is considerably simpler and removes a (3)
s.t. (1 − γ)p + γP⊤ µ = E⊤ µ
projection required in previous work [3–6].
Finally, we do note that the same sample complex- where r ∈ R|S||A| , p ∈ R|S| , and P ∈ R|S||A|×|S|
ity can also be achieved, e.g., by model-based RL and are vector forms of r, p, and P, respectively, and
(phased) Q-learning [19, 20]. However, these methods E = I ⊗ 1 ∈ R|S||A|×|S| (we use | · | to denote the cardi-
either have super-linear runtime, with no obvious route nality of a set, ⊗ the Kronecker product, I ∈ R|S|×|S|
for improvement, or could become unstable when using is the identity, and 1 ∈ R|A| the vector of ones). In (3),
function approximators without further assumption. S and A may seem to have finite cardinalities, but the
same formulation extends to countable or even con-
2 SETUP & PRELIMINARIES tinuous spaces (under proper regularity assumptions;
see [22]). We adopt this abuse of notation (empha-
Let S and A be state and action spaces, which can sized by bold-faced symbols) for compactness.
be discrete or continuous. We consider γ-discounted The variable µ of the LP in (3) resembles a joint dis-
infinite-horizon problems for γ ∈ [0, 1). Our goal is tribution dπ (s)π(a|s). To see this, notice that the
to find a policy π(a|s) that maximizes the discounted constraint in (3) is reminiscent of (2), and implies
average return V π (p) := Es∼p [V π (s)], where p is the kµk1 = 1, i.e. µ is a probability distribution. Then
1 one can show µ(s, a) = dπ (s)π(a|s) when the con-
In practice, it can be approximated by running a be-
havior policy with sufficient exploration [19]. straint is satisfied, which implies that (3) is the same
2
[5] has the same sample complexity but requires the as maxπ V π (p) and its solution µ∗ corresponds to
∗
MDP to be ergodic under any policy. µ∗ (s, a) = dπ (s)π ∗ (a|s) of the optimal policy π ∗ .
Ching-An Cheng, Remi Tachet des Combes, Byron Boots, Geoff Gordon
As (3) is a LP, it suggests looking at its dual, which and mirror-prox [24, 25], to efficiently solve the prob-
turns out to be the classic LP formulation of RL3 , lem. These methods only require using the generative
model to compute unbiased estimates of the gradients
minv∈R|S| p⊤ v ∇v L = bµ and ∇µ L = av , where we define
(4)
s.t. (1 − γ)r + γPv ≤ Ev.
1
bµ := p + 1−γ (γP − E)⊤ µ (11)
It can be verified that for all p > 0, the solution to (4)
satisfies the Bellman equation [23] and therefore is the as the balance function with respect to µ. bµ mea-
optimal value function v∗ (the vector form of V ∗ ). We sures whether µ violates the stationarity constraint
note that, for any policy π, V π by definition satisfies in (3) and can be viewed as the dual of av . When
a stationarity property the state or action space is too large, one can resort to
function approximators to represent v and µ, which
V π (s) = Ea∼π|s (1 − γ)r(s, a) + γEs′ ∼P|s,a [V π (s′ )]
are often realized by linear basis functions for the sake
(5) of analysis [10].
continuous, F (x, ·) is convex, and F (x, x) ≥ 0.4 The Methods to achieve these two steps individually are not
problem EP(X , F ) aims to find x⋆ ∈ X s.t. new. The reduction from convex-concave problems to
no-regret online learning is well known [30]. Likewise,
F (x⋆ , x) ≥ 0, ∀x ∈ X . (13) the relationship between the approximate solution of
(8) and policy performance is also available; this is how
By its definition, a natural residual function to quan- the saddle-point formulation [5] works in the first place.
tify the quality of an approximation solution x to EP So couldn’t we just use these existing approaches? We
is rep (x) := − minx′ ∈X F (x, x′ ) which describes the de- argue that purely combining these two techniques fails
gree to which (13) is violated at x. We say a bifunction to fully capture important structure that resides in RL.
F is monotone if, ∀x, x′ ∈ X , F (x, x′ ) + F (x′ , x) ≤ 0, While this will be made precise in the later analyses,
and skew-symmetric if the equality holds. we highlight the main insights here.
EPs with monotone bifunctions represent general con- Instead of treating (8) as an adversarial two-player on-
vex problems, including convex optimization prob- line learning problem [30], we adopt the recent reduc-
lems, saddle-point problems, variational inequali- tion to COL [14] reviewed in Section 2.3. The main dif-
ties, etc. For instance, a convex-concave problem ference is that the COL approach takes a single-player
miny∈Y maxz∈Z φ(y, z) can be cast as EP(X , F ) with setup and retains the Lipschitz continuity in the source
X = Y × Z and the skew-symmetric bifunction [29] saddle-point problem. This single-player perspective
is in some sense cleaner and, as we will show in Sec-
F (x, x′ ) := φ(y ′ , z) − φ(y, z ′ ), (14) tion 4.2, provides a simple setup to analyze effects of
function approximators. Additionally, due to continu-
where x = (y, z) and x′ = (y ′ , z ′ ). In this case, ity, the losses in COL are predictable and therefore
rep (x) = maxz′ ∈Z φ(y, z ′ ) − miny′ ∈Y φ(y ′ , z) is the du- make designing fast algorithms possible.
ality gap.
With the help of the COL reformulation, we study the
Cheng et al. [14] show that a learner achieves sublinear relationship between the approximate solution to (8)
dynamic regret in COL if and only if the same algo- and the performance of the associated policy in RL.
rithm can solve EP(X , F ) with F (x, x′ ) = fx (x′ ) − We are able to establish a tight bound between the
fx (x). Concretely, they show that, given a mono- residual and the performance gap, resulting in a large
tone EP(X , F ) with F (x, x) = 0 (which is satisfied |S|2
improvement of (1−γ) 2 in sample complexity compared
by (14)), one can construct a COL problem by setting with the best bounds in the literature of the saddle-
fx′ (x) := F (x′ , x), i.e. ln (x) = F (xn , x), such that point setup, without adding extra constraints on X
any no-regret algorithm can generate an approximate and assumptions on the MDP. Overall, this means that
solution to the EP. stronger sample complexity guarantees can be attained
Proposition 1. [14] If F is skew-symmetric and by simpler algorithms, as we demonstrate in Section 5.
ln (x) = F (xn , x), then rep (x̂N ) ≤ N1 RegretN ,
The missing proofs of this section are in Appendix B.
where RegretN = maxx∈X RegretN (x), and x̂N =
1
PN
N n=1 xn ; the same guarantee holds also for the best 3.1 The COL Formulation of RL
decision in {xn }N n=1 .
First, let us exercise the above COL idea with the
3 AN ONLINE LEARNING VIEW saddle-point formulation of RL in (8). To construct
the EP, we can let X = {x = (v, µ) : v ∈ V, µ ∈ M},
We present an alternate online-learning perspective which is compact. According to (14), the bifunction F
on the saddle-point formulation in (8). This analy- of the associated EP(X , F ) is naturally given as
sis paves a way for of our reduction in the next section. F (x, x′ ) := L(v′ , µ) − L(v, µ′ )
By reduction, we mean realizing the two steps below:
= p⊤ v′ + µ⊤ av′ − p⊤ v − µ′⊤ av (15)
∗ ∗ ∗
1. Define a sequence of online losses such that any which is skew-symmetric, and x := (v , µ ) is a solu-
algorithm with sublinear regret can produce an tion to EP(X , F ). This identification gives us a COL
approximate solution to the saddle-point problem. problem with the loss in the nth round defined as
ln (x) := p⊤ v + µ⊤ ⊤ ⊤
n av − p vn − µ avn (16)
2. Convert the approximate solution in the first step
to an approximately optimal policy in RL. where xn = (vn , µn ). We see ln is a linear loss. More-
over, because of the continuity in L, it is predictable,
4
We restrict ourselves to this convex and continuous i.e. ln can be (partially) inferred from past feedback
case as it is sufficient for our problem setup. as the MDP involved in each round is the same.
Ching-An Cheng, Remi Tachet des Combes, Byron Boots, Geoff Gordon
3.2 Policy Performance and Residual realize, e.g., when p is unknown. Without it, extra as-
sumptions (like ergodicity [5]) on the MDP are needed.
By Proposition 1, any no-regret algorithm, when ap-
plied to (16), provides guarantees in terms of the resid- However, Proposition 2 is undesirable for a number of
ual function rep (x) of the EP. But this is not the end of reasons. First, the bound is quite conservative, as it
the story. We also need to relate the learner decision concerns the uniform error kv∗ − vπµ k∞ whereas the
x ∈ X to a policy π in RL and then convert bounds on objective in RL is about the gap V ∗ (p) − V πµ (p) =
rep (x) back to the policy performance V π (p). Here we p⊤ (v∗ − vπµ ) with respect to the initial distribution
follow the common rule in the literature and associate p (i.e. a weighted error). Second, the constant term
each x = (v, µ) ∈ X with a policy πµ defined as (1 − γ) mins p(s) can be quite small (e.g. when p is
uniform, it is 1−γ
|S| ) which can significantly amplify the
πµ (a|s) ∝ µ(s, a). (17) error in the residual. Because a no-regret algorithm
typically decreases the residual in O(N −1/2 ) after see-
In the following, we relate the residual rep (x) to the ing N samples, the factor of 1−γ |S| earlier would turn
performance gap V ∗ (p) − V πµ (p) through a relative |S|2
′
particular, it implies V π (p) − V π (p) = (µπ )⊤ avπ′ . both the average policy and the best policy in XN con-
verge to the optimal policy in performance with a rate
From Lemmas 1 and 2, we see that the difference be- O(RegretN (yN ∗
)/N ). Compared with existing results
tween the residual rep (x; x∗ ) = −µ⊤ av∗ and the per- obtained through Proposition 2, the above result re-
∗
formance gap V πµ (p) − V π (p) = (µπµ )⊤ av∗ is due to moves the factor (1 − γ) mins p(s) and impose no as-
the mismatch between µ and µπµ , or more specifically, sumption on XN or the MDP. Indeed Theorem 1 holds
the mismatch between the two marginals d = E⊤ µ for any sequence. For example, when XN is generated
and dπµ = E⊤ µπµ . Indeed, when d = dπµ , the by stochastic feedback of ln , Theorem 1 continues to
residual is equal to the performance gap. However, hold, as the regret is defined in terms of ln , not of the
in general, we do not have control over that difference sampled loss. Stochasticity only affects the regret rate.
for the sequence of variables {xn = (vn , µn ) ∈ X }
an algorithm generates. The sufficient condition in In other words, we have shown that when µ and v
Proposition 2 attempts to mitigate the difference, us- can be directly parameterized, an approximately op-
ing the fact dπµ = (1 − γ)p + γP⊤ πµ
from (2), timal policy for the RL problem can be obtained by
πµ d
where Pπµ is the transition matrix under πµ . But running any no-regret online learning algorithm, and
the missing half γP⊤ πµ
(due to the long-term ef- that the policy quality is simply dictated by the re-
πµ d
fects in the MDP) introduces the unavoidable, weak gret rate. To illustrate, in Section 5 we will prove
constant (1 − γ) mins p(s), if we want to have an uni- that simply running mirror descent in this COL pro-
form bound on kv∗ − vπµ k∞ . The counterexample in duces an RL algorithm that is as sample efficient as
Proposition 3 was designed to maximize the effect of other common RL techniques. One can further foresee
covariate shift, so that µ fails to captures state-action that algorithms leveraging the continuity in COL—e.g.
pairs with high advantage. To break the curse, we mirror-prox [25] or PicCoLO [18]—and variance reduc-
must properly weight the gap between v∗ and vπµ in- tion can lead to more sample efficient RL algorithms.
stead of relying on the uniform bound on kv∗ − vπµ k∞ Below we will also demonstrate how to use the fact
as before. that COL is single-player (see Section 2.3) to cleanly
incorporate the effects of using function approxima-
4 THE REDUCTION tors to model µ and v. We will present a corollary
of Theorem 1, which separates the problem of learn-
The analyses above reveal both good and bad proper- ing µ and v, and that of approximating M and V
ties of the saddle-point setup in (8). On the one hand, with function approximators. The first part is con-
we showed that approximate solutions to the saddle- trolled by the rate of regret in online learning, and
point problem in (8) can be obtained by running any the second part depends on only the chosen class of
no-regret algorithm in the single-player COL problem function approximators, independently of the learning
defined in (16); many efficient algorithms are available process. As these properties are agnostic to problem
from the online learning literature. On the other hand, setups and algorithms, our reduction leads to a frame-
we also discovered a root difficulty in converting an work for systematic synthesis of new RL algorithms
approximate solution of (8) to an approximately opti- with performance guarantees. The missing proofs of
mal policy in RL (Proposition 2), even after imposing this section are in Appendix C.
strong conditions like (19). At this point, one may won-
der if the formulation based on (8) is fundamentally 4.1 Proof of Theorem 1
sample inefficient compared with other approaches to
RL, but this is actually not true. The main insight of our reduction is to adopt, in defin-
Our main contribution shows that learning a policy ing rep (x; x′ ), a comparator x′ ∈ X based on the out-
through running a no-regret algorithm in the COL put of the algorithm (represented by x), instead of the
problem in (16) is, in fact, as sample efficient in pol- fixed comparator x∗ (the optimal pair of value func-
icy performance as other RL techniques, even without tion and state-action distribution) that has been used
the common constraint in (19) or extra assumptions conventionally, e.g. in Proposition 2. While this idea
on the MDP like ergodicity imposed in the literature. seems unnatural from the standard saddle-point or EP
perspective, it is possible, because the regret in online
learning is measured against the worst-case choice in
Theorem 1. Let XN = {xn ∈ X }N n=1 be any sequence. X , which is allowed to be selected in hindsight. Specif-
Let π̂N be the policy given by x̂N via (17), which is
ically, we propose to select the following comparator
either the average or the best decision in XN . Define
∗ :=
∗
RegretN (yN ) to directly bound V ∗ (p) − V π̂N (p) instead of the con-
yN (vπ̂N , µ∗ ). Then V π̂N (p) ≥ V ∗ (p) − N . servative measure kV ∗ − V π̂N k∞ used before.
Theorem 1 shows that if XN has sublinear regret, then Proposition 4. For x = (v, µ) ∈ X , define yx∗ :=
Ching-An Cheng, Remi Tachet des Combes, Byron Boots, Geoff Gordon
(vπµ , µ∗ ) ∈ X . It holds rep (x; yx∗ ) = V ∗ (p) − V πµ (p). Algorithm 1 Mirror descent for RL
PN Input: ǫ optimality of the γ-average return
To finish the proof, let x̂N be either N1 n=1 xn or δ maximal failure probability
arg minx∈XN rep (x; yx∗ ), and let π̂N denote the policy generative model of an MDP
∗
given by (17). First, V ∗ (p) − V π̂N (p) = rep (x̂N ; yN ) Output: π̂N = π µ̂N
1: x1 = (v1 , µ1 ) where µ1 is uniform and v1 ∈ V
by Proposition 4. Next we follow the proof idea of 1)
|S||A| log( δ
Proposition 1 in [14]: because F is skew-symmetric 2: Set N = Ω̃( (1−γ)2 ǫ2
) and η = (1 − γ)(|S||A|N )−1/2
∗
and F (yN , ·) is convex, we have by (18) 3: Set the Bregman divergence as (22)
∗ π̂N ∗ ∗ 4: for n = 1 . . . N − 1 do
V (p) − V (p) = rep (x̂N ; yN ) = −F (x̂N , yN ) 5: Sample gn according to (24)
N 6: Update to xn+1 according to (21)
∗
, x̂N ) ≤ N1 n=1 F (yN∗
P
= F (yN , xn ) 7: end for PN
1 N ∗ 1 ∗ 8: Set (v̂N , µ̂N ) = x̂N = N1 n=1 xn
P
= N n=1 −F (xn , yN ) = N RegretN (yN ).
where η > 0 is the step size, gn is the feedback direc- Theorem 2. With probability 1−δ, Algorithm 1 learns
|S||A| log( 1δ )
tion, and BR (x||x′ ) = R(x) − R(x′ ) − h∇R(x′ ), x − x′ i an ǫ-optimal policy with Õ samples.
(1−γ)2 ǫ2
is the Bregman divergence generated by a strictly con-
vex function R. Based on the geometry of X = V × M, Note that the above statement makes no assumption
we consider a natural Bregman divergence of the form on the MDP (except the tabular setup for simplifying
1
analysis). Also, because the definition of value func-
BR (x′ ||x) = 2|S| kv
′
− vk22 + KL(µ′ ||µ) (22) tion in (1) is scaled by a factor (1−γ), the above result
|S||A| log( 1 )
translates into a sample complexity in Õ (1−γ)4 ǫ2
δ
This choice mitigates the effects of dimension (e.g. if
we set x1 = (v1 , µ1 ) with µ1 being the uniform distri- for the conventional discounted accumulated rewards.
bution, it holds BR (x′ ||x1 ) = Õ(1) for any x′ ∈ X ).
5.1 Proof Sketch of Theorem 2
To define the feedback direction gn , we slightly modify
the per-round loss ln in (16) and consider a new loss The proof is based on the basic property of mirror
descent and martingale concentration. We provide a
hn (x) := b⊤ ⊤ 1 sketch here; please refer to Appendix D for details. Let
µn v + µ ( 1−γ 1 − avn ) (23) ∗
yN = (vπ̂N , µ∗ ). We bound the regret in Theorem 1 by
the following rearrangement, where the first equality
that shifts ln by a constant, where 1 is the vector of below is because hn is a constant shift from ln .
ones. One can verify that ln (x) − ln (x′ ) = hn (x) −
hn (x′ ), for all x, x′ ∈ X when µ, µ′ in x and x′ satisfy ∗
N
X N
X ∗
kµk1 = kµ′ k1 (which holds for Algorithm 1). There- RegretN (yN )= hn (xn ) − hn (yN )
n=1 n=1
fore, using hn does not change regret. The reason for
N N
! !
using hn instead of ln is to make ∇µ hn ((v, µ)) (and ≤
X
(∇hn (xn ) − gn ) xn ⊤
+ max
X
gn⊤ (xn − x)
its unbiased approximation) a positive vector, so the n=1
x∈X
n=1
regret bound can have a better dimension dependency. N
X
!
This is a common trick used in online learning (e.g. + (gn − ∇hn (xn ))⊤ yN
∗
(µ here).
We recognize the first term is a martingale, because xn
We set the first-order feedback gn as an unbiased sam- does not depend on gn . Therefore, we can appeal to a
pled estimate of ∇hn (xn ). In round n, this is realized Bernstein-type
√ martingale concentration and prove it
by two independent calls of the generative model: N |S||A| log( 1 )
is in Õ( 1−γ
δ
). For the second term, by treat-
⊤
"
1
# ing gn x as the per-round loss, we can use standard
p̃n + 1−γ (γ P̃n − En )⊤ µ̃n
gn = 1 1 (24) regret√analysis of mirror descent and show a bound
|S||A|( 1−γ 1̂n − r̂n − 1−γ (γ P̂n − Ên )vn ) N |S||A|
in Õ( 1−γ ). For the third term, because vπ̂N in
∗
Let gn = [gn,v ; gn,µ ]. For gn,v , we sample p, sam- yN = (vπ̂N , µ∗ ) depends on {gn }N
n=1 , it is not a mar-
ple µn to get a state-action pair, and query the tran- tingale. Nonetheless, we are able to handle it through
sition P at the state-action pair sampled from µn . a union
√ bound and show it is again no more than
N |S||A| log( 1 )
(p̃n , P̃n , and µ̃n denote the single-sample estimate Õ( 1−γ
δ
). Despite the union bound, it does
of these probabilities.) For gn,µ , we first sample uni- not increase the rate because we only need to handle
formly a state-action pair (which explains the factor vπ̂N , not µ∗ which induces a martingale. To finish the
|S||A|), and then query the reward r and the tran- proof, we substitute this high-probability regret bound
sition P. (1̂n , r̂n , P̂n , and Ên denote the single- into Theorem 1 to obtain the desired claim.
sample estimates.) To emphasize, we use ˜· and ˆ· to
distinguish the empirical quantities obtained by these 5.2 Extension to Function Approximators
two independent queries. By construction, we have
gn,µ ≥ 0. It is clear that this direction gn is unbi- The above algorithm assumes the tabular setup for il-
ased, i.e. E[gn ] = ∇hn (xn ). Moreover, it is extremely lustration purposes. In Appendix E, we describe a
sparse and can be computed using O(1) sample, com- direct extension of Algorithm 1 that uses linearly pa-
putational, and memory complexities. rameterized function approximators of the form xθ =
(Φθv , Ψθµ ), where columns of bases Φ, Ψ belong to V
Below we show this algorithm, despite being extremely
and M, respectively, and (θv , θµ ) ∈ Θ.
simple, has strong theoretical guarantees. In other
words, we obtain simpler versions of the algorithms Overall the algorithm stays the same, except the gra-
proposed in [3, 5, 10] but with improved performance. dient is computed by chain-rule, which can be done in
O(dim(Θ)) time and space. While this seems worse,
Ching-An Cheng, Remi Tachet des Combes, Byron Boots, Geoff Gordon
the computational complexity per update actually im- Markov renewal programs. SIAM Journal on Ap-
proves to O(dim(Θ)) from the slow O(|S||A|) (re- plied Mathematics, 16(3):468–487, 1968.
quired before for the projection in (21)), as now we [3] Mengdi Wang and Yichen Chen. An online primal-
only optimize in Θ. Moreover, we prove that its sam- dual method for discounted Markov decision pro-
ple complexity is also better, though at the cost of bias cesses. In Conference on Decision and Control,
ǫΘ,N in Corollary 1. Therefore, the algorithm becomes pages 4516–4521. IEEE, 2016.
applicable to large-scale or continuous problems.
[4] Yichen Chen and Mengdi Wang. Stochas-
Theorem 3. Under a proper choice of Θ and BR , with tic primal-dual methods and sample complex-
probability 1 − δ, Algorithm 1 learns
an (ǫ + ǫΘ,N )- ity of reinforcement learning. arXiv preprint
dim(Θ) log( 1δ )
optimal policy with Õ (1−γ)2 ǫ2 samples. arXiv:1612.02516, 2016.
[5] Mengdi Wang. Randomized linear programming
The proof is in Appendix E, and mainly follows Sec- solves the discounted Markov decision problem in
tion 5.1. First, we choose some Θ to satisfy (20) so nearly-linear running time. ArXiv e-prints, 2017.
we can use Corollary 1 to reduce the problem into re- [6] Donghwan Lee and Niao He. Stochas-
gret minimization. To make the sample complexity tic primal-dual Q-learning. arXiv preprint
independent of |S|,|A|, the key is to uniformly sample arXiv:1810.08298, 2018.
over the columns of Ψ (instead of over all states and
actions like (24)) when computing unbiased estimates [7] Mengdi Wang. Primal-dual π learning: Sam-
of ∇θµ hn ((θv , θµ )). The intuition is that we should ple complexity and sublinear run time for er-
only focus on the places our basis functions care about godic Markov decision problems. arXiv preprint
(of size dim(Θ)), instead of wasting efforts to visit all arXiv:1710.06100, 2017.
possible combinations (of size |S||A|). [8] Qihang Lin, Selvaprabu Nadarajah, and Negar So-
heili. Revisiting approximate linear programming
using a saddle point based reformulation and root
6 CONCLUSION finding solution approach. Technical report, work-
ing paper, U. of Il. at Chicago and U. of Iowa,
We propose a reduction from RL to no-regret online 2017.
learning that provides a systematic way to design new
[9] Bo Dai, Albert Shaw, Niao He, Lihong Li, and
RL algorithms with performance guarantees. Com-
Le Song. Boosting the actor with dual critic. In
pared with existing approaches, our framework makes
International Conference on Learning Represen-
no assumption on the MDP and naturally works with
tation, 2018.
function approximators. To illustrate, we design a sim-
ple RL algorithm based on mirror descent; it achieves [10] Yichen Chen, Lihong Li, and Mengdi Wang. Scal-
similar sample complexity as other RL techniques, but able bilinear π learning using state and action fea-
uses minimal assumptions on the MDP and is scal- tures. arXiv preprint arXiv:1804.10328, 2018.
able to large or continuous problems. This encourag- [11] Chandrashekar Lakshminarayanan, Shalabh
ing result evidences the strength of the online learning Bhatnagar, and Csaba Szepesvári. A linearly
perspective. As a future work, we believe even faster relaxed approximate linear program for Markov
learning in RL is possible by leveraging control variate decision processes. IEEE Transactions on
for variance reduction and by applying more advanced Automatic Control, 63(4):1185–1191, 2018.
online techniques [17, 18] that exploit the continuity [12] Geoffrey J Gordon. Regret bounds for prediction
in COL to predict the future gradients. problems. In Conference on Learning Theory, vol-
ume 99, pages 29–40, 1999.
Acknowledgements
[13] Martin Zinkevich. Online convex programming
This research is partially supported by NVIDIA Grad- and generalized infinitesimal gradient ascent. In
uate Fellowship. International Conference on Machine Learning,
pages 928–936, 2003.
References [14] Ching-An Cheng, Jonathan Lee, Ken Goldberg,
and Byron Boots. Online learning with contin-
[1] Alan S Manne et al. Linear programming and uous variations: Dynamic regret and reductions.
sequential decision models. Technical report, arXiv preprint arXiv:1902.07286, 2019.
Cowles Foundation for Research in Economics,
[15] Eugen Blum. From optimization and variational
Yale University, 1959.
inequalities to equilibrium problems. Math. stu-
[2] Eric V Denardo and Bennett L Fox. Multichain dent, 63:123–145, 1994.
A Reduction from Reinforcement Learning to Online Learning
[16] M Bianchi and S Schaible. Generalized monotone [30] Jacob Abernethy, Peter L Bartlett, and Elad
bifunctions and equilibrium problems. Journal of Hazan. Blackwell approachability and no-regret
Optimization Theory and Applications, 90(1):31– learning are equivalent. In Annual Conference on
43, 1996. Learning Theory, pages 27–46, 2011.
[17] Alexander Rakhlin and Karthik Sridharan. On- [31] Andrew Y Ng, Daishi Harada, and Stuart Rus-
line learning with predictable sequences. arXiv sell. Policy invariance under reward transforma-
preprint arXiv:1208.3728, 2012. tions: Theory and application to reward shaping.
[18] Ching-An Cheng, Xinyan Yan, Nathan Ratliff, In International Conference on Machine Learning,
volume 99, pages 278–287, 1999.
and Byron Boots. Predictor-corrector policy op-
timization. In International Conference on Ma- [32] Sham Kakade and John Langford. Approximately
chine Learning, 2019. optimal approximate reinforcement learning. In
International Conference on Machine Learning,
[19] Michael J Kearns and Satinder P Singh. Finite-
volume 2, pages 267–274, 2002.
sample convergence rates for q-learning and indi-
rect algorithms. In Advances in neural informa- [33] Stéphane Ross, Geoffrey Gordon, and Drew Bag-
tion processing systems, pages 996–1002, 1999. nell. A reduction of imitation learning and struc-
tured prediction to no-regret online learning. In
[20] Sham Machandranath Kakade et al. On the sam-
International Conference on Artificial Intelligence
ple complexity of reinforcement learning. PhD the-
and Statistics, pages 627–635, 2011.
sis, University of London London, England, 2003.
[34] Ching-An Cheng and Byron Boots. Convergence
[21] Martin L Puterman. Markov decision processes:
of value aggregation for imitation learning. Inter-
discrete stochastic dynamic programming. John
national Conference on Artificial Intelligence and
Wiley & Sons, 2014.
Statistics, 2018.
[22] Onésimo Hernández-Lerma and Jean B Lasserre.
[35] Ching-An Cheng, Xinyan Yan, Evangelos A
Discrete-time Markov control processes: basic op- Theodorou, and Byron Boots. Accelerating imi-
timality criteria, volume 30. Springer Science & tation learning with predictive models. In Inter-
Business Media, 2012.
national Conference onArtificial Intelligence and
[23] Richard Bellman. The theory of dynamic pro- Statistics, 2019.
gramming. Bulletin of the American Mathemati- [36] Elad Hazan. Introduction to online convex
cal Society, 60(6):503–515, 1954. optimization. Foundations and Trends in
[24] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Optimization, 2(3-4):157–325, 2016. URL
Lan, and Alexander Shapiro. Robust stochastic http://dblp.uni-trier.de/db/journals/ftopt/ftopt2.h
approximation approach to stochastic program- [37] Colin McDiarmid. Concentration. In Probabilis-
ming. SIAM Journal on optimization, 19(4):1574– tic methods for algorithmic discrete mathematics,
1609, 2009. pages 195–248. Springer, 1998.
[25] Anatoli Juditsky, Arkadi Nemirovski, and Claire
Tauvel. Solving variational inequalities with
stochastic mirror-prox algorithm. Stochastic Sys-
tems, 1(1):17–58, 2011.
[26] Nicolo Cesa-Bianchi and Gabor Lugosi. Predic-
tion, learning, and games. Cambridge university
press, 2006.
[27] Shai Shalev-Shwartz et al. Online learning
and online convex optimization. Foundations
and Trends R in Machine Learning, 4(2):107–194,
2012.
[28] Elad Hazan et al. Introduction to online convex
optimization. Foundations and Trends R in Opti-
mization, 2(3-4):157–325, 2016.
[29] Alejandro Jofré and Roger J-B Wets. Variational
convergence of bifunctions: motivating applica-
tions. SIAM Journal on Optimization, 24(4):
1952–1979, 2014.
Ching-An Cheng, Remi Tachet des Combes, Byron Boots, Geoff Gordon
Appendix
A Review of RL Setups
We provide an extended review of different formulations of RL for interested readers. First, let us recall the
problem setup. Let S and A be state and action spaces, and let π(a|s) denote a policy. For γ ∈ [0, 1), we are
interested in solving a γ-discounted infinite-horizon RL problem:
P∞
maxπ V π (p), s.t. V π (p) := (1 − γ)Es0 ∼p Eξ∼ρπ (s0 ) [ t=0 γ t r(st , at )] (25)
where V π (p) is the discounted average return, r : S × A → [0, 1] is the reward function, ρπ (s0 ) denotes the
distribution of trajectory ξ = s0 , a0 , s1 , . . . generated by running π from state s0 in a Markov decision process
(MDP), and p is a fixed but unknown initial state distribution.
RL in terms of stationary state distribution Let dπt (s) denote the state distribution at time t given by
running π starting from p. We define its γ-weighted mixture as
dπ (s) := (1 − γ) ∞ t π
P
t=0 γ dt (s) (26)
We can view dπ in (26) as a form of stationary state distribution of π, because it is a valid probability distribution
of state and satisfies the stationarity property below,
where P(s′ |s, a) is the transition probability of the MDP. The definition in (26) generalizes the concept of
stationary distribution of MDP; as γ → 1, dπ is known as the limiting average state distribution, which is the
same as the stationary distribution of the MDP under π, if one exists. Moreover, with the property in (2), dπ
summarizes the Markov structure of RL, and allows us to write (25) simply as
after commuting the order of expectation and summation. That is, an RL problem aims to maximize the expected
reward under the stationary state-action distribution generated by the policy π.
RL in terms of value function We can also write (25) in terms of value function. Recall
which can be viewed as a dual equivalent of (2). Because r is in [0, 1], (5) implies V π lies in [0, 1].
The value function V ∗ (a shorthand of Vπ∗ ) of the optimal policy π ∗ of the RL problem satisfies the so-called
Bellman equation [23]: V ∗ (s) = maxa∈A (1 − γ)r(s, a) + γEs′ ∼P|s,a [V ∗ (s′ )], where the optimal policy π ∗ can be
recovered as the arg max. Equivalently, by the definition of max, the Bellman equation amounts to finding the
smallest V such that V (s) ≥ (1 − γ)r(s, a) + γEs′ ∼P|s,a [V (s′ )], ∀s ∈ S, a ∈ A. In other words, the RL problem
in (25) can be written as
min Es∼p [V (s)] s.t. V (s) ≥ (1 − γ)r(s, a) + γEs′ ∼P|s,a [V (s′ )] , ∀s ∈ S, a ∈ A (28)
V
We now connect the above two alternate expressions through the classical LP setup of RL [1, 2].
A Reduction from Reinforcement Learning to Online Learning
where p ∈ R|S| , v ∈ R|S| , and r ∈ R|S||A| are the vector forms of p, V , r, respectively, P ∈ R|S||A|×|S| is the
transition probability6 , and E = I ⊗ 1 ∈ R|S||A|×|S| (we use | · | to denote the cardinality of a set, ⊗ the Kronecker
product, I ∈ R|S|×|S| is the identity, and 1 ∈ R|A| a vector of ones). It is easy to verify that for all p > 0, the
solution to (4) is the same and equal to v∗ (the vector form of V ∗ ).
where f ≥ 0 ∈ R|S||A| is the Lagrangian multiplier. By Lagrangian duality, the dual problem of (4) is given as
maxf ≥0 minv L(v, f ). Or after substituting the optimality condition of v and define µ := (1 − γ)f , we can write
the dual problem as another LP problem
Note that this problem like (4) is normalized: we have kµk1 = 1 because kpk1 = 1, and
where we use the facts that µ ≥ 0 and P is a stochastic transition matrix. This means that µ is a valid state-
action distribution, from which we see that the equality constraint in (3) is simply a vector form (2). Therefore,
(3) is the same as (27) if we define the policy π as the conditional distribution based on µ.
Lemma 1. For any x = (v, µ), if x′ ∈ X satisfies (2) and (5) (i.e. v′ and µ′ are the value function and
state-action distribution of policy πµ′ ), rep (x; x′ ) = −µ⊤ av′ .
Proof. First note that F (x, x) = 0. Then as x′ satisfies stationarity, we can use Lemma 2 below and write
Lemma 2. Let vπ and µπ denote the value and state-action distribution of some policy π. Then for any function
′
v′ , it holds that p⊤ (vπ − v′ ) = (µπ )⊤ av′ . In particular, it implies V π (p) − V π (p) = (µπ )⊤ avπ′ .
Proof. This is the well-known performance difference lemma. The proof is based on the stationary properties in
(2) and (5), which can be stated in vector form as
Proposition 2. For any x = (v, µ) ∈ X , if E⊤ µ ≥ (1 − γ)p, rep (x; x∗ ) ≥ (1 − γ) mins p(s)kv∗ − vπµ k∞ .
Proof. This proof mainly follows the steps in [5] but written in our notation. First Lemma 1 shows rep (x; x∗ ) =
−µ⊤ av∗ . We then lower bound −µ⊤ av∗ by reversing the proof of the performance difference lemma (Lemma 2).
1
µ⊤ av∗ = µ⊤ ((1 − γ)r − (E − γP)v∗ ) (∵ Definition of av∗ )
1−γ
1
= µ⊤ ((E − γP)vπµ − (E − γP)v∗ ) (∵ Stationarity of vπµ )
1−γ
1
= µ⊤ (E − γP)(vπµ − v∗ )
1−γ
1
= d⊤ (I − γPπµ )(vπµ − v∗ )
1−γ
which together imply that (I − γPπµ )(vπµ − v∗ ) ≤ 0. We will also use the stationarity of dπµ (the average state
distribution of πµ ): dπµ = (1 − γ)p + γP⊤ πµ
πµ d .
1
µ⊤ av∗ = d⊤ (I − γPπµ )(vπµ − v∗ )
1−γ
≤ p⊤ (I − γPπµ )(vπµ − v∗ )
≤ − min p(s)k(I − γPπµ )(vπµ − v∗ )k∞
s
≤ − min p(s)(1 − γ)kvπµ − v∗ k∞ .
s
Proposition 3. There is a class of MDPs such that, for some x ∈ X , Proposition 2 is an equality.
Proof. We show this equality holds for a class of MDPs. For simplicity, let us first consider an MDP with
three states 1, 2, 3 and for each state there are three actions (lef t, right, stay). They correspond to intuitive,
deterministic transition dynamics
We set the reward as r(s, right) = 1 for s = 1, 2, 3 and zero otherwise. It is easy to see that the optimal policy
is π ∗ (right|s) = 1, which has value function v∗ = [1, 1, 1]⊤.
Now consider x = (v, µ) ∈ X . To define µ, let µ(s, a) = d(s)πµ (a|s). We set
πµ (right|1) = 1, πµ (stay|2) = 1, πµ (right|3) = 1
∗
That is, πµ is equal to π except when s = 2. One can verify the value function of this policy is vπµ =
[(1 − γ), 0, 1]⊤ .
As far as d is concerned (d = E⊤ µ), suppose the initial distribution is uniform, i.e. p = [1/3, 1/3, 1/3]⊤, we
choose d as d = (1 − γ)p + γ[1, 0, 0]⊤ , which satisfies the assumption in Proposition 2. Therefore, we have
µ ∈ M′ and we will let v be some arbitrary point in V.
Now we show for this choice x = (v, µ) ∈ V × M′ , the equality in Proposition 2 holds. By Lemma 1, we know
1
rep (x; x′ ) = −µ⊤ av∗ . Recall the advantage is defined as av∗ = r + 1−γ (γP − E)v∗ . Let AV ∗ (s, a) denote the
functional form of av∗ and define the expected advantage:
AV ∗ (s, πµ ) := Ea∼πµ [AV ∗ (s, a)].
We can verify it has the following values:
AV ∗ (1, πµ ) = 0, AV ∗ (2, πµ ) = −1, AV ∗ (3, πµ ) = 0.
One can easily generalize this 3-state MDP to an |S|-state MDP where states are partitioned into three groups.
Proposition 4. For x = (v, µ) ∈ X , define yx∗ := (vπµ , µ∗ ) ∈ X . It holds rep (x; yx∗ ) = V ∗ (p) − V πµ (p).
To proceed, we write yx∗ = (v∗ +(vπµ −v∗ ), µ∗ ) and use Lemma 3, which gives rep (x; yx∗ ) = −µ⊤ av∗ −b⊤
µ (v
πµ
−v∗ ).
To relate this equality to the policy performance gap, we also need the following equality.
Lemma 4. For µ ∈ M, it holds that −µ⊤ av∗ = V ∗ (p) − V πµ (p) + b⊤
µ (v
πµ
− v∗ ).
Together they imply the desired equality rep (x; yx∗ ) = V ∗ (p) − V πµ (p).
Lemma 3. Let x = (v, µ) be arbitrary. Consider x̃′ = (v′ + u′ , µ′ ), where v′ and µ′ are the value function and
state-action distribution of policy πµ′ , and u′ is arbitrary. It holds that rep (x; x̃′ ) = −µ⊤ av′ − b⊤ ′
µu .
1
Proof. Let x′ = (v′ , µ′ ). As shorthand, define f ′ := v′ + u′ , and L := 1−γ (γP− E) (i.e. we can write af = r+ Lf ).
′ ′ ⊤ ′ ⊤ ⊤ ′⊤
Because rep (x; x ) = −F (x, x ) = −(p v + µ av′ − p v − µ av ), we can write
rep (x; x̃′ ) = −p⊤ f ′ − µ⊤ af ′ + p⊤ v + µ′⊤ av
= −p⊤ v′ − µ⊤ av′ + p⊤ v + µ′⊤ av − p⊤ u′ − µ⊤ Lu′
Finally, by Lemma 1, we have also rep (x; x′ ) = −µ⊤ av′ and therefore the final equality.
Proof. Following the setup in Lemma 3, we prove the statement by the rearrangement below:
where the second equality is due to the performance difference lemma, i.e. Lemma 2, and the last equality uses
the definition av′ = r + Lv′ . For the second term above, let rπµ and Pπµ denote the expected reward and
transition under πµ . Because µ ∈ M, we can rewrite it as
= b⊤
µv
πµ
where the second equality uses the stationarity of µπµ given by (2). For the third term, it can be written
where the first equality uses stationarity, i.e. bµπµ = p + L⊤ µπµ = 0. Finally combining the three steps, we
have
′
−µ⊤ av′ = V π (p) − V πµ (p) + bµ (vπµ − v′ )
Corollary 1. Let XN = {xn ∈ Xθ }N n=1 be any sequence. Let π̂N be the policy given either by the average or the
best decision in XN . It holds that
RegretN (Θ)
V π̂N (p) ≥ V ∗ (p) − N − ǫΘ,N
∗
where ǫΘ,N = minxθ ∈Xθ rep (x̂N ; yN ) − rep (x̂N ; xθ ) measures the expressiveness of Xθ , and RegretN (Θ) :=
PN PN
l (x
n=1 n n ) − min x∈XΘ n=1 nl (x).
RegretN (Θ)
∗
V ∗ (p) − V π̂N (p) = rep (x̂N ; yN ) = ǫΘ,N + max rep (x̂N ; xθ ) ≤ ǫΘ,N +
xθ ∈Xθ N
where the first equality is Proposition 4 and the last inequality is due to the skew-symmetry of F , similar to the
proof of Theorem 1.
A Reduction from Reinforcement Learning to Online Learning
kµθ − µ∗ k1
min + min kbµ̂N k1,w kvθ − vπ̂N k∞,1/w
(vθ ,µθ )∈XΘ 1−γ w:w≥1
1
≤ min kµθ − µ∗ k1 + 2kvθ − vπ̂N k∞ .
(vθ ,µθ )∈XΘ 1 − γ
Proof. For shorthand, let us set x = (v, µ) = x̂N and write also πµ = π̂N as the associated policy. Let
yx∗ = (vπµ , µ∗ ) and similarly let xθ = (vθ , µθ ) ∈ XΘ . With rep (x; x′ ) = −F (x, x′ ) and (15), we can write
rep (x; yx∗ ) − rep (x; xθ ) = −p⊤ vπµ − µ⊤ avπµ + p⊤ v + µ∗ ⊤ av − −p⊤ vθ − µ⊤ avθ + p⊤ v + µ⊤
θ av
However, the second inequality above can be very conservative, especially when bµ ≈ 0 which can be likely
when it is close to the end of policy optimization. To this end, we introduce a free vector w ≥ 1. Define norms
kvk∞,1/w = maxi |vwii| and kδk1,w = i wi |δi |. Then we can instead have an upper bound
P
b⊤
µ (vθ − v
πµ
) ≤ min kbµ k1,w kvθ − vπµ k∞,1/w
w:w≥1
1
ǫΘ,N = rep (x; yx∗ ) − rep (x; xθ ) ≤ kµθ − µ∗ k1 + min kbµ k1,w kvθ − vπµ k∞,1/w
1−γ w:w≥1
1
≤ (kµθ − µ∗ k1 + 2kvθ − vπµ k∞ ) .
1−γ
Since it holds for any θ ∈ Θ, we can minimize the right-hand side over all possible choices.
Ching-An Cheng, Remi Tachet des Combes, Byron Boots, Geoff Gordon
The proof is a combination of the basic property of mirror descent (Lemma 9) and the martingale concentration.
1
Define K = |S||A| and κ = 1−γ as shorthands. We first slightly modify the per-round loss used to compute the
gradient. Recall ln (x) := p⊤ v + µ⊤ ⊤ ⊤
n av − p vn − µ avn and let us consider instead a loss function
hn (x) := b⊤ ⊤
µn v + µ (κ1 − avn )
which shifts ln by a constant in each round. (Note for all the decisions (vn , µn ) produced by Algorithm 1 µn
always satisfies kµn k1 = 1). One can verify that ln (x) − ln (x′ ) = hn (x) − hn (x′ ), for all x, x′ ∈ X , when µ, µ′
in x and x′ satisfy kµk1 = kµ′ k1 (which holds for Algorithm 1). As the definition of regret is relative, we may
work with hn in online learning and use it to define the feedback.
The reason for using hn instead of ln is to make ∇µ hn ((v, µ)) (and its unbiased approximation) a positive vector
(because κ ≥ kav k∞ for any v ∈ V), so that the regret bound can have a better dependency on the dimension
for learning µ that lives in the simplex M. This is a common trick used in the online learning, e.g. in EXP3.
To run mirror descent, we set the first-order feedback gn received by the learner as an unbiased estimate of
∇hn (xn ). For round n, we construct gn based on two calls of the generative model:
" 1
#
p̃n + 1−γ (γ P̃n − En )⊤ µ̃n
gn,v
gn = = 1
gn,µ K(κ1̂n − r̂n − 1−γ (γ P̂n − Ên )vn )
For gn,v , we sample p, then sample µn to get a state-action pair, and finally query the transition dynamics P
at the state-action pair sampled from µn . (p̃n , P̃n , and µ̃n denote the single-sample empirical approximation
of these probabilities.) For gn,µ , we first sample uniformly a state-action pair (which explains the factor K),
and then query the reward r and the transition dynamics P. (1̂n , r̂n , P̂n , and Ên denote the single-sample
empirical estimates.) To emphasize, we use ˜ and ˆ to distinguish the empirical quantities obtained by these
two independent queries. By construction, we have gn,µ ≥ 0. It is clear that this direction gn is unbiased, i.e.
E[gn ] = ∇hn (xn ). Moreover, it is extremely sparse and can be computed using O(1) sample, computational, and
memory complexities.
∗
Let yN = (vπ̂N , µ∗ ). We bound the regret by the following rearrangement.
N
X N
X
∗ ∗
RegretN (yN )= ln (xn ) − ln (yN )
n=1 n=1
N
X N
X
∗
= hn (xn ) − hn (yN )
n=1 n=1
N
X
= ∇hn (xn )⊤ (xn − yN
∗
)
n=1
N N N
! ! !
X X X
⊤
= (∇hn (xn ) − gn ) xn + gn⊤ (xn − ∗
yN ) + (gn − ∇hn (xn ))⊤ yN
∗
where
PN the third equality comes from hn being linear. We recognize the first term is a martingale MN =
⊤
n=1 (∇h n (xn ) − g n ) xn , because xn does not depend on gn . Therefore, we can appeal to standard martingale
concentration property. For the second term, it can be upper bounded by standard regret analysis of mirror
descent, by treating gn⊤ x as the per-round loss. For the third term, because yN ∗
= (vπ̂N , µ∗ ) depends on {gn }N
n=1 ,
it is not a martingale. Nonetheless, we will be able to handle it through a union bound. Below, we give details
for bounding these three terms.
A Reduction from Reinforcement Learning to Online Learning
!
−ǫ2
P (MN − M0 ≥ ǫ) ≤ exp bǫ
.
2N σ 2 (1 + 3N σ2 )
For the first term (κ1 − avn − gn,µ )⊤ µn , we use the lemma below:
Lemma 7. Let µ ∈ M be arbitrary, chosen independently from the randomness of gn,µ when Fn−1 is given.
Then it holds |(κ1 − avn − gn,µ )⊤ µ| ≤ 2(1+K)
1−γ
4K
and V|Fn−1 [(κ1 − avn − gn,µ )⊤ µ] ≤ (1−γ) 2.
2
|(κ1 − avn )⊤ µ| ≤ κ + kavn k∞ kµk1 ≤ .
1−γ
For the stochastic part, let in be index of the sampled state-action pair and jn be the index of the transited state
sampled at the pair given by in . With abuse of notation, we will use in to index both S × A and S. With this
notation, we may derive
⊤ 1
|gn,µ µ| = |Kµ⊤ (κ1̂n − r̂n − (γ P̂n − Ên )vn )|
1−γ
γvn,jn − vn,in
= Kµin |κ − rin − |
1−γ
2Kµin 2K
≤ ≤
1−γ 1−γ
where we use the facts that rin , vn,jn , vn,in ∈ [0, 1] and µin ≤ 1.
Ching-An Cheng, Remi Tachet des Combes, Byron Boots, Geoff Gordon
4K X 2
≤ µin
(1 − γ)2 i
n
!2
4K X 4K
≤ µin ≤
(1 − γ)2 in
(1 − γ)2
γvn,jn −vn,in 2
where in the second inequality we use the fact that |κ − rin − 1−γ |≤ 1−γ .
For the second term (bµn − gn,v )⊤ vn , we use the following lemma.
Lemma 8. Let v ∈ V be arbitrary, chosen independently from the randomness of gn,v when Fn−1 is given..
4 4
Then it holds that |(bµn − gn,v )⊤ v| ≤ 1−γ and V|Fn−1 [(bµn − gn,v )⊤ v] ≤ (1−γ) 2.
2
Proof. We appeal to Lemma 5, which shows kbµn k1 , kgn,v k1 ≤ 1−γ , and derive
4
|(bµn − gn,v )⊤ v| ≤ (kbµn k1 + kgn,v k1 )kvk∞ ≤ .
1−γ
Similarly, for the variance, we can write
4
⊤
V|Fn−1 [(bµn − gn,v )⊤ v] = V|Fn−1 [gn,v ⊤
v] ≤ E|Fn−1 [(gn,v v)2 ] ≤ .
(1 − γ)2
Thus, with helps from the two lemmas above, we are able to show
4 + 2(1 + K)
Mn − Mn−1 ≤ |(κ1 − avn − gn,µ )⊤ µn | + |(bµn − gn,v )⊤ vn | ≤
1−γ
as well as (because gn,µ and gn,b are computed using independent samples)
4(1 + K)
V|Fn−1 [Mn − Mn−1 ] ≤ E|Fn−1 [|(κ1 − avn − gn,µ )⊤ µn |2 ] + E|Fn−1 [|(bµn − gn,v )⊤ vn |2 ] ≤
(1 − γ)2
Now, since M0 = 0, by martingale concentration in Lemma 6, we have
N
! !
X −ǫ2
P (∇hn (xn ) − gn )⊤ xn > ǫ ≤ exp bǫ
n=1
2N σ 2 (1 + 3N σ2 )
6+2K 4(1+K)
with b = 1−γ and σ 2 = (1−γ)2 .This implies that, with probability at least 1 − δ, it holds
q
N K log( 1δ )
s
N
X 8(1 + K) 1
(∇hn (xn ) − gn )⊤ xn ≤ N 2
(1 + o(1)) log = Õ
n=1
(1 − γ) δ 1 − γ
Next we move onto deriving the regret bound of mirror descent with respect to the online loss sequence:
N
X
max gn⊤ (xn − x)
x∈X
n=1
Lemma 9. Let X be a convex set. Suppose R is 1-strongly convex with respect to some norm k · k. Let g be an
arbitrary vector and define, for x ∈ X ,
Proof. Recall the definition BR (x′ ||x) = R(x′ ) − R(x) − h∇R(x), x′ − xi. The optimality of the proximal map
can be written as
hg + ∇R(y) − ∇R(x), y − zi ≤ 0, ∀z ∈ X .
By rearranging the terms, we can rewrite the above inequality in terms of Bregman divergences as follows and
derive the first inequality (31):
Let x′ ∈ X be arbitrary. Applying this lemma to the nth iteration of mirror descent in (21), we get
1
hgn , xn+1 − x′ i ≤ (BR (x′ ||xn ) − BR (x′ ||xn+1 ) − BR (xn+1 ||xn ))
η
By a telescoping sum, we then have
N N
X
′ 1 ′
X 1
hgn , xn − x i ≤ BR (x ||x1 ) + hgn , xn+1 − xn i − BR (xn+1 ||xn ).
n=1
η n=1
η
We bound the right-hand side as follows. Recall that based on the geometry of X = V × M, we considered a
natural Bregman divergence of the form:
1
BR (x′ ||x) = kv′ − vk22 + KL(µ′ ||µ)
2|S|
and we upper bound them using the two lemmas below (recall gn,µ ≥ 0 due to the added κ1 term).
1 ηkgk22
Lemma 10. For any vector x, y, g and scalar η > 0, it holds hg, x − yi − 2η kx − yk22 ≤ 2 .
1 1 ηkgk22
Proof. By Cauchy-Swartz inequality, hg, x − yi − 2η kx − yk22 ≤ kgk2kx − yk2 − 2η kx − yk22 ≤ 2 .
Ching-An Cheng, Remi Tachet des Combes, Byron Boots, Geoff Gordon
Lemma 11. Suppose BR (x||y) = KL(x||y) and x, y are probability distributions, and g ≥ 0 element-wise. Then,
for η > 0,
1 ηX η
− BR (y||x) + hg, x − yi ≤ xi (gi )2 = kgk2x.
η 2 i 2
PN
We can expect, with high probability, n=1 |S|kgn,v k22 + kgn,µ k2µn concentrates toward its expectation, i.e.
N
X N
X
|S|kgn,v k22 + kgn,µ k2µn ≤ E[|S|kgn,v k22 + kgn,µ k2µn ] + o(N ).
n=1 n=1
Below we quantify this relationship using martingale concentration. First we bound the expectation.
4 4K
Lemma 12. E[kgn,v k22 ] ≤ (1−γ) 2
2 and E[kgn,µ kµn ] ≤ (1−γ)2 .
Proof. For the first statement, using the fact that k · k2 ≤ k · k1 and Lemma 5, we can write
1 4
E[kgn,v k22 ] ≤ E[kgn,v k21 ] = E[kp̃n + (γ P̃n − En )⊤ µ̃n k21 ] ≤ .
1−γ (1 − γ)2
For the second statement, let in be the index of the sampled state-action pair and jn be the index of the
transited-to state sampled at the pair given by in . With abuse of notation, we will use in to index both S × A
and S.
" " 2 ##
2
X 1
2 γvn,jn − vn,in
E[kgn,µ kµn ] = E Ej |i K µin κ − rin −
in
K n n 1−γ
" #
4K X 4K
≤ 2
E µin ≤ .
(1 − γ) i
(1 − γ)2
n
A Reduction from Reinforcement Learning to Online Learning
To bound the tail, we resort to the Höffding-Azuma inequality of martingale [37, Theorem 3.14].
Lemma 13 (Azuma-Hoeffding). Let M0 , . . . , MN be a martingale and let F0 ⊆ F1 ⊆ · · · ⊆ Fn be the filtration
such that Mn = E|Fn [Mn+1 ]. Suppose there exists b < ∞ such that for all n, given Fn−1 , |Mn − Mn−1 | ≤ b.
Then for any ǫ ≥ 0,
−2ǫ2
P (MN − M0 ≥ ǫ) ≤ exp
N b2
To bound the change of the size of martingale difference |Mn − Mn−1 |, we follow similar steps to Lemma 12.
4 4K 2
Lemma 14. kgn,v k22 ≤ (1−γ)2 and kgn,µ k2µn ≤ (1−γ)2 .
Note kgn,µ k2µ is K-factor larger than E[kgn,µ k2µ ]) and K ≥ 1. Therefore, we have
8(|S| + K 2 )
|Mn − Mn−1 | ≤ |S|kgn,v k22 + kgn,µ k2µn + |S|E[kgn,v k22 ] + E[kgn,µ k2µn ] ≤
(1 − γ)2
Combining these results, we have, with probability as least 1 − δ,
N N √ 2
s
X X 4 2(|S| + K ) 1
|S|kgn,v k22 + kgn,µ k2µn ≤ 2 2
E[|S|kgn,v k2 + kgn,µ kµn ] + 2
N log
n=1 n=1
(1 − γ) δ
√ s
4 2(|S| + K 2 )
4(K + |S|) 1
≤ N+ N log
(1 − γ)2 (1 − γ)2 δ
1−γ
Now we suppose we set η = √
KN
. From (38), we then have
N N
X
′ 1 1 ηX
hgn , xn − x i ≤ + log(K) + |S|kgn,v k22 + kgn,µ k2µn
n=1
η 2 2 n=1
√ √ s !
2 2(|S| + K 2 )
KN 1 1 − γ 2(K + |S|) 1
≤ + log(K) + √ 2
N+ 2
N log
1−γ 2 KN (1 − γ) (1 − γ) δ
√ q
KN K 3 log 1δ
≤ Õ + .
1−γ 1−γ
Lemma 15. Let f, g be two random L-Lipschitz functions. Suppose for some a > 0 and some fixed z ∈ Z
selected independently of f, g, it holds
ǫ
Proof. Let C denote a set of covers of size N (Z, 4L ) Then, for any z ∈ Z which could depend on f, g,
Thus, supz∈Z |f (z) − g(z)| > ǫ =⇒ maxz′ ∈C |f (z ′ ) − g(z ′ )| > 2ǫ . Therefore, we have the union bound.
−aǫ2
ǫ
P sup |f (z) − E[f (z)]| > ǫ ≤ N Z, exp .
z∈Z 4L 4
PN ⊤ ∗
We now use Lemma 15 to bound the component n=1 (gn −∇hn (xn )) yN . We recall by definition, for x = (v, µ),
For the first term, because µ∗ is set beforehand by the MDP definition and does not depend on the randomness
during learning, it is a martingale and we can apply the steps in Appendix D.1 to show,
q
XN N K log( 1δ )
(gn,µ − κ1 + avn )⊤ µ∗ = Õ
n=1
1 − γ
For the second term, because vπ̂N depends on the randomness in the learning process, we need to use a union
bound. Following the steps in Appendix D.1, we see that for some fixed v ∈ V, it holds
N
!
(1 − γ)2 2
X
⊤
P (gn,v − bµn ) v > ǫ ≤ exp − ǫ
n=1
N
√
where some constants were ignored for the sake of conciseness. Note also that it does not have the K factor
because of Lemma 8. To apply Lemma 15, we need to know the order of covering number of V. Since V is
an |S|-dimensional unit cube in the positive orthant, it is straightforward to show N (V, ǫ) ≤ max(1, (1/ǫ)|S|)
PN ⊤
PN ⊤
(by simply discretizing evenly in each dimension). Moreover, the functions n=1 gn,v v and n=1 bµn v are
N
1−γ -Lipschitz in k · k∞ .
(1 − γ)2 2
ǫ(1 − γ)
δ ≥ N V, exp − ǫ .
4N 4N
That is:
1 (1 − γ)2 2 ǫ(1 − γ)
log( ) ≤ ǫ + |S| min(0, log( )).
δ 4N 4N
√ √
N log( 1 ) N log( δ1 )
Picking ǫ = O log(N ) 1−γ δ = Õ 1−γ guarantees that the inequality is verified asymptotically.
Combining these two steps, we have shown overall, with probability at least 1 − δ,
q
XN N K log( 1δ )
(gn − ∇hn (xn ))⊤ yN
∗
= Õ .
n=1
1 − γ
D.4 Summary
In the previous subsections, we have provided high probability upper bounds for each term in the decomposition
N N N
! ! !
X X X
∗
RegretN (yN )≤ (∇hn (xn ) − gn )⊤ xn + max gn⊤ (xn − x) + (gn − ∇hn (xn ))⊤ yN
∗
x∈X
n=1 n=1 n=1
In other words, the sample complexity of mirror descent to obtain an ǫ approximately optimal policy (i.e.
|S||A| log( 1δ )
V ∗ (p) − V π̂N (p) ≤ ǫ) is at most Õ
(1−γ)2 ǫ2 .
Here we provide further discussions on the sample complexity of running Algorithm 1 with linearly parameterized
function approximators and the proof of Theorem 3.
Theorem 3. Under a proper choice of Θ and BR , with probability 1−δ, Algorithm 1 learns an (ǫ+ǫΘ,N )-optimal
dim(Θ) log( 1δ )
policy with Õ (1−γ)2 ǫ2 samples.
E.1 Setup
We suppose that the decision variable is parameterized in the form xθ = (Φθv , Ψθµ ), where Φ, Ψ are given
nonlinear basis functions and (θv , θµ ) ∈ Θ are the parameters to learn. For modeling the value function, we
suppose each column in Φ is a vector (i.e. function) such that its k · k∞ is less than one. For modeling the
state-action distribution, we suppose each column in Ψ is a state-action distribution from which we can draw
samples. This choice implies that every column of Φ belongs to V, and every column of Ψ belongs to M.
Ching-An Cheng, Remi Tachet des Combes, Byron Boots, Geoff Gordon
Considering the geometry of Φ and Ψ, we consider a compact and convex parameter set
Cv
Θ = {θ = (θv , θµ ) : kθv k2 ≤ p , θµ ≥ 0, kθµ k1 ≤ 1}
dim(θv )
where Cv < ∞. The constant Cv acts as a regularization in learning: if it is too small, the bias (captured as
ǫΘ,N in Corollary 1 restated below) becomes larger; if it is too large, the learning becomes slower.
This choice of Θ makes sure, for θ = (θv , θµ ) ∈ Θ, Ψθµ ∈ M and kΦθv k∞ ≤ kθv k1 ≤ Cv . Therefore, by the
above construction, we can verify that the requirement in Corollary 1 is satisfied, i.e. for θ = (θv , θµ ) ∈ Θ, we
have (Φθv , Ψµ θµ ) ∈ XΘ .
Corollary 1. Let XN = {xn ∈ Xθ }N n=1 be any sequence. Let π̂N be the policy given either by the average or the
best decision in XN . It holds that
RegretN (Θ)
V π̂N (p) ≥ V ∗ (p) − N − ǫΘ,N
∗
where ǫΘ,N = minxθ ∈Xθ rep (x̂N ; yN ) − rep (x̂N ; xθ ) measures the expressiveness of Xθ , and RegretN (Θ) :=
PN PN
n=1 ln (xn ) − minx∈XΘ n=1 ln (x).
Let θ = (θv , θµ ) ∈ Θ. In view of the parameterization above, we can identify the online loss in (23) in the
parameter space as
1
hn (θ) := b⊤ ⊤ ⊤
µn Φθv + θµ Ψ ( 1−γ 1 − avn ) (33)
where we have the natural identification xn = (vn , µn ) = (Φθv,n , Ψθµ,n ) and θn = (θv,n , θµ,n ) ∈ Θ is the
decision made by the online learner in the nth round. Note that because this extension of Algorithm 1 makes
sure kθµ,n k1 = 1 for every iteration, we can still use hn . For writing convenience, we will continue to overload
hn as a function of parameter θ in the following analyses.
Mirror descent requires gradient estimates of ∇hn (θn ). Here we construct an unbiased stochastic estimate of
∇hn (θn ) as
" 1
#
Φ⊤ (p̃n + 1−γ (γ P̃n − En )⊤ µ̃n )
gn,v
gn = = 1 1 (34)
gn,µ dim(θµ )Ψ̂⊤
n ( 1−γ 1̂n − r̂n − 1−γ (γ P̂n − Ên )vn )
using two calls of the generative model (again we overload the symbol gn for the analyses in this section):
• The upper part gn,v is constructed similarly as before in (24): First we sample the initial state from the
initial distribution, the state-action pair using the learned state-action distribution, and then the transited-
to state at the sampled state-action pair. We evaluate Φ’s values at those samples to construct gn,v . Thus,
gn,v would generally be a dense vector of size dim(θv ) (unless the columns of Φ are sparse to begin with).
• The lower part gn,µ is constructed slightly differently. Recall for the tabular version in (24), we uniformly
sample over the state and action spaces. Here instead we first sample uniformly a column (i.e. a basis
function) in Ψ and then sample a state-action pair according to the sampled column, which is a distribution
by design. Therefore, the multiplier due to uniform sampling in the second row of (34) is now dim(θµ )
rather than |S||A| in (24). The matrix Ψ̂n is extremely sparse, where only the single sampled entry (the
column and the state-action pair) is one and the others are zero. In fact, one can think of the tabular version
as simply using basis functions Ψ = I, i.e. each column is a delta distribution. Under this identification,
the expression in (34) matches the one in (24).
We follow the same steps of the analysis of the tabular version. We will highlight the differences/improvement
due to using function approximations.
First, we use Corollary 1 in place of Theorem 1. To properly handle the randomness, we revisit its derivation to
slightly tighten the statement, which was simplified for the sake of cleaner exposition. Define
∗ ∗
yN,θ = (vN,θ , µ∗θ ) := arg max rep (x̂N ; xθ ).
xθ ∈Xθ
∗ ∗
For writing convenience, let us also denote θN = (θv,N , θµ∗ ) ∈ Θ as the corresponding parameter of yN,θ
∗
. We
remark that µ∗θ (i.e. θµ∗ ), which tries to approximate µ∗ , is fixed before the learning process, whereas vN,θ
∗
(i.e.
∗
θv,N ) could depend on the stochasticity in the learning. Using this new notation and the steps in the proof of
Corollary 1, we can write
∗
V ∗ (p) − V π̂N (p) = rep (x̂N ; yN )
∗
∗
RegretN (yN,θ )
= ǫΘ,N + rep (x̂N ; yN,θ ) ≤ ǫΘ,N +
N
where the first equality is Proposition 4, the last inequality follows the proof of Theorem 1, and we recall the
∗ ∗
definition ǫΘ,N = rep (x̂N ; yN ) − rep (x̂N ; yN,θ ).
The rest of the proof is very similar to that of Theorem 1, because linear parameterization does not change the
∗
convexity of the loss sequence. Let yN = (vπ̂N , µ∗ ). We bound the regret by the following rearrangement.
N
X N
X
∗ ∗
RegretN (yN,θ )= ln (xn ) − ln (yN,θ )
n=1 n=1
N
X N
X
∗
= hn (θn ) − hn (θN )
n=1 n=1
N
X
= ∇hn (θn )⊤ (θn − θN
∗
)
n=1
N N N
! ! !
X X X
⊤
= (∇hn (θn ) − gn ) θn + gn⊤ (θn − ∗
θN ) + (gn − ∇hn (θn ))⊤ θN
∗
The first term is a martingale. We will use this part to highlight the different properties due to using basis
functions. The proof follows the steps in Appendix D.1, but now the martingale difference of interest is instead
Mn − Mn−1 = (∇hn (θn ) − gn )⊤ θn
= (Ψ⊤ (κ1 − avn ) − gn,µ )⊤ θµ,n + (Φ⊤ ⊤
n bµn − gn,v ) θv,n
Ching-An Cheng, Remi Tachet des Combes, Byron Boots, Geoff Gordon
They now have nicer properties due to the way gn,µ is sampled.
For the first term (Ψ⊤ (κ1 − avn ) − gn,µ )⊤ θµ,n , we use the lemma below, where we recall the filtration Fn is
naturally defined as {θ1 , . . . , θn }.
Lemma 16. Let θ = (θv , θµ ) ∈ Θ be arbitrary that is chosen independent of the randomness of gn,µ when Fn−1
2(1+dim(θµ )) 4dim(θ )
is given. Then it holds |(κ1 − avn − gn,µ )⊤ θ| ≤ 1−γ and V|Fn−1 [(κ1 − avn − gn,µ )⊤ θn ] ≤ (1−γ)µ2 .
⊤ ⊤ ⊤ 1
|gn,µ θµ | = |dim(θµ )θµ Ψ̂n (κ1̂n − r̂n − (γ P̂n − Ên )vn )|
1−γ
γvn,jn − vn,in
= dim(θµ )θµ,kn |κ − rin − |
1−γ
2dim(θµ )θµ,kn 2dim(θµ )
≤ ≤
1−γ 1−γ
where we use the facts rin , vn,jn , vn,in ∈ [0, 1] and θµ,kn ≤ 1.
(k)
Let ψµ denote the kth column of Ψ. For V|Fn−1 [(κ1 − avn − gn,µ )⊤ θn ], we can write it as
⊤
V|Fn−1 [(Ψ⊤ (κ1 − avn ) − gn,µ )⊤ θµ ] = V|Fn−1 [gn,µ θn ]
⊤
≤ E|Fn−1 [|gn,µ θ n |2 ]
" 2 #
X 1 X (k )
n 2 2 γvn,jn − vn,in
= ψµ,in Ejn |in dim(θµ ) θµ,kn κ − rin −
dim(θµ ) i 1−γ
kn n
4dim(θµ ) X 2 X (kn )
≤ θµ,kn ψµ,in
(1 − γ)2 in
kn
!2
4dim(θµ ) X 4dim(θµ )
≤ θµ,kn ≤
(1 − γ)2 (1 − γ)2
kn
γvn,jn −vn,in 2
where in the second inequality we use the fact that |κ − rin − 1−γ |≤ 1−γ .
2
Proof. We appeal to Lemma 5, which shows kbµn k1 ≤ 1−γ and
1 2
kp̃n + (γ P̃n − En )⊤ µ̃n k1 ≤
1−γ 1−γ
Therefore, overall we can derive
⊤ ⊤ 1 ⊤ 4Cv
|(Φ bµn − gn,v ) θ| ≤ kbµn k1 + kp̃n + (γ P̃n − En ) µ̃n k1 kΦθv k∞ ≤
1−γ 1−γ
A Reduction from Reinforcement Learning to Online Learning
where we use again each column in Φ has k · k∞ less than one, and k · k∞ ≤ k · k2 . Similarly, for the variance, we
can write
4Cv2
V|Fn−1 [(Φ⊤ ⊤ ⊤ ⊤ 2
n bµn − gn,v ) θ] = V|Fn−1 [gn,v θ] ≤ E|Fn−1 [(gn,v θ) ] ≤
(1 − γ)2
From the above two lemmas, we see the main difference from the what we had in Appendix
D.1 for the tabular
C +dim(θ )
case is that, the martingale difference now scales in O v 1−γ µ instead of O |S||A| 1−γ , and its variance scales
2
C +dim(θ ) |S||A|
in O v (1−γ)2 µ instead of O (1−γ) 2 . We note the constant Cv is universal, independent of the problem
size.
Following the similar steps in Appendix D.1, these new results imply that
N
! !
X
⊤ −ǫ2
P (∇hn (θn ) − gn ) θn > ǫ ≤ exp 2 bǫ
n=1
2N σ (1 + 3N σ2 )
Cv2 +dim(θµ )
Cv +dim(θµ )
with b = O 1−γ and O (1−γ)2 . This implies that, with probability at least 1 − δ, it hold
q
N
X N (Cv2 + dim(θµ )) log( 1δ )
(∇hn (θn ) − gn )⊤ θn = Õ
n=1
1 − γ
Again the steps here are very similar to those in Appendix D.2. We concern bounding the static regret.
N
X
max gn⊤ (θn − θ)
θ∈Θ
n=1
From Appendix D.2, we recall this can be achieved by the mirror descent’s optimality condition. The below
inequality is true, for any θ′ ∈ Θ:
N N
X 1 ′ ′
X 1
hgn , θn − θ i ≤ BR (θ ||θ1 ) + hgn , θn+1 − θn i − BR (θn+1 ||θn )
n=1
η n=1
η
1 ′
Together with the upper bound on η BR (x ||x1 ), it implies that
N N
X
′1 ′
X 1
hgn , xn − x i ≤ BR (x ||x1 ) + hgn , xn+1 − xn i − BR (xn+1 ||xn )
n=1
η n=1
η
N
Õ(1) η X Cv2
≤ + kgn,v k22 + kgn,µ k2θµ,n (38)
η 2 n=1 dim(θv )
PN Cv2 2
We can expect, with high probability, n=1 dim(θv ) kgn,v k2 + kgn,µ k2θµ,n concentrates toward its expectation, i.e.
N N
Cv2 Cv2
X X
kgn,v k22 + kgn,µ k2θµ,n ≤ E kgn,v k22 + kgn,µ k2θµ,n + o(N )
n=1
dim(θv ) n=1
dim(θv )
To bound the right-hand side, we will use the upper bounds below, which largely follow the proof of Lemma 16
and Lemma 17.
4dim(θv ) 4dim(θµ )
Lemma 18. E[kgn,v k22 ] ≤ (1−γ)2 and E[kgn,µ k2θµ,n ] ≤ (1−γ)2 .
4dim(θv ) 4dim(θµ )2
Lemma 19. kgn,v k22 ≤ (1−γ)2 and kgn,µ k2θµ,n ≤ (1−γ)2 .
√ 1−γ
Now we suppose we set η = O . We have
N (Cv2 +dim(θµ ))
N N p !
X
′ Õ(1) η X Cv2 (Cv2 + dim(θµ ))N
hgn , θn − θ i ≤ + kgn,v k22 + kgn,µ k2θµ,n ≤ Õ
n=1
η 2 n=1 dim(θv ) 1−γ
n=1
∗ ∗
We follow the steps in Appendix D.3: we will use again the fact that θN = (θv,N , θµ∗ ) ∈ Θ, so we can handle the
part with θµ∗ using the standard martingale concentration, and the part with θv,N
∗
using the union bound.
Using the previous analyses, we see can first show that the martingale due to the part θµ∗ concentrates in
√
N dim(θµ ) log( 1δ ) ∗
Õ 1−γ . Likewise, using the union bound, we can show the martingale due to the part θv,N
√
N Cv2 log( N
δ ) C
concentrates in Õ 1−γ where N some proper the covering number of the set θv : kθv k2 ≤ √ v
.
dim(θv )
Because log N = O(dim(θv )) for an Euclidean ball. We can combine the two bounds and show together
q
XN N (Cv2 dim(θv ) + dim(θµ )) log( 1δ )
(gn − ∇hn (θn ))⊤ θN
∗
= Õ
n=1
1−γ
A Reduction from Reinforcement Learning to Online Learning
E.5.2 Summary
Combining the results of the three parts above, we have, with probability 1 − δ,
∗
RegretN (yN,θ )
N N N
! ! !
X X X
⊤
≤ (∇hn (θn ) − gn ) θn + max gn⊤ (θn − θ) + (gn − ∇hn (θn ))⊤ θN
∗
θ∈Θ
n=1 n=1 n=1
q q
N (dim(θµ ) + Cv2 ) log( 1δ ) N (Cv2 dim(θv ) + dim(θµ )) log( 1δ )
p !
(Cv2 + dim(θµ ))N
= Õ + Õ + Õ
1−γ 1−γ 1−γ
q
N dim(Θ) log( 1δ )
= Õ
1−γ