0% found this document useful (0 votes)
56 views31 pages

A Reduction From Reinforcement Learning To No-Regr

This document presents a reduction from reinforcement learning to no-regret online learning. The authors show that any online algorithm with sublinear regret can generate reinforcement learning policies with performance guarantees. Specifically, the reduction separates the reinforcement learning problem into two parts - regret minimization and function approximation. This allows for standard online learning analysis techniques to be applied to the regret minimization problem. The authors demonstrate this idea by developing a simple reinforcement learning algorithm based on mirror descent and a generative model oracle. For any discounted, tabular reinforcement learning problem, this algorithm learns an epsilon-optimal policy using a sample complexity that is independent of the state and action spaces but depends on the discount factor and approximation error.

Uploaded by

alesadurgaryan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views31 pages

A Reduction From Reinforcement Learning To No-Regr

This document presents a reduction from reinforcement learning to no-regret online learning. The authors show that any online algorithm with sublinear regret can generate reinforcement learning policies with performance guarantees. Specifically, the reduction separates the reinforcement learning problem into two parts - regret minimization and function approximation. This allows for standard online learning analysis techniques to be applied to the regret minimization problem. The authors demonstrate this idea by developing a simple reinforcement learning algorithm based on mirror descent and a generative model oracle. For any discounted, tabular reinforcement learning problem, this algorithm learns an epsilon-optimal policy using a sample complexity that is independent of the state and action spaces but depends on the discount factor and approximation error.

Uploaded by

alesadurgaryan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/337273591

A Reduction from Reinforcement Learning to No-Regret Online Learning

Preprint · November 2019

CITATIONS READS
0 17

4 authors, including:

Ching-An Cheng
Georgia Institute of Technology
38 PUBLICATIONS 203 CITATIONS

SEE PROFILE

All content following this page was uploaded by Ching-An Cheng on 06 December 2019.

The user has requested enhancement of the downloaded file.


A Reduction from Reinforcement Learning to
No-Regret Online Learning

Ching-An Cheng Remi Tachet des Combes Byron Boots Geoff Gordon
Georgia Tech Microsoft Research Montreal UW Microsoft Research Montreal
arXiv:1911.05873v1 [cs.LG] 14 Nov 2019

Abstract In this work, we revisit the classic linear-program (LP)


formulation of RL [1, 2] in an attempt to tackle this
We present a reduction from reinforcement long-standing question. We focus on the associated
learning (RL) to no-regret online learning saddle-point problem of the LP (given by Lagrange
based on the saddle-point formulation of RL, duality), which has recently gained traction due to
by which any online algorithm with sublin- its potential for computationally efficient algorithms
ear regret can generate policies with provable with theoretical guarantees [3–11]. But in contrast to
performance guarantees. This new perspec- these previous works based on stochastic approxima-
tive decouples the RL problem into two parts: tion, here we consider a reformulation through the lens
regret minimization and function approxima- of online learning, i.e. regret minimization. Since the
tion. The first part admits a standard online- pioneering work of Gordon [12], Zinkevich [13], online
learning analysis, and the second part can be learning has evolved into a ubiquitous tool for system-
quantified independently of the learning algo- atic design and analysis of iterative algorithms. There-
rithm. Therefore, the proposed reduction can fore, if we can identify a reduction from RL to online
be used as a tool to systematically design new learning, we can potentially leverage it to build effi-
RL algorithms. We demonstrate this idea by cient RL algorithms.
devising a simple RL algorithm based on mir- We will show this idea is indeed feasible. We present
ror descent and the generative-model oracle. a reduction by which any no-regret online algorithm,
For any γ-discounted tabular RL problem, after observing N samples, can find a policy π̂N in a

with probability at least 1 − δ, it learns an  ǫ- policy class Π satisfying V π̂N (p) ≥ V π (p) − o(1) − ǫΠ,
|S||A| log( δ1 )
optimal policy using at most Õ (1−γ)4 ǫ2
where V π (p) is the accumulated reward of policy π
samples. Furthermore, this algorithm admits with respect to some unknown initial state distribution
a direct extension to linearly parameterized p, π ∗ is the optimal policy, and ǫΠ ≥ 0 is a measure of
function approximators for large-scale appli- the expressivity of Π (see Section 4.2 for definition).
cations, with computation and sample com- Our reduction is built on a refinement of online learn-
plexities independent of |S|,|A|, though at ing, called Continuous Online Learning (COL), which
the cost of potential approximation bias. was proposed to model problems where loss gradients
across rounds change continuously with the learner’s
decisions [14]. COL has a strong connection to equilib-
1 INTRODUCTION rium problems (EPs) [15, 16], and any monotone EP
(including our saddle-point problem of interest) can be
Reinforcement learning (RL) is a fundamental prob- framed as no-regret learning in a properly constructed
lem for sequential decision making in unknown envi- COL problem [14]. Using this idea, our reduction fol-
ronments. One of its core difficulties, however, is the lows naturally by first converting an RL problem to an
need for algorithms to infer long-term consequences EP and then the EP to a COL problem.
based on limited, noisy, short-term feedback. As a re-
sult, designing RL algorithms that are both scalable Framing RL as COL reveals new insights into the rela-
and provably sample efficient has been challenging. tionship between approximate solutions to the saddle-
point problem and approximately optimal policies. Im-
portantly, this new perspective shows that the RL
problem can be separated into two parts: regret min-
imization and function approximation. The first part
admits standard treatments from the online learning
A Reduction from Reinforcement Learning to Online Learning

literature, and the second part can be quantified in- initial state distribution,
dependently of the learning process. For example, one P∞
can accelerate learning by adopting optimistic online V π (s) := (1 − γ)Eξ∼ρπ (s) [ t=0 γ t r(st , at )] (1)
algorithms [17, 18] that account for the predictability
in COL, without worrying about function approxima- is the value function of π at state s, r : S × A → [0, 1]
tors. Because of these problem-agnostic features, the is the reward function, and ρπ (s) is the distribution
proposed reduction can be used to systematically de- of trajectory ξ = s0 , a0 , s1 , . . . generated by running
sign efficient RL algorithms with performance guaran- π from s0 = s in an MDP. We assume that the initial
tees. distribution p(s0 ), the transition P(s′ |s, a), and the
reward function r(s, a) in the MDP are unknown but
As a demonstration, we design an RL algorithm based
can be queried through a generative model, i.e. we can
on arguably the simplest online learning algorithm:
sample s0 from p, s′ from P, and r(s, a) for any s ∈ S
mirror descent. Assuming a generative model1 , we
and a ∈ A. We remark that the definition of V π in (1)
prove that, for any tabular Markov decision process
contains a (1 − γ) factor. We adopt this setup to make
(MDP), with probability at least 1 − δ, this algorithm
writing more compact. We denote the optimal policy
learns an ǫ-optimal policy for theγ-discounted  accumu-
|S||A| log( δ1 ) as π ∗ and its value function as V ∗ for short.
lated reward, using at most Õ (1−γ)4 ǫ2 samples,
where |S|,|A| are the sizes of state and action spaces,
2.1 Duality in RL
and γ is the discount rate. Furthermore, thanks to
the separation property above, our algorithm admits a Our reduction is based on the linear-program (LP) for-
natural extension with linearly parameterized function mulation of RL. We provide a short recap here (please
approximators, whose sample and per-round computa- see Appendix A and [21] for details).
tion complexities are linear in the number of parame-
ters and independent of |S|,|A|, though at the cost of To show how maxπ V π (p) can be framed as a LP, let us
policy performance bias due to approximation error. define the
P∞average state distribution under π, dπ (s) :=
(1−γ) t=0 γ dt (s), where dπt is the state distribution
t π
This sample complexity improves the current best at time t visited by running π from p (e.g. dπ0 = p).
provable rate of the saddle-point RL setup [3–6] by a By construction, dπ satisfies the stationarity property,
|S|2
large factor of (1−γ) 2 , without making any assumption

on the MDP.2 This improvement is attributed to our dπ (s′ ) = (1 − γ)p(s′ ) + γEs∼dπ Ea∼π|s [P(s′ |s, a)]. (2)
new online-learning-style analysis that uses a cleverly
selected comparator in the regret definition. While it is With dπ , we can write V π (p) = Es∼dπ Ea∼π|s [r(s, a)]
possible to devise a minor modification of the previous and our objective maxπ V π (p) equivalently as:
stochastic mirror descent algorithms, e.g. [5], achieving
the same rate with our new analysis, we remark that maxµ∈R|S||A| :µ≥0 r⊤ µ
our algorithm is considerably simpler and removes a (3)
s.t. (1 − γ)p + γP⊤ µ = E⊤ µ
projection required in previous work [3–6].
Finally, we do note that the same sample complex- where r ∈ R|S||A| , p ∈ R|S| , and P ∈ R|S||A|×|S|
ity can also be achieved, e.g., by model-based RL and are vector forms of r, p, and P, respectively, and
(phased) Q-learning [19, 20]. However, these methods E = I ⊗ 1 ∈ R|S||A|×|S| (we use | · | to denote the cardi-
either have super-linear runtime, with no obvious route nality of a set, ⊗ the Kronecker product, I ∈ R|S|×|S|
for improvement, or could become unstable when using is the identity, and 1 ∈ R|A| the vector of ones). In (3),
function approximators without further assumption. S and A may seem to have finite cardinalities, but the
same formulation extends to countable or even con-
2 SETUP & PRELIMINARIES tinuous spaces (under proper regularity assumptions;
see [22]). We adopt this abuse of notation (empha-
Let S and A be state and action spaces, which can sized by bold-faced symbols) for compactness.
be discrete or continuous. We consider γ-discounted The variable µ of the LP in (3) resembles a joint dis-
infinite-horizon problems for γ ∈ [0, 1). Our goal is tribution dπ (s)π(a|s). To see this, notice that the
to find a policy π(a|s) that maximizes the discounted constraint in (3) is reminiscent of (2), and implies
average return V π (p) := Es∼p [V π (s)], where p is the kµk1 = 1, i.e. µ is a probability distribution. Then
1 one can show µ(s, a) = dπ (s)π(a|s) when the con-
In practice, it can be approximated by running a be-
havior policy with sufficient exploration [19]. straint is satisfied, which implies that (3) is the same
2
[5] has the same sample complexity but requires the as maxπ V π (p) and its solution µ∗ corresponds to

MDP to be ergodic under any policy. µ∗ (s, a) = dπ (s)π ∗ (a|s) of the optimal policy π ∗ .
Ching-An Cheng, Remi Tachet des Combes, Byron Boots, Geoff Gordon

As (3) is a LP, it suggests looking at its dual, which and mirror-prox [24, 25], to efficiently solve the prob-
turns out to be the classic LP formulation of RL3 , lem. These methods only require using the generative
model to compute unbiased estimates of the gradients
minv∈R|S| p⊤ v ∇v L = bµ and ∇µ L = av , where we define
(4)
s.t. (1 − γ)r + γPv ≤ Ev.
1
bµ := p + 1−γ (γP − E)⊤ µ (11)
It can be verified that for all p > 0, the solution to (4)
satisfies the Bellman equation [23] and therefore is the as the balance function with respect to µ. bµ mea-
optimal value function v∗ (the vector form of V ∗ ). We sures whether µ violates the stationarity constraint
note that, for any policy π, V π by definition satisfies in (3) and can be viewed as the dual of av . When
a stationarity property the state or action space is too large, one can resort to
function approximators to represent v and µ, which
V π (s) = Ea∼π|s (1 − γ)r(s, a) + γEs′ ∼P|s,a [V π (s′ )]
 
are often realized by linear basis functions for the sake
(5) of analysis [10].

which can be viewed as a dual equivalent of (2) for dπ .


2.3 COL and EPs
Because, for any s ∈ S and a ∈ A, r(s, a) is in [0, 1],
(5) implies V π (s) lies in [0, 1] too. Finally, we review the COL setup in [14], which we
will use to design the reduction from the saddle-point
2.2 Toward RL: the Saddle-Point Setup problem in (8) to online learning in the next section.
The LP formulations above require knowing the prob- Recall that an online learning problem describes the
abilities p and P and are computationally inefficient. iterative interactions between a learner and an oppo-
When only generative models are available (as in our nent. In round n, the learner chooses a decision xn
setup), one can alternatively exploit the duality rela- from a decision set X , the opponent chooses a per-
tionship between the two LPs in (3) and (4), and frame round loss function ln : X → R based on the learner’s
RL as a saddle-point problem [3]. Let us define decisions, and then information about ln (e.g. its gra-
dient ∇ln (xn )) is revealed to the learner. The perfor-
1
av := r + 1−γ (γP − E)v (6) mance of the learner is usually measured in terms of
regret with respect to some x′ ∈ X ,
as the advantage function with respect to v (where v is PN PN
not necessarily a value function). Then the Lagrangian RegretN (x′ ) := n=1 ln (xn ) − ′
n=1 ln (x ).
connecting the two LPs can be written as
When ln is convex and X is compact and convex,
L(v, µ) := p⊤ v + µ⊤ av , (7) many no-regret (i.e. RegretN (x′ ) = o(N )) algorithms
are available, such as mirror descent and follow-the-
which leads to the saddle-point formulation, regularized-leader [26–28].
min max L(v, µ), (8) COL is a subclass of online learning problems where
v∈V µ∈M
the loss sequence changes continuously with respect to
where the constraints are the played decisions of the learner [14]. In COL, the
opponent is equipped with a bifunction f : (x, x′ ) 7→
V = {v ∈ R|S| : v ≥ 0, kvk∞ ≤ 1} (9) fx (x′ ), where any fixed x′ ∈ X , ∇fx (x′ ) is continuous
|S||A| in x ∈ X . The opponent selects per-round losses based
M = {µ ∈ R : µ ≥ 0, kµk1 ≤ 1}. (10)
on f , but the learner does not know f : in round n, if
The solution to (8) is exactly (v∗ , µ∗ ), but notice that the learner chooses xn , the opponent sets
extra constraints on the norm of µ and v are intro-
ln (x) = fxn (x), (12)
duced in V, M, compared with (3) and (4). This is
a common practice, which uses known bound on the and returns, e.g., a stochastic estimate of ∇ln (xn ) (the
solutions of (3) and (4) (discussed above) to make the regret is still measured in terms of the noise-free ln ).
search spaces V and M in (8) compact and as small
as possible so that optimization converges faster. In [14], a natural connection is shown between COL
and equilibrium problems (EPs). As EPs include the
Having compact variable sets allows using first-order saddle-point problem of interest, we can use this idea
stochastic methods, such as stochastic mirror descent to turn (8) into a COL problem. Recall an EP is de-
3
Our setup in (4) differs from the classic one in the fined as follows: Let X be compact and F : (x, x′ ) 7→
(1 − γ) factor in the constraint due to the average setup. F (x, x′ ) be a bifunction s.t. ∀x, x′ ∈ X , F (·, x′ ) is
A Reduction from Reinforcement Learning to Online Learning

continuous, F (x, ·) is convex, and F (x, x) ≥ 0.4 The Methods to achieve these two steps individually are not
problem EP(X , F ) aims to find x⋆ ∈ X s.t. new. The reduction from convex-concave problems to
no-regret online learning is well known [30]. Likewise,
F (x⋆ , x) ≥ 0, ∀x ∈ X . (13) the relationship between the approximate solution of
(8) and policy performance is also available; this is how
By its definition, a natural residual function to quan- the saddle-point formulation [5] works in the first place.
tify the quality of an approximation solution x to EP So couldn’t we just use these existing approaches? We
is rep (x) := − minx′ ∈X F (x, x′ ) which describes the de- argue that purely combining these two techniques fails
gree to which (13) is violated at x. We say a bifunction to fully capture important structure that resides in RL.
F is monotone if, ∀x, x′ ∈ X , F (x, x′ ) + F (x′ , x) ≤ 0, While this will be made precise in the later analyses,
and skew-symmetric if the equality holds. we highlight the main insights here.
EPs with monotone bifunctions represent general con- Instead of treating (8) as an adversarial two-player on-
vex problems, including convex optimization prob- line learning problem [30], we adopt the recent reduc-
lems, saddle-point problems, variational inequali- tion to COL [14] reviewed in Section 2.3. The main dif-
ties, etc. For instance, a convex-concave problem ference is that the COL approach takes a single-player
miny∈Y maxz∈Z φ(y, z) can be cast as EP(X , F ) with setup and retains the Lipschitz continuity in the source
X = Y × Z and the skew-symmetric bifunction [29] saddle-point problem. This single-player perspective
is in some sense cleaner and, as we will show in Sec-
F (x, x′ ) := φ(y ′ , z) − φ(y, z ′ ), (14) tion 4.2, provides a simple setup to analyze effects of
function approximators. Additionally, due to continu-
where x = (y, z) and x′ = (y ′ , z ′ ). In this case, ity, the losses in COL are predictable and therefore
rep (x) = maxz′ ∈Z φ(y, z ′ ) − miny′ ∈Y φ(y ′ , z) is the du- make designing fast algorithms possible.
ality gap.
With the help of the COL reformulation, we study the
Cheng et al. [14] show that a learner achieves sublinear relationship between the approximate solution to (8)
dynamic regret in COL if and only if the same algo- and the performance of the associated policy in RL.
rithm can solve EP(X , F ) with F (x, x′ ) = fx (x′ ) − We are able to establish a tight bound between the
fx (x). Concretely, they show that, given a mono- residual and the performance gap, resulting in a large
tone EP(X , F ) with F (x, x) = 0 (which is satisfied |S|2
improvement of (1−γ) 2 in sample complexity compared
by (14)), one can construct a COL problem by setting with the best bounds in the literature of the saddle-
fx′ (x) := F (x′ , x), i.e. ln (x) = F (xn , x), such that point setup, without adding extra constraints on X
any no-regret algorithm can generate an approximate and assumptions on the MDP. Overall, this means that
solution to the EP. stronger sample complexity guarantees can be attained
Proposition 1. [14] If F is skew-symmetric and by simpler algorithms, as we demonstrate in Section 5.
ln (x) = F (xn , x), then rep (x̂N ) ≤ N1 RegretN ,
The missing proofs of this section are in Appendix B.
where RegretN = maxx∈X RegretN (x), and x̂N =
1
PN
N n=1 xn ; the same guarantee holds also for the best 3.1 The COL Formulation of RL
decision in {xn }N n=1 .
First, let us exercise the above COL idea with the
3 AN ONLINE LEARNING VIEW saddle-point formulation of RL in (8). To construct
the EP, we can let X = {x = (v, µ) : v ∈ V, µ ∈ M},
We present an alternate online-learning perspective which is compact. According to (14), the bifunction F
on the saddle-point formulation in (8). This analy- of the associated EP(X , F ) is naturally given as
sis paves a way for of our reduction in the next section. F (x, x′ ) := L(v′ , µ) − L(v, µ′ )
By reduction, we mean realizing the two steps below:
= p⊤ v′ + µ⊤ av′ − p⊤ v − µ′⊤ av (15)
∗ ∗ ∗
1. Define a sequence of online losses such that any which is skew-symmetric, and x := (v , µ ) is a solu-
algorithm with sublinear regret can produce an tion to EP(X , F ). This identification gives us a COL
approximate solution to the saddle-point problem. problem with the loss in the nth round defined as
ln (x) := p⊤ v + µ⊤ ⊤ ⊤
n av − p vn − µ avn (16)
2. Convert the approximate solution in the first step
to an approximately optimal policy in RL. where xn = (vn , µn ). We see ln is a linear loss. More-
over, because of the continuity in L, it is predictable,
4
We restrict ourselves to this convex and continuous i.e. ln can be (partially) inferred from past feedback
case as it is sufficient for our problem setup. as the MDP involved in each round is the same.
Ching-An Cheng, Remi Tachet des Combes, Byron Boots, Geoff Gordon

3.2 Policy Performance and Residual realize, e.g., when p is unknown. Without it, extra as-
sumptions (like ergodicity [5]) on the MDP are needed.
By Proposition 1, any no-regret algorithm, when ap-
plied to (16), provides guarantees in terms of the resid- However, Proposition 2 is undesirable for a number of
ual function rep (x) of the EP. But this is not the end of reasons. First, the bound is quite conservative, as it
the story. We also need to relate the learner decision concerns the uniform error kv∗ − vπµ k∞ whereas the
x ∈ X to a policy π in RL and then convert bounds on objective in RL is about the gap V ∗ (p) − V πµ (p) =
rep (x) back to the policy performance V π (p). Here we p⊤ (v∗ − vπµ ) with respect to the initial distribution
follow the common rule in the literature and associate p (i.e. a weighted error). Second, the constant term
each x = (v, µ) ∈ X with a policy πµ defined as (1 − γ) mins p(s) can be quite small (e.g. when p is
uniform, it is 1−γ
|S| ) which can significantly amplify the
πµ (a|s) ∝ µ(s, a). (17) error in the residual. Because a no-regret algorithm
typically decreases the residual in O(N −1/2 ) after see-
In the following, we relate the residual rep (x) to the ing N samples, the factor of 1−γ |S| earlier would turn
performance gap V ∗ (p) − V πµ (p) through a relative |S|2

performance measure defined as into a multiplier of (1−γ) 2 in sample complexity. This

makes existing saddle-point approaches sample ineffi-


rep (x; x′ ) := F (x, x) − F (x, x′ ) = −F (x, x′ ) (18) cient in comparison with other RL methods like Q-
learning [20]. Lastly, enforcing E⊤ µ ≥ (1 − γ)p re-
for x, x′ ∈ X , where the last equality follows from the quires knowing p (which is unavailable in our setup)
skew-symmetry of F in (15). Intuitively, we can view and adds extra projection steps during optimization.
rep (x; x′ ) as comparing the performance of x with re- When p is unknown, while it is possible to modify this
spect to the comparator x′ under an optimization prob- constraint to use a uniform distribution, this might
lem proposed by x, e.g. we have ln (xn ) − ln (x′ ) = worsen the constant factor and could introduce bias.
rep (xn ; x′ ). And by the definition in (18), it holds that
rep (x; x′ ) ≤ maxx′ ∈X −F (x, x′ ) = rep (x). One may conjecture that the bound in Proposition 2
could perhaps be tightened by better analyses. How-
We are looking for inequalities in the form V ∗ (p) − ever, we prove this is impossible in general.
V πµ (p) ≤ κ(rep (x; x′ )) that hold for all x ∈ X with
Proposition 3. There is a class of MDPs such that,
some strictly increasing function κ and some x′ ∈ X ,
for some x ∈ X , Proposition 2 is an equality.
so we can get non-asymptotic performance guaran-
tees once we combine the two steps described at the We note that Proposition 3 does not hold for all MDPs.
beginning of this section. For example, by directly Indeed, if one makes stronger assumptions on the
applying results of [14] to the COL in (16), we get MDP, such as that the Markov chain induced by every
V ∗ (p) − V π̂N (p) ≤ κ( Regret
N
N
), where π̂N is the policy policy is ergodic [5], then it is possible to show, for
associated with the average/best decision in {xn }N 1=n . all x ∈ X , rep (x; x∗ ) = ckv∗ − vπµ k∞ for some con-
stant c independent√of γ and |S|, when one constrains
3.2.1 The Classic Result E⊤ µ ≥ (1 − γ + γ c)p. Nonetheless, this construct
still requires adding an undesirable constraint to X .
Existing approaches (e.g. [4–6]) to the saddle-point
point formulation in (8) rely on the relative residual
3.2.2 Curse of Covariate Shift
rep (x; x∗ ) with respect to the optimal solution to the
problem x∗ , which we restate in our notation. Why does this happen? We can view this issue as a
Proposition 2. For any x = (v, µ) ∈ X , if E⊤ µ ≥ form of covariate shift, i.e. a mismatch between distri-
(1 − γ)p, rep (x; x∗ ) ≥ (1 − γ) mins p(s)kv∗ − vπµ k∞ . butions. To better understand it, we notice a simple
equality, which has often been used implicitly, e.g. in
Therefore, although the original saddle-point problem the technical proofs of [5].
in (8) is framed using V and M, in practice, an extra
Lemma 1. For any x = (v, µ), if x′ ∈ X satisfies (2)
constraint, such as E⊤ µ ≥ (1 − γ)p, is added into M,
and (5) (i.e. v′ and µ′ are the value function and state-
i.e. these algorithms consider instead
action distribution of policy πµ′ ), rep (x; x′ ) = −µ⊤ av′ .
M′ = {µ ∈ R|S||A| : µ ∈ M, E⊤ µ ≥ (1 − γ)p}, (19)
Lemma 1 implies rep (x; x∗ ) = −µ⊤ av∗ , which is non-
so that the marginal of the estimate µ can have the negative. This term is similar to an equality called the
sufficient coverage required in Proposition 2. This con- performance difference lemma [31, 32].
dition is needed to establish non-asymptotic guaran- Lemma 2. Let vπ and µπ denote the value and state-
tees on the performance of the policy generated by action distribution of some policy π. Then for any
µ [3, 5, 6], but it can sometimes be impractical to function v′ , it holds that p⊤ (vπ − v′ ) = (µπ )⊤ av′ . In
A Reduction from Reinforcement Learning to Online Learning


particular, it implies V π (p) − V π (p) = (µπ )⊤ avπ′ . both the average policy and the best policy in XN con-
verge to the optimal policy in performance with a rate
From Lemmas 1 and 2, we see that the difference be- O(RegretN (yN ∗
)/N ). Compared with existing results
tween the residual rep (x; x∗ ) = −µ⊤ av∗ and the per- obtained through Proposition 2, the above result re-

formance gap V πµ (p) − V π (p) = (µπµ )⊤ av∗ is due to moves the factor (1 − γ) mins p(s) and impose no as-
the mismatch between µ and µπµ , or more specifically, sumption on XN or the MDP. Indeed Theorem 1 holds
the mismatch between the two marginals d = E⊤ µ for any sequence. For example, when XN is generated
and dπµ = E⊤ µπµ . Indeed, when d = dπµ , the by stochastic feedback of ln , Theorem 1 continues to
residual is equal to the performance gap. However, hold, as the regret is defined in terms of ln , not of the
in general, we do not have control over that difference sampled loss. Stochasticity only affects the regret rate.
for the sequence of variables {xn = (vn , µn ) ∈ X }
an algorithm generates. The sufficient condition in In other words, we have shown that when µ and v
Proposition 2 attempts to mitigate the difference, us- can be directly parameterized, an approximately op-
ing the fact dπµ = (1 − γ)p + γP⊤ πµ
from (2), timal policy for the RL problem can be obtained by
πµ d
where Pπµ is the transition matrix under πµ . But running any no-regret online learning algorithm, and
the missing half γP⊤ πµ
(due to the long-term ef- that the policy quality is simply dictated by the re-
πµ d
fects in the MDP) introduces the unavoidable, weak gret rate. To illustrate, in Section 5 we will prove
constant (1 − γ) mins p(s), if we want to have an uni- that simply running mirror descent in this COL pro-
form bound on kv∗ − vπµ k∞ . The counterexample in duces an RL algorithm that is as sample efficient as
Proposition 3 was designed to maximize the effect of other common RL techniques. One can further foresee
covariate shift, so that µ fails to captures state-action that algorithms leveraging the continuity in COL—e.g.
pairs with high advantage. To break the curse, we mirror-prox [25] or PicCoLO [18]—and variance reduc-
must properly weight the gap between v∗ and vπµ in- tion can lead to more sample efficient RL algorithms.
stead of relying on the uniform bound on kv∗ − vπµ k∞ Below we will also demonstrate how to use the fact
as before. that COL is single-player (see Section 2.3) to cleanly
incorporate the effects of using function approxima-
4 THE REDUCTION tors to model µ and v. We will present a corollary
of Theorem 1, which separates the problem of learn-
The analyses above reveal both good and bad proper- ing µ and v, and that of approximating M and V
ties of the saddle-point setup in (8). On the one hand, with function approximators. The first part is con-
we showed that approximate solutions to the saddle- trolled by the rate of regret in online learning, and
point problem in (8) can be obtained by running any the second part depends on only the chosen class of
no-regret algorithm in the single-player COL problem function approximators, independently of the learning
defined in (16); many efficient algorithms are available process. As these properties are agnostic to problem
from the online learning literature. On the other hand, setups and algorithms, our reduction leads to a frame-
we also discovered a root difficulty in converting an work for systematic synthesis of new RL algorithms
approximate solution of (8) to an approximately opti- with performance guarantees. The missing proofs of
mal policy in RL (Proposition 2), even after imposing this section are in Appendix C.
strong conditions like (19). At this point, one may won-
der if the formulation based on (8) is fundamentally 4.1 Proof of Theorem 1
sample inefficient compared with other approaches to
RL, but this is actually not true. The main insight of our reduction is to adopt, in defin-
Our main contribution shows that learning a policy ing rep (x; x′ ), a comparator x′ ∈ X based on the out-
through running a no-regret algorithm in the COL put of the algorithm (represented by x), instead of the
problem in (16) is, in fact, as sample efficient in pol- fixed comparator x∗ (the optimal pair of value func-
icy performance as other RL techniques, even without tion and state-action distribution) that has been used
the common constraint in (19) or extra assumptions conventionally, e.g. in Proposition 2. While this idea
on the MDP like ergodicity imposed in the literature. seems unnatural from the standard saddle-point or EP
perspective, it is possible, because the regret in online
learning is measured against the worst-case choice in
Theorem 1. Let XN = {xn ∈ X }N n=1 be any sequence. X , which is allowed to be selected in hindsight. Specif-
Let π̂N be the policy given by x̂N via (17), which is
ically, we propose to select the following comparator
either the average or the best decision in XN . Define
∗ :=

RegretN (yN ) to directly bound V ∗ (p) − V π̂N (p) instead of the con-
yN (vπ̂N , µ∗ ). Then V π̂N (p) ≥ V ∗ (p) − N . servative measure kV ∗ − V π̂N k∞ used before.
Theorem 1 shows that if XN has sublinear regret, then Proposition 4. For x = (v, µ) ∈ X , define yx∗ :=
Ching-An Cheng, Remi Tachet des Combes, Byron Boots, Geoff Gordon

(vπµ , µ∗ ) ∈ X . It holds rep (x; yx∗ ) = V ∗ (p) − V πµ (p). Algorithm 1 Mirror descent for RL
PN Input: ǫ optimality of the γ-average return
To finish the proof, let x̂N be either N1 n=1 xn or δ maximal failure probability
arg minx∈XN rep (x; yx∗ ), and let π̂N denote the policy generative model of an MDP

given by (17). First, V ∗ (p) − V π̂N (p) = rep (x̂N ; yN ) Output: π̂N = π µ̂N
1: x1 = (v1 , µ1 ) where µ1 is uniform and v1 ∈ V
by Proposition 4. Next we follow the proof idea of 1)
|S||A| log( δ
Proposition 1 in [14]: because F is skew-symmetric 2: Set N = Ω̃( (1−γ)2 ǫ2
) and η = (1 − γ)(|S||A|N )−1/2

and F (yN , ·) is convex, we have by (18) 3: Set the Bregman divergence as (22)
∗ π̂N ∗ ∗ 4: for n = 1 . . . N − 1 do
V (p) − V (p) = rep (x̂N ; yN ) = −F (x̂N , yN ) 5: Sample gn according to (24)
N 6: Update to xn+1 according to (21)

, x̂N ) ≤ N1 n=1 F (yN∗
P
= F (yN , xn ) 7: end for PN
1 N ∗ 1 ∗ 8: Set (v̂N , µ̂N ) = x̂N = N1 n=1 xn
P
= N n=1 −F (xn , yN ) = N RegretN (yN ).

4.2 Function Approximators


Proposition 5 says ǫΘ,N depends on how well XΘ cap-
When the state and action spaces are large or contin- tures the value function of the output policy vπ̂N and
uous, directly optimizing v and µ can be impractical. the optimal state-action distribution µ∗ . We remark
Instead we can consider optimizing over a subset of that this result is independent of how vπ̂N is gener-
feasible choices parameterized by function approxima- ated. Furthermore, Proposition 5 makes no assump-
tors tion whatsoever on the structure of function approxi-
mators. It even allows sharing parameters θ between
XΘ = {xθ = (φθ , ψθ ) : ψθ ∈ M, θ ∈ Θ}, (20) v = φθ and µ = ψθ , e.g., they can be a bi-headed neu-
ral network, which is common for learning shared fea-
where φθ and ψθ are functions parameterized by θ ∈ Θ, ture representations. More precisely, the structure of
and Θ is a parameter set. Because COL is a single- the function approximator would only affect whether
player setup, we can extend the previous idea and The- ln ((φθ , ψθ )) remains a convex function in θ, which de-
orem 1 to provide performance bounds in this case by termines the difficulty of designing algorithms with
a simple rearrangement (see Appendix C), which is a sublinear regret.
common trick used in the online imitation learning lit-
erature [33–35]. Notice that, in (20), we require only In other words, the proposed COL formulation pro-
ψθ ∈ M, but not φθ ∈ V, because for the performance vides a reduction which dictates the policy perfor-
bound in our reduction to hold, we only need the con- mance with two separate factors: 1) the rate of regret
straint M (see Lemma 4 in proof of Proposition 4). RegretN (Θ) which is controlled by the choice of online
learning algorithm; 2) the approximation error ǫΘ,N
Corollary 1. Let XN = {xn ∈ Xθ }N n=1 be any se- which is determined by the choice of function approxi-
quence. Let π̂N be the policy given either by the average mators. These two factors can almost be treated inde-
or the best decision in XN . It holds that pendently, except that the choice of function approxi-
RegretN (Θ) mators would determine the properties of ln ((φθ , ψθ ))
V π̂N (p) ≥ V ∗ (p) − N − ǫΘ,N
as a function of θ, and the choice of Θ needs to ensure

where ǫΘ,N = minxθ ∈Xθ rep (x̂N ; yN )−rep (x̂N ; xθ ) mea- (20) is admissible.
sures the expressiveness of Xθ , and RegretN (Θ) :=
PN PN
n=1 ln (xn ) − minx∈XΘ n=1 ln (x). 5 SAMPLE COMPLEXITY OF
MIRROR DESCENT
We can quantify ǫΘ,N with the basic Hölder’s inequal-
ity.
We demonstrate the power of our reduction by ap-
Proposition 5. Let x̂N = (v̂N , µ̂N ). Under the setup plying perhaps the simplest online learning algorithm,
in Corollary 1, regardless of the parameterization, it is
true that ǫΘ,N is no larger than mirror descent, to the proposed COL problem in (16)
with stochastic feedback (Algorithm 1). For trans-
kµθ − µ∗ k1 parency, we discuss the tabular setup. We will show a
min + min kbµ̂N k1,w kvθ − vπ̂N k∞,1/w
(vθ ,µθ )∈XΘ 1−γ w:w≥1 natural extension to basis functions at the end.
1  
≤ min kµθ − µ∗ k1 + 2kvθ − vπ̂N k∞ . Recall that mirror descent is a first-order algorithm,
(vθ ,µθ )∈XΘ 1 − γ
whose update rule can be written as
P
where the norms are defined as kxk1,w = i wi |xi |
and kxk∞,1/w = maxi wi−1 |xi |. xn+1 = arg minx∈X hgn , xi + η1 BR (x||xn ) (21)
A Reduction from Reinforcement Learning to Online Learning

where η > 0 is the step size, gn is the feedback direc- Theorem 2. With probability  1−δ, Algorithm 1 learns
|S||A| log( 1δ )

tion, and BR (x||x′ ) = R(x) − R(x′ ) − h∇R(x′ ), x − x′ i an ǫ-optimal policy with Õ samples.
(1−γ)2 ǫ2
is the Bregman divergence generated by a strictly con-
vex function R. Based on the geometry of X = V × M, Note that the above statement makes no assumption
we consider a natural Bregman divergence of the form on the MDP (except the tabular setup for simplifying
1
analysis). Also, because the definition of value func-
BR (x′ ||x) = 2|S| kv

− vk22 + KL(µ′ ||µ) (22) tion in (1) is scaled by a factor (1−γ), the above result
|S||A| log( 1 )

translates into a sample complexity in Õ (1−γ)4 ǫ2
δ
This choice mitigates the effects of dimension (e.g. if
we set x1 = (v1 , µ1 ) with µ1 being the uniform distri- for the conventional discounted accumulated rewards.
bution, it holds BR (x′ ||x1 ) = Õ(1) for any x′ ∈ X ).
5.1 Proof Sketch of Theorem 2
To define the feedback direction gn , we slightly modify
the per-round loss ln in (16) and consider a new loss The proof is based on the basic property of mirror
descent and martingale concentration. We provide a
hn (x) := b⊤ ⊤ 1 sketch here; please refer to Appendix D for details. Let
µn v + µ ( 1−γ 1 − avn ) (23) ∗
yN = (vπ̂N , µ∗ ). We bound the regret in Theorem 1 by
the following rearrangement, where the first equality
that shifts ln by a constant, where 1 is the vector of below is because hn is a constant shift from ln .
ones. One can verify that ln (x) − ln (x′ ) = hn (x) −
hn (x′ ), for all x, x′ ∈ X when µ, µ′ in x and x′ satisfy ∗
N
X N
X ∗
kµk1 = kµ′ k1 (which holds for Algorithm 1). There- RegretN (yN )= hn (xn ) − hn (yN )
n=1 n=1
fore, using hn does not change regret. The reason for
N N
! !
using hn instead of ln is to make ∇µ hn ((v, µ)) (and ≤
X
(∇hn (xn ) − gn ) xn ⊤
+ max
X
gn⊤ (xn − x)
its unbiased approximation) a positive vector, so the n=1
x∈X
n=1
regret bound can have a better dimension dependency. N
X
!
This is a common trick used in online learning (e.g. + (gn − ∇hn (xn ))⊤ yN

EXP3 [36]) for optimizing variables living in a simplex n=1

(µ here).
We recognize the first term is a martingale, because xn
We set the first-order feedback gn as an unbiased sam- does not depend on gn . Therefore, we can appeal to a
pled estimate of ∇hn (xn ). In round n, this is realized Bernstein-type
√ martingale concentration and prove it
by two independent calls of the generative model: N |S||A| log( 1 )
is in Õ( 1−γ
δ
). For the second term, by treat-

"
1
# ing gn x as the per-round loss, we can use standard
p̃n + 1−γ (γ P̃n − En )⊤ µ̃n
gn = 1 1 (24) regret√analysis of mirror descent and show a bound
|S||A|( 1−γ 1̂n − r̂n − 1−γ (γ P̂n − Ên )vn ) N |S||A|
in Õ( 1−γ ). For the third term, because vπ̂N in

Let gn = [gn,v ; gn,µ ]. For gn,v , we sample p, sam- yN = (vπ̂N , µ∗ ) depends on {gn }N
n=1 , it is not a mar-
ple µn to get a state-action pair, and query the tran- tingale. Nonetheless, we are able to handle it through
sition P at the state-action pair sampled from µn . a union
√ bound and show it is again no more than
N |S||A| log( 1 )
(p̃n , P̃n , and µ̃n denote the single-sample estimate Õ( 1−γ
δ
). Despite the union bound, it does
of these probabilities.) For gn,µ , we first sample uni- not increase the rate because we only need to handle
formly a state-action pair (which explains the factor vπ̂N , not µ∗ which induces a martingale. To finish the
|S||A|), and then query the reward r and the tran- proof, we substitute this high-probability regret bound
sition P. (1̂n , r̂n , P̂n , and Ên denote the single- into Theorem 1 to obtain the desired claim.
sample estimates.) To emphasize, we use ˜· and ˆ· to
distinguish the empirical quantities obtained by these 5.2 Extension to Function Approximators
two independent queries. By construction, we have
gn,µ ≥ 0. It is clear that this direction gn is unbi- The above algorithm assumes the tabular setup for il-
ased, i.e. E[gn ] = ∇hn (xn ). Moreover, it is extremely lustration purposes. In Appendix E, we describe a
sparse and can be computed using O(1) sample, com- direct extension of Algorithm 1 that uses linearly pa-
putational, and memory complexities. rameterized function approximators of the form xθ =
(Φθv , Ψθµ ), where columns of bases Φ, Ψ belong to V
Below we show this algorithm, despite being extremely
and M, respectively, and (θv , θµ ) ∈ Θ.
simple, has strong theoretical guarantees. In other
words, we obtain simpler versions of the algorithms Overall the algorithm stays the same, except the gra-
proposed in [3, 5, 10] but with improved performance. dient is computed by chain-rule, which can be done in
O(dim(Θ)) time and space. While this seems worse,
Ching-An Cheng, Remi Tachet des Combes, Byron Boots, Geoff Gordon

the computational complexity per update actually im- Markov renewal programs. SIAM Journal on Ap-
proves to O(dim(Θ)) from the slow O(|S||A|) (re- plied Mathematics, 16(3):468–487, 1968.
quired before for the projection in (21)), as now we [3] Mengdi Wang and Yichen Chen. An online primal-
only optimize in Θ. Moreover, we prove that its sam- dual method for discounted Markov decision pro-
ple complexity is also better, though at the cost of bias cesses. In Conference on Decision and Control,
ǫΘ,N in Corollary 1. Therefore, the algorithm becomes pages 4516–4521. IEEE, 2016.
applicable to large-scale or continuous problems.
[4] Yichen Chen and Mengdi Wang. Stochas-
Theorem 3. Under a proper choice of Θ and BR , with tic primal-dual methods and sample complex-
probability 1 − δ, Algorithm 1 learns
 an (ǫ + ǫΘ,N )- ity of reinforcement learning. arXiv preprint
dim(Θ) log( 1δ )

optimal policy with Õ (1−γ)2 ǫ2 samples. arXiv:1612.02516, 2016.
[5] Mengdi Wang. Randomized linear programming
The proof is in Appendix E, and mainly follows Sec- solves the discounted Markov decision problem in
tion 5.1. First, we choose some Θ to satisfy (20) so nearly-linear running time. ArXiv e-prints, 2017.
we can use Corollary 1 to reduce the problem into re- [6] Donghwan Lee and Niao He. Stochas-
gret minimization. To make the sample complexity tic primal-dual Q-learning. arXiv preprint
independent of |S|,|A|, the key is to uniformly sample arXiv:1810.08298, 2018.
over the columns of Ψ (instead of over all states and
actions like (24)) when computing unbiased estimates [7] Mengdi Wang. Primal-dual π learning: Sam-
of ∇θµ hn ((θv , θµ )). The intuition is that we should ple complexity and sublinear run time for er-
only focus on the places our basis functions care about godic Markov decision problems. arXiv preprint
(of size dim(Θ)), instead of wasting efforts to visit all arXiv:1710.06100, 2017.
possible combinations (of size |S||A|). [8] Qihang Lin, Selvaprabu Nadarajah, and Negar So-
heili. Revisiting approximate linear programming
using a saddle point based reformulation and root
6 CONCLUSION finding solution approach. Technical report, work-
ing paper, U. of Il. at Chicago and U. of Iowa,
We propose a reduction from RL to no-regret online 2017.
learning that provides a systematic way to design new
[9] Bo Dai, Albert Shaw, Niao He, Lihong Li, and
RL algorithms with performance guarantees. Com-
Le Song. Boosting the actor with dual critic. In
pared with existing approaches, our framework makes
International Conference on Learning Represen-
no assumption on the MDP and naturally works with
tation, 2018.
function approximators. To illustrate, we design a sim-
ple RL algorithm based on mirror descent; it achieves [10] Yichen Chen, Lihong Li, and Mengdi Wang. Scal-
similar sample complexity as other RL techniques, but able bilinear π learning using state and action fea-
uses minimal assumptions on the MDP and is scal- tures. arXiv preprint arXiv:1804.10328, 2018.
able to large or continuous problems. This encourag- [11] Chandrashekar Lakshminarayanan, Shalabh
ing result evidences the strength of the online learning Bhatnagar, and Csaba Szepesvári. A linearly
perspective. As a future work, we believe even faster relaxed approximate linear program for Markov
learning in RL is possible by leveraging control variate decision processes. IEEE Transactions on
for variance reduction and by applying more advanced Automatic Control, 63(4):1185–1191, 2018.
online techniques [17, 18] that exploit the continuity [12] Geoffrey J Gordon. Regret bounds for prediction
in COL to predict the future gradients. problems. In Conference on Learning Theory, vol-
ume 99, pages 29–40, 1999.
Acknowledgements
[13] Martin Zinkevich. Online convex programming
This research is partially supported by NVIDIA Grad- and generalized infinitesimal gradient ascent. In
uate Fellowship. International Conference on Machine Learning,
pages 928–936, 2003.
References [14] Ching-An Cheng, Jonathan Lee, Ken Goldberg,
and Byron Boots. Online learning with contin-
[1] Alan S Manne et al. Linear programming and uous variations: Dynamic regret and reductions.
sequential decision models. Technical report, arXiv preprint arXiv:1902.07286, 2019.
Cowles Foundation for Research in Economics,
[15] Eugen Blum. From optimization and variational
Yale University, 1959.
inequalities to equilibrium problems. Math. stu-
[2] Eric V Denardo and Bennett L Fox. Multichain dent, 63:123–145, 1994.
A Reduction from Reinforcement Learning to Online Learning

[16] M Bianchi and S Schaible. Generalized monotone [30] Jacob Abernethy, Peter L Bartlett, and Elad
bifunctions and equilibrium problems. Journal of Hazan. Blackwell approachability and no-regret
Optimization Theory and Applications, 90(1):31– learning are equivalent. In Annual Conference on
43, 1996. Learning Theory, pages 27–46, 2011.
[17] Alexander Rakhlin and Karthik Sridharan. On- [31] Andrew Y Ng, Daishi Harada, and Stuart Rus-
line learning with predictable sequences. arXiv sell. Policy invariance under reward transforma-
preprint arXiv:1208.3728, 2012. tions: Theory and application to reward shaping.
[18] Ching-An Cheng, Xinyan Yan, Nathan Ratliff, In International Conference on Machine Learning,
volume 99, pages 278–287, 1999.
and Byron Boots. Predictor-corrector policy op-
timization. In International Conference on Ma- [32] Sham Kakade and John Langford. Approximately
chine Learning, 2019. optimal approximate reinforcement learning. In
International Conference on Machine Learning,
[19] Michael J Kearns and Satinder P Singh. Finite-
volume 2, pages 267–274, 2002.
sample convergence rates for q-learning and indi-
rect algorithms. In Advances in neural informa- [33] Stéphane Ross, Geoffrey Gordon, and Drew Bag-
tion processing systems, pages 996–1002, 1999. nell. A reduction of imitation learning and struc-
tured prediction to no-regret online learning. In
[20] Sham Machandranath Kakade et al. On the sam-
International Conference on Artificial Intelligence
ple complexity of reinforcement learning. PhD the-
and Statistics, pages 627–635, 2011.
sis, University of London London, England, 2003.
[34] Ching-An Cheng and Byron Boots. Convergence
[21] Martin L Puterman. Markov decision processes:
of value aggregation for imitation learning. Inter-
discrete stochastic dynamic programming. John
national Conference on Artificial Intelligence and
Wiley & Sons, 2014.
Statistics, 2018.
[22] Onésimo Hernández-Lerma and Jean B Lasserre.
[35] Ching-An Cheng, Xinyan Yan, Evangelos A
Discrete-time Markov control processes: basic op- Theodorou, and Byron Boots. Accelerating imi-
timality criteria, volume 30. Springer Science & tation learning with predictive models. In Inter-
Business Media, 2012.
national Conference onArtificial Intelligence and
[23] Richard Bellman. The theory of dynamic pro- Statistics, 2019.
gramming. Bulletin of the American Mathemati- [36] Elad Hazan. Introduction to online convex
cal Society, 60(6):503–515, 1954. optimization. Foundations and Trends in
[24] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Optimization, 2(3-4):157–325, 2016. URL
Lan, and Alexander Shapiro. Robust stochastic http://dblp.uni-trier.de/db/journals/ftopt/ftopt2.h
approximation approach to stochastic program- [37] Colin McDiarmid. Concentration. In Probabilis-
ming. SIAM Journal on optimization, 19(4):1574– tic methods for algorithmic discrete mathematics,
1609, 2009. pages 195–248. Springer, 1998.
[25] Anatoli Juditsky, Arkadi Nemirovski, and Claire
Tauvel. Solving variational inequalities with
stochastic mirror-prox algorithm. Stochastic Sys-
tems, 1(1):17–58, 2011.
[26] Nicolo Cesa-Bianchi and Gabor Lugosi. Predic-
tion, learning, and games. Cambridge university
press, 2006.
[27] Shai Shalev-Shwartz et al. Online learning
and online convex optimization. Foundations
and Trends R in Machine Learning, 4(2):107–194,
2012.
[28] Elad Hazan et al. Introduction to online convex
optimization. Foundations and Trends R in Opti-
mization, 2(3-4):157–325, 2016.
[29] Alejandro Jofré and Roger J-B Wets. Variational
convergence of bifunctions: motivating applica-
tions. SIAM Journal on Optimization, 24(4):
1952–1979, 2014.
Ching-An Cheng, Remi Tachet des Combes, Byron Boots, Geoff Gordon

Appendix

A Review of RL Setups

We provide an extended review of different formulations of RL for interested readers. First, let us recall the
problem setup. Let S and A be state and action spaces, and let π(a|s) denote a policy. For γ ∈ [0, 1), we are
interested in solving a γ-discounted infinite-horizon RL problem:
P∞
maxπ V π (p), s.t. V π (p) := (1 − γ)Es0 ∼p Eξ∼ρπ (s0 ) [ t=0 γ t r(st , at )] (25)

where V π (p) is the discounted average return, r : S × A → [0, 1] is the reward function, ρπ (s0 ) denotes the
distribution of trajectory ξ = s0 , a0 , s1 , . . . generated by running π from state s0 in a Markov decision process
(MDP), and p is a fixed but unknown initial state distribution.

A.1 Coordinate-wise Formulations

RL in terms of stationary state distribution Let dπt (s) denote the state distribution at time t given by
running π starting from p. We define its γ-weighted mixture as

dπ (s) := (1 − γ) ∞ t π
P
t=0 γ dt (s) (26)

We can view dπ in (26) as a form of stationary state distribution of π, because it is a valid probability distribution
of state and satisfies the stationarity property below,

dπ (s′ ) = (1 − γ)p(s′ ) + γEs∼dπ Ea∼π|s [P(s′ |s, a)] (2)

where P(s′ |s, a) is the transition probability of the MDP. The definition in (26) generalizes the concept of
stationary distribution of MDP; as γ → 1, dπ is known as the limiting average state distribution, which is the
same as the stationary distribution of the MDP under π, if one exists. Moreover, with the property in (2), dπ
summarizes the Markov structure of RL, and allows us to write (25) simply as

max V π (p), s.t. V π (p) = Es∼dπ Ea∼π|s [r(s, a)] (27)


π

after commuting the order of expectation and summation. That is, an RL problem aims to maximize the expected
reward under the stationary state-action distribution generated by the policy π.

RL in terms of value function We can also write (25) in terms of value function. Recall

V π (s) := (1 − γ)Eξ∼ρπ (s0 )|s0 =s [ ∞ t


P
t=0 γ r(st , at )] (1)

is the value function of π. By definition, V π (like dπ ) satisfies a stationarity property

V π (s) = Ea∼π|s (1 − γ)r(s, a) + γEs′ ∼P|s,a [V π (s′ )]


 
(5)

which can be viewed as a dual equivalent of (2). Because r is in [0, 1], (5) implies V π lies in [0, 1].
The value function V ∗ (a shorthand of Vπ∗ ) of the optimal policy π ∗ of the RL problem satisfies the so-called
Bellman equation [23]: V ∗ (s) = maxa∈A (1 − γ)r(s, a) + γEs′ ∼P|s,a [V ∗ (s′ )], where the optimal policy π ∗ can be
recovered as the arg max. Equivalently, by the definition of max, the Bellman equation amounts to finding the
smallest V such that V (s) ≥ (1 − γ)r(s, a) + γEs′ ∼P|s,a [V (s′ )], ∀s ∈ S, a ∈ A. In other words, the RL problem
in (25) can be written as

min Es∼p [V (s)] s.t. V (s) ≥ (1 − γ)r(s, a) + γEs′ ∼P|s,a [V (s′ )] , ∀s ∈ S, a ∈ A (28)
V

A.2 Linear Programming Formulations

We now connect the above two alternate expressions through the classical LP setup of RL [1, 2].
A Reduction from Reinforcement Learning to Online Learning

LP in terms of value function The classic LP formulation5 is simply a restatement of (28):

min p⊤ v s.t. (1 − γ)r + γPv ≤ Ev (4)


v

where p ∈ R|S| , v ∈ R|S| , and r ∈ R|S||A| are the vector forms of p, V , r, respectively, P ∈ R|S||A|×|S| is the
transition probability6 , and E = I ⊗ 1 ∈ R|S||A|×|S| (we use | · | to denote the cardinality of a set, ⊗ the Kronecker
product, I ∈ R|S|×|S| is the identity, and 1 ∈ R|A| a vector of ones). It is easy to verify that for all p > 0, the
solution to (4) is the same and equal to v∗ (the vector form of V ∗ ).

LP in terms of stationary state-action distribution Define the Lagrangian function

L(v, f ) := p⊤ v + f ⊤ ((1 − γ)r + γPv − Ev) (29)

where f ≥ 0 ∈ R|S||A| is the Lagrangian multiplier. By Lagrangian duality, the dual problem of (4) is given as
maxf ≥0 minv L(v, f ). Or after substituting the optimality condition of v and define µ := (1 − γ)f , we can write
the dual problem as another LP problem

max r⊤ µ s.t. (1 − γ)p + γP⊤ µ = E⊤ µ (3)


µ≥0

Note that this problem like (4) is normalized: we have kµk1 = 1 because kpk1 = 1, and

kµk1 = 1⊤ E⊤ µ = (1 − γ)1⊤ p + γ1⊤ P⊤ µ = (1 − γ)kpk1 + γkµk1

where we use the facts that µ ≥ 0 and P is a stochastic transition matrix. This means that µ is a valid state-
action distribution, from which we see that the equality constraint in (3) is simply a vector form (2). Therefore,
(3) is the same as (27) if we define the policy π as the conditional distribution based on µ.

B Missing Proofs of Section 3

B.1 Proof of Lemma 1

Lemma 1. For any x = (v, µ), if x′ ∈ X satisfies (2) and (5) (i.e. v′ and µ′ are the value function and
state-action distribution of policy πµ′ ), rep (x; x′ ) = −µ⊤ av′ .

Proof. First note that F (x, x) = 0. Then as x′ satisfies stationarity, we can use Lemma 2 below and write

rep (x; x′ ) = F (x, x) − F (x, x′ )


= −F (x, x′ )
= −(p⊤ v′ − p⊤ v) − µ⊤ av′ + µ′⊤ av (∵ Definition of F in (15))
= −µ′ av − µ⊤ av′ + µ′⊤ av (∵ Lemma 2)

= −µ av′

B.2 Proof of Lemma 2

Lemma 2. Let vπ and µπ denote the value and state-action distribution of some policy π. Then for any function

v′ , it holds that p⊤ (vπ − v′ ) = (µπ )⊤ av′ . In particular, it implies V π (p) − V π (p) = (µπ )⊤ avπ′ .

Proof. This is the well-known performance difference lemma. The proof is based on the stationary properties in
(2) and (5), which can be stated in vector form as

(µπ )⊤ Evπ = (µπ )⊤ ((1 − γ)r + γPvπ ) and (1 − γ)p + γP⊤ µπ = E⊤ µπ


5
Our setup in (4) differs from the classic one in the (1 − γ) factor in the constraint to normalize the problem.
6
We arrange the coordinates in a way such that along the |S||A| indices are contiguous in actions.
Ching-An Cheng, Remi Tachet des Combes, Byron Boots, Geoff Gordon

The proof is a simple application of these two properties.


1
p⊤ (vπ − v′ ) = (E⊤ µπ − γP⊤ µπ )⊤ (vπ − v′ )
1−γ
1
= (µπ )⊤ ((E − γP)vπ − (E − γP)v′ )
1−γ
1
= (µπ )⊤ ((1 − γ)r − (E − γP)v′ ) = (µπ )⊤ av′
1−γ
where we use the stationarity property of µπ in the first equality and that vπ in the third equality.

B.3 Proof of Proposition 2

Proposition 2. For any x = (v, µ) ∈ X , if E⊤ µ ≥ (1 − γ)p, rep (x; x∗ ) ≥ (1 − γ) mins p(s)kv∗ − vπµ k∞ .

Proof. This proof mainly follows the steps in [5] but written in our notation. First Lemma 1 shows rep (x; x∗ ) =
−µ⊤ av∗ . We then lower bound −µ⊤ av∗ by reversing the proof of the performance difference lemma (Lemma 2).

1
µ⊤ av∗ = µ⊤ ((1 − γ)r − (E − γP)v∗ ) (∵ Definition of av∗ )
1−γ
1
= µ⊤ ((E − γP)vπµ − (E − γP)v∗ ) (∵ Stationarity of vπµ )
1−γ
1
= µ⊤ (E − γP)(vπµ − v∗ )
1−γ
1
= d⊤ (I − γPπµ )(vπµ − v∗ )
1−γ

where we define d := E⊤ µ and Pπµ as the state-transition of running policy πµ .


We wish to further upper bound this quantity. To proceed, we appeal to the Bellman equation of the optimal
value function v∗ and the stationarity of vπµ :

v∗ ≥ (1 − γ)rπµ + γPπµ v∗ and vπµ = (1 − γ)rπµ + γPπµ vπµ ,

which together imply that (I − γPπµ )(vπµ − v∗ ) ≤ 0. We will also use the stationarity of dπµ (the average state
distribution of πµ ): dπµ = (1 − γ)p + γP⊤ πµ
πµ d .

Since d ≥ (1 − γ)p in the assumption, we can then write

1
µ⊤ av∗ = d⊤ (I − γPπµ )(vπµ − v∗ )
1−γ
≤ p⊤ (I − γPπµ )(vπµ − v∗ )
≤ − min p(s)k(I − γPπµ )(vπµ − v∗ )k∞
s
≤ − min p(s)(1 − γ)kvπµ − v∗ k∞ .
s

Finally, flipping the sign of the inequality concludes the proof.

B.4 Proof of Proposition 3

Proposition 3. There is a class of MDPs such that, for some x ∈ X , Proposition 2 is an equality.

Proof. We show this equality holds for a class of MDPs. For simplicity, let us first consider an MDP with
three states 1, 2, 3 and for each state there are three actions (lef t, right, stay). They correspond to intuitive,
deterministic transition dynamics

P(max{s − 1, 1}|s, lef t) = 1, P(min{s + 1, 3}|s, right) = 1, P(s|s, stay) = 1.


A Reduction from Reinforcement Learning to Online Learning

We set the reward as r(s, right) = 1 for s = 1, 2, 3 and zero otherwise. It is easy to see that the optimal policy
is π ∗ (right|s) = 1, which has value function v∗ = [1, 1, 1]⊤.
Now consider x = (v, µ) ∈ X . To define µ, let µ(s, a) = d(s)πµ (a|s). We set
πµ (right|1) = 1, πµ (stay|2) = 1, πµ (right|3) = 1

That is, πµ is equal to π except when s = 2. One can verify the value function of this policy is vπµ =
[(1 − γ), 0, 1]⊤ .
As far as d is concerned (d = E⊤ µ), suppose the initial distribution is uniform, i.e. p = [1/3, 1/3, 1/3]⊤, we
choose d as d = (1 − γ)p + γ[1, 0, 0]⊤ , which satisfies the assumption in Proposition 2. Therefore, we have
µ ∈ M′ and we will let v be some arbitrary point in V.
Now we show for this choice x = (v, µ) ∈ V × M′ , the equality in Proposition 2 holds. By Lemma 1, we know
1
rep (x; x′ ) = −µ⊤ av∗ . Recall the advantage is defined as av∗ = r + 1−γ (γP − E)v∗ . Let AV ∗ (s, a) denote the
functional form of av∗ and define the expected advantage:
AV ∗ (s, πµ ) := Ea∼πµ [AV ∗ (s, a)].
We can verify it has the following values:
AV ∗ (1, πµ ) = 0, AV ∗ (2, πµ ) = −1, AV ∗ (3, πµ ) = 0.

Thus, the above construction yields


(1 − γ)
rep (x; x∗ ) = −µ⊤ av∗ = = (1 − γ) min p(s)kv∗ − vπµ k∞
3 s

One can easily generalize this 3-state MDP to an |S|-state MDP where states are partitioned into three groups.

C Missing Proofs of Section 4


C.1 Proof of Proposition 4

Proposition 4. For x = (v, µ) ∈ X , define yx∗ := (vπµ , µ∗ ) ∈ X . It holds rep (x; yx∗ ) = V ∗ (p) − V πµ (p).

Proof. First we generalize Lemma 1.


Lemma 3. Let x = (v, µ) be arbitrary. Consider x̃′ = (v′ + u′ , µ′ ), where v′ and µ′ are the value function and
state-action distribution of policy πµ′ , and u′ is arbitrary. It holds that rep (x; x̃′ ) = −µ⊤ av′ − b⊤ ′
µu .

To proceed, we write yx∗ = (v∗ +(vπµ −v∗ ), µ∗ ) and use Lemma 3, which gives rep (x; yx∗ ) = −µ⊤ av∗ −b⊤
µ (v
πµ
−v∗ ).
To relate this equality to the policy performance gap, we also need the following equality.
Lemma 4. For µ ∈ M, it holds that −µ⊤ av∗ = V ∗ (p) − V πµ (p) + b⊤
µ (v
πµ
− v∗ ).

Together they imply the desired equality rep (x; yx∗ ) = V ∗ (p) − V πµ (p).

C.1.1 Proof of Lemma 3

Lemma 3. Let x = (v, µ) be arbitrary. Consider x̃′ = (v′ + u′ , µ′ ), where v′ and µ′ are the value function and
state-action distribution of policy πµ′ , and u′ is arbitrary. It holds that rep (x; x̃′ ) = −µ⊤ av′ − b⊤ ′
µu .

1
Proof. Let x′ = (v′ , µ′ ). As shorthand, define f ′ := v′ + u′ , and L := 1−γ (γP− E) (i.e. we can write af = r+ Lf ).
′ ′ ⊤ ′ ⊤ ⊤ ′⊤
Because rep (x; x ) = −F (x, x ) = −(p v + µ av′ − p v − µ av ), we can write
rep (x; x̃′ ) = −p⊤ f ′ − µ⊤ af ′ + p⊤ v + µ′⊤ av
= −p⊤ v′ − µ⊤ av′ + p⊤ v + µ′⊤ av − p⊤ u′ − µ⊤ Lu′


= rep (x; x′ ) − p⊤ u′ − µ⊤ Lu′


= rep (x; x′ ) − b⊤
µu

Ching-An Cheng, Remi Tachet des Combes, Byron Boots, Geoff Gordon

Finally, by Lemma 1, we have also rep (x; x′ ) = −µ⊤ av′ and therefore the final equality.

C.1.2 Proof of Lemma 4

Lemma 4. For µ ∈ M, it holds that −µ⊤ av∗ = V ∗ (p) − V πµ (p) + b⊤


µ (v
πµ
− v∗ ).

Proof. Following the setup in Lemma 3, we prove the statement by the rearrangement below:

−µ⊤ av′ = −(µπµ )⊤ av′ + (µπµ )⊤ av′ − µ⊤ av′



= V π (p) − V πµ (p) + (µπµ − µ)⊤ av′
 ′ 
= V π (p) − V πµ (p) + (µπµ − µ)⊤ r + (µπµ − µ)⊤ Lv′

where the second equality is due to the performance difference lemma, i.e. Lemma 2, and the last equality uses
the definition av′ = r + Lv′ . For the second term above, let rπµ and Pπµ denote the expected reward and
transition under πµ . Because µ ∈ M, we can rewrite it as

(µπµ − µ)⊤ r = (E⊤ µπµ − E⊤ µ)rπµ


= ((1 − γ)p + γP⊤ µπµ − E⊤ µ)rπµ
= (1 − γ)b⊤
µ rπµ + γ(µ
πµ
− µ)⊤ Prπµ
 
= (1 − γ)b⊤ 2 2
µ rπµ + γPπµ rπµ + γ Pπµ rπµ + . . .

= b⊤
µv
πµ

where the second equality uses the stationarity of µπµ given by (2). For the third term, it can be written

(µπµ − µ)⊤ Lv′ = (−p − L⊤ µ)⊤ v′ = −b⊤


µv

where the first equality uses stationarity, i.e. bµπµ = p + L⊤ µπµ = 0. Finally combining the three steps, we
have

−µ⊤ av′ = V π (p) − V πµ (p) + bµ (vπµ − v′ )

C.2 Proof of Corollary 1

Corollary 1. Let XN = {xn ∈ Xθ }N n=1 be any sequence. Let π̂N be the policy given either by the average or the
best decision in XN . It holds that
RegretN (Θ)
V π̂N (p) ≥ V ∗ (p) − N − ǫΘ,N


where ǫΘ,N = minxθ ∈Xθ rep (x̂N ; yN ) − rep (x̂N ; xθ ) measures the expressiveness of Xθ , and RegretN (Θ) :=
PN PN
l (x
n=1 n n ) − min x∈XΘ n=1 nl (x).

Proof. This can be proved by a simple rearrangement

RegretN (Θ)

V ∗ (p) − V π̂N (p) = rep (x̂N ; yN ) = ǫΘ,N + max rep (x̂N ; xθ ) ≤ ǫΘ,N +
xθ ∈Xθ N

where the first equality is Proposition 4 and the last inequality is due to the skew-symmetry of F , similar to the
proof of Theorem 1.
A Reduction from Reinforcement Learning to Online Learning

C.3 Proof of Proposition 5


Proposition 5. Let x̂N = (v̂N , µ̂N ). Under the setup in Corollary 1, regardless of the parameterization, it is
true that ǫΘ,N is no larger than

kµθ − µ∗ k1
min + min kbµ̂N k1,w kvθ − vπ̂N k∞,1/w
(vθ ,µθ )∈XΘ 1−γ w:w≥1

1  
≤ min kµθ − µ∗ k1 + 2kvθ − vπ̂N k∞ .
(vθ ,µθ )∈XΘ 1 − γ

wi |xi | and kxk∞,1/w = maxi wi−1 |xi |.


P
where the norms are defined as kxk1,w = i

Proof. For shorthand, let us set x = (v, µ) = x̂N and write also πµ = π̂N as the associated policy. Let
yx∗ = (vπµ , µ∗ ) and similarly let xθ = (vθ , µθ ) ∈ XΘ . With rep (x; x′ ) = −F (x, x′ ) and (15), we can write
 
rep (x; yx∗ ) − rep (x; xθ ) = −p⊤ vπµ − µ⊤ avπµ + p⊤ v + µ∗ ⊤ av − −p⊤ vθ − µ⊤ avθ + p⊤ v + µ⊤

θ av

= p⊤ (vθ − vπµ ) + (µ∗ − µθ )⊤ av + µ⊤ (avθ − avπµ )


= b⊤
µ (vθ − v
πµ
) + (µ∗ − µθ )⊤ av

Next we quantize the size of av and bµ .


1 2
Lemma 5. For (v, µ) ∈ X , kav k∞ ≤ 1−γ and kbµ k1 ≤ 1−γ .

Proof of Lemma 5. Let ∆ denote the set of distributions


1 1 1
kav k∞ = k(1 − γ)r + γPv − Evk∞ ≤ max |a − b| ≤
1−γ 1 − γ a,b∈[0,1] 1−γ
1 1 2
kbµ k1 = k(1 − γ)p + γP⊤ µ − E⊤ µk1 ≤ max kq − q′ k1 ≤
1−γ 1 − γ q,q′ ∈∆ 1−γ

Therefore, we have preliminary upper bounds:


1
(µ∗ − µθ )⊤ av ≤ kav k∞ kµ∗ − µθ k1 ≤ kµ∗ − µθ k1
1−γ
2
b⊤
µ (vθ − v
πµ
) ≤ kbµ k1 kvθ − vπµ k∞ ≤ kvθ − vπµ k∞
1−γ

However, the second inequality above can be very conservative, especially when bµ ≈ 0 which can be likely
when it is close to the end of policy optimization. To this end, we introduce a free vector w ≥ 1. Define norms
kvk∞,1/w = maxi |vwii| and kδk1,w = i wi |δi |. Then we can instead have an upper bound
P

b⊤
µ (vθ − v
πµ
) ≤ min kbµ k1,w kvθ − vπµ k∞,1/w
w:w≥1

Notice that when w = 1 the above inequality reduces to b⊤ πµ πµ


µ (vθ − v ) ≤ kbµ k1 kvθ − v k∞ , which as we showed
2
has an upper bound 1−γ kvθ − vπµ k∞ .
Combining the above upper bounds, we have an upper bound on ǫΘ,N :

1
ǫΘ,N = rep (x; yx∗ ) − rep (x; xθ ) ≤ kµθ − µ∗ k1 + min kbµ k1,w kvθ − vπµ k∞,1/w
1−γ w:w≥1
1
≤ (kµθ − µ∗ k1 + 2kvθ − vπµ k∞ ) .
1−γ

Since it holds for any θ ∈ Θ, we can minimize the right-hand side over all possible choices.
Ching-An Cheng, Remi Tachet des Combes, Byron Boots, Geoff Gordon

D Proof of Sample Complexity of Mirror Descent


|S||A| log( 1δ )
 
Theorem 2. With probability 1 − δ, Algorithm 1 learns an ǫ-optimal policy with Õ (1−γ)2 ǫ2 samples.

The proof is a combination of the basic property of mirror descent (Lemma 9) and the martingale concentration.
1
Define K = |S||A| and κ = 1−γ as shorthands. We first slightly modify the per-round loss used to compute the
gradient. Recall ln (x) := p⊤ v + µ⊤ ⊤ ⊤
n av − p vn − µ avn and let us consider instead a loss function

hn (x) := b⊤ ⊤
µn v + µ (κ1 − avn )

which shifts ln by a constant in each round. (Note for all the decisions (vn , µn ) produced by Algorithm 1 µn
always satisfies kµn k1 = 1). One can verify that ln (x) − ln (x′ ) = hn (x) − hn (x′ ), for all x, x′ ∈ X , when µ, µ′
in x and x′ satisfy kµk1 = kµ′ k1 (which holds for Algorithm 1). As the definition of regret is relative, we may
work with hn in online learning and use it to define the feedback.
The reason for using hn instead of ln is to make ∇µ hn ((v, µ)) (and its unbiased approximation) a positive vector
(because κ ≥ kav k∞ for any v ∈ V), so that the regret bound can have a better dependency on the dimension
for learning µ that lives in the simplex M. This is a common trick used in the online learning, e.g. in EXP3.
To run mirror descent, we set the first-order feedback gn received by the learner as an unbiased estimate of
∇hn (xn ). For round n, we construct gn based on two calls of the generative model:
 " 1
#
p̃n + 1−γ (γ P̃n − En )⊤ µ̃n

gn,v
gn = = 1
gn,µ K(κ1̂n − r̂n − 1−γ (γ P̂n − Ên )vn )

For gn,v , we sample p, then sample µn to get a state-action pair, and finally query the transition dynamics P
at the state-action pair sampled from µn . (p̃n , P̃n , and µ̃n denote the single-sample empirical approximation
of these probabilities.) For gn,µ , we first sample uniformly a state-action pair (which explains the factor K),
and then query the reward r and the transition dynamics P. (1̂n , r̂n , P̂n , and Ên denote the single-sample
empirical estimates.) To emphasize, we use ˜ and ˆ to distinguish the empirical quantities obtained by these
two independent queries. By construction, we have gn,µ ≥ 0. It is clear that this direction gn is unbiased, i.e.
E[gn ] = ∇hn (xn ). Moreover, it is extremely sparse and can be computed using O(1) sample, computational, and
memory complexities.

Let yN = (vπ̂N , µ∗ ). We bound the regret by the following rearrangement.
N
X N
X
∗ ∗
RegretN (yN )= ln (xn ) − ln (yN )
n=1 n=1
N
X N
X

= hn (xn ) − hn (yN )
n=1 n=1
N
X
= ∇hn (xn )⊤ (xn − yN

)
n=1
N N N
! ! !
X X X

= (∇hn (xn ) − gn ) xn + gn⊤ (xn − ∗
yN ) + (gn − ∇hn (xn ))⊤ yN

n=1 n=1 n=1


N N N
! ! !
X X X

≤ (∇hn (xn ) − gn ) xn + max gn⊤ (xn − x) + (gn − ∇hn (xn ))⊤ yN

, (30)
x∈X
n=1 n=1 n=1

where
PN the third equality comes from hn being linear. We recognize the first term is a martingale MN =

n=1 (∇h n (xn ) − g n ) xn , because xn does not depend on gn . Therefore, we can appeal to standard martingale
concentration property. For the second term, it can be upper bounded by standard regret analysis of mirror
descent, by treating gn⊤ x as the per-round loss. For the third term, because yN ∗
= (vπ̂N , µ∗ ) depends on {gn }N
n=1 ,
it is not a martingale. Nonetheless, we will be able to handle it through a union bound. Below, we give details
for bounding these three terms.
A Reduction from Reinforcement Learning to Online Learning

D.1 The First Term: Martingale Concentration


PN
For the first term, n=1 (∇hn (xn ) − gn )⊤ xn , we use a martingale concentration property. Specifically, we adopt
a Bernstein-type inequality [37, Theorem 3.15]:
Lemma 6. [37, Theorem 3.15] Let M0 , . . . , MN be a martingale and let F0 ⊆ F1 ⊆ · · · ⊆ Fn be the filtration
such that Mn = E|Fn [Mn+1 ]. Suppose there are b, σ < ∞ such that for all n, given Fn−1 , Mn − Mn−1 ≤ b, and
V|Fn−1 [Mn − Mn−1 ] ≤ σ 2 almost surely. Then for any ǫ ≥ 0,

!
−ǫ2
P (MN − M0 ≥ ǫ) ≤ exp bǫ
.
2N σ 2 (1 + 3N σ2 )

Lemma 6 implies, with probability at least 1 − δ,


s  
1
MN − M0 ≤ 2N σ 2 (1 + o(1)) log ,
δ

where o(1) means convergence to 0 as N → ∞.


To apply Lemma 6, we need to provide bounds on the properties of the martingale difference:

Mn − Mn−1 = (∇hn (xn ) − gn )⊤ xn


= (κ1 − avn − gn,µ )⊤ µn + (bµn − gn,v )⊤ vn .

For the first term (κ1 − avn − gn,µ )⊤ µn , we use the lemma below:
Lemma 7. Let µ ∈ M be arbitrary, chosen independently from the randomness of gn,µ when Fn−1 is given.
Then it holds |(κ1 − avn − gn,µ )⊤ µ| ≤ 2(1+K)
1−γ
4K
and V|Fn−1 [(κ1 − avn − gn,µ )⊤ µ] ≤ (1−γ) 2.

Proof. By triangular inequality,

|(κ1 − avn − gn,µ )⊤ µ| ≤ |(κ1 − avn )⊤ µ| + |gn,µ



µ|.

For the deterministic part, using Lemma 5 and Hölder’s inequality,

2
|(κ1 − avn )⊤ µ| ≤ κ + kavn k∞ kµk1 ≤ .
1−γ

For the stochastic part, let in be index of the sampled state-action pair and jn be the index of the transited state
sampled at the pair given by in . With abuse of notation, we will use in to index both S × A and S. With this
notation, we may derive

⊤ 1
|gn,µ µ| = |Kµ⊤ (κ1̂n − r̂n − (γ P̂n − Ên )vn )|
1−γ
γvn,jn − vn,in
= Kµin |κ − rin − |
1−γ
2Kµin 2K
≤ ≤
1−γ 1−γ

where we use the facts that rin , vn,jn , vn,in ∈ [0, 1] and µin ≤ 1.
Ching-An Cheng, Remi Tachet des Combes, Byron Boots, Geoff Gordon

For V|Fn−1 [(κ1 − avn − gn,µ )⊤ µn ], we can write it as



V|Fn−1 [(κ1 − avn − gn,µ )⊤ µ] = V|Fn−1 [gn,µ µ]

≤ E|Fn−1 [|gn,µ µ n |2 ]
"  2 #
X 1
2 2 γvn,jn − vn,in
= Ej |i K µin κ − rin −
i
K n n 1−γ
n

4K X 2
≤ µin
(1 − γ)2 i
n
!2
4K X 4K
≤ µin ≤
(1 − γ)2 in
(1 − γ)2
γvn,jn −vn,in 2
where in the second inequality we use the fact that |κ − rin − 1−γ |≤ 1−γ .

For the second term (bµn − gn,v )⊤ vn , we use the following lemma.
Lemma 8. Let v ∈ V be arbitrary, chosen independently from the randomness of gn,v when Fn−1 is given..
4 4
Then it holds that |(bµn − gn,v )⊤ v| ≤ 1−γ and V|Fn−1 [(bµn − gn,v )⊤ v] ≤ (1−γ) 2.

2
Proof. We appeal to Lemma 5, which shows kbµn k1 , kgn,v k1 ≤ 1−γ , and derive
4
|(bµn − gn,v )⊤ v| ≤ (kbµn k1 + kgn,v k1 )kvk∞ ≤ .
1−γ
Similarly, for the variance, we can write
4

V|Fn−1 [(bµn − gn,v )⊤ v] = V|Fn−1 [gn,v ⊤
v] ≤ E|Fn−1 [(gn,v v)2 ] ≤ .
(1 − γ)2

Thus, with helps from the two lemmas above, we are able to show
4 + 2(1 + K)
Mn − Mn−1 ≤ |(κ1 − avn − gn,µ )⊤ µn | + |(bµn − gn,v )⊤ vn | ≤
1−γ
as well as (because gn,µ and gn,b are computed using independent samples)
4(1 + K)
V|Fn−1 [Mn − Mn−1 ] ≤ E|Fn−1 [|(κ1 − avn − gn,µ )⊤ µn |2 ] + E|Fn−1 [|(bµn − gn,v )⊤ vn |2 ] ≤
(1 − γ)2
Now, since M0 = 0, by martingale concentration in Lemma 6, we have
N
! !
X −ǫ2
P (∇hn (xn ) − gn )⊤ xn > ǫ ≤ exp bǫ
n=1
2N σ 2 (1 + 3N σ2 )

6+2K 4(1+K)
with b = 1−γ and σ 2 = (1−γ)2 .This implies that, with probability at least 1 − δ, it holds
q 
N K log( 1δ )
s
N  
X 8(1 + K) 1
(∇hn (xn ) − gn )⊤ xn ≤ N 2
(1 + o(1)) log = Õ  
n=1
(1 − γ) δ 1 − γ

D.2 Static Regret of Mirror Descent

Next we move onto deriving the regret bound of mirror descent with respect to the online loss sequence:
N
X
max gn⊤ (xn − x)
x∈X
n=1

This part is quite standard; nonetheless, we provide complete derivations below.


We first recall a basic property of mirror descent
A Reduction from Reinforcement Learning to Online Learning

Lemma 9. Let X be a convex set. Suppose R is 1-strongly convex with respect to some norm k · k. Let g be an
arbitrary vector and define, for x ∈ X ,

y = arg min hg, x′ i + BR (x′ ||x)


x′ ∈X

Then for all z ∈ X ,

hg, y − zi ≤ BR (z||x) − BR (z||y) − BR (y||x) (31)

Proof. Recall the definition BR (x′ ||x) = R(x′ ) − R(x) − h∇R(x), x′ − xi. The optimality of the proximal map
can be written as

hg + ∇R(y) − ∇R(x), y − zi ≤ 0, ∀z ∈ X .

By rearranging the terms, we can rewrite the above inequality in terms of Bregman divergences as follows and
derive the first inequality (31):

hg, y − zi ≤ h∇R(x) − ∇R(y), y − zi


= BR (z||x) − BR (z||y) + h∇R(x) − ∇R(y), yi − h∇R(x), xi + h∇R(y), yi + R(x) − R(y)
= BR (z||x) − BR (z||y) + h∇R(x), y − xi + R(x) − R(y)
= BR (z||x) − BR (z||y) − BR (y||x),

which concludes the lemma.

Let x′ ∈ X be arbitrary. Applying this lemma to the nth iteration of mirror descent in (21), we get
1
hgn , xn+1 − x′ i ≤ (BR (x′ ||xn ) − BR (x′ ||xn+1 ) − BR (xn+1 ||xn ))
η
By a telescoping sum, we then have
N N
X
′ 1 ′
X 1
hgn , xn − x i ≤ BR (x ||x1 ) + hgn , xn+1 − xn i − BR (xn+1 ||xn ).
n=1
η n=1
η

We bound the right-hand side as follows. Recall that based on the geometry of X = V × M, we considered a
natural Bregman divergence of the form:
1
BR (x′ ||x) = kv′ − vk22 + KL(µ′ ||µ)
2|S|

Let x1 = (v1 , µ1 ) where µ1 is uniform. By this choice, we have:


 
1 1 1 1
BR (x′ ||x1 ) ≤ max BR (x||x1 ) ≤ + log(K) .
η η x∈X η 2

We now decompose each item in the above sum as:


 
1 ⊤ 1
hgn , xn+1 − xn i − BR (xn+1 ||xn ) = gn,v (vn+1 − vn ) − kvn − vn+1 k22
η 2η|S|
 
⊤ 1
+ gn,µ (µn+1 − µn ) − KL(µn+1 ||µn )
η

and we upper bound them using the two lemmas below (recall gn,µ ≥ 0 due to the added κ1 term).
1 ηkgk22
Lemma 10. For any vector x, y, g and scalar η > 0, it holds hg, x − yi − 2η kx − yk22 ≤ 2 .

1 1 ηkgk22
Proof. By Cauchy-Swartz inequality, hg, x − yi − 2η kx − yk22 ≤ kgk2kx − yk2 − 2η kx − yk22 ≤ 2 .
Ching-An Cheng, Remi Tachet des Combes, Byron Boots, Geoff Gordon

Lemma 11. Suppose BR (x||y) = KL(x||y) and x, y are probability distributions, and g ≥ 0 element-wise. Then,
for η > 0,
1 ηX η
− BR (y||x) + hg, x − yi ≤ xi (gi )2 = kgk2x.
η 2 i 2

Proof. Let ∆ denotes the unit simplex.


−BR (y||x) + hηg, x − yi ≤ hηg, xi + max h−ηg, yi − BR (y ′ ||x)
y ′ ∈∆
!
X
= hηg, xi + log xi exp(−ηgi ) (∵ convex conjugate of KL divergence)
i
 !
X 1 2 1
≤ hηg, xi + log xi 1 − ηgi + (ηgi ) (∵ ex ≤ 1 + x + x2 for x ≤ 0)
i
2 2
 !
X 1 2
= hηg, xi + log 1 + xi −ηgi + (ηgi )
i
2
X  1

≤ hηg, xi + xi −ηgi + (ηgi )2 (∵ log(x) ≤ x − 1)
i
2
1X η2
= xi (ηgi )2 = kgk2x .
2 i 2
Finally, dividing both sides by η, we get the desired result.

η|S|kgn,v k22 ηkgn,µ k2µn


Thus, we have the upper bound hgn , xn+1 − xn i − η1 BR (xn+1 ||xn ) = 2 + 2 . Together with the
upper bound on η1 BR (x′ ||x1 ), it implies that
N N
X 1 X 1
hgn , xn − x′ i ≤ BR (x′ ||x1 ) + hgn , xn+1 − xn i − BR (xn+1 ||xn )
n=1
η n=1
η
  N
1 1 ηX
≤ + log(K) + |S|kgn,v k22 + kgn,µ k2µn . (32)
η 2 2 n=1

PN
We can expect, with high probability, n=1 |S|kgn,v k22 + kgn,µ k2µn concentrates toward its expectation, i.e.
N
X N
X
|S|kgn,v k22 + kgn,µ k2µn ≤ E[|S|kgn,v k22 + kgn,µ k2µn ] + o(N ).
n=1 n=1

Below we quantify this relationship using martingale concentration. First we bound the expectation.
4 4K
Lemma 12. E[kgn,v k22 ] ≤ (1−γ) 2
2 and E[kgn,µ kµn ] ≤ (1−γ)2 .

Proof. For the first statement, using the fact that k · k2 ≤ k · k1 and Lemma 5, we can write
1 4
E[kgn,v k22 ] ≤ E[kgn,v k21 ] = E[kp̃n + (γ P̃n − En )⊤ µ̃n k21 ] ≤ .
1−γ (1 − γ)2

For the second statement, let in be the index of the sampled state-action pair and jn be the index of the
transited-to state sampled at the pair given by in . With abuse of notation, we will use in to index both S × A
and S.
" "  2 ##
2
X 1
2 γvn,jn − vn,in
E[kgn,µ kµn ] = E Ej |i K µin κ − rin −
in
K n n 1−γ
" #
4K X 4K
≤ 2
E µin ≤ .
(1 − γ) i
(1 − γ)2
n
A Reduction from Reinforcement Learning to Online Learning

To bound the tail, we resort to the Höffding-Azuma inequality of martingale [37, Theorem 3.14].
Lemma 13 (Azuma-Hoeffding). Let M0 , . . . , MN be a martingale and let F0 ⊆ F1 ⊆ · · · ⊆ Fn be the filtration
such that Mn = E|Fn [Mn+1 ]. Suppose there exists b < ∞ such that for all n, given Fn−1 , |Mn − Mn−1 | ≤ b.
Then for any ǫ ≥ 0,
−2ǫ2
 
P (MN − M0 ≥ ǫ) ≤ exp
N b2

To apply Lemma 13, we consider the martingale


N N
!
X X
MN = |S|kgn,v k22 + kgn,µ k2µn − E[|S|kgn,v k22 + kgn,µ k2µn ]
n=1 n=1

To bound the change of the size of martingale difference |Mn − Mn−1 |, we follow similar steps to Lemma 12.
4 4K 2
Lemma 14. kgn,v k22 ≤ (1−γ)2 and kgn,µ k2µn ≤ (1−γ)2 .

Note kgn,µ k2µ is K-factor larger than E[kgn,µ k2µ ]) and K ≥ 1. Therefore, we have
8(|S| + K 2 )
|Mn − Mn−1 | ≤ |S|kgn,v k22 + kgn,µ k2µn + |S|E[kgn,v k22 ] + E[kgn,µ k2µn ] ≤
(1 − γ)2
Combining these results, we have, with probability as least 1 − δ,
N N √ 2
s  
X X 4 2(|S| + K ) 1
|S|kgn,v k22 + kgn,µ k2µn ≤ 2 2
E[|S|kgn,v k2 + kgn,µ kµn ] + 2
N log
n=1 n=1
(1 − γ) δ
√ s
4 2(|S| + K 2 )
 
4(K + |S|) 1
≤ N+ N log
(1 − γ)2 (1 − γ)2 δ

1−γ
Now we suppose we set η = √
KN
. From (38), we then have
N   N
X
′ 1 1 ηX
hgn , xn − x i ≤ + log(K) + |S|kgn,v k22 + kgn,µ k2µn
n=1
η 2 2 n=1
√ √ s  !
2 2(|S| + K 2 )
 
KN 1 1 − γ 2(K + |S|) 1
≤ + log(K) + √ 2
N+ 2
N log
1−γ 2 KN (1 − γ) (1 − γ) δ
√ q 
KN K 3 log 1δ
≤ Õ  + .
1−γ 1−γ

D.3 Union Bound

Lastly, we provide an upper bound on the last component:


N
X
(gn − ∇hn (xn ))⊤ yN

.
n=1

Because yN depends on gn , this term does not obey martingale concentration like the first component
PN ⊤
n=1 (∇hn (xn ) − gn ) xn which we analyzed in Appendix D.1 To resolve this issue, we utilize the concept
of covering number and derive a union bound.
Recall for a compact set Z in a norm space, the covering number N (Z, ǫ) with ǫ > 0 is the minimal number of
N (Z,ǫ)
ǫ-balls that covers Z. That is, there is a set {zi ∈ Z}i=1 such that maxz∈Z minz′ ∈B(Z,ǫ) kz − z ′ k ≤ ǫ. Usually
the covering number N (Z, ǫ) is polynomial in ǫ and perhaps exponential in the ambient dimension of Z.
The idea of covering number can be used to provide a union bound of concentration over compact sets, which
we summarize as a lemma below.
Ching-An Cheng, Remi Tachet des Combes, Byron Boots, Geoff Gordon

Lemma 15. Let f, g be two random L-Lipschitz functions. Suppose for some a > 0 and some fixed z ∈ Z
selected independently of f, g, it holds

P (|f (z) − g(z)| > ǫ) ≤ exp −aǫ2




Then it holds that


−aǫ2
   
 ǫ 
P sup |f (z) − g(z)| > ǫ ≤ N Z, exp
z∈Z 4L 4

ǫ
Proof. Let C denote a set of covers of size N (Z, 4L ) Then, for any z ∈ Z which could depend on f, g,

|f (z) − g(z)| ≤ min



|f (z) − f (z ′ )| + |f (z ′ ) − g(z ′ )| + |g(z ′ ) − g(z)|
z ∈C
≤ min

2Lkz − z ′ k + |f (z ′ ) − g(z ′ )]|
z ∈C
ǫ
≤ + max |f (z ′ ) − g(z ′ )|
2 z′ ∈C

Thus, supz∈Z |f (z) − g(z)| > ǫ =⇒ maxz′ ∈C |f (z ′ ) − g(z ′ )| > 2ǫ . Therefore, we have the union bound.

−aǫ2
   
 ǫ 
P sup |f (z) − E[f (z)]| > ǫ ≤ N Z, exp .
z∈Z 4L 4

PN ⊤ ∗
We now use Lemma 15 to bound the component n=1 (gn −∇hn (xn )) yN . We recall by definition, for x = (v, µ),

(∇hn (xn ) − gn )⊤ x = (κ1 − avn − gn,µ )⊤ µ + (bµn − gn,v )⊤ v



Because yN = (vπ̂N , µ∗ ), we can write the sum of interest as
N
X N
X N
X
(gn − ∇hn (xn ))⊤ yN

= (gn,µ − κ1 + avn )⊤ µ∗ + (gn,v − bµn )⊤ vπ̂N
n=1 n=1 n=1

For the first term, because µ∗ is set beforehand by the MDP definition and does not depend on the randomness
during learning, it is a martingale and we can apply the steps in Appendix D.1 to show,
q 
XN N K log( 1δ )
(gn,µ − κ1 + avn )⊤ µ∗ = Õ  
n=1
1 − γ

For the second term, because vπ̂N depends on the randomness in the learning process, we need to use a union
bound. Following the steps in Appendix D.1, we see that for some fixed v ∈ V, it holds
N
!
(1 − γ)2 2
X  

P (gn,v − bµn ) v > ǫ ≤ exp − ǫ
n=1
N

where some constants were ignored for the sake of conciseness. Note also that it does not have the K factor
because of Lemma 8. To apply Lemma 15, we need to know the order of covering number of V. Since V is
an |S|-dimensional unit cube in the positive orthant, it is straightforward to show N (V, ǫ) ≤ max(1, (1/ǫ)|S|)
PN ⊤
PN ⊤
(by simply discretizing evenly in each dimension). Moreover, the functions n=1 gn,v v and n=1 bµn v are
N
1−γ -Lipschitz in k · k∞ .

Applying Lemma 15 then gives us:


N
!
(1 − γ)2 2
   
X
⊤ ǫ(1 − γ)
P sup (gn,v − bµn ) v > ǫ ≤ N V, exp − ǫ .
v∈V n=1 4N 4N
A Reduction from Reinforcement Learning to Online Learning

For a given δ, we thus want to find the smallest ǫ such that:

(1 − γ)2 2
   
ǫ(1 − γ)
δ ≥ N V, exp − ǫ .
4N 4N

That is:
1 (1 − γ)2 2 ǫ(1 − γ)
log( ) ≤ ǫ + |S| min(0, log( )).
δ 4N 4N
 √  √ 
N log( 1 ) N log( δ1 )
Picking ǫ = O log(N ) 1−γ δ = Õ 1−γ guarantees that the inequality is verified asymptotically.

Combining these two steps, we have shown overall, with probability at least 1 − δ,
q 
XN N K log( 1δ )
(gn − ∇hn (xn ))⊤ yN

= Õ  .
n=1
1 − γ

D.4 Summary

In the previous subsections, we have provided high probability upper bounds for each term in the decomposition
N N N
! ! !
X X X

RegretN (yN )≤ (∇hn (xn ) − gn )⊤ xn + max gn⊤ (xn − x) + (gn − ∇hn (xn ))⊤ yN

x∈X
n=1 n=1 n=1

implying with probability at least 1 − δ,


q  √ q  q 
N K log( 1δ ) K 3 log 1 N |S||A| log( 1
)

RegretN (yN ) ≤ Õ   + Õ  KN + δ
 = Õ  δ

1−γ 1−γ 1−γ 1−γ

By Theorem 1, this would imply with probability at least 1 − δ,


q 
Regret (y ∗
) |S||A| log( δ1 )
V π̂N (p) ≥ V ∗ (p) − N N
≥ V ∗ (p) − Õ  √ 
N (1 − γ) N

In other words, the sample complexity of mirror descent to obtain an ǫ approximately optimal policy (i.e.
|S||A| log( 1δ )
 
V ∗ (p) − V π̂N (p) ≤ ǫ) is at most Õ
(1−γ)2 ǫ2 .

E Sample Complexity of Mirror Descent with Basis Functions

Here we provide further discussions on the sample complexity of running Algorithm 1 with linearly parameterized
function approximators and the proof of Theorem 3.
Theorem 3. Under a proper  choice of Θ and BR , with probability 1−δ, Algorithm 1 learns an (ǫ+ǫΘ,N )-optimal
dim(Θ) log( 1δ )
policy with Õ (1−γ)2 ǫ2 samples.

E.1 Setup

We suppose that the decision variable is parameterized in the form xθ = (Φθv , Ψθµ ), where Φ, Ψ are given
nonlinear basis functions and (θv , θµ ) ∈ Θ are the parameters to learn. For modeling the value function, we
suppose each column in Φ is a vector (i.e. function) such that its k · k∞ is less than one. For modeling the
state-action distribution, we suppose each column in Ψ is a state-action distribution from which we can draw
samples. This choice implies that every column of Φ belongs to V, and every column of Ψ belongs to M.
Ching-An Cheng, Remi Tachet des Combes, Byron Boots, Geoff Gordon

Considering the geometry of Φ and Ψ, we consider a compact and convex parameter set

Cv
Θ = {θ = (θv , θµ ) : kθv k2 ≤ p , θµ ≥ 0, kθµ k1 ≤ 1}
dim(θv )

where Cv < ∞. The constant Cv acts as a regularization in learning: if it is too small, the bias (captured as
ǫΘ,N in Corollary 1 restated below) becomes larger; if it is too large, the learning becomes slower.
This choice of Θ makes sure, for θ = (θv , θµ ) ∈ Θ, Ψθµ ∈ M and kΦθv k∞ ≤ kθv k1 ≤ Cv . Therefore, by the
above construction, we can verify that the requirement in Corollary 1 is satisfied, i.e. for θ = (θv , θµ ) ∈ Θ, we
have (Φθv , Ψµ θµ ) ∈ XΘ .
Corollary 1. Let XN = {xn ∈ Xθ }N n=1 be any sequence. Let π̂N be the policy given either by the average or the
best decision in XN . It holds that
RegretN (Θ)
V π̂N (p) ≥ V ∗ (p) − N − ǫΘ,N


where ǫΘ,N = minxθ ∈Xθ rep (x̂N ; yN ) − rep (x̂N ; xθ ) measures the expressiveness of Xθ , and RegretN (Θ) :=
PN PN
n=1 ln (xn ) − minx∈XΘ n=1 ln (x).

E.2 Online Loss and Sampled Gradient

Let θ = (θv , θµ ) ∈ Θ. In view of the parameterization above, we can identify the online loss in (23) in the
parameter space as

1
hn (θ) := b⊤ ⊤ ⊤
µn Φθv + θµ Ψ ( 1−γ 1 − avn ) (33)

where we have the natural identification xn = (vn , µn ) = (Φθv,n , Ψθµ,n ) and θn = (θv,n , θµ,n ) ∈ Θ is the
decision made by the online learner in the nth round. Note that because this extension of Algorithm 1 makes
sure kθµ,n k1 = 1 for every iteration, we can still use hn . For writing convenience, we will continue to overload
hn as a function of parameter θ in the following analyses.
Mirror descent requires gradient estimates of ∇hn (θn ). Here we construct an unbiased stochastic estimate of
∇hn (θn ) as
 " 1
#
Φ⊤ (p̃n + 1−γ (γ P̃n − En )⊤ µ̃n )

gn,v
gn = = 1 1 (34)
gn,µ dim(θµ )Ψ̂⊤
n ( 1−γ 1̂n − r̂n − 1−γ (γ P̂n − Ên )vn )

using two calls of the generative model (again we overload the symbol gn for the analyses in this section):

• The upper part gn,v is constructed similarly as before in (24): First we sample the initial state from the
initial distribution, the state-action pair using the learned state-action distribution, and then the transited-
to state at the sampled state-action pair. We evaluate Φ’s values at those samples to construct gn,v . Thus,
gn,v would generally be a dense vector of size dim(θv ) (unless the columns of Φ are sparse to begin with).

• The lower part gn,µ is constructed slightly differently. Recall for the tabular version in (24), we uniformly
sample over the state and action spaces. Here instead we first sample uniformly a column (i.e. a basis
function) in Ψ and then sample a state-action pair according to the sampled column, which is a distribution
by design. Therefore, the multiplier due to uniform sampling in the second row of (34) is now dim(θµ )
rather than |S||A| in (24). The matrix Ψ̂n is extremely sparse, where only the single sampled entry (the
column and the state-action pair) is one and the others are zero. In fact, one can think of the tabular version
as simply using basis functions Ψ = I, i.e. each column is a delta distribution. Under this identification,
the expression in (34) matches the one in (24).

It is straightforward to verify that E[gn ] = ∇hn (θn ) for gn in (34).


A Reduction from Reinforcement Learning to Online Learning

E.3 Proof of Theorem 3

We follow the same steps of the analysis of the tabular version. We will highlight the differences/improvement
due to using function approximations.
First, we use Corollary 1 in place of Theorem 1. To properly handle the randomness, we revisit its derivation to
slightly tighten the statement, which was simplified for the sake of cleaner exposition. Define
∗ ∗
yN,θ = (vN,θ , µ∗θ ) := arg max rep (x̂N ; xθ ).
xθ ∈Xθ
∗ ∗
For writing convenience, let us also denote θN = (θv,N , θµ∗ ) ∈ Θ as the corresponding parameter of yN,θ

. We
remark that µ∗θ (i.e. θµ∗ ), which tries to approximate µ∗ , is fixed before the learning process, whereas vN,θ

(i.e.

θv,N ) could depend on the stochasticity in the learning. Using this new notation and the steps in the proof of
Corollary 1, we can write

V ∗ (p) − V π̂N (p) = rep (x̂N ; yN )


RegretN (yN,θ )
= ǫΘ,N + rep (x̂N ; yN,θ ) ≤ ǫΘ,N +
N
where the first equality is Proposition 4, the last inequality follows the proof of Theorem 1, and we recall the
∗ ∗
definition ǫΘ,N = rep (x̂N ; yN ) − rep (x̂N ; yN,θ ).
The rest of the proof is very similar to that of Theorem 1, because linear parameterization does not change the

convexity of the loss sequence. Let yN = (vπ̂N , µ∗ ). We bound the regret by the following rearrangement.
N
X N
X
∗ ∗
RegretN (yN,θ )= ln (xn ) − ln (yN,θ )
n=1 n=1
N
X N
X

= hn (θn ) − hn (θN )
n=1 n=1
N
X
= ∇hn (θn )⊤ (θn − θN

)
n=1
N N N
! ! !
X X X

= (∇hn (θn ) − gn ) θn + gn⊤ (θn − ∗
θN ) + (gn − ∇hn (θn ))⊤ θN

n=1 n=1 n=1


N N N
! ! !
X X X

≤ (∇hn (θn ) − gn ) θn + max gn⊤ (θn − θ) + (gn − ∇hn (θn ))⊤ θN

(35)
θ∈Θ
n=1 n=1 n=1

where the second equality is due to the identifcation in (33).


We will solve this online learning problem with mirror descent
1
θn+1 = arg min hgn , θi + BR (θ||θn ) (36)
θ∈Θ η
with step size η > 0 and a Bregman divergence that is a straightforward extension of (22)
1 dim(θv )
BR (θ′ ||θ) = 2 Cv2 kθv′ − θv k22 + KL(θµ′ ||θµ ) (37)

where the constant dim(θ


Cv2
v)
is chosen to make the size of Bregman divergence dimension-free (at least up to log
factors). Below we analyze the size of the three terms in (35) like what we did for Theorem 2.

E.4 The First Term: Martingale Concentration

The first term is a martingale. We will use this part to highlight the different properties due to using basis
functions. The proof follows the steps in Appendix D.1, but now the martingale difference of interest is instead
Mn − Mn−1 = (∇hn (θn ) − gn )⊤ θn
= (Ψ⊤ (κ1 − avn ) − gn,µ )⊤ θµ,n + (Φ⊤ ⊤
n bµn − gn,v ) θv,n
Ching-An Cheng, Remi Tachet des Combes, Byron Boots, Geoff Gordon

They now have nicer properties due to the way gn,µ is sampled.
For the first term (Ψ⊤ (κ1 − avn ) − gn,µ )⊤ θµ,n , we use the lemma below, where we recall the filtration Fn is
naturally defined as {θ1 , . . . , θn }.
Lemma 16. Let θ = (θv , θµ ) ∈ Θ be arbitrary that is chosen independent of the randomness of gn,µ when Fn−1
2(1+dim(θµ )) 4dim(θ )
is given. Then it holds |(κ1 − avn − gn,µ )⊤ θ| ≤ 1−γ and V|Fn−1 [(κ1 − avn − gn,µ )⊤ θn ] ≤ (1−γ)µ2 .

Proof. By triangular inequality,


|(Ψ⊤ (κ1 − avn ) − gn,µ )⊤ θµ | ≤ |(κ1 − avn )⊤ Ψθµ | + |gn,µ

θµ |
For the deterministic part, using Lemma 5 and Hölder’s inequality,
2
|(κ1 − avn )⊤ Ψθµ | ≤ κ + kavn k∞ kΨθµ k1 ≤
1−γ
For the stochastic part, let kn denote the sampled column index, in be index of the sampled state-action pair
using the column of kn , and jn be the index of the transited state sampled at the pair given by in . With abuse
of notation, we will use in to index both S × A and S. Let µ = Ψµ θµ . With this notation, we may derive

⊤ ⊤ ⊤ 1
|gn,µ θµ | = |dim(θµ )θµ Ψ̂n (κ1̂n − r̂n − (γ P̂n − Ên )vn )|
1−γ
γvn,jn − vn,in
= dim(θµ )θµ,kn |κ − rin − |
1−γ
2dim(θµ )θµ,kn 2dim(θµ )
≤ ≤
1−γ 1−γ
where we use the facts rin , vn,jn , vn,in ∈ [0, 1] and θµ,kn ≤ 1.
(k)
Let ψµ denote the kth column of Ψ. For V|Fn−1 [(κ1 − avn − gn,µ )⊤ θn ], we can write it as

V|Fn−1 [(Ψ⊤ (κ1 − avn ) − gn,µ )⊤ θµ ] = V|Fn−1 [gn,µ θn ]

≤ E|Fn−1 [|gn,µ θ n |2 ]
"  2 #
X 1 X (k )
n 2 2 γvn,jn − vn,in
= ψµ,in Ejn |in dim(θµ ) θµ,kn κ − rin −
dim(θµ ) i 1−γ
kn n

4dim(θµ ) X 2 X (kn )
≤ θµ,kn ψµ,in
(1 − γ)2 in
kn
!2
4dim(θµ ) X 4dim(θµ )
≤ θµ,kn ≤
(1 − γ)2 (1 − γ)2
kn

γvn,jn −vn,in 2
where in the second inequality we use the fact that |κ − rin − 1−γ |≤ 1−γ .

For the second term (Φ⊤ ⊤


n bµn − gn,v ) θv,n , we use this lemma.
Lemma 17. Let θ ∈ V be arbitrary that is chosen independent of the randomness of gn,v when Fn−1 is given.
4Cv 4Cv2
Then it holds |(Φ⊤ ⊤
n bµn − gn,v ) θ| ≤ 1−γ and V|Fn−1 [(Φ⊤ ⊤
n bµn − gn,v ) θ] ≤ (1−γ)2 .

2
Proof. We appeal to Lemma 5, which shows kbµn k1 ≤ 1−γ and

1 2
kp̃n + (γ P̃n − En )⊤ µ̃n k1 ≤
1−γ 1−γ
Therefore, overall we can derive
 
⊤ ⊤ 1 ⊤ 4Cv
|(Φ bµn − gn,v ) θ| ≤ kbµn k1 + kp̃n + (γ P̃n − En ) µ̃n k1 kΦθv k∞ ≤
1−γ 1−γ
A Reduction from Reinforcement Learning to Online Learning

where we use again each column in Φ has k · k∞ less than one, and k · k∞ ≤ k · k2 . Similarly, for the variance, we
can write
4Cv2
V|Fn−1 [(Φ⊤ ⊤ ⊤ ⊤ 2
n bµn − gn,v ) θ] = V|Fn−1 [gn,v θ] ≤ E|Fn−1 [(gn,v θ) ] ≤
(1 − γ)2

From the above two lemmas, we see the main difference  from the  what we had in Appendix
 D.1 for the tabular
C +dim(θ )
case is that, the martingale difference now scales in O v 1−γ µ instead of O |S||A| 1−γ , and its variance scales
 2   
C +dim(θ ) |S||A|
in O v (1−γ)2 µ instead of O (1−γ) 2 . We note the constant Cv is universal, independent of the problem
size.
Following the similar steps in Appendix D.1, these new results imply that
N
! !
X
⊤ −ǫ2
P (∇hn (θn ) − gn ) θn > ǫ ≤ exp 2 bǫ
n=1
2N σ (1 + 3N σ2 )

Cv2 +dim(θµ )
   
Cv +dim(θµ )
with b = O 1−γ and O (1−γ)2 . This implies that, with probability at least 1 − δ, it hold
q 
N
X N (Cv2 + dim(θµ )) log( 1δ )
(∇hn (θn ) − gn )⊤ θn = Õ  
n=1
1 − γ

E.5 Static Regret of Mirror Descent

Again the steps here are very similar to those in Appendix D.2. We concern bounding the static regret.
N
X
max gn⊤ (θn − θ)
θ∈Θ
n=1

From Appendix D.2, we recall this can be achieved by the mirror descent’s optimality condition. The below
inequality is true, for any θ′ ∈ Θ:
N N
X 1 ′ ′
X 1
hgn , θn − θ i ≤ BR (θ ||θ1 ) + hgn , θn+1 − θn i − BR (θn+1 ||θn )
n=1
η n=1
η

Based on our choice of Bregman divergence given in (37), i.e.


1 dim(θv )
BR (θ′ ||θ) = 2 Cv2 kθv′ − θv k22 + KL(θµ′ ||θµ ), (37)

we have η1 BR (θ′ ||θ1 ) ≤ Õ(1) 1


η . For each hgn , θn+1 − θn i − η BR (θn+1 ||θn ), we will use again the two basic lemmas
we proved in Appendix D.2.
1 ηkgk22
Lemma 10. For any vector x, y, g and scalar η > 0, it holds hg, x − yi − 2η kx − yk22 ≤ 2 .
Lemma 11. Suppose BR (x||y) = KL(x||y) and x, y are probability distributions, and g ≥ 0 element-wise. Then,
for η > 0,
1 ηX η
− BR (y||x) + hg, x − yi ≤ xi (gi )2 = kgk2x.
η 2 i 2

Thus, we have the upper bound

1 Cv2 ηkgn,v k22 ηkgn,µ k2θµ,n


hgn , θn+1 − θn i − BR (θn+1 ||θn ) = +
η dim(θv ) 2 2
Ching-An Cheng, Remi Tachet des Combes, Byron Boots, Geoff Gordon

1 ′
Together with the upper bound on η BR (x ||x1 ), it implies that

N N
X
′1 ′
X 1
hgn , xn − x i ≤ BR (x ||x1 ) + hgn , xn+1 − xn i − BR (xn+1 ||xn )
n=1
η n=1
η
N
Õ(1) η X Cv2
≤ + kgn,v k22 + kgn,µ k2θµ,n (38)
η 2 n=1 dim(θv )

PN Cv2 2
We can expect, with high probability, n=1 dim(θv ) kgn,v k2 + kgn,µ k2θµ,n concentrates toward its expectation, i.e.

N N
Cv2 Cv2
X X  
kgn,v k22 + kgn,µ k2θµ,n ≤ E kgn,v k22 + kgn,µ k2θµ,n + o(N )
n=1
dim(θv ) n=1
dim(θv )

To bound the right-hand side, we will use the upper bounds below, which largely follow the proof of Lemma 16
and Lemma 17.
4dim(θv ) 4dim(θµ )
Lemma 18. E[kgn,v k22 ] ≤ (1−γ)2 and E[kgn,µ k2θµ,n ] ≤ (1−γ)2 .
4dim(θv ) 4dim(θµ )2
Lemma 19. kgn,v k22 ≤ (1−γ)2 and kgn,µ k2θµ,n ≤ (1−γ)2 .

By Azuma-Hoeffding’s inequality in Lemma 13,


s
N N  !
Cv2 2 2 2
 
X
2 2
X Cv 2 2 Cv + dim(θ µ ) 1
kgn,v k2 + kgn,µ kθµ,n ≤ E kgn,v k2 + kgn,µ kθµ,n + O 2
N log
n=1
dim(θv ) n=1
dim(θv ) (1 − γ) δ
 2 s !
Cv2 + dim(θµ )2
  
Cv + dim(θµ ) 1
≤O 2
N + O 2
N log
(1 − γ) (1 − γ) δ

 
√ 1−γ
Now we suppose we set η = O . We have
N (Cv2 +dim(θµ ))

N N p !
X
′ Õ(1) η X Cv2 (Cv2 + dim(θµ ))N
hgn , θn − θ i ≤ + kgn,v k22 + kgn,µ k2θµ,n ≤ Õ
n=1
η 2 n=1 dim(θv ) 1−γ

E.5.1 Union Bound

Lastly we use an union bound to handle the term


N
X
(gn − ∇hn (θn ))⊤ θN

n=1

∗ ∗
We follow the steps in Appendix D.3: we will use again the fact that θN = (θv,N , θµ∗ ) ∈ Θ, so we can handle the
part with θµ∗ using the standard martingale concentration, and the part with θv,N

using the union bound.
Using the previous analyses, we see can first show that the martingale due to the part θµ∗ concentrates in
√ 
N dim(θµ ) log( 1δ ) ∗
Õ 1−γ . Likewise, using the union bound, we can show the martingale due to the part θv,N
√   
N Cv2 log( N
δ ) C
concentrates in Õ 1−γ where N some proper the covering number of the set θv : kθv k2 ≤ √ v
.
dim(θv )
Because log N = O(dim(θv )) for an Euclidean ball. We can combine the two bounds and show together
q 
XN N (Cv2 dim(θv ) + dim(θµ )) log( 1δ )
(gn − ∇hn (θn ))⊤ θN

= Õ  
n=1
1−γ
A Reduction from Reinforcement Learning to Online Learning

E.5.2 Summary

Combining the results of the three parts above, we have, with probability 1 − δ,

RegretN (yN,θ )
N N N
! ! !
X X X

≤ (∇hn (θn ) − gn ) θn + max gn⊤ (θn − θ) + (gn − ∇hn (θn ))⊤ θN

θ∈Θ
n=1 n=1 n=1
q   q 
N (dim(θµ ) + Cv2 ) log( 1δ ) N (Cv2 dim(θv ) + dim(θµ )) log( 1δ )
p !
(Cv2 + dim(θµ ))N
= Õ   + Õ + Õ  
1−γ 1−γ 1−γ
q 
N dim(Θ) log( 1δ )
= Õ  
1−γ

where the last step


 is due to Cv is a universal constant. Or equivalently, the above bounds means a sample
dim(Θ) log( 1δ )
complexity in Õ (1−γ)2 ǫ2 . Finally, we recall the policy performance has a bias ǫΘ,N in Corollary 1 due to
using function approximators. Considering this effect, we have the final statement.

View publication stats

You might also like