0% found this document useful (0 votes)
13 views32 pages

Model-Free Reinforcement Learning in Infinite-Horizon Average-Reward Markov Decision Processes

This paper introduces two model-free reinforcement learning algorithms for infinite-horizon average-reward Markov Decision Processes (MDPs). The first algorithm achieves O(T^{2/3}) regret for weakly communicating MDPs, while the second algorithm improves the regret to O(sqrt(T)) for ergodic MDPs. Both algorithms demonstrate significant advancements over existing model-free approaches in this setting.

Uploaded by

Avik Kar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views32 pages

Model-Free Reinforcement Learning in Infinite-Horizon Average-Reward Markov Decision Processes

This paper introduces two model-free reinforcement learning algorithms for infinite-horizon average-reward Markov Decision Processes (MDPs). The first algorithm achieves O(T^{2/3}) regret for weakly communicating MDPs, while the second algorithm improves the regret to O(sqrt(T)) for ergodic MDPs. Both algorithms demonstrate significant advancements over existing model-free approaches in this setting.

Uploaded by

Avik Kar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Model-free Reinforcement Learning in Infinite-horizon

Average-reward Markov Decision Processes

Chen-Yu Wei Mehdi Jafarnia-Jahromi Haipeng Luo Hiteshi Sharma Rahul Jain
{chenyu.wei, mjafarni, haipengl, hiteshis, rahul.jain}@usc.edu
arXiv:1910.07072v2 [cs.LG] 25 Feb 2020

Abstract 2009), PSRL (Ouyang et al., 2017b), SCAL (Fruit et al.,


Model-free reinforcement learning is known to 2018b), UCBVI (Azar et al., 2017), EBF (Zhang & Ji,
be memory and computation efficient and more 2019) and EULER (Zanette & Brunskill, 2019). Model-
amendable to large scale problems. In this pa- based algorithms are well-known for their sample effi-
per, two model-free algorithms are introduced for ciency. However, there are two general disadvantages of
learning infinite-horizon average-reward Markov model-based algorithms: First, model-based algorithms re-
Decision Processes (MDPs). The first algo- quire large memory to store the estimate of the model pa-
rithm reduces the problem to the discounted- rameters. Second, it is hard to extend model-based ap-
reward version and achieves O(T 2/3 ) regret af- proaches to non-parametric settings, e.g., continuous state
ter T steps, under the minimal assumption of MDPs.
weakly communicating MDPs. To our knowl- Model-free algorithms, on the other hand, try to resolve
edge, this is the first model-free algorithm for these issues by directly maintaining an estimate of the op-
general MDPs in this setting. The second al- timal Q-value function or the optimal policy. Examples
gorithm makes use of recent advances in adap- include Q-learning (Watkins, 1989), Delayed Q-learning
tive algorithms for adversarial multi-armed
√ ban- (Strehl et al., 2006), TRPO (Schulman et al., 2015), DQN
dits and improves the regret to O( T ), albeit (Mnih et al., 2013), A3C (Mnih et al., 2016), and more.
with a stronger ergodic assumption. This re- Model-free algorithms are not only computation and mem-
sult significantly improves over the O(T 3/4 ) re- ory efficient, but also easier to be extended to large scale
gret achieved by the only existing model-free problems by incorporating function approximation.
algorithm by Abbasi-Yadkori et al. (2019a) for
ergodic MDPs in the infinite-horizon average- It was believed that model-free algorithms are less sample-
reward setting. efficient compared to model-based algorithms. How-
ever, recently Jin et al. (2018) showed that (model-free)
Q-learning algorithm with UCB exploration achieves a
1. Introduction nearly-optimal regret bound, implying the possibility of
designing algorithms with advantages of both model-free
Reinforcement learning (RL) refers to the problem of an and model-based methods. Jin et al. (2018) addressed the
agent interacting with an unknown environment with the problem for episodic finite-horizon MDPs. Following this
goal of maximizing its cumulative reward through time. work, Dong et al. (2019) extended the result to the infinite-
The environment is usually modeled as a Markov Decision horizon discounted-reward setting.
Process (MDP) with an unknown transition kernel and/or
an unknown reward function. The fundamental trade-off However, Q-learning based model-free algorithms with
between exploration and exploitation is the key challenge low regret for infinite-horizon average-reward MDPs, an
for RL: should the agent exploit the available information equally heavily-studied setting in the RL literature, re-
to optimize the immediate performance, or should it ex- mains unknown. Designing such algorithms has proven to
plore the poorly understood states and actions to gather be rather challenging since the Q-value function estimate
more information to improve future performance? may grow unbounded over time and it is hard to control
its magnitude in a way that guarantees efficient learning.
There are two broad classes of RL algorithms: model-based Moreover, techniques such as backward induction in the
and model-free. Model-based algorithms maintain an es- finite-horizon setting or contraction mapping in the infinite-
timate of the underlying MDP and use that to determine horizon discounted setting can not be applied to the infinite-
a policy during the learning process. Examples include horizon average-reward setting.
UCRL2 (Jaksch et al., 2010), REGAL (Bartlett & Tewari,
Model-free RL in Infinite-horizon Average-reward MDPs

Table 1. Regret comparisons for RL algorithms in infinite-horizon average-reward MDPs with S states, A actions, and T steps. D is the
diameter of the MDP, sp(v ∗ ) ≤ D is the span of the optimal value function, V⋆s,a := Vars′ ∼p(·|s,a) [v ∗ (s′ )] ≤ sp(v ∗ )2 is the variance of
the optimal value function, tmix is the mixing time (Def 5.1), thit is the hitting time (Def 5.2), and ρ ≤ thit is some distribution mismatch
coefficient (Eq. (4)). For more concrete definition of these parameters, see Sections 3-5.

Algorithm Regret Comment



REGAL (Bartlett & Tewari, 2009) e
O(sp(v ∗
) SAT ) no efficient implementation

UCRL2 (Jaksch et al., 2010) e
O(DS AT ) -

PSRL (Ouyang et al., 2017b) e ∗
O(sp(v )S AT ) Bayesian regret
√ ergodic assumption and
OSP (Ortner, 2018) e tmix SAT )
O(
Model-based no efficient implementation

SCAL (Fruit et al., 2018b) e
O(sp(v ∗
)S AT ) -
q P
e S
O( ⋆
KL-UCRL (Talebi & Maillard, 2018) s,a Vs,a T ) -

UCRL2B (Fruit et al., 2019) e
O(S DAT ) -

EBF (Zhang & Ji, 2019) e DSAT )
O( no efficient implementation
√ 3
P OLITEX(Abbasi-Yadkori et al., 2019a) t3mix thit SAT 4 ergodic assumption
1 2
Model-free Optimistic Q-learning (this work) e
O(sp(v ∗
)(SA) 3 T 3 ) -
p
MDP-OOMD (this work) e t3 ρAT )
O( ergodic assumption
mix

lower bound (Jaksch et al., 2010) Ω( DSAT ) -

In this paper, we make significant progress in this direc- free algorithm for this setting is the P OLITEX algo-
tion and propose two model-free algorithms for learning rithm (Abbasi-Yadkori et al., 2019a;b), which achieves
infinite-horizon average-reward MDPs. The first algorithm, e 3/4 ) regret for ergodic MDPs only. Both of our algo-
O(T
Optimistic Q-learning (Section 4), achieves a regret bound rithms enjoy a better bound compared to P OLITEX, and the
e 2/3 ) with high probability for the broad class of
of O(T first algorithm even removes the ergodic assumption com-
weakly communicating MDPs.1 This is the first model-free pletely.2
algorithm in this setting under only the minimal weakly
For comparisons with other existing model-based ap-
communicating assumption. The key idea of this algorithm
proaches for this problem, see Table 1. We also conduct
is to artificially introduce a discount factor for the reward,
experiments comparing our two algorithms. Details are de-
to avoid the aforementioned unbounded Q-value estimate
ferred to Appendix D due to space constraints.
issue, and to trade-off this effect with the approximation
introduced by the discount factor. We remark that this is
very different from the R-learning algorithm of (Schwartz, 2. Related Work
1993), which is a variant of Q-learning with no discount
We review the related literature with regret guarantees for
factor for the infinite-horizon average-reward setting.
learning MDPs with finite state and action spaces (there
The second algorithm, MDP-OOMD √ (Section 5), attains are many other works on asymptotic convergence or sam-
e T ) for the more restricted
an improved regret bound of O( ple complexity, a different focus compared to our work).
class of ergodic MDPs. This algorithm maintains an in- Three common settings have been studied: 1) finite-horizon
stance of a multi-armed bandit algorithm at each state to episodic setting, 2) infinite-horizon discounted setting, and
learn the best action. Importantly, the multi-armed bandit 3) infinite-horizon average-reward setting. For the first
algorithm needs to ensure several key properties to achieve two settings, previous works have designed efficient al-
our claimed regret bound, and to this end we make use gorithms with regret bound or sample complexity that
of the recent advances for adaptive adversarial bandit al- is (almost) information-theoretically optimal, using either
gorithms from (Wei & Luo, 2018) in a novel way. model-based approaches such as (Azar et al., 2017), or
model-free approaches such as (Jin et al., 2018; Dong et al.,
To the best of our knowledge, the only existing model-
2
1 e to suppress P OLITEX is studied in a more general setup with function
Throughout the paper, we use the notation O(·) approximation though. See the end of Section 5.1 for more com-
log terms. parisons.
Model-free RL in Infinite-horizon Average-reward MDPs

2019). 2009).
For the infinite-horizon average-reward setting, many Standard MDP theory (Puterman, 2014) shows that for
model-based algorithms have been proposed, such these two classes, there exist q ∗ : S × A → R (unique
as (Auer & Ortner, 2007; Jaksch et al., 2010; Ouyang et al., up to an additive constant) and unique J ∗ ∈ [0, 1] such
2017b; Agrawal & Jia, 2017; Talebi & Maillard, 2018; that J ∗ (s) = J ∗ for all s ∈ S and the following Bellman
Fruit et al., 2018a;b). These algorithms either conduct equation holds:
posterior sampling or follow the optimism in face of un-
certainty principle to build an MDP model estimate and J ∗ + q ∗ (s, a) = r(s, a) + Es′ ∼p(·|s,a) [v ∗ (s′ )], (1)
then plan according to √ the estimate (hence model-based).
They all achieve Õ( T ) regret, but the dependence on where v ∗ (s) := maxa∈A q ∗ (s, a). The optimal policy is
other parameters are suboptimal. Recent works made then obtained by π ∗ (s) = argmaxa q ∗ (s, a).
progress toward obtaining the optimal bound (Ortner, 2018; We consider a learning problem where S, A and the re-
Zhang & Ji, 2019); however, their algorithms are not com- ward function r are known to the agent, but not the tran-
putationally efficient – the time complexity scales exponen- sition probability p (so one cannot directly solve the Bell-
tially in the number of states. On the other hand, except for man equation). The knowledge of the reward function
the naive approach of combining Q-learning with ǫ-greedy is a typical assumption as in (Bartlett & Tewari, 2009;
exploration (which is known to suffer regret exponential in Gopalan & Mannor, 2015; Ouyang et al., 2017b), and can
some parameters (Osband et al., 2014)), the only existing be removed at the expense of a constant factor for the regret
model-free algorithm for this setting is P OLITEX, which bound.
only works for ergodic MDPs.
Specifically, the learning protocol is as follows. An agent
Two additional works are closely related to our second algo- starts at an arbitrary state s1 ∈ S. At each time step
rithm MDP-OOMD: (Neu et al., 2013) and (Wang, 2017). t = 1, 2, 3, · · · , the agent observes state st ∈ S and
They all belong to policy optimization method where the takes action at ∈ A which is a function of the history
learner tries to learn the parameter of the optimal policy di- s1 , a1 , s2 , a2 , · · · , st−1 , at−1 , st . The environment then
rectly. Their settings are quite different from ours and the determines the next state by drawing st+1 according to
results are not comparable. We defer more detailed com- p(·|st , at ). The performance of a learning algorithm is eval-
parisons with these two works to the end of Section 5.1. uated through the notion of cumulative regret, defined as
the difference between the total reward of the optimal pol-
3. Preliminaries icy and that of the algorithm:

An infinite-horizon average-reward Markov Decision Pro- T 


X 
cess (MDP) can be described by (S, A, r, p) where S is the RT := J ∗ − r(st , at ) .
state space, A is the action space, r : S × A → [0, 1] is t=1

the reward function and p : S 2 × A → [0, 1] is the transi- Since r ∈ [0, 1] (and subsequently J ∗ ∈ [0, 1]), the re-
tion probability such that p(s′ |s, a) := P(st+1 = s′ | st = gret can at worst grow linearly with T . If a learning al-
s, at = a) for st ∈ S, at ∈ A and t = 1, 2, 3, · · · . We gorithm achieves sub-linear regret, then RT /T goes to
assume that S and A are finite sets with cardinalities S and zero, i.e., the average reward of the algorithm converges
A, respectively. The average reward per stage of a deter- ∗
ministic/stationary policy π : S → A starting from state s
to the optimal per√ stage reward J . The best existing re-
gret bound is O(e DSAT ) achieved by a model-based al-
is defined as gorithm (Zhang & Ji, 2019) (where D is the diameter of
" T # the MDP) and it matches the lower bound of (Jaksch et al.,
1 X
π
J (s) := lim inf E r(st , π(st )) s1 = s 2010).
T →∞ T
t=1

where st+1 is drawn from p(·|st , π(st )). Let J ∗ (s) := 4. Optimistic Q-Learning
maxπ∈AS J π (s). A policy π ∗ is said to be optimal if it
∗ In this section, we introduce our first algorithm, O PTI -
satisfies J π (s) = J ∗ (s) for all s ∈ S.
MISTIC Q- LEARNING (see Algorithm 1 for pseudocode).
We consider two standard classes of MDPs in this paper: The algorithm works for any weakly communicating
(1) weakly communicating MDPs defined in Section 4 and MDPs. An MDP is weakly communicating if its state space
(2) ergodic MDPs defined in Section 5. The weakly com- S can be partitioned into two subsets: in the first subset,
municating assumption is weaker than the ergodic assump- all states are transient under any stationary policy; in the
tion, and is in fact known to be necessary for learning second subset, every two states are accessible from each
infinite-horizon MDPs with low regret (Bartlett & Tewari, other under some stationary policy. It is well-known that
Model-free RL in Infinite-horizon Average-reward MDPs

Algorithm 1 O PTIMISTIC Q- LEARNING timated Q value (Line 1). After seeing the next state, the
Parameters: H ≥ 2, confidence level δ ∈ (0, 1) algorithm makes a stochastic update of Qt based on the
Initialization: γ = 1 − H1 , ∀s : V̂1 (s) = H Bellman equation, importantly with an extra bonus term bτ
∀s, a : Q1 (s, a) = Q̂1 (s, a) = H, n1 (s, and a carefully chosen step size ατ (Eq.(2)). Here, τ is
q a) = 0 the number of times the current state-action
Define: ∀τ, ατ = H+1
= 4 sp(v ∗ ) H
ln 2T ppair has been
H+τ , bτ τ δ visited, and the bonus term bτ scales as O( H/τ ), which
for t = 1, . . . , T do encourages exploration since it shrinks every time a state-
1 Take action at = argmaxa∈A Q̂t (st , a). action pair is executed. The choice of the step size ατ is
2 Observe st+1 . also crucial as pointed out in (Jin et al., 2018) and deter-
3 Update: mines a certain effective period of the history for the cur-
rent update.
nt+1 (st , at ) ← nt (st , at ) + 1
While the algorithmic idea is similar to (Dong et al., 2019),
τ ← nt+1 (st , at )
we emphasize that our analysis is different and novel:
Qt+1 (st , at ) ← (1 − ατ )Qt (st , at )
h i
+ατ r(st , at ) + γ V̂t (st+1 ) + bτ (2) • First, Dong et al. (2019) analyze the sample complex-
n o
ity of their algorithm while we analyze the regret.
Q̂t+1 (st , at ) ← min Q̂t (st , at ), Qt+1 (st , at )
V̂t+1 (st ) ← max Q̂t+1 (st , a).
a∈A
• Second, we need to deal with the approximation effect
due to the difference between the discounted MDP
(All other entries of nt+1 , Qt+1 , Q̂t+1 , V̂t+1 remain the
and the original undiscounted one (Lemma 2).
same as those in nt , Qt , Q̂t , V̂t .)

• Finally, part of our analysis improves over that of


the weakly communicating condition is necessary for en- (Dong et al., 2019) (specifically our Lemma 3). Fol-
suring low regret in this setting (Bartlett & Tewari, 2009). lowing the original analysis of (Dong et al., 2019)
Define sp(v ∗ ) = maxs v ∗ (s) − mins v ∗ (s) to be the span would lead to a worse bound here.
of the value function, which is known to be bounded for
weakly communicating MDPs. In particular, it is bounded
by the diameter of the MDP (see (Lattimore & Szepesvári, We now state the main regret guarantee of Algorithm 1.
2018, Lemma 38.1)). We assume that sp(v ∗ ) is known and Theorem 1. If the MDP
use it to set the parameters. However, in the case when it is qis weaklycommunicating,
 31 
Algo-
sp(v ∗ )T T
unknown, we can replace sp(v ∗ ) with any upper bound of rithm 1 with H = min SA , SA ln 4T ensures
δ
it (e.g. the diameter) in both the algorithm and the analysis. that with probability at least 1 − δ, RT is of order
The key idea of Algorithm 1 is to solve the undiscounted
  
problem via learning a discounted MDP (with the same p 2 1 q
states, actions, reward function, and transition), for some O sp(v ∗ )SAT + sp(v ∗ ) T 3 SA ln Tδ 3 + T ln 1δ .
discount factor γ (defined in terms of a parameter H). De-
fine V ∗ and Q∗ to be the optimal value-function and Q-
function of the discounted MDP, satisfying the Bellman Our regret bound scales as O(T e 2/3 ) and is suboptimal
equation: √
compared to model-based approaches with O( e T ) regret
∀(s, a), Q∗ (s, a) = r(s, a) + γEs′ ∼p(·|s,a) [V ∗ (s′ )] (such as UCRL2) that matches the information-theoretic
lower bound (Jaksch et al., 2010). However, this is the
∀s, V ∗ (s) = max Q∗ (s, a).
a∈A first model-free algorithm with sub-linear regret (under
only the weakly
√ communicating condition), and how to
The way we learn this discounted MDP is essentially the e T ) regret via model-free algorithms remains
achieve O(
same as the algorithm of Dong et al. (2019), which itself is unknown. Also note that our bound depends on sp(v ∗ ) in-
based on the idea from (Jin et al., 2018). Specifically, the stead of the potentially much larger diameter of the MDP.
algorithm maintains an estimate V̂t for the optimal value To our knowledge, existing approaches that achieve sp(v ∗ )
function V ∗ and Q̂t for the optimal Q-function Q∗ , which dependence are all model-based (Bartlett & Tewari, 2009;
itself is a clipped version of another estimate Qt . Each time Ouyang et al., 2017b; Fruit et al., 2018b) and use very dif-
the algorithm takes a greedy action with the maximum es- ferent arguments.
Model-free RL in Infinite-horizon Average-reward MDPs

4.1. Proof sketch of Theorem 1 Lemma 4. With probability at least 1 − δ,


The proof starts by decomposing the regret as T
X
(Q∗ (st , at ) − γV ∗ (st ) − r(st , at ))
T
X t=1
RT = (J ∗ − r(st , at )) q
t=1 ≤ 2 sp(v ∗ ) 2T ln 1δ + 2 sp(v ∗ ).
T
X
= (J ∗ − (1 − γ)V ∗ (st )) This lemma is proven via Bellman equation for the dis-
t=1
counted setting and Azuma’s inequality.
T
X
+ (V ∗ (st ) − Q∗ (st , at )) √
t=1 5. Õ( T ) Regret for Ergodic MDPs
T
X In this section, we
√ propose another model-free algorithm
+ (Q∗ (st , at ) − γV ∗ (st ) − r(st , at )) .
t=1
that achieves Õ( T ) regret bound for ergodic MDPs, a
sub-class of weakly communicating MDPs. An MDP is
Each of these three terms are handled through Lemmas 2, 3 ergodic if for any stationary policy π, the induced Markov
and 4 whose proofs are deferred to the appendix. Plugging chain is irreducible and aperiodic. Learning ergodic MDPs
in γ = 1 − H1 and picking the optimal H finish the proof. is arguably easier than the general case because √the MDP
e 2/3 ) regret comes from the bound
One can see that the O(T is explorative by itself. However, achieving Õ( T ) regret
T

H from the first term and the bound HT from the second.
bound in this case with model-free methods is still highly
Lemma 2. The optimal value function V ∗ of the dis- non-trivial and we are not aware of any such result in the
counted MDP satisfies literature. Below, we first introduce a few useful properties
of ergodic MDPs, all of which can be found in (Puterman,
2014).
1. |J ∗ − (1 − γ)V ∗ (s)| ≤ (1 − γ) sp(v ∗ ), ∀s ∈ S,
We use randomized policies in this approach. A ran-
2. sp(V ∗ ) ≤ 2 sp(v ∗ ).
domized policy π maps every state s to a distribution
A
This lemma shows that the difference between the optimal P actions π(·|s) ∈ ∆A , where ∆A = {x ∈ R+ :
over
a x(a) = 1}. In an ergodic MDP, any policy π in-
value in the discounted setting (scaled by 1 − γ) and that duces a Markov chain with a unique stationary distribu-
of the undiscounted setting is small as long as γ is close to tion µπ ∈ ∆S satisfying (µπ )⊤ P π = (µπ )⊤ , where
1. The proof is by combining the Bellman equation of the P π ∈ RS×SP is the induced transition matrix defined as
these two settings and direct calculations. P π (s, s′ ) = ′
a π(a|s)p(s |s, a). We denote the station-
Lemma 3. With probability at least 1 − δ, we have ary distribution of the optimal policy π ∗ by µ∗ .
T
X For ergodic MDPs, the long-term average reward J π of any
∗ ∗
(V (st ) − Q (st , at )) fixed policy π is independent of the starting state and can
π
t=1 be written as JP = (µπ )⊤ rπ where rπ ∈ [0, 1]S is such
q π
that r (s) := a π(a|s)r(s, a). For any policy π, the fol-
≤ 4HSA + 24 sp(v ∗ ) HSAT ln 2T
δ . lowing Bellman equation has a solution q π : S × A → R
that is unique up to an additive constant:
This lemma is one of our key technical contributions. To
prove this lemma one can write J π + q π (s, a) = r(s, a) + Es′ ∼p(·|s,a) [v π (s′ )],
P
T
X where v π (s) = a π(a|s)q π
P (s, a). In this section, we im-
(V ∗ (st ) − Q∗ (st , at )) pose an extra constraint: s µπ (s)v π (s) = 0 so that q π is
t=1 indeed unique. In this case, it can be shown that v π has the
T
X T
X following form:
= (V ∗ (st ) − V̂t (st )) + (Q̂t (st , at ) − Q∗ (st , at )),

X
t=1 t=1  π
v π (s) = e⊤ π t π ⊤
s (P ) − (µ ) r (3)
using the fact that V̂t (st ) = Q̂t (st , at ) by the greedy policy. t=0
The main part of the proof is to show PT +1 that the second sum-
where es is the basis vector with 1 in coordinate s.
mation can in fact be bounded as t=2 (V̂t (st ) − V ∗ (st ))
plus a small sub-linear term, which cancels with the first Furthermore, ergodic MDPs have finite mixing time and hit-
summation. ting time, defined as follows.
Model-free RL in Infinite-horizon Average-reward MDPs

Algorithm 2 MDP-OOMD Algorithm 3 E STIMATE Q


Define: episode length B = 16tmix thit (log2 T )2 and num- Input: T , π, s
ber of episodes K = T /B
Initialize: π1′ (a|s) = π1 (a|s) = A1 , ∀s, a. T : a state-action trajectory from t1 to t2
for k = 1, 2, . . . , K do (st1 , at1 , . . . , st2 , at2 )
for t = (k − 1)B + 1, . . . , kB do π : a policy used to sample the trajectory T
Draw at ∼ πk (·|st ) and observe st+1 .
s : target state
Define trajectory Tk =
(s(k−1)B+1 , a(k−1)B+1 , . . . , skB , akB ).
for all s ∈ S do Define: N = 4tmix log2 T (window length minus 1)
βbk (s, ·) = E STIMATE Q(Tk , πk , s). Initialize: τ ← t1 , i ← 0

(πk+1 (·|s), πk+1 (·|s)) = 1 while τ ≤ t2 − N do
O OMD U PDATE(πk′ (·|s), βbk (s, ·)). 2 if sτ = s then
3 i←i+1
Pτ +N
4 Let R = t=τ r(st , at ).
R
Definition 5.1 ((Levin & Peres, 2017; Wang, 2017)). The 5 Let yi (a) = π(a|s) 1[aτ = a], ∀a. (yi ∈ RA )
mixing time of an ergodic MDP is defined as 6 τ ← τ + 2N
 
1 7 else
tmix := max min t ≥ 1 k(P π )t (s, ·) − µπ k1 ≤ , ∀s , τ ←τ +1
π 4
that is, the maximum time required for any policy starting 8 if i 6= 0 then
P
at any initial state to make the state distribution 14 -close (in return 1i ij=1 yj .
ℓ1 norm) to the stationary distribution. 9 else
Definition 5.2. The hitting time of an ergodic MDP is de- return 0.
fined as
1
thit := max max π , Algorithm 4 O OMD U PDATE
π s µ (s)

that is, the maximum inverse stationary probability of visit- Input: π ′ ∈ ∆A , βb ∈ RA


ing any state under any policy. Define: PA 1
Regularizer ψ(x) = η1 a=1 log x(a) , for x ∈ RA
+
Our regret bound also depends on the following distribu- Bregman divergence associated with ψ:
tion mismatch coefficient:
X µ∗ (s) Dψ (x, x′ ) = ψ(x) − ψ(x′ ) − h∇ψ(x′ ), x − x′ i
ρ := max (4)
π
s
µπ (s)

which has been used in previous Update:


work (Kakade & Langford, 2002; Agarwal et al., 2019). n o
P ′
πnext b − Dψ (π, π ′ )
= argmax hπ, βi (5)
Clearly, one has ρ ≤ thit s µ∗ (s) = thit . Note that these π∈∆A
quantities are all parameters of the MDP only and are n o
b − Dψ (π, π ′ )
πnext = argmax hπ, βi (6)
considered as finite constants compared to the horizon T . next
π∈∆A
We thus assume that T is large enough so that tmix and thit
are both smaller than T /4. Also, we assume that these ′
quantities are known to the algorithm. return (πnext , πnext ).

5.1. Policy Optimization via Optimistic OMD


√ At the beginning of episode k, each MAB algorithm out-
The key to get O( e T ) bound is to learn the optimal policy
puts an action distribution πk (·|s) for the corresponding
π ∗ directly, by reducing the problem to solving an adversar- state s, which together induces a policy πk . The learner
ial multi-armed bandit (MAB) (Auer et al., 2002) instance then executes policy πk throughout episode k. At the end
at each individual state. of the episode, for every state s we feed a reward estimator
The details of our algorithm MDP-OOMD is shown in Al- βbk (s, ·) ∈ RA to the corresponding MAB algorithm, where
gorithm 2. It proceeds in episodes, and maintains an inde- βbk is constructed using the samples collected in episode k
pendent copy of a specific MAB algorithm for each state. (see Algorithm 3). Finally all MAB algorithms update their
Model-free RL in Infinite-horizon Average-reward MDPs

distributions and output πk+1 for the next episode (Algo- sponding policies and their Q-value functions are also
rithm 4). stable, which is critical for our analysis.
The reward estimator βbk (s, ·) is an almost unbiased estima- • Finally, the optimistic prediction of OOMD, to-
tor for gether with our particular reward estimator from E S -
β πk (s, ·) := q πk (s, ·) + N J πk (7) TIMATE Q, provides a variance reduction effect that
leads to a better regret bound in terms of ρ instead
with negligible bias (N is defined in Algorithm 3). The of thit . See Lemma 8 and Lemma 9.
term N J πk is the same for all actions and thus the cor-
responding MAB algorithm is trying to learn the best ac- The regret guarantee of our algorithm is shown below.
tion at state s in terms of the average of Q-value functions
Theorem 5. For ergodic MDPs, with an appropriate cho-
q π1 (s, ·), . . . , q πK (s, ·). To construct the reward estima-
sen learning rate η for Algorithm 4, MDP-OOMD achieves
tor for state s, the sub-routine E STIMATE Q collects non-
e mix ) that start q 
overlapping intervals of length N + 1 = O(t e
from state s, and use the standard inverse-propensity scor- E[RT ] = O t3mix ρAT .
ing to construct an estimator yi for interval i (Line 5). In
fact, to reduce the correlation among the non-overlapping Note that in this bound, the dependence on the number of
P ∗
intervals, we also make sure that these intervals are at least states S is hidden in ρ, since ρ ≥ s µµ∗ (s)
(s) = S. Com-
N steps apart from each other (Line 6). The final estimator pared to the bound of Algorithm 1 or some other model-
βbk (s, ·) is simply the average of all estimators yi over these based algorithms such as UCRL2, this bound has an extra
disjoint intervals. This averaging is important for reducing dependence on tmix , a potentially large constant. As far
variance as explained later (see also Lemma 6). as we know, all existing mirror-descent-based algorithms
The MAB algorithm we use is optimistic online mirror for the average-reward setting has the same issue (such
descent (OOMD) (Rakhlin & Sridharan, 2013) with log- as (Neu et al., 2013; Wang, 2017; Abbasi-Yadkori et al.,
barrier as the regularizer, analyzed in depth in (Wei & Luo, 2019a)). The role of tmix in our analysis is almost the
2018). Here, optimism refers to something different from same as that of 1/(1 − γ) in the discounted setting (γ is
the optimistic exploration in Section 4. It corresponds to the discount factor). Specifically, a small tmix ensures 1)
the fact that after a standard mirror descent update (Eq. (5)), a short trajectory needed to approximate the Q-function
the algorithm further makes a similar update using an opti- with expected trajectory reward (in view of Eq. (11)) and
mistic prediction of the next reward vector, which in our 2) an upper bound for the magnitude of q(s, a) and v(s)
case is simply the previous reward estimator (Eq. (6)). We (Lemma 14). For the discounted setting these are ensured
refer the reader to (Wei & Luo, 2018) for more details, but by the discount factor already.
point out that the optimistic prediction we use here is new.
Comparisons. Neu et al. (2013) considered learning er-
It is clear that each MAB algorithm faces a non-stochastic godic MDPs with known transition kernel and adversar-
problem (since πk is changing over time) and thus it is im- ial rewards, a setting incomparable to ours. Their algo-
portant to deploy an adversarial MAB algorithm. The stan- rithm maintains a copy of E XP 3 for each state, but the
dard algorithm for adversarial MAB is E XP 3 (Auer et al., reward estimators fed to these algorithms are constructed
2002), which was also used for solving adversarial using the knowledge of the transition kernel and are very
MDPs (Neu et al., 2013) (more comparisons with this to different
p from ours.  They proved a regret bound of order
follow). However, there are several important reasons for e 3
O tmix thit AT , which is worse than ours since ρ ≤ thit .
our choice of the recently developed OOMD with log-
barrier: In another recent work, (Wang, 2017) considered learn-
ing ergodic MDPs under the assumption that the learner
• First, the log-barrier regularizer produces a more ex- is provided with a generative model (an oracle that takes
ploratory distribution compared to E XP 3 (as noticed in a state-action pair and output a sample of the next
in e.g. (Agarwal et al., 2017)), so we do not need an state).
 2 They derived a sample-complexity bound of order
2
explicit exploration over the actions, which signifi- Oe tmix τ2 SA for finding an ǫ-optimal policy, where τ =
cantly simplifies the analysis compared to (Neu et al., ǫ  ∗ 2  2 
µ (s) 1/S
2013). max maxs 1/S , maxs′ ,π µπ (s′ ) , which is at

• Second, log-barrier regularizer provides more stable least maxπ maxs,s′ µµπ (s
(s)
′ ) by AM-GM inequality. This re-

updates compared to E XP 3 in the sense that πk (a|s) sult is again incomparable to ours, but we point out that our
and πk−1 (a|s) are within a multiplicative factor of distribution mismatch coefficient ρ is always bounded by
each other (see Lemma 7). This implies that the corre- τ S, while τ can be much larger than ρ on the other hand.
Model-free RL in Infinite-horizon Average-reward MDPs

Finally, Abbasi-Yadkori et al. (2019a) considers a more Eq.(8) can be further written as
general setting with function approximation, and their al- "K #
gorithm P OLITEX maintains a copy of the standard ex- XX
ponential weight algorithm for each state, very similar E (J πk − r(st , at ))
k=1 t∈Ik
to (Neu et al., 2013). When specified to our tabular setting, " #
K X
X
one can verify (according
√ to3their Theorem 5.2) that P OLI -
TEX achieves t3mix thit SAT 4 regret, which is significantly
=E (Es′ ∼p(·|st ,at ) [v πk (s′ )] − q πk (st , at ))
k=1 t∈Ik
worse than ours in terms of all parameters.
(Bellman equation)
" K X
#
5.2. Proof sketch of Theorem 5 X
=E (Es′ ∼p(·|st ,at ) [v πk (s′ )] − v πk (st+1 ))
We first decompose the regret as follows: k=1 t∈Ik
" K X
#
X
πk πk
+E (v (st ) − q (st , at ))
k=1 t∈Ik
" K X
#
T
X X
πk πk
RT = J ∗ − r(st , at ) +E (v (st+1 ) − v (st ))
k=1 t∈Ik
t=1 " #
K K X K
X
X X πk πk
=B (J ∗ − J πk ) + (J πk − r(st , at )) , (8) =E (v (skB+1 ) − v (s(k−1)B+1 ))
k=1 k=1 t∈Ik k=1
(the first two terms above are zero)
"K−1 #
X
πk πk+1
=E (v (skB+1 ) − v (skB+1 ))
k=1
where Ik := {(k − 1)B + 1, . . . , kB} is the set of time h i
+ E v πK (sKB+1 ) − v π1 (s1 ) . (10)
steps for episode k. Using the reward difference lemma
(Lemma 15 in the appendix), the first term of Eq. (8) can
The first term in the last expression can be bounded
be written as
by O(ηN 3 K) = O(ηN 3 T /B) due to the stability of
O OMD U PDATE (Lemma 7) and the second term is at most
O(tmix ) according to Lemma 14 in the appendix.
" K X
# Combining these facts with N = O(t e mix ), B = O(t
e mix thit ),
X X
B ∗
µ (s) ∗ πk
(π (a|s) − πk (a|s))q (s, a) , Eq. (8) and Eq. (9) and choosing the optimal η, we arrive at
s k=1 a  
e BA t3mix ρT 3 6
E[RT ] = O +η + η tmix T
η B
q 
 43 1
=O e 3 3 2
tmix ρAT + tmix thit A T + tmix thit A .
4

where the term in the square bracket can be recognized as


exactly the regret of the MAB algorithm for state s and
is analyzed in Lemma 8 of Section 5.3. Combining the 5.3. Auxiliary Lemmas
regret of all MAB algorithms, Lemma 9 then shows that in To analyze the regret, we establish several useful lemmas,
expectation the first term of Eq. (8) is at most whose proofs can be found in the Appendix. First, we show
that βbk (s, a) is an almost unbiased estimator for β πk (s, a).
Lemma 6. Let Ek [x] denote the expectation of a random
  variable x conditioned on all history before episode k.
e BA ηT N 3 ρ Then for any k, s, a (recall β defined in Eq. (7)),
O + + η3 T N 6 . (9)
η B h i  
b πk 1
Ek βk (s, a) − β (s, a) ≤ O , (11)
T
 2   
b πk N 3 log T
Ek βk (s, a) − β (s, a) ≤O .
Bπk (a|s)µπk (s)
On the other hand, the expectation of the second term in (12)
Model-free RL in Infinite-horizon Average-reward MDPs

The next lemma shows that in OOMD, πk and πk−1 are ECCS-1810447), HL (award IIS-1755781), HS (award
close in a strong sense, which further implies the stability CCF-1817212) and RJ (awards ECCS-1810447 and CCF-
for several other related quantities. 1817212) is gratefully acknowledged.
Lemma 7. For any k, s, a,
References
|πk (a|s) − πk−1 (a|s)| ≤ O(ηN πk−1 (a|s)), (13)
Abbasi-Yadkori, Y., Bartlett, P., Bhatia, K., Lazic, N.,
|J πk − J πk−1 | ≤ O(ηN 2 ),
Szepesvari, C., and Weisz, G. Politex: Regret bounds
|v πk (s) − v πk−1 (s)| ≤ O(ηN 3 ), for policy iteration using expert prediction. In Interna-
|q πk (s, a) − q πk−1 (s, a)| ≤ O(ηN 3 ), tional Conference on Machine Learning, pp. 3692–3702,
2019a.
|β πk (s, a) − β πk−1 (s, a)| ≤ O(ηN 3 ).
Abbasi-Yadkori, Y., Lazic, N., Szepesvari, C., and Weisz,
The next lemma shows the regret bound of OOMD based G. Exploration-enhanced politex. arXiv preprint
on an analysis similar to (Wei & Luo, 2018). arXiv:1908.10479, 2019b.
Lemma 8. For a specific state s, we have Agarwal, A., Luo, H., Neyshabur, B., and Schapire, R. E.
"K # Corralling a band of bandit algorithms. In Conference
XX A ln T
E (π ∗ (a|s) − πk (a|s))βbk (s, a) ≤ O on Learning Theory, pp. 12–38, 2017.
η
k=1 a
"K #! Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan,
XX  2
G. Optimality and approximation with policy gradient
+ ηE πk (a|s)2 βbk (s, a) − βbk−1 (s, a) ,
methods in markov decision processes. arXiv preprint
k=1 a
arXiv:1908.00261, 2019.
where we define βb0 (s, a) = 0 for all s and a.
Agrawal, S. and Jia, R. Optimistic posterior sampling for
Finally, we state a key lemma for proving Theorem 5. reinforcement learning: worst-case regret bounds. In
Advances in Neural Information Processing Systems, pp.
Lemma 9. MDP-OOMD ensures
" K # 1184–1194, 2017.
XXX
∗ ∗ πk Auer, P. and Ortner, R. Logarithmic online regret bounds
E B µ (s) (π (a|s) − πk (a|s)) q (s, a)
k=1 s a for undiscounted reinforcement learning. In Advances
  in Neural Information Processing Systems, pp. 49–56,
BA ln T T N 3ρ
=O +η + η3 T N 6 . 2007.
η B
Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E.
6. Conclusions The nonstochastic multiarmed bandit problem. SIAM
Journal on Computing, 32(1):48–77, 2002.
In this work we propose two model-free algorithms for
learning infinite-horizon average-reward MDPs. They are Azar, M. G., Osband, I., and Munos, R. Minimax regret
based on different ideas: one reduces the problem to the bounds for reinforcement learning. In Proceedings of
discounted version, while the other optimizes the policy di- the 34th International Conference on Machine Learning-
rectly via a novel application of adaptive adversarial multi- Volume 70, pp. 263–272. JMLR. org, 2017.
armed bandit algorithms. The main open question is how to
achieve the information-theoretically optimal regret bound Bartlett, P. L. and Tewari, A. Regal: A regularization based
via a model-free algorithm, if it is possible at all. We be- algorithm for reinforcement learning in weakly commu-
lieve that the techniques we develop in this work would be nicating mdps. In Proceedings of the Twenty-Fifth Con-
useful in answering this question. ference on Uncertainty in Artificial Intelligence, pp. 35–
42. AUAI Press, 2009.

Acknowledgements Bubeck, S., Li, Y., Luo, H., and Wei, C.-Y. Improved
path-length regret bounds for bandits. In Conference On
The authors would like to thank Csaba Szepesvari for point- Learning Theory, 2019.
ing out the related works (Abbasi-Yadkori et al., 2019a;b),
Mengxiao Zhang for helping us prove Lemma 6, Gergely Chiang, C.-K., Yang, T., Lee, C.-J., Mahdavi, M., Lu, C.-J.,
Neu for clarifying the analysis in (Neu et al., 2013), and Jin, R., and Zhu, S. Online optimization with gradual
Ronan Fruit for discussions on a related open problem pre- variations. In Conference on Learning Theory, pp. 6–1,
sented at ALT 2019. Support from NSF for MJ (award 2012.
Model-free RL in Infinite-horizon Average-reward MDPs

Dong, K., Wang, Y., Chen, X., and Wang, L. Q-learning Osband, I., Van Roy, B., and Wen, Z. Generalization
with ucb exploration is sample efficient for infinite- and exploration via randomized value functions. arXiv
horizon mdp. arXiv preprint arXiv:1901.09311, 2019. preprint arXiv:1402.0635, 2014.
Fruit, R., Pirotta, M., and Lazaric, A. Near optimal Ouyang, Y., Gagrani, M., and Jain, R. Learning-based con-
exploration-exploitation in non-communicating markov trol of unknown linear systems with thompson sampling.
decision processes. In Advances in Neural Information arXiv preprint arXiv:1709.04047, 2017a.
Processing Systems, pp. 2994–3004, 2018a.
Ouyang, Y., Gagrani, M., Nayyar, A., and Jain, R. Learn-
Fruit, R., Pirotta, M., Lazaric, A., and Ortner, R. Ef- ing unknown markov decision processes: A thompson
ficient bias-span-constrained exploration-exploitation in sampling approach. In Advances in Neural Information
reinforcement learning. In International Conference on Processing Systems, pp. 1333–1342, 2017b.
Machine Learning, pp. 1573–1581, 2018b.
Puterman, M. L. Markov decision processes: discrete
Fruit, R., Pirotta, M., and Lazaric, A. Improved analysis of stochastic dynamic programming. John Wiley & Sons,
ucrl2b, 2019. Available at 2014.
rlgammazero.github.io/docs/ucrl2b improved.pdf. Rakhlin, A. and Sridharan, K. Online learning with pre-
Gopalan, A. and Mannor, S. Thompson sampling for learn- dictable sequences. In Conference on Learning Theory,
ing parameterized markov decision processes. In Confer- pp. 993–1019, 2013.
ence on Learning Theory, pp. 861–898, 2015. Schulman, J., Levine, S., Abbeel, P., Jordan, M., and
Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret Moritz, P. Trust region policy optimization. In Interna-
bounds for reinforcement learning. Journal of Machine tional conference on machine learning, pp. 1889–1897,
Learning Research, 11(Apr):1563–1600, 2010. 2015.
Schwartz, A. A reinforcement learning method for max-
Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. Is
imizing undiscounted rewards. In Proceedings of the
Q-learning provably efficient? In Advances in Neural
tenth international conference on machine learning, vol-
Information Processing Systems, pp. 4863–4873, 2018.
ume 298, pp. 298–305, 1993.
Kakade, S. and Langford, J. Approximately optimal ap-
Strehl, A. L. and Littman, M. L. An analysis of model-
proximate reinforcement learning. In Proceedings of
based interval estimation for markov decision processes.
the 34th International Conference on Machine Learning,
Journal of Computer and System Sciences, 74(8):1309–
2002.
1331, 2008.
Lattimore, T. and Szepesvári, C. Bandit algorithms. Cam- Strehl, A. L., Li, L., Wiewiora, E., Langford, J., and
bridge University Press, 2018. Littman, M. L. Pac model-free reinforcement learning.
Levin, D. A. and Peres, Y. Markov chains and mixing times, In Proceedings of the 23rd international conference on
volume 107. American Mathematical Soc., 2017. Machine learning, pp. 881–888. ACM, 2006.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Talebi, M. S. and Maillard, O.-A. Variance-aware regret
Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing bounds for undiscounted reinforcement learning in mdps.
atari with deep reinforcement learning. arXiv preprint In Algorithmic Learning Theory, pp. 770–805, 2018.
arXiv:1312.5602, 2013. Wang, M. Primal-dual π learning: Sample complexity and
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, sublinear run time for ergodic markov decision problems.
T., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn- arXiv preprint arXiv:1710.06100, 2017.
chronous methods for deep reinforcement learning. In Watkins, C. J. C. H. Learning from delayed rewards. Phd
International conference on machine learning, pp. 1928– Thesis, King’s College, Cambridge, 1989.
1937, 2016.
Wei, C.-Y. and Luo, H. More adaptive algorithms for ad-
Neu, G., György, A., Szepesvári, C., and Antos, A. Online versarial bandits. In Conference On Learning Theory, pp.
markov decision processes under bandit feedback. IEEE 1263–1291, 2018.
Transactions on Automatic Control, 59:676–691, 2013.
Zanette, A. and Brunskill, E. Tighter problem-dependent
Ortner, R. Regret bounds for reinforcement learn- regret bounds in reinforcement learning without domain
ing via markov chain concentration. arXiv preprint knowledge using value function bounds. In International
arXiv:1808.01813, 2018. Conference on Machine Learning, 2019.
Model-free RL in Infinite-horizon Average-reward MDPs

Zhang, Z. and Ji, X. Regret minimization for reinforcement


learning by evaluating the optimal bias function. In Ad-
vances in Neural Information Processing Systems, 2019.
Model-free RL in Infinite-horizon Average-reward MDPs

A. Omitted Proofs in Section 4


H+1
In this section, we provide detailed proof for the lemmas used in Section 4. Recall that the learning rate ατ = H+τ is
similar to the one used by (Jin et al., 2018). For notational convenience, let
τ
Y τ
Y
α0τ := (1 − αj ), αiτ := αi (1 − αj ). (14)
j=1 j=i+1

It can be verified that α0τ = 0 for τ ≥ 1 and we define α00 = 1. These quantities are used in the proof of Lemma 3 and
have some nice properties summarized in the following lemma.
Lemma 10 ((Jin et al., 2018)). The following properties hold for αiτ :
Pτ αiτ
1. √1 ≤ √ ≤ √2 for every τ ≥ 1.
τ i=1 i τ
Pτ i 2 2H
2. i=1 (ατ ) ≤for every τ ≥ 1.
τ
Pτ i
P∞ i 1
3. i=1 ατ = 1 for every τ ≥ 1 and τ =i ατ = 1 + H for every i ≥ 1.

Also recall the well-known Azuma’s inequality:


Lemma 11 (Azuma’s inequality). Let X1 , X2 , · · · be a martingale difference sequence with |Xi | ≤ ci for all i. Then, for
any 0 < δ < 1,
T r !
X 1
P Xi ≥ 2c̄2T ln ≤ δ,
i=1
δ
PT
where c̄2T := 2
i=1 ci .

A.1. Proof of Lemma 2


Lemma 2 (Restated). Let V ∗ be the optimal value function in the discounted MDP with discount factor γ and v ∗ be the
optimal value function in the undiscounted MDP. Then,

1. |J ∗ − (1 − γ)V ∗ (s)| ≤ (1 − γ) sp(v ∗ ), ∀s ∈ S,


2. sp(V ∗ ) ≤ 2 sp(v ∗ ).

Proof. 1. Let π ∗ and πγ be the optimal policy under undiscounted and discounted settings, respectively. By Bellman’s
equation, we have
v ∗ (s) = r(s, π ∗ (s)) − J ∗ + Es′ ∼p(·|s,π∗ (s)) v ∗ (s′ ).
Consider a state sequence s1 , s2 , · · · generated by π ∗ . Then, by sub-optimality of π ∗ for the discounted setting, we
have
"∞ #
X
∗ t−1 ∗
V (s1 ) ≥ E γ r(st , π (st )) s1
t=1
" ∞
#
X
=E γ t−1 (J ∗ + v ∗ (st ) − v ∗ (st+1 )) s1
t=1
"∞ #
J∗ X
∗ t−2 t−1 ∗
= + v (s1 ) − E (γ − γ )v (st ) s1
1−γ t=2
X∞
J∗ ∗ ∗
≥ + min v (s) − max v (s) (γ t−2 − γ t−1 )
1−γ s s
t=2
J∗
= − sp(v ∗ ),
1−γ
Model-free RL in Infinite-horizon Average-reward MDPs

where the first equality is by the Bellman equation for the undiscounted setting.
Similarly, for the other direction, let s1 , s2 , · · · be generated by πγ . We have
" ∞
#
X
∗ t−1
V (s1 ) = E γ r(st , πγ (st )) s1
t=1
"∞ #
X
≤E γ t−1 (J ∗ + v ∗ (st ) − v ∗ (st+1 )) s1
t=1
"∞ #
J∗ X
∗ t−2 t−1 ∗
= + v (s1 ) − E (γ − γ )v (st ) s1
1−γ t=2
X∞
J∗
≤ + max v ∗ (s) − min v ∗ (s) (γ t−2 − γ t−1 )
1−γ s s
t=2
J∗
= + sp(v ∗ ),
1−γ

where the first inequality is by sub-optimality of πγ for the undiscounted setting.

2. Using previous part, for any s1 , s2 ∈ S, we have

J∗ J∗
|V ∗ (s1 ) − V ∗ (s2 )| ≤ V ∗ (s1 ) − + V ∗ (s2 ) − ≤ 2 sp(v ∗ ).
1−γ 1−γ

Thus, sp(V ∗ ) ≤ 2 sp(v ∗ ).

A.2. Proof of Lemma 3


Lemma 3. With probability at least 1 − δ,
T r
X 2T
∗ ∗ ∗
(V (st ) − Q (st , at )) ≤ 4HSA + 24 sp(v ) HSAT ln .
t=1
δ

Proof. We condition on the statement of Lemma 12, which happens with probability at least 1 − δ. Let nt ≥ 1 denote
nt+1 (st , at ), that is, the total number of visits to the state-action pair (st , at ) for the first t rounds (including round t). Also
let ti (s, a) denote the timestep at which (s, a) is visited the i-th time. Recalling the definition of αint in Eq. (14), we have

T 
X  XT
V̂t (st ) − V ∗ (st ) + (V ∗ (st ) − Q∗ (st , at )) (15)
t=1 t=1
T 
X 
= Q̂t (st , at ) − Q∗ (st , at ) (because at = argmaxa Q̂t (st , a))
t=1
T 
X  XT  
= Q̂t+1 (st , at ) − Q∗ (st , at ) + Q̂t (st , at ) − Q̂t+1 (st , at ) (16)
t=1 t=1
T r T X nt h i
X H 2T X

≤ 12 sp(v ) ln +γ αint V̂ti (st ,at ) (sti (st ,at )+1 ) − V ∗ (sti (st ,at )+1 ) + SAH. (17)
t=1
nt δ t=1 i=1

Here, we apply Lemma 12 to bound the first term of Eq .(16) (note α0nt = 0 by definition since nt ≥ 1), and also bound
the second term of Eq .(16) by SAH since for each fixed (s, a), Q̂t (s, a) is non-increasing in t and overall cannot decrease
by more than H (the initial value).
Model-free RL in Infinite-horizon Average-reward MDPs

To bound the third term of Eq. (17) we write:


nt
T X
X h i
γ αint V̂ti (st ,at ) (sti (st ,at )+1 ) − V ∗ (sti (st ,at )+1 )
t=1 i=1
T X nt+1 (s,a) h i
X X
=γ 1[st =s,at =a] αint+1 (s,a) V̂ti (s,a) (sti (s,a)+1 ) − V ∗ (sti (s,a)+1 )
t=1 s,a i=1
nT +1 (s,a) j h i
X X X
=γ αij V̂ti (s,a) (sti (s,a)+1 ) − V ∗ (sti (s,a)+1 ) .
s,a j=1 i=1

By changing the order of summation on i and j, the latter is equal to

X nT +1
X (s,a) nT +1 (s,a)
X h i
γ αij V̂ti (s,a) (sti (s,a)+1 ) − V ∗ (sti (s,a)+1 )
s,a i=1 j=i
nT +1 (s,a) h i nT +1 (s,a)
X X X
=γ V̂ti (s,a) (sti (s,a)+1 ) − V ∗ (sti (s,a)+1 ) αij
s,a i=1 j=i
PnT +1 (s,a) i P∞ i 1
Now, we can upper bound j=i αj by j=i αj where the latter is equal to 1 + H by Lemma 10. Since

V̂ti (s,a) (sti (s,a)+1 ) − V (sti (s,a)+1 ) ≥ 0 (by Lemma 12), we can write:

X nT +1
X (s,a) h i nT +1
X (s,a)
γ V̂ti (s,a) (sti (s,a)+1 ) − V ∗ (sti (s,a)+1 ) αij
s,a i=1 j=i
nT +1 (s,a) h iX

X X
≤γ V̂ti (s,a) (sti (s,a)+1 ) − V ∗ (sti (s,a)+1 ) αij
s,a i=1 j=i

X nT +1 (s,a) h
X i 1


=γ V̂ti (s,a) (sti (s,a)+1 ) − V (sti (s,a)+1 ) 1 +
s,a i=1
H
  X T h i
1
= 1+ γ V̂t (st+1 ) − V ∗ (st+1 )
H t=1
  X T h i   T
1 1 Xh i
= 1+ γ V̂t+1 (st+1 ) − V ∗ (st+1 ) + 1 + V̂t (st+1 ) − V̂t+1 (st+1 )
H t=1
H t=1
T
X +1 h i  1

≤ V̂t (st ) − V ∗ (st ) + 1 + SH.
t=2
H

The last inequality is because 1 + H1 γ ≤ 1 and that for any state s, V̂t (s) ≥ V̂t+1 (s) and the value can decrease by at
most H (the initial value). Substituting in Eq. (17) and telescoping with the left hand side, we have
T r   
H 2T 
X T X
∗ ∗ ∗ ∗ 1
(V (st ) − Q (st , at )) ≤ 12 sp(v ) ln + V̂T +1 (sT +1 ) − V (sT +1 ) + 1 + SH + SAH
t=1 t=1
nt δ H
XT r
H 2T
≤ 12 sp(v ∗ ) ln + 4SAH.
t=1
n t δ
PT √
Moreover, √1 ≤ 2 SAT because
t=1 nt

T
X X X 1[st =s,at =a]
T X nT +1 (s,a)
X X p s X √
1 1
p = p = √ ≤ 2 nT +1 (s, a) ≤ 2 SA nT +1 (s, a) = 2 SAT ,
t=1 nt+1 (st , at ) t=1 s,a nt+1 (s, a) s,a j=1
j s,a s,a
Model-free RL in Infinite-horizon Average-reward MDPs

where the last inequality is by Cauchy-Schwarz inequality. This finishes the proof.

Lemma 12. With probability at least 1 − δ, for any t = 1, . . . , T and state-action pair (s, a), the following holds
τ h i r
X H 2T

0 ≤ Q̂t+1 (s, a) − Q (s, a) ≤ Hα0τ +γ αiτ ∗ ∗
V̂ti (sti +1 ) − V (sti +1 ) + 12 sp(v ) ln ,
i=1
τ δ

where τ = nt+1 (s, a) (i.e., the total number of visits to (s, a) for the first t timesteps), αiτ is defined by (14), and
t1 , . . . , tτ ≤ t are the timesteps on which (s, a) is taken.

Proof. Recursively substituting Qt (s, a) in Eq. (2) of the algorithm, we have


τ
X h i Xτ
Qt+1 (s, a) = Hα0τ + αiτ r(s, a) + γ V̂ti (sti +1 ) + αiτ bi .
i=1 i=1


Moreover, since i=1 αiτ = 1 (Lemma 10), By Bellman equation we have
τ
X  
Q∗ (s, a) = α0τ Q∗ (s, a) + αiτ r(s, a) + γEs′ ∼p(·|s,a) V ∗ (s′ ) .
i=1


Taking their difference and adding and subtracting a term γ i=1 αiτ V ∗ (sti +1 ) lead to:
τ
X h i
Qt+1 (s, a) − Q∗ (s, a) = α0τ (H − Q∗ (s, a)) + γ αiτ V̂ti (sti +1 ) − V ∗ (sti +1 )
i=1
τ
X τ
  X
+γ αiτ V ∗ (sti +1 ) − Es′ ∼p(·|s,a) V ∗ (s′ ) + αiτ bi .
i=1 i=1

P∞ 1
The first term is upper bounded by α0τ H clearly and lower bounded by 0 since Q∗ (s, a) ≤ i=0 γi = 1−γ = H.
The third term is a martingale difference sequence with each term bounded in [−γαiτ sp(Vq ∗
), γαiτ sp(V ∗ )]. There-

fore, by Azuma’s inequality (Lemma 11), its absolute value is bounded by γ sp(V ∗ ) 2 i=1 (αiτ )2 ln 2T δ ≤
q q
2γ sp(V ∗ ) H 2T
τ ln δ ≤ 4γ sp(v )
∗ H 2T δ
τ ln δ with probability at least 1 − T , where the first inequality is by Lemma
10 and the last inequality is by Lemma 2. Note that when t varies from 1 to T and (s, a) varies over all possible state-action
pairs, the third term only takes T different forms. Therefore,
q by taking a union bound over these T events, we have: with
probability 1 − δ, the third term is bounded by 4γ sp(v ∗ ) H 2T
τ ln δ in absolute value for all t and (s, a).
q q
The forth term is lower bounded by 4 sp(v ∗ ) H 2T
τ ln δ and upper bounded by 8 sp(V )
∗ H 2T
τ ln δ , by Lemma 10.
n o
Combining all aforementioned upper bounds and the fact Q̂t+1 (s, a) = min Q̂t (s, a), Qt+1 (s, a) ≤ Qt+1 (s, a) we
prove the upperh bound in the lemma statement. To prove thei lower bound, further note that the second term can be written
Pτ i ∗
as γ i=1 ατ maxa Q̂ti (sti +1 , a) − maxa Q (sti +1 , a) . Using a direct induction with all aforementioned lower bounds
n o
and the fact Q̂t+1 (s, a) = min Q̂t (s, a), Qt+1 (s, a) we prove the lower bound in the lemma statement as well.

A.3. Proof of Lemma 4


Lemma 4. With probability at least 1 − δ,

T r
X 1
∗ ∗ ∗
(Q (st , at ) − γV (st ) − r(st , at )) ≤ 2 sp(v ) 2T ln + 2 sp(v ∗ ).
t=1
δ
Model-free RL in Infinite-horizon Average-reward MDPs

Proof. By Bellman equation for  the discounted problem, we have Q∗ (st , at ) − γV ∗ (st ) − r(st , at ) =
∗ ′ ∗ ∗
γ Es′ ∼p(·|st ,at ) [V (s )] − V (st ) . Adding and subtracting V (st+1 ) and summing over t we will get

T
X T
X T
X

(Q∗ (st , at ) − γV ∗ (st ) − r(st , at )) = γ Es′ ∼p(·|st ,at ) [V ∗ (s′ )] − V ∗ (st+1 ) + γ (V ∗ (st+1 ) − V ∗ (st ))
t=1 t=1 t=1

The summands of the first term on the right hand side constitute a martingale difference sequence. Thus, byqAzuma’s
inequality (Lemma 11) and the fact that sp(V ∗ ) ≤ 2 sp(v ∗ ) (Lemma 2), this term is upper bounded by 2γ sp(v ∗ ) 2T ln δ1 ,
with probability at least 1 − δ. The second term is equal to γ(V ∗ (sT +1 ) − V ∗ (s1 )) which is upper bounded by 2γ sp(v ∗ ).
Recalling γ < 1 completes the proof.

B. Omitted Proofs in Section 5 — Proofs for Lemma 6 and Lemma 7


B.1. Auxiliary Lemmas
In this subsection, we state several lemmas that will be helpful in the analysis.
Lemma 13 ((Levin & Peres, 2017, Section 4.5)). Define
n o
tmix (ǫ) := max min t ≥ 1 k(P π )t (s, ·) − µπ k1 ≤ ǫ, ∀s ,
π

so that tmix = tmix ( 14 ). We have


 
1
tmix (ǫ) ≤ log2 tmix
ǫ

for any ǫ ∈ (0, 21 ].


Corollary 13.1. For an ergodic MDP with mixing time tmix , we have
−tt
k(P π )t (s, ·) − µπ k1 ≤ 2 · 2 mix , ∀π, s

for all π and all t ≥ 2tmix .

Proof. Lemma 13 implies for any ǫ ∈ (0, 12 ], as long as t ≥ ⌈log2 (1/ǫ)⌉tmix , we have

k(P π )t (s, ·) − µπ k1 ≤ ǫ.

t −tt
This condition can be satisfied by picking log2 (1/ǫ) = tmix − 1, which leads to ǫ = 2 · 2 mix .
Corollary 13.2. Let N = 4tmix log2 T . For an ergodic MDP with mixing time tmix < T /4, we have for all π:

X 1
k(P π )t (s, ·) − µπ k1 ≤ .
T3
t=N

Proof. By Corollary 13.1,



X ∞
X − tN
π t π −tt 2·2 mix 2tmix − N 2tmix 1 1
k(P ) (s, ·) − µ k1 ≤ 2·2 mix = −t1
≤ · 2 · 2 tmix = · 2 · 4 ≤ 3.
1−2 mix ln 2 ln 2 T T
t=N t=N

Lemma 14 (Stated in (Wang, 2017) without proof). For an ergodic MDP with mixing time tmix , and any π, s, a,

|v π (s)| ≤ 5tmix ,
|q π (s, a)| ≤ 6tmix .
Model-free RL in Infinite-horizon Average-reward MDPs

Proof. Using the identity of Eq. (3) we have



X
|v π (s)| = ((P π )t (s, ·) − µπ )⊤ rπ
t=0

X
≤ (P π )t (s, ·) − µπ 1
krπ k∞
t=0
2tX
mix −1 ∞ (i+1)t
X X mix −1

≤ (P π )t (s, ·) − µπ 1
+ (P π )t (s, ·) − µπ 1
t=0 i=2 t=itmix

X
≤ 4tmix + 2 · 2−i tmix (by k(P π )t (s, ·) − µπ k1 ≤ 2 and Corollary 13.1)
i=2
≤ 5tmix ,

and thus

|q π (s, a)| = r(s, a) + Es′ ∼p(·|s,a) [v π (s′ )] ≤ 1 + 5tmix ≤ 6tmix .

Lemma 15 ((Neu et al., 2013, Lemma 2)). For any two policies π, π̃,
XX
J π̃ − J π = µπ̃ (s) (π̃(a|s) − π(a|s)) q π (s, a).
s a

Proof. Using Bellman equation we have


XX
µπ̃ (s)π̃(a|s)q π (s, a)
s a
!
XX X
π̃ π ′ π ′
= µ (s)π̃(a|s) r(s, a) − J + p(s |s, a)v (s )
s a s′
X
= J π̃ − J π + µπ̃ (s′ )v π (s′ )
s′
X
= J π̃ − J π + µπ̃ (s)v π (s)
s
XX
= J π̃ − J π + µπ̃ (s)π(a|s)q π (s, a),
s a
P P P
where the second equality uses the facts J π̃ = s a µπ̃ (s)π̃(a|s)r(s, a) and s,a µπ̃ (s)π̃(a|s)p(s′ |s, a) = µπ̃ (s′ ).
Rearranging gives the desired equality.
Lemma 16. Let I = {t1 +1, t1 +2, . . . , t2 } be a certain period of an episode k of Algorithm 2 with |I| ≥ N = 4tmix log2 T .
Then for any s, the probability that the algorithm never visits s in I is upper bounded by
  |I|
3µπk (s) ⌊ N ⌋
1− .
4
 t2 −t1 
Proof. Consider a subset of I: {t1 + N, t1 + 2N, . . .} which consists of at least N rounds that are at least N -step
away from each other. By Corollary 13.1, we have for any i,
− tN 2
Pr[st1 +iN = s st1 +(i−1)N ] − µπk (s) ≤ 2 · 2 mix ≤ 2 · 2−4 log2 T ≤ ,
T4
that is, conditioned on the state at time t1 + (i − 1)N , the state distribution at time t1 + iN is close to the stationary
distribution induced by πk . Therefore we further have Pr[st1 +iN = s st1 +(i−1)N ] ≥ µπk (s) − T24 ≥ 34 µπk (s), where
Model-free RL in Infinite-horizon Average-reward MDPs

the last step uses the fact µπk (s) ≥ t1hit ≥ 4


T. The probability that the algorithm does not visit s in any of the rounds
{t1 + N, t1 + 2N, . . .} is then at most
  t2 −t1   |I|
3µπk (s) ⌊ N ⌋ 3µπk (s) ⌊ N ⌋
1− = 1− ,
4 4
finishing the proof.

B.2. Proof for Lemma 6


Proof for Eq.(11). In this proof, we consider a specific episode k and a specific state s. For notation simplicity, we use
π for πk throughout this proof, and all the expectations or probabilities are conditioned on the history before episode k.
Suppose that when Algorithm 2 calls E STIMATE Q in episode k for state s, it finds M disjoint intervals that starts from s.
Denote the reward estimators corresponding to the i-th interval as βbk,i (s, ·) (i.e., the yi (·) in Algorithm 3), and the time
when the i-th interval starts as τi (thus sτi = s). Then by the algorithm, we have
( PM b
i=1 βk,i (s,a)
if M > 0,
βbk (s, a) = M (18)
0 if M = 0.

Since each βbk,i (s, a) is constructed by a length-(N + 1) trajectory starting from s at time τi ≤ kB − N , we can calculate
its conditional expectation as follows:
h i
E βbk,i (s, a) sτi = s
hP i
τi +N
r(s, a) + E t=τi +1 r(s ,
t ta ) (s τi , a τi ) = (s, a)
= Pr[aτi = a | sτi = s] ×
π(a|s)
" τ +N #
X X
i

= r(s, a) + p(s′ |s, a)E r(st , at ) sτi +1 = s′


s′ t=τi +1

X N
X −1
= r(s, a) + p(s′ |s, a) e⊤ π j π
s′ (P ) r
s′ j=0

X N
X −1
= r(s, a) + p(s′ |s, a) (e⊤ π j π ⊤ π
s′ (P ) − (µ ) )r + N J
π
(because µπ⊤ rπ = J π )
s′ j=0
X X ∞
X
= r(s, a) + p(s′ |s, a)v π (s′ ) + N J π − p(s′ |s, a) (e⊤ π j π ⊤ π
s′ (P ) − (µ ) )r (By Eq. (3))
s′ s′ j=N

= q π (s, a) + N J π − δ(s, a)
= β π (s, a) − δ(s, a), (19)
P P∞
where δ(s, a) , s′ p(s′ |s, a) ⊤ π j
j=N (es′ (P ) − (µπ )⊤ )rπ . By Corollary 13.2,

1
|δ(s, a)| ≤ . (20)
T3
Thus,
h i 1
E βbk,i (s, a) sτi = s − β π (s, a) ≤ 3 .
T

This shows that βbk,i (s, a) is an almost unbiased estimator for β π conditioned on all history before τi . Also, by our selection
of the episode length, M > 0 will happen with very high probability according to Lemma 16. These facts seem to indicate
that βbk (s, a) – an average of several βbk,i (s, a) – will also be an almost unbiased estimator for β π (s, a) with small error.
Model-free RL in Infinite-horizon Average-reward MDPs

However, a caveat here is that the quantity M in Eq.(18) is random, and it is not independent from the reward estimators
PM b b π
i=1 βk,i (s, a). Therefore, to argue that the expectation of E[βk (s, a)] is close to β (s, a), more technical work is needed.
Specifically, we use the following two steps to argue that E[βbk (s, a)] is close to β π (s, a).
Step 1. Construct an imaginary world where βbk (s, a) is an almost unbiased estimator of β π (s, a).
Step 2. Argue that the expectation of βbk (s, a) in the real world and the expectation of βbk (s, a) in the imaginary world are
close.

        

wait to see , wait to see ,


(length =  ) (length =  )

do nothing do nothing

Figure 1. An illustration for the sub-algorithm E STIMATE Q with target state s (best viewed in color). The red round points indicate that
the algorithm “starts to wait” for a visit toPs. When the algorithm reaches s (the blue stars) at time τi , it starts to record the sum of
rewards in the following N + 1 steps, i.e. t=τ τi +N
i
r(st , at ). This is used to construct βbk,i (s, ·). The next point the algorithm “starts to
wait for s” would be τi + 2N if this is still no later than kB − N .

Step 1. We first examine what E STIMATE Q sub-algorithm does in an episode k for a state s. The goal of this sub-
algorithm is to collect disjoint intervals of length N + 1 that start from s, calculate a reward estimator from each of them,
and finally average the estimators over all intervals to get a good estimator for β π (s, ·). However, after our algorithm
collects an interval [τ, τ + N ], it rests for another N steps before starting to find the next visit to s – i.e., it restarts from
τ + 2N (see Line 6 in E STIMATE Q (Algorithm 3), and also the illustration in Figure 1).
The goal of doing this is to de-correlate the observed reward and the number of collected intervals: as shown in Eq.(18),
these two quantities affect the numerator and the denominator of βbk (s, ·) respectively, and if they are highly correlated,
then βbk (s, ·) may be heavily biased from β π (s, ·). On the other hand, if we introduce the “rest time” after we collect each
interval (i.e., the dashed segments in Figure 1), then since the length of the rest time (N ) is longer than the mixing time,
the process will almost totally “forget” about the reward estimators collected before. In Figure 1, this means that the state
distributions at the red round points (except for the left most one) will be close to µπ when conditioned on all history that
happened N rounds ago.
We first argue that if the process can indeed “reset its memory” at those red round points in Figure 1 (except for the left most
one), then we get almost unbiased estimators for β π (s, ·). That is, consider a process like in Figure 2 where everything
remains same as in E STIMATE Q except that after every rest interval, the state distribution is directly reset to the stationary
distribution µπ .

reset to stationary reset to stationary reset to stationary


distribution  distribution  distribution 

        

wait to see , wait to see ,


(length =  ) (length =  )

do nothing do nothing

Figure 2. The imaginary world (best viewed in color)

Below we calculate the expectation of βbk (s, a) in this imaginary world. As specified in Figure 2, we use τi to denote
Model-free RL in Infinite-horizon Average-reward MDPs

the i-th time E STIMATE Q starts to record an interval (therefore sτi = s), and let wi = τi − (τi−1 + 2N ) for i > 1 and
w1 = τ1 − ((k − 1)B + 1) be the “wait time” before starting the i-th interval. Note the following facts in the imaginary
world:

1. M is determined by the sequence w1 , w2 , . . . because all other segments in the figures have fixed length.

2. w1 only depends on s(k−1)B+1 and P π , and wi only depends on the stationary distribution µπ and P π because of the
reset.

The above facts imply that in the imaginary world, w1 , w2 , . . ., as well as M , are all independent from
βbk,1 (s, a), βbk,2 (s, a), . . .. Let E′ denote the expectation in the imaginary world. Then
" #
h i 1 X ′ hb
M i
E′ βbk (s, a) = Pr[w1 ≤ B − N ] × E′{wi } E βk,i (s, a) {wi } w1 ≤ B − N + Pr[w1 > B − N ] × 0
M i=1
" M
!#
′ 1 X π
= Pr[w1 ≤ B − N ] × E{wi } β (s, a) − δ(s, a) (by the same calculation as in (19))
M i=1
= Pr[w1 ≤ B − N ] × (β π (s, a) − δ(s, a))
= β π (s, a) − δ ′ (s, a), (21)

where E′{wi } denotes the expectation over the randomness of w1 , w2 , . . ., and δ ′ (s, a) = (1 − Pr[w1 ≤ B −
  B−N
N
N ]) (β π (s, a) − δ(s, a)) + δ(s, a). By Lemma 16, we have Pr[w1 ≤ B − N ] ≥ 1 − 1 − 4t3hit = 1−
 4thit log2 T −1
1 − 4t3hit ≥ 1 − T13 . Together with Eq. (20) and Lemma 14, we have
 
1 1 1 1 1
|δ (s, a)| ≤ 3 (|β π (s, a)| + |δ(s, a)|) + |δ(s, a)| ≤ 3 (6tmix + N + 3 ) + 3 = O

,
T T T T T2

and thus  
h i 1
′ b π
E βk (s, a) − β (s, a) = O . (22)
T2

Step 2. Note that βbk (s, a) is a deterministic function of X = (M, τ1 , T1 , τ2 , T2 , . . . , τM , TM ), where Ti =


(aτi , sτi +1 , aτi +1 , . . . , sτi +N , aτi +N ). We use βbk (s, a) = f (X) to denote this mapping. To say E[βbk (s, a)] and
E′ [βbk (s, a)] are close, we bound their ratio:
P
E[βbk (s, a)] X f (X)P(X) P(X)
= P ≤ max ′ , (23)
′ b
E [βk (s, a)] X f (X)P ′ (X) X P (X)

where we use P and P′ to denote the probability mass function in the real world and the imaginary world respectively, and
in the last inequality we use the non-negativeness of f (X).
For a fixed sequence of X, the probability of generating X in the real world is

P(X) = P(τ1 ) × P(T1 |τ1 ) × P(τ2 |τ1 , T1 ) × P(T2 |τ2 ) × · · · × P(τM |τM−1 , TM−1 )
h i
× P(TM |τM ) × Pr st 6= s, ∀t ∈ [τM + 2N, kB − N ] τM , TM . (24)

In the imaginary world, it is

P′ (X) = P(τ1 ) × P(T1 |τ1 ) × P′ (τ2 |τ1 , T1 ) × P(T2 |τ2 ) × · · · × P′ (τM |τM−1 , TM−1 )
h i
× P(TM |τM ) × Pr st 6= s, ∀t ∈ [τM + 2N, kB − N ] τM , TM . (25)
Model-free RL in Infinite-horizon Average-reward MDPs

Their difference only comes from P(τi+1 |τi , Ti ) 6= P′ (τi+1 |τi , Ti ) because of the reset. Note that

X h i
P(τi+1 |τi , Ti ) = P(sτi +2N = s′ |τi , Ti ) × Pr st 6= s, ∀t ∈ [τi + 2N, τi+1 − 1], sτi+1 = s sτi +2N = s′ , (26)
s′ 6=s
X h i
P′ (τi+1 |τi , Ti ) = P′ (sτi +2N = s′ |τi , Ti ) × Pr st 6= s, ∀t ∈ [τi + 2N, τi+1 − 1], sτi+1 = s sτi +2N = s′ . (27)
s′ 6=s

Because of the reset in the imaginary world, P′ (sτi +2N = s′ |τi , Ti ) = µπ (s′ ) for all s′ ; in the real world, since at time
τi + 2N , the process has proceeded N steps from τi + N (the last step of Ti ), by Corollary 13.1 we have

P(sτi +2N = s′ |τi , Ti ) P(sτi +2N = s′ |τi , Ti ) − µπ (s′ ) 2 1


= 1 + ≤1+ 4 π ′ ≤1+ 3 for all s′ ,
P′ (sτi +2N = s′ |τi , Ti ) µπ (s′ ) T µ (s ) T

i+1 |τi ,Ti ) 1 M


 M 1
which implies PP(τ
′ (τ
i+1 |τi ,Ti )
≤ 1 + T13 by (26) and (27) . This further implies P(X)
P′ (X) ≤ 1+ T3 ≤ e T 3 ≤ e T 2 ≤ 1 + T22
by (24) and (25). From (23), we then have

E[βbk (s, a)] 2


≤ 1 + 2.
′ b
E [βk (s, a)] T

Thus, using the bound from Eq. (22) we have

       
2 2 1 1
E[βbk (s, a)] ≤ ′ b
1 + 2 E [βk (s, a)] ≤ 1 + 2 βk (s, a) + O 2
≤ βk (s, a) + O .
T T T T


Similarly we can prove the other direction: βk (s, a) ≤ E[βbk (s, a)] + O 1
T , finishing the proof.

Proof for Eq.(12). We use the same notations, and the similar approach as in the previous proof for Eq. (11). That is,
we first bound the expectation of the desired quantity in the imaginary world, and then argue that the expectation in the
imaginary world and that in the real world are close.
Model-free RL in Infinite-horizon Average-reward MDPs

Step 1. Define ∆i = βbk,i (s, a) − β π (s, a) + δ(s, a). Then E′ [∆i | {wi }] = 0 by Eq.(19). Thus in the imaginary world,
 2 
E′ βbk (s, a) − β π (s, a))
 ! 
1 XM   2
= E′  βbk,i (s, a) − β π (s, a) 1[M > 0] + β π (s, a)2 1[M = 0]
M i=1
 !2 
XM
1
= E′  ∆i − δ(s, a) 1[M > 0] + β π (s, a)2 1[M = 0]
M i=1
 !2  
M
X
1
≤ E′ 2 ∆i + 2δ(s, a)2  1[M > 0] + β π (s, a)2 1[M = 0] (using (a − b)2 ≤ 2a2 + 2b2 )
M i=1
  !2  
XM
1
≤ Pr[w1 ≤ B − N ] × E′{wi } E′ 2 ∆i + 2δ(s, a)2 {wi } w1 ≤ B − N  + Pr[w1 > B − N ] × (N + 6tmix )2
M i=1
(β π (s, a) ≤ N + 6tmix by Lemma 14)
  !2  
M  
E′ 2 1 X 1
≤ E′{wi } ∆i {wi } w1 ≤ B − N  + O
M i=1 T
  B−N
N
3 1
(using Lemma 16: Pr[w1 > B − N ] ≤ 1 − 4thit ≤ T 3 .)
" M
#  
2 X ′ 2  1
≤ E′{wi } E ∆i {wi } w1 ≤ B − N + O
M 2 i=1 T
(∆i is zero-mean and independent of each other conditioned on {wi })
"   #
2
2 O(N ) 1
≤ E′{wi } 2
·M × w1 ≤ B − N + O
M π(a|s) T
2
O(N 2 )
O(N )
(E′ [∆2i ] ≤ π(a|s) π(a|s) 2 = π(a|s) by definition of βbk (s, a), Lemma 14, and Eq. (20))
   
O(N 2 ) ′ 1 1
≤ E w1 ≤ B − N + O . (28)
π(a|s) M T

Since Pr′ [M = 0] ≤ 1
T3 by Lemma 16, we have Pr′ [w1 ≤ B − N ] = Pr′ [M > 0] ≥ 1 − 1
T3 . Also note that if
B−N
M < M0 := 4N log T
,
2N + µπ (s)

4N log T
then there exists at least one waiting interval (i.e., wi ) longer than µπ (s) (see Figure 1 or 2) . By Lemma 16, this happens
 π
 4µlog T
π (s)
with probability smaller than 1 − 3µ 4(s) ≤ T13 .

Therefore,
  P∞
′ 1 1
m=1 m Pr′ [M = m] 1 × Pr′ [M < M0 ] + M10 × Pr′ [M ≥ M0 ]
E M >0 = ′ ≤
M Pr [M > 0] Pr′ [M > 0]
4N log T
2N + µπ (s)  
1 × T13 + B−N N log T
≤ ≤O .
1 − T13 Bµπ (s)

Combining with (28), we get


 2   
N 3 log T
E′ βbk (s, a) − β π (s, a)) ≤O .
Bπ(a|s)µπ (s)
Model-free RL in Infinite-horizon Average-reward MDPs

Step 2. By the same argument as in the “Step 2” of the previous proof for Eq. (11), we have
 2     2   
b π 2 ′ b π N 3 log T
E βk (s, a) − β (s, a)) ≤ 1+ 2 E βk (s, a) − β (s, a)) ≤O ,
T Bπ(a|s)µπ (s)

which finishes the proof.

B.3. Proof for Lemma 7


Proof. We defer the proof of Eq. (13) to Lemma 17 and prove the rest of the statements assuming Eq. (13). First, we have

XX
|J πk − J πk−1 | = µπk (s) (πk (a|s) − πk−1 (a|s)) q πk−1 (s, a) (By Lemma 15)
s a
XX
≤ µπk (s) |(πk (a|s) − πk−1 (a|s))| |q πk−1 (s, a)|
s a
!
XX
πk
=O µ (s)N ηπk−1 (a|s)tmix (By Eq. (13) and Lemma 14)
s a
= O (ηtmix N ) = O(ηN 2 ). (29)

Next, to prove a bound on |v πk (s) − v πk−1 (s)|, first note that for any policy π,


X  π
v π (s) = e⊤ π n π ⊤
s (P ) − (µ ) r (By Eq. (3))
n=0
N
X −1 ∞
 π X  π
= e⊤
s (P π n
) − (µ π ⊤
) r + e⊤ π n π ⊤
s (P ) − (µ ) r
n=0 n=N
N
X −1
= e⊤ π n π π π
s (P ) r − N J + error (s), (J π = (µπ )⊤ rπ )
n=0

P∞ ⊤ 1
where errorπ (s) := n=N e⊤ π n
s (P ) − µ
π
rπ . By Corollary 13.2, |errorπ (s)| ≤ T2 . Thus

N
X −1 N
X −1
2
|v πk (s) − v πk−1 (s)| = e⊤
s ((P
πk n
) − (P πk−1 )n ) rπk + e⊤
s (P
πk−1 n πk
) (r − rπk−1 ) − N J πk + N J πk−1 +
n=0 n=0
T2
N
X −1 N
X −1
2
≤ k((P πk )n − (P πk−1 )n ) rπk k∞ + krπk − rπk−1 k∞ + N |J πk − J πk−1 | + . (30)
n=0 n=0
T2

Below we bound each individual term above (using notation π ′ := πk , π := πk−1 , P ′ := P πk , P := P πk−1 , r′ :=
rπk , r := rπk−1 , µ := µπk−1 for simplicity). The first term can be bounded as

k(P ′n − P n )r′ k∞

= k P ′ (P ′n−1 − P n−1 ) + (P ′ − P )P n−1 r′ k∞
≤ kP ′ (P ′n−1 − P n−1 )r′ k∞ + k(P ′ − P )P n−1 r′ k∞
≤ k(P ′n−1 − P n−1 )r′ k∞ + k(P ′ − P )P n−1 r′ k∞ (because every row of P ′ sums to 1)
= k(P ′n−1 − P n−1 )r′ k∞ + max e⊤ ′
s (P − P )P
n−1 ′
r
s
′n−1 n−1
≤ k(P −P )r k∞ + max ke⊤
′ ′
s (P − P )P
n−1
k1 ,
s
Model-free RL in Infinite-horizon Average-reward MDPs

where the last term can be further bounded by


max ke⊤ ′
s (P − P )P
n−1
k1 ≤ max ke⊤ ′
s (P − P )k1
s s
!
X X
′ ′
= max (π (a|s) − π(a|s))p(s |s, a)
s
s′ a
!!
XX

≤ O max ηN π(a|s)p(s |s, a) (By Eq. (13))
s
s′ a
= O (ηN ) .

Repeatedly applying this bound we arrive at k(P ′n − P n )r′ k∞ ≤ O ηN 2 , and therefore,
N
X −1

k((P πk )n − (P πk−1 )n ) rπk k∞ ≤ O ηN 3 .
n=0

The second term in Eq. (30) can be bounded as (by Eq. (13) again)
N −1 N −1 N −1
!
X X X X X 
′ ′
kr − rk∞ = max (π (a|s) − π(a|s))r(s, a) ≤ O max ηN π(a|s) = O ηN 2 ,
s s
n=0 n=0 a n=0 a
πk πk−1
and the third term in Eq. (30) is bounded via the earlier proof (for bounding |J −J |):

N |J πk − J πk−1 | = O ηN 3 . (Eq.(29))

Plugging everything into Eq.(30), we prove |v πk (s) − v πk−1 (s)| = O ηN 3 .
Finally, it is straightforward to prove the rest of the two statements:
|q πk (s, a) − q πk−1 (s, a)| = r(s, a) + Es′ ∼p(·|s,a) [v πk (s′ )] − r(s, a) − Es′ ∼p(·|s,a) [v πk−1 (s′ )]

= Es′ ∼p(·|s,a) [v πk (s′ ) − v πk−1 (s′ )] = O ηN 3 .

|β πk (s, a) − β πk−1 (s, a)| ≤ |q πk (s, a) − q πk−1 (s, a)| + N |J πk − J πk−1 | = O ηN 3 .
This completes the proof.

C. Analyzing Optimistic Online Mirror Descent with Log-barrier Regularizer — Proofs for
Eq.(13), Lemma 8, and Lemma 9
In this section, we derive the stability property (Eq.(13)) and the regret bound (Lemma 8 and Lemma 9) for optimistic online
mirror descent with the log-barrier regularizer. Most of the analysis is similar to that in (Wei & Luo, 2018; Bubeck et al.,
2019). Since in our MDP-OOMD algorithm, we run optimistic online mirror descent independently on each state, the
analysis in this section only focuses on a specific state s. We simplify our notations using πk (·) := πk (·|s), πk′ (·) :=
πk′ (·|s), βbk (·) := βbk (s, ·) throughout the whole section.
Our MDP-OOMD algorithm is effectively running Algorithm 5 on each state. We first verify that the condition in Line 1 of
Algorithm 5 indeed holds in our MDP-OOMD algorithm. Recall that in E STIMATE Q (Algorithm 3) we collect trajectories
in every episode for every state. Suppose for episode k and P
state s it collects M trajectories that
P start from time τ1 , . . . , τM
M
and has total reward R1 , . . . , RM respectively. Let ma = i=1 1[aτi = a], then we have a ma = M . By our way of
constructing βbk (s, ·), we have
XM
Ri 1[aτi = a]
βbk (s, a) =
i=1
M πk (a|s)
P P PM R 1[a =a] PM i
when M > 0. Thus we have a πk (a|s)βbk (s, a) = a i=1 i Mτi = i=1 R M ≤ (N + 1) because every Ri is
the total reward for an interval of length N + 1. This verifies the condition in Line 1 for the case M > 0. When M = 0,
b ·) to zero so the condition clearly still holds.
E STIMATE Q sets β(s,
Model-free RL in Infinite-horizon Average-reward MDPs

Algorithm 5 Optimistic Online Mirror Descent (OOMD) with log-barrier regularizer


Define:
C := N + 1 P
Regularizer ψ(x) = η1 A 1 A
a=1 log x(a) , for x ∈ R+
Bregman divergence associated with ψ:

Dψ (x, x′ ) = ψ(x) − ψ(x′ ) − h∇ψ(x′ ), x − x′ i

1
Initialization: π1′ = π1 = A 1
for k = 1, . . . , K do
P
1 Receive βbk ∈ RA + for which
b
a πk (a)βk (a) ≤ C.
2 Update
n o

πk+1 = argmax hπ, βbk i − Dψ (π, πk′ )
π∈∆A
n o
πk+1 = argmax hπ, βbk i − Dψ (π, πk+1

)
π∈∆A

C.1. The stability property of Algorithm 5 — Proof of Eq.(13)


The statement and the proofs of Lemmas 17 and 18 are almost identical to those of Lemma 9 and 10 in (Bubeck et al.,
2019).
1 1
Lemma 17. In Algorighm 5, if η ≤ 270C = 270(N +1) , then

|πk+1 (a) − πk (a)| ≤ 120ηCπk (a).


To prove this lemma we make use of the following auxiliary result, where we use the notation kakM = a⊤ M a for a
vector a ∈ RA and a positive semi-definite matrix M ∈ RA×A .
1
Lemma 18. For some arbitrary b1 , b2 ∈ RA , a0 ∈ ∆A with η ≤ 270C , define
(
a1 = argmina∈∆A F1 (a), where F1 (a) , ha, b1 i + Dψ (a, a0 ),
a2 = argmina∈∆A F2 (a), where F2 (a) , ha, b2 i + Dψ (a, a0 ).

(ψ and Dψ are defined in Algorithm 5). Then as long as kb1 −b2 k∇−2 ψ(a1 ) ≤ 12 ηC, we have for all i ∈ [A], |a2,i −a1,i | ≤
60ηCa1,i .

√ √
Proof of Lemma 18. First, we prove ka1 − a2 k∇2 ψ(a1 ) ≤ 60 ηC by contradiction. Assume ka1 − a2 k∇2 ψ(a1 ) > 60 ηC.

Then there exists some a′2 lying in the line segment between a1 and a2 such that ka1 − a′2 k∇2 ψ(a1 ) = 60 ηC. By Taylor’s
theorem, there exists a that lies in the line segment between a1 and a′2 such that

1
F2 (a′2 ) = F2 (a1 ) + h∇F2 (a1 ), a′2 − a1 i + ka′2 − a1 k2∇2 F2 (a)
2
1
= F2 (a1 ) + hb2 − b1 , a′2 − a1 i + h∇F1 (a1 ), a′2 − a1 i + ka′2 − a1 k2∇2 ψ(a)
2
1
≥ F2 (a1 ) − kb2 − b1 k∇−2 ψ(a1 ) ka′2 − a1 k∇2 ψ(a1 ) + ka′2 − a1 k2∇2 ψ(a)
2
√ √ 1 ′
≥ F2 (a1 ) − 12 ηC × 60 ηC + ka2 − a1 k2∇2 ψ(a) (31)
2
Model-free RL in Infinite-horizon Average-reward MDPs

where in the first inequality we use Hölder inequality and the first-order optimality condition h∇F1 (a1 ), a′2 − a1 i ≥ 0, and
√ √
in the last inequality we use the conditions kb1 − b2 k∇−2 ψ(a1 ) ≤ 12 ηC and ka1 − a′2 k∇2 ψ(a1 ) = 60 ηC. Note that
∇2 ψ(x) is a diagonal matrix and ∇2 ψ(x)ii = η1 x12 . Therefore for any i ∈ [A],
i

v
u A
√ uX (a′2,j − a1,j )2 |a′2,i − a1,i |
60 ηC = ka2 − a1 k∇2 ψ(a1 ) = t

≥ √
j=1
ηa21,j ηa1,i

|a′2,i −a1,i |
n a′ o
2,i a1,i
and thus a1,i ≤ 60ηC ≤ 29 , which implies max a1,i , a′2,i ≤ 97 . Thus the last term in (31) can be lower bounded
by
A  2 A
1X 1 ′ 1 7 X 1
ka′2 − a1 k2∇2 ψ(a) = (a − a 1,i )2
≥ ′
2 (a2,i − a1,i )
2
η i=1 a2i 2,i η 9 a
i=1 1,i
√ 2
′ 2
≥ 0.6ka2 − a1 k∇2 ψ(a1 ) = 0.6 × (60 ηC) = 2160ηC 2 .

Combining with (31) gives

1
F2 (a′2 ) ≥ F2 (a1 ) − 720ηC 2 + × 2160ηC 2 > F2 (a1 ).
2
Recall that a′2 is a point in the line segment between a1 and a2 . By the convexity of F2 , the above inequality implies
F2 (a1 ) < F2 (a2 ), contradicting the optimality of a2 .
r
√ PA (a1,j −a2,j )2 |a2,i −a1,i |
Thus we conclude ka1 − a2 k∇2 ψ(a1 ) ≤ 60 ηC. Since ka1 − a2 k∇2 ψ(a1 ) = j=1 ηa21,j
≥ √ ηa1,i for all i,
|a2,i −a1,i | √
we get √ ηa1,i ≤ 60 ηC, which implies |a2,i − a1,i | ≤ 60ηCa1,i .

Proof of Lemma 17. We prove the following stability inequalities



πk (a) − πk+1 (a) ≤ 60ηCπk (a), (32)

πk+1 (a) − πk+1 (a) ≤ 60ηCπk (a). (33)

Note that (32) and (33) imply

|πk (a) − πk+1 (a)| ≤ 120ηCπk (a), (34)

which is the inequality we want to prove.


We use induction on k to prove (32) and (33). Note that (32) implies

′ 60
πk+1 (a) ≤ πk (a) + 60ηCπk (a) ≤ πk (a) + πk (a) ≤ 2πk (a), (35)
270
and (34) implies

120
πk+1 (a) ≤ πk (a) + 120ηCπk (a) ≤ πk (a) + πk (a) ≤ 2πk (a). (36)
270
Thus, (35) and (36) are also inequalities we may use in the induction process.

Base case. For the case k = 1, note that


(
π1 = argminπ∈∆A Dψ (π, π1′ ), (because π1 = π1′ )
′ b ′
π2 = argminπ∈∆A hπ, −β1 i + Dψ (π, π1 ).
Model-free RL in Infinite-horizon Average-reward MDPs

To apply Lemma 18 and obtain (32), we only need to show kβb1 k∇−2 ψ(π1 ) ≤ 12 ηC. Recall ∇2 ψ(u)ii = 1 1
η u2i and
−2
∇ ψ(u)ii = ηu2i . Thus,
A
X
kβb1 k2∇−2 ψ(π1 ) ≤ ηπ1 (a)2 βb1 (a)2 ≤ ηC 2
a=1

P P 2
because a π1 (a)2 βb1 (a)2 ≤ a π1 (a)βb1 (a) ≤ C 2 by the condition in Line 1 of Algorithm 5. This proves (32) for
the base case.
Now we prove (33) of the base case. Note that
( ′ ′
Dψ (π, π2E
π2 = argminπ∈∆A D ),
b (37)
π2 = argmin π, −β1 + Dψ (π, π ′ ).
π∈∆A 2


Similarly, with the help of Lemma 18, we only need to show kβb1 k∇−2 ψ(π2′ ) ≤ 12 ηC. This can be verified by

A
X A
X
kβb1 k2∇−2 ψ(π2′ ) ≤ ηπ2′ (a)2 βb1 (a)2 ≤ 4 ηπ1 (a)2 βb1 (a)2 ≤ 4ηC 2 ,
a=1 a=1

where the second inequality uses (35) for the base case (implied by (32) for the base case, which we just proved).

Induction. Assume (32) and (33) hold before k. To prove (32), observe that
( D E
πk = argminπ∈∆A π, −βbk−1 + Dψ (π, πk′ ),
(38)
π′ = argmin
k+1 hπ, −βbk i + Dψ (π, π ′ ).
π∈∆A k


To apply Lemma 18 and obtain (32), we only need to show kβbk − βbk−1 k∇−2 ψ(πk ) ≤ 12 ηC. This can be verified by

A
X  2
kβbk − βbk−1 k2∇−2 ψ(πk ) ≤ ηπk (a)2 βbk (a) − βbk−1 (a)
a=1
A
X  
≤ 2η πk (a)2 βbk (a)2 + βbk−1 (a)2
a=1
A
X A
X
≤ 2η πk (a)2 βbk (a)2 + 2η 4πk−1 (a)2 βbk−1 (a)2
a=1 a=1
≤ 10ηC 2 ,

where the third inequality uses (36) for k − 1.


To prove (33), we observe:
( ′ ′
πk+1 Dψ (π, πk+1
= argminπ∈∆A D E ),
b (39)
πk+1 = argmin π, −βk + Dψ (π, π ′
π∈∆A k+1 ).


Similarly, with the help of Lemma 18, we only need to show kβbk k∇−2 ψ(πk+1
′ ) ≤ 12 ηC. This can be verified by

A
X A
X
kβbk k2∇−2 ψ(π′ ≤ ′
ηπk+1 (a)2 βbk (a)2 ≤ 4 ηπk (a)2 βbk (a)2 ≤ 4ηC 2 ,
k+1 )
a=1 a=1

where in the second inequality we use (35) (implied by (32), which we just proved). This finishes the proof.
Model-free RL in Infinite-horizon Average-reward MDPs

C.2. The regret bound of Algorithm 5 — Proof of Lemma 8


Proof of Lemma 8. By standard analysis for optimistic online mirror descent (e.g, (Wei & Luo, 2018, Lemma 6),
(Chiang et al., 2012, Lemma 5)), we have (recall βb0 is the all-zero vector)

hπ̃ − πk , βbk i ≤ Dψ (π̃, πk′ ) − Dψ (π̃, πk+1


′ ′
) + hπk − πk+1 , βbk−1 − βbk i (40)

for any π̃ ∈ ∆A . Summing over k and telescoping give


K
X K
X K
X
hπ̃ − πk , βbk i ≤ Dψ (π̃, π1′ ) − Dψ (π̃, πK+1

)+ ′
hπk − πk+1 , βbk−1 − βbk i ≤ Dψ (π̃, π1′ ) + ′
hπk − πk+1 , βbk−1 − βbk i.
k=1 k=1 k=1

1
 1
As in (Wei & Luo, 2018), we pick π̃ = 1 − T π∗ + T A 1A , and thus

Dψ (π̃, π1′ ) = ψ(π̃) − ψ(π1′ ) − h∇ψ(π1′ ), π̃ − π1′ i


= ψ(π̃) − ψ(π1′ ) (∇ψ(π1′ ) = − A ′
η 1 and h1, π̃ − π1 i = 0)
A A
1X 1 1X 1
= log − log ′
η a=1 π̃(a) η a=1 π1 (a)
A log(AT ) A log A A ln T
≤ − = .
η η η


On the other hand, to bound hπk − πk+1 , βbk−1 − βbk i, we follow the same approach as in (Wei & Luo, 2018, Lemma
14): define Fk (π) = hπ, −βbk−1 i + Dψ (π, πk′ ) and Fk+1

(π) = hπ, −βbk i + Dψ (π, πk′ ). Then by definition we have
′ ′
πk = argminπ∈∆A Fk (π) and πk+1 = argminπ∈∆A Ft+1 (π).
Observe that

Fk+1 ′
(πk ) − Fk+1 ′
(πk+1 ′
) = (πk − πk+1 )⊤ (βbk−1 − βbk ) + Fk (πk ) − Fk (πk+1

)
≤ (πk − π ′ )⊤ (βbk−1 − βbk )
k+1 (by the optimality of πk )

≤ πk − πk+1 ∇2 ψ(πk )
βbk−1 − βbk . (41)
∇−2 ψ(πk )


On the other hand, for some ξ that lies on the line segment between πk and πk+1 , we have by Taylor’s theorem and the

optimality of πk+1 ,

′ ′ ′ ′ ′ 1 2
Fk+1 (πk ) − Fk+1 (πk+1 ) = ∇Fk+1 (πk+1 )⊤ (πk − πk+1

)+ ′
πk − πk+1 ′
∇2 Fk+1 (ξ)
2
1 ′ 2 ′
≥ πk − πk+1 ∇2 ψ(ξ)
(by the optimality of πk+1 and that ∇2 Fk+1

= ∇2 ψ)
2
(42)
   

By Eq.(32) we know πk+1 (a) ∈ 21 πk (a), 2πk (a) , and hence ξ(a) ∈ 12 πk (a), 2πk (a) holds as well, because ξ is in the

line segment between πk and πk+1 . This implies for any x,
v v
u A u A
uX x(a)2 1u X x(a)2 1
kxk∇2 ψ(ξ) = t ≥ t = kxk∇2 ψ(πk ) .
ηξ(a)2 2 ηπ (a)2 2
a=1 a=1 k

Combine this with (41) and (42), we get

1 2

πk − πk+1 ∇2 ψ(πk )
βbk−1 − βbk ≥ ′
πk − πk+1 ∇2 ψ(πk )
,
∇−2 ψ(π k) 8
Model-free RL in Infinite-horizon Average-reward MDPs


which implies πk − πk+1 ∇2 ψ(πk )
≤ 8 βbk−1 − βbk . Hence we can bound the third term in (40) by
∇−2 ψ(πk )
2 X  2

πk − πk+1 ∇2 ψ(πk )
βbk−1 − βbk ≤ 8 βbk−1 − βbk = 8η πk (a)2 βbk−1 (a) − βbk (a) .
∇−2 ψ(πk ) ∇−2 ψ(πk )
a

Finally, combining everything we have


"K #
X
E hπ − πk , βbk i

k=1
" K
#
X
=E hπ − π̃, βbk i + hπ̃ − πk , βbk i

k=1
"
K  # !
1 X 1 b A log T XK X  2
∗ 2 b b
≤ π − 1, βk +O +η πk (a) βk−1 (a) − βk (a) ,
T A η
k=1 k=1 a

where the expectation of the first term is bounded by O KN T = O(1) by the fact E[βbk (s)] = O(N ) (implied by Lemma 6
and Lemma 14). This completes the proof.

C.3. Proof for Lemma 9


Lemma 19 (Restatement of Lemma 9).
" K #
XXX
∗ ∗ πk
E B µ (s) (π (a|s) − πk (a|s)) q (s, a)
k=1 s a
 
e BA ln T T N 3ρ 3 6
=O +η + η TN .
η B
 √ √

4
1 √B A BA
With the choice of η = min 270(N +1) , , √
4 6
, the bound becomes
3 ρT N TN

p  q 
3 1 3 1
e
O e
N 3 ρAT + (BAN 2 ) 4 T 4 + BN A = O t3mix ρAT + (t3mix thit A) 4 T 4 + t2mix thit A .

Proof. For any s,


"K #
XX
∗ πk
E (π (a|s) − πk (a|s))q (s, a)
k=1 a
" K X
#
X P
∗ πk
=E (π (a|s) − πk (a|s))β (s, a) (by the definition of β πk and that a (π

(a|s) − πk (a|s))J πk = 0)
k=1 a
" #  
K X
X h i K
≤E ∗
(π (a|s) − πk (a|s))Ek b
βk (s, a) + O (by Eq. (11))
T
k=1 a
  " K X
#!
A ln T X
=O + O ηE πk (a|s)2 (βbk (s, a) − βbk−1 (s, a))2 (by Lemma 8)
η
k=1 a
  "K #!
A ln T XX
2 2 b πk 2
≤O + ηN + O ηE πk (a|s) (βk (s, a) − β (s, a))
η
k=2 a
"K #!
XX
2 πk πk−1 2
+ O ηE πk (a|s) (β (s, a) − β (s, a))
k=2 a
" K X
#!
X
+ O ηE 2
πk (a|s) (β πk−1
(s, a) − βbk−1 (s, a))2 , (43)
k=2 a
Model-free RL in Infinite-horizon Average-reward MDPs

where the last line uses the fact (z1 + z2 + z3 )2 ≤ 3z12 + 3z22 + 3z32 . The second term in (43) can be bounded using Eq. (12):
"K #!
XX
O ηE πk (a|s)2 (βbk (s, a) − β πk (s, a))2
k=2 a
" K X
#!
X N 3 log T
2
= O ηE πk (a|s)
Bπk (a|s)µπk (s)
k=2 a
"K #!
X N 3 log T
= O ηE .
Bµπk (s)
k=2

The fourth term in (43) can be bounded similarly,


 except
hP that 3we firsti
use Lemma
 17
hPto upper boundiπk (a|s) by 2πk−1 (a|s).
K N log T K N 3 log T
Eventually this term is upper bounded by O ηE k=2 Bµπk−1 (s) = O ηE k=1 Bµπk (s) .

The third term in (43) can be bounded using Lemma 7:


"K #!
XX
O ηE πk (a|s)2 (β πk (s, a) − β πk−1 (s, a))2
k=2 a
" K X
#!
X
2 3 2
= O ηE πk (a|s) (ηN )
k=2 a

= O η 3 KN 6
.

Combining all these bounds in (43), we get


"K # "K # !
XX A ln T X N 3 log T
∗ πk 3 6
E (π (a|s) − πk (a|s))q (s, a) = O + ηE + η KN .
η Bµπk (s)
k=1 a k=1

Now multiplying both sides by Bµ∗ (s) and summing over s we get
" K # "K # !
XXX BA ln T X X N 3 (log T )µ∗ (s)
E B µ∗ (s)(π ∗ (a|s) − πk (a|s))q πk (s, a) = O + ηE + η 3 BKN 6
s a
η s
µπk (s)
k=1 k=1
 
BA ln T
≤O + ηρKN 3 (log T ) + η 3 BKN 6
η
 3

=Oe BA + ηρ T N + η 3 T N 6 (T = BK)
η B
 √ √

4
1 √B A BA 1
Choosing η = min 270(N +1) , , √
4 6
(η ≤ 270(N +1) is required by Lemma 17), we finally obtain
3
ρT N TN

" #
K XX
X p 3 1

E B e
µ∗ (s)(π ∗ (a|s) − πk (a|s))q πk (s, a) = O N 3 ρAT + (BAN 2 ) 4 T 4 + BN A
k=1 s a
q 
3 1
e
=O 3 3 2
tmix ρAT + (tmix thit A) T + tmix thit A .
4 4

D. Experiments
In this section, we compare the performance of our proposed algorithms and previous model-free algorithms. We note
that model-based algorithms (UCRL2, PSRL, . . . ) typically have better performance in terms of regret but require more
memory. For a fair comparison, we restrict our attention to model-free algorithms.
Model-free RL in Infinite-horizon Average-reward MDPs

RandomMDP JumpRiverSwim

40000 40000

30000 30000
Regret

Regret
20000 20000

10000 10000

0 0

0 1000000 2000000 3000000 4000000 5000000 0 1000000 2000000 3000000 4000000 5000000

Figure 3. Performance of model-free algorithms on random MDP (left) and JumpRiverSwim (right). The standard Q-learning algorithm
with ǫ-greedy exploration suffers from linear regret. The O PTIMISTIC Q- LEARNING and MDP-OOMD algorithms achieve sub-linear
regret. The shaded area denotes the standard deviation of regret over multiple runs.

Two environments are considered: a randomly generated MDP and JumpRiverSwim. Both of the environments consist of
6 states and 2 actions. The reward function and the transition kernel of the random MDP are chosen uniformly at random.
The JumpRiverSwim environment is a modification of the RiverSwim environment (Strehl & Littman, 2008; Ouyang et al.,
2017a) with a small probability of jumping to an arbitrary state at each time step.
The standard RiverSwim models a swimmer who can choose to swim either left or right in a river. The states are arranged
in a chain and the swimmer starts from the leftmost state (s = 1). If the swimmer chooses to swim left, i.e., the direction
of the river current, he is always successful. If he chooses to swim right, he may fail with a certain probability. The reward
function is: r(1, left) = 0.2, r(6, right) = 1 and r(s, a) = 0 for all other states and actions. The optimal policy is to always
swim right to gain the maximum reward of state s = 6. The standard RiverSwim is not an ergodic MDP and does not
satisfy the assumption of the MDP-OOMD algorithm. To handle this issue, we consider the JumpRiverSwim environment
which has a small probability 0.01 of moving to an arbitrary state at each time step. This small modification provides an
ergodic environment.
We compare our algorithms with two benchmark model-free algorithms. The first benchmark is the standard Q-learning
with ǫ-greedy exploration. Figure 3 shows that this algorithm suffers from linear regret, indicating that the naive ǫ-greedy
exploration is not efficient. The second benchmark is the P OLITEX algorithm by Abbasi-Yadkori et al. (2019a). The im-
plementation of P OLITEX is based on the variant designed for the tabular case, which is presented in their Appendix F
and Figure 3. P OLITEX usually requires longer episode length than MDP-OOMD (see Table 2) because in each episode it
needs to accurately estimate the Q-function, rather than merely getting an unbiased estimator of it as in MDP-OOMD.
Figure 3 shows that the proposed O PTIMISTIC Q- LEARNING, MDP-OOMD algorithms, and the P OLITEX algorithm
by Abbasi-Yadkori et al. (2019a) all achieve similar performance in the RandomMDP environment. In the JumpRiver-
Swim environment, the Optimistic Q-learning algorithm outperforms the other three algorithms. Although the regret upper
bound for O PTIMISTIC Q- LEARNING scales as O(T e 2/3 ) (Theorem 1), which is worse than that of MDP-OOMD (Theorem
5), Figure 3 suggests that in the environments that lack good mixing properties, O PTIMISTIC Q- LEARNING algorithm may
perform better. The detail of the experiments is listed in Table 2.
Model-free RL in Infinite-horizon Average-reward MDPs

Table 2. Hyper parameters used in the experiments. These hyper parameters are optimized to perform the best possible result for all the
algorithms. All the experiments are averaged over 10 independent runs for a horizon of 5 × 106 . For the P OLITEX algorithm, τ and τ ′
are the lengths of the two stages defined in Figure 3 of (Abbasi-Yadkori et al., 2019a).

Algorithm Parameters
Q-learning with ǫ-greedy ǫ = 0.05
p
Optimistic Q-learning H = 100, c = 1, bτ = c H/τ
Random MDP
MDP-OOMD N = 2, B = 4, η = 0.01
P OLITEX τ = 1000, τ ′ = 1000, η = 0.2
Q-learning with ǫ-greedy ǫ = 0.03
p
Optimistic Q-learning H = 100, c = 1, bτ = c H/τ
JumpRiverSwim
MDP-OOMD N = 10, B = 30, η = 0.01
P OLITEX τ = 3000, τ ′ = 3000, η = 0.2

You might also like