Model-Free Reinforcement Learning in Infinite-Horizon Average-Reward Markov Decision Processes
Model-Free Reinforcement Learning in Infinite-Horizon Average-Reward Markov Decision Processes
Chen-Yu Wei Mehdi Jafarnia-Jahromi Haipeng Luo Hiteshi Sharma Rahul Jain
{chenyu.wei, mjafarni, haipengl, hiteshis, rahul.jain}@usc.edu
arXiv:1910.07072v2 [cs.LG] 25 Feb 2020
Table 1. Regret comparisons for RL algorithms in infinite-horizon average-reward MDPs with S states, A actions, and T steps. D is the
diameter of the MDP, sp(v ∗ ) ≤ D is the span of the optimal value function, V⋆s,a := Vars′ ∼p(·|s,a) [v ∗ (s′ )] ≤ sp(v ∗ )2 is the variance of
the optimal value function, tmix is the mixing time (Def 5.1), thit is the hitting time (Def 5.2), and ρ ≤ thit is some distribution mismatch
coefficient (Eq. (4)). For more concrete definition of these parameters, see Sections 3-5.
In this paper, we make significant progress in this direc- free algorithm for this setting is the P OLITEX algo-
tion and propose two model-free algorithms for learning rithm (Abbasi-Yadkori et al., 2019a;b), which achieves
infinite-horizon average-reward MDPs. The first algorithm, e 3/4 ) regret for ergodic MDPs only. Both of our algo-
O(T
Optimistic Q-learning (Section 4), achieves a regret bound rithms enjoy a better bound compared to P OLITEX, and the
e 2/3 ) with high probability for the broad class of
of O(T first algorithm even removes the ergodic assumption com-
weakly communicating MDPs.1 This is the first model-free pletely.2
algorithm in this setting under only the minimal weakly
For comparisons with other existing model-based ap-
communicating assumption. The key idea of this algorithm
proaches for this problem, see Table 1. We also conduct
is to artificially introduce a discount factor for the reward,
experiments comparing our two algorithms. Details are de-
to avoid the aforementioned unbounded Q-value estimate
ferred to Appendix D due to space constraints.
issue, and to trade-off this effect with the approximation
introduced by the discount factor. We remark that this is
very different from the R-learning algorithm of (Schwartz, 2. Related Work
1993), which is a variant of Q-learning with no discount
We review the related literature with regret guarantees for
factor for the infinite-horizon average-reward setting.
learning MDPs with finite state and action spaces (there
The second algorithm, MDP-OOMD √ (Section 5), attains are many other works on asymptotic convergence or sam-
e T ) for the more restricted
an improved regret bound of O( ple complexity, a different focus compared to our work).
class of ergodic MDPs. This algorithm maintains an in- Three common settings have been studied: 1) finite-horizon
stance of a multi-armed bandit algorithm at each state to episodic setting, 2) infinite-horizon discounted setting, and
learn the best action. Importantly, the multi-armed bandit 3) infinite-horizon average-reward setting. For the first
algorithm needs to ensure several key properties to achieve two settings, previous works have designed efficient al-
our claimed regret bound, and to this end we make use gorithms with regret bound or sample complexity that
of the recent advances for adaptive adversarial bandit al- is (almost) information-theoretically optimal, using either
gorithms from (Wei & Luo, 2018) in a novel way. model-based approaches such as (Azar et al., 2017), or
model-free approaches such as (Jin et al., 2018; Dong et al.,
To the best of our knowledge, the only existing model-
2
1 e to suppress P OLITEX is studied in a more general setup with function
Throughout the paper, we use the notation O(·) approximation though. See the end of Section 5.1 for more com-
log terms. parisons.
Model-free RL in Infinite-horizon Average-reward MDPs
2019). 2009).
For the infinite-horizon average-reward setting, many Standard MDP theory (Puterman, 2014) shows that for
model-based algorithms have been proposed, such these two classes, there exist q ∗ : S × A → R (unique
as (Auer & Ortner, 2007; Jaksch et al., 2010; Ouyang et al., up to an additive constant) and unique J ∗ ∈ [0, 1] such
2017b; Agrawal & Jia, 2017; Talebi & Maillard, 2018; that J ∗ (s) = J ∗ for all s ∈ S and the following Bellman
Fruit et al., 2018a;b). These algorithms either conduct equation holds:
posterior sampling or follow the optimism in face of un-
certainty principle to build an MDP model estimate and J ∗ + q ∗ (s, a) = r(s, a) + Es′ ∼p(·|s,a) [v ∗ (s′ )], (1)
then plan according to √ the estimate (hence model-based).
They all achieve Õ( T ) regret, but the dependence on where v ∗ (s) := maxa∈A q ∗ (s, a). The optimal policy is
other parameters are suboptimal. Recent works made then obtained by π ∗ (s) = argmaxa q ∗ (s, a).
progress toward obtaining the optimal bound (Ortner, 2018; We consider a learning problem where S, A and the re-
Zhang & Ji, 2019); however, their algorithms are not com- ward function r are known to the agent, but not the tran-
putationally efficient – the time complexity scales exponen- sition probability p (so one cannot directly solve the Bell-
tially in the number of states. On the other hand, except for man equation). The knowledge of the reward function
the naive approach of combining Q-learning with ǫ-greedy is a typical assumption as in (Bartlett & Tewari, 2009;
exploration (which is known to suffer regret exponential in Gopalan & Mannor, 2015; Ouyang et al., 2017b), and can
some parameters (Osband et al., 2014)), the only existing be removed at the expense of a constant factor for the regret
model-free algorithm for this setting is P OLITEX, which bound.
only works for ergodic MDPs.
Specifically, the learning protocol is as follows. An agent
Two additional works are closely related to our second algo- starts at an arbitrary state s1 ∈ S. At each time step
rithm MDP-OOMD: (Neu et al., 2013) and (Wang, 2017). t = 1, 2, 3, · · · , the agent observes state st ∈ S and
They all belong to policy optimization method where the takes action at ∈ A which is a function of the history
learner tries to learn the parameter of the optimal policy di- s1 , a1 , s2 , a2 , · · · , st−1 , at−1 , st . The environment then
rectly. Their settings are quite different from ours and the determines the next state by drawing st+1 according to
results are not comparable. We defer more detailed com- p(·|st , at ). The performance of a learning algorithm is eval-
parisons with these two works to the end of Section 5.1. uated through the notion of cumulative regret, defined as
the difference between the total reward of the optimal pol-
3. Preliminaries icy and that of the algorithm:
the reward function and p : S 2 × A → [0, 1] is the transi- Since r ∈ [0, 1] (and subsequently J ∗ ∈ [0, 1]), the re-
tion probability such that p(s′ |s, a) := P(st+1 = s′ | st = gret can at worst grow linearly with T . If a learning al-
s, at = a) for st ∈ S, at ∈ A and t = 1, 2, 3, · · · . We gorithm achieves sub-linear regret, then RT /T goes to
assume that S and A are finite sets with cardinalities S and zero, i.e., the average reward of the algorithm converges
A, respectively. The average reward per stage of a deter- ∗
ministic/stationary policy π : S → A starting from state s
to the optimal per√ stage reward J . The best existing re-
gret bound is O(e DSAT ) achieved by a model-based al-
is defined as gorithm (Zhang & Ji, 2019) (where D is the diameter of
" T # the MDP) and it matches the lower bound of (Jaksch et al.,
1 X
π
J (s) := lim inf E r(st , π(st )) s1 = s 2010).
T →∞ T
t=1
where st+1 is drawn from p(·|st , π(st )). Let J ∗ (s) := 4. Optimistic Q-Learning
maxπ∈AS J π (s). A policy π ∗ is said to be optimal if it
∗ In this section, we introduce our first algorithm, O PTI -
satisfies J π (s) = J ∗ (s) for all s ∈ S.
MISTIC Q- LEARNING (see Algorithm 1 for pseudocode).
We consider two standard classes of MDPs in this paper: The algorithm works for any weakly communicating
(1) weakly communicating MDPs defined in Section 4 and MDPs. An MDP is weakly communicating if its state space
(2) ergodic MDPs defined in Section 5. The weakly com- S can be partitioned into two subsets: in the first subset,
municating assumption is weaker than the ergodic assump- all states are transient under any stationary policy; in the
tion, and is in fact known to be necessary for learning second subset, every two states are accessible from each
infinite-horizon MDPs with low regret (Bartlett & Tewari, other under some stationary policy. It is well-known that
Model-free RL in Infinite-horizon Average-reward MDPs
Algorithm 1 O PTIMISTIC Q- LEARNING timated Q value (Line 1). After seeing the next state, the
Parameters: H ≥ 2, confidence level δ ∈ (0, 1) algorithm makes a stochastic update of Qt based on the
Initialization: γ = 1 − H1 , ∀s : V̂1 (s) = H Bellman equation, importantly with an extra bonus term bτ
∀s, a : Q1 (s, a) = Q̂1 (s, a) = H, n1 (s, and a carefully chosen step size ατ (Eq.(2)). Here, τ is
q a) = 0 the number of times the current state-action
Define: ∀τ, ατ = H+1
= 4 sp(v ∗ ) H
ln 2T ppair has been
H+τ , bτ τ δ visited, and the bonus term bτ scales as O( H/τ ), which
for t = 1, . . . , T do encourages exploration since it shrinks every time a state-
1 Take action at = argmaxa∈A Q̂t (st , a). action pair is executed. The choice of the step size ατ is
2 Observe st+1 . also crucial as pointed out in (Jin et al., 2018) and deter-
3 Update: mines a certain effective period of the history for the cur-
rent update.
nt+1 (st , at ) ← nt (st , at ) + 1
While the algorithmic idea is similar to (Dong et al., 2019),
τ ← nt+1 (st , at )
we emphasize that our analysis is different and novel:
Qt+1 (st , at ) ← (1 − ατ )Qt (st , at )
h i
+ατ r(st , at ) + γ V̂t (st+1 ) + bτ (2) • First, Dong et al. (2019) analyze the sample complex-
n o
ity of their algorithm while we analyze the regret.
Q̂t+1 (st , at ) ← min Q̂t (st , at ), Qt+1 (st , at )
V̂t+1 (st ) ← max Q̂t+1 (st , a).
a∈A
• Second, we need to deal with the approximation effect
due to the difference between the discounted MDP
(All other entries of nt+1 , Qt+1 , Q̂t+1 , V̂t+1 remain the
and the original undiscounted one (Lemma 2).
same as those in nt , Qt , Q̂t , V̂t .)
distributions and output πk+1 for the next episode (Algo- sponding policies and their Q-value functions are also
rithm 4). stable, which is critical for our analysis.
The reward estimator βbk (s, ·) is an almost unbiased estima- • Finally, the optimistic prediction of OOMD, to-
tor for gether with our particular reward estimator from E S -
β πk (s, ·) := q πk (s, ·) + N J πk (7) TIMATE Q, provides a variance reduction effect that
leads to a better regret bound in terms of ρ instead
with negligible bias (N is defined in Algorithm 3). The of thit . See Lemma 8 and Lemma 9.
term N J πk is the same for all actions and thus the cor-
responding MAB algorithm is trying to learn the best ac- The regret guarantee of our algorithm is shown below.
tion at state s in terms of the average of Q-value functions
Theorem 5. For ergodic MDPs, with an appropriate cho-
q π1 (s, ·), . . . , q πK (s, ·). To construct the reward estima-
sen learning rate η for Algorithm 4, MDP-OOMD achieves
tor for state s, the sub-routine E STIMATE Q collects non-
e mix ) that start q
overlapping intervals of length N + 1 = O(t e
from state s, and use the standard inverse-propensity scor- E[RT ] = O t3mix ρAT .
ing to construct an estimator yi for interval i (Line 5). In
fact, to reduce the correlation among the non-overlapping Note that in this bound, the dependence on the number of
P ∗
intervals, we also make sure that these intervals are at least states S is hidden in ρ, since ρ ≥ s µµ∗ (s)
(s) = S. Com-
N steps apart from each other (Line 6). The final estimator pared to the bound of Algorithm 1 or some other model-
βbk (s, ·) is simply the average of all estimators yi over these based algorithms such as UCRL2, this bound has an extra
disjoint intervals. This averaging is important for reducing dependence on tmix , a potentially large constant. As far
variance as explained later (see also Lemma 6). as we know, all existing mirror-descent-based algorithms
The MAB algorithm we use is optimistic online mirror for the average-reward setting has the same issue (such
descent (OOMD) (Rakhlin & Sridharan, 2013) with log- as (Neu et al., 2013; Wang, 2017; Abbasi-Yadkori et al.,
barrier as the regularizer, analyzed in depth in (Wei & Luo, 2019a)). The role of tmix in our analysis is almost the
2018). Here, optimism refers to something different from same as that of 1/(1 − γ) in the discounted setting (γ is
the optimistic exploration in Section 4. It corresponds to the discount factor). Specifically, a small tmix ensures 1)
the fact that after a standard mirror descent update (Eq. (5)), a short trajectory needed to approximate the Q-function
the algorithm further makes a similar update using an opti- with expected trajectory reward (in view of Eq. (11)) and
mistic prediction of the next reward vector, which in our 2) an upper bound for the magnitude of q(s, a) and v(s)
case is simply the previous reward estimator (Eq. (6)). We (Lemma 14). For the discounted setting these are ensured
refer the reader to (Wei & Luo, 2018) for more details, but by the discount factor already.
point out that the optimistic prediction we use here is new.
Comparisons. Neu et al. (2013) considered learning er-
It is clear that each MAB algorithm faces a non-stochastic godic MDPs with known transition kernel and adversar-
problem (since πk is changing over time) and thus it is im- ial rewards, a setting incomparable to ours. Their algo-
portant to deploy an adversarial MAB algorithm. The stan- rithm maintains a copy of E XP 3 for each state, but the
dard algorithm for adversarial MAB is E XP 3 (Auer et al., reward estimators fed to these algorithms are constructed
2002), which was also used for solving adversarial using the knowledge of the transition kernel and are very
MDPs (Neu et al., 2013) (more comparisons with this to different
p from ours. They proved a regret bound of order
follow). However, there are several important reasons for e 3
O tmix thit AT , which is worse than ours since ρ ≤ thit .
our choice of the recently developed OOMD with log-
barrier: In another recent work, (Wang, 2017) considered learn-
ing ergodic MDPs under the assumption that the learner
• First, the log-barrier regularizer produces a more ex- is provided with a generative model (an oracle that takes
ploratory distribution compared to E XP 3 (as noticed in a state-action pair and output a sample of the next
in e.g. (Agarwal et al., 2017)), so we do not need an state).
2 They derived a sample-complexity bound of order
2
explicit exploration over the actions, which signifi- Oe tmix τ2 SA for finding an ǫ-optimal policy, where τ =
cantly simplifies the analysis compared to (Neu et al., ǫ ∗ 2 2
µ (s) 1/S
2013). max maxs 1/S , maxs′ ,π µπ (s′ ) , which is at
∗
• Second, log-barrier regularizer provides more stable least maxπ maxs,s′ µµπ (s
(s)
′ ) by AM-GM inequality. This re-
updates compared to E XP 3 in the sense that πk (a|s) sult is again incomparable to ours, but we point out that our
and πk−1 (a|s) are within a multiplicative factor of distribution mismatch coefficient ρ is always bounded by
each other (see Lemma 7). This implies that the corre- τ S, while τ can be much larger than ρ on the other hand.
Model-free RL in Infinite-horizon Average-reward MDPs
Finally, Abbasi-Yadkori et al. (2019a) considers a more Eq.(8) can be further written as
general setting with function approximation, and their al- "K #
gorithm P OLITEX maintains a copy of the standard ex- XX
ponential weight algorithm for each state, very similar E (J πk − r(st , at ))
k=1 t∈Ik
to (Neu et al., 2013). When specified to our tabular setting, " #
K X
X
one can verify (according
√ to3their Theorem 5.2) that P OLI -
TEX achieves t3mix thit SAT 4 regret, which is significantly
=E (Es′ ∼p(·|st ,at ) [v πk (s′ )] − q πk (st , at ))
k=1 t∈Ik
worse than ours in terms of all parameters.
(Bellman equation)
" K X
#
5.2. Proof sketch of Theorem 5 X
=E (Es′ ∼p(·|st ,at ) [v πk (s′ )] − v πk (st+1 ))
We first decompose the regret as follows: k=1 t∈Ik
" K X
#
X
πk πk
+E (v (st ) − q (st , at ))
k=1 t∈Ik
" K X
#
T
X X
πk πk
RT = J ∗ − r(st , at ) +E (v (st+1 ) − v (st ))
k=1 t∈Ik
t=1 " #
K K X K
X
X X πk πk
=B (J ∗ − J πk ) + (J πk − r(st , at )) , (8) =E (v (skB+1 ) − v (s(k−1)B+1 ))
k=1 k=1 t∈Ik k=1
(the first two terms above are zero)
"K−1 #
X
πk πk+1
=E (v (skB+1 ) − v (skB+1 ))
k=1
where Ik := {(k − 1)B + 1, . . . , kB} is the set of time h i
+ E v πK (sKB+1 ) − v π1 (s1 ) . (10)
steps for episode k. Using the reward difference lemma
(Lemma 15 in the appendix), the first term of Eq. (8) can
The first term in the last expression can be bounded
be written as
by O(ηN 3 K) = O(ηN 3 T /B) due to the stability of
O OMD U PDATE (Lemma 7) and the second term is at most
O(tmix ) according to Lemma 14 in the appendix.
" K X
# Combining these facts with N = O(t e mix ), B = O(t
e mix thit ),
X X
B ∗
µ (s) ∗ πk
(π (a|s) − πk (a|s))q (s, a) , Eq. (8) and Eq. (9) and choosing the optimal η, we arrive at
s k=1 a
e BA t3mix ρT 3 6
E[RT ] = O +η + η tmix T
η B
q
43 1
=O e 3 3 2
tmix ρAT + tmix thit A T + tmix thit A .
4
The next lemma shows that in OOMD, πk and πk−1 are ECCS-1810447), HL (award IIS-1755781), HS (award
close in a strong sense, which further implies the stability CCF-1817212) and RJ (awards ECCS-1810447 and CCF-
for several other related quantities. 1817212) is gratefully acknowledged.
Lemma 7. For any k, s, a,
References
|πk (a|s) − πk−1 (a|s)| ≤ O(ηN πk−1 (a|s)), (13)
Abbasi-Yadkori, Y., Bartlett, P., Bhatia, K., Lazic, N.,
|J πk − J πk−1 | ≤ O(ηN 2 ),
Szepesvari, C., and Weisz, G. Politex: Regret bounds
|v πk (s) − v πk−1 (s)| ≤ O(ηN 3 ), for policy iteration using expert prediction. In Interna-
|q πk (s, a) − q πk−1 (s, a)| ≤ O(ηN 3 ), tional Conference on Machine Learning, pp. 3692–3702,
2019a.
|β πk (s, a) − β πk−1 (s, a)| ≤ O(ηN 3 ).
Abbasi-Yadkori, Y., Lazic, N., Szepesvari, C., and Weisz,
The next lemma shows the regret bound of OOMD based G. Exploration-enhanced politex. arXiv preprint
on an analysis similar to (Wei & Luo, 2018). arXiv:1908.10479, 2019b.
Lemma 8. For a specific state s, we have Agarwal, A., Luo, H., Neyshabur, B., and Schapire, R. E.
"K # Corralling a band of bandit algorithms. In Conference
XX A ln T
E (π ∗ (a|s) − πk (a|s))βbk (s, a) ≤ O on Learning Theory, pp. 12–38, 2017.
η
k=1 a
"K #! Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan,
XX 2
G. Optimality and approximation with policy gradient
+ ηE πk (a|s)2 βbk (s, a) − βbk−1 (s, a) ,
methods in markov decision processes. arXiv preprint
k=1 a
arXiv:1908.00261, 2019.
where we define βb0 (s, a) = 0 for all s and a.
Agrawal, S. and Jia, R. Optimistic posterior sampling for
Finally, we state a key lemma for proving Theorem 5. reinforcement learning: worst-case regret bounds. In
Advances in Neural Information Processing Systems, pp.
Lemma 9. MDP-OOMD ensures
" K # 1184–1194, 2017.
XXX
∗ ∗ πk Auer, P. and Ortner, R. Logarithmic online regret bounds
E B µ (s) (π (a|s) − πk (a|s)) q (s, a)
k=1 s a for undiscounted reinforcement learning. In Advances
in Neural Information Processing Systems, pp. 49–56,
BA ln T T N 3ρ
=O +η + η3 T N 6 . 2007.
η B
Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E.
6. Conclusions The nonstochastic multiarmed bandit problem. SIAM
Journal on Computing, 32(1):48–77, 2002.
In this work we propose two model-free algorithms for
learning infinite-horizon average-reward MDPs. They are Azar, M. G., Osband, I., and Munos, R. Minimax regret
based on different ideas: one reduces the problem to the bounds for reinforcement learning. In Proceedings of
discounted version, while the other optimizes the policy di- the 34th International Conference on Machine Learning-
rectly via a novel application of adaptive adversarial multi- Volume 70, pp. 263–272. JMLR. org, 2017.
armed bandit algorithms. The main open question is how to
achieve the information-theoretically optimal regret bound Bartlett, P. L. and Tewari, A. Regal: A regularization based
via a model-free algorithm, if it is possible at all. We be- algorithm for reinforcement learning in weakly commu-
lieve that the techniques we develop in this work would be nicating mdps. In Proceedings of the Twenty-Fifth Con-
useful in answering this question. ference on Uncertainty in Artificial Intelligence, pp. 35–
42. AUAI Press, 2009.
Acknowledgements Bubeck, S., Li, Y., Luo, H., and Wei, C.-Y. Improved
path-length regret bounds for bandits. In Conference On
The authors would like to thank Csaba Szepesvari for point- Learning Theory, 2019.
ing out the related works (Abbasi-Yadkori et al., 2019a;b),
Mengxiao Zhang for helping us prove Lemma 6, Gergely Chiang, C.-K., Yang, T., Lee, C.-J., Mahdavi, M., Lu, C.-J.,
Neu for clarifying the analysis in (Neu et al., 2013), and Jin, R., and Zhu, S. Online optimization with gradual
Ronan Fruit for discussions on a related open problem pre- variations. In Conference on Learning Theory, pp. 6–1,
sented at ALT 2019. Support from NSF for MJ (award 2012.
Model-free RL in Infinite-horizon Average-reward MDPs
Dong, K., Wang, Y., Chen, X., and Wang, L. Q-learning Osband, I., Van Roy, B., and Wen, Z. Generalization
with ucb exploration is sample efficient for infinite- and exploration via randomized value functions. arXiv
horizon mdp. arXiv preprint arXiv:1901.09311, 2019. preprint arXiv:1402.0635, 2014.
Fruit, R., Pirotta, M., and Lazaric, A. Near optimal Ouyang, Y., Gagrani, M., and Jain, R. Learning-based con-
exploration-exploitation in non-communicating markov trol of unknown linear systems with thompson sampling.
decision processes. In Advances in Neural Information arXiv preprint arXiv:1709.04047, 2017a.
Processing Systems, pp. 2994–3004, 2018a.
Ouyang, Y., Gagrani, M., Nayyar, A., and Jain, R. Learn-
Fruit, R., Pirotta, M., Lazaric, A., and Ortner, R. Ef- ing unknown markov decision processes: A thompson
ficient bias-span-constrained exploration-exploitation in sampling approach. In Advances in Neural Information
reinforcement learning. In International Conference on Processing Systems, pp. 1333–1342, 2017b.
Machine Learning, pp. 1573–1581, 2018b.
Puterman, M. L. Markov decision processes: discrete
Fruit, R., Pirotta, M., and Lazaric, A. Improved analysis of stochastic dynamic programming. John Wiley & Sons,
ucrl2b, 2019. Available at 2014.
rlgammazero.github.io/docs/ucrl2b improved.pdf. Rakhlin, A. and Sridharan, K. Online learning with pre-
Gopalan, A. and Mannor, S. Thompson sampling for learn- dictable sequences. In Conference on Learning Theory,
ing parameterized markov decision processes. In Confer- pp. 993–1019, 2013.
ence on Learning Theory, pp. 861–898, 2015. Schulman, J., Levine, S., Abbeel, P., Jordan, M., and
Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret Moritz, P. Trust region policy optimization. In Interna-
bounds for reinforcement learning. Journal of Machine tional conference on machine learning, pp. 1889–1897,
Learning Research, 11(Apr):1563–1600, 2010. 2015.
Schwartz, A. A reinforcement learning method for max-
Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. Is
imizing undiscounted rewards. In Proceedings of the
Q-learning provably efficient? In Advances in Neural
tenth international conference on machine learning, vol-
Information Processing Systems, pp. 4863–4873, 2018.
ume 298, pp. 298–305, 1993.
Kakade, S. and Langford, J. Approximately optimal ap-
Strehl, A. L. and Littman, M. L. An analysis of model-
proximate reinforcement learning. In Proceedings of
based interval estimation for markov decision processes.
the 34th International Conference on Machine Learning,
Journal of Computer and System Sciences, 74(8):1309–
2002.
1331, 2008.
Lattimore, T. and Szepesvári, C. Bandit algorithms. Cam- Strehl, A. L., Li, L., Wiewiora, E., Langford, J., and
bridge University Press, 2018. Littman, M. L. Pac model-free reinforcement learning.
Levin, D. A. and Peres, Y. Markov chains and mixing times, In Proceedings of the 23rd international conference on
volume 107. American Mathematical Soc., 2017. Machine learning, pp. 881–888. ACM, 2006.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Talebi, M. S. and Maillard, O.-A. Variance-aware regret
Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing bounds for undiscounted reinforcement learning in mdps.
atari with deep reinforcement learning. arXiv preprint In Algorithmic Learning Theory, pp. 770–805, 2018.
arXiv:1312.5602, 2013. Wang, M. Primal-dual π learning: Sample complexity and
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, sublinear run time for ergodic markov decision problems.
T., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn- arXiv preprint arXiv:1710.06100, 2017.
chronous methods for deep reinforcement learning. In Watkins, C. J. C. H. Learning from delayed rewards. Phd
International conference on machine learning, pp. 1928– Thesis, King’s College, Cambridge, 1989.
1937, 2016.
Wei, C.-Y. and Luo, H. More adaptive algorithms for ad-
Neu, G., György, A., Szepesvári, C., and Antos, A. Online versarial bandits. In Conference On Learning Theory, pp.
markov decision processes under bandit feedback. IEEE 1263–1291, 2018.
Transactions on Automatic Control, 59:676–691, 2013.
Zanette, A. and Brunskill, E. Tighter problem-dependent
Ortner, R. Regret bounds for reinforcement learn- regret bounds in reinforcement learning without domain
ing via markov chain concentration. arXiv preprint knowledge using value function bounds. In International
arXiv:1808.01813, 2018. Conference on Machine Learning, 2019.
Model-free RL in Infinite-horizon Average-reward MDPs
It can be verified that α0τ = 0 for τ ≥ 1 and we define α00 = 1. These quantities are used in the proof of Lemma 3 and
have some nice properties summarized in the following lemma.
Lemma 10 ((Jin et al., 2018)). The following properties hold for αiτ :
Pτ αiτ
1. √1 ≤ √ ≤ √2 for every τ ≥ 1.
τ i=1 i τ
Pτ i 2 2H
2. i=1 (ατ ) ≤for every τ ≥ 1.
τ
Pτ i
P∞ i 1
3. i=1 ατ = 1 for every τ ≥ 1 and τ =i ατ = 1 + H for every i ≥ 1.
Proof. 1. Let π ∗ and πγ be the optimal policy under undiscounted and discounted settings, respectively. By Bellman’s
equation, we have
v ∗ (s) = r(s, π ∗ (s)) − J ∗ + Es′ ∼p(·|s,π∗ (s)) v ∗ (s′ ).
Consider a state sequence s1 , s2 , · · · generated by π ∗ . Then, by sub-optimality of π ∗ for the discounted setting, we
have
"∞ #
X
∗ t−1 ∗
V (s1 ) ≥ E γ r(st , π (st )) s1
t=1
" ∞
#
X
=E γ t−1 (J ∗ + v ∗ (st ) − v ∗ (st+1 )) s1
t=1
"∞ #
J∗ X
∗ t−2 t−1 ∗
= + v (s1 ) − E (γ − γ )v (st ) s1
1−γ t=2
X∞
J∗ ∗ ∗
≥ + min v (s) − max v (s) (γ t−2 − γ t−1 )
1−γ s s
t=2
J∗
= − sp(v ∗ ),
1−γ
Model-free RL in Infinite-horizon Average-reward MDPs
where the first equality is by the Bellman equation for the undiscounted setting.
Similarly, for the other direction, let s1 , s2 , · · · be generated by πγ . We have
" ∞
#
X
∗ t−1
V (s1 ) = E γ r(st , πγ (st )) s1
t=1
"∞ #
X
≤E γ t−1 (J ∗ + v ∗ (st ) − v ∗ (st+1 )) s1
t=1
"∞ #
J∗ X
∗ t−2 t−1 ∗
= + v (s1 ) − E (γ − γ )v (st ) s1
1−γ t=2
X∞
J∗
≤ + max v ∗ (s) − min v ∗ (s) (γ t−2 − γ t−1 )
1−γ s s
t=2
J∗
= + sp(v ∗ ),
1−γ
J∗ J∗
|V ∗ (s1 ) − V ∗ (s2 )| ≤ V ∗ (s1 ) − + V ∗ (s2 ) − ≤ 2 sp(v ∗ ).
1−γ 1−γ
Proof. We condition on the statement of Lemma 12, which happens with probability at least 1 − δ. Let nt ≥ 1 denote
nt+1 (st , at ), that is, the total number of visits to the state-action pair (st , at ) for the first t rounds (including round t). Also
let ti (s, a) denote the timestep at which (s, a) is visited the i-th time. Recalling the definition of αint in Eq. (14), we have
T
X XT
V̂t (st ) − V ∗ (st ) + (V ∗ (st ) − Q∗ (st , at )) (15)
t=1 t=1
T
X
= Q̂t (st , at ) − Q∗ (st , at ) (because at = argmaxa Q̂t (st , a))
t=1
T
X XT
= Q̂t+1 (st , at ) − Q∗ (st , at ) + Q̂t (st , at ) − Q̂t+1 (st , at ) (16)
t=1 t=1
T r T X nt h i
X H 2T X
∗
≤ 12 sp(v ) ln +γ αint V̂ti (st ,at ) (sti (st ,at )+1 ) − V ∗ (sti (st ,at )+1 ) + SAH. (17)
t=1
nt δ t=1 i=1
Here, we apply Lemma 12 to bound the first term of Eq .(16) (note α0nt = 0 by definition since nt ≥ 1), and also bound
the second term of Eq .(16) by SAH since for each fixed (s, a), Q̂t (s, a) is non-increasing in t and overall cannot decrease
by more than H (the initial value).
Model-free RL in Infinite-horizon Average-reward MDPs
X nT +1
X (s,a) nT +1 (s,a)
X h i
γ αij V̂ti (s,a) (sti (s,a)+1 ) − V ∗ (sti (s,a)+1 )
s,a i=1 j=i
nT +1 (s,a) h i nT +1 (s,a)
X X X
=γ V̂ti (s,a) (sti (s,a)+1 ) − V ∗ (sti (s,a)+1 ) αij
s,a i=1 j=i
PnT +1 (s,a) i P∞ i 1
Now, we can upper bound j=i αj by j=i αj where the latter is equal to 1 + H by Lemma 10. Since
∗
V̂ti (s,a) (sti (s,a)+1 ) − V (sti (s,a)+1 ) ≥ 0 (by Lemma 12), we can write:
X nT +1
X (s,a) h i nT +1
X (s,a)
γ V̂ti (s,a) (sti (s,a)+1 ) − V ∗ (sti (s,a)+1 ) αij
s,a i=1 j=i
nT +1 (s,a) h iX
∞
X X
≤γ V̂ti (s,a) (sti (s,a)+1 ) − V ∗ (sti (s,a)+1 ) αij
s,a i=1 j=i
X nT +1 (s,a) h
X i 1
∗
=γ V̂ti (s,a) (sti (s,a)+1 ) − V (sti (s,a)+1 ) 1 +
s,a i=1
H
X T h i
1
= 1+ γ V̂t (st+1 ) − V ∗ (st+1 )
H t=1
X T h i T
1 1 Xh i
= 1+ γ V̂t+1 (st+1 ) − V ∗ (st+1 ) + 1 + V̂t (st+1 ) − V̂t+1 (st+1 )
H t=1
H t=1
T
X +1 h i 1
≤ V̂t (st ) − V ∗ (st ) + 1 + SH.
t=2
H
The last inequality is because 1 + H1 γ ≤ 1 and that for any state s, V̂t (s) ≥ V̂t+1 (s) and the value can decrease by at
most H (the initial value). Substituting in Eq. (17) and telescoping with the left hand side, we have
T r
H 2T
X T X
∗ ∗ ∗ ∗ 1
(V (st ) − Q (st , at )) ≤ 12 sp(v ) ln + V̂T +1 (sT +1 ) − V (sT +1 ) + 1 + SH + SAH
t=1 t=1
nt δ H
XT r
H 2T
≤ 12 sp(v ∗ ) ln + 4SAH.
t=1
n t δ
PT √
Moreover, √1 ≤ 2 SAT because
t=1 nt
T
X X X 1[st =s,at =a]
T X nT +1 (s,a)
X X p s X √
1 1
p = p = √ ≤ 2 nT +1 (s, a) ≤ 2 SA nT +1 (s, a) = 2 SAT ,
t=1 nt+1 (st , at ) t=1 s,a nt+1 (s, a) s,a j=1
j s,a s,a
Model-free RL in Infinite-horizon Average-reward MDPs
where the last inequality is by Cauchy-Schwarz inequality. This finishes the proof.
Lemma 12. With probability at least 1 − δ, for any t = 1, . . . , T and state-action pair (s, a), the following holds
τ h i r
X H 2T
∗
0 ≤ Q̂t+1 (s, a) − Q (s, a) ≤ Hα0τ +γ αiτ ∗ ∗
V̂ti (sti +1 ) − V (sti +1 ) + 12 sp(v ) ln ,
i=1
τ δ
where τ = nt+1 (s, a) (i.e., the total number of visits to (s, a) for the first t timesteps), αiτ is defined by (14), and
t1 , . . . , tτ ≤ t are the timesteps on which (s, a) is taken.
Pτ
Moreover, since i=1 αiτ = 1 (Lemma 10), By Bellman equation we have
τ
X
Q∗ (s, a) = α0τ Q∗ (s, a) + αiτ r(s, a) + γEs′ ∼p(·|s,a) V ∗ (s′ ) .
i=1
Pτ
Taking their difference and adding and subtracting a term γ i=1 αiτ V ∗ (sti +1 ) lead to:
τ
X h i
Qt+1 (s, a) − Q∗ (s, a) = α0τ (H − Q∗ (s, a)) + γ αiτ V̂ti (sti +1 ) − V ∗ (sti +1 )
i=1
τ
X τ
X
+γ αiτ V ∗ (sti +1 ) − Es′ ∼p(·|s,a) V ∗ (s′ ) + αiτ bi .
i=1 i=1
P∞ 1
The first term is upper bounded by α0τ H clearly and lower bounded by 0 since Q∗ (s, a) ≤ i=0 γi = 1−γ = H.
The third term is a martingale difference sequence with each term bounded in [−γαiτ sp(Vq ∗
), γαiτ sp(V ∗ )]. There-
Pτ
fore, by Azuma’s inequality (Lemma 11), its absolute value is bounded by γ sp(V ∗ ) 2 i=1 (αiτ )2 ln 2T δ ≤
q q
2γ sp(V ∗ ) H 2T
τ ln δ ≤ 4γ sp(v )
∗ H 2T δ
τ ln δ with probability at least 1 − T , where the first inequality is by Lemma
10 and the last inequality is by Lemma 2. Note that when t varies from 1 to T and (s, a) varies over all possible state-action
pairs, the third term only takes T different forms. Therefore,
q by taking a union bound over these T events, we have: with
probability 1 − δ, the third term is bounded by 4γ sp(v ∗ ) H 2T
τ ln δ in absolute value for all t and (s, a).
q q
The forth term is lower bounded by 4 sp(v ∗ ) H 2T
τ ln δ and upper bounded by 8 sp(V )
∗ H 2T
τ ln δ , by Lemma 10.
n o
Combining all aforementioned upper bounds and the fact Q̂t+1 (s, a) = min Q̂t (s, a), Qt+1 (s, a) ≤ Qt+1 (s, a) we
prove the upperh bound in the lemma statement. To prove thei lower bound, further note that the second term can be written
Pτ i ∗
as γ i=1 ατ maxa Q̂ti (sti +1 , a) − maxa Q (sti +1 , a) . Using a direct induction with all aforementioned lower bounds
n o
and the fact Q̂t+1 (s, a) = min Q̂t (s, a), Qt+1 (s, a) we prove the lower bound in the lemma statement as well.
T r
X 1
∗ ∗ ∗
(Q (st , at ) − γV (st ) − r(st , at )) ≤ 2 sp(v ) 2T ln + 2 sp(v ∗ ).
t=1
δ
Model-free RL in Infinite-horizon Average-reward MDPs
Proof. By Bellman equation for the discounted problem, we have Q∗ (st , at ) − γV ∗ (st ) − r(st , at ) =
∗ ′ ∗ ∗
γ Es′ ∼p(·|st ,at ) [V (s )] − V (st ) . Adding and subtracting V (st+1 ) and summing over t we will get
T
X T
X T
X
(Q∗ (st , at ) − γV ∗ (st ) − r(st , at )) = γ Es′ ∼p(·|st ,at ) [V ∗ (s′ )] − V ∗ (st+1 ) + γ (V ∗ (st+1 ) − V ∗ (st ))
t=1 t=1 t=1
The summands of the first term on the right hand side constitute a martingale difference sequence. Thus, byqAzuma’s
inequality (Lemma 11) and the fact that sp(V ∗ ) ≤ 2 sp(v ∗ ) (Lemma 2), this term is upper bounded by 2γ sp(v ∗ ) 2T ln δ1 ,
with probability at least 1 − δ. The second term is equal to γ(V ∗ (sT +1 ) − V ∗ (s1 )) which is upper bounded by 2γ sp(v ∗ ).
Recalling γ < 1 completes the proof.
Proof. Lemma 13 implies for any ǫ ∈ (0, 12 ], as long as t ≥ ⌈log2 (1/ǫ)⌉tmix , we have
k(P π )t (s, ·) − µπ k1 ≤ ǫ.
t −tt
This condition can be satisfied by picking log2 (1/ǫ) = tmix − 1, which leads to ǫ = 2 · 2 mix .
Corollary 13.2. Let N = 4tmix log2 T . For an ergodic MDP with mixing time tmix < T /4, we have for all π:
∞
X 1
k(P π )t (s, ·) − µπ k1 ≤ .
T3
t=N
Lemma 14 (Stated in (Wang, 2017) without proof). For an ergodic MDP with mixing time tmix , and any π, s, a,
|v π (s)| ≤ 5tmix ,
|q π (s, a)| ≤ 6tmix .
Model-free RL in Infinite-horizon Average-reward MDPs
≤ (P π )t (s, ·) − µπ 1
+ (P π )t (s, ·) − µπ 1
t=0 i=2 t=itmix
∞
X
≤ 4tmix + 2 · 2−i tmix (by k(P π )t (s, ·) − µπ k1 ≤ 2 and Corollary 13.1)
i=2
≤ 5tmix ,
and thus
Lemma 15 ((Neu et al., 2013, Lemma 2)). For any two policies π, π̃,
XX
J π̃ − J π = µπ̃ (s) (π̃(a|s) − π(a|s)) q π (s, a).
s a
Since each βbk,i (s, a) is constructed by a length-(N + 1) trajectory starting from s at time τi ≤ kB − N , we can calculate
its conditional expectation as follows:
h i
E βbk,i (s, a) sτi = s
hP i
τi +N
r(s, a) + E t=τi +1 r(s ,
t ta ) (s τi , a τi ) = (s, a)
= Pr[aτi = a | sτi = s] ×
π(a|s)
" τ +N #
X X
i
X N
X −1
= r(s, a) + p(s′ |s, a) e⊤ π j π
s′ (P ) r
s′ j=0
X N
X −1
= r(s, a) + p(s′ |s, a) (e⊤ π j π ⊤ π
s′ (P ) − (µ ) )r + N J
π
(because µπ⊤ rπ = J π )
s′ j=0
X X ∞
X
= r(s, a) + p(s′ |s, a)v π (s′ ) + N J π − p(s′ |s, a) (e⊤ π j π ⊤ π
s′ (P ) − (µ ) )r (By Eq. (3))
s′ s′ j=N
= q π (s, a) + N J π − δ(s, a)
= β π (s, a) − δ(s, a), (19)
P P∞
where δ(s, a) , s′ p(s′ |s, a) ⊤ π j
j=N (es′ (P ) − (µπ )⊤ )rπ . By Corollary 13.2,
1
|δ(s, a)| ≤ . (20)
T3
Thus,
h i 1
E βbk,i (s, a) sτi = s − β π (s, a) ≤ 3 .
T
This shows that βbk,i (s, a) is an almost unbiased estimator for β π conditioned on all history before τi . Also, by our selection
of the episode length, M > 0 will happen with very high probability according to Lemma 16. These facts seem to indicate
that βbk (s, a) – an average of several βbk,i (s, a) – will also be an almost unbiased estimator for β π (s, a) with small error.
Model-free RL in Infinite-horizon Average-reward MDPs
However, a caveat here is that the quantity M in Eq.(18) is random, and it is not independent from the reward estimators
PM b b π
i=1 βk,i (s, a). Therefore, to argue that the expectation of E[βk (s, a)] is close to β (s, a), more technical work is needed.
Specifically, we use the following two steps to argue that E[βbk (s, a)] is close to β π (s, a).
Step 1. Construct an imaginary world where βbk (s, a) is an almost unbiased estimator of β π (s, a).
Step 2. Argue that the expectation of βbk (s, a) in the real world and the expectation of βbk (s, a) in the imaginary world are
close.
do nothing do nothing
Figure 1. An illustration for the sub-algorithm E STIMATE Q with target state s (best viewed in color). The red round points indicate that
the algorithm “starts to wait” for a visit toPs. When the algorithm reaches s (the blue stars) at time τi , it starts to record the sum of
rewards in the following N + 1 steps, i.e. t=τ τi +N
i
r(st , at ). This is used to construct βbk,i (s, ·). The next point the algorithm “starts to
wait for s” would be τi + 2N if this is still no later than kB − N .
Step 1. We first examine what E STIMATE Q sub-algorithm does in an episode k for a state s. The goal of this sub-
algorithm is to collect disjoint intervals of length N + 1 that start from s, calculate a reward estimator from each of them,
and finally average the estimators over all intervals to get a good estimator for β π (s, ·). However, after our algorithm
collects an interval [τ, τ + N ], it rests for another N steps before starting to find the next visit to s – i.e., it restarts from
τ + 2N (see Line 6 in E STIMATE Q (Algorithm 3), and also the illustration in Figure 1).
The goal of doing this is to de-correlate the observed reward and the number of collected intervals: as shown in Eq.(18),
these two quantities affect the numerator and the denominator of βbk (s, ·) respectively, and if they are highly correlated,
then βbk (s, ·) may be heavily biased from β π (s, ·). On the other hand, if we introduce the “rest time” after we collect each
interval (i.e., the dashed segments in Figure 1), then since the length of the rest time (N ) is longer than the mixing time,
the process will almost totally “forget” about the reward estimators collected before. In Figure 1, this means that the state
distributions at the red round points (except for the left most one) will be close to µπ when conditioned on all history that
happened N rounds ago.
We first argue that if the process can indeed “reset its memory” at those red round points in Figure 1 (except for the left most
one), then we get almost unbiased estimators for β π (s, ·). That is, consider a process like in Figure 2 where everything
remains same as in E STIMATE Q except that after every rest interval, the state distribution is directly reset to the stationary
distribution µπ .
do nothing do nothing
Below we calculate the expectation of βbk (s, a) in this imaginary world. As specified in Figure 2, we use τi to denote
Model-free RL in Infinite-horizon Average-reward MDPs
the i-th time E STIMATE Q starts to record an interval (therefore sτi = s), and let wi = τi − (τi−1 + 2N ) for i > 1 and
w1 = τ1 − ((k − 1)B + 1) be the “wait time” before starting the i-th interval. Note the following facts in the imaginary
world:
1. M is determined by the sequence w1 , w2 , . . . because all other segments in the figures have fixed length.
2. w1 only depends on s(k−1)B+1 and P π , and wi only depends on the stationary distribution µπ and P π because of the
reset.
The above facts imply that in the imaginary world, w1 , w2 , . . ., as well as M , are all independent from
βbk,1 (s, a), βbk,2 (s, a), . . .. Let E′ denote the expectation in the imaginary world. Then
" #
h i 1 X ′ hb
M i
E′ βbk (s, a) = Pr[w1 ≤ B − N ] × E′{wi } E βk,i (s, a) {wi } w1 ≤ B − N + Pr[w1 > B − N ] × 0
M i=1
" M
!#
′ 1 X π
= Pr[w1 ≤ B − N ] × E{wi } β (s, a) − δ(s, a) (by the same calculation as in (19))
M i=1
= Pr[w1 ≤ B − N ] × (β π (s, a) − δ(s, a))
= β π (s, a) − δ ′ (s, a), (21)
where E′{wi } denotes the expectation over the randomness of w1 , w2 , . . ., and δ ′ (s, a) = (1 − Pr[w1 ≤ B −
B−N
N
N ]) (β π (s, a) − δ(s, a)) + δ(s, a). By Lemma 16, we have Pr[w1 ≤ B − N ] ≥ 1 − 1 − 4t3hit = 1−
4thit log2 T −1
1 − 4t3hit ≥ 1 − T13 . Together with Eq. (20) and Lemma 14, we have
1 1 1 1 1
|δ (s, a)| ≤ 3 (|β π (s, a)| + |δ(s, a)|) + |δ(s, a)| ≤ 3 (6tmix + N + 3 ) + 3 = O
′
,
T T T T T2
and thus
h i 1
′ b π
E βk (s, a) − β (s, a) = O . (22)
T2
where we use P and P′ to denote the probability mass function in the real world and the imaginary world respectively, and
in the last inequality we use the non-negativeness of f (X).
For a fixed sequence of X, the probability of generating X in the real world is
P(X) = P(τ1 ) × P(T1 |τ1 ) × P(τ2 |τ1 , T1 ) × P(T2 |τ2 ) × · · · × P(τM |τM−1 , TM−1 )
h i
× P(TM |τM ) × Pr st 6= s, ∀t ∈ [τM + 2N, kB − N ] τM , TM . (24)
P′ (X) = P(τ1 ) × P(T1 |τ1 ) × P′ (τ2 |τ1 , T1 ) × P(T2 |τ2 ) × · · · × P′ (τM |τM−1 , TM−1 )
h i
× P(TM |τM ) × Pr st 6= s, ∀t ∈ [τM + 2N, kB − N ] τM , TM . (25)
Model-free RL in Infinite-horizon Average-reward MDPs
Their difference only comes from P(τi+1 |τi , Ti ) 6= P′ (τi+1 |τi , Ti ) because of the reset. Note that
X h i
P(τi+1 |τi , Ti ) = P(sτi +2N = s′ |τi , Ti ) × Pr st 6= s, ∀t ∈ [τi + 2N, τi+1 − 1], sτi+1 = s sτi +2N = s′ , (26)
s′ 6=s
X h i
P′ (τi+1 |τi , Ti ) = P′ (sτi +2N = s′ |τi , Ti ) × Pr st 6= s, ∀t ∈ [τi + 2N, τi+1 − 1], sτi+1 = s sτi +2N = s′ . (27)
s′ 6=s
Because of the reset in the imaginary world, P′ (sτi +2N = s′ |τi , Ti ) = µπ (s′ ) for all s′ ; in the real world, since at time
τi + 2N , the process has proceeded N steps from τi + N (the last step of Ti ), by Corollary 13.1 we have
2 2 1 1
E[βbk (s, a)] ≤ ′ b
1 + 2 E [βk (s, a)] ≤ 1 + 2 βk (s, a) + O 2
≤ βk (s, a) + O .
T T T T
Similarly we can prove the other direction: βk (s, a) ≤ E[βbk (s, a)] + O 1
T , finishing the proof.
Proof for Eq.(12). We use the same notations, and the similar approach as in the previous proof for Eq. (11). That is,
we first bound the expectation of the desired quantity in the imaginary world, and then argue that the expectation in the
imaginary world and that in the real world are close.
Model-free RL in Infinite-horizon Average-reward MDPs
Step 1. Define ∆i = βbk,i (s, a) − β π (s, a) + δ(s, a). Then E′ [∆i | {wi }] = 0 by Eq.(19). Thus in the imaginary world,
2
E′ βbk (s, a) − β π (s, a))
!
1 XM 2
= E′ βbk,i (s, a) − β π (s, a) 1[M > 0] + β π (s, a)2 1[M = 0]
M i=1
!2
XM
1
= E′ ∆i − δ(s, a) 1[M > 0] + β π (s, a)2 1[M = 0]
M i=1
!2
M
X
1
≤ E′ 2 ∆i + 2δ(s, a)2 1[M > 0] + β π (s, a)2 1[M = 0] (using (a − b)2 ≤ 2a2 + 2b2 )
M i=1
!2
XM
1
≤ Pr[w1 ≤ B − N ] × E′{wi } E′ 2 ∆i + 2δ(s, a)2 {wi } w1 ≤ B − N + Pr[w1 > B − N ] × (N + 6tmix )2
M i=1
(β π (s, a) ≤ N + 6tmix by Lemma 14)
!2
M
E′ 2 1 X 1
≤ E′{wi } ∆i {wi } w1 ≤ B − N + O
M i=1 T
B−N
N
3 1
(using Lemma 16: Pr[w1 > B − N ] ≤ 1 − 4thit ≤ T 3 .)
" M
#
2 X ′ 2 1
≤ E′{wi } E ∆i {wi } w1 ≤ B − N + O
M 2 i=1 T
(∆i is zero-mean and independent of each other conditioned on {wi })
" #
2
2 O(N ) 1
≤ E′{wi } 2
·M × w1 ≤ B − N + O
M π(a|s) T
2
O(N 2 )
O(N )
(E′ [∆2i ] ≤ π(a|s) π(a|s) 2 = π(a|s) by definition of βbk (s, a), Lemma 14, and Eq. (20))
O(N 2 ) ′ 1 1
≤ E w1 ≤ B − N + O . (28)
π(a|s) M T
Since Pr′ [M = 0] ≤ 1
T3 by Lemma 16, we have Pr′ [w1 ≤ B − N ] = Pr′ [M > 0] ≥ 1 − 1
T3 . Also note that if
B−N
M < M0 := 4N log T
,
2N + µπ (s)
4N log T
then there exists at least one waiting interval (i.e., wi ) longer than µπ (s) (see Figure 1 or 2) . By Lemma 16, this happens
π
4µlog T
π (s)
with probability smaller than 1 − 3µ 4(s) ≤ T13 .
Therefore,
P∞
′ 1 1
m=1 m Pr′ [M = m] 1 × Pr′ [M < M0 ] + M10 × Pr′ [M ≥ M0 ]
E M >0 = ′ ≤
M Pr [M > 0] Pr′ [M > 0]
4N log T
2N + µπ (s)
1 × T13 + B−N N log T
≤ ≤O .
1 − T13 Bµπ (s)
Step 2. By the same argument as in the “Step 2” of the previous proof for Eq. (11), we have
2 2
b π 2 ′ b π N 3 log T
E βk (s, a) − β (s, a)) ≤ 1+ 2 E βk (s, a) − β (s, a)) ≤O ,
T Bπ(a|s)µπ (s)
XX
|J πk − J πk−1 | = µπk (s) (πk (a|s) − πk−1 (a|s)) q πk−1 (s, a) (By Lemma 15)
s a
XX
≤ µπk (s) |(πk (a|s) − πk−1 (a|s))| |q πk−1 (s, a)|
s a
!
XX
πk
=O µ (s)N ηπk−1 (a|s)tmix (By Eq. (13) and Lemma 14)
s a
= O (ηtmix N ) = O(ηN 2 ). (29)
Next, to prove a bound on |v πk (s) − v πk−1 (s)|, first note that for any policy π,
∞
X π
v π (s) = e⊤ π n π ⊤
s (P ) − (µ ) r (By Eq. (3))
n=0
N
X −1 ∞
π X π
= e⊤
s (P π n
) − (µ π ⊤
) r + e⊤ π n π ⊤
s (P ) − (µ ) r
n=0 n=N
N
X −1
= e⊤ π n π π π
s (P ) r − N J + error (s), (J π = (µπ )⊤ rπ )
n=0
P∞ ⊤ 1
where errorπ (s) := n=N e⊤ π n
s (P ) − µ
π
rπ . By Corollary 13.2, |errorπ (s)| ≤ T2 . Thus
N
X −1 N
X −1
2
|v πk (s) − v πk−1 (s)| = e⊤
s ((P
πk n
) − (P πk−1 )n ) rπk + e⊤
s (P
πk−1 n πk
) (r − rπk−1 ) − N J πk + N J πk−1 +
n=0 n=0
T2
N
X −1 N
X −1
2
≤ k((P πk )n − (P πk−1 )n ) rπk k∞ + krπk − rπk−1 k∞ + N |J πk − J πk−1 | + . (30)
n=0 n=0
T2
Below we bound each individual term above (using notation π ′ := πk , π := πk−1 , P ′ := P πk , P := P πk−1 , r′ :=
rπk , r := rπk−1 , µ := µπk−1 for simplicity). The first term can be bounded as
k(P ′n − P n )r′ k∞
= k P ′ (P ′n−1 − P n−1 ) + (P ′ − P )P n−1 r′ k∞
≤ kP ′ (P ′n−1 − P n−1 )r′ k∞ + k(P ′ − P )P n−1 r′ k∞
≤ k(P ′n−1 − P n−1 )r′ k∞ + k(P ′ − P )P n−1 r′ k∞ (because every row of P ′ sums to 1)
= k(P ′n−1 − P n−1 )r′ k∞ + max e⊤ ′
s (P − P )P
n−1 ′
r
s
′n−1 n−1
≤ k(P −P )r k∞ + max ke⊤
′ ′
s (P − P )P
n−1
k1 ,
s
Model-free RL in Infinite-horizon Average-reward MDPs
The second term in Eq. (30) can be bounded as (by Eq. (13) again)
N −1 N −1 N −1
!
X X X X X
′ ′
kr − rk∞ = max (π (a|s) − π(a|s))r(s, a) ≤ O max ηN π(a|s) = O ηN 2 ,
s s
n=0 n=0 a n=0 a
πk πk−1
and the third term in Eq. (30) is bounded via the earlier proof (for bounding |J −J |):
N |J πk − J πk−1 | = O ηN 3 . (Eq.(29))
Plugging everything into Eq.(30), we prove |v πk (s) − v πk−1 (s)| = O ηN 3 .
Finally, it is straightforward to prove the rest of the two statements:
|q πk (s, a) − q πk−1 (s, a)| = r(s, a) + Es′ ∼p(·|s,a) [v πk (s′ )] − r(s, a) − Es′ ∼p(·|s,a) [v πk−1 (s′ )]
= Es′ ∼p(·|s,a) [v πk (s′ ) − v πk−1 (s′ )] = O ηN 3 .
|β πk (s, a) − β πk−1 (s, a)| ≤ |q πk (s, a) − q πk−1 (s, a)| + N |J πk − J πk−1 | = O ηN 3 .
This completes the proof.
C. Analyzing Optimistic Online Mirror Descent with Log-barrier Regularizer — Proofs for
Eq.(13), Lemma 8, and Lemma 9
In this section, we derive the stability property (Eq.(13)) and the regret bound (Lemma 8 and Lemma 9) for optimistic online
mirror descent with the log-barrier regularizer. Most of the analysis is similar to that in (Wei & Luo, 2018; Bubeck et al.,
2019). Since in our MDP-OOMD algorithm, we run optimistic online mirror descent independently on each state, the
analysis in this section only focuses on a specific state s. We simplify our notations using πk (·) := πk (·|s), πk′ (·) :=
πk′ (·|s), βbk (·) := βbk (s, ·) throughout the whole section.
Our MDP-OOMD algorithm is effectively running Algorithm 5 on each state. We first verify that the condition in Line 1 of
Algorithm 5 indeed holds in our MDP-OOMD algorithm. Recall that in E STIMATE Q (Algorithm 3) we collect trajectories
in every episode for every state. Suppose for episode k and P
state s it collects M trajectories that
P start from time τ1 , . . . , τM
M
and has total reward R1 , . . . , RM respectively. Let ma = i=1 1[aτi = a], then we have a ma = M . By our way of
constructing βbk (s, ·), we have
XM
Ri 1[aτi = a]
βbk (s, a) =
i=1
M πk (a|s)
P P PM R 1[a =a] PM i
when M > 0. Thus we have a πk (a|s)βbk (s, a) = a i=1 i Mτi = i=1 R M ≤ (N + 1) because every Ri is
the total reward for an interval of length N + 1. This verifies the condition in Line 1 for the case M > 0. When M = 0,
b ·) to zero so the condition clearly still holds.
E STIMATE Q sets β(s,
Model-free RL in Infinite-horizon Average-reward MDPs
1
Initialization: π1′ = π1 = A 1
for k = 1, . . . , K do
P
1 Receive βbk ∈ RA + for which
b
a πk (a)βk (a) ≤ C.
2 Update
n o
′
πk+1 = argmax hπ, βbk i − Dψ (π, πk′ )
π∈∆A
n o
πk+1 = argmax hπ, βbk i − Dψ (π, πk+1
′
)
π∈∆A
√
To prove this lemma we make use of the following auxiliary result, where we use the notation kakM = a⊤ M a for a
vector a ∈ RA and a positive semi-definite matrix M ∈ RA×A .
1
Lemma 18. For some arbitrary b1 , b2 ∈ RA , a0 ∈ ∆A with η ≤ 270C , define
(
a1 = argmina∈∆A F1 (a), where F1 (a) , ha, b1 i + Dψ (a, a0 ),
a2 = argmina∈∆A F2 (a), where F2 (a) , ha, b2 i + Dψ (a, a0 ).
√
(ψ and Dψ are defined in Algorithm 5). Then as long as kb1 −b2 k∇−2 ψ(a1 ) ≤ 12 ηC, we have for all i ∈ [A], |a2,i −a1,i | ≤
60ηCa1,i .
√ √
Proof of Lemma 18. First, we prove ka1 − a2 k∇2 ψ(a1 ) ≤ 60 ηC by contradiction. Assume ka1 − a2 k∇2 ψ(a1 ) > 60 ηC.
√
Then there exists some a′2 lying in the line segment between a1 and a2 such that ka1 − a′2 k∇2 ψ(a1 ) = 60 ηC. By Taylor’s
theorem, there exists a that lies in the line segment between a1 and a′2 such that
1
F2 (a′2 ) = F2 (a1 ) + h∇F2 (a1 ), a′2 − a1 i + ka′2 − a1 k2∇2 F2 (a)
2
1
= F2 (a1 ) + hb2 − b1 , a′2 − a1 i + h∇F1 (a1 ), a′2 − a1 i + ka′2 − a1 k2∇2 ψ(a)
2
1
≥ F2 (a1 ) − kb2 − b1 k∇−2 ψ(a1 ) ka′2 − a1 k∇2 ψ(a1 ) + ka′2 − a1 k2∇2 ψ(a)
2
√ √ 1 ′
≥ F2 (a1 ) − 12 ηC × 60 ηC + ka2 − a1 k2∇2 ψ(a) (31)
2
Model-free RL in Infinite-horizon Average-reward MDPs
where in the first inequality we use Hölder inequality and the first-order optimality condition h∇F1 (a1 ), a′2 − a1 i ≥ 0, and
√ √
in the last inequality we use the conditions kb1 − b2 k∇−2 ψ(a1 ) ≤ 12 ηC and ka1 − a′2 k∇2 ψ(a1 ) = 60 ηC. Note that
∇2 ψ(x) is a diagonal matrix and ∇2 ψ(x)ii = η1 x12 . Therefore for any i ∈ [A],
i
v
u A
√ uX (a′2,j − a1,j )2 |a′2,i − a1,i |
60 ηC = ka2 − a1 k∇2 ψ(a1 ) = t
′
≥ √
j=1
ηa21,j ηa1,i
|a′2,i −a1,i |
n a′ o
2,i a1,i
and thus a1,i ≤ 60ηC ≤ 29 , which implies max a1,i , a′2,i ≤ 97 . Thus the last term in (31) can be lower bounded
by
A 2 A
1X 1 ′ 1 7 X 1
ka′2 − a1 k2∇2 ψ(a) = (a − a 1,i )2
≥ ′
2 (a2,i − a1,i )
2
η i=1 a2i 2,i η 9 a
i=1 1,i
√ 2
′ 2
≥ 0.6ka2 − a1 k∇2 ψ(a1 ) = 0.6 × (60 ηC) = 2160ηC 2 .
1
F2 (a′2 ) ≥ F2 (a1 ) − 720ηC 2 + × 2160ηC 2 > F2 (a1 ).
2
Recall that a′2 is a point in the line segment between a1 and a2 . By the convexity of F2 , the above inequality implies
F2 (a1 ) < F2 (a2 ), contradicting the optimality of a2 .
r
√ PA (a1,j −a2,j )2 |a2,i −a1,i |
Thus we conclude ka1 − a2 k∇2 ψ(a1 ) ≤ 60 ηC. Since ka1 − a2 k∇2 ψ(a1 ) = j=1 ηa21,j
≥ √ ηa1,i for all i,
|a2,i −a1,i | √
we get √ ηa1,i ≤ 60 ηC, which implies |a2,i − a1,i | ≤ 60ηCa1,i .
′ 60
πk+1 (a) ≤ πk (a) + 60ηCπk (a) ≤ πk (a) + πk (a) ≤ 2πk (a), (35)
270
and (34) implies
120
πk+1 (a) ≤ πk (a) + 120ηCπk (a) ≤ πk (a) + πk (a) ≤ 2πk (a). (36)
270
Thus, (35) and (36) are also inequalities we may use in the induction process.
P P 2
because a π1 (a)2 βb1 (a)2 ≤ a π1 (a)βb1 (a) ≤ C 2 by the condition in Line 1 of Algorithm 5. This proves (32) for
the base case.
Now we prove (33) of the base case. Note that
( ′ ′
Dψ (π, π2E
π2 = argminπ∈∆A D ),
b (37)
π2 = argmin π, −β1 + Dψ (π, π ′ ).
π∈∆A 2
√
Similarly, with the help of Lemma 18, we only need to show kβb1 k∇−2 ψ(π2′ ) ≤ 12 ηC. This can be verified by
A
X A
X
kβb1 k2∇−2 ψ(π2′ ) ≤ ηπ2′ (a)2 βb1 (a)2 ≤ 4 ηπ1 (a)2 βb1 (a)2 ≤ 4ηC 2 ,
a=1 a=1
where the second inequality uses (35) for the base case (implied by (32) for the base case, which we just proved).
Induction. Assume (32) and (33) hold before k. To prove (32), observe that
( D E
πk = argminπ∈∆A π, −βbk−1 + Dψ (π, πk′ ),
(38)
π′ = argmin
k+1 hπ, −βbk i + Dψ (π, π ′ ).
π∈∆A k
√
To apply Lemma 18 and obtain (32), we only need to show kβbk − βbk−1 k∇−2 ψ(πk ) ≤ 12 ηC. This can be verified by
A
X 2
kβbk − βbk−1 k2∇−2 ψ(πk ) ≤ ηπk (a)2 βbk (a) − βbk−1 (a)
a=1
A
X
≤ 2η πk (a)2 βbk (a)2 + βbk−1 (a)2
a=1
A
X A
X
≤ 2η πk (a)2 βbk (a)2 + 2η 4πk−1 (a)2 βbk−1 (a)2
a=1 a=1
≤ 10ηC 2 ,
√
Similarly, with the help of Lemma 18, we only need to show kβbk k∇−2 ψ(πk+1
′ ) ≤ 12 ηC. This can be verified by
A
X A
X
kβbk k2∇−2 ψ(π′ ≤ ′
ηπk+1 (a)2 βbk (a)2 ≤ 4 ηπk (a)2 βbk (a)2 ≤ 4ηC 2 ,
k+1 )
a=1 a=1
where in the second inequality we use (35) (implied by (32), which we just proved). This finishes the proof.
Model-free RL in Infinite-horizon Average-reward MDPs
1
1
As in (Wei & Luo, 2018), we pick π̃ = 1 − T π∗ + T A 1A , and thus
′
On the other hand, to bound hπk − πk+1 , βbk−1 − βbk i, we follow the same approach as in (Wei & Luo, 2018, Lemma
14): define Fk (π) = hπ, −βbk−1 i + Dψ (π, πk′ ) and Fk+1
′
(π) = hπ, −βbk i + Dψ (π, πk′ ). Then by definition we have
′ ′
πk = argminπ∈∆A Fk (π) and πk+1 = argminπ∈∆A Ft+1 (π).
Observe that
′
Fk+1 ′
(πk ) − Fk+1 ′
(πk+1 ′
) = (πk − πk+1 )⊤ (βbk−1 − βbk ) + Fk (πk ) − Fk (πk+1
′
)
≤ (πk − π ′ )⊤ (βbk−1 − βbk )
k+1 (by the optimality of πk )
′
≤ πk − πk+1 ∇2 ψ(πk )
βbk−1 − βbk . (41)
∇−2 ψ(πk )
′
On the other hand, for some ξ that lies on the line segment between πk and πk+1 , we have by Taylor’s theorem and the
′
optimality of πk+1 ,
′ ′ ′ ′ ′ 1 2
Fk+1 (πk ) − Fk+1 (πk+1 ) = ∇Fk+1 (πk+1 )⊤ (πk − πk+1
′
)+ ′
πk − πk+1 ′
∇2 Fk+1 (ξ)
2
1 ′ 2 ′
≥ πk − πk+1 ∇2 ψ(ξ)
(by the optimality of πk+1 and that ∇2 Fk+1
′
= ∇2 ψ)
2
(42)
′
By Eq.(32) we know πk+1 (a) ∈ 21 πk (a), 2πk (a) , and hence ξ(a) ∈ 12 πk (a), 2πk (a) holds as well, because ξ is in the
′
line segment between πk and πk+1 . This implies for any x,
v v
u A u A
uX x(a)2 1u X x(a)2 1
kxk∇2 ψ(ξ) = t ≥ t = kxk∇2 ψ(πk ) .
ηξ(a)2 2 ηπ (a)2 2
a=1 a=1 k
1 2
′
πk − πk+1 ∇2 ψ(πk )
βbk−1 − βbk ≥ ′
πk − πk+1 ∇2 ψ(πk )
,
∇−2 ψ(π k) 8
Model-free RL in Infinite-horizon Average-reward MDPs
′
which implies πk − πk+1 ∇2 ψ(πk )
≤ 8 βbk−1 − βbk . Hence we can bound the third term in (40) by
∇−2 ψ(πk )
2 X 2
′
πk − πk+1 ∇2 ψ(πk )
βbk−1 − βbk ≤ 8 βbk−1 − βbk = 8η πk (a)2 βbk−1 (a) − βbk (a) .
∇−2 ψ(πk ) ∇−2 ψ(πk )
a
k=1
" K
#
X
=E hπ − π̃, βbk i + hπ̃ − πk , βbk i
∗
k=1
"
K # !
1 X 1 b A log T XK X 2
∗ 2 b b
≤ π − 1, βk +O +η πk (a) βk−1 (a) − βk (a) ,
T A η
k=1 k=1 a
where the expectation of the first term is bounded by O KN T = O(1) by the fact E[βbk (s)] = O(N ) (implied by Lemma 6
and Lemma 14). This completes the proof.
p q
3 1 3 1
e
O e
N 3 ρAT + (BAN 2 ) 4 T 4 + BN A = O t3mix ρAT + (t3mix thit A) 4 T 4 + t2mix thit A .
where the last line uses the fact (z1 + z2 + z3 )2 ≤ 3z12 + 3z22 + 3z32 . The second term in (43) can be bounded using Eq. (12):
"K #!
XX
O ηE πk (a|s)2 (βbk (s, a) − β πk (s, a))2
k=2 a
" K X
#!
X N 3 log T
2
= O ηE πk (a|s)
Bπk (a|s)µπk (s)
k=2 a
"K #!
X N 3 log T
= O ηE .
Bµπk (s)
k=2
Now multiplying both sides by Bµ∗ (s) and summing over s we get
" K # "K # !
XXX BA ln T X X N 3 (log T )µ∗ (s)
E B µ∗ (s)(π ∗ (a|s) − πk (a|s))q πk (s, a) = O + ηE + η 3 BKN 6
s a
η s
µπk (s)
k=1 k=1
BA ln T
≤O + ηρKN 3 (log T ) + η 3 BKN 6
η
3
=Oe BA + ηρ T N + η 3 T N 6 (T = BK)
η B
√ √
4
1 √B A BA 1
Choosing η = min 270(N +1) , , √
4 6
(η ≤ 270(N +1) is required by Lemma 17), we finally obtain
3
ρT N TN
" #
K XX
X p 3 1
E B e
µ∗ (s)(π ∗ (a|s) − πk (a|s))q πk (s, a) = O N 3 ρAT + (BAN 2 ) 4 T 4 + BN A
k=1 s a
q
3 1
e
=O 3 3 2
tmix ρAT + (tmix thit A) T + tmix thit A .
4 4
D. Experiments
In this section, we compare the performance of our proposed algorithms and previous model-free algorithms. We note
that model-based algorithms (UCRL2, PSRL, . . . ) typically have better performance in terms of regret but require more
memory. For a fair comparison, we restrict our attention to model-free algorithms.
Model-free RL in Infinite-horizon Average-reward MDPs
RandomMDP JumpRiverSwim
40000 40000
30000 30000
Regret
Regret
20000 20000
10000 10000
0 0
0 1000000 2000000 3000000 4000000 5000000 0 1000000 2000000 3000000 4000000 5000000
Figure 3. Performance of model-free algorithms on random MDP (left) and JumpRiverSwim (right). The standard Q-learning algorithm
with ǫ-greedy exploration suffers from linear regret. The O PTIMISTIC Q- LEARNING and MDP-OOMD algorithms achieve sub-linear
regret. The shaded area denotes the standard deviation of regret over multiple runs.
Two environments are considered: a randomly generated MDP and JumpRiverSwim. Both of the environments consist of
6 states and 2 actions. The reward function and the transition kernel of the random MDP are chosen uniformly at random.
The JumpRiverSwim environment is a modification of the RiverSwim environment (Strehl & Littman, 2008; Ouyang et al.,
2017a) with a small probability of jumping to an arbitrary state at each time step.
The standard RiverSwim models a swimmer who can choose to swim either left or right in a river. The states are arranged
in a chain and the swimmer starts from the leftmost state (s = 1). If the swimmer chooses to swim left, i.e., the direction
of the river current, he is always successful. If he chooses to swim right, he may fail with a certain probability. The reward
function is: r(1, left) = 0.2, r(6, right) = 1 and r(s, a) = 0 for all other states and actions. The optimal policy is to always
swim right to gain the maximum reward of state s = 6. The standard RiverSwim is not an ergodic MDP and does not
satisfy the assumption of the MDP-OOMD algorithm. To handle this issue, we consider the JumpRiverSwim environment
which has a small probability 0.01 of moving to an arbitrary state at each time step. This small modification provides an
ergodic environment.
We compare our algorithms with two benchmark model-free algorithms. The first benchmark is the standard Q-learning
with ǫ-greedy exploration. Figure 3 shows that this algorithm suffers from linear regret, indicating that the naive ǫ-greedy
exploration is not efficient. The second benchmark is the P OLITEX algorithm by Abbasi-Yadkori et al. (2019a). The im-
plementation of P OLITEX is based on the variant designed for the tabular case, which is presented in their Appendix F
and Figure 3. P OLITEX usually requires longer episode length than MDP-OOMD (see Table 2) because in each episode it
needs to accurately estimate the Q-function, rather than merely getting an unbiased estimator of it as in MDP-OOMD.
Figure 3 shows that the proposed O PTIMISTIC Q- LEARNING, MDP-OOMD algorithms, and the P OLITEX algorithm
by Abbasi-Yadkori et al. (2019a) all achieve similar performance in the RandomMDP environment. In the JumpRiver-
Swim environment, the Optimistic Q-learning algorithm outperforms the other three algorithms. Although the regret upper
bound for O PTIMISTIC Q- LEARNING scales as O(T e 2/3 ) (Theorem 1), which is worse than that of MDP-OOMD (Theorem
5), Figure 3 suggests that in the environments that lack good mixing properties, O PTIMISTIC Q- LEARNING algorithm may
perform better. The detail of the experiments is listed in Table 2.
Model-free RL in Infinite-horizon Average-reward MDPs
Table 2. Hyper parameters used in the experiments. These hyper parameters are optimized to perform the best possible result for all the
algorithms. All the experiments are averaged over 10 independent runs for a horizon of 5 × 106 . For the P OLITEX algorithm, τ and τ ′
are the lengths of the two stages defined in Figure 3 of (Abbasi-Yadkori et al., 2019a).
Algorithm Parameters
Q-learning with ǫ-greedy ǫ = 0.05
p
Optimistic Q-learning H = 100, c = 1, bτ = c H/τ
Random MDP
MDP-OOMD N = 2, B = 4, η = 0.01
P OLITEX τ = 1000, τ ′ = 1000, η = 0.2
Q-learning with ǫ-greedy ǫ = 0.03
p
Optimistic Q-learning H = 100, c = 1, bτ = c H/τ
JumpRiverSwim
MDP-OOMD N = 10, B = 30, η = 0.01
P OLITEX τ = 3000, τ ′ = 3000, η = 0.2