An Analysis of Quantile Temporal-Difference Learning: Mark Rowland
An Analysis of Quantile Temporal-Difference Learning: Mark Rowland
Abstract
c 2024 Mark Rowland, Rémi Munos, Mohammad Gheshlaghi Azar, Yunhao Tang, Georg Ostrovski, Anna
Harutyunyan, Karl Tuyls, Marc G. Bellemare, Will Dabney.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided
at http://jmlr.org/papers/v25/23-0154.html.
Rowland, Munos, Azar, Tang, Ostrovski, Harutyunyan, Tuyls, Bellemare, Dabney
1. Introduction
In distributional reinforcement learning, an agent aims to predict the full probability dis-
tribution over future returns it will encounter (Morimura et al., 2010b,a; Bellemare et al.,
2017, 2023), in contrast to predicting just the mean return, as in classical reinforcement
learning (Sutton and Barto, 2018). A widely-used family of algorithms for distributional
reinforcement learning is based on the notion of learning quantiles of the return distribu-
tion, an approach that originated with Dabney et al. (2018b), who introduced the quantile
temporal-difference (QTD) learning algorithm. This approach has been particularly suc-
cessful in combination with deep reinforcement learning, and has been a central component
in several recent real-world applications, including sim-to-real stratospheric balloon navi-
gation (Bellemare et al., 2020), robotic manipulation (Bodnar et al., 2020), and algorithm
discovery (Fawzi et al., 2022), as well as on benchmark simulated domains such as the Ar-
cade Learning Environment (Bellemare et al., 2013; Machado et al., 2018; Dabney et al.,
2018b,a; Yang et al., 2019) and racing simulation (Wurman et al., 2022).
Despite these empirical successes of QTD, little is known about its behaviour from a theoret-
ical viewpoint. In particular, questions regarding the asymptotic behaviour of the algorithm
(Do its predictions converge? Under what conditions? What is the qualitative character of
the predictions when they do converge?) were left open. A core reason for this is that unlike
classical TD, and other distributional reinforcement learning algorithms such as categorical
temporal-difference learning (Rowland et al., 2018; Bellemare et al., 2023), the updates of
QTD rely on asymmetric L1 losses. As a result, these updates do not approximate the
application of a contraction mapping, are highly non-linear (even in the tabular setting),
and also may have multiple fixed points (depending on the exact structure of the reward
distributions of the environment), and their analysis requires a distinct set of tools to those
typically used to analyse temporal-difference learning algorithms.
In this paper, we prove the convergence of QTD—notably under weaker assumptions than
are required in typical proofs of convergence for classical TD learning—establishing it as a
sound algorithm with theoretical convergence guarantees, and paving the way for further
analysis and investigation. The more general conditions stem from the structure of the
QTD updates (namely, their boundedness), and the proof is obtained through the use of
stochastic approximation theory with differential inclusions.
We begin by providing background on Markov decision processes, classical TD learning,
and quantile regression in Section 2. After motivating the QTD algorithm in Section 3,
we describe the related family of quantile dynamic programming (QDP) algorithms, and
provide a convergence analysis of these algorithms in Section 4. We then present the main
result, a convergence analysis of QTD, in Section 5. The proof relies on the stochastic
approximation framework set out by Benaı̈m et al. (2005), arguing that the QTD algorithm
approximates a continuous-time differential inclusion, and then constructing a Lyapunov
function to demonstrate that the limiting behaviour of trajectories of the differential in-
clusion matches that of the QDP algorithms introduced earlier. Finally, in Section 6, we
analyse the limit points of QTD, bounding their approximation error to the true return
distributions of interest, and investigating the kinds of approximation artefacts that arise
empirically.
2
An Analysis of Quantile Temporal-Difference Learning
2. Background
We first introduce background concepts and notation.
The distribution of the trajectory is thus parametrised by the initial state x0 , and the policy
π. To illustrate this dependency, we use the notation Pπx0 and Eπx0 to denote the probability
distribution and expectation operator corresponding to this distribution, and will write
P π (·|x) for the joint distribution over a reward–next-state pair when the current state is x.
The return is a random variable, whose sources of randomness are the random selections of
actions made according to π, the randomness in state transitions, and the randomness in
rewards observed. Typically in reinforcement learning, a single scalar summary of perfor-
mance is given by the expectation of this return over all these sources of randomness. For
a given policy, this is summarised across each possible starting state via the value function
V π : X → R, defined by
∞
" #
X
V π (x) = Eπx γ t Rt . (2)
t=0
3
Rowland, Munos, Azar, Tang, Ostrovski, Harutyunyan, Tuyls, Bellemare, Dabney
Learning the value function of a policy π from sampled trajectories generated through
interaction with the environment is a central problem in reinforcement learning, referred to
as the policy evaluation task.
Each expected return is a scalar summary of a much more rich, complex object: the proba-
bility distributions of the random return in Equation (1) itself. Distributional reinforcement
learning (Bellemare et al., 2023) is concerned with the problem of learning to predict the
probability distribution over returns, in contrast to just their expected value. Mathemati-
cally, the goal is to learn the return-distribution function η π : X → P(R); for each state
x ∈ X , η π (x) is the probability distribution of the random return in Expression (1) when
the trajectory begins at state x, and the agent acts using policy π. Mathematically, we
have
!
X
π π t
η (x) = Dx γ Rt ,
t≥0
where Dxπ extract the probability distribution of a random variable under Pπx .
There are several distinct motivations for aiming to learn these more complex objects. First,
the richness of the distribution provides an abundance of signal for an agent to learn from,
in contrast to a single scalar expectation. The strong performance of deep reinforcement
learning agents that incorporate distributional predictions is hypothesised to be related to
this fact (Dabney et al., 2018b; Barth-Maron et al., 2018; Dabney et al., 2018a; Yang et al.,
2019). Second, learning about the full probability distribution of returns makes possible the
use of risk-sensitive performance criteria; one may be interested in not only the expected
return under a policy, but also the variance of the return, or the probability of the return
being under a certain threshold.
Unlike the value function V π , which is an element of RX , and can therefore be straightfor-
wardly represented on a computer (up to floating-point precision), the return-distribution
function η π is not representable. Each object η π (x) is a probability distribution over the real
numbers, and, informally speaking, probability distributions have infinitely many degrees
of freedom. Distributional reinforcement learning algorithms therefore typically work with
a subset of distributions that are amenable to parametrisation on a computer (Bellemare
et al., 2023). Common choices of subsets include categorical distributions (Bellemare et al.,
2017), exponential families (Morimura et al., 2010b), and mixtures of Gaussian distribu-
tions (Barth-Maron et al., 2018). Quantile temporal-difference learning, the core algorithm
of study in this paper, aims to learn a particular set of quantiles of the return distribution,
as described in Section 3.
4
An Analysis of Quantile Temporal-Difference Learning
and update V (x) by taking a step in the direction of this negative gradient, with some step
size α:
X ∞
V (x) ← V (x) + α γ t Rt − V (x) . (3)
t=0
This is a Monte Carlo algorithm, so called because it uses Monte Carlo samples of the
random return to update the estimate V .
A popular alternative to this Monte Carlo algorithm is temporal-difference learning, which
replaces samples from the random return with a bootstrapped approximation to the return,
obtained from a transition (x, A, R, X 0 ) by combining the immediate reward R with the
current estimate of the expected return obtained at X 0 , resulting in the return estimate
R + γV (X 0 ) , (4)
While the mean-return estimator in Expression (4) is generally biased, since V (X 0 ) is not
generally equal to the true expected return V π (X 0 ), it is often a lower-variance estimate,
since we are replacing the random return from X 0 with an estimate of its expectation
(Sutton, 1988; Sutton and Barto, 2018; Kearns and Singh, 2000).
This motivates the TD learning rule in Expression (5) based on the Monte Carlo update
rule in Expression (3), with the understanding that this algorithm can be applied more
generally, with access only to sampled transitions (rather than full trajectories), and may
result in more accurate estimates of the value function, due to lower-variance updates, and
the propensity of TD algorithms to “share information” across states. Note however that
this does not prove anything about the behaviour of temporal-difference learning, and a
fully rigorous theory of the asymptotic behaviour emerged several years after TD methods
were formally introduced (Sutton, 1984, 1988; Watkins, 1989; Watkins and Dayan, 1992;
Dayan, 1992; Dayan and Sejnowski, 1994; Jaakkola et al., 1994; Tsitsiklis, 1994).
5
Rowland, Munos, Azar, Tang, Ostrovski, Harutyunyan, Tuyls, Bellemare, Dabney
an equally-weighted mixture of Dirac deltas, for each state x ∈ X . The quantile-based ap-
proach to distributional reinforcement learning aims to have the particle locations (θ(x, i))m
i=1
approximate certain quantiles of η π (x).
Definition 1 For a probability distribution ν ∈ P(R) and parameter τ ∈ (0, 1), the set of
τ -quantiles of ν is given by the set
and provides a way of uniquely specifying a quantile for each level τ . In cases where there
is not a unique τ -quantile (see Figure 1), Fν−1 (τ ) corresponds to the left-most or least valid
τ -quantile. We also introduce the notation
which corresponds to the right-most or greatest τ -quantile; notice the strict inequality that
appears in the definition, in contrast to that of Fν−1 (τ ). If Fν−1 is continuous at τ , then
6
An Analysis of Quantile Temporal-Difference Learning
Figure 1: The three distinct scenarios that arise in defining quantiles. Firstly, there is a
value z1 for which Fν (z1 ) = τ1 and at which Fν is strictly increasing. Therefore
z1 is the unique τ1 -quantile of ν. Next, there is an interval [z2 , z20 ] on which Fν
equals τ2 , therefore all elements in this interval are τ2 -quantiles of ν. Finally,
there is no value z such that Fν (z) = τ3 , and the unique τ3 -quantile is therefore
defined by the infimum part of the definition.
Such an approach is available by using the quantile regression loss. We define the quantile
regression loss associated with distribution ν ∈ P(R) and quantile level τ ∈ (0, 1) as a
function of v by
This loss is the expectation of an asymmetric absolute value loss, in which positive and
negative errors are weighted according to the parameters τ and 1 − τ respectively. Just
as the expected squared loss encountered above encodes the mean as its unique minimiser,
the quantile regression loss encodes the τ -quantiles of ν as the unique minimisers; see, for
example, Koenker (2005) for further background. Thus, applying the quantile regression
loss to the problem of estimating τ -quantiles of the return distribution, we arrive at the loss
h i ∞
(τ 1{∆ ≥ 0} + (1 − τ )1{∆ < 0})|∆| ,
X
Lτ,π
x (v) = Eπx where ∆ = γ t Rt − v .
t=0
7
Rowland, Munos, Azar, Tang, Ostrovski, Harutyunyan, Tuyls, Bellemare, Dabney
Given an observed return t≥0 γ t Rt from the state x, we therefore have that an unbiased
P
This is essentially the application of the stochastic gradient descent method for quantile
regression to learning quantiles of the return distribution.
based on a full trajectory, with an approximate sample from the return distribution derived
from an observed transition (x, R, X 0 ), and the estimate η(X 0 ) of the return distribution at
state X 0 . If the return distribution estimate η(X 0 ) takes the form given in Equation (6),
as is the case for the probability distribution representation considered here, then such a
sample return is obtained as
R + γθ(X 0 , J) ,
with J sampled uniformly from {1, . . . , m}. This yields the update rule
n o
θ(x, i) ← θ(x, i) + α τi − 1 R + γθ(X 0 , J) < θ(x, i) .
We can consider also a variance-reduced version of this update, in which we average over
updates performed under different realisations of J, leading to the update
m
α X n o
θ(x, i) ← θ(x, i) + τi − 1 R + γθ(X 0 , j) < θ(x, i) . (10)
m
j=1
1. Technically speaking, we are assuming that differentiation and expectation can be interchanged here.
Further, under certain circumstances the loss is only sub-differentiable. As our principal goal in this
section is to provide intuition for QTD, we do not comment further on these technical details here. The
convergence results later in the paper deal with these issues carefully.
8
An Analysis of Quantile Temporal-Difference Learning
4: end for
5: for i = 1, . . . , m do
6: Set θ(x, i) ← θ0 (x, i)
7: end for
8: return ((θ 0 (x, i))mi=1 : x ∈ X )
Whilst the QTD update makes use of temporal-difference errors r + γθ(x0 , j) − θ(x, i), there
are two key differences to the use of analogous quantities in classical TD learning. First, the
TD errors influence the update only through their sign, not their magnitude. Second, the
predictions at each state (θ(x, i))mi=1 are indexed by i, and each update includes a distinct
term τi (equal to 2i−1/2m). The presence of these terms causes the learnt parameters to
make distinct predictions, as described in Section 3.1. Practical implementations of QTD
use these precise values for τi , equally spaced out on [0, 1], as proposed by Dabney et al.
(2018b). Much of the analysis in this paper goes through straightforwardly for other values
of τi , though we will see in Section 6 that this choice is well motivated in that it provides
the best bounds on distribution approximation. The tabular QTD algorithm as described
in Algorithm 1 uses a factor O(m) times more memory than an analogous classical TD
algorithm, owing to the need to store multiple predictions at each state, though the scaling
with the size of the state space is the same as for classical TD. For further discussion of the
computational complexity of QTD, see Rowland et al. (Appendix A.3; 2023).
The discussion above provides motivation for the form of the QTD update given in Al-
gorithm 1, and intuition as to why this algorithm might perform reasonably, and learn a
sensible approximation to the return distribution. However, it stops short of providing an
explanation of how the algorithm should be expected to behave, or providing any theoretical
guarantees as to what the algorithm will in fact converge to. A core goal of the sections
that follow is to answer these questions, and put QTD on firm theoretical footing.
9
Rowland, Munos, Azar, Tang, Ostrovski, Harutyunyan, Tuyls, Bellemare, Dabney
10
An Analysis of Quantile Temporal-Difference Learning
8
Predicted quantile values
6
4
2
0
2
0 2000 4000 6000 8000 10000
Updates
State 0 State 1 State 2 State 3
Figure 2: Top: A chain MDP with four states. Each transition yields a normally-distributed
reward; from x3 , the episode ends. The discount factor is γ = 0.9. Centre-top:
The progress of QTD, run with m = 5 quantiles, over the course of 10,000 updates.
The vertical axis corresponds to the predicted quantile values. Centre-bottom:
The true CDF of the return distribution (blue) at each state, along with the
final estimate produced by QTD (black), and the approximation produced by
the quantiles of the return distribution (grey). Bottom: The PDF of the return
distribution (blue) at each state, along with the final quantile approximation
produced by QTD (black).
11
Rowland, Munos, Azar, Tang, Ostrovski, Harutyunyan, Tuyls, Bellemare, Dabney
(x2, 1)
(x1, 1)
(x2, 1)
(x2, 1)
(x1, 1) (x1, 1)
Figure 3: Top left: The example Markov decision process described in Example 3. Top
right: Example dynamics of QTD with m = 1 in this environment, when reward
distributions are Gaussian. Also included are the directions of expected update,
in blue. Bottom left: Example dynamics and expected update directions when
reward distributions are Dirac deltas. Bottom right: Example dynamics and
expected updates with modified environment transition probabilities.
12
An Analysis of Quantile Temporal-Difference Learning
Our interest in QDP stems from the fact that QTD can be viewed as approximating the
behaviour of the QDP algorithms, without requiring access to the transition structure and
reward distributions of the environment. In particular, we will show that under appropriate
conditions, the asymptotic behaviour of QTD and QDP are equivalent: they both converge
to the same limiting points. Figure 4 illustrates the behaviour of the QDP algorithm in the
environment described in Example 3; since the reward distributions in this example have
strictly increasing CDFs, QDP behaves identically for all choices of interpolation parameters
λ. QTD and QDP appear to have the same asymptotic behaviour, converging to the
same limiting point. In cases where QTD appears to converge to a set, such as in the
bottom-right plot of Figure 3, the relationship is slightly more complicated, and there is
a correspondence between the asymptotic behaviour of QTD and the family of dynamic
programming algorithms parametrised by λ, as illustrated at the bottom of Figure 4. Thus,
to understand the asymptotic behaviour of QTD, we begin by analysing the asymptotic
behaviour of QDP.
13
Rowland, Munos, Azar, Tang, Ostrovski, Harutyunyan, Tuyls, Bellemare, Dabney
(x2, 1)
(x2, 1)
(x1, 1) (x1, 1)
1.0
0.8
0.6
(x2, 1)
(x2, 1)
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
(x1, 1) (x1, 1)
Figure 4: Top left: Illustration of QDP (dashed purple) and QTD (solid red) on the first
MDP from Example 3, with Gaussian rewards. Top right: Illustration of QDP
and QTD on the second MDP from Example 3, with deterministic rewards. Bot-
tom: Values of λ and corresponding fixed points of QDP in the final MDP from
Example 3.
14
An Analysis of Quantile Temporal-Difference Learning
and reason about the transformations undertaken by Algorithm 2 directly in terms of dis-
tributions. To this end, if we write η(x) ∈ P(R) for the probability distribution associated
with the quantile estimates (θ(x, i))mi=1 , we can interpret the transformation performed by
Algorithm 2 as comprising two parts, which we now describe in turn.
First, the variable η(x) is assigned the distribution of R + γG(X 0 ), where R, X 0 are the
random reward and next-state encountered from the initial state x with policy π, and
(G(y) : y ∈ X ) is an independent collection of random variables, with each G(y) distributed
according to η(y).
We write T π : P(R)X → P(R)X for this transformation. The function T π is known as
the distributional Bellman operator (Bellemare et al., 2017; Rowland et al., 2018; Bellemare
et al., 2023). In terms of the above definition via distributions of random variables, T π can
be written
(T π η)(x) = Dπ (R + γG(X 0 )) ,
15
Rowland, Munos, Azar, Tang, Ostrovski, Harutyunyan, Tuyls, Bellemare, Dabney
to return an element of RX ×[m] . We will also write T π θ for the element of P(R)X obtained
by applying T π to the distributions (η(x) : x ∈ X ) defined by
m
X 1
η(x) = δ .
m θ(x,i)
i=1
Remark 4 This convention highlights that there are two complementary views of distri-
butional reinforcement learning algorithms, through finite-dimensional sets of parameters,
and through probability distributions. The view in terms of probability distributions is of-
ten useful in contraction analysis, and in measuring approximation error, while we will see
that the parameter view is key to the stochastic approximation analysis that follows, and is
ultimately the way in which these algorithms are implemented.
With this convention, Πλ T π θ is precisely the table θ0 output by Algorithm 2 on input θ, and
so the QDP algorithm is mathematically equivalent to repeated application of the operator
Πλ T π to an initial collection of quantile estimates. To understand the long-term behaviour
of QDP, we can therefore seek to understand this projected operator Πλ T π .
and its extension to return-distribution functions, w̄∞ : P(R)X × P(R)X → [0, ∞], given
by
−1
w̄∞ (η, η 0 ) = max sup |Fη(x) (t) − Fη−1
0 (x) (t)| .
x∈X t∈(0,1)
Both w∞ and w̄∞ fulfil all the requirements of a metric, except that they may assign infinite
distances (Villani, 2009; see also Bellemare et al., 2023 for a detailed discussion specifically
in the context of reinforcement learning). We must therefore take some care as to when
distances are finite. The following is established by Bellemare et al. (2023, Proposition 4.15).
Proposition 5 The distributional Bellman operator T π : P(R)X → P(R)X is a γ-
contraction with respect to w̄∞ . That is,
w̄∞ (T π η, T π η 0 ) ≤ γ w̄∞ (η, η 0 ) ,
for all η, η 0 ∈ P(R)X .
16
An Analysis of Quantile Temporal-Difference Learning
Next, we show that the projection operator Πλ cannot expand distances as measured by
w̄∞ , generalising the proof given by Bellemare et al. (2023) in the case λ ≡ 0; the proof is
given in Appendix A.1.
Proposition 6 The projection operator Πλ : P(R)X → P(R)X is a non-expansion with
respect to w̄∞ . That is, for any η, η 0 ∈ P(R)X , we have
Finally, we put these two results together to obtain our desired conclusion. In stating this
result, it is useful here to introduce the notation
m
nX 1 o
FQ,m = δzi : zi ∈ R for i = 1, . . . , m ,
m
i=1
Next, observe that w̄∞ assigns finite distance to all pairs of return-distribution functions in
FQ,m
X , and further, this set is complete with respect to w̄ . Hence, we may apply Banach’s
∞
fixed point theorem to obtain the existence of the unique fixed point η̂λπ in FQ,m
X . The final
Note that the fixed point η̂λπ depends on λ, and therefore implicitly on m. We also introduce
the notation θ̂λπ ∈ RX ×[m] for the parameters of this collection of distributions, which is what
the QDP algorithm really operates over, so that we have
m
X 1
η̂λπ (x) = δπ .
m θ̂λ (x,i)
i=1
Note that the convergence result of Proposition 7 also implies convergence of the estimated
quantile locations to θ̂λπ . In Section 6, we will analyse the fixed point η̂λπ , and understand
how closely it approximates the true return-distribution function η π . For now, having
established convergence of QDP through contraction mapping theory, we can return to
QTD and demonstrate its own convergence to the same fixed points.
17
Rowland, Munos, Azar, Tang, Ostrovski, Harutyunyan, Tuyls, Bellemare, Dabney
where given x and k, we have (Rk (x), Xk0 (x)) ∼ P π (·|x), independently of the transitions
used at all other states/time steps, and (αk )∞
k=0 is a sequence of step sizes. The assumption
of synchronous updates makes the analysis easier to present, and means that our results
follow classical approaches to stochastic approximation with differential inclusions (Benaı̈m
et al., 2005). It is also possible to extend the analysis to the asynchronous case, where a
single state is updated at each algorithm time step (as would be the case in fully online
QTD, or an implementation using a replay buffer); see Section 5.7. We now state the main
convergence result of the paper.
Theorem 8 Consider the sequence (θk )∞ k=0 defined by an initial point θ0 ∈ R
X ×[m] , the
iterative update in Equation (13), and non-negative step sizes satisfying the condition
∞
X
αk = ∞ , αk = o(1/ log k) . (14)
t=0
Then (θk )∞
k=0 converges almost surely to the set of fixed points of the projected distributional
Bellman operators {Πλ T π : λ ∈ [0, 1]X ×[m] }; that is,
with probability 1.
Of particular note is the generality of this result. It does not require finite-variance con-
ditions on rewards (as is typically the case with convergence results for classical TD); it
holds for any collection of reward distributions with the finite mean property set out at the
beginning of the paper. Some intuition as to why this is the case is that the finite-variance
conditions typically encountered are to ensure that the updates performed in classical TD
learning cannot grow in magnitude too rapidly. Since the updates performed in QTD are
bounded, this is not a concern, meaning that the proof does not rely on such conditions. We
note also that the step size conditions are weaker than the typical Robbins-Monro condi-
tions used in classical TD analyses (see, for example, Bertsekas and Tsitsiklis, 1996), which
enforce square-summability, also to avoid the possibility of divergence due to unbounded
noise in the classical TD learning.
The proof is based on the ODE method for stochastic approximation; in particular we use
the framework set out by Benaı̈m (1999) and Benaı̈m et al. (2005). This involves interpreting
the QTD update as a noisy Euler discretisation of a differential equation (or more generally,
a differential inclusion). The broad steps are then to argue that the trajectories of the
18
An Analysis of Quantile Temporal-Difference Learning
differential equation/inclusion converge to some set of fixed points in a suitable way (that
is, in such a way that is robust to small perturbations), and that the asymptotic behaviour
of QTD, forming a noisy Euler discretisation, matches the asymptotic behaviour of the
true trajectories. This then allows us to deduce that the QTD iterates converge to the
same set of fixed points as the true trajectories. We begin by elucidating the connection to
differential equations and differential inclusions.
We now briefly introduce an assumption on the MDP reward structure that simplifies the
analysis that follows. This assumption guarantees that the two “difficult” cases of flat and
vertical regions of CDFs (see Figure 1) do not arise; note that this assumptions removes
the possibility of multiple fixed points or discontinuous expected dynamics, as described in
Example 3. We will lift this assumption later.
Assumption 9 For each state x ∈ X , the reward distribution at x has a CDF which is
strictly increasing, and Lipschitz continuous.
As described in Section 4, the distribution of R + θk (X 0 , J) given the initial state x is
in fact equal to the application of the distributional Bellman operator T π applied to the
return-distribution function ηk ∈ P(R)X given by
m
1 X
ηk (x) = δθk (x,i) .
m
i=1
Under Assumption 9, and in particular the assumption of continuous reward CDFs, this
yields a concise rewriting of the increment as
αk τi − F(T π θk )(x) (θk (x, i)) .
We may therefore intuitively interpret Equation (13) as a noisy discretisation of the differ-
ential equation
which we refer to as the QTD differential equation (or QTD ODE). Note also that As-
sumption 9 guarantees the global existence and uniqueness of solutions to this differential
equation, by the Cauchy-Lipschitz theorem.
Remark 10 Calling back to Figure 3, the trajectories of the QTD ODE are obtained pre-
cisely by integrating the vector fields that appear in these plots. In contrast to the ODE that
emerges when analysing classical TD learning (both in tabular and linear function approx-
imation settings) (Tsitsiklis and Van Roy, 1997), the right-hand side of Equation (16) is
non-linear in the parameters ϑt , meaning that we are outside the domain of linear stochastic
approximation methods.
19
Rowland, Munos, Azar, Tang, Ostrovski, Harutyunyan, Tuyls, Bellemare, Dabney
20
An Analysis of Quantile Temporal-Difference Learning
Note that Definition 11 does not require that zt is differentiable with derivative gt , but only
the weaker integration condition in Equation (18). We then have the following existence
result (see, for example, Smirnov, 2002 for a proof).
Proposition 12 Consider a set-valued map H : Rn ⇒ Rn , and suppose that H is a Mar-
chaud map: that is,
• the set {(z, h) : z ∈ Rn , h ∈ H(z)} is closed.
• For all z ∈ Rn , H(z) is non-empty, compact, and convex.
• There exists a constant C > 0 such that for all z ∈ Rn ,
Then the differential inclusion ∂t zt ∈ H(zt ) has a global solution, for any initial condition.
It is readily verified that the QTD DI satisfies the requirements of this result, and we
are therefore guaranteed global solutions to this differential inclusion, under any initial
conditions.
θk+1 = θk + αk (gk + wk ) ,
21
Rowland, Munos, Azar, Tang, Ostrovski, Harutyunyan, Tuyls, Bellemare, Dabney
where:
P∞
• (αk )∞
k=0 satisfy the conditions k=0 αk = ∞, αk = o(1/ log(k));
• gk ∈ H(θk ) for all k ≥ 0;
• (wk )∞
k=0 is a bounded martingale difference sequence with respect to the natural filtration
generated by (θk )∞
k=0 ; that is, there is an absolute constant C such that kwk k∞ < C
almost surely, and E[wk |θ0 , . . . , θk ] = 0.
If further (θk )∞
k=0 is bounded almost surely (that is, supk≥0 kθk k∞ < ∞ almost surely), then
θk → Λ almost surely.
The intuition behind the conditions of the theorem are as follows. The Marchaud map
condition ensures the differential inclusion of interest has global solutions. The existence of
the Lyapunov function guarantees that trajectories of the differential inclusion converge in
a suitably stable sense to Λ. The step size conditions, martingale difference condition, and
boundedness conditions mean that the iterates (θk )∞ k=0 will closely track the differential
inclusion trajectories, and hence exhibit the same asymptotic behaviour. We can now
give the proof of Theorem 8, first requiring the following proposition, which is proven in
Appendix A.3.
Proposition 15 Under the conditions of Theorem 8, the iterates (θk )∞
k=0 are bounded al-
most surely.
Proof (Proof of Theorem 8) We see that for the QTD sequence (θk )∞ k=0 and the QTD
π
DI and QDP invariant set Λ = {θ̂λ : λ ∈ [0, 1] X ×[m] }, the conditions of Theorem 14 are
satisfied, except perhaps for the boundedness of (θk )∞
k=0 , and the existence of the Lyapunov
function. The fact that the sequence (θk )∞
k=0 is bounded almost surely is Proposition 15;
its proof is somewhat technical, and given in the appendix. The construction of a valid
Lyapunov function is given in Proposition 18 below, which completes the proof.
Remark 16 What makes the relaxation to the differential inclusion work in this analysis?
We have already seen that some kind of relaxation of the dynamics is required in order to
define a valid continuous-time dynamical system; the original ODE may not have solutions
in general. If we relax the dynamics too much (an extreme example would be the differential
inclusion ϑt (x, i) ∈ R), what goes wrong? The answer is that there are too many resulting
solutions, which do not exhibit the desired asymptotic behaviour. Thus, the differential
inclusion in Equation (17) is in some sense just the right level of relaxation of the differential
equation we started with, since trajectories of the QTD DI are still guaranteed to converge
to the QDP fixed points.
22
An Analysis of Quantile Temporal-Difference Learning
resulting CDFs are strictly increasing. We therefore introduce the notation Π to refer to
π to refer to the unique fixed point of
any such projection in this case, and the notation θ̂m
ΠT π .
Proposition 17 Consider the ODE in Equation (16), and suppose Assumption 9 holds. A
π is given by
Lyapunov function for the equilibrium point θ̂m
π
L(θ) = max max |θ(x, i) − θ̂m (x, i)| .
x∈X i=1,...,m
Proof We immediately observe that L is continuous, non-negative, and takes on the value
0 only at θ̂mπ . To show that L(ϑ ) is decreasing, where (ϑ )
t t t≥0 is an ODE trajectory, suppose
(x, i) is a state-index pair attaining the maximum in L(ϑt ). It is sufficient to show that
π (x, i), or expressed mathematically,
ϑt (x, i) is moving towards θ̂m
S π
∂t ϑt (x, i) = θ̂m (x, i) − ϑt (x, i) ,
S
where we use a = b as shorthand for equality of signs sign(a) = sign(b), where
1
if z > 0 ,
sign(z) = 0 if z = 0 ,
−1 if z < 0 .
where the sign equality follows from Assumption 9; since all reward CDFs are strictly
−1
increasing, so too is F(T π ϑt )(x) , and so F(T π ϑ )(x) is strictly monotonic. Additionally, from
t
π
the contractivity of ΠT with respect to w∞ (see Proposition 7), we have
the equality follows since we selected (x, i) attain the maximum in the definition of L(ϑt ).
From this, we deduce
S
(ΠT π ϑt )(x, i) − ϑt (x, i) = θ̂m
π
(x, i) − ϑt (x, i) ;
23
Rowland, Munos, Azar, Tang, Ostrovski, Harutyunyan, Tuyls, Bellemare, Dabney
π , and hence ΠT π ϑ = θ̂ π , and the claim follows; both sides are equal to 0. For the
ϑt = θ̂m t m
π
case θ̂m (x, i) − ϑt (x, i) < 0, then note we have
The proof of Proposition 17 also sheds further light on the mechanisms underlying the QTD
algorithm. A key step in the argument is to show that for the state-index pairs (x, i) such
π (x, i), the expected update under
that ϑt (x, i) is maximally distant from the fixed point θm
QTD moves this coordinate of the estimate in the same direction as gradient descent on a
squared loss from the fixed point. However, the fact that it is only the sign of the update
that has this property, and not its magnitude, means that the empirical rate of convergence
and stability of QTD can be expected to be somewhat different from methods based on an
L2 loss, such as classical TD.
We now state the Lyapunov result in the general case; the proof is somewhat more involved,
and is given in Appendix A.4.
Proposition 18 The function
is a Lyapunov function for the differential inclusion in Equation (17) and the set of fixed
points {θ̂λπ : λ ∈ [0, 1]X ×[m] }.
for x = Xk , and θk+1 (x, i) = θk (x, i) otherwise. Here, the step size βx,k depends on both x
and k, and is typically selected so that each state individually makes use of a fixed step size
24
An Analysis of Quantile Temporal-Difference Learning
sequence (αk )∞
k=0 , by taking βx,k = α kl=0 1{Xl =x} . This models the online situation where
P
a stream of experience (Xk , Rk )k≥0 is generated by interacting with the environment using
the policy π, and updates are performed setting Xk0 = Xk+1 , and also the setting in which
the tuples (Xk , Rk , Xk0 )k≥0 are sampled i.i.d. from a replay buffer, among others.
Convergence of QTD in such asynchronous settings can also be proven; Perkins and Leslie
(2013) extend the analysis of Benaı̈m (1999) and Benaı̈m et al. (2005), incorporating the
approach of Borkar (1998), to obtain convergence guarantees for asynchronous stochastic
approximation algorithms approximating differential inclusions. In the interest of space, we
do not provide the full details of the proof here, but instead sketch the key differences that
arise in the analysis in Appendix C.
for all ν, ν 0 ∈ P(R), and η, η 0 ∈ P(R)X . The following result improves on the analysis given
by Bellemare et al. (2023) for the case of λ ≡ 0, establishing an upper bound on the w1
distance between η̂λπ and η π for any λ, essentially by showing that the errors accumulated
in dynamic programming can be made arbitrarily small by increasing m, which controls the
richness of the distribution representation.
Proposition 19 For any λ ∈ [0, 1]X ×[m] , if all reward distributions are supported on
[Rmin , Rmax ], then we have
Vmax − Vmin
w̄1 (η̂λπ , η π ) ≤ ,
2m(1 − γ)
where Vmax = Rmax /(1 − γ), and similarly Vmin = Rmin /(1 − γ).
25
Rowland, Munos, Azar, Tang, Ostrovski, Harutyunyan, Tuyls, Bellemare, Dabney
Remark 20 This bound also provides motivation for the specific values of (τi )m i=1 that
QTD uses. A similar convergence analysis and fixed-point analysis can be straightfor-
wardly carried out for a version of the QTD algorithm with other values for (τi )m i=1 ; by
tracing through the proof of Proposition 19, it can be seen that the bound is proportional to
max(τ1 , max((τi+1 − τi )/2 : i = 2, . . . , m − 1), 1 − τm ), which is minimised precisely by the
choice of (τi )m
i=1 used by QTD.
Proposition 21 Consider an MDP such that for any trajectory, after k time steps all
encountered transition distributions and reward distributions are Dirac deltas. If all reward
distributions in the MDP are supported on [Rmin , Rmax ], then for any λ ∈ [0, 1]X ×[m] , we
have
Remark 22 One particular upshot of this bound for practitioners is that for agents in
near-deterministic environments using near-deterministic policies, it may be possible to use
m = o((1−γ)−1 ) quantiles and still obtain accurate approximations to the return-distribution
function via QTD and/or QDP. It is interesting to contrast this result for quantile-based
distributional reinforcement learning against the case when using categorical distribution
representations (Bellemare et al., 2017; Rowland et al., 2018; Bellemare et al., 2023). In
this latter case, fixed point error continues to be accumulated even when the environment
has solely deterministic transitions and rewards, due to the well-documented phenomenon
of the approximate distribution ‘spreading its mass out’ under the Cramér projection (Belle-
mare et al., 2017; Rowland et al., 2018; Bellemare et al., 2023). Our observation here leads
to immediate practical advice for practitioners (in environments with mostly deterministic
transitions, a quantile representation may be preferred to a categorical representation, lead-
ing to less approximation error), and raises a general question that warrants further study:
how can we use prior knowledge about the structure of the environment to select a good
distribution representation?
We conclude this section by noting that many variants of Proposition 21 are possible; one
can for example modify the assumption that rewards are deterministic to an assumption
that rewards distributions are supported on a ‘small’ interval, and still obtain a fixed-point
bound that improves over the instance-independent bound of Proposition 19. There are a
wide variety of such modifications that could be imagined, and we believe this to be an
interesting direction for future research and applications.
26
An Analysis of Quantile Temporal-Difference Learning
Example 23 Consider the two-state Markov decision process (with a single action) whose
transition probabilities are specified by the left-hand side of Figure 5, such that a determin-
istic reward of 2 is obtained in state x1 , and −1 in state x2 ; further, let us take a discount
factor γ = 0.9. The centre panel of this figure shows various estimates of the CDF for the
return distribution at state x1 . The ground truth estimate in black is obtained from Monte
Carlo sampling. The CDFs in purple, blue, green, and orange are the points of convergence
for QDP with m = 2, 5, 10, 100, respectively. For m = 100, a very close fit to the true re-
turn distribution is obtained. However, for small m in particular, the distribution is heavily
skewed to the right. In the case of m = 2, half of the probability mass is placed on the
greatest possible return in this MDP—namely 20—even though with probability 1 the true
return is less than this value. What is the cause of this behaviour from QDP? This question
is answered by investigating the dynamic programming update itself in more detail.
In this MDP, the result of the QDP operator applied to the fixed point θ is to update each
particle location with a ‘backed-up’ particle appearing in the distributions T π θ. When such
settings arise, tracking which backed-up particles are allocated to which other particles helps
us to understand the behaviour of QDP, and the nature of the approximation incurred. We
also gain intuition about the situation, since the QDP operator is behaving like a an affine
policy evaluation operator on X × [m] locally around the fixed point. We can visualise which
particles are assigned to one another by a QDP operator application through local quantile
back-up diagrams; the right-hand side of Figure 5 shows the local quantile back-up diagram
for particular MDP. We observe that θ(x1 , 2) backs up from itself, and hence learns a value
that corresponds to observing a self-transition at every state, with a reward of 2; under
the discount factor of 0.9, this corresponds to a return of 20. This is the source of the
drastic over-estimation of returns in the approximation obtained with m = 2, and the fact
that all other state-quantile pairs implicitly bootstrap from this estimate leads to the over-
estimation leaking out into all quantiles estimated in this case. As m increases, we observe
from the CDF plot that there is always one particle that learns this maximal return of 20,
but that this has less effect on the other quantiles; indeed in the orange curve, we obtain a
very good approximation (in w1 ) to the true return distribution despite this particle with a
maximal value of 20 remaining present. We can interpret the increase in m as preventing
pathological self-loops/small cycles in the quantile backup diagram from “leaking out” and
degrading the quality of other quantile estimates; this provides a complementary perspective
on the approximation artefacts that occur in QDP/QTD fixed points to the quantitative
upper bounds in the previous section.
We expect the local quantile back-up diagram introduced in Example 23 to be a useful tool
for developing intuition, as well as further analysis, of QDP and QTD. As described in the
example itself, being able to define the local back-up diagram depends on the structure
27
Rowland, Munos, Azar, Tang, Ostrovski, Harutyunyan, Tuyls, Bellemare, Dabney
1.0
0.8
Cumulative
0.6
probability
0.4
0.2
0.0
5 10 15 20
Return
Figure 5: Left: An example MDP. Centre: The fixed point return distribution estimates for
state x1 obtained by QDP for m = 2, 5, 20, 100 (solid purple, dotted blue, dashed
green, and dash-dotted orange, respectively) compared to ground truth in solid
black. Right: The corresponding local quantile backup diagram at the fixed point
for m = 2, illustrating potential approximation artefacts in QDP fixed points.
of the MDP being such that the QDP operator obtains each new coordinate value from a
single backed-up particle location. It is an interesting question as to how the definition of
such local back-up diagrams could be generalised to apply in situations where this does not
hold, such as with absolutely continuous reward distributions.
7. Related Work
Stochastic approximation theory with differential inclusions. The ODE method was intro-
duced by Ljung (1977) as a means of analysing stochastic approximation algorithms, and
was subsequently extended and refined by Kushner and Clark (1978); standard references
on the subject include Kushner and Yin (2003); Borkar (2008); Benveniste et al. (2012); see
also Meyn (2022) for an overview in the context of reinforcement learning. The framework
we follow in this paper is set out by Benaı̈m (1999), and was extended by Benaı̈m et al.
(2005) to allow for differential inclusions. Perkins and Leslie (2013) later extended this
analysis further to allow for asynchronous algorithms, building on the approach introduced
by Borkar (1998), and extended, with particular application to reinforcement learning, by
Borkar and Meyn (2000).
Differential inclusion theory. Differential inclusions have found application across a wide
variety of fields, including control theory (Wazewski, 1961), economics (Aubin, 1991) dif-
ferential game theory (Krasovskii and Subbotin, 1988), and mechanics (Monteiro Marques,
2013). The approach to modelling differential equations with discontinuous right-hand sides
via differential inclusions was introduced by Filippov (1960). Standard references on the
theory of differential inclusions include Aubin and Cellina (1984); Clarke et al. (1998);
Smirnov (2002); see also Bernardo et al. (2008) on the related field of piecewise-smooth
dynamical systems. Joseph and Bhatnagar (2019) also use tools combining stochastic ap-
proximation and differential inclusions from Benaı̈m et al. (2005) to analyse (sub-)gradient
descent as a means of estimating quantiles of fixed distributions. Within reinforcement
learning and related fields more specifically, differential inclusions have played a key role in
28
An Analysis of Quantile Temporal-Difference Learning
the analysis of game-theoretic algorithms based on fictitious play (Brown, 1951; Robinson,
1951); see Benaı̈m et al. (2006); Leslie and Collins (2006); Benaı̈m and Faure (2013) for ex-
amples. More recently, Gopalan and Thoppe (2023) used differential inclusions to analyse
TD algorithms for control with linear function approximation.
8. Conclusion
We have provided the first convergence analysis for QTD, a popular and effective distri-
butional reinforcement learning algorithm. In contrast to the analysis of many classical
temporal-difference learning algorithms, this has required the use of tools from the field
of differential inclusions and branches of stochastic approximation theory that deal with
the associated dynamical systems. Due to the structure of the QTD algorithm, such as
its bounded-magnitude updates, these convergence guarantees hold under weaker condi-
tions than are generally used in the analysis of TD algorithms. These results establish the
soundness of QTD, representing an important step towards understanding its efficacy and
practical successes, and we expect the theoretical tools used here to be useful in further
analyses of (distributional) reinforcement learning algorithms.
There are several natural directions for further work building on this analysis. One such
direction is to establish finite-sample bounds for the convergence of QTD predictions to
the set of QDP fixed points. This is a central theoretical question for developing our un-
derstanding of QTD, and may also shed further light on the recently observed empirical
phenomenon in which tabular QTD can outperform TD in stochastic environments as a
means of value estimation (Rowland et al., 2023). Related to this point, the Lyapunov
analysis conducted in this paper provides further intuition for why QTD works in general,
and we expect this to inform the design of further variants of QTD, for example incorpo-
rating multi-step bootstrapping (Watkins, 1989), or Ruppert-Polyak averaging (Ruppert,
1988; Polyak and Juditsky, 1992). Another important direction is to analyse more complex
29
Rowland, Munos, Azar, Tang, Ostrovski, Harutyunyan, Tuyls, Bellemare, Dabney
variants of the QTD algorithm, incorporating more aspects of the large-scale systems in
which it has found application. Examples include incorporating function approximation, or
control variants of the algorithm based on Q-learning. We believe further research into the
theory, practice and applications of QTD, at a variety of scales, are important directions
for foundational reinforcement learning research.
Acknowledgments
We thank the anonymous reviewers and action editor for helpful suggestions and feedback.
We also thank Tor Lattimore for detailed comments on an earlier draft, and David Abel,
Bernardo Avila Pires, Diana Borsa, Yash Chandak, Daniel Guo, Clare Lyle, and Shan-
tanu Thakoor for helpful discussions. Marc G. Bellemare was supported by Canada CIFAR
AI Chair funding. The simulations in this paper were generated using the Python 3 lan-
guage, and made use of the NumPy (Harris et al., 2020), SciPy (Virtanen et al., 2020), and
Matplotlib (Hunter, 2007) libraries.
Appendix A. Proofs
In this section, we provide proofs for results which are not proven in the main text.
Clearly
−1
|Fη(x) (τi ) − Fη−1 0
0 (x) (τi )| ≤ w̄∞ (η, η ) .
Additionally, we have
−1
|F̄η(x) (τi ) − F̄η−1 −1 −1
0 (x) (τi )| = | lim Fη(x) (s) − lim Fη 0 (x) (s)|
s↓τi s↓τi
−1
= lim |Fη(x) (s) − Fη−1
0 (x) (s)|
s↓τ
i
≤ w̄∞ (η, η 0 ) .
as required.
30
An Analysis of Quantile Temporal-Difference Learning
Next, since we assume θ̄ is bounded, Theorem 4.2 of Benaı̈m et al. (2005) applies so that we
deduce that it is an asymptotic pseudotrajectory of the differential inclusion (w.p.1). We
then have that (θ̄(t))t≥0 is a bounded asymptotic pseudotrajectory (w.p.1), so Theorem 4.3
of Benaı̈m et al. (2005) applies, and we deduce that the set of limit points of (θ̄(t))t≥0 is
internally chain transitive (w.p.1). But now by Proposition 3.27 of Benaı̈m et al. (2005)
applied to the Lyapunov function L and the set Λ, all internally chain transitive sets are
contained within Λ. Since (θ̄(t))t≥0 is bounded, we deduce that it converges to Λ (w.p.1).
It therefore follows that the discrete sequence (θk )∞
k=0 converges to Λ with probability 1, as
required.
Differential inclusion update direction. To begin with the analysis of the differential inclu-
sion, fix δ > 0 such that 1 − δ > γ, and let M > 0 be such that for all (x, a) ∈ X × A, we
have
FPR (x,a) ((1 − δ − γ)M ) > 1 − 1/(4m) , FPR (x,a) (−(1 − δ − γ)M ) < 1/(4m) .
which, roughly speaking, hold when θk has at least one large coordinate (in absolute value),
and θk (x, i) is a positive (respectively, negative) coordinate close to the maximum value.
31
Rowland, Munos, Azar, Tang, Ostrovski, Harutyunyan, Tuyls, Bellemare, Dabney
≤ (1 − 1/(2m)) − (1 − (1/(4m))
≤ −1/(4m) , (22)
and hence the differential inclusion moves θk (x, i) towards the origin. Inequality (a) follows
since on Ik+ (x, i), we have
and so the differential inclusion moves θk (x, i) towards the origin in this case too.
Chaining updates and reasoning about noise. To describe the relationship between successive
iterates in the sequence (θk )k≥0 , we introduce the notation θk+1 = θk + αk gk + αk wk , where
wj is martingale difference noise, and hence gj is an expected update direction, from the
right-hand side of the QTD differential inclusion. By boundedness of the update noise and
the step size assumptions, we have from Proposition 1.4 of Benaı̈m et al. (2005) (see also
Theorem 5.3.3 of Kushner and Yin (2003)) that
k+l
n X k+l
X o
lim sup α j wj ∞
: l ≥ 0 and αj ≤ 8m + 1 = 0 ,
k
j=k j=k
almost surely. In particular, letting ε ∈ (0, 1), there almost-surely exists K (which depends
on the realisation of the martingale noise) such that
k+l
n X k+l
X o
sup α j wj ∞
: l ≥ 0 and αj ≤ 8m + 1 < ε
j=k j=k
Let us additionally take M̄ ≥ M such that δ M̄ ≥ 4(8m + P1). Suppose that for some
k+l
k ≥ K, kθk k∞ ≥ M̄ + (8m + 1). Let l be minimal such that j=k αj > 8m. Then we have
kθk+j − θk k∞ ≤ 8m + 1 for all 0 ≤ j ≤ l, and so kθk+j k∞ ≥ M̄ for all 0 ≤ j ≤ l. Further, if
32
An Analysis of Quantile Temporal-Difference Learning
θk+j (x, i)
≥ θk (x, i) − (8m + 1)
≥ (1 − δ)kθk k∞ + (8m + 1)
≥ (1 − δ)(kθk+j k∞ − (8m + 1)) + (8m + 1)
= (1 − δ)kθk+j k∞ ,
+
so Ik+j (x, i) holds for all 0 ≤ j ≤ l, and hence
k+l
X k+l
X
|θk+l+1 (x, i)| ≤ kθk k∞ − αj × 1/(4m) + k αj wj k∞ < kθk k∞ − 2 + ε < kθk k − 1 .
j=k j=k
(23)
−
Similarly, if θk (x, i) < −kθk k∞ (1 − δ) − 2(8m + 1), Ik+j (x, i) holds for all 0 ≤ j ≤ l, and
we reach the same conclusion as in Equation (23). Finally, if |θk (x, i)| ≤ kθk k∞ (1 − δ) +
2(8m + 1), then since δkθk k∞ > δ M̄ , we have |θk (x, i)| ≤ kθk k∞ − 2(8m + 1), and hence
|θk+l+1 (x, i)| ≤ kθk k∞ − (8m + 1). Putting these components together, we have
where C is a constant depending only on the reward distributions of the MDP and γ.
Proof By the triangle inequality, we have
0 0 0 0
kθλ − θλ k∞ ≤ kθλ − Πλ T π θλ k∞ + kΠλ T π θλ − θλ k∞
0 0 0 0
= kΠλ T π θλ − Πλ T π θλ k∞ + kΠλ T π θλ − Πλ T π θλ k∞
0 0
≤ k(Πλ − Πλ )T π θλ k∞ + γkθλ − θλ k∞
0 1 0
=⇒ kθλ − θλ k∞ ≤ k(Πλ − Πλ )T π θλ k∞ .
1−γ
33
Rowland, Munos, Azar, Tang, Ostrovski, Harutyunyan, Tuyls, Bellemare, Dabney
1, . . . , n}, since
n
X
PZ∼ν (Z ≤ min{Fν−1
i
(τn ) : i = 1, . . . , n}) = pi PZi ∼νi (Zi ≤ min{Fν−1
i
(τ ) : i = 1, . . . , n})
i=1
n
X
≤ pi PZi ∼νi (Zi ≤ Fν−1
i
(τ ))
i=1
≤τ.
−1
By analogous reasoning, we obtain that F̄(T π θ λ )(x) (
2m−1/2m) is no greater than
−1 2m−1/2m) + γkθ λ k∞ .
max F̄Rπ (x) (
x
and hence
0
k(Πλ − Πλ )T π θλ k∞ ≤ Ckλ − λ0 k∞ ,
We now turn to the proof of Proposition 18. First, we observe that the infimum over λ in
Equation (20) is attained, since Lemma 24 establishes that λ 7→ θλ is continuous (in fact
Lipschitz), and [0, 1]X ×[m] is compact. We therefore have that L is continuous, non-negative,
and takes on the value 0 only on the set of fixed points {θλ : λ ∈ [0, 1]X ×[m] }.
For the decreasing property, let (ϑt )t≥0 be a solution to the differential inclusion in Equa-
tion (17), and as in Definition 11, let g : [0, ∞) → RX ×[m] satisfy
Z t
ϑt = gs ds , (24)
0
π (ϑ ) for all (x, i), and for almost all t ≥ 0, where we have introduced the
with gt (x, i) ∈ Hx,i t
notation
π
Hx,i (θ) = [τi − F(T π θ)(x) (θ(x, i)), τi − F(T π θ)(x) (θ(x, i)−)] .
As in the proof of Proposition 17, we will show that L(ϑt ) is locally decreasing outside of the
fixed point set, which is enough for the global decreasing property. Further, by continuity
of L(ϑt ), it is enough to show this property for almost all t ≥ 0. We will therefore consider
a value of t ≥ 0 at which the above inclusion for gt holds.
34
An Analysis of Quantile Temporal-Difference Learning
Let λ attain the minimum in the definition of L(ϑt ). Write θλ for the corresponding fixed
point for conciseness, and let (x, i) be a λ-argmax index with respect to ϑt ; a state-particle
pair achieving the maximum in the definition of the norm kϑt −θλ k∞ . First, we consider the
π (ϑ ) is not a singleton. Now, if 0 ∈ H π (ϑ ), then we have (Πλ T π ϑ )(x, i) =
cases where Hx,i t x,i t t
ϑt (x, i), and with the same logic as above, we have ϑt = θλ , and hence ϑt is in the fixed
π (ϑ ), then as in the proof of Proposition 17, it
point set, and L(ϑt ) is constant. If 0 6∈ Hx,i t
can be shown that any element of Hx,i π (ϑ ) has the same sign as
t
In the case of Proposition 17, continuity of the derivative then allowed us to deduce that
|ϑt (x, i) − θλ (x, i)| is locally decreasing. Here, we require a related concept of continuity for
the set-valued map θ 7→ Hx,i π (θ), namely that it is upper semicontinuous (see, for example,
Smirnov, 2002); for a given θ ∈ RX ×[m] and any given ε > 0, there exists δ > 0 such that if
kθ0 − θk∞ < δ, then Hx,i π (θ 0 ) ⊆ {h + v : h ∈ H π (θ) , |v| < ε}. From this, it follows that any
x,i
π
element of Hx,i (ϑt+s ), for sufficiently small positive s, has the same sign as the expression
in Equation (25), and so from Equation (24), we have that |ϑt (x, i) − θλ (x, i)| is locally
decreasing, as required.
π (ϑ ) is a singleton, if it is non-zero, then by the same argument as in the
Now, when Hx,i t
proof of Proposition 17, the corresponding element has the same sign as the expression in
Equation (25), and so as above, we conclude that |ϑt (x, i) − θλ (x, i)| is locally decreasing.
Finally, the case where there exists an argmax index (x, i) with Hx,i π (ϑ ) = {0} requires more
t
care, and we will need to reason about the effects of perturbing λ to show that the Lyapunov
π (ϑ
function is decreasing. For some intuition as to what the problem is, if Hx,i t+s ) = {0}
for small positive s, then the coordinate ϑt+s (x, i) is static, as it lies on the flat region of
the CDF F(T π ϑt+s )(x) at level τi , and so the distance |ϑt+s (x, i) − θλ (x, i)| is not decreasing.
We explain how to deal with this case below.
We will now demonstrate the existence of a parameter λ0 ∈ [0, 1]X ×[m] such that kϑt+s −
0
θλ k∞ < kϑt − θλ k∞ , which establishes the locally decreasing property of the Lyapnuov
function, as required. To do so, we introduce a modification of the fixed point map λ 7→ θλ .
Letting µ ∈ RJ , and defining λ[µ] ∈ RX ×[m] to be the replacement of the J coordinates of
λ with the corresponding coordinates of µ, we consider the map
35
Rowland, Munos, Azar, Tang, Ostrovski, Harutyunyan, Tuyls, Bellemare, Dabney
where PJ : RX ×[m] → RJ extracts the J coordinates. At an intuitive level, this map allows
us to study the effect of perturbing the J coordinates of λ on the corresponding coordinates
of the fixed point.
where the inequality follows from the fact that since θλ[µ] is continuous in µ, for µ sufficiently
close to J there is a flat region of F(T π θλ[µ] )(x) at level τi , for any (x, i) ∈ J. To complete the
0 0
injectivity argument, we cannot have PJ θλ[µ ] = PJ θλ[µ] if θλ[µ ] 6= θλ[µ] , as the contraction
maps Πλ[µ] T π and Πλ[µ] T π are equal on coordinates not in J, and these two maps would
therefore have the same fixed point, a contradiction.
We may now appeal to the invariance of domain theorem (Brouwer, 1912) to deduce that
since hλ is a continuous injective map between an open subset of [0, 1]J containing λJ (here
we are using the assumption that λJ lies in the interior of [0, 1]J ) and the Euclidean space
RJ of equal dimension, it is an open map on this domain; that is, it maps open sets to
open sets. Hence, we can perturb θλ in the J coordinates in any direction we want by
locally modifying the J coordinates of λ. In particular, we can move all J coordinates of
θλ closer to those of (ϑt+s (x, i) : (x, i) ∈ J). Let λ0 ∈ (0, 1)X ×[m] be such a modification of
λ, taken to be close enough to λ so that all coordinates outside J have sufficiently small
perturbations so that they cannot be λ0 -argmax indices with respect to ϑt+s . We then have
0
that kϑt+s − θλ k∞ < kϑt − θλ k∞ , as required.
36
An Analysis of Quantile Temporal-Difference Learning
λJ to obtain µ in such a way that we obtain the desired perturbation of θλ , without the
parameters µ leaving the set [0, 1]J . To do this, we first rule out λJ lying on certain parts
of the boundary.
Lemma 25 If (x, i) ∈ J and ϑt+s (x, i) < θλ (x, i), then λ(x, i) > 0. Similarly, if ϑt+s (x, i) >
θλ (x, i), then λ(x, i) < 1.
Proof We prove the claim when ϑt+s (x, i) < θλ (x, i); the other case follows analogously.
If λ(x, i) = 0, then since ϑt+s (x, i) corresponds to the flat region at level τi of the CDF
F(T π ϑt+s )(x) , we must have (Πλ T π ϑt+s )(x, i) ≤ ϑt+s (x, i) since λ(x, i) = 0, and so the
chosen quantile at level τi by the projection Πλ is the left-most point of this flat region. We
therefore have
w∞ (Πλ T π ϑt , θλ ) ≥ |(Πλ T π ϑt )(x, i) − θλ (x, i)| ≥ |ϑt (x, i) − θλ (x, i)| = w∞ (ϑt , θλ ) ,
We write v = sign((ϑt+s )J − θJλ ) ∈ RJ , where the sign mapping is applied elementwise, and
introduce the notation N(v) = {α v : α ∈ Rn>0 } for the (open) orthant containing the
λ[µ]
vector v. We are therefore seeking a perturbation µ of λJ such that θJ lies in a direction
in N(v) from θJλ , and further such that the perturbation to θλ is sufficiently small that no
index that was not an argmax in kϑt+s − θλ k∞ can become one in kϑt+s − θλ[µ] k∞ ; under
these conditions, we have kϑt+s − θλ[µ] k∞ < kϑt+s − θλ k∞ , as required. Lemma 25 then
guarantees that a (sufficiently small) perturbation of λJ in any direction in N(v) remains
within [0, 1]J , so it is sufficient to show that a perturbation in such a direction achieves the
desired perturbation of θλ .
Differentiability. Now, if the extended map λ 7→ θλ is differentiable at λ, then differentiating
through the fixed-point equation θλ = G(λ, θλ ) (where we write G(λ, θ) = Πλ T π θ for
conciseness) yields
37
Rowland, Munos, Azar, Tang, Ostrovski, Harutyunyan, Tuyls, Bellemare, Dabney
From Lemma 26, we therefore obtain that ∇hλ (λJ ) has the form
with D ∈ RJ×J diagonal, with positive elements on the diagonal (from monotonicity of
λ 7→ G(λ, θ)), with Q ∈ RJ×J strictly substochastic. The derivative is therefore invertible,
and we obtain the derivative of the inverse of the form
∇h−1
λ
(θJλ ) = D−1 (I − Q) .
From strict substochasticity of Q, and since v ∈ {±1}J , it follows that for the desired
perturbation direction v, we have
S
∇hλ−1 (θJλ )v = v ,
and so ∇h−1λ
(θJλ )v ∈ N(v), where the equality of signs applies elementwise. Therefore, a
perturbation of λJ in a direction in N(v) is achieved by a sufficiently small perturbation of
λJ in a direction in N(v), as required.
Non-differentiability. If λ 7→ θλ is not differentiable at λ, we instead use techniques from
non-smooth analysis to complete the argument. First, since λ 7→ θλ is Lipschitz (by
Lemma 24), it is differentiable almost everywhere by Rademacher’s theorem (Rademacher,
1919). By adapting the argument made by Clarke (1976, Lemma 3), by Fubini’s theorem,
for almost all λ\J ∈ RX ×[m]\J , the map λ 7→ θλ is differentiable at (λ\J , µ) for almost all
µ with (λ\J , µ) in the extended domain. The map (λ\J , µ) 7→ (λ\J , hλ\J (µ)) is Lipschitz
and locally injective around λ, and hence maps sufficiently small open neighbourhoods of
λ to open neighbourhoods of (λ\J , hλ\J (λJ )). Further, since each hλ\J is Lipschitz, and so
38
An Analysis of Quantile Temporal-Difference Learning
for almost all λ\J in a ball B around λ\J , and (for each such λ\J ) for almost all θ in the
L∞ ball B 0 with centre hλ\J (λJ ) and radius ρ, for some radius ρ > 0. We further take B
and B 0 to be of small enough radii so that this directional derivative is bounded on this set,
so that h−1 0
λ is locally Lipschitz on B for each λ\J ∈ B (and hence absolutely continuous),
and so that for any θ ∈ B 0 , we have sign(θ − (ϑt+s )J ) = sign(θJλ − (ϑt+s )J ), and so that for
no µ in the preimage of B 0 under hλ\J can have that kθ(λ\J ,µ) − ϑt+s k∞ has new argmax
coordinates outside of J.
Let us consider λ̃ ∈ B at which the almost-everywhere differentiability condition holds. By
applying the same argument with Fubini’s theorem, for almost all θ̄ in B(hλ̃ (λJ ), ρ/4), the
inverse h−1
λ̃
is differentiable almost everywhere on {θ̄ + uv : u ∈ [0, ρ/2]}.
39
Rowland, Munos, Azar, Tang, Ostrovski, Harutyunyan, Tuyls, Bellemare, Dabney
Here, (a) follows from the triangle inequality, (b) follows as η̂λπ , η π are fixed points of Πλ T π ,
T π , respectively, and (c) follows from the application of the inequality at the beginning of
the proof and contractivity of T π . Rearranging then gives the desired result.
Vmax − Vmin
w̄1 ((Πλ T π )l η π , η π ) ≤ γ w̄1 ((Πλ T π )l−1 η π , η π ) + .
2m
Chaining these inequalities yields the required statement.
40
An Analysis of Quantile Temporal-Difference Learning
function, and therefore strong guarantees can be given on the approximate solution returned
when the reward CDFs of the MDP are continuous. Nevertheless, note that Algorithm 4
does not exactly implement the operator Πλ T π due to this root-finding approximation er-
ror. For simplicity, we present the algorithm in the case where the reward and next state
in a transition are conditionally independent given the current state, though the algorithm
can be straightforwardly extended to the general case, by working with CDFs of reward
distributions conditioned on the next state.
3: for i = 1, . . . , m do
4: Use a scalar root-finding subroutine to find θ0 (x, i) approximately satisfying
5: end for
6: end for
7: return ((θ 0 (x, i)m
i=1 : x ∈ X )
Step size restrictions. Typically, more restrictive assumptions on step sizes, beyond the
Robbins-Monro conditions, are required for asynchronous convergence guarantees. See, for
example, Assumption A2 of Perkins and Leslie (2013); note that the typical Robbins-Monro
step size schedule of αk ∝ 1/k ρ for ρ ∈ (1/2, 1] satisfies these requirements.
Conditions on the sequence of states (Xk )k≥0 to be updated. Additionally, different states
are required to be updated “comparably often”; assuming that (Xk )k≥0 forms an aperiodic
irreducible time-homogeneous Markov chain is sufficient, and this conditions holds when
either (i) π generates such a Markov chain over the state space of the MDP of interest, or
(ii) when the states to be updated are sampled i.i.d. from a fixed distribution supported on
the entirety of the state space, amongst other settings. See Assumption A4 of Perkins and
Leslie (2013) for further details.
41
Rowland, Munos, Azar, Tang, Ostrovski, Harutyunyan, Tuyls, Bellemare, Dabney
Modified differential inclusion. The QTD differential inclusion in Equation (17) must be
broadened to account for the possibility of different states being updated with different
frequencies, leading to a differential inclusion of the form
π
∂t ϑt (x, i) ∈ {ωh : ω ∈ (δ, 1] , h ∈ Hx,i (ϑt )} ,
where δ represents a minimum relative update frequency for the state x, derived from the
conditions on (Xk )k≥0 described above. Because of the structure of the Lyapunov function
for the QTD DI in Equation (20), it is readily verified that this remains a valid Lyapunov
function for this broader differential inclusion, for the same invariant set of QDP fixed
points.
References
Jean-Pierre Aubin. Viability theory. Springer Birkhauser, 1991.
Jean-Pierre Aubin and Arrigo Cellina. Differential inclusions: Set-valued maps and viability
theory. Springer Science & Business Media, 1984.
Gabriel Barth-Maron, Matthew W. Hoffman, David Budden, Will Dabney, Dan Horgan,
Dhruva TB, Alistair Muldal, Nicolas Heess, and Timothy Lillicrap. Distributed distribu-
tional deterministic policy gradients. In Proceedings of the International Conference on
Learning Representations, 2018.
Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The Arcade Learning
Environment: An evaluation platform for general agents. Journal of Artificial Intelligence
Research, 47:253–279, June 2013.
Marc G. Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on rein-
forcement learning. In Proceedings of the International Conference on Machine Learning,
2017.
Marc G. Bellemare, Salvatore Candido, Pablo Samuel Castro, Jun Gong, Marlos C.
Machado, Subhodeep Moitra, Sameera S. Ponda, and Ziyu Wang. Autonomous navi-
gation of stratospheric balloons using reinforcement learning. Nature, 588(7836):77–82,
2020.
Marc G. Bellemare, Will Dabney, and Mark Rowland. Distributional reinforcement learning.
MIT Press, 2023.
Michel Benaı̈m and Mathieu Faure. Consistency of vanishingly smooth fictitious play. Math-
ematics of Operations Research, 38(3):437–450, 2013.
Michel Benaı̈m, Josef Hofbauer, and Sylvain Sorin. Stochastic approximations and differ-
ential inclusions. SIAM Journal on Control and Optimization, 44(1):328–348, 2005.
42
An Analysis of Quantile Temporal-Difference Learning
Michel Benaı̈m, Josef Hofbauer, and Sylvain Sorin. Stochastic approximations and differen-
tial inclusions, Part II: Applications. Mathematics of Operations Research, 31(4):673–695,
2006.
Albert Benveniste, Michel Métivier, and Pierre Priouret. Adaptive algorithms and stochastic
approximations. Springer Science & Business Media, 2012.
Mario Bernardo, Chris Budd, Alan Richard Champneys, and Piotr Kowalczyk. Piecewise-
smooth dynamical systems: Theory and applications. Springer Science & Business Media,
2008.
Cristian Bodnar, Adrian Li, Karol Hausman, Peter Pastor, and Mrinal Kalakrishnan. Quan-
tile QT-Opt for risk-aware vision-based robotic grasping. In Robotics: Science and Sys-
tems, 2020.
Vivek S. Borkar and Sean P. Meyn. The ODE method for convergence of stochastic ap-
proximation and reinforcement learning. SIAM Journal on Control and Optimization, 38
(2):447–469, 2000.
George W. Brown. Iterative solution of games by fictitious play. Act. Anal. Prod Allocation,
13(1):374, 1951.
Francis H. Clarke, Yuri S. Ledyaev, Ronald J. Stern, and Peter R. Wolenski. Nonsmooth
analysis and control theory. Springer Science & Business Media, 1998.
Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks
for distributional reinforcement learning. In Proceedings of the International Conference
on Machine Learning, 2018a.
Will Dabney, Mark Rowland, Marc G. Bellemare, and Rémi Munos. Distributional rein-
forcement learning with quantile regression. In Proceedings of the AAAI Conference on
Artificial Intelligence, 2018b.
Peter Dayan. The convergence of TD(λ) for general λ. Machine Learning, 8(3-4):341–362,
1992.
43
Rowland, Munos, Azar, Tang, Ostrovski, Harutyunyan, Tuyls, Bellemare, Dabney
Peter Dayan and Terrence J. Sejnowski. TD(λ) converges with probability 1. Machine
Learning, 14(3):295–301, 1994.
Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes,
Mohammadamin Barekatain, Alexander Novikov, Francisco J. R. Ruiz, Julian Schrit-
twieser, Grzegorz Swirszcz, David Silver, Demis Hassabis, and Pushmeet Kohli. Discov-
ering faster matrix multiplication algorithms with reinforcement learning. Nature, 610
(7930):47–53, 2022.
A. F. Filippov. Differential equations with discontinuous right-hand side. Mat. Sb. (N.S.),
51(93):99–128, 1960.
Hugo Gilbert and Paul Weng. Quantile reinforcement learning. In Proceedings of the Asian
Workshop on Reinforcement Leanring, 2016.
Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virta-
nen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith,
Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett,
Allan Haldane, Jaime Fernández del Rı́o, Mark Wiebe, Pearu Peterson, Pierre Gérard-
Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph
Gohlke, and Travis E. Oliphant. Array programming with NumPy. Nature, 585(7825):
357–362, September 2020.
Tommi Jaakkola, Michael I. Jordan, and Satinder P. Singh. On the convergence of stochastic
iterative dynamic programming algorithms. Neural Computation, 6(6):1185–1201, 1994.
Ajin George Joseph and Shalabh Bhatnagar. An adaptive and incremental approach to
quantile estimation. In IEEE Conference on Decision and Control, 2019.
Michael J. Kearns and Satinder Singh. Bias-variance error bounds for temporal difference
updates. In Proceedings of the Conference on Learning Theory, 2000.
Roger Koenker and Gilbert Bassett. Regression quantiles. Econometrica: Journal of the
Econometric Society, pages 33–50, 1978.
Roger Koenker, Victor Chernozhukov, Xuming He, and Limin Peng. Handbook of quantile
regression. CRC press, 2017.
44
An Analysis of Quantile Temporal-Difference Learning
Harold Kushner and Dean Clark. Stochastic approximation methods for constrained and
unconstrained systems. Springer, 1978.
Harold J. Kushner and G. George Yin. Stochastic approximation and recursive algorithms
and applications. Springer Science & Business Media, 2003.
David S. Leslie and Edmund J. Collins. Generalised weakened fictitious play. Games and
Economic Behavior, 56(2):285–298, 2006.
Alix Lhéritier and Nicolas Bondoux. A Cramér distance perspective on quantile regression
based distributional reinforcement learning. In Proceedings of the International Confer-
ence on Artificial Intelligence and Statistics, 2022.
Xiaocheng Li, Huaiyang Zhong, and Margaret L Brandeau. Quantile Markov decision
processes. Operations Research, 70(3):1428–1447, 2022.
Yudong Luo, Guiliang Liu, Haonan Duan, Oliver Schulte, and Pascal Poupart. Distribu-
tional reinforcement learning with monotonic splines. In Proceedings of the International
Conference on Learning Representations, 2021.
Marlos C. Machado, Marc G. Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht,
and Michael Bowling. Revisiting the Arcade Learning Environment: Evaluation protocols
and open problems for general agents. Journal of Artificial Intelligence Research, 61:523–
562, 2018.
Sean Meyn. Control systems and reinforcement learning. Cambridge University Press, 2022.
Tetsuro Morimura, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and Toshiyuki
Tanaka. Nonparametric return density estimation for reinforcement learning. In Proceed-
ings of the International Conference on Machine Learning, 2010a.
Tetsuro Morimura, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and Toshiyuki
Tanaka. Parametric return density estimation for reinforcement learning. In Proceedings
of the Conference on Uncertainty in Artificial Intelligence, 2010b.
Steven Perkins and David S. Leslie. Asynchronous stochastic approximation with differential
inclusions. Stochastic Systems, 2(2):409–446, 2013.
Hans Rademacher. Über partielle und totale Differenzierbarkeit von Funktionen mehrerer
Variabeln und über die Transformation der Doppelintegrale. Mathematische Annalen,
79(4):340–359, 1919.
45
Rowland, Munos, Azar, Tang, Ostrovski, Harutyunyan, Tuyls, Bellemare, Dabney
Mark Rowland, Marc G. Bellemare, Will Dabney, Rémi Munos, and Yee Whye Teh. An
analysis of categorical distributional reinforcement learning. In Proceedings of the Inter-
national Conference on Artificial Intelligence and Statistics, 2018.
Mark Rowland, Robert Dadashi, Saurabh Kumar, Rémi Munos, Marc G. Bellemare, and
Will Dabney. Statistics and samples in distributional reinforcement learning. In Proceed-
ings of the International Conference on Machine Learning, 2019.
Mark Rowland, Yunhao Tang, Clare Lyle, Rémi Munos, Marc G Bellemare, and Will Dab-
ney. The statistical benefits of quantile temporal-difference learning for value estimation.
In Proceedings of the International Conference on Machine Learning, 2023.
John N. Tsitsiklis and Benjamin Van Roy. An analysis of temporal-difference learning with
function approximation. IEEE Transactions on Automatic Control, 42(5):674–690, 1997.
Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David
Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright,
Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay May-
orov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, İlhan Polat,
Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimr-
man, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M. Archibald, Antônio H.
Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1.0 Contributors. SciPy 1.0:
Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–
272, 2020.
46
An Analysis of Quantile Temporal-Difference Learning
Peter R. Wurman, Samuel Barrett, Kenta Kawamoto, James MacGlashan, Kaushik Sub-
ramanian, Thomas J. Walsh, Roberto Capobianco, Alisa Devlic, Franziska Eckert, Flo-
rian Fuchs, Leilani Gilpin, Piyush Khandelwal, Varun Kompella, HaoChih Lin, Patrick
MacAlpine, Declan Oller, Takuma Seno, Craig Sherstan, Michael D. Thomure, Houmehr
Aghabozorgi, Leon Barrett, Rory Douglas, Dion Whitehead, Peter Dürr, Peter Stone,
Michael Spranger, and Hiroaki Kitano. Outracing champion Gran Turismo drivers with
deep reinforcement learning. Nature, 602(7896):223–228, 2022.
Derek Yang, Li Zhao, Zichuan Lin, Tao Qin, Jiang Bian, and Tie-Yan Liu. Fully parame-
terized quantile function for distributional reinforcement learning. In Advances in Neural
Information Processing Systems, 2019.
Fan Zhou, Jianing Wang, and Xingdong Feng. Non-crossing quantile regression for distri-
butional reinforcement learning. In Advances in Neural Information Processing Systems,
2020.
47