0% found this document useful (0 votes)

10 views61 pages

Paper RL

This paper presents a novel policy-gradient method for model-based reinforcement learning that utilizes score-aware gradient estimators (SAGEs) to improve average-reward optimization in stochastic systems. By leveraging the stationary distributions of Markov decision processes, the authors demonstrate local convergence and performance guarantees for SAGEs, showing that they outperform traditional actor-critic methods in terms of speed and variance. The findings suggest that incorporating model-specific information can significantly enhance reinforcement learning algorithms in complex environments.

Uploaded by

matthieu.jonckheere

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views61 pages

Paper RL

Uploaded by

matthieu.jonckheere

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 61

Journal of Machine Learning Research XX (2024) 1-??

Submitted XX/XX; Published XX/XX

Score-Aware Policy-Gradient and Performance Guarantees

using Local Lyapunov Stability
Céline Comte [email protected]
Matthieu Jonckheere [email protected]
CNRS and LAAS-CNRS, Toulouse, France
Jaron Sanders [email protected]
Eindhoven University of Technology, Eindhoven, The Netherlands
Albert Senen–Cerda [email protected]
LAAS-CNRS, IRIT, and Université de Toulouse, Toulouse, France

Editor:

Abstract
In this paper, we introduce a policy-gradient method for model-based reinforcement learning (RL)
that exploits a type of stationary distributions commonly obtained from Markov decision processes
(MDPs) in stochastic networks, queueing systems, and statistical mechanics. Specifically, when the
stationary distribution of the MDP belongs to an exponential family that is parametrized by policy
parameters, we can improve existing policy gradient methods for average-reward RL. Our key
identification is a family of gradient estimators, called score-aware gradient estimators (SAGEs),
that enable policy gradient estimation without relying on value-function approximation in the
aforementioned setting. We show that policy-gradient with SAGE locally converges and obtain its
regret. This includes cases when the state space of the MDP is countable and unstable policies
can exist. Under appropriate assumptions such as starting sufficiently close to a maximizer and
the existence of a local Lyapunov function, the policy under stochastic gradient ascent with SAGE
has an overwhelming probability of converging to the associated optimal policy. Furthermore,
we conduct a numerical comparison between a SAGE-based policy-gradient method and an actor–
critic method on several examples inspired from stochastic networks, queueing systems, and models
derived from statistical physics. Our results demonstrate that a SAGE-based method finds close–
to–optimal policies faster than an actor–critic method.
Keywords: reinforcement learning, policy-gradient method, exponential families, product-form
stationary distribution, stochastic approximation

1 Introduction
Reinforcement learning (RL) has become the primary tool for optimizing controls in uncertain
environments. Model-free RL, in particular, can be used to solve generic Markov decision processes
(MDPs) with unknown dynamics with an agent that learns to maximize a reward incurred upon
acting on the environment. In stochastic systems, examples of possible applications of RL can
be found in stochastic networks, queueing systems, and particle systems, where an optimal policy
is desirable. For example, a policy yielding a good routing policy, an efficient scheduling, or an
annealing schedule to reach a desired state.
As stochastic systems expand in size and complexity, however, the RL agent must deal with large
state and action spaces. This leads to several computational concerns, namely, the combinatorial
explosion of action choices, the computationally intensive exploration and evaluation of policies
(Qian et al., 2019), and a larger complexity of the optimization landscape.
One way to circumvent issues pertaining to large state spaces and/or nonconvex objective func-
tions is to include features of the underlying MDP in the RL algorithm. If the model class of the

©2024 Comte, Jonckheere, Sanders and Senen–Cerda.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at
http://jmlr.org/papers/vXX/XX-XXXX.html.
Comte, Jonckheere, Sanders and Senen-Cerda

environment is known, a model-based RL approach estimates first an approximate model of the en-
vironment in the class that can later be used to solve an MDP describing its approximate dynamics.
This approach is common in queueing networks (Liu et al., 2022; Anselmi et al., 2023). Nevertheless,
solving an approximate MDP adds a computational burden if the number of states is large.
Policy-gradient methods are learning algorithms that instead directly optimize policy parameters
through stochastic gradient ascent (SGA) (Sutton and Barto, 2018). These methods have gained
attention and popularity due to their perceived ability to handle large state and action spaces in
model-free settings (Daneshmand et al., 2018; Khadka and Tumer, 2018). Policy-gradient methods
rely on the estimation of value functions, which encode reward-weighted representations of the
underlying model dynamics. Computing such functions, however, is challenging in high-dimensional
settings and, differently to a model-based approach, key model features are initially unknown.
In this paper, we improve policy-gradient methods for some stochastic systems by incorporat-
ing model-specific information of the MDP to the gradient estimator. Specifically, we exploit the
fact that long-term average behavior of such systems are described using exponential families of
distributions. In the context of stochastic networks and queueing systems, this typically means that
the Markov chains associated to fixed policies have a product-form stationary distribution. This
structural assumption holds in various relevant scenarios, including Jackson and Whittle networks
(Serfozo, 1999, Chapter 1), BCMP networks (Baskett et al., 1975), and more recent models aris-
ing in datacenter scheduling and online matching (Gardner and Righter, 2020). By encoding this
key model feature into policy-gradient methods, we aim to expand the current model-based RL
techniques for control policies of stochastic systems.
Our primary contributions are the following:
• We present a new gradient estimator for policy-gradient methods that incorporates informa-
tion from the stationary measure of the MDP. Under an average-reward and infinite-horizon
learning setting, we namely consider policy parametrizations such that there is a known rela-
tionship between the policy on the one hand, and the MDP’s stationary distribution on the
other hand. In practice, this translates to assuming that the stationary distribution forms an
exponential family explicitly depending on the policy parameters. Using this structure, we
define score-aware gradient estimators (SAGEs), a class of estimators that exploit the afore-
mentioned assumption to estimate the policy gradient without relying on value or action–value
functions.
• We show the local convergence and regret of a SAGE-based policy-gradient under broad as-
sumptions, such as a countable state space, nonconvex objective functions, and unbounded
rewards. To do so, we first generalize the approach of Fehrman et al. (2020) to a general RL
setting that includes Markovian updates, a countable state space and does not require the
stationary distribution to be exponential. We show local convergence by first, using a local
Lyapunov function that guarantees stability of the Markov chain as long as the iterates are
close to the optimum, and second, by using that nondegeneracy of the Hessian at the optimum
holds, which allows to keep track of the updates locally. These two key elements allow to
show convergence if the bias of the gradient estimator present non-episodic settings, as well
as the variance due to gradient estimation error can be controlled. Remarkably, the local
assumptions also allow for unstable policies to exist. For policy gradient with SAGE in par-
ticular, we can then crucially estimate its bias, and variance explicitly due to the exponential
family assumption, and show convergence with large probability by using the aforementioned
approach, whenever the trajectory of the iterates gets close enough to an optimum. The con-
vergence proof approach is of independent interest, and can also be adapted to other policy
gradient-based methods as long as the bias and variance of the policy-gradient estimator can
be controlled.
• We numerically evaluate the performance of SAGE-based policy-gradient on several models
from stochastic networks, queueing systems, and statistical physics. Compared to an actor–

2
Score-Aware Policy-Gradient Methods and Performance Guarantees

critic algorithm, we observe that SAGE-based policy-gradient methods exhibit faster conver-
gence and lower variance.
Our results suggest that exploiting model-specific information is a promising approach to improve
RL algorithms, especially for stochastic networks and queueing systems. Sections 1.1 and 1.2 below
describe our contributions in more details.

1.1 Score-aware gradient estimators (SAGEs)

We introduce SAGEs for MDPs following the exponential-family assumption in Section 4. These
estimators leverage the structure of the stationary distribution, with the goal of reducing variance
and favoring stable learning. Notably, their usage requires neither knowledge nor explicit estima-
tion of model parameters, ensuring practical applicability. The key step of the derivation exploits
information on the form of the score of exponential families—that is, the gradient of the logarithm
of the probability mass function.
We can illustrate the working principle using a toy example on a countable state space S: given
a sufficient statistic x : S → Rn , the associated exponential family in canonical form is the family of
distributions with probability mass functions p(·|θ) ∝ exp(θ| x(·)) parametrized by θ ∈ Rn . Observe
now that these distributions satisfy the relation

d log(p(s|θ))
= x(s) − ES∼p( · |θ) [x(S)], s∈S (1)
dθ
and that (1) gives an exact expression for the gradient of the score.
Now, a more general version of (1) that is also applicable beyond this toy example—see Theorem 1
below—allows us to bypass the commonly used policy-gradient theorem (Sutton and Barto, 2018,
§13.2), which ties the estimation of the gradient with that of first estimating value or action–value
functions. A key aspect that SAGE practically exploits is that, in the models from queueing and
statistical physics that we will study, we fully or partially know the sufficient statistic x. Furthermore,
such models commonly possess an ‘effective dimension’ that is much lower than the size of state space
and is reflected by the sufficient statistic. For example, in a load-balancing model we consider in
the numerical section, an agnostic model-free RL algorithm would assume the number of states in
S to grow exponentially in the number n of servers. However, the state space latent representation
is actually the job count at each server, which is efficiently encoded by the sufficient statistic with
an n-dimensional vector.

1.2 Convergence of policy-gradient methods

We examine the convergence properties of the SAGE-based policy-gradient method theoretically in
Section 5. Specifically, we consider the setting of policy-gradient RL with average rewards, which
consists of finding a parameter θ such that the parametric policy π(θ) = π( · | · , θ) maximizes
T
1 hX i
J(θ) = lim E Rt . (2)
T →∞ T
t=1

Here, Rt+1 denotes the reward that is given after choosing action At while being in state St , which
happens with probability π(At |St , θ). As is common in episodic RL, we consider epochs, that is,
time intervals where the parameter θ is fixed and a trajectory of the Markov chain is observed. For
each epoch m, and under the exponential-family assumption for the stationary distribution, SAGE
yields a gradient estimator Hm from a trajectory of state-action-reward tuples (St , At , Rt+1 ) sampled
from a policy with Θm as an epoch-dependent parameter. Convergence analysis of the SAGE-based
policy-gradient method aligns with ascent algorithms like SGA by considering updates at the end

3
Comte, Jonckheere, Sanders and Senen-Cerda

of epoch m with step-size αm > 0,

Θm+1 = Θm + αm Hm . (3)

Convergence analyses for policy-gradient RL and SGA are quite standard; see Section 2. Our
work specifically aligns with the framework of (Fehrman et al., 2020) that studies local convergence of
unbiased stochastic gradient descent (SGD), that is, when the conditional estimator Hm of ∇J(Θm )
on the past F is unbiased, which is typical in a supervised learning setting. An important part of
our work consists in expanding the results of (Fehrman et al., 2020) to the case of Markovian data,
leading to biased estimators (i.e., E[Hm |F] 6= ∇J(Θm )). In our RL setting, we handle potentially
unbounded rewards and unbounded state spaces as well as the existence of unstable policies. We
also assume an online application of the policy-gradient method, where restarts are impractical or
costly: the last state of the prior epoch is used as the initial state for the next, distinguishing our
work from typical episodic RL setups where an initial state S0 is sampled from a predetermined
distribution.
Our main result in Section 5 shows convergence of iterates in (3) to the set M that attain the
maximum J ? of (2), assuming nondegeneracy of J on M and existence of a local Lyapunov function.
If the trajectory of SGA ends up within a sufficiently small neighborhood V of a maximizer θ? ∈ M,
with appropriate epoch length and step-sizes, convergence to M occurs with large probability: for
any epoch m > 0 and > 0, if Θ0 is the first iterate in V ,
α2
P[J ? − J(Θm ) > |Θ0 ∈ V ] ≤ O −2 m−σ−κ + m1−σ/2−κ/2 + m−κ/2 + , (4)
`
where the parameters σ ∈ (2/3, 1), κ > 0, α ∈ (0, α0 ], and ` ∈ [`0 , ∞) depend on the step and
batch sizes and can be tuned to make the bound in (4) arbitrarily small. While focusing on the
global optimum, the bound (4) also holds under the same assumptions in case J ? is a local optimum
instead.
Our key assumption relies on the existence of a local Lyapunov function in the neighborhood V .
Hence, we need only to assume stability of policies that are close to the optimum. This sets our work
further apart from others in the RL literature, which typically require existence of a global Lyapunov
function and/or finite state space. In fact, our numerical results in Section 6 show an instance where
local stability suffices, highlighting the benefits of SAGE. The set M of global maxima is also not
required to be finite or convex, thanks to the local nondegeneracy assumption.
For large m, the bound in (4) can be made arbitrarily small by setting the initial step size α
and batch size ` small and large, respectively. In (4), the chance that the policy escapes the set V ,
outside which stability cannot be guaranteed, does not vanish when m → ∞; it remains as α2 /`. We
show that this term is inherent to the function approximation of the gradient and cannot be avoided.
Specifically, for any β > 0, there are functions f such that P[f (θ? ) − f (Θm ) > |Θ0 ∈ V ] > cα2+β /`
for some c > 0. Hence, a lower bound shows that the proof method cannot be improved without
using additional structure of Hm or J. Furthermore, our proof can be adapted to other generic
policy-gradients that have similar bounds on the gradient estimator Hm , thus showing that such
phenomenon not just happens with SAGE but with any other policy-gradient algorithm with similar
gradient approximation properties.
Denoting the total number of samples drawn by the algorithm by T , from (4) we obtain a regret
bound of our algorithm when reaching the set V . In the case of bounded rewards, we namely show
that for any 1 > > 0, if Θ0 is the first iterate in V , we have for any T ≥ 1,
T
h X i 1 2 α2
E T J? − r(St , At ) Θ0 ∈ V = O (L? ) 3 T 3 + + T . (5)
t=1
`

The linear term in (5) depends on the previous approximation error of the policy gradient and
has been seen in other recent works such as (Abbasi-Yadkori et al., 2019), and also recently for

4
Score-Aware Policy-Gradient Methods and Performance Guarantees

countable state spaces (Murthy et al., 2024). The other term is sublinear, and its coefficient L?
characterizes an ‘effective’ size of the state space when the algorithm is stable, and directly depends
on the local Lyapunov function. For a given T , we can find `—a term related to the batchsize and
thus approximation error—such that the regret becomes O(T 3/4+ ). Remarkably, while we start
from a stable policy, the expectation in (5) includes trajectories where policies may be unstable.
When the reward is unbounded, we similarly obtain a bound without a linear term if we restrict to
trajectories in V .
For cases where the optimum is reached only as Θm → ∞, as with deterministic policies, we
additionally show that adding a small entropy regularization term to J(θ) allows us to ensure not
just that maxima are bounded but also that M satisfies the nondegeneracy assumption required to
show local convergence.

1.3 Numerical experiments

After the introduction and theoretical analysis of the SAGE-based policy-gradient algorithm, we fi-
nally assess its applicability in Section 6 by comparing its performance with the actor–critic algorithm
on three models from queueing systems, stochastic networks, and statistical physics. Specifically,
we consider an admission control problem on the M/M/1 queue, a load balancing system, and the
Ising model with Glauber dynamics.
The numerical results on these examples suggest that, when applicable, SAGEs can expedite
convergence towards an optimal policy (compared to actor–critic) by leveraging the structure of the
stationary distribution. Furthermore, the lower variance of SAGE becomes decisive when stability
is not guaranteed for all policies. Namely, we observe in an example that the SAGE-based policy-
gradient method converges to a close–to–optimal policy even if some policies are unstable, provided
that a stable policy is used as initialization. This behavior contrasts with actor–critic, whose out-
put policies are not always stable. SAGE also reproduces a well-known phenomenon in annealing
schedules for Ising models. Specifically, the agent momentarily increases the temperature in order
to escape stable states that do not correspond to the global optimum.

2 Related works
The work in the present manuscript resides at the intersection of distinct lines of research. We
therefore broadly review, relate, and position our work to other research in this section.

2.1 Gradient estimation, exponential families, and product forms

Operations on high-dimensional probability distributions, such as marginalization and inference, are
numerically intractable in general. Exponential families—see Section 4.1 for a definition—are para-
metric sets of distributions that lead to more tractable operations and approximations while also
capturing well-known probability distributions, such as probabilistic graphical models (Wainwright
and Jordan, 2018), popular in machine learning. In the context of stochastic networks and queue-
ing systems, the stationary distribution of many product-form systems can be seen as forming an
exponential family.
Our first contribution is related to several works on exponential families, product-form distri-
butions, and probabilistic graphical models. Key parameters in these distributions are numerically
intractable a priori, but can be expressed as expectations of random vectors that can be sampled
by simulation. The most basic and well-known result, which appears in Section 1.1 and will be
exploited in Section 4.2, rewrites the gradient of the logarithm of the normalizing constant (a.k.a.
the log-partition function) as the expectation of the model’s sufficient statistics. In probabilistic
graphical models, this relation has been mainly used to learn a distribution that best describes a
dataset via SGD (Wainwright and Jordan, 2018; Koller and Friedman, 2009). In stochastic networks,

5
Comte, Jonckheere, Sanders and Senen-Cerda

this relation has been applied to analyze systems with known parameters, for instance to predict
their performance (de Souza e Silva and Muntz, 1988; Zachary and Ziedins, 1999; Bonald and Vir-
tamo, 2004; Shah, 2011; Shah and de Veciana, 2015), to characterize their asymptotic behavior in
scaling regimes (Shah, 2011; Shah and de Veciana, 2015), for sensitivity analysis (de Souza e Silva
and Muntz, 1988; Liu and Nain, 1991), and occasionally to optimize control parameters via gradient
ascent (Liu and Nain, 1991; de Souza e Silva and Gerla, 1991; Shah, 2011).
To the best of our knowledge, an approach similar to ours is found only in Sanders et al. (2016).
This work derives a gradient estimator and performs SGA in a class of product-form reversible
networks. However, the procedure requires first estimating the stationary distribution, convergence
is proven only for convex objective functions, and the focus is more on developing a distributed
algorithm than on canonical RL. The algorithm in Jiang and Walrand (2009) is similarly noteworthy,
although the focus there is on developing a distributed control algorithm specifically for wireless
networks and not general product-form networks.

2.2 Stochastic gradient ascent (SGA) and policy-gradient methods

When a gradient is estimated using samples from a Markov chain, methods from Markov Chain
Monte Carlo (MCMC) are commonly used (Mohamed et al., 2020). In our case, we have moreover
bias from being unable to restart the chain at each epoch. Convergence of biased SGD to approximate
stationary points of smooth nonconvex functions—points θ such that |∇J(θ)| < for some > 0—
has been addressed in the literature (Tadic and Doucet, 2017; Atchadé et al., 2017; Karimi et al.,
2019; Doan et al., 2020). The asymptotic conditions for local convergence to a stationary point
were first investigated in Tadic and Doucet (2017), where conditions for the asymptotic stochastic
variance of the gradient estimator and bias were assumed (see Assumptions 2.1–2.3 in Tadic and
Doucet (2017)). In Karimi et al. (2019), a nonasymptotic analysis of biased SGD is shown. Under
Lipschitz assumptions on the transition probabilities and bounded variance of the gradient estimator
?
Hm , in Karimi et al. (2019) √ it is shown that under appropriate step-sizes, for some m ≤ T ,
E[|∇J(Θm∗ )|2 ] = O(log(T )/ T ), where T is a time horizon. In Tadic and Doucet (2017); Karimi
et al. (2019), these results are applied in a RL context. While these works demonstrate convergence
to stationary points, our contribution lies in proving convergence to a maximum, albeit locally. This
approach is essential for addressing scenarios with only local assumptions and potentially unstable
(not positive recurrent) policies.
Finally, several recent works build on gradient domination for policy-gradient methods, address-
ing convexity limitations and ensuring global convergence (Fazel et al., 2018; Agarwal et al., 2021;
Xiao, 2022; Kumar et al., 2024). Notable differences to our work are their use of a finite state
space and that we do not consider an episodic setting with restarts, as well as distinct structural
assumptions on policy parametrization like natural gradients. In (Murthy et al., 2024), a convergent
natural policy-gradient (NPG) is considered for countable state spaces where the cost is the norm
of the state, i.e., the queue length in the queueing setting. In this paper a (stable) max-weight
policy is applied with a probability that tends to 1 as the queue size increases, therefore avoiding
instability issues. We instead do not use a stabilizing policy, but require to start sufficiently close to
the optimal policy. Another unique aspect of our contribution lies in specialized gradient estimation
schemes based on the exponential family assumption on the stationary distribution, which crucially
avoids estimating or learning value functions, common in all previous works.
A succint non-exhaustive table is provided to guide interested readers to key references within
this vast body of literature.

6
Score-Aware Policy-Gradient Methods and Performance Guarantees

Method Context/State Space Main Assump- Convergence Key Assumptions and

tions Links
SGD Markovian data with convex Lipschitz transi- Local convergence Tadic and Doucet (2017);
objectives tion probabilities, Daneshmand et al. (2018);
bounded variance Karimi et al. (2019)
estimators
SGD Iid data with non convex ob- Local convexity Local convergence Fehrman et al. (2020)
jectives
Policy Gradient RL, Finite state space Gradient Domina- Global convergence Agarwal et al. (2021), Ku-
tion mar et al. (2024).
Policy Gradient RL, Finite state space ABC condition Global convergence Yuan et al. (2022).
Natural Policy Gradient RL, Finite state space Gradient domina- Global convergence Liu et al. (2020); Agarwal
tion / entropy reg- et al. (2021), Cen et al.
ularization, (2022)
Natural Policy Gradient RL, Countable state space Uniform Lyapunov Global convergence Murthy et al. (2024)
control
Policy gradient for product- Countable state space Convex objec- Global convergence for con- Sanders et al. (2016)
form Networks tives, requires vex objectives
pre-computed
stationary dis-
tribution for the
Markov chain.
SAGE Countable Lyapunov as- Local convergence This paper
sumptions, Non-
degenerate Hes-
sian, Application
to exponential
family stationary
distributions

3 Problem formulation
3.1 Basic notation
The sets of nonnegative integers, positive integers, reals, and nonnegative reals are denoted by N,
N+ , R, and R≥0 , respectively. For a differentiable function f : θ ∈ Rn 7→ f (θ) ∈ R, ∇f (θ) denotes
the gradient of f taken at θ ∈ Rn , that is, the n-dimensional column vector whose j-th component
is the partial derivative of f with respect to θj , for j ∈ {1, 2, . . . , n}. If f is twice differentiable,
Hessθ f denotes the Hessian of f at θ, that is, the n × n matrix of second partial derivatives. For
a differentiable vector function f : θ ∈ Rn 7→ f (θ) = (f1 (θ), . . . , fd (θ)) ∈ Rd , Df (θ) is the Jacobian
|
matrix of f taken at θ, that is, the d × n matrix whose i-th row isp∇fi (θ) , for i ∈ {1, 2, . . . , d}.
For a vector x = (x1 , . . . , xn ) ∈ Rn , we denote its l2 -norm by |x| = x21 + · · · + x2n . We define the
operator norm of a matrix A ∈ Ra×b as |A|op = supx∈Rb :|x|=1 |Ax|. We use uppercase to denote
random variables and vectors, and a calligraphic font for their sets of outcomes.

3.2 Markov decision process (MDP)

We consider a Markov decision process (MDP) with countable state, action, and reward spaces S, A,
and R, respectively, and transition probability kernel P : (s, a, r, s0 ) ∈ S ×A×R×S 7→ P (r, s0 |s, a) ∈
[0, 1], where P (r, s0 |s, a) gives the conditional probability that the next reward–state pair is (r, s0 )
given that the current state-action pair is (s, a). With a slight abuse of notation, we introduce, for
s, s0 ∈ S, a ∈ A, and r ∈ R,
X X
P (r|s, a) = P (r, s0 |s, a) and P (s0 |s, a) = P (r, s0 |s, a).
s0 ∈S r∈R

All our results also generalize to absolutely continuous rewards; an example will appear in Section 6.1.
Following the framework of policy-gradient algorithms (Sutton and Barto, 2018, Chapter 13),
we assume that the agent is given a random policy parametrization π : (s, θ, a) ∈ S × Rn × A →
π(a|s, θ) ∈ (0, 1), such that π(a|s, θ) is the conditional probability that the next action is a ∈ A
given that the current state is s ∈ S and the parameter vector θ ∈ Rn . We assume that the function
θ 7→ π(a|s, θ) is differentiable for each (s, a) ∈ S × A. The goal of the learning algorithm will be to
find a parameter (vector) that maximizes the long-run average reward, as will be defined formally
in Section 3.3.

7
Comte, Jonckheere, Sanders and Senen-Cerda

As a concrete example, we will often consider a class of softmax policies that depend on a feature
extraction map ξ : S × A → Rn as follows:
|
eθ ξ(s,a)
π(a|s, θ) = P , s ∈ S, a ∈ A, (6)
a0 ∈A eθ| ξ(s,a0 )
The feature extraction map ξ may leverage prior information on the system dynamics. In queueing
systems for instance, we may decide to make similar decisions in large states, as these states are
typically visited rarely, and it may be beneficial to aggregate the information collected about them.

3.3 Stationary analysis and optimality criterion

Given θ ∈ Rn , if the agent applies the policy π(θ) : (s, a) ∈ S × A 7→ π(a|s, θ) at every time step,
the random state-action-reward sequence ((St , At , Rt+1 ), t ∈ N) obtained by running this policy is a
Markov chain such that, for each s, s0 ∈ S, a ∈ A, and r ∈ R, we have P[At = a|St = s] = π(a|s, θ)
and P[Rt+1 = r, St+1 = s0 |St = s, At = a] = P (r, s0 |s, a). The dependency of the random variables
on the parameter is left implicit to avoid cluttering notation. Leaving aside actions and rewards,
the state sequence (St , t ∈ N) also defines a Markov chain, with transition probability kernel P (θ) :
(s, s0 ) ∈ S × S 7→ P (s0 |s, θ) given by
X
P (s0 |s, θ) = π(a|s, θ)P (s0 |s, a), s, s0 ∈ S.
a∈A

In the remainder, we will assume that Assumptions 1 and 2 below are satisfied.
Assumption 1 There exists an open set Ω ⊆ Rn such that, for each θ ∈ Ω, the Markov chain
(St , t ∈ N) with transition probability kernel P (θ) is irreducible and positive recurrent.
Thanks to Assumption 1, for each θ ∈ Ω, the corresponding Markov chain (St , t ∈ N) has a unique
stationary distribution p(·|θ). We say that a triplet (S, A, R) of random variables is a stationary
state-action-reward triplet, and we write (S, A, R) ∼ stat(θ), if (S, A, R) follows the stationary
distribution of the Markov chain ((St , At , Rt+1 ), t ∈ N), given by
P[S = s, A = a, R = r] = p(s|θ)π(a|s, θ)P (r|s, a), s ∈ S, a ∈ A, r ∈ R. (stat(θ))
Assumption 2 For each θ ∈ Ω, the stationary state-action-reward triplet (S, A, R) ∼ stat(θ) is
such that the random variables |R|, |R ∇ log p(S|θ)|, and |R ∇ log π(A|S, θ)| have a finite expectation.
PT
By ergodicity (Brémaud, 1999, Theorem 4.1), the running average reward T1 t=1 Rt tends to
J(θ) in (2) almost surely as T tends to infinity. J(θ) is called the long-run average reward and is
also given by
XX X
J(θ) = E[R] = p(s|θ)π(a|s, θ)P (r|s, a)r, θ ∈ Ω. (7)
s∈S a∈A r∈R

Our end goal, further developed in Section 3.4, is to find a learning algorithm that maximizes the
objective function J. For now, we only observe that the objective function J : θ ∈ Ω 7→ J(θ) is
differentiable thanks to Assumption 2, and that its gradient is given by
XX X
∇J(θ) = p(s|θ)π(a|s, θ)P (r|s, a)r(∇ log p(s|θ) + ∇ log π(a|s, θ)), θ ∈ Ω. (8)
s∈S a∈A r∈R

In general, computing ∇J(θ) using (8) is challenging: (i) computing ∇ log p(s|θ) is in itself challeng-
ing because p(s|θ) depends in a complex way on the unknown transition kernel P (r, s0 |s, a) and the
parameter θ via the policy π(θ), and (ii) enumerating and thus summing over the state space S is
often practically infeasible (for instance, when the state space S is infinite and/or high-dimensional).
Our first contribution, in Section 4, is precisely a new family of estimators for the gradient (8).

8
Score-Aware Policy-Gradient Methods and Performance Guarantees

3.4 Learning algorithm

In Section 3.3, we defined the objective function J by considering trajectories where the agent applied
a policy π(θ) parametrized by a constant vector θ. Going back to a learning setting, we consider a
state-action-reward sequence ((St , At , Rt+1 ), t ∈ N) and a parameter sequence (Θm , m ∈ N) obtained
by updating the parameter periodically according to the gradient-ascent step Θm+1 = Θm + αm Hm
introduced in (3), where Hm is provided by a family of learning algorithms, called policy gradient.
The pseudocode of a generic policy-gradient algorithm, shown in Algorithm 1, is parametrized by a
sequence 0 , t0 < t1 < t2 < . . . of observation times and a sequence α0 , α1 , α2 , . . . > 0 of step sizes.
For each m ∈ N, Dm denotes batch m, obtained by applying policy π(Θm ) at epoch m, given by

Dm = (St , At , Rt+1 ), t ∈ {tm , . . . , tm+1 − 1} . (9)

Given some initialization Θ0 , Algorithm 1 calls a procedure Gradient that computes an estimate
Hm of ∇J(Θm ) from Dm , and it updates the parameter according to (3).
As discussed at the end of Section 3.3, finding an estimator Hm for ∇J(Θm ) directly from (7) is
difficult in general. A common way to obtain Hm follows from the policy-gradient theorem (Sutton
and Barto, 2018, Chapter 13), which instead writes the gradient ∇J(θ) using the action-value
function q:

∇J(θ) = E[q(S, A) ∇ log π(A|S, θ)],

where (S, A, R) ∼ stat(θ), for each θ ∈ Ω. Consistently, in a model-free setting, policy-gradient

methods like the actor–critic algorithm recalled in Appendix A.1 estimate ∇J(Θm ) by first estimat-
ing a value function. However, this approach can suffer from high-variance of the estimator, which
slows down convergence, as described in Section 1. Some of these problems can be circumvented by
exploiting the problem structure, as we will see now.

4 Score-aware gradient estimator (SAGE)

We now define the key structural assumption in our paper. Namely, that we have information on
the impact of the policy parameter θ on the stationary distribution p. In Section 4.2, we will use
this assumption to build a new family of estimators for the gradient ∇J that do not involve the
state-value function, contrary to actor–critic. In Section 4.3, we will further explain how to use this
insight to design a SAGE-based policy-gradient method.

Algorithm 1 Generic policy-gradient algorithm. Examples of Gradient procedure, based on

different estimators for the gradient ∇J, are given in Algorithms 2 and 3. All variables of Algorithm 1
are accessible within the Gradient procedure.
1: Input: • Observation times 0 , t0 < t1 < t2 < . . .
• Step size sequence α0 , α1 , α2 , . . . > 0
• Positive and differentiable policy parametrization (s, θ, a) 7→ π(a|s, θ)
2: Initialization: Policy parameter Θ0 ∈ Ω and initial state S0 ∈ S
3: Main loop:
4: for m = 0, 1, 2, . . . do
5: for t = tm , . . . , tm+1 − 1 do
6: Sample At ∼ π(·|St , Θm )
7: Take action At and observe Rt+1 , St+1
8: end for
9: Update Θm+1 ← Θm + αm Gradient(m)
10: end for

9
Comte, Jonckheere, Sanders and Senen-Cerda

4.1 Product-form and exponential family

As announced in the introduction, our end goal is to design a gradient estimator capable of exploiting
information on the stationary distribution p(·|θ) of the MDP when such information is available.
Assumption 3 below formalizes this idea by assuming that the stationary distribution forms an
exponential family parametrized by the policy parameter θ.

Assumption 3 (Stationary distribution) There exist a scalar function Φ : S → R>0 , an integer

d ∈ N+ , a differentiable vector function ρ : Ω → Rd>0 , and a vector function x : S → Rd such that
the following two equivalent equations are satisfied:
d
1 Y x (s)
p(s|θ) = Φ(s) ρi (θ) i , s ∈ S, θ ∈ Ω, (10–PF)
Z(θ) i=1
log p(s|θ) = log Φ(s) + log ρ(θ)| x(s) − log Z(θ), s ∈ S, θ ∈ Ω, (10–EF)

where the partition function Z : Ω → R>0 follows by normalization:

d
X Y xi (s)
X |
Z(θ) = Φ(s) ρi (θ) = elog Φ(s)+log ρ(θ) x(s)
, θ ∈ Ω. (11)
s∈S i=1 s∈S

We will call Φ the balance function, ρ the load function, and x the sufficient statistics.

(10–PF) is the product-form variant of the stationary distribution, classical in queueing theory.
(10–EF) is the exponential-family description of the distribution. This latter representation is more
classical in machine learning (Wainwright and Jordan, 2018) and will simplify our derivations. Let
us briefly discuss the implications of this assumption as well as examples where this assumption is
satisfied.
Assumption 3 implies that the stationary distribution p depends on the policy parameter θ only
via the load function ρ. Yet, this assumption may not seem very restrictive a priori. Assuming
for instance that the state space S is finite, with S = {s1 , s2 , . . . , sN }, we can write the stationary
distribution in the form (10) with d = N , ρi (θ) = p(si |θ), xi (s) = 1[s = si ], and Φ(s) = Z(θ) = 1,
for each θ ∈ Rn , s ∈ S, and i ∈ {1, 2, . . . , N }. However, writing the stationary distribution in this
form is not helpful, in the sense that in general the function ρ will be prohibitively intricate. As
we will see in Section 4.2, what will prove important in Assumption 3 is that the load function ρ is
simple enough so that we can evaluate its Jacobian matrix function D log ρ numerically.
There is much literature on stochastic networks and queueing systems with a stationary distri-
bution of the form (10–PF). Most works focus on performance evaluation, that is, evaluating J(θ)
for some parameter θ ∈ Ω, assuming that the MDP’s transition probability kernel is known. In this
context, the product-form (10–PF) arises in Jackson and Whittle networks (Serfozo, 1999, Chap-
ter 1), BCMP networks (Baskett et al., 1975), as well as more recent models arising in datacenter
scheduling and online matching (Gardner and Righter, 2020)1 . Building on this literature, in Sec-
tion 6, we will consider policy parametrizations for control problems that also lead to a stationary
distribution of the form (10).
In the next section, we exploit Assumption 3 to construct a gradient estimator that requires
knowing the functions D log ρ and x but not the functions ρ, Φ, and Z.
1. Although the distributions recalled in (Gardner and Righter, 2020, Theorems 3.9, 3.10, 3.13) do not seem to fit
the framework of (10) a priori because the number of factors in the product can be arbitrarily large, some of
these distributions can be rewritten in the form (10) by using an expanded state descriptor, as in (Adan et al.,
2017, Equation (4), Corollary 2, and Theorem 6) and (Moyal et al., 2021, Equation (7) and Proposition 3.1).

10
Score-Aware Policy-Gradient Methods and Performance Guarantees

4.2 Score-aware gradient estimator (SAGE)

As our first contribution, Theorem 1 below gives simple expressions for ∇ log p(s|θ) and ∇J(θ) under
Assumptions 1 to 3. Gradient estimators that will be formed using (13) will be called score-aware
gradient estimators (SAGEs), to emphasize that the estimators rely on the simple expression (12)
for the score ∇ log p(s|θ). Particular cases of this result have been obtained in de Souza e Silva
and Muntz (1988); de Souza e Silva and Gerla (1991); Liu and Nain (1991) for specific stochastic
networks; our proof is shorter and more general thanks to the exponential form (10–EF).

Theorem 1 Suppose that Assumptions 1 to 3 hold. For each θ ∈ Ω, we have

∇ log p(s|θ) = D log ρ(θ)| (x(s) − E[x(S)]), (12)

|
∇J(θ) = D log ρ(θ) Cov[R, x(S)] + E[R∇ log π(A|S, θ)], (13)

where (S, A, R) ∼ stat(θ), Cov[R, x(S)] = (Cov[R, x1 (S)], . . . , Cov[R, xd (S)])| , and the gradient
and Jacobian operators, ∇ and D respectively, are taken with respect to θ.

Proof Applying the gradient operator to the logarithm of (11) and simplifying yields

∇ log Z(θ) = D log ρ(θ)| E[x(S)]. (14)

This equation is well-known and was already discussed in Section 2.1. Equation (12) follows by ap-
plying the gradient operator to (10–EF) and injecting (14). Equation (13) follows by injecting (12)
into (8) and simplifying.

Assuming that the functions D log ρ and x are known in closed-form, Theorem 1 allows us to con-
struct an estimator of ∇J(θ) from a state-action-reward sequence ((St , At , Rt+1 ), t ∈ {0, 1, . . . , T })
obtained by applying policy π(θ) at every time step as follows:

H = D log ρ(θ)| C + E, (15)

where C and E are estimators of Cov[R, x(S)] and E[R ∇ log π(A|S, θ)], respectively, obtained for
instance by taking the sample mean and sample covariance. An estimator of the form (15) will be
called a score-aware gradient estimator (SAGE). This idea will form the basis of the SAGE-based
policy-gradient method that will be introduced in Section 4.3. Observe that such an estimator will
typically be biased since the initial state S0 is not stationary. Nonetheless, we will show in the proof
of the convergence result in Section 5 that this bias does not prevent convergence.
The advantage of using a SAGE is twofold. First, the challenging task of estimating ∇J(θ)
is reduced to the simpler task of estimating the d-dimensional covariance Cov[R, x(S)] and the
n-dimensional expectation E[R ∇ log π(A|S, θ)], for which leveraging estimation techniques in the
literature is possible. Also recall that the gradient estimator used in the actor–critic algorithm
(Appendix A.1) relies on the state-value function, so that it requires estimating |S| values; we
therefore anticipate SAGEs to yield better performance when max(n, d) |S|; see examples from
Sections 6.2 and 6.3. Second, as we will also observe in Section 6, SAGEs can “by-design” exploit
information on the structure of the policy and stationary distribution. Actor–critic exploits this
information only indirectly due to its dependency on the state-value function.

4.3 SAGE-based policy-gradient algorithm

Algorithm 2 introduces a SAGE-based policy-gradient method based on Theorem 1. For each m ∈ N,
the Gradient(m) procedure is called in the gradient-update step (Line 9) of Algorithm 1, at the end
of epoch m, and returns an estimate of ∇J(Θm ) based on batch Dm , defined in (9). Algorithm 2 can
be understood as follows. According to Theorem 1, we have ∇J(Θm ) = D log ρ(Θm )| Cov[R, x(S)] +

11
Comte, Jonckheere, Sanders and Senen-Cerda

E[R∇ log π(A|S, Θm )] with (S, A, R) ∼ stat(Θm ). Lines 4 to 6 estimate Cov[R, x(S)] using the usual
sample covariance estimator. Line 7 estimates E[R∇ log π(A|S, θ)] using the usual sample mean
estimator. To simplify the signature of Gradient(m), we assume all variables from Algorithm 1,
in particular batch Dm , are accessible within Algorithm 2. The variable Nm computed on Line 3 is
the batch size, i.e., the number of samples used to estimate the gradient ∇J(Θm ), and we assume
it is greater than or equal to 2. An alternate implementation of the SAGE-based policy-gradient
method that allows for batch sizes equal to 1 is given in Appendix A.2.

Algorithm 2 SAGE-based policy-gradient method, to be called on Line 9 of Algorithm 1.

1: Input: • Positive and differentiable policy parametrization (s, θ, a) 7→ π(a|s, θ)
• Jacobian matrix function θ 7→ D log ρ(θ)
• Feature function s 7→ x(s)
2: procedure Gradient(m)
3: Nm ← tm+1P − tm
tm+1 −1
4: X m ← N1m t=t m
x(St )
1 t m+1 −1
P
5: Rm ← Nm t=tm Rt+1
Ptm+1 −1
6: C m ← Nm1−1 t=t m
(x(St ) − X m )(Rt+1 − Rm )
1
Ptm+1 −1
7: E m ← Nm t=tm Rt+1 ∇ log π(At |St , Θm )
8: return D log ρ(Θm )C m + E m
9: end procedure

Recall that our initial goal was to exploit information on the stationary distribution, when such
information is available. Consistently, compared to actor–critic (Appendix A.1), the SAGE-based
method of Algorithm 2 requires as input the Jacobian matrix function D log ρ and the sufficient
statistics x. In return, as we will see in Sections 5 and 6, the SAGEs-based method relies on a
lower-dimensional estimator whenever max(n, d) |S|, which can lead to an improved convergence.

5 A local convergence result

Our goal in this section is to study the limiting behavior of Algorithm 2. To do so, we will consider
this algorithm as an SGA algorithm that uses biased gradient estimates. The gradient estimates are
biased because they arise from the MCMC estimations from Lines 4–7 in Algorithm 2. Throughout
the proof, we will assume for simplicity that the reward is a deterministic function r : S × A → R.
Under this assumption, for each m ∈ N, Algorithm 2 follows the gradient ascent step (3), with
 Ptm+1 −1 Ptm+1 −1
t=tm x(St ) r(St , At )
, Rm = t=tm

Xm = ,


tm+1 − tm tm+1 − tm







 Ptm+1 −1
x(St ) − X m r(St , At ) − Rm

| t=t
Hm = D log ρ(Θm ) C m + E m , where C m = m
, (16)


 tm+1 − tm − 1


 Ptm+1 −1
r(St , At )∇ log π(At |St , Θm )


t=tm

 Em = ,


tm+1 − tm

The estimates X m , Rm , and C m are functions of Dm , while Hm and E m are functions of Dm and
Θm . We will additionally apply decreasing step sizes and increasing batch sizes of the form
α σ
αm = and tm+1 = tm + `m 2 +κ for each m ∈ N, (17)
(m + 1)σ

12
Score-Aware Policy-Gradient Methods and Performance Guarantees

for some parameters α ∈ (0, ∞), ` ∈ (1, ∞), σ ∈ (2/3, 1), and κ ∈ [0, ∞).
Our goal—to study the limiting algorithmic behavior of Algorithm 2—is equivalent to studying
the limiting algorithmic behavior of the stochastic recursion (3). In particular, we will focus on the
local convergence of the iterates of (3) and (16) to the following set of global maximizers:

M = {θ ∈ Ω : J(θ) = J ? }, where J ? = sup J(θ). (18)

θ∈Ω

We will assume M to be nonempty, that is, M = 6 ∅ and at least one θ ∈ Ω satisfies J(θ) =
J ? . The assumptions that we consider (Assumption 7 below) allow us to assume that M is only
locally a manifold. Consequently, J can be nonconvex with noncompact level-subsets, and J is even
allowed not to exist outside the local neighborhood; namely, the policies may be unstable. In this
latter case θ ∈/ Ω, and we will use the convention that if θ yields an unstable policy, then J(θ) =
inf s∈S,a∈A r(s, a) ≥ −∞. While the previous assumptions allow for general objective functions, the
convergence will be guaranteed close to the set of maxima M, or to a set of local maxima that
satisfy equivalent assumptions.

5.1 Assumptions pertaining to algorithmic convergence

We use the Markov chain of state-action pairs. Specifically, consider the pairs {(St , At )}t≥0 ⊂ S ×A,
where At is generated according to policy π( · |St , θ). For a given θ ∈ Ω, the one-step transition
probability and the stationary distribution of this Markov chain are

P ((s0 , a0 )|(s, a), θ) = π(a0 |s0 , θ)P (s0 |s, a), (19)
p̃((s, a)|θ) = p(s|θ)π(a|s, θ) for (s, a) ∈ S × A. (20)

The following are assumed:

Assumption 4 There exists a function L : S × A → [1, ∞) such that, for any θ? ∈ M, there exist
a neighborhood U of θ? in Ω and four constants λ ∈ (0, 1), C > 0, b ∈ R+ , and v ≥ 16 such that,
for each θ ∈ U , the policy π( · | · , θ) is such that
X
P ((s0 , a0 )|(s, a), θ)(L(s0 , a0 ))v ≤ λ(L(s, a))v + b, for each (s, a) ∈ S × A,
(s0 ,a0 )∈S×A

and, for each ` ∈ N+ and (s, a), (s0 , a0 ) ∈ S × A,

P ` ((s0 , a0 )|(s, a), θ) − p̃((s0 , a0 )|θ) ≤ Cλ` L(s, a),

where P ` (θ) is the `-step transition probability kernel of the Markov chain with transition probability
kernel (19) .

Assumption 5 There exists a constant C > 0 such that |D log ρ(θ)|op < C for each θ ∈ Ω.

Assumption 6 Let L be the Lyapunov function from Assumption 4. For any θ? ∈ M, if U is a

local neighborhood satisfying the conditions of Assumption 4, then there exists a constant C > 0 such
that for any θ ∈ U and (s, a) ∈ S × A,

|x(s)| < CL(s, a), |r(s, a)| < CL(s, a), |r(s, a)∇ log π(a|s, θ)| < CL(s, a). (21)

Assumption 7 There exist an integer n ∈ {0, 1, . . . , n − 1} and an open subset U ⊆ Ω such that
(i) M ∩ U is a nonempty n-dimensional C 2 -submanifold of Rn , and (ii) the Hessian of J at θ? has
rank n − n, for each θ? ∈ M ∩ U .

13
Comte, Jonckheere, Sanders and Senen-Cerda

These assumptions have the following interpretation. Assumption 4 formalizes that the Markov
chain is stable for policies close to the maximum. Remarkably, it does not assume that the chain
is geometrically ergodic for all policies, only for those close to an optimal policy when θ ∈ Ω. This
stability is guaranteed by a local Lyapunov function L uniformly over some neighborhood close to
a maximizer. In the notation for b, and λ ∈ (0, 1) of Assumption 4, if S0 is the initial state, a term
that will later bound the size of an ‘effective’ state space in the regret of the algorithm is

b
L? = max , max L(S 0 , a)v
. (22)
(1 − λ)2 a∈A

Assumptions 5 and 6 together guarantee that the estimator Hm concentrates around ∇J(Θm )
at an appropriate rate. Assumption 5 is easy to verify in our examples since ρ is always positive
and bounded. Assumption 6 guarantees that the reward r(s, a) and sufficient statistics x(s) cannot
grow fast enough in s to perturb the stability of the MDP. In many applications from queueing
Assumption 6 holds. Namely, S is usually a normed space and the order of the Lyapunov function
L(s, a) is exponential in the norm of the state s ∈ S, compared to the sufficient statistic x which has
an order linear in the norm of s. We remark that, in a setting with a bounded reward function r
and a bounded map x or with a finite state space, Assumption 6 becomes trivial.
Assumption 7 is a geometric condition. It guarantees that, locally around the set of maxima
M or set of local maxima satisfying the same assumptions, in directions perpendicular to M, J
behaves approximately in a convex manner. Concretely, this means that Hessθ J has strictly negative
eigenvalues in the directions normal to M—also referred to as the Hessian being nondegenerate.
Thus, there is one-to-one correspondence between local directions around θ ∈ M that decrease J
and directions that do not belong to the tangent space of M. Strictly concave functions satisfy
that n = 0 and Assumption 7 is thus automatically satisfied in such cases. If M ∩ U = {θ? } is a
singleton, Assumption 7 reduces to assuming that Hessθ? J is negative definite. Assumption 7 in a
general setting can be difficult to verify, but by adding a regularization term, it can be guaranteed
to hold in a broad sense (see Section 5.5).

5.2 Local convergence results

This is our main convergence result for the case that the set of maxima is not necessarily bounded.

Theorem 2 (Noncompact case) Suppose that Assumptions 1 to 7 hold. For every maximizer
θ? ∈ M ∩ U , there exist constants c > 0 and α0 > 0 such that, for each α ∈ (0, α0 ], there exists a
nonempty neighborhood V of θ? and `0 ≥ 1 such that, for each ` ∈ [`0 , ∞), σ ∈ (2/3, 1), κ ∈ [0, ∞)
with σ + κ > 1, we have, for each m ∈ N+ ,
m1−σ−κ α2 αm1−(σ+κ)/2
P[J(Θm ) < J ? − |Θ0 ∈ V ] ≤ c −2 L? m−σ−κ + + + αm−κ/2 + √ , (23)
` ` `
where (Θm , m ∈ N) is a random sequence with P[Θ0 ∈ V ] > 0, and built by recursively applying the
gradient ascent step (3) with the gradient update (16) and the step and batch sizes (17) parameterized
by these values of α, `, σ, and κ.

In Theorem 2, by setting the parameters α, `, σ, and κ in (17) appropriately, we can make the
probability of Θm being -suboptimal arbitrarily small. Specifically, the step and batch sizes for each
epoch allow us to control the variance of the estimators in (16). This shows that the SAGE-based
policy-gradient method converges with large probability. The bound can be understood as follows.
The term in (23) on the bound depending on characterizes the convergence rate assuming that all
iterates up to time m remain in V . The remaining terms in (23) estimate the probability that the
iterates escape the set V , which can be made small by tuning parameters that diminish the variance
of the estimator Hm , such as setting κ or ` large—the batch size becomes larger.

14
Score-Aware Policy-Gradient Methods and Performance Guarantees

Theorem 2 extends the result of (Fehrman et al., 2020, Thm. 25) to a Markovian setting with
inability to restart. In our case, the bias can be controlled by using a longer batch size with exponent
at least σ/2. Furthermore, we also use the Lyapunov function to keep track of the state of the MDP
as we update the parameter in V and ensure stability. The proof sketch of Theorem 2 can be found
in Section 5.6. In Appendix D, we also consider the case that M ∩ U is compact, which can be used
to improve Theorem 2. Note that the sequence (Θm , m ∈ N) from Theorem 2 is well defined even if
unstable policies occur, since the update Hm from (16) is finite. In this case, recall that we have the
convention that if θ yields an unstable policy then J(θ) = inf s∈S,a∈A r(s, a) ≥ −∞. In Theorem 2,
we can thus assume instead of initializing Θ0 ∈ V that we restrict to trajectories of SGA that end
up in the neighborhood V —the first iterate satisfying this being Θ0 . In this alternative description,
we can assume that the trajectory is {Θ̃t }t∈[0,T +t0 ] for some t0 ∈ N, and Θ0 = Θ̃t0 reaches V .
Theorem 2 also holds for any estimator H̃m of the gradient J(Θm ) provided that this estima-
tor satisfies appropiate bias and variance bounds typical for estimators using Markov chains (see
Lemma 8 and Proposition 9 in Section 5.6 below). Thus, Theorem 9 and its consequences in the
following sections hold for a wide range of policy-gradient methods. Similarly, Theorem 2 also holds
when M is a manifold of local maxima instead of global maxima. Indeed, the assumptions are all
local and the proof is equivalent.
From Theorem 2, we immediately obtain a typical sample complexity bound.

Corollary 3 (Sample Complexity) Under the same assumptions and notation as in Theorem 2,
there exists a constant c > 0 such that for any 1 > > 0 and δ > 0, if we fix ` ≥ α2 /(5δc) and
σ + κ > 2 then for any m ∈ N satisfying
1 1 2 1

m ≥ m(, δ) = c max (2 δ)− σ+κ , δ − σ+κ−1 , δ − κ , δ − (σ+κ)/2−1 (24)

we have
P[J(Θm ) < J ? − |Θ0 ∈ V ] < δ. (25)

5.3 Lower bound

As noted in Theorem 2, the rate in (23) includes the probability that the iterates escape V , outside
which convergence cannot be guaranteed. Indeed, there is a term O(α2 /`) that characterizes the
probability that the iterates escape the basin of attraction. For general settings, this term cannot be
avoided, even in the unbiased case. In fact, the proposition below shows that for any β > 0 there are
cases where there is a positive lower bound depending on α2+β /`. In Theorem 4 below, we consider
an SGA setting with i.i.d. data, where the target is to maximize a function f using estimators Hm
for the gradient ∇f (Θm ) at epoch m. In a non-RL setting, we usually have Hm = Hm (Θm , Zm ),
where Zm is a collection of i.i.d. random variables and Fm denotes the sigma algebra of the random
variables Θ0 , . . . , Θm as well as Z0 , . . . , Zm−1 . For our result, we assume the iterates Θm satisfy (3),
and ηm = Hm − ∇f (Θm ) satisfies the following unbiased conditional concentration bounds for some
C > 0:
C
E[ηm |Fm ] = 0 and E[|ηm |2 |Fm ]| ≤ . (26)
tm+1 − tm
Proposition 4 below shows that Theorem 2 is almost sharp and characterizes the limitations of using
gradient approximation, which can lead to instability. As we will see in Section 6.1, however, there
are examples where only local convergence can be expected. The proof of Proposition 4 can be found
in Appendix E.

Proposition 4 For any β > 0, there are functions f ∈ C ∞ (Rn ) with a maximum f ? = f (θ? )
satisfying Assumption 7, such that if the iterates Θm satisfy (3) and the gradient estimator Hm =

15
Comte, Jonckheere, Sanders and Senen-Cerda

∇f (Θm ) + ηm satisfies (26), there exists a constant c > 0 depending on f and independent of m
such that for any ∈ (0, 1), 1 > α > 0, δ > 0, ` ≥ 1 and any σ ≥ 0, κ ≥ 0, in (17) we have that

α2+β
P[f (Θm ) < f ? − |Θ0 ∈ V ] ≥ c . (27)
`

5.4 Performance gap and regret

A performance gap bound that we can obtain from Theorem 2 is not fully satisfactory for the epoch
number m. Indeed, we can set the batch size very large (κ large) since the cost of exploration is
not factored in. We will therefore obtain a performance gap depending on the number of samples
drawn. In particular, for a time-step T ≥ 1, we will define the parameter at this time-step as Θm(T )
where
Xn
m(T ) = min{n ∈ N : `iσ/2+κ ≥ T }, (28)
i=1

that is, the corresponding epoch of the sample drawn at time T . We show the bounds on the
performance gap in terms of the total number of samples T . The proof of Proposition 5 can be
found in Appendix F.
Proposition 5 (Performance Gap) Under the same assumptions and notation as in Theorem 2,
we fix α and δ. Then for any 1 > ζ > 0 there is κ(ζ) ≥ 0, c > 0 and `0 > 0 such that for any ` ≥ `0
and T ≥ 1 we have the following.
(i) If sup(s,a) |r(s, a)| < ∞, then
h i 1 1 1 1 α2
E J ? − J(Θm(T ) ) Θ0 ∈ V ≤ c (L? ) 3 ` 3 +ζ T − 3 +ζ + `1/2+ζ T − 2 +ζ + `2/3+ζ T −1+ζ + . (29)
`
(ii) If sup(s,a) |r(s, a)| is unbounded, let Bm(T ) = {Θn ∈ V, n ∈ [m(T )]} be the event that all iterates
up to the epoch of sample T stay in V . Then, we have that
α2 1

P[Bm(T ) ] ≥ 1 − c + `1/2+ζ T − 2 −ζ + `2/3+ζ T −1−ζ , (30)
`
and h i 1 1 1
E J ? − J(Θm(T ) ) Bm(T ) ≤ c(L? ) 3 ` 3 +ζ T − 3 +ζ . (31)

From Proposition 5 we obtain the regret of SAGE and, in general, of other policy-gradient
algorithms that satisfy typical bounds on bias and variance bounds (see Lemma 8 and Proposition 9
in Section 5.6). The proof of Corollary 6 can be found in Appendix G.
Corollary 6 (Regret) Suppose the assumptions and notation as in Theorem 2 hold. Then for any
1 > ζ > 0 there exist κ(ζ) ≥ 0, c > 0, `0 such that if ` ≥ `0 , when Θ0 is the first iterate of (3) in V
(i) If sup(s,a) |r(s, a)| < ∞, then for any T > 1,
T
h X i 1 1 2 α2
E T J? − r(St , At ) Θ0 ∈ V ≤ c (L? ) 3 ` 3 +ζ T 3 +ζ + T . (32)
t=1
`

(ii) If sup(s,a) |r(s, a)| is unbounded, then for any T ≥ 1

h T i 1 2
X
E T J? − r(St , At ) Bm(T ) ≤ c(L? ) 3 T 3 +ζ . (33)
t=1

16
Score-Aware Policy-Gradient Methods and Performance Guarantees

Besides the term α2 /` in Corollary 6 that captures the instability due to the approximation of
the gradient, the term T 2/3+ζ in (32) cannot be easily compared with other common regret bounds
1
that assume global features for J(θ) or the gradient estimator Hm . However, the coefficient (L? ) 3
plays an analogous role to the size of the state space in regret bounds for finite state space MDPs,
and directly depends on the Lyapunov function. In Appendix B.1, we find explicitly the value of L
for the single-server queue example of Section 6 and see that it behaves as L? ∼ O(Vol(V )), that is,
it encodes the volume of the parameter space where the iterates are confined.
Remarkably, in the expectation of Corollary 6(i) policies that are unstable are not avoided. In-
deed, note that we only condition on initializing in a stable policy in V but afterwards the trajectory
may escape the set V and encounter unstable policies.
In (32), by setting ` = T 1/4 we would obtain a horizon-dependent regret of O(T 3/4+ζ ), which
is sublinear but far from the optimum T 1/2 . This is most likely due to the decreasing step and
increasing batch-sizes. While they allow for asymptotic convergence, they are slower to reach a fixed
suboptimality gap compared to, e.g., using a constant step and batchsize algorithm. It may then
be possible to use horizon-dependent step, and batch sizes together with a ‘doubling trick’ (Besson
and Kaufmann, 2018) argument to achieve an anytime optimal suboptimality in the sublinear term.
In this case, however, it is unclear if we would still obtain an equivalent factor α2 /` in (32) that is
optimal as shown with Proposition 4.

5.5 Local convergence with entropy regularization

A well-known phenomenon that can occur when using the softmax policy (6) is that, if the optimal
policy is deterministic, the iterates converge to this optimal policy only when Θm → ∞. Problems
where this occurs will thus not satisfy Assumption 7: the set of maxima will be empty. This
phenomenon is illustrated in the example of Section 6.1. One prevalent method to mitigate the
occurrence of maxima at the boundary involves incorporating a regularization term, often linked to
relative entropy KL[π̃ k π] of the policy π compared to a given π̃, defined below in (34).
Let π̃ be a policy of the same type as those defined in (6) and let ζ be a distribution on S such
that ζ(h−1 (i)) > 0 for any i ∈ I, where h is the index map defined for the class of policies that we
use in (6). We define the regularization term as

X π̃(A|s)
Rπ̃ (θ) = ES∼ζ [KL[π̃( · |S) k π( · |S, θ)]] = ζ(s)EA∼π̃( · |s) log . (34)
π(A|s, θ)
s∈S

For some b > 0 we define

Jπ̃ (θ) = J(θ) − bRπ̃ (θ). (35)
We can show that adding (34) to J(θ) defined in (7) not only prevents maxima from being at the
boundary, but also allows us to avoid using Assumption 7 altogether. The next proposition is proved
in Appendix H.

Proposition 7 Assume that we use the softmax policy from (6) and let J(θ) be defined as in (7).
Then for almost every policy π̃ in the class of (6) with respect to its Lebesgue measure,
1. the function Jπ̃ (θ) in (35) satisfies Assumption 7 and the set of maximizers is bounded, and

2. Theorem 2 for Jπ̃ (θ) holds without Assumption 7.

5.6 Proof outline for Theorem 2

We extend the local approach presented in (Fehrman et al., 2020, §5), that deals with convergence
of SGD where the samples used to estimate the gradient are i.i.d. We consider instead an RL setting
where data is Markovian and thus presents a bias. Fortunately, we can overcome its presence by

17
Comte, Jonckheere, Sanders and Senen-Cerda

adding an increasing batch size while tracking the states of the Markov chain via the local Lyapunov
function from Assumption 4, which guarantees a stable MPD trajectory as long as the parameter is
in a neighborhood close to the maximum. Below we give an outline of the technique employed. For
the full proof we refer to Appendix C.

Structure of the proof

The proof of Theorems 2 consists of several parts. To show a bound on the probability that Θm
is -suboptimal, we consider the event Bm that all previous iterates Θ0 , . . . , Θm belong to a local
neighborhood V , and the complementary event B m . We bound these separately. Firstly, on the
event Bm , we show in Lemma 10 that the iterates converge to M, and we obtain a bound on the
-suboptimal probability for this case. Secondly, the probability of the complement B m is written as
the sum of the probabilities of two disjoint events, namely, (i) Θm+1 ∈/ V and the distance of Θm+1
to M is larger than δ, and (ii) Θm+1 ∈ / V and the distance of Θm+1 to M is less than δ. Intuitively,
these events group the cases when Θm+1 escapes V in ‘normal directions’ to M and in ‘tangent
directions’ to M, respectively. We can bound the former by using concentration inequalities, but
for the latter we need a maximal excursion bound (Lemma 11 below). Combining all bounds results
in an upper bound on P[Bm ] (Lemma 12). The local properties of J are then be used to finish the
proof. Crucially, we use throughout the proof that the local Lyapunov function guarantees stability
of the Markov chain and the gradient estimator within V , as well as keeps track of the initial state
for each epoch.

Preliminary step: Definition of the local neighborhood and bound strategy

For θ? ∈ M ∩ U , we now define a neighborhood Vr,δ (θ? ) of θ? where the algorithm will operate.
Let B̄r (θ? ) := {θ ∈ Ω : |θ − θ? | ≤ r} denote a closed ball around θ? with radius r and dist(θ, L) =
supθ0 ∈L |θ − θ0 | for an open set L. Let U be the neighborhood of θ? described in Assumptions 4 and
7. We define a tubular neighborhood of θ? as follows:

Vr,δ (θ? ) := θ ∈ Ω ∩ U : dist(θ, M ∩ U ) = dist(θ, B̄r (θ? ) ∩ M ∩ U ) < δ .

(36)

Crucially, Assumption 7 implies that there exists δ0 , r0 > such that for any δ ∈ (0, δ0 ] and r ∈ (0, r0 ]
an equivalent definition of the set is then
⊥
Vr,δ (θ? ) = y + v : y ∈ B̄r (θ? ) ∩ M ∩ U and v ∈ Ty (M ∩ U ) with |v| < δ, p(y + v) = y . (37)

Here, p is the unique local projection onto M ∩ U , and Ty (M ∩ U )⊥ denotes the cotangent space
of M ∩ U at y. For further details on this geometric statement, we refer to (Fehrman et al., 2020,
Prop. 13) or (Lee, 2013, Thm. 6.24).
In the following, we let U denote the intersection of the neighborhoods from Assumptions 4 and
7, and L the Lyapunov function from Assumption 4. For any m ∈ N+ define the event and filtration
m
\
Θl ∈ Vr,δ (θ? ) ,

Bm := (38)
l=1

Fm := σ D1 ∪ . . . ∪ Dm−1 ∪ Θ0 , . . . , Θm . (39)

Due to the local properties of J, Theorem 2 can be shown by bounding P[dist(Θm , M∩U ) ≤ |B0 ].
By separating into the event Bm and its complement, we can show that

P[dist(Θm , M ∩ U ) ≤ |B0 ] ≤ P[dist(Θm , M ∩ U )1[Bm−1 ] ≥ ] + P[Bm ]. (40)

The remaining steps of the proof consist of bounding both terms in the right-hand side of (40).

18
Score-Aware Policy-Gradient Methods and Performance Guarantees

Step 1: The variance of the gradient estimator decreases, in spite of the bias
For each m ∈ N+ , let
ηm := Hm − ∇J(Θm ), (41)
denote the difference between the gradient estimator Hm in (16) and the true gradient ∇J(Θm ).
Lemma 8 below implies that the difference in (41) is, ultimately, small. From Assumption 4, since
the state-action chain {(St , At )}t≥0 has a Lyapunov function L, so does the chain {St }t>0 with
X
Lv (s) = L(s, a)v π(a|s, θ), (42)
a∈A

where v ≥ 16 is the exponent from Assumption 4. We can define L4 (s) similarly. The following
lemma bounds the variance of ηm on the event Bm , which can be controlled with the local Lyapunov
function. The proof of Lemma 8 is deferred to Appendix C.3.
Lemma 8 Suppose that Assumptions 1–7 hold. There exists a constant C > 0 that depends on θ? ,
U , and J such that for every m ∈ N+ ,
C
|E[ηm 1[Bm ]|Fm ]| ≤ L4 (Stm )1/2 , (43)
tm+1 − tm
C
E[|ηm |l 1[Bm ]|Fm ] ≤ L4 (Stm )l/2 , for every l ∈ {1, 2}. (44)
(tm+1 − tm )p/2
Lemma 8 helps to determine the bias incurred when starting at a different state than that of
stationarity, and is used to bound the term dist(Θm , M ∩ U ) from (40) in Lemma 10 below. Note
that the definition of SAGE and Assumptions 5 and 6 are used.
As a matted of fact, any other estimator H̃m of ∇J satisfying (43) and (44) from Lemma 8
will yield similar guarantees. In particular, instead of using the aforementioned assumptions for the
proof of Theorem 2 that involve the structure of SAGE or the stationary distribution, we may repeat
the proof using instead Lemma 8.
Proposition 9 Suppose Assumptions 1, 2, 4 and 7 hold. Suppose moreover that the estimator of
the gradient Hm satisfies Lemma 8 for each m ≥ 1. Then, the results of Theorem 2, Section 5.4 and
5.5 also hold for the step-size from (17) and policy-gradient update in (3).

Step 2: Convergence on the event Bm−1 .

We turn to the first term on the right-hand side of (40) and examine, on the event Bm−1 , if the
iterates converge. Using a similar proof strategy as that of (Fehrman et al., 2020, Proposition 20)
for the unbiased non-Markovian case, we prove in Lemma 10 that the variance of the distance to
the set of minima decreases under the appropriate step and batch sizes. Moreover, the rate depends
on an effective size of the state space, characterized by the Lyapunov function, and (22).
The proof of Lemma 10 is in Appendix C.4.
Lemma 10 Suppose that Assumptions 1–7 hold. There then exist r0 , α0 , `0 > 0 and c > 0 such that
for any r ∈ (0, r0 ], α ∈ (0, α0 ] and ` ∈ [`0 , ∞) there also exists δ0 > 0 such that for any δ ∈ (0, δ0 ]
and m ∈ N+ , h i
E (dist(Θm , M ∩ U ) ∧ δ)2 1[Bm−1 ] ≤ cL∗ m−σ−κ . (45)

Compared to the unbiased case in Fehrman et al. (2020), Lemma 10 needs to use a larger batch
size to deal with the bias of Lemma 8. A key result required is that on the event Bm−1 , the Lyapunov
function is bounded in expectation by L∗ , which captures the size of the ‘effective’ state space for
the policies around an optimum. With Lemma 10 together with Markov’s inequality, a bound of
order −2 m−σ−κ for the first term in (40) follows.

19
Comte, Jonckheere, Sanders and Senen-Cerda

Step 3: Excursion and the probability of staying in Vr,δ (θ? )

We next focus on P[B m ]. Since

/ Vr,δ (θ? ), Bm−1 ],

P[Bm ] ≥ P[Bm−1 ] − P[Θm ∈ (46)

we can use a recursive argument to obtain a lower bound, if we can bound first the probability

/ Vr,δ (θ? ), Bm−1 ] = P[dist(Θm , M ∩ U ) > δ, Bm−1 ]

P[Θm ∈
/ Vr,δ (θ? ), Bm−1 ].
+ P[dist(Θm , M ∩ U ) ≤ δ, Θm ∈ (47)

The first term in (47) represents the event that the iterand Θm escapes the set Vr,δ (θ? ) in directions
‘normal’ to M, while the second term represents the escape in directions ‘tangent’ to M—intuition
derived from the fact that, in that latter event, we still have dist(Θm , M ∩ U ) ≤ δ.
The first term in (47) can be bounded by using the local geometric properties around minima
in the set U and associating the escape probability with the probability that on the event Bm−1
escape can only occur if |ηm | is large enough. The probability of this last event happening can then
be controlled with the variance estimates from Lemma 8.
After a recursive argument, we have to consider the second term in (47) for all l ≤ m. Fortunately,
this term can be bounded by first looking at the maximal excursion event for the iterates {Θl }m l=1 .
The proof can be found in Appendix C.5. Here, the Lyapunov function again plays a crucial role to
control the variance of the gradient estimator on the events Bl for l ≤ m, compared to an unbiased
and non-Markovian case.

Lemma 11 Suppose that Assumptions 1–7 hold. Then there exist r0 , α0 , `0 > 0, and c > 0 such
that for any r ∈ (0, r0 ], α ∈ (0, α0 ] and ` ∈ [`0 , ∞), there exist δ0 > 0 such that for any δ ∈ (0, δ0 ]
and m ≥ 1,
r
h i 1 1−5σ/8−κ/2
E max Θl − Θ0 1[Bl−1 ] < cα m1−3σ/2−κ/2 + m . (48)
1≤l≤m `

Finally, with the previous steps we obtain a bound on P[Bm ] in Lemma 12 below2 . The proof of
Lemma 12 can be found in Appendix C.6.

Lemma 12 Suppose that Assumptions 1–7 hold and σ + κ > 1. There exist r0 , α0 , such that for
any r ∈ (0, r0 ], α ∈ (0, α0 ], there also exists a constant c > 0, δ0 > 0 such that for any δ ∈ (0, δ0 ], if
Θ0 ∈ Vr/2,δ (θ? ), there exists `0 > 0 such that for any ` ∈ [`0 , ∞) and m ∈ N+ ,
cα2 c (m1−3σ/2−κ/2 + `−1/2 m1−5σ/8−κ/2 )
P[Bm ] ≥ exp − 2 − 4 m1−σ−κ − cα . (49)
δ ` δ ` (r/2 − 2δ)+

Step 4: Combining the bounds in (40).

The proof of Theorem 2 follows the same steps as are used to prove (Fehrman et al., 2020, Theorem
25) by substituting the modified bounds that we have obtained from Lemmas 10 and 12 in (40).
The details can be found in Appendix C.2.
2. In (48), if the Lyapunov function has only smaller moments than order ν, then condition on κ ≥ 0 will become
stricter. In particular, κ tunes the batch size required to sample from the tails of the stationary distribution and
may be required to be positive depending the moments of the Lyapunov function. The terms σ and κ can be
tuned to control the bias coming from variance and nonstationarity, and finite batch size, respectively.

20
Score-Aware Policy-Gradient Methods and Performance Guarantees

6 Examples and numerical results

In Section 5, we have shown convergence of a SAGE-based policy-gradient method under the as-
sumptions in Section 5.1. We now numerically assess its performance in examples from stochastic
networks and statistical physics that go beyond these assumptions. Specifically, we examine a single-
server queue with admission control in Section 6.1, a load-balancing system in Section 6.2, and an
example of the Ising model with Glauber dynamics in Section 6.3. These examples satisfy the as-
sumptions of Section 4, required to implement the SAGE-based policy-gradient method, but not
necessarily those of Section 5.1 used to prove convergence. More discussion about these assumptions
is provided in Appendix B.
Simulation setup. Plots are obtained by averaging 10 independent simulations, each lasting
Tmax = 106 time steps. The initial parameter vector Θ0 is taken to be the zero vector, yielding in
each example a uniform policy over the action space. The SAGE-based algorithm (Algorithm 2)
is run with batch size 100 and step size αm = 10−1 . The actor–critic algorithm (Appendix A.1)
is run with batch size 1 and step sizes αm = 10−3 and αv = αR = 10−2 . It uses a tabular value
function which we initially treat as containing all-zeros, and which in practice is expanded as states
are visited.

6.1 Admission control in a single-server queue

Consider a queueing system where jobs arrive according to a Poisson process with rate λ > 0,
service times are independent and exponentially distributed with rate µ > 0, and the server applies
an arbitrary nonidling nonanticipating scheduling policy such as first-come-first-served or processor-
sharing. This model is also commonly known as an M/M/1 queue in the literature. When a job
arrives, the agent decides to either admit or reject it: in the former case, the job is added to the
queue, otherwise it is lost permanently. The agent receives a one-time reward γ > 0 for each admitted
job (incentive to accept jobs) and incurs a holding cost η > 0 per job per time unit (incentive to
reject jobs). The goal is to find an admission-control policy that achieves a trade-off between these
two conflicting objectives.
The problem can be related to the framework of Section 3 as follows. For t ∈ N, let St denote
the number of jobs in the system right before the arrival of the (t + 1)th job, and let At denote the
decision of either admitting or rejecting this job. We have S = N and A = {admit, reject}. Also, let
(Στ , τ ∈ R≥0 ) denote the continuous-time process that describes the evolution of the number of jobs
over time and (Tt , t ∈ N) the sequence of job arrival times, so that S0 = Σ0 and St = limσ↑Tt Στ for
t ∈ N+ . Rewards are given by
Z Tt+1
Rt+1 = rdisc (St , At ) + rcont (Στ )dτ,
Tt

where rdisc (s, a) = γ1[a = admit] represents the one-time admission reward and rcont (s) = −ηs the
holding cost incurred continuously over time. We use this common reward structure in this example,
but we remark that arbitrary reward functions rdisc and rcont are possible.
For each k ∈ N, we define a random policy parametrization πk with threshold k and parameter
vector3 θ = (θ0 , θ1 , . . . , θk ) ∈ Rk+1 as follows. Under policy πk , an incoming job finding s jobs in
the system is accepted with probability
1
πk (admit|s, θ) = , s ∈ N. (50)
1 + e−θmin(s,k)
Taking k = 0 yields a static (i.e., state-independent) random policy, while letting k tend to infinity
yields a fully state-dependent random policy. We believe this parametrization makes intuitive sense
3. In this example, vectors and matrices are indexed starting at 0 (instead of 1) for notational convenience.

21
Comte, Jonckheere, Sanders and Senen-Cerda

because, in a stable queueing system, small states tend to be visited more frequently than large
states.
Under policy parametrization πk , Assumptions 1 to 3 are satisfied with n = d = k + 1, Ω =
{θ ∈ Rk+1 : πk (admit|s, θ) < µλ }, Φ(s) = ( µλ )s for each s ∈ S, xi (s) = 1[s ≥ i + 1] for each
i ∈ {0, 1, . . . , k − 1} and xk (s) = max(s − k, 0), and ρi (θ) = πk (admit|i, θ) for each i ∈ {0, 1, . . . , k}.
It follows that ∇ log ρi = ∇ log πk (admit|i, ·) for each i ∈ {0, 1, . . . , k}. Assumption 7 can be satisfied
by adding a small relative entropy regularization term as shown in Proposition 7. We refer to
Appendix B.1 for further details.
Numerical results in a stable queue. We study the impact of the policy threshold k ∈ N on
the performance of SAGE and actor–critic. The parameters are λ = 0.7, µ = 1, γ = 5, and η = 1,
and we consider random policies πk with various thresholds. We have Ω = Rk+1 because λ < µ,
i.e., the queue is always stable. As we can verify using Appendix B.1, if k ≤ 2 the best policy is
random, while if k ≥ 3, the best policy (deterministically) admits incoming jobs if and only if there
are at most 2 jobs in the system. Thus, if k ≥ 3, the best policy is approximated when θi → +∞
if i ∈ {0, 1, 2} and θi → −∞ if i ∈ {3, 4, . . . , k}. This deterministic policy is optimal among all
Markovian policies. The initial policy is πk (admit|s, Θ0 ) = 21 for each s ∈ N, and the system is
initially empty, i.e., S0 = 0 with probability 1.

(a) Performance under π0 . (b) Performance under π1 .

(c) Performance under π3 . (d) Performance under π100 .

Figure 1: Long-run average reward J(Θt ) in the admission-control problem with λ = 0.7, µ = 1,
γ = 5, and η = 1. Using Appendix B.1, we can verify that the long-run average reward under the
best policy is approximately 2.183 if k = 0, 2.566 if k = 1, and 2.795 if k ≥ 3.

Figure 1 depicts the impact of the threshold k on the evolution of the long-run average re-
ward J(Θt ) (defined in (7) and computed using the formulas of Appendix B.1) under SAGE and
actor–critic. Figure 2 shows the admission probabilities π3 (admit|i, Θt ) for each i ∈ {0, 1, 2, 3} (i.e.,
the admission probabilities under the policy with threshold k = 3). In both plots, the x-axis has a

22
Score-Aware Policy-Gradient Methods and Performance Guarantees

(a) SAGE (b) Actor–critic

Figure 2: Admission probabilities under policy parametrization π3 .

logarithmic scale starting at time t = 102 , lines are obtained by averaging the results over 10 indepen-
dent simulations, and transparent areas show the standard deviation. Both SAGE and actor–critic
eventually converge to the maximal attainable long-run average reward, and under both algorithms
the convergence is initially faster under policy π0 than under π1 , π3 , and π100 . For a particular
threshold k, the convergence is initially faster under actor–critic than under SAGE. However, the
long-run average reward under SAGE increases monotonically from its initial value to its maximal
value while, under actor–critic, there is a time period (comprised between 103 and 105 time steps)
where the long-run average reward stagnates or even decreases. Pt Similar qualitative remarks can
be made when looking at the running average reward 1t t0 =1 Rt0 instead of the long-run average
reward J(Θt ). Figure 2b suggests that, under π3 , this is because actor–critic first “overshoots” by
increasing π3 (admit|3, Θt ) too much and then decreasing π3 (admit|2, Θt ) too much before eventually
converging to the best admission probabilities. This overshooting is more pronounced with a small
threshold k, but it is still visible with k = 100.
Figures 1 and 2 suggest actor–critic has more difficulty to correctly estimate the policy update
compared to SAGE, especially under parametrizations πk with small thresholds k. We conjecture
this is due to the combination of two phenomena which reaches a peak when k is small. First, a close
examination of the evolution of the value function under π3 and π10 (not shown here) reveals that
there is a transitory bias in the estimate of the value function. For instance, right after increasing
the admission probability in state 0, the estimate of the value function at states 2 and 3 becomes
negative, even if the optimal value function at these states is positive. Second, due to the policy
parametrization, parameter θk is updated whenever a state s ∈ {k, k + 1, k + 2, . . .} is visited (while,
for each i ∈ {0, 1, . . . , k − 1}, parameter θi is updated only when state i is visited). As a result, the
correlated biases in the estimates of the value function at states k, k + 1, k + 2, . . . add up and lead
actor–critic to overshoot the update of θk , which has a knock-on effect on other states.
Numerical results in a possibly-unstable queue. Figure 3 is the counterpart of Figure 1 when
the arrival rate is λ = 1.4 > 1 = µ. Now the set of policy parameters for which the system is stable
is Ω = {θ ∈ Rk+1 : πk (admit|k, θ) < µλ } ( Rk+1 , with µλ ' 0.714. For simplicity, we will say that a
policy is stable if the Markov chain defined by the system state under this policy is positive recurrent
(i.e., if πk (admit|k, θ) < µλ ), and unstable otherwise. This is an example where convergence can only
be guaranteed locally, as not all policies are stable. Again using Appendix B.1, we can verify that if
k ≤ 1, the best policy is random, while if k ≥ 2, the best policy (deterministically) admits incoming
jobs if and only if there are fewer than 2 jobs in the system. This deterministic policy is optimal
among all Markovian policies. The initial policy is again the (stable) uniform policy, and the system
is initially empty, i.e., S0 = 0 with probability 1.

23
Comte, Jonckheere, Sanders and Senen-Cerda

(a) Performance under π0 . (b) Performance under π2 .

(c) Performance under π4 . (d) Performance under π100 .

Figure 3: Long-run average reward in the admission-control problem with parameters λ = 1.4,
µ = 1, γ = 5, and η = 1. Using Appendix B.1, we can verify that the maximal value of the long-run
average reward is approximately 1.091 if k = 0 and 1.880 if k ≥ 2.

The first take-away of Figure 3 is that SAGE converges to a close–to–optimal policy despite
the fact that some policies are unstable. The convergence of SAGE is actually faster under λ = 1.4
compared to λ = 0.7 (Figure 1). By looking at the evolution of the admission probability (not shown
here), we conjecture this is due to the fact that the admission probability in states larger than or
equal to 2 decreases much faster when λ = 1.4 compared to λ = 0.7, and that this probability has
a significant impact on the long-run average reward. In none of the simulations does SAGE reach
an unstable policy. This suggests that the updates of SAGE have lower chance of reaching unstable
regions of the policy space per observed sample.
The second take-away of Figure 3 is that, on the contrary, actor–critic has difficulties coping
with instability in this example. In all simulation runs used to plot this figure, the long-run average
reward J(Θt ) first decreases before possibly increasing again and converging to the best achievable
long-run average reward. Under parametrizations π0 , π2 , and π4 , unstable policies are visited for
thousands of steps in all simulation runs, and a stable policy is eventually reached in only 7 out of
10 runs. Under parametrization π0 , the long-run average reward under the last policy is close to the
best only in 2 out of 10 runs. Under π100 , the policy remains stable throughout all runs, but the
long-run average reward transitorily decreases before increasing again.

6.2 Load-balancing system

Consider a cluster of n servers. Jobs arrive according to a Poisson process with rate λ > 0, and a
new job is admitted if and only if there are fewer than c ∈ N+ jobs in the system. Each server i ∈
{1, 2, . . . , n} processes jobs in its queue according to a nonidling, nonanticipating policy. The service
time of each job at server i is exponentially distributed with rate µi > 0, independently of all other

24
Score-Aware Policy-Gradient Methods and Performance Guarantees

random variables. The agent aims to maximize the admission probability by adequately distributing
load across servers.
For each t ∈ N, let St = (St,1 , St,2 , . . . , St,n ) denote the vector containing the number of jobs at
each server right before the arrival of the (t + 1)th job, and let At ∈ {1, 2, . . . , n} denote the server
to which this (t + 1)th job is assigned. (This decision is void if St,1 + . . . + St,n = c because the job
is rejected anyway.) We have S = {s ∈ Nn : s1 + s2 + . . . + sn ≤ c} and A = {1, 2, . . . , n}. The agent
obtains a reward of 1 if the job is accepted and 0 otherwise, that is, Rt+1 = 1[St,1 +. . .+St,n ≤ c−1]
for each t ∈ N.
We consider the following static policy parametrization, with parameter vector θ ∈ Rn : irrespec-
tive of the system state s ∈ S, an incoming job is assigned to server i with probability
eθi
π(i|s, θ) = π(i|θ) = Pn , i ∈ {1, 2, . . . , n}. (51)
j=1 eθ j
Qn
Assumptions 1 to 3 are satisfied with n = d, Ω = Rn , Φ(s) = i=1 ( µλi )si for each s ∈ S, xi (s) = si
for i ∈ {1, 2, . . . , n} and s ∈ S, and ρi (θ) = π(i|θ) for i ∈ {1, 2, . . . , n} and θ ∈ Rn . Also note
that ∇ log ρi (θ) = ∇ log π(i|θ) for i ∈ {1, 2, . . . , d} and θ ∈ Rn . Except for Assumption 7, the
remaining assumptions outlined in Section 5 are also satisfied. We refer to Appendix B.2 for more
details. Lastly observe that, in spite of the policy being static and the state space being finite,
the function J is still nonconvex for typical system parameters. In fact, our numerical experiments
are done in nonconvex scenarios. Furthermore, note that this system can become challenging to
optimize if c and n are large.
Numerical results We study the performance of SAGE and actor–critic under varying numbers
of servers and service speed imbalance. Given an integer n ∈ N>0 multiple of 4 and δ > 1, we
consider the following cluster of n servers divided into 4 pools. For each k ∈ {1, 2, 3, 4}, pool k
consists of the n4 servers indexed from (k − 1) n4 + P 1 to k n4 , and each server i in this pool has service
n
rate µi = δ k−1 . The total arrival rate is λ = 0.7( i=1 µi ) and the upper bound on the number of
n
jobs in the system is c = 10 4 . Letting δ = 1 gives a system where all servers have the same service
speed, while increasing δ makes the server speeds more and more imbalanced. The initial policy is
uniform, i.e., π(i|Θ0 ) = n1 for each i ∈ {1, 2, . . . , n}, and the initial state is empty, i.e., S0 = 0 with
probability 1.
Figure 4 shows performance of SAGE and actor–critic in clusters of n ∈ {4, 20, 100} servers. Solid
lines show the evolution
Pt of the long-run average reward J(Θt ), and dashed lines show the running
average reward 1t t0 =1 Rt0 . (Recall J(Θt ) is the limit of the running average we would see if we ran
the system under policy π(Θt ). It is defined in (7) and can be computed as shown in Appendix B.2.)
As before, transparent areas show the standard deviation around the average. The results under
actor–critic are reported only for n = 4 servers, as this method already suffers from a combinatorial
explosion in the state–action space for n ∈ {20, 100}. Indeed, while the memory complexity increases
linearly with the number n of servers under SAGE, it increases with the cardinality n+c

c of the
state space under actor–critic4 , which is already prohibitively large for n ∈ {20, 100}.

1
PAll
t
four subfigures in Figure 4 show a consistent 2-phase pattern: first the running average reward
t =1 Rt converges to the initial long-run average reward J(Θ0 ), and then the long-run average
0 0
t
reward increases to reach the best value, with the running average reward catching up at a slower
pace. This suggests that the gradient estimates under both algorithms remain close to zero until the
system reaches approximate stationarity. A similar reasoning explains why the algorithms converge
at a slower pace when we increase the imbalance factor δ (as the stationary distribution under the
initial uniform policy π(Θ0 ) puts mass on states that are further away from the initial empty state)
or the number n of servers (as the mixing time increases).
Focusing on the system with n = 4 servers, Figures 4a and 4b show convergence occurs on
the order of 105 time steps sooner under SAGE than under actor–critic. We conjecture that this
4. As shown by applying the stars and bars method in combinatorics.

25
Comte, Jonckheere, Sanders and Senen-Cerda

(a) SAGE, n = 4 servers. (b) Actor–critic, n = 4 servers.

(c) SAGE, n = 20 servers. (d) SAGE, n = 100 servers.

Figure 4: Impact of the number of servers and service-rate imbalance on the performance of SAGE
and actor–critic in a load-balancing system. Solid lines
Pt show the long-run average reward J(Θt ),
while dashed lines show the running average reward, 1t t0 =1 Rt0 . Simulations for n = 100 and δ = 4
are omitted because numerical instability of Buzen’s algorithm (see Appendix B.2) prevents us from
computing J(Θt ) in this case.

is again due to the fact

that actor–critic relies on estimating the state-value function, so that it
needs to estimate n+cc values. SAGE, on the other hand, exploits the structure of the stationary
distribution and only needs to estimate a number of values that grows linearly with n (and is
independent of c). We also note that actor–critic shows nonmonotonic convergence (i.e., J(Θt )
decreases before increasing again between 105 and 106 time steps). We conjecture this is due to a
similar phenomenon as described in Section 6.1.

6.3 Ising model and Glauber dynamics

Consider a system of spin particles spread over a two-dimensional lattice of shape d1 × d2 , for some
d1 , d2 ∈ {2, 3, 4, . . .}. Let V = {1, 2, . . . , d1 } × {1, 2, . . . , d2 } denote the set of lattice coordinates. For
any two coordinates v = (v1 , v2 ) ∈ V and w = (w1 , w2 ) ∈ V, we write v ∼ w if and only v and w are
neighbors in the lattice, that is, if and only if |v1 − w1 | + |v2 − w2 | = 1.
A map σ : V → {−1, +1} is called a spin configuration, and the set of all 2d1 d2 configurations is
denoted by Σ. Given a configuration σ ∈ Σ, we refer to σ(v) ∈ {−1, +1} as the spin (of the particle
located) at v. If the system is in some configuration σ ∈ Σ, we say that the spin at v ∈ V is flipped
if the system jumps to the configuration σ−v ∈ Σ such that σ−v (v) = −σ(v) and σ−v (w) = σ(w) for
each w ∈ V \ {v}. As we will formalize below, the agent’s goal is to reach a configuration σ ∈ Σ so
that the magnetization on the left (resp. right) half of the lattice is close to ξleft ∈ (−1, +1) (resp.

26
Score-Aware Policy-Gradient Methods and Performance Guarantees

ξright ∈ (−1, +1)), i.e.,

2 2
Mleft (σ) ' ξleft , Mright (σ) ' ξright ,
d1 d2 d1 d2
where
X X
Mleft (σ) = σ(v), Mright (σ) = σ(v).
v∈V: v2 ≤d2 /2 v∈V: v2 >d2 /2

To each configuration σ ∈ Σ is associated an energy E(σ) , −JI(σ) − µF (σ), where I and F

are called the interaction and external field terms, respectively, given by
X X
I(σ) = σ(v)σ(w), F (σ) = h(v)σ(v),
v,w∈V: v∼w v∈V

where the first sum runs over all pairs of neighboring coordinates (so that each pair appears once).
Here, J ∈ R is the coupling constant, µ ∈ R≥0 the magnetic moment, and h : V → R the external
magnetic field. Under the dynamics defined below, the probability of a configuration σ ∈ Σ will
be proportional to e−βE(σ) , where β ∈ R>0 is the inverse temperature. If J > 0 (resp. J < 0),
the interaction term I contributes to increasing the probability of configurations where neighboring
spins have the same (resp. opposite) sign. Concurrently, due to the external-field term F , the spin
at each v ∈ V is attracted in the direction pointed by the sign of h(v). The coupling constant J and
magnetic moment µ are fixed and known by the agent (as they depend on the particles), and the
agent will fine-tune the inverse temperature β and coarse-tune the external magnetic field h.
Glauber dynamics Given a starting configuration, at every time step, the spin at a coordinate
chosen uniformly at random is flipped (or not) with some probability that depends on the current
configuration and the parameters set by the agent. This is cast as a Markov decision process as
follows. The state and action spaces are given by S = Σ × V and A = {flip, not flip}, respectively.
For each s = (σ, v) ∈ S and a ∈ A, the state reached by taking action a in state s is given by
S 0 = (σ 0 , V 0 ), where σ 0 = σ−v if a = flip and σ 0 = σ if a = not flip, and V 0 is chosen uniformly
at random in V, independently of the past states, actions, and rewards. The next reward r is the
opposite of the sum of the absolute difference between the next magnetizations and the desired
magnetizations, that is,
2 2
r = − ξleft − Mleft (σ 0 ) − ξright − Mright (σ 0 ) .
d1 d2 d2 d2
The agent controls a vector θ ∈ R3 that determines the inverse temperature and the left and right
external magnetic fields as follows:
β(θ) = 1 + tanh(θ1 ), hleft (θ) = tanh(θ2 ), hright (θ) = tanh(θ3 ),
so that in particular β(θ) ∈ (0, 2), hleft (θ) ∈ (−1, 1), and hright (θ) ∈ (−1, 1). The corresponding
external magnetic field and external field term are
h(v|θ) = hleft (θ)1[v2 ≤ d2 /2] + hright (θ)1[v2 > d2 /2], v ∈ V,
X
F (σ|θ) = h(v|θ)σ(v) = hleft (θ)Mleft (σ) + hright (θ)Mright (σ), σ ∈ Σ.
v∈V

Given θ ∈ R3 , for each s = (σ, v) ∈ Σ, the probability that the spin at the randomly-chosen
coordinate v is flipped when the current configuration is σ is given by
!
1 X
π(flip|s, θ) = , with δ(s|θ) = 2β(θ)σ(v) J σ(w) + µh(v|θ) . (52)
1 + eδ(s|θ) w∈V: w∼v

27
Comte, Jonckheere, Sanders and Senen-Cerda

When θ ∈ R3 is fixed, the dynamics defined by this system are called the Glauber dynamics (Levin
and Peres, Section 3.3). Note that, although we use the word action to match the terminology of
MDPs, here an action should be seen as a random event in the environment, of which only the
distribution π can be controlled by the agent via the parameter vector θ.
Product-form distribution We verify in Appendix B.3 that the stationary distribution of the
system state under a particular choice of θ ∈ R3 satisfies

p(s|θ) ∝ eβ(θ)(JI(σ)+µF (σ|θ)) , s = (σ, v) ∈ S, θ ∈ R3 . (53)

Assumptions 1 to 3 are satisfied with n = d = 3, Ω = R3 , Φ(s) = 1 for each s ∈ S, log ρ1 (θ) = β(θ)J,
log ρ2 (θ) = β(θ)µhleft (θ), and log ρ3 (θ) = β(θ)µhright (θ) for each θ ∈ R3 , and x1 (s) = I(σ), x2 (s) =
Mleft (σ), and x3 (s) = Mright (σ) for each s = (σ, v) ∈ S. All derivations are given in Appendix B.3.

(a) Average over 10 simulation runs. (b) A particular simulation run.

Figure 5: Performance of SAGE in the Ising model.

Numerical results Figure 5 shows the performance of SAGE in a system with parameters d1 =
10, d2 = 20, J = µ = 1, ξleft = −1, and ξright = 1. We do not run simulations under actor–
critic, as again the state space has size 2d1 d2 = 2200 , which is out of reach for this method. The
initial parameter vector is Θ0 = 0, yielding inverse temperature β(Θ0 ) = 1 and external fields
hleft (Θ0 ) = hright (Θ0 ) = 0. The initial configuration has spins 1 on the left-hand side and −1 on the
right-hand side, so that reaching the target configuration requires flipping every spin. In Figure 5a,
the reward Rt seems to increase on average monotonically from −4 to 0, which is consistent with
the observation that the left (resp. right) magnetization decreases from 1 to −1 (resp. increases from
−1 to 1). The increase of the reward is stepwise, with stages where it remains roughly constant for
several thousand time steps. Lastly, the standard deviation increases significantly from about 104
to 3 · 105 time steps, and it becomes negligible afterwards.
To help us understand these observations, Figure 5b shows the evolution of the system param-
eters and of the magnetizations over a particular simulation run. The left magnetization Mleft

28
Score-Aware Policy-Gradient Methods and Performance Guarantees

starts decreasing around 104 time steps (bottom plot), approximately when hleft (Θt ) and hright (Θt )
become nonzero (top plot), to become roughly −1 around 3 · 104 time steps. At that moment,
the system configuration is close to the all–spin–down configuration σ−1 such that σ−1 (v) = −1
for each v ∈ V. The right magnetization starts increasing only when the inverse temperature
β(Θt ) has a sudden decrease (top plot). To make sense of this observation, consider π(flip|s, θ) as
given by (52), where
P s = (σ−1 , v) for some v ∈ {2, 3, . . . , d1 − 1} × {2, 3, . . . , d2 − 1}. In δ(s|θ),
the first term J w∈V:w∼v σ−1 (v)σ−1 (w) is equal to 4, while the absolute value of the second term
1
µσ−1 (v)h(v|θ) is at most 1; hence, if β(θ) ' 1 as initially, π(flip|s, θ) is between 1+e2(4+1) ' 4.5 · 10−5
1 −3
and 1+e2(4−1) ' 2.5 · 10 . The brief decrease of β(θ) is an efficient way of increasing the flipping
probability in all states, which allows the system to escape from σ−1 . Other simulation runs are
qualitatively similar, but the times at which the qualitative changes occur and the side (left or right)
that flips magnetization first vary, which explains the large standard deviation observed earlier.

7 Conclusion
In this paper, we incorporated model-specific information about MDPs into the gradient estimator
in policy-gradient methods. Specifically, assuming that the stationary distribution is an exponential
family, we derived score-aware gradient estimators (SAGEs) that do not require the computation
of value functions (Theorem 1). As showcased in Section 6, this assumption is satisfied by models
from stochastic networks, where the stationary distribution possesses a product-form structure, and
by models from statistical mechanics, such as the Ising models with Glauber dynamics.
The numerical results in Section 6 show that in these systems, policy-gradient algorithms equipped
with a SAGE outperform actor–critic. In these examples, the Jacobian of the load function D log ρ(θ)
can be computed explicitly in terms of the policy parameter θ. However, SAGE estimators can be
harder to compute in more complex cases, for example when D log ρ(θ) depends on some model
parameters. Nevertheless, our examples showcase how it is possible to improve the current policy
gradient methods by levering information on the MDP, and we expect extensions of SAGEs to cover
more challenging cases, for example by combining SAGE with model selection by first estimating
the model parameters appearing in D log ρ(θ). We leave such extensions of SAGE for future work.
We have also shown with Theorem 2 that policy gradient with SAGE converges to the optimal
policy under light assumptions, namely, the existence of a local Lyapunov function close to the
optimum, which allows for unstable policies to exist, and a nondegeneracy property of the Hessian
at maxima. The convergence occurs with a probability arbitrarily close to one provided that the
iterates start close enough. Notably, our method proof also works with other policy-gradients with
similar policy-gradient approximation properties. In Corollary 6, the regret of the algorithm is
shown to be O(T 2/3+ + T α2 /`), where T is the number of samples drawn. Unlike most common
convergence results, we have gradient approximation on a countable state space and there is a
nonzero probability that the algorithm will end in an unstable policy. This fact is namely captured
by the term α2 /`. Remarkably, such instabilities are observed in one of the examples of Section 6.
If we had made stronger assumptions such as the existence of a global Lyapunov function, then such
phenomena would not have been captured by the analysis.

References
Yasin Abbasi-Yadkori, Peter Bartlett, Kush Bhatia, Nevena Lazic, Csaba Szepesvari, and Gellért
Weisz. Politex: Regret bounds for policy iteration using expert prediction. In International
Conference on Machine Learning, pages 3692–3702. PMLR, 2019.
Ivo Adan, Ana Bušić, Jean Mairesse, and Gideon Weiss. Reversibility and further properties of
FCFS infinite bipartite matching. Mathematics of Operations Research, 43(2):598–621, dec 2017.
Publisher: INFORMS.

29
Comte, Jonckheere, Sanders and Senen-Cerda

Alekh Agarwal, Sham M. Kakade, Jason D. Lee, and Gaurav Mahajan. On the theory of policy
gradient methods: Optimality, approximation, and distribution shift. The Journal of Machine
Learning Research, 22(1):4431–4506, 2021.
Jonatha Anselmi, Bruno Gaujal, and Louis-Sébastien Rebuffi. Learning optimal admission control
in partially observable queueing networks. arXiv preprint arXiv:2308.02391, 2023.
Yves F. Atchadé, Gersende Fort, and Eric Moulines. On perturbed proximal gradient algorithms.
The Journal of Machine Learning Research, 18(1):310–342, 2017.
Forest Baskett, K. Mani Chandy, Richard R. Muntz, and Fernando G. Palacios. Open, closed, and
mixed networks of queues with different classes of customers. Journal of the ACM, 22(2):248–260,
apr 1975.
Lilian Besson and Emilie Kaufmann. What doubling tricks can and can’t do for multi-armed bandits.
arXiv preprint arXiv:1803.06971, 2018.
Thomas Bonald and Jorma Virtamo. Calculating the flow level performance of balanced fairness in
tree networks. Performance Evaluation, 58(1):1–14, oct 2004.
Pierre Brémaud. Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues. Texts in
Applied Mathematics. Springer-Verlag, 1999.
Jeffrey P. Buzen. Computational algorithms for closed queueing networks with exponential servers.
Communications of the ACM, 16(9):527–531, sep 1973.
Shicong Cen, Chen Cheng, Yuxin Chen, Yuting Wei, and Yuejie Chi. Fast global convergence of
natural policy gradient methods with entropy regularization. Operations Research, 70(4):2563–
2578, 2022.
Hadi Daneshmand, Jonas Kohler, Aurelien Lucchi, and Thomas Hofmann. Escaping saddles with
stochastic gradients. In International Conference on Machine Learning, pages 1155–1164. PMLR,
2018.
Edmundo de Souza e Silva and Mario Gerla. Queueing network models for load balancing in dis-
tributed systems. Journal of Parallel and Distributed Computing, 12(1):24–38, may 1991.
Edmundo de Souza e Silva and Richard R. Muntz. Simple relationships among moments of queue
lengths in product form queueing networks. IEEE Transactions on Computers, 37(9):1125–1129,
sep 1988.
Thinh T. Doan, Lam M. Nguyen, Nhan H. Pham, and Justin Romberg. Finite-time analysis of
stochastic gradient descent under Markov randomness. arXiv preprint arXiv:2003.10973, 2020.
Maryam Fazel, Rong Ge, Sham Kakade, and Mehran Mesbahi. Global convergence of policy gradient
methods for the linear quadratic regulator. In International Conference on Machine Learning,
pages 1467–1476. PMLR, 2018.
Benjamin Fehrman, Benjamin Gess, and Arnulf Jentzen. Convergence rates for the stochastic gra-
dient descent method for non-convex objective functions. Journal of Machine Learning Research,
21:136, 2020.
Gersende Fort and Eric Moulines. Convergence of the Monte Carlo expectation maximization for
curved exponential families. The Annals of Statistics, 31(4):1220–1259, 2003.
Kristen Gardner and Rhonda Righter. Product forms for FCFS queueing models with arbitrary
server-job compatibilities: an overview. Queueing Systems, 96(1):3–51, oct 2020.

30
Score-Aware Policy-Gradient Methods and Performance Guarantees

Victor Guillemin and Alan Pollack. Differential topology, volume 370. American Mathematical Soc.,
2010.
Libin Jiang and Jean Walrand. A distributed CSMA algorithm for throughput and utility maxi-
mization in wireless networks. IEEE/ACM Transactions on Networking, 18(3):960–972, 2009.
Belhal Karimi, Blazej Miasojedow, Eric Moulines, and Hoi-To Wai. Non-asymptotic analysis of
biased stochastic approximation scheme. In Conference on Learning Theory, pages 1944–1974.
PMLR, 2019.
Shauharda Khadka and Kagan Tumer. Evolution-guided policy gradient in reinforcement learning.
In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,
Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.

Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques. The
MIT Press, Cambridge, MA, 2009.
Navdeep Kumar, Yashaswini Murthy, Itai Shufaro, Kfir Y Levy, R Srikant, and Shie Mannor. On
the global convergence of policy gradient in average reward Markov decision processes. arXiv
preprint arXiv:2403.06806, 2024.

John M Lee. Introduction to Smooth Manifolds, volume 218. Springer Science & Business Media,
2013.
David A. Levin and Yuval Peres. Markov Chains and Mixing Times. American Mathematical
Society, 2e édition edition.

Bai Liu, Qiaomin Xie, and Eytan Modiano. RL–QN: A reinforcement learning framework for optimal
control of queueing systems. ACM Transactions on Modeling and Performance Evaluation of
Computing Systems, 7(1):1–35, 2022.
Yanli Liu, Kaiqing Zhang, Tamer Basar, and Wotao Yin. An improved analysis of (variance-reduced)
policy gradient and natural policy gradient methods. In H. Larochelle, M. Ranzato, R. Hadsell,
M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33,
pages 7624–7636. Curran Associates, Inc., 2020.
Zhen Liu and Philippe Nain. Sensitivity results in open, closed and mixed product form queueing
networks. Performance Evaluation, 13(4):237–251, 1991.
Shakir Mohamed, Mihaela Rosca, Michael Figurnov, and Andriy Mnih. Monte Carlo gradient
estimation in machine learning. The Journal of Machine Learning Research, 21(1):5183–5244,
2020.
Pascal Moyal, Ana Bušić, and Jean Mairesse. A product form for the general stochastic matching
model. Journal of Applied Probability, 58(2):449–468, jun 2021. Publisher: Cambridge University
Press.

Yashaswini Murthy, Isaac Grosof, Siva Theja Maguluri, and R Srikant. Performance of npg in
countable state-space average-cost rl. arXiv preprint arXiv:2405.20467, 2024.
Liviu I. Nicolaescu et al. An invitation to Morse theory. Springer, 2011.
Yichen Qian, Jun Wu, Rui Wang, Fusheng Zhu, and Wei Zhang. Survey on reinforcement learning
applications in communication networks. Journal of Communications and Information Networks,
4(2):30–39, 2019.

31
Comte, Jonckheere, Sanders and Senen-Cerda

Jaron Sanders, Sem C. Borst, and Johan S. H. van Leeuwaarden. Online network optimization using
product-form Markov processes. IEEE Transactions on Automatic Control, 61(7):1838–1853, 2016.
Richard Serfozo. Introduction to Stochastic Networks. Stochastic Modelling and Applied Probability.
Springer-Verlag, 1999.
Devavrat Shah. Message-passing in stochastic processing networks. Surveys in Operations Research
and Management Science, 16(2):83–104, jul 2011.

Virag Shah and Gustavo de Veciana. High-performance centralized content delivery infrastructure:
Models and asymptotics. IEEE/ACM Transactions on Networking, 23(5):1674–1687, oct 2015.
Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT press,
Cambridge, MA, USA, 2 edition, 2018.

Vladislav B. Tadic and Arnaud Doucet. Asymptotic bias of stochastic gradient search. Annals of
Applied Probability, 27(6):3255–3304, 2017.
Martin J. Wainwright and Michael I. Jordan. Graphical models, exponential families, and variational
inference. Found. Trends Mach. Learn., 1(1):1–305, 2018.

Ronald W. Wolff. Poisson arrivals see time averages. Operations Research, 30(2):223–231, apr 1982.
Publisher: INFORMS.
Lin Xiao. On the convergence rates of policy gradient methods. The Journal of Machine Learning
Research, 23(1):12887–12922, 2022.
Rui Yuan, Robert M. Gower, and Alessandro Lazaric. A general sample complexity analysis of vanilla
policy gradient. In Proceedings of The 25th International Conference on Artificial Intelligence and
Statistics, pages 3332–3380. PMLR, 2022.
Stan Zachary and Ilze Ziedins. Loss networks and Markov random fields. Journal of Applied
Probability, 36(2):403–414, jun 1999. Publisher: Cambridge University Press.

Appendix A. Policy-gradient algorithms

A.1 Actor–critic algorithm
The actor–critic algorithm is first mentioned in Section 3.4 and compared to our SAGE-based policy-
gradient algorithm in Section 6. We focus on the version of actor–critic described in (Sutton and
Barto, 2018, Section 13.6) for the average-reward criterion in infinite horizon. The algorithm relies
on the following expression for ∇J(θ), which is a variant of the policy-gradient theorem (Sutton and
Barto, 2018, Chapter 13):

∇J(θ) ∝ E[(R − J(θ) + v(S 0 ) − v(S))∇ log π(A|S, θ)],

where (S, A, R, S 0 ) is a quadruplet of random variables such that S ∼ p( · |θ), A|S ∼ π( · |S, θ), and
(R, S 0 )|(S, A) ∼ P ( · , · |S, A) (so that in particular (S, A, R) ∼ stat(θ)), and v is the state-value
function.
The pseudocode of the procedure Gradient used in the actor–critic algorithm is given in Al-
gorithm 3. This procedure is to be implemented within Algorithm 1 with batch sizes equal to one,
meaning that tm+1 = tm + 1 for each m ∈ N. We assume for simplicity that all variables from
Algorithm 1 are accessible inside Algorithm 3. The variable R updated on Line 6 is a biased esti-
mate of J(Θm ), while the table V updated on Line 7 is a biased estimate of the state-value function

32
Score-Aware Policy-Gradient Methods and Performance Guarantees

Algorithm 3 Actor–critic algorithm (Sutton and Barto, 2018, Section 13.6) to be called on Line 9
of Algorithm 1, with batch sizes equal to one.
1: Input: Positive and differentiable policy parametrization (s, θ, a) 7→ π(a|s, θ)
2: Parameters: Step sizes αR > 0 and αv > 0
3: Initialization: • R ← 0
• V [s] ← 0 for each s ∈ S
4: procedure Gradient(t)
5: δ ← Rt+1 − R + V [St+1 ] − V [St ]
6: Update R ← R + αR δ
7: Update V [St ] ← V [St ] + αv δ
8: return δ ∇ log π(At |St , Θt )
9: end procedure

under policy π(Θm ). Compared to (Sutton and Barto, 2018, Section 13.6), the value function is
encoded by a table V and there are no eligibility traces. If the state space S is infinite, the table V
is initialized at zero over a finite subset of S containing the initial state S0 and expanded with zero
padding whenever necessary.

A.2 SAGE-based policy-gradient method

Algorithm 4 SAGE-updated policy-gradient method, to be called on Line 9 of Algorithm 1.

1: Input: • Positive and differentiable policy parametrization (s, θ, a) 7→ π(a|s, θ)
• Jacobian matrix function θ 7→ D log ρ(θ)
• Feature function s 7→ x(s)
2: Parameters: Memory factor ν ∈ [0, 1]
3: Initialization: Global variables N−1 , M−1 , X −1 , R−1 , C −1 , E −1 ← 0
4: procedure Gradient(m)
5: Nm ← νNm−1 + (tm+1 − tm )
6: Mm ← ν 2 Mm−1 + (tm+1 − tm )
7: return D log ρ(Θm )Covariance(m) + Expectation(m)
8: end procedure
9: procedure Covariance(m)
Ptm+1 −1
10: Update X m ← νX m−1 + t=t x(St )
Ptm+1m−1
11: Update Rm ← νRm−1 + t=tm Rt+1
Ptm+1 −1
12: Update C m ← νC m−1 + t=t m
(x(St ) − N1m X m )(Rt+1 − N1m Rm )
Nm 2
13: return Nm 2 −Mm C m if Nm > Mm else N1m C m
14: end procedure
15: procedure Expectation(m)
Ptm+1 −1
16: Update E m ← νE m−1 + t=t m
Rt+1 ∇ log π(At |St , Θm )
1
17: return Nm E m
18: end procedure

Algorithm 4 is an extension of Algorithm 2 that allows for batches of size 1. The main advantage
of Algorithm 4 over Algorithm 2 is that it estimates ∇J(Θm ) based not only on batch Dm , but
also on previous batches, depending on the memory factor ν initialized on Line 2. To simplify the
signature of procedures in Algorithm 4, we assume variables Nm , Mm , X m , Rm , C m , and E m are

33
Comte, Jonckheere, Sanders and Senen-Cerda

global, and that all variables from Algorithm 1 are accessible within Algorithm 4, in particular
batch Dm . Line 5 in Algorithm 4 is the counterpart of Line 3 in Algorithm 2. Line 6 in Algorithm 4
is used on Line 13 to compute the counterpart of the quotient 1/(Nm − 1) in Line 6 in Algorithm 2.
The Covariance(m) procedure in Algorithm 4 is the counterpart of Lines 4–6 in Algorithm 2. The
Expectation(m) procedure in Algorithm 4 is the counterpart of Line 7 in Algorithm 2. Algorithm 2
can be seen as a special case of Algorithm 4 with memory factor ν = 0. Note that terminology in
Algorithm 4 differs slightly compared to Algorithm 2: bar notation refers to cumulative sums instead
of averages.
The subroutines Covariance and Expectation compute biased covariance and mean estimates
for Cov[R, x(S)] and E[R ∇ log π(A|S, θ)], where (S, A, R) ∼ stat(Θm ), consistently with Theorem 1.
If the memory factor ν is zero, these procedures return the usual sample mean and covariance
estimates taken over the last batch Dm (as in Algorithm 2), and bias only comes from the fact that
the system is not stationary. If ν is positive, estimates from previous batches are also taken into
account, so that the bias is increased in exchange for a (hopefully) lower variance. In this case, the
updates on Lines 10–12 and 16 calculate iteratively the weighted sample mean and covariance over
the whole history, where observations from epoch m − m have weight ν m , for each m ∈ {0, 1, . . . , m}.
When m is large, the mean returned by Expectation is approximately equal to the sample mean
over batches Dm−M through Dm , where M is a truncated geometric random variable, independent
of all other random variables, such that P[M = m] ∝ ν m for each m ∈ {0, 1, . . . , m}; if batches have
c
constant size c, then we take into account approximately the last c(E[M ] + 1) = 1−ν steps.

Appendix B. Examples
This appendix provides detailed derivations for the examples of Section 6. We consider the single-
server queue with admission control of Section 6.1 in Appendix B.1, the load-balancing example of
Section 6.2 in Appendix B.2, and the Ising model of Section 6.3 in Appendix B.3.

B.1 Single-server queue with admission control

Consider the example of Section 6.1, where jobs arrive according to a Poisson process with rate
λ > 0, and service times are exponentially distributed with rate µ > 0. Recall the long-run average
reward is the difference between an admission reward proportional to the admission probability and
a holding cost proportional to the mean queue size. We first verify that Assumptions 1 to 3 are
satisfied, then we give a closed-form expression for the objective function, and lastly we discuss the
assumptions of Section 5. We consider a random threshold-based policy πk of the form (50) for some
k ∈ N and some parameter θ ∈ Ω, where Ω = {θ ∈ Rk+1 : πk (admit|k, θ) < µλ }.
Product-form stationary distribution. The evolution of the number of jobs in the system
(either waiting or in service) defines a birth-and-death process with birth rate λπk (admit|s, θ) and
death rate µ1[s ≥ 1] in state s, for each s ∈ {0, 1, 2, . . .}. This process is irreducible because its
birth and death rates are positive, and it is positive recurrent because λπk (admit|s, θ) < µ for each
s ∈ S by definition of Ω. This verifies Assumption 1. The stationary distribution is given by
s−1
1 Y λ
p(s|θ) = π(admit|q, θ) ,
Z(θ) q=0 µ
s "k−1 #
1 λ Y
= πk (admit|i, θ)1[s≥i+1] πk (admit|k, θ)max(s−k,0) , s ∈ N, (54)
Z(θ) µ i=0

where the second equality follows by injecting (50), and the value of Z(θ) follows by normalization.
We recognize (10–PF) from Assumption 3, with n = d = k + 1, Φ(s) = ( µλ )s for each s ∈ S,
xi (s) = 1[s ≥ i + 1] for each i ∈ {0, 1, . . . , k − 1} and xk (s) = max(s − k, 0) for each s ∈ S, and

34
Score-Aware Policy-Gradient Methods and Performance Guarantees

ρi (θ) = πk (admit|i, θ) for each i ∈ {0, 1, . . . , k}. The function ρ defined in this way is differen-
tiable. Assumption 3 is therefore satisfied, as the distribution of the system seen at arrival times is
also (54) according to the PASTA property (Wolff, 1982). For each s ∈ N and a ∈ {admit, reject},
∇ log πk (a|s, θ) is the (k + 1)-dimensional column vector with value 1[a = admit] − πk (admit|i, θ)
in component i = min(s, k) and zero elsewhere, and D log ρ(θ) is the (k + 1)-dimensional diagonal
matrix with diagonal coefficient 1 − πk (admit|i, θ) in position i, for each i ∈ {0, 1, . . . , k}. This can
be used to verify that Assumption 5 is satisfied.
Objective function. The objective function is J(θ) = γP[A = admit] − λη E[S], where

with the convention that empty sums are equal to zero and empty products are equal to one.
All calculations remain valid in the limit as πk (admit|i, θ) → 1 for some i ∈ {0, 1, . . . , k} (cor-
responding to θi → +∞). In the limit as πk (admit|i, θ) → 0 for some i ∈ {0, 1, . . . , k}, we
can study the restriction of the birth-and-death process to the state space {0, 1, . . . , c}, where
c = min{i ∈ {0, 1, . . . , k} : πi (θ) = 0}.
Assumptions of Section 5. For any closed set U ⊂ Ω, it can be shown that there exists a
Lyapunov function L uniformly over θ ∈ U such that L(s, a) = L(s) = exp(cs) for some c > 0,
depending on U and the model parameters. We look at the equivalent Lyapunov stability condition
for continious Markov chains. If θ ∈ U we have µ − λπk (θ) > δ(U ) > 0. Then, for s > k + 1, the
generator of the Markov process Qθ satisfies

Qθ L(s) = qθ (s − 1|s)L(s − 1) + qθ (s + 1|s)L(s + 1) + qθ (s|s)L(s)

= µ exp(c(s − 1)) + λπk (θ) exp(c(s + 1)) − (µ + λπk (θ)) exp(cs)
= exp(cs)(µ exp(−c) + λπk (θ)) exp(c) + (µ + λπk (θ)) (55)

= −c µ − λπk (θ) + O(c) L(s). (56)

For c small enough, from (56) we have that Qθ L(s) ≤ −cδ(U )/2L(s), so that for any θ ∈ U the
Markov chain corresponding to the policy of θ is geometrically ergodic. Hence, Assumptions 4, 5 and
6 are satisfied. In general, Assumption 7 does not hold for this example because maxima occur only
as |θ| → ∞. As suggested by Proposition 7, by adding a small regularization term, we can guarantee
Assumption 7 while simultaneously ensuring that the maximizer is bounded. In practice, using a
regularization term can additionally present some benefits such as avoiding vanishing gradients and
saddle points.
Effective State Space. The effective state space if captured in the term (22). Similarly to the
continious time Markov chain example from (56), we have that the Lyapunov function is L(s, a) =
exp(cs) for some c > 0 small enough. For a policy πk (θ) with θ ∈ V , it is easy to show that if s ≥ k,
setting 1 > ρ > λπk (θ)/µ we have that choosing c small enough such that
ρ 1 1−ρ
exp(c) + exp(−c) ≤ 1 − c , (57)
1+ρ 1+ρ 2(1 + ρ)

35
Comte, Jonckheere, Sanders and Senen-Cerda

we have
Pθ L(s) ≤ λL(s) + b, (58)
where
1−ρ
λ=1−c (59)
2(1 + ρ)
b = exp(c1 k) for some c1 > 0. (60)
If we let s0 ∈ [k], then
L? = O exp(ck) . (61)
?
We note that in this case L ∼ Volume(U ) ∼ exp(dim(θ)). Hence, it encodes the volume of the
optimization space where the algorithm operates. We remark that the Lyapunov function encodes
geometric ergodicity and allows to tackle any type of rewards, as long as they satisfy |r|L < ∞.
Thus, for a specific reward r better bounds could be attained.
We remark that geometric ergodicity is not equal to Foster stability with condition Pθ L(s) ≤
L(s) − δ for some δ > 0. Foster stability implies positive recurrence of the Markov chain, and the
existence of E[L(s)]. In this case, a Lyapunov function of the type L(s) ' s2 would suffice to show
positive recurrence.

B.2 Load-balancing system

We now consider the load-balancing example of Section 6.2. Recall that jobs arrive according
to a Poisson process with rate λ > 0, there are n servers at which service times are distributed
exponentially with rates µ1 , µ2 , . . . , µn , respectively, and the system can contain at most c jobs, for
some c ∈ N+ . The goal is to choose a static random policy that maximizes the admission probability.
We first verify that the system satisfies Assumptions 1 to 3, then we provide an algorithm to evaluate
the objective function when the parameters are known; this is used in particular for performance
comparison with the optimal policy in the numerical results. Lastly, we discuss the assumptions of
Section 5. Throughout this section, we assume that we apply the policy π(θ) defined by (51) for
some parameter θ ∈ Rn .
Product-form stationary distribution. That Assumption 1 is satisfied follows from the facts
that the rates and probabilities λ, µ1 , µ2 , . . . , µn , π1 (θ), π2 (θ), . . . , πn (θ) are positive and that the
state space S is finite. Assumption 2 is satisfied because the state space is finite. This system can
be modeled either as a loss Jackson network with n queues (one queue for each server in the load-
balancing system) or as a closed Jackson network with n + 1 queues (one queue for each server in the
system, plus another queue signaling available positions in the system, with service rate λ). Either
way, we can verify (for instance by writing the balance equations) that the stationary distribution
of the continuous-time Markov chain that describes the evolution of the system state is given by:
n si
1 Y λ
p(s|θ) = πi (θ) , s = (s1 , s2 , . . . , sn ) ∈ S, (62)
Z(θ) i=1 µi
where Z(θ) followsQby normalization. This is exactly (10–PF) from Assumption 3, with n = d,
n
Ω = Rn , Φ(s) = i=1 ( µλi )si for each s ∈ S, xi (s) = si for each i ∈ {1, 2, . . . , n} and s ∈ S,
and ρi (θ) = πi (θ) for each i ∈ {1, 2, . . . , n}. The function ρ defined in this way is differentiable.
Assumption 3 is therefore satisfied, as the distribution of the system seen at arrival times is also (62)
according to the PASTA property. Besides the sufficient statistics x, the inputs of Algorithm 2 are
∇ log π(a|s, θ) = 1a − π(θ), where 1a is the n-dimensional vector with one in component a and zero
elsewhere and π(θ) is the policy seen as a (column) vector, and D log ρ(θ) = Id − 1π(θ)| , where Id
is the n-dimensional identity matrix, 1 is the n-dimensional vector with all-one components, and
π(θ)| is the (row) vector obtained by transposing π(θ). This latter equation can be used to verify
Assumption 5.

36
Score-Aware Policy-Gradient Methods and Performance Guarantees

Objective function. When all parameters are known and the number of servers is not too
large, the normalizing constant Z(θ) and admission probability J(θ) can be computed efficiently
using a variant of Buzen’s algorithm (Buzen, 1973) for loss networks. Define the array G =
(Gc,n )c∈{0,1,...,c},n∈{1,2,...,n} by
n si
X Y λ
Gc,n = ρi (θ) , c ∈ {0, 1, . . . , c}, n ∈ {1, 2, . . . , n}.
µi
s∈Nn : i=1
|s|≤c

The dependency of G on θ is left implicit to alleviate notation. The normalizing constant and
admission probability are given by Z(θ) = Gc,n and J(θ) = Gc−1,n /Gc,n , respectively. Defining the
array G allows us to calculate these metrics more efficiently than by direct calculation, as we have
G0,n = 1 for each n ∈ {1, 2, . . . , n}, and
λ
Gc,1 = 1 + µ1 ρ1 (θ)Gc−1,1 , c ∈ {1, 2, . . . , c},
Gc,n = Gc,n−1 + µλn ρn (θ)Gc−1,n , c ∈ {1, 2, . . . , c}, n ∈ {2, 3, . . . n}.

Assumptions of Section 5. Assumptions 4, 5, and 6 are automatically satisfied because the

state space is finite (with |S| = n+c

c ). Verifying Assumption 7 is challenging since it requires
computing Hessθ? J at the maximizer θ? , which depends in an implicit manner on the parameters
of the system such as the arrival rate λ, service rates µ1 , µ2 , . . . , µn , and policy π(θ? ). However,
the nondegeneracy property of the Hessian for smooth functions is a property that is commonly
stable in the following sense: if a function satisfies this property, then it will still be satisfied after
any small-enough smooth perturbation. In particular, smooth functions with isolated nondegenerate
critical points—also known as Morse functions—are dense and form an open subset in the space of
smooth functions; see (Nicolaescu et al., 2011, Section 1.2). Thus, unless the example is adversarial
or presents symmetries, we can expect Assumption 7 to hold.

B.3 Ising model and Glauber dynamics

Lastly, we focus on the example of Section 6.3. We consider the Markov chain defined by applying
the policy (52) parameterized by some vector θ ∈ R3 : starting from an arbitrary initial configuration,
at every time step, a coordinate is chosen uniformly at random, and the agent flips or not the spin
at this coordinate according to the policy.
Product-form stationary distribution The Markov chain is irreducible because it has a positive
probability of transitioning from any configuration to any other as follows: at every step, choose a
coordinate at which the two configurations differ and flip the spin at this coordinate. The Markov
chain is positive recurrent because its state space is finite. Hence, Assumptions 1 and 2 are satisfied.
We now focus on proving Assumption 3.
Our goal is to verify that the Markov chain that describes the random evolution of the state
admits the stationary distribution (53), which we recall here:
1 β(θ)JI(θ)+β(θ)µF (σ|θ)
p(σ, v|θ) = e , s = (σ, v) ∈ S, θ ∈ R3 ,
Z(θ)

Observe that p(σ, v|θ) is independent of v, hence we can let q(σ|θ) , p(σ, ·|θ) for each σ ∈ Σ. The
key argument to prove that this is indeed the stationary distribution consists of observing that the
policy (52) satisfies, for each s = (σ, v) ∈ S,

q(σ−v |θ) q(σ|θ)

37
Comte, Jonckheere, Sanders and Senen-Cerda

where σ−v ∈ Σ is the configuration obtained by flipping the spin at v compared to σ, that is,
σ−v (w) = σ(w) for each w ∈ V \ {v} and σ−v (v) = −σ(v).
The balance equation for a particular state s = (σ, v) ∈ S writes
X 1 X 1
p(σ, v|θ) = p(σ, w|θ)π(not flip|σ, w, θ) + p(σ−w , w|θ)π(flip|σ−w , w, θ) .
d1 d2 d1 d2
w∈V w∈V

Dropping the dependency on θ to simplify notation, and injecting (63) into the right-hand side of
this balance equation, we obtain successively
X q(σ) 1 X q((σ−w )−w ) 1
p(σ, w) + p(σ−w , w)
q(σ) + q(σ−w ) d1 d2 q(σ−w ) + q((σ−w )−w ) d1 d2
w∈V w∈V
(1) X q(σ) 1 (2) X 1 (2)
= (p(σ, w) + p(σ−w , w)) = q(σ) = q(σ) = p(σ, v),
q(σ) + q(σ−w ) d1 d2 d1 d2
w∈V w∈V

where (1) follows by observing that (σ−w )−w = σ and (2) by recalling that q(σ) = p(σ, w) for each
(σ, v) ∈ S. This proves that the distribution (53) is indeed the stationary distribution of the Markov
chain that describes the evolution of the state under the policy (52).
Besides the sufficient statistics x, the inputs of Algorithm 2 are given, for each θ ∈ R3 and
s = (σ, v) ∈ S, by

β 0 (θ)J
 
0 0
D log ρ(θ) =  β 0 (θ)µθleft (θ) β(θ)µhleft 0 (θ) 0 ,
β 0 (θ)µθright (θ) 0 β(θ)µhright 0 (θ)
∇ log π(a|σ, v, θ) = (1[a = not flip] − π(not flip|σ, v, θ))∇δ(σ, v|θ),
 0 P 
β (θ)σ(v)(J w∈V: w∼v σ(w) + µh(v|θ))
∇δ(σ, v|θ) = 2 β(θ)σ(v)µhleft 0 (θ)1[v2 ≤ d2 /2] ,
β(θ)σ(v)µhright 0 (θ)1[v2 > d2 /2]

where β 0 (θ) (resp. h0left (θ), h0right (θ)) is to be understood as the partial derivative of β (resp. hleft ,
hright ) with respect to θ1 (resp. θ2 , θ3 ).

Appendix C. Proof of Theorem 2

C.1 Preliminaries
We are going to use concentration inequalities for Markov chains. Such results are common in the
literature (for example, see Karimi et al. (2019)), and will be required to get a concentration bound
of the plug-in estimators from (16).
Denote by B (θ) the open ball of radius centered at θ ∈ Ω ⊆ Rn and Y = S × A. Given a
function q : Y → R and the Lyapunov function L : Y → [1, ∞) from Assumption 4, define

|q(y)|
|q|L = sup . (64)
y∈Y L(y)

Given a signed measure ν, we also define the seminorm

Z
|ν|L = sup |ν[q]| = sup q(y)ν(dy) . (65)
|q|L ≤1 |q|L ≤1

Equations (64) to (65) imply that

|ν[q]| ≤ |ν|L |q|L . (66)

38
Score-Aware Policy-Gradient Methods and Performance Guarantees

Note that we defined | · |L for a unidimensional function. Given instead m functions qi : Y → R,

for the higher-dimensional
qP function q : Y → Rm that satisfies for all y ∈ Y, q(y) = (q1 (y), . . . , ql (y)),
l 2
we define |q|L = i=1 |qi |L .
The following lemma yields the concentration inequalities required:

Lemma 13 Let {Yn }n≥1 be a geometrically ergodic Markov chain with invariant distribution p and
transition matrix P ( · , · ). Let the Lyapunov function be L : Y → R. From geometric ergodicity,
there exists C > 0 and λ ∈ (0, 1) such that for any y ∈ Y,

P m ( · |y) − p(·) L
≤ Cλm . (67)

Let F = σ(Y1 ) be the σ-algebra of Y1 . Let q : Y → Rm be a measurable function such that |q|L < ∞.
For a finite trajectory Y1 , . . . , YM of the Markov chain, we define the empirical estimator for p[q] as
M
1 X
p̂M [q] = q(Yi ). (68)
M i=1

With these assumptions, there exists C 0 depending on C and λ such that

h i C 0 |q|L
E p[q] − p̂M [q] F ≤ L(Y1 ), (69)
M
and for l ∈ {1, 2, 4},
h i C 0 |q|l
l L l
E p[q] − p̂M [q] F ≤ L (Y1 ). (70)
M l/2
Proof We refer to (Fort and Moulines, 2003, Prop. 12) for a proof of (70). What remains is to
prove (69).
Observe that for y ∈ Y, P (y) = P ( · |y) is a distribution over Y. Conditional on F, there exists
C > 0 such that
M
h 1 X M M
i 1 X i 1 X
P i (Y1 ) − p [q]

E q(Yi ) − p[q] F ≤ P (Y1 )[q] − p[q] =
M i=1 M i=1 M i=1
M M
1 X i |q|L X
≤ P ( · |Y1 ) − p(·) L |q|L L(Y1 ) ≤ Cλi L(Y1 )
M i=1 M i=1
C|q|L
≤ L(Y1 ). (71)
M (1 − λ)

This concludes the proof.

In epoch m, the Markov chain {St }t∈[tm ,tm+1 ] with control parameter Θm has a Lyapunov function
Lv . Intuitively, as a consequence of Assumption 4, we can show that the process does not drift to
infinity on the event Bm (despite the changing control parameter Θm ).
Specifically, for m > 0, let {St }i∈[tm ,tm+1 ] be the Markov chain trajectory with transition prob-
abilities P (Θm ), where Θm is given by the updates in (3) and (16) and initial state S0 ∈ S. Recall
that Bm is defined in (38). We can then prove the following:

Lemma 14 Suppose Assumption 4 holds. There exists D < ∞ such that for m > 0, E? Lv (Stm+1 )
1[Bm ] < D. In particular, using the notation of Assumption 4 we may choose D = L as defined
in (22)

39
Comte, Jonckheere, Sanders and Senen-Cerda

Proof We will give an inductive argument. A similar argument can be found in Atchadé et al.
(2017).
First, observe that for m = 0, S0 is fixed. Thus, there exists a D such that Lv (S0 ) ≤ D.
Next, assume that E[Lv (Stm )1[Bm−1 ]] ≤ D. On the event Bm , Assumption 4 holds since
Θ1 , . . . , Θm−1 , Θm ∈ Vr,δ (θ? ) ⊂ U . Thus, on the event Bm , and when additionally conditioning
on Stm+1 −1 and Θm , the following holds true:

E Lv (Stm+1 )1[Bm ] ≤ E E[Lv (Stm+1 )1[Bm ]|Stm+1 −1 ]

= E 1[Bm ]PΘm Lv (Stm+1 −1 ) (72)

≤ E 1[Bm ][λLv (Stm+1 −1 ) + b] .

The last step followed from Assumption 4.

Observe finally that the bound in (72) can be iterated by conditioning on Stm+1 −2 ; so on and so
forth. After tm+1 − tm iterations, one obtains
b
E Lv (Stm+1 )1[Bm ] ≤ λE Lv (Stm )1[Bm ] + . (73)
1−λ
Noting that 1[Bm ] ≤ 1[Bm−1 ], the claim follows by induction if we choose D large enough such that
λD + b/(1 − λ) ≤ D, that is D ≥ L? .

C.2 Proof of Theorem 2

To prove Theorem 2, we more–or–less follow the arguments of (Fehrman et al., 2020, Thm. 25).
Modifications are however required because we consider a Markovian setting instead. Specifically,
we rely on the bounds in Lemmas 10 to 12 instead of the bounds in (Fehrman et al., 2020, Prop. 20,
Prop. 21, Prop. 24), respectively.
Let us begin by bounding
P[J ? − J(Θm ) > |B0 ]. (74)
Here, B0 = {Θ0 ∈ Vr,δ (θ? )}—recall (38). Theorem 2 assumes that we initialize in a set V which we
will specify later but satisfies V ⊂ Vr,δ (θ? ). Since we can initialize Θ0 with positive probability in
V , we have that P[B0 ] ≥ P[Θ0 ∈ V ] > 1/c > 0 for some c > 0. Thus, we will focus on finding an
upper bound of
P[J ? − J(Θm ) > |B0 ] ≤ cP[{J ? − J(Θm ) > } ∩ B0 ]. (75)
Denote the orthogonal projection of Θm onto M∩U by Θ̃m = p(Θm ). We can relate the objective
gap J ? − J(Θm ) to the distance Dm := dist(Θm , M ∩ U ) as follows. Since J is twice continuously
differentiable with maximum J ? attained at M ∩ U , the function J(θ) with θ ∈ Vr,δ (θ? ) is locally
Lipschitz with constant lr,δ (Θ? ) > 0. On the event Bm , we have Θm ∈ Vr,δ (θ? ) and therefore we
have the inequality

J ? − J(Θm ) = J(Θ̃m ) − J(Θm ) ≤ lr,δ (θ? ) Θ̃m − Θm = lr,δ (θ? )Dm . (76)

Consequently, we have the bound

hn o i
P[{J ? − J(Θm ) > } ∩ Bm ] ≤ P Dm ≥ ?
∩ Bm (77)
lr,δ (θ )

If we define 0 = /lr,δ (θ? ), the right-hand side of (77) can also be written as

P[{Dm ≥ 0 } ∩ Bm ] = E[1[Dm ≥ 0 ]1[Bm ]]

= E[1[Dm 1[Bm ] ≥ 0 ]] = P[Dm 1[Bm ] ≥ 0 ] (78)

40
Score-Aware Policy-Gradient Methods and Performance Guarantees

by the positivity of Dm .
Next, we use (i) the law of total probability noting that Bm ⊂ B0 , (ii) the bound (77) and the
inequality P[A ∩ B] ≤ P[A] for any two events A, B, and finally, (iii) the equality (78). We obtain
(i)
P[{J ? − J(Θm ) > } ∩ B0 ] ≤ P[{J ? − J(Θm )) > } ∩ Bm ] + P[{J ? − J(Θm )) > } ∩ Bm ]
(ii)
≤ P[{Dm ≥ 0 } ∩ Bm ] + P[Bm ]
(iii)
≤ P[Dm 1[Bm ] ≥ 0 ] + P[Bm ]
≤ P[Dm 1[Bm−1 ] ≥ 0 ] + P[Bm ] = Term I + Term II. (79)

Term I can be bounded by using Markov’‘s inequality and Lemma 10. This shows that

Term I ≤ c0−2 m−σ−κ . (80)

Term II can be bounded by Lemma 12. Specifically, one finds that there exists a constant c > 0
such that, if Θ0 ∈ Vr/2,δ (Θ? ),
cα2 (m1−3/2σ−κ/2 + `−1/2 m1−5σ/8−κ/2 )
Term II ≤ 1 − exp − 2 + cδ −2 `−1 m1−σ−κ + cα . (81)
δ ` (r/2 − 2δ)+
Note next that for any α ∈ (0, α0 ] and c > 0 there exists δ0 such that for any δ ∈ (0, δ0 ] there
exists `0 such that if ` ∈ [`0 , ∞) there exists a constant c0 > 0 such that we have the inequality
1 − exp (−cα2 /δ 2 `) ≤ c0 α2 /δ 2 `. We can substitute this bound in (81) to yield

α2
Term II ≤ c0 + idem. (82)
δ2 `
Bounding (79) by the sum of (80) and (82), and substituting the bound in (75) reveals that there
exists a constant c00 > 0 such that if Θ0 ∈ Vr/2,δ (θ? ) then

P[J ? − J(Θm ) > |B0 ] ≤ c00 (0 )−2 m−σ−κ + c00 α2 δ −2 `−1 + c00 δ −2 `−1 m1−σ−κ
(m1−3/2σ−κ/2 + `−1/2 m1−5σ/8−κ/2 )
+ c00 α . (83)
(r/2 − 2δ)+
Note that the exponents of m in (83) satisfy that since σ ∈ (2/3, 1), 1 − 3/2σ − κ/2 ≤ −κ/2 as well
as 1 − 5σ/8 − κ/2 < 1 − σ/2 − κ/2. Finally, let the initialization set be V = Vr/2,δ (θ? ). Note that
since {Θ0 ∈ V } ⊂ B0 there exists a constant c000 > 0 such that

P[J ? − J(Θm ) > |Θ0 ∈ V ] ≤ c000 P[J ? − J(Θm ) > |B0 ]. (84)

Substituting the upper bound (83) in (84) concludes the proof.

C.3 Proof of Lemma 8

For simplicity, we will denote tm+1 − tm = Tm , Xt = x(St ) throughout this proof. We also tem-
porarily omit the summation indices for the epoch. We note that the policies defined in (6) satisfy
that for (s, a) ∈ S × A,
(
1[a = a0 ] − π(a0 |s, θ) if i = h(s),
∇ log π(a|s, θ) i,a0 =
0 otherwise.

In particular, there exists c1 > 0 such that for any (s, a) ∈ S × A, |∇ log π(a|s, θ)| < c1 . The proof
below, however, can also be extended to other policy classes.

41
Comte, Jonckheere, Sanders and Senen-Cerda

C.3.1 Proof of (43)

Observe that if the event Bm holds, that then the definitions in (16) also imply that

ηm = ∇J(Θm ) − Hm = ∇J(Θm ) − (D log ρ(Θm )| C m + E m )

tm+1 −1
1 X
= ∇J(Θm ) − (D log ρ(Θm )|

Xt − X m r(St , At )
Tm+1 t=tm
tm+1 −1
1 X
+ r(St , At )∇ log π(At |St , Θm )
Tm t=tm
tm+1 −1
1 X
= D log ρ(Θm )| Cov[R, S] −

Xt − X m r(St , At )
Tm t=tm
tm+1 −1
1 X
+ E[R∇ log π(A|S, Θm )] − r(St , At )∇ log π(At |St , Θm )
Tm t=tm
|
= D log ρ(Θm ) η̃m + ζ̃m . (85)

We will deal with the terms η̃m in and ζ̃m in (85) one–by–one.
Dealing with the 1st term, η̃m . Define
1 X
A = E[(X − E[X])R] − (Xt − E[X])r(St , At ),
Tm t
1 X
B= r(St , At ) (E[X] − X̄m ), (86)
Tm t

and observe that

η̃m = A + B. (87)

We look first at A in (86). Recall that {Yt }t>0 = {(St , At )}t>0 is the chain of state-action pairs
(see Section 5.1). Define the function g : S × A → Rn as

g(y) = g((s, a)) = x(s) − E[x(s)] r(y). (88)

Then, we can rewrite

1 X
A = E[g(Y )] − g(Yt ). (89)
Tm t
We are now almost in position to apply Lemma 13 to A. Observe next that the law of total
expectation implies that
X
E[ηm 1[Bm ]|Fm ] = E[ηm 1[Bm ]|Fm , Atm = a]π(a|Sm , Θm ), (90)
a∈A

Without loss of generality, it therefore suffices to consider the case that we have one action Atm =
a ∈ A. For the first term we have that there exists a constant c2 > 0 such that
1 X
E[A1[Bm ]|Fm , Atm = a] = E E[g(Y )] − g(Yt )1[Bm ]|Y0 = (Stm , Atm )
Tm t
(Lemma 13) c2 |g|L
= L((Stm , a)), (91)
Tm

42
Score-Aware Policy-Gradient Methods and Performance Guarantees

where we can use that |g|L < ∞ due to Assumption 6.

For the term B in (86). We can add and subtract again the following terms and obtain
1 X
B= r(St , At ) (E[X] − X̄m ) − E[R](E[X] − X̄m )
Tm t
+ E[R](E[X] − X̄m )
= C + D, (92)
where
1 X
C = (E[X] − X̄m ) r(St , At ) − E[R] ,
Tm t
D = E[R](E[X] − X̄m ). (93)
For the term D in (93) we can readily use the concentration of Lemma 13 to obtain
|x(S)|L
E E[R](E[X] − X̄m )1[Bm ]|Fm , Atm = a ≤ E[R] L(Stm , a), (94)
Tm
where we have |x(S)|L < ∞ from Assumption 6 and E[R] < J ? .
For the term C, we use Cauchy–Schwartz together with Lemma 8. In particular, we have
1 X
E (E[X] − X̄m ) r(St , At ) − E[R] 1[Bm ] Fm , Atm = a ≤
Tm t
1/2
E |E[X] − X̄m |2 1[Bm ] Fm , Atm = a

×
h 1 X i 1/2
2
E r(St , At ) − E[R] 1[Bm ] Fm , Atm = a (95)
Tm t
For both terms we can repeat the same argument to that in (90) together with Lemma 13 to show
that
1/2
1/2 |X|L
E |E[X] − X̄m |2 1[Bm ] Fm , Atm = a

≤ c3 1/2
L(Stm , a)
Tm
1/2
h 1 X
2
i 1/2 |R|L
E r(St , At ) − E[R] 1[Bm ] Fm , Atm = a ≤ c4 1/2
L(Stm , a) (96)
Tm t Tm
Therefore multiplying both bounds in (96) and using Assumption 6 to bound the L-norms, we obtain
that there exists c5 > 0 such that
c5
|E[C|Fm , Atm = a]| ≤ L(Stm , a)2 . (97)
Tm
Adding the bounds (91), (97), and (94) together we have now
c6 2
|E[η̃m 1[Bm ]|Fm , Atm = a]| ≤ L (Stm , a). (98)
Tm
Finally, averaging this bound over all actions in (90), we obtain
c7 X c7
L(Stm , a)2 π(a|Stm , Θm ) ≤ L4 (Stm )1/2 .

|E[η̃m 1[Bm ]|Fm ]| ≤ (99)
Tm a Tm
Now we use Assumption 5. We can write
|E[∇ log(Θm )η̃m 1[Bm ]|Fm ]| = |∇ log(Θm )E[η̃m 1[Bm ]|Fm ]|
≤ C|E[η̃m 1[Bm ]|Fm ]|
c8
≤ L(Stm ) (100)
Tm

43
Comte, Jonckheere, Sanders and Senen-Cerda

Dealing with the 2nd term, ζ̃m . Define a function of Y = (S, A) as

g(Y ) = r(Y )∇ log π(A|S, θ), (101)

This concludes the proof of (43).

C.3.2 Proof of (44)

Note that by using the fact that for a vector-valued random variable Z we have that E[|Z|2 ] ≥ E[|Z|]2 ,
the case for p = 1 follows from the case p = 2.
We focus on the case p = 2. By using the identity (a + b) ≤ 2a2 + b2 , we estimate

E[|D log ρ(Θm )| η̃m + ζ̃m |2 1[Bm ]|Fm ]

≤ 2(E[|D log ρ(Θm )| η̃m |2 1[Bm ]|Fm ] + E[|ζ̃m |2 1[Bm ]|Fm ])
(5)
≤ 2c21 E[|η̃m |2 1[Bm ]|Fm ] + 2E[|ζ̃m |2 1[Bm ]|Fm ] (105)

say. We again use the law of total expectation with the action set in (90) and condition on the
action Am = a.
For the term involving ζ̃m in (105) we can again use the definition of g in (101). We bound
1 X
E[|ζ̃m |2 1[Bm ]|Fm , Atm = a] = E |E[g(Y ) − g(Y )|2 |Y0 = (Stm , a)

Tm t
(Lemma 13) c2
≤ L(Stm , a)2 . (106)
Tm
For the term involving η̃m in (105), we use the same definition for the terms A, C and D from
(91), (97) and (94) as in the proof of (43). We have the bound

E[|η̃m |2 1[Bm ]|Fm , Atm = a] ≤ 3(E[|A|2 1[Bm ]|Fm , Atm = a] + E[|C|2 1[Bm ]|Fm , Atm = a]
+ E[|D|2 1[Bm ]|Fm , Atm = a]) (107)

For the terms pertaining to A and D in (107) the same argument as those used for ζ̃m in (101) and
(106) can be used to show that
c3
E[|A|2 1[Bm ]|Fm , Atm = a] ≤ L(Stm , a)2
Tm
c4
E[|D|2 1[Bm ]|Fm , Atm = a] ≤ L(Stm , a)2 . (108)
Tm

44
Score-Aware Policy-Gradient Methods and Performance Guarantees

The only remaining term to bound in (107) is C. We use again Cauchy–Schwartz’s inequality
h 1 X 4 i
E (E[X] − X̄m ) r(St , At ) − E[R] 1[Bm ] Fm , Atm = a ≤
Tm t
1/2
E |E[X] − X̄m |2 1[Bm ] Fm , Atm = a

×
h 1 X
4
i 1/2
E r(St , At ) − E[R] 1[Bm ] Fm , Atm = a m (109)
Tm t

and by Lemma 13 the following hold

1/2
1/2 |X|L
E |E[X] − X̄m |4 1[Bm ] Fm , Atm = a L(Stm , a)2

≤ c5
Tm
1/2
h 1 X
4
i 1/2 |R|L
E r(St , At ) − E[R] 1[Bm ] Fm , Atm = a ≤ c6 L(Stm , a)2 . (110)
Tm t Tm

The bound for C thus becomes

c7
E[|C|2 |Fm , Atm = a] ≤ 2
L(Stm , a)4 . (111)
Tm
Upper bounding all terms by the largest exponents and adding over the different actions, we finally
obtain
c8 X c9
E[|ηm |2 1[Bm ]|Fm ] ≤ L(Stm , a)4 π(a|Stm ), Θm ≤ L4 (Stm ). (112)
Tm a Tm
That is it.

C.4 Proof of Lemma 10

We will again use the notation tm+1 − tm = Tm and without loss of generality we will assume that
Tm = `mσ/2+κ instead of b`mσ/2+κ c. This can be assumed since for m ≥ 1 there exist constants
cl , cu > 0 such that cl `mσ/2+κ ≤ tm+1 − tm ≥ cu `mσ/2+κ . The proof of Lemma 10 follows the same
steps as in (Fehrman et al., 2020, Proposition 20). However, we have to quickly diverge and adapt
the estimates to the case that there the variance of Hm depends on the states of a Markov chain.
From the assumptions, it can be shown that there is a unique differentiable orthogonal projection
map p : Vr,δ (θ? ) → M ∩ U from Vr,δ (θ? ) ∩ U onto Vr,δ (θ? ) ∩ M ∩ U . The distance of Θm to the set
of minima can then be upper bounded by the distance to the projection p : Vr,δ (θ? ) → M ∩ U of
Θm−1 by

dist(Θm , M ∩ U )2 ≤ |Θm − p(Θm−1 )|2

≤ |Θm−1 − p(Θm−1 ) − αm−1 ∇J(Θm−1 )
+ (αm−1 ∇J(Θm−1 ) − αm−1 Hm−1 )|2 (113)

After expanding (113) and taking expectations, however, the effect of bias already appears, and we
must diverge from the analysis from (Fehrman et al., 2020, (44)) thereafter. In particular, the effect
of the bias of Hm−1 needs to be handled in the terms
h D E i
E 2 Θm−1 − p(Θm−1 ) − αm−1 ∇J(Θm−1 ), αm−1 ∇J(Θm−1 ) − αm−1 Hm−1 1[Bm−1 ] , (114)

and
h 2 i h i
E αm−1 ∇J(Θm−1 ) − αm−1 Hm−1 1[Bm−1 ] = (αm−1 )2 E |ηm−1 |2 1[Bm−1 ] . (115)

45
Comte, Jonckheere, Sanders and Senen-Cerda

We specifically require bounds of these terms without relying on independence of the iterands.
We focus on (115) first. Recall for m > 0, that Fm is the sigma algebra defined in (39). By
using the tower property of the conditional expectation and conditioning on Fm−1 , from Lemma 8
together with the fact that Tm < cTm−1 for some c > 0, we obtain directly
h h ii ( Lemma 8 ) c1
(115) = (αm−1 )2 E E |ηm−1 |2 1[Bm−1 ] Fm−1 ≤ (αm−1 )2 E[L4 (Stm−1 )2 1[Bm−1 ]].
Tm
(116)

Let us next bound (114). Note that this term does not vanish due to dependence of the samples
conditional on Fm−1 . In our case, however, we have a Markov chain trajectory whose kernel will
depend on Θm−1 . Let

Zm−1 = Θm−1 − p(Θm−1 ) − αm−1 ∇J(Θm−1 ). (117)

We use the law of total expectation again on (114). Note that Zm−1 and Bm−1 are Fm−1 -measurable.
h i
(114) ≤ 2αm−1 E 1[Bm−1 ]Zm−1 , E[ηm−1 |Fm−1 ]
(i) h i1/2 h i1/2
≤ 2αm−1 E |Zm−1 |2 1[Bm−1 ] E |E[ηm−1 1[Bm−1 ]|Fm−1 ]|2
(ii) h i1/2 h i1/2 c
2
≤ 2αm−1 E |Zm−1 |2 1[Bm−1 ] E 1[Bm−1 ]L4 (Stm−1 )2 (118)
Tm

where (i) have used Cauchy–Schwartz and (ii) Lemma 8 and the fact that for some c > 0, Tm <
cTm−1 .
The terms in (116) and (118) containing L4 (Stm ) can be upper bounded as follows. From the
definition of (42) and since v ≥ 16, by a generalized mean inequality and the fact that L(s, a) ≥ 1
for any (s, a) ∈ S × A we have

L4 (s) ≤ Lv (s)4/v ≤ Lv (s)1/4 . (119)

Now, by Lemma 14, L? is such that for all m ∈ N

h i h i ( 119) h i
E 1[Bm−1 ]L4 (Stm−1 )2 ≤ E 1[Bm−2 ]L4 (Stm−1 )2 ≤ E 1[Bm−2 ]Lv (Stm−1 ) ≤ L? . (120)

For the other term in (118), we can use the same bound used in (Fehrman et al., 2020, (41)):
There exists constants y, c > 0 depending on J, θ? and r0 such that on the event Bm−1 we have
2
|Zm−1 |2 ≤ 1 − αm−1 y dist(Θm−1 , M ∩ U )2 + c 1 − αm−1 y αm−1 dist(Θm−1 , M ∩ U )3

+ c(αm−1 )2 dist(Θm−1 , M ∩ U )4 . (121)

The bound in (121) characterizes the fact that, close to the manifold of maximizers, the projection
is differentiable and can be approximated by an orthogonal expansion of J around the manifold
of maximizers. The error terms of this expansion can be bounded depending on the Hessian at
p(Θm−1 ) ∈ M ∩ U , Hessp(Θm−1 ) J. We refer to (Fehrman et al., 2020, Proposition 17) for a proof
of this fact.
We will now use an induction argument to show the claim of the lemma. Namely, we will assume
for the time being that for m − 1 we have
h i
E (dist(Θm−1 , M ∩ U ) ∧ δ)2 1[Bm−1 ] ≤ δ 2 c(α)(m − 1)−σ−κ , (122)

46
Score-Aware Policy-Gradient Methods and Performance Guarantees

where c(α) > 0 is a function of a to be determined. We want to show (122) for m. To do so we will
use (121) to bound Zm−1 . Suppose that there exists a sequence {bl }l>0 ⊂ R+ such that we have
h i
E |Zm−1 |2 1[Bm−1 ] ≤ bm−1 . (123)

Using (123) in (118) yields that for some c3 > 0 we have:

c3
(114) ≤ 2(bm−1 )1/2 αm−1 (L? )1/2 . (124)
Tm
From the expansion of (113) and combining the bounds of (121) and (124) together we obtain
h i c3
E dist(Θm , M ∩ U )2 1[Bm−1 ] ≤ bm−1 + 2(bm−1 )1/2 αm−1 (L? )1/2
Tm
2 c4 ?
+ (αm−1 ) L . (125)
Tm
We show now that from the induction hypothesis, if (122) holds, then we also have the bound
αy
bm−1 ≤ c(α)δ 2 m−σ−κ − δ 2 c(α)(m − 1)−σ−κ m−σ . (126)
2
Indeed, taking expectations in (121) and using the bound (122) yields
2
bm−1 ≤ 1 − αm−1 y c(α)(m − 1)−σ−κ + c(α) 1 − αm−1 y αm−1 δc(α)(m − 1)−σ−κ

+ c(α)(αm−1 )2 δ 2 c(α)(m − 1)−σ−κ . (127)

Recall that αm−1 = αm−σ/2−κ . Adding and subtracting c(α)m−σ−κ in (127), we obtain that

bm−1 ≤ c(α)m−σ−κ
α2 y α2 y 2

αy
+ c(α)m−σ (m − 1)−σ−κ mσ − (m − 1)σ+κ m−κ − 2αy + σ + 1 − σ αδ + δ 2 σ
m m m

Note now that there exists m0 (a) > 0 such that if m ≥ m0 (a), we have

α2 y αy
mσ − (m − 1)σ+κ m−κ − αy + σ
<− . (128)
m 2
Indeed, note that the latter equation can be satisfied for m ≥ m0 (a) since there exists a constant
c > 0 depending on σ and κ such that

mσ − (m − 1)σ+κ m−κ ≤ m−κ (mσ+κ − (m − 1)σ+κ )

≤ m−κ (σ + κ) max[(m − 1)σ+κ−1 , mσ+κ−1 ]
≤ c5 (σ + κ)mσ−1 . (129)

In this case we have that 2c (σ + κ) 1−σ

5 c0
m0 (α) = > 1−σ . (130)
yα α
Then for m > m0 (α), we will have

α2 y 2

αy αy
bm ≤c(α)m−σ−κ + c(α)m−σ (m − 1)−σ−κ − + 1 − σ αδ + δ 2 σ .
2 m m

47
Comte, Jonckheere, Sanders and Senen-Cerda

Choose δ ∈ (0, δ1 (α)], where δ1 (α) is a bound that we will choose appropriately, such that for any
m ≥ m0 (α) we have
αy α2 y 2
1 − σ αδ + δ 2 σ ≤ αy. (131)
m m
Thus, from (122) we obtain (126). With (126) with an appropriate choice of c(α), we can now show
(122) for m. We will namely choose c(α) as follows
c0 4C 2 L? + 4yCL? α`
c(α) = max (1−σ)(σ+κ) , , (132)
α δ 2 `2 y 2
where recall that δ ∈ (0, δ1 (α)] and δ1 (α) were chosen so that (131) holds. Let L = `−1 . Substituting
the bound of (126) into (125) and recalling that Tm = mκ+σ/2 ` yields
h i αy
E (dist(Θm , M ∩ U ))2 1[Bm−1 ] ≤ c(α)δ 2 m−σ−κ − c(α)δ 2 (m − 1)−σ−κ m−σ
2
αλ c3 α2 c3
+ 2(c(α)δ 2 m−σ−κ − c(α)δ 2 (m − 1)−σ−κ m−σ )1/2 αm−σ (L? )1/2 + L? m−2σ
2 Tm Tm
2 −σ−κ −σ
p ? 1/2 −σ−3κ/2
≤ c(α)δ m + m (2 c(α)δc3 a(L ) Lm
+ c3 L? α2 Lm−3σ/2−κ − c(α)δ 2 αy(m − 1)−σ−κ )
p
≤ c(α)δ 2 m−σ−κ + m−σ (m − 1)−σ−κ (2 c(α)δc3 a(L? )1/2 L + c3 L? α2 L − c(α)δ 2 αy). (133)
By the choice of c(α) in (132), for any κ ≥ 0 we have the following inequality
p
2 c(α)δc5 (L? )1/2 L + c5 L? aL − c(α)δ 2 y < 0. (134)
Hence, with this choice of c(α), in (133) the latter term in the right-hand side is negative for any
m ≥ 2 and the induction step follows if m > m0 (α). That is, we have for some c > 0 that and when
m > m0 (α) that
h i δ2 L? (1 + α`) −σ−κ
E dist(Θm , M ∩ U )2 1[Bm−1 ] ≤ c max (1−σ)(σ+κ) , m . (135)
a `2
We have left to show that the induction hypothesis holds in (122) for some m. Recall that
m > m0 (α) is the only restriction we needed on the starting point for the induction argument to
work—δ was already chosen depending on α in (131). From the choice
c0
m0 (α) ≥ , (136)
α1−σ
if m ≤ m0 (α), the following slightly changed version of (122) will hold; namely
h i
E (dist(Θm , M ∩ U )2 ∧ δ 2 )1[Bm−1 ] ≤ δ 2 c(α)m−σ−κ . (137)

Hence, by same arguments conducted with (137) instead of(122), we have shown by induction that
(137) holds for m > 0.
For convenience, we will further show that there exists a constant c6 > 0 such that for all m > 0
we have h i
E (dist(Θm , M ∩ U )2 ∧ δ 2 )1[Bm−1 ] ≤ c6 L? m−σ−κ . (138)
Fix c6 > 0. Choose δ0 ≤ δ1 (α) depending on α small enough and `0 > 0 large enough such that for
δ ∈ (0, δ0 ] and ` ∈ [`0 , ∞) we have that
c0 δ 2
< c6 ≥ c6 L?
α(1−σ)(σ+κ)
cD(1 + α`)
< c6 L? , (139)
`2
With the conditions in (139), the proof of the lemma follows noting that δ 2 c(α) = δ 2 c(α, `) < c6 L? .

48
Score-Aware Policy-Gradient Methods and Performance Guarantees

C.5 Proof of Lemma 11

We will again use the notation that tm+1 − tm = Tm and without loss of generality assume that
Tm = `mσ/2+κ as in Appendix C.4. The proof of Lemma 11 also mainly follows the steps of Fehrman
et al. (2020). However, we again need to take care of the terms that the bias and lack of independence
generate in the analysis.
The bounding starts noting the inequality
h i Xm
E max Θl − Θ0 1[Bl−1 ] ≤ E[|Θl − Θl−1 |2 1[Bl−1 ]]1/2 . (140)
1≤l≤m
l=1

We will show that there exists a constant c > 0 such that for l ∈ [m] we have
r
2 1/2

−3/2σ−κ/2 1 −5σ/8−κ/2
E[|Θl+1 − Θl | 1[Bl ]] ≤ cα l + l , (141)
`
where the exponents of σ and κ already differ from the result in Fehrman et al. (2020), and are
required to account for the lack of independence and bias. Following the steps from Fehrman et al.
(2020), in the neighborhood Vr,δ (θ? ), for each l ≤ m there is a random variable l : Bl → Rn and
there exists a constant c > 0 such that

|l | < cdist(Θl , M ∩ U )2 (142)

and such that on the event Bl we have

∇J(Θl ) = Hessp(Θl ) (Θl − p(Θl )) + l . (143)

Recalling the definition of ηl in (41), we have then the equality

Θl+1 = Θl − αl Hessp(Θl ) (Θl − p(Θl )) − αl l + αl ηl (144)

Define
Θ̃l = Θl − αl Hessp(Θl ) (Θl − p(Θl )). (145)
We use the triangle inequality with in (144) separating Θl+1 − Θl as the summands of Θl+1 − Θ̃l
and Θ̃l − Θl .
We estimate first |Θl+1 − Θ̃l |2 . In our case, after expanding E[|Θl+1 − Θ̃l |2 1[Bl ]], we diverge from
(Fehrman et al., 2020, (58)) and we need to bound
h i
αl2 E 1[Bl ]hl , ηl i . (146)

Similar to the proof of Lemma 10, we can condition on Fl and using that l and Bl are Fl -measurable
together with the Cauchy–Schwartz inequality, we have
h i h i
αl2 E 1[Bl ] l , ηl ≤ αl2 E 1[Bl ]l , E ηl 1[Bl ]|Fl

h i1/2 h i1/2
≤ αl2 E 1[Bl ]|l |2 E |E ηl 1[Bl ]|Fl |2 (147)

Since 1[Bm ] ≤ 1[Bm−1 ], we can bound

h
2 1/2
i (Lemma 8) h c2 i1/2 (Lemma 14) c2
E E[ηl 1[Bl ]|Fl ] ≤ E 1[Bl ] 12 Lv (Stl ) ≤ . (148)
Tl Tl

49
Comte, Jonckheere, Sanders and Senen-Cerda

For the remaining term in (147), recall that on the event Bl , since Θl ∈ Vr,δ (θ? ), we have that
dist(Θl , M ∩ U ) ≤ δ. Hence, we can bound for any l > 0 that
h i1/2 (142) c3
E 1[Bl ]|l |2 ≤ (αl )2 E[dist(Θl , M ∩ U )4 1[Bl ]]1/2
Tl+1
c3
≤ (αl )2 δ 2 E[dist(Θl , M ∩ U )2 1[Bl ]]1/2
Tl+1
c3
≤ (αl )2 δ 2 E[dist(Θl , M ∩ U )2 1[Bl−1 ]]1/2
Tl+1
(Lemma 10) c4
≤ (αl )2 δ 2 l−σ/2−κ/2 . (149)
Tl

The estimation of the remaining terms in the expansion of E[|Θl − Θ̃l−1 |2 1[Bl−1 ]] can be conducted
in the same way as that in Fehrman et al. (2020), to which we refer for the details to the interested
reader. Together with the estimate of (149) that accounts for the biases we have that
h i
E[|Θl − Θ̃l−1 |2 1[Bl ]] ≤ c5 (αl )2 δ 2 E dist(Θl , M ∩ U )2 1[Bl ]
h i1/2 c c7
6
+ 2δE dist(Θl , M ∩ U )2 1[Bl ] + (αl )2
Tl Tl
h 1 1i
≤ c8 (αl )2 δ 2 l−σ−κ + 2δl−σ/2−κ/2 + . (150)
Tl Tl

Substituting Tl = tl+1 − tl = lκ+σ/2 ` and using αl < αl−1 = αl−σ into (150) yields the bound

α2 1 1 1
E |Θl − Θ̃l−1 |2 1[Bl−1 ] ≤ c9 2σ δ 2 σ+κ + 2δ σ+3κ/2` + κ+σ/2

l l l l
α2
≤ c10 5σ/4+κ , (151)
l `
where in the last inequality we have taken the term with the highest order. Using the previous
bounds from Lemma 10 we can show that
a2
E |Θl − Θ̃l |2 1[Bl ] ≤ αl2 E dist(Θl , M ∩ U )1[Bl ] ≤ c11 3σ+κ ,

(152)
l
so that using the triangle inequality and combining the bounds of (151) and (152) we obtain
1/2 √ −1
E |Θl+1 − Θl |2 1[Bl ] ≤ c12 α l−3/2σ−κ/2 + ` l−5σ/8−κ/2 .

(153)

Hence, since σ ∈ (2/3, 1) adding the bound (153) in (140) yields

h i Xm √ −1
E max Θl − Θ0 1[Bl−1 ] ≤ c12 α(l−3/2σ−κ/2 + ` l−σ−κ/2 )
1≤l≤m
l=1
√ −1
≤ c13 α(m1−3/2σ−κ/2 + ` m1−5σ/8−κ/2 ).

C.6 Proof of Lemma 12

The proof mimicks the proof strategy of (Fehrman et al., 2020, Prop. 24), but modifications are
required due to our Markovian assumptions and appearances of biases. Specifically, we must carefully
consider the adverse effects that these biases could have on the probability that the iterates exit the
basin of attraction. Concretely, our effort will go into firstly proving the following sufficiently strong
analogue of (Fehrman et al., 2020, (75)) that is applicable to our problem:

50
Score-Aware Policy-Gradient Methods and Performance Guarantees

Lemma 15 There exist constants c1 , c2 > 0 such that

c1 α 2 c2
P[dist(Θm , M ∩ U ) > δ, Bm−1 ] ≤ P[Bm−1 ] + 4 σ+κ . (154)
δ 2 `m2σ δ `m
The proof of Lemma 15 can be found in Appendix C.6.1.
Once Lemma 15 has been established, we secondly estimate the combined probability that any of
the iterates escape in directions tangential to the manifold. The proof of this fact, which is analogous
to (Fehrman et al., 2020, (78)–(79)), can be found in Appendix C.6.2.

Lemma 16 If Θ0 ∈ Vr/2,δ (θ? ), then

m
X h i
/ Vr,δ (θ? ), Bl−1 ] ≤ P max Θl − Θ0 1[Bl−1 ] > R/2 − 2δ, .
P[dist(Θl , M ∩ U ) < δ, Θl ∈ (155)
1≤l≤m
l=1

Proof that Lemmas 15 and 16 imply Lemma 12. First, note that the recursion

P[Bm ] = P[Θm ∈ Vr,δ (θ? ), Bm−1 ] = P[Bm−1 ] − P[Θm ∈

/ Vr,δ (θ? ), Bm−1 ] (156)

can be iterated whenever we can control and bound the following probabilities

/ Vr,δ (θ? ), Bm−1 ] = P[dist(Θm , M ∩ U ) > δ, Bm−1 ]

P[Θm ∈
/ Vr,δ (θ? ), Bm−1 ].
+ P[dist(Θm , M ∩ U ) ≤ δ, Θm ∈ (157)

Using Lemma 15 and induction on (156) and (157), it follows that for some c > 0,
m m m
Y cα2 X c X
P[Bm ] ≥ 1− − − / Vr,δ (θ? ), Bl−1 ]. (158)
P[dist(Θl , M ∩ U ) < δ, Θl ∈
δ 2 `l2σ + `δ 4 lσ+κ
l=1 l=1 l=1

We use Lemma 16 together with Lemma 11 and Markov’s inequality to obtain the bound
m
X (m1−3/2σ−κ/2 + `−1/2 m1−5σ/8−κ/2 )
/ Vr,δ (θ? ), Bl−1 ] ≤ cα
P[dist(Θl , M ∩ U ) < δ, Θl ∈ (159)
(r/2 − 2δ)+
l=1

Thus, substituting (159) in (158), for some c > 0 we have

m m
Y cα2 X c (m1−3/2σ−κ/2 + `−1/2 m1−5σ/8−κ/2 )
P[Bm ] ≥ 1 − 2 2σ − 4 σ+κ
− cα . (160)
δ `l + `δ l (r/2 − 2δ)+
l=1 l=1

Note first that since σ ∈ (2/3, 1) and κ ≥ 0, if σ + κ 6= 1, then there exists a constant c1 > 0 such
that
m
X c
≤ c1 m1−σ−κ (161)
`δ 4 lσ+κ
l=1

Lastly, there also exists a constant c > 0, α0 > 0, δ0 such that if α ∈ (0, α0 ] and δ ∈ (0, δ0 ] then
there exists `0 > 0 such that if ` ∈ [`0 , ∞) then
m
Y cα2 cα2
1− ≥ exp − 2 (162)
δ 2 `l2σ + δ `
l=1

Lower bounding (160) using (161) and (162) yields Lemma 12.

51
Comte, Jonckheere, Sanders and Senen-Cerda

C.6.1 Proof of Lemma 15

We follow first (Fehrman et al., 2020, (69)), by fixing δ1 small enough such that δ ∈ (0, δ1 ], on the
event Bm−1 it is shown in Fehrman et al. (2020) that we have the inequality
λαm−1
dist(Θm , M ∩ U ) ≤ 1 − dist(Θm−1 , M ∩ U ) + αm−1 |ηm−1 |. (163)
2
We consider now the event {dist(Θm , M ∩ U ) > δ} ∩ Bm−1 . This event occurs when in (163),
either Θm−1 ∈ Vr,δ/2 (θ? ) and |ηm−1 | ≥ αm−1 δ/2, or Θm−1 ∈ Vr,δ (θ? )\Vr,δ/2 (θ? ) and the gradient
term can have smaller size. Mathematically, this translates into the inequality
h δ i
P[dist(Θm , M ∩ U ) > δ, Bm−1 ] ≤ P |ηm−1 | ≥ , Θm−1 ∈ Vr,δ/2 (θ? ), Bm−2 (164)
2αm−1
h δλ i
+ P |ηm−1 | ≥ , Θm−1 ∈ Vr,δ (θ? )\Vr,δ/2 (θ? ), Bm−2 =: P1 + P2 .
2
Contrary to what is done in the proof of (Fehrman et al., 2020, Prop. 24), we cannot use an
independence property to estimate the probabilities P1 and P2 in (164). After all, the Markov chain’s
behavior at epoch m − 1 depends on Θm−1 .
In order to overcome this issue we will use the characterization of ηm−1 in Lemma 8. Recall
Lemma 8, and note that it implies
h h δ i i h δ i
E 1[Bm−1 ]1 |ηm−1 | ≥ Fm−1 = P |ηm−1 | ≥ , Bm−1 Fm−1
2αm−1 2αm−1
2 2
E[|ηm−1 | 1[Bm−1 ] |Fm−1 ] 4c2 (αm−1 ) L4 (Stm−1 )
≤ δ2
≤ (165)
4(α )2
δ 2 Tm
m−1

since there exist a constant c > 0 such that Tm < cTm−1 .

Bounding P1 in (164). We can write
(i)
h δ i
P1 = E 1 |ηm−1 | ≥ 1[Θm−1 ∈ Vr,δ/2 (θ? )]1[Bm−2 ]1[Bm−1 ]
2αm−1
h δ i
= E 1[Θm−1 ∈ Vr,δ/2 (θ? )]1[Bm−2 ]E 1[Bm−1 ]1[|ηm−1 | ≥

]|Fm−1
2αm−1
(165) 4c (α 2 h
m−1 )
i
2
≤ 2
E 1[Θm−1 ∈ Vr,δ/2 (θ? )]1[Bm−2 ]L4 (Stm−1 ) (166)
Tm δ
where for (i) we have used the fact that {Θm−1 ∈ Vr,δ/2 (θ? )} ∩ Bm−2 ⊂ Bm−1 .
We deal now with the remaining term in (166). Differently to the independent and unbiased case
we need to control the bias and use the tail probability that the Lyapunov function is larger than a
certain bound in order to estimate the deviation probability. This step is the crucial different step
compared to Fehrman et al. (2020), where we have to explicitly use Assumption 4 and 6. Note that
a Cauchy–Schwartz inequality in (166) will not yields an inequality strong enough. See the remark
after the proof for further details.
Before bounding the remaining term in (166), we obtain the necessary inequalities. Recall from
Lemma 14 that since E[L4 (Stm−1 )4 1[Bm−2 ]] < E[Lv (Stm−1 )1[Bm−2 ]] < D < ∞, then by Markov’s
inequality we have that there exists D > 0 such that for any m > 0,
P[L(Stm−1 ) > ms , Bm−2 ] ≤ D4 m−4s . (167)
Note also that under the moment assumptions the following holds
h i Z ∞
E L(Stm−1 )1[Bm−2 ]1[L(Stm−1 ) > ms ] = P[L(Stm−1 ) > t, Bm−2 ] dt
ms

52
Score-Aware Policy-Gradient Methods and Performance Guarantees

∞
D4
Z
= dt ≤ D4 m−3s+1 . (168)
ms t4
We use the (168) to bound (166) as follows
h i
E 1[Θm−1 ∈ Vr,δ/2 (θ? )]L4 (Stm−1 )1[Bm−2 ]
h i
≤ E 1[Θm−1 ∈ Vr,δ/2 (θ? )]L4 (Stm−1 )1[Bm−2 ] 1[L4 (Stm−1 ) > ms ] + 1[L4 (Stm−1 ) ≤ ms ]
h i h i
≤ E 1[Θm−1 ∈ Vr,δ/2 (θ? )]ms 1[Bm−2 ] + E L(Stm−1 )1[Bm−2 ]1[L(Stm−1 ) > ms ]
(168)
≤ ms P[Θm−1 ∈ Vr,δ/2 (θ? ), Bm−2 ] + c3 Dm−3s+1 ≤ ms P[Bm−1 ] + c3 Dm−3s+1 . (169)
Thus, using (169), we can bound P1 in (164). Specifically,
4c4 (αm−1 )2 s
P1 ≤ (m P[Bm−1 ] + m−3s+1 ). (170)
Tm δ 2
This completes our bound for P1 .
Bounding P2 in (164). Repeating the argumentation behind (170), we can show that
4c5 s ? ?
−3s+1

P2 ≤ m P Θ m−1 ∈ Vr,δ (θ )\V r,δ/2 (θ ), B m−2 + m . (171)
Tm λ2 δ 2
Using the facts (i) {Θm−1 ∈ Vr,δ (θ? )\Vr,δ/2 (θ? )} ⊆ {dist(Θm−1 , M ∩ U ) ≥ δ/2}, with (ii) an
application of Lemma 10 and Markov’s inequality, reveals that
(i) h δ i (ii) 4
P[Θm−1 ∈ Vr,δ (θ? )\Vr,δ/2 (θ? ), Bm−2 ] ≤ P dist(Θm−1 , M ∩ U ) ≥ , Bm−2 ≤ 2 c6 m−σ−κ . (172)
2 δ
Applying the bound in (171) to (172) yields
4c7 s −σ−κ −3s+1

P2 ≤ m m + m . (173)
Tm λ2 δ 4
This completes the bound for P2 in (164).
A return to (164), and parameter selection. Let us now combine (169) and (173) and return
to bounding the left-hand side of (164). Specifically, observe that we proved that
4c8 (αm−1 )2
ms P[Bm−1 ] + m−3s+1

P[dist(Θm , M ∩ U ) > δ, Bm−1 ] ≤ 2
Tm δ
4c9
ms−σ−κ + m−3s+1 .

+ (174)
Tm δ 4
We now specify s = κ + σ/2 in (174). Without loss of generality we will again assume that
Tm = `mσ/2+κ instead of b`mσ/2+κ c—there is namely only a constant changed. By choosing the
smallest exponents in m in (174) for all m > 0 we have
a2 c10
m−3σ−4κ+1 + m−σ−κ .

P[dist(Θm , M ∩ U ) > δ, Bm−1 ] ≤ c10 P[Bm−1 ] + (175)
δ 2 `m2σ 4
δ `
Since σ ∈ (2/3, 1), then −3σ − 4κ + 1 < −σ − κ for any κ ≥ 0. Upper bounding the leading orders
in m completes the proof of Lemma 15.
Remark 17 A Cauchy–Schwartz inequality in (166) would only yield a factor P[Bm−1 ]1/2 > P[Bm−1 ],
which would not be sufficient. Similarly, we could have used Lemma 14 directly and obtain a bound
on E[1[Bm−2 ]L4 (Stm−1 )]. However, this would not give an inequality that can be iterated inductively
and is sharp enough. We can directly simplify this term to obtain P(Bm−1 ) in the inequality only
when L4 (Stm−1 ) is bounded.

53
Comte, Jonckheere, Sanders and Senen-Cerda

C.6.2 Proof of Lemma 16

In Fehrman et al. (2020), it is (Fehrman et al., 2020, Lem. 23) that establishes (Fehrman et al.,
2020, (78)–(79)) directly. Since (Fehrman et al., 2020, Lem. 23) is solely a geometric argument, and
does not concern the stochastic process, it also applies in our Markovian setting.

Appendix D. The compact case

In the case that the set of maxima M is compact, we can improve the convergence rate of Theorem 2.
We will namely assume the following
Assumption 8 (Compactness, Optional) The open subset U defined in Assumption 7 is such
that M ∩ U is compact.

Under this additional assumption we have the following

Theorem 18 (Compact case) Suppose that Assumptions 1 to 8 hold, except that (17) is now
relaxed to allow for σ ∈ (0, 1) and κ ∈ [0, ∞). For every maximizer θ? ∈ M, there exist constants
c > 0 and α0 > 0 such that, for every α ∈ (0, α0 ], there exists a neighborhood V of θ? such that
there exists `0 > 0 such that for any ` ∈ [`0 , ∞), m ∈ N+ , and ∈ (0, 1),

m1−σ−κ α2

? −2 −σ−κ
P[J(Θm ) < J − |Θ0 ∈ V ] ≤ c m + + . (176)
` `
1
The term proportional to αm−κ/2 + αm1−σ/2−κ/2 `− 2 is not in Theorem 18 compared to The-
orem 2. This term estimates the probability that the iterates escape V along directions almost
parallel to those of M. As it turns out, in the compact case such event cannot occur. The bound
in (176) thus holds when the set of maxima is, for example, a singleton M ∩ U = {x0 }.

D.1 Proof of Theorem 18

The proof is the same as with Theorem 2, but we can omit the last term in (82) by showing that we
can choose r arbitrarily large. The argument is as follows. If the manifold M ∩ U is compact, it can
be covered by a finite number k of local tubular neighborhoods Vi = Vri ,δi (θi ) where θi ∈ M ∩ U
and M ∩ U ⊂ ∪i∈[k] Vi . Choose δ = mini∈[k] δi . Then, any θ ∈ U such that dist(θ, M ∩ U ) < δ will
satisfy that p(θ) ∈ M ∩ U , where p is the unique local orthogonal projection on M ∩ U from (37).
Now, from compactness, for any θ? ∈ M ∩ U there exists r̃ > 0 such that M ∩ U ⊂ Br̃ (θ? ). For any
r ≥ r̃ we thus have that Vr,δ (θ? ) = Vr̃,δ (θ? ) is a tubular neighborhood containing M ∩ U . Then, we
can choose r arbitrarily large and conclude that the last term in the bound for the probability in
Theorem 2 vanishes if M ∩ U is a compact manifold. More details on tubular neighborhoods and
their existence for embedded manifolds can be found in Lee (2013).

Appendix E. Proof of Proposition 4

We consider the following setting. Let D < 1. We consider θ ∈ R and a function f such that in
R\[−D, D] satisfies f (θ) = 0 and in [−D/2, D/2] satisfies

f (θ) = 1 − θ2 . (177)

In [−D, −D/2] ∪ [D/2, D], we define f such that it is smoothly and monotonically interpolated
between [−D/2, D/2] and R\[−D, D].
We let Hm be such that Hm = 0 in R\[−D, D]. Hence, the set R\[−D, D] is an absorbing set
that is 1-suboptimal. In [−D/2, D/2], we will consider ηm = ∇f (Θm ) − Hm to be a random variable

54
Score-Aware Policy-Gradient Methods and Performance Guarantees

that, conditional on Fm , is unbiased and has a second moment for all m but approximates a heavy
tailed random variable. In particular, for β > 0, we define ηm such that there exists c > 0 such that
for any m, we have
c
P[|ηm | > s|Fm ] ≥ 2+β for s > D. (178)
s Tm
Note that this constraint
√ on ηm is compatible with the finite second moment condition from (26).
If moreover α ≤ 1 and < 2D, then we can bound under the previous conditions
(i)
P[f (Θm ) < f ? − |Θ0 ∈ V ] ≥ P f (Θm ) < f ? − |Θ0 = θmin

√
= P |Θm | > |Θ0 = θmin

(ii)

min
≥ P sup |Θl | > 2D|Θ0 = θ
l≤m

≥ P |Θ1 | > 2D|Θ0 = θmin

= P |θmin + α1 η1 | > 2D|Θ0 = θ0

≥ P f (Θm ) < f ? − |Θ0 = θmin

(180)

for some θmin ∈ V . In (ii), we have used the fact that from the definition of f , we have the inclusion
of events {supl≤m |Θl | > 2D} ∈ {|Θm | > 2D}, since the set R\[−D, D] is absorbent for the process
{Θt }t≥0 . In (iii), we have used that θmin belongs at least to [−D, D], since otherwise it cannot be
the minimum as defined in (180). To guarantee that ∈ (0, 1) we may choose D = 1/2, for example.

Appendix F. Proof of Proposition 5

We define the total number of samples T up to epoch m as
m
σ
X
T = tm+1 = `k κ+σ/2 = `Θ mκ+ 2 +1 . (181)
k=1

σ
and so for a given T ≥ 1, define m = d(T /`)1/(κ+ 2 +1) e. Note that according to definition in (28),
we have m(T ) ≤ m ≤ m(T ) + 1.
We show first an intermediate result in Lemma 19. Recall the definiton of the set V in (36).
Since the closure of V is compact we have that supθ∈V |J(θ)| exists. From Theorem 2 we directly
obtain:
Lemma 19 Under the same assumptions and setting as in Theorem 2, assume either (i) there exists
some b > 0 such that |r(s, a)| < b for any (s, a) ∈ S × A or (ii) the event Bm = {Θt ∈ V : t ∈ [m]}

55
Comte, Jonckheere, Sanders and Senen-Cerda

holds. Under condition (i) we have that

1
cb 13 (σ+κ) bc 1−(σ+κ) α2
E[J ? − J(Θm )|Θ0 ∈ V ] ≤ 3(L? ) 3 m− 3 +2 m + 2bc
2 ` `
bcα
+ 2bcαm−κ/2 + 2 m1−(σ+κ)/2 .
`
Under condition (ii), we have that if b = supθ∈V |J(θ)| and P[Bm ] > 1/2, then
1 1 (σ+κ)
E[J ? − J(Θm )|Bm ] ≤ 3(L? ) 3 (cb) 3 m− 3 . (182)

Proof Under condition (i), optimizing the following bound over > 0,

E[J ? − J(Θm )|Θ0 ∈ V ] ≤ P[J(Θm ) < J ? − |Θ0 ∈ V ]2b + , (183)

immediately yields the result by using the bound from (23). For condition (ii), using (80), we have
directly that
1
P[{J ? − J(Θm )) > } Bm ] ≤ P[{J ? − J(Θm )) > } Bm ]P[Bm ]
2
= P[{J ? − J(Θm )) > } ∩ Bm ]
≤ c−2 L? m−(σ+κ) . (184)

Finally, we repeat the same argument as in part (i) using the new bound b.

Proof of Proposition 5(i) Recall that both V and α are fixed. Let Θ̃t for t ∈ [T ] be defined as
in Section 5.4. Then using Lemma 19 and the definition of m in terms of T in (181), we have that
there exists a constant c > 0 independent of ` ≥ `0 such that for any T we have
(σ+κ)
T − 3(κ+ 1−(σ+κ)/2
σ +1) 1 T (κ+ σ2 +1)
E[J ? − J(Θm(T ) )|Θ0 ∈ V ] ≤ c 2
+ 1/2
` ` `
1−σ+κ
1 T (κ+ σ2 +1) −κ
σ α2
+ + T 2(κ+ 2 +1) + . (185)
` ` `
Note that by looking at the orders in (185) we can make κ large to obtain an approximation for the
exponents. I particular, for any ζ > 0 there exists κ0 (ζ) > 0 such that if κ ≥ κ0 (), then
1 1 1 1 α2
E[J ? − J(Θm(T ) )|Θ0 ∈ V ] ≤ c (L? ) 3 ` 3 +ζ T − 3 +ζ + `1/2+ζ T − 2 +ζ + `2/3+ζ T −1+ζ + . (186)
`
Proof of Proposition 5(ii) Repeating the same argument as in (i) we obtain that
1 1 1
E[J ? − J(Θm(T ) )|Bm(T ) ] ≤ c(L? ) 3 ` 3 +ζ T − 3 +ζ . (187)

The bound on the probability P[Bm(T ) ] is given in (82) together with the remark on the exponents
thereafter. In terms of T /`, this observation yields
α2 1

P[Bm(T ) ] ≥ 1 − c + `1/2+ζ T − 2 +ζ + `2/3+ζ T −1+ζ . (188)
`
Finally, we make `0 large enough to guarantee that if T ≥ T0 for some T0 > 0, we have P[Bm(T ) ] ≥
1/2. Then note that P[Bm(k) ] ≥ P[Bm(T0 ) ] for any k ≤ T0 .

56
Score-Aware Policy-Gradient Methods and Performance Guarantees

Appendix G. Proof of Corollary 6

We will use the same notation as in the proof of Proposition 5 in Appendix F. Moreover, for an
epoch l corresponding samples t ∈ [tl , tl+1 ], we will consider that Θl is fixed for any sample in the
epoch and define the parameter corresponding to sample t as Θ̃t = Θm(l) , where m(l) was defined
in (28). We need first the following inequalities.

Lemma 20 Under the same assumptions and notation as in Theorem 2, we fix α and δ satisfying
such assumptions. Then for any 1 > ζ > 0 there exists κ(ζ) ≥ 0, c > 0 and `0 > 0 such that for any
` ≥ `0 , κ ≥ κ(ζ) and T ≥ 1, (i) if r(s, a) is bounded,

h T i h T i 1
X X
E T J ?− r(St , At ) Θ0 ∈ V ≤ E T J ? − J(Θ̃t ) Θ0 ∈ V +c α2 T `−1 +`1/2+ζ T 2 +ζ +`2/3+ζ T ζ +T ζ ,

t=1 t=1

and (ii) if r(s, a) is unbounded,

h XT i h XT i 1
T σ/2+κ
E T J? − r(St , At ) Bm(T ) ≤ E T J ? − J(Θ̃t ) Bm(T ) + c .
t=1 t=1
`

Proof We show (ii) first. We use that

h T
X i h T
X i T
hX i
E T J?− r(St , At ) Bm(T ) = E T J ? − J(Θ̃t ) Bm(T ) +E J(Θ̃t )−r(St , At ) Bm(T ) . (189)
t=1 t=1 t=1

From the same argument as that of (188), we may pick `0 such that P[Bm(T ) ] > 1/2 and for some
c>0
T
hX i h T
X i
E J(Θ̃t ) − r(St , At ) Bm(T ) ≤ c E 1[Bm(T ) ] J(Θ̃t ) − r(St , At ) Θ0 ∈ V (190)
t=1 t=1

We need to bound only the last term in (190). From Assumption 6, we have that |r(s, a)| ≤ cL(s, a).
From (69) we obtain for epoch m that if Θm ∈ V there exists a constant C > 0 such that

h tX
m+1 i
E J(Θ̃t ) − r(St , At ) Fm ≤ CL4 (Stm , Atm ). (191)
t=tm

Recall from (28) that m(T ) = min{m ∈ N : `mσ/2+κ ≥ T }. From Lemma 14, we know that there
exists a constant c > 0 such that for any n ≥ 1 E[L4 (Stn , Atn )1[Bn ]] ≤ c. Let Fn be defined as
in (39). Recall that 1[Bm ] ≤ 1[Bn ] if n < m. By using in the following the tower property of the
conditional expectation in (i) we have

h T i m(T ) tn+1
X X h X i
E 1[Bm(T ) ] J(Θ̃t ) − r(St , At ) Θ0 ∈ V ≤ E 1[Bn ] J(Θ̃t ) − r(St , At ) Θ0 ∈ V
t=1 n=1 t=tn
m(T ) tn+1
(i) X h h X i i
≤ E E 1[Bn ] J(Θ̃t ) − r(St , At ) Fn Θ0 ∈ V
n=1 t=tn
m(T )
(191) X h i
≤ c E 1[Bn ]L4 (Stn , Atn ) Θ0 ∈ V
n=1

57
Comte, Jonckheere, Sanders and Senen-Cerda

m(T ) 1
(Lemma 14) X T σ/2+κ
≤ c C≤c . (192)
m=1
`

Substituting (192) in (190) yields the result. We now show (i) using a similar argument. First note
that for n ∈ [m(T )] we have
h tX
n+1 i h tn+1
X i
E J(Θ̃t ) − r(St , At ) Θ0 ∈ V ≤ E 1[Bn ] J(Θ̃t ) − r(St , At ) Θ0 ∈ V
t=tn t=tn
h tn+1 i
X
+ E 1[Bn ] J(Θ̃t ) − r(St , At ) Θ0 ∈ V
t=tn
(a)
≤ C + cP[Bn ] (193)
α2 n 1−(σ+κ)
n 1−(σ+κ)/2
≤ C + c`nσ/2+κ + + n−κ/2 + ,
` ` `
where in (a) we have used the same argument as in (192) for the first term, and for the second term,
we have used that since the reward is bounded, |J(Θ̃t ) − r(St , At )| is also bounded, regardless of the
stability of Θ̃t . We are left with a constant times P[Bn ] for the second term. We add the remaining
Ph
terms in (193) for n ∈ [m(T )] and use the inequality i=1 iη ≤ Chη+1 for η ≥ 0. Setting m(T ) in
terms of T according to (181), we are left with
h tX
n+1 i 1 1 α2 1

E J(Θ̃t ) − r(St , At ) Θ0 ∈ V ≤ c T σ/2+κ `− σ/2+κ + T + `1/2+ζ T 2 +ζ + `2/3+ζ T ζ + T ζ
t=tn
`
1
≤ c α2 T `−1 + `1/2+ζ T 2 +ζ + `2/3+ζ T ζ + T ζ ,

(194)

where we have used that T −1/(σ/2+κ) `−1/(σ/2+κ) ≤ T /` for any T ≥ `.

Proof of Corollary 6(i): Let ` ≥ `0 be fixed, where `0 is given by the conditions of Theorem 2.
We add for each t ∈ [T ] the performance gap of Proposition 5 which yields
h T i XT
1 1 1 1
X
E T J? − c (L? ) 3 ` 3 +ζ t− 3 +ζ + α2 `−1 + `1/2+ζ t− 2 +ζ + `2/3+ζ t−1+ζ + t−1+ζ

J(Θ̃t ) Θ0 ∈ V ≤
t=1 t=1
1 1 2 1
≤ c (L? ) 3 ` 3 +ζ T 3 +ζ + α2 `−1 + `1/2+ζ T 2 +ζ + `2/3+ζ T ζ + T ζ

(195)
Use now Lemma 20 together with (195). In this manner we obtain the bound:
h T i 1 1 2 1
X
E T J? − r(St , At ) Θ0 ∈ V ≤ c (L? ) 3 ` 3 +ζ T 3 +ζ + α2 T `−1 + `1/2+ζ T 2 +ζ + `2/3+ζ T ζ + T ζ .

t=1

Then, for ` ≥ `0 fixed and for any T > 0 the following holds
T
h X i 1 2 α2
E T J? − r(St , At ) Θ0 ∈ V ≤ c (L? ) 3 T 3 +ζ + T . (196)
t=1
`

Similarly, if we choose ` depending on a given T fixed, then setting ` = T 1/4 we obtain

h T i 3
X
E T J? − r(St , At ) Θ0 ∈ V ≤ cT 4 +ζ . (197)
t=1

58
Score-Aware Policy-Gradient Methods and Performance Guarantees

Proof of Corollary 6(ii): Let ` ≥ `0 be fixed, where `0 is given by the conditions of Theorem 2
and satisfies the same conditions as (188). We repeat the argument used in (i) by using Proposition 5
and Lemma 20. We obtain that if σ/2 + κ ≥ 3/2 then

h T i 1
T σ/2+κ
1 1 2 1 2
X
?
E TJ − r(St , At ) Bm(T ) ≤ c(L? ) 3 ` 3 +ζ T 3 +ζ + c ≤ c(L? ) 3 T 3 +ζ . (198)
t=1
`

Appendix H. Proof of Proposition 7

To prove the proposition we will show that for almost all π̃ in the Lebesgue measure of the class of
policies defined in (6), the function Jπ̃ (θ) is Morse. Morse functions are smooth functions f such
that every critical point of f is nondegenerate, that is, for any x such that ∇x f = 0 we have that
Hessx f is nonsingular. Hence, all critical points are isolated. If the function Jπ̃ (θ) is Morse and
furthermore satisfies that Jπ̃ (θ) → −∞ as |θ| → ∞, it will then have bounded isolated maxima.
We show first that for almost all π̃, the function Jπ̃ (θ) is a Morse function. To do so, we will
implicitly use the fact that Morse functions are dense and form an open subset in the space of
smooth functions (see Nicolaescu et al. (2011)).
We introduce first notation. For a finite dimensional smooth manifold M , we denote by Tx M
and Tx∗ M the tangent and cotangent spaces at x ∈ M , respectively. When M = Ru , for f : Ru → R
we will denote the (covariant) derivative and gradient of f at x by dx f ∈ Tx∗ M and ∇x f ∈ Tx M ,
respectively. In local coordinates (w1 , . . . , wu ), we have namely
u
X ∂f (x)
dx f = dwi
i=1
∂wi
u
X ∂f (x) d
∇x f = , (199)
i=1
∂wi dwi

d
where dwi ( dw i
) = 1[i = j]. In this notation and since M = Ru , we have then
u u X u
X ∂f (x) X ∂ 2 f (x)
dx (df ) = dx dwi = dwj ⊗ dwi = Hessx f ∈ Tx∗ M ⊗ Tx∗ M. (200)
i=1
∂wi i=1 j=1
∂w j ∂w i

We require the following lemmas and definitions.

Definition 21 Let M and N be two manifolds and let B be a submanifold of N . We say a smooth
map f : M → N is transversal to B if for every point x ∈ M such that f (x) ∈ B we have

dx f (Tx M ) + Tf (x) B = Tf (x) N. (201)

We will use the following result that has is its core an application of Sard’s theorem that states
that in a map between smooth manifolds, the set of critical points has measure zero in the image.
Lemma 22 (Parametric transversality theorem (Guillemin and Pollack, 2010)) Let Z, M
and N be smooth manifolds and let B be a smooth submanifold of N . Let F : Z × M → N be a
smooth submersion, that is, the differential map is surjective everywhere. If F is transversal to B,
then for almost every z ∈ Z, the map

Fz (m) = F (z, m) (202)

is transversal to B.

59
Comte, Jonckheere, Sanders and Senen-Cerda

When appropriate, we will make explicit the dependence of v ∈ Tx∗ M on x by writing (x, v) ∈
Tx∗ M .
We can now show the following,
Lemma 23 Let M = Ru and let f : M → R be a smooth map. Consider the map f˜ : M → T ∗ M
given for x ∈ M by
f˜(x) = (x, dx f ) ∈ Tx∗ M. (203)
Let B ⊂ T ∗ M be the zero section submanifold, that is, B(x) = (x, 0) ∈ Tx∗ M for every x. Then x is
a nondegenerate critical point of f if and only if f˜ is transversal to B at x and ∇x f = 0.
Proof x is a critical nondegenerate point if and only if ∇x f = 0 and Hessx f ∈ Tx∗ M ⊗ Tx∗ M is
nonsingular. For any ν ∈ Tx M , we have then that

dx f˜(ν) = (ν, Hessx f (ν)) (204)

By definition, f˜ is transversal to B if and only if for every x ∈ M ,

dx f˜(Tx M ) + Tx M ⊕ 0 = (Id ⊕ Hessx (f ))(Tx M ) + Tx M ⊕ 0

= Tx M ⊕ Hessx f (Tx M )
= Tx M ⊕ Tx∗ M, (205)

which is true if and only if Hessx f is nonsingular.

From the last two lemmas it follows that by adding an appropriate perturbation to a function,
the perturbed function is nondegenerate. This result is well-known in the literature in the context
of genericity of Morse functions and can be generalized to general smooth manifolds; see Guillemin
and Pollack (2010).

Lemma 24 Let M = Ru . Let f : M → R and gi : M → R for i ∈ [l] be smooth functions such that
for every x ∈ M , span({dx gi }li=1 ) = Tx∗ M . Then for almost every z = (z1 , . . . , zl ) ∈ Ru we have
that
l
X
fz (·) = f (·) + zi gi (·) (206)
i=1

is a Morse function.

Proof Define the smooth function F : Rl × M → T ∗ M given by

l
X
F (z, x) = (x, dx f + zi dx gi ) = (x, dx fz ). (207)
i=1

The derivative of this map at (z, x) evaluated at (η, χ) ∈ Tz Rl × Tx M is then

l
X
d(z,x) F (η, χ) = (χ, Hessx fz (χ) + ηi dx gi ) ∈ TF (z,x) (T ∗ M ) ' Tx M ⊕ Tx∗ M. (208)
i=1

For every x, we have span({dx gi }li=1 ) = Tx∗ M , then d(z,x) F (Tz Rl , Tx M ) = TF (z,x) (T ∗ M ) and d(z,x) F
is surjective. Thus, F is a submersion and is therefore transversal to the zero section of T ∗ M and
by Lemma 22 for almost every z ∈ Z the map Fz (x) = F (z, x) is transversal to the zero section of
T ∗ M . Finally, by Lemma 23 we can conclude that for almost every z ∈ Z, the critical points of fz
are nondegenerate, that is, fz is a Morse function.

60
Score-Aware Policy-Gradient Methods and Performance Guarantees

We are now in position to show the proposition. Recall from the definition of the policy in (6)
that there is an index set I and a function h : S → I that determines the P parameter dependence
of {θi,a : (i, a) ∈ I × A}. For s ∈ I, let z(a,i) = π̃(a|i) and denote ζ̃(i) = s∈S:h(s)=i ζ(s). We can
write
X X
dθ Rπ̃ (θ) = b ζ(s) π̃(a|s)dθ log(π(a|s, θ))
s∈S a∈A
X X X
=b ζ(s) π̃(a|s) (1[a = a0 ] − π(a0 |s, θ))dθh(s),a0
s∈S a∈A a0 ∈A
X X
=b ζ(s) (π̃(a|s) − π(a0 |s, θ))dθh(s),a0
s∈S a0 ∈A
XX
=b ζ̃(i)(π̃(a|i) − π(a|i, θ))dθi,a
i∈I a∈A
X
=b ζ̃(i)(z(i,a) − π(a|i, θ))dθi,a (209)
(i,a)∈I×A

If ζ̃(i) > 0 for all i ∈ I, it is clear from (209) that the terms {dθi,a }(i,a)∈I×A span Tθ∗ R|A|×|I| for
each θ, since π(a|s, θ) 6= 0 for any finite θ. By Lemma 24 and the assumption on ζ, we immediately
obtain that for almost all policies π̃, the function

Jπ̃ (θ) = J(θ) − bRπ̄ (θ). (210)

is Morse and has nondegenerate critical points—including the maximum. Finally, the set of maxima
of (210) will be nonempty. Indeed, the function −bRπ̄ (θ) → −∞ whenever for any s ∈ S, π( · |s) →
∂∆(S). Thus, by continuity, the set of maxima belongs to a compact set.

Shiyu Zhao - Mathematical Foundation of Reinforcement Learning (2024, Tsinghua University Press, Springer) - Libgen - Li
No ratings yet
Shiyu Zhao - Mathematical Foundation of Reinforcement Learning (2024, Tsinghua University Press, Springer) - Libgen - Li
283 pages
5SC28 Machine Learning For Systems and Control
No ratings yet
5SC28 Machine Learning For Systems and Control
68 pages
Policy Gradient 2020
No ratings yet
Policy Gradient 2020
76 pages
13 RL 3
No ratings yet
13 RL 3
48 pages
Approximate Dynamic Programming and Reinforcement Learning - Algorithms, Analysis and An Application
No ratings yet
Approximate Dynamic Programming and Reinforcement Learning - Algorithms, Analysis and An Application
139 pages
L4DC PolicyOptTutorial2023
No ratings yet
L4DC PolicyOptTutorial2023
160 pages
Book All in One
No ratings yet
Book All in One
288 pages
Book All-In-One 2
No ratings yet
Book All-In-One 2
281 pages
RL 5
No ratings yet
RL 5
26 pages
Trust Region Policy Optimization: John Schulman Sergey Levine Philipp Moritz Michael Jordan Pieter Abbeel
No ratings yet
Trust Region Policy Optimization: John Schulman Sergey Levine Philipp Moritz Michael Jordan Pieter Abbeel
16 pages
2017 - On The Sample Complexity of The Linear Quadratic Regulator
No ratings yet
2017 - On The Sample Complexity of The Linear Quadratic Regulator
43 pages
Abdolmaleki Et Al. - 2018 - Maximum A Posteriori Policy Optimisation
No ratings yet
Abdolmaleki Et Al. - 2018 - Maximum A Posteriori Policy Optimisation
23 pages
Unit 5 - Policy Based
No ratings yet
Unit 5 - Policy Based
30 pages
RL Lecture4
No ratings yet
RL Lecture4
7 pages
Average-Reward Model-Free Reinforcement Learning - A Systematic Review and Literature Mapping
No ratings yet
Average-Reward Model-Free Reinforcement Learning - A Systematic Review and Literature Mapping
36 pages
Optimistic Linear Support and Successor Features As A Basis For Optimal Policy Transfer (Alegre, 2022)
No ratings yet
Optimistic Linear Support and Successor Features As A Basis For Optimal Policy Transfer (Alegre, 2022)
20 pages
2023 Week5 Policy
No ratings yet
2023 Week5 Policy
62 pages
SRE Report Merged
No ratings yet
SRE Report Merged
16 pages
Policy Gradient Methods-BR
No ratings yet
Policy Gradient Methods-BR
14 pages
Linear Quadratic Control Using Model-Free Reinforcement Learning
No ratings yet
Linear Quadratic Control Using Model-Free Reinforcement Learning
16 pages
Unit7 RL
No ratings yet
Unit7 RL
7 pages
Lecture 12 Slides - After
No ratings yet
Lecture 12 Slides - After
50 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
Reinforcement Learning With Deep Energy-Based Policies
No ratings yet
Reinforcement Learning With Deep Energy-Based Policies
16 pages
5 - Policy Gradient Methods
No ratings yet
5 - Policy Gradient Methods
57 pages
Actor-Critic Policy Optimization in Partially Observable Multiagent Environments 1810.09026
No ratings yet
Actor-Critic Policy Optimization in Partially Observable Multiagent Environments 1810.09026
28 pages
Policy Gradient Method For Robust Reinforcement Learning
No ratings yet
Policy Gradient Method For Robust Reinforcement Learning
44 pages
A3C-GS Adaptive Moment Gradient Sharing With Locks For Asynchronous ActorCritic Agents
No ratings yet
A3C-GS Adaptive Moment Gradient Sharing With Locks For Asynchronous ActorCritic Agents
15 pages
Policy Gradient Methods For Reinforcement Learning
No ratings yet
Policy Gradient Methods For Reinforcement Learning
5 pages
Deep Reinforcement Learning in Large Discrete Action Spaces
No ratings yet
Deep Reinforcement Learning in Large Discrete Action Spaces
11 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
70 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
Policy Approximation Document
No ratings yet
Policy Approximation Document
2 pages
Lecture 10
No ratings yet
Lecture 10
25 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
93 pages
RL Exam Tutti
No ratings yet
RL Exam Tutti
47 pages
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
No ratings yet
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
35 pages
1、Bayesian Policy Gradient Algorithms（2006）
No ratings yet
1、Bayesian Policy Gradient Algorithms（2006）
9 pages
Assignment 2 - Policy Gradients
No ratings yet
Assignment 2 - Policy Gradients
7 pages
ACC23 Tutorial Paulson
No ratings yet
ACC23 Tutorial Paulson
12 pages
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
No ratings yet
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
9 pages
Policy-Based Reinforcement Learning: Shusen Wang
No ratings yet
Policy-Based Reinforcement Learning: Shusen Wang
46 pages
Dynamic Programming and Optimal Control
No ratings yet
Dynamic Programming and Optimal Control
199 pages
Experiment 3
No ratings yet
Experiment 3
6 pages
High-Dimensional Continuous Control Using Generalized Advantage Estimation-1506.02438v5
No ratings yet
High-Dimensional Continuous Control Using Generalized Advantage Estimation-1506.02438v5
14 pages
Natural Actor-Critic: Abstract. This Paper Investigates A Novel Model-Free Reinforcement
No ratings yet
Natural Actor-Critic: Abstract. This Paper Investigates A Novel Model-Free Reinforcement
12 pages
Policy Gradient Methods For Reinforcement Learning PDF
No ratings yet
Policy Gradient Methods For Reinforcement Learning PDF
5 pages
Powell UnifiedFrameworkStochasticOptimization Jan292018
No ratings yet
Powell UnifiedFrameworkStochasticOptimization Jan292018
69 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
Bridging The Gap Between Value and Policy Based Reinforcement Learning
No ratings yet
Bridging The Gap Between Value and Policy Based Reinforcement Learning
21 pages
Hypothesis Testing - II: S. Devi Yamini
No ratings yet
Hypothesis Testing - II: S. Devi Yamini
145 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
No ratings yet
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
22 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
28 pages
1 - Table of Contents
No ratings yet
1 - Table of Contents
6 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
Heading Hints A Guide To Cold Forming Specialty Alloys
No ratings yet
Heading Hints A Guide To Cold Forming Specialty Alloys
63 pages
Grid Audit Report Format
100% (1)
Grid Audit Report Format
7 pages
Cobra C1 FastScanManual
No ratings yet
Cobra C1 FastScanManual
64 pages
SC MCQ
0% (1)
SC MCQ
10 pages
? Gallery Walk Scoring Rubric
No ratings yet
? Gallery Walk Scoring Rubric
2 pages
Chap6 Stair Design MDM
No ratings yet
Chap6 Stair Design MDM
33 pages
Ucc2817, Ucc2818, Ucc3817 and Ucc3818 Bicmos Power Factor Pregulator
No ratings yet
Ucc2817, Ucc2818, Ucc3817 and Ucc3818 Bicmos Power Factor Pregulator
45 pages
Lecture 3: Role of Academic Librarian: Prof. Dana P. Tugade
100% (1)
Lecture 3: Role of Academic Librarian: Prof. Dana P. Tugade
34 pages
Asset Holiday Home Work 2
No ratings yet
Asset Holiday Home Work 2
13 pages
Psychology and Other Disciplines
No ratings yet
Psychology and Other Disciplines
5 pages
CSC403 - Software Engineering BOSU
No ratings yet
CSC403 - Software Engineering BOSU
13 pages
Magnetic Flow E&H
No ratings yet
Magnetic Flow E&H
20 pages
Log
No ratings yet
Log
8 pages
Alemite Oil Mist Application Manual
100% (1)
Alemite Oil Mist Application Manual
34 pages
Internship - Ii Report Front Pages
No ratings yet
Internship - Ii Report Front Pages
9 pages
B. Stage 1 and 2
No ratings yet
B. Stage 1 and 2
20 pages
Izar Net 2 14
No ratings yet
Izar Net 2 14
3 pages
Offline Schedule-Siioc2023 Version2
No ratings yet
Offline Schedule-Siioc2023 Version2
5 pages
Invoice 10
No ratings yet
Invoice 10
1 page
Skymionic Beams PDF
No ratings yet
Skymionic Beams PDF
6 pages
EPP Lessonplan
No ratings yet
EPP Lessonplan
6 pages
Daftar Referensi Jurnal Enzim1
No ratings yet
Daftar Referensi Jurnal Enzim1
7 pages
Paper 4 PDF
No ratings yet
Paper 4 PDF
5 pages
L Matching Reflection Coefficient Using Matlab
100% (1)
L Matching Reflection Coefficient Using Matlab
7 pages
Python Programs by Narayana
100% (1)
Python Programs by Narayana
18 pages
Planning A Lesson Using PRIMM: The Five Stages of PRIMM
No ratings yet
Planning A Lesson Using PRIMM: The Five Stages of PRIMM
2 pages
60. Đề Thi Thử TN THPT 2021 - Môn Tiếng Anh - Sở GD & ĐT Hưng Yên - File Word Có Lời Giải
No ratings yet
60. Đề Thi Thử TN THPT 2021 - Môn Tiếng Anh - Sở GD & ĐT Hưng Yên - File Word Có Lời Giải
6 pages
Alter Table: Table - Name ADD Column - Name Datatype
No ratings yet
Alter Table: Table - Name ADD Column - Name Datatype
5 pages
Sales Prediction Using Regression Analysis: Problem Statement
No ratings yet
Sales Prediction Using Regression Analysis: Problem Statement
3 pages
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet

Paper RL

Uploaded by

Paper RL

Uploaded by

Journal of Machine Learning Research XX (2024) 1-??

Submitted XX/XX; Published XX/XX

Score-Aware Policy-Gradient and Performance Guarantees

©2024 Comte, Jonckheere, Sanders and Senen–Cerda.

1.1 Score-aware gradient estimators (SAGEs)

1.2 Convergence of policy-gradient methods

of epoch m with step-size αm > 0,

1.3 Numerical experiments

2.1 Gradient estimation, exponential families, and product forms

2.2 Stochastic gradient ascent (SGA) and policy-gradient methods

Method Context/State Space Main Assump- Convergence Key Assumptions and

3.2 Markov decision process (MDP)

3.3 Stationary analysis and optimality criterion

3.4 Learning algorithm

∇J(θ) = E[q(S, A) ∇ log π(A|S, θ)],

where (S, A, R) ∼ stat(θ), for each θ ∈ Ω. Consistently, in a model-free setting, policy-gradient

4 Score-aware gradient estimator (SAGE)

Algorithm 1 Generic policy-gradient algorithm. Examples of Gradient procedure, based on

4.1 Product-form and exponential family

Assumption 3 (Stationary distribution) There exist a scalar function Φ : S → R>0 , an integer

where the partition function Z : Ω → R>0 follows by normalization:

4.2 Score-aware gradient estimator (SAGE)

Theorem 1 Suppose that Assumptions 1 to 3 hold. For each θ ∈ Ω, we have

∇ log p(s|θ) = D log ρ(θ)| (x(s) − E[x(S)]), (12)

∇ log Z(θ) = D log ρ(θ)| E[x(S)]. (14)

H = D log ρ(θ)| C + E, (15)

4.3 SAGE-based policy-gradient algorithm

Algorithm 2 SAGE-based policy-gradient method, to be called on Line 9 of Algorithm 1.

5 A local convergence result

M = {θ ∈ Ω : J(θ) = J ? }, where J ? = sup J(θ). (18)

5.1 Assumptions pertaining to algorithmic convergence

The following are assumed:

and, for each ` ∈ N+ and (s, a), (s0 , a0 ) ∈ S × A,

P ` ((s0 , a0 )|(s, a), θ) − p̃((s0 , a0 )|θ) ≤ Cλ` L(s, a),

Assumption 6 Let L be the Lyapunov function from Assumption 4. For any θ? ∈ M, if U is a

5.2 Local convergence results

5.3 Lower bound

5.4 Performance gap and regret

(ii) If sup(s,a) |r(s, a)| is unbounded, then for any T ≥ 1

5.5 Local convergence with entropy regularization

For some b > 0 we define

2. Theorem 2 for Jπ̃ (θ) holds without Assumption 7.

5.6 Proof outline for Theorem 2

Structure of the proof

Preliminary step: Definition of the local neighborhood and bound strategy

Vr,δ (θ? ) := θ ∈ Ω ∩ U : dist(θ, M ∩ U ) = dist(θ, B̄r (θ? ) ∩ M ∩ U ) < δ .

P[dist(Θm , M ∩ U ) ≤ |B0 ] ≤ P[dist(Θm , M ∩ U )1[Bm−1 ] ≥ ] + P[Bm ]. (40)

Step 2: Convergence on the event Bm−1 .

Step 3: Excursion and the probability of staying in Vr,δ (θ? )

/ Vr,δ (θ? ), Bm−1 ],

/ Vr,δ (θ? ), Bm−1 ] = P[dist(Θm , M ∩ U ) > δ, Bm−1 ]

Step 4: Combining the bounds in (40).

6 Examples and numerical results

6.1 Admission control in a single-server queue

(a) Performance under π0 . (b) Performance under π1 .

(c) Performance under π3 . (d) Performance under π100 .

(a) SAGE (b) Actor–critic

Figure 2: Admission probabilities under policy parametrization π3 .

(a) Performance under π0 . (b) Performance under π2 .

(c) Performance under π4 . (d) Performance under π100 .

6.2 Load-balancing system

(a) SAGE, n = 4 servers. (b) Actor–critic, n = 4 servers.

(c) SAGE, n = 20 servers. (d) SAGE, n = 100 servers.

is again due to the fact

6.3 Ising model and Glauber dynamics

ξright ∈ (−1, +1)), i.e.,

To each configuration σ ∈ Σ is associated an energy E(σ) , −JI(σ) − µF (σ), where I and F

p(s|θ) ∝ eβ(θ)(JI(σ)+µF (σ|θ)) , s = (σ, v) ∈ S, θ ∈ R3 . (53)

(a) Average over 10 simulation runs. (b) A particular simulation run.

Figure 5: Performance of SAGE in the Ising model.

Appendix A. Policy-gradient algorithms

∇J(θ) ∝ E[(R − J(θ) + v(S 0 ) − v(S))∇ log π(A|S, θ)],

A.2 SAGE-based policy-gradient method

Algorithm 4 SAGE-updated policy-gradient method, to be called on Line 9 of Algorithm 1.

B.1 Single-server queue with admission control

Qθ L(s) = qθ (s − 1|s)L(s − 1) + qθ (s + 1|s)L(s + 1) + qθ (s|s)L(s)

P[dist(Θm , M ∩ U ) ≤ |B0 ] ≤ P[dist(Θm , M ∩ U )1[Bm−1 ] ≥ ] + P[Bm ]. (40)

P[{Dm ≥ 0 } ∩ Bm ] = E[1[Dm ≥ 0 ]1[Bm ]]

Term I ≤ c0−2 m−σ−κ . (80)

In this case we have that 2c (σ + κ) 1−σ

|l | < cdist(Θl , M ∩ U )2 (142)

∇J(Θl ) = Hessp(Θl ) (Θl − p(Θl )) + l . (143)

Θl+1 = Θl − αl Hessp(Θl ) (Θl − p(Θl )) − αl l + αl ηl (144)

≥ P f (Θm ) < f ? − |Θ0 = θmin

E[J ? − J(Θm )|Θ0 ∈ V ] ≤ P[J(Θm ) < J ? − |Θ0 ∈ V ]2b + , (183)