Robust Markov Decision Processes With Average and Blackwell Optimality
Robust Markov Decision Processes With Average and Blackwell Optimality
Marek Petrik
Department of Computer Science, University of New Hampshire, [email protected]
Nicolas Vieille
Economics and Decision Sciences Department, HEC Paris, [email protected]
Robust Markov Decision Processes (RMDPs) are a widely used framework for sequential decision-making
under parameter uncertainty. RMDPs have been extensively studied when the objective is to maximize the
discounted return, but little is known for average optimality (optimizing the long-run average of the rewards
obtained over time) and Blackwell optimality (remaining discount optimal for all discount factors sufficiently
close to 1). In this paper, we prove several foundational results for RMDPs beyond the discounted return. We
show that average optimal policies can be chosen stationary and deterministic for sa-rectangular RMDPs but,
perhaps surprisingly, that history-dependent (Markovian) policies strictly outperform stationary policies for
average optimality in s-rectangular RMDPs. We also study Blackwell optimality for sa-rectangular RMDPs,
where we show that approximate Blackwell optimal policies always exist, although Blackwell optimal policies
may not exist. We also provide a sufficient condition for their existence, which encompasses virtually any
examples from the literature. We then discuss the connection between average and Blackwell optimality, and
we describe several algorithms to compute the optimal average return. Interestingly, our approach leverages
the connections between RMDPs and stochastic games.
Key words : Robust Markov decision process, average optimality, Blackwell optimality, value iteration
1. Introduction
Markov decision process (MDP) [Puterman, 2014] is one of the most popular frameworks to model
sequential decision-making by a single agent, with recent applications to game solving and reinforcement
learning [Mnih et al., 2013, Sutton and Barto, 2018], healthcare decision-making [Bennett and Hauser,
2013, Steimle and Denton, 2017] and finance [Bäuerle and Rieder, 2011]. Robust Markov decision
processes (RMDPs) is a generalization of MDPs, dating as far back as Satia and Lave [1973] and
extensively studied more recently after the seminal papers of Iyengar [2005] and Nilim and Ghaoui
[2005]. Robust MDPs consider a single agent repeatedly interacting with an environment with unknown
1
2
instantaneous rewards and/or transition probabilities. The unknown parameters are assumed to be
chosen adversarially from an uncertainty set, modeling the set of all plausible values of the parameters.
Most of the robust MDP literature has focused on computational and algorithmic considerations
for the discounted return, where the future instantaneous rewards are discounted with a discount fac-
tor γ P p0, 1q. In this case, RMDPs become tractable under certain rectangularity assumptions, with
models such as sa-rectangularity [Iyengar, 2005, Nilim and Ghaoui, 2005], s-rectangularity [Le Tallec,
2007, Wiesemann et al., 2013], k-rectangularity [Mannor et al., 2016], and r-rectangularity [Goh et al.,
2018, Goyal and Grand-Clément, 2022]. For rectangular RMDPs, iterative algorithms can be efficiently
implemented for various distance- and divergence-based uncertainty sets [Behzadian et al., 2021, Givan
et al., 1997, Grand-Clement and Kroer, 2021, Ho et al., 2021, Iyengar, 2005], as well as gradient-based
algorithms [Kumar et al., 2023, Li et al., 2023, 2022, Wang et al., 2022].
Robust MDPs have found some applications in healthcare [Goh et al., 2018, Grand-Clément et al.,
2023, Zhang et al., 2017], where the set of states represents the potential health conditions of the
patients, actions represent the medical interventions, and it is critical to account for the potential
errors in the estimated transition probabilities representing the evolution of the patient’s health, and
in inverse reinforcement learning [Chae et al., 2022, Viano et al., 2021] for imitations that are robust
to shifts between the learner’s and the experts’ dynamics. In some applications, introducing a discount
factor can be seen as a modeling choice, e.g. in finance [Deng et al., 2016]. In other applications like
game-solving [Brockman et al., 2016, Mnih et al., 2013], the discount factor is merely introduced for
algorithmic purposes, typically to ensure the convergence of iterative algorithms or to reduce vari-
ance [Baxter and Bartlett, 2001], and it may not have any natural interpretation. Additionally, large
discount factors may slow the convergence rate of the algorithms. Average optimality, i.e., optimizing
the limit of the average of the instantaneous rewards obtained over an infinite horizon, and Blackwell
optimality, i.e., finding policies that remain discount optimal for all discount factors sufficiently close to
1, provide useful objective criteria for environments with no natural discount factor, unknown discount
factor, or large discount factor. To the best of our knowledge, the literature on robust MDPs that
address these optimality notions is scarce. The authors in Tewari and Bartlett [2007], Wang et al. [2023]
study RMDPs with the average return criterion but only focus on computing the optimal worst-case
average returns among stationary policies, without any guarantee that average optimal policies may be
chosen in this class of policies. Similarly, some previous papers show the existence of Blackwell optimal
policies for sa-rectangular RMDPs, but they require some restrictive assumptions, such as polyhedral
uncertainty sets [Goyal and Grand-Clément, 2022, Tewari and Bartlett, 2007] or unichain RMDPs with
a unique average optimal policy [Wang et al., 2023].
3
Our goal in this paper is to study RMDPs with average and Blackwell optimality in the general case.
Our main contributions can be summarized as follows.
1. Average optimality. For sa-rectangular robust MDPs with compact convex uncertainty sets, we
show the existence of a stationary deterministic policy that is average optimal, thus closing an important
gap in the literature. Additionally, we describe several strong duality results and we highlight that the
worst-case transition probabilities need not exist, a fact that has been overlooked by previous work.
We also discuss the case of robustness against history-dependent worst-case transition probabilities. In
addition, we show for s-rectangular RMDPs that surprisingly, stationary policies may be suboptimal
for the average return criterion. That is, history-dependent (Markovian) policies may be necessary to
achieve the optimal average return.
2. Blackwell optimality. We provide an extensive treatment of Blackwell optimality for sa-
rectangular RMDPs. In this case, we provide a counterexample where Blackwell optimal policies fail to
exist, and we identify the pathological oscillating behaviors of the robust value functions as the key fac-
tor for this non-existence. We show however that approximate Blackwell optimal policies always exist.
Finally, we introduce the notion of definable uncertainty sets, which are sets built on simple enough
functions to ensure that the discounted value functions are “well-behaved”. The definability assumption
captures virtually all uncertainty sets used in practice for sa-rectangular RMDPs, and we show that for
sa-rectangular RMDPs with definable uncertainty sets, there always exists a stationary deterministic
policy that is Blackwell optimal. We also highlight the connections between Blackwell optimal policies
and average optimal policies. For s-rectangular RMDPs, we provide a simple example where Blackwell
optimal policies do not exist, even in the simple case of polyhedral uncertainty.
3. Algorithms and numerical experiments. We introduce three algorithms that converge to the
optimal average return, in the case of definable sa-rectangular uncertainty sets. We note that efficiently
computing an average optimal policy for sa-rectangular RMDPs is a long-standing open problem in
algorithmic game theory, and we present numerical experiments on three RMDP instances.
Overall, we provide a complete picture of average optimality and Blackwell optimality in sa-rectangular
RMDPs, and we show surprising new results for s-rectangular RMDPs. Table 1 summarizes our main
results and compares them with prior results for robust MDPs and nominal MDPs. Our results are
highligted in bold.
Outline of the paper. The rest of the paper is organized as follows. We introduce robust MDPs
in Section 2 and provide our literature review there. We study average optimality in Section 3 and
Blackwell optimality in Section 4. We study algorithms for computing average optimal policies for
sa-rectangular RMDPs in Section 5 and we provide some numerical experiments in Section 6.
4
Table 1 Properties of optimal policies for different objective criteria. The set of states and the set of actions are
finite. Our results are in bold.
Discount Average
Uncertainty set U Blackwell optimality
optimality optimality
stationary, stationary,
Singleton (MDPs) stationary, deterministic
deterministic deterministic
history-
s-rectangular, stationary,
dependent, may not exist
compact convex randomized
randomized
1.1. Notations.
Given a finite set Ω, we write ∆pΩq for the simplex of probability distributions over Ω, and P pΩq for
the power set of Ω. We allow ourselves to conflate vectors in R|Ω| and functions on RΩ . For any two
vectors u, v P RΩ , the inequality u ď v is understood componentwise: us ď vs , @ s P Ω, and for ϵ ě 0, we
overload the notation u ď v ` ϵ to signify us ď vs ` ϵ, @ s P Ω. We write e “ p1, ..., 1q and its dimension
depends on the context.
A robust Markov decision process is defined by a tuple pS , A, r, U , p0 q where S is the finite set of states,
A is the finite set of actions available to the decision-maker, and r P RSˆAˆS is the instantaneous reward
SˆA
function. The set U Ă p∆pSqq , usually called the uncertainty set, models the set of all possible
5
transition probabilities P “ ppsa qsa P ∆pSqSˆA , and p0 P ∆pSq is the initial distribution over the set of
states S . We emphasize that we assume throughout S and A to be finite. We also assume that U is a
non-empty, convex compact set. This is well-motivated from a practical standpoint, where U is typically
built from some data based on some statistical distances, see the end of this section for some examples.
We write ΠH for the set of history-dependent, randomized policies that possibly depend on the entire
past history ps0 , a0 , ..., st q to choose possibly randomly an action at time t. We write ΠM for the set of
Markovian policies, i.e., for policies π P ΠH that only depend on t and the current state st for the choice
of at . The set ΠS Ă ΠM of stationary policies consists of the Markovian policies that do not depend
on time, and ΠSD Ă ΠS are those policies that are furthermore deterministic [Puterman, 2014]. Given
a policy π P ΠH and transition probabilities P P U , we denote by Eπ,P the expectation with respect to
the distribution of the sequence pst , at qtě0 of states and actions induced by π and P (as a function of
the initial state). Given γ ă 1, the value function induced by pπ, P q P ΠH ˆ U is
« ff
`8
ÿ
π,P t
vγ,s :“ Eπ,P γ rst at st`1 | s0 “ s , @ s P S ,
t“0
π,P
and the discounted return is Rγ pπ, P q :“ pJ
0 vγ . A discount optimal policy is an optimal solution to
the optimization problem:
sup min Rγ pπ, P q. (2.1)
πPΠH P PU
Rectangularity, Bellman operator and adversarial MDP. The discounted robust MDP prob-
lem (2.1) becomes tractable when the uncertainty set U satisfies the following s-rectangularity condition:
with the interpretation that Us is the set of probability transitions from state s P S , as a function of the
action a P A. The authors in Wiesemann et al. [2013] show that if the uncertainty set U is s-rectangular
and compact convex, a discount optimal policy π ‹ can be chosen stationary (but it may be randomized).
Additionally, π ‹ can be computed by first solving for v P RS the equation: v ‹ “ T pv ‹ q where T : RS Ñ RS
is the Bellman operator, defined as
ÿ
S
Ts pv q “ max min πsa pJ
sa prsa ` γv q , @ v P R , @ s P S , (2.3)
πΠS Ps PUs
aPA
then returning the policy π ‹ that attains the maximum on each component of T pv ‹ q.
6
Let us now fix a stationary policy π P ΠS , and a s-rectangular RMDP instance. The problem of
computing the worst-case return of π, defined as
can be reformulated as an MDP instance, called the adversarial MDP. In the adversarial MDP, the
state set is the finite set S , the set of actions available at s P S is the compact set Us , and the rewards
and transitions are continuous, see Goyal and Grand-Clément [2022], Ho et al. [2021] for more details.
In particular, the robust value function vγπ,U P RS of π, defined as vγ,s
π,U π,P
:“ minP PU vγ,s for all s P S ,
satisfies the fixed-point equation:
ÿ ` ˘
π,U π,U
vγ,s “ min πsa pJ
sa rsa ` γvγ , @ s P S.
Ps PUs
aPA
Crucially, the worst-case transition probabilities for a given π P ΠS can be chosen in the set of extreme
points of U .
Under the sa-rectangularity assumption:
ą
U“ Usa , Usa Ď ∆pSq, @ ps, aq P S ˆ A, (2.5)
ps,aqPSˆA
a discount optimal policy may be chosen stationary and deterministic when U is compact [Iyengar, 2005,
Nilim and Ghaoui, 2005]. Typical examples of sa-rectangular uncertainty sets are based on ℓp -distance
with p P t1, 2, `8u , with Usa defined as Usa “ tp P ∆pSq | }p ´ psa }p ď αsa u or based on Kullback-Leibler
divergence: Usa “ tp P ∆pSq | KL pp, psa q ď αsa u for ps, aq P S ˆ A and some radius αsa ě 0 [Iyengar,
2005, Nilim and Ghaoui, 2005, Panaganti and Kalathil, 2022]. Typical s-rectangular uncertainty sets are
constructed analogously to sa-rectangular ambiguity sets [Grand-Clement and Kroer, 2021, Wiesemann
et al., 2013].
Two-player zero-sum stochastic games (abbreviated SGs in the rest of this paper) were introduced in
the seminal work of Shapley [Shapley, 1953] and model repeated interactions between two agents with
opposite interests. In each period, the players choose each an action, which influences the evolution
of the (publicly observed) state. Agents aim at optimizing their gain, which depends on states and
actions. The literature on stochastic games is extensive, we refer the reader to Mertens et al. [2015],
Neyman et al. [2003] for classical textbooks and to Laraki and Sorin [2015], Renault [2019] for recent
short surveys. Stochastic games and robust MDPs share many similarities despite historically distinct
7
communities, motivations, terminology, and lines of research. In this paper, we leverage some existing
results in SGs to prove novel results for RMDPs. This section briefly describes the similarities and
differences between the two fields.
Similarities between SGs and RMDPs. It has been noted in the RMDP literature that rectangular
RMDPs as in (2.1) can be reformulated as stochastic games. At a given state s P S , the first player in
the game is the decision-maker in the RMDP, who chooses actions in A, while the adversary chooses the
transition probabilities Ps in Us . The uncertainty set thus corresponds to the action set of the adversary.
This connection has been alluded to multiple times in the robust MDP literature, e.g., section 5 of
Iyengar [2005], the last paragraph of page 7 of Xu and Mannor [2010], the last paragraph of section 2
in Wiesemann et al. [2013], and the introduction of Grand-Clément and Petrik [2022].
It is crucial to note that polyhedral uncertainty sets in s-rectangular RMDPs can be modeled by a
finite action set for the second player in the associated SG. This is because the worst-case transition
probabilities may be chosen in the set of extreme points of U , and polyhedra have finitely many
extreme points. With this in mind, extreme points of the s-rectangular convex set U correspond to
stationary deterministic strategies for the second player, while arbitrary elements of U , which are convex
combinations of extreme points, correspond to stationary randomized strategies, also called stationary
(behavior) strategies in game theory.
Interestingly, sa-rectangular RMDPs can be reformulated as a special case of perfect information
stochastic games (see section 5 in Iyengar [2005]). Perfect information SGs are SGs with the property
that each state is controlled (in terms of rewards and transitions) by only one player, see Chapter 4 in
Neyman et al. [2003] for an introduction. In the perfect information SG reformulation of sa-rectangular
RMDPs, the set of states is the union of S and of S ˆ A, the first player controls states s P S , and the
second player chooses the transition probabilities psa in states ps, aq P S ˆ A. Instantaneous rewards are
only obtained at the states of the form ps, aq. For completeness, a detailed construction is provided in
Appendix A.
A fundamental distinction between SGs and RMDPs. We now describe a crucial distinction
between RMDPs and SGs, which has received limited attention in the RMDP literature. In the classical
stochastic game framework, the first player and the second player can choose history-dependent strate-
gies. In contrast, in robust MDPs as in (2.1), it is common to assume that the decision-maker chooses
a history-dependent policy, but that the adversary is restricted to stationary policies, i.e., to choose
P P U instead of P P UH with UH is the set of (randomized) history-dependent policies for the adver-
sary [Goyal and Grand-Clément, 2022, Wiesemann et al., 2013]. This fundamental difference between
8
SGs and RMDPs is one of the main reasons why existing results for SGs do not readily extend to
results for RMDPs. The focus of this paper is on stationary adversaries, as classical for RMDPs, and
for clarity we make it explicit in the statements of all our results.
Remark 2.1 This distinction is irrelevant in the discounted return case, as we point out in Proposition
2.2 below. This proposition follows from Shapley [1953] and we provide a concise proof in Appendix B.
Equality (2.6) shows that facing a stationary adversary or a history-dependent adversary is equivalent
for RMDPs with discounted return (and a decision-maker that can choose history-dependent policies).
This provides an answer to some of the discussions about stationary vs. non-stationary adversaries
from the seminal paper on discounted sa-rectangular RMDPs [Iyengar, 2005].
Average optimality is a fundamental objective criterion extensively studied in nominal MDPs and in
reinforcement learning, see Chapters 8 and 9 in Puterman [2014] and the survey Dewanto et al. [2020].
Average optimality alleviates the need for introducing a discount factor by directly optimizing a long-
run average Ravg pπ, P q of the instantaneous rewards received over time. Intuitively, we would want
” řT ı
1
Ravg pπ, P q to capture the limit behaviour of the average payoff Eπ,P T `1 r
t“0 st at st`1 | s0 „ p0 over
the first T periods, for T P N. A well-known issue however, see for instance Example 8.1.1 in Puterman
[2014], is that these average payoffs need not have a limit as T Ñ `8.
Example 3.1 Consider a MDP with a single state and two actions a0 and a1 , with payoffs 0 and 1
respectively. In this setup, a (deterministic) Markovian strategy is a sequence pat q in t0, 1uN , and the
average payoff over the first T periods is the frequency of action a1 in these periods. Therefore, if the
sequence pat qtPN is such that these frequencies do not converge, the average payoff up to T does not
converge either as T Ñ `8. We provide a detailed example in Appendix C.
9
In this paper, we focus on the following definition of the average return Ravg pπ, P q of a pair pπ, P q P
ΠH ˆ U : « ff
T
1 ÿ
Ravg pπ, P q “ Eπ,P lim sup rst at st`1 | s0 „ p0 . (3.1)
T Ñ`8 T ` 1 t“0
Other natural definitions of the average return are possible, such as using the lim inf instead of the
lim sup, or taking the expectation before the lim sup. We will show in the next section that our main
theorems still hold for these other definitions (see Corollary 3.7). Given a stationary policy π P ΠS and
for P P U , we have
« ff
T
1 ÿ
Ravg pπ, P q “ lim Eπ,P rs a s | s0 „ p0 , @ pπ, P q P ΠS ˆ U ,
T Ñ`8 T ` 1 t“0 t t t`1
and the average return coincides with the limit (normalized) discounted return, as the discount factor
increases to 1:
lim p1 ´ γ qRγ pπ, P q “ Ravg pπ, P q, @ pπ, P q P ΠS ˆ U .
γÑ1
One of our main goals in this section is to study the robust MDP problem with the average return
criterion:
sup inf Ravg pπ, P q (3.2)
πPΠH P PU
A solution π P ΠH to (3.2), when it exists, is called an average optimal policy. The definition of the
optimization problem (3.2) warrants two important comments.
Properties of average optimal policies. First, we stress that (3.2) optimizes the worst-case aver-
age return over history-dependent policies. In contrast, prior work on average optimal RMDPs [Le Tallec,
2007, Tewari and Bartlett, 2007, Wang et al., 2023] have solely considered the optimization problem
supπPΠS inf P PU Ravg pπ, P q, thus restricting the decision-maker to stationary policies. At this point, it is
not clear if stationary policies are optimal in (3.2), and this issue has been overlooked in the existing
literature. One of our main contributions is to answer this question by the positive for sa-rectangular
RMDPs and by the negative for s-rectangular RMDPs.
Non-existence of the worst-case transition probabilities. Second, we note that (3.2) considers
the infimum over P P U of the average return Ravg pπ, P q. The reason is that this infimum may not
be attained, even for π P ΠS . This issue has been overlooked in prior work on robust MDPs with
average return [Tewari and Bartlett, 2007, Wang et al., 2023], even though it was recognized in other
communities repeatedly, see e.g. Section 1.4.4 in Sorin [2002] or Example 4.10 in Leizarowitz [2003]. In
the next proposition, we provide a simple counterexample.
10
Proposition 3.2 There exists a robust MDP instance with an sa-rectangular compact convex uncer-
tainty set, for which inf P PU Ravg pπ, P q is not attained for any π P ΠS .
The proof builds upon a relatively simple RMDP instance with three states and one action. We also
reuse this RMDP instance for later important results in the next sections. For this reason, we describe
it in detail below.
Proof of Proposition 3.2. We consider the robust MDP instance from Figure 1. There are three
s1 s1
1−α
α−β
α s0a0
β
s0 s2 𝒰
s0 s2
(a) The point p1 ´ α, β, α ´ β q in ∆pts0 , s1 , s2 uq (b) The uncertainty set Us0 a0 (in blue)
Figure 1 A simple robust MDP instance where inf P PU Ravg pπ, P q is not attained.
states: S “ ts0 , s1 , s2 u, and only one action a0 , so that the RMDP reduces to the adversarial MDP
with the adversary aiming at minimizing the average return. States s1 and s2 are absorbing states with
reward ´1 and 0 respectively. The initial state is s0 , and the instantaneous reward in s0 is 0. Elements
of ∆pSq are written pp0 , p1 , p2 q, where pi is the probability of transitioning to state si . The uncertainty
set is
see Figure 1a. Given p1 ´ α, β, α ´ β q P Us0 a0 , α is the probability to leave state s0 and β as the proba-
bility to go to state s1 .
The decision-maker has a single policy, which we denote π “ a0 . We claim that inf P PUs0 a0 Ravg pπ, P q “
´1, with Ravg pπ, P q ą ´1 for each P P Us0 a0 . We first compute the average return associated with a pair
pa0 , P q with P P Us0 a0 , P “ p1 ´ α, β, α ´ β q with α, β P r0, 1s. Given the instantaneous rewards in this
instance, note that we always have Ravg pπ, P q P r´1, 0s. If α “ 0, then the decision-maker never leaves
11
s0 and Ravg pπ, P q “ 0. Otherwise, for every discount factor γ P p0, 1q, the discounted return Rγ pa0 , P q
satisfies
β
p1 ´ γ qRγ pa0 , P q “ ´γ .
1 ´ γ ` γα
Therefore, taking the limit as γ Ñ 1, we obtain that if α ą 0 we have Ravg pa0 , P q “ ´β {α. Now consider
αn “ 1{n, β “ αn p1 ´ αn q, Pn “ p1 ´ αn , βn , αn ´ βn q for n ě 1. Then Ravg pa0 , Pn q “ ´1 ` 1{n Ñ ´1 as
n Ñ `8. However, ´1 is never attained by any stationary policy. Indeed, Ravg pa0 , P q “ ´1 ðñ ´β {α “
´1 ðñ α “ β. But β ď αp1 ´ αq by construction of Us0 a0 , so that β “ α ñ α “ 0 for P P Us0 a0 . In this
case, we have Ravg p, a0 , P q “ 0, which is a contradiction. ˝
Prior works on RMDPs have only focused on sa-rectangular RMDPs, either with polyhedral uncer-
tainty set [Le Tallec, 2007, Tewari and Bartlett, 2007] or under the assumption that the Markov chains
induced by any pair of policy and transition probabilities are unichain [Wang et al., 2023]. In these
special cases, we prove in Appendix D that inf P PU Ravg pπ, P q “ minP PU Ravg pπ, P q, i.e., worst-case tran-
sition probabilities indeed exist for any stationary policy π P ΠS . We note that this question is not
mentioned in prior work [Tewari and Bartlett, 2007, Wang et al., 2023].
We conclude this section with the following technical lemma. We adapt it from the literature on
nominal MDP with a finite set of states and compact set of actions [Bierth, 1987]. We briefly discuss
this lemma in Appendix E.
Lemma 3.3 (Adapted from theorem 2.5, Bierth [1987]) Let U be compact convex and s-
rectangular. Let π P ΠS . Then
inf Ravg pπ, P q “ inf Ravg pπ, P q.
P PUH P PU
We now focus on the case of sa-rectangular RMDPs. Our main result in this section is to show that
there always exist average optimal policies that are stationary and deterministic, an attractive feature
for practical implementation in real-world applications. In particular, we have the following theorem.
Theorem 3.4 Consider a sa-rectangular robust MDP with a compact convex uncertainty set U . There
exists an average optimal policy that is stationary and deterministic:
The proof proceeds in two steps and leverages existing results from the literature on perfect information
SGs [Gimbert and Kelmendi, 2023].
12
Proof of Theorem 3.4. In the first step of the proof, we show that Theorem 3.4 is true in the special
case where U is polyhedral. In this case, U only has a finite number of extreme points, and the robust
MDP problem is equivalent to a perfect information stochastic game with finitely many actions for
both players, as described in Section 2. We can then rely on the following crucial equality, which is a
reformulation of proposition 3.2 in Gimbert and Kelmendi [2023]:
sup inf Ravg pπ, P q “ max inf Ravg pπ, P q “ inf sup Ravg pπ, P q. (3.3)
πPΠH P PUH πPΠSD P PUH P PU πPΠ
H
From weak duality we always have supπPΠH inf P PU Ravg pπ, P q ď inf P PU supπPΠH Ravg pπ, P q, and
inf sup Ravg pπ, P q “ max inf Ravg pπ, P q ď max inf Ravg pπ, P q ď sup inf Ravg pπ, P q
P PU πPΠ πPΠSD P PUH πPΠSD P PU πPΠH P PU
H
where the equality follows from (3.3), the second inequality follows from U Ă UH , and the last inequality
follows from ΠSD Ă ΠH . Therefore, all terms above are equal, and we have
In the second step of the proof, we show that Theorem 3.4 holds for general compact uncertainty sets,
without the assumption that U is polyhedral as in the first step of the proof. Let ϵ ą 0. For each π P ΠSD ,
consider P π,ϵ P U such that
Note that P π,ϵ always exists by definition of the infimum and note that tP π,ϵ | π P ΠSD u is a finite set
since ΠSD is a finite set. Let us now consider the sa-rectangular uncertainty set Uf defined as the convex
hull of the finite set tP | @ ps, aq P S ˆ A, Dπ P ΠSD , psa “ pπsa u. By construction, Uf is polyhedral and
sa-rectangular. Additionally, Uf Ă U , and for any π P ΠSD , we have
Therefore,
where the first inequality uses Uf Ă U , the equality follows from the first step of the proof and Uf
being polyhedral, and the last equality holds by construction of Uf . Therefore, for all ϵ ą 0, we have
supπPΠH inf P PU Ravg pπ, P q ď maxπPΠSD inf P PU Ravg pπ, P q ` ϵ, and we can conclude that
Closing an important gap in prior works. First, to the best of our knowledge, we are the first to
study average return RMDPs in all generality without constraining the problem to stationary policies
and to show that average optimal policies for sa-rectangular RMDPs may be chosen stationary and
deterministic. Therefore, Theorem 3.4 addresses an important gap that has been entirely overlooked in
the existing literature on RMDPs with average return [Tewari and Bartlett, 2007, Wang et al., 2023].
Strong duality and history-dependent adversary. Another consequence of Theorem 3.4 is the
following strong duality theorem. We provide the detailed proof in Appendix F.
Theorem 3.5 Consider a sa-rectangular robust MDP with a compact convex uncertainty set U . Then
the following strong duality results hold:
Theorem 3.5 is akin to the strong duality results for discounted RMDPs [Goyal and Grand-Clément,
2022, Wiesemann et al., 2013]. It is interesting to note that strong duality still holds for sa-rectangular
RMDPs with average optimality, especially as in Equality (3.5), since the maximum over π P ΠSD is
always attained (it is the maximum over a finite set) while the infimum over U may not be attained,
even in very simple settings, as we illustrate in Proposition 3.2. Strong duality is crucial to study the
case of history-dependent adversaries, as we now show. The case of non-stationary adversary has also
gathered interest and is discussed in Iyengar [2005], Nilim and Ghaoui [2005]. Interestingly, we show
that this model is equivalent to (3.9), i.e., to the case of stationary adversaries in the following theorem.
The proof is relatively concise and relies on our fundamental results from Theorem 3.4, Theorem 3.5
and on the properties of MDPs with compact action sets. We also note that it is straightforward to
prove that the same results hold for Markovian adversaries.
14
Theorem 3.6 Consider a sa-rectangular robust MDP with a compact convex uncertainty set U . Then
Proof. We have
where the first inequality uses U Ă UH , where the first equality uses Theorem 3.5, the second equality
uses Theorem 3.4, the third equality is from Lemma 3.3, and the last equality uses ΠSD Ă ΠH . ˝
Other natural definitions of average optimality. Other possible definitions of the average return
criterion exist, such as, for pπ, P q P ΠH ˆ U , the following natural definitions:
« ff
T
1 ÿ
lim inf Eπ,P rs a s | s0 „ p0 , (3.6)
T Ñ`8 T ` 1 t“0 t t t`1
« ff
T
1 ÿ
lim sup Eπ,P rs a s | s0 „ p0 , (3.7)
T Ñ`8 T ` 1 t“0 t t t`1
« ff
T
1 ÿ
Eπ,P lim inf rst at st`1 | s0 „ p0 . (3.8)
T Ñ`8 T ` 1
t“0
Note that when pπ, P q P ΠS ˆ U , the three quantities in the above equations are equal to Ravg and are
equal to « ff
T
1 ÿ
lim Eπ,P rs a s | s0 „ p0 (3.9)
T Ñ`8 T ` 1 t“0 t t t`1
but in all generality, we have the following inequality (see lemma 2.1 in Feinberg and Shwartz [2012]):
At this point, the reader may wonder if the results that we proved in this section also hold for these other
definitions of the average return and if these returns each need to be analyzed separately. Fortunately,
based on Inequality (3.10), we can show in a straightforward manner that Theorem 3.4, Theorem 3.5
and Theorem 3.6 still hold for the average return as in (3.8), (3.6) and (3.7). This shows that all these
15
definitions of the average return are equivalent in the sense that there exists a stationary deterministic
policy that is average optimal simultaneously for all these objective functions. In particular, we obtain
the following important corollary. For the sake of conciseness we provide the proof in Appendix G.
Corollary 3.7 Consider a sa-rectangular robust MDP with a compact convex uncertainty set U . Let
R̂avg as defined in (3.6), (3.7), or (3.8).
1. There exists an average optimal policy that is stationary and deterministic, and which coincides
with an average optimal policy for Ravg as in (3.1):
sup inf R̂avg pπ, P q “ max inf R̂avg pπ, P q “ max inf Ravg pπ, P q.
πPΠH P PU πPΠSD P PU πPΠSD P PU
2. The strong duality results from Theorem 3.5 still hold if we replace Ravg by R̂avg .
3. The equivalence between stationary adversaries and history-dependent adversaries from Theorem
3.6 still hold if we replace Ravg by R̂avg .
We now show that for s-rectangular RMDPs, Markovian policies may strictly outperform stationary
policies for the average return criterion. This is a surprising result since stationary policies are sufficient
for solving average sa-rectangular RMDPs and discount s-rectangular RMDPs. We rely on a stochastic
game instance called The Big Match [Blackwell and Ferguson, 1968]. In this instance, there are two
absorbing states A0 (with a reward of 0) and A1 (with a reward of 1) and two non-absorbing state s0
and s1 . The decision-maker starts in state s0 and chooses an action in tT, B u. The adversary chooses
the scalar p P r0, 1s, whose impact on the transitions is represented in Figure 2. Each arrow corresponds
to a possible transition, and is labelled with its probability, and with the payoff obtained along this
transition. If action T is chosen, the game reaches an absorbing state, A0 or A1 , depending on the
choice of the adversary. If action B is chosen, the decision-maker obtains a reward of 0 or 1 and the
game continues. We can think of this instance as a game where the adversary tries to “guess” if the
decision-maker will choose to stop the game (action T ). If the adversary guesses correctly and the
decision-maker chooses action T , the adversary can choose p “ 1 so that the decision-maker reaches the
“bad” absorbing state A0 (with reward 0). Otherwise, if the decision-maker chooses action B, the game
continues. The Big Match is represented in Figure 2. We will show that in The Big Match, all stationary
policies have a worst-case average return of 0, while a Markovian policy can achieve a worst-case average
return of 1{2. In particular, we have the following proposition.
Proposition 3.8 Consider the robust MDP instance from Figure 2 (The Big Match).
16
1; 0 p; 1 1; 0
p; 0
s0 A0 s0 A0
1 − p; 1
1; 1 1 − p; 0 1; 1
1 − p; 0
p; 0
s1 1 − p; 1 s1
A1 A1
p; 1
Proof. We note that point 1 and point 3 are known in the SG literature, and only point 2 is new
here. In particular, point 1 follows from Blackwell’s original argument, see the beginning of theorem 1
in Blackwell and Ferguson [1968], and point 3 is exactly Lemma 1, chapter 12, Neyman et al. [2003].
We detail the proofs here for conciseness.
1. This follows from the adversary choosing the transition probabilities P corresponding to p “ 1{2.
In this case, we have Ravg pπ, P q “ 1{2 for any policy π P ΠH . Indeed, if π always selects action B, then
clearly Ravg pπ, P q “ 1{2. Otherwise, there exists a period at which π selects action T with a positive
probability, and in this case we also have Ravg pπ, P q “ 1{2 since the decision-maker is equally likely to
reach the terminal state A0 (with a reward of 0) or the terminal state A1 (with a reward of 1).
2. We will construct a Markovian policy π such that Ravg pπ, P q “ 1{2 for any P P U . In particular,
let π the Markovian policy that chooses the same action in the states s0 and s1 and such that: (a) at
time t “ 0, π chooses action T or action B with probability 1{2; (b) for any time t ě 1, π always chooses
action B. Let p P r0, 1s and let P P U be the corresponding transition probabilities. At time t “ 0, the
game stops with probability 1{2 (if action T is chosen), in which case the decision-maker reaches A0
with probability p, obtaining a reward of 0 forever, and reaches A1 with probability 1 ´ p, obtaining a
reward of 1 forever. If action B is selected at time t “ 0, which happens with probability 1{2, the game
17
starts in s0 or s1 at time t “ 1. Then, action B is selected forever, so that the average payoff is p1 ´ pq.
1
Overall, we obtain that the average payoff of π is 2
ˆ p ` 12 ˆ p1 ´ pq “ 12 .
3. Let π P ΠS . If π always chooses action B, then the adversary choosing p “ 0 yields an average
return of 0. Otherwise, π chooses action T with a positive probability, and the adversary choosing
p “ 1 yields an average return of 0. Therefore, minP PU Ravg pπ, P q “ 0 for any π P ΠS . The fact that
minP PU maxπPΠS Ravg pπ, P q “ 1{2 follows from point 1 in this proof: when p “ 1{2, we have Ravg pπ, P q “
1{2, @ π P ΠH .
˝
Proposition 3.8 shows the stark contrast between average optimal policies for sa-rectangular RMDPs,
which can always be chosen stationary and deterministic (Theorem 3.4), and average optimal policies
for s-rectangular RMDPs, which may have to be Markovian and randomized. Additionally, the last
point of Proposition 3.8 shows that strong duality does not hold for average return s-rectangular RMDPs
with stationary strategies for the decision-maker, again in contrast to the case of sa-rectangular RMDPs
(Theorem 3.5). It is an interesting open question to understand if, for any s-rectangular RMDP, an
average optimal policy may be chosen Markovian and if strong duality holds in the case of Markovian
decisions for the decision-maker. We conclude this section with a comparison of The Big Match in the
case of a history-dependent adversary. In this case, The Big Match is exactly the instance studied in the
stochastic game literature. In particular, Blackwell and Ferguson [1968] show that in the classical SG
setting where both the decision-maker and the adversary can use history-dependent policies, we have
the supremum is not attained in the left-hand side above, any Markovian policy achieves a worst-case
average return of 0: inf P PUH Ravg pπ, P q “ 0 for π P ΠM , and for any choice of ϵ P p0, 1{2q, only history-
dependent policies can guarantee a worst-case average reward of 1{2 ´ ϵ. We refer to chapter 12 in
Neyman et al. [2003] for a modern exposition of these results.
An important limitation of the average return is that it ignores any rewards obtained in finite time,
which may be problematic. For instance, when optimizing patient trajectories over time, practitioners
are concerned with long-term goals (typically, survival at discharge) but also with the current patient
condition at any point in time. Blackwell optimality is a criterion that balances both long-term and
short-term goals by accounting for an entire range of discount factors. In particular, a policy is Blackwell
optimal if it is discount optimal for all discount factors sufficiently close to 1 [Puterman, 2014].
Definition 4.1 Let U be s-rectangular and convex compact. A policy π P ΠH is Blackwell optimal if
there exists γ0 P p0, 1q such that
For nominal MDPs, i.e. for the case where U is a singleton and the sets S , A are finite, there always
exists a Blackwell optimal policy [Puterman, 2014]. We will be interested in the existence (or not) of
Blackwell optimal policies for rectangular robust MDPs. We also introduce the following definition of
approximate Blackwell optimality, where a policy remains ϵ-optimal for all discount factors sufficiently
large.
Definition 4.2 (ϵ-Blackwell optimality.) Let U be s-rectangular and convex compact. A policy π P
ΠH is ϵ-Blackwell optimal for ϵ ą 0 if there exists a discount factor γϵ P p0, 1q such that
Note that we renormalize the discounted returns in the robust MDPs with the multiplicative term
p1 ´ γ q since they may be unbounded otherwise. To the best of our knowledge, we are the first to
introduce and study the notion of ϵ-Blackwell optimality for robust MDPs. In the rest of this section,
we will repeatedly use the following result pertaining to the existence of approximate Blackwell optimal
policy in the adversarial MDP, which we adapt from the literature on MDPs with compact action sets.
Theorem 4.3 (Corollary 5.26, Sorin [2002]) Let U be a compact convex s-rectangular uncertainty
set. Let π P ΠS and ϵ ą 0. Then there exist γϵ P p0, 1q and Pϵ P U such that
minp1 ´ γ qRγ pπ, P q ď p1 ´ γ qRγ pπ, Pϵ q ď minp1 ´ γ qRγ pπ, P q ` ϵ, @ γ P pγϵ , 1q.
P PU P PU
19
In this section, we provide a complete analysis of Blackwell optimality for sa-rectangular RMDPs. In
Section 4.2.1, we first show that, surprisingly, a Blackwell optimal policy may not exist, although an
approximate Blackwell optimal policy may always be chosen stationary and deterministic. Additionally,
this approximate Blackwell optimal policy is average optimal, as we show in Section 4.2.2. Finally, in
Section 4.2.3 we introduce the notion of definable uncertainty sets, a very general class of uncertainty
for which a Blackwell optimal policy exists.
4.2.1. Existence and non-existence results We first contrast the existence properties of Black-
well optimal policies and ϵ-Blackwell optimal policies (as introduced in Definition 4.2). Surprisingly,
there are some examples of sa-rectangular robust MDPs where there are no Blackwell optimal policies,
as stated formally in the next theorem.
Theorem 4.4 There exists a sa-rectangular robust MDP instance, with a compact convex uncertainty
set U , such that there is no Blackwell optimal policy:
To the best of our knowledge, we are the first to show that Blackwell optimal policies may not exist for
sa-rectangular RMDPs. Indeed, Theorem 4.4 is surprising because the existence of Blackwell optimal
policies has been proved in various other related frameworks, including nominal MDPs [Puterman,
2014], or sa-rectangular RMDPs under some additional assumptions [Goyal and Grand-Clément, 2022,
Wang et al., 2023]. For the sake of conciseness, we defer the proof of Theorem 4.4 in Appendix H,
and we only provide here some intuition on the main reasons behind the potential non-existence of
Blackwell optimal policies.
Our proof of Theorem 4.4 is based on the same simple counterexample as for Proposition 3.2. We con-
sider the same instance but where there are two available actions a1 and a2 at the non-absorbing state
s0 . We can identify stationary policies with actions a1 and a2 . The main property of our counterexam-
ple is that the robust value functions γ ÞÑ vsa01,γ
,U
and γ ÞÑ vsa02,γ
,U
have an oscillatory behaviour when γ
approaches 1. As these robust value functions oscillate more and more often as γ Ñ 1, they intersect
infinitely often on any interval close to 1, so that there are no discount factors close enough to 1 after
which a1 (or a2 ) always remains a discount optimal policy. To obtain the oscillatory behaviours of the
robust value functions, we construct two convex compact uncertainty sets Us0 a1 and Us0 a2 . At a high
level, we construct these uncertainty sets so that their boundaries overlap and intersect infinitely often,
while remaining distinct sets (as subsets of ∆pSq). Let us define the worst-case transition probabilities
20
γ ÞÑ Pγa1 ,‹ and γ ÞÑ Pγa2 ,‹ for actions a1 and a2 . Then these worst-case transition probabilities vary with
γ and always belong to the boundaries of Us0 a1 and Us0 a2 . The infinite intersections of these boundaries
a ,‹
a ,P 1
induce a pathological, oscillating behaviour of γ ÞÑ vsa01,γ
,U
“ vs01,γ γ (and similarly for a2 ). Our analysis
also explains why this pathological behaviour is impossible in nominal MDPs: the transition probabil-
π,P
ities is fixed, and in this case γ ÞÑ vs,γ is always a well-behaved (rational) function. This pathological
behaviour is also impossible for polyhedral uncertainty: the worst-case transition probabilities remain
the same for γ large enough, as shown in [Goyal and Grand-Clément, 2022].
Since Theorem 4.4 shows that Blackwell optimal policies may not always exist, it is natural to ask if
approximate Blackwell optimal policies always exist. We answer this question by the positive in the
next theorem. In fact, we show the following stronger existence result.
Theorem 4.5 Let U be a sa-rectangular compact uncertainty set. Then there exists a stationary deter-
ministic policy that is ϵ-Blackwell optimal for all ϵ ą 0, i.e., D π P ΠSD , @ ϵ ą 0, D γϵ P p0, 1q such that
Theorem 4.5 shows another surprising result: not only there exists an ϵ-Blackwell optimal policy for
any choice of ϵ ą 0, but in fact, we can choose the same stationary deterministic policy π P ΠSD to
be ϵ-Blackwell optimal for all ϵ. Theorem 4.4 shows that we can not choose ϵ “ 0 in the statement of
Theorem 4.5. We present the proof in Appendix I. The main idea is to exploit Theorem 4.3, i.e., the
existence of approximate Blackwell optimal policies in the adversarial MDP.
4.2.2. Limit behaviours and connection with average optimality We now highlight the
connection between Blackwell optimality and average optimality. Intuitively, approximate Blackwell
optimal policies remain ϵ-optimal for all γ close to 1. Since the discount factor captures the willingness of
the decision-maker to wait for future rewards, we expect that the discounted return resembles more and
more the average return as γ Ñ 1. We show that this intuition is correct for sa-rectangular uncertainty
sets in the following theorem.
Theorem 4.6 Consider a sa-rectangular robust MDP with a compact convex uncertainty set U . Let
π P ΠSD be ϵ-Blackwell optimal, for all ϵ ą 0. Then π is also average optimal.
The proof of Theorem 4.6 is presented in Appendix J. The main lines of the proof are instructive and
rely on carefully inspecting the limit behavior of the discounted return as γ Ñ 1. We provide an outline
here. Recall that we always have
The first step is to show that the equality above is still true when taking the worst-case over the
transition probabilities. In particular, we show the following lemma.
Lemma 4.7 In the setting of Theorem 4.6, let π P ΠS . Then minP PU p1 ´ γ qRγ pπ, P q admits a limit as
γ Ñ 1 and
lim minp1 ´ γ qRγ pπ, P q “ inf Ravg pπ, P q.
γÑ1 P PU P PU
We then show that the same conclusion holds for the limit of the optimal discounted return.
Lemma 4.8 In the setting of Theorem 4.6, maxπPΠSD minP PU p1 ´ γ qRγ pπ, P q admits a limit as γ Ñ 1
and
lim max minp1 ´ γ qRγ pπ, P q “ max inf Ravg pπ, P q.
γÑ1 πPΠSD P PU πPΠSD P PU
The last part of the proof relates average optimality and approximate-Blackwell optimality.
Lemma 4.9 In the setting of Theorem 4.6, let π P ΠSD be ϵ-Blackwell optimal for any ϵ ą 0. Then
Combining Lemma 4.9 with Lemma 4.8 concludes the proof of Theorem 4.6. Note that all the conclusions
from Lemma 4.7, Lemma 4.8 and Lemma 4.9 are true when we replace Ravg as defined in (3.1) by the
other natural definitions (3.6), (3.7), or (3.8), since the lemmas above only involve maximization and
minimization over ΠS and U , over which Ravg and the other definitions of the average return coincide.
Remark 4.10 Lemma 4.8 and Lemma 4.9 are present as corollary 4 in Tewari and Bartlett [2007],
under the assumption that U is based on ℓ8 -distance. This considerably simplifies the proof, as in this
setting, the number of extreme points of U is finite, so the infimum over U can be reduced to a minimum
over a finite number of elements, and we can exchange the minimization with the limit. Lemma 4.7 is
also present in Wang et al. [2023], under some additional assumptions (unichain compact sa-rectangular
RMDPs). Our results show that the unichain assumption is unnecessary.
4.2.3. Definable robust Markov decision processes In this section, we introduce a very gen-
eral class of uncertainty sets that encompasses virtually all the practical examples existing in the RMDP
literature, and for which stationary deterministic Blackwell optimal policies exist. Indeed, it is classical
to construct uncertainty sets based on simple functions, like affine maps, ℓp -balls, or Kullback-Leibler
divergence, see Section 2. For this kind of simple functions, intuitively we do not expect that the robust
22
value functions oscillate and intersect infinitely often. Our main contribution in this section is to for-
malize this intuition with the notion of definability [Bolte et al., 2015, Coste, 2000, Van Den Dries,
1998] and to prove Theorem 4.20, which states that for definable sa-rectangular RMDPs there always
exists a stationary deterministic Blackwell optimal policy.
A concise introduction to definability. We start with the following definition. Intuitively, a set is
definable if it is simple enough to be “constructed” based on polynomials, the exponential function, and
canonical projections (elimination of variables).
Definition 4.11 (Definable set and definable function) A subset of Rn is definable if it is of the
form
tx P Rn |Dk P N, Dy P Rk , P px1 , ..., xn , y1 , ..., yk , exppx1 q, ..., exppxn q, ..., exppy1 q, ..., exppyk qq “ 0u (4.3)
We refer to Bolte et al. [2015] and Akian et al. [2019] for concise introductions to definability and to
Coste [2000] for a more in-depth treatment. It is instructive to start by studying a few simple examples.
Example 4.13 (ℓp -norms.) Consider an ℓp -norm for p P N. Then its graph is tpx, y q P RS ˆ
ř
R | sPS |xs |p ´ y p “ 0, y ě 0u, which can be written as (4.3). Therefore, ℓp -norms are definable.
Definable functions and definable sets are well-behaved under many useful operations, as shown in the
following lemma. The proof follows directly from the definition and some properties shown in Bolte
et al. [2015], and we present it for completeness in Appendix K.
Lemma 4.15 1. If A, B Ă Rn are definable sets, then A Y B, A X B and Rn zA are definable sets.
2. Let f, g be definable functions. Then f ˝ g, ´f , f ` g, f ˆ g are definable.
23
3. For A, B two definable sets and g : A ˆ B Ñ R a definable function, then A Ñ R, x ÞÑ inf yPB g px, y q
and A Ñ R, x ÞÑ supyPB g px, y q are definable functions.
4. If A, B, g : A Ñ R are definable then g ´1 pB q is definable.
Example 4.16 (Functions based on entropy.) Let p̂ P ∆pAq with p̂s ą 0, @ s P S . Consider the
ř
Kullback-Leibler divergence: f : p ÞÑ sPS ps logpps {p̂s q defined over p P ∆pSq. Recall that x ÞÑ logpxq is
definable, so that x ÞÑ x logpxq is also definable. By summation, the Kullback-Leibler divergence is a
ř
definable function. The case of the Burg entropy p ÞÑ sPS p̂s logpp̂s {ps q for p P ∆pSq is similar to the
case of the Kullback-Leibler divergence.
The most important result for our use of definability is the following monotonicity theorem concerning
definable functions of a real variable.
Theorem 4.17 (Theorem 2.1, Coste [2000]) Let f : pa, bq Ñ R be a definable function. Then there
exists a finite subdivision of the interval pa, bq as a “ a1 ă a2 ă ¨ ¨ ¨ ă ak “ b such that on each pai , ai`1 q
for i “ 1, ..., k ´ 1, f is continuous and either constant or strictly monotone.
The monotonicity theorem shows that definable functions over R cannot oscillate infinitely often on an
interval. Recall in the previous section, we have identified oscillations of the robust value functions as
the main issue potentially precluding the existence of Blackwell optimal policies, see discussion after
Theorem 4.4. Therefore, the monotonicity theorem will play an important role in our proof of the
existence of Blackwell optimality for definable RMDPs as in Theorem 4.20.
Remark 4.18 The notion of definability is usually introduced in much more generality, e.g. [Bolte
et al., 2015, Coste, 2000, Van Den Dries, 1998]. We only introduce the notions necessary for this paper.
For exactness, we simply note that in the literature, the notion of definability introduced in Definition
4.11 is usually referred to as definable in the real exponential field.
Definable robust MDPs. We now study sa-rectangular robust MDPs with a definable compact
uncertainty set. We first note that this encompasses the vast majority of the uncertainty sets studied in
the robust MDP literature. Indeed, from Lemma 4.15, we know that the set Usa “ tp P ∆pSq | dsa ppq ď
αsa u is definable as soon as dsa : RS Ñ R is definable. From the various examples introduced above, we
obtain that sa-rectangular uncertainty sets based on ℓ8 -norm [Givan et al., 1997], ℓ2 -norm [Iyengar,
2005], ℓ1 -norm [Ho et al., 2021], Kullback-Leibler divergence and Burg entropy [Ho et al., 2022, Iyengar,
2005] are definable uncertainty sets.
We start by describing an appealing property of definable RMDPs: the robust value functions are
themselves definable functions. In particular, we have the following proposition.
24
Proposition 4.19 Assume that U is sa-rectangular and definable. Then for any policy π P ΠS , the
function γ ÞÑ vγπ,U is a definable function.
Proof of Proposition 4.19. Recall that the robust value function vγπ,U is the unique fixed-point of
the following operator v ÞÑ Tγπ,U pv q, defined as
ÿ
π,U S
Tγ,s pv q “ min πsa pJ
sa prsa ` γv q , @ s P S , @ v P R .
Ps PUs
aPA
The proof proceeds in two steps. The first step shows that Tγπ,U is definable when U is definable. The
second step shows that this is sufficient to conclude that the robust value functions are definable.
First step. We first prove that pv, γ q ÞÑ Tγπ,U pv q is definable.
Note that a function is definable if and only if each of its components is definable (exercise 1.10, Coste
S
ř
[2000]). We want to show that pv, γ q ÞÑ p aPA πsa minpsa PUsa pJ sa prsa ` γv qqsPS is definable. Let v P R .
ř
The map γ ÞÑ p aPA πsa minpsa PUsa pJ sa prsa ` γv qqsPS is affine and therefore it is definable. We now fix
γ P p0, 1q. From Lemma 4.15, if Usa is definable for each pair ps, aq P S ˆ A, then v ÞÑ minpsa PUsa pJ sa v is
ř
definable, and we conclude that v ÞÑ aPA πsa minpsa PUsa pJ sa prsa ` γv q is definable for any s P S . This
ř
shows that v ÞÑ p aPA πsa minpsa PUsa pJ
sa prsa ` γv qqsPS is definable.
Theorem 4.20 Consider a sa-rectangular robust MDP with a definable compact convex uncertainty
set U . Then there exists a stationary deterministic Blackwell optimal policy:
Proof of Theorem 4.20 From Proposition 4.19 and Lemma 4.15, we know that for each pair of
1
π,U π ,U
stationary deterministic policies π, π 1 the function r0, 1q Ñ R, γ ÞÑ vγ,s ´ vγ,s is definable. From the
1
π,U π ,U
monotonicity theorem, we can conclude that γ ÞÑ vγ,s ´ vγ,s does not change signs in a neighborhood
p1 ´ η pπ, π 1 , sq, 1q with η pπ, π 1 , sq ą 0. We then define γ̄ “ maxπ,π1 PΠSD ,sPS 1 ´ η pπ, π 1 , sq with γ̄ ă 1 since
there are finitely many stationary deterministic policies and finitely many states. Any policy that is dis-
count optimal for γ P pγ̄, 1q is Blackwell optimal, which shows the existence of a stationary deterministic
Blackwell optimal policy. ˝
The proof of Theorem 4.20 is relatively concise. It is also quite instructive, as it shows the existence
of a Blackwell discount factor γbw P p0, 1q, above which any discount optimal policy is also Blackwell
optimal. We note that we actually proved a stronger result than the existence of a Blackwell optimal
policy. In particular, we have shown the following theorem.
Theorem 4.21 Consider a sa-rectangular robust MDP with a definable compact convex uncertainty
set U . Then there exists a Blackwell discount factor γbw P p0, 1q, such that any policy that is discount
optimal for γ P pγbw , 1q is also Blackwell optimal.
The existence of the Blackwell discount factor is shown in Grand-Clément and Petrik [2023] for nominal
MDPs and for sa-rectangular RMDPs with polyhedral uncertainty sets. Theorem 4.21 extends this
result to the larger class of definable uncertainty sets. If we know an upper bound γ̂ ă 1 on γbw , then
we can compute a Blackwell optimal policy by solving a discounted robust MDP with discount factor
γ “ γ̂. We leave computing such an upper bound on γbw as an interesting future direction.
We now provide an instance of an s-rectangular RMDPs where there are no Blackwell optimal policies,
even though the uncertainty set is definable. This may happen when all optimal policies are randomized,
and the randomized optimal policies vary with the discount factor.
Proposition 4.22 There exists a s-rectangular robust MDP instance, with a compact, polyhedral uncer-
tainty set U , such that there is no Blackwell optimal policy:
Proof. In Figure 3, we adapt the example from Figure 3 in Wiesemann et al. [2013]. There are
three states and two actions. We can parametrize the return by the probability x P r0, 1s to play action
1; 1 1; 1
p; 1 2 1 − p; 0 2
1 1; 0 1 1; 0
1 − p; 0 p; 1
3 3
´ ¯
γ
a1 and by the parameter p P r0, 1s chosen by the adversary. We have Rγ px, pq “ xp 1 ` 1´γ ` p1 ´
´ ¯ ´ ¯ ´ ¯
γ γ γ
xq p ` p1 ´ pq 1´γ . This can be reformulated Rγ px, pq “ p 1 ` x 1´γ ` p1 ´ pq p1 ´ xq 1´γ . Let us
γ
compute the worst-case return for policy x P r0, 1s. We have minpPr0,1s Rpx, pq “ mint1 ` x 1´γ , p1 ´
γ
xq 1´γ u. The optimal policy then maximizes x ÞÑ minpPr0,1s Rpx, pq, i.e., it maximizes x ÞÑ mint1 `
γ γ
x 1´γ , p1 ´ xq 1´γ u over r0, 1s. Assume that γ ě 1{2 (we are interested in Blackwell optimal policies, i.e. in
γ γ
the case γ Ñ 1). Then the maximum of x ÞÑ mint1 ` x 1´γ , p1 ´ xq 1´γ u is attained at x˚ pγ q, solution of
γ γ
the equation 1 ` x 1´γ “ p1 ´ xq 1´γ . Therefore, there is a unique discount optimal policy and it depends
1
on γ: x‹ pγ q “ 1 ´ 2γ for γ ě 1{2. Overall, no policies remain discount optimal when γ varies, and there
are no Blackwell optimal policies. ˝
We note that the counterexample in the proof of Proposition 4.22 is much simpler than the coun-
terexample from Section 4.2.1 for sa-rectangular RMDPs. In particular, Proposition 4.22 shows that
Blackwell optimal policies may fail to exist for s-rectangular RMDPs, even in the simple setting where
the uncertainty set is a polytope parametrized by p P r0, 1s. As a possible next step, it sounds promising
to study the existence and tractability of ϵ-Blackwell optimal policies for s-rectangular RMDPs.
5. Algorithms
In this section, we discuss various iterative algorithms to compute average optimal and Blackwell
optimal policies. Our results for s-rectangular RMDPs from Section 3.3 and Section 4.3 suggest that
it may be difficult to compute average and Blackwell optimal policies in this case. Therefore, we focus
27
on sa-rectangular robust MDPs. Since for sa-rectangular RMDPs an average optimal policy may be
chosen stationary and deterministic, we define the optimal gain g ‹ P RS as
« ff
T
1 ÿ
gs‹ “ max inf lim Eπ,P rs a s | s0 “ s , @ s P S . (5.1)
πPΠSD P PU T Ñ`8 T ` 1 t“0 t t t`1
We now introduce three algorithms to compute the optimal gain g ‹ when U is sa-rectangular and
definable.
As discussed in Section 4.2.3, when U is definable, we have shown that there exists a discount factor
γbw P p0, 1q, such that any policy that is γ-discount optimal for γ P pγbw , 1q is also Blackwell optimal.
Therefore, we can compute Blackwell optimal and average optimal policies by solving a sequence of
discounted RMDPs with discount factors increasing to 1. An immediate consequence is the following
theorem.
Theorem 5.1 Let U be a definable compact convex uncertainty set and let pv t qtě1 and pπ t qtPN be the
iterates of Algorithm 1. Then pv t qtě1 converges to g ‹ P RS , and π t is both Blackwell optimal and average
optimal for t P N large enough.
Unfortunately, computing the Blackwell discount factor γbw appears challenging, so that we are not able
to provide a convergence rate for Algorithm 1. When U is polyhedral, Grand-Clément and Petrik [2023]
obtains an upper bound on γbw , but this upper bound is too close to 1 to be of practical use. We also
note that Algorithm 1 requires solving a robust MDP at every iteration, which may be computationally
intensive. We now describe two algorithms with a lower per-iteration complexity.
We now introduce two value iteration algorithms to compute the optimal worst-case average return
when the uncertainty set U is definable.
Algorithm based on increasing horizon. We start with Algorithm 2, which computes the optimal
end for
value functions of a finite-horizon RMDP as the horizon increases to `8. The following theorem shows
that Algorithm 2 is correct.
Theorem 5.2 Let U be a definable compact convex uncertainty set and let pv t qtě1 be the iterates of
` ˘
Algorithm 2. Then 1t v t tě1 converges to g ‹ P RS .
Proof. When U is a definable uncertainty set, the same steps as Proposition 4.19 show that the
robust Bellman operator (2.3) is definable. We also recall that we can cast a sa-rectangular RMDP
as a perfect information SG. The authors in Bolte et al. [2015] show that when the operator of an
` ˘
SG is definable, then 1t v t tě1 admits a limit and that this limit coincides with limγÑ1 p1 ´ γ qvγ‹ (see
corollary 4, point (i), in Bolte et al. [2015]). We have shown in the third step of the proof of Theorem
4.6 that limγÑ1 p1 ´ γ qpJ ‹ ‹
0 vγ is equal to maxπPΠSD minP PU Ravg pπ, P q. By definition of g , this shows that
Algorithm based on increasing discount factor. We now consider Algorithm 3, which builds
upon value iteration for RMDPs with an increasing sequence of discount factors. We have the following
theorem.
Theorem 5.3 Let U be a definable compact convex uncertainty set and let pv t qtě1 be the iterates of
Algorithm 3. Then pv t qtě1 converges to g ‹ P RS .
29
end for
The proof of Theorem 5.3 is deferred to Appendix M. We only provide the main steps here. The first
step is to show the following lemma, which decomposes the suboptimality gap of the current iterates
of Algorithm 1.
Lemma 5.4 There exist a period t0 P N and two sequences of non-negative scalars pϵt qtět0 and pδt qtět0
such that
}v t ´ g ‹ }8 ď ϵt ` δt , @ t ě t0 ,
δt`1 “ γt δt ` et , (5.2)
B B
et “ }p1 ´ γt`1 qvγπt`1,U ´ p1 ´ γt qvγπt ,U
}8 . (5.3)
The difficult part of the proof is to prove that the sequence pδt qtět0 converges to 0. To show limtÑ`8 δt Ñ
0, we use the inductive step (5.2) and γt “ ωt {ωt`1 to show
t
1 ÿ
δt “ ωj`1 ej , @ t ě t0 . (5.4)
ωt`1 j“t
0
Equation (5.4) reveals that the expression of δt resembles a weighted Cesaro average of the sequence
pet qtět0 . It is known that the Cesaro average of a converging sequence also converges to the same limit;
we prove Theorem 5.3 by extending this result to weighted Cesaro averages and by invoking existing
results from SGs to obtain that limtÑ`8 et “ 0.
Comparison with previous work. We now compare Theorem 5.3 with related results from the
RMDP literature. The authors in Tewari and Bartlett [2007] prove Theorem 5.3 with ωt “ t ` 1, i.e.,
with γt “ pt ` 1q{pt ` 2q and for the case of sa-rectangular uncertainty sets with ℓ8 -balls. Tewari and
Bartlett [2007] show the Lipschitz continuity of the robust value functions to simplify the term et from
30
Lemma 5.4. In particular, for sa-rectangular polyhedral U , for t large enough, there exists a constant
L ą 0 such that
B B
et “ }p1 ´ γt`1 qvγπt`1,U ´ p1 ´ γt qvγπt ,U
}8 ď L ¨ pγt`1 ´ γt q . (5.5)
and Lemma 8 in Tewari and Bartlett [2007] shows that a sequence pδt qtě0 satisfying (5.6) converges to
0.
The authors in Wang et al. [2023] consider Algorithm 3 with γt “ pt ` 1q{pt ` 2q for the case of general
sa-rectangular uncertainty set (potentially non-polyhedral), with the assumption that the Markov chain
associated with any transition probabilities and any policy is unichain. The analysis of Algorithm 3 in
Wang et al. [2023] follows the same lines as the proof in Tewari and Bartlett [2007], and the authors
conclude by proving that the Lipschitzness property (5.5) holds for unichain sa-rectangular RMDPs.
However, we do not assume that the RMDP is unichain, and in all generality the normalized robust
value function γ ÞÑ p1 ´ γ qvγπ,U may not be Lipschitz continuous for a fixed π P ΠS , as we prove in
Appendix L. We sidestep this issue by leveraging existing results from SGs [Bolte et al., 2015] and
elementary results from real analysis.
Stopping criterion and complexity. We have proved that Algorithm 1, Algorithm 2, and Algorithm
3 converge to an optimal gain. However, we would like to emphasize that we do not introduce a stopping
criterion for these algorithms. Even for the simpler case of nominal MDPs, in all generality, there is no
known stopping criterion for value iteration for the average return criterion, e.g. Ashok et al. [2017] and
section 9.4 in Puterman [2014]. We leave finding a stopping criterion as an open question in this paper,
noting that this is not addressed in previous papers on average-reward RMDPs [Tewari and Bartlett,
2007, Wang et al., 2023].
We also comment on the complexity of computing an average-optimal policy (and not just computing
an optimal gain). For sa-rectangular RMDPs, this may be a difficult task. Indeed, we described the
connections between sa-rectangular RMDPs with average return and mean-payoff perfect information
stochastic games in Section 2.2. There is no known polynomial-time algorithm for solving this class of
SGs [Andersson and Miltersen, 2009, Condon, 1992, Zwick and Paterson, 1996] even after more than six
decades since their introduction in the game theory literature [Gillette, 1957], and this is regarded as one
of the major open questions in algorithmic game theory. For this reason, in this paper, we have focused
on the properties of average and Blackwell optimal policies and we leave designing polynomial-time
algorithms for solving average return sa-rectangular RMDPs as an open question.
31
6. Numerical experiments
In this section, we compare the empirical performances of the algorithms introduced in Section 5.
Our goal is to test the practical convergence of Algorithm 1, Algorithm 2 and Algorithm 3 on various
sa-rectangular RMDPs instances.
Test instances. We consider three different nominal MDP instances: a machine replacement prob-
lem [Delage and Mannor, 2010, Wiesemann et al., 2013], a forest management instance [Cordwell et al.,
2015, Possingham and Tuck, 1997], and an instance inspired from healthcare [Goyal and Grand-Clément,
2022]. In the machine replacement problem, the goal is to compute a replacement and repair schedule
for a line of machines. In the forest management instance, a forest grows at every period, and the goal
is to balance the revenue from wood cutting and the risk of wildfire. In the healthcare instance, the
goal is to plan the treatment of a patient, avoiding the mortality state while reducing the invasiveness
of the treatment. We refer to the nominal transition probabilities in these instances as Pnom and we
represent the RMDP instances in Appendix N.1.
Construction of the uncertainty sets. For each of these three RMDP instances, we consider two
types of uncertainty sets. The first uncertainty set is based on box inequalities:
box
Usa “ tp P ∆pSq | plow up
sa ď p ď psa u, @ ps, aq P S ˆ A
S
with plow up low up
sa , psa P r0, 1s two vectors such that psa ď pnom,sa ď psa for ps, aq P S ˆ A. The second uncertainty
ℓ2
Usa “ tp P ∆pSq | }p ´ pnom,sa }2 ď αu, @ ps, aq P S ˆ A
where α ą 0 is a scalar. Box inequalities have been used in applications of RMDPs in healthcare [Goh
et al., 2018], while uncertainty based on ℓ2 -distance can be seen as conservative approximations of sets
based on relative entropy [Iyengar, 2005]. Note that both U ℓ2 and U box are definable sets, with U box
being polyhedral. Additionally, there exist efficient algorithms to evaluate the robust Bellman operator
as in (2.3) for U box (e.g. proposition 3 in Goh et al. [2018]). For U ℓ2 , evaluating the robust Bellman
operator requires solving a convex program. Still, we can obtain a closed-form expression assuming that
α ą 0 is sufficiently small, as we detail in Appendix N.2. We choose a uniform initial distribution for
all instances.
Empirical setup. We run Algorithm 1, Algorithm 2 and Algorithm 3 for T “ 50 iterations, on the
Machine, Forest, and Healthcare RMDP instances with S “ 100 states. For U ℓ2 we choose a radius of
α “ 0.05 and for U box we choose the upper and lower bound on each coefficient to allow for 5% deviations
32
from the nominal distributions. For Algorithm 3, we choose ωt “ t ` 1. We provide more details in
Appendix N.2. To implement Algorithm 1, we need to solve a robust MDP at every iteration. To do so,
we implement the two-player strategy iteration [Hansen et al., 2013] with warm-starts, see more details
in Appendix N.3. We implement all algorithms in Python 3.8.8.
Numerical results. We first present the convergence rate as a function of the number of iterations of
all our algorithms in Figure 4 for U ℓ2 and in Figure 5 for U box . Algorithm 1, Algorithm 2 and Algorithm
3 appear to have comparable convergence speed on our three RMDP instances, despite Algorithm 1
solving a discounted RMDP at each iteration, compared to Algorithm 2 and Algorithm 3 which only
require to evaluate the robust Bellman operator. We also present the total computation time as a
function of the number of iterations in Figure 6-7. We note that Algorithm 1 is much slower than
Algorithm 2 and Algorithm 3, even with warm-start. Indeed, at every iteration, Algorithm 1 requires
running two-player strategy iteration for solving a discounted RMDP, whereas the other two algorithms
only require evaluating the Bellman operator. We also note that running our algorithms for U ℓ2 is faster
than for U box , since the Bellman update for U box requires to sort a vector of size |S|, and then compute
a maximum, whereas for U ℓ2 we only need to compute a maximum, see the implementation details in
Appendix N.2.
Value
0.6
0.1 3
0.4
0 20 40 0 20 40 0 20 40
Number of iterations Number of iterations Number of iterations
7. Conclusion
Our paper addresses several important issues in the existing literature and derives the fundamental
properties of average optimal and Blackwell optimal policies. In particular, our work highlights impor-
tant distinctions between the widely-studied framework of discounted RMDPs and the less-studied
frameworks of RMDPs with average optimality and Blackwell optimality. We view the non-existence of
33
Value
Value
Value
0.6
0.1 3
0.4
0 20 40 0 20 40 0 20 40
Number of iterations Number of iterations Number of iterations
Time (s)
1.0 0.4
1
0.5 0.2
0.0 0 0.0
0 20 40 0 20 40 0 20 40
Number of iterations Number of iterations Number of iterations
Time (s)
Time (s)
10
4
5 2
2
0 0 0
0 20 40 0 20 40 0 20 40
Number of iterations Number of iterations Number of iterations
(a) Machine (b) Healthcare (c) Forest
Figure 7 Total computation time as a function of the number of iterations for various RMDP instances, for U box based
on box inequalities.
stationary average optimal policies for s-rectangular RMDPs and the non-existence of Blackwell opti-
mal policies for sa-rectangular RMDPs (in all generality) as surprising results. The notion of definable
uncertainty sets is also of independent interest, characterizing the cases where the value functions are
well-behaved and iterative algorithms asymptotically converge to the optimal gain in the sa-rectangular
34
case. Finally, our work opens new research avenues for RMDPs. Among them, deriving an efficient
algorithm for computing an average optimal policy for sa-rectangular RMDPs appears crucial, but
this may be difficult since a similar question remains open after several decades in the literature on
SGs. Important other research questions include studying in more detail the particular case of irre-
ducible instances, which considerably simplifies the main technical challenges, and studying the case
of distributionally robust MDPs. Understanding if Markovian policies are sufficient for average return
s-rectangular RMDPs also looks like a promising next direction.
References
Marianne Akian, Stéphane Gaubert, Julien Grand-Clément, and Jérémie Guillaud. The operator
approach to entropy games. Theory of Computing Systems, 63(5):1089–1130, 2019.
Daniel Andersson and Peter Bro Miltersen. The complexity of solving stochastic games on graphs. In
International Symposium on Algorithms and Computation, pages 112–121. Springer, 2009.
Pranav Ashok, Krishnendu Chatterjee, Przemyslaw Daca, Jan Křetı́nskỳ, and Tobias Meggendorfer.
Value iteration for long-run average reward in Markov decision processes. In Computer Aided Verifica-
tion: 29th International Conference, CAV 2017, Heidelberg, Germany, July 24-28, 2017, Proceedings,
Part I, pages 201–221. Springer, 2017.
Nicole Bäuerle and Ulrich Rieder. Markov decision processes with applications to finance. Springer
Science & Business Media, 2011.
Jonathan Baxter and Peter L Bartlett. Infinite-horizon policy-gradient estimation. journal of artificial
intelligence research, 15:319–350, 2001.
Bahram Behzadian, Marek Petrik, and Chin Pang Ho. Fast algorithms for l8 constrained s-rectangular
robust MDPs. Advances in Neural Information Processing Systems, 34:25982–25992, 2021.
Casey C Bennett and Kris Hauser. Artificial intelligence framework for simulating clinical decision-
making: A Markov decision process approach. Artificial intelligence in medicine, 57(1):9–19, 2013.
K-J Bierth. An expected average reward criterion. Stochastic processes and their applications, 26:
123–140, 1987.
David Blackwell and Tom S Ferguson. The big match. The Annals of Mathematical Statistics, 39(1):
159–163, 1968.
Jérôme Bolte, Stéphane Gaubert, and Guillaume Vigeral. Definable zero-sum stochastic games. Math-
ematics of Operations Research, 40(1):171–191, 2015.
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and
Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
35
Jongseong Chae, Seungyul Han, Whiyoung Jung, Myungsik Cho, Sungho Choi, and Youngchul Sung.
Robust imitation learning against variations in environment dynamics. In International Conference
on Machine Learning, pages 2828–2852. PMLR, 2022.
Alla Dita Raza Choudary and Constantin P Niculescu. Real analysis on intervals. Springer, 2014.
Anne Condon. The complexity of stochastic games. Information and Computation, 96(2):203–224,
1992.
Steven Cordwell, Yasser Gonzalez, and Theja Tulabandhula. Markov Decision Process (MDP) toolbox
for python. https://github.com/sawcordwell/pymdptoolbox, 2015.
Michel Coste. An introduction to o-minimal geometry. Istituti editoriali e poligrafici internazionali
Pisa, 2000.
Erick Delage and Shie Mannor. Percentile optimization for Markov decision processes with parameter
uncertainty. Operations research, 58(1):203–213, 2010.
Yue Deng, Feng Bao, Youyong Kong, Zhiquan Ren, and Qionghai Dai. Deep direct reinforcement
learning for financial signal representation and trading. IEEE transactions on neural networks and
learning systems, 28(3):653–664, 2016.
Vektor Dewanto, George Dunn, Ali Eshragh, Marcus Gallagher, and Fred Roosta. Average-reward
model-free reinforcement learning: a systematic review and literature mapping. arXiv preprint
arXiv:2010.08920, 2020.
Juan José Egozcue, Vera Pawlowsky-Glahn, Glòria Mateu-Figueras, and Carles Barcelo-Vidal. Isometric
logratio transformations for compositional data analysis. Mathematical geology, 35(3):279–300, 2003.
Eugene A Feinberg and Adam Shwartz. Handbook of Markov decision processes: methods and applica-
tions, volume 40. Springer Science & Business Media, 2012.
Jerzy Filar and Koos Vrieze. Competitive Markov decision processes. Springer Science & Business
Media, 2012.
Dean Gillette. Stochastic games with zero stop probabilities. Contributions to the Theory of Games,
3:179–187, 1957.
Hugo Gimbert and Edon Kelmendi. Submixing and shift-invariant stochastic games. International
Journal of Game Theory, pages 1–36, 2023.
Robert Givan, Sonia Leach, and Thomas Dean. Bounded parameter Markov decision processes. In
European Conference on Planning, pages 234–246. Springer, 1997.
Joel Goh, Mohsen Bayati, Stefanos A Zenios, Sundeep Singh, and David Moore. Data uncertainty in
Markov chains: Application to cost-effectiveness analyses of medical innovations. Operations Research,
66(3):697–715, 2018.
36
Vineet Goyal and Julien Grand-Clément. Robust Markov decision processes: Beyond rectangularity.
Mathematics of Operations Research, 2022.
Julien Grand-Clément and Christian Kroer. Conic blackwell algorithm: Parameter-free convex-concave
saddle-point solving. Advances in Neural Information Processing Systems, 34:9587–9599, 2021.
Julien Grand-Clement and Christian Kroer. First-order methods for wasserstein distributionally robust
MDP. In International Conference on Machine Learning, pages 2010–2019. PMLR, 2021.
Julien Grand-Clément and Christian Kroer. Solving optimization problems with blackwell approacha-
bility. Mathematics of Operations Research, 2023.
Julien Grand-Clément and Marek Petrik. On the convex formulations of robust Markov decision pro-
cesses. arXiv preprint arXiv:2209.10187, 2022.
Julien Grand-Clément and Marek Petrik. Reducing blackwell and average optimality to discounted
MDPs via the blackwell discount factor. arXiv preprint arXiv:2302.00036, 2023.
Julien Grand-Clément, Carri W Chan, Vineet Goyal, and Gabriel Escobar. Robustness of proactive
intensive care unit transfer policies. Operations Research, 71(5):1653–1688, 2023.
Thomas Dueholm Hansen, Peter Bro Miltersen, and Uri Zwick. Strategy iteration is strongly polyno-
mial for 2-player turn-based stochastic games with a constant discount factor. Journal of the ACM
(JACM), 60(1):1–16, 2013.
Chin Pang Ho, Marek Petrik, and Wolfram Wiesemann. Partial policy iteration for l1-robust Markov
decision processes. The Journal of Machine Learning Research, 22(1):12612–12657, 2021.
Chin Pang Ho, Marek Petrik, and Wolfram Wiesemann. Robust phi-divergence MDPs. arXiv preprint
arXiv:2205.14202, 2022.
G. Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280, 2005.
Navdeep Kumar, Esther Derman, Matthieu Geist, Kfir Levy, and Shie Mannor. Policy gradient for
s-rectangular robust Markov decision processes. arXiv preprint arXiv:2301.13589, 2023.
Rida Laraki and Sylvain Sorin. Advances in zero-sum dynamic games. In Handbook of game theory
with economic applications, volume 4, pages 27–93. Elsevier, 2015.
Yann Le Tallec. Robust, risk-sensitive, and data-driven control of Markov decision processes. PhD
thesis, Massachusetts Institute of Technology, 2007.
Arie Leizarowitz. An algorithm to identify and compute average optimal policies in multichain markov
decision processes. Mathematics of Operations Research, 28(3):553–586, 2003.
Mengmeng Li, Tobias Sutter, and Daniel Kuhn. Policy gradient algorithms for robust MDPs with
non-rectangular uncertainty sets. arXiv preprint arXiv:2305.19004, 2023.
37
Yan Li, Tuo Zhao, and Guanghui Lan. First-order policy optimization for robust Markov decision
process. arXiv preprint arXiv:2209.10579, 2022.
S. Mannor, O. Mebel, and H. Xu. Robust MDPs with k-rectangular uncertainty. Mathematics of
Operations Research, 41(4):1484–1509, 2016.
Jean-François Mertens, Sylvain Sorin, and Shmuel Zamir. Repeated games, volume 55. Cambridge
University Press, 2015.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wier-
stra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint
arXiv:1312.5602, 2013.
Abraham Neyman, Sylvain Sorin, and S Sorin. Stochastic games and applications, volume 570. Springer
Science & Business Media, 2003.
A. Nilim and L. El Ghaoui. Robust control of Markov decision processes with uncertain transition
probabilities. Operations Research, 53(5):780–798, 2005.
Kishan Panaganti and Dileep Kalathil. Sample complexity of robust reinforcement learning with a
generative model. In International Conference on Artificial Intelligence and Statistics, pages 9582–
9602. PMLR, 2022.
Hugh Possingham and G Tuck. Application of stochastic dynamic programming to optimal fire manage-
ment of a spatially structured threatened species. In Proceedings International Congress on Modelling
and Simulation, MODSIM, pages 813–817, 1997.
Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley
& Sons, 2014.
Jérôme Renault. A tutorial on zero-sum stochastic games. arXiv preprint arXiv:1905.06577, 2019.
J.K. Satia and R.L. Lave. Markov decision processes with uncertain transition probabilities. Operations
Research, 21(3):728–740, 1973.
Lloyd S Shapley. Stochastic games. Proceedings of the national academy of sciences, 39(10):1095–1100,
1953.
Sylvain Sorin. A first course on zero-sum repeated games, volume 37. Springer Science & Business
Media, 2002.
Lauren N Steimle and Brian T Denton. Markov decision processes for screening and treatment of
chronic diseases. Markov Decision Processes in Practice, pages 189–222, 2017.
Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
Ambuj Tewari and Peter L Bartlett. Bounded parameter Markov decision processes with average reward
criterion. In International Conference on Computational Learning Theory, pages 263–277. Springer,
2007.
38
Lou Van Den Dries. O-minimal structures and real analytic geometry. Current developments in math-
ematics, 1998(1):105–152, 1998.
Lou Van den Dries and Chris Miller. Geometric categories and o-minimal structures. 1996.
Luca Viano, Yu-Ting Huang, Parameswaran Kamalaruban, Adrian Weller, and Volkan Cevher. Robust
inverse reinforcement learning under transition dynamics mismatch. Advances in Neural Information
Processing Systems, 34:25917–25931, 2021.
Qiuhao Wang, Chin Pang Ho, and Marek Petrik. On the convergence of policy gradient in robust
MDPs. arXiv preprint arXiv:2212.10439, 2022.
Yue Wang, Alvaro Velasquez, George Atia, Ashley Prater-Bennette, and Shaofeng Zou. Robust average-
reward Markov decision processes. arXiv preprint arXiv:2301.00858, 2023.
Wolfram Wiesemann, Daniel Kuhn, and Berç Rustem. Robust Markov decision processes. Mathematics
of Operations Research, 38(1):153–183, 2013.
Huan Xu and Shie Mannor. Distributionally robust Markov decision processes. Advances in Neural
Information Processing Systems, 23, 2010.
Yuanhui Zhang, L Steimle, and BT Denton. Robust Markov decision processes for medical treatment
decisions. Optimization online, 2017.
Uri Zwick and Mike Paterson. The complexity of mean payoff games on graphs. Theoretical Computer
Science, 158(1-2):343–359, 1996.
second player correspond to convex combinations of extreme points of the convex set U , i.e., to elements
of U . There is no instantaneous payoff when visiting a state in Ω1 , and there is an instantaneous payoff
of pJ
sa rsa in a state ps, aq where the second player chooses psa P Usa . Given a state ω P Ω1 , the transition
function attributes a probability of 1 to the next state ps, aq P Ω2 if the first-player chooses action a P A
and ω “ s for some s P S . Given a state ω P Ω2 , the transition function attributes a probability of
psas1 to the next state s1 P Ω1 if ω “ ps, aq and the second player chooses action psa . We represent this
construction in Figure 8.
t t+1 t+2


Figure 8 A sa-rectangular robust MDP as a perfect information game. The decision-maker controls the states s P S and
chooses an action a P A. The adversary controls the states ps, aq P S ˆ A and chooses transition probabilities
psa P Usa . The pairs above the arcs represent the transition probabilities and instantaneous payoffs.
Proof of Proposition 2.2. Since U Ă UH , we always have supπPΠH inf P PUH Rγ pπ, P q ď
supπPΠH minP PU Rγ pπ, P q. We now prove the converse inequality. We have
where (B.1) follows from weak duality, (B.2) follows from the inner maximization problem being an
MDP, for which stationary optimal policies always exist [Puterman, 2014], (B.3) follows from strong
duality for s-rectangular RMDPs, (B.4) follows from the inner minimization being an MDP (the adver-
sarial MDP), see Goyal and Grand-Clément [2022], Ho et al. [2021]), and the last inequality following
from ΠS Ă ΠH . ˝
40
Appendix D: Some cases where the average worst-case transition probabilities exist
We have the following proposition.
Proposition D.1 Let U be an s-rectangular compact convex robust MDP instance. Let π P ΠS . Then
the infimum in inf P PU Ravg pπ, P q is attained under any of the following assumptions:
1. For any pπ, P q P ΠS ˆ U , the Markov chain induced by pπ, P q over S is unichain.
2. For any pπ, P q P ΠS ˆ U , the Markov chain induced by pπ, P q over S is irreducible.
3. The uncertainty set U is polyhedral.
Proof. Under the first two assumptions, the average return pπ, P q ÞÑ Ravg pπ, P q is continuous on
ΠS ˆ U (exercise 3, section 5.6, Filar and Vrieze [2012] and lemma 5.1.4, Filar and Vrieze [2012]). From
Weierstrass theorem, P ÞÑ Ravg pπ, P q attains its minimum over the compact set U .
Under the assumption that U is polyhedral, we know that the worst-case transition probabilities (for
a given π P ΠS ) can be chosen in the set of extreme points of U . Therefore, we can replace U by a finite
set Uf (its set of extreme points). Recall that the optimization problem inf P PU Ravg pπ, P q can be seen as
the adversarial MDP [Ho et al., 2021]. In this MDP, the set of actions corresponds to U . If we replace
U by Uf , we obtain an MDP with finitely many actions. In this case, it is a classical result (see for
41
instance chapter 8 and chapter 9 in Puterman [2014]) that a stationary deterministic policy is optimal
for the average return, so that the infimum is attained in inf P PU Ravg pπ, P q. ˝
Remark E.1 The authors in Wang et al. [2023] consider a statement equivalent to Lemma 3.3 for
the case of Markovian adversary in unichain sa-rectangular RMDPs, see theorem 9 in Wang et al.
[2023]. In particular, Wang et al. [2023] consider a stationary policy π P ΠS and study the worst-case
” řT ı
1
of the average return limT Ñ`8 Eπ,P T `1 t“0 rst at st`1
| s0 „ p0 over the set of Markovian adversary
UM :“ U N . However, we have shown in Example 3.1 that this limit may not exist when P P UM , an issue
that is overlooked in Wang et al. [2023]. We conclude by noting that Lemma 3.3 holds without the
unichain assumption and that s-rectangularity is sufficient (instead of sa-rectangularity).
inf max Ravg pπ, P q ď inf sup Ravg pπ, P q “ sup inf Ravg pπ, P q “ max inf Ravg pπ, P q
P PU πPΠSD P PU πPΠ πPΠH P PU πPΠSD P PU
H
where the first inequality follows from ΠSD Ă ΠH and the last equality follows from Theorem 3.4. We
now dedicate our efforts to proving that (3.4) holds.
First, when U is polyhedral, we have
inf sup Ravg pπ, P q “ sup inf Ravg pπ, P q ď sup inf Ravg pπ, P q
P PU πPΠ πPΠH P PUH πPΠH P PU
H
42
where the equality follows from (3.3) and the inequality follows from U Ă UH . This shows that (3.4)
holds when U is polyhedral.
We now turn to showing (3.4) when U is a convex compact sa-rectangular uncertainty set. Let ϵ ą 0
and consider the set Uf defined similarly as in the second step of the proof of Theorem 3.4. We have
where (F.1) follows from Uf Ă U , (F.2) follows from the first two steps of this proof and Uf having only
finitely many extreme points, (F.3) follows from Theorem 3.4, (F.4) follows from the construction of
Uf , and (F.5) follows from Theorem 3.4 again. Since the inequalities above are true for any choice of
ϵ ą 0, we conclude that supπPΠH inf P PU Ravg pπ, P q “ inf P PU supπPΠH Ravg pπ, P q. ˝
and R̂avg pπ, P q “ Ravg pπ, P q when pπ, P q P ΠS ˆ U . From this, we have
max inf R̂avg pπ, P q “ max inf Ravg pπ, P q “ sup inf Ravg pπ, P q ě sup inf R̂avg pπ, P q
πPΠSD P PU πPΠSD P PU πPΠH P PU πPΠH P PU
where the first equality holds from the optimization variables pπ, P q being in ΠSD ˆ U , the second equal-
ity holds from the first step of the proof, and the inequality follows from (G.1). This shows that when
R̂avg is defined as in (3.6), (3.7), or (3.8), we have supπPΠH inf P PU R̂avg pπ, P q “ maxπPΠSD inf P PU R̂avg pπ, P q.
Now since R̂avg and Ravg coincides over ΠS ˆ U , we can conclude that maxπPΠSD inf P PU R̂avg pπ, P q “
maxπPΠSD inf P PU Ravg pπ, P q.
Part 2. We now turn to proving that the following strong duality results hold:
where (G.4) follows from R̂avg pπ, P q ď Ravg pπ, P q for pπ, P q P ΠH ˆ U , (G.5) follows from Theorem 3.5,
(G.6) follows from Theorem 3.4, (G.7) follows from R̂avg pπ, P q “ Ravg pπ, P q for pπ, P q P ΠS ˆ U , and
(G.8) follows from USD Ă UH . Therefore, supπPΠH inf P PU R̂avg pπ, P q “ inf P PU supπPΠH R̂avg pπ, P q.
Part 3. Finally, we prove that Theorem 3.6 still holds for R̂avg , i.e. we show that
Note that proof for Theorem 3.6 only relies on Theorem 3.4, Theorem 3.5 and Lemma 3.3. We have
proved above that Theorem 3.4 and Theorem 3.5 still hold when we replace R̂avg by Ravg . The authors in
Bierth [1987] also prove Lemma 3.3 for R̂avg as in (3.6), (3.7), or (3.8). Therefore, the proof of Theorem
3.6 follows verbatim when replacing Ravg by R̂avg . ˝
for D1 and D2 two concave functions such that D1 pαq ď α, D2 pαq ď α. We will specify D1 and D2
later. For a vector p1 ´ α, β, α ´ β q in the simplex, we can interpret α as the probability to leave state
s0 and β as the probability to go to state s1 . This is represented in Figure 1a. Since state s0 is the
only non-absorbing state, we can identify policies and the action (a1 or a2 ) chosen at s0 . Since we are
considering a sa-rectangular robust MDP, for every discount factor γ P p0, 1q there is a deterministic
stationary optimal policy in ta1 , a2 u. The worst-case return of a1 and a2 are defined as, for i P t1, 2u
1 β
Rγ pai q :“ min Rγ pai , P q “ min ´γ .
P PU 1´γ αPr0,1s,βPr0,αs,p1´α,β,α´βqPUsai 1 ´ γ ` γα
44
From the definition of Usai for i P t1, 2u, we can write Rγ pai q as
1 Di pαq
Rγ pai q “ ´ max γ .
1 ´ γ αPr0,1s 1 ´ γ ` γα
We will construct two sequences pγk qkPN and pγk1 qkPN such that for any k P N, a1 is the unique optimal
policy for γ “ γk and a2 is the unique optimal policy for γ “ γk1 , and γk Ñ 1, γk1 Ñ 1 as k Ñ `8.
First step. We first consider the following function D : r0, 1s Ñ r0, 1s defined as Dpαq “ αp1 ´ αq, and
consider the following optimization program:
Dpαq αp1 ´ αq
max γ “ max γ .
αPr0,1s 1 ´ γ ` γα αPr0,1s 1 ´ γ ` γα
αp1´αq p1´2αqp1´γ`γαq´αp1´αqγ 1´γ´2αp1´γq´γα2
The derivative of α ÞÑ 1´γ`γα
is p1´γ`γαq2
“ p1´γ`γαq2
, which shows that the maxi-
?
αp1´αq
mum of α ÞÑ 1´γ`γα
is attained at α pγ q “ 1´γ´p1´γq
‹
γ
. ‹
Note that γ ÞÑ α pγ q is a decreasing function on
p0, 1s that we can continuously extend to r0, 1s, with α‹ p1q “ 0 and α‹ p0q “ 1{2.
Second step. We now construct two piece-wise affine functions D1 , D2 , that are (pointwise) smaller
than D : α ÞÑ αp1 ´ αq on r0, 1s , and that intersect infinitely many times on any interval p0, α0 q for
any α0 P p0, 1{2q. In particular, we have the following lemma. We call breakpoints of a piecewise affine
function the points where it changes slopes.
Proof of Lemma H.1. We first show that D1 is a concave, strictly increasing function. For k ě 1 and
` 1 1
˘
x P 2k`2 , 2k , there exists a1,k , b1,k P R such that D1 pxq “ a1,k x ` b1,k . In particular, we have
`1˘ ` 1 ˘ 1
` 1
˘ 1
` 1
˘
D1 2k ´ D1 2k`1 2k
1 ´ 2k
´ 2k`2
1 ´ 2k`2 2k 2 ´ 1
a1,k “ 1 1 “ 1 1 “ .
2k
´ 2k`2 2k
´ 2k`2 2k 2 ` 2k
2k2 ´1
1
`1˘ 1
We also have a1,k 2k ` b1,k “ D1 2k
, so that b1,k “ 4kpk`1q
. Note that k ÞÑ 2k2 `2k
is an increasing function,
2
2k ´1
with limkÑ`8 2k 2 `2k “ 1. Therefore, we have shown that k ÞÑ a1,k is an increasing function. Since D1 is
continuous by construction, we conclude that D1 is a concave, increasing function. The same approach
shows that D2 is concave and strictly increasing.
We now show that the two functions intersect infinitely many times on any interval p0, α0 q with
` 1 1˘
α0 ą 0. Let k ě 1. Then there is a zero of x ÞÑ D1 pxq ´ D2 pxq on any interval k`1 , k . From the concavity
45
? ` 1
˘ ` 1
˘ ` 1 ˘ ` 1
˘
of x ÞÑ x, we have that D2 2k`1
´ D1 2k`1
ě 0 and D2
´ D1 2k`2 ď 0. Since D1 , D2 are
2k`2
` 1 1
˘
continuous, there exists a zero of x ÞÑ D1 pxq ´ D2 pxq on any interval 2k`2 , 2k`1 for k ě 1. The same
` 1 1
˘
approach shows that there exists a zero x ÞÑ D1 pxq ´ D2 pxq on any interval 2k`1 , 2k for k ě 1. ˝
Third step. We now construct a sequence of discount factors pγk qkPN such that γk Ñ 1 and a1 is the
1
unique optimal policy for γ “ γk for any k P N. Let k P N and γk P p0, 1q such that α‹ pγk q “ 2k`1
. From
the strict monotonicity of γ ÞÑ α‹ pγ q we know that γk Ñ 1 as k Ñ `8. By construction,
1 D2 pαq
Rγk pa2 q “ ´ max
1 ´ γk αPr0,1s 1 ´ γk ` γk α
1 D pα q
ě´ max (H.1)
1 ´ γk αPr0,1s 1 ´ γk ` γk α
1
1 Dp 2k`1 q
“´ 1 (H.2)
1 ´ γk 1 ´ γk ` γk 2k`1
1
1 D2 p 2k`1 q
“´ 1 (H.3)
1 ´ γk 1 ´ γk ` γk 2k`1
where (H.1) follows from D2 pαq ď Dpαq for any α P p0, 1{2q, where (H.2) follows from the definition
1 1 1
of γk such that α‹ pγk q “ 2k`1
, and (H.3) follows from D2 p 2k`1 q “ Dp 2k`1 q by construction of D2 as in
1
Lemma H.1. This shows that the maximum in Rγk pa2 q is attained at α‹ pγk q “ 2k`1
.
D1 pαq
We now show that Rγk pa1 q ą Rγk pa2 q. The function α ÞÑ 1´γk `γk α
is continuous on the compact set
r0, 1{2s, hence it attains its maximum. We distinguish two cases.
Case 1. Suppose that this maximum is attained at α1 P t1{2k 1 | k 1 P Nu. Then D1 pα1 q “ Dpα1 q, and
D1 pα1 q Dpα1 q Dpα1 q
“ ă max
1 ´ γk ` γk α1 1 ´ γk ` γk α1 αPr0,1{2s 1 ´ γk ` γk α1
where the strict inequality follows from α‹ pγk q “ 1{p2k ` 1q ‰ α1 .
D1 pαq
Case 2. Otherwise, the maximum of α ÞÑ 1´γk `γk α
is attained at α1 R r0, 1{2s but α1 R t1{2k 1 | k 1 P Nu.
In this case we have by construction that D1 pα1 q ă Dpα1 q so that
D1 pα1 q Dpα1 q Dpα1 q
ă ď max .
1 ´ γk ` γk α1 1 ´ γk ` γk α1 αPr0,1{2s 1 ´ γk ` γk α1
Overall, we conclude that Rγk pa1 q ą Rγk pa2 q in both case 1 and case 2. Threfore, we have constructed
a sequence pγk qkPN such that Rγk pa1 q ą Rγk pa2 q, @ k ě 1. The construction of a sequence pγk1 qkPN such
1
that Rγk1 pa1 q ą Rγk1 pa2 q, @ k ě 1 is analogous, by choosing γk1 such that α‹ pγk1 q “ 2k
. This shows that
there does not exist a stationary deterministic Blackwell optimal policy. Since for any k P N, at γ “ γk
there is a unique discount optimal policy and it is stationary deterministic, this shows that there does
not exist a Blackwell optimal policy that is history-dependent. This concludes the construction of our
counterexample.
46
Remark H.2 Wang et al. [2023] show the existence of a stationary deterministic Blackwell optimal
policy for sa-rectangular RMDPs, under two conditions: (a) the Markov chains induced by any pair of
policy and transition probabilities are unichain, (b) there is a unique average optimal policy. The second
assumption precludes two value functions from intersecting infinitely often as γ Ñ 1, which is exactly
the problematic behavior in our counterexample. In fact, our counterexample violates both assumptions:
there are two absorbing states (which violates the unichain assumption), and actions a1 and a2 are
average optimal.
π
Assume a fixed ϵ ą 0. For any discount factor γ P p0, 1q consider the set tRγ pπ, Pϵ{2 q | π P ΠSD u. This set
π
is finite since π P ΠSD is finite. We then define π̂γ P arg maxπPΠSD Rγ pπ, Pϵ{2 q Since π̂γ P ΠSD , @ γ P p0, 1q
and ΠSD is finite, we can choose a sequence of discount factors pγn qně1 such that for all n P N, we
have π̂γn “ π P ΠSD for some π P ΠSD and γn Ñ 1 as n Ñ `8 (see for instance lemma F.1 in Goyal and
Grand-Clément [2022]). Next, we show that the policy π is ϵ-Blackwell optimal.
Indeed, suppose that π is not ϵ-Blackwell optimal: we can construct a sequence of discount factors
´ ¯
π,U
pγn1 qnPN P p0, 1qN and a sequence of states psn qnPN P S N such that p1 ´ γn1 q vγ‹,U
1 ,s ´ vγ 1 ,s
n n n n
ą ϵ, @ n ě 1.
and γn1 Ñ 1 as n Ñ `8. Since for each γn1 the optimal discount policy can be chosen stationary and
´ 1 ¯
deterministic, i.e., in ΠSD , we can find a policy π 1 and a state s P S such that p1 ´ γn1 q vγπn1 ,U
,s ´ v π,U
1 ,s
γn ą
1
π π
ϵ, @ n ě 1. By definition of the transition probabilities Pϵ{2 and Pϵ{2 we have
ˆ 1 π1 ˙
π ,P π 1 ,U π1
p1 ´ γ q vγ,s ϵ{2 ´ vγ,s ď ϵ{2, @ γ ą γϵ{2 ,
´ π,P π ¯
π,U
p1 ´ γ q vγ,s ϵ{2 ´ vγ,s π
ď ϵ{2, @ γ ą γϵ{2 .
π̂
π̂,Pϵ{2
But by construction of π, we have π P arg maxπ̂PΠSD vγn , @ n ě 1. From this we conclude that
1
π 1 ,P π π,P π
vγn ,s ϵ{2 ´ vγn ,sϵ{2 ď 0, @ n ě n0 . (I.2)
47
π1
π 1 ,Pϵ{2 π,P π
Combining (I.1) and (I.2), we obtain that γ ÞÑ vγ,s ´ vγ,s ϵ{2 must cancel infinitely many times in
the interval p0, 1q, which is a contradiction since this is a rational function (lemma 10.1.3, Puterman
[2014]). Therefore, for any ϵ ą 0 we can choose πϵ P ΠSD that is ϵ-Blackwell optimal. Since ΠSD is finite,
the same stationary deterministic policy can be chosen ϵ-Blackwell optimal for all ϵ ą 0. ˝
Since π P ΠS and Pϵ P U , we know that limγÑ1 p1 ´ γ qRγ pπ, Pϵ q exists and that
Therefore,
lim sup minp1 ´ γ qRγ pπ, P q “ lim inf minp1 ´ γ qRγ pπ, P q,
γÑ1 P PU γÑ1 P PU
i.e., limγÑ1 minP PU p1 ´ γ qRγ pπ, P q exists. For conciseness, let us write Rπ P R for this limit. Then by
taking the limit as γ Ñ 1 in (J.1), we have
Since the inequality above is true for any ϵ ą 0 and since Ravg pπ, Pϵ q ě inf P PU Ravg pπ, P q, we proved
that inf P PU Ravg pπ, P q ď Rπ . We now want to prove the converse inequality. To do this, let us consider
a fixed P P U . We always have p1 ´ γ qRγ pπ, P q ě minP PU p1 ´ γ qRγ pπ, P q. Taking the limit on both
side, we have Ravg pπ, P q ě Rπ . Taking the infimum over P P U on the right-hand side, we obtain that
inf P PU Ravg pπ, P q ě Rπ . This shows that inf P PU Ravg pπ, P q “ Rπ , which concludes the first step.
48
Second step. The second and third steps correspond to Lemma 4.8. We now study the limit behavior
of the optimal discounted return as γ Ñ 1. In particular, we show that v exists, with
v :“ lim max
1
inf
1
p1 ´ γ qRγ pπ 1 , P 1 q.
γÑ1 π PΠSD P PU
To show that v exists, we consider π a stationary deterministic policy that is ϵ-Blackwell optimal for
all ϵ ą 0. Let ϵ ą 0. We have, by definition of π, that
From the first step of our proof, limγÑ1 minP PU Rγ pπ, P q exists. Hence, for all ϵ ą 0, we have
This shows that v :“ limγÑ1 maxπ1 PΠSD minP PU p1 ´ γ qRγ pπ 1 , P q exists, and also that v “
limγÑ1 minP PU p1 ´ γ qRγ pπ, P q for π a policy that is ϵ-Blackwell optimal policy for all ϵ ą 0.
Third step. We now show that v “ maxπPΠSD inf P PU Ravg pπ, P q. We start by showing that v ď
maxπPΠSD minP PU Ravg pπ, P q. Indeed, let us consider the policy π from Theorem 4.5, that is stationary
deterministic and ϵ-Blackwell optimal, for all ϵ ą 0:
Since π P ΠSD , for P P U we have limγÑ1 p1 ´ γ qRγ pπ, P q “ Ravg pπ, P q. Hence for any P P U and ϵ ą
0, we have Ravg pπ, P q ě v ´ ϵ. Overall we have shown that inf P PU Ravg pπ, P q ě v, which implies that
maxπ1 PΠSD inf P PU Ravg pπ 1 , P q ě v.
We now show that maxπ1 PΠSD inf P PU Ravg pπ 1 , P q ď v. We proceed by contradiction. Assume that there
exists ϵ ą 0 such that
max inf Ravg pπ 1 , P q ą v ` ϵ (J.2)
π 1 PΠSD P PU
and let π be a stationary deterministic average optimal policy (which exists from Theorem 3.4). Then
Ravg pπ, P q ą v ` ϵ, @ P P U . Now let ϵ1 ă ϵ. From Theorem 4.3, we know that there exists Pϵ1 P U such
that for any γ ą γ0 , we have
This is in contradiction with (J.2). Hence, we have shown that v “ maxπ1 PΠSD inf P PU Ravg pπ 1 , P q.
49
Fourth step. The fourth step of the proof corresponds to Lemma 4.9. Let π be a ϵ-Blackwell optimal
policy for any ϵ ą 0. From the third step, we know that inf P PU Ravg pπ, P q ě v. From the third step, we
also know that v “ maxπ1 PΠSD inf P PU Ravg pπ 1 , P q. Hence, we can conclude that
i.e., we can conclude that π is optimal for the average return criterion. ˝
Surprisingly, we show that robust value functions may not be Lipschitz continuous.
Proposition L.1 There exists a sa-rectangular definable uncertainty set U and a stationary determin-
istic policy π such that the robust value function γ ÞÑ vγπ,U is not Lipschitz continuous and the normalized
robust value function γ ÞÑ p1 ´ γ qvγπ,U is not Lipschitz continuous.
The proof builds upon our previous counterexample for the non-existence of Blackwell optimal policies
from Proposition 3.2, which we adapt to show that the non-Lipschitzness of the value functions as in
Proposition L.1. Intuitively, in our counterexample the oscillations of the robust value function become
more and more rapid as γ approaches 1, which makes them non-Lipschitz.
50
Proof of Proposition L.1 Consider the RMDP instance described in Section 3.1 for the proof of
Proposition 3.2, where
continuous.
f 1 pγqp1´γq`f pγq
• The case of γ ÞÑ Rγ pa3 q. Since Rpγ q “ 1
1´γ
f pγ q, we have Rγ1 pa3 q “ p1´γq2
„ ´1
p1´γq3{2
as γ Ñ 1.
Therefore, Rγ1 pa3 q Ñ ´8 as γ Ñ 1 and γ ÞÑ Rγ pa3 q is not Lipschitz continuous.
˝
Despite their potential lack of Lipschitz continuity, robust value functions are always differentiable
under some mild assumptions, as we show in the next theorem. Note that prior to this work, virtually
nothing is known about the regularity of the robust value functions γ ÞÑ vγπ,U and of the optimal robust
value function γ ÞÑ vγ‹,U . We recall that a function f is of class C p for p P N if f is differentiable p-times
and the p-th differential of f is continuous.
Theorem L.2 Consider a sa-rectangular robust MDP with a definable uncertainty set U .
1. Let π P ΠS be a policy and p P N. Then there exists a finite subdivision of the interval p0, 1q as
0 “ a1 ă a2 ă ¨ ¨ ¨ ă ak “ 1 such that on each pai , ai`1 q for i “ 1, ..., k ´ 1, the robust value function
γ ÞÑ vγπ,U is of class C p .
2. Let p P N. Then there exists a finite subdivision of the interval p0, 1q as 0 “ a1 ă a2 ă ¨ ¨ ¨ ă ak “ 1
such that on each pai , ai`1 q for i “ 1, ..., k ´ 1, the optimal robust value function γ ÞÑ vγ‹,U is of class C p .
51
Theorem L.2 is a straightforward consequence of a more general version of the monotonicity theorem
(theorem 4.1 in Van den Dries and Miller [1996]) and we omit the proof for conciseness. Interesting
properties related to Hölder continuity and Lojasiewicz inequality can also be extended from Theorem
4.14 in Van den Dries and Miller [1996].
B B
}v t ´ g ‹ }8 ď }v t ´ p1 ´ γt qvγπt ,U
}8 ` }g ‹ ´ p1 ´ γt qvγπt ,U
}8 .
The second term converges to 0 as γt Ñ 1, as we proved in Theorem 4.6. In the rest of the proof, we
show that the first term converges to 0 as t Ñ `8. We start with the following lemma, which is similar
to lemma 7 in Tewari and Bartlett [2007].
B
Lemma M.1 There exists t0 P N such that for any t ě t0 , we have }v t ´ p1 ´ γt qvγπt ,U
}8 ď δt with
B B
pδt qtět0 a sequence of scalars such that δt`1 “ γt δt ` et with et “ }p1 ´ γt`1 qvγπt`1,U ´ p1 ´ γt qvγπt ,U
}8 .
Proof of Lemma M.1 Note that γt Ñ 1 as t Ñ `8. Therefore, we can choose t0 large enough such
that π B is an optimal policy, see Theorem 4.21. We prove Lemma M.1 by induction. Assume that
B
}v t ´ p1 ´ γt qvγπt ,U
}8 ď δt . We have
B B B B
}v t`1 ´ p1 ´ γt`1 qvγπt`1,U }8 ď }v t`1 ´ p1 ´ γt qvγπt ,U
}8 ` }p1 ´ γt qvγπt ,U
´ p1 ´ γt`1 qvγπt`1,U }8 .
B B B
By definition, et “ }p1 ´ γt qvγπt ,U
´ p1 ´ γt`1 qvγπt`1,U }8 . We now turn to bounding }v t`1 ´ p1 ´ γt qvγπt ,U }8 .
B
´ B ¯ B
We have, for any s P S , p1 ´ γt qvγπt ,s,U “ p1 ´ γt qTs,γt vγπt ,U because vγπt ,U is the fixed-point of the
operator Tγt since π B is Blackwell optimal. This shows that
B
´ ¯
π B ,U
` ˘
p1 ´ γt qvγπt ,s,U “ max min pJ
sa p1 ´ γt qrsa ` γt p1 ´ γt qvγt ď max min pJ t
sa p1 ´ γt qrsa ` γt p1 ´ γt qv ` γt δt
aPA psa PUsa aPA psa PUsa
where the first equality is by definition of T , and the inequality is because of the induction hypothesis
ř B
and s1 PS psas1 “ 1, @ psa P Usa . Overall, we have proved that p1 ´ γt qvγπt ,s,U ď vst`1 ` γt δt . We can prove the
B
converse inequality similarly. From this, we conclude that }v t ´ p1 ´ γt qvγπt ,U
}8 ď δt with δt`1 “ γt δt ` et .
˝
We now turn to proving that δt Ñ 0 as t Ñ `8. From the induction δt`1 “ γt δt ` et we obtain that
for any t ě t0 we have ˜ ¸
t
ÿ t
ź
δt “ γi ej .
j“t0 i“j`1
52
ωi
Additionally, γi “ ωi`1
, therefore
t
ź ωt ωt´1 ωj`2 ωj`1 ωj`1
γi “ ... “
i“j`1
ωt`1 ωt ωj`3 ωj`2 ωt`1
so that
t
1 ÿ
δt “ ωj`1 ej .
ωt`1 j“t
0
1
řt
We now prove that limtÑ`8 ωt`1 ωj`1 ej “ 0. The most fundamental point is to note that from
j“t0
ř`8
Bolte et al. [2015] (Theorem 3 and Corollary 4), we have t“t0 et ă `8 for any increasing sequence
pγt qtPN , see also Renault [2019], corollary 1.6, point (2) for an equivalent result for SGs with finitely
many actions.
ř`8 ř`8
Let us now write Ej “ i“j ej . Since j“t0 ej ă `8 and ej ě 0, we know that Ej Ñ 0 as t Ñ `8.
Now we have
t
ÿ t
ÿ t
ÿ
ωj`1 ej “ ωj`1 pEj ´ Ej`1 q “ ωt0 `1 Et0 ´ ωt`1 Et`1 ` pωj`1 ´ ωj qEj .
j“t0 j“t0 j“t0 `1
First, since ωt0 `1 Et0 is constant and limtÑ`8 ωt “ `8, it is clear that ωt0 `1 Et0 {ωt`1 Ñ 0. Second,
řt
j“t0 `1 pωj`1 ´ωj qEj
ωt`1 Et`1 {ωt`1 “ Et`1 Ñ 0. There remains to show that ωt`1
Ñ 0 as t Ñ `8. This can be
done using the following simple lemma from real analysis (see theorem 2.7.2 in Choudary and Niculescu
[2014]).
Lemma M.2 (Stoltz-Cesaro theorem) Let pAt qtPN and pBt qtPN two sequences of real numbers, with
pBt qtPN increasing with limtÑ`8 Bt “ `8. Let ℓ :“ limtÑ`8 pAt`1 ´ At q{pBt`1 ´ Bt q, ℓ P R Y t`8u. Then
the limit limtÑ`8 At {Bt exists and limtÑ`8 At {Bt “ ℓ.
řt řt
Applying Lemma M.2 to At :“ j“t0 `1 pωj`1 ´ ωj qEj and Bt “ j“t0 `1 ωj`1 ´ ωj “ ωt`1 ´ ωt0 `1 , we
obtain that
At`1 ´ At pωt`2 ´ ωt`1 qEt`1
“ “ Et`1 Ñ 0
Bt`1 ´ Bt ωt`2 ´ ωt`1
řt
j“t0 `1 pωj`1 ´ωj qEj
which shows that At {Bt “ ωt`1
converges to 0. Overall, we have proved that δt Ñ 0 as
t Ñ `8, which concludes the proof of Theorem 5.3. ˝
N.1.1. Forest management instance. The states represent the growth of the forest. An optimal
policy finds the right balance between maintaining the forest, earning revenue by selling cut wood, and
the risk of wildfires. A complete description may be found at Cordwell et al. [2015]. This instance is
inspired from the application of dynamic programming to optimal fire management Possingham and
Tuck [1997].
States. There are S states: State 1 is the youngest state for the forest, State S is the oldest state.
Actions. The two actions are wait and cut & sell.
Transitions. If the forest is in State s and the action is wait, the next state is State s ` 1 with
probability 1 ´ p (the forest grows) and 1 with probability p (a wildfire burns the forest down). If
the forest is in State s and the action is cut & sell, the next state is State 1 with probability 1. The
probability of wildfire p is chosen at p “ 0.1.
Rewards. There is a reward of 4 when the forest reaches the oldest state (S) and the chosen action is
wait. There is a reward of 0 at every other state if the chosen action is wait. When the action is cut &
sell, the reward at the youngest state s “ 1 is 0, there is a reward of 1 in any other state s P t1, ..., S ´ 1u,
and a reward of 2 in s “ S.
N.1.2. Machine replacement instance This instance is represented in Figure 9, in the case of 8
states representing the operative conditions of the machine. There are two additional repair states. To
build larger instances, we add new operative states for the machine, with the same transition. We now
describe in detail the states, actions, transitions, and rewards in this instance.
States. The machine replacement problem involves a machine whose set of possible conditions are
described by S states. The first S ´ 2 states are operative states. The states 1 to S ´ 2 model the
condition of the machine, with 1 being the perfect condition and S ´ 2 being the worst condition. The
last two states S ´ 1 and S are states representing when the machine is being repaired.
Actions. There are two actions: repair and wait.
Transitions. The transitions are detailed in Figures 9a-9b. When the action is wait, the machine is
likely to deteriorate toward the state S ´ 2, or may stay in the same condition. When the action is
repair, the decision-maker brings the machine to the states S ´ 1 and S ´ 2.
Rewards. There is a cost of 0 for states 1, ..., S ´ 3; letting the machine reach the worst operative
state S ´ 2 is penalized with a cost of 20. The state S ´ 1 is a standard repair state and has a cost of 2,
while the last state S is a longer and more costly repair state and has a cost of 10. We turn costs into
rewards by flipping the signs.
54
0.4 1
R1 0.1 R1
1 0.6 0.2
R2 R2
0.6 0.2 0.2
0.3 1
0.6 0.3 0.8
0.3 0.8 0.8
1 2 8 1 2 8
N.1.3. Instance inspired by healthcare We present in Figure 10 the nominal transitions for this
instance, in the case of 8 health conditions states. This goal is to minimize the mortality rate of the
patient while reducing the invasiveness of the drug dosage (low, medium or high) prescribed at each
state. Similarly, as for the machine replacement instance, the instances for a larger number of states
are constructed by adding some health condition states for the patient.
States. There are S states. The first S ´ 1 states represent the health conditions of the patient, with
S ´ 1 being the worst condition before the mortality absorbing state m.
Actions. The action set is A “ tlow, medium, highu, representing the drug dosage at each state.
Transitions. The transitions are represented in Figure 10. They capture the fact that the patient is
more likely to recover (i.e., transitioning early states) under the high-intensity treatment (Figure 10c)
than under the low-intensity treatment (Figure 10a).
Rewards. The rewards penalize using an intense treatment plan for the patient. In a health condition
state s P t1, ..., S ´ 1u, the reward is 10 for choosing action low, 8 for choosing action medium, and 6 for
choosing action high.
1 2 8 1 2 8 1 2 8
m m m
Assumption N.1 The radius α ą 0 is such that, for any ps, aq P S ˆ A, we have
Assumption N.1 has the simple interpretation that the orthogonal projection of the ℓ2 -ball tp P RS |}p ´
psa }2 ď αu onto the hyperplane tp P Rn | pJ e “ 1u is entirely contained into the simplex. Note that from
Section N.2.1, we only consider that transition probabilities in p0, 1q are uncertain so that Assumption
ℓ2 ℓ2
N.1 holds for a radius α small enough. Under this assumption, we can write Usa as Usa “ pnom,sa `
αB̃ with B̃ “ tz P RS | z J e “ 0, }z }2 ď 1u. Let us now write u1 , ..., u|S|´1 P R|S| an orthonormal basis
of tp P Rn | pJ e “ 0u, e.g. the one given in closed-form in Egozcue et al. [2003]. Then writing U “
` ˘
u1 , ..., u|S|´1 P R|S|ˆp|S|´1q , and noting by definition U J U “ I|S|´1 , we have
B̃ “ tz P RS | z J e “ 0, }z }2 ď 1u “ tU y | y P R|S|´1 , }U y }2 ď 1u “ tU y | y P R|S|´1 , }y }2 ď 1u
where the last equality follows from U being an orthonormal basis of tp P Rn | pJ e “ 0u. This shows
that under Assumption N.1, we have the following closed-form update: for v P RS ,
J
min pJ v “ pJ
nom,sa v ` α min pU y q v “ p J J
nom,sa v ´ α}U v }2 .
pPUsa yPR|S|´1 ,}y}2 ď1
Overall, our analysis shows that in the case of ellipsoidal uncertainty as in U ℓ2 and under Assumption
N.1, we can efficiently evaluate Ts pv q as Ts pv q “ maxaPA pJ J
nom,sa prsa ` γv q ´ α}U prsa ` γv q }2 , @ s P S .
56
N.2.3. Uncertainty based on box inequalities. We also consider the following uncertainty set:
box S
Usa “ tp P ∆pSq | plow up low up
sa ď p ď psa u, @ ps, aq P S ˆ A with psa , psa P r0, 1s two vectors such that plow
sa ď
experiment we use θup “ θlow “ 0.05. Note that to evaluate the Bellman operator T at a vector v P RS ,
we only need to sort the component of v in increasing order and then use the closed-form expression
from the method described in proposition 3 in Goh et al. [2018].
N.3. Implementation of Algorithm 1
In Algorithm 1, at every iteration t we compute vγ‹,U
t
by implementing two-player Strategy Itera-
tion [Hansen et al., 2013], as described in Algorithm 4.
Recall that in Algorithm 1, we compute the optimal discounted policies for an increasing sequence of
discount factors. Therefore, we warm-start Algorithm 4 at the next iteration with the policy computed
at the previous iteration. This considerably reduces the computation time, for two reasons:
1. The discount factors γt at iteration t and γt`1 at iteration t ` 1 are close so that we can expect
the optimal policies at iteration t and iteration t ` 1 to be close;
2. Since U box and U ℓ2 are definable sets, Blackwell optimal policies exist. A Blackwell discount factor
also exists, and for γ large enough, the set of optimal stationary deterministic policies does not change.
In Algorithm 4, we compute the robust value function of π t at every iteration t. We do so by implement-
ing the one-player version of Policy Iteration for the adversarial MDP [Goh et al., 2018, Ho et al., 2021])
as in Algorithm 5. We also use warm-starting to accelerate the practical performance of Algorithm 5.
57
t`1 t
with Psa “ Psa if possible.
stop if pt`1 “ pt .
end for