Major-Minor Mean Field Multi-Agent Reinforcement Learning

Kai Cui    Christian Fabian    Anam Tahir    Heinz Koeppl
Abstract

Multi-agent reinforcement learning (MARL) remains difficult to scale to many agents. Recent MARL using Mean Field Control (MFC) provides a tractable and rigorous approach to otherwise difficult cooperative MARL. However, the strict MFC assumption of many independent, weakly-interacting agents is too inflexible in practice. We generalize MFC to instead simultaneously model many similar and few complex agents – as Major-Minor Mean Field Control (M3FC). Theoretically, we give approximation results for finite agent control, and verify the sufficiency of stationary policies for optimality together with a dynamic programming principle. Algorithmically, we propose Major-Minor Mean Field MARL (M3FMARL) for finite agent systems instead of the limiting system. The algorithm is shown to approximate the policy gradient of the underlying M3FC MDP. Finally, we demonstrate its capabilities experimentally in various scenarios. We observe a strong performance in comparison to state-of-the-art policy gradient MARL methods.

Multi-Agent Reinforcement Learning, Mean Field Control, Large-Scale Multi-Agent Systems
\AtAppendix\AtAppendix\AtAppendix\AtAppendix\AtAppendix\AtAppendix\AtAppendix

1 Introduction

Recent successes of reinforcement learning (RL) (Vinyals et al., 2019; Schrittwieser et al., 2020; Ouyang et al., 2022) motivate the search for techniques for the multi-agent case, referred to as multi-agent reinforcement learning (MARL). Due to the high complexity of multi-agent control (Bernstein et al., 2002; Daskalakis et al., 2009), exploiting problem structure is important for scalable MARL. In this work, we consider systems with many agents interacting through aggregated information of all agents – the mean field (MF).

Mean field control for MARL.

Dynamical control and behavior in systems with many agents is the subject of studies in mean field games (MFG) (Huang et al., 2006; Lasry and Lions, 2007) and mean field control (MFC) (Nourian et al., 2012; Bensoussan et al., 2013; Carmona et al., 2023b). Such aggregated interaction models simplify MARL in the limit of infinite agents, whenever agents interact only through their empirical distribution. The simplification provides a problem complexity that is independent of the exact number of agents. The result is tractability, by avoiding otherwise exponentially large joint state-action spaces (Zhang et al., 2021). This has led to scalable control based on MFC (Gu et al., 2023; Carmona et al., 2023b). And indeed, in applications such aggregation is commonly found on some level, e.g., in chemical reaction networks for aggregate molecule mass (Anderson and Kurtz, 2011), related mass-action epidemics models (Kiss et al., 2017), or traffic where congestion depends on the number of travelling cars (Cabannes et al., 2022), to name just a few. See also epidemics control (Dunyak and Caines, 2021), drone swarms (Shiri et al., 2019), self organization (Carmona et al., 2023a), and many more financial (Carmona, 2020) or engineering scenarios (Djehiche et al., 2017).

Table 1: A comparison of recent related works and a subset of their results on discrete-time MFC.
prop. chaos: propagation of chaos; opt. policy: existence of optimal (stationary) policies; common noise: presence thereof; non-finite: non-finite state-actions, e.g. compact; major agent: presence thereof; RL: RL algorithm (+: learns / is analyzed on finite MARL problems).
Ref. prop. chaos opt. policy common noise non-finite major agent RL
Carmona et al. (2023b)
Gu et al. (2021, 2023)
Bäuerle (2023)
Mondal et al. (2022, 2023)
Motte and Pham (2022, 2023)
our work +

Limitations of standard MFC.

However, the strict assumption of only minor agents – i.e. independent, homogeneous agents that can be summarized by their distribution (MF) – limits applicability. In practice, systems often consist of more than homogeneous agents, and hence one must extend standard MFC towards major agents or environment states that are not aggregated. For instance, in modelling car traffic on road networks (Cabannes et al., 2022; Wu et al., 2023), when considering only the distribution of cars (minor agents) on the network, one cannot model major agents or environment states, such as traffic lights or the road conditions respectively. Another example is given by the logistics scenario in Figure 1 and in the experiments, where many drones on a moving truck collect many packages.

Refer to caption
Figure 1: Logistics example: Many drones are modelled as minor agent MF, while truck and package destinations are modelled by a major agent. (See Foraging problem in Section 4.1)

For this purpose, a first step in the continuous-time MFG literature is to consider common noise (Carmona et al., 2016; Perrin et al., 2020), in order to relax the unconditional independence of minor agents. Some more recent works consider such common noise also in discrete-time MFC (Carmona et al., 2023b; Bäuerle, 2023; Motte and Pham, 2022, 2023), or equivalently, global environment states (Mondal et al., 2023). Essentially, this extension allows MFC to also model random environment effects such as the arrival of new packages in the logistics example (Figure 1). Carmona et al. (2023b) provide a reformulation of MARL into single-agent RL and consider algorithms for the resulting Markov decision process (MDP). Bäuerle (2023) give approximation theorems and approximate optimality in the finite system by the limiting MFC solution with common noise, and Motte and Pham (2022, 2023) quantify the rates of convergence explicitly. See also Table 1 for a brief comparison between existing works. In comparison, for the common noise setting, we contribute a new approximation analysis of MFC-based MARL algorithms, where in contrast to prior work, we learn directly with finite agents.

More importantly however, a second contribution is to consider major agents. Major agents generalize common noise or environmental states, and take actions that have a non-negligible effect on the system. So far, major agents have only been considered in continuous-time, non-cooperative MFGs (Nourian and Caines, 2013; Şen and Caines, 2014; Caines and Kizilkale, 2016; Şen and Caines, 2016). To the best of our knowledge, no such discrete-time, cooperative framework has been formulated yet. In this work, we investigate such a framework and associated MARL algorithms.

Refer to caption
Figure 2: Our M3FC-based MARL generalizes MFC-based MARL and standard single-agent RL in the solution space of general MARL solutions, reducing the otherwise combinatorial nature of MARL (Zhang et al., 2021) to a tractable but still general setting.

Contribution.

Existing MFC cannot model general agents and many aggregated agents simultaneously. In essence, we generalize the solution spaces of single-agent RL and MFC-based MARL – frameworks for cooperative MARL as depicted in Figure 2. This provides both tractability for many aggregated agents and generality for arbitrary general agents. Our contribution is briefly summarized into (i) formulating the first discrete-time MFC model with major agents, together with establishing its theoretical properties; (ii) providing a MFC-based MARL algorithm, which in contrast to prior work learns on the finite problem of interest; and (iii) we perform a significant empirical evaluation, also obtaining positive comparisons of MFC-based MARL against state of the art, whereas prior works on MFC were limited to verifying algorithms on one or two examples.

2 Major-Minor Mean Field Control

To begin, in this section we extend standard MFC by modelling the presence of a major agent. The generalization to more than one major agent is straightforward. This leads to our discrete-time major-minor MFC (M3FC) model. Overall, we obtain a formulation that allows standard MARL handling of major agents, while tractably handling many minor agents via MFC-based techniques.

Notation: By 𝔼Xsubscript𝔼𝑋\operatorname{\mathbb{E}}_{X}blackboard_E start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT we denote conditional expectations given X𝑋Xitalic_X. The space of probability measures 𝒫(𝒳)𝒫𝒳\mathcal{P}(\mathcal{X})caligraphic_P ( caligraphic_X ) on compact metric spaces 𝒳𝒳\mathcal{X}caligraphic_X is equipped with the 1111-Wasserstein distance, unless noted otherwise (Villani, 2009). Note compactness of 𝒫(𝒳)𝒫𝒳\mathcal{P}(\mathcal{X})caligraphic_P ( caligraphic_X ) on compact 𝒳𝒳\mathcal{X}caligraphic_X by Prokhorov’s theorem (Billingsley, 2013). Hence, we sometimes use the uniformly (not Lipschitz) equivalent metric dΣ(μ,μ)m=12m|fmd(μμ)|subscript𝑑Σ𝜇superscript𝜇superscriptsubscript𝑚1superscript2𝑚subscript𝑓𝑚d𝜇superscript𝜇d_{\Sigma}(\mu,\mu^{\prime})\coloneqq\sum_{m=1}^{\infty}2^{-m}|\int f_{m}\,% \mathrm{d}(\mu-\mu^{\prime})|italic_d start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT ( italic_μ , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≔ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT - italic_m end_POSTSUPERSCRIPT | ∫ italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_d ( italic_μ - italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) |, for some sequence of continuous fm:𝒳[1,1]:subscript𝑓𝑚𝒳11f_{m}\colon\mathcal{X}\to[-1,1]italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT : caligraphic_X → [ - 1 , 1 ] (Parthasarathy, 2005, Theorem 6.6).

2.1 Finite-Agent System

Consider N𝑁Nitalic_N (minor) agents i[N]{1,,N}𝑖delimited-[]𝑁1𝑁i\in[N]\coloneqq\{1,\ldots,N\}italic_i ∈ [ italic_N ] ≔ { 1 , … , italic_N } with compact metric state and action spaces 𝒳𝒳\mathcal{X}caligraphic_X, 𝒰𝒰\mathcal{U}caligraphic_U, equipped with random states and actions xti,Nsubscriptsuperscript𝑥𝑖𝑁𝑡x^{i,N}_{t}italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and uti,Nsubscriptsuperscript𝑢𝑖𝑁𝑡u^{i,N}_{t}italic_u start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at times t𝑡t\in\mathbb{N}italic_t ∈ blackboard_N, where initial states x0i,Nμ0similar-tosubscriptsuperscript𝑥𝑖𝑁0subscript𝜇0x^{i,N}_{0}\sim\mu_{0}italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are independently sampled from some initial distribution μ0𝒫(𝒳)subscript𝜇0𝒫𝒳\mu_{0}\in\mathcal{P}(\mathcal{X})italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_P ( caligraphic_X ). In addition to standard MFC, we also consider a single major agent, though the framework can be extended to multiple. Consider major agent state and action spaces, 𝒳0superscript𝒳0\mathcal{X}^{0}caligraphic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, 𝒰0superscript𝒰0\mathcal{U}^{0}caligraphic_U start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT and state-actions xt0,Nsubscriptsuperscript𝑥0𝑁𝑡x^{0,N}_{t}italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, ut0,Nsubscriptsuperscript𝑢0𝑁𝑡u^{0,N}_{t}italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, with the major agent formally indexed by i=0𝑖0i=0italic_i = 0. Given all actions, the agent states evolve according to kernels p𝑝pitalic_p, p0superscript𝑝0p^{0}italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT depending on (i) the agent’s own state-actions, (ii) the major state-actions, and (iii) the empirical MF, i.e. the 𝒫(𝒳)𝒫𝒳\mathcal{P}(\mathcal{X})caligraphic_P ( caligraphic_X )-valued empirical state distribution μtN1Ni=1Nδxti,Nsubscriptsuperscript𝜇𝑁𝑡1𝑁superscriptsubscript𝑖1𝑁subscript𝛿subscriptsuperscript𝑥𝑖𝑁𝑡\mu^{N}_{t}\coloneqq\frac{1}{N}\sum_{i=1}^{N}\delta_{x^{i,N}_{t}}italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≔ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT. This means that minor agents affect other agents only at rate 1N1𝑁\frac{1}{N}divide start_ARG 1 end_ARG start_ARG italic_N end_ARG. In practice, we identify minor agents as all agents that matter through their MF μtNsubscriptsuperscript𝜇𝑁𝑡\mu^{N}_{t}italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Any remaining agents are major, such that the problem-specific stratification into major and minor agents is always possible.

By symmetry, the system state at any time t𝑡titalic_t is therefore entirely given by (xt0,N,μtN)subscriptsuperscript𝑥0𝑁𝑡subscriptsuperscript𝜇𝑁𝑡(x^{0,N}_{t},\mu^{N}_{t})( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Accordingly, in MFC we share policies between all minor agents. We consider time-variant policies πΠ𝜋Π\pi\in\Piitalic_π ∈ roman_Π, π0Π0superscript𝜋0superscriptΠ0\pi^{0}\in\Pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∈ roman_Π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT from some classes of major and minor policies ΠΠ\Piroman_Π, Π0superscriptΠ0\Pi^{0}roman_Π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT that depend on an agent’s own state and (xt0,N,μtN)subscriptsuperscript𝑥0𝑁𝑡subscriptsuperscript𝜇𝑁𝑡(x^{0,N}_{t},\mu^{N}_{t})( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) at all times t𝑡titalic_t. Overall, for all i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ] and t𝑡t\in\mathbb{N}italic_t ∈ blackboard_N, the finite MFC system follows

uti,Nsubscriptsuperscript𝑢𝑖𝑁𝑡\displaystyle u^{i,N}_{t}italic_u start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT πt(uti,Nxti,N,xt0,N,μtN),similar-toabsentsubscript𝜋𝑡conditionalsubscriptsuperscript𝑢𝑖𝑁𝑡subscriptsuperscript𝑥𝑖𝑁𝑡subscriptsuperscript𝑥0𝑁𝑡superscriptsubscript𝜇𝑡𝑁\displaystyle\sim\pi_{t}(u^{i,N}_{t}\mid x^{i,N}_{t},x^{0,N}_{t},\mu_{t}^{N}),∼ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) , (1a)
ut0,Nsubscriptsuperscript𝑢0𝑁𝑡\displaystyle u^{0,N}_{t}italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT πt0(ut0,Nxt0,N,μtN),similar-toabsentsubscriptsuperscript𝜋0𝑡conditionalsubscriptsuperscript𝑢0𝑁𝑡subscriptsuperscript𝑥0𝑁𝑡superscriptsubscript𝜇𝑡𝑁\displaystyle\sim\pi^{0}_{t}(u^{0,N}_{t}\mid x^{0,N}_{t},\mu_{t}^{N}),∼ italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) , (1b)
xt+1i,Nsubscriptsuperscript𝑥𝑖𝑁𝑡1\displaystyle x^{i,N}_{t+1}italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT p(xt+1i,Nxti,N,uti,N,xt0,N,ut0,N,μtN),similar-toabsent𝑝conditionalsubscriptsuperscript𝑥𝑖𝑁𝑡1subscriptsuperscript𝑥𝑖𝑁𝑡subscriptsuperscript𝑢𝑖𝑁𝑡subscriptsuperscript𝑥0𝑁𝑡subscriptsuperscript𝑢0𝑁𝑡superscriptsubscript𝜇𝑡𝑁\displaystyle\sim p(x^{i,N}_{t+1}\mid x^{i,N}_{t},u^{i,N}_{t},x^{0,N}_{t},u^{0% ,N}_{t},\mu_{t}^{N}),∼ italic_p ( italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) , (1c)
xt+10,Nsubscriptsuperscript𝑥0𝑁𝑡1\displaystyle x^{0,N}_{t+1}italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT p0(xt+10,Nxt0,N,ut0,N,μtN).similar-toabsentsuperscript𝑝0conditionalsubscriptsuperscript𝑥0𝑁𝑡1subscriptsuperscript𝑥0𝑁𝑡subscriptsuperscript𝑢0𝑁𝑡superscriptsubscript𝜇𝑡𝑁\displaystyle\sim p^{0}(x^{0,N}_{t+1}\mid x^{0,N}_{t},u^{0,N}_{t},\mu_{t}^{N})\,.∼ italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) . (1d)

The goal is then to maximize the infinite-horizon discounted objective JN(π,π0)𝔼[t=0γtr(xt0,N,ut0,N,μtN)]superscript𝐽𝑁𝜋superscript𝜋0𝔼delimited-[]superscriptsubscript𝑡0superscript𝛾𝑡𝑟subscriptsuperscript𝑥0𝑁𝑡subscriptsuperscript𝑢0𝑁𝑡subscriptsuperscript𝜇𝑁𝑡J^{N}(\pi,\pi^{0})\coloneqq\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}r(x^{0% ,N}_{t},u^{0,N}_{t},\mu^{N}_{t})\right]italic_J start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ≔ blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] over minor and major policies (π,π0)𝜋superscript𝜋0(\pi,\pi^{0})( italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ), with discount γ(0,1)𝛾01\gamma\in(0,1)italic_γ ∈ ( 0 , 1 ) and reward function r:𝒫(𝒳):𝑟𝒫𝒳r\colon\mathcal{P}(\mathcal{X})\to\mathbb{R}italic_r : caligraphic_P ( caligraphic_X ) → blackboard_R. While an optimal behavior could be learned using standard MARL policy gradient methods, for improved tractability we introduce the following M3FC model in the case of many minor agents.

Remark 1.

The model is as expressive as in existing MFC (Mondal et al., 2022; Gu et al., 2023), as it also includes (i) joint state-action MFs νt𝒫(𝒳×𝒰)subscript𝜈𝑡𝒫𝒳𝒰\nu_{t}\in\mathcal{P}(\mathcal{X}\times\mathcal{U})italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_P ( caligraphic_X × caligraphic_U ), by splitting time steps in two and defining new states in 𝒳𝒳×𝒰𝒳𝒳𝒰\mathcal{X}\cup\mathcal{X}\times\mathcal{U}caligraphic_X ∪ caligraphic_X × caligraphic_U, (ii) average rewards over all agents, and (iii) random rewards rtisuperscriptsubscript𝑟𝑡𝑖r_{t}^{i}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT by r(μtN)1Ni=1N𝔼[rtixti,N,μtN]𝑟subscriptsuperscript𝜇𝑁𝑡1𝑁superscriptsubscript𝑖1𝑁𝔼conditionalsuperscriptsubscript𝑟𝑡𝑖subscriptsuperscript𝑥𝑖𝑁𝑡subscriptsuperscript𝜇𝑁𝑡r(\mu^{N}_{t})\equiv\frac{1}{N}\sum_{i=1}^{N}\operatorname{\mathbb{E}}[r_{t}^{% i}\mid x^{i,N}_{t},\mu^{N}_{t}]italic_r ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≡ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_E [ italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ]. A finite horizon is handled analogously (without optimal stationary policies).

2.2 Mean Field Control Limit

By the introduction of the MF limit, we obtain a large, more tractable subclass of cooperative multi-agent control problems, which may otherwise suffer from the curse of many agents (combinatorial joint state-action space, (Zhang et al., 2021)). We introduce the MF limit by formally taking N𝑁N\to\inftyitalic_N → ∞: The finite-agent control problem is replaced by a higher-dimensional single-agent MDP – the M3FC MDP. By symmetry, we summarize minor agents into their probability law, the MF μt(xti,N)𝒫(𝒳)subscript𝜇𝑡subscriptsuperscript𝑥𝑖𝑁𝑡𝒫𝒳\mu_{t}\equiv\mathcal{L}(x^{i,N}_{t})\in\mathcal{P}(\mathcal{X})italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≡ caligraphic_L ( italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∈ caligraphic_P ( caligraphic_X ). It replaces its empirical analogue μtNsubscriptsuperscript𝜇𝑁𝑡\mu^{N}_{t}italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by a law of large numbers (LLN). Thus, by definition, the MF μtsubscript𝜇𝑡\mu_{t}italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT evolves forward as

μt+1=T(xt0,ut0,μt,μtπt(μt))=p(x,u,xt0,ut0,μt)πt(dux,μt)μt(dx),\mu_{t+1}=T(x^{0}_{t},u^{0}_{t},\mu_{t},\mu_{t}\otimes\pi_{t}(\mu_{t}))\\ =\iint p(\cdot\mid x,u,x^{0}_{t},u^{0}_{t},\mu_{t})\pi_{t}(\mathrm{d}u\mid x,% \mu_{t})\mu_{t}(\mathrm{d}x),start_ROW start_CELL italic_μ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL = ∬ italic_p ( ⋅ ∣ italic_x , italic_u , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_x , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_x ) , end_CELL end_ROW (2)

with πt(μt)πt(,μt)\pi_{t}(\mu_{t})\coloneqq\pi_{t}(\cdot\mid\cdot,\mu_{t})italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≔ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ∣ ⋅ , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), product measures μtπt(μt)tensor-productsubscript𝜇𝑡subscript𝜋𝑡subscript𝜇𝑡\mu_{t}\otimes\pi_{t}(\mu_{t})italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) of measure μtsubscript𝜇𝑡\mu_{t}italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and kernel πt(μt)subscript𝜋𝑡subscript𝜇𝑡\pi_{t}(\mu_{t})italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) on 𝒳×𝒰𝒳𝒰\mathcal{X}\times\mathcal{U}caligraphic_X × caligraphic_U, and deterministic dynamics for the MF, T(x0,u0,μ,h)p(x,u,x0,u0,μ)h(dx,du)T(x^{0},u^{0},\mu,h)\coloneqq\iint p(\cdot\mid x,u,x^{0},u^{0},\mu)h(\mathrm{d% }x,\mathrm{d}u)italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ , italic_h ) ≔ ∬ italic_p ( ⋅ ∣ italic_x , italic_u , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) italic_h ( roman_d italic_x , roman_d italic_u ).

Therefore, the state of the limiting system consists only of the MF μtsubscript𝜇𝑡\mu_{t}italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and major state xt0subscriptsuperscript𝑥0𝑡x^{0}_{t}italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. As a result, we obtain the limiting M3FC MDP

htsubscript𝑡\displaystyle h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT π^t(htxt0,μt),similar-toabsentsubscript^𝜋𝑡conditionalsubscript𝑡subscriptsuperscript𝑥0𝑡subscript𝜇𝑡\displaystyle\sim\hat{\pi}_{t}(h_{t}\mid x^{0}_{t},\mu_{t}),∼ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (3a)
ut0subscriptsuperscript𝑢0𝑡\displaystyle u^{0}_{t}italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT πt0(ut0xt0,μt),similar-toabsentsubscriptsuperscript𝜋0𝑡conditionalsubscriptsuperscript𝑢0𝑡subscriptsuperscript𝑥0𝑡subscript𝜇𝑡\displaystyle\sim\pi^{0}_{t}(u^{0}_{t}\mid x^{0}_{t},\mu_{t}),∼ italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (3b)
μt+1subscript𝜇𝑡1\displaystyle\mu_{t+1}italic_μ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT =T(xt0,ut0,μt,ht),absent𝑇subscriptsuperscript𝑥0𝑡subscriptsuperscript𝑢0𝑡subscript𝜇𝑡subscript𝑡\displaystyle=T(x^{0}_{t},u^{0}_{t},\mu_{t},h_{t}),= italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (3c)
xt+10subscriptsuperscript𝑥0𝑡1\displaystyle x^{0}_{t+1}italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT p0(xt+10xt0,ut0,μt)similar-toabsentsuperscript𝑝0conditionalsubscriptsuperscript𝑥0𝑡1subscriptsuperscript𝑥0𝑡subscriptsuperscript𝑢0𝑡subscript𝜇𝑡\displaystyle\sim p^{0}(x^{0}_{t+1}\mid x^{0}_{t},u^{0}_{t},\mu_{t})∼ italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (3d)

with objective J(π^,π0)=𝔼[t=0γtr(xt0,ut0,μt)]𝐽^𝜋superscript𝜋0𝔼superscriptsubscript𝑡0superscript𝛾𝑡𝑟subscriptsuperscript𝑥0𝑡subscriptsuperscript𝑢0𝑡subscript𝜇𝑡J(\hat{\pi},\pi^{0})=\operatorname{\mathbb{E}}\left[\sum_{t=0}^{\infty}\gamma^% {t}r(x^{0}_{t},u^{0}_{t},\mu_{t})\right]italic_J ( over^ start_ARG italic_π end_ARG , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) = blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] and transition dynamics for the MF T(x0,u0,μ,h)p(x,u,x0,u0,μ)h(dx,du)T(x^{0},u^{0},\mu,h)\coloneqq\iint p(\cdot\mid x,u,x^{0},u^{0},\mu)h(\mathrm{d% }x,\mathrm{d}u)italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ , italic_h ) ≔ ∬ italic_p ( ⋅ ∣ italic_x , italic_u , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) italic_h ( roman_d italic_x , roman_d italic_u ). Here, we identify μtπt(μt)ht(μt)tensor-productsubscript𝜇𝑡subscript𝜋𝑡subscript𝜇𝑡subscript𝑡subscript𝜇𝑡\mu_{t}\otimes\pi_{t}(\mu_{t})\equiv h_{t}\in\mathcal{H}(\mu_{t})italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≡ italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_H ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) in the compact set (μ)𝒫(𝒳×𝒰)𝜇𝒫𝒳𝒰\mathcal{H}(\mu)\subseteq\mathcal{P}(\mathcal{X}\times\mathcal{U})caligraphic_H ( italic_μ ) ⊆ caligraphic_P ( caligraphic_X × caligraphic_U ) of desired joint state-action distributions with first marginal μ𝜇\muitalic_μ as part of the action of the M3FC MDP.

In other words, the action of the M3FC MDP is (ht,ut0)subscript𝑡subscriptsuperscript𝑢0𝑡(h_{t},u^{0}_{t})( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) where htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT replaces all the minor agent actions by a LLN. Accordingly, minor agent policies are replaced by MFC policies π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG mapping from current μtsubscript𝜇𝑡\mu_{t}italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to desired state-action distribution htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The limiting M3FC model abstracts away all the minor agents in the finite system, and considers only the MF and the major agents, as visualized in Figure 3. The reason for writing joint htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is mostly technical, as for deterministic π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG, we write πt=Φ(π^t)subscript𝜋𝑡Φsubscript^𝜋𝑡\pi_{t}=\Phi(\hat{\pi}_{t})italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Φ ( over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) to reobtain agent policies μtsubscript𝜇𝑡\mu_{t}italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT-a.e. uniquely by disintegration (Kallenberg, 2017) of ht=π^t(μt)subscript𝑡subscript^𝜋𝑡subscript𝜇𝑡h_{t}=\hat{\pi}_{t}(\mu_{t})italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) into μtπttensor-productsubscript𝜇𝑡subscriptsuperscript𝜋𝑡\mu_{t}\otimes\pi^{\prime}_{t}italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with decision rule πt𝒫(𝒰)𝒳subscriptsuperscript𝜋𝑡𝒫superscript𝒰𝒳\pi^{\prime}_{t}\in\mathcal{P}(\mathcal{U})^{\mathcal{X}}italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_P ( caligraphic_U ) start_POSTSUPERSCRIPT caligraphic_X end_POSTSUPERSCRIPT and using πt(μt)πtsubscript𝜋𝑡subscript𝜇𝑡subscriptsuperscript𝜋𝑡\pi_{t}(\mu_{t})\equiv\pi^{\prime}_{t}italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≡ italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Inversely, any πΠ𝜋Π\pi\in\Piitalic_π ∈ roman_Π is represented in the MFC MDP by deterministic π^t=Φ1(π)t=μtπtsubscript^𝜋𝑡superscriptΦ1subscript𝜋𝑡tensor-productsubscript𝜇𝑡subscript𝜋𝑡\hat{\pi}_{t}=\Phi^{-1}(\pi)_{t}=\mu_{t}\otimes\pi_{t}over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_π ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Remark 2.

Strictly speaking, in finite-agent control one jointly select actions (ut0,N,ut1,N,,utN,N)subscriptsuperscript𝑢0𝑁𝑡subscriptsuperscript𝑢1𝑁𝑡subscriptsuperscript𝑢𝑁𝑁𝑡(u^{0,N}_{t},u^{1,N}_{t},\ldots,u^{N,N}_{t})( italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 1 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_u start_POSTSUPERSCRIPT italic_N , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) given joint states (xt0,N,xt1,N,,xtN,N)subscriptsuperscript𝑥0𝑁𝑡subscriptsuperscript𝑥1𝑁𝑡subscriptsuperscript𝑥𝑁𝑁𝑡(x^{0,N}_{t},x^{1,N}_{t},\ldots,x^{N,N}_{t})( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT 1 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUPERSCRIPT italic_N , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). But intuitively, (i) joint states reduce to (xt0,N,μtN)subscriptsuperscript𝑥0𝑁𝑡subscriptsuperscript𝜇𝑁𝑡(x^{0,N}_{t},\mu^{N}_{t})( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), while (ii) joint actions are replaced by the LLN and sampling actions. Optimality of MFC solutions over larger classes of heterogeneous or joint policies is plausible, but to the best of our knowledge, general result are still limited. See also Appendix Q.

For the unfamiliar reader, in Appendix B we recap basic deterministic MFC without major agents or common noise. There, we recap Lipschitz approximation theorems and dynamic programming principles in compact spaces.

Refer to caption
Figure 3: The dynamics (1) as a probabilistic graphical model, with actions in grey (inputs omitted for readability). Diamonds denote deterministic functions. M3FC abstracts minor agents i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ] by a LLN, considering only their MF as variables in the dotted box.

Common noise and global states.

In the classical sense (Perrin et al., 2020; Motte and Pham, 2022), common noise is given by random noise ϵt0pϵ(ϵt0)similar-tosubscriptsuperscriptitalic-ϵ0𝑡subscript𝑝italic-ϵsubscriptsuperscriptitalic-ϵ0𝑡\epsilon^{0}_{t}\sim p_{\epsilon}(\epsilon^{0}_{t})italic_ϵ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( italic_ϵ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) sampled from a fixed distribution pϵsubscript𝑝italic-ϵp_{\epsilon}italic_p start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT, and affects all minor agents at once, xt+1i,Np(xt+1i,Nxti,N,uti,N,ϵt0,μtN)similar-tosubscriptsuperscript𝑥𝑖𝑁𝑡1𝑝conditionalsubscriptsuperscript𝑥𝑖𝑁𝑡1subscriptsuperscript𝑥𝑖𝑁𝑡subscriptsuperscript𝑢𝑖𝑁𝑡subscriptsuperscriptitalic-ϵ0𝑡superscriptsubscript𝜇𝑡𝑁x^{i,N}_{t+1}\sim p(x^{i,N}_{t+1}\mid x^{i,N}_{t},u^{i,N}_{t},\epsilon^{0}_{t}% ,\mu_{t}^{N})italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_p ( italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ϵ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ). This allows to model systems with stochastic MFs and inter-agent correlation, and has added difficulty to the theoretical analysis (Carmona et al., 2016). Of similar interest are also “major” global states xt0,Nsubscriptsuperscript𝑥0𝑁𝑡x^{0,N}_{t}italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which need not be sampled from fixed distributions but evolve dynamically (for MFC with finite global states, see e.g. Mondal et al. (2023)).

Both common noise and global states are contained in the M3FC model by using a trivial major agent without actions. We also note that, in general, common noise is equivalent to global states, as global states can be integrated into the minor state conditioned on the common noise. However, for computational purposes the separation of global states and minor agent states can be helpful, as the simplex 𝒫(𝒳)𝒫𝒳\mathcal{P}(\mathcal{X})caligraphic_P ( caligraphic_X ) over minor states can be kept smaller for methods based on discretization of the simplex.

2.3 Dynamic Programming

As a first step, it is well known that stationary (time-independent) policies suffice for optimality in infinite-horizon discounted MDPs. In the following, this property is also verified for the M3FC MDP. For the following technical results, we assume standard Lipschitz conditions (Gu et al., 2021; Mondal et al., 2022; Pásztor et al., 2023).

Assumption 1.

The transition kernels p𝑝pitalic_p, p0superscript𝑝0p^{0}italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT and rewards r𝑟ritalic_r are Lipschitz with constants Lpsubscript𝐿𝑝L_{p}italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, Lp0subscript𝐿superscript𝑝0L_{p^{0}}italic_L start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, Lrsubscript𝐿𝑟L_{r}italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT.

Assumption 1 is true, e.g., in finite spaces if transition matrix entries of P𝑃Pitalic_P are Lipschitz in the |𝒳|𝒳|\mathcal{X}|| caligraphic_X |-dimensional MF vector. The sufficiency of stationary policies is obtained by the dynamic programming principle, which can also be used to compute exact optimal policies in the M3FC MDP. We use the value function Vsuperscript𝑉V^{*}italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as the fixed point of the Bellman equation, V(x0,μ)=max(h,u0)(μ)×𝒰0r(x0,u0,μ)+γ𝔼y0p0(y0x0,u0,μ)V(y0,T(x0,u0,μ,h))superscript𝑉superscript𝑥0𝜇subscriptsuperscript𝑢0𝜇superscript𝒰0𝑟superscript𝑥0superscript𝑢0𝜇𝛾subscript𝔼similar-tosuperscript𝑦0superscript𝑝0conditionalsuperscript𝑦0superscript𝑥0superscript𝑢0𝜇superscript𝑉superscript𝑦0𝑇superscript𝑥0superscript𝑢0𝜇V^{*}(x^{0},\mu)=\max_{(h,u^{0})\in\mathcal{H}(\mu)\times\mathcal{U}^{0}}r(x^{% 0},u^{0},\mu)+\gamma\mathbb{E}_{y^{0}\sim p^{0}(y^{0}\mid x^{0},u^{0},\mu)}V^{% *}(y^{0},T(x^{0},u^{0},\mu,h))italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) = roman_max start_POSTSUBSCRIPT ( italic_h , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ∈ caligraphic_H ( italic_μ ) × caligraphic_U start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_r ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) + italic_γ blackboard_E start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∼ italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ , italic_h ) ).

Theorem 1.

Under Assumption 1, there exist optimal stationary, deterministic policies π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG, π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT for the M3FC MDP (3) by choosing (π^(x0,μ),π0(x0,μ))^𝜋superscript𝑥0𝜇superscript𝜋0superscript𝑥0𝜇(\hat{\pi}(x^{0},\mu),\pi^{0}(x^{0},\mu))( over^ start_ARG italic_π end_ARG ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ) from the maximizers of argmax(h,u0)(μ)×𝒰0r(x0,u0,μ)+γ𝔼y0p0(y0x0,u0,μ)V(y0,T(x0,u0,μ,h))subscriptargmaxsuperscript𝑢0𝜇superscript𝒰0𝑟superscript𝑥0superscript𝑢0𝜇𝛾subscript𝔼similar-tosuperscript𝑦0superscript𝑝0conditionalsuperscript𝑦0superscript𝑥0superscript𝑢0𝜇superscript𝑉superscript𝑦0𝑇superscript𝑥0superscript𝑢0𝜇\operatorname*{arg\,max}_{(h,u^{0})\in\mathcal{H}(\mu)\times\mathcal{U}^{0}}r(% x^{0},u^{0},\mu)+\gamma\mathbb{E}_{y^{0}\sim p^{0}(y^{0}\mid x^{0},u^{0},\mu)}% V^{*}(y^{0},T(x^{0},u^{0},\mu,h))start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT ( italic_h , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ∈ caligraphic_H ( italic_μ ) × caligraphic_U start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_r ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) + italic_γ blackboard_E start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∼ italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ , italic_h ) ).

Remark 3.

We obtain existence of optimal deterministic stationary minor and major policies π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG, π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT via optimal joint policies π~π^π0~𝜋tensor-product^𝜋superscript𝜋0\tilde{\pi}\equiv\hat{\pi}\otimes\pi^{0}over~ start_ARG italic_π end_ARG ≡ over^ start_ARG italic_π end_ARG ⊗ italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, (ht,ut0)π~((ht,ut0)xt0,μt)similar-tosubscript𝑡subscriptsuperscript𝑢0𝑡~𝜋conditionalsubscript𝑡subscriptsuperscript𝑢0𝑡subscriptsuperscript𝑥0𝑡subscript𝜇𝑡(h_{t},u^{0}_{t})\sim\tilde{\pi}((h_{t},u^{0}_{t})\mid x^{0}_{t},\mu_{t})( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∼ over~ start_ARG italic_π end_ARG ( ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

The results follow from classical MDP theory (Hernández-Lerma and Lasserre, 2012). Thus, we may solve M3FC problems through the DPP, or approximately by using policy gradients with stationary policies for the M3FC MDP, which has naturally continuous actions.

2.4 Finite Agent Convergence

Next, in order to show the approximate optimality of M3FC solutions, we first obtain propagation of chaos (Sznitman, 1991) – convergence of empirical MFs to the limiting MF. The result theoretically backs the reduction of multi-agent control to single-agent MDPs, as there is no loss of optimality in the finite problem by considering the M3FC problem. We assume standard Lipschitz conditions on policies (Gu et al., 2021; Mondal et al., 2022; Pásztor et al., 2023).

Assumption 2.

The classes of policies ΠΠ\Piroman_Π, Π0superscriptΠ0\Pi^{0}roman_Π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT are equi-Lipschitz sets of policies, i.e. there exists LΠ>0subscript𝐿Π0L_{\Pi}>0italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT > 0 such that for all t𝑡titalic_t and πΠ𝜋Π\pi\in\Piitalic_π ∈ roman_Π, πt𝒫(𝒰)𝒳×𝒫(𝒳)subscript𝜋𝑡𝒫superscript𝒰𝒳𝒫𝒳\pi_{t}\in\mathcal{P}(\mathcal{U})^{\mathcal{X}\times\mathcal{P}(\mathcal{X})}italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_P ( caligraphic_U ) start_POSTSUPERSCRIPT caligraphic_X × caligraphic_P ( caligraphic_X ) end_POSTSUPERSCRIPT is LΠsubscript𝐿ΠL_{\Pi}italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT-Lipschitz, and similarly for major policies π0Π0superscript𝜋0superscriptΠ0\pi^{0}\in\Pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∈ roman_Π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT.

We note that Lipschitz policies are natural, as we usually parametrize policies in a Lipschitz manner; in particular, neural networks allow Lipschitz analysis (Pásztor et al., 2023; Herrera et al., 2023; Araujo et al., 2023). The result is that the limiting system approximates large finite systems.

Theorem 2.

Fix any family of equi-Lipschitz functions 𝒳0×𝒰0×𝒫(𝒳)superscriptsuperscript𝒳0superscript𝒰0𝒫𝒳\mathcal{F}\subseteq\mathbb{R}^{\mathcal{X}^{0}\times\mathcal{U}^{0}\times% \mathcal{P}(\mathcal{X})}caligraphic_F ⊆ blackboard_R start_POSTSUPERSCRIPT caligraphic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT × caligraphic_U start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT × caligraphic_P ( caligraphic_X ) end_POSTSUPERSCRIPT with shared Lipschitz constant Lsubscript𝐿L_{\mathcal{F}}italic_L start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT. Under Assumptions 1 and 2, (xt0,N,ut0,N,μtN)subscriptsuperscript𝑥0𝑁𝑡subscriptsuperscript𝑢0𝑁𝑡superscriptsubscript𝜇𝑡𝑁(x^{0,N}_{t},u^{0,N}_{t},\mu_{t}^{N})( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) converges weakly to (xt0,ut0,μt)subscriptsuperscript𝑥0𝑡subscriptsuperscript𝑢0𝑡subscript𝜇𝑡(x^{0}_{t},u^{0}_{t},\mu_{t})( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), uniformly over f𝑓f\in\mathcal{F}italic_f ∈ caligraphic_F, (π,π0)Π×Π0𝜋superscript𝜋0ΠsuperscriptΠ0(\pi,\pi^{0})\in\Pi\times\Pi^{0}( italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ∈ roman_Π × roman_Π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, π^=Φ1(π)^𝜋superscriptΦ1𝜋\hat{\pi}=\Phi^{-1}(\pi)over^ start_ARG italic_π end_ARG = roman_Φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_π ) at all times t𝑡t\in\mathbb{N}italic_t ∈ blackboard_N,

supf,π,π0|𝔼[f(xt0,N,ut0,N,μtN)f(xt0,ut0,μt)]|0.subscriptsupremum𝑓𝜋superscript𝜋0𝔼𝑓subscriptsuperscript𝑥0𝑁𝑡subscriptsuperscript𝑢0𝑁𝑡superscriptsubscript𝜇𝑡𝑁𝑓subscriptsuperscript𝑥0𝑡subscriptsuperscript𝑢0𝑡subscript𝜇𝑡0\sup_{f,\pi,\pi^{0}}\left|\operatorname{\mathbb{E}}\left[f(x^{0,N}_{t},u^{0,N}% _{t},\mu_{t}^{N})-f(x^{0}_{t},u^{0}_{t},\mu_{t})\right]\right|\to 0.roman_sup start_POSTSUBSCRIPT italic_f , italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | blackboard_E [ italic_f ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] | → 0 . (4)

Further, the convergence rate is 𝒪(1/N)𝒪1𝑁\mathcal{O}(1/\sqrt{N})caligraphic_O ( 1 / square-root start_ARG italic_N end_ARG ) if |𝒳|<𝒳|\mathcal{X}|<\infty| caligraphic_X | < ∞.

The above motivates M3FC by the following near optimality result of M3FC MDP solutions in the finite system, as it suffices to optimize over stationary M3FC policies.

Corollary 1.

Under Assumptions 1 and 2, optimal deterministic M3FC MDP policies (π^,π0)argmax(π^,π0)J(π^,π0)superscript^𝜋superscript𝜋0subscriptargmax^𝜋superscript𝜋0𝐽^𝜋superscript𝜋0(\hat{\pi}^{*},\pi^{0*})\in\operatorname*{arg\,max}_{(\hat{\pi},\pi^{0})}J(% \hat{\pi},\pi^{0})( over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT 0 ∗ end_POSTSUPERSCRIPT ) ∈ start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT ( over^ start_ARG italic_π end_ARG , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT italic_J ( over^ start_ARG italic_π end_ARG , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) with Φ(π^)ΠΦsuperscript^𝜋Π\Phi(\hat{\pi}^{*})\in\Piroman_Φ ( over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∈ roman_Π yield ε𝜀\varepsilonitalic_ε-optimal (Φ(π^),π0)Φsuperscript^𝜋superscript𝜋0(\Phi(\hat{\pi}^{*}),\pi^{0*})( roman_Φ ( over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_π start_POSTSUPERSCRIPT 0 ∗ end_POSTSUPERSCRIPT ) with ε0𝜀0\varepsilon\to 0italic_ε → 0 as N𝑁N\to\inftyitalic_N → ∞ in the finite system, JN(Φ(π^),π0)sup(π,π0)Π×Π0JN(π,π0)εsuperscript𝐽𝑁Φsuperscript^𝜋superscript𝜋0subscriptsupremum𝜋superscript𝜋0ΠsuperscriptΠ0superscript𝐽𝑁𝜋superscript𝜋0𝜀J^{N}(\Phi(\hat{\pi}^{*}),\pi^{0*})\geq\sup_{(\pi,\pi^{0})\in\Pi\times\Pi^{0}}% J^{N}(\pi,\pi^{0})-\varepsilonitalic_J start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( roman_Φ ( over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_π start_POSTSUPERSCRIPT 0 ∗ end_POSTSUPERSCRIPT ) ≥ roman_sup start_POSTSUBSCRIPT ( italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ∈ roman_Π × roman_Π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) - italic_ε.

N𝑁Nitalic_N-minor agent controlM3FCOptimal N𝑁Nitalic_N-minor agent controlM3FC policyoptimize (intractable)N𝑁\scriptstyle{N\to\infty}italic_N → ∞optimizeapprox.
Figure 4: Approximation of intractable N𝑁Nitalic_N-agent control by M3FC (blue path), the solution of which is near-optimal for large N𝑁Nitalic_N.

Therefore, one may solve difficult finite-agent MARL by detouring over the corresponding M3FC MDP as depicted in Figure 4, reducing to an MDP of a complexity independent of the number of agents N𝑁Nitalic_N, which we solve in Section 3.

3 Major-Minor Mean Field MARL

As indicated in the prequel and in Figure 2, MARL via M3FC generalizes both single-agent RL and MARL via MFC in the searched policy solution space. Therefore, in M3FC one only optimizes over a tractable, smaller solution space of a single minor and major policy Π,Π0ΠsuperscriptΠ0\Pi,\Pi^{0}roman_Π , roman_Π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT. At the same time, the framework is highly general and handles arbitrary major agents with many minor agents simultaneously. The reduction of MARL problems to a fixed-complexity single-agent M3FC MDP is the key. In this section, we develop MARL algorithms based on the M3FC framework.

Recalling the motivation of MFC, it is crucial to find tractable sample-based MARL techniques for both complex problems where other methods fail, and for problems where we have no access to the dynamics or reward model. Relating to the former, RL has been applied before to solve MFC given that we know the MFC model equations (Carmona et al., 2023b; Pásztor et al., 2023; Mondal et al., 2022). However, regarding the latter, we should instead use the MFC formalism to give rise to novel MARL algorithms.

While literature usually focused analysis on the former, in our work we analyze the proposed algorithm not on limiting M3FC MDPs, but on the more interesting finite M3FC system. In particular, if the M3FC MDP is known, one can instantiate finite systems of any size for training. We consider the following perspective: By Theorem 2, the M3FC MDP is approximated well by the finite system. Therefore, we can solve the limiting M3FC MDP by applying our proposed algorithm directly to finite M3FC systems.

Since we know by Theorem 1 that stationary policy suffice, we solve the M3FC MDP (3) using stationary policies and single-agent RL techniques but on its finite multi-agent instance (1), the combination of which we aptly refer to as Major-Minor Mean Field MARL (M3FMARL). The result is Algorithm 1, where we directly apply RL to multi-agent systems (1) by observing next states (xt+10,N,μt+1N)subscriptsuperscript𝑥0𝑁𝑡1subscriptsuperscript𝜇𝑁𝑡1(x^{0,N}_{t+1},\mu^{N}_{t+1})( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) and rewards rtNr(xt0,N,ut0,N,μtN)subscriptsuperscript𝑟𝑁𝑡𝑟subscriptsuperscript𝑥0𝑁𝑡subscriptsuperscript𝑢0𝑁𝑡subscriptsuperscript𝜇𝑁𝑡r^{N}_{t}\coloneqq r(x^{0,N}_{t},u^{0,N}_{t},\mu^{N}_{t})italic_r start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≔ italic_r ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). The algorithm can be understood as a kind of hierarchical algorithm, as M3FC MDP actions specify behavior for all minor agents at once.

Algorithm 1 M3FMARL
1:  for n=0,1,𝑛01n=0,1,\ldotsitalic_n = 0 , 1 , … do
2:     for t=0,,Blen1𝑡0subscript𝐵len1t=0,\ldots,B_{\mathrm{len}}-1italic_t = 0 , … , italic_B start_POSTSUBSCRIPT roman_len end_POSTSUBSCRIPT - 1 do
3:        Sample M3FC action from RL policy, i.e. ut(ut0,N,πt)π~θ(xt0,N,μtN)u_{t}\equiv(u^{0,N}_{t},\pi^{\prime}_{t})\sim\tilde{\pi}^{\theta}(\cdot\mid x^% {0,N}_{t},\mu^{N}_{t})italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≡ ( italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∼ over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( ⋅ ∣ italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).
4:        for i=1,,N𝑖1𝑁i=1,\ldots,Nitalic_i = 1 , … , italic_N do
5:           Sample i𝑖iitalic_i-th minor action uti,Nπt(xti,N)u^{i,N}_{t}\sim\pi^{\prime}_{t}(\cdot\mid x^{i,N}_{t})italic_u start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ∣ italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).
6:        end for
7:        Execute {ut0,N,ut1,N,}subscriptsuperscript𝑢0𝑁𝑡subscriptsuperscript𝑢1𝑁𝑡\{u^{0,N}_{t},u^{1,N}_{t},\ldots\}{ italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 1 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … } for next reward rtNsubscriptsuperscript𝑟𝑁𝑡r^{N}_{t}italic_r start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, state (xt+10,N,μt+1N)subscriptsuperscript𝑥0𝑁𝑡1subscriptsuperscript𝜇𝑁𝑡1(x^{0,N}_{t+1},\mu^{N}_{t+1})( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) and termination dt+1{0,1}subscript𝑑𝑡101d_{t+1}\in\{0,1\}italic_d start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∈ { 0 , 1 }.
8:     end for
9:     Perform an update (on policy π~θsuperscript~𝜋𝜃\tilde{\pi}^{\theta}over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT) using transitions B=((xt0,N,μtN),ut,rtN,dt+1,(xt+10,N,μt+1N))t0𝐵subscriptsubscriptsuperscript𝑥0𝑁𝑡subscriptsuperscript𝜇𝑁𝑡subscript𝑢𝑡subscriptsuperscript𝑟𝑁𝑡subscript𝑑𝑡1subscriptsuperscript𝑥0𝑁𝑡1subscriptsuperscript𝜇𝑁𝑡1𝑡0B=((x^{0,N}_{t},\mu^{N}_{t}),u_{t},r^{N}_{t},d_{t+1},(x^{0,N}_{t+1},\mu^{N}_{t% +1}))_{t\geq 0}italic_B = ( ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT.
10:  end for

3.1 M3FC-based Policy Gradients

The proposed algorithm can be theoretically motivated. As shown in the following, finite-agent policy gradients (PG) estimate the true limiting M3FC MDP PG. First, note that finite state-actions 𝒳,𝒰𝒳𝒰\mathcal{X},\mathcal{U}caligraphic_X , caligraphic_U lead to continuous M3FC MDP actions (μ)𝜇\mathcal{H}(\mu)caligraphic_H ( italic_μ ), while continuous 𝒳,𝒰𝒳𝒰\mathcal{X},\mathcal{U}caligraphic_X , caligraphic_U even yield infinite-dimensional (μ)𝜇\mathcal{H}(\mu)caligraphic_H ( italic_μ ). Therefore, we have at least continuous MDPs, complicating value-based learning.

For this reason, we mainly consider PG methods to solve M3FC-type MARL problems. We parametrize M3FC MDP solutions via RL policies π~θsuperscript~𝜋𝜃\tilde{\pi}^{\theta}over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT with parameters θ𝜃\thetaitalic_θ, outputting ξΞ𝜉Ξ\xi\in\Xiitalic_ξ ∈ roman_Ξ from some compact parameter space ΞΞ\Xiroman_Ξ with a Lipschitz map Γ(ξ)=πtΓ𝜉subscriptsuperscript𝜋𝑡\Gamma(\xi)=\pi^{\prime}_{t}roman_Γ ( italic_ξ ) = italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to LΠsubscript𝐿ΠL_{\Pi}italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT-Lipschitz minor agent decision rules πtsubscriptsuperscript𝜋𝑡\pi^{\prime}_{t}italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (formally, ht=μtπtsubscript𝑡tensor-productsubscript𝜇𝑡subscriptsuperscript𝜋𝑡h_{t}=\mu_{t}\otimes\pi^{\prime}_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT). Assuming the Lipschitzness of the policy network and its gradient in all arguments, on which there has been a great number of recent literature (see e.g. Herrera et al. (2023); Araujo et al. (2023) and references therein), we formulate Assumption 1.

Assumption 1.

The parameter map ΓΓ\Gammaroman_Γ, joint policy π~θsuperscript~𝜋𝜃\tilde{\pi}^{\theta}over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT and log-gradient θlogπ~θsubscript𝜃superscript~𝜋𝜃\nabla_{\theta}\log\tilde{\pi}^{\theta}∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT (or gradient θπ~θsubscript𝜃superscript~𝜋𝜃\nabla_{\theta}\tilde{\pi}^{\theta}∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT) are LΓsubscript𝐿ΓL_{\Gamma}italic_L start_POSTSUBSCRIPT roman_Γ end_POSTSUBSCRIPT, Lπ~subscript𝐿~𝜋L_{\tilde{\pi}}italic_L start_POSTSUBSCRIPT over~ start_ARG italic_π end_ARG end_POSTSUBSCRIPT, Lπ~subscript𝐿~𝜋L_{\nabla\tilde{\pi}}italic_L start_POSTSUBSCRIPT ∇ over~ start_ARG italic_π end_ARG end_POSTSUBSCRIPT-Lipschitz and uniformly bounded.

Then, we can apply the PG theorem (Sutton et al., 1999) for the M3FC MDP. The M3FC MDP (3) essentially substitutes many-agent systems (1), which are natural approximations of the M3FC MDP by Theorem 2. Therefore, we show that M3FMARL (Algorithm 1) – single-agent PG on the multi-agent M3FC system – approximates the true PG of the limiting M3FC MDP, in the case of many minor agents. In other words, M3FMARL solves MARL by approximately solving the single-agent M3FC MDP using policy gradients.

Theorem 1.

Under Assumptions 1, 2 and 1, the approximate PG of joint policy π~θsuperscript~𝜋𝜃\tilde{\pi}^{\theta}over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT computed on the finite M3FC system (1) in Algorithm 1 uniformly tends to the true PG of the M3FC MDP (3), as N𝑁N\to\inftyitalic_N → ∞.

Importantly, the underlying MDP complexity is independent of the number of minor agents. Therefore, we would expect Algorithm 1 to be able to perform well in M3FC-type problems, possibly compared to straightforward MARL where each agent is handled separately. Intuitively, for many agents, the reward signal for any single agent can become uninformative: A cooperative, “averaged” reward remains almost unaffected by a single agent’s actions. This well-known credit assignment issue is therefore solved by the hierarchical structure of M3FC, as credit is assigned to M3FC actions, which affect all minor agents at once and hence receive aggregated credit. Another advantage is that MFC profits from any advances in single-agent RL.

Refer to caption
Figure 5: Training curves (mean episode return) of M3FPPO (red), with shaded standard deviation, and maximum (blue) over all three trials (two for Foraging). (a) 2G; (b) Formation; (c) Beach; (d) Foraging; (e) Potential.

3.2 Implementation Details

We use the proximal policy optimization (PPO) algorithm (Schulman et al., 2017) to obtain a M3FC policy πRLsubscript𝜋RL\pi_{\mathrm{RL}}italic_π start_POSTSUBSCRIPT roman_RL end_POSTSUBSCRIPT, instantiating the major minor mean field PPO (M3FPPO) algorithm as an instance of M3FMARL, Algorithm 1. Other PG algorithms (A2C, leading to M3FA2C) are also compared in our experiments. We parametrize MFs in 𝒫(𝒳)𝒫𝒳\mathcal{P}(\mathcal{X})caligraphic_P ( caligraphic_X ) and joint distributions in (μtN)subscriptsuperscript𝜇𝑁𝑡\mathcal{H}(\mu^{N}_{t})caligraphic_H ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). In practice, for finite 𝒳𝒳\mathcal{X}caligraphic_X, 𝒰𝒰\mathcal{U}caligraphic_U, the parametrization of 𝒫(𝒳)𝒫𝒳\mathcal{P}(\mathcal{X})caligraphic_P ( caligraphic_X ) is immediate by finite-dimensional vectors μtN𝒫(𝒳)subscriptsuperscript𝜇𝑁𝑡𝒫𝒳\mu^{N}_{t}\in\mathcal{P}(\mathcal{X})italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_P ( caligraphic_X ). For M3FC actions, consider – in addition to the major agent action – the matrix ξ[1,1]𝒳×𝒰𝜉superscript11𝒳𝒰\xi\in[-1,1]^{\mathcal{X}\times\mathcal{U}}italic_ξ ∈ [ - 1 , 1 ] start_POSTSUPERSCRIPT caligraphic_X × caligraphic_U end_POSTSUPERSCRIPT, which is mapped to probabilities of minor actions in any minor state πt(ux)Z1(ξxu+1+ϵ)subscriptsuperscript𝜋𝑡conditional𝑢𝑥superscript𝑍1subscript𝜉𝑥𝑢1italic-ϵ\pi^{\prime}_{t}(u\mid x)\coloneqq Z^{-1}(\xi_{xu}+1+\epsilon)italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u ∣ italic_x ) ≔ italic_Z start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_ξ start_POSTSUBSCRIPT italic_x italic_u end_POSTSUBSCRIPT + 1 + italic_ϵ ), for small ϵ=1010italic-ϵsuperscript1010\epsilon=10^{-10}italic_ϵ = 10 start_POSTSUPERSCRIPT - 10 end_POSTSUPERSCRIPT and normalizer Z𝑍Zitalic_Z. For continuous 𝒳𝒳\mathcal{X}caligraphic_X, 𝒰𝒰\mathcal{U}caligraphic_U, we instead partition 𝒳𝒳\mathcal{X}caligraphic_X into M𝑀Mitalic_M bins and represent μtNsubscriptsuperscript𝜇𝑁𝑡\mu^{N}_{t}italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as a histogram, mapping ξ[1,1]M×2𝜉superscript11𝑀2\xi\in[-1,1]^{M\times 2}italic_ξ ∈ [ - 1 , 1 ] start_POSTSUPERSCRIPT italic_M × 2 end_POSTSUPERSCRIPT to diagonal Gaussian means and standard deviations, μ𝒳i𝒰subscript𝜇subscript𝒳𝑖𝒰\mu_{\mathcal{X}_{i}}\in\mathcal{U}italic_μ start_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ caligraphic_U, σ𝒳i[ϵ,0.25+ϵ]subscript𝜎subscript𝒳𝑖italic-ϵ0.25italic-ϵ\sigma_{\mathcal{X}_{i}}\in[\epsilon,0.25+\epsilon]italic_σ start_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ [ italic_ϵ , 0.25 + italic_ϵ ], for each of M𝑀Mitalic_M bins 𝒳i𝒳subscript𝒳𝑖𝒳\mathcal{X}_{i}\subseteq\mathcal{X}caligraphic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊆ caligraphic_X. Major actions ut0,Nsubscriptsuperscript𝑢0𝑁𝑡u^{0,N}_{t}italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are categorical or diagonal Gaussian as usual. For large 𝒳,𝒰𝒳𝒰\mathcal{X},\mathcal{U}caligraphic_X , caligraphic_U, one could also consider kernel-based parametrizations (Cui et al., 2024).

We use two hidden layers of 256256256256 nodes and tanh\tanhroman_tanh activations for the neural networks of the policies. The neural network policy outputs parameters of a diagonal Gaussian over the major action u0superscript𝑢0u^{0}italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT and matrices U𝑈Uitalic_U as discussed above. In the discrete Beach scenario below, the neural network instead outputs a categorical distribution using a final softmax layer. We used no GPUs and around 300,000 CPU core hours on Intel Xeon Platinum 9242 CPUs. Optimal transport costs are computed using POT (Flamary et al., 2021). Our M3FC MDP implementation follows the gym interface (Brockman et al., 2016), while the implementation of multi-agent RL as in the following fulfills RLlib interfaces (Liang et al., 2018). The RL implementations in our work are based on MARLlib 1.0 (Hu et al., 2023a) (MIT license), which uses RLlib 1.8 (Liang et al., 2018) (Apache-2.0 license) with hyperparameters in Table 3, and otherwise default settings.

3.3 Comparison to MARL

The M3FMARL algorithm falls into the paradigm of centralized training with decentralized execution (CTDE) (Zhang et al., 2021), as we sample a single central M3FC MDP action during training, but enable decentralized execution by sampling πtsubscriptsuperscript𝜋𝑡\pi^{\prime}_{t}italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT separately on each agent instead. For instance, when converged to a deterministic M3FC policy (of which an optimal one is guaranteed to exist by Theorem 1), the M3FC action is always trivially equal for all agents.

Since we also consider continuous minor agent action spaces in our experiments, we compare against PG methods for MARL. In particular, we firstly consider Independent PPO (IPPO), as PPO with independent learning (Tan, 1993) and parameter sharing (Gupta et al., 2017), and secondly also Multi-Agent PPO (MAPPO) with centralized critics. The latter has repeatedly shown strong state-of-the-art performance in cooperative MARL (de Witt et al., 2020; Papoudakis et al., 2021; Yu et al., 2022). We also separate major and minor agent policies for improved performance of IPPO / MAPPO. For comparison, we use the same observations for the policy input as in M3FMARL. The policy network architectures match, and the same PPO implementation and hyperparameters are shared with M3FPPO in Table 3. Minor agents are additionally allowed to observe their own states. More details can be found in Appendix R.

Table 2: Comparison of mean episode returns between best trained policies of standard MARL and M3FMARL methods on a system with N=20𝑁20N=20italic_N = 20 agents (±plus-or-minus\pm± 95%percent9595\%95 % confidence interval, for a number of episodes as in Figure 9).
Problem IPPO MAPPO M3FA2C M3FPPO
2G -43.9 ±plus-or-minus\pm± 1.1 -26.0 ±plus-or-minus\pm± 0.5 -30.6 ±plus-or-minus\pm± 0.6 -22.2 ±plus-or-minus\pm± 0.56
Formation -51.1 ±plus-or-minus\pm± 2.4 -101.1 ±plus-or-minus\pm± 7.1 -79.2 ±plus-or-minus\pm± 3.1 -63.9 ±plus-or-minus\pm± 4.2
Beach -350.3 ±plus-or-minus\pm± 3.4 -342.9 ±plus-or-minus\pm± 4.7 -424.8 ±plus-or-minus\pm± 5.5 -303.5 ±plus-or-minus\pm± 3.4
Foraging 735.3 ±plus-or-minus\pm± 46.4 803.9 ±plus-or-minus\pm± 54.6 1398.0 ±plus-or-minus\pm± 57.1 1479.4 ±plus-or-minus\pm± 36.3
Potential -27.1 ±plus-or-minus\pm± 1.4 -26.7 ±plus-or-minus\pm± 1.7 -50.4 ±plus-or-minus\pm± 5.5 -31.3 ±plus-or-minus\pm± 1.3
Refer to caption
Figure 6: Training curves (mean episode return vs. time steps) of M3FPPO, trained on the finite systems with N{5,10,20}𝑁51020N\in\{5,10,20\}italic_N ∈ { 5 , 10 , 20 }. (a) 2G; (b) Formation; (c) Beach; (d) Foraging; (e) Potential.
Refer to caption
Figure 7: Comparing IPPO / MAPPO vs. results of M3FPPO (MF, ours), as in Figure 5 (no maxima, N=20𝑁20N=20italic_N = 20).
Refer to caption
Figure 9: Mean episode return of M3FC policy in finite systems as in Figure 5 over (a-c) 100100100100, (d) 300300300300 or (e) 500500500500 trials (95% confidence interval shaded). MF: CE, N=500𝑁500N=500italic_N = 500; CE / DE: centralized / decentralized execution.

4 Experiments

In this section, we demonstrate the performance of M3FPPO on illustrative, practical problems. Unless noted otherwise, we use M=49𝑀49M=49italic_M = 49 bins (M=7𝑀7M=7italic_M = 7 in Potential), train for around 24242424 hours, and train M3FPPO on the finite-agent system (1) with N=300𝑁300N=300italic_N = 300 minor agents unless noted otherwise (similar results for less agents in Appendix R). Full descriptions and additional experiments and discussions are in Appendix R.

Refer to captionRefer to caption
Figure 8: Qualitative visualization of M3FC in Beach (a-d), Foraging (e-h). (a-d): empirical MF, major agent & target in green; (e-h): blue / green triangle: major agent / target; green / red dots: less- / more-than-half encumbered minor agents; purple: current foraging areas.

4.1 Problems

To verify the usefulness of M3FMARL whenever the M3FC model (1) is accurate, we consider 5555 benchmark tasks that fulfill the M3FC modelling assumptions. To begin, the simple two Gaussian (2G) problem has no major agent and is equipped with a time-dependent major state: A periodic, time-variant mixture of two Gaussians μtsubscriptsuperscript𝜇𝑡\mu^{*}_{t}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT – the major state – is noisily observed analogously to μtNsubscriptsuperscript𝜇𝑁𝑡\mu^{N}_{t}italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT via M=49𝑀49M=49italic_M = 49 bins. Minor agents should then track the mixture distribution over time, which can find application for example in UAV-based cellular coverage of dynamic users (Mozaffari et al., 2016). In the Formation problem, we extend such formation control with major agents. In addition to 2G, one added major agent tracks a moving target. Meanwhile, minor agents instead track a formation around the dynamic major agent, see e.g. Yang et al. (2021) for applications. The Beach bar process is a studied classic (Arthur, 1994; Perrin et al., 2020), where minor agents minimize their distances to a bar and additionally avoid crowded areas. Here, the bar moves on a discrete torus. The Foraging problem is archetypal of swarm intelligence (Brambilla et al., 2013), and has agents forage randomly generated foraging areas. In particular, we can consider the logistics scenario depicted in Figure 1, where a major package truck moves in a restricted space (roads) while minor drones collect packages for urban parcel delivery (Marinelli et al., 2018). Drones fill up at package “foraging” areas, and unload near the major agent. Lastly, in the Potential problem, minor agents can generate a potential landscape, the gradient of which pushes the major agent – e.g., a large object affected by magnetic active matter (Jin and Zhang, 2021) – to be delivered to a variable target.

4.2 Evaluation

In Figure 5, we see that M3FPPO learning is stable, as M3FPPO reduces hard-to-analyze MARL to single-agent RL, avoiding pathologies of MARL such as non-stationarity of multi-agent learning, or the combinatorial complexity over numbers of agents. In Figure 6, we find similar success in directly training M3FPPO for small N𝑁Nitalic_N instead of transferring from high N𝑁Nitalic_N. We conclude that M3FPPO remains applicable even with as few as 5555 agents. M3FPPO usually compares well against its A2C variant (M3FA2C) and IPPO / MAPPO, see Table 2 and Appendix R.2. Meanwhile, IPPO / MAPPO under the same hyperparameters as M3FPPO (large batch sizes, see Table 3) can be more unstable and lead to worse results, see Figure 7.

Qualitative behavior.

In Figure 8, we observe successfully trained behavior in Beach and Foraging: In Beach, M3FPPO learns to accumulate up to 70%percent7070\%70 % of agents on the bar, as more agents on the space lead to a suboptimal reduction in rewards. In Foraging, we find that agents successfully deplete foraging areas shown in the bottom left, moving on afterwards. Further, M3FPPO successfully learns to form mixtures of Gaussians in 2G, a Gaussian around a moving major agent successfully tracking its target in Formation, and similar success in pushing the major agent towards its target in Potential, see Appendix R.3.

Quantitative support of theory.

In Figure 9, we transfer the trained M3FPPO policy to N=2,,50𝑁250N=2,\ldots,50italic_N = 2 , … , 50, comparing against the performance in the limit (N=500𝑁500N=500italic_N = 500). As N𝑁Nitalic_N grows, the performance converges to the limit, supporting Theorem 2 and Corollary 1. Any sufficiently large system has the same limiting performance as predicted by the theory. We thus have empirical support for scalability, and also transferability between varying numbers of minor agents.

Comparison to MARL.

Comparing Figures 5, 7 and Table 2, we see that (i) by experience sharing, standard MARL can be more sample-efficient, as each step gives N𝑁Nitalic_N samples instead of just one; and (ii) M3FPPO matches or outperforms IPPO and MAPPO, despite having significantly less control over minor agent actions: All minor agents in a bin (with similar minor agent states) use the same action distributions, which suffices for strong results.

Decentralized execution.

Lastly, decentralized execution by agent-wise randomization – i.e. sampling M3FC actions per agent instead of a single shared, correlated M3FC action – has little to no effect, and can even marginally improve performance, see e.g., Beach in Figure 9(c). Figure 9 verifies the performance of M3FMARL as a CTDE method.

5 Conclusion and Discussion

We have proposed a generalization of MDPs and MFC, enabling tractable state-of-the-art MARL on general many-agent systems, with both theoretical and empirical support. Beyond the current model and its optimality guarantees, one could work on extended optimality conjectures in Appendix Q, refined approximations (Gast and Van Houdt, 2018), and local interactions (Qu et al., 2020b). Algorithmically, M3FC MDP actions (μ)𝜇\mathcal{H}(\mu)caligraphic_H ( italic_μ ) could move beyond binning 𝒳𝒳\mathcal{X}caligraphic_X to gain performance, e.g. via kernels. Lastly, one may try to quantify convergence to the rate 𝒪(1/N)𝒪1𝑁\mathcal{O}(1/\sqrt{N})caligraphic_O ( 1 / square-root start_ARG italic_N end_ARG ) for non-finite 𝒳𝒳\mathcal{X}caligraphic_X, as the current proof strategy would need hard-to-verify or unrealistic dΣsubscript𝑑Σd_{\Sigma}italic_d start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT-Lipschitzness.

Broader Impact

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Acknowledgements

This work has been co-funded by the LOEWE initiative (Hesse, Germany) within the emergenCITY center, and the Hessian Ministry of Science and the Arts (HMWK) within the projects “The Third Wave of Artificial Intelligence - 3AI” and hessian.AI. The authors acknowledge the Lichtenberg high performance computing cluster of the TU Darmstadt for providing computational facilities for the calculations of this research. We thank anonymous reviewers for their helpful comments to improve the manuscript.

References

  • Anderson and Kurtz (2011) David F Anderson and Thomas G Kurtz. Continuous time Markov chain models for chemical reaction networks. In Design and Analysis of Biomolecular Circuits, pages 3–42. Springer, 2011.
  • Araujo et al. (2023) Alexandre Araujo, Aaron J Havens, Blaise Delattre, Alexandre Allauzen, and Bin Hu. A unified algebraic perspective on Lipschitz neural networks. In Proc. ICLR, pages 1–15, 2023.
  • Arthur (1994) W Brian Arthur. Inductive reasoning and bounded rationality. Am. Econ. Rev., 84(2):406–411, 1994.
  • Bäuerle (2023) Nicole Bäuerle. Mean field Markov decision processes. Appl. Math. Optim., 88(1):12, 2023.
  • Bensoussan et al. (2013) Alain Bensoussan, Jens Frehse, and Phillip Yam. Mean field games and mean field type control theory, volume 101. Springer, 2013.
  • Bernstein et al. (2002) Daniel S Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein. The complexity of decentralized control of Markov decision processes. Math. Oper. Res., 27(4):819–840, 2002.
  • Billingsley (2013) Patrick Billingsley. Convergence of probability measures. John Wiley & Sons, 2013.
  • Bonesini et al. (2022) Ofelia Bonesini, Luciano Campi, and Markus Fischer. Correlated equilibria for mean field games with progressive strategies. arXiv:2212.01656, 2022.
  • Brambilla et al. (2013) Manuele Brambilla, Eliseo Ferrante, Mauro Birattari, and Marco Dorigo. Swarm robotics: A review from the swarm engineering perspective. Swarm Intell., 7(1):1–41, 2013.
  • Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI gym. arXiv:1606.01540, 2016.
  • Cabannes et al. (2022) Theophile Cabannes, Mathieu Laurière, Julien Perolat, Raphael Marinier, Sertan Girgin, Sarah Perrin, Olivier Pietquin, Alexandre M Bayen, Eric Goubault, and Romuald Elie. Solving n-player dynamic routing games with congestion: A mean-field approach. In Proc. AAMAS, volume 21, pages 1557–1559, 2022.
  • Caines and Huang (2019) Peter E Caines and Minyi Huang. Graphon mean field games and the GMFG equations: ε𝜀\varepsilonitalic_ε-Nash equilibria. In Proc. IEEE CDC, pages 286–292, 2019.
  • Caines and Kizilkale (2016) Peter E Caines and Arman C Kizilkale. ϵitalic-ϵ\epsilonitalic_ϵ-Nash equilibria for partially observed LQG mean field games with a major player. IEEE Trans. Automat. Contr., 62(7):3225–3234, 2016.
  • Campi and Fischer (2022) Luciano Campi and Markus Fischer. Correlated equilibria and mean field games: a simple model. Math. Oper. Res., 2022.
  • Carmona (2020) René Carmona. Applications of mean field games in financial engineering and economic theory. arXiv:2012.05237, 2020.
  • Carmona and Delarue (2018) René Carmona and François Delarue. Probabilistic Theory of Mean Field Games with Applications I-II. Springer, 2018.
  • Carmona et al. (2016) René Carmona, François Delarue, and Daniel Lacker. Mean field games with common noise. Ann. Probab., 44(6):3740–3803, 2016.
  • Carmona et al. (2023a) René Carmona, Quentin Cormier, and H Mete Soner. Synchronization in a Kuramoto mean field game. Commun. Partial. Differ. Equ., 48(9):1214–1244, 2023a.
  • Carmona et al. (2023b) René Carmona, Mathieu Laurière, and Zongjun Tan. Model-free mean-field reinforcement learning: mean-field MDP and mean-field Q-learning. Ann. Appl. Probab., 33(6B):5334–5381, 2023b.
  • Cui and Koeppl (2021) Kai Cui and Heinz Koeppl. Approximately solving mean field games via entropy-regularized deep reinforcement learning. In Proc. AISTATS, pages 1909–1917, 2021.
  • Cui and Koeppl (2022) Kai Cui and Heinz Koeppl. Learning graphon mean field games and approximate Nash equilibria. In Proc. ICLR, pages 1–31, 2022.
  • Cui et al. (2021) Kai Cui, Anam Tahir, Mark Sinzger, and Heinz Koeppl. Discrete-time mean field control with environment states. In Proc. IEEE CDC, pages 5239–5246, 2021.
  • Cui et al. (2024) Kai Cui, Sascha H. Hauck, Christian Fabian, and Heinz Koeppl. Learning decentralized partially observable mean field control for artificial collective behavior. In Proc. ICLR, 2024.
  • Daskalakis et al. (2009) Constantinos Daskalakis, Paul W Goldberg, and Christos H Papadimitriou. The complexity of computing a Nash equilibrium. SIAM J. Comput., 39(1):195–259, 2009.
  • de Witt et al. (2020) Christian Schroeder de Witt, Tarun Gupta, Denys Makoviichuk, Viktor Makoviychuk, Philip HS Torr, Mingfei Sun, and Shimon Whiteson. Is independent learning all you need in the Starcraft multi-agent challenge? arXiv:2011.09533, 2020.
  • DeVore and Lorentz (1993) Ronald A DeVore and George G Lorentz. Constructive approximation, volume 303. Springer Science & Business Media, 1993.
  • Djehiche et al. (2017) Boualem Djehiche, Alain Tcheukam, and Hamidou Tembine. Mean-field-type games in engineering. AIMS Electron. Electr. Eng., 1(1):18–73, 2017.
  • Dunyak and Caines (2021) Alex Dunyak and Peter E Caines. Large scale systems and SIR models: A featured graphon approach. In Proc. IEEE CDC, pages 6928–6933. IEEE, 2021.
  • Flamary et al. (2021) Rémi Flamary, Nicolas Courty, Alexandre Gramfort, Mokhtar Z. Alaya, Aurélie Boisbunon, Stanislas Chambon, Laetitia Chapel, Adrien Corenflos, Kilian Fatras, Nemo Fournier, Léo Gautheron, Nathalie T.H. Gayraud, Hicham Janati, Alain Rakotomamonjy, Ievgen Redko, Antoine Rolet, Antony Schutz, Vivien Seguy, Danica J. Sutherland, Romain Tavenard, Alexander Tong, and Titouan Vayer. POT: Python optimal transport. J. Mach. Learn. Res., 22(78):1–8, 2021.
  • Ganapathi Subramanian et al. (2020) Sriram Ganapathi Subramanian, Pascal Poupart, Matthew E Taylor, and Nidhi Hegde. Multi type mean field reinforcement learning. In Proc. AAMAS, volume 19, pages 411–419, 2020.
  • Ganapathi Subramanian et al. (2021) Sriram Ganapathi Subramanian, Matthew E Taylor, Mark Crowley, and Pascal Poupart. Partially observable mean field reinforcement learning. In Proc. AAMAS, volume 20, pages 537–545, 2021.
  • Gast and Gaujal (2011) Nicolas Gast and Bruno Gaujal. A mean field approach for optimization in discrete time. Discret. Event Dyn. Syst., 21(1):63–101, 2011.
  • Gast and Van Houdt (2018) Nicolas Gast and Benny Van Houdt. A refined mean field approximation. ACM SIGMETRICS Perform. Eval. Rev., 46(1):113–113, 2018.
  • Gu et al. (2021) Haotian Gu, Xin Guo, Xiaoli Wei, and Renyuan Xu. Mean-field controls with Q-learning for cooperative MARL: convergence and complexity analysis. SIAM J. Math. Data Sci., 3(4):1168–1196, 2021.
  • Gu et al. (2023) Haotian Gu, Xin Guo, Xiaoli Wei, and Renyuan Xu. Dynamic programming principles for mean-field controls with learning. Oper. Res., 2023.
  • Guan et al. (2024) Yue Guan, Mohammad Afshari, and Panagiotis Tsiotras. Zero-sum games between mean-field teams: Reachability-based analysis under mean-field sharing. In Proc. AAAI, volume 38, pages 9731–9739, 2024.
  • Guo et al. (2019) Xin Guo, Anran Hu, Renyuan Xu, and Junzi Zhang. Learning mean-field games. In Proc. NeurIPS, pages 4966–4976, 2019.
  • Guo et al. (2022) Xin Guo, Anran Hu, Renyuan Xu, and Junzi Zhang. A general framework for learning mean-field games. Math. Oper. Res., 2022.
  • Gupta et al. (2017) Jayesh K Gupta, Maxim Egorov, and Mykel Kochenderfer. Cooperative multi-agent control using deep reinforcement learning. In Proc. AAMAS, pages 66–83, 2017.
  • Hernández-Lerma and Lasserre (2012) Onésimo Hernández-Lerma and Jean B Lasserre. Discrete-time Markov control processes: basic optimality criteria, volume 30. Springer Science & Business Media, 2012.
  • Hernández-Lerma and Muñoz de Ozak (1992) Onésimo Hernández-Lerma and Myriam Muñoz de Ozak. Discrete-time Markov control processes with discounted unbounded costs: optimality criteria. Kybernetika, 28(3):191–212, 1992.
  • Herrera et al. (2023) Calypso Herrera, Florian Krach, and Josef Teichmann. Local lipschitz bounds of deep neural networks. arXiv:2004.13135, 2023.
  • Hu et al. (2023a) Siyi Hu, Yifan Zhong, Minquan Gao, Weixun Wang, Hao Dong, Xiaodan Liang, Zhihui Li, Xiaojun Chang, and Yaodong Yang. MARLlib: A scalable and efficient multi-agent reinforcement learning library. J. Mach. Learn. Res., 2023a.
  • Hu et al. (2023b) Yuanquan Hu, Xiaoli Wei, Junji Yan, and Hengxi Zhang. Graphon mean-field control for cooperative multi-agent reinforcement learning. J. Franklin Inst., 2023b.
  • Huang et al. (2006) Minyi Huang, Roland P Malhamé, and Peter E Caines. Large population stochastic dynamic games: closed-loop McKean-Vlasov systems and the Nash certainty equivalence principle. Commun. Inf. Syst., 6(3):221–252, 2006.
  • Jin and Zhang (2021) Dongdong Jin and Li Zhang. Collective behaviors of magnetic active matter: Recent progress toward reconfigurable, adaptive, and multifunctional swarming micro/nanorobots. Acc. Chem. Res., 55(1):98–109, 2021.
  • Kallenberg (2017) Olav Kallenberg. Random measures, theory and applications, volume 1. Springer, 2017.
  • Kiss et al. (2017) István Z Kiss, Joel C Miller, and Péter L Simon. Mathematics of Epidemics on Networks: From Exact to Approximate Models, volume 46. Springer, 2017. doi: 10.1007/978-3-319-50806-1.
  • Lasry and Lions (2007) Jean-Michel Lasry and Pierre-Louis Lions. Mean field games. Japanese J. Math., 2(1):229–260, 2007.
  • Laurière et al. (2022) Mathieu Laurière, Sarah Perrin, Matthieu Geist, and Olivier Pietquin. Learning mean field games: A survey. arXiv:2205.12944, 2022.
  • Liang et al. (2018) Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Ken Goldberg, Joseph Gonzalez, Michael Jordan, and Ion Stoica. RLlib: Abstractions for distributed reinforcement learning. In Proc. ICML, pages 3053–3062, 2018.
  • Liu et al. (2022) Xin Liu, Honghao Wei, and Lei Ying. Scalable and sample efficient distributed policy gradient algorithms in multi-agent networked systems. arXiv:2212.06357, 2022.
  • Marinelli et al. (2018) Mario Marinelli, Leonardo Caggiani, Michele Ottomanelli, and Mauro Dell’Orco. En route truck–drone parcel delivery for optimal vehicle routing strategies. IET Intell. Transp. Syst., 12(4):253–261, 2018.
  • Mondal et al. (2022) Washim Uddin Mondal, Mridul Agarwal, Vaneet Aggarwal, and Satish V Ukkusuri. On the approximation of cooperative heterogeneous multi-agent reinforcement learning (MARL) using mean field control (MFC). J. Mach. Learn. Res., 23(129):1–46, 2022.
  • Mondal et al. (2023) Washim Uddin Mondal, Vaneet Aggarwal, and Satish Ukkusuri. Mean-field control based approximation of multi-agent reinforcement learning in presence of a non-decomposable shared global state. Trans. Mach. Learn. Res., 2023. ISSN 2835-8856.
  • Motte and Pham (2022) Médéric Motte and Huyên Pham. Mean-field Markov decision processes with common noise and open-loop controls. Ann. Appl. Probab., 32(2):1421–1458, 2022.
  • Motte and Pham (2023) Médéric Motte and Huyên Pham. Quantitative propagation of chaos for mean field Markov decision process with common noise. Electron. J. Probab., 28:1–24, 2023.
  • Mozaffari et al. (2016) Mohammad Mozaffari, Walid Saad, Mehdi Bennis, and Mérouane Debbah. Efficient deployment of multiple unmanned aerial vehicles for optimal wireless coverage. IEEE Commun. Lett., 20(8):1647–1650, 2016.
  • Muller et al. (2021) Paul Muller, Mark Rowland, Romuald Elie, Georgios Piliouras, Julien Perolat, Mathieu Laurière, Raphael Marinier, Olivier Pietquin, and Karl Tuyls. Learning equilibria in mean-field games: Introducing mean-field PSRO. In Proc. AAMAS, volume 20, page 926–934, 2021.
  • Nourian and Caines (2013) Mojtaba Nourian and Peter E Caines. ϵitalic-ϵ\epsilonitalic_ϵ-Nash mean field game theory for nonlinear stochastic dynamical systems with major and minor agents. SIAM J. Contr. Optim., 51(4):3302–3331, 2013.
  • Nourian et al. (2012) Mojtaba Nourian, Peter E Caines, Roland P Malhame, and Minyi Huang. Nash, social and centralized solutions to consensus problems via mean field control theory. IEEE Trans. Automat. Contr., 58(3):639–653, 2012.
  • Ouyang et al. (2022) Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In Proc. NeurIPS, pages 27730–27744, 2022.
  • Papoudakis et al. (2021) Georgios Papoudakis, Filippos Christianos, Lukas Schäfer, and Stefano V Albrecht. Benchmarking multi-agent deep reinforcement learning algorithms in cooperative tasks. In Proc. NeurIPS Track Datasets Benchmarks, 2021.
  • Parthasarathy (2005) Kalyanapuram Rangachari Parthasarathy. Probability measures on metric spaces, volume 352. American Mathematical Soc., 2005.
  • Pásztor et al. (2023) Barna Pásztor, Andreas Krause, and Ilija Bogunovic. Efficient model-based multi-agent mean-field reinforcement learning. Trans. Mach. Learn. Res., 2023.
  • Pérolat et al. (2022) Julien Pérolat, Sarah Perrin, Romuald Elie, Mathieu Laurière, Georgios Piliouras, Matthieu Geist, Karl Tuyls, and Olivier Pietquin. Scaling mean field games by online mirror descent. In Proc. AAMAS, volume 21, pages 1028–1037, 2022.
  • Perrin et al. (2020) Sarah Perrin, Julien Pérolat, Mathieu Laurière, Matthieu Geist, Romuald Elie, and Olivier Pietquin. Fictitious play for mean field games: Continuous time analysis and applications. In Proc. NeurIPS, volume 33, pages 13199–13213, 2020.
  • Perrin et al. (2022) Sarah Perrin, Mathieu Laurière, Julien Pérolat, Romuald Élie, Matthieu Geist, and Olivier Pietquin. Generalization in mean field games by learning master policies. In Proc. AAAI, volume 36, pages 9413–9421, 2022.
  • Pham and Wei (2018) Huyên Pham and Xiaoli Wei. Bellman equation and viscosity solutions for mean-field stochastic control problem. ESAIM Contr. Optim. Calc. Var., 24(1):437–461, 2018.
  • Qu et al. (2020a) Guannan Qu, Yiheng Lin, Adam Wierman, and Na Li. Scalable multi-agent reinforcement learning for networked systems with average reward. In Proc. NeurIPS, volume 33, pages 2074–2086, 2020a.
  • Qu et al. (2020b) Guannan Qu, Adam Wierman, and Na Li. Scalable reinforcement learning of localized policies for multi-agent networked systems. In Proc. Learn. Dyn. Contr., pages 256–266, 2020b.
  • Saldi et al. (2018) Naci Saldi, Tamer Başar, and Maxim Raginsky. Markov–Nash equilibria in mean-field games with discounted cost. SIAM J. Contr. Optim., 56(6):4256–4287, 2018.
  • Saldi et al. (2019) Naci Saldi, Tamer Başar, and Maxim Raginsky. Partially-observed discrete-time risk-sensitive mean-field games. In Proc. IEEE CDC, pages 317–322, 2019.
  • Sanjari and Yüksel (2020) Sina Sanjari and Serdar Yüksel. Optimal solutions to infinite-player stochastic teams and mean-field teams. IEEE Trans. Automat. Contr., 66(3):1071–1086, 2020.
  • Schrittwieser et al. (2020) Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.
  • Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv:1707.06347, 2017.
  • Şen and Caines (2014) Nevroz Şen and Peter E Caines. Mean field games with partially observed major player and stochastic mean field. In Proc. IEEE CDC, pages 2709–2715, 2014.
  • Şen and Caines (2016) Nevroz Şen and Peter E Caines. Mean field game theory with a partially observed major agent. SIAM J. Contr. Optim., 54(6):3174–3224, 2016.
  • Şen and Caines (2019) Nevroz Şen and Peter E Caines. Mean field games with partial observation. SIAM J. Contr. Optim., 57(3):2064–2091, 2019.
  • Shiri et al. (2019) Hamid Shiri, Jihong Park, and Mehdi Bennis. Massive autonomous UAV path planning: A neural network based mean-field game theoretic approach. In Proc. IEEE GLOBECOM, pages 1–6. IEEE, 2019.
  • Silver et al. (2014) David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In Proc. ICML, pages 387–395. PMLR, 2014.
  • Subramanian et al. (2022) Sriram Ganapathi Subramanian, Matthew E Taylor, Mark Crowley, and Pascal Poupart. Decentralized mean field games. In Proc. AAAI, volume 36, pages 9439–9447, 2022.
  • Sutton et al. (1999) Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Proc. NIPS, pages 1057–1063, 1999.
  • Sznitman (1991) Alain-Sol Sznitman. Topics in propagation of chaos. In Ecole d’été de probabilités de Saint-Flour XIX—1989, pages 165–251. Springer, 1991.
  • Tan (1993) Ming Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proc. ICML, pages 330–337, 1993.
  • Tchuendom et al. (2021) Rinel Foguen Tchuendom, Peter E Caines, and Minyi Huang. Critical nodes in graphon mean field games. In Proc. IEEE CDC, pages 166–170. IEEE, 2021.
  • Villani (2009) Cédric Villani. Optimal transport: old and new, volume 338. Springer, 2009.
  • Vinyals et al. (2019) Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
  • Wu et al. (2023) Minghui Wu, Xingmin Wang, Yafeng Yin, Henry Liu, Ben Wang, Jerome P Lynch, et al. Leveraging connected and automated vehicles for participatory traffic control. Technical report, University of Michigan. Center for Connected and Automated Transportation, 2023.
  • Yang et al. (2018) Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. Mean field multi-agent reinforcement learning. In Proc. ICML, pages 5571–5580, 2018.
  • Yang et al. (2021) Yue Yang, Yang Xiao, and Tieshan Li. Attacks on formation control for multiagent systems. IEEE Trans. Cybern., 52(12):12805–12817, 2021.
  • Yardim et al. (2023) Batuhan Yardim, Semih Cayci, Matthieu Geist, and Niao He. Policy mirror ascent for efficient and independent learning in mean field games. In Proc. ICML, pages 39722–39754. PMLR, 2023.
  • Yu et al. (2022) Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. The surprising effectiveness of PPO in cooperative multi-agent games. In Proc. NeurIPS Datasets and Benchmarks, 2022.
  • Zhang et al. (2021) Kaiqing Zhang, Zhuoran Yang, and Tamer Başar. Multi-agent reinforcement learning: A selective overview of theories and algorithms. In Kyriakos G. Vamvoudakis, Yan Wan, Frank L. Lewis, and Derya Cansever, editors, Handbook of Reinforcement Learning and Control, pages 321–384. Springer International Publishing, Cham, 2021.

Appendix A Related Work

In this section, we provide additional context on related works. Since the introduction of MFGs in continuous and discrete time [Huang et al., 2006, Lasry and Lions, 2007, Saldi et al., 2018], MFGs have been studied in various forms, ranging from partially observed systems [Saldi et al., 2019, Şen and Caines, 2019] over learning-based solutions [Guo et al., 2019, Perrin et al., 2020, Cui and Koeppl, 2021, Guo et al., 2022, Pérolat et al., 2022, Perrin et al., 2022, Yardim et al., 2023] on graphs [Caines and Huang, 2019, Tchuendom et al., 2021, Cui and Koeppl, 2022, Hu et al., 2023b] to considering correlated equilibria [Muller et al., 2021, Campi and Fischer, 2022, Bonesini et al., 2022].

While many works focus on non-cooperative settings with self-interested agents, this can run counter to the goal of engineering many-agent behavior, e.g., achieving cooperative behavior in swarms of drones. Instead, we focus on the related setting of cooperative MFC [Pham and Wei, 2018, Gu et al., 2023, Mondal et al., 2022], see also work on differential [Carmona and Delarue, 2018], static [Sanjari and Yüksel, 2020], or discrete-time deterministic MFC [Gast and Gaujal, 2011]. For the unfamiliar reader, we point towards many extensive surveys on the topic of mean field systems [Bensoussan et al., 2013, Carmona and Delarue, 2018, Laurière et al., 2022].

In general comparison, another well-known line of mean field MARL [Yang et al., 2018, Ganapathi Subramanian et al., 2020, 2021, Subramanian et al., 2022] focuses on approximating the influence of other agents on any particular agent by their average actions. Relatedly, some MARL algorithms introduce approximations over agent neighborhoods based on exponential decay [Qu et al., 2020b, a, Liu et al., 2022]. In contrast, MFC assumes dependence on the entire distribution of agents and not, e.g., pairwise terms for each neighbor, per agent.

Appendix B Deterministic Mean Field Control

In the following, we provide proofs that were omitted in the main text. To begin, in this section we recap standard deterministic MFC. Here, our general proof technique is introduced. It generalizes to the M3FC case and allows approximation properties and dynamic programming principles beyond finite spaces and Lipschitz continuity assumptions in compact spaces, for MFC models under simple continuity. In standard MFC, we have the model without major agents,

uti,Nsubscriptsuperscript𝑢𝑖𝑁𝑡\displaystyle u^{i,N}_{t}italic_u start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT πt(uti,Nxti,N,μtN),similar-toabsentsubscript𝜋𝑡conditionalsubscriptsuperscript𝑢𝑖𝑁𝑡subscriptsuperscript𝑥𝑖𝑁𝑡superscriptsubscript𝜇𝑡𝑁\displaystyle\sim\pi_{t}(u^{i,N}_{t}\mid x^{i,N}_{t},\mu_{t}^{N}),∼ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) , (5)
xt+1i,Nsubscriptsuperscript𝑥𝑖𝑁𝑡1\displaystyle x^{i,N}_{t+1}italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT p(xt+1i,Nxti,N,uti,N,μtN)similar-toabsent𝑝conditionalsubscriptsuperscript𝑥𝑖𝑁𝑡1subscriptsuperscript𝑥𝑖𝑁𝑡subscriptsuperscript𝑢𝑖𝑁𝑡superscriptsubscript𝜇𝑡𝑁\displaystyle\sim p(x^{i,N}_{t+1}\mid x^{i,N}_{t},u^{i,N}_{t},\mu_{t}^{N})∼ italic_p ( italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) (6)

while in the limit, we have the MF evolution

μt+1=T(μt,μtπt(μt))p(x,u,μt)πt(dux,μt)μt(dx)\displaystyle\mu_{t+1}=T(\mu_{t},\mu_{t}\otimes\pi_{t}(\mu_{t}))\coloneqq\iint p% (\cdot\mid x,u,\mu_{t})\pi_{t}(\mathrm{d}u\mid x,\mu_{t})\mu_{t}(\mathrm{d}x)italic_μ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_T ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ≔ ∬ italic_p ( ⋅ ∣ italic_x , italic_u , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_x , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_x ) (7)

and MFC system

htsubscript𝑡\displaystyle h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT π^t(htμt),μt+1=T(μt,ht)formulae-sequencesimilar-toabsentsubscript^𝜋𝑡conditionalsubscript𝑡subscript𝜇𝑡subscript𝜇𝑡1𝑇subscript𝜇𝑡subscript𝑡\displaystyle\sim\hat{\pi}_{t}(h_{t}\mid\mu_{t}),\quad\mu_{t+1}=T(\mu_{t},h_{t})∼ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_μ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_T ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (8)

with objective J(π^)=𝔼[t=0γtr(μt)]𝐽^𝜋𝔼superscriptsubscript𝑡0superscript𝛾𝑡𝑟subscript𝜇𝑡J(\hat{\pi})=\operatorname{\mathbb{E}}\left[\sum_{t=0}^{\infty}\gamma^{t}r(\mu% _{t})\right]italic_J ( over^ start_ARG italic_π end_ARG ) = blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ].

Dynamic Programming and Propagation of Chaos

We may solve the hard finite-agent system (5) near-optimally by instead solving the MFC MDP, allowing direct application of single-agent RL to the MFC MDP with approximate optimality in large systems. Mild continuity assumptions are required.

Assumption 1.

The transition kernel p𝑝pitalic_p and reward r𝑟ritalic_r are continuous.

Assumption 2.

The considered class of policies ΠΠ\Piroman_Π is equi-Lipschitz, i.e. there exists LΠ>0subscript𝐿Π0L_{\Pi}>0italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT > 0 such that for all t𝑡titalic_t and πΠ𝜋Π\pi\in\Piitalic_π ∈ roman_Π, πt𝒫(𝒰)𝒳×𝒫(𝒳)subscript𝜋𝑡𝒫superscript𝒰𝒳𝒫𝒳\pi_{t}\in\mathcal{P}(\mathcal{U})^{\mathcal{X}\times\mathcal{P}(\mathcal{X})}italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_P ( caligraphic_U ) start_POSTSUPERSCRIPT caligraphic_X × caligraphic_P ( caligraphic_X ) end_POSTSUPERSCRIPT is LΠsubscript𝐿ΠL_{\Pi}italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT-Lipschitz.

We note that Assumption 1 holds true in studied finite spaces, if each transition matrix entry of P𝑃Pitalic_P is continuous in the |𝒳|𝒳|\mathcal{X}|| caligraphic_X |-dimensional MF vector on the simplex (but not necessarily Lipschitz as in [Gu et al., 2021, Mondal et al., 2022], the conditions of which we relax for deterministic MFC).

We show a dynamic programming principle [Hernández-Lerma and Lasserre, 2012] to solve for and show existence of a deterministic, stationary optimal policy via the value function Vsuperscript𝑉V^{*}italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as the fixed point of the Bellman equation V(μ)=maxh(μ)r(μ)+γV(T(μ,h))superscript𝑉𝜇subscript𝜇𝑟𝜇𝛾superscript𝑉𝑇𝜇V^{*}(\mu)=\max_{h\in\mathcal{H}(\mu)}r(\mu)+\gamma V^{*}(T(\mu,h))italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_μ ) = roman_max start_POSTSUBSCRIPT italic_h ∈ caligraphic_H ( italic_μ ) end_POSTSUBSCRIPT italic_r ( italic_μ ) + italic_γ italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_T ( italic_μ , italic_h ) ).

Theorem 1.

Under Assumptions 1, there exists an optimal stationary, deterministic policy π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG for (8), with π^(μ)argmaxh(μ)r(μ)+γV(T(μ,h))^𝜋𝜇subscriptargmax𝜇𝑟𝜇𝛾superscript𝑉𝑇𝜇\hat{\pi}(\mu)\in\operatorname*{arg\,max}_{h\in\mathcal{H}(\mu)}r(\mu)+\gamma V% ^{*}(T(\mu,h))over^ start_ARG italic_π end_ARG ( italic_μ ) ∈ start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_h ∈ caligraphic_H ( italic_μ ) end_POSTSUBSCRIPT italic_r ( italic_μ ) + italic_γ italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_T ( italic_μ , italic_h ) ).

This DPP can be used for computing solutions or to show optimality of stationary policies and existence of an optimum. Next, we show propagation of chaos [Sznitman, 1991]. Here, prior proof techniques [Gu et al., 2021, Mondal et al., 2022] are extended by our approach from finite to general compact spaces.

Theorem 2.

Fix any family of equicontinuous functions 𝒫(𝒳)superscript𝒫𝒳\mathcal{F}\subseteq\mathbb{R}^{\mathcal{P}(\mathcal{X})}caligraphic_F ⊆ blackboard_R start_POSTSUPERSCRIPT caligraphic_P ( caligraphic_X ) end_POSTSUPERSCRIPT. Under Assumptions 1 and 2, the empirical MF converges weakly, uniformly over f𝑓f\in\mathcal{F}italic_f ∈ caligraphic_F, πΠ𝜋Π\pi\in\Piitalic_π ∈ roman_Π, π^=Φ1(π)^𝜋superscriptΦ1𝜋\hat{\pi}=\Phi^{-1}(\pi)over^ start_ARG italic_π end_ARG = roman_Φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_π ), to the limiting MF at all times t𝑡t\in\mathbb{N}italic_t ∈ blackboard_N, supπΠsupf|𝔼[f(μtN)]𝔼[f(μt)]|0subscriptsupremum𝜋Πsubscriptsupremum𝑓𝔼𝑓subscriptsuperscript𝜇𝑁𝑡𝔼𝑓subscript𝜇𝑡0\sup_{\pi\in\Pi}\sup_{f\in\mathcal{F}}\left|\operatorname{\mathbb{E}}\left[f(% \mu^{N}_{t})\right]-\operatorname{\mathbb{E}}\left[f(\mu_{t})\right]\right|\to 0roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT | blackboard_E [ italic_f ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] - blackboard_E [ italic_f ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] | → 0.

Importantly, propagation of chaos allows one to show approximate optimality of MFC policies in the large finite control problem, which is of practical relevance for solving many-agent problems.

Corollary 1.

Under Assumptions 1 and 2, an optimal deterministic MFC policy πargmaxπ^J(π^)superscript𝜋subscriptargmax^𝜋𝐽^𝜋\pi^{*}\in\operatorname*{arg\,max}_{\hat{\pi}}J(\hat{\pi})italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT over^ start_ARG italic_π end_ARG end_POSTSUBSCRIPT italic_J ( over^ start_ARG italic_π end_ARG ) yields ε𝜀\varepsilonitalic_ε-optimal finite-agent policy Φ(π)ΠΦsuperscript𝜋Π\Phi(\pi^{*})\in\Piroman_Φ ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∈ roman_Π, JN(Φ(π))supπΠJN(π)εsuperscript𝐽𝑁Φsuperscript𝜋subscriptsupremum𝜋Πsuperscript𝐽𝑁𝜋𝜀J^{N}(\Phi(\pi^{*}))\geq\sup_{\pi\in\Pi}J^{N}(\pi)-\varepsilonitalic_J start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( roman_Φ ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) ≥ roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_π ) - italic_ε, with ε0𝜀0\varepsilon\to 0italic_ε → 0 as N𝑁N\to\inftyitalic_N → ∞.

Appendix C Continuity of MF dynamics

First, we find continuity of the MFC dynamics T𝑇Titalic_T, which is used in the following proofs.

Lemma 1.

Under Assumption 1, we have T(μn,νn)T(μ,ν)𝑇subscript𝜇𝑛subscript𝜈𝑛𝑇𝜇𝜈T(\mu_{n},\nu_{n})\to T(\mu,\nu)italic_T ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) → italic_T ( italic_μ , italic_ν ) whenever (μn,νn)(μ,ν)subscript𝜇𝑛subscript𝜈𝑛𝜇𝜈(\mu_{n},\nu_{n})\to(\mu,\nu)( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) → ( italic_μ , italic_ν ),

Proof.

To show T(μn,νn)T(μ,ν)𝑇subscript𝜇𝑛subscript𝜈𝑛𝑇𝜇𝜈T(\mu_{n},\nu_{n})\to T(\mu,\nu)italic_T ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) → italic_T ( italic_μ , italic_ν ), consider any Lipschitz and bounded f𝑓fitalic_f with Lipschitz constant Lfsubscript𝐿𝑓L_{f}italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, then

|fd(T(μn,νn)T(μ,ν))|𝑓d𝑇subscript𝜇𝑛subscript𝜈𝑛𝑇𝜇𝜈\displaystyle\left|\int f\,\mathrm{d}(T(\mu_{n},\nu_{n})-T(\mu,\nu))\right|| ∫ italic_f roman_d ( italic_T ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_T ( italic_μ , italic_ν ) ) |
=|f(x)p(dxx,u,μn)νn(dx,du)f(x)p(dxx,u,μ)ν(dx,du)|\displaystyle=\left|\iiint f(x^{\prime})p(\mathrm{d}x^{\prime}\mid x,u,\mu_{n}% )\nu_{n}(\mathrm{d}x,\mathrm{d}u)-\iiint f(x^{\prime})p(\mathrm{d}x^{\prime}% \mid x,u,\mu)\nu(\mathrm{d}x,\mathrm{d}u)\right|= | ∭ italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_p ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x , italic_u , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( roman_d italic_x , roman_d italic_u ) - ∭ italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_p ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x , italic_u , italic_μ ) italic_ν ( roman_d italic_x , roman_d italic_u ) |
|f(x)p(dxx,u,μn)f(x)p(dxx,u,μ)|νn(dx,du)\displaystyle\quad\leq\iint\left|\int f(x^{\prime})p(\mathrm{d}x^{\prime}\mid x% ,u,\mu_{n})-\int f(x^{\prime})p(\mathrm{d}x^{\prime}\mid x,u,\mu)\right|\nu_{n% }(\mathrm{d}x,\mathrm{d}u)≤ ∬ | ∫ italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_p ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x , italic_u , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - ∫ italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_p ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x , italic_u , italic_μ ) | italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( roman_d italic_x , roman_d italic_u )
+|f(x)p(dxx,u,μ)(νn(dx,du)ν(dx,du))|\displaystyle\qquad+\left|\iiint f(x^{\prime})p(\mathrm{d}x^{\prime}\mid x,u,% \mu)(\nu_{n}(\mathrm{d}x,\mathrm{d}u)-\nu(\mathrm{d}x,\mathrm{d}u))\right|+ | ∭ italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_p ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x , italic_u , italic_μ ) ( italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( roman_d italic_x , roman_d italic_u ) - italic_ν ( roman_d italic_x , roman_d italic_u ) ) |
supx𝒳,u𝒰LfW1(p(x,u,μn),p(x,u,μ))\displaystyle\quad\leq\sup_{x\in\mathcal{X},u\in\mathcal{U}}L_{f}W_{1}(p(\cdot% \mid x,u,\mu_{n}),p(\cdot\mid x,u,\mu))≤ roman_sup start_POSTSUBSCRIPT italic_x ∈ caligraphic_X , italic_u ∈ caligraphic_U end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_p ( ⋅ ∣ italic_x , italic_u , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_p ( ⋅ ∣ italic_x , italic_u , italic_μ ) )
+|f(x)p(dxx,u,μ)(νn(dx,du)ν(dx,du))|0\displaystyle\qquad+\left|\iiint f(x^{\prime})p(\mathrm{d}x^{\prime}\mid x,u,% \mu)(\nu_{n}(\mathrm{d}x,\mathrm{d}u)-\nu(\mathrm{d}x,\mathrm{d}u))\right|\to 0+ | ∭ italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_p ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x , italic_u , italic_μ ) ( italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( roman_d italic_x , roman_d italic_u ) - italic_ν ( roman_d italic_x , roman_d italic_u ) ) | → 0

for the first term by 1111-Lipschitzness of fLf𝑓subscript𝐿𝑓\frac{f}{L_{f}}divide start_ARG italic_f end_ARG start_ARG italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG and Assumption 1 (with compactness implying the uniform continuity), and for the second by νnνsubscript𝜈𝑛𝜈\nu_{n}\to\nuitalic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → italic_ν and from continuity by the same argument of (x,u)f(x)p(dxx,u,μ)maps-to𝑥𝑢double-integral𝑓superscript𝑥𝑝conditionaldsuperscript𝑥𝑥𝑢𝜇(x,u)\mapsto\iint f(x^{\prime})p(\mathrm{d}x^{\prime}\mid x,u,\mu)( italic_x , italic_u ) ↦ ∬ italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_p ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x , italic_u , italic_μ ). ∎

Appendix D Proof of Theorem 1

Proof.

The MFC MDP fulfills [Hernández-Lerma and Lasserre, 2012], Assumption 4.2.1. Here, we use [Hernández-Lerma and Lasserre, 2012], Condition 3.3.4(b1) instead of (b2), see also alternatively [Hernández-Lerma and Muñoz de Ozak, 1992].

More specifically, for [Hernández-Lerma and Lasserre, 2012], Assumption 4.2.1(a), the cost function r𝑟-r- italic_r is continuous by Assumption 1, therefore also bounded by compactness of 𝒫(𝒳)𝒫𝒳\mathcal{P}(\mathcal{X})caligraphic_P ( caligraphic_X ), and finally also inf-compact on the state-action space of the MFC MDP, since for any μ𝒫(𝒳)𝜇𝒫𝒳\mu\in\mathcal{P}(\mathcal{X})italic_μ ∈ caligraphic_P ( caligraphic_X ) the set {h(μ)r(μ)c}conditional-set𝜇𝑟𝜇𝑐\{h\in\mathcal{H}(\mu)\mid-r(\mu)\leq c\}{ italic_h ∈ caligraphic_H ( italic_μ ) ∣ - italic_r ( italic_μ ) ≤ italic_c } is trivially given by (μ)𝜇\mathcal{H}(\mu)caligraphic_H ( italic_μ ) whenever r(μ)c𝑟𝜇𝑐-r(\mu)\leq c- italic_r ( italic_μ ) ≤ italic_c, and \emptyset otherwise. Here, we show that (μ)𝒫(𝒳×𝒰)𝜇𝒫𝒳𝒰\mathcal{H}(\mu)\subseteq\mathcal{P}(\mathcal{X}\times\mathcal{U})caligraphic_H ( italic_μ ) ⊆ caligraphic_P ( caligraphic_X × caligraphic_U ) is a closed subset of the compact space 𝒫(𝒳×𝒰)𝒫𝒳𝒰\mathcal{P}(\mathcal{X}\times\mathcal{U})caligraphic_P ( caligraphic_X × caligraphic_U ) and therefore also compact. Note first that two measures μ,μ𝒫(𝒳)𝜇superscript𝜇𝒫𝒳\mu,\mu^{\prime}\in\mathcal{P}(\mathcal{X})italic_μ , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_P ( caligraphic_X ) are equal if and only if for all continuous and bounded f𝑓fitalic_f we have fdμ=fdμ𝑓differential-d𝜇𝑓differential-dsuperscript𝜇\int f\,\mathrm{d}\mu=\int f\,\mathrm{d}\mu^{\prime}∫ italic_f roman_d italic_μ = ∫ italic_f roman_d italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, see e.g. [Billingsley, 2013], Theorem 1.3.

Therefore, as (μ)𝜇\mathcal{H}(\mu)caligraphic_H ( italic_μ ) is defined by its first marginal μ𝜇\muitalic_μ, (μ)𝜇\mathcal{H}(\mu)caligraphic_H ( italic_μ ) can be written as an intersection

(μ)=fCb(𝒳){h𝒫(𝒳×𝒰)|f𝟏dh=fdμ}𝜇subscript𝑓subscript𝐶𝑏𝒳conditional-set𝒫𝒳𝒰tensor-product𝑓1differential-d𝑓differential-d𝜇\displaystyle\mathcal{H}(\mu)=\bigcap_{f\in C_{b}(\mathcal{X})}\left\{h\in% \mathcal{P}(\mathcal{X}\times\mathcal{U})\;\middle\lvert\;\int f\otimes\mathbf% {1}\,\mathrm{d}h=\int f\,\mathrm{d}\mu\right\}caligraphic_H ( italic_μ ) = ⋂ start_POSTSUBSCRIPT italic_f ∈ italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( caligraphic_X ) end_POSTSUBSCRIPT { italic_h ∈ caligraphic_P ( caligraphic_X × caligraphic_U ) | ∫ italic_f ⊗ bold_1 roman_d italic_h = ∫ italic_f roman_d italic_μ }

of closed sets: Since hf𝟏dhmaps-totensor-product𝑓1differential-dh\mapsto\int f\otimes\mathbf{1}\,\mathrm{d}hitalic_h ↦ ∫ italic_f ⊗ bold_1 roman_d italic_h is continuous, its preimage of the closed set {fdμ}𝑓differential-d𝜇\{\int f\,\mathrm{d}\mu\}{ ∫ italic_f roman_d italic_μ } is closed. Here, tensor-product\otimes denotes the tensor product of f𝑓fitalic_f with the function 𝟏1\mathbf{1}bold_1 equal one, i.e. f𝟏tensor-product𝑓1f\otimes\mathbf{1}italic_f ⊗ bold_1 is the map (x,u)f(x)maps-to𝑥𝑢𝑓𝑥(x,u)\mapsto f(x)( italic_x , italic_u ) ↦ italic_f ( italic_x ).

Similarly, for [Hernández-Lerma and Lasserre, 2012], Assumption 4.2.1(b), the transition dynamics T𝑇Titalic_T are weakly continuous, as for any (μn,νn)(μ,ν)𝒫(𝒳)×𝒫(𝒳×𝒰)subscript𝜇𝑛subscript𝜈𝑛𝜇𝜈𝒫𝒳𝒫𝒳𝒰(\mu_{n},\nu_{n})\to(\mu,\nu)\in\mathcal{P}(\mathcal{X})\times\mathcal{P}(% \mathcal{X}\times\mathcal{U})( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) → ( italic_μ , italic_ν ) ∈ caligraphic_P ( caligraphic_X ) × caligraphic_P ( caligraphic_X × caligraphic_U ) we have T(μn,νn)T(μ,ν)𝑇subscript𝜇𝑛subscript𝜈𝑛𝑇𝜇𝜈T(\mu_{n},\nu_{n})\to T(\mu,\nu)italic_T ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) → italic_T ( italic_μ , italic_ν ) by Lemma 1 and therefore fdδT(μn,νn)=f(T(μn,νn))f(T(μ,ν))=fdδT(μ,ν)𝑓differential-dsubscript𝛿𝑇subscript𝜇𝑛subscript𝜈𝑛𝑓𝑇subscript𝜇𝑛subscript𝜈𝑛𝑓𝑇𝜇𝜈𝑓differential-dsubscript𝛿𝑇𝜇𝜈\int f\,\mathrm{d}\delta_{T(\mu_{n},\nu_{n})}=f(T(\mu_{n},\nu_{n}))\to f(T(\mu% ,\nu))=\int f\,\mathrm{d}\delta_{T(\mu,\nu)}∫ italic_f roman_d italic_δ start_POSTSUBSCRIPT italic_T ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT = italic_f ( italic_T ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) → italic_f ( italic_T ( italic_μ , italic_ν ) ) = ∫ italic_f roman_d italic_δ start_POSTSUBSCRIPT italic_T ( italic_μ , italic_ν ) end_POSTSUBSCRIPT for any continuous and bounded f:𝒫(𝒳):𝑓𝒫𝒳f\colon\mathcal{P}(\mathcal{X})\to\mathbb{R}italic_f : caligraphic_P ( caligraphic_X ) → blackboard_R.

Furthermore, the MFC MDP fulfills [Hernández-Lerma and Lasserre, 2012], Assumption 4.2.2 by boundedness of r𝑟ritalic_r from Assumption 1. Therefore, the desired statement follows from [Hernández-Lerma and Lasserre, 2012], Theorem 4.2.3. ∎

Appendix E Proof of Theorem 2

Proof.

Note that we can also show the slightly stronger L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT convergence statement with the absolute value inside of the expectation, supπΠsupf𝔼[|f(μtN)f(μt)|]0subscriptsupremum𝜋Πsubscriptsupremum𝑓𝔼𝑓subscriptsuperscript𝜇𝑁𝑡𝑓subscript𝜇𝑡0\sup_{\pi\in\Pi}\sup_{f\in\mathcal{F}}\operatorname{\mathbb{E}}\left[\left|f(% \mu^{N}_{t})-f(\mu_{t})\right|\right]\to 0roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT blackboard_E [ | italic_f ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_f ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | ] → 0, but since this statement is only true for deterministic MFC, we avoid it here to later extend our proof directly to M3FC.

The statement supπΠsupf|𝔼[f(μtN)]𝔼[f(μt)]|0subscriptsupremum𝜋Πsubscriptsupremum𝑓𝔼𝑓subscriptsuperscript𝜇𝑁𝑡𝔼𝑓subscript𝜇𝑡0\sup_{\pi\in\Pi}\sup_{f\in\mathcal{F}}\left|\operatorname{\mathbb{E}}\left[f(% \mu^{N}_{t})\right]-\operatorname{\mathbb{E}}\left[f(\mu_{t})\right]\right|\to 0roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT | blackboard_E [ italic_f ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] - blackboard_E [ italic_f ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] | → 0 is shown inductively over t0𝑡0t\geq 0italic_t ≥ 0. At time t=0𝑡0t=0italic_t = 0, it holds by the weak LLN argument, see also the first term below. Assuming the statement at time t𝑡titalic_t, then for time t+1𝑡1t+1italic_t + 1 we have

supπΠsupf|𝔼[f(μt+1N)f(μt+1)]|subscriptsupremum𝜋Πsubscriptsupremum𝑓𝔼𝑓subscriptsuperscript𝜇𝑁𝑡1𝑓subscript𝜇𝑡1\displaystyle\sup_{\pi\in\Pi}\sup_{f\in\mathcal{F}}\left|\operatorname{\mathbb% {E}}\left[f(\mu^{N}_{t+1})-f(\mu_{t+1})\right]\right|roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT | blackboard_E [ italic_f ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_μ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] |
supπΠsupf|𝔼[f(μt+1N)f(T(μtN,μtNπt(μtN)))]|absentsubscriptsupremum𝜋Πsubscriptsupremum𝑓𝔼𝑓subscriptsuperscript𝜇𝑁𝑡1𝑓𝑇subscriptsuperscript𝜇𝑁𝑡tensor-productsubscriptsuperscript𝜇𝑁𝑡subscript𝜋𝑡subscriptsuperscript𝜇𝑁𝑡\displaystyle\quad\leq\sup_{\pi\in\Pi}\sup_{f\in\mathcal{F}}\left|% \operatorname{\mathbb{E}}\left[f(\mu^{N}_{t+1})-f(T(\mu^{N}_{t},\mu^{N}_{t}% \otimes\pi_{t}(\mu^{N}_{t})))\right]\right|≤ roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT | blackboard_E [ italic_f ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_T ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) ] | (9)
+supπΠsupf|𝔼[f(T(μtN,μtNπt(μtN)))f(μt+1)]|.subscriptsupremum𝜋Πsubscriptsupremum𝑓𝔼𝑓𝑇subscriptsuperscript𝜇𝑁𝑡tensor-productsubscriptsuperscript𝜇𝑁𝑡subscript𝜋𝑡subscriptsuperscript𝜇𝑁𝑡𝑓subscript𝜇𝑡1\displaystyle\qquad+\sup_{\pi\in\Pi}\sup_{f\in\mathcal{F}}\left|\operatorname{% \mathbb{E}}\left[f(T(\mu^{N}_{t},\mu^{N}_{t}\otimes\pi_{t}(\mu^{N}_{t})))-f(% \mu_{t+1})\right]\right|.+ roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT | blackboard_E [ italic_f ( italic_T ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) - italic_f ( italic_μ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] | . (10)

For the first term (9), first note that by compactness of 𝒫(𝒳)𝒫𝒳\mathcal{P}(\mathcal{X})caligraphic_P ( caligraphic_X ), \mathcal{F}caligraphic_F is uniformly equicontinuous, and hence admits a non-decreasing, concave (as in [DeVore and Lorentz, 1993], Lemma 6.1) modulus of continuity ω:[0,)[0,):subscript𝜔00\omega_{\mathcal{F}}\colon[0,\infty)\to[0,\infty)italic_ω start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT : [ 0 , ∞ ) → [ 0 , ∞ ) where ω(x)0subscript𝜔𝑥0\omega_{\mathcal{F}}(x)\to 0italic_ω start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ( italic_x ) → 0 as x0𝑥0x\to 0italic_x → 0 and |f(μ)f(ν)|ω(W1(μ,ν))𝑓𝜇𝑓𝜈subscript𝜔subscript𝑊1𝜇𝜈|f(\mu)-f(\nu)|\leq\omega_{\mathcal{F}}(W_{1}(\mu,\nu))| italic_f ( italic_μ ) - italic_f ( italic_ν ) | ≤ italic_ω start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_μ , italic_ν ) ) for all f𝑓f\in\mathcal{F}italic_f ∈ caligraphic_F.

We also have uniform equicontinuity of \mathcal{F}caligraphic_F with respect to the space (𝒫(𝒳),dΣ)𝒫𝒳subscript𝑑Σ(\mathcal{P}(\mathcal{X}),d_{\Sigma})( caligraphic_P ( caligraphic_X ) , italic_d start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT ) instead of (𝒫(𝒳),W1)𝒫𝒳subscript𝑊1(\mathcal{P}(\mathcal{X}),W_{1})( caligraphic_P ( caligraphic_X ) , italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), as the identity map id:(𝒫(𝒳),dΣ)(𝒫(𝒳),W1):id𝒫𝒳subscript𝑑Σ𝒫𝒳subscript𝑊1\mathrm{id}\colon(\mathcal{P}(\mathcal{X}),d_{\Sigma})\to(\mathcal{P}(\mathcal% {X}),W_{1})roman_id : ( caligraphic_P ( caligraphic_X ) , italic_d start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT ) → ( caligraphic_P ( caligraphic_X ) , italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) is uniformly continuous (as both dΣsubscript𝑑Σd_{\Sigma}italic_d start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT and W1subscript𝑊1W_{1}italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT metrize the topology of weak convergence, and 𝒫(𝒳)𝒫𝒳\mathcal{P}(\mathcal{X})caligraphic_P ( caligraphic_X ) is compact), and therefore there exists a modulus of continuity ω~~𝜔\tilde{\omega}over~ start_ARG italic_ω end_ARG for the identity map such that for any μ,ν(𝒫(𝒳),dΣ)𝜇𝜈𝒫𝒳subscript𝑑Σ\mu,\nu\in(\mathcal{P}(\mathcal{X}),d_{\Sigma})italic_μ , italic_ν ∈ ( caligraphic_P ( caligraphic_X ) , italic_d start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT ), by the prequel

|f(μ)f(ν)|ω(W1(idμ,idν))ω(ω~(dΣ(μ,ν)))𝑓𝜇𝑓𝜈subscript𝜔subscript𝑊1id𝜇id𝜈subscript𝜔~𝜔subscript𝑑Σ𝜇𝜈\displaystyle|f(\mu)-f(\nu)|\leq\omega_{\mathcal{F}}(W_{1}(\mathrm{id}\,\mu,% \mathrm{id}\,\nu))\leq\omega_{\mathcal{F}}(\tilde{\omega}(d_{\Sigma}(\mu,\nu)))| italic_f ( italic_μ ) - italic_f ( italic_ν ) | ≤ italic_ω start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( roman_id italic_μ , roman_id italic_ν ) ) ≤ italic_ω start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ( over~ start_ARG italic_ω end_ARG ( italic_d start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT ( italic_μ , italic_ν ) ) )

with ω~ωω~subscript~𝜔subscript𝜔~𝜔\tilde{\omega}_{\mathcal{F}}\coloneqq\omega_{\mathcal{F}}\circ\tilde{\omega}over~ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ≔ italic_ω start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ∘ over~ start_ARG italic_ω end_ARG, which can be replaced by its least concave majorant (again as in [DeVore and Lorentz, 1993], Lemma 6.1).

Therefore, by Jensen’s inequality, for (9) we obtain

|𝔼[f(μt+1N)f(T(μtN,μtNπt(μtN)))]|𝔼𝑓subscriptsuperscript𝜇𝑁𝑡1𝑓𝑇subscriptsuperscript𝜇𝑁𝑡tensor-productsubscriptsuperscript𝜇𝑁𝑡subscript𝜋𝑡subscriptsuperscript𝜇𝑁𝑡\displaystyle\left|\operatorname{\mathbb{E}}\left[f(\mu^{N}_{t+1})-f(T(\mu^{N}% _{t},\mu^{N}_{t}\otimes\pi_{t}(\mu^{N}_{t})))\right]\right|| blackboard_E [ italic_f ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_T ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) ] |
𝔼[ω~(dΣ(μt+1N,T(μtN,μtNπt(μtN))))]absent𝔼subscript~𝜔subscript𝑑Σsubscriptsuperscript𝜇𝑁𝑡1𝑇subscriptsuperscript𝜇𝑁𝑡tensor-productsubscriptsuperscript𝜇𝑁𝑡subscript𝜋𝑡subscriptsuperscript𝜇𝑁𝑡\displaystyle\quad\leq\operatorname{\mathbb{E}}\left[\tilde{\omega}_{\mathcal{% F}}(d_{\Sigma}(\mu^{N}_{t+1},T(\mu^{N}_{t},\mu^{N}_{t}\otimes\pi_{t}(\mu^{N}_{% t}))))\right]≤ blackboard_E [ over~ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_T ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) ) ]
ω~(𝔼[dΣ(μt+1N,T(μtN,μtNπt(μtN)))])absentsubscript~𝜔𝔼subscript𝑑Σsubscriptsuperscript𝜇𝑁𝑡1𝑇subscriptsuperscript𝜇𝑁𝑡tensor-productsubscriptsuperscript𝜇𝑁𝑡subscript𝜋𝑡subscriptsuperscript𝜇𝑁𝑡\displaystyle\quad\leq\tilde{\omega}_{\mathcal{F}}\left(\operatorname{\mathbb{% E}}\left[d_{\Sigma}(\mu^{N}_{t+1},T(\mu^{N}_{t},\mu^{N}_{t}\otimes\pi_{t}(\mu^% {N}_{t})))\right]\right)≤ over~ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ( blackboard_E [ italic_d start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_T ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) ] )

irrespective of π𝜋\piitalic_π, f𝑓fitalic_f via concavity of ω~subscript~𝜔\tilde{\omega}_{\mathcal{F}}over~ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT. Introducing for readability xtN{xti,N}i[N]subscriptsuperscript𝑥𝑁𝑡subscriptsubscriptsuperscript𝑥𝑖𝑁𝑡𝑖delimited-[]𝑁x^{N}_{t}\equiv\{x^{i,N}_{t}\}_{i\in[N]}italic_x start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≡ { italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_N ] end_POSTSUBSCRIPT, we then obtain

𝔼[dΣ(μt+1N,T(μtN,μtNπt(μtN)))]𝔼subscript𝑑Σsubscriptsuperscript𝜇𝑁𝑡1𝑇subscriptsuperscript𝜇𝑁𝑡tensor-productsubscriptsuperscript𝜇𝑁𝑡subscript𝜋𝑡subscriptsuperscript𝜇𝑁𝑡\displaystyle\operatorname{\mathbb{E}}\left[d_{\Sigma}(\mu^{N}_{t+1},T(\mu^{N}% _{t},\mu^{N}_{t}\otimes\pi_{t}(\mu^{N}_{t})))\right]blackboard_E [ italic_d start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_T ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) ]
=m=12m𝔼[|fmd(μt+1NT(μtN,μtNπt(μtN)))|]absentsuperscriptsubscript𝑚1superscript2𝑚𝔼subscript𝑓𝑚dsubscriptsuperscript𝜇𝑁𝑡1𝑇subscriptsuperscript𝜇𝑁𝑡tensor-productsubscriptsuperscript𝜇𝑁𝑡subscript𝜋𝑡subscriptsuperscript𝜇𝑁𝑡\displaystyle\quad=\sum_{m=1}^{\infty}2^{-m}\operatorname{\mathbb{E}}\left[% \left|\int f_{m}\,\mathrm{d}(\mu^{N}_{t+1}-T(\mu^{N}_{t},\mu^{N}_{t}\otimes\pi% _{t}(\mu^{N}_{t})))\right|\right]= ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT - italic_m end_POSTSUPERSCRIPT blackboard_E [ | ∫ italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_d ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_T ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) | ]
supm1𝔼[𝔼xtN[|fmd(μt+1NT(μtN,μtNπt(μtN)))|]],absentsubscriptsupremum𝑚1𝔼subscript𝔼subscriptsuperscript𝑥𝑁𝑡subscript𝑓𝑚dsubscriptsuperscript𝜇𝑁𝑡1𝑇subscriptsuperscript𝜇𝑁𝑡tensor-productsubscriptsuperscript𝜇𝑁𝑡subscript𝜋𝑡subscriptsuperscript𝜇𝑁𝑡\displaystyle\quad\leq\sup_{m\geq 1}\operatorname{\mathbb{E}}\left[% \operatorname{\mathbb{E}}_{x^{N}_{t}}\left[\left|\int f_{m}\,\mathrm{d}(\mu^{N% }_{t+1}-T(\mu^{N}_{t},\mu^{N}_{t}\otimes\pi_{t}(\mu^{N}_{t})))\right|\right]% \right],≤ roman_sup start_POSTSUBSCRIPT italic_m ≥ 1 end_POSTSUBSCRIPT blackboard_E [ blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | ∫ italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_d ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_T ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) | ] ] ,

and by the following weak LLN argument, for the squared term and any fmsubscript𝑓𝑚f_{m}italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT

𝔼xtN[|fmd(μt+1NT(μtN,μtNπt(μtN)))|]2\displaystyle\operatorname{\mathbb{E}}_{x^{N}_{t}}\left[\left|\int f_{m}\,% \mathrm{d}(\mu^{N}_{t+1}-T(\mu^{N}_{t},\mu^{N}_{t}\otimes\pi_{t}(\mu^{N}_{t}))% )\right|\right]^{2}blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | ∫ italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT roman_d ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_T ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) | ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=𝔼xtN[|1Ni=1N(fm(xt+1i,N)𝔼xtN[fm(xt+1i,N)])|]2\displaystyle\quad=\operatorname{\mathbb{E}}_{x^{N}_{t}}\left[\left|\frac{1}{N% }\sum_{i=1}^{N}\left(f_{m}(x^{i,N}_{t+1})-\operatorname{\mathbb{E}}_{x^{N}_{t}% }\left[f_{m}(x^{i,N}_{t+1})\right]\right)\right|\right]^{2}= blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] ) | ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
𝔼xtN[|1Ni=1N(fm(xt+1i,N)𝔼xtN[fm(xt+1i,N)])|2]absentsubscript𝔼subscriptsuperscript𝑥𝑁𝑡superscript1𝑁superscriptsubscript𝑖1𝑁subscript𝑓𝑚subscriptsuperscript𝑥𝑖𝑁𝑡1subscript𝔼subscriptsuperscript𝑥𝑁𝑡subscript𝑓𝑚subscriptsuperscript𝑥𝑖𝑁𝑡12\displaystyle\quad\leq\operatorname{\mathbb{E}}_{x^{N}_{t}}\left[\left|\frac{1% }{N}\sum_{i=1}^{N}\left(f_{m}(x^{i,N}_{t+1})-\operatorname{\mathbb{E}}_{x^{N}_% {t}}\left[f_{m}(x^{i,N}_{t+1})\right]\right)\right|^{2}\right]≤ blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=1N2i=1N𝔼xtN[(fm(xt+1i,N)𝔼xtN[fm(xt+1i,N)])2]4N0absent1superscript𝑁2superscriptsubscript𝑖1𝑁subscript𝔼subscriptsuperscript𝑥𝑁𝑡superscriptsubscript𝑓𝑚subscriptsuperscript𝑥𝑖𝑁𝑡1subscript𝔼subscriptsuperscript𝑥𝑁𝑡subscript𝑓𝑚subscriptsuperscript𝑥𝑖𝑁𝑡124𝑁0\displaystyle\quad=\frac{1}{N^{2}}\sum_{i=1}^{N}\operatorname{\mathbb{E}}_{x^{% N}_{t}}\left[\left(f_{m}(x^{i,N}_{t+1})-\operatorname{\mathbb{E}}_{x^{N}_{t}}% \left[f_{m}(x^{i,N}_{t+1})\right]\right)^{2}\right]\leq\frac{4}{N}\to 0= divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ divide start_ARG 4 end_ARG start_ARG italic_N end_ARG → 0

by bounding |fm|1subscript𝑓𝑚1|f_{m}|\leq 1| italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | ≤ 1, as the cross-terms are zero by conditional independence of xt+1i,Nsubscriptsuperscript𝑥𝑖𝑁𝑡1x^{i,N}_{t+1}italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT given xtNsubscriptsuperscript𝑥𝑁𝑡x^{N}_{t}italic_x start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. By the prequel, the term (9) hence converges to zero.

For the second term (10), we have

supπΠsupf|𝔼[f(T(μtN,μtNπt(μtN)))f(μt+1)]|subscriptsupremum𝜋Πsubscriptsupremum𝑓𝔼𝑓𝑇subscriptsuperscript𝜇𝑁𝑡tensor-productsubscriptsuperscript𝜇𝑁𝑡subscript𝜋𝑡subscriptsuperscript𝜇𝑁𝑡𝑓subscript𝜇𝑡1\displaystyle\sup_{\pi\in\Pi}\sup_{f\in\mathcal{F}}\left|\operatorname{\mathbb% {E}}\left[f(T(\mu^{N}_{t},\mu^{N}_{t}\otimes\pi_{t}(\mu^{N}_{t})))-f(\mu_{t+1}% )\right]\right|roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT | blackboard_E [ italic_f ( italic_T ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) - italic_f ( italic_μ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] |
=supπΠsupf|𝔼[f(T(μtN,μtNπt(μtN)))f(T(μt,μtπt(μt)))]|absentsubscriptsupremum𝜋Πsubscriptsupremum𝑓𝔼𝑓𝑇subscriptsuperscript𝜇𝑁𝑡tensor-productsubscriptsuperscript𝜇𝑁𝑡subscript𝜋𝑡subscriptsuperscript𝜇𝑁𝑡𝑓𝑇subscript𝜇𝑡tensor-productsubscript𝜇𝑡subscript𝜋𝑡subscript𝜇𝑡\displaystyle\quad=\sup_{\pi\in\Pi}\sup_{f\in\mathcal{F}}\left|\operatorname{% \mathbb{E}}\left[f(T(\mu^{N}_{t},\mu^{N}_{t}\otimes\pi_{t}(\mu^{N}_{t})))-f(T(% \mu_{t},\mu_{t}\otimes\pi_{t}(\mu_{t})))\right]\right|= roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT | blackboard_E [ italic_f ( italic_T ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) - italic_f ( italic_T ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) ] |
supπΠsupg𝒢|𝔼[g(μtN)g(μt)]|0absentsubscriptsupremum𝜋Πsubscriptsupremum𝑔𝒢𝔼𝑔subscriptsuperscript𝜇𝑁𝑡𝑔subscript𝜇𝑡0\displaystyle\quad\leq\sup_{\pi\in\Pi}\sup_{g\in\mathcal{G}}\left|% \operatorname{\mathbb{E}}\left[g(\mu^{N}_{t})-g(\mu_{t})\right]\right|\to 0≤ roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_g ∈ caligraphic_G end_POSTSUBSCRIPT | blackboard_E [ italic_g ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_g ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] | → 0

by the induction assumption, where we defined g=fT~πt𝑔𝑓superscript~𝑇subscript𝜋𝑡g=f\circ\tilde{T}^{\pi_{t}}italic_g = italic_f ∘ over~ start_ARG italic_T end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT from the class 𝒢𝒢\mathcal{G}caligraphic_G of equicontinuous functions with modulus of continuity ω𝒢ωωTsubscript𝜔𝒢subscript𝜔subscript𝜔𝑇\omega_{\mathcal{G}}\coloneqq\omega_{\mathcal{F}}\circ\omega_{T}italic_ω start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ≔ italic_ω start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ∘ italic_ω start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, where ωTsubscript𝜔𝑇\omega_{T}italic_ω start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT denotes the uniform modulus of continuity of μtT~πt(μt)T(μt,μtπt(μt)))\mu_{t}\mapsto\tilde{T}^{\pi_{t}}(\mu_{t})\coloneqq T(\mu_{t},\mu_{t}\otimes% \pi_{t}(\mu_{t})))italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ↦ over~ start_ARG italic_T end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≔ italic_T ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) over all policies π𝜋\piitalic_π. Here, this equicontinuity of {T~πt}πΠsubscriptsuperscript~𝑇subscript𝜋𝑡𝜋Π\{\tilde{T}^{\pi_{t}}\}_{\pi\in\Pi}{ over~ start_ARG italic_T end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT follows from Lemma 1 and the equicontinuity of functions μtμtπt(μt)maps-tosubscript𝜇𝑡tensor-productsubscript𝜇𝑡subscript𝜋𝑡subscript𝜇𝑡\mu_{t}\mapsto\mu_{t}\otimes\pi_{t}(\mu_{t})italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ↦ italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) due to uniformly Lipschitz ΠΠ\Piroman_Π as we show in the following, completing the proof by induction:

Consider μnμ𝒫(𝒳)subscript𝜇𝑛𝜇𝒫𝒳\mu_{n}\to\mu\in\mathcal{P}(\mathcal{X})italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → italic_μ ∈ caligraphic_P ( caligraphic_X ), then we have

supπΠW1(μnπt(μn),μπt(μ))subscriptsupremum𝜋Πsubscript𝑊1tensor-productsubscript𝜇𝑛subscript𝜋𝑡subscript𝜇𝑛tensor-product𝜇subscript𝜋𝑡𝜇\displaystyle\sup_{\pi\in\Pi}W_{1}(\mu_{n}\otimes\pi_{t}(\mu_{n}),\mu\otimes% \pi_{t}(\mu))roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_μ ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_μ ) )
=supπΠsupfLip1|fd(μnπt(μn)μπt(μ))|absentsubscriptsupremum𝜋Πsubscriptsupremumsubscriptdelimited-∥∥superscript𝑓Lip1superscript𝑓dtensor-productsubscript𝜇𝑛subscript𝜋𝑡subscript𝜇𝑛tensor-product𝜇subscript𝜋𝑡𝜇\displaystyle\quad=\sup_{\pi\in\Pi}\sup_{\lVert f^{\prime}\rVert_{\mathrm{Lip}% }\leq 1}\left|\int f^{\prime}\,\mathrm{d}(\mu_{n}\otimes\pi_{t}(\mu_{n})-\mu% \otimes\pi_{t}(\mu))\right|= roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT ∥ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_Lip end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT | ∫ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT roman_d ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_μ ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_μ ) ) |
supπΠsupfLip1|f(x,u)(πt(dux,μn)πt(dux,μ))μn(dx)|\displaystyle\quad\leq\sup_{\pi\in\Pi}\sup_{\lVert f^{\prime}\rVert_{\mathrm{% Lip}}\leq 1}\left|\iint f^{\prime}(x,u)(\pi_{t}(\mathrm{d}u\mid x,\mu_{n})-\pi% _{t}(\mathrm{d}u\mid x,\mu))\mu_{n}(\mathrm{d}x)\right|≤ roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT ∥ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_Lip end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT | ∬ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_u ) ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_x , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_x , italic_μ ) ) italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( roman_d italic_x ) |
+supπΠsupfLip1|f(x,u)πt(dux,μ)(μn(dx)μ(dx))|\displaystyle\qquad+\sup_{\pi\in\Pi}\sup_{\lVert f^{\prime}\rVert_{\mathrm{Lip% }}\leq 1}\left|\iint f^{\prime}(x,u)\pi_{t}(\mathrm{d}u\mid x,\mu)(\mu_{n}(% \mathrm{d}x)-\mu(\mathrm{d}x))\right|+ roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT ∥ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_Lip end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT | ∬ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_u ) italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_x , italic_μ ) ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( roman_d italic_x ) - italic_μ ( roman_d italic_x ) ) |

where for the first term

supπΠsupfLip1|f(x,u)(πt(dux,μn)πt(dux,μ))μn(dx)|\displaystyle\sup_{\pi\in\Pi}\sup_{\lVert f^{\prime}\rVert_{\mathrm{Lip}}\leq 1% }\left|\iint f^{\prime}(x,u)(\pi_{t}(\mathrm{d}u\mid x,\mu_{n})-\pi_{t}(% \mathrm{d}u\mid x,\mu))\mu_{n}(\mathrm{d}x)\right|roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT ∥ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_Lip end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT | ∬ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_u ) ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_x , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_x , italic_μ ) ) italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( roman_d italic_x ) |
supπΠsupfLip1|f(x,u)(πt(dux,μn)πt(dux,μ))|μn(dx)\displaystyle\quad\leq\sup_{\pi\in\Pi}\sup_{\lVert f^{\prime}\rVert_{\mathrm{% Lip}}\leq 1}\int\left|\int f^{\prime}(x,u)(\pi_{t}(\mathrm{d}u\mid x,\mu_{n})-% \pi_{t}(\mathrm{d}u\mid x,\mu))\right|\mu_{n}(\mathrm{d}x)≤ roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT ∥ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_Lip end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT ∫ | ∫ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_u ) ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_x , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_x , italic_μ ) ) | italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( roman_d italic_x )
supπΠsupfLip1supx𝒳|f(x,u)(πt(dux,μn)πt(dux,μ))|\displaystyle\quad\leq\sup_{\pi\in\Pi}\sup_{\lVert f^{\prime}\rVert_{\mathrm{% Lip}}\leq 1}\sup_{x\in\mathcal{X}}\left|\int f^{\prime}(x,u)(\pi_{t}(\mathrm{d% }u\mid x,\mu_{n})-\pi_{t}(\mathrm{d}u\mid x,\mu))\right|≤ roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT ∥ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_Lip end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT | ∫ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_u ) ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_x , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_x , italic_μ ) ) |
=supπΠsupx𝒳W1(πt(x,μn),πt(x,μ))\displaystyle\quad=\sup_{\pi\in\Pi}\sup_{x\in\mathcal{X}}W_{1}(\pi_{t}(\cdot% \mid x,\mu_{n}),\pi_{t}(\cdot\mid x,\mu))= roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ∣ italic_x , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ∣ italic_x , italic_μ ) )
LΠW1(μn,μ)0absentsubscript𝐿Πsubscript𝑊1subscript𝜇𝑛𝜇0\displaystyle\quad\leq L_{\Pi}W_{1}(\mu_{n},\mu)\to 0≤ italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ ) → 0

by Assumption 2, and similarly for the second by first noting 1111-Lipschitzness of xf(x,u)LΠ+1πt(dux,μ)maps-to𝑥superscript𝑓𝑥𝑢subscript𝐿Π1subscript𝜋𝑡conditionald𝑢𝑥𝜇x\mapsto\int\frac{f^{\prime}(x,u)}{L_{\Pi}+1}\pi_{t}(\mathrm{d}u\mid x,\mu)italic_x ↦ ∫ divide start_ARG italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_u ) end_ARG start_ARG italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT + 1 end_ARG italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_x , italic_μ ), as for yx𝑦𝑥y\neq xitalic_y ≠ italic_x

|f(y,u)LΠ+1πt(duy,μ)f(x,u)LΠ+1πt(dux,μ)|\displaystyle\left|\int\frac{f^{\prime}(y,u)}{L_{\Pi}+1}\pi_{t}(\mathrm{d}u% \mid y,\mu)-\int\frac{f^{\prime}(x,u)}{L_{\Pi}+1}\pi_{t}(\mathrm{d}u\mid x,\mu% )\right|| ∫ divide start_ARG italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y , italic_u ) end_ARG start_ARG italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT + 1 end_ARG italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_y , italic_μ ) - ∫ divide start_ARG italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_u ) end_ARG start_ARG italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT + 1 end_ARG italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_x , italic_μ ) |
|f(y,u)f(x,u)LΠ+1πt(duy,μ)|+|f(x,u)LΠ+1(πt(duy,μ)πt(dux,μ))|\displaystyle\quad\leq\left|\int\frac{f^{\prime}(y,u)-f^{\prime}(x,u)}{L_{\Pi}% +1}\pi_{t}(\mathrm{d}u\mid y,\mu)\right|+\left|\int\frac{f^{\prime}(x,u)}{L_{% \Pi}+1}(\pi_{t}(\mathrm{d}u\mid y,\mu)-\pi_{t}(\mathrm{d}u\mid x,\mu))\right|≤ | ∫ divide start_ARG italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y , italic_u ) - italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_u ) end_ARG start_ARG italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT + 1 end_ARG italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_y , italic_μ ) | + | ∫ divide start_ARG italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_u ) end_ARG start_ARG italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT + 1 end_ARG ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_y , italic_μ ) - italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_x , italic_μ ) ) |
1LΠ+1d(y,x)+1LΠ+1W1(πt(y,μ),πt(x,μ))\displaystyle\quad\leq\frac{1}{L_{\Pi}+1}d(y,x)+\frac{1}{L_{\Pi}+1}W_{1}(\pi_{% t}(\cdot\mid y,\mu),\pi_{t}(\cdot\mid x,\mu))≤ divide start_ARG 1 end_ARG start_ARG italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT + 1 end_ARG italic_d ( italic_y , italic_x ) + divide start_ARG 1 end_ARG start_ARG italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT + 1 end_ARG italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ∣ italic_y , italic_μ ) , italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ∣ italic_x , italic_μ ) )
(1LΠ+1+LΠLΠ+1)d(x,y)absent1subscript𝐿Π1subscript𝐿Πsubscript𝐿Π1𝑑𝑥𝑦\displaystyle\quad\leq\left(\frac{1}{L_{\Pi}+1}+\frac{L_{\Pi}}{L_{\Pi}+1}% \right)d(x,y)≤ ( divide start_ARG 1 end_ARG start_ARG italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT + 1 end_ARG + divide start_ARG italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT end_ARG start_ARG italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT + 1 end_ARG ) italic_d ( italic_x , italic_y ) (11)

with 1LΠ+1+LΠLΠ+1=111subscript𝐿Π1subscript𝐿Πsubscript𝐿Π111\frac{1}{L_{\Pi}+1}+\frac{L_{\Pi}}{L_{\Pi}+1}=1\leq 1divide start_ARG 1 end_ARG start_ARG italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT + 1 end_ARG + divide start_ARG italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT end_ARG start_ARG italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT + 1 end_ARG = 1 ≤ 1, and therefore again

supπΠsupfLip1|f(x,u)πt(dux,μ)(μn(dx)μ(dx))|\displaystyle\sup_{\pi\in\Pi}\sup_{\lVert f^{\prime}\rVert_{\mathrm{Lip}}\leq 1% }\left|\iint f^{\prime}(x,u)\pi_{t}(\mathrm{d}u\mid x,\mu)(\mu_{n}(\mathrm{d}x% )-\mu(\mathrm{d}x))\right|roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT ∥ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_Lip end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT | ∬ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_u ) italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_x , italic_μ ) ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( roman_d italic_x ) - italic_μ ( roman_d italic_x ) ) |
=supπΠsupfLip1(LΠ+1)|f(x,u)LΠ+1πt(dux,μ)(μn(dx)μ(dx))|\displaystyle\quad=\sup_{\pi\in\Pi}\sup_{\lVert f^{\prime}\rVert_{\mathrm{Lip}% }\leq 1}(L_{\Pi}+1)\left|\iint\frac{f^{\prime}(x,u)}{L_{\Pi}+1}\pi_{t}(\mathrm% {d}u\mid x,\mu)(\mu_{n}(\mathrm{d}x)-\mu(\mathrm{d}x))\right|= roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT ∥ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_Lip end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT ( italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT + 1 ) | ∬ divide start_ARG italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_u ) end_ARG start_ARG italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT + 1 end_ARG italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_x , italic_μ ) ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( roman_d italic_x ) - italic_μ ( roman_d italic_x ) ) |
(LΠ+1)W1(μn,μ)0.absentsubscript𝐿Π1subscript𝑊1subscript𝜇𝑛𝜇0\displaystyle\quad\leq(L_{\Pi}+1)W_{1}(\mu_{n},\mu)\to 0.≤ ( italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT + 1 ) italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ ) → 0 .

This completes the proof by induction. ∎

Appendix F Proof of Corollary 1

Proof.

First, we show that from uniform convergence in Theorem 2, the finite-agent objectives converge uniformly to the MFC limit.

Lemma 1.

Under Assumptions 1 and 2, the finite-agent objective converges uniformly to the MFC limit,

supπΠ|JN(π)J(Φ1(π))|0.subscriptsupremum𝜋Πsuperscript𝐽𝑁𝜋𝐽superscriptΦ1𝜋0\sup_{\pi\in\Pi}\left|J^{N}(\pi)-J(\Phi^{-1}(\pi))\right|\to 0.roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT | italic_J start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_π ) - italic_J ( roman_Φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_π ) ) | → 0 . (12)
Proof.

For any ε>0𝜀0\varepsilon>0italic_ε > 0, choose time T𝑇T\in\mathbb{N}italic_T ∈ blackboard_N such that t=Tγt𝔼|[r(μtN)r(μt)]|γT1γmaxμ2|r(μ)|<ε2superscriptsubscript𝑡𝑇superscript𝛾𝑡𝔼delimited-[]𝑟subscriptsuperscript𝜇𝑁𝑡𝑟subscript𝜇𝑡superscript𝛾𝑇1𝛾subscript𝜇2𝑟𝜇𝜀2\sum_{t=T}^{\infty}\gamma^{t}\operatorname{\mathbb{E}}\left|\left[r(\mu^{N}_{t% })-r(\mu_{t})\right]\right|\leq\frac{\gamma^{T}}{1-\gamma}\max_{\mu}2|r(\mu)|<% \frac{\varepsilon}{2}∑ start_POSTSUBSCRIPT italic_t = italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT blackboard_E | [ italic_r ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_r ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] | ≤ divide start_ARG italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG roman_max start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT 2 | italic_r ( italic_μ ) | < divide start_ARG italic_ε end_ARG start_ARG 2 end_ARG. By Theorem 2, t=0T1γt𝔼|[r(μtN)r(μt)]|<ε2superscriptsubscript𝑡0𝑇1superscript𝛾𝑡𝔼delimited-[]𝑟subscriptsuperscript𝜇𝑁𝑡𝑟subscript𝜇𝑡𝜀2\sum_{t=0}^{T-1}\gamma^{t}\operatorname{\mathbb{E}}\left|\left[r(\mu^{N}_{t})-% r(\mu_{t})\right]\right|<\frac{\varepsilon}{2}∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT blackboard_E | [ italic_r ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_r ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] | < divide start_ARG italic_ε end_ARG start_ARG 2 end_ARG for sufficiently large N𝑁Nitalic_N. The result follows. ∎

The approximate optimality of MFC solutions in the finite system follows immediately: By Lemma 1, we have

JN(Φ(π))supπΠJN(π)=infπΠ(JN(π)JN(π))superscript𝐽𝑁Φsuperscript𝜋subscriptsupremum𝜋Πsuperscript𝐽𝑁𝜋subscriptinfimum𝜋Πsuperscript𝐽𝑁superscript𝜋superscript𝐽𝑁𝜋\displaystyle J^{N}(\Phi(\pi^{*}))-\sup_{\pi\in\Pi}J^{N}(\pi)=\inf_{\pi\in\Pi}% (J^{N}(\pi^{*})-J^{N}(\pi))italic_J start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( roman_Φ ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) - roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_π ) = roman_inf start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT ( italic_J start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_J start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_π ) )
infπΠ(JN(Φ(π))J(π))+infπΠ(J(π)J(Φ1(π)))+infπΠ(J(Φ1(π))JN(π))absentsubscriptinfimum𝜋Πsuperscript𝐽𝑁Φsuperscript𝜋𝐽superscript𝜋subscriptinfimum𝜋Π𝐽superscript𝜋𝐽superscriptΦ1𝜋subscriptinfimum𝜋Π𝐽superscriptΦ1𝜋superscript𝐽𝑁𝜋\displaystyle\quad\geq\inf_{\pi\in\Pi}(J^{N}(\Phi(\pi^{*}))-J(\pi^{*}))+\inf_{% \pi\in\Pi}(J(\pi^{*})-J(\Phi^{-1}(\pi)))+\inf_{\pi\in\Pi}(J(\Phi^{-1}(\pi))-J^% {N}(\pi))≥ roman_inf start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT ( italic_J start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( roman_Φ ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) - italic_J ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) + roman_inf start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT ( italic_J ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_J ( roman_Φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_π ) ) ) + roman_inf start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT ( italic_J ( roman_Φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_π ) ) - italic_J start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_π ) )
ε2+0ε2=εabsent𝜀20𝜀2𝜀\displaystyle\quad\geq-\frac{\varepsilon}{2}+0-\frac{\varepsilon}{2}=-\varepsilon≥ - divide start_ARG italic_ε end_ARG start_ARG 2 end_ARG + 0 - divide start_ARG italic_ε end_ARG start_ARG 2 end_ARG = - italic_ε

for sufficiently large N𝑁Nitalic_N, where the second term is zero by optimality of πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in the MFC problem. ∎

Appendix G Stochastic Mean Field Control with Common Noise and Major States

For convenience, we also restate the results for MFC with major states, or common noise. We have the finite MFC system with major states

uti,Nsubscriptsuperscript𝑢𝑖𝑁𝑡\displaystyle u^{i,N}_{t}italic_u start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT πt(uti,Nxti,N,xt0,N,μtN),similar-toabsentsubscript𝜋𝑡conditionalsubscriptsuperscript𝑢𝑖𝑁𝑡subscriptsuperscript𝑥𝑖𝑁𝑡subscriptsuperscript𝑥0𝑁𝑡superscriptsubscript𝜇𝑡𝑁\displaystyle\sim\pi_{t}(u^{i,N}_{t}\mid x^{i,N}_{t},x^{0,N}_{t},\mu_{t}^{N}),∼ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) , (13a)
xt+1i,Nsubscriptsuperscript𝑥𝑖𝑁𝑡1\displaystyle x^{i,N}_{t+1}italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT p(xt+1i,Nxti,N,uti,N,xt0,N,μtN),xt+10,Np0(xt+10,Nxt0,N,μtN)formulae-sequencesimilar-toabsent𝑝conditionalsubscriptsuperscript𝑥𝑖𝑁𝑡1subscriptsuperscript𝑥𝑖𝑁𝑡subscriptsuperscript𝑢𝑖𝑁𝑡subscriptsuperscript𝑥0𝑁𝑡superscriptsubscript𝜇𝑡𝑁similar-tosubscriptsuperscript𝑥0𝑁𝑡1superscript𝑝0conditionalsubscriptsuperscript𝑥0𝑁𝑡1subscriptsuperscript𝑥0𝑁𝑡superscriptsubscript𝜇𝑡𝑁\displaystyle\sim p(x^{i,N}_{t+1}\mid x^{i,N}_{t},u^{i,N}_{t},x^{0,N}_{t},\mu_% {t}^{N}),\quad x^{0,N}_{t+1}\sim p^{0}(x^{0,N}_{t+1}\mid x^{0,N}_{t},\mu_{t}^{% N})∼ italic_p ( italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) , italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) (13b)

and objective JN(π)=𝔼[t=0γtr(xt0,N,μtN)]superscript𝐽𝑁𝜋𝔼superscriptsubscript𝑡0superscript𝛾𝑡𝑟subscriptsuperscript𝑥0𝑁𝑡subscriptsuperscript𝜇𝑁𝑡J^{N}(\pi)=\operatorname{\mathbb{E}}\left[\sum_{t=0}^{\infty}\gamma^{t}r(x^{0,% N}_{t},\mu^{N}_{t})\right]italic_J start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_π ) = blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] analogous to (5), with the corresponding limiting MFC MDP with major states analogous to (8),

htπ^t(htxt0,μt),μt+1=T(xt0,μt,ht),xt+10p0(xt+10xt0,μt)formulae-sequencesimilar-tosubscript𝑡subscript^𝜋𝑡conditionalsubscript𝑡subscriptsuperscript𝑥0𝑡subscript𝜇𝑡formulae-sequencesubscript𝜇𝑡1𝑇subscriptsuperscript𝑥0𝑡subscript𝜇𝑡subscript𝑡similar-tosubscriptsuperscript𝑥0𝑡1superscript𝑝0conditionalsubscriptsuperscript𝑥0𝑡1subscriptsuperscript𝑥0𝑡subscript𝜇𝑡\displaystyle h_{t}\sim\hat{\pi}_{t}(h_{t}\mid x^{0}_{t},\mu_{t}),\quad\mu_{t+% 1}=T(x^{0}_{t},\mu_{t},h_{t}),\quad x^{0}_{t+1}\sim p^{0}(x^{0}_{t+1}\mid x^{0% }_{t},\mu_{t})italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_μ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (14)

with objective J(π^)=𝔼[t=0γtr(xt0,μt)]𝐽^𝜋𝔼superscriptsubscript𝑡0superscript𝛾𝑡𝑟subscriptsuperscript𝑥0𝑡subscript𝜇𝑡J(\hat{\pi})=\operatorname{\mathbb{E}}\left[\sum_{t=0}^{\infty}\gamma^{t}r(x^{% 0}_{t},\mu_{t})\right]italic_J ( over^ start_ARG italic_π end_ARG ) = blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ], where T(x0,μ,h)p(x,u,x0,μ)h(dx,du)T(x^{0},\mu,h)\coloneqq\iint p(\cdot\mid x,u,x^{0},\mu)h(\mathrm{d}x,\mathrm{d% }u)italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ , italic_h ) ≔ ∬ italic_p ( ⋅ ∣ italic_x , italic_u , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) italic_h ( roman_d italic_x , roman_d italic_u ).

Assumption 1.

The transition kernels p𝑝pitalic_p, p0superscript𝑝0p^{0}italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT and rewards r𝑟ritalic_r are Lipschitz continuous with constants Lpsubscript𝐿𝑝L_{p}italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, Lp0subscript𝐿superscript𝑝0L_{p^{0}}italic_L start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, Lrsubscript𝐿𝑟L_{r}italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT.

Assumption 2.

The class of policies ΠΠ\Piroman_Π are equi-Lipschitz, i.e. there exists LΠ>0subscript𝐿Π0L_{\Pi}>0italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT > 0 such that for all t𝑡titalic_t and πΠ𝜋Π\pi\in\Piitalic_π ∈ roman_Π, πt𝒫(𝒰)𝒳×𝒫(𝒳)subscript𝜋𝑡𝒫superscript𝒰𝒳𝒫𝒳\pi_{t}\in\mathcal{P}(\mathcal{U})^{\mathcal{X}\times\mathcal{P}(\mathcal{X})}italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_P ( caligraphic_U ) start_POSTSUPERSCRIPT caligraphic_X × caligraphic_P ( caligraphic_X ) end_POSTSUPERSCRIPT is LΠsubscript𝐿ΠL_{\Pi}italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT-Lipschitz.

Theorem 1.

Under Assumption 1, there exists an optimal stationary, deterministic policy π^^𝜋\hat{\pi}over^ start_ARG italic_π end_ARG for the MFC MDP (14) by choosing π^(x0,μ)^𝜋superscript𝑥0𝜇\hat{\pi}(x^{0},\mu)over^ start_ARG italic_π end_ARG ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) from the maximizers of argmaxh(μ)r(x0,μ)+γ𝔼y0p0(y0x0,μ)V(y0,T(x0,μ,h))subscriptargmax𝜇𝑟superscript𝑥0𝜇𝛾subscript𝔼similar-tosuperscript𝑦0superscript𝑝0conditionalsuperscript𝑦0superscript𝑥0𝜇superscript𝑉superscript𝑦0𝑇superscript𝑥0𝜇\operatorname*{arg\,max}_{h\in\mathcal{H}(\mu)}r(x^{0},\mu)+\gamma\mathbb{E}_{% y^{0}\sim p^{0}(y^{0}\mid x^{0},\mu)}V^{*}(y^{0},T(x^{0},\mu,h))start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_h ∈ caligraphic_H ( italic_μ ) end_POSTSUBSCRIPT italic_r ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) + italic_γ blackboard_E start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∼ italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ , italic_h ) ), with Vsuperscript𝑉V^{*}italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT the unique fixed point of the Bellman equation V(x0,μ)=maxh(μ)r(x0,μ)+γ𝔼y0p0(y0x0,μ)V(y0,T(x0,μ,h))superscript𝑉superscript𝑥0𝜇subscript𝜇𝑟superscript𝑥0𝜇𝛾subscript𝔼similar-tosuperscript𝑦0superscript𝑝0conditionalsuperscript𝑦0superscript𝑥0𝜇superscript𝑉superscript𝑦0𝑇superscript𝑥0𝜇V^{*}(x^{0},\mu)=\max_{h\in\mathcal{H}(\mu)}r(x^{0},\mu)+\gamma\mathbb{E}_{y^{% 0}\sim p^{0}(y^{0}\mid x^{0},\mu)}V^{*}(y^{0},T(x^{0},\mu,h))italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) = roman_max start_POSTSUBSCRIPT italic_h ∈ caligraphic_H ( italic_μ ) end_POSTSUBSCRIPT italic_r ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) + italic_γ blackboard_E start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∼ italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ , italic_h ) ) (value function).

Theorem 2.

Fix any family of equi-Lipschitz functions 𝒳0×𝒫(𝒳)superscriptsuperscript𝒳0𝒫𝒳\mathcal{F}\subseteq\mathbb{R}^{\mathcal{X}^{0}\times\mathcal{P}(\mathcal{X})}caligraphic_F ⊆ blackboard_R start_POSTSUPERSCRIPT caligraphic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT × caligraphic_P ( caligraphic_X ) end_POSTSUPERSCRIPT with shared Lipschitz constant Lsubscript𝐿L_{\mathcal{F}}italic_L start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT for all f𝑓f\in\mathcal{F}italic_f ∈ caligraphic_F. Under Assumption 1, the random variable (xt0,N,μtN)subscriptsuperscript𝑥0𝑁𝑡superscriptsubscript𝜇𝑡𝑁(x^{0,N}_{t},\mu_{t}^{N})( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) converges weakly, uniformly over \mathcal{F}caligraphic_F, ΠΠ\Piroman_Π, to (xt0,μt)subscriptsuperscript𝑥0𝑡subscript𝜇𝑡(x^{0}_{t},\mu_{t})( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) at all times t𝑡t\in\mathbb{N}italic_t ∈ blackboard_N,

supπΠsupf|𝔼[f(xt0,N,μtN)f(xt0,μt)]|0.subscriptsupremum𝜋Πsubscriptsupremum𝑓𝔼𝑓subscriptsuperscript𝑥0𝑁𝑡superscriptsubscript𝜇𝑡𝑁𝑓subscriptsuperscript𝑥0𝑡subscript𝜇𝑡0\sup_{\pi\in\Pi}\sup_{f\in\mathcal{F}}\left|\operatorname{\mathbb{E}}\left[f(x% ^{0,N}_{t},\mu_{t}^{N})-f(x^{0}_{t},\mu_{t})\right]\right|\to 0.roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT | blackboard_E [ italic_f ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] | → 0 . (15)
Corollary 1.

Under Assumptions 1 and 2, optimal deterministic MFC policies πargmaxπJ(π)superscript𝜋subscriptargmax𝜋𝐽𝜋\pi^{*}\in\operatorname*{arg\,max}_{\pi}J(\pi)italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT italic_J ( italic_π ) result in ε𝜀\varepsilonitalic_ε-optimal policies Φ(π)Φsuperscript𝜋\Phi(\pi^{*})roman_Φ ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) in the finite-agent problem with ε0𝜀0\varepsilon\to 0italic_ε → 0 as N𝑁N\to\inftyitalic_N → ∞,

JN(Φ(π))supπΠJN(π)ε.superscript𝐽𝑁Φsuperscript𝜋subscriptsupremum𝜋Πsuperscript𝐽𝑁𝜋𝜀J^{N}(\Phi(\pi^{*}))\geq\sup_{\pi\in\Pi}J^{N}(\pi)-\varepsilon.italic_J start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( roman_Φ ( italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) ≥ roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_π ) - italic_ε . (16)

The proofs and interpretation are directly analogous to the M3FC case and the following proofs, by leaving out the major agent actions, or alternatively using the M3FC results with a trivial singleton major action space, |𝒰0|=1superscript𝒰01|\mathcal{U}^{0}|=1| caligraphic_U start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT | = 1.

Appendix H Proof of Theorem 1

Proof.

The proof is analogous to Appendix D by first showing the continuity of T𝑇Titalic_T (proof further below).

Lemma 1.

Under Assumption 1, for any sequence (xn0,un0,μn,νn)(x0,u0,μ,ν)𝒳0×𝒰0×𝒫(𝒳)×𝒫(𝒳×𝒰)subscriptsuperscript𝑥0𝑛subscriptsuperscript𝑢0𝑛subscript𝜇𝑛subscript𝜈𝑛superscript𝑥0superscript𝑢0𝜇𝜈superscript𝒳0superscript𝒰0𝒫𝒳𝒫𝒳𝒰(x^{0}_{n},u^{0}_{n},\mu_{n},\nu_{n})\to(x^{0},u^{0},\mu,\nu)\in\mathcal{X}^{0% }\times\mathcal{U}^{0}\times\mathcal{P}(\mathcal{X})\times\mathcal{P}(\mathcal% {X}\times\mathcal{U})( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) → ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ , italic_ν ) ∈ caligraphic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT × caligraphic_U start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT × caligraphic_P ( caligraphic_X ) × caligraphic_P ( caligraphic_X × caligraphic_U ), we have T(xn0,un0,μn,νn)T(x0,u0,μ,ν)𝑇subscriptsuperscript𝑥0𝑛subscriptsuperscript𝑢0𝑛subscript𝜇𝑛subscript𝜈𝑛𝑇superscript𝑥0superscript𝑢0𝜇𝜈T(x^{0}_{n},u^{0}_{n},\mu_{n},\nu_{n})\to T(x^{0},u^{0},\mu,\nu)italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) → italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ , italic_ν ).

For [Hernández-Lerma and Lasserre, 2012], Assumption 4.2.1(a), the cost function r𝑟-r- italic_r is continuous by Assumption 1, therefore also bounded by compactness of 𝒳0×𝒫(𝒳)superscript𝒳0𝒫𝒳\mathcal{X}^{0}\times\mathcal{P}(\mathcal{X})caligraphic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT × caligraphic_P ( caligraphic_X ), and finally also inf-compact on the state-action space of the M3FC MDP, since for any (x0,μ)𝒳0×𝒫(𝒳)superscript𝑥0𝜇superscript𝒳0𝒫𝒳(x^{0},\mu)\in\mathcal{X}^{0}\times\mathcal{P}(\mathcal{X})( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ∈ caligraphic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT × caligraphic_P ( caligraphic_X ) the set {(h,u0)(μ)×𝒰0r(x0,u0,μ)c}conditional-setsuperscript𝑢0𝜇superscript𝒰0𝑟superscript𝑥0superscript𝑢0𝜇𝑐\{(h,u^{0})\in\mathcal{H}(\mu)\times\mathcal{U}^{0}\mid-r(x^{0},u^{0},\mu)\leq c\}{ ( italic_h , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ∈ caligraphic_H ( italic_μ ) × caligraphic_U start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∣ - italic_r ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ≤ italic_c } is given by (μ)×r~1((,c])𝜇superscript~𝑟1𝑐\mathcal{H}(\mu)\times\tilde{r}^{-1}((-\infty,c])caligraphic_H ( italic_μ ) × over~ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( ( - ∞ , italic_c ] ), where we defined r~(u0)r(x0,u0,μ)~𝑟superscript𝑢0𝑟superscript𝑥0superscript𝑢0𝜇\tilde{r}(u^{0})\coloneqq-r(x^{0},u^{0},\mu)over~ start_ARG italic_r end_ARG ( italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ≔ - italic_r ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ). Note that (μ)𝜇\mathcal{H}(\mu)caligraphic_H ( italic_μ ) is compact by the same argument as in Appendix D, while r~~𝑟\tilde{r}over~ start_ARG italic_r end_ARG is continuous by Assumption 1 and therefore its preimage of the closed set (,c]𝑐(-\infty,c]( - ∞ , italic_c ] is compact.

For [Hernández-Lerma and Lasserre, 2012], Assumption 4.2.1(b), consider any continuous and bounded f:𝒳0×𝒫(𝒳):𝑓superscript𝒳0𝒫𝒳f\colon\mathcal{X}^{0}\times\mathcal{P}(\mathcal{X})\to\mathbb{R}italic_f : caligraphic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT × caligraphic_P ( caligraphic_X ) → blackboard_R. The continuity is uniform by compactness. Hence, supx𝒳0|f(x,μn)f(x,μ)|0subscriptsupremumsuperscript𝑥superscript𝒳0𝑓superscript𝑥subscriptsuperscript𝜇𝑛𝑓superscript𝑥superscript𝜇0\sup_{x^{\prime}\in\mathcal{X}^{0}}\left|f(x^{\prime},\mu^{\prime}_{n})-f(x^{% \prime},\mu^{\prime})\right|\to 0roman_sup start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | → 0 as μnμ𝒫(𝒳)subscriptsuperscript𝜇𝑛superscript𝜇𝒫𝒳\mu^{\prime}_{n}\to\mu^{\prime}\in\mathcal{P}(\mathcal{X})italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_P ( caligraphic_X ). Thus, whenever (xn0,un0,μn,νn)(x0,u0,μ,ν)𝒳0×𝒰0×𝒫(𝒳)×𝒫(𝒳×𝒰)subscriptsuperscript𝑥0𝑛subscriptsuperscript𝑢0𝑛subscript𝜇𝑛subscript𝜈𝑛superscript𝑥0superscript𝑢0𝜇𝜈superscript𝒳0superscript𝒰0𝒫𝒳𝒫𝒳𝒰(x^{0}_{n},u^{0}_{n},\mu_{n},\nu_{n})\to(x^{0},u^{0},\mu,\nu)\in\mathcal{X}^{0% }\times\mathcal{U}^{0}\times\mathcal{P}(\mathcal{X})\times\mathcal{P}(\mathcal% {X}\times\mathcal{U})( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) → ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ , italic_ν ) ∈ caligraphic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT × caligraphic_U start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT × caligraphic_P ( caligraphic_X ) × caligraphic_P ( caligraphic_X × caligraphic_U ), we have

|f(x,μ)δTn(dμ)p0(dxxn0,un0,μn)f(x,μ)δT(dμ)p0(dxx0,u0,μ)|\displaystyle\left|\iint f(x^{\prime},\mu)\,\delta_{T^{*}_{n}}(\mathrm{d}\mu^{% \prime})\,p^{0}(\mathrm{d}x^{\prime}\mid x^{0}_{n},u^{0}_{n},\mu_{n})-\iint f(% x^{\prime},\mu)\,\delta_{T^{*}}(\mathrm{d}\mu^{\prime})\,p^{0}(\mathrm{d}x^{% \prime}\mid x^{0},u^{0},\mu)\right|| ∬ italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ ) italic_δ start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( roman_d italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - ∬ italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ ) italic_δ start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( roman_d italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) |
=|f(x,Tn)p0(dxxn0,un0,μn)f(x,T)p0(dxx0,u0,μ)|\displaystyle\quad=\left|\int f(x^{\prime},T^{*}_{n})\,p^{0}(\mathrm{d}x^{% \prime}\mid x^{0}_{n},u^{0}_{n},\mu_{n})-\int f(x^{\prime},T^{*})\,p^{0}(% \mathrm{d}x^{\prime}\mid x^{0},u^{0},\mu)\right|= | ∫ italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - ∫ italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) |
|f(x,Tn)p0(dxxn0,un0,μn)f(x,T)p0(dxxn0,un0,μn)|\displaystyle\quad\leq\left|\int f(x^{\prime},T^{*}_{n})\,p^{0}(\mathrm{d}x^{% \prime}\mid x^{0}_{n},u^{0}_{n},\mu_{n})-\int f(x^{\prime},T^{*})\,p^{0}(% \mathrm{d}x^{\prime}\mid x^{0}_{n},u^{0}_{n},\mu_{n})\right|≤ | ∫ italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - ∫ italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) |
+|f(x,T)p0(dxxn0,un0,μn)f(x,T)p0(dxx0,u0,μ)|\displaystyle\qquad+\left|\int f(x^{\prime},T^{*})\,p^{0}(\mathrm{d}x^{\prime}% \mid x^{0}_{n},u^{0}_{n},\mu_{n})-\int f(x^{\prime},T^{*})\,p^{0}(\mathrm{d}x^% {\prime}\mid x^{0},u^{0},\mu)\right|+ | ∫ italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - ∫ italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) |
supx𝒳0|f(x,Tn)f(x,T)|absentsubscriptsupremumsuperscript𝑥superscript𝒳0𝑓superscript𝑥subscriptsuperscript𝑇𝑛𝑓superscript𝑥superscript𝑇\displaystyle\quad\leq\sup_{x^{\prime}\in\mathcal{X}^{0}}\left|f(x^{\prime},T^% {*}_{n})-f(x^{\prime},T^{*})\right|≤ roman_sup start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) |
+|f~(x)p0(dxxn0,un0,μn)f~(x)p0(dxx0,u0,μ)|0\displaystyle\qquad+\left|\int\tilde{f}(x^{\prime})\,p^{0}(\mathrm{d}x^{\prime% }\mid x^{0}_{n},u^{0}_{n},\mu_{n})-\int\tilde{f}(x^{\prime})\,p^{0}(\mathrm{d}% x^{\prime}\mid x^{0},u^{0},\mu)\right|\to 0+ | ∫ over~ start_ARG italic_f end_ARG ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - ∫ over~ start_ARG italic_f end_ARG ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) | → 0

for the first term by the prequel where TnT(xn0,un0,μn,νn)TT(x0,u0,μ,ν)subscriptsuperscript𝑇𝑛𝑇subscriptsuperscript𝑥0𝑛subscriptsuperscript𝑢0𝑛subscript𝜇𝑛subscript𝜈𝑛superscript𝑇𝑇superscript𝑥0superscript𝑢0𝜇𝜈T^{*}_{n}\coloneqq T(x^{0}_{n},u^{0}_{n},\mu_{n},\nu_{n})\to T^{*}\coloneqq T(% x^{0},u^{0},\mu,\nu)italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≔ italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) → italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≔ italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ , italic_ν ) by Lemma 1, and for the second term by applying Assumption 1 to f~(x)f(x,T)~𝑓superscript𝑥𝑓superscript𝑥superscript𝑇\tilde{f}(x^{\prime})\coloneqq f(x^{\prime},T^{*})over~ start_ARG italic_f end_ARG ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≔ italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). This shows weak continuity of the dynamics.

Furthermore, the M3FC MDP fulfills [Hernández-Lerma and Lasserre, 2012], Assumption 4.2.2 by boundedness of r𝑟ritalic_r from Assumption 1. Therefore, the desired statement follows from [Hernández-Lerma and Lasserre, 2012], Theorem 4.2.3. ∎

Appendix I Proof of Lemma 1

Proof.

To show T(xn0,un0,μn,νn)T(x0,u0,μ,ν)𝑇subscriptsuperscript𝑥0𝑛subscriptsuperscript𝑢0𝑛subscript𝜇𝑛subscript𝜈𝑛𝑇superscript𝑥0superscript𝑢0𝜇𝜈T(x^{0}_{n},u^{0}_{n},\mu_{n},\nu_{n})\to T(x^{0},u^{0},\mu,\nu)italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) → italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ , italic_ν ), consider any Lipschitz and bounded f𝑓fitalic_f with Lipschitz constant Lfsubscript𝐿𝑓L_{f}italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, then

|fd(T(xn0,un0,μn,νn)T(x0,u0,μ,ν))|𝑓d𝑇subscriptsuperscript𝑥0𝑛subscriptsuperscript𝑢0𝑛subscript𝜇𝑛subscript𝜈𝑛𝑇superscript𝑥0superscript𝑢0𝜇𝜈\displaystyle\left|\int f\,\mathrm{d}(T(x^{0}_{n},u^{0}_{n},\mu_{n},\nu_{n})-T% (x^{0},u^{0},\mu,\nu))\right|| ∫ italic_f roman_d ( italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ , italic_ν ) ) |
=|f(x)(p(dxx,u,xn0,un0,μn)νn(dx,du)p(dxx,u,x0,u0,μ)ν(dx,du))|\displaystyle=\left|\iiint f(x^{\prime})\left(p(\mathrm{d}x^{\prime}\mid x,u,x% ^{0}_{n},u^{0}_{n},\mu_{n})\nu_{n}(\mathrm{d}x,\mathrm{d}u)-p(\mathrm{d}x^{% \prime}\mid x,u,x^{0},u^{0},\mu)\nu(\mathrm{d}x,\mathrm{d}u)\right)\right|= | ∭ italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ( italic_p ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x , italic_u , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( roman_d italic_x , roman_d italic_u ) - italic_p ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x , italic_u , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) italic_ν ( roman_d italic_x , roman_d italic_u ) ) |
|f(x)p(dxx,u,xn0,un0,μn)f(x)p(dxx,u,x0,u0,μ)|νn(dx,du)\displaystyle\quad\leq\iint\left|\int f(x^{\prime})p(\mathrm{d}x^{\prime}\mid x% ,u,x^{0}_{n},u^{0}_{n},\mu_{n})-\int f(x^{\prime})p(\mathrm{d}x^{\prime}\mid x% ,u,x^{0},u^{0},\mu)\right|\nu_{n}(\mathrm{d}x,\mathrm{d}u)≤ ∬ | ∫ italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_p ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x , italic_u , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - ∫ italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_p ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x , italic_u , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) | italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( roman_d italic_x , roman_d italic_u )
+|f(x)p(dxx,u,x0,u0,μ)(νn(dx,du)ν(dx,du))|\displaystyle\qquad+\left|\iiint f(x^{\prime})p(\mathrm{d}x^{\prime}\mid x,u,x% ^{0},u^{0},\mu)(\nu_{n}(\mathrm{d}x,\mathrm{d}u)-\nu(\mathrm{d}x,\mathrm{d}u))\right|+ | ∭ italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_p ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x , italic_u , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ( italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( roman_d italic_x , roman_d italic_u ) - italic_ν ( roman_d italic_x , roman_d italic_u ) ) |
supx𝒳,u𝒰LfW1(p(x,u,xn0,un0,μn),p(x,u,x0,u0,μ))\displaystyle\quad\leq\sup_{x\in\mathcal{X},u\in\mathcal{U}}L_{f}W_{1}(p(\cdot% \mid x,u,x^{0}_{n},u^{0}_{n},\mu_{n}),p(\cdot\mid x,u,x^{0},u^{0},\mu))≤ roman_sup start_POSTSUBSCRIPT italic_x ∈ caligraphic_X , italic_u ∈ caligraphic_U end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_p ( ⋅ ∣ italic_x , italic_u , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_p ( ⋅ ∣ italic_x , italic_u , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) )
+|f(x)p(dxx,u,x0,u0,μ)(νn(dx,du)ν(dx,du))|0\displaystyle\qquad+\left|\iiint f(x^{\prime})p(\mathrm{d}x^{\prime}\mid x,u,x% ^{0},u^{0},\mu)(\nu_{n}(\mathrm{d}x,\mathrm{d}u)-\nu(\mathrm{d}x,\mathrm{d}u))% \right|\to 0+ | ∭ italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_p ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x , italic_u , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ( italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( roman_d italic_x , roman_d italic_u ) - italic_ν ( roman_d italic_x , roman_d italic_u ) ) | → 0

for the first term by 1111-Lipschitzness of fLf𝑓subscript𝐿𝑓\frac{f}{L_{f}}divide start_ARG italic_f end_ARG start_ARG italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG and Assumption 1 (with compactness implying the uniform continuity), and for the second by νnνsubscript𝜈𝑛𝜈\nu_{n}\to\nuitalic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → italic_ν and continuity of (x,u)f(x)p(dxx,u,x0,u0,μ)maps-to𝑥𝑢double-integral𝑓superscript𝑥𝑝conditionaldsuperscript𝑥𝑥𝑢superscript𝑥0superscript𝑢0𝜇(x,u)\mapsto\iint f(x^{\prime})p(\mathrm{d}x^{\prime}\mid x,u,x^{0},u^{0},\mu)( italic_x , italic_u ) ↦ ∬ italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_p ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x , italic_u , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) by the same argument. ∎

Appendix J Proof of Theorem 2

Proof.

The statement supf,π,π0|𝔼[f(xt0,N,ut0,N,μtN)f(xt0,ut0,μt)]|subscriptsupremum𝑓𝜋superscript𝜋0𝔼𝑓subscriptsuperscript𝑥0𝑁𝑡subscriptsuperscript𝑢0𝑁𝑡superscriptsubscript𝜇𝑡𝑁𝑓subscriptsuperscript𝑥0𝑡subscriptsuperscript𝑢0𝑡subscript𝜇𝑡\sup_{f,\pi,\pi^{0}}\left|\operatorname{\mathbb{E}}\left[f(x^{0,N}_{t},u^{0,N}% _{t},\mu_{t}^{N})-f(x^{0}_{t},u^{0}_{t},\mu_{t})\right]\right|roman_sup start_POSTSUBSCRIPT italic_f , italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | blackboard_E [ italic_f ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] | is shown inductively over t0𝑡0t\geq 0italic_t ≥ 0. At time t=0𝑡0t=0italic_t = 0, it holds by the weak LLN argument, see also the first term below. Assuming the statement at time t𝑡titalic_t, then for time t+1𝑡1t+1italic_t + 1 we have

sup(π,π0)Π×Π0supf|𝔼[f(xt+10,N,ut+10,N,μt+1N)f(xt+10,ut+10,μt+1)]|subscriptsupremum𝜋superscript𝜋0ΠsuperscriptΠ0subscriptsupremum𝑓𝔼𝑓subscriptsuperscript𝑥0𝑁𝑡1subscriptsuperscript𝑢0𝑁𝑡1subscriptsuperscript𝜇𝑁𝑡1𝑓subscriptsuperscript𝑥0𝑡1subscriptsuperscript𝑢0𝑡1subscript𝜇𝑡1\displaystyle\sup_{(\pi,\pi^{0})\in\Pi\times\Pi^{0}}\sup_{f\in\mathcal{F}}% \left|\operatorname{\mathbb{E}}\left[f(x^{0,N}_{t+1},u^{0,N}_{t+1},\mu^{N}_{t+% 1})-f(x^{0}_{t+1},u^{0}_{t+1},\mu_{t+1})\right]\right|roman_sup start_POSTSUBSCRIPT ( italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ∈ roman_Π × roman_Π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT | blackboard_E [ italic_f ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] |
supπ,π0supf|𝔼[f(xt+10,N,ut+10,N,μt+1N)f(xt+10,N,ut+10,N,μ^t+1N)]|absentsubscriptsupremum𝜋superscript𝜋0subscriptsupremum𝑓𝔼𝑓subscriptsuperscript𝑥0𝑁𝑡1subscriptsuperscript𝑢0𝑁𝑡1subscriptsuperscript𝜇𝑁𝑡1𝑓subscriptsuperscript𝑥0𝑁𝑡1subscriptsuperscript𝑢0𝑁𝑡1subscriptsuperscript^𝜇𝑁𝑡1\displaystyle\quad\leq\sup_{\pi,\pi^{0}}\sup_{f\in\mathcal{F}}\left|% \operatorname{\mathbb{E}}\left[f(x^{0,N}_{t+1},u^{0,N}_{t+1},\mu^{N}_{t+1})-f(% x^{0,N}_{t+1},u^{0,N}_{t+1},\hat{\mu}^{N}_{t+1})\right]\right|≤ roman_sup start_POSTSUBSCRIPT italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT | blackboard_E [ italic_f ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] | (17)
+supπ,π0supf|𝔼[f(xt+10,N,ut+10,N,μ^t+1N)f(xt+10,ut+10,μt+1)]|subscriptsupremum𝜋superscript𝜋0subscriptsupremum𝑓𝔼𝑓subscriptsuperscript𝑥0𝑁𝑡1subscriptsuperscript𝑢0𝑁𝑡1subscriptsuperscript^𝜇𝑁𝑡1𝑓subscriptsuperscript𝑥0𝑡1subscriptsuperscript𝑢0𝑡1subscript𝜇𝑡1\displaystyle\qquad+\sup_{\pi,\pi^{0}}\sup_{f\in\mathcal{F}}\left|% \operatorname{\mathbb{E}}\left[f(x^{0,N}_{t+1},u^{0,N}_{t+1},\hat{\mu}^{N}_{t+% 1})-f(x^{0}_{t+1},u^{0}_{t+1},\mu_{t+1})\right]\right|+ roman_sup start_POSTSUBSCRIPT italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT | blackboard_E [ italic_f ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] | (18)

where for readability, we again write πt(xt0,μt)πt(,xt0,μt)\pi_{t}(x^{0}_{t},\mu_{t})\coloneqq\pi_{t}(\cdot\mid\cdot,x^{0}_{t},\mu_{t})italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≔ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ∣ ⋅ , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and introduce the random variable

μ^t+1NT(xt0,N,ut0,N,μtN,μtNπt(xt0,N,μtN)).subscriptsuperscript^𝜇𝑁𝑡1𝑇subscriptsuperscript𝑥0𝑁𝑡subscriptsuperscript𝑢0𝑁𝑡subscriptsuperscript𝜇𝑁𝑡tensor-productsubscriptsuperscript𝜇𝑁𝑡subscript𝜋𝑡subscriptsuperscript𝑥0𝑁𝑡subscriptsuperscript𝜇𝑁𝑡\displaystyle\hat{\mu}^{N}_{t+1}\coloneqq T(x^{0,N}_{t},u^{0,N}_{t},\mu^{N}_{t% },\mu^{N}_{t}\otimes\pi_{t}(x^{0,N}_{t},\mu^{N}_{t})).over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ≔ italic_T ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) .

By compactness of 𝒳0×𝒰0×𝒫(𝒳)superscript𝒳0superscript𝒰0𝒫𝒳\mathcal{X}^{0}\times\mathcal{U}^{0}\times\mathcal{P}(\mathcal{X})caligraphic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT × caligraphic_U start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT × caligraphic_P ( caligraphic_X ), \mathcal{F}caligraphic_F is uniformly equicontinuous, and hence admits a non-decreasing, concave (as in [DeVore and Lorentz, 1993], Lemma 6.1) modulus of continuity ω:[0,)[0,):subscript𝜔00\omega_{\mathcal{F}}\colon[0,\infty)\to[0,\infty)italic_ω start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT : [ 0 , ∞ ) → [ 0 , ∞ ) where ω(x)0subscript𝜔𝑥0\omega_{\mathcal{F}}(x)\to 0italic_ω start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ( italic_x ) → 0 as x0𝑥0x\to 0italic_x → 0 and |f(x,u,μ)f(x,u,ν)|ω(d(x,x)+d(u,u)+W1(μ,ν))𝑓𝑥𝑢𝜇𝑓superscript𝑥superscript𝑢𝜈subscript𝜔𝑑𝑥superscript𝑥𝑑𝑢superscript𝑢subscript𝑊1𝜇𝜈|f(x,u,\mu)-f(x^{\prime},u^{\prime},\nu)|\leq\omega_{\mathcal{F}}(d(x,x^{% \prime})+d(u,u^{\prime})+W_{1}(\mu,\nu))| italic_f ( italic_x , italic_u , italic_μ ) - italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_ν ) | ≤ italic_ω start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ( italic_d ( italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_d ( italic_u , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_μ , italic_ν ) ) for all f𝑓f\in\mathcal{F}italic_f ∈ caligraphic_F, and analogously there exists such ω~subscript~𝜔\tilde{\omega}_{\mathcal{F}}over~ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT with respect to (𝒫(𝒳),dΣ)𝒫𝒳subscript𝑑Σ(\mathcal{P}(\mathcal{X}),d_{\Sigma})( caligraphic_P ( caligraphic_X ) , italic_d start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT ) instead of (𝒫(𝒳),W1)𝒫𝒳subscript𝑊1(\mathcal{P}(\mathcal{X}),W_{1})( caligraphic_P ( caligraphic_X ) , italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) as in Appendix E.

For the first term (17), let xtN{xti,N}i[N]subscriptsuperscript𝑥𝑁𝑡subscriptsubscriptsuperscript𝑥𝑖𝑁𝑡𝑖delimited-[]𝑁x^{N}_{t}\equiv\{x^{i,N}_{t}\}_{i\in[N]}italic_x start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≡ { italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ [ italic_N ] end_POSTSUBSCRIPT. Then, by the weak LLN argument,

supπ,π0supf|𝔼[f(xt+10,N,ut+10,N,μt+1N)f(xt+10,N,ut+10,N,μ^t+1N)]|subscriptsupremum𝜋superscript𝜋0subscriptsupremum𝑓𝔼𝑓subscriptsuperscript𝑥0𝑁𝑡1subscriptsuperscript𝑢0𝑁𝑡1subscriptsuperscript𝜇𝑁𝑡1𝑓subscriptsuperscript𝑥0𝑁𝑡1subscriptsuperscript𝑢0𝑁𝑡1subscriptsuperscript^𝜇𝑁𝑡1\displaystyle\sup_{\pi,\pi^{0}}\sup_{f\in\mathcal{F}}\left|\operatorname{% \mathbb{E}}\left[f(x^{0,N}_{t+1},u^{0,N}_{t+1},\mu^{N}_{t+1})-f(x^{0,N}_{t+1},% u^{0,N}_{t+1},\hat{\mu}^{N}_{t+1})\right]\right|roman_sup start_POSTSUBSCRIPT italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT | blackboard_E [ italic_f ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] |
supπ,π0𝔼[ω~(dΣ(μt+1N,μ^t+1N))]absentsubscriptsupremum𝜋superscript𝜋0𝔼subscript~𝜔subscript𝑑Σsubscriptsuperscript𝜇𝑁𝑡1subscriptsuperscript^𝜇𝑁𝑡1\displaystyle\quad\leq\sup_{\pi,\pi^{0}}\operatorname{\mathbb{E}}\left[\tilde{% \omega}_{\mathcal{F}}(d_{\Sigma}(\mu^{N}_{t+1},\hat{\mu}^{N}_{t+1}))\right]≤ roman_sup start_POSTSUBSCRIPT italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_E [ over~ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ) ]
supπ,π0ω~(m=12m𝔼[|μt+1N(fm)μ^t+1N(fm)|])absentsubscriptsupremum𝜋superscript𝜋0subscript~𝜔superscriptsubscript𝑚1superscript2𝑚𝔼subscriptsuperscript𝜇𝑁𝑡1subscript𝑓𝑚subscriptsuperscript^𝜇𝑁𝑡1subscript𝑓𝑚\displaystyle\quad\leq\sup_{\pi,\pi^{0}}\tilde{\omega}_{\mathcal{F}}\left(\sum% _{m=1}^{\infty}2^{-m}\operatorname{\mathbb{E}}\left[\left|\mu^{N}_{t+1}(f_{m})% -\hat{\mu}^{N}_{t+1}(f_{m})\right|\right]\right)≤ roman_sup start_POSTSUBSCRIPT italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT - italic_m end_POSTSUPERSCRIPT blackboard_E [ | italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) - over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) | ] )
supπ,π0ω~(supm1𝔼[𝔼βt[|μt+1N(fm)μ^t+1N(fm)|]])absentsubscriptsupremum𝜋superscript𝜋0subscript~𝜔subscriptsupremum𝑚1𝔼subscript𝔼subscript𝛽𝑡subscriptsuperscript𝜇𝑁𝑡1subscript𝑓𝑚subscriptsuperscript^𝜇𝑁𝑡1subscript𝑓𝑚\displaystyle\quad\leq\sup_{\pi,\pi^{0}}\tilde{\omega}_{\mathcal{F}}\left(\sup% _{m\geq 1}\operatorname{\mathbb{E}}\left[\operatorname{\mathbb{E}}_{\beta_{t}}% \left[\left|\mu^{N}_{t+1}(f_{m})-\hat{\mu}^{N}_{t+1}(f_{m})\right|\right]% \right]\right)≤ roman_sup start_POSTSUBSCRIPT italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ( roman_sup start_POSTSUBSCRIPT italic_m ≥ 1 end_POSTSUBSCRIPT blackboard_E [ blackboard_E start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) - over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) | ] ] )
=supπ,π0ω~(supm1𝔼[𝔼βt[|1Ni=1N(fm(xt+1i,N)𝔼βt[fm(xt+1i,N)])|]])absentsubscriptsupremum𝜋superscript𝜋0subscript~𝜔subscriptsupremum𝑚1𝔼subscript𝔼subscript𝛽𝑡1𝑁superscriptsubscript𝑖1𝑁subscript𝑓𝑚subscriptsuperscript𝑥𝑖𝑁𝑡1subscript𝔼subscript𝛽𝑡subscript𝑓𝑚subscriptsuperscript𝑥𝑖𝑁𝑡1\displaystyle\quad=\sup_{\pi,\pi^{0}}\tilde{\omega}_{\mathcal{F}}\left(\sup_{m% \geq 1}\operatorname{\mathbb{E}}\left[\operatorname{\mathbb{E}}_{\beta_{t}}% \left[\left|\frac{1}{N}\sum_{i=1}^{N}\left(f_{m}(x^{i,N}_{t+1})-\operatorname{% \mathbb{E}}_{\beta_{t}}\left[f_{m}(x^{i,N}_{t+1})\right]\right)\right|\right]% \right]\right)= roman_sup start_POSTSUBSCRIPT italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ( roman_sup start_POSTSUBSCRIPT italic_m ≥ 1 end_POSTSUBSCRIPT blackboard_E [ blackboard_E start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - blackboard_E start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] ) | ] ] )
supπ,π0ω~(supm1𝔼[𝔼βt[|1Ni=1N(fm(xt+1i,N)𝔼βt[fm(xt+1i,N)])|2]]1/2)\displaystyle\quad\leq\sup_{\pi,\pi^{0}}\tilde{\omega}_{\mathcal{F}}\left(\sup% _{m\geq 1}\operatorname{\mathbb{E}}\left[\operatorname{\mathbb{E}}_{\beta_{t}}% \left[\left|\frac{1}{N}\sum_{i=1}^{N}\left(f_{m}(x^{i,N}_{t+1})-\operatorname{% \mathbb{E}}_{\beta_{t}}\left[f_{m}(x^{i,N}_{t+1})\right]\right)\right|^{2}% \right]\right]^{1/2}\right)≤ roman_sup start_POSTSUBSCRIPT italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ( roman_sup start_POSTSUBSCRIPT italic_m ≥ 1 end_POSTSUBSCRIPT blackboard_E [ blackboard_E start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - blackboard_E start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ] start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT )
=supπ,π0ω~(supm1(1N2i=1N𝔼[𝔼βt[(fm(xt+1i,N)𝔼βt[fm(xt+1i,N)])2]])1/2)absentsubscriptsupremum𝜋superscript𝜋0subscript~𝜔subscriptsupremum𝑚1superscript1superscript𝑁2superscriptsubscript𝑖1𝑁𝔼subscript𝔼subscript𝛽𝑡superscriptsubscript𝑓𝑚subscriptsuperscript𝑥𝑖𝑁𝑡1subscript𝔼subscript𝛽𝑡subscript𝑓𝑚subscriptsuperscript𝑥𝑖𝑁𝑡1212\displaystyle\quad=\sup_{\pi,\pi^{0}}\tilde{\omega}_{\mathcal{F}}\left(\sup_{m% \geq 1}\left(\frac{1}{N^{2}}\sum_{i=1}^{N}\operatorname{\mathbb{E}}\left[% \operatorname{\mathbb{E}}_{\beta_{t}}\left[\left(f_{m}(x^{i,N}_{t+1})-% \operatorname{\mathbb{E}}_{\beta_{t}}\left[f_{m}(x^{i,N}_{t+1})\right]\right)^% {2}\right]\right]\right)^{1/2}\right)= roman_sup start_POSTSUBSCRIPT italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ( roman_sup start_POSTSUBSCRIPT italic_m ≥ 1 end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_E [ blackboard_E start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - blackboard_E start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ] ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT )
ω~(2N)0absentsubscript~𝜔2𝑁0\displaystyle\quad\leq\tilde{\omega}_{\mathcal{F}}\left(\frac{2}{\sqrt{N}}% \right)\to 0≤ over~ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT ( divide start_ARG 2 end_ARG start_ARG square-root start_ARG italic_N end_ARG end_ARG ) → 0 (19)

for βt(xt0,N,ut0,N,xtN)subscript𝛽𝑡subscriptsuperscript𝑥0𝑁𝑡subscriptsuperscript𝑢0𝑁𝑡subscriptsuperscript𝑥𝑁𝑡\beta_{t}\coloneqq(x^{0,N}_{t},u^{0,N}_{t},x^{N}_{t})italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≔ ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) by bounding |fm|1subscript𝑓𝑚1|f_{m}|\leq 1| italic_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | ≤ 1, as the cross-terms disappear.

For the second term (18), by noting μ^t+1N=T(xt0,N,ut0,N,μtN,μtNπt(xt0,N,μtN))subscriptsuperscript^𝜇𝑁𝑡1𝑇subscriptsuperscript𝑥0𝑁𝑡subscriptsuperscript𝑢0𝑁𝑡subscriptsuperscript𝜇𝑁𝑡tensor-productsubscriptsuperscript𝜇𝑁𝑡subscript𝜋𝑡subscriptsuperscript𝑥0𝑁𝑡subscriptsuperscript𝜇𝑁𝑡\hat{\mu}^{N}_{t+1}=T(x^{0,N}_{t},u^{0,N}_{t},\mu^{N}_{t},\mu^{N}_{t}\otimes% \pi_{t}(x^{0,N}_{t},\mu^{N}_{t}))over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_T ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ), we have

supπ,π0supf|𝔼[f(xt+10,N,ut+10,N,μ^t+1N)f(xt+10,ut+10,μt+1)]|subscriptsupremum𝜋superscript𝜋0subscriptsupremum𝑓𝔼𝑓subscriptsuperscript𝑥0𝑁𝑡1subscriptsuperscript𝑢0𝑁𝑡1subscriptsuperscript^𝜇𝑁𝑡1𝑓subscriptsuperscript𝑥0𝑡1subscriptsuperscript𝑢0𝑡1subscript𝜇𝑡1\displaystyle\sup_{\pi,\pi^{0}}\sup_{f\in\mathcal{F}}\left|\operatorname{% \mathbb{E}}\left[f(x^{0,N}_{t+1},u^{0,N}_{t+1},\hat{\mu}^{N}_{t+1})-f(x^{0}_{t% +1},u^{0}_{t+1},\mu_{t+1})\right]\right|roman_sup start_POSTSUBSCRIPT italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT | blackboard_E [ italic_f ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] |
=supπ,π0supf|𝔼[f(x,u,μ^t+1N)πt0(dux,μt+1N)p0(dxxt0,N,ut0,N,μtN)\displaystyle\quad=\sup_{\pi,\pi^{0}}\sup_{f\in\mathcal{F}}\left|\operatorname% {\mathbb{E}}\left[\iint f(x^{\prime},u^{\prime},\hat{\mu}^{N}_{t+1})\pi^{0}_{t% }(\mathrm{d}u^{\prime}\mid x^{\prime},\mu^{N}_{t+1})p^{0}(\mathrm{d}x^{\prime}% \mid x^{0,N}_{t},u^{0,N}_{t},\mu^{N}_{t})\right.\right.= roman_sup start_POSTSUBSCRIPT italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT | blackboard_E [ ∬ italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
f(x,u,μt+1)πt0(dux,μt+1)p0(dxxt0,ut0,μt)]|\displaystyle\hskip 85.35826pt\left.\left.-\iint f(x^{\prime},u^{\prime},\mu_{% t+1})\pi^{0}_{t}(\mathrm{d}u^{\prime}\mid x^{\prime},\mu_{t+1})p^{0}(\mathrm{d% }x^{\prime}\mid x^{0}_{t},u^{0}_{t},\mu_{t})\right]\right|- ∬ italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] |
supπ,π0supf𝔼[supx|f(x,u,μ^t+1N)(πt0(dux,μt+1N)πt0(dux,μ^t+1N))|]\displaystyle\quad\leq\sup_{\pi,\pi^{0}}\sup_{f\in\mathcal{F}}\operatorname{% \mathbb{E}}\left[\sup_{x^{\prime}}\left|\int f(x^{\prime},u^{\prime},\hat{\mu}% ^{N}_{t+1})(\pi^{0}_{t}(\mathrm{d}u^{\prime}\mid x^{\prime},\mu^{N}_{t+1})-\pi% ^{0}_{t}(\mathrm{d}u^{\prime}\mid x^{\prime},\hat{\mu}^{N}_{t+1}))\right|\right]≤ roman_sup start_POSTSUBSCRIPT italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT blackboard_E [ roman_sup start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | ∫ italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ( italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ) | ] (20)
+supπ,π0supg𝒢|𝔼[g(xt0,N,ut0,N,μtN)g(xt0,ut0,μt)]|subscriptsupremum𝜋superscript𝜋0subscriptsupremum𝑔𝒢𝔼𝑔subscriptsuperscript𝑥0𝑁𝑡subscriptsuperscript𝑢0𝑁𝑡subscriptsuperscript𝜇𝑁𝑡𝑔subscriptsuperscript𝑥0𝑡subscriptsuperscript𝑢0𝑡subscript𝜇𝑡\displaystyle\qquad+\sup_{\pi,\pi^{0}}\sup_{g\in\mathcal{G}}\left|% \operatorname{\mathbb{E}}\left[g(x^{0,N}_{t},u^{0,N}_{t},\mu^{N}_{t})-g(x^{0}_% {t},u^{0}_{t},\mu_{t})\right]\right|+ roman_sup start_POSTSUBSCRIPT italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_g ∈ caligraphic_G end_POSTSUBSCRIPT | blackboard_E [ italic_g ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_g ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] | (21)

and analyze each term separately, where we defined the function g:𝒳0×𝒰0×𝒫(𝒳):𝑔superscript𝒳0superscript𝒰0𝒫𝒳g\colon\mathcal{X}^{0}\times\mathcal{U}^{0}\times\mathcal{P}(\mathcal{X})italic_g : caligraphic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT × caligraphic_U start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT × caligraphic_P ( caligraphic_X ) as

g(x0,u0,μ)f(x,u,T)πt0(dux,T)p0(dxx0,u0,μ)𝑔superscript𝑥0superscript𝑢0𝜇double-integral𝑓superscript𝑥superscript𝑢superscript𝑇subscriptsuperscript𝜋0𝑡conditionaldsuperscript𝑢superscript𝑥superscript𝑇superscript𝑝0conditionaldsuperscript𝑥superscript𝑥0superscript𝑢0𝜇\displaystyle g(x^{0},u^{0},\mu)\coloneqq\iint f(x^{\prime},u^{\prime},T^{*})% \pi^{0}_{t}(\mathrm{d}u^{\prime}\mid x^{\prime},T^{*})p^{0}(\mathrm{d}x^{% \prime}\mid x^{0},u^{0},\mu)italic_g ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ≔ ∬ italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ )

from the class 𝒢𝒢\mathcal{G}caligraphic_G of such functions for any policies π,π0𝜋superscript𝜋0\pi,\pi^{0}italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, where TT(x0,u0,μ,μπt(x0,μ))superscript𝑇𝑇superscript𝑥0superscript𝑢0𝜇tensor-product𝜇subscript𝜋𝑡superscript𝑥0𝜇T^{*}\coloneqq T(x^{0},u^{0},\mu,\mu\otimes\pi_{t}(x^{0},\mu))italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≔ italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ , italic_μ ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ).

For (20), defining a modulus of continuity ω~Π0subscript~𝜔superscriptΠ0\tilde{\omega}_{\Pi^{0}}over~ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT roman_Π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT for Π0superscriptΠ0\Pi^{0}roman_Π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT as for \mathcal{F}caligraphic_F, we have

supπ,π0supf𝔼[supx|f(x,u,μ^t+1N)(πt0(dux,μt+1N)πt0(dux,μ^t+1N))|]\displaystyle\sup_{\pi,\pi^{0}}\sup_{f\in\mathcal{F}}\operatorname{\mathbb{E}}% \left[\sup_{x^{\prime}}\left|\int f(x^{\prime},u^{\prime},\hat{\mu}^{N}_{t+1})% (\pi^{0}_{t}(\mathrm{d}u^{\prime}\mid x^{\prime},\mu^{N}_{t+1})-\pi^{0}_{t}(% \mathrm{d}u^{\prime}\mid x^{\prime},\hat{\mu}^{N}_{t+1}))\right|\right]roman_sup start_POSTSUBSCRIPT italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT blackboard_E [ roman_sup start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | ∫ italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ( italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ) | ]
supπ,π0𝔼[LsupxW1(πt0(x,μt+1N),πt0(x,μ^t+1N))]\displaystyle\quad\leq\sup_{\pi,\pi^{0}}\operatorname{\mathbb{E}}\left[L_{% \mathcal{F}}\sup_{x^{\prime}}W_{1}(\pi^{0}_{t}(\cdot\mid x^{\prime},\mu^{N}_{t% +1}),\pi^{0}_{t}(\cdot\mid x^{\prime},\hat{\mu}^{N}_{t+1}))\right]≤ roman_sup start_POSTSUBSCRIPT italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_E [ italic_L start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ∣ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ∣ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ) ]
supπ,π0𝔼[Lω~Π0(dΣ(μt+1N,μ^t+1N))]Lω~Π0(2N)0.absentsubscriptsupremum𝜋superscript𝜋0𝔼subscript𝐿subscript~𝜔superscriptΠ0subscript𝑑Σsubscriptsuperscript𝜇𝑁𝑡1subscriptsuperscript^𝜇𝑁𝑡1subscript𝐿subscript~𝜔superscriptΠ02𝑁0\displaystyle\quad\leq\sup_{\pi,\pi^{0}}\operatorname{\mathbb{E}}\left[L_{% \mathcal{F}}\tilde{\omega}_{\Pi^{0}}(d_{\Sigma}(\mu^{N}_{t+1},\hat{\mu}^{N}_{t% +1}))\right]\leq L_{\mathcal{F}}\tilde{\omega}_{\Pi^{0}}\left(\frac{2}{\sqrt{N% }}\right)\to 0.≤ roman_sup start_POSTSUBSCRIPT italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_E [ italic_L start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT over~ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT roman_Π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ) ] ≤ italic_L start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT over~ start_ARG italic_ω end_ARG start_POSTSUBSCRIPT roman_Π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( divide start_ARG 2 end_ARG start_ARG square-root start_ARG italic_N end_ARG end_ARG ) → 0 .

Lastly, for (21), we first note that the class 𝒢𝒢\mathcal{G}caligraphic_G of functions is equi-Lipschitz.

Lemma 1.

Under Assumptions 1 and 2, the map (x0,u0,μ)T(x0,u0,μ,μπt(x0,μ))maps-tosuperscript𝑥0superscript𝑢0𝜇𝑇superscript𝑥0superscript𝑢0𝜇tensor-product𝜇subscript𝜋𝑡superscript𝑥0𝜇(x^{0},u^{0},\mu)\mapsto T(x^{0},u^{0},\mu,\mu\otimes\pi_{t}(x^{0},\mu))( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ↦ italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ , italic_μ ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ) is Lipschitz with constant LT(2LΠ+1)(Lp+(Lp+1)LΠ+(Lp+LΠ+1))subscript𝐿𝑇2subscript𝐿Π1subscript𝐿𝑝subscript𝐿𝑝1subscript𝐿Πsubscript𝐿𝑝subscript𝐿Π1L_{T}\coloneqq(2L_{\Pi}+1)\cdot(L_{p}+(L_{p}+1)L_{\Pi}+(L_{p}+L_{\Pi}+1))italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ≔ ( 2 italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT + 1 ) ⋅ ( italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + ( italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + 1 ) italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT + ( italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT + 1 ) ).

Lemma 2.

Under Assumptions 1 and 2, for any equi-Lipschitz \mathcal{F}caligraphic_F with constant Lsubscript𝐿L_{\mathcal{F}}italic_L start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT, the function class 𝒢𝒢\mathcal{G}caligraphic_G is equi-Lipschitz with constant L𝒢(LLT+LLΠ0LT+LLΠLp0)subscript𝐿𝒢subscript𝐿subscript𝐿𝑇subscript𝐿subscript𝐿superscriptΠ0subscript𝐿𝑇subscript𝐿subscript𝐿Πsubscript𝐿superscript𝑝0L_{\mathcal{G}}\coloneqq(L_{\mathcal{F}}L_{T}+L_{\mathcal{F}}L_{\Pi^{0}}L_{T}+% L_{\mathcal{F}}L_{\Pi}L_{p^{0}})italic_L start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ≔ ( italic_L start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT roman_Π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ).

Therefore, for (21), we have

supπ,π0supg𝒢|𝔼[g(xt0,N,ut0,N,μtN)g(xt0,ut0,μt)]|0subscriptsupremum𝜋superscript𝜋0subscriptsupremum𝑔𝒢𝔼𝑔subscriptsuperscript𝑥0𝑁𝑡subscriptsuperscript𝑢0𝑁𝑡subscriptsuperscript𝜇𝑁𝑡𝑔subscriptsuperscript𝑥0𝑡subscriptsuperscript𝑢0𝑡subscript𝜇𝑡0\displaystyle\sup_{\pi,\pi^{0}}\sup_{g\in\mathcal{G}}\left|\operatorname{% \mathbb{E}}\left[g(x^{0,N}_{t},u^{0,N}_{t},\mu^{N}_{t})-g(x^{0}_{t},u^{0}_{t},% \mu_{t})\right]\right|\to 0roman_sup start_POSTSUBSCRIPT italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_g ∈ caligraphic_G end_POSTSUBSCRIPT | blackboard_E [ italic_g ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_g ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] | → 0

by the induction assumption over the class 𝒢𝒢\mathcal{G}caligraphic_G of equi-Lipschitz functions, completing the proof by induction. The existence of independent optimal π𝜋\piitalic_π, π0superscript𝜋0\pi^{0}italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT follows from Remark 3. This completes the proof.

For finite minor states, we can quantify the convergence rate more precisely as 𝒪(1/N)𝒪1𝑁\mathcal{O}(1/\sqrt{N})caligraphic_O ( 1 / square-root start_ARG italic_N end_ARG ), since the two metrizations dΣsubscript𝑑Σd_{\Sigma}italic_d start_POSTSUBSCRIPT roman_Σ end_POSTSUBSCRIPT and W1subscript𝑊1W_{1}italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are then Lipschitz equivalent and the above moduli of continuity simply become a multiplication with the Lipschitz constant, so for convenience we simply use the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT distance. The convergence in the first term (17) is immediate by the weak LLN

supπ,π0supf|𝔼[f(xt+10,N,ut+10,N,μt+1N)f(xt+10,N,ut+10,N,μ^t+1N)]|subscriptsupremum𝜋superscript𝜋0subscriptsupremum𝑓𝔼𝑓subscriptsuperscript𝑥0𝑁𝑡1subscriptsuperscript𝑢0𝑁𝑡1subscriptsuperscript𝜇𝑁𝑡1𝑓subscriptsuperscript𝑥0𝑁𝑡1subscriptsuperscript𝑢0𝑁𝑡1subscriptsuperscript^𝜇𝑁𝑡1\displaystyle\sup_{\pi,\pi^{0}}\sup_{f\in\mathcal{F}}\left|\operatorname{% \mathbb{E}}\left[f(x^{0,N}_{t+1},u^{0,N}_{t+1},\mu^{N}_{t+1})-f(x^{0,N}_{t+1},% u^{0,N}_{t+1},\hat{\mu}^{N}_{t+1})\right]\right|roman_sup start_POSTSUBSCRIPT italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT | blackboard_E [ italic_f ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ] |
supπ,π0Lf𝔼[x𝒳|μt+1N(x)μ^t+1N(x)|]absentsubscriptsupremum𝜋superscript𝜋0subscript𝐿𝑓𝔼subscript𝑥𝒳subscriptsuperscript𝜇𝑁𝑡1𝑥subscriptsuperscript^𝜇𝑁𝑡1𝑥\displaystyle\quad\leq\sup_{\pi,\pi^{0}}L_{f}\operatorname{\mathbb{E}}\left[% \sum_{x\in\mathcal{X}}\left|\mu^{N}_{t+1}(x)-\hat{\mu}^{N}_{t+1}(x)\right|\right]≤ roman_sup start_POSTSUBSCRIPT italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT blackboard_E [ ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT | italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_x ) - over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_x ) | ]
=supπ,π0Lfx𝒳𝔼[𝔼[|1Ni=1N𝟏x(xt+1i,N)𝔼[1Ni=1N𝟏x(xt+1i,N)|xt0,N,ut0,N,μtN]||xt0,N,ut0,N,μtN]]absentsubscriptsupremum𝜋superscript𝜋0subscript𝐿𝑓subscript𝑥𝒳𝔼𝔼1𝑁superscriptsubscript𝑖1𝑁subscript1𝑥subscriptsuperscript𝑥𝑖𝑁𝑡1𝔼1𝑁superscriptsubscript𝑖1𝑁subscript1𝑥subscriptsuperscript𝑥𝑖𝑁𝑡1subscriptsuperscript𝑥0𝑁𝑡subscriptsuperscript𝑢0𝑁𝑡subscriptsuperscript𝜇𝑁𝑡subscriptsuperscript𝑥0𝑁𝑡subscriptsuperscript𝑢0𝑁𝑡subscriptsuperscript𝜇𝑁𝑡\displaystyle\quad=\sup_{\pi,\pi^{0}}L_{f}\sum_{x\in\mathcal{X}}\operatorname{% \mathbb{E}}\left[\operatorname{\mathbb{E}}\left[\left|\frac{1}{N}\sum_{i=1}^{N% }\mathbf{1}_{x}(x^{i,N}_{t+1})-\operatorname{\mathbb{E}}\left[\frac{1}{N}\sum_% {i=1}^{N}\mathbf{1}_{x}(x^{i,N}_{t+1})\;\middle\lvert\;x^{0,N}_{t},u^{0,N}_{t}% ,\mu^{N}_{t}\right]\right|\;\middle\lvert\;x^{0,N}_{t},u^{0,N}_{t},\mu^{N}_{t}% \right]\right]= roman_sup start_POSTSUBSCRIPT italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT blackboard_E [ blackboard_E [ | divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_1 start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) - blackboard_E [ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_1 start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) | italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] | | italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ]
Lf|𝒳|4N,absentsubscript𝐿𝑓𝒳4𝑁\displaystyle\quad\leq L_{f}|\mathcal{X}|\sqrt{\frac{4}{N}},≤ italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT | caligraphic_X | square-root start_ARG divide start_ARG 4 end_ARG start_ARG italic_N end_ARG end_ARG ,

and for the second term (18) we again use the induction assumption, completing the proof. ∎

Appendix K Proof of Lemma 1

Proof.

First note Lipschitz continuity of (x0,μ)μπt(x0,μ)maps-tosuperscript𝑥0𝜇tensor-product𝜇subscript𝜋𝑡superscript𝑥0𝜇(x^{0},\mu)\mapsto\mu\otimes\pi_{t}(x^{0},\mu)( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ↦ italic_μ ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) as in Appendix E, as for any (x0,μ),(x0,μ)𝒳0×𝒫(𝒳)subscriptsuperscript𝑥0subscript𝜇superscript𝑥0𝜇superscript𝒳0𝒫𝒳(x^{0}_{*},\mu_{*}),(x^{0},\mu)\in\mathcal{X}^{0}\times\mathcal{P}(\mathcal{X})( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) , ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ∈ caligraphic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT × caligraphic_P ( caligraphic_X ), then

supπΠW1(μπt(x0,μ),μπt(x0,μ))subscriptsupremum𝜋Πsubscript𝑊1tensor-productsubscript𝜇subscript𝜋𝑡subscriptsuperscript𝑥0subscript𝜇tensor-product𝜇subscript𝜋𝑡superscript𝑥0𝜇\displaystyle\sup_{\pi\in\Pi}W_{1}(\mu_{*}\otimes\pi_{t}(x^{0}_{*},\mu_{*}),% \mu\otimes\pi_{t}(x^{0},\mu))roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) , italic_μ ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) )
=supπΠsupfLip1|fd(μπt(x0,μ)μπt(x0,μ))|absentsubscriptsupremum𝜋Πsubscriptsupremumsubscriptdelimited-∥∥superscript𝑓Lip1superscript𝑓dtensor-productsubscript𝜇subscript𝜋𝑡subscriptsuperscript𝑥0subscript𝜇tensor-product𝜇subscript𝜋𝑡superscript𝑥0𝜇\displaystyle\quad=\sup_{\pi\in\Pi}\sup_{\lVert f^{\prime}\rVert_{\mathrm{Lip}% }\leq 1}\left|\int f^{\prime}\,\mathrm{d}(\mu_{*}\otimes\pi_{t}(x^{0}_{*},\mu_% {*})-\mu\otimes\pi_{t}(x^{0},\mu))\right|= roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT ∥ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_Lip end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT | ∫ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT roman_d ( italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - italic_μ ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ) |
supπΠsupfLip1|f(x,u)(πt(dux,x0,μ)πt(dux,x0,μ))μ(dx)|\displaystyle\quad\leq\sup_{\pi\in\Pi}\sup_{\lVert f^{\prime}\rVert_{\mathrm{% Lip}}\leq 1}\left|\iint f^{\prime}(x,u)(\pi_{t}(\mathrm{d}u\mid x,x^{0}_{*},% \mu_{*})-\pi_{t}(\mathrm{d}u\mid x,x^{0},\mu))\mu_{*}(\mathrm{d}x)\right|≤ roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT ∥ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_Lip end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT | ∬ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_u ) ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_x , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_x , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ) italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( roman_d italic_x ) |
+supπΠsupfLip1|f(x,u)πt(dux,x0,μ)(μ(dx)μ(dx))|\displaystyle\qquad+\sup_{\pi\in\Pi}\sup_{\lVert f^{\prime}\rVert_{\mathrm{Lip% }}\leq 1}\left|\iint f^{\prime}(x,u)\pi_{t}(\mathrm{d}u\mid x,x^{0},\mu)(\mu_{% *}(\mathrm{d}x)-\mu(\mathrm{d}x))\right|+ roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT ∥ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_Lip end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT | ∬ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_u ) italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_x , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ( italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( roman_d italic_x ) - italic_μ ( roman_d italic_x ) ) |

where for the first term

supπΠsupfLip1|f(x,u)(πt(dux,x0,μ)πt(dux,x0,μ))μ(dx)|\displaystyle\sup_{\pi\in\Pi}\sup_{\lVert f^{\prime}\rVert_{\mathrm{Lip}}\leq 1% }\left|\iint f^{\prime}(x,u)(\pi_{t}(\mathrm{d}u\mid x,x^{0}_{*},\mu_{*})-\pi_% {t}(\mathrm{d}u\mid x,x^{0},\mu))\mu_{*}(\mathrm{d}x)\right|roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT ∥ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_Lip end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT | ∬ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_u ) ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_x , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_x , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ) italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( roman_d italic_x ) |
supπΠsupfLip1|f(x,u)(πt(dux,x0,μ)πt(dux,x0,μ))|μ(dx)\displaystyle\quad\leq\sup_{\pi\in\Pi}\sup_{\lVert f^{\prime}\rVert_{\mathrm{% Lip}}\leq 1}\int\left|\int f^{\prime}(x,u)(\pi_{t}(\mathrm{d}u\mid x,x^{0}_{*}% ,\mu_{*})-\pi_{t}(\mathrm{d}u\mid x,x^{0},\mu))\right|\mu_{*}(\mathrm{d}x)≤ roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT ∥ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_Lip end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT ∫ | ∫ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_u ) ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_x , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_x , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ) | italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( roman_d italic_x )
supπΠsupfLip1supx𝒳|f(x,u)(πt(dux,x0,μ)πt(dux,x0,μ))|\displaystyle\quad\leq\sup_{\pi\in\Pi}\sup_{\lVert f^{\prime}\rVert_{\mathrm{% Lip}}\leq 1}\sup_{x\in\mathcal{X}}\left|\int f^{\prime}(x,u)(\pi_{t}(\mathrm{d% }u\mid x,x^{0}_{*},\mu_{*})-\pi_{t}(\mathrm{d}u\mid x,x^{0},\mu))\right|≤ roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT ∥ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_Lip end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT | ∫ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_u ) ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_x , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_x , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ) |
=supπΠsupx𝒳W1(πt(x,x0,μ),πt(x,x0,μ))\displaystyle\quad=\sup_{\pi\in\Pi}\sup_{x\in\mathcal{X}}W_{1}(\pi_{t}(\cdot% \mid x,x^{0}_{*},\mu_{*}),\pi_{t}(\cdot\mid x,x^{0},\mu))= roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ∣ italic_x , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) , italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ∣ italic_x , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) )
LΠd((x0,μ),(x0,μ))absentsubscript𝐿Π𝑑subscriptsuperscript𝑥0subscript𝜇superscript𝑥0𝜇\displaystyle\quad\leq L_{\Pi}d((x^{0}_{*},\mu_{*}),(x^{0},\mu))≤ italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT italic_d ( ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) , ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) )

by Assumption 2, and similarly for the second by noting 1111-Lipschitzness of xf(x,u)LΠ+1πt(dux,x0,μ)maps-to𝑥superscript𝑓𝑥𝑢subscript𝐿Π1subscript𝜋𝑡conditionald𝑢𝑥superscript𝑥0𝜇x\mapsto\int\frac{f^{\prime}(x,u)}{L_{\Pi}+1}\pi_{t}(\mathrm{d}u\mid x,x^{0},\mu)italic_x ↦ ∫ divide start_ARG italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_u ) end_ARG start_ARG italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT + 1 end_ARG italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_x , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ), as before in (11), and therefore again

supπΠsupfLip1|f(x,u)πt(dux,x0,μ)(μ(dx)μ(dx))|\displaystyle\sup_{\pi\in\Pi}\sup_{\lVert f^{\prime}\rVert_{\mathrm{Lip}}\leq 1% }\left|\iint f^{\prime}(x,u)\pi_{t}(\mathrm{d}u\mid x,x^{0},\mu)(\mu_{*}(% \mathrm{d}x)-\mu(\mathrm{d}x))\right|roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT ∥ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_Lip end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT | ∬ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_u ) italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_x , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ( italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( roman_d italic_x ) - italic_μ ( roman_d italic_x ) ) |
=supπΠsupfLip1(LΠ+1)|f(x,u)LΠ+1πt(dux,x0,μ)(μ(dx)μ(dx))|\displaystyle\quad=\sup_{\pi\in\Pi}\sup_{\lVert f^{\prime}\rVert_{\mathrm{Lip}% }\leq 1}(L_{\Pi}+1)\left|\iint\frac{f^{\prime}(x,u)}{L_{\Pi}+1}\pi_{t}(\mathrm% {d}u\mid x,x^{0},\mu)(\mu_{*}(\mathrm{d}x)-\mu(\mathrm{d}x))\right|= roman_sup start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT ∥ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_Lip end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT ( italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT + 1 ) | ∬ divide start_ARG italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_u ) end_ARG start_ARG italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT + 1 end_ARG italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_x , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ( italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( roman_d italic_x ) - italic_μ ( roman_d italic_x ) ) |
(LΠ+1)W1(μ,μ).absentsubscript𝐿Π1subscript𝑊1subscript𝜇𝜇\displaystyle\quad\leq(L_{\Pi}+1)W_{1}(\mu_{*},\mu).≤ ( italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT + 1 ) italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ ) .

Hence, the map (x0,u0,μ)μπt(x0,μ)maps-tosuperscript𝑥0superscript𝑢0𝜇tensor-product𝜇subscript𝜋𝑡superscript𝑥0𝜇(x^{0},u^{0},\mu)\mapsto\mu\otimes\pi_{t}(x^{0},\mu)( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ↦ italic_μ ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) is Lipschitz with constant (2LΠ+1)2subscript𝐿Π1(2L_{\Pi}+1)( 2 italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT + 1 ).

As a result, the entire map (x0,u0,μ)T(x0,u0,μ,μπt(x0,μ)(x^{0},u^{0},\mu)\mapsto T(x^{0},u^{0},\mu,\mu\otimes\pi_{t}(x^{0},\mu)( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ↦ italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ , italic_μ ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) is Lipschitz, since for any

W1(T(x0,u0,μ,μπt(x0,μ)),T(x0,u0,μ,μπt(x0,μ))\displaystyle W_{1}(T(x^{0}_{*},u^{0}_{*},\mu_{*},\mu_{*}\otimes\pi_{t}(x^{0}_% {*},\mu_{*})),T(x^{0},u^{0},\mu,\mu\otimes\pi_{t}(x^{0},\mu))italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) , italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ , italic_μ ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) )
=supfLip1|f(x)p(dxx,u,x0,u0,μ)πt(dux,x0,μ)μ(dx)\displaystyle\quad=\sup_{\lVert f^{\prime}\rVert_{\mathrm{Lip}}\leq 1}\left|% \iiint f^{\prime}(x^{\prime})p(\mathrm{d}x^{\prime}\mid x,u,x^{0}_{*},u^{0}_{*% },\mu_{*})\pi_{t}(\mathrm{d}u\mid x,x^{0}_{*},\mu_{*})\mu_{*}(\mathrm{d}x)\right.= roman_sup start_POSTSUBSCRIPT ∥ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_Lip end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT | ∭ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_p ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x , italic_u , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_x , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( roman_d italic_x )
f(x)p(dxx,u,x0,u0,μ)πt(dux,x0,μ)μ(dx)|\displaystyle\hskip 71.13188pt\left.-\iiint f^{\prime}(x^{\prime})p(\mathrm{d}% x^{\prime}\mid x,u,x^{0},u^{0},\mu)\pi_{t}(\mathrm{d}u\mid x,x^{0},\mu)\mu(% \mathrm{d}x)\right|- ∭ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_p ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x , italic_u , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_x , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) italic_μ ( roman_d italic_x ) |
supfLip1sup(x,u)𝒳×𝒰|f(x)(p(dxx,u,x0,u0,μ)p(dxx,u,x0,u0,μ))|\displaystyle\quad\leq\sup_{\lVert f^{\prime}\rVert_{\mathrm{Lip}}\leq 1}\sup_% {(x,u)\in\mathcal{X}\times\mathcal{U}}\left|\int f^{\prime}(x^{\prime})(p(% \mathrm{d}x^{\prime}\mid x,u,x^{0}_{*},u^{0}_{*},\mu_{*})-p(\mathrm{d}x^{% \prime}\mid x,u,x^{0},u^{0},\mu))\right|≤ roman_sup start_POSTSUBSCRIPT ∥ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_Lip end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT ( italic_x , italic_u ) ∈ caligraphic_X × caligraphic_U end_POSTSUBSCRIPT | ∫ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ( italic_p ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x , italic_u , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - italic_p ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x , italic_u , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ) |
+supfLip1supx𝒳|f(x)p(dxx,u,x0,u0,μ)(πt(dux,x0,μ)πt(dux,x0,μ))|\displaystyle\qquad+\sup_{\lVert f^{\prime}\rVert_{\mathrm{Lip}}\leq 1}\sup_{x% \in\mathcal{X}}\left|\iint f^{\prime}(x^{\prime})p(\mathrm{d}x^{\prime}\mid x,% u,x^{0},u^{0},\mu)(\pi_{t}(\mathrm{d}u\mid x,x^{0}_{*},\mu_{*})-\pi_{t}(% \mathrm{d}u\mid x,x^{0},\mu))\right|+ roman_sup start_POSTSUBSCRIPT ∥ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_Lip end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT | ∬ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_p ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x , italic_u , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_x , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_x , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ) |
+supfLip1|f(x)p(dxx,u,x0,u0,μ)πt(dux,x0,μ)(μ(dx)μ(dx))|\displaystyle\qquad+\sup_{\lVert f^{\prime}\rVert_{\mathrm{Lip}}\leq 1}\left|% \iiint f^{\prime}(x^{\prime})p(\mathrm{d}x^{\prime}\mid x,u,x^{0},u^{0},\mu)% \pi_{t}(\mathrm{d}u\mid x,x^{0},\mu)(\mu_{*}(\mathrm{d}x)-\mu(\mathrm{d}x))\right|+ roman_sup start_POSTSUBSCRIPT ∥ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT roman_Lip end_POSTSUBSCRIPT ≤ 1 end_POSTSUBSCRIPT | ∭ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_p ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x , italic_u , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u ∣ italic_x , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ( italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( roman_d italic_x ) - italic_μ ( roman_d italic_x ) ) |
sup(x,u)𝒳×𝒰W1(p(x,u,x0,u0,μ),p(x,u,x0,u0,μ))\displaystyle\quad\leq\sup_{(x,u)\in\mathcal{X}\times\mathcal{U}}W_{1}(p(\cdot% \mid x,u,x^{0}_{*},u^{0}_{*},\mu_{*}),p(\cdot\mid x,u,x^{0},u^{0},\mu))≤ roman_sup start_POSTSUBSCRIPT ( italic_x , italic_u ) ∈ caligraphic_X × caligraphic_U end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_p ( ⋅ ∣ italic_x , italic_u , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) , italic_p ( ⋅ ∣ italic_x , italic_u , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) )
+supx𝒳(Lp+1)W1(πt(x,x0,μ),πt(x,x0,μ))\displaystyle\qquad+\sup_{x\in\mathcal{X}}(L_{p}+1)W_{1}(\pi_{t}(\cdot\mid x,x% ^{0}_{*},\mu_{*}),\pi_{t}(\cdot\mid x,x^{0},\mu))+ roman_sup start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT ( italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + 1 ) italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ∣ italic_x , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) , italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ∣ italic_x , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) )
+sup(x,u)𝒳×𝒰(Lp+LΠ+1)W1(μ,μ)subscriptsupremum𝑥𝑢𝒳𝒰subscript𝐿𝑝subscript𝐿Π1subscript𝑊1subscript𝜇𝜇\displaystyle\qquad+\sup_{(x,u)\in\mathcal{X}\times\mathcal{U}}(L_{p}+L_{\Pi}+% 1)W_{1}(\mu_{*},\mu)+ roman_sup start_POSTSUBSCRIPT ( italic_x , italic_u ) ∈ caligraphic_X × caligraphic_U end_POSTSUBSCRIPT ( italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT + 1 ) italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ )
(Lp+(Lp+1)LΠ+(Lp+LΠ+1))Ld((x0,u0,μ),(x0,u0,μ))absentsubscriptsubscript𝐿𝑝subscript𝐿𝑝1subscript𝐿Πsubscript𝐿𝑝subscript𝐿Π1subscript𝐿𝑑subscriptsuperscript𝑥0subscriptsuperscript𝑢0subscript𝜇superscript𝑥0superscript𝑢0𝜇\displaystyle\quad\leq\underbrace{(L_{p}+(L_{p}+1)L_{\Pi}+(L_{p}+L_{\Pi}+1))}_% {L_{*}}d((x^{0}_{*},u^{0}_{*},\mu_{*}),(x^{0},u^{0},\mu))≤ under⏟ start_ARG ( italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + ( italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + 1 ) italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT + ( italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT + 1 ) ) end_ARG start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_d ( ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) , ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) )

with Lipschitz constant LT(2LΠ+1)Lsubscript𝐿𝑇2subscript𝐿Π1subscript𝐿L_{T}\coloneqq(2L_{\Pi}+1)\cdot L_{*}italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ≔ ( 2 italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT + 1 ) ⋅ italic_L start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT from Assumptions 1 and 2, using the same argument as in (11). ∎

Appendix L Proof of Lemma 2

Proof.

For any g𝒢𝑔𝒢g\in\mathcal{G}italic_g ∈ caligraphic_G, for any (x0,u0,μ),(x0,u0,μ)𝒳0×𝒰0×𝒫(𝒳)subscriptsuperscript𝑥0subscriptsuperscript𝑢0subscript𝜇superscript𝑥0superscript𝑢0𝜇superscript𝒳0superscript𝒰0𝒫𝒳(x^{0}_{*},u^{0}_{*},\mu_{*}),(x^{0},u^{0},\mu)\in\mathcal{X}^{0}\times% \mathcal{U}^{0}\times\mathcal{P}(\mathcal{X})( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) , ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ∈ caligraphic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT × caligraphic_U start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT × caligraphic_P ( caligraphic_X ), let TT(x0,u0,μ,μπt(x0,μ))subscript𝑇𝑇subscriptsuperscript𝑥0subscriptsuperscript𝑢0subscript𝜇tensor-productsubscript𝜇subscript𝜋𝑡subscriptsuperscript𝑥0subscript𝜇T_{*}\coloneqq T(x^{0}_{*},u^{0}_{*},\mu_{*},\mu_{*}\otimes\pi_{t}(x^{0}_{*},% \mu_{*}))italic_T start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ≔ italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) and TT(x0,u0,μ,μπt(x0,μ))superscript𝑇𝑇superscript𝑥0superscript𝑢0𝜇tensor-product𝜇subscript𝜋𝑡superscript𝑥0𝜇T^{*}\coloneqq T(x^{0},u^{0},\mu,\mu\otimes\pi_{t}(x^{0},\mu))italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≔ italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ , italic_μ ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ) for brevity. We have

|g(x0,u0,μ)g(x0,u0,μ)|𝑔subscriptsuperscript𝑥0subscriptsuperscript𝑢0subscript𝜇𝑔superscript𝑥0superscript𝑢0𝜇\displaystyle\left|g(x^{0}_{*},u^{0}_{*},\mu_{*})-g(x^{0},u^{0},\mu)\right|| italic_g ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - italic_g ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) |
=|f(x,u,T)πt0(dux,T)p0(dxx0,u0,μ)\displaystyle\quad=\left|\iint f(x^{\prime},u^{\prime},T_{*})\pi^{0}_{t}(% \mathrm{d}u^{\prime}\mid x^{\prime},T_{*})p^{0}(\mathrm{d}x^{\prime}\mid x^{0}% _{*},u^{0}_{*},\mu_{*})\right.= | ∬ italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_T start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_T start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT )
f(x,u,T)πt0(dux,T)p0(dxx0,u0,μ)|\displaystyle\hskip 42.67912pt\left.-\iint f(x^{\prime},u^{\prime},T^{*})\pi^{% 0}_{t}(\mathrm{d}u^{\prime}\mid x^{\prime},T^{*})p^{0}(\mathrm{d}x^{\prime}% \mid x^{0},u^{0},\mu)\right|- ∬ italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) |
supx,u|f(x,u,T)f(x,u,T)|absentsubscriptsupremumsuperscript𝑥superscript𝑢𝑓superscript𝑥superscript𝑢subscript𝑇𝑓superscript𝑥superscript𝑢superscript𝑇\displaystyle\quad\leq\sup_{x^{\prime},u^{\prime}}\left|f(x^{\prime},u^{\prime% },T_{*})-f(x^{\prime},u^{\prime},T^{*})\right|≤ roman_sup start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_T start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) | (22)
+supx|f(x,u,T)(πt0(dux,T)πt0(dux,T))|\displaystyle\qquad+\sup_{x^{\prime}}\left|\int f(x^{\prime},u^{\prime},T^{*})% (\pi^{0}_{t}(\mathrm{d}u^{\prime}\mid x^{\prime},T_{*})-\pi^{0}_{t}(\mathrm{d}% u^{\prime}\mid x^{\prime},T^{*}))\right|+ roman_sup start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | ∫ italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ( italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_T start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) | (23)
+|f(x,u,T)πt0(dux,T)(p0(dxx0,u0,μ)p0(dxx0,u0,μ))|.\displaystyle\qquad+\left|\iint f(x^{\prime},u^{\prime},T^{*})\pi^{0}_{t}(% \mathrm{d}u^{\prime}\mid x^{\prime},T^{*})(p^{0}(\mathrm{d}x^{\prime}\mid x^{0% }_{*},u^{0}_{*},\mu_{*})-p^{0}(\mathrm{d}x^{\prime}\mid x^{0},u^{0},\mu))% \right|.+ | ∬ italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ( italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ) | . (24)

By Lemma 1, for (22) we obtain

supx,u|f(x,u,T(x0,u0,μ,μπt(x0,μ)))f(x,u,T(x0,u0,μ,μπt(x0,μ)))|subscriptsupremumsuperscript𝑥superscript𝑢𝑓superscript𝑥superscript𝑢𝑇subscriptsuperscript𝑥0subscriptsuperscript𝑢0subscript𝜇tensor-productsubscript𝜇subscript𝜋𝑡subscriptsuperscript𝑥0subscript𝜇𝑓superscript𝑥superscript𝑢𝑇superscript𝑥0superscript𝑢0𝜇tensor-product𝜇subscript𝜋𝑡superscript𝑥0𝜇\displaystyle\sup_{x^{\prime},u^{\prime}}\left|f(x^{\prime},u^{\prime},T(x^{0}% _{*},u^{0}_{*},\mu_{*},\mu_{*}\otimes\pi_{t}(x^{0}_{*},\mu_{*})))-f(x^{\prime}% ,u^{\prime},T(x^{0},u^{0},\mu,\mu\otimes\pi_{t}(x^{0},\mu)))\right|roman_sup start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ) - italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ , italic_μ ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ) ) |
LLTd((x0,u0,μ),(x0,u0,μ)).absentsubscript𝐿subscript𝐿𝑇𝑑subscriptsuperscript𝑥0subscriptsuperscript𝑢0subscript𝜇superscript𝑥0superscript𝑢0𝜇\displaystyle\quad\leq L_{\mathcal{F}}L_{T}d((x^{0}_{*},u^{0}_{*},\mu_{*}),(x^% {0},u^{0},\mu)).≤ italic_L start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_d ( ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) , ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ) .

Similarly for (23), by Assumption 2 we analogously have

supx|f(x,u,T(x0,u0,μ,μπt(x0,μ)))conditionalsubscriptsupremumsuperscript𝑥𝑓superscript𝑥superscript𝑢𝑇superscript𝑥0superscript𝑢0𝜇tensor-product𝜇subscript𝜋𝑡superscript𝑥0𝜇\displaystyle\sup_{x^{\prime}}\left|\int f(x^{\prime},u^{\prime},T(x^{0},u^{0}% ,\mu,\mu\otimes\pi_{t}(x^{0},\mu)))\right.roman_sup start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | ∫ italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ , italic_μ ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ) )
(πt0(dux,T(x0,u0,μ,μπt(x0,μ)))πt0(dux,T(x0,u0,μ,μπt(x0,μ))))|\displaystyle\hskip 14.22636pt\left.(\pi^{0}_{t}(\mathrm{d}u^{\prime}\mid x^{% \prime},T(x^{0}_{*},u^{0}_{*},\mu_{*},\mu_{*}\otimes\pi_{t}(x^{0}_{*},\mu_{*})% ))-\pi^{0}_{t}(\mathrm{d}u^{\prime}\mid x^{\prime},T(x^{0},u^{0},\mu,\mu% \otimes\pi_{t}(x^{0},\mu))))\right|( italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ) - italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ , italic_μ ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ) ) ) |
LW1(πt0(x,T(x0,u0,μ,μπt(x0,μ))),πt0(x,T(x0,u0,μ,μπt(x0,μ)))\displaystyle\quad\leq L_{\mathcal{F}}W_{1}(\pi^{0}_{t}(\cdot\mid x^{\prime},T% (x^{0}_{*},u^{0}_{*},\mu_{*},\mu_{*}\otimes\pi_{t}(x^{0}_{*},\mu_{*}))),\pi^{0% }_{t}(\cdot^{\prime}\mid x^{\prime},T(x^{0},u^{0},\mu,\mu\otimes\pi_{t}(x^{0},% \mu)))≤ italic_L start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ ∣ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ) , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ , italic_μ ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ) )
LLΠ0LTd((x0,u0,μ),(x0,u0,μ)).absentsubscript𝐿subscript𝐿superscriptΠ0subscript𝐿𝑇𝑑subscriptsuperscript𝑥0subscriptsuperscript𝑢0subscript𝜇superscript𝑥0superscript𝑢0𝜇\displaystyle\quad\leq L_{\mathcal{F}}L_{\Pi^{0}}L_{T}d((x^{0}_{*},u^{0}_{*},% \mu_{*}),(x^{0},u^{0},\mu)).≤ italic_L start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT roman_Π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_d ( ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) , ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ) .

Lastly, for (24), as before in (11), by Assumption 1 and 2 we have again

|f(x,u,T(x0,u0,μ,μπt(x0,μ)))πt0(dux,T(x0,u0,μ,μπt(x0,μ)))\displaystyle\left|\iint f(x^{\prime},u^{\prime},T(x^{0},u^{0},\mu,\mu\otimes% \pi_{t}(x^{0},\mu)))\pi^{0}_{t}(\mathrm{d}u^{\prime}\mid x^{\prime},T(x^{0},u^% {0},\mu,\mu\otimes\pi_{t}(x^{0},\mu)))\right.| ∬ italic_f ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ , italic_μ ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ) ) italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ , italic_μ ⊗ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ) )
(p0(dxx0,u0,μ)p0(dxx0,u0,μ))|\displaystyle\hskip 71.13188pt\left.(p^{0}(\mathrm{d}x^{\prime}\mid x^{0}_{*},% u^{0}_{*},\mu_{*})-p^{0}(\mathrm{d}x^{\prime}\mid x^{0},u^{0},\mu))\right|( italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( roman_d italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ) |
LLΠW1(p0(x0,u0,μ),p0(x0,u0,μ))\displaystyle\quad\leq L_{\mathcal{F}}L_{\Pi}W_{1}(p^{0}(\cdot\mid x^{0}_{*},u% ^{0}_{*},\mu_{*}),p^{0}(\cdot\mid x^{0},u^{0},\mu))≤ italic_L start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( ⋅ ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) , italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( ⋅ ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) )
LLΠLp0d((x0,u0,μ),(x0,u0,μ)).absentsubscript𝐿subscript𝐿Πsubscript𝐿superscript𝑝0𝑑subscriptsuperscript𝑥0subscriptsuperscript𝑢0subscript𝜇superscript𝑥0superscript𝑢0𝜇\displaystyle\quad\leq L_{\mathcal{F}}L_{\Pi}L_{p^{0}}d((x^{0}_{*},u^{0}_{*},% \mu_{*}),(x^{0},u^{0},\mu)).≤ italic_L start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_d ( ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) , ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ) .

Therefore, 𝒢𝒢\mathcal{G}caligraphic_G is equi-Lipschitz with Lipschitz constant (LLT+LLΠ0LT+LLΠLp0)subscript𝐿subscript𝐿𝑇subscript𝐿subscript𝐿superscriptΠ0subscript𝐿𝑇subscript𝐿subscript𝐿Πsubscript𝐿superscript𝑝0(L_{\mathcal{F}}L_{T}+L_{\mathcal{F}}L_{\Pi^{0}}L_{T}+L_{\mathcal{F}}L_{\Pi}L_% {p^{0}})( italic_L start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT roman_Π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT caligraphic_F end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT roman_Π end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ). ∎

Appendix M Proof of Corollary 1

Proof.

As in Lemma 1, for any ε>0𝜀0\varepsilon>0italic_ε > 0, choose time T𝑇T\in\mathbb{N}italic_T ∈ blackboard_N such that

t=Tγt|𝔼[r(xt0,N,ut0,N,μtN)r(xt0,ut0,μt)]|γT1γmaxμ2|r(μ)|<ε2.superscriptsubscript𝑡𝑇superscript𝛾𝑡𝔼𝑟subscriptsuperscript𝑥0𝑁𝑡subscriptsuperscript𝑢0𝑁𝑡superscriptsubscript𝜇𝑡𝑁𝑟subscriptsuperscript𝑥0𝑡subscriptsuperscript𝑢0𝑡subscript𝜇𝑡superscript𝛾𝑇1𝛾subscript𝜇2𝑟𝜇𝜀2\displaystyle\sum_{t=T}^{\infty}\gamma^{t}\left|\operatorname{\mathbb{E}}\left% [r(x^{0,N}_{t},u^{0,N}_{t},\mu_{t}^{N})-r(x^{0}_{t},u^{0}_{t},\mu_{t})\right]% \right|\leq\frac{\gamma^{T}}{1-\gamma}\max_{\mu}2|r(\mu)|<\frac{\varepsilon}{2}.∑ start_POSTSUBSCRIPT italic_t = italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | blackboard_E [ italic_r ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) - italic_r ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] | ≤ divide start_ARG italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG roman_max start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT 2 | italic_r ( italic_μ ) | < divide start_ARG italic_ε end_ARG start_ARG 2 end_ARG .

By Theorem 2,

t=0T1γt|𝔼[r(xt0,N,ut0,N,μtN)r(xt0,ut0,μt)]|<ε2superscriptsubscript𝑡0𝑇1superscript𝛾𝑡𝔼𝑟subscriptsuperscript𝑥0𝑁𝑡subscriptsuperscript𝑢0𝑁𝑡superscriptsubscript𝜇𝑡𝑁𝑟subscriptsuperscript𝑥0𝑡subscriptsuperscript𝑢0𝑡subscript𝜇𝑡𝜀2\displaystyle\sum_{t=0}^{T-1}\gamma^{t}\left|\operatorname{\mathbb{E}}\left[r(% x^{0,N}_{t},u^{0,N}_{t},\mu_{t}^{N})-r(x^{0}_{t},u^{0}_{t},\mu_{t})\right]% \right|<\frac{\varepsilon}{2}∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | blackboard_E [ italic_r ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) - italic_r ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] | < divide start_ARG italic_ε end_ARG start_ARG 2 end_ARG

for sufficiently large N𝑁Nitalic_N. Therefore, sup(π,π0)Π×Π0|JN(π,π0)J(Φ1(π),π0)|0subscriptsupremum𝜋superscript𝜋0ΠsuperscriptΠ0superscript𝐽𝑁𝜋superscript𝜋0𝐽superscriptΦ1𝜋superscript𝜋00\sup_{(\pi,\pi^{0})\in\Pi\times\Pi^{0}}\left|J^{N}(\pi,\pi^{0})-J(\Phi^{-1}(% \pi),\pi^{0})\right|\to 0roman_sup start_POSTSUBSCRIPT ( italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ∈ roman_Π × roman_Π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | italic_J start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) - italic_J ( roman_Φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_π ) , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) | → 0.

As a result, we have

JN(Φ(π^),π0)sup(π,π0)Π×Π0JN(π,π0)superscript𝐽𝑁Φsuperscript^𝜋superscript𝜋0subscriptsupremum𝜋superscript𝜋0ΠsuperscriptΠ0superscript𝐽𝑁𝜋superscript𝜋0\displaystyle J^{N}(\Phi(\hat{\pi}^{*}),\pi^{0*})-\sup_{(\pi,\pi^{0})\in\Pi% \times\Pi^{0}}J^{N}(\pi,\pi^{0})italic_J start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( roman_Φ ( over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_π start_POSTSUPERSCRIPT 0 ∗ end_POSTSUPERSCRIPT ) - roman_sup start_POSTSUBSCRIPT ( italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ∈ roman_Π × roman_Π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) =inf(π,π0)Π×Π0(JN(Φ(π^),π0)JN(π,π0))absentsubscriptinfimum𝜋superscript𝜋0ΠsuperscriptΠ0superscript𝐽𝑁Φsuperscript^𝜋superscript𝜋0superscript𝐽𝑁𝜋superscript𝜋0\displaystyle=\inf_{(\pi,\pi^{0})\in\Pi\times\Pi^{0}}(J^{N}(\Phi(\hat{\pi}^{*}% ),\pi^{0*})-J^{N}(\pi,\pi^{0}))= roman_inf start_POSTSUBSCRIPT ( italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ∈ roman_Π × roman_Π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_J start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( roman_Φ ( over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_π start_POSTSUPERSCRIPT 0 ∗ end_POSTSUPERSCRIPT ) - italic_J start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) )
inf(π,π0)Π×Π0(JN(Φ(π^),π0)J(π^,π0))absentsubscriptinfimum𝜋superscript𝜋0ΠsuperscriptΠ0superscript𝐽𝑁Φsuperscript^𝜋superscript𝜋0𝐽superscript^𝜋superscript𝜋0\displaystyle\geq\inf_{(\pi,\pi^{0})\in\Pi\times\Pi^{0}}(J^{N}(\Phi(\hat{\pi}^% {*}),\pi^{0*})-J(\hat{\pi}^{*},\pi^{0*}))≥ roman_inf start_POSTSUBSCRIPT ( italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ∈ roman_Π × roman_Π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_J start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( roman_Φ ( over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_π start_POSTSUPERSCRIPT 0 ∗ end_POSTSUPERSCRIPT ) - italic_J ( over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT 0 ∗ end_POSTSUPERSCRIPT ) )
+inf(π,π0)Π×Π0(J(π^,π0)J(π,π0))subscriptinfimum𝜋superscript𝜋0ΠsuperscriptΠ0𝐽superscript^𝜋superscript𝜋0𝐽𝜋superscript𝜋0\displaystyle\quad+\inf_{(\pi,\pi^{0})\in\Pi\times\Pi^{0}}(J(\hat{\pi}^{*},\pi% ^{0*})-J(\pi,\pi^{0}))+ roman_inf start_POSTSUBSCRIPT ( italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ∈ roman_Π × roman_Π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_J ( over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT 0 ∗ end_POSTSUPERSCRIPT ) - italic_J ( italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) )
+inf(π,π0)Π×Π0(J(π,π0)JN(π,π0))subscriptinfimum𝜋superscript𝜋0ΠsuperscriptΠ0𝐽𝜋superscript𝜋0superscript𝐽𝑁𝜋superscript𝜋0\displaystyle\quad+\inf_{(\pi,\pi^{0})\in\Pi\times\Pi^{0}}(J(\pi,\pi^{0})-J^{N% }(\pi,\pi^{0}))+ roman_inf start_POSTSUBSCRIPT ( italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ∈ roman_Π × roman_Π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_J ( italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) - italic_J start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) )
ε2+0ε2=εabsent𝜀20𝜀2𝜀\displaystyle\geq-\frac{\varepsilon}{2}+0-\frac{\varepsilon}{2}=-\varepsilon≥ - divide start_ARG italic_ε end_ARG start_ARG 2 end_ARG + 0 - divide start_ARG italic_ε end_ARG start_ARG 2 end_ARG = - italic_ε

for sufficiently large N𝑁Nitalic_N, where the second term is zero by optimality of (π^,π0)superscript^𝜋superscript𝜋0(\hat{\pi}^{*},\pi^{0*})( over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT 0 ∗ end_POSTSUPERSCRIPT ) in the M3FC problem. ∎

Appendix N Proof of Theorem 1

First, for completeness we give the finite M3FC system equations under the assumed Lipschitz parametrization for joint stationary M3FMARL policies111Note that deterministic joint policies π~θsuperscript~𝜋𝜃\tilde{\pi}^{\theta}over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT (e.g. at convergence, or if using deterministic policy gradients [Silver et al., 2014]) are equivalent to using separate deterministic minor and major policies in (1), see also Remark 3. π~θsuperscript~𝜋𝜃\tilde{\pi}^{\theta}over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT used during centralized training with correlated minor agent actions, as

ut0,N,ξtNsubscriptsuperscript𝑢0𝑁𝑡subscriptsuperscript𝜉𝑁𝑡\displaystyle u^{0,N}_{t},\xi^{N}_{t}italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT π~θ(ut0,N,ξtNxt0,N,μtN),πtN=Γ(ξtN),uti,NπtN(uti,Nxti,N),formulae-sequencesimilar-toabsentsuperscript~𝜋𝜃subscriptsuperscript𝑢0𝑁𝑡conditionalsubscriptsuperscript𝜉𝑁𝑡subscriptsuperscript𝑥0𝑁𝑡superscriptsubscript𝜇𝑡𝑁formulae-sequencesubscriptsuperscript𝜋𝑁𝑡Γsubscriptsuperscript𝜉𝑁𝑡similar-tosubscriptsuperscript𝑢𝑖𝑁𝑡subscriptsuperscript𝜋𝑁𝑡conditionalsubscriptsuperscript𝑢𝑖𝑁𝑡subscriptsuperscript𝑥𝑖𝑁𝑡\displaystyle\sim\tilde{\pi}^{\theta}(u^{0,N}_{t},\xi^{N}_{t}\mid x^{0,N}_{t},% \mu_{t}^{N}),\quad\pi^{\prime N}_{t}=\Gamma(\xi^{N}_{t}),\quad u^{i,N}_{t}\sim% \pi^{\prime N}_{t}(u^{i,N}_{t}\mid x^{i,N}_{t}),∼ over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) , italic_π start_POSTSUPERSCRIPT ′ italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Γ ( italic_ξ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_u start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π start_POSTSUPERSCRIPT ′ italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_u start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,
xt+1i,Nsubscriptsuperscript𝑥𝑖𝑁𝑡1\displaystyle x^{i,N}_{t+1}italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT p(xt+1i,Nxti,N,uti,N,xt0,N,ut0,N,μtN),xt+10,Np0(xt+10,Nxt0,N,ut0,N,μtN),formulae-sequencesimilar-toabsent𝑝conditionalsubscriptsuperscript𝑥𝑖𝑁𝑡1subscriptsuperscript𝑥𝑖𝑁𝑡subscriptsuperscript𝑢𝑖𝑁𝑡subscriptsuperscript𝑥0𝑁𝑡subscriptsuperscript𝑢0𝑁𝑡superscriptsubscript𝜇𝑡𝑁similar-tosubscriptsuperscript𝑥0𝑁𝑡1superscript𝑝0conditionalsubscriptsuperscript𝑥0𝑁𝑡1subscriptsuperscript𝑥0𝑁𝑡subscriptsuperscript𝑢0𝑁𝑡superscriptsubscript𝜇𝑡𝑁\displaystyle\sim p(x^{i,N}_{t+1}\mid x^{i,N}_{t},u^{i,N}_{t},x^{0,N}_{t},u^{0% ,N}_{t},\mu_{t}^{N}),\quad x^{0,N}_{t+1}\sim p^{0}(x^{0,N}_{t+1}\mid x^{0,N}_{% t},u^{0,N}_{t},\mu_{t}^{N}),∼ italic_p ( italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) , italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) ,

as well as the limiting M3FC MDP under such parametrization as

ut0,ξtsubscriptsuperscript𝑢0𝑡subscript𝜉𝑡\displaystyle u^{0}_{t},\xi_{t}italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT π~θ(ut0,ξtxt0,μt),πt=Γ(ξt),ht=μtπt,formulae-sequencesimilar-toabsentsuperscript~𝜋𝜃subscriptsuperscript𝑢0𝑡conditionalsubscript𝜉𝑡subscriptsuperscript𝑥0𝑡subscript𝜇𝑡formulae-sequencesubscriptsuperscript𝜋𝑡Γsubscript𝜉𝑡subscript𝑡tensor-productsubscript𝜇𝑡subscriptsuperscript𝜋𝑡\displaystyle\sim\tilde{\pi}^{\theta}(u^{0}_{t},\xi_{t}\mid x^{0}_{t},\mu_{t})% ,\quad\pi^{\prime}_{t}=\Gamma(\xi_{t}),\quad h_{t}=\mu_{t}\otimes\pi^{\prime}_% {t},∼ over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Γ ( italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊗ italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ,
μt+1subscript𝜇𝑡1\displaystyle\mu_{t+1}italic_μ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT =T(xt0,ut0,μt,ht),xt+10p0(xt+10xt0,ut0,μt).formulae-sequenceabsent𝑇subscriptsuperscript𝑥0𝑡subscriptsuperscript𝑢0𝑡subscript𝜇𝑡subscript𝑡similar-tosubscriptsuperscript𝑥0𝑡1superscript𝑝0conditionalsubscriptsuperscript𝑥0𝑡1subscriptsuperscript𝑥0𝑡subscriptsuperscript𝑢0𝑡subscript𝜇𝑡\displaystyle=T(x^{0}_{t},u^{0}_{t},\mu_{t},h_{t}),\quad x^{0}_{t+1}\sim p^{0}% (x^{0}_{t+1}\mid x^{0}_{t},u^{0}_{t},\mu_{t}).= italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .

Then, by Sutton et al. [1999], the exact policy gradient for the limiting M3FC MDP is given as

θJ(π~θ)=t=Tγt𝔼[Qθ(xt0,μt,ut0,ξt)θlogπ~θ(ut0,ξtxt0,μt)]subscript𝜃𝐽superscript~𝜋𝜃superscriptsubscript𝑡𝑇superscript𝛾𝑡𝔼superscript𝑄𝜃subscriptsuperscript𝑥0𝑡subscript𝜇𝑡subscriptsuperscript𝑢0𝑡subscript𝜉𝑡subscript𝜃superscript~𝜋𝜃subscriptsuperscript𝑢0𝑡conditionalsubscript𝜉𝑡subscriptsuperscript𝑥0𝑡subscript𝜇𝑡\displaystyle\nabla_{\theta}J(\tilde{\pi}^{\theta})=\sum_{t=T}^{\infty}\gamma^% {t}\operatorname{\mathbb{E}}\left[Q^{\theta}(x^{0}_{t},\mu_{t},u^{0}_{t},\xi_{% t})\nabla_{\theta}\log\tilde{\pi}^{\theta}(u^{0}_{t},\xi_{t}\mid x^{0}_{t},\mu% _{t})\right]∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_J ( over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_t = italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT blackboard_E [ italic_Q start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ]

under the action-value function

Qθ(x0,μ,u0,ξ)=𝔼[t=0γtr(xt0,ut0,μt)|x00=x0,μ0=μ,u00=u0,ξ0=ξ],superscript𝑄𝜃superscript𝑥0𝜇superscript𝑢0𝜉𝔼superscriptsubscript𝑡0superscript𝛾𝑡𝑟subscriptsuperscript𝑥0𝑡subscriptsuperscript𝑢0𝑡subscript𝜇𝑡subscriptsuperscript𝑥00superscript𝑥0subscript𝜇0𝜇subscriptsuperscript𝑢00superscript𝑢0subscript𝜉0𝜉\displaystyle Q^{\theta}(x^{0},\mu,u^{0},\xi)=\operatorname{\mathbb{E}}\left[% \sum_{t=0}^{\infty}\gamma^{t}r(x^{0}_{t},u^{0}_{t},\mu_{t})\;\middle\lvert\;x^% {0}_{0}=x^{0},\mu_{0}=\mu,u^{0}_{0}=u^{0},\xi_{0}=\xi\right],italic_Q start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_ξ ) = blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_ξ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_ξ ] ,

while the approximation for the policy gradient on the finite M3FC system is given instead by

θJ^(π~θ)=t=Tγt𝔼[Q^θ(xt0,N,μtN,ut0,N,ξtN)θlogπ~θ(ut0,N,ξtN|xt0,N,μtN)]\displaystyle\widehat{\nabla_{\theta}J}(\tilde{\pi}^{\theta})=\sum_{t=T}^{% \infty}\gamma^{t}\operatorname{\mathbb{E}}\left[\widehat{Q}^{\theta}(x^{0,N}_{% t},\mu^{N}_{t},u^{0,N}_{t},\xi^{N}_{t})\nabla_{\theta}\log\tilde{\pi}^{\theta}% (u^{0,N}_{t},\xi^{N}_{t}\;\middle\lvert\;x^{0,N}_{t},\mu^{N}_{t})\right]over^ start_ARG ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_J end_ARG ( over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_t = italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT blackboard_E [ over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ]

and the finite-agent action-values

Q^θ(x0,μ,u0,ξ)=𝔼[t=0γtr(xt0,N,ut0,N,μtN)|x00,N=x0,μ0=μ,u00,N=u0,ξ0N=ξ],superscript^𝑄𝜃superscript𝑥0𝜇superscript𝑢0𝜉𝔼superscriptsubscript𝑡0superscript𝛾𝑡𝑟subscriptsuperscript𝑥0𝑁𝑡subscriptsuperscript𝑢0𝑁𝑡subscriptsuperscript𝜇𝑁𝑡subscriptsuperscript𝑥0𝑁0superscript𝑥0subscript𝜇0𝜇subscriptsuperscript𝑢0𝑁0superscript𝑢0subscriptsuperscript𝜉𝑁0𝜉\displaystyle\widehat{Q}^{\theta}(x^{0},\mu,u^{0},\xi)=\operatorname{\mathbb{E% }}\left[\sum_{t=0}^{\infty}\gamma^{t}r(x^{0,N}_{t},u^{0,N}_{t},\mu^{N}_{t})\;% \middle\lvert\;x^{0,N}_{0}=x^{0},\mu_{0}=\mu,u^{0,N}_{0}=u^{0},\xi^{N}_{0}=\xi% \right],over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_ξ ) = blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_ξ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_ξ ] ,

which are obtained, e.g., by on-policy samples and using critic estimates. Note that here, the conditional expectations are given by redefining the systems (1) and (3) with the values conditioned upon.

We then show that the approximation of the policy gradient is good for large systems, i.e.

θJ^(π~θ)θJ(π^θ)0norm^subscript𝜃𝐽superscript~𝜋𝜃subscript𝜃𝐽superscript^𝜋𝜃0\displaystyle\left\|\widehat{\nabla_{\theta}J}(\tilde{\pi}^{\theta})-\nabla_{% \theta}J(\hat{\pi}^{\theta})\right\|\to 0∥ over^ start_ARG ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_J end_ARG ( over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_J ( over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ) ∥ → 0 (25)

as N𝑁N\to\inftyitalic_N → ∞, uniformly over all current policy parameters θ𝜃\thetaitalic_θ.

Proof of Theorem 1.

We use the following lemmas in the proof of Theorem 1, for which the proofs are given below.

Proposition 1.

Propagation of chaos holds for the M3FC systems with parameterized actions as in Theorem 2, i.e. under Assumptions 1, 2 and 1, for any equi-Lipschitz family \mathcal{F}caligraphic_F, at all times t𝑡t\in\mathbb{N}italic_t ∈ blackboard_N uniformly,

supf,π,π0|𝔼[f(xt0,N,ut0,N,μtN)f(xt0,ut0,μt)]|0.subscriptsupremum𝑓𝜋superscript𝜋0𝔼𝑓subscriptsuperscript𝑥0𝑁𝑡subscriptsuperscript𝑢0𝑁𝑡superscriptsubscript𝜇𝑡𝑁𝑓subscriptsuperscript𝑥0𝑡subscriptsuperscript𝑢0𝑡subscript𝜇𝑡0\sup_{f,\pi,\pi^{0}}\left|\operatorname{\mathbb{E}}\left[f(x^{0,N}_{t},u^{0,N}% _{t},\mu_{t}^{N})-f(x^{0}_{t},u^{0}_{t},\mu_{t})\right]\right|\to 0.roman_sup start_POSTSUBSCRIPT italic_f , italic_π , italic_π start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | blackboard_E [ italic_f ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) - italic_f ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] | → 0 . (26)
Proposition 2.

Under Assumptions 1 and 2, the approximate action-values converge uniformly, Q^θQθsuperscript^𝑄𝜃superscript𝑄𝜃\widehat{Q}^{\theta}\to Q^{\theta}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT → italic_Q start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT as N𝑁N\to\inftyitalic_N → ∞.

As a result, we obtain

θJ^(π~θ)θJ(π^θ)norm^subscript𝜃𝐽superscript~𝜋𝜃subscript𝜃𝐽superscript^𝜋𝜃\displaystyle\left\|\widehat{\nabla_{\theta}J}(\tilde{\pi}^{\theta})-\nabla_{% \theta}J(\hat{\pi}^{\theta})\right\|∥ over^ start_ARG ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_J end_ARG ( over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ) - ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_J ( over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ) ∥
=t=0γt𝔼[Q^θ(xt0,N,μtN,ut0,N,ξtN)θlogπ~θ(ut0,N,ξtNxt0,N,μtN)Qθ(xt0,μt,ut0,ξt)θlogπ~θ(ut0,ξtxt0,μt)]\displaystyle=\left\|\sum_{t=0}^{\infty}\gamma^{t}\operatorname{\mathbb{E}}% \left[\widehat{Q}^{\theta}(x^{0,N}_{t},\mu^{N}_{t},u^{0,N}_{t},\xi^{N}_{t})% \nabla_{\theta}\log\tilde{\pi}^{\theta}(u^{0,N}_{t},\xi^{N}_{t}\mid x^{0,N}_{t% },\mu^{N}_{t})-Q^{\theta}(x^{0}_{t},\mu_{t},u^{0}_{t},\xi_{t})\nabla_{\theta}% \log\tilde{\pi}^{\theta}(u^{0}_{t},\xi_{t}\mid x^{0}_{t},\mu_{t})\right]\right\|= ∥ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT blackboard_E [ over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_Q start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ∥
t=0γt𝔼[(Q^θ(xt0,N,μtN,ut0,N,ξtN)Qθ(xt0,N,μtN,ut0,N,ξtN))θlogπ~θ(ut0,N,ξtNxt0,N,μtN)]\displaystyle\leq\left\|\sum_{t=0}^{\infty}\gamma^{t}\operatorname{\mathbb{E}}% \left[\left(\widehat{Q}^{\theta}(x^{0,N}_{t},\mu^{N}_{t},u^{0,N}_{t},\xi^{N}_{% t})-Q^{\theta}(x^{0,N}_{t},\mu^{N}_{t},u^{0,N}_{t},\xi^{N}_{t})\right)\nabla_{% \theta}\log\tilde{\pi}^{\theta}(u^{0,N}_{t},\xi^{N}_{t}\mid x^{0,N}_{t},\mu^{N% }_{t})\right]\right\|≤ ∥ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT blackboard_E [ ( over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_Q start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ∥
+t=Tγt𝔼[Qθ(xt0,N,μtN,ut0,N,ξtN)θlogπ~θ(ut0,N,ξtNxt0,N,μtN)Qθ(xt0,μt,ut0,ξt)θlogπ~θ(ut0,ξtxt0,μt)]\displaystyle+\left\|\sum_{t=T}^{\infty}\gamma^{t}\operatorname{\mathbb{E}}% \left[Q^{\theta}(x^{0,N}_{t},\mu^{N}_{t},u^{0,N}_{t},\xi^{N}_{t})\nabla_{% \theta}\log\tilde{\pi}^{\theta}(u^{0,N}_{t},\xi^{N}_{t}\mid x^{0,N}_{t},\mu^{N% }_{t})-Q^{\theta}(x^{0}_{t},\mu_{t},u^{0}_{t},\xi_{t})\nabla_{\theta}\log% \tilde{\pi}^{\theta}(u^{0}_{t},\xi_{t}\mid x^{0}_{t},\mu_{t})\right]\right\|+ ∥ ∑ start_POSTSUBSCRIPT italic_t = italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT blackboard_E [ italic_Q start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_Q start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ∥
+t=0T1γt𝔼[Qθ(xt0,N,μtN,ut0,N,ξtN)θlogπ~θ(ut0,N,ξtNxt0,N,μtN)Qθ(xt0,μt,ut0,ξt)θlogπ~θ(ut0,ξtxt0,μt)]\displaystyle+\left\|\sum_{t=0}^{T-1}\gamma^{t}\operatorname{\mathbb{E}}\left[% Q^{\theta}(x^{0,N}_{t},\mu^{N}_{t},u^{0,N}_{t},\xi^{N}_{t})\nabla_{\theta}\log% \tilde{\pi}^{\theta}(u^{0,N}_{t},\xi^{N}_{t}\mid x^{0,N}_{t},\mu^{N}_{t})-Q^{% \theta}(x^{0}_{t},\mu_{t},u^{0}_{t},\xi_{t})\nabla_{\theta}\log\tilde{\pi}^{% \theta}(u^{0}_{t},\xi_{t}\mid x^{0}_{t},\mu_{t})\right]\right\|+ ∥ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT blackboard_E [ italic_Q start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_Q start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ∥

for any T𝑇Titalic_T, such that the first term disappears by Assumption 1 uniformly bounding θlogπ~θsubscript𝜃superscript~𝜋𝜃\nabla_{\theta}\log\tilde{\pi}^{\theta}∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT and Proposition 2. Note that we bounded θlogπ~θsubscript𝜃superscript~𝜋𝜃\nabla_{\theta}\log\tilde{\pi}^{\theta}∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT here, but we can also assume bounded gradients θπ~θsubscript𝜃superscript~𝜋𝜃\nabla_{\theta}\tilde{\pi}^{\theta}∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT instead, e.g. (27).

For the second term, we similarly uniformly bound θlogπ~θsubscript𝜃superscript~𝜋𝜃\nabla_{\theta}\log\tilde{\pi}^{\theta}∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT by Assumption 1 and Q𝑄Qitalic_Q by Assumption 1, then choose T𝑇Titalic_T sufficiently large.

Finally, for the last term, we note that we can write the difference as

t=0T1γt𝔼[Qθ(xt0,N,μtN,ut0,N,ξtN)θlogπ~θ(ut0,N,ξtNxt0,N,μtN)Qθ(xt0,μt,ut0,ξt)θlogπ~θ(ut0,ξtxt0,μt)]\displaystyle\left\|\sum_{t=0}^{T-1}\gamma^{t}\operatorname{\mathbb{E}}\left[Q% ^{\theta}(x^{0,N}_{t},\mu^{N}_{t},u^{0,N}_{t},\xi^{N}_{t})\nabla_{\theta}\log% \tilde{\pi}^{\theta}(u^{0,N}_{t},\xi^{N}_{t}\mid x^{0,N}_{t},\mu^{N}_{t})-Q^{% \theta}(x^{0}_{t},\mu_{t},u^{0}_{t},\xi_{t})\nabla_{\theta}\log\tilde{\pi}^{% \theta}(u^{0}_{t},\xi_{t}\mid x^{0}_{t},\mu_{t})\right]\right\|∥ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT blackboard_E [ italic_Q start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_Q start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ∥
=t=0T1γt𝔼[t=0γt𝔼[r(xt0,ut0,μt)|x00=xt0,N,μ0=μtN,u00=ut0,N,ξ0=ξtN]θlogπ~θ(ut0,N,ξtNxt0,N,μtN)\displaystyle=\left\|\sum_{t=0}^{T-1}\gamma^{t}\operatorname{\mathbb{E}}\left[% \sum_{t^{\prime}=0}^{\infty}\gamma^{t}\operatorname{\mathbb{E}}\left[r(x^{0% \prime}_{t^{\prime}},u^{0\prime}_{t^{\prime}},\mu^{\prime}_{t^{\prime}})\;% \middle\lvert\;x^{0\prime}_{0}=x^{0,N}_{t},\mu^{\prime}_{0}=\mu^{N}_{t},u^{0% \prime}_{0}=u^{0,N}_{t},\xi^{\prime}_{0}=\xi^{N}_{t}\right]\nabla_{\theta}\log% \tilde{\pi}^{\theta}(u^{0,N}_{t},\xi^{N}_{t}\mid x^{0,N}_{t},\mu^{N}_{t})% \right.\right.= ∥ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT blackboard_E [ italic_r ( italic_x start_POSTSUPERSCRIPT 0 ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) | italic_x start_POSTSUPERSCRIPT 0 ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_ξ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
t=0γt𝔼[r(xt0,ut0,μt)|x00=xt0,μ0=μt,u00=ut0,ξ0=ξt]θlogπ~θ(ut0,ξtxt0,μt)]\displaystyle\hskip 56.9055pt\left.\left.-\sum_{t=0}^{\infty}\gamma^{t}% \operatorname{\mathbb{E}}\left[r(x^{0\prime}_{t^{\prime}},u^{0\prime}_{t^{% \prime}},\mu^{\prime}_{t^{\prime}})\;\middle\lvert\;x^{0\prime}_{0}=x^{0}_{t},% \mu^{\prime}_{0}=\mu_{t},u^{0\prime}_{0}=u^{0}_{t},\xi^{\prime}_{0}=\xi_{t}% \right]\nabla_{\theta}\log\tilde{\pi}^{\theta}(u^{0}_{t},\xi_{t}\mid x^{0}_{t}% ,\mu_{t})\right]\right\|- ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT blackboard_E [ italic_r ( italic_x start_POSTSUPERSCRIPT 0 ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) | italic_x start_POSTSUPERSCRIPT 0 ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ∥
t=0T1γt𝔼[t=Tγt𝔼[r(xt0,ut0,μt)|x00=xt0,N,μ0=μtN,u00=ut0,N,ξ0=ξtN]θlogπ~θ(ut0,N,ξtNxt0,N,μtN)\displaystyle\leq\left\|\sum_{t=0}^{T-1}\gamma^{t}\operatorname{\mathbb{E}}% \left[\sum_{t^{\prime}=T^{\prime}}^{\infty}\gamma^{t}\operatorname{\mathbb{E}}% \left[r(x^{0\prime}_{t^{\prime}},u^{0\prime}_{t^{\prime}},\mu^{\prime}_{t^{% \prime}})\;\middle\lvert\;x^{0\prime}_{0}=x^{0,N}_{t},\mu^{\prime}_{0}=\mu^{N}% _{t},u^{0\prime}_{0}=u^{0,N}_{t},\xi^{\prime}_{0}=\xi^{N}_{t}\right]\nabla_{% \theta}\log\tilde{\pi}^{\theta}(u^{0,N}_{t},\xi^{N}_{t}\mid x^{0,N}_{t},\mu^{N% }_{t})\right.\right.≤ ∥ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT blackboard_E [ italic_r ( italic_x start_POSTSUPERSCRIPT 0 ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) | italic_x start_POSTSUPERSCRIPT 0 ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_ξ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
t=Tγt𝔼[r(xt0,ut0,μt)|x00=xt0,μ0=μt,u00=ut0,ξ0=ξt]θlogπ~θ(ut0,ξtxt0,μt)]\displaystyle\hskip 56.9055pt\left.\left.-\sum_{t=T^{\prime}}^{\infty}\gamma^{% t}\operatorname{\mathbb{E}}\left[r(x^{0\prime}_{t^{\prime}},u^{0\prime}_{t^{% \prime}},\mu^{\prime}_{t^{\prime}})\;\middle\lvert\;x^{0\prime}_{0}=x^{0}_{t},% \mu^{\prime}_{0}=\mu_{t},u^{0\prime}_{0}=u^{0}_{t},\xi^{\prime}_{0}=\xi_{t}% \right]\nabla_{\theta}\log\tilde{\pi}^{\theta}(u^{0}_{t},\xi_{t}\mid x^{0}_{t}% ,\mu_{t})\right]\right\|- ∑ start_POSTSUBSCRIPT italic_t = italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT blackboard_E [ italic_r ( italic_x start_POSTSUPERSCRIPT 0 ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) | italic_x start_POSTSUPERSCRIPT 0 ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ∥
+t=0T1γt𝔼[t=0T1γt𝔼[r(xt0,ut0,μt)|x00=xt0,N,μ0=μtN,u00=ut0,N,ξ0=ξtN]θlogπ~θ(ut0,N,ξtNxt0,N,μtN)\displaystyle+\left\|\sum_{t=0}^{T-1}\gamma^{t}\operatorname{\mathbb{E}}\left[% \sum_{t^{\prime}=0}^{T^{\prime}-1}\gamma^{t}\operatorname{\mathbb{E}}\left[r(x% ^{0\prime}_{t^{\prime}},u^{0\prime}_{t^{\prime}},\mu^{\prime}_{t^{\prime}})\;% \middle\lvert\;x^{0\prime}_{0}=x^{0,N}_{t},\mu^{\prime}_{0}=\mu^{N}_{t},u^{0% \prime}_{0}=u^{0,N}_{t},\xi^{\prime}_{0}=\xi^{N}_{t}\right]\nabla_{\theta}\log% \tilde{\pi}^{\theta}(u^{0,N}_{t},\xi^{N}_{t}\mid x^{0,N}_{t},\mu^{N}_{t})% \right.\right.+ ∥ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT blackboard_E [ ∑ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT blackboard_E [ italic_r ( italic_x start_POSTSUPERSCRIPT 0 ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) | italic_x start_POSTSUPERSCRIPT 0 ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_ξ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
t=0T1γt𝔼[r(xt0,ut0,μt)|x00=xt0,μ0=μt,u00=ut0,ξ0=ξt]θlogπ~θ(ut0,ξtxt0,μt)]\displaystyle\hskip 56.9055pt\left.\left.-\sum_{t=0}^{T^{\prime}-1}\gamma^{t}% \operatorname{\mathbb{E}}\left[r(x^{0\prime}_{t^{\prime}},u^{0\prime}_{t^{% \prime}},\mu^{\prime}_{t^{\prime}})\;\middle\lvert\;x^{0\prime}_{0}=x^{0}_{t},% \mu^{\prime}_{0}=\mu_{t},u^{0\prime}_{0}=u^{0}_{t},\xi^{\prime}_{0}=\xi_{t}% \right]\nabla_{\theta}\log\tilde{\pi}^{\theta}(u^{0}_{t},\xi_{t}\mid x^{0}_{t}% ,\mu_{t})\right]\right\|- ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT blackboard_E [ italic_r ( italic_x start_POSTSUPERSCRIPT 0 ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) | italic_x start_POSTSUPERSCRIPT 0 ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ∥

where we write the conditional M3FC system and random variables in the inner expectation with a prime, bounding again the former terms by choosing sufficiently large Tsuperscript𝑇T^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and using Assumptions 1 and 1, while for the latter terms we use Proposition 1 on the functions

f(x0,μ)=𝔼[r(xt0,ut0,μt)|x00=x0,μ0=μ,u00=u0,ξ0=ξ]θπ~θ(u0,ξx0,μ)d(u0,ξ)𝑓superscript𝑥0𝜇double-integral𝔼𝑟subscriptsuperscript𝑥0superscript𝑡subscriptsuperscript𝑢0superscript𝑡subscriptsuperscript𝜇superscript𝑡subscriptsuperscript𝑥00superscript𝑥0subscript𝜇0𝜇subscriptsuperscript𝑢00superscript𝑢0subscriptsuperscript𝜉0𝜉subscript𝜃superscript~𝜋𝜃superscript𝑢0conditional𝜉superscript𝑥0𝜇dsuperscript𝑢0𝜉\displaystyle f(x^{0},\mu)=\iint\operatorname{\mathbb{E}}\left[r(x^{0\prime}_{% t^{\prime}},u^{0\prime}_{t^{\prime}},\mu^{\prime}_{t^{\prime}})\;\middle\lvert% \;x^{0\prime}_{0}=x^{0},\mu_{0}=\mu,u^{0\prime}_{0}=u^{0},\xi^{\prime}_{0}=\xi% \right]\nabla_{\theta}\tilde{\pi}^{\theta}(u^{0},\xi\mid x^{0},\mu)\mathrm{d}(% u^{0},\xi)italic_f ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) = ∬ blackboard_E [ italic_r ( italic_x start_POSTSUPERSCRIPT 0 ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) | italic_x start_POSTSUPERSCRIPT 0 ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ , italic_u start_POSTSUPERSCRIPT 0 ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_ξ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_ξ ] ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT over~ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_ξ ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) roman_d ( italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_ξ ) (27)

for all tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, which are uniformly Lipschitz by Assumptions 1 and 1. This completes the proof. ∎

Appendix O Proof of Proposition 1

Proof.

The proof is exactly analogous to the proof of Theorem 2, except that instead of using Lipschitz constants of xt0,ut0,μt,htT(xt0,ut0,μt,ht)maps-tosubscriptsuperscript𝑥0𝑡subscriptsuperscript𝑢0𝑡subscript𝜇𝑡subscript𝑡𝑇subscriptsuperscript𝑥0𝑡subscriptsuperscript𝑢0𝑡subscript𝜇𝑡subscript𝑡x^{0}_{t},u^{0}_{t},\mu_{t},h_{t}\mapsto T(x^{0}_{t},u^{0}_{t},\mu_{t},h_{t})italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ↦ italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), one uses Lipschitz constants of xt0,ut0,μt,ξtT(xt0,ut0,μt,μtΓ(ξt))maps-tosubscriptsuperscript𝑥0𝑡subscriptsuperscript𝑢0𝑡subscript𝜇𝑡subscript𝜉𝑡𝑇subscriptsuperscript𝑥0𝑡subscriptsuperscript𝑢0𝑡subscript𝜇𝑡tensor-productsubscript𝜇𝑡Γsubscript𝜉𝑡x^{0}_{t},u^{0}_{t},\mu_{t},\xi_{t}\mapsto T(x^{0}_{t},u^{0}_{t},\mu_{t},\mu_{% t}\otimes\Gamma(\xi_{t}))italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ↦ italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊗ roman_Γ ( italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) via the additional Assumption 1 on top of Assumptions 1 and 2. ∎

Appendix P Proof of Proposition 2

Proof.

To show Q^θQθsuperscript^𝑄𝜃superscript𝑄𝜃\widehat{Q}^{\theta}\to Q^{\theta}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT → italic_Q start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT as N𝑁N\to\inftyitalic_N → ∞ uniformly, it suffices to prove pointwise convergence due to compact support.

Therefore, fix any x0,μ,u0,ξsuperscript𝑥0𝜇superscript𝑢0𝜉x^{0},\mu,u^{0},\xiitalic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_ξ. The convergence follows as in Corollary 1, from showing at any time t𝑡titalic_t that

supf|𝔼[f(xt0,ut0,μt)|x00=x0,μ0=μ,u00=u0,ξ0=ξ]conditionalsubscriptsupremum𝑓𝔼𝑓subscriptsuperscript𝑥0𝑡subscriptsuperscript𝑢0𝑡subscript𝜇𝑡subscriptsuperscript𝑥00superscript𝑥0subscript𝜇0𝜇subscriptsuperscript𝑢00superscript𝑢0subscript𝜉0𝜉\displaystyle\sup_{f\in\mathcal{F}}\left|\operatorname{\mathbb{E}}\left[f(x^{0% }_{t},u^{0}_{t},\mu_{t})\;\middle\lvert\;x^{0}_{0}=x^{0},\mu_{0}=\mu,u^{0}_{0}% =u^{0},\xi_{0}=\xi\right]\right.roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT | blackboard_E [ italic_f ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_ξ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_ξ ]
𝔼[f(xt0,N,ut0,N,μtN)|x00,N=x0,N,μ0=μ,u00,N=u0,ξ0N=ξ]|0\displaystyle\hskip 85.35826pt\left.-\operatorname{\mathbb{E}}\left[f(x^{0,N}_% {t},u^{0,N}_{t},\mu^{N}_{t})\;\middle\lvert\;x^{0,N}_{0}=x^{0,N},\mu_{0}=\mu,u% ^{0,N}_{0}=u^{0},\xi^{N}_{0}=\xi\right]\right|\to 0- blackboard_E [ italic_f ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT , italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_ξ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_ξ ] | → 0

over any equi-Lipschitz family of functions \mathcal{F}caligraphic_F, and applying for f=r𝑓𝑟f=ritalic_f = italic_r (using the set \mathcal{F}caligraphic_F of Lrsubscript𝐿𝑟L_{r}italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT-Lipschitz functions) by Assumption 1.

The statement is shown by considering time t=0𝑡0t=0italic_t = 0, and then by induction for any t1𝑡1t\geq 1italic_t ≥ 1. At time t=0𝑡0t=0italic_t = 0, the statement follows from the weak LLN as in Theorem 2. For any subsequent times, we similarly have

supf|𝔼[f(xt+10,ut+10,μt+1)|x00=x0,μ0=μ,u00=u0,ξ0=ξ]conditionalsubscriptsupremum𝑓𝔼𝑓subscriptsuperscript𝑥0𝑡1subscriptsuperscript𝑢0𝑡1subscript𝜇𝑡1subscriptsuperscript𝑥00superscript𝑥0subscript𝜇0𝜇subscriptsuperscript𝑢00superscript𝑢0subscript𝜉0𝜉\displaystyle\sup_{f\in\mathcal{F}}\left|\operatorname{\mathbb{E}}\left[f(x^{0% }_{t+1},u^{0}_{t+1},\mu_{t+1})\;\middle\lvert\;x^{0}_{0}=x^{0},\mu_{0}=\mu,u^{% 0}_{0}=u^{0},\xi_{0}=\xi\right]\right.roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT | blackboard_E [ italic_f ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) | italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_ξ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_ξ ]
𝔼[f(xt+10,N,ut+10,N,μt+1N)|x00,N=x0,N,μ0=μ,u00,N=u0,ξ0N=ξ]|\displaystyle\hskip 56.9055pt\left.-\operatorname{\mathbb{E}}\left[f(x^{0,N}_{% t+1},u^{0,N}_{t+1},\mu^{N}_{t+1})\;\middle\lvert\;x^{0,N}_{0}=x^{0,N},\mu_{0}=% \mu,u^{0,N}_{0}=u^{0},\xi^{N}_{0}=\xi\right]\right|- blackboard_E [ italic_f ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) | italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT , italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_ξ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_ξ ] |
supf|𝔼[f(xt+10,ut+10,μt+1)|x00=x0,μ0=μ,u00=u0,ξ0=ξ]absentconditionalsubscriptsupremum𝑓𝔼𝑓subscriptsuperscript𝑥0𝑡1subscriptsuperscript𝑢0𝑡1subscript𝜇𝑡1subscriptsuperscript𝑥00superscript𝑥0subscript𝜇0𝜇subscriptsuperscript𝑢00superscript𝑢0subscript𝜉0𝜉\displaystyle\quad\leq\sup_{f\in\mathcal{F}}\left|\operatorname{\mathbb{E}}% \left[f(x^{0}_{t+1},u^{0}_{t+1},\mu_{t+1})\;\middle\lvert\;x^{0}_{0}=x^{0},\mu% _{0}=\mu,u^{0}_{0}=u^{0},\xi_{0}=\xi\right]\right.≤ roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT | blackboard_E [ italic_f ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) | italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_ξ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_ξ ]
𝔼[f(xt+10,N,ut+10,N,T(xt0,N,ut0,N,μtN,μtNΓ(ξtN)))|x00,N=x0,N,μ0=μ,u00,N=u0,ξ0N=ξ]|\displaystyle\hskip 56.9055pt\left.-\operatorname{\mathbb{E}}\left[f(x^{0,N}_{% t+1},u^{0,N}_{t+1},T(x^{0,N}_{t},u^{0,N}_{t},\mu^{N}_{t},\mu^{N}_{t}\otimes% \Gamma(\xi^{N}_{t})))\;\middle\lvert\;x^{0,N}_{0}=x^{0,N},\mu_{0}=\mu,u^{0,N}_% {0}=u^{0},\xi^{N}_{0}=\xi\right]\right|- blackboard_E [ italic_f ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_T ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊗ roman_Γ ( italic_ξ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) | italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT , italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_ξ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_ξ ] |
+supf|𝔼[f(xt+10,N,ut+10,N,T(xt0,N,ut0,N,μtN,μtNΓ(ξtN)))|x00,N=x0,N,μ0=μ,u00,N=u0,ξ0N=ξ]conditionalsubscriptsupremum𝑓𝔼𝑓subscriptsuperscript𝑥0𝑁𝑡1subscriptsuperscript𝑢0𝑁𝑡1𝑇subscriptsuperscript𝑥0𝑁𝑡subscriptsuperscript𝑢0𝑁𝑡subscriptsuperscript𝜇𝑁𝑡tensor-productsubscriptsuperscript𝜇𝑁𝑡Γsubscriptsuperscript𝜉𝑁𝑡subscriptsuperscript𝑥0𝑁0superscript𝑥0𝑁subscript𝜇0𝜇subscriptsuperscript𝑢0𝑁0superscript𝑢0subscriptsuperscript𝜉𝑁0𝜉\displaystyle\qquad+\sup_{f\in\mathcal{F}}\left|\operatorname{\mathbb{E}}\left% [f(x^{0,N}_{t+1},u^{0,N}_{t+1},T(x^{0,N}_{t},u^{0,N}_{t},\mu^{N}_{t},\mu^{N}_{% t}\otimes\Gamma(\xi^{N}_{t})))\;\middle\lvert\;x^{0,N}_{0}=x^{0,N},\mu_{0}=\mu% ,u^{0,N}_{0}=u^{0},\xi^{N}_{0}=\xi\right]\right.+ roman_sup start_POSTSUBSCRIPT italic_f ∈ caligraphic_F end_POSTSUBSCRIPT | blackboard_E [ italic_f ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_T ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊗ roman_Γ ( italic_ξ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) | italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT , italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_ξ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_ξ ]
𝔼[f(xt+10,N,ut+10,N,μt+1N)|x00,N=x0,N,μ0=μ,u00,N=u0,ξ0N=ξ]|.\displaystyle\hskip 56.9055pt\left.-\operatorname{\mathbb{E}}\left[f(x^{0,N}_{% t+1},u^{0,N}_{t+1},\mu^{N}_{t+1})\;\middle\lvert\;x^{0,N}_{0}=x^{0,N},\mu_{0}=% \mu,u^{0,N}_{0}=u^{0},\xi^{N}_{0}=\xi\right]\right|.- blackboard_E [ italic_f ( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) | italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT , italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_μ , italic_u start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_ξ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_ξ ] | .

As in Theorem 2, the latter term is bounded by induction assumption, using uniform Lipschitzness of the dynamics, xt0,ut0,μt,ξtT(xt0,ut0,μt,μtΓ(ξt))maps-tosubscriptsuperscript𝑥0𝑡subscriptsuperscript𝑢0𝑡subscript𝜇𝑡subscript𝜉𝑡𝑇subscriptsuperscript𝑥0𝑡subscriptsuperscript𝑢0𝑡subscript𝜇𝑡tensor-productsubscript𝜇𝑡Γsubscript𝜉𝑡x^{0}_{t},u^{0}_{t},\mu_{t},\xi_{t}\mapsto T(x^{0}_{t},u^{0}_{t},\mu_{t},\mu_{% t}\otimes\Gamma(\xi_{t}))italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ↦ italic_T ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊗ roman_Γ ( italic_ξ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) via Assumptions 2 and 1, while the former term is bounded as usual by the weak LLN. This completes the proof. ∎

Appendix Q Extended MFC Optimalities

Intuitively, in large MF systems governed by dynamics of the form (1), almost all information of the joint state (xt0,N,xt1,N,,xtN,N)subscriptsuperscript𝑥0𝑁𝑡subscriptsuperscript𝑥1𝑁𝑡subscriptsuperscript𝑥𝑁𝑁𝑡(x^{0,N}_{t},x^{1,N}_{t},\ldots,x^{N,N}_{t})( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT 1 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUPERSCRIPT italic_N , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is contained in (xt0,N,μtN)subscriptsuperscript𝑥0𝑁𝑡subscriptsuperscript𝜇𝑁𝑡(x^{0,N}_{t},\mu^{N}_{t})( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), while heterogeneous policies should by LLN be replaceable by a shared one. To fully complete the theory of MFC, it is therefore interesting to establish the optimality of the considered MF policies over arbitrary other policies acting on the joint state (xt0,N,xt1,N,,xtN,N)subscriptsuperscript𝑥0𝑁𝑡subscriptsuperscript𝑥1𝑁𝑡subscriptsuperscript𝑥𝑁𝑁𝑡(x^{0,N}_{t},x^{1,N}_{t},\ldots,x^{N,N}_{t})( italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT 1 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUPERSCRIPT italic_N , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

It seems plausible that it would be possible to extend optimality (Corollary 1) over larger classes of policies in the finite system. In particular, at least for finite state-action spaces, (i) any joint-state policy π(duxt0,N,xt1,N,,xtN,N)𝜋conditionald𝑢subscriptsuperscript𝑥0𝑁𝑡subscriptsuperscript𝑥1𝑁𝑡subscriptsuperscript𝑥𝑁𝑁𝑡\pi(\mathrm{d}u\mid x^{0,N}_{t},x^{1,N}_{t},\ldots,x^{N,N}_{t})italic_π ( roman_d italic_u ∣ italic_x start_POSTSUPERSCRIPT 0 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT 1 , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_x start_POSTSUPERSCRIPT italic_N , italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) might in the limit be replaced by an averaged policy π¯(dux0,μ)xN𝒳N:1Niδxi,N=μπ(dux0,xN)¯𝜋conditionald𝑢superscript𝑥0𝜇subscript:superscript𝑥𝑁superscript𝒳𝑁1𝑁subscript𝑖subscript𝛿superscript𝑥𝑖𝑁𝜇𝜋conditionald𝑢superscript𝑥0superscript𝑥𝑁\bar{\pi}(\mathrm{d}u\mid x^{0},\mu)\coloneqq\sum_{x^{N}\in\mathcal{X}^{N}% \colon\frac{1}{N}\sum_{i}\delta_{x^{i,N}}=\mu}\pi(\mathrm{d}u\mid x^{0},x^{N})over¯ start_ARG italic_π end_ARG ( roman_d italic_u ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) ≔ ∑ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ caligraphic_X start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT : divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_i , italic_N end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_μ end_POSTSUBSCRIPT italic_π ( roman_d italic_u ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) under some exchangeability of agents; (ii) any optimal policy π𝜋\piitalic_π outputting joint actions for all agents might be replaced by an independent but identical policy for each agent, as in the limit all information is contained in the joint state-action distribution, any of which may be approximated increasingly closely by LLN; and (iii) heterogeneous policies for each minor agent π1,,πNsuperscript𝜋1superscript𝜋𝑁\pi^{1},\ldots,\pi^{N}italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_π start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT might similarly be replaced by some averaged policy π¯(π1,,πN)¯𝜋superscript𝜋1superscript𝜋𝑁\bar{\pi}(\pi^{1},\ldots,\pi^{N})over¯ start_ARG italic_π end_ARG ( italic_π start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_π start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ), averaging the action distributions in any specific state over the proportion of agent likelihoods in that state.

Showing such results would allow us to conclude that the policy classes ΠΠ\Piroman_Π are natural and sufficient in MF systems, including MFC and also the competitive MFGs, as more general or heterogeneous policies will not perform much better. A result related to (iii) has been shown for static cases [Sanjari and Yüksel, 2020, Cui et al., 2021] and more recently in MFC and its two-team generalizations [Guan et al., 2024].

Appendix R Experimental Details

In this section, we give lengthy experimental details that were omitted in the main text.

Table 3: Shared hyperparameter configurations for all algorithms.
Symbol Name Value
γ𝛾\gammaitalic_γ Discount factor 0.990.990.990.99
λ𝜆\lambdaitalic_λ GAE lambda 1111
β𝛽\betaitalic_β KL coefficient 0.030.030.030.03
ϵitalic-ϵ\epsilonitalic_ϵ Clip parameter 0.20.20.20.2
lrsubscript𝑙𝑟l_{r}italic_l start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT Learning rate 0.000050.000050.000050.00005
Blensubscript𝐵lenB_{\mathrm{len}}italic_B start_POSTSUBSCRIPT roman_len end_POSTSUBSCRIPT Training batch size 24000240002400024000
blensubscript𝑏lenb_{\mathrm{len}}italic_b start_POSTSUBSCRIPT roman_len end_POSTSUBSCRIPT Mini-batch size 4000400040004000
NSGDsubscript𝑁SGDN_{\mathrm{SGD}}italic_N start_POSTSUBSCRIPT roman_SGD end_POSTSUBSCRIPT Gradient steps per training batch 8888

R.1 Problem Details

In this section, we give details to the problems considered in this work. We omit the superscript N𝑁Nitalic_N for readability.

2G.

In the 2G problem, we formally let 𝒳=[2,2]2𝒳superscript222\mathcal{X}=[-2,2]^{2}caligraphic_X = [ - 2 , 2 ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, 𝒰=[1,1]2𝒰superscript112\mathcal{U}=[-1,1]^{2}caligraphic_U = [ - 1 , 1 ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, 𝒳0={0,1,49}superscript𝒳00149\mathcal{X}^{0}=\{0,1,\ldots 49\}caligraphic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = { 0 , 1 , … 49 } according to (13). We allow noisy movement of minor agents following the Gaussian law

p(xt+1ixti,uti)=𝒩(xt+1i|xti+vmaxutimax(1,uti2),diag(σ2,σ2))\displaystyle p(x^{i}_{t+1}\mid x^{i}_{t},u^{i}_{t})=\mathcal{N}\left(x^{i}_{t% +1}\;\middle\lvert\;x^{i}_{t}+v_{\mathrm{max}}\frac{u^{i}_{t}}{\max(1,\lVert u% ^{i}_{t}\rVert_{2})},\mathrm{diag}(\sigma^{2},\sigma^{2})\right)italic_p ( italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∣ italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = caligraphic_N ( italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_v start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT divide start_ARG italic_u start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG roman_max ( 1 , ∥ italic_u start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG , roman_diag ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) )

for some maximum speed vmax=0.2subscript𝑣max0.2v_{\mathrm{max}}=0.2italic_v start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 0.2, noise covariance σ2=0.03superscript𝜎20.03\sigma^{2}=0.03italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.03 and projecting back actions u𝑢uitalic_u with norm larger than 1111, with the additional modification that agent positions are clipped back into 𝒳𝒳\mathcal{X}caligraphic_X whenever the agents move out of bounds.

We then consider a time-variant mixture of two Gaussians

μt1+cos(2πt/50)2𝒩(𝐞1,diag(σ2,σ2))+1cos(2πt/50)2𝒩(𝐞1,diag(σ2,σ2))subscriptsuperscript𝜇𝑡12𝜋𝑡502𝒩subscript𝐞1diagsuperscriptsubscript𝜎2superscriptsubscript𝜎212𝜋𝑡502𝒩subscript𝐞1diagsuperscriptsubscript𝜎2superscriptsubscript𝜎2\displaystyle\mu^{*}_{t}\coloneqq\frac{1+\cos(2\pi t/50)}{2}\mathcal{N}\left(% \mathbf{e}_{1},\mathrm{diag}(\sigma_{*}^{2},\sigma_{*}^{2})\right)+\frac{1-% \cos(2\pi t/50)}{2}\mathcal{N}\left(-\mathbf{e}_{1},\mathrm{diag}(\sigma_{*}^{% 2},\sigma_{*}^{2})\right)italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≔ divide start_ARG 1 + roman_cos ( 2 italic_π italic_t / 50 ) end_ARG start_ARG 2 end_ARG caligraphic_N ( bold_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_diag ( italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) + divide start_ARG 1 - roman_cos ( 2 italic_π italic_t / 50 ) end_ARG start_ARG 2 end_ARG caligraphic_N ( - bold_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_diag ( italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) )

for unit vector 𝐞1subscript𝐞1\mathbf{e}_{1}bold_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and covariance σ2=0.05superscriptsubscript𝜎20.05\sigma_{*}^{2}=0.05italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.05, i.e. we have a period of 50505050 time steps, and let the major state follow the clock dynamics p0(x0+1mod50x0,μ)=1superscript𝑝0modulosuperscript𝑥01conditional50superscript𝑥0𝜇1p^{0}(x^{0}+1\mod 50\mid x^{0},\mu)=1italic_p start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT + 1 roman_mod 50 ∣ italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_μ ) = 1.

The goal of minor agents is to minimize the Wasserstein metric W^1subscript^𝑊1\hat{W}_{1}over^ start_ARG italic_W end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT under the squared Euclidean distance,

W^1(μ,μ)infγΓ(μ,μ){xy22γ(dx,dy)}subscript^𝑊1𝜇superscript𝜇subscriptinfimum𝛾Γ𝜇superscript𝜇superscriptsubscriptdelimited-∥∥𝑥𝑦22𝛾d𝑥d𝑦\displaystyle\hat{W}_{1}(\mu,\mu^{\prime})\coloneqq\inf_{\gamma\in\Gamma(\mu,% \mu^{\prime})}\left\{\int\lVert x-y\rVert_{2}^{2}\gamma(\mathrm{d}x,\mathrm{d}% y)\right\}over^ start_ARG italic_W end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_μ , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≔ roman_inf start_POSTSUBSCRIPT italic_γ ∈ roman_Γ ( italic_μ , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT { ∫ ∥ italic_x - italic_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ ( roman_d italic_x , roman_d italic_y ) }

defined over all couplings Γ(μ,μ)Γ𝜇superscript𝜇\Gamma(\mu,\mu^{\prime})roman_Γ ( italic_μ , italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) with first and second marginals μ𝜇\muitalic_μ, μsuperscript𝜇\mu^{\prime}italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (which is strictly speaking not a metric but an optimal transportation cost, since the squared Euclidean distance fails the triangle inequality), between their empirical distribution and the desired mixture of Gaussians

r(xt0,μt)=W^1(μt,μt)𝑟subscriptsuperscript𝑥0𝑡subscript𝜇𝑡subscript^𝑊1subscript𝜇𝑡subscriptsuperscript𝜇𝑡\displaystyle r(x^{0}_{t},\mu_{t})=-\hat{W}_{1}(\mu_{t},\mu^{*}_{t})italic_r ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = - over^ start_ARG italic_W end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )

which is computed numerically by the empirical distance, sampling 300300300300 samples from μtsubscriptsuperscript𝜇𝑡\mu^{*}_{t}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

The initialization of minor agents is uniform, i.e. μ0=Unif(𝒳)subscript𝜇0Unif𝒳\mu_{0}=\mathrm{Unif}(\mathcal{X})italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = roman_Unif ( caligraphic_X ), and x00=0subscriptsuperscript𝑥000x^{0}_{0}=0italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0. For sake of simulation, we define the episode length T=100𝑇100T=100italic_T = 100 after which a new episode starts.

Formation.

The Formation problem is an extension of the 2G problem, where instead 𝒳0=𝒳×𝒳superscript𝒳0𝒳𝒳\mathcal{X}^{0}=\mathcal{X}\times\mathcal{X}caligraphic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = caligraphic_X × caligraphic_X and 𝒰0=𝒰superscript𝒰0𝒰\mathcal{U}^{0}=\mathcal{U}caligraphic_U start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = caligraphic_U, the major agent follows the same dynamics as the minor agents, and movements are noise-free, i.e. σ2=0superscript𝜎20\sigma^{2}=0italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0. The major agent state xt0=(x^t0,xt)subscriptsuperscript𝑥0𝑡subscriptsuperscript^𝑥0𝑡subscriptsuperscript𝑥𝑡x^{0}_{t}=(\hat{x}^{0}_{t},x^{*}_{t})italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) here contains both the major agent position x^t0subscriptsuperscript^𝑥0𝑡\hat{x}^{0}_{t}over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and its target position xtsubscriptsuperscript𝑥𝑡x^{*}_{t}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The desired minor agent distribution is centered around the major agent

μt𝒩(x^t0,diag(σ2,σ2))subscriptsuperscript𝜇𝑡𝒩subscriptsuperscript^𝑥0𝑡diagsuperscriptsubscript𝜎2superscriptsubscript𝜎2\displaystyle\mu^{*}_{t}\coloneqq\mathcal{N}\left(\hat{x}^{0}_{t},\mathrm{diag% }(\sigma_{*}^{2},\sigma_{*}^{2})\right)italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≔ caligraphic_N ( over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , roman_diag ( italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) )

with covariance σ2=0.3superscriptsubscript𝜎20.3\sigma_{*}^{2}=0.3italic_σ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.3, and is also observed by agents as in 2G via binning. Additionally, the major agent should follow a random target xtsubscriptsuperscript𝑥𝑡x^{*}_{t}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT following discretized Ornstein-Uhlenbeck dynamics

xt+1𝒩(0.95xt,diag(σtarg2,σtarg2))similar-tosubscriptsuperscript𝑥𝑡1𝒩0.95subscriptsuperscript𝑥𝑡diagsuperscriptsubscript𝜎targ2superscriptsubscript𝜎targ2\displaystyle x^{*}_{t+1}\sim\mathcal{N}\left(0.95x^{*}_{t},\mathrm{diag}(% \sigma_{\mathrm{targ}}^{2},\sigma_{\mathrm{targ}}^{2})\right)italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ caligraphic_N ( 0.95 italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , roman_diag ( italic_σ start_POSTSUBSCRIPT roman_targ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT roman_targ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) )

with σtarg2=0.02superscriptsubscript𝜎targ20.02\sigma_{\mathrm{targ}}^{2}=0.02italic_σ start_POSTSUBSCRIPT roman_targ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.02. Thus, similar to 2G, the reward function becomes

r(xt0,ut0,μt)=x^t0xt2W^1(μt,μt).𝑟subscriptsuperscript𝑥0𝑡subscriptsuperscript𝑢0𝑡subscript𝜇𝑡subscriptdelimited-∥∥subscriptsuperscript^𝑥0𝑡subscriptsuperscript𝑥𝑡2subscript^𝑊1subscript𝜇𝑡subscriptsuperscript𝜇𝑡\displaystyle r(x^{0}_{t},u^{0}_{t},\mu_{t})=-\lVert\hat{x}^{0}_{t}-x^{*}_{t}% \rVert_{2}-\hat{W}_{1}(\mu_{t},\mu^{*}_{t}).italic_r ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = - ∥ over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - over^ start_ARG italic_W end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .

The initialization of agents is uniform, while the target starts around zero, i.e. μ0=Unif(𝒳)subscript𝜇0Unif𝒳\mu_{0}=\mathrm{Unif}(\mathcal{X})italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = roman_Unif ( caligraphic_X ) and μ00=Unif(𝒳)𝒩(0,diag(σtarg2,σtarg2))subscriptsuperscript𝜇00tensor-productUnif𝒳𝒩0diagsuperscriptsubscript𝜎targ2superscriptsubscript𝜎targ2\mu^{0}_{0}=\mathrm{Unif}(\mathcal{X})\otimes\mathcal{N}\left(0,\mathrm{diag}(% \sigma_{\mathrm{targ}}^{2},\sigma_{\mathrm{targ}}^{2})\right)italic_μ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = roman_Unif ( caligraphic_X ) ⊗ caligraphic_N ( 0 , roman_diag ( italic_σ start_POSTSUBSCRIPT roman_targ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT roman_targ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ). For sake of simulation, we define the episode length T=100𝑇100T=100italic_T = 100 after which a new episode starts.

Beach Bar Process.

In the discrete beach bar process, we consider a discrete torus 𝒳={0,1,,4}2𝒳superscript0142\mathcal{X}=\{0,1,\ldots,4\}^{2}caligraphic_X = { 0 , 1 , … , 4 } start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, 𝒳0=𝒳×𝒳superscript𝒳0𝒳𝒳\mathcal{X}^{0}=\mathcal{X}\times\mathcal{X}caligraphic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = caligraphic_X × caligraphic_X and actions 𝒰=𝒰0={(0,0),(1,0),(0,1),(1,0),(0,1)}𝒰superscript𝒰00010011001\mathcal{U}=\mathcal{U}^{0}=\{(0,0),(-1,0),(0,-1),(1,0),(0,1)\}caligraphic_U = caligraphic_U start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = { ( 0 , 0 ) , ( - 1 , 0 ) , ( 0 , - 1 ) , ( 1 , 0 ) , ( 0 , 1 ) } indicating movement in any of the four cardinal directions. The major agent state xt0=(x^t0,xt)subscriptsuperscript𝑥0𝑡subscriptsuperscript^𝑥0𝑡subscriptsuperscript𝑥𝑡x^{0}_{t}=(\hat{x}^{0}_{t},x^{*}_{t})italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) here contains both the major agent position x^t0subscriptsuperscript^𝑥0𝑡\hat{x}^{0}_{t}over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and its target position xtsubscriptsuperscript𝑥𝑡x^{*}_{t}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. In other words, the dynamics follow

x^t+10=x^t0+ut0mod(5,5),xt+1i=xti+utimod(5,5).formulae-sequencesubscriptsuperscript^𝑥0𝑡1modulosubscriptsuperscript^𝑥0𝑡subscriptsuperscript𝑢0𝑡55subscriptsuperscript𝑥𝑖𝑡1modulosubscriptsuperscript𝑥𝑖𝑡subscriptsuperscript𝑢𝑖𝑡55\displaystyle\hat{x}^{0}_{t+1}=\hat{x}^{0}_{t}+u^{0}_{t}\mod(5,5),\quad x^{i}_% {t+1}=x^{i}_{t}+u^{i}_{t}\mod(5,5).over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_mod ( 5 , 5 ) , italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_u start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_mod ( 5 , 5 ) .

The target position follows a random walk on the torus

xt+1xt+ϵtUnif((1,0),(0,1),(1,0),(0,1))mod(5,5)similar-tosubscriptsuperscript𝑥𝑡1modulosubscriptsuperscript𝑥𝑡subscriptitalic-ϵ𝑡Unif1001100155\displaystyle x^{*}_{t+1}\sim x^{*}_{t}+\epsilon_{t}\mathrm{Unif}((-1,0),(0,-1% ),(1,0),(0,1))\mod(5,5)italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_Unif ( ( - 1 , 0 ) , ( 0 , - 1 ) , ( 1 , 0 ) , ( 0 , 1 ) ) roman_mod ( 5 , 5 )

with walking probability ϵtBernoulli(0.2)similar-tosubscriptitalic-ϵ𝑡Bernoulli0.2\epsilon_{t}\sim\mathrm{Bernoulli}(0.2)italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ roman_Bernoulli ( 0.2 ), uniformly in any direction.

The costs are then given by the average toroidal distance d𝑑ditalic_d (the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT “wrap-around” distance on the torus) between the major agent and its target, the average distance between major and minor agents, and the crowdedness of agents

r(xt0,ut0,μt)=0.5d(xt0,xt)2.5d(x,xt0)μt(dx)6.25μt(x)μt(dx).𝑟subscriptsuperscript𝑥0𝑡subscriptsuperscript𝑢0𝑡subscript𝜇𝑡0.5𝑑subscriptsuperscript𝑥0𝑡subscriptsuperscript𝑥𝑡2.5𝑑𝑥subscriptsuperscript𝑥0𝑡subscript𝜇𝑡d𝑥6.25subscript𝜇𝑡𝑥subscript𝜇𝑡d𝑥\displaystyle r(x^{0}_{t},u^{0}_{t},\mu_{t})=-0.5d(x^{0}_{t},x^{*}_{t})-2.5% \int d(x,x^{0}_{t})\mu_{t}(\mathrm{d}x)-6.25\int\mu_{t}(x)\mu_{t}(\mathrm{d}x).italic_r ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = - 0.5 italic_d ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - 2.5 ∫ italic_d ( italic_x , italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_x ) - 6.25 ∫ italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_x ) .

The initialization of agents is uniform, while the target starts at zero, i.e. μ0=Unif(𝒳)subscript𝜇0Unif𝒳\mu_{0}=\mathrm{Unif}(\mathcal{X})italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = roman_Unif ( caligraphic_X ) and μ00=Unif(𝒳)δ(0,0)subscriptsuperscript𝜇00tensor-productUnif𝒳subscript𝛿00\mu^{0}_{0}=\mathrm{Unif}(\mathcal{X})\otimes\delta_{(0,0)}italic_μ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = roman_Unif ( caligraphic_X ) ⊗ italic_δ start_POSTSUBSCRIPT ( 0 , 0 ) end_POSTSUBSCRIPT. For sake of simulation, we define the episode length T=200𝑇200T=200italic_T = 200 after which a new episode starts.

For the neural network policy, we use a one-hot encoding of major states as input, i.e. the concatenation of two 5555-dimensional one-hot vectors for the major agent position x^t0subscriptsuperscript^𝑥0𝑡\hat{x}^{0}_{t}over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and its target position xtsubscriptsuperscript𝑥𝑡x^{*}_{t}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT respectively.

Foraging.

In the Foraging problem, we formally define 𝒳=[2,2]2×[0,1]𝒳superscript22201\mathcal{X}=[-2,2]^{2}\times[0,1]caligraphic_X = [ - 2 , 2 ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × [ 0 , 1 ], 𝒰=[1,1]2=𝒰0𝒰superscript112superscript𝒰0\mathcal{U}=[-1,1]^{2}=\mathcal{U}^{0}caligraphic_U = [ - 1 , 1 ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = caligraphic_U start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT and 𝒳0=([2,2]×[2,1])×n=05([2,2]2×[0,1.5])nsuperscript𝒳02221superscriptsubscript𝑛05superscriptsuperscript22201.5𝑛\mathcal{X}^{0}=([-2,2]\times[-2,-1])\times\bigcup_{n=0}^{5}\left([-2,2]^{2}% \times[0,1.5]\right)^{n}caligraphic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = ( [ - 2 , 2 ] × [ - 2 , - 1 ] ) × ⋃ start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT ( [ - 2 , 2 ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × [ 0 , 1.5 ] ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. The minor agent states xti=(x^ti,x~ti)subscriptsuperscript𝑥𝑖𝑡subscriptsuperscript^𝑥𝑖𝑡subscriptsuperscript~𝑥𝑖𝑡x^{i}_{t}=(\hat{x}^{i}_{t},\tilde{x}^{i}_{t})italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) here contain their positions x^ti[2,2]2subscriptsuperscript^𝑥𝑖𝑡superscript222\hat{x}^{i}_{t}\in[-2,2]^{2}over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ [ - 2 , 2 ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and encumbrance (or inversely, free cargo space) x^ti[0,1]subscriptsuperscript^𝑥𝑖𝑡01\hat{x}^{i}_{t}\in[0,1]over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ [ 0 , 1 ]. Meanwhile, the major agent state xt0=(x^t0,xtenv)subscriptsuperscript𝑥0𝑡subscriptsuperscript^𝑥0𝑡subscriptsuperscript𝑥env𝑡x^{0}_{t}=(\hat{x}^{0}_{t},x^{\mathrm{env}}_{t})italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT roman_env end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) here contains both the major agent position x^t0subscriptsuperscript^𝑥0𝑡\hat{x}^{0}_{t}over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT restricted to [2,2]×[2,1]2221[-2,2]\times[-2,-1][ - 2 , 2 ] × [ - 2 , - 1 ], and the current environment state xtenvsubscriptsuperscript𝑥env𝑡x^{\mathrm{env}}_{t}italic_x start_POSTSUPERSCRIPT roman_env end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Here, the minor and major agents move as in Formation, though with different maximum velocities for minor agents vmax=0.3subscript𝑣max0.3v_{\mathrm{max}}=0.3italic_v start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 0.3 and major agent vmax0=0.1subscriptsuperscript𝑣0max0.1v^{0}_{\mathrm{max}}=0.1italic_v start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 0.1 respectively.

An additional environmental state consists of up to 5555 spatially localized foraging areas, which is not observed by the agents. In each time step, Nt=Pois(0.2)subscript𝑁𝑡Pois0.2N_{t}=\mathrm{Pois}(0.2)italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Pois ( 0.2 ) new foraging areas appear, up to a maximum total number of 5555. The location xtmsubscriptsuperscript𝑥𝑚𝑡x^{m}_{t}italic_x start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of each foraging area m=1,,5𝑚15m=1,\ldots,5italic_m = 1 , … , 5 is sampled uniformly randomly from Unif(𝒳)Unif𝒳\mathrm{Unif}(\mathcal{X})roman_Unif ( caligraphic_X ), while their total initial size Ltmsubscriptsuperscript𝐿𝑚𝑡L^{m}_{t}italic_L start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is sampled from Unif([0.5,1.5])Unif0.51.5\mathrm{Unif}([0.5,1.5])roman_Unif ( [ 0.5 , 1.5 ] ), making up the environment state xtenv=(xtm,Ltm)msubscriptsuperscript𝑥env𝑡subscriptsubscriptsuperscript𝑥𝑚𝑡subscriptsuperscript𝐿𝑚𝑡𝑚x^{\mathrm{env}}_{t}=(x^{m}_{t},L^{m}_{t})_{m}italic_x start_POSTSUPERSCRIPT roman_env end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_x start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_L start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. At every time step, the foraging areas m𝑚mitalic_m are depleted by nearby agents closer than range 0.50.50.50.5,

Lt+1msubscriptsuperscript𝐿𝑚𝑡1\displaystyle L^{m}_{t+1}italic_L start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT =LtmΔLm(μt),absentsubscriptsuperscript𝐿𝑚𝑡Δsuperscript𝐿𝑚subscript𝜇𝑡\displaystyle=L^{m}_{t}-\Delta L^{m}(\mu_{t}),= italic_L start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - roman_Δ italic_L start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,
ΔLm(μt)Δsuperscript𝐿𝑚subscript𝜇𝑡\displaystyle\Delta L^{m}(\mu_{t})roman_Δ italic_L start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) min(Lt+1mLtm,min(0.1,(0.5xxtm2)+μt(dx))\displaystyle\coloneqq\min(L^{m}_{t+1}-L^{m}_{t},\min(0.1,\int(0.5-\lVert x-x^% {m}_{t}\rVert_{2})^{+}\,\mu_{t}(\mathrm{d}x))≔ roman_min ( italic_L start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_L start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , roman_min ( 0.1 , ∫ ( 0.5 - ∥ italic_x - italic_x start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_x ) )

where ()+max(0,)superscript0(\cdot)^{+}\coloneqq\max(0,\cdot)( ⋅ ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ≔ roman_max ( 0 , ⋅ ), until they are fully depleted and disappear (Lt+1m0subscriptsuperscript𝐿𝑚𝑡10L^{m}_{t+1}\leq 0italic_L start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ≤ 0).

Foraging minor agents simulate encumbrance, gaining it from nearby foraging areas and depositing to a nearby major agent, by splitting the foraged amount among all nearby minor agents according to their foraged contribution, and wasting any amount going beyond maximum encumbrance 1111,

x~t+1i={min(1,x~ti+ΔLm(μt)(0.5xxtm2)+(0.5xxtm2)+μt(dx))ifxtixt020.5,0else.subscriptsuperscript~𝑥𝑖𝑡1cases1subscriptsuperscript~𝑥𝑖𝑡Δsuperscript𝐿𝑚subscript𝜇𝑡superscript0.5subscriptdelimited-∥∥𝑥subscriptsuperscript𝑥𝑚𝑡2superscript0.5subscriptdelimited-∥∥𝑥subscriptsuperscript𝑥𝑚𝑡2subscript𝜇𝑡d𝑥ifsubscriptdelimited-∥∥subscriptsuperscript𝑥𝑖𝑡subscriptsuperscript𝑥0𝑡20.5otherwise0else.otherwise\displaystyle\tilde{x}^{i}_{t+1}=\begin{cases}\min(1,\tilde{x}^{i}_{t}+\Delta L% ^{m}(\mu_{t})\cdot\frac{(0.5-\lVert x-x^{m}_{t}\rVert_{2})^{+}}{\int(0.5-% \lVert x-x^{m}_{t}\rVert_{2})^{+}\,\mu_{t}(\mathrm{d}x)})\quad\text{if}\quad% \lVert x^{i}_{t}-x^{0}_{t}\rVert_{2}\geq 0.5,\\ 0\quad\text{else.}\end{cases}over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = { start_ROW start_CELL roman_min ( 1 , over~ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + roman_Δ italic_L start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⋅ divide start_ARG ( 0.5 - ∥ italic_x - italic_x start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_ARG start_ARG ∫ ( 0.5 - ∥ italic_x - italic_x start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_x ) end_ARG ) if ∥ italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ 0.5 , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 0 else. end_CELL start_CELL end_CELL end_ROW

The reward at each time step is then given by the according total foraged and then deposited amount by the minor agents, where any clipped amount is wasted.

The initialization of agents is uniform, while the environment starts empty, i.e. μ0=Unif(𝒳)subscript𝜇0Unif𝒳\mu_{0}=\mathrm{Unif}(\mathcal{X})italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = roman_Unif ( caligraphic_X ) and μ00=Unif(𝒳)δsubscriptsuperscript𝜇00tensor-productUnif𝒳subscript𝛿\mu^{0}_{0}=\mathrm{Unif}(\mathcal{X})\otimes\delta_{\emptyset}italic_μ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = roman_Unif ( caligraphic_X ) ⊗ italic_δ start_POSTSUBSCRIPT ∅ end_POSTSUBSCRIPT. For sake of simulation, we define the episode length T=200𝑇200T=200italic_T = 200 after which a new episode starts.

Potential.

Lastly, in Potential we consider minor agents on a continuous one-dimensional torus 𝒳=[2,2]𝒳22\mathcal{X}=[-2,2]caligraphic_X = [ - 2 , 2 ] (where the points 22-2- 2 and 2222 are identified), actions 𝒰=[1,1]𝒰11\mathcal{U}=[-1,1]caligraphic_U = [ - 1 , 1 ] and major state 𝒳0=𝒳×𝒳superscript𝒳0𝒳𝒳\mathcal{X}^{0}=\mathcal{X}\times\mathcal{X}caligraphic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = caligraphic_X × caligraphic_X. The minor agents move as in Foraging (wrapping around the torus instead of clipping), while the major agent follows the gradient of the potential landscape generated by minor agents, with the goal of staying close to its current target. The major agent state xt0=(x^t0,xt)subscriptsuperscript𝑥0𝑡subscriptsuperscript^𝑥0𝑡subscriptsuperscript𝑥𝑡x^{0}_{t}=(\hat{x}^{0}_{t},x^{*}_{t})italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) here contains both the major agent position x^t0subscriptsuperscript^𝑥0𝑡\hat{x}^{0}_{t}over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and its target position xtsubscriptsuperscript𝑥𝑡x^{*}_{t}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. For simplicity, here we use a linear repulsive force decreasing from 1N1𝑁\frac{1}{N}divide start_ARG 1 end_ARG start_ARG italic_N end_ARG to 00 over a range of 1111,

x^t+10=x^t0+120xoff{4,0,4}(1x^t0x+xoff2)+x^t0x+xoffx^t0x+xoff2μt(dx)mod[2,2]\displaystyle\hat{x}^{0}_{t+1}=\hat{x}^{0}_{t}+\frac{1}{20}\sum_{x_{\mathrm{% off}}\in\{-4,0,4\}}\int(1-\lVert\hat{x}^{0}_{t}-x+x_{\mathrm{off}}\rVert_{2})^% {+}\frac{\hat{x}^{0}_{t}-x+x_{\mathrm{off}}}{\lVert\hat{x}^{0}_{t}-x+x_{% \mathrm{off}}\|_{2}}\mu_{t}(\mathrm{d}x)\mod[-2,2]over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 20 end_ARG ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT ∈ { - 4 , 0 , 4 } end_POSTSUBSCRIPT ∫ ( 1 - ∥ over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x + italic_x start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT divide start_ARG over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x + italic_x start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT end_ARG start_ARG ∥ over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x + italic_x start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( roman_d italic_x ) roman_mod [ - 2 , 2 ]

where we let terms 0/0=00000/0=00 / 0 = 0 and use the offset xoffsubscript𝑥offx_{\mathrm{off}}italic_x start_POSTSUBSCRIPT roman_off end_POSTSUBSCRIPT to account for the wrap-around on the torus.

The target follows the discretized Ornstein-Uhlenbeck process

xt+1𝒩(0.99xt,diag(σtarg2,σtarg2))similar-tosubscriptsuperscript𝑥𝑡1𝒩0.99subscriptsuperscript𝑥𝑡diagsuperscriptsubscript𝜎targ2superscriptsubscript𝜎targ2\displaystyle x^{*}_{t+1}\sim\mathcal{N}\left(0.99x^{*}_{t},\mathrm{diag}(% \sigma_{\mathrm{targ}}^{2},\sigma_{\mathrm{targ}}^{2})\right)italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ caligraphic_N ( 0.99 italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , roman_diag ( italic_σ start_POSTSUBSCRIPT roman_targ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT roman_targ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) )

with covariance σtarg2=0.005superscriptsubscript𝜎targ20.005\sigma_{\mathrm{targ}}^{2}=0.005italic_σ start_POSTSUBSCRIPT roman_targ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.005, and gives rise to the reward function via the toroidal distance between target and major agent

r(xt0,μt)=d(x^t0,xt).𝑟subscriptsuperscript𝑥0𝑡subscript𝜇𝑡𝑑subscriptsuperscript^𝑥0𝑡subscriptsuperscript𝑥𝑡\displaystyle r(x^{0}_{t},\mu_{t})=-d(\hat{x}^{0}_{t},x^{*}_{t}).italic_r ( italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = - italic_d ( over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .

The initialization of agents is uniform, while the target starts around zero, i.e. μ0=Unif(𝒳)subscript𝜇0Unif𝒳\mu_{0}=\mathrm{Unif}(\mathcal{X})italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = roman_Unif ( caligraphic_X ) and μ00=Unif(𝒳)𝒩(0,diag(σtarg2,σtarg2))subscriptsuperscript𝜇00tensor-productUnif𝒳𝒩0diagsuperscriptsubscript𝜎targ2superscriptsubscript𝜎targ2\mu^{0}_{0}=\mathrm{Unif}(\mathcal{X})\otimes\mathcal{N}\left(0,\mathrm{diag}(% \sigma_{\mathrm{targ}}^{2},\sigma_{\mathrm{targ}}^{2})\right)italic_μ start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = roman_Unif ( caligraphic_X ) ⊗ caligraphic_N ( 0 , roman_diag ( italic_σ start_POSTSUBSCRIPT roman_targ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT roman_targ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ). For sake of simulation, we define the episode length T=100𝑇100T=100italic_T = 100 after which a new episode starts. In contrast to M=72=49𝑀superscript7249M=7^{2}=49italic_M = 7 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 49 in 2G, Formation and Foraging, here we use M=7𝑀7M=7italic_M = 7 bins for the one-dimensional problem.

Refer to caption
Figure 10: Training curves (mean episode return vs. time steps) of M3FPPO in red, compared to A2C in blue. (a) 2G; (b) Formation; (c) Beach; (d) Foraging; (e) Potential.
Refer to captionRefer to captionRefer to caption
Figure 11: Qualitative visualization of learned M3FC behavior in the 2G (a-d), Formation (e-h) and Potential (i-l) problems. Red: minor agent; blue triangle: major agent; green triangle: major agent target. (i-l): As in (e-h), with arrow for potential gradient (not to scale).
Refer to caption
Figure 12: Training curves (mean episode return vs. time steps) of IPPO, trained on the systems with N{5,10,20}𝑁51020N\in\{5,10,20\}italic_N ∈ { 5 , 10 , 20 }. (a) 2G; (b) Formation; (c) Beach; (d) Foraging; (e) Potential.
Refer to caption
Figure 13: Training curves (mean episode return vs. time steps) of MAPPO, trained on the systems with N{5,10,20}𝑁51020N\in\{5,10,20\}italic_N ∈ { 5 , 10 , 20 }. (a) 2G; (b) Formation; (c) Beach; (d) Foraging; (e) Potential.

R.2 Comparison to M3FA2C

In Figure 10 we can see that vanilla M3FA2C typically performs worse than M3FPPO, getting stuck in worse local optima. Here, we used the same hyperparameters as in PPO. This validates our choice of PPO for M3FMARL.

R.3 Qualitative results

In Figure 11, M3FPPO successfully learns to form mixtures of Gaussians in 2G, and a Gaussian around a moving major agent that tracks its target in Formation. As expected in 2G, the two Gaussians at their sinusoidal peaks t=25𝑡25t=25italic_t = 25 and t=50𝑡50t=50italic_t = 50 are not perfectly tracked, in order to minimize the cost in following time steps, when the other Gaussian reappears. Finally, in Potential the minor agents succeed in pushing the major agent towards its target, while spreading on both sides of the major agent to be able to track any random movement of the target.

R.4 Training M3FPPO, IPPO and MAPPO on smaller systems

In Figure 6 we verified the training of M3FPPO on small finite system. Comparing to Figures 5 and 9, for M3FPPO we see little difference between training on a small finite-agent system versus training on a large system and applying the policy on the smaller system. For the chosen hyperparameters, the performance in the Potential problem depends on the initialization. However, M3FPPO compares especially favorably to IPPO in Beach and Foraging, even when directly training on the finite system. This shows that we can either (i) directly apply M3FPPO as a MARL algorithm to small systems, or (ii) train on a fixed system, and transfer the learned behavior to systems of almost arbitrary other sizes.

Analogously, in Figures 12 and 13 we show the training results for around a day of IPPO and MAPPO for numbers of agents N=5𝑁5N=5italic_N = 5, N=10𝑁10N=10italic_N = 10 and N=20𝑁20N=20italic_N = 20. As seen in the plot, the results for each number of agents is comparable to the analysis shown in the main text. In particular, transferring M3FPPO or comparing with Figure 6, we observe that M3FPPO continues to outperform or match the performance of IPPO and MAPPO, even in the setting with fewer agents.