Learning Regularized Graphon Mean-Field Games With Unknown Graphons
Learning Regularized Graphon Mean-Field Games With Unknown Graphons
Abstract
We design and analyze reinforcement learning algorithms for Graphon Mean-Field Games
(GMFGs). In contrast to previous works that require the precise values of the graphons,
we aim to learn the Nash Equilibrium (NE) of the regularized GMFGs when the graphons
are unknown. Our contributions are threefold. First, we propose the Proximal Policy
Optimization for GMFG (GMFG-PPO) algorithm and show that it converges at a rate of
Õ(T −1/3 ) after T iterations with an estimation oracle, improving on a previous work by
Xie et al. (ICML, 2021). Second, using kernel embedding of distributions, we design efficient
algorithms to estimate the transition kernels, reward functions, and graphons from sampled
agents. Convergence rates are then derived when the positions of the agents are either
known or unknown. Results for the combination of the optimization algorithm GMFG-PPO
and the estimation algorithm are then provided. These algorithms are the first specifically
designed for learning graphons from sampled agents. Finally, the efficacy of the proposed
algorithms are corroborated through simulations. These simulations demonstrate that
learning the unknown graphons reduces the exploitability effectively.
c 2024 Fengzhuo Zhang, Vincent Y. F. Tan, Zhaoran Wang, and Zhuoran Yang.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided
at http://jmlr.org/papers/v25/23-1409.html.
Zhang, Tan, Wang, and Yang
1. Introduction
Multi-Agent Reinforcement Learning (MARL) aims to solve sequential decision-making
problems in multi-agent systems (Zhang et al., 2021; Gronauer and Diepold, 2022; Oroojlooy
and Hajinezhad, 2022). Although MARL has enjoyed tremendous successes across a wide
range of real-world applications (Tang and Ha, 2021; Wang et al., 2022a,b; Xu et al., 2021),
it suffers from the “curse of many agents” where the sizes of the state and action spaces
increase exponentially with the number of agents (Menda et al., 2018; Wang et al., 2020). A
potential remedy is to use the mean-field approximation (Yang et al., 2018; Carmona et al.,
2019). It assumes that the agents are homogeneous, and each agent is influenced only by the
common state distribution of agents. This assumption mitigates the exponential growth of
the state and action spaces (Wang et al., 2020; Guo et al., 2022a). However, the homogeneity
assumption heavily restricts the applicability of the Mean-Field Game (MFG). For example,
analyzing the propagation of Covid-19 in an extremely large population requires modeling
the fact that people in different regions have distinct activity intensities. This cannot be
captured by the mean-field approximation, which assumes a simplistic homogeneous setup.
As a result, the Graphon Mean-Field Game (GMFG) is proposed as a means to relax the
homogeneity assumption. It captures the heterogeneity of agents through graphons and
allows the number of agents to be potentially uncountably infinite (Parise and Ozdaglar,
2019; Carmona et al., 2022). GMFGs have achieved great successes in a wide range of
applications, including in networks (Gao and Caines, 2019) and epidemics (Aurell et al.,
2022a). In Aurell et al. (2022a), the states of people indicate their infection situation, and
the graphons represent the propagation intensity between different types of people.
However, learning algorithms for GMFG require significantly more efforts to design and
analyze. Cui and Koeppl (2021b) proposed to learn the Nash Equilibrium (NE) of GMFGs
by modifying existing MFG learning algorithms. However, these model-free algorithms suffer
from the fact that the distribution flow estimation in GMFG requires a large number of
samples due to the heterogeneity of the agents. In addition, these algorithms potentially
necessitate the use of a very large class of value functions. In particular, this function
class should include the nominal value function in GMFG with any graphons to satisfy the
realizability assumption (Jin et al., 2021; Zhan et al., 2022). Moreover, existing works only
prove the consistency of learning algorithms with rather stringent assumptions (Cui and
Koeppl, 2021b; Fabian et al., 2022). These assumptions include the contractivity of the
estimated operators and the access to the nominal value functions. The convergence rates of
algorithms in GMFGs with milder assumptions are currently lacking in the literature.
In this paper, we focus on learning the NE from data collected from a set of sampled
agents in a centralized manner. Concretely, the central planner has access to a simulator of
the GMFG which generates the states and rewards of agents with the policies of the agent
as its inputs. However, only the states and rewards of only a finite set of agents are revealed
to the learner. Compared with the settings in Cui and Koeppl (2021b) and Fabian et al.
(2022), our setting is more relevant in real-world applications where the number of agents is
always finite. We aim to learn the NE of the GMFG from the states and rewards of these
sampled agents.
Learning the NEs in our problem involves overcoming difficulties from the statistical and
optimization perspectives. From the statistical side, we suffer from the lack of information
2
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
about the inputs of the functions to estimate. The transition kernels and the reward functions
of each agent take as inputs the collective behavior of all the other agents and the graphon. In
contrast, we do not know the graphons and only have information provided by a finite subset
of agents. From the optimization perspective, each agent is faced with a non-stationary
environment formed by other agents. Thus, we should design policy optimization procedures
that ensure that the policy of each agent converges to the optimal one in a time-varying
environment, while also ensuring that the non-stationary environment converges to a NE.
Main Contributions Addressing these difficulties, we summarize our main contributions
and results in Table 1 and in more details as follows:
1. We propose and analyze the Proximal Policy Optimization for GMFG (GMFG-PPO)
algorithm to learn the NE. Given an estimate oracle, our algorithm implements
a Proximal Policy Optimization (PPO)-like algorithm to update the agents’ poli-
cies (Schulman et al., 2017). The environment is simultaneously updated with a
carefully designed learning rate. These strategies overcome the optimization-related
hurdles. GMFG-PPO achieves a convergence rate Õ(T −1/3 ), where T is the number
of iterations. This convergence rate is faster than that of the algorithm in Xie et al.
(2021) and is proved under fewer assumptions. This improvement is attributed to our
carefully designed policy and environment update rates. In addition, the analysis of
our optimization leads to a faster convergence of the mirror descent algorithm on a
fixed MDP. As a byproduct, we generalize the result in Lan (2022) to inhomogeneous
MDPs with a finite horizon.
2. We design and analyze the model learning algorithm of GMFG under three different
agent sampling schemes, as shown in Table 1. The algorithm first incorporates the
graphon with the empirical measure to estimate the mean-embedding of each agent’s
influence. Then we take this estimate as the input and then perform a regression task;
this resolves the statistical difficulties mentioned above. In the case where sampled
agents have known and fixed positions, Theorem 5 shows that the convergence rate
for the model estimate is O((N L)−1 + N −1/2 ), where N is the number of sampled
agents, and L is the number of samples from each agent. We also consider two
additional scenarios—the case in which the agents are randomly sampled from the unit
interval but their positions are known, and learning from sample agents with unknown
grid positions. Pertaining to the final scenario, Theorem 7 indicates that the lack
of information of the position of the agents results in the sample complexity being
degraded by an additional factor of O(N log N ).
3. Our model estimation learning algorithm is the first one proposed for GMFGs. It
recovers the underlying graphons from the states sampled from a finite number of
agents. This model-learning problem is a considerable generalization of the distribution
regression problem (Szabó et al., 2016). Detailed discussions are provided in Section 5.4.
Also, our graphon learning setting can be regarded as a novel addition to the existing
graphon estimation literature, as discussed in Section 2.
Paper Outline The rest of the paper is organized as follows. We discuss related works
in Section 2. In Section 3, we introduce the GMFGs and a key property that they possess,
3
Zhang, Tan, Wang, and Yang
namely equivariance. Our three sampling schemes are also introduced. In Section 4, we
propose GMFG-PPO and analyze its convergence rate assuming an estimation oracle. In
Section 5, we first introduce our mean-embedding procedure. Then we propose and analyze
the model-learning algorithms for three sampling schemes. In Section 6, we combine the
results from Sections 4 and 5. In Section 7, we provide the numerical simulation results to
corroborate our theoretical findings. In Section 8, we conclude our paper.
2. Related Works
The GMFG has been proposed to study the games played between a large number of
heterogenous agents for several years. Parise and Ozdaglar (2019) first formulated the static
GMFG and proved that it is the limit of finite-agent games with graph structure. Carmona
et al. (2022) then generalized these results to the Bayesian setting. Caines and Huang (2019,
2021) formulated the continuous-time GMFG and studied the existence and the uniqueness
of their NE. As a special case, the continuous-time linear-quadratic GMFG was studied by
Aurell et al. (2022b); Tchuendom et al. (2020); Gao et al. (2020, 2021), where the existence
and uniqueness of NE were established, and the convergence of finite-agent games to GMFG
was analyzed. Learning of the NE on the discrete-time GMFG was first considered in Vasal
et al. (2020) via the master equation. After that, Cui and Koeppl (2021b) and Fabian et al.
(2022) proposed algorithms to learn the NE of discrete-time GMFG with dense and sparse
graphons, respectively.
As a special case, the MFG models a game between a large number of homogeneous
agents. This classical problem formulation was suggested in Lasry and Lions (2007); Huang
et al. (2006). NE learning algorithms for the continuous-time MFG have been designed via
fictitious play (Cardaliaguet and Hadikhanloo, 2017), mirror descent (Hadikhanloo, 2017),
generalized conditional gradient (Lavigne and Pfeiffer, 2022), and policy gradient (Guo
et al., 2022b). For discrete-time MFG, efficient algorithms have been proposed based on the
notion of contraction (Guo et al., 2019; Xie et al., 2021; Anahtarci et al., 2022; Yardim et al.,
4
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
2022; Guo et al., 2023). With the monotonicity condition, Perrin et al. (2020) and Perolat
et al. (2021) propose fictitious play and mirror descent algorithms for learning the NE,
respectively. Readers are encouraged to refer to Laurière et al. (2022) for a comprehensive
survey of MFGs.
The graphon estimation problem has been studied for a decade under different classes of
graphons and different performance metrics. Existing works mainly focus on the estimation
of graphons from the random graphs generated from it. Gao et al. (2015) first proposed a
rate-optimal algorithm to estimate the graphon at sampled points. The graphon estimation
is then studied under L2 norm (Klopp et al., 2017; Wolfe and Olhede, 2013), and cut
distance (Klopp and Verzelen, 2019). The spectral method for graphon estimation was
also studied in Xu (2018). For a comprehensice survey of graphon estimation, readers are
encouraged to refer to Gao and Ma (2021). Different from these works, we aim to estimate
the graphons without the graphs generated from them. Instead, we only have access to the
state and action samples of agents, who interact with each other according to an unknown
graphon structure.
3. Preliminaries
Graphons are measurable and symmetric functions that map [0, 1]2 to [0, 1]. By symmetry,
we mean that W (α, β) = W (β, α) for any α, β ∈ [0, 1]. The set of all graphons is denoted as
W = {W : [0, 1]2 → [0, 1] | W is symmetric}. In the following, graphons are used to represent
interactions between agents. We consider a finite horizon GMFG (I, S, A, µ1 , H, P ∗ , r∗ , W ∗ ).
In this game, each agent is indexed by α ∈ I = [0, 1]. The state space and the action space
of each agent are respectively denoted as S ⊆ Rds and A ⊆ Rda . We assume that S is a
compact subset of Rds and A is a finite subset of Rda . The horizon of the game is denoted
as H ∈ N. The initial state distribution of each agent is µ1 ∈ ∆(S), where ∆(S) is the
set of probability measures on S. We note that the initial distributions of agents can be
different by adding a new time step h = 0, where the reward is zero and the transition
kernel is independent of the action. The state transition kernels P ∗ = {Ph∗ }H h=1 are functions
Ph∗ : S × A × M(S) → ∆(S) for all h ∈ [H], where M(S) is the set of measures on S. In
contrast to the single-agent Markov Decision Process (MDP), the state dynamics of each
agent in a GMFG depends on an aggregate z ∈ M(S), which reflects the influence of other
agents on it. Since we consider the case in which the state space S is compact but potentially
infinite, we assume that Ph∗ (· | sh , ah , zh ) admits a probability density function with respect
to Lebesgue measure on S for any sh ∈ S,ah ∈ A, zh ∈ M(S), and h ∈ [H]. For time h and
agent α ∈ I, given a graphon Wh∗ ∈ W ∗ = {Wh∗ }H α
h=1 , the aggregate zh for agent α is defined
5
Zhang, Tan, Wang, and Yang
as
Z 1
zhα = Wh∗ (α, β)L(sβh ) dβ, (1)
0
where L(s) ∈ ∆(S) denotes the law of the random variable s. We note that the agents in this
game are heterogeneous. This means that each agent is affected differently by other agents
or, in other words, the aggregates zhα for different α ∈ I are, in general, different. Given
the state sαh ∈ S and the action aαh ∈ A of agent α, the agent transitions to a new state
sαh+1 ∼ Ph∗ (· | sαh , aαh , zhα ). The reward functions r∗ = {rh∗ }H h=1 are deterministic functions
rh∗ : S × A × M(S) → R for all h ∈ [H]. For agent α ∈ I at time h, taking the action aαh
under the state sαh and the aggregate zhα earns the agent a reward of rh∗ (sαh , aαh , zhα ).
We remark that the above GMFG subsumes the MFG (Xie et al., 2021; Anahtarci et al.,
2022) as a special case. To see this, let Wh (α, β) = 1 for all α, β ∈ I and h ∈ [H], then the
agents are homogeneous. The aggregate zhα in Eqn. (1) is simply the state distributions of
these homogeneous agents.
A Markov policy for the agent α ∈ I is characterized by π α = {πhα }H H
h=1 ∈ Π , where
α
πh : S → ∆(A) lies in the class Π = {π : S → ∆(A)}. The collection of policies of all
agents is denoted as π I = (π α )α∈I ∈ ΠI×H = Π̃. We let µαh = L(sαh ) ∈ ∆(S) be the
state distribution of the agent α at time h. Then µIh = (µαh )α∈I ∈ ∆(S)I is the set of
state distributions of all agents at time h. Note that the aggregate zhα is a function of the
distributions µIh and the graphon Wh∗ , so we may write it more explicitly as zhα (µIh , Wh ).
The distribution flow µI = (µIh )H I×H = ∆ ˜ consists of the state distributions of all
h=1 ∈ ∆(S)
agents at any given time.
In this work, we focus on the regularized problem (Nachum et al., 2017; Cui and
Koeppl, 2021a). This setting augments standard reward functions with the entropy of
the implemented policy. Some recent works have shown that entropy regularization can
accelerate the convergence of the policy gradient methods (Shani et al., 2020; Cen et al.,
2022). In a λ-regularized GMFG, when agent α implements policy πhα at time h, she will
receive a reward rh∗ (sαh , aαh , zhα ) − λ log πhα (aαh | sαh ) by taking action aαh at state sαh . Given the
underlying distribution flow µI and the policy π I , the value function and the action-value
function for agent α ∈ I in the λ-regularized game with λ > 0 are respectively defined as
XH
λ,α α I ∗ πα ∗ α α α I ∗ α α α α
Vm (s, π , µ , W ) = E rh sh , ah , zh (µh , Wh ) − λ log πh (ah | sh ) sh = s ,
h=m
Qλ,α α I ∗
λ,α α
rh∗ s, a, zhα (µIh , Wh∗ ) (sh+1 , π α , µI , W ∗ ) | sαh = s, aαh = a ,
h (s, a, π , µ , W ) = + E Vh+1
α
where the expectation Eπ [·] is taken with respect to aαh ∼ πhα (· | sαh ) and sαt+1 ∼ Ph (· | sαh , aαh , zhα )
for all h ∈ [H]. The cumulative reward of agent α ∈ I under policy π I is defined as
J λ,α (π α , µI , W ∗ ) = Eµα1 [V1λ,α (s, π α , µI , W ∗ )], where the expectation is taken with respect
to s ∼ µα1 .
˜ that satisfies
Definition 1 A NE of the λ-regularized GMFG is a pair (π ∗,I , µ∗,I ) ∈ Π̃ × ∆
the following two conditions:
• (Agent rationality) J λ,α (π ∗,α , µ∗,I , W ∗ ) ≥ J λ,α (π̃ α , µ∗,I , W ∗ ) for all α ∈ I and π̃ α ∈
ΠH .
6
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
• (Distribution consistency) The distribution flow µ∗,I is equal to the distribution flow
∗,I
µπ ,I induced by the policy π ∗,I .
We define the operator that returns the optimal policy when the underlying distribution
flow is µI and the graphon is W as Γλ1 (µI , W ) ∈ Π̃, i.e., π I = Γλ1 (µI , W ) if J λ,α (π α , µI , W ) =
supπ̃α ∈Π̃ J λ,α (π̃ I , µI , W ) for all α ∈ I. In this work, we focus on the case where the GMFG is
regularized, i.e., λ > 0. Thus, Γλ1 is uniquely defined. We also define the operator that returns
the distribution flow induced by the policy π I as Γ2 (π I , W ∗ ) ∈ ∆, ˜ i.e., µ̃I = Γ2 (π I , W ∗ ) if
Z X
0
α
µ̃αh (s)πhα (a | s)Ph s0 | s, a, zhα (µ̃Ih , Wh∗ ) ds
µ̃h+1 (s ) =
S a∈A
and µ̃I1 = µI1 . Our goal in this paper is to learn the NE of the λ-regularized GMFG from
the data collected of the sampled agents. Before giving an overview of our agent sampling
schemes, we first introduce the equivariance property of GMFG.
We denote the set of all the measure-preserving bijections as B[0,1] . Then the equivariance
property of the GMFG can be stated as follows.
Proposition 2 For any policy π I ∈ Π̃, let its distribution flow on (S, A, µ1 , H, P ∗ , r∗ , W ∗ )
be µI ∈ ∆.˜ In other words, µI = Γ2 (π I , W ∗ ). For any φ ∈ B[0,1] , define the φ-transformed
policy π φ,I as π φ,α = π φ(α) for all α ∈ I. Then we denote its distribution flow on
˜ i.e., µφ,I = Γ2 (π φ,I , W φ,∗ ). We have
(S, A, µ1 , H, P ∗ , r∗ , W φ,∗ ) as µφ,I ∈ ∆,
The proof is provided in Appendix D. Proposition 2 shows that the graphons transformed
by a measure-preserving bijections defines the same game as the original graphons up to the
bijection. In Section 5.5, we learn the values of the graphon from the sampled agents without
information of their positions. This proposition shows that we can learn the graphon up to
a measure-preserving bijection, which motivates the definition of the permutation-invariance
risk in Section 5.5.
7
Zhang, Tan, Wang, and Yang
1. Agents are sampled from known grid positions. In particular, we sample the agents at
grid positions {i/N }N
i=1 ⊂ [0, 1], and we know the position of each agent;
2. Agents are sampled from known random positions. In particular, we sample the agents
from N i.i.d. samples of Unif([0, 1]), and the positions of agents are also known;
3. Agents are sampled from grid positions, but the positions of the sampled agents are
unknown. For example, we know the positions of the sampled agents belong to the set
{i/N }N N
i=1 . However, the position of each agent within the set {i/N }i=1 is unknown.
In Section 5, we design and analyze a model learning algorithm that estimates the
transition kernel P ∗ , the reward function r∗ , and the underlying graphons W ∗ for each of
these three sampling schemes. In the first two cases, we design a model learning algorithm
that estimates the transition kernel P ∗ , the reward function r∗ , and the underlying graphons
W ∗ . However, in the third case, we cannot estimate the original graphons, since the positions
of the agents are unknown. Instead, we can only estimate the original graphons up to a
measure-preserving bijection. In this case, we need to recover the “relative positions” of
sampled agents to select the graphons from set W̃. For N agents, there are N ! potential
cases for their relative positions. The super-exponential size of the search space makes the
problem statistically challenging. To complete the story, there is a sampling scheme where
the positions of agents are unknown and random. However, the analysis of algorithms in
this case is difficult due to the need to carefully analyze the order statistics which is rather
different from the abovementioned three cases. We leave this case for future work.
8
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
Algorithm 1 GMFG-PPO
Procedure:
1: Initialize π1,hα (· | s) = Unif(A) for all s ∈ S, h ∈ [H] and α ∈ I.
procedure is known as fictitious play in Xie et al. (2021) and Perrin et al. (2020). It slows
down the update of the distribution flow. In our analysis, this deceleration is shown to
be important for learning the optimal policy with respect to the current distribution flow.
Finally, we improve the policy with one-step mirror descent (Line 6). We note that Line 6 is
in fact the closed-form solution to the optimization
h i
α
(· | s) = argmax ηt+1 Q̂λ,α α ˆI α
π̂t+1,h h (s, ·, πt , µ̄ t , Ŵ ), p − λH̄(p) − KL pkπt,h (· | s) ∀ s ∈ S,
p∈∆(A)
where H̄(p) = hp, log pi is the negative entropy function. This procedure is one-step policy
mirror descent in Lan (2022), and it also corresponds to the PPO algorithm in Schulman
et al. (2017). This policy improvement procedure aims to optimize the policy in the MDP
induced by µ̄ˆIt . With the convergence of µ̄
ˆIt to µ∗,I , this procedure can learn the optimal
∗,I
policy on µ , i.e., the policy π ∗,I in the NE. Line 7 mixes the current policy iterate π̂ I
with the uniform distribution. Intuitively, this mixing ensures that the policy has sufficient
exploration in order to find the NE eventually.
GMFG-PPO differs from the NE learning algorithm of regularized MFG in Xie et al.
(2021) in three aspects. First, GMFG-PPO is designed to learn the NE of the regularized
GMFG. It involves graphon learning and requires the policy and action-value function
updates for all the agents. In contrast, the algorithm in Xie et al. (2021) can only learn the
NE of the regularized MFG, which is a special case of GMFG with constant graphons. It
only keeps track of the policy and action-value function of a representative agent. Second,
GMFG-PPO learns a non-stationary NE, whereas the algorithm in Xie et al. (2021) learns
a stationary NE. Finally, the stepsize ηt used in the policy improvement (Line 6) will be
set to be a (non-vanishing) constant in Section 4.2. In contrast, the algorithm in Xie et al.
(2021) sets ηt = o(1). Our choice of ηt is the chief reason for the improved convergence rate.
9
Zhang, Tan, Wang, and Yang
(π ∗,I , µ∗,I ). We measure the distances between policies and distribution flows with
Z H
1X h i
I I
D(π , π̃ ) = Eµ∗,α πhα (· | s) − π̃hα (· | s) 1
dα, and
h
0 h=1
Z 1XH
d(µI , µ̃I ) = kµαh − µ̃αh k1 dα.
0 h=1
For the purpose of our convergence results, we make a few assumptions about the λ-
regularized GMFG. We first assume the Lipschitz continuity of transition kernels and reward
functions.
Assumption 1 The reward function rh (s, a, z) is Lipschitz continuous in z for all h ∈ [H],
that is |rh (s, a, z) − rh (s, a, z 0 )| ≤ Lr kz − z 0 k1 for all h ∈ [H], s ∈ S and a ∈ A. The
transition kernel Ph (· | s, a, z) is Lipschitz continuous in z with respect to the total variation,
that is TV(Ph (· | s, a, z), Ph (· | s, a, z 0 )) ≤ LP kz − z 0 k1 for all h ∈ [H], s ∈ S and a ∈ A.
This assumption is common in the MFG and GMFG literature (Cui and Koeppl, 2021b;
Anahtarci et al., 2022). We then assume that the composition of the operators Γλ1 and Γ2 is
contractive in the following sense.
Assumption 2 There exist constants d1 , d2 > 0 and d1 d2 < 1 such that for any policies
π I , π̃ I and distribution flows µI , µ̃I , it holds that
d Γ2 (π I , W ∗ ), Γ2 (π̃ I , W ∗ ) ≤ d2 D(π I , π̃ I ).
This “contractive” assumption plays an important role in the design of efficient algorithms,
since it guarantees the convergence of both π I and µI using simple fixed point iterations.
This assumption is widely adopted in the MFG literature (Xie et al., 2021; Guo et al., 2019),
and it holds if the regularization λ is higher enough than Lr and LP (Anahtarci et al., 2022;
Cui and Koeppl, 2021a). We note that Assumption 2 indeed implies the existence and
uniqueness of NE.
The proof is provided in Appendix O. For a policy π I and any distribution flow µI , we
define the operator Γ3 that satisfies µ+,I = Γ3 (π I , µI , W ) as
+,I +,α 0
X Z +,α
I
µh (s)πhα (a | s)Ph s0 | s, a, zhα (µIh , Wh ) ds,
µ1 = µ1 , µh+1 (s ) =
a∈A S
for all s0 ∈ S, α ∈ I, and h ≥ 1. The operator Γ3 outputs the distribution flow µ+,I for
implementing the policy π I on the MDP induced by µI . We now make an assumption about
certain concentrability coefficients.
10
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
Assumption 3 For any distribution flow µI , we define its induced optimal policy on the
MDP induced by it as πµ∗,I = Γλ1 (µI , W ∗ ) and the induced distribution flow as µ̃∗,I =
Γ3 (πµ∗,I , µI , W ∗ ). Then there exists a constant Cµ > 0 such that for any distribution flow
µI , it hold that
µ∗,α 2
h (s)
sup Es∼µ̃∗,α ≤ Cµ2 .
α∈I,h∈[H]
h µ̃∗,α
h (s)
d Γ̂2 (π I , Ŵ ), Γ2 (π I , W ∗ ) ≤ εµ ,
We make this assumption only for ease of the presentation of the analysis of our algorithm. In
Section 6, we will replace this assumption with the actual performance guarantee of our model
learning algorithms. When learning the model from L trajectories of N sampled agents, we
could quantify εµ and εQ as: (i) (known fixed positions) εµ = O(N −1/2 + (N L)−1/4 ) and
εQ = O(N −1/2 + (N L)−1/2 ). (ii) (known random positions) εµ = O(N −1/2 + (N L)−1/4 ) and
εQ = O(N −1/4 + (N L)−1/2 ). (iii) (unknown fixed positions) εµ = O(N −1/2 + (N/L)1/4 ) and
εQ = O(N −1/2 + (L)−1/2 ).
Theorem 4 We set αt = O(T −2/3 ), βt = O(T −1 ), and ηt to a constant that only depends
on λ, H and |A|. Under Assumptions 1, 2, 3, and 4, Algorithm 1 returns the policy π̄ I and
the distribution flow µ̄I that satisfies
T XT √
1 X I ∗,I 1 I ∗,I log T p
D πt , π +d ˆt , µ
µ̄ =O + O(εµ + εQ + εµ ).
T T T 1/3
t=1 t=1
11
Zhang, Tan, Wang, and Yang
The proof is provided in Appendix E. There are two main differences in Theorem 4 and Xie
et al. (2021, Theorem 1). First, we achieve a faster rate Õ(T −1/3 ) than the rate Õ(T −1/5 ) in
Xie et al. (2021). This improvement is attributed to the newly designed stepsize ηt , which
is a constant, but the algorithm in Xie et al. (2021) sets ηt to be O(T −2/5 ). Intuitively, a
stepsize ηt that is independent of T will result in faster convergence of an algorithm compared
to one that decays as T grows. However, the proof involves a novel optimization error
recursion analysis for this new stepsize. This novel optimization error recursion analysis
also generalizes Lan (2022, Theorem 1) to the time-inhomogeneous MDP with a finite
horizon. See Appendix F for the statement. Second, Theorem 4 does not require the first
condition in Assumptions 4 and 5 in Xie et al. (2021). Instead, we adopt the more realistic
Assumption 1 concerning the Lipschitzness of transition kernels and reward functions to
control the difference between the MDP induced by difference distribution flows.
12
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
We have ωds,a,z ∈ H. We note that such mean-embedding procedure will not cause the
problem to be degenerate, since the embedding with the identity kernel degenerates to ds,a,z .
For our regression setting, we will embed the measure δsαh × δaαh × zhα (Wh∗ ) for all α ∈ I, and
h ∈ [H]. Here the aggregate zhα is the influence aggregate for agent α at time h defined in
Eqn. (1). Then the mean-embedding of the measure δsαh × δaαh × zhα (Wh∗ ) is
Z 1Z
ωhα (Wh∗ ) Wh∗ (α, β)k ·, (sαh , aαh , s) µβh (s) ds dβ.
=
0 S
Given such embedding representation, we reformulate the transition kernels and the reward
functions as functions fh∗ , gh∗ : H → R that is defined as
sαh+1 = fh∗ ωhα (Wh∗ ) + εh , rhα = gh∗ ωhα (Wh∗ ) for all h ∈ [H], α ∈ I,
(2)
where {εαh }α∈I are independent zero-mean noises. Since |s| ≤ BS , we have |εαh | ≤ 2BS .
13
Zhang, Tan, Wang, and Yang
• The kernel k is bounded, i.e., there exists Bk > 0 such that k(x, x) ≤ Bk2 for all x ∈ Ξ.
• The kernel K̄ (resp. K̃) is bounded, i.e., there exists BK̄ > 0 (resp. BK̃ > 0) such
2 (resp. K̃(ω, ω) ≤ B 2 ) for all ω ∈ H.
that K̄(ω, ω) ≤ BK̄ K̃
• The kernel K̄ (resp. K̃) is LK̄ -Lipschitz (resp. LK̃ ) continuous, i.e., kK̄(·, ω) −
K̄(·, ω 0 )kH̄ ≤ LK̄ kω − ω 0 kH (resp. kK̃(·, ω) − K̃(·, ω 0 )kH̃ ≤ LK̃ kω − ω 0 kH ) for all
ω, ω 0 ∈ H.
For ease of notation, we define the maximal boundedness parameter BK = max{BK̄ , BK̃ }
and the maximal Lipschitz constant LK = max{LK̄ , LK̃ }. Finally, we state the realizability
assumption. It guarantees that we choose the proper function class for our regression task.
We define the r-ball in a RKHS H̄ as B(r, H̄) = {f ∈ H̄ | kf kH̄ ≤ r}.
Assumption 7 (Realizability) The nominal transition functions fh∗ , reward functions gh∗
and graphons Wh∗ satisfy that fh∗ ∈ B(r, H̄), gh∗ ∈ B(r̃, H̃) and Wh∗ ∈ W̃ for all h ∈ [H], where
r, r̃ > 0 are some constants.
For ease of notation, we define the maximal radius as r̄ = max{r, r̃}. We note that our
algorithms and analysis are also applicable to the general function class F and F̃, replacing
H and H̃. This assumption is realized when the chosen function classes are large enough. For
example, H and H̃ can be chosen as kernels spaces of neural networks (Jacot et al., 2018), and
W̃ can be a set of neural networks for the purpose of graphon estimation (Xia et al., 2023). In
addition to these non-parameteric function classes, for the case in which we know the form of
the underlying graphon (e.g., Wh∗ (α, β) = a − b · (α + β) for some a, b ∈ R), we can also choose
the graphon class accordingly (e.g., W̃ = {W | W (α, β) = k1 − k2 · (α + β) for k1 , k2 ∈ R}).
Here we adopt the RKHS for H̄ and H̃ the ease of representation.
i 1 X
ẑτ,h (Wh ) = Wh (ξi , ξj )δsj .
N −1 τ,h
j6=i
This estimate involves three kinds of error sources. The first is the graphon estimation error,
which originates from the difference between Wh and Wh∗ . The second is the agent sampling
error which originates from the approximation of uncountably many agents in [0, 1] with
14
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
N − 1 of them, i.e., an integral over [0, 1] is replaced by a sum over N − 1 terms. The
ξj
last is the state sampling error in which we replace the integral of µτ,h over state space S
with the singleton δsj . In the analysis, we handle these three errors separately. Given the
τ,h
i (W ), the corresponding mean-embedding of the state, action, and
aggregate estimate ẑτ,h h
th
the aggregate for the i agent is
1 X
i
Wh (ξi , ξj )k ·, (siτ,h , aiτ,h , sjτ,h ) .
ω̂τ,h (Wh ) = (4)
N −1
j6=i
Taking this estimate as the input of fh∗ and gh∗ , we evaluate the square error of the prediction
and derive the estimates by minimizing the error. Thus, the estimation procedure for
learning the system dynamics, the reward functions, and the graphons can be expressed as
L N
1 XX i 2 i 2
(fˆh , ĝh , Ŵh ) = argmin i
sτ,h+1 −f ω̂τ,h (W̃ ) + rτ,h i
−g ω̂τ,h (W̃ ) . (5)
f ∈B(r,H̄),g∈B(r̃,H̃),N Lτ =1 i=1
W̃ ∈W̃
We note that the above optimization problem is, in general, non-convex. However, we focus
on the statistical property of it in this work, and the practical implementation can be done
with the help of non-convex optimization algorithms. In this estimation procedure, we form
our predictions of states/rewards via the composition of two procedures, i.e.,
k,W f or g
{(siτ,h , aiτ,h )}N i i i
i=1 −→ ω̂τ,h (W ) −→ sτ,h+1 /rτ,h (6)
In the first stage, the states and actions are embedded with the kernel k and a selected
graphon W . In the second stage, the mean-embedding ω̂τ,h i (W ) is forwarded by the functions
in H or H̃.
This two-stage prediction distinguishes our estimation procedure from the algorithms
designed for the distribution regression problem (Szabó et al., 2016; Fang et al., 2020;
Meunier et al., 2022). In the distribution regression problem, the covariate, i.e., the input
of f or g in Eqn. (6), is an unknown distribution. In this problem, we are tasked with
performing a regression from the data of the response variable and the i.i.d. samples of the
unknown distribution. Although the distribution regression problem also requires a two-
stage prediction similarly as Eqn. (6), i.e., the covariate should be first estimated from i.i.d.
samples drawn from itself, our problem setting involving graphons is a strict generalization
of distribution regression. First, the input of f or g in our problem is a function of a set of
distributions {µατ,h }α∈I . In contrast, the covariate of the distribution regression problem is a
single distribution. Second, in addition to the recovery of µIτ,h from its samples, our problem
requires the estimation of the graphon W to form ω̂τ,h i (W ). However, the distribution
regression problem only requires the recovery of a distribution from its i.i.d. samples, which
corresponds to the case that W is a constant function.
15
Zhang, Tan, Wang, and Yang
of generality, we assume that ξi ≤ ξj for any i ≤ j in [N ], and denote the set of positions
as ξ¯ = {ξi }N I
i=1 . In this section, our behavior policies πτ for τ ∈ [L] are set as Lπ -Lipschitz
policies. It means that kπhα (· | s) − πhβ (· | s)k1 ≤ Lπ |α − β| for all h ∈ [H] and α, β ∈ I. We
note that setting the behavior policies as Lipschitz policies will not restrict the applicability
of our estimation procedure, since the NE is shown to be Lipschitz under Assumptions 1
and 5 in Appendix P.
Then we introduce the performance Q metric for our estimates. Given ξ, ¯ the joint distribu-
N
tion of (sτ,h , aτ,h , µτ,h , rτ,h , sτ,h+1 )i=1 is i=1 ρτ,h , where ρτ,h = µτ,h × πτ,h × δµI × δrh∗ × Ph∗ .
i i I i i N i i i i
τ,h
Here δrh∗ is the delta distribution induced by the deterministic function rh∗ . We define the
risk of (f, g, W ) given ξ¯ as
L N
1 XX i i
2 i i
2
Rξ̄ (f, g, W ) = Eρi sτ,h+1 − f ωτ,h (W ) + rτ,h − g ωτ,h (W ) . (7)
NL τ,h
τ =1 i=1
The risk Rξ̄ (f, g, W ) measures the mean square error of the estimates f, g, W with respect
to the distributions of states, actions and distribution flow on the sampled agents. This
risk definition is motivated by the distribution regression (Szabó et al., 2015), since our
framework is a generalization of the distribution regression, as discussed in Section 5.4. The
convergence rate of our estimates (fˆh , ĝh , Ŵh ) is stated as follows.
Theorem 5 Under Assumptions 1, 5, 6, and 7, if {ξi }N i=1 are known grid positions such
that ξi = i/N for i ∈ [N ], then with probability at least 1 − δ, the risk of the estimates in
Eqn. (5) can be bounded as
where
3 3 3
NBr = NH̄ , B(r, H̄) , NB̃r̃ = NH̃ , B(r̃, H̃) , NW̃ = N∞ , W̃ .
NL NL LK N L
The proof of the theorem and the definitions of these covering numbers are provided in
Appendix G. The estimation error in Theorem 5 consists of two terms: the first term
corresponds to the generalization error, and the second term corresponds to the mean-
embedding estimation error. The generalization error involves the error from optimizing
over the empirical mean of the risk in Eqn. (5) instead of the population risk in Eqn. (7).
The mean-embedding estimation error comes from the fact that we cannot directly observe
the distribution flow µIτ , but we need to estimate it from the states of sampled agents. As
discussed in Section 5.4, the mean-embedding estimation error consists of the agent sampling
error and the state sampling error. If we use finite general function classes, then the covering
number in the bound will be replaced by the √ cardinalities of these function classes. The
resultant convergence rate would thus be O(1/ N ).
The model learning algorithm in Pasztor et al. (2021) for the MFG assumes access to
the nominal value of the distribution flow. Such an assumption can be achieved in MFGs by
16
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
sampling a large number of agents at each time, since all the agents are homogeneous and
have the same
√ state distribution flow. This estimation procedure will however, come at a
cost of O(1/ N ), which is not reflected in their results. What’s more, such an assumption is
no longer realistic in the GMFG, since the agents in GMFG are heterogeneous, and the state
distributions of agents are different. Our estimation procedure in (5) does not require the
access to the nominal value of the distribution flow µIh . Instead, we estimate this quantity
from states of sampled agents and prove that such an estimate works for the heterogeneous
agents.
Compared to the risk with grid positions defined in Eqn. (7), the risk defined in Eqn. (8)
can be derived by taking expectation
with respect the distribution of the positions, i.e.,
R(f, g, W ) = Eξ̄ Rξ̄ (f, g, W ) . The convergence rate of our estimates can be stated as
follows.
Theorem 6 Under Assumptions 5, 6, 7, and 1, if {ξi }N i=1 are known i.i.d. samples of
Unif([0, 1]), then with probability at least 1 − δ, the risk of the estimates in Eqn. (5) can be
bounded as
17
Zhang, Tan, Wang, and Yang
(a) The SBM graphon and three sampled agents. (b) Transformed SBM graphon and the corre-
spondingly transformed agents.
Figure 1: The left figure shows the SBM graphon and three sampled agents. Swapping
the second and the third communities, we obtain the graphon on the right. The
sampled agents are correspondingly swapped. Although the graphons and agent
positions in the left and the right figures are not the same, the agents in both
figures retain the same “relative positions” with the underlying graphons.
We now consider the setting where the positions of the sampled agents {ξi }N i=1 are on the
grid in [0, 1], but are unknown. This means that the set of sampled positions {ξi }N i=1 is
equal to {i/N }N i=1 , but we do not know which i/N each ξ i corresponds to. In addition to
the data collection procedures in Section 5.1, we assume that we implement the same policy
over L independent rounds. This sampling method implies that the distribution defined in
Section 5.5 satisfies ρατ,h = ρiτ 0 ,h for all τ, τ 0 ∈ [L], α ∈ I and h ∈ [H].
Intuitively, since the position information is missing from our observations, we cannot
estimate the precise values of graphons. For example, the collected data from the agents
in Figure 1(a) is same as that in Figure 1(b), so we cannot distinguish between these two
different graphons. However, we can see that these two graphons are the same up to a
measure-preserving bijection. Proposition 2 shows that the model with transformed graphons
is the same as the original model up to a measure-preserving bijection. Thus, in this section,
our goal is to estimate the model of GMFG up to a measure-preserving bijection.
In this setting, we cannot estimate the mean-embedding ωτ,h i (W ∗ ) as Eqn. (4), since
h
we do not know the agents’ positions {ξi }N i=1 . Instead, we need to estimate the “relative
positions” of these agents. Here the relative positions refer to the relationship between the
agents’ positions and the underlying graphon. For example, in Figure 1, the agents retain
the same relative positions in different graphons. With N sampled agents, the relative
positions can be represented by the permutation of these agents. We denote the set of all
the permutations of N objects as C N , where |C N | = N !. For a permutation σ ∈ C N and a
graphon W , we estimate the relative position of ith agent as σ(i)/N for all i ∈ [N ]. Then
18
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
Similar as Eqn. (14), Eqn. (9) is also an average over L episodes, since we implement the
same policy for L independent times. In this estimate, only the relative positions between
agents and the underlying graphon matters, so we can equivalently express such estimate
i (W ) as ω̄
ˆ τ,h i,σ
ˆ τ,h
with a transformed graphon. We define ω̄ (W ) with the identity map σ. The
set of measure-preserving bijections that are permutations of the intervals [(i − 1)/N, i/N ]
N . Then for some φ ∈ C N , the estimate in Eqn. (9) can be
for i ∈ [N ] is denoted as C[0,1] [0,1]
reformulated as
L
1
W φ i/N , φ j/N k ·, (siτ,h , aiτ,h , sjτ 0 ,h ) .
XX
i
(W φ ) =
ˆ τ,h
ω̄
(N − 1)L 0 j6=i τ =1
Given this mean-embedding estimate, our model estimation estimation procedure can be
stated as
L N
1 XX i 2 i 2
(fˆh , ĝh , Ŵh , φ̂h ) = argmin i
ˆ τ,h
sτ,h+1 − f ω̄ (W̃ φ ) i
ˆ τ,h
+ rτ,h − g ω̄ (W̃ φ ) .
f ∈B(r,H̄), N L
τ =1 i=1
g∈B(r̃,H̃),
N
W̃ ∈W̃,φ∈C[0,1]
(10)
We note that the computational burden of this procedure can be high due the large number
N
of permutations in C[0,1] (for large N ). However, this high computational burden is common
in the graphon learning algorithms (Gao et al., 2015; Klopp et al., 2017). Our work mainly
focuses on the statistical analysis of the graphon problem, rather than their well-known
computational limitations. We leave the addressing of computational concerns to future work.
We then specify the performance metric under this setting. As mentioned earlier, we cannot
estimate the precise values of graphons. Thus, we measure the accuracy of our estimates by
transforming the graphon estimate with the optimal measure-preserving bijections. Such a
risk is known as the permutation-invariant risk, which is defined as
L N
1 XX i i φ
2 i i φ
2
R̄ξ̄ (f, g, W ) = inf Eρi sτ,h+1 − f ωτ,h (W ) + rτ,h − g ωτ,h (W )
φ∈B[0,1] N L τ,h
τ =1 i=1
N
1 X i i φ
2 i i φ
2
= inf Eρi sh+1 − f ωh (W ) + rτ,h − g ωτ,h (W ) , (11)
φ∈B[0,1] N h
i=1
19
Zhang, Tan, Wang, and Yang
where
3 3 3
ÑBr = NH̄ , B(r, H̄) , ÑB̃r̃ = NH̃ , B(r̃, H̃) , ÑW̃ = N∞ , W̃ .
L L LK L
The proof is provided in Appendix I. The estimation error in Theorem 7 consists of three
terms: the first two terms correspond to the mean-embedding estimation error, and the
last term corresponds to the generalization error. As mentioned in Section 5.4, the mean-
embedding estimation error consists of agent sampling error and the state sampling error.
The first term in the bound represents the agent sampling error. Since the distance between
adjacent agents is 1/N , this approximation error√is of order O(1/N ). The second term
represents the state sampling error. The term N in the numerator comes from the
estimation of relative positions from C N , whose size is N !, and the union bound among
this set. The third term, which is the generalization error, also suffers from the union
bound of N ! relative positions. Compared with Corollary 12 in Section 5.4, the result in
Theorem 7 suffers from a multiplicative factor log N !. When the function classes are finite
and L = Θ(N β ) with β > 1, the convergence rate in Theorem 7 is O(max{N −(β−1)/2 , N −1 }).
In contrast, the convergence rate Corollary 12 is O(N −(β+1)/2 ).
Theorem 7 states the estimate error in the permutation-invariant risk. In fact, we can
also derive the convergence rate of our estimation of relative positions φ̂h . This means
that for some unknown correction ψ ∗ ∈ C[0,1]
N , the risk defined in Eqn. (7) of our estimate
∗
(fˆh , ĝh , Ŵhφ̂h ◦ψ ) vanishes.
20
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
The proof is provided in Appendix M. Combined with Proposition 2, Corollary 8 shows that
the model estimate (fˆh , ĝh , Ŵhφ̂h ) converges to the nominal model in the sense that they are
shown to be equivalent up to an unknown measure-preserving bijection ψ ∗ ∈ C[0,1] N .
21
Zhang, Tan, Wang, and Yang
Assumption 8 There exist Lε > 0 such that the noises εh for h ∈ [H] satisfy that for any
a ∈ R, TV(εh + a, εh ) ≤ Lε a for all h ∈ [H].
This assumption enables us to control the total variation error of our transition kernels
Ph∗ by the estimation error of fh∗ . We note that Assumption 8 is satisfied for a wide range
of distributions, including the uniform distribution, the centralized Beta distributions for
α > 1, β > 1, and the truncated Gaussian distribution. We then assume that the behavior
policy πtb,I satisfies the following assumptions.
Assumption 9 There exist two constants Cπ , Cπ0 > 0 such that for all t ∈ [T ]
∗,α α
π̄t,h (a | s) πt+1,h (a | s)
sup b,α
≤ Cπ and sup b,α
≤ Cπ0 .
s∈S,a∈A,α∈I,h∈[H] πt,h (a | s) s∈S,a∈A,α∈I,h∈[H] πt,h (a | s)
This assumption states that the behavior policy should explore the actions of the NE and
I . It is quite natural since we want to estimate the action-value function of
the policy πt+1
I
πt+1 from the data collected by πtb,I . Similar assumptions have been commonly made in the
off-policy evaluation literature (Kallus et al., 2021; Uehara et al., 2020).
Assumption 10 For any policy π I ∈ Π̃, we define µ+,I = Γ3 (π I , µ̄It , W ∗ ). We also define
µ̄b,I
t = Γ3 (πtb,I , µ̄It , W ∗ ). There exists a constant Cπ00 > 0 such that for any t ∈ [T ] and any
policy π I specified above, we have
µ+,α
h (s)
sup b,α
≤ Cπ00 .
s∈S,h∈[H],α∈I µ̄t,h (s)
This assumption states that the behavior policy should be sufficiently exploratory such that
the induced distribution of other policies can be covered by that of the behavior policy.
Similar assumptions haven been made in the policy optimization literatures (Shani et al.,
2020; Agarwal et al., 2020). We note that if we take the behavior policy πtb,I = Unif(A)I×H
to be the uniform distribution on the action space, then the constants in Assumptions 9 and
10 can be set as Cπ = Cπ0 = |A| and Cπ00 = |A|H .
22
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
Corollary 9 If we sample agents with known grid positions and adopt Algorithms (5) and
(12) to estimate the MDP, then under Assumptions 1 to 10, we have that GMFG-PPO
return the following estimates with probability at least 1 − δ
T X T
1 X I ∗,I 1 I ∗,I
D πt , π +d ˆt , µ
µ̄
T T
t=1 t=1
1 1 √
T N Br NB̃r̃ NW̃
BS + r̄BK 1 (BS + r̄BK ) 4 (r̄LK Bk ) 4 1 T N LN∞ (1/ N , W̃)
=O 1 log 4 + 1 log 4
(N L) 4 δ (N L) 8 δ
1 1 √
(BS + r̄BK ) 4 (BS + r̄BK + r̄LK Bk ) 4 log T
+ 1 + O .
N4 T 1/3
The proof is provided in Appendix K. The error of learning NE consists of two types of
terms. The first originates from the estimation error of the distribution flow and the action-
value function. It involves the number of sampled agents N and the number of episodes
L. The second represents the optimization error and involves the number of iterations T .
Consider the case where the function classes are finite. To learn a NE with error ε measured
according to D(·, ·) and d(·, ·), we can run Algorithms 1 and 2 with T = Õ(ε−3 ) iterations
and O((N L)−1/8 + N −1/4 ) = ε. The second condition can be achieved by several parameter
settings, e.g., L = 1, N = O(ε−8 ) and L = O(ε−4 ), N = O(ε−4 ).
The result for the agents with known random positions is stated as follows.
Corollary 10 If we sample agents with known random positions and adopt Algorithms (5)
and (12) to estimate the MDP, then under Assumptions 1 to 10, we have that GMFG-PPO
return the following estimates with probability at least 1 − δ
T X T
1 X I ∗,I 1 I ∗,I
D πt , π +d ˆt , µ
µ̄
T T
t=1 t=1
1 1 √
1 T NBr NB̃ NW̃
BS + r̄BK r̃
(BS + r̄BK ) 4 (r̄LK Bk ) 4 1 T N LN∞ (1/ N L, W̃)
=O 1 log 4 + 1 log 4
(N L) 4 δ (N L) 8 δ
√
(BS + r̄BK )1/2 1 ÑBr ÑB̃ ÑW̃
r̃
log T
+ 1 log 4 +O .
N8 δ T 1/3
23
Zhang, Tan, Wang, and Yang
For example, we consider the swarm robotics related problems (Elamvazhuthi and Berman,
2019). In this problem, we would like to find the NE of swarm robotics. The state and action
are the kinetic signals and acceleration of robotics, respectively. The reward is the quantity
related to the kinetic goal. The robotics that have close physical positions usually share
close positions in the underlying graphon, since the interaction among the swarm robotics is
related to the physics setting of them. Thus, in this example, although we do not know their
exact positions in graphons, we have information about their relative closeness in graphons
via their physical positions. In addition, since the data points are stored in each robotics,
the samples across different iterations can be guaranteed to come from the same robotics.
In this case, there is one sampled person from each state, and we assume that each person
knows which state she belongs to, i.e., which sampled person is the closest person to her.
Corollary 11 If we sample agents with known grid positions and adopt Algorithms (5)
and (12) to estimate the MDP, then under Assumptions 1 to 10, we have that GMFG-PPO
return the following estimates with probability at least 1 − δ
T X T
1 X I ∗,I 1 I ∗,I
D πt , π +d ˆt , µ
µ̄
T T
t=1 t=1
p
Bk r̄LK (BS + r̄BK ) (BS + r̄BK )1/4 (r̄LK BK N )1/4 N 1/8
1/4 N LN∞ ( N/L, W̃)
=O + log
N 1/4 L1/8 δ
√
N ÑBr ÑB̃r̃ Ñ∞
BS + r̄BK 1/4 log T
+ 1/4
log +O .
L δ T 1/3
The proof is provided in Appendix N. Similar to Corollaries 9 and 10, the learning error
in Corollary 11 consists of the estimation error and the optimization error. To learn a NE
with error ε measured according to D(·, ·) and d(·, ·), we can run Algorithms 1 and 2 with
T = Õ(ε−3 ) iterations, N = O(ε−4 ) sampled agents, and L = O(ε−12 ) episodes.
7. Experiments
In this section, we utilize simulations to demonstrate the importance of learning the under-
lying graphons, thus corroborating our theoretical results. We simulate our algorithms on
the Susceptible-Infectious-Susceptible (SIS) problem and investment problem. The detailed
definitions of the problems are provided in Appendix A.
The SIS problem: This problem, which has also been considered in Cui and Koeppl
(2021a,b), models the propagation of an epidemic among a large population. People in the
population are infected with probability proportional to the number of infected neighbors.
An investment problem: This problem considers the situation where several compa-
nies aim to maximize their profits simultaneously. The profit of each company is proportional
to the quality of its product and decreases with the total quality of the products in its
neighborhood.
We experiment with four types of graphons: exp-graphon, SBM graphon, affine attach-
ment graphon, and ranked attachment graphon. The value of exp-graphon is affine in the
24
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
SIS problem with Exp graphons N = 7, L = 125 SIS problem with SBM graphons N = 7, L = 125
102 N = 7, L = 500 102
N = 7, L = 500
N = 14, L = 125 N = 14, L = 125
N = 14, L = 500 N = 14, L = 500
101 mf, L = 125 mf, L = 125
101
Exploitability
Exploitability
MD, p = 0 MD, p = 0
100 MD, p = 0.25 MD, p = 0.25
MD, p = 0.5 100 MD, p = 0.5
10 1 MD, p = 0.75 MD, p = 0.75
MD, p = 1 MD, p = 1
FPI, p = 0 10 1 FPI, p = 0
10 2
FPI, p = 0.25 FPI, p = 0.25
FPI, p = 0.5 FPI, p = 0.5
10 3 FPI, p = 0.75 10 2
FPI, p = 0.75
1 5 10 15 20 FPI, p = 1 1 5 10 15 20 FPI, p = 1
Optimization Iteration Optimization Iteration
(a) SIS problem with exp-graphons (b) SIS problem with SBM graphons.
SIS problem with Ranked Attachment graphons SIS problem with Affine Attachment graphons
102
102
101
101
Exploitability
Exploitability
100
100
1 5 10 15 20 1 5 10 15 20
Optimization Iteration Optimization Iteration
(c) SIS problem with ranked attachment (d) SIS problem with affine attachment
graphons graphons.
Figure 2: Simulation results for SIS problem with grid known-position agents.
exponential of the product of α, β, which is defined as
2 exp(θ · αβ)
Wθexp (α, β) = − 1, (13)
1 + exp(θ · αβ)
which is parameterized by θ > 0. The SBM graphon splits [0, 1] into K ≥ 1 blocks,
which is parameterized by {pk }K k=0 . Here p0 = 0 and pK = 1, and the i-th block is
(pi−1 , pi ]. The value of SBM graphon is then specified by {aij }K,K i,j=1 with aij = aji as
W SBM (α, β) = aij if pi−1 < α ≤ pi and pj−1 < β ≤ pj . The affine attachment graphon is
defined as Wa,b aff (α, β) = a − b · (α + β), where a, b ∈ R parameterize the graphon. The ranked
attachment graphon is defined as Wa,b rank (α, β) = a − b · αβ. This is a generalization of the
If we do not learn the underlying graphons, reasonable guesses for them would be constant
graphons W (α, β) = p for all α, β ∈ I, corresponding to the MFG. In the simulations, we
choose the constant p to be 0, 0.25, 0.5, 0.75 and 1. These values model the cases from the
independent agents to the most intensely interacting agents.
Figure 2 displays the exploitability for the algorithms in the SIS problem with different
graphons. To learn the system model, we sample N = 7 and N = 14 agents with known
25
Zhang, Tan, Wang, and Yang
Investment problem with Exp graphons N = 7, L = 25 Investment problem with SBM graphons N = 7, L = 25
N = 7, L = 100 N = 7, L = 100
101 N = 14, L = 25 N = 14, L = 25
N = 14, L = 100 101 N = 14, L = 100
100 mf, L = 25 mf, L = 25
Exploitability
Exploitability
MD, p = 0 100 MD, p = 0
10 1 MD, p = 0.25 MD, p = 0.25
MD, p = 0.5 MD, p = 0.5
MD, p = 0.75 10 1
MD, p = 0.75
10 2
MD, p = 1 MD, p = 1
10 3 FPI, p = 0 10 2 FPI, p = 0
FPI, p = 0.25 FPI, p = 0.25
10 4 FPI, p = 0.5 10 3 FPI, p = 0.5
FPI, p = 0.75 FPI, p = 0.75
1 5 10 15 20 FPI, p = 1 1 5 10 15 20 FPI, p = 1
Optimization Iteration Optimization Iteration
(a) Investment problem with exp-graphons (b) Investment problem with SBM graphons.
Investment problem with Ranked Attachment graphons Investment problem with Affine Attachment graphons
101 101
Exploitability
100 100
Exploitability
10 1
10 1
10 2
N = 7, L = 25 MD, p = 0.25 N = 7, L = 25 MD, p = 0.25
10 2
N = 7, L = 100 MD, p = 0.5 N = 7, L = 100 MD, p = 0.5
10 3
mf, L = 25 MD, p = 0.75 mf, L = 25 MD, p = 0.75
MD, p = 0 10 3
MD, p = 0
MD, p = 1 MD, p = 1
10 4
1 5 10 15 20 1 5 10 15 20
Optimization Iteration Optimization Iteration
(c) Investment problem with ranked attach- (d) Investment problem with affine attach-
ment graphons ment graphons.
Figure 3: Simulation results for investment problem with grid known-position agents.
grid positions. The number of episodes for data collection L is set to 125 and 500. The
line “mf, L = 125” refers to the model-free algorithm in Cui and Koeppl (2021b) that uses
125 trajectories from 7 agents for distribution flow together with value function estimation
in each round. “L = 125” and “L = 500” refer to our algorithms that use 125 and
500 samples from each agent in each round for estimation of the graphons. Since the
messages from the results of different graphons are similar, we only display the results for
exp graphon and SBM graphon for brevity. Figure 2 demonstrates that our model-based
algorithm achieves lower exploitability than the model-free algorithm. The reason is that
the estimation error of the model-based algorithm is smaller, as mentioned in Section 6.
Lines “M D, p = 0, 0.25, 0.5, 0.75, 1” refer to the MFG learning algorithm that implements
one-step mirror descent in each iteration Xie et al. (2021); Yardim et al. (2022). Lines
“F P I, p = 0, 0.25, 0.5, 0.75, 1” refer to the MFG learning algorithm that learns the optimal
policy for the current mean-field in each iteration Guo et al. (2019). For these MFG learning
algorithms, the reward functions and transition kernels are known to the algorithm. Thus,
there are no error bars for these lines. Figure 2 shows that when we assume that the
heterogeneous agents are homogeneous, the learning algorithm for NE will suffer from a
large error (large exploitability). In contrast, learning the graphons will enable us to learn
the NE more accurately. These results demonstrate the necessity of our model learning
algorithm in Algorithm 2. We can also observe that the learning error for N = 7, L = 500
is less than that for N = 7, L = 125, which justifies that the learning error decreases with
the increasing trajectory numbers L. In addition, the learning error for N = 14, L = 125
is less than that for N = 7, L = 125. This shows that the learning error decreases with an
increasing number of sampled agents N . These observations corroborate Corollary 9.
26
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
SIS problem with Exp graphons SIS problem with SBM graphons
102
102
101
Exploitability
Exploitability
101
100
N = 7, L = 125 MD, p = 0.25 100 N = 7, L = 125 MD, p = 0.25
N = 7, L = 500 MD, p = 0.5 N = 7, L = 500 MD, p = 0.5
10 1 mf, L = 125 MD, p = 0.75 mf, L = 125 MD, p = 0.75
MD, p = 0 MD, p = 1 MD, p = 0 MD, p = 1
10 1
1 5 10 15 20 1 5 10 15 20
Optimization Iteration Optimization Iteration
(a) SIS problem with exp-graphons (b) SIS problem with SBM graphons.
Investment problem with Exp graphons Investment problem with SBM graphons
101 101
Exploitability
Exploitability
100 100
10 1 10 1
N = 7, L = 25 MD, p = 0.25 N = 7, L = 25 MD, p = 0.25
N = 7, L = 100 MD, p = 0.5 N = 7, L = 100 MD, p = 0.5
mf, L = 25 10 2
mf, L = 25
10 2 MD, p = 0.75 MD, p = 0.75
MD, p = 0 MD, p = 1 MD, p = 0 MD, p = 1
1 5 10 15 20 1 5 10 15 20
Optimization Iteration Optimization Iteration
(c) Investment problem with exp-graphons (d) Investment problem with SBM graphons
Figure 4: Simulation results for SIS and investment problems with random known-position
agents.
Figure 3 displays the exploitability for algorithms in the investment problem of Cui and
Koeppl (2021b) with different graphons. “L = 25” and “L = 100” refer to our algorithms
that estimate with 125 and 500 samples from each agent in each round. Although the
investment problem has a larger state space than the SIS problem, the simulation results
contain similar insights as discussed above. These results corroborate Corollary 9.
Figure 4 displays the exploitability for algorithms in the SIS and investment problems
with different graphons, where sampled agents have known random positions. We note
that lines “M D, p = 0, 0.25, 0.5, 0.75, 1” are same as those in Figures 2 and 3, since these
MFG algorithms have the full information of the reward function and transition kernel
where different estimation setting will not affect the MFG algorithm performance. Figure 4
shows that the GMFG learning algorithms have better performance than the MFG learning
algorithm. The reason is that the MFG learning algorithms wrongly assume that all the agents
are homogeneous. Figure 4 also indicates that “N = 7, L = 500” (resp. “N = 7, L = 100”)
has better performance than “N = 7, L = 125” (resp. “N = 7, L = 25”) in SIS problem
(resp. investment problem), which corroborates with Corollary 10.
27
Zhang, Tan, Wang, and Yang
8. Conclusion
In this paper, we investigated learning the NE of GMFG in the graphons incognizant case.
Provably efficient optimization algorithms were designed and analyzed with an estimation
oracle, which improved on the previous works in convergence rate. In addition, adopting the
mean-embedding ideas, we designed and analyzed the model-based estimation algorithms
with sampled agents. Here, the sampled agents have known or unknown positions. These
estimation algorithms feature as the first model-based algorithms in GMFG without the
distribution flow information. We leave the analysis of more complex agent sampling schemes
for future works.
Acknowledgments
This research work is funded by the Singapore Ministry of Education AcRF Tier 2 grant
(A-8000423-00-00) and Tier 1 grants (A-8000189-01-00 and A-8000980-00-00).
28
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
Appendices for
“Learning Graphon Mean-Field Games with Unknown
Graphons”
Investment problem: The state space is S = {0, 1, . . . , 9}, and the action space is
A = {I, O}. The horizon is H = 50. The reward function is
0.3 · s
rh∗ (s, a, z) = P 0 0
− 2 · Ia=I
1+ s0 ∈S s ∗ z(s )
9−s 1+s
Ph∗ (s + 1 | s, I, ·) = , Ph∗ (s | s, I, ·) = , Ph∗ (s | s, O, ·) = 1
10 10
for all s ∈ {0, . . . , 8}, and s = 9 is an absorbing state.
We then introduce our graphon parameters. We set θ = 3 for exp-graphon. For SBM
graphon, we set K = 2, p0 = 1, p1 = 0.7, p2 = 1, a11 = a22 = 0.9, and a12 = a21 = 0.3. We
set a = 1, b = 0.5 for affine attachment graphon and ranked attachment graphon. We set the
regularization parameter as λ = 1 in our experiments. For the choices of the model classes,
we note that the SIS and investment problems involve a set of parameters. For example, the
coefficients 10 and 2.5 for the reward function of SIS problem. We estimate these coefficients.
For the graphon classes, we note that all the graphons can be parameterized by some
parameters, and we estimate these parameters in the experiments. For the implementation
of π I and the computation of µI , we discretize the infinitely many agents indexed by [0, 1]
into N = 100 groups, and approximate the policies and distribution flows within each group
by one policy and one distribution flow. This step incurs an approximation error O(N −1 )
with respect to the `1 norm.
To shorten the simulation time and convey the main message, we only estimate the
model in the beginning of the first iteration round and reuse this estimate in the following
iterations to generate action-value function estimates. Figures 2, 3 and 4 are derived from
twenty Monte-Carlo implementations of the algorithms. The error bar indicates the 25%
and the 75% quantile of the errors. When simulating the cases with constant graphons, we
implement the fixed point iteration of the mirror descent operator (Yardim et al., 2022; Xie
et al., 2021) or the game operator (Guo et al., 2019) to find the NE, and the calculations of
the optimal policy and the induced distribution flow are implemented via the dynamical
29
Zhang, Tan, Wang, and Yang
programming and direct calculation with the nominal transition kernels and reward functions.
Thus, there is no error bar for these cases.
The code used in our simulations uses the code in Fabian et al. (2022) and Cui and
Koeppl (2021b) for building the simulation environment. We run our simulations on Intel(R)
Core(TM) i5-8257U CPU @ 1.40GHz.
We note that the exploitability ∆(π I ) defined in Section 7 is indeed ∆λ (π I ) here. Then
Proposition 3 in Geist et al. (2019) asserts that
∆λ (π I ) − ∆0 (π I ) ≤ λH log |A|
for all λ ≥ 0. This inequality implies that the NE of the regularized GMFG (resp. unregu-
larized) satisfies agent rationality in the unregularized (resp. regularized) up to λH log |A|.
This gap also appears in MFGs (Anahtarci et al., 2022; Xie et al., 2021), and mitigating the
bias remains an unsolved problem in MFGs, a strict subclass of GMFGs.
We note that Eqn. (14) averages the states over L episodes, since the distribution flows
of these L episodes are same. Correspondingly, the estimation procedure in Eqn. (5) is
modified to be
L N
1 X X i 2 i 2
(fˆh , ĝh , Ŵh ) = argmin ˆ τ,h
sτ,h+1 − f ω̃ i
(W̃ ) + rτ,h ˆ τ,h
− g ω̃ i
(W̃ ) .
f ∈B(r,H̄),g∈B(r̃,H̃),N L τ =1 i=1
W̃ ∈W̃
(15)
The convergence rate of the corresponding estimates (fˆh , ĝh , Ŵh ) can be derived as follows.
30
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
where the last equality results from the definition of π φ,I and the hypothesis. To show that
φ(α) φ(α) I
µh+1 = µφ,α
h+1 , it remains to show zh (µh , Wh∗ ) = zhα (µφ,I φ,∗
h , Wh ). In fact, we have that
Z 1 Z 1
α φ,I φ,∗ φ(β)
zh (µh , Wh ) = ∗
Wh (φ(α), φ(β))µh dβ = Wh∗ (φ(α), γ)µγh dγ,
0 0
where the last equality results from setting γ = φ(β). Thus, we conclude the proof of
Proposition 2.
31
Zhang, Tan, Wang, and Yang
• Second, we derive the recurrence relationship of the policy learning error from the
relationship in the second step.
Step 1: Analyze the property of the policy π̂t+1 I derived in Line 6 of Algo-
rithm 1
α
We first note that the update of π̂t+1,h (· | s) in Line 6 of Algorithm 1 can be equivalently
defined as
h i
α
(· | s) = argmax ηt+1 Q̂λ,α α ˆI α
π̂t+1,h h (s, ·, πt , µ̄t , Ŵ ), p − λH̄(p) − KL pkπt,h (· | s) , (16)
p∈∆(A)
Proposition 13 For the policy π̂t+1,h α (· | s), which is defined in Eqn. (16), we have that for
all s ∈ S, p ∈ ∆(A), and h ∈ [H]
h i
ηt+1 Q̂λ,α α ˆI α α
h (s, ·, πt , µ̄t , Ŵ ), p − π̂t+1,h (· | s) + λη t+1 R π̂t+1,h (· | s) − H̄(p)
α α α α
≤ KL pkπt,h (· | s) − (1 + ληt+1 )KL pkπ̂t+1,h (· | s) − KL π̂t+1,h (· | s)kπt,h (· | s) .
where the term (I) is the combination of the action-value function estimation error and the
I
difference between π̂t+1 and πt+1 I that is defined as
h i
(I) = ηt+1 Qλ,α
h (sh , ·, πt
α I
, µ̄t , W ∗
), p − π α
t+1,h (· | sh ) − Q̂λ,α
h (sh , ·, πt
α ˆI
, µ̄t , Ŵ ), p − π̂ α
t+1,h (· | sh .
)
The term (III) is the KL divergence difference between π̂t+1 I and πt+1I that is defined as
α α α α
(III) = KL πt+1,h (· | sh )kπt,h (· | sh ) − KL π̂t+1,h (· | sh )kπt,h (· | sh ) .
32
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
I
The term (IV) is also the KL divergence difference between π̂t+1 and πt+1I that is defined as
h i
α α
(IV) = (1 + ληt+1 ) KL pkπt+1,h (· | sh ) − KL pkπ̂t+1,h (· | sh ) .
We define
Λαt+1,h = 2ηt+1 Qλ,α α ˆI ∗ λ,α α ˆI
h (sh , ·, πt , µ̄t , W ) − Q̂h (sh , ·, πt , µ̄t , Ŵ ) ∞ + 2ηt+1 H(1 + λ log |A|)βt+1
|A|
+ 2ηt+1 Lr + H(1 + λ log |A|)LP εµ + 2βt+1 log + 2(1 + ληt+1 )βt+1 , (18)
βt
Then we can show the following bound.
Proposition 14 Under assumptions in Theorem 4, (I) + (II) + (III) + (IV) ≤ Λαt+1,h .
Proof See Appendix Q.2.2.
Then inequality (17) shows that
h i
ηt+1 Qλ,α α I ∗ α α
h (sh , ·, πt , µ̄t , W ), p − πt+1,h (· | sh ) + ληt+1 R πt+1,h (· | sh ) − H̄(p)
α α
+ KL πt+1,h (· | sh )kπt,h (· | sh )
α α
(· | sh ) + Λαt+1,h .
≤ KL pkπt,h (· | sh ) − (1 + ληt+1 )KL pkπt+1,h (19)
Step 2: Derive the recurrence relationship of the policy learning error from
the relationship the second step, and bound the dynamical error in such recur-
rence relationship.
I
Inequality (19) implies that the improvement of πt+1 of the MDP induced by µ̄It over
I
πt can be lower bounded as
Vmλ,α (s, πt+1
α
, µ̄It , W ∗ ) − Vmλ,α (s, πtα , µ̄It , W ∗ )
X H
= Eπα ,µ̄It Qλ,α α I ∗ α α
h (sh , ·, πt , µ̄t , W ), πt+1,h (· | sh ) − πt,h (· | sh )
t+1
h=m
h
α
α
i
+ λ R πt,h (· | sh ) − R πt+1,h (· | sh ) sm = s
h i
α I ∗
≥ Qλ,α α α α α
m (s, ·, πt , µ̄t , W ), πt+1,m (· | s) − πt,m (· | s) + λ R πt,h (· | s) − R πt+1,h (· | s)
XH
1 α
− E α I Λt+1,h sm = s , (20)
ηt+1 πt+1 ,µ̄t
h=m
where the equality results from Lemma 37, and the inequality results from inequality (19)
and that KL divergence is non-negative.
We denote the optimal policy on the MDP induced by µ̄It as π̄t∗,I = Γλ1 (µ̄It , W ∗ ). Then
Lemma 37 and inequality (20) implies that
H
X
∗,α
ηt+1 Eπ̄t∗,α ,µ̄It Qλ,α α I ∗ α
h (sh , ·, πt , µ̄t , W ), π̄t,h (· | sh ) − πt+1,h (· | sh )
h=1
h
α
∗,α i
+ λ R πt+1,h (· | sh ) − R π̄t,h (· | sh )
33
Zhang, Tan, Wang, and Yang
H
X
≥ ηt+1 Eπ̄t∗,α ,µ̄It Vhλ,α (sh , πtα , µ̄It , W ∗ ) − Vhλ,α (sh , πt+1
α
, µ̄It , W ∗ )
h=1
H
X H
X
− Eπ̄t∗,α ,µ̄It Eπα ,µ̄It Λαt+1,m sh
t+1
h=1 m=h
λ,α ∗,α I
ηt+1 Eµα1 V1 (s1 , π̄t , µ̄t , W ∗ ) − V1λ,α (s1 , πtα , µ̄It , W ∗ ) .
+ (21)
∗,α
Applying inequality (19) with p = π̄t,h (· | sh ) to the left-hand side of inequality (21) and
rearranging the terms, we have that
H
X
ηt+1 E π̄t∗,α ,µ̄I
t
Vhλ,α (sh , π̄t∗,α , µ̄It , W ∗ ) − Vhλ,α (sh , πt+1
α
, µ̄It , W ∗ )
h=1
H
X
∗,α α
+ (1 + ληt+1 )Eπ̄t∗,α ,µ̄It KL π̄t,h (· | sh )kπt+1,h (· | sh )
h=1
H
X
≤ ηt+1 Eπ̄t∗,α ,µ̄It Vhλ,α (sh , π̄t∗,α , µ̄It , W ∗ ) − Vhλ,α (sh , πtα , µ̄It , W ∗ )
h=1
− ηt+1 Eµα1 V1λ,α (s1 , π̄t∗,α , µ̄It , W ∗ ) − V1λ,α (s1 , πtα , µ̄It , W ∗ )
XH
∗,α α
+ Eπ̄t∗,α ,µ̄It KL π̄t,h (· | sh )kπt,h (· | sh )
h=1
H
X H
X H
X
+ Eπ̄t∗,α ,µ̄It Eπα I Λαt+1,m sh + Eπ̄t∗,α ,µ̄It Λαt+1,h . (22)
t+1 ,µ̄t
h=1 m=h h=1
To handle the right-hand side of this inequality, we utilize the following proposition.
Proposition 15 For a λ-regularized finite-horizon MDP (S, A, H, {rh }H H
h=1 , {Ph }h=1 ) with
|rh | ≤ 1 for all h ∈ [H], we denote the optimal policy as π ∗ = {πh∗ }H
h=1 . Then for any policy
π, we have that
H
X
V1λ (s1 , π ∗ ) V1λ (s1 , π) ∗
Vhλ (sh , π ∗ ) Vhλ (sh , π)
Eπ∗ − ≥ β Eπ∗ − ,
h=2
where the expectation is taken with respect to the state distribution induced by π ∗ , and β ∗ > 0
is a constant that only depends on λ, H and |A|.
Proof [Proof of Proposition 15] See Appendix Q.2.3.
Define θ∗ = 1/(1 + β ∗ ) < 1 and let ηt = η, where 1 + λη = 1/θ∗ . Proposition 15 shows that
H
X
Eπ̄t∗,α ,µ̄It Vhλ,α (sh , π̄t∗,α , µ̄It , W ∗ ) − Vhλ,α (sh , πt+1
α
, µ̄It , W ∗ )
h=1
H
X
1 ∗,α α
+ ∗ Eπ̄t∗,α ,µ̄It KL π̄t,h (· | sh )kπt+1,h (· | sh )
ηθ
h=1
34
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
H
X
≤θ ∗
Eπ̄t∗,α ,µ̄It Vhλ,α (sh , π̄t∗,α , µ̄It , W ∗ ) − Vhλ,α (sh , πtα , µ̄It , W ∗ )
h=1
XH
1 ∗,α α
+ ∗ Eπ̄t∗,α ,µ̄It KL π̄t,h (· | sh )kπt,h (· | sh )
ηθ
h=1
X H X H H
X
1 α 1 α
+ Eπ̄t ,µ̄It
∗,α Eπα ,µ̄It Λt+1,m sh E ∗,α I Λt+1,h . (23)
η t+1 η π̄t ,µ̄t
h=1 m=h h=1
In the following, we will derive the rate of convergence of the following term
H
X
Xtα = Eπ̄t∗,α ,µ̄It Vhλ,α (sh , π̄t∗,α , µ̄It , W ∗ ) − Vhλ,α (sh , πtα , µ̄It , W ∗ )
h=1
H
X
1 ∗,α α
+ ∗ Eπ̄t ,µ̄It
∗,α KL π̄t,h (· | sh )kπt,h (· | sh ) . (24)
ηθ
h=1
We note that XtI is a good quantity to measure the “distance” between πtI and NE. For
NE, π ∗,I is the optimal policy on the MDP induced by the distribution flow µ∗,I of itself.
Since µ̄It is close to µIt , we expect that πtI achieves high rewards on the MDP induced by
µ̄It if it is close to the NE. Inequality (23) shows that the recurrence relationship of Xtα is
H
X H
X H
X
α ∗ 1 1
Xt+1 ≤θ Xtα + Eπ̄t∗,α ,µ̄It Eπα ,µ̄It α
Λt+1,m sh + Eπ̄t∗,α ,µ̄It Λt+1,h + ∆αt+1 ,
α
η t+1 η
h=1 m=h h=1
(25)
where ∆αt+1 is the error introduced by the change of the environment, which is also called
the dynamical error, and it is defined as
H
X
∆αt+1 = Xt+1
α
− Eπ̄t∗,α ,µ̄It Vhλ,α (sh , π̄t∗,α , µ̄It , W ∗ ) − Vhλ,α (sh , πt+1
α
, µ̄It , W ∗ )
h=1
H
X
1 ∗,α α
− ∗ Eπ̄t∗,α ,µ̄It KL π̄t,h (· | sh )kπt+1,h (· | sh ) .
ηθ
h=1
|A|2
α 1 2 |A|
∆t+1 ≤ H 2H(1 + λ log |A|) + λLR + ∗ log + ∗ max log , LR
ηθ βt+1 ηθ βt+1
H
X
∗,α ∗,α
· Eπ̄∗,α ,µ̄I π̄t+1,m (· | sm ) − π̄t,m (· | sm ) 1
t+1 t+1
m=1
|A|2
1
+ H H 1 + λ log |A| + ∗ log LP + 2H Lr + H(1 + λ log |A|)LP
ηθ βt+1
35
Zhang, Tan, Wang, and Yang
H Z 1
kµ̄βt+1,m − µ̄βt,m k1 dβ
X
·
m=1 0
H
X
∗,α ∗,α
= C1 (η, βt+1 )Eπ̄∗,α ,µ̄I π̄t+1,m (· | sm ) − π̄t,m (· | sm ) 1
t+1 t+1
m=1
H Z 1
kµ̄βt+1,m − µ̄βt,m k1 dβ.
X
+ C2 (η, βt+1 )
m=1 0
In the above, we defined C1 (η, βt+1 ) and C2 (η, βt+1 ) for ease of notation subsequently.
Proof See Appendix Q.2.4.
We need the following proposition to relate the difference between the optimal policies
∗,α ∗,α
π̄t+1,m (· | sm ) and π̄t,m (· | sm ) in Proposition 16 to the distribution flows µ̄It+1 and µ̄It .
Proposition 17 For any two distribution flows µI and µ̃I , we define the optimal policies
π ∗,I = Γλ1 (µI , W ∗ ) and π̃ ∗,I = Γλ1 (µ̃I , W ∗ ). Under Assumption 1, we have that for any
h ∈ [H] and α ∈ [0, 1]
36
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
+ 2H 2H H(1 + λ log |A|)LP + Lr C1 (η, βt+1 ) + C2 (η, βt+1 ) αt , (27)
where the equality results from the definition of µ̄ ˆIt+1 , and the inequality results from the
triangle inequality. For the fourth term in the right-hand side of inequality (29), we have
that
d(µ̄t∗,I , µ∗,I ) = d Γ2 Γλ1 (µ̄It , W ∗ ), W ∗ , Γ2 Γλ1 (µ∗,I , W ∗ ), W ∗
≤ d1 d2 d(µ̄It , µ∗,I )
≤ d1 d2 d(µ̄It , µ̄
ˆIt ) + d(µ̄
ˆIt , µ∗,I ) ,
(30)
where the equality results from the definitions of µ̄∗,I t and µ∗,I , the first inequality results
from Assumption 2, and the last inequality results from the triangle inequality. We then
define µ̃∗,I
t = Γ3 (π̄t∗,I , µ̄It , W ∗ ). For the third term in the right-hand side of inequality (29),
we have that
Z 1X H h i
α ∗,α
≤ d2 Eµ∗,α πt,h (· | s) − π̄t,h (· | s) 1 dα
h
0 h=1
H
µ∗,α
Z 1X
h (s) α ∗,α
= d2 Eµ̃∗,α πt,h (· | s) − π̄t,h (· | s) dα
0 h=1
t,h µ̃∗,α
t,h (s) 1
v
√ u Z
u H
1X h
∗,α i
α (· | s) dα,
≤ d2 Cµ H t2 Eµ̃∗,α KL π̄t,h (· | s)kπt,h (31)
t,h
0 h=1
where the first inequality results from Assumption 2, and the second inequality results from
Assumption 3 and the Cauchy–Schwarz inequality. Define Yt = d(µ̄ ˆIt , µ∗,I ). Combining
37
Zhang, Tan, Wang, and Yang
where d¯ = 1 − d1 d2 .
Recall the expressions of µ̄It and µ̄
ˆIt in Eqn. (79), we have that
t−1
X
d(µ̄It , µ̄
ˆIt ) ≤ αm,t−1 d(µ̂Im , µIm ) ≤ εµ ,
m=1
where the first inequality results from the triangle inequality, and the second inequality
results from Assumption 4. Take αt = α. we have that
T √ T
s
Z 1
1X 1 (1 + d1 d2 ) d2 Cµ H 1 X
Yt ≤ ¯ Y1 + ¯ εµ + ¯ · 2ηθ ∗ Xtα dα
T dαT d d T 0
t=1 t=1
√
v
u T Z
1 (1 + d1 d2 ) d2 Cµ H u ∗
1X 1 α
≤ ¯ Y1 + εµ + · 2ηθ
t Xt dα,
dαT d¯ d¯ T 0 t=1
where the last inequality results from Eqn. (28). Thus, we have
T √
1X log T p
Yt = O + O(ε µ + εQ + εµ ).
T
t=1
T 1/3
D(πtI , π ∗,I ) ≤ D(πtI , π̄t∗,I ) + D(π̄t∗,I , π ∗,I ) ≤ D(πtI , π̄t∗,I ) + d1 d(µ̄It , µ∗,I ).
38
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
39
Zhang, Tan, Wang, and Yang
H
X H
X
∗ 1
Vhλ (sh , π ∗ ) Vhλ (sh , πt ) ∗
≤θ Eπ∗ − + ∗ Eπt∗ KL πt,h (· | sh )kπt,h (· | sh ) ,
ηθ
h=1 h=1
where 0 < θ∗ < 1 is a function of λ, H and |A|, and Eπ∗ refers to the expectation with
respect to the state distribution induced by π ∗ .
Proof [Proof of Corollary 18] Similarly as Step 1 of the proof of Theorem 4, we have
h i
ηt+1 Qλh (sh ,·, πt ), p − πt+1,h (·|sh ) +ληt+1 R πt+1,h (·|sh ) −R(p) +KL πt+1,h (·|sh )kπt,h (·|sh )
≤ KL pkπt,h (·|sh ) − (1 + ληt+1 )KL p k πt+1,h (· | sh )
for any p ∈ ∆(A). Following the same pipeline to inequality (23), we have that
H
X X H
1
Vhλ (sh , π ∗ ) Vhλ (sh , πt+1 ) ∗
Eπ∗ − + ∗ Eπ∗ KL πh (· | sh )kπt+1,h (· | sh )
ηθ
h=1 h=1
H
X H
X
∗ λ ∗ λ 1 ∗
≤ θ Eπ∗ Vh (sh , π ) − Vh (sh , πt ) + ∗ Eπt∗ KL πt,h (· | sh )kπt,h (· | sh ) ,
ηθ
h=1 h=1
40
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
This generalization error of risk represents the error due to the fact that we optimize over
the empirical estimation of the risk not the population risk.
Estimation error of mean-embedding represents the error due to the fact that we cannot
i (Ŵ ). Instead, we can only estimate the value of it through the
observe the value of ω̂τ,h h
states of sampled agents.
41
Zhang, Tan, Wang, and Yang
Empirical risk difference represents the error from that fact that we choose (fˆh , ĝh , Ŵh ) not
(fh∗ , gh∗ , Wh∗ ) by minimizing the empirical risk. From the procedure of Algorithm (5), we have
Empirical Risk Difference ≤ 0.
Thus, we have that
Rξ̄ (fˆh , ĝh , Ŵh ) − Rξ̄ (fh∗ , gh∗ , Wh∗ )
L N
1 XX i ˆ i
2 i ∗ i ∗
2
≤ Eρi sτ,h+1 − fh ωτ,h (Ŵh ) − sτ,h+1 − fh ωτ,h (Wh )
NL τ,h
τ =1 i=1
L XN
1 X 2 i 2
−2 siτ,h+1 − fˆh ωτ,h
i
(Ŵh ) − sτ,h+1 − fh∗ ωτ,h
i
(Wh∗ )
NL
τ =1 i=1
L N
1 XX i i
2 i ∗ i ∗
2
+ Eρi rτ,h − ĝh ωτ,h (Ŵh ) − rτ,h − gh ωτ,h (Wh )
NL τ,h
τ =1 i=1
L N
1 XX i i
2 i ∗ i ∗
2
−2 rτ,h − ĝh ωτ,h (Ŵh ) − rτ,h − gh ωτ,h (Wh )
NL
τ =1 i=1
L N
1 XX i i
2 i i
2
+2 sup sτ,h+1 − f ω̂τ,h (W̃ ) − sτ,h+1 − f ωτ,h (W̃ )
f ∈B(r,H̄),W̃ ∈W̃ NL
τ =1 i=1
L X N
1 X
i i
2 i i
2
+2 sup rτ,h − g ω̂τ,h (W̃ ) − rτ,h − g ωτ,h (W̃ )
g∈B(r̃,H̃),W̃ ∈W̃ NL
τ =1 i=1
= (I) + (II), (32)
We note that the terms related to the transition kernels and reward functions are similar. In
the following, we will only present the bounds for the terms related to the transition kernels,
and the bounds for the reward functions can be similarly derived.
Step 1: Bound the Estimation Error of Mean-embedding.
Considering term (II), we have that
L N
1 XX i i
2 i i
2
sup sτ,h+1 − f ω̂τ,h (W̃ ) − sτ,h+1 − f ωτ,h (W̃ )
f ∈B(r,H̄),W̃ ∈W̃ NL
τ =1 i=1
L XN
1 X
i i
(W̃ ) · 2siτ,h+1 −f ω̂τ,h
i i
≤ sup f ω̂τ,h (W̃ ) −f ωτ,h (W̃ )−f ωτ,h (W̃ )
f ∈B(r,H̄),W̃ ∈W̃ NL
τ =1 i=1
L N
1 XX i i
≤ 2(BS + rBK̄ )rLK sup ω̂τ,h (W̃ ) − ωτ,h (W̃ ) H
, (33)
W̃ ∈W̃ NL
τ =1 i=1
where the first inequality results from the triangle inequality, and the second inequality
results from Assumption 6 and Lemma 33. Recall the definitions of ω̂τ,h i (W ) and ω i (W )
τ,h
are
Z 1Z
i
W (ξi , β)k ·, (siτ,h , aiτ,h , s) µβτ,h (s) ds dβ,
ωτ,h (W ) =
0 S
42
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
1 X
i
W (ξi , ξj )k ·, (siτ,h , aiτ,h , sjτ,h ) ,
ω̂τ,h (W ) =
N −1
j6=i
where
Z
1 X
i
k ·, (siτ,h , aiτ,h , s) µjτ,h (s)ds.
ω̄τ,h (W ) = W (ξi , ξj )
N −1 S
j6=i
i (W ) − ω̄ i (W )
For term (III) = supW ∈W̃ ωτ,h , we have that
τ,h H
Z 1Z
i i
W (ξi , β)k ·, (siτ,h , aiτ,h , s) µβτ,h (s) ds dβ
ωτ,h (W ) − ω̄τ,h (W ) H ≤
0 S
N −1 Z
1 X j N j−1
− W ξi , k ·, (siτ,h , aiτ,h , s) µτ,h (s) ds
N −1 N −1 S H
j=1
N −1 Z
1 X j N j−1
+ W ξi , k ·, (siτ,h , aiτ,h , s) µτ,h (s) ds
N −1 N −1 S
j=1
Z
1 X
W (ξi , ξj ) k ·, (siτ,h , aiτ,h , s) µjτ,h (s) ds
−
N −1 S H
j6=i
where the inequality results from the triangle inequality. For each term in the sum, we have
that
Z
W (ξi , β)k ·, (siτ,h , aiτ,h , s) µβτ,h (s)ds
S
43
Zhang, Tan, Wang, and Yang
Z
j N j−1
− W ξi , k ·, (siτ,h , aiτ,h , s) µτ,h (s)ds
N −1 S H
Z
j
k ·, (siτ,h , aiτ,h , s) µβτ,h (s)ds
≤ W (ξi , β) − W ξi ,
N −1 S H
Z j
j i i
β N −1
+ W ξi , k ·, (sτ,h , aτ,h , s) µτ,h (s) − µτ,h (s) ds
N −1 S H
j
j
≤ B k LW β − + Bk µβτ,h − µτ,h N −1
, (37)
N −1 1
where the first inequality results from the triangle inequality, and the second results from
Assumptions 5 and 6.
j h−1 j
j
µβτ,h − µτ,h sup πtβ (· | s) − πtN −1 (· | s)
X
N −1
≤ HLP LW β − + 1
1 N −1 s∈S t=1
j
≤ (HLP LW + HLπ ) β − , (38)
N −1
where the first inequality results from Proposition 19, and the second inequality results
from the Lipschitzness of behavior policies. Substituting inequalities (37) and (38) into
inequality (36), we have that
Z 1Z
W (ξi , β)k ·, (siτ,h , aiτ,h , s) µβτ,h (s)dsdβ
(V) =
0 S
N −1 Z
1 X j N j−1
− W ξi , k ·, (siτ,h , aiτ,h , s) µτ,h (s)ds
N −1 N −1 S H
j=1
N −1 Z j
X N −1 j
≤ Bk (LW + HLP LW + HLπ ) β − dβ
j−1 N −1
j=1 N −1
1
= Bk (LW + HLP LW + HLπ ).
2(N − 1)
44
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
Z
1 X
W (ξi , ξj ) k ·, (siτ,h , aiτ,h , s) µjτ,h (s)ds
−
N −1 S H
j6=i
1 X j j−1 j
≤ Bk LW ξj − + ξj − N −1
+ Bk µτ,h − µjτ,h 1
N −1 N −1 N −1
j6=1
N
1 X i
≤ 3 + 2Bk (LW + HLπ + HLp LW ) ξi − ,
N −1 N
i=1
where the first results from triangle inequality, and the second inequality results from
Proposition 19. Substituting the bounds for terms (V) and (VI) into inequality (35), we
have that
i i
(III) = sup ωτ,h (W ) − ω̄τ,h (W ) H
W ∈W̃
N
Bk (LW + HLP LW + HLπ ) 1 X i
≤ + 3 + 2Bk (LW + HLπ + HLp LW ) ξi −
2(N − 1) N −1 N
i=1
1 3
= Bk (LW + HLP LW + HLπ ) + . (39)
2(N − 1) N −1
To derive a concentration inequality for term (IV), we first construct the minimal ε−cover
of W̃ with respect to k · k∞ . The covering number is denoted as N∞ (ε, W̃). Then for any
W ∈ W̃, there exists a graphon Wi for i ∈ {1, · · · , N∞ (ε, W̃)} such that kW − Wi k∞ ≤ ε.
Then we have that
i i i i
ω̂τ,h (W ) − ω̄τ,h (W ) H
≤ ω̂τ,h (Wi ) − ω̄τ,h (Wi ) H
+ 2εBk ,
where the inequality results from the triangle inequality. In the following, we set ε = t/(4Bk ).
Then the concentration inequality for term (IV) can be derived as
i i
P ∃W̃ ∈ W̃, i ∈ [N ], τ ∈ [L], ω̂τ,h (W̃ ) − ω̄τ,h (W̃ ) H ≥ t)
i i
≤ P ∃j ∈ N∞ (ε, W̃) , i ∈ [N ], τ ∈ [L], ω̂τ,h (Wj ) − ω̄τ,h (Wj ) H ≥ t − 2εBk
i i
≤ N LN∞ (t/(4Bk ), W̃) max P ω̂τ,h (Wj ) − ω̄τ,h (Wj ) H ≥ t/2
j∈[N∞ ],i∈[N ],τ ∈[L]
(N − 1)t2
≤ 2N LN∞ (t/(4Bk ), W̃) exp − ,
32Bk2
where the first inequality results from the construction of the cover, the second inequality
results from the union bound, and the last inequality results from Lemma 32 and that
45
Zhang, Tan, Wang, and Yang
√
kW (ξi , ξj )k ·, (siτ,h , aiτ,h , sjτ,h )k ≤ Bk for any W ∈ W̃. For t ≥ 4Bk / N , we have that
i i
P ∃W̃ ∈ W̃, i ∈ [N ], τ ∈ [L], ω̂τ,h (W̃ ) − ω̄τ,h (W̃ ) H ≥ t)
√ (N − 1)t2
≤ 2N LN∞ (1/ N , W̃) exp − .
32Bk2
with probability at least 1 − δ. Substituting inequalities (40) and (39) into inequalities (33)
and (34), we have that
L N
1 XX i i
2 i i
2
sup sτ,h+1 − f ω̂τ,h (W ) − sτ,h+1 − f ωτ,h (W )
f ∈B(r,H̄),W ∈W̃ N L τ =1 i=1
√
(BS + rBK̄ )rLK Bk N LN∞ (1/ N , W̃)
≤O √ log , (41)
N δ
with probability at least 1 − δ.
Step 2: Bound the generalization error of risk.
Considering term (I), for ease of notation, we denote the quadruple (siτ,h , aiτ,h , µIτ,h , siτ,h+1 )
as eiτ,h . We define the function fW as
2 i 2
fW (eiτ,h ) = siτ,h+1 − f ωτ,h
i
(W ) − sτ,h+1 − fh∗ ωτ,h
i
(Wh∗ ) .
The correspond function class is defined as FW̃ = {fW | f ∈ B(r, H̄), W ∈ W̃}. Then we
have that
L N L N
1 XX 1 XX ˆ
Eρi fˆŴ (eiτ,h ) − 2 fŴ (eiτ,h ).
(I) =
NL τ,h NL
τ =1 i=1 τ =1 i=1
46
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
Combining inequalities (32), (41), and (42), we have that the following holds with probability
at least 1 − δ
47
Zhang, Tan, Wang, and Yang
≤2 sup R(f, g, W ) − Rξ̄ (f, g, W ) + Rξ̄ (fˆh , ĝh , Ŵh ) − Rξ̄ (fh∗ , gh∗ , Wh∗ )
f ∈B(r,H̄),g∈B(r̃,H̃),W ∈W̃
where (IX) is the generalization error of risk from position sampling, and (X) is the difference
between the risk given positions. Similar as the proof of Theorem 5, the terms related to the
transition kernels and reward functions in inequality (43) are analogous. In the following,
we will only present the proof for the terms related to the transition kernel, and the results
for the terms related to the reward functions can be similarly derived.
Step 1: Bound the generalization error of risk from position sampling.
We first define that
L
1X α α
2
gf,W (α) = Eρατ,h sτ,h+1 − f ωτ,h (W ) .
L
τ =1
The correspond function class for gf,W is GF ,W̃ = {gf,W | f ∈ B(r, H̄), W ∈ W̃}. Then term
in (IX) that is related to the transition kernels can be expressed as
Z 1 N
1 X
2 sup gf,W (α)dα − gf,W (ξi ) .
gf,W ∈GF ,W̃ 0 N
i=1
Let δ > 0, Gδ be a minimal L∞ δ−cover of GF ,W̃ . Then for any gf,W ∈ GF ,W̃ , there exists
ḡf,W ∈ Gδ such that |gf,W (α) − ḡf,W (α)| ≤ δ for all α ∈ I. For any t > 0, we set δ = t/4.
Then we have that
Z 1 N
1 X
P sup gf,W (α)dα − gf,W (ξi ) ≥ t
gf,W ∈GF ,W̃ 0 N
i=1
Z 1 N
t 1 X t
≤ N∞ ,G max P gf,W (α)dα − gf,W (ξi ) ≥
4 F ,W̃ gf,W ∈G t 0 N 2
4 i=1
N t2
t
≤ 2N∞ , GF ,W̃ exp − , (44)
4 2(BS + rBK̄ )4
48
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
where the first inequality results from the union bound, and the second inequality results
from that 0 ≤ gf,W (α) ≤ (BS + rBK̄ )2 and Hoeffding inequality. To upper bound the
covering number in the tail probability, we note that
L
1X
Eρατ,h f ωτ,h (W ) − f¯ ωτ,h (W̄ )
α α
gf,W (α) − ḡf,W (α) ≤ 2(BS + rBK̄ )
L
τ =1
≤ 2(BS + rBK̄ ) BK̄ kf − f¯kH̄ + rLK Bk kW − W̄ k∞ ,
where the first inequality results from the definition of gf,W , and the second inequality
results from Lemma 33 and the triangle inequality. This inequality implies that
t t t
N∞ ,G ≤ NH̄ , B(r, H̄) · N∞ , W̃ .
4 F ,W̃ 16(BS + rBK̄ )BK̄ 16(BS + rBK̄ )rLK Bk
49
Zhang, Tan, Wang, and Yang
L N
1 XX i i
(XII) = 4(BS + rBK̄ )rLK sup ω̂τ,h (W ) − ω̄τ,h (W ) H
W ∈W̃ NL
τ =1 i=1
L X N
1 X
i i
(XIII) = 4(BS + rBK̄ )rLK sup ω̄τ,h (W ) − ωτ,h (W ) H
.
W ∈W̃ NL
τ =1 i=1
For term (XIII), we adopt a different method with the proof of Theorem 5. Let ε > 0,
W̃ε be a L∞ ε−cover of W̃. Then for any W ∈ W̃, there exists W̄ ∈ W̃ε such that
kW̄ − W k∞ ≤ ε. Then we have
Z 1Z
i i
W (ξi , β) − W̄ (ξi , β) k ·, (siτ,h , aiτ,h , s) µβτ,h (s)dsdβ
kωτ,h (W ) − ωτ,h (W̄ )kH =
0 S H
≤ εBk ,
i (W ) −
where the inequality results from the triangle inequality. Similarly, we have that kω̄τ,h
i (W̄ )k ≤ εB . For any t > 0, we will set ε = t/(4B ). Then the tail probability for
ω̄τ,h H k k
(XIII) can be bounded as
L N
1 XX i i
P sup ω̄τ,h (W ) − ωτ,h (W ) H ≥ t
W ∈W̃ N L τ =1 i=1
i i
≤ P ∃W ∈ W̃, τ ∈ [L], i ∈ [N ], ω̄τ,h (W ) − ωτ,h (W ) H ≥ t
t i i t
≤ N LN∞ , W̃ max P ω̄τ,h (W ) − ωτ,h (W ) H
≥
4Bk W ∈W̃ t ,τ ∈[L],i∈[N ] 2
4Bk
(N − 1)t2
t
≤ 2N LN∞ , W̃ exp − ,
4Bk 8Bk2
where the second inequality results from the union bound, and the last inequality resutls
from Lemma 32. For any 0 < δ < 1, we set
√ 1
2N LN∞ √N , W̃
2 2Bk
t= √ log .
N −1 δ
Then we have that
N LN∞ √1 , W̃
(BS + rBK̄ )rLK Bk N
(XIII) ≤ O √ log (46)
N δ
with probability at least 1 − δ.
For term (XII), we follow the proof of Theorem 5 and condition on the values of ξ¯ to
bound the tail probability. We have that
L N
1 XX i i
P sup ω̂τ,h (W ) − ω̄τ,h (W ) H
≥t
W ∈W̃ NL
τ =1 i=1
50
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
L N
1 XX i
= Eξ̄ P sup i
ω̂τ,h (W ) − ω̄τ,h (W ) H
≥ t ξ¯
W ∈W̃ N L
τ =1 i=1
√ (N − 1)t2
≤ 2N LN∞ (1/ N , W̃) exp − ,
32Bk2
where we condition on the values of ξ¯ in the first equality, and the inequality results from
inequality (40). Thus, we have that
√
(BS + rBK̄ )rLK Bk N LN∞ (1/ N , W̃)
(XII) ≤ O √ log (47)
N δ
with probability at least 1 − δ.
For term (XI), we just adopt the same conditional probability trick as shown in the
bound of (XII). From inequality (42), we have that
(BS + rBK̄ )4
NBr NW̃
(XI) ≤ O log (48)
NL δ
with probability at least 1 − δ.
Combining the inequalities (43), (45), (46), (47), and (48), we have that
51
Zhang, Tan, Wang, and Yang
L N
1 XX 2 i 2
= inf Eρi siτ,h+1 − fˆh ωτ,h
i
(Ŵhφ ) i
+ rτ,h − ĝh ωτ,h (Ŵhφ )
φ∈B[0,1] N L τ,h
τ =1 i=1
i ∗ i ∗
2 i ∗ i ∗
2
− sτ,h+1 − fh ωτ,h (Wh ) − rτ,h − gh ωτ,h (Wh )
L N
1 XX 2 i 2
≤ inf Eρi siτ,h+1 − fˆh ωτ,h
i
(Ŵhφ ) i
+ rτ,h − ĝh ωτ,h (Ŵhφ )
N
φ∈C[0,1] NL τ,h
τ =1 i=1
i ∗ i ∗
2 i ∗ i ∗
2
− sτ,h+1 − fh ωτ,h (Wh ) − rτ,h − gh ωτ,h (Wh )
This generalization error of risk represents the error due to the fact that we optimize over
the empirical estimation of the risk not the population risk.
Estimation Error of Mean-embedding
L N
1 XX i 2 i 2
= 2 inf sτ,h+1 − fˆh ωτ,h
i
(Ŵhφ ) i
+ rτ,h − ĝh ωτ,h (Ŵhφ )
N
φ∈C[0,1] NL
τ =1 i=1
L X
N
1 2 i 2
(Ŵhφ ) (Ŵhφ )
X
− 2 inf siτ,h+1 − fˆh ω̄ i
ˆ τ,h i
ˆ τ,h
+ rτ,h − ĝh ω̄
N
φ∈C[0,1] NL
τ =1 i=1
L X
N
1 ∗ 2 ∗ 2
(Wh∗,φ ) (Wh∗,φ )
X
+2 siτ,h+1 − fh∗ ω̄ i
ˆ τ,h i
+ rτ,h − gh∗ ω̄ i
ˆ τ,h
NL
τ =1 i=1
L X N
1 X 2 i 2
−2 siτ,h+1 − fh∗ ωτ,h
i
(Wh∗ ) + rτ,h − gh∗ ωτ,h
i
(Wh∗ ) .
NL
τ =1 i=1
Estimation error of mean-embedding represents the error due to the fact that we cannot
i (Ŵ ). Instead, we can only estimate the value of it through the
observe the value of ω̂τ,h h
states of sampled agents.
Empirical Risk Difference
52
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
L N
1 XX i 2 i 2
= 2 inf sτ,h+1 − fˆh ω̄ i
ˆ τ,h (Ŵhφ ) i
ˆ τ,h
+ rτ,h − ĝh ω̄ (Ŵhφ )
N
φ∈C[0,1] NL
τ =1 i=1
L N
1 ∗ 2
∗ 2
(Wh∗,φ ) (Wh∗,φ ) ,
XX
−2 siτ,h+1 − fh∗ ω̄ i
ˆ τ,h i
+ rτ,h − gh∗ ω̄ i
ˆ τ,h
NL
τ =1 i=1
where φ∗ ∈ C[0,1]
N is a permutation of ((i − 1)/N, i/N ] for i ∈ [N ] such that φ∗ (i/N ) = ξi .
From the estimation procedure of Algorithm (10), we have that
53
Zhang, Tan, Wang, and Yang
L N
1 XX i i ∗ 2
∗ 2
= sup ˆ τ,h
sτ,h+1 − f ω̄ (W φ◦φ ) i
+ rτ,h i
ˆ τ,h
− g ω̄ (W φ◦φ )
f,W,φ NL
τ =1 i=1
L N
1 XX 2 i 2
− siτ,h+1 − f ωτ,h
i
(W φ ) i
+ rτ,h − g ωτ,h (W φ )
NL
τ =1 i=1
L N
1 XX i ∗
≤ 4(BS + r̄BK )r̄LK sup ˆ τ,h (W φ◦φ ) − ωτ,h
ω̄ i
(W φ ) , (50)
NL H
N
W ∈W̃,φ∈C[0,1] τ =1 i=1
where the equality results from the fact that φ∗ is a measure-preserving bijection, and the
inequality results from the same arguments in inequality (33).
We decompose the error as
i ∗
ˆ τ,h
sup ω̄ (W φ◦φ ) − ωτ,h
i
(W φ ) H
W,φ
i ∗
≤ sup ω̄τ,h (W φ ) − ωτ,h
i
(W φ ) H
i
ˆ τ,h
+ sup ω̄ (W φ◦φ ) − ω̄τ,h
i
(W φ ) H
W,φ W,φ
where
Z 1Z
i
(W φ ) = W φ(ξi ), φ(β) k ·, (siτ,h , aiτ,h , s) µβτ,h (s)dsdβ,
ωτ,h
0 S
Z
1
k ·, (siτ,h , aiτ,h , s) µjτ,h (s)ds,
X
i φ
ω̄τ,h (W ) = W φ(ξi ), φ(ξj )
N −1 S
j6=i
L
∗ 1
W φ(ξi ), φ(ξj ) k ·, (siτ,h , aiτ,h , sjτ 0 ,h ) .
XX
i
(W φ◦φ ) =
ˆ τ,h
ω̄
(N − 1)L 0 j6=i τ =1
i (W φ ) − ω i (W φ )
For term supW,φ ω̄τ,h , we define the interval Ii = ((i − 1)/N, i/N ]
τ,h H
for i ∈ [N ]. Then we have that
i
ω̄τ,h (W φ ) − ωτ,h
i
(W φ ) H
2 X Z ξj Z
W φ(ξi ), φ(β) k ·, (siτ,h , aiτ,h , s) µβτ,h (s)dsdβ
≤ Bk +
N 1
ξj − N S
j6=i
Z
1 X
k ·, (siτ,h , aiτ,h , s) µjτ,h (s)ds
− W φ(ξi ), φ(ξj ) ,
N S H
j6=i
where the inequality results from the triangle inequality. For each term in the sum, we
bound it as
Z ξj Z
W φ(ξi ), φ(β) k ·, (siτ,h , aiτ,h , s) µβτ,h (s)dsdβ
1
ξj − N S
Z
1 X
k ·, (siτ,h , aiτ,h , s) µjτ,h (s)ds
− W φ(ξi ), φ(ξj )
N S H
j6=i
Z ξj Z
W φ(ξi ), φ(β) k ·, (siτ,h , aiτ,h , s) µβτ,h (s) − µjτ,h (s) dsdβ
≤
1
ξj − N S H
54
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
Z ξj Z
W φ(ξi ), φ(β) − W φ(ξi ), φ(ξj ) k ·, (siτ,h , aiτ,h , s) µjτ,h (s)dsdβ
+
1
ξj − N S H
LW Bk
=O ,
N2
where the first inequality results from the triangle inequality, and the second inequality
results from the same argument in inequality (38) and the fact that β and ξj are always in
N . Thus, we have that
the same interval for any φ ∈ C[0,1]
i φ i φ LW Bk
sup ω̄τ,h (W ) − ωτ,h (W ) H = O . (51)
W,φ N
N Lt2
≤ 2N !N LN∞ (t/(4Bk ), W̃) exp − ,
16Bk2
where the first inequality results from the proof of inequality (40), and the last inequality
results from Lemma 32. Thus, we have that with probability at least 1 − δ
r p
i φ◦φ ∗ i φ N N LN∞ ( N/L, W̃)
ˆ τ,h (W
sup ω̄ ) − ω̄τ,h (W ) H = O Bk log . (52)
W,φ L δ
55
Zhang, Tan, Wang, and Yang
h 1 X L X N L
i 2 X i
i
≤ P ∃f∈B(r, H̄), W∈W̃, max Eρi f (eτ,h , W, φ) − f (eτ,h , W, φ) ≥ t
N
φ∈C[0,1] NL τ,h NL
τ =1 i=1 τ =1
L N L
1 XX i 2 X i
≤ N ! max P ∃f∈B(r, H̄), W∈W̃, Eρi f (eτ,h , W, φ) − f (eτ,h , W, φ) ≥ t
N
φ∈C[0,1] NL τ,h NL
τ =1 i=1 τ =1
t t
≤ 14N !NH̄ , B(r, H̄) · N ∞ , W̃ ,
160(BS + rBK̄ )3 BK̄ 160(BS + rBK̄ )3 rLK Bk
where the second inequality results from the union bound and the fact that minx f (x) −
minx g(x) ≤ maxx f (x) − g(x), and the final inequality results from Proposition 20. Thus,
we have that with probability at least 1 − δ
L N L
1 XX
ˆ i
1 Xˆ i
inf Eρi fh (eτ,h , Ŵh , φ) − 2 inf fh (eτ,h , Ŵh , φ)
N
φ∈C[0,1] NL τ,h N
φ∈C[0,1] NL
τ =1 i=1 τ =1
(BS + rBK̄ )4
N ÑBr Ñ∞
=O log , (54)
L δ
where
3 3
ÑBr = NH̄ , B(r, H̄) , ÑW̃ = N∞ , W̃ .
L LK L
Combining inequalities (54) and (53), we have that
R̄ξ̄ (fˆh , ĝh , Ŵh ) − R̄ξ̄ (fh∗ , gh∗ , Wh∗ )
r p
LW Bk r̄LK (BS + r̄BK ) N N LN∞ ( N/L, W̃)
=O + (BS + r̄BK )r̄LK BK log
N L δ
(BS + r̄BK )4 N ÑBr ÑB̃r̃ Ñ∞
+ log ,
L δ
where
3
ÑB̃r̃ = NH̃ , B(r̃, H̃) .
L
Thus, we conclude the proof of Theorem 7.
56
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
• Bound the estimation error of distribution flow and action-value function estimate.
given ξ¯ as
L N
1 XX i i
2 i i
2
Rξ̄ (f, g, W ) = Eρ+,i sτ,h+1 − f ωτ,h (W ) + rτ,h − g ωτ,h (W )
NL τ,h
τ =1 i=1
N
1 X 2 i 2
= Eρ+,i sih+1 − f ωhi (W ) + rh+1 − g ωhi (W ) ,
N h
i=1
where the second equality results from that we implement the same policy for L times. The
difference between this definition and Eqn. (7) is that we take expectation with respect to
ρ+,i i
τ,h instead of ρτ,h . The reason is that in the setting where we specify Eqn. (7), the MDP is
induced by the distribution flow of the policy itself, not by a pre-specified distribution flow.
We state the performance guarantee as
Corollary 21 Under Assumptions 5, 6, 7, and 1, if ξi = i/N for i ∈ [N ], then the risk of
estimate derived in Algorithm (12) can be bounded as
Step 2: Generalize the performance guarantee from {ξi }N i=1 to [0, 1] by lips-
chitzness.
Intuitively, when the implemented policy is lipschitz, we can generalize the performance
guarantee of Rξ̄ (f, g, W ) to that of R(f, g, W ). Here we consider the case where the MDP
57
Zhang, Tan, Wang, and Yang
is induced by the distribution flow of the policy itself, i.e., the case specified in Section 5.
The results for the case where the MDP is induced by a pre-specified distribution flow can
be similarly derived. We note that
≤2 sup R(f, g, W ) − Rξ̄ (f, g, W ) + Rξ̄ (fˆh , ĝh , Ŵh ) − Rξ̄ (fh∗ , gh∗ , Wh∗ ).
f ∈B(r,H̄),g∈B(r̃,H̃),W ∈W̃
(55)
Then we attempt to bound the first term of the right-hand side of inequality (55). For any
two positions α, β ∈ I and f ∈ B(r, H̄), we have
α
2 β 2
Eρτ,h sτ,h+1 − f ωτ,h (W )
α − Eρβ sτ,h+1 − f ωτ,h (W )
τ,h
α
2 α
2
≤ Eρατ,h sτ,h+1 − f ωτ,h (W ) − Eρβ sτ,h+1 − f ωτ,h (W )
τ,h
α
2 β 2
+ Eρβ sτ,h+1 − f ωτ,h (W ) − Eρβ sτ,h+1 − f ωτ,h (W ) , (56)
τ,h τ,h
where the inequality results from the triangle inequality. For the first term in the right-hand
side of inequality (56), we have that
α
2 α
2
Eρτ,h sτ,h+1 − f ωτ,h (W )
α − Eρβ sτ,h+1 − f ωτ,h (W )
τ,h
h
≤ (BS + rBK̄ )2 kµατ,h − µβτ,h k1 + Eµατ,h kπτ,h β
α
(· | s) − πτ,h (· | s)k1
i
+ LP zhα (µIτ,h , Wh∗ ) − zhβ (µIτ,h , Wh∗ ) 1
≤ C(BS + rBK̄ )2 · |α − β|,
where C > 0 is a constant, the first inequality results from the definition of ρIτ,h , and the
last inequality adopts Proposition 19 and Assumption 5 to bound these three terms. The
second term in the right-hand side of inequality (56) can be bounded as
α
2 β 2
Eρβ sτ,h+1 − f ωτ,h (W ) − Eρβ sτ,h+1 − f ωτ,h (W )
τ,h τ,h
where the inequality results from Lemma 33 and Assumption 5. Thus, we conclude that
α
2 β 2
Eρατ,h sτ,h+1 − f ωτ,h (W ) − Eρβ sτ,h+1 − f ωτ,h (W )
τ,h
= O (BS + rBK̄ )(BS + rBK̄ + rLk Bk )|α − β| .
58
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
By decomposing the interval [0, 1] into the disjoint union of intervals ((i − 1)/N, i/N ]
for i ∈ [N ] and using this result, we can bound the first term of the right-hand side of
inequality (55) as
(BS + r̄BK )(BS + r̄BK + r̄LK Bk )
sup R(f, g, W ) − Rξ̄ (f, g, W ) = O .
f ∈B(r,H̄),g∈B(r̃,H̃),W ∈W̃ N
(57)
Eqn. (57) implies that we can transfer the results in Corollary 12 and Corollary 21 to
R(fˆh , ĝh , Ŵh ) with an additional term shown in Eqn. (57). Thus, for the case where the
MDP is induced by the distribution flow of the policy itself, we have that
For the case where the MDP is induced by a pre-specified distribution flow, we have that
Proposition 22 Given two GMFGs (P ∗ , r∗ , W ∗ ) and (P̂ , r̂, Ŵ ), for a policy π I ∈ Π̃, we
define the distribution flows induced by this policy as µI = Γ2 (π I , W ∗ ) and µ̂I = Γ̂2 (π I , Ŵ ).
Assume that the transition kernels P ∗ and P̂ are equivalently defined by f ∗ and fˆ ∈ B(r, H̄)
from Eqn. (2). Under Assumption 8, we have that
H Z
X 1 H
X
kµ̂αh − µαh k1 ≤ H(1 + rLK Lε Bk )H eπ,β
m dβ + eπ,α
m ,
m=1 0 m=1
where eπ,α
h is defined as
rh 2 i
eπ,α
h = Lε Eραh fˆh ωhα (Ŵh ) − fh∗ ωhα (Wh∗ ) ,
Z 1Z
ωhα (W ) = W (α, β)k ·, (siτ,h , aiτ,h , s) µβh (s)dsdβ,
0 S
ραh = µαh × πhα for α ∈ I.
59
Zhang, Tan, Wang, and Yang
Since we implement the same policy πtI for L times in Step 1 of Algorithm 2, ρατ,h for τ ∈ [L]
are the same. Thus, we have
H Z 1
X H q
X
I I
d(µ̂t , µt ) = kµ̂αt,h − µαt,h k1 dα ≤ C R(fˆh , ĝh , Ŵh ) − R(fh∗ , gh∗ , Wh∗ ),
h=1 0 h=1
where C > 0 is a constant, and the inequality results from Proposition 22 and Hölder
inequality. The right-hande side of this inequality will play the role of εµ in the proof of
Theorem 4, which is bounded in Eqn. (58).
Next, we bound the estimation error of the action-value function.
Proposition 23 Assume that we have two GMFGs (P ∗ , r∗ , W ∗ ) and (P̂ , r̂, Ŵ ). For a
policy π I ∈ Π̃, a behavior policy π b,I ∈ Π̃, and a distribution flow µI ∈ ∆, ˜ we define the
∗ ∗ ∗
distribution flows induced by the behavior policy on the GMFG (P , r , W ) with underlying
distribution flow µI as µb,I = Γ3 (π b,I , µI , W ∗ ). Assume that the transition kernels P ∗
and P̂ are equivalently defined by f ∗ and fˆ ∈ B(r, H̄) from Eqn. (2), and reward functions
r∗ and r̂ are equivalently defined by g ∗ and ĝ ∈ B(r̃, H̃) from Eqn. (2). Assume that
sups∈S,a∈A,α∈I,h∈[H] πhα (a | s)/πhb,α (a | s) ≤ C. Under Assumption 8, we have that
h i
Eρb,α Q̂λ,α
h (s, a, π α , µI , Ŵ ) − Qλ,α
h (s, a, π α , µI , W ∗ )
h
H r
H
X h 2 i
Eρb,α ĝm ωhα (Ŵm ) − gm ∗ ω α (W ∗ )
≤C h m
m
m=h
H r
X h 2 i
H
fˆm ωhα (Ŵm ) − fm
∗ ω α (W ∗ )
+ Lε H(1 + λ log |A|)C Eρb,α h m ,
m
m=h
where ρb,α
h is defined as ρb,α b,α b,α λ,α α I λ,α α I
h = µh · πh , Q̂h (s, a, π , µ , Ŵ ) and Qh (s, a, π , µ , W )
∗
Next, we will make use of Proposition 23 to bound the estimation error of action-value
function. Here, we adopt a different method to bound term (I) defined in Step 1 of the proof
of Theorem 4. From inequality (78), we have
+ ηt+1 Q̂λ,α α ˆI α α
h (sh , ·, πt , µ̄t , Ŵ ), π̂t+1,h (· | sh ) − πt+1,h (· | sh )
60
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
∗,I
For the third term in the right-hand side of inequality (60), if p = π̄t,h (· | sh ), we have that
∗,I
Qλ,α α ˆI ∗ λ,α α ˆI α
h (sh , ·, πt , µ̄t , W ) − Q̂h (sh , ·, πt , µ̄t , Ŵ ), π̄t,h (· | sh ) − πt+1,h (· | sh )
X λ,α
ˆIt , W ∗ ) − Q̂λ,α
Qh (sh , ah , πtα , µ̄ α ˆI
= h (sh , ah , πt , µ̄t , Ŵ )
a∈A
∗,I α
b,α
π̄t,h (ah | sh ) − πt+1,h (ah | sh )
· πt,h (ah | sh ) · b,α
πt,h (ah | sh )
Qλ,α λ,α b,α
X
≤ (Cπ + Cπ0 ) α ˆI ∗ α ˆI
h (sh , ah , πt , µ̄t , W ) − Q̂h (sh , ah , πt , µ̄t , Ŵ ) · πt,h (ah | sh ),
a∈A
(61)
∗,I
where the inequality results from Assumption 9. We note that we can let p = π̄t,h (· | sh )
in our whole proof, because we will use such bound to upper bound the right-hand side of
∗,I
inequality (22), which we take p = π̄t,h (· | sh ) to prove. Now we can define a new Λαt+1,h
with the terms in inequalities (60) and (61) replacing the original upper bound of term (I).
In such case, the term εQ in inequality (27) can be replaced by the upper bound of the
expectation of the third term in right-hand side in inequality (60).
H
X
Qλ,α Q̂λ,α b,α
X
Eπ̄t∗,α ,µ̄It (Cπ + Cπ0 ) α ˆI ∗
h (sh , ah , πt , µ̄t , W ) − α ˆI
h (sh , ah , πt , µ̄t , Ŵ ) · πt,h (ah | sh )
h=1 a∈A
H q
X
Cπ0 )Cπ00 CπH H R(fˆh0 , ĝh0 , Ŵh0 ) − R(fh∗ , gh∗ , Wh∗ ),
≤ (Cπ + 1 + Lε H(1 + λ log |A|)
h=1
where the inequality results from Propositions 27 and 23 and Assumption 10. The right-hand
side of this inequality can be further bounded with Eqn. (59)
Step 4: Conclude the final result.
Replacing εµ and εQ with the derived new bounds and using the union bound, we have
that
X T X T
1 I ∗,I 1 I ∗,I
D πt , π +d ˆt , µ
µ̄
T T
t=1 t=1
√ √
(BS + r̄BK )1/4 (r̄LK Bk )1/4
log T 1/4 T N LN∞ (1/ N , W̃)
=O +O log
T 1/3 (N L)1/8 δ
T NBr NB̃r̃ NW̃ (BS + r̄BK ) (BS + r̄BK + r̄LK Bk )1/4
1/4
BS + r̄BK 1/4
+ log + .
(N L)1/4 δ N 1/4
Thus, we conclude the proof of Corollary 9.
61
Zhang, Tan, Wang, and Yang
For ease of notation, we only write the definition of each term for the transition kernel. The
term for the reward functions can be easily derived.
For Estimation Error of Mean-embedding, we can use the bound in inequality (50) in the
proof of Theorem 7 to bound it. In fact, since ψ ∗ is the inverse function of φ∗ , the expression
of the Estimation Error of Mean-embedding here is same as the term in inequality (50). For
generalization error of risk, we can use inequality (54) in the proof of Theorem 7 to bound
it. Thus, we conclude the proof of Corollary 8.
62
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
(62)
Here we adopt the similar steps as the proof of Corollary 9. We note that the only
different procedure is the first step. Next, we will derive the performance guarantee of
Algorithm (62).
In such setting, we implement {π (ξi −1/N,ξi ] }N i=1 for L times on the MDP induced by
(ξ −1/N,ξ ] [N ] [N ] [N ] [N ]
{µ i i }i=1 to collect the dataset Dτ = {(sτ,h , aτ,h , rτ,h , sτ,h+1 )}H
N
h=1 for τ ∈ [L]. We
define µ +,I = Γ3 (π , µ , W ) as the distribution flow of implementing π I on the MDP
I I ∗
induced by µI . We highlight that we will not use quantity in the estimation procedure, QN but
use it only in the analysis. The joint distribution of (sτ,h , aτ,h , rτ,h , sτ,h+1 )i=1 is i=1 ρ+,i
i i i i N
τ,h ,
where ρ+,i +,i i ∗
τ,h = µτ,h × πτ,h × δrh × Ph . Same as the proof of Corollary 8, we define two bijections
∗
ψ ∗ , φ∗ ∈ C[0,1]
N as ψ ∗ (ξi ) = i/N for all i ∈ [N ], and φ∗ ◦ ψ ∗ (α) = φ∗ (ψ ∗ (α)) = α for all α ∈ I.
With a little abuse of notation, we define the risk of (f, g, W ) given ξ¯ as
N
1 X i i
2 i i
2
Rξ̄ (f, g, W ) = Eρ+,i sh+1 − f ωh (W ) + rh+1 − g ωh (W ) .
N h
i=1
Then we only need to exactly follow the steps 2, 3, and 4 in the proof of Corollary 9 to
prove the desired results. Thus, we conclude the proof of Corollary 11.
63
Zhang, Tan, Wang, and Yang
≤ d1 d2 d(µI , µ̃I ).
Banach fixed point theorem shows that there exists a fixed point for Γ2 ◦ Γλ1 , which we denote
as µ∗,I . We then define π ∗,I = Γλ1 (µ∗,I , W ∗ ). Definition 1 shows that (π ∗,I , µ∗,I ) is a NE.
Then we prove the uniqueness of NE. Assume that there are two NEs (π ∗,I , µ∗,I ) and
(π̃ , µ̃∗,I ). From the Definition 1 of the NE, we have that
∗,I
π ∗,I = Γλ1 (µ∗,I , W ∗ ), µ∗,I = Γ2 (π ∗,I , W ∗ ), π̃ ∗,I = Γλ1 (µ̃∗,I , W ∗ ), µ̃∗,I = Γ2 (π̃ ∗,I , W ∗ ).
Thus, we have d(µ∗,I , µ̃∗,I ) = 0, which implies that they are different only on a set of
zero-measure agents with respect to the Lebesgue measure on [0, 1]. Thus, we conclude the
proof of Proposition 3.
Appendix P. Lipschitzness of NE
Proposition 25 Under Assumptions 1 and 5, for any NE of the λ-regularized GMFG
(π λ,I , µλ,I ) with λ > 0, we have that
h i
2HLW Lr + H 1 + λ log |A| LP
πhλ,α (· | s) − πhλ,β (· | s) 1 ≤ |α − β| for all h ∈ [H], s ∈ S.
λ
Proof [Proof of Proposition 25] For any distribution flow µI , we denote the optimal value
function in the λ-regularized MDP induced by µI as V ∗,I = (Vh∗,I )H
h=1 . Then we prove the
proposition in two steps:
• Given any distribution flow µI , the optimal value function V ∗,I is Lipschitz in the
positions of agents, i.e., |Vh∗,α (s) − Vh∗,β (s)| ≤ H[Lr + H(1 + λ log |A|)LP ]LW |α − β|
for all s ∈ S and h ∈ [H].
• Any policy π I that achieves the optimal value function V ∗,I is Lipschitz in the positions
of agents.
64
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
These two steps concludes the proof of Proposition 25 by noting that for any λ-NE (π λ,I , µλ,I ),
the policy π λ,I achieves the maximal accumulative rewards in the MDP induced by µλ,I
according to Definition 1.
Step 1: Show the optimal value function V ∗,I is Lipschitz in the positions of
agents.
For any distribution flow µI ∈ ∆(S)I×H , we define an operator acting on S → R as
µI ,α
X XZ
Th u(s) = sup α
p(a)rh (s, a, zh )−λR(p)+ p(a)Ph (s0 | s, a, zhα )u(s0 )ds0 for h ∈ [H − 1],
p∈∆(A)a∈A a∈A S
µI ,α
X
α
TH u(s) = sup p(a)rH (s, a, zH )−λR(p),
p∈∆(A)a∈A
where R(·) is the negative entropy function. Since V ∗,I is the optimal value function of the
MDP induced by µI , we have that
I ,α ∗,α
Thµ Vh+1 (s) = Vh∗,α (s) and VH+1
∗,α
(s) = 0 for all s ∈ S, h ∈ [H], α ∈ I.
R1
where thefirst inequality results from Assumption 1. Note that kzhα −zhβ k1 ≤ k 0 Wh (α, γ)−
Wh (β, γ) µγh dγk1 ≤ LW |α − β|, we have
h i
sup Vh∗,α (s) − Vh∗,β (s) ≤ Lr + H 1 + λ log |A| LP LW |α − β| + sup Vh+1
∗,α ∗,β
(s) − Vh+1 (s) .
s∈S s∈S
∗,α
Summing this inequality for t = h, · · · , H and noting that VH+1 (s) = 0, we have
h i
sup Vh∗,α (s) − Vh∗,β (s) ≤ H Lr + H 1 + λ log |A| LP LW |α − β|.
s∈S
Step 2: Any policy that achieves the optimal value function V ∗,I is Lipschitz
in the positions of agents
Assume that policy π I achieves the optimal value function V ∗,I . For any α, β ∈ I, s ∈ S,
and h ∈ [H], we have that
X XZ ∗,α 0
α
πh (· | s) = argmax α
p(a)rh (s, a, zh ) − λR(p) + p(a)Ph (s0 | s, a, zhα )Vh+1 (s )ds0
p∈∆(A) a∈A a∈A S
65
Zhang, Tan, Wang, and Yang
XZ ∗,β 0
πhβ (· | s) p(a)rh (s, a, zhβ ) p(a)Ph (s0 | s, a, zhβ )Vh+1
X
= argmax − λR(p) + (s )ds0
p∈∆(A) a∈A a∈A S
∗,α 0
Define y α (s, a) = rh (s, a, zhα ) + p(a)Ph (s0 | s, a, zhα )Vh+1 (s )ds0 for all α ∈ I. Lemma 34
R
S
shows that
1 α
πhα (· | s) − πhβ (· | s) 1
≤ ky (s, ·) − y β (s, ·)k∞ .
λ
Term ky α (s, ·) − y β (s, ·)k∞ can be bounded as
ky α (s, ·) − y β (s, ·)k∞
h i h i
≤ Lr + H 1 + λ log |A| LP LW |α − β| + H Lr + H 1 + λ log |A| LP LW |α − β|
h i
≤ 2H Lr + H 1 + λ log |A| LP LW |α − β|,
where the first inequality results from the triangle inequality, and the second inequality
results from Step 1. Thus, we conclude that
h i
2H Lr + H 1 + λ log |A| LP LW
πhα (· | s) − πhβ (· | s) 1 ≤ |α − β|,
λ
which proves the claim of Proposition 25.
kµαh+1 − µβh+1 k1
Z XZ XZ
= 0 α α α
Ph (s | s, a, zh )µh (s)πh (a | s)ds − Ph (s0 | s, a, zhβ )µβh (s)πhβ (a | s)ds ds0
S a∈A S a∈A S
where the first inequality results from the triangle inequality, and the second inequality
results from Assumptions 5 and 1. We further bound the first term in the right-hand side of
inequality (63) as
Z 1 Z 1
β γ
α
kzh − zh k1 = Wh (α, γ)µh dγ − Wh (β, γ)µγh dγ ≤ LW |α − β|,
0 0 1
where the inequality results from Assumption 5. Substituting this inequality to the right-hand
side of inequality (63), we derive that
66
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
h=1 s∈S
which results from that µα1 = µβ1 . Thus, we concludes the proof of Proposition 19.
holds. If such function does not exist, then fW is an arbitrary function in FW̃ . Then we
have that
i
i
2
Eρi fW (ẽτ,h ) − Eρi fW (ẽτ,h ) Dh Dh
τ,h τ,h
h 2 i
≤ Eρi fW (ẽiτ,h ) Dh
τ,h
2 i
∗ i ∗
2
≤ 4(BS + rBK̄ ) Eρi f ω̃τ,h (W ) − fh ω̃τ,h (Wh ) Dh
τ,h
where the second inequality results from Lemma 33, and the last equality results from that
i (W ∗ )] = f ∗ (ω̃ i (W ∗ )). Then the tail probability for the ghost sample
Eρi [s̃iτ,h+1 | Dh , ω̃τ,h
τ,h h h τ,h h
D̃h is bouneded as
1 X i
1 X i ε 1 X i
P Eρi fW (ẽτ,h ) Dh − fW (ẽτ,h ) ≥ α+β+ Eρi fW (ẽτ,h ) Dh
NL τ,h NL 2 NL τ,h
τ,i τ,i τ,i
67
Zhang, Tan, Wang, and Yang
2
1 P i 1 P i
E N L τ,i Eρi fW (ẽτ,h ) Dh − N L τ,i fW (ẽτ,h )
τ,h
≤
2
ε(α+β) ε P
i
2 + 2N L E
τ,i ρi f (ẽ
W τ,h ) D h
τ,h
4(BS +rBK̄ )2 1
fW (ẽiτ,h ) Dh
P
NL NL τ,i Eρiτ,h
≤
2
ε(α+β) ε P i
2 + 2N L τ,i Eρiτ,h fW (ẽτ,h ) Dh
4(BS + rBK̄ )2
≤ ,
(α + β)N Lε2
where the first inequality results from Chebyshev inequality, the second inequality results
from inequality (64), and the last inequality results from x/(a + x)2 ≤ 1/(4a) for any x, a > 0.
When N L ≥ 32(BS + rBK̄ )2 /((α + β)ε2 ), we have that
1 X 1 X
Eρi fW (ẽiτ,h ) Dh − fW (ẽiτ,h )
P
NL τ,h NL
τ,i τ,i
ε 1 X i
1
≥ α+β+ Eρi fW (ẽτ,h ) Dh ≤ . (65)
2 NL τ,h 8
τ,i
68
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
1 X 2 i 1 X 2 i 1 X 2 i
fW (ẽτ,h ) − Eρi fW (ẽτ,h ) ≤ ε α + β + Eρi fW (ẽτ,h )
NL NL τ,h NL τ,h
τ,i τ,i τ,i
1 X 2 i 2 i 1 X 2 i
+ 2P ∃fW ∈ FW̃ , fW (ẽτ,h )−Eρi fW (ẽτ,h ) ≤ ε α+β + Eρi fW (ẽτ,h )
NL τ,h NL τ,h
τ,i τ,i
69
Zhang, Tan, Wang, and Yang
where the inequality results from the union bound. Let δ > 0, Fδ be a L1 δ−cover of FW̃
on {eiτ,h }L,N ¯
τ,i=1 . Then for any fW ∈ FW̃ , there exists fW ∈ Fδ such that
1 X
fW (eiτ,h ) − f¯W (eiτ,h ) ≤ δ.
NL
τ,i
1 X i 1 X i ¯
Uτ,h fW (eiτ,h ) − Uτ,h fW (eiτ,h ) ≤ δ
NL NL
τ,i τ,i
1 X 2 i 1 X ¯2 i
fW (eτ,h ) − fW (eτ,h ) ≥ −2(BS + rBK̄ )2 δ,
NL NL
τ,i τ,i
where these inequalities results from the triangle inequality. In the following, we take
δ = εβ/5. Thus, we can bound the right-hand side of inequality (71) as
1 X i ε(α + β)
P ∃fW ∈ FW̃ , Uτ,h fW (eiτ,h ) ≥
NL 4
τ,i
ε2 (α + β)
ε(1 − ε) 1 X 2 i i L,N
− + fW (eτ,h ) {eτ,h }τ,i=1
8(BS + rBK̄ )2 (1 + ε) 8(BS + rBK̄ )2 (1 + ε) N L
τ,i
εβ i L,N 1 X i εα
≤ N1 , FW̃ , {eτ,h }τ,i=1 max P Uτ,h fW (eiτ,h ) ≥
5 fW ∈F εβ NL 4
5 τ,i
ε2 α
ε(1 − ε) 1 X 2 i i L,N
− + fW (eτ,h ) {eτ,h }τ,i=1
8(BS + rBK̄ )2 (1 + ε) 8(BS + rBK̄ )2 (1 + ε) N L
τ,i
70
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
ε2 (1 − ε)αN L
εβ i L,N
≤ 2N1 , FW̃ , {eτ,h }τ,i=1 exp − (72)
5 20(BS + rBK̄ )2 (1 + ε)
1 X
fW (eiτ,h ) − f¯W (eiτ,h )
NL
τ,i
where the inequality results from Lemma 33 and the triangle inequality. Thus, we have that
i L,N δ δ
N1 δ, FW̃ , {eτ,h }τ,i=1 ≤ NH̄ , B(r, H̄) · N∞ , W̃
4(BS + rBK̄ )BK̄ 4(BS + rBK̄ )rLK Bk
(73)
where the last inequality results from inequality (73). For N L ≤ 32(BS + rBK̄ )2 /((α + β)ε2 ),
we have that
ε2 (1 − ε)αN L
32(1 − ε)α 32 1
exp − 4
≥ exp − 2
≥ exp − ≥ .
20(BS + rBK̄ ) (1 + ε) 20(BS + rBK̄ ) (1 + ε)(α + β) 80 14
71
Zhang, Tan, Wang, and Yang
Pn
i=1 B − Eρ i g(Z) Eρi g(Z)
≤ 2 , (74)
n2 β 2 α + n1 ni=1 Eρi g(Z)
P
where the first inequality results from Chebyshev inequality, and the last inequality results
from that g : X → [0, B]. For two constants a, b > 0 and variables 0 ≤ xi ≤ b for i ∈ [n],
some basic calculus calculations show that
Pn
i=1 (b − xi )xi nb
f (x1 , · · · , xn ) = 1 Pn ≤ .
a + n i=1 xi 2a
We take β = ε/4. If n ≥ 16B/(ε2 α), such probability is upper bounded by 1/2. Then we
have that
1 Pn 1 Pn
−
n i=1 g(Z i ) n i=1 Eρi g(Z)
P sup 1 Pn 1 Pn
>ε
g∈G α + n i=1 g(Zi ) + n i=1 Eρi g(Z)
n n
1X 3ε 1X
≤ 2P ∃g ∈ G, g(Zi ) − g(Z̃i ) ≥ 2α + g(Zi ) + g(Z̃i ) , (75)
n 8 n
i=1 i=1
where the inequality results from the conditional probability trick. The detailed procedure
can be found in Györfi et al. (2002, Theorem 11.6).
Step 2: Additional randomization by random signs.
72
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
Let {Ui }ni=1 be independent and uniformly distributed random variables on {+1, 1} that
are independent of Z1n and Z̃1n . Then we have that
n n
1X 3ε 1X
P ∃g ∈ G, g(Zi ) − g(Z̃i ) ≥ 2α + g(Zi ) + g(Z̃i )
n 8 n
i=1 i=1
n n
1 X 3ε 1X
≤ 2E P ∃g ∈ G, Ui g(Zi ) ≥ α+ g(Zi ) Z1n = z1n , (76)
n 8 n
i=1 i=1
where the inequality results from the union bound. Let
n
Pδn > 0, Gδ be a L1 δ−cover of G on
z1 . Then for any g ∈ G, there exists ḡ ∈ Gδ such that i=1 |g(zi ) − ḡ(zi )|/n ≤ δ. Thus, we
have that
n n
1X 3ε 1X
P ∃g ∈ G, Ui g(Zi ) ≥ α+ g(Zi ) Z1n = z1n
n 8 n
i=1 i=1
n n
1 X 3ε 1X n n
≤ P ∃g ∈ Gδ , δ + Ui g(Zi ) ≥ α−δ+ g(Zi ) Z1 = z1
n 8 n
i=1 i=1
X n n
1 3εα 3εδ 3ε 1 X n n
≤ |Gδ | max P Ui g(Zi ) ≥ − −δ+ g(Zi ) Z1 = z1 ,
g∈Gδ n 8 8 8 n
i=1 i=1
where the last inequality follows from the union bound. Take δ = εα/5, then we have
3εα 3εδ εα
− −δ ≥ .
8 8 10
Thus, we can control the tail probability as
n n
1X 3ε 1X n n
P ∃g ∈ G, Ui g(Zi ) ≥ α+ g(Zi ) Z1 = z1
n 8 n
i=1 i=1
X n n
εα n 1 εα 3ε 1 X n n
≤ N1 , G, z1 max P Ui g(Zi ) ≥ + g(Zi ) Z1 = z1
5 g∈G εα n 10 8 n
5 i=1 i=1
4 Pn 2
2
εα 9ε 15 nαP + i=1 g(zi )
≤ N1 , G, z1n exp − n
5 128B i=1 g(zi )
2
εα 3αε n
≤ N1 , G, z1n exp − , (77)
5 40B
where the second inequality results from the Hoeffding’s inequality, and the last inequality
results from that (a + y)2 /y ≥ 4a for any a, y > 0. Combining the inequalities (75), (76)
and (77), we conclude the proof of Proposition 26.
73
Zhang, Tan, Wang, and Yang
Then the first-order optimal condition of Eqn. (16) is that for any p ∈ ∆(A)
α
π̂t+1,h (· | s)
λ,α α ˆI α α
ηt+1 Q̂h (s, ·, πt , µ̄t , Ŵ ) − ληt+1 log π̂t+1,h (· | s) − log α , p − π̂t+1,h (· | s) ≤ 0.
πt,h (· | s)
Note that
Then we have
h i
ηt+1 Q̂λ,α α ˆI α α α α
h (s,·, πt , µ̄ t , Ŵ ), p− π̂t+1,h (·|s) +λη t+1 R π̂t+1,h (·|s) −R(p) +KL π̂t+1,h (·|s)kπt,h (·|s)
α α
≤ KL pkπt,h (· | s) − (1 + ληt+1 )KL pkπ̂t+1,h (· | s) .
+ ηt+1 Q̂λ,α α ˆI α α
h (sh , ·, πt , µ̄t , Ŵ ), π̂t+1,h (· | sh ) − πt+1,h (· | sh )
where the second inequality results from the triangle inequality, and the last inequality
results from the Hölder inequality and the triangle inequality. To bound the second term in
the right-hand side of inequality (78), we state the proposition
Proposition 27 Under Assumption 1, for any policy π I and two distribution flows µI and
µ̃I , we have that
H Z 1
Qλ,α λ,α
α I ∗ α I ∗
X
kµβm − µ̃βm k1 dβ,
h (s, a, π , µ , W )−Qh (s, a, π , µ̃ , W ) ≤ Lr + H(1 + λ log |A|)LP
m=h 0
H
X 1 β
Z
Vhλ,α (s, π α, µI, W ∗ )−Vhλ,α (s, π α, µ̃I, W ∗ ) ≤ Lr + H(1 + − µ̃βm k1 dβ
λ log |A|)LP kµm
m=h 0
74
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
where the inequality results from the triangle inequality. Thus, we have
α
α
X α α
1
(III) = R πt+1,h (· | sh ) − R π̂t+1,h (· | sh ) + πt+1,h (a | sh ) − π̂t+1,h (a | sh ) log α (a | s )
πt,h h
a∈A
X
α α |A|
≤ πt+1,h (a | sh ) − π̂t+1,h (a | sh ) log
βt
a∈A
|A|
≤ 2βt+1 log ,
βt
α
where the last inequality results from the definition of πt+1,h α
and π̂t+1,h .
For term (IV), Lemma 36 shows that for βt+1 ≤ 1/2, we have that (IV) ≤ 2(1+ληt+1 )βt+1 .
Summing these four terms, we conclude the proof of the proposition.
75
Zhang, Tan, Wang, and Yang
Step 1: Proof Eπ∗ [Vhλ (sh , π ∗ ) − Vhλ (sh , π)] ≥ γ ∗ Eπ∗ [Vh+1
λ (s ∗ λ
h+1 , π ) − Vh+1 (sh+1 , π)] for
all h ∈ [H].
If πt = πt∗ for all t ≥ h + 1, then the result trivially holds. In the following, we assume
that πt 6= πt∗ for some t ≥ h + 1. This implies that
λ
Eπ∗ [Vh+1 (sh+1 , π ∗ ) − Vh+1
λ
(sh+1 , π)] > 0.
For ease of notation, we define that
Z
y(s, a) = rh (s, a) + Ph (s0 | s, a)Vh+1 (s0 , π)ds0
ZS
y ∗ (s, a) = rh (s, a) + Ph (s0 | s, a)Vh+1 (s0 , π ∗ )ds0 .
S
(sh+1 , π ∗ ) − Vh+1
λ λ
Eπ∗ Vh+1 (sh+1 , π)
= Eπ∗ hy ∗ (sh , ·) − y(sh , ·), πh∗ (· | sh )i ,
where R(p) = hp, log pi. In the following, we will prove that for any s ∈ S
h i
hy(s, ·), πh∗ (· | s) − πh (· | s)i + λ R πh (· | s) − R πh∗ (· | s) + hy ∗ (s, ·) − y(s, ·), πh∗ (· | s)i
These distributions admit closed-formPexpressions p∗ (a) = exp(y ∗ (s, a)/λ)/Z ∗ (s) and p(a) =
exp(y(s, a)/λ)/Z(s), where Z ∗ (s) = a exp(y ∗ (s, a)/λ) and Z(s) = a exp(y(s, a)/λ). To
P
prove inequality (80), it suffices to prove that
hy(s, ·), p∗ − pi + λ R(p) − R(p∗ ) ≥ (γ ∗ − 1)hy ∗ (s, ·) − y(s, ·), p∗ i.
(81)
The left-hand side the inequality (81) is
p∗
∗ ∗ ∗ ∗ ∗
hy(s, ·), p − pi + λ R(p) − R(p ) = hλ log p, p − pi + λ R(p) − R(p ) = −λ p , log ,
p
(82)
where the first equality results from the closed-form expression of p, and the second inequality
results from the definition of R(·). We further expand this term as
exp y ∗ (s, ·)/λ
p∗
∗ Z(s) ∗
−λ p , log = −λ log ∗ − , y (s, ·) − y(s, ·) , (83)
p Z (s) Z ∗ (s)
76
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
where the equalitys result from the closed-form expressions of p and p∗ . The right-hand side
of inequality (81) is
p∗ ∗ Z ∗ (s)
∗ ∗ ∗ ∗
(γ − 1)hy (s, ·) − y(s, ·), p i = (γ − 1)λ log , p + log , (84)
p Z(s)
where the equalitys result from the closed-form expressions of p and p∗ . Combining Eqn. (82),
(83), and (84), we have
γ∗
∗
Z ∗ (s)
y (s, ·)
⇔ exp , y (s, ·) − y(s, ·) ≤ Z ∗ (s) log
∗
. (85)
λ λ Z(s)
In the following, we prove inequality (85). The right-hand side of (85) can be lower-bounded
as
∗
Z ∗ (s)
P
∗ log B X ∗ a∈A exp y (s, a)/λ
Z (s) log ≥ exp y (s, a)/λ · P −1
Z(s) B−1 a∈A exp y(s, a)/λ
a∈A
log B X
exp y(s, a)/λ y ∗ (s, a) − y(s, a) ,
≥ · (86)
(B − 1)λ
a∈A
where B = exp(H(1 + λ log |A|)/λ), the first inequality results from that log B/(B − 1) · (x −
1) ≤ log x for x ∈ [1, B] and the facts that y ∗ (s, a) ≥ y(s, a) and |y ∗ (s, a)| ≤ H(1 + λ log |A|)
for all s ∈ S and a ∈ A, and the second inequality results from that exp(x) − 1 ≥ x and
that y ∗ (s, a) ≥ y(s, a). The left-hand side of inequality (85) can be upper bounded as
γ∗
∗
γ∗
y (s, ·) ∗
X
exp y(s, a)/λ y ∗ (s, a) − y(s, a) ,
exp , y (s, ·) − y(s, ·) ≤ ·B·
λ λ λ
a∈A
(87)
where the inequality results from that y ∗ (s, a)/y(s, a) ≤ B for all s ∈ S and a ∈ A.
Combining inequalities (86) and (87), we prove inequality (85) given
log B
0 < γ∗ ≤ .
B(B − 1)
for h ∈ [H]. Then Step 1 shows that Dh ≥ γ ∗ Dh+1 for all h ∈ [H]. Thus, we have
77
Zhang, Tan, Wang, and Yang
Term (V) is the error that measures the difference between the action-value function induced
by the optimal policies of µ̄It+1 and µ̄It , which is defined as
H
X
∗,α
(V) = E ∗,α I
π̄t+1 ,µ̄t+1 Vhλ,α (sh , π̄t+1 , µ̄It+1 , W ∗ ) − Vhλ,α (sh , π̄t∗,α , µ̄It+1 , W ∗ ) .
h=1
To upper bound the term (V), we note that the optimal policies µ̄It+1 and µ̄It satisfy the
following property.
Proposition 28 For a λ-regularized finite-horizon MDP (S, A, H, {rh }H H
h=1 , {Ph }h=1 ) with
∗ ∗ H
rh ∈ [0, 1] for all h ∈ [H], we denote the optimal policy as π = {πh }h=1 . Then we have that
for any s ∈ S, and h ∈ [H]
1
min πh∗ (a | s) ≥ .
a∈A 1 + |A| exp (H − h + 1)(1 + λ log |A|)/λ
Proof [Proof of Proposition 28] See Appendix Q.2.5
Then we have that
∗,α
Vhλ,α (sh , π̄t+1 , µ̄It+1 , W ∗ ) − Vhλ,α (sh , π̄t∗,α , µ̄It+1 , W ∗ )
X H
∗,α ∗,α
≤ H(1 + λ log |A|) + λLR Eπ̄∗,α ,µ̄I π̄t+1,m (· | sm ) − π̄t,m (· | sm ) 1 ,
t+1 t+1
m=h
78
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
where LR = log(1 + |A| exp H(1 + λ log |A|)/λ), the inequality results from the performance
difference lemma, Lemma 37, proposition 28 and Lemma 38. Thus, we have that
X H
∗,α ∗,α
(V) ≤ H H(1 + λ log |A|) + λLR Eπ̄∗,α ,µ̄I π̄t+1,m (· | sm ) − π̄t,m (· | sm ) 1 . (89)
t+1 t+1
m=1
Term (VI) is the error that measures the difference between the distribution of states
induced by optimal policies of µ̄It+1 and µ̄It , which is defined as
X H
λ,α ∗,α I ∗ λ,α α I ∗
(VI) = Eπ̄ ,µ̄I − Eπ̄t ,µ̄I
∗,α ∗,α Vh (sh , π̄t , µ̄t+1 , W ) − Vh (sh , πt+1 , µ̄t+1 , W )
t+1 t+1 t+1
h=1
H
X
1 ∗,α α
+ ∗ Eπ̄∗,α ,µ̄I − Eπ̄t∗,α ,µ̄I KL π̄t,h (· | sh )kπt+1,h (· | sh ) .
ηθ t+1 t+1 t+1
h=1
79
Zhang, Tan, Wang, and Yang
Proposition 30 Given any policy π I and two distribution flows µI and µ̃I , we define
µ+,I = Γ3 (π I , µI , W ∗ ) and µ̃+,I = Γ3 (π I , µ̃I , W ∗ ). Under Assumption 1, we have that
h−1 Z 1
µ+,α µ̃+,α
X
h − h 1
≤ LP kµβm − µ̃βm k1 dβ
m=1 0
H Z 1
|A|2
1
kµ̄βt+1,m − µ̄βt,m k1 dβ.
X
(VII) ≤ H H 1 + λ log |A| + ∗ log LP · (91)
ηθ βt+1 0 m=1
Term (VIII) is the error that measures the difference between the action-value function
induced by difference distribution flows µ̄It+1 and µ̄It , which is defined as
H
X
(VIII) = Eπ̄t∗,α ,µ̄It Vhλ,α (sh , π̄t∗,α , µ̄It+1 , W ∗ ) − Vhλ,α (sh , πt+1
α
, µ̄It+1 , W ∗ )
h=1
H
X
−E π̄t∗,α ,µ̄I
t
Vhλ,α (sh , π̄t∗,α , µ̄It , W ∗ ) − Vhλ,α (sh , πt+1
α
, µ̄It , W ∗ )
h=1
H Z 1
X
kµβm − µ̃βm k1 dβ.
(VIII) ≤ 2H Lr + H(1 + λ log |A|)LP (92)
m=1 0
Term (IX) is the error that measures the difference between the KL divergence related
to the optimal policies of µ̄It+1 and µ̄It , which is defined as
H
X
1 ∗,α α
∗,α α
(IX) = ∗ Eπ̄∗,α ,µ̄I KL π̄t+1,h (· | sh )kπt+1,h (· | sh ) − KL π̄t,h (· | sh )kπt+1,h (· | sh ) .
ηθ t+1 t+1
h=1
Combining Eqn. (88) and inequalities (89), (90), (91), (92), (93), we conclude the proof of
this proposition.
80
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
XZ
πh∗ (· | s) = argmaxhrh (s, ·), pi − λR(p) + p(a)Ph (s0 | s, a)Vh+1
λ
(s0 , π ∗ )ds0 .
p∈∆(A) a∈A S
The desired result follows from that Vhλ (s0 , π ∗ ) ≤ (H − h + 1)(1 + λ log |A|). Thus, we
conclude the proof of Proposition 28.
where the first inequality results from the definition of Γ3 and the triangle inequality, and
the second inequality results from the triangle inequality. Note that µ+,α 1 = µ̃+,α
1 = µα1 .
Summing over h, we prove the desired result. This completes the proof of Proposition 29.
µ+,α +,α
h+1 − µ̃h+1 1
Z XZ
µ+,α +,α ∗ 0 α I ∗
α 0
≤ h (s) − µ̃h (s) πh (a | s)Ph s | s, a, zh (µh , Wh ) ds
S a∈A S
Z XZ 0
µ̃+,α α ∗ 0 α I ∗ ∗ 0 α I ∗
+ h (s)πh (a | s) P h s | s, a, z h (µ h , Wh ) − P h s | s, a, z h (µ̃ h , Wh ) ds
S
a∈A S
≤ kµh+,α − µ̃+,α
h k1 + LP zhα (µIh , Wh∗ ) − zhα (µ̃Ih , Wh∗ ) 1 ,
81
Zhang, Tan, Wang, and Yang
where the first inequality results from the definition of Γ3 and triangle inequality, and the
second inequality results from Assumption 1. For the right-hand side term, we have that
Z Z 1 Z 1
β β
α I ∗ α I ∗ ∗
kµβh − µ̃βh k1 dβ,
zh (µh , Wh ) − zh (µ̃h , Wh ) 1 = Wh (α, β) µh (s) − µ̃h (s) dβ ds ≤
S 0 0
where the inequality results from the triangle inequality and that |Wh∗ | ≤ 1. Summing over
h, we prove the desired result. Thus, we conclude the proof of Proposition 30.
Qλ,α α I ∗ λ,α α I
h (s, a, π , µ , W ) − Qh (s, a, π , µ̃ , W )
∗
Z 1
kµβh − µ̃βh k1 dβ
≤ Lr + H(1 + λ log |A|)LP
0
XZ
Qh+1 (s, a, π , µ , W ∗ )−Qλ,α
λ,α 0 0 I I 0 0 I I ∗ 0 0 ∗ 0 α I ∗
α
0
+ h+1 (s, a, π , µ̃ , W ) πh+1 (a |s )Ph s |s, a, zh (µ , W ) ds ,
a0 ∈A S
where the inequality results from the triangle inequality and Assumption 1. By induction, it
is easy to prove that
H Z 1
Qλ,α λ,α
α I ∗ α I ∗
X
kµβm − µ̃βm k1 dβ.
h (s, a, π , µ , W )−Qh (s, a, π , µ̃ , W ) ≤ Lr + H(1 + λ log |A|)LP
m=h 0
From the relationship between the value function and action-value function, we have that
H Z 1
Vhλ,α (s, π α, µI, W ∗ )−Vhλ,α (s, π α, µ̃I, W ∗ )
X
kµβm − µ̃βm k1 dβ.
≤ Lr + H(1 + λ log |A|)LP
m=h 0
82
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
XZ λ,α
+ p(a)Ph s0 | s, a, zhα (µIh , Wh∗ ) Vh+1 (s, π ∗,I , µI , W ∗ )ds0 .
a∈A S
where the inequality results from the fact that | maxx f (x) − maxx g(x)| ≤ maxx |f (x) − g(x)|
and Assumption 1. By induction, it is easy to prove that
Next, we prove the claim related to the optimal policies. From the definition of the
optimal policies, we have that
We define that
Z
λ,α
yhα (s, a) rh s, a, zhα (µIh , Wh∗ ) Ph s0 | s, a, zhα (µIh , Wh∗ ) Vh+1 (s, π ∗,I , µI , W ∗ )ds0 ,
= +
ZS
λ,α
ỹhα (s, a) rh s, a, zhα (µ̃Ih , Wh∗ ) Ph s0 | s, a, zhα (µ̃Ih , Wh∗ ) Vh+1 (s, π ∗,I , µ̃I , W ∗ )ds0 .
= +
S
πh∗,α (· | s) − π̃h∗,α (· | s) 1
≤ yhα (s, ·) − ỹhα (s, ·) ∞
.
which proves the claim related to the optimal policies. Thus, we conclude the proof of
Proposition 17
83
Zhang, Tan, Wang, and Yang
where the generalization error of risk and the empirical risk difference are defined similarly
as those in Theorem 5. From the procedure of Algorithm (5), we have
The generalization error of risk can be bounded by inequality (42) in the proof of Theorem 5.
Thus, we conclude the proof of Corollary 21.
Assumption 8 implies that we can bound the total variation between µαh+1 and µ̂αh+1 as
kµαh+1 − µ̂αh+1 k1
h i h i
≤ Lε Eραh fˆh ωhα (Ŵh ) − fh∗ ωhα (Wh∗ ) + Lε Eραh fˆh ωhα (Ŵh ) − fˆh ω̃hα (Ŵh )
where
Z 1Z
ω̃hα (W ) W (α, β)k ·, (siτ,h , aiτ,h , s) µ̂βh (s) ds dβ.
=
0 S
The first term in the right-hand side of inequality (94) can be upper bounded as
h i
Lε Eραh fˆh ωhα (Ŵh ) − fh∗ ωhα (Wh∗ ) ≤ eπ,α
h ,
where the inequality results from the Hölder inequality. The second term in the right-hand
side of inequality (94) can be upper-bounded as
h Z 1
i
fh ωh (Ŵh ) − fh ω̃h (Ŵh ) ≤ rLK Lε kωh (Ŵh )−ω̃h (Ŵh )kH ≤ rLK Lε Bk kµ̂βh − µβh k1 dβ,
ˆ α ˆ α α α
Lε Eρα
h
0
84
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
where the first inequality results from Lemma 33, and the second inequality results from the
triangle inequality. Thus, we have that
Z 1
α α α α
kµh+1 − µ̂h+1 k1 ≤ kµh − µ̂h k1 + rLK Lε Bk kµ̂βh − µβh k1 dβ + eπ,α
h .
0
which proves our desired results. Thus, we conclude the proof of Proposition 22.
Qλ,α α I ∗ α I ∗
λ,α 0 α I ∗ α α
h (s, a, π , µ , W ) = rh (s, a, zh (µ , Wh )) + E Vh+1 (s , π , µ , W ) sh = s, ah = a
Vhλ,α (s0 , π α , µI , W ∗ ) = Qλ,α α I ∗ α α
h (s, ·, π , µ , W ), πh (· | s) + λR(πh · | s) . (95)
where the inequality results from the triangle inequality and Eqn. (95). Since Q̂λ,α
H+1 =
λ,α
QH+1 = 0, we have that
h i
Eρb,α Q̂λ,α
h (s, a, π α I
, µ , Ŵ ) − Qλ,α
h (s, a, π α I
, µ , W ∗
)
h
H h
X
∗ ∗
i
Eρb,α ĝm ωhα (Ŵm ) − gm ωhα (Wm
≤ )
m
m=h
H r h
X i
ˆ α α
+ Lε H(1 + λ log |A|) Eρb,α ∗ ∗
fm ωh (Ŵm ) − fm ωh (Wm ) ,
m
m=h
where the inequality results from Assumption 8. Our desired result follows from the Hölder
inequality. Thus, we conclude the proof of Proposition 23.
85
Zhang, Tan, Wang, and Yang
where the first equality results from the definition of µ̃Iτ , and the second equality results
from that α ∈ ((i − 1)/N, i/N ] and φ∗ is a permutation of {((i − 1)/N, i/N ]}N i=1 . Thus, we
have that
Z 1Z
i φ
φ∗ (β)
W φ(i/N ), φ(β) k ·, (siτ,h , aiτ,h , s) µτ,h (s) ds dβ
ω̃τ,h (W ) =
0 S
Z 1Z
φ∗ (β)
W φ ◦ ψ ∗ ◦ φ∗ (i/N ), φ ◦ ψ ∗ ◦ φ∗ (β) k ·, (siτ,h , aiτ,h , s) µτ,h (s) ds dβ
=
0 S
Z 1Z
W φ ◦ ψ ∗ (ξi ), φ ◦ ψ ∗ (γ) k ·, (siτ,h , aiτ,h , s) µγτ,h (s) ds dγ
=
0 S
i ∗
= ωτ,h (W φ◦ψ ),
where the second equality results from that φ∗ is the inverse functio of ψ ∗ , and the third
equality results from taking γ = φ∗ (β) and φ∗ (i/N ) = ξi . Thus, Algorithm (62) can be
equivalent formulated as
∗
Rξ̄ (fˆh , ĝh , Ŵhφ̂h ◦ψ ) − Rξ̄ (fh∗ , gh∗ , Wh∗ ) = Generalization Error of Risk + Empirical Risk Difference,
where the generalization error of risk and the empirical risk difference are defined as
86
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
L N
1 XX i ∗ 2
2
−2 sτ,h+1 − fˆh ωτ,hi
(Ŵhφ̂h ◦ψ ) − siτ,h+1 − fh∗ ωτ,h
i
(Wh∗ )
NL
τ =1 i=1
L N
1 XX i i φ̂h ◦ψ ∗
2
i ∗ i ∗
2
+ Eρi rτ,h − ĝh ωτ,h (Ŵh ) − rτ,h − gh ωτ,h (Wh )
NL τ,h
τ =1 i=1
L X N
1 ∗ 2
2
(Ŵhφ̂h ◦ψ )
X
−2 i
rτ,h i
− ĝh ωτ,h i
− rτ,h − gh∗ ωτ,h
i
(Wh∗ ) ,
NL
τ =1 i=1
Empirical Risk Difference
L N
1 XX i ∗ 2
2
=2 sτ,h+1 − fˆh ωτ,h
i
(Ŵhφ̂h ◦ψ ) − siτ,h+1 − fh∗ ωτ,h
i
(Wh∗ )
NL
τ =1 i=1
L X
N
1 ∗ 2
2
(Ŵhφ̂h ◦ψ )
X
+2 i
rτ,h i
− ĝh ωτ,h i
− rτ,h − gh∗ ωτ,h
i
(Wh∗ ) .
NL
τ =1 i=1
The generalization error of risk can be controlled exactly as the proof of Theorem 7. Thus,
we conclude the proof of Corollary 24.
Since f 0 (1) = 0, we have that f 0 (β) ≤ 0 for β ∈ [0, 1]. Thus, we conclude the proof of
Lemma 31.
Lemma 33 In a RKHS H with kernel k : X × X → R that satisfies: (i) k(x, x) ≤ Bk2 for
all x ∈ X . (ii) kk(·, x) − k(·, x0 )kH ≤ Lk kx − x0 kX for all x, x0 ∈ X . We have that for any
f ∈ B(r, H): (i) |f (x)| ≤ rBk for all x ∈ X . (ii) |f (x) − f (x0 )| ≤ rLk kx − x0 kX for all
x, x0 ∈ X .
87
Zhang, Tan, Wang, and Yang
Proof [Proof of Lemma 33] For the first claim, we have that
Proof [Proof of Lemma 34] Define fi (x) = hx, yi i − f (x) for i = 1, 2. Then Shalev-Shwartz
(2012, Lemma 2.8) shows that
k k
kx1 − x2 k2 ≤ f1 (x1 ) − f1 (x2 ) and kx1 − x2 k2 ≤ f2 (x2 ) − f2 (x1 ).
2 2
Summing these two inequalities, we have
where the second inequality results from the definition of the dual norm. Thus, we conclude
the proof of Lemma 34.
Lemma 35 (Lemma 3.3 in Cai et al. (2020)) For any distribution p, p∗ ∈ ∆(A) and
any function g : A → [0, H], it holds for q ∈ ∆(A) with q(·) ∝ p(·) exp αg(·) that
Lemma 36 For any two distributions p∗ , p ∈ ∆(A) and p̂ = (1 − β)p + βUnif(A) with
β ∈ (0, 1). Then
|A|
KL(p∗ kp̂) ≤ log
β
∗ ∗
KL(p kp̂) − KL(p kp) ≤ β/(1 − β).
88
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
where the second inequality results from log(x) ≤ x − 1 for x > 0. Thus, we conclude the
proof of Lemma 36.
where the expectation Eπ̃α ,µI is taken with respect to the randomness in implementing policy
π̃ α for agent α under the MDP induced by µI .
Proof [Proof of Lemma 37] From the definition of V1λ,α (s, π̃ α , µI , W ), we have
V1λ,α (s, π̃ α , µI , W )
XH
α α α α α α λ,α α α I λ,α α α I α
= Eπ̃α ,µI rh (sh, ah, zh)−λ log π̃h (ah |sh )+Vh (sh , π , µ , W )−Vh (sh , π , µ , W ) s1 = s
h=1
H
X
λ,α α
= Eπ̃α ,µI rh (sαh , aαh , zhα ) − λ log π̃hα (aαh | sαh ) + Vh+1 (sh+1 , π α , µI , W )
h=1
− Vhλ,α (sαh , π α , µI , W ) sα1 = s + V1λ,α (s, π α , µI , W ), (97)
where the second equality results from the rearrangement from the terms. We then focus on
a part of the right-hand side of Eqn. (97).
λ,α α
Eπ̃α ,µI rh (sαh , aαh , zhα ) − λ log π̃hα (aαh | sαh ) + Vh+1 (sh+1 , π α , µI , W ) | sα1 = s
h i
λ,α α
= Eπ̃α ,µI rh (sαh , aαh , zhα ) + Vh+1 (sh+1 , π α , µI , W ) | sα1 = s − λEπ̃α ,µI R π̃hα (· | sαh ) | sα1 = s
h i h i
= Eπ̃α ,µI Qλ,α α α I α α α α α
α
h (s h , ·, π , µ , W ), π̃h (· | sh ) | s1 = s − λE π̃ α ,µ I R π̃ h (· | sh ) | s1 = s ,
(98)
89
Zhang, Tan, Wang, and Yang
where R(·) is the negative entropy function, the inner product h·, ·i is taken with respect
to the action space A, and the second equality results from the definition of Qλ,α h and
λ,α
Vh+1 . Substituting Eqn. (98) into Eqn. (97) and noting the fact that Vhλ,α (sαh , π α , µI , W ) =
hQλ,α α α I α α α α
h (sh , ·, π , µ , W ), πh (· | sh )i − R(πh (· | sh )), we derive that
where the last equality results from the definition of the negative entropy R(·). This concludes
the proof of Lemma 37.
Lemma 38 For a finite alphabet X , define R as the negative entropy function. For two
distributions p, q supported on X , we have that
n o
|R(p) − R(q)| ≤ max log(p) ∞ , log(q) ∞ kp − qk1 .
where the first inequality results from the definition of integral and the triangle inequality,
and the second inequality results from Hölder’s inequality. The desired result follows from
the fact that for t ∈ [0, 1]
n o
log q + t(p − q) ≤ max log(p) ∞ , log(q) ∞ .
∞
90
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
Lemma 40 (Lemma 39 in Wei et al. (2021)) Let {gt }t≥0 and {ht }t≥0 be non-negative
sequences that satisft gt ≤ (1 − c)gt−1 + ht for some 0 < c < 1 for all t ≥ 1. Then
References
A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan. Optimality and approximation with
policy gradient methods in Markov decision processes. In Conference on Learning Theory,
pages 64–66. PMLR, 2020.
A. Aurell, R. Carmona, G. Dayanıklı, and M. Laurière. Finite state graphon games with
applications to epidemics. Dynamic Games and Applications, 12(1):49–81, 2022a.
A. Aurell, R. Carmona, and M. Lauriere. Stochastic graphon games: II. The linear-quadratic
case. Applied Mathematics & Optimization, 85(3):1–33, 2022b.
J. Bhandari and D. Russo. Global optimality guarantees for policy gradient methods. arXiv
preprint arXiv:1906.01786, 2019.
Q. Cai, Z. Yang, C. Jin, and Z. Wang. Provably efficient exploration in policy optimization.
In International Conference on Machine Learning, pages 1283–1294. PMLR, 2020.
P. E. Caines and M. Huang. Graphon mean field games and the GMFG equations: ε-nash
equilibria. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 286–292.
IEEE, 2019.
P. E. Caines and M. Huang. Graphon mean field games and their equations. SIAM Journal
on Control and Optimization, 59(6):4373–4399, 2021.
P. Cardaliaguet and S. Hadikhanloo. Learning in mean field games: the fictitious play.
ESAIM: Control, Optimisation and Calculus of Variations, 23(2):569–591, 2017.
S. Cen, C. Cheng, Y. Chen, Y. Wei, and Y. Chi. Fast global convergence of natural policy
gradient methods with entropy regularization. Operations Research, 70(4):2563–2578,
2022.
91
Zhang, Tan, Wang, and Yang
K. Cui and H. Koeppl. Approximately solving mean field games via entropy-regularized
deep reinforcement learning. In International Conference on Artificial Intelligence and
Statistics, pages 1909–1917. PMLR, 2021a.
K. Cui and H. Koeppl. Learning graphon mean field games and approximate Nash equilibria.
International Conference on Learning Representations, 2021b.
C. Fabian, K. Cui, and H. Koeppl. Learning sparse graphon mean field games. arXiv
preprint arXiv:2209.03880, 2022.
Z. Fang, Z. Guo, and D. Zhou. Optimal learning rates for distribution regression. Journal
of Complexity, 56:101426, 2020.
C. Gao and Z. Ma. Minimax rates in network analysis: Graphon estimation, community
detection and hypothesis testing. Statistical Science, 36:16–33, 2021.
C. Gao, Y. Lu, and H. H. Zhou. Rate-optimal graphon estimation. The Annals of Statistics,
43:2624–2652, 2015.
S. Gao and P. E. Caines. Graphon control of large-scale networks of linear systems. IEEE
Transactions on Automatic Control, 65(10):4090–4105, 2019.
S. Gao, R. F. Tchuendom, and P. E. Caines. Linear quadratic graphon field games. arXiv
preprint arXiv:2006.03964, 2020.
S. Gao, P. E. Caines, and M. Huang. Lqg graphon mean field games: Graphon invariant
subspaces. In 2021 60th IEEE Conference on Decision and Control (CDC), pages 5253–
5260. IEEE, 2021.
X. Guo, A. Hu, R. Xu, and J. Zhang. Learning mean-field games. Advances in Neural
Information Processing Systems, 32, 2019.
X. Guo, R. Xu, and T. Zariphopoulou. Entropy regularization for mean field games with
learning. Mathematics of Operations research, 47(4):3239–3260, 2022b.
X. Guo, A. Hu, R. Xu, and J. Zhang. A general framework for learning mean-field games.
Mathematics of Operations Research, 48(2):656–686, 2023.
92
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization
in neural networks. Advances in neural information processing systems, 31, 2018.
C. Jin, Q. Liu, and S. Miryoosefi. Bellman Eluder dimension: New rich classes of RL
problems, and sample-efficient algorithms. Advances in Neural Information Processing
Systems, 34:13406–13418, 2021.
N. Kallus, Y. Saito, and M. Uehara. Optimal off-policy evaluation from multiple logging
policies. In International Conference on Machine Learning, pages 5247–5256. PMLR,
2021.
O. Klopp and N. Verzelen. Optimal graphon estimation in cut distance. Probability Theory
and Related Fields, 174:1033–1090, 2019.
O. Klopp, A. B. Tsybakov, and N. Verzelen. Oracle inequalities for network models and
sparse graphon estimation. The Annals of Statistics, 45:316–354, 2017.
G. Lan. Policy mirror descent for reinforcement learning: Linear convergence, new sampling
complexity, and generalized problem classes. Mathematical Programming, pages 1–48,
2022.
J. Lasry and P. Lions. Mean field games. Japanese journal of mathematics, 2(1):229–260,
2007.
M. Laurière, S. Perrin, M. Geist, and O. Pietquin. Learning mean field games: A survey.
arXiv preprint arXiv:2205.12944, 2022.
P. Lavigne and L. Pfeiffer. Generalized conditional gradient and learning in potential mean
field games. arXiv preprint arXiv:2209.12772, 2022.
O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans. Bridging the gap between value and
policy based reinforcement learning. Advances in Neural Information Processing Systems,
30, 2017.
93
Zhang, Tan, Wang, and Yang
F. Parise and A. Ozdaglar. Graphon games. In Proceedings of the 2019 ACM Conference on
Economics and Computation, pages 457–458, 2019.
I. Pinelis. Optimum bounds for the distributions of martingales in Banach spaces. The
Annals of Probability, pages 1679–1706, 1994.
L. Shani, Y. Efroni, and S. Mannor. Adaptive trust region policy optimization: Global
convergence and faster rates for regularized mdps. In Proceedings of the AAAI Conference
on Artificial Intelligence, volume 34, pages 5668–5675, 2020.
R. F. Tchuendom, P. E. Caines, and M. Huang. On the master equation for linear quadratic
graphon mean field games. In 2020 59th IEEE Conference on Decision and Control
(CDC), pages 1026–1031. IEEE, 2020.
M. Uehara, J. Huang, and N. Jiang. Minimax weight and q-function learning for off-policy
evaluation. In International Conference on Machine Learning, pages 9659–9668. PMLR,
2020.
D. Vasal, R. K. Mishra, and S. Vishwanath. Master equation of discrete time graphon mean
field games and teams. arXiv preprint arXiv:2001.05633, 2020.
94
Learning Regularized Graphon Mean-Field Games with Unknown Graphons
L. Wang, Z. Yang, and Z. Wang. Breaking the curse of many agents: Provable mean
embedding Q-iteration for mean-field reinforcement learning. In International Conference
on Machine Learning, pages 10092–10103. PMLR, 2020.
Q. Xie, Z. Yang, Z. Wang, and A. Minca. Learning while playing in mean-field games:
Convergence and optimality. In International Conference on Machine Learning, pages
11436–11447. PMLR, 2021.
K. Xu, Y. Zhang, D. Ye, P. Zhao, and M. Tan. Relation-aware transformer for portfolio policy
learning. In Proceedings of the Twenty-Ninth International Conference on International
Joint Conferences on Artificial Intelligence, pages 4647–4653, 2021.
Y. Yang, R. Luo, M. Li, M. Zhou, W. Zhang, and J. Wang. Mean field multi-agent
reinforcement learning. In International Conference on Machine Learning, pages 5571–
5580. PMLR, 2018.
B. Yardim, S. Cayci, M. Geist, and N. He. Policy mirror ascent for efficient and independent
learning in mean field games. arXiv preprint arXiv:2212.14449, 2022.
W. Zhan, B. Huang, A. Huang, N. Jiang, and J. Lee. Offline reinforcement learning with
realizability and single-policy concentrability. In Conference on Learning Theory, pages
2730–2775. PMLR, 2022.
95