0% found this document useful (0 votes)

23 views95 pages

Learning Regularized Graphon Mean-Field Games With Unknown Graphons

Uploaded by

HARSHIT KHANNA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views95 pages

Learning Regularized Graphon Mean-Field Games With Unknown Graphons

Uploaded by

HARSHIT KHANNA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 95

Journal of Machine Learning Research 25 (2024) 1-95 Submitted 10/23; Revised 5/24; Published 11/24

Learning Regularized Graphon Mean-Field Games with

Unknown Graphons

Fengzhuo Zhang [email protected]

Department of Electrical and Computer Engineering
National University of Singapore
Singapore 117583
Vincent Y. F. Tan [email protected]
Department of Mathematics
Department of Electrical and Computer Engineering
National University of Singapore
Singapore 119076
Zhaoran Wang [email protected]
Department of Industrial Engineering and Management Sciences
Northwestern University
Evanston, IL 60208-3109, USA
Zhuoran Yang [email protected]
Department of Statistics and Data Science
Yale University
New Haven, CT 06511, USA

Editor: Alexandre Proutiere

Abstract

We design and analyze reinforcement learning algorithms for Graphon Mean-Field Games
(GMFGs). In contrast to previous works that require the precise values of the graphons,
we aim to learn the Nash Equilibrium (NE) of the regularized GMFGs when the graphons
are unknown. Our contributions are threefold. First, we propose the Proximal Policy
Optimization for GMFG (GMFG-PPO) algorithm and show that it converges at a rate of
Õ(T −1/3 ) after T iterations with an estimation oracle, improving on a previous work by
Xie et al. (ICML, 2021). Second, using kernel embedding of distributions, we design efficient
algorithms to estimate the transition kernels, reward functions, and graphons from sampled
agents. Convergence rates are then derived when the positions of the agents are either
known or unknown. Results for the combination of the optimization algorithm GMFG-PPO
and the estimation algorithm are then provided. These algorithms are the first specifically
designed for learning graphons from sampled agents. Finally, the efficacy of the proposed
algorithms are corroborated through simulations. These simulations demonstrate that
learning the unknown graphons reduces the exploitability effectively.

Keywords: Graphon, Mean-Field Games, Multi-Agent Reinforcement Learning, Policy

Gradient, Model-Based Reinforcement Learning

c 2024 Fengzhuo Zhang, Vincent Y. F. Tan, Zhaoran Wang, and Zhuoran Yang.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided
at http://jmlr.org/papers/v25/23-1409.html.
Zhang, Tan, Wang, and Yang

1. Introduction
Multi-Agent Reinforcement Learning (MARL) aims to solve sequential decision-making
problems in multi-agent systems (Zhang et al., 2021; Gronauer and Diepold, 2022; Oroojlooy
and Hajinezhad, 2022). Although MARL has enjoyed tremendous successes across a wide
range of real-world applications (Tang and Ha, 2021; Wang et al., 2022a,b; Xu et al., 2021),
it suffers from the “curse of many agents” where the sizes of the state and action spaces
increase exponentially with the number of agents (Menda et al., 2018; Wang et al., 2020). A
potential remedy is to use the mean-field approximation (Yang et al., 2018; Carmona et al.,
2019). It assumes that the agents are homogeneous, and each agent is influenced only by the
common state distribution of agents. This assumption mitigates the exponential growth of
the state and action spaces (Wang et al., 2020; Guo et al., 2022a). However, the homogeneity
assumption heavily restricts the applicability of the Mean-Field Game (MFG). For example,
analyzing the propagation of Covid-19 in an extremely large population requires modeling
the fact that people in different regions have distinct activity intensities. This cannot be
captured by the mean-field approximation, which assumes a simplistic homogeneous setup.
As a result, the Graphon Mean-Field Game (GMFG) is proposed as a means to relax the
homogeneity assumption. It captures the heterogeneity of agents through graphons and
allows the number of agents to be potentially uncountably infinite (Parise and Ozdaglar,
2019; Carmona et al., 2022). GMFGs have achieved great successes in a wide range of
applications, including in networks (Gao and Caines, 2019) and epidemics (Aurell et al.,
2022a). In Aurell et al. (2022a), the states of people indicate their infection situation, and
the graphons represent the propagation intensity between different types of people.
However, learning algorithms for GMFG require significantly more efforts to design and
analyze. Cui and Koeppl (2021b) proposed to learn the Nash Equilibrium (NE) of GMFGs
by modifying existing MFG learning algorithms. However, these model-free algorithms suffer
from the fact that the distribution flow estimation in GMFG requires a large number of
samples due to the heterogeneity of the agents. In addition, these algorithms potentially
necessitate the use of a very large class of value functions. In particular, this function
class should include the nominal value function in GMFG with any graphons to satisfy the
realizability assumption (Jin et al., 2021; Zhan et al., 2022). Moreover, existing works only
prove the consistency of learning algorithms with rather stringent assumptions (Cui and
Koeppl, 2021b; Fabian et al., 2022). These assumptions include the contractivity of the
estimated operators and the access to the nominal value functions. The convergence rates of
algorithms in GMFGs with milder assumptions are currently lacking in the literature.
In this paper, we focus on learning the NE from data collected from a set of sampled
agents in a centralized manner. Concretely, the central planner has access to a simulator of
the GMFG which generates the states and rewards of agents with the policies of the agent
as its inputs. However, only the states and rewards of only a finite set of agents are revealed
to the learner. Compared with the settings in Cui and Koeppl (2021b) and Fabian et al.
(2022), our setting is more relevant in real-world applications where the number of agents is
always finite. We aim to learn the NE of the GMFG from the states and rewards of these
sampled agents.
Learning the NEs in our problem involves overcoming difficulties from the statistical and
optimization perspectives. From the statistical side, we suffer from the lack of information

2
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

about the inputs of the functions to estimate. The transition kernels and the reward functions
of each agent take as inputs the collective behavior of all the other agents and the graphon. In
contrast, we do not know the graphons and only have information provided by a finite subset
of agents. From the optimization perspective, each agent is faced with a non-stationary
environment formed by other agents. Thus, we should design policy optimization procedures
that ensure that the policy of each agent converges to the optimal one in a time-varying
environment, while also ensuring that the non-stationary environment converges to a NE.
Main Contributions Addressing these difficulties, we summarize our main contributions
and results in Table 1 and in more details as follows:

1. We propose and analyze the Proximal Policy Optimization for GMFG (GMFG-PPO)
algorithm to learn the NE. Given an estimate oracle, our algorithm implements
a Proximal Policy Optimization (PPO)-like algorithm to update the agents’ poli-
cies (Schulman et al., 2017). The environment is simultaneously updated with a
carefully designed learning rate. These strategies overcome the optimization-related
hurdles. GMFG-PPO achieves a convergence rate Õ(T −1/3 ), where T is the number
of iterations. This convergence rate is faster than that of the algorithm in Xie et al.
(2021) and is proved under fewer assumptions. This improvement is attributed to our
carefully designed policy and environment update rates. In addition, the analysis of
our optimization leads to a faster convergence of the mirror descent algorithm on a
fixed MDP. As a byproduct, we generalize the result in Lan (2022) to inhomogeneous
MDPs with a finite horizon.

2. We design and analyze the model learning algorithm of GMFG under three different
agent sampling schemes, as shown in Table 1. The algorithm first incorporates the
graphon with the empirical measure to estimate the mean-embedding of each agent’s
influence. Then we take this estimate as the input and then perform a regression task;
this resolves the statistical difficulties mentioned above. In the case where sampled
agents have known and fixed positions, Theorem 5 shows that the convergence rate
for the model estimate is O((N L)−1 + N −1/2 ), where N is the number of sampled
agents, and L is the number of samples from each agent. We also consider two
additional scenarios—the case in which the agents are randomly sampled from the unit
interval but their positions are known, and learning from sample agents with unknown
grid positions. Pertaining to the final scenario, Theorem 7 indicates that the lack
of information of the position of the agents results in the sample complexity being
degraded by an additional factor of O(N log N ).

3. Our model estimation learning algorithm is the first one proposed for GMFGs. It
recovers the underlying graphons from the states sampled from a finite number of
agents. This model-learning problem is a considerable generalization of the distribution
regression problem (Szabó et al., 2016). Detailed discussions are provided in Section 5.4.
Also, our graphon learning setting can be regarded as a novel addition to the existing
graphon estimation literature, as discussed in Section 2.
Paper Outline The rest of the paper is organized as follows. We discuss related works
in Section 2. In Section 3, we introduce the GMFGs and a key property that they possess,

3
Zhang, Tan, Wang, and Yang

Table 1: Summary of the theoretical results

Results Description
Convergence rate of GMFG-PPO,
Theorem 4 when an estimation oracle is assumed.
Convergence rate of the model estimation procedure,
Theorem 5 when the agents have known fixed positions.
Convergence rate of the model estimation procedure,
Theorem 6 when the agents have known random positions.
Convergence rate of the model estimation procedure,
Theorem 7 when the agents have unknown fixed positions.
Convergence rate of the NE learning algorithm that implements GMFG-PPO
Corollary 9 and collects data from agents with known fixed positions.
Convergence rate of the NE learning algorithm that implements GMFG-PPO
Corollary 10 and collects data from agents with known random positions.
Convergence rate of the NE learning algorithm that implements GMFG-PPO
Corollary 11 and collects data from agents with unknown fixed positions.

namely equivariance. Our three sampling schemes are also introduced. In Section 4, we
propose GMFG-PPO and analyze its convergence rate assuming an estimation oracle. In
Section 5, we first introduce our mean-embedding procedure. Then we propose and analyze
the model-learning algorithms for three sampling schemes. In Section 6, we combine the
results from Sections 4 and 5. In Section 7, we provide the numerical simulation results to
corroborate our theoretical findings. In Section 8, we conclude our paper.

2. Related Works
The GMFG has been proposed to study the games played between a large number of
heterogenous agents for several years. Parise and Ozdaglar (2019) first formulated the static
GMFG and proved that it is the limit of finite-agent games with graph structure. Carmona
et al. (2022) then generalized these results to the Bayesian setting. Caines and Huang (2019,
2021) formulated the continuous-time GMFG and studied the existence and the uniqueness
of their NE. As a special case, the continuous-time linear-quadratic GMFG was studied by
Aurell et al. (2022b); Tchuendom et al. (2020); Gao et al. (2020, 2021), where the existence
and uniqueness of NE were established, and the convergence of finite-agent games to GMFG
was analyzed. Learning of the NE on the discrete-time GMFG was first considered in Vasal
et al. (2020) via the master equation. After that, Cui and Koeppl (2021b) and Fabian et al.
(2022) proposed algorithms to learn the NE of discrete-time GMFG with dense and sparse
graphons, respectively.
As a special case, the MFG models a game between a large number of homogeneous
agents. This classical problem formulation was suggested in Lasry and Lions (2007); Huang
et al. (2006). NE learning algorithms for the continuous-time MFG have been designed via
fictitious play (Cardaliaguet and Hadikhanloo, 2017), mirror descent (Hadikhanloo, 2017),
generalized conditional gradient (Lavigne and Pfeiffer, 2022), and policy gradient (Guo
et al., 2022b). For discrete-time MFG, efficient algorithms have been proposed based on the
notion of contraction (Guo et al., 2019; Xie et al., 2021; Anahtarci et al., 2022; Yardim et al.,

4
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

2022; Guo et al., 2023). With the monotonicity condition, Perrin et al. (2020) and Perolat
et al. (2021) propose fictitious play and mirror descent algorithms for learning the NE,
respectively. Readers are encouraged to refer to Laurière et al. (2022) for a comprehensive
survey of MFGs.
The graphon estimation problem has been studied for a decade under different classes of
graphons and different performance metrics. Existing works mainly focus on the estimation
of graphons from the random graphs generated from it. Gao et al. (2015) first proposed a
rate-optimal algorithm to estimate the graphon at sampled points. The graphon estimation
is then studied under L2 norm (Klopp et al., 2017; Wolfe and Olhede, 2013), and cut
distance (Klopp and Verzelen, 2019). The spectral method for graphon estimation was
also studied in Xu (2018). For a comprehensice survey of graphon estimation, readers are
encouraged to refer to Gao and Ma (2021). Different from these works, we aim to estimate
the graphons without the graphs generated from them. Instead, we only have access to the
state and action samples of agents, who interact with each other according to an unknown
graphon structure.

Notations We denote {1, · · · , N } as [N ]. For a set S, we denote the collection of all

the measures and the probability measures on S as M(S) and ∆(S), respectively. For a
measurable space (X , F) and two distributions P, Q ∈ ∆(X ) supported on X , the total
variation distance between them is defined as TV(P, Q) = supA∈F |P (A) − Q(A)|. For
to random variables X, Y supported on (X , F), we write TV(X, Y ) to denote the total
variation between their distributions. For a graphon W , we define its infinity norm as
kW k∞ = supx,y∈[0,1] |W (x, y)|.

3. Preliminaries

Graphons are measurable and symmetric functions that map [0, 1]2 to [0, 1]. By symmetry,
we mean that W (α, β) = W (β, α) for any α, β ∈ [0, 1]. The set of all graphons is denoted as
W = {W : [0, 1]2 → [0, 1] | W is symmetric}. In the following, graphons are used to represent
interactions between agents. We consider a finite horizon GMFG (I, S, A, µ1 , H, P ∗ , r∗ , W ∗ ).
In this game, each agent is indexed by α ∈ I = [0, 1]. The state space and the action space
of each agent are respectively denoted as S ⊆ Rds and A ⊆ Rda . We assume that S is a
compact subset of Rds and A is a finite subset of Rda . The horizon of the game is denoted
as H ∈ N. The initial state distribution of each agent is µ1 ∈ ∆(S), where ∆(S) is the
set of probability measures on S. We note that the initial distributions of agents can be
different by adding a new time step h = 0, where the reward is zero and the transition
kernel is independent of the action. The state transition kernels P ∗ = {Ph∗ }H h=1 are functions
Ph∗ : S × A × M(S) → ∆(S) for all h ∈ [H], where M(S) is the set of measures on S. In
contrast to the single-agent Markov Decision Process (MDP), the state dynamics of each
agent in a GMFG depends on an aggregate z ∈ M(S), which reflects the influence of other
agents on it. Since we consider the case in which the state space S is compact but potentially
infinite, we assume that Ph∗ (· | sh , ah , zh ) admits a probability density function with respect
to Lebesgue measure on S for any sh ∈ S,ah ∈ A, zh ∈ M(S), and h ∈ [H]. For time h and
agent α ∈ I, given a graphon Wh∗ ∈ W ∗ = {Wh∗ }H α
h=1 , the aggregate zh for agent α is defined

5
Zhang, Tan, Wang, and Yang

as
Z 1
zhα = Wh∗ (α, β)L(sβh ) dβ, (1)
0

where L(s) ∈ ∆(S) denotes the law of the random variable s. We note that the agents in this
game are heterogeneous. This means that each agent is affected differently by other agents
or, in other words, the aggregates zhα for different α ∈ I are, in general, different. Given
the state sαh ∈ S and the action aαh ∈ A of agent α, the agent transitions to a new state
sαh+1 ∼ Ph∗ (· | sαh , aαh , zhα ). The reward functions r∗ = {rh∗ }H h=1 are deterministic functions
rh∗ : S × A × M(S) → R for all h ∈ [H]. For agent α ∈ I at time h, taking the action aαh
under the state sαh and the aggregate zhα earns the agent a reward of rh∗ (sαh , aαh , zhα ).
We remark that the above GMFG subsumes the MFG (Xie et al., 2021; Anahtarci et al.,
2022) as a special case. To see this, let Wh (α, β) = 1 for all α, β ∈ I and h ∈ [H], then the
agents are homogeneous. The aggregate zhα in Eqn. (1) is simply the state distributions of
these homogeneous agents.
A Markov policy for the agent α ∈ I is characterized by π α = {πhα }H H
h=1 ∈ Π , where
α
πh : S → ∆(A) lies in the class Π = {π : S → ∆(A)}. The collection of policies of all
agents is denoted as π I = (π α )α∈I ∈ ΠI×H = Π̃. We let µαh = L(sαh ) ∈ ∆(S) be the
state distribution of the agent α at time h. Then µIh = (µαh )α∈I ∈ ∆(S)I is the set of
state distributions of all agents at time h. Note that the aggregate zhα is a function of the
distributions µIh and the graphon Wh∗ , so we may write it more explicitly as zhα (µIh , Wh ).
The distribution flow µI = (µIh )H I×H = ∆ ˜ consists of the state distributions of all
h=1 ∈ ∆(S)
agents at any given time.
In this work, we focus on the regularized problem (Nachum et al., 2017; Cui and
Koeppl, 2021a). This setting augments standard reward functions with the entropy of
the implemented policy. Some recent works have shown that entropy regularization can
accelerate the convergence of the policy gradient methods (Shani et al., 2020; Cen et al.,
2022). In a λ-regularized GMFG, when agent α implements policy πhα at time h, she will
receive a reward rh∗ (sαh , aαh , zhα ) − λ log πhα (aαh | sαh ) by taking action aαh at state sαh . Given the
underlying distribution flow µI and the policy π I , the value function and the action-value
function for agent α ∈ I in the λ-regularized game with λ > 0 are respectively defined as
XH
λ,α α I ∗ πα ∗ α α α I ∗ α α α α

Vm (s, π , µ , W ) = E rh sh , ah , zh (µh , Wh ) − λ log πh (ah | sh ) sh = s ,
h=m
Qλ,α α I ∗
λ,α α
rh∗ s, a, zhα (µIh , Wh∗ ) (sh+1 , π α , µI , W ∗ ) | sαh = s, aαh = a ,

h (s, a, π , µ , W ) = + E Vh+1
α
where the expectation Eπ [·] is taken with respect to aαh ∼ πhα (· | sαh ) and sαt+1 ∼ Ph (· | sαh , aαh , zhα )
for all h ∈ [H]. The cumulative reward of agent α ∈ I under policy π I is defined as
J λ,α (π α , µI , W ∗ ) = Eµα1 [V1λ,α (s, π α , µI , W ∗ )], where the expectation is taken with respect
to s ∼ µα1 .
˜ that satisfies
Definition 1 A NE of the λ-regularized GMFG is a pair (π ∗,I , µ∗,I ) ∈ Π̃ × ∆
the following two conditions:
• (Agent rationality) J λ,α (π ∗,α , µ∗,I , W ∗ ) ≥ J λ,α (π̃ α , µ∗,I , W ∗ ) for all α ∈ I and π̃ α ∈
ΠH .

6
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

• (Distribution consistency) The distribution flow µ∗,I is equal to the distribution flow
∗,I
µπ ,I induced by the policy π ∗,I .

We define the operator that returns the optimal policy when the underlying distribution
flow is µI and the graphon is W as Γλ1 (µI , W ) ∈ Π̃, i.e., π I = Γλ1 (µI , W ) if J λ,α (π α , µI , W ) =
supπ̃α ∈Π̃ J λ,α (π̃ I , µI , W ) for all α ∈ I. In this work, we focus on the case where the GMFG is
regularized, i.e., λ > 0. Thus, Γλ1 is uniquely defined. We also define the operator that returns
the distribution flow induced by the policy π I as Γ2 (π I , W ∗ ) ∈ ∆, ˜ i.e., µ̃I = Γ2 (π I , W ∗ ) if
Z X
0
α
µ̃αh (s)πhα (a | s)Ph s0 | s, a, zhα (µ̃Ih , Wh∗ ) ds

µ̃h+1 (s ) =
S a∈A

for all s0 ∈ S, h ∈ [H − 1] and α ∈ I,

and µ̃I1 = µI1 . Our goal in this paper is to learn the NE of the λ-regularized GMFG from
the data collected of the sampled agents. Before giving an overview of our agent sampling
schemes, we first introduce the equivariance property of GMFG.

3.1 Equivariance Property of GMFGs

We now argue that GMFG is equivariant to the measure-preserving bijection imposed on
agents. In the GMFG, all the interactions among agents are captured by the underlying
graphons. For agents α, β ∈ I, the value W (α, β) represents the strength of interactions
between α and β. Intuitively, if we “permute” the positions of agents in the graphon (i.e.,
we “permute” the values of α ∈ I) and transform the graphons accordingly, the resultant
game remains the same up to this permutation. However, given an uncountable number
of agents in [0, 1], the concept of “permutation” of finite objects should be more precisely
stated. This is formalized by the notion of measure-preserving bijections from [0, 1] to [0, 1].
Given a measure-preserving bijection φ : [0, 1] → [0, 1], the transformation of a graphon W φ
is defined as

W φ (x, y) = W φ(x), φ(y) .

We denote the set of all the measure-preserving bijections as B[0,1] . Then the equivariance
property of the GMFG can be stated as follows.
Proposition 2 For any policy π I ∈ Π̃, let its distribution flow on (S, A, µ1 , H, P ∗ , r∗ , W ∗ )
be µI ∈ ∆.˜ In other words, µI = Γ2 (π I , W ∗ ). For any φ ∈ B[0,1] , define the φ-transformed
policy π φ,I as π φ,α = π φ(α) for all α ∈ I. Then we denote its distribution flow on
˜ i.e., µφ,I = Γ2 (π φ,I , W φ,∗ ). We have
(S, A, µ1 , H, P ∗ , r∗ , W φ,∗ ) as µφ,I ∈ ∆,

µφ,α = µφ(α) for all α ∈ I.

The proof is provided in Appendix D. Proposition 2 shows that the graphons transformed
by a measure-preserving bijections defines the same game as the original graphons up to the
bijection. In Section 5.5, we learn the values of the graphon from the sampled agents without
information of their positions. This proposition shows that we can learn the graphon up to
a measure-preserving bijection, which motivates the definition of the permutation-invariance
risk in Section 5.5.

7
Zhang, Tan, Wang, and Yang

3.2 Overview of Sampling Schemes

Our goal is to design algorithms for a central planner to learn the NE from the data collected
from a subset of sampled agents in a simulator. This simulator setting is widely accepted
and adopted in the MFG community (Guo et al., 2019; Anahtarci et al., 2022). We further
note that MFGs constitute a subclass of GMFGs. When all the agents indexed by [0, 1]
implement the behavior policies, the simulator will sample N agents from [0, 1] and collect
their states, actions, and rewards. The learner (central planner) only has access to the
samples of the sampled agents. In this work, we consider three types of agent sampling
procedures

1. Agents are sampled from known grid positions. In particular, we sample the agents at
grid positions {i/N }N
i=1 ⊂ [0, 1], and we know the position of each agent;

2. Agents are sampled from known random positions. In particular, we sample the agents
from N i.i.d. samples of Unif([0, 1]), and the positions of agents are also known;

3. Agents are sampled from grid positions, but the positions of the sampled agents are
unknown. For example, we know the positions of the sampled agents belong to the set
{i/N }N N
i=1 . However, the position of each agent within the set {i/N }i=1 is unknown.

In Section 5, we design and analyze a model learning algorithm that estimates the
transition kernel P ∗ , the reward function r∗ , and the underlying graphons W ∗ for each of
these three sampling schemes. In the first two cases, we design a model learning algorithm
that estimates the transition kernel P ∗ , the reward function r∗ , and the underlying graphons
W ∗ . However, in the third case, we cannot estimate the original graphons, since the positions
of the agents are unknown. Instead, we can only estimate the original graphons up to a
measure-preserving bijection. In this case, we need to recover the “relative positions” of
sampled agents to select the graphons from set W̃. For N agents, there are N ! potential
cases for their relative positions. The super-exponential size of the search space makes the
problem statistically challenging. To complete the story, there is a sampling scheme where
the positions of agents are unknown and random. However, the analysis of algorithms in
this case is difficult due to the need to carefully analyze the order statistics which is rather
different from the abovementioned three cases. We leave this case for future work.

4. Learning Algorithm for GMFG

4.1 Design of the GMFG-PPO Algorithm
In this section, we design an algorithm called GMFG-PPO (Algorithm 1) to learn an NE
of the λ-regularized GMFG with λ > 0. GMFG-PPO, which is an iterative algorithm,
involves three main steps in each iteration. First, it evaluates the distribution flow and
the action-value function (Line 4), assuming the access to a sub-module for computing
these. In Section 5, we design this sub-module as a model-based learning algorithm. Second,
it updates the distribution flow as a mixture of the distribution flow estimate µ̂It of the
current policy πtI and the current distribution flow µ̄ˆIt (Line 5). The “bar” notation (such
as µ̄I ) represents a mixture of distributions, and the “hat” notation (such as µ̂It ) represents
estimated distributions. All procedures are repeated T times in the algorithm. This

8
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

Algorithm 1 GMFG-PPO
Procedure:
1: Initialize π1,hα (· | s) = Unif(A) for all s ∈ S, h ∈ [H] and α ∈ I.

ˆI1 = Γ̂2 (π1I , Ŵ ).

2: Initialize µ̄
3: for t = 1, 2, . . . , T do
4: Compute the distribution flow µ̂It = Γ̂2 (πtI , Ŵ ) induced by policy πtI and corresponding
action-value function Q̂λ,α α ˆI
h (s, a, πt , µ̄t , Ŵ ) for all α ∈ I and h ∈ [H].
5: I
ˆt+1 = (1 − αt )µ̄
µ̄ I
ˆt + αt µ̂t . I
ληt+1
α α (· | s) 1− 1+ληt+1 exp ηt+1 λ,α α , µ̄I , Ŵ ) for all α ∈ I and

6: π̂t+1,h (· | s) ∝ πt,h 1+ληt+1 h Q̂ (s, ·, πt
ˆ t
h ∈ [H]
7: α
πt+1,h (· | s) = (1 − βt+1 )π̂t+1,h α (· | s) + βt+1 Unif(A)
8: end for
9: Output π̄ I = Unif π[1:T I I = Unif µ̄ I

] and µ̄ ˆ [1:T ]

procedure is known as fictitious play in Xie et al. (2021) and Perrin et al. (2020). It slows
down the update of the distribution flow. In our analysis, this deceleration is shown to
be important for learning the optimal policy with respect to the current distribution flow.
Finally, we improve the policy with one-step mirror descent (Line 6). We note that Line 6 is
in fact the closed-form solution to the optimization
h i
α
(· | s) = argmax ηt+1 Q̂λ,α α ˆI α

π̂t+1,h h (s, ·, πt , µ̄ t , Ŵ ), p − λH̄(p) − KL pkπt,h (· | s) ∀ s ∈ S,
p∈∆(A)

where H̄(p) = hp, log pi is the negative entropy function. This procedure is one-step policy
mirror descent in Lan (2022), and it also corresponds to the PPO algorithm in Schulman
et al. (2017). This policy improvement procedure aims to optimize the policy in the MDP
induced by µ̄ˆIt . With the convergence of µ̄
ˆIt to µ∗,I , this procedure can learn the optimal
∗,I
policy on µ , i.e., the policy π ∗,I in the NE. Line 7 mixes the current policy iterate π̂ I
with the uniform distribution. Intuitively, this mixing ensures that the policy has sufficient
exploration in order to find the NE eventually.
GMFG-PPO differs from the NE learning algorithm of regularized MFG in Xie et al.
(2021) in three aspects. First, GMFG-PPO is designed to learn the NE of the regularized
GMFG. It involves graphon learning and requires the policy and action-value function
updates for all the agents. In contrast, the algorithm in Xie et al. (2021) can only learn the
NE of the regularized MFG, which is a special case of GMFG with constant graphons. It
only keeps track of the policy and action-value function of a representative agent. Second,
GMFG-PPO learns a non-stationary NE, whereas the algorithm in Xie et al. (2021) learns
a stationary NE. Finally, the stepsize ηt used in the policy improvement (Line 6) will be
set to be a (non-vanishing) constant in Section 4.2. In contrast, the algorithm in Xie et al.
(2021) sets ηt = o(1). Our choice of ηt is the chief reason for the improved convergence rate.

4.2 Convergence Analysis of GMFG-PPO

Assuming that an NE exists (Cui and Koeppl, 2021b; Fabian et al., 2022), we now present
convergence results for learning it. We denote an NE of the λ-regularized GMFG as

9
Zhang, Tan, Wang, and Yang

(π ∗,I , µ∗,I ). We measure the distances between policies and distribution flows with
Z H
1X h i
I I
D(π , π̃ ) = Eµ∗,α πhα (· | s) − π̃hα (· | s) 1
dα, and
h
0 h=1
Z 1XH
d(µI , µ̃I ) = kµαh − µ̃αh k1 dα.
0 h=1

For the purpose of our convergence results, we make a few assumptions about the λ-
regularized GMFG. We first assume the Lipschitz continuity of transition kernels and reward
functions.

Assumption 1 The reward function rh (s, a, z) is Lipschitz continuous in z for all h ∈ [H],
that is |rh (s, a, z) − rh (s, a, z 0 )| ≤ Lr kz − z 0 k1 for all h ∈ [H], s ∈ S and a ∈ A. The
transition kernel Ph (· | s, a, z) is Lipschitz continuous in z with respect to the total variation,
that is TV(Ph (· | s, a, z), Ph (· | s, a, z 0 )) ≤ LP kz − z 0 k1 for all h ∈ [H], s ∈ S and a ∈ A.

This assumption is common in the MFG and GMFG literature (Cui and Koeppl, 2021b;
Anahtarci et al., 2022). We then assume that the composition of the operators Γλ1 and Γ2 is
contractive in the following sense.

Assumption 2 There exist constants d1 , d2 > 0 and d1 d2 < 1 such that for any policies
π I , π̃ I and distribution flows µI , µ̃I , it holds that

D Γλ1 (µI , W ∗ ), Γλ1 (µ̃I , W ∗ ) ≤ d1 d(µI , µ̃I ), and

d Γ2 (π I , W ∗ ), Γ2 (π̃ I , W ∗ ) ≤ d2 D(π I , π̃ I ).

This “contractive” assumption plays an important role in the design of efficient algorithms,
since it guarantees the convergence of both π I and µI using simple fixed point iterations.
This assumption is widely adopted in the MFG literature (Xie et al., 2021; Guo et al., 2019),
and it holds if the regularization λ is higher enough than Lr and LP (Anahtarci et al., 2022;
Cui and Koeppl, 2021a). We note that Assumption 2 indeed implies the existence and
uniqueness of NE.

Proposition 3 Under Assumption 2, the λ-regularized GMFG admits exactly one NE up

to a set of zero-measure agents with respect to the Lebesgue measure on [0, 1].

The proof is provided in Appendix O. For a policy π I and any distribution flow µI , we
define the operator Γ3 that satisfies µ+,I = Γ3 (π I , µI , W ) as

+,I +,α 0
X Z +,α
I
µh (s)πhα (a | s)Ph s0 | s, a, zhα (µIh , Wh ) ds,

µ1 = µ1 , µh+1 (s ) =
a∈A S

for all s0 ∈ S, α ∈ I, and h ≥ 1. The operator Γ3 outputs the distribution flow µ+,I for
implementing the policy π I on the MDP induced by µI . We now make an assumption about
certain concentrability coefficients.

10
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

Assumption 3 For any distribution flow µI , we define its induced optimal policy on the
MDP induced by it as πµ∗,I = Γλ1 (µI , W ∗ ) and the induced distribution flow as µ̃∗,I =
Γ3 (πµ∗,I , µI , W ∗ ). Then there exists a constant Cµ > 0 such that for any distribution flow
µI , it hold that

µ∗,α 2

h (s)
sup Es∼µ̃∗,α ≤ Cµ2 .
α∈I,h∈[H]
h µ̃∗,α
h (s)

This assumption concerns the boundedness of concentrability coefficients. This type of

assumption are standard in the policy optimization literatures (Shani et al., 2020; Bhandari
and Russo, 2019; Agarwal et al., 2020). The policy πµ∗,α (ah |sh ) > 0 is strictly positive for all
ah ∈ A, sh ∈ S due to the presence of the regularization. When S is finite, this assumption
holds if Ph∗ (sh+1 |sh , ah , zh ) > 0 for all sh+1 , sh ∈ S, ah ∈ A, and zh ∈ M(S). When S is com-
pact (but uncountable) this assumption holds if Ph∗ (sh+1 |sh , ah , zh ) ≤ κPh∗ (sh+1 |sh , ah , zh0 )
for some constant κ > 0 and all sh+1 , sh ∈ S, ah ∈ A, and zh , zh0 ∈ M(S). We then make
an assumption about the accuracy about our distribution flow and action-value function
estimates in Line 4 of Algorithm 1.

Assumption 4 We have access to the estimator P̂ = {P̂h }H H

h=1 , r̂ = {r̂h }h=1 , and Ŵ =
{Ŵh }H
h=1 and corresponding operator estimate Γ̂2 (·, Ŵ ) and action-value function estimator
λ,α
Q̂h (·, Ŵ ). These estimates satisfy that for any policy π I , we have that

d Γ̂2 (π I , Ŵ ), Γ2 (π I , W ∗ ) ≤ εµ ,

and that for any policy π I and distribution flow µI

sup Eπ̃α ,µI Q̂λ,α α I λ,α α I ∗

h (s, ·, π , µ , Ŵ ) − Qh (s, ·, π , µ , W ) ∞
≤ εQ .
π̃ I ,α

for some constants εµ and εQ .

We make this assumption only for ease of the presentation of the analysis of our algorithm. In
Section 6, we will replace this assumption with the actual performance guarantee of our model
learning algorithms. When learning the model from L trajectories of N sampled agents, we
could quantify εµ and εQ as: (i) (known fixed positions) εµ = O(N −1/2 + (N L)−1/4 ) and
εQ = O(N −1/2 + (N L)−1/2 ). (ii) (known random positions) εµ = O(N −1/2 + (N L)−1/4 ) and
εQ = O(N −1/4 + (N L)−1/2 ). (iii) (unknown fixed positions) εµ = O(N −1/2 + (N/L)1/4 ) and
εQ = O(N −1/2 + (L)−1/2 ).

Theorem 4 We set αt = O(T −2/3 ), βt = O(T −1 ), and ηt to a constant that only depends
on λ, H and |A|. Under Assumptions 1, 2, 3, and 4, Algorithm 1 returns the policy π̄ I and
the distribution flow µ̄I that satisfies
T XT √
1 X I ∗,I 1 I ∗,I log T p
D πt , π +d ˆt , µ
µ̄ =O + O(εµ + εQ + εµ ).
T T T 1/3
t=1 t=1

11
Zhang, Tan, Wang, and Yang

The proof is provided in Appendix E. There are two main differences in Theorem 4 and Xie
et al. (2021, Theorem 1). First, we achieve a faster rate Õ(T −1/3 ) than the rate Õ(T −1/5 ) in
Xie et al. (2021). This improvement is attributed to the newly designed stepsize ηt , which
is a constant, but the algorithm in Xie et al. (2021) sets ηt to be O(T −2/5 ). Intuitively, a
stepsize ηt that is independent of T will result in faster convergence of an algorithm compared
to one that decays as T grows. However, the proof involves a novel optimization error
recursion analysis for this new stepsize. This novel optimization error recursion analysis
also generalizes Lan (2022, Theorem 1) to the time-inhomogeneous MDP with a finite
horizon. See Appendix F for the statement. Second, Theorem 4 does not require the first
condition in Assumptions 4 and 5 in Xie et al. (2021). Instead, we adopt the more realistic
Assumption 1 concerning the Lipschitzness of transition kernels and reward functions to
control the difference between the MDP induced by difference distribution flows.

5. Model Estimation From Datasets

We assume that the state space S ⊆ Rds is a subset of R, i.e., ds = 1. Our results can be
extended to the case ds > 1 by using kernels of functions with multiple outputs. Since S is
compact, there exists a constant BS > 0 such that |s| ≤ BS for all s ∈ S.

5.1 Dataset Collection

Since the GMFG involves uncountably infinite agents, it is impossible to collect the trajec-
tories of all the agents. Thus, we sample N agents {ξi }N i=1 in [0, 1] to collect their states,
actions, and rewards in each episode. We consider three sampling methods: (i) agents’
positions {ξi }N
i=1 are known grids, namely, ξi = i/N for all i ∈ [N ]. Furthermore, the map
between the identity of each agent to the grid {i/N }N N
i=1 is known. (ii) {ξi }i=1 are known
i.i.d. samples of the uniform distribution Unif([0, 1]). (iii) agents’ positions {ξi }N i=1 are
grid points, and these positions are unknown. Then we acquire the states and actions of
these sampled agents. For notational simplicity, we denote the state sξhi and action aξhi of
the agent ξi as sih = sξhi and aih = aξhi , respectively. To collect these data, we implement
L behavior policies πτI for all τ ∈ [L]. In the τ th episode, a trajectory of these agents is
[N ] [N ] [N ] [N ]
Dτ = {(sτ,h , aτ,h , rτ,h , sτ,h+1 )}H L
h=1 . The dataset consists of L trajectories, i.e., D = {Dτ }τ =1 .
We note that once the behavior policy πτI is determined, the distribution flow µIτ
is fixed. Then the influence aggregate on the ith agent zτ,h i (W ∗ ) is a function only of
h
[N ]
ξi , which is independent of the states of other agents. Thus, the distribution of sτ,h is
QN ξ i QN i
i=1 µτ,h = i=1 µτ,h .

5.2 Mean-Embedding of Distribution Flows

The transition kernels and the reward functions both take (s, a, z) as their inputs. However,
the aggregate z ∈ M(S) for an agent which is defined in Eqn. (1) is not available to us,
since it requires the unknown values of graphons W ∗ and the distribution flow µI . From the
collected data, we only have the states {siτ,h }N i N
i=1 sampled from distributions {µτ,h }i=1 . Thus,
we first need to estimate the distribution flow µI from these sample. We handle this by using
a mean-embedding, which is a widely adopted method in distribution regression (Szabó et al.,

12
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

2016, 2015). Define Ξ = S × A × S, then ds,a,z = δs × δa × z is measure on Ξ. Given a positive

definite kernel k : Ξ × Ξ → R, we denote the Reproducing Kernel Hilbert Space (RKHS)
spanned by kernel k as H. Then we embed the measure ds,a,z with the kernel k as
Z
k ·, (s, a, s0 ) z(ds0 ).

ωds,a,z =
S

We have ωds,a,z ∈ H. We note that such mean-embedding procedure will not cause the
problem to be degenerate, since the embedding with the identity kernel degenerates to ds,a,z .
For our regression setting, we will embed the measure δsαh × δaαh × zhα (Wh∗ ) for all α ∈ I, and
h ∈ [H]. Here the aggregate zhα is the influence aggregate for agent α at time h defined in
Eqn. (1). Then the mean-embedding of the measure δsαh × δaαh × zhα (Wh∗ ) is
Z 1Z
ωhα (Wh∗ ) Wh∗ (α, β)k ·, (sαh , aαh , s) µβh (s) ds dβ.

=
0 S

Given such embedding representation, we reformulate the transition kernels and the reward
functions as functions fh∗ , gh∗ : H → R that is defined as

sαh+1 = fh∗ ωhα (Wh∗ ) + εh , rhα = gh∗ ωhα (Wh∗ ) for all h ∈ [H], α ∈ I,

(2)

where {εαh }α∈I are independent zero-mean noises. Since |s| ≤ BS , we have |εαh | ≤ 2BS .

5.3 Assumptions for Model Learning

In the following, we will estimate the transition kernels {fh∗ }H h=1 , the reward functions
∗ H ∗ H
{gh }h=1 and the graphons {Wh }h=1 from the collected data. With nonparametric regression
methods, we adopt a general graphon class W̃ to estimate the underlying graphons and
adopt the kernels K̄ : H × H → R and K̃ : H × H → R to estimate the transition kernel and
reward functions, respectively. The space spanned by the kernels K̄ and K̃ are respectively
denoted as H̄ and H̃. We postpone the details of the estimation algorithms for three
sampling schemes to the following sections, and we first state the assumptions needed for
the convergence of all these estimation algorithms.
First, we assume the Lipschitz continuity of the graphon class W̃ and the nominal
graphons W ∗ = {Wh∗ }H h=1 . This assumption will help us to generalize the estimate from the
sampled agents to the unobserved agents.
Assumption 5 (Lipschitzness of Graphons) For any W̃ ∈ W̃ (resp. {Wh∗ }H h=1 ), we
have that |W̃ (α, β) − W̃ (α0 , β 0 )| ≤ LW̃ (|α − α0 | + |β − β 0 |) (resp. |W (α, β) − W (α0 , β 0 )| ≤
LW ∗ (|α − α0 | + |β − β 0 |)) for all α, α0 , β, β 0 ∈ [0, 1], where LW̃ > 0 (resp. LW ∗ > 0) is a
constant.
For ease of notation, we define LW = max{LW̃ , LW ∗ }. Second, we assume the boundedness
and the Lipschitz continuity of the kernels. Similar as Assumption 5, this assumption is
helpful to guarantee the boundedness of estimates and generalize the estimates from the
sampled agents to the unobserved agents.
Assumption 6 (Boundedness and Lipschitzness of Kernels) The reproducing kernels
k, K̄ and K̃ satisfy

13
Zhang, Tan, Wang, and Yang

• The kernel k is bounded, i.e., there exists Bk > 0 such that k(x, x) ≤ Bk2 for all x ∈ Ξ.

• The kernel K̄ (resp. K̃) is bounded, i.e., there exists BK̄ > 0 (resp. BK̃ > 0) such
2 (resp. K̃(ω, ω) ≤ B 2 ) for all ω ∈ H.
that K̄(ω, ω) ≤ BK̄ K̃

• The kernel K̄ (resp. K̃) is LK̄ -Lipschitz (resp. LK̃ ) continuous, i.e., kK̄(·, ω) −
K̄(·, ω 0 )kH̄ ≤ LK̄ kω − ω 0 kH (resp. kK̃(·, ω) − K̃(·, ω 0 )kH̃ ≤ LK̃ kω − ω 0 kH ) for all
ω, ω 0 ∈ H.

For ease of notation, we define the maximal boundedness parameter BK = max{BK̄ , BK̃ }
and the maximal Lipschitz constant LK = max{LK̄ , LK̃ }. Finally, we state the realizability
assumption. It guarantees that we choose the proper function class for our regression task.
We define the r-ball in a RKHS H̄ as B(r, H̄) = {f ∈ H̄ | kf kH̄ ≤ r}.

Assumption 7 (Realizability) The nominal transition functions fh∗ , reward functions gh∗
and graphons Wh∗ satisfy that fh∗ ∈ B(r, H̄), gh∗ ∈ B(r̃, H̃) and Wh∗ ∈ W̃ for all h ∈ [H], where
r, r̃ > 0 are some constants.

For ease of notation, we define the maximal radius as r̄ = max{r, r̃}. We note that our
algorithms and analysis are also applicable to the general function class F and F̃, replacing
H and H̃. This assumption is realized when the chosen function classes are large enough. For
example, H and H̃ can be chosen as kernels spaces of neural networks (Jacot et al., 2018), and
W̃ can be a set of neural networks for the purpose of graphon estimation (Xia et al., 2023). In
addition to these non-parameteric function classes, for the case in which we know the form of
the underlying graphon (e.g., Wh∗ (α, β) = a − b · (α + β) for some a, b ∈ R), we can also choose
the graphon class accordingly (e.g., W̃ = {W | W (α, β) = k1 − k2 · (α + β) for k1 , k2 ∈ R}).
Here we adopt the RKHS for H̄ and H̃ the ease of representation.

5.4 Learning from Sampled Agents with Known Positions

In this section, we design regression algorithms when the positions of sampled agents are
known. From the data collection procedure in Section 5.1, the values of the distribution
flows µIτ for τ ∈ [L] are not directly accessible. For the ith agent, the mean-embedding of
her state, action and the aggregate at time h in the τ th episode is
Z Z 1Z
i ∗ i i
Wh∗ (ξi , β)k ·, (siτ,h , aiτ,h , s) µβτ,h (s) ds dβ.
i
ωτ,h (Wh ) = k ·, (sτ,h , aτ,h , s) zτ,h (ds) =
S 0 S
(3)

Thus, the input of fh∗ and gh∗ , i.e., ωτ,h

i (W ∗ ), needs to be estimated. Given any graphon
h
Wh ∈ W̃, we derive the empirical estimate of the aggregate of the ith agent at time h as

i 1 X
ẑτ,h (Wh ) = Wh (ξi , ξj )δsj .
N −1 τ,h
j6=i

This estimate involves three kinds of error sources. The first is the graphon estimation error,
which originates from the difference between Wh and Wh∗ . The second is the agent sampling
error which originates from the approximation of uncountably many agents in [0, 1] with

14
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

N − 1 of them, i.e., an integral over [0, 1] is replaced by a sum over N − 1 terms. The
ξj
last is the state sampling error in which we replace the integral of µτ,h over state space S
with the singleton δsj . In the analysis, we handle these three errors separately. Given the
τ,h
i (W ), the corresponding mean-embedding of the state, action, and
aggregate estimate ẑτ,h h
th
the aggregate for the i agent is
1 X
i
Wh (ξi , ξj )k ·, (siτ,h , aiτ,h , sjτ,h ) .

ω̂τ,h (Wh ) = (4)
N −1
j6=i

Taking this estimate as the input of fh∗ and gh∗ , we evaluate the square error of the prediction
and derive the estimates by minimizing the error. Thus, the estimation procedure for
learning the system dynamics, the reward functions, and the graphons can be expressed as
L N
1 XX i 2 i 2
(fˆh , ĝh , Ŵh ) = argmin i
sτ,h+1 −f ω̂τ,h (W̃ ) + rτ,h i
−g ω̂τ,h (W̃ ) . (5)
f ∈B(r,H̄),g∈B(r̃,H̃),N Lτ =1 i=1
W̃ ∈W̃

We note that the above optimization problem is, in general, non-convex. However, we focus
on the statistical property of it in this work, and the practical implementation can be done
with the help of non-convex optimization algorithms. In this estimation procedure, we form
our predictions of states/rewards via the composition of two procedures, i.e.,
k,W f or g
{(siτ,h , aiτ,h )}N i i i
i=1 −→ ω̂τ,h (W ) −→ sτ,h+1 /rτ,h (6)

In the first stage, the states and actions are embedded with the kernel k and a selected
graphon W . In the second stage, the mean-embedding ω̂τ,h i (W ) is forwarded by the functions

in H or H̃.
This two-stage prediction distinguishes our estimation procedure from the algorithms
designed for the distribution regression problem (Szabó et al., 2016; Fang et al., 2020;
Meunier et al., 2022). In the distribution regression problem, the covariate, i.e., the input
of f or g in Eqn. (6), is an unknown distribution. In this problem, we are tasked with
performing a regression from the data of the response variable and the i.i.d. samples of the
unknown distribution. Although the distribution regression problem also requires a two-
stage prediction similarly as Eqn. (6), i.e., the covariate should be first estimated from i.i.d.
samples drawn from itself, our problem setting involving graphons is a strict generalization
of distribution regression. First, the input of f or g in our problem is a function of a set of
distributions {µατ,h }α∈I . In contrast, the covariate of the distribution regression problem is a
single distribution. Second, in addition to the recovery of µIτ,h from its samples, our problem
requires the estimation of the graphon W to form ω̂τ,h i (W ). However, the distribution

regression problem only requires the recovery of a distribution from its i.i.d. samples, which
corresponds to the case that W is a constant function.

5.4.1 Agents with Known Grid Positions

In this section, we provide the convergence result of the estimation procedure in Eqn. (5) in
the setting where the agents’ positions {ξi }N
i=1 form a known grid on [0, 1]. Without loss

15
Zhang, Tan, Wang, and Yang

of generality, we assume that ξi ≤ ξj for any i ≤ j in [N ], and denote the set of positions
as ξ¯ = {ξi }N I
i=1 . In this section, our behavior policies πτ for τ ∈ [L] are set as Lπ -Lipschitz
policies. It means that kπhα (· | s) − πhβ (· | s)k1 ≤ Lπ |α − β| for all h ∈ [H] and α, β ∈ I. We
note that setting the behavior policies as Lipschitz policies will not restrict the applicability
of our estimation procedure, since the NE is shown to be Lipschitz under Assumptions 1
and 5 in Appendix P.
Then we introduce the performance Q metric for our estimates. Given ξ, ¯ the joint distribu-
N
tion of (sτ,h , aτ,h , µτ,h , rτ,h , sτ,h+1 )i=1 is i=1 ρτ,h , where ρτ,h = µτ,h × πτ,h × δµI × δrh∗ × Ph∗ .
i i I i i N i i i i
τ,h
Here δrh∗ is the delta distribution induced by the deterministic function rh∗ . We define the
risk of (f, g, W ) given ξ¯ as
L N
1 XX i i
2 i i
2
Rξ̄ (f, g, W ) = Eρi sτ,h+1 − f ωτ,h (W ) + rτ,h − g ωτ,h (W ) . (7)
NL τ,h
τ =1 i=1

The risk Rξ̄ (f, g, W ) measures the mean square error of the estimates f, g, W with respect
to the distributions of states, actions and distribution flow on the sampled agents. This
risk definition is motivated by the distribution regression (Szabó et al., 2015), since our
framework is a generalization of the distribution regression, as discussed in Section 5.4. The
convergence rate of our estimates (fˆh , ĝh , Ŵh ) is stated as follows.
Theorem 5 Under Assumptions 1, 5, 6, and 7, if {ξi }N i=1 are known grid positions such
that ξi = i/N for i ∈ [N ], then with probability at least 1 − δ, the risk of the estimates in
Eqn. (5) can be bounded as

Rξ̄ (fˆh , ĝh , Ŵh ) − Rξ̄ (fh∗ , gh∗ , Wh∗ )

√
(BS + r̄BK )4 NBr NB̃r̃ NW̃ (BS + r̄BK )r̄LK Bk

N LN∞ (1/ N , W̃)
=O log + √ log ,
| NL {z δ } | N δ
{z }
generalization error mean-embedding estimation error

where

3 3 3
NBr = NH̄ , B(r, H̄) , NB̃r̃ = NH̃ , B(r̃, H̃) , NW̃ = N∞ , W̃ .
NL NL LK N L
The proof of the theorem and the definitions of these covering numbers are provided in
Appendix G. The estimation error in Theorem 5 consists of two terms: the first term
corresponds to the generalization error, and the second term corresponds to the mean-
embedding estimation error. The generalization error involves the error from optimizing
over the empirical mean of the risk in Eqn. (5) instead of the population risk in Eqn. (7).
The mean-embedding estimation error comes from the fact that we cannot directly observe
the distribution flow µIτ , but we need to estimate it from the states of sampled agents. As
discussed in Section 5.4, the mean-embedding estimation error consists of the agent sampling
error and the state sampling error. If we use finite general function classes, then the covering
number in the bound will be replaced by the √ cardinalities of these function classes. The
resultant convergence rate would thus be O(1/ N ).
The model learning algorithm in Pasztor et al. (2021) for the MFG assumes access to
the nominal value of the distribution flow. Such an assumption can be achieved in MFGs by

16
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

sampling a large number of agents at each time, since all the agents are homogeneous and
have the same
√ state distribution flow. This estimation procedure will however, come at a
cost of O(1/ N ), which is not reflected in their results. What’s more, such an assumption is
no longer realistic in the GMFG, since the agents in GMFG are heterogeneous, and the state
distributions of agents are different. Our estimation procedure in (5) does not require the
access to the nominal value of the distribution flow µIh . Instead, we estimate this quantity
from states of sampled agents and prove that such an estimate works for the heterogeneous
agents.

5.4.2 Agents with Known Random Positions

In this section, we provide the convergence result of estimation procedure in Eqn. (5) in
the setting where the agent positions {ξi }N i=1 are known realizations of i.i.d. samples drawn
from Unif([0, 1]). The set of positions is denoted as ξ¯ = {ξi }N i=1 . We first specify the
performance metric in this section. For an agent α ∈ I, we denote the joint distribution
of (sατ,h , aατ,h , µIτ,h , rτ,h
α , sα α α α α ∗
τ,h+1 ) as ρτ,h , where ρτ,h = µτ,h × πτ,h × δµI × δrh × Ph . Then the
∗
τ,h

risk of f ∈ H, g ∈ H̃, and W̃ ∈ W̃ is defined as

L Z 1
1X 2 2
R(f, g, W̃ ) = Eρατ,h α
sατ,h+1 − f ωτ,h (W̃ ) α
+ rτ,h α
− g ωτ,h (W̃ ) dα. (8)
L 0
τ =1

Compared to the risk with grid positions defined in Eqn. (7), the risk defined in Eqn. (8)
can be derived by taking expectation
with respect the distribution of the positions, i.e.,
R(f, g, W ) = Eξ̄ Rξ̄ (f, g, W ) . The convergence rate of our estimates can be stated as
follows.
Theorem 6 Under Assumptions 5, 6, 7, and 1, if {ξi }N i=1 are known i.i.d. samples of
Unif([0, 1]), then with probability at least 1 − δ, the risk of the estimates in Eqn. (5) can be
bounded as

R(fˆh , ĝh , Ŵh ) − R(fh∗ , gh∗ , Wh∗ )

(BS + r̄BK )2 ÑBr ÑB̃r̃ ÑW̃

=O √ log
N δ
√
N LN∞ 1/ N , W̃ (BS + r̄BK )4 NBr NB̃r̃ NW̃

(BS + r̄BK )r̄LK Bk
+ √ log + log ,
N δ NL δ
where

1 1 1
ÑBr = NH̄ √ , B(r, H̄) , ÑB̃r̃ = NH̃ √ , B(r̃, H̃) , ÑW̃ = N∞ √ , W̃ .
16 N 16 N 16rLK Bk N
The proof is provided in Appendix H. The estimation error in Theorem 6 consists of three
terms: the first term corresponds to the approximation error, the second term corresponds to
the generalization error, and the third term corresponds to the mean-embedding estimation
error. The first term comes from the fact that we can only approximate the risk R(f, g, W )
by Rξ̄ (f, g, W ) in the estimation procedure specified in Eqn. (5). The second and the third
terms can be explained in the same way as for the terms in Theorem 5.

17
Zhang, Tan, Wang, and Yang

(a) The SBM graphon and three sampled agents. (b) Transformed SBM graphon and the corre-
spondingly transformed agents.

Figure 1: The left figure shows the SBM graphon and three sampled agents. Swapping
the second and the third communities, we obtain the graphon on the right. The
sampled agents are correspondingly swapped. Although the graphons and agent
positions in the left and the right figures are not the same, the agents in both
figures retain the same “relative positions” with the underlying graphons.

5.5 Learning from Sampled Agents with Unknown Positions

We now consider the setting where the positions of the sampled agents {ξi }N i=1 are on the
grid in [0, 1], but are unknown. This means that the set of sampled positions {ξi }N i=1 is
equal to {i/N }N i=1 , but we do not know which i/N each ξ i corresponds to. In addition to
the data collection procedures in Section 5.1, we assume that we implement the same policy
over L independent rounds. This sampling method implies that the distribution defined in
Section 5.5 satisfies ρατ,h = ρiτ 0 ,h for all τ, τ 0 ∈ [L], α ∈ I and h ∈ [H].
Intuitively, since the position information is missing from our observations, we cannot
estimate the precise values of graphons. For example, the collected data from the agents
in Figure 1(a) is same as that in Figure 1(b), so we cannot distinguish between these two
different graphons. However, we can see that these two graphons are the same up to a
measure-preserving bijection. Proposition 2 shows that the model with transformed graphons
is the same as the original model up to a measure-preserving bijection. Thus, in this section,
our goal is to estimate the model of GMFG up to a measure-preserving bijection.
In this setting, we cannot estimate the mean-embedding ωτ,h i (W ∗ ) as Eqn. (4), since
h
we do not know the agents’ positions {ξi }N i=1 . Instead, we need to estimate the “relative
positions” of these agents. Here the relative positions refer to the relationship between the
agents’ positions and the underlying graphon. For example, in Figure 1, the agents retain
the same relative positions in different graphons. With N sampled agents, the relative
positions can be represented by the permutation of these agents. We denote the set of all
the permutations of N objects as C N , where |C N | = N !. For a permutation σ ∈ C N and a
graphon W , we estimate the relative position of ith agent as σ(i)/N for all i ∈ [N ]. Then

18
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

mean-field embedding estimate can be derived as

L
i,σ 1 σ(i) σ(j)
k ·, (siτ,h , aiτ,h , sjτ 0 ,h ) .
XX
ˆ τ,h
ω̄ (W ) = W , (9)
(N − 1)L 0
N N
j6=i τ =1

Similar as Eqn. (14), Eqn. (9) is also an average over L episodes, since we implement the
same policy for L independent times. In this estimate, only the relative positions between
agents and the underlying graphon matters, so we can equivalently express such estimate
i (W ) as ω̄
ˆ τ,h i,σ
ˆ τ,h
with a transformed graphon. We define ω̄ (W ) with the identity map σ. The
set of measure-preserving bijections that are permutations of the intervals [(i − 1)/N, i/N ]
N . Then for some φ ∈ C N , the estimate in Eqn. (9) can be
for i ∈ [N ] is denoted as C[0,1] [0,1]
reformulated as
L
1
W φ i/N , φ j/N k ·, (siτ,h , aiτ,h , sjτ 0 ,h ) .
XX
i
(W φ ) =

ˆ τ,h
ω̄
(N − 1)L 0 j6=i τ =1

Given this mean-embedding estimate, our model estimation estimation procedure can be
stated as
L N
1 XX i 2 i 2
(fˆh , ĝh , Ŵh , φ̂h ) = argmin i
ˆ τ,h
sτ,h+1 − f ω̄ (W̃ φ ) i
ˆ τ,h
+ rτ,h − g ω̄ (W̃ φ ) .
f ∈B(r,H̄), N L
τ =1 i=1
g∈B(r̃,H̃),
N
W̃ ∈W̃,φ∈C[0,1]

(10)
We note that the computational burden of this procedure can be high due the large number
N
of permutations in C[0,1] (for large N ). However, this high computational burden is common
in the graphon learning algorithms (Gao et al., 2015; Klopp et al., 2017). Our work mainly
focuses on the statistical analysis of the graphon problem, rather than their well-known
computational limitations. We leave the addressing of computational concerns to future work.
We then specify the performance metric under this setting. As mentioned earlier, we cannot
estimate the precise values of graphons. Thus, we measure the accuracy of our estimates by
transforming the graphon estimate with the optimal measure-preserving bijections. Such a
risk is known as the permutation-invariant risk, which is defined as
L N
1 XX i i φ
2 i i φ
2
R̄ξ̄ (f, g, W ) = inf Eρi sτ,h+1 − f ωτ,h (W ) + rτ,h − g ωτ,h (W )
φ∈B[0,1] N L τ,h
τ =1 i=1
N
1 X i i φ
2 i i φ
2
= inf Eρi sh+1 − f ωh (W ) + rτ,h − g ωτ,h (W ) , (11)
φ∈B[0,1] N h
i=1

where ρih ρiτ,h

= for all τ ∈ [L]. The term “permutation-invariant” comes from the analogy
between permutations and measure-preserving bijections and the fact that R̄ξ̄ (f, g, W ) =
R̄ξ̄ (f, g, W φ ) for any φ ∈ B[0,1] . In the graphon learning problem, similar distances that take
the permutation-invariance into account have been defined in existing works (Klopp and
Verzelen, 2019). Our risk here is designed to reflect the permutation-invariance property and
various GMFG-related quantities. Our convergence guarantee of the estimation procedure
can be stated as follows.

19
Zhang, Tan, Wang, and Yang

Theorem 7 Under Assumptions 5, 6, 7, and 1, if {ξi }N N

i=1 = {i/N }i=1 , then with probability
at least 1 − δ, the risk of the estimates in Eqn. (10) can be bounded as

R̄ξ̄ (fˆh , ĝh , Ŵh ) − R̄ξ̄ (fh∗ , gh∗ , Wh∗ )

r p
LW Bk r̄LK (BS + r̄BK ) N N LN∞ ( N/L, W̃)
=O + (BS + r̄BK )r̄LK BK log
| N
{z } | L{z δ }
agent sampling error state sampling error

)4 N ÑBr ÑB̃r̃ Ñ∞

(BS + r̄BK
+ log ,
| L {z δ }
generalization error

where

3 3 3
ÑBr = NH̄ , B(r, H̄) , ÑB̃r̃ = NH̃ , B(r̃, H̃) , ÑW̃ = N∞ , W̃ .
L L LK L

The proof is provided in Appendix I. The estimation error in Theorem 7 consists of three
terms: the first two terms correspond to the mean-embedding estimation error, and the
last term corresponds to the generalization error. As mentioned in Section 5.4, the mean-
embedding estimation error consists of agent sampling error and the state sampling error.
The first term in the bound represents the agent sampling error. Since the distance between
adjacent agents is 1/N , this approximation error√is of order O(1/N ). The second term
represents the state sampling error. The term N in the numerator comes from the
estimation of relative positions from C N , whose size is N !, and the union bound among
this set. The third term, which is the generalization error, also suffers from the union
bound of N ! relative positions. Compared with Corollary 12 in Section 5.4, the result in
Theorem 7 suffers from a multiplicative factor log N !. When the function classes are finite
and L = Θ(N β ) with β > 1, the convergence rate in Theorem 7 is O(max{N −(β−1)/2 , N −1 }).
In contrast, the convergence rate Corollary 12 is O(N −(β+1)/2 ).
Theorem 7 states the estimate error in the permutation-invariant risk. In fact, we can
also derive the convergence rate of our estimation of relative positions φ̂h . This means
that for some unknown correction ψ ∗ ∈ C[0,1]
N , the risk defined in Eqn. (7) of our estimate
∗
(fˆh , ĝh , Ŵhφ̂h ◦ψ ) vanishes.

Corollary 8 Given {ξi }N N ∗ N

i=1 = {i/N }i=1 , we adopt ψ ∈ C[0,1] to denote the mapping that
∗
ψ (ξi ) = i/N for all i ∈ [N ]. Under Assumptions 5, 6, 7, and 1, the risk of estimate can be
bounded as
∗
Rξ̄ (fˆh , ĝh , Ŵhφ̂h ◦ψ ) − Rξ̄ (fh∗ , gh∗ , Wh∗ )
r p
LW Bk r̄LK (BS + r̄BK ) N N LN∞ ( N/L, W̃)
=O + (BS + r̄BK )r̄LK BK log
N L δ
(BS + r̄BK )4 N ÑBr ÑB̃r̃ Ñ∞

+ log
L δ

with probability at least 1 − δ.

20
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

ˆIt+1 , and Q̂λ,α

Algorithm 2 Estimation of µ̂It , µ̄ α ˆI
h (s, a, πt , µ̄t , Ŵ )
Inputs: the current policy πtI and the past distribution flow estimate µ̄ ˆIt
I I λ,α α
ˆt+1 , and Q̂h (·, ·, πt , µ̄ I
ˆt , Ŵ ) for all h ∈ [H], α ∈ I
Outputs: µ̂t , µ̄
Procedure:
1: Implement policy πtI for L times and collect the data {Dτ }L τ =1 (with any kind of sampled
agents in Section 5)
2: Derive the MDP estimate (P̂ , r̂, Ŵ ) with the estimation procedures in Section 5, where
P̂ = {P̂h }H H H
h=1 , P̂ = {P̂h }h=1 , r̂ = {r̂h }h=1 , and Ŵ = {Ŵh }h=1
H
I
3: Derive µ̂t as the distribution flow of implementing πt on the MDP estimate.I

4: Derive µ̄ Ît+1 as µ̄ Ît+1 = (1 − αt )µ̄Ît + αt µ̂It .

b,I ˆIt for L times and collect
5: Implement a behavior policy πt on the MDP induced by µ̄
0 L
the data {Dτ }τ =1 (with any kind of sampled agents in Section 5)
6: Derive the MDP estimate (P̂ 0 , r̂ 0 , Ŵ 0 ) with the estimation procedures in Section 5
λ,α ˆIt , Ŵ ) as action-value functions of πtI on the MDP estimate
7: Derive Q̂h (·, ·, πtα , µ̄
(P̂ 0 , r̂0 , Ŵ 0 ).

The proof is provided in Appendix M. Combined with Proposition 2, Corollary 8 shows that
the model estimate (fˆh , ĝh , Ŵhφ̂h ) converges to the nominal model in the sense that they are
shown to be equivalent up to an unknown measure-preserving bijection ψ ∗ ∈ C[0,1] N .

6. Combination of Optimization and Estimation Results

In this section, we make use of the estimator we constructed and analyzed in Section 5 to
derive estimates in Step 4 in Algorithm 1. We assume that one has access to a population
simulator; this assumption is commonly made in the MFG literature (Guo et al., 2019;
Anahtarci et al., 2019, 2022). This simulator is able to generate data according to two types
of requests: (i) implement policy π I on the MDP induced by a pre-specified distribution
flow µI , (ii) implement policy π I directly. In the latter case, the MDP is induced by the
distribution flow of π I itself.
We use Algorithm 2 to derive the distribution flow and action-value function estimate
in Line 4 of Algorithm 1. In this algorithm, we call the simulator twice. First, we directly
implement the policy πtI for L times independently. With the collected data, we can estimate
the distribution flow µIt . Second, we implement a behavior policy πtb,I on the MDP induced
ˆIt for L times. Then estimate the action-value functions with the collected data.
by µ̄
One natural question is that why we need to estimate the transition kernels and underlying
graphons to estimate µIt . An alternative is to implement πtI for L times and estimate the
distribution flow of the sampled agents
√ as their empirical distribution. In fact, the convergence
rate of the alternative will be O(1/ L) √ from central limit theorem. However, our estimate
will shown to have risk bounded by O(1/ N L). This improvement is because our algorithm
makes use of the information of all the agents, but the alternative only uses the information
of single agent for the estimation.
To derive the theoretical guarantees on the accuracy of the distribution flow and action-
value function estimates, we make the following assumptions.

21
Zhang, Tan, Wang, and Yang

Assumption 8 There exist Lε > 0 such that the noises εh for h ∈ [H] satisfy that for any
a ∈ R, TV(εh + a, εh ) ≤ Lε a for all h ∈ [H].
This assumption enables us to control the total variation error of our transition kernels
Ph∗ by the estimation error of fh∗ . We note that Assumption 8 is satisfied for a wide range
of distributions, including the uniform distribution, the centralized Beta distributions for
α > 1, β > 1, and the truncated Gaussian distribution. We then assume that the behavior
policy πtb,I satisfies the following assumptions.
Assumption 9 There exist two constants Cπ , Cπ0 > 0 such that for all t ∈ [T ]
∗,α α
π̄t,h (a | s) πt+1,h (a | s)
sup b,α
≤ Cπ and sup b,α
≤ Cπ0 .
s∈S,a∈A,α∈I,h∈[H] πt,h (a | s) s∈S,a∈A,α∈I,h∈[H] πt,h (a | s)

This assumption states that the behavior policy should explore the actions of the NE and
I . It is quite natural since we want to estimate the action-value function of
the policy πt+1
I
πt+1 from the data collected by πtb,I . Similar assumptions have been commonly made in the
off-policy evaluation literature (Kallus et al., 2021; Uehara et al., 2020).
Assumption 10 For any policy π I ∈ Π̃, we define µ+,I = Γ3 (π I , µ̄It , W ∗ ). We also define
µ̄b,I
t = Γ3 (πtb,I , µ̄It , W ∗ ). There exists a constant Cπ00 > 0 such that for any t ∈ [T ] and any
policy π I specified above, we have

µ+,α
h (s)
sup b,α
≤ Cπ00 .
s∈S,h∈[H],α∈I µ̄t,h (s)

This assumption states that the behavior policy should be sufficiently exploratory such that
the induced distribution of other policies can be covered by that of the behavior policy.
Similar assumptions haven been made in the policy optimization literatures (Shani et al.,
2020; Agarwal et al., 2020). We note that if we take the behavior policy πtb,I = Unif(A)I×H
to be the uniform distribution on the action space, then the constants in Assumptions 9 and
10 can be set as Cπ = Cπ0 = |A| and Cπ00 = |A|H .

6.1 Known-position Case

In this section, we analyze Algorithm 1 and Algorithm 2 when we know the positions (grid
or random) of the sampled agents. In Algorithm 2, we know the distribution flow µ̄ ˆIt during
our second call of the simulator. Thus, in Line 6 of Algorithm 2, we estimate the model
from the collected and the precise value of the distribution flows. This estimation procedure
can be acquired by simplifying the estimation procedure in Section 5.4.1 as
L N
1 XX i 2 i 2
(fˆh , ĝh , Ŵh ) = argmin i
sτ,h+1 − f ωτ,h (W̃ ) i
+ rτ,h − g ωτ,h (W̃ ) , (12)
f ∈B(r,H̄), N L τ =1 i=1
g∈B(r̃,H̃),
W̃ ∈W̃

i (W ) is the mean-embedding calculated by Eqn. (3) and the known distribution

where ωτ,h
flow. Then the result for the agents with known grid positions is stated as

22
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

Corollary 9 If we sample agents with known grid positions and adopt Algorithms (5) and
(12) to estimate the MDP, then under Assumptions 1 to 10, we have that GMFG-PPO
return the following estimates with probability at least 1 − δ
T X T
1 X I ∗,I 1 I ∗,I
D πt , π +d ˆt , µ
µ̄
T T
t=1 t=1
1 1 √
T N Br NB̃r̃ NW̃

BS + r̄BK 1 (BS + r̄BK ) 4 (r̄LK Bk ) 4 1 T N LN∞ (1/ N , W̃)
=O 1 log 4 + 1 log 4
(N L) 4 δ (N L) 8 δ
1 1 √
(BS + r̄BK ) 4 (BS + r̄BK + r̄LK Bk ) 4 log T
+ 1 + O .
N4 T 1/3

The proof is provided in Appendix K. The error of learning NE consists of two types of
terms. The first originates from the estimation error of the distribution flow and the action-
value function. It involves the number of sampled agents N and the number of episodes
L. The second represents the optimization error and involves the number of iterations T .
Consider the case where the function classes are finite. To learn a NE with error ε measured
according to D(·, ·) and d(·, ·), we can run Algorithms 1 and 2 with T = Õ(ε−3 ) iterations
and O((N L)−1/8 + N −1/4 ) = ε. The second condition can be achieved by several parameter
settings, e.g., L = 1, N = O(ε−8 ) and L = O(ε−4 ), N = O(ε−4 ).
The result for the agents with known random positions is stated as follows.

Corollary 10 If we sample agents with known random positions and adopt Algorithms (5)
and (12) to estimate the MDP, then under Assumptions 1 to 10, we have that GMFG-PPO
return the following estimates with probability at least 1 − δ
T X T
1 X I ∗,I 1 I ∗,I
D πt , π +d ˆt , µ
µ̄
T T
t=1 t=1
1 1 √
1 T NBr NB̃ NW̃

BS + r̄BK r̃
(BS + r̄BK ) 4 (r̄LK Bk ) 4 1 T N LN∞ (1/ N L, W̃)
=O 1 log 4 + 1 log 4
(N L) 4 δ (N L) 8 δ
√
(BS + r̄BK )1/2 1 ÑBr ÑB̃ ÑW̃

r̃
log T
+ 1 log 4 +O .
N8 δ T 1/3

The proof is provided in Appendix L. Similar as Corollary 9, error of learning NE consists

of the estimation error and the optimization error. To learn a NE with error ε measured
according to D(·, ·) and d(·, ·), we can run Algorithms 1 and 2 with T = Õ(ε−3 ) iterations
and N = O(ε−8 ) sampled agents.

6.2 Unknown-position Case

In this section, we analyze Algorithms 1 and 2 when we do not know the grid positions of
the sampled agents. In Algorithm 2, we need to specify the policy πtI and distribution flow
ˆIt , which requires the information of agents’ positions. Thus, we additionally assume that
µ̄
for a specific agent α ∈ I, we know which sampled agent is closest to α and the relative
position to the closest sampled agent. This assumption holds in many realistic scenarios.

23
Zhang, Tan, Wang, and Yang

For example, we consider the swarm robotics related problems (Elamvazhuthi and Berman,
2019). In this problem, we would like to find the NE of swarm robotics. The state and action
are the kinetic signals and acceleration of robotics, respectively. The reward is the quantity
related to the kinetic goal. The robotics that have close physical positions usually share
close positions in the underlying graphon, since the interaction among the swarm robotics is
related to the physics setting of them. Thus, in this example, although we do not know their
exact positions in graphons, we have information about their relative closeness in graphons
via their physical positions. In addition, since the data points are stored in each robotics,
the samples across different iterations can be guaranteed to come from the same robotics.
In this case, there is one sampled person from each state, and we assume that each person
knows which state she belongs to, i.e., which sampled person is the closest person to her.

Corollary 11 If we sample agents with known grid positions and adopt Algorithms (5)
and (12) to estimate the MDP, then under Assumptions 1 to 10, we have that GMFG-PPO
return the following estimates with probability at least 1 − δ

T X T
1 X I ∗,I 1 I ∗,I
D πt , π +d ˆt , µ
µ̄
T T
t=1 t=1
p
Bk r̄LK (BS + r̄BK ) (BS + r̄BK )1/4 (r̄LK BK N )1/4 N 1/8

1/4 N LN∞ ( N/L, W̃)
=O + log
N 1/4 L1/8 δ
√
N ÑBr ÑB̃r̃ Ñ∞

BS + r̄BK 1/4 log T
+ 1/4
log +O .
L δ T 1/3

The proof is provided in Appendix N. Similar to Corollaries 9 and 10, the learning error
in Corollary 11 consists of the estimation error and the optimization error. To learn a NE
with error ε measured according to D(·, ·) and d(·, ·), we can run Algorithms 1 and 2 with
T = Õ(ε−3 ) iterations, N = O(ε−4 ) sampled agents, and L = O(ε−12 ) episodes.

7. Experiments
In this section, we utilize simulations to demonstrate the importance of learning the under-
lying graphons, thus corroborating our theoretical results. We simulate our algorithms on
the Susceptible-Infectious-Susceptible (SIS) problem and investment problem. The detailed
definitions of the problems are provided in Appendix A.
The SIS problem: This problem, which has also been considered in Cui and Koeppl
(2021a,b), models the propagation of an epidemic among a large population. People in the
population are infected with probability proportional to the number of infected neighbors.
An investment problem: This problem considers the situation where several compa-
nies aim to maximize their profits simultaneously. The profit of each company is proportional
to the quality of its product and decreases with the total quality of the products in its
neighborhood.
We experiment with four types of graphons: exp-graphon, SBM graphon, affine attach-
ment graphon, and ranked attachment graphon. The value of exp-graphon is affine in the

24
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

SIS problem with Exp graphons N = 7, L = 125 SIS problem with SBM graphons N = 7, L = 125
102 N = 7, L = 500 102
N = 7, L = 500
N = 14, L = 125 N = 14, L = 125
N = 14, L = 500 N = 14, L = 500
101 mf, L = 125 mf, L = 125
101
Exploitability

Exploitability
MD, p = 0 MD, p = 0
100 MD, p = 0.25 MD, p = 0.25
MD, p = 0.5 100 MD, p = 0.5
10 1 MD, p = 0.75 MD, p = 0.75
MD, p = 1 MD, p = 1
FPI, p = 0 10 1 FPI, p = 0
10 2
FPI, p = 0.25 FPI, p = 0.25
FPI, p = 0.5 FPI, p = 0.5
10 3 FPI, p = 0.75 10 2
FPI, p = 0.75
1 5 10 15 20 FPI, p = 1 1 5 10 15 20 FPI, p = 1
Optimization Iteration Optimization Iteration
(a) SIS problem with exp-graphons (b) SIS problem with SBM graphons.
SIS problem with Ranked Attachment graphons SIS problem with Affine Attachment graphons
102
102

101
101

Exploitability
Exploitability

100
100

N = 7, L = 125 MD, p = 0.25 N = 7, L = 125 MD, p = 0.25

10 1
N = 7, L = 500 MD, p = 0.5 10 1
N = 7, L = 500 MD, p = 0.5
mf, L = 125 MD, p = 0.75 mf, L = 125 MD, p = 0.75
MD, p = 0 MD, p = 1 MD, p = 0 MD, p = 1
10 2 10 2

1 5 10 15 20 1 5 10 15 20
Optimization Iteration Optimization Iteration
(c) SIS problem with ranked attachment (d) SIS problem with affine attachment
graphons graphons.
Figure 2: Simulation results for SIS problem with grid known-position agents.
exponential of the product of α, β, which is defined as
2 exp(θ · αβ)
Wθexp (α, β) = − 1, (13)
1 + exp(θ · αβ)
which is parameterized by θ > 0. The SBM graphon splits [0, 1] into K ≥ 1 blocks,
which is parameterized by {pk }K k=0 . Here p0 = 0 and pK = 1, and the i-th block is
(pi−1 , pi ]. The value of SBM graphon is then specified by {aij }K,K i,j=1 with aij = aji as
W SBM (α, β) = aij if pi−1 < α ≤ pi and pj−1 < β ≤ pj . The affine attachment graphon is
defined as Wa,b aff (α, β) = a − b · (α + β), where a, b ∈ R parameterize the graphon. The ranked

attachment graphon is defined as Wa,b rank (α, β) = a − b · αβ. This is a generalization of the

definition in Cui and Koeppl (2021b).

Since we do not know the nominal value of the NE, we adopt the notion of exploitability
to measure the closeness between a policy and the NE. For a policy π I and its induced
distribution flow µI , the exploitability is defined as (Fabian et al., 2022)
Z 1
I
∆(π ) = sup J λ,α (π̃ α , µI , W ∗ ) − J λ,α (π α , µI , W ∗ ) dα.
0 π̃ α ∈ΠH

If we do not learn the underlying graphons, reasonable guesses for them would be constant
graphons W (α, β) = p for all α, β ∈ I, corresponding to the MFG. In the simulations, we
choose the constant p to be 0, 0.25, 0.5, 0.75 and 1. These values model the cases from the
independent agents to the most intensely interacting agents.
Figure 2 displays the exploitability for the algorithms in the SIS problem with different
graphons. To learn the system model, we sample N = 7 and N = 14 agents with known

25
Zhang, Tan, Wang, and Yang

Investment problem with Exp graphons N = 7, L = 25 Investment problem with SBM graphons N = 7, L = 25
N = 7, L = 100 N = 7, L = 100
101 N = 14, L = 25 N = 14, L = 25
N = 14, L = 100 101 N = 14, L = 100
100 mf, L = 25 mf, L = 25
Exploitability

Exploitability
MD, p = 0 100 MD, p = 0
10 1 MD, p = 0.25 MD, p = 0.25
MD, p = 0.5 MD, p = 0.5
MD, p = 0.75 10 1
MD, p = 0.75
10 2
MD, p = 1 MD, p = 1
10 3 FPI, p = 0 10 2 FPI, p = 0
FPI, p = 0.25 FPI, p = 0.25
10 4 FPI, p = 0.5 10 3 FPI, p = 0.5
FPI, p = 0.75 FPI, p = 0.75
1 5 10 15 20 FPI, p = 1 1 5 10 15 20 FPI, p = 1
Optimization Iteration Optimization Iteration
(a) Investment problem with exp-graphons (b) Investment problem with SBM graphons.
Investment problem with Ranked Attachment graphons Investment problem with Affine Attachment graphons
101 101

Exploitability
100 100
Exploitability

10 1
10 1

10 2
N = 7, L = 25 MD, p = 0.25 N = 7, L = 25 MD, p = 0.25
10 2
N = 7, L = 100 MD, p = 0.5 N = 7, L = 100 MD, p = 0.5
10 3
mf, L = 25 MD, p = 0.75 mf, L = 25 MD, p = 0.75
MD, p = 0 10 3
MD, p = 0
MD, p = 1 MD, p = 1
10 4
1 5 10 15 20 1 5 10 15 20
Optimization Iteration Optimization Iteration
(c) Investment problem with ranked attach- (d) Investment problem with affine attach-
ment graphons ment graphons.
Figure 3: Simulation results for investment problem with grid known-position agents.

grid positions. The number of episodes for data collection L is set to 125 and 500. The
line “mf, L = 125” refers to the model-free algorithm in Cui and Koeppl (2021b) that uses
125 trajectories from 7 agents for distribution flow together with value function estimation
in each round. “L = 125” and “L = 500” refer to our algorithms that use 125 and
500 samples from each agent in each round for estimation of the graphons. Since the
messages from the results of different graphons are similar, we only display the results for
exp graphon and SBM graphon for brevity. Figure 2 demonstrates that our model-based
algorithm achieves lower exploitability than the model-free algorithm. The reason is that
the estimation error of the model-based algorithm is smaller, as mentioned in Section 6.
Lines “M D, p = 0, 0.25, 0.5, 0.75, 1” refer to the MFG learning algorithm that implements
one-step mirror descent in each iteration Xie et al. (2021); Yardim et al. (2022). Lines
“F P I, p = 0, 0.25, 0.5, 0.75, 1” refer to the MFG learning algorithm that learns the optimal
policy for the current mean-field in each iteration Guo et al. (2019). For these MFG learning
algorithms, the reward functions and transition kernels are known to the algorithm. Thus,
there are no error bars for these lines. Figure 2 shows that when we assume that the
heterogeneous agents are homogeneous, the learning algorithm for NE will suffer from a
large error (large exploitability). In contrast, learning the graphons will enable us to learn
the NE more accurately. These results demonstrate the necessity of our model learning
algorithm in Algorithm 2. We can also observe that the learning error for N = 7, L = 500
is less than that for N = 7, L = 125, which justifies that the learning error decreases with
the increasing trajectory numbers L. In addition, the learning error for N = 14, L = 125
is less than that for N = 7, L = 125. This shows that the learning error decreases with an
increasing number of sampled agents N . These observations corroborate Corollary 9.

26
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

SIS problem with Exp graphons SIS problem with SBM graphons
102
102

101
Exploitability

Exploitability
101

100
N = 7, L = 125 MD, p = 0.25 100 N = 7, L = 125 MD, p = 0.25
N = 7, L = 500 MD, p = 0.5 N = 7, L = 500 MD, p = 0.5
10 1 mf, L = 125 MD, p = 0.75 mf, L = 125 MD, p = 0.75
MD, p = 0 MD, p = 1 MD, p = 0 MD, p = 1
10 1
1 5 10 15 20 1 5 10 15 20
Optimization Iteration Optimization Iteration
(a) SIS problem with exp-graphons (b) SIS problem with SBM graphons.
Investment problem with Exp graphons Investment problem with SBM graphons
101 101
Exploitability

Exploitability
100 100

10 1 10 1
N = 7, L = 25 MD, p = 0.25 N = 7, L = 25 MD, p = 0.25
N = 7, L = 100 MD, p = 0.5 N = 7, L = 100 MD, p = 0.5
mf, L = 25 10 2
mf, L = 25
10 2 MD, p = 0.75 MD, p = 0.75
MD, p = 0 MD, p = 1 MD, p = 0 MD, p = 1
1 5 10 15 20 1 5 10 15 20
Optimization Iteration Optimization Iteration
(c) Investment problem with exp-graphons (d) Investment problem with SBM graphons

Figure 4: Simulation results for SIS and investment problems with random known-position
agents.

Figure 3 displays the exploitability for algorithms in the investment problem of Cui and
Koeppl (2021b) with different graphons. “L = 25” and “L = 100” refer to our algorithms
that estimate with 125 and 500 samples from each agent in each round. Although the
investment problem has a larger state space than the SIS problem, the simulation results
contain similar insights as discussed above. These results corroborate Corollary 9.

Figure 4 displays the exploitability for algorithms in the SIS and investment problems
with different graphons, where sampled agents have known random positions. We note
that lines “M D, p = 0, 0.25, 0.5, 0.75, 1” are same as those in Figures 2 and 3, since these
MFG algorithms have the full information of the reward function and transition kernel
where different estimation setting will not affect the MFG algorithm performance. Figure 4
shows that the GMFG learning algorithms have better performance than the MFG learning
algorithm. The reason is that the MFG learning algorithms wrongly assume that all the agents
are homogeneous. Figure 4 also indicates that “N = 7, L = 500” (resp. “N = 7, L = 100”)
has better performance than “N = 7, L = 125” (resp. “N = 7, L = 25”) in SIS problem
(resp. investment problem), which corroborates with Corollary 10.

27
Zhang, Tan, Wang, and Yang

8. Conclusion
In this paper, we investigated learning the NE of GMFG in the graphons incognizant case.
Provably efficient optimization algorithms were designed and analyzed with an estimation
oracle, which improved on the previous works in convergence rate. In addition, adopting the
mean-embedding ideas, we designed and analyzed the model-based estimation algorithms
with sampled agents. Here, the sampled agents have known or unknown positions. These
estimation algorithms feature as the first model-based algorithms in GMFG without the
distribution flow information. We leave the analysis of more complex agent sampling schemes
for future works.

Acknowledgments

This research work is funded by the Singapore Ministry of Education AcRF Tier 2 grant
(A-8000423-00-00) and Tier 1 grants (A-8000189-01-00 and A-8000980-00-00).

28
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

Appendices for
“Learning Graphon Mean-Field Games with Unknown
Graphons”

Appendix A. Experimental Details

In this section, we provide the details of our experiments shown in Section 7 of the main
paper. We first formally define the SIS and investment problems.
SIS problem: The state space of this problem S = {S, I} consists of the states S
(susceptible) and I (infected). The action space A = {U, D} consists of the actions U (going
out) and D (keeping distance). The horizon is H = 50. The reward functions are defined as
rh∗ (s, a, z) = −10 · Is=I − 2.5 · Is=D for all h ∈ [H]. The transition kernels are defined as

Ph∗ (S | I, ·, ·) = 0.2, Ph∗ (I | S, U, z) = 0.8 · z(I), Ph∗ (I | S, D, ·) = 0 for all h ∈ [H].

Investment problem: The state space is S = {0, 1, . . . , 9}, and the action space is
A = {I, O}. The horizon is H = 50. The reward function is

0.3 · s
rh∗ (s, a, z) = P 0 0
− 2 · Ia=I
1+ s0 ∈S s ∗ z(s )

for all h ∈ [H]. The transition kernel is specified as

9−s 1+s
Ph∗ (s + 1 | s, I, ·) = , Ph∗ (s | s, I, ·) = , Ph∗ (s | s, O, ·) = 1
10 10
for all s ∈ {0, . . . , 8}, and s = 9 is an absorbing state.
We then introduce our graphon parameters. We set θ = 3 for exp-graphon. For SBM
graphon, we set K = 2, p0 = 1, p1 = 0.7, p2 = 1, a11 = a22 = 0.9, and a12 = a21 = 0.3. We
set a = 1, b = 0.5 for affine attachment graphon and ranked attachment graphon. We set the
regularization parameter as λ = 1 in our experiments. For the choices of the model classes,
we note that the SIS and investment problems involve a set of parameters. For example, the
coefficients 10 and 2.5 for the reward function of SIS problem. We estimate these coefficients.
For the graphon classes, we note that all the graphons can be parameterized by some
parameters, and we estimate these parameters in the experiments. For the implementation
of π I and the computation of µI , we discretize the infinitely many agents indexed by [0, 1]
into N = 100 groups, and approximate the policies and distribution flows within each group
by one policy and one distribution flow. This step incurs an approximation error O(N −1 )
with respect to the `1 norm.
To shorten the simulation time and convey the main message, we only estimate the
model in the beginning of the first iteration round and reuse this estimate in the following
iterations to generate action-value function estimates. Figures 2, 3 and 4 are derived from
twenty Monte-Carlo implementations of the algorithms. The error bar indicates the 25%
and the 75% quantile of the errors. When simulating the cases with constant graphons, we
implement the fixed point iteration of the mirror descent operator (Yardim et al., 2022; Xie
et al., 2021) or the game operator (Guo et al., 2019) to find the NE, and the calculations of
the optimal policy and the induced distribution flow are implemented via the dynamical

29
Zhang, Tan, Wang, and Yang

programming and direct calculation with the nominal transition kernels and reward functions.
Thus, there is no error bar for these cases.
The code used in our simulations uses the code in Fabian et al. (2022) and Cui and
Koeppl (2021b) for building the simulation environment. We run our simulations on Intel(R)
Core(TM) i5-8257U CPU @ 1.40GHz.

Appendix B. Discussion of Regularization

We focus on regularized GMFGs in the paper. Here, for the sake of completeness, we
discuss the relationship between regularized and unregularized games. We show that the
NE of the regularized GMFG suffers at most a λH log |A| exploitability compared to that of
unregularized MFGs. For a policy π I , we denote its exploitability in a λ-regularized GMFG
as
Z 1
λ I
∆ (π ) = sup J λ,α (π̃ α , µI , W ∗ ) − J λ,α (π α , µI , W ∗ ) dα.
0 π̃ α ∈ΠH

We note that the exploitability ∆(π I ) defined in Section 7 is indeed ∆λ (π I ) here. Then
Proposition 3 in Geist et al. (2019) asserts that

∆λ (π I ) − ∆0 (π I ) ≤ λH log |A|

for all λ ≥ 0. This inequality implies that the NE of the regularized GMFG (resp. unregu-
larized) satisfies agent rationality in the unregularized (resp. regularized) up to λH log |A|.
This gap also appears in MFGs (Anahtarci et al., 2022; Xie et al., 2021), and mitigating the
bias remains an unsolved problem in MFGs, a strict subclass of GMFGs.

Appendix C. Corollary of Theorem 5 in Single-policy Setting

We next derive a corollary for the setting where we implement a single behavior policy for
L independent times to collect the data, i.e., πτI = π I for all τ ∈ [L]. As such, instead of
Eqn. (4), we estimate the mean-embedding via
L
1
W (ξi , ξj )k ·, (siτ,h , aiτ,h , sjτ 0 ,h ) .
XX
ˆ τ,h
i

ω̃ (W ) = (14)
(N − 1)L 0 j6=i τ =1

We note that Eqn. (14) averages the states over L episodes, since the distribution flows
of these L episodes are same. Correspondingly, the estimation procedure in Eqn. (5) is
modified to be
L N
1 X X i 2 i 2
(fˆh , ĝh , Ŵh ) = argmin ˆ τ,h
sτ,h+1 − f ω̃ i
(W̃ ) + rτ,h ˆ τ,h
− g ω̃ i
(W̃ ) .
f ∈B(r,H̄),g∈B(r̃,H̃),N L τ =1 i=1
W̃ ∈W̃
(15)

The convergence rate of the corresponding estimates (fˆh , ĝh , Ŵh ) can be derived as follows.

30
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

Corollary 12 Under Assumptions 1, 5, 6, and 7, if we implement a policy L independent

times to collect the data, and ξi = i/N for i ∈ [N ], then with probability at least 1 − δ, the
risk of the estimates in Eqn. (15) can be bounded as
Rξ̄ (fˆh , ĝh , Ŵh ) − Rξ̄ (fh∗ , gh∗ , Wh∗ )
√
(BS + r̄BK )4 NBr NB̃r̃ NW̃

(BS + r̄BK )r̄LK Bk N LN∞ (1/ N L, W̃)
=O log + √ log ,
NL δ NL δ
where

3 3 3
NBr = NH̄ , B(r, H̄) , NB̃r̃ = NH̄ , B(r̃, H̃) , NW̃ = N∞ , W̃ .
NL NL LK N L
The proof is provided in Appendix J. Compared to the result in Theorem √ 5, the √mean-
embedding estimation error, i.e., the second term, is improved from O(1/ N ) to O(1/ N L).
Such an improvement is intuitive, since we now utilize the data from L episodes to estimate
the distribution flow, but the estimation procedure in Theorem 5 only uses the data from a
single episode for the same purpose.

Appendix D. Proof of Proposition 2

Proof [Proof of Proposition 2] We prove the desired results by induction on h ∈ [H]. When
φ(α) φ(α)
h = 1, µ1 = µφ,α
1 holds trivially for all α ∈ I. Assume that µh = µφ,α
h holds for all
α ∈ I, then for h + 1 and any α ∈ I we have that
φ(α) 0
X Z φ(α) φ(α) φ(α)
µh+1 (s ) = µh (s)πh (a | s)Ph∗ (s0 | s, a, zh (µIh , Wh∗ )) ds and
a∈A S
XZ
µφ,α 0
h+1 (s ) = µφ,α φ,α ∗ 0 α φ,I φ,∗
h (s)πh (a | s)Ph (s | s, a, zh (µh , Wh )) ds
a∈A S
XZ φ(α) φ(α)
= µh (s)πh (a | s)Ph∗ (s0 | s, a, zhα (µφ,I φ,∗
h , Wh )) ds,
a∈A S

where the last equality results from the definition of π φ,I and the hypothesis. To show that
φ(α) φ(α) I
µh+1 = µφ,α
h+1 , it remains to show zh (µh , Wh∗ ) = zhα (µφ,I φ,∗
h , Wh ). In fact, we have that
Z 1 Z 1
α φ,I φ,∗ φ(β)
zh (µh , Wh ) = ∗
Wh (φ(α), φ(β))µh dβ = Wh∗ (φ(α), γ)µγh dγ,
0 0

where the last equality results from setting γ = φ(β). Thus, we conclude the proof of
Proposition 2.

Appendix E. Proof of Theorem 4

Proof [Proof of Theorem 4] For the analysis of the Algorithm 1, we define the nominal
distribution flows as
µIt = Γ2 (πtI , W ∗ ), µ̄It+1 = (1 − αt )µ̄It + αt µIt for all t ∈ [T ].

31
Zhang, Tan, Wang, and Yang

Our proof of Theorem 4 involves four distinct steps:

I
• First, we derive the first-order optimality condition of the policy π̂t+1 derived in Line 6
of Algorithm 1.

• Second, we derive the recurrence relationship of the policy learning error from the
relationship in the second step.

• Third, we derive the convergence rate of the learned mean-field.

• Finally, we obtain the desired result by combining step 2 and step 3.

Step 1: Analyze the property of the policy π̂t+1 I derived in Line 6 of Algo-
rithm 1
α
We first note that the update of π̂t+1,h (· | s) in Line 6 of Algorithm 1 can be equivalently
defined as
h i
α
(· | s) = argmax ηt+1 Q̂λ,α α ˆI α

π̂t+1,h h (s, ·, πt , µ̄t , Ŵ ), p − λH̄(p) − KL pkπt,h (· | s) , (16)
p∈∆(A)

which can be proved using Lagrangian multipliers.

Proof [Proof of Proposition 13] See Appendix Q.2.1.

where the term (I) is the combination of the action-value function estimation error and the
I
difference between π̂t+1 and πt+1 I that is defined as
h i
(I) = ηt+1 Qλ,α
h (sh , ·, πt
α I
, µ̄t , W ∗
), p − π α
t+1,h (· | sh ) − Q̂λ,α
h (sh , ·, πt
α ˆI
, µ̄t , Ŵ ), p − π̂ α
t+1,h (· | sh .
)

The term (II) is the entropy difference between π̂t+1I I

and πt+1 that is defined as

α α

(II) = ληt+1 R πt+1,h (· | sh ) − R π̂t+1,h (· | sh ) .

The term (III) is the KL divergence difference between π̂t+1 I and πt+1I that is defined as
α α α α

(III) = KL πt+1,h (· | sh )kπt,h (· | sh ) − KL π̂t+1,h (· | sh )kπt,h (· | sh ) .

32
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

I
The term (IV) is also the KL divergence difference between π̂t+1 and πt+1I that is defined as
h i
α α

(IV) = (1 + ληt+1 ) KL pkπt+1,h (· | sh ) − KL pkπ̂t+1,h (· | sh ) .

We define
Λαt+1,h = 2ηt+1 Qλ,α α ˆI ∗ λ,α α ˆI
h (sh , ·, πt , µ̄t , W ) − Q̂h (sh , ·, πt , µ̄t , Ŵ ) ∞ + 2ηt+1 H(1 + λ log |A|)βt+1
|A|
+ 2ηt+1 Lr + H(1 + λ log |A|)LP εµ + 2βt+1 log + 2(1 + ληt+1 )βt+1 , (18)
βt
Then we can show the following bound.
Proposition 14 Under assumptions in Theorem 4, (I) + (II) + (III) + (IV) ≤ Λαt+1,h .
Proof See Appendix Q.2.2.
Then inequality (17) shows that
h i
ηt+1 Qλ,α α I ∗ α α

h (sh , ·, πt , µ̄t , W ), p − πt+1,h (· | sh ) + ληt+1 R πt+1,h (· | sh ) − H̄(p)
α α

+ KL πt+1,h (· | sh )kπt,h (· | sh )
α α
(· | sh ) + Λαt+1,h .

≤ KL pkπt,h (· | sh ) − (1 + ληt+1 )KL pkπt+1,h (19)
Step 2: Derive the recurrence relationship of the policy learning error from
the relationship the second step, and bound the dynamical error in such recur-
rence relationship.
I
Inequality (19) implies that the improvement of πt+1 of the MDP induced by µ̄It over
I
πt can be lower bounded as
Vmλ,α (s, πt+1
α
, µ̄It , W ∗ ) − Vmλ,α (s, πtα , µ̄It , W ∗ )
X H
= Eπα ,µ̄It Qλ,α α I ∗ α α
h (sh , ·, πt , µ̄t , W ), πt+1,h (· | sh ) − πt,h (· | sh )
t+1
h=m
h
α
α
i
+ λ R πt,h (· | sh ) − R πt+1,h (· | sh ) sm = s
h i
α I ∗
≥ Qλ,α α α α α

m (s, ·, πt , µ̄t , W ), πt+1,m (· | s) − πt,m (· | s) + λ R πt,h (· | s) − R πt+1,h (· | s)
XH
1 α
− E α I Λt+1,h sm = s , (20)
ηt+1 πt+1 ,µ̄t
h=m

where the equality results from Lemma 37, and the inequality results from inequality (19)
and that KL divergence is non-negative.
We denote the optimal policy on the MDP induced by µ̄It as π̄t∗,I = Γλ1 (µ̄It , W ∗ ). Then
Lemma 37 and inequality (20) implies that
H
X
∗,α
ηt+1 Eπ̄t∗,α ,µ̄It Qλ,α α I ∗ α
h (sh , ·, πt , µ̄t , W ), π̄t,h (· | sh ) − πt+1,h (· | sh )
h=1
h
α
∗,α i
+ λ R πt+1,h (· | sh ) − R π̄t,h (· | sh )

33
Zhang, Tan, Wang, and Yang

H
X
≥ ηt+1 Eπ̄t∗,α ,µ̄It Vhλ,α (sh , πtα , µ̄It , W ∗ ) − Vhλ,α (sh , πt+1
α
, µ̄It , W ∗ )
h=1
H
X H
X
− Eπ̄t∗,α ,µ̄It Eπα ,µ̄It Λαt+1,m sh
t+1
h=1 m=h
λ,α ∗,α I
ηt+1 Eµα1 V1 (s1 , π̄t , µ̄t , W ∗ ) − V1λ,α (s1 , πtα , µ̄It , W ∗ ) .

+ (21)
∗,α
Applying inequality (19) with p = π̄t,h (· | sh ) to the left-hand side of inequality (21) and
rearranging the terms, we have that
H
X
ηt+1 E π̄t∗,α ,µ̄I
t
Vhλ,α (sh , π̄t∗,α , µ̄It , W ∗ ) − Vhλ,α (sh , πt+1
α
, µ̄It , W ∗ )
h=1
H
X
∗,α α

+ (1 + ληt+1 )Eπ̄t∗,α ,µ̄It KL π̄t,h (· | sh )kπt+1,h (· | sh )
h=1
H
X
≤ ηt+1 Eπ̄t∗,α ,µ̄It Vhλ,α (sh , π̄t∗,α , µ̄It , W ∗ ) − Vhλ,α (sh , πtα , µ̄It , W ∗ )
h=1
− ηt+1 Eµα1 V1λ,α (s1 , π̄t∗,α , µ̄It , W ∗ ) − V1λ,α (s1 , πtα , µ̄It , W ∗ )

XH
∗,α α

+ Eπ̄t∗,α ,µ̄It KL π̄t,h (· | sh )kπt,h (· | sh )
h=1
H
X H
X H
X
+ Eπ̄t∗,α ,µ̄It Eπα I Λαt+1,m sh + Eπ̄t∗,α ,µ̄It Λαt+1,h . (22)
t+1 ,µ̄t
h=1 m=h h=1

To handle the right-hand side of this inequality, we utilize the following proposition.
Proposition 15 For a λ-regularized finite-horizon MDP (S, A, H, {rh }H H
h=1 , {Ph }h=1 ) with
|rh | ≤ 1 for all h ∈ [H], we denote the optimal policy as π ∗ = {πh∗ }H
h=1 . Then for any policy
π, we have that
H
X
V1λ (s1 , π ∗ ) V1λ (s1 , π) ∗
Vhλ (sh , π ∗ ) Vhλ (sh , π)

Eπ∗ − ≥ β Eπ∗ − ,
h=2

where the expectation is taken with respect to the state distribution induced by π ∗ , and β ∗ > 0
is a constant that only depends on λ, H and |A|.
Proof [Proof of Proposition 15] See Appendix Q.2.3.
Define θ∗ = 1/(1 + β ∗ ) < 1 and let ηt = η, where 1 + λη = 1/θ∗ . Proposition 15 shows that
H
X
Eπ̄t∗,α ,µ̄It Vhλ,α (sh , π̄t∗,α , µ̄It , W ∗ ) − Vhλ,α (sh , πt+1
α
, µ̄It , W ∗ )
h=1
H
X
1 ∗,α α

+ ∗ Eπ̄t∗,α ,µ̄It KL π̄t,h (· | sh )kπt+1,h (· | sh )
ηθ
h=1

34
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

H
X
≤θ ∗
Eπ̄t∗,α ,µ̄It Vhλ,α (sh , π̄t∗,α , µ̄It , W ∗ ) − Vhλ,α (sh , πtα , µ̄It , W ∗ )
h=1
XH
1 ∗,α α

+ ∗ Eπ̄t∗,α ,µ̄It KL π̄t,h (· | sh )kπt,h (· | sh )
ηθ
h=1
X H X H H
X
1 α 1 α
+ Eπ̄t ,µ̄It
∗,α Eπα ,µ̄It Λt+1,m sh E ∗,α I Λt+1,h . (23)
η t+1 η π̄t ,µ̄t
h=1 m=h h=1

In the following, we will derive the rate of convergence of the following term
H
X
Xtα = Eπ̄t∗,α ,µ̄It Vhλ,α (sh , π̄t∗,α , µ̄It , W ∗ ) − Vhλ,α (sh , πtα , µ̄It , W ∗ )
h=1
H
X
1 ∗,α α

+ ∗ Eπ̄t ,µ̄It
∗,α KL π̄t,h (· | sh )kπt,h (· | sh ) . (24)
ηθ
h=1

We note that XtI is a good quantity to measure the “distance” between πtI and NE. For
NE, π ∗,I is the optimal policy on the MDP induced by the distribution flow µ∗,I of itself.
Since µ̄It is close to µIt , we expect that πtI achieves high rewards on the MDP induced by
µ̄It if it is close to the NE. Inequality (23) shows that the recurrence relationship of Xtα is

H
X H
X H
X
α ∗ 1 1
Xt+1 ≤θ Xtα + Eπ̄t∗,α ,µ̄It Eπα ,µ̄It α
Λt+1,m sh + Eπ̄t∗,α ,µ̄It Λt+1,h + ∆αt+1 ,
α
η t+1 η
h=1 m=h h=1
(25)

where ∆αt+1 is the error introduced by the change of the environment, which is also called
the dynamical error, and it is defined as
H
X
∆αt+1 = Xt+1
α
− Eπ̄t∗,α ,µ̄It Vhλ,α (sh , π̄t∗,α , µ̄It , W ∗ ) − Vhλ,α (sh , πt+1
α
, µ̄It , W ∗ )
h=1
H
X
1 ∗,α α

− ∗ Eπ̄t∗,α ,µ̄It KL π̄t,h (· | sh )kπt+1,h (· | sh ) .
ηθ
h=1

Proposition 16 Under assumptions in Theorem 4, we have

|A|2

α 1 2 |A|
∆t+1 ≤ H 2H(1 + λ log |A|) + λLR + ∗ log + ∗ max log , LR
ηθ βt+1 ηθ βt+1
H
X
∗,α ∗,α
· Eπ̄∗,α ,µ̄I π̄t+1,m (· | sm ) − π̄t,m (· | sm ) 1
t+1 t+1
m=1
|A|2

1
+ H H 1 + λ log |A| + ∗ log LP + 2H Lr + H(1 + λ log |A|)LP
ηθ βt+1

35
Zhang, Tan, Wang, and Yang

H Z 1
kµ̄βt+1,m − µ̄βt,m k1 dβ
X
·
m=1 0
H
X
∗,α ∗,α
= C1 (η, βt+1 )Eπ̄∗,α ,µ̄I π̄t+1,m (· | sm ) − π̄t,m (· | sm ) 1
t+1 t+1
m=1
H Z 1
kµ̄βt+1,m − µ̄βt,m k1 dβ.
X
+ C2 (η, βt+1 )
m=1 0

In the above, we defined C1 (η, βt+1 ) and C2 (η, βt+1 ) for ease of notation subsequently.
Proof See Appendix Q.2.4.

We need the following proposition to relate the difference between the optimal policies
∗,α ∗,α
π̄t+1,m (· | sm ) and π̄t,m (· | sm ) in Proposition 16 to the distribution flows µ̄It+1 and µ̄It .

Proposition 17 For any two distribution flows µI and µ̃I , we define the optimal policies
π ∗,I = Γλ1 (µI , W ∗ ) and π̃ ∗,I = Γλ1 (µ̃I , W ∗ ). Under Assumption 1, we have that for any
h ∈ [H] and α ∈ [0, 1]

max Vhλ,α (s, π ∗,I , µI , W ∗ ) − Vhλ,α (s, π̃ ∗,I , µ̃I , W ∗ )

Proof [Proof of Proposition 17] See Appendix Q.2.9.

Propositions 16 and 17 shows that
H Z 1
X
∆αt+1 ≤ 2H H(1 + λ log |A|)LP + Lr C1 (η, βt+1 ) + C2 (η, βt+1 ) kµ̄βt+1,m − µ̄βt,m k1 dβ

m=1 0

≤ 2H 2H H(1 + λ log |A|)LP + Lr C1 (η, βt+1 ) + C2 (η, βt+1 ) αt , (26)

where the inequality results from the definition of µ̄It+1 .

Next, we will combine Eqn. (18) and inequalities (25) and (26) to derive a relationship
between Xt+1α and Xtα . Adopting Assumption 4 to control the estimation error of the
action-value functions in inequality (25), we have that
Z 1 Z 1
H(H + 1) |A|
α
Xt+1 dα ≤ θ∗ Xtα dα + 2ηεQ + 2ηH(1 + λ log |A|)βt+1 + 2βt+1 log
0 0 η βt

+ 2η Lr + H(1 + λ log |A|)LP εµ + 2(1 + λη)βt+1

36
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

+ 2H 2H H(1 + λ log |A|)LP + Lr C1 (η, βt+1 ) + C2 (η, βt+1 ) αt , (27)

where the inequality results from Assumption 4.

We set αt = O(T −2/3 ) and βt = O(T −1 ) for all t ∈ [T ]. Lemma 40 shows that
Z 1
α ∗ t ∗ t/2 log T
Xt dα = O (θ ) + (εQ + εµ )(θ ) + 2/3 + O(εQ + εµ ).
0 T
Thus, we have
T Z
1X 1 α

log T
Xt dα = O + O(εQ + εµ ). (28)
T 0
t=1
T 2/3

Step 3: Derive the convergence rate of the learned mean-field.

To derive the convergence behavior of µ̄ ˆIt , we define the distribution flow induced by
π̄t∗,I as µ̄∗,I
t = Γ2 (π̄t∗,I , W ∗ ). Then we have that

ˆIt+1 , µ∗,I ) = d (1 − αt )µ̄ ˆIt + αt µ̂It , µ∗,I

d(µ̄
ˆIt , µ∗,I ) + αt d(µ̂It , µIt ) + αt d(µIt , µ̄∗,I
≤ (1 − αt )d(µ̄ ∗,I
t ) + αt d(µ̄t , µ
∗,I
), (29)

where the equality results from the definition of µ̄ ˆIt+1 , and the inequality results from the
triangle inequality. For the fourth term in the right-hand side of inequality (29), we have
that

d(µ̄t∗,I , µ∗,I ) = d Γ2 Γλ1 (µ̄It , W ∗ ), W ∗ , Γ2 Γλ1 (µ∗,I , W ∗ ), W ∗

≤ d1 d2 d(µ̄It , µ∗,I )
≤ d1 d2 d(µ̄It , µ̄
ˆIt ) + d(µ̄
ˆIt , µ∗,I ) ,

(30)

where the equality results from the definitions of µ̄∗,I t and µ∗,I , the first inequality results
from Assumption 2, and the last inequality results from the triangle inequality. We then
define µ̃∗,I
t = Γ3 (π̄t∗,I , µ̄It , W ∗ ). For the third term in the right-hand side of inequality (29),
we have that

d(µIt , µ̄t∗,I ) = d Γ2 (πtI , W ∗ ), Γ2 (π̄t∗,I , W ∗ )

where the first inequality results from Assumption 2, and the second inequality results from
Assumption 3 and the Cauchy–Schwarz inequality. Define Yt = d(µ̄ ˆIt , µ∗,I ). Combining

37
Zhang, Tan, Wang, and Yang

inequalities (29), (30), and (31), we have that

s
√ Z 1
Yt+1 ¯ t + αt d(µ̂I , µI ) + αt d1 d2 d(µ̄I , µ̄
≤ (1 − αt d)Y ˆIt ) + αt d2 Cµ H 2ηθ∗ Xtα dα,
t t t
0

where d¯ = 1 − d1 d2 .
Recall the expressions of µ̄It and µ̄
ˆIt in Eqn. (79), we have that
t−1
X
d(µ̄It , µ̄
ˆIt ) ≤ αm,t−1 d(µ̂Im , µIm ) ≤ εµ ,
m=1

where the first inequality results from the triangle inequality, and the second inequality
results from Assumption 4. Take αt = α. we have that
T √ T
s
Z 1
1X 1 (1 + d1 d2 ) d2 Cµ H 1 X
Yt ≤ ¯ Y1 + ¯ εµ + ¯ · 2ηθ ∗ Xtα dα
T dαT d d T 0
t=1 t=1
√
v
u T Z
1 (1 + d1 d2 ) d2 Cµ H u ∗
1X 1 α
≤ ¯ Y1 + εµ + · 2ηθ
t Xt dα,
dαT d¯ d¯ T 0 t=1

where the last inequality results from Eqn. (28). Thus, we have
T √
1X log T p
Yt = O + O(ε µ + εQ + εµ ).
T
t=1
T 1/3

Step 4: Build the desired result from step 2 and step 3.

From the definition of Xt , i.e., Eqn. (24), and Eqn. (28), we have that
T Z H
1X 1
X
∗,α α
log T
Eπ̄t∗,α ,µ̄It KL π̄t,h (· | sh )kπt,h (· | sh ) dα = O + O(εQ + εµ ).
T 0
t=1 h=1
T 2/3

Recall that we defined µ̃∗,I

t = Γ3 (π̄t∗,I , µ̄It , W ∗ ). Then we bound D(·, ·) as follows
T T Z H ∗,α
1 X 1X

1X I ∗,I µh (s) α ∗,α
D(πt , π̄t ) = Eµ̃∗,α ∗,α π (· | s) − π̄t,h (· | s) 1 dα
T
t=1
T
t=1 0 h=1
t,h µ̃t,h (s) t,h
v
√ u2 X
u T Z 1X H h
∗,α α
i
≤ Cµ H t Eµ̃ ∗,α KL π̄t,h (· | s)kπt,h (· | s) dα
T t,h
t=1 0 h=1
√
log T p
≤O + O( εQ + εµ ),
T 1/3
where the first inequality results from the same arguments in inequality (31). To bound the
distance between πtI and π ∗,I , we adopt the triangle inequality as

D(πtI , π ∗,I ) ≤ D(πtI , π̄t∗,I ) + D(π̄t∗,I , π ∗,I ) ≤ D(πtI , π̄t∗,I ) + d1 d(µ̄It , µ∗,I ).

38
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

Thus, we have that

X T XT
1 1
D πtI , π ∗,I + d Ît , µ∗,I
µ̄
T T
t=1 t=1
T
1
D(πtI , π̄t∗,I ) + d1 d(µ̄It , µ∗,I ) + d(µ̄
X
≤ Ît , µ∗,I )
T
t=1
T
1
D(πtI , π̄t∗,I ) + d1 d(µ̄
X
≤ Ît , µ∗,I ) + d(µ̄
Ît , µ∗,I ) + d1 d(µ̄It , µ̄
Ît )
T
t=1
√
log T p
=O + O(εµ + εQ + εµ ),
T 1/3
where the first inequality results from Jensen’s inequality, and the second inequality results
from the triangle inequality. Thus, we conclude the proof of Theorem 4.

Appendix F. Corollary for Single-Agent MDP

In this section, we state and prove our corollary for the policy mirror descent algorithm
on single-agent MDP. A single-agent MDP is defined through a tuple (S, A, µ1 , P, r, H).
The state space and the action space are denoted respectively as S and A. The initial
state distribution µ1 ∈ ∆(S) is state distribution at time h = 1. The transition kernels
Ph : S × A → ∆(S)) and reward functions rh : S × A → [0, 1] for h ∈ [H] describe the state
transition behavior and the reward generation process. A policy π = {πh }H h=1 is the set
of mappings πh : S → A for h ∈ [H]. Similar as the value function defined in Section 3,
the value function and the action-value function of a policy π on a λ-regularized MDP are
defined as
H
X
λ π
Vh (s, π) = E rt (st , at ) − λ log πt (at | st ) sh = s ,
t=h
Qλh (s, a, π)
λ
= rh (s, a) + E Vh+1 (sh+1 , π) | sh = s, ah = a .
The cumulative reward function is J λ (π) = Eµ1 [V1λ (s, π)], where the expectation is taken
with respect to s ∼ µ1 . We denote the optimal policy as π ∗ = argmaxπ∈ΠH J λ (π). The
policy mirror descent algorithm is implementing
ληt+1 η
1− 1+λη t+1
πt+1,h (· | s) ∝ πt,h (· | s) t+1 exp Qλh (s, ·, πt ) for all h ∈ [H].
1 + ληt+1
for t ∈ [T ], and we set π1,h (· | s) = Unif(A) for all s ∈ S. Then the convergence result of this
algorithm is
Corollary 18 Suppose that ηt = η > 0 for all t ∈ [T ], and we set this as some function of
λ, H and |A|. Then we have
H
X H
X
λ ∗ λ 1 ∗

Eπ∗ Vh (sh , π ) − Vh (sh , πt+1 ) + ∗ Eπ∗ KL πh (· | sh )kπt+1,h (· | sh )
ηθ
h=1 h=1

39
Zhang, Tan, Wang, and Yang

H
X H
X
∗ 1
Vhλ (sh , π ∗ ) Vhλ (sh , πt ) ∗

≤θ Eπ∗ − + ∗ Eπt∗ KL πt,h (· | sh )kπt,h (· | sh ) ,
ηθ
h=1 h=1

where 0 < θ∗ < 1 is a function of λ, H and |A|, and Eπ∗ refers to the expectation with
respect to the state distribution induced by π ∗ .
Proof [Proof of Corollary 18] Similarly as Step 1 of the proof of Theorem 4, we have
h i
ηt+1 Qλh (sh ,·, πt ), p − πt+1,h (·|sh ) +ληt+1 R πt+1,h (·|sh ) −R(p) +KL πt+1,h (·|sh )kπt,h (·|sh )

≤ KL pkπt,h (·|sh ) − (1 + ληt+1 )KL p k πt+1,h (· | sh )

for any p ∈ ∆(A). Following the same pipeline to inequality (23), we have that
H
X X H
1
Vhλ (sh , π ∗ ) Vhλ (sh , πt+1 ) ∗

Eπ∗ − + ∗ Eπ∗ KL πh (· | sh )kπt+1,h (· | sh )
ηθ
h=1 h=1
H
X H
X
∗ λ ∗ λ 1 ∗

≤ θ Eπ∗ Vh (sh , π ) − Vh (sh , πt ) + ∗ Eπt∗ KL πt,h (· | sh )kπt,h (· | sh ) ,
ηθ
h=1 h=1

where β ∗ is defined in Proposition 15, θ∗ = 1/(1 + β ∗ ) < 1, ηt = η is defined through

1 + λη = 1/θ∗ . We note that in this single-agent setting, we do not have the µ̄It , which is
adopted to calculate the influence from others. Thus, the optimal policy π ∗ is not changed
over iterations. At the same time, we do not include the estimation error in the above
algorithm. We conclude the proof of Corollary 18.

Appendix G. Proof of Theorem 5

Proof [Proof of Theorem 5] Before delving in the formal proof, we would like to recap
the definitions of covering numbers. The covering number of the graphon function class
N∞ (ε, W̃) is the minimal size C of a set of graphons {Wi }C C
i=1 that ε-covers W̃. A set {Wi }i=1
ε-covers W̃ if for any W̃ ∈ W̃, there exists an index i ∈ [C] such that kWi − W̃ k∞ ≤ ε. The
covering number of the set F in RKHS H̄ (resp. H̃) is defined as the minimal size C of the
ε-cover set {fi }C C
i=1 ⊆ H̄. A set {fi }i=1 ε-covers F if for any f ∈ F, there exists i ∈ [C] such
that kf − fi kH̄ ≤ ε (resp. kf − fi kH̃ ≤ ε).
For the proof of the theorem, We first decompose the difference between the risk as the
sum of the generalization error of risk, the Estimation Error of mean-embedding, and the
empirical risk difference. Given the fact that the empirical risk difference is equal and less
to zero, Our proof involves two steps:
• Bound the Estimation Error of Mean-embedding.

• Bound the generalization error of risk.

Rξ̄ (fˆh , ĝh , Ŵh ) − Rξ̄ (fh∗ , gh∗ , Wh∗ )

40
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

= Generalization Error of Risk + Estimation Error of Mean-embedding

+ Empirical Risk Difference,

where each term is defined as

Generalization Error of Risk

L N
1 XX 2 i ∗ ∗
2
= Eρi sτ,h+1 − fˆh ωτ,h (Ŵh )
i i i
− sτ,h+1 − fh ωτ,h (Wh )
NL τ,h
τ =1 i=1
L X
N
1 X 2 2
−2 siτ,h+1 − fˆh ωτ,h
i
(Ŵh ) − siτ,h+1 − fh∗ ωτ,h
i
(Wh∗ )
NL
τ =1 i=1
L N
1 XX 2 2
+ Eρi i
rτ,h − i
ĝh ωτ,h (Ŵh ) − i
rτ,h − gh∗ i
ωτ,h (Wh∗ )
NL τ,h
τ =1 i=1
L N
1 XX i 2 i 2
− 2 rτ,h i
− ĝh ωτ,h (Ŵh ) − rτ,h − gh∗ ωτ,h
i
(Wh∗ ) .
NL
τ =1 i=1

This generalization error of risk represents the error due to the fact that we optimize over
the empirical estimation of the risk not the population risk.

Estimation Error of Mean-embedding

L N
1 XX i 2 i 2
=2 sτ,h+1 − fˆh ωτ,h
i
(Ŵh ) − sτ,h+1 − fˆh ω̂τ,h
i
(Ŵh )
NL
τ =1 i=1
L X
N
1 X 2 i 2
+2 siτ,h+1 − fh∗ ω̂τ,h
i
(Wh∗ ) − sτ,h+1 − fh∗ ωτ,h
i
(Wh∗ )
NL
τ =1 i=1
L X N
1 X
i i
2 i i
2
+2 rτ,h − ĝh ωτ,h (Ŵh ) − rτ,h − ĝh ω̂τ,h (Ŵh )
NL
τ =1 i=1
L X N
1 X 2 i 2
+2 i
rτ,h − gh∗ ω̂τ,h
i
(Wh∗ ) − rτ,h − gh∗ ωτ,h
i
(Wh∗ ) .
NL
τ =1 i=1

Estimation error of mean-embedding represents the error due to the fact that we cannot
i (Ŵ ). Instead, we can only estimate the value of it through the
observe the value of ω̂τ,h h
states of sampled agents.

Empirical Risk Difference

L N
1 XX i 2 i 2
=2 sτ,h+1 − fˆh ω̂τ,h
i
(Ŵh ) − sτ,h+1 − fh∗ ω̂τ,h
i
(Wh∗ )
NL
τ =1 i=1
L X
N
1 X 2 i 2
+2 i
rτ,h i
− ĝh ω̂τ,h (Ŵh ) − rτ,h − gh∗ ω̂τ,h
i
(Wh∗ ) .
NL
τ =1 i=1

41
Zhang, Tan, Wang, and Yang

Empirical risk difference represents the error from that fact that we choose (fˆh , ĝh , Ŵh ) not
(fh∗ , gh∗ , Wh∗ ) by minimizing the empirical risk. From the procedure of Algorithm (5), we have
Empirical Risk Difference ≤ 0.
Thus, we have that
Rξ̄ (fˆh , ĝh , Ŵh ) − Rξ̄ (fh∗ , gh∗ , Wh∗ )
L N
1 XX i ˆ i
2 i ∗ i ∗
2
≤ Eρi sτ,h+1 − fh ωτ,h (Ŵh ) − sτ,h+1 − fh ωτ,h (Wh )
NL τ,h
τ =1 i=1
L XN
1 X 2 i 2
−2 siτ,h+1 − fˆh ωτ,h
i
(Ŵh ) − sτ,h+1 − fh∗ ωτ,h
i
(Wh∗ )
NL
τ =1 i=1
L N
1 XX i i
2 i ∗ i ∗
2
+ Eρi rτ,h − ĝh ωτ,h (Ŵh ) − rτ,h − gh ωτ,h (Wh )
NL τ,h
τ =1 i=1
L N
1 XX i i
2 i ∗ i ∗
2
−2 rτ,h − ĝh ωτ,h (Ŵh ) − rτ,h − gh ωτ,h (Wh )
NL
τ =1 i=1
L N
1 XX i i
2 i i
2
+2 sup sτ,h+1 − f ω̂τ,h (W̃ ) − sτ,h+1 − f ωτ,h (W̃ )
f ∈B(r,H̄),W̃ ∈W̃ NL
τ =1 i=1
L X N
1 X
i i
2 i i
2
+2 sup rτ,h − g ω̂τ,h (W̃ ) − rτ,h − g ωτ,h (W̃ )
g∈B(r̃,H̃),W̃ ∈W̃ NL
τ =1 i=1
= (I) + (II), (32)
We note that the terms related to the transition kernels and reward functions are similar. In
the following, we will only present the bounds for the terms related to the transition kernels,
and the bounds for the reward functions can be similarly derived.
Step 1: Bound the Estimation Error of Mean-embedding.
Considering term (II), we have that
L N
1 XX i i
2 i i
2
sup sτ,h+1 − f ω̂τ,h (W̃ ) − sτ,h+1 − f ωτ,h (W̃ )
f ∈B(r,H̄),W̃ ∈W̃ NL
τ =1 i=1
L XN
1 X
i i
(W̃ ) · 2siτ,h+1 −f ω̂τ,h
i i

≤ sup f ω̂τ,h (W̃ ) −f ωτ,h (W̃ )−f ωτ,h (W̃ )
f ∈B(r,H̄),W̃ ∈W̃ NL
τ =1 i=1
L N
1 XX i i
≤ 2(BS + rBK̄ )rLK sup ω̂τ,h (W̃ ) − ωτ,h (W̃ ) H
, (33)
W̃ ∈W̃ NL
τ =1 i=1

where the first inequality results from the triangle inequality, and the second inequality
results from Assumption 6 and Lemma 33. Recall the definitions of ω̂τ,h i (W ) and ω i (W )
τ,h
are
Z 1Z
i
W (ξi , β)k ·, (siτ,h , aiτ,h , s) µβτ,h (s) ds dβ,

ωτ,h (W ) =
0 S

42
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

1 X
i
W (ξi , ξj )k ·, (siτ,h , aiτ,h , sjτ,h ) ,

ω̂τ,h (W ) =
N −1
j6=i

respectively. We decompose the error between them as

i i i i i i
sup ω̂τ,h (W ) − ωτ,h (W ) H
≤ sup ω̄τ,h (W ) − ωτ,h (W ) H
+ sup ω̂τ,h (W ) − ω̄τ,h (W ) H
W ∈W̃ W ∈W̃ W ∈W̃
= (III) + (IV), (34)

where
Z
1 X
i
k ·, (siτ,h , aiτ,h , s) µjτ,h (s)ds.

ω̄τ,h (W ) = W (ξi , ξj )
N −1 S
j6=i

i (W ) − ω̄ i (W )
For term (III) = supW ∈W̃ ωτ,h , we have that
τ,h H
Z 1Z
i i
W (ξi , β)k ·, (siτ,h , aiτ,h , s) µβτ,h (s) ds dβ

ωτ,h (W ) − ω̄τ,h (W ) H ≤
0 S
N −1 Z
1 X j N j−1
− W ξi , k ·, (siτ,h , aiτ,h , s) µτ,h (s) ds
N −1 N −1 S H
j=1
N −1 Z
1 X j N j−1
+ W ξi , k ·, (siτ,h , aiτ,h , s) µτ,h (s) ds
N −1 N −1 S
j=1
Z
1 X
W (ξi , ξj ) k ·, (siτ,h , aiτ,h , s) µjτ,h (s) ds

−
N −1 S H
j6=i

= (V) + (VI) (35)

For term (V), we have that

Z 1Z
W (ξi , β)k ·, (siτ,h , aiτ,h , s) µβτ,h (s)dsdβ

0 S
N −1 Z
1 X j N j−1
− W ξi , k ·, (siτ,h , aiτ,h , s) µτ,h (s) ds
N −1 N −1 S H
j=1
N −1 Z j Z
N −1
W (ξi , β)k ·, (siτ,h , aiτ,h , s) µβτ,h (s) ds
X
≤
j−1
j=1 N −1
S
Z
j N j−1
− W ξi , k ·, (siτ,h , aiτ,h , s) µτ,h (s)ds dβ, (36)
N −1 S H

where the inequality results from the triangle inequality. For each term in the sum, we have
that
Z
W (ξi , β)k ·, (siτ,h , aiτ,h , s) µβτ,h (s)ds

S

43
Zhang, Tan, Wang, and Yang

Z
j N j−1
− W ξi , k ·, (siτ,h , aiτ,h , s) µτ,h (s)ds
N −1 S H
Z
j
k ·, (siτ,h , aiτ,h , s) µβτ,h (s)ds

≤ W (ξi , β) − W ξi ,
N −1 S H
Z j
j i i
β N −1

+ W ξi , k ·, (sτ,h , aτ,h , s) µτ,h (s) − µτ,h (s) ds
N −1 S H
j
j
≤ B k LW β − + Bk µβτ,h − µτ,h N −1
, (37)
N −1 1

where the first inequality results from the triangle inequality, and the second results from
Assumptions 5 and 6.

Proposition 19 Under Assumptions 5 and 1, we have that

h−1
kµαh − µβh k1 ≤ (h − 1)LP LW |α − β| + sup πtα (· | s) − πtβ (· | s)
X
1
for all h ∈ [H].
t=1 s∈S

Proof [Proof of Proposition 19] See Appendix Q.1.1.

Thus, we bound the second term of inequality (37) as

j h−1 j
j
µβτ,h − µτ,h sup πtβ (· | s) − πtN −1 (· | s)
X
N −1
≤ HLP LW β − + 1
1 N −1 s∈S t=1
j
≤ (HLP LW + HLπ ) β − , (38)
N −1

where the first inequality results from Proposition 19, and the second inequality results
from the Lipschitzness of behavior policies. Substituting inequalities (37) and (38) into
inequality (36), we have that
Z 1Z
W (ξi , β)k ·, (siτ,h , aiτ,h , s) µβτ,h (s)dsdβ

(V) =
0 S
N −1 Z
1 X j N j−1
− W ξi , k ·, (siτ,h , aiτ,h , s) µτ,h (s)ds
N −1 N −1 S H
j=1
N −1 Z j
X N −1 j
≤ Bk (LW + HLP LW + HLπ ) β − dβ
j−1 N −1
j=1 N −1

1
= Bk (LW + HLP LW + HLπ ).
2(N − 1)

For term (VI), we have that

N −1 Z
1 X j N j−1
(VI) = W ξi , k ·, (siτ,h , aiτ,h , s) µτ,h (s)ds
N −1 N −1 S
j=1

44
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

Z
1 X
W (ξi , ξj ) k ·, (siτ,h , aiτ,h , s) µjτ,h (s)ds

−
N −1 S H
j6=i

1 X j j−1 j
≤ Bk LW ξj − + ξj − N −1
+ Bk µτ,h − µjτ,h 1
N −1 N −1 N −1
j6=1
N
1 X i
≤ 3 + 2Bk (LW + HLπ + HLp LW ) ξi − ,
N −1 N
i=1

where the first results from triangle inequality, and the second inequality results from
Proposition 19. Substituting the bounds for terms (V) and (VI) into inequality (35), we
have that
i i
(III) = sup ωτ,h (W ) − ω̄τ,h (W ) H
W ∈W̃
N
Bk (LW + HLP LW + HLπ ) 1 X i
≤ + 3 + 2Bk (LW + HLπ + HLp LW ) ξi −
2(N − 1) N −1 N
i=1
1 3
= Bk (LW + HLP LW + HLπ ) + . (39)
2(N − 1) N −1

For term (IV), we have that

i i
ω̂τ,h (W ) − ω̄τ,h (W ) H
Z
1 X 1 X
W (ξi , ξj )k ·, (siτ,h , aiτ,h , sjτ,h ) − W (ξi , ξj ) k ·, (siτ,h , aiτ,h , s) µjτ,h (s)ds

= .
N −1 N −1 S H
j6=i j6=i

To derive a concentration inequality for term (IV), we first construct the minimal ε−cover
of W̃ with respect to k · k∞ . The covering number is denoted as N∞ (ε, W̃). Then for any
W ∈ W̃, there exists a graphon Wi for i ∈ {1, · · · , N∞ (ε, W̃)} such that kW − Wi k∞ ≤ ε.
Then we have that
i i i i
ω̂τ,h (W ) − ω̄τ,h (W ) H
≤ ω̂τ,h (Wi ) − ω̄τ,h (Wi ) H
+ 2εBk ,

where the inequality results from the triangle inequality. In the following, we set ε = t/(4Bk ).
Then the concentration inequality for term (IV) can be derived as

i i
P ∃W̃ ∈ W̃, i ∈ [N ], τ ∈ [L], ω̂τ,h (W̃ ) − ω̄τ,h (W̃ ) H ≥ t)

i i

≤ P ∃j ∈ N∞ (ε, W̃) , i ∈ [N ], τ ∈ [L], ω̂τ,h (Wj ) − ω̄τ,h (Wj ) H ≥ t − 2εBk

i i
≤ N LN∞ (t/(4Bk ), W̃) max P ω̂τ,h (Wj ) − ω̄τ,h (Wj ) H ≥ t/2
j∈[N∞ ],i∈[N ],τ ∈[L]

(N − 1)t2

≤ 2N LN∞ (t/(4Bk ), W̃) exp − ,
32Bk2

where the first inequality results from the construction of the cover, the second inequality
results from the union bound, and the last inequality results from Lemma 32 and that

45
Zhang, Tan, Wang, and Yang

√
kW (ξi , ξj )k ·, (siτ,h , aiτ,h , sjτ,h )k ≤ Bk for any W ∈ W̃. For t ≥ 4Bk / N , we have that

i i
P ∃W̃ ∈ W̃, i ∈ [N ], τ ∈ [L], ω̂τ,h (W̃ ) − ω̄τ,h (W̃ ) H ≥ t)
√ (N − 1)t2

≤ 2N LN∞ (1/ N , W̃) exp − .
32Bk2

Thus, term (IV) can be bounded as

√ √
i i 4 2Bk 2N LN∞ (1/ N , W̃)
(IV) = sup ω̂τ,h (W ) − ω̄τ,h (W ) H ≤√ log , (40)
W ∈W̃ N −1 δ

with probability at least 1 − δ. Substituting inequalities (40) and (39) into inequalities (33)
and (34), we have that
L N
1 XX i i
2 i i
2
sup sτ,h+1 − f ω̂τ,h (W ) − sτ,h+1 − f ωτ,h (W )
f ∈B(r,H̄),W ∈W̃ N L τ =1 i=1
√
(BS + rBK̄ )rLK Bk N LN∞ (1/ N , W̃)
≤O √ log , (41)
N δ
with probability at least 1 − δ.
Step 2: Bound the generalization error of risk.
Considering term (I), for ease of notation, we denote the quadruple (siτ,h , aiτ,h , µIτ,h , siτ,h+1 )
as eiτ,h . We define the function fW as
2 i 2
fW (eiτ,h ) = siτ,h+1 − f ωτ,h
i
(W ) − sτ,h+1 − fh∗ ωτ,h
i
(Wh∗ ) .

The correspond function class is defined as FW̃ = {fW | f ∈ B(r, H̄), W ∈ W̃}. Then we
have that
L N L N
1 XX 1 XX ˆ
Eρi fˆŴ (eiτ,h ) − 2 fŴ (eiτ,h ).

(I) =
NL τ,h NL
τ =1 i=1 τ =1 i=1

Proposition 20 With Assumption 6, we have that

L N L N
1 XX 1 XX
Eρi fW (eiτ,h ) − fW (eiτ,h )

P ∃fW ∈ FW̃ ,
NL τ,h NL
τ =1 i=1 τ =1 i=1
L N
1 XX
Eρi fW (eiτ,h )

≥ε α+β+
NL τ,h
τ =1 i=1

εβ εβ
≤ 14NH̄ , B(r, H̄) · N∞ , W̃
40(BS + rBK̄ )3 BK̄ 40(BS + rBK̄ )3 rLK Bk
ε2 (1 − ε)αN L

· exp − ,
20(BS + rBK̄ )4 (1 + ε)
where α, β > 0 and 0 < ε ≤ 1/2.

46
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

Proof [Proof of Proposition 20] See Appendix Q.1.2.

Now consider,
L N L N
1 XX
ˆ i
1 XX ˆ i
P Eρi fŴ (eτ,h ) − 2 fŴ (eτ,h ) ≥ t
NL τ,h NL
τ =1 i=1 τ =1 i=1
L N L N
1 XX i
1 XX i
≤ P ∃fW ∈ FW̃ , Eρi fW (eτ,h ) − 2 fW (eτ,h ) ≥ t
NL τ,h NL
τ =1 i=1 τ =1 i=1
L N L N
1 XX 1 XX
Eρi fW (eiτ,h ) − fW (eiτ,h )

= P ∃fW ∈ FW̃ ,
NL τ,h NL
τ =1 i=1 τ =1 i=1
L N
1 1 XX
Eρi fW (eiτ,h )

≥ t+
2 NL τ,h
τ =1 i=1

t t
≤ 14NH̄ , B(r, H̄) · N∞ , W̃
160(BS + rBK̄ )3 BK̄ 160(BS + rBK̄ )3 rLK Bk

tN L
· exp − ,
480(BS + rBK̄ )4
where the last inequality results from Proposition 20. We define that

3 3
NBr = NH̄ , B(r, H̄) , NW̃ = N∞ , W̃ .
NL LK N L
For δ > 0, we set

480(BS + rBK̄ )4 14NBr NW̃

t= log ,
NL δ
then we have that
L N L N
1 XX
ˆ i
1 XX ˆ i
P Eρi fŴ (eτ,h ) − 2 fŴ (eτ,h ) ≥ t ≤ δ. (42)
NL τ,h NL
τ =1 i=1 τ =1 i=1

Combining inequalities (32), (41), and (42), we have that the following holds with probability
at least 1 − δ

Rξ̄ (fˆh , ĝh , Ŵh ) − Rξ̄ (fh∗ , gh∗ , Wh∗ )

√
(BS + rBK̄ )4 NBr NB̃r̃ NW̃

(BS + rBK̄ )rLK Bk N LN∞ (1/ N , W̃)
≤O log + √ log ,
NL δ N δ
where

3
NB̃r̃ = NH̄ , B(r̃, H̃) .
NL
Thus, we conclude the proof of Theorem 5.

47
Zhang, Tan, Wang, and Yang

Appendix H. Proof of Theorem 6

Proof [Proof of Theorem 6] We first decompose the difference between the risk as the sum
of the generalization error of risk from position sampling and the difference between the risk
given the positions. Our proof involves two steps:
• Bound the generalization error of risk from position sampling.

• Bound the difference between the risk given positions.

R(fˆh , ĝh , Ŵh ) − R(fh∗ , gh∗ , Wh∗ )

= R(fˆh , ĝh , Ŵh ) − Rξ̄ (fˆh , ĝh , Ŵh ) − R(fh∗ , gh∗ , Wh∗ ) − Rξ̄ (fh∗ , gh∗ , Wh∗ )

+ R (fˆh , ĝh , Ŵh ) − R (f ∗ , g ∗ , W ∗ )

ξ̄ ξ̄ h h h

≤2 sup R(f, g, W ) − Rξ̄ (f, g, W ) + Rξ̄ (fˆh , ĝh , Ŵh ) − Rξ̄ (fh∗ , gh∗ , Wh∗ )
f ∈B(r,H̄),g∈B(r̃,H̃),W ∈W̃

= (IX) + (X), (43)

where (IX) is the generalization error of risk from position sampling, and (X) is the difference
between the risk given positions. Similar as the proof of Theorem 5, the terms related to the
transition kernels and reward functions in inequality (43) are analogous. In the following,
we will only present the proof for the terms related to the transition kernel, and the results
for the terms related to the reward functions can be similarly derived.
Step 1: Bound the generalization error of risk from position sampling.
We first define that
L
1X α α
2
gf,W (α) = Eρατ,h sτ,h+1 − f ωτ,h (W ) .
L
τ =1

The correspond function class for gf,W is GF ,W̃ = {gf,W | f ∈ B(r, H̄), W ∈ W̃}. Then term
in (IX) that is related to the transition kernels can be expressed as
Z 1 N
1 X
2 sup gf,W (α)dα − gf,W (ξi ) .
gf,W ∈GF ,W̃ 0 N
i=1

Let δ > 0, Gδ be a minimal L∞ δ−cover of GF ,W̃ . Then for any gf,W ∈ GF ,W̃ , there exists
ḡf,W ∈ Gδ such that |gf,W (α) − ḡf,W (α)| ≤ δ for all α ∈ I. For any t > 0, we set δ = t/4.
Then we have that
Z 1 N
1 X
P sup gf,W (α)dα − gf,W (ξi ) ≥ t
gf,W ∈GF ,W̃ 0 N
i=1
Z 1 N
t 1 X t
≤ N∞ ,G max P gf,W (α)dα − gf,W (ξi ) ≥
4 F ,W̃ gf,W ∈G t 0 N 2
4 i=1
N t2

t
≤ 2N∞ , GF ,W̃ exp − , (44)
4 2(BS + rBK̄ )4

48
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

where the first inequality results from the union bound, and the second inequality results
from that 0 ≤ gf,W (α) ≤ (BS + rBK̄ )2 and Hoeffding inequality. To upper bound the
covering number in the tail probability, we note that
L
1X
Eρατ,h f ωτ,h (W ) − f¯ ωτ,h (W̄ )
α α

gf,W (α) − ḡf,W (α) ≤ 2(BS + rBK̄ )
L
τ =1
≤ 2(BS + rBK̄ ) BK̄ kf − f¯kH̄ + rLK Bk kW − W̄ k∞ ,

where the first inequality results from the definition of gf,W , and the second inequality
results from Lemma 33 and the triangle inequality. This inequality implies that

t t t
N∞ ,G ≤ NH̄ , B(r, H̄) · N∞ , W̃ .
4 F ,W̃ 16(BS + rBK̄ )BK̄ 16(BS + rBK̄ )rLK Bk

For 1 > δ > 0, we take

√ 2NH̄ 1
√ , B(r, H̄) · N∞ 1 √
, W̃
2(BS + rBK̄ )2 16 N 16rLK Bk N
t= √ log .
N δ

Then inequality (44) shows that

Z 1 N
1 X
sup gf,W (α)dα − gf,W (ξi )
gf,W ∈GF ,W̃ 0 N
i=1

1 1
NH̄ 16√N , B(r, H̄) · N∞ 16rL B √N , W̃
(BS + rBK̄ )2

K k
=O √ log ,
N δ

with probability at least 1 − δ. Thus, we have that

1 1
NH̄ 16 N , B(r, H̄) · N∞ 16rL B N , W̃
√ √
(BS + rBK̄ )2

K k
(IX) = O √ log
N δ

1 1
NH̃ 16√N , B(r̃, H̃) · N∞ 16rL B √N , W̃
(BS + rBK̃ )2 K k
+ √ log (45)
N δ

Step 2: Bound the difference between the risk given positions.

We adopt the similar procedures as the proof of Theorem 5. From inequalities (32),
(33),and (34), we define that
L N
1 XX 2 i 2
(XI) = Eρi siτ,h+1 − fˆh ωτ,h
i
(Ŵh ) − sτ,h+1 − fh∗ ωτ,h
i
(Wh∗ )
NL τ,h
τ =1 i=1
L N
1 XX i 2 i 2
−2 sτ,h+1 − fˆh ωτ,h
i
(Ŵh ) − sτ,h+1 − fh∗ ωτ,h
i
(Wh∗ )
NL
τ =1 i=1

49
Zhang, Tan, Wang, and Yang

L N
1 XX i i
(XII) = 4(BS + rBK̄ )rLK sup ω̂τ,h (W ) − ω̄τ,h (W ) H
W ∈W̃ NL
τ =1 i=1
L X N
1 X
i i
(XIII) = 4(BS + rBK̄ )rLK sup ω̄τ,h (W ) − ωτ,h (W ) H
.
W ∈W̃ NL
τ =1 i=1

For term (XIII), we adopt a different method with the proof of Theorem 5. Let ε > 0,
W̃ε be a L∞ ε−cover of W̃. Then for any W ∈ W̃, there exists W̄ ∈ W̃ε such that
kW̄ − W k∞ ≤ ε. Then we have
Z 1Z
i i
W (ξi , β) − W̄ (ξi , β) k ·, (siτ,h , aiτ,h , s) µβτ,h (s)dsdβ

kωτ,h (W ) − ωτ,h (W̄ )kH =
0 S H
≤ εBk ,
i (W ) −
where the inequality results from the triangle inequality. Similarly, we have that kω̄τ,h
i (W̄ )k ≤ εB . For any t > 0, we will set ε = t/(4B ). Then the tail probability for
ω̄τ,h H k k
(XIII) can be bounded as
L N
1 XX i i
P sup ω̄τ,h (W ) − ωτ,h (W ) H ≥ t
W ∈W̃ N L τ =1 i=1

i i
≤ P ∃W ∈ W̃, τ ∈ [L], i ∈ [N ], ω̄τ,h (W ) − ωτ,h (W ) H ≥ t

t i i t
≤ N LN∞ , W̃ max P ω̄τ,h (W ) − ωτ,h (W ) H
≥
4Bk W ∈W̃ t ,τ ∈[L],i∈[N ] 2
4Bk

(N − 1)t2

t
≤ 2N LN∞ , W̃ exp − ,
4Bk 8Bk2
where the second inequality results from the union bound, and the last inequality resutls
from Lemma 32. For any 0 < δ < 1, we set

√ 1
2N LN∞ √N , W̃
2 2Bk
t= √ log .
N −1 δ
Then we have that

N LN∞ √1 , W̃
(BS + rBK̄ )rLK Bk N
(XIII) ≤ O √ log (46)
N δ
with probability at least 1 − δ.
For term (XII), we follow the proof of Theorem 5 and condition on the values of ξ¯ to
bound the tail probability. We have that
L N
1 XX i i
P sup ω̂τ,h (W ) − ω̄τ,h (W ) H
≥t
W ∈W̃ NL
τ =1 i=1

50
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

L N
1 XX i
= Eξ̄ P sup i
ω̂τ,h (W ) − ω̄τ,h (W ) H
≥ t ξ¯
W ∈W̃ N L
τ =1 i=1
√ (N − 1)t2

≤ 2N LN∞ (1/ N , W̃) exp − ,
32Bk2

where we condition on the values of ξ¯ in the first equality, and the inequality results from
inequality (40). Thus, we have that
√
(BS + rBK̄ )rLK Bk N LN∞ (1/ N , W̃)
(XII) ≤ O √ log (47)
N δ
with probability at least 1 − δ.
For term (XI), we just adopt the same conditional probability trick as shown in the
bound of (XII). From inequality (42), we have that

(BS + rBK̄ )4

NBr NW̃
(XI) ≤ O log (48)
NL δ
with probability at least 1 − δ.
Combining the inequalities (43), (45), (46), (47), and (48), we have that

R(fˆh , ĝh , Ŵh ) − R(fh∗ , gh∗ , Wh∗ )

2
BS + r max{BK̄ , BK̃ } ÑBr ÑB̃r̃ ÑW̃

≤O √ log
N δ

N LN∞ √1 , W̃
BS + r max{BK̄ , BK̃ } r max{LK , LK̃ BK̄ } N
+ √ log
N δ
(BS + r max{BK̄ , BK̃ )4 NBr NB̃r̃ NW̃

+ log .
NL δ
Thus, we conclude the proof of Theorem 6.

Appendix I. Proof of Theorem 7

Proof [Proof of Theorem 7] We first decompose the difference between the permutation-
invariant risk as the sum of the generalization error of risk, the Estimation Error of Mean-
embedding, and the empirical risk difference. Given the fact that the empirical risk difference
is equal and less to zero, Our proof involves two steps:
• Bound the estimation error of mean-embedding.

• Bound the generalization error of the risk.

Consider,

R̄ξ̄ (fˆh , ĝh , Ŵh ) − R̄ξ̄ (fh∗ , gh∗ , Wh∗ )

51
Zhang, Tan, Wang, and Yang

L N
1 XX 2 i 2
= inf Eρi siτ,h+1 − fˆh ωτ,h
i
(Ŵhφ ) i
+ rτ,h − ĝh ωτ,h (Ŵhφ )
φ∈B[0,1] N L τ,h
τ =1 i=1

i ∗ i ∗
2 i ∗ i ∗
2
− sτ,h+1 − fh ωτ,h (Wh ) − rτ,h − gh ωτ,h (Wh )
L N
1 XX 2 i 2
≤ inf Eρi siτ,h+1 − fˆh ωτ,h
i
(Ŵhφ ) i
+ rτ,h − ĝh ωτ,h (Ŵhφ )
N
φ∈C[0,1] NL τ,h
τ =1 i=1

i ∗ i ∗
2 i ∗ i ∗
2
− sτ,h+1 − fh ωτ,h (Wh ) − rτ,h − gh ωτ,h (Wh )

= Generalization Error of Risk + Estimation Error of Mean-embedding

+ Empirical Risk Difference,
where each term is defined as
Generalization Error of Risk
L N
1 XX 2 i 2
= inf Eρi siτ,h+1 − fˆh ωτ,h
i
(Ŵhφ ) i
+ rτ,h − ĝh ωτ,h (Ŵhφ )
N
φ∈C[0,1] NL τ,h
τ =1 i=1

i ∗ i ∗
2 i ∗ i ∗
2
− sτ,h+1 − fh ωτ,h (Wh ) − rτ,h − gh ωτ,h (Wh )
L N
1 XX i 2 i 2
− 2 inf sτ,h+1 − fˆh ωτ,h
i
(Ŵhφ ) i
+ rτ,h − ĝh ωτ,h (Ŵhφ )
N
φ∈C[0,1] NL
τ =1 i=1

i ∗ i ∗
2 i ∗ i ∗
2
− sτ,h+1 − fh ωτ,h (Wh ) − rτ,h − gh ωτ,h (Wh ) .

This generalization error of risk represents the error due to the fact that we optimize over
the empirical estimation of the risk not the population risk.
Estimation Error of Mean-embedding
L N
1 XX i 2 i 2
= 2 inf sτ,h+1 − fˆh ωτ,h
i
(Ŵhφ ) i
+ rτ,h − ĝh ωτ,h (Ŵhφ )
N
φ∈C[0,1] NL
τ =1 i=1
L X
N
1 2 i 2
(Ŵhφ ) (Ŵhφ )
X
− 2 inf siτ,h+1 − fˆh ω̄ i
ˆ τ,h i
ˆ τ,h
+ rτ,h − ĝh ω̄
N
φ∈C[0,1] NL
τ =1 i=1
L X
N
1 ∗ 2 ∗ 2

(Wh∗,φ ) (Wh∗,φ )
X
+2 siτ,h+1 − fh∗ ω̄ i
ˆ τ,h i
+ rτ,h − gh∗ ω̄ i
ˆ τ,h
NL
τ =1 i=1
L X N
1 X 2 i 2
−2 siτ,h+1 − fh∗ ωτ,h
i
(Wh∗ ) + rτ,h − gh∗ ωτ,h
i
(Wh∗ ) .
NL
τ =1 i=1

52
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

L N
1 XX i 2 i 2
= 2 inf sτ,h+1 − fˆh ω̄ i
ˆ τ,h (Ŵhφ ) i
ˆ τ,h
+ rτ,h − ĝh ω̄ (Ŵhφ )
N
φ∈C[0,1] NL
τ =1 i=1
L N
1 ∗ 2
∗ 2

(Wh∗,φ ) (Wh∗,φ ) ,
XX
−2 siτ,h+1 − fh∗ ω̄ i
ˆ τ,h i
+ rτ,h − gh∗ ω̄ i
ˆ τ,h
NL
τ =1 i=1

where φ∗ ∈ C[0,1]
N is a permutation of ((i − 1)/N, i/N ] for i ∈ [N ] such that φ∗ (i/N ) = ξi .
From the estimation procedure of Algorithm (10), we have that

Empirical Risk Difference ≤ 0.

Thus, we have that

R̄ξ̄ (fˆh , ĝh , Ŵh ) − R̄ξ̄ (fh∗ , gh∗ , Wh∗ )

≤ Generalization Error of Risk + Estimation Error of Mean-embedding.

Step 1: Bound the Estimation Error of Mean-embedding.

From the definition of the generalization error, we bound two terms separately

Estimation Error of Mean-embedding

L N
1 XX i i
2 i 2
≤ 2 sup inf ˆ τ,h
sτ,h+1 − f ω̄ (W φ ) i
ˆ τ,h
+ rτ,h − g ω̄ (W φ )
N
f,g,W φ∈C[0,1] NL
τ =1 i=1
L N
1 XX i i
2 i 2
− inf sτ,h+1 − f ωτ,h (W φ ) i
+ rτ,h − g ωτ,h (W φ )
N
φ∈C[0,1] NL
τ =1 i=1
L N
1 XX i ∗ 2
∗ 2

+ 2 sτ,h+1 − fh∗ i
ˆ τ,h
ω̄ (Wh∗,φ ) i
+ rτ,h − gh∗ ω̄ i
ˆ τ,h (Wh∗,φ )
NL
τ =1 i=1
L N
1 XX i 2 2
− 2 sτ,h+1 − fh∗ i
ωτ,h (Wh∗ ) i
+ rτ,h − gh∗ ωτ,h
i
(Wh∗ )
NL
τ =1 i=1
= (XIV) + (XV) (49)

We first denote the composition of two measure-preserving bijections φ and ψ as φ ◦ ψ. When

appiled to a graphon W , the composition of bijections maps the values of the graphon as

W φ◦ψ (x, y) = W φ ψ(x) , φ ψ(y) .

Then we bound the term (XIV) as

L N
1 XX i i ∗ 2
∗ 2

sup inf ˆ τ,h
sτ,h+1 − f ω̄ (W φ◦φ ) i
+ rτ,h i
ˆ τ,h
− g ω̄ (W φ◦φ )
f,g,W N
φ∈C[0,1] NL
τ =1 i=1
L X N
1 X 2 2
− inf siτ,h+1 − f ωτ,h
i
(W φ ) i
+ rτ,h i
− g ωτ,h (W φ )
N
φ∈C[0,1] NL
τ =1 i=1

53
Zhang, Tan, Wang, and Yang

L N
1 XX i i ∗ 2
∗ 2

= sup ˆ τ,h
sτ,h+1 − f ω̄ (W φ◦φ ) i
+ rτ,h i
ˆ τ,h
− g ω̄ (W φ◦φ )
f,W,φ NL
τ =1 i=1
L N
1 XX 2 i 2
− siτ,h+1 − f ωτ,h
i
(W φ ) i
+ rτ,h − g ωτ,h (W φ )
NL
τ =1 i=1
L N
1 XX i ∗
≤ 4(BS + r̄BK )r̄LK sup ˆ τ,h (W φ◦φ ) − ωτ,h
ω̄ i
(W φ ) , (50)
NL H
N
W ∈W̃,φ∈C[0,1] τ =1 i=1

where the equality results from the fact that φ∗ is a measure-preserving bijection, and the
inequality results from the same arguments in inequality (33).
We decompose the error as
i ∗
ˆ τ,h
sup ω̄ (W φ◦φ ) − ωτ,h
i
(W φ ) H
W,φ
i ∗
≤ sup ω̄τ,h (W φ ) − ωτ,h
i
(W φ ) H
i
ˆ τ,h
+ sup ω̄ (W φ◦φ ) − ω̄τ,h
i
(W φ ) H
W,φ W,φ

where
Z 1Z
i
(W φ ) = W φ(ξi ), φ(β) k ·, (siτ,h , aiτ,h , s) µβτ,h (s)dsdβ,

ωτ,h
0 S
Z
1
k ·, (siτ,h , aiτ,h , s) µjτ,h (s)ds,
X
i φ

ω̄τ,h (W ) = W φ(ξi ), φ(ξj )
N −1 S
j6=i
L
∗ 1
W φ(ξi ), φ(ξj ) k ·, (siτ,h , aiτ,h , sjτ 0 ,h ) .
XX
i
(W φ◦φ ) =

ˆ τ,h
ω̄
(N − 1)L 0 j6=i τ =1

i (W φ ) − ω i (W φ )
For term supW,φ ω̄τ,h , we define the interval Ii = ((i − 1)/N, i/N ]
τ,h H
for i ∈ [N ]. Then we have that
i
ω̄τ,h (W φ ) − ωτ,h
i
(W φ ) H
2 X Z ξj Z
W φ(ξi ), φ(β) k ·, (siτ,h , aiτ,h , s) µβτ,h (s)dsdβ

≤ Bk +
N 1
ξj − N S
j6=i
Z
1 X
k ·, (siτ,h , aiτ,h , s) µjτ,h (s)ds

− W φ(ξi ), φ(ξj ) ,
N S H
j6=i

where the inequality results from the triangle inequality. For each term in the sum, we
bound it as
Z ξj Z
W φ(ξi ), φ(β) k ·, (siτ,h , aiτ,h , s) µβτ,h (s)dsdβ

1
ξj − N S
Z
1 X
k ·, (siτ,h , aiτ,h , s) µjτ,h (s)ds

− W φ(ξi ), φ(ξj )
N S H
j6=i
Z ξj Z
W φ(ξi ), φ(β) k ·, (siτ,h , aiτ,h , s) µβτ,h (s) − µjτ,h (s) dsdβ

≤
1
ξj − N S H

54
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

Z ξj Z
W φ(ξi ), φ(β) − W φ(ξi ), φ(ξj ) k ·, (siτ,h , aiτ,h , s) µjτ,h (s)dsdβ

+
1
ξj − N S H

LW Bk
=O ,
N2
where the first inequality results from the triangle inequality, and the second inequality
results from the same argument in inequality (38) and the fact that β and ξj are always in
N . Thus, we have that
the same interval for any φ ∈ C[0,1]

i φ i φ LW Bk
sup ω̄τ,h (W ) − ωτ,h (W ) H = O . (51)
W,φ N

For supW,φ ω̄ i (W φ◦φ∗ ) − ω̄ i (W φ )

ˆ τ,h , we adopt the similar procedure in the proof of
τ,h H
inequality (40).
∗

i
P sup ω̄ ˆ τ,h (W φ◦φ ) − ω̄τ,h
i
(W φ ) H ≥ t
W,φ
∗

≤ N !N LN∞ (t/(4Bk ), W̃) max P ω̄ i
ˆ τ,h (Wjφ◦φ ) − ω̄τ,h
i
(Wjφ ) H
≥ t/2
j∈[N∞ ],i∈[N ],τ ∈[L]

N Lt2

≤ 2N !N LN∞ (t/(4Bk ), W̃) exp − ,
16Bk2
where the first inequality results from the proof of inequality (40), and the last inequality
results from Lemma 32. Thus, we have that with probability at least 1 − δ
r p
i φ◦φ ∗ i φ N N LN∞ ( N/L, W̃)
ˆ τ,h (W
sup ω̄ ) − ω̄τ,h (W ) H = O Bk log . (52)
W,φ L δ

Combining inequalities (50), (51) and (52), we have that

r p
LW Bk r̄LK (BS + r̄BK ) N N LN∞ ( N/L, W̃)
(XIV) = O + (BS + r̄BK )r̄LK Bk log .
N L δ
(53)
Following the similar arguments, we can derive that
p
Bk r̄LK (BS + r̄BK ) 1 N LN∞ ( N/L, W̃)
(XV) = O + (BS + r̄BK )r̄LK Bk √ log .
N NL δ
Step 2: Bound the generalization error of risk
We follow the similar procedures in Step 2 of the proof of Theorem 5. We denote the
quadruple (siτ,h , aiτ,h , µIτ,h , siτ,h+1 ) as eiτ,h . We define the function fW as
2 i 2
f (eiτ,h , W, φ) = siτ,h+1 − f ωτ,h
i
(W φ ) − sτ,h+1 − fh∗ ωτ,h
i
(Wh∗ ) .

Then we have that

L N L
1 XX
ˆ i
1 Xˆ i
P inf Eρi fh (eτ,h , Ŵh , φ) − 2 inf fh (eτ,h , Ŵh , φ) ≥ t
N
φ∈C[0,1] NL τ,h N
φ∈C[0,1] NL
τ =1 i=1 τ =1

55
Zhang, Tan, Wang, and Yang

h 1 X L X N L
i 2 X i
i
≤ P ∃f∈B(r, H̄), W∈W̃, max Eρi f (eτ,h , W, φ) − f (eτ,h , W, φ) ≥ t
N
φ∈C[0,1] NL τ,h NL
τ =1 i=1 τ =1
L N L
1 XX i 2 X i
≤ N ! max P ∃f∈B(r, H̄), W∈W̃, Eρi f (eτ,h , W, φ) − f (eτ,h , W, φ) ≥ t
N
φ∈C[0,1] NL τ,h NL
τ =1 i=1 τ =1

t t
≤ 14N !NH̄ , B(r, H̄) · N ∞ , W̃ ,
160(BS + rBK̄ )3 BK̄ 160(BS + rBK̄ )3 rLK Bk
where the second inequality results from the union bound and the fact that minx f (x) −
minx g(x) ≤ maxx f (x) − g(x), and the final inequality results from Proposition 20. Thus,
we have that with probability at least 1 − δ
L N L
1 XX
ˆ i
1 Xˆ i
inf Eρi fh (eτ,h , Ŵh , φ) − 2 inf fh (eτ,h , Ŵh , φ)
N
φ∈C[0,1] NL τ,h N
φ∈C[0,1] NL
τ =1 i=1 τ =1
(BS + rBK̄ )4

N ÑBr Ñ∞
=O log , (54)
L δ
where

3 3
ÑBr = NH̄ , B(r, H̄) , ÑW̃ = N∞ , W̃ .
L LK L
Combining inequalities (54) and (53), we have that
R̄ξ̄ (fˆh , ĝh , Ŵh ) − R̄ξ̄ (fh∗ , gh∗ , Wh∗ )
r p
LW Bk r̄LK (BS + r̄BK ) N N LN∞ ( N/L, W̃)
=O + (BS + r̄BK )r̄LK BK log
N L δ
(BS + r̄BK )4 N ÑBr ÑB̃r̃ Ñ∞

+ log ,
L δ
where

3
ÑB̃r̃ = NH̃ , B(r̃, H̃) .
L
Thus, we conclude the proof of Theorem 7.

Appendix J. Proof of Corollary 12

Proof [Proof of Corollary 12] The proof of Corollary 12 follows the same procedures as the
proof of Theorem 5. The only difference is that inequality (40) in the proof of Theorem 5 is
replaced by
√ √
i i 4 2Bk 2N LN∞ (1/ N L, W̃)
(IV) = sup ω̂τ,h (W ) − ω̄τ,h (W ) H ≤ p log .
W ∈W̃ (N − 1)L δ

56
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

Appendix K. Proof of Corollary 9

Proof [Proof of Corollary 9] Our proof mainly involves four steps
• Derive the performance guarantee of Algorithm (12).

• Generalize the performance guarantee from {ξi }N

i to [0, 1] by lipschitzness.

• Bound the estimation error of distribution flow and action-value function estimate.

• Conclude the final result.

Step 1: Derive the performance guarantee of Algorithm (12).
We first derive the performance guarantee of Algorithm (12) when we we sample agents
with known grid positions. In such setting, we implement π I for L times on the MDP
[N ] [N ] [N ] [N ]
induced by µI to collect the dataset Dτ = {(sτ,h , aτ,h , rτ,h , sτ,h+1 )}H h=1 for τ ∈ [L]. We
define µ+,I I I ∗
= Γ3 (π , µ , W ) as the distribution flow of implementing π I
I i i i i N
QNon +,i
the MDP
induced by µ . Then the joint distribution of (sτ,h , aτ,h , rτ,h , sτ,h+1 )i=1 is i=1 ρτ,h , where
ρ+,i +,i i ∗
τ,h = µτ,h × πτ,h × δrh × Ph . With a little abuse of notation, we define the risk of (f, g, W )
∗

given ξ¯ as
L N
1 XX i i
2 i i
2
Rξ̄ (f, g, W ) = Eρ+,i sτ,h+1 − f ωτ,h (W ) + rτ,h − g ωτ,h (W )
NL τ,h
τ =1 i=1
N
1 X 2 i 2
= Eρ+,i sih+1 − f ωhi (W ) + rh+1 − g ωhi (W ) ,
N h
i=1

where the second equality results from that we implement the same policy for L times. The
difference between this definition and Eqn. (7) is that we take expectation with respect to
ρ+,i i
τ,h instead of ρτ,h . The reason is that in the setting where we specify Eqn. (7), the MDP is
induced by the distribution flow of the policy itself, not by a pre-specified distribution flow.
We state the performance guarantee as
Corollary 21 Under Assumptions 5, 6, 7, and 1, if ξi = i/N for i ∈ [N ], then the risk of
estimate derived in Algorithm (12) can be bounded as

Rξ̄ (fˆh , ĝh , Ŵh ) − Rξ̄ (fh∗ , gh∗ , Wh∗ )

(BS + r̄BK )4 NBr NB̃r̃ NW̃

≤O log
NL δ
with probability at least 1 − δ, where NBr , NB̃r̃ , and NW̃ are defined in Theorem 5.
Proof [Proof of Corollary 21] See Appendix Q.3.1.

Step 2: Generalize the performance guarantee from {ξi }N i=1 to [0, 1] by lips-
chitzness.
Intuitively, when the implemented policy is lipschitz, we can generalize the performance
guarantee of Rξ̄ (f, g, W ) to that of R(f, g, W ). Here we consider the case where the MDP

57
Zhang, Tan, Wang, and Yang

is induced by the distribution flow of the policy itself, i.e., the case specified in Section 5.
The results for the case where the MDP is induced by a pre-specified distribution flow can
be similarly derived. We note that

R(fˆh , ĝh , Ŵh ) − R(fh∗ , gh∗ , Wh∗ )

= R(fˆh , ĝh , Ŵh ) − Rξ̄ (fˆh , ĝh , Ŵh ) − R(fh∗ , gh∗ , Wh∗ ) − Rξ̄ (fh∗ , gh∗ , Wh∗ )

+ R (fˆh , ĝh , Ŵh ) − R (f ∗ , g ∗ , W ∗ )

ξ̄ ξ̄ h h h

≤2 sup R(f, g, W ) − Rξ̄ (f, g, W ) + Rξ̄ (fˆh , ĝh , Ŵh ) − Rξ̄ (fh∗ , gh∗ , Wh∗ ).
f ∈B(r,H̄),g∈B(r̃,H̃),W ∈W̃
(55)

Then we attempt to bound the first term of the right-hand side of inequality (55). For any
two positions α, β ∈ I and f ∈ B(r, H̄), we have

α
2 β 2
Eρτ,h sτ,h+1 − f ωτ,h (W )
α − Eρβ sτ,h+1 − f ωτ,h (W )
τ,h

α
2 α
2
≤ Eρατ,h sτ,h+1 − f ωτ,h (W ) − Eρβ sτ,h+1 − f ωτ,h (W )
τ,h

α
2 β 2
+ Eρβ sτ,h+1 − f ωτ,h (W ) − Eρβ sτ,h+1 − f ωτ,h (W ) , (56)
τ,h τ,h

where the inequality results from the triangle inequality. For the first term in the right-hand
side of inequality (56), we have that

α
2 α
2
Eρτ,h sτ,h+1 − f ωτ,h (W )
α − Eρβ sτ,h+1 − f ωτ,h (W )
τ,h
h
≤ (BS + rBK̄ )2 kµατ,h − µβτ,h k1 + Eµατ,h kπτ,h β
α
(· | s) − πτ,h (· | s)k1
i
+ LP zhα (µIτ,h , Wh∗ ) − zhβ (µIτ,h , Wh∗ ) 1
≤ C(BS + rBK̄ )2 · |α − β|,

where C > 0 is a constant, the first inequality results from the definition of ρIτ,h , and the
last inequality adopts Proposition 19 and Assumption 5 to bound these three terms. The
second term in the right-hand side of inequality (56) can be bounded as

α
2 β 2
Eρβ sτ,h+1 − f ωτ,h (W ) − Eρβ sτ,h+1 − f ωτ,h (W )
τ,h τ,h

≤ 2(BS + rBK̄ )rLK LW Bk |α − β|,

where the inequality results from Lemma 33 and Assumption 5. Thus, we conclude that

α
2 β 2
Eρατ,h sτ,h+1 − f ωτ,h (W ) − Eρβ sτ,h+1 − f ωτ,h (W )
τ,h

= O (BS + rBK̄ )(BS + rBK̄ + rLk Bk )|α − β| .

58
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

By decomposing the interval [0, 1] into the disjoint union of intervals ((i − 1)/N, i/N ]
for i ∈ [N ] and using this result, we can bound the first term of the right-hand side of
inequality (55) as

(BS + r̄BK )(BS + r̄BK + r̄LK Bk )
sup R(f, g, W ) − Rξ̄ (f, g, W ) = O .
f ∈B(r,H̄),g∈B(r̃,H̃),W ∈W̃ N
(57)

Eqn. (57) implies that we can transfer the results in Corollary 12 and Corollary 21 to
R(fˆh , ĝh , Ŵh ) with an additional term shown in Eqn. (57). Thus, for the case where the
MDP is induced by the distribution flow of the policy itself, we have that

R(fˆh , ĝh , Ŵh ) − R(fh∗ , gh∗ , Wh∗ )

(BS + r̄BK )(BS + r̄BK + r̄LK Bk ) (BS + r̄BK )4 NBr NB̃r̃ NW̃

=O + log
N NL δ
√
(BS + r̄BK )r̄LK Bk N LN∞ (1/ N L, W̃)
+ √ log . (58)
NL δ

For the case where the MDP is induced by a pre-specified distribution flow, we have that

R(fˆh0 , ĝh0 , Ŵh0 ) − R(fh∗ , gh∗ , Wh∗ )

(BS + r̄BK )(BS + r̄BK + r̄LK Bk ) (BS + r̄BK )4 NBr NB̃r̃ NW̃

=O + log . (59)
N NL δ

Step 3: Bound the estimation error of distribution flow and action-value

function estimate.
For the estimation error of the distribution flow µ̂It , we have the following proposition

Proposition 22 Given two GMFGs (P ∗ , r∗ , W ∗ ) and (P̂ , r̂, Ŵ ), for a policy π I ∈ Π̃, we
define the distribution flows induced by this policy as µI = Γ2 (π I , W ∗ ) and µ̂I = Γ̂2 (π I , Ŵ ).
Assume that the transition kernels P ∗ and P̂ are equivalently defined by f ∗ and fˆ ∈ B(r, H̄)
from Eqn. (2). Under Assumption 8, we have that
H Z
X 1 H
X
kµ̂αh − µαh k1 ≤ H(1 + rLK Lε Bk )H eπ,β
m dβ + eπ,α
m ,
m=1 0 m=1

where eπ,α
h is defined as
rh 2 i
eπ,α
h = Lε Eραh fˆh ωhα (Ŵh ) − fh∗ ωhα (Wh∗ ) ,
Z 1Z
ωhα (W ) = W (α, β)k ·, (siτ,h , aiτ,h , s) µβh (s)dsdβ,

0 S
ραh = µαh × πhα for α ∈ I.

Proof [Proof of Proposition 22] See Appendix Q.3.2.

59
Zhang, Tan, Wang, and Yang

From the definition of risk in Eqn. (8), we have that

R(fˆh , ĝh , Ŵh ) − R(fh∗ , gh∗ , Wh∗ )
L Z
1X 1

∗ ∗
2 ∗ α ∗
2
α ˆ α α

= Eρατ,h fh ωτ,h (Wh ) − fh ωτ,h (Ŵh ) + gh ωτ,h (Wh ) − ĝh ωτ,h (Ŵh ) dα.
L 0
τ =1

Since we implement the same policy πtI for L times in Step 1 of Algorithm 2, ρατ,h for τ ∈ [L]
are the same. Thus, we have
H Z 1
X H q
X
I I
d(µ̂t , µt ) = kµ̂αt,h − µαt,h k1 dα ≤ C R(fˆh , ĝh , Ŵh ) − R(fh∗ , gh∗ , Wh∗ ),
h=1 0 h=1

where C > 0 is a constant, and the inequality results from Proposition 22 and Hölder
inequality. The right-hande side of this inequality will play the role of εµ in the proof of
Theorem 4, which is bounded in Eqn. (58).
Next, we bound the estimation error of the action-value function.
Proposition 23 Assume that we have two GMFGs (P ∗ , r∗ , W ∗ ) and (P̂ , r̂, Ŵ ). For a
policy π I ∈ Π̃, a behavior policy π b,I ∈ Π̃, and a distribution flow µI ∈ ∆, ˜ we define the
∗ ∗ ∗
distribution flows induced by the behavior policy on the GMFG (P , r , W ) with underlying
distribution flow µI as µb,I = Γ3 (π b,I , µI , W ∗ ). Assume that the transition kernels P ∗
and P̂ are equivalently defined by f ∗ and fˆ ∈ B(r, H̄) from Eqn. (2), and reward functions
r∗ and r̂ are equivalently defined by g ∗ and ĝ ∈ B(r̃, H̃) from Eqn. (2). Assume that
sups∈S,a∈A,α∈I,h∈[H] πhα (a | s)/πhb,α (a | s) ≤ C. Under Assumption 8, we have that
h i
Eρb,α Q̂λ,α
h (s, a, π α , µI , Ŵ ) − Qλ,α
h (s, a, π α , µI , W ∗ )
h
H r
H
X h 2 i
Eρb,α ĝm ωhα (Ŵm ) − gm ∗ ω α (W ∗ )

≤C h m
m
m=h
H r
X h 2 i
H
fˆm ωhα (Ŵm ) − fm
∗ ω α (W ∗ )

+ Lε H(1 + λ log |A|)C Eρb,α h m ,
m
m=h

where ρb,α
h is defined as ρb,α b,α b,α λ,α α I λ,α α I
h = µh · πh , Q̂h (s, a, π , µ , Ŵ ) and Qh (s, a, π , µ , W )
∗

are the action-value functions of policy π I on two GMFGs, and

Z 1Z
α
W (α, β)k ·, (siτ,h , aiτ,h , s) µβh (s)dsdβ.

ωh (W ) =
0 S

Proof [Proof of Proposition 23] See Appendix Q.3.3.

Next, we will make use of Proposition 23 to bound the estimation error of action-value
function. Here, we adopt a different method to bound term (I) defined in Step 1 of the proof
of Theorem 4. From inequality (78), we have

(I) ≤ ηt+1 Qλ,α α I ∗ λ,α α ˆI α

h (sh , ·, πt , µ̄t , W ) − Q̂h (sh , ·, πt , µ̄t , Ŵ ), p − πt+1,h (· | sh )

+ ηt+1 Q̂λ,α α ˆI α α
h (sh , ·, πt , µ̄t , Ŵ ), π̂t+1,h (· | sh ) − πt+1,h (· | sh )

60
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

≤ 2ηt+1 Qλ,α α I ∗ λ,α α ˆI ∗

h (sh , ·, πt , µ̄t , W ) − Qh (sh , ·, πt , µ̄t , W ) ∞
+ 2ηt+1 H(1 + λ log |A|)βt+1
+ ηt+1 Qλ,α α ˆI ∗ λ,α α ˆI α
h (sh , ·, πt , µ̄t , W ) − Q̂h (sh , ·, πt , µ̄t , Ŵ ), p − πt+1,h (· | sh ) . (60)

∗,I
For the third term in the right-hand side of inequality (60), if p = π̄t,h (· | sh ), we have that

∗,I
Qλ,α α Î ∗ λ,α α Î α
h (sh , ·, πt , µ̄t , W ) − Q̂h (sh , ·, πt , µ̄t , Ŵ ), π̄t,h (· | sh ) − πt+1,h (· | sh )
X λ,α
Ît , W ∗ ) − Q̂λ,α
Qh (sh , ah , πtα , µ̄ α Î

= h (sh , ah , πt , µ̄t , Ŵ )
a∈A
∗,I α
b,α
π̄t,h (ah | sh ) − πt+1,h (ah | sh )
· πt,h (ah | sh ) · b,α
πt,h (ah | sh )
Qλ,α λ,α b,α
X
≤ (Cπ + Cπ0 ) α Î ∗ α Î
h (sh , ah , πt , µ̄t , W ) − Q̂h (sh , ah , πt , µ̄t , Ŵ ) · πt,h (ah | sh ),
a∈A
(61)
∗,I
where the inequality results from Assumption 9. We note that we can let p = π̄t,h (· | sh )
in our whole proof, because we will use such bound to upper bound the right-hand side of
∗,I
inequality (22), which we take p = π̄t,h (· | sh ) to prove. Now we can define a new Λαt+1,h
with the terms in inequalities (60) and (61) replacing the original upper bound of term (I).
In such case, the term εQ in inequality (27) can be replaced by the upper bound of the
expectation of the third term in right-hand side in inequality (60).
H
X
Qλ,α Q̂λ,α b,α
X
Eπ̄t∗,α ,µ̄It (Cπ + Cπ0 ) α Î ∗
h (sh , ah , πt , µ̄t , W ) − α Î
h (sh , ah , πt , µ̄t , Ŵ ) · πt,h (ah | sh )
h=1 a∈A
H q
X
Cπ0 )Cπ00 CπH H R(fˆh0 , ĝh0 , Ŵh0 ) − R(fh∗ , gh∗ , Wh∗ ),

≤ (Cπ + 1 + Lε H(1 + λ log |A|)
h=1

where the inequality results from Propositions 27 and 23 and Assumption 10. The right-hand
side of this inequality can be further bounded with Eqn. (59)
Step 4: Conclude the final result.
Replacing εµ and εQ with the derived new bounds and using the union bound, we have
that
X T X T
1 I ∗,I 1 I ∗,I
D πt , π +d ˆt , µ
µ̄
T T
t=1 t=1
√ √
(BS + r̄BK )1/4 (r̄LK Bk )1/4

log T 1/4 T N LN∞ (1/ N , W̃)
=O +O log
T 1/3 (N L)1/8 δ
T NBr NB̃r̃ NW̃ (BS + r̄BK ) (BS + r̄BK + r̄LK Bk )1/4
1/4

BS + r̄BK 1/4
+ log + .
(N L)1/4 δ N 1/4
Thus, we conclude the proof of Corollary 9.

61
Zhang, Tan, Wang, and Yang

Appendix L. Proof of Corollary 10

Proof [Proof of Corollary 10] The proof of Corollary 10 is same as the proof of Corollary 9,
except that we use the bound in Theorem 6 instead of Theorem 5.

Appendix M. Proof of Corollary 8

Proof [Proof of Corollary 8] We first define the inverse function of ψ ∗ as φ∗ , i.e., φ∗ (ψ ∗ (α)) =
α for all α ∈ I. Similar to the proof of Theorem 7, we can decompose the risk difference as
∗
Rξ̄ (fˆh , ĝh , Ŵhφ̂h ◦ψ ) − Rξ̄ (fh∗ , gh∗ , Wh∗ )
= Generalization Error of Risk + Estimation Error of Mean-embedding
+ Empirical Risk Difference.

For ease of notation, we only write the definition of each term for the transition kernel. The
term for the reward functions can be easily derived.

Generalization Error of Risk

L N
1 XX i ˆ i φ̂h ◦ψ ∗
2
i ∗ i ∗
2
= Eρi sτ,h+1 − fh ωτ,h (Ŵh ) − sτ,h+1 − fh ωτ,h (Wh )
NL τ,h
τ =1 i=1
L X
N
1 X
ˆ φ̂h ◦ψ ∗
2
∗ ∗
2
−2 siτ,h+1 i
− fh ωτ,h (Ŵh ) i i
− sτ,h+1 − fh ωτ,h (Wh )
NL
τ =1 i=1
Estimation Error of Mean-embedding
L N
1 XX i ∗ 2
2
=2 sτ,h+1 − fˆh ωτ,h
i
(Ŵhφ̂h ◦ψ ) − siτ,h+1 − fˆh ω̄ i
ˆ τ,h (Ŵhφ̂h )
NL
τ =1 i=1
L X
N
1 ∗ 2
2
(Wh∗,φ )
X
+2 siτ,h+1 − fh∗ ω̄ i
ˆ τ,h − siτ,h+1 − fh∗ ωτ,h
i
(Wh∗ )
NL
τ =1 i=1
Empirical Risk Difference
L N
1 XX i 2 i ∗ 2

=2 sτ,h+1 − fˆh ω̄ i
ˆ τ,h (Ŵhφ̂h ) − sτ,h+1 − fh∗ ω̄ i
ˆ τ,h (Wh∗,φ ) .
NL
τ =1 i=1

From the estimation procedure of Algorithm 10, we have that

Empirical Risk Difference ≤ 0.

For Estimation Error of Mean-embedding, we can use the bound in inequality (50) in the
proof of Theorem 7 to bound it. In fact, since ψ ∗ is the inverse function of φ∗ , the expression
of the Estimation Error of Mean-embedding here is same as the term in inequality (50). For
generalization error of risk, we can use inequality (54) in the proof of Theorem 7 to bound
it. Thus, we conclude the proof of Corollary 8.

62
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

Appendix N. Proof of Corollary 11

Proof [Proof of Corollary 11] We note that Line 6 of Algorithm 2 involves the estimation of
MDP when the underlying distribution flow is given. However, different from the setting
in Section 6.1, here we can only specify the distribution flow through {µ(ξi −1/N,ξi ] }N
i=1 .
We concatenate these distribution flows to form µ̃I that is defined as µ̃α = µξi +α−i/N if
α ∈ ((i−1)/N, i/N ]. That is, we assume that ξi = i/N . Then we define the mean-embedding
induced by µ̃I as
Z 1Z
i
W (i/N, β)k ·, (siτ,h , aiτ,h , s) µ̃βτ,h (s)dsdβ.

ω̃τ,h (W ) =
0 S
Then we estimate the transition kernels, reward functions, and graphons as
L N
1 XX i 2 i 2
(fˆh , ĝh , Ŵh , φ̂h ) = argmin i
sτ,h+1 − f ω̃τ,h (W φ ) i
+ rτ,h − g ω̃τ,h (W φ ) .
f ∈B(r,H̄), N Lτ =1 i=1
g∈B(r̃,H̃),
W ∈W̃,
N
φ∈C[0,1]

(62)
Here we adopt the similar steps as the proof of Corollary 9. We note that the only
different procedure is the first step. Next, we will derive the performance guarantee of
Algorithm (62).
In such setting, we implement {π (ξi −1/N,ξi ] }N i=1 for L times on the MDP induced by
(ξ −1/N,ξ ] [N ] [N ] [N ] [N ]
{µ i i }i=1 to collect the dataset Dτ = {(sτ,h , aτ,h , rτ,h , sτ,h+1 )}H
N
h=1 for τ ∈ [L]. We
define µ +,I = Γ3 (π , µ , W ) as the distribution flow of implementing π I on the MDP
I I ∗

induced by µI . We highlight that we will not use quantity in the estimation procedure, QN but
use it only in the analysis. The joint distribution of (sτ,h , aτ,h , rτ,h , sτ,h+1 )i=1 is i=1 ρ+,i
i i i i N
τ,h ,
where ρ+,i +,i i ∗
τ,h = µτ,h × πτ,h × δrh × Ph . Same as the proof of Corollary 8, we define two bijections
∗

ψ ∗ , φ∗ ∈ C[0,1]
N as ψ ∗ (ξi ) = i/N for all i ∈ [N ], and φ∗ ◦ ψ ∗ (α) = φ∗ (ψ ∗ (α)) = α for all α ∈ I.
With a little abuse of notation, we define the risk of (f, g, W ) given ξ¯ as
N
1 X i i
2 i i
2
Rξ̄ (f, g, W ) = Eρ+,i sh+1 − f ωh (W ) + rh+1 − g ωh (W ) .
N h
i=1

Corollary 24 Under Assumptions 5, 6, 7, and 1, if {ξi }N N

i=1 = {i/N }i=1 , then the risk of
estimate derived in Algorithm (62) can be bounded as
(BS + rBK̄ )4

∗ N ÑBr Ñ∞
Rξ̄ (fˆh , ĝh , Ŵhφ̂h ◦ψ ) − Rξ̄ (fh∗ , gh∗ , Wh∗ ) ≤ O log .
L δ
with probability at least 1 − δ.
Proof [Proof of Corollary 24] See Appendix Q.3.4.

Then we only need to exactly follow the steps 2, 3, and 4 in the proof of Corollary 9 to
prove the desired results. Thus, we conclude the proof of Corollary 11.

63
Zhang, Tan, Wang, and Yang

Appendix O. Proof of Proposition 3

The existence follows from the Banach fixed point theorem. To prove the uniqueness, we
adopt proof by contradiction. In this case, we admit the existence of multiple different NEs.
Then the expectations in distances D(·, ·) and d(·, ·) defined in Section 4.2 can be taken with
respect to any NE.
Proof [Proof of Proposition 3] We first prove the existence of NE under Assumption 2.
From the mean-field side, for two mean fields µI and µ̃I , we have that

d Γ2 Γλ1 (µI , W ∗ ), W ∗ , Γ2 Γλ1 (µ̃I , W ∗ ), W ∗

≤ d2 · D Γλ1 (µI , W ∗ ), Γλ1 (µ̃I , W ∗ )

≤ d1 d2 d(µI , µ̃I ).

Banach fixed point theorem shows that there exists a fixed point for Γ2 ◦ Γλ1 , which we denote
as µ∗,I . We then define π ∗,I = Γλ1 (µ∗,I , W ∗ ). Definition 1 shows that (π ∗,I , µ∗,I ) is a NE.
Then we prove the uniqueness of NE. Assume that there are two NEs (π ∗,I , µ∗,I ) and
(π̃ , µ̃∗,I ). From the Definition 1 of the NE, we have that
∗,I

π ∗,I = Γλ1 (µ∗,I , W ∗ ), µ∗,I = Γ2 (π ∗,I , W ∗ ), π̃ ∗,I = Γλ1 (µ̃∗,I , W ∗ ), µ̃∗,I = Γ2 (π̃ ∗,I , W ∗ ).

Then Assumption 2 implies that

d(µ∗,I , µ̃∗,I ) ≤ d1 d2 d(µ∗,I , µ̃∗,I ).

Thus, we have d(µ∗,I , µ̃∗,I ) = 0, which implies that they are different only on a set of
zero-measure agents with respect to the Lebesgue measure on [0, 1]. Thus, we conclude the
proof of Proposition 3.

Appendix P. Lipschitzness of NE
Proposition 25 Under Assumptions 1 and 5, for any NE of the λ-regularized GMFG
(π λ,I , µλ,I ) with λ > 0, we have that
h i
2HLW Lr + H 1 + λ log |A| LP
πhλ,α (· | s) − πhλ,β (· | s) 1 ≤ |α − β| for all h ∈ [H], s ∈ S.
λ
Proof [Proof of Proposition 25] For any distribution flow µI , we denote the optimal value
function in the λ-regularized MDP induced by µI as V ∗,I = (Vh∗,I )H
h=1 . Then we prove the
proposition in two steps:
• Given any distribution flow µI , the optimal value function V ∗,I is Lipschitz in the
positions of agents, i.e., |Vh∗,α (s) − Vh∗,β (s)| ≤ H[Lr + H(1 + λ log |A|)LP ]LW |α − β|
for all s ∈ S and h ∈ [H].

• Any policy π I that achieves the optimal value function V ∗,I is Lipschitz in the positions
of agents.

64
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

These two steps concludes the proof of Proposition 25 by noting that for any λ-NE (π λ,I , µλ,I ),
the policy π λ,I achieves the maximal accumulative rewards in the MDP induced by µλ,I
according to Definition 1.
Step 1: Show the optimal value function V ∗,I is Lipschitz in the positions of
agents.
For any distribution flow µI ∈ ∆(S)I×H , we define an operator acting on S → R as

µI ,α
X XZ
Th u(s) = sup α
p(a)rh (s, a, zh )−λR(p)+ p(a)Ph (s0 | s, a, zhα )u(s0 )ds0 for h ∈ [H − 1],
p∈∆(A)a∈A a∈A S
µI ,α
X
α
TH u(s) = sup p(a)rH (s, a, zH )−λR(p),
p∈∆(A)a∈A

where R(·) is the negative entropy function. Since V ∗,I is the optimal value function of the
MDP induced by µI , we have that
I ,α ∗,α
Thµ Vh+1 (s) = Vh∗,α (s) and VH+1
∗,α
(s) = 0 for all s ∈ S, h ∈ [H], α ∈ I.

For any h ∈ [H − 1], we have that

Vh∗,α (s) − Vh∗,β (s)

p(a) rh (s, a, zhα ) − rh (s, a, zhβ )

X
≤ sup
p∈∆(A) a∈A
XZ ∗,α 0 ∗,β 0 0
+ p(a) Ph (s0 | s, a, zhα )Vh+1 (s ) − Ph (s0 | s, a, zhβ )Vh+1 (s ) ds
a∈A S
∗,α ∗,β
≤ Lr kzhα − zhβ k1 + H(1 + λ log |A|)LP kzhα − zhβ k1 + sup Vh+1 (s) − Vh+1 (s) ,
s∈S

R1
where thefirst inequality results from Assumption 1. Note that kzhα −zhβ k1 ≤ k 0 Wh (α, γ)−
Wh (β, γ) µγh dγk1 ≤ LW |α − β|, we have
h i
sup Vh∗,α (s) − Vh∗,β (s) ≤ Lr + H 1 + λ log |A| LP LW |α − β| + sup Vh+1
∗,α ∗,β
(s) − Vh+1 (s) .
s∈S s∈S

∗,α
Summing this inequality for t = h, · · · , H and noting that VH+1 (s) = 0, we have
h i
sup Vh∗,α (s) − Vh∗,β (s) ≤ H Lr + H 1 + λ log |A| LP LW |α − β|.
s∈S

Step 2: Any policy that achieves the optimal value function V ∗,I is Lipschitz
in the positions of agents
Assume that policy π I achieves the optimal value function V ∗,I . For any α, β ∈ I, s ∈ S,
and h ∈ [H], we have that
X XZ ∗,α 0
α
πh (· | s) = argmax α
p(a)rh (s, a, zh ) − λR(p) + p(a)Ph (s0 | s, a, zhα )Vh+1 (s )ds0
p∈∆(A) a∈A a∈A S

65
Zhang, Tan, Wang, and Yang

where the first inequality results from the triangle inequality, and the second inequality
results from Step 1. Thus, we conclude that
h i
2H Lr + H 1 + λ log |A| LP LW
πhα (· | s) − πhβ (· | s) 1 ≤ |α − β|,
λ
which proves the claim of Proposition 25.

Appendix Q. Supporting Propositions and Lemmas

Q.1 Propositions and Lemmas for Estimation
Q.1.1 Proof of Proposition 19
Proof [Proof of Proposition 19] For any h ∈ [H − 1], we have that

kµαh+1 − µβh+1 k1
Z XZ XZ
= 0 α α α
Ph (s | s, a, zh )µh (s)πh (a | s)ds − Ph (s0 | s, a, zhβ )µβh (s)πhβ (a | s)ds ds0
S a∈A S a∈A S

≤ LP kzhα − zhβ k1 + kµαh − µβh k1 + sup πhα (· | s) − πhβ (· | s) 1 , (63)

s∈S

where the first inequality results from the triangle inequality, and the second inequality
results from Assumptions 5 and 1. We further bound the first term in the right-hand side of
inequality (63) as
Z 1 Z 1
β γ
α
kzh − zh k1 = Wh (α, γ)µh dγ − Wh (β, γ)µγh dγ ≤ LW |α − β|,
0 0 1

where the inequality results from Assumption 5. Substituting this inequality to the right-hand
side of inequality (63), we derive that

kµαh+1 − µβh+1 k1 ≤ kµαh − µβh k1 + LP LW |α − β| + sup πhα (· | s) − πhβ (· | s) 1 .

s∈S

66
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

Summing these inequalities for h = 1, · · · , t, we have that

t−1
kµαt − µβt k1 ≤ (t − 1)LP LW |α − β| + sup πhα (· | s) − πhβ (· | s) 1 ,
X

h=1 s∈S

which results from that µα1 = µβ1 . Thus, we concludes the proof of Proposition 19.

Q.1.2 Proof of Proposition 20

Proof [Proof of Proposition 20] Our proof of Proposition 20 follows the pipeline of the proof
of Györfi et al. (2002, Theorem 11.4). However, the random variables in our problem are not
identically distributed, which requires additional techniques to control the tail probabilities.
Our proof involves three steps:
• Symmetrization by a ghost sample.

• Additional randomization by random signs.

• Bounding the covering number

Step 1: Symmetrization by a ghost sample.
We construct the random variables D̃h = {ẽiτ,h }L,N
τ,i=1 that are independent of and
D
identically distributed as Dh = {eiτ,h }L,N i i
τ,i=1 . It means that ẽτ,h = eτ,h for all τ ∈ [L], i ∈ [N ],
PL PN P
and they are independent. For ease of notation, we write τ =1 i=1 as τ,i Choose a
function fW that depends on Dh such that

1 X i
1 X i 1 X i

Eρi fW (eτ,h ) − fW (eτ,h ) ≥ ε α + β + Eρi fW (eτ,h )
NL τ,h NL NL τ,h
τ,i τ,i τ,i

holds. If such function does not exist, then fW is an arbitrary function in FW̃ . Then we
have that

i
i
2
Eρi fW (ẽτ,h ) − Eρi fW (ẽτ,h ) Dh Dh
τ,h τ,h
h 2 i
≤ Eρi fW (ẽiτ,h ) Dh
τ,h

2 i
∗ i ∗
2
≤ 4(BS + rBK̄ ) Eρi f ω̃τ,h (W ) − fh ω̃τ,h (Wh ) Dh
τ,h

= 4(BS + rBK̄ )2 Eρi fW (ẽiτ,h ) Dh ,

(64)
τ,h

where the second inequality results from Lemma 33, and the last equality results from that
i (W ∗ )] = f ∗ (ω̃ i (W ∗ )). Then the tail probability for the ghost sample
Eρi [s̃iτ,h+1 | Dh , ω̃τ,h
τ,h h h τ,h h
D̃h is bouneded as

1 X i
1 X i ε 1 X i

P Eρi fW (ẽτ,h ) Dh − fW (ẽτ,h ) ≥ α+β+ Eρi fW (ẽτ,h ) Dh
NL τ,h NL 2 NL τ,h
τ,i τ,i τ,i

67
Zhang, Tan, Wang, and Yang

2
1 P i 1 P i

E N L τ,i Eρi fW (ẽτ,h ) Dh − N L τ,i fW (ẽτ,h )
τ,h
≤
2

ε(α+β) ε P
i
2 + 2N L E
τ,i ρi f (ẽ
W τ,h ) D h
τ,h

4(BS +rBK̄ )2 1
fW (ẽiτ,h ) Dh
P
NL NL τ,i Eρiτ,h
≤
2

ε(α+β) ε P i
2 + 2N L τ,i Eρiτ,h fW (ẽτ,h ) Dh

4(BS + rBK̄ )2
≤ ,
(α + β)N Lε2
where the first inequality results from Chebyshev inequality, the second inequality results
from inequality (64), and the last inequality results from x/(a + x)2 ≤ 1/(4a) for any x, a > 0.
When N L ≥ 32(BS + rBK̄ )2 /((α + β)ε2 ), we have that

1 X 1 X
Eρi fW (ẽiτ,h ) Dh − fW (ẽiτ,h )

P
NL τ,h NL
τ,i τ,i

ε 1 X i
1
≥ α+β+ Eρi fW (ẽτ,h ) Dh ≤ . (65)
2 NL τ,h 8
τ,i

Thus, we have that

1 X i
1 X i 1 X i

P ∃fW ∈ FW̃ , Eρi fW (eτ,h ) − fW (eτ,h ) ≥ ε α + β + Eρi fW (eτ,h )
NL τ,h NL NL τ,h
τ,i τ,i τ,i

8 1 X ε 1 X
Eρi fW (ẽiτ,h ) − fW (ẽiτ,h ) < Eρi fW (ẽiτ,h ) ,

≤ P ∃fW ∈ FW̃ , α+β+
7 NL τ,h 2 NL τ,h
τ,i τ,i

1 X 1 X
Eρi fW (eiτ,h ) − fW (eiτ,h ) ≥ ε α + β + Eρi fW (eiτ,h )

NL τ,h NL τ,h
τ,i τ,i

8 1 X i i ε 1 X i

≤ P ∃fW ∈ FW̃ , fW (ẽτ,h ) − fW (eτ,h ) ≥ α+β+ Eρi fW (eτ,h ) ,
7 NL 2 NL τ,h
τ,i τ,i
(66)
where the first inequality results from inequality (65), the detailed proof of this step is in
Györfi et al. (2002, Theorem 11.4). To derive the fast rate result, we want to replace the
expectation of the function in the right-hand side of inequality (66) by the expectation of
the square of the function. Thus, We handle the right-hand side of inequality (66) as

1 X i 1 X i ε 1 X i

P ∃fW ∈ FW̃ , fW (ẽτ,h ) − fW (eτ,h ) ≥ α+β+ Eρi fW (eτ,h )
NL NL 2 NL τ,h
τ,i τ,i τ,i

1 X i 1 X i ε 1 X i

≤ P ∃fW ∈ FW̃ , fW (ẽτ,h ) − fW (eτ,h ) ≥ α+β+ Eρi fW (eτ,h ) ,
NL NL 2 NL τ,h
τ,i τ,i τ,i

1 X 2 i 1 X 2 i 1 X 2 i
fW (eτ,h ) − Eρi fW (eτ,h ) ≤ ε α + β + Eρi fW (eτ,h ) ,
NL NL τ,h NL τ,h
τ,i τ,i τ,i

68
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

1 X 2 i 1 X 2 i 1 X 2 i
fW (ẽτ,h ) − Eρi fW (ẽτ,h ) ≤ ε α + β + Eρi fW (ẽτ,h )
NL NL τ,h NL τ,h
τ,i τ,i τ,i

1 X 2 i 2 i 1 X 2 i
+ 2P ∃fW ∈ FW̃ , fW (ẽτ,h )−Eρi fW (ẽτ,h ) ≤ ε α+β + Eρi fW (ẽτ,h )
NL τ,h NL τ,h
τ,i τ,i

= (VII) + (VIII), (67)

where the inequality follows from the union bound. For the term (VIII), Proposition 26
shows that

1 X 2 i 1 X 2 i
P ∃fW ∈ FW̃ , fW (ẽτ,h ) − Eρi fW (ẽτ,h ) ≤ ε α + β
NL NL τ,h
τ,i τ,i

1 X 2 i
+ Eρi fW (ẽτ,h )
NL τ,h
τ,i
3ε2 (α + β)N L

ε(α + β) 2 i L,N
≤ 2E N1 , {fW | fW ∈ FW̃ }, {ẽτ,h }τ,i=1 exp −
5 40(BS + rBK̄ )4
3ε2 (α + β)N L

ε(α + β) i L,N
≤ 2E N1 , F , {ẽ }
τ,h τ,i=1 exp − . (68)
10(BS + rBK̄ )2 W̃ 40(BS + rBK̄ )4
For term (VII), the last two events in (VII) are equivalent to that
1 X 2 i 1 X 2 i
(1 + ε) Eρi fW (eτ,h ) ≥ (1 − ε) fW (eτ,h ) − ε(α + β)
NL τ,h NL
τ,i τ,i
1 X 2 i 1 X 2 i
(1 + ε) Eρi fW (ẽτ,h ) ≥ (1 − ε) fW (ẽτ,h ) − ε(α + β). (69)
NL τ,h NL
τ,i τ,i

Then term (VII) can be bounded as

(VII)

1 X 1 X ε(α + β)
≤ P ∃fW ∈ FW̃ , fW (ẽiτ,h ) − fW (eiτ,h ) ≥
NL NL 2
τ,i τ,i
ε2 (α

+ β) ε(1 − ε) 1 X 2 i 2 i
− + fW (eτ,h ) + fW (ẽτ,h ) ,
4(BS + rBK̄ )2 (1 + ε) 8(BS + rBK̄ )2 (1 + ε) N L
τ,i
(70)
2 (ei )] ≤ 4(B +
where the inequality results from inequality (69) and the fact that Eρi [fW τ,h S
τ,h
rBK̄ )2 Eρi [fW (eiτ,h )].
τ,h

Proposition 26 Let B ≥ 1, G be a set of functions g : X → [0, B]. Let Z1 , · · · , Zn be

independent X −valued random variables that are distributed as ρ1 , · · · , ρn , respectively.
Assume α > 0, 0 < ε ≤ 1, n ≥ 1. Then we have that
1 Pn 1 Pn

i=1 g(Zi ) − n 3αnε2

n i=1 Eρi g(Z) αε n
P sup 1 Pn 1 Pn
> ε ≤ 2E N1 , G, Z1 exp −
g∈G α + n i=1 g(Zi ) + n i=1 Eρi g(Z) 5 40B
for n ≥ 16B/(ε2 α).

69
Zhang, Tan, Wang, and Yang

Proof [Proof of Proposition 26] See Appendix Q.1.3.

Step 2: Additional randomization by random signs.

i }L,N be independent and uniformly distributed over {+1, −1} that are also
Let {Uτ,h τ,i=1
independent of Dh and D̃h . Then we have that

1 X 1 X ε(α + β)
P ∃fW ∈ FW̃ , fW (ẽiτ,h ) − fW (eiτ,h ) ≥
NL NL 2
τ,i τ,i
ε2 (α + β)

ε(1 − ε) 1 X 2 i 2 i
− + f (e
W τ,h ) + f (ẽ
W τ,h )
4(BS + rBK̄ )2 (1 + ε) 8(BS + rBK̄ )2 (1 + ε) N L
τ,i

1 X i ε(α + β)
≤ 2E P ∃fW ∈ FW̃ , Uτ,h fW (eiτ,h ) ≥
NL 4
τ,i
ε2 (α + β)

ε(1 − ε) 1 X 2 i i L,N
− + f (e
W τ,h ) {e }
τ,h τ,i=1
8(BS + rBK̄ )2 (1 + ε) 8(BS + rBK̄ )2 (1 + ε) N L
τ,i
(71)

where the inequality results from the union bound. Let δ > 0, Fδ be a L1 δ−cover of FW̃
on {eiτ,h }L,N ¯
τ,i=1 . Then for any fW ∈ FW̃ , there exists fW ∈ Fδ such that

1 X
fW (eiτ,h ) − f¯W (eiτ,h ) ≤ δ.
NL
τ,i

This inequality implies that

1 X i 1 X i ¯
Uτ,h fW (eiτ,h ) − Uτ,h fW (eiτ,h ) ≤ δ
NL NL
τ,i τ,i
1 X 2 i 1 X ¯2 i
fW (eτ,h ) − fW (eτ,h ) ≥ −2(BS + rBK̄ )2 δ,
NL NL
τ,i τ,i

where these inequalities results from the triangle inequality. In the following, we take
δ = εβ/5. Thus, we can bound the right-hand side of inequality (71) as

1 X i ε(α + β)
P ∃fW ∈ FW̃ , Uτ,h fW (eiτ,h ) ≥
NL 4
τ,i
ε2 (α + β)

ε(1 − ε) 1 X 2 i i L,N
− + fW (eτ,h ) {eτ,h }τ,i=1
8(BS + rBK̄ )2 (1 + ε) 8(BS + rBK̄ )2 (1 + ε) N L
τ,i

εβ i L,N 1 X i εα
≤ N1 , FW̃ , {eτ,h }τ,i=1 max P Uτ,h fW (eiτ,h ) ≥
5 fW ∈F εβ NL 4
5 τ,i
ε2 α

ε(1 − ε) 1 X 2 i i L,N
− + fW (eτ,h ) {eτ,h }τ,i=1
8(BS + rBK̄ )2 (1 + ε) 8(BS + rBK̄ )2 (1 + ε) N L
τ,i

70
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

ε2 (1 − ε)αN L

εβ i L,N
≤ 2N1 , FW̃ , {eτ,h }τ,i=1 exp − (72)
5 20(BS + rBK̄ )2 (1 + ε)

where the first inequality results from the union bound.

Step 3: Bounding the covering number.
In this step, we upper bound the covering number of FW̃ by the covering numbers of
B(r, H̄) and W̃ and conclude the tail probability. We note that

1 X
fW (eiτ,h ) − f¯W (eiτ,h )
NL
τ,i

≤ 2(BS + rBK̄ ) BK̄ kf − f¯kH̄ + rLK Bk kW − W̄ k∞ ,

where the inequality results from Lemma 33 and the triangle inequality. Thus, we have that

i L,N δ δ
N1 δ, FW̃ , {eτ,h }τ,i=1 ≤ NH̄ , B(r, H̄) · N∞ , W̃
4(BS + rBK̄ )BK̄ 4(BS + rBK̄ )rLK Bk
(73)

for any {eiτ,h }L,N

τ,i=1 . Combining the inequalities (66), (67), (68), (70), (71), and (72), we have
that for N L ≥ 32(BS + rBK̄ )2 /((α + β)ε2 )

1 X 1 X
Eρi fW (eiτ,h ) − fW (eiτ,h )

P ∃fW ∈ FW̃ ,
NL τ,h NL
τ,i τ,i

1 X
Eρi fW (eiτ,h )

≥ε α+β+
NL τ,h
τ,i
3ε2 (α + β)N L

64 ε(α + β) i L,N
≤ E N1 , F , {eτ,h }τ,i=1 exp −
7 10(BS + rBK̄ )2 W̃ 40(BS + rBK̄ )4
ε2 (1 − ε)αN L

32 εβ i L,N
+ E N1 , FW̃ , {eτ,h }τ,i=1 exp −
7 5 20(BS + rBK̄ )2 (1 + ε)
ε2 (1 − ε)αN L

εβ i L,N
≤ 14E N1 , F , {eτ,h }τ,i=1 exp −
10(BS + rBK̄ )2 W̃ 20(BS + rBK̄ )4 (1 + ε)

εβ εβ
≤ 14NH̄ , B(r, H̄) · N∞ , W̃
40(BS + rBK̄ )3 BK̄ 40(BS + rBK̄ )3 rLK Bk
ε2 (1 − ε)αN L

· exp − ,
20(BS + rBK̄ )4 (1 + ε)

where the last inequality results from inequality (73). For N L ≤ 32(BS + rBK̄ )2 /((α + β)ε2 ),
we have that
ε2 (1 − ε)αN L

32(1 − ε)α 32 1
exp − 4
≥ exp − 2
≥ exp − ≥ .
20(BS + rBK̄ ) (1 + ε) 20(BS + rBK̄ ) (1 + ε)(α + β) 80 14

Thus, we conclude the proof of Proposition 20.

71
Zhang, Tan, Wang, and Yang

Q.1.3 Proof of Proposition 26

Proof [Proof of Proposition 26] The proof of Proposition 26 mainly follows the pipeline
of the proof of Györfi et al. (2002, Theorem 11.6). However, the random variables in our
problem are not identically distributed, which requires additional techniques to control the
tail probabilities. Our proof involves two steps:
• Symmetrization by a ghost sample.

• Additional randomization by random signs

Step 1: Symmetrization by a ghost sample.
We draw ghost samples Z̃1n = (Z̃1 , · · · , Z̃n ) that are independent of and identically
distributed as Z1n = (Z1 , · · · , Zn ). Then we have that
1 Pn 1 Pn

i=1 g(Z̃i ) − n

n i=1 Eρi g(Z)
P sup 1 Pn 1 Pn
>β
g∈G α + n i=1 g(Z̃i ) + n i=1 Eρi g(Z)
h P
n 2 i
E i=1 g(Z̃ i ) − Eρi g(Z)
≤ 2
n2 β 2 α + n1 ni=1 Eρi g(Z)
P

Pn
i=1 B − Eρ i g(Z) Eρi g(Z)
≤ 2 , (74)
n2 β 2 α + n1 ni=1 Eρi g(Z)
P

where the first inequality results from Chebyshev inequality, and the last inequality results
from that g : X → [0, B]. For two constants a, b > 0 and variables 0 ≤ xi ≤ b for i ∈ [n],
some basic calculus calculations show that
Pn
i=1 (b − xi )xi nb
f (x1 , · · · , xn ) = 1 Pn ≤ .
a + n i=1 xi 2a

Thus, inequality (74) shows that

1 Pn 1 Pn

i=1 g(Z̃i ) − n

n i=1 Eρi g(Z) B
P sup 1 Pn 1 Pn
>β ≤ .
2β 2 αn

g∈G α + n i=1 g(Z̃i ) + n i=1 Eρi g(Z)

We take β = ε/4. If n ≥ 16B/(ε2 α), such probability is upper bounded by 1/2. Then we
have that
1 Pn 1 Pn

−

n i=1 g(Z i ) n i=1 Eρi g(Z)
P sup 1 Pn 1 Pn
>ε
g∈G α + n i=1 g(Zi ) + n i=1 Eρi g(Z)
n n
1X 3ε 1X
≤ 2P ∃g ∈ G, g(Zi ) − g(Z̃i ) ≥ 2α + g(Zi ) + g(Z̃i ) , (75)
n 8 n
i=1 i=1

where the inequality results from the conditional probability trick. The detailed procedure
can be found in Györfi et al. (2002, Theorem 11.6).
Step 2: Additional randomization by random signs.

72
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

Let {Ui }ni=1 be independent and uniformly distributed random variables on {+1, 1} that
are independent of Z1n and Z̃1n . Then we have that
n n
1X 3ε 1X
P ∃g ∈ G, g(Zi ) − g(Z̃i ) ≥ 2α + g(Zi ) + g(Z̃i )
n 8 n
i=1 i=1
n n
1 X 3ε 1X
≤ 2E P ∃g ∈ G, Ui g(Zi ) ≥ α+ g(Zi ) Z1n = z1n , (76)
n 8 n
i=1 i=1
where the inequality results from the union bound. Let
n
Pδn > 0, Gδ be a L1 δ−cover of G on
z1 . Then for any g ∈ G, there exists ḡ ∈ Gδ such that i=1 |g(zi ) − ḡ(zi )|/n ≤ δ. Thus, we
have that
n n
1X 3ε 1X
P ∃g ∈ G, Ui g(Zi ) ≥ α+ g(Zi ) Z1n = z1n
n 8 n
i=1 i=1
n n
1 X 3ε 1X n n
≤ P ∃g ∈ Gδ , δ + Ui g(Zi ) ≥ α−δ+ g(Zi ) Z1 = z1
n 8 n
i=1 i=1
X n n
1 3εα 3εδ 3ε 1 X n n
≤ |Gδ | max P Ui g(Zi ) ≥ − −δ+ g(Zi ) Z1 = z1 ,
g∈Gδ n 8 8 8 n
i=1 i=1

where the last inequality follows from the union bound. Take δ = εα/5, then we have
3εα 3εδ εα
− −δ ≥ .
8 8 10
Thus, we can control the tail probability as
n n
1X 3ε 1X n n
P ∃g ∈ G, Ui g(Zi ) ≥ α+ g(Zi ) Z1 = z1
n 8 n
i=1 i=1
X n n
εα n 1 εα 3ε 1 X n n
≤ N1 , G, z1 max P Ui g(Zi ) ≥ + g(Zi ) Z1 = z1
5 g∈G εα n 10 8 n
5 i=1 i=1
4 Pn 2
2

εα 9ε 15 nαP + i=1 g(zi )
≤ N1 , G, z1n exp − n
5 128B i=1 g(zi )
2

εα 3αε n
≤ N1 , G, z1n exp − , (77)
5 40B
where the second inequality results from the Hoeffding’s inequality, and the last inequality
results from that (a + y)2 /y ≥ 4a for any a, y > 0. Combining the inequalities (75), (76)
and (77), we conclude the proof of Proposition 26.

Q.2 Propositions and Lemmas for Optimization

Q.2.1 Proof of Proposition 13
Proof [Proof of Proposition 13] From the definition of R(·) and KL(·k·), we have that
p
∇p R(p) = 1 + log p ∇p KL(pkq) = 1 + log .
q

73
Zhang, Tan, Wang, and Yang

Then the first-order optimal condition of Eqn. (16) is that for any p ∈ ∆(A)
α
π̂t+1,h (· | s)

λ,α α ˆI α α
ηt+1 Q̂h (s, ·, πt , µ̄t , Ŵ ) − ληt+1 log π̂t+1,h (· | s) − log α , p − π̂t+1,h (· | s) ≤ 0.
πt,h (· | s)

Note that

KL(p1 kp2 ) = KL(p3 kp2 ) + ∇p3 KL(p3 kp2 ), p1 − p3 + KL(p1 kp3 )

KL(p1 kp2 ) = R(p1 ) − R(p2 ) + ∇R(p2 ), p2 − p1 .

Thus, we conclude the proof of Proposition 13.

Q.2.2 Proof of Proposition 14

Proof [Proof of Proposition 14] In the following, we upper bound these four terms separately.
For term (I), we have that

(I) ≤ ηt+1 Qλ,α α I ∗ α λ,α α ˆI α

h (sh , ·, πt , µ̄t , W ), p − πt+1,h (· | sh ) − Q̂h (sh , ·, πt , µ̄t , Ŵ ), p − π̂t+1,h (· | sh )

≤ ηt+1 Qλ,α α I ∗ λ,α α ˆI α

h (sh , ·, πt , µ̄t , W ) − Q̂h (sh , ·, πt , µ̄t , Ŵ ), p − πt+1,h (· | sh )

+ ηt+1 Q̂λ,α α ˆI α α
h (sh , ·, πt , µ̄t , Ŵ ), π̂t+1,h (· | sh ) − πt+1,h (· | sh )

≤ 2ηt+1 Qλ,α α I ∗ λ,α α ˆI ∗

h (sh , ·, πt , µ̄t , W ) − Qh (sh , ·, πt , µ̄t , W ) ∞
+ 2ηt+1 Qλ,α α ˆI ∗ λ,α α ˆI
h (sh , ·, πt , µ̄t , W ) − Q̂h (sh , ·, πt , µ̄t , Ŵ ) ∞
+ 2ηt+1 H(1 + λ log |A|)βt+1 , (78)

where the second inequality results from the triangle inequality, and the last inequality
results from the Hölder inequality and the triangle inequality. To bound the second term in
the right-hand side of inequality (78), we state the proposition

Proposition 27 Under Assumption 1, for any policy π I and two distribution flows µI and
µ̃I , we have that
H Z 1
Qλ,α λ,α
α I ∗ α I ∗
X
kµβm − µ̃βm k1 dβ,

h (s, a, π , µ , W )−Qh (s, a, π , µ̃ , W ) ≤ Lr + H(1 + λ log |A|)LP
m=h 0
H
X 1 β
Z
Vhλ,α (s, π α, µI, W ∗ )−Vhλ,α (s, π α, µ̃I, W ∗ ) ≤ Lr + H(1 + − µ̃βm k1 dβ

λ log |A|)LP kµm
m=h 0

for all α ∈ I, s ∈ S, a ∈ A and h ∈ [H].

74
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

Proof [Proof of Proposition 27] See Appendix Q.2.8.

Thus, we have that
H Z 1
kµ̄βt,m − µ̄
ˆβt,m k1 dβ
X
(I) ≤ 2ηt+1 Lr + H(1 + λ log |A|)LP
m=1 0

+ 2ηt+1 Qh (sh , ·, πt , µ̄t , W )− Q̂λ,α

λ,α α Î ∗ α Î
h (sh , ·, πt , µ̄t , Ŵ ) ∞+2ηt+1 H(1 + λ log |A|)βt+1 .
0
Qt−1 0 ) for m ∈ [t], where α0 = α for m ≥ 2 and α0 = 1 (since
Define αm,t = αm k=m+1 (1 − αkP m m 1
Î1 = µ̂I1 ). Then it satisfies that t−1
µ̄ α
m=1 m,t = 1, and that
t−1
X t−1
X
µ̄It = αm,t · µIm , and Ît =
µ̄ αm,t · µ̂Im . (79)
m=1 m=1

Then we have that

H Z 1 t−1
kµ̄βt,m − µ̄
ˆβt,m k1 dβ = d(µ̄It , µ̄
X X
ˆIt ) ≤ αm,t−1 d(µ̂Im , µIm ) ≤ εµ ,
m=1 0 m=1

where the inequality results from the triangle inequality. Thus, we have

(I) ≤ 2ηt+1 Qλ,α α ˆI ∗ λ,α α ˆI

h (sh , ·, πt , µ̄t , W ) − Q̂h (sh , ·, πt , µ̄t , Ŵ ) ∞

+ 2ηt+1 Lr + H(1 + λ log |A|)LP εµ + 2ηt+1 H(1 + λ log |A|)βt+1 .

For term (II), Lemma 31 shows that (II) ≤ 0.

For term (III), we have that

Q.2.3 Proof of Proposition 15

Proof [Proof of Proposition 15] Our proof involves two steps:
• Proof Eπ∗ [Vhλ (sh , π ∗ ) − Vhλ (sh , π)] ≥ γ ∗ Eπ∗ [Vh+1
λ (s ∗ λ
h+1 , π ) − Vh+1 (sh+1 , π)] for all
h ∈ [H], where γ ∗ > 0 is a constant.

• Proof the desired result from Step 1.

75
Zhang, Tan, Wang, and Yang

Step 1: Proof Eπ∗ [Vhλ (sh , π ∗ ) − Vhλ (sh , π)] ≥ γ ∗ Eπ∗ [Vh+1
λ (s ∗ λ
h+1 , π ) − Vh+1 (sh+1 , π)] for
all h ∈ [H].
If πt = πt∗ for all t ≥ h + 1, then the result trivially holds. In the following, we assume
that πt 6= πt∗ for some t ≥ h + 1. This implies that
λ
Eπ∗ [Vh+1 (sh+1 , π ∗ ) − Vh+1
λ
(sh+1 , π)] > 0.
For ease of notation, we define that
Z
y(s, a) = rh (s, a) + Ph (s0 | s, a)Vh+1 (s0 , π)ds0
ZS
y ∗ (s, a) = rh (s, a) + Ph (s0 | s, a)Vh+1 (s0 , π ∗ )ds0 .
S

Then we expand these two differences between value functions as

(sh+1 , π ∗ ) − Vh+1
λ λ

Eπ∗ Vh+1 (sh+1 , π)
= Eπ∗ hy ∗ (sh , ·) − y(sh , ·), πh∗ (· | sh )i ,

≥ γ ∗ hy ∗ (s, ·) − y(s, ·), πh∗ (· | s)i, (80)

and our desired result immediately follows from taking expectation on the both sides of
inequality (80). For ease of notation, we define p∗ = πh∗ (· | s). From the definition of the
optimal policy, we have that
p∗ = argmaxhq, y ∗ (s, ·)i − λR(q), p = argmaxhq, y(s, ·)i − λR(q).
q∈∆(A) q∈∆(A)

These distributions admit closed-formPexpressions p∗ (a) = exp(y ∗ (s, a)/λ)/Z ∗ (s) and p(a) =

exp(y(s, a)/λ)/Z(s), where Z ∗ (s) = a exp(y ∗ (s, a)/λ) and Z(s) = a exp(y(s, a)/λ). To
P
prove inequality (80), it suffices to prove that
hy(s, ·), p∗ − pi + λ R(p) − R(p∗ ) ≥ (γ ∗ − 1)hy ∗ (s, ·) − y(s, ·), p∗ i.

(81)
The left-hand side the inequality (81) is
p∗

∗ ∗ ∗ ∗ ∗

hy(s, ·), p − pi + λ R(p) − R(p ) = hλ log p, p − pi + λ R(p) − R(p ) = −λ p , log ,
p
(82)
where the first equality results from the closed-form expression of p, and the second inequality
results from the definition of R(·). We further expand this term as
exp y ∗ (s, ·)/λ

p∗

∗ Z(s) ∗
−λ p , log = −λ log ∗ − , y (s, ·) − y(s, ·) , (83)
p Z (s) Z ∗ (s)

76
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

where the equalitys result from the closed-form expressions of p and p∗ . The right-hand side
of inequality (81) is

p∗ ∗ Z ∗ (s)

∗ ∗ ∗ ∗
(γ − 1)hy (s, ·) − y(s, ·), p i = (γ − 1)λ log , p + log , (84)
p Z(s)

where the equalitys result from the closed-form expressions of p and p∗ . Combining Eqn. (82),
(83), and (84), we have

hy(s, ·), p∗ − pi + λ R(p) − R(p∗ ) ≥ (γ ∗ − 1)hy ∗ (s, ·) − y(s, ·), p∗ i

γ∗
∗
Z ∗ (s)

y (s, ·)
⇔ exp , y (s, ·) − y(s, ·) ≤ Z ∗ (s) log
∗
. (85)
λ λ Z(s)

In the following, we prove inequality (85). The right-hand side of (85) can be lower-bounded
as
∗

Z ∗ (s)
P
∗ log B X ∗ a∈A exp y (s, a)/λ

Z (s) log ≥ exp y (s, a)/λ · P −1
Z(s) B−1 a∈A exp y(s, a)/λ
a∈A
log B X
exp y(s, a)/λ y ∗ (s, a) − y(s, a) ,

≥ · (86)
(B − 1)λ
a∈A

where B = exp(H(1 + λ log |A|)/λ), the first inequality results from that log B/(B − 1) · (x −
1) ≤ log x for x ∈ [1, B] and the facts that y ∗ (s, a) ≥ y(s, a) and |y ∗ (s, a)| ≤ H(1 + λ log |A|)
for all s ∈ S and a ∈ A, and the second inequality results from that exp(x) − 1 ≥ x and
that y ∗ (s, a) ≥ y(s, a). The left-hand side of inequality (85) can be upper bounded as

γ∗
∗
γ∗

y (s, ·) ∗
X
exp y(s, a)/λ y ∗ (s, a) − y(s, a) ,

exp , y (s, ·) − y(s, ·) ≤ ·B·
λ λ λ
a∈A
(87)

where the inequality results from that y ∗ (s, a)/y(s, a) ≤ B for all s ∈ S and a ∈ A.
Combining inequalities (86) and (87), we prove inequality (85) given

log B
0 < γ∗ ≤ .
B(B − 1)

Step 2: Proof the desired result from Step 1.

We define that

Dh = Eπ∗ [Vhλ (sh , π ∗ ) − Vhλ (sh , π)]

for h ∈ [H]. Then Step 1 shows that Dh ≥ γ ∗ Dh+1 for all h ∈ [H]. Thus, we have

Eπ∗ V1λ (s1 , π ∗ ) − V1λ (s1 , π)

D 1
= PH 1 = PH .
PH λ ∗ λ D h D h /D 1
h=2 Vh (sh , π ) − Vh (sh , π)
Eπ ∗ h=2 h=2

77
Zhang, Tan, Wang, and Yang

For each term, we have that

Dh Dh D2
= ··· ≤ γ ∗(2−h) .
D1 Dh−1 D1
Thus, we have that
Eπ∗ V1λ (s1 , π ∗ ) − V1λ (s1 , π)

(1 − γ ∗ )γ ∗(H−2)
≥
∗(H−1)
= β∗.
PH λ ∗ λ 1 − γ
Eπ∗ h=2 Vh (sh , π ) − Vh (sh , π)

The proof of Proposition 15 is complete.

Q.2.4 Proof of Proposition 16

Proof [Proof of Proposition 16]
We first write
H
X
α α λ,α ∗,α I ∗ λ,α α I ∗
∆t+1 = Xt+1 − Eπ̄t ,µ̄It
∗,α Vh (sh , π̄t , µ̄t , W ) − Vh (sh , πt+1 , µ̄t , W )
h=1
H
X
1 ∗,α α

− ∗ Eπ̄t∗,α ,µ̄It KL π̄t,h (· | sh )kπt+1,h (· | sh )
ηθ
h=1
= (V) + (VI) + (VII) + (VIII) + (IX). (88)

Term (V) is the error that measures the difference between the action-value function induced
by the optimal policies of µ̄It+1 and µ̄It , which is defined as
H
X
∗,α
(V) = E ∗,α I
π̄t+1 ,µ̄t+1 Vhλ,α (sh , π̄t+1 , µ̄It+1 , W ∗ ) − Vhλ,α (sh , π̄t∗,α , µ̄It+1 , W ∗ ) .
h=1

To upper bound the term (V), we note that the optimal policies µ̄It+1 and µ̄It satisfy the
following property.
Proposition 28 For a λ-regularized finite-horizon MDP (S, A, H, {rh }H H
h=1 , {Ph }h=1 ) with
∗ ∗ H
rh ∈ [0, 1] for all h ∈ [H], we denote the optimal policy as π = {πh }h=1 . Then we have that
for any s ∈ S, and h ∈ [H]
1
min πh∗ (a | s) ≥ .
a∈A 1 + |A| exp (H − h + 1)(1 + λ log |A|)/λ
Proof [Proof of Proposition 28] See Appendix Q.2.5
Then we have that
∗,α
Vhλ,α (sh , π̄t+1 , µ̄It+1 , W ∗ ) − Vhλ,α (sh , π̄t∗,α , µ̄It+1 , W ∗ )
X H
∗,α ∗,α
≤ H(1 + λ log |A|) + λLR Eπ̄∗,α ,µ̄I π̄t+1,m (· | sm ) − π̄t,m (· | sm ) 1 ,
t+1 t+1
m=h

78
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

where LR = log(1 + |A| exp H(1 + λ log |A|)/λ), the inequality results from the performance
difference lemma, Lemma 37, proposition 28 and Lemma 38. Thus, we have that
X H
∗,α ∗,α
(V) ≤ H H(1 + λ log |A|) + λLR Eπ̄∗,α ,µ̄I π̄t+1,m (· | sm ) − π̄t,m (· | sm ) 1 . (89)
t+1 t+1
m=1
Term (VI) is the error that measures the difference between the distribution of states
induced by optimal policies of µ̄It+1 and µ̄It , which is defined as
X H
λ,α ∗,α I ∗ λ,α α I ∗
(VI) = Eπ̄ ,µ̄I − Eπ̄t ,µ̄I
∗,α ∗,α Vh (sh , π̄t , µ̄t+1 , W ) − Vh (sh , πt+1 , µ̄t+1 , W )
t+1 t+1 t+1
h=1
H
X
1 ∗,α α

+ ∗ Eπ̄∗,α ,µ̄I − Eπ̄t∗,α ,µ̄I KL π̄t,h (· | sh )kπt+1,h (· | sh ) .
ηθ t+1 t+1 t+1
h=1

Proposition 29 Given any two policies π I , π̃ I

and distribution flow µI , we define µ+,I =
I I
Γ3 (π , µ , W ) and µ̃ +,I I I
= Γ3 (π̃ , µ , W ) for any graphons W = {Wh }H h=1 . Then we have
h−1 h i
µ+,α µ̃+,α
X
α α
h − h 1
≤ Eµ+,α
m
πm (· | s) − π̃m (· | s) 1
.
m=1
for all α ∈ I and h ∈ [H].
Proof [Proof of Proposition 29] See Appendix Q.2.6.

Proposition 29 shows that

79
Zhang, Tan, Wang, and Yang

Proposition 30 Given any policy π I and two distribution flows µI and µ̃I , we define
µ+,I = Γ3 (π I , µI , W ∗ ) and µ̃+,I = Γ3 (π I , µ̃I , W ∗ ). Under Assumption 1, we have that

h−1 Z 1
µ+,α µ̃+,α
X
h − h 1
≤ LP kµβm − µ̃βm k1 dβ
m=1 0

for all α ∈ I and h ∈ [H].

Proof [Proof of Proposition 30] See Appendix Q.2.7.

Following the similar arguments in inequality (90), we have that

H Z 1
|A|2

1
kµ̄βt+1,m − µ̄βt,m k1 dβ.
X
(VII) ≤ H H 1 + λ log |A| + ∗ log LP · (91)
ηθ βt+1 0 m=1

Term (VIII) is the error that measures the difference between the action-value function
induced by difference distribution flows µ̄It+1 and µ̄It , which is defined as

H
X
(VIII) = Eπ̄t∗,α ,µ̄It Vhλ,α (sh , π̄t∗,α , µ̄It+1 , W ∗ ) − Vhλ,α (sh , πt+1
α
, µ̄It+1 , W ∗ )
h=1
H
X
−E π̄t∗,α ,µ̄I
t
Vhλ,α (sh , π̄t∗,α , µ̄It , W ∗ ) − Vhλ,α (sh , πt+1
α
, µ̄It , W ∗ )
h=1

From Proposition 27, we have that

H Z 1
X
kµβm − µ̃βm k1 dβ.

(VIII) ≤ 2H Lr + H(1 + λ log |A|)LP (92)
m=1 0

Term (IX) is the error that measures the difference between the KL divergence related
to the optimal policies of µ̄It+1 and µ̄It , which is defined as

H
X
1 ∗,α α
∗,α α

(IX) = ∗ Eπ̄∗,α ,µ̄I KL π̄t+1,h (· | sh )kπt+1,h (· | sh ) − KL π̄t,h (· | sh )kπt+1,h (· | sh ) .
ηθ t+1 t+1
h=1

Lemma 39 and Proposition 28 show that

H
X
2 |A| ∗,α ∗,α
(IX) ≤ ∗ max log , LR Eπ̄∗,α ,µ̄I π̄t+1,m (· | sm ) − π̄t,m (· | sm ) 1 . (93)
ηθ βt+1 t+1 t+1
m=1

Combining Eqn. (88) and inequalities (89), (90), (91), (92), (93), we conclude the proof of
this proposition.

80
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

Q.2.5 Proof of Proposition 28

Proof [Proof of Proposition 28] We denote the value function of the optimal policy π ∗ as
Vhλ (s, π ∗ ) for h ∈ [H]. From the definition of the optimal policy, we have that for any s ∈ S

XZ
πh∗ (· | s) = argmaxhrh (s, ·), pi − λR(p) + p(a)Ph (s0 | s, a)Vh+1
λ
(s0 , π ∗ )ds0 .
p∈∆(A) a∈A S

Then we have that

Z
1
πh∗ (a | s) ∝ exp rh (s, a) + Ph (s0 | s, a)Vh+1
λ
(s0 , π ∗ )ds0 .
λ S

The desired result follows from that Vhλ (s0 , π ∗ ) ≤ (H − h + 1)(1 + λ log |A|). Thus, we
conclude the proof of Proposition 28.

Q.2.6 Proof of Proposition 29

where the first inequality results from the definition of Γ3 and the triangle inequality, and
the second inequality results from the triangle inequality. Note that µ+,α 1 = µ̃+,α
1 = µα1 .
Summing over h, we prove the desired result. This completes the proof of Proposition 29.

Q.2.7 Proof of Proposition 30

Proof [Proof of Proposition 30] For any h ∈ [H − 1], we have that

µ+,α +,α
h+1 − µ̃h+1 1
Z XZ
µ+,α +,α ∗ 0 α I ∗
α 0
≤ h (s) − µ̃h (s) πh (a | s)Ph s | s, a, zh (µh , Wh ) ds
S a∈A S
Z XZ 0
µ̃+,α α ∗ 0 α I ∗ ∗ 0 α I ∗

+ h (s)πh (a | s) P h s | s, a, z h (µ h , Wh ) − P h s | s, a, z h (µ̃ h , Wh ) ds
S
a∈A S
≤ kµh+,α − µ̃+,α
h k1 + LP zhα (µIh , Wh∗ ) − zhα (µ̃Ih , Wh∗ ) 1 ,

81
Zhang, Tan, Wang, and Yang

where the first inequality results from the definition of Γ3 and triangle inequality, and the
second inequality results from Assumption 1. For the right-hand side term, we have that
Z Z 1 Z 1
β β
α I ∗ α I ∗ ∗
kµβh − µ̃βh k1 dβ,

zh (µh , Wh ) − zh (µ̃h , Wh ) 1 = Wh (α, β) µh (s) − µ̃h (s) dβ ds ≤
S 0 0

where the inequality results from the triangle inequality and that |Wh∗ | ≤ 1. Summing over
h, we prove the desired result. Thus, we conclude the proof of Proposition 30.

Q.2.8 Proof of Proposition 27

Proof [Proof of Proposition 27] From the definitions of the value function and the action-
value function, we have that for any h ∈ [H]
Z hX
Qλ,α α I ∗ ∗ α I ∗ α
(a0 | s0 )Qλ,α 0 0 α I ∗

h (s, a, π , µ , W ) = rh s, a, z h (µ , W ) + πh+1 h+1 (s , a , π , µ , W )
S a0 ∈A
i
α
(· | s0 ) Ph∗ s0 | s, a, zhα (µI , W ∗ ) ds0 .

− λR πh+1

Thus, we have that

Qλ,α α I ∗ λ,α α I
h (s, a, π , µ , W ) − Qh (s, a, π , µ̃ , W )
∗
Z 1
kµβh − µ̃βh k1 dβ

≤ Lr + H(1 + λ log |A|)LP
0
XZ
Qh+1 (s, a, π , µ , W ∗ )−Qλ,α
λ,α 0 0 I I 0 0 I I ∗ 0 0 ∗ 0 α I ∗
α
0
+ h+1 (s, a, π , µ̃ , W ) πh+1 (a |s )Ph s |s, a, zh (µ , W ) ds ,
a0 ∈A S

where the inequality results from the triangle inequality and Assumption 1. By induction, it
is easy to prove that
H Z 1
Qλ,α λ,α
α I ∗ α I ∗
X
kµβm − µ̃βm k1 dβ.

h (s, a, π , µ , W )−Qh (s, a, π , µ̃ , W ) ≤ Lr + H(1 + λ log |A|)LP
m=h 0

From the relationship between the value function and action-value function, we have that
H Z 1
Vhλ,α (s, π α, µI, W ∗ )−Vhλ,α (s, π α, µ̃I, W ∗ )
X
kµβm − µ̃βm k1 dβ.

≤ Lr + H(1 + λ log |A|)LP
m=h 0

Thus, we conclude the proof of Proposition 27.

Q.2.9 Proof of Proposition 17

Proof [Proof of Proposition 17] We first prove the claim related to the value function. From
the definition of the optimal policy, we have that

Vhλ,α (s, π ∗,I , µI , W ∗ ) = max rh s, ·, zhα (µIh , Wh∗ ) , p − λR(p)

p∈∆(A)

82
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

XZ λ,α
+ p(a)Ph s0 | s, a, zhα (µIh , Wh∗ ) Vh+1 (s, π ∗,I , µI , W ∗ )ds0 .
a∈A S

Thus, we have that

Vhλ,α (s, π ∗,I , µI , W ∗ ) − Vhλ,α (s, π̃ ∗,I , µ̃I , W ∗ )

1 β
Z
≤ H(1 + λ log |A|)LP + Lr kµh − µ̃βh k1 dβ
0
λ,α λ,α
+ max Vh+1 (s, π , µ , W ∗ )
∗,I I
− Vh+1 (s, π̃ ∗,I , µ̃I , W ∗ ) ,
s∈S

where the inequality results from the fact that | maxx f (x) − maxx g(x)| ≤ maxx |f (x) − g(x)|
and Assumption 1. By induction, it is easy to prove that

max Vhλ,α (s, π ∗,I , µI , W ∗ ) − Vhλ,α (s, π̃ ∗,I , µ̃I , W ∗ )

s∈S
H Z 1
X
≤ H(1 + λ log |A|)LP + Lr kµβm − µ̃βm k1 dβ.
m=h 0

Next, we prove the claim related to the optimal policies. From the definition of the
optimal policies, we have that

πh∗,α (· | s) = argmax rh s, ·, zhα (µIh , Wh∗ ) , p − λR(p)

p∈∆(A)
XZ λ,α
+ p(a)Ph s0 | s, a, zhα (µIh , Wh∗ ) Vh+1 (s, π ∗,I , µI , W ∗ )ds0 .
a∈A S

We define that
Z
λ,α
yhα (s, a) rh s, a, zhα (µIh , Wh∗ ) Ph s0 | s, a, zhα (µIh , Wh∗ ) Vh+1 (s, π ∗,I , µI , W ∗ )ds0 ,

= +
ZS
λ,α
ỹhα (s, a) rh s, a, zhα (µ̃Ih , Wh∗ ) Ph s0 | s, a, zhα (µ̃Ih , Wh∗ ) Vh+1 (s, π ∗,I , µ̃I , W ∗ )ds0 .

= +
S

Then Lemma 34 shows that

πh∗,α (· | s) − π̃h∗,α (· | s) 1
≤ yhα (s, ·) − ỹhα (s, ·) ∞
.

From the triangle inequality and Assumption 1, we have that

yhα (s, ·) − ỹhα (s, ·) ∞

Z 1
kµβh − µ̃βh k1 dβ

≤ H(1 + λ log |A|)LP + Lr
0
H Z 1
X
+ H(1 + λ log |A|)LP + Lr kµβm − µ̃βm k1 dβ,
m=h 0

which proves the claim related to the optimal policies. Thus, we conclude the proof of
Proposition 17

83
Zhang, Tan, Wang, and Yang

Q.3 Propositions and Lemmas for Combination

Q.3.1 Proof of Corollary 21
Proof [Proof of Corollary 21] Following the proof of Theorem 5, we decompose the risk
difference as

Rξ̄ (fˆh , ĝh , Ŵh ) − Rξ̄ (fh∗ , gh∗ , Wh∗ )

= Generalization Error of Risk + Empirical Risk Difference,

where the generalization error of risk and the empirical risk difference are defined similarly
as those in Theorem 5. From the procedure of Algorithm (5), we have

Empirical Risk Difference ≤ 0.

The generalization error of risk can be bounded by inequality (42) in the proof of Theorem 5.
Thus, we conclude the proof of Corollary 21.

Q.3.2 Proof of Proposition 22

.
Proof [Proof of Proposition 22] For any h ∈ [H − 1], the definition of Γ2 shows that
XZ
0
α
µh+1 (s ) = Ph∗ (s0 | s, a, zhα (µIh , Wh∗ ))µαh (s)πhα (a | s)ds.
a∈A S

Assumption 8 implies that we can bound the total variation between µαh+1 and µ̂αh+1 as

kµαh+1 − µ̂αh+1 k1
h i h i
≤ Lε Eραh fˆh ωhα (Ŵh ) − fh∗ ωhα (Wh∗ ) + Lε Eραh fˆh ωhα (Ŵh ) − fˆh ω̃hα (Ŵh )

+ kµαh − µ̂αh k1 , (94)

where
Z 1Z
ω̃hα (W ) W (α, β)k ·, (siτ,h , aiτ,h , s) µ̂βh (s) ds dβ.

=
0 S

The first term in the right-hand side of inequality (94) can be upper bounded as
h i
Lε Eραh fˆh ωhα (Ŵh ) − fh∗ ωhα (Wh∗ ) ≤ eπ,α
h ,

where the inequality results from the Hölder inequality. The second term in the right-hand
side of inequality (94) can be upper-bounded as
h Z 1
i
fh ωh (Ŵh ) − fh ω̃h (Ŵh ) ≤ rLK Lε kωh (Ŵh )−ω̃h (Ŵh )kH ≤ rLK Lε Bk kµ̂βh − µβh k1 dβ,
ˆ α ˆ α α α

Lε Eρα
h
0

84
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

where the first inequality results from Lemma 33, and the second inequality results from the
triangle inequality. Thus, we have that
Z 1
α α α α
kµh+1 − µ̂h+1 k1 ≤ kµh − µ̂h k1 + rLK Lε Bk kµ̂βh − µβh k1 dβ + eπ,α
h .
0

By induction, it is easy to prove that

h−2 h−m−2
X X Z 1 h−1
X
kµαh − µ̂αh k1 ≤ (1 + rLK Lε Bk )k
eπ,β
m dβ + eπ,α
m ,
m=1 k=0 0 m=1

which proves our desired results. Thus, we conclude the proof of Proposition 22.

Q.3.3 Proof of Proposition 23

Proof [Proof of Proposition 23] From the definition of the action-value function, we have
that

Qλ,α α I ∗ α I ∗
λ,α 0 α I ∗ α α

h (s, a, π , µ , W ) = rh (s, a, zh (µ , Wh )) + E Vh+1 (s , π , µ , W ) sh = s, ah = a
Vhλ,α (s0 , π α , µI , W ∗ ) = Qλ,α α I ∗ α α

h (s, ·, π , µ , W ), πh (· | s) + λR(πh · | s) . (95)

Thus for any h ∈ [H], we have that

h i
Eρ+,α Q̂λ,α
h (s, a, π α I
, µ , Ŵ ) − Q λ,α
h (s, a, π α I
, µ , W ∗
)
h
h i
≤ Eρ+,α r̂h (s, a, zhα (µI , Ŵh )) − rh∗ (s, a, zhα (µI , Wh∗ ))
h
h i
+ (H − h)(1 + λ log |A|)Eρ+,α P̂h (· | s, a, zhα (µI , Ŵh )) − Ph∗ (· | s, a, zhα (µI , Wh∗ ))
h 1
h i
λ,α α I λ,α α I ∗
+ C · Eρb,α Q̂h+1 (s, a, π , µ , Ŵ ) − Qh+1 (s, a, π , µ , W ) ,
h+1

where the inequality results from the triangle inequality and Eqn. (95). Since Q̂λ,α
H+1 =
λ,α
QH+1 = 0, we have that
h i
Eρb,α Q̂λ,α
h (s, a, π α I
, µ , Ŵ ) − Qλ,α
h (s, a, π α I
, µ , W ∗
)
h
H h
X
∗ ∗
i
Eρb,α ĝm ωhα (Ŵm ) − gm ωhα (Wm

≤ )
m
m=h
H r h
X i
ˆ α α

+ Lε H(1 + λ log |A|) Eρb,α ∗ ∗
fm ωh (Ŵm ) − fm ωh (Wm ) ,
m
m=h

where the inequality results from Assumption 8. Our desired result follows from the Hölder
inequality. Thus, we conclude the proof of Proposition 23.

85
Zhang, Tan, Wang, and Yang

Q.3.4 Proof of Corollary 24

Proof [Proof of Corollary 24] We proof mainly takes two steps:

• Reformulate the algorithm (62).

• Decompose the risk difference and control each terms.

Step 1: Reformulate the algorithm (62).

From the definition φ∗ , we have that for α ∈ ((i − 1)/N, i/N ]
φ∗ (i/N )+α−i/N φ∗ (α)
µ̃ατ,h = µτ,h = µτ,h ,

where the first equality results from the definition of µ̃Iτ , and the second equality results
from that α ∈ ((i − 1)/N, i/N ] and φ∗ is a permutation of {((i − 1)/N, i/N ]}N i=1 . Thus, we
have that
Z 1Z
i φ
φ∗ (β)
W φ(i/N ), φ(β) k ·, (siτ,h , aiτ,h , s) µτ,h (s) ds dβ

ω̃τ,h (W ) =
0 S
Z 1Z
φ∗ (β)
W φ ◦ ψ ∗ ◦ φ∗ (i/N ), φ ◦ ψ ∗ ◦ φ∗ (β) k ·, (siτ,h , aiτ,h , s) µτ,h (s) ds dβ

=
0 S
Z 1Z
W φ ◦ ψ ∗ (ξi ), φ ◦ ψ ∗ (γ) k ·, (siτ,h , aiτ,h , s) µγτ,h (s) ds dγ

=
0 S
i ∗
= ωτ,h (W φ◦ψ ),

where the second equality results from that φ∗ is the inverse functio of ψ ∗ , and the third
equality results from taking γ = φ∗ (β) and φ∗ (i/N ) = ξi . Thus, Algorithm (62) can be
equivalent formulated as

(fˆh , ĝh , Ŵh , φ̂h )

L N
1 XX i i ∗ 2
∗ 2

= argmin sτ,h+1 − f ωτ,h (W φ◦ψ ) i
+ rτ,h i
− g ωτ,h (W φ◦ψ ) . (96)
f ∈B(r,H̄) N L τ =1 i=1
g∈B(r̃,H̃)
W ∈W̃
N
φ∈C[0,1]

Step 2: Decompose the risk difference and control each terms.

∗
Rξ̄ (fˆh , ĝh , Ŵhφ̂h ◦ψ ) − Rξ̄ (fh∗ , gh∗ , Wh∗ ) = Generalization Error of Risk + Empirical Risk Difference,

where the generalization error of risk and the empirical risk difference are defined as

Generalization Error of Risk

L N
1 XX i ˆ i φ̂h ◦ψ ∗
2
i ∗ i ∗
2
= Eρi sτ,h+1 − fh ωτ,h (Ŵh ) − sτ,h+1 − fh ωτ,h (Wh )
NL τ,h
τ =1 i=1

86
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

L N
1 XX i ∗ 2
2
−2 sτ,h+1 − fˆh ωτ,hi
(Ŵhφ̂h ◦ψ ) − siτ,h+1 − fh∗ ωτ,h
i
(Wh∗ )
NL
τ =1 i=1
L N
1 XX i i φ̂h ◦ψ ∗
2
i ∗ i ∗
2
+ Eρi rτ,h − ĝh ωτ,h (Ŵh ) − rτ,h − gh ωτ,h (Wh )
NL τ,h
τ =1 i=1
L X N
1 ∗ 2
2
(Ŵhφ̂h ◦ψ )
X
−2 i
rτ,h i
− ĝh ωτ,h i
− rτ,h − gh∗ ωτ,h
i
(Wh∗ ) ,
NL
τ =1 i=1
Empirical Risk Difference
L N
1 XX i ∗ 2
2
=2 sτ,h+1 − fˆh ωτ,h
i
(Ŵhφ̂h ◦ψ ) − siτ,h+1 − fh∗ ωτ,h
i
(Wh∗ )
NL
τ =1 i=1
L X
N
1 ∗ 2
2
(Ŵhφ̂h ◦ψ )
X
+2 i
rτ,h i
− ĝh ωτ,h i
− rτ,h − gh∗ ωτ,h
i
(Wh∗ ) .
NL
τ =1 i=1

From the procedure of Algorithm (96), we have

Empirical Risk Difference ≤ 0.

The generalization error of risk can be controlled exactly as the proof of Theorem 7. Thus,
we conclude the proof of Corollary 24.

Q.4 Technical Lemmas

Lemma 31 For a finite alphabet X and any distribution p supported on it, we define
pβ = (1 − β)p + βUnif(X ). Then the function f (β) = R(pβ ) is a decreasing function on
β ∈ [0, 1].

Proof [Proof of Lemma 31] From calculus, we can show that

f 00 (β) ≥ 0 for β ∈ [0, 1].

Since f 0 (1) = 0, we have that f 0 (β) ≤ 0 for β ∈ [0, 1]. Thus, we conclude the proof of
Lemma 31.

Lemma 32 (Theorem 3.5 in Pinelis (1994)) Let X1 , · · · , Xn be independent random

variables that take values in a Hilbert space. If kXi k ≤ M and E[Xi ] = 0 for all i ∈ [n].
Then P(kX1 + · · · + Xn k ≥ t) ≤ 2 exp(−t2 /(2nM )).

Lemma 33 In a RKHS H with kernel k : X × X → R that satisfies: (i) k(x, x) ≤ Bk2 for
all x ∈ X . (ii) kk(·, x) − k(·, x0 )kH ≤ Lk kx − x0 kX for all x, x0 ∈ X . We have that for any
f ∈ B(r, H): (i) |f (x)| ≤ rBk for all x ∈ X . (ii) |f (x) − f (x0 )| ≤ rLk kx − x0 kX for all
x, x0 ∈ X .

87
Zhang, Tan, Wang, and Yang

Proof [Proof of Lemma 33] For the first claim, we have that

f (x) = f, k(·, x) ≤ kf kH · k(·, x) H

≤ rBk .

For the second claim, we have that

f (x) − f (x0 ) = f, k(·, x) − k(·, x0 ) ≤ kf kH · k(·, x) − k(·, x0 ) H

≤ rLk kx − x0 kX .

Thus, we conclude the proof of Lemma 33.

Lemma 34 Let X be a nonempty compact convex set and f : X → R be a differentiable

k-strongly convex function, i.e., f (x) ≥ f (y) + h∇f (y), x − yi + k2 kx − yk2 for all x, y ∈ X .
For any two elements y1 , y2 , we define

xi = argmaxhx, yi i − f (x) for i = 1, 2.

x∈X

Then kx1 − x2 k ≤ ky1 − y2 k∗ /k, where k · k∗ is the dual norm of k · k.

Proof [Proof of Lemma 34] Define fi (x) = hx, yi i − f (x) for i = 1, 2. Then Shalev-Shwartz
(2012, Lemma 2.8) shows that

k k
kx1 − x2 k2 ≤ f1 (x1 ) − f1 (x2 ) and kx1 − x2 k2 ≤ f2 (x2 ) − f2 (x1 ).
2 2
Summing these two inequalities, we have

kkx1 − x2 k2 ≤ hx1 − x2 , y1 − y2 i ≤ kx1 − x2 k · ky1 − y2 k∗ ,

where the second inequality results from the definition of the dual norm. Thus, we conclude
the proof of Lemma 34.

Lemma 35 (Lemma 3.3 in Cai et al. (2020)) For any distribution p, p∗ ∈ ∆(A) and
any function g : A → [0, H], it holds for q ∈ ∆(A) with q(·) ∝ p(·) exp αg(·) that

hg(·), p∗ (·) − p(·)i ≤ αH 2 /2 + α−1 KL(p∗ kp) − KL(p∗ kq) .

Lemma 36 For any two distributions p∗ , p ∈ ∆(A) and p̂ = (1 − β)p + βUnif(A) with
β ∈ (0, 1). Then

|A|
KL(p∗ kp̂) ≤ log
β
∗ ∗
KL(p kp̂) − KL(p kp) ≤ β/(1 − β).

88
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

Proof [Proof of Lemma 36]

D p∗ E D 1 E |A|
KL(p∗ kp̂) ≤ p∗ , log ≤ p∗ , log = log .
(1 − β)p + β/|A| β/|A| β
Thus, we prove the first inequality. For the second inequality, we have
D p E D p E D β E β
KL(p∗ kp̂)−KL(p∗ kp) = p∗ , log ≤ p∗ , log ≤ p∗ , = ,
(1 − β)p + β/|A| (1 − β)p 1−β 1−β

where the second inequality results from log(x) ≤ x − 1 for x > 0. Thus, we conclude the
proof of Lemma 36.

Lemma 37 (Performance Difference Lemma) Given a policy π I and the correspond-

where the expectation Eπ̃α ,µI is taken with respect to the randomness in implementing policy
π̃ α for agent α under the MDP induced by µI .

Proof [Proof of Lemma 37] From the definition of V1λ,α (s, π̃ α , µI , W ), we have

V1λ,α (s, π̃ α , µI , W )
XH
α α α α α α λ,α α α I λ,α α α I α
= Eπ̃α ,µI rh (sh, ah, zh)−λ log π̃h (ah |sh )+Vh (sh , π , µ , W )−Vh (sh , π , µ , W ) s1 = s
h=1
H
X
λ,α α
= Eπ̃α ,µI rh (sαh , aαh , zhα ) − λ log π̃hα (aαh | sαh ) + Vh+1 (sh+1 , π α , µI , W )
h=1

− Vhλ,α (sαh , π α , µI , W ) sα1 = s + V1λ,α (s, π α , µI , W ), (97)

where the second equality results from the rearrangement from the terms. We then focus on
a part of the right-hand side of Eqn. (97).
λ,α α
Eπ̃α ,µI rh (sαh , aαh , zhα ) − λ log π̃hα (aαh | sαh ) + Vh+1 (sh+1 , π α , µI , W ) | sα1 = s

h i
λ,α α
= Eπ̃α ,µI rh (sαh , aαh , zhα ) + Vh+1 (sh+1 , π α , µI , W ) | sα1 = s − λEπ̃α ,µI R π̃hα (· | sαh ) | sα1 = s

h i h i
= Eπ̃α ,µI Qλ,α α α I α α α α α
α
h (s h , ·, π , µ , W ), π̃h (· | sh ) | s1 = s − λE π̃ α ,µ I R π̃ h (· | sh ) | s1 = s ,
(98)

89
Zhang, Tan, Wang, and Yang

where R(·) is the negative entropy function, the inner product h·, ·i is taken with respect
to the action space A, and the second equality results from the definition of Qλ,α h and
λ,α
Vh+1 . Substituting Eqn. (98) into Eqn. (97) and noting the fact that Vhλ,α (sαh , π α , µI , W ) =
hQλ,α α α I α α α α
h (sh , ·, π , µ , W ), πh (· | sh )i − R(πh (· | sh )), we derive that

V1λ,α (s, π̃ α , µI , W ) − V1λ,α (s, π α , µI , W )

where the last equality results from the definition of the negative entropy R(·). This concludes
the proof of Lemma 37.

Lemma 38 For a finite alphabet X , define R as the negative entropy function. For two
distributions p, q supported on X , we have that
n o
|R(p) − R(q)| ≤ max log(p) ∞ , log(q) ∞ kp − qk1 .

Proof [Proof of Lemma 38] Then we have that

Z 1 D E Z 1

|R(p) − R(q)| ≤ ∇R q + t(p − q) , p − q dt ≤ kp − qk1 log q + t(p − q) dt,
0 0 ∞

where the first inequality results from the definition of integral and the triangle inequality,
and the second inequality results from Hölder’s inequality. The desired result follows from
the fact that for t ∈ [0, 1]
n o
log q + t(p − q) ≤ max log(p) ∞ , log(q) ∞ .
∞

Thus, we conclude the proof of Lemma 38.

Lemma 39 (Lemma 3 in Xie et al. (2021)) Let p, q, u ∈ ∆(X ) be distributions sup-

ported on a finite set X . If p(x) ≥ α1 , q(x) ≥ α1 , and u(x) ≥ α2 for all x ∈ X . Then

1
KL(pku) − KL(qku) ≤ 1 + log kp − qk1
min{α1 , α2 }

90
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

Lemma 40 (Lemma 39 in Wei et al. (2021)) Let {gt }t≥0 and {ht }t≥0 be non-negative
sequences that satisft gt ≤ (1 − c)gt−1 + ht for some 0 < c < 1 for all t ≥ 1. Then

maxτ ∈[1,t/2] hτ maxτ ∈[t/2,t] hτ

gt ≤ g0 (1 − c)t + (1 − c)t/2 + .
c c

References
A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan. Optimality and approximation with
policy gradient methods in Markov decision processes. In Conference on Learning Theory,
pages 64–66. PMLR, 2020.

B. Anahtarci, C. D. Kariksiz, and N. Saldi. Fitted Q-learning in mean-field games. arXiv

preprint arXiv:1912.13309, 2019.

B. Anahtarci, C. D. Kariksiz, and N. Saldi. Q-learning in regularized mean-field games.

Dynamic Games and Applications, pages 1–29, 2022.

A. Aurell, R. Carmona, G. Dayanıklı, and M. Laurière. Finite state graphon games with
applications to epidemics. Dynamic Games and Applications, 12(1):49–81, 2022a.

A. Aurell, R. Carmona, and M. Lauriere. Stochastic graphon games: II. The linear-quadratic
case. Applied Mathematics & Optimization, 85(3):1–33, 2022b.

J. Bhandari and D. Russo. Global optimality guarantees for policy gradient methods. arXiv
preprint arXiv:1906.01786, 2019.

Q. Cai, Z. Yang, C. Jin, and Z. Wang. Provably efficient exploration in policy optimization.
In International Conference on Machine Learning, pages 1283–1294. PMLR, 2020.

P. E. Caines and M. Huang. Graphon mean field games and the GMFG equations: ε-nash
equilibria. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 286–292.
IEEE, 2019.

P. E. Caines and M. Huang. Graphon mean field games and their equations. SIAM Journal
on Control and Optimization, 59(6):4373–4399, 2021.

P. Cardaliaguet and S. Hadikhanloo. Learning in mean field games: the fictitious play.
ESAIM: Control, Optimisation and Calculus of Variations, 23(2):569–591, 2017.

R. Carmona, M. Laurière, and Z. Tan. Model-free mean-field reinforcement learning:

mean-field mdp and mean-field q-learning. arXiv preprint arXiv:1910.12802, 2019.

R. Carmona, D. B. Cooney, C. V. Graves, and M. Lauriere. Stochastic graphon games: I.

the static case. Mathematics of Operations Research, 47(1):750–778, 2022.

S. Cen, C. Cheng, Y. Chen, Y. Wei, and Y. Chi. Fast global convergence of natural policy
gradient methods with entropy regularization. Operations Research, 70(4):2563–2578,
2022.

91
Zhang, Tan, Wang, and Yang

K. Cui and H. Koeppl. Approximately solving mean field games via entropy-regularized
deep reinforcement learning. In International Conference on Artificial Intelligence and
Statistics, pages 1909–1917. PMLR, 2021a.

K. Cui and H. Koeppl. Learning graphon mean field games and approximate Nash equilibria.
International Conference on Learning Representations, 2021b.

K. Elamvazhuthi and S. Berman. Mean-field models in swarm robotics: A survey. Bioinspi-

ration & Biomimetics, 15(1):015001, 2019.

C. Fabian, K. Cui, and H. Koeppl. Learning sparse graphon mean field games. arXiv
preprint arXiv:2209.03880, 2022.

Z. Fang, Z. Guo, and D. Zhou. Optimal learning rates for distribution regression. Journal
of Complexity, 56:101426, 2020.

C. Gao and Z. Ma. Minimax rates in network analysis: Graphon estimation, community
detection and hypothesis testing. Statistical Science, 36:16–33, 2021.

C. Gao, Y. Lu, and H. H. Zhou. Rate-optimal graphon estimation. The Annals of Statistics,
43:2624–2652, 2015.

S. Gao and P. E. Caines. Graphon control of large-scale networks of linear systems. IEEE
Transactions on Automatic Control, 65(10):4090–4105, 2019.

S. Gao, R. F. Tchuendom, and P. E. Caines. Linear quadratic graphon field games. arXiv
preprint arXiv:2006.03964, 2020.

S. Gao, P. E. Caines, and M. Huang. Lqg graphon mean field games: Graphon invariant
subspaces. In 2021 60th IEEE Conference on Decision and Control (CDC), pages 5253–
5260. IEEE, 2021.

M. Geist, B. Scherrer, and O. Pietquin. A theory of regularized markov decision processes.

In International Conference on Machine Learning, pages 2160–2169. PMLR, 2019.

S. Gronauer and K. Diepold. Multi-agent deep reinforcement learning: a survey. Artificial

Intelligence Review, pages 1–49, 2022.

X. Guo, A. Hu, R. Xu, and J. Zhang. Learning mean-field games. Advances in Neural
Information Processing Systems, 32, 2019.

X. Guo, A. Hu, and J. Zhang. MF-OMO: An optimization formulation of mean-field games.

arXiv preprint arXiv:2206.09608, 2022a.

X. Guo, R. Xu, and T. Zariphopoulou. Entropy regularization for mean field games with
learning. Mathematics of Operations research, 47(4):3239–3260, 2022b.

X. Guo, A. Hu, R. Xu, and J. Zhang. A general framework for learning mean-field games.
Mathematics of Operations Research, 48(2):656–686, 2023.

92
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

L. Györfi, M. Kohler, A. Krzyzak, H. Walk, et al. A distribution-free theory of nonparametric

regression, volume 1. Springer, 2002.

S. Hadikhanloo. Learning in anonymous nonatomic games with applications to first-order

mean field games. arXiv preprint arXiv:1704.00378, 2017.

M. Huang, R. P. Malhamé, and P. E. Caines. Large population stochastic dynamic games:

closed-loop mckean-vlasov systems and the nash certainty equivalence principle. 2006.

A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization
in neural networks. Advances in neural information processing systems, 31, 2018.

C. Jin, Q. Liu, and S. Miryoosefi. Bellman Eluder dimension: New rich classes of RL
problems, and sample-efficient algorithms. Advances in Neural Information Processing
Systems, 34:13406–13418, 2021.

N. Kallus, Y. Saito, and M. Uehara. Optimal off-policy evaluation from multiple logging
policies. In International Conference on Machine Learning, pages 5247–5256. PMLR,
2021.

O. Klopp and N. Verzelen. Optimal graphon estimation in cut distance. Probability Theory
and Related Fields, 174:1033–1090, 2019.

O. Klopp, A. B. Tsybakov, and N. Verzelen. Oracle inequalities for network models and
sparse graphon estimation. The Annals of Statistics, 45:316–354, 2017.

G. Lan. Policy mirror descent for reinforcement learning: Linear convergence, new sampling
complexity, and generalized problem classes. Mathematical Programming, pages 1–48,
2022.

J. Lasry and P. Lions. Mean field games. Japanese journal of mathematics, 2(1):229–260,
2007.

M. Laurière, S. Perrin, M. Geist, and O. Pietquin. Learning mean field games: A survey.
arXiv preprint arXiv:2205.12944, 2022.

P. Lavigne and L. Pfeiffer. Generalized conditional gradient and learning in potential mean
field games. arXiv preprint arXiv:2209.12772, 2022.

K. Menda, Y. Chen, J. Grana, J. W. Bono, B. D. Tracey, M. J. Kochenderfer, and D. Wolpert.

Deep reinforcement learning for event-driven multi-agent decision processes. IEEE Trans-
actions on Intelligent Transportation Systems, 20(4):1259–1268, 2018.

D. Meunier, M. Pontil, and C. Ciliberto. Distribution regression with sliced wasserstein

kernels. In International Conference on Machine Learning, pages 15501–15523. PMLR,
2022.

O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans. Bridging the gap between value and
policy based reinforcement learning. Advances in Neural Information Processing Systems,
30, 2017.

93
Zhang, Tan, Wang, and Yang

A. Oroojlooy and D. Hajinezhad. A review of cooperative multi-agent deep reinforcement

learning. Applied Intelligence, pages 1–46, 2022.

F. Parise and A. Ozdaglar. Graphon games. In Proceedings of the 2019 ACM Conference on
Economics and Computation, pages 457–458, 2019.

B. Pasztor, I. Bogunovic, and A. Krause. Efficient model-based multi-agent mean-field

reinforcement learning. arXiv preprint arXiv:2107.04050, 2021.

J. Perolat, S. Perrin, R. Elie, M. Laurière, G. Piliouras, M. Geist, K. Tuyls, and O. Pietquin.

Scaling up mean field games with online mirror descent. arXiv preprint arXiv:2103.00623,
2021.

S. Perrin, J. Pérolat, M. Laurière, M. Geist, R. Elie, and O. Pietquin. Fictitious play

for mean field games: Continuous time analysis and applications. Advances in Neural
Information Processing Systems, 33:13199–13213, 2020.

I. Pinelis. Optimum bounds for the distributions of martingales in Banach spaces. The
Annals of Probability, pages 1679–1706, 1994.

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimiza-

tion algorithms. arXiv preprint arXiv:1707.06347, 2017.

S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and

Trends R in Machine Learning, 4(2):107–194, 2012.

L. Shani, Y. Efroni, and S. Mannor. Adaptive trust region policy optimization: Global
convergence and faster rates for regularized mdps. In Proceedings of the AAAI Conference
on Artificial Intelligence, volume 34, pages 5668–5675, 2020.

Z. Szabó, A. Gretton, B. Póczos, and B. Sriperumbudur. Two-stage sampled learning theory

on distributions. In Artificial Intelligence and Statistics, pages 948–957. PMLR, 2015.

Z. Szabó, B. K. Sriperumbudur, B. Póczos, and A. Gretton. Learning theory for distribution

regression. The Journal of Machine Learning Research, 17(1):5272–5311, 2016.

Y. Tang and D. Ha. The sensory neuron as a transformer: Permutation-invariant neural

networks for reinforcement learning. Advances in Neural Information Processing Systems,
34, 2021.

R. F. Tchuendom, P. E. Caines, and M. Huang. On the master equation for linear quadratic
graphon mean field games. In 2020 59th IEEE Conference on Decision and Control
(CDC), pages 1026–1031. IEEE, 2020.

M. Uehara, J. Huang, and N. Jiang. Minimax weight and q-function learning for off-policy
evaluation. In International Conference on Machine Learning, pages 9659–9668. PMLR,
2020.

D. Vasal, R. K. Mishra, and S. Vishwanath. Master equation of discrete time graphon mean
field games and teams. arXiv preprint arXiv:2001.05633, 2020.

94
Learning Regularized Graphon Mean-Field Games with Unknown Graphons

D. Wang, R. Walters, and R. Platt. SO(2)-equivariant reinforcement learning. arXiv preprint

arXiv:2203.04439, 2022a.

D. Wang, R. Walters, X. Zhu, and R. Platt. Equivariant Q-Learning in Spatial Action

Spaces. In Conference on Robot Learning, pages 1713–1723. PMLR, 2022b.

L. Wang, Z. Yang, and Z. Wang. Breaking the curse of many agents: Provable mean
embedding Q-iteration for mean-field reinforcement learning. In International Conference
on Machine Learning, pages 10092–10103. PMLR, 2020.

C. Wei, C. Lee, M. Zhang, and H. Luo. Last-iterate convergence of decentralized optimistic

gradient descent/ascent in infinite-horizon competitive Markov games. In Conference on
Learning Theory, pages 4259–4299. PMLR, 2021.

P. J. Wolfe and S. C. Olhede. Nonparametric graphon estimation. arXiv preprint

arXiv:1309.5936, 2013.

X. Xia, G. Mishne, and Y. Wang. Implicit graphon neural representation. In International

Conference on Artificial Intelligence and Statistics, pages 10619–10634. PMLR, 2023.

Q. Xie, Z. Yang, Z. Wang, and A. Minca. Learning while playing in mean-field games:
Convergence and optimality. In International Conference on Machine Learning, pages
11436–11447. PMLR, 2021.

J. Xu. Rates of convergence of spectral methods for graphon estimation. In International

Conference on Machine Learning, pages 5433–5442. PMLR, 2018.

K. Xu, Y. Zhang, D. Ye, P. Zhao, and M. Tan. Relation-aware transformer for portfolio policy
learning. In Proceedings of the Twenty-Ninth International Conference on International
Joint Conferences on Artificial Intelligence, pages 4647–4653, 2021.

Y. Yang, R. Luo, M. Li, M. Zhou, W. Zhang, and J. Wang. Mean field multi-agent
reinforcement learning. In International Conference on Machine Learning, pages 5571–
5580. PMLR, 2018.

B. Yardim, S. Cayci, M. Geist, and N. He. Policy mirror ascent for efficient and independent
learning in mean field games. arXiv preprint arXiv:2212.14449, 2022.

W. Zhan, B. Huang, A. Huang, N. Jiang, and J. Lee. Offline reinforcement learning with
realizability and single-policy concentrability. In Conference on Learning Theory, pages
2730–2775. PMLR, 2022.

K. Zhang, Z. Yang, and T. Başar. Multi-agent reinforcement learning: A selective overview

of theories and algorithms. Handbook of Reinforcement Learning and Control, pages
321–384, 2021.

Towards Delivering A Coherent Self-Contained Explanation of Proximal Policy Optimization
No ratings yet
Towards Delivering A Coherent Self-Contained Explanation of Proximal Policy Optimization
36 pages
MFG Cetraro Italy 2019
No ratings yet
MFG Cetraro Italy 2019
316 pages
MPH02
No ratings yet
MPH02
538 pages
3 Differential Leveling
No ratings yet
3 Differential Leveling
4 pages
Week 6 Practice Assignment Solution
No ratings yet
Week 6 Practice Assignment Solution
14 pages
Elementary Linear Algebra Metric Version Eighth Edition Larson All Chapter Instant Download
100% (2)
Elementary Linear Algebra Metric Version Eighth Edition Larson All Chapter Instant Download
82 pages
Value Functions & Bellman Equations: UNIT-3
No ratings yet
Value Functions & Bellman Equations: UNIT-3
11 pages
Partial Differentiation
100% (1)
Partial Differentiation
5 pages
5524 Maximum Entropy Heterogen
No ratings yet
5524 Maximum Entropy Heterogen
42 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
Paper RL
No ratings yet
Paper RL
61 pages
Operator Learning Without The Adjoint: Nicolas Boull e
No ratings yet
Operator Learning Without The Adjoint: Nicolas Boull e
54 pages
Apl 232
No ratings yet
Apl 232
48 pages
Entropic Gromov-Wasserstein Distances: Stability and Algorithms
No ratings yet
Entropic Gromov-Wasserstein Distances: Stability and Algorithms
52 pages
Sample-Efficient Reinforcement Learning in The Presence of Exogenous Information
No ratings yet
Sample-Efficient Reinforcement Learning in The Presence of Exogenous Information
56 pages
3050 A Unified Approach To Reinforc
No ratings yet
3050 A Unified Approach To Reinforc
41 pages
A Lagrangian Approach For Aggregative Mean Field Games of Controls With Mixed and Final Constraints
No ratings yet
A Lagrangian Approach For Aggregative Mean Field Games of Controls With Mixed and Final Constraints
29 pages
Filippov Theory On Infinitesimal Epsilon-Greedy Q-Learning
No ratings yet
Filippov Theory On Infinitesimal Epsilon-Greedy Q-Learning
66 pages
Discovering Diverse Multi-Agent Strategic Behavior Via Reward Randomization
No ratings yet
Discovering Diverse Multi-Agent Strategic Behavior Via Reward Randomization
26 pages
A Probabilistic Approach To Mean Field Games With Major
No ratings yet
A Probabilistic Approach To Mean Field Games With Major
31 pages
A Scalable Solver For 2p0s Differential Games With One-Sided Payoff Information and Continuous Actions, States, and Time
No ratings yet
A Scalable Solver For 2p0s Differential Games With One-Sided Payoff Information and Continuous Actions, States, and Time
32 pages
Multi Agent Reinforcement Learning A Rev
No ratings yet
Multi Agent Reinforcement Learning A Rev
25 pages
A Review of Cooperative Multi-Agent Deep Reinforcement Learning
No ratings yet
A Review of Cooperative Multi-Agent Deep Reinforcement Learning
46 pages
A Variational Inequality Approach To Independent Learning in Static Mean-Field Games
No ratings yet
A Variational Inequality Approach To Independent Learning in Static Mean-Field Games
53 pages
ML in Control and Games SIAM Review HAL
No ratings yet
ML in Control and Games SIAM Review HAL
76 pages
2021 Nature Bichler Equilibrium Learning
No ratings yet
2021 Nature Bichler Equilibrium Learning
33 pages
AS M - R L: Urvey On Odel Based Einforcement Earning
No ratings yet
AS M - R L: Urvey On Odel Based Einforcement Earning
28 pages
Topological Analysis For Detecting Anomalies (TADA) in Dependent Sequences: Application To Time Series
No ratings yet
Topological Analysis For Detecting Anomalies (TADA) in Dependent Sequences: Application To Time Series
49 pages
Dll-Math 9 Week 6 Sy 2022-2023
No ratings yet
Dll-Math 9 Week 6 Sy 2022-2023
9 pages
Chapter Two Matrix Formulation of Quantum Mechanics
No ratings yet
Chapter Two Matrix Formulation of Quantum Mechanics
30 pages
Verma 24 A
No ratings yet
Verma 24 A
22 pages
Emergent Complexity Via Multiagent Competition
No ratings yet
Emergent Complexity Via Multiagent Competition
12 pages
Estimating Dynamic Discrete CH
No ratings yet
Estimating Dynamic Discrete CH
32 pages
Policy Gradient Methods in The Presence of Symmetries and State Abstractions
No ratings yet
Policy Gradient Methods in The Presence of Symmetries and State Abstractions
57 pages
Probabilistic Analysis of Mean-Field Games
No ratings yet
Probabilistic Analysis of Mean-Field Games
29 pages
Optimistic Linear Support and Successor Features As A Basis For Optimal Policy Transfer (Alegre, 2022)
No ratings yet
Optimistic Linear Support and Successor Features As A Basis For Optimal Policy Transfer (Alegre, 2022)
20 pages
Cooperation Conflict and Transformative Artificial Intelligence A Research Agenda
No ratings yet
Cooperation Conflict and Transformative Artificial Intelligence A Research Agenda
10 pages
2110 04638
No ratings yet
2110 04638
32 pages
Linear Quadratic Mean Field Game With Control Input Constraint
No ratings yet
Linear Quadratic Mean Field Game With Control Input Constraint
19 pages
Mico
No ratings yet
Mico
27 pages
The Confluence of Networks, Games and Learning: A Game-Theoretic Framework For Multi-Agent Decision Making Over Networks
No ratings yet
The Confluence of Networks, Games and Learning: A Game-Theoretic Framework For Multi-Agent Decision Making Over Networks
61 pages
Flow Network Based Generative Models For Non-Iterative Diverse Candidate Generation
No ratings yet
Flow Network Based Generative Models For Non-Iterative Diverse Candidate Generation
25 pages
NeurIPS 2020 Refactoring Policy For Compositional Generalizability Using Self Supervised Object Proposals Paper
No ratings yet
NeurIPS 2020 Refactoring Policy For Compositional Generalizability Using Self Supervised Object Proposals Paper
12 pages
Actor-Critic Policy Optimization in Partially Observable Multiagent Environments 1810.09026
No ratings yet
Actor-Critic Policy Optimization in Partially Observable Multiagent Environments 1810.09026
28 pages
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
No ratings yet
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
9 pages
Co-Learning in Differential Games
No ratings yet
Co-Learning in Differential Games
35 pages
1512 04455v1-RNN
No ratings yet
1512 04455v1-RNN
11 pages
08 Odds Ends
No ratings yet
08 Odds Ends
27 pages
Correction To
No ratings yet
Correction To
9 pages
Reinforcement Learning Cheatsheet
No ratings yet
Reinforcement Learning Cheatsheet
16 pages
Two-Player Zero-Sum Differential Games With One-Sided Information
No ratings yet
Two-Player Zero-Sum Differential Games With One-Sided Information
6 pages
Deep Reinforcement Learning From Self-Play in Imperfect-Information Games
No ratings yet
Deep Reinforcement Learning From Self-Play in Imperfect-Information Games
10 pages
IGCSE20 Probability
No ratings yet
IGCSE20 Probability
13 pages
A Level Further Mathematics For AQA - Student Book 1
50% (2)
A Level Further Mathematics For AQA - Student Book 1
31 pages
Notes para Sa Akin
No ratings yet
Notes para Sa Akin
4 pages
Base 4
No ratings yet
Base 4
6 pages
Tutorial Questions (Annexure I) Que S-Tion No Questions Co BTL
No ratings yet
Tutorial Questions (Annexure I) Que S-Tion No Questions Co BTL
6 pages
1805 07297 PDF
No ratings yet
1805 07297 PDF
29 pages
HGTFHGFHTF
No ratings yet
HGTFHGFHTF
5 pages
Nips00 Bs
No ratings yet
Nips00 Bs
7 pages
Appendix: High-Accuracy Approximation Methods For General Binary-State Dynamics
No ratings yet
Appendix: High-Accuracy Approximation Methods For General Binary-State Dynamics
23 pages
A Survey On Multi-Task Learning: Yu Zhang and Qiang Yang
No ratings yet
A Survey On Multi-Task Learning: Yu Zhang and Qiang Yang
20 pages
Pseudocode - Edexcel
No ratings yet
Pseudocode - Edexcel
8 pages
Efficient Solution Algorithms For Factored Mdps
No ratings yet
Efficient Solution Algorithms For Factored Mdps
70 pages
Introduction To Daa
No ratings yet
Introduction To Daa
24 pages
Cambridge International AS & A Level: Mathematics 9709/62
No ratings yet
Cambridge International AS & A Level: Mathematics 9709/62
16 pages
Swarm RL 34
No ratings yet
Swarm RL 34
15 pages
Multinomial Theorem-1
No ratings yet
Multinomial Theorem-1
3 pages
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
No ratings yet
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
26 pages
RLAlgs in MDPs
No ratings yet
RLAlgs in MDPs
9 pages
Adapt 1P
No ratings yet
Adapt 1P
22 pages
Lecture-1, Introduction To Cryptography & Classical Ciphers
No ratings yet
Lecture-1, Introduction To Cryptography & Classical Ciphers
14 pages
Chapter 3 - Scientific Measurement
0% (1)
Chapter 3 - Scientific Measurement
30 pages
Project3-Arc1 1
No ratings yet
Project3-Arc1 1
7 pages
High-Dimensional Continuous Control Using Generalized Advantage Estimation-1506.02438v5
No ratings yet
High-Dimensional Continuous Control Using Generalized Advantage Estimation-1506.02438v5
14 pages
Thesis Fabrizio Galli
No ratings yet
Thesis Fabrizio Galli
22 pages
3 Aplication of Matrice Operations (Larson)
No ratings yet
3 Aplication of Matrice Operations (Larson)
14 pages
PC Solve Linear Systems
100% (1)
PC Solve Linear Systems
15 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
Ed Neil O. Maratas: Instructor Jrmsu - Main Campus
No ratings yet
Ed Neil O. Maratas: Instructor Jrmsu - Main Campus
15 pages
Euclid's Algorithm: ENGI 1331: Exam 2 Review - Additional Practice Problems Fall 2020
No ratings yet
Euclid's Algorithm: ENGI 1331: Exam 2 Review - Additional Practice Problems Fall 2020
4 pages
3 Industry Calculations 3
No ratings yet
3 Industry Calculations 3
24 pages
Math9 June 11-13
100% (1)
Math9 June 11-13
16 pages
RL Course Report
No ratings yet
RL Course Report
10 pages
UNSW NonlinearCon - Soln
No ratings yet
UNSW NonlinearCon - Soln
4 pages
A Formal Solution To The Grain of Truth Problem
No ratings yet
A Formal Solution To The Grain of Truth Problem
10 pages
Boolean Algebra and Logic Simplification: Truth Tables For The Laws of Boolean
No ratings yet
Boolean Algebra and Logic Simplification: Truth Tables For The Laws of Boolean
20 pages
Modeling in The Time Domain
No ratings yet
Modeling in The Time Domain
19 pages
Safe Multi Agent Reforcement Learning For Autonomous Driving
No ratings yet
Safe Multi Agent Reforcement Learning For Autonomous Driving
13 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
CV (Sagnik) PDF
100% (2)
CV (Sagnik) PDF
3 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
351 - 13-03 - Gravitation and The Principle of Superposition - 05-16
No ratings yet
351 - 13-03 - Gravitation and The Principle of Superposition - 05-16
5 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Math 4 Mtap Reviewer
100% (1)
Math 4 Mtap Reviewer
4 pages
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
From Everand
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
Fouad Sabry
No ratings yet