0% found this document useful (0 votes)
13 views6 pages

Two-Player Zero-Sum Differential Games With One-Sided Information

This document presents a study on two-player zero-sum differential games with one-sided information, focusing on continuous action spaces and the challenges they pose for computational complexity. The authors develop an algorithm that approximates optimal strategies while leveraging insights from convexification and the Isaacs' condition, enabling scalable convergence to equilibrium. Real-world applications include scenarios like sports matchups where one player has private information, and the paper contributes to the understanding of game theory in the context of incomplete information.

Uploaded by

cnstiger625
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views6 pages

Two-Player Zero-Sum Differential Games With One-Sided Information

This document presents a study on two-player zero-sum differential games with one-sided information, focusing on continuous action spaces and the challenges they pose for computational complexity. The authors develop an algorithm that approximates optimal strategies while leveraging insights from convexification and the Isaacs' condition, enabling scalable convergence to equilibrium. Real-world applications include scenarios like sports matchups where one player has private information, and the paper contributes to the understanding of game theory in the context of incomplete information.

Uploaded by

cnstiger625
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Two-Player Zero-Sum Differential Games with One-Sided Information

Mukesh Ghimire1 , Zhe Xu1 , Yi Ren1


1
Arizona State University
{mghimire, xzhe1, yiren}@asu.edu
arXiv:2502.05314v2 [cs.GT] 13 Feb 2025

Abstract As a step towards addressing this challenge, our study fo-


cuses on games with one-sided information, which repre-
Unlike Poker where the action space A is discrete, differ-
ential games in the physical world often have continuous
sent a variety of attack-defence scenarios: Both players have
action spaces not amenable to discrete abstraction, render- common knowledge about the finite set of I possible payoff
ing no-regret algorithms with O(|A|) complexity not scal- types and nature’s distribution over these types p0 . At the be-
able. To address this challenge within the scope of two- ginning of the game, nature draws a type and informs Player
player zero-sum (2p0s) games with one-sided information, 1 (P1) about the type but not P2. As the game progresses, the
we show that (1) a computational complexity independent of public belief about the chosen type is updated from p0 based
|A| can be achieved by exploiting the convexification prop- on the action sequence taken by P1 via the Bayes rule. P1’s
erty of incomplete-information games and the Isaacs’ condi- goal is to minimize the expected cost over p0 . This game is
tion that commonly holds for dynamical systems, and that proved to have a value under Isaacs’ condition (Cardaliaguet
(2) the computation of the two equilibrium strategies can 2009). Due to the zero-sum nature, P1 may need to delay
be decoupled under one-sidedness of information. Leverag-
ing these insights, we develop an algorithm that successfully
information release or manipulate P2’s belief to take full ad-
approximates the optimal strategy in a homing game. Code vantage of information asymmetry, and P2’s strategy is to
available in github1 . optimize the worst-case payoff. Real-world examples of the
game include man-on-man matchup in sports where the at-
tacker has private information about which play is to be ex-
Introduction ecuted, and defense games where multiple potential targets
The strength of game solvers has grown rapidly in the last are concerned.
decade, beating elite-level human players in Chess (Silver The two differences between
et al. 2017a), Go (Silver et al. 2017b), Poker (Brown and our game and commonly studied
Sandholm 2019; Brown et al. 2020a), Diplomacy (FAIR† imperfect-information extensive-
et al. 2022), Stratego (Perolat et al. 2022), among others with form games (IIEFGs) (Sandholm
increasing complexity. These successes motivated recent in- 2010; Perolat et al. 2022; FAIR†
terests in solving differential games in continuous time and et al. 2022) are that: (1) IIEFGs
space, e.g., competitive sports (Wang et al. 2024; Ghimire often have belief spaces (e.g., be-
et al. 2024), where critical strategic plays should be exe- lief about opponent’s cards in |A|
cuted precisely within the continuous action space and at Poker) larger than their abstracted Figure 1: SOTA algo-
specific moments in time (e.g., consider set piece scenar- action spaces (e.g., betting cate- rithms like CFR require ex-
ios in soccer). However, existing regret minimization algo- gories in Poker), and (2) infor- panding over entire action
rithms, e.g., CFR+ (Tammelin 2014) and its variants (Burch, mation asymmetry in our games space (left), whereas our al-
Johanson, and Bowling 2014; Moravčı́k et al. 2017; Brown is only one-sided. This paper in- gorithm only requires ex-
et al. 2020a; Lanctot et al. 2009), and last-iterate online vestigates the potential computa- panding over at most I ac-
learning algorithms, e.g., variants of follow the regularized tional advantages from exploiting tions for P1 (I +1 for P2) at
leader (FTRL) (McMahan 2011; Perolat et al. 2021) and of each decision node (right).
these differences via the followi-
mirror descent (Sokota et al. 2022; Cen, Wei, and Chi 2021; ng insights: (1) At any infostate, P1’s (resp. P2’s) behavioral
Vieillard et al. 2020), are designed for discrete actions and strategy is I (resp. I + 1)-atomic and convexifies the primal
have computational complexities increasing with respect to (resp. dual) value with respect to the public belief (Fig. 1).
the size of the action space A. Thus applying these algo- With this, we can reformulate the convex-concave minimax
rithms to differential games would require either insightful problem of size O(|A|) at each infostate into a nonconvex-
action and time abstraction or enormous compute, neither of nonconcave problem of size O(I 2 ). When I 2 ≪ |A|, and
which are readily available. in particular when |A| = ∞, the latter becomes more ef-
Multi-Agent AI in the Real World Workshop at AAAI 2025 ficient to solve in practice. (2) Due to the one-sidedness of
1
https://github.com/ghimiremukesh/cams/tree/workshop information, the equilibrium behavioral strategies of P1 and
P2 can be solved separately through primal and dual formu- ential games equipped with the Isaacs’ condition, in which
lations of the game, in each of which the opponent plays case the equilibrium strategy is mostly pure along the game
pure best responses. This decoupling avoids recurrent learn- tree, and is atomic on A when mixed.
ing dynamics between the pair of strategies without regular-
ization (Perolat et al. 2021). Table 1: Solver computational complexity with respect to
To summarize, this work has two contributions: (1) fa- action space A and equilibrium error ε
miliarizing the broader AI community with the connections Algorithm Complexity
between computational game theory and differential game CFR variants (Zinkevich et al. 2007; O |A|ε−2
theory, and (2) providing the first algorithm with scalable Lanctot et al. 2009; Brown et al. 2019; to ε-Nash
convergence to the equilibrium of differential games with Tammelin 2014; Johanson et al. 2012)
continuous action spaces and one-sided information.  
FTRL variants & MMD (McMahan O ln(|A|) ε ln 1ε
Related Work 2011; Perolat et al. 2021; Sokota et al. to ε-QRE
2p0s games with incomplete information. (Harsanyi 2022)
1967) introduced a Bayesian game framework to solve Descent-ascent algorithms for nonconvex-nonconcave
incomplete-information normal-form games by transform- minimax problems. Existing developments in IIEFGs fo-
ing the game into an imperfect-information one involv- cused on convex-concave minimax problems due to the bi-
ing a chance mechanism. The seminal work of (Aumann, linear form of the expected payoff through the conversion of
Maschler, and Stearns 1995) extended this idea to re- games to their normal forms. This paper, on the other hand,
peated games and established the connection between value investigates the nonconvex-nonconcave minimax problems
convexification and belief manipulation. Within the same to be solved at every infostate when actions are considered
framework, Blackwell’s approachability theorem (Black- continuous. To this end, we use the doubly smoothed gra-
well 1956) naturally becomes the theoretical support for the dient descent ascent method (DS-GDA) which has a worst-
optimal strategy of the uninformed player (P2). Building on case complexity of O(ε−4 ) (Zheng et al. 2023).
top of (Aumann, Maschler, and Stearns 1995), (De Meyer
1996) introduced the concept of a dual game in which the be- 2p0s Differential Games w/ One-Sided Info.
havioral strategy of the uninformed player becomes Markov.
This concept later helped (Cardaliaguet 2007; Ghimire et al. Notations and preliminaries. We use ∆(I) as the sim-
2024) to establish the value existence proof for 2p0s differ- plex in RI , [K] := {1, ..., K} for K ∈ Z+ , a[i] as the ith
ential games with incomplete information. Unlike repeated element of vector a, ∂V as the subgradient of function V ,
games in which belief manipulation occurs only in the first and ⟨·, ·⟩ for vector product. Consider a time-invariant dy-
round of the game, differential games may have multiple namical system that defines the evolution of the joint state
critical collocation points in the joint space of time, state, x ∈ X ⊆ Rdx of P1 and P2 with control inputs u ∈ U and
and public belief where belief manipulations are necessary v ∈ V, respectively:
to achieve Nash equilibrium, depending on the specifications ẋ(t) = f (x(t), u, v). (1)
of system dynamics, payoffs, and state constraints (Ghimire The game starts at t0 ∈ [0, T ] from some initial state
et al. 2024). For this reason, scalable value and strategy ap- x(t0 ) = x0 . The initial belief p0 ∈ ∆(I) is set to nature’s
proximation for 2p0s differential games with incomplete in- distribution. P1 of type i accumulates a running cost li (u, v)
formation has not yet been achieved. during the game and receives a terminal cost gi (x(T )),
Imperfect information extensive-form games. IIEFGs where i ∼ p0 . The goal of P1 is to minimize the expected
represent the more general set of simultaneous or sequen- sum of the running and terminal costs, while P2 aims to
tial multi-agent decision-making problems with finite hori- maximize it. A behavioral strategy pair (η, ζ) is a Nash equi-
zons. Since any 2p0s IIEFG with finite action sets has a librium (NE) of a zero-sum game if and only if
normal-form formulation, a unique Nash equilibrium always Z T Z T
exists in the space of mixed strategies. Significant efforts inf sup Eη,ζ,i∼p0 li dt+gi = sup inf Eη,ζ,i∼p0 li dt+gi ,
η ζ 0 ζ η 0
have been taken to find equilibrium of large IIEFGs such as
(2)
poker (Koller and Megiddo 1992; Billings et al. 2003; Gilpin
and we call this common value the value of the game. A NE
and Sandholm 2006; Gilpin et al. 2007; Sandholm 2010;
is called pure if the strategies (η, ζ) are deterministic, spec-
Brown and Sandholm 2019), with a converging set of algo-
ifying a definite action for every decision point. It is called
rithms that are no-regret, average- or last-iterate converging,
mixed if the strategies are probabilistic, involving random-
and with sublinear or linear convergence rates (Zinkevich
ization over action spaces. When information is one-sided,
et al. 2007; Abernethy, Bartlett, and Hazan 2011; McMahan
η = {ηi }I since P1 prepares one strategy for each possible
2011; Tammelin 2014; Johanson et al. 2012; Lanctot et al.
game type. We introduce the following assumptions under
2009; Brown et al. 2019, 2020b; Perolat et al. 2021; Sokota
which mixed NE exists for Eq. (2) (Cardaliaguet 2009):
et al. 2022; Perolat et al. 2022; Schmid et al. 2023) (see sum-
mary in Tab. 1). These algorithms all have computational 1. U ⊆ Rdu and V ⊆ Rdv are compact and finite-
complexities increasing with |A|, provided that the equilib- dimensional sets.
rium behavioral strategy lies in the interior of the simplex 2. f : X × U × V → X is bounded, continuous, and uni-
∆(|A|). Critically, this assumption does not hold for differ- formly Lipschitz continuous with respect to x.
3. gi : X → R and li : U ×V → R are Lipschitz continuous backward induction for the dual value:
and bounded. Vτ∗ (T, x, p̂) = max{p̂i − gi (x(T ))}
i
4. Isaacs’ condition holds for the Hamiltonian H : X × 
Rdx → R: Vτ (t, x, p̂) = Vexp̂ min max Vτ∗ (t + τ, x + τ f (x, u, v), (7)

v∈V u∈U
H(x, ξ) := min max f (x, u, v)⊤ ξ − li (u, v)

u∈U v∈V p̂ − τ l(u, v)) ,
(3)
= max min f (x, u, v)⊤ ξ − li (u, v).
v∈V u∈U where, l = [l1 , ..., lI ]T . Then at any (t, x, p̂), P2 finds λ =
[λ1 , . . . , λI+1 ] and p̂k ∈ RI for k ∈ [I + 1] such that:
5. Both players have full knowledge about f , {gi }Ii=1 , I+1
{li }Ii=1 , p0 , and the NE of the game. Control inputs and
X  

Vτ (t, x, p̂) = λk min max Vτ∗ (t + τ, x + τ f (x, u, v),
states are fully observable and we assume perfect recall. k
v u

Critically, the Isaacs’ condition ensures that 2p0s differential  XI+1


games with complete information have pure NE. p̂ − τ l(u, v)) ; λk p̂k = p̂
k
Behavioral strategy of P1. A behavioral strategy pre- (8)
scribes distributions over the action space at every sub- where l = [l1 , ..., lI ]T . P2’s strategy is to compute the min-
game (t, x, p). In order to determine the strategy, it is imax solution v k corresponding to p̂k and chooses v = v k
necessary to first characterize the value function. From with probability λk .
Cardaliaguet (2009), we obtain the following backward in-
duction to approximate the value given a sufficiently fine
Methods
time-discretization τ → 0+ : Reformulation of the primal and dual games. To re-
cap, at any (t, x), P1 computes actions uk and their type-
Vτ (t, x, p) = Vexp min max Vτ (t + τ, x + τ f (x, u, v), p)
u∈U v∈V conditioned probabilities αki := Pr(u = uk |i) such that
PI k
PI
k=1 αki = 1 for i ∈ [I]. Then, λ =

i=1 αki p[i] and
X
+ τ E l(u, v) ; Vτ (T, x, p) = pi gi (x)
i
pk [i] = αki p[i]/λk are both functions of αki . We can now
(4) reformulate (5) as follows:
where Vex is the convexification operator. The behavioral I  
strategy of P1 is computed as follows: P1 first finds λ =
X
min max λk V (t + τ, xk , pk ) + τ Ei∼pk [li (uk , v k )]
[λ1 , . . . , λI ] ∈ ∆(I) and pk ∈ ∆(I) for k ∈ [I] such that: {uk },{αki } {v k }
k=1
k
s.t. u ∈ U , x = ODE(x, τ, uk , v k ; f ),
k
v k ∈ V, αki ∈ [0, 1],
X  
Vτ (t, x, p) = λk min max Vτ (t + τ, x + τ f (x, u, v), p)
u v I I
k X X αki p[i]
 X αki = 1, λk = αki p[i], pk [i] = , ∀i, k ∈ [I].
+ E l(u, v) ; λk pk = p k=1 i=1
λk
k (P1 )
(5) P1 is in general a nonconvex-nonconcave minimax problem
He then chooses uk with Pr(u = uk |i) = λk pk [i]/p[i] if of size (O(I(I + du )), O(Idv )) that needs to be solved at
he is of type i and updates the belief to pk . This is famously all sampled infostates (t, x, p) ∈ [0, T ] × X × ∆(I). The
known as the splitting mechanism in repeated game, and is resultant minimax objective is by definition the convexified
a consequence of the “Cav u” theorem (Aumann, Maschler, value of the primal game.
and Stearns 1995; De Meyer 1996). P2, on the other hand, keeps track of the dual variable
Behavioral strategy of P2. For P2, the idea is to reformu- p̂ ∈ RI instead of the public belief p during the dual game
late the game so that we can compute the value using P2’s and solves the following problem at all sampled infostates
behavioral strategies and P1’s pure best responses. This can (t, x, p̂):
be achieved by introducing the Fenchel conjugate V ∗ of V :
I+1
V ∗ (t0 , x0 , p̂) := max p · p̂ − V (t0 , x0 , p) X  
p min max λk V ∗ (t + τ, xk , p̂k − τ l(uk , v k ))
{v k },{λk },{p̂k } {uk }
k=1
n h  
= inf sup max p̂i − Eη,ζ gi XTt0 ,x0 ,η,ζ k k
ζ η i∈{1,...,I} s.t. u ∈ U , v ∈ V, xk = ODE(x, τ, uk , v k ; f ), λk ∈ [0, 1],
Z T  I+1 I+1
+ li (η(s), ζ(s))ds ,
X X
λk p̂k = p̂, λk = 1, k ∈ [I + 1].
t0
k=1 k=1
(6)
(P2 )
which describes a dual game with complete information in
P2 is in general nonconvex-nonconcave of size (O(I(I +
which P2’s goal is to minimize some worst-case dual payoff.
dv ), O(Idu )).
It is proved that P2’s equilibrium in the dual game starting
from some (t0 , x0 , p̂) is also an equilibrium for the primal Game solver. We propose a continuous-action mixed-
game if p̂ ∈ ∂p V (t0 , x0 , p) (Cardaliaguet 2007). strategy (CAMS) solver for 2p0s differential games with
P2’s strategy can be obtained through the dual game us- one-sided information. Our algorithm performs Bellman
ing a procedure similar to that of P1’s: We first obtain the backup through P1 (resp. P2 ) starting from the terminal
condition in (4) (resp. (8)) at discretized time stamps t ∈ {16, 36, 64, 144}. All algorithms terminate when a thresh-
{T, T − τ, ..., 0} and (x, p) (resp. (x, p̂)) uniformly sampled old of NashConv (see Lanctot et al. (2017) for definition) is
in X × ∆(I) (resp. X × RI ). Specifically, at any t, with met. For conciseness, we only consider solving P1’s strat-
a value approximation model V̂t+1 : X × ∆(I) → R, we egy and thus use P1’s δ in NashConv. We then use Deep-
solve P1 using DS-GDA at N collocation points (x, p) ∈ CFR as a baseline for a Hexner’s game with 4 time-steps,
X × ∆(I) and collect a dataset Dt := {(x(i) , p(i) , Ṽ (i) )}N where T = 1 and τ = 0.25. DeepCFRs were run for 1000
i=1
CFR iterations (resp. 100) with 10 (resp. 5) traversals for
where Ṽ is the numerical approximation of the convexified
|A| = 9 (resp. 16). We compare the computational cost and
value at (t, x(i) , p(i) ) for the minimax problem. Then we fit a the expected action error ε (and average action error at each
model V̂t (x, p) to Dt and go to t − τ . Alg. 1 summarizes the time-step, ε̄t for 4-stage game) from the ground-truth action
solver for the primal game. The dual game solver is similarly of P1. Fig. 3 summarizes the comparisons. For the normal-
defined. form game, all baselines have complexities increasing with
A, while CAMS is invariant. In the 4-stage game, CAMS
Algorithm 1: Continuous Action Mixed Strategy Solver achieves significantly better strategies than DeepCFR, as vi-
(CAMS) sualized in Fig. 4.
Require: τ , V (T, ·, ·), N , minimax solver O Time Complexity Algorithm Iterations
−τ
1: Initialize {V̂t }T
t=0 , D ← ∅ 105
2: S ← sample N states (x, p) ∈ X × ∆(I) 104
104

wall time (s)


3: for t ∈ {T − τ, . . . , 0} do

Iterations
4: for (x, p) ∈ S do 103

5: Append {(t, x, p), O(t, x, p)} to D 102 CFR+ CFR+


MMD 103 MMD
6: end for 101 CFR-BR-Primal
CAMS (ours)
CFR-BR-Primal
CAMS (ours)
7: Fit V̂t to D 100
8: end for 16 36 64 144 16 36 64 144
|A| |A|
(a) (b)
Empirical Validation Action Error Action Error
We introduce Hexner’s homing game (Hexner 1979) that has 100
an analytical Nash equilibrium. We use variants of this game 40
to compare CAMS with baselines (MMD, CFR+, and Deep-
CFR) on solution quality and computational cost. As shown

ε̄t
ε

CFR+ 20
in Fig. 2, it is a two-player game, in which P1’s goal is to MMD DeepCFR (|A| = 16)
10−2 CFR-BR-Primal DeepCFR (|A| = 9)
get closer to the target Θ unknown to P2, while keeping P2 CAMS (ours) CAMS (ours)

away and minimizing running costs. The cost to P1 is the 0


expected value of the total cost: 16 36 64 144 0 0.25 0.5 0.75
Z T |A| t
(c) (d)
J= (u⊤ R1 u − v ⊤ R2 v)dt + [x1 (T ) − Θ]⊤ K1 [x1 (T ) − Θ]
0
Figure 3: Comparisons b/w CAMS and baseline algorithms.
− [x2 (T ) − Θ]⊤ K2 [x2 (T ) − Θ],
(9) CAMS Deep CFR (|A| = 9) Deep CFR (|A| = 16)
where R1 , R2 ≻ 0 and K1 , K2 ⪰ 0 are control and state- 1
P1(Goal-1)
P1(Goal-2)
P2 (Goal-1)
P2 (Goal-2)
1 1

penalty matrices respectively. Due to the quadratic cost and GT

decoupled dynamics, this game can be solved analytically as 0 0 0


Y

done in Hexner (1979).


Comparison on 1- and 4-stage ⋆
−1 −1 −1
−0.5 0 0.5 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
Hexner’s Games. We first use p = 1
X X X

a normal-form Hexner’s game ⋆

with τ = T and a fixed P1 Figure 4: Trajectories using strategies from CAMS and
tr P2
initial state x0 to demonstrate DeepCFR. Markers indicate initial position.
p = 0.5
that IIEFG algorithms suffer Conclusion
from increasing costs along |A|
while CAMS does not. We con- This work highlights the need for a scalable algorithm for
sider CFR+ (Tammelin 2014), Figure 2: Hexner’s game solving incomplete-information differential games which
MMD (Sokota et al. 2022), and with a sample equilib- are structurally similar to imperfect-information games such
a modified CFR-BR (Johanson rium trajectory. P1 starts as poker. We demonstrated that SOTA IIEFG solvers are in-
et al. 2012) (dubbed CFR-BR- to move to its target after tractable when it comes to solving differential games. To
Primal, where we only focus on tr . the authors’ best knowledge, this is the first method to pro-
solving P1’s optimal strategy) as baselines. Each player’s vide tractable solution for incomplete-information differen-
state consists of 2D position and velocity. For base- tial games with continuous action spaces without problem-
lines, we discretize the action sets A1 and A2 with sizes specific abstraction and discretization.
Acknowledgment Ghimire, M.; Zhang, L.; Xu, Z.; and Ren, Y. 2024. State-
This work is partially supported by NSF CNS 2304863, Constrained Zero-Sum Differential Games with One-Sided
CNS 2339774, IIS 2332476, and ONR N00014-23-1-2505. Information. In Salakhutdinov, R.; Kolter, Z.; Heller, K.;
Weller, A.; Oliver, N.; Scarlett, J.; and Berkenkamp, F.,
eds., Proceedings of the 41st International Conference on
References Machine Learning, volume 235 of Proceedings of Machine
Abernethy, J.; Bartlett, P. L.; and Hazan, E. 2011. Black- Learning Research, 15512–15539. PMLR.
well approachability and no-regret learning are equivalent. Gilpin, A.; Hoda, S.; Pena, J.; and Sandholm, T. 2007.
In Proceedings of the 24th Annual Conference on Learning Gradient-based algorithms for finding Nash equilibria in ex-
Theory, 27–46. JMLR Workshop and Conference Proceed- tensive form games. In Internet and Network Economics:
ings. Third International Workshop, WINE 2007, San Diego,
Aumann, R. J.; Maschler, M.; and Stearns, R. E. 1995. Re- CA, USA, December 12-14, 2007. Proceedings 3, 57–69.
peated games with incomplete information. MIT press. Springer.
Billings, D.; Burch, N.; Davidson, A.; Holte, R.; Schaeffer, Gilpin, A.; and Sandholm, T. 2006. Finding equilibria in
J.; Schauenberg, T.; and Szafron, D. 2003. Approximating large sequential games of imperfect information. In Pro-
game-theoretic optimal strategies for full-scale poker. In IJ- ceedings of the 7th ACM conference on Electronic com-
CAI, volume 3, 661. merce, 160–169.
Blackwell, D. 1956. An analog of the minimax theorem for Harsanyi, J. C. 1967. Games with incomplete information
vector payoffs. played by “Bayesian” players, I–III Part I. The basic model.
Brown, N.; Bakhtin, A.; Lerer, A.; and Gong, Q. 2020a. Management science, 14(3): 159–182.
Combining deep reinforcement learning and search for Hexner, G. 1979. A differential game of incomplete infor-
imperfect-information games. Advances in Neural Informa- mation. Journal of Optimization Theory and Applications,
tion Processing Systems, 33: 17057–17069. 28: 213–232.
Brown, N.; Bakhtin, A.; Lerer, A.; and Gong, Q. 2020b. Johanson, M.; Bard, N.; Burch, N.; and Bowling, M. 2012.
Combining deep reinforcement learning and search for Finding optimal abstract strategies in extensive-form games.
imperfect-information games. Advances in Neural Informa- In Proceedings of the AAAI Conference on Artificial Intelli-
tion Processing Systems, 33: 17057–17069. gence, volume 26, 1371–1379.
Brown, N.; Lerer, A.; Gross, S.; and Sandholm, T. 2019. Koller, D.; and Megiddo, N. 1992. The complexity of two-
Deep counterfactual regret minimization. In International person zero-sum games in extensive form. Games and eco-
conference on machine learning, 793–802. PMLR. nomic behavior, 4(4): 528–552.
Brown, N.; and Sandholm, T. 2019. Superhuman AI for mul- Lanctot, M.; Waugh, K.; Zinkevich, M.; and Bowling, M.
tiplayer poker. Science, 365(6456): 885–890. 2009. Monte Carlo sampling for regret minimization in ex-
tensive games. Advances in neural information processing
Burch, N.; Johanson, M.; and Bowling, M. 2014. Solving systems, 22.
imperfect information games using decomposition. In Pro-
ceedings of the AAAI Conference on Artificial Intelligence, Lanctot, M.; Zambaldi, V.; Gruslys, A.; Lazaridou, A.;
volume 28. Tuyls, K.; Pérolat, J.; Silver, D.; and Graepel, T. 2017. A uni-
fied game-theoretic approach to multiagent reinforcement
Cardaliaguet, P. 2007. Differential games with asymmetric learning. Advances in neural information processing sys-
information. SIAM journal on Control and Optimization, tems, 30.
46(3): 816–838.
McMahan, B. 2011. Follow-the-Regularized-Leader and
Cardaliaguet, P. 2009. Numerical approximation and opti- Mirror Descent: Equivalence Theorems and L1 Regulariza-
mal strategies for differential games with lack of informa- tion. In Gordon, G.; Dunson, D.; and Dudı́k, M., eds., Pro-
tion on one side. Advances in Dynamic Games and Their ceedings of the Fourteenth International Conference on Ar-
Applications: Analytical and Numerical Developments, 1– tificial Intelligence and Statistics, volume 15 of Proceedings
18. of Machine Learning Research, 525–533. Fort Lauderdale,
Cen, S.; Wei, Y.; and Chi, Y. 2021. Fast policy extragradi- FL, USA: PMLR.
ent methods for competitive games with entropy regulariza- Moravčı́k, M.; Schmid, M.; Burch, N.; Lisỳ, V.; Morrill, D.;
tion. Advances in Neural Information Processing Systems, Bard, N.; Davis, T.; Waugh, K.; Johanson, M.; and Bowling,
34: 27952–27964. M. 2017. Deepstack: Expert-level artificial intelligence in
De Meyer, B. 1996. Repeated games, duality and the central heads-up no-limit poker. Science, 356(6337): 508–513.
limit theorem. Mathematics of Operations Research, 21(1): Perolat, J.; De Vylder, B.; Hennes, D.; Tarassov, E.; Strub,
237–251. F.; de Boer, V.; Muller, P.; Connor, J. T.; Burch, N.; Anthony,
FAIR†, M. F. A. R. D. T.; Bakhtin, A.; Brown, N.; Dinan, E.; T.; et al. 2022. Mastering the game of Stratego with model-
Farina, G.; Flaherty, C.; Fried, D.; Goff, A.; Gray, J.; Hu, H.; free multiagent reinforcement learning. Science, 378(6623):
et al. 2022. Human-level play in the game of Diplomacy by 990–996.
combining language models with strategic reasoning. Sci- Perolat, J.; Munos, R.; Lespiau, J.-B.; Omidshafiei, S.; Row-
ence, 378(6624): 1067–1074. land, M.; Ortega, P.; Burch, N.; Anthony, T.; Balduzzi, D.;
De Vylder, B.; et al. 2021. From poincaré recurrence to
convergence in imperfect information games: Finding equi-
librium via regularization. In International Conference on
Machine Learning, 8525–8535. PMLR.
Sandholm, T. 2010. The state of solving large incomplete-
information games, and application to poker. Ai Magazine,
31(4): 13–32.
Schmid, M.; Moravčı́k, M.; Burch, N.; Kadlec, R.; David-
son, J.; Waugh, K.; Bard, N.; Timbers, F.; Lanctot, M.; Hol-
land, G. Z.; et al. 2023. Student of Games: A unified learning
algorithm for both perfect and imperfect information games.
Science Advances, 9(46): eadg3256.
Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai,
M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel,
T.; et al. 2017a. Mastering chess and shogi by self-play with
a general reinforcement learning algorithm. arXiv preprint
arXiv:1712.01815.
Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.;
Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton,
A.; et al. 2017b. Mastering the game of go without human
knowledge. nature, 550(7676): 354–359.
Sokota, S.; D’Orazio, R.; Kolter, J. Z.; Loizou, N.; Lanctot,
M.; Mitliagkas, I.; Brown, N.; and Kroer, C. 2022. A uni-
fied approach to reinforcement learning, quantal response
equilibria, and two-player zero-sum games. arXiv preprint
arXiv:2206.05825.
Tammelin, O. 2014. Solving large imperfect information
games using CFR+. arXiv preprint arXiv:1407.5042.
Vieillard, N.; Kozuno, T.; Scherrer, B.; Pietquin, O.; Munos,
R.; and Geist, M. 2020. Leverage the average: an analysis
of kl regularization in reinforcement learning. Advances in
Neural Information Processing Systems, 33: 12163–12174.
Wang, Z.; Veličković, P.; Hennes, D.; Tomašev, N.; Prince,
L.; Kaisers, M.; Bachrach, Y.; Elie, R.; Wenliang, L. K.; Pic-
cinini, F.; et al. 2024. TacticAI: an AI assistant for football
tactics. Nature communications, 15(1): 1906.
Zheng, T.; Zhu, L.; So, A. M.-C.; Blanchet, J.; and Li,
J. 2023. Universal gradient descent ascent method for
nonconvex-nonconcave minimax optimization. Advances in
Neural Information Processing Systems, 36: 54075–54110.
Zinkevich, M.; Johanson, M.; Bowling, M.; and Piccione, C.
2007. Regret minimization in games with incomplete infor-
mation. Advances in neural information processing systems,
20.

You might also like