0% found this document useful (0 votes)
5 views

Lecture 12 Evaluating and Learning in Multi-Agent Systems 2

Uploaded by

James Tan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Lecture 12 Evaluating and Learning in Multi-Agent Systems 2

Uploaded by

James Tan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Lecture 12: Learning

and Evaluation in MA
systems
Start Recording!

2
Reminders

● Poster Presentation in Agora the 19th of April!

3
References for this lecture:

1. Balduzzi, David, et al. "Open-ended learning in symmetric zero-sum games."


International Conference on Machine Learning. PMLR, 2019.

Today: Empirical Games

4
Motivations : Learning Objectives
Single player: Multi-player:

Hand-crafted notion of Very simple notion of performance


performance The complexity of the task depends
on the opponent(s) 5
Achieving super-human performance in multi-player games is very
challenging
and requires a deep understanding of the game.
Go Dota 2

[Silver et al. 2016] [OpenAI et al. 2019]


(Picture from DeepMind’s blog post) (Picture from OpenAI’s Blog post)

Poker Starcraft II

[Brown and Sandholm 2019] [Vinyals et al. 2019]


(Picture from FAIR’s Blog post) (Picture from DeepMind’s Blog post) 6
1.Setting
2.Evaluation of agents
2.1. Agent vs. Agent (Elo)
2.2. Agent v.s Task (Nash)

3.Training of Agents
3.1.Self-Play
3.2.Fictitious Self-Play
AntiSymmetric (zero-sum) Game (Functional Form)

Anti-symmetric
Payoff:

Players (example: RL policies)

Intuition: Switching the roles switches the results.


Example: Chess, Go, Poker (need to randomize who starts)

NB: Can generalize to non-zero sum (just heavier because of


the two losses)
Who are the players?
Here we care about the agent/player (same thing)

Usually: Parametrized policy uθ


But can be anything that plays the game: (e.g., chess
engine, human player)
Interpretation of the payoff

u beats w.

w beats u.

it is a tie.
Proba of
winning
Example:
Transitive game:

Example: φ(u, v) = σ( f(u) − f(v)) − 1/2

Question: Is it the only possible transitive payoff???

Answer: No!

Remark: difference between skill and strategy


AntiSymmetric (zero-sum) Game (Functional Form)

Anti-symmetric Payoff:

Players (example: RL policies)

Intuition: Switching the roles switches the results.


Example: Chess, Go, Poker (need to randomize who starts)

NB: Can generalize to non-zero sum (just heavier because of


the two losses)
Example: Elo Rating

f(u) : Elo Rating of u

Problem:

Intuition: Playing against weaker opponent gives almost


no training signal.
Example: Elo Rating

f(u) : Elo Rating of u

Antisymetric payoff!!! :-)


Online Estimation of the Elo
Target: pij
Estimated proba.

Cross-entropy loss True proba.


Score of a Match-up

(stochastic) Gradient
Descent on that loss:

Exercice: derive this gradient


Take-away
● Optimization perspective on the ELO:
Stochastic gradient descent with constant step-size

SGD with constant step-size does not converge.


(It only converges to a neighborhood proportional to the variance times the step-
size)
Question: try to think why they still use constant step-size?
Estimation of the ELO at a given time!
Population of agents
Payoff matrix of the group:
Getting Elo From A
Intuition:
- If
- Then

Average
Elo

Individual
Elo
Getting Elo From A
Theorem:
- If

We have:
Transitive component Cyclic component:

Take-away:
● ELO = f
● Meaningful if B << f

Cyclic component:
There exists cycles: P1 beats P2, P2 beats P3,
P3 beats P1
Why do we care about that
Elo is useful to predict win-loss probability:
- Under the assumption that the game is transitive

Assuming we ‘know’ f_i and f_j we can predict who will win.

We need a “higher-order” ELO in non-transitive games.


Higher Order Elo
Idea: “PCA” on B.
- B is skew-symmetric -> NO PCA but similar decomposition!

Orthogonal matrices

Estimate the principal components of B.


Higher Order Elo
Idea: “PCA” on B.
- B is skew-symmetric -> NO PCA but Schur
First-K components: best rank-K estimate of B
decomposition!

Orthogonal matrices

Estimate the principal components of B.


Higher Order Elo

Perf of player i depends on two quantities:


● Skills (ELO):
● Strategy (cyclic vector) : Says how much the
game is cyclic

Difference of skills Cyclic component


Higher Order Elo

Perf of player i depends on two quantities:


● Skills (ELO):
● Strategy (cyclic vector) :

Estimated with an
empirical payoff
matrix

Caveat: We need all the pairwise matchups!!!


(not always the case… think about chess)
Agents Vs Tasks
How Tasks are Combined?

Table from NeurIPS tutorial on learning dynamics by Marta Garnelo, Wojciech Czarnecki
and David Balduzzi

26
How Tasks are Combined?

Table from NeurIPS tutorial on learning dynamics by Marta Garnelo, Wojciech Czarnecki
and David Balduzzi

27
Desired properties
Desired properties:
1. Invariant: adding redundant copies of an agent or
task to the data should make no difference.

2. Continuous: the evaluation method should be robust


to small changes in the data.

3. Interpretable: hard to formalize, but the procedure


should agree with intuition in basic cases
Maxent Nash Evaluation Method

Meta-Agent Meta-Task

Theorem: There is unique (p^*,q^*) that maximize the


entropy H(p^*) + H(q^*)
Best Agents
Best Agents are the ones in the MaxEnt Nash
Application: Atari

Perf against the Env (uniform or Difficulty of the env against an avg player
Nash Avg) (uniform or Nash Avg)
Training of
Multiple
Agents
Who are the players?
Here we care about the agent/player (same thing)

Usually: Parametrized policy uθ


But can be anything that plays the game: (e.g., chess
engine, human player)
Example: Elo Rating

f(u) : Elo Rating of u

Problem:

Intuition: Playing against weaker opponent gives almost


no training signal.
Example: Elo Rating

f(u) : Elo Rating of u


Solution: Self-play i.e., play against a copy of yourself.
Problem:

Intuition: Playing against weaker opponent gives almost


no training signal.
Open-ended Learning
General Framework to answer the question:
“Who plays against who?”

Stronger agent against the opponent.

Example: Self-play:
Open-ended Learning
General Framework to answer the question:
“Who plays against who?”

Conclusion:
- General framework to understand
general algorithm such as self-play or
Fictitious self play.
Self-Play

- Play against a copy of yourself


- Well calibrated opponent
- Simple algorithm.
- Successful in Chess, Go and many other applications
- Issue: suited when the payoff is transitive:

- Improvement against vt implies global improvement.


A simple Example: The bilinear game

Simple payoff:

Self play:

Remark: in practice

Copy of t (do not differentiate through it)


𝜃
A simple Example: The bilinear game

The vector field depends on your


opponent.

Proposition: The dynamics of self play


diverges!

Conclusion: self play is not enough if


the game is not purely transitive
Idea: Playing against a group of agents
Population of agents
Payoff matrix of the group:
Idea: Playing against a group of agents
Population of agents
Payoff matrix of the group:
Nash of an Empirical Game
Proposition:

Mixture of Agents:
Sample vi with probability
pi.
Matrix of the empirical game
We can use this matrix for several purposes:
1. Evaluate (a group of) agents.
2. Evaluate the diversity of a group of agents
3. Setup efficient Training.

Many open questions remaining:


- How to relate the empirical matrix to the real game?
- Are the proposed measures (see next slides)
meaningful?
How to evaluate the Performance of a Population?

Proposition: For any


population
● v(B,B) = 0
How to Evaluate Diversity of a Population?

Interpretation: How much the best agents (i.e. agents in


the Nash) exploit each other.
How to train agents Efficiently?

Use this matrix to find who to train against:


● Train against the Nash
● Train against the best Response.
● Many other ways:
- [Garnelo et al. 2021] (to appear at AAMAS)
- see also SC II paper (league of agents).
Idea:
- Compute the Nash and Play against it: (PSRO)

- Seems like a good idea.


- Problem: Sometime provide zero gradient (e.g.
Bilinear example)
Fictitious Self-Play
- Group of agents vi
- Play against to ‘best’ opponent
Fictitious Self-Play

- Group of agents vi
- Play against to ‘best’ opponent
- Used in Starcraft II
- [Vinyals · 2019]
Conclusion
● Self-play is a very powerful method to train agents in a
Multi-Agent framework.

● Sometimes it fails (when we need a diversity of agents to


play the game)

● When having a group of agents we can use the empirical


payoff to:
○ Evaluate agents
○ Train Agents
○ Evaluate the group (perfs and diversity)

You might also like