0% found this document useful (0 votes)

5 views

Lecture 12 Evaluating and Learning in Multi-Agent Systems 2

Uploaded by

James Tan

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Lecture 12 Evaluating and Learning in Multi-Agent Systems 2

Uploaded by

James Tan

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Lecture 12: Learning

and Evaluation in MA
systems
Start Recording!

2
Reminders

● Poster Presentation in Agora the 19th of April!

3
References for this lecture:

1. Balduzzi, David, et al. "Open-ended learning in symmetric zero-sum games."

International Conference on Machine Learning. PMLR, 2019.

Today: Empirical Games

4
Motivations : Learning Objectives
Single player: Multi-player:

Hand-crafted notion of Very simple notion of performance

performance The complexity of the task depends
on the opponent(s) 5
Achieving super-human performance in multi-player games is very
challenging
and requires a deep understanding of the game.
Go Dota 2

[Silver et al. 2016] [OpenAI et al. 2019]

(Picture from DeepMind’s blog post) (Picture from OpenAI’s Blog post)

Poker Starcraft II

[Brown and Sandholm 2019] [Vinyals et al. 2019]

(Picture from FAIR’s Blog post) (Picture from DeepMind’s Blog post) 6
1.Setting
2.Evaluation of agents
2.1. Agent vs. Agent (Elo)
2.2. Agent v.s Task (Nash)

3.Training of Agents
3.1.Self-Play
3.2.Fictitious Self-Play
AntiSymmetric (zero-sum) Game (Functional Form)

Anti-symmetric
Payoff:

Players (example: RL policies)

Intuition: Switching the roles switches the results.

Example: Chess, Go, Poker (need to randomize who starts)

NB: Can generalize to non-zero sum (just heavier because of

the two losses)
Who are the players?
Here we care about the agent/player (same thing)

Usually: Parametrized policy uθ

But can be anything that plays the game: (e.g., chess
engine, human player)
Interpretation of the payoff

u beats w.

w beats u.

it is a tie.
Proba of
winning
Example:
Transitive game:

Example: φ(u, v) = σ( f(u) − f(v)) − 1/2

Question: Is it the only possible transitive payoff???

Answer: No!

Remark: difference between skill and strategy

AntiSymmetric (zero-sum) Game (Functional Form)

Anti-symmetric Payoff:

Players (example: RL policies)

Intuition: Switching the roles switches the results.

Example: Chess, Go, Poker (need to randomize who starts)

NB: Can generalize to non-zero sum (just heavier because of

the two losses)
Example: Elo Rating

f(u) : Elo Rating of u

Problem:

Intuition: Playing against weaker opponent gives almost

no training signal.
Example: Elo Rating

f(u) : Elo Rating of u

Antisymetric payoff!!! :-)

Online Estimation of the Elo
Target: pij
Estimated proba.

Cross-entropy loss True proba.

Score of a Match-up

(stochastic) Gradient
Descent on that loss:

Exercice: derive this gradient

Take-away
● Optimization perspective on the ELO:
Stochastic gradient descent with constant step-size

SGD with constant step-size does not converge.

(It only converges to a neighborhood proportional to the variance times the step-
size)
Question: try to think why they still use constant step-size?
Estimation of the ELO at a given time!
Population of agents
Payoff matrix of the group:
Getting Elo From A
Intuition:
- If
- Then

Average
Elo

Individual
Elo
Getting Elo From A
Theorem:
- If

We have:
Transitive component Cyclic component:

Take-away:
● ELO = f
● Meaningful if B << f

Cyclic component:
There exists cycles: P1 beats P2, P2 beats P3,
P3 beats P1
Why do we care about that
Elo is useful to predict win-loss probability:
- Under the assumption that the game is transitive

Assuming we ‘know’ f_i and f_j we can predict who will win.

We need a “higher-order” ELO in non-transitive games.

Higher Order Elo
Idea: “PCA” on B.
- B is skew-symmetric -> NO PCA but similar decomposition!

Orthogonal matrices

Estimate the principal components of B.

Higher Order Elo
Idea: “PCA” on B.
- B is skew-symmetric -> NO PCA but Schur
First-K components: best rank-K estimate of B
decomposition!

Orthogonal matrices

Estimate the principal components of B.

Higher Order Elo

Perf of player i depends on two quantities:

● Skills (ELO):
● Strategy (cyclic vector) : Says how much the
game is cyclic

Difference of skills Cyclic component

Higher Order Elo

Perf of player i depends on two quantities:

● Skills (ELO):
● Strategy (cyclic vector) :

Estimated with an
empirical payoff
matrix

Caveat: We need all the pairwise matchups!!!

(not always the case… think about chess)
Agents Vs Tasks
How Tasks are Combined?

Table from NeurIPS tutorial on learning dynamics by Marta Garnelo, Wojciech Czarnecki
and David Balduzzi

26
How Tasks are Combined?

Table from NeurIPS tutorial on learning dynamics by Marta Garnelo, Wojciech Czarnecki
and David Balduzzi

27
Desired properties
Desired properties:
1. Invariant: adding redundant copies of an agent or
task to the data should make no difference.

2. Continuous: the evaluation method should be robust

to small changes in the data.

3. Interpretable: hard to formalize, but the procedure

should agree with intuition in basic cases
Maxent Nash Evaluation Method

Meta-Agent Meta-Task

Theorem: There is unique (p^,q^) that maximize the

entropy H(p^*) + H(q^*)
Best Agents
Best Agents are the ones in the MaxEnt Nash
Application: Atari

Perf against the Env (uniform or Difficulty of the env against an avg player
Nash Avg) (uniform or Nash Avg)
Training of
Multiple
Agents
Who are the players?
Here we care about the agent/player (same thing)

Usually: Parametrized policy uθ

But can be anything that plays the game: (e.g., chess
engine, human player)
Example: Elo Rating

f(u) : Elo Rating of u

Problem:

Intuition: Playing against weaker opponent gives almost

no training signal.
Example: Elo Rating

f(u) : Elo Rating of u

Solution: Self-play i.e., play against a copy of yourself.
Problem:

Intuition: Playing against weaker opponent gives almost

no training signal.
Open-ended Learning
General Framework to answer the question:
“Who plays against who?”

Stronger agent against the opponent.

Example: Self-play:
Open-ended Learning
General Framework to answer the question:
“Who plays against who?”

Conclusion:
- General framework to understand
general algorithm such as self-play or
Fictitious self play.
Self-Play

- Play against a copy of yourself

- Well calibrated opponent
- Simple algorithm.
- Successful in Chess, Go and many other applications
- Issue: suited when the payoff is transitive:

- Improvement against vt implies global improvement.

A simple Example: The bilinear game

Simple payoff:

Self play:

Remark: in practice

Copy of t (do not differentiate through it)

𝜃
A simple Example: The bilinear game

The vector field depends on your

opponent.

Proposition: The dynamics of self play

diverges!

Conclusion: self play is not enough if

the game is not purely transitive
Idea: Playing against a group of agents
Population of agents
Payoff matrix of the group:
Idea: Playing against a group of agents
Population of agents
Payoff matrix of the group:
Nash of an Empirical Game
Proposition:

Mixture of Agents:
Sample vi with probability
pi.
Matrix of the empirical game
We can use this matrix for several purposes:
1. Evaluate (a group of) agents.
2. Evaluate the diversity of a group of agents
3. Setup efficient Training.

Many open questions remaining:

- How to relate the empirical matrix to the real game?
- Are the proposed measures (see next slides)
meaningful?
How to evaluate the Performance of a Population?

Proposition: For any

population
● v(B,B) = 0
How to Evaluate Diversity of a Population?

Interpretation: How much the best agents (i.e. agents in

the Nash) exploit each other.
How to train agents Efficiently?

Use this matrix to find who to train against:

● Train against the Nash
● Train against the best Response.
● Many other ways:
- [Garnelo et al. 2021] (to appear at AAMAS)
- see also SC II paper (league of agents).
Idea:
- Compute the Nash and Play against it: (PSRO)

- Seems like a good idea.

- Problem: Sometime provide zero gradient (e.g.
Bilinear example)
Fictitious Self-Play
- Group of agents vi
- Play against to ‘best’ opponent
Fictitious Self-Play

- Group of agents vi
- Play against to ‘best’ opponent
- Used in Starcraft II
- [Vinyals · 2019]
Conclusion
● Self-play is a very powerful method to train agents in a
Multi-Agent framework.

● Sometimes it fails (when we need a diversity of agents to

play the game)

● When having a group of agents we can use the empirical

payoff to:
○ Evaluate agents
○ Train Agents
○ Evaluate the group (perfs and diversity)

CSEC IT - Study Guide2
No ratings yet
CSEC IT - Study Guide2
2 pages
Vikis Digital Alarm Clock Manual
100% (1)
Vikis Digital Alarm Clock Manual
13 pages
DMC Plus
No ratings yet
DMC Plus
156 pages
VMOST Cheat Sheet - Create Mission Board
No ratings yet
VMOST Cheat Sheet - Create Mission Board
1 page
YK Operation Manual
No ratings yet
YK Operation Manual
206 pages
7588 Re Evaluating Evaluation
No ratings yet
7588 Re Evaluating Evaluation
12 pages
Intro To Reinforcement Learning
No ratings yet
Intro To Reinforcement Learning
56 pages
ml notes
No ratings yet
ml notes
47 pages
Reinforcement Learning.pptx
No ratings yet
Reinforcement Learning.pptx
59 pages
Nau 2008 Learning
No ratings yet
Nau 2008 Learning
49 pages
Evolutionary Game Theory and Multi-Agent Reinforcement Learning
No ratings yet
Evolutionary Game Theory and Multi-Agent Reinforcement Learning
26 pages
Module 1
No ratings yet
Module 1
27 pages
Lec 04
No ratings yet
Lec 04
79 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
An Introduction To Deep ReinforcementLearning
No ratings yet
An Introduction To Deep ReinforcementLearning
65 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
PPT Lecture 1 & 2
No ratings yet
PPT Lecture 1 & 2
3 pages
RL Lecturer (1)
No ratings yet
RL Lecturer (1)
38 pages
cs188-su24-lec06
No ratings yet
cs188-su24-lec06
79 pages
lecture21
No ratings yet
lecture21
29 pages
Optimal Decision in Games
No ratings yet
Optimal Decision in Games
68 pages
AI-Lecture 6 (Adversarial Search)
No ratings yet
AI-Lecture 6 (Adversarial Search)
68 pages
Introduction To Reinforcement Learning
100% (1)
Introduction To Reinforcement Learning
52 pages
Maai 6
No ratings yet
Maai 6
143 pages
lecture-06
No ratings yet
lecture-06
98 pages
Lecture 5: Mixed Strategies and Expected Payoffs
No ratings yet
Lecture 5: Mixed Strategies and Expected Payoffs
5 pages
Module 1 Notes PDF
No ratings yet
Module 1 Notes PDF
26 pages
3 CSE3013 Adversarial Search
No ratings yet
3 CSE3013 Adversarial Search
48 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
AI Notes Unit II
No ratings yet
AI Notes Unit II
31 pages
ML First Unit
No ratings yet
ML First Unit
70 pages
Module 2 PDF
No ratings yet
Module 2 PDF
26 pages
CZ3005 Module 5_Reinforcement Learning(1)
No ratings yet
CZ3005 Module 5_Reinforcement Learning(1)
31 pages
Game Playing AI: (Based On Earlier Lecture From Stephen Gould)
No ratings yet
Game Playing AI: (Based On Earlier Lecture From Stephen Gould)
28 pages
New CZ3005 Module 5 - Reinforcement Learning
No ratings yet
New CZ3005 Module 5 - Reinforcement Learning
31 pages
DS Unit Iv
No ratings yet
DS Unit Iv
89 pages
Game Theory Apuntes
No ratings yet
Game Theory Apuntes
5 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
Unit 1 ML
No ratings yet
Unit 1 ML
14 pages
Module 1 Concept Learning Notes
No ratings yet
Module 1 Concept Learning Notes
26 pages
Machine Learning Notes-1 (ML Design)
No ratings yet
Machine Learning Notes-1 (ML Design)
7 pages
Adversarial Search and Game Playing: Games
No ratings yet
Adversarial Search and Game Playing: Games
8 pages
Lecture VIII: Learning: Markus M. M Obius March 6, 2003
No ratings yet
Lecture VIII: Learning: Markus M. M Obius March 6, 2003
9 pages
Report
No ratings yet
Report
11 pages
ML1
No ratings yet
ML1
28 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
ML RUSA Module 1 Intro
No ratings yet
ML RUSA Module 1 Intro
30 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
ML Unit 5 @ VS
No ratings yet
ML Unit 5 @ VS
29 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Adversial Search
No ratings yet
Adversial Search
38 pages
UNIT IV-1
No ratings yet
UNIT IV-1
32 pages
ML Module Notes
No ratings yet
ML Module Notes
139 pages
ML-UNIT-1 - Introduction PART-1
No ratings yet
ML-UNIT-1 - Introduction PART-1
60 pages
Module 1
No ratings yet
Module 1
27 pages
Reinforcement Learning: A Short Cut
No ratings yet
Reinforcement Learning: A Short Cut
7 pages
GamePlaying_Minimax_Unit-2_SPS
No ratings yet
GamePlaying_Minimax_Unit-2_SPS
72 pages
Unit 1
No ratings yet
Unit 1
14 pages
Game Playing: Artificial Intelligence
No ratings yet
Game Playing: Artificial Intelligence
85 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
ML Unit 5
No ratings yet
ML Unit 5
30 pages
Learning C++ by Creating Games with UE4
From Everand
Learning C++ by Creating Games with UE4
William Sherif
3/5 (7)
4C Expanded 2.0
From Everand
4C Expanded 2.0
James Keck
No ratings yet
Mathematical Chess
From Everand
Mathematical Chess
Dr George Ho
No ratings yet
HP Pro Book 450 g6 Notebook PC PDF
No ratings yet
HP Pro Book 450 g6 Notebook PC PDF
4 pages
Network Design Assignment
50% (2)
Network Design Assignment
15 pages
Interior Architectural Woodwork
No ratings yet
Interior Architectural Woodwork
60 pages
Arithmetic B
No ratings yet
Arithmetic B
21 pages
Electric Field Hockey Sim Homework
100% (1)
Electric Field Hockey Sim Homework
7 pages
A Conceptual Foundation For The Shannon-Weaver Model of Communication
No ratings yet
A Conceptual Foundation For The Shannon-Weaver Model of Communication
9 pages
In Amp Input Overvoltage Protection
No ratings yet
In Amp Input Overvoltage Protection
5 pages
Assemble The HACCP Team - Task 1: Good Hygiene Practices Along The Coffee Chain
No ratings yet
Assemble The HACCP Team - Task 1: Good Hygiene Practices Along The Coffee Chain
15 pages
Sewage Cleaning System.
No ratings yet
Sewage Cleaning System.
6 pages
Non-Immigrant Visa - Review Personal, Address, Phone, and Passport Information
No ratings yet
Non-Immigrant Visa - Review Personal, Address, Phone, and Passport Information
2 pages
12.topic 10 - Employee Safety & Health
No ratings yet
12.topic 10 - Employee Safety & Health
28 pages
Genetic Algorithms: PHY 604: Computational Methods in Physics and Astrophysics II
No ratings yet
Genetic Algorithms: PHY 604: Computational Methods in Physics and Astrophysics II
31 pages
Motorpact IEC: Instructions For Use: Instruction Bulletin
No ratings yet
Motorpact IEC: Instructions For Use: Instruction Bulletin
108 pages
Education: Delhi Technological University
No ratings yet
Education: Delhi Technological University
2 pages
Program Kerja: Bagian Keuangan Tahun 2020
No ratings yet
Program Kerja: Bagian Keuangan Tahun 2020
59 pages
Pro Line II Generic Trainig Course
No ratings yet
Pro Line II Generic Trainig Course
11 pages
Company Profile Mobile Money
No ratings yet
Company Profile Mobile Money
1 page
QDC25-041
No ratings yet
QDC25-041
118 pages
Marking Scheme Cs GR 10 11 Mockexam Paper2
No ratings yet
Marking Scheme Cs GR 10 11 Mockexam Paper2
21 pages
Tutorial 3 Basic - Beam Elements
No ratings yet
Tutorial 3 Basic - Beam Elements
13 pages
Drager Patient Monitor PM8060 User Manual
No ratings yet
Drager Patient Monitor PM8060 User Manual
104 pages
Light Burn Docs
No ratings yet
Light Burn Docs
225 pages
Principles of Model Checking
No ratings yet
Principles of Model Checking
12 pages
Notice - BPUT Tech Carnival 2024
No ratings yet
Notice - BPUT Tech Carnival 2024
27 pages
Sap Credit Management FSCM Overview PDF
100% (1)
Sap Credit Management FSCM Overview PDF
85 pages

Lecture 12 Evaluating and Learning in Multi-Agent Systems 2

Uploaded by

Lecture 12 Evaluating and Learning in Multi-Agent Systems 2

Uploaded by

Lecture 12: Learning

● Poster Presentation in Agora the 19th of April!

1. Balduzzi, David, et al. "Open-ended learning in symmetric zero-sum games."

Today: Empirical Games

Hand-crafted notion of Very simple notion of performance

[Silver et al. 2016] [OpenAI et al. 2019]

[Brown and Sandholm 2019] [Vinyals et al. 2019]

Players (example: RL policies)

Intuition: Switching the roles switches the results.

NB: Can generalize to non-zero sum (just heavier because of

Usually: Parametrized policy uθ

Example: φ(u, v) = σ( f(u) − f(v)) − 1/2

Question: Is it the only possible transitive payoff???

Remark: difference between skill and strategy

Players (example: RL policies)

Intuition: Switching the roles switches the results.

NB: Can generalize to non-zero sum (just heavier because of

f(u) : Elo Rating of u

Intuition: Playing against weaker opponent gives almost

f(u) : Elo Rating of u

Antisymetric payoff!!! :-)

Cross-entropy loss True proba.

Exercice: derive this gradient

SGD with constant step-size does not converge.

We need a “higher-order” ELO in non-transitive games.

Estimate the principal components of B.

Estimate the principal components of B.

Perf of player i depends on two quantities:

Difference of skills Cyclic component

Perf of player i depends on two quantities:

Caveat: We need all the pairwise matchups!!!

2. Continuous: the evaluation method should be robust

3. Interpretable: hard to formalize, but the procedure

Theorem: There is unique (p^*,q^*) that maximize the

Usually: Parametrized policy uθ

f(u) : Elo Rating of u

Intuition: Playing against weaker opponent gives almost

f(u) : Elo Rating of u

Intuition: Playing against weaker opponent gives almost

Stronger agent against the opponent.

- Play against a copy of yourself

- Improvement against vt implies global improvement.

Copy of t (do not differentiate through it)

The vector field depends on your

Proposition: The dynamics of self play

Conclusion: self play is not enough if

Many open questions remaining:

Proposition: For any

Interpretation: How much the best agents (i.e. agents in

Use this matrix to find who to train against:

- Seems like a good idea.

● Sometimes it fails (when we need a diversity of agents to

● When having a group of agents we can use the empirical

You might also like

Theorem: There is unique (p^,q^) that maximize the