0% found this document useful (0 votes)
25 views10 pages

Lecture 14

Uploaded by

andybao291
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views10 pages

Lecture 14

Uploaded by

andybao291
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Game Theory Lecture #14

Outline:

• Multiagent learning
• Regret matching
• Fictitious play
Single Agent Learning

• Setup:
– Two players: Player 1 vs. Nature
– Actions set: A1 and AN
– Payoffs: U : A1 × AN → R

Nature
Rain No Rain
Umbrella 1 0
P1
No umbrella 0 1
Player 1’s Payoff

• Player repeatedly interacts with nature


– Player’s action day t: a1 (t)
– Nature’s action day t: aN (t)
– Payoff day t: U (a1 (t), aN (t))
• Goal: Implement strategy that provides desirable guarantees with regard to average
performance
• Case 1: Stationary environment

– Nature’s choice according to non-adaptive (fixed) prob distribution pN ∈ ∆(An )


– Theory available to optimize average performance, e.g., reinforcement learning

• Case 2: Non-stationary environment

– Nature’s choice according to adaptive prob distribution, i.e., pN (t) 6= pN (t − 1)


– In general, pN (t) = f (a1 (0), ..., a1 (t − 1), aN (0), ..., aN (t − 1))
– One choice: aN (t) = βN (a1 (t − 1)) (assume zero sum game)

• Question: Is a player’s environment stationary or non-stationary in a game?

1
Single agent learning (cont)

• Challenge: Hard to predict what nature is going to do


• Previous direction: Optimize worst-case payoffs (e.g., security strategies)
• Problem: Derived strategies might be highly inefficient given behavior of nature
• Example:

Nature
Rain No Rain Thunder
Umbrella 1 0 0
P1 No umbrella 0 1 0
Jacket 0.1 0.1 0.1
Player 1’s Payoff
– What is security strategy?
– What is security level?
– How would answers change if there was never any thunder?

• Fact: Security strategies and values can be highly influenced by “rare” actions
• Are there “online” policies that can provide potentially better performance guarantees?

2
What about regret?

• New direction: Can a player optimize “what if” scenarios?


• Definition: Player’s average payoff at day t
t
1X
Ū (t) = U (a1 (τ ), aN (τ ))
t τ =1

• Definition: Player’s perceived average payoff at day t if committed to fixed action and
nature was unchanged
t
a1 1X
v̄ (t) = U (a1 , aN (τ ))
t τ =1

• Definition: Player’s regret at day t for not having used action a1

R̄a1 (t) = v̄ a1 (t) − Ū (t)

• Example:

Day 1 2 3 4 5 6 ...
Player’s Decision NU U NU U NU NU ...
Nature’s Decision R NR R R NR R ...
Payoff 0 0 0 1 1 0 ...

– Ū (6)?
– v̄ U (6)?
– v̄ N U (6)?
– R̄U (6)?
– R̄N U (6)?

3
Regret Matching

• Positive regret = Player could have done something better in hindsight


• Q: Is it possible to make positive regret vanish asymptotically “irrespective” of nature?
• Consider the strategy Regret Matching : At day t play strategy p(t) ∈ ∆(A1 )
 U 
R̄ (t) +
pU (t + 1) =  U   
R̄ (t) + + R̄N U (t) +
 NU 
R̄ (t) +
pN U (t + 1) =  U   
R̄ (t) + + R̄N U (t) +

• Notation: [·]+ is projection to positive orthant, i.e., [x]+ = max{x, 0}


• Strategy generalizes to more than two actions
• Fact: Positive regret asymptotically vanishes irrespective of nature
 U 
R̄ (t) →0
 N U +
R̄ (t) + → 0

• Example revisited:

Day 1 2 3 4 5 6 ...
Player’s Decision NU U NU U NU NU ...
Nature’s Decision R NR R R NR R ...
Payoff 0 0 0 1 1 0 ...
– Regret matching strategy day 2?
– Regret matching strategy day 3?
– Regret matching strategy day 4?
– Regret matching strategy day 5?
– Regret matching strategy day 6?

4
Learning in games

• Consider the following one-shot game


– Players N
– Actions Ai
– Utility functions Ui : A → R
• Consider a repeated version of the above one-shot game where at each time t ∈ {1, 2, ...},
each player i ∈ N simultaneously
– Selects a strategy pi (t) ∈ ∆(Ai )
– Selects an action ai (t) randomly according to strategy pi (t)
– Receives utility Ui (ai (t), a−i (t))
– Each player updates strategy using available information

pi (t + 1) = f (a(0), a(1), ..., a(t); Ui )

• The strategy update function f (·) is referred to as the learning rule

– Ex: Cournot adjustment process

• Concern: How much information do players have access to?

– Structural form of utility function, i.e., Ui (·)?


– Action of other players, i.e., a−i (t)?
– Perceived reward for alternative actions, i.e., Ui (ai , a−i (t)) for any ai
– Utility received, Ui (a(t))

• Informational restrictions place restriction on class of admissible learning rules


• Goal: Provide asymptotic guarantees if all players follow a specific f (·)

5
Regret matching

• Consider the learning rule f (·) where


 ai 
R̄i (t) +
pai i (t + 1) = P 
ãi (t)

ãi ∈Ai R̄ +

– pai i (t + 1) = Probability player i plays action ai at time t + 1


– R̄iai (t) = Regret of player i for action ai at time t
• Fact: Max regret of all players goes to 0 (think of other players as “nature”)
 ai 
R̄i (t) + → 0
• Result restated: The behavior converges to a “no-regret” point
• Question: Where are we? Is this a NE?
• Rewrite regret in terms of empirical frequency z(t) ∈ ∆(A)
Ūi (t) = 1t tτ =1 Ui (a(τ )) = Ui (z(t))
P

Pt
v̄iai (t) = 1
t τ =1 Ui (ai , a−i (t)) = Ui (ai , z−i (t))

R̄iai (t) = v̄iai (t) − Ūi (t) = Ui (ai , z−i (t)) − Ui (z(t))
• Characteristic of no-regret point
R̄iai (t) ≤ 0 ⇔ Ui (ai , z−i (t)) ≤ Ui (z(t))
• No-regret point restated: For any player i and action ai
Ui (ai , z−i (t)) ≤ ui (z(t))
• No-regret point = Coarse correlated equilibrium (slightly weaker notion than correlated
equilibrium)
• Slightly modified (and more complex) version of regret matching ensures convergence to
correlated equilibrium.
• Theorem: If all players follow the regret matching strategy then the empirical frequency
converges to the set of coarse correlated equilibrium.

6
Convergence to NE?

• Recap: If all players follow the regret matching strategy then the empirical frequency
converges to the set of coarse correlated equilibria.
• This result holds irrespective of the underlying game!
• Problems:
– Predictability: Behavior will not necessarily settle down, i.e., only guarantees that
empirical frequency of play will be in the set of CCE
– Efficiency: Set of CCE much larger than the set of NE. Are CCE worse than NE in
terms of efficiency?
• Revised goal: Are there learning rules that converge to NE (as opposed to CCE) for any
game?
• Answer: No
• Theorem: There are no “natural” dynamics that lead to NE in any game (Hart, 2009).
– Natural = adaptive, simple, efficient (e.g., regret matching, cournot, ...)
– Not natural = exhaustive search, mediator, ...
• Question: Are there natural dynamics that converge to NE for special game structures?
(e.g., zero-sum games?)

7
Fictitious Play

• Recall: A learning rule is of the form

pi (t + 1) = f (a(1), a(2), ..., a(t); Ui )

• Fictitious play: A learning rule where the strategy pi (t + 1) is a best response to the
scenario where all players j 6= i are selecting their action independently according to the
empirical frequency of their past decisions.
• Define empirical frequencies qi (t) as follows:
t
1X
qiai (t) = I{ai (τ ) = ai }
t τ =1

• Fictitious play: Each player best responds to empirical frequencies

pi (t + 1) ∈ arg max ui (pi , q−i (t))


pi ∈∆(Ai )

where
a
X Y
ui (pi , q−i (t)) = ui (a1 , a2 , ..., an )pai i qj j (t)
a∈A j6=i

• FP facts: Beliefs (i.e., empirical frequencies) converge to NE for


– For 2-player games with 2 moves per player
– Zero sum games with arbitrary moves per player
– Other game structures as well (more to come on this)

8
Fictitious play example

• Consider the following two-player zero-sum games

L C R
T −1 0 1
M 1 −1 0
B 0 1 −1

• Suppose a(1) = {T, L}

– What is qrow (1)?


– What is qcol (1)?

• What is a(2)?

– What is qrow (2)?


– What is qcol (2)?

• What is a(3)?

You might also like