0% found this document useful (0 votes)

82 views105 pages

RESQ Learning in Stochastic Games

This paper proposes a novel approach for multi-agent coordination in general-sum Markov games. Rather than focusing on equilibrium between agent policies, it learns a set of joint agent policies that can be randomized between to find different solutions. The main idea is to decompose the Markov game into a set of multi-agent MDPs, each reflecting one agent's preferences. Simple reinforcement learning agents are then able to solve these common interest problems in parallel and take turns between the solutions, allowing them to satisfy individual objectives without knowing which problem they face. This approach is demonstrated on a grid world setting where agents learn to take turns.

Uploaded by

Max Jordan Dooley

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

82 views105 pages

RESQ Learning in Stochastic Games

Uploaded by

Max Jordan Dooley

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 105

Proceedings of the Adaptive

and Learning Agents Workshop

at AAMAS 2010

May 10, 2010

Toronto, Canada

Editors
Marek Grześ and Matthew E. Taylor

In conjunction with the Ninth International Conference on Autonomous Agents and

Multiagent Systems, AAMAS-10
Foreword

This year’s edition of the Adaptive and Learning Agents workshop is the third after
the ALAMAS and ALAg workshops merged. ALAMAS was an annual European
workshop on Adaptive and Learning Agents and Multi-Agent Systems, held eight
times. ALAg was the international workshop on Adaptive and Learning agents, typ-
ically held in conjunction with AAMAS. To increase the strength, visibility, and
quality of the workshops, ALAMAS and ALAg were merged into the ALA workshop,
and a steering committee was appointed to guide its development. We are very happy
to present you the proceedings of this special edition of the ALA workshop.

We thank all authors who responded to our call-for-papers. We expect that the
workshop will be both lively and informative, refining and producing future research
ideas. We are thankful to the members of the program committee for their high
quality reviews. We would like to thank all the members of the steering committee
for their guidance, and the AAMAS conference for providing an excellent venue for
our workshop.

Marek Grześ and Matthew E. Taylor

ALA 2010 Co-Chairs
Program Chairs

Marek Grześ
Department of Computer Science
University of York
UK
[email protected]

Matthew E. Taylor
Department of Computer Science
The University of Southern California
USA
[email protected]

Program Committee

Adrian Agogino, UCSC, NASA Ames Research Center, USA

Sherief Abdallah, British University in Dubai, United Arab Emirates
Eduardo Alonso, City University, UK
Bikramjit Banerjee, The University of Southern Mississippi, USA
Ana L.C. Bazzan, UFRGS, Porto Alegre, Brazil
Vincent Corruble, University of Paris 6, France
Enda Howley, National University of Ireland, Ireland
Shivaram Kalyanakrishnan, University of Texas at Austin, USA
Franziska Klügl, University of Orebro, Sweden
W. Bradley Knox, University of Texas at Austin, USA
Daniel Kudenko, University of York, UK
Ann Nowé, Vrije Universiteit Brussels, Belgium
Lynne Parker, University of Tennessee, USA
Jeffrey Rosenschein, The Hebrew University of Jerusalem, Israel
Michael Rovatsos, Centre for Intelligent Systems and their Applications, UK
Sandip Sen, University of Tulsa, USA
Kagan Tumer, Oregon State University, USA
Karl Tuyls, Maastricht University, The Netherlands
Katja Verbreek, KaHo Sint-Lieven, Belgium
PaweÃl Wawrzyński, Warsaw University of Technology, Poland

iii
Steering Committee

Franziska Klügl
Daniel Kudenko
Ann Nowé
Lynne E. Parker
Sandip Sen
Peter Stone
Kagan Tumer
Karl Tuyls

iv
CONTENTS

Foreword ii
Organisation iii

Contributed Papers
Learning to Take Turns 1
Peter Vrancx, Katja Verbeeck, and Ann Nowé
RESQ-learning in stochastic games 8
Daniel Hennes, Michael Kaisers, and Karl Tuyls
Adaptation of Stepsize Parameter to Minimize Exponential
Moving Average of Square Error by Newton’s Method 16
Itsuki Noda
Transfer Learning for Reinforcement Learning on a Physical Robot 24
Samuel Barrett, Matthew E. Taylor, and Peter Stone
Reinforcement Learning with Action Discovery 30
Bikramjit Banerjee and Landon Kraemer
Convergence, Targeted Optimality, and Safety in Multiagent Learning 38
Doran Chakraborty and Peter Stone
An Approach to Imitation Learning For Physically Heterogeneous
Robots 45
Jeff Allen and John Anderson
Multi-agent Reinforcement Learning with Reward Shaping
for KeepAway Takers 53
Sam Devlin, Marek Grześ, and Daniel Kudenko
Learn to Behave! Rapid Training of Behavior Automata 61
Sean Luke and Vittorio Ziparo
Policy Search and Policy Gradient Methods
for Autonomous Navigation 69
Matt Knudson and Kagan Tumer
A Comparison of Learning Approaches to Support
the Adaptive Provision of Distributed Services 77

v
Enda Barrett, Enda Howley, and Jim Duggan
Using bisimulation for policy transfer in MDPs 85
Pablo Samuel Castro and Doina Precup
The Evolution of Cooperation and Investment Strategies
in a Commons Dilemma 93
Enda Howley and Jim Duggan

vi
Proceedings of the AAMAS Workshop on Adaptive and Learning Agents, May 2010, Toronto, Canada

Learning to Take Turns

∗
Peter Vrancx Katja Verbeeck Ann Nowé
Computational Modeling Lab Information Technology Computational Modeling Lab
Vrije Universiteit Brussel KaHo St. Lieven (KULeuven) Vrije Universiteit Brussel
Brussels, Belgium Ghent, Belgium Brussels, Belgium
[email protected] [email protected] [email protected]

ABSTRACT by the famous prisoner’s dilemma game, where the unique

This paper provides a novel approach to multi-agent coor- Nash equilibrium does not represent a desirable outcome,
dination in general-sum Markov games. Contrary to what since both agents can simultaneously do better.
is common in multi-agent learning, our approach does not The motivation for this paper comes from the observa-
focus on reaching a particular equilibrium between agent tion that the equilibrium concepts used in Multi-Agent Re-
policies. Instead, it learns a basis set of special joint agent inforcement Learning (MARL) algorithms are often chosen
policies, over which it can randomize to build different solu- for their analytical properties, rather than their correspon-
tions. dence to a desired outcome of the learning process. Instead,
The main idea is to tackle a Markov game by decomposing in our approach we are interested in agents learning real-
it into a set of multi-agent common interest problems, also istic patterns of behavior. One such pattern, which occurs
called Multi-agent Markov Decision Processes (MMDPs). naturally in human interactions is turn taking. When faced
Each MMDP reflects one agent’s preferences in the system. with conflicting interests, humans can often compromise by
With only a minimum of coordination, simple reinforcement agreeing to take turns to select each participant’s desired re-
learning agents using Parameterised Learning Automata are sult. This compromise frequently leads to outcomes that are
able to solve this set of common interest problems in parallel. more desirable than those reached by a population of agents
A third party then selects the MMDP to be played, with- selfishly optimizing their individual rewards. Therefore, the
out a need for the agents to know which problem or reward goal of this paper is to show how simple reinforcement learn-
function they are confronted with. As a result, a team of ing agents are able to learn interesting patterns of play, like
simple learning agents is able to switch play between desired turn taking, in general-sum Markov Games.
joint policies rather than mixing individual policies. One ap- In [12] the problem of learning to take turns is studied
plication of this principle, which we consider in this paper, is in repeated games. The problem is illustrated using a com-
to let simple adaptive agents learn to take turns in general- puter game and two children. The children need to fairly
sum Markov Games in order to satisfy their individual ob- take turns in playing this game to maximize their players’
jectives. We experimentally demonstrate this principle in a satisfaction, since only one child can play the game at the
grid-world setting. same time. When one child is playing the game, the other
one is supposed to watch the game. [12] showed that players
using Markov fictitious play, a simple extension of fictitious
Categories and Subject Descriptors play, can spontaneously learn to take turns quite often. Em-
I.2.11 [Distributed Artificial Intelligence]: Multiagent pirical investigations were done on (mostly) symmetric 2 x
systems 2 repeated games.
In [6] the authors propose to use the correlated equilibrium
(CE) notion [2]. In a correlated equilibrium, each player re-
General Terms alizes that the best he can do is to follow a private recom-
Algorithms mendation, provided that all other players will do this too.
A correlated equilibrium is more general than a NE, since it
Keywords permits dependencies among the agents’ action probability
distributions, while maintaining the property that agents
Agent Cooperation, Multi-agent learning, Markov games, are optimizing. These equilibria have the advantage that,
reinforcement learning unlike Nash equilibria, they can be computed efficiently us-
ing linear programming. In [18], the authors show that the
1. INTRODUCTION method proposed in [6] can convergence to what they call
A large part of the multi-agent learning literature focuses a cyclic equilibrium. Such an equilibrium represents a limit
on finding a Nash equilibrium (NE) between agent policies cycle in which the agents cycle through a fixed sequence of
[8, 15]. However, while a Nash equilibrium represents a local joint policies.
optimum, it does not necessarily represent a desirable solu- In this paper we present a new multi-agent coordination
tion for the problem at hand. This is clearly demonstrated approach for learning patterns of desirable joint agent poli-
cies. To do so, we depart from the idea of jointly learning
∗funded by a Ph.D grant of the Institute for the Promotion an equilibrium in the full Markov game. Instead, our main
of Innovation through Science and Technology in Flanders.

Page 1 of 99
State 1 State 2 si and the joint action in this state si , i.e. ai = (ai1 , . . . aiN )
b1 b2 b1 b2
with aik ∈ Aik . The reward function Rk (s, a) can be indi-
R a1 0.2/0.1 0/0 a1 1.0/0.5 0/0
a2 0/ 0 0.2/0.1 a2 0/0 0.6/0.9 vidual to each agent k, meaning that different agents can
(a1,b1):(0.5,0.5) (a1,b1):(0.5,0.5) receive different rewards for the same state transition.
(a1,b2):(0.5,0.5) (a1,b2):(0.5,0.5) The goal of each individual agent in the game, is to find
T
(a2,b1):(0.5,0.5) (a2,b1):(0.5,0.5) a policy which maps each state to a strategy in order to
(a2,b2):(0.5,0.5) (a2,b2):(0.5,0.5) maximize its reward. In this paper we consider the limit
average reward, meaning that agents try to maximize their
average reward over time. For a joint policy α consisting
Table 1: Example Markov game with 2 states and of a policy for each agent in the system, the limit average
2 agents. Each agent has 2 actions in each state: reward to agent k is defined as:
actions a1 and a2 for agent 1 and b1 and b2 for agent
2. Rewards for joint actions in each state are given
in the first row as matrix games. The second row " l−1 #
specifies the transition probabilities to both states 1 X
under each joint action. Jk (α) ≡ liml→∞ E Rk (s(t), a1 (t), ..., aN (t)) (1)
l t=0

idea is to tackle a Markov game by decomposing it into a In the remainder of this paper we will assume that the
set of multi-agent common interest problems; each reflecting Markov chain of system states under every joint policy is
one agent’s preferences in the system. Simple reinforcement ergodic. A Markov chain {xl }l≥0 is said to be ergodic when
learning agents using Parameterised Learning Automata [11] the distribution of the chain converges to a limiting distri-
are able to solve this set of MMDPs in parallel. A third bution π(α) = (π1 (α), . . . , π|S| (α)) with ∀i, πi (α) > 0 as
trusted party is used to enforce each MMDP to be played l → ∞.
and solved equally well. There is no need for the agents in Due to the individual reward functions of the agents, it is
the system to know which problem or reward function they in general impossible to find an optimal policy for all agents
are confronted with. As a result, a team of simple learning simultaneously. Instead, most approaches seek equilibrium
agents becomes able to switch play between desired joint points. In an equilibrium, no agent can improve its reward
policies rather than mixing individual policies. The role of by changing its policy if all other agents keep their pol-
the third party is minimal in the sense that only simple coor- icy fixed. In a special case of the general Markov game
dination signals should be communicated. In case all agents framework, the so-called team games or multi-agent MDPs
fully trust its opponent players to stick with the learning (MMDPs) [3] optimal policies do exist. In this case, the
mechanism proposed, the third party is even unnecessary. Markov game is purely cooperative and all agents share the
We will show how this technique can lead to turn taking same reward function. This specialization allows us to define
behavior in 2 different Markov games. the optimal policy as the joint agent policy, which maximizes
This paper is organized as follows: in the next section we the payoff of all agents.
introduce some background knowledge needed to develop An example Markov game is given in Table 1. Each col-
our algorithm. In Section 3 our approach for learning cor- umn in this table specifies one state of the problem. The first
related policies in Markov Games is described. We explain row gives the immediate rewards agents receive for a joint
our decomposition method and how it can easily be used in action, while the second row gives the transition probabili-
combination with a third party, comparable to the private ties. In this case transition probabilities are independent of
signal modelled in the CE concept. We demonstrate this the joint action chosen and the system moves to either state
approach on a simple 2-state Markov game and a larger grid with equal probability.
world problem in section 4. We end with a discussion in
Section 5.
2.2 Parameterised Learning Automata
2. BACKGROUND Learning Automata are simple reinforcement learners which
attempt to learn an optimal action, based on past actions
In this section we describe some basic formalisms and
and environmental feedback. Formally, the automaton is
background concepts used throughout the rest of this pa-
described by a tuple {A, β, p, U } where A = {a1 , . . . , ar } is
per.
the set of possible actions the automaton can perform, p is
2.1 Markov Games the probability distribution over these actions, β is a ran-
dom variable between 0 and 1 representing the evironmental
In this paper we adopt the formal setting of Markov games response, and U is a learning scheme used to update p.
(also called stochastic games). Markov games are a straight- In this paper we will make use of the so called Paramete-
forward extension of single agent Markov decision problems rized Learning Automata (PLA) [11]. Instead of modifying
(MDPs) to the multi-agent case. A Markov game consists probabilities directly, PLA use a parameter vector u(t) to-
of a set of states S and a set of N agents. In each state si gether with an exploration function g(u) and an update rule
Aik = {aik1 , . . . , aikir } is the action set available for agent k, based on the REINFORCE algorithm [17]:
with k : 1 . . . N . Actions in the game are the joint result
of multiple agents choosing an action independently. The
transition function T (si , ai ) and reward function Rk (si , ai ),
determine the probability of moving to another state and ui (t + 1) = ui (t) + λβ(t) δδln g
(u(t), α(t))
′
√ui
(2)
the reward for each agent k, depending on the current state +λh (ui (t)) + bsi (t)

Page 2 of 99
where h′ (x) is the derivative of h(x): sj and an individual reward Rki (si , ai ) for each agent. The
8 agents then repeat the process in state sj .
< −K(x − L)2n x ≥ L Automata in the system are not informed of the imme-
h(x) = 0 |x| ≤ L (3) diate reward that their joint action triggers. Instead each
−K(x + L)2n x ≤ −L
:
agent keeps track of the cumulative reward it has gathered
{si (t) : k ≥ 0} is a set of i.i.d. variables with zero up to the current time step. When the system returns to a
mean and variance σ 2 , b is the learning parameter, σ and state si , that was previously visited, each agent k computes
K are positive constants and n is a positive integer. In this the time ∆ti that has passed since the last visit and the
update rule, the second term is a gradient following term, the reward ∆rki that it has gathered since. Automaton LAik
the third term is used to keep the solutions bounded and then updates the action a it took last time using following
the final term is a random noise term. The gradient term feedback:
is part of the orginal REINFORCE algorithm and allows
agents to locally optimize their rewards. In [11] however, it ∆rki
βki = (4)
was shown that this original algorithm is only locally opti- ∆ti
mal. Moreover, it was found that the algorithm could give In [15] it is shown that the behavior of this algorithm can
rise to unbounded behavior, causing the values in u to go to be analyzed by examining an approximating limiting game.
infinity. To deal with these issues the authors in [11] √ added This game approximates the full Markov game by a single re-
the extensions above. The random noise term bsi (t) is peated game. Since the limiting game is an automata game,
based on the concepts used in simulated annealing. It adds this limiting game view allows us to predict the behavior of
a random walk to the update that allows the algorithm to the algorithm based on the convergence properties of the up-
escape local optima that are not globally optimal. Addition- date rule used by the automata. When used with a common
ally, the range for each value in the vector u is now limited LA update scheme called linear reward-inaction[10], the sys-
to the interval [−L, L]. The h′ (ui (t)) term keeps each ui tem can be shown to converge towards pure Nash equilibria
bounded with |ui | ≤ L. This term is 0 when the parameter [15]. In the special case of MMDPs, where all agents receive
ui being updated is within the desired interval, but becomes the same reward and a globally optimal equilibrium still ex-
either negative or positve when ui leaves this interval. Pro- ists, the PLA introduced in the previous section can be used
vided that L is taken sufficiently large, the resulting update to achieve convergence to this global optimum [14]. In this
can still closely approximate the optimal solution, without paper we introduce another approach, in which the agents
resulting in unbounded behavior. alternate between different joint policies in a general sum
Groups of learning automata can be interconnected by Markov game. In the next section we will show how this
using them as players in a repeated game. In such a game can be implemented, using the automata algorithm above.
multiple automata interact with the same environment. A
play a(t) = (a1 (t) . . . an (t)) of n automata is a set of strate- 3. MARKOV GAME TURN TAKING ALGO-
gies chosen by the automata at stage t. Correspondingly, the
response is now a vector β(t) = (β1 (t) . . . βn (t)), specifying RITHM
a payoff for each automaton. The main idea behind our algorithm is to split the Markov
At every instance, all automata update their action prob- game into a number of common interest problems, one for
abilities based on the responses of the environment. Each each agent. These problems are then solved in parallel, al-
automaton participating in the game operates without infor- lowing agents to switch between different joint policies, in
mation concerning the number of participants, their strate- order to satisfy different agents’ preferences. The agents
gies, their payoffs or actions. learn preferred outcome for each participant in the game.
In common interest games, where all agents receive the
same feedback and a clear optimal solution exists, PLA can 3.1 Markov game Decomposition
be used to assure that this global optimum is reached [11]. We develop a system in which agents alternate between
optimising different agent goals in order to satisfy all agents
2.3 Automata Learning in Markov games in the system. This implies that we let agents switch be-
Besides the repeated games mentioned in the previous sec- tween playing different joint policies.
tion, LA can also be used in more complex, multi-state prob- To allow agents to switch between objectives we use a sys-
lems. We now explain an automata based algorithm, capable tem based on Policy Time Sharing (PTS) approaches used
of finding pure equilibria in general-sum Markov games [15, in constrained MDPs [1]. A related approach was also used
13] and optimal policies in MMDPs [14]. In the next sec- in a multi-objective reinforcement learning setting in [9]. In
tion, this algorithm will serve as building block for our turn these systems, a single controller (i.e agent) switches be-
taking approach. The algorithm is an extension of an LA tween alternate policies to keep a vector of payoffs in a tar-
algorithm for solving MDPs, originally proposed by Wheeler get set. In this paper on the other hand, we will consider a
and Narendra [16]. system composed of multiple independent controllers, each
The main idea behind the algorithm is that agent k asso- with an individual scalar payoff.
ciates a different learning automaton LAik with each state In a policy time sharing system the game play is divided
si . The agents then defer the actual action selection in each into a series of periods. A single recurrent1 state in the
state to the automaton they have associated with that state. system is select as the switch state. Play is then divided in
Each time step each agent k in the system activates LAik that episodes, with a single episode comprising the time-steps be-
it associates with the current system state si . The joint ac- tween 2 subsequent visits to the switch state. Each episode a
tion ai consisting of the actions of all automata associated 1
A recurrent state is a non-transient state. In the ergodic
with si , then triggers a transition to the next system state systems under study here all states are recurrent.

Page 3 of 99
...
n-player
Markov
Game
start
period n

Dispatcher:
MMDP 1 MMDP 2 ... MMDP n -worst agent = i
- send rewards
Period n-1

Agents:
Figure 1: Markov game decomposition - update PLAs
period n-1
- Play using PLAs i

different objective to be optimized can be selected, with the

different objectives here corresponding to the reward func-
tion of different agents.
return to start
This means that each episode the agents may use a dif-
state:
ferent joint policy, in order to maximize a different reward period n+1
function. The agents do not only learn a policy to optimize
their own reward function, they also learn the preferences
of the other agents. To do this, agents associate multiple ...
LA with each state, one for each agent in the system. One
automaton is then used to learn their own policy, while the
others are used to learn the preferences of the other agents. Figure 2: Outline of the turn-taking algorithm.
During a single episode the agents use in each state the au-
tomaton corresponding to the current reward function. At
the end of the episode all agents receive the rewards gathered B S
for this reward function during the episode. B 2,1 0,0
Since the expected average reward under a given policy S 0,0 1,2
in an ergodic Markov game is the same for all states, the
average rewards received after each episode contain sufficient Table 2: The Battle of the Sexes game. A 2 player
information for each agent to estimate the average payoff 2 action game, where each agent prefers a different
received by the other agents during the last episode. This pure equilibrium.
information can then be used to update the automata in
visited states, in exactly the same manner as was described
in the algorithm of Section 2.3. Agents then coordinate to
select the reward function to use during the new episode. mixing its policies. Instead, agents using this system to co-
This system effectively transforms the Markov game into a ordinate their policy switches, can play only desired joint
set of MMDPs. The states and transitions of these MMDPs policies, rather than the entire cross product of their indi-
are identical to the full Markov game, but the MMDP only vidual policy sets. This allows them to only reach desirable
considers the reward function of a single agent. Each MMDP outcomes of the game, in this case the joint policies that
represents the problem of finding the joint policy that maxi- maximize the reward for one of the participating agents.
mizes the reward for a single agent in the system. By switch- Agents can then take turns to achieve this maximum payoff.
ing between different automata to learn different agents’ Each epsiode the system switches to a another joint policy,
preferences, the agents are actually solving each of the MMDPs which is optimal for another agent’s reward function.
in an interleaved manner, using the algorithm described in To motivate this system, consider for example the Battle
Section 2.3. of the Sexes repeated game in Table 2. In situations such as
Provided that all LA in the system use the PLA update, these, pure equilibrium convergence can optimize the reward
the agents will find the optimal joint policy in each MMDP, of only a single agent, since a payoff discrepancy always ex-
which corresponds to the joint policy maximizing the reward ists. On the other hand, when agents play a mixed strategy
for the corresponding agent. Thus when the automata have reward are lowered because sometimes agents will miscoor-
converged, the agents continuously alternate between the dinate and play one of the 0 reward joint actions. When
stationary joint policies that are optimal for the different agents correlate their action choices, however, they can al-
agents in the system. In the experiments section we will ternate between the plays preferred by both agents, while
demonstrate how this system can be naturally applied in an avoiding the 0 payoff.
example grid world setting. It should be noted that in a joint action setting such as
By using a coordination mechanism, agents correlate their [8, 6] where all rewards are visible, this system can still be
policy choice. This means that agents are not limited to implemented without the need for communication or explicit
the product distributions, given by each agent individually coordination between the agents, provided that all agents

Page 4 of 99
use a common predetermined system to select the reward Agent 2
function to optimize during the next episode. (b1,b1) (b1,b2) (b2,b1) (b2,b2)
(a1,a1) 0.6/0.3 0.1/0.05 0.5/0.25 0/0

Agent 1
(a1,a2) 0.1/0.05 0.4/0.5 0/0 0.3/0.45
3.2 Combining Joint Policies (a2,a1) 0.5/0.25 0/0 0.6/0.3 0.1/0.05
Using the switching mechanism described above we can (a2,a2) 0/0 0.3/0.45 0.1/0.05 0.4/0.5
learn the joint policies that maximize each agent’s individual
payoff. One additional requirement to implement this sys- Table 3: Possible outcomes for for the Markov game
tem is a mechanism to decide which MMDP will be played in Table 1. Nash equilibria are indicated in bold.
next. This mechanism determines how the different joint
policies learnt in the set of MMDPs are combined into a sin-
gle solution and consequently how much each agent’s goal is
optimized. outcome. This allows each agent to achieve their desired
Different methods could be use to implement the coordina- objective at least some of the time. In situations such as
tion mechanism. One possibility is to implement a communi- the Battle of the Sexes game of Table 2, this assures that no
cation protocol to let agents exchange rewards and negotiate agent will always be stuck with the minimum payoff.
about the next agent to aid. Alternatively it can be imple-
mented using a centralized mechanism. In our setting we 4. EXPERIMENTS
implement this switching mechanism using a separate dis-
In this section we demonstrate the behavior of our ap-
patcher agent. This agent is separate from the other agents
proach on 2 Markov games and show that it does achieve a
and does not participate in the actual learning problem. In-
fair payoff division between agents. As a first problem set-
stead this agent coordinates all other agents and determines
ting we use that Markov game of Table 1. Table 3 lists the
the reward to optimize next. In this way the actual learning
possible combinations of deterministic policies for this game,
agents do not need information on the actions and rewards
together with their expected average reward for each agent.
of others or even the fact that other agents are present in the
We observe that the Markov game has 4 pure equilibrium
system. Whenever the system reaches the switch state, the
points. All of these equilibria have asymmetric payoffs with
current episode ends and the dispatcher becomes active. The
2 equilibria favoring agent 1 and giving payoffs (0.6, 0.3), and
dispatcher then collects the total rewards up to the current
the other equilibria favoring agent 2 with payoffs (0.4, 0.5).
time step for each agent and sends each agent in the system
Figure 3 gives a typical run of the algorithm, which shows
2 pieces of information: a feedback for the last episode and
that agents equalize their average reward, while still obtain-
the index for the next problem to be played. Figure 2 gives
ing a payoff between both equilibrium payoffs. All PLA used
an outline of the algorithm steps.
a Boltzmann exploration function and update parameters:
The feedback is used by the agents to update the automata
λ = 0.05, σ = 0.001, L = 3.0, K = n = 1.0. These pa-
they used in the last episode. Since we assume the problem
rameter settings were determined empirically based on set-
is ergodic, a single scalar reward is sufficient to update all
tings reported in [11, 14] Over 20 runs of 100000 iterations
automata in states visited during the last episode. The dis-
the agents achieved an average payoff 0.42891 (std. dev:
patcher can calculate this feedback by simply determining
0.00199), with an average payoff difference at the finish of
the average reward the agent corresponding to last epsiode’s
0.00001.
goal gathered during the episode. The problem index sent
In a second experiment we apply the algorithm to a some-
to the learning agents indicates the next reward to be maxi-
what larger Markov game given by the grid world shown in
mized. The learning agents themselves do not need to know
Figure 4(a). This problem is based on the experiments de-
whose reward they are optimizing, they can simply use the
scribed in [6]. Two agents start from the lower corners from
index to select the corresponding automata during the next
the grid and try to reach the goal location (top row center).
episode.
When the agents try to enter the same non-goal location
The dispatcher can select from a wide variety of possi-
they stay in their original place and receive a penalty −1.
ble strategies to determine the next objective to optimize,
The agents receive a reward when they both enter the goal
depending on the desired outcome of the system. One possi-
location. The reward they receive depends on how they en-
bility, for example, is to assign a fixed weight to each agent,
ter the goal location,however. If an agent enters from the
which is then used by the dispatcher to determine the proba-
bottom he receives a reward of 100. If he enters from either
bility of selecting each agent as the next objective. Alterna-
side he receives a reward of 75. A state in this problem is
tively, the dispatcher could opt to maximize the maximum
given by the joint location of both agents, resulting in a to-
over the players’ rewards and always select the agent having
tal of 81 states for this 3 × 3 grid. Agents have four actions
the highest possible payoff (also called a republican selection
corresponding to moves in the 4 compass directions. Moves
mechanism[6]). In [6] several possible mechanisms are dis-
in the grid are stochastic and have a chance of 0.01 of fail-
cussed in the context of selecting a correlated equilibrium to
ing2 . The game continues until both agents arrive in the
use in updating the value function.
goal location together, then agents receive their reward and
In this paper we focus on an egalitarian selection mecha-
are put back in their starting positions. As described in [6],
nism. This means we try to maximize the minimum of the
this problem has pure equilibria corresponding to the joint
players’ rewards and the dispatcher will always choose to
policies where one agent prefers a path entering the goal
optimize the payoff of the worst performing agent, i.e. the
from the side and the other one enters from the bottom.
agent with the lowest average reward over time for the entire
These equilibria are asymmetric and result in one agent al-
running time. In this way we can resolve dilemma’s result-
ing from agents having different preferences for the game 2
when a move fails the agent either stays put or arrives in a
outcomes, by letting them take turns to play their optimum random neighboring location.

Page 5 of 99
ways receiving the maximum reward, while the other always 6. REFERENCES
receives the lower reward. [1] E. Altman and A. Shwartz. Time-Sharing Policies for
In order to apply the LA algorithms all rewards described Controlled Markov Chains. Operations Research,
above were scaled to lie in [0, 1]. The turn-taking Markov 41(6):1116–1124, 1993.
game algorithm was applied as follows. Each agent assigns [2] R. Aumann. Subjectivity and correlation in
2 PLA to every state. The starting state (both agents in randomized strategies. Journal of Mathematical
their starting position) is selected as the switch state. Each Economics, 1:67 – 96, 1974.
time the agents enter this start state, they receive an index
[3] C. Boutilier. Planning, learning and coordination in
i ∈ {1, 2} and a reward for the last start to goal epsiode.
multiagent decision processes. In Proceedings of the
Using this information the agents can then update the PLA
6th Conference on Theoretical Aspects of Rationality
used in the last episode. During the next episode they play
and Knowledge, pages 195 – 210, Holland, 1996.
using the PLA corresponding to the index i. When the PLA
[4] S. de Jong and K. Tuyls. Learning to cooperate in
have converged this system results in agents taking turns to
public-goods interactions. In EUMAS 2008, 2008.
use the optimal route.
Results for a typical run are shown in Figure 4, with the [5] S. de Jong, K. Tuyls, and K. Verbeeck. Fairness in
same parameter settings as are given above. From the figure multi-agent systems. Knowledge Engineering Review,
it is clear that agents equalize their reward, both receiving an 23(2):153–180, 2008.
average reward that is between the average rewards for the [6] A. Greenwald and K. Hall. Correlated Q-learning. In
2 paths played in an equilibrium. For comparison purposes Proceedings of the Twentieth International Conference
we also show the rewards obtained by 2 agents converging on Machine Learning, pages 242 – 249, 2003.
to one of the deterministic equilibria. [7] S. Hart and A. Mas-Colell. A simple adaptive
procedure leading to correlated equilibrium.
Econometrica, 68(5):1127–1150, 2000.
5. DISCUSSION AND FUTURE WORK [8] J. Hu and M. Wellman. Nash Q-learning for
In this paper we introduced a multi-agent learning algo- general-sum stochastic games. Journal of Machine
rithm which allows agents to switch between stationary poli- Learning Research, 4:1039 – 1069, 2003.
cies in order to equalize the reward division among the agent [9] S. Mannor and N. Shimkin. The Steering Approach
population. In the present system we rely on a dispatcher for Multi-Criteria Reinforcement Learning. Advances
agent to select the objective to play and the correlate agents’ in Neural Information Processing Systems,
policy switches. If we assume that all agents are coopera- 2:1563–1570, 2002.
tive and willing to sacrifice some payoff in order to equalize [10] K. Narendra and M. Thathachar. Learning Automata:
the rewards in the population3 , this functionality could also An Introduction. Prentice-Hall International, Inc,
be embedded in the agents, either by letting agents com- 1989.
municate or allowing each agent to observe all rewards as [11] M. Thathachar and V. Phansalkar. Learning the
is done in e.g. [8, 6]. In systems where agents cannot be global maximum with parameterized learning
trusted or are not willing to cooperate, methods from com- automata. Neural Networks, IEEE Transactions on,
putational mechanism design could be used to ensure that 6(2):398–406, 1995.
agenst’ selfish interests are aligned with the global system [12] P. Vanderschraaf and B. Skyrms. Learning to take
utility. Another possible approach is considered in [4], where turns. Erkenntnis, 59:311–347(37), November 2003.
the other agents can choose to punish uncooperative agents, [13] P. Vrancx. Decentralised reinforcement learning in
leading to lower rewards for those agents. Markov games. PhD thesis, Computational Modeling
Note also that in the system presented here agents learn Lab, Vrije Universiteit Brussel, 2010.
to correlate on the joint actions they play. In [6] an approach [14] P. Vrancx, K. Verbeeck, and A. Nowé. Optimal
was presented to learn correlated equilibria. A deeper study Convergence in Multi-agent MDPs. Lecture Notes in
on the relation between our turn-taking policies and corre- Computer Science, Knowledge-Based Intelligent
lated equilibrium still needs to be done. The main differ- Information and Engineering Systems (KES 2007),
ence we put forward here is that a turn-taking policy was 4694:107–114, 2007.
proposed as a vehicle to reach fair reward divisions among [15] P. Vrancx, K. Verbeeck, and A. Nowe. Decentralized
the agents. Furthermore, the system in [6] requires agents learning in markov games. IEEE Transactions on
to learn in the joint action space and relies on centralized Systems, Man and Cybernetics (Part B: Cybernetics),
computation of correlated equilibria. In our system agents 38(4):976–81, 2008.
only learn probabilities for their individual action sets and [16] R. Wheeler and K. Narendra. Decentralized learning
coordination only takes place in the switch state, rather than in finite markov chains. IEEE Transactions on
at every state. Automatic Control, AC-31:519 – 526, 1986.
In [18] the concept of cyclic equilibria in Markov Games
[17] R. Williams. Simple Statistical Gradient-Following
was proposed as an alternative to Nash equilibrium. These
Algorithms for Connectionist Reinforcement Learning.
cyclic equilibria refer to a sequence of policies that reach a
Reinforcement Learning, 8:229–256, 1992.
limit cycle in the game. However again, no link was made
[18] M. Zinkevich, A. Greenwald, and M. Littman. Cyclic
with individual agent preferences and how they compare to
Equilibria in Markov Games. Advances in Neural
each other.
Information Processing Systems, 18:1641, 2006.
3
Systems satisfying this assumption are referred to as homo
egualis systems [5]

Page 6 of 99
Agent 1 Reward Agent 2 Reward
0.45 0.45
agent 1 agent 2

0.4 0.4

0.35 0.35

0.3 0.3
Average Reward

Average Reward
0.25 0.25

0.2 0.2

0.15 0.15

0.1 0.1

0.05 0.05

0 0
0 100000 200000 300000 400000 500000 0 100000 200000 300000 400000 500000
Iteration Iteration

(a) (b)

Figure 3: Typical run of the turn-taking algorithm on the Markov game in Table 1.(a) Average reward over
time agent 1. (b)Average reward over time agent 2.

0.35
Agent 1
Agent 2

GOAL 0.3

0.25
average reward

0.2

0.15

0.1

Agent 1 Agent 2 0.05

0
0 2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07 1.4e+07 1.6e+07 1.8e+07 2e+07
iteration

(a) (b)

Figure 4: (a) Deterministic equilibrium solution for the grid world problem. (b) Average reward over time
for 2 agents converging to equilibrium.

0.35 0.35
Agent 1 Agent 2

0.3 0.3

0.25 0.25
average reward

average reward

0.2 0.2

0.15 0.15

0.1 0.1

0.05 0.05

0 0
0 2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07 1.4e+07 1.6e+07 1.8e+07 2e+07 0 2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07 1.4e+07 1.6e+07 1.8e+07 2e+07
iteration iteration

Figure 5: Results of the turn-taking algorithm in the grid world problem. The coloured lines give the average
reward over time for both agents. Grey lines give the rewards for agents playing one of the deterministic
equilibria.

Page 7 of 99
Proceedings of the AAMAS Workshop on Adaptive and Learning Agents, May 2010, Toronto, Canada

RESQ-learning in stochastic games

Daniel Hennes, Michael Kaisers and Karl Tuyls

Maastricht University
Department of Knowledge Engineering
P.O. Box 616, 6200 MD Maastricht, The Netherlands
{daniel.hennes, michael.kaisers, k.tuyls} @maastrichtuniversity.nl

ABSTRACT dynamic and non-deterministic, thus violating the Markov

This paper introduces a new multi-agent learning algorithm property. Evidently, there is a strong need for an adequate
for stochastic games based on replicator dynamics from evo- theoretical framework modeling multi-agent learning. Re-
lutionary game theory. We identify and transfer desired cently, a link between the learning dynamics of reinforce-
convergence behavior of these dynamical systems by lever- ment learning algorithms and evolutionary game theory has
aging the link between evolutionary game theory and multi- been established, providing useful insights into the learning
agent reinforcement learning. More precisely, the algorithm dynamics [1, 3, 17, 18]. In particular, in [1] the authors have
(RESQ-learning) presented here is the result of Reverse En- derived a formal relation between multi-agent reinforcement
gineering State-coupled replicator dynamics injected with learning and the replicator dynamics. This relation between
the Q-learning Boltzmann mutation scheme. The contribu- replicators and reinforcement learning has been extended
tions of this paper are twofold. One, we demonstrate the to different algorithms such as learning automata and Q-
importance of a mathematical multi-agent learning frame- learning in [9, 18].
work by transferring insights from evolutionary game the- Exploiting the link between reinforcement learning and
ory to reinforcement learning. Two, the resulting learn- evolutionary game theory is beneficial for a number of rea-
ing algorithm successfully inherits the convergence behavior sons. The majority of state of the art reinforcement learning
of the reverse engineered dynamical system. Results show algorithms are blackbox models. This makes it difficult to
that RESQ-learning provides convergence to pure as well as gain detailed insight into the learning process and parameter
mixed Nash equilibria in a selection of stateless and stochas- tuning becomes a cumbersome task. Analyzing the learning
tic multi-agent games. dynamics helps to determine parameter configurations prior
to actual employment in the task domain. Furthermore, the
possibility to formally analyze multi-agent learning helps to
Categories and Subject Descriptors derive and compare new algorithms, which has been success-
I.2.6 [Computing Methodologies]: Artificial Intelligence— fully demonstrated for lenient Q-learning in [12].
Learning However, the evolutionary game theoretic framework has
been limited to either non-explorative learning in multi-
Keywords ple states [6], or explorative single-state learning [17, 18].
The investigation of single-state learning in the latter source
Reinforcement learning, Multi-agent learning, Evolutionary has shown, that exploration facilitates convergence to mixed
game theory, Replicator dynamics, Stochastic games equilibria and allows to overcome local optima, while non-
explorative learning may end up in limit cycles. Therefore,
this article designs a state-coupled system with the desired
1. INTRODUCTION convergence behavior, using insights about Q-learning with
Modern society is characterized by a high level of inter- Boltzmann exploration. Subsequently, this system will be
connectedness, with the internet and mobile phone networks reverse engineered, resulting in the derivation of Reverse
being the most prominent example media. As a result, most Engineered State-coupled Q-exploration (RESQ) learning, a
situations yield more than one actor, and should naturally new multi-agent learning algorithm for stochastic games.
be modeled as multi-agent systems to account for their in- RESQ learning is based on model-free learners with a min-
herent structure and complexity. Example applications for imum of required information (current state and reward
which significant progress has been facilitated using multi- feedback); agents maintain a policy only over their own ac-
agent learning range from auctions and swarm robotics to tion space. Thereby it constitutes a substantial advantage
predicting political decisions [2, 7, 11, 13]. over joint-action learning approaches, such as Nash-Q [8] or
The learning performance of contemporary reinforcement Friend-or-foe (FFQ) [10]. Experiments confirm the match
learning techniques has been studied in great depth exper- of the introduced algorithm with its evolutionary dynami-
imentally as well as formally for a diversity of single agent cal system. Furthermore, convergence to stable points in a
control tasks [15]. Markov decision processes provide a math- selection of two-state matrix games is shown.
ematical framework to study single agent learning. However, This paper is divided into two main parts: the forward
in general they are not applicable to multi-agent learning. and the reverse approach. First, Section 2 presents the for-
Once multiple adaptive agents simultaneously interact with ward approach, modeling multi-agent reinforcement learning
each other and the environment, the process becomes highly

Page 8 of 99
within an evolutionary game theoretic framework. Second, given below.
the inverse approach, reverse engineering the RESQ-learning „ «
β
algorithm is demonstrated in Section 3. Section 4 delivers Qi (t + 1) ←Qi (t) + min ,1
a comparative study of the newly devised algorithm and its xi
„ «
dynamics. Section 5 concludes this article.
· α ri (t) + γ argmax Qj (t) − Qi (t)
j

2. FORWARD APPROACH Again πi denotes the probability of selecting action i. This

policy is generated using a function π(Q) = (π1 , . . . , πk ).
An adequate theoretical framework modeling multi-agent
The most prominent examples of such policy generators are
learning dynamics has long been lacking [14, 16]. Recently,
-greedy and Boltzmann exploration schemes [15]. For the
an evolutionary game theoretic approach using replicator
dynamics of -greedy Q-learning we refer to [4]. This arti-
dynamics is employed to fill this gap. Replicator dynamics
cle exclusively discusses Q-learning with Boltzmann explo-
are a methodology of evolutionary game theory to model the
ration. It is defined by the following function, mapping Q-
dynamical evolution of strategies. Exploiting the link be-
values to policies, while balancing exploration and exploita-
tween reinforcement learning and evolutionary game theory
tion using a temperature parameter τ :
is beneficial for a variety of reasons. Analyzing the learn-
−1
ing dynamics helps to gain further insight into the learning e τ Qi
dynamics and to determine parameter configurations before πi (Q, τ ) = P τ −1 Q
je
j
learners are actually employed in the task domain. We call
this the forward approach. The parameter τ lends its interpretation as temperature
from the domain of physics. High temperatures lead to
2.1 Stateless learning dynamics stochasticity and random exploration, selecting all actions
First, we focus on model free, stateless and independent almost equally likely regardless of their Q-values. In con-
learners. This means interacting agents do not model each trast, low temperatures lead to high exploitation of the Q-
other; they only act upon the experience collected by exper- values, selecting the action with the highest Q-value with
imenting with the environment. Furthermore, no environ- probability close to one. Intermediate values prefer actions
mental state is considered which means that the perception proportionally to their relative competitiveness. In many
of the environment is limited to the reinforcement signal. applications, the temperature parameter is decreased over
While these restrictions are not negligible they allow for sim- time, allowing initially high exploration and eventual ex-
ple algorithms that can be treated analytically. ploitation of the knowledge encoded in the Q-values. Within
the scope of this article, the temperature is kept constant
2.1.1 Learning automata for analytical simplicity and coherence with the derivations
A learning automaton (LA) uses the basic policy iteration in [17, 18].
reinforcement learning scheme. An initial random policy is 2.1.3 Replicator dynamics of learning automata
used to explore the environment; by monitoring the rein-
forcement signal, the policy is updated in order to learn the Using the example of learning automata, this section demon-
optimal policy and maximize the expected reward. strates the forward approach to modeling multi-agent re-
The class of finite action-set learning automata consid- inforcement learning within a evolutionary game theoretic
ers only automata that optimize their policies over a finite framework. In particular, we indicate the mathematical re-
action-set A = {1, . . . , k} with k some finite integer. One lation between learning automata and the multi-population
optimization step, called epoch, is divided into two parts: replicator dynamics. For the full prove we refer to Börgers
action selection and policy update. At the beginning of an et al. [1].
epoch t, the automaton draws a random action a(t) accord- The continuous time two-population replicator dynamics
ing to the probability vector π(t), called policy. Based on the are defined by the following system of differential equations:
action a(t), the environment responds with a reinforcement
dπi h i
signal r(t), called reward. Hereafter, the automaton uses = πi (Aσ)i − π 0 Aσ
the reward r(t) to update π(t) to the new policy π(t + 1). dt (2)
dσj h i
The learning automaton update rule using the linear reward- = σj (Bπ)j − σ 0 Bπ
inaction scheme is given below. dt
( where A and B are the normal form game payoff matrices
αr (t) (1 − πi (t)) if a (t) = i for player 1 and 2 respectively. The probability vector π
πi (t + 1) ← πi (t) + (1)
−αr (t) πi (t) otherwise describes the frequency of all pure strategies (replicators)
for player 1. Success of a replicator i is measured by the
where r(t) ∈ [0, 1]. The reward parameter α ∈ [0, 1] deter- difference between its current payoff (Aσ)i and the average
mines the learning rate of the automaton. payoff π 0 Aσ of the entire population π against the strategy
of player 2.
2.1.2 Q-learning with Boltzmann exploration The policy change in (1) depends on action a (t) selected
In contrast to learning automata, Q-learners maintain a at time t. We now assume that an agent receives an imme-
value estimation Qi (t) of the expected (discounted) reward diate reward for each possible action rather than just the
for each action and are hence known as value iterators. We feedback for this specific action a(t). Furthermore, let the
use Frequency Adjusted Q-learning (FAQ), a slight varia- reward r̄i for action i be the average reward that action i
tion of the original Q-learning update rule [9]. The FAQ yields given that all other agents play according to their cur-
update rule with learning rate α and discount factor γ is rent policies. Finally, the action probability change in (1)

Page 9 of 99
LA LA dynamics FAQ FAQ dynamics
1 1 1 1

0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6

!21 !21 !21 !21
0.4 0.4 0.4 0.4

0.2 0.2 0.2 0.2

0 0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
!11 !11 !11 !11

LA LA dynamics FAQ FAQ dynamics

1 1 1 1

0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6

!2 !2 !2 !2
1 1 1 1
0.4 0.4 0.4 0.4

0.2 0.2 0.2 0.2

0 0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
!11 !11 !11 !11

Figure 1: Overview of trajectory plots for stateless games: Prisoners’ Dilemma (top row) and Matching
Pennies game (bottom row).

is proportional to πi since πi determines the frequency of 2.1.4 Dynamics of Q-learning

action i. Consequently, (3) describes the expected average In [18] the authors extended the work of Borgers et al. [1]
policy change at time t. to Q-learning. More precisely, they derived the dynamics of
2 3 the Q-learning process, which yields the following system of
X differential equations, describing the learning dynamics for
E (∆πi (t)) = πi 4αr̄i (t) (1 − πi (t)) + (−αrj (t) πj (t))5
a two-player stateless game:
j6=i
2 3
X !
= πi α 4r̄i (t) − r̄i (t) πi (t) − (rj (t) πj (t))5 dπi h i X
= πi α τ −1 (Aσ)i − π 0 Aσ − log πi + πk log πk
j6=i dt
k
" # !
= πi α r̄i (t) −
X
(rj (t) πj (t)) dσj −1
h
0
i X
= σj α τ (Bπ)j − σ Bπ − log σj + σl log σl
j dt
l
(3) (6)
If we apply (3) to a 2-player normal form game the con- The equations contain a selection part, equal to the multi-
nection between automata games and replicator dynamics population replicator dynamics, and a mutation part, origi-
becomes apparent. We consider a matrix game where A is nating from the Bolzmann exploration scheme of FAQ. For
the payoff for agent 1 and B the payoff for agent 2; π and an elaborate discussion in terms of selection and mutation
σ are the two action probability distributions respectively. operators we refer to [17, 18].
Agent 1 receives an average payoff of r̄i = (Aσ)i for action
i against agent 2’s strategy σ. Hence, (3) can be rewritten
as: 2.1.5 Example single-state game analysis
"
X
# We now examine the learning dynamics for a selection of
E (∆πi (t)) = πi α r̄i (t) − (rj (t) πj (t)) 2 x 2 matrix games, in particular we consider the Prisoners’
j Dilemma and the Matching Pennies game. Reward matri-
" # ces for Prisoners’ Dilemma (left, Defect or Cooperate) and
X“ (4)
”
= πi α (Aσ)i − (Aσ)j πj Matching Pennies (right, Head or Tail ) are given below:
j
h
0
i D C H T
= πi α (Aσ)i − π Aσ D 3, 3 0, 5 H 1, −1 −1, 1
C 5, 0 1, 1 T −1, 1 1, −1
Similarly, we can derive
h i
E (∆σj (t)) = σj α (Bπ)j − σ 0 Bπ (5) In all automata games the linear reward-inaction scheme
with a reward parameter α = 0.005 is used. Q-learners use
for agent 2. Note that (4) and (5) correspond to the multi- a learning rate of α = 0.005, discount factor γ = 0 and a
population replicator equations given in (2) scaled by the constant temperature τ = 0.02. Initial policies for learner
learning rate α. and replicator trajectory plots are generated randomly.

Page 10 of 99
Figure 1 (top row) shows the dynamics in the single state Games with absorbing states are of no particular interest in
Prisoners’ Dilemma. The automata game as well as the cor- respect to evolution or learning since any type of exploration
responding replicator dynamics show similar evolution to- will eventually lead to absorption. The formal definition of
ward the equilibrium strategy of mutual defection. Action an ergodic set in stochastic games is given below.
probabilities are plotted for action 1 (in this case cooperate);
x- and y-axis correspond to the action of player 1 and 2 re- Definition 2. In the context of a stochastic game G,
spectively. Hence, the Nash equilibrium point is located at E ⊆ S is an ergodic set if and only if the following con-
the origin (0, 0). FAQ-learners evolve to a joint policy close ditions hold:
to Nash. Constant temperature prohibits full convergence. (a) For all s ∈ E, if G is in state s at stage t, then at t + 1:
Learning in the Matching Pennies game, Figure 1 (bottom P r (G in some state s0 ∈ E) = 1, and
row), shows cyclic behavior for automata games and its repli- (b) for all proper subsets E 0 ⊂ E, (a) does not hold.
cator dynamics alike. FAQ-learning successfully converges Note that in repeated games, player i either tries to maxi-
to the mixed equilibrium due to its exploration scheme. mize the limit of the average of stage rewards (e.g., Learning
2.2 Multi-state learning dynamics Automata)
T
The main limitation of the evolutionary game theoretic 1 X i
max lim inf r (t) (7)
approach to multi-agent learning has been its restriction to πi T →∞ T
t=1
stateless repeated games. Even though real-life tasks might
or the discounted sum of stage rewards Tt=1 ri (t) δ t−1 with
P
be modeled statelessly, the majority of such problems nat- i
urally relates to multi-state situations. Vrancx et al. [20] 0 < δ < 1 (e.g., Q-learning), where r (t) is the immediate
have made the first attempt to extend replicator dynamics stage reward for player i at time step t.
to multi-state games. More precisely, the authors have com- 2.2.2 2-State Prisoners’ Dilemma
bined replicator dynamics and piecewise dynamics, called
piecewise replicator dynamics, to model the learning behav- The 2-State Prisoners’ Dilemma is a stochastic game for
ior of agents in stochastic games. Recently, this promising two players. The payoff matrices are given by
„ « „ «
proof of concept has been formally studied in [5] and ex- ` 1 1´ 3, 3 0, 10 ` 2 2 ´ 4, 4 0, 10
tended to state-coupled replicator dynamics [6] which form A ,B = , A ,B = .
10, 0 2, 2 10, 0 1, 1
the foundation for the later described inverse approach.
Where As determines the payoff for player 1 and B s for
2.2.1 Stochastic games player 2 in state s. The first action of each player is cooperate
Stochastic games extend the concept of Markov decision and the second is defect. Player 1 receives r1 (s, a) = Asa1 ,a2
processes to multiple agents, and allow to model multi-state while player 2 gets r2 (s, a) = Bas1 ,a2 for a given joint ac-
games in an abstract manner. The concept of repeated tion a = (a1 , a2 ). Similarly, the transition probabilities are
0 0
games is generalized by introducing probabilistic switching given by the matrices Qs→s where qs0 (s, a) = Qs→s a1 ,a2 is the
0
between multiple states. At any time t, the game is in a probability for a transition from state s to state s .
specific state featuring a particular payoff function and an „ « „ «
1 2 0.1 0.9 2 1 0.1 0.9
admissible action set for each player. Players take actions si- Qs →s = , Qs →s =
0.9 0.1 0.9 0.1
multaneously and hereafter receive an immediate payoff de-
pending on their joint action. A transition function maps the The probabilities to continue in the same state after the
1 1
s1 →s2
joint action space to a probability distribution over all states transition are qs1 s1 , a = Qas 1→s
` ´
,a2 = 1 − Qa1 ,a2 and
which in turn determines the probabilistic state change. Thus, 2 2 2 1
qs2 s , a = Qsa1→s s →s
` 2 ´
,a2 = 1 − Qa1 ,a2 .
similar to a Markov decision process, actions influence the
Essentially a Prisoners’ Dilemma is played in both states,
state transitions. A formal definition of stochastic games
and if regarded separately, defect is still a dominating strat-
(also called Markov games) is given below.
egy. One might assume that the Nash equilibrium strat-
Definition 1. The game G = n, S, A, q, r, π 1 . . . π n is egy in this game is to defect at every stage. However, the
˙ ¸
a stochastic game with n players and only pure stationary equilibria in this game reflect strategies
` k states.
´ At each stage
t, the game is in a state s ∈ S = s1,. . .,sk and each player where one of the players defects in one state while cooper-
i chooses an action ai from its admissible action set Ai (s) ating in the other and the second player does exactly the
according to its strategy π i (s). Q opposite. Hence, a player betrays his opponent in one state
n i n while being exploited himself in the other state.
The payoff function
` 1 ´ a) : i=1 A (s) 7→ < maps the
r (s,
n
joint action a = a ,. . .,a to an immediate payoff value for 2.2.3 2-State Matching Pennies game
each player.
The transition function q(s, a) : n
Q i k−1 Another 2-player, 2-actions and 2-state game is the 2-
i=1 A (s) 7→ ∆ de-
k−1 State Matching Pennies game. This game has a mixed Nash
termines the probabilistic state change, where ∆ is the
equilibrium with joint-strategies π 1 = (.75, .25), π 2 = (.5, .5)
(k − 1)-simplex and qs0 (s, a) is the transition probability from
in state 1 and π 1 = (.25, .75), π 2 = (.5, .5) in state 2. Payoff
state s to s0 under joint action a.
and transition matrices are given below.
In this work we restrict our consideration to the set of ` 1 1´
„ «
1, 0 0, 1 ` 2 2 ´
„
0, 1 1, 0
«
games where all states s ∈ S are in the same ergodic set. A ,B = , A ,B =
0, 1 1, 0 1, 0 0, 1
The motivation for this restriction is two-folded. In the
presence of more than one ergodic set one could analyze „ « „ «
1
→s2 1 1 2
→s1 0 0
the corresponding sub-games separately. Furthermore, the Qs = , Qs =
restriction ensures that the game has no absorbing states. 0 0 1 1

Page 11 of 99
2.2.4 Networks of learning automata Since a specific joint action a is played in state s, the
To cope with stochastic games, the learning algorithms stationary distribution x depends on s and a as well. A
in Section 2.1 need to be adopted to account for multiple formal definition is given below.
states. To this end, we use a network of automata for each
Definition 3. For G = n, S, A, q, r, π 1 . . . π n where S
˙ ¸
agent [19]. An agent associates a dedicated learning automa-
itself is the only ergodic set in S = s . . . sk , we say x (s, a)
` 1 ´
ton (LA) to each state of the game and control is passed on
from one automaton to another. Each LA tries to optimize is a stationary
P distribution of the stochastic game G if and
the policy in its state using the standard update rule given only if z∈S xz (s, a) = 1 and
in (1). Only a single LA is active and selects an action at X
xs0 (s, a) Qi s0 ,
` ´
xz (s, a) = xs (s, a) qz (s, a) +
each stage of the game. However, the immediate reward
s0 ∈S−{s}
from the environment is not directly fed back to this LA. where n
!
Instead, when the LA becomes active again, i.e., next time X Y
Q i s0 = 0 0´ ` 0´
πai 0i
` ´ `
qz s , a s .
the same state is played, it is informed about the cumulative Qn
a0 ∈ i 0
i=1 A (s )
i=1
reward gathered since the last activation and the time that
has passed by. Based on this notion of stationary distribution and (9) we
The reward feedback τ i for agent i’s automaton LAi (s) can define the average reward game as follows.
associated with state s is defined as
Pt−1 i Definition 4. For a stochastic game G where S itself is
i ∆ri l=t0 (s) r (l)
the only ergodic set in S = s1 . . . sk , we define the average
` ´
τ (t) = = , (8)
∆t t − t0 (s)
reward game for some state s ∈ S as the normal-form game
where ri (t) is the immediate reward for agent i in epoch
Ḡ s, π 1 . . . π n = n, A1 (s) . . . An (s) , r̄, π 1 (s) . . . π n (s) ,
` ´ ˙ ¸
t and t0 (s) is the last occurrence function and determines
when states s was visited last. The reward feedback in
where each player i plays a fixed strategy π i (s0 ) in all states
epoch t equals the cumulative reward ∆ri divided by time-
s0 6= s. The payoff function r̄ is given by
frame ∆t. The cumulative reward ∆ri is the sum over all im- X
mediate rewards gathered in all states beginning with epoch xs0 (s, a) P i s0 .
` ´
r̄ (s, a) = xs (s, a) r (s, a) +
t0 (s) and including the last epoch t − 1. The time-frame ∆t s0 ∈S−{s}
measures the number of epochs that have passed since au-
tomaton LAi (s) has been active last. This means the state 2.2.6 State-coupled replicator dynamics
policy is updated using the average stage reward over the We reconsider the replicator equations for population π
interim immediate rewards. as given in (2):
2.2.5 Average reward game dπi h i
= πi (Aσ)i − π 0 Aσ
For a repeated automata game, let the objective of player dt
i
i at stage t0 be Pto maximize the limit average reward r̄ = Essentially, the payoff of an individual in population π, play-
lim inf T →∞ T1 Tt=t0 ri (t) as defined in (7). The scope of ing pure strategy i against population σ, is compared to the
this paper is restricted to stochastic games where the se- average payoff of population π. In the context of an average
quence of game states X (t) is ergodic. Hence, there exists reward game Ḡ with payoff function r̄ the expected payoff
a stationary distribution x over all states, where fraction xs for player i and pure action j is given by
determines the frequency P of state s in X. Therefore, we 0 1
can rewrite r̄i as r̄i = s∈S xs P i (s), where P i (s) is the i
X i ∗
Y l
expected payoff of player i in state s. Pj (s) = @r̄ (a ) πa∗l (s)A ,
Now, let us assume the game is in state s at stage t0 and a∈ l6=i Al (s) l6=i
Q

players play a given joint action a in s and fixed strategies

where a∗ = a1 . . . ai−1 , j, ai . . . an . This means that we
` ´
π (s0 ) in all states but s. Then the limit average payoff
becomes enumerate all possible joint actions a with fixed action j
X for agent i. In general, for some mixed strategy ω, agent i
xs0 P i s0 ,
` ´
r̄ (s, a) = xs r (s, a) + (9) receives an expected payoff of
s0 ∈S−{s} 2 0 13
where n
! X X Y
X Y
P i (s, ω) = 4 ωj @r̄i (s, a∗ ) πal ∗l (s)A5 .
P s0 = 0 0´
i ` 0´
πai 0i
` ´ `
r s ,a s .
j∈Ai (s) a∈ l6=i Al (s) l6=i
Qn Q
a0 ∈ i=1 Ai (s0 ) i=1

An intuitive explanation of (9) goes as follows. At each If each player i is represented by a population π i , we can
stage, players consider the infinite horizon of payoffs under set up a system of differential equations, each similar to
current strategies. We untangle the current state s from all (2), where the payoff matrix A is substituted by the average
other states s0 6= s and the limit average payoff r̄ becomes reward game payoff r̄. Furthermore, σ now represents all
the sum of the immediate payoff for joint action a in state remaining populations π l where l 6= i.
s and the expected payoffs in all other states. Payoffs are
weighted by the frequency xs of corresponding state occur- Definition 5. The multi-population state-coupled repli-
rences. Thus, if players invariably play joint action a every cator dynamics are defined by the following system of differ-
time the game is in state s and their fixed strategies π (s0 ) ential equations:
for all other states, the limit average reward for T → ∞ is dπji (s) h “ ”i
expressed by (9). = πji xs (π) P i (s, ej ) − P i s, π i (s) , (10)
dt

Page 12 of 99
where ej is the j th -unit vector. P i (s, ω) is the expected pay- SC−RD, state 1 SC−RD, state 2
1 1
off for an individual of population i playing some strategy ω
in state s. P i is defined as 0.8 0.8
2 0 13
0.6 0.6
!2(s ) !2(s )
X X Y
P i (s, ω) = 4ωj @r̄i (s, a∗ ) πal ∗l (s)A5 , 1 1 1 2
0.4 0.4
j∈Ai (s) Al (s) l6=i
Q
a∈ l6=i
0.2 0.2
where r̄ is the payoff function of Ḡ s, π 1 . . . π n and
` ´
0 0
“ ” 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
1 1
a∗ = a1 . . . ai−1 , j, ai . . . an . !1(s1) !1(s2)
RESQ dynamics, state 1 RESQ dynamics, state 2
1 1
Furthermore, x is the stationary distribution over all states
S under π, with 0.8 0.8

0.6 0.6
X
xs (π) = 1 and !21(s1) !21(s2)
s∈S 0.4 0.4

2 0.2 0.2
n
!3
X X Y
xs (π) = 4xz (π) qs (z, a) πai i (s) 5 . 0
0 0.2 0.4 0.6 0.8 1
0
0 0.2 0.4 0.6 0.8 1
!11(s1) !11(s2)
Qn
z∈S a∈ Ai (s) i=1
i=1
Pn RESQ, state 1 RESQ, state 2
|Ai (s) | replicator
P
In total this system has N = s∈S i=1
1 1

equations. 0.8 0.8

In essence, state-coupled replicator dynamics use direct 0.6 0.6

state-coupling by incorporating the expected payoff in all !21(s1) !21(s2)
0.4 0.4
states under current strategies, weighted by the frequency
of state occurrences. 0.2 0.2
Previous work has shown that state-coupled replicator
0 0
dynamics converge to pure Nash equilibria in general-sum 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
stochastic games such as the 2-State Prisoners’ Dilemma [6]. !11(s1) !11(s2)
However, state-coupled replicator dynamics fail to converge
to mixed equilibria. We observe cycling behavior, similar to Figure 2: Comparison between SC-RD dynamics,
the stateless situation of Matching Pennies (see Figure 2). RESQ dynamics and RESQ-learning (α = 0.004,
τ = 0.04) in the 2-State Matching Pennies game.

3. INVERSE APPROACH
the Bolzmann exploration scheme:
The forward approach has focused on deriving predictive !
models for the learning dynamics of existing multi-agent re- dπi h i X
−1 0
inforcement learners. These models help to gain deeper in- = πi β τ (Aσ)i − π Aσ − log πi + πk log πk
dt
sight and allow to tune parameter settings. In this section k
!
we demonstrate the inverse approach, designing a dynamical h i X
system that does indeed converge to pure and mixed Nash = πi βτ −1 (Aσ)i − π 0 Aσ − πi β log πi + πk log πk
equilibria and reverse re-engineering that system, resulting k
in a new multi-agent reinforcement learning algorithm, i.e. The learning rate of FAQ is now denoted by β. Let us as-
RESQ-learning. sume α = βτ −1 ⇒ β = ατ . Note that from β ∈ [0, 1] follows
Results for stateless games provide evidence that explo-
ration is the key to prevent cycling around attractors. Hence, 0 ≤ ατ −1 ≤ 1.
we aim to combine the exploration-mutation term of FAQ- Then we can rewrite the FAQ replicator equation as follows:
learning dynamics with state-coupled replicator dynamics. !
dπi h
0
i X
3.1 Linking LA and Q-learning dynamics = πi α (Aσ)i − π Aσ − πi ατ log πi + πk log πk
dt
k
First, we link the dynamics of learning automata and Q-
learning for the stateless case. We recall from Section 2.1.3 In the limit limτ →0 the mutation term collapses and the
that the learning dynamics of LA correspond to the standard dynamics of learning automata become:
multi-population replicators scaled by the learning rate α: dπi h i
= πi α (Aσ)i − π 0 Aσ
dπi h i dt
= πi α (Aσ)i − π 0 Aσ
dt 3.2 State-coupled RD with mutation
The FAQ replicator dynamics (see Section 2.1.4) contain a After we have established the connection between the learn-
selection part equivalent to the multi-population replicator ing dynamics of FAQ-learning and learning automata, ex-
dynamics, and an additional mutation part originating from tending this link to multi-state games is straightforward.

Page 13 of 99
The mutation term Analog to the description in Section 2.2.4 a network of
„ X « learners is used for each agent i. The reward feedback signal
−τ log πi + πk log πk (11) is equal to (8) while the update rule now incorporates the
k same exploration term as in (12). If a (t) = i :
is solely dependent on the agent’s policy π and thus inde-
" !#
i
X i i
pendent of any payoff computation. Therefore, the average πi (t+1) ← πi (t)+α r (t) (1−πi (t))−τ log πj + πk log πk
reward game remains the sound measure for the limit of k
the average of stage rewards under the assumptions made otherwise:
in Section 2.2.5. The equations of the dynamical system " !#
in (2.2.5) are complemented with the mutation term (11), X
resulting in the following state-coupled replicator equations πi (t+1) ← πi (t)+α −r (t) πi (t)−τ log πji + πki log πki
k
with mutation:
" Hence, RESQ-learning is essentially a multi-state policy it-
dπji (s) h “ ”i
= πj xs (π) P i (s, ej ) − P i s, π i (s)
i erator using exploration equivalent to the Boltzmann policy
dt generation scheme.
„ X i «# (12)
−τ log πji + πk log πki 4. RESULTS AND DISCUSSION
k
This section sets the newly proposed RESQ-learning al-
In the next section we introduces the corresponding RESQ- gorithm in perspective by examining the underlying dynam-
learning algorithm. ics of state-coupled replicator dynamics with mutation and
traces of the resulting learning algorithm.
3.3 RESQ-learning First, we explore the behavior of the dynamical system,
In [6] the authors have shown that maximizing the ex- as derived in Section 3.2, and verify the desired convergence
pected average stage reward over interim immediate rewards behavior, i.e., convergence to pure and mixed Nash equilib-
relates to the average reward game played in state-coupled ria. Figure 3 shows multiple trajectory traces in the 2-State
replicator dynamics. We reverse this result to obtain a Prisoners’ Dilemma, originating from random strategy pro-
learner equivalent to state-coupled replicator dynamics with files in both states. Analysis reveals that all trajectories
mutation. converge close to either one of the two pure Nash equilib-

RESQ dynamics, state 1 RESQ dynamics, state 2

1 1

0.8 0.8

0.6 0.6
!21(s1)

!2(s )
2
1

0.4 0.4

0.2 0.2
1 1
0.8 0.8
0 0.6 0 0.6
0 0.4 0 0.4
500 0.2 500 0.2
1000 0 1000 0
1500 !11(s1) 1500 !11(s2)
t t

Figure 3: RESQ-learning dynamics (α = 0.004, τ = 0.02) in the 2-State Prisoners’ Dilemma.

RESQ dynamics, state 1 RESQ dynamics, state 2

1 1

0.8 0.8

0.6 0.6
!21(s1)

!21(s2)

0.4 0.4

0.2 1 0.2 1
0.8 0.8
0.6 0.6
0 0
0 0.4 0.4
500 0 500
1000 0.2 1000 0.2
1500 1500 1
2000
2500 0 !11(s1) 2000 2500 0 !1(s2)

t t

Figure 4: RESQ-learning (α = 0.004, τ = 0.04) in the 2-State Matching Pennies game.

Page 14 of 99
rium points described in Section 2.2.2. As mentioned be- [4] Eduardo Rodrigues Gomes and Ryszard Kowalczyk.
fore for the stateless case, constant temperature prohibits Dynamic analysis of multiagent q-learning with
full convergence. Figure 4 shows trajectory traces in the epsilon-greedy exploration. In ICML, 2009.
2-State Matching Pennies game. Again, all traces converge [5] Daniel Hennes, Karl Tuyls, and Matthias Rauterberg.
close to Nash, thus affirming the statement that exploration- Formalizing multi-state learning dynamics. In Proc. of
mutation is crucial to prevent cycling and to converge in 2009 Intl. Conf. on Intelligent Agent Technology, 2008.
games with mixed optimal strategies. [6] Daniel Hennes, Karl Tuyls, and Matthias Rauterberg.
Figure 2 shows a comparison between state-coupled repli- State-coupled replicator dynamics. In Proc. of 8th
cator dynamics (SC-RD), the RESQ-dynamics as in (12), Intl. Conf. on Autonomous Agents and Multiagent
and an empirical learning trace of RESQ-learners. As above- Systems, 2009.
mentioned, ”pure” state-coupled replicator dynamics with- [7] Shlomit Hon-Snir, Dov Monderer, and Aner Sela. A
out the exploration-mutation term fail to converge. The tra- learning approach to auctions. Journal of Economic
jectory of the state space of this dynamical system exhibits Theory, 82:65–88, November 1998.
cycling behavior around the mixed Naish equilibrium (see [8] Junling Hu and Michael P. Wellman. Nash q-learning
Section 2.2.3). RESQ-dynamics successfully converge -near for general-sum stochastic games. Journal of Machine
to the Nash-optimal joint policy. Furthermore, we present Learning, 4:1039–1069, 2003.
the learning trace of two RESQ-learners in order to judge
[9] Michael Kaisers and Karl Tuyls. Frequency adjusted
the predictive quality of the coresponding state-coupled dy-
multi-agent q-learning. In Proc. of 9th Intl. Conf. on
namics with mutation. Due to the stochasticity involved in
Autonomous Agents and Multiagent Systems, 2010.
the action selection process, the learning trace is more noisy.
However, we clearly observe that RESQ-learning indeed suc- [10] Michael L. Littman. Friend-or-foe q-learning in
cessfully inherits the convergence behavior of state-coupled general-sum games. In ICML, pages 322–328, 2001.
replicator dynamics with mutation. [11] Shervin Nouyan, Roderich Groß, Michael Bonani,
Further experiments are required to verify the performance Francesco Mondada, and Marco Dorigo. Teamwork in
of RESQ-learning in real applications and to gain insight self-organized robot colonies. Transactions on
into how it competes with multi-state Q-learning and the Evolutionary Computation, 13(4):695–711, 2009.
SARSA algorithm [15]. In particular, the speed and qual- [12] Liviu Panait, Karl Tuyls, and Sean Luke. Theoretical
ity of convergence need to be considered. Therefore, the advantages of lenient learners: An evolutionary game
theoretical framework needs to be extended to account for theoretic perspective. Journal of Machine Learning
decreasing temperature to balance exploration and exploita- Research, 9:423–457, 2008.
tion over time. [13] S. Phelps, M. Marcinkiewicz, and S. Parsons. A novel
method for automatic strategy acquisition in n-player
non-zero-sum games. In AAMAS ’06: Proceedings of
5. CONCLUSIONS the fifth international joint conference on Autonomous
The contributions of this article can be summarized as agents and multiagent systems, pages 705–712,
follows. First, we have demonstrated the forward approach Hakodate, Japan, 2006. ACM.
to modeling multi-agent reinforcement learning within an [14] Y. Shoham, R. Powers, and T. Grenager. If
evolutionary game theoretic framework. In particular, the multi-agent learning is the answer, what is the
stateless learning dynamics of learning automata and FAQ- question? Journal of Artificial Intelligence,
learning as well as state-coupled replicator dynamics for 171(7):365–377, 2006.
stochastic games have been discussed. Based on the in-
[15] Richard S. Sutton and Aandrew G. Barto.
sights that were gained from the forward approach, RESQ-
Reinforcement Learning: An Introduction. MIT Press,
learning has been introduced by reverse engineering state-
Cambridge, MA, 1998.
coupled replicator dynamics injected with the Q-learning
[16] K. Tuyls and S. Parsons. What evolutionary game
Boltzmann mutation scheme. We have provided empirical
theory tells us about multiagent learning. Artificial
confirmation that RESQ-learning successfully inherits the
Intelligence, 171(7):115–153, 2007.
convergence behavior of its evolutionary counter part. Re-
sults have shown that RESQ-learning provides convergence [17] Karl Tuyls, Pieter J. ’t Hoen, and Bram
to pure as well as mixed Nash equilibria in a selection of Vanschoenwinkel. An evolutionary dynamical analysis
stateless and stochastic multi-agent games. of multi-agent learning in iterated games. Autonomous
Agents and Multi-Agent Systems, 12:115–153, 2005.
[18] Karl Tuyls, Katja Verbeeck, and Tom Lenaerts. A
6. REFERENCES selection-mutation model for Q-learning in multi-agent
[1] Tilman Börgers and Rajiv Sarin. Learning through systems. In Proc. of 2nd Intl. Conf. on Autonomous
reinforcement and replicator dynamics. Journal of Agents and Multiagent Systems, 2003.
Econ. Theory, 77(1), 1997. [19] Katja Verbeeck, Peter Vrancx, and Ann Nowé.
[2] Bruce Bueno de Mesquita. Game theory, political Networks of learning automata and limiting games. In
economy, and the evolving study of war and peace. ALAMAS, 2006.
American Political Science Review, 100(4):637–642, [20] Peter Vrancx, Karl Tuyls, Ronald Westra, and Ann
November 2006. Nowé. Switching dynamics of multi-agent learning. In
[3] Herbert Gintis. Game Theory Evolving. A Proc. of 7th Intl. Conf. on Autonomous Agents and
Problem-Centered Introduction to Modelling Strategic Multiagent Systems, 2008.
Interaction. Princeton University Press, Princeton,
2000.

Page 15 of 99
Proceedings of the AAMAS Workshop on Adaptive and Learning Agents, May 2010, Toronto, Canada

Adaptation of Stepsize Parameter to Minimize Exponential

Moving Average of Square Error by Newton’s Method

Itsuki Noda
ITRI, AIST
1-1-1 Umezono, Tsukuba, Japan
[email protected]

ABSTRACT corresponds to the estimated value x̃t in Eq. (1), and rt +

A method to adjust a stepsize parameter in exponential mov- γ maxa0 Qt (st+1 , a0 ) corresponds to the given value xt .
ing average (EMA) based on Newton’s method to minimize In the most cases of these schema, the stepsize parameter
square errors is proposed. The stepsize parameter used in re- α is set to be a small positive number in (0, 1) and decreased
inforcement learnings is generally decreased to zero, because to be zero through learning time t according to equation
ω for ω ∈ ( 2 , 1] [3]. This is because EMA with a
1 1
we generally suppose that target values of the learning are α(t) = (t+1)
stable. However, such an assumption is violated under un- smaller stepsize becomes long-term moving average that can
stable environment, where the target values like expected reduct noisy factors included in given values xt . This is a
rewards may change over time. In order to adapt step- reasonable setup for stationary environments in which target
size parameters, we have proposed a framework to acquire values or utilities of the estimation by Eq. (1) is fixed over
higher-order derivatives of learning values by the stepsize time. On the other hand, in many applications of reinforce-
parameter. Based on this framework, we extend a method ment learning, environments are non-stationary so that the
to determine the best stepsize using Newton’s method to target values may change over time. For example, in a re-
minimize EMA of square error of learning. The method is source sharing problem with multiple resources and multiple
confirmed by mathematical theories and by results of exper- agents, an expected utility of a certain resource may change
iments. by total allocation of resources, the number of agents, and
the choice policies of other agents. In such cases, we can
not simply decrease the stepsize parameter to be zero, but
Categories and Subject Descriptors should adjust it to be a suitable value according to environ-
I2.6 [Learning]: Parameter learning ments.
Several works have tried to modify and adjust the stepsize
General Terms parameter to be suitable for a given environment. George
and Powell[4] proposed a method, called optimal stepsize
Algorithms, Experimentation
algorithm (OSA), to control stepsize parameters in order
to minimize noise factors on the basis of the relationships
Keywords among the stepsize parameter, noise variance, and changes
Reinforcement Learning, Stepsize Parameter, Non-stationary in learning values. Sato and et. al.[8] also proposed a frame-
Environment work to accumulate error variance to find out the suitable
learning parameters. Bonarini et. al.[1] proposed a method
1. INTRODUCTION to switch two stepsize parameters according to the process
of learning based on the similar concept of WoLF (Win or
In the several methods of reinforcement learning, we use Learn First) proposed by Bowling and Veloso [2].
the following exponential moving average (EMA) to estimate In order to find the suitable stepsize parameter, I also
values or utilities of actions and states from examples and have proposed a method called Gradient Descent Adapta-
experiences: tion of Stepsize by Recurrent Exponential Moving Average
x̃t+1 = x̃t + α(xt − x̃t ) (GDASS-REMA), in which square of difference between an
estimated and given values, (x̃t − xt )2 , is minimized in each
= (1 − α)x̃t + αxt , (1)
time-step by a gradient descent manner using derivatives
where xt is observed or given value at time t, and x̃t is the ∂ k x̃t
∂αk
that is calculated by recursive exponential moving av-
estimated value. α is a learning parameter called stepsize. erage (REMA) [5, 6] . Because this method uses the gradient
For example, Q-learning [9] generally uses the following up- descent procedure, it may take a long time to converge and
date schema of state-action values: follow to changes of the ideal stepsize value when the envi-
Qt+1 (st , at ) = (1 − α)Qt (st , at ) ronment changes drastically. In this article, we revise the
method using Newton’s method to find the suitable stepsize
+α(rt + γ max
0
Qt (st+1 , a0 )), (2) value quickly with the higher order derivatives acquired by
a
REMA.
where Qt (st , at ) is an expected utility of state action at at
st , and rt is a given reward from the environment at time
t. γ is a discount parameter. In this schema, Qt (st , at )

Page 16 of 99
2. RECURSIVE EXPONENTIAL MOVING that should be eliminated in calculation of x̃t , so that α
AVERAGE tends to be adjusted to estimate the noise factor instead of
hki the true value of xt .
The Recursive Exponential Moving Average (REMA) ξt So, in the following sections, we focus on EMA of squared
is defined as follows: error and construct a method to minimize the averaged error
h0i
ξt = xt by the Newton’s method.
h1i
ξt+1 = x̃t+1 = (1 − α)x̃t + αxt 3.1 Squared Error and Derivatives
hki hki hk−1i hki
ξt+1 = ξt + α(ξt − ξt ) Here, we re-define the squared error shown in Eq. (6) using
hki hk−1i an error δt of given and estimated values, xt and x̃t , as
= (1 − + αξt
α)ξt follows:
∞
hk−1i
X
= α (1 − α)τ ξt−τ . (3) δt = x̃t − xt
τ =0
Et = (1/2)δt2 .
Using the REMA, we can derive the following lemma and
theorem about partial differentials of estimated values x̃t by Then, we have the following theorem.
the stepsize parameter α [5, 6] . Theorem 2.
Lemma 1. The k-th partial derivative of the squared error Et by α is
hki calculated by the following equations:
The first partial derivative of REMA ξt by α is given by
the following equation: k−1
∂ k Et X (k − 1)! ∂ i δt ∂ k−i δt
hki = , (8)
∂ξt k hki hk+1i ∂αk (k − 1 − i)!i! ∂αi ∂αk−i
= (ξ − ξt ) (4) i=0
∂α α t
where,
∂ 0 δt
Theorem 1. = δt
h1i ∂α0
The k-th partial derivative of EMA x̃t (= ξt ) is given by
the following equation: ∂ k δt ∂ k x̃t
= (k > 0). (9)
∂αk ∂αk
∂ k x̃t hk+1i hki
= (−α)−k k!(ξt − ξt ) (5)
∂αk
(See Appendix for the proof.)

In the previous work, GDASS-REMA updates the stepsize 3.2 EMA of Squared Error and Partial Deriva-
α to the direction to decrease the following squared error of tives
the estimation in each time t gradually [5, 6] : As discussed above, our target is a method to determine
Et = (1/2)(x̃t − xt )2 . (6) the stepsize parameter α that minimizes the EMA of the
squared error Et . Here, we deﬁne the EMA Ẽt as follows:
Therefore, the actual update schema in GDASS is:
Ẽt+1 = (1 − β)Ẽt + βEt , (10)
∂Et
α ← α − η · sign( )
∂α where β is another stepsize parameter for EMA of the squared
∂ x̃t error. This Ẽt is equal to the estimated variance of the ex-
= α − η · sign((x̃t − xt ) · ). pected reward value introduced in [8].
∂α
As same as the case of the squared error Et shown in
3. EXPONENTIAL MOVING AVERAGE OF Eq. (7), we can estimate the optimal stepsize value α∗ to
SQUARED ERROR minimize Ẽt using the Newton’s method and Taylor expan-
sion as follows:
Because Theorem 1 provides a way to calculate higher “ ”
order derivatives of x̃t by α, we can get higher order Taylor ∂ Ẽt
expansion of Et by α as follows: ∆α∗ = “
∂α
” (11)
∂ 2 Ẽt
∂Et 1 ∂ Et 1 ∂ Et 2 3 ∂α2
Et (∆α) = Et (0) + ∆α + ∆α2 + ∆α3 + · · · .
∂α 2 ∂α2 6 ∂α3 α∗ = α − ∆α∗ (12)
Therefore, if we focus on the expansion of the ﬁrst and sec- On the other hand, we can accumulate higher order par-
ond order terms, we will determine the optimum change of tial derivatives of Ẽt by α from Eq. (10) by the following
the stepsize, ∆α∗hti , which minimize the error at time t, equations:
using the Newton’s method as follow:
∂ Ẽt+1 ∂ Ẽt ∂Et
= (1 − β)
“ ”
∂ Et +β (13)
∗hti ∂α ∂α ∂α ∂α
∆α = ”. (7)
∂ 2 Ẽt+1 ∂ 2 Ẽt ∂ 2 Et
“
∂ 2 Et
∂α2 = (1 − β) +β (14)
∂α2 ∂α 2 ∂α2
However, updating α using the above equation directly ∂ Ẽt+1
k
∂ Ẽt
k
∂ k Et
does not work well, because the given value xt includes noise = (1 − β) +β (15)
∂αk ∂α k ∂αk

Page 17 of 99
This means that these partial derivatives can be calculated x values [gamma=0.050000, best alpha = 0.048766]
8
by the same manner of EMA using the derivatives of the current
k agent
squared error, ∂∂αEkt . As shown in Eq. (8) and Eq. (9), these 6
real
hKi
values can be determined systematically using REMA ξt . 4
Finally, we get the following procedure to obtain the opti- 2
mal stepsize α∗ to minimize EMA of the squared error. We
call it as Rapid Recursive Adaption of Stepsize Parameter 0

x
by Newton’s method (RRASP-N). -2
Initialize: ∀k ∈ {0 . . . kmax − 1} : ξ hki ← x0
∂ k Ẽ
-4
∀k ∈ {0 . . . kmax − 2} : ∂α k ← 0
while forever do -6
Let x be an observation. -8
for k = kmax − 1 to 1 do
-10
ξ hki ← (1 − α)ξ hki + αξ hk−1i 0 200 400 600 800 1000
end for Cycle
ξ h0i ← x
δ ← ξ h1i − x
for k = 1 to kmax − 2 do Figure 1: Exp.1: Changes of Learned Expected
∂k E
Calculate ∂α k by Eq. (8), Eq. (9) and Eq. (5). Value x̃t using Acquired Stepsize α by RRASP-N
k
∂ Ẽ
Update ∂α k by Eq. (13)∼ Eq. (15).
end2 for
∂ Ẽ
if ∂α where ∆vt is a random noise whose average and standard
2 > 0 then
Calculate ∆α∗ by Eq. (11). deviation are 0 and σv , respectively. Figure 1 shows an
if |∆α∗ | > α then example of the noisy random-walk. In this graph, band-like
∆α∗ ← sign(∆α∗ )α spikes are the given sequence xt , and curves at the center of
end if the band are true value vt and its learning result x̃t .
∗ For such noisy random walks, we can calculate the optimal
α ← α + ∆α 2
.
if α is not in [αmin , αmax ] then stepsize by the following equation:
let α be αmin or αmax . −γ 2 + γ 4 + 4γ 2
p
end if α∗ = , (17)
2
for k = 1 to kmax − 1 do
∂ξhki where, γ = σσv² . Of course, the standard deviations are
Update ξ hki according to changes of α using ∂α not given for learning agents, so that, they must acquire
determined by Eq. (4).
it through learning like RRASP-N.
end for
Figure 2 shows results of adaptation of α by RRASP-N
end if
for the given values xt . Each graph of this figure indicates
end while
2
∂ Ẽ
changes of α through learning with the optimal value of α,
In this procedure, α is updated only when ∂α 2 is positive, which indicated by a horizontal line, for different setting of
because the changes of Ẽ by α is concave down in the case of the noisy random-walk. As shown in these graphs, acquired
∂ 2 Ẽ
∂α2
< 0. We also cut-off ∆α∗ because of the following rea- α quickly converges to the optimal value of α.
son: The expansion of ξ hki using Eq. (5) includes the Moreover, the speed of convergence is drastically improved
` Taylor
ń compared with GDASS proposed in the previous work. Fig-
term ∆α α
, which becomes huge when α is small. There-
fore, truncation errors of the Taylor expansion may be large ure 3 shows results of adaptation by GDASS for the same
and affects other calculations in the procedure. In order to settings of figure 2. While adaptation of GDASS converge
avoid such effects, we limit the absolute value of ∆α∗ within to the optimal stepsize gradually as a nature of gradient de-
the value of α. cent methods, RRASP-N can adapt it to the optimal one
so quickly by jumping stepsize to the optimal directly using
Newton’s method.
4. EXPERIMENTS
In order to show the performance of RRASP-N, we carried 4.2 Exp.2: Adaptation for Squared-waved True
out several experiments. Value
In the second experiment, given values are squared-waved
4.1 Exp.1: Finding Optimal Stepsize true value with large noise as shown on the right of figure 4.
In order to show that RRASP-N can determine the op- Actual value of xt is generated by the following equations:
timal stepsize α∗ , we conducted an experiment using the
following noisy random-walk as a given value xt : xt = vt + ²t
10 ; 2000n < t < 2000n + 1000; n = 0, 1, 2, · · ·

xt = v t + ²t , (16) vt = ,
5 ; otherwise
where ²t is a random noise whose average and standard de-
where where ²t is a random noise whose average and stan-
viation are 0 and σ² , respectively. The true value vt is a
dard deviation are 0 and σ² , respectively. For such given val-
random walk defined by the following equations:
ues, the stepsize should become large right after the changes
vt+1 = vt + ∆vt , of the true value (t = 1000, 2000, 3000, · · · ) to catch-up the

Page 18 of 99
changes of alpha [gamma=3.333333, best alpha = 0.923280] changes of alpha [gamma=3.333333, best alpha = 0.923280]
1 1

0.8 0.8

0.6 0.6
Alpha

Alpha
0.4 0.4

0.2 alpha 0.2

alpha
best_alpha best_alpha

0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
Cycle Cycle

∗ ∗
(a) γ = 3.33, α = 0.923 (a) γ = 3.333, α = 0.923
changes of alpha [gamma=2.000000, best alpha = 0.828427] changes of alpha [gamma=2.000000, best alpha = 0.828427]
1 1

0.8 0.8

0.6 0.6
Alpha

Alpha
0.4 0.4

0.2 alpha 0.2

alpha
best_alpha best_alpha

0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
Cycle Cycle

∗ ∗
(b) γ = 2.00, α = 0.828 (b) γ = 2.00, α = 0.828
changes of alpha [gamma=1.250000, best alpha = 0.692810] changes of alpha [gamma=1.250000, best alpha = 0.692810]
1 1
alpha alpha
best_alpha best_alpha

0.8 0.8

0.6 0.6
Alpha

Alpha
0.4 0.4

0.2 0.2

0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
Cycle Cycle

∗ ∗
(c) γ = 1.25, α = 0.693 (c) γ = 1.25, α = 0.693
changes of alpha [gamma=1.000000, best alpha = 0.618034] changes of alpha [gamma=1.000000, best alpha = 0.618034]
1 1
alpha alpha
best_alpha best_alpha

0.8 0.8

0.6 0.6
Alpha

Alpha

0.4 0.4

0.2 0.2

0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
Cycle Cycle

∗ ∗
(d) γ = 1.00, α = 0.618 (d) γ = 1.00, α = 0.618
changes of alpha [gamma=0.333333, best alpha = 0.282376] changes of alpha [gamma=0.333333, best alpha = 0.282376]
1 1
alpha alpha
best_alpha best_alpha

0.8 0.8

0.6 0.6
Alpha

Alpha

0.4 0.4

0.2 0.2

0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
Cycle Cycle

∗ ∗
(e) γ = 0.33, α = 0.282 (e) γ = 0.33, α = 0.282
changes of alpha [gamma=0.050000, best alpha = 0.048766] changes of alpha [gamma=0.050000, best alpha = 0.048766]
1 1
alpha alpha
best_alpha best_alpha

0.8 0.8

0.6 0.6
Alpha

Alpha

0.4 0.4

0.2 0.2

0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
Cycle Cycle

∗ ∗
(f) γ = 0.05, α = 0.049 (f) γ = 0.05, α = 0.049

Figure 2: Exp.1-a: Adjustment of Stepsize Parameter Figure 3: Exp.1-b: Adjustment of Stepsize Parameter
by RRASP-N for Various Ratio of Standard Devia- by GDASS for Various Ratio of Standard Deviations
tions of Random Walk and Noise. of Random Walk and Noise.

Page 19 of 99
changes of alpha [gamma=0.000000, best alpha = 0.000000] x values [gamma=0.000000, best alpha = 0.000000]
1 14 current agent real
alpha

12
0.8
10

0.6
8
Alpha

x
6
0.4

4
0.2
2

0 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Cycle Cycle

(a) Adaptation by RRASP-N

changes of alpha [gamma=0.000000, best alpha = 0.000000] x values [gamma=0.000000, best alpha = 0.000000]
1 14
alpha current
best_alpha agent
12 real
0.8
10

0.6
8
Alpha

6
0.4

4
0.2
2

0 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Cycle Cycle

(b) Adaptation by GDASS

changes of alpha [gamma=0.000000, best alpha = 0.000000] x values [gamma=0.000000, best alpha = 0.000000]
1 14
alpha current
best_alpha agent
12 real
0.8
10

0.6
8
Alpha

6
0.4

4
0.2
2

0 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Cycle Cycle

(c) Adaptation by OSA

Figure 4: Exp.2: Changes of Stepsize Parameters and Learned Estimated Values by RRASP-N, GDASS, and
OSA Methods. (In the case of squared-waved value)

Page 20 of 99
change, and should decrease immediately to zero to reduce of stepsize parameters for each resource. Note that an in-
the noise factor. dependent stepsize parameter is assigned for each resource.
Figure 4 shows results of adaptation by (a) RRASP-N, Therefore, the parameters changes independently with each
(b) GDASS, and (c) OSA (Optimal Stepsize Algorithm) [4]. other. From these graphs, we can find that the agent can
In the figure, the left three graphs indicate changes of α by estimate suitable expected utilities by reducing noisy fac-
adaptation with each method, and the right ones indicate tors of the random hoppers. Also, they can adapt drastic
changes of given values xt , true values vt , and learned es- changes by big players. Changes of the stepsize parameters
timated values x̃t by Eq. (1). These graphs shows features in figure 5 shows how the agent adapts to the changes of the
of three methods: Because OSA shown in (c) use statistical environment.
information of given and learned values, changes of the step-
size is stable but tends to be delayed from the changes of the 5. CONCLUDING REMARK
true value. GDASS shown in (b) can response to the changes
In this article, I proposed a method to adapt stepsize pa-
of the true value quickly, but the stepsize tends to be un-
rameter, called rapid recursive adaptation of stepsize pa-
stable by effects of noise factors included in the given value,
rameter by Newton’s method (RRASP-N), in which EMA of
because GDASS focuses only differences between given and
the squared error of estimated value is minimized by chang-
estimated values at each time t. Therefore, it is hard to de-
ing the stepsize parameter. RRASP-N utilizes higher order
tect the environmental change from the changes of α clearly.
partial derivatives of the estimated value and EMA of the
Compared with these methods, RRASP-N can catch the
squared error by the stepsize parameter, which can be cal-
changes of the true value quickly (α goes up right after ev-
culated from recursive exponential moving average system-
ery cycle changes of the true value), and also show robust
atically.
and stable changes of the stepsize because it uses EMA of
Experimental results shows that RRASP-N responds changes
squared errors to reduce effects of noise.
of environments so quickly that it adapts the stepsize param-
4.3 Exp.3: Repeated Multi-agent Resource Shar- eter to be suitable. In the same time, RRASP-N’ behavior
is stable because it use statistical value, EMA of the squared
ing error. We also can apply RRASP-N to various noise model.
Finally, I conducted a learning experiment using repeated While we use only Gaussian noise for input values, the for-
multi-agent resource sharing. In the experiment, we sup- malization only suppose to minimize squared error between
pose that multiple agents share four resources (resource-0 . . . expected and given values. Actually, situation used in Exp.3
resource-3). The agents are grouped into three types, fixed is a non-Gaussian noise case. In this experiment, random
users who never change their choice from a certain resource, hoppers provides noisy effects to the environment. As shown
random hoppers who choose one of resources randomly ev- in the results of experiments, RRASP-N perform reasonably
ery cycle, big players who usually stay on a certain resource to such an environment.
but sometime change their choice, and a learning agent who Although experimental setups shown in this article are
try to estimate the average of utility for each resource. The simple to demonstrate features of the proposed method, gen-
population and weight of each group is as follows: eralities of RRASP-N is supported by theorems so that it
can be applied generally to reinforcement learning that use
type population weight
EMA formula. For example, it is easy to apply Q-learning
fixed users 1 7 of multiple states and actions. Of course, the situation of
random hoppers 17 1 , acquiring the best stepsize for a Q-learning is not so simple,
big players 2 10 because learning speed of a Q-value affects a backup value
learning agent 1 1 of Q-value for another state-action pairs. We have been in-
vestigating such cases, and found that there can exist local
where the weight of an agent means a degree of consuming minimums for the stepsize in a certain condition [7]. The
resources compared with a random hopper. Therefore, a re- condition and its effects are still under investigation.
source that is used by big players, who has a big weight, will There still several open issues that include:
have a poor utility. Each resource also has its own capacity,
which indicates the size of resource. In this experiment, the • effects of different stepsize parameters for states and
actual utility of a resource k at time t is calculated by the actions in Q-learning.
following equation:
• utilization of more higher order derivatives (k > 2) to
1 analyze structures of errors function (Ẽt ) with respect
utilityk (t) = ,
1 + totalWeightk (t)/capacityk to α.
where, totalWeightk (t) is the summation of weights of agents • tuning of β, another stepsize parameter for the EMA
who choose the resource k at time t. of the squared error.
The purpose of the learning agent is to acquire estima-
tion of an utility of each resource by reducing noisy factor acknowledgment
caused by the random hoppers. In the same time, the learn- This work was supported by JSPS KAKENHI 21500153.
ing agent must adapt drastic changes brought by big play-
ers’ change of choice. Therefore, the agent must adapt its
stepsize parameter according to changes of the environment. 6. REFERENCES
Figure 5 shows the result of the learning. In each graph in [1] A. Bonarini, A. Lazaric, E. Munoz de Cote, and
the right of this figure indicates given and expected utilities M. Restelli. Improving cooperation among
for each time step, while graphs in the left shows changes self-interested reinforcement learning agents. In Proc. of

Page 21 of 99
Changes of Stepsize Parameter for Resource0 Given and Estimated Utility for Resource 0
1 1
alpha resource
learn

0.8 0.8

0.6 0.6
Stepsize

Utility
0.4 0.4

0.2 0.2

0 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Learning Cycle Learning Cycle

(a) Resource 0
Changes of Stepsize Parameter for Resource1 Given and Estimated Utility for Resource 1
1 1
alpha resource
learn

0.8 0.8

0.6 0.6
Stepsize

Utility
0.4 0.4

0.2 0.2

0 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Learning Cycle Learning Cycle

(b) Resource 1
Changes of Stepsize Parameter for Resource2 Given and Estimated Utility for Resource 2
1 1
alpha

0.8 0.8

0.6 0.6
Stepsize

Utility

0.4 0.4

0.2 0.2
resource
learn
0 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Learning Cycle Learning Cycle

(c) Resource 2
Changes of Stepsize Parameter for Resource3 Given and Estimated Utility for Resource 3
1 1
alpha resource
learn

0.8 0.8

0.6 0.6
Stepsize

Utility

0.4 0.4

0.2 0.2

0 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Learning Cycle Learning Cycle

(d) Resource 3

Figure 5: Exp.3: Learning of Expected Utility of Each Resources.

Page 22 of 99
Workshop on Reinforcement Learning in In this case, we have the k+1-th partial derivative as follows:
Non-Stationary Environments. ECML-PKDD 2005, k−1
Oct. 2005. ∂ k+1 Et X (k − 1)!
=
[2] Michael Bowling and Manuela Veloso. Multiagent ∂αk+1 i=0
(k − 1 − i)!i!
learning using a variable learning rate. Artificial » i
∂ δ ∂ j−i+1 δ ∂ i+1 δ ∂ j−i δ
–
Intelligence, 136:215–250, 2002. · + .
∂αi ∂αj−i+1 ∂αi+1 ∂αj−i
[3] Eyal Even-dar and Yishay Mansour. Learning rates for
q-learning. Journal of Machine Learning Research, Here, we reform the equation by terms ∂α ∂ δ ∂ i k−i+1
δ
i ∂αk−i+1 . Then
5:2003, Dec. 2003. its factor ai can be calcuated as follows: In the case of i = 0
[4] Abraham P. George and Warren B. Powell. Adaptive or i = k,
stepsizes for recursive estimation with applications in
approximate dynamic programming. Machine learning, ((k + 1) − 1)!
ai = 1= .
65(1):167–198, 2006. ((k + 1) − 1 − i)!i!
[5] Itsuki Noda. Adaptation of stepsize parameter for In the case of 0 < i < k,
non-stationary environments by recursive exponential
moving average. In Prof. of ECML 2009 LNIID (k − 1)! (k − 1)!
ai = +
Workshop, pages 24–31. ECML, Sep. 2009. (k − 1 − (i − 1))!(i − 1)! (k − 1 − i))!i!
[6] Itsuki Noda. Recursive adaptation of stepsize
» –
1 1
parameter for non-stationary environments. In = (k − 1)! +
(k − i)!(i − 1)! (k − 1 − i)!i!
Matthew E. Taylor and Karl Tuyls, editors, Adaptive
(k − 1)!(k − i + i)
Learning Agents: Second Workshop, ALA 2009, page =
(to appear). Springer, May 2009. (k − i)!i!
[7] Itsuki Noda. Relation between stepsize parameter and k!
=
stochastic reward on reinforcement learning. In Proc. of (k − i)!i!
JSAI 2009, pages 1D2–OS6–13. JSAI, JSAI, Jun. 2009. ((k + 1) − 1)!
(in Japanese). = .
((k + 1) − 1 − i)!i!
[8] Makoto Sato, Hajime Kimura, and Shigenobu
Kobayashi. TD algorithm for the variance of return and Therefore, Eq. (8) is satisfied in the case of the k + 1-th
mean-variance reinforcement learning (in japanese). partial derivative.
Transactions of the Japanese Society for Artificial As the result, Eq. (8) is satisfied for all k > 0.
Intelligence, Vol. 16(No. 3F):353–362, 2001.
[9] Richard S. Sutton and Andrew G. Barto.
Reinforcement Learning: An Introduction. MIT Press,
Cambridge, MA, 1998.

APPENDIX
A. PROOF OF Theorem 2
First of all, I show the following lemma:

Lemma 2.
The partial derivative of δt by α is equal to the derivatives
of x̃t by α:
∂δt ∂ x̃t
= (18)
∂α ∂α
In addition, generally, we can get the following equations for
the k-th partial derivatives:
∂ k δt ∂ k x̃t
= (19)
∂αk ∂αk

∂ k x̃t
On the other hand, we can calculate ∂αk
from REMA
hki
ξt using Theorem 1.
k
Therefore, we can calculate ∂∂αδkt by REMA.
Here, let’s focus on the partial derivative of Et .
Suppose that the i-th partial derivative of Et by α satisﬁes
Eq. (8) for all j ≤ k as follow:
j−1
∂ j Et X (j − 1)! ∂ i δ ∂ j−i δ
= .
∂αj i=0
(j − 1 − i)!i! ∂αi ∂αj−i

Page 23 of 99
Proceedings of the AAMAS Workshop on Adaptive and Learning Agents, May 2010, Toronto, Canada

Transfer Learning for Reinforcement Learning on a

Physical Robot

Samuel Barrett Matthew E. Taylor Peter Stone

Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science
University of Texas at Austin University of Southern University of Texas at Austin
Austin, TX 78712 USA California Austin, TX 78712 USA
[email protected] Los Angeles, CA 90089 [email protected]
[email protected]

Categories and Subject Descriptors Learning on physically grounded robots is difficult for several
I.2.6 [Artificial Intelligence]: Learning reasons, including environmental and sensor noise, high costs of
failure (such as a crashed helicopter), the large amount of time it
takes to perform tasks, and the fact the robots’ dynamics are often
General Terms not constant due to wear and tear on their motors. Thus, to the ex-
Algorithms, Experimentation tent possible, it is desirable to train robots in a controlled environ-
ment before sending them out into the world. Doing so can reduce
Keywords damage to the robots and prepare them to deal with expected sit-
uations. However, when encountering unexpected situations after
Transfer Learning, Robotics, Reinforcement Learning, Artificial “deployment” in the real world, the robot will have to continue to
Intelligence adapt. Such unexpected situations can even arise from the dynam-
ics of the robot itself changing as its joints break, or as repairs are
ABSTRACT made. It is conceivable to relearn tasks from scratch each time a
change happens, but due to the time and cost of learning, it is not
As robots become more widely available, many capabilities that
practical. Instead, it is desirable for the robot to reuse prior infor-
were once only practical to develop and test in simulation are be-
mation in order to learn faster. The concept of reusing information
coming feasible on real, physically grounded, robots. This new-
from past learning is the idea behind transfer learning.
found feasibility is important because simulators rarely represent
Transfer learning for RL tasks has been shown to be effective in
the world with sufficient fidelity that developed behaviors will work
simulation [18], but no prior work has been done on transfer learn-
as desired in the real world. However, development and testing on
ing on physically grounded robots. The main contribution of this
robots remains difficult and time consuming, so it is desirable to
paper is the first empirical demonstration that transfer learning for
minimize the number of trials needed when developing robot be-
RL can significantly speed up and even improve asymptotic per-
haviors.
formance of RL with learning done entirely on a physical robot,
This paper focuses on reinforcement learning (RL) on physically
specifically using Q-value reuse for the Sarsa(λ) algorithm [19]. In
grounded robots. A few noteworthy exceptions notwithstanding,
addition, we show that transferring information from learning in
RL has typically been done purely in simulation, or, at best, ini-
simulation can improve subsequent learning learning on the robot.
tially in simulation with the eventual learned behaviors run on a
To this end, we introduce a novel reinforcement learning task
real robot. However, some recent RL methods exhibit sufficiently
for humanoid robots and demonstrate that transfer learning can be
low sample complexity to enable learning entirely on robots. One
effective for this task. The results additionally represent one of the
such method is transfer learning for RL. The main contribution of
first successful applications of reinforcement learning on the Nao
this paper is the first empirical demonstration that transfer learn-
humanoid platform developed by Aldebaran1 . A limited amount of
ing can significantly speed up and even improve asymptotic per-
previous work has been done using the Nao, but this work focused
formance of RL done entirely on a physical robot. In addition, we
on simulation work, with only a single run on a physical robot [7].
show that transferring information learned in simulation can bolster
The remainder of the paper is organized as follows. Section
additional learning on the robot.
2 presents the main algorithms used in our experiments, namely
Sarsa(λ) and Q-value reuse. Section 3 introduces our experimental
1. INTRODUCTION testbed and fully specifies the task to be learned. Sections 4 and
Physically grounded robots need to be able to learn from their 5 present the results of our experiments. Section 6 further situates
experience, both in order to deal with changing environments and the results in the literature, and Section 7 concludes.
to adapt to new problems. For the purpose of online learning of se-
quential decision making tasks with limited feedback, value-function- 2. BACKGROUND
based reinforcement learning (RL) [15] is an appealing paradigm,
Reinforcement learning (RL) is a framework for learning se-
because of the well-defined semantics of the value function and its
quential decisions with delayed rewards [15]. RL is promising for
elegant theoretical properties. However, a few notable successes
robotics because it handles online learning with limited feedback
notwithstanding (e.g., flying RC helicopters [3, 9] and quadruped
where actions taken affect the environment. RL has been exten-
walking [8]), RL algorithms have typically been applied only in
sively studied in many domains, with positive results. However,
simulation, or at best trained in simulation with the eventual learned
1
behaviors run on a real robot (e.g., [6], [10], and [5]). http://www.aldebaran-robotics.com/eng

Page 24 of 99
RL techniques can require long training times. Therefore, espe- Source Task Target Task
cially on robots, it can be useful to reuse knowledge learned from
similar problems to speed up training times via transfer learning. Environment Environment
Value-function-based RL algorithms assume that the task to be
Action State Reward
learned can be modeled as a Markov Decision Process (MDP). An asource ssource rsource
Action
atarget
State
starget
Reward
rtarget
MDP is a four-tuple of (S, A, T, R) where S is a state space, A Inter−Task Mapping

+
Source Task
Agent Agent
Source Task

is an action space, T is a transition function specifying the effects

Target Task
Q−value Q−value Q−value
Function Function Function
Approximator Approximator Approximator

of actions, and R is an immediate reward function specifying the

value of state transitions. The complete formulation is given by
T (s, a) = s′ with s, s′ ∈ S and a ∈ A and R(s, s′ ) = r where r ∈ Inter−Task Mapping
R. However, the agent typically does not start with any knowledge Source Task Action and State Source Task Target Task
Q−value
about T or R, so it must learn what actions should be taken in the Function Variable Mappings
Q−value
Function
Q−value
Function
Approximator
states it encounters. One way of doing so is via an intermediate Approximator Approximator

data structure called a state-action value function (Q) that stores

the expected long-term reward of executing action a from state s.
Taylor et al. [19] recently demonstrated that state-action values Figure 1: Q-Value Reuse
from one RL problem can be effectively reused in a related, but dif-
ferent sequential decision making problem. This result is surprising
because the state-action values are intuitively the most problem- where
specific data structure of an RL algorithm: they represent the ex-
1 if tile i is activated

pected long-term reward from a given state-action pair in a single, fi (x) =
specific problem. However, it turns out that there are useful pat- 0 otherwise
terns encoded in the state-action value function that can speed up For any state, the result is a vector of values with length equal to
and even improve asymptotic learning on related tasks. Their algo- the number of actions, and the lengths may not be the same for
rithm of Q-value reuse, which we adopt in this paper, is based on different tasks.
the standard RL framework.
2.2 Q-value Reuse
2.1 Sarsa(λ) Transfer learning involves reusing knowledge learned from ear-
This research uses the Sarsa(λ) learning algorithm as the base lier tasks to learn new problems more effectively. The task learned
RL algorithm. We choose to use Sarsa because it is compatible previously is called the source task and the new task is called the
with Q-value reuse, and because it is among the simplest of RL target task. We use Q-value reuse for the transfer, where the value
algorithms: our focus is on speed-up due to transfer rather than on function, Qsource , learned from an earlier task is used as a start-
the learning algorithm itself. ing point for the new problem, and a new value function, Qtarget ,
Sarsa is an on-policy temporal difference learning algorithm first is learned to correct errors in the source value function. However,
proposed by Rummery and Niranjan [11] and later augmented by the source state and action spaces may not coincide with the tar-
Sutton [14]. Specifically, Sutton’s work describes using cerebellar get state and action spaces. Therefore, the agent must be given a
model arithmetic computers (CMACs) [2] as a function approxi- mapping between the source and target tasks: χX (starget ) = ssource
mator for generalizing learning, allowing the agent to generalize and χA (atarget ) = asource . In this work, this mapping is provided
across similar states and handle larger (or infinite) state spaces. to the agent rather than learned. Therefore, the agent’s new value
This approach has been shown to be successful in a number of do- function is given by
mains.
The Sarsa(λ) algorithm is a well-known approach to solving an Q(s, a) = Qsource (χX (s), χA (a)) + Qtarget (s, a)
MDP. It learns a value function over state-action pairs, Q(s, a) =
Figure 1 shows how Q-value reuse works, reusing the source task’s
r, and actions are chosen ǫ-greedily with respect to Q. The value
state-action value function approximator in the target.
function is changed via a Bellman (TD) update:
Sarsa updates are calculated the same way as previously, but
Q(s, a) = (1 − α)Q(s, a) + α[R(s, s′ ) + γe(s, a)Q(s′ , a′ )] only the target’s function approximator weights (Qtarget(s,a) ) are
updated. In some cases, there may be no corresponding action in
given the state s, action a, reward r, next state s′ , next action a′ , the source task so a default value is given to these actions. In this
the discount factor γ, the eligibility trace e(s, a) (representing how paper, we initialize these actions to the average action values across
recently the state-action pair was visited), and the decay parameter all possible states in the source domain [19]. We also tried initializ-
λ. ing new actions to an intermediate value picked by hand (0) and to
We use CMACs for their discretization and generalization abili- the average action value for the current state, but the average action
ties, deriving from their infinite, axis-parallel tilings over a contin- values of all states performed better in initial experiments.
uous state space. These tiles are discrete features, and there are a
constant number active for each point in the space. Several tilings 3. EXPERIMENTAL SETUP
are used and each is offset from the others (by a random amount in
For all experiments in this paper, we use a novel task on the Nao
our implementation). The value function is generalized over each
humanoid platform developed by Aldebaran. We chose a task with
tile, but the overlapping tiles allow for better resolution. A CMAC’s
the robot seated to reduce the possibility of damaging the robot and
value for each action is computed by summing the weights from
for easier control of the robot’s start state. We emphasize the phys-
each activated tile:
ical groundedness of our experiments by requiring that the robot
calculate its own reward signal from observations, accepting any
X
f (x) = wi fi (x)
i resulting inaccuracies.

Page 25 of 99
Figure 3: Joint movements possible for the task

Figure 2: Estimates of episode rewards

Specifically, the robot’s task was to hit an orange ball as far as (a) Source task
possible at a 45◦ angle with its right hand. It used its onboard
camera to observe the result of each trial and calculate the reward
signal. The robot is seated with the ball 80 mm in front of the
center of the robot and 170 mm to its right. Note that the robot is
not given ball’s location except for the information in the reward
signal. Every 75 ms, the robot is given the current positions of the
joints and their velocities as observations.
The reward signal is given by r = d ∗ cos(θ), where d is the (b) Target task
distance that the ball moved, and θ is the angle between the ball’s
trajectory and the 45◦ target angle. If the ball was not seen for
sufficiently long, it was assumed to have been hit backwards, and Figure 4: Keyframes of robot tasks
the action was assigned reward −100. All other steps were given a
reward of -1 to encourage the agent to find a fast action sequence
to hit the ball. seen in Figure 4. The robot has less control in this source task,
The reward from vision can be inaccurate, due to the ball mov- and therefore cannot hit the ball as far as in the target task, but it
ing outside the sight range of the robot, the arm obscuring the sight can learn faster as the problem is simpler. Our central hypothesis
of the ball, and noisy distance estimates of the ball. However, we is that using Q-value reuse to transfer information from this source
measured these effects, and found that they were not very signifi- task will enable the robot to learn faster on the target task.
cant. Figure 2 compares the robot’s estimate of the reward with the As this work focuses on transfer learning on robots, so the main
measurements taken by hand using a tape measure and a protrac- task considered was transferring from the source task to the target
tor. Out of 50 episodes, only two successful hits were not seen by task on the robot, compared to learning the target task with no prior
the robot and incorrectly assumed to be backward hits with reward information. We also replicated both tasks in the Webots simulator2
−100. The R-squared value of the robot’s estimations was 0.86. to test our algorithm in a different, though similar environment (as
As shown in Figure 3 (supplied by Aldebaran1 ), the robot can the dynamics of the simulator do not entirely match the physical
use four joints to help it hit the ball: shoulder pitch, shoulder roll, robot). We do not assume that a useful simulator will be available
elbow pitch, and elbow yaw. For each episode, these joints start in all cases, which is why we focus on transfer on the robot itself.
at a fixed position with no initial velocity; these values are given In this case, the simulator allows us to better evaluate the effective-
in Table 1 and depicted in the left-most frames of Figure 4. Also, ness and robustness of the algorithm and to run many more experi-
the ball starts in the same position for every episode, as shown in ments than physical robots allow. However, we emphasize that for
Figure 4. At each time step, the robot can accelerate one joint in ei- the main result of the paper, both the source and target tasks were
ther direction or leave all the joints alone. Therefore, the robot has learned on the physical robot.
nine actions: {no change, accelerate the shoulder pitch upwards, We refer to the source task on the robot as S OURCE ROB, the
accelerate the shoulder pitch downwards, accelerate the shoulder target task on the robot as TARGET ROB, the source task in the
roll clockwise, or accelerate the shoulder roll counter-clockwise, simulator as S OURCE S IM, and the target task in the simulator as
etc.}. Furthermore, it has eight observations: the position and ve- TARGET S IM. The main test of our algorithm is in how the trans-
locity for each joint. The velocities are kept in the range [−100◦ /s, 2
+100◦ /s] and the actions are taken every 75ms (more than 13 Cyberbotics Ltd. http://www.cyberbotics.com
times per second) to change the velocity by 50◦ /s.
It is possible to learn this task without any prior information, Joint Min Max Start
but the process can be slow and the robot converges to a mediocre
policy. Our work focuses on improving this learning, specifically Shoulder pitch 0◦ 115◦ 115◦
by using a related source task as prior information. For this simpler Shoulder roll −90◦ 5◦ −75◦
task, the robot only has control of the two shoulder joints, with the Elbow roll 0◦ 120◦ 45◦
elbow roll and yaw fixed at 0◦ and 0◦ . Therefore, the robot will Elbow yaw −90◦ 90◦ −45◦
only have five actions and four observations. We will refer to this
simpler task as the source task and the original task as the target
task. The keyframes of the robot performing the two tasks can be Table 1: Joint angle ranges and starting positions

Page 26 of 99
fer from S OURCE ROB to TARGET ROB and from S OURCE S IM to
TARGET S IM performs. However, the use of the simulator allows
for several other paths for transfer information, and we discuss this
idea further in Section 5
A significant part of the work was done using the Webots simulator2 ,
and this work relies heavily on the code developed by the UT Austin
Villa robot soccer team3 . This code base provides the interface be-
tween the learning agent and the robot’s actions, as well as provid-
ing visual detection of the ball.

4. RESULTS
Transfer learning can be evaluated in many different ways [18].
In this paper, our main focus is on “weak” transfer, meaning that
we assume that the time spent in the source task does not count
against the learner in the target. This is the case when the robot
has already learned the source task, so this training time is not a
new cost. For example, if a robot was trained in a lab before being
sent out, we might be interested in the time it would take the robot
to learn a new task, and less interested in how long the robot was
trained in the lab. We also show one “strong” transfer result, where (a) Weak Transfer (b) Strong Transfer
time spent in the source does count.
For all experiments, we plot the running average reward for each
approach, taken with a 25 episode moving window for the robot Figure 5: Transfer on the robot to the target task
tests and a 50 episode window for the simulation tests. Each test
on the robot represents five runs, each lasting 50 episodes. In the
simulator, each test averages 50 runs, each lasting 1,000 episodes.
These 50 episodes on a robot takes approximately 30 minutes, and
1,000 episodes in simulation takes approximately three hours. This
data allows us to draw conclusions with statistical significance and
reason about the convergence of each approach.
The baseline that we use is learning TARGET ROB with no prior
information. Figure 5a shows that transfer from S OURCE ROB to
TARGET ROB is helpful, improving the reward throughout the en-
tire test. The initial few episodes of each algorithm are very noisy,
so the initial positive performance of TARGET ROB is not signifi-
cant, just the effect of a few outliers. This graph is an evaluation of
weak transfer: we do not depict training time in the source task.
Figure 5b shifts the transfer plot 50 episodes to the right to rep-
resent the strong transfer scenario. Though not as dramatic, the
result is still positive, thus demonstrating that it can be useful to
break a robot task into robot subtasks, and then transfer from the
subtasks to the target task. In this test, the robot performs about as
well in the source task as in the target task, because it does not have
enough trials to completely explore the target task and find a good
behavior.
Unfortunately, the small number of tests on the robots means that
we cannot draw statistical conclusions about the performance of the
methods. However, the tests were also replicated in simulation with
good results. Figure 6a shows that the transfer from S OURCE S IM
to TARGET S IM is helpful, even after a large number of episodes.
The differences between the final rewards of each method are sta-
tistically significant with a confidence of 99%, and the error bars (a) Weak Transfer (b) Strong Transfer
in the diagram show the standard deviation of the average rewards.
Figure 6b shows that our results for strong transfer hold in simu-
lation. Overall, Figures 5–6 suggest that transfer learning works Figure 6: Transfer in the simulator
on robots, and can greatly speed up learning and reach better end
behaviors.

5. ADDITIONAL EXPERIMENTS
In addition to providing statistically significant results, the use of
the simulator opens several other paths for transferring knowledge
3
http://www.cs.utexas.edu/users/austinvilla

Page 27 of 99
1
S OURCE ROB TARGET ROB

4
3
2
5
S OURCE S IM TARGET S IM

Figure 7: Paths for transferring experience

Figure 9: Two-step transfer to the robot target task

Figure 8: One-step transfer to the robot target task

between tasks, including two-step transfer, where we learn sequen- Figure 10: Comparison of one and two-step transfer.
tially from multiple source tasks. Two-step transfer is performed as
described in Section 2, with the value function:
Though all of these results are for weak transfer, we speculate
Q(s, a) = Q1 (χX1 (s), χA1 (a)) that these trends will hold for strong transfer (as they did in both
+Q2 (χX2 (s), χA2 (a)) + Q3 (s, a) one-step transfer cases). Furthermore, transferring from simulation
to a physical robot raises the possibility of having different costs
We consider TARGET ROB to be the target for all of the tests, and for training spent in the simulator than on the robot. For example,
we continue using 1,000 episodes in simulation and 50 on the phys- if we consider simulation time to be insignificant, then tests 2, 3,
ical robot. Figure 7 shows all the ways to transfer information to and 5 are all evidence of strong transfer.
TARGET ROB, with numbers corresponding to the following tests:
6. RELATED WORK
1. S OURCE ROB → TARGET ROB
One of the earliest uses of transfer learning for reinforcement
2. S OURCE S IM → TARGET ROB learning was done by Selfridge et al. [12] in the familiar cart-pole
3. TARGET S IM → TARGET ROB domain. In this work, the function approximator was reused for
4. S OURCE S IM → S OURCE ROB → TARGET ROB poles of different sizes and weights, with good effect.
5. S OURCE S IM → TARGET S IM → TARGET ROB Taylor and Stone [18] recently surveyed the use of transfer learn-
ing in reinforcement learning. Significant prior work in this area
Test 1 is further investigated in Section 4, and the results of the has been performed, with good results. However, little work has
three one-step transfer tests (tests 1, 2, and 3) are displayed in Fig- been done in applying transfer learning to the area of robotics. Tay-
ure 8. Transferring from TARGET S IM produces the biggest im- lor and Stone discuss several approaches to transfer learning, and
provement in the early episodes due to it having already learned point out several ways to evaluate the effects of the transfer. Our
about the entire state-action space. However, the agent does have to research focuses on Q-value reuse with supervised task transfer.
learn about the differences between the simulated and real robots. Taylor et al. [19] explored Q-value reuse in temporal difference
Also, transferring from S OURCE S IM performs better than trans- learning with good results. They specifically evaluate a Sarsa agent
ferring from S OURCE ROB, probably due to the higher number of using CMAC for function approximation. However, this work fo-
runs in S OURCE S IM, which allow the agent to explore the state- cuses on the simulated domain of keepaway for soccer. Our work
action space more completely. In the end, all of the transfer meth- applies this research to a physical robot, and has a greater differ-
ods end up with similar performance, and all perform much better ence between the source and target tasks.
than starting with no prior information. One interesting approach to transfer learning is to extract higher
The two types of two-step transfer were also tested (tests 4 and level strategies from the policy learned by the agent. Torrey et
5), and the results are shown in Figure 9. Both methods show a sub- al. [20] explored this idea using relational macros to represent the
stantial boost to early episodes but later plateau, achieving similar strategies learned by induction logic programming (ILP) in the RoboCup
results to the other transfer methods. The results of the two-step breakaway domain, but this requires the domain to be translated
transfer are not better than some of the one-step transfers, but Fig- into first-order logic. It is also possible to break a single problem
ure 10 shows that multi-step transfer can be beneficial, giving a into a series of smaller tasks. Then, the agent learns each of these
large early boost. sub-tasks and combines the learned knowledge for the full task.

Page 28 of 99
In the target task, the state space, actions, and transition function 8. REFERENCES
are the same as the sub-tasks, and the information is transferred [1] P. Abbeel and A. Y. Ng. Exploration and apprenticeship
via Q-value transfer. Singh [13] also explored this area, naming it learning in reinforcement learning. In ICML ’05, pages 1–8.
“compositional learning.” ACM, 2005.
It is possible to learn a mapping between source and target tasks [2] J. S. Albus. Brains, Behavior, and Robotics. Byte Books,
autonomously (e.g., when a human is unable or unwilling to pro- Peterborough, NH, 1981.
vide such a mapping). Talvitie and Singh [16] developed an algo-
[3] J. A. Bagnell and J. Schneider. Autonomous helicopter
rithm to generate possible state variable mappings and learn which
control using reinforcement learning policy search methods.
mapping is best as an n-armed bandit problem. Further work has
In ICRA ’01, pages 1615–1620. IEEE Press, 2001.
been done by Taylor et al. [17] using a model-based approach to
[4] A. Coates, P. Abbeel, and A. Y. Ng. Learning for control
reduce the samples needed, and they transfer observed (s, a, r, s′ )
from multiple demonstrations. In ICML ’08, pages 144–151.
instances, which allows the source and target agents to have differ-
ACM, 2008.
ent representations for the task. However, these methods are not
as reliable as hand-mapping and can be unnecessary for smaller [5] Y. Davidor. Genetic Algorithms and Robotics: A Heuristic
domains. Strategy for Optimization. World Scientific Publishing Co.,
Unfortunately, tests on robots can be slow, and most learning al- Inc., 1991.
gorithms require a large amount of training data to perform well. [6] E. Gat. On the role of simulation in the study of autonomous
Therefore, it can be useful to train an agent in simulation and trans- mobile robots. In AAAI-95 Spring Symposium on Lessons
fer these behaviors to a robot [6, 10, 5]. However, we cannot as- Learned from Implemented Software Architectures for
sume that a simulator will accurately model complex perception or Physical Agents., March 1995.
manipulation tasks, so it is often useful to tune the behavior from [7] T. Hester, M. Quinlan, and P. Stone. Generalized model
the simulator by running more tests on a robot. This requires com- learning for reinforcement learning on a humanoid robot. In
bining information about a source simulation task and a target robot ICRA ’10, 2010.
task, but no work we know of treats this as a transfer learning prob- [8] N. Kohl and P. Stone. Policy gradient reinforcement learning
lem. for fast quadrupedal locomotion. In ICRA ’04, May 2004.
Another way to speed up learning is to use prior demonstrations. [9] A. Y. Ng, H. J. Kim, M. I. Jordan, and S. Sastry. Inverted
Researchers have shown that sub-optimal demonstrations can be autonomous helicopter flight via reinforcement learning. In
sufficient to teach an agent to control an autonomous helicopter [1, In International Symposium on Experimental Robotics. MIT
4]. Unfortunately, this requires on an expert in the domain to per- Press, 2004.
form the demonstrations, which is not always possible. [10] J. M. Porta and E. Celaya. Efficient gait generation using
reinforcement learning. In Proceedings of the Fourth
7. CONCLUSIONS AND FUTURE WORK International Conference on Climbing and Walking Robots,
This paper empirically tests transfer learning for RL on physical pages 411–418, 2001.
robots. The results show that model-free RL can be effective on a [11] G. A. Rummery and M. Niranjan. On-line Q-learning using
robot, and that transfer learning can speed up learning on physical connectionist systems. Technical Report
robots. CUEF/F-INFENG/TR 166, Cambridge University
Furthermore, this prior information can be learned in simulation, Engineering Dept., 1994.
even if the simulator does not completely capture the dynamics of [12] O. G. Selfridge, R. S. Sutton, and A. G. Barto. Training and
the robot. For example, the simulator does not model collisions tracking in robotics. In IJCAI, pages 670–672, 1985.
between the robot’s different parts, so dynamics of the arm hitting [13] S. P. Singh. Transfer of learning by composing solutions of
the body are never learned in the simulator. However, the behaviors elemental sequential tasks. Machine Learning, 8:323–339,
learned in the simulator serve as good starting points for learning 1992.
on the robot. This result is useful when a simulator is available, [14] R. S. Sutton. Generalization in reinforcement learning:
since simulator tests are significantly easier to run than robot tests: Successful examples using sparse coarse coding. In NIPS
it suggests that only a relatively small amount of tuning is necessary ’96, 1996.
to adapt behaviors learned in the simulator to the real robot. The [15] R. S. Sutton and A. G. Barto. Reinforcement Learning: An
main motivation for this work is that in some situations learning Introduction. MIT Press, Cambridge, MA, USA, 1998.
must be performed entirely on a physical platform, and the positive [16] E. Talvitie and S. Singh. An experts algorithm for transfer
results in that setting are the main contribution of this paper. learning. In IJCAI, pages 1065–1070, 2007.
This work opens up several interesting directions for future work. [17] M. E. Taylor, N. K. Jong, and P. Stone. Transferring
For example, it is worth investigating if other learning algorithms instances for model-based reinforcement learning. In ECML
can learn this task faster than Sarsa, and if so, whether Q-value PKDD, pages 488–505, September 2008.
reuse (if applicable) can show similar benefits with these other al- [18] M. E. Taylor and P. Stone. Transfer learning for
gorithms. It would also be interesting to see how different methods reinforcement learning domains: A survey. JMLR,
for transfer learning perform on this task. In the long run, we view 10(1):1633–1685, 2009.
the research reported in this paper as just the first of many possible
[19] M. E. Taylor, P. Stone, and Y. Liu. Transfer learning via
applications of transfer learning for RL to physical robots.
inter-task mappings for temporal difference learning. JMLR,
8(1):2125–2167, 2007.
[20] L. Torrey, J. W. Shavlik, T. Walker, and R. Maclin.
Relational macros for transfer in reinforcement learning. In
ILP ’07, 2007.

Page 29 of 99
Proceedings of the AAMAS Workshop on Adaptive and Learning Agents, May 2010, Toronto, Canada

Reinforcement Learning with Action Discovery

Bikramjit Banerjee and Landon Kraemer

School of Computing
The University of Southern Mississippi
Hattiesburg, MS 39406

Categories and Subject Descriptors tion sets are artificially limited (in addition to physical limitations
I.2.11 [Artificial Intelligence]: Distributed Artificial Intelligence— of an agent) leading to constraints in the learned behaviors as well.
Multiagent Systems; I.2.6 [Artificial Intelligence]: Learning Consider a simple example of this limitation: assume that the
robotic arm in Figure 1(a) is physically limited to rotating by no
less than 2◦ at a time. The goal is to get it to rotate by 13◦ . In-
General Terms stead of allowing it to explore every possible action (2◦ – 359◦ ),
Algorithms, Performance the designer might prefer to allow only 4 actions, viz., 2◦ clock-
wise and anti-clockwise, and 5◦ clockwise and anti-clockwise. Al-
though this would enable the robot to learn an action policy for any
Keywords integer goal angle, many of those policies would have constraints
Multi-agent learning, Reinforcement learning that are imposed by the design choice, not by the robot’s physical
limitation, e.g., it would have to execute three 5◦ actions followed
by a 2◦ action in the reverse direction. However, the robot could
ABSTRACT have learned to simply turn by 13◦ in one smooth motion, had the
The design of reinforcement learning solutions to many problems learning problem not been artificially constrained. On the other
artificially constrain the action set available to an agent, in order hand, allowing a full blown action set might slow down learning to
to limit the exploration/sample complexity. While exploring, if such an extent that no performance improvement (over a random
an agent can discover new actions that can break through the con- policy baseline) may be observed in any reasonable time-frame.
straints of its basic/atomic action set, then the quality of the learned 1111111111111
0000000000000
0000000000000
1111111111111
0000000000000
1111111111111
decision policy could improve. On the flipside, considering all pos- θ init
0000000000000
1111111111111
0000000000000
1111111111111
0000000000000
1111111111111
0000000000000
1111111111111
0000000000000
1111111111111
sible non-atomic actions might explode the exploration complex- 0000000000000
1111111111111
0000000000000
1111111111111
0000000000000
1111111111111
ity. We present a potential based solution to this dilemma, and θ goal 0000000000000
1111111111111
0000000000000
1111111111111
0000000000000
1111111111111
0000000000000
1111111111111
0000000000000
1111111111111
empirically evaluate it in grid navigation tasks. In particular, we 0000000000000
1111111111111Robotic
0000000000000Arm
1111111111111
0000000000000
1111111111111
0000000000000
1111111111111
show that both the solution quality and the sample complexity im- 0000000000000
1111111111111
0000000000000
1111111111111
0000000000000
1111111111111
0000000000000
1111111111111
prove significantly when basic reinforcement learning is coupled 0000000000000
1111111111111

with action discovery. Our approach relies on reducing the num-

ber of decisions points, which is particularly suited for multiagent
coordination learning, since agents tend to learn more easily with (a)
fewer coordination problems (CPs). To demonstrate this we extend
action discovery to multi-agent reinforcement learning. We show
that Joint Action Learners (JALs) indeed learn coordination poli-
cies of higher quality with lower sample complexity when coupled B

with action discovery, in a multi-agent box-pushing task.

1. INTRODUCTION
Reinforcement learning is a popular framework for agent-based
solutions to many problems, primarily because of the simplicity of (b)
design and the strong convergence guarantees in the face of uncer-
tainty and limited feedback. In typical on-line reinforcement learn- Figure 1: Motivating examples
ing problems, an agent interacts with an unknown environment by
executing actions and learns to optimize long-term payoffs (or feed- Consider a second example, a grid-world navigation task, as shown
backs from the environment) consequent to selecting actions from in Figure 1(b). In such worlds, the action set is usually assumed to
a given set, A, in every state. In most cases, care is taken to ensure contain the 8 atomic actions that an agent can take to move from
that the set of actions is not too large, usually by discretizing con- one state (tile corner) to a (8-connected) neighboring state. How-
tinuous action spaces (see [7] for an exception). This is because a ever, the optimal policies generated by such an action set can make
large action set can slow down exploratory learning by creating too for unnatural navigation paths, such as the path from state A to the
many alternate trajectories through the state space to be explored. bottleneck B in solid arrows, in Figure 1(b). The most natural path
However, in order to curtail this exploration space, oftentimes, ac- from A to B would be the dotted arrow in Figure 1(b), but accomo-

Page 30 of 99
dating such actions might make the action set of the agent too large. an action-quality function Q given by
This example also highlights the difference between our work and X
the theory of options [13]. An option in this example might allow Q(s, a) = R(s, a) + max γ T (s, a, s′ )V π (s′ ) (1)
π
an agent to move to the doorway (B) with a temporally extended s′
action sub-plan that consists of the same atomic actions (i.e., the This quality value stands for the discounted sum of rewards ob-
chain of solid arrows). In contrast, our method adopts a fundamen- tained when the agent starts from state s, executes action a, and
tally new action (the dotted arrow), whereby, and agent can move in follows the optimal policy thereafter. Action quality functions are
a straight line to B, instead of being constrained by the set of atomic preferred over value functions, since the optimal policy can be cal-
actions. However, as mentioned before, it is not immediately clear culated more easily from the former. The Q function can be learned
if such additional actions must come at the cost of reduced learning by online dynamic programming using various update rules, such
rate. as temporal difference (TD) methods [12]. In this paper, we use the
In this paper, we propose a method to address the tradeoff be- on-policy Sarsa rule given by
tween discovering new actions and keeping the learning rate high.
We allow a reinforcement learning agent to start exploring its en- Q(st , at ) ← Q(st , at ) + α[rt+1 + γQ(st+1 , at+1 ) − Q(st , at )]
vironment with the same (limited) basic/atomic action set, but en-
able it to discover new actions on-line that are expected to lead where α ∈ (0, 1] is the learning rate, rt+1 is the actual environmen-
to its goal faster. As the agent augments its action set with these tal reward and st+1 ∼ T (st , at , .) is the actual next state resulting
newly discovered promising actions, its learning rate might be ex- from the agent’s choice of action at in state st . We assume that the
pected to fall. However, if only the most promising actions are agent uses ǫ-greedy strategy for action selection: it selects action
added, then they may actually decrease the time to reach the goal, at = arg maxb Q(st , b) in state st with probability (1 − ǫ), but
thereby accelerating the learning. We experimentally study the rel- with probability ǫ it selects a random action.
ative effects of these two factors in the grid-world navigation do- Sarsa is named after the acronym of its steps: state, action, re-
main with single agents. We show that action discovery can indeed ward, state, action. From state st , the agent picks action at , re-
improve the solution quality while significantly reducing the ex- ceives a reward rt+1 , transitions to state st+1 , and then selects ac-
ploration/sample complexity. Furthermore, the reason behind the tion at+1 in that state. It is only at this point that it can update
success of action discovery, viz., improvement in the connectivity Q(st , at ), using the above TD rule.
of the state-graph, indicates an added benefit to multi-agent coor- In reinforcement learning, it is traditional to define a simple set
dination learning. Coordination Problems (CPs) [3] are points in of actions A that an agent can select from at any state, since many
multi-agent sequential decision problems where agents must coor- actions are applicable to several states. However, for the purpose
dinate their actions in order to optimize future global returns. With of this paper we will separate the action sets on a per state basis.
fewer CPs, the learning problem is simplified, leading to faster That is, we will assume that in a state s, an agent can select from
learning. Since action discovery can reduce the number of points the set A(s) of actions. This is just for the purpose of presenta-
where agents would need to coordinate (i.e., reduce CPs), action tion, and there is really no fundamental difference between the two
discovery can greatly enhance the learning rates in multi-agent co- conventions. We assume that the agent is initially given the same
ordination learning tasks. In order to verify this intuition we adapt action set as basic RL, represented as A0 (s) over states s. If the
the Joint Action Learning (JAL) algorithm [4] with action discov- agent discovers a new action in episode t that can be executed from
ery in a multi-agent box pushing task, and show that the beneficial state s on its exploration trajectory, it grows the action set for that
impact of action discovery does indeed apply. state, At (s).
Since we are no longer constrained to a basic/atomic action set,
we must accomodate different execution times (or costs) of actions,
similar to options. The framework of Semi-Markov Decision Pro-
2. REINFORCEMENT LEARNING cesses (or SMDPs) is the appropriate relaxation for this purpose,
Reinforcement learning (RL) problems are modeled as Markov and the only difference it entails in terms of the learning algorithm,
Decision Processes or MDPs [12]. An MDP is given by the tu- is that the execution time, t(a) of an action a, must be used to ex-
ple {S, A, R, T }, where S is the set of environmental states that ponentiate the discount factor, i.e., γ t(a) in place of γ.
an agent can be in at any given time, A is the set of actions it can
choose from at any state, R : S × A 7→ ℜ is the reward func- 3. RELATED WORK
tion, i.e., R(s, a) specifies the reward from the environment that
While reinforcement learning has seen successes in many no-
the agent gets for executing action a ∈ A in state s ∈ S; T :
table applications [14, 5, 1], experience/sample complexity has tra-
S × A × S 7→ [0, 1] is the state transition probability function spec-
ditionally been an issue of concern. More recently, several tech-
ifying the probability of the next state in the Markov chain conse-
niques have been proposed to reduce sample complexity. These
quential to the agent’s selection of an action in a state. The agent’s
approaches include the theory of options and temporal abstrac-
goal is to learn a policy (action decision function) π : S 7→ A that
tion [13], reward shaping [9], Lyapunov-constrained action sets [10],
maximizes the sum of discounted future rewards from any state s,
and knowledge transfer [11], among others. In particular, Lyapunov-
given by,
constrained action sets [10] seeks to limit the action set of an agent
during exploration by constructing appropriate Lyapunov functions
V π (s) = ET [R(s, π(s))+γR(s′, π(s′ ))+γ 2 R(s′′ , π(s′′ ))+. . .] to guide exploration, while action transfer [11] seeks to bias action
selection in new tasks by exploiting successful actions from pre-
where s, s′ , s′′ , . . . are samplings from the distribution T following vious tasks. Given the significant prior effort in reducing sample
the Markov chain with policy π, and γ ∈ (0, 1) is the discount complexity, some by eliminating or reducing the weight of avail-
factor. able actions, it may sound counterproductive to seek to expand an
A common method for learning the value-function, V as defined agent’s available action set. Our insight is that with the discovery
above, through online interactions with the environment, is to learn of new actions that circumvent policy constraints, more efficient

Page 31 of 99
policies can be learned and exploited to ultimately learn to achieve to physical limitations of the agent or the environment, for all s, s′ .
the goal faster. As a bonus, the quality of the learned solution is It is useful to deal with both possibilities uniformly, with a cost
also expected to improve. function.
The basic insight that learning temporally extended abstractions We assume that for a given domain, a cost function c : S ×
of ground behavior can increase the learning rate by reusing ab- S 7→ ℜ, is always available, such that c(s, s′ ) gives the cost of
stractions, has been verified before in the context of options [13]. executing an action that would take an agent from state s to state
However, there is a fundamental difference between our work and s′ , i.e., ass′ . If c(s, s′ ) < ∞, this simply means that there is some
the theory of options. While options can be loosely thought of action (whether atomic or newly discovered) that takes the agent
as labels for a series of atomic actions that are useful to execute from state s directly to state s′ . However, if c(s, s′ ) = ∞, then no
in the same sequence in many different states, and are geared to- such action exists. c is virtually an oracle that can be enquired by
ward reusable knowledge, our work considers actually new actions. the agent for pairs of states that it has seen in the past. Our setting
When options are considered as additional actions that an agent can is different from regular RL settings in that the agent does not know
select in place of an atomic action, they have been shown to expe- the state space a priori, but has access to a transition function oracle
dite learning. However, discovering options is not a simple task. In (c), whereas in regular RL settings the state space is known but the
contrast, it may be simple to discover new ground actions outside transition function is unknown.
an agent’s set of atomic actions, as we demonstrate in grid naviga- The cost function also serves as the measure of action complex-
tion tasks. Rather than bank on their reusability as with options, we ity, and can be used to exponentiate γ for SMDPs. For actions
rely on the ability of these new actions to improve the policy qual- outside the atomic action set (A0 ), and having a finite cost, we do
ity by connecting topologically distant states in the state graph. It not assume that a reward sample for such an action is available un-
is not immediately clear if such qualitative enhancement will also less this action is actually executed. Hence the first time that such
reduce sample complexity. But our experiments in simple grid nav- an action is discovered (line 16, Algorithm 1), the reward is esti-
igation tasks show that this is indeed possible. mated (r̂ in line 18, Algorithm 1) on the basis of the actual rewards
Reinforcement learning in multi-agent sequential decision tasks r1 , r2 .
has been an active area of research [3, 6, 8, 2]. In multi-agent Clearly, accepting every newly discovered action into the set of
systems the decision complexity (typically the size of the Q-table) actions will be expensive for learning. For instance, in a grid of
usually depends exponentially on the number of agents, and so it is size n × n, there may be O(n2 ) such new actions, per state, i.e.,
even less intuitive whether worsening the decision complexity by potentially O(n4 ) actions to contend with. Accomodating such a
accommodating new actions can help the learning rate at all. We large number of actions will impact the exploration and reduce the
answer this question affirmatively, by showing that Joint Action learning rate. Fortunately, many of these actions may be needless
Learners (JAL) [4] with action discovery do learn better policies to explore, e.g., if they lead away from the goal. It is possible to
with lower sample complexity in a multi-agent box pushing task estimate the value potential of a state, Φ, precisely for this pur-
than regular JALs. pose. Potential functions, Φ(s), have been used before, to shape
rewards and reduce the sample complexity of reinforcement learn-
4. ACTION DISCOVERY ing [9]. Such functions can be set by the agent designers or domain
In reinforcement learning problems, the atomic action set, A0 , designers. In this paper, we use such functions to informatively
is usually fixed. Even if new options are discovered, these options select among newly discovered actions. To illustrate our heuristic
are described in terms of the atomic actions from A0 . However,
S
in many cases new actions that are neither included in A0 , nor 3
Φ( S 3 )
precluded by the agent’s capabilities, may be able to improve the
agent’s performance by
• reducing the number of steps to the goal, or the total solution
cost γ cost(s 1, s3 )

• reducing the cost of exploration by connecting topologically S

Φ( S 2 )
2
distant states with new actions
γ cost(s 1, s2 )
• making the goal-directed behavior more natural, i.e., less
constrained from a design perspective S1

We renounce the innate meaning of an action, and assume it to

simply stand for a vehicle of state transition. As such, we repre- Figure 2: Illustration of the selection procedure for a newly
sent an action by ass′ to mean that the intended purpose of this discovered action.
action is to transition from state s to state s′ . To accomodate non-
determinism in the effect of an action, we can now redefine the selection procedure for newly discovered actions, consider an agent
transition function T as T (s, ass′ , s′′ ) to stand for the probabil- that has transitioned through successive states s1 , s2 , and s3 , dur-
ity that if the agent acts with the intention of transitioning from s ing some episode, t (Figure 2). The actions that it has executed to
to s′ , then it ends up in state s′′ . Therefore, T (s, ass′ , s′ ) is the make these transitions may be atomic actions, or previously discov-
probability of success of this action. The fixed point of Q-learning, ered new actions, in the set At (.). At state s3 , the agent determines
replacing equation 1, is now, if there exists an action that could have transitioned it directly from
X s1 to s3 , i.e., whether c(s1 , s3 ) < ∞. If this is true and this action
Q(s, ass′ ) = R(s, ass′ ) + max γ T (s, ass′ , s′′ )V π (s′′ ) did not exist in At (s1 ) (line 16, Algorithm 1), then a new action has
π
s′′
been discovered based on two older actions (either basic, or them-
In this paper, however, we focus on the deterministic cases, i.e., selves discovered). The question is whether this new action, as1 s3 ,
where T (s, ass′ , s′ ) is either 1, or the action ass′ is infeasible due is worth exploring in the future from state s1 , compared to the ac-

Page 32 of 99
tion (atomic or otherwise) that had transitioned the agent from s1
to s2 . This question may be heuristically answered by comparing
the potential backup values from both s2 and s3 to s1 . These po-
tential backup values can be estimated as γ c(s1 ,s2 ) Φ(s2 ) from s2 ,
Goal
and γ c(s1 ,s3 ) Φ(s3 ) from s3 . Consequently, we use the following 1
0
0
1
criterion for accepting a newly discovered action, as1 s3 , Start
1
0
0
1
c(s1 ,s3 ) c(s1 ,s2 ) (G1 )
γ Φ(s3 ) > (1 + δ)γ Φ(s2 )
where δ is a slack variable guiding the degree of conservatism in 11
00
00
11
Goal
accepting new actions. This step is shown in line 16 in Algorithm 1.
Furthermore, new actions merely facilitate reaching the goal, but
they are not necessary for the agent to reach the goal. The agent
00000000000
11111111111
000
111
11111111111
00000000000
should be able to find a baseline policy to the goal using just the 111
000
00000000000
11111111111
000
111
00000000000
11111111111
000
111
00000000000
11111111111
000
111
atomic actions, in the worst case. Hence, we use the above test 00000000000
11111111111
000
111
00000000000
11111111111
000
111
111
000
000
111
rather conservatively (δ > 0) to select or reject a newly discovered 000
111
000
111
000
111
111
000
action. 1
0
000
111
000
111
1
0 000
111
000
111
Start 111
000
000
111
000
111
0000000000
1111111111
000
111
Algorithm 1 Sarsa-AD (Sarsa with Action Discovery) 0000000000
1111111111
000
111
0000000000
1111111111
111
000
0000000000
1111111111
000
111
0000000000
1111111111
000
111
0000000000
1111111111
000
111
1: Initialize ǫ, δ, α, γ 0000000000
1111111111

2: Initialize Σ ← ∅, the set of states seen so far

3: for episode t = 0, 1, 2, 3, . . . do
4: s1 is the start state. If seen for the first time, add it to Σ and
(G2 )
set At (s1 ) ← A0 (s1 )
5: Choose a1 ∈ At (s1 ), with ǫ-greedy
Figure 3: The two navigation maps (G1 ,G2 ) used in the exper-
6: Execute a1 and get next-state s2 and reward r1 (unless s1 is
iments, and the paths found by Sarsa (solid red line, given by
terminal). If s2 is seen for the first time, add it to Σ and set
atomic actions only), and action discovery (δ = 0; dotted blue
At (s2 ) ← A0 (s2 )
line, in terms of discovered actions.
7: Choose a2 ∈ At (s2 ), with ǫ-greedy
8: Q(s1 , a1 ) ← Q(s1 , a1 ) + α[r1 + γ c(s1 ,s2 ) Q(s2 , a2 ) −
Q(s1 , a1 )] For each map, we performed 20 runs of each of the following
9: Execute a2 and get next-state s3 and reward r2 (unless s2 is versions: basic Sarsa (i.e., no action discovery), Sarsa-AD with no
terminal). If s3 is seen for the first time, add it to Σ and set potential test (i.e., only the first two tests in line 16, Algorithm 1
At (s3 ) ← A0 (s3 ) are performed) which we call “All actions”, and two versions of
10: Choose a3 ∈ At (s3 ), with ǫ-greedy Sarsa-AD with potential tests, for δ = 0, 1. For each of the above
11: Q(s2 , a2 ) ← Q(s2 , a2 ) + α[r2 + γ c(s2 ,s3 ) Q(s3 , a3 ) − versions, we study three figures of merit: solution quality, sample
Q(s2 , a2 )] complexity, and the growth rate of |At − A0 | over all visited states,
12: repeat as detailed next. All plots show 95% confidence intervals over 20
13: Execute a3 and get next-state s4 and reward r3 (unless s3 runs (assuming normal distributions) for each figure of merit, and
is terminal). If s4 is seen for the first time, add it to Σ and for each version, over the three maps. Also, the first (leftmost) plot
set At (s4 ) ← A0 (s4 ) point in each case is an average over the first 900 episodes, and
14: Choose a4 ∈ At (s4 ), with ǫ-greedy the subsequent points are averages over a moving window of 900
15: Q(s3 , a3 ) ← Q(s3 , a3 ) + α[r3 + γ c(s3 ,s4 ) Q(s4 , a4 ) − episodes. Hence the learning performances are not coincident at
Q(s3 , a3 )] the beginning, although all algorithms are essentially identical at
16: if (c(s1 , s3 ) < ∞) ∧ (as1 s3 ∈ / At (s1 )) ∧ the start.
(γ c(s1 ,s3 ) Φ(s3 ) > (1 + δ)γ c(s1 ,s2 ) Φ(s2 )) then The specific parametric choices made in the runs were:
17: At (s1 ) ← At (s1 ) ∪ {as1 s3 }
18: Q(s1 , as1 s3 ) ← r̂(r1 , r2 ) + γ c(s1 ,s3 ) Q(s3 , a3 ) • A0 consists of 8 actions for each state,
19: end if
20: s1 ← s2 , s2 ← s3 , s3 ← s4 , r 1 ← r 2 , r 2 ← r 3 , • c(s, s′ ) = distance(s, s′ ) with simple line-tests detecting
a3 ← a4 blocked paths (i.e., c(s, s′ ) = ∞),
21: until s3 is terminal
22: At+1 (s) ← At (s), ∀s ∈ Σ • rewards are 1 for any action reaching the goal, but 0 other-
23: end for wise,
1
• φ(s) = distance(s,goal)
,
4.1 Experiments with a single agent • r̂ = r1 + r2 ,
We have used two grid navigation maps, G1 and G2 , as shown in
Figure 3. Since the potential functions are based on the estimated • δ = 0, 1, α = 0.125, γ = 0.9, and ǫ = 0.15.
proximity of a state to the goal, we have considered G2 as a test
case to verify the performance of action discovery when the agent • All learning algorithms (inclusing basic Sarsa) use the Φ
may need to head away from the goal first, before it can approach function for state-action value initialization (which is equiv-
the goal. alent to online reward shaping [15]).

Page 33 of 99
Figures 4 and 7 show the learned solution qualities on the 2 maps 13.5
in Figure 3, respectively, as total path lengths. As one might expect, All actions
δ=0

Learned path length to goal (meters)

the quality of the solution that Sarsa-AD learns is significantly bet- 13
δ=1
ter than basic Sarsa. For the map G2 in Figure 3, there is no signif- Basic SARSA
12.5
icant difference between the solution qualities of δ = 0 and δ = 1,
whereas for map G1 , δ = 0 is significantly better. This might indi- 12
cate that obstacles favor low δ, unless they defeat discovery in the
11.5
first place, as in map G2 in Figure 3, where discovery comes into
play only in obstacle free areas. 11
Usually sample/experience complexity in RL is measured by the
number of decisions that the agent has to make in each episode. 10.5
The problem with this measure in the context of our work is that 10
it is not only affected by learning, but also by action discovery.
Clearly, Sarsa-AD will learn to make fewer decisions than Sarsa, 9.5
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
by virtue of action discovery, and so this measure will favor Sarsa-
AD unduly over Sarsa. However, Sarsa-AD makes fewer decisions Number of Episodes
at the expense of increasing the number of choices (i.e., available
actions) at each decision point. Therefore, a more refined measure Figure 4: Plot of solution quality against episodes for task G1 .
of sample complexity for Sarsa-AD would be the sum of the num-
ber of choices available across all decision points in each episode. 120
Strictly speaking, this measure is a combination of decision com- All actions

Available action choices during episode

110 δ=0
plexity (i.e., number of actions available to choose from, which is δ=1
100 Basic SARSA
fixed in regular RL but increases in Sarsa-AD) and sample com-
plexity (i.e., number of decision points), but here we simply refer 90
to it as sample complexity. We use this measure to compare the 80
sample complexities of the different methods in Figures 5 and 8.
70
In Figure 5 we see a statistically significant advantage of Sarsa-
AD over Sarsa as well as “All actions”. Notice that “All actions” 60
is not the version that knows all possible actions (atomic or oth- 50
erwise) in all states. Such a variant of Sarsa would have a worse 40
sample complexity than even baseline Sarsa, and is not studied in
30
our experiments. Rather, “All actions” discovers actions along the
state trajectories, much like other versions of Sarsa-AD; it only 20
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
does so most liberally without the potential test. In Figure 8, the
Number of Episodes
two versions of Sarsa-AD (δ = 0, 1) have significantly lower sam-
ple complexity than basic Sarsa and “All actions”. This suggests
that even if the potential function is partially uninformative (map Figure 5: Plot of sample complexity against episodes for task
G2 in Figure 3), Sarsa-AD is still preferable to basic Sarsa. G1 .
Another interesting observation about our measure of sample
complexity (especially in Figure 5) is that the sample complexity
problem where state transition constraints are well-defined, it sould
of “All actions” becomes an increasing function. This is to be ex-
be possible to discover new actions by constraint programming.
pected because these versions of Sarsa-AD expand the action sets
rather liberally, and could get bogged down with exploring poor 4.2 Multi-agent Learning with Action Discov-
discovered actions. Also notice that the basic Sarsa converges to ery
optimal (near optimal in map G2 ) paths in terms of basic actions,
Our results so far indicate a beneficial impact of action discovery
with little learning because of the informed initialization with the
on exploration complexity even though it comes at a cost to deci-
potential function. Such initialization, however, still leaves the dif-
sion complexity, so much so that the overall sample complexity is
ferent versions of Sarsa-AD with the task of learning the values of
significantly lower than in regular reinforcement learning. How-
new actions. Hence their convergence is not as fast.
ever, a sterner test for this hypothesis is in a multi-agent system
Finally, Figures 6, and 9 show the growth rates of the sizes of
where the decision complexity grows exponentially with the num-
the action sets with newly discovered actions. Although the relative
ber of agents, creating the possibility that any augmentation of the
patterns are not unexpected, what is inspiring is that the growth rate
action set (by discovery) may dominate the sample complexity.
of Sarsa-AD even for δ = 0 is quite low compared to the potential
In order to test the hypothesis that action discovery is beneficial
action space size (O(n4 )). This is due to the focussed exploration
to both solution quality and sample complexity (combined over all
of a few trajectories compared to the total number of possible tra-
agents) in a multi-agent learning (MAL) task, we adopt the Joint
jectories. Furthermore, there is a statistically significant advantage
Action Learning algorithm [4]. For JALs, the decision complex-
of both δ = 0, 1 over “All actions”, indicating that the potential
ity is clearly exponential in the number of agents, n, since each
test is indeed beneficial to action discovery. The overall conclusion
agent maintains a Q-value for each joint-state s and the entire joint-
from these results can be that action discovery with the potential
action vector ha1 , a2 , . . . , an i. Since we intend to test the impact
test and with a (preferably) low δ can significantly improve both
of action discovery on what Boutilier calls coordination problems
the solution quality and the sample complexity, in reinforcement
(CPs) [3], in particular whether the number of coordination prob-
learning. In the future we would like to analyze non-navigational
lems are reduced or increased, we cleanly separate the atomic ac-
tasks for the scope of action discovery. Conceivably, in any RL
tion sets of agents, so that every decision point is a coordination

Page 34 of 99
500 300
All actions

Available action choices during episode

450 δ=0
250 δ=1
400 Basic SARSA
350 All actions
δ=0 200
300 δ=1
| At - A0 |

Basic SARSA
250 150
200
100
150
100
50
50
0 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Number of Episodes Number of Episodes

Figure 6: Growth in the size of the set At (.) − A0 (.) over visited Figure 8: Plot of sample complexity against episodes for task
states, against episodes for task G1 . G2 .

1600
23
All actions
δ=0 1400
22
Learned path length to goal (meters)

δ=1
21 Basic SARSA 1200

20 1000 All actions

| At - A0 | δ=0
19 800 δ=1
Basic SARSA
18 600
17
400
16
200
15
0
14 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Number of Episodes
Number of Episodes

Figure 9: Growth in the size of the set At (.) − A0 (.) over visited
Figure 7: Plot of solution quality against episodes for task G2 . states, against episodes for task G2 .

problem. In our experiments we consider two agents pushing a box the rate of O(|At |n−1 ) where At is the largest of the current action
on a plane, so we allow one agent to exert a force along the x-axis sets over n agents. Given such a phenomenal growth in decision
only (we call it the x-agent), and the other along the y-axis only complexity, it is unclear if action discovery will benefit multi-agent
(the y-agent. By removing overlap in the directionalities of the learning.
forces, we ensure that the agents do not trivially coordinate at some
decision points. This serves the purpose of isolating the impact of 4.3 Experiments in the Box-pushing Task
action discovery on CPs, with the impact on accidental coordina- We use a 9×9 grid for the discrete box-pushing task, as shown in
tion being removed. Note however, that this is only meant for our Figure 10. Each JAL uses action discovery as shown in Algorithm 1
experimental set-up, and it is not necesasry to preclude overlaps in with similar parameters as in the single-agent experiments (with
the agents’ atomic action sets. Also, agents can achieve such clean some differences):
separation of their action sets by prior agreement in cooperative do-
• A0 consists of 3 actions for each state, for each agent: ±1 or
mains. It is worth noting that in this setting, the multi-agent block
0 in its chosen direction,
pushing task it very closely related to the single agent navigation
task studied earlier. • c(s, s′ ) = distance(s, s′ ) with simple line-tests detecting
We allow each agent to test for feasibility of a new action using blocked paths (i.e., c(s, s′ ) = ∞),
the same method as in algorithm 1. If a new action passes the test,
then all agents discover that action and append their action sets • rewards are 1 for any action reaching the goal, -1 for hitting
in that joint-state, by the appropriate component of the discovered any obstable including the boundary, but 0 otherwise,
action. Therefore, if an action (x′ , y ′ ) is discovered, the x-agent 1
• φ(s) = ,
appends x′ as a new action in its own list of actions in that state, and distance(s,goal)

also includes y ′ as a new action of the other agent in that state. The • r̂ = r1 + r2 ,
y-agent performs the corresponding actions as well. This means
that with each discovery, the size of the joint action table grows at • δ = 0.1, α = 0.25, γ = 0.9, and ǫ = 0.01.

Page 35 of 99
• All learning algorithms (including basic Sarsa JAL) use the 100
Φ function for state-action value initialization. Discovery

Available action choices during episode

90 No Discovery
Figures 11 and 12 show the solution quality (i.e., the length of the
80
path along which the agents learn to push the box) and the sam-
ple complexity (sum of the sample complexities as defined in sec- 70
tion 4.1, over the two agents) respectively, of JAL Sarsa learning
with and without action discovery. These plots again show the 95% 60
confidence intervals over 20 runs. Expectedly, action discovery al- 50
lows the learners to learn a fundamentally shorter path, but surpris-
ingly it also improves the sample complexity. This clearly demon- 40
strates that the impact of action discovery on the number of CPs
30
(which is reduced) outweighs the impact on decision complexity
(which is worsened), such that the net sample complexity is signif- 20
icantly lower with action discovery. The result reaffirms our find- 0 200000 400000 600000 800000 1e+06
ing that action discovery is indeed a potent tool for reinforcement Number of Episodes
learner(s) to improve both solution quality and sample complexity
of learning, through the counter-intuitive process of worsening the Figure 12: Plot of sample complexity against episodes for the
decision complexity. multi-agent box-pushing task.

We have observed that action sets of agents are often constrained

in reinforcement learning design, thereby constraining the learned
policies. We have argued in favor of a stragey – called Action Dis-
covery – that incrementally augments the action set with newly dis-
covered actions that are potentially beneficial to explore in the fu-
ture. We have shown simple experiments in grid navigation tasks
for individual agents, as well as a box-pushing task for Joint Ac-
tion Learners (JALs), that suggest that action discovery improves
both the solution quality and sample complexity of reinforcement
learning. In particular, our result that a reduction in the number of
coordination problems (CPs) by virtue of action discovery enables
multiple agents to learn a fundamentally better coordination policy
with a lower sample complexity than in a regular JAL framework,
is a fundamental contribution to multi-agent learning research.

6. PLAN FOR EXTENSION

Figure 10: The multi-agent box-pushing task. Our plan to extend this work is entirely in the domain of multi-
agent coordination learning. A comparison of the single-agent and
multi-agent plots of sample complexity indicates two things: (1)
that the convergence rate is much slower in the two-agent case than
19
Discovery in the one-agent case, and (2) that the advantage of action discovery
No Discovery in terms of sample complexity seems to be pronounced in the two-
Learned path length to goal (meters)

18 agent case. While (1) is to be expected, (2) is not quite intuitive and
needs further investigation. Increasing the number of agents in the
17 box-pushing task will necessitate overlap in the action spaces of the
agents. We will allow all agents to act in both x and y directions,
16 but at any given time, an agent must pick an action in one of the two
directions. This means an agent can choose the magnitude of the
15 force exerted on the box, and the orientation must be either in the
x-direction or y. This restriction would ensure that agents do not
14 discover actions in arbitrary orientations since that would reduce
the need to coordinate with other agents. A technical difficulty
13 arising from not imposing this restriction is that the outcome of a
0 200000 400000 600000 800000 1e+06 joint action (where each action can be in an arbitrary orientation)
Number of Episodes may not fall on a grid point in discrete maps.
We also plan to investigate the impact of increasing the number
Figure 11: Plot of solution quality against episodes for the of agents on the benefit accrued from action discovery in continu-
multi-agent box-pushing task. ous maps, where an action would be composed of two choices: the
magnitude of the force and the orientation. However, since such
domains require some kind of function approximation for learning
the action values, it is not immediately clear how a newly discov-
5. CONCLUSION ered action could be reconciled with a function approximator that

Page 36 of 99
usually works with a fixed set of discrete actions. There is very [14] G. Tesauro. Temporal difference learning and TD-gammon.
little work that consider both continuous action space and continu- Communications of the ACM, 38(3):58 – 68, 1995.
ous state spaces, and it would be non-trivial to adapt any of these [15] E. Wiewiora. Potential based shaping and Q-value
techniques to accommodate new actions. initialization are equivalent. Journal of Artificial Intelligence
Research, pages 205–208, 2003.
7. ACKNOWLEDGMENTS
We would like to thank the anonymous reviewers for helpful
comments. This work was supported by a start-up grant from the
University of Southern Mississippi.

8. REFERENCES
[1] P. Abbeel, A. Coates, M. Quigley, and A. Y. Ng. An
application of reinforcement learning to aerobatic helicopter
flight. In NIPS 19, 2007.
[2] S. Abdallah and V. Lesser. Multiagent reinforcement
learning and self-organization in a network of agents. In
Proceedings of 6th International Conference on Autonomous
Agents and Multiagent Systems (AAMAS), 2007.
[3] C. Boutilier. Sequential optimality and coordination in
multiagent systems. In In Proceedings of the Sixteenth
International Joint Conference on Artificial Intelligence,
pages 478–485, 1999.
[4] C. Claus and C. Boutilier. The dynamics of reinforcement
learning in cooperative multiagent systems. In Proceedings
of the 15th National Conference on Artificial Intelligence,
pages 746–752, Menlo Park, CA, 1998. AAAI Press/MIT
Press.
[5] R. H. Crites and A. G. Barto. Improving elevator
performance using reinforcement learning. In Advances in
Neural Information Processing Systems 8, volume 8, pages
1017–1023, 1996.
[6] J. Hu and M. P. Wellman. Nash Q-learning for general-sum
stochastic games. Journal of Machine Learning Research,
4:1039–1069, 2003.
[7] A. Lazaric, M. Restelli, and A. Bonarini. Reinforcement
learning in continuous action spaces through sequential
monte carlo methods. In J. Platt, D. Koller, Y. Singer, and
S. Roweis, editors, Advances in Neural Information
Processing Systems 20, pages 833–840, Cambridge, MA,
2008. MIT Press.
[8] M. L. Littman. Markov games as a framework for
multi-agent reinforcement learning. In Proc. of the 11th Int.
Conf. on Machine Learning, pages 157–163, San Mateo, CA,
1994. Morgan Kaufmann.
[9] A. Y. Ng, D. Harada, and S. Russell. Policy invariance under
reward transformations: Theory and application to reward
shaping. In Proc. 16th International Conf. on Machine
Learning, pages 278–287. Morgan Kaufmann, 1999.
[10] T. J. Perkins and A. G. Barto. Lyapunov-constrained action
sets for reinforcement learning. In Proceedings of the ICML,
pages 409–416, 2001.
[11] A. A. Sherstov and P. Stone. Improving action selection in
MDP’s via knowledge transfer. In Proceedings of the
Twentieth National Conference on Artificial Intelligence,
July 2005.
[12] R. Sutton and A. G. Barto. Reinforcement Learning: An
Introduction. MIT Press, 1998.
[13] R. Sutton, D. Precup, and S. Singh. Between MDPs and
semi-MDPs: A framework for temporal abstraction in
reinforcement learning. Artificial Intelligence, 112:181–211,
1999.

Page 37 of 99
Proceedings of the AAMAS Workshop on Adaptive and Learning Agents, May 2010, Toronto, Canada

Convergence, Targeted Optimality, and Safety in

Multiagent Learning

Doran Chakraborty Peter Stone

[email protected] [email protected]
Computer Sciences Department Computer Sciences Department
University of Texas, Austin University of Texas, Austin
Austin, Texas, USA Austin, Texas, USA

ABSTRACT for a multi-player multi-action (arbitrary) repeated game,

This paper introduces a novel multiagent learning algorithm achieves the following three goals:
which achieves convergence, targeted optimality against mem- 1. Convergence : converges to playing a Nash equilib-
ory bounded adversaries, and safety, in arbitrary repeated rium in self-play (other agents are also CMLeS);
games. Called CMLeS, its most novel aspect is the manner 2. Targeted Optimality : for any arbitrary ǫ > 0 and
in which it guarantees (in a PAC sense) targeted optimal- δ > 0, with probability at least 1-δ, achieves at least within
ity against memory-bounded adversaries, via efficient explo- ǫ + L(δ) of the expected value of the best response against
ration and exploitation. CMLeS is fully implemented and any memory-bounded, or adaptive,2 opponent of memory
we present empirical results demonstrating its effectiveness. size K, in time polynomial in 1ǫ , ln( 1δ ) and λ−Size(K+1) .
L(δ) ∈ [0, 1], is a decreasing function w.r.t. 1-δ and assumes
a very small value for small values of δ. λ is the minimum
Categories and Subject Descriptors non-zero probability that the opponent assigns to an action,
I.2 [Computing Methods]: Artificial Intelligence in any history and Size(K + 1) denotes number of feasible
joint histories of size K+1. The same guarantee also holds
for opponents which eventually become memory-bounded,
General Terms with the time complexity claim now holding from the point
Algorithms, Performance that the opponent becomes memory-bounded. The main ad-
vance of MLeS lies in reducing the exponential dependence
Keywords on Size(Kmax ) in time complexity, that is achieved by the
current state of the art algorithm, to an exponential depen-
opponent modeling dence on Size(K + 1), where Kmax is an upper bound on
the opponent’s memory size, K.
1. INTRODUCTION 3. Safety : converges to playing the maximin strategy
In recent years, great strides have been made towards against any other opponent which cannot be approximated
creating autonomous agents that can learn via interaction as a Kmax memory-bounded opponent.
with their environment. When considering just an individ-
ual agent, it is often appropriate to model the world as be-
1.1 Related work
ing stationary, meaning that the same action from the same Bowling et al. [2] were the first to put forth a set of
state will always yield the same (possibly stochastic) effects. criterion for evaluating multiagent learning algorithms. In
However in the presence of other independent agents, the en- games with two players and two actions per player, their
vironment is not stationary: an action’s effects may depend algorithm WoLF-IGA converges to playing best response
on the actions of the other agents. This non-stationarity against stationary, or memoryless, opponents (rationality),
poses the primary challenge of multiagent learning (MAL) and converges to playing the Nash equilibrium in self-play
and comprises the main reason that it is best considered (convergence). Subsequent approaches extended the ratio-
distinctly from single agent learning. nality and convergence criteria to arbitrary (multi-player,
While functioning in a hostile world, it is desirable for a multi-action) repeated games [1, 5]. Amongst them, Awe-
MAL algorithm to come with assurances of the quality of some [5] achieves convergence and rationality in arbitrary
solution it provides against various types of agents (oppo- repeated games without requiring agents to observe each
nents). The simplest, and most often studied, MAL scenario others’ mixed strategies. However, none of the above al-
is the stateless scenario in which agents repeatedly interact gorithms have any guarantee about the payoffs achieved
in the stylized setting of a matrix game (a.k.a. normal form when they face arbitrary non-stationary opponents. More
game). In the multiagent literature, various criteria have recently, Powers et al. proposed a newer set of evaluation
been proposed to evaluate MAL algorithms, emphasizing criteria that emphasizes compatibility, targeted optimality
what behavior they will converge to against various types of and safety [7]. Compatibility is a stricter criterion than con-
opponents,1 in such settings. The contribution of this paper vergence as it requires the learner to converge within ǫ of the
is that it proposes a novel MAL algorithm, CMLeS, that payoff achieved by a Pareto optimal Nash equilibrium. Their
1 2
Although we refer to other agents as opponents, we mean Consistent with the literature (Powers et al., 2005), we call
any agent (cooperative, adversarial, or neither) memory-bounded opponents as adaptive opponents.

Page 38 of 99
proposed algorithm, PCM(A) [8] is, to the best of our knowl- to unilaterally deviate from its own share of the strategy. A
edge, the only known MAL algorithm to date that achieves maximin strategy for an agent is a strategy which maximizes
compatibility, safety and targeted optimality against adap- its own minimum payoff. It is often called the safety strat-
tive opponents in arbitrary repeated games. egy, because resorting to it guarantees the agent a minimum
payoff.
1.2 Contributions An adaptive opponent strategy looks back at the most re-
CMLeS improves on Awesome by guaranteeing both safety cent K joint actions played in the current history of play to
and targeted optimality against adaptive opponents. It im- determine its next stochastic action profile. K is referred to
proves upon PCM(A) in five ways. as the memory size of the opponent.3 The strategy of such
1. The only guarantees of optimality against adaptive an opponent is then a mapping, π : AnK 7→ ∆A. If we con-
opponents that PCM(A) provides are against the ones that sider opponents whose future behavior depends on the en-
are drawn from an initially chosen target set. In contrast, tire history, we lose the ability to (provably) learn anything
CMLeS can model every adaptive opponent whose memory about them in a single repeated game, since we see a given
is bounded by Kmax . Thus it does not require a target history only once. The concept of memory-boundedness lim-
set as input: its only input is Kmax , an upper bound on its the opponent’s ability to condition on history, thereby
the memory size of adaptive opponents that it is willing to giving us a chance to learning its policy.
model and exploit. We now specify what we mean by playing optimally against
2. PCM(A) achieves targeted optimality against adap- adaptive opponents. For notational clarity, we denote the
tive opponents by requiring all feasible joint histories of size other agents as a single agent o. It has been shown pre-
Kmax to be visited a sufficient number of times. Kmax for viously [4] that the dynamics of playing against such an
PCM(A) is the maximum memory size of any opponent from o can be modeled as a Markov Decision Process (MDP)
its target set. CMLeS significantly improves this by requir- whose transition probability function and reward function
ing a sufficient number of visits to all feasible joint histories are determined by the opponents’ (joint) strategy π. As the
only of size K+1. Thus CMLeS promises targeted optimal- MDP is induced by an adversary, this setting is called an
ity in number of steps polynomial in λ−Size(K+1) in com- Adversary Induced MDP, or AIM in short.
parison to PCM(A) which provides similar guarantees, but An AIM is characterized by the K of the opponent which
in steps polynomial in λ−Size(Kmax ) . The above sample ef- induces it: the AIM’s state space is the set of all feasible joint
ficiency property makes CMLeS a good candidate for online action sequences of length K. By way of example, consider
learning. the game of Roshambo or rock-paper-scissors (Figure 1) and
3. Unlike PCM(A), CMLeS promises targeted optimal- assume that o is a single agent and has K = 1, meaning that
ity against opponents which eventually become memory- it acts entirely based on the immediately previous joint ac-
bounded with K ≤ Kmax . tion. Let the current state be (R, P ), meaning that on the
4. PCM(A) can only guarantee convergence to a pay- previous action, i selected R, and o selected P . Assume that
off within ǫ of the desired Nash equilibrium payoff with a from that state, o plays actions R, P and S with probabil-
probability δ. In contrast, CMLeS guarantees convergence ity 0.25, 0.25, and 0.5 respectively. When i chooses to take
in self-play with probability 1. action S in state (R, P ), the probabilities of transitioning
5. CMLeS is relatively simple in its design. It tackles to states (S, R), (S, P ) and (S, S) are then 0.25, 0.25 and 0.5
the entire problem of targeted optimality and safety by run- respectively. Transitions to states that have a different ac-
ning an algorithm that implicitly achieves either of the two, tion for i, such as (R, R), have probability 0. The reward
without having to reason separately about adaptive and ar- obtained by i when it transitions to state (S, R) is -1 and so
bitrary opponents. on.
The remainder of the paper is organized as follows. Sec- The optimal policy of the MDP associated with the AIM
tion 2 presents background and definitions, Section 3 and 4 is the optimal policy for playing against o. A policy that
presents our algorithm, Section 5 presents empirical results achieves an expected return within ǫ of the expected return
and Section 6 concludes. achieved by the optimal policy is called an ǫ-optimal policy
(the corresponding return is called ǫ-optimal return). If π is
2. BACKGROUND AND CONCEPTS known, then we can have computed the optimal policy (and
hence ǫ-optimal policy) by doing dynamic programming [9].
This section reviews the definitions and concepts neces-
However, we do not assume that π or even K are known
sary for fully specifying CMLeS.
in advance: they need to be learned in online play. We
A matrix game is defined as an interaction between n
use the discounted payoff criterion in our computation of an
agents. Without loss of generality, we assume that the set
ǫ-optimal policy, with γ denoting the discount factor.
of actions available to all the agents are same, i.e., A1 =
Finally, it is important to note that there exist opponents
. . . = An = A. The payoff received by agent i during each
in the literature which do not allow convergence to the opti-
step of interaction is determined by a utility function over
mal policy once a certain set of moves have been played. For
the agents’ joint action, ui : An 7→ ℜ. Without loss of gen-
example, the grim-trigger opponent in the well-known Pris-
erality, we assume that the payoffs are bounded in the range
oner’s Dilemma (PD) game, an opponent with memory size
[0,1]. A repeated game is a setting in which the agents play
1, plays cooperate at first, but then plays defect forever once
the same matrix game repeatedly and infinitely often.
the other agent has played defect once. Thus, there is no
A single stage Nash equilibrium is a stationary strategy
way of detecting its strategy without defecting, after which
profile {πi∗ , . . . , πn∗ } such that for every agent i and for ev-
it is impossible to recover to the optimal strategy of mutual
ery other possible stationary strategy πi , the following in-
equality holds: E(π1∗ ,...,πi∗ ,...,πn∗ ) ui (·) ≥ E(π1∗ ,...,πi ,...,πn∗ ) ui (·). 3
K is the minimum memory size that fully characterizes the
It is a strategy profile in which no agent has an incentive opponent strategy.

Page 39 of 99
R P S
0.25 (R,P) determined, then return null.
R 0.25 0.5
R 0,0 −1,1 1,−1 3. If π̂best 6= null, take a step towards solving the rein-
0.25 0.25
P 1,−1 0,0 −1,1 (R,P) P forcement learning (RL) problem for the AIM induced by
S −1,1 1,−1 0,0 kbest . Otherwise, play the maximin strategy.
(S,R) (S,P) (S,S)
0.5 S Of these three steps, step 2 is by far the most complex.
R − Rock, P − Paper, S − Scissor
We present how MLeS addresses it next.
Roshambo Opponent Strategy Partial Transition Function
for state (R,P) and action S 3.2 Model selection
The objective of MLeS is to find a kbest which is either K
Figure 1: Example of AIM (the true memory size) or a suboptimal k s.t. π̂k is a good
approximation of π (o’s true policy). It does so by compar-
ing models of increasing size to determine at which point the
cooperation. In our analysis, we constrain the class of adap- larger models cease to become more predictive of o’s behav-
tive opponents to include only those which do not negate ior. We start by proposing a metric called ∆k , which is an
the possibility of convergence to optimal exploitation, given estimate of how much models π̂k and π̂k+1 differ from each
any arbitrary initial sequence of exploratory moves [7]. other. But, first, we introduce two notations that will be in-
Equipped with the required concepts, we are now ready strumental in explaining the metric. We denote (ai , ao )·sk to
to specify our algorithms. First, in Section 3, we present be a joint history of size k+1, that has sk as its last k joint ac-
an algorithm that only guarantees safety and targeted opti- tions and (ai , ao ) as the last k+1’th joint action. For any sk ,
mality against adaptive opponents. Then, in Section 4, we we define a set Aug(sk ) = ∪∀ai ,ao ∈A2 ((ai , ao ) · sk |v((ai , ao ) ·
introduce the full-blown CMLeS algorithm that incorporates sk ) > 0). In other words Aug(sk ) contains all joint histo-
convergence additionally. ries of size k+1 which have sk as their last k joint actions
and have been visited at least once. ∆k is then defined
3. MODEL LEARNING WITH SAFETY as maxsk ,sk+1 ∈Aug(sk ),ao ∈A |π̂k (sk , ao ) − π̂k+1 (sk+1 , ao )|. We
say that π̂k and π̂k+1 are ∆k distant from one another.
In this section, we introduce a novel algorithm, Model
Based, on the concept of ∆k , we make two observations
Learning with Safety (MLeS), that ensures safety and tar-
that will come in handy for our theoretical claims made later
geted optimality against adaptive opponents.
in this subsection.
3.1 Overview Observation 1. For all, k ∈ [K, Kmax ]|k ∈ N, and for
MLeS begins with the hypothesis that the opponent is an any k sized joint history sk and any sk+1 ∈ Aug(sk ), E(π̂k (sk ))
adaptive opponent (denoted as o) with an unknown mem- = E(π̂k+1 (sk+1 )). Hence E(∆k ) = 0.
ory size, K, that is bounded above by a known value, Kmax .
MLeS maintains a model for each possible value of o’s mem- Let, sK be the last K joint actions in sk and sk+1 . π̂k (sk )
ory size, from k = 0 to Kmax , plus one additional model and π̂k+1 (sk+1 ) represent draws from the same fixed distri-
for memory size Kmax +1. Each model π̂k is a mapping bution π(sK ). So, their expectations will always be equal
Ank 7→ ∆A representing a possible o strategy. π̂k is the to π(sK ). This is because o just looks at the most recent K
maximum likelihood distribution based on the observed ac- joint actions in its history, to decide on its next step action.
tions played by o for each joint history of size k encountered. Observation 2. For k < K|k ∈ N, ∆k is a random vari-
Henceforth we will refer to a joint history of size k as sk and able with 0 ≤ E(∆k ) ≤ 1.
the empirical distribution captured by π̂k for sk as π̂k (sk ).
π̂k (sk , ao ) will denote the probability assigned to action ao , In this case, in the computation of π̂k (sk ), the draws can
by π̂k (sk ). When a particular sk is encountered and the come from different distributions. This is because, k < K
respective o’s action in the next step is observed, the em- and there is no guarantee of stationarity of π̂k (sk ). Thus,
pirical distribution π̂k (sk ) is updated. Such updates happen ∆k can be any arbitrary random variable with an expected
for every π̂k , on every step. For every sk , MLeS maintains a value within 0 and 1.
count value v(sk ), which is the number of times sk has been High-level idea: Alg. 1 presents how MLeS selects kbest .
encountered. We call an opponent model an ǫ approxima- We denote the current values of π̂k and ∆k at time t, as π̂kt
tion of π, when for any history of size K, it predicts the true and ∆tk respectively.
opponent action distribution with error at most ǫ. Definition 1. {σkt }t∈1,2,... is a sequence of real numbers,
On each step, MLeS selects π̂best (and correspondingly unique to each k, s.t. it satisfies the following:
kbest ) as the one from among the Kmax +1 (from 0 to Kmax ) 1. it is a positive decreasing sequence, tending to 0 as t →
models that currently appears to best describe o’s behavior. ∞;
The mechanism for selecting π̂best ensures that, with high 2. for a fixed high probability ρ > 0 and for k ∈ [K, Kmax ],
probability, it is either π̂K (the most compact representation P r(∆tk < σkt ) > ρ;
of π) or a model with a smaller k which is a good approx-
imation of π. Once such a π̂best is picked, MLeS takes a The reason for choosing such a {σkt }t∈1,2,... sequence for
step towards learning an ǫ-optimal policy for the underlying each k will be clear from the next two paragraphs. Later,
AIM induced by kbest . If it cannot determine such a π̂best , we will show, how we compute the σkt ’s. MLeS iterates over
it defaults to playing the maximin strategy for safety. values of k starting from 0 to Kmax and picks the minimum
Thus, the operations performed by MLeS on each step can k s.t for all k ≤ k′ ≤ Kmax , the condition ∆tk′ < σkt ′ is
be summarized as follows: satisfied (steps 3-11).
1. Update all models based on the past step. For k < K, there is no guarantee that ∆k will tend to 0,
2. Determine π̂best (and hence kbest ). If a π̂best cannot be as t → ∞ (Observation 2). More often than not, ∆k will

Page 40 of 99
Algorithm 1: Find-Model number of visits to any member from Aug(sk ). So,
output : kbest , π̂best , t √
P r(|π̂k+1 (stk+1 , ato ) − E(π̂k+1
t
(stk+1 , ato ))| < σkt ) > ρ (5)
1 kbest ← −1, π̂best ← null
2 for all 0 ≤ k ≤ Kmax , compute ∆tk and σk
t =⇒ P r(∆tk < σkt ) > ρ
3 for 0 ≤ k ≤ Kmax do
4 f lag ← true The problem now boils down to selecting a suitable σkt s.t.
5 for k ≤ k′ ≤ Kmax do Inequality 5 is satisfied. Hoeffding’s inequality gives us an
6 if ∆tk′ ≥ σkt ′ then upper bound for σkt in Inequality 5.qUsing that upper bound
7 f lag ← false and solving for σkt , we get, σkt = ( 2v(s1t ) ln( 1−2√ρ )). So
8 break k+1
in general, for each k ∈ [0, Kmax ], the σkt value is set as
9 if f lag then above. Note that, v(stk+1 ) is the number of visits to the spe-
10 kbest ← k; π̂best ← π̂kt
cific stk+1 chosen for the computation of ∆tk . Setting σkt as
11 break
above satisfies both the conditions specified in Definition 1.
12 return kbest and π̂best Condition 1 follows implicitly since in infinite play, the ac-
tion selection mechanism ensures infinite visits to all joint
histories of a finite length.
Theoretical underpinnings: Now, we state our main
tend to a positive value quickly. On the other hand, σkt → 0 theoretical result regarding model selection.
as t → ∞ (condition 1 of Definition 1). This leads to one of
the following two cases: Lemma 3.1. After all feasible joint histories of size K+1
2
1) σkt becomes ≤ ∆tk and step 6 of Alg. 1 holds, thus rejecting have been visited (K+1) ln( 1−2√ρ ) times, then with probabil-
2ǫ2
k as a possible candidate for selection.
ity at least ρKmax +2 , the π̂best returned by Alg. 1 is an ǫ ap-
2) k gets selected. However, then we are sure that π̂kt is no
proximation of π. ρ is the fixed high probability value from
more than k≤k′ <K σkt ′ distant from π̂K t
P
(the best model of
Condition 2 of Definition 1.
π we have at present). With increasingly many time steps,
π̂kt needs to be an increasingly better approximation of πK t
, Proof. When all k < K have been rejected, Alg. 1 se-
to keep getting selected. lects K with probability at least ρKmax −K+1 . If p is the
For k ≥ K, all ∆tk ’s → 0, as t → ∞ (Observation 1). Since probability of selecting any k < K as kbest , the proba-
for all k ≥ K : P r(∆tk < σkt ) > ρ (condition 2 of Defini- bility of selecting any k ≤ K as kbest , is then at least
tion 1), K gets selected with a high probability ρKmax −K+1 . p + (1 − p)ρKmax −K+1 > ρKmax −K+1 > ρKmax +1 . If kbest =
A model with memory size more than K is selected with K, then we know that ∆tK < σK t
. So from Inequality 4,
probability at most (1−ρKmax −K+1 ), which is a small value. √
P r(|π̂K (sK , ao ) − π(sK , ato ))| < σK
t t t t t
)> ρ
We now address the final part of Alg. 1 that we have yet
t
to specify: setting the σkt ’s (step 2). =⇒ P r(|π̂K (stK , ato ) − π(stK , ato )| < σK
t
)>ρ
Choosing σkt : In the computation of ∆tk , MLeS chooses a
stK and ato are the respective joint history of size K and ac-
specific stk from the set of all possible joint histories of size t t
tion, for which models π̂K and π̂K+1 maximally differ at t.
k, a specific stk+1 from Aug(stk ) and an action ato , for which t
So in this case, with probability ρ, π̂best is a σK approx-
t
the models π̂kt and π̂k+1 differ maximally on that particular
imation of π. In similar fashion it can be shown that if
time step. So,
kbest < K, then with probability ρ, π̂best is a k≤k′ ≤K σkt ′
P

∆tk < σkt ≡ |π̂kt (stk , ato ) − π̂k+1

t
(stk+1 , ato )| < σkt (1) approximation of π.
If an ǫ approximation of π is desired, a sufficient condition
The goal will be to select a value for σkt
s.t. condition 2 of is to ensure that for all 0 ≤ k ≤ K, σkt gets assigned a
ǫ
Definition 1 is always satisfied. Condition 1 will implicitly value ≤ K+1 . If all feasible joint histories of size K+1 are
2
follow from the above. For k ∈ [K, Kmax ], we can rewrite visited (K+1) ln( 1−2√ρ ) times, then σK t
must be less than
2ǫ2
Inequality 1 as, ǫ
K+1
(from Inequality 5 and Hoeffding’s inequality). Also
≡ |(|π̂kt (stk , ato ) − E(π̂kt (stk , ato )|) − (|π̂k+1
t
(stk+1 , ato ) every feasible history of smaller sizes, must also have been
2
t t t
−E(π̂k+1 (sk+1 , ao )|)| < σk t
(2) visited at least (K+1)
2ǫ2
ln( 1−2√ρ ) times. Hence σkt for all k <
ǫ
K also must have values less than K+1 .
The above step follows from using E(π̂kt (stk , ato )) =
t
E(π̂k+1 (stk+1 , ato )) ≥ 0 (Observation 1). One way to satisfy For Alg. 1 to return an ǫ approximation of π, MLeS does
Inequality 2 is to have both |π̂kt (stk , ato ) − E(π̂kt (stk , ato ))| and not need to know the number of visits (Lemma 3.1) required
t
|π̂k+1 (stk+1 , ato ) − E(π̂k+1
t
(stk+1 , ato ))| be < σkt . Thus, to en- beforehand; it just needs to ensure that every feasible K + 1
sure [k ∈ [K, Kmax ] : P r(∆tk < σkt ) > ρ], we need a lower history gets visited so many times. Finally, what remains
√ to be addressed is the action-selection mechanism (step 3,
bound of ρ, on the probabilities of the above 2 inequalities.
Also, we observe that the following holds : main algorithm).
√
t
P r(|π̂k+1 (stk+1 , ato ) − E(π̂k+1
t
(stk+1 , ato ))| < σkt ) > ρ (3) 3.3 Action selection
√ On each time step, the action selection mechanism decides
=⇒ P r(|π̂kt (stk , ato ) − E(π̂kt (stk , ato ))| < σkt ) > ρ (4)
on what action to take for the ensuing time step. If the π̂best
This can be derived by applying Hoeffding’s inequality [6] returned is null, it plays the maximin strategy. If π̂best 6=
and using v(stk ) ≥ v(stk+1 ). v(stk ) ≥ v(stk+1 ) because the null, the action selection strategy picks the AIM associated
number of visits to a joint history sk must be at least the with opponent memory kbest and takes the next step in the

Page 41 of 99
reinforcement learning problem of computing a near-optimal Theorem 3.2. For any arbitrary ǫ > 0 and δ > 0, MLeS
policy for that AIM. In order to solve this RL problem, with probability at least 1-δ, achieves at least within ǫ + L(δ)
MLeS uses the variant of the R-Max algorithm that does of the expected value of the best response against any adap-
not assume that the mixing time of the underlying MDP tive opponent, in number of time steps polynomial in 1ǫ ,
is known [3]. R-Max is a model based RL algorithm that ln( 1δ ) and λ−Size(K+1) .
converges to playing an ǫ-optimal policy for an MDP with
probability 1-δ, in time complexity polynomial in 1ǫ , ln( 1δ ), Against an arbitrary o, our claims rely on o not behaving as
and the state space size of the MDP. A separate instantia- a Kmax adaptive opponent in the limit. This means ∆Kmax
tion of the R-Max algorithm is maintained for each of the tends to a positive value, as t → ∞. Alg. 1 returns π̂best
possible Kmax +1 AIMs pertaining to the possible memory as null in the limit, with probability 1. MLeS will then
sizes of o, i.e, M0 , M1 , . . . , MKmax . On each step, based on subsequently converge to playing the maximin strategy, thus
the kbest returned, the R-Max instance for the AIM Mkbest ensuring safety.
is selected to take an action.
The steps that ensure targeted optimality against adap- 4. CONVERGENCE AND MODEL LEARN-
tive opponents are then as follows: ING WITH SAFETY
1. First, ensure that √Alg. 1 keeps returning aǫ(1−γ)
kbest ≤ K with In this section we build on MLeS to introduce a novel MAL
a high probability 1 − δ s.t. π̂best is an 2Size(K) approxi- algorithm for an arbitrary repeated game which achieves
mation of π. The conditions for that to happen are given by safety, targeted optimality, and convergence, as defined in
Lemma 3.1. Playing optimally against such an approxima- Section 1. We call our algorithm, Convergence with Model
tion of π, guarantees an 2ǫ -optimal payoff against o (Lemma Learning and Safety: (CMLeS). CMLeS begins by testing
4 of [3]). Thus an 2ǫ optimal policy for such a model will the opponents to see if they are also running CMLeS (self-
guarantee an ǫ-optimal payoff against o. play); when not, it uses MLeS as a subroutine.
2. Once such √ a kbest ≤ K is selected by Alg. 1 with a high
probability
√ 1 − δ on every step, then with a probability 4.1 Overview
1 − δ, converge to playing an 2ǫ optimal policy for Mkbest . CMLeS (Alg. 2) can be tuned to converge to any Nash
In order to achieve that, the R-Max instantiation for Mkbest equilibrium of the repeated game in self-play. Here, for the
will require a certain fixed number of visits to every joint his- sake of clarity, we present a variant which converges to the
tory of size kbest . Since the kbest selected by Alg. 1 is at most single stage Nash equilibrium. This equilibrium also has
K with a high probability , a sufficient number of visits to the advantage of being the easiest of all Nash equilibria to
every joint history of size K will suffice convergence to an 2ǫ compute and hence has historically been the preferred solu-
optimal policy. tion concept in multiagent learning [2, 5]. The extension of
It can be shown that our R-Max-based action selection CMLeS to allow for convergence to other Nash equilibria is
strategy implicitly achieves both of above steps in number straightforward, only requiring keeping track of the proba-
of time steps polynomial in 1ǫ , ln( 1δ ) and λ−Size(K+1) . Note, bility distribution for every conditional strategy present in
we do not have the ability to take samples at will from dif- the specification of the equilibrium.
ferent histories, but may need to follow a chain of different Steps 1 - 2: Like Awesome, we assume that all agents
histories to get a sample pertaining to one history. In the have access to a Nash equilibrium solver and they compute
worst case, the chain can be the full set of all histories, the same Nash equilibrium profile. If there are finitely many
with each transition occurring with λ. Hence the unavoid- equilibria, then this assumption can be lifted with each agent
able dependence on λ−Size(K+1) , in time complexity. The choosing randomly an equilibrium profile, so that there is a
bounds we provide are extremely pessimistic and likely to non-zero probability that the computed equilibrium coin-
be tractable against most opponents. For example against cides.
opponents which only condition on MLeS’s recent history of Steps 3 - 4: The algorithm maintains a null hypothesis
actions, λ−Size(K+1) dependency gets replaced by a depen- that all agents are playing equilibrium (AAP E). The hy-
dency over just |A|K+1 . pothesis is not rejected unless the algorithm is certain with
So far what we have shown is that MLeS, with a high prob- probability 1 that the other agents are not playing CMLeS.
ability 1-δ on each step, converges to playing an ǫ-optimal τ keeps count of the number of times the algorithm reaches
policy. It is important to note that, acting in this fash- step 4.
ion does not guarantee it a return that is 1-δ times the ǫ- Steps 5 - 8 (Same as Awesome): Whenever the algo-
optimal return. However, we can compute an upper bound rithm reaches step 5, it plays the equilibrium strategy for
on the loss and show that the loss is extremely small for a fixed number of episodes, Nτ . It keeps a running esti-
small values of δ. Let rt be the random variable that de- mate of the empirical distribution of actions played by all
notes the reward obtained on time step t by following the agents, including itself, during this run. At step 8, if for
ǫ-optimal policy. P The maximum loss incurred P is t : |(1 − any agent j, the empirical distribution φτj differs from πj∗
δ) ∞ t ∞ t
<| ∞
− δ)t E(rt )|P
P
t=0 γ E(rt ) − t=0 γ (1P t=0 γ E(rt ) − by at least ǫτe , AAP E is set to false. The CMLeS agent
∞ t t ∞ t ∞ t
(1 − δ)t | ≤
P
t=0 γ (1 − δ) E(rt )| ≤ | t=0 γ − t=0 γ has reason to believe that j may not be playing the same
γδ
(1−γ)(1−γ(1−δ))
. In the above computation, we assume that algorithm. {ǫτe }τ ∈1,2,... represents a decreasing sequence of
whenever MLeS does not play the ǫ-optimal policy, it gets positive numbers converging to 0 in the limit. Similarly
the minimum reward of 0. We denote this loss as L(δ), since {Nτ }τ ∈1,2,... represents an increasing sequence of positive
it is a function of δ (γ being fixed). Note that L(δ) can be numbers converging to infinity in the limit. The ǫτe and Nτ
made extremely small by selecting a very small δ. values for each τ are assigned in a similar fashion to Awe-
This brings us to our main theorem regarding MLeS. some (Definition 4 of [5]).
Steps 10 - 20: Once AAP E is set to false, the algorithm

Page 42 of 99
Algorithm 2: CMLeS Proof of 2. This part of the proof follows from Hoeffd-
input : n, τ = 0 ing’s inequality. CMLeS reaches step 22 with a probability
1 for ∀j ∈ {1, 2, . . . , n} do at least δ in τ polynomial in κ1 and ln( 1δ ), where κ is the
2 πj∗ ← ComputeNashEquilibriumStrategy() maximum probability that any agent assigns to any action
3 AAP E ← true other than ao for a recent Kmax joint history of all agents
4 while AAP E do playing ao .
5 for Nτ rounds do
∗
6 Play πself Theorem 4.2. In self-play, CMLeS converges to playing
7 for each agent j update φτj the Nash equilibrium of the repeated game, with probability
8 recompute AAP E using the φτj ’s and πj∗ ’s 1.
9 if AAP E is false then Proof. We prove the theorem by proving the following:
10 if τ = 0 then 1) In self-play, every time after AAP E is set to false, there
11 Play ao , Kmax +1 times
is a non-zero probability that AAP E is never set to false
12 else if τ = 1 then again.
13 Play ao , Kmax times followed by a
2) If AAP E is never set to false again, then CMLeS con-
14 random action other than ao
verges to the Nash equilibrium with probability 1.
15 else The proof of (1) follows by similar reasoning as in Awe-
16 Play ao , Kmax +1 times
some (Theorem 3 of [5]). If AAP E is never set to false, then
17 if any other agent plays differently then all agents must be playing CMLeS (From Theorem 4.1). As
18 AAP E ← f alse
Nτ approaches ∞, φτj approaches πj∗ . So the agents con-
19 else verge to playing the Nash equilibrium with probability 1 in
20 AAP E ← true
the limit.
21 τ ←τ +1
22 Play MLeS 5. RESULTS
We now present empirical results that supplement the the-
oretical claims. We focus on how efficiently CMLeS models
goes through a series of steps in which it checks whether the adaptive opponents in comparison to existing algorithms,
other agents are really CMLeS agents. The details are ex- PCM(A) and Awesome. For CMLeS, we set ǫ = 0.1, δ =
plained below when we describe the convergence properties 0.01 and Kmax = 10. To make the comparison fair with
of CMLeS (Theorem 4.1). PCM(A), we use the same values of ǫ and δ and always in-
Step 22: When the algorithm reaches here, it is sure (proba- clude the respective opponent in the target set of PCM(A).
bility 1) that the other agents are not CMLeS agents. Hence We also add an adaptive strategy with K = 10 to the target
it switches to playing MLeS. set of PCM(A), so that it needs to explore joint histories of
size 10.
4.2 Theoretical underpinnings We use the 3-player Prisoner’s Dilemma (PD) game as our
We now state our main convergence theorems. representative matrix game. The game is a 3 player version
of the N-player PD present in GAMUT.4 The adaptive op-
Theorem 4.1. CMLeS satisfies both the criteria of tar- ponent strategies we test against are :
geted optimality and safety. 1. Type 1: every other player plays defect if in the last 5
steps CMLeS played defect even once. Otherwise, they play
Proof. To prove the theorem, we need to prove:
cooperate. The opponents are thus deterministic adaptive
1. For opponents not themselves playing CMLeS, CMLeS
strategies with K = 5.
always reaches step 22 with some probability;
2. Type 2: every other player behaves as type-1 with 0.5
2. There exists a value of τ , for and above which, the above
probability, or else plays completely randomly. In this case,
probability is at least δ.
the opponents are stochastic with K = 5.
Proof of 1. We utilize the property that a K adaptive oppo-
The total number of joint histories of size 10 in this case
nent is also a Kmax adaptive opponent (see Observation 1).
is 810 , which makes PCM(A) highly inefficient. However,
The first time AAP E is set to false, it selects a random
CM LeS quickly figures out the true K and converges to op-
action ao and then plays it Kmax +1 times in a row. The
timal behavior in tractable number of steps. Figure 2 shows
second time when AAP E is set to false, it plays ao , Kmax
our results against these two types of opponents. The Y-
times followed by a different action. If the other agents have
axis shows the payoff of each algorithm as a fraction of the
behaved identically in both of the above situations, then
optimal payoff achievable against the respective opponent.
CMLeS knows : 1) either the rest of the agents are play-
Also plotted in the same graph, is the fraction of times CM-
ing CMLeS, or, 2) they are adaptive and plays stochasti-
LeS chooses the right memory size (denoted as convg in the
cally for a Kmax bounded memory where all agents play ao .
plot). Each plot has been averaged over 30 runs to increase
The latter observation comes in handy below. Henceforth,
robustness. Against type-1 opponents (Figure 2(i)), CMLeS
whenever AAP E is set to false, CMLeS always plays ao ,
figures out the true memory size in about 2000 steps and
Kmax +1 times in a row. Since a non-CMLeS opponent must
converges to playing optimally by 16000 episodes. Against
be stochastic (from the above observation), at some point of
type-2 opponents (Figure 2(ii)), it takes a little longer to
time, it will play a different action on the Kmax +1’th step
figure out the correct memory size (about 35000 episodes)
with a non-zero probability. CMLeS then rejects the null hy-
because in this case, the number of feasible joint histories of
pothesis that all other agents are CMLeS agents and jumps
4
to step 22. http://gamut.stanford.edu/userdoc.pdf

Page 43 of 99
1
on Artifical intelligence, pages 2–7. AAAI Press / The
MIT Press, 2004.
0.75
[2] M. Bowling and M. Veloso. Convergence of gradient

Ratio
0.5 dynamics with a variable learning rate. In Proc. 18th
0.25
International Conf. on Machine Learning, pages 27–34.
Morgan Kaufmann, San Francisco, CA, 2001.
0
[3] R. I. Brafman and M. Tennenholtz. R-max - a general
0 5000 10000 15000 20000
Episode (Against Trigger Strategy)
polynomial time algorithm for near-optimal
reinforcement learning. J. Mach. Learn. Res.,
1
3:213–231, 2003.
0.75
[4] D. Chakraborty and P. Stone. Online multiagent
learning against memory bounded adversaries. In
Ratio

0.5
ECML, pages 211–226, Antwerp,Belgium, 2008.
0.25 [5] V. Conitzer and T. Sandholm. Awesome: A general
0 multiagent learning algorithm that converges in
0 20000 40000 60000 80000 self-play and learns a best response against stationary
Episode (Against 50 % Random and 50 % Trigger Strategy) opponents. In J. Mach. Learn. Res., pages 23–43.
CMLeS convg PCM(A) AWESOME
Springer, 2006.
[6] W. Hoeffding. Probability inequalities for sums of
Figure 2: Against adaptive opponents
bounded random variables. Journal of the American
Statistical Association, pages 13–30, 1963.
size 6 are much more. Both Awesome and PCM(A) perform [7] R. Powers and Y. Shoham. Learning against opponents
much worse. PCM(A) plays a random exploration strategy with bounded memory. In IJCAI, pages 817–822, 2005.
until it has visited every possible joint history of size Kmax , [8] R. Powers, Y. Shoham, and T. Vu. A general criterion
hence it keeps getting a constant payoff during this whole and an algorithmic framework for learning in
exploration phase. multi-agent systems. Mach. Learn., 67(1-2):45–76, 2007.
When Kmax was set to 4, MLeS converged to playing the [9] R. S. Sutton and A. G. Barto. Reinforcement Learning.
maximin strategy in about 10000 episodes against both of MIT Press, 1998.
the above opponents. The convergence part of MLeS uses
the framework of Awesome and the results are exactly sim-
ilar to it.

6. CONCLUSION AND FUTURE WORK

In this paper, we introduced a novel MAL algorithm, CM-
LeS, which in an arbitrary repeated game, achieves conver-
gence, targeted-optimality against adaptive opponents, and
safety. One key contribution of CMLeS is in the manner
it handles adaptive opponents: it requires only a loose up-
per bound on the opponent’s memory size. In contrast, the
existing state of the art algorithm, PCM(A), requires a com-
plete specification of the adaptive opponents at the begin-
ning, which it calls a target set. Second, and more impor-
tantly, CMLeS improves on PCM(A), by promising targeted
optimality against adaptive opponents in time steps poly-
nomial in λ−Size(K+1) where Size(K + 1) is the number of
feasible histories of size K+1, and λ is the minimum non-
zero probability that the opponent assigns to an action, in
any history. PCM(A) guarantees the same, but in steps
polynomial in λ−Size(Kmax ) .
Right now, the guarantees of CMLeS are only in self-play
or when all other agents are adaptive. Any other distribu-
tion of agents is considered arbitrary, and MLeS converges to
playing the maximin strategy. Our ongoing research agenda
includes improving CMLeS to have better performance guar-
antees against arbitrary mixes of agents, i.e., some adaptive,
some self-play, and the rest arbitrary.

7. REFERENCES
[1] B. Banerjee and J. Peng. Performance bounded
reinforcement learning in strategic interactions. In
AAAI’04: Proceedings of the 19th national conference

Page 44 of 99
Proceedings of the AAMAS Workshop on Adaptive and Learning Agents, May 2010, Toronto, Canada

An Approach to Imitation Learning For Physically

Heterogeneous Robots

Jeff Allen John Anderson

Dept. of Computer Science, University of Dept. of Computer Science, University of
Manitoba Manitoba
Winnipeg, Manitoba, Canada R3T2N2 Winnipeg, Manitoba, Canada R3T2N2
http://aalab.cs.umanitoba.ca http://aalab.cs.umanitoba.ca
[email protected] [email protected]

ABSTRACT To make imitation learning useful, an agent must first have an

Imitation learning enables a learner to improve its abilities by ob- understanding of its own primitive motor skills, observe demon-
serving others. Most robotic imitation learning systems only learn strations and their outcomes, and ultimately interpret these within
from a single class of demonstrators, and often only a single demon- the context of its own primitives. In doing so, the agent develops
strator, because of the complexity involved in dealing with variation new motor skills by creating hierarchical combinations of prim-
between observed activities. When heterogeneous robots are intro- itives [16], providing a deeper understanding of the imitated be-
duced, learning becomes much more difficult, due to potentially haviour. In any real world setting, this will be complicated by the
major differences in physiology between learner and demonstra- fact that multiple demonstrations will likely be performed by dif-
tors (e.g. if a learner is a wheeled robot and the demonstrator is ferent agents. Arguably this should be the case, since seeing the
a humanoid). To be successful under such conditions, the imitator full range of ways in which a task could be accomplished is faster
must be able to abstract the behaviour it observes and approximate than the learner discovering these itself, and different agents will
this with possibly very different actions that it is able to perform. likely perform a task in different ways.
This paper describes an approach to imitation learning from het- When the imitator and its demonstrators have heterogeneous phys-
erogeneous demonstrators, using global vision to observe demon- iologies (distinct differences such in body type or size) imitation is
strations from an oblique angle. It supports not only working with much more difficult. Humans naturally deal with heterogeneous
physiologically different demonstrators (e.g., wheeled vs. legged), demonstrators: even a small child can imitate the actions of an an-
but integrates the examples provided by different individuals, pos- imal that is not bipedal, for example. If a child’s first exposure to
sibly of different skill levels, in such a way that different parts of the game of frisbee is through observing a dog catching a frisbee in
a task can be learned from different individuals. We assume the its mouth, when the frisbee is thrown to the child they will likely at-
imitator has no initial knowledge of the observable effects of its tempt to catch it in their hand instead. This way they are using the
own actions, and train a set of Hidden Markov Models to map ob- skills that are natural and available to them to complete the task,
servations to actions and create an understanding of the imitator’s even if the demonstration displayed a different set of skills. In a
own abilities. We then use a combination of tracking sequences robotic environment, physiological differences are generally much
of primitives and predicting future primitives from existing com- more broad than this. Robots have been developed for many pur-
binations using forward models to learn abstract behaviours from poses, and consequently differ in size, control programs, sensors
the demonstrations of others. We evaluate our work using a group and effectors. These differences result in a broader range of ways
of heterogeneous robots that we have previously used in RoboCup in which a single activity can be performed. A humanoid robot
competitions in various leagues: a wheeled robot from the Small- (or any bipedal robot), for example, can step over some obstacles
Size League, a Citizen Eco-Be micro-robot used in the Mixed Re- that a wheeled robot cannot, but might trip over very low obstacles
ality Competition, and a Humanoid robot. that a wheeled robot could simply drive over. In order to increase
the performance of a learner and allow it to learn from whatever
demonstrators happen to be available (ultimately, a mixture of hu-
1. INTRODUCTION mans and other robots), overcoming differences in physiology is
Imitation learning - the ability to observe demonstrations of be- absolutely necessary [18].
haviour and reproduce functionally equivalent behaviour with ones In this paper, we present a framework for imitation through global
own abilities - is a powerful mechanism for improving the abilities vision, which models multiple demonstrators by approximating the
of an intelligent agent. Evidence of learning from the demonstra- visual outcomes of their actions with those available to the imita-
tions of others can be seen in primates, birds, and humans [10, 16, tor, with no prior knowledge of demonstrators’ abilities or physiol-
4]. From an AI perspective, this is attractive because of its po- ogy. The results presented here focus on illustrating the ability of
tential for dealing with the general problem of knowledge acqui- this framework to adapt for differences between a broad range of
sition: instead of programming a robot for each individual task, heterogeneous robots, and we explore this using a group of robots
robots should ultimately be able to gather information from human that is broadly different in both size and physiology. Having said
demonstrations [15, 19, 5], or from one another [2, 5, 21] with the that, because demonstrators are modelled individually, the same ap-
result that the robot’s performance at that task improves over time. proach allows learning from demonstrators with a different range of
Additionally, demonstrations do not have to be active teaching ex- skills. Having a rough idea of the competence of your teachers is
ercises: the imitator can simply observe a demonstrator with no very useful, especially if you are being taught one task by many
communication necessary. That is, the demonstrator does not even teachers. Individually modelling ones teachers gives an imitation
need to be aware that it is being observed. learner the ability to compare the quality of the teachers relative to

Page 45 of 99
each other, even within parts of the same task. This enables the execution). This provides the imitator with a model that can make
robot to be more resistant to bad demonstrations as well as adapt- predictions about what behaviours a specific demonstrator might
able to heterogeneous demonstrators. use at a given time.
The experimental domain we use to ground this work is robotic Prior work in imitation learning has often used a series of demon-
soccer. In our evaluation, an imitating robot learns to shoot the soc- strations from demonstrators that are similar in skill level and phys-
cer ball into an open goal, from a range of demonstrators that differ iologies [8, 19]. The approach presented in this paper is designed
in size as well as physiology (humanoid vs. wheeled). While this from the bottom up to learn from multiple demonstrators that vary
problem may seem trivial to a human adult, it is quite challeng- physically, as well as in underlying control programs and skill lev-
ing to an individual that is learning about its own motion control. els.
Manoeuvring behind the soccer ball and lining it up for a kick is Some recent work in humanoid robots imitating humans has used
a difficult task for an autonomous agent to perform, even without many demonstrations, but not necessarily different demonstrators,
considering the ball’s destination - just as it would be for a young and very few have modelled each demonstrator separately. Those
child. It is also a task where it is easy to visualize a broad range that do employ different demonstrators, such as [8], often have
of skills (demonstrators that have good versus poor motor control), demonstrators of similar skills and physiologies (in this work all
and one where heterogeneity matters (that is, there are visual dif- humans performing simple drawing tasks) that also manipulate their
ferences in how physiologically-distinct robots move). environment using the same parts of their physiology as the imita-
Beyond simply improving learning, there are good application- tor (in this case the imitator was a humanoid robot learning how to
independent reasons for allowing a robot to learn from heteroge- draw letters, the demonstrators and imitators used the same hands
neous demonstrators. The time taken to create or adapt a control to draw). Inamura et al. [13, 12] use HMMs in their mimesis ar-
program for a particular robot physiology is often wasted when chitecture for imitation learning. They trained a humanoid robot to
robots are abandoned in favour of newer models, or different de- learn motions from human demonstrators, though they did not sep-
signs (e.g. switching from a wheeled robot to a one that has tank arately model or rank demonstrator skills relative to each other as
treads). A learning system needs to be able to learn from others we do in our work. Moreover, they also only use humanoid demon-
that are physiologically different than the imitator if the knowledge strators, significantly limiting heterogeneity.
of various demonstrators is to be passed on. Learning should be Nicolescu and Matarić [19] motivate the desire to have robots
robust enough to allow any type of demonstrator to work. Learning with the ability to generalize over multiple teaching experiences.
should also benefit correspondingly from a heterogeneous breadth They explain that the quality of a teacher’s demonstration and par-
of demonstrators. It may be possible to discover and adapt ele- ticularities of the environment can prevent the imitator from learn-
ments of a performance by a physically distinct demonstrator that ing from a single trial. They also note that multiple trials help to
have not yet been exploited by demonstrators of the same physi- identify important parts of a task, but point out that repeated obser-
ology, for example. Further, imitating robots that can learn from vations of irrelevant steps can cause the imitator to learn undesir-
any type of demonstrator can also learn from robots that developed able behaviours. They do not implement any method of modelling
their control programs through imitation. Imitation can therefore individual demonstrators, or try to evaluate demonstrator skill lev-
provide a mechanism for passing down knowledge between gener- els as our work does. By ranking demonstrators relative to each
ations of robots. other and mixing the best elements from among all demonstrators,
we believe that our system can minimize the behaviours it learns
2. RELATED WORK that contain irrelevant steps.
A number of imitation learning approaches have influenced this
work. Demiris and Hayes [10] developed a computational model 3. METHODOLOGY
based on the phenomenon of body babbling, where babies prac- The robots used in this work can be seen in Figure 1. The robot
tice movement through self-generated activity [17]. Demiris and imitator, a two-wheeled robot built from a Lego Mindstorms kit, is
Hayes [10] devised their system using forward models to predict on the far left. One of the three robot types used for demonstrators
the outcomes of the imitator’s behaviours, in order to find the best is physically identical (i.e. homogeneous) to the imitator, in order
match to an observed demonstrator’s behaviour. A forward model to provide a baseline to compare how well the imitator learns from
takes as input the state of the environment and a control command heterogeneous demonstrators. Two demonstrators that are hetero-
that is to be applied. Using this information, the forward model geneous along different dimensions are also employed. The first is
predicts the next state and outputs it. In their implementation, the a humanoid robot based on a Bioloid kit, using a cellphone for vi-
effects of all the behaviours are predicted and compared to the ac- sion and processing. The choice of a humanoid was made because
tual demonstrator’s state at the next time step. Each behaviour has it provides an extremely different physiology from the imitator in
an error signal that is then used to update its confidence that it can terms of how motions made by the robot appear visually. Both the
match that particular demonstrator behaviour. Our work differs in differences in outcomes of individual actions, as well as the visual
that we use forward models to model entire behaviour repertoires appearance produced by the additional motions necessary for hu-
of demonstrators, not individual behaviours. manoid balancing should be a significant challenge to a framework
Demiris and Hayes [10] use one forward model for each be- for imitation learning in terms of adapting to heterogeneity. The
haviour, which is then refined based on how accurately the forward third demonstrator type is a two-wheeled Citizen Eco-Be (version
model predicts the behaviour’s outcome. By using many of these I) robot which is about 1/10 the size of the imitator. This was
forward models, Demiris and Hayes construct a repertoire of be- chosen because while its physiology is similar, the large size dif-
haviours with predictive capabilities. In contrast, the forward mod- ference (accompanied by significant variation in how long it takes
els in our framework model the repertoire of individual demonstra- the robot to move the same distance) makes for a different extreme
tors (instead of having an individual forward model for each be- of heterogeneity than the challenge presented by a humanoid robot.
haviour), and contain individual behaviours learned from specific The imitation learning robot observes one demonstrator at a time,
demonstrators within them (the behaviours can still predict their with the demonstrated task being that of shooting a ball into an
effects on the environment, but these effects are not refined during empty goal, similar to a penalty kick in soccer. This task should

Page 46 of 99
Figure 1: Two views of the heterogeneous robots used in this
work (a standard ballpoint pen is used to give a rough illus-
tration of scale). The right side of the image shows the robots
with visual markers in place to allow motion to be tracked by a
global vision system.
Figure 2: Imitation Learning Architecture

allow for enough variation between approaches for both different

skill levels and different physiologies to have an impact. All knowl- trained to recognize it, which can then be used to recognize primi-
edge of the task to be learned is gained by observing the demonstra- tives from visual data obtained from demonstrations.
tors: no communication between the imitator and its demonstrators Once a demonstration has been converted into a sequence of
is allowed (or necessary). primitives, the primitive sequence is used to construct a more mean-
The problem of an imitator physically relating to its demonstra- ingful abstraction of the demonstration using behaviours. Behaviours
tors (human or robotic) is referred to as the correspondence prob- provide mechanisms to integrate the important actions of the demon-
lem [6]. In this work, this is partly handled through the use of a strations, overcome differences in physiology, and deal with differ-
global vision system (Ergo [3]). The use of the oblique view pro- ing demonstrator skill levels.
vided by our global vision system provides a common frame of ref- Behaviours are learned by combining primitives to produce more
erence for the physical locations of the demonstrator and all objects complex actions based on observations [7, 4, 19]. In our implemen-
in the environment, similar to that provided by GPS. This supports tation, forward models are used to manage and create behaviours
the ability to map demonstrator positional movements (coordinate from the imitator’s primitives and existing behaviours, as seen in
movements) onto the imitator’s own possible motions. The Ergo Figure 2. In our implementation a new behaviour is created from
system provides information about the movements of marked ob- a combination of two primitives or existing behaviours when the
jects in three-dimensional space, such as orientation, location, and frequency of the two occurring in sequence surpasses a thresh-
velocity. Our approach uses the x and y coordinates of the demon- old. For example, suppose the primitive forward is recognized in
strators, imitator, and ball, as well as the orientations of the imitator demonstrations, followed by the primitive left often enough that the
and demonstrators. This data is sufficient for the imitator to learn frequency of their sequential occurrence surpasses the threshold.
the chosen task. A forward-left behaviour is created, made from the primitive se-
Whether a robot is learning from imitation or not, it must begin quence forward followed by left. Similarly, a behaviour that causes
with a set of motion primitives that it can use to accomplish ac- the robot to drive in a square formation might be achieved by a be-
tions. In our implementation we have defined these as the atomic haviour that is made from four forward-left behaviours in sequence.
motor commands available to the wheeled imitator as (forward, To keep the number of behaviours learned reasonable, behaviours
backward, left, right and stop). To properly imitate others, espe- will slowly decay over time, to the point where they are deleted.
cially those of differing physiologies, an imitation learner must be If they are predicted frequently enough, their decay will slow and
able to understand what its own actions accomplish. In our work, they will become permanent.
we begin by allowing the imitator to develop such an understand- The behaviours are built and stored using a type of forward model
ing based on its own motions. The imitator starts out by collecting which essentially represent frequencies of primitives and behaviours
visual data of the outcomes of its own primitive actions using the occurring in sequence. This idea is based on work by [9] who im-
Ergo vision system, by executing primitives on the field and cre- plement forward models that make predictions of the effects of the
ating a mapping between these and the visual changes that result. imitator’s actions on its environment. In our approach, a unique
The raw vision frames obtained as the robot moves on the field forward model is created for each demonstrator that the imitator
are converted into vectors that represent the change in position of is exposed to. This serves to both abstract how that demonstrator
the imitator between the first frame of the primitive action and the performs parts of the task at hand over multiple demonstrations,
current frame. Each vector contains data relating to the x and y and recommend elements of this activity that might be useful for
coordinates, as well as the orientation of the imitator. These frame the imitator. There is also an additional forward model for the im-
change vectors are then clustered using the k-means algorithm [11]. itator itself, which is used to model how the demonstrator should
This clustering generalizes the visual changes between frames dur- perform the given task once imitation learning is complete. The
ing a portion of the execution of a primitive, such as removing the forward models representing demonstrators begin with only the
specific changes in the x and y coordinates involved. To recognize imitator’s primitives. Once behaviours for a particular demonstra-
a complete primitive from a series of these visual changes, the im- tor have been learned, the corresponding forward model acts as a
itator must associate the various legitimate strings of symbols that predictive model for that specific demonstrator. That is, given the
could make up a primitive. To do this, we employ Hidden Markov observed behaviour thus far, the model can be used to predict fu-
Models (HMMs) [20], a modelling mechanism often used to rec- ture behaviour of that demonstrator. Throughout the training of
ognize time-sensitive events. Each primitive has a unique HMM the demonstrator forward models, frequently occurring candidate

Page 47 of 99
behaviours are added to the forward model representing the imita-
tor, as seen in Figures 3. Each model representing a demonstrator
is then used to process each demonstration from all demonstrators
one last time before the forward model does its processing. Essen-
tially this is the stage where the imitator is using the forward models
representing each of its demonstrators to predict what each individ-
ual demonstrator would do in the current situation. By the time all
the forward models representing all the demonstrators are trained,
the model representing the imitator has a number of additional be-
haviours in its repertoire as a result of this process, and serves as
a generalized predictive model of all useful activity obtained from
all demonstrators. Finally, the imitator does the processing of each
demonstration using the candidate behaviours added by the forward
models for the demonstrators as shown in Figure 4. The same pro-
cess of behaviour proposal and decay described earlier allows the
imitator to keep some demonstrator behaviours, and discard oth-
ers, while also learning new behaviours of its own as a result of the
common behaviours extrapolated from multiple demonstrators.
To model the relative skill levels of the demonstrators in our sys-
tem, each of the demonstrator forward models maintain a demon-
strator specific learning rate, the learning preference (LP). The learn-
ing preference is analogous to how people favour certain teachers,
and tend to learn more from these preferred teachers. A higher LP
indicates that a demonstrator is more skilled than its peers, so be-
haviours should be learned from it at a faster rate than a demonstra-
tor with a lower LP. The LP is used as a weight when updating the
frequency of two behaviours or primitives occurring in sequence.
The LP of a demonstrator begins at the half way point between
the minimum (0) and maximum (1) values. When updating the fre-
quencies of sequentially occurring behaviours, a minimum increase
in frequency (referred to as minFreq in equation 1) is preserved (a
value of 0.05, obtained during experimentation), to ensure that a
forward model for a demonstrator that has an LP of 0 does not
stagnate. The forward model for a given demonstrator would still
update frequencies, albeit more slowly than if its LP was above 0.
Equation 2 shows the decay step, where the decay rate is equal to
1 − LP and the decayStep is a constant (0.007 was used in our
experiments).

Figure 3: Demonstrations from each individual demonstra-

tor are first used to train a forward model representing that f req = f req + minF req + minF req × LP (1)
demonstrator. Frequently occurring behaviours in each session
are are moved to the forward model representing the imitator
P erm = P erm − decayRate × decayStep (2)
as potential behaviours to use in its own activities.

LP = LP ± lpShapeAmount (3)
The LP of a demonstrator is increased if one of its behaviours
results in the demonstrator (ordered from highest LP increase to
lowest): scoring a goal, moving the ball closer to the goal, or mov-
ing closer to the ball. The LP of a demonstrator is decreased if
the opposite of these criteria results from one of the demonstra-
tor’s behaviours. Equation 3 shows the update step, where lpSha-
peAmount is either a constant (0.001) if the LP is adjusted by the
non-criteria factors, or plus or minus 0.01 for a behaviour that re-
sults in scoring a correct/incorrect goal, 0.005 for moving the ball
closer to the goal, or 0.002 for moving the robot closer to the ball.
These criteria are obviously domain-specific, and are used to shape
Figure 4: In the final phase of training, all demonstrations are the learning (a technique that has been shown to be effective in
first passed to the demonstrator models to elicit any candidate other domains [14]) in my system to speed up the imitator’s learn-
behaviour nominations before the forward model for the imita- ing. Though this may seem like pure reinforcement learning, these
tor processes the demonstration. criteria do not directly influence which behaviours are saved, and
which behaviours are deleted. The criteria merely influence the
LP of a demonstrator, affecting how much the imitator will learn

Page 48 of 99
Demonstrator Goals Scored Wrong Goals Scored
RC2004 27 4
Citizen 15 3
Bioloid 12 1

Table 1: The number of goals and wrong goals scored for each
Figure 5: Field configurations. The demonstrator is repre- demonstrator.
sented by a square with a line that indicates the robot’s ori-
entation. The target goal is indicated by a black rectangle, the random order. All of the forward models for each demonstrator
demonstrator’s own goal is white. predict and update their models at this time, one step ahead of the
forward model for the imitator. This is done to allow each forward
model a chance to nominate additional candidate behaviours rele-
from that particular demonstrator. Dependence on these criteria vant to the current demonstration instance, to the forward model
was minimized so that future work (such as learning the criteria for the imitator.
from demonstrators) can remove them entirely. The total number of goals each demonstrator scored during all
When the learning process is complete, the imitator is left with 50 of their individual demonstrations is given in Table 1.
a final forward model that it can use as a basis for performing the To determine if the order in which an imitator using our approach
tasks it has learned from the demonstrators. is exposed to the various demonstrators - specifically, the degree of
heterogeneity - had any effect on its learning, we chose to order
the demonstrators in two ways. The first is in order of similarity
4. EXPERIMENTAL RESULTS to the imitator. In this ordering, the MindStorms robot demonstra-
To evaluate this approach in a heterogeneous setting, we em- tor (labelled RC2004 here because its expert-level control code was
ployed the robots previously shown in Figure 1 to gather demon- from our small-sized team at RoboCup-2004) is first, then the Citi-
strations. Each of the robots used in these experiments was con- zen demonstrator (which is much smaller than the imitator, but still
trolled using its own behaviour-based control system - since the a two-wheeled robot), and finally the Bioloid demonstrator. The
work presented here focusses on overcoming differences in phys- shorthand we have adopted for this ordering is RCB. The second
iology, all of the robots used code that was developed for robotic ordering is the reverse of the first, that is, in order of most physical
soccer competitions, and all would be considered expert demon- differences from the imitator. The second ordering is thus Bioloid,
strations. The Bioloid and Lego Mindstorms robots were demon- Citizen, RC2004, or BCR for short. The orderings determine when
strated on a 1020 x 810 cm field, while the Citizen was demon- each set of training data is used to train the demonstrator forward
strated on a 56 x 34.5 cm field (this was because the size difference models (starting with the first demonstrator’s set in the order). The
of the robot made for significant battery power issues given the dis- imitator’s final model follows the same ordering when passing each
tances covered on the large size field). The ball used by the Bioloid set of demonstrations to the demonstrator forward models during
and Lego Mindstorms robots was 10 centimeters in diameter, while the final training phase described in Section 3.
the ball used by the Citizen robot was 2.5 centimeters in diameter. For each of the two orderings, we ran 100 trials. The results
We limited the positions to the two field configurations shown of the forward model training processes using the RCB and BCR
in Figure 5. In the configuration on the left, the demonstrator is demonstrator orderings are presented here. All the following data
positioned for a direct approach to the ball. As a more challeng- has been averaged over 100 trials. Though the resulting LPs of the
ing scenario, we also used a more degenerate configuration (on the demonstrators are not given in this paper, all three demonstrators in
left), where the demonstrator is positioned for a direct approach to both orderings ended up with LPs (with a range of 0 to 1) close to
the ball, but the ball is lined up to its own goal – risking putting the the maximum (over 0.95 on average). In our experiments on differ-
ball in one’s own net while manoeuvering, and requiring a greater ing skill levels, poor demonstrators had LPs of approximately 0.25,
field distance to traverse with the ball. while average demonstrators had LPs of approximately 0.5. These
The individual demonstrators were recorded by the Ergo global results indicate that the LP is accurately judging all demonstrators
vision system [3] while they performed 25 goal kicks for each of to be skilled, even those that have different physiologies from the
the two field configurations. The global vision system continually imitator.
captures the x and y motion and orientation of the demonstrating Figures 6 and 7 show results for the number of behaviours cre-
robot and the ball. The demonstrations were filtered manually for ated and deleted for each of the forward models representing the
simple vision problems such as when the vision server was unable given demonstrators, with the two orderings for comparison pur-
to track the robot, or when the robot broke down (falls/loses power). poses and standard deviations given above each bar. It can be seen
The individual demonstrations were considered complete when the that the RCB and BCR demonstration orderings do not affect the
ball or robot left the field. number of behaviours created or deleted from any of the forward
One learning trial consists of each forward model representing a models. The forward models representing the Bioloid demonstra-
given demonstrator training on the full set of kick demonstrations tor can be seen to create many more behaviours than the other for-
for that particular demonstrator, presented in random order. Once ward models (and have a higher standard deviation), but they also
the forward models representing each demonstrator are trained, the end up deleting many more than the others. The vast difference in
forward model representing the imitator begins training. At this physiology from the other two-wheeled robots cause the forward
point all the forward models for the demonstrators have been trained models representing the humanoid to build many behaviours in an
for their own data, and have provided the forward model represent- attempt to match the visual outcome of the Bioloid’s demonstra-
ing the imitator with candidate behaviours. The forward model for tions. When trying to use those behaviours to predict the outcome
the imitator then processes all the demonstrations for each of the of the other two-wheeled robot demonstrators, they do not match
two field configurations (a total of 150 attempted goal kicks) in frequently enough (i.e. they are not a useful basis for imitation),

Page 49 of 99
Figure 6: The number of behaviours created, comparing RCB
and BCR demonstrator orderings. Corresponding standard
deviations are given at the top of each bar. Figure 8: The number of permanent behaviours in each for-
ward model, comparing RCB and BCR demonstrator order-
ings. Corresponding standard deviations are given at the top of
each bar.

Figure 7: The number of behaviours deleted, comparing RCB

and BCR demonstrator orderings. Corresponding standard
deviations are given at the top of each bar.

Figure 9: The number of candidate behaviours moved to the

and are eventually deleted as a result. forward model representing the imitator, comparing RCB and
In Figure 8, the number of permanent behaviours for each of the BCR demonstrator orderings. Corresponding standard devia-
forward models are shown along with standard deviations above tions are given at the top of each bar.
each bar, grouped by RCB and BCR to see any effect on demon-
strator orderings. It can be seen that the orderings do not affect the
number of behaviours made permanent to any of the forward mod-
els, indicating that ordering does not affect the number of useful be- Figure 9 shows the number of candidate behaviours nominated
haviours acquired by the forward models representing the demon- by each forward model representing the demonstrators, as well as
strators, or the imitator itself. Even though the Bioloid has a very their standard deviations above each bar. It can be seen that the
different physiology, the forward models representing its actions order in which the forward models are trained slightly affects the
still learn a relatively similar number of behaviours as the other two number of candidate behaviours moved to the forward model rep-
forward models for the other demonstrators. The forward models resenting the imitator. The forward models for the Citizen demon-
representing the imitator have fewer permanent behaviours, partly strator are mostly unaffected by ordering, which makes sense as
because the forward model for an imitator filters the candidate be- they are in the middle for both orderings. The forward models
haviours given to it by the forward models representing the demon- representing the RC2004 demonstrator have slightly more candi-
strators, but it could also be due to the fact that the imitator is only date behaviours nominated when they are first (the RCB ordering)
exposed to each set of demonstrations once, while the other for- than when they are last (the BCR ordering). The forward mod-
ward models see all demonstrations once, but the demonstrations els representing the Bioloid have similar results (though with the
for their particular demonstrator twice. opposite orderings), with more candidate behaviours in the BCR

Page 50 of 99
Figure 10: The number of candidate behaviours not moved to
the forward model representing the imitator, because they were
already there, comparing RCB and BCR demonstrator order- Figure 11: The number of candidate behaviours that earned
ings. Corresponding standard deviations are given at the top of permanency after being moved to the forward model repre-
each bar. senting the imitator, comparing RCB and BCR demonstrator
orderings. Corresponding standard deviations are given at the
top of each bar.
Demonstrator Ordering Goals Scored Wrong Goals Scored
RCB 11 9
BCR 7 13
uated in this section at random (one from the RCB training order,
and one from the BCR order). We used the forward models to con-
Table 2: The number of goals and wrong goals scored for two trol (as described in Section 3) the Lego Mindstorms robots and
imitators trained with the different demonstrator orderings. recorded them in exactly the same way that we recorded the demon-
strators, for 25 shots on goal in each of the two field configurations
(Figure 5) for a total of 50 trials. Table 2 shows the results of these
ordering than the RCB ordering. This is likely due to the number penalty kick attempts by the two imitators trained using our frame-
of candidate behaviours that get rejected because they already ex- work. The orderings do not show a significant difference. The
ist in the forward model representing the imitator, but the standard main reason the final trained imitator did not perform objectively
deviation could also explain this. The results for duplicate candi- better is that the current behaviour being executed was not stopped
date behaviours can be seen in Figure 10. The forward models for if another behaviour became more applicable during its execution.
the RC2004 demonstrator have fewer duplicates rejected when they This caused the imitator, when demonstrating its skills, to stick to
are first (RCB), as is true for the forward models for the Bioloid a chosen behaviour, even if using that behaviour resulted in poor
when the Bioloid is first (BCR). In both cases, however, this is not results. That is, it is a flaw in the learner demonstrating its skills,
much of a difference given the standard deviations involved. The not in its learning. The only time the execution of a behaviour was
forward models for both demonstrators appear to have more candi- stopped was when the primitives it was about to execute predicted
date behaviours rejected when they are last in the ordering, but this that it might move the imitator off the field, since the imitator was
could also be explained simply by the standard deviations involved. not tasked with learning behaviours that kept it on the field. In the
Again, the forward models representing the Citizen demonstrators end, the results from heterogeneous demonstrators were compa-
are less affected by ordering, as they appear in the middle both rable to those using only homogeneous demonstrators of the same
times. skill level [1], which is itself very positive because of the additional
Similar results are found when looking at the number of can- difficulties involved with heterogeneity.
didate behaviours that achieve permanency to the forward model
representing the imitator. In Figure 11, the forward models repre-
senting the RC2004 and Bioloid demonstrators can be seen to have 5. CONCLUSION
more behaviours made permanent to the forward model for the imi- We have presented the results and analysis of the experiments
tator when their demonstrations appear first in the ordering, but this used to evaluate our approach to developing an imitation learning
is explainable by considering the standard deviation. The forward architecture that can learn from multiple demonstrators of varying
models representing the two-wheeled demonstrators seem to have physiologies and skill levels. The results in Section 4 show that
an advantage in the number of their candidate behaviours becoming our approach can be used to learn from demonstrators that have
permanent to the forward model for the imitator over the forward heterogeneous physiologies. The humanoid demonstrator was not
models representing the Bioloid demonstrator. This is likely due learned from as much as the two-wheeled robots that had similar
to the same reasons of physiology discussed when looking at the physiologies to the imitator, but the imitator still learned approxi-
number of behaviours created and deleted by each of the forward mately 12% of its permanent behaviours from the Bioloid, as seen
models. in Figures 11 and 8. The Citizen robot was nearly as effective a
To evaluate the performance of the imitators trained using this demonstrator as the RC2004 robot. This is somewhat surprising,
approach, we selected two imitators from the learning trials eval- as the RC2004 robot has an identical physiology to the imitator,

Page 51 of 99
while the Citizen robot is approximately 1/10 the imitator’s size. [8] C ALINON , S., AND B ILLARD , A. Learning of Gestures by
This could be partially due to the fact that the Citizen robot has the Imitation in a Humanoid Robot. In Imitation and Social
same limited command set as the imitator, compared to the vastly Learning in Robots, Humans and Animals: Behavioural,
expanded set of primitive commands available to the RC2004. The Social and Communicative Dimensions, K. Dautenhahn and
Citizen moves much slower due to its size, so the demonstration C. N. (Eds), Eds. Cambridge University Press, 2007,
conversion process must have compensated substantially to give pp. 153–177.
the Citizen demonstrator results so close to the RC2004 demonstra- [9] D EARDEN , A., AND D EMIRIS , Y. Learning forward models
tor. Size differences, apparently, are easier to compensate for than for robots. In Proceedings of IJCAI-05 (Edinburgh Scotland,
differences in physiology, at least to the degree of the differences August 2005), pp. 1440–1445.
between wheeled and humanoid robots. [10] D EMIRIS , J., AND H AYES , G. Imitation as a dual-route
The results presented in Section 4 also show that this framework process featuring predictive and learning components: A
is not affected drastically by the order that demonstrators are pre- biologically plausible computational model. In Imitation in
sented to the forward models. There is some effects from candidate Animals and Artifacts, K. Dautenhahn and C. Nehaniv, Eds.
behaviours being rejected if a forward model for a given demon- MIT Press, 2002, pp. 327–361.
strator is the last to be trained, since the other forward models have [11] H ARTIGAN , J. A., AND W ONG , M. A. Algorithm AS 136:
already had a chance to get their candidate behaviours added, in- A k-means clustering algorithm. Applied Statistics 28, 1
creasing chances of duplicates. In practice this does not seem to (1979), 100–108.
adversely affect the LP of any of the forward models, and so the [12] I NAMURA , T., NAKAMURA , Y., T OSHIMA , I., AND TANIE ,
order of demonstrations is mostly negligible. H. Embodied symbol emergence based on mimesis theory.
The results for the performance of our forward models when International Journal of Robotics Research 23, 4 (2004),
used as control systems did not perform as well as the expert demon- 363–377.
strators, but they still were able to control the imitator adequately.
[13] I NAMURA , T., T OSHIMA , I., AND NAKAMURA , Y.
The main focus on our research was in developing an imitation
Acquiring motion elements for bidirectional computation of
learning architecture that could learn from multiple demonstrators
motion recognition and generation. In Proceedings of the
of varying physiologies and skill levels. The results of the con-
International Symposium On Experimental Robotics (ISER)
version processes, predictions, and the influence that the LP of a
(Sant’Angelo d’Ischia, Italy, July 2002), pp. 357–366.
forward model for a given demonstrator has on what an imitator
learns from that particular demonstrator all indicate that the learn- [14] M ATARI Ć , M. J. Reinforcement learning in the multi-robot
ing architecture we have devised is capable of properly modelling domain. Autonomous Robots 4, 1 (1997), 73–83.
relative demonstrator skill levels as well as learn from physiologi- [15] M ATARI Ć , M. J. Getting humanoids to move and imitate.
cally distinct demonstrators. A stronger focus on the refinement of IEEE Intelligent Systems (July 2000), 18–24.
behaviour preconditions and control similar to the work of Demiris [16] M ATARI Ć , M. J. Sensory-motor primitives as a basis for
and Hayes [10] could make our entire system more robust. imitation: linking perception to action and biology to
robotics. In Imitation in Animals and Artifacts,
K. Dautenhahn and C. Nehaniv, Eds. MIT Press, 2002,
6. REFERENCES pp. 391–422.
[1] A LLEN , J. Imitation learning from multiple demonstrators [17] M ELTZOFF , A. N., AND M OORE , M. K. Explaining facial
using global vision. Master’s thesis, Department of imitation: A theoretical model. In Early Development and
Computer Science, University of Manitoba, Winnipeg, Parenting, vol. 6. John Wiley and Sons, Ltd., 1997,
Canada, August 2009. pp. 179–192.
[2] A NDERSON , J., TANNER , B., AND BALTES , J. [18] N EHANIV, C. L., AND DAUTENHAHN , K. Of
Reinforcement learning from teammates of varying skill in hummingbirds and helicopters: An algebraic framework for
robotic soccer. In Proceedings of the 2004 FIRA Robot World interdisciplinary studies of imitation and its applications. In
Congress (Busan, Korea, October 2004), FIRA. Interdisciplinary Approaches to Robot Learning, J. Demiris
[3] BALTES , J., AND A NDERSON , J. Intelligent global vision and A. Birk, Eds., vol. 24. World Scientific Press, 2000,
for teams of mobile robots. In Mobile Robots: Perception & pp. 136–161.
Navigation, S. Kolski, Ed. Advanced Robotic Systems [19] N ICOLESCU , M., AND M ATARI Ć , M. J. Natural methods
International/pro literatur Verlag, Vienna, Austria, 2007, for robot task learning: Instructive demonstration,
ch. 9, pp. 165–186. generalization and practice. In Proceedings of AAMAS-2003
[4] B ILLARD , A., AND M ATARI Ć , M. J. A biologically inspired (Melbourne, Australia, July 2003), pp. 241–248.
robotic model for learning by imitation. In Proceedings of [20] R ABINER , L., AND J UANG , B. An introduction to Hidden
Autonomous Agents 2000 (Barcelona, Spain, June 2000), Markov Models. IEEE ASSP Magazine (1986), 4–16.
pp. 373–380. [21] R ILEY, P., AND V ELOSO , M. Coaching a simulated soccer
[5] B REAZEAL , C., AND S CASSELLATI , B. Challenges in team by opponent model recognition. In Proceedings of the
building robots that imitate people. In Imitation in Animals Fifth International Conference on Autonomous Agents (May
and Artifacts, K. Dautenhahn and C. Nehaniv, Eds. MIT 2001), pp. 155–156.
Press, 2002, pp. 363–390.
[6] B REAZEAL , C., AND S CASSELLATI , B. Robots that imitate
humans. Trends in Cognitive Sciences 6, 11 (2002), 481–487.
[7] C ALDERON , C. A., AND H U , H. Goal and actions:
Learning by imitation. In Proceedings of the AISB Š03
Second International Symposium on Imitation in Animals
and Artifacts (Aberystwyth, Wales, 2003), pp. 179–182.

Page 52 of 99
Proceedings of the AAMAS Workshop on Adaptive and Learning Agents, May 2010, Toronto, Canada

Multi-agent Reinforcement Learning with Reward Shaping

for KeepAway Takers

Sam Devlin, Marek Grześ and Daniel Kudenko

Department of Computer Science, University of York
Heslington, YO10 5DD York, UK
{devlin, grzes, kudenko}@cs.york.ac.uk

ABSTRACT multi-agent system grows exponentially with the number of agents,

This paper investigates the impact of reward shaping in multi-agent which may considerably slow down convergence.
reinforcement learning (MARL) as a way to incorporate domain Most existing RL algorithms were proposed under the assump-
knowledge about good strategies. We demonstrate the performance tion that there is no knowledge available about the problem and
of reward shaping in the domain of RoboCup KeepAway by de- about the MDP model in particular. This is however often not the
signing three reward shaping schemes, encouraging specific be- case in many practical applications. In many domains, heuristic
haviour such as keeping a minimum distance from other players knowledge can be easily identified by the designer of the system
on the same team, and taking on specific roles, e.g. tackling the [14] or inquired using reasoning or learning [6]. In the area of sin-
ball-controlling opponent or marking others. Results show that re- gle agent RL, reward shaping has been proven to be a principled
ward shaping does speed up learning, while having a comparable and theoretically correct method of incorporating heuristic knowl-
asymptotic performance to RL without reward shaping. The exper- edge into RL agents [11]. To date, multi-agent scenarios have not
iments demonstrate that reward shaping can be successfully used been studied with regard to reward shaping, and our paper takes
in MARL to incorporate domain knowledge and to improve perfor- first steps in this direction. In this work we focus on the RoboCup
mance by encouraging heterogeneous role behaviour. domain and look for good heuristics and their evaluation with the
long term goal of drawing general conclusions about reward shap-
ing in multi-agent RL.
Categories and Subject Descriptors Our empirical evaluation is based on the RoboCup KeepAway
I.2.6 [Artificial Intelligence]: Learning; I.2.8 [Artificial Intellig- task for two reasons. Firstly, RoboCup is an international project
ence]: Problem Solving, Control Methods, and Search; I.2.11 [Dis- (see Section 4 for details) which has been proven to provide an ex-
tributed Artificial Intelligence]: Multiagent systems perimental framework in which various technologies can be inte-
grated and evaluated. Since, the full game of soccer is complex, re-
General Terms searchers developed several simulated environments which can be
used to evaluate techniques for specific sub-problems. One of these
Algorithms, Experimentation, Theory sub-problems is the KeepAway task [15, 16]. There are two types
of opponents in this domain: a team of keepers which learn how
Keywords to maintain possession of the ball and a team of takers which learn
Domain knowledge, Heuristics, Multi-Agent Reinforcement learn- how to get the ball. In this paper, experiments on RoboCup takers
ing, Reward shaping are presented because multi-agent learning could be implemented
with the action space provided by the KeepAway framework. For
keepers, new actions would have to be introduced making compar-
1. INTRODUCTION isons to existing work more difficult. The second reason why we
Multi-agent systems are becoming increasingly popular because focus on the KeepAway domain is that our aim in this project is
in many practical applications they naturally model the environ- to investigate knowledge-based multi-agent RL approaches. This
ment or the problem decomposition may allow for more efficient requires a well defined and challenging domain where domain spe-
solutions in domains which are inherently single-agent [21]. One cific knowledge can be identified. RoboCup is suitable for this un-
of the methods of designing intelligent agents is the use of machine dertaking.
learning to implement adaptive, autonomous, and self-improving In our experiments we investigate three types of multi-agent kno-
behaviour. Reinforcement learning (RL) in particular represents a wledge: (1) how agents should maintain states relative to each other
natural fit to learn adaptive behaviour in a multi-agent scenario. (e.g. keep a minimum distance); (2) how role specialization can
Whilst reinforcement learning can deal with problems with com- improve overall performance by explicitly encouraging heteroge-
binatorially huge state spaces in a fully observable setting [12, 17], neous behaviours in a multi-agent team (e.g. specialising in tack-
the multi-agent scenario is a bigger challenge [3]. The most signif- ling the ball-controlling opponent); and (3) the combination of (1)
icant problem is that the existence of other agents, which execute and (2). Our empirical results show that reward shaping does speed
their own actions and those actions influence the state of the world, up learning, while having a comparable asymptotic performance
has to be dealt with. This makes the problem partially observable to RL without reward shaping. The experiments demonstrate that
because of the uncertainty in the behaviour of other agents, and reward shaping can be successfully used in MARL to incorporate
also non-stationary because other agents may concurrently learn domain knowledge and to improve performance by encouraging
and improve their behaviour. Additionally, a rather elementary but heterogeneous role behaviour.
serious problem comes from the fact that the state-action space of a

Page 53 of 99
Our research presented here is the first step of a bigger ongoing also very misleading [14]. To deal with such problems potential-
project on the use of domain knowledge and analysis of the suit- based reward shaping was proposed [11] as the difference of some
ability of reward shaping in KeepAway and in multi-agent RL in potential function Φ defined over a source s and a destination state
general. s0 :
The paper is organised as follows. Section 2 presents a more
detailed introduction to reinforcement learning and Section 3 intro- F (s, s0 ) = γΦ(s0 ) − Φ(s), (3)
duces reward shaping. The subsequent section introduces RoboCup where γ is the discount factor. When the potential function Φ(s)
Soccer and the problem of learning takers in that domain. Next, is a function of states only, actions can be omitted in F yielding
Section 5 discusses our approach to learning takers with reward F : S × S → R and F (s, s0 ). Ng et al. [11] proved that re-
shaping. Details of experimental evaluation are in Section 6 and ward shaping defined in this way, that is, according to Equation 3,
obtained results are collected and discussed in Section 7. The final guarantees learning a policy which is equivalent to the one learned
section concludes the paper. without reward shaping when the same heuristic knowledge repre-
sented by Φ(s) would be used directly to initialise the value func-
2. REINFORCEMENT LEARNING AND tion. This is an important fact, because when function approxima-
tion is used in big environments, where the structural properties of
MARKOV DECISION PROCESSES the state space are not clear, it is not easy to initialise the value func-
Reinforcement learning is a paradigm which allows agents to tion. Reward shaping represents a flexible and theoretically correct
learn by reward and punishment from interactions with the environ- method to incorporate background knowledge into RL algorithms.
ment [19]. The numeric feedback received from the environment is Its properties have been proven for RL in both infinite and finite
used to improve agent’s actions. The majority of work in the area horizon MDPs [11]. It was however indicated in [5] that the stan-
of reinforcement learning (RL) applies a Markov Decision Process dard formulation of potential-based reward shaping according to
as a mathematical model [13]. [11] can fail in domains with multiple goals. One of the solutions
A Markov Decision Process (MDP) is a tuple hS, A, T, Ri, where suggested in [5] to overcome this problem is to use F (·, ·, g) = 0
S is the state space, A is the action space, T (s, a, s0 ) = P r(s0 |s, a) for each goal state g ∈ G.
is the probability that action a in state s will lead to state s0 , and When the shaping reward is computed according to Equation 3,
R(s, a, s0 ) is the immediate reward received when action a taken the application of reward shaping reduces to the problem of how to
in state s results in a transition to state s0 . The problem of solving define the potential function, Φ(s). In this paper, we address this is-
an MDP is to find a policy (i.e., mapping from states to actions) sue is a novel context of multi-agent learning (details in Section 5)
which maximises the accumulated reward. When the environment and evaluate it in the RoboCup KeepAway domain [15, 16] which
dynamics (transition probabilities and a reward function) are avail- is introduced in the next section. Our experiments apply function
able, this task can be solved using iterative approaches like policy approximation to represent the value function in this task. Even
and value iteration [2]. though with function approximation the optimal policy might not
MDPs constitute a modeling framework for RL agents whose be representable, our application of potential-based reward shaping
goal is to learn an optimal policy when the environment dynamics is still valid and justified. Potential based reward shaping guaran-
are not available and, thus, value iteration cannot be used. However tees that the new MDP with a modified reward function has the
the concept of an iterative approach in itself is the backbone of same solution as the original MDP solved by reinforcement learn-
the majority of RL algorithms. These algorithms apply so called ing without reward shaping. Therefore the learning problem re-
temporal-difference updates to propagate information about values mains the same, allowing the methods presented here to be applied
of states, V (s), or state-action, Q(s, a), pairs [18]. These updates with or without an approximate function representation.
are based on the difference of the two temporally different estimates The work of Ng. et al. [11] formally specified requirements on
of a particular state or state-action value. The SARSA algorithm is reward shaping. The idea of giving an additional external reward
such a method [19]. After each real transition, (s, a) → (s0 , r), in was investigated by numerous researchers before that. For exam-
the environment, it updates state-action values by the formula: ple, interesting observations on the behaviour and problems of re-
ward shaping were reported in [14] and were then influential in the
Q(s, a) ← Q(s, a) + α[r + γQ(s0 , a0 ) − Q(s, a)]. (1)
formalisation of the potential-based reward function. Another early
It modifies the value of taking action a in state s, when after exe- work suggesting progress estimators, which also resembles the idea
cuting this action the environment returned reward r, moved to a of the potential function, was presented in [9].
new state s0 , and action a0 was chosen in state s0 .
4. MULTI-AGENT LEARNING
3. REWARD SHAPING IN ROBOCUP SOCCER
Immediate reward r which is in the update rule given by Equa- RoboCup is an international project1 which aims at providing an
tion 1 represents the feedback from the environment. The idea experimental framework in which various technologies can be in-
of reward shaping is to provide an additional reward which will tegrated and evaluated. The overall research challenge is to create
improve the convergence of the learning agent with regard to the humanoid robots which would play at human masters level. Since,
learning speed, the quality of the final solution or both [11, 14]. the full game of soccer is complex, researchers developed several
This concept can be represented by the following formula for the simulated environments which can be used to evaluate techniques
SARSA algorithm: for specific sub-problems. One of such sub-problems is the Keep-
Away2 task [15, 16]. In this task (see Figure 1), N players (keepers)
Q(s, a) ← Q(s, a) + α[r + F (s, a, s0 ) + γQ(s0 , a0 ) − Q(s, a)],
learn how to keep the ball when attacked by N − 1 takers and when
(2)
where F (s, a, s0 ) is the general form of the shaping reward. 1
See http://www.robocup.org/ for more information
Even though reward shaping has been powerful in many exper- 2
See http://userweb.cs.utexas.edu/ AustinVilla/sim/Keepaway/ for
iments it quickly turned out that, when used improperly, it can be more information

Page 54 of 99
the two bodies of work [7, 10] can not be directly compared as they
used different state representations.
These two papers appear to encompass the entirety of current
published work in this problem domain. However, there still re-
mains a large room for improvement in the development of a learn-
ing taker. The more challenging a taker can become, the more it
will challenge researchers interested in learning the behaviours of
keepers. The work we have undertaken has resulted in takers per-
forming significantly better than the performances reported in both
these papers against the same opposing keepers in games with the
same set up and in games more challenging to the takers.
Figure 1: Snapshot of a 3 vs. 2 KeepAway game.
This problem domain also provides a suitable test bed for other
more generally applicable research into multi-agent reinforcement
learning. Given the learning taker we have developed we were able
playing within a small area of the football pitch, which makes the
to then expand upon the basic implementation and incorporate three
problem more difficult.
novel approaches to potential based reward shaping in a multi-agent
This task is multi-agent [21] in its nature, however, most research
context.
has focused on learning one specific behaviour at a time. Overall,
there are three types of high level behaviour in this task. The be-
haviour of the agents trying to take the ball is one of these high 5. PROPOSED METHOD
level behaviours but let us first consider the agents trying to main- In this section we provide more details on our learning takers
tain possession of the ball; the keepers. and the reward shaping techniques used. In our investigation we
For keepers there are two distinct situations, either the keeper compare the performance of RL takers without reward shaping (the
has possession of the ball or it does not. If it is not in possession of base learner) to takers using one of three types of reward shaping
the ball, a keeper executes a fixed hand-coded policy which directs detailed below.
it to be in a position convenient to receive the ball from the keeper
which is. The third behaviour is that of the keeper in possession of 5.1 Base Learner
the ball. Previous work has attempted to learn this behaviour using Our base learning taker combines the work of both previous pa-
reinforcement learning whilst the takers adhere to a hand-coded pers [7, 10] on learning takers in KeepAway. As in both these
policy [15, 16, 4]. papers, the takers can on each update choose either to tackle the
However, in this work we are interested in learning the behaviour keeper with the ball or mark any of the remaining keepers. To
of the takers and so the keepers shall now follow a hand-coded tackle a keeper, the taker runs directly to the keeper currently in
policy both with and without the ball. The hand-coded behaviour posession of the ball. To mark a keeper, the taker moves close to
of a keeper with the ball was originally specified in [16] and has the keeper trying to maintain their position on the intercepting line
since been used in other work on learning takers [7, 10]. of a straight pass of the ball from its current location to the current
Although the work of [15, 16, 4] has multiple agents learning location of the keeper.
during each episode, the implementation is not a true multi-agent To learn when to perform these actions we use the SARSA algo-
learning example. At any one time only one agent is learning, rithm with tile coding, as in [7]. Then from [10] we use the state
namely the keeper in possession of the ball, and all other agents representation (originally suggested in [16]) and the reward func-
are following fixed hand-coded policies. Therefore the agent is in tion, -1 for every cycle the episode continues to run and +10 for
effect learning within a static environment. However, when we con- ending the episode. Given the observations made by both papers
sider takers with the ability to learn, the problem becomes multi- we update only after every 15 cycles.
agent as all takers learn simultaneously.
Previous attempts to learn the behaviour of takers proved rela-
tively successful [7, 10] and were a useful resource when attempt-
ing to develop novel approaches. In [7] the first basic learning taker
was developed using SARSA reinforcement learning with tile cod-
ing to decide the action of a taker every 15 cycles. This work
emphasised that allowing a taker to decide an action every cycle
caused indecisiveness in the agent because the short time elapsed
between decisions did not allow adequate time for the true benefit
or cost of an action to be realised. In experiments allowing de-
cisions to be made every cycle takers oscillate between decisions
causing poor performance.
This observation was again witnessed by [10], who noted that
updates at any interval between 15 and 40 had comparable results
but intervals larger than 60 or less than 10 were largely unsuccess-
ful. To avoid this hesitation they chose to switch from the SARSA
algorithm for reinforcement learning to the Advantage(λ) Learning
algorithm. In their work a significant improvement in performance
was seen when takers learnt every cycle by Advantage(λ) Learning
instead of infrequently by SARSA. However, the comparison is not Figure 2: State Representation of Base Learner (from [10]).
complete as takers using the Advantage(λ) Learning also imple-
mented a more advanced function approximation technique. Also We chose the state representation from [10] as opposed to the

Page 55 of 99
one used by [7], because the latter represented less observations of be able to learn an equal policy as the roles are not enforced but
the environment and we expected this to limit the performance of merely encouraged.
learning takers. Our experiments presented below show that the The successful application of this reward shaping will illustrate
more detailed state representation, from [10] and illustrated in Fig- the potential benefits of using heterogeneous reward shaping in
ure 2, improves the performance of the basic learning taker sup- multi-agent systems to encourage roles.
porting our expectations and providing a useful comparison to the
more novel approaches we take in the following two subsections. 5.4 Combining Shaping Functions
Finally, we have also considered the incorporation of both pieces
5.2 Simple Reward Shaping of domain knowledge into one team of takers. This way the takers
Our first extension is to apply a simple potential based reward can be encouraged to take roles but also consider the benefit of
shaping function to the existing base agent to incorporate prior separating.
domain knowledge into the agent. It is expected that given this When combining shaping functions it is important that each is
knowledge the agent will converge quicker to an equal or better scaled individually because to calculate the potential difference of
performance than the base learning approach alone. This agent is both states and scale the sum would give a different meaning to
intended to show that the use of reward shaping is both applicable the resultant reward shaping, it would not accurately represent the
and beneficial in multi-agent reinforcement learning. domain knowledge intended. Therefore the potential based reward
Specifically, the domain knowledge we have applied states that shaping function changes from Equation 3 given in Section 3 to:
takers can improve their performance by separating and marking
different players. By following this principle, they are able to limit F (s, s0 ) = τ1 (γΦ1 (s0 ) − Φ1 (s)) + τ2 (γΦ2 (s0 ) − Φ2 (s)), (4)
the passing options of the keepers and reduce the time the keepers where γ is the discount factor, Φ1 and Φ2 are the potential functions
maintain possession. and τ1 and τ2 are two separate scaling factors.
We have implemented a reward shaping function that encourages Given that our motive is to publicise the use of heterogeneous
separation by adding the change in distance between the takers to reward shaping for encouraging roles our scaling will emphasise
the reward they receive from the basic learning algorithm. Assum- the heterogeneous reward shaping function. This agent will still
ing our domain knowledge is correct, the addition of this potential include the separation based reward shaping function but by scal-
based function will ensure better co-operation between the agents ing the function appropriately it will have less of an impact on the
developed than those following the hand-coded policy. Also, al- resultant behaviour than the encouragement to take up a specific
though this knowledge could be learned by the base learner, the role.
new agent will know from the beginning to attempt to separate and It is expected that as this agent will benefit from both pieces of
so will converge quicker. domain knowledge that this will be our best performing agent and
as such will be a beneficial contribution to the RoboCup KeepAway
5.3 Heterogeneous Shaping research field.
In experiments with the previous agent based upon a simple re-
ward shaping, all taker agents will be equivalent or homogeneous.
A more interesting problem is that of heterogeneous agents, whereby 6. EXPERIMENTAL DESIGN
different agents co-operating on the same team combine different The experiments undergone were performed in RoboCup Soc-
skills to outperform their homogeneous counterparts [1]. cer Simulator v11.1.0 compiled against RoboCup Soccer Simu-
Given the previous hypothesis, that takers sticking together is lator Base Code v11.1.0. The KeepAway player code used was
detrimental to performance, more complex prior domain knowl- keepaway-player v0.6. Keepers were based upon the hand coded
edge can be incorporated stating that it is beneficial for one taker to policy publicly available in this release and takers were based upon
tackle and another to fall back and mark. our own extensions to this base player.
In effect, this new domain knowledge defines two roles; one of For takers both with and without reward shaping the SARSA al-
a tackling taker and one of a marking taker. We thus use hetero- gorithm of reinforcement learning was used with the parameters;
geneous reward shaping to encourage these roles in the learning α = 0.125, γ = 1.0 and = 0.01. For function approximation a
takers, which to our knowledge is a novel idea. By rewarding one tile coding function with 13 groups of 32 single-dimension tilings
taker for choosing a tackling action when previously choosing a was used. All takers used one group per each feature in the obser-
marking action and punishing it when it changes from choosing vation and split angles into ten degree intervals and distances into
tackling to now marking, the agent will be encouraged to tackle. three meter intervals.
A similar approach reversing the punishment and reward will then Experiments were performed on pitches of sizes 20×20, 30×30,
encourage the other taker to mark. 40×40, and 50×50 meters. These values were chosen to show the
These roles however are not hard-coded, we are not limiting the performance of our takers in similar contexts to previous work on
action choices available to the takers. Both takers can still choose learning the behaviour of takers and also in more complex problem
either to mark or tackle and reinforcement learning will still have domains.
them explore the use of both action choices. Therefore in extreme The addition of reward shaping functions must be scaled to max-
cases when it is necessary for the marking agent to tackle he will imise the performance of the respective agents. The value of these
still make the correct decision and tackle, but in general it will scaling factors was found through experimental testing, therefore
choose to mark as the reward shaping function applied will make they may not be the optimal settings. However, they are sufficient
this appear more lucrative. to show the improvement in performance the methods are capable
Therefore, it is hoped that these two roles are beneficial to win- of. For the simple reward shaping agent the value of separation was
ning possession. If they are, then the agent will converge quicker doubled before added to the basic reward function.
to an equal or better performance as the base learner because the For the heterogeneous shaping approach agents were either re-
takers without reward shaping will have to learn these roles them- warded or penalised by 5 for changing their action from marking to
selves. However, if these roles are not beneficial the takers will still tackling and vice versa.

Page 56 of 99
When combining shaping functions we wanted to emphasise the 13
heterogeneous knowledge and so for changing their action these Base Learner
Simple Reward Shaping
takers were either rewarded or penalised by 10 and for separation 12.5

Episode Duration (seconds)

Heterogenous Shaping
the change in distances were simply added and not doubled as they Combined Shaping
are in takers only relying upon this knowledge. 12
Experiments with each combination of pitch size and reward
shaping function were repeated 15 times. The results provided in 11.5
Section 7 illustrate the change in average episode length over all
repeat experiments against time. Given that we are learning the be- 11
haviour of the takers, we are aiming to minimise the length of the
average episode. 10.5

10
7. RESULTS
Experiments on the simplest domains were relatively unhelp- 9.5
0 1 2 3 4 5
ful. All agents converged quickly to good results with little vari-
ation between approaches used. For both pitches of size 20x20 and Training Time (hours)
30x30, illustrated in Figures 3 and 4, it is important to consider
that both axis represent small changes in time in their given dimen- Figure 4: Takers v KA06 at 30x30.
sion and the differences between agents is both brief and insignif-
icantly small (only 0.4 seconds for pitch size 20x20). When tak-
ing into consideration the statistical variation between samples, no 26 Base Learner
one agent is seen to reliably be significantly better than any other. Simple Reward Shaping

Episode Duration (seconds)

Therefore, the problem domain at pitches of this size is too simple Heterogenous Shaping
24 Combined Shaping
to gain useful insight.

22
6
Base Learner
Simple Reward Shaping 20
5.8
Episode Duration (seconds)

Heterogenous Shaping
Combined Shaping
5.6 18

5.4 16
5.2
0 2 4 6 8 10
5
Training Time (hours)
4.8
Figure 5: Takers v KA06 at 40x40.
4.6

0 0.5 1 1.5 2 2.5 3

Initially, all agents with reward shaping learn at an equivalent
Training Time (hours)
rate to the base learner and so maintain their positive difference in
performance. However, after approximately two hours of training
Figure 3: Takers v KA06 at 20x20. the learning of takers using reward shaping begins to slow and the
base learner begins to outperform the agents using domain knowl-
These results, however, have been included for comparison to edge. Therefore, in this specific problem domain the policy repre-
previous work on learning takers. Existing work against the same sented by the domain knowledge we have suggested is not optimal.
keeper on pitches of size 20x20 had significantly worse perfor- If more suitable domain knowledge were available a taker could
mance, converging at best to an average of 12.9 seconds [7], or be designed to benefit both the initial gain in performance and in
approximately equivalent performance, converging on average to the long term converge to the better performing policy discovered
5.8 seconds [10]. All learning takers, both the existing and our own naturally by the base learner.
base learner, outperform the standard hand coded takers defined by Regardless, there is a benefit to initially using these reward shap-
[16] that perform consistantly around 15 seconds. Therefore, the ing functions. A useful extension to this work would consider
basic learner we have developed is both a suitable and highly com- maintaining two value functions whilst learning begins. One that is
petitive test agent to compare our approaches to. updated by the sum of the reward and the reward shaping function
At a pitch size of 40x40 the problem appears to become suffi- and another updated solely by the reward. For the first two hours
ciently difficult, with the base learner unable to converge quickly. of training (a domain specific parameter) the taker would decide on
With this level of difficulty a clear difference in agents is now evi- which action to perform from the value function updated using the
dent. All agents with reward shaping immediately benefit from the shaping function, then after this time the reward shaping would be
additional domain knowledge with gains of up to approximately stopped. Now only the value function not using the reward shap-
4.5 seconds on average witnessed for takers using the combined ing function would be maintained and would subsequently be used
knowledge of both shaping functions. for all further action decisions. In this manner the taker would re-

Page 57 of 99
ceive the initial performance improvement from exploiting the do- previously noticed in the takers’ performances after convergence
main knowledge but later, instead of being hindered by this flawed is contradicted by Figure 6. Although the average performance of
knowledge, would gain the benefit of exploring all potential deci- the base learner is lower than all of the takers using reward shap-
sions and so match the superior converged performance of the base ing, the upper bounds of variation in this result are higher than the
learner. lower bounds of all takers using reward shaping and equivalent to
the average of some. Therefore the one benefit of using the ba-
sic learner instead of incorporating domain knowledge, namely the
26 Base Learner perceived improvement in performance at the time of convergence,
Simple Reward Shaping
Heterogenous Shaping is not statistically significant.
Episode Duration (seconds)

24 Combined Shaping

22 Base Learner
34 Simple Reward Shaping
Heterogenous Shaping

Episode Duration (seconds)

20 32 Combined Shaping

30
18
28
16
26

0 2 4 6 8 10 24
Training Time (hours) 22

Figure 6: Takers v KA06 at 40x40 with Performance Variation 20

Illustrated. 0 2 4 6 8 10 12
Training Time (hours)
Given that the reinforcement learning method is stochastic, it is
important at this time to consider the statistical variation in the re-
Figure 8: Takers v KA06 at 50x50.
sults obtained. In Figure 6 the standard error has been illustrated to
highlight the variations we witnessed, similarly Figure 9 highlights
the same error measurements for agents in the 50x50 problem do-
main.
The results in Figure 6 empirically demonstrate that there is a Base Learner
34 Simple Reward Shaping
statistically significant gain in initial performance between the base
Heterogenous Shaping
Episode Duration (seconds)

learner and the simplest reward shaping function. Also, all subse- 32 Combined Shaping
quent increases in the complexity of domain knowledge applied
through reward shaping result in a significant increase in perfor- 30
mance. These initial gains in performance are highlighted for clar-
28
ity in Figure 7.
26
25
Base Learner 24
24 Simple Reward Shaping
Heterogenous Shaping
Episode Duration (seconds)

22
23 Combined Shaping
20
22
0 2 4 6 8 10 12
21 Training Time (hours)
20
Figure 9: Takers v KA06 at 50x50 with Performance Variation
19 Illustrated.
18 The results of Figures 8 and 9 further support the conclusions
17 made thus far. As previously seen in the change from pitch sizes of
30x30 to 40x40, there is a significant rise in difficulty when increas-
0 0.5 1 1.5 2 ing the pitch size from 40x40 to 50x50. Given the yet again higher
Training Time (hours) difficulty, a more significant improvement can and has been wit-
nessed when incorporating domain knowledge into multiple agents
Figure 7: Takers v KA06 at 40x40 with Initial Performance co-learning in a single system.
Highlighted. Firstly, there is now a significant gap between the upper bound
of initial performance in takers using even just the simplest reward
Finally with regard to the 40x40 problem domain, it is support- shaping function and the the lower bound of initial performance by
ive of our method to note that the slight difference in performance the base learner. On average takers benefiting from both reward

Page 58 of 99
shaping functions can begin taking possession of the ball 6 seconds domain knowledge. By encouraging roles through reward shaping,
faster than takers not using any reward shaping. The initial gain in as opposed to enforcing them through hard-coded limitation to ei-
performance is highlighted in Figure 10. ther actions or state representations, agents can choose to exploit
the given domain knowledge, and so benefit from fast convergence
rates, but also can choose still to explore allowing the discovery of
34 Base Learner optimal policies where they diverge in specific contexts from their
Simple Reward Shaping encouraged roles.
Heterogenous Shaping
Episode Duration (seconds)

32 Combined Shaping
Although the specific reward shaping functions implemented have
used domain specific knowledge the types of domain knowledge
30 represented are generally applicable. The knowledge that takers
should try to stay separate is an example of knowledge regarding
how agents should maintain states relative to each other. Maintain-
28
ing a state relative to either team-mates or opponents is a common
type of knowledge applicable in many multi-agent systems. For
26
example, it has been shown in the predator/prey problem domain
that it is beneficial for predators to consider the relative location of
24 its supporting predator to aid co-ordination [20]. Similarly, having
one tackler and one marker is specific to takers in KeepAway but
22 the knowledge that agents should specialise into roles is common
0 0.5 1 1.5 2 2.5 3 in multi-agent systems. For example, again in the predator/prey
Training Time (hours) problem domain, it has been shown that it is beneficial to have one
predator take a hunting role and another take a scouting role [20].
Figure 10: Takers v KA06 at 50x50 with Initial Performance Therefore the use of reward shaping, both homogeneous, heteroge-
Highlighted. neous and combined, could be applied in general to any multi-agent
system that would benefit from agents having these types of knowl-
This gain in performance remains roughly constant throughout edge with the expected benefits being similar to those documented
the first 4 hours of training. It then begins to shrink but still out- in the KeepAway domain.
performs the base learner for up to approximately 8 hours. Even Finally, our last contribution is that of the taker learning with the
after the first 8 hours of training, the base learner can only match combined domain knowledge of both encouraging separation and
the performance of the novel approaches and never significantly roles. This taker has the best currently published performance of
outperforms any of them. any taker in the RoboCup KeepAway problem domain.
Furthermore, the agents solely encouraged to take heterogeneous We intend to continue this work along the following avenues.
roles did adhere to the encouragement and after convergence were Firstly, we believe that there is the potential to apply similar re-
seen to almost exclusively stick to their assigned roles. The ex- ward shaping functions to the keepers in a true multi-agent learning
ceptions being the specific contexts at which it was more bene- domain. Recent work [8] has expanded the keepers to learn both
ficial to performance to ignore the encouraged role and make a whilst on and off the ball. Currently, despite simultaneous learn-
non-characteristic action decision. By using RL with reward shaping, the behaviour of keepers with the ball is very different to that
ing to encourage roles, these deviations from the encouraged role of those without the ball. However, the application of a separation
were possible whereas an agent with enforced roles would not have based reward shaping function could be adapted to further improve
learnt to nor been able to exploit these specific contexts. the performance of the keepers. This continued cycle of improving
Finally, a closing note of some interest. It appears, in particular takers and then improving keepers will continue to push research
in Figure 9 but also to a degree in Figure 6, that the variation in per- efforts in this problem domain. Eventually, leading to research into
formances achieved is notably smaller in agents making use of the the simultaneous learning of both keepers and takers.
heterogeneous reward shaping. These results are not broad enough A larger more general contribution however would continue to
to make any firm conclusions at this time, but it would be inter- investigate the potential of heterogeneous reward shaping in multi-
esting in a deeper study of the heterogeneous reward shaping for agents systems. Again this could foreseeably be applied to a true
multi-agent systems approach to explore this observation further. multi-agent learning set of keepers, with perhaps some keepers en-
It may be that this is characteristic of the encouragement of roles, couraged to mislead takers by making runs off the ball and oth-
however it may also be the case that this is simply an artifact of ers encouraged to sneak away from markers to true open posi-
this specific piece of domain knowledge in this particular problem tions. Other classic multi-agent domains, such as task distribution
domain. or predator/prey, may also be interesting to study when applying
this technique to highlight its general applicability and widen the
8. CONCLUSION audience to this method.
In conclusion, we have demonstrated the applicability and ben-
efits of using potential based reward shaping in multi-agent rein- 9. REFERENCES
forcement learning. By incorporating domain knowledge in an [1] T. Balch. Learning Roles: Behavioral Diversity in Robot
agents design the agent can converge quicker to an equal or su- Teams. In AAAI Workshop on Multiagent Learning, 1997.
perior policy than agents learning by reinforcement alone. [2] D. P. Bertsekas. Dynamic Programming and Optimal Control
The results documented here are a first step in demonstrating (2 Vol Set). Athena Scientific, 3rd edition, 2007.
the potential benefit of heterogeneous reward shaping to encour- [3] L. Busoniu, R. Babuska, and B. De Schutter. A
age roles. We have successfully designed heterogeneous agents Comprehensive Survey of MultiAgent Reinforcement
that co-operate in a multi-agent system to outperform agents using Learning. IEEE Transactions on Systems Man & Cybernetics
either homogeneous reward shaping functions or incorporating no Part C Applications and Reviews, 38(2):156, 2008.

Page 59 of 99
[4] S. Devlin, M. Grześ, and D. Kudenko. Reinforcement Introduction. MIT Press, 1998.
learning in robocup keepaway with partial observability. In [20] M. Tan. Multi-Agent Reinforcement Learning: Independent
IEEE/WIC/ACM International Conference on Web vs. Cooperative Agents. In Proceedings of the Tenth
Intelligence and Intelligent Agent Technology, 2009. International Conference on Machine Learning, volume 337,
WI-IAT’09, 2009. 1993.
[5] M. Grześ. Improving exploration in reinforcement learning [21] M. Wooldridge. An Introduction to MultiAgent Systems. John
through domain knowledge and parameter analysis. Wiley and Sons, 2002.
Technical report, University of York, 2010. (in preparation).
[6] M. Grześ and D. Kudenko. Plan-based reward shaping for
reinforcement learning. In Proceedings of the 4th IEEE
International Conference on Intelligent Systems (IS’08),
pages 22–29. IEEE, 2008.
[7] A. Iscen and U. Erogul. A new perspective to the keepaway
soccer: the takers. In Proceedings of the 7th international
joint conference on Autonomous agents and multiagent
systems-Volume 3, pages 1341–1344. International
Foundation for Autonomous Agents and Multiagent
Systems, 2008.
[8] S. Kalyanakrishnan and P. Stone. Learning complementary
multiagent behaviors: a case study. In Proceedings of The
8th International Conference on Autonomous Agents and
Multiagent Systems-Volume 2, pages 1359–1360.
International Foundation for Autonomous Agents and
Multiagent Systems, 2009.
[9] M. J. Mataric. Reward functions for accelerated learning. In
Proceedings of the 11th International Conference on
Machine Learning, pages 181–189, 1994.
[10] H. Min, J. Zeng, J. Chen, and J. Zhu. A Study of
Reinforcement Learning in a New Multiagent Domain. In
IEEE/WIC/ACM International Conference on Web
Intelligence and Intelligent Agent Technology, 2008.
WI-IAT’08, volume 2, 2008.
[11] A. Y. Ng, D. Harada, and S. J. Russell. Policy invariance
under reward transformations: Theory and application to
reward shaping. In Proceedings of the 16th International
Conference on Machine Learning, pages 278–287, 1999.
[12] J. Peters, S. Vijayakumar, and S. Schaal. Reinforcement
learning for humanoid robotics. In Proceedings of
Humanoids2003, Third IEEE-RAS International Conference
on Humanoid Robots, 2003.
[13] M. L. Puterman. Markov Decision Processes: Discrete
Stochastic Dynamic Programming. John Wiley & Sons, Inc.,
New York, NY, USA, 1994.
[14] J. Randløv and P. Alstrom. Learning to drive a bicycle using
reinforcement learning and shaping. In Proceedings of the
15th International Conference on Machine Learning, pages
463–471, 1998.
[15] P. Stone and R. S. Sutton. Scaling reinforcement learning
toward robocup soccer. In The 18th International Conference
on Machine Learning, pages 537–544. Morgan Kaufmann,
San Francisco, CA, 2001.
[16] P. Stone, R. S. Sutton, and G. Kuhlmann. Reinforcement
learning for RoboCup-soccer keepaway. Adaptive Behavior,
13(3):165–188, 2005.
[17] R. Sutton. Generalization in Reinforcement Learning:
Successful Examples Using Sparse Coarse Coding.
Advances in Neural Information Processing Systems, pages
1038–1044, 1996.
[18] R. S. Sutton. Temporal credit assignment in reinforcement
learning. PhD thesis, Department of Computer Science,
University of Massachusetts, Amherst, 1984.
[19] R. S. Sutton and A. G. Barto. Reinforcement Learning: An

Page 60 of 99
Proceedings of the AAMAS Workshop on Adaptive and Learning Agents, May 2010, Toronto, Canada

Learn to Behave!
Rapid Training of Behavior Automata

Sean Luke Vittorio Amos Ziparo

Department of Computer Science Dipartimento di Informatica e Sistemistica
George Mason University Università di Roma “La Sapienza”
4400 University Drive MSN 4A5 Via Ariosto 25, I-00185
Fairfax, VA 22030 USA Rome, ITALY
[email protected] [email protected]

ABSTRACT advantage of domain knowledge in various ways, and thus

Programming robot or virtual agent behaviors can be a chal- our method lies somewhere in the middle-ground between
lenging task, and makes attractive the prospect of automat- explicit programming (that is, specification) and full, unfet-
ically learning the behaviors from the actions of a human tered learning.
demonstrator. However, learning complex behaviors rapidly Our learned agent behaviors take the form of determinis-
from a demonstrator may be difficult if they demand a large tic hierarchical finite-state automata (HFA). Obviously HFA
number of training samples. We describe an architecture for are not as expressive as other models: for example, the paral-
rapid learning of recurrent behaviors from demonstration. lelism inherent in Petri Nets; or the richer computational ca-
The architecture is based on deterministic hierarchical finite- pacity afforded stack automata or arbitrary functions. The
state automata (HFAs) with classification algorithms taking motivation underlying the choice of HFAs is twofold. First,
the place of the state transition function. This architecture HFA are a widely adopted tool for modeling agent and robot
allows for task decomposition, statefulness, parameterized behaviors, rich enough for a broad range of common be-
features and behaviors, per-behavior feature set customiza- haviors, yet are simple enough to allow the straightforward
tion, and storage of learned behaviors in libraries to be used demonstration of our learning approach. Second, we chose
later on as elements in more complex behaviors. We describe HFAs as they enabled us to do task decomposition easily.
the system, then illustrate its application in a simple, but There are many HFA formulations. Ours is straightfor-
nontrivial, foraging task involving multiple behaviors. ward: a learned behavior is a standard Moore Machine
finite-state automaton, where each state is associated with a
certain behavior, and also with a transition function which
Categories and Subject Descriptors stipulates, given the current world situation, which state to
I.2.6 [Artificial Intelligence]: Learning transition to in the next time step. There is a start state
but no accepting states.
General Terms Our approach is to build an HFA iteratively: we allow the
user to easily create an HFA based on a current library of
Algorithms, Design, Human Factors behaviors (some of which may themselves be HFAs). When
the HFA is complete, it is added to the library to help build
Keywords a more complex higher-level HFA. One can create of course
an HFA by coding it by hand: but of interest to us is the
Learning from Demonstration, Hierarchical Finite-state Au-
ability to learn the HFA by watching a demonstrator ma-
tomata, Agents, Robotics
nipulate the agent. As the agent moves about in the en-
vironment, the demonstrator directs it to perform various
1. INTRODUCTION behaviors (and thus to transition to various new states).
Our goal is to enable the rapid, real-time training of com- Each time the demonstrator requests such a transition, the
plex, stateful agent behaviors. Agent behavior training has system records the transition and the current world situa-
applications in a variety of fields, including 3D animation, tion. At the end of the training period, from these records
game level design, and autonomous robotics. In these areas, the system builds, for each state (behavior), a learned tran-
programming custom domain-specific behaviors on-the-fly sition function indicating under what conditions the agent
may not be desirable or possible, and so it is attractive to should transition to new states. This is essentially a super-
instead have the agent learn them from a trainer. vised learning task and can employ a variety classification
One of the challenges facing training, however, is the con- algorithms: at present our learned models take the form of
flict between the real-time nature of training and the large decision trees.
numbers of samples that may be demanded by a challeng- The approach also lends itself to both stochastic and de-
ing, high-dimensional domain. It may not be feasible to terministic transitions. Decision trees traditionally compute
ask a trainer to perform hundreds of trials to satisfy the classes deterministically, based on the most common class
needs of a learning algorithm. Thus, one of our goals is to among the relevant training examples. Our method can be
develop methods to reduce domain complexity, and ideally set up to do this; or to choose classes stochastically based
reduce the number of necessary samples, while not sacrific- on the proportion of examples from a given class. The ex-
ing the gamut of learnable behaviors. We do this by taking periments in this paper apply the latter method.

Page 61 of 99
The learning domain for an HFA behavior can obviously augmented with sequence iteration [25]. Like our approach,
be complex and of high dimensionality, depending on the these plans are often parameterizable.
number of basic behaviors and the dimensionality of the Such plan networks generally have limited or no recur-
agent’s feature vector. This in turn can require a large num- rence: instead they usually tend to be organized as se-
ber of training sessions to adequately describe the domain. quences or simultaneous groups of behaviors which activate
It is not reasonable to expect a demonstrator to perform further behaviors downstream. This is mostly a feature of
that many training sessions, and so it is important to re- the problem being tackled: such plans are largely induced
duce the domain space complexity or training difficulty. We from ordered sequences of actions intended to produce a re-
have done this in three ways: sult. Since we are training goal-less behaviors rather than
plans, our model instead assumes a rich level of recurrence:
• An HFA encourages task decomposition. Rather than and for the same reason the specific ordering of actions is
learn one large behavior, the system may be trained less helpful.
on simpler behaviors, which are then composed into a
higher-level learned behavior. This essentially projects Learning Policies. Another large body of work in learning
the full learning space into multiple lower-dimensional from demonstration involves observing a demonstrator per-
spaces. form various actions when in various world situations. From
this the system gleans a set of !situation, action" tuples per-
• Feature vector reduction. Our system allows the user formed and builds a policy function π(situation) → action
to specify precisely those features he feels are neces- from these tuples. This can be tackled as a supervised learn-
sary for a given learned HFA, which in turn dramati- ing task [4, 5, 8, 10, 12, 15]. However, some literature in-
cally reduces the learning space. Each HFA, including stead transforms the problem into a reinforcement learning
lower-level HFAs, may have its own different reduced task by providing the learner only with a reinforcement sig-
feature vector. nal based on how closely the learned policy matches the
tuples provided by the demonstrator [9, 24]. This is curious
• Generalization by parametrization. All behaviors, in-
given that the problem is, in essence, supervised; the rein-
cluding HFAs themselves, may be parameterized with
forcement methods are in some sense working with reduced
targets: for example, rather than create a behavior
information.
go-to-home-base, we can create a general behavior go-
Our approach differs from these methods in an impor-
to(A), and allow for higher-level behaviors to specify
tant way. Instead of learning situation→action rules, our
the meaning of the target A at a future time. This
model learns the transition functions of an HFA with pre-
can significantly reduce the number of behaviors which
defined internal states, each corresponding to a possible ba-
must be trained.
sic behavior. This enables the demonstrator to differentiate
transitions to new behaviors not just based on the current
By employing these complexity-reduction measures, our
world situation but also the current behavior. That is, we
system ideally enables the rapid construction of complex be-
learn rules of the form !previous action, situation" →action.
haviors, with internal state and a variety of sensor features,
Another, somewhat different use of internal state would be
in real time entirely by training from demonstration.
to distinguish between aliased observations of hidden world
The remainder of the paper is laid out as follows. We be-
situations, something which may be accomplished through
gin with a discussion of related work. We then describe the
learning hidden Markov models (for example, [13]).
basic HFA model and our approach to learning the transi-
tion functions in the automaton. We follow this with a train-
ing example of a nontrivial foraging behavior, then conclude Hierarchical Models. The use of hierarchies in robot or
with a discussion of future directions. agent behaviors is very old indeed, going back as early as
Brooks’s Subsumption Architecture [7]. Hierarchies are a
natural way to achieve layered learning [22] via task decom-
2. RELATED WORK position. This is a common strategy to simplify the state
Our approach generally fits under the category of learning space: see [11] for an example. While it is possible in these
from demonstration [3], an overall term for training agent ac- cases to induce the hierarchy itself, usually such methods
tions by having a human demonstrator perform the action iteratively compose hierarchies in a bottom-up fashion.
on behalf of the agent. Because the proper action to per- Our HFA model bears some similarity to hierarchical be-
form in a given situation is directly provided to the agent, havior networks such as those for virtual agents [6] or phys-
this is broadly speaking a supervised learning task, though ical robots [17], in which feed-forward plans are developed,
a significant body of research in the topic actually involves then incorporated as subunits in larger and more complex
reinforcement learning, whereby the demonstrator’s actions plans. In such literature, the actual application of hierar-
are converted into a reinforcement signal from which the chy to learning from demonstration has been unexpectedly
agent is expected to derive a policy. The lion’s share of limited. However, learning from demonstration has been ap-
learning from demonstration literature comes not from vir- plied more extensively to multi-level reinforcement learning,
tual or game agents but from autonomous robotics. For a as in [23], albeit with a fixed hierarchy.
large survey of the area, see [2].
Language Induction. One cannot mention learning finite
Learning Plans. One learning from demonstration area, state automata without noting that they have a long his-
closely related to our own research, involves the learning tory in language induction and grammatical inference, with
of largely directed acyclic graphs of behaviors (essentially a correspondingly massive literature. For recent surveys of
plans) from sequences of actions [1, 16, 18, 21], possibly techniques using automata for grammar induction, see [19,

Page 62 of 99
special state is the optional done state, whose behavior sim-
ply sets a done flag and immediately transitions to the start
Start state. This is used to potentially indicate to higher-level
HFAs that the behavior of the current HFA is “done”.
Always
Figure 1 shows a simple automaton with four states, corre-
Rotate Rotate sponding to the behaviors start, rotate-left, rotate-right, and
Left Right forward. It may appear at first glance that not all HFAs can
be built with this model: for example, what if there were two
states in which the rotate-left behavior needed to be done?
This can be handled by creating a simple HFA which does
Always Always
If
nothing but transition to the rotate-left state and stay there.
If
no obstacle is in front obstacle is in front This automaton is then stored as a behavior called rotate-
and or left2 and used in our HFA as an additional state, but one
obstacle is " 5.2 to left obstacle is ! 2.3 to left which performs the identical behavior to rotate-left.

Forward Features. Transitions from state to state are triggered by

observable features of the environment. One such fea-
ture might be distance-to-closest-obstacle-on-my-left. At any
time, this feature yields a non-negative value indicating
the distance to such an obstacle. In our system features
presently take three forms: categorical features, which re-
Figure 1: A simple finite-state automaton for wall turn unordered values like “red” or “blue”; continuous fea-
following (counter-clockwise). All conditions not tures, which return real-valued numbers (like distances);
shown are assumed to indicate that the agent re- and toroidal features, which return real-valued numbers but
main in its current state. which are assumed to wrap around in a toroidal fashion (like
angles). Boolean features are typically modeled as categori-
cal features. One special boolean feature is the done feature,
26]. However the goal of this literature is fundamentally
which is true if the current behavior is a lower-level HFA,
different from ours in this paper. Specifically, in language
and if it has triggered its done flag.
induction, the learning algorithm is given a set of positive
and negative string examples and generates an automaton
which induces an underlying language. Typically these al- Targets. Importantly, our approach supports parameter-
gorithms make no assumptions about the number of states, ized, general-purpose behaviors and transitions. Rather
assume the states are unlabelled, typically assume a small than create a behavior called go-to-obstacle-number-42, we
set of transition conditions, and include accepting or reject- can create a behavior called go-to(A), where A may be speci-
ing states. In contrast we are not interested in terminating fied later. Similarly, rather than the aforementioned feature
automata, and seek to induce only the edges among a pre- distance-to-closest-obstacle-on-my-left, we might instead have
specified set of labeled states, given examples with labelled the more general feature distance-to(B). This separates fea-
transitions from state to state. tures and behaviors from the targets to which they apply.
For example, a feature or behavior may be either specified
with regard to one or more ground targets (“obstacle 42”
3. THE HFA MODEL or “the closest obstacle on my left”) — resulting in a behav-
Using our system, a trainer iteratively develops new finite- ior such as go-to(obstacle-42) — or the target may simply be
state automata, whose states encompass behaviors drawn left unspecified (A), to be bound to a ground target at some
from a behavior library. An automaton is learned by ob- later time. In the latter case, the unbound target is called a
serving the trainer as he selects various behaviors in various parameter.
situations. Once learned, the automaton can then be added When an HFA employs features or behaviors with as-
as a behavior in the library, and then may be itself used as of-yet unbound targets (parameters), it must itself present
a state in more complex automata. In the following, we first those parameters when used as a behavior by some higher-
describe the hierarchical finite state automaton model, and level HFA. Thus HFAs themselves may be parameterized.
in the next section we detail our approach for learning the
automaton by demonstration from the trainer. Transitions. In traditional finite-state automata, transi-
tions are represented by directed edges between nodes, each
States and Behaviors. Our HFAs model Moore machines: labelled with a condition which may or may not be true
that is, each state corresponds to a behavior, and when in a about the current features of the environment. Without
state, the HFA performs that behavior. A behavior may be loss of generality, it’s more useful for us to think of a transi-
an atomic behavior or may itself be another HFA, leading to tion function which maps the current state and the current
the hierarchical definition of the model. Atomic behaviors feature vector into a new state. The start state always tran-
are hard-coded behaviors provided by the system. For ex- sitions to a specific other state; and the done state always
ample, the behavior rotate-left might be an atomic behavior: transitions to the start state.
when employing this behavior, the agent will spin counter-
clockwise at some rate. The HFA always begins in the start Operating the HFA. Each timestep the HFA is advanced
state, associated with a special idle behavior, and which alone tick: it performs one step of the behavior associated
ways transitions immediately to some other state. Another

Page 63 of 99
with its current state, then applies the transition function of learning the transitions among the states: given a state
to determine a new state for next timestep, if any. When a and a feature vector, decide which state (drawn from a fi-
performed behavior is itself an HFA, this operation is recur- nite set) to transition to. This is an ordinary classification
sive: the child HFA likewise performs one step of its current task. Specifically, for each state Si we must learn a classi-
behavior, and applies its transition function. Additionally, fier f" → S whose attributes are the environmental features
when an HFA transitions to a state whose behavior is an and whose classes are the various states. Once the classi-
HFA, that HFA is initialized: its initial state is set to the fiers have been learned, the HFA can then be added to our
start state, and its done flag is cleared. library of behaviors and itself be used as a state later on.
Because the potential number of features can be very high,
Formal Model. For the purposes of this work, we define and many unrelated to the task, and because we want to
the class of hierarchical finite-state automata models H as learn based on a very small number of samples, we wish
the set of tuples !S, F , T, B, M " where: to reduce the dimensionality of the input space to the ma-
chine learning algorithm. This is done by allowing the user
• S = {S0 , S1 , . . . , Sn } is a set of states, including a dis- to specify beforehand which features will matter to train a
tinguished start state S0 , and possibly also one done given behavior. For example, to learn a Figure-8 pattern
state S∗ . Exactly one state is active at any time. around two unspecified targets A and B, the user might
indicate a desire to use only four parameterized features:
• F = {F1 , F2 , . . . , Fn } is a set of observable features in distance-to(A), distance-to(B), direction-to(A), and direction-
the environment. The set of features is partitioned to(B). During training the user temporarily binds A and B to
in three disjoint subsets representing categorical (C), some ground targets in the environment, but after training
continuous (R) and toroidal (A) features. Each Fi can they are unbound again. The resulting learned behavior will
assume a value fi drawn from a finite (in the case of C) itself have two parameters (A and B), which must ultimately
or infinite (in the case of R and A) number of possible be bound to use it in any meaningful way later on.
values. At any point in time, the present assumed val- The training process works as follows. The HFA starts in
ues f" = !f1 , f2 , . . . , fn " for each of the F1 , F2 , . . . , Fn the “start” state (idling). The user then directs the agent to
are known as the environment’s current feature vector. perform various behaviors in the environment as time pro-
• T : F1 × F2 × . . . × Fn × S → S is a transition function gresses. When the agent is presently performing a behavior
which maps a given state Si , and the current feature associated with a state Si and the user chooses a new be-
vector !f1 , f2 , . . . , fn ", onto a new state Sj . The done havior associated with the state Sj , the agent transitions
state S∗ is the sole state which transitions to the start to this new behavior and records an example, of the form
state S0 , and does so always: ∀Sk &= S∗ ∀f" T (f", Sk ) &= !Si , f", Sj ", where f" is the current feature vector. Immedi-
ately after the agent has transitioned to Sj , it turns out to
S0 and ∀f" : T (f", S∗ ) = S0 .
be often helpful to record an additional example of the form
• B = {B1 , B2 , . . . , Bn } is a set of atomic behaviors. By !Sj , f", Sj ". This adds at least one “default” (that is, “keep
default, the special behavior idle, which corresponds to doing state Sj ”) example, and is nearly always correct since
inactivity, is in B, as may also be the optional behavior in that current world situation the user, who had just tran-
done. sitioned to Sj , would nearly always want to stay in Sj rather
than instantaneously transition away again.
• M : S → H ∪ B is a one-to-one mapping function At the completion of the training session, the system then
of states to basic behaviors or hierarchical automata. builds transition functions from the recorded examples. For
M (S0 ) = idle, and M (S∗ ) = done. M is constrained by each state Sk , we build a decision tree DSk based on all ex-
the stipulation that recursion is not permitted, that is, amples where Sk is the first element, that is, of the form
if an HFA H ∈ H contains a mapping M which maps to !Sk , f", Si ". Here, f" and Si form a data sample for the classi-
(among other things) a child HFA H " , then neither H "
fier: f" is the input feature and Si is the desired output class.
nor any of its descendent HFAs may contain mappings
If there are no examples at all (because the user never tran-
which include H.
sitioned from Sk ), the transition function is simply defined
We further generalize the model by introducing free vari- as always transitioning to back to Sk .
ables (G1 , . . . , Gn ) for basic behaviors and features: these At the end of this process, our approach has built some
free variables are known as targets. The model remains un- N decision trees, one per state, which collectively form the
altered, by replacing behaviors Bi with Bi (G1 , . . . , Gn ) and transition function for the HFA. After training, some states
features Fi with Fi (G1 , . . . , Gn ). The main differences are will be unreachable because the user never visited them, and
that the evaluation of the transition function and the exe- so no learned classification function ever mapped to them.
cution of behaviors will both be based on ground instances These states may be discarded. The agent can then be left
of the free variables. to wander about in the world on its own, using the resulting
HFA.
Though in theory many classification algorithms are ap-
4. LEARNING FROM DEMONSTRATION plicable (such as K-Nearest-Neighbor or Support Vector Ma-
The above mechanism is sufficient to hand-code HFA be- chines), in our experiments we chose to use a variant of the
haviors to do a variety of tasks; but our approach was meant C4.5 Decision Tree algorithm [20] for several reasons:
instead to enable the learning of such tasks. Our learning
algorithm presumes that the HFA has a fixed set of states, 1. Many areas of interest in the feature space of our agent
comprising the combined set of atomic behaviors and all pre- approximately take the form of rectangular regions
viously learned HFAs. Thus, the learning task consists only (angles, distances, etc.).

Page 64 of 99
Figure 3: Feature selection and target assignment.

perform behaviors by pressing various buttons or keystrokes,

and then finally adding the trained HFA to the system li-
Figure 2: The foraging scenario in our testbed. brary.
We have successfully trained several simple behaviors,
tracking and acquiring a target, wall-following, generic ob-
2. Decision trees nicely handle various kinds of data: in
stacle circumnavigation, and tracing paths (such as a figure
our case, we used categorical, real-valued, and toroidal
eight path between two targets). In this section, we give
data (the latter requiring so-called “pie-slice” decision
an example where we have trained the agent to perform a
tree splits).
moderately complex foraging task: to harvest food from food
sources and bring it back to deposit at the agent’s central
3. Decision trees are particularly adept at handling un- station. Food can be located anywhere, as can the station.
scaled dimensions in the feature space. In our case, we Food at a given location can be in any concentration, and
would otherwise be faced with asking how many units depletes, eventually to zero, as it is harvested by the agent.
of distance were equivalent to a degree of angle, or to The agent can only store so much food before it must return
a change from “true” to “false”. to the station to unload. There are various corner cases:
for example, if the agent depletes food at a harvest loca-
In decision trees, the class is most commonly computed tion before it is full, it must continue harvesting at another
deterministically: the leaf node in a decision tree is set to location rather than return to the station. The scenario is
the class appearing among the plurality of training examples shown in Figure 2: the black circle is the agent, pink areas
which wound up at that leaf node. During the implemen- are food sources, and the red “×” (labelled “Home Base”) is
tation and the evaluation of our algorithm, we found out the station.
that in many cases we would not want a deterministic clas- Foraging tasks are of course old hat in robotics, and are
sification. For example, when performing a wall-following not particularly difficult to code by hand. But training such
behavior, we’d need to turn left some percentage of time. a behavior is less trivial. We selected this task as an ex-
As a result, our decision tree procedure can also compute ample because it illustrates a number of features special to
classes stochastically, with probability based on the propor- our approach: our foraging behavior is in fact a three-layer
tion of relevant examples at a given leaf node rather than a HFA hierarchy; employs “done” states; involves real-valued,
plurality vote. In the following example, we solely use this toroidal, and categorical (boolean) inputs; and requires one
second method. behavior with an unbound parameter used in two different
ways.
The behavior is shown in Figure 4. It requires seven ba-
5. EXAMPLE sic behaviors: start and done, forward, rotate-left, rotate-
We have implemented an experimental research testbed right, load-food (deplete the current location’s food by 1,
for training agents using this approach (Figure 2), writ- and add 1 to the agent’s stored food), and unload-food (re-
ten with the MASON multiagent simulation toolkit [14] (see move all the agent’s stored food). It also requires several
http://cs.gmu.edu/∼eclab/projects/mason/). In the environ- features: distance-to(A), angle-to(A), food-below-me (that is,
ment, our agent can sense a variety of things: the relative lo- how much food is located here), food-stored-in-me, and done.
cations of obstacles, other agents of different classes, certain Finally, it requires two targets to bind to A: the station and
predefined waypoints, food locations, etc. In this testbed, nearest-food.
the experimenter trains an HFA by first selecting features From this we manually decomposed the foraging task into
relevant to the behaviors (see Figure 3), then grounding tar- a hierarchy of four HFA behaviors, and trained each one in
gets for behaviors and features, then directing the agent to turn as described next. All told, we were able to train all four

Page 65 of 99
GoTo (A)
If A
is to my right

Harvest
Rotate If A Rotate If No Food is
Left is to my left Right Below Me and
If A is close If A is close If I am Not Full
enough enough
If Food is GoTo
Done Load
Below Me and (Nearest
Food
I Am Not Full Food)
If A is roughly If Ai s roughly
ahead ahead

If A is close If I Am Full
If A If A If I Am Not Full
enough
is to my left is to my right

Start Always Forward Done If I Am Full Start

Deposit Forage

If Done

Unload If I Am Near GoTo Deposit Harvest

Food the Station (Station)

If Done

If I Am Empty If I Am Not Empty Always

Done If I Am Empty Start Start

Figure 4: The Forage behavior and its sub-behaviors: Deposit, Harvest, and GoTo(Parameter A). All condi-
tions not shown are assumed to indicate that the agent remain in its current state.

behaviors, and demonstrate the agent properly foraging, in The Harvest Behavior. This behavior caused the agent to
a manner of minutes. go to the nearest food, then load it into the agent. When
the agent had filled up, it would signal that it was done.
The GoTo(A) Behavior. This behavior caused the agent If the agent had not filled up yet but the food has been
to go to the object marked A. The behavior was a straight- depleted, the agent would search for a new food location and
forward bang-bang servoing controller: rotate left if A is to continue harvesting. This behavior employed the previously-
the left, else rotate right if A is to the right; else go forward; learned go-to(A) behavior as a subsidiary behavior, binding
and when close enough to the target, enter the “done” state. its Parameter A to the nearest-food target. This behavior
We trained the GoTo(A) behavior by temporarily declar- also employed the features food-below-me and food-stored-
ing a marker in the environment to be Parameter A, and in-me.
reducing the features to just distance-to(A) and angle-to(A). We trained the Harvest Behavior by directing the agent
We then placed the agent in various situations with respect to go to the nearest food, then load it, then (if appropriate)
to Parameter A and “drove” it over to A by pressing keys cor- signal “done”, else go get more food. We also placed the
responding to the rotate-left, rotate-right, forward, and done agent in various corner-case situations (such as if the agent
behaviors. After a short training session, the system quickly started out already filled up with food). Again, we were
learned the necessary behaviors to accurately go to the tar- able to rapidly train the agent to perform harvesting. Once
get and signal completion. Once completed, it was made completed, it was made available in the library as harvest.
available in the library as go-to(A).

Page 66 of 99
The Deposit Behavior. This behavior caused the agent learned function, identifying which examples were improper,
to go to the station, unload its food, and signal that it is and whether to remove them, may prove a challenge.
done. If the agent was already empty when starting, it
would immediately signal done. This behavior also used the Programming versus Training. We have sought to train
previously-learned go-to(A) behavior as a subsidiary state agents rather than explicitly code them. However we also
behavior, but instead bound its Parameter A to the station aimed to do so with a minimum of training. These goals are
target. It used the features food-stored-in-me and distance- somewhat in conflict. To reduce the training necessary, we
to(station). We trained the Deposit Behavior in a similar typically must reduce the problem space complexity and/or
manner as the Harvest Behavior, including various corner dimensionality. We have so far done so by allowing the user
cases. Once completed, it was made available in the library to inject domain knowledge into the problem (via task de-
as deposit. composition, for example, or by explicitly training for cer-
tain corner cases). This is essentially a step towards having
The Forage Behavior. This simple top-level behavior just the user explicitly declare part of the solution rather than
cycled between depositing and harvesting. Accordingly, this have the learner induce it. So is this learning or coding?
behavior employed the previously-learned deposit and har- We think that training of this sort is somewhere in-
vest behaviors. The behavior used only the done feature. between: in some sense the learning algorithm is relieving
the trainer from having to “code” everything himself. The
question worth studying is: how much learning is useful be-
6. CONCLUSION fore the number of samples required to learn outweighs the
In this paper, we have presented an approach for train- reduced “coding” load, so to speak, on the trainer?
ing agent behaviors using a hierarchical deterministic finite
state automata model and a classification algorithm, imple-
mented as a variant of the C4.5 algorithm. The main goal of
Other Representations. HFAs cannot straightforwardly
do parallelism or planning. We chose HFAs largely because
our approach is to enable users to train agents rapidly based
they were simple enough to make training intuitively feasi-
on a small number of training examples. In order to achieve
ble. Now that we’ve demonstrated this, we wish to examine
this goal, we trade off learning complexity with training ef-
how to train with other common representations, such as
fort, by enabling trainers to decompose the learning task in a
Petri nets or hierarchical task network plans, to demonstrate
hierarchical manner, to learn general parameterized behav-
the generality of the approach.
iors, and to explicitly select the most appropriate features to
use when learning. This in turn reduces the dimensionality
of the learning problem. 7. ACKNOWLEDGMENTS
We have developed a proof of concept testbed simulator This work was supported in part by NSF grant 0916870.
which appears to work well: we can train parameterized,
hierarchical behaviors for a variety of tasks in a short period
of time. We are presently deploying the platform to robots 8. REFERENCES
in our laboratory. In the mean time, there are a number of
interesting issues that remain to be dealt with. [1] R. Angros, W. L. Johnson, J. Rickel, and A. Scholer.
Learning domain knowledge for teaching procedural
Multiple Agents. Our immediate next goal is to move to skills. In The First International Joint Conference on
training multiple agents. In the general case, multiagent Autonomous Agents and Multiagent Systems
learning is a much more complex task than single-agent (AAMAS), pages 1372–1378. ACM, 2002.
learning, involving game-theoretic issues which may be well [2] B. D. Argall, S. Chernova, M. Veloso, and
outside the scope of the learning facility. However we be- B. Browning. A survey of robot learning from
lieve there are obvious approaches to certain simple mul- demonstration. Robotics and Autonomous Systems,
tiagent learning scenarios: for example teaching agents to 57:469–483, 2009.
perform actions as homogeneous behavior groups (perhaps [3] C. G. Atkeson and S. Schaal. Robot learning from
by training an agent with respect to other agents not un- demonstration. In D. H. Fisher, editor, Proceedings of
der his control, but moving them similarly). Another area the Fourteenth International Conference on Machine
of multiple agent training may involve hierarchies of agents, Learning (ICML), pages 12–20. Morgan Kaufmann,
with certain agents in control of teams of other agents. 1997.
[4] M. Bain and C. Sammut. A framework for behavioural
Unlearning. There are two major reasons why an agent cloning. In Machine Intelligence 15, pages 103–129.
may make an error. First, it may have learned poorly due Oxford University Press, 1996.
to an insufficient number of examples or unfortunately lo- [5] D. C. Bentivegna, C. G. Atkeson, and G. Cheng.
cated examples. Second, it may have been misled due to Learning tasks from observation and practice. Robotics
bad examples. This second situation arises due to errors in and Autonomous Systems, 47(2-3):163–169, 2004.
the training process, something that’s surprisingly easy to [6] R. Bindiganavale, W. Schuler, J. M. Allbeck, N. I.
do! When an agent makes a mistake, the user can jump Badler, A. K. Joshi, and M. Palmer. Dynamically
in and correct it immediately, which causes the system to altering agent behaviors using natural language
drop back into training mode and add those new examples instructions. In Autonomous Agents, pages 293–300.
to the behavior’s collection. However this does not cause ACM Press, 2000.
any errant examples to be removed. Since the agent made [7] R. A. Brooks. Intelligence without representation.
an error based not on examples but rather based on the Artificial Intelligence, 47:139–159, 1991.

Page 67 of 99
[8] S. Calinon and A. Billard. Incremental learning of [22] P. Stone and M. M. Veloso. Layered learning. In R. L.
gestures by imitation in a humanoid robot. In de Mántaras and E. Plaza, editors, 11th European
C. Breazeal, A. C. Schultz, T. Fong, and S. B. Kiesler, Conference on Machine Learning (ECML), pages
editors, Proceedings of the Second ACM 369–381. Springer, 2000.
SIGCHI/SIGART Conference on Human-Robot [23] Y. Takahashi and M. Asada. Multi-layered learning
Interaction (HRI), pages 255–262. ACM, 2007. system for real robot behavior acquisition. In
[9] A. Coates, P. Abbeel, and A. Y. Ng. Apprenticeship V. Kordic, A. Lazinica, and M. Merdan, editors,
learning for helicopter control. Communications of the Cutting Edge Robotics. Pro Literatur, 2005.
ACM, 52(7):97–105, 2009. [24] Y. Takahashi, Y. Tamura, and M. Asada. Mutual
[10] J. D. P. K. Egbert and D. Ventura. Learning policies development of behavior acquisition and recognition
for embodied virtual agents through demonstration. In based on value system. In 2008 IEEE/RSJ
Proceedings of the International Joint Conference on International Conference on Intelligent Robots and
Artificial Intelligence, pages 1257–1252, 2007. Systems, pages 386–392. IEEE, 2008.
[11] D. Grollman and O. Jenkins. Learning robot soccer [25] H. Veeraraghavan and M. M. Veloso. Learning task
skills from demonstration. In IEEE 6th International specific plans through sound and visually interpretable
Conference on Development and Learning (ICDL), demonstrations. In 2008 IEEE/RSJ International
pages 276–281, July 2007. Conference on Intelligent Robots and Systems, pages
[12] M. Kasper, G. Fricke, K. Steuernagel, and E. von 2599–2604. IEEE, 2008.
Puttkamer. A behavior-based mobile robot [26] E. Vidal. Grammatical inference: An introductory
architecture for learning from demonstration. Robotics survey. In Grammatical Inference and Applications,
and Autonomous Systems, 34(2-3):153–164, 2001. pages 1–4. Springer, 1994.
[13] D. Kulic, D. Lee, C. Ott, and Y. Nakamura.
Incremental learning of full body motion primitives for
humanoid robots. In 8th IEEE-RAS International
Conference on Humanoid Robots, pages 326–332, Dec.
2008.
[14] S. Luke, C. Cioffi-Revilla, L. Panait, K. Sullivan, and
G. Balan. Mason: A multi-agent simulation
environment. Simulation, 81(7):517–527, July 2005.
[15] J. Nakanishi, J. Morimoto, G. Endo, G. Cheng,
S. Schaal, and M. Kawato. Learning from
demonstration and adaptation of biped locomotion.
Robotics and Autonomous Systems, 47(2-3):79–91,
2004.
[16] M. N. Nicolescu and M. J. Mataric. Learning and
interacting in human-robot domains. IEEE
Transactions on Systems, Man, and Cybernetics, Part
A, 31(5):419–430, 2001.
[17] M. N. Nicolescu and M. J. Mataric. A hierarchical
architecture for behavior-based robots. In The First
International Joint Conference on Autonomous Agents
and Multiagent Systems (AAMAS), pages 227–233.
ACM, 2002.
[18] M. N. Nicolescu and M. J. Mataric. Natural methods
for robot task learning: instructive demonstrations,
generalization and practice. In The Second
International Joint Conference on Autonomous Agents
and Multiagent Systems (AAMAS), pages 241–248.
ACM, 2003.
[19] R. Parekh and V. Honavar. Grammar inference,
automata induction, and language acquisition. In
Handbook of Natural Language Processing, pages
727–764. Marcel Dekker, 2000.
[20] J. R. Quinlan. C4.5: Programs for Machine Learning
(Morgan Kaufmann Series in Machine Learning).
Morgan Kaufmann, 1 edition, January 1993.
[21] P. E. Rybski, K. Yoon, J. Stolarz, and M. M. Veloso.
Interactive robot task training through dialog and
demonstration. In C. Breazeal, A. C. Schultz, T. Fong,
and S. B. Kiesler, editors, Proceedings of the Second
ACM SIGCHI/SIGART Conference on Human-Robot
Interaction (HRI), pages 49–56. ACM, 2007.

Page 68 of 99
Proceedings of the AAMAS Workshop on Adaptive and Learning Agents, May 2010, Toronto, Canada

Policy Search and Policy Gradient Methods for

Autonomous Navigation

Matt Knudson Kagan Tumer

Oregon State University Oregon State University
[email protected] [email protected]

ABSTRACT Indeed, successful navigation algorithms need to operate

Autonomous robots provide many tangible benefits in a mul- in partially observable and dynamic environments that are
titude of exploration and/or search and rescue tasks. In often stochastic in nature. One approach to providing such
both types of applications, robots offer to both reduce the capabilities is the use of domain knowledge at the expense
cost of missions and the risk exposure of humans. How- of high sensing, computational and power requirements [28].
ever, such benefits are contingent on robots performing basic Another approach is to provide a mapping from sensory
autonomous navigation tasks at significantly higher speeds inputs to actions that statistically capture the key behav-
than they currently do, without requiring algorithms with ioral objectives without needing a model or detailed domain
high computational requirements. In this paper, we present knowledge. Such methods are well suited to domains where
two such approaches to autonomous robot navigation: (i) the tools available to learn from past experience and adapt
a policy search (neuro-evolutionary) algorithm and (ii) a to emergent conditions are limited [5].
policy gradient algorithm. Our results show that effective, In this work we explore two such approaches, policy search
adaptive navigation techniques can be developed for mobile and policy gradient based navigation [8, 11]. Policy search
robots in an exploration domain when the robots have simple approaches are methods where control is achieved through a
sensing and articulation capabilities. In addition, we show search across policies. This search through a population of
that policy gradient approaches thrive in difficult navigation policies allows for the discovery of new and robust control
tasks but suffer in noisy environments. Policy search meth- strategies. Policy search approaches have been successfully
ods on the other hand, handle sensor and actuation noise applied to benchmark problems [14] as well as real world
well, but suffer when the navigation tasks become complex. control problems [1]. Often the policy, an artificial neural
Finally, we show that there is a fundamental difference in be- network, is simple in construction and therefore is inexpen-
havior when the objective functions are based on different sive to modify and evaluate in practice, providing resource
time scales, and though functionally equivalent, simply sum- cost benefits as well.
ming short-term objective values to provide a time-extended Policy gradient algorithms, on the other hand, modify
objective value (or vice versa) does not provide a “fair” com- the parameters of a policy directly, rather than searching
parison of these two algorithms. through sets of potential control strategies [22]. They are
a generalization of table-based reinforcement learning algo-
rithms where a look-up table is replaced by a function ap-
Categories and Subject Descriptors proximator, in this case an artificial neural network [25].
I.2.6 [AI]: Robotics The use of the gradient allows for determining the direction
of change in the parameters that will provide the largest im-
provement to the policy. Policy gradient methods have been
General Terms successfully applied to multiple problems, including robot
Algorithms, Experimentation soccer [23], biped locomotion [27], and multiagent learn-
ing [2]. As in the case of policy search approaches, the use
of a neural network as the function approximator provides
Keywords a low cost policy that can be readily modified.
Learning::Single Agent; Learning::Evolution, Adaptation In this paper we provide both a policy search and a pol-
icy gradient approach to adaptive navigation for robots with
limited resources, operating in a partially observable envi-
1. INTRODUCTION ronment. In addition, we explore the relationship between
Advances in mobile autonomous robots have provided so- policy search and policy gradient methods, with a particu-
lutions to complex tasks previously only considered achiev- lar focus on the impact objective functions and performance
able by humans. Such domains include planetary or un- evaluation time scales on performance. In Section 1.1 we
derwater exploration [15, 17], operation in urban environ- briefly discuss related work. In Section 2 we describe the
ments [4], and unmanned flight [10]. In each of those do- state and action spaces, along with the policy search and pol-
mains, autonomous navigation plays a key role in the suc- icy gradient algorithms. In Section 3 we present the problem
cess of the robots. However, as navigation has become more domain and provide the robot capabilities. In Section 4 we
complex, algorithms have become both domain specific and describe the experimental approach, and provide the simula-
resource intensive [29].

Page 69 of 99
tion results. Finally, in Section 5 we provide a summary and lation capabilities.
a discussion of the results as well as directions for further To encode as much information as possible from the sen-
research. sors, as simply as possible, incoming state details are dis-
tilled into two state variables:
1.1 Related Work
Model-free learning algorithms such as reinforcement learn- Object Distance: A distance to the nearest object dθi is
ing can be used for navigation applications [24]. The online provided for each vehicle relative angle θi . This repre-
reinforcement learning algorithm OLPOMDP [3] has been sents a potential obstacle, such as a wall or rock.
used successfully in applications including general robot con-
trol [9]. By operating on a parameterized functional repre- Destination Heading: The difference between the poten-
sentation of the knowledge gained during operation, instead tial path heading θi and the vehicle relative destination
of on specific models, important features of the world in heading αdes is provided. This indicates how signifi-
which the robot (or agent) operates can be the focus. This cant of a correction is required for the robot to point
allows adaptive behavior to be learned as a connection be- directly toward the destination.
tween features, rather than the degree to which a provided
This state representation is of course quite predictable in
model is accurate.
the destination heading, never exceeding |180| degrees and
Reward shaping in reinforcement learning for robotics al-
symmetrical about the destination heading αdes . The ob-
lows the balancing of specific agent tasks and automatic
ject distance state variable can vary widely however, and
agent-agent or agent-environment interactions [6, 7]. The
depends strongly on the resolution of the environment sens-
methodology was based on domain knowledge first, and then
ing available. Both state variables also depend on the ac-
augmented by suggestions from an external trainer to “shape”
curacy of the sensors that provide distance and track robot
the agent learning progress. The concept was further ex-
orientation.
panded to reinforcement learning for situated agents [20].
To provide a space of actions that is as directly indicative
The use of reward shaping concepts has proven successful
of robot task needs as possible, but abstract enough to re-
in the area of robotics, including its application to policy
duce the impact of non-determinism, the concept of “path
search techniques for navigation and non-Markovian pole
quality” is introduced. This quality, represented by X(θi ),
balancing [13, 14]. More recently analysis was done to eval-
is calculated in varying ways dependent on the algorithm
uate the impact of shaping on reward horizons [19] and gen-
used, but represents the quality of a potential path for the
erate a methodology to allow dynamically shaped rewards
robot to take next. In producing a distribution of quality
(rather than the typical static shaped reward) [18]. Further
for all possible paths at each time-step, the state of the envi-
extensions on dynamic shaping have moved into the area
ronment is represented, and a path can be chosen either via
of adaptive shaping in general robot domains [16] and the
the maximum quality, or by sampling to inject exploration
robot soccer domain [12].
behavior.
Of particular interest to the work presented in this pa-
per is the empirical analysis of genetic algorithms and tem- 2.2 Policy Search
poral difference learning in reinforcement learning robotic
The state/action structure of navigation presented in Sec-
soccer [26, 30]. Both techniques are model-free control ap-
tion 2.1 contains a beneficial approach to path selection. It
proaches, and the work presented in these papers provides a
is simple, which reduces computational complexity as well
comprehensive analysis of the behavior of both in a domain
as the number of potentially unpredictable behaviors, and
where the robots have limited resources, including observa-
applies well to a mobile robot with limited sensing capabil-
tional capabilities.
ities.
To capture those benefits while injecting the benefits of
2. ROBOT NAVIGATION adaptability into navigation control, the state and action
In problems where resources are limited, particularly in spaces are maintained. Therefore, for this work, the baseline
the robots’ abilities to observe their surroundings, correctly network structure created is a two layer, sigmoid activated,
interpreting the incoming information and mapping it to artificial neural network with two inputs, one output unit,
coherent actions (e.g., navigation) is a key concern. In this and eight hidden units (selected through empirical perfor-
work, we explore three algorithms for navigation based on mance study).
environment information obtained from sonar and inertial This network is run at each time step, for each potential
sensors: i) a deterministic navigation algorithm is used to path, generating a path quality function in a similar fashion
provide a deterministic action choice based on state infor- to that of the deterministic algorithm. The difference is in
mation collected, ii) a policy search algorithm that uses a the replacement of static predefined probability distributions
multi-layered neural network as a policy and an evolutionary by an adaptive artificial neural network.
algorithm as the search method to assign a “quality” to each An evolutionary search algorithm for ranking and subse-
potential path, and iii) a policy gradient algorithm that uses quently locating successful networks within a population [21]
a multi-layered neural network for the policy function. is applied here. The algorithm maintains a population of ten
networks, uses mutation to modify individuals, and ranks
2.1 State and Action Spaces them based on a performance metric specific to the domain.
In the work presented here, the mobile robotic platform The search algorithm used is shown in Figure 1 which dis-
has a limited set of sensing capabilities, and provides a non- plays the logical flow of the algorithm.
deterministic outcome of actions taken. As a result, a unique The definitions for the variables and functions located in
set of spaces is required to accurately represent the environ- the algorithm shown in Figure 1 are as follows:
ment surrounding the robot and provide maximum articu-

Page 70 of 99
Initialize N networks at T = 0
For T < Tmax Loop: Initialize ω and e
For T < Tmax Loop:
1. Pick a random network Ni from population
For t < tepi Loop:
With probability : Ncurrent ← Ni
1. Capture current state s
With probability 1 − : Ncurrent ← Nbest
2. Sample action a from π (a|s, ω)
2. Modify Ncurrent to produce N 0
2.1 For θi ≤ 360 Loop:
Run network to produce f a0 |s, ω
` ´
3. Control robot with N 0 for next episode (tepi time steps)
0 P ` 0 ´
2.2 g(a ) = f a |s, ω
For t < tepi Loop: a0
f (a|s,ω)
3.1 For θi ≤ 360 Loop: 2.3 P (a) = g(a0 )
Run N 0 to produce X(θi )
3. Execute action a and capture reward r
3.2 αu ← argmaxX(θi )
3.3 Vu ← F (X(αu )) 4. e ← βe + ∇ω π (a|s, ω)
5. ω ← ω + αer
4. Rank N 0 based on performance
(objective function) 6. Vu ← F (f (a|s, ω))

5. Replace Nworst with N 0

Figure 2: Policy Gradient Algorithm: An online rein-

Figure 1: Policy Search Algorithm: An -greedy evo- forcement learning algorithm using eligibility traces to ad-
lutionary search algorithm to determine the weights of the just the weights (parameters ω) of a neural network policy
neural network policy. function approximator.

T : Indexes episodes
ω: Weights of the neural network function approximator
t: Indexes time-steps within each episode
π (a|s, ω): Path quality assignment policy
θi : Angle of potential path i
N : Indexes networks with appropriate subscripts f (a0 |s, ω): Output of the neural network function approximator

N 0 : Mutated network for use in control of current episode P (a): Probability of taking action a
X(θi ): Path quality assigned to each potential path e: Eligibility traces for parameters ω
αu : Chosen vehicle relative robot angle for next time-step β, α: Discounting factor and learning rate respectively
F (X(αu )): Linear function mapping quality of path chosen to
F (f (a|s, ω)): Linear function mapping quality of path chosen to
robot speed
robot speed
Vu : Chosen robot speed
Vu : Chosen robot speed
In this domain, modifying a policy (Step 2) involves adding
a randomly generated number to every weight within the The state and action spaces here are very important as
network. This can be done in a large variety of ways, how- well. In order to maintain comparability, the spaces are
ever it is done here by sampling from a random Cauchy identical to that used in the deterministic and policy search
distribution [1] where the samples are limited to the contin- techniques described in Sections 2.1 and 2.2. This structure,
uous range [-10.0,10.0]. Ranking of the network performance where path quality is assigned to potential paths, lends itself
(Step 4) is done using a domain specific objective function, to reinforcement learning, with a minor modification to fit
and is discussed in detail in Section 3. within the online partially observable MDP [3] policy update
strategy. This change involves normalizing each network
2.3 Policy Gradient output by the sum over all outputs to produce a probability
The reinforcement learning technique of adaptive control distribution for the path quality assignment policy:
is a structure of algorithm that must be formed such that
f (a|s, ω)
it can be “rewarded” directly based on a predefined objec- π (a|s, ω) = P (1)
tive function of the next state of the robot or the next state f (a0 |s, ω)
a0
and action taken. In this work, the algorithm is rewarded
based on the change of state resulting from an action pre- where a is a sampled action, s is the current state, ω are the
viously taken (therefore a direct function of the next state network weights and f (a|s, ω) is the output of the neural
achieved). There are a great many reinforcement learning al- network function approximator that provides the value of
gorithm structures, however for this work an online partially the sampled action a (path quality), given current state s
observable MDP [3] algorithm proved the most successful. and weights ω 1 .
The definitions for the variables and functions located in 1
The output of the neural network approximator f a0 |s, ω is es-
` ´
the algorithm shown in Figure 2 are as follows:
sentially identical to the path quality distribution X(θi ) shown in
T : Indexes episodes Figure 1. The neural network does not produce a probability distribu-
tion, so dividing by the sum of all qualities normalizes to a probability
t: Indexes time-steps within each episode
from which the next action can be sampled, injecting exploration into
θi : Angle of potential path i the algorithm.

Page 71 of 99
The following is the neural network function approximator 3.1 Episodic Objective
gradient with respect to parameters ω (network weights)2 : An episodic objective aims to capture three important as-
pects of mobile robot navigation in unknown environments
d f (a|s, ω) P f a0 |s, ω − f (a|s, ω) P d f a0 |s, ω
“ ” “ ” “ “ ””
dω
a0 a0
dω under the capability restrictions described above; 1) total
∇ω π (a|s, ω) = ˛
˛P ` ´˛˛2
˛
path length the robot uses to reach the destination, 2) time
f a0 |s, ω ˛
˛
the robot consumes reaching the destination, and 3) time
˛
˛ 0
a
˛
(2)
the robot consumes recovering from a collision with an ob-
where the derivative term with regard to output weights stacle. These incorporate choosing the shortest path, exe-
is: cuting it with greatest speed, and doing so in a safe manner.
In order to convert the above to maximization rather than
d minimization, and support constantly shifting initial condi-
f (a|s, ω) = f (a|s, ω) (1 − f (a|s, w)) hj = δo hj (3) tions, the best possible behavior is incorporated, generating
dωj
the following objective function:
and the derivative with respect to the hidden weights re-
sults in: R(s) = α (dbest − dactual ) + β (tbest − tactual ) − γtcollision (5)

0 1 0 1 where d is the path length (best possible and episode

d X X actual), t is the time consumed, and τcollision is the total
f (a|s, ω) = @ ωj δo hj (1 − hj ) xi =
A @ ωj A δo δh,j xi
dωi,j j j
amount of time spent recovering from collisions. The best
(4) possible of these is used to indicate what would happen if
where x are network inputs and h are hidden units, indexed the robot took a straight path, at maximum velocity, with-
by i and j respectively. out hitting any obstacles. α, β and γ are constants used
The output and hidden weight derivatives from Equa- to increase or decrease the respective terms’ contribution to
tions 4 and 3 are plugged into Equation 2 which is sub- the overall function.
sequently plugged into Step (4) of Figure 2 for calculation To lead into the focused objective utilized for policy gra-
of the eligibility traces which are finally used to update the dient, discussed in the next section, the above episodic ob-
individual weights. jective can be rewritten as:
Under this structure, as with the deterministic and policy
search algorithms, the network is run at each time-step for T
X T
X T
X
each possible action, however each output is then divided R (s) = (αdbest + βtbest )−α dstep − β τstep −γ τcollision
by the sum of the outputs for all possible actions (poten- t=0 t=0 t=0
(6)
tial paths). This generates a probability distribution across
actions from which the next can be sampled. Doing so al-
lows the neural network function approximator to fit within This objective function is general enough to be used di-
policy gradient calculation, as well as inherently provide ex- rectly as the ranking for the policy search algorithm (Sec-
ploration. tion 2.2, Figure 1 step (4)).
The learning parameters α and β both allow for discount-
ing and were found to produce successful results within their
3.2 Focused Objective
common ranges, at 0.1 and 0.7 respectively. The episodic objective in Equation 6 is too general to pro-
duce significant results with the policy gradient algorithm
highlighted in Figure 2. However, R(s) can be further de-
3. ROBOT NAVIGATION PROBLEM composed into single time-step rewards:
Essential in any robotic exploration domain is the ability r (s) = ψ − αdstep − βτstep − γτcollision (7)
to evaluate the performance of techniques used for naviga-
tion. In general a performance metric is needed for eval- where ψ is a constant representing the initial conditions,
uation and tuning purposes, but specifically when apply- and the remaining terms are the same as those in Equation 6.
ing learning algorithms, an objective function is required for By optimizing the sum of these rewards over time, this objec-
providing the algorithm with a signal indicating success or tive becomes identical to the episodic objective. However,
failure of action decisions made during the learning process. utilizing this objective does not produce learning on time
This objective preferably will have a clear gradient such that scales approaching the performance of the policy search al-
a learning algorithm can determine in which direction bet- gorithm, and does not even produce coherent behavior with
ter performance can be achieved. In the work here, two ob- 50, 000 training episodes (3×107 parameter updates). While
jectives were designed for these purposes; a time-extended the decomposed reward is based on robot state, this depen-
calculation of behavior throughout an entire episode, and a dance is highly non-deterministic, as there are many possible
short-term, focused reward for determining the immediate quality assignments (actions) that produce very similar or
result of an action taken. identical rewards.
To overcome the assignment problems when applying the
2
The gradient term has been modified from [3] with the removal of
episodic objective to the policy gradient algorithm, a more
the logarithmic term. This was done as the function approximating specific shaped reward was created to allow the policy gra-
the policy is highly non-linear (a neural network), where the major- dient algorithm to be directly rewarded based on a single
ity of policy function approximators are kept linear for differentiation action, and resulting change in state:
purposes. In our case, a neural network is still continuously differen-
tiable with respect to its parameters, but is non-linear, resulting in a
r (s) = η1 1 − θ̄ + η2 d¯
` ´
modification to the gradient calculation. (8)

Page 72 of 99
where θ̄ is the change in state regarding the robot’s heading all weights simultaneously, regardless of their direct affect
as it relates to the destination heading, and d¯ is the change on each action taken during an episode. Therefore, if the
in state regarding the distance to the next impassable ob- evolution search is run at each time-step (or a small subset
ject. The constants η1 and η2 are in place to provide scaling of time-steps) and the networks are ranked on such a spe-
and additional shaping. This linear formulation provides a cific basis, it is very unlikely that the search will locate a
much more deterministic indication as to the source of the network capable of being successful over an entire episode.
reward in a single action choice, representing the desire for For example, at the beginning of the episode the search may
the policy to assign high quality to paths that turn toward locate a successful network, and use it with only probabil-
the goal, but away from nearby objects. ity of exploration throughout, resulting in low system level
The objective in Equation 8 brought the policy gradient performance.
algorithm into the same learning time scale as the policy
search algorithm. However, overall performance was still 3.3 Objective Equivalence
not satisfactory, as poor policies were not sufficiently pe- The episodic objective (Equation 5) is better suited for
nalized. Therefore, the objective was further modified to time-extended learning, as with policy search, and the fo-
exponentially penalize large deviations, leading to: cused objective (Equation 9) is better suited for state-change
parameter updates, as with policy gradient. Still, it is im-
“ ¯
” portant to study the behavior of both algorithms to ensure
r(s) = ξ 1 + e−η1 θ̄ − e−η2 d (9) that they are achieving the same overall behavior goals and
where r is the reward used in Figure 2. This reward pro- are being provided with the same level of information.
vided more robust performance in both simple and complex Equation 5 shows that by minimizing total path length
environments, and allowed the policy gradient algorithm to and collisions while maximizing speed the robot can maxi-
learn similar successful behavior to that of the policy search mize system level performance (a maximum of 0, otherwise
and deterministic algorithms. negative). Conceptually, Equation 9 shows that if the algo-
This reward is calculated, and therefore the weight update rithm minimizes θ̄ by choosing actions that point the robot
is performed, at every time-step to allow the algorithm to toward the destination it is in effect minimizing total robot
directly observe and be rewarded based on the perceived path length at the end of the episode. Additionally, the
success or failure of the last action taken. This allows the robot speed is again based linearly on the quality assignment
eligibility traces to more accurately track the effect of each to the path chosen, and therefore by being more confident
network weight on the resulting state over time. Conversely, about quality assignment, the algorithm maximizes robot
speed. Finally, if the algorithm maximizes d, ¯ it is choosing
utilizing the episodic objective function requires that the
traces interpret the effect of each network weight on up to actions that point the robot away from nearby obstacles and
600 actions before a reward is received. There are factors therefore minimizes collisions.
to mitigate the effect of taking so many actions during an Figure 4 shows the calculated focused objective r(s) dur-
episode, most notably changing how often the parameters ing a learning session. The policy search algorithm does not
are updated (Figure 3) which effectively reduces how many use the objective for learning, rather the average objective
actions the robot chooses during an episode. over an episode is calculated and displayed. It is shown that
while the policy search algorithm does not maximize the fo-
cused objective in a stable fashion, when it converges to the
highest performance of its own objective (Equation 5), it
simultaneously converges to the highest performance of the
focused objective (Equation 9) used by the policy gradient
algorithm. Likewise for policy gradient, shown on the right
of Figure 4, when policy gradient converges to its best per-
formance of the focused objective, it is also converging to
the best performance of the episodic objective. These two
results demonstrate empirically the equivalence of the two
objectives used for learning.

4. EXPERIMENTS
Several experiments were designed to evaluate the navi-
gation algorithms for a specific set of behaviors discussed
in the problem definition. These progressively increased in
difficulty and scope from basic navigation to a destination,
through advanced navigation in cluttered environments. In
Figure 3: Effect of changing the interval between weight all experiments, an arena of 5 meters square was created
updates in the policy gradient navigation algorithm. n is the with a varying number of obstacles, depending on the ex-
number of time steps between updates. The episodic objec- periment. The learning method is episodic in that the robot
tive is plotted against learning episode for a representative is allowed to operate for a fixed maximum amount of time
selection of update intervals. (tepi = 60 seconds in this work). Learning is executed for
2000 episodes, and each experiment is run 40 times for each
The inverse effect to that shown in Figure 3 occurs when algorithm. These experiments evaluate not only the navi-
the more specific reward in Equation 9 is used in conjunction gation algorithm’s ability to seek a destination, but safely
with the policy search algorithm. That algorithm modifies and intricately navigate around obstacles in an unknown en-

Page 73 of 99
Figure 4: Left: The focused reward is plotted for both algorithms during a learning session. The average focused objective
r(s) per episode is calculated for all algorithms, though the policy search algorithm does not use it for learning. Right: The
policy gradient algorithm behavior as measured by two objectives: The algorithm learns using the focused objective, but both
the focused and episodic objectives are plotted. The comparison clearly shows that the two objective functions are functionally
equivalent in measuring robot performance. Error bars are omitted for clarity.

Figure 5: Left: The impact of obstacle density is shown. Maximum episodic objective achieved is plotted against varying
number of obstacles within the environment. Right: The result of the learning in a dense environment containing 20 obstacles.
The objective function is plotted for the random, deterministic, policy search, and policy gradient algorithms as an average over
40 iterations.

vironment, including when state information is inaccurate when the number of obstacles increases. The deterministic
and action results are stochastic. algorithm consistently drops in performance as the environ-
ment increases in density, and has a sharp deterioration rate.
4.1 Impact of Obstacle Density The policy search algorithm is able to maintain acceptable
We now focus on the performance of the algorithms with performance early on, but sharply declines between 15 and
respect to the density of obstacles within the environment. 20 obstacles, unable to locate the destination on occasion
With limited environment detection capabilities, as the envi- within the time allotted. The best performing algorithm
ronment becomes more dense with hazards, the robot must is policy gradient, which degrades gracefully and is able to
be careful with path quality assignments such that safe op- maintain its performance even when the environment is ex-
eration is ensured. This is reflected in Figure 5 when the tremely dense with obstacles.
number of obstacles is low. While both adaptive algorithms Figure 5 (right) shows that the policy gradient algorithm,
consistently outperform the deterministic navigation algo- utilizing the more focused objective, was able to more suc-
rithm, they have similar performance until the environment cessfully encode information learned during operation. It
becomes complex. does this by trading off robot speed in order to operate more
As expected, all three algorithms drop in performance safely in environments more dangerous for operation. This

Page 74 of 99
Figure 6: Left: The results of the learning in a dense environment with 15 obstacles while sensor and actuator noise was
present. The objective function is plotted for the random, deterministic, policy search, and policy gradient algorithms as an
average over 40 iterations. Right: The results of the learning in a dense environment with 20 obstacles while sensor and actuator
noise was present. The objective function is plotted for the random, deterministic, policy search, and policy gradient algorithms
as an average over 40 iterations.

trade-off, and subsequent learned behavior, is an important utilizes a focused state-based objective that allows it to learn
aspect of mobile robot navigation. intricate behavior in complex environments.

4.2 Sensor and Actuator Signal Noise 5. DISCUSSION

Previously the sensors and actuators produced ideal data Autonomous robots provide many tangible benefits in a
and robot motion. This is not a realistic situation for phys- multitude of exploration and/or search and rescue tasks. In
ical robots, as all sensors contain stochastic differences in both types of applications, robots offer to both reduce the
readings of the environment, and actuators may not pro- cost of missions and the risk exposure of humans. The robot
duce exactly the intended robot motion. Therefore, random used in this work was not allowed to maintain a detailed
noise was injected into the sonar and inertial sensor data as map of its environment, and therefore was required to make
well as the output of the navigation algorithm to the actu- decisions based on information immediately available using
ators. Specifically, 5% random noise was present from the a limited encoding of experience via simple artificial neural
beginning of the learning and to simulate potential failures, networks. To provide adaptive and robust navigation under
the noise level was phased up to 10% over 200 episodes sur- such conditions, we used a unique state/action mapping.
rounding the 1000th episode (e.g., from 900 to 1100). Both policy search and policy gradient navigation provided
The results of the learning presented in Figure 6 show better overall behavior than deterministic navigation where
that the learning process struggles when noise is added to the system designer provided a set of strategies for actions.
the system. Note however, that as the additional noise is The work in this paper demonstrates that in limiting a
phased in (surrounding t = 1000), there is little or no effect robot to only the amount of resources exactly required of it
in the performance of the policy search algorithm. Learning to complete the navigation task, adaptive behavior can still
continues unimpeded to significantly outperform the deter- be successful, indeed can perform better in the face of sensor
ministic navigation algorithm which is strongly affected by and actuators failures than techniques based on set proba-
the increased noise level in sensing and actuation. This is a bility distributions. The policy gradient method was able
result of the learning occurring while noise is present in the to learn faster, though tended to converge to lower overall
system such that good behavior is learned despite the noise performance in clearer environments. This was due to the
and therefore an increase in the noise level during operation required specificity of the objective structure to promote
(once successful behavior has been learned) does not affect learning. Conversely, the policy search algorithm learned
the policy search algorithm performance. slower, but converged to higher performance in clear envi-
Conversely, the policy gradient algorithm suffers from the ronments as it was capable of learning on an objective more
additional noise. Not only does the initial noise prevent the directly indicative of performance over a full episode. In
algorithm from achieving the previous level of performance, stark contrast, the policy gradient algorithm proved much
but as the additional noise is phased in the performance is more successful in environments cluttered with obstacles, as
further reduced. However, as shown on the right of Figure 6, such objective specificity provided the details of the required
and in confirmation of the results in Figure 5, when the envi- behavior.
ronment becomes so complex as to prevent the policy search The algorithms presented in this paper are currently being
algorithm from finding successful behavior, the policy gradi- installed and evaluated on physical robots in a laboratory
ent algorithm takes over as the top performer. This occurs setting. In particular, we focus on determining the algorithm
regardless of the signal noise present because the algorithm robustness to learning in a platform-based simulation, before

Page 75 of 99
making the transition to control of a physical robot in a Artificial Intelligence (IJCAI-99), pages 1356–1361,
real-world setting. In addition, our future work will focus Stockholm, Sweden, 1999.
on coordination in multiple physical robots, blending our [15] D. M. Helmick and S. I. Roumeliotis.
current work of robot coordination and robot navigation. Slip-compensated path following for planetary
exploration rovers. Advanced Robotics, 20:1257–1280,
2006.
Acknowledgements [16] G. Konidaris and A. Barto. Autonomous shaping:
This work was partially supported by AFOSR grant FA9550- knowledge transfer in reinforcement learning. In Int.
08-1-0187 and NSF grant IIS-0910358. Conference on Machine Learning, pages 489–496,
2006.
6. REFERENCES [17] C. Kunz and C. Murphy. Deep sea underwater robotic
exploration in the ice-covered arctic ocean with auvs.
[1] A. K. Agogino and K. Tumer. Efficient evaluation In Intelligent Robots and Systems, IEEE/RSJ
functions for evolving coordination. Evolutionary International Conference on, 2008.
Computation, 16(2):257–288, 2008. [18] A. Laud and G. DeJong. Reinforcement learning and
[2] B. Banerjee and J. Peng. Adaptive policy gradient in shaping: Encouraging intended behaviors. In Int.
multiagent learning. In The Conference on Conference on Machine Learning, pages 355–362,
Autonomous Agents and Multiagent Systems, pages 2002.
686–692, 2003. [19] A. Laud and G. DeJong. The influence of reward on
[3] P. Bartlett and J. Baxter. Stochastic optimization of the speed of reinforcement learning: An analysis of
controlled partially observable markov decision shaping. In Int. Conference on Machine Learning,
processes. Decision and Control, 1:124–129, 2000. 2003.
[4] J. Bohren and T. Foote. Little Ben: The Ben Franklin [20] M. J. MataricĆ. Reward functions for accelerated
racing team’s entry in the 2007 DARPA Urban learning. In Machine Learning: Proceedings of the
Challenge. Field Robotics Research, 25:588–614, 2008. Eleventh International Conference, pages 181–189, San
[5] M. Cummins and P. Newman. Probabilistic Francisco, CA, 1994.
appearance based navigation and loop closing. In [21] D. Moriarty and R. Miikkulainen. Forming neural
Robotics and Automation, IEEE International networks through efficient and adaptive coevolution.
Conference on, pages 2042–2048, 2007. Evolutionary Computation, 5:373–399, 2002.
[6] M. Dorigo and M. Colombetti. Robot shaping: [22] S. Russell and P. Norvig. Artificial Intelligence: A
Developing situated agents through learning. Modern Approach. Pearson Education, Inc., 2003.
Technical Report TR-92-040, International Computer [23] M. Saggar, T. D’Silva, N. Kohl, and P. Stone.
Science Institute, 1993. Autonomous Learning of Stable Quadruped
[7] M. Dorigo and M. Colombetti. Robot Shaping: An Locomotion, chapter 9, pages 98–109. Springer Berlin
Experiment in Behavior Engineering. MIT Press, 1998. / Heidelberg, 2007.
[8] A. El-Fakdi and M. Carreras. Policy gradient based [24] R. S. Sutton and A. G. Barto. Reinforcement
reinforcement learning for real autonomous Learning: An Introduction. MIT Press, Cambridge,
underwater cable tracking. In Intelligent Robots and MA, 1998.
Systems, IEEE/RSJ International Conference on, [25] R. S. Sutton and D. McAllester. Policy gradient
pages 3635–3640, 2008. methods for reinforcement learning with function
[9] A. El-Fakdi, M. Carreras, N. Palomeras, and P. Ridao. approximation. Advances in Neural Information
Autonomous underwater vehicle control using Processing Systems, 12:1057–1063, 2000.
reinforcement learning policy search methods. Oceans, [26] M. E. Taylor, S. Whiteson, and P. Stone. Comparing
pages 2: 793–798, 2005. evolutionary and temporal difference methods for
[10] P. Fabiani and V. Fuertes. Autonomous flight and reinforcement learning. In Proceedings of the Genetic
navigation of vtol uavs: from autonomy and Evolutionary Computation Conference, pages
demonstrations to out-of-sight flights. Aerospace 1321–1328, 2006.
Science and Technology, 11:183–193, 2007. [27] R. Tedrake and T. Zhang. Stochastic policy gradient
[11] J. A. Fernandez-Leon, G. G. Acosta, and M. A. reinforcement learning on a simple 3d biped. In
Mayosky. Behavioral control through evolutionary Intelligent Robots and Systems, IEEE/RSJ
neurocontrollers for autonomous mobile robot International Conference on, pages 2849– 2854, 2004.
navigation. Robotics and Autonomous Systems, In [28] S. Thrun and M. Montemerlo. Stanley: The robot
Press, Corrected Proof:411–419, 2008. that won the darpa grand challenge. Robotic Systems,
[12] T. Gabel and M. Riedmiller. Learning a partial 23:661–692, 2006.
behavior for a competitive robotic soccer agent. KI [29] S. Thrun and G. Sukhatme. Robotics: Science and
Zeitschrift, 20:18–23, 2006. Systems I. MIT Press, 2005.
[13] F. Gomez and R. Miikkulainen. Incremental evolution [30] S. Whiteson, M. E. Taylor, and P. Stone. Empirical
of complex general behavior. Adaptive Behavior, studies in action selection for reinforcement learning.
5:5–317, 1997. Adaptive Behavior, 15(1):33 –50, 2007.
[14] F. Gomez and R. Miikkulainen. Solving
non-markovian control tasks with neuroevolution. In
Proceedings of the International Joint Conference on

Page 76 of 99
Proceedings of the AAMAS Workshop on Adaptive and Learning Agents, May 2010, Toronto, Canada

A Comparison of Learning Approaches to Support the

Adaptive Provision of Distributed Services

Enda Barrett Enda Howley Jim Duggan

Department of Information Department of Information Department of Information
Technology Technology Technology
National University of Ireland, National University of Ireland, National University of Ireland,
Galway Galway Galway
[email protected] [email protected]@nuigalway.ie

ABSTRACT 1. INTRODUCTION
Recent advances in service oriented technologies offer me- With the expansion of utility and cloud computing, more
tered computational resources to consumers on demand. In and more companies are outsourcing their computational re-
certain environments these consumers are software agents, quirements rather than processing them in-house. This has
capable of autonomously procuring resources and complet- led to an increase in the number of providers offering cloud
ing tasks. The consumer agent can solicit relevant services based services and charging for usage either on fixed rate
from other software agents known as provider agents. These tariffs or a pay per use basis. Zimory a German company
provider agents are instantiated with various business logic, launched what is being touted as a global marketplace for
which culminates in their service offering. Consumer de- computational resource trading in January 2009. Although
mand varies over time, meaning each provider agents ser- the idea of trading computational resources similarly to tra-
vice offering must adapt in order to succeed. Agents of- ditional commodities has been around for some time [2],
fering services for which there is little demand greatly re- with a number of projects [6] [7] developing structures for
duces the probability of successful provision. Since service trading resources, its implementation has not. In this envi-
agents occupy a finite resource, offerings for which demand is ronment, potentially excess compute capacity could be sold
low wastes resources. The goal of this research is to create on the open market, generating a revenue stream where pre-
provider agents capable of adapting their service offerings viously there was none. Technologies are emerging where
to meet the available demand, thus maximising available agents are capable of autonomously procuring and supply-
revenues. To achieve this we implemented two separate al- ing services on the web. These agents, must procure the
gorithms, each tackling the problem from different perspec- revelent resources to achieve a specified goal often engaging
tives, comparing their efficacy in this environment. Firstly with multiple service providers to ensure successful comple-
we adopted a centralised approach where a Genetic Algo- tion. The service providers must decide what services to
rithm evolves agents online to meet the fluctuating demand offer for consumption. The selection choice can involve de-
for services. In our results the genetic algorithm achieved a ciding from amongst a large set of possible service offerings.
significant improvement over a non elastic fixed agent sup- When market conditions are uncertain the service provider
ply. Secondly a decentralised approach was developed us- is presented with a difficult decision. Choosing a service of-
ing Reinforcement Learning. Provider agents individually fering for which there is little demand results in wasted re-
learned through trial and error which services to offer. Re- sources as the instantiation costs remain. Therefore this pa-
sults showed that reinforcement learning was slightly less per raises an interesting research question, what service of-
adaptive than the genetic algorithm in meeting the varying fering should a provider agent expose to increase its chances
demand. Both approaches displayed a significant improve- of successful consumption?
ment over the fixed aggregate supply. Critically reinforce- In addressing this question we have investigated the use
ment learning requires far less computational or communica- of Reinforcement Learning and a Genetic algorithm to cre-
tion overhead as agents make decisions from environmental ate adaptive service agents, capable of autonomously alter-
experience. ing their service offerings online. Using these techniques
the agents can adapt to fluctuating demands for services.
The methods enable them to alter their offerings, meeting
Categories and Subject Descriptors market demand, and concurrently optimising their limited
H.4 [Service Computing, Artificial Intelligence]: Mul- resources. We compare the two approaches empirically fo-
tiagent Systems cussing on optimsing limited resources while maintaining ad-
equate service level. The goal of each approach is to max-
imise the amount of revenue earned from available resources,
General Terms while minimising lost revenues resulting from unfulfilled ser-
vice requests. In these environments agent interactions are
Genetic Algorithm, Reinforcement Learning
often governed by Service Level Agreements (SLA’s), where
a minimum requirement of service level is stipulated. This
Keywords can often lead to over provisioning of services to ensure com-
pliance. Over provisioning ensures compliance but it also
Adaptive services, Demand estimation

Page 77 of 99
increases costs and inefficiency. Our aim is to find a solu- once more. A solutions’ suitability to its environment is
tion that can meet agreements but also maximise efficiency, determined using a fitness function. By iterating through
generating greater revenues from the existing resources. The successive generations an optimal solution can be found for
genetic algorithm provides a centralised approach to tackling the given environment. In our model the GA attempts to
the problem of service choice among the provider agents. It find optimal solutions continuously, in an online fashion. As
is responsible for controlling agent population numbers and demand for services vary with time, the optimum solution
deciding what number of provider agents offer which ser- is not static and changes continuously.
vices each time. In contrast reinforcement learning offers a A number of researchers have looked at using genetic al-
decentralised approach to tackling the same problem, where gorithms to estimate demand for traditional utilities and
the provider agents make decisions based on their past ex- commodities, such as oil [3] and electricity [1]. Using histor-
periences of what services to offer when. ical demand figures as well as a number of other parameters
The following sections of this paper are structured as fol- such as, GNP, population, import and export figures, future
lows: Background Research provides an overview of relevant projections were made, proving the viability of using the
work in this field. A number of aspects of service computing approach for demand prediction. Demand estimations were
are discussed as well as an overview of a genetic algorithm carried out retrospectively and not online.
and reinforcement learning. Model Design details the ser-
vice provider/consumer model and describing the algorithms 2.3 Reinforcement Learning
used in this paper. Simulator Design describes the simulator Reinforcement learning involves the agent learning through
used to generate our service oriented environment test-bed. trial and error from interactions with its environment. The
Experimental Results evaluates the performance of the two reinforcement learning agent has an explicit goal which it
methods, finally leading to Conclusions. endeavors to achieve. The environment presents the learn-
ing agent with the necessary evaluative feedback required
2. BACKGROUND RESEARCH to achieve this goal. This feedback or reward consists of
a scalar value through which the learning agent determines
This section details relevant work in the area of service
its performance. Through repeated interactions with its en-
computing. We begin by outlining the artificial intelligence
vironment the agent learns which actions result in higher
methods used in this paper. We document related work ad-
rewards. The agent’s goal is to maximise its reward in the
dressing the relevancy of these approaches to our problem.
long run.
Finally we look at other related research in service comput-
ing. 2.3.1 Value Functions
2.1 Centralised vs Decentralised The objective of the learning agent is to optimise its value
function. The agent makes decisions on its value estimates
In this section we examine both centralised and decen-
of states and actions. V π (s) is called the state value function
tralised approaches to solving the problem. Using the ge-
for policy π. It is the value of state s under policy π and
netic algorithm to evolve suitable agents we address the
amounts to the return you expect to achieve, starting in s
problem in a centralised manner. The genetic algorithm de-
and following policy π from then on.
termines the agent configuration, evolving the fittest agents
for the environment each time. It is also solely responsible
for adding and removing provider agents to and from the {∑
∞
}
V π (s) = Eπ γ k rt+k+1 |st = s (1)
environment, depending on an agents fitness. The creation
k=0
of new agents and subsequent service allocation incurs an
instantiation cost. Qπ (s, a) is called the action-value function. It is the value
In contrast reinforcement learning represents a decentralised of choosing action a while in state s under policy π.
approach to solving the problem, where agents individually
make decisions on what services to offer each time. The {∑
∞
}
reinforcement learning agents cannot increase the resources Qπ (s, a) = Eπ γ k rt+k+1 |st = s, at = a (2)
at their disposal and similarly to the genetic algorithm are k=0

limited to choosing from among the available services. A A number of studies have looked at applying reinforce-
switching cost is applied should a learning agent elect to ment learning to resource allocation problems [14]. The au-
provide a service other than the one it currently provides. thor presented a framework using reinforcement learning,
This cost is dealt with through the agents’ returns and is capable of dynamically allocating resources in a distributed
explained in more detail in section 4. Importantly an un- system. While adaptive service provision is not a resource
successful learning agent can also decide to make itself idle allocation problem the parallels between them still merit
for a period of time. Once in this idle phase the agents their inclusion. Tesauro investigated the use of a hybrid
resources are freed up and available to be used elsewhere. reinforcement learning technique for autonomic resource al-
After a certain time the agent comes out of the idle state location [13]. He applied this research to optimising server
and begins service provision once more. allocation in data centers. Germain-Renaud et al. [4] looked
at similar resource allocation issues. Here a workload de-
2.2 Genetic Algorithms mand prediction technique was used to predict the resource
Genetic algorithms are stochastic search and optimisation allocation required each time. Reinforcement learning has
techniques based on evolution. In their simplest form, a set also been successfully applied to grid computing as a job
of possible solutions to a particular problem are evaluated in scheduler. Here the scheduler can seamlessly adapt its de-
an iterative manner. From the ﬁttest of these solutions, the cisions to changes in the distributions of inter-arrival time,
next generation is created and the evaluation process begins QoS requirements, and resource availability [10].

Page 78 of 99
Demand Curves for Multiple Services Aggregate Demand for Multiple Services
140
Service1 Volatile
Service2 350 Stable
120 Service3
Service4
300
100
Quantity Demanded

Quantity Demanded
250
80
200
60
150
40
100

20 50

0 0
0 200 400 600 800 1000 1200 1400 1600 1800 0 200 400 600 800 1000 1200 1400 1600 1800
Time (s) Time (s)
Individual Demand for Multiple Services (Stable) Aggregate Service Demand

Figure 1: Individual Service Demand Figure 2: Aggregate Service Demand

2.4 Service Selection and Composition controlled manner with consumer agents always procuring
In service computing, recent work by Jackyno et al. in- services from provider agents. Consumers do not differen-
vestigated an approach where agents share local demand es- tiate among the available providers in terms of quality, or
timates among one another and adapt their service offerings price. Also they are not bound by deadlines nor restricted
to suit [5]. The agents in this work did not learn from their by budgetary constraints. The focus of this paper is to
environment, instead they communicated with one another develop techniques to improve service adaptability and dy-
and altered their service offerings based on the information namism. The consumer agent requests a service from any
they acquired. By limiting the flow of information between available provider agent, with the provider agent remuner-
agents the author was able to show the system had the ca- ated upon completion. Payments for services are homoge-
pability to self-organise decentrally into communities where nous. The provider agents goal is to maximize its income
agents reliably provide the most requested service types. throughout the course of its interactions with the environ-
A number of researchers in the field have addressed the ment. The income earned through interactions denotes its
problem of optimal service selection [8] [11]. Optimal ser- fitness. This influences it’s degree of success in the genetic
vice selection involves developing approaches to selecting an algorithm, where fitter agents have a higher probability of
adequate number of service providers to ensure task com- producing offspring. It also represents the agents reward
pletion, within certain constraints such as time, quality and in reinforcement learning biasing the agent interactions in
budgetary. order to achieve maximum reward.
Service composition has also received much attention in The demand for each service, manifested through the ser-
recent years. Composition addresses the issues of compos- vice consumer, is controlled exogenously through the use of
ing required functionality from amongst the available ser- a sin function. The demand pattern generated using the
vices. These services often belong to numerous different sin function is purely deterministic with no degree of ran-
providers, offering varied or similar functionality. The ob- domness applied. Using this method we can evaluate per-
jective of the composing algorithm is to service the request formance more accurately over successive runs. Figure 1
by composing the required functionality from the existing shows the demand curve for four separate services where
services. Weise et al. [15] compared the performance of services are increasing and decreasing in demand. This de-
an informed/uniformed search and a genetic algorithm, for mand pattern is relatively stable and is depicted in Figure
composing web services. The evolutionary approach where 2 by the stable demand curve. The demand patterns in
solutions to requests were evolved, proved to be much slower Figure 2 depict services on an aggregate level for both sta-
than the search algorithms but was shown to always success- ble and volatile environments. These two demand patterns
fully satisfy requests. were selected in order to perform preliminary analysis of our
approaches in disparate environments. Greater variance in
3. MODEL DESIGN demand will be addressed in future work.
In this section we discuss the model which we use to eval-
uate our learning approaches. We discuss have demand is 3.2 Agent Architecture
generated and controlled and also the architecture of the Implementation of both the genetic algorithm and rein-
agents adopting the different approaches. forcement learning require two architecturally different provider
agents. Learning agents require greater autonomy, and de-
3.1 Agent Interactions And Demand Function cision making skills than that of their evolved counterparts.
The agent environment supports two types of agents, ser- This section outlines both architectures and introduces SARSA
vice providers and service consumers. Agents interact in a the reinforcement learning algorithm used in this work.

Page 79 of 99
3.2.1 Evolutionary Architecture
Each agents service type is encoded as a bit string. This
Q(s, a) ← Q(s, a) + α[r + γQ(s′ , a′ ) − Q(s, a)] (3)
represents an agents gene and resides in it’s chromosome.
Since our model is currently concerned only with an agents and calculated each time a state is reached which is non-
service type, this is the only gene residing in the chromo- terminal. Approximations of Qπ which are indicative as to
some. Individuals are selected for reproduction using roulette the benefit of taking action a while in state s, are calculated
wheel selection, based on their fitness. Roulette wheel selec- after each time interval. Actions are chosen based on π the
tion involves ranking chromosomes in terms of their fitness policy being followed. As mentioned previously Qπ (s, a) is
and probabilistically selecting them. The selection process the value of taking action a while in state s, under policy π.
is weighted in favour of chromosomes possessing a higher fit- The research presented in this paper uses an ϵ-greedy policy
ness. To ensure that agents, already optimal for their envi- to decide what action to select while in a particular state. In
ronment are not lost in the evolutionary process elitism is ap- the service environment we wish to reduce the probability of
plied. Elitism involves the selection of a certain percentage selection for all non-greedy actions. Put simply we do not
of the fittest chromosomes and moving them straight into wish to select a service for which there is no demand. All
the next generation, avoiding the normal selection process. actions which are non-greedy have a probability of selection
In creating offspring for the next generation, the selection of ϵ
of |A(s)| , with the higher probability weighted in favour of
two parents is required. Each pairing results in the repro- the greedy strategies 1 − ϵ + |A(s)|
ϵ
.
duction of two offspring. During reproduction crossover and
mutation, fundamental principles of genetic algorithms, are
applied probabilistically. Crossover involves taking certain 4. SIMULATOR DESIGN
aspects/traits of both parents’ chromosomes and creating a The simulations presented in this paper involve agents
new chromosome. There are a number of ways to achieve interacting with one another in a discrete time environment.
this including, single point crossover, two point crossover To simulate the performance of the two approaches, slightly
and uniform crossover. Our crossover function employs sin- different configurations had to be applied. The design of
gle point crossover, where a point in the bit string is ran- each algorithm is detailed below.
domly selected, at which crossover is applied. Crossover
generally has a high probability of occurrence, mutation on 4.1 Evolutionary Simulations
the other hand generally does not. Mutation involves ran- For our experiments evolution occurs over the entire pop-
domly altering its bit string changing aspects of a chromo- ulation, with offspring from the fittest agents replacing only
some. Occurrences of mutation were biased towards an in- the weakest agents in the population. An agents fitness Af
crease or decrease of only 1 of a possible n services. Once is calculated as the sum of all payoffs divided by the∑ number
the chromosome has been created an agent is formed and its of services provided during the time step Af = n1 n i=0 xi .
added to the population. The agents performance is mea- After each time step an agents income from service provision
sured through its fitness value. is representative of its fitness for the environment, resulting
in the fittest agent earning the most income. The popula-
tion of agents is proportional to the amount of services in
the system at any particular time. The percentage of elitism
3.2.2 SARSA Learning E applied to the population varies depending on whether a
In this paper we use a classical reinforcement learning al- service is rising or falling in demand. To ensure adequate ser-
gorithm known as Sarsa. Sarsa belongs to a collection of vice provision to cater for spikes in demand, the genetic algo-
algorithms called Temporal Difference (TD) methods. Not rithm allows for a certain percentage of over provisioning of
requiring a complete model of the environment, TD meth- supply. This percentage of over provisioning is probabilisti-
ods possess a significant advantage. TD methods have the cally chosen, where a greater weighting is applied to increas-
capability of being able to make predictions incrementally ing supply, where demand is rising. A cross-over rate of 85%
and in an on-line fashion, without having to wait until the and a mutation rate of 5% are also used. Mutation is fun-
episode has terminated. damental to the success of the genetic algorithm, without it
The learning agent interacts with its environment through adaptivity could not be achieved. If demand for a particular
a sequence of discretized time steps. At the end of each time service approaches 0, evolution may favour a generation of
period t the agent occupies state st ∈ S, where S represents offspring which do not support this service. Without muta-
the set of all possible states. Here the agent chooses an tion this service will become extinct resulting in major losses
action at ∈ A(st ), where A(st ) is the set of all possible ac- should demand for it increase again. The agent population
tions within state st . The agent receives a numerical reward is initially dispersed randomly ensuring an even distribution
or return, rt+1 ∈ ℜ and enters a new state s′ = st+1 [12]. of the available services. Offspring are created using a sin-
The goal of the Reinforcement learning agent is to maximise gle point cross-over of both parents’ service gene, with the
its returns in the long run often forgoing short term gains actual cross-over point being randomly selected each time.
in place of long term benefits. By introducing a discount Mutation occurred probabilistically throughout this process.
factor γ, (0 < γ < 1), an agents degree of myopia can be Occurrences of mutation were biased towards an increase or
controlled. A value close to 1 for γ assigns a greater weight decrease of only 1 of a possible n, with n being the number
to future rewards, while a value close to 0 considers only the of services available. Pairings produce two offspring, which
most recent rewards. For our experiments we have assigned are added to the agent population replacing weaker agents.
γ a value of 0.9 forcing the agent to place greater emphasis To reflect the instantiation costs of loading the various busi-
on future rewards. ness logic, the genetic algorithm is penalised each time it
The update rule for Sarsa is defined as evolves. This is achieved by restricting evolution process to

Page 80 of 99
eﬃciency perspective this could prove costly if carried out
in suﬃcient quantity. Averaged returns for each state ac-
tion pair are stored in a lookup table. Storing Q values
ZĂŶŬ this way is possible as the number of states and actions is
^ĞůĞĐƚŝŽŶ
ŐĞŶƚƐ
kept relatively small for analytical purposes. Learning in a
more expansive system containing larger numbers of services
would require function approximation or neural nets to ap-
proximate Q values. The steps involved in the reinforcement
learning algorithm are depicted by Algorithm 1 below.

Algorithm 1 Reinforcement Learning Algorithm (SARSA)

Initialise Q(s, a) arbitrarily
'ĞŶĞƌĂƚŝŽŶ ƌŽƐƐŽǀĞƌ Initialise s
'нϭ DƵƚĂƚŝŽŶ
Choose a from s using ϵ-greedy policy
repeat
Take action a and observe r, s’
Choose a’ from s’ using ϵ-greedy policy
Genetic Algorithm Design Q(s, a) ← Q(s, a) + α[r + γQ(s′ , a′ ) − Q(s, a)]
s ← s′ ; a ← a′
Figure 3: Genetic Algorithm Design until s is terminal

every second time interval. Figure 3 above, gives a general 5. EXPERIMENTAL RESULTS
overview of the design of the genetic algorithm.
This section compares the performance of both the cen-
4.2 Learning Simulations tralised and decentralised approaches using the simulator
discussed in the previous section as a test bed. The ex-
For our experiments the service environment contains n
periment evaluates performance in two distinct demand en-
number of services from which an agent can choose. The
vironments. Depicted in Figure 2, the stable curve sim-
set of all possible service choices C = { c1 , c2 ......cn , c∗ }
ulates an environment were consumer demand fluctuates
contains all the services the environment supports. c∗ rep-
moderately. The volatile demand curve simulates more dra-
resents a temporary idle state which the agent may enter.
matic demand fluctuations. Since all service demand (both
When the agent enters this state it is inactive in the envi-
stable and volatile) is generated using a sinusoidal func-
ronment freeing up its resources for usage elsewhere. While
tion the aggregate demand will be sinusoidal in nature, as
in the idle state the agent releases its lock on the system
sin A + sin B = sin C.
resources allowing another process to obtain this lock and
The corresponding results are empirically analysed and
use the resources as they please. Once finished the pro-
presented in both graphical and tabular form. A number
cess relinquishes the lock allowing the agent to exit the idle
of the metrics used to explain the results are not intuitive
state. As stated previously the policy being followed is an ϵ
and require further explanation. Demand represents the to-
- greedy policy, meaning loosely, choose the action which re-
tal aggregate demand curve and not the demand curve for
sults in the highest reward most of the time. All non-greedy
a single service. Aggregated in this demand curve are all
actions with the exception of c∗ are subsequently weighted
the services present in the system at run-time. Missed re-
similarly. The addition of the idle state reduces the effec-
quests is the percentage of consumer requests which provider
tiveness of the policy during exploration. The provider only
agents were unable to fulfil. For analytical purposes we term
wishes to choose this action once it has tried all the other
the difference between total adaptive supply and the fixed
possible actions. To address this we designate a learning
rate supply as the efficiency Ef of the approach. It can be
phase where non-greedy actions are not equiprobable. After
calculated as follows
each non-greedy action other than the idle state has been
explored, then the learning phase ends and all non-greedy
∫ T ∫ T ∑
T
actions become equiprobable. This increases probability of
the agent entering the idle state. The agents action set for Ef = f (t)dt − g(t)dt ≈ (f (t) − g(t)) (4)
0 0 t=0
each state contains a choice of whether to remain in the
current state or switch to one of the n other possibilities. where T is the terminal state, the function f (t) represents
A = { remain , switchc1 ,c2 ......cn ,c∗ }. The set of returns the fixed supply and g(t) the aggregate supply (genetic algo-
achievable is R = {0, +0.5, +1}. Actions that result in suc- rithm or reinforcement learning). This value represents the
cessful service provision, yield a reward of +1. However if number of resources freed up during the experiment which
successful provision is achieved through switching services, can subsequently be reallocated to other processes.
the reward is halved to +0.5. Halving the reward only oc-
curs for the first time interval, with the agent receiving a 5.1 Efficiency Comparison
reward of +1 thereafter for successful provision, as long as This experiment evaluates each algorithms performance
it doesn’t switch. This represents the expense of instanti- in a substantially sized agent community. The number of
ating the necessary business logic for the new service. Ap- agents initially added to the community is 400. For analysis
plying a switching cost in this way, enhances the stability purposes only 4 services are available for selection, mean-
of the system, preventing unnecessary switching. From an ing provider agents must select a service from this service

Page 81 of 99
Stable Demand Volatile Demand

450 450
400 400
350 350
300 300
Quantity

Quantity
250 250
200 200
150 150
100 Fixed Supply 100 Fixed Supply
Aggregate Supply(GA) Aggregate Supply(GA)
50 Aggregate Supply(RL) 50 Aggregate Supply(RL)
Aggregate Demand Aggregate Demand
0 0
0 200 400 600 800 1000 1200 1400 1600 0 200 400 600 800 1000 1200 1400 1600
Time Time
Aggregate Service Supply Aggregate Service Supply

Figure 4: Aggregate Curves for RL and GA Figure 5: Aggregate Curves for RL and GA

Table 1: Stable Demand Table 2: Volatile Demand

Algo Demand Missed Requests (%) Eﬃciency (%) Algo Demand Missed Requests (%) Eﬃciency (%)
GA 183,693 0.9 28.75 GA 192,805 1.47 28.73
RL 183,693 0.03 23.35 RL 192,805 0.13 14.48
Fixed 183,693 0 0 Fixed 192,805 0 0

set. Peak quantity demanded for the entire simulation is learning. Both approaches performed very strongly at meet-
restricted to 100 units demanded per service. This restric- ing the majority of service requests. The results show that
tion is only necessary for evaluation purposes in order to the percentage of missed requests for both approaches is very
ground our results within measurable bounds. At times of low, standing at 0.9% for the genetic algorithm, with an even
peak demand for individual services, the maximum demand lower percentage value of 0.03% for reinforcement learning.
requires 100 provider agents to completely satisfy consumer Potentially if the percentage of missed requests is high, con-
requests. In a system where supply is fixed and peak demand sumer confidence may decline, reducing overall demand. An
is known, its possible to adequately meet demand by setting important consideration is to maintain the balance between
the minimum level of provision to 100 provider agents per increased efficiency and the percentage of missed requests.
service. This is depicted by the flat-lined fixed supply curve While increasing efficiency generates greater revenues, if this
shown in Figures 4 and 5. In service environments it is un- increase comes at the expense of reduced consumer confi-
likely that this value would be known or could be reliably dence, this could negatively impact market share in the long
estimated from experience. Therefore over provisioning of run.
providers may be much higher than the minimum fixed sup-
ply used here. In any case we demonstrate the performance 5.1.2 Volatile Demand
of our approaches against this minimum fixed rate supply Similarly to the previous experiment the results are av-
in terms of efficiency. As detailed above, this paper terms eraged over 50 runs. Figure 4 shows the supply curves for
efficiency as the percentage difference between the flat rate both the genetic algorithm and reinforcement learning where
supply and the adaptive approaches. demand for services exhibits greater volatility. Its evident
from the graphs that the genetic algorithm displays greater
5.1.1 Stable Demand adaptability in this environment. As shown in Figure 4 the
Figure 3 shows the performance for both approaches when genetic algorithm’s supply curve tracks the aggregate de-
averaging over 50 runs. From the graph it is clear that mand curve. The efficiency of the genetic algorithm, shown
the evolutionary approach slightly outperforms the learning in Table 2, is much higher than reinforcement learning. The
approach in adapting to demand. The genetic algorithm genetic algorithm maintains similar adaptability in both en-
converges faster than reinforcement learning, as it quickly vironments with a performance of 28.73%. Reinforcement
identifies least fit agents and removes them. Reinforcement learning however does not perform as well in the volatile en-
learning takes longer to converge as unsuccessful agents ex- vironment, achieving an efficiency value of 14.48%, a reduc-
plore their non-greedy actions first before moving into the tion of 9.27% from the stable environment. The principle
idle state.The average number of service requests missed factor affecting the measure of efficiency in reinforcement
by service providers was higher for the genetic algorithm learning is the number of agents occupying the idle state. In
than for reinforcement learning. While marginally missing a volatile environment distribution of consumer demand is
more service requests the genetic algorithm had a greater more equitable among the provider agents. This results in
efficiency of 28.75% as opposed to 23.35% for reinforcement less agents entering the idle state, as they will have received

Page 82 of 99
returns from service provision. If the learning agent does [2] Rajkumar Buyya, Chee S. Yeo, and Srikumar
not enter the idle state then other processes will not be able Venugopal. Market-oriented cloud computing: Vision,
to obtain a lock on its resources, thus reducing efficiency. hype, and reality for delivering it services as
The percentage of missed requests is higher for the genetic computing utilities. Aug 2008.
algorithm at 1.47% than for reinforcement learning 0.05%. [3] Olcay Ersel Canyurt and Harun Kemal Öztürk. Three
Although this value is still quite low, this might be unac- different applications of genetic algorithm (ga) search
ceptable for some critical systems, such as stock or bank- techniques on oil demand estimation. Energy
ing services. Only missing 0.05% of requests reinforcement Conversion and Management, 47(18-19):3138 – 3148,
learning performs extremely well in this regard. It’s able to 2006.
maintain a much more stable supply while still achieving an [4] D. Gmach, J. Rolia, L. Cherkasova, and A. Kemper.
acceptable level of efficiency. Workload analysis and demand prediction of
enterprise data center applications. In Workload
6. CONCLUSIONS Characterization, 2007. IISWC 2007. IEEE 10th
International Symposium on, pages 171–180, Sept.
This paper has presented two separate approaches aimed
2007.
at tackling agent adaptivity in a service environment. Cre-
[5] Mariusz Jacyno, Seth Bullock, Michael Luck, and
ating dynamic and adaptive processes has been identified
Terry R. Payne. Emergent service provisioning and
as one of the principle research challenges in service ori-
demand estimation through self-organizing agent
ented computing [9]. Furthermore these processes should
communities. In AAMAS ’09: Proceedings of The 8th
possess the capability, of continually morphing themselves
International Conference on Autonomous Agents and
to respond to environmental demands.
Multiagent Systems, pages 481–488, Richland, SC,
Earlier we proposed a research question, where service de-
2009. International Foundation for Autonomous
mand is uncertain, which service offering should a provider
Agents and Multiagent Systems.
agent choose to expose? The work outlined in this paper
has demonstrated two approaches which successfully cre- [6] Yun Fu Jeffrey, Jeffrey Chase, Brent Chun, Stephen
ated provider agents, capable of reconfiguring their service Schwab, and Amin Vahdat. Sharp: An architecture for
offering to meet the available demand. We showed for both secure resource peering. In In Proceedings of the 19th
techniques how the provider agents were capable of deciding ACM Symposium on Operating System Principles,
which service offering to select in order to maximse available pages 133–148.
revenues. Both approaches have achieved significant perfor- [7] Kevin Lai, Lars Rasmusson, Eytan Adar, Li Zhang,
mance gains, when compared to a fixed rate supply. While and Bernardo A. Huberman. Tycoon: An
the genetic algorithm demonstrated greater efficiency, the implementation of a distributed, market-based
costs involved in evolving an entire population of distributed resource allocation system. Multiagent Grid Syst.,
agents may outweigh the resource saving. Gathering global 1(3):169–182, 2005.
information from hundreds or possibly thousands of service [8] Daniel A. Menascé, Emiliano Casalicchio, and Vinod
agents in a distributed environment could outweigh the per- Dubey. On optimal service selection in service oriented
formance benefits detailed in the results section. In smaller architectures. Performance Evaluation, In Press,
service environments where resources are limited and the Corrected Proof:–, 2009.
computational overheads required to run the genetic algo- [9] M.P. Papazoglou, P. Traverso, S. Dustdar, and
rithm is small, the genetic algorithm will achieve excellent F. Leymann. Service-oriented computing: State of the
adaptivity and efficiency. art and research challenges. Computer, 40(11):38–45,
This paper also defined a formal specification for an emer- Nov. 2007.
gent service environment, depicting it as a continuous action- [10] Julien Perez, Cécile Germain-Renaud, Balazs Kégl,
state space reinforcement learning problem. Learning through and Charles Loomis. Grid differentiated services: A
trial and error the agents quickly identified services that reinforcement learning approach. In CCGRID ’08:
were in demand and adapted their offering to meet them. Proceedings of the 2008 Eighth IEEE International
The efficiency of the learning approach outperformed the Symposium on Cluster Computing and the Grid, pages
fixed rate supply for both demand environments. The ser- 287–294, Washington, DC, USA, 2008. IEEE
vice level maintained by the learning agents was superior to Computer Society.
the evolved agents, demonstrating its reliability for critical [11] Sebastian Stein, Enrico Gerding, Alex C. Rogers, Kate
systems, where failed requests carry large penalties. Larson, and Nicholas R. Jennings. Flexible
procurement of services with uncertain durations. In
7. ACKNOWLEDGEMENT Second International Workshop on Optimisation in
Multi-Agent Systems (OptMas), May 2009.
The authors would like to gratefully acknowledge the con-
[12] Richard S. Sutton and Andrew G. Barto.
tinued support of Science Foundation Ireland.
Reinforcement learning :an introduction, 1998.
[13] Gerald Tesauro, Nicholas K. Jong, Rajarshi Das, and
8. REFERENCES Mohamed N. Bennani. On the use of hybrid
reinforcement learning for autonomic resource
[1] A. Azadeh, S.F. Ghaderi, S. Tarverdian, and
allocation. Cluster Computing, 10(3):287–299, 2007.
M. Saberi. Integration of artificial neural networks and
genetic algorithm to predict electrical energy [14] David Vengerov. A reinforcement learning approach to
consumption. Applied Mathematics and Computation, dynamic resource allocation. Eng. Appl. Artif. Intell.,
186(2):1731 – 1741, 2007. 20(3):383–390, 2007.

Page 83 of 99
[15] T. Weise, S. Bleul, D. Comes, and K. Geihs. Diﬀerent
approaches to semantic web service composition. In
Internet and Web Applications and Services, 2008.
ICIW ’08. Third International Conference on, pages
90–96, June 2008.

Page 84 of 99
Proceedings of the AAMAS Workshop on Adaptive and Learning Agents, May 2010, Toronto, Canada

Using bisimulation for policy transfer in MDPs

Pablo Samuel Castro Doina Precup

School of Computer Science School of Computer Science
McGill University McGill University
3480 University Street 3480 University Street
Montreal, QC H3A 2A7 Montreal, QC H3A 2A7
[email protected] [email protected]

ABSTRACT has motivated a significant amount of recent research in knowl-

Knowledge transfer has been suggested as a useful approach for edge transfer methods for MDPs. The idea is to allow an agent
solving large Markov Decision Processes. The main idea is to to continue to re-use the expertise accumulated while solving past
compute a decision-making policy in one environment and use it tasks, over its lifetime. Several approaches for knowledge trans-
in a different environment, provided the two are "close enough". In fer in MDPs have been proposed; Taylor & Stone (2009) provide a
this paper, we use bisimulation-style metrics (Ferns et al., 2004) to comprehensive survey1 . Broadly speaking, the goal of knowledge
guide knowledge transfer. We propose algorithms that decide what transfer is two-fold. On one hand, it should speed up the process of
actions to transfer from the policy computed on a small MDP task solving new tasks. On the other hand, it should enable solving tasks
to a large task, given the bisimulation distance between states in which are very complex in the given (raw) representation. The first
the two tasks. We demonstrate the inherent "pessimism" of bisim- goal has been particularly emphasized in reinforcement learning,
ulation metrics and present variants of this metric aimed to over- while the second goal is more prevalent in general machine learn-
come this pessimism, leading to improved action transfer. We also ing.
show that using this approach for transferring temporally extended In this paper we focus on transferring knowledge in MDPs that
actions (Sutton et al., 1999) is more successful than using it exclu- are specified fully by their states, actions, rewards and model of
sively with primitive actions. We present theoretical guarantees on state transition probabilities. The knowledge to be transferred is
the quality of the transferred policy, as well as promising empirical in the form of a policy, i.e. a way of behaving for the agent. The
results. goal is to specify a transfer method with strong guarantees on the
expected returns of this policy in the new MDP. In particular, we fo-
cus on bisimulation metrics (Ferns, Panangaden & Precup, 2004 ;
Categories and Subject Descriptors Taylor, Precup & Panangaden, 2009), which measure the long-term
I.2.8 [Artificial Intelligence]: Problem Solving, Control Methods, behavioral similarity of different states. States which are “close” in
and Search terms of these metrics also have similar expected returns (Ferns,
Panangaden & Precup, 2004). However, bisimulation suffers from
three drawbacks. The metrics are very expensive to compute; the
General Terms optimal policy or value function is irrelevant to the metric; and
Algorithms, Performance, Theory their estimates tend to be too pessimistic. We present a variant
of the bisimulation metrics which overcomes these problems and
improves the empirical behavior significantly, while still retaining
Keywords good theoretical properties (in some cases).
Markov Decision Processes, Planning, Bisimulation, Policy trans- Previous work has also illustrated the fact that using temporally
fer extended actions (and their models) can significantly improve knowl-
edge transfer in MDPs (e.g., Perkins & Precup, 1999, Andre & Rus-
sell, 2002; Ravindran & Barto, 2003; Konidaris & Barto, 2007).
1. INTRODUCTION Intuitively, it is easier to transfer high-level controls rather than
Autonomous intelligent agents are often faced with the problem low-level primitive actions. For instance, someone giving driving
of making decisions with favorable long-term consequences, in the directions will use high-level specifications (such as street names
presence of stochasticity. In this paper, we consider this problem and physical landmarks), and will not mention lower-level controls,
in the context of Markov Decision Processes (MDPs) (Puterman, such as how to drive a car. This overcomes many of the difficulties
1994), in which the agent has to find a way of behaving that max- that arise when comparing dynamics on a primitive-action level:
imizes its long-term expected return. Much of the work on using different individuals will have differences in how they drive, but
MDPs in AI and operations research focuses on solving a single the high-level description will “smooth” them out. We establish
problem. However, in practice, AI agents often exist over a longer bisimulation metrics for MDPs with temporally extended actions,
period of time, during which they may be required to solve several, using the framework of options (Sutton, Precup & Singh, 1999).
related tasks. For example, a physical robot may be in use for a All theoretical results hold in this case as well, and options provide
period of several years, during which it has to solve many different better empirical behavior, as expected.
tasks (navigating to different locations, picking up different objects, The paper is organized as follows. In Sec. 2 we introduce our
etc.). Typically, these tasks will be distinct, but they will share im-
portant properties (e.g., the robot may be located in one specific 1 Note that a tutorial on the topic was also presented by Taylor and
building, where all its tasks take place). This type of scenario Lazaric at AAMAS’09.

Page 85 of 99
notation and discuss related work. In Sec. 3 we present knowl- the authors use soft homomorphisms to perform transfer and pro-
edge transfer using bisimulation metrics, and discuss its theoretical vide theoretical bounds on the loss incurred from the transfer.
properties. Sec. 4 presents approximants to overcome bisimula-
tion’s blindness to a state’s value function, while Sec. 5 presents
approximations to the bisimulation metrics, designed speed up the
computation. In Sec. 6 we discuss the use of bisimulation with 3. KNOWLEDGE TRANSFER USING BISIM-
options. Sec. 7 contains empirical illustrations of the proposed ULATION METRICS
algorithms. Finally, in Sec. 8 we conclude and present ideas for Suppose that we are given two MDPs M1 = hS1 , A, P, Ri, M2 =
future work. hS2 , A, P, Ri with the same action sets, and a metric d : S1 × S2 → R
between their state spaces. We define the policy πd on M2 as
2. BACKGROUND ∀t ∈ S2 . πd (t) = π∗ (arg min d(s,t)) (3)
A Markov decision process (MDP) is a 4-tuple hS, A, P, Ri, where s∈S1
S is a finite state space, A is a finite set of actions, P : S × A →
Dist(S) 2 specifies the next-state transition probabilities, and R : In other words, πd (t) does what is optimal for the state in S1 that
S × A → R is the reward function. A policy π : S → A specifies is closest to t according to metric d. Algorithm 1 implements this
the action choices for each state. The value of a state s ∈ S un- approach.
∞ γt r |s = s}, where
der a policy π is defined as: V π (s) = Eπ {∑t=0 t 0
rt is the reward received at time step t, and γ ∈ (0, 1) is a dis- Algorithm 1 PolicyTransfer(M1 , M2 , d)
count factor. Solving an MDP means finding the optimal value 1: Compute d∼
V ∗ (s) = maxπ V π (s), and the associated policy π∗ . In a finite MDP, 2: for t ∈ S2 do
there is a unique optimal value function, and at least one determin- 3: s∗ (t) ← arg mins∈S1 d∼ (s,t)
istic optimal policy. The optimal value function obeys the Bellman 4: πd (t) ← π∗ (s∗ (t))
optimality equations: 5: end for
" # 6: return π∼
V ∗ (s) = max R(s, a) + γ
a∈A
∑ P(s, a)(s′ )V ∗ (s′ ) (1)
s′ ∈S
Note that the mapping between the states of the two MDPs is de-
The action-values function, Q∗ : S × A → R, gives the optimal value fined implicitly by the distance metric d. Hence, it is clear that this
for each state-action pair, given that the optimal policy is followed is an important choice. We will now study the use of bisimulation
afterwards. It obeys a similar set of optimality equations: metrics as a choice for d.
Bisimulation for MDPs was defined by Givan, Dean & Greig
Q∗ (s, a) = R(s, a) + γ ∑ P(s, a)(s′ )V ∗ (s′ ) (2) (2003) based on the notion of probabilistic bisimulation from pro-
s′ ∈S cess algebra (Larsen & Skou, 1991). Intuitively, bisimilar states
Several types of knowledge can be transferred between MDPs. have the same long-term behavior.
Existing work includes transferring models (e.g., Sunmola & Wy-
att, 2006), using samples obtained by interacting with one MDP Definition 1. A relation E ⊆ S × S is said to be a bisimulation
to learn a good policy in a different MDP (e.g., Lazaric, Restelli relation if whenever sEt:
& Bonarini, 2008), transferring values (e.g., Ferrante, Lazaric &
1. ∀a ∈ A. R(s, a) = R(t, a)
Restelli, 2008), or transferring policies. In this paper, we focus on
the latter approach, and mention just a few pieces of work, most
closely related to our approach. The main idea of policy trans- 2. ∀a ∈ A.∀C ∈ S/E. ∑s′ ∈C P(s, a)(s′ ) = ∑s′ ∈C P(t, a)(s′ )
fer methods is to take policies learned on small tasks and apply
where S/E is the set of all equivalence classes in S w.r.t equivalence
them to larger tasks. Sherstov & Stone (2005) show how policies
relation E. Two states s and t are called bisimilar, denoted s ∼ t if
learned previously can be used to restrict the policy space in MDPs
there exists a bisimulation relation E such that sEt.
with many actions. Taylor et al. (2007) transfer policies, repre-
sented as neural network action selectors, from a source to a target
Ferns, Panangaden & Precup (2004) defined a bisimulation met-
task. A hand-coded mapping between the two tasks is used in the
ric, and proved that it is an appropriate quantitative analogue of
process. MDP homomorphisms (Ravindran & Barto, 2002) allow
bisimulation. The metric is not brittle, like bisimulation: if the
correspondences to be defined between state-action pairs, rather
transitions or rewards of two bisimilar states are changed slightly,
than just states. Follow-up work (e.g., Ravindran & Barto, 2003;
the states will no longer be bisimilar, but they will remain close in
Konidaris & Barto, 2007) uses MDP homomorphisms and options
the metric. A metric d is a bisimulation metric if for any s,t ∈ S,
to transfer knowledge between MDPs with different state and ac-
d(s,t) = 0 ⇔ s ∼ t.
tion spaces. Wolfe & Barto (2006) construct a reduced MDP us-
The bisimulation metric is based on the Kantorovich probability
ing options and MDP homomorphisms, and transfer the policy be-
metric TK (d)(P, Q) applied to state probability distributions P and
tween two states if they both map to the same state in the reduced
Q, where d is a semimetric on S. It is defined by the following
MDP. Unfortunately, because the work is based on an equivalence
primal linear program (LP):
relation, rather than a metric, small perturbations in the reward or
transition dynamics make the results brittle. Soni & Singh (2006) |S|
transfer policies learned in a small domain as options for a larger max
u ,i=1,··· ,|S|
∑ (P(si ) − Q(si ))ui (4)
domain, assuming that a mapping between state variables is given. i i=1
A closely related idea was presented in (Sorg & Singh, 2009) where subject to: ∀i, j.ui − u j ≤ d(si , s j )
2 Dist(X) is the set of distributions over the set X ∀i.0 ≤ ui ≤ 1

Page 86 of 99
The following is the equivalent dual formulation: P ROOF.
|Q∗2 (t, atT )−V1∗ (s)| = |Q∗2 (t, atT ) − Q∗1 (s, atT )|
|S|
∑

min lk j d(sk , s j ) (5)
= R(t, atT ) + γ ∑ P(t, atT )(t ′ )V2∗ (t ′ )

lk j ,k=1,··· ,|S|, j=1,··· ,|S| k, j=1
t ′ ∈S

2
|S| !
subject to: ∀k. ∑ lk j = P(sk )
∑

− R(s, atT ) + γ P(s, atT )(s′ )V1∗ (s′ )

j=1
s′ ∈S1

|S| n
∀ j. ∑ lk j = Q(s j ) ≤ max R(t, atT ) − R(s, atT )

a∈A
k=1 )
+ γ ∑ P(t, at )(t )V2 (t ) − ∑ P(s, at )(s )V1 (s )

∀k, j.lk j ≥ 0 T ′ ∗ ′ T ′ ∗ ′
t ′ ∈S s′ ∈S1

2

Intuitively, TK (d)(P, Q) calculates the cost of “converting” P into
n o
≤ max R(t, atT ) − R(s, atT ) + γTK (d∼ )(P(t, a), P(s, a))

Q under d. The dual formulation is a Minimum Cost Flow (MCF) a∈A
problem, where the network consists of two copies of the state = d∼ (s,t)
space, a source node and a sink node. The source node is con-
nected to one of the copies of the state space, each node i with where the second to last line follows from the fact that V1∗ and
supply equal to P(si ), each node j of the second copy of the state V2∗ together constitute a feasible solution to primal LP TK (d∼ ) by
space are all connected to the sink node, each with demand equal to Lemma 3.2.
Q(s j ). Each “supply” node is connected to every other “demand”
We can now use the last lemmas to bound the loss incurred when
node, with cost from supply node i to demand node j of d(si , s j ).
using the transferred policy.
(see Ferns et al., 2004 for more details).
T HEOREM 3.4. For all t ∈ S2 let atT = π∼ (t), then
T HEOREM 3.1. (From Ferns et al., 2004) Let M be the set of |Q2 (t, atT ) −V2∗ (t)| ≤ 2 mins∈S1 d∼ (s,t).
∗

all semimetrics on S and define F : M → M by P ROOF. Let st = arg mins∈S1 d∼ (s,t).

|Q∗2 (t, atT ) −V2∗ (t)| = |Q∗2 (t, atT ) −V1∗ (st ) +V1∗ (st ) −V2∗ (t)|
F(d)(s, s′ ) = max(|R(s, a) − R(s′ , a)| + γTK (d)(P(s, a), P(s′ , a)))
a∈A ≤ |Q∗2 (t, atT ) −V1∗ (st )| + |V1∗ (st ) −V2∗ (t)|
≤ |Q∗2 (t, atT ) −V1∗ (st )| + d∼ (st ,t)(by Lemma 3.2)
Then F has a least-fixed point, d∼ , and d∼ is a bisimulation metric.
≤ 2d∼ (st ,t)(by Lemma 3.3)
Phillips (2006) has used bisimulation metrics before to trans- = 2 min d∼ (s,t)
s∈S1
fer policies in MDPs, assuming that a state mapping between the
MDPs is given. Here, we will relax this requirement and let the
mapping be determined automatically from bisimulation. The following simple example proves that the above bound is
When transferring knowledge from one MDP to another, we are tight. Consider the following two systems:
really only interested in computing the distance between the states
of S1 and the states of S2 , but not between states of the same MDP. s t
Because of this, the primal LP (4) can be rewritten as:
a,[1+ε] b,[1−ε] a,[0] b,[2]

|S1 | |S2 | s′ X t′ X
i
max
u ,i=1,··· ,|S |,v ,i=1,··· ,|S |
1 i 2
∑ P(si )ui − ∑ Q(s j )v j
i=1 j=1
a,b,[0] a,b,[0]
subject to: ∀i, j.ui − v j ≤ d(si , s j )
∀i. − 1 ≤ ui ≤ 1 There are two available actions, a and b. The numbers in brackets
in the transitions indicate the reward received when following that
branch. We can see that the optimal action for state s is a, yielding
Let V1∗ (Q∗1 ) and V2∗ (Q∗2 ) denote the optimal policies (optimal Q- V1∗ (s) = 1 + ε, while the optimal action for state t is b, yielding
values) for M1 and M2 , respectively. The following lemmas are V2∗ (t) = 2. Since s′ and t ′ are bisimilar states, s and t have the same
necessary for Theorem 3.4. probability of transitioning to all bisimulation equivalence classes.
Thus, d∼ (s,t) = 1+ε and d∼ (s′ ,t) = 2, telling us that if we perform
L EMMA 3.2. For all s ∈ S1 and t ∈ S2 , |V1∗ (s)−V2∗ (t)| ≤ d∼ (s,t). policy transfer from the system on the left to the one on the right,
the policy from s will be used for t, yielding Q∗2 (t, atT ) = 0. We
P ROOF. The proof will be omitted for succinctness, but is al- then have |Q∗2 (t, atT ) − V2∗ (t)| = 2 ≤ 2(1 + ε) = 2d∼ (s,t), proving
most identical to the proof of Theorem 5.1 of Ferns et al., 2004. that our bound is tight.
A shortcoming of the bisimulation approach is that it requires
both systems to have the same action sets. This not only restricts
L EMMA 3.3. For all t ∈ S2 let st = arg mins∈S1 d∼ (s,t) and atT = the target domain to those that have equal action sets as the target,
π∗ (st ). Then but it also means the transfer will not work if the target domain
|Q∗2 (t, atT ) −V1∗ (s)| ≤ d∼ (s,t). has a different ordering for the actions. To overcome this problem,

Page 87 of 99
Taylor, Precup & Panangaden (2009) introduce lax bisimulation Again, we use the same symbol d≈ for the state metric, but the
metrics, dL . The idea is to have a metric for state-action pairs rather arguments will resolve any ambiguity. For all s ∈ S1 and t ∈ S2 ,
than just for state pairs. Given two MDPs M1 = hS1 , A1 , P1 , R1 i
d≈ (s,t) = max d≈ (s, (t, b))
M2 = hS2 , A2 , P2 , R2 i, for all s ∈ S1 , t ∈ S2 , a ∈ A1 and b ∈ A2 , b∈A2

dL ((s, a), (t, b)) = |R1 (s, a) − R2 (t, b)| + γTK (dL )(P1 (s, a), P2 (t, b)) We can now use Algorithm 2 again, but with the new metric d≈
to obtain the transferred policy π≈ . In other words π≈ (t) finds the
From the distance between state-action pairs we can then define a
closest state s ∈ S1 to t under d≈ and then chooses the action b from
state lax-bisimulation metric. We use the same symbol dL for the
t that is closest to π∗ (s).
state lax-bisimulation metric, but the arguments will resolve any
We can then prove the following results.
ambiguity. For all s ∈ S1 and t ∈ S2 :
L EMMA 4.1. For all s ∈ S1 and t ∈ S2 |V1∗ (s)−V2∗ (t)| ≤ d≈ (s,t).
dL (s,t) = max max min d((s, a), (t, b)), max, min d((s, a), (t, b))
a∈A1 b∈A2 b∈A2 a∈A1 P ROOF.
Now we can define our transferred policy via Algorithm 2. |V1∗ (s) −V2∗ (t)| = |Q∗1 (s, a∗s ) − Q∗2 (t, at∗ )|
≤ |R(s, a∗s ) − R(t, at∗ )|
Algorithm 2 laxBisimTransfer(S1 , S2 )
+ γ ∑ P(s, as )(s )V1 (s) − ∑ P(t, at )(t )V2 (t )

1: Compute dL ∗ ′ ∗ ∗ ′ ∗ ′
2: for All t ∈ S2 do s′ ∈S
1 t ′ ∈S 2

3: st ← arg mins∈S1 dL (s,t)
≤ |R(s, a∗s ) − R(t, at∗ )| γTK (d≈ (P(s, a∗s ), P(t, at∗ ))
4: bt = minb∈A2 dL ((st , π∗ (st )), (t, b))
5: πL (t) ← bt by induction
6: end for = d≈ ((s, a∗s ), (t, at∗ ))
7: return πL ≤ max d≈ ((s, a∗s ), (t, b))
b∈S2

In other words πL (t) finds the closest state s ∈ S1 to t under dL = d≈ (s,t)

and then chooses the action b from t that is closest to π∗ (s). With
little extra effort we can obtain a similar theorem as for Algorithm
1. L EMMA 4.2. For all s ∈ S1 , t ∈ S2 and b ∈ A2 , |Q∗2 (t, b) −
T HEOREM 3.5. For all t ∈ S2 let atT = πL (t), then |Q∗2 (t, atT ) − V1∗ (s)| ≤ d≈ (s,t).
V2∗ (t)| ≤ 2 mins∈S1 dL (s,t). P ROOF. The proof is similar to that of Lemma 4.1 and will be
omitted.
Although the example after Theorem 3.4 no longer applies for
this algorithm, the following example demonstrates that the bound C OROLLARY 4.3. For all s ∈ S1 , t ∈ S2 , let atT = π≈ (t). Then
of Theorem 3.5 is tight. |Q∗2 (t, atT ) −V1∗ (s)| ≤ d≈ (s,t).
s t
From the last lemmas we can obtain a similar result as before.
a,[0] b,[1−ε] a,[0] b,[2]
T HEOREM 4.4. For all t ∈ S2 let atT = π≈ (t), then
s′ t′ |Q∗2 (t, atT ) −V2∗ (t)| ≤ 2 mins∈S1 d≈ (s,t).
X X
This last result confirms our claim that we really need only con-
a,b,[0] a,b,[0] sider the optimal actions in the source MDP. However, there is still
a problem inherent to the previous transfers. In the following ex-
We can see that dL (s,t) = dL ((s, b), (t, b)) = 1 + ε and dL (s′ ,t) = 2, ample we can see an instance of a poor transfer. We only indicate
so the policy from s will be used to transfer to t. Since π∗ (s) = b and the optimal actions in the source system.
a = arg minc∈{a,b d((s, b), (t, c)), πT (t) = a, yielding Q∗2 (t, atT ) = 0.
s1 s2 t
Thus, |Q∗2 (t, atT )−V2∗ (t)| = 2 ≤ 2(1+ε) = dL (s,t), proving that our ??
??
bound is tight. ?? ∗ a,[1] b,[2]
∗
as1 ,[1] ??
as2 ,[3]
4. USING THE VALUES OF STATES s′ X t′ X
In this section we improve the policy transfer by considering not
only bisimulation distances, but also the known value function of a∗s′ ,[0] a,b,[0]
states in the target system.
We can see that d≈ (s1 ,t) = 1, V ∗ (s1 ) = 1, d≈ (s2 ,t) = 2 and V ∗ (s2 ) =
4.1 Pessimistic approach 3, but π≈ (t) = arg minc∈{a,b} d≈ ((s1 , a∗s1 ), (t, c)) = a, yielding V T (t) =
Although we are considering all actions in the source system, 1 < 2 = V ∗ (t). This illustrates the problem in the last algorithms:
we really only transfer the optimal ones. This suggests a simple when performing the transfer the target system is trying to find the
way to modify the lax bisimulation approach to speed up the com- state in the source system which it can most closely simulate, re-
putation of the metric by only considering the optimal actions in gardless of the actual value this produces.
the source system. Given two MDPs M1 = hS1 , A1 , P1 , R1 i M2 = From Corollary 4.3 we obtain an interesting result.
hS2 , A2 , P2 , R2 i, for all s ∈ S1 , t ∈ S2 and b ∈ A2 , where a∗s = π∗ (s),
C OROLLARY 4.5. For all s ∈ S1 , t ∈ S2 , let atT = π≈ (t). Then
d≈ (s, (t, b)) = |R1 (s, a∗s ) − R2 (t, b)| + γTK (dL )(P1 (s, a∗s ), P2 (t, b)) Q∗2 (t, atT ) ≥ V1∗ (s) − d≈ (s,t).

Page 88 of 99
With this result we now have a lower bound on the value of the ac- the immediate reward as a myopic estimate.
tion transferred which takes into consideration the value function in The second approximant still will not scale very well to large
the source system. This suggests Algorithm 3 to obtain transferred problems, due to the MCF computations. As the number of states
policy πPess . In other words πPess (t) uses the source state with the increases, the number of variables and constraints of each MCF
problem also increases. In this second approximation we fix a num-
Algorithm 3 pessTransfer(S1 , S2 ) ber of clusters k which will split our reward region (i.e. the interval
1: Compute d≈ from the minimum reward to the maximum reward) into k regions.
2: for All t ∈ S2 do For each s ∈ S1 we choose what cluster it belongs to by checking in
3: for All s ∈ S1 do which of the k reward regions R(s, π∗ (s)) falls; similarly, for each
4: LB(s,t) ← V1∗ (s) − d≈ (s,t) t ∈ S2 we choose what cluster it belongs to by checking in which of
5: end for the k reward regions maxb∈A2 R(t, b) falls. Having thus reduced the
6: st ← arg maxs∈S1 LB(s,t) state space into k clusters, when we compute TK (d)(P(s, a), P(t, b))
7: bt = minb∈A2 d≈ (st , (t, b)) we are no longer looking at transition probabilities into individual
8: πPess (t) ← bt states, but rather, transition probabilities into one of the k clusters.
9: end for By doing so we have put an upper limit on the number of variables
10: return πL and constraints of each MCF. If the reward structure in the domains
in question are relatively sparse, we can get away with a small k.
Finally, we only iterate the Kantorovich metric computation once,
highest guaranteed lower bound on the value of its optimal action. setting d(s,t) = |R(s, π∗ (s)) − maxb∈A2 R(t, b)| for all s ∈ S1 and
This clearly overcomes the problem of the last example. t ∈ S2 .
4.2 Optimistic approach 6. BISIMULATION FOR OPTIONS
The idea of the pessimistic approach is appealing as it uses the
An option o is a triple hIo , πo , βo i, where Io ⊆ S is the set of
value function of the source system as well as the metric to guide
states where the option is enabled, πo : S → Dist(A) is the policy
the transfer. However, there is still an underlying problem in all of
for the option, and βo : S → [0, 1] is the probability of the option
the previous algorithms. This is in fact a problem with bisimulation
terminating at each state (Sutton, Precup & Singh, 1999). Options
when used for transfer. The problem is an inherent “pessimism” in
are temporally abstract actions and generalize one-step primitive
bisimulation: we always consider the action that maximizes the
actions. Given that an option o is started at state s, we can define
distance between two states. This pessimism is what equips bisim-
Pr(s′ |s, o) as the discounted probability of ending in state s′ given
ulation with all the mathematical guarantees, since we are usually
that we started in state s and followed option o. We can also define
“upper-bounding”. However, one may (not so infrequently) en-
the expected reward received throughout the lifetime of an option
counter situations where this pessimism produces a poor transfer.
as R(s, o) (see Sutton, Precup & Singh, 1999 for details). Based
For instance, assume there is a source state s whose optimal action
on the above definition, we introduce bisimulation for MDPs with
can be transferred with almost no loss as action b in a target state
options.
t (i.e. d≈ (s, (t, b)) is almost 0); however, assume there is another
action c in t such that d≈ (s, (t, c)) is very large. This large distance Definition 2. A relation E ⊆ S × S is said to be an option- bisim-
may disqualify state s as a transfer candidate for state t, when it may ulation relation if whenever sEt:
very well be the best choice! The inherent pessimism of bisimula- 1. ∀o ∈ OPT. R(s, o) = R(t, o)
tion would have overlooked this ideal transfer. If we would have
taken a more “optimistic” approach, then we would have ignored 2. ∀o ∈ OPT.∀C ∈ S/E .
d≈ (s, (t, c)) and focused on d≈ (s, (t, b)). This idea motivates the ∑s′ ∈C Pr(s′ |s, o) = ∑s′ ∈C Pr(s′ |t, o)
main algorithmic contribution of the paper. Two states s and t are said to be option-bisimilar if there exists an
We start by defining a new metric, option-bisimulation relation E such that sEt. Let s ∼O t denote the
dOpt (s,t) = minb∈A2 d≈ ((s, a∗s ), (t, b)), and use Algorithm 3 but with maximal option-bisimulation relation.
dOpt instead of d≈ in the computation of LB(s,t) to obtain our
transferred policy πOpt . In other words πOpt (t) chooses the action Similarly, a metric d is an option-bisimulation metric if for any
with the highest optimistic lower bound on the value of the action. s,t ∈ S, d(s,t) = 0 ⇔ s ∼O t.
By removing the pessimism we lose our theoretical properties, so Ferns et al. (2004) pass the next state transition probabilities in
we can no longer say that this lower bound is guaranteed. How- to TK (d). However, in our case we will pass in Pr(·|s, o) for s ∈ S
ever, intuition tells us that this should be a better method to guide and o ∈ OPT , which is a subprobability distribution. To account
the transfer. Indeed, we shall see in Section 7 that this method out- for this, we add two dummy nodes in the dual formulation of the
performs all the rest. Kantorovich metric above, which absorb any leftover probability
mass. These dummy nodes are still connected as the other nodes,
but with a cost of 1 (see Van Breugel & Worrell, 2001 for more
5. SPEEDING UP THE COMPUTATION details).
As was mentioned previously, the long computation time of these Option-bisimulation metrics are very similar to the usual bisim-
methods is mainly due to the fact that each iteration of the Kan- ulation metrics, in terms of properties, as can be seen from the fol-
torovich metric computation (see Theorem 3.1) requires solving lowing theorem:
|S1 | × |S 2 | × |A 2 | MCF problems, which is very expensive. In the
first approximation we propose to solve TK (d) only once, with T HEOREM 6.1. Let
d(s,t) = V1∗ (s) − maxb∈A2 R(t, b) for all s ∈ S1 and t ∈ S2 . The intu- F(d)(s,t) = max (|R(s, o)−R(t, o)|+γTK (d)(Pr(·|s, o), Pr(·|t, o))
ition behind this rough distance estimate is that we want the target o∈OPT
state to try to match the optimal value of the source state. Since the Then F has a least fixed-point, d∼ and d f ix is an option-bisimulation
optimal value function for the target system is not known, we use metric.

Page 89 of 99
Table 1: Running times (in seconds) and kV2∗ −VtT k∞
First instance (4to4) Second instance (4to4) Second instance (4to3) Second instance (3to4)
Algorithm Running time kV2∗ −VtT k∞ Running time kV2∗ −VtT k∞ Running time kV2∗ −VtT k∞ Running time kV2∗ −VtT k∞
Bisim 15.565 0.952872 - - - - - -
Lax-Bisim 66.167 0.847645 128.135 0.749583 67.115 0.749583 100.394 0.749583
Pessimistic 25.082 0.954625 47.723 0.904437 24.125 0.875533 37.048 0.904437
Optimistic 23.053 0.335348 47.710 0.327911 24.649 0.360802 39.672 0.002052
Approximant1 0.725 0.744038 1.484 0.744036 0.820 0.721627 1.214 0.744036
Approximant2 0.443 0.744038 0.949 0.632880 0.556 0.532724 0.762 0.632880

The proof is almost identical to that of Theorem 4.5 in (Ferns et the first instance the target domain only had primitive actions as
al, 2004), so we omit it for succinctness. As shown there, d∼ can options: ∧, ∨, < and >. In the second instance, the domain was
δ
be approximated to a desired accuracy δ by applying F for ⌈ ln ln γ ⌉ equipped with 8 options: {∧, ∨, <, >, u, d, l, r}. Clearly the origi-
steps. nal bisimulation metric approach could not be run because of the
All the results presented for the four algorithms carry over easily difference in number of options. We also ran experiments where
to the option-bisimulation case. Their proofs will be omitted for the target system only has 3 rooms (bottom right room was re-
succinctness. moved), but source still has 4 rooms, and where the source system
only has 3 rooms (bottom right room was removed) and target sys-
7. EXPERIMENTAL RESULTS tem has 4 rooms. Table 1 displays the running times of the various
algorithms, as well as kV2∗ − V2T k∞ . In Figure 2 we display the
To illustrate the performance of the various policy transfer algo-
transferred policies when using lax-bisim, the pessimistic and the
rithms, we used the grid world navigation task of (Sutton, Precup
optimistic approach. The colors indicate which state in the tiny in-
& Singh, 1999), consisting of four rooms in a square (a room in
stance was used for the transfer. In figures 3 and 4 we compare the
each corner) connected by four hallways, one between each pair of
performance of the various algorithms when used to speed up learn-
rooms. There are four primitive actions: ∧ (up), ∨ (down), < (left)
ing. Standard Q-learning was performed, but the agent was biased
and > (right). When one of the actions is chosen, the agent moves
towards the transferred policy. In other words, if the agent did not
in the desired direction with 0.9 probability, and with 0.1 probabil-
have another option with a better Q-value than the current Q-value
ity uniformly moves in one of the other three directions or stays in
estimate for the transferred policy, it would choose the transferred
the same place. Whenever a move would take the agent into a wall,
policy. These results clearly demonstrate the superior performance
the agent remains in the same position.
of the optimistic approach.
There are four global options available in every state, analogous
In Table 2 we examine the performance of the second approxi-
to the ∧, ∨, < and > primitive actions. We will refer to them as
mant compared to that of the first as we scale the sizes of the target
u, d, l and r, respectively. If an agent chooses option u, then the
systems. The first approximant was only able to solve the problem
option will take it to the hallway above its position. If there is no
with 104 states, at which point it ran out of memory. We can see
hallway in that direction, then the option will take the agent to the
that we still obtain reasonable results with the second approxima-
middle of the upper wall. The option terminates as soon as the
tion, even as the number of states gets larger.
agent reaches the respective hallway or position along the wall. All
other options are similar. There is a single goal placed in one of
the hallways, yielding a reward of 1. Everywhere else the agent 8. CONCLUSIONS AND FUTURE WORK
receives a reward of 0. In this paper we presented six new algorithms for performing
The above topology for the rooms policy transfer on MDPs that were based on bisimulation metrics.
can be instantiated with different We started off with algorithms that had very strong theoretical re-
numbers of cells. We started with a sults but poor empirical performance. Using these initial algo-
tiny instance, where there are only rithms as inspiration, we defined new algorithms that traded some
8 states: one for each of the rooms, of the theoretical guarantees for improved performance. Finally,
and one for each of the hallways, we presented two approximation algorithms to overcome the com-
Figure 1: Tiny in- with the goal in the rightmost hall- putational overhead of bisimulation metrics. The second of these
stance with the opti- way. (Figure 1). This tiny domain was shown to scale very well to very large problems. We presented
mal policy. Red state only has 4 options, which are sim- empirical evidence of the suitability of our algorithms for speeding
is the goal state. ply the primitive actions. The vari- up learning.
ous metrics (d∼ , dL , d≈ , and dOpt ) Our algorithms would also be very useful if we had a model dis-
were computed between the tiny in- tribution from which problems were sampled and we wanted to
stance and each of the larger instances, using a desired accuracy of avoid solving the value function for each sampled model. This sit-
δ = 0.01, and then the policy transfer algorithms were applied. We uation is commonly encountered in Bayesian RL, where a Dirich-
also used the two approximants on the optimistic algorithm. For let distribution over models is maintained and updated with each
all experiments we used the CS2 algorithm for the MCF problems transition. Most algorithms sample a number of models from the
(Frangioni & Manca, 2006) and a discount factor of γ = 0.9. In Dirichlet distribution and solve the value function for each in order
the second approximant we set the number of clusters to 8 (note to make the next action choice. We could use our algorithms to
that we are looking at option reward regions, rather than primitive transfer the policy from the small source to just one of the target
reward regions). systems (the mean model, for instance), and use that policy for all
We used a domain with 44 states as the large domain. We varied the other samples. It would be useful to obtain empirical evidence
the number and type of options available in the larger domain. In to justify these claims, as well as theoretical bounds on the loss of

Page 90 of 99
d d d d r r r r d r l d
d d d r r r r d r r l l
r r r r r
u u u r u r l u
u u u u r r r u u d r l l
u u u u r r r u u r l u

Figure 2: Transferred policies in second instance. Left: Lax-bisim, middle: Pessimistic, right: optimistic

First instance
300
Second instance
No transfer 400
Bisim No transfer
250 Lax−bisim Lax−bisim
Pessimistic 350
Pessimistic
Optimistic
Optimistic
Approximant 1
300 Approximant 1
Approximant 2
200 Approximant 2
Cumulative reward

250

Cumulative reward
150
200

100 150

100

0 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Time step
Time step

Figure 3: Comparison of performance of transfer algorithms (4 rooms to 4 rooms)

optimality that is dependent on the parameters of the underlying 9. REFERENCES

model distribution.
David Andre and Stuart Russell. State abstraction for
Bowling & Veloso (1999) derive a bound based on the Bell-
programmable reinforcement learning agents. In Eighteenth
man error quantifying the loss of optimality when using policies
national conference on Artificial intelligence (AAAI), pages
of sub-problems to speed up learning. The motivation for consider-
119–125, 2002.
ing solutions to sub-problems is similar to the motivation for using
Michael Bowling and Manuela Veloso. Bounding the
options in our case. Their bound is applicable for a particular def-
suboptimality of reusing subproblems. In Proceedings of the
inition of sub-problem, whereas our bounds are general. Sorg &
16th International Joint Conference on Artificial Intelligence
Singh (2009) use soft homomorphisms to transfer policies, and de-
(IJCAI-99), 1999.
rive bounds on the loss. The state mapping they suggest is loosely
based on MDP homomorphisms and does not have continuity prop- Pablo Samuel Castro, Prakash Panangaden, and Doina Precup.
erties with respect to its underlying equivalence, as is the case for Notions of state equivalence under partial observability. In
bisimulation metrics. Our bounds are tighter because they are state- Proceedings of the Twenty-First International Joint Conference
dependent while their bound is uniform over the state space. on Artificial Intelligence (IJCAI-09), to appear, 2009.
In Section 7 we demonstrated empirically that our algorithms Norm Ferns, Prakash Panangaden, and Doina Precup. Metrics
can be very effective for speeding up learning. We demonstrated for finite Markov decision processes. In Proceedings of the 20th
that both in terms of learning speedup and L∞ norm the optimistic Annual Conference on Uncertainty in Artificial Intelligence,
approach outperforms all the rest. The approximation algorithms pages 162–169, 2004.
presented come close in terms of performance, but greatly speed up Eliseo Ferrante, Alessandro Lazaric, and Marcello Restelli.
the computation. The second approximation is very promising, and Transfer of task representation in reinforcement learning using
was shown to scale relatively well to larger problems. However, the policy-based proto-value functions. In AAMAS, pages
algorithm is scaling worse than linearly, due to the fact that we are 1329–1332, 2008.
still running S1 × S2 MCF problems at each iteration (even though A. Frangioni and A. Manca. A computational study of cost
the number of variables is reduced in each MCF due to our reward reoptimization for min cost flow problems. INFORMS Journal
clustering method). A possible approach is not only to use the re- on Computing, 18(1):61–70, 2006.
ward clustering to reduce the number of variables in each MCF, but Robert Givan, Thomas Dean, and Matthew Greig. Equivalence
also to reduce the number of MCFs. However, this could require Notions and Model Minimization in Markov Decision
that a transfer be performed from a source states to a group of tar- Processes. Artificial Intelligence, 147(1–2):163–223, 2003.
get states. The problem here is that it does not always make sense George Konidaris and Andrew Barto. Building portable
to perform the same transfer to states in one cluster. An approach options: Skill transfer in reinforcement learning. In
such as suggested in (Castro, Panangaden & Precup, 2009) could Proceedings of the 20th International Joint Conference on
be used to refine the clusters in a principled way. Although these Artificial Intelligence (IJCAI-07), 2007.
initial results are very promising, more empirical evidence in other Kim Guldstrand Larsen and Arne Skou. Bisimulation through
domains is needed. probabilistic testing. Information and Computation,
94(1):1–28, 1991.

Page 91 of 99
3 rooms (target) Second instance 3 rooms (source) Second instance
160 400

No transfer No transfer
140 Lax−bisim 350 Lax−bisim
Pessimistic Pessimistic
120 Optimistic 300 Optimistic
Approximant 1 Approximant 1
Approximant 2 Approximant 2

Cumulative reward

Cumulative reward
100 250

80 200

60 150

40 100

20 50

0 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Time step Time step

Figure 4: Comparison of performance of transfer algorithms (left: 4 rooms to 3 rooms, right: 3 rooms to 4 rooms)

Table 2: Scaling performance of the approximation algorithms

Approximant 1 Approximant 2
Number of states Running time kV2∗ −V2T k∞ Running time kV2∗ −V2T k∞
104 8.715 0.742559 4.284 0.655103
200 - - 15.859 0.656311
328 - - 48.011 0.661011
488 - - 127.611 0.662410

Alessandro Lazaric, Marcello Restelli, and Andrea Bonarini. reinforcement learning domains: A survey. Journal of Machine
Transfer of samples in batch reinforcement learning. In ICML, Learning Research, 10:1633–1685, 2009.
pages 544–551, 2008. Matthew E. Taylor, Peter Stone, and Yaxin Liu. Transfer
Theodore J. Perkins and Doina Precup. Using options for learning via inter-task mappings for temporal difference
knowledge transfer in reinforcement learning. Technical Report learning. Journal of Machine Learning Research, 8:2125–2167,
UM-CS-1999-034, University of Masschusetts, Amherts, 1999. 2007.
Caitlin Phillips. Knowledge transfer in Markov Decision Jonathan Taylor, Doina Precup, and Prakash Panangaden.
Processes. Technical report, McGill University, 2006. Bounding performance loss in approximate MDP
Martin L. Puterman. Markov Decision Processes. John Wily & homomorphisms. In Advances in Neural Information
Sons, New York, NY, 1994. Processing Systems 21, page In press, 2009.
Balaraman Ravindran and Andrew G. Barto. Model Franck van Breugel and James Worrell. An algorithm for
minimization in hierarchical reinforcement learning. In Fifth quantitative verification of probabilistic transition systems. In
Symposium on Abstraction, Reformulation and Approximation, Proceedings of the 12th International Conference on
2002. Concurrency Theory (CONCUR), pages 336–350, 2001.
Balamaran Ravindran and Andrew G. Barto. Relativized Alicia Peregrin Wolfe and Andrew G. Barto. Defining object
options: Choosing the right transformation. In Proceedings of types and options using MDP homomorphisms. In Proceedings
the 20th Internation Conference on Machine Learning, 2003. of the ICML-06 Workshop on Structural Knowledge Transfer
Alexander A. Sherstov and Peter Stone. Improving action for Machine Learning, 2006.
selection in MDP’s via knowledge transfer. In Proceedings of
the 20th National Conference on Artificial Intelligence, 2005.
Vishai Soni and Satinder Singh. Using homomorphism to
transfer options across reinforcement learning domains. In
Proceedings of the National Conference on Artificial
Intelligence (AAAI-06), 2006.
Jonathan Sorg and Satinder Singh. Transfer via soft
homomorphisms. In Proceedings of the 8th International
Conference on Autonomous Agents and Multiagent Systems
(AAMAS-2009), 2009.
Funlade T. Sunmola and Jeremy L. Wyatt. Model transfer for
Markov decision tasks via parameter matching. In Proceedings
of the 25th Workshop of the UK Planning and Scheduling
Special Interest Group (Plan-SIG 2006), 2006.
Richard S. Sutton, Doina Precup, and Satinder Singh. Between
MDPs and semi-MDPs: A framework for temporal abstraction
in reinforcement learning. Artificial Intelligence, 112:181–211,
1999.
Matthew E. Taylor and Peter Stone. Transfer learning for

Page 92 of 99
Proceedings of the AAMAS Workshop on Adaptive and Learning Agents, May 2010, Toronto, Canada

The Evolution of Cooperation and Investment Strategies in

a Commons Dilemma

Enda Howley Jim Duggan

System Dynamics Research Group System Dynamics Research Group
Department of Information Technology Department of Information Technology
National University of Ireland, Galway National University of Ireland, Galway
[email protected] [email protected]

ABSTRACT exploit the situation and order the most expensive items on
This paper examines the evolution of agent strategies in a the menu. If all members of the group apply this strategy,
commons dilemma using a tag interaction model. Through then all participants will end up paying more [5].
the use of a tag-mediated interaction model, individuals can These games are all classified as n-player dilemmas, as
determine their interactions based on their tag similarity. they involve multiple participants interacting as a group. N-
The simulations presented show the significance and bene- player dilemmas have been shown to result in widespread de-
fits of agents that contribute to the commons. A series of fection unless agent interactions are structured. This is most
experiments examine the importance of the tag space in a commonly achieved through using spatial constraints which
n-player dilemma. The paper shows the emergence of co- limit agent interactions through specified neighbourhoods
operation through tag-mediated interactions in the n-player on a spatial grid. Limiting group size has been shown to
games. Simulation results show the evolution of strategies benefit cooperation in these n-player dilemmas [24]. Agent
that contribute heavily to the value of the shared commons. interaction models such as spatial constraints, social net-
works and tags offer a basis for agents to determine there
peer interaction and the subsequent emergence of coopera-
Categories and Subject Descriptors tion. This paper examines a series of simulations involving
H.4 [Information Systems Applications]: Artificial In- a tag-mediated interaction environment. Tags are visible
telligence| Multiagent systems markings or social cues which serve to bias agent interac-
tions based on their similarity [9].
General Terms In this paper we will examine an n-player dilemma, and
study the evolution of strategies when individuals can con-
Multi-Agent Cooperation, Learning and Evolution, Tag Me- tribute some of their payoffs towards the value of the com-
diated Interactions mons. The theory that commitment changes the incentives
of players is a familiar principle in economics. Applications
Keywords of these principles include bargaining [19], monetary policy
Evolution, Learning, Tag Mediated Interactions, Coopera- [17], industrial organisation [4], strategic trade policy [2].
tion The simulations presented in this paper use the well known
n-player Prisoner’s Dilemma (NPD). Agents bias their in-
teractions through a tag mediated environment. The re-
1. INTRODUCTION sults show the evolution of widespread contributing strate-
When a common resource is being shared among a num- gies throughout the population, despite this being a sub-
ber of individuals, each individual benefits most by using optimal strategy.
as much of the resource as possible. While this is the indi- This paper examines the impact of the tag space and
vidually rational choice, it results in collective irrationality its effects on the emergence of cooperation in the n-player
and a non Pareto-optimal result for all participants. These dilemma. In this context the experiments will show the ef-
n-player dilemmas are common throughout many real world fects of investment strategies on the emergence of coopera-
scenarios. For example, the computing community is partic- tion. The research presented in this paper will address three
ularly concerned with how finite resources can be used most specific research questions:
efficiently where conflicting and potentially selfish demands
are placed on those resources. Those resources may range 1. What is the impact of tag space on the emergence of
from access to processor time or bandwidth. cooperation in a n-player dilemma?
One example commonly used throughout existing research
2. Will agent strategies evolve investment properties in a
is the Tragedy of the Commons [7]. This outlines a scenario
commons dilemma?
whereby villagers are allowed to graze their cows on the vil-
lage green. This common resource will be over grazed and 3. How do investment strategies impact on the emergence
lost to everyone if the villagers allow all their cows to graze, of cooperation in a commons dilemma?
yet if everyone limits their use of the village green, it will
continue to be useful to all villagers. Another example is The following section of his paper will provide an intro-
the Diners Dilemma where a group of people in a restau- duction to the NPD and a number of well known agent inter-
rant agree to equally split their bill. Each has the choice to action models. The topics of tag-mediated interactions, and

Page 93 of 99
the n-player Prisoner’s Dilemma will be discussed in detail. Tragedy of the Commons
In the experimental setup section we will discuss our simu- 6

lator design and our experimental parameters. Our results

5
section will provide a series of game theoretic simulations.
Finally we will outline our conclusions and future work.

Utility per individual

2. RELATED RESEARCH 3

The area of social dilemmas has been addressed by a large

2
number of researchers from a broad set of subject areas. The
most commonly known areas include, trust [18][3], social 1
capital [21] and solidarity [22]. Social dilemmas are partic- Benefit to Cooperator
Benefit to Defector
ularly useful as analytic tools through which large groups of 0
0 0.2 0.4 0.6 0.8 1
agents can be studied. These individuals can have signifi- Proportion of Cooperations
cant interdependencies and interact through complex social
structures. These agent interactions have been identified Figure 1: The N-Player Prisoner’s Dilemma
as being very significant to the study of social dilemmas
[12][1]. The importance of these interaction choices moti-
vates our interest in agent sociability and the evolution of than for a cooperation. The utility for defection dominates
cooperation in an n-player dilemma. the payoff for cooperation in all cases. Therefore, an indi-
A number of investigations into social dilemmas have at- vidual that defects will always receive a higher payoff than
tempted to study ways of evolving cooperation. For exam- if they had chosen to cooperate. The result of this payoff
ple, Suzuki extended traditional studies by allowing individ- structure should result in an advantage to defectors in the
uals become charging agents in the tragedy of the commons agent population. Despite this, a cooperator in a group of
[20]. In other work Yamashita et al, have examined the ef- cooperators will do much better than a defector in a group
fects of group dynamics in the NPD [23]. Individuals come of defectors.
together to form groups and then participate in NPD games This game is considered a valid dilemma due to the fact
with each other. Yamashita studied a number of group for- that individual rationality favours defection despite this re-
mation mechanisms, which included unilateral choice and sulting in state which is less beneficial to all participants.
mutual choice. Unilateral stated that once an agent wanted In our case where all individuals defect they all receive 0.5.
to join a group then ie is admitted into the group. In mu- This state is a non-pareto, sub-optimal, and collectively ir-
tual choice the agent must be accepted in by a majority rational outcome for the agent population. For all values of
of the existing group members. This mutual choice mecha- x this can be expressed as follows:
nism was then augmented to include group splitting which
allowed dissenting individuals in the group to split away on
Ud (x) > Uc (x) (1)
their own when a group applicant divided the voting group
members. The agreeing agents proceeded to accept the new x represents the fraction of cooperators while Ud and Uc
individual into their group, while the opposing agents would are utility functions based on the fraction of cooperators in
leave to form a new group of their own [23]. the group.
The size of the tag space has been shown to be very signif-
icant to the subsequent emergence of cooperation in the two 2.2 Tag-Mediated Interactions
player Prisoner’s Dilemma [11]. In this paper we explore the The early studies of tags have focused primarily on the
impact of this parameter in relation to the n-player commons evolution of cooperation [8, 15, 16]. In the commonly used
dilemma, and the effects of allowing the agent strategies in- model proposed by Riolo, tags are represented as values
vest in the value of the commons. in the range [0. . . 1]. These values represent an abstract
Existing research has not examined the NPD with respect topology which allows individuals to determine their peer
to tags and their known ability to effectively bias agent inter- interactions. Agents are more likely to interact when their
actions and engender cooperation in two player games such tag values are similar, while they are less likely to interact
as the Prisoner’s Dilemma. By proposing a tag mediated when their tags are less similar. The results presented by
interaction model for n-player games, we hope to bridge the Riolo demonstrate the ability of tags to engender coopera-
gap between the research already conducted involving tags tion through limiting peer interactions and thereby avoiding
in two player games [15][14], and the need for more research exploitation. This point was examined specifically in later
involving group structures in many n-player games [1][23]. work by Howley et al. which showed the significance of
partitioning and tag group size on the levels of cooperation
2.1 The N-Player Prisoner’s Dilemma recorded [11].
The n-player Prisoner’s Dilemma is also commonly known Tags have been widely used to demonstrate the emergence
as the Tragedy of the Commons [7] and the payoff structure of cooperation among clusters of agents that emerge. In
of this game is shown in Figure 1. other work tags have also been shown to promote mimicking
The x-axis represents the fraction of cooperators in the and thereby have certain limitations where complimentary
group of n players in a particular game. The vertical axis actions are required by agents. Evolving identical actions
represents the payoff for an individual participating in a among individuals is assisted by using tags, yet behaviours
game. There is a linear relationship between the fraction of that require divergent actions are problematic [14, 13]. This
cooperators and the utility received by a game participant. is particularly significant when considering issues such as
Importantly, the payoff received for a defection is higher cooperation and coordination.

Page 94 of 99
3. SIMULATOR DESIGN 3.3 Agent Interactions
In this section we outline the overall design of our simula- In our simulations each agent interacts through a fixed
tor. Firstly, we will outlined the agent genome and how this bias tag mediated interaction model. We adopt a similar
influences agent behaviours. Then we describe our agent tag implementation as that outlined by Riolo [15] and more
interaction model which uses a tag mediated interaction recently by [10]. In our model each agent has a GT gene
model. This paper examines agent learning through evolu- which is used as their tag value. Each agent A is given
tion, and as a result we use a genetic algorithm. This genetic the opportunity to make game offers to all other agents in
algorithm and its parameter settings are also outlined in this the population. The intention is that this agent A will host
section. a game and the probability other agents will participate is
determined using the following formulation.
3.1 Agent Genome
In our model each agent is represented through an agent dA,B = 1 − |AGT − BGT | (4)
genome. This genome holds a number of genes which repre-
sents how that particular agent behaves. This equation is based on the absolute value between the
tag values of two agents A and B. This value is used to gen-
erate two roulette wheels Rab and Rba for A and B. These
Genome = GC , GT , GI , (2) two roulette wheels will then be used to determine agent A’s
The GC gene represents the probability of an agent coop- attitude to B and agent B’s attitude to A. An agent B will
erating in a particular move. The GT gene represents the only participate in the game when both roulette wheels have
agent tag. This is represented in the range [0. . . 1] and is indicated acceptance.
used to determine which games each agent participates. Fi-
nally, GI represents an individuals willingness to contribute
into the commons. This is again represented on a scale to in-
dicate that some individuals may chose to invest more than
others.
Initially these agent genes are generated using a uniform
distribution for the first generation. Over subsequent gen-
erations new agent genomes are generated using our genetic
algorithm. Each of these genes are evolved attributes and
are fixed for that individual’s lifetime, therefore changes in
the population only occur through new offspring which have
evolved genetic traits.

3.2 Investing in the Commons

Individuals can determine whether they wish to invest in
the value of the commons through their GI gene. An agent’s
contribution is considered the fraction represented by their
GI gene. In real terms, this fraction is applied the maximum
value of Uc of the standard game. As shown in the earlier Figure 2: Simulator Design
example in Figure 1, the maximum value of Uc is 5. Since
this is a n-player game, n individuals may choose to make The diagram shown in Figure 2 shows the sequence of
contributions. These contributions are totalled and added main events that occur in the simulator for a given genera-
0 0
to the initial Uc value to give Uc . The value of Ud is then tion (G). Initially individuals are selected to participate in
calculated as follows: individual n-player games. Then individuals can make their
contributions towards the size of the commons. This results
0 0 in the potential utilities available to players in the commons
Ud = Uc × 1.1 (3) game to change. Agents then play the n-player commons
0 0 dilemma game using their GC strategy gene. The payoffs
This results in the ratio between Uc and Ud remaining
0 are calculated and this determines each individuals fitness
consistent for all values of Uc regardless of how much is value. The genetic algorithm provides the final stage of this
0
contributed into the commons. If the Ud did not change process which begins the cycle again.
0
proportionally to the Uc value, then the dilemma would be
significantly undermined. The temptation to defect would 3.4 Genetic Algorithm
0
be proportionally less as the value of Uc increased. This In this paper we have used a genetic algorithm to reflect
would promote the emergence of cooperation in the model. learning throughout the agent population [9, 6]. In each
When an individual chooses to make a contribution, the generation individuals participate in a variable number of
amount it deducted from their payoffs. Only agents who games. Therefore, fitness is determined by summing all their
can make their desired contribution can participate in the n- payoffs received and getting an average payoff per game.
player game. Individuals with a strategy gene GI = 0 will be Once fitness has been established for each individual, a new
capable of participating without making any contribution or population for generation G + 1 is created. Individuals are
having any payoffs in reserve. All individuals are initialised selected through roulette wheel selection based on their fit-
with a reserve payoff of 30 in order to begin contributing to ness from generation G. Parent pairs are selected based on
games if they wish. their fitness and these are used to create a new agent. Each

Page 95 of 99
parent has a set of three genes GC , GT , GI . A probability more difficult to establish and maintain cooperative interac-
of 0.9 is applied in favor of selecting two random genes from tions and avoid exploitation.
the the fittest parent, and 1 gene from the other parent.
Each gene is exposed to a 2% chance of mutation. When 4.2 N-Player Dilemma with Investment
applied to a gene, the mutation operator changes it through In this section we will examine the extended version of
a displacement chosen from a Gaussian distribution with a the N-player dilemma. All experimental parameters are held
mean of 0 and a standard deviation of 0.5. Since tag values from the previous experiment.
are considered arbitrary the tag space is viewed as circular.
Therefore, when adding and subtracting displacements on Average Cooperations
the GT gene values over 1.0 are calculated upwards from 1
0, while values under 0 are calculated downwards from 1.0.
For the two remaining genes, their actual value is significant 0.8
and displacements resulting in values above 1.0 are set to

Average Cooperations
1.0, while those below 0 are set to 0. 0.6

4. EXPERIMENTAL RESULTS 0.4

In this section we will present a series of experiments ex- 100 Agents

amining the agent environment. Firstly, we will show simu- 0.2 75 Agents
50 Agents
lations involving the traditional n-player commons dilemms. 25 Agents
10 Agents
While subsequently we will examine a commons dilemma 0
0 200 400 600 800 1000
with voluntary contributions from the players. All the data Generations
presented in these experiments is averaged from 50 experi-
mental runs. Figure 4: Average Cooperations (N-Player Dilemma
with Investments)
4.1 Classic N-Player Dilemma
In the following experiment we show a series of simulations The results shown in Figure 4 show the levels of cooper-
involving a standard commons dilemma using our tag inter- ation recorded in the commons dilemma with agent contri-
action model. No investing into the commons was allowed butions. The results show some significant differences with
in this experiment. those in the previous figure. For small populations there
is greater volatility with respect to the high levels of coop-
Average Cooperations eration recorded in the previous experiment. Significantly,
1 higher levels of cooperation were recorded for large popula-
tions despite the tag space being constrained. These results
0.8 show that despite all games being valid commons dilemmas,
and the conditions of the game being adhered to stringently,
Average Cooperations

0.6 significant differences could be identified in the strategies

evolved throughout the model.
0.4
Average Investment Gene
100 Agents 1
0.2 75 Agents
50 Agents
25 Agents
10 Agents 0.8
0
0 200 400 600 800 1000
Investment Gene

Generations
0.6

Figure 3: Average Cooperations (Classic N-Player

Dilemma) 0.4

100 Agents
Figure 3 shows the standard commons dilemma using a 0.2 75 Agents
50 Agents
number of alternative agent population sizes. This has the 25 Agents
10 Agents
effect of altering the size of the tag space in the model. As 0
0 200 400 600 800 1000
shown through a similar experiment by Howley et al, the Generations
size of the tag space has a significant effect on the emer-
gence of cooperation [11]. Where there are only a small Figure 5: Average Investment Gene (N-Player
number of agents in the population, cooperators can avoid Dilemma with Investments)
exploiters much more easily due to the partitioning effects
of the tag environment. However, in larger populations this The data shown in Figure 5, shows how the GI gene
is much more difficult and exploiters are much more likely to evolved in various population sizes. The results show that
be present in a n-player interaction. Therefore cooperation high investment strategies evolved in almost all cases accept
is not evolved for larger populations where the tag space for the smallest populations. The evolutionary pressure to
is undermined. The probability of being exploited is much evolve such a strategy stems from the increased fitness of
higher with each peer interaction an individual participates those individuals who were part of highly investment, coop-
in, therefore in larger populations it is generally considered erative groups. This combination facilitates many high value

Page 96 of 99
games which helps offset exploitation from rogue strategies levels correlate with the population sizes. Increased num-
on the periphery of the cluster. Once established this trait bers of participants in a n-player game makes the chances of
will increase throughout the population with the help of the maintaining high levels of cooperation less likely. Exploita-
tag mechanism which promotes homogeneity. As discussed tion is much more likely to occur in a game with many par-
by many previous authors, individuals who share tag values ticipants and therefore undermine any cooperation that may
are likely to share many other genetic traits due to their have been established previously. This data correlates with
shared ancestry [13]. the levels of cooperation we identified previously in Figure
3. The model with the most participants was the least co-
Average Commons Value operative while the model with the least game participants
100 was the most cooperative.
100 Agents
75 Agents
50 Agents
80 25 Agents
Average Number of Participants
Average Commons Value

10 Agents
30
100 Agents
60 75 Agents
25 50 Agents

Average Number of Participants

25 Agents
10 Agents
40
20

20 15

10
0
0 200 400 600 800 1000
Generations 5

Figure 6: Average Commons Value (N-Player 0

0 200 400 600 800 1000
Dilemma with Investments) Generations

The data shown in Figure 6 show the average value of Figure 8: Average Participants (N-Player Dilemma
games in the commons dilemma with investment. These with Investments)
payoffs represent the entire value of the commons which each
of the players had to then play for. As would be expected
The data shown in Figure 8 mirrors the previous simu-
these payoff values correlate strongly with the numbers of
lations shown in Figure 7. Here we identify many similar
agents in the population, and this would increase the likely
features, yet we also notice that the levels of game partic-
number of individuals participating and contributing to the
ipation are lower in the extended game environment. This
n-player games. The most significant aspect of this data are
is as a result of the extended game encouraging more tag
the scale of the payoffs for the smallest populations. The
diversity throughout the population. This has the effect of
influence of the investment extension appears to be smallest
reducing the size of tag groups to smaller numbers and then
in these populations, and this appears to simply be due to
game participation is lower than in the traditional game.
the fact that there were not enough participants participat-
This feature only becomes apparent in larger populations
ing and contributing for the extension to have a major effect
and therefore it is in the largest population we identify the
over the population.
most significant changes with respect to game participation
To investigate the issue of game participation in more de-
and the strategies evolved.
tail, we now examine the average numbers of participants in
each of the models.
4.3 Tag Space Evolution
Average Number of Participants
In this section we will examine the evolution of the tag
30 space in both game environments. In this case we use two
100 Agents
75 Agents populations of 100 agents, and 100 experimental runs.
25 50 Agents
Average Number of Participants

25 Agents
10 Agents
20 Average Proportion of Unique Tag Values
1
Classic N-Player Dilemma
15 N-Player Dilemma with Investments
Proportion of Unique Tag Values

0.8
10

0.6
5

0 0.4
0 200 400 600 800 1000
Generations
0.2

Figure 7: Average Participants (Classic N-Player

Dilemma) 0
0 200 400 600 800 1000
Generations
Figure 7 shows the average number of participants in the
n-player games for each population size. As expected the Figure 9: Proportion of Unique Tag Values

Page 97 of 99
Figure 9, shows the evolution of tags over time through difficult to achieve. There are many alternative means of en-
successive generations. The proportion of unique tag values couraging cooperation in these scenarios, but in this case we
is calculated the maximum number of possible tag values, wanted to show the benefits of individuals investing in their
and then this is recorded and averaged over many genera- shared resource.
tions and experiments. In the initial generations there is a Earlier in this paper we posed a number of research ques-
very large number of tag values, however over time the num- tions. We will now refer to these through the following sub-
ber of unique tag values falls and converges to a relatively headings.
small number of tag groups. This is an indication of the
high levels of mimicking that occurs due to tags. However Tag Space: The tag space was found to be very significant
we observe the higher numbers of unique tag values in the in the emergence of cooperation in the classic n-player
n-player commons with investment. This indicates more tag commons dilemma. Clustering was much more diffi-
groups and a higher degree of tag diversity throughout the cult to achieve in larger populations when compared
population. The n-player dilemma with investment helps to smaller populations.
clusters of cooperators to emerge and avoid being exploited
by invaders. This happens as potential defectors must have a Investment Strategies: Throughout our simulations we
reserve fund available to contribute in line with their strat- identified the emergence of agents with high invest-
egy or else they cannot participate in the game. Further- ment genes. This was promoted through the cluster-
more, high value games have the added benefit of spreading ing of the tag environment and the mimicry that it
wealth among a group of individuals with similar tag values, encourages.
which has the effect of promoting that groups strategy traits
Emergence of Cooperation: The emergence of investment
throughout the population in the following generation. This
strategies and the emergence if cooperation in large
inevitably encourages cooperation and contribution strate-
populations are very closely linked. The fitness of cer-
gies as the group can only achieve high levels of fitness if it
tain tag assisted by the investment mechanism as this
is composed of high cooperators and investors.
helps avoid exploitation. Furthermore, the increased
payoffs help sustain and promote the evolution of a
Table 1: Average Unique Tag Values particular tag cluster. This has the effect of promot-
Model µ σ ing that tags associated strategy characteristics.
Classic N-Player Dilemma 6.31% 0.52%
N-Player Dilemma with Investments 12.698% 0.87% Through addressing these research questions, we have pro-
vided a clear picture of the importance of tag space with
respect to the n-player commons dilemma. Importantly, we
have also shown how investment strategies can result in the
As indicated in Figure 9 and also through the data pre-
promotion of cooperative traits in otherwise difficult con-
sented in Table 1 the differences between the two models are
ditions. This highlights the potential benefits of studying
significant. The data shown in Table 1 is recorded from 100
extensions to these well known games.
experimental runs using populations of 100 agents. The dif-
ferences between the levels of unique tags recorded in each
model were found to be statistically significant. This was 6. SUMMARY AND FUTURE RESEARCH
found when examined using a two tailed t test with a 95% In this paper we have presented a novel extension to the
confidence interval. These difference reinforce our observa- n-player commons dilemma and shown the significant effects
tions earlier in the paper regarding the contrasting levels of this extension has on the emergence of cooperation. Further-
cooperation recorded in the models. more, this paper has outlined a number of significant experi-
mental results which show the evolution of cooperation with
5. CONCLUSIONS respect to agent investment and cooperation choices. In fu-
In this paper we have examined a number of issues. Firstly, ture research we would like to study the effects of investment
we proposed an adaptation to the classic n-player commons mechanisms on conservation, and more efficient utilisation
of common resources. Examples of these could include such
dilemma to include agent investments into the commons.
This was achieved while maintaining the essential character- as fossil fuels, water and computing resources.
istics of the game. Secondly, we examined the tag space and
its significance with respect to the emergence of cooperation 7. ACKNOWLEDGMENTS
in the n-player PD. The study of n-player games using tags is The authors would like to gratefully acknowledge the con-
a recent area of research, and the significance of the tag space tinued support of Science Foundation Ireland.
is an important consideration in that study. The results pre-
sented in this paper show that the relationship between the
tag space and the population size is vitally important in the
8. REFERENCES
n-player commons dilemma. Finally, this paper has outlined [1] S. W. Benard. Adaptation and network structure in
a series of experiments showing the significant impact of in- social dilemmas. In Paper presented at the annual
dividuals investing in the commons. Importantly, we have meeting of the American Sociological Association,
learned that this has the effect of encouraging cooperation Atlanta Hilton Hotel, Atlanta, GA, 2003.
when sufficient numbers of individuals are participating in [2] J. A. Brander and B. J. Spencer. Export subsidies and
the n-player games. While quite unstable for small popula- international market share rivalry. NBER Working
tions, this new adapted game offers a means of engendering Papers 1464, National Bureau of Economic Research,
cooperation in larger populations where cooperation is more Inc, Sept. 1984.

Page 98 of 99
[3] J. Coleman. Foundations of Social Theory. Belknap and Society, 12(4):451–476, 2000.
Press, August 1998. [22] T. Yamagishi and K. S. Cook. Generalized exchange
[4] C. Fershtman and K. L. Judd. Equilibrium incentives and social dilemmas. Social Psychology Quarterly,
in oligopoly. Discussion Papers 642, Northwestern 56(4):235–248, 1993.
University, Center for Mathematical Studies in [23] T. Yamashita, R. L. Axtell, K. Kurumatani, and
Economics and Management Science, Dec. 1984. A. Ohuchi. Investigation of mutual choice metanorm
[5] N. S. Glance and B. A. Huberman. The dynamics of in group dynamics for solving social dilemmas. In
social dilemmas. Scientific American, 270(3):76–81, MAMUS, pages 137–153, 2003.
1994. [24] X. Yao and P. J. Darwen. An experimental study of
[6] D. E. Goldberg. Genetic Algorithms in Search, n-person iterated prisoner’s dilemma games.
Optimization, and Machine Learning. Addison-Wesley Informatica, 18:435–450, 1994.
Professional, January 1989.
[7] G. Hardin. The tragedy of the commons. Science,
162(3859):1243–1248, December 1968.
[8] J. Holland. The effects of labels (tags) on social
interactions. Working Paper Santa Fe Institute
93-10-064, 1993.
[9] J. H. Holland. Adaptation in natural and artificial
systems: An introductory analysis with applications to
biology, control, and artificial intelligence. University
of Michigan Press, 1975.
[10] E. Howley and J. Duggan. The Evolution of Agent
Strategies and Sociability in a Commons Dilemma.
Lecture Notes in Computer Science. Springer-Verlag
Berlin, In Press.
[11] E. Howley and C. O’Riordan. The emergence of
cooperation among agents using simple fixed bias
tagging. In Proceedings of the 2005 Congress on
Evolutionary Computation (IEEE CEC’05), volume 2,
pages 1011–1016. IEEE Press, 2005.
[12] M. Macy and A. Flache. Learning dynamics in social
dilemmas. P Natl Acad Sci USA, 99(3):7229–7236,
2002.
[13] M. Matlock and S. Sen. Effective tag mechanisms for
evolving coordination. In AAMAS ’07: Proceedings of
the 6th international joint conference on Autonomous
agents and multiagent systems, pages 1–8, New York,
NY, USA, 2007. ACM.
[14] A. McDonald and S. Sen. The success and failure of
tag-mediated evolution of cooperation. In LAMAS,
pages 155–164, 2005.
[15] R. Riolo. The effects and evolution of tag-mediated
selection of partners in populations playing the
iterated prisoner’s dilemma. In ICGA, pages 378–385,
1997.
[16] R. Riolo, M. Cohen, and R. Axelrod. Evolution of
cooperation without reciprocity. Nature, 414:441–443,
2001.
[17] K. Rogoff. The optimal degree of commitment to an
intermediate monetary target. The Quarterly Journal
of Economics, 100(4):1169–89, November 1985.
[18] Y. Sato. Trust, assurance, and inequality: A rational
choice model of mutual trust 1. The Journal of
Mathematical Sociology, 26(1):1–16, 2002.
[19] T. C. Schelling. The strategy of conflict. Harvard
University Press, Cambridge 1960.
”
[20] K. Suzuki. Effects of conflict between emergent
charging agents in social dilemma. In MAMUS, pages
120–136, 2003.
[21] G. Torsvik. Social Capital and Economic
Development: A Plea for the Mechanisms. Rationality

Page 99 of 99

Trading GAPS Lazy Trader PDF
76% (17)
Trading GAPS Lazy Trader PDF
12 pages
Neural Networks: A Practical Guide for Understanding and Programming Neural Networks and Useful Insights for Inspiring Reinvention
From Everand
Neural Networks: A Practical Guide for Understanding and Programming Neural Networks and Useful Insights for Inspiring Reinvention
Steven Cooper
4/5 (9)
Bicomplex Holomorphic Functions - The Algebra, Geometry and Analysis of Bicomplex Numbers (PDFDrive)
100% (1)
Bicomplex Holomorphic Functions - The Algebra, Geometry and Analysis of Bicomplex Numbers (PDFDrive)
231 pages
Ai Magazine 2012 Tuylsweiss
No ratings yet
Ai Magazine 2012 Tuylsweiss
12 pages
Multi Agent Deep Reinforcement Learning: A Survey: Sven Gronauer Klaus Diepold
No ratings yet
Multi Agent Deep Reinforcement Learning: A Survey: Sven Gronauer Klaus Diepold
49 pages
Sminton,+13445 Article+ (PDF) 30493 1 11 20220502
No ratings yet
Sminton,+13445 Article+ (PDF) 30493 1 11 20220502
74 pages
Evolutionary Game Theory and Multi-Agent Reinforcement Learning
No ratings yet
Evolutionary Game Theory and Multi-Agent Reinforcement Learning
26 pages
Rosa Et Al. - 2019 - BADGER Learning To (Learn (Learning Algorithms) T
No ratings yet
Rosa Et Al. - 2019 - BADGER Learning To (Learn (Learning Algorithms) T
15 pages
Multiagent Reinforement Learning
No ratings yet
Multiagent Reinforement Learning
37 pages
Multi-Agent Reinforcement Learning: A Comprehensive Survey: Dom Huh and Prasant Mohapatra
No ratings yet
Multi-Agent Reinforcement Learning: A Comprehensive Survey: Dom Huh and Prasant Mohapatra
51 pages
2110 04638
No ratings yet
2110 04638
32 pages
Pedido 41 - 5
No ratings yet
Pedido 41 - 5
15 pages
Deep MARL
No ratings yet
Deep MARL
205 pages
A Comprehensive Survey of Multi-Agent Reinforcement Learning
No ratings yet
A Comprehensive Survey of Multi-Agent Reinforcement Learning
18 pages
Lecture 2
No ratings yet
Lecture 2
47 pages
F90de-Introduction To Reinforcement Learning
No ratings yet
F90de-Introduction To Reinforcement Learning
67 pages
Báo Cáo Nhóm 5 Final AI
No ratings yet
Báo Cáo Nhóm 5 Final AI
23 pages
BTP Final Term Report v3
No ratings yet
BTP Final Term Report v3
26 pages
A Survey of Learning in Multiagent Environments: Dealing With Non-Stationarity
No ratings yet
A Survey of Learning in Multiagent Environments: Dealing With Non-Stationarity
64 pages
Introduction To Reinforcement Learning
No ratings yet
Introduction To Reinforcement Learning
62 pages
NeurIPS 2022 Multi Agent Reinforcement Learning Is A Sequence Modeling Problem Paper Conference
No ratings yet
NeurIPS 2022 Multi Agent Reinforcement Learning Is A Sequence Modeling Problem Paper Conference
13 pages
Mean Field Multi-Agent Reinforcement Learning: Yaodong Yang Rui Luo Minne Li Ming Zhou Weinan Zhang Jun Wang
No ratings yet
Mean Field Multi-Agent Reinforcement Learning: Yaodong Yang Rui Luo Minne Li Ming Zhou Weinan Zhang Jun Wang
10 pages
Natural Emergence of Heterogeneous Strategies in Artificially Intelligent Competitive Teams
No ratings yet
Natural Emergence of Heterogeneous Strategies in Artificially Intelligent Competitive Teams
8 pages
LD Article
No ratings yet
LD Article
8 pages
Tembine Book
No ratings yet
Tembine Book
30 pages
26803-Article Text-30866-1-2-20230626
No ratings yet
26803-Article Text-30866-1-2-20230626
1 page
Emergent Complexity Via Multiagent Competition
No ratings yet
Emergent Complexity Via Multiagent Competition
12 pages
Nikolay Khachatryan Project S3
No ratings yet
Nikolay Khachatryan Project S3
19 pages
Multi Agent Reinforcement Learning A Rev
No ratings yet
Multi Agent Reinforcement Learning A Rev
25 pages
Mathematics of Multi-Agent Learning Systems at The Interface of Game Theory
No ratings yet
Mathematics of Multi-Agent Learning Systems at The Interface of Game Theory
8 pages
Lecture Week12
No ratings yet
Lecture Week12
37 pages
ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
A Concise Introduction To Multi Agent Systems and Distributed Artificial Intelligence
No ratings yet
A Concise Introduction To Multi Agent Systems and Distributed Artificial Intelligence
84 pages
E P C S M - A R L: Volutionary Opulation Urriculum For Caling Ulti Gent Einforcement Earning
No ratings yet
E P C S M - A R L: Volutionary Opulation Urriculum For Caling Ulti Gent Einforcement Earning
18 pages
Learning: Introduction and Overview: Chapter 18-21
No ratings yet
Learning: Introduction and Overview: Chapter 18-21
29 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
An Introduction To Deep ReinforcementLearning
No ratings yet
An Introduction To Deep ReinforcementLearning
65 pages
Unlearnable Games and "Satisficing" Decisions: A Simple Model For A Complex World
No ratings yet
Unlearnable Games and "Satisficing" Decisions: A Simple Model For A Complex World
34 pages
Lecture 12 Evaluating and Learning in Multi-Agent Systems 2
No ratings yet
Lecture 12 Evaluating and Learning in Multi-Agent Systems 2
51 pages
8283 Large Language Models Can
No ratings yet
8283 Large Language Models Can
12 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Unit-5 ML
No ratings yet
Unit-5 ML
18 pages
Language-Guided Multi-Agent Learning in Simulations: A Unified Framework and Evaluation
No ratings yet
Language-Guided Multi-Agent Learning in Simulations: A Unified Framework and Evaluation
9 pages
Reinforcement Learning Mastery Path
No ratings yet
Reinforcement Learning Mastery Path
18 pages
Maai 6
No ratings yet
Maai 6
143 pages
An Evolutionary Dynamical Analysis of Multi-Agent
No ratings yet
An Evolutionary Dynamical Analysis of Multi-Agent
40 pages
Imitation Learning Progress, Taxonomies and Challenges
No ratings yet
Imitation Learning Progress, Taxonomies and Challenges
21 pages
The Principal-Agent Alignment Problem in Artificial
No ratings yet
The Principal-Agent Alignment Problem in Artificial
166 pages
Lecture Notes in Control and Information Sciences: Yousri M. EI-Fattah Claude Foulard
No ratings yet
Lecture Notes in Control and Information Sciences: Yousri M. EI-Fattah Claude Foulard
123 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
13 pages
Ecs 403 ML Module I
No ratings yet
Ecs 403 ML Module I
33 pages
ML Unit-4
No ratings yet
ML Unit-4
10 pages
A Review of Cooperative Multi-Agent Deep Reinforcement Learning
No ratings yet
A Review of Cooperative Multi-Agent Deep Reinforcement Learning
46 pages
CopyofNeoscholarDPRLL10 MARL
No ratings yet
CopyofNeoscholarDPRLL10 MARL
24 pages
Unit 5
No ratings yet
Unit 5
36 pages
Dynamic and Adaptive Games
No ratings yet
Dynamic and Adaptive Games
77 pages
Reinf 2
No ratings yet
Reinf 2
4 pages
The Confluence of Networks, Games and Learning: A Game-Theoretic Framework For Multi-Agent Decision Making Over Networks
No ratings yet
The Confluence of Networks, Games and Learning: A Game-Theoretic Framework For Multi-Agent Decision Making Over Networks
61 pages
ML 4
No ratings yet
ML 4
4 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
Supervised Machine Learning for Science: How to stop worrying and love your black box
From Everand
Supervised Machine Learning for Science: How to stop worrying and love your black box
Christoph Molnar
No ratings yet
Homo Ludens in the Loop: Playful Human Computation Systems
From Everand
Homo Ludens in the Loop: Playful Human Computation Systems
Markus Krause
No ratings yet
Ferraro, Nakamoto, Brown - 2003 - Introductory Raman Spectroscopy Second Edition
No ratings yet
Ferraro, Nakamoto, Brown - 2003 - Introductory Raman Spectroscopy Second Edition
4 pages
Wound-Management Final
No ratings yet
Wound-Management Final
28 pages
WHO COuntry COoperation Strategy
No ratings yet
WHO COuntry COoperation Strategy
2 pages
Burns in A Major Burns Center in East China From
No ratings yet
Burns in A Major Burns Center in East China From
10 pages
Alpha Beta Pruning For Games With Simult
No ratings yet
Alpha Beta Pruning For Games With Simult
7 pages
Deep Reinforcement Learning An Overview
No ratings yet
Deep Reinforcement Learning An Overview
30 pages
Prediction of Sublayer Depth in Turbid Media Using Spatially Offset Raman Spectros
No ratings yet
Prediction of Sublayer Depth in Turbid Media Using Spatially Offset Raman Spectros
7 pages
Sei Shonagon - The Pillow Book
No ratings yet
Sei Shonagon - The Pillow Book
1 page
Raz Cqlc21 Ilookedevery
No ratings yet
Raz Cqlc21 Ilookedevery
1 page
TheBaroquePeriod PDF
No ratings yet
TheBaroquePeriod PDF
39 pages
Learn Abouts 1a My Birthday
No ratings yet
Learn Abouts 1a My Birthday
4 pages
The Adventures of Huckleberry Finn: Chapter 1 Quick Quiz
100% (1)
The Adventures of Huckleberry Finn: Chapter 1 Quick Quiz
2 pages
Neuroscience of Meditation - Chapter 4
No ratings yet
Neuroscience of Meditation - Chapter 4
7 pages
Fundamental Analysis and Stock Returns: An Indian Evidence: Full Length Research Paper
No ratings yet
Fundamental Analysis and Stock Returns: An Indian Evidence: Full Length Research Paper
7 pages
Jurnal Nurfitri Novriyanti PDF
No ratings yet
Jurnal Nurfitri Novriyanti PDF
10 pages
Proposal POP UP Pinasthika XV English
No ratings yet
Proposal POP UP Pinasthika XV English
25 pages
Dat LM3940
No ratings yet
Dat LM3940
9 pages
01 - Fox (2008) Postpositivism
No ratings yet
01 - Fox (2008) Postpositivism
8 pages
The Long House
No ratings yet
The Long House
1 page
Obie e Course Contents
No ratings yet
Obie e Course Contents
7 pages
Renegade Training Edits PDF
100% (1)
Renegade Training Edits PDF
198 pages
Practice Questions English Lit
No ratings yet
Practice Questions English Lit
11 pages
Franz Schubert and His Times - Kobald, Karl, 1876 - Marshall, Beatrice, TR - 1970 - Port Washington, N.Y., Kennikat Press - 9780804607568 - Anna's Archive
No ratings yet
Franz Schubert and His Times - Kobald, Karl, 1876 - Marshall, Beatrice, TR - 1970 - Port Washington, N.Y., Kennikat Press - 9780804607568 - Anna's Archive
328 pages
Philosophy of Yoga MCQS
100% (1)
Philosophy of Yoga MCQS
10 pages
Lesson Plan 6
100% (3)
Lesson Plan 6
3 pages
Final Test Revision Part 1
No ratings yet
Final Test Revision Part 1
4 pages
9 Enneagram Types Overview PDF
100% (1)
9 Enneagram Types Overview PDF
2 pages
Asdasdsad
No ratings yet
Asdasdsad
4 pages
Kundnani, A. (2012) - Radicalisation. The Journey of A Concept
No ratings yet
Kundnani, A. (2012) - Radicalisation. The Journey of A Concept
23 pages
Slides Module 4 Lesson 2
No ratings yet
Slides Module 4 Lesson 2
34 pages
English Quarter 2 Module 3 - Personal Pronouns
No ratings yet
English Quarter 2 Module 3 - Personal Pronouns
31 pages
SW32V Operating Manual e ZIEHL 2017-05-03
No ratings yet
SW32V Operating Manual e ZIEHL 2017-05-03
22 pages
Crane Troubleshooting Guide Lines
No ratings yet
Crane Troubleshooting Guide Lines
9 pages
Dollar Way
No ratings yet
Dollar Way
32 pages
BHOLI
No ratings yet
BHOLI
4 pages