RESQ Learning in Stochastic Games
RESQ Learning in Stochastic Games
Toronto, Canada
Editors
Marek Grześ and Matthew E. Taylor
This year’s edition of the Adaptive and Learning Agents workshop is the third after
the ALAMAS and ALAg workshops merged. ALAMAS was an annual European
workshop on Adaptive and Learning Agents and Multi-Agent Systems, held eight
times. ALAg was the international workshop on Adaptive and Learning agents, typ-
ically held in conjunction with AAMAS. To increase the strength, visibility, and
quality of the workshops, ALAMAS and ALAg were merged into the ALA workshop,
and a steering committee was appointed to guide its development. We are very happy
to present you the proceedings of this special edition of the ALA workshop.
We thank all authors who responded to our call-for-papers. We expect that the
workshop will be both lively and informative, refining and producing future research
ideas. We are thankful to the members of the program committee for their high
quality reviews. We would like to thank all the members of the steering committee
for their guidance, and the AAMAS conference for providing an excellent venue for
our workshop.
Marek Grześ
Department of Computer Science
University of York
UK
[email protected]
Matthew E. Taylor
Department of Computer Science
The University of Southern California
USA
[email protected]
Program Committee
iii
Steering Committee
Franziska Klügl
Daniel Kudenko
Ann Nowé
Lynne E. Parker
Sandip Sen
Peter Stone
Kagan Tumer
Karl Tuyls
iv
CONTENTS
Foreword ii
Organisation iii
Contributed Papers
Learning to Take Turns 1
Peter Vrancx, Katja Verbeeck, and Ann Nowé
RESQ-learning in stochastic games 8
Daniel Hennes, Michael Kaisers, and Karl Tuyls
Adaptation of Stepsize Parameter to Minimize Exponential
Moving Average of Square Error by Newton’s Method 16
Itsuki Noda
Transfer Learning for Reinforcement Learning on a Physical Robot 24
Samuel Barrett, Matthew E. Taylor, and Peter Stone
Reinforcement Learning with Action Discovery 30
Bikramjit Banerjee and Landon Kraemer
Convergence, Targeted Optimality, and Safety in Multiagent Learning 38
Doran Chakraborty and Peter Stone
An Approach to Imitation Learning For Physically Heterogeneous
Robots 45
Jeff Allen and John Anderson
Multi-agent Reinforcement Learning with Reward Shaping
for KeepAway Takers 53
Sam Devlin, Marek Grześ, and Daniel Kudenko
Learn to Behave! Rapid Training of Behavior Automata 61
Sean Luke and Vittorio Ziparo
Policy Search and Policy Gradient Methods
for Autonomous Navigation 69
Matt Knudson and Kagan Tumer
A Comparison of Learning Approaches to Support
the Adaptive Provision of Distributed Services 77
v
Enda Barrett, Enda Howley, and Jim Duggan
Using bisimulation for policy transfer in MDPs 85
Pablo Samuel Castro and Doina Precup
The Evolution of Cooperation and Investment Strategies
in a Commons Dilemma 93
Enda Howley and Jim Duggan
vi
Proceedings of the AAMAS Workshop on Adaptive and Learning Agents, May 2010, Toronto, Canada
∗
Peter Vrancx Katja Verbeeck Ann Nowé
Computational Modeling Lab Information Technology Computational Modeling Lab
Vrije Universiteit Brussel KaHo St. Lieven (KULeuven) Vrije Universiteit Brussel
Brussels, Belgium Ghent, Belgium Brussels, Belgium
[email protected] [email protected] [email protected]
Page 1 of 99
State 1 State 2 si and the joint action in this state si , i.e. ai = (ai1 , . . . aiN )
b1 b2 b1 b2
with aik ∈ Aik . The reward function Rk (s, a) can be indi-
R a1 0.2/0.1 0/0 a1 1.0/0.5 0/0
a2 0/ 0 0.2/0.1 a2 0/0 0.6/0.9 vidual to each agent k, meaning that different agents can
(a1,b1):(0.5,0.5) (a1,b1):(0.5,0.5) receive different rewards for the same state transition.
(a1,b2):(0.5,0.5) (a1,b2):(0.5,0.5) The goal of each individual agent in the game, is to find
T
(a2,b1):(0.5,0.5) (a2,b1):(0.5,0.5) a policy which maps each state to a strategy in order to
(a2,b2):(0.5,0.5) (a2,b2):(0.5,0.5) maximize its reward. In this paper we consider the limit
average reward, meaning that agents try to maximize their
average reward over time. For a joint policy α consisting
Table 1: Example Markov game with 2 states and of a policy for each agent in the system, the limit average
2 agents. Each agent has 2 actions in each state: reward to agent k is defined as:
actions a1 and a2 for agent 1 and b1 and b2 for agent
2. Rewards for joint actions in each state are given
in the first row as matrix games. The second row " l−1 #
specifies the transition probabilities to both states 1 X
under each joint action. Jk (α) ≡ liml→∞ E Rk (s(t), a1 (t), ..., aN (t)) (1)
l t=0
idea is to tackle a Markov game by decomposing it into a In the remainder of this paper we will assume that the
set of multi-agent common interest problems; each reflecting Markov chain of system states under every joint policy is
one agent’s preferences in the system. Simple reinforcement ergodic. A Markov chain {xl }l≥0 is said to be ergodic when
learning agents using Parameterised Learning Automata [11] the distribution of the chain converges to a limiting distri-
are able to solve this set of MMDPs in parallel. A third bution π(α) = (π1 (α), . . . , π|S| (α)) with ∀i, πi (α) > 0 as
trusted party is used to enforce each MMDP to be played l → ∞.
and solved equally well. There is no need for the agents in Due to the individual reward functions of the agents, it is
the system to know which problem or reward function they in general impossible to find an optimal policy for all agents
are confronted with. As a result, a team of simple learning simultaneously. Instead, most approaches seek equilibrium
agents becomes able to switch play between desired joint points. In an equilibrium, no agent can improve its reward
policies rather than mixing individual policies. The role of by changing its policy if all other agents keep their pol-
the third party is minimal in the sense that only simple coor- icy fixed. In a special case of the general Markov game
dination signals should be communicated. In case all agents framework, the so-called team games or multi-agent MDPs
fully trust its opponent players to stick with the learning (MMDPs) [3] optimal policies do exist. In this case, the
mechanism proposed, the third party is even unnecessary. Markov game is purely cooperative and all agents share the
We will show how this technique can lead to turn taking same reward function. This specialization allows us to define
behavior in 2 different Markov games. the optimal policy as the joint agent policy, which maximizes
This paper is organized as follows: in the next section we the payoff of all agents.
introduce some background knowledge needed to develop An example Markov game is given in Table 1. Each col-
our algorithm. In Section 3 our approach for learning cor- umn in this table specifies one state of the problem. The first
related policies in Markov Games is described. We explain row gives the immediate rewards agents receive for a joint
our decomposition method and how it can easily be used in action, while the second row gives the transition probabili-
combination with a third party, comparable to the private ties. In this case transition probabilities are independent of
signal modelled in the CE concept. We demonstrate this the joint action chosen and the system moves to either state
approach on a simple 2-state Markov game and a larger grid with equal probability.
world problem in section 4. We end with a discussion in
Section 5.
2.2 Parameterised Learning Automata
2. BACKGROUND Learning Automata are simple reinforcement learners which
attempt to learn an optimal action, based on past actions
In this section we describe some basic formalisms and
and environmental feedback. Formally, the automaton is
background concepts used throughout the rest of this pa-
described by a tuple {A, β, p, U } where A = {a1 , . . . , ar } is
per.
the set of possible actions the automaton can perform, p is
2.1 Markov Games the probability distribution over these actions, β is a ran-
dom variable between 0 and 1 representing the evironmental
In this paper we adopt the formal setting of Markov games response, and U is a learning scheme used to update p.
(also called stochastic games). Markov games are a straight- In this paper we will make use of the so called Paramete-
forward extension of single agent Markov decision problems rized Learning Automata (PLA) [11]. Instead of modifying
(MDPs) to the multi-agent case. A Markov game consists probabilities directly, PLA use a parameter vector u(t) to-
of a set of states S and a set of N agents. In each state si gether with an exploration function g(u) and an update rule
Aik = {aik1 , . . . , aikir } is the action set available for agent k, based on the REINFORCE algorithm [17]:
with k : 1 . . . N . Actions in the game are the joint result
of multiple agents choosing an action independently. The
transition function T (si , ai ) and reward function Rk (si , ai ),
determine the probability of moving to another state and ui (t + 1) = ui (t) + λβ(t) δδln g
(u(t), α(t))
′
√ui
(2)
the reward for each agent k, depending on the current state +λh (ui (t)) + bsi (t)
Page 2 of 99
where h′ (x) is the derivative of h(x): sj and an individual reward Rki (si , ai ) for each agent. The
8 agents then repeat the process in state sj .
< −K(x − L)2n x ≥ L Automata in the system are not informed of the imme-
h(x) = 0 |x| ≤ L (3) diate reward that their joint action triggers. Instead each
−K(x + L)2n x ≤ −L
:
agent keeps track of the cumulative reward it has gathered
{si (t) : k ≥ 0} is a set of i.i.d. variables with zero up to the current time step. When the system returns to a
mean and variance σ 2 , b is the learning parameter, σ and state si , that was previously visited, each agent k computes
K are positive constants and n is a positive integer. In this the time ∆ti that has passed since the last visit and the
update rule, the second term is a gradient following term, the reward ∆rki that it has gathered since. Automaton LAik
the third term is used to keep the solutions bounded and then updates the action a it took last time using following
the final term is a random noise term. The gradient term feedback:
is part of the orginal REINFORCE algorithm and allows
agents to locally optimize their rewards. In [11] however, it ∆rki
βki = (4)
was shown that this original algorithm is only locally opti- ∆ti
mal. Moreover, it was found that the algorithm could give In [15] it is shown that the behavior of this algorithm can
rise to unbounded behavior, causing the values in u to go to be analyzed by examining an approximating limiting game.
infinity. To deal with these issues the authors in [11] √ added This game approximates the full Markov game by a single re-
the extensions above. The random noise term bsi (t) is peated game. Since the limiting game is an automata game,
based on the concepts used in simulated annealing. It adds this limiting game view allows us to predict the behavior of
a random walk to the update that allows the algorithm to the algorithm based on the convergence properties of the up-
escape local optima that are not globally optimal. Addition- date rule used by the automata. When used with a common
ally, the range for each value in the vector u is now limited LA update scheme called linear reward-inaction[10], the sys-
to the interval [−L, L]. The h′ (ui (t)) term keeps each ui tem can be shown to converge towards pure Nash equilibria
bounded with |ui | ≤ L. This term is 0 when the parameter [15]. In the special case of MMDPs, where all agents receive
ui being updated is within the desired interval, but becomes the same reward and a globally optimal equilibrium still ex-
either negative or positve when ui leaves this interval. Pro- ists, the PLA introduced in the previous section can be used
vided that L is taken sufficiently large, the resulting update to achieve convergence to this global optimum [14]. In this
can still closely approximate the optimal solution, without paper we introduce another approach, in which the agents
resulting in unbounded behavior. alternate between different joint policies in a general sum
Groups of learning automata can be interconnected by Markov game. In the next section we will show how this
using them as players in a repeated game. In such a game can be implemented, using the automata algorithm above.
multiple automata interact with the same environment. A
play a(t) = (a1 (t) . . . an (t)) of n automata is a set of strate- 3. MARKOV GAME TURN TAKING ALGO-
gies chosen by the automata at stage t. Correspondingly, the
response is now a vector β(t) = (β1 (t) . . . βn (t)), specifying RITHM
a payoff for each automaton. The main idea behind our algorithm is to split the Markov
At every instance, all automata update their action prob- game into a number of common interest problems, one for
abilities based on the responses of the environment. Each each agent. These problems are then solved in parallel, al-
automaton participating in the game operates without infor- lowing agents to switch between different joint policies, in
mation concerning the number of participants, their strate- order to satisfy different agents’ preferences. The agents
gies, their payoffs or actions. learn preferred outcome for each participant in the game.
In common interest games, where all agents receive the
same feedback and a clear optimal solution exists, PLA can 3.1 Markov game Decomposition
be used to assure that this global optimum is reached [11]. We develop a system in which agents alternate between
optimising different agent goals in order to satisfy all agents
2.3 Automata Learning in Markov games in the system. This implies that we let agents switch be-
Besides the repeated games mentioned in the previous sec- tween playing different joint policies.
tion, LA can also be used in more complex, multi-state prob- To allow agents to switch between objectives we use a sys-
lems. We now explain an automata based algorithm, capable tem based on Policy Time Sharing (PTS) approaches used
of finding pure equilibria in general-sum Markov games [15, in constrained MDPs [1]. A related approach was also used
13] and optimal policies in MMDPs [14]. In the next sec- in a multi-objective reinforcement learning setting in [9]. In
tion, this algorithm will serve as building block for our turn these systems, a single controller (i.e agent) switches be-
taking approach. The algorithm is an extension of an LA tween alternate policies to keep a vector of payoffs in a tar-
algorithm for solving MDPs, originally proposed by Wheeler get set. In this paper on the other hand, we will consider a
and Narendra [16]. system composed of multiple independent controllers, each
The main idea behind the algorithm is that agent k asso- with an individual scalar payoff.
ciates a different learning automaton LAik with each state In a policy time sharing system the game play is divided
si . The agents then defer the actual action selection in each into a series of periods. A single recurrent1 state in the
state to the automaton they have associated with that state. system is select as the switch state. Play is then divided in
Each time step each agent k in the system activates LAik that episodes, with a single episode comprising the time-steps be-
it associates with the current system state si . The joint ac- tween 2 subsequent visits to the switch state. Each episode a
tion ai consisting of the actions of all automata associated 1
A recurrent state is a non-transient state. In the ergodic
with si , then triggers a transition to the next system state systems under study here all states are recurrent.
Page 3 of 99
...
n-player
Markov
Game
start
period n
Dispatcher:
MMDP 1 MMDP 2 ... MMDP n -worst agent = i
- send rewards
Period n-1
Agents:
Figure 1: Markov game decomposition - update PLAs
period n-1
- Play using PLAs i
Page 4 of 99
use a common predetermined system to select the reward Agent 2
function to optimize during the next episode. (b1,b1) (b1,b2) (b2,b1) (b2,b2)
(a1,a1) 0.6/0.3 0.1/0.05 0.5/0.25 0/0
Agent 1
(a1,a2) 0.1/0.05 0.4/0.5 0/0 0.3/0.45
3.2 Combining Joint Policies (a2,a1) 0.5/0.25 0/0 0.6/0.3 0.1/0.05
Using the switching mechanism described above we can (a2,a2) 0/0 0.3/0.45 0.1/0.05 0.4/0.5
learn the joint policies that maximize each agent’s individual
payoff. One additional requirement to implement this sys- Table 3: Possible outcomes for for the Markov game
tem is a mechanism to decide which MMDP will be played in Table 1. Nash equilibria are indicated in bold.
next. This mechanism determines how the different joint
policies learnt in the set of MMDPs are combined into a sin-
gle solution and consequently how much each agent’s goal is
optimized. outcome. This allows each agent to achieve their desired
Different methods could be use to implement the coordina- objective at least some of the time. In situations such as
tion mechanism. One possibility is to implement a communi- the Battle of the Sexes game of Table 2, this assures that no
cation protocol to let agents exchange rewards and negotiate agent will always be stuck with the minimum payoff.
about the next agent to aid. Alternatively it can be imple-
mented using a centralized mechanism. In our setting we 4. EXPERIMENTS
implement this switching mechanism using a separate dis-
In this section we demonstrate the behavior of our ap-
patcher agent. This agent is separate from the other agents
proach on 2 Markov games and show that it does achieve a
and does not participate in the actual learning problem. In-
fair payoff division between agents. As a first problem set-
stead this agent coordinates all other agents and determines
ting we use that Markov game of Table 1. Table 3 lists the
the reward to optimize next. In this way the actual learning
possible combinations of deterministic policies for this game,
agents do not need information on the actions and rewards
together with their expected average reward for each agent.
of others or even the fact that other agents are present in the
We observe that the Markov game has 4 pure equilibrium
system. Whenever the system reaches the switch state, the
points. All of these equilibria have asymmetric payoffs with
current episode ends and the dispatcher becomes active. The
2 equilibria favoring agent 1 and giving payoffs (0.6, 0.3), and
dispatcher then collects the total rewards up to the current
the other equilibria favoring agent 2 with payoffs (0.4, 0.5).
time step for each agent and sends each agent in the system
Figure 3 gives a typical run of the algorithm, which shows
2 pieces of information: a feedback for the last episode and
that agents equalize their average reward, while still obtain-
the index for the next problem to be played. Figure 2 gives
ing a payoff between both equilibrium payoffs. All PLA used
an outline of the algorithm steps.
a Boltzmann exploration function and update parameters:
The feedback is used by the agents to update the automata
λ = 0.05, σ = 0.001, L = 3.0, K = n = 1.0. These pa-
they used in the last episode. Since we assume the problem
rameter settings were determined empirically based on set-
is ergodic, a single scalar reward is sufficient to update all
tings reported in [11, 14] Over 20 runs of 100000 iterations
automata in states visited during the last episode. The dis-
the agents achieved an average payoff 0.42891 (std. dev:
patcher can calculate this feedback by simply determining
0.00199), with an average payoff difference at the finish of
the average reward the agent corresponding to last epsiode’s
0.00001.
goal gathered during the episode. The problem index sent
In a second experiment we apply the algorithm to a some-
to the learning agents indicates the next reward to be maxi-
what larger Markov game given by the grid world shown in
mized. The learning agents themselves do not need to know
Figure 4(a). This problem is based on the experiments de-
whose reward they are optimizing, they can simply use the
scribed in [6]. Two agents start from the lower corners from
index to select the corresponding automata during the next
the grid and try to reach the goal location (top row center).
episode.
When the agents try to enter the same non-goal location
The dispatcher can select from a wide variety of possi-
they stay in their original place and receive a penalty −1.
ble strategies to determine the next objective to optimize,
The agents receive a reward when they both enter the goal
depending on the desired outcome of the system. One possi-
location. The reward they receive depends on how they en-
bility, for example, is to assign a fixed weight to each agent,
ter the goal location,however. If an agent enters from the
which is then used by the dispatcher to determine the proba-
bottom he receives a reward of 100. If he enters from either
bility of selecting each agent as the next objective. Alterna-
side he receives a reward of 75. A state in this problem is
tively, the dispatcher could opt to maximize the maximum
given by the joint location of both agents, resulting in a to-
over the players’ rewards and always select the agent having
tal of 81 states for this 3 × 3 grid. Agents have four actions
the highest possible payoff (also called a republican selection
corresponding to moves in the 4 compass directions. Moves
mechanism[6]). In [6] several possible mechanisms are dis-
in the grid are stochastic and have a chance of 0.01 of fail-
cussed in the context of selecting a correlated equilibrium to
ing2 . The game continues until both agents arrive in the
use in updating the value function.
goal location together, then agents receive their reward and
In this paper we focus on an egalitarian selection mecha-
are put back in their starting positions. As described in [6],
nism. This means we try to maximize the minimum of the
this problem has pure equilibria corresponding to the joint
players’ rewards and the dispatcher will always choose to
policies where one agent prefers a path entering the goal
optimize the payoff of the worst performing agent, i.e. the
from the side and the other one enters from the bottom.
agent with the lowest average reward over time for the entire
These equilibria are asymmetric and result in one agent al-
running time. In this way we can resolve dilemma’s result-
ing from agents having different preferences for the game 2
when a move fails the agent either stays put or arrives in a
outcomes, by letting them take turns to play their optimum random neighboring location.
Page 5 of 99
ways receiving the maximum reward, while the other always 6. REFERENCES
receives the lower reward. [1] E. Altman and A. Shwartz. Time-Sharing Policies for
In order to apply the LA algorithms all rewards described Controlled Markov Chains. Operations Research,
above were scaled to lie in [0, 1]. The turn-taking Markov 41(6):1116–1124, 1993.
game algorithm was applied as follows. Each agent assigns [2] R. Aumann. Subjectivity and correlation in
2 PLA to every state. The starting state (both agents in randomized strategies. Journal of Mathematical
their starting position) is selected as the switch state. Each Economics, 1:67 – 96, 1974.
time the agents enter this start state, they receive an index
[3] C. Boutilier. Planning, learning and coordination in
i ∈ {1, 2} and a reward for the last start to goal epsiode.
multiagent decision processes. In Proceedings of the
Using this information the agents can then update the PLA
6th Conference on Theoretical Aspects of Rationality
used in the last episode. During the next episode they play
and Knowledge, pages 195 – 210, Holland, 1996.
using the PLA corresponding to the index i. When the PLA
[4] S. de Jong and K. Tuyls. Learning to cooperate in
have converged this system results in agents taking turns to
public-goods interactions. In EUMAS 2008, 2008.
use the optimal route.
Results for a typical run are shown in Figure 4, with the [5] S. de Jong, K. Tuyls, and K. Verbeeck. Fairness in
same parameter settings as are given above. From the figure multi-agent systems. Knowledge Engineering Review,
it is clear that agents equalize their reward, both receiving an 23(2):153–180, 2008.
average reward that is between the average rewards for the [6] A. Greenwald and K. Hall. Correlated Q-learning. In
2 paths played in an equilibrium. For comparison purposes Proceedings of the Twentieth International Conference
we also show the rewards obtained by 2 agents converging on Machine Learning, pages 242 – 249, 2003.
to one of the deterministic equilibria. [7] S. Hart and A. Mas-Colell. A simple adaptive
procedure leading to correlated equilibrium.
Econometrica, 68(5):1127–1150, 2000.
5. DISCUSSION AND FUTURE WORK [8] J. Hu and M. Wellman. Nash Q-learning for
In this paper we introduced a multi-agent learning algo- general-sum stochastic games. Journal of Machine
rithm which allows agents to switch between stationary poli- Learning Research, 4:1039 – 1069, 2003.
cies in order to equalize the reward division among the agent [9] S. Mannor and N. Shimkin. The Steering Approach
population. In the present system we rely on a dispatcher for Multi-Criteria Reinforcement Learning. Advances
agent to select the objective to play and the correlate agents’ in Neural Information Processing Systems,
policy switches. If we assume that all agents are coopera- 2:1563–1570, 2002.
tive and willing to sacrifice some payoff in order to equalize [10] K. Narendra and M. Thathachar. Learning Automata:
the rewards in the population3 , this functionality could also An Introduction. Prentice-Hall International, Inc,
be embedded in the agents, either by letting agents com- 1989.
municate or allowing each agent to observe all rewards as [11] M. Thathachar and V. Phansalkar. Learning the
is done in e.g. [8, 6]. In systems where agents cannot be global maximum with parameterized learning
trusted or are not willing to cooperate, methods from com- automata. Neural Networks, IEEE Transactions on,
putational mechanism design could be used to ensure that 6(2):398–406, 1995.
agenst’ selfish interests are aligned with the global system [12] P. Vanderschraaf and B. Skyrms. Learning to take
utility. Another possible approach is considered in [4], where turns. Erkenntnis, 59:311–347(37), November 2003.
the other agents can choose to punish uncooperative agents, [13] P. Vrancx. Decentralised reinforcement learning in
leading to lower rewards for those agents. Markov games. PhD thesis, Computational Modeling
Note also that in the system presented here agents learn Lab, Vrije Universiteit Brussel, 2010.
to correlate on the joint actions they play. In [6] an approach [14] P. Vrancx, K. Verbeeck, and A. Nowé. Optimal
was presented to learn correlated equilibria. A deeper study Convergence in Multi-agent MDPs. Lecture Notes in
on the relation between our turn-taking policies and corre- Computer Science, Knowledge-Based Intelligent
lated equilibrium still needs to be done. The main differ- Information and Engineering Systems (KES 2007),
ence we put forward here is that a turn-taking policy was 4694:107–114, 2007.
proposed as a vehicle to reach fair reward divisions among [15] P. Vrancx, K. Verbeeck, and A. Nowe. Decentralized
the agents. Furthermore, the system in [6] requires agents learning in markov games. IEEE Transactions on
to learn in the joint action space and relies on centralized Systems, Man and Cybernetics (Part B: Cybernetics),
computation of correlated equilibria. In our system agents 38(4):976–81, 2008.
only learn probabilities for their individual action sets and [16] R. Wheeler and K. Narendra. Decentralized learning
coordination only takes place in the switch state, rather than in finite markov chains. IEEE Transactions on
at every state. Automatic Control, AC-31:519 – 526, 1986.
In [18] the concept of cyclic equilibria in Markov Games
[17] R. Williams. Simple Statistical Gradient-Following
was proposed as an alternative to Nash equilibrium. These
Algorithms for Connectionist Reinforcement Learning.
cyclic equilibria refer to a sequence of policies that reach a
Reinforcement Learning, 8:229–256, 1992.
limit cycle in the game. However again, no link was made
[18] M. Zinkevich, A. Greenwald, and M. Littman. Cyclic
with individual agent preferences and how they compare to
Equilibria in Markov Games. Advances in Neural
each other.
Information Processing Systems, 18:1641, 2006.
3
Systems satisfying this assumption are referred to as homo
egualis systems [5]
Page 6 of 99
Agent 1 Reward Agent 2 Reward
0.45 0.45
agent 1 agent 2
0.4 0.4
0.35 0.35
0.3 0.3
Average Reward
Average Reward
0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
0 100000 200000 300000 400000 500000 0 100000 200000 300000 400000 500000
Iteration Iteration
(a) (b)
Figure 3: Typical run of the turn-taking algorithm on the Markov game in Table 1.(a) Average reward over
time agent 1. (b)Average reward over time agent 2.
0.35
Agent 1
Agent 2
GOAL 0.3
0.25
average reward
0.2
0.15
0.1
0
0 2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07 1.4e+07 1.6e+07 1.8e+07 2e+07
iteration
(a) (b)
Figure 4: (a) Deterministic equilibrium solution for the grid world problem. (b) Average reward over time
for 2 agents converging to equilibrium.
0.35 0.35
Agent 1 Agent 2
0.3 0.3
0.25 0.25
average reward
average reward
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
0 2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07 1.4e+07 1.6e+07 1.8e+07 2e+07 0 2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07 1.4e+07 1.6e+07 1.8e+07 2e+07
iteration iteration
Figure 5: Results of the turn-taking algorithm in the grid world problem. The coloured lines give the average
reward over time for both agents. Grey lines give the rewards for agents playing one of the deterministic
equilibria.
Page 7 of 99
Proceedings of the AAMAS Workshop on Adaptive and Learning Agents, May 2010, Toronto, Canada
Page 8 of 99
within an evolutionary game theoretic framework. Second, given below.
the inverse approach, reverse engineering the RESQ-learning „ «
β
algorithm is demonstrated in Section 3. Section 4 delivers Qi (t + 1) ←Qi (t) + min ,1
a comparative study of the newly devised algorithm and its xi
„ «
dynamics. Section 5 concludes this article.
· α ri (t) + γ argmax Qj (t) − Qi (t)
j
Page 9 of 99
LA LA dynamics FAQ FAQ dynamics
1 1 1 1
0 0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
!11 !11 !11 !11
0 0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
!11 !11 !11 !11
Figure 1: Overview of trajectory plots for stateless games: Prisoners’ Dilemma (top row) and Matching
Pennies game (bottom row).
Page 10 of 99
Figure 1 (top row) shows the dynamics in the single state Games with absorbing states are of no particular interest in
Prisoners’ Dilemma. The automata game as well as the cor- respect to evolution or learning since any type of exploration
responding replicator dynamics show similar evolution to- will eventually lead to absorption. The formal definition of
ward the equilibrium strategy of mutual defection. Action an ergodic set in stochastic games is given below.
probabilities are plotted for action 1 (in this case cooperate);
x- and y-axis correspond to the action of player 1 and 2 re- Definition 2. In the context of a stochastic game G,
spectively. Hence, the Nash equilibrium point is located at E ⊆ S is an ergodic set if and only if the following con-
the origin (0, 0). FAQ-learners evolve to a joint policy close ditions hold:
to Nash. Constant temperature prohibits full convergence. (a) For all s ∈ E, if G is in state s at stage t, then at t + 1:
Learning in the Matching Pennies game, Figure 1 (bottom P r (G in some state s0 ∈ E) = 1, and
row), shows cyclic behavior for automata games and its repli- (b) for all proper subsets E 0 ⊂ E, (a) does not hold.
cator dynamics alike. FAQ-learning successfully converges Note that in repeated games, player i either tries to maxi-
to the mixed equilibrium due to its exploration scheme. mize the limit of the average of stage rewards (e.g., Learning
2.2 Multi-state learning dynamics Automata)
T
The main limitation of the evolutionary game theoretic 1 X i
max lim inf r (t) (7)
approach to multi-agent learning has been its restriction to πi T →∞ T
t=1
stateless repeated games. Even though real-life tasks might
or the discounted sum of stage rewards Tt=1 ri (t) δ t−1 with
P
be modeled statelessly, the majority of such problems nat- i
urally relates to multi-state situations. Vrancx et al. [20] 0 < δ < 1 (e.g., Q-learning), where r (t) is the immediate
have made the first attempt to extend replicator dynamics stage reward for player i at time step t.
to multi-state games. More precisely, the authors have com- 2.2.2 2-State Prisoners’ Dilemma
bined replicator dynamics and piecewise dynamics, called
piecewise replicator dynamics, to model the learning behav- The 2-State Prisoners’ Dilemma is a stochastic game for
ior of agents in stochastic games. Recently, this promising two players. The payoff matrices are given by
„ « „ «
proof of concept has been formally studied in [5] and ex- ` 1 1´ 3, 3 0, 10 ` 2 2 ´ 4, 4 0, 10
tended to state-coupled replicator dynamics [6] which form A ,B = , A ,B = .
10, 0 2, 2 10, 0 1, 1
the foundation for the later described inverse approach.
Where As determines the payoff for player 1 and B s for
2.2.1 Stochastic games player 2 in state s. The first action of each player is cooperate
Stochastic games extend the concept of Markov decision and the second is defect. Player 1 receives r1 (s, a) = Asa1 ,a2
processes to multiple agents, and allow to model multi-state while player 2 gets r2 (s, a) = Bas1 ,a2 for a given joint ac-
games in an abstract manner. The concept of repeated tion a = (a1 , a2 ). Similarly, the transition probabilities are
0 0
games is generalized by introducing probabilistic switching given by the matrices Qs→s where qs0 (s, a) = Qs→s a1 ,a2 is the
0
between multiple states. At any time t, the game is in a probability for a transition from state s to state s .
specific state featuring a particular payoff function and an „ « „ «
1 2 0.1 0.9 2 1 0.1 0.9
admissible action set for each player. Players take actions si- Qs →s = , Qs →s =
0.9 0.1 0.9 0.1
multaneously and hereafter receive an immediate payoff de-
pending on their joint action. A transition function maps the The probabilities to continue in the same state after the
1 1
s1 →s2
joint action space to a probability distribution over all states transition are qs1 s1 , a = Qas 1→s
` ´
,a2 = 1 − Qa1 ,a2 and
which in turn determines the probabilistic state change. Thus, 2 2 2 1
qs2 s , a = Qsa1→s s →s
` 2 ´
,a2 = 1 − Qa1 ,a2 .
similar to a Markov decision process, actions influence the
Essentially a Prisoners’ Dilemma is played in both states,
state transitions. A formal definition of stochastic games
and if regarded separately, defect is still a dominating strat-
(also called Markov games) is given below.
egy. One might assume that the Nash equilibrium strat-
Definition 1. The game G = n, S, A, q, r, π 1 . . . π n is egy in this game is to defect at every stage. However, the
˙ ¸
a stochastic game with n players and only pure stationary equilibria in this game reflect strategies
` k states.
´ At each stage
t, the game is in a state s ∈ S = s1,. . .,sk and each player where one of the players defects in one state while cooper-
i chooses an action ai from its admissible action set Ai (s) ating in the other and the second player does exactly the
according to its strategy π i (s). Q opposite. Hence, a player betrays his opponent in one state
n i n while being exploited himself in the other state.
The payoff function
` 1 ´ a) : i=1 A (s) 7→ < maps the
r (s,
n
joint action a = a ,. . .,a to an immediate payoff value for 2.2.3 2-State Matching Pennies game
each player.
The transition function q(s, a) : n
Q i k−1 Another 2-player, 2-actions and 2-state game is the 2-
i=1 A (s) 7→ ∆ de-
k−1 State Matching Pennies game. This game has a mixed Nash
termines the probabilistic state change, where ∆ is the
equilibrium with joint-strategies π 1 = (.75, .25), π 2 = (.5, .5)
(k − 1)-simplex and qs0 (s, a) is the transition probability from
in state 1 and π 1 = (.25, .75), π 2 = (.5, .5) in state 2. Payoff
state s to s0 under joint action a.
and transition matrices are given below.
In this work we restrict our consideration to the set of ` 1 1´
„ «
1, 0 0, 1 ` 2 2 ´
„
0, 1 1, 0
«
games where all states s ∈ S are in the same ergodic set. A ,B = , A ,B =
0, 1 1, 0 1, 0 0, 1
The motivation for this restriction is two-folded. In the
presence of more than one ergodic set one could analyze „ « „ «
1
→s2 1 1 2
→s1 0 0
the corresponding sub-games separately. Furthermore, the Qs = , Qs =
restriction ensures that the game has no absorbing states. 0 0 1 1
Page 11 of 99
2.2.4 Networks of learning automata Since a specific joint action a is played in state s, the
To cope with stochastic games, the learning algorithms stationary distribution x depends on s and a as well. A
in Section 2.1 need to be adopted to account for multiple formal definition is given below.
states. To this end, we use a network of automata for each
Definition 3. For G = n, S, A, q, r, π 1 . . . π n where S
˙ ¸
agent [19]. An agent associates a dedicated learning automa-
itself is the only ergodic set in S = s . . . sk , we say x (s, a)
` 1 ´
ton (LA) to each state of the game and control is passed on
from one automaton to another. Each LA tries to optimize is a stationary
P distribution of the stochastic game G if and
the policy in its state using the standard update rule given only if z∈S xz (s, a) = 1 and
in (1). Only a single LA is active and selects an action at X
xs0 (s, a) Qi s0 ,
` ´
xz (s, a) = xs (s, a) qz (s, a) +
each stage of the game. However, the immediate reward
s0 ∈S−{s}
from the environment is not directly fed back to this LA. where n
!
Instead, when the LA becomes active again, i.e., next time X Y
Q i s0 = 0 0´ ` 0´
πai 0i
` ´ `
qz s , a s .
the same state is played, it is informed about the cumulative Qn
a0 ∈ i 0
i=1 A (s )
i=1
reward gathered since the last activation and the time that
has passed by. Based on this notion of stationary distribution and (9) we
The reward feedback τ i for agent i’s automaton LAi (s) can define the average reward game as follows.
associated with state s is defined as
Pt−1 i Definition 4. For a stochastic game G where S itself is
i ∆ri l=t0 (s) r (l)
the only ergodic set in S = s1 . . . sk , we define the average
` ´
τ (t) = = , (8)
∆t t − t0 (s)
reward game for some state s ∈ S as the normal-form game
where ri (t) is the immediate reward for agent i in epoch
Ḡ s, π 1 . . . π n = n, A1 (s) . . . An (s) , r̄, π 1 (s) . . . π n (s) ,
` ´ ˙ ¸
t and t0 (s) is the last occurrence function and determines
when states s was visited last. The reward feedback in
where each player i plays a fixed strategy π i (s0 ) in all states
epoch t equals the cumulative reward ∆ri divided by time-
s0 6= s. The payoff function r̄ is given by
frame ∆t. The cumulative reward ∆ri is the sum over all im- X
mediate rewards gathered in all states beginning with epoch xs0 (s, a) P i s0 .
` ´
r̄ (s, a) = xs (s, a) r (s, a) +
t0 (s) and including the last epoch t − 1. The time-frame ∆t s0 ∈S−{s}
measures the number of epochs that have passed since au-
tomaton LAi (s) has been active last. This means the state 2.2.6 State-coupled replicator dynamics
policy is updated using the average stage reward over the We reconsider the replicator equations for population π
interim immediate rewards. as given in (2):
2.2.5 Average reward game dπi h i
= πi (Aσ)i − π 0 Aσ
For a repeated automata game, let the objective of player dt
i
i at stage t0 be Pto maximize the limit average reward r̄ = Essentially, the payoff of an individual in population π, play-
lim inf T →∞ T1 Tt=t0 ri (t) as defined in (7). The scope of ing pure strategy i against population σ, is compared to the
this paper is restricted to stochastic games where the se- average payoff of population π. In the context of an average
quence of game states X (t) is ergodic. Hence, there exists reward game Ḡ with payoff function r̄ the expected payoff
a stationary distribution x over all states, where fraction xs for player i and pure action j is given by
determines the frequency P of state s in X. Therefore, we 0 1
can rewrite r̄i as r̄i = s∈S xs P i (s), where P i (s) is the i
X i ∗
Y l
expected payoff of player i in state s. Pj (s) = @r̄ (a ) πa∗l (s)A ,
Now, let us assume the game is in state s at stage t0 and a∈ l6=i Al (s) l6=i
Q
An intuitive explanation of (9) goes as follows. At each If each player i is represented by a population π i , we can
stage, players consider the infinite horizon of payoffs under set up a system of differential equations, each similar to
current strategies. We untangle the current state s from all (2), where the payoff matrix A is substituted by the average
other states s0 6= s and the limit average payoff r̄ becomes reward game payoff r̄. Furthermore, σ now represents all
the sum of the immediate payoff for joint action a in state remaining populations π l where l 6= i.
s and the expected payoffs in all other states. Payoffs are
weighted by the frequency xs of corresponding state occur- Definition 5. The multi-population state-coupled repli-
rences. Thus, if players invariably play joint action a every cator dynamics are defined by the following system of differ-
time the game is in state s and their fixed strategies π (s0 ) ential equations:
for all other states, the limit average reward for T → ∞ is dπji (s) h “ ”i
expressed by (9). = πji xs (π) P i (s, ej ) − P i s, π i (s) , (10)
dt
Page 12 of 99
where ej is the j th -unit vector. P i (s, ω) is the expected pay- SC−RD, state 1 SC−RD, state 2
1 1
off for an individual of population i playing some strategy ω
in state s. P i is defined as 0.8 0.8
2 0 13
0.6 0.6
!2(s ) !2(s )
X X Y
P i (s, ω) = 4ωj @r̄i (s, a∗ ) πal ∗l (s)A5 , 1 1 1 2
0.4 0.4
j∈Ai (s) Al (s) l6=i
Q
a∈ l6=i
0.2 0.2
where r̄ is the payoff function of Ḡ s, π 1 . . . π n and
` ´
0 0
“ ” 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
1 1
a∗ = a1 . . . ai−1 , j, ai . . . an . !1(s1) !1(s2)
RESQ dynamics, state 1 RESQ dynamics, state 2
1 1
Furthermore, x is the stationary distribution over all states
S under π, with 0.8 0.8
0.6 0.6
X
xs (π) = 1 and !21(s1) !21(s2)
s∈S 0.4 0.4
2 0.2 0.2
n
!3
X X Y
xs (π) = 4xz (π) qs (z, a) πai i (s) 5 . 0
0 0.2 0.4 0.6 0.8 1
0
0 0.2 0.4 0.6 0.8 1
!11(s1) !11(s2)
Qn
z∈S a∈ Ai (s) i=1
i=1
Pn RESQ, state 1 RESQ, state 2
|Ai (s) | replicator
P
In total this system has N = s∈S i=1
1 1
3. INVERSE APPROACH
the Bolzmann exploration scheme:
The forward approach has focused on deriving predictive !
models for the learning dynamics of existing multi-agent re- dπi h i X
−1 0
inforcement learners. These models help to gain deeper in- = πi β τ (Aσ)i − π Aσ − log πi + πk log πk
dt
sight and allow to tune parameter settings. In this section k
!
we demonstrate the inverse approach, designing a dynamical h i X
system that does indeed converge to pure and mixed Nash = πi βτ −1 (Aσ)i − π 0 Aσ − πi β log πi + πk log πk
equilibria and reverse re-engineering that system, resulting k
in a new multi-agent reinforcement learning algorithm, i.e. The learning rate of FAQ is now denoted by β. Let us as-
RESQ-learning. sume α = βτ −1 ⇒ β = ατ . Note that from β ∈ [0, 1] follows
Results for stateless games provide evidence that explo-
ration is the key to prevent cycling around attractors. Hence, 0 ≤ ατ −1 ≤ 1.
we aim to combine the exploration-mutation term of FAQ- Then we can rewrite the FAQ replicator equation as follows:
learning dynamics with state-coupled replicator dynamics. !
dπi h
0
i X
3.1 Linking LA and Q-learning dynamics = πi α (Aσ)i − π Aσ − πi ατ log πi + πk log πk
dt
k
First, we link the dynamics of learning automata and Q-
learning for the stateless case. We recall from Section 2.1.3 In the limit limτ →0 the mutation term collapses and the
that the learning dynamics of LA correspond to the standard dynamics of learning automata become:
multi-population replicators scaled by the learning rate α: dπi h i
= πi α (Aσ)i − π 0 Aσ
dπi h i dt
= πi α (Aσ)i − π 0 Aσ
dt 3.2 State-coupled RD with mutation
The FAQ replicator dynamics (see Section 2.1.4) contain a After we have established the connection between the learn-
selection part equivalent to the multi-population replicator ing dynamics of FAQ-learning and learning automata, ex-
dynamics, and an additional mutation part originating from tending this link to multi-state games is straightforward.
Page 13 of 99
The mutation term Analog to the description in Section 2.2.4 a network of
„ X « learners is used for each agent i. The reward feedback signal
−τ log πi + πk log πk (11) is equal to (8) while the update rule now incorporates the
k same exploration term as in (12). If a (t) = i :
is solely dependent on the agent’s policy π and thus inde-
" !#
i
X i i
pendent of any payoff computation. Therefore, the average πi (t+1) ← πi (t)+α r (t) (1−πi (t))−τ log πj + πk log πk
reward game remains the sound measure for the limit of k
the average of stage rewards under the assumptions made otherwise:
in Section 2.2.5. The equations of the dynamical system " !#
in (2.2.5) are complemented with the mutation term (11), X
resulting in the following state-coupled replicator equations πi (t+1) ← πi (t)+α −r (t) πi (t)−τ log πji + πki log πki
k
with mutation:
" Hence, RESQ-learning is essentially a multi-state policy it-
dπji (s) h “ ”i
= πj xs (π) P i (s, ej ) − P i s, π i (s)
i erator using exploration equivalent to the Boltzmann policy
dt generation scheme.
„ X i «# (12)
−τ log πji + πk log πki 4. RESULTS AND DISCUSSION
k
This section sets the newly proposed RESQ-learning al-
In the next section we introduces the corresponding RESQ- gorithm in perspective by examining the underlying dynam-
learning algorithm. ics of state-coupled replicator dynamics with mutation and
traces of the resulting learning algorithm.
3.3 RESQ-learning First, we explore the behavior of the dynamical system,
In [6] the authors have shown that maximizing the ex- as derived in Section 3.2, and verify the desired convergence
pected average stage reward over interim immediate rewards behavior, i.e., convergence to pure and mixed Nash equilib-
relates to the average reward game played in state-coupled ria. Figure 3 shows multiple trajectory traces in the 2-State
replicator dynamics. We reverse this result to obtain a Prisoners’ Dilemma, originating from random strategy pro-
learner equivalent to state-coupled replicator dynamics with files in both states. Analysis reveals that all trajectories
mutation. converge close to either one of the two pure Nash equilib-
1 1
0.8 0.8
0.6 0.6
!21(s1)
!2(s )
2
1
0.4 0.4
0.2 0.2
1 1
0.8 0.8
0 0.6 0 0.6
0 0.4 0 0.4
500 0.2 500 0.2
1000 0 1000 0
1500 !11(s1) 1500 !11(s2)
t t
1 1
0.8 0.8
0.6 0.6
!21(s1)
!21(s2)
0.4 0.4
0.2 1 0.2 1
0.8 0.8
0.6 0.6
0 0
0 0.4 0.4
500 0 500
1000 0.2 1000 0.2
1500 1500 1
2000
2500 0 !11(s1) 2000 2500 0 !1(s2)
t t
Page 14 of 99
rium points described in Section 2.2.2. As mentioned be- [4] Eduardo Rodrigues Gomes and Ryszard Kowalczyk.
fore for the stateless case, constant temperature prohibits Dynamic analysis of multiagent q-learning with
full convergence. Figure 4 shows trajectory traces in the epsilon-greedy exploration. In ICML, 2009.
2-State Matching Pennies game. Again, all traces converge [5] Daniel Hennes, Karl Tuyls, and Matthias Rauterberg.
close to Nash, thus affirming the statement that exploration- Formalizing multi-state learning dynamics. In Proc. of
mutation is crucial to prevent cycling and to converge in 2009 Intl. Conf. on Intelligent Agent Technology, 2008.
games with mixed optimal strategies. [6] Daniel Hennes, Karl Tuyls, and Matthias Rauterberg.
Figure 2 shows a comparison between state-coupled repli- State-coupled replicator dynamics. In Proc. of 8th
cator dynamics (SC-RD), the RESQ-dynamics as in (12), Intl. Conf. on Autonomous Agents and Multiagent
and an empirical learning trace of RESQ-learners. As above- Systems, 2009.
mentioned, ”pure” state-coupled replicator dynamics with- [7] Shlomit Hon-Snir, Dov Monderer, and Aner Sela. A
out the exploration-mutation term fail to converge. The tra- learning approach to auctions. Journal of Economic
jectory of the state space of this dynamical system exhibits Theory, 82:65–88, November 1998.
cycling behavior around the mixed Naish equilibrium (see [8] Junling Hu and Michael P. Wellman. Nash q-learning
Section 2.2.3). RESQ-dynamics successfully converge -near for general-sum stochastic games. Journal of Machine
to the Nash-optimal joint policy. Furthermore, we present Learning, 4:1039–1069, 2003.
the learning trace of two RESQ-learners in order to judge
[9] Michael Kaisers and Karl Tuyls. Frequency adjusted
the predictive quality of the coresponding state-coupled dy-
multi-agent q-learning. In Proc. of 9th Intl. Conf. on
namics with mutation. Due to the stochasticity involved in
Autonomous Agents and Multiagent Systems, 2010.
the action selection process, the learning trace is more noisy.
However, we clearly observe that RESQ-learning indeed suc- [10] Michael L. Littman. Friend-or-foe q-learning in
cessfully inherits the convergence behavior of state-coupled general-sum games. In ICML, pages 322–328, 2001.
replicator dynamics with mutation. [11] Shervin Nouyan, Roderich Groß, Michael Bonani,
Further experiments are required to verify the performance Francesco Mondada, and Marco Dorigo. Teamwork in
of RESQ-learning in real applications and to gain insight self-organized robot colonies. Transactions on
into how it competes with multi-state Q-learning and the Evolutionary Computation, 13(4):695–711, 2009.
SARSA algorithm [15]. In particular, the speed and qual- [12] Liviu Panait, Karl Tuyls, and Sean Luke. Theoretical
ity of convergence need to be considered. Therefore, the advantages of lenient learners: An evolutionary game
theoretical framework needs to be extended to account for theoretic perspective. Journal of Machine Learning
decreasing temperature to balance exploration and exploita- Research, 9:423–457, 2008.
tion over time. [13] S. Phelps, M. Marcinkiewicz, and S. Parsons. A novel
method for automatic strategy acquisition in n-player
non-zero-sum games. In AAMAS ’06: Proceedings of
5. CONCLUSIONS the fifth international joint conference on Autonomous
The contributions of this article can be summarized as agents and multiagent systems, pages 705–712,
follows. First, we have demonstrated the forward approach Hakodate, Japan, 2006. ACM.
to modeling multi-agent reinforcement learning within an [14] Y. Shoham, R. Powers, and T. Grenager. If
evolutionary game theoretic framework. In particular, the multi-agent learning is the answer, what is the
stateless learning dynamics of learning automata and FAQ- question? Journal of Artificial Intelligence,
learning as well as state-coupled replicator dynamics for 171(7):365–377, 2006.
stochastic games have been discussed. Based on the in-
[15] Richard S. Sutton and Aandrew G. Barto.
sights that were gained from the forward approach, RESQ-
Reinforcement Learning: An Introduction. MIT Press,
learning has been introduced by reverse engineering state-
Cambridge, MA, 1998.
coupled replicator dynamics injected with the Q-learning
[16] K. Tuyls and S. Parsons. What evolutionary game
Boltzmann mutation scheme. We have provided empirical
theory tells us about multiagent learning. Artificial
confirmation that RESQ-learning successfully inherits the
Intelligence, 171(7):115–153, 2007.
convergence behavior of its evolutionary counter part. Re-
sults have shown that RESQ-learning provides convergence [17] Karl Tuyls, Pieter J. ’t Hoen, and Bram
to pure as well as mixed Nash equilibria in a selection of Vanschoenwinkel. An evolutionary dynamical analysis
stateless and stochastic multi-agent games. of multi-agent learning in iterated games. Autonomous
Agents and Multi-Agent Systems, 12:115–153, 2005.
[18] Karl Tuyls, Katja Verbeeck, and Tom Lenaerts. A
6. REFERENCES selection-mutation model for Q-learning in multi-agent
[1] Tilman Börgers and Rajiv Sarin. Learning through systems. In Proc. of 2nd Intl. Conf. on Autonomous
reinforcement and replicator dynamics. Journal of Agents and Multiagent Systems, 2003.
Econ. Theory, 77(1), 1997. [19] Katja Verbeeck, Peter Vrancx, and Ann Nowé.
[2] Bruce Bueno de Mesquita. Game theory, political Networks of learning automata and limiting games. In
economy, and the evolving study of war and peace. ALAMAS, 2006.
American Political Science Review, 100(4):637–642, [20] Peter Vrancx, Karl Tuyls, Ronald Westra, and Ann
November 2006. Nowé. Switching dynamics of multi-agent learning. In
[3] Herbert Gintis. Game Theory Evolving. A Proc. of 7th Intl. Conf. on Autonomous Agents and
Problem-Centered Introduction to Modelling Strategic Multiagent Systems, 2008.
Interaction. Princeton University Press, Princeton,
2000.
Page 15 of 99
Proceedings of the AAMAS Workshop on Adaptive and Learning Agents, May 2010, Toronto, Canada
Itsuki Noda
ITRI, AIST
1-1-1 Umezono, Tsukuba, Japan
[email protected]
Page 16 of 99
2. RECURSIVE EXPONENTIAL MOVING that should be eliminated in calculation of x̃t , so that α
AVERAGE tends to be adjusted to estimate the noise factor instead of
hki the true value of xt .
The Recursive Exponential Moving Average (REMA) ξt So, in the following sections, we focus on EMA of squared
is defined as follows: error and construct a method to minimize the averaged error
h0i
ξt = xt by the Newton’s method.
h1i
ξt+1 = x̃t+1 = (1 − α)x̃t + αxt 3.1 Squared Error and Derivatives
hki hki hk−1i hki
ξt+1 = ξt + α(ξt − ξt ) Here, we re-define the squared error shown in Eq. (6) using
hki hk−1i an error δt of given and estimated values, xt and x̃t , as
= (1 − + αξt
α)ξt follows:
∞
hk−1i
X
= α (1 − α)τ ξt−τ . (3) δt = x̃t − xt
τ =0
Et = (1/2)δt2 .
Using the REMA, we can derive the following lemma and
theorem about partial differentials of estimated values x̃t by Then, we have the following theorem.
the stepsize parameter α [5, 6] . Theorem 2.
Lemma 1. The k-th partial derivative of the squared error Et by α is
hki calculated by the following equations:
The first partial derivative of REMA ξt by α is given by
the following equation: k−1
∂ k Et X (k − 1)! ∂ i δt ∂ k−i δt
hki = , (8)
∂ξt k hki hk+1i ∂αk (k − 1 − i)!i! ∂αi ∂αk−i
= (ξ − ξt ) (4) i=0
∂α α t
where,
∂ 0 δt
Theorem 1. = δt
h1i ∂α0
The k-th partial derivative of EMA x̃t (= ξt ) is given by
the following equation: ∂ k δt ∂ k x̃t
= (k > 0). (9)
∂αk ∂αk
∂ k x̃t hk+1i hki
= (−α)−k k!(ξt − ξt ) (5)
∂αk
(See Appendix for the proof.)
In the previous work, GDASS-REMA updates the stepsize 3.2 EMA of Squared Error and Partial Deriva-
α to the direction to decrease the following squared error of tives
the estimation in each time t gradually [5, 6] : As discussed above, our target is a method to determine
Et = (1/2)(x̃t − xt )2 . (6) the stepsize parameter α that minimizes the EMA of the
squared error Et . Here, we define the EMA Ẽt as follows:
Therefore, the actual update schema in GDASS is:
Ẽt+1 = (1 − β)Ẽt + βEt , (10)
∂Et
α ← α − η · sign( )
∂α where β is another stepsize parameter for EMA of the squared
∂ x̃t error. This Ẽt is equal to the estimated variance of the ex-
= α − η · sign((x̃t − xt ) · ). pected reward value introduced in [8].
∂α
As same as the case of the squared error Et shown in
3. EXPONENTIAL MOVING AVERAGE OF Eq. (7), we can estimate the optimal stepsize value α∗ to
SQUARED ERROR minimize Ẽt using the Newton’s method and Taylor expan-
sion as follows:
Because Theorem 1 provides a way to calculate higher “ ”
order derivatives of x̃t by α, we can get higher order Taylor ∂ Ẽt
expansion of Et by α as follows: ∆α∗ = “
∂α
” (11)
∂ 2 Ẽt
∂Et 1 ∂ Et 1 ∂ Et 2 3 ∂α2
Et (∆α) = Et (0) + ∆α + ∆α2 + ∆α3 + · · · .
∂α 2 ∂α2 6 ∂α3 α∗ = α − ∆α∗ (12)
Therefore, if we focus on the expansion of the first and sec- On the other hand, we can accumulate higher order par-
ond order terms, we will determine the optimum change of tial derivatives of Ẽt by α from Eq. (10) by the following
the stepsize, ∆α∗hti , which minimize the error at time t, equations:
using the Newton’s method as follow:
∂ Ẽt+1 ∂ Ẽt ∂Et
= (1 − β)
“ ”
∂ Et +β (13)
∗hti ∂α ∂α ∂α ∂α
∆α = ”. (7)
∂ 2 Ẽt+1 ∂ 2 Ẽt ∂ 2 Et
“
∂ 2 Et
∂α2 = (1 − β) +β (14)
∂α2 ∂α 2 ∂α2
However, updating α using the above equation directly ∂ Ẽt+1
k
∂ Ẽt
k
∂ k Et
does not work well, because the given value xt includes noise = (1 − β) +β (15)
∂αk ∂α k ∂αk
Page 17 of 99
This means that these partial derivatives can be calculated x values [gamma=0.050000, best alpha = 0.048766]
8
by the same manner of EMA using the derivatives of the current
k agent
squared error, ∂∂αEkt . As shown in Eq. (8) and Eq. (9), these 6
real
hKi
values can be determined systematically using REMA ξt . 4
Finally, we get the following procedure to obtain the opti- 2
mal stepsize α∗ to minimize EMA of the squared error. We
call it as Rapid Recursive Adaption of Stepsize Parameter 0
x
by Newton’s method (RRASP-N). -2
Initialize: ∀k ∈ {0 . . . kmax − 1} : ξ hki ← x0
∂ k Ẽ
-4
∀k ∈ {0 . . . kmax − 2} : ∂α k ← 0
while forever do -6
Let x be an observation. -8
for k = kmax − 1 to 1 do
-10
ξ hki ← (1 − α)ξ hki + αξ hk−1i 0 200 400 600 800 1000
end for Cycle
ξ h0i ← x
δ ← ξ h1i − x
for k = 1 to kmax − 2 do Figure 1: Exp.1: Changes of Learned Expected
∂k E
Calculate ∂α k by Eq. (8), Eq. (9) and Eq. (5). Value x̃t using Acquired Stepsize α by RRASP-N
k
∂ Ẽ
Update ∂α k by Eq. (13)∼ Eq. (15).
end2 for
∂ Ẽ
if ∂α where ∆vt is a random noise whose average and standard
2 > 0 then
Calculate ∆α∗ by Eq. (11). deviation are 0 and σv , respectively. Figure 1 shows an
if |∆α∗ | > α then example of the noisy random-walk. In this graph, band-like
∆α∗ ← sign(∆α∗ )α spikes are the given sequence xt , and curves at the center of
end if the band are true value vt and its learning result x̃t .
∗ For such noisy random walks, we can calculate the optimal
α ← α + ∆α 2
.
if α is not in [αmin , αmax ] then stepsize by the following equation:
let α be αmin or αmax . −γ 2 + γ 4 + 4γ 2
p
end if α∗ = , (17)
2
for k = 1 to kmax − 1 do
∂ξhki where, γ = σσv² . Of course, the standard deviations are
Update ξ hki according to changes of α using ∂α not given for learning agents, so that, they must acquire
determined by Eq. (4).
it through learning like RRASP-N.
end for
Figure 2 shows results of adaptation of α by RRASP-N
end if
for the given values xt . Each graph of this figure indicates
end while
2
∂ Ẽ
changes of α through learning with the optimal value of α,
In this procedure, α is updated only when ∂α 2 is positive, which indicated by a horizontal line, for different setting of
because the changes of Ẽ by α is concave down in the case of the noisy random-walk. As shown in these graphs, acquired
∂ 2 Ẽ
∂α2
< 0. We also cut-off ∆α∗ because of the following rea- α quickly converges to the optimal value of α.
son: The expansion of ξ hki using Eq. (5) includes the Moreover, the speed of convergence is drastically improved
` Taylor
´n compared with GDASS proposed in the previous work. Fig-
term ∆α α
, which becomes huge when α is small. There-
fore, truncation errors of the Taylor expansion may be large ure 3 shows results of adaptation by GDASS for the same
and affects other calculations in the procedure. In order to settings of figure 2. While adaptation of GDASS converge
avoid such effects, we limit the absolute value of ∆α∗ within to the optimal stepsize gradually as a nature of gradient de-
the value of α. cent methods, RRASP-N can adapt it to the optimal one
so quickly by jumping stepsize to the optimal directly using
Newton’s method.
4. EXPERIMENTS
In order to show the performance of RRASP-N, we carried 4.2 Exp.2: Adaptation for Squared-waved True
out several experiments. Value
In the second experiment, given values are squared-waved
4.1 Exp.1: Finding Optimal Stepsize true value with large noise as shown on the right of figure 4.
In order to show that RRASP-N can determine the op- Actual value of xt is generated by the following equations:
timal stepsize α∗ , we conducted an experiment using the
following noisy random-walk as a given value xt : xt = vt + ²t
10 ; 2000n < t < 2000n + 1000; n = 0, 1, 2, · · ·
xt = v t + ²t , (16) vt = ,
5 ; otherwise
where ²t is a random noise whose average and standard de-
where where ²t is a random noise whose average and stan-
viation are 0 and σ² , respectively. The true value vt is a
dard deviation are 0 and σ² , respectively. For such given val-
random walk defined by the following equations:
ues, the stepsize should become large right after the changes
vt+1 = vt + ∆vt , of the true value (t = 1000, 2000, 3000, · · · ) to catch-up the
Page 18 of 99
changes of alpha [gamma=3.333333, best alpha = 0.923280] changes of alpha [gamma=3.333333, best alpha = 0.923280]
1 1
0.8 0.8
0.6 0.6
Alpha
Alpha
0.4 0.4
0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
Cycle Cycle
∗ ∗
(a) γ = 3.33, α = 0.923 (a) γ = 3.333, α = 0.923
changes of alpha [gamma=2.000000, best alpha = 0.828427] changes of alpha [gamma=2.000000, best alpha = 0.828427]
1 1
0.8 0.8
0.6 0.6
Alpha
Alpha
0.4 0.4
0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
Cycle Cycle
∗ ∗
(b) γ = 2.00, α = 0.828 (b) γ = 2.00, α = 0.828
changes of alpha [gamma=1.250000, best alpha = 0.692810] changes of alpha [gamma=1.250000, best alpha = 0.692810]
1 1
alpha alpha
best_alpha best_alpha
0.8 0.8
0.6 0.6
Alpha
Alpha
0.4 0.4
0.2 0.2
0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
Cycle Cycle
∗ ∗
(c) γ = 1.25, α = 0.693 (c) γ = 1.25, α = 0.693
changes of alpha [gamma=1.000000, best alpha = 0.618034] changes of alpha [gamma=1.000000, best alpha = 0.618034]
1 1
alpha alpha
best_alpha best_alpha
0.8 0.8
0.6 0.6
Alpha
Alpha
0.4 0.4
0.2 0.2
0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
Cycle Cycle
∗ ∗
(d) γ = 1.00, α = 0.618 (d) γ = 1.00, α = 0.618
changes of alpha [gamma=0.333333, best alpha = 0.282376] changes of alpha [gamma=0.333333, best alpha = 0.282376]
1 1
alpha alpha
best_alpha best_alpha
0.8 0.8
0.6 0.6
Alpha
Alpha
0.4 0.4
0.2 0.2
0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
Cycle Cycle
∗ ∗
(e) γ = 0.33, α = 0.282 (e) γ = 0.33, α = 0.282
changes of alpha [gamma=0.050000, best alpha = 0.048766] changes of alpha [gamma=0.050000, best alpha = 0.048766]
1 1
alpha alpha
best_alpha best_alpha
0.8 0.8
0.6 0.6
Alpha
Alpha
0.4 0.4
0.2 0.2
0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
Cycle Cycle
∗ ∗
(f) γ = 0.05, α = 0.049 (f) γ = 0.05, α = 0.049
Figure 2: Exp.1-a: Adjustment of Stepsize Parameter Figure 3: Exp.1-b: Adjustment of Stepsize Parameter
by RRASP-N for Various Ratio of Standard Devia- by GDASS for Various Ratio of Standard Deviations
tions of Random Walk and Noise. of Random Walk and Noise.
Page 19 of 99
changes of alpha [gamma=0.000000, best alpha = 0.000000] x values [gamma=0.000000, best alpha = 0.000000]
1 14 current agent real
alpha
12
0.8
10
0.6
8
Alpha
x
6
0.4
4
0.2
2
0 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Cycle Cycle
0.6
8
Alpha
6
0.4
4
0.2
2
0 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Cycle Cycle
0.6
8
Alpha
6
0.4
4
0.2
2
0 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Cycle Cycle
Figure 4: Exp.2: Changes of Stepsize Parameters and Learned Estimated Values by RRASP-N, GDASS, and
OSA Methods. (In the case of squared-waved value)
Page 20 of 99
change, and should decrease immediately to zero to reduce of stepsize parameters for each resource. Note that an in-
the noise factor. dependent stepsize parameter is assigned for each resource.
Figure 4 shows results of adaptation by (a) RRASP-N, Therefore, the parameters changes independently with each
(b) GDASS, and (c) OSA (Optimal Stepsize Algorithm) [4]. other. From these graphs, we can find that the agent can
In the figure, the left three graphs indicate changes of α by estimate suitable expected utilities by reducing noisy fac-
adaptation with each method, and the right ones indicate tors of the random hoppers. Also, they can adapt drastic
changes of given values xt , true values vt , and learned es- changes by big players. Changes of the stepsize parameters
timated values x̃t by Eq. (1). These graphs shows features in figure 5 shows how the agent adapts to the changes of the
of three methods: Because OSA shown in (c) use statistical environment.
information of given and learned values, changes of the step-
size is stable but tends to be delayed from the changes of the 5. CONCLUDING REMARK
true value. GDASS shown in (b) can response to the changes
In this article, I proposed a method to adapt stepsize pa-
of the true value quickly, but the stepsize tends to be un-
rameter, called rapid recursive adaptation of stepsize pa-
stable by effects of noise factors included in the given value,
rameter by Newton’s method (RRASP-N), in which EMA of
because GDASS focuses only differences between given and
the squared error of estimated value is minimized by chang-
estimated values at each time t. Therefore, it is hard to de-
ing the stepsize parameter. RRASP-N utilizes higher order
tect the environmental change from the changes of α clearly.
partial derivatives of the estimated value and EMA of the
Compared with these methods, RRASP-N can catch the
squared error by the stepsize parameter, which can be cal-
changes of the true value quickly (α goes up right after ev-
culated from recursive exponential moving average system-
ery cycle changes of the true value), and also show robust
atically.
and stable changes of the stepsize because it uses EMA of
Experimental results shows that RRASP-N responds changes
squared errors to reduce effects of noise.
of environments so quickly that it adapts the stepsize param-
4.3 Exp.3: Repeated Multi-agent Resource Shar- eter to be suitable. In the same time, RRASP-N’ behavior
is stable because it use statistical value, EMA of the squared
ing error. We also can apply RRASP-N to various noise model.
Finally, I conducted a learning experiment using repeated While we use only Gaussian noise for input values, the for-
multi-agent resource sharing. In the experiment, we sup- malization only suppose to minimize squared error between
pose that multiple agents share four resources (resource-0 . . . expected and given values. Actually, situation used in Exp.3
resource-3). The agents are grouped into three types, fixed is a non-Gaussian noise case. In this experiment, random
users who never change their choice from a certain resource, hoppers provides noisy effects to the environment. As shown
random hoppers who choose one of resources randomly ev- in the results of experiments, RRASP-N perform reasonably
ery cycle, big players who usually stay on a certain resource to such an environment.
but sometime change their choice, and a learning agent who Although experimental setups shown in this article are
try to estimate the average of utility for each resource. The simple to demonstrate features of the proposed method, gen-
population and weight of each group is as follows: eralities of RRASP-N is supported by theorems so that it
can be applied generally to reinforcement learning that use
type population weight
EMA formula. For example, it is easy to apply Q-learning
fixed users 1 7 of multiple states and actions. Of course, the situation of
random hoppers 17 1 , acquiring the best stepsize for a Q-learning is not so simple,
big players 2 10 because learning speed of a Q-value affects a backup value
learning agent 1 1 of Q-value for another state-action pairs. We have been in-
vestigating such cases, and found that there can exist local
where the weight of an agent means a degree of consuming minimums for the stepsize in a certain condition [7]. The
resources compared with a random hopper. Therefore, a re- condition and its effects are still under investigation.
source that is used by big players, who has a big weight, will There still several open issues that include:
have a poor utility. Each resource also has its own capacity,
which indicates the size of resource. In this experiment, the • effects of different stepsize parameters for states and
actual utility of a resource k at time t is calculated by the actions in Q-learning.
following equation:
• utilization of more higher order derivatives (k > 2) to
1 analyze structures of errors function (Ẽt ) with respect
utilityk (t) = ,
1 + totalWeightk (t)/capacityk to α.
where, totalWeightk (t) is the summation of weights of agents • tuning of β, another stepsize parameter for the EMA
who choose the resource k at time t. of the squared error.
The purpose of the learning agent is to acquire estima-
tion of an utility of each resource by reducing noisy factor acknowledgment
caused by the random hoppers. In the same time, the learn- This work was supported by JSPS KAKENHI 21500153.
ing agent must adapt drastic changes brought by big play-
ers’ change of choice. Therefore, the agent must adapt its
stepsize parameter according to changes of the environment. 6. REFERENCES
Figure 5 shows the result of the learning. In each graph in [1] A. Bonarini, A. Lazaric, E. Munoz de Cote, and
the right of this figure indicates given and expected utilities M. Restelli. Improving cooperation among
for each time step, while graphs in the left shows changes self-interested reinforcement learning agents. In Proc. of
Page 21 of 99
Changes of Stepsize Parameter for Resource0 Given and Estimated Utility for Resource 0
1 1
alpha resource
learn
0.8 0.8
0.6 0.6
Stepsize
Utility
0.4 0.4
0.2 0.2
0 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Learning Cycle Learning Cycle
(a) Resource 0
Changes of Stepsize Parameter for Resource1 Given and Estimated Utility for Resource 1
1 1
alpha resource
learn
0.8 0.8
0.6 0.6
Stepsize
Utility
0.4 0.4
0.2 0.2
0 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Learning Cycle Learning Cycle
(b) Resource 1
Changes of Stepsize Parameter for Resource2 Given and Estimated Utility for Resource 2
1 1
alpha
0.8 0.8
0.6 0.6
Stepsize
Utility
0.4 0.4
0.2 0.2
resource
learn
0 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Learning Cycle Learning Cycle
(c) Resource 2
Changes of Stepsize Parameter for Resource3 Given and Estimated Utility for Resource 3
1 1
alpha resource
learn
0.8 0.8
0.6 0.6
Stepsize
Utility
0.4 0.4
0.2 0.2
0 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Learning Cycle Learning Cycle
(d) Resource 3
Page 22 of 99
Workshop on Reinforcement Learning in In this case, we have the k+1-th partial derivative as follows:
Non-Stationary Environments. ECML-PKDD 2005, k−1
Oct. 2005. ∂ k+1 Et X (k − 1)!
=
[2] Michael Bowling and Manuela Veloso. Multiagent ∂αk+1 i=0
(k − 1 − i)!i!
learning using a variable learning rate. Artificial » i
∂ δ ∂ j−i+1 δ ∂ i+1 δ ∂ j−i δ
–
Intelligence, 136:215–250, 2002. · + .
∂αi ∂αj−i+1 ∂αi+1 ∂αj−i
[3] Eyal Even-dar and Yishay Mansour. Learning rates for
q-learning. Journal of Machine Learning Research, Here, we reform the equation by terms ∂α ∂ δ ∂ i k−i+1
δ
i ∂αk−i+1 . Then
5:2003, Dec. 2003. its factor ai can be calcuated as follows: In the case of i = 0
[4] Abraham P. George and Warren B. Powell. Adaptive or i = k,
stepsizes for recursive estimation with applications in
approximate dynamic programming. Machine learning, ((k + 1) − 1)!
ai = 1= .
65(1):167–198, 2006. ((k + 1) − 1 − i)!i!
[5] Itsuki Noda. Adaptation of stepsize parameter for In the case of 0 < i < k,
non-stationary environments by recursive exponential
moving average. In Prof. of ECML 2009 LNIID (k − 1)! (k − 1)!
ai = +
Workshop, pages 24–31. ECML, Sep. 2009. (k − 1 − (i − 1))!(i − 1)! (k − 1 − i))!i!
[6] Itsuki Noda. Recursive adaptation of stepsize
» –
1 1
parameter for non-stationary environments. In = (k − 1)! +
(k − i)!(i − 1)! (k − 1 − i)!i!
Matthew E. Taylor and Karl Tuyls, editors, Adaptive
(k − 1)!(k − i + i)
Learning Agents: Second Workshop, ALA 2009, page =
(to appear). Springer, May 2009. (k − i)!i!
[7] Itsuki Noda. Relation between stepsize parameter and k!
=
stochastic reward on reinforcement learning. In Proc. of (k − i)!i!
JSAI 2009, pages 1D2–OS6–13. JSAI, JSAI, Jun. 2009. ((k + 1) − 1)!
(in Japanese). = .
((k + 1) − 1 − i)!i!
[8] Makoto Sato, Hajime Kimura, and Shigenobu
Kobayashi. TD algorithm for the variance of return and Therefore, Eq. (8) is satisfied in the case of the k + 1-th
mean-variance reinforcement learning (in japanese). partial derivative.
Transactions of the Japanese Society for Artificial As the result, Eq. (8) is satisfied for all k > 0.
Intelligence, Vol. 16(No. 3F):353–362, 2001.
[9] Richard S. Sutton and Andrew G. Barto.
Reinforcement Learning: An Introduction. MIT Press,
Cambridge, MA, 1998.
APPENDIX
A. PROOF OF Theorem 2
First of all, I show the following lemma:
Lemma 2.
The partial derivative of δt by α is equal to the derivatives
of x̃t by α:
∂δt ∂ x̃t
= (18)
∂α ∂α
In addition, generally, we can get the following equations for
the k-th partial derivatives:
∂ k δt ∂ k x̃t
= (19)
∂αk ∂αk
∂ k x̃t
On the other hand, we can calculate ∂αk
from REMA
hki
ξt using Theorem 1.
k
Therefore, we can calculate ∂∂αδkt by REMA.
Here, let’s focus on the partial derivative of Et .
Suppose that the i-th partial derivative of Et by α satisfies
Eq. (8) for all j ≤ k as follow:
j−1
∂ j Et X (j − 1)! ∂ i δ ∂ j−i δ
= .
∂αj i=0
(j − 1 − i)!i! ∂αi ∂αj−i
Page 23 of 99
Proceedings of the AAMAS Workshop on Adaptive and Learning Agents, May 2010, Toronto, Canada
Categories and Subject Descriptors Learning on physically grounded robots is difficult for several
I.2.6 [Artificial Intelligence]: Learning reasons, including environmental and sensor noise, high costs of
failure (such as a crashed helicopter), the large amount of time it
takes to perform tasks, and the fact the robots’ dynamics are often
General Terms not constant due to wear and tear on their motors. Thus, to the ex-
Algorithms, Experimentation tent possible, it is desirable to train robots in a controlled environ-
ment before sending them out into the world. Doing so can reduce
Keywords damage to the robots and prepare them to deal with expected sit-
uations. However, when encountering unexpected situations after
Transfer Learning, Robotics, Reinforcement Learning, Artificial “deployment” in the real world, the robot will have to continue to
Intelligence adapt. Such unexpected situations can even arise from the dynam-
ics of the robot itself changing as its joints break, or as repairs are
ABSTRACT made. It is conceivable to relearn tasks from scratch each time a
change happens, but due to the time and cost of learning, it is not
As robots become more widely available, many capabilities that
practical. Instead, it is desirable for the robot to reuse prior infor-
were once only practical to develop and test in simulation are be-
mation in order to learn faster. The concept of reusing information
coming feasible on real, physically grounded, robots. This new-
from past learning is the idea behind transfer learning.
found feasibility is important because simulators rarely represent
Transfer learning for RL tasks has been shown to be effective in
the world with sufficient fidelity that developed behaviors will work
simulation [18], but no prior work has been done on transfer learn-
as desired in the real world. However, development and testing on
ing on physically grounded robots. The main contribution of this
robots remains difficult and time consuming, so it is desirable to
paper is the first empirical demonstration that transfer learning for
minimize the number of trials needed when developing robot be-
RL can significantly speed up and even improve asymptotic per-
haviors.
formance of RL with learning done entirely on a physical robot,
This paper focuses on reinforcement learning (RL) on physically
specifically using Q-value reuse for the Sarsa(λ) algorithm [19]. In
grounded robots. A few noteworthy exceptions notwithstanding,
addition, we show that transferring information from learning in
RL has typically been done purely in simulation, or, at best, ini-
simulation can improve subsequent learning learning on the robot.
tially in simulation with the eventual learned behaviors run on a
To this end, we introduce a novel reinforcement learning task
real robot. However, some recent RL methods exhibit sufficiently
for humanoid robots and demonstrate that transfer learning can be
low sample complexity to enable learning entirely on robots. One
effective for this task. The results additionally represent one of the
such method is transfer learning for RL. The main contribution of
first successful applications of reinforcement learning on the Nao
this paper is the first empirical demonstration that transfer learn-
humanoid platform developed by Aldebaran1 . A limited amount of
ing can significantly speed up and even improve asymptotic per-
previous work has been done using the Nao, but this work focused
formance of RL done entirely on a physical robot. In addition, we
on simulation work, with only a single run on a physical robot [7].
show that transferring information learned in simulation can bolster
The remainder of the paper is organized as follows. Section
additional learning on the robot.
2 presents the main algorithms used in our experiments, namely
Sarsa(λ) and Q-value reuse. Section 3 introduces our experimental
1. INTRODUCTION testbed and fully specifies the task to be learned. Sections 4 and
Physically grounded robots need to be able to learn from their 5 present the results of our experiments. Section 6 further situates
experience, both in order to deal with changing environments and the results in the literature, and Section 7 concludes.
to adapt to new problems. For the purpose of online learning of se-
quential decision making tasks with limited feedback, value-function- 2. BACKGROUND
based reinforcement learning (RL) [15] is an appealing paradigm,
Reinforcement learning (RL) is a framework for learning se-
because of the well-defined semantics of the value function and its
quential decisions with delayed rewards [15]. RL is promising for
elegant theoretical properties. However, a few notable successes
robotics because it handles online learning with limited feedback
notwithstanding (e.g., flying RC helicopters [3, 9] and quadruped
where actions taken affect the environment. RL has been exten-
walking [8]), RL algorithms have typically been applied only in
sively studied in many domains, with positive results. However,
simulation, or at best trained in simulation with the eventual learned
1
behaviors run on a real robot (e.g., [6], [10], and [5]). http://www.aldebaran-robotics.com/eng
Page 24 of 99
RL techniques can require long training times. Therefore, espe- Source Task Target Task
cially on robots, it can be useful to reuse knowledge learned from
similar problems to speed up training times via transfer learning. Environment Environment
Value-function-based RL algorithms assume that the task to be
Action State Reward
learned can be modeled as a Markov Decision Process (MDP). An asource ssource rsource
Action
atarget
State
starget
Reward
rtarget
MDP is a four-tuple of (S, A, T, R) where S is a state space, A Inter−Task Mapping
+
Source Task
Agent Agent
Source Task
Page 25 of 99
Figure 3: Joint movements possible for the task
Specifically, the robot’s task was to hit an orange ball as far as (a) Source task
possible at a 45◦ angle with its right hand. It used its onboard
camera to observe the result of each trial and calculate the reward
signal. The robot is seated with the ball 80 mm in front of the
center of the robot and 170 mm to its right. Note that the robot is
not given ball’s location except for the information in the reward
signal. Every 75 ms, the robot is given the current positions of the
joints and their velocities as observations.
The reward signal is given by r = d ∗ cos(θ), where d is the (b) Target task
distance that the ball moved, and θ is the angle between the ball’s
trajectory and the 45◦ target angle. If the ball was not seen for
sufficiently long, it was assumed to have been hit backwards, and Figure 4: Keyframes of robot tasks
the action was assigned reward −100. All other steps were given a
reward of -1 to encourage the agent to find a fast action sequence
to hit the ball. seen in Figure 4. The robot has less control in this source task,
The reward from vision can be inaccurate, due to the ball mov- and therefore cannot hit the ball as far as in the target task, but it
ing outside the sight range of the robot, the arm obscuring the sight can learn faster as the problem is simpler. Our central hypothesis
of the ball, and noisy distance estimates of the ball. However, we is that using Q-value reuse to transfer information from this source
measured these effects, and found that they were not very signifi- task will enable the robot to learn faster on the target task.
cant. Figure 2 compares the robot’s estimate of the reward with the As this work focuses on transfer learning on robots, so the main
measurements taken by hand using a tape measure and a protrac- task considered was transferring from the source task to the target
tor. Out of 50 episodes, only two successful hits were not seen by task on the robot, compared to learning the target task with no prior
the robot and incorrectly assumed to be backward hits with reward information. We also replicated both tasks in the Webots simulator2
−100. The R-squared value of the robot’s estimations was 0.86. to test our algorithm in a different, though similar environment (as
As shown in Figure 3 (supplied by Aldebaran1 ), the robot can the dynamics of the simulator do not entirely match the physical
use four joints to help it hit the ball: shoulder pitch, shoulder roll, robot). We do not assume that a useful simulator will be available
elbow pitch, and elbow yaw. For each episode, these joints start in all cases, which is why we focus on transfer on the robot itself.
at a fixed position with no initial velocity; these values are given In this case, the simulator allows us to better evaluate the effective-
in Table 1 and depicted in the left-most frames of Figure 4. Also, ness and robustness of the algorithm and to run many more experi-
the ball starts in the same position for every episode, as shown in ments than physical robots allow. However, we emphasize that for
Figure 4. At each time step, the robot can accelerate one joint in ei- the main result of the paper, both the source and target tasks were
ther direction or leave all the joints alone. Therefore, the robot has learned on the physical robot.
nine actions: {no change, accelerate the shoulder pitch upwards, We refer to the source task on the robot as S OURCE ROB, the
accelerate the shoulder pitch downwards, accelerate the shoulder target task on the robot as TARGET ROB, the source task in the
roll clockwise, or accelerate the shoulder roll counter-clockwise, simulator as S OURCE S IM, and the target task in the simulator as
etc.}. Furthermore, it has eight observations: the position and ve- TARGET S IM. The main test of our algorithm is in how the trans-
locity for each joint. The velocities are kept in the range [−100◦ /s, 2
+100◦ /s] and the actions are taken every 75ms (more than 13 Cyberbotics Ltd. http://www.cyberbotics.com
times per second) to change the velocity by 50◦ /s.
It is possible to learn this task without any prior information, Joint Min Max Start
but the process can be slow and the robot converges to a mediocre
policy. Our work focuses on improving this learning, specifically Shoulder pitch 0◦ 115◦ 115◦
by using a related source task as prior information. For this simpler Shoulder roll −90◦ 5◦ −75◦
task, the robot only has control of the two shoulder joints, with the Elbow roll 0◦ 120◦ 45◦
elbow roll and yaw fixed at 0◦ and 0◦ . Therefore, the robot will Elbow yaw −90◦ 90◦ −45◦
only have five actions and four observations. We will refer to this
simpler task as the source task and the original task as the target
task. The keyframes of the robot performing the two tasks can be Table 1: Joint angle ranges and starting positions
Page 26 of 99
fer from S OURCE ROB to TARGET ROB and from S OURCE S IM to
TARGET S IM performs. However, the use of the simulator allows
for several other paths for transfer information, and we discuss this
idea further in Section 5
A significant part of the work was done using the Webots simulator2 ,
and this work relies heavily on the code developed by the UT Austin
Villa robot soccer team3 . This code base provides the interface be-
tween the learning agent and the robot’s actions, as well as provid-
ing visual detection of the ball.
4. RESULTS
Transfer learning can be evaluated in many different ways [18].
In this paper, our main focus is on “weak” transfer, meaning that
we assume that the time spent in the source task does not count
against the learner in the target. This is the case when the robot
has already learned the source task, so this training time is not a
new cost. For example, if a robot was trained in a lab before being
sent out, we might be interested in the time it would take the robot
to learn a new task, and less interested in how long the robot was
trained in the lab. We also show one “strong” transfer result, where (a) Weak Transfer (b) Strong Transfer
time spent in the source does count.
For all experiments, we plot the running average reward for each
approach, taken with a 25 episode moving window for the robot Figure 5: Transfer on the robot to the target task
tests and a 50 episode window for the simulation tests. Each test
on the robot represents five runs, each lasting 50 episodes. In the
simulator, each test averages 50 runs, each lasting 1,000 episodes.
These 50 episodes on a robot takes approximately 30 minutes, and
1,000 episodes in simulation takes approximately three hours. This
data allows us to draw conclusions with statistical significance and
reason about the convergence of each approach.
The baseline that we use is learning TARGET ROB with no prior
information. Figure 5a shows that transfer from S OURCE ROB to
TARGET ROB is helpful, improving the reward throughout the en-
tire test. The initial few episodes of each algorithm are very noisy,
so the initial positive performance of TARGET ROB is not signifi-
cant, just the effect of a few outliers. This graph is an evaluation of
weak transfer: we do not depict training time in the source task.
Figure 5b shifts the transfer plot 50 episodes to the right to rep-
resent the strong transfer scenario. Though not as dramatic, the
result is still positive, thus demonstrating that it can be useful to
break a robot task into robot subtasks, and then transfer from the
subtasks to the target task. In this test, the robot performs about as
well in the source task as in the target task, because it does not have
enough trials to completely explore the target task and find a good
behavior.
Unfortunately, the small number of tests on the robots means that
we cannot draw statistical conclusions about the performance of the
methods. However, the tests were also replicated in simulation with
good results. Figure 6a shows that the transfer from S OURCE S IM
to TARGET S IM is helpful, even after a large number of episodes.
The differences between the final rewards of each method are sta-
tistically significant with a confidence of 99%, and the error bars (a) Weak Transfer (b) Strong Transfer
in the diagram show the standard deviation of the average rewards.
Figure 6b shows that our results for strong transfer hold in simu-
lation. Overall, Figures 5–6 suggest that transfer learning works Figure 6: Transfer in the simulator
on robots, and can greatly speed up learning and reach better end
behaviors.
5. ADDITIONAL EXPERIMENTS
In addition to providing statistically significant results, the use of
the simulator opens several other paths for transferring knowledge
3
http://www.cs.utexas.edu/users/austinvilla
Page 27 of 99
1
S OURCE ROB TARGET ROB
4
3
2
5
S OURCE S IM TARGET S IM
between tasks, including two-step transfer, where we learn sequen- Figure 10: Comparison of one and two-step transfer.
tially from multiple source tasks. Two-step transfer is performed as
described in Section 2, with the value function:
Though all of these results are for weak transfer, we speculate
Q(s, a) = Q1 (χX1 (s), χA1 (a)) that these trends will hold for strong transfer (as they did in both
+Q2 (χX2 (s), χA2 (a)) + Q3 (s, a) one-step transfer cases). Furthermore, transferring from simulation
to a physical robot raises the possibility of having different costs
We consider TARGET ROB to be the target for all of the tests, and for training spent in the simulator than on the robot. For example,
we continue using 1,000 episodes in simulation and 50 on the phys- if we consider simulation time to be insignificant, then tests 2, 3,
ical robot. Figure 7 shows all the ways to transfer information to and 5 are all evidence of strong transfer.
TARGET ROB, with numbers corresponding to the following tests:
6. RELATED WORK
1. S OURCE ROB → TARGET ROB
One of the earliest uses of transfer learning for reinforcement
2. S OURCE S IM → TARGET ROB learning was done by Selfridge et al. [12] in the familiar cart-pole
3. TARGET S IM → TARGET ROB domain. In this work, the function approximator was reused for
4. S OURCE S IM → S OURCE ROB → TARGET ROB poles of different sizes and weights, with good effect.
5. S OURCE S IM → TARGET S IM → TARGET ROB Taylor and Stone [18] recently surveyed the use of transfer learn-
ing in reinforcement learning. Significant prior work in this area
Test 1 is further investigated in Section 4, and the results of the has been performed, with good results. However, little work has
three one-step transfer tests (tests 1, 2, and 3) are displayed in Fig- been done in applying transfer learning to the area of robotics. Tay-
ure 8. Transferring from TARGET S IM produces the biggest im- lor and Stone discuss several approaches to transfer learning, and
provement in the early episodes due to it having already learned point out several ways to evaluate the effects of the transfer. Our
about the entire state-action space. However, the agent does have to research focuses on Q-value reuse with supervised task transfer.
learn about the differences between the simulated and real robots. Taylor et al. [19] explored Q-value reuse in temporal difference
Also, transferring from S OURCE S IM performs better than trans- learning with good results. They specifically evaluate a Sarsa agent
ferring from S OURCE ROB, probably due to the higher number of using CMAC for function approximation. However, this work fo-
runs in S OURCE S IM, which allow the agent to explore the state- cuses on the simulated domain of keepaway for soccer. Our work
action space more completely. In the end, all of the transfer meth- applies this research to a physical robot, and has a greater differ-
ods end up with similar performance, and all perform much better ence between the source and target tasks.
than starting with no prior information. One interesting approach to transfer learning is to extract higher
The two types of two-step transfer were also tested (tests 4 and level strategies from the policy learned by the agent. Torrey et
5), and the results are shown in Figure 9. Both methods show a sub- al. [20] explored this idea using relational macros to represent the
stantial boost to early episodes but later plateau, achieving similar strategies learned by induction logic programming (ILP) in the RoboCup
results to the other transfer methods. The results of the two-step breakaway domain, but this requires the domain to be translated
transfer are not better than some of the one-step transfers, but Fig- into first-order logic. It is also possible to break a single problem
ure 10 shows that multi-step transfer can be beneficial, giving a into a series of smaller tasks. Then, the agent learns each of these
large early boost. sub-tasks and combines the learned knowledge for the full task.
Page 28 of 99
In the target task, the state space, actions, and transition function 8. REFERENCES
are the same as the sub-tasks, and the information is transferred [1] P. Abbeel and A. Y. Ng. Exploration and apprenticeship
via Q-value transfer. Singh [13] also explored this area, naming it learning in reinforcement learning. In ICML ’05, pages 1–8.
“compositional learning.” ACM, 2005.
It is possible to learn a mapping between source and target tasks [2] J. S. Albus. Brains, Behavior, and Robotics. Byte Books,
autonomously (e.g., when a human is unable or unwilling to pro- Peterborough, NH, 1981.
vide such a mapping). Talvitie and Singh [16] developed an algo-
[3] J. A. Bagnell and J. Schneider. Autonomous helicopter
rithm to generate possible state variable mappings and learn which
control using reinforcement learning policy search methods.
mapping is best as an n-armed bandit problem. Further work has
In ICRA ’01, pages 1615–1620. IEEE Press, 2001.
been done by Taylor et al. [17] using a model-based approach to
[4] A. Coates, P. Abbeel, and A. Y. Ng. Learning for control
reduce the samples needed, and they transfer observed (s, a, r, s′ )
from multiple demonstrations. In ICML ’08, pages 144–151.
instances, which allows the source and target agents to have differ-
ACM, 2008.
ent representations for the task. However, these methods are not
as reliable as hand-mapping and can be unnecessary for smaller [5] Y. Davidor. Genetic Algorithms and Robotics: A Heuristic
domains. Strategy for Optimization. World Scientific Publishing Co.,
Unfortunately, tests on robots can be slow, and most learning al- Inc., 1991.
gorithms require a large amount of training data to perform well. [6] E. Gat. On the role of simulation in the study of autonomous
Therefore, it can be useful to train an agent in simulation and trans- mobile robots. In AAAI-95 Spring Symposium on Lessons
fer these behaviors to a robot [6, 10, 5]. However, we cannot as- Learned from Implemented Software Architectures for
sume that a simulator will accurately model complex perception or Physical Agents., March 1995.
manipulation tasks, so it is often useful to tune the behavior from [7] T. Hester, M. Quinlan, and P. Stone. Generalized model
the simulator by running more tests on a robot. This requires com- learning for reinforcement learning on a humanoid robot. In
bining information about a source simulation task and a target robot ICRA ’10, 2010.
task, but no work we know of treats this as a transfer learning prob- [8] N. Kohl and P. Stone. Policy gradient reinforcement learning
lem. for fast quadrupedal locomotion. In ICRA ’04, May 2004.
Another way to speed up learning is to use prior demonstrations. [9] A. Y. Ng, H. J. Kim, M. I. Jordan, and S. Sastry. Inverted
Researchers have shown that sub-optimal demonstrations can be autonomous helicopter flight via reinforcement learning. In
sufficient to teach an agent to control an autonomous helicopter [1, In International Symposium on Experimental Robotics. MIT
4]. Unfortunately, this requires on an expert in the domain to per- Press, 2004.
form the demonstrations, which is not always possible. [10] J. M. Porta and E. Celaya. Efficient gait generation using
reinforcement learning. In Proceedings of the Fourth
7. CONCLUSIONS AND FUTURE WORK International Conference on Climbing and Walking Robots,
This paper empirically tests transfer learning for RL on physical pages 411–418, 2001.
robots. The results show that model-free RL can be effective on a [11] G. A. Rummery and M. Niranjan. On-line Q-learning using
robot, and that transfer learning can speed up learning on physical connectionist systems. Technical Report
robots. CUEF/F-INFENG/TR 166, Cambridge University
Furthermore, this prior information can be learned in simulation, Engineering Dept., 1994.
even if the simulator does not completely capture the dynamics of [12] O. G. Selfridge, R. S. Sutton, and A. G. Barto. Training and
the robot. For example, the simulator does not model collisions tracking in robotics. In IJCAI, pages 670–672, 1985.
between the robot’s different parts, so dynamics of the arm hitting [13] S. P. Singh. Transfer of learning by composing solutions of
the body are never learned in the simulator. However, the behaviors elemental sequential tasks. Machine Learning, 8:323–339,
learned in the simulator serve as good starting points for learning 1992.
on the robot. This result is useful when a simulator is available, [14] R. S. Sutton. Generalization in reinforcement learning:
since simulator tests are significantly easier to run than robot tests: Successful examples using sparse coarse coding. In NIPS
it suggests that only a relatively small amount of tuning is necessary ’96, 1996.
to adapt behaviors learned in the simulator to the real robot. The [15] R. S. Sutton and A. G. Barto. Reinforcement Learning: An
main motivation for this work is that in some situations learning Introduction. MIT Press, Cambridge, MA, USA, 1998.
must be performed entirely on a physical platform, and the positive [16] E. Talvitie and S. Singh. An experts algorithm for transfer
results in that setting are the main contribution of this paper. learning. In IJCAI, pages 1065–1070, 2007.
This work opens up several interesting directions for future work. [17] M. E. Taylor, N. K. Jong, and P. Stone. Transferring
For example, it is worth investigating if other learning algorithms instances for model-based reinforcement learning. In ECML
can learn this task faster than Sarsa, and if so, whether Q-value PKDD, pages 488–505, September 2008.
reuse (if applicable) can show similar benefits with these other al- [18] M. E. Taylor and P. Stone. Transfer learning for
gorithms. It would also be interesting to see how different methods reinforcement learning domains: A survey. JMLR,
for transfer learning perform on this task. In the long run, we view 10(1):1633–1685, 2009.
the research reported in this paper as just the first of many possible
[19] M. E. Taylor, P. Stone, and Y. Liu. Transfer learning via
applications of transfer learning for RL to physical robots.
inter-task mappings for temporal difference learning. JMLR,
8(1):2125–2167, 2007.
[20] L. Torrey, J. W. Shavlik, T. Walker, and R. Maclin.
Relational macros for transfer in reinforcement learning. In
ILP ’07, 2007.
Page 29 of 99
Proceedings of the AAMAS Workshop on Adaptive and Learning Agents, May 2010, Toronto, Canada
Categories and Subject Descriptors tion sets are artificially limited (in addition to physical limitations
I.2.11 [Artificial Intelligence]: Distributed Artificial Intelligence— of an agent) leading to constraints in the learned behaviors as well.
Multiagent Systems; I.2.6 [Artificial Intelligence]: Learning Consider a simple example of this limitation: assume that the
robotic arm in Figure 1(a) is physically limited to rotating by no
less than 2◦ at a time. The goal is to get it to rotate by 13◦ . In-
General Terms stead of allowing it to explore every possible action (2◦ – 359◦ ),
Algorithms, Performance the designer might prefer to allow only 4 actions, viz., 2◦ clock-
wise and anti-clockwise, and 5◦ clockwise and anti-clockwise. Al-
though this would enable the robot to learn an action policy for any
Keywords integer goal angle, many of those policies would have constraints
Multi-agent learning, Reinforcement learning that are imposed by the design choice, not by the robot’s physical
limitation, e.g., it would have to execute three 5◦ actions followed
by a 2◦ action in the reverse direction. However, the robot could
ABSTRACT have learned to simply turn by 13◦ in one smooth motion, had the
The design of reinforcement learning solutions to many problems learning problem not been artificially constrained. On the other
artificially constrain the action set available to an agent, in order hand, allowing a full blown action set might slow down learning to
to limit the exploration/sample complexity. While exploring, if such an extent that no performance improvement (over a random
an agent can discover new actions that can break through the con- policy baseline) may be observed in any reasonable time-frame.
straints of its basic/atomic action set, then the quality of the learned 1111111111111
0000000000000
0000000000000
1111111111111
0000000000000
1111111111111
decision policy could improve. On the flipside, considering all pos- θ init
0000000000000
1111111111111
0000000000000
1111111111111
0000000000000
1111111111111
0000000000000
1111111111111
0000000000000
1111111111111
sible non-atomic actions might explode the exploration complex- 0000000000000
1111111111111
0000000000000
1111111111111
0000000000000
1111111111111
ity. We present a potential based solution to this dilemma, and θ goal 0000000000000
1111111111111
0000000000000
1111111111111
0000000000000
1111111111111
0000000000000
1111111111111
0000000000000
1111111111111
empirically evaluate it in grid navigation tasks. In particular, we 0000000000000
1111111111111Robotic
0000000000000Arm
1111111111111
0000000000000
1111111111111
0000000000000
1111111111111
show that both the solution quality and the sample complexity im- 0000000000000
1111111111111
0000000000000
1111111111111
0000000000000
1111111111111
0000000000000
1111111111111
prove significantly when basic reinforcement learning is coupled 0000000000000
1111111111111
1. INTRODUCTION
Reinforcement learning is a popular framework for agent-based
solutions to many problems, primarily because of the simplicity of (b)
design and the strong convergence guarantees in the face of uncer-
tainty and limited feedback. In typical on-line reinforcement learn- Figure 1: Motivating examples
ing problems, an agent interacts with an unknown environment by
executing actions and learns to optimize long-term payoffs (or feed- Consider a second example, a grid-world navigation task, as shown
backs from the environment) consequent to selecting actions from in Figure 1(b). In such worlds, the action set is usually assumed to
a given set, A, in every state. In most cases, care is taken to ensure contain the 8 atomic actions that an agent can take to move from
that the set of actions is not too large, usually by discretizing con- one state (tile corner) to a (8-connected) neighboring state. How-
tinuous action spaces (see [7] for an exception). This is because a ever, the optimal policies generated by such an action set can make
large action set can slow down exploratory learning by creating too for unnatural navigation paths, such as the path from state A to the
many alternate trajectories through the state space to be explored. bottleneck B in solid arrows, in Figure 1(b). The most natural path
However, in order to curtail this exploration space, oftentimes, ac- from A to B would be the dotted arrow in Figure 1(b), but accomo-
Page 30 of 99
dating such actions might make the action set of the agent too large. an action-quality function Q given by
This example also highlights the difference between our work and X
the theory of options [13]. An option in this example might allow Q(s, a) = R(s, a) + max γ T (s, a, s′ )V π (s′ ) (1)
π
an agent to move to the doorway (B) with a temporally extended s′
action sub-plan that consists of the same atomic actions (i.e., the This quality value stands for the discounted sum of rewards ob-
chain of solid arrows). In contrast, our method adopts a fundamen- tained when the agent starts from state s, executes action a, and
tally new action (the dotted arrow), whereby, and agent can move in follows the optimal policy thereafter. Action quality functions are
a straight line to B, instead of being constrained by the set of atomic preferred over value functions, since the optimal policy can be cal-
actions. However, as mentioned before, it is not immediately clear culated more easily from the former. The Q function can be learned
if such additional actions must come at the cost of reduced learning by online dynamic programming using various update rules, such
rate. as temporal difference (TD) methods [12]. In this paper, we use the
In this paper, we propose a method to address the tradeoff be- on-policy Sarsa rule given by
tween discovering new actions and keeping the learning rate high.
We allow a reinforcement learning agent to start exploring its en- Q(st , at ) ← Q(st , at ) + α[rt+1 + γQ(st+1 , at+1 ) − Q(st , at )]
vironment with the same (limited) basic/atomic action set, but en-
able it to discover new actions on-line that are expected to lead where α ∈ (0, 1] is the learning rate, rt+1 is the actual environmen-
to its goal faster. As the agent augments its action set with these tal reward and st+1 ∼ T (st , at , .) is the actual next state resulting
newly discovered promising actions, its learning rate might be ex- from the agent’s choice of action at in state st . We assume that the
pected to fall. However, if only the most promising actions are agent uses ǫ-greedy strategy for action selection: it selects action
added, then they may actually decrease the time to reach the goal, at = arg maxb Q(st , b) in state st with probability (1 − ǫ), but
thereby accelerating the learning. We experimentally study the rel- with probability ǫ it selects a random action.
ative effects of these two factors in the grid-world navigation do- Sarsa is named after the acronym of its steps: state, action, re-
main with single agents. We show that action discovery can indeed ward, state, action. From state st , the agent picks action at , re-
improve the solution quality while significantly reducing the ex- ceives a reward rt+1 , transitions to state st+1 , and then selects ac-
ploration/sample complexity. Furthermore, the reason behind the tion at+1 in that state. It is only at this point that it can update
success of action discovery, viz., improvement in the connectivity Q(st , at ), using the above TD rule.
of the state-graph, indicates an added benefit to multi-agent coor- In reinforcement learning, it is traditional to define a simple set
dination learning. Coordination Problems (CPs) [3] are points in of actions A that an agent can select from at any state, since many
multi-agent sequential decision problems where agents must coor- actions are applicable to several states. However, for the purpose
dinate their actions in order to optimize future global returns. With of this paper we will separate the action sets on a per state basis.
fewer CPs, the learning problem is simplified, leading to faster That is, we will assume that in a state s, an agent can select from
learning. Since action discovery can reduce the number of points the set A(s) of actions. This is just for the purpose of presenta-
where agents would need to coordinate (i.e., reduce CPs), action tion, and there is really no fundamental difference between the two
discovery can greatly enhance the learning rates in multi-agent co- conventions. We assume that the agent is initially given the same
ordination learning tasks. In order to verify this intuition we adapt action set as basic RL, represented as A0 (s) over states s. If the
the Joint Action Learning (JAL) algorithm [4] with action discov- agent discovers a new action in episode t that can be executed from
ery in a multi-agent box pushing task, and show that the beneficial state s on its exploration trajectory, it grows the action set for that
impact of action discovery does indeed apply. state, At (s).
Since we are no longer constrained to a basic/atomic action set,
we must accomodate different execution times (or costs) of actions,
similar to options. The framework of Semi-Markov Decision Pro-
2. REINFORCEMENT LEARNING cesses (or SMDPs) is the appropriate relaxation for this purpose,
Reinforcement learning (RL) problems are modeled as Markov and the only difference it entails in terms of the learning algorithm,
Decision Processes or MDPs [12]. An MDP is given by the tu- is that the execution time, t(a) of an action a, must be used to ex-
ple {S, A, R, T }, where S is the set of environmental states that ponentiate the discount factor, i.e., γ t(a) in place of γ.
an agent can be in at any given time, A is the set of actions it can
choose from at any state, R : S × A 7→ ℜ is the reward func- 3. RELATED WORK
tion, i.e., R(s, a) specifies the reward from the environment that
While reinforcement learning has seen successes in many no-
the agent gets for executing action a ∈ A in state s ∈ S; T :
table applications [14, 5, 1], experience/sample complexity has tra-
S × A × S 7→ [0, 1] is the state transition probability function spec-
ditionally been an issue of concern. More recently, several tech-
ifying the probability of the next state in the Markov chain conse-
niques have been proposed to reduce sample complexity. These
quential to the agent’s selection of an action in a state. The agent’s
approaches include the theory of options and temporal abstrac-
goal is to learn a policy (action decision function) π : S 7→ A that
tion [13], reward shaping [9], Lyapunov-constrained action sets [10],
maximizes the sum of discounted future rewards from any state s,
and knowledge transfer [11], among others. In particular, Lyapunov-
given by,
constrained action sets [10] seeks to limit the action set of an agent
during exploration by constructing appropriate Lyapunov functions
V π (s) = ET [R(s, π(s))+γR(s′, π(s′ ))+γ 2 R(s′′ , π(s′′ ))+. . .] to guide exploration, while action transfer [11] seeks to bias action
selection in new tasks by exploiting successful actions from pre-
where s, s′ , s′′ , . . . are samplings from the distribution T following vious tasks. Given the significant prior effort in reducing sample
the Markov chain with policy π, and γ ∈ (0, 1) is the discount complexity, some by eliminating or reducing the weight of avail-
factor. able actions, it may sound counterproductive to seek to expand an
A common method for learning the value-function, V as defined agent’s available action set. Our insight is that with the discovery
above, through online interactions with the environment, is to learn of new actions that circumvent policy constraints, more efficient
Page 31 of 99
policies can be learned and exploited to ultimately learn to achieve to physical limitations of the agent or the environment, for all s, s′ .
the goal faster. As a bonus, the quality of the learned solution is It is useful to deal with both possibilities uniformly, with a cost
also expected to improve. function.
The basic insight that learning temporally extended abstractions We assume that for a given domain, a cost function c : S ×
of ground behavior can increase the learning rate by reusing ab- S 7→ ℜ, is always available, such that c(s, s′ ) gives the cost of
stractions, has been verified before in the context of options [13]. executing an action that would take an agent from state s to state
However, there is a fundamental difference between our work and s′ , i.e., ass′ . If c(s, s′ ) < ∞, this simply means that there is some
the theory of options. While options can be loosely thought of action (whether atomic or newly discovered) that takes the agent
as labels for a series of atomic actions that are useful to execute from state s directly to state s′ . However, if c(s, s′ ) = ∞, then no
in the same sequence in many different states, and are geared to- such action exists. c is virtually an oracle that can be enquired by
ward reusable knowledge, our work considers actually new actions. the agent for pairs of states that it has seen in the past. Our setting
When options are considered as additional actions that an agent can is different from regular RL settings in that the agent does not know
select in place of an atomic action, they have been shown to expe- the state space a priori, but has access to a transition function oracle
dite learning. However, discovering options is not a simple task. In (c), whereas in regular RL settings the state space is known but the
contrast, it may be simple to discover new ground actions outside transition function is unknown.
an agent’s set of atomic actions, as we demonstrate in grid naviga- The cost function also serves as the measure of action complex-
tion tasks. Rather than bank on their reusability as with options, we ity, and can be used to exponentiate γ for SMDPs. For actions
rely on the ability of these new actions to improve the policy qual- outside the atomic action set (A0 ), and having a finite cost, we do
ity by connecting topologically distant states in the state graph. It not assume that a reward sample for such an action is available un-
is not immediately clear if such qualitative enhancement will also less this action is actually executed. Hence the first time that such
reduce sample complexity. But our experiments in simple grid nav- an action is discovered (line 16, Algorithm 1), the reward is esti-
igation tasks show that this is indeed possible. mated (r̂ in line 18, Algorithm 1) on the basis of the actual rewards
Reinforcement learning in multi-agent sequential decision tasks r1 , r2 .
has been an active area of research [3, 6, 8, 2]. In multi-agent Clearly, accepting every newly discovered action into the set of
systems the decision complexity (typically the size of the Q-table) actions will be expensive for learning. For instance, in a grid of
usually depends exponentially on the number of agents, and so it is size n × n, there may be O(n2 ) such new actions, per state, i.e.,
even less intuitive whether worsening the decision complexity by potentially O(n4 ) actions to contend with. Accomodating such a
accommodating new actions can help the learning rate at all. We large number of actions will impact the exploration and reduce the
answer this question affirmatively, by showing that Joint Action learning rate. Fortunately, many of these actions may be needless
Learners (JAL) [4] with action discovery do learn better policies to explore, e.g., if they lead away from the goal. It is possible to
with lower sample complexity in a multi-agent box pushing task estimate the value potential of a state, Φ, precisely for this pur-
than regular JALs. pose. Potential functions, Φ(s), have been used before, to shape
rewards and reduce the sample complexity of reinforcement learn-
4. ACTION DISCOVERY ing [9]. Such functions can be set by the agent designers or domain
In reinforcement learning problems, the atomic action set, A0 , designers. In this paper, we use such functions to informatively
is usually fixed. Even if new options are discovered, these options select among newly discovered actions. To illustrate our heuristic
are described in terms of the atomic actions from A0 . However,
S
in many cases new actions that are neither included in A0 , nor 3
Φ( S 3 )
precluded by the agent’s capabilities, may be able to improve the
agent’s performance by
• reducing the number of steps to the goal, or the total solution
cost γ cost(s 1, s3 )
Page 32 of 99
tion (atomic or otherwise) that had transitioned the agent from s1
to s2 . This question may be heuristically answered by comparing
the potential backup values from both s2 and s3 to s1 . These po-
tential backup values can be estimated as γ c(s1 ,s2 ) Φ(s2 ) from s2 ,
Goal
and γ c(s1 ,s3 ) Φ(s3 ) from s3 . Consequently, we use the following 1
0
0
1
criterion for accepting a newly discovered action, as1 s3 , Start
1
0
0
1
c(s1 ,s3 ) c(s1 ,s2 ) (G1 )
γ Φ(s3 ) > (1 + δ)γ Φ(s2 )
where δ is a slack variable guiding the degree of conservatism in 11
00
00
11
Goal
accepting new actions. This step is shown in line 16 in Algorithm 1.
Furthermore, new actions merely facilitate reaching the goal, but
they are not necessary for the agent to reach the goal. The agent
00000000000
11111111111
000
111
11111111111
00000000000
should be able to find a baseline policy to the goal using just the 111
000
00000000000
11111111111
000
111
00000000000
11111111111
000
111
00000000000
11111111111
000
111
atomic actions, in the worst case. Hence, we use the above test 00000000000
11111111111
000
111
00000000000
11111111111
000
111
111
000
000
111
rather conservatively (δ > 0) to select or reject a newly discovered 000
111
000
111
000
111
111
000
action. 1
0
000
111
000
111
1
0 000
111
000
111
Start 111
000
000
111
000
111
0000000000
1111111111
000
111
Algorithm 1 Sarsa-AD (Sarsa with Action Discovery) 0000000000
1111111111
000
111
0000000000
1111111111
111
000
0000000000
1111111111
000
111
0000000000
1111111111
000
111
0000000000
1111111111
000
111
1: Initialize ǫ, δ, α, γ 0000000000
1111111111
Page 33 of 99
Figures 4 and 7 show the learned solution qualities on the 2 maps 13.5
in Figure 3, respectively, as total path lengths. As one might expect, All actions
δ=0
Page 34 of 99
500 300
All actions
Basic SARSA
250 150
200
100
150
100
50
50
0 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Number of Episodes Number of Episodes
Figure 6: Growth in the size of the set At (.) − A0 (.) over visited Figure 8: Plot of sample complexity against episodes for task
states, against episodes for task G1 . G2 .
1600
23
All actions
δ=0 1400
22
Learned path length to goal (meters)
δ=1
21 Basic SARSA 1200
Figure 9: Growth in the size of the set At (.) − A0 (.) over visited
Figure 7: Plot of solution quality against episodes for task G2 . states, against episodes for task G2 .
problem. In our experiments we consider two agents pushing a box the rate of O(|At |n−1 ) where At is the largest of the current action
on a plane, so we allow one agent to exert a force along the x-axis sets over n agents. Given such a phenomenal growth in decision
only (we call it the x-agent), and the other along the y-axis only complexity, it is unclear if action discovery will benefit multi-agent
(the y-agent. By removing overlap in the directionalities of the learning.
forces, we ensure that the agents do not trivially coordinate at some
decision points. This serves the purpose of isolating the impact of 4.3 Experiments in the Box-pushing Task
action discovery on CPs, with the impact on accidental coordina- We use a 9×9 grid for the discrete box-pushing task, as shown in
tion being removed. Note however, that this is only meant for our Figure 10. Each JAL uses action discovery as shown in Algorithm 1
experimental set-up, and it is not necesasry to preclude overlaps in with similar parameters as in the single-agent experiments (with
the agents’ atomic action sets. Also, agents can achieve such clean some differences):
separation of their action sets by prior agreement in cooperative do-
• A0 consists of 3 actions for each state, for each agent: ±1 or
mains. It is worth noting that in this setting, the multi-agent block
0 in its chosen direction,
pushing task it very closely related to the single agent navigation
task studied earlier. • c(s, s′ ) = distance(s, s′ ) with simple line-tests detecting
We allow each agent to test for feasibility of a new action using blocked paths (i.e., c(s, s′ ) = ∞),
the same method as in algorithm 1. If a new action passes the test,
then all agents discover that action and append their action sets • rewards are 1 for any action reaching the goal, -1 for hitting
in that joint-state, by the appropriate component of the discovered any obstable including the boundary, but 0 otherwise,
action. Therefore, if an action (x′ , y ′ ) is discovered, the x-agent 1
• φ(s) = ,
appends x′ as a new action in its own list of actions in that state, and distance(s,goal)
also includes y ′ as a new action of the other agent in that state. The • r̂ = r1 + r2 ,
y-agent performs the corresponding actions as well. This means
that with each discovery, the size of the joint action table grows at • δ = 0.1, α = 0.25, γ = 0.9, and ǫ = 0.01.
Page 35 of 99
• All learning algorithms (including basic Sarsa JAL) use the 100
Φ function for state-action value initialization. Discovery
18 agent case. While (1) is to be expected, (2) is not quite intuitive and
needs further investigation. Increasing the number of agents in the
17 box-pushing task will necessitate overlap in the action spaces of the
agents. We will allow all agents to act in both x and y directions,
16 but at any given time, an agent must pick an action in one of the two
directions. This means an agent can choose the magnitude of the
15 force exerted on the box, and the orientation must be either in the
x-direction or y. This restriction would ensure that agents do not
14 discover actions in arbitrary orientations since that would reduce
the need to coordinate with other agents. A technical difficulty
13 arising from not imposing this restriction is that the outcome of a
0 200000 400000 600000 800000 1e+06 joint action (where each action can be in an arbitrary orientation)
Number of Episodes may not fall on a grid point in discrete maps.
We also plan to investigate the impact of increasing the number
Figure 11: Plot of solution quality against episodes for the of agents on the benefit accrued from action discovery in continu-
multi-agent box-pushing task. ous maps, where an action would be composed of two choices: the
magnitude of the force and the orientation. However, since such
domains require some kind of function approximation for learning
the action values, it is not immediately clear how a newly discov-
5. CONCLUSION ered action could be reconciled with a function approximator that
Page 36 of 99
usually works with a fixed set of discrete actions. There is very [14] G. Tesauro. Temporal difference learning and TD-gammon.
little work that consider both continuous action space and continu- Communications of the ACM, 38(3):58 – 68, 1995.
ous state spaces, and it would be non-trivial to adapt any of these [15] E. Wiewiora. Potential based shaping and Q-value
techniques to accommodate new actions. initialization are equivalent. Journal of Artificial Intelligence
Research, pages 205–208, 2003.
7. ACKNOWLEDGMENTS
We would like to thank the anonymous reviewers for helpful
comments. This work was supported by a start-up grant from the
University of Southern Mississippi.
8. REFERENCES
[1] P. Abbeel, A. Coates, M. Quigley, and A. Y. Ng. An
application of reinforcement learning to aerobatic helicopter
flight. In NIPS 19, 2007.
[2] S. Abdallah and V. Lesser. Multiagent reinforcement
learning and self-organization in a network of agents. In
Proceedings of 6th International Conference on Autonomous
Agents and Multiagent Systems (AAMAS), 2007.
[3] C. Boutilier. Sequential optimality and coordination in
multiagent systems. In In Proceedings of the Sixteenth
International Joint Conference on Artificial Intelligence,
pages 478–485, 1999.
[4] C. Claus and C. Boutilier. The dynamics of reinforcement
learning in cooperative multiagent systems. In Proceedings
of the 15th National Conference on Artificial Intelligence,
pages 746–752, Menlo Park, CA, 1998. AAAI Press/MIT
Press.
[5] R. H. Crites and A. G. Barto. Improving elevator
performance using reinforcement learning. In Advances in
Neural Information Processing Systems 8, volume 8, pages
1017–1023, 1996.
[6] J. Hu and M. P. Wellman. Nash Q-learning for general-sum
stochastic games. Journal of Machine Learning Research,
4:1039–1069, 2003.
[7] A. Lazaric, M. Restelli, and A. Bonarini. Reinforcement
learning in continuous action spaces through sequential
monte carlo methods. In J. Platt, D. Koller, Y. Singer, and
S. Roweis, editors, Advances in Neural Information
Processing Systems 20, pages 833–840, Cambridge, MA,
2008. MIT Press.
[8] M. L. Littman. Markov games as a framework for
multi-agent reinforcement learning. In Proc. of the 11th Int.
Conf. on Machine Learning, pages 157–163, San Mateo, CA,
1994. Morgan Kaufmann.
[9] A. Y. Ng, D. Harada, and S. Russell. Policy invariance under
reward transformations: Theory and application to reward
shaping. In Proc. 16th International Conf. on Machine
Learning, pages 278–287. Morgan Kaufmann, 1999.
[10] T. J. Perkins and A. G. Barto. Lyapunov-constrained action
sets for reinforcement learning. In Proceedings of the ICML,
pages 409–416, 2001.
[11] A. A. Sherstov and P. Stone. Improving action selection in
MDP’s via knowledge transfer. In Proceedings of the
Twentieth National Conference on Artificial Intelligence,
July 2005.
[12] R. Sutton and A. G. Barto. Reinforcement Learning: An
Introduction. MIT Press, 1998.
[13] R. Sutton, D. Precup, and S. Singh. Between MDPs and
semi-MDPs: A framework for temporal abstraction in
reinforcement learning. Artificial Intelligence, 112:181–211,
1999.
Page 37 of 99
Proceedings of the AAMAS Workshop on Adaptive and Learning Agents, May 2010, Toronto, Canada
Page 38 of 99
proposed algorithm, PCM(A) [8] is, to the best of our knowl- to unilaterally deviate from its own share of the strategy. A
edge, the only known MAL algorithm to date that achieves maximin strategy for an agent is a strategy which maximizes
compatibility, safety and targeted optimality against adap- its own minimum payoff. It is often called the safety strat-
tive opponents in arbitrary repeated games. egy, because resorting to it guarantees the agent a minimum
payoff.
1.2 Contributions An adaptive opponent strategy looks back at the most re-
CMLeS improves on Awesome by guaranteeing both safety cent K joint actions played in the current history of play to
and targeted optimality against adaptive opponents. It im- determine its next stochastic action profile. K is referred to
proves upon PCM(A) in five ways. as the memory size of the opponent.3 The strategy of such
1. The only guarantees of optimality against adaptive an opponent is then a mapping, π : AnK 7→ ∆A. If we con-
opponents that PCM(A) provides are against the ones that sider opponents whose future behavior depends on the en-
are drawn from an initially chosen target set. In contrast, tire history, we lose the ability to (provably) learn anything
CMLeS can model every adaptive opponent whose memory about them in a single repeated game, since we see a given
is bounded by Kmax . Thus it does not require a target history only once. The concept of memory-boundedness lim-
set as input: its only input is Kmax , an upper bound on its the opponent’s ability to condition on history, thereby
the memory size of adaptive opponents that it is willing to giving us a chance to learning its policy.
model and exploit. We now specify what we mean by playing optimally against
2. PCM(A) achieves targeted optimality against adap- adaptive opponents. For notational clarity, we denote the
tive opponents by requiring all feasible joint histories of size other agents as a single agent o. It has been shown pre-
Kmax to be visited a sufficient number of times. Kmax for viously [4] that the dynamics of playing against such an
PCM(A) is the maximum memory size of any opponent from o can be modeled as a Markov Decision Process (MDP)
its target set. CMLeS significantly improves this by requir- whose transition probability function and reward function
ing a sufficient number of visits to all feasible joint histories are determined by the opponents’ (joint) strategy π. As the
only of size K+1. Thus CMLeS promises targeted optimal- MDP is induced by an adversary, this setting is called an
ity in number of steps polynomial in λ−Size(K+1) in com- Adversary Induced MDP, or AIM in short.
parison to PCM(A) which provides similar guarantees, but An AIM is characterized by the K of the opponent which
in steps polynomial in λ−Size(Kmax ) . The above sample ef- induces it: the AIM’s state space is the set of all feasible joint
ficiency property makes CMLeS a good candidate for online action sequences of length K. By way of example, consider
learning. the game of Roshambo or rock-paper-scissors (Figure 1) and
3. Unlike PCM(A), CMLeS promises targeted optimal- assume that o is a single agent and has K = 1, meaning that
ity against opponents which eventually become memory- it acts entirely based on the immediately previous joint ac-
bounded with K ≤ Kmax . tion. Let the current state be (R, P ), meaning that on the
4. PCM(A) can only guarantee convergence to a pay- previous action, i selected R, and o selected P . Assume that
off within ǫ of the desired Nash equilibrium payoff with a from that state, o plays actions R, P and S with probabil-
probability δ. In contrast, CMLeS guarantees convergence ity 0.25, 0.25, and 0.5 respectively. When i chooses to take
in self-play with probability 1. action S in state (R, P ), the probabilities of transitioning
5. CMLeS is relatively simple in its design. It tackles to states (S, R), (S, P ) and (S, S) are then 0.25, 0.25 and 0.5
the entire problem of targeted optimality and safety by run- respectively. Transitions to states that have a different ac-
ning an algorithm that implicitly achieves either of the two, tion for i, such as (R, R), have probability 0. The reward
without having to reason separately about adaptive and ar- obtained by i when it transitions to state (S, R) is -1 and so
bitrary opponents. on.
The remainder of the paper is organized as follows. Sec- The optimal policy of the MDP associated with the AIM
tion 2 presents background and definitions, Section 3 and 4 is the optimal policy for playing against o. A policy that
presents our algorithm, Section 5 presents empirical results achieves an expected return within ǫ of the expected return
and Section 6 concludes. achieved by the optimal policy is called an ǫ-optimal policy
(the corresponding return is called ǫ-optimal return). If π is
2. BACKGROUND AND CONCEPTS known, then we can have computed the optimal policy (and
hence ǫ-optimal policy) by doing dynamic programming [9].
This section reviews the definitions and concepts neces-
However, we do not assume that π or even K are known
sary for fully specifying CMLeS.
in advance: they need to be learned in online play. We
A matrix game is defined as an interaction between n
use the discounted payoff criterion in our computation of an
agents. Without loss of generality, we assume that the set
ǫ-optimal policy, with γ denoting the discount factor.
of actions available to all the agents are same, i.e., A1 =
Finally, it is important to note that there exist opponents
. . . = An = A. The payoff received by agent i during each
in the literature which do not allow convergence to the opti-
step of interaction is determined by a utility function over
mal policy once a certain set of moves have been played. For
the agents’ joint action, ui : An 7→ ℜ. Without loss of gen-
example, the grim-trigger opponent in the well-known Pris-
erality, we assume that the payoffs are bounded in the range
oner’s Dilemma (PD) game, an opponent with memory size
[0,1]. A repeated game is a setting in which the agents play
1, plays cooperate at first, but then plays defect forever once
the same matrix game repeatedly and infinitely often.
the other agent has played defect once. Thus, there is no
A single stage Nash equilibrium is a stationary strategy
way of detecting its strategy without defecting, after which
profile {πi∗ , . . . , πn∗ } such that for every agent i and for ev-
it is impossible to recover to the optimal strategy of mutual
ery other possible stationary strategy πi , the following in-
equality holds: E(π1∗ ,...,πi∗ ,...,πn∗ ) ui (·) ≥ E(π1∗ ,...,πi ,...,πn∗ ) ui (·). 3
K is the minimum memory size that fully characterizes the
It is a strategy profile in which no agent has an incentive opponent strategy.
Page 39 of 99
R P S
0.25 (R,P) determined, then return null.
R 0.25 0.5
R 0,0 −1,1 1,−1 3. If π̂best 6= null, take a step towards solving the rein-
0.25 0.25
P 1,−1 0,0 −1,1 (R,P) P forcement learning (RL) problem for the AIM induced by
S −1,1 1,−1 0,0 kbest . Otherwise, play the maximin strategy.
(S,R) (S,P) (S,S)
0.5 S Of these three steps, step 2 is by far the most complex.
R − Rock, P − Paper, S − Scissor
We present how MLeS addresses it next.
Roshambo Opponent Strategy Partial Transition Function
for state (R,P) and action S 3.2 Model selection
The objective of MLeS is to find a kbest which is either K
Figure 1: Example of AIM (the true memory size) or a suboptimal k s.t. π̂k is a good
approximation of π (o’s true policy). It does so by compar-
ing models of increasing size to determine at which point the
cooperation. In our analysis, we constrain the class of adap- larger models cease to become more predictive of o’s behav-
tive opponents to include only those which do not negate ior. We start by proposing a metric called ∆k , which is an
the possibility of convergence to optimal exploitation, given estimate of how much models π̂k and π̂k+1 differ from each
any arbitrary initial sequence of exploratory moves [7]. other. But, first, we introduce two notations that will be in-
Equipped with the required concepts, we are now ready strumental in explaining the metric. We denote (ai , ao )·sk to
to specify our algorithms. First, in Section 3, we present be a joint history of size k+1, that has sk as its last k joint ac-
an algorithm that only guarantees safety and targeted opti- tions and (ai , ao ) as the last k+1’th joint action. For any sk ,
mality against adaptive opponents. Then, in Section 4, we we define a set Aug(sk ) = ∪∀ai ,ao ∈A2 ((ai , ao ) · sk |v((ai , ao ) ·
introduce the full-blown CMLeS algorithm that incorporates sk ) > 0). In other words Aug(sk ) contains all joint histo-
convergence additionally. ries of size k+1 which have sk as their last k joint actions
and have been visited at least once. ∆k is then defined
3. MODEL LEARNING WITH SAFETY as maxsk ,sk+1 ∈Aug(sk ),ao ∈A |π̂k (sk , ao ) − π̂k+1 (sk+1 , ao )|. We
say that π̂k and π̂k+1 are ∆k distant from one another.
In this section, we introduce a novel algorithm, Model
Based, on the concept of ∆k , we make two observations
Learning with Safety (MLeS), that ensures safety and tar-
that will come in handy for our theoretical claims made later
geted optimality against adaptive opponents.
in this subsection.
3.1 Overview Observation 1. For all, k ∈ [K, Kmax ]|k ∈ N, and for
MLeS begins with the hypothesis that the opponent is an any k sized joint history sk and any sk+1 ∈ Aug(sk ), E(π̂k (sk ))
adaptive opponent (denoted as o) with an unknown mem- = E(π̂k+1 (sk+1 )). Hence E(∆k ) = 0.
ory size, K, that is bounded above by a known value, Kmax .
MLeS maintains a model for each possible value of o’s mem- Let, sK be the last K joint actions in sk and sk+1 . π̂k (sk )
ory size, from k = 0 to Kmax , plus one additional model and π̂k+1 (sk+1 ) represent draws from the same fixed distri-
for memory size Kmax +1. Each model π̂k is a mapping bution π(sK ). So, their expectations will always be equal
Ank 7→ ∆A representing a possible o strategy. π̂k is the to π(sK ). This is because o just looks at the most recent K
maximum likelihood distribution based on the observed ac- joint actions in its history, to decide on its next step action.
tions played by o for each joint history of size k encountered. Observation 2. For k < K|k ∈ N, ∆k is a random vari-
Henceforth we will refer to a joint history of size k as sk and able with 0 ≤ E(∆k ) ≤ 1.
the empirical distribution captured by π̂k for sk as π̂k (sk ).
π̂k (sk , ao ) will denote the probability assigned to action ao , In this case, in the computation of π̂k (sk ), the draws can
by π̂k (sk ). When a particular sk is encountered and the come from different distributions. This is because, k < K
respective o’s action in the next step is observed, the em- and there is no guarantee of stationarity of π̂k (sk ). Thus,
pirical distribution π̂k (sk ) is updated. Such updates happen ∆k can be any arbitrary random variable with an expected
for every π̂k , on every step. For every sk , MLeS maintains a value within 0 and 1.
count value v(sk ), which is the number of times sk has been High-level idea: Alg. 1 presents how MLeS selects kbest .
encountered. We call an opponent model an ǫ approxima- We denote the current values of π̂k and ∆k at time t, as π̂kt
tion of π, when for any history of size K, it predicts the true and ∆tk respectively.
opponent action distribution with error at most ǫ. Definition 1. {σkt }t∈1,2,... is a sequence of real numbers,
On each step, MLeS selects π̂best (and correspondingly unique to each k, s.t. it satisfies the following:
kbest ) as the one from among the Kmax +1 (from 0 to Kmax ) 1. it is a positive decreasing sequence, tending to 0 as t →
models that currently appears to best describe o’s behavior. ∞;
The mechanism for selecting π̂best ensures that, with high 2. for a fixed high probability ρ > 0 and for k ∈ [K, Kmax ],
probability, it is either π̂K (the most compact representation P r(∆tk < σkt ) > ρ;
of π) or a model with a smaller k which is a good approx-
imation of π. Once such a π̂best is picked, MLeS takes a The reason for choosing such a {σkt }t∈1,2,... sequence for
step towards learning an ǫ-optimal policy for the underlying each k will be clear from the next two paragraphs. Later,
AIM induced by kbest . If it cannot determine such a π̂best , we will show, how we compute the σkt ’s. MLeS iterates over
it defaults to playing the maximin strategy for safety. values of k starting from 0 to Kmax and picks the minimum
Thus, the operations performed by MLeS on each step can k s.t for all k ≤ k′ ≤ Kmax , the condition ∆tk′ < σkt ′ is
be summarized as follows: satisfied (steps 3-11).
1. Update all models based on the past step. For k < K, there is no guarantee that ∆k will tend to 0,
2. Determine π̂best (and hence kbest ). If a π̂best cannot be as t → ∞ (Observation 2). More often than not, ∆k will
Page 40 of 99
Algorithm 1: Find-Model number of visits to any member from Aug(sk ). So,
output : kbest , π̂best , t √
P r(|π̂k+1 (stk+1 , ato ) − E(π̂k+1
t
(stk+1 , ato ))| < σkt ) > ρ (5)
1 kbest ← −1, π̂best ← null
2 for all 0 ≤ k ≤ Kmax , compute ∆tk and σk
t =⇒ P r(∆tk < σkt ) > ρ
3 for 0 ≤ k ≤ Kmax do
4 f lag ← true The problem now boils down to selecting a suitable σkt s.t.
5 for k ≤ k′ ≤ Kmax do Inequality 5 is satisfied. Hoeffding’s inequality gives us an
6 if ∆tk′ ≥ σkt ′ then upper bound for σkt in Inequality 5.qUsing that upper bound
7 f lag ← false and solving for σkt , we get, σkt = ( 2v(s1t ) ln( 1−2√ρ )). So
8 break k+1
in general, for each k ∈ [0, Kmax ], the σkt value is set as
9 if f lag then above. Note that, v(stk+1 ) is the number of visits to the spe-
10 kbest ← k; π̂best ← π̂kt
cific stk+1 chosen for the computation of ∆tk . Setting σkt as
11 break
above satisfies both the conditions specified in Definition 1.
12 return kbest and π̂best Condition 1 follows implicitly since in infinite play, the ac-
tion selection mechanism ensures infinite visits to all joint
histories of a finite length.
Theoretical underpinnings: Now, we state our main
tend to a positive value quickly. On the other hand, σkt → 0 theoretical result regarding model selection.
as t → ∞ (condition 1 of Definition 1). This leads to one of
the following two cases: Lemma 3.1. After all feasible joint histories of size K+1
2
1) σkt becomes ≤ ∆tk and step 6 of Alg. 1 holds, thus rejecting have been visited (K+1) ln( 1−2√ρ ) times, then with probabil-
2ǫ2
k as a possible candidate for selection.
ity at least ρKmax +2 , the π̂best returned by Alg. 1 is an ǫ ap-
2) k gets selected. However, then we are sure that π̂kt is no
proximation of π. ρ is the fixed high probability value from
more than k≤k′ <K σkt ′ distant from π̂K t
P
(the best model of
Condition 2 of Definition 1.
π we have at present). With increasingly many time steps,
π̂kt needs to be an increasingly better approximation of πK t
, Proof. When all k < K have been rejected, Alg. 1 se-
to keep getting selected. lects K with probability at least ρKmax −K+1 . If p is the
For k ≥ K, all ∆tk ’s → 0, as t → ∞ (Observation 1). Since probability of selecting any k < K as kbest , the proba-
for all k ≥ K : P r(∆tk < σkt ) > ρ (condition 2 of Defini- bility of selecting any k ≤ K as kbest , is then at least
tion 1), K gets selected with a high probability ρKmax −K+1 . p + (1 − p)ρKmax −K+1 > ρKmax −K+1 > ρKmax +1 . If kbest =
A model with memory size more than K is selected with K, then we know that ∆tK < σK t
. So from Inequality 4,
probability at most (1−ρKmax −K+1 ), which is a small value. √
P r(|π̂K (sK , ao ) − π(sK , ato ))| < σK
t t t t t
)> ρ
We now address the final part of Alg. 1 that we have yet
t
to specify: setting the σkt ’s (step 2). =⇒ P r(|π̂K (stK , ato ) − π(stK , ato )| < σK
t
)>ρ
Choosing σkt : In the computation of ∆tk , MLeS chooses a
stK and ato are the respective joint history of size K and ac-
specific stk from the set of all possible joint histories of size t t
tion, for which models π̂K and π̂K+1 maximally differ at t.
k, a specific stk+1 from Aug(stk ) and an action ato , for which t
So in this case, with probability ρ, π̂best is a σK approx-
t
the models π̂kt and π̂k+1 differ maximally on that particular
imation of π. In similar fashion it can be shown that if
time step. So,
kbest < K, then with probability ρ, π̂best is a k≤k′ ≤K σkt ′
P
Page 41 of 99
reinforcement learning problem of computing a near-optimal Theorem 3.2. For any arbitrary ǫ > 0 and δ > 0, MLeS
policy for that AIM. In order to solve this RL problem, with probability at least 1-δ, achieves at least within ǫ + L(δ)
MLeS uses the variant of the R-Max algorithm that does of the expected value of the best response against any adap-
not assume that the mixing time of the underlying MDP tive opponent, in number of time steps polynomial in 1ǫ ,
is known [3]. R-Max is a model based RL algorithm that ln( 1δ ) and λ−Size(K+1) .
converges to playing an ǫ-optimal policy for an MDP with
probability 1-δ, in time complexity polynomial in 1ǫ , ln( 1δ ), Against an arbitrary o, our claims rely on o not behaving as
and the state space size of the MDP. A separate instantia- a Kmax adaptive opponent in the limit. This means ∆Kmax
tion of the R-Max algorithm is maintained for each of the tends to a positive value, as t → ∞. Alg. 1 returns π̂best
possible Kmax +1 AIMs pertaining to the possible memory as null in the limit, with probability 1. MLeS will then
sizes of o, i.e, M0 , M1 , . . . , MKmax . On each step, based on subsequently converge to playing the maximin strategy, thus
the kbest returned, the R-Max instance for the AIM Mkbest ensuring safety.
is selected to take an action.
The steps that ensure targeted optimality against adap- 4. CONVERGENCE AND MODEL LEARN-
tive opponents are then as follows: ING WITH SAFETY
1. First, ensure that √Alg. 1 keeps returning aǫ(1−γ)
kbest ≤ K with In this section we build on MLeS to introduce a novel MAL
a high probability 1 − δ s.t. π̂best is an 2Size(K) approxi- algorithm for an arbitrary repeated game which achieves
mation of π. The conditions for that to happen are given by safety, targeted optimality, and convergence, as defined in
Lemma 3.1. Playing optimally against such an approxima- Section 1. We call our algorithm, Convergence with Model
tion of π, guarantees an 2ǫ -optimal payoff against o (Lemma Learning and Safety: (CMLeS). CMLeS begins by testing
4 of [3]). Thus an 2ǫ optimal policy for such a model will the opponents to see if they are also running CMLeS (self-
guarantee an ǫ-optimal payoff against o. play); when not, it uses MLeS as a subroutine.
2. Once such √ a kbest ≤ K is selected by Alg. 1 with a high
probability
√ 1 − δ on every step, then with a probability 4.1 Overview
1 − δ, converge to playing an 2ǫ optimal policy for Mkbest . CMLeS (Alg. 2) can be tuned to converge to any Nash
In order to achieve that, the R-Max instantiation for Mkbest equilibrium of the repeated game in self-play. Here, for the
will require a certain fixed number of visits to every joint his- sake of clarity, we present a variant which converges to the
tory of size kbest . Since the kbest selected by Alg. 1 is at most single stage Nash equilibrium. This equilibrium also has
K with a high probability , a sufficient number of visits to the advantage of being the easiest of all Nash equilibria to
every joint history of size K will suffice convergence to an 2ǫ compute and hence has historically been the preferred solu-
optimal policy. tion concept in multiagent learning [2, 5]. The extension of
It can be shown that our R-Max-based action selection CMLeS to allow for convergence to other Nash equilibria is
strategy implicitly achieves both of above steps in number straightforward, only requiring keeping track of the proba-
of time steps polynomial in 1ǫ , ln( 1δ ) and λ−Size(K+1) . Note, bility distribution for every conditional strategy present in
we do not have the ability to take samples at will from dif- the specification of the equilibrium.
ferent histories, but may need to follow a chain of different Steps 1 - 2: Like Awesome, we assume that all agents
histories to get a sample pertaining to one history. In the have access to a Nash equilibrium solver and they compute
worst case, the chain can be the full set of all histories, the same Nash equilibrium profile. If there are finitely many
with each transition occurring with λ. Hence the unavoid- equilibria, then this assumption can be lifted with each agent
able dependence on λ−Size(K+1) , in time complexity. The choosing randomly an equilibrium profile, so that there is a
bounds we provide are extremely pessimistic and likely to non-zero probability that the computed equilibrium coin-
be tractable against most opponents. For example against cides.
opponents which only condition on MLeS’s recent history of Steps 3 - 4: The algorithm maintains a null hypothesis
actions, λ−Size(K+1) dependency gets replaced by a depen- that all agents are playing equilibrium (AAP E). The hy-
dency over just |A|K+1 . pothesis is not rejected unless the algorithm is certain with
So far what we have shown is that MLeS, with a high prob- probability 1 that the other agents are not playing CMLeS.
ability 1-δ on each step, converges to playing an ǫ-optimal τ keeps count of the number of times the algorithm reaches
policy. It is important to note that, acting in this fash- step 4.
ion does not guarantee it a return that is 1-δ times the ǫ- Steps 5 - 8 (Same as Awesome): Whenever the algo-
optimal return. However, we can compute an upper bound rithm reaches step 5, it plays the equilibrium strategy for
on the loss and show that the loss is extremely small for a fixed number of episodes, Nτ . It keeps a running esti-
small values of δ. Let rt be the random variable that de- mate of the empirical distribution of actions played by all
notes the reward obtained on time step t by following the agents, including itself, during this run. At step 8, if for
ǫ-optimal policy. P The maximum loss incurred P is t : |(1 − any agent j, the empirical distribution φτj differs from πj∗
δ) ∞ t ∞ t
<| ∞
− δ)t E(rt )|P
P
t=0 γ E(rt ) − t=0 γ (1P t=0 γ E(rt ) − by at least ǫτe , AAP E is set to false. The CMLeS agent
∞ t t ∞ t ∞ t
(1 − δ)t | ≤
P
t=0 γ (1 − δ) E(rt )| ≤ | t=0 γ − t=0 γ has reason to believe that j may not be playing the same
γδ
(1−γ)(1−γ(1−δ))
. In the above computation, we assume that algorithm. {ǫτe }τ ∈1,2,... represents a decreasing sequence of
whenever MLeS does not play the ǫ-optimal policy, it gets positive numbers converging to 0 in the limit. Similarly
the minimum reward of 0. We denote this loss as L(δ), since {Nτ }τ ∈1,2,... represents an increasing sequence of positive
it is a function of δ (γ being fixed). Note that L(δ) can be numbers converging to infinity in the limit. The ǫτe and Nτ
made extremely small by selecting a very small δ. values for each τ are assigned in a similar fashion to Awe-
This brings us to our main theorem regarding MLeS. some (Definition 4 of [5]).
Steps 10 - 20: Once AAP E is set to false, the algorithm
Page 42 of 99
Algorithm 2: CMLeS Proof of 2. This part of the proof follows from Hoeffd-
input : n, τ = 0 ing’s inequality. CMLeS reaches step 22 with a probability
1 for ∀j ∈ {1, 2, . . . , n} do at least δ in τ polynomial in κ1 and ln( 1δ ), where κ is the
2 πj∗ ← ComputeNashEquilibriumStrategy() maximum probability that any agent assigns to any action
3 AAP E ← true other than ao for a recent Kmax joint history of all agents
4 while AAP E do playing ao .
5 for Nτ rounds do
∗
6 Play πself Theorem 4.2. In self-play, CMLeS converges to playing
7 for each agent j update φτj the Nash equilibrium of the repeated game, with probability
8 recompute AAP E using the φτj ’s and πj∗ ’s 1.
9 if AAP E is false then Proof. We prove the theorem by proving the following:
10 if τ = 0 then 1) In self-play, every time after AAP E is set to false, there
11 Play ao , Kmax +1 times
is a non-zero probability that AAP E is never set to false
12 else if τ = 1 then again.
13 Play ao , Kmax times followed by a
2) If AAP E is never set to false again, then CMLeS con-
14 random action other than ao
verges to the Nash equilibrium with probability 1.
15 else The proof of (1) follows by similar reasoning as in Awe-
16 Play ao , Kmax +1 times
some (Theorem 3 of [5]). If AAP E is never set to false, then
17 if any other agent plays differently then all agents must be playing CMLeS (From Theorem 4.1). As
18 AAP E ← f alse
Nτ approaches ∞, φτj approaches πj∗ . So the agents con-
19 else verge to playing the Nash equilibrium with probability 1 in
20 AAP E ← true
the limit.
21 τ ←τ +1
22 Play MLeS 5. RESULTS
We now present empirical results that supplement the the-
oretical claims. We focus on how efficiently CMLeS models
goes through a series of steps in which it checks whether the adaptive opponents in comparison to existing algorithms,
other agents are really CMLeS agents. The details are ex- PCM(A) and Awesome. For CMLeS, we set ǫ = 0.1, δ =
plained below when we describe the convergence properties 0.01 and Kmax = 10. To make the comparison fair with
of CMLeS (Theorem 4.1). PCM(A), we use the same values of ǫ and δ and always in-
Step 22: When the algorithm reaches here, it is sure (proba- clude the respective opponent in the target set of PCM(A).
bility 1) that the other agents are not CMLeS agents. Hence We also add an adaptive strategy with K = 10 to the target
it switches to playing MLeS. set of PCM(A), so that it needs to explore joint histories of
size 10.
4.2 Theoretical underpinnings We use the 3-player Prisoner’s Dilemma (PD) game as our
We now state our main convergence theorems. representative matrix game. The game is a 3 player version
of the N-player PD present in GAMUT.4 The adaptive op-
Theorem 4.1. CMLeS satisfies both the criteria of tar- ponent strategies we test against are :
geted optimality and safety. 1. Type 1: every other player plays defect if in the last 5
steps CMLeS played defect even once. Otherwise, they play
Proof. To prove the theorem, we need to prove:
cooperate. The opponents are thus deterministic adaptive
1. For opponents not themselves playing CMLeS, CMLeS
strategies with K = 5.
always reaches step 22 with some probability;
2. Type 2: every other player behaves as type-1 with 0.5
2. There exists a value of τ , for and above which, the above
probability, or else plays completely randomly. In this case,
probability is at least δ.
the opponents are stochastic with K = 5.
Proof of 1. We utilize the property that a K adaptive oppo-
The total number of joint histories of size 10 in this case
nent is also a Kmax adaptive opponent (see Observation 1).
is 810 , which makes PCM(A) highly inefficient. However,
The first time AAP E is set to false, it selects a random
CM LeS quickly figures out the true K and converges to op-
action ao and then plays it Kmax +1 times in a row. The
timal behavior in tractable number of steps. Figure 2 shows
second time when AAP E is set to false, it plays ao , Kmax
our results against these two types of opponents. The Y-
times followed by a different action. If the other agents have
axis shows the payoff of each algorithm as a fraction of the
behaved identically in both of the above situations, then
optimal payoff achievable against the respective opponent.
CMLeS knows : 1) either the rest of the agents are play-
Also plotted in the same graph, is the fraction of times CM-
ing CMLeS, or, 2) they are adaptive and plays stochasti-
LeS chooses the right memory size (denoted as convg in the
cally for a Kmax bounded memory where all agents play ao .
plot). Each plot has been averaged over 30 runs to increase
The latter observation comes in handy below. Henceforth,
robustness. Against type-1 opponents (Figure 2(i)), CMLeS
whenever AAP E is set to false, CMLeS always plays ao ,
figures out the true memory size in about 2000 steps and
Kmax +1 times in a row. Since a non-CMLeS opponent must
converges to playing optimally by 16000 episodes. Against
be stochastic (from the above observation), at some point of
type-2 opponents (Figure 2(ii)), it takes a little longer to
time, it will play a different action on the Kmax +1’th step
figure out the correct memory size (about 35000 episodes)
with a non-zero probability. CMLeS then rejects the null hy-
because in this case, the number of feasible joint histories of
pothesis that all other agents are CMLeS agents and jumps
4
to step 22. http://gamut.stanford.edu/userdoc.pdf
Page 43 of 99
1
on Artifical intelligence, pages 2–7. AAAI Press / The
MIT Press, 2004.
0.75
[2] M. Bowling and M. Veloso. Convergence of gradient
Ratio
0.5 dynamics with a variable learning rate. In Proc. 18th
0.25
International Conf. on Machine Learning, pages 27–34.
Morgan Kaufmann, San Francisco, CA, 2001.
0
[3] R. I. Brafman and M. Tennenholtz. R-max - a general
0 5000 10000 15000 20000
Episode (Against Trigger Strategy)
polynomial time algorithm for near-optimal
reinforcement learning. J. Mach. Learn. Res.,
1
3:213–231, 2003.
0.75
[4] D. Chakraborty and P. Stone. Online multiagent
learning against memory bounded adversaries. In
Ratio
0.5
ECML, pages 211–226, Antwerp,Belgium, 2008.
0.25 [5] V. Conitzer and T. Sandholm. Awesome: A general
0 multiagent learning algorithm that converges in
0 20000 40000 60000 80000 self-play and learns a best response against stationary
Episode (Against 50 % Random and 50 % Trigger Strategy) opponents. In J. Mach. Learn. Res., pages 23–43.
CMLeS convg PCM(A) AWESOME
Springer, 2006.
[6] W. Hoeffding. Probability inequalities for sums of
Figure 2: Against adaptive opponents
bounded random variables. Journal of the American
Statistical Association, pages 13–30, 1963.
size 6 are much more. Both Awesome and PCM(A) perform [7] R. Powers and Y. Shoham. Learning against opponents
much worse. PCM(A) plays a random exploration strategy with bounded memory. In IJCAI, pages 817–822, 2005.
until it has visited every possible joint history of size Kmax , [8] R. Powers, Y. Shoham, and T. Vu. A general criterion
hence it keeps getting a constant payoff during this whole and an algorithmic framework for learning in
exploration phase. multi-agent systems. Mach. Learn., 67(1-2):45–76, 2007.
When Kmax was set to 4, MLeS converged to playing the [9] R. S. Sutton and A. G. Barto. Reinforcement Learning.
maximin strategy in about 10000 episodes against both of MIT Press, 1998.
the above opponents. The convergence part of MLeS uses
the framework of Awesome and the results are exactly sim-
ilar to it.
7. REFERENCES
[1] B. Banerjee and J. Peng. Performance bounded
reinforcement learning in strategic interactions. In
AAAI’04: Proceedings of the 19th national conference
Page 44 of 99
Proceedings of the AAMAS Workshop on Adaptive and Learning Agents, May 2010, Toronto, Canada
Page 45 of 99
each other, even within parts of the same task. This enables the execution). This provides the imitator with a model that can make
robot to be more resistant to bad demonstrations as well as adapt- predictions about what behaviours a specific demonstrator might
able to heterogeneous demonstrators. use at a given time.
The experimental domain we use to ground this work is robotic Prior work in imitation learning has often used a series of demon-
soccer. In our evaluation, an imitating robot learns to shoot the soc- strations from demonstrators that are similar in skill level and phys-
cer ball into an open goal, from a range of demonstrators that differ iologies [8, 19]. The approach presented in this paper is designed
in size as well as physiology (humanoid vs. wheeled). While this from the bottom up to learn from multiple demonstrators that vary
problem may seem trivial to a human adult, it is quite challeng- physically, as well as in underlying control programs and skill lev-
ing to an individual that is learning about its own motion control. els.
Manoeuvring behind the soccer ball and lining it up for a kick is Some recent work in humanoid robots imitating humans has used
a difficult task for an autonomous agent to perform, even without many demonstrations, but not necessarily different demonstrators,
considering the ball’s destination - just as it would be for a young and very few have modelled each demonstrator separately. Those
child. It is also a task where it is easy to visualize a broad range that do employ different demonstrators, such as [8], often have
of skills (demonstrators that have good versus poor motor control), demonstrators of similar skills and physiologies (in this work all
and one where heterogeneity matters (that is, there are visual dif- humans performing simple drawing tasks) that also manipulate their
ferences in how physiologically-distinct robots move). environment using the same parts of their physiology as the imita-
Beyond simply improving learning, there are good application- tor (in this case the imitator was a humanoid robot learning how to
independent reasons for allowing a robot to learn from heteroge- draw letters, the demonstrators and imitators used the same hands
neous demonstrators. The time taken to create or adapt a control to draw). Inamura et al. [13, 12] use HMMs in their mimesis ar-
program for a particular robot physiology is often wasted when chitecture for imitation learning. They trained a humanoid robot to
robots are abandoned in favour of newer models, or different de- learn motions from human demonstrators, though they did not sep-
signs (e.g. switching from a wheeled robot to a one that has tank arately model or rank demonstrator skills relative to each other as
treads). A learning system needs to be able to learn from others we do in our work. Moreover, they also only use humanoid demon-
that are physiologically different than the imitator if the knowledge strators, significantly limiting heterogeneity.
of various demonstrators is to be passed on. Learning should be Nicolescu and Matarić [19] motivate the desire to have robots
robust enough to allow any type of demonstrator to work. Learning with the ability to generalize over multiple teaching experiences.
should also benefit correspondingly from a heterogeneous breadth They explain that the quality of a teacher’s demonstration and par-
of demonstrators. It may be possible to discover and adapt ele- ticularities of the environment can prevent the imitator from learn-
ments of a performance by a physically distinct demonstrator that ing from a single trial. They also note that multiple trials help to
have not yet been exploited by demonstrators of the same physi- identify important parts of a task, but point out that repeated obser-
ology, for example. Further, imitating robots that can learn from vations of irrelevant steps can cause the imitator to learn undesir-
any type of demonstrator can also learn from robots that developed able behaviours. They do not implement any method of modelling
their control programs through imitation. Imitation can therefore individual demonstrators, or try to evaluate demonstrator skill lev-
provide a mechanism for passing down knowledge between gener- els as our work does. By ranking demonstrators relative to each
ations of robots. other and mixing the best elements from among all demonstrators,
we believe that our system can minimize the behaviours it learns
2. RELATED WORK that contain irrelevant steps.
A number of imitation learning approaches have influenced this
work. Demiris and Hayes [10] developed a computational model 3. METHODOLOGY
based on the phenomenon of body babbling, where babies prac- The robots used in this work can be seen in Figure 1. The robot
tice movement through self-generated activity [17]. Demiris and imitator, a two-wheeled robot built from a Lego Mindstorms kit, is
Hayes [10] devised their system using forward models to predict on the far left. One of the three robot types used for demonstrators
the outcomes of the imitator’s behaviours, in order to find the best is physically identical (i.e. homogeneous) to the imitator, in order
match to an observed demonstrator’s behaviour. A forward model to provide a baseline to compare how well the imitator learns from
takes as input the state of the environment and a control command heterogeneous demonstrators. Two demonstrators that are hetero-
that is to be applied. Using this information, the forward model geneous along different dimensions are also employed. The first is
predicts the next state and outputs it. In their implementation, the a humanoid robot based on a Bioloid kit, using a cellphone for vi-
effects of all the behaviours are predicted and compared to the ac- sion and processing. The choice of a humanoid was made because
tual demonstrator’s state at the next time step. Each behaviour has it provides an extremely different physiology from the imitator in
an error signal that is then used to update its confidence that it can terms of how motions made by the robot appear visually. Both the
match that particular demonstrator behaviour. Our work differs in differences in outcomes of individual actions, as well as the visual
that we use forward models to model entire behaviour repertoires appearance produced by the additional motions necessary for hu-
of demonstrators, not individual behaviours. manoid balancing should be a significant challenge to a framework
Demiris and Hayes [10] use one forward model for each be- for imitation learning in terms of adapting to heterogeneity. The
haviour, which is then refined based on how accurately the forward third demonstrator type is a two-wheeled Citizen Eco-Be (version
model predicts the behaviour’s outcome. By using many of these I) robot which is about 1/10 the size of the imitator. This was
forward models, Demiris and Hayes construct a repertoire of be- chosen because while its physiology is similar, the large size dif-
haviours with predictive capabilities. In contrast, the forward mod- ference (accompanied by significant variation in how long it takes
els in our framework model the repertoire of individual demonstra- the robot to move the same distance) makes for a different extreme
tors (instead of having an individual forward model for each be- of heterogeneity than the challenge presented by a humanoid robot.
haviour), and contain individual behaviours learned from specific The imitation learning robot observes one demonstrator at a time,
demonstrators within them (the behaviours can still predict their with the demonstrated task being that of shooting a ball into an
effects on the environment, but these effects are not refined during empty goal, similar to a penalty kick in soccer. This task should
Page 46 of 99
Figure 1: Two views of the heterogeneous robots used in this
work (a standard ballpoint pen is used to give a rough illus-
tration of scale). The right side of the image shows the robots
with visual markers in place to allow motion to be tracked by a
global vision system.
Figure 2: Imitation Learning Architecture
Page 47 of 99
behaviours are added to the forward model representing the imita-
tor, as seen in Figures 3. Each model representing a demonstrator
is then used to process each demonstration from all demonstrators
one last time before the forward model does its processing. Essen-
tially this is the stage where the imitator is using the forward models
representing each of its demonstrators to predict what each individ-
ual demonstrator would do in the current situation. By the time all
the forward models representing all the demonstrators are trained,
the model representing the imitator has a number of additional be-
haviours in its repertoire as a result of this process, and serves as
a generalized predictive model of all useful activity obtained from
all demonstrators. Finally, the imitator does the processing of each
demonstration using the candidate behaviours added by the forward
models for the demonstrators as shown in Figure 4. The same pro-
cess of behaviour proposal and decay described earlier allows the
imitator to keep some demonstrator behaviours, and discard oth-
ers, while also learning new behaviours of its own as a result of the
common behaviours extrapolated from multiple demonstrators.
To model the relative skill levels of the demonstrators in our sys-
tem, each of the demonstrator forward models maintain a demon-
strator specific learning rate, the learning preference (LP). The learn-
ing preference is analogous to how people favour certain teachers,
and tend to learn more from these preferred teachers. A higher LP
indicates that a demonstrator is more skilled than its peers, so be-
haviours should be learned from it at a faster rate than a demonstra-
tor with a lower LP. The LP is used as a weight when updating the
frequency of two behaviours or primitives occurring in sequence.
The LP of a demonstrator begins at the half way point between
the minimum (0) and maximum (1) values. When updating the fre-
quencies of sequentially occurring behaviours, a minimum increase
in frequency (referred to as minFreq in equation 1) is preserved (a
value of 0.05, obtained during experimentation), to ensure that a
forward model for a demonstrator that has an LP of 0 does not
stagnate. The forward model for a given demonstrator would still
update frequencies, albeit more slowly than if its LP was above 0.
Equation 2 shows the decay step, where the decay rate is equal to
1 − LP and the decayStep is a constant (0.007 was used in our
experiments).
LP = LP ± lpShapeAmount (3)
The LP of a demonstrator is increased if one of its behaviours
results in the demonstrator (ordered from highest LP increase to
lowest): scoring a goal, moving the ball closer to the goal, or mov-
ing closer to the ball. The LP of a demonstrator is decreased if
the opposite of these criteria results from one of the demonstra-
tor’s behaviours. Equation 3 shows the update step, where lpSha-
peAmount is either a constant (0.001) if the LP is adjusted by the
non-criteria factors, or plus or minus 0.01 for a behaviour that re-
sults in scoring a correct/incorrect goal, 0.005 for moving the ball
closer to the goal, or 0.002 for moving the robot closer to the ball.
These criteria are obviously domain-specific, and are used to shape
Figure 4: In the final phase of training, all demonstrations are the learning (a technique that has been shown to be effective in
first passed to the demonstrator models to elicit any candidate other domains [14]) in my system to speed up the imitator’s learn-
behaviour nominations before the forward model for the imita- ing. Though this may seem like pure reinforcement learning, these
tor processes the demonstration. criteria do not directly influence which behaviours are saved, and
which behaviours are deleted. The criteria merely influence the
LP of a demonstrator, affecting how much the imitator will learn
Page 48 of 99
Demonstrator Goals Scored Wrong Goals Scored
RC2004 27 4
Citizen 15 3
Bioloid 12 1
Table 1: The number of goals and wrong goals scored for each
Figure 5: Field configurations. The demonstrator is repre- demonstrator.
sented by a square with a line that indicates the robot’s ori-
entation. The target goal is indicated by a black rectangle, the random order. All of the forward models for each demonstrator
demonstrator’s own goal is white. predict and update their models at this time, one step ahead of the
forward model for the imitator. This is done to allow each forward
model a chance to nominate additional candidate behaviours rele-
from that particular demonstrator. Dependence on these criteria vant to the current demonstration instance, to the forward model
was minimized so that future work (such as learning the criteria for the imitator.
from demonstrators) can remove them entirely. The total number of goals each demonstrator scored during all
When the learning process is complete, the imitator is left with 50 of their individual demonstrations is given in Table 1.
a final forward model that it can use as a basis for performing the To determine if the order in which an imitator using our approach
tasks it has learned from the demonstrators. is exposed to the various demonstrators - specifically, the degree of
heterogeneity - had any effect on its learning, we chose to order
the demonstrators in two ways. The first is in order of similarity
4. EXPERIMENTAL RESULTS to the imitator. In this ordering, the MindStorms robot demonstra-
To evaluate this approach in a heterogeneous setting, we em- tor (labelled RC2004 here because its expert-level control code was
ployed the robots previously shown in Figure 1 to gather demon- from our small-sized team at RoboCup-2004) is first, then the Citi-
strations. Each of the robots used in these experiments was con- zen demonstrator (which is much smaller than the imitator, but still
trolled using its own behaviour-based control system - since the a two-wheeled robot), and finally the Bioloid demonstrator. The
work presented here focusses on overcoming differences in phys- shorthand we have adopted for this ordering is RCB. The second
iology, all of the robots used code that was developed for robotic ordering is the reverse of the first, that is, in order of most physical
soccer competitions, and all would be considered expert demon- differences from the imitator. The second ordering is thus Bioloid,
strations. The Bioloid and Lego Mindstorms robots were demon- Citizen, RC2004, or BCR for short. The orderings determine when
strated on a 1020 x 810 cm field, while the Citizen was demon- each set of training data is used to train the demonstrator forward
strated on a 56 x 34.5 cm field (this was because the size difference models (starting with the first demonstrator’s set in the order). The
of the robot made for significant battery power issues given the dis- imitator’s final model follows the same ordering when passing each
tances covered on the large size field). The ball used by the Bioloid set of demonstrations to the demonstrator forward models during
and Lego Mindstorms robots was 10 centimeters in diameter, while the final training phase described in Section 3.
the ball used by the Citizen robot was 2.5 centimeters in diameter. For each of the two orderings, we ran 100 trials. The results
We limited the positions to the two field configurations shown of the forward model training processes using the RCB and BCR
in Figure 5. In the configuration on the left, the demonstrator is demonstrator orderings are presented here. All the following data
positioned for a direct approach to the ball. As a more challeng- has been averaged over 100 trials. Though the resulting LPs of the
ing scenario, we also used a more degenerate configuration (on the demonstrators are not given in this paper, all three demonstrators in
left), where the demonstrator is positioned for a direct approach to both orderings ended up with LPs (with a range of 0 to 1) close to
the ball, but the ball is lined up to its own goal – risking putting the the maximum (over 0.95 on average). In our experiments on differ-
ball in one’s own net while manoeuvering, and requiring a greater ing skill levels, poor demonstrators had LPs of approximately 0.25,
field distance to traverse with the ball. while average demonstrators had LPs of approximately 0.5. These
The individual demonstrators were recorded by the Ergo global results indicate that the LP is accurately judging all demonstrators
vision system [3] while they performed 25 goal kicks for each of to be skilled, even those that have different physiologies from the
the two field configurations. The global vision system continually imitator.
captures the x and y motion and orientation of the demonstrating Figures 6 and 7 show results for the number of behaviours cre-
robot and the ball. The demonstrations were filtered manually for ated and deleted for each of the forward models representing the
simple vision problems such as when the vision server was unable given demonstrators, with the two orderings for comparison pur-
to track the robot, or when the robot broke down (falls/loses power). poses and standard deviations given above each bar. It can be seen
The individual demonstrations were considered complete when the that the RCB and BCR demonstration orderings do not affect the
ball or robot left the field. number of behaviours created or deleted from any of the forward
One learning trial consists of each forward model representing a models. The forward models representing the Bioloid demonstra-
given demonstrator training on the full set of kick demonstrations tor can be seen to create many more behaviours than the other for-
for that particular demonstrator, presented in random order. Once ward models (and have a higher standard deviation), but they also
the forward models representing each demonstrator are trained, the end up deleting many more than the others. The vast difference in
forward model representing the imitator begins training. At this physiology from the other two-wheeled robots cause the forward
point all the forward models for the demonstrators have been trained models representing the humanoid to build many behaviours in an
for their own data, and have provided the forward model represent- attempt to match the visual outcome of the Bioloid’s demonstra-
ing the imitator with candidate behaviours. The forward model for tions. When trying to use those behaviours to predict the outcome
the imitator then processes all the demonstrations for each of the of the other two-wheeled robot demonstrators, they do not match
two field configurations (a total of 150 attempted goal kicks) in frequently enough (i.e. they are not a useful basis for imitation),
Page 49 of 99
Figure 6: The number of behaviours created, comparing RCB
and BCR demonstrator orderings. Corresponding standard
deviations are given at the top of each bar. Figure 8: The number of permanent behaviours in each for-
ward model, comparing RCB and BCR demonstrator order-
ings. Corresponding standard deviations are given at the top of
each bar.
Page 50 of 99
Figure 10: The number of candidate behaviours not moved to
the forward model representing the imitator, because they were
already there, comparing RCB and BCR demonstrator order- Figure 11: The number of candidate behaviours that earned
ings. Corresponding standard deviations are given at the top of permanency after being moved to the forward model repre-
each bar. senting the imitator, comparing RCB and BCR demonstrator
orderings. Corresponding standard deviations are given at the
top of each bar.
Demonstrator Ordering Goals Scored Wrong Goals Scored
RCB 11 9
BCR 7 13
uated in this section at random (one from the RCB training order,
and one from the BCR order). We used the forward models to con-
Table 2: The number of goals and wrong goals scored for two trol (as described in Section 3) the Lego Mindstorms robots and
imitators trained with the different demonstrator orderings. recorded them in exactly the same way that we recorded the demon-
strators, for 25 shots on goal in each of the two field configurations
(Figure 5) for a total of 50 trials. Table 2 shows the results of these
ordering than the RCB ordering. This is likely due to the number penalty kick attempts by the two imitators trained using our frame-
of candidate behaviours that get rejected because they already ex- work. The orderings do not show a significant difference. The
ist in the forward model representing the imitator, but the standard main reason the final trained imitator did not perform objectively
deviation could also explain this. The results for duplicate candi- better is that the current behaviour being executed was not stopped
date behaviours can be seen in Figure 10. The forward models for if another behaviour became more applicable during its execution.
the RC2004 demonstrator have fewer duplicates rejected when they This caused the imitator, when demonstrating its skills, to stick to
are first (RCB), as is true for the forward models for the Bioloid a chosen behaviour, even if using that behaviour resulted in poor
when the Bioloid is first (BCR). In both cases, however, this is not results. That is, it is a flaw in the learner demonstrating its skills,
much of a difference given the standard deviations involved. The not in its learning. The only time the execution of a behaviour was
forward models for both demonstrators appear to have more candi- stopped was when the primitives it was about to execute predicted
date behaviours rejected when they are last in the ordering, but this that it might move the imitator off the field, since the imitator was
could also be explained simply by the standard deviations involved. not tasked with learning behaviours that kept it on the field. In the
Again, the forward models representing the Citizen demonstrators end, the results from heterogeneous demonstrators were compa-
are less affected by ordering, as they appear in the middle both rable to those using only homogeneous demonstrators of the same
times. skill level [1], which is itself very positive because of the additional
Similar results are found when looking at the number of can- difficulties involved with heterogeneity.
didate behaviours that achieve permanency to the forward model
representing the imitator. In Figure 11, the forward models repre-
senting the RC2004 and Bioloid demonstrators can be seen to have 5. CONCLUSION
more behaviours made permanent to the forward model for the imi- We have presented the results and analysis of the experiments
tator when their demonstrations appear first in the ordering, but this used to evaluate our approach to developing an imitation learning
is explainable by considering the standard deviation. The forward architecture that can learn from multiple demonstrators of varying
models representing the two-wheeled demonstrators seem to have physiologies and skill levels. The results in Section 4 show that
an advantage in the number of their candidate behaviours becoming our approach can be used to learn from demonstrators that have
permanent to the forward model for the imitator over the forward heterogeneous physiologies. The humanoid demonstrator was not
models representing the Bioloid demonstrator. This is likely due learned from as much as the two-wheeled robots that had similar
to the same reasons of physiology discussed when looking at the physiologies to the imitator, but the imitator still learned approxi-
number of behaviours created and deleted by each of the forward mately 12% of its permanent behaviours from the Bioloid, as seen
models. in Figures 11 and 8. The Citizen robot was nearly as effective a
To evaluate the performance of the imitators trained using this demonstrator as the RC2004 robot. This is somewhat surprising,
approach, we selected two imitators from the learning trials eval- as the RC2004 robot has an identical physiology to the imitator,
Page 51 of 99
while the Citizen robot is approximately 1/10 the imitator’s size. [8] C ALINON , S., AND B ILLARD , A. Learning of Gestures by
This could be partially due to the fact that the Citizen robot has the Imitation in a Humanoid Robot. In Imitation and Social
same limited command set as the imitator, compared to the vastly Learning in Robots, Humans and Animals: Behavioural,
expanded set of primitive commands available to the RC2004. The Social and Communicative Dimensions, K. Dautenhahn and
Citizen moves much slower due to its size, so the demonstration C. N. (Eds), Eds. Cambridge University Press, 2007,
conversion process must have compensated substantially to give pp. 153–177.
the Citizen demonstrator results so close to the RC2004 demonstra- [9] D EARDEN , A., AND D EMIRIS , Y. Learning forward models
tor. Size differences, apparently, are easier to compensate for than for robots. In Proceedings of IJCAI-05 (Edinburgh Scotland,
differences in physiology, at least to the degree of the differences August 2005), pp. 1440–1445.
between wheeled and humanoid robots. [10] D EMIRIS , J., AND H AYES , G. Imitation as a dual-route
The results presented in Section 4 also show that this framework process featuring predictive and learning components: A
is not affected drastically by the order that demonstrators are pre- biologically plausible computational model. In Imitation in
sented to the forward models. There is some effects from candidate Animals and Artifacts, K. Dautenhahn and C. Nehaniv, Eds.
behaviours being rejected if a forward model for a given demon- MIT Press, 2002, pp. 327–361.
strator is the last to be trained, since the other forward models have [11] H ARTIGAN , J. A., AND W ONG , M. A. Algorithm AS 136:
already had a chance to get their candidate behaviours added, in- A k-means clustering algorithm. Applied Statistics 28, 1
creasing chances of duplicates. In practice this does not seem to (1979), 100–108.
adversely affect the LP of any of the forward models, and so the [12] I NAMURA , T., NAKAMURA , Y., T OSHIMA , I., AND TANIE ,
order of demonstrations is mostly negligible. H. Embodied symbol emergence based on mimesis theory.
The results for the performance of our forward models when International Journal of Robotics Research 23, 4 (2004),
used as control systems did not perform as well as the expert demon- 363–377.
strators, but they still were able to control the imitator adequately.
[13] I NAMURA , T., T OSHIMA , I., AND NAKAMURA , Y.
The main focus on our research was in developing an imitation
Acquiring motion elements for bidirectional computation of
learning architecture that could learn from multiple demonstrators
motion recognition and generation. In Proceedings of the
of varying physiologies and skill levels. The results of the con-
International Symposium On Experimental Robotics (ISER)
version processes, predictions, and the influence that the LP of a
(Sant’Angelo d’Ischia, Italy, July 2002), pp. 357–366.
forward model for a given demonstrator has on what an imitator
learns from that particular demonstrator all indicate that the learn- [14] M ATARI Ć , M. J. Reinforcement learning in the multi-robot
ing architecture we have devised is capable of properly modelling domain. Autonomous Robots 4, 1 (1997), 73–83.
relative demonstrator skill levels as well as learn from physiologi- [15] M ATARI Ć , M. J. Getting humanoids to move and imitate.
cally distinct demonstrators. A stronger focus on the refinement of IEEE Intelligent Systems (July 2000), 18–24.
behaviour preconditions and control similar to the work of Demiris [16] M ATARI Ć , M. J. Sensory-motor primitives as a basis for
and Hayes [10] could make our entire system more robust. imitation: linking perception to action and biology to
robotics. In Imitation in Animals and Artifacts,
K. Dautenhahn and C. Nehaniv, Eds. MIT Press, 2002,
6. REFERENCES pp. 391–422.
[1] A LLEN , J. Imitation learning from multiple demonstrators [17] M ELTZOFF , A. N., AND M OORE , M. K. Explaining facial
using global vision. Master’s thesis, Department of imitation: A theoretical model. In Early Development and
Computer Science, University of Manitoba, Winnipeg, Parenting, vol. 6. John Wiley and Sons, Ltd., 1997,
Canada, August 2009. pp. 179–192.
[2] A NDERSON , J., TANNER , B., AND BALTES , J. [18] N EHANIV, C. L., AND DAUTENHAHN , K. Of
Reinforcement learning from teammates of varying skill in hummingbirds and helicopters: An algebraic framework for
robotic soccer. In Proceedings of the 2004 FIRA Robot World interdisciplinary studies of imitation and its applications. In
Congress (Busan, Korea, October 2004), FIRA. Interdisciplinary Approaches to Robot Learning, J. Demiris
[3] BALTES , J., AND A NDERSON , J. Intelligent global vision and A. Birk, Eds., vol. 24. World Scientific Press, 2000,
for teams of mobile robots. In Mobile Robots: Perception & pp. 136–161.
Navigation, S. Kolski, Ed. Advanced Robotic Systems [19] N ICOLESCU , M., AND M ATARI Ć , M. J. Natural methods
International/pro literatur Verlag, Vienna, Austria, 2007, for robot task learning: Instructive demonstration,
ch. 9, pp. 165–186. generalization and practice. In Proceedings of AAMAS-2003
[4] B ILLARD , A., AND M ATARI Ć , M. J. A biologically inspired (Melbourne, Australia, July 2003), pp. 241–248.
robotic model for learning by imitation. In Proceedings of [20] R ABINER , L., AND J UANG , B. An introduction to Hidden
Autonomous Agents 2000 (Barcelona, Spain, June 2000), Markov Models. IEEE ASSP Magazine (1986), 4–16.
pp. 373–380. [21] R ILEY, P., AND V ELOSO , M. Coaching a simulated soccer
[5] B REAZEAL , C., AND S CASSELLATI , B. Challenges in team by opponent model recognition. In Proceedings of the
building robots that imitate people. In Imitation in Animals Fifth International Conference on Autonomous Agents (May
and Artifacts, K. Dautenhahn and C. Nehaniv, Eds. MIT 2001), pp. 155–156.
Press, 2002, pp. 363–390.
[6] B REAZEAL , C., AND S CASSELLATI , B. Robots that imitate
humans. Trends in Cognitive Sciences 6, 11 (2002), 481–487.
[7] C ALDERON , C. A., AND H U , H. Goal and actions:
Learning by imitation. In Proceedings of the AISB Š03
Second International Symposium on Imitation in Animals
and Artifacts (Aberystwyth, Wales, 2003), pp. 179–182.
Page 52 of 99
Proceedings of the AAMAS Workshop on Adaptive and Learning Agents, May 2010, Toronto, Canada
Page 53 of 99
Our research presented here is the first step of a bigger ongoing also very misleading [14]. To deal with such problems potential-
project on the use of domain knowledge and analysis of the suit- based reward shaping was proposed [11] as the difference of some
ability of reward shaping in KeepAway and in multi-agent RL in potential function Φ defined over a source s and a destination state
general. s0 :
The paper is organised as follows. Section 2 presents a more
detailed introduction to reinforcement learning and Section 3 intro- F (s, s0 ) = γΦ(s0 ) − Φ(s), (3)
duces reward shaping. The subsequent section introduces RoboCup where γ is the discount factor. When the potential function Φ(s)
Soccer and the problem of learning takers in that domain. Next, is a function of states only, actions can be omitted in F yielding
Section 5 discusses our approach to learning takers with reward F : S × S → R and F (s, s0 ). Ng et al. [11] proved that re-
shaping. Details of experimental evaluation are in Section 6 and ward shaping defined in this way, that is, according to Equation 3,
obtained results are collected and discussed in Section 7. The final guarantees learning a policy which is equivalent to the one learned
section concludes the paper. without reward shaping when the same heuristic knowledge repre-
sented by Φ(s) would be used directly to initialise the value func-
2. REINFORCEMENT LEARNING AND tion. This is an important fact, because when function approxima-
tion is used in big environments, where the structural properties of
MARKOV DECISION PROCESSES the state space are not clear, it is not easy to initialise the value func-
Reinforcement learning is a paradigm which allows agents to tion. Reward shaping represents a flexible and theoretically correct
learn by reward and punishment from interactions with the environ- method to incorporate background knowledge into RL algorithms.
ment [19]. The numeric feedback received from the environment is Its properties have been proven for RL in both infinite and finite
used to improve agent’s actions. The majority of work in the area horizon MDPs [11]. It was however indicated in [5] that the stan-
of reinforcement learning (RL) applies a Markov Decision Process dard formulation of potential-based reward shaping according to
as a mathematical model [13]. [11] can fail in domains with multiple goals. One of the solutions
A Markov Decision Process (MDP) is a tuple hS, A, T, Ri, where suggested in [5] to overcome this problem is to use F (·, ·, g) = 0
S is the state space, A is the action space, T (s, a, s0 ) = P r(s0 |s, a) for each goal state g ∈ G.
is the probability that action a in state s will lead to state s0 , and When the shaping reward is computed according to Equation 3,
R(s, a, s0 ) is the immediate reward received when action a taken the application of reward shaping reduces to the problem of how to
in state s results in a transition to state s0 . The problem of solving define the potential function, Φ(s). In this paper, we address this is-
an MDP is to find a policy (i.e., mapping from states to actions) sue is a novel context of multi-agent learning (details in Section 5)
which maximises the accumulated reward. When the environment and evaluate it in the RoboCup KeepAway domain [15, 16] which
dynamics (transition probabilities and a reward function) are avail- is introduced in the next section. Our experiments apply function
able, this task can be solved using iterative approaches like policy approximation to represent the value function in this task. Even
and value iteration [2]. though with function approximation the optimal policy might not
MDPs constitute a modeling framework for RL agents whose be representable, our application of potential-based reward shaping
goal is to learn an optimal policy when the environment dynamics is still valid and justified. Potential based reward shaping guaran-
are not available and, thus, value iteration cannot be used. However tees that the new MDP with a modified reward function has the
the concept of an iterative approach in itself is the backbone of same solution as the original MDP solved by reinforcement learn-
the majority of RL algorithms. These algorithms apply so called ing without reward shaping. Therefore the learning problem re-
temporal-difference updates to propagate information about values mains the same, allowing the methods presented here to be applied
of states, V (s), or state-action, Q(s, a), pairs [18]. These updates with or without an approximate function representation.
are based on the difference of the two temporally different estimates The work of Ng. et al. [11] formally specified requirements on
of a particular state or state-action value. The SARSA algorithm is reward shaping. The idea of giving an additional external reward
such a method [19]. After each real transition, (s, a) → (s0 , r), in was investigated by numerous researchers before that. For exam-
the environment, it updates state-action values by the formula: ple, interesting observations on the behaviour and problems of re-
ward shaping were reported in [14] and were then influential in the
Q(s, a) ← Q(s, a) + α[r + γQ(s0 , a0 ) − Q(s, a)]. (1)
formalisation of the potential-based reward function. Another early
It modifies the value of taking action a in state s, when after exe- work suggesting progress estimators, which also resembles the idea
cuting this action the environment returned reward r, moved to a of the potential function, was presented in [9].
new state s0 , and action a0 was chosen in state s0 .
4. MULTI-AGENT LEARNING
3. REWARD SHAPING IN ROBOCUP SOCCER
Immediate reward r which is in the update rule given by Equa- RoboCup is an international project1 which aims at providing an
tion 1 represents the feedback from the environment. The idea experimental framework in which various technologies can be in-
of reward shaping is to provide an additional reward which will tegrated and evaluated. The overall research challenge is to create
improve the convergence of the learning agent with regard to the humanoid robots which would play at human masters level. Since,
learning speed, the quality of the final solution or both [11, 14]. the full game of soccer is complex, researchers developed several
This concept can be represented by the following formula for the simulated environments which can be used to evaluate techniques
SARSA algorithm: for specific sub-problems. One of such sub-problems is the Keep-
Away2 task [15, 16]. In this task (see Figure 1), N players (keepers)
Q(s, a) ← Q(s, a) + α[r + F (s, a, s0 ) + γQ(s0 , a0 ) − Q(s, a)],
learn how to keep the ball when attacked by N − 1 takers and when
(2)
where F (s, a, s0 ) is the general form of the shaping reward. 1
See http://www.robocup.org/ for more information
Even though reward shaping has been powerful in many exper- 2
See http://userweb.cs.utexas.edu/ AustinVilla/sim/Keepaway/ for
iments it quickly turned out that, when used improperly, it can be more information
Page 54 of 99
the two bodies of work [7, 10] can not be directly compared as they
used different state representations.
These two papers appear to encompass the entirety of current
published work in this problem domain. However, there still re-
mains a large room for improvement in the development of a learn-
ing taker. The more challenging a taker can become, the more it
will challenge researchers interested in learning the behaviours of
keepers. The work we have undertaken has resulted in takers per-
forming significantly better than the performances reported in both
these papers against the same opposing keepers in games with the
same set up and in games more challenging to the takers.
Figure 1: Snapshot of a 3 vs. 2 KeepAway game.
This problem domain also provides a suitable test bed for other
more generally applicable research into multi-agent reinforcement
learning. Given the learning taker we have developed we were able
playing within a small area of the football pitch, which makes the
to then expand upon the basic implementation and incorporate three
problem more difficult.
novel approaches to potential based reward shaping in a multi-agent
This task is multi-agent [21] in its nature, however, most research
context.
has focused on learning one specific behaviour at a time. Overall,
there are three types of high level behaviour in this task. The be-
haviour of the agents trying to take the ball is one of these high 5. PROPOSED METHOD
level behaviours but let us first consider the agents trying to main- In this section we provide more details on our learning takers
tain possession of the ball; the keepers. and the reward shaping techniques used. In our investigation we
For keepers there are two distinct situations, either the keeper compare the performance of RL takers without reward shaping (the
has possession of the ball or it does not. If it is not in possession of base learner) to takers using one of three types of reward shaping
the ball, a keeper executes a fixed hand-coded policy which directs detailed below.
it to be in a position convenient to receive the ball from the keeper
which is. The third behaviour is that of the keeper in possession of 5.1 Base Learner
the ball. Previous work has attempted to learn this behaviour using Our base learning taker combines the work of both previous pa-
reinforcement learning whilst the takers adhere to a hand-coded pers [7, 10] on learning takers in KeepAway. As in both these
policy [15, 16, 4]. papers, the takers can on each update choose either to tackle the
However, in this work we are interested in learning the behaviour keeper with the ball or mark any of the remaining keepers. To
of the takers and so the keepers shall now follow a hand-coded tackle a keeper, the taker runs directly to the keeper currently in
policy both with and without the ball. The hand-coded behaviour posession of the ball. To mark a keeper, the taker moves close to
of a keeper with the ball was originally specified in [16] and has the keeper trying to maintain their position on the intercepting line
since been used in other work on learning takers [7, 10]. of a straight pass of the ball from its current location to the current
Although the work of [15, 16, 4] has multiple agents learning location of the keeper.
during each episode, the implementation is not a true multi-agent To learn when to perform these actions we use the SARSA algo-
learning example. At any one time only one agent is learning, rithm with tile coding, as in [7]. Then from [10] we use the state
namely the keeper in possession of the ball, and all other agents representation (originally suggested in [16]) and the reward func-
are following fixed hand-coded policies. Therefore the agent is in tion, -1 for every cycle the episode continues to run and +10 for
effect learning within a static environment. However, when we con- ending the episode. Given the observations made by both papers
sider takers with the ability to learn, the problem becomes multi- we update only after every 15 cycles.
agent as all takers learn simultaneously.
Previous attempts to learn the behaviour of takers proved rela-
tively successful [7, 10] and were a useful resource when attempt-
ing to develop novel approaches. In [7] the first basic learning taker
was developed using SARSA reinforcement learning with tile cod-
ing to decide the action of a taker every 15 cycles. This work
emphasised that allowing a taker to decide an action every cycle
caused indecisiveness in the agent because the short time elapsed
between decisions did not allow adequate time for the true benefit
or cost of an action to be realised. In experiments allowing de-
cisions to be made every cycle takers oscillate between decisions
causing poor performance.
This observation was again witnessed by [10], who noted that
updates at any interval between 15 and 40 had comparable results
but intervals larger than 60 or less than 10 were largely unsuccess-
ful. To avoid this hesitation they chose to switch from the SARSA
algorithm for reinforcement learning to the Advantage(λ) Learning
algorithm. In their work a significant improvement in performance
was seen when takers learnt every cycle by Advantage(λ) Learning
instead of infrequently by SARSA. However, the comparison is not Figure 2: State Representation of Base Learner (from [10]).
complete as takers using the Advantage(λ) Learning also imple-
mented a more advanced function approximation technique. Also We chose the state representation from [10] as opposed to the
Page 55 of 99
one used by [7], because the latter represented less observations of be able to learn an equal policy as the roles are not enforced but
the environment and we expected this to limit the performance of merely encouraged.
learning takers. Our experiments presented below show that the The successful application of this reward shaping will illustrate
more detailed state representation, from [10] and illustrated in Fig- the potential benefits of using heterogeneous reward shaping in
ure 2, improves the performance of the basic learning taker sup- multi-agent systems to encourage roles.
porting our expectations and providing a useful comparison to the
more novel approaches we take in the following two subsections. 5.4 Combining Shaping Functions
Finally, we have also considered the incorporation of both pieces
5.2 Simple Reward Shaping of domain knowledge into one team of takers. This way the takers
Our first extension is to apply a simple potential based reward can be encouraged to take roles but also consider the benefit of
shaping function to the existing base agent to incorporate prior separating.
domain knowledge into the agent. It is expected that given this When combining shaping functions it is important that each is
knowledge the agent will converge quicker to an equal or better scaled individually because to calculate the potential difference of
performance than the base learning approach alone. This agent is both states and scale the sum would give a different meaning to
intended to show that the use of reward shaping is both applicable the resultant reward shaping, it would not accurately represent the
and beneficial in multi-agent reinforcement learning. domain knowledge intended. Therefore the potential based reward
Specifically, the domain knowledge we have applied states that shaping function changes from Equation 3 given in Section 3 to:
takers can improve their performance by separating and marking
different players. By following this principle, they are able to limit F (s, s0 ) = τ1 (γΦ1 (s0 ) − Φ1 (s)) + τ2 (γΦ2 (s0 ) − Φ2 (s)), (4)
the passing options of the keepers and reduce the time the keepers where γ is the discount factor, Φ1 and Φ2 are the potential functions
maintain possession. and τ1 and τ2 are two separate scaling factors.
We have implemented a reward shaping function that encourages Given that our motive is to publicise the use of heterogeneous
separation by adding the change in distance between the takers to reward shaping for encouraging roles our scaling will emphasise
the reward they receive from the basic learning algorithm. Assum- the heterogeneous reward shaping function. This agent will still
ing our domain knowledge is correct, the addition of this potential include the separation based reward shaping function but by scal-
based function will ensure better co-operation between the agents ing the function appropriately it will have less of an impact on the
developed than those following the hand-coded policy. Also, al- resultant behaviour than the encouragement to take up a specific
though this knowledge could be learned by the base learner, the role.
new agent will know from the beginning to attempt to separate and It is expected that as this agent will benefit from both pieces of
so will converge quicker. domain knowledge that this will be our best performing agent and
as such will be a beneficial contribution to the RoboCup KeepAway
5.3 Heterogeneous Shaping research field.
In experiments with the previous agent based upon a simple re-
ward shaping, all taker agents will be equivalent or homogeneous.
A more interesting problem is that of heterogeneous agents, whereby 6. EXPERIMENTAL DESIGN
different agents co-operating on the same team combine different The experiments undergone were performed in RoboCup Soc-
skills to outperform their homogeneous counterparts [1]. cer Simulator v11.1.0 compiled against RoboCup Soccer Simu-
Given the previous hypothesis, that takers sticking together is lator Base Code v11.1.0. The KeepAway player code used was
detrimental to performance, more complex prior domain knowl- keepaway-player v0.6. Keepers were based upon the hand coded
edge can be incorporated stating that it is beneficial for one taker to policy publicly available in this release and takers were based upon
tackle and another to fall back and mark. our own extensions to this base player.
In effect, this new domain knowledge defines two roles; one of For takers both with and without reward shaping the SARSA al-
a tackling taker and one of a marking taker. We thus use hetero- gorithm of reinforcement learning was used with the parameters;
geneous reward shaping to encourage these roles in the learning α = 0.125, γ = 1.0 and = 0.01. For function approximation a
takers, which to our knowledge is a novel idea. By rewarding one tile coding function with 13 groups of 32 single-dimension tilings
taker for choosing a tackling action when previously choosing a was used. All takers used one group per each feature in the obser-
marking action and punishing it when it changes from choosing vation and split angles into ten degree intervals and distances into
tackling to now marking, the agent will be encouraged to tackle. three meter intervals.
A similar approach reversing the punishment and reward will then Experiments were performed on pitches of sizes 20×20, 30×30,
encourage the other taker to mark. 40×40, and 50×50 meters. These values were chosen to show the
These roles however are not hard-coded, we are not limiting the performance of our takers in similar contexts to previous work on
action choices available to the takers. Both takers can still choose learning the behaviour of takers and also in more complex problem
either to mark or tackle and reinforcement learning will still have domains.
them explore the use of both action choices. Therefore in extreme The addition of reward shaping functions must be scaled to max-
cases when it is necessary for the marking agent to tackle he will imise the performance of the respective agents. The value of these
still make the correct decision and tackle, but in general it will scaling factors was found through experimental testing, therefore
choose to mark as the reward shaping function applied will make they may not be the optimal settings. However, they are sufficient
this appear more lucrative. to show the improvement in performance the methods are capable
Therefore, it is hoped that these two roles are beneficial to win- of. For the simple reward shaping agent the value of separation was
ning possession. If they are, then the agent will converge quicker doubled before added to the basic reward function.
to an equal or better performance as the base learner because the For the heterogeneous shaping approach agents were either re-
takers without reward shaping will have to learn these roles them- warded or penalised by 5 for changing their action from marking to
selves. However, if these roles are not beneficial the takers will still tackling and vice versa.
Page 56 of 99
When combining shaping functions we wanted to emphasise the 13
heterogeneous knowledge and so for changing their action these Base Learner
Simple Reward Shaping
takers were either rewarded or penalised by 10 and for separation 12.5
10
7. RESULTS
Experiments on the simplest domains were relatively unhelp- 9.5
0 1 2 3 4 5
ful. All agents converged quickly to good results with little vari-
ation between approaches used. For both pitches of size 20x20 and Training Time (hours)
30x30, illustrated in Figures 3 and 4, it is important to consider
that both axis represent small changes in time in their given dimen- Figure 4: Takers v KA06 at 30x30.
sion and the differences between agents is both brief and insignif-
icantly small (only 0.4 seconds for pitch size 20x20). When tak-
ing into consideration the statistical variation between samples, no 26 Base Learner
one agent is seen to reliably be significantly better than any other. Simple Reward Shaping
22
6
Base Learner
Simple Reward Shaping 20
5.8
Episode Duration (seconds)
Heterogenous Shaping
Combined Shaping
5.6 18
5.4 16
5.2
0 2 4 6 8 10
5
Training Time (hours)
4.8
Figure 5: Takers v KA06 at 40x40.
4.6
Page 57 of 99
ceive the initial performance improvement from exploiting the do- previously noticed in the takers’ performances after convergence
main knowledge but later, instead of being hindered by this flawed is contradicted by Figure 6. Although the average performance of
knowledge, would gain the benefit of exploring all potential deci- the base learner is lower than all of the takers using reward shap-
sions and so match the superior converged performance of the base ing, the upper bounds of variation in this result are higher than the
learner. lower bounds of all takers using reward shaping and equivalent to
the average of some. Therefore the one benefit of using the ba-
sic learner instead of incorporating domain knowledge, namely the
26 Base Learner perceived improvement in performance at the time of convergence,
Simple Reward Shaping
Heterogenous Shaping is not statistically significant.
Episode Duration (seconds)
24 Combined Shaping
22 Base Learner
34 Simple Reward Shaping
Heterogenous Shaping
30
18
28
16
26
0 2 4 6 8 10 24
Training Time (hours) 22
learner and the simplest reward shaping function. Also, all subse- 32 Combined Shaping
quent increases in the complexity of domain knowledge applied
through reward shaping result in a significant increase in perfor- 30
mance. These initial gains in performance are highlighted for clar-
28
ity in Figure 7.
26
25
Base Learner 24
24 Simple Reward Shaping
Heterogenous Shaping
Episode Duration (seconds)
22
23 Combined Shaping
20
22
0 2 4 6 8 10 12
21 Training Time (hours)
20
Figure 9: Takers v KA06 at 50x50 with Performance Variation
19 Illustrated.
18 The results of Figures 8 and 9 further support the conclusions
17 made thus far. As previously seen in the change from pitch sizes of
30x30 to 40x40, there is a significant rise in difficulty when increas-
0 0.5 1 1.5 2 ing the pitch size from 40x40 to 50x50. Given the yet again higher
Training Time (hours) difficulty, a more significant improvement can and has been wit-
nessed when incorporating domain knowledge into multiple agents
Figure 7: Takers v KA06 at 40x40 with Initial Performance co-learning in a single system.
Highlighted. Firstly, there is now a significant gap between the upper bound
of initial performance in takers using even just the simplest reward
Finally with regard to the 40x40 problem domain, it is support- shaping function and the the lower bound of initial performance by
ive of our method to note that the slight difference in performance the base learner. On average takers benefiting from both reward
Page 58 of 99
shaping functions can begin taking possession of the ball 6 seconds domain knowledge. By encouraging roles through reward shaping,
faster than takers not using any reward shaping. The initial gain in as opposed to enforcing them through hard-coded limitation to ei-
performance is highlighted in Figure 10. ther actions or state representations, agents can choose to exploit
the given domain knowledge, and so benefit from fast convergence
rates, but also can choose still to explore allowing the discovery of
34 Base Learner optimal policies where they diverge in specific contexts from their
Simple Reward Shaping encouraged roles.
Heterogenous Shaping
Episode Duration (seconds)
32 Combined Shaping
Although the specific reward shaping functions implemented have
used domain specific knowledge the types of domain knowledge
30 represented are generally applicable. The knowledge that takers
should try to stay separate is an example of knowledge regarding
how agents should maintain states relative to each other. Maintain-
28
ing a state relative to either team-mates or opponents is a common
type of knowledge applicable in many multi-agent systems. For
26
example, it has been shown in the predator/prey problem domain
that it is beneficial for predators to consider the relative location of
24 its supporting predator to aid co-ordination [20]. Similarly, having
one tackler and one marker is specific to takers in KeepAway but
22 the knowledge that agents should specialise into roles is common
0 0.5 1 1.5 2 2.5 3 in multi-agent systems. For example, again in the predator/prey
Training Time (hours) problem domain, it has been shown that it is beneficial to have one
predator take a hunting role and another take a scouting role [20].
Figure 10: Takers v KA06 at 50x50 with Initial Performance Therefore the use of reward shaping, both homogeneous, heteroge-
Highlighted. neous and combined, could be applied in general to any multi-agent
system that would benefit from agents having these types of knowl-
This gain in performance remains roughly constant throughout edge with the expected benefits being similar to those documented
the first 4 hours of training. It then begins to shrink but still out- in the KeepAway domain.
performs the base learner for up to approximately 8 hours. Even Finally, our last contribution is that of the taker learning with the
after the first 8 hours of training, the base learner can only match combined domain knowledge of both encouraging separation and
the performance of the novel approaches and never significantly roles. This taker has the best currently published performance of
outperforms any of them. any taker in the RoboCup KeepAway problem domain.
Furthermore, the agents solely encouraged to take heterogeneous We intend to continue this work along the following avenues.
roles did adhere to the encouragement and after convergence were Firstly, we believe that there is the potential to apply similar re-
seen to almost exclusively stick to their assigned roles. The ex- ward shaping functions to the keepers in a true multi-agent learning
ceptions being the specific contexts at which it was more bene- domain. Recent work [8] has expanded the keepers to learn both
ficial to performance to ignore the encouraged role and make a whilst on and off the ball. Currently, despite simultaneous learn-
non-characteristic action decision. By using RL with reward shap- ing, the behaviour of keepers with the ball is very different to that
ing to encourage roles, these deviations from the encouraged role of those without the ball. However, the application of a separation
were possible whereas an agent with enforced roles would not have based reward shaping function could be adapted to further improve
learnt to nor been able to exploit these specific contexts. the performance of the keepers. This continued cycle of improving
Finally, a closing note of some interest. It appears, in particular takers and then improving keepers will continue to push research
in Figure 9 but also to a degree in Figure 6, that the variation in per- efforts in this problem domain. Eventually, leading to research into
formances achieved is notably smaller in agents making use of the the simultaneous learning of both keepers and takers.
heterogeneous reward shaping. These results are not broad enough A larger more general contribution however would continue to
to make any firm conclusions at this time, but it would be inter- investigate the potential of heterogeneous reward shaping in multi-
esting in a deeper study of the heterogeneous reward shaping for agents systems. Again this could foreseeably be applied to a true
multi-agent systems approach to explore this observation further. multi-agent learning set of keepers, with perhaps some keepers en-
It may be that this is characteristic of the encouragement of roles, couraged to mislead takers by making runs off the ball and oth-
however it may also be the case that this is simply an artifact of ers encouraged to sneak away from markers to true open posi-
this specific piece of domain knowledge in this particular problem tions. Other classic multi-agent domains, such as task distribution
domain. or predator/prey, may also be interesting to study when applying
this technique to highlight its general applicability and widen the
8. CONCLUSION audience to this method.
In conclusion, we have demonstrated the applicability and ben-
efits of using potential based reward shaping in multi-agent rein- 9. REFERENCES
forcement learning. By incorporating domain knowledge in an [1] T. Balch. Learning Roles: Behavioral Diversity in Robot
agents design the agent can converge quicker to an equal or su- Teams. In AAAI Workshop on Multiagent Learning, 1997.
perior policy than agents learning by reinforcement alone. [2] D. P. Bertsekas. Dynamic Programming and Optimal Control
The results documented here are a first step in demonstrating (2 Vol Set). Athena Scientific, 3rd edition, 2007.
the potential benefit of heterogeneous reward shaping to encour- [3] L. Busoniu, R. Babuska, and B. De Schutter. A
age roles. We have successfully designed heterogeneous agents Comprehensive Survey of MultiAgent Reinforcement
that co-operate in a multi-agent system to outperform agents using Learning. IEEE Transactions on Systems Man & Cybernetics
either homogeneous reward shaping functions or incorporating no Part C Applications and Reviews, 38(2):156, 2008.
Page 59 of 99
[4] S. Devlin, M. Grześ, and D. Kudenko. Reinforcement Introduction. MIT Press, 1998.
learning in robocup keepaway with partial observability. In [20] M. Tan. Multi-Agent Reinforcement Learning: Independent
IEEE/WIC/ACM International Conference on Web vs. Cooperative Agents. In Proceedings of the Tenth
Intelligence and Intelligent Agent Technology, 2009. International Conference on Machine Learning, volume 337,
WI-IAT’09, 2009. 1993.
[5] M. Grześ. Improving exploration in reinforcement learning [21] M. Wooldridge. An Introduction to MultiAgent Systems. John
through domain knowledge and parameter analysis. Wiley and Sons, 2002.
Technical report, University of York, 2010. (in preparation).
[6] M. Grześ and D. Kudenko. Plan-based reward shaping for
reinforcement learning. In Proceedings of the 4th IEEE
International Conference on Intelligent Systems (IS’08),
pages 22–29. IEEE, 2008.
[7] A. Iscen and U. Erogul. A new perspective to the keepaway
soccer: the takers. In Proceedings of the 7th international
joint conference on Autonomous agents and multiagent
systems-Volume 3, pages 1341–1344. International
Foundation for Autonomous Agents and Multiagent
Systems, 2008.
[8] S. Kalyanakrishnan and P. Stone. Learning complementary
multiagent behaviors: a case study. In Proceedings of The
8th International Conference on Autonomous Agents and
Multiagent Systems-Volume 2, pages 1359–1360.
International Foundation for Autonomous Agents and
Multiagent Systems, 2009.
[9] M. J. Mataric. Reward functions for accelerated learning. In
Proceedings of the 11th International Conference on
Machine Learning, pages 181–189, 1994.
[10] H. Min, J. Zeng, J. Chen, and J. Zhu. A Study of
Reinforcement Learning in a New Multiagent Domain. In
IEEE/WIC/ACM International Conference on Web
Intelligence and Intelligent Agent Technology, 2008.
WI-IAT’08, volume 2, 2008.
[11] A. Y. Ng, D. Harada, and S. J. Russell. Policy invariance
under reward transformations: Theory and application to
reward shaping. In Proceedings of the 16th International
Conference on Machine Learning, pages 278–287, 1999.
[12] J. Peters, S. Vijayakumar, and S. Schaal. Reinforcement
learning for humanoid robotics. In Proceedings of
Humanoids2003, Third IEEE-RAS International Conference
on Humanoid Robots, 2003.
[13] M. L. Puterman. Markov Decision Processes: Discrete
Stochastic Dynamic Programming. John Wiley & Sons, Inc.,
New York, NY, USA, 1994.
[14] J. Randløv and P. Alstrom. Learning to drive a bicycle using
reinforcement learning and shaping. In Proceedings of the
15th International Conference on Machine Learning, pages
463–471, 1998.
[15] P. Stone and R. S. Sutton. Scaling reinforcement learning
toward robocup soccer. In The 18th International Conference
on Machine Learning, pages 537–544. Morgan Kaufmann,
San Francisco, CA, 2001.
[16] P. Stone, R. S. Sutton, and G. Kuhlmann. Reinforcement
learning for RoboCup-soccer keepaway. Adaptive Behavior,
13(3):165–188, 2005.
[17] R. Sutton. Generalization in Reinforcement Learning:
Successful Examples Using Sparse Coarse Coding.
Advances in Neural Information Processing Systems, pages
1038–1044, 1996.
[18] R. S. Sutton. Temporal credit assignment in reinforcement
learning. PhD thesis, Department of Computer Science,
University of Massachusetts, Amherst, 1984.
[19] R. S. Sutton and A. G. Barto. Reinforcement Learning: An
Page 60 of 99
Proceedings of the AAMAS Workshop on Adaptive and Learning Agents, May 2010, Toronto, Canada
Learn to Behave!
Rapid Training of Behavior Automata
Page 61 of 99
The learning domain for an HFA behavior can obviously augmented with sequence iteration [25]. Like our approach,
be complex and of high dimensionality, depending on the these plans are often parameterizable.
number of basic behaviors and the dimensionality of the Such plan networks generally have limited or no recur-
agent’s feature vector. This in turn can require a large num- rence: instead they usually tend to be organized as se-
ber of training sessions to adequately describe the domain. quences or simultaneous groups of behaviors which activate
It is not reasonable to expect a demonstrator to perform further behaviors downstream. This is mostly a feature of
that many training sessions, and so it is important to re- the problem being tackled: such plans are largely induced
duce the domain space complexity or training difficulty. We from ordered sequences of actions intended to produce a re-
have done this in three ways: sult. Since we are training goal-less behaviors rather than
plans, our model instead assumes a rich level of recurrence:
• An HFA encourages task decomposition. Rather than and for the same reason the specific ordering of actions is
learn one large behavior, the system may be trained less helpful.
on simpler behaviors, which are then composed into a
higher-level learned behavior. This essentially projects Learning Policies. Another large body of work in learning
the full learning space into multiple lower-dimensional from demonstration involves observing a demonstrator per-
spaces. form various actions when in various world situations. From
this the system gleans a set of !situation, action" tuples per-
• Feature vector reduction. Our system allows the user formed and builds a policy function π(situation) → action
to specify precisely those features he feels are neces- from these tuples. This can be tackled as a supervised learn-
sary for a given learned HFA, which in turn dramati- ing task [4, 5, 8, 10, 12, 15]. However, some literature in-
cally reduces the learning space. Each HFA, including stead transforms the problem into a reinforcement learning
lower-level HFAs, may have its own different reduced task by providing the learner only with a reinforcement sig-
feature vector. nal based on how closely the learned policy matches the
tuples provided by the demonstrator [9, 24]. This is curious
• Generalization by parametrization. All behaviors, in-
given that the problem is, in essence, supervised; the rein-
cluding HFAs themselves, may be parameterized with
forcement methods are in some sense working with reduced
targets: for example, rather than create a behavior
information.
go-to-home-base, we can create a general behavior go-
Our approach differs from these methods in an impor-
to(A), and allow for higher-level behaviors to specify
tant way. Instead of learning situation→action rules, our
the meaning of the target A at a future time. This
model learns the transition functions of an HFA with pre-
can significantly reduce the number of behaviors which
defined internal states, each corresponding to a possible ba-
must be trained.
sic behavior. This enables the demonstrator to differentiate
transitions to new behaviors not just based on the current
By employing these complexity-reduction measures, our
world situation but also the current behavior. That is, we
system ideally enables the rapid construction of complex be-
learn rules of the form !previous action, situation" →action.
haviors, with internal state and a variety of sensor features,
Another, somewhat different use of internal state would be
in real time entirely by training from demonstration.
to distinguish between aliased observations of hidden world
The remainder of the paper is laid out as follows. We be-
situations, something which may be accomplished through
gin with a discussion of related work. We then describe the
learning hidden Markov models (for example, [13]).
basic HFA model and our approach to learning the transi-
tion functions in the automaton. We follow this with a train-
ing example of a nontrivial foraging behavior, then conclude Hierarchical Models. The use of hierarchies in robot or
with a discussion of future directions. agent behaviors is very old indeed, going back as early as
Brooks’s Subsumption Architecture [7]. Hierarchies are a
natural way to achieve layered learning [22] via task decom-
2. RELATED WORK position. This is a common strategy to simplify the state
Our approach generally fits under the category of learning space: see [11] for an example. While it is possible in these
from demonstration [3], an overall term for training agent ac- cases to induce the hierarchy itself, usually such methods
tions by having a human demonstrator perform the action iteratively compose hierarchies in a bottom-up fashion.
on behalf of the agent. Because the proper action to per- Our HFA model bears some similarity to hierarchical be-
form in a given situation is directly provided to the agent, havior networks such as those for virtual agents [6] or phys-
this is broadly speaking a supervised learning task, though ical robots [17], in which feed-forward plans are developed,
a significant body of research in the topic actually involves then incorporated as subunits in larger and more complex
reinforcement learning, whereby the demonstrator’s actions plans. In such literature, the actual application of hierar-
are converted into a reinforcement signal from which the chy to learning from demonstration has been unexpectedly
agent is expected to derive a policy. The lion’s share of limited. However, learning from demonstration has been ap-
learning from demonstration literature comes not from vir- plied more extensively to multi-level reinforcement learning,
tual or game agents but from autonomous robotics. For a as in [23], albeit with a fixed hierarchy.
large survey of the area, see [2].
Language Induction. One cannot mention learning finite
Learning Plans. One learning from demonstration area, state automata without noting that they have a long his-
closely related to our own research, involves the learning tory in language induction and grammatical inference, with
of largely directed acyclic graphs of behaviors (essentially a correspondingly massive literature. For recent surveys of
plans) from sequences of actions [1, 16, 18, 21], possibly techniques using automata for grammar induction, see [19,
Page 62 of 99
special state is the optional done state, whose behavior sim-
ply sets a done flag and immediately transitions to the start
Start state. This is used to potentially indicate to higher-level
HFAs that the behavior of the current HFA is “done”.
Always
Figure 1 shows a simple automaton with four states, corre-
Rotate Rotate sponding to the behaviors start, rotate-left, rotate-right, and
Left Right forward. It may appear at first glance that not all HFAs can
be built with this model: for example, what if there were two
states in which the rotate-left behavior needed to be done?
This can be handled by creating a simple HFA which does
Always Always
If
nothing but transition to the rotate-left state and stay there.
If
no obstacle is in front obstacle is in front This automaton is then stored as a behavior called rotate-
and or left2 and used in our HFA as an additional state, but one
obstacle is " 5.2 to left obstacle is ! 2.3 to left which performs the identical behavior to rotate-left.
Page 63 of 99
with its current state, then applies the transition function of learning the transitions among the states: given a state
to determine a new state for next timestep, if any. When a and a feature vector, decide which state (drawn from a fi-
performed behavior is itself an HFA, this operation is recur- nite set) to transition to. This is an ordinary classification
sive: the child HFA likewise performs one step of its current task. Specifically, for each state Si we must learn a classi-
behavior, and applies its transition function. Additionally, fier f" → S whose attributes are the environmental features
when an HFA transitions to a state whose behavior is an and whose classes are the various states. Once the classi-
HFA, that HFA is initialized: its initial state is set to the fiers have been learned, the HFA can then be added to our
start state, and its done flag is cleared. library of behaviors and itself be used as a state later on.
Because the potential number of features can be very high,
Formal Model. For the purposes of this work, we define and many unrelated to the task, and because we want to
the class of hierarchical finite-state automata models H as learn based on a very small number of samples, we wish
the set of tuples !S, F , T, B, M " where: to reduce the dimensionality of the input space to the ma-
chine learning algorithm. This is done by allowing the user
• S = {S0 , S1 , . . . , Sn } is a set of states, including a dis- to specify beforehand which features will matter to train a
tinguished start state S0 , and possibly also one done given behavior. For example, to learn a Figure-8 pattern
state S∗ . Exactly one state is active at any time. around two unspecified targets A and B, the user might
indicate a desire to use only four parameterized features:
• F = {F1 , F2 , . . . , Fn } is a set of observable features in distance-to(A), distance-to(B), direction-to(A), and direction-
the environment. The set of features is partitioned to(B). During training the user temporarily binds A and B to
in three disjoint subsets representing categorical (C), some ground targets in the environment, but after training
continuous (R) and toroidal (A) features. Each Fi can they are unbound again. The resulting learned behavior will
assume a value fi drawn from a finite (in the case of C) itself have two parameters (A and B), which must ultimately
or infinite (in the case of R and A) number of possible be bound to use it in any meaningful way later on.
values. At any point in time, the present assumed val- The training process works as follows. The HFA starts in
ues f" = !f1 , f2 , . . . , fn " for each of the F1 , F2 , . . . , Fn the “start” state (idling). The user then directs the agent to
are known as the environment’s current feature vector. perform various behaviors in the environment as time pro-
• T : F1 × F2 × . . . × Fn × S → S is a transition function gresses. When the agent is presently performing a behavior
which maps a given state Si , and the current feature associated with a state Si and the user chooses a new be-
vector !f1 , f2 , . . . , fn ", onto a new state Sj . The done havior associated with the state Sj , the agent transitions
state S∗ is the sole state which transitions to the start to this new behavior and records an example, of the form
state S0 , and does so always: ∀Sk &= S∗ ∀f" T (f", Sk ) &= !Si , f", Sj ", where f" is the current feature vector. Immedi-
ately after the agent has transitioned to Sj , it turns out to
S0 and ∀f" : T (f", S∗ ) = S0 .
be often helpful to record an additional example of the form
• B = {B1 , B2 , . . . , Bn } is a set of atomic behaviors. By !Sj , f", Sj ". This adds at least one “default” (that is, “keep
default, the special behavior idle, which corresponds to doing state Sj ”) example, and is nearly always correct since
inactivity, is in B, as may also be the optional behavior in that current world situation the user, who had just tran-
done. sitioned to Sj , would nearly always want to stay in Sj rather
than instantaneously transition away again.
• M : S → H ∪ B is a one-to-one mapping function At the completion of the training session, the system then
of states to basic behaviors or hierarchical automata. builds transition functions from the recorded examples. For
M (S0 ) = idle, and M (S∗ ) = done. M is constrained by each state Sk , we build a decision tree DSk based on all ex-
the stipulation that recursion is not permitted, that is, amples where Sk is the first element, that is, of the form
if an HFA H ∈ H contains a mapping M which maps to !Sk , f", Si ". Here, f" and Si form a data sample for the classi-
(among other things) a child HFA H " , then neither H "
fier: f" is the input feature and Si is the desired output class.
nor any of its descendent HFAs may contain mappings
If there are no examples at all (because the user never tran-
which include H.
sitioned from Sk ), the transition function is simply defined
We further generalize the model by introducing free vari- as always transitioning to back to Sk .
ables (G1 , . . . , Gn ) for basic behaviors and features: these At the end of this process, our approach has built some
free variables are known as targets. The model remains un- N decision trees, one per state, which collectively form the
altered, by replacing behaviors Bi with Bi (G1 , . . . , Gn ) and transition function for the HFA. After training, some states
features Fi with Fi (G1 , . . . , Gn ). The main differences are will be unreachable because the user never visited them, and
that the evaluation of the transition function and the exe- so no learned classification function ever mapped to them.
cution of behaviors will both be based on ground instances These states may be discarded. The agent can then be left
of the free variables. to wander about in the world on its own, using the resulting
HFA.
Though in theory many classification algorithms are ap-
4. LEARNING FROM DEMONSTRATION plicable (such as K-Nearest-Neighbor or Support Vector Ma-
The above mechanism is sufficient to hand-code HFA be- chines), in our experiments we chose to use a variant of the
haviors to do a variety of tasks; but our approach was meant C4.5 Decision Tree algorithm [20] for several reasons:
instead to enable the learning of such tasks. Our learning
algorithm presumes that the HFA has a fixed set of states, 1. Many areas of interest in the feature space of our agent
comprising the combined set of atomic behaviors and all pre- approximately take the form of rectangular regions
viously learned HFAs. Thus, the learning task consists only (angles, distances, etc.).
Page 64 of 99
Figure 3: Feature selection and target assignment.
Page 65 of 99
GoTo (A)
If A
is to my right
Harvest
Rotate If A Rotate If No Food is
Left is to my left Right Below Me and
If A is close If A is close If I am Not Full
enough enough
If Food is GoTo
Done Load
Below Me and (Nearest
Food
I Am Not Full Food)
If A is roughly If Ai s roughly
ahead ahead
If A is close If I Am Full
If A If A If I Am Not Full
enough
is to my left is to my right
Deposit Forage
If Done
If Done
Figure 4: The Forage behavior and its sub-behaviors: Deposit, Harvest, and GoTo(Parameter A). All condi-
tions not shown are assumed to indicate that the agent remain in its current state.
behaviors, and demonstrate the agent properly foraging, in The Harvest Behavior. This behavior caused the agent to
a manner of minutes. go to the nearest food, then load it into the agent. When
the agent had filled up, it would signal that it was done.
The GoTo(A) Behavior. This behavior caused the agent If the agent had not filled up yet but the food has been
to go to the object marked A. The behavior was a straight- depleted, the agent would search for a new food location and
forward bang-bang servoing controller: rotate left if A is to continue harvesting. This behavior employed the previously-
the left, else rotate right if A is to the right; else go forward; learned go-to(A) behavior as a subsidiary behavior, binding
and when close enough to the target, enter the “done” state. its Parameter A to the nearest-food target. This behavior
We trained the GoTo(A) behavior by temporarily declar- also employed the features food-below-me and food-stored-
ing a marker in the environment to be Parameter A, and in-me.
reducing the features to just distance-to(A) and angle-to(A). We trained the Harvest Behavior by directing the agent
We then placed the agent in various situations with respect to go to the nearest food, then load it, then (if appropriate)
to Parameter A and “drove” it over to A by pressing keys cor- signal “done”, else go get more food. We also placed the
responding to the rotate-left, rotate-right, forward, and done agent in various corner-case situations (such as if the agent
behaviors. After a short training session, the system quickly started out already filled up with food). Again, we were
learned the necessary behaviors to accurately go to the tar- able to rapidly train the agent to perform harvesting. Once
get and signal completion. Once completed, it was made completed, it was made available in the library as harvest.
available in the library as go-to(A).
Page 66 of 99
The Deposit Behavior. This behavior caused the agent learned function, identifying which examples were improper,
to go to the station, unload its food, and signal that it is and whether to remove them, may prove a challenge.
done. If the agent was already empty when starting, it
would immediately signal done. This behavior also used the Programming versus Training. We have sought to train
previously-learned go-to(A) behavior as a subsidiary state agents rather than explicitly code them. However we also
behavior, but instead bound its Parameter A to the station aimed to do so with a minimum of training. These goals are
target. It used the features food-stored-in-me and distance- somewhat in conflict. To reduce the training necessary, we
to(station). We trained the Deposit Behavior in a similar typically must reduce the problem space complexity and/or
manner as the Harvest Behavior, including various corner dimensionality. We have so far done so by allowing the user
cases. Once completed, it was made available in the library to inject domain knowledge into the problem (via task de-
as deposit. composition, for example, or by explicitly training for cer-
tain corner cases). This is essentially a step towards having
The Forage Behavior. This simple top-level behavior just the user explicitly declare part of the solution rather than
cycled between depositing and harvesting. Accordingly, this have the learner induce it. So is this learning or coding?
behavior employed the previously-learned deposit and har- We think that training of this sort is somewhere in-
vest behaviors. The behavior used only the done feature. between: in some sense the learning algorithm is relieving
the trainer from having to “code” everything himself. The
question worth studying is: how much learning is useful be-
6. CONCLUSION fore the number of samples required to learn outweighs the
In this paper, we have presented an approach for train- reduced “coding” load, so to speak, on the trainer?
ing agent behaviors using a hierarchical deterministic finite
state automata model and a classification algorithm, imple-
mented as a variant of the C4.5 algorithm. The main goal of
Other Representations. HFAs cannot straightforwardly
do parallelism or planning. We chose HFAs largely because
our approach is to enable users to train agents rapidly based
they were simple enough to make training intuitively feasi-
on a small number of training examples. In order to achieve
ble. Now that we’ve demonstrated this, we wish to examine
this goal, we trade off learning complexity with training ef-
how to train with other common representations, such as
fort, by enabling trainers to decompose the learning task in a
Petri nets or hierarchical task network plans, to demonstrate
hierarchical manner, to learn general parameterized behav-
the generality of the approach.
iors, and to explicitly select the most appropriate features to
use when learning. This in turn reduces the dimensionality
of the learning problem. 7. ACKNOWLEDGMENTS
We have developed a proof of concept testbed simulator This work was supported in part by NSF grant 0916870.
which appears to work well: we can train parameterized,
hierarchical behaviors for a variety of tasks in a short period
of time. We are presently deploying the platform to robots 8. REFERENCES
in our laboratory. In the mean time, there are a number of
interesting issues that remain to be dealt with. [1] R. Angros, W. L. Johnson, J. Rickel, and A. Scholer.
Learning domain knowledge for teaching procedural
Multiple Agents. Our immediate next goal is to move to skills. In The First International Joint Conference on
training multiple agents. In the general case, multiagent Autonomous Agents and Multiagent Systems
learning is a much more complex task than single-agent (AAMAS), pages 1372–1378. ACM, 2002.
learning, involving game-theoretic issues which may be well [2] B. D. Argall, S. Chernova, M. Veloso, and
outside the scope of the learning facility. However we be- B. Browning. A survey of robot learning from
lieve there are obvious approaches to certain simple mul- demonstration. Robotics and Autonomous Systems,
tiagent learning scenarios: for example teaching agents to 57:469–483, 2009.
perform actions as homogeneous behavior groups (perhaps [3] C. G. Atkeson and S. Schaal. Robot learning from
by training an agent with respect to other agents not un- demonstration. In D. H. Fisher, editor, Proceedings of
der his control, but moving them similarly). Another area the Fourteenth International Conference on Machine
of multiple agent training may involve hierarchies of agents, Learning (ICML), pages 12–20. Morgan Kaufmann,
with certain agents in control of teams of other agents. 1997.
[4] M. Bain and C. Sammut. A framework for behavioural
Unlearning. There are two major reasons why an agent cloning. In Machine Intelligence 15, pages 103–129.
may make an error. First, it may have learned poorly due Oxford University Press, 1996.
to an insufficient number of examples or unfortunately lo- [5] D. C. Bentivegna, C. G. Atkeson, and G. Cheng.
cated examples. Second, it may have been misled due to Learning tasks from observation and practice. Robotics
bad examples. This second situation arises due to errors in and Autonomous Systems, 47(2-3):163–169, 2004.
the training process, something that’s surprisingly easy to [6] R. Bindiganavale, W. Schuler, J. M. Allbeck, N. I.
do! When an agent makes a mistake, the user can jump Badler, A. K. Joshi, and M. Palmer. Dynamically
in and correct it immediately, which causes the system to altering agent behaviors using natural language
drop back into training mode and add those new examples instructions. In Autonomous Agents, pages 293–300.
to the behavior’s collection. However this does not cause ACM Press, 2000.
any errant examples to be removed. Since the agent made [7] R. A. Brooks. Intelligence without representation.
an error based not on examples but rather based on the Artificial Intelligence, 47:139–159, 1991.
Page 67 of 99
[8] S. Calinon and A. Billard. Incremental learning of [22] P. Stone and M. M. Veloso. Layered learning. In R. L.
gestures by imitation in a humanoid robot. In de Mántaras and E. Plaza, editors, 11th European
C. Breazeal, A. C. Schultz, T. Fong, and S. B. Kiesler, Conference on Machine Learning (ECML), pages
editors, Proceedings of the Second ACM 369–381. Springer, 2000.
SIGCHI/SIGART Conference on Human-Robot [23] Y. Takahashi and M. Asada. Multi-layered learning
Interaction (HRI), pages 255–262. ACM, 2007. system for real robot behavior acquisition. In
[9] A. Coates, P. Abbeel, and A. Y. Ng. Apprenticeship V. Kordic, A. Lazinica, and M. Merdan, editors,
learning for helicopter control. Communications of the Cutting Edge Robotics. Pro Literatur, 2005.
ACM, 52(7):97–105, 2009. [24] Y. Takahashi, Y. Tamura, and M. Asada. Mutual
[10] J. D. P. K. Egbert and D. Ventura. Learning policies development of behavior acquisition and recognition
for embodied virtual agents through demonstration. In based on value system. In 2008 IEEE/RSJ
Proceedings of the International Joint Conference on International Conference on Intelligent Robots and
Artificial Intelligence, pages 1257–1252, 2007. Systems, pages 386–392. IEEE, 2008.
[11] D. Grollman and O. Jenkins. Learning robot soccer [25] H. Veeraraghavan and M. M. Veloso. Learning task
skills from demonstration. In IEEE 6th International specific plans through sound and visually interpretable
Conference on Development and Learning (ICDL), demonstrations. In 2008 IEEE/RSJ International
pages 276–281, July 2007. Conference on Intelligent Robots and Systems, pages
[12] M. Kasper, G. Fricke, K. Steuernagel, and E. von 2599–2604. IEEE, 2008.
Puttkamer. A behavior-based mobile robot [26] E. Vidal. Grammatical inference: An introductory
architecture for learning from demonstration. Robotics survey. In Grammatical Inference and Applications,
and Autonomous Systems, 34(2-3):153–164, 2001. pages 1–4. Springer, 1994.
[13] D. Kulic, D. Lee, C. Ott, and Y. Nakamura.
Incremental learning of full body motion primitives for
humanoid robots. In 8th IEEE-RAS International
Conference on Humanoid Robots, pages 326–332, Dec.
2008.
[14] S. Luke, C. Cioffi-Revilla, L. Panait, K. Sullivan, and
G. Balan. Mason: A multi-agent simulation
environment. Simulation, 81(7):517–527, July 2005.
[15] J. Nakanishi, J. Morimoto, G. Endo, G. Cheng,
S. Schaal, and M. Kawato. Learning from
demonstration and adaptation of biped locomotion.
Robotics and Autonomous Systems, 47(2-3):79–91,
2004.
[16] M. N. Nicolescu and M. J. Mataric. Learning and
interacting in human-robot domains. IEEE
Transactions on Systems, Man, and Cybernetics, Part
A, 31(5):419–430, 2001.
[17] M. N. Nicolescu and M. J. Mataric. A hierarchical
architecture for behavior-based robots. In The First
International Joint Conference on Autonomous Agents
and Multiagent Systems (AAMAS), pages 227–233.
ACM, 2002.
[18] M. N. Nicolescu and M. J. Mataric. Natural methods
for robot task learning: instructive demonstrations,
generalization and practice. In The Second
International Joint Conference on Autonomous Agents
and Multiagent Systems (AAMAS), pages 241–248.
ACM, 2003.
[19] R. Parekh and V. Honavar. Grammar inference,
automata induction, and language acquisition. In
Handbook of Natural Language Processing, pages
727–764. Marcel Dekker, 2000.
[20] J. R. Quinlan. C4.5: Programs for Machine Learning
(Morgan Kaufmann Series in Machine Learning).
Morgan Kaufmann, 1 edition, January 1993.
[21] P. E. Rybski, K. Yoon, J. Stolarz, and M. M. Veloso.
Interactive robot task training through dialog and
demonstration. In C. Breazeal, A. C. Schultz, T. Fong,
and S. B. Kiesler, editors, Proceedings of the Second
ACM SIGCHI/SIGART Conference on Human-Robot
Interaction (HRI), pages 49–56. ACM, 2007.
Page 68 of 99
Proceedings of the AAMAS Workshop on Adaptive and Learning Agents, May 2010, Toronto, Canada
Page 69 of 99
tion results. Finally, in Section 5 we provide a summary and lation capabilities.
a discussion of the results as well as directions for further To encode as much information as possible from the sen-
research. sors, as simply as possible, incoming state details are dis-
tilled into two state variables:
1.1 Related Work
Model-free learning algorithms such as reinforcement learn- Object Distance: A distance to the nearest object dθi is
ing can be used for navigation applications [24]. The online provided for each vehicle relative angle θi . This repre-
reinforcement learning algorithm OLPOMDP [3] has been sents a potential obstacle, such as a wall or rock.
used successfully in applications including general robot con-
trol [9]. By operating on a parameterized functional repre- Destination Heading: The difference between the poten-
sentation of the knowledge gained during operation, instead tial path heading θi and the vehicle relative destination
of on specific models, important features of the world in heading αdes is provided. This indicates how signifi-
which the robot (or agent) operates can be the focus. This cant of a correction is required for the robot to point
allows adaptive behavior to be learned as a connection be- directly toward the destination.
tween features, rather than the degree to which a provided
This state representation is of course quite predictable in
model is accurate.
the destination heading, never exceeding |180| degrees and
Reward shaping in reinforcement learning for robotics al-
symmetrical about the destination heading αdes . The ob-
lows the balancing of specific agent tasks and automatic
ject distance state variable can vary widely however, and
agent-agent or agent-environment interactions [6, 7]. The
depends strongly on the resolution of the environment sens-
methodology was based on domain knowledge first, and then
ing available. Both state variables also depend on the ac-
augmented by suggestions from an external trainer to “shape”
curacy of the sensors that provide distance and track robot
the agent learning progress. The concept was further ex-
orientation.
panded to reinforcement learning for situated agents [20].
To provide a space of actions that is as directly indicative
The use of reward shaping concepts has proven successful
of robot task needs as possible, but abstract enough to re-
in the area of robotics, including its application to policy
duce the impact of non-determinism, the concept of “path
search techniques for navigation and non-Markovian pole
quality” is introduced. This quality, represented by X(θi ),
balancing [13, 14]. More recently analysis was done to eval-
is calculated in varying ways dependent on the algorithm
uate the impact of shaping on reward horizons [19] and gen-
used, but represents the quality of a potential path for the
erate a methodology to allow dynamically shaped rewards
robot to take next. In producing a distribution of quality
(rather than the typical static shaped reward) [18]. Further
for all possible paths at each time-step, the state of the envi-
extensions on dynamic shaping have moved into the area
ronment is represented, and a path can be chosen either via
of adaptive shaping in general robot domains [16] and the
the maximum quality, or by sampling to inject exploration
robot soccer domain [12].
behavior.
Of particular interest to the work presented in this pa-
per is the empirical analysis of genetic algorithms and tem- 2.2 Policy Search
poral difference learning in reinforcement learning robotic
The state/action structure of navigation presented in Sec-
soccer [26, 30]. Both techniques are model-free control ap-
tion 2.1 contains a beneficial approach to path selection. It
proaches, and the work presented in these papers provides a
is simple, which reduces computational complexity as well
comprehensive analysis of the behavior of both in a domain
as the number of potentially unpredictable behaviors, and
where the robots have limited resources, including observa-
applies well to a mobile robot with limited sensing capabil-
tional capabilities.
ities.
To capture those benefits while injecting the benefits of
2. ROBOT NAVIGATION adaptability into navigation control, the state and action
In problems where resources are limited, particularly in spaces are maintained. Therefore, for this work, the baseline
the robots’ abilities to observe their surroundings, correctly network structure created is a two layer, sigmoid activated,
interpreting the incoming information and mapping it to artificial neural network with two inputs, one output unit,
coherent actions (e.g., navigation) is a key concern. In this and eight hidden units (selected through empirical perfor-
work, we explore three algorithms for navigation based on mance study).
environment information obtained from sonar and inertial This network is run at each time step, for each potential
sensors: i) a deterministic navigation algorithm is used to path, generating a path quality function in a similar fashion
provide a deterministic action choice based on state infor- to that of the deterministic algorithm. The difference is in
mation collected, ii) a policy search algorithm that uses a the replacement of static predefined probability distributions
multi-layered neural network as a policy and an evolutionary by an adaptive artificial neural network.
algorithm as the search method to assign a “quality” to each An evolutionary search algorithm for ranking and subse-
potential path, and iii) a policy gradient algorithm that uses quently locating successful networks within a population [21]
a multi-layered neural network for the policy function. is applied here. The algorithm maintains a population of ten
networks, uses mutation to modify individuals, and ranks
2.1 State and Action Spaces them based on a performance metric specific to the domain.
In the work presented here, the mobile robotic platform The search algorithm used is shown in Figure 1 which dis-
has a limited set of sensing capabilities, and provides a non- plays the logical flow of the algorithm.
deterministic outcome of actions taken. As a result, a unique The definitions for the variables and functions located in
set of spaces is required to accurately represent the environ- the algorithm shown in Figure 1 are as follows:
ment surrounding the robot and provide maximum articu-
Page 70 of 99
Initialize N networks at T = 0
For T < Tmax Loop: Initialize ω and e
For T < Tmax Loop:
1. Pick a random network Ni from population
For t < tepi Loop:
With probability : Ncurrent ← Ni
1. Capture current state s
With probability 1 − : Ncurrent ← Nbest
2. Sample action a from π (a|s, ω)
2. Modify Ncurrent to produce N 0
2.1 For θi ≤ 360 Loop:
Run network to produce f a0 |s, ω
` ´
3. Control robot with N 0 for next episode (tepi time steps)
0 P ` 0 ´
2.2 g(a ) = f a |s, ω
For t < tepi Loop: a0
f (a|s,ω)
3.1 For θi ≤ 360 Loop: 2.3 P (a) = g(a0 )
Run N 0 to produce X(θi )
3. Execute action a and capture reward r
3.2 αu ← argmaxX(θi )
3.3 Vu ← F (X(αu )) 4. e ← βe + ∇ω π (a|s, ω)
5. ω ← ω + αer
4. Rank N 0 based on performance
(objective function) 6. Vu ← F (f (a|s, ω))
T : Indexes episodes
ω: Weights of the neural network function approximator
t: Indexes time-steps within each episode
π (a|s, ω): Path quality assignment policy
θi : Angle of potential path i
N : Indexes networks with appropriate subscripts f (a0 |s, ω): Output of the neural network function approximator
N 0 : Mutated network for use in control of current episode P (a): Probability of taking action a
X(θi ): Path quality assigned to each potential path e: Eligibility traces for parameters ω
αu : Chosen vehicle relative robot angle for next time-step β, α: Discounting factor and learning rate respectively
F (X(αu )): Linear function mapping quality of path chosen to
F (f (a|s, ω)): Linear function mapping quality of path chosen to
robot speed
robot speed
Vu : Chosen robot speed
Vu : Chosen robot speed
In this domain, modifying a policy (Step 2) involves adding
a randomly generated number to every weight within the The state and action spaces here are very important as
network. This can be done in a large variety of ways, how- well. In order to maintain comparability, the spaces are
ever it is done here by sampling from a random Cauchy identical to that used in the deterministic and policy search
distribution [1] where the samples are limited to the contin- techniques described in Sections 2.1 and 2.2. This structure,
uous range [-10.0,10.0]. Ranking of the network performance where path quality is assigned to potential paths, lends itself
(Step 4) is done using a domain specific objective function, to reinforcement learning, with a minor modification to fit
and is discussed in detail in Section 3. within the online partially observable MDP [3] policy update
strategy. This change involves normalizing each network
2.3 Policy Gradient output by the sum over all outputs to produce a probability
The reinforcement learning technique of adaptive control distribution for the path quality assignment policy:
is a structure of algorithm that must be formed such that
f (a|s, ω)
it can be “rewarded” directly based on a predefined objec- π (a|s, ω) = P (1)
tive function of the next state of the robot or the next state f (a0 |s, ω)
a0
and action taken. In this work, the algorithm is rewarded
based on the change of state resulting from an action pre- where a is a sampled action, s is the current state, ω are the
viously taken (therefore a direct function of the next state network weights and f (a|s, ω) is the output of the neural
achieved). There are a great many reinforcement learning al- network function approximator that provides the value of
gorithm structures, however for this work an online partially the sampled action a (path quality), given current state s
observable MDP [3] algorithm proved the most successful. and weights ω 1 .
The definitions for the variables and functions located in 1
The output of the neural network approximator f a0 |s, ω is es-
` ´
the algorithm shown in Figure 2 are as follows:
sentially identical to the path quality distribution X(θi ) shown in
T : Indexes episodes Figure 1. The neural network does not produce a probability distribu-
tion, so dividing by the sum of all qualities normalizes to a probability
t: Indexes time-steps within each episode
from which the next action can be sampled, injecting exploration into
θi : Angle of potential path i the algorithm.
Page 71 of 99
The following is the neural network function approximator 3.1 Episodic Objective
gradient with respect to parameters ω (network weights)2 : An episodic objective aims to capture three important as-
pects of mobile robot navigation in unknown environments
d f (a|s, ω) P f a0 |s, ω − f (a|s, ω) P d f a0 |s, ω
“ ” “ ” “ “ ””
dω
a0 a0
dω under the capability restrictions described above; 1) total
∇ω π (a|s, ω) = ˛
˛P ` ´˛˛2
˛
path length the robot uses to reach the destination, 2) time
f a0 |s, ω ˛
˛
the robot consumes reaching the destination, and 3) time
˛
˛ 0
a
˛
(2)
the robot consumes recovering from a collision with an ob-
where the derivative term with regard to output weights stacle. These incorporate choosing the shortest path, exe-
is: cuting it with greatest speed, and doing so in a safe manner.
In order to convert the above to maximization rather than
d minimization, and support constantly shifting initial condi-
f (a|s, ω) = f (a|s, ω) (1 − f (a|s, w)) hj = δo hj (3) tions, the best possible behavior is incorporated, generating
dωj
the following objective function:
and the derivative with respect to the hidden weights re-
sults in: R(s) = α (dbest − dactual ) + β (tbest − tactual ) − γtcollision (5)
Page 72 of 99
where θ̄ is the change in state regarding the robot’s heading all weights simultaneously, regardless of their direct affect
as it relates to the destination heading, and d¯ is the change on each action taken during an episode. Therefore, if the
in state regarding the distance to the next impassable ob- evolution search is run at each time-step (or a small subset
ject. The constants η1 and η2 are in place to provide scaling of time-steps) and the networks are ranked on such a spe-
and additional shaping. This linear formulation provides a cific basis, it is very unlikely that the search will locate a
much more deterministic indication as to the source of the network capable of being successful over an entire episode.
reward in a single action choice, representing the desire for For example, at the beginning of the episode the search may
the policy to assign high quality to paths that turn toward locate a successful network, and use it with only probabil-
the goal, but away from nearby objects. ity of exploration throughout, resulting in low system level
The objective in Equation 8 brought the policy gradient performance.
algorithm into the same learning time scale as the policy
search algorithm. However, overall performance was still 3.3 Objective Equivalence
not satisfactory, as poor policies were not sufficiently pe- The episodic objective (Equation 5) is better suited for
nalized. Therefore, the objective was further modified to time-extended learning, as with policy search, and the fo-
exponentially penalize large deviations, leading to: cused objective (Equation 9) is better suited for state-change
parameter updates, as with policy gradient. Still, it is im-
“ ¯
” portant to study the behavior of both algorithms to ensure
r(s) = ξ 1 + e−η1 θ̄ − e−η2 d (9) that they are achieving the same overall behavior goals and
where r is the reward used in Figure 2. This reward pro- are being provided with the same level of information.
vided more robust performance in both simple and complex Equation 5 shows that by minimizing total path length
environments, and allowed the policy gradient algorithm to and collisions while maximizing speed the robot can maxi-
learn similar successful behavior to that of the policy search mize system level performance (a maximum of 0, otherwise
and deterministic algorithms. negative). Conceptually, Equation 9 shows that if the algo-
This reward is calculated, and therefore the weight update rithm minimizes θ̄ by choosing actions that point the robot
is performed, at every time-step to allow the algorithm to toward the destination it is in effect minimizing total robot
directly observe and be rewarded based on the perceived path length at the end of the episode. Additionally, the
success or failure of the last action taken. This allows the robot speed is again based linearly on the quality assignment
eligibility traces to more accurately track the effect of each to the path chosen, and therefore by being more confident
network weight on the resulting state over time. Conversely, about quality assignment, the algorithm maximizes robot
speed. Finally, if the algorithm maximizes d, ¯ it is choosing
utilizing the episodic objective function requires that the
traces interpret the effect of each network weight on up to actions that point the robot away from nearby obstacles and
600 actions before a reward is received. There are factors therefore minimizes collisions.
to mitigate the effect of taking so many actions during an Figure 4 shows the calculated focused objective r(s) dur-
episode, most notably changing how often the parameters ing a learning session. The policy search algorithm does not
are updated (Figure 3) which effectively reduces how many use the objective for learning, rather the average objective
actions the robot chooses during an episode. over an episode is calculated and displayed. It is shown that
while the policy search algorithm does not maximize the fo-
cused objective in a stable fashion, when it converges to the
highest performance of its own objective (Equation 5), it
simultaneously converges to the highest performance of the
focused objective (Equation 9) used by the policy gradient
algorithm. Likewise for policy gradient, shown on the right
of Figure 4, when policy gradient converges to its best per-
formance of the focused objective, it is also converging to
the best performance of the episodic objective. These two
results demonstrate empirically the equivalence of the two
objectives used for learning.
4. EXPERIMENTS
Several experiments were designed to evaluate the navi-
gation algorithms for a specific set of behaviors discussed
in the problem definition. These progressively increased in
difficulty and scope from basic navigation to a destination,
through advanced navigation in cluttered environments. In
Figure 3: Effect of changing the interval between weight all experiments, an arena of 5 meters square was created
updates in the policy gradient navigation algorithm. n is the with a varying number of obstacles, depending on the ex-
number of time steps between updates. The episodic objec- periment. The learning method is episodic in that the robot
tive is plotted against learning episode for a representative is allowed to operate for a fixed maximum amount of time
selection of update intervals. (tepi = 60 seconds in this work). Learning is executed for
2000 episodes, and each experiment is run 40 times for each
The inverse effect to that shown in Figure 3 occurs when algorithm. These experiments evaluate not only the navi-
the more specific reward in Equation 9 is used in conjunction gation algorithm’s ability to seek a destination, but safely
with the policy search algorithm. That algorithm modifies and intricately navigate around obstacles in an unknown en-
Page 73 of 99
Figure 4: Left: The focused reward is plotted for both algorithms during a learning session. The average focused objective
r(s) per episode is calculated for all algorithms, though the policy search algorithm does not use it for learning. Right: The
policy gradient algorithm behavior as measured by two objectives: The algorithm learns using the focused objective, but both
the focused and episodic objectives are plotted. The comparison clearly shows that the two objective functions are functionally
equivalent in measuring robot performance. Error bars are omitted for clarity.
Figure 5: Left: The impact of obstacle density is shown. Maximum episodic objective achieved is plotted against varying
number of obstacles within the environment. Right: The result of the learning in a dense environment containing 20 obstacles.
The objective function is plotted for the random, deterministic, policy search, and policy gradient algorithms as an average over
40 iterations.
vironment, including when state information is inaccurate when the number of obstacles increases. The deterministic
and action results are stochastic. algorithm consistently drops in performance as the environ-
ment increases in density, and has a sharp deterioration rate.
4.1 Impact of Obstacle Density The policy search algorithm is able to maintain acceptable
We now focus on the performance of the algorithms with performance early on, but sharply declines between 15 and
respect to the density of obstacles within the environment. 20 obstacles, unable to locate the destination on occasion
With limited environment detection capabilities, as the envi- within the time allotted. The best performing algorithm
ronment becomes more dense with hazards, the robot must is policy gradient, which degrades gracefully and is able to
be careful with path quality assignments such that safe op- maintain its performance even when the environment is ex-
eration is ensured. This is reflected in Figure 5 when the tremely dense with obstacles.
number of obstacles is low. While both adaptive algorithms Figure 5 (right) shows that the policy gradient algorithm,
consistently outperform the deterministic navigation algo- utilizing the more focused objective, was able to more suc-
rithm, they have similar performance until the environment cessfully encode information learned during operation. It
becomes complex. does this by trading off robot speed in order to operate more
As expected, all three algorithms drop in performance safely in environments more dangerous for operation. This
Page 74 of 99
Figure 6: Left: The results of the learning in a dense environment with 15 obstacles while sensor and actuator noise was
present. The objective function is plotted for the random, deterministic, policy search, and policy gradient algorithms as an
average over 40 iterations. Right: The results of the learning in a dense environment with 20 obstacles while sensor and actuator
noise was present. The objective function is plotted for the random, deterministic, policy search, and policy gradient algorithms
as an average over 40 iterations.
trade-off, and subsequent learned behavior, is an important utilizes a focused state-based objective that allows it to learn
aspect of mobile robot navigation. intricate behavior in complex environments.
Page 75 of 99
making the transition to control of a physical robot in a Artificial Intelligence (IJCAI-99), pages 1356–1361,
real-world setting. In addition, our future work will focus Stockholm, Sweden, 1999.
on coordination in multiple physical robots, blending our [15] D. M. Helmick and S. I. Roumeliotis.
current work of robot coordination and robot navigation. Slip-compensated path following for planetary
exploration rovers. Advanced Robotics, 20:1257–1280,
2006.
Acknowledgements [16] G. Konidaris and A. Barto. Autonomous shaping:
This work was partially supported by AFOSR grant FA9550- knowledge transfer in reinforcement learning. In Int.
08-1-0187 and NSF grant IIS-0910358. Conference on Machine Learning, pages 489–496,
2006.
6. REFERENCES [17] C. Kunz and C. Murphy. Deep sea underwater robotic
exploration in the ice-covered arctic ocean with auvs.
[1] A. K. Agogino and K. Tumer. Efficient evaluation In Intelligent Robots and Systems, IEEE/RSJ
functions for evolving coordination. Evolutionary International Conference on, 2008.
Computation, 16(2):257–288, 2008. [18] A. Laud and G. DeJong. Reinforcement learning and
[2] B. Banerjee and J. Peng. Adaptive policy gradient in shaping: Encouraging intended behaviors. In Int.
multiagent learning. In The Conference on Conference on Machine Learning, pages 355–362,
Autonomous Agents and Multiagent Systems, pages 2002.
686–692, 2003. [19] A. Laud and G. DeJong. The influence of reward on
[3] P. Bartlett and J. Baxter. Stochastic optimization of the speed of reinforcement learning: An analysis of
controlled partially observable markov decision shaping. In Int. Conference on Machine Learning,
processes. Decision and Control, 1:124–129, 2000. 2003.
[4] J. Bohren and T. Foote. Little Ben: The Ben Franklin [20] M. J. MataricĆ. Reward functions for accelerated
racing team’s entry in the 2007 DARPA Urban learning. In Machine Learning: Proceedings of the
Challenge. Field Robotics Research, 25:588–614, 2008. Eleventh International Conference, pages 181–189, San
[5] M. Cummins and P. Newman. Probabilistic Francisco, CA, 1994.
appearance based navigation and loop closing. In [21] D. Moriarty and R. Miikkulainen. Forming neural
Robotics and Automation, IEEE International networks through efficient and adaptive coevolution.
Conference on, pages 2042–2048, 2007. Evolutionary Computation, 5:373–399, 2002.
[6] M. Dorigo and M. Colombetti. Robot shaping: [22] S. Russell and P. Norvig. Artificial Intelligence: A
Developing situated agents through learning. Modern Approach. Pearson Education, Inc., 2003.
Technical Report TR-92-040, International Computer [23] M. Saggar, T. D’Silva, N. Kohl, and P. Stone.
Science Institute, 1993. Autonomous Learning of Stable Quadruped
[7] M. Dorigo and M. Colombetti. Robot Shaping: An Locomotion, chapter 9, pages 98–109. Springer Berlin
Experiment in Behavior Engineering. MIT Press, 1998. / Heidelberg, 2007.
[8] A. El-Fakdi and M. Carreras. Policy gradient based [24] R. S. Sutton and A. G. Barto. Reinforcement
reinforcement learning for real autonomous Learning: An Introduction. MIT Press, Cambridge,
underwater cable tracking. In Intelligent Robots and MA, 1998.
Systems, IEEE/RSJ International Conference on, [25] R. S. Sutton and D. McAllester. Policy gradient
pages 3635–3640, 2008. methods for reinforcement learning with function
[9] A. El-Fakdi, M. Carreras, N. Palomeras, and P. Ridao. approximation. Advances in Neural Information
Autonomous underwater vehicle control using Processing Systems, 12:1057–1063, 2000.
reinforcement learning policy search methods. Oceans, [26] M. E. Taylor, S. Whiteson, and P. Stone. Comparing
pages 2: 793–798, 2005. evolutionary and temporal difference methods for
[10] P. Fabiani and V. Fuertes. Autonomous flight and reinforcement learning. In Proceedings of the Genetic
navigation of vtol uavs: from autonomy and Evolutionary Computation Conference, pages
demonstrations to out-of-sight flights. Aerospace 1321–1328, 2006.
Science and Technology, 11:183–193, 2007. [27] R. Tedrake and T. Zhang. Stochastic policy gradient
[11] J. A. Fernandez-Leon, G. G. Acosta, and M. A. reinforcement learning on a simple 3d biped. In
Mayosky. Behavioral control through evolutionary Intelligent Robots and Systems, IEEE/RSJ
neurocontrollers for autonomous mobile robot International Conference on, pages 2849– 2854, 2004.
navigation. Robotics and Autonomous Systems, In [28] S. Thrun and M. Montemerlo. Stanley: The robot
Press, Corrected Proof:411–419, 2008. that won the darpa grand challenge. Robotic Systems,
[12] T. Gabel and M. Riedmiller. Learning a partial 23:661–692, 2006.
behavior for a competitive robotic soccer agent. KI [29] S. Thrun and G. Sukhatme. Robotics: Science and
Zeitschrift, 20:18–23, 2006. Systems I. MIT Press, 2005.
[13] F. Gomez and R. Miikkulainen. Incremental evolution [30] S. Whiteson, M. E. Taylor, and P. Stone. Empirical
of complex general behavior. Adaptive Behavior, studies in action selection for reinforcement learning.
5:5–317, 1997. Adaptive Behavior, 15(1):33 –50, 2007.
[14] F. Gomez and R. Miikkulainen. Solving
non-markovian control tasks with neuroevolution. In
Proceedings of the International Joint Conference on
Page 76 of 99
Proceedings of the AAMAS Workshop on Adaptive and Learning Agents, May 2010, Toronto, Canada
ABSTRACT 1. INTRODUCTION
Recent advances in service oriented technologies offer me- With the expansion of utility and cloud computing, more
tered computational resources to consumers on demand. In and more companies are outsourcing their computational re-
certain environments these consumers are software agents, quirements rather than processing them in-house. This has
capable of autonomously procuring resources and complet- led to an increase in the number of providers offering cloud
ing tasks. The consumer agent can solicit relevant services based services and charging for usage either on fixed rate
from other software agents known as provider agents. These tariffs or a pay per use basis. Zimory a German company
provider agents are instantiated with various business logic, launched what is being touted as a global marketplace for
which culminates in their service offering. Consumer de- computational resource trading in January 2009. Although
mand varies over time, meaning each provider agents ser- the idea of trading computational resources similarly to tra-
vice offering must adapt in order to succeed. Agents of- ditional commodities has been around for some time [2],
fering services for which there is little demand greatly re- with a number of projects [6] [7] developing structures for
duces the probability of successful provision. Since service trading resources, its implementation has not. In this envi-
agents occupy a finite resource, offerings for which demand is ronment, potentially excess compute capacity could be sold
low wastes resources. The goal of this research is to create on the open market, generating a revenue stream where pre-
provider agents capable of adapting their service offerings viously there was none. Technologies are emerging where
to meet the available demand, thus maximising available agents are capable of autonomously procuring and supply-
revenues. To achieve this we implemented two separate al- ing services on the web. These agents, must procure the
gorithms, each tackling the problem from different perspec- revelent resources to achieve a specified goal often engaging
tives, comparing their efficacy in this environment. Firstly with multiple service providers to ensure successful comple-
we adopted a centralised approach where a Genetic Algo- tion. The service providers must decide what services to
rithm evolves agents online to meet the fluctuating demand offer for consumption. The selection choice can involve de-
for services. In our results the genetic algorithm achieved a ciding from amongst a large set of possible service offerings.
significant improvement over a non elastic fixed agent sup- When market conditions are uncertain the service provider
ply. Secondly a decentralised approach was developed us- is presented with a difficult decision. Choosing a service of-
ing Reinforcement Learning. Provider agents individually fering for which there is little demand results in wasted re-
learned through trial and error which services to offer. Re- sources as the instantiation costs remain. Therefore this pa-
sults showed that reinforcement learning was slightly less per raises an interesting research question, what service of-
adaptive than the genetic algorithm in meeting the varying fering should a provider agent expose to increase its chances
demand. Both approaches displayed a significant improve- of successful consumption?
ment over the fixed aggregate supply. Critically reinforce- In addressing this question we have investigated the use
ment learning requires far less computational or communica- of Reinforcement Learning and a Genetic algorithm to cre-
tion overhead as agents make decisions from environmental ate adaptive service agents, capable of autonomously alter-
experience. ing their service offerings online. Using these techniques
the agents can adapt to fluctuating demands for services.
The methods enable them to alter their offerings, meeting
Categories and Subject Descriptors market demand, and concurrently optimising their limited
H.4 [Service Computing, Artificial Intelligence]: Mul- resources. We compare the two approaches empirically fo-
tiagent Systems cussing on optimsing limited resources while maintaining ad-
equate service level. The goal of each approach is to max-
imise the amount of revenue earned from available resources,
General Terms while minimising lost revenues resulting from unfulfilled ser-
vice requests. In these environments agent interactions are
Genetic Algorithm, Reinforcement Learning
often governed by Service Level Agreements (SLA’s), where
a minimum requirement of service level is stipulated. This
Keywords can often lead to over provisioning of services to ensure com-
pliance. Over provisioning ensures compliance but it also
Adaptive services, Demand estimation
Page 77 of 99
increases costs and inefficiency. Our aim is to find a solu- once more. A solutions’ suitability to its environment is
tion that can meet agreements but also maximise efficiency, determined using a fitness function. By iterating through
generating greater revenues from the existing resources. The successive generations an optimal solution can be found for
genetic algorithm provides a centralised approach to tackling the given environment. In our model the GA attempts to
the problem of service choice among the provider agents. It find optimal solutions continuously, in an online fashion. As
is responsible for controlling agent population numbers and demand for services vary with time, the optimum solution
deciding what number of provider agents offer which ser- is not static and changes continuously.
vices each time. In contrast reinforcement learning offers a A number of researchers have looked at using genetic al-
decentralised approach to tackling the same problem, where gorithms to estimate demand for traditional utilities and
the provider agents make decisions based on their past ex- commodities, such as oil [3] and electricity [1]. Using histor-
periences of what services to offer when. ical demand figures as well as a number of other parameters
The following sections of this paper are structured as fol- such as, GNP, population, import and export figures, future
lows: Background Research provides an overview of relevant projections were made, proving the viability of using the
work in this field. A number of aspects of service computing approach for demand prediction. Demand estimations were
are discussed as well as an overview of a genetic algorithm carried out retrospectively and not online.
and reinforcement learning. Model Design details the ser-
vice provider/consumer model and describing the algorithms 2.3 Reinforcement Learning
used in this paper. Simulator Design describes the simulator Reinforcement learning involves the agent learning through
used to generate our service oriented environment test-bed. trial and error from interactions with its environment. The
Experimental Results evaluates the performance of the two reinforcement learning agent has an explicit goal which it
methods, finally leading to Conclusions. endeavors to achieve. The environment presents the learn-
ing agent with the necessary evaluative feedback required
2. BACKGROUND RESEARCH to achieve this goal. This feedback or reward consists of
a scalar value through which the learning agent determines
This section details relevant work in the area of service
its performance. Through repeated interactions with its en-
computing. We begin by outlining the artificial intelligence
vironment the agent learns which actions result in higher
methods used in this paper. We document related work ad-
rewards. The agent’s goal is to maximise its reward in the
dressing the relevancy of these approaches to our problem.
long run.
Finally we look at other related research in service comput-
ing. 2.3.1 Value Functions
2.1 Centralised vs Decentralised The objective of the learning agent is to optimise its value
function. The agent makes decisions on its value estimates
In this section we examine both centralised and decen-
of states and actions. V π (s) is called the state value function
tralised approaches to solving the problem. Using the ge-
for policy π. It is the value of state s under policy π and
netic algorithm to evolve suitable agents we address the
amounts to the return you expect to achieve, starting in s
problem in a centralised manner. The genetic algorithm de-
and following policy π from then on.
termines the agent configuration, evolving the fittest agents
for the environment each time. It is also solely responsible
for adding and removing provider agents to and from the {∑
∞
}
V π (s) = Eπ γ k rt+k+1 |st = s (1)
environment, depending on an agents fitness. The creation
k=0
of new agents and subsequent service allocation incurs an
instantiation cost. Qπ (s, a) is called the action-value function. It is the value
In contrast reinforcement learning represents a decentralised of choosing action a while in state s under policy π.
approach to solving the problem, where agents individually
make decisions on what services to offer each time. The {∑
∞
}
reinforcement learning agents cannot increase the resources Qπ (s, a) = Eπ γ k rt+k+1 |st = s, at = a (2)
at their disposal and similarly to the genetic algorithm are k=0
limited to choosing from among the available services. A A number of studies have looked at applying reinforce-
switching cost is applied should a learning agent elect to ment learning to resource allocation problems [14]. The au-
provide a service other than the one it currently provides. thor presented a framework using reinforcement learning,
This cost is dealt with through the agents’ returns and is capable of dynamically allocating resources in a distributed
explained in more detail in section 4. Importantly an un- system. While adaptive service provision is not a resource
successful learning agent can also decide to make itself idle allocation problem the parallels between them still merit
for a period of time. Once in this idle phase the agents their inclusion. Tesauro investigated the use of a hybrid
resources are freed up and available to be used elsewhere. reinforcement learning technique for autonomic resource al-
After a certain time the agent comes out of the idle state location [13]. He applied this research to optimising server
and begins service provision once more. allocation in data centers. Germain-Renaud et al. [4] looked
at similar resource allocation issues. Here a workload de-
2.2 Genetic Algorithms mand prediction technique was used to predict the resource
Genetic algorithms are stochastic search and optimisation allocation required each time. Reinforcement learning has
techniques based on evolution. In their simplest form, a set also been successfully applied to grid computing as a job
of possible solutions to a particular problem are evaluated in scheduler. Here the scheduler can seamlessly adapt its de-
an iterative manner. From the fittest of these solutions, the cisions to changes in the distributions of inter-arrival time,
next generation is created and the evaluation process begins QoS requirements, and resource availability [10].
Page 78 of 99
Demand Curves for Multiple Services Aggregate Demand for Multiple Services
140
Service1 Volatile
Service2 350 Stable
120 Service3
Service4
300
100
Quantity Demanded
Quantity Demanded
250
80
200
60
150
40
100
20 50
0 0
0 200 400 600 800 1000 1200 1400 1600 1800 0 200 400 600 800 1000 1200 1400 1600 1800
Time (s) Time (s)
Individual Demand for Multiple Services (Stable) Aggregate Service Demand
2.4 Service Selection and Composition controlled manner with consumer agents always procuring
In service computing, recent work by Jackyno et al. in- services from provider agents. Consumers do not differen-
vestigated an approach where agents share local demand es- tiate among the available providers in terms of quality, or
timates among one another and adapt their service offerings price. Also they are not bound by deadlines nor restricted
to suit [5]. The agents in this work did not learn from their by budgetary constraints. The focus of this paper is to
environment, instead they communicated with one another develop techniques to improve service adaptability and dy-
and altered their service offerings based on the information namism. The consumer agent requests a service from any
they acquired. By limiting the flow of information between available provider agent, with the provider agent remuner-
agents the author was able to show the system had the ca- ated upon completion. Payments for services are homoge-
pability to self-organise decentrally into communities where nous. The provider agents goal is to maximize its income
agents reliably provide the most requested service types. throughout the course of its interactions with the environ-
A number of researchers in the field have addressed the ment. The income earned through interactions denotes its
problem of optimal service selection [8] [11]. Optimal ser- fitness. This influences it’s degree of success in the genetic
vice selection involves developing approaches to selecting an algorithm, where fitter agents have a higher probability of
adequate number of service providers to ensure task com- producing offspring. It also represents the agents reward
pletion, within certain constraints such as time, quality and in reinforcement learning biasing the agent interactions in
budgetary. order to achieve maximum reward.
Service composition has also received much attention in The demand for each service, manifested through the ser-
recent years. Composition addresses the issues of compos- vice consumer, is controlled exogenously through the use of
ing required functionality from amongst the available ser- a sin function. The demand pattern generated using the
vices. These services often belong to numerous different sin function is purely deterministic with no degree of ran-
providers, offering varied or similar functionality. The ob- domness applied. Using this method we can evaluate per-
jective of the composing algorithm is to service the request formance more accurately over successive runs. Figure 1
by composing the required functionality from the existing shows the demand curve for four separate services where
services. Weise et al. [15] compared the performance of services are increasing and decreasing in demand. This de-
an informed/uniformed search and a genetic algorithm, for mand pattern is relatively stable and is depicted in Figure
composing web services. The evolutionary approach where 2 by the stable demand curve. The demand patterns in
solutions to requests were evolved, proved to be much slower Figure 2 depict services on an aggregate level for both sta-
than the search algorithms but was shown to always success- ble and volatile environments. These two demand patterns
fully satisfy requests. were selected in order to perform preliminary analysis of our
approaches in disparate environments. Greater variance in
3. MODEL DESIGN demand will be addressed in future work.
In this section we discuss the model which we use to eval-
uate our learning approaches. We discuss have demand is 3.2 Agent Architecture
generated and controlled and also the architecture of the Implementation of both the genetic algorithm and rein-
agents adopting the different approaches. forcement learning require two architecturally different provider
agents. Learning agents require greater autonomy, and de-
3.1 Agent Interactions And Demand Function cision making skills than that of their evolved counterparts.
The agent environment supports two types of agents, ser- This section outlines both architectures and introduces SARSA
vice providers and service consumers. Agents interact in a the reinforcement learning algorithm used in this work.
Page 79 of 99
3.2.1 Evolutionary Architecture
Each agents service type is encoded as a bit string. This
Q(s, a) ← Q(s, a) + α[r + γQ(s′ , a′ ) − Q(s, a)] (3)
represents an agents gene and resides in it’s chromosome.
Since our model is currently concerned only with an agents and calculated each time a state is reached which is non-
service type, this is the only gene residing in the chromo- terminal. Approximations of Qπ which are indicative as to
some. Individuals are selected for reproduction using roulette the benefit of taking action a while in state s, are calculated
wheel selection, based on their fitness. Roulette wheel selec- after each time interval. Actions are chosen based on π the
tion involves ranking chromosomes in terms of their fitness policy being followed. As mentioned previously Qπ (s, a) is
and probabilistically selecting them. The selection process the value of taking action a while in state s, under policy π.
is weighted in favour of chromosomes possessing a higher fit- The research presented in this paper uses an ϵ-greedy policy
ness. To ensure that agents, already optimal for their envi- to decide what action to select while in a particular state. In
ronment are not lost in the evolutionary process elitism is ap- the service environment we wish to reduce the probability of
plied. Elitism involves the selection of a certain percentage selection for all non-greedy actions. Put simply we do not
of the fittest chromosomes and moving them straight into wish to select a service for which there is no demand. All
the next generation, avoiding the normal selection process. actions which are non-greedy have a probability of selection
In creating offspring for the next generation, the selection of ϵ
of |A(s)| , with the higher probability weighted in favour of
two parents is required. Each pairing results in the repro- the greedy strategies 1 − ϵ + |A(s)|
ϵ
.
duction of two offspring. During reproduction crossover and
mutation, fundamental principles of genetic algorithms, are
applied probabilistically. Crossover involves taking certain 4. SIMULATOR DESIGN
aspects/traits of both parents’ chromosomes and creating a The simulations presented in this paper involve agents
new chromosome. There are a number of ways to achieve interacting with one another in a discrete time environment.
this including, single point crossover, two point crossover To simulate the performance of the two approaches, slightly
and uniform crossover. Our crossover function employs sin- different configurations had to be applied. The design of
gle point crossover, where a point in the bit string is ran- each algorithm is detailed below.
domly selected, at which crossover is applied. Crossover
generally has a high probability of occurrence, mutation on 4.1 Evolutionary Simulations
the other hand generally does not. Mutation involves ran- For our experiments evolution occurs over the entire pop-
domly altering its bit string changing aspects of a chromo- ulation, with offspring from the fittest agents replacing only
some. Occurrences of mutation were biased towards an in- the weakest agents in the population. An agents fitness Af
crease or decrease of only 1 of a possible n services. Once is calculated as the sum of all payoffs divided by the∑ number
the chromosome has been created an agent is formed and its of services provided during the time step Af = n1 n i=0 xi .
added to the population. The agents performance is mea- After each time step an agents income from service provision
sured through its fitness value. is representative of its fitness for the environment, resulting
in the fittest agent earning the most income. The popula-
tion of agents is proportional to the amount of services in
the system at any particular time. The percentage of elitism
3.2.2 SARSA Learning E applied to the population varies depending on whether a
In this paper we use a classical reinforcement learning al- service is rising or falling in demand. To ensure adequate ser-
gorithm known as Sarsa. Sarsa belongs to a collection of vice provision to cater for spikes in demand, the genetic algo-
algorithms called Temporal Difference (TD) methods. Not rithm allows for a certain percentage of over provisioning of
requiring a complete model of the environment, TD meth- supply. This percentage of over provisioning is probabilisti-
ods possess a significant advantage. TD methods have the cally chosen, where a greater weighting is applied to increas-
capability of being able to make predictions incrementally ing supply, where demand is rising. A cross-over rate of 85%
and in an on-line fashion, without having to wait until the and a mutation rate of 5% are also used. Mutation is fun-
episode has terminated. damental to the success of the genetic algorithm, without it
The learning agent interacts with its environment through adaptivity could not be achieved. If demand for a particular
a sequence of discretized time steps. At the end of each time service approaches 0, evolution may favour a generation of
period t the agent occupies state st ∈ S, where S represents offspring which do not support this service. Without muta-
the set of all possible states. Here the agent chooses an tion this service will become extinct resulting in major losses
action at ∈ A(st ), where A(st ) is the set of all possible ac- should demand for it increase again. The agent population
tions within state st . The agent receives a numerical reward is initially dispersed randomly ensuring an even distribution
or return, rt+1 ∈ ℜ and enters a new state s′ = st+1 [12]. of the available services. Offspring are created using a sin-
The goal of the Reinforcement learning agent is to maximise gle point cross-over of both parents’ service gene, with the
its returns in the long run often forgoing short term gains actual cross-over point being randomly selected each time.
in place of long term benefits. By introducing a discount Mutation occurred probabilistically throughout this process.
factor γ, (0 < γ < 1), an agents degree of myopia can be Occurrences of mutation were biased towards an increase or
controlled. A value close to 1 for γ assigns a greater weight decrease of only 1 of a possible n, with n being the number
to future rewards, while a value close to 0 considers only the of services available. Pairings produce two offspring, which
most recent rewards. For our experiments we have assigned are added to the agent population replacing weaker agents.
γ a value of 0.9 forcing the agent to place greater emphasis To reflect the instantiation costs of loading the various busi-
on future rewards. ness logic, the genetic algorithm is penalised each time it
The update rule for Sarsa is defined as evolves. This is achieved by restricting evolution process to
Page 80 of 99
efficiency perspective this could prove costly if carried out
in sufficient quantity. Averaged returns for each state ac-
tion pair are stored in a lookup table. Storing Q values
ZĂŶŬ this way is possible as the number of states and actions is
^ĞůĞĐƚŝŽŶ
ŐĞŶƚƐ
kept relatively small for analytical purposes. Learning in a
more expansive system containing larger numbers of services
would require function approximation or neural nets to ap-
proximate Q values. The steps involved in the reinforcement
learning algorithm are depicted by Algorithm 1 below.
every second time interval. Figure 3 above, gives a general 5. EXPERIMENTAL RESULTS
overview of the design of the genetic algorithm.
This section compares the performance of both the cen-
4.2 Learning Simulations tralised and decentralised approaches using the simulator
discussed in the previous section as a test bed. The ex-
For our experiments the service environment contains n
periment evaluates performance in two distinct demand en-
number of services from which an agent can choose. The
vironments. Depicted in Figure 2, the stable curve sim-
set of all possible service choices C = { c1 , c2 ......cn , c∗ }
ulates an environment were consumer demand fluctuates
contains all the services the environment supports. c∗ rep-
moderately. The volatile demand curve simulates more dra-
resents a temporary idle state which the agent may enter.
matic demand fluctuations. Since all service demand (both
When the agent enters this state it is inactive in the envi-
stable and volatile) is generated using a sinusoidal func-
ronment freeing up its resources for usage elsewhere. While
tion the aggregate demand will be sinusoidal in nature, as
in the idle state the agent releases its lock on the system
sin A + sin B = sin C.
resources allowing another process to obtain this lock and
The corresponding results are empirically analysed and
use the resources as they please. Once finished the pro-
presented in both graphical and tabular form. A number
cess relinquishes the lock allowing the agent to exit the idle
of the metrics used to explain the results are not intuitive
state. As stated previously the policy being followed is an ϵ
and require further explanation. Demand represents the to-
- greedy policy, meaning loosely, choose the action which re-
tal aggregate demand curve and not the demand curve for
sults in the highest reward most of the time. All non-greedy
a single service. Aggregated in this demand curve are all
actions with the exception of c∗ are subsequently weighted
the services present in the system at run-time. Missed re-
similarly. The addition of the idle state reduces the effec-
quests is the percentage of consumer requests which provider
tiveness of the policy during exploration. The provider only
agents were unable to fulfil. For analytical purposes we term
wishes to choose this action once it has tried all the other
the difference between total adaptive supply and the fixed
possible actions. To address this we designate a learning
rate supply as the efficiency Ef of the approach. It can be
phase where non-greedy actions are not equiprobable. After
calculated as follows
each non-greedy action other than the idle state has been
explored, then the learning phase ends and all non-greedy
∫ T ∫ T ∑
T
actions become equiprobable. This increases probability of
the agent entering the idle state. The agents action set for Ef = f (t)dt − g(t)dt ≈ (f (t) − g(t)) (4)
0 0 t=0
each state contains a choice of whether to remain in the
current state or switch to one of the n other possibilities. where T is the terminal state, the function f (t) represents
A = { remain , switchc1 ,c2 ......cn ,c∗ }. The set of returns the fixed supply and g(t) the aggregate supply (genetic algo-
achievable is R = {0, +0.5, +1}. Actions that result in suc- rithm or reinforcement learning). This value represents the
cessful service provision, yield a reward of +1. However if number of resources freed up during the experiment which
successful provision is achieved through switching services, can subsequently be reallocated to other processes.
the reward is halved to +0.5. Halving the reward only oc-
curs for the first time interval, with the agent receiving a 5.1 Efficiency Comparison
reward of +1 thereafter for successful provision, as long as This experiment evaluates each algorithms performance
it doesn’t switch. This represents the expense of instanti- in a substantially sized agent community. The number of
ating the necessary business logic for the new service. Ap- agents initially added to the community is 400. For analysis
plying a switching cost in this way, enhances the stability purposes only 4 services are available for selection, mean-
of the system, preventing unnecessary switching. From an ing provider agents must select a service from this service
Page 81 of 99
Stable Demand Volatile Demand
450 450
400 400
350 350
300 300
Quantity
Quantity
250 250
200 200
150 150
100 Fixed Supply 100 Fixed Supply
Aggregate Supply(GA) Aggregate Supply(GA)
50 Aggregate Supply(RL) 50 Aggregate Supply(RL)
Aggregate Demand Aggregate Demand
0 0
0 200 400 600 800 1000 1200 1400 1600 0 200 400 600 800 1000 1200 1400 1600
Time Time
Aggregate Service Supply Aggregate Service Supply
Figure 4: Aggregate Curves for RL and GA Figure 5: Aggregate Curves for RL and GA
set. Peak quantity demanded for the entire simulation is learning. Both approaches performed very strongly at meet-
restricted to 100 units demanded per service. This restric- ing the majority of service requests. The results show that
tion is only necessary for evaluation purposes in order to the percentage of missed requests for both approaches is very
ground our results within measurable bounds. At times of low, standing at 0.9% for the genetic algorithm, with an even
peak demand for individual services, the maximum demand lower percentage value of 0.03% for reinforcement learning.
requires 100 provider agents to completely satisfy consumer Potentially if the percentage of missed requests is high, con-
requests. In a system where supply is fixed and peak demand sumer confidence may decline, reducing overall demand. An
is known, its possible to adequately meet demand by setting important consideration is to maintain the balance between
the minimum level of provision to 100 provider agents per increased efficiency and the percentage of missed requests.
service. This is depicted by the flat-lined fixed supply curve While increasing efficiency generates greater revenues, if this
shown in Figures 4 and 5. In service environments it is un- increase comes at the expense of reduced consumer confi-
likely that this value would be known or could be reliably dence, this could negatively impact market share in the long
estimated from experience. Therefore over provisioning of run.
providers may be much higher than the minimum fixed sup-
ply used here. In any case we demonstrate the performance 5.1.2 Volatile Demand
of our approaches against this minimum fixed rate supply Similarly to the previous experiment the results are av-
in terms of efficiency. As detailed above, this paper terms eraged over 50 runs. Figure 4 shows the supply curves for
efficiency as the percentage difference between the flat rate both the genetic algorithm and reinforcement learning where
supply and the adaptive approaches. demand for services exhibits greater volatility. Its evident
from the graphs that the genetic algorithm displays greater
5.1.1 Stable Demand adaptability in this environment. As shown in Figure 4 the
Figure 3 shows the performance for both approaches when genetic algorithm’s supply curve tracks the aggregate de-
averaging over 50 runs. From the graph it is clear that mand curve. The efficiency of the genetic algorithm, shown
the evolutionary approach slightly outperforms the learning in Table 2, is much higher than reinforcement learning. The
approach in adapting to demand. The genetic algorithm genetic algorithm maintains similar adaptability in both en-
converges faster than reinforcement learning, as it quickly vironments with a performance of 28.73%. Reinforcement
identifies least fit agents and removes them. Reinforcement learning however does not perform as well in the volatile en-
learning takes longer to converge as unsuccessful agents ex- vironment, achieving an efficiency value of 14.48%, a reduc-
plore their non-greedy actions first before moving into the tion of 9.27% from the stable environment. The principle
idle state.The average number of service requests missed factor affecting the measure of efficiency in reinforcement
by service providers was higher for the genetic algorithm learning is the number of agents occupying the idle state. In
than for reinforcement learning. While marginally missing a volatile environment distribution of consumer demand is
more service requests the genetic algorithm had a greater more equitable among the provider agents. This results in
efficiency of 28.75% as opposed to 23.35% for reinforcement less agents entering the idle state, as they will have received
Page 82 of 99
returns from service provision. If the learning agent does [2] Rajkumar Buyya, Chee S. Yeo, and Srikumar
not enter the idle state then other processes will not be able Venugopal. Market-oriented cloud computing: Vision,
to obtain a lock on its resources, thus reducing efficiency. hype, and reality for delivering it services as
The percentage of missed requests is higher for the genetic computing utilities. Aug 2008.
algorithm at 1.47% than for reinforcement learning 0.05%. [3] Olcay Ersel Canyurt and Harun Kemal Öztürk. Three
Although this value is still quite low, this might be unac- different applications of genetic algorithm (ga) search
ceptable for some critical systems, such as stock or bank- techniques on oil demand estimation. Energy
ing services. Only missing 0.05% of requests reinforcement Conversion and Management, 47(18-19):3138 – 3148,
learning performs extremely well in this regard. It’s able to 2006.
maintain a much more stable supply while still achieving an [4] D. Gmach, J. Rolia, L. Cherkasova, and A. Kemper.
acceptable level of efficiency. Workload analysis and demand prediction of
enterprise data center applications. In Workload
6. CONCLUSIONS Characterization, 2007. IISWC 2007. IEEE 10th
International Symposium on, pages 171–180, Sept.
This paper has presented two separate approaches aimed
2007.
at tackling agent adaptivity in a service environment. Cre-
[5] Mariusz Jacyno, Seth Bullock, Michael Luck, and
ating dynamic and adaptive processes has been identified
Terry R. Payne. Emergent service provisioning and
as one of the principle research challenges in service ori-
demand estimation through self-organizing agent
ented computing [9]. Furthermore these processes should
communities. In AAMAS ’09: Proceedings of The 8th
possess the capability, of continually morphing themselves
International Conference on Autonomous Agents and
to respond to environmental demands.
Multiagent Systems, pages 481–488, Richland, SC,
Earlier we proposed a research question, where service de-
2009. International Foundation for Autonomous
mand is uncertain, which service offering should a provider
Agents and Multiagent Systems.
agent choose to expose? The work outlined in this paper
has demonstrated two approaches which successfully cre- [6] Yun Fu Jeffrey, Jeffrey Chase, Brent Chun, Stephen
ated provider agents, capable of reconfiguring their service Schwab, and Amin Vahdat. Sharp: An architecture for
offering to meet the available demand. We showed for both secure resource peering. In In Proceedings of the 19th
techniques how the provider agents were capable of deciding ACM Symposium on Operating System Principles,
which service offering to select in order to maximse available pages 133–148.
revenues. Both approaches have achieved significant perfor- [7] Kevin Lai, Lars Rasmusson, Eytan Adar, Li Zhang,
mance gains, when compared to a fixed rate supply. While and Bernardo A. Huberman. Tycoon: An
the genetic algorithm demonstrated greater efficiency, the implementation of a distributed, market-based
costs involved in evolving an entire population of distributed resource allocation system. Multiagent Grid Syst.,
agents may outweigh the resource saving. Gathering global 1(3):169–182, 2005.
information from hundreds or possibly thousands of service [8] Daniel A. Menascé, Emiliano Casalicchio, and Vinod
agents in a distributed environment could outweigh the per- Dubey. On optimal service selection in service oriented
formance benefits detailed in the results section. In smaller architectures. Performance Evaluation, In Press,
service environments where resources are limited and the Corrected Proof:–, 2009.
computational overheads required to run the genetic algo- [9] M.P. Papazoglou, P. Traverso, S. Dustdar, and
rithm is small, the genetic algorithm will achieve excellent F. Leymann. Service-oriented computing: State of the
adaptivity and efficiency. art and research challenges. Computer, 40(11):38–45,
This paper also defined a formal specification for an emer- Nov. 2007.
gent service environment, depicting it as a continuous action- [10] Julien Perez, Cécile Germain-Renaud, Balazs Kégl,
state space reinforcement learning problem. Learning through and Charles Loomis. Grid differentiated services: A
trial and error the agents quickly identified services that reinforcement learning approach. In CCGRID ’08:
were in demand and adapted their offering to meet them. Proceedings of the 2008 Eighth IEEE International
The efficiency of the learning approach outperformed the Symposium on Cluster Computing and the Grid, pages
fixed rate supply for both demand environments. The ser- 287–294, Washington, DC, USA, 2008. IEEE
vice level maintained by the learning agents was superior to Computer Society.
the evolved agents, demonstrating its reliability for critical [11] Sebastian Stein, Enrico Gerding, Alex C. Rogers, Kate
systems, where failed requests carry large penalties. Larson, and Nicholas R. Jennings. Flexible
procurement of services with uncertain durations. In
7. ACKNOWLEDGEMENT Second International Workshop on Optimisation in
Multi-Agent Systems (OptMas), May 2009.
The authors would like to gratefully acknowledge the con-
[12] Richard S. Sutton and Andrew G. Barto.
tinued support of Science Foundation Ireland.
Reinforcement learning :an introduction, 1998.
[13] Gerald Tesauro, Nicholas K. Jong, Rajarshi Das, and
8. REFERENCES Mohamed N. Bennani. On the use of hybrid
reinforcement learning for autonomic resource
[1] A. Azadeh, S.F. Ghaderi, S. Tarverdian, and
allocation. Cluster Computing, 10(3):287–299, 2007.
M. Saberi. Integration of artificial neural networks and
genetic algorithm to predict electrical energy [14] David Vengerov. A reinforcement learning approach to
consumption. Applied Mathematics and Computation, dynamic resource allocation. Eng. Appl. Artif. Intell.,
186(2):1731 – 1741, 2007. 20(3):383–390, 2007.
Page 83 of 99
[15] T. Weise, S. Bleul, D. Comes, and K. Geihs. Different
approaches to semantic web service composition. In
Internet and Web Applications and Services, 2008.
ICIW ’08. Third International Conference on, pages
90–96, June 2008.
Page 84 of 99
Proceedings of the AAMAS Workshop on Adaptive and Learning Agents, May 2010, Toronto, Canada
Page 85 of 99
notation and discuss related work. In Sec. 3 we present knowl- the authors use soft homomorphisms to perform transfer and pro-
edge transfer using bisimulation metrics, and discuss its theoretical vide theoretical bounds on the loss incurred from the transfer.
properties. Sec. 4 presents approximants to overcome bisimula-
tion’s blindness to a state’s value function, while Sec. 5 presents
approximations to the bisimulation metrics, designed speed up the
computation. In Sec. 6 we discuss the use of bisimulation with 3. KNOWLEDGE TRANSFER USING BISIM-
options. Sec. 7 contains empirical illustrations of the proposed ULATION METRICS
algorithms. Finally, in Sec. 8 we conclude and present ideas for Suppose that we are given two MDPs M1 = hS1 , A, P, Ri, M2 =
future work. hS2 , A, P, Ri with the same action sets, and a metric d : S1 × S2 → R
between their state spaces. We define the policy πd on M2 as
2. BACKGROUND ∀t ∈ S2 . πd (t) = π∗ (arg min d(s,t)) (3)
A Markov decision process (MDP) is a 4-tuple hS, A, P, Ri, where s∈S1
S is a finite state space, A is a finite set of actions, P : S × A →
Dist(S) 2 specifies the next-state transition probabilities, and R : In other words, πd (t) does what is optimal for the state in S1 that
S × A → R is the reward function. A policy π : S → A specifies is closest to t according to metric d. Algorithm 1 implements this
the action choices for each state. The value of a state s ∈ S un- approach.
∞ γt r |s = s}, where
der a policy π is defined as: V π (s) = Eπ {∑t=0 t 0
rt is the reward received at time step t, and γ ∈ (0, 1) is a dis- Algorithm 1 PolicyTransfer(M1 , M2 , d)
count factor. Solving an MDP means finding the optimal value 1: Compute d∼
V ∗ (s) = maxπ V π (s), and the associated policy π∗ . In a finite MDP, 2: for t ∈ S2 do
there is a unique optimal value function, and at least one determin- 3: s∗ (t) ← arg mins∈S1 d∼ (s,t)
istic optimal policy. The optimal value function obeys the Bellman 4: πd (t) ← π∗ (s∗ (t))
optimality equations: 5: end for
" # 6: return π∼
V ∗ (s) = max R(s, a) + γ
a∈A
∑ P(s, a)(s′ )V ∗ (s′ ) (1)
s′ ∈S
Note that the mapping between the states of the two MDPs is de-
The action-values function, Q∗ : S × A → R, gives the optimal value fined implicitly by the distance metric d. Hence, it is clear that this
for each state-action pair, given that the optimal policy is followed is an important choice. We will now study the use of bisimulation
afterwards. It obeys a similar set of optimality equations: metrics as a choice for d.
Bisimulation for MDPs was defined by Givan, Dean & Greig
Q∗ (s, a) = R(s, a) + γ ∑ P(s, a)(s′ )V ∗ (s′ ) (2) (2003) based on the notion of probabilistic bisimulation from pro-
s′ ∈S cess algebra (Larsen & Skou, 1991). Intuitively, bisimilar states
Several types of knowledge can be transferred between MDPs. have the same long-term behavior.
Existing work includes transferring models (e.g., Sunmola & Wy-
att, 2006), using samples obtained by interacting with one MDP Definition 1. A relation E ⊆ S × S is said to be a bisimulation
to learn a good policy in a different MDP (e.g., Lazaric, Restelli relation if whenever sEt:
& Bonarini, 2008), transferring values (e.g., Ferrante, Lazaric &
1. ∀a ∈ A. R(s, a) = R(t, a)
Restelli, 2008), or transferring policies. In this paper, we focus on
the latter approach, and mention just a few pieces of work, most
closely related to our approach. The main idea of policy trans- 2. ∀a ∈ A.∀C ∈ S/E. ∑s′ ∈C P(s, a)(s′ ) = ∑s′ ∈C P(t, a)(s′ )
fer methods is to take policies learned on small tasks and apply
where S/E is the set of all equivalence classes in S w.r.t equivalence
them to larger tasks. Sherstov & Stone (2005) show how policies
relation E. Two states s and t are called bisimilar, denoted s ∼ t if
learned previously can be used to restrict the policy space in MDPs
there exists a bisimulation relation E such that sEt.
with many actions. Taylor et al. (2007) transfer policies, repre-
sented as neural network action selectors, from a source to a target
Ferns, Panangaden & Precup (2004) defined a bisimulation met-
task. A hand-coded mapping between the two tasks is used in the
ric, and proved that it is an appropriate quantitative analogue of
process. MDP homomorphisms (Ravindran & Barto, 2002) allow
bisimulation. The metric is not brittle, like bisimulation: if the
correspondences to be defined between state-action pairs, rather
transitions or rewards of two bisimilar states are changed slightly,
than just states. Follow-up work (e.g., Ravindran & Barto, 2003;
the states will no longer be bisimilar, but they will remain close in
Konidaris & Barto, 2007) uses MDP homomorphisms and options
the metric. A metric d is a bisimulation metric if for any s,t ∈ S,
to transfer knowledge between MDPs with different state and ac-
d(s,t) = 0 ⇔ s ∼ t.
tion spaces. Wolfe & Barto (2006) construct a reduced MDP us-
The bisimulation metric is based on the Kantorovich probability
ing options and MDP homomorphisms, and transfer the policy be-
metric TK (d)(P, Q) applied to state probability distributions P and
tween two states if they both map to the same state in the reduced
Q, where d is a semimetric on S. It is defined by the following
MDP. Unfortunately, because the work is based on an equivalence
primal linear program (LP):
relation, rather than a metric, small perturbations in the reward or
transition dynamics make the results brittle. Soni & Singh (2006) |S|
transfer policies learned in a small domain as options for a larger max
u ,i=1,··· ,|S|
∑ (P(si ) − Q(si ))ui (4)
domain, assuming that a mapping between state variables is given. i i=1
A closely related idea was presented in (Sorg & Singh, 2009) where subject to: ∀i, j.ui − u j ≤ d(si , s j )
2 Dist(X) is the set of distributions over the set X ∀i.0 ≤ ui ≤ 1
Page 86 of 99
The following is the equivalent dual formulation: P ROOF.
|Q∗2 (t, atT )−V1∗ (s)| = |Q∗2 (t, atT ) − Q∗1 (s, atT )|
|S|
∑
min lk j d(sk , s j ) (5)
= R(t, atT ) + γ ∑ P(t, atT )(t ′ )V2∗ (t ′ )
lk j ,k=1,··· ,|S|, j=1,··· ,|S| k, j=1
t ′ ∈S
2
|S| !
subject to: ∀k. ∑ lk j = P(sk )
∑
− R(s, atT ) + γ P(s, atT )(s′ )V1∗ (s′ )
j=1
s′ ∈S1
|S| n
∀ j. ∑ lk j = Q(s j ) ≤ max R(t, atT ) − R(s, atT )
a∈A
k=1 )
+ γ ∑ P(t, at )(t )V2 (t ) − ∑ P(s, at )(s )V1 (s )
∀k, j.lk j ≥ 0 T ′ ∗ ′ T ′ ∗ ′
t ′ ∈S s′ ∈S1
2
Intuitively, TK (d)(P, Q) calculates the cost of “converting” P into
n o
≤ max R(t, atT ) − R(s, atT ) + γTK (d∼ )(P(t, a), P(s, a))
Q under d. The dual formulation is a Minimum Cost Flow (MCF) a∈A
problem, where the network consists of two copies of the state = d∼ (s,t)
space, a source node and a sink node. The source node is con-
nected to one of the copies of the state space, each node i with where the second to last line follows from the fact that V1∗ and
supply equal to P(si ), each node j of the second copy of the state V2∗ together constitute a feasible solution to primal LP TK (d∼ ) by
space are all connected to the sink node, each with demand equal to Lemma 3.2.
Q(s j ). Each “supply” node is connected to every other “demand”
We can now use the last lemmas to bound the loss incurred when
node, with cost from supply node i to demand node j of d(si , s j ).
using the transferred policy.
(see Ferns et al., 2004 for more details).
T HEOREM 3.4. For all t ∈ S2 let atT = π∼ (t), then
T HEOREM 3.1. (From Ferns et al., 2004) Let M be the set of |Q2 (t, atT ) −V2∗ (t)| ≤ 2 mins∈S1 d∼ (s,t).
∗
Page 87 of 99
Taylor, Precup & Panangaden (2009) introduce lax bisimulation Again, we use the same symbol d≈ for the state metric, but the
metrics, dL . The idea is to have a metric for state-action pairs rather arguments will resolve any ambiguity. For all s ∈ S1 and t ∈ S2 ,
than just for state pairs. Given two MDPs M1 = hS1 , A1 , P1 , R1 i
d≈ (s,t) = max d≈ (s, (t, b))
M2 = hS2 , A2 , P2 , R2 i, for all s ∈ S1 , t ∈ S2 , a ∈ A1 and b ∈ A2 , b∈A2
dL ((s, a), (t, b)) = |R1 (s, a) − R2 (t, b)| + γTK (dL )(P1 (s, a), P2 (t, b)) We can now use Algorithm 2 again, but with the new metric d≈
to obtain the transferred policy π≈ . In other words π≈ (t) finds the
From the distance between state-action pairs we can then define a
closest state s ∈ S1 to t under d≈ and then chooses the action b from
state lax-bisimulation metric. We use the same symbol dL for the
t that is closest to π∗ (s).
state lax-bisimulation metric, but the arguments will resolve any
We can then prove the following results.
ambiguity. For all s ∈ S1 and t ∈ S2 :
L EMMA 4.1. For all s ∈ S1 and t ∈ S2 |V1∗ (s)−V2∗ (t)| ≤ d≈ (s,t).
dL (s,t) = max max min d((s, a), (t, b)), max, min d((s, a), (t, b))
a∈A1 b∈A2 b∈A2 a∈A1 P ROOF.
Now we can define our transferred policy via Algorithm 2. |V1∗ (s) −V2∗ (t)| = |Q∗1 (s, a∗s ) − Q∗2 (t, at∗ )|
≤ |R(s, a∗s ) − R(t, at∗ )|
Algorithm 2 laxBisimTransfer(S1 , S2 )
+ γ ∑ P(s, as )(s )V1 (s) − ∑ P(t, at )(t )V2 (t )
1: Compute dL ∗ ′ ∗ ∗ ′ ∗ ′
2: for All t ∈ S2 do s′ ∈S
1 t ′ ∈S 2
3: st ← arg mins∈S1 dL (s,t)
≤ |R(s, a∗s ) − R(t, at∗ )| γTK (d≈ (P(s, a∗s ), P(t, at∗ ))
4: bt = minb∈A2 dL ((st , π∗ (st )), (t, b))
5: πL (t) ← bt by induction
6: end for = d≈ ((s, a∗s ), (t, at∗ ))
7: return πL ≤ max d≈ ((s, a∗s ), (t, b))
b∈S2
Page 88 of 99
With this result we now have a lower bound on the value of the ac- the immediate reward as a myopic estimate.
tion transferred which takes into consideration the value function in The second approximant still will not scale very well to large
the source system. This suggests Algorithm 3 to obtain transferred problems, due to the MCF computations. As the number of states
policy πPess . In other words πPess (t) uses the source state with the increases, the number of variables and constraints of each MCF
problem also increases. In this second approximation we fix a num-
Algorithm 3 pessTransfer(S1 , S2 ) ber of clusters k which will split our reward region (i.e. the interval
1: Compute d≈ from the minimum reward to the maximum reward) into k regions.
2: for All t ∈ S2 do For each s ∈ S1 we choose what cluster it belongs to by checking in
3: for All s ∈ S1 do which of the k reward regions R(s, π∗ (s)) falls; similarly, for each
4: LB(s,t) ← V1∗ (s) − d≈ (s,t) t ∈ S2 we choose what cluster it belongs to by checking in which of
5: end for the k reward regions maxb∈A2 R(t, b) falls. Having thus reduced the
6: st ← arg maxs∈S1 LB(s,t) state space into k clusters, when we compute TK (d)(P(s, a), P(t, b))
7: bt = minb∈A2 d≈ (st , (t, b)) we are no longer looking at transition probabilities into individual
8: πPess (t) ← bt states, but rather, transition probabilities into one of the k clusters.
9: end for By doing so we have put an upper limit on the number of variables
10: return πL and constraints of each MCF. If the reward structure in the domains
in question are relatively sparse, we can get away with a small k.
Finally, we only iterate the Kantorovich metric computation once,
highest guaranteed lower bound on the value of its optimal action. setting d(s,t) = |R(s, π∗ (s)) − maxb∈A2 R(t, b)| for all s ∈ S1 and
This clearly overcomes the problem of the last example. t ∈ S2 .
4.2 Optimistic approach 6. BISIMULATION FOR OPTIONS
The idea of the pessimistic approach is appealing as it uses the
An option o is a triple hIo , πo , βo i, where Io ⊆ S is the set of
value function of the source system as well as the metric to guide
states where the option is enabled, πo : S → Dist(A) is the policy
the transfer. However, there is still an underlying problem in all of
for the option, and βo : S → [0, 1] is the probability of the option
the previous algorithms. This is in fact a problem with bisimulation
terminating at each state (Sutton, Precup & Singh, 1999). Options
when used for transfer. The problem is an inherent “pessimism” in
are temporally abstract actions and generalize one-step primitive
bisimulation: we always consider the action that maximizes the
actions. Given that an option o is started at state s, we can define
distance between two states. This pessimism is what equips bisim-
Pr(s′ |s, o) as the discounted probability of ending in state s′ given
ulation with all the mathematical guarantees, since we are usually
that we started in state s and followed option o. We can also define
“upper-bounding”. However, one may (not so infrequently) en-
the expected reward received throughout the lifetime of an option
counter situations where this pessimism produces a poor transfer.
as R(s, o) (see Sutton, Precup & Singh, 1999 for details). Based
For instance, assume there is a source state s whose optimal action
on the above definition, we introduce bisimulation for MDPs with
can be transferred with almost no loss as action b in a target state
options.
t (i.e. d≈ (s, (t, b)) is almost 0); however, assume there is another
action c in t such that d≈ (s, (t, c)) is very large. This large distance Definition 2. A relation E ⊆ S × S is said to be an option- bisim-
may disqualify state s as a transfer candidate for state t, when it may ulation relation if whenever sEt:
very well be the best choice! The inherent pessimism of bisimula- 1. ∀o ∈ OPT. R(s, o) = R(t, o)
tion would have overlooked this ideal transfer. If we would have
taken a more “optimistic” approach, then we would have ignored 2. ∀o ∈ OPT.∀C ∈ S/E .
d≈ (s, (t, c)) and focused on d≈ (s, (t, b)). This idea motivates the ∑s′ ∈C Pr(s′ |s, o) = ∑s′ ∈C Pr(s′ |t, o)
main algorithmic contribution of the paper. Two states s and t are said to be option-bisimilar if there exists an
We start by defining a new metric, option-bisimulation relation E such that sEt. Let s ∼O t denote the
dOpt (s,t) = minb∈A2 d≈ ((s, a∗s ), (t, b)), and use Algorithm 3 but with maximal option-bisimulation relation.
dOpt instead of d≈ in the computation of LB(s,t) to obtain our
transferred policy πOpt . In other words πOpt (t) chooses the action Similarly, a metric d is an option-bisimulation metric if for any
with the highest optimistic lower bound on the value of the action. s,t ∈ S, d(s,t) = 0 ⇔ s ∼O t.
By removing the pessimism we lose our theoretical properties, so Ferns et al. (2004) pass the next state transition probabilities in
we can no longer say that this lower bound is guaranteed. How- to TK (d). However, in our case we will pass in Pr(·|s, o) for s ∈ S
ever, intuition tells us that this should be a better method to guide and o ∈ OPT , which is a subprobability distribution. To account
the transfer. Indeed, we shall see in Section 7 that this method out- for this, we add two dummy nodes in the dual formulation of the
performs all the rest. Kantorovich metric above, which absorb any leftover probability
mass. These dummy nodes are still connected as the other nodes,
but with a cost of 1 (see Van Breugel & Worrell, 2001 for more
5. SPEEDING UP THE COMPUTATION details).
As was mentioned previously, the long computation time of these Option-bisimulation metrics are very similar to the usual bisim-
methods is mainly due to the fact that each iteration of the Kan- ulation metrics, in terms of properties, as can be seen from the fol-
torovich metric computation (see Theorem 3.1) requires solving lowing theorem:
|S1 | × |S 2 | × |A 2 | MCF problems, which is very expensive. In the
first approximation we propose to solve TK (d) only once, with T HEOREM 6.1. Let
d(s,t) = V1∗ (s) − maxb∈A2 R(t, b) for all s ∈ S1 and t ∈ S2 . The intu- F(d)(s,t) = max (|R(s, o)−R(t, o)|+γTK (d)(Pr(·|s, o), Pr(·|t, o))
ition behind this rough distance estimate is that we want the target o∈OPT
state to try to match the optimal value of the source state. Since the Then F has a least fixed-point, d∼ and d f ix is an option-bisimulation
optimal value function for the target system is not known, we use metric.
Page 89 of 99
Table 1: Running times (in seconds) and kV2∗ −VtT k∞
First instance (4to4) Second instance (4to4) Second instance (4to3) Second instance (3to4)
Algorithm Running time kV2∗ −VtT k∞ Running time kV2∗ −VtT k∞ Running time kV2∗ −VtT k∞ Running time kV2∗ −VtT k∞
Bisim 15.565 0.952872 - - - - - -
Lax-Bisim 66.167 0.847645 128.135 0.749583 67.115 0.749583 100.394 0.749583
Pessimistic 25.082 0.954625 47.723 0.904437 24.125 0.875533 37.048 0.904437
Optimistic 23.053 0.335348 47.710 0.327911 24.649 0.360802 39.672 0.002052
Approximant1 0.725 0.744038 1.484 0.744036 0.820 0.721627 1.214 0.744036
Approximant2 0.443 0.744038 0.949 0.632880 0.556 0.532724 0.762 0.632880
The proof is almost identical to that of Theorem 4.5 in (Ferns et the first instance the target domain only had primitive actions as
al, 2004), so we omit it for succinctness. As shown there, d∼ can options: ∧, ∨, < and >. In the second instance, the domain was
δ
be approximated to a desired accuracy δ by applying F for ⌈ ln ln γ ⌉ equipped with 8 options: {∧, ∨, <, >, u, d, l, r}. Clearly the origi-
steps. nal bisimulation metric approach could not be run because of the
All the results presented for the four algorithms carry over easily difference in number of options. We also ran experiments where
to the option-bisimulation case. Their proofs will be omitted for the target system only has 3 rooms (bottom right room was re-
succinctness. moved), but source still has 4 rooms, and where the source system
only has 3 rooms (bottom right room was removed) and target sys-
7. EXPERIMENTAL RESULTS tem has 4 rooms. Table 1 displays the running times of the various
algorithms, as well as kV2∗ − V2T k∞ . In Figure 2 we display the
To illustrate the performance of the various policy transfer algo-
transferred policies when using lax-bisim, the pessimistic and the
rithms, we used the grid world navigation task of (Sutton, Precup
optimistic approach. The colors indicate which state in the tiny in-
& Singh, 1999), consisting of four rooms in a square (a room in
stance was used for the transfer. In figures 3 and 4 we compare the
each corner) connected by four hallways, one between each pair of
performance of the various algorithms when used to speed up learn-
rooms. There are four primitive actions: ∧ (up), ∨ (down), < (left)
ing. Standard Q-learning was performed, but the agent was biased
and > (right). When one of the actions is chosen, the agent moves
towards the transferred policy. In other words, if the agent did not
in the desired direction with 0.9 probability, and with 0.1 probabil-
have another option with a better Q-value than the current Q-value
ity uniformly moves in one of the other three directions or stays in
estimate for the transferred policy, it would choose the transferred
the same place. Whenever a move would take the agent into a wall,
policy. These results clearly demonstrate the superior performance
the agent remains in the same position.
of the optimistic approach.
There are four global options available in every state, analogous
In Table 2 we examine the performance of the second approxi-
to the ∧, ∨, < and > primitive actions. We will refer to them as
mant compared to that of the first as we scale the sizes of the target
u, d, l and r, respectively. If an agent chooses option u, then the
systems. The first approximant was only able to solve the problem
option will take it to the hallway above its position. If there is no
with 104 states, at which point it ran out of memory. We can see
hallway in that direction, then the option will take the agent to the
that we still obtain reasonable results with the second approxima-
middle of the upper wall. The option terminates as soon as the
tion, even as the number of states gets larger.
agent reaches the respective hallway or position along the wall. All
other options are similar. There is a single goal placed in one of
the hallways, yielding a reward of 1. Everywhere else the agent 8. CONCLUSIONS AND FUTURE WORK
receives a reward of 0. In this paper we presented six new algorithms for performing
The above topology for the rooms policy transfer on MDPs that were based on bisimulation metrics.
can be instantiated with different We started off with algorithms that had very strong theoretical re-
numbers of cells. We started with a sults but poor empirical performance. Using these initial algo-
tiny instance, where there are only rithms as inspiration, we defined new algorithms that traded some
8 states: one for each of the rooms, of the theoretical guarantees for improved performance. Finally,
and one for each of the hallways, we presented two approximation algorithms to overcome the com-
Figure 1: Tiny in- with the goal in the rightmost hall- putational overhead of bisimulation metrics. The second of these
stance with the opti- way. (Figure 1). This tiny domain was shown to scale very well to very large problems. We presented
mal policy. Red state only has 4 options, which are sim- empirical evidence of the suitability of our algorithms for speeding
is the goal state. ply the primitive actions. The vari- up learning.
ous metrics (d∼ , dL , d≈ , and dOpt ) Our algorithms would also be very useful if we had a model dis-
were computed between the tiny in- tribution from which problems were sampled and we wanted to
stance and each of the larger instances, using a desired accuracy of avoid solving the value function for each sampled model. This sit-
δ = 0.01, and then the policy transfer algorithms were applied. We uation is commonly encountered in Bayesian RL, where a Dirich-
also used the two approximants on the optimistic algorithm. For let distribution over models is maintained and updated with each
all experiments we used the CS2 algorithm for the MCF problems transition. Most algorithms sample a number of models from the
(Frangioni & Manca, 2006) and a discount factor of γ = 0.9. In Dirichlet distribution and solve the value function for each in order
the second approximant we set the number of clusters to 8 (note to make the next action choice. We could use our algorithms to
that we are looking at option reward regions, rather than primitive transfer the policy from the small source to just one of the target
reward regions). systems (the mean model, for instance), and use that policy for all
We used a domain with 44 states as the large domain. We varied the other samples. It would be useful to obtain empirical evidence
the number and type of options available in the larger domain. In to justify these claims, as well as theoretical bounds on the loss of
Page 90 of 99
d d d d r r r r d r l d
d d d r r r r d r r l l
r r r r r
u u u r u r l u
u u u u r r r u u d r l l
u u u u r r r u u r l u
Figure 2: Transferred policies in second instance. Left: Lax-bisim, middle: Pessimistic, right: optimistic
First instance
300
Second instance
No transfer 400
Bisim No transfer
250 Lax−bisim Lax−bisim
Pessimistic 350
Pessimistic
Optimistic
Optimistic
Approximant 1
300 Approximant 1
Approximant 2
200 Approximant 2
Cumulative reward
250
Cumulative reward
150
200
100 150
100
50
50
0 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Time step
Time step
Page 91 of 99
3 rooms (target) Second instance 3 rooms (source) Second instance
160 400
No transfer No transfer
140 Lax−bisim 350 Lax−bisim
Pessimistic Pessimistic
120 Optimistic 300 Optimistic
Approximant 1 Approximant 1
Approximant 2 Approximant 2
Cumulative reward
Cumulative reward
100 250
80 200
60 150
40 100
20 50
0 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Time step Time step
Figure 4: Comparison of performance of transfer algorithms (left: 4 rooms to 3 rooms, right: 3 rooms to 4 rooms)
Alessandro Lazaric, Marcello Restelli, and Andrea Bonarini. reinforcement learning domains: A survey. Journal of Machine
Transfer of samples in batch reinforcement learning. In ICML, Learning Research, 10:1633–1685, 2009.
pages 544–551, 2008. Matthew E. Taylor, Peter Stone, and Yaxin Liu. Transfer
Theodore J. Perkins and Doina Precup. Using options for learning via inter-task mappings for temporal difference
knowledge transfer in reinforcement learning. Technical Report learning. Journal of Machine Learning Research, 8:2125–2167,
UM-CS-1999-034, University of Masschusetts, Amherts, 1999. 2007.
Caitlin Phillips. Knowledge transfer in Markov Decision Jonathan Taylor, Doina Precup, and Prakash Panangaden.
Processes. Technical report, McGill University, 2006. Bounding performance loss in approximate MDP
Martin L. Puterman. Markov Decision Processes. John Wily & homomorphisms. In Advances in Neural Information
Sons, New York, NY, 1994. Processing Systems 21, page In press, 2009.
Balaraman Ravindran and Andrew G. Barto. Model Franck van Breugel and James Worrell. An algorithm for
minimization in hierarchical reinforcement learning. In Fifth quantitative verification of probabilistic transition systems. In
Symposium on Abstraction, Reformulation and Approximation, Proceedings of the 12th International Conference on
2002. Concurrency Theory (CONCUR), pages 336–350, 2001.
Balamaran Ravindran and Andrew G. Barto. Relativized Alicia Peregrin Wolfe and Andrew G. Barto. Defining object
options: Choosing the right transformation. In Proceedings of types and options using MDP homomorphisms. In Proceedings
the 20th Internation Conference on Machine Learning, 2003. of the ICML-06 Workshop on Structural Knowledge Transfer
Alexander A. Sherstov and Peter Stone. Improving action for Machine Learning, 2006.
selection in MDP’s via knowledge transfer. In Proceedings of
the 20th National Conference on Artificial Intelligence, 2005.
Vishai Soni and Satinder Singh. Using homomorphism to
transfer options across reinforcement learning domains. In
Proceedings of the National Conference on Artificial
Intelligence (AAAI-06), 2006.
Jonathan Sorg and Satinder Singh. Transfer via soft
homomorphisms. In Proceedings of the 8th International
Conference on Autonomous Agents and Multiagent Systems
(AAMAS-2009), 2009.
Funlade T. Sunmola and Jeremy L. Wyatt. Model transfer for
Markov decision tasks via parameter matching. In Proceedings
of the 25th Workshop of the UK Planning and Scheduling
Special Interest Group (Plan-SIG 2006), 2006.
Richard S. Sutton, Doina Precup, and Satinder Singh. Between
MDPs and semi-MDPs: A framework for temporal abstraction
in reinforcement learning. Artificial Intelligence, 112:181–211,
1999.
Matthew E. Taylor and Peter Stone. Transfer learning for
Page 92 of 99
Proceedings of the AAMAS Workshop on Adaptive and Learning Agents, May 2010, Toronto, Canada
ABSTRACT exploit the situation and order the most expensive items on
This paper examines the evolution of agent strategies in a the menu. If all members of the group apply this strategy,
commons dilemma using a tag interaction model. Through then all participants will end up paying more [5].
the use of a tag-mediated interaction model, individuals can These games are all classified as n-player dilemmas, as
determine their interactions based on their tag similarity. they involve multiple participants interacting as a group. N-
The simulations presented show the significance and bene- player dilemmas have been shown to result in widespread de-
fits of agents that contribute to the commons. A series of fection unless agent interactions are structured. This is most
experiments examine the importance of the tag space in a commonly achieved through using spatial constraints which
n-player dilemma. The paper shows the emergence of co- limit agent interactions through specified neighbourhoods
operation through tag-mediated interactions in the n-player on a spatial grid. Limiting group size has been shown to
games. Simulation results show the evolution of strategies benefit cooperation in these n-player dilemmas [24]. Agent
that contribute heavily to the value of the shared commons. interaction models such as spatial constraints, social net-
works and tags offer a basis for agents to determine there
peer interaction and the subsequent emergence of coopera-
Categories and Subject Descriptors tion. This paper examines a series of simulations involving
H.4 [Information Systems Applications]: Artificial In- a tag-mediated interaction environment. Tags are visible
telligence| Multiagent systems markings or social cues which serve to bias agent interac-
tions based on their similarity [9].
General Terms In this paper we will examine an n-player dilemma, and
study the evolution of strategies when individuals can con-
Multi-Agent Cooperation, Learning and Evolution, Tag Me- tribute some of their payoffs towards the value of the com-
diated Interactions mons. The theory that commitment changes the incentives
of players is a familiar principle in economics. Applications
Keywords of these principles include bargaining [19], monetary policy
Evolution, Learning, Tag Mediated Interactions, Coopera- [17], industrial organisation [4], strategic trade policy [2].
tion The simulations presented in this paper use the well known
n-player Prisoner’s Dilemma (NPD). Agents bias their in-
teractions through a tag mediated environment. The re-
1. INTRODUCTION sults show the evolution of widespread contributing strate-
When a common resource is being shared among a num- gies throughout the population, despite this being a sub-
ber of individuals, each individual benefits most by using optimal strategy.
as much of the resource as possible. While this is the indi- This paper examines the impact of the tag space and
vidually rational choice, it results in collective irrationality its effects on the emergence of cooperation in the n-player
and a non Pareto-optimal result for all participants. These dilemma. In this context the experiments will show the ef-
n-player dilemmas are common throughout many real world fects of investment strategies on the emergence of coopera-
scenarios. For example, the computing community is partic- tion. The research presented in this paper will address three
ularly concerned with how finite resources can be used most specific research questions:
efficiently where conflicting and potentially selfish demands
are placed on those resources. Those resources may range 1. What is the impact of tag space on the emergence of
from access to processor time or bandwidth. cooperation in a n-player dilemma?
One example commonly used throughout existing research
2. Will agent strategies evolve investment properties in a
is the Tragedy of the Commons [7]. This outlines a scenario
commons dilemma?
whereby villagers are allowed to graze their cows on the vil-
lage green. This common resource will be over grazed and 3. How do investment strategies impact on the emergence
lost to everyone if the villagers allow all their cows to graze, of cooperation in a commons dilemma?
yet if everyone limits their use of the village green, it will
continue to be useful to all villagers. Another example is The following section of his paper will provide an intro-
the Diners Dilemma where a group of people in a restau- duction to the NPD and a number of well known agent inter-
rant agree to equally split their bill. Each has the choice to action models. The topics of tag-mediated interactions, and
Page 93 of 99
the n-player Prisoner’s Dilemma will be discussed in detail. Tragedy of the Commons
In the experimental setup section we will discuss our simu- 6
2. RELATED RESEARCH 3
Page 94 of 99
3. SIMULATOR DESIGN 3.3 Agent Interactions
In this section we outline the overall design of our simula- In our simulations each agent interacts through a fixed
tor. Firstly, we will outlined the agent genome and how this bias tag mediated interaction model. We adopt a similar
influences agent behaviours. Then we describe our agent tag implementation as that outlined by Riolo [15] and more
interaction model which uses a tag mediated interaction recently by [10]. In our model each agent has a GT gene
model. This paper examines agent learning through evolu- which is used as their tag value. Each agent A is given
tion, and as a result we use a genetic algorithm. This genetic the opportunity to make game offers to all other agents in
algorithm and its parameter settings are also outlined in this the population. The intention is that this agent A will host
section. a game and the probability other agents will participate is
determined using the following formulation.
3.1 Agent Genome
In our model each agent is represented through an agent dA,B = 1 − |AGT − BGT | (4)
genome. This genome holds a number of genes which repre-
sents how that particular agent behaves. This equation is based on the absolute value between the
tag values of two agents A and B. This value is used to gen-
erate two roulette wheels Rab and Rba for A and B. These
Genome = GC , GT , GI , (2) two roulette wheels will then be used to determine agent A’s
The GC gene represents the probability of an agent coop- attitude to B and agent B’s attitude to A. An agent B will
erating in a particular move. The GT gene represents the only participate in the game when both roulette wheels have
agent tag. This is represented in the range [0. . . 1] and is indicated acceptance.
used to determine which games each agent participates. Fi-
nally, GI represents an individuals willingness to contribute
into the commons. This is again represented on a scale to in-
dicate that some individuals may chose to invest more than
others.
Initially these agent genes are generated using a uniform
distribution for the first generation. Over subsequent gen-
erations new agent genomes are generated using our genetic
algorithm. Each of these genes are evolved attributes and
are fixed for that individual’s lifetime, therefore changes in
the population only occur through new offspring which have
evolved genetic traits.
Page 95 of 99
parent has a set of three genes GC , GT , GI . A probability more difficult to establish and maintain cooperative interac-
of 0.9 is applied in favor of selecting two random genes from tions and avoid exploitation.
the the fittest parent, and 1 gene from the other parent.
Each gene is exposed to a 2% chance of mutation. When 4.2 N-Player Dilemma with Investment
applied to a gene, the mutation operator changes it through In this section we will examine the extended version of
a displacement chosen from a Gaussian distribution with a the N-player dilemma. All experimental parameters are held
mean of 0 and a standard deviation of 0.5. Since tag values from the previous experiment.
are considered arbitrary the tag space is viewed as circular.
Therefore, when adding and subtracting displacements on Average Cooperations
the GT gene values over 1.0 are calculated upwards from 1
0, while values under 0 are calculated downwards from 1.0.
For the two remaining genes, their actual value is significant 0.8
and displacements resulting in values above 1.0 are set to
Average Cooperations
1.0, while those below 0 are set to 0. 0.6
Generations
0.6
100 Agents
Figure 3 shows the standard commons dilemma using a 0.2 75 Agents
50 Agents
number of alternative agent population sizes. This has the 25 Agents
10 Agents
effect of altering the size of the tag space in the model. As 0
0 200 400 600 800 1000
shown through a similar experiment by Howley et al, the Generations
size of the tag space has a significant effect on the emer-
gence of cooperation [11]. Where there are only a small Figure 5: Average Investment Gene (N-Player
number of agents in the population, cooperators can avoid Dilemma with Investments)
exploiters much more easily due to the partitioning effects
of the tag environment. However, in larger populations this The data shown in Figure 5, shows how the GI gene
is much more difficult and exploiters are much more likely to evolved in various population sizes. The results show that
be present in a n-player interaction. Therefore cooperation high investment strategies evolved in almost all cases accept
is not evolved for larger populations where the tag space for the smallest populations. The evolutionary pressure to
is undermined. The probability of being exploited is much evolve such a strategy stems from the increased fitness of
higher with each peer interaction an individual participates those individuals who were part of highly investment, coop-
in, therefore in larger populations it is generally considered erative groups. This combination facilitates many high value
Page 96 of 99
games which helps offset exploitation from rogue strategies levels correlate with the population sizes. Increased num-
on the periphery of the cluster. Once established this trait bers of participants in a n-player game makes the chances of
will increase throughout the population with the help of the maintaining high levels of cooperation less likely. Exploita-
tag mechanism which promotes homogeneity. As discussed tion is much more likely to occur in a game with many par-
by many previous authors, individuals who share tag values ticipants and therefore undermine any cooperation that may
are likely to share many other genetic traits due to their have been established previously. This data correlates with
shared ancestry [13]. the levels of cooperation we identified previously in Figure
3. The model with the most participants was the least co-
Average Commons Value operative while the model with the least game participants
100 was the most cooperative.
100 Agents
75 Agents
50 Agents
80 25 Agents
Average Number of Participants
Average Commons Value
10 Agents
30
100 Agents
60 75 Agents
25 50 Agents
20 15
10
0
0 200 400 600 800 1000
Generations 5
The data shown in Figure 6 show the average value of Figure 8: Average Participants (N-Player Dilemma
games in the commons dilemma with investment. These with Investments)
payoffs represent the entire value of the commons which each
of the players had to then play for. As would be expected
The data shown in Figure 8 mirrors the previous simu-
these payoff values correlate strongly with the numbers of
lations shown in Figure 7. Here we identify many similar
agents in the population, and this would increase the likely
features, yet we also notice that the levels of game partic-
number of individuals participating and contributing to the
ipation are lower in the extended game environment. This
n-player games. The most significant aspect of this data are
is as a result of the extended game encouraging more tag
the scale of the payoffs for the smallest populations. The
diversity throughout the population. This has the effect of
influence of the investment extension appears to be smallest
reducing the size of tag groups to smaller numbers and then
in these populations, and this appears to simply be due to
game participation is lower than in the traditional game.
the fact that there were not enough participants participat-
This feature only becomes apparent in larger populations
ing and contributing for the extension to have a major effect
and therefore it is in the largest population we identify the
over the population.
most significant changes with respect to game participation
To investigate the issue of game participation in more de-
and the strategies evolved.
tail, we now examine the average numbers of participants in
each of the models.
4.3 Tag Space Evolution
Average Number of Participants
In this section we will examine the evolution of the tag
30 space in both game environments. In this case we use two
100 Agents
75 Agents populations of 100 agents, and 100 experimental runs.
25 50 Agents
Average Number of Participants
25 Agents
10 Agents
20 Average Proportion of Unique Tag Values
1
Classic N-Player Dilemma
15 N-Player Dilemma with Investments
Proportion of Unique Tag Values
0.8
10
0.6
5
0 0.4
0 200 400 600 800 1000
Generations
0.2
Page 97 of 99
Figure 9, shows the evolution of tags over time through difficult to achieve. There are many alternative means of en-
successive generations. The proportion of unique tag values couraging cooperation in these scenarios, but in this case we
is calculated the maximum number of possible tag values, wanted to show the benefits of individuals investing in their
and then this is recorded and averaged over many genera- shared resource.
tions and experiments. In the initial generations there is a Earlier in this paper we posed a number of research ques-
very large number of tag values, however over time the num- tions. We will now refer to these through the following sub-
ber of unique tag values falls and converges to a relatively headings.
small number of tag groups. This is an indication of the
high levels of mimicking that occurs due to tags. However Tag Space: The tag space was found to be very significant
we observe the higher numbers of unique tag values in the in the emergence of cooperation in the classic n-player
n-player commons with investment. This indicates more tag commons dilemma. Clustering was much more diffi-
groups and a higher degree of tag diversity throughout the cult to achieve in larger populations when compared
population. The n-player dilemma with investment helps to smaller populations.
clusters of cooperators to emerge and avoid being exploited
by invaders. This happens as potential defectors must have a Investment Strategies: Throughout our simulations we
reserve fund available to contribute in line with their strat- identified the emergence of agents with high invest-
egy or else they cannot participate in the game. Further- ment genes. This was promoted through the cluster-
more, high value games have the added benefit of spreading ing of the tag environment and the mimicry that it
wealth among a group of individuals with similar tag values, encourages.
which has the effect of promoting that groups strategy traits
Emergence of Cooperation: The emergence of investment
throughout the population in the following generation. This
strategies and the emergence if cooperation in large
inevitably encourages cooperation and contribution strate-
populations are very closely linked. The fitness of cer-
gies as the group can only achieve high levels of fitness if it
tain tag assisted by the investment mechanism as this
is composed of high cooperators and investors.
helps avoid exploitation. Furthermore, the increased
payoffs help sustain and promote the evolution of a
Table 1: Average Unique Tag Values particular tag cluster. This has the effect of promot-
Model µ σ ing that tags associated strategy characteristics.
Classic N-Player Dilemma 6.31% 0.52%
N-Player Dilemma with Investments 12.698% 0.87% Through addressing these research questions, we have pro-
vided a clear picture of the importance of tag space with
respect to the n-player commons dilemma. Importantly, we
have also shown how investment strategies can result in the
As indicated in Figure 9 and also through the data pre-
promotion of cooperative traits in otherwise difficult con-
sented in Table 1 the differences between the two models are
ditions. This highlights the potential benefits of studying
significant. The data shown in Table 1 is recorded from 100
extensions to these well known games.
experimental runs using populations of 100 agents. The dif-
ferences between the levels of unique tags recorded in each
model were found to be statistically significant. This was 6. SUMMARY AND FUTURE RESEARCH
found when examined using a two tailed t test with a 95% In this paper we have presented a novel extension to the
confidence interval. These difference reinforce our observa- n-player commons dilemma and shown the significant effects
tions earlier in the paper regarding the contrasting levels of this extension has on the emergence of cooperation. Further-
cooperation recorded in the models. more, this paper has outlined a number of significant experi-
mental results which show the evolution of cooperation with
5. CONCLUSIONS respect to agent investment and cooperation choices. In fu-
In this paper we have examined a number of issues. Firstly, ture research we would like to study the effects of investment
we proposed an adaptation to the classic n-player commons mechanisms on conservation, and more efficient utilisation
of common resources. Examples of these could include such
dilemma to include agent investments into the commons.
This was achieved while maintaining the essential character- as fossil fuels, water and computing resources.
istics of the game. Secondly, we examined the tag space and
its significance with respect to the emergence of cooperation 7. ACKNOWLEDGMENTS
in the n-player PD. The study of n-player games using tags is The authors would like to gratefully acknowledge the con-
a recent area of research, and the significance of the tag space tinued support of Science Foundation Ireland.
is an important consideration in that study. The results pre-
sented in this paper show that the relationship between the
tag space and the population size is vitally important in the
8. REFERENCES
n-player commons dilemma. Finally, this paper has outlined [1] S. W. Benard. Adaptation and network structure in
a series of experiments showing the significant impact of in- social dilemmas. In Paper presented at the annual
dividuals investing in the commons. Importantly, we have meeting of the American Sociological Association,
learned that this has the effect of encouraging cooperation Atlanta Hilton Hotel, Atlanta, GA, 2003.
when sufficient numbers of individuals are participating in [2] J. A. Brander and B. J. Spencer. Export subsidies and
the n-player games. While quite unstable for small popula- international market share rivalry. NBER Working
tions, this new adapted game offers a means of engendering Papers 1464, National Bureau of Economic Research,
cooperation in larger populations where cooperation is more Inc, Sept. 1984.
Page 98 of 99
[3] J. Coleman. Foundations of Social Theory. Belknap and Society, 12(4):451–476, 2000.
Press, August 1998. [22] T. Yamagishi and K. S. Cook. Generalized exchange
[4] C. Fershtman and K. L. Judd. Equilibrium incentives and social dilemmas. Social Psychology Quarterly,
in oligopoly. Discussion Papers 642, Northwestern 56(4):235–248, 1993.
University, Center for Mathematical Studies in [23] T. Yamashita, R. L. Axtell, K. Kurumatani, and
Economics and Management Science, Dec. 1984. A. Ohuchi. Investigation of mutual choice metanorm
[5] N. S. Glance and B. A. Huberman. The dynamics of in group dynamics for solving social dilemmas. In
social dilemmas. Scientific American, 270(3):76–81, MAMUS, pages 137–153, 2003.
1994. [24] X. Yao and P. J. Darwen. An experimental study of
[6] D. E. Goldberg. Genetic Algorithms in Search, n-person iterated prisoner’s dilemma games.
Optimization, and Machine Learning. Addison-Wesley Informatica, 18:435–450, 1994.
Professional, January 1989.
[7] G. Hardin. The tragedy of the commons. Science,
162(3859):1243–1248, December 1968.
[8] J. Holland. The effects of labels (tags) on social
interactions. Working Paper Santa Fe Institute
93-10-064, 1993.
[9] J. H. Holland. Adaptation in natural and artificial
systems: An introductory analysis with applications to
biology, control, and artificial intelligence. University
of Michigan Press, 1975.
[10] E. Howley and J. Duggan. The Evolution of Agent
Strategies and Sociability in a Commons Dilemma.
Lecture Notes in Computer Science. Springer-Verlag
Berlin, In Press.
[11] E. Howley and C. O’Riordan. The emergence of
cooperation among agents using simple fixed bias
tagging. In Proceedings of the 2005 Congress on
Evolutionary Computation (IEEE CEC’05), volume 2,
pages 1011–1016. IEEE Press, 2005.
[12] M. Macy and A. Flache. Learning dynamics in social
dilemmas. P Natl Acad Sci USA, 99(3):7229–7236,
2002.
[13] M. Matlock and S. Sen. Effective tag mechanisms for
evolving coordination. In AAMAS ’07: Proceedings of
the 6th international joint conference on Autonomous
agents and multiagent systems, pages 1–8, New York,
NY, USA, 2007. ACM.
[14] A. McDonald and S. Sen. The success and failure of
tag-mediated evolution of cooperation. In LAMAS,
pages 155–164, 2005.
[15] R. Riolo. The effects and evolution of tag-mediated
selection of partners in populations playing the
iterated prisoner’s dilemma. In ICGA, pages 378–385,
1997.
[16] R. Riolo, M. Cohen, and R. Axelrod. Evolution of
cooperation without reciprocity. Nature, 414:441–443,
2001.
[17] K. Rogoff. The optimal degree of commitment to an
intermediate monetary target. The Quarterly Journal
of Economics, 100(4):1169–89, November 1985.
[18] Y. Sato. Trust, assurance, and inequality: A rational
choice model of mutual trust 1. The Journal of
Mathematical Sociology, 26(1):1–16, 2002.
[19] T. C. Schelling. The strategy of conflict. Harvard
University Press, Cambridge 1960.
”
[20] K. Suzuki. Effects of conflict between emergent
charging agents in social dilemma. In MAMUS, pages
120–136, 2003.
[21] G. Torsvik. Social Capital and Economic
Development: A Plea for the Mechanisms. Rationality
Page 99 of 99