Mit 2
Mit 2
Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Department of Electrical Engineering and Computer Science
July 27, 2007
Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Daniela Rus
Professor
Thesis Supervisor
Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Arthur C. Smith
Chairman, Department Committee on Graduate Students
Distributed Reinforcement Learning for Self-Reconfiguring
Modular Robots
by
Paulina Varshavskaya
Abstract
In this thesis, we study distributed reinforcement learning in the context of automat-
ing the design of decentralized control for groups of cooperating, coupled robots.
Specifically, we develop a framework and algorithms for automatically generating dis-
tributed controllers for self-reconfiguring modular robots using reinforcement learning.
The promise of self-reconfiguring modular robots is that of robustness, adaptability
and versatility. Yet most state-of-the-art distributed controllers are laboriously hand-
crafted and task-specific, due to the inherent complexities of distributed, local-only
control. In this thesis, we propose and develop a framework for using reinforcement
learning for automatic generation of such controllers. The approach is profitable be-
cause reinforcement learning methods search for good behaviors during the lifetime
of the learning agent, and are therefore applicable to online adaptation as well as
automatic controller design. However, we must overcome the challenges due to the
fundamental partial observability inherent in a distributed system such as a self-
reconfiguring modular robot.
We use a family of policy search methods that we adapt to our distributed problem.
The outcome of a local search is always influenced by the search space dimensional-
ity, its starting point, and the amount and quality of available exploration through
experience. We undertake a systematic study of the effects that certain robot and
task parameters, such as the number of modules, presence of exploration constraints,
availability of nearest-neighbor communications, and partial behavioral knowledge
from previous experience, have on the speed and reliability of learning through policy
search in self-reconfiguring modular robots. In the process, we develop novel algorith-
mic variations and compact search space representations for learning in our domain,
which we test experimentally on a number of tasks.
This thesis is an empirical study of reinforcement learning in a simulated lattice-
based self-reconfiguring modular robot domain. However, our results contribute to
the broader understanding of automatic generation of group control and design of
distributed reinforcement learning algorithms.
This work was supported by Boeing Corporation. I am very grateful for their
support.
I cannot begin to thank my friends and housemates, past and present, of the
Bishop Allen Drive Cooperative, who have showered me with their friendship and
created the most amazing, supportive and intense home I have ever had the pleasure
to live in. I am too lazy to enumerate all of you — because I know I cannot disappoint
you — but you have a special place in my heart. To everybody who hopped in the
car with me and left the urban prison behind, temporarily, repeatedly, to catch the
sun, the rocks and the snow, to give ourselves a boost of sanity — thank you.
To my life- and soul-mate Luke: thank you for seeing me through this.
Contents
1 Introduction 7
1.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.1 Self-reconfiguring modular robots . . . . . . . . . . . . . . . . 9
1.2.2 Research motivation vs. control reality . . . . . . . . . . . . . 11
1.3 Conflicting assumptions . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.1 Assumptions of a Markovian world . . . . . . . . . . . . . . . 14
1.3.2 Assumptions of the kinematic model . . . . . . . . . . . . . . 16
1.3.3 Possibilities for conflict resolution . . . . . . . . . . . . . . . . 17
1.3.4 Case study: locomotion by self-reconfiguration . . . . . . . . . 18
1.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.5 Thesis contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.6 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2 Related Work 22
2.1 Self-reconfiguring modular robots . . . . . . . . . . . . . . . . . . . . 22
2.1.1 Hardware lattice-based systems . . . . . . . . . . . . . . . . . 22
2.1.2 Automated controller design . . . . . . . . . . . . . . . . . . . 23
2.1.3 Automated path planning . . . . . . . . . . . . . . . . . . . . 23
2.2 Distributed reinforcement learning . . . . . . . . . . . . . . . . . . . . 24
2.2.1 Multi-agent Q-learning . . . . . . . . . . . . . . . . . . . . . . 24
2.2.2 Hierarchical distributed learning . . . . . . . . . . . . . . . . . 24
2.2.3 Coordinated learning . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.4 Reinforcement learning by policy search . . . . . . . . . . . . 25
2.3 Agreement algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4
3.5.4 Learning in feature spaces . . . . . . . . . . . . . . . . . . . . 36
3.5.5 Learning from individual experience . . . . . . . . . . . . . . . 39
3.5.6 Comparison to hand-designed controllers . . . . . . . . . . . . 40
3.5.7 Remarks on policy correctness and scalability . . . . . . . . . 42
3.6 Key issues in gradient ascent for SRMRs . . . . . . . . . . . . . . . . 45
3.6.1 Number of parameters to learn . . . . . . . . . . . . . . . . . 46
3.6.2 Quality of experience . . . . . . . . . . . . . . . . . . . . . . . 47
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5
6.3.4 Experiments with policy transformation . . . . . . . . . . . . 92
6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7 Concluding Remarks 96
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.3 Limitations and future work . . . . . . . . . . . . . . . . . . . . . . . 98
6
Chapter 1
Introduction
The goal of this thesis is to develop and demonstrate a framework and algorithms
for decentralized automatic design of distributed control for groups of cooperating,
coupled robots. This broad area of research will advance the state of knowledge in
distributed systems of mobile robots, which are becoming ever more present in our
lives. Swarming robots, teams of autonomous vehicles, and ad hoc mobile sensor
networks are now used to automate aspects of exploration of remote and dangerous
sites, search and rescue, and even conservation efforts. The advance of coordinated
teams is due to their natural parallelism: they are more efficient than single agents;
they cover more terrain more quickly; they are more robust to random variation and
random failure. As human operators and beneficiaries of multitudes of intelligent
robots, we would very much like to just tell them to go, and expect them to decide
on their own exactly how to achieve the desired behavior, how to avoid obstacles
and pitfalls along the way, how to coordinate their efforts, and how to adapt to
an unknown, changing environment. Ideally, a group of robots faced with a new
task, or with a new environment, should take the high-level task goal as input and
automatically decide what team-level and individual-level interactions are needed to
make it happen.
The reality of distributed mobile system control is much different. Distributed
programs must be painstakingly developed and verified by human designers. The
interactions and interference the system will be subject to are not always clear. And
development of control for such distributed mobile systems is in general difficult
for serially-minded human beings. Yet automatic search for the correct distributed
controller is problematic — there is incredible combinatorial explosion of the search
space due to the multiplicity of agents; there are also intrinsic control difficulties for
each agent due to the limitations of local information and sensing. The difficulties
are dramatically multiplied when the search itself needs to happen in a completely
distributed way with the limited resources of each agent.
This thesis is a study in automatic adaptable controller design for a particular
class of distributed mobile systems: self-reconfiguring modular robots. Such robots
are made up of distinct physical modules, which have degrees of freedom (DOFs) to
move with respect to each other and to connect to or disconnect from each other. They
have been proposed as versatile tools for unknown and changing situations, when mis-
7
sions are not pre-determined, in environments that may be too remote or hazardous
for humans. These types of scenarios are where a robot’s ability to transform its
shape and function would be undoubtedly advantageous. Clearly, adaptability and
automatic decision-making with respect to control is essential for any such situations.
A self-reconfiguring modular robot is controlled in a distributed fashion by the
many processors embedded in its modules (usually one processor per module), where
each processor is responsible for only a few of the robot’s sensors and actuators. For
the robot as a whole to cohesively perform any task, the modules need to coordinate
their efforts without any central controller. As with other distributed mobile systems,
in self-reconfiguring modular robots (SRMRs) we give up on easier centralized control
in the hope of increased versatility and robustness. Yet most modular robotic systems
run task-specific, hand-designed algorithms. For example, a distributed localized rule-
based system controlling adaptive locomotion (Butler et al. 2001) has 16 manually
designed local rules, whose asynchronous interactions needed to be anticipated and
verified by hand. A related self-assembly system had 81 such rules (Kotay & Rus
2004). It is very difficult to manually generate correct distributed controllers of this
complexity.
Instead, we examine how to automatically generate local controllers for groups
of robots capable of local sensing and communications to near neighbors. Both the
controllers and the process of automation itself must be computationally fast and only
require the kinds of resources available to individuals in such decentralized groups.
In this thesis we present a framework for automating distributed controller design
for SRMRs through reinforcement learning (RL) (Sutton & Barto 1999): the intuitive
framework where agents observe and act in the world and receive a signal which
tells them how well they are doing. Learning may be done from scratch (with no
initial knowledge of how to solve the task), or it can be seeded with some amount
of information — received wisdom directly from the designer, or solutions found in
prior experience. Automatic learning from scratch will be hindered by the large search
spaces pervasive in SRMR control. It is therefore to be expected that constraining and
guiding the search with extra information will in no small way influence the outcome of
learning and the quality of the resulting distributed controllers. Here, we undertake a
systematic study of the parameters and constraints influencing reinforcement learning
in SRMRs.
It is important to note that unlike with evolutionary algorithms, reinforcement
learning optimizes during the lifetime of the learning agent, as it attempts to perform
a task. The challenge in this case is that learning itself must be fully distributed.
The advantage is that the same paradigm can be used to (1) automate controller
design, and (2) run distributed adaptive algorithms directly on the robot, enabling it
to change its behavior as the environment, its goal, or its own composition changes.
Such capability of online adaptation would take us closer to the goal of more versatile,
robust and adaptable robots through modularity and self-reconfiguration. We are ul-
timately pursuing both goals (automatic controller generation and online adaptation)
in applying RL methods to self-reconfigurable modular robots.
Learning distributed controllers, especially in domains such as SRMRs where
neighbors interfere with each other and determine each other’s immediate capabil-
8
ities, is an extremely challenging problem for two reasons: the space of possible be-
haviors are huge and they are plagued by numerous local optima which take the form
of bizarre and dysfunctional robot configurations. In this thesis, we describe strate-
gies for minimizing dead-end search in the neighborhoods of such local optima by
using easily available information, smarter starting points, and inter-module commu-
nication. The best strategies take advantage of the structural properties of modular
robots, but they can be generalized to a broader range of problems in cooperative
group behavior. In this thesis, we have developed novel variations on distributed
learning algorithms. The undertaking is both profitable and hard due to the na-
ture of the platforms we study. SRMRs have exceedingly large numbers of degrees of
freedom, yet may be constrained in very specific ways by their structural composition.
1.2 Background
1.2.1 Self-reconfiguring modular robots
Self-reconfiguring modular robotics (Everist et al. 2004, Kotay & Rus 2005, Murata
et al. 2001, Shen et al. 2006, Yim et al. 2000, Zykov et al. 2005) is a young and grow-
ing field of robotics with a promise of robustness, versatility and self-organization.
Robots made of a large number of identical modules, each of which has computa-
tional, sensory, and motor capabilities, are proposed as universal tools for unknown
9
(a) (b)
Figure 1-1: Some recent self-reconfiguring modular robots (SRMRs): (a) The
Molecule (Kotay & Rus 2005), (b) SuperBot (Shen et al. 2006). Both can act as
lattice-based robots. SuperBot can also act as a chain-type robot (reprinted with
permission).
10
required.
Figure 1-1 shows two recent systems. The first one, called the Molecule (Kotay
& Rus 2005), is a pure lattice system. The robot is composed of male and female
“atoms”, which together represent a “molecule” structure which is also the basis of
a lattice. The second system, in figure 1-1b, is SuperBot (Shen et al. 2006), which is
a hybrid. On the one hand, it may function in a lattice-based fashion similar to the
Molecule: SuperBot modules are cubic, with six possible connection sites, one at each
face of the cube, which makes for a straightforward cubic grid. On the other hand,
the robot can be configured topologically and function as a chain (e.g., in a snake
configuration) or tree (e.g., in a tetrapod, hexapod or humanoid configuration).
While chain-type or hybrid self-reconfiguring systems are better suited for tasks
requiring high mobility, lattice-based robots are useful in construction and shape-
maintenance scenarios, as well as for self-repair. The research effort in designing
lattice-based robots and developing associated theory has also been directed towards
future systems and materials. As an example, miniaturization may lead to fluid-like
motion through lattice-based self-reconfiguration. On another extreme, we might
envision self-reconfiguring buildings built of block modules that are able and required
to infrequently change their position within the building.
The focus of this thesis in mainly on lattice-based robots. However, chain-type
systems may also benefit from adaptable controllers designed automatically through
reinforcement learning.
11
properties of self-organization through building and studying self-assembling, self-
replicating and self-reconfiguring artificial systems. This nature-inspired scientific
motive is pervasive especially in stochastically reconfigurable robotics, which focuses
on self-assembly.
The designers of actuated, deterministic SRMRs, on the other hand, tend to
motivate their efforts with the promise of versatility, robustness and adaptability
inherent in modular architectures. Modular robots are expected to become a kind of
universal tool, changing shape and therefore function as the task requirements also
change. It is easy to imagine, for example, a search and rescue scenario with such
a versatile modular robot. Initially in a highly-mobile legged configuration that will
enable the robot to walk fast over rough terrain, it can reconfigure into a number
of serpentine shapes that burrow into the piles of rubble, searching in parallel for
signs of life. Another example, often mentioned in the literature, concerns space and
planetary exploration. A modular robot can be tightly packed into a relatively small
container for travel on a spaceship. Upon arrival, it can deploy itself according to
mission (a “rover” without wheels, a lunar factory) by moving its modules into a
correct configuration. If any of the modules should break down or become damaged,
the possibility of self-reconfiguration allows for their replacement with fresh stock,
adding the benefits of possible self-repair to modular robots.
The proclaimed qualities of versatility, robustness and adaptability with reference
to the ideal “killer apps” are belied by the rigidity of state-of-the-art controllers
available for actuated self-reconfiguring modular robots.
Existing controllers
Fully distributed controllers of many potentially redundant degrees of freedom are
required for robustness and adaptability. Compiling a global high-level goal into
distributed control laws or rules based solely on local interactions is a notoriously
difficult exercise. Yet currently, most modular robotic systems run task-specific, hand-
designed algorithms, for example, the rule-based systems in (Butler et al. 2004), which
took hours of designer time to synthesize. One such rule-based controller for a two-
dimensional lattice robot is reproduced here in figure 1-2. These rules guarantee
eastward locomotion by self-reconfiguration on a square lattice.
Automating the design of such distributed rule systems and control laws is clearly
needed, and the present work presents a framework in which it can be done. There has
been some exploration of automation in this case. In some systems the controllers were
automatically generated using evolutionary techniques in simulation before being ap-
plied to the robotic system itself (Kamimura et al. 2004, Mytilinaios et al. 2004). Cur-
rent research in automatic controller generation for SRMRs and related distributed
systems will be examined more closely in chapter 2. One of the important differences
between the frameworks of evolutionary algorithms and reinforcement learning is the
perspective from which optimization occurs.
12
Figure 1-2: A hand-designed rule-based controller for locomotion by self-
reconfiguration on a square lattice. Reprinted with permission from Butler et al.
(2004).
13
(a) (b)
Figure 1-3: The reinforcement learning framework. (a) In a fully observable world,
the agent can estimate a value function for each state and use it to select its actions.
(b) In a partially observable world, the agent does not know which state it is in due to
sensor limitations; instead of a value function, the agent updates its policy parameters
directly.
the state of the world, but may observe a local, noisy measure on parts of it. We
will demonstrate below that in the case of modular robots, even their very simplified
abstract kinematic models violate the assumptions necessary for the powerful tech-
niques to apply and the theorems to hold. As a result of this conflict, some creativity
is essential in designing or applying learning algorithms to the SRMR domain.
Reinforcement learning
Consider the class of problems in which an agent, such as a robot, has to achieve some
task by undertaking a series of actions in the environment, as shown in figure 1-3a.
The agent perceives the state of the environment, selects an action to perform from
its repertoire and executes it, thereby affecting the environment which transitions to
a new state. The agent also receives a scalar signal indicating its level of performance,
called the reward signal. The problem for the agent is to find a good strategy (called
a policy) for selecting actions given the states. The class of such problems can be
described by stochastic processes called Markov decision processes (see below). The
problem of finding an optimal policy can be solved in a number of ways. For instance,
if a model of the environment is known to the agent, it can use dynamic programming
algorithms to optimize its policy (Bertsekas 1995). However, in many cases the model
is either unknown initially or impossibly difficult to compute. In these cases, the agent
must act in the world to learn how it works.
14
Reinforcement learning is sometimes used in the literature to refer to the class
of problems that we have just described. It is more appropriately used to name the
set of statistical learning techniques employed to solve this class of problems in the
cases when a model of the environment is not available to the agent. The agent may
learn such a model, and then solve the underlying decision process directly. Or it
may estimate a value function associated with every state it has visited (as is the
case in figure 1-3a) from the reward signal it has received. Or it may update the
policy directly using the reward signal. We refer the reader to a textbook (Sutton
& Barto 1999) for a good introduction to the problems addressed by reinforcement
learning and the details of standard solutions.
It is assumed that the way the world works does not change over time, so that the
agent can actually expect to optimize its behavior with respect to the world. This is
the assumption of a stationary environment. It is also assumed that the probability
of the world entering the state s0 at the next time step is determined solely by the
current state s and the action chosen by the agent. This is the Markovian world
assumption, which we formalize below. Finally, it is assumed that the agent knows
all the information it needs about the current state of the world and its own actions
in it. This is the full observability assumption.
Multi-agent MDPs
When more than one agent are learning to behave in the same world that all of them
affect, this more complex interaction is usually described as a multi-agent MDP.
Different formulations of processes involving multiple agents exist, depending on the
assumptions we make about the information available to the learning agent or agents.
In the simplest case we imagine that learning agents have access to a kind of oracle
that observes the full state of the process, but execute factored actions (i.e., that is
called one action of the MDP is the result of coordinated mini-actions of all modules
executed at the same time). The individual components of a factored action need
to be coordinated among the agents, and information about the actions taken by all
modules also must be available to every learning agent. This formulation satisfies all
of the strong assumptions of a Markovian world, including full observability. However,
as we argue later in section 1.3.2, to learn and maintain a policy with respect to each
state is in practical SRMRs unrealistic, as well as prohibitively expensive in both
experience and amount of computation.
15
modules have access only to factored (local, limited, perhaps noisy) observations over
the state. In this case, the assumption of full observability is no longer satisfied. In
general, powerful learning techniques such as Q-learning (Watkins & Dayan 1992) are
no longer guaranteed to converge to a solution, and optimal behavior is much harder
to find.
Furthermore, the environment with which each agent interacts comprises all the
other agents who are learning at the same time and thus changing their behavior.
The world is non-stationary due to many agents learning at once, and thus cannot
even be treated as a POMDP, although in principle agents could build a weak model
of the competence of the rest of agents in the world (Chang et al. 2004).
(1) (2)
Figure 1-4: The sliding-cube kinematic model for lattice-based modular robots: (1)
a sliding transition, and (2) a convex transition.
How does our kinematic model fit into these possible sets of assumptions? We
base most of our development and experiments on the standard sliding-cube model
for lattice-based self-reconfigurable robots (Butler et al. 2004, Fitch & Butler 2006,
Varshavskaya et al. 2004, Varshavskaya et al. 2007). In the sliding-cube model, each
module of the robot is represented as a cube (or a square in two dimensions), which
can be connected to other module-cubes at either of its six (four in 2D) faces. The
cube cannot move on its own; however, it can move, one cell of the lattice at a time,
on a substrate of like modules in the following two ways, as shown in figure 1-4: 1)
if the neighboring module M1 that it is attached to has another neighbor M2 in the
direction of motion, the moving cube can slide to a new position on top of M2 , or
2) if there is no such neighbor M2 , the moving cube can make a convex transition to
the same lattice cell in which M2 would have been. Provided the relevant neighbors
are present in the right positions, these motions can be performed relative to any of
the six faces of the cubic module. The motions in this simplified kinematic model
can actually be executed by physically implemented robots (Butler et al. 2004). Of
those mentioned in section 1.2.1, the Molecule, MTRAN-II, and SuperBot can all
form configurations required to perform the motions of the kinematic model, and
therefore controllers developed in simulation for this model are potentially useful for
these robots as well.
The assumptions made by the sliding-cube model are that the agents are individual
modules of the model. In physical instantiations, more than one physical module may
16
coordinate to comprise one simulation “cube”. The modules have limited resources,
which affects any potential application of learning algorithms:
limited actuation: each module can execute one of a small set of discrete actions;
each action takes a fixed amount of time to execute, potentially resulting in a
unit displacement of the module in the lattice
limited power: actuation requires a significantly greater amount of power than com-
putation or communication
limited computation and memory: on-board computation is usually limited to
microcontrollers
clock: the system may be either synchronized to a common clock, which is a rather
unrealistic assumption, or asynchronous1 , with every module running its code
in its own time
Clearly, if an individual module knew the global configuration and position of
the entire robot, as well as the action each module is about to take, then it would
also know exactly the state (i.e., position and configuration) of the robot at the next
time step, as we assume a deterministic kinematic model. Thus, the world can be
Markovian. However, this global information is not available to individual modules
due to limitations in computational power and communications bandwidth. They
may communicate with their neighbors at each of their faces to find out the local
configuration of their neighborhood region. Hence, there is only a partial observation
of the world state in the learning agent, and our kinematic model assumptions are in
conflict with those of powerful RL techniques.
17
1.3.4 Case study: locomotion by self-reconfiguration
We now examine the problem of synthesizing locomotion gaits for lattice-based mod-
ular self-reconfiguring robots. The abstract kinematic model of a 2D lattice-based
robot is shown in figure 1-5. The modules are constrained to be connected to each
other in order to move with respect to each other; they are unable to move on their
own. The robot is positioned on an imaginary 2D grid; and each module can observe
at each of its 4 faces (positions 1, 3, 5, and 7 on the grid) whether or not there is
another connected module on that neighboring cell. A module (call it M ) can also
ask those neighbors to confirm whether or not there are other connected modules at
the corner positions (2, 4, 6, and 8 on the grid) with respect to M . These eight bits
of observation comprise the local configuration that each module perceives.
Thus, the module M in the figure knows that lattice cells number 3, 4, and 5
have other modules in them, but lattice cells 1, 2, and 6-8 are free space. M has a
repertoire of 9 actions, one for moving into each adjacent lattice cell (face or corner),
and one for staying in the current position. Given any particular local configuration
of the neighborhood, only a subset of those actions can be executed, namely those
that correspond to the sliding or convex motions about neighbor modules at any of
the face sites. This knowledge may or may not be available to the modules. If it is
not, then the module needs to learn which actions it will be able to execute by trying
and failing.
The modules are learning a locomotion gait with the eastward direction of motion
receiving positive rewards, and the westward receiving negative rewards. Specifically,
a reward of +1 is given for each lattice-unit of displacement to the East, −1 for each
lattice-unit of displacement to the West, and 0 for no displacement. M ’s goal in figure
1-5 is to move right. The objective function to be maximized through learning is the
displacement of the whole robot’s center of mass along the x axis, which is equivalent
to the average displacement incurred by individual modules.
Clearly, if the full global configuration of this modular robot were known to each
module, then each of them could learn to select the best action in any of them
with an MDP-based algorithm. If there are only two or three modules, this may be
done as there are only 4 (or 18 respectively) possible global configurations. However,
18
the 8 modules shown in figure 1-5 can be in a prohibitively large number of global
configurations; and we are hoping to build modular systems with tens and hundreds of
modules. Therefore it is more practical to resort to learning in a partially observable
world, where each module only knows about its local neighborhood configuration.
The algorithms we present in this thesis were not developed exclusively for this
scenario. They are general techniques applicable to multi-agent domains. However,
the example problem above is useful in understanding how they instantiate to the
SRMR domain. Throughout the thesis, it will be used in experiments to illustrate
how information, search constraints and the degree of inter-module communication
influence the speed and reliability of reinforcement learning in SRMRs.
1.4 Overview
This thesis provides an algorithmic study, evaluated empirically, of using distributed
reinforcement learning by policy search with different information sources for learning
distributed controllers in self-reconfiguring modular robots.
At the outset, we delineate the design parameters and aspects of both robot and
task that contribute to three dimensions along which policy search algorithms may be
described: the number of parameters to be learned (the number of knobs to tweak),
the starting point of the search, and the amount and quality of experience that
modules can gather during the learning process. Each contribution presented in this
thesis relates in some way to manipulating at least one of these issues of influence in
order to increase the speed and reliability of learning and the quality of its outcome.
For example, we present two novel representations for lattice-based SRMR control,
both of which aim to compress the search space by reducing the number of parame-
ters to learn. We quickly find, however, that compact representations require careful
construction, shifting the design burden away from design of behavior, but ultimately
mandating full human participation. On the other hand, certain constraints on ex-
ploration are very easy for any human designer to articulate, for example, that the
robot should not attempt physically impossible actions. These constraints effectively
reduce the number of parameters to be learned. Other simple suggestions that can
be easily articulated by the human designer (for example: try moving up when you
see neighbors on the right) can provide good starting points to policy search, and we
also explore their effectiveness.
Further, we propose a novel variations on the theme of policy search which are
specifically geared to learning of group behavior. For instance, we demonstrate how
incremental policy search with each new learning process starting from the point at
which a previous run with fewer modules had stopped, reduces the likelihood of being
trapped in an undesirable local optimum, and results in better learned behaviors. This
algorithm is thus a systematic way of generating good starting points for distributed
group learning.
In addition, we develop a framework of incorporating agreement (also known as
consensus) algorithms into policy search, and we present a new algorithm with which
individual learners in distributed systems such as SRMRs can agree on both rewards
19
and experience gathered during the learning process — information which can be
incorporated into learning updates, and will result in consistently better policies and
markedly faster learning. This algorithmic innovation thus provides a way of in-
creasing the amount and quality of individuals’ experience during learning. Active
agreement also has an effect on the kind of policies that are learned.
Finally, as we evaluate our claims and algorithms on a number of tasks requiring
SRMR cooperation, we also inquire into the possibility of post-learning knowledge
transfer between policies which were learned for different tasks from different reward
functions. It turns out that local geometric transformations of learned policies can
provide another systematic way of automatically generating good starting points for
further learning.
20
These contributions bring us forward towards delivering on the promise of adapt-
ability and versatility of SRMRS — properties often used for motivating SRMR re-
search, but conspicuously absent from most current systems. The results of this study
are significant, because they present a framework in which future efforts in automat-
ing controller generation and online adaptation of group behavior can be grounded.
In addition, this thesis establishes empirically where the difficulties in such cooper-
ative group learning lie and how they can be mitigated through the exploitation of
relevant domain properties.
2
In the nontechnical sense of adaptable behavior.
21
Chapter 2
Related Work
22
More recent hardware systems have had a greater focus on stochastic self-assembly
from modular parts, e.g., the 2D systems such as programmable parts (Bishop et al.
2005) and growing machines (Griffith et al. 2005), where the energy and mobility
is provided to the modules through the flow generated by an air table; and the 3D
systems such as those of (White et al. 2005), where modules are in passive motion
inside a liquid whose flow also provides exogenous energy and mobility to the modular
parts. Miche (Gilpin et al. 2007) is a deterministic self-disassembly system that is
related to this work insofar as the parts used in Miche are not moving. Such work is
outside the scope of this thesis.
23
contrast, in our approach, once the policy is learned it takes no extra run-time plan-
ning to execute it. On the other hand, Fitch and Butler’s algorithm works in an
entirely asynchronous manner; and they have demonstrated path-planning on very
large numbers of modules.
24
2.2.4 Reinforcement learning by policy search
In partially observable environments, policy search has been extensively used (Peshkin
2001, Ng & Jordan 2000, Baxter & Bartlett 2001, Bagnell et al. 2004) whenever mod-
els of the environment are not available to the learning agent. In robotics, successful
applications include humanoid robot motion (Schaal et al. 2003), autonomous heli-
copter flight (Ng et al. 2004), navigation (Grudic et al. 2003), and bipedal walking
(Tedrake et al. 2005), as well as simulated mobile manipulation (Martin 2004). In
some of those cases, a model identification step is required prior to reinforcement
learning, for example from human control of the plant (Ng et al. 2004), or human
motion capture (Schaal et al. 2003). In other cases, the system is brilliantly engi-
neered to reduce the search task to estimating as few as a single parameter (Tedrake
et al. 2005).
Function approximation, such as neural networks or coarse coding, has also been
widely used to improve performance of reinforcement learning algorithms, for example
in policy gradient algorithms (Sutton et al. 2000) or approximate policy iteration
(Lagoudakis & Parr 2003).
25
gradient MDP learning together with pairwise agreement on rewards. By contrast,
our approach is applicable in partially observable situations. In addition, we take
advantage of the particular policy search algorithm to let learning agents agree not
only on reward, but also on experience. The implications and results of this approach
are discussed in chapter 5.
26
Chapter 3
The first approach we take in dealing with distributed actions, local observations and
partial observability is to describe the problem of locomotion by self-reconfiguration
of a modular robot as a multi-agent Partially Observable Markov Decision Process
(POMDP).
In this chapter, we first describe the derivation of the basic GAPS algorithm,
and in section 3.3 the issues concerning its implementation in a distributed learning
system. We then develop a variation using a feature-based representation and a log-
linear policy in section 3.4. Finally, we present some experimental results (section
3.5) from our case study of locomotion by self-reconfiguration, which demonstrate
that policy search works where Markov-assuming techniques do not.
27
be to learn controllers with internal state (Meuleau et al. 1999). However, internal
state is unlikely to help in our case, as it would multiply drastically the number of
parameters to be learned, which is already considerable.
To learn in this POMDP we use a policy search algorithm based on gradient
ascent in policy space (GAPS), proposed by Peshkin (2001). This approach assumes
a given parametric form of the policy πθ (o, a) = P (a|o, θ), where θ is the vector
of policy parameters, and the policy is a differentiable function of the parameters.
The learning proceeds by gradient ascent on the parameters θ to maximize expected
long-term reward.
P (h|θ) =
T
Y
= P (s0 ) P (ot |st )P (at |ot , θ)P (st+1 |st , at )
t=1
T T
" #" #
Y Y
= P (s0 ) P (ot |st )P (st+1 |st , at ) P (at |ot , θ)
t=1 t+1
= Ξ(h)Ψ(h, θ),
where Ξ(h) is not known to the learning agent and does not depend on θ, and
Ψ(h, θ) is known and differentiablePunder the assumption of a differentiable policy
∂
representation. Therefore, ∂θk Vθ = hH R(h)Ξ(h) ∂θ∂k Ψ(h, θ).
28
The differentiable part of the update gives:
T
∂ ∂ Y
Ψ(h, θ = πθ (at , ot ) =
∂θk ∂θk
t=1
T
X ∂ Y
= πθ (at , ot ) πθ (aτ , oτ )
∂θk
t=1 τ 6=t
T
" ∂ T
#
∂θk πθ (at , ot )
X Y
= πθ (at , ot )
πθ (at , ot )
t=1 t=1
T
X ∂
= Ψ(h, θ) ln πθ (at , ot ).
∂θk
t=1
Therefore,
∂
P (h|θ) Tt=1 ∂
P P
∂θk Vθ = hH R(h) ∂θk ln πθ (at , ot ) .
where β is an inverse temperature parameter that controls the steepness of the curve
and therefore the level of exploration.
This gradient ascent algorithm has some important properties. As with any
stochastic hill-climbing method, it can only be relied upon to reach a local opti-
mum in the represented space, and will only converge if the learning rate is reduced
over time. We can attempt to mitigate this problem by running GAPS from many
initialization points and choosing the best policy, or by initializing the parameters in
a smarter way. The latter approach is explored in chapter 4.
29
Algorithm 1 GAPS (observation function o, M modules)
Initialize parameters θ ← small random numbers
for each episode do
Calculate policy π(θ)
Initialize observation counts N ← 0
Initialize observation-action counts C ← 0
for each timestep in episode do
for each module m do
observe om and increment N (om )
choose a from π(om , θ) and increment C(om , a)
execute a
end for
end for
Get global reward R
Update θ according to
θ ( o, a ) + = α R ( C(o, a) − π(o, a, θ) N (o) )
Update π(θ) using Boltzmann’s law
end for
and acts but also learns its own parameterized policy, the algorithm can be extended
in the most obvious way to Distributed GAPS (DGAPS), as was also done by Peshkin
(2001). A theorem in his thesis says that the centralized factored version of GAPS
and the distributed multi-agent version will make the same updates given the same
experience. That means that, given the same observation and action counts, and the
same rewards, the two instantiations of the algorithm will find the same solutions.
In our domain of a distributed modular robot, DGAPS is naturally preferable.
However, in a fully distributed scenario, agents do not in fact have access to the
same experience and the same rewards as all others. Instead of requiring an identical
reward signal for all agents, we take each module’s displacement along the x axis to
be its reward signal: Rm = xm , since we assume that modules do not communicate.
This means that the policy value landscape is now different for each agent. However,
the agents
PN
are physically coupled with no disconnections allowed. If the true reward
m=1 xm
is R = N
, and each individual reward is Rm = xm , then each Rm is a bounded
estimate of R that’s at most N/2 away from it (in the worst-case scenario where
all modules are connected in one line). Furthermore, as each episode is initialized,
modules are placed at random in the starting configuration of the robot. Therefore,
the robot’s center of mass and R = E[xm ] is the expected value of any one module’s
position along the x axis. We can easily see that in the limit, as the number of turns
per episode increases and as learning progresses, each xm approaches the true reward.
This estimate is better the fewer modules we have and the larger R is. Therefore it
makes sense to simplify the problem in the initial phase of learning, while the rewards
are small, by starting with fewer modules, as we explore in chapter 4. It may also
make sense to split up the experience into larger episodes, which would generate larger
rewards all other things being equal, especially as the number of modules increases.
Other practical implications of diverse experience and rewards among the learning
30
agents are examined in chapter 5.
!
X
=β φk (at , ot ) − φk (a, ot )πθ (a, ot ) = λk (t).
a
The accumulated traces λk are only slightly more computationally involved than the
simple counts of the original GAPS algorithm: there are |A| + N more computations
per timestep than in GAPS, where A is the set of actions, and N the number of
features. The updates are just as intuitive, however, as they still assign more credit
to those features that differentiate more between actions, normalized by the actions’
likelihood under the current policy.
The algorithm (see Algorithm 2) is guaranteed to converge to a local maximum in
policy value space, for a given feature and policy representation. However, the feature
space may exclude good policies — using a feature-based representation shifts the
1
The term log-linear refers to the log-linear combination of features in the policy representation.
31
Algorithm 2 LLGAPS (Observation function o, N feature functions φ)
Initialize parameters θ ← small random numbers
for each episode do
Initialize traces Λ ← 0
for each timestep t in episode do
Observe current situation ot and get features response for every action Φ(∗, ot )
Sample action at according to policy (Boltzmann’s law)
for k = 1 to N do
Λk ← Λk + β (φk (at , ot ) − Σa φk (a, ot )πθ (a, ot ))
end for
end for
θ ←θ+αRΛ
end for
32
Figure 3-1: Experimental setup: simulated 2D modular robot on a lattice.
As previewed in section 1.3.4, each module can observe its immediate Moore
neighborhood (eight immediate neighbor cells) to see if those cells are occupied by
a neighboring module or empty. Each module can also execute one of nine actions,
which are to move into one of the neighboring cells or a NOP. The simulator is
synchronous, and modules execute their actions in turn, such that no module can go
twice before every other module has taken its turn. This assumption is unrealistic;
however, it helps us avoid learning to coordinate motion across the entire robot. There
are examples in the literature (e.g., in Fitch & Butler 2007) of algorithms intended
for use in lattice-based SRMRs that provide for turn-taking. The task is to learn to
move in one direction (East) by self-reconfiguration. The reward function measures
the progress the modules make in one episode along the x axis of the simulator. Figure
3-2 shows how a policy (a) can be executed by modules to produce a locomotion gait
(b) and therefore gain reward.
We run the experiments in two broad conditions. First, the goal of automatically
generating controllers can in principle be achieved in simulation off-line, and then
imparted to the robot for run-time execution. In that case, we can require all modules
to share one set of policy parameters, that is, to pool their local observations and
actions for one learning “super-agent” to make one set of parameter updates, which
then propagates to all the modules on the next episode. Second, we run the same
learning algorithms on individual modules operating independently, without sharing
their experience. We predict that more experience will result in faster learning curves
and better learning results.
33
(a) (b)
Figure 3-2: (a) A locomotion gait policy (conditional probability distribution over
actions given an observation). (b) First few configurations of 9 modules executing
the policy during a test run: in blue, module with lowest ID number, in red, module
which is about to execute an action, in green, module which has just finished its
action.
(a) (b)
34
Figure 3-4: Reward distributions after 4,000 episodes of learning for Q-learning, Sarsa
and GAPS (10 trials per algorithm). This box and whisker plot, generated using
MATLAB R , shows the lower quartile, the median and the upper quartile values as
box lines for each algorithm. The whiskers are vertical lines that show the extent of
the rest of the data points. Outliers are data beyond the end of the whiskers. The
notches on the box represent a robust estimate of the uncertainty about the medians
at the 5% significance level.
Unless noted otherwise, in all conditions the learning rate started at α = 0.01,
decreased over the first 1, 500 episodes, and remained at its minimal value of 0.001
thereafter. The inverse temperature parameter started at β = 1, increased over
the first 1, 500 episodes, and remained at its maximal value of 3 thereafter. This
ensured more exploration and larger updates in the beginning of each learning trial.
These parameters were selected by trial and error, as the ones consistently generating
the best results. Whenever results are reported as smoothed, downsampled average
reward curves, the smoothing was done with a 100-point moving window on the
results of every trial, which were then downsampled for the sake of clarity, to exclude
random variation and variability due to continued within-trial exploration.
Hypothesis: GAPS is capable of finding good locomotion gaits, but Q-learning and
Sarsa are not.
3
As we will see later, a locally optimal policy is not always a good locomotion gait.
35
Figure 3-3a shows the smoothed average learning curves for both algorithms. Fif-
teen modules were learning to locomote eastward in 10 separate trial runs (50 trials
for GAPS). As predicted, gradient ascent receives considerably more reward than ei-
ther Q-learning or Sarsa. The one-way ANOVA4 reveals that the rewards obtained
after 4,000 episodes of learning differed significantly as a function of the learning
algorithm used (F (2, 27) = 33.45, p < .01). Figure 3.5.2 shows the mean rewards
obtained by the three algorithms, with error bars and outliers. The post-hoc multiple
comparisons test reports that the GAPS mean reward is significantly different from
both Sarsa and Q-learning at 99% confidence level, but the latter are not significantly
different from each other. In test trials this discrepancy manifested itself as finding a
good policy for moving eastward (one such policy is shown in figure 3-2) for GAPS,
and failing to find a reasonable policy for Q-learning and Sarsa: modules oscillated,
moving up-and-down or left-and-right and the robot did not make progress. In figure
3-3b we see the raw rewards collected at each episode in one typical trial run of both
learning algorithms.
While these results demonstrate that modules running GAPS learn a good policy,
they also show that it takes a long time for gradient ascent to find it. We next
examine the extent to which we can reduce the number of search space dimensions,
and therefore, the experience required by GAPS through employing the feature-based
approach of section 3.4.
36
(a) (b)
Figure 3-5: 6 modules learning with LLGAPS and 144 features. (a) One of the
features used for function approximation in the LLGAPS experiments: this feature
function returns 1 if there are neighbors in all three cells of the upper left corner
and at = 2(NE). (b) Smoothed (100-point moving window), downsampled average
rewards over 10 runs, with standard error: comparison between original GAPS and
LLGAPS.
Figure 3-5b presents the results of comparing the performance of LLGAPS with
the original GAPS algorithm on the locomotion task for 6 modules. We see that both
algorithms are comparable in both their speed of learning and the average quality
of the resulting policies. If anything, around episode 1,500 basic GAPS has better
performance. However, a two-way mixed-factor ANOVA (repeated measures every
1000 episodes, GAPS vs. LLGAPS) reveals no statistical significance. Our hypothesis
was wrong, at least for this representation.
However, increasing the number of modules reveals that LLGAPS is more prone to
finding unacceptable local minima, as explained in section 4.3.2. We also hypothesized
that increasing the size of the observed neighborhood would favor the feature-based
LLGAPS over the basic GAPS. As the size of the observation increases, the number of
possible local configuration grows exponentially, whereas the features can be designed
to grow linearly. We ran a round of experiments with an increased neighborhood
size of 12 cells obtained as follows: the acting module observes its immediate face
neighbors in positions 1,3,5, and 7, and requests from each of them a report on the
three cells adjacent to their own faces5 . Thus the original Moore neighborhood plus
four additional bits of observation are available to every module, as shown in figure
3-7a. This setup results in 212 × 9 = 36, 869 parameters to estimate for GAPS. For
LLGAPS we incorporated the extra information thus obtained into an additional
72 features as follows: one partial neighborhood mask per extra neighbor present,
5
If no neighbor is present at a face, and so no information is available about a corner neighbor,
it is assumed to be an empty cell.
37
Full corner Full line Empty corner Empty line
- neighbor present
Figure 3-6: All 16 observation masks for feature generation: each mask, combined
with one of 9 possible actions, generates one of 144 features describing the space of
policies.
38
15 modules 20 modules
GAPS 0.1±0.1 5±1.7
LLGAPS with 144 features 6.6±1.3 9.7±0.2
Table 3.1: Mean number of configurations that do not correspond to locomotion gaits,
out of 10 test trials, for 10 policies learned by GAPS vs. LLGAPS with 144 features,
after 6,000 episodes of learning, with standard error. We will use the mean number
of non-gait configurations throughout this thesis as one of the performance measures
p
for various algorithms, representations and constraints. The standard error σ̂/ (N )
(N is the number of sampled policies, here equal to 10) included in this type of table
both gives an estimate of how spread-out the values from different policies are, and
is a measure of quality of the mean value as a statistic.
and one per extra neighbor absent, where each of those 8 partial masks generates
9 features, one per possible action. We found that increasing the observation size
indeed slows the basic GAPS algorithm down (figure 3-7b, where LLGAPS ran with
a much lower learning rate to avoid too-large update steps: α = 0.005 decreasing to
α = 1e−5 over the first 1,500 episodes and remaining at the minimal value thereafter).
During the first few hundred episodes, LLGAPS does better than GAPS. However,
we have also found that there is no significant difference in performance after both
have found their local optima (mixed-factor ANOVA with repeated measures every
1000 episodes, starting at episode 500 reveals no significance for GAPS vs. LLGAPS
but a significant difference at 99% confidence level for the interaction between the two
factors of episodes and algorithm). Table 3.1 shows that the feature-based LLGAPS
is even more prone to fall for local optima.
Two possible explanations would account for this phenomenon. On the one hand,
feature spaces need to be carefully designed to avoid these pitfalls, which shifts the hu-
man designer’s burden from developing distributed control algorithms to developing
sets of features. The latter task may not be any less challenging. On the other hand,
even well-designed feature spaces may eliminate redundancy in the policy space, and
therefore make it harder for local search to reach an acceptable optimum. This effect
is the result of compressing the policy representation into fewer parameters. When
an action is executed that leads to worse performance, the features that contributed
to the selection of this action will all get updated. That is, the update step results
a simultaneous change in different places in the observation-action space, which can
more easily create a new policy with even worse performance. Thus, the parameters
before the update would be locally optimal.
39
(a) (b)
Figure 3-7: (a) During experiments with an extended observation, modules had access
to 12 bits of observation as shown here. (b) Smoothed (100 point moving window),
downsampled average rewards obtained by 6 modules over 10 trials, with standard
error: comparison between basic GAPS and LLGAPS.
than in the distributed case, as the learner has more data on each episode. However,
with the size of the robot (i.e., the number of modules acting and learning together)
grows the discrepancy between the cumulative experience of all agents, and the in-
dividual experience of one agent. When many modules are densely packed into an
initial shape at the start of every episode, we can expect the ones “stuck” in the
middle of the shape to not gather any experience at all, and to receive zero reward,
for at least as long as it takes the modules on the perimeter to learn to move out of
the way. While a randomized initial position is supposed to mitigate this problem,
as the number of modules grows, so grows the probability that any one module will
rarely or never be on the perimeter at the outset of an episode. This is a direct
consequence of our decision to start locomotion from a tightly packed rectangular
configuration with a height-to-length ratio closest to 1. That decision was motivated
by the self-unpacking and deploying scenario often mentioned in SRMR research.
Hypothesis: Distributed GAPS will learn much slower than centralized factored
GAPS, and may not learn at all.
40
Figure 3-8: Comparison of learning performance between the centralized, factored
GAPS and the distributed GAPS (DGAPS) implementations: smoothed (10-point
window) average rewards obtained during 10 learning trials with 15 modules. (a)
first 10,000 episodes, and (b) to 100,000 episodes of learning.
Figure 3-9: Screenshot sequence of 15 modules executing the best policy found by
GAPS for eastward locomotion.
41
Table 3.2: Rules for eastward locomotion, distilled from the best policy learned by
GAPS.
td → NE
t t dt
d, d d, d d → SE t current actor
t t dt
d d, d d d, d d d →E d neighbor module
d dd
dt dt dt
d ,d ,d →S
far east is has gone at the end of an episode. On the other hand, the hand-designed
policies of Butler et al. (2001) were specifically developed to maintain the convex,
roughly square or cubic overall shape of the robot, and to avoid any holes within that
shape. It turns out that one can go farther faster if this constraint is relaxed. The
shape-maintaining hand-designed policy, executed by 15 modules for 50 time steps
(as shown in figure 3.5.6), achieves an average reward per episode of 5.8 (σ = 0.9),
whereas its counterpart learned using GAPS (execution sequence shown in figure
3.5.6) achieves an average reward of 16.5 (σ = 1.2). Table 3.2 graphically represents
the rules distilled from the best policy learned by GAPS. The robot executing this
policy unfolds itself into a two-layer thread, then uses a thread-gait to move East.
While very good at maximizing this particular reward signal, these policies no longer
have the “nice” properties of the hand-designed policies of Butler et al. (2001). By
learning with no constraints and a very simple objective function (maximize horizon-
tal displacement), we forgo any maintenance of shape, or indeed any guarantees that
the learning algorithm will converge on a policy that is a locomotion gait. The best
we can say is that, given the learning setup, it will converge on a policy that locally
maximizes the simple reward.
42
Table 3.3: Learned rules for eastward locomotion: a locally optimal gait.
d
td
ddd →N
td
t d, d d → NE
t t dt
d d, d d d, d d d →E t current actor
t t dt
d, d d, d d → SE d neighbor module
d dd
dt dt dt
d ,d , d →S
Figure 3-11: Locomotion of nm modules following rules in table 3.3, treated as de-
terministic. After the last state shown, module n executes action S, modules 1 then
2 execute action E, then module 4 can execute action NE, and so forth. The length
of the sequence is determined by the dimensions n and m of the original rectangular
shape.
assumption of synchronous turn-taking. Figure 3-11 shows the first several actions
of the only possible cyclic sequence for nm modules following these rules if treated
as deterministic. The particular assignment of module IDs in the figure is chosen for
clarity and is not essential to the argument.
However, the rules are stochastic. The probability of the robot’s center of mass
moving eastward over the course of an episode is equal to the probability, where there
are T turns in an episode, that during t out of T turns the center of mass moved
eastward and t > T − t. As θ(o, a) → ∞ so π(o, a, θ) → 1 for the correct action a,
so t will also grow and we will expect correct actions. And when the correct actions
are executed, the center of mass is always expected to move eastwards during a turn.
Naturally, in practice once the algorithm has converged, we can extract deterministic
rules from the table of learned parameters by selecting the highest parameter value
per state. However, it has also been shown (Littman 1994) that for some POMDPs
the best stochastic state-free policy is better than the best deterministic one. It may
be that randomness could help in our case also.
While some locally optimal policies will indeed generate locomotion gaits like the
one just described, others will instead form protrusions in the direction of higher
reward without moving the whole robot. Figure 3-12 shows some configurations
43
a b c
d e f
Figure 3-12: Some locally optimal policies result in robot configurations that are not
locomotion gaits.
resulting from such policies — the prevalence of these will grow with the robot size.
In the next chapter, we examine in detail several strategies for constraining the policy
search and providing more information to the learning agents in an effort to avoid
local optima that do not correspond to acceptable locomotion gaits.
(a) (b)
Figure 3-13: Scalability issues with GAPS: 15, 20 and 100 modules. (a) Smoothed
(10-point window) average reward over 10 learning trials. (b) Mean number of stuck
configurations, out of 10 test trials for each of 10 learned policies after 10,000 episodes
of learning, with standard error.
44
no. modules no. bad configs no. bad policy
15 0.1±0.1 2±1.3
20 0.9±0.8 3±1.5
Table 3.4: Results of introducing a stability requirement into the simulator: mean
number of non-gaits, out of 10 test runs, for 10 learned policies, with standard error
(“bad policy” refers to test in which the modules stopped moving after a while de-
spite being in an acceptable configuration; this effect is probably due to the reward
structure in these experiments: since unstable configurations are severely penalized,
it is possible that the modules prefer, faced with a local neighborhood that previously
led to an unstable configuration, to not do anything.)
(a) (b)
Figure 3-14: Results of introducing a stability constraint into the simulator: two
non-gait configurations.
At this point we might note that some of the undesirable configurations shown
in figure 3-12 would not be possible on any physical robot configuration, since they
are not stable. Although we explicitly state our intent to work only with abstract
rules of motion, at this point we might consider introducing another rule enforced
by the simulator: (3) the robot’s center of mass must be within its footprint. In a
series of experiments, we let the modules learn in this new environment, where any
episode where rule (3) is broken is immediately terminated with a reward of -10 to
all modules (centralized factored learning).
The results of the experiments (table 3.4) show that non-gait local optima still
exist, even though unstable long protrusions are no longer possible. Some such con-
figurations can be seen in figure 3-14a and 3-14b. For the rest of this thesis, we return
to the original two rules of the simulator.
45
number of parameters The more parameters we have to estimate, the more data
or experience is required (sample complexity), which results in slower learning
for a fixed amount of experience per history (episode of learning).
starting point Stochastic gradient ascent will only converge to a local optimum
in policy space. Therefore the parameter settings at which the search begins
matter for both the speed of convergence and the goodness of the resulting
policy.
Policy representation
Clearly, the number of possible observations and executable actions determines the
number of parameters in a full tabular representation. In a feature-based represen-
tation, this number can be reduced. However, we have seen in this chapter that the
recasting of the policy space into a number of log-linearly compositional features is
not a straightforward task. Simple reduction in the number of parameters does not
guarantee a more successful learning approach. Instead, good policies may be ex-
cluded from the reduced space representation altogether. There may be more local
optima that do not correspond to acceptable policies at all, or less optima overall,
which would result in a harder search problem.
The number of parameters to set can also be reduced by designing suitable pre-
processing steps on the observation space of each agent.
Robot size
In modular robots, especially self-reconfiguring modular robots, the robot size influ-
ences the effective number of parameters to be learned. We can see this relationship
clearly in our case study of locomotion by self-reconfiguration on a 2D lattice-based
robot. The full tabular representation of a policy in this case involves 28 possible
local observations and 9 possible actions, for a total of 256 × 9 = 2, 304 parameters.
However, if the robot is composed of only two modules, learning to leap-frog over
46
(a)
(b)
Figure 3-15: All possible local observations for (a) two modules, and (b) three mod-
ules, other than those in (a).
each other in order to move East, then effectively each agent only sees 4 possible
observations, as shown in figure 3-15a. If there are only three modules, there are only
18 possible local observations. Therefore the number of parameters that the agent
has to set will be limited to 4 × 9 = 36 or 18 × 9 = 162 respectively. This effective
reduction will result in faster learning for smaller robots composed of fewer modules.
Search constraints
Another way of reducing the effective number of parameters to learn is to impose
constraints on the search by stochastic gradient ascent. This can be achieved, for
example, by containing exploration to only a small subset of actions.
47
In SRMRs, robot size also contributes to the relationship between episode length
and quality of learning updates. In particular, for our case study task of locomotion
by self-reconfiguration, there is a clear dependency between the number of modules in
a robot and the capability of the simple reward function to discriminate between good
and bad policies. To take an example, there is a clear difference between 15 modules
executing a good locomotion gait for 50 timesteps and the same number of modules
forming an arm-like protrusion, such as the one shown in figure 3-12a. This difference
is both visible to us and to the learning agent through the reward signal (center of
mass displacement along the x axis), which will be much greater in the case of a
locomotion gait. However, if 100 modules followed both policies, after 50 timesteps,
due to the nature of the threadlike locomotion gaits, the robot would not be able to
differentiate between what we perceive as a good locomotion policy, or the start of
an arm-like protrusion, based on reward alone. Therefore, as robot size increases, we
must either provide more sophisticated reward functions, or increase episode length
to increase the amount of experience per episode for the learning modules.
Shared experience
We have seen that centralized factored GAPS can learn good policies while fully
distributed GAPS cannot. This is due to the amount and quality of experience the
modules have access to during learning. In the centralized version, each module
makes the same updates as all others as they share all experience: the sum total of
all encountered observations and all executed actions play into each update. In the
distributed version, each module is on its own for collecting experience and only has
access to its own observation and action counts. This results in extremely poor to
nonexistent exploration for most modules — especially those initially placed at the
center of the robot. If it were possible for modules to share their experience in a fully
distributed implementation of GAPS, we could expect successful learning due to the
increase in both amount and quality of experience available to each learner.
3.7 Summary
We have formulated the locomotion problem for a SRMR as a multi-agent POMDP
and applied gradient-ascent search in policy value space to solve it. Our results
suggest that automating controller design by learning is a promising approach. We
should, however, bear in mind the potential drawbacks of direct policy search as the
learning technique of choice.
As with all hill-climbing methods, there is a guarantee of GAPS converging to a
local optimum in policy value space, given infinite data, but no proof of convergence
to the global optimum is possible. A local optimum is the best solution we can find
to a POMDP problem. Unfortunately, not all local optima correspond to reasonable
locomotion gaits.
In addition, we have seen that GAPS takes on average a rather long time (mea-
sured in thousands of episodes) to learn. We have identified three key issues that con-
48
tribute to the speed and quality of learning in stochastic gradient ascent algorithms
such as GAPS, and we have established which robot parameters can contribute to
the make-up of these three variables. In this chapter, we have already attempted, un-
successfully, to address one of the issues — the number of policy parameters to learn
— by introducing feature spaces. In the next two chapters, we explore the influence
of robot size, search constraints, episode length, information sharing, and smarter
policy representations on the speed and reliability of learning in SRMRs. The goal
to keep in sight as we report the results of those experiments, is to find the right mix
of automation and easily available constraints and information that will help guide
automated search for the good distributed controllers.
49
Chapter 4
We have established that stochastic gradient ascent in policy space works in principle
for the task of locomotion by self-reconfiguration. In particular, if modules can some-
how pool their experience together and average their rewards, provided that there
are not too many of them, the learning algorithm will converge to a good policy.
However, we have also seen that even in this centralized factored case, increasing
the size of the robot uncovers the algorithm’s susceptibility to local minima which
do not correspond to acceptable policies. In general, local search will be plagued by
these unless we can provide either a good starting point, or guidance in the form of
constraints on the search space.
Modular robot designers are usually well placed to provide either a starting point
or search constraints, as we can expect them to have some idea about what a reason-
able policy, or at least parts of it, should look like. In this chapter, we examine how
specifying such information or constraints affects learning by policy search.
50
(a) (b)
(c) (d)
Figure 4-1: Determining if actions are legal for the purpose of constraining explo-
ration: (a) A = {N OP }, (b) A = {N OP }, (c) A = {N OP } (d) A = {2(N E), 7(W )}.
of parameters that need to be learned and (2) causes the initial exploration phases
to be more efficient because the robot will not waste its time trying out impossible
actions. The second effect is probably more important than the first.
The following rules were used to pre-select the subset Ait of actions possible for
module i at time t, given the local configuration as the immediate Moore neighborhood
(see also figure 4-1):
4. Ait = {legal actions based on local neighbor configuration and the sliding cube
model}
5. Ait = Ait − any action that would lead into an already occupied cell
These rules are applied in the above sequence and incorporated into the GAPS
algorithm by setting the corresponding θ(ot , at ) to a large negative value, thereby
making it extremely unlikely that actions not in Ait would be randomly selected by
the policy. Those parameters are not updated, thereby constraining the search at
every time step.
1
N OP stands for ‘no operation’ and means the module’s action is to stay in its current location
and not attempt any motion.
2
This is a very conservative rule, which prevents disconnection of the robot locally.
51
We predict that the added space structure and constraints that were introduced
here will result in the modular robot finding good policies with less experience.
52
(a) (b)
Starting from a partially known policy may be especially important for problems
where experience is scarce.
53
a b c d
e f g h
Figure 4-3: Some local optima which are not locomotion gaits, found by GAPS with
pre-screening for legal actions and (a)-(f) 1-hop communications, (g)-(h) multi-hop
communications.
of its actions. The module will be able to tell whether any given action will generate
a legal motion onto a new lattice cell, or not. However, the module still will not
know which state (full configuration of the robot) this motion will result in. Nor will
it be able to tell what it will observe locally at the next timestep. The amount of
knowledge given is really minimal. Nevertheless, with these constraints the robot no
longer needs to learn not to try and move into lattice cells that are already occupied,
or those that are not immediately attached to its neighbors. Therefore we predicted
that less experience would be required for our learning algorithms.
Hypothesis: Constraining exploration to legal actions only will result in faster learn-
ing (shorter convergence times).
In fact, figure 4-2 shows that there is a marked improvement, especially as the
number of modules grows.
54
15 modules 20 modules
GAPS 0.1±0.1 5±1.7
GAPS with pre-screened legal actions
no extra comms 0.3±0.3 4.2±1.5
1-hop 0.3±0.2 2.5±1.3
multi-hop 0.3±0.2 0.2±0.1
Table 4.1: In each of 10 learning trials, the learning was stopped after 10,000 episodes
and the resulting policy was tested 10 times. The table shows the mean number of
times, with standard error, during these 10 test runs, that modules became stuck in
some configuration due to a locally optimal policy.
any, will drive them further along the protrusion, and no module will attempt to go
down and back as the rewards in that case will immediately be negative.
(a) (b)
Local optima are always present in policy space, and empirically it seems that
GAPS, with or without pre-screening for legal motions, is more likely to converge to
one of them with increasing number of modules comprising the robot.
Having encountered this problem in scaling up the modular system, we introduce
new information to the policy search algorithms to reduce the chance of convergence
to a local maximum.
55
4.3.3 Extra communicated observations
In our locomotion task, locally optimal but globally suboptimal policies do not allow
modules to wait for their neighbors to form a solid supporting base underneath them.
Instead, they push ahead forming long protrusions. We could reduce the likelihood
of such formations by introducing new constraints, artificially requiring modules to
stop moving and wait in certain configurations. For that purpose, we expand the
local observation space of each module by one bit, which is communicated by the
module’s South/downstairs neighbor, if any. The bit is set if the South neighbor is
not supported by the ground or another module underneath. If a module’s bit is set
it is not allowed to move at this timestep. We thus create more time in which other
modules may move into the empty space underneath to fill in the base. Note that
although we have increased the number of possible observations by a factor of 2 by
introducing the extra bit, we are at the same time restricting the set of legal actions
in half of those configurations to {N OP }. Therefore, we do not expect GAPS to
require any more experience to learn policies in this case than before.
We investigate two communication algorithms for the setting of the extra bit.
If the currently acting module is M1 and it has a South neighbor M2 , then either
1) M1 asks M2 if it is supported; if not, M2 sends the set-bit message and M1 ’s
bit is set (this is the one-hop scheme), or 2) M1 generates a support request that
propagates South until either the ground is reached, or one of the modules replies
with the set-bit message and all of their bits are set (this is the multi-hop scheme).
Experimentally (see table 4.1) we find that it is not enough to ask just one neigh-
bor. While the addition of a single request on average halved the number of stuck con-
figurations per 10 test trials for policies learned by 20 modules, the chain-of-support
multi-hop scheme generated almost no such configurations. On the other hand, the
addition of communicated information and waiting constraints does not seem to af-
fect the amount of experience necessary for learning. Figure 4-4a shows the learning
curves for both sets of experiments with 15 modules practically undistinguishable.
However, the average obtained rewards should be lower for algorithms that con-
sistently produce more suboptimal policies. This distinction is more visible when the
modular system is scaled up. When 20 modules run the four proposed versions of gra-
dient ascent, there is a clear distinction in average obtained reward, with the highest
value achieved by policies resulting from multi-hop communications (see figure 4-4b).
The discrepancy reflects the greatly reduced number of trial runs in which modules
get stuck, while the best found policies remain the same in all conditions.
56
information to the modules? In particular, if the modules knew the required direction
of motion, would the number of “stuck” configurations be zero?
(a) (b)
Figure 4-5: Two global configurations that are aliased assuming only local, partial
observations, even in the presence of directional information: (a) module M is really
at the front of the robot and should not move, (b) module M should be able to move
despite having no “front” neighbors.
57
height=3 height=4
# modules baseline baseline multi-hop
15 0.1±0.1 0.2±0.1 -
16 0.1±0.1 5.4±1.3 2.1±1.0
17 0.1±0.1 3.9±1.5 2.5±1.3
18 1.0±1.0 1.5±1.1 0.5±0.3
19 0.2±0.1 3.3±1.3 1.1±1.0
20 0.0±0.0 5.0±1.7 0.2±0.1
Table 4.2: Mean number of non-gait stuck configurations per 10 test trials, with
standard error, of 10 policies learned by modules running centralized GAPS with
(multi-hop) or without (baseline) pre-screening for legal actions and multi-hop com-
munications.
Large size effect Conversely, when the robot is composed of upwards of 100 mod-
ules, the sheer scale of the problem results in a policy space landscape with
drastically more local optima which do not correspond to locomotion gaits. In
this case, even with the best initial conditions, modules will have trouble learn-
ing in the basic setup. However, biasing the search and scaling up incrementally,
as we explore in this chapter, will help.
Therefore, when comparing different algorithms and the effect of extra constraints
and information, throughout this thesis, we have selected two examples to demon-
strate these effects: 1) 15 modules in a lower aspect ratio initial configuration (height/width
= 3/5), which is helpful for horizontal locomotion, and 2) 20 modules in a higher as-
pect ratio initial condition (height/width = 4/5), which creates more local optima
related problems for the learning process. The initial configurations were chosen to
most closely mimic a tight packing for transportation in rectangular “boxes”.
In table 4.2, we report one of the performance measures used throughout this
thesis (the mean number of non-gait configurations per 10 test runs of a policy) for
all robot sizes between 15 and 20 modules, in order to demonstrate the effect of the
initial condition. In all cases, at the start of every episode the robot’s shape was a
58
25 modules 90 modules 100 modules
height=3 1.1±1.0 9±1.0 9.4±0.4
Table 4.3: Larger robots get stuck with locally optimal non-gait policies despite a
lower aspect ratio of the initial configuration.
Figure 4-6 shows how incremental GAPS performs on the locomotion task. Its
effect is most powerful when there are only a few modules learning to behave at
the same time; when the number of modules increases beyond the size of the local
observation space, the effect becomes negligible. Statistical analysis fails to reveal any
significant difference between the mean rewards obtained by basic GAPS vs. IGAPS
for 6, 15, or 20 modules. Nevertheless, we see in figure 4-7b that only 2.1 out of 10 test
runs on average produce arm-like protrusions when the algorithm was seeded in the
59
(a) (b)
Figure 4-6: Smoothed, downsampled average rewards, with standard error, over 10
trials: comparison between basic GAPS and the incremental extension (a) for 6 mod-
ules and (b) for 15 modules.
incremental way, against 5 out of 10 for basic GAPS. The lack of statistical significance
on the mean rewards may be due to the short time scale of the experiments: in
50 timesteps, 20 modules may achieve very similar rewards whether they follow a
locomotion gait or build an arm in the direction of larger rewards. A drawback
of the incremental learning approach is that it will only work to our advantage on
tasks where the optimal policy does not change as the number of modules increases.
Otherwise, IGAPS may well lead us more quickly to a globally suboptimal local
maximum.
We have taken the incremental idea a little further in a series of experiments where
robot size was increased in larger increments. Starting with 4 modules in a 2 × 2
configuration, we have added enough modules at a time to increase the square length
by one, eventually reaching a 10 × 10 initial configuration. The results can be seen
in figure 4.3.5 and table 4.3.5. The number of locally optimal non-gait configurations
was considerable for 100 modules, even in the IGAPS condition, but it was halved
with respect to the baseline GAPS.
Figure 4-9 shows the results of another experiment in better starting points. Usu-
ally the starting parameters for all our algorithms are initialized to either small ran-
dom numbers or all zeros. However, sometimes we have a good idea of what certain
parts of our distributed controller should look like. It may therefore make sense to
seed the learning algorithm with a good starting point by imparting to it our incom-
plete knowledge. We have made a preliminary investigation of this idea by partially
specifying two good policy “rules” before learning started in GAPS. Essentially we
initialized the policy parameters to a very strong preference for the ‘up’ (North) ac-
tion when the module sees neighbors only on its right (East), and a correspondingly
strong preference for the ‘down’ (South) action when the module sees neighbors only
on its left (West).
60
15 modules 20 modules
GAPS 0.1±0.1 5±1.7
IGAPS 0.3±0.3 2.1±1.3
(a) (b)
Figure 4-7: Incremental GAPS with 20 modules: (a) Smoothed (100-point window),
downsampled average rewards obtained by basic vs. incremental GAPS while learning
locomotion (b) The table shows the mean number, with standard error, of dysfunc-
tional configurations, out of 10 test trials for 10 learned policies, after 15 and 20
modules learning with basic vs. incremental GAPS; IGAPS helps as the number of
modules grows.
Hypothesis: Seeding GAPS with a partially known policy will result in faster learn-
ing and fewer local optima.
Table 4.4: Mean number, with standard error, of locally optimal configurations which
do not correspond to acceptable locomotion gaits out of 10 test trial for 10 learned
policies: basic GAPS algorithm vs. incremental GAPS with increments by square side
length. All learning trials had episodes of length T=50, but test trials had episodes
of length T=250, in order to give the larger robots time to unfold.
61
(a) (b)
relies more heavily on horizontal motion. The effect will not necessarily generalize to
any designer-determined starting point. In fact, we expect there to be a correlation
between how many “rules” are pre-specified before learning and how fast a good
policy is found.
We expect extra constraints and information to be even more useful when expe-
rience is scarce, as is the case when modules learn independently in a distributed
fashion without tieing their policy parameters to each other.
62
(a) (b)
Figure 4-9: Smoothed average rewards over 10 trials: effect of introducing a partially
known policy for (a) 15 modules and (b) 20 modules.
expect that it will take proportionately more time for individual modules to find
good policies on their own.
Here we describe the results of experiments on modules learning individual policies
without sharing observation or parameter information. In all cases, the learning rate
started at α = 0.01, decreased uniformly over 7,000 episodes until 0.001 and remained
at 0.001 thereafter. The inverse temperature started at β = 1, increased uniformly
over 7,000 episodes until 3 and remained at 3 thereafter. The performance curves in
this sections were smoothed with a 100-point moving average and downsampled for
clarity.
We see in figure 4-10 that the basic unconstrained version of the algorithm is
struggling in the distributed setting. Despite a random permutation of module po-
sitions in the start state before each learning episode, there is much less experience
available to individual agents. Therefore we observe that legal actions constraints
and extra communicated observations help dramatically. Surprisingly, without any of
these extensions, GAPS is also greatly helped by initializing the optimization with a
partially specified policy. The same two “rules” were used in these experiments as in
section 4.3.5. A two-way 4 × 2 ANOVA on the mean rewards after 100,000 episodes
was performed with the following independent variables: information and constraints
group (one of: none, legal pre-screening, 1-hop communications and multi-hop com-
munications), and partially known policy (yes or no). The results reveal a statistically
significant difference in means for both main factors, and for their simple interaction,
at the 99% confidence level.3
3
These results do not reflect any longitudinal data or effects.
63
(a) (b)
Figure 4-10: Smoothed, downsampled average rewards over 10 trials with 15 modules
learning to locomote using only their own experience, with standard error: (a) basic
distributed GAPS vs. GAPS with pre-screening for legal motions vs. GAPS with
pre-screening and 1-hop (green) or multi-hop (cyan) communications with neighbors
below, and (b) basic distributed GAPS vs. GAPS with all actions and a partially
specified policy as starting point.
Hypothesis: Seeding policy search with a partially known policy will help dramat-
ically in distributed GAPS.
Figure 4-11a compares the average rewards during the course of learning in three
conditions of GAPS, where all three were initialized with the same partial policy: 1)
basic GAPS with no extra constraints, 2) GAPS with pre-screening for legal actions
only, 3) GAPS with legal actions and an extra observation bit communicated by the
1-hop protocol, and 4) by the multi-hop protocol. The curves in 1) and 4) are almost
indistinguishable, whereas 2) is consistently lower. However, this very comparable
performance can be due to the fact that there is not enough time in each episode (50
steps) to disambiguate between an acceptable locomotion gait and locally optimal
protrusion-making policies.
Hypothesis: Test runs with longer episodes will disambiguate between performance
of policies learned with different sources of information. In particular, dis-
tributed GAPS with multi-hop communications will find many fewer bad local
optima.
64
(a) (b)
Figure 4-11: (a) Smoothed, downsampled averaged reward over 10 trials of 15 mod-
ules running algorithms seeded with a partial policy, with and without pre-screening
for legal actions and extra communicated observations, with standard error, (b) Av-
erage rewards obtained by 15 modules with episodes of length T = 150 after 100,000
episodes of distributed GAPS learning seeded with partially known policy, with var-
ious degrees of constraints and information, with standard deviation.
4.5 Discussion
The experimental results of section 4.3 are evidence that reinforcement learning can
be used fruitfully in the domain of self-reconfigurable modular robots. Most of our
results concern the improvement in SRMR usability through automated development
of distributed controllers for such robots; and to that effect we have demonstrated
that, provided with a good policy representation and enough constraints on the search
space, gradient ascent algorithms converge to good policies given enough time and
experience. We have also shown a number of ways to structure and constrain the
learning space such that less time and experience is required and local optima become
less likely. Imparting domain knowledge or partial policy knowledge requires more
involvement from the human designer, and our desire to reduce the search space in
this way is driven by the idea of finding a good balance between human designer
skills and the automatic optimization. If the right balance is achieved, the humans
can seed the learning algorithms with the kinds of insight that is easy for us; and the
robot can then learn to improve on its own.
We have explored two ways of imparting knowledge to the learning system. On
the one hand, we constrain exploration by effectively disallowing those actions that
would result in a failure of motion (sections 4.1 and 4.3.1) or any action other than
N OP in special cases (section 4.3.3). On the other hand, we initialize the algorithm
at a better starting point by an incremental addition of modules (sections 4.2.1 and
4.3.5) or by a partially pre-specified policy (section 4.3.5). These experiments suggest
that a good representation is very important for learning locomotion gaits in SRMRs.
Local optima abound in the policy space when observations consists of the 8 bits of
the immediate Moore neighborhood. Restricting exploration to legal motions only
65
15 modules 20 modules
T=50 T=50 T=100
Distributed GAPS 10±0.0 10±0.0
DGAPS with pre-screened legal actions
1-hop comms 1.3±0.2 10±0.0 9.6±0.4
multi-hop comms 0.9±0.3 9.6±0.4 3.7±1.1
DGAPS with partially known policy
no extra restrictions or info 1.2±0.4 10±0.0 9.9±0.1
pre-screened legal actions 3.8±0.6 8.8±0.6 9.5±0.4
+ 1-hop 0.8±0.2 9.9±0.1 7.6±0.7
+ multi-hop 0.3±0.3 4.7±1.0 1.3±0.3
Table 4.5: In each of the 10 learning trials, the learning was stopped after 100,000
episodes and the resulting policy was tested 10 times. The table shows the mean
number of times, with standard error, during these 10 test runs for each policy, that
modules became stuck in some configuration due to a locally optimal policy.
did not prevent this problem. It seems that more information than what is available
locally is needed for learning good locomotion gaits.
Therefore, as a means to mitigate the prevalence of local optima, we have intro-
duced a very limited communications protocol between neighboring modules. This
increased the observation space and potentially the number of parameters to esti-
mate. However, more search constraints in the form of restricting any motion if lack
of support is communicated to the module, allowed modules to avoid the pitfalls of
local optima substantially more often.
We have also found that local optima were more problematic the more modules
were acting and learning at once; they were also preferentially found by robots which
started from taller initial configurations (higher aspect ratio). The basic formulation
of GAPS with no extra restrictions only moved into a configuration from which the
modules would not move once in 100 test trials when there were 15 modules learning.
When there were 20 modules learning, the number increased to almost half of all
test trials. It is important to use resources and build supporting infrastructure into
the learning algorithm in a way commensurate with the scale of the problem; more
scaffolding is needed for harder problems involving a larger number of modules.
Restricting the search space and seeding the algorithms with good starting points
is even more important when modules learn in a completely distributed fashion from
their experience and their local rewards alone (section 4.4). We have shown that in
the distributed case introducing constraints, communicated information, and partially
known policies all contribute to successful learning of a locomotion gait, where the
basic distributed GAPS algorithm has not found a good policy in 100,000 episodes.
This is not very surprising, considering the amount of exploration each individual
module needs to do in the unrestricted basic GAPS case. However, given enough
constraints and enough experience, we have shown that it is possible to use the same
66
RL algorithms on individual modules. It is clear that more work is necessary before
GAPS is ported to physical robots: for instance, we cannot afford to run a physical
system for 100,000 episodes of 50 actions per module each.
In the next chapter, we extend our work using the idea of coordination and com-
munication between modules that goes beyond single-bit exchanges.
67
Chapter 5
Agreement in Distributed
Reinforcement Learning
68
5.1 Agreement Algorithms
Agreement (also known as consensus) algorithms are pervasive in distributed systems
research. First introduced by Tsitsiklis et al. (1986), they have been employed in many
fields including control theory, sensor networks, biological modeling, and economics.
They are applicable in synchronous and partially asynchronous settings, whenever
a group of independently operating agents can iteratively update variables in their
storage with a view to converge to a common value for each variable. Flocking is
a common example of an agreement algorithm applied to a mobile system of robots
or animals: each agent maintains a velocity vector and keeps averaging it with its
neighbors’ velocities, until eventually, all agents move together in the same direction.
xi (t + 1) = xi (t), otherwise,
where aij are nonnegative coefficients which sum to 1, T i is the set of times at which
processor i updates its value, and τji (t) determines the amount of time by which the
value xj is outdated. If the graph representing the communication network among
the processors is connected, and if there is a finite upper bound on communication
delays between processors, then the values xi will exponentially converge to a common
intermediate value x such that xmin (0) ≤ x ≤ xmax (0) (part of Proposition 3.1 in
Bertsekas & Tsitsiklis (1997)). Exactly what value x the processors will agree on in
the limit will depend on the particular situation.
where γ is a small decreasing positive step size, and si (t) is in the possibly noisy
direction of the gradient of some continuously differentiable nonnegative cost function.
It is assumed that noise is uncorrelated in time and independent for different i’s.
However, GAPS is a stochastic gradient ascent algorithm, which means that the
updates performed climb the estimated gradient of the policy value function ∇V̂ ,
which may, depending on the quality of the agents’ experience, be very different from
∇V , as shown in figure 5-1. A requirement of the partially asynchronous gradient-
like update algorithm described in this section is that the update direction si (t) be in
the same quadrant as ∇V , but even this weak assumption is not necessarily satisfied
by stochastic gradient ascent methods. In addition, the “noise”, described by the
wi (t) term above, is correlated in a SRMR, as modules are physically attached to
each other. In the rest of this section, we nevertheless propose a class of methods for
using the agreement algorithm with GAPS, and demonstrate their suitability for the
problem of learning locomotion by self-reconfiguration.
70
Figure 5-1: Stochastic gradient ascent updates are not necessarily in the direction of
the true gradient.
(a) (b)
Figure 5-2: Two implementations of agreement exchanges within the GAPS learning
algorithm: (a) at the end of each episode, and (b) at each time step.
71
Algorithm 3 Asynchronous ID-based collection of experience. This algorithm can
be run at the end of each episode before the learning update rule.
Require: unique id
send message = hid, R, Co , Coa i to all neighbors
loop
if timeout then
proceed to update
end if
whenever message m = hidm , Rm , Com , Coa
m i arrives
72
Algorithm 4 Asynchronous average-based collection of experience. This algorithm
may be run at the end of each episode, or the procedure labeled average may be
used at each time step while gathering experience, as shown also in figure 5-2b.
send message message = hR, Co , Coa i to a random neighbor
repeat
if timeout then
proceed to update
end if
whenever m = hRm , Com , Coa
m i arrives
average:
reward R ← 21 (R + Rm )
experience Co ← 12 (Co + Com ), Coa ← 12 (Coa + Coa
m)
0
send to a random neighbor new message m = hR, Co , Coa i
until converged
update:
GAPS update rule
this quantity is then substituted into the GAPS update rule, the updates become:
N N
!
1 X i 1 X i
∆θoa = αR C − πθ (o, a) C
N i=1 oa N i=1 o
1
= αR (Coa − Co πθ (o, a)) ,
N
which is equal to the centralized GAPS updates scaled by a constant N1 . Note that
synchronicity is also required at the moment of the updates, i.e., all modules must
make updates simultaneously, in order to preserve the stationarity of the underlying
process. Therefore, modules learning with distributed GAPS using an agreement
algorithm with synchronous pairwise disjoint updates to come to a consensus on both
the value of the reward, and the average of experience counters, will make stochastic
gradient ascent updates, and therefore will converge to a local optimum in policy
space.
The algorithm (Algorithm 4) requires the scaling of the learning rate by a constant
proportional to the number of modules. Since stochastic gradient ascent is in general
sensitive to the learning rate, this pitfall cannot be avoided. In practice, however,
scaling the learning rate up by an approximate factor loosely dependent (to the order
of magnitude) on the number of modules, works just as well.
Another practical issue is the communications bandwidth required for the imple-
mentation of agreement on experience. While many state of the art microcontrollers
have large capacity for speedy communications (e.g., the new Atmel AT32 architecture
includes built-in full speed 12 Mbps USB), depending on the policy representation,
significant downsizing of required communications can be achieved. The most obvious
compression technique is to not transmit any zeros. If exchanges happen at the end of
the episode, instead of sharing the full tabular representation of the counts, we notice
73
Algorithm 5 Synchronous average-based collection of experience. This algorithm
may be run at the end of each episode, or the procedure labeled average may be
used at each time step while gathering experience, as shown also in figure 5-2b.
send to all n neighbors message = hR, Co , Coa i
for all time steps do
average:
m m m
Pn mm= hR , Co , Coa i
receive from all n neighbors message
1
average reward R ← n+1 (R + m=1 R )
1
(Co + nm=1 Com ), Coa ← n+1
1
(Coa + nm=1 Coa
m)
P P
average experience Co ← n+1
0
send to all n neighbors m = hR, Co , Coa i
end for
update:
GAPS update rule
that often experience histories h =< o1 , a1 , o2 , a2 , ..., oT , aT > are shorter — in fact,
as long as 2T , where T is the length of an episode. In addition, these can be sorted
by observation with the actions combined: hsorted =< o1 : a1 a2 a1 a4 a4 , o2 : a2 , o3 :
a3 a3 , ... >, and finally, if the number of repetitions in each string of actions warrants
it, each observation’s substring of actions can be re-sorted according to action, and
a lossless compression scheme can be used, such as the run-length encoding, giving
hrle =< o1 : 2a1 a2 2a4 , o2 : a2 , o3 : 2a3 , ... >. Note that this issue unveils the trade-
off in experience gathering between longer episodes and sharing experience through
neighbor communication.
How can a module incorporate this message into computing the averages of shared
experience? Upon receiving a string hm rle coming from module m, align own and m’s
contributions by observation, creating a new entry for any oi that was not found in
own hrle . Then for each oi , combine action substrings, re-sort according to action and
re-compress the action substrings such that there is only one number nj preceding
any action aj . At this point, all of these numbers n are halved, and now represent
the pairwise average of the current module’s and m’s combined experience gathered
during this episode. Observation counts (averages) are trivially the sum of action
counts (averages) written in the observation’s substring.
74
(a) (b)
Figure 5-3: Smoothed (100-point window), downsampled average center of mass dis-
placement, with standard error, over 10 runs with 15 modules learning to locomote
eastward, as compared to the baseline distributed GAPS and standard GAPS with
given global mean reward. (a) The modules run the agreement algorithm on rewards
to convergence at the end of episodes. (b) Distribution of first occurrences of cen-
ter of mass displacement greater than 90% of maximal displacement for each group:
baseline DGAPS, DGAPS with given mean as common reward, and DGAPS with
agreement in two conditions – two exchanges per timestep and to convergence at the
end of episode.
thereafter. This encouraged more exploration during the initial episodes of learning.
Figure 5-3a demonstrates the results of experiments with the learning modules
agreeing exclusively on the rewards generated during each episode. The two con-
ditions, as shown in figure 5-2, diverged on the time at which exchanges of reward
took place. The practical difference between the end-of-episode versus the during-
the-episode conditions is that in the former, averaging steps can be taken until con-
vergence3 on a stable value of R, whereas in the latter, all averaging stops with the
end of the episode, which will likely arrive before actual agreement occurs.
Figure 5-3a shows that despite this early stopping point, not-quite-agreeing on a
common reward is significantly better than using each module’s individual estimate.
As predicted, actually converging on a common value for the reward results, on av-
erage, in very good policies after 100,000 episodes of learning in a fully distributed
manner. By comparison, as we have previously seen, distributed GAPS with indi-
vidual rewards only does not progress much beyond an essentially random policy. In
order to measure convergence times (that is, the speed of learning) for the different al-
gorithms, we use a single-number measure obtained as follows: for each learning run,
average robot center of mass displacement dx was smoothed with 100-point moving
window and downsampled to filter out variability due to within-run exploration and
random variation. We then calculate a measure of speed of learning (akin to con-
vergence time) for each condition by comparing the average first occurrence of dx
3
As recorded by each individual module separately.
75
no. stuck
DGAPS 10±0.0
DGAPS with common R 0.4±0.2
agreement at each t 0.5±0.2
agreement at end episode 0.2±0.1
reward mean: 2.4
agreement gives: 0.63
(a) (b)
Figure 5-4: Effect of active agreement on reward during learning: (a) The modules
with greatest rewards in such configurations have least neighbors (module 15 has 1
neighbor and a reward of 5) and influence the agreed-upon value the least. The mod-
ules with most neighbors (modules 1-10) do not have a chance to get any reward, and
influence the agreed-upon value the most. (b) Mean number of stuck configurations,
with standard error, out of 10 test runs each of 10 learned policies, after 15 modules
learned to locomote eastward for 100,000 episodes.
greater than 90% of its maximum value for each condition, as shown in figure 5-3b.
The Kruskal-Wallis test with four groups (baseline PN original DGAPS with individual
1
rewards, DGAPS with common reward R = N i=1 Ri available to individual mod-
ules, agreement at each timestep, and agreement at the end of each episode) reveals
statistically significant differences in convergence times (χ2 (3, 36) = 32.56, p < 10−6 )
at the 99% confidence level. Post-hoc analysis shows no significant difference be-
tween DGAPS with common rewards and agreement run at the end of each episode.
Agreement (two exchanges at every timestep) run during locomotion results in a
significantly later learning convergence measure for that condition – this is not sur-
prising, as exchanges are cut off at the end of episode when the modules still have
not necessarily reached agreement.
While after 100,000 episodes of learning, the policies learned with agreement at
the end of each episode and those learned with given common mean R receive the
same amount of reward, according to figure 5-3a, agreement seems to help more earlier
in the learning process. This could be a benefit of averaging among all immediate
neighbors, as opposed to pairwise exchanges. Consider 15 modules which started
forming a protrusion early on during the learning process, such that at the end of
some episode, they are in the configuration depicted in figure 5-4a. The average robot
displacement here is Rmean = 2.4. However, if instead of being given that value,
modules actively run the synchronous agreement at the end of this episode, they will
eventually arrive at the value of Ragree = 0.63. This discrepancy is due to the fact
that most of the modules, and notably those that have on average more neighbors,
have received zero individual rewards. Those that have, because they started forming
an arm, have less neighbors and therefore less influence on Ragree , which results in
smaller updates for their policy parameters, and ultimately with less learning “pull”
76
(a) (b)
Figure 5-5: Smoothed (100-point window), downsampled average center of mass dis-
placement, with standard error, over 10 runs with 15 modules learning to locomote
eastward, as compared to the baseline distributed GAPS. (a) The modules run the
agreement algorithm on rewards and experience counters at the end of episodes with
original vs. scaled learning rates. (b) Distribution of first occurrences of center of
mass displacement greater than 90% of maximal displacement for each group.
77
(q) (b)
Figure 5-6: Comparison between centralized GAPS, distributed GAPS and dis-
tributed GAPS with agreement on experience: (a) smoothed (100-point window)
average center of mass displacement over 10 trials with 15 modules learning to loco-
mote eastward (T=50), (b) same for 20 modules with episode length T=100: learn-
ing was done starting with α = 0.2 and decreasing it uniformly over the first 1,500
episodes to 0.02.
Table 5.1: Mean number of non-gait local optima, with standard error, out of 10 test
trials for 10 learned policies. The starting learning rate was α = 0.1 for 15 modules
and α = 0.2 for 20 modules.
78
(a)
(b)
Figure 5-7: Screenshots taken every 5 timesteps of 20 modules executing two compact
gait policies found by learning with agreement on reward and experience.
79
5.4 Discussion
We have demonstrated that with agreement-style exchanges of rewards and experi-
ence among the learning modules, we can expect to see performance in distributed
reinforcement learning to increase by a factor of ten, and approach that of a cen-
tralized system for smaller numbers of modules and lower aspect ratios, such as we
have with 15 modules. This result suggests that the algorithms and extensions we
presented in chapters 3 and 4 may be applicable to fully distributed systems. With
higher numbers of modules (and higher aspect ratios of the initial configuration),
there is a significant difference between rewards obtained by centralized learning and
distributed learning with agreement, which is primarily due to the kinds of policies
that are favored by the different learning algorithms.
While centralized GAPS maximizes reward by unfolding the robot’s modules into
a two-layer thread, distributed GAPS with agreement finds gaits that preserve the
blobby shape of the initial robot configuration. This difference is due to the way
modules calculate their rewards by repeated averaging with their neighbors. As we
explained in section 5.3, modules will agree on a common reward value that is not
the mean of their initial estimates, if the updates are not pairwise. This generates an
effect where modules with many neighbors have more of an influence on the resulting
agreed-upon reward value than those with fewer neighbors. It therefore makes sense
that locomotion gaits found through this process tend to keep modules within closer
proximity to each other. There is, after all, a difference between trying to figure out
what to do based on the national GDP, and trying to figure out what to do based on
how you and your neighbors are faring.
We have developed a class of algorithms incorporating stochastic gradient ascent
in policy space (GAPS) and agreement algorithms, creating a framework for fully
distributed implementations of this POMDP learning technique.
This work is very closely related to the distributed optimization algorithm de-
scribed by Moallemi & Van Roy (2003) for networks, where the global objective
function is the average of locally measurable signals. They proposed a distributed
policy gradient algorithm where local rewards were pairwise averaged in an asyn-
chronous but disjoint (in time and processor set) way. They prove the resulting local
gradient estimates converge in the limit to the global gradient estimate. Crucially,
they do not make any assumptions about the policies the processors are learning,
beyond continuity and continuous differentiability. In contrast, here we are interested
in a special case where ideally all modules’ policies are identical. Therefore, we are
able to take advantage of neighbors’ experience as well as rewards.
Practical issues in sharing experience include the amount of required communica-
tions bandwidth and message complexity. We have proposed a simple way of limiting
the size of each message by eliminating zeros, sorting experience by observation and
action, and using run-length encoding as representation of counts (and ultimately,
averages). We must remember, however, that in most cases, communication is much
cheaper and easier than actuation, both in resource (e.g., power) consumption, and in
terms of the learning process. Modules would benefit greatly from sharing experience
with higher communication cost, and therefore learning from others’ mistakes, rather
80
than continuously executing suboptimal actions that in the physical world may well
result in danger to the robot or to something or someone in the environment.
If further reduction in communications complexity is required, in recent work
Moallemi & Van Roy (2006) proposed another distributed protocol called consensus
propagation, where neighbors pass two messages: their current estimate of the mean,
and their current estimate of a quantity related to the cardinality of the mean esti-
mate, i.e. a measure of how many other processors so far have contributed to the
construction of this estimate of the mean. The protocol converges extremely fast
to the exact average of initial values in the case of singly-connected communication
graphs (trees). It may benefit GAPS-style distributed learning to first construct a
spanning tree of the robot, then run consensus propagation on experience. However,
we do not want to give up the benefits of simple neighborhood averaging of rewards
in what concerns avoidance of undesirable local optima.
We may also wish in the future to develop specific caching and communication
protocols for passing just the required amount of information, and therefore requiring
less bandwidth, given the policy representation and current experience.
81
Chapter 6
Reducing observation and parameter space may lead to good policies disappearing
from the set of representable behaviors. In general, it will be impossible to achieve
some tasks using limited local observation. For example, it is impossible to learn a
policy that reaches to a goal position in a random direction from the center of the
robot, using only local neighborhoods, as there is no way locally of either observing
the goal or otherwise differentiating between desired directions of motion. Therefore,
we introduce a simple model of minimal sensing at each module and describe two
ways of using the sensory information: by incorporating it into a larger observation
space for GAPS, or by using it in a framework of geometric policy transformations,
for a kind of transfer of learned behavior between different spatially defined tasks.
82
(a) (b)
Figure 6-1: 25 modules learning to build tall towers (reaching upward) with basic
GAPS versus incremental GAPS (in both cases with pre-screened legal actions): (a)
Smoothed (100-point window), downsampled average height difference from initial
configuration, with standard error (b) Distribution of first occurrence of 90% of max-
imal height achieved during learning, for each group separately.
83
Figure 6-2: A sequence of 25 modules, initially in a square configuration, executing
a learned policy that builds a tall tower (frames taken every 5 steps).
non-locomotion task. We can see in figure 6-1a that on average, modules learning
with basic GAPS do not increase the height of the robot over time, whereas modules
taking the incremental learning approach create an average height difference after
7,000 episodes.
Figure 6-1c shows the distribution of the first occurrence of a height difference
greater than 90% of the maximal height difference achieved in each group. We
take this as a crude measure of converge speed. By this measure basic GAPS con-
verges right away, while incremental GAPS needs on average over 6,000 episodes (a
Kruskal-Wallis non-parametric ANOVA shows a difference at 99% significance level:
χ2 (1, 18) = 15.26, p < .0001). However, by looking also at figure 6-1a, we notice that
the maximal values are very different. Incremental GAPS actually learns a tower-
building policy, whereas basic GAPS does not learn anything.
Figure 6-2 demonstrates an execution sequence of a policy learned by 25 mod-
ules through the incremental process, using pre-screening for legal actions. Not all
policies produce tower-like structures. Often the robot reaches up from both sides
at once, standing tall but with double vertical protrusions. This is due to two facts.
The first is that we do not specifically reward tower-building, but simple height dif-
ference. Therefore, policies can appear successful to modules when they achieve any
positive reward. This problem is identical to the horizontal protrusions appearing
as local maxima to modules learning to locomote in chapter 4. However, the other
problem is that the eight immediately neighboring cells do not constitute a sufficient
representation for disambiguating between the robot’s sides in a task that is essen-
tially symmetric with respect to the vertical. Therefore modules can either learn to
only go up, which produces double towers, or they can learn to occasionally come
down on the other side, which produces something like a locomotion gait instead. A
larger observed neighborhood could be sufficient to disambiguate in this case, which
again highlights the trade-off between being able to represent the desired policy and
compactness of representation.
84
Figure 6-3: First 80 time-steps of executing a policy for locomotion over obsta-
cles, captured in frames every 10 steps (15 modules). The policy was learned with
standard-representation GAPS seeded with a partial policy.
85
2-bit 3-bit
basic GAPS 10±0.0 10±0.0
partial policy 4.0±1.0 5.1±0.6
(a) (b)
Figure 6-4: Locomotion over obstacles with or without initializing GAPS to a partially
known policy, for both 2-bit and 3-bit observations, learned by 15 modules. (a)
Smoothed (100-point window), downsampled average rewards, with standard error.
(b) Mean number of non-gait stuck configurations, with standard error, out of 10 test
runs for each of 10 learned policies.
The table shown in figure 6-4b represents the average, for a sample of 10 learned
policies, number of times modules were stuck in a dysfunctional configuration due
to a local optimum that does not correspond to a locomotion gait. Locomotion over
obstacles is a very hard task for learning, and the results reflect that, with even
15 modules not being able to learn any good policies without us biasing the search
with a partially known policy. In addition, we observe that increasing the number of
parameters to learn by using a more general 3-bit representation leads to a significant
drop in average rewards obtained by policies after 10,000 episodes of learning: average
reward curves for biased and unbiased GAPS with 3-bit observation are practically
indistinguishable. While the mean number of arm-like locally optimal configurations
reported in the table for the biased condition and 3-bit observation is significantly
lower than for unbiased GAPS, this result fails to take into account that the “good”
policies in that case still perform poorly. The more blind policies represented with
only 2 bits per cell of observations fared much better due to the drastically smaller
dimensionality of the policy parameter space in which GAPS has to search. Larger,
more general observation spaces require longer learning.
86
6.2.1 Minimal sensing: spatial gradients
We assume that each module is capable of making measurements of some underlying
spatially distributed function f such that it knows, at any time, in which direction
lies the largest gradient of f . We will assume the sensing resolution to be either
one in 4 (N, E, S or W) or one in 8 (N, NE, E, SE, S, SW, W or NW) possible
directions. This sensory information adds a new dimension to the observation space
of the module. There are now 28 possible neighborhood observations ×5 possible
gradient observations1 ×9 possible actions = 11, 520 parameters, in the case of the
lower resolution, and respectively 28 × 9 × 9 = 20, 736 parameters in the case of the
higher resolution. This is a substantial increase. The obvious advantage is to broaden
the representation to apply to a wider range of problems and tasks. However, if we
severely downsize the observation space and thus limit the number of parameters to
learn, it should speed up the learning process significantly.
87
(a) (b) (c) (d) (e)
88
Figure 6-6: Reaching to a random goal position using sensory gradient information
during learning and test phases. Policy execution captured every 5 frames.
leaves an average of 1.7 proper failures per 10 test trials (standard deviation: 1.6).
Three out of 10 policies displayed no failures, and another two failed only once.
89
function f 0 . The intuition is that if f 0 is sufficiently similar to f , and if all possible
gradient observations have been explored during the learning of the old policy, then
the modules will get good performance on the new task with the old policy. The
condition of sufficient similarity in this case is that f 0 be a linear transformation of
f . This is subject to symmetry breaking experimental constraints such as requiring
that the modules stay connected to the ground line at all times.
The idea that having seen all gradient observations and learned how to behave
given them and the local configuration should be of benefit in previously unseen
scenarios can be taken further. In the next section, we experiment with a limited set
of gradient directions during the learning phase, then test using random directions.
90
Figure 6-7: When there is a known policy that optimizes reward R, modules should
be able to use local geometric transformations of the observation and policy to achieve
better than random performance in terms of reward R’ (sensed as direction of greatest
gradient). The ground line breaks the rotational transformations when 6 RR0 > π/2
such that a flip by the vertical becomes necessary. The implementation is discretized
with π/4 intervals, which corresponds to 8 cells.
91
Algorithm 6 GAPS with transformations at every timestep
Initialize parameters θ according to experimental condition
for each episode do
Calculate policy π(θ)
Initialize observation counts N ← 0
Initialize observation-action counts C ← 0
for each timestep in episode do
for each module m do
sense direction of greatest gradient ω
observe o
if ω <= π/2 then
rotate o by −ω → o0
else
flip o by vertical and rotate by π/2 − ω → o0
end if
increment N (o0 )
choose a from π(o0 , θ)
if ω <= π/2 then
rotate a by ω → a0
else
flip a by vertical and rotate by ω − π/2 → a0
end if and increment C(o0 , a0 )
execute a0
end for
end for
Get global reward R
Update θ according to
θ ( o0 , a0 ) + = α R ( C(o0 , a0 ) − π(o0 , a0 , θ) N (o0 ) )
Update π(θ) using Boltzmann’s law
end for
92
(a) (b) (c) (d)
Figure 6-8: Policy transformation with standard GAPS representation when the re-
ward direction is rotated by π/2 from the horizontal (i.e., the modules now need to go
North, but they only know how to go East): (a) find local neighborhood and rotate
it by −π/2 (b)-(c) the resulting observation generates a policy (d) rotate policy back
by π/2 to get equivalent policy for moving North.
(a) (b)
Figure 6-9: Equivalent observation and corresponding policy for module M (a) before
and (b) after the flip.
case of a local optimum). Thus, modules maintain a single policy with the standard
number of parameters – the sensory information is used in transformations alone, and
the modules are learning a “universal” policy that should apply in all directions (see
Algorithm 6). The experiments were then run in three conditions with ten trials in
each condition:
2. Modules learned eastward locomotion, but were tested on the task of reaching
to a random target (post-learning knowledge transfer).
93
mean success ± std. err. 15 mods 20 mods
learn with transformations 8.1 ± .4 5.1 ± .6
learn eastward locomotion, 6.4 ± .8 5.0 ± .8
test with transformations
use policy for eastward 7.6 ± .4 8.6 ± .5
locomotion as starting point
random policy 3 0
(a) (b)
Figure 6-10: (a) Average over 10 learning trials of smoothed (100-point window),
downsampled reward over time (learning episodes) for 20 modules running centralized
learning with episode length of T=50, learning to reach to an arbitrary goal position
with geometric transformations of observations, actions, and rewards, starting with
no information (blue) vs. starting with existing locomotion policy (black). Standard
error bars at 95% confidence level. (b) Table of average number of times the goal was
reached out of 10 test trials, with standard error: using locomotion policy as a good
starting point for learning with transformations works best with a larger number of
modules.
Table 6-10b summarizes the success rates of the policies learned under these three
conditions on the task of reaching to a randomly placed target within the robot’s
reachable space, with the success rate of the random policy shown for comparison.
We can see that all conditions do significantly better than the random policy, for both
15 and 20 modules. In addition, a nonparametric Kruskal-Wallis ANOVA shows no
significant difference between the success rate of the three conditions for 15 modules,
but for 20 modules, there is a significant difference (χ2 (2, 27) = 13.54, p = .0011) at
the 99% confidence level. A post-hoc multiple comparison shows that the policies
from condition 3 are significantly more successful than all others. As the problem
becomes harder with a larger number of modules, so biasing the search with a good
starting point generated by geometrically transforming the policy helps learning more.
Fig. 6-10a shows the smoothed, downsampled average rewards (with standard
error) obtained over time by 20 modules learning with transformations in conditions
1 and 3. We can see clearly that biasing GAPS with a good starting point obtained
through transforming a good policy for a different but related task, results in faster
learning, and in more reward overall.
6.4 Discussion
We have demonstrated the applicability of GAPS to tasks beyond simple locomotion,
and presented results in learning to build tall structures (reach upward), move over
rough terrain and reach to a random goal position. As we move from simple locomo-
94
tion to broader classes of behavior for lattice-based SRMRs, we quickly outgrow the
standard model for each module’s local observations. Therefore, in this chapter we
have introduced a model of local gradient sensing that allows expansion of the robots’
behavioral repertoire, as amenable to learning. We have additionally introduced a
novel, compressed policy representation for lattice-based SRMRs that reduces the
number of parameters to learn.
With the evaluation of our results on different objective functions, we have also
presented a study of the possibility of behavioral knowledge transfer between different
tasks, assuming sensing capabilities of spatially distributed gradients, and rewards de-
fined in terms of those gradients. We find that it is possible to transfer a policy learned
on one task to another, either by folding gradient information into the observation
space, or by applying local geometric transformations of the policy (or, equivalently,
observation and action) based on the sensed gradient information. The relevance of
the resulting policies to new tasks comes from the spatial properties of lattice-based
modular robots, which require observations, actions and rewards to be local functions
of the lattice space.
Finally, we demonstrate empirically that geometric policy transformations can
provide a systematic way to automatically find good starting points for further policy
search.
95
Chapter 7
Concluding Remarks
7.1 Summary
In this thesis, we have described the problem of automating distributed controller gen-
eration for self-reconfiguring modular robots (SRMRs) through reinforcement learn-
ing. In addition to being an intuitive approach for describing the problem, distributed
reinforcement learning also provides a double-pronged mechanism for both control
generation and online adaptation.
We have demonstrated that the problem is technically difficult due to the dis-
tributed but coupled nature of the modular robots, where each module (and therefore
each agent) has only access to local observations and a local estimate of reward. We
then proposed a class of algorithms taking Gradient Ascent in Policy Space (GAPS)
as a baseline starting point. We have shown that GAPS-style algorithms are capable
of learning policies in situations where the more powerful Q-learning and Sarsa are
not, as they make the strong assumption of full observability of the MDP. We iden-
tified three key issues in using policy gradient algorithms for learning, and specified
which parameters of the SRMR problem contribute to these issues. We then ad-
dressed scalability issues presented by learning in SRMRs, including the proliferation
of local optima unacceptable for desired behaviors, and suggested ways to mitigate
various aspects of this problem through search constraints, good starting points, and
appropriate policy representations. In this context, we discussed both centralized and
fully distributed learning. In the latter case, we proposed a class of agreement-based
GAPS-style algorithms for sharing both local rewards and local experience, which we
have shown speed up learning by at least a factor of 10. Along the way we provided
experimental evidence for our algorithms’ performance on the problem of locomo-
tion by self-reconfiguration on a two-dimensional lattice, and further evaluated our
results on problems with different objective functions. Finally, we proposed a geo-
metric transformation framework for sharing behavioral knowledge between different
tasks and as another systematic way to automatically find good starting points for
distributed search.
96
7.2 Conclusions
We can draw several conclusions from our study.
2. Robot and problem design decisions affect the speed and reliability of learning
by gradient ascent, in particular:
3. It is possible to create good design strategies for mitigating scalability and local
optima issues by making appropriate design decisions, such as pre-screening for
legal actions — which effectively reduces the number of parameters to be learned
— or using smart and appropriate policy representations — which nonethe-
less require careful human participation and trade-offs between representational
power (the ideal policy must be included), number of policy parameters, and
the quality of policy value landscape that these parameters create (it may have
many local optima but lack the redundancy of the full representation).
4. Unlike centralized but factored GAPS, fully distributed gradient ascent with
partial observability and local rewards is impossible without improvement in
either (a) amount and quality of experience, or (b) quality of local estimates
of reward. Improvement in (a) can be achieved through good design strate-
gies mentioned above, and either (a) or (b) or both benefit tremendously from
agreement-style sharing among local neighbors.
97
6. The difference in reward functions between centralized learning and distributed
learning with agreement generates a difference in learned policies: for larger
numbers and initial configurations of higher aspect ratios, agreement by neigh-
borhood averaging of local estimates of rewards and experience favors the find-
ing of more compact gaits, and is unlikely to find the same two-layer thread-like
gaits that prevail in the solutions found by centralized learning.
98
to the robots. We have not addressed online adaptation in this thesis. In the future
we would like to see the robots, seeded with a good policy learned offline, to run
a much faster distributed adaptation algorithm to make practical run-time on-the-
robot adjustments to changing environments and goals. We are currently exploring
new directions within the same concept of restricting the search in an intelligent way
without requiring too much involved analysis on the part of the human designer. It
is worth mentioning anyway that the computations required by the GAPS algorithm
are simple enough to run on a microcontroller with limited computing power and
memory.
A more extensive empirical analysis of the space of possible representations and
search restrictions is needed for a definitive strategy in making policy search work
fast and reliably in SRMRs. In particular, we would like to be able to describe the
properties of good feature spaces for our domain, and ways to construct them. A
special study of the influence of policy parameter redundancy would be equally useful
and interesting: for example, we would like to know in general when restricting ex-
ploration to only legal actions might be harmful instead of helpful because it removes
a source of smoothness and redundancy in the landscape.
We have studied how search constraints and information affect the learning of
locomotion gaits in SRMRs in this thesis. The general conclusion is that more infor-
mation is always better, especially when combined with restrictions on how it may be
used, which essentially guides and constrains search. This opens up the theoretical
question on the amount and kinds of information needed for particular systems and
tasks, which could provide a fruitful direction of future research.
As we explore collaboration between the human designer and the automated learn-
ing agent, our experiments bring forward some issues that such interactions could
raise. As we have observed, the partial information coming from the human designer
can potentially lead the search away from the global optimum. The misleading can
take the form of a bad feature representation, or an overconstrained search, or a
partial policy that favors suboptimal actions.
Another issue is that of providing a reward signal to the learning agents that
actually provides good estimates of the underlying objective function. Experiments
have shown that a simple performance measure such as displacement during a time
period cannot always disambiguate between good behavior (i.e., locomotion gait) and
an unacceptable local optimum (e.g., a long arm-like protrusion in the right direction).
In this thesis, we have restricted experiments to objective functions that depend on
a spatially described gradient. Future work could include more sophisticated reward
signals.
We have also seen throughout that the learning success rate (as measured, for ex-
ample, by the percentage of policies tested which exhibited correct behavior) depends
not only on robot size, but also on the initial condition, specifically, the configuration
that the robot assumes at the start of every episode. In this thesis, we have limited
our study to a few such initial configurations and robot sizes in order to compare the
different approaches to learning. In the future, we would like to describe the space of
initial conditions, and research their influence on learning performance more broadly.
Finally, the motivation of this work has been to address distributed RL problems of
99
a cooperative nature; and as a special case of those problems, some of our approaches
(e.g., sharing experience) is applicable to agents essentially learning identical policies.
While this is a limitation of the present work, we believe that most algorithmic and
representational findings we have discussed should also be valid in less identical and
less cooperative settings. In particular, sharing experience but not local rewards
might lead to interesting developments in games with non-identical payoffs. These
developments are also potential subjects for future work.
In addition to the above, we would like to see the ideas described in this thesis
applied to concrete models of existing modular robots, as well as to other distributed
domains, such as chain and hybrid modular robots, ad hoc mobile networks, and
robotic swarms.
100
Bibliography
Bagnell, J. A., Kakade, S., Ng, A. Y. & Schneider, J. (2004), Policy search by dynamic
programming, in ‘Advances in Neural Information Processing Systems 16’.
Bishop, J., Burden, S., Klavins, E., Kreisberg, R., Malone, W., Napp, N. & Nguyen,
T. (2005), Self-organizing programmable parts, in ‘International Conference on
Intelligent Robots and Systems, IEEE/RSJ Robotics and Automation Society’.
Buhl, J., Sumpter, D. J., Couzin, I., Hale, J., Despland, E., Miller, E. & Simpson, S. J.
(2006), ‘From disorder to order in marching locusts’, Science 312, 1402–1406.
Butler, Z., Kotay, K., Rus, D. & Tomita, K. (2001), Cellular automata for decen-
tralized control of self-reconfigurable robots, in ‘Proceedings of the International
Conference on Robots and Automation’.
Butler, Z., Kotay, K., Rus, D. & Tomita, K. (2004), ‘Generic distributed control
for locomotion with self-reconfiguring robots’, International Journal of Robotics
Research 23(9), 919–938.
Chang, Y.-H., Ho, T. & Kaelbling, L. P. (2004), All learning is local: Multi-agent
learning in global reward games, in ‘Neural Information Processing Systems’.
Everist, J., Mogharei, K., Suri, H., Ranasinghe, N., Khoshnevis, B., Will, P. & Shen,
W.-M. (2004), A system for in-space assembly, in ‘Proc. Int. Conference on
Robots and Systems (IROS)’.
101
Fernandez, F. & Parker, L. E. (2001), ‘Learning in large cooperative multi-robot
domains’, International Journal of Robotics and Automation: Special issue on
Computational Intelligence Techniques in Cooperative Robots 16(4), 217–226.
Grudic, G. Z., Kumar, V. & Ungar, L. (2003), Using policy reinforcemnet learning on
autonomous robot controllers, in ‘Proceedings of the IEEE/RSJ Intl. Conference
on Intelligent Robots and Systems’, Las Vegas, Nevada.
Guestrin, C., Koller, D. & Parr, R. (2002), Multiagent planning with factored mdps,
in ‘Advances in Neural Information Processing Systems (NIPS)’, Vol. 14, The
MIT Press.
Jadbabaie, A., Lin, J. & Morse, A. S. (2003), ‘Coordination of groups of mobile au-
tonomous agents using nearest neighbor rules’, IEEE Transactions on Automatic
Control 48(6).
Kamimura, A., Kurokawa, H., Yoshida, E., Murata, S., Tomita, K. & Kokaji, S.
(2004), Distributed adaptive locomotion by a modular robotic system, M-TRAN
II – from local adaptation to global coordinated motion using cpg controllers, in
‘Proc. of Int. Conference on Robots and Systems (IROS)’.
Kitano, H., Asada, M., Kuniyoshi, Y., Noda, I. & Osawa, E. (1997), RoboCup: The
robot world cup initiative, in W. L. Johnson & B. Hayes-Roth, eds, ‘Proceedings
of the First International Conference on Autonomous Agents (Agents’97)’, ACM
Press, New York, pp. 340–347.
102
Kotay, K. & Rus, D. (2004), Generic distributed assembly and repair algorithms for
self-reconfiguring robots, in ‘IEEE Intl. Conf. on Intelligent Robots and Systems’,
Sendai, Japan.
Kotay, K. & Rus, D. (2005), Efficient locomotion for a self-reconfiguring robot, in
‘Proc. of Int. Conference on Robotics and Automation (ICRA)’.
Kotay, K., Rus, D., Vona, M. & McGray, C. (1998), The self-reconfiguring robotic
molecule, in ‘Proceedings of the IEEE International Conference on Robotics and
Automation’, Leuven, Belgium, pp. 424–431.
Kubica, J. & Rieffel, E. (2002), Collaborating with a genetic programming system
to generate modular robotic code, in W. et al., ed., ‘GECCO 2002: Proceedings
of the Genetic and Evolutionary Computation Conference’, Morgan Kaufmann
Publishers, New York, pp. 804–811.
Lagoudakis, M. G. & Parr, R. (2003), ‘Least-squares policy iteration’, Journal of
Machine Learning Research 4, 1107–1149.
Littman, M. L. (1994), Memoryless policies: Theoretical limitations and practical
results, in ‘From Animals to Animats 3: Proceedings of the 3rd International
Conference on Simulation of Adaptive Behavior (SAB)’, Cambridge, MA.
Marthi, B., Russell, S. & Andre, D. (2006), A compact, hierarchically optimal q-
function decomposition, in ‘Procedings of the Intl. Conf. on Uncertainty in AI’,
Cambridge, MA.
Martin, M. (2004), The essential dynamics algorithm: Fast policy search in continuous
worlds, in ‘MIT Media Lab Tech Report’.
Matarić, M. J. (1997), ‘Reinforcement learning in the multi-robot domain’, Au-
tonomous Robots 4(1), 73–83.
Meuleau, N., Peshkin, L., Kim, K.-E. & Kaelbling, L. P. (1999), Learning finite-
state controllers for partially observable environments, in ‘Proc. of 15th Conf.
on Uncertainty in Artificial Intelligence (UAI)’.
Moallemi, C. C. & Van Roy, B. (2003), Distributed optimization in adaptive networks,
in ‘Proceedings of Intl. Conference on Neural Information Processing Systems’.
Moallemi, C. C. & Van Roy, B. (2006), ‘Consensus propagation’, IEEE Transactions
on Information Theory 52(11).
Murata, S., Kurokawa, H. & Kokaji, S. (1994), Self-assembling machine, in ‘Pro-
ceedings of IEEE Int. Conf. on Robotics and Automation (ICRA)’, San Diego,
California, p. 441448.
Murata, S., Kurokawa, H., Yochida, E., Tomita, K. & Kokaji, S. (1998), A 3d self-
reconfigurable structure, in ‘Proceedings of the 1998 IEEE International Confer-
ence on Robotics and Automation’, pp. 432–439.
103
Murata, S., Yoshida, E., Kurokawa, H., Tomita, K. & Kokaji, S. (2001), ‘Self-repairing
mechanical systems’, Autonomous Robots 10, 7–21.
Mytilinaios, E., Marcus, D., Desnoyer, M. & Lipson, H. (2004), Design and evolved
blueprints for physical self-replicating machines, in ‘Proc. of 9th Int. Conference
on Artificial Life (ALIFE IX)’.
Ng, A. Y., Coates, A., Diel, M., Ganapathi, V., Schulte, J., Tse, B., Berger, E.
& Liang, E. (2004), Autonomous inverted helicopter flight via reinforcement
learning, in ‘ISER’.
Ng, A. Y. & Jordan, M. (2000), Pegasus: A policy search method for large mdps and
pomdps, in ‘Proc. of Int. Conference on Uncertainty in AI (UAI)’.
Olfati-Saber, R., Franco, E., Frazzoli, E. & Shamma, J. S. (2005), Belief consensus
and distributed hypothesis testing in sensor networks, in ‘Proceedings of the
Workshop on Network Embedded Sensing and Control’, Notre Dame University,
South Bend, IN.
Østergaard, E. H., Kassow, K., Beck, R. & Lund, H. H. (2006), ‘Design of the ATRON
lattice-based self-reconfigurable robot’, Autonomous Robots 21(2), 165–183.
Pamecha, A., Chiang, C.-J., Stein, D. & Chirikjian, G. (1996), Design and imple-
mentation of metamorphic robots, in ‘Proceedings of the 1996 ASME Design
Engineering Technical Conference and Computers in Engineering Conference’.
Peshkin, L., Kim, K., Meuleau, N. & Kaelbling, L. P. (2000), Learning to cooperate
via policy search, in ‘Proceedings of the 16th Conference on Uncertainty in
Artificial Intelligence (UAI)’, p. 489.
Schaal, S., Peters, J., Nakanishi, J. & Ijspeert, A. (2003), Learning movement primi-
tives, in ‘Int. Symposium on Robotics Research (ISRR)’.
Schneider, J., Wong, W.-K., Moore, A. & Riedmiller, M. (1999), Distributed value
functions, in ‘Proceedings of the International Conference on Machine Leanring’.
104
Shen, W.-M., Krivokon, M., Chiu, H., Everist, J., Rubenstein, M. & Venkatesh,
J. (2006), ‘Multimode locomotion via SuperBot reconfigurable robots’, Au-
tonomous Robots 20(2), 165177.
Suh, J. W., Homans, B. & Yim, M. (2002), Telecubes: Mechanical design of a module
for self-reconfiguration, in ‘Proceedings of the IEEE International Conference on
Robotics and Automation’, Washington, DC.
Sutton, R. S., McAllester, D., Singh, S. & Mansour, Y. (2000), Policy gradient meth-
ods for reinforcement learning with function approximation, in ‘Advances in
Neural Information Processing Systems 12’.
Toothaker, L. E. & Chang, H. S. (1980), ‘On “the analysis of ranked data derived
from completely randomized factorial designs”’, Journal of Educational Statistics
5(2), 169–176.
Varshavskaya, P., Kaelbling, L. P. & Rus, D. (2004), Distributed learning for modular
robots, in ‘Proc. Int. Conference on Robots and Systems (IROS)’.
105
Vicsek, T., Czirók, A., Ben-Jacob, E., Cohen, I. & Schochet, O. (1995), ‘Novel type
of phase transition in a system of self-driven particles’, Physical Review Letters
75(6), 1226–1229.
White, P., Zykov, V., Bongard, J. & Lipson, H. (2005), Three-dimensional stochastic
reconfiguration of modular robots, in ‘Proceedings of Robotics: Science and
Systems’, Cambridge, MA.
Wolpert, D., Wheeler, K. & Tumer, K. (1999), Collective intelligence for control of
distributed dynamical systems, Technical Report NASA-ARC-IC-99-44.
Yu, W., Takuya, I., Iijima, D., Yokoi, H. & Kakazu, Y. (2002), Using interaction-based
learning to construct an adaptive and fault-tolerant multi-link floating robot, in
H. Asama, T. Arai, T. Fukuda & T. Hasegawa, eds, ‘Distributed Autonomous
Robotic Systems’, Vol. 5, Springer, pp. 455–464.
Zykov, V., Mytilinaios, E., Adams, B. & Lipson, H. (2005), ‘Self-reproducing ma-
chines’, Nature 435(7038), 163–164.
106