Hierarchical Problem Solving Using Reinforcement Learning: Methodology and Methods
Hierarchical Problem Solving Using Reinforcement Learning: Methodology and Methods
Hierarchical Problem Solving Using Reinforcement Learning: Methodology and Methods
by
Yassine Faihe
DISSERTATION
1999
A knowledgements
I am indebted to my advisor, Professor Jean-Pierre Muller for his support and en
ouragement. While giving me extensive freedom to
ondu
t my resear
h, he has always
provided me with useful advi
e and original ideas. My introdu
tion to the eld of reinfor
ement learning as well as the dire
tion taken by my resear
h
ome from his guidan
e.
I am grateful to my thesis
ommittee. I greatly a
knowledge Paul Bourgine who has
helped me to develop the mathemati
al aspe
t of my thesis and for the useful dis
ussions
we had in Paris. I would also like to thank Tony Pres
ott for his explanations whi
h have
been of great help in my understanding of the a
tion sele
tion me
hanism as well as for his
ex
ellent
omments about the dissertation. Thanks must also go to Dario Floreano and
Killian Stoel. Their questions and their remarks allowed me to
larify some important
issues.
The intera
tions I have had with the CASCAD Team members: Antoine, Abdelaziz,
Eri
, Fabri
e, Luba, Lu
-Laurent and Matthieu have always been fruitful and of great
interest.
Finally I would like to thank Carolina Badii who has proofread the draft of this dissertation and has helped in improving the style of the written English.
Contents
1 Introdu
tion
2.1 Formulation . . . . . . . . . . . . . . . . . . . .
2.1.1 Framework . . . . . . . . . . . . . . . .
2.1.2 Markov De
ision Pro
esses . . . . . . . .
2.1.3 Returns and Optimality Criteria . . . . .
2.2 Temporal Credit Assignment . . . . . . . . . . .
2.2.1 Value Fun
tions and Optimal Poli
ies . .
2.2.2 Dynami
Programming . . . . . . . . . .
2.2.3 Temporal Dieren
e Learning . . . . . .
2.3 Stru
tural Credit Assignment . . . . . . . . . .
2.3.1 Predi
tion with Fun
tion Approximator
2.3.2 Neural networks . . . . . . . . . . . . . .
2.3.3 Conne
tionist Reinfor
ement Learning .
2.4 Summary . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
2
4
5
5
6
7
8
9
9
14
19
20
21
25
28
29
Contents
ii
.
.
.
.
.
.
.
.
.
.
4 The Methodology
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
31
31
31
32
34
35
35
36
38
39
40
42
43
44
46
49
49
50
51
54
55
55
56
56
58
64
68
5.1 Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Contents
iii
6.1
6.2
6.3
6.4
Summary of
ontributions
Pra
ti
al Issues . . . . . .
Future work . . . . . . . .
Epilogue . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
69
70
70
71
72
73
74
74
76
77
80
81
84
86
86
87
88
89
List of Tables
3.1 The letter arrivals patterns for ea
h o
e. . . . . . . . . . . . . . . . . . . 32
4.1 Outline of the evaluation forms. . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 Steps needed by the robot to move between dierent pla
es in the environment. 62
List of Figures
2.1 Reinfor
ement learning framework . . . . . . . . . . . . . . . . . . . . . . .
2.2 The poli
y iteration method build a sequen
e of poli
ies that
onverge to .
PE and PI are respe
tively the poli
y evaluation and the poli
y improvement
operators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 The poli
y iteration algorithm . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 The value iteration algorithm . . . . . . . . . . . . . . . . . . . . . . . . .
2.5 Evolution of tra
es a
ording to the state visits. . . . . . . . . . . . . . . .
2.6 Algorithms of Q() and Sarsa() with either repla
ing or a
umulating
tra
es. For = 0 we have Sarsa and one step Q-learning algorithms. . . . .
2.7 Multi-layer per
eptron network. . . . . . . . . . . . . . . . . . . . . . . . .
2.8 A
onnexion between units of
onse
utive layers. The index of the layers
de
reases from the output to the input. . . . . . . . . . . . . . . . . . . . .
2.9 Algorithm of Sarsa() with a
onne
tionist fun
tion approximator. . . . .
2.10 An Elman network as used by Lin (1992). . . . . . . . . . . . . . . . . . .
6
11
11
12
16
18
22
23
25
26
36
40
43
45
47
List of Figures
vi
52
54
55
58
59
59
60
60
61
61
63
65
66
69
76
78
81
84
List of Figures
vii
5.6 Average of the quality
riterion as a fun
tion of de
ision steps. The top
graph
on
erns the periodi
letters
ow and the bottom graph the Poisson
distribution letters
ow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Chapter 1
Introdu
tion
1.1
This thesis is about the use of autonomous agents to solve problems. A problem is dened
by an environment and a task to a
hieve. For instan
e, the environment
ould be a
building with an elevator group and the task
ould be to
ontrol the elevator
ars so
as to redu
e the passengers' waiting time (Crites 1996). An autonomous agent is an
entity that has the ability to intera
t, without human intervention, with dynami
and
unpredi
table environments through sensing and a
ting devi
es. It
an sense some aspe
ts
of the environment's state and in
uen
e its dynami
s. During this intera
tion the agent
exhibits a behavior. When tightly
oupled with the environment, the agent is said to be
embedded (Kaelbling 1993b), that is, being a part of this environment and having qui
k
rea
tions to stimuli.
The
lassi
al approa
h to building embedded autonomous agents has been to program
them. The designer uses his own expertise and a priori knowledge to anti
ipate all possible
patterns of intera
tion, or analyzes and models the problem with dierential equations. In
the latter
ase the agent's
ontroller is derived using methods developed in the eld of
ontrol theory. However the in
reasing
omplexity of the problems,
oming from di
ult
tasks or from non-linear, sto
hasti
and unstru
tured environments, limits the appli
ability
of su
h methods, even though adaptive methods to tune
ertain parameters of the
ontroller
do exist.
Introdu tion
One way of over
oming this di
ulty is autonomous programming, that is, making
the agent a
quire the ne
essary skills to a
hieve the given task from the intera
tion with
the environment. Su
h a pro
ess is
alled learning and refers to the ability to modify
one's knowledge a
ording to experien
e. Apart from freeing the designer from expli
itly
programming the agent, learning is useful to maintain the agent's
apability to perform
a task under
hanging
ir
umstan
es. Thus learning agents are more
exible, robust and
able to
ope with un
ertainty and
hanging environments.
First resear
h on learning fo
used on supervised learning where a tutor trains a system
using input-output pairs examples. Be
ause su
h training examples are not always available, appli
ations of supervised learning methods are restri
ted to patterns re
ognition and
lassi
ation, and fun
tions approximation. Reinfor
ement learning (RL) is appli
able in
more general and di
ult
ases. In the reinfor
ement learning paradigm, an agent learns
how to a
hieve a given task from its own intera
tion with the environment. To do so
it modies its de
ision pro
ess on the basis of a feedba
k whi
h is a s
alar evaluation of
its
urrent performan
e. Positive and negative (high and low) values of this s
alar
orrespond to rewards and punishments respe
tively. Thus the agent solves the problem when
it behaves in a way that maximizes rewards and minimizes punishments. RL methods
have proven to perform well on simple problems but be
ome impra
ti
al to use when the
problem's
omplexity in
reases.
The main motivation of the work presented in this dissertation is to s
ale up reinfor
ement learning to
omplex problems.
1.2
Two
losely linked reasons
an explain why reinfor
ement learning fails to solve
omplex
problems. First the appropriate reinfor
ement fun
tion, that is, the one that makes the
agent solve the problem when rewards are maximized, is not easy to nd. So far there has
been no systemati
way to design su
h a fun
tion. The se
ond reason is that the number
of situations that the agent may en
ounter during its intera
tion with the environment
in
reases with the
omplexity of the problem, so the sear
h pro
ess is slowed down and
Introdu tion
levels of abstra
tion, of a sequen
e of a
tions produ
ed by the agent via its
oupling
with the environment;
motor me
hanisms intera
ting with the environment (Braitenberg 1984; Pfeifer and
S
heier 1998);
behavior;
the design pro ess of a behavior onsists in transposing the observer's point of view
Having these arguments in mind, it is now possible to ta
kle the obsta
les that limit the
s
alability of reinfor
ement learning.
Let's start with the
urse of dimensionality. When a problem requires to be solved
in whi
h the agent performs a long sequen
e of a
tions, it be
omes very hard to dis
over
su
h a sequen
e, espe
ially when the reinfor
ements are sparse be
ause the exploration
is not guided. One may introdu
e lo
al reinfor
ements (given by a tea
her) to guide the
exploration or
ome up with e
ient exploration strategies. One may also argue that the
agent does not have the adequate a
tions otherwise it would have solved the problem in
few de
ision steps (Martin 1998). Thus, we propose to add the missing a
tions to the agent
repertoire by allowing it to learn them. A
tually these new a
tions
orrespond to skills
that solve parts of the problem. So it is ne
essary to perform a problem de
omposition
in order to identify the needed skills. If the skills found are still too di
ult to learn, the
orresponding sub-problems are de
omposed on
e again. The resulting agent's ar
hite
ture
Introdu tion
is a hierar
hi
ally stru
tured skills set where ea
h skill is learned using previously a
quired
ones.
The dire
t
onsequen
e of this approa
h is that we will have to design several simple
reinfor
ement fun
tions (one for ea
h sub-problem) rather than a single global and
omplex
one. However the ne
essity to have a means of des
ribing behaviors still remains.
In order to systemize the approa
h we mentioned above, and manage the overall design
pro
ess a methodology is required. Issues that should be raised by su
h a methodology
on
ern:
the analysis of the problem and the spe
i
ation of the desired behavior;
the problem de
omposition into sub-problems and the learning of the
orresponding
skills;
A methodology that meets these requirements as well as methods to address the above
issues are proposed in this thesis, and
onstitute our main
ontribution.
1.3
In this thesis we investigate the methodologi
al aspe
t of hierar
hi
al problem solving using
agents that learn by reinfor
ement. The next
hapter denes the reinfor
ement learning
problem. It provides a mathemati
al formulation of the problem and reviews te
hniques
to solve it. Chapter 3 presents the postman robot problem and des
ribes the testbed used
in this work. In
hapter 4 a new agent design methodology is introdu
ed with details of its
omponents. One parti
ular
omponent of the methodology, the
oordination, is addressed
in depth in
hapter 5. Both
hapters 4 and 5 report and analyze the experimental results
we have obtained. Finally in
hapter 6, we summarize the
ontribution of our work, dis
uss
some pra
ti
al issues, and suggest dire
tions for future resear
h.
Chapter 2
Ba
kground: Reinfor
ement Learning
In this
hapter we introdu
e the reinfor
ement learning problem. We rst setup the framework by dening how the agent intera
ts with the environment and formalize the problem
as the optimal
ontrol of a Markov de
ision pro
ess. The solutions are presented from the
redit assignment point of view. Both temporal and stru
tural
redit assignment problems
are des
ribed and state-of-the-art methods to solve them are reviewed.
2.1
Formulation
2.1.1 Framework
The agent, the environment it intera
ts with and the task it has to a
hieve are the
omponents that dene the reinfor
ement learning framework (gure 2.1). The intera
tion
between the agent and the environment is
ontinuous. On one hand the agent's de
ision
pro
ess sele
ts a
tions a
ording to the per
eived situations of the environment, and on
the other hand these situations evolve under the in
uen
e of the a
tions. Ea
h time the
agent performs an a
tion, it re
eives a reward. A reward is a s
alar value that tells the
agent how well it is fullling the given task. To be formal let's denote x a representation
of the environment's state as it is per
eived by the agent, a the sele
ted a
tion, and r the
re
eived reward. The agent's de
ision pro
ess is
alled poli
y and is a mapping from states
to a
tions. A learning agent modies its poli
y a
ording to its experien
e and to its goal
whi
h is to maximize the
umulated rewards over time. Su
h an amount is
alled return
Reinforcement
Agent
Perceptions
Action
Environment
Task
Figure 2.1:
and will be explained later. Be
ause of its
exibility and its abstra
tion, the reinfor
ement
learning framework
an be used to spe
ify several kinds of problems. A
tually, time steps
at whi
h an intera
tion o
urs have to be seen as de
ision-making steps rather than xed
ti
ks of real time, and states and a
tions may range from low-level intera
tion devi
es to
high-level des
riptions and de
isions.
2.1.2 Markov De
ision Pro
esses
A Markov de
ision pro
ess (MDP)
onsists of a set of states X and a set of a
tions A
whi
h allow movement from one state to another. In ea
h state x only a subset of a
tions
A(x) A is available. The dynami
s of the pro
ess is governed by a set of transition
matri
es. There is one matrix P (a) for ea
h a
tion a, where ea
h element Pxy (a) denotes
the probability of transition to state y given x and a. If an a
tion a is not available in state
x then Pxy (a) = 0. At the end of ea
h transition a reward r = R(x; a; y ) is generated. The
immediate evaluation of a transition is generally expressed by the expe
ted reward:
R(x; a) = E [R(x; a; y )
X
(2.1)
= Pxy (a)R(x; a; y):
2
y X
In this thesis we assume that the pro ess is dis rete and that both S and A are nite.
Poli y
In general the out
ome of a pro
ess, in terms of states and rewards, at a given time step
depends on the prior sequen
e of states or past history Ht = fxt ; at ; xt ; at ; :::; x ; a g.
When it is possible to predi
t the next state and the next expe
ted reward only on the
basis of the
urrent state, then the pro
ess is said to have the Markov property or to be
Markovian. Formally the Markov property
an be expressed by the following equality:
1
(2.2)
One
an noti
e the importan
e of the Markov property in the sense that the de
ision
is only a fun
tion of the
urrent state. The
ase where an agent has to deal with nonMarkov states, either be
ause it intera
ts with a non-Markov environment or be
ause of
its in
omplete per
eptions, will be dis
ussed later.
2.1.3 Returns and Optimality Criteria
r1 + r2 + r3 + ::: + rn + :::
(2.3)
Su
h a measure of long term reward is
alled return (Barto et al. 1990). Be
ause of the
sto
hasti
ity of the
ontrolled pro
ess we will
onsider the expe
ted value of the return.
E
N
X
t=0
! (t)rt ;
(2.4)
where E is the expe
tation operator when poli
y is used, N is the horizon of the return
and ! is a weighting fa
tor. Several optimality
riteria have been investigated in the
literature (Mahadevan 1996), but all
an be expressed in the above form. Here we will
fo
us on the
ase where N ! 1 and !(t) =
t , where 0
< 1, whi
h represents the
expe
ted dis
ounted total reward. The dis
ount fa
tor a
ts as an attenuator. Hen
e one
unit of reward re
eived at time t + is equivalent to
units at time t. This optimality
riterion is attra
tive be
ause of its mathemati
al properties whi
h make the
omputation
of the optimal poli
y more tra
table: the return value is nite (be
ause 0
< 1 and as
long as the reward fun
tion is bounded) and the optimal innite horizon poli
y is always
stationary.
2.2
The temporal
redit assignment problem (TCA)
onsists in attributing
redit or blame to
individual a
tions on the basis of the result of a whole plan of a
tions and is a
on
ern for
most real de
ision problems. Indeed some a
tions may generate low immediate payo but
an
ontribute to produ
ing higher rewards in the future. Sometimes several a
tions have
to be performed before getting a reward: the reward is said to be delayed. In this se
tion
we review dynami
programming (DP) and temporal dieren
e (TD) learning whi
h are
te
hniques that solve the TCA problem. Although DP algorithms
an
ompute optimal
poli
ies for MDPs, they are not very useful to solve reinfor
ement learning problems be
ause an a
urate model of the environment is usually not available. However dynami
programming provides important theoreti
al foundations for understanding the fun
tion
of temporal dieren
e methods.
A widely used approa
h to deal with delayed rewards is to estimate the worth of a state or
a de
ision in terms of future expe
ted rewards. Given an optimality
riterion we
an dene
a value fun
tion for a poli
y , V : X 7! IR as a mapping from states to real values. We
have:
V (x) = E
"
1
X
t=0
t rt jx0 = x ;
(2.5)
whi
h expresses the expe
ted return when the poli
y is followed starting from state x.
In the same way we
an dene a utility fun
tion for poli
y , Q : X A 7! IR mapping
state-a
tion pairs to real values. Q (x; a) expresses the utility to perform a
tion a in state
x and follow poli
y thereafter:
Q (x; a) = E
"
1
X
t=0
t rt jx0 = x; a0 = a :
(2.6)
Given two poli
ies and , we say that is better than (or an improvement of)
if the value fun
tion for the rst poli
y is at least equal to that of the se
ond poli
y, and
is greater for at least one state. Hen
e the optimal poli
y is the one whi
h
annot be
improved anymore. Its value fun
tion is V . Many optimal poli
ies may exist but they all
have the same optimal value fun
tion V . Now we will see how su
h optimal poli
ies
an
be indu
ed.
1
The starting point of dynami
programming
omes from equation 2.5 written in a re
ursive
form:
V (x) = R(x; (x)) +
y X
Pxy ( (x))V (y );
(2.7)
y X
Pxy ( (x))V (y ):
(2.8)
10
As all optimal poli
ies have the same optimal value fun
tion V , and V V for all
x 2 X and for all poli
ies i , we obtain:
i
V (x) =
max
2
"
a A(x)
R(x; a) +
y X
Pxy (a)V (y ) :
(2.9)
This equation is known as the Bellman's optimality equation (or Bellman's equation for
). When V is known, the optimal poli
y
an be easily derived:
(x) = arg
max
2
a A(x)
"
R(x; a) +
y X
Pxy (a)V (y ) :
(2.10)
There are several
omputational te
hniques to solve the Bellman's equation. Here we will
limit ourselves to two of them: value iteration and poli
y iteration. But let's rst see how
the evaluation of a given poli
y
an be
omputed.
Poli
y Evaluation
Let's dene Vn (x) as the expe
ted return if poli
y is followed for n steps only, starting
from state x. For n = 1, the expe
ted return is simply the expe
ted immediate reward,
when a
tion a = (x) is performed:
V1 (x) = R(x; a):
(2.11)
Assuming that V is known and that the next observed state when a is performed in x is
y with probability Pxy (a), we have for all x 2 X :
1
y X
Pxy (a)V1 (y ):
(2.12)
Similarly we
an determine V from V , V from V , and in the general
ase Vn from
Vn :
3
y X
+1
Pxy ( (x))Vn (y );
(2.13)
for all x 2 X . After a high number of iterations N over all states, VN (x)
an be
onsidered
as a good approximation of V (x) given an arbitrary initial V (x).
0
11
Poli y Iteration
The poli
y iteration method
onsists of two pro
edures: the poli
y evaluation and the
poli
y improvement. Thus starting from any initial poli
y we will su
essively evaluate
it, obtaining V , improve it, obtaining , and so on until the optimal poli
y is rea
hed
(gure 2.2). On
e a poli
y n is evaluated, the result Vn is used to make the improvement.
0
0 PE-V 0 PI - 1 PE-V 1
......
-V
PE
The poli
y iteration method build a sequen
e of poli
ies that
onverge to . PE
and PI are respe
tively the poli
y evaluation and the poli
y improvement operators.
Figure 2.2:
"
arg max
R(x; a) +
a
y X
Pxy (a)Vn (y ) :
(2.14)
12
Value Iteration
The poli
y evaluation phase in the poli
y iteration algorithm needs a lot of
omputation
and has to be performed after ea
h improvement. Instead of making an improvement after
ea
h poli
y evaluation, it is possible to make it after only one ba
kup of ea
h state. This
pro
edure amounts to dire
tly
ompute the optimal value fun
tion using equation 2.9. The
ba
kup operation be
omes:
"
y X
Pxy (a)Vn (y )) ;
(2.15)
for all x 2 X . The
omplete value iteration algorithm is given in gure 2.4.
V0
The algorithms presented in the previous se
tion are
alled syn
hronous dynami
programming algorithms be
ause at ea
h iteration the value fun
tion is updated for the entire state
spa
e. In the
ase where the state spa
e is very large, the solution of the MDP be
omes
omputationally intra
table. Asyn
hronous dynami
programming relaxes this rule and
allows ba
kups to be applied for only a subset of the state set, whi
h may be a singleton
13
(Gauss-Seidel DP) and may vary in ea
h iteration. Let Xn X be the set of states whose
value fun
tions will be ba
ked up during the iteration stage n = 0; 1; ::: The ba
kups are
done as follows:
(
h
i
P
max
R
(
x;
a
)
+
P
(
a
)
V
(
y
)
if x 2 Xn;
a
n
y 2X xy
V n ( x) =
(2.16)
Vn (x)
otherwise:
The
hoi
e of Xn is
ru
ial for the
onvergen
e to V . Ideally ea
h state should be ba
ked
up innitely, whi
h means that it should be
ontained in all the subsets Xn.
+1
The relaxation introdu
ed by asyn
hronous DP is very useful when the
omputation of
the optimal poli
y o
urs while intera
ting with an unknown pro
ess. In this
ase the
states are ba
ked up as they are en
ountered. Adaptive real-time dynami
programming
(ARTDP) (Barto et al. 1995) relies on this prin
iple to perform an on-line
ontrol of a
pro
ess. It involves the estimation of the pro
ess' model, the poli
y
omputation, and
the
ontrol. Ea
h time a transition is observed the estimate of the transition probabilities
matrixes P^ (a) is updated:
n (a)
P^xy (a) = xy
(2.17)
nx (a)
where nxy (a) is the number of transitions from x to y when a is performed, and nx(a) =
P
y 2X nxy (a) is the number of times a was performed in x. The estimation of the immediate
reward R^ (x; a) is simply updated with the average of the observed immediate reward for
this state-a
tion pair. After an innite number of updates the estimated model of the
pro
ess
onverges to the true pro
ess. At ea
h time step t the optimal value fun
tion
is estimated using the
urrent pro
ess model estimation and the previous optimal value
fun
tion estimation V^t . With an a
urate model only one ba
kup would be ne
essary
and V^t would be equal to V . However, in the present
ase su
h a model is not available
and there are little variations between two
onse
utive estimations of the model. For these
reasons a
tive exploration me
hanisms have been investigated (Barto and Singh 1990) to
speed up the identi
ation phase.
1
14
Temporal dieren
e learning (Sutton 1988) methods are
on
erned with solving a predi
tion
problem and unlike DP methods, do not need a model of the environment's dynami
s. Su
h
methods are referred to as dire
t or model-free methods as opposed to indire
t methods like
ARTDP or model-based methods like DP. In this se
tion we present the general prin
iple
behind the predi
tion of the value fun
tion of an MDP and then extend it to the
ontrol
problem. Finally we will see how the e
ien
y of TD methods
an be improved with
eligibility tra
es and review some popular TD algorithms.
Predi
tion
For a Markov de
ision pro
ess and a poli
y , the predi
tion problem
on
erns the value
fun
tion V . Let V^ (x) be an estimate of V (x). Given an experien
e hx; a; r; yi and the
estimates of ea
h of these states, V^ (x) and V^ (y), it appears , relying on equation 2.7
that r +
V^ (y) is a better estimate of V (x) than V^ (x). The temporal dieren
e error
(TD-error)
V^ = r +
V^ (y) V^ (x)
(2.18)
is simply the dieren
e between these two estimates, and is used to update the previous
estimate of V . The
onstru
tion of an estimate of V dire
tly from the observation of
su
essive states and rewards is done using the following update rule:
V^ (x)
V^ (x) + V^ ;
(2.19)
where 0 < 1 is the learning rate. Equation 2.19 is known as the TD(0) equation. Ea
h
time the state x is visited and the above update is applied, the estimate V^ (x) be
omes
loser to V (x).
Control
To use TD methods for the
ontrol problem, the predi
tion has to be made on the utility
fun
tion Q (x; a) rather than on the value fun
tion V (x). On the other hand we need to
15
expand the experien
e mentioned above by adding b whi
h is the
hosen a
tion when y is
observed. At the end of ea
h state-a
tion pair transition h(x; a); r; (y; b)i, the same update
rule as for V (x) is applied to estimate Q (x; a):
Q^ (x; a)
Q^ (x; a) + Q^ ;
(2.20)
where Q^ = r +
Q^ (y; b) Q^ (x; a). We noti
e that there is a mutual in
uen
e between
the poli
y and the utility fun
tion Q . In ee
t a new update of Q
hanges , whi
h
then modies Q and so on until both of them be
ome optimal. Algorithms based on this
update rule are
alled Sarsa (be
ause of the tuple State, A
tion, Reward, State, A
tion ) and
was rst investigated by Rummery and Niranjan (1994) who
alled it Modied Q-learning.
Q-learning (Watkins 1989) is another algorithm also based on TD-learning, whi
h dire
tly
estimates the optimal utility fun
tion Q. It uses the following update rule:
Q^ (x; a)
Q^ (x; a) + Q^ ;
(2.21)
where
Q^ (y; b) Q^ (x; a):
Q^ = r +
max
b
(2.22)
Unlike Sarsa, Q-learning does not need to know the a
tual a
tion that will be exe
uted
during the next experien
e; it simply takes greedy a
tion with respe
t to y and the
urrent
estimate of Q . Q-learning is qualied by asyn
hronous or o-poli
y algorithm be
ause
it
an learn the utility fun
tion of a poli
y (the optimal one) while following another (by
observing the behavior of another agent for instan
e). The
onvergen
e of these algorithms
is guaranteed if all state-a
tion pairs are visited an innite number of time and the learning
rate is de
ayed adequately. Moreover the Sarsa algorithm requires that the
ontrol poli
y
onverges little by little towards a greedy poli
y.
Eligibility Tra
es
One way of improving learning and dealing more e iently with the temporal redit assignment is not only to update the value fun tion of the state whi h is urrently visited,
16
but to update those that have led to it as well. To do so, we keep a re
ord of the degree
of re
en
y of the visited states: their eligibility tra
es. Thus the estimate of the value
fun
tion is updated for ea
h state a
ording to its eligibility. The update rule is
V^ (x)
for ea h x 2 X
(2.23)
where e(x) is the eligibility of state x. It is updated on-line either by a
umulating tra
es
e(x) + 1 if x is the
urrent state;
e(x)
(2.24)
e(x)
otherwise
or by repla
ing tra
es
1
if x is the
urrent state;
e(x)
(2.25)
e(x) otherwise,
where 0 1 is the tra
e-de
ay fa
tor. The dieren
e between these two eligibility
tra
e me
hanisms is emphasized in gure 2.5. Basi
ally a
umulating tra
es takes into
a
ount both the frequen
y and the re
en
y of the state whereas repla
ing tra
es only
onsiders the re
en
y. Both tra
es de
ay exponentially a
ording to when the state is
no longer visited. Re
ent work has reported the superiority of repla
ing tra
es (Singh and
Sutton 1996). Predi
tion algorithms based on the update 2.23 are
alled TD() and are a
accumulating trace
replacing trace
visits to a state
Figure 2.5:
generalization of TD(0). The way we introdu
ed the eligibility tra
es is
alled the ba
kward
view of TD() (Sutton and Barto 1998). It is intuitive and
an be dire
tly implemented.
17
On the other hand, the forward view of TD() is a more theoreti
al view and
onsists in
making updates using predi
tions on several forth
oming steps.
Eligibility tra
es
an also be used to enhan
e the performan
es of
ontrol algorithms
su
h as Sarsa or Q-learning. However it is required to have tra
es for ea
h state-a
tion pair
and not only for ea
h state. The algorithms resulting from this
ombination are Sarsa()
(Rummery 1995) and Q() (Peng and Williams 1996), and are presented in gure 2.6.
The
ounterpart of the e
ien
y in the use of eligibility tra
es is their
omputational
ost be
ause the value fun
tion and the eligibility tra
es have to be updated for ea
h
state (or state-a
tion pair for the
ontrol). However there are some promising results that
over
ome this drawba
k (Ci
hosz 1995; Wiering and S
hmidhuber 1998). The prin
iple
is to update only the states whose tra
es are above a
ertain and ignore the remaining
states.
Exploration
As it was pointed out earlier the
onvergen
e of TD
ontrol algorithms to an optimal poli
y
is essentially subje
t to the requirement to visit all state-a
tion pairs an innite number
of time. This is obviously not possible in pra
ti
e be
ause it would take too long before
starting the optimal
ontrol. The agent is therefore fa
ed with an interesting trade-o
between (i) performing a
tions that will in
rease its knowledge about the environment (i.e.
visiting new states or
onsolidating its experien
e) and (ii) a
tions that are optimal relative
to its
urrent estimate of the optimal poli
y. In fa
t some a
tions are known to give good
results in a parti
ular situation but some others are not known at all and might produ
e
better results. This trade-o is
alled the exploration-exploitation dilemma. Methods to
solve this dilemma
an be
lassied into two
ategories: undire
ted methods and dire
ted
methods.
Undire
ted methods, also
alled ad ho
methods, do not use any knowledge about
the learning pro
ess to dire
t the exploration: they make a random exploration. The
simplest te
hnique to do so is
alled -greedy poli
y. It takes a greedy a
tion by default
and, with probability , a random a
tion. The parameter is set to 1 in the beginning to
18
Q^ (x0 ; a0 )
0 Q^ r +
Q^ (y; b) Q^ (x; a)
Q^ 0Q^
for ea
h state-a
tion pair (x0 ; a0 ) do
e(x0 ; a0 )
e(x0 ; a0 )
^ (x0 ; a0)
Q(x0 ; a0 ) Q(x0 ; a0 ) + Qe
end for
^ (x; a)
Q(x; a) Q(x; a) + 0 Qe
For a
umulating tra
es
e(x; a) e(x; a) + 1
For repla
ing tra
es
e(x; a) 1
for ea
h a0 2 A do
e(x; a0 ) 0
end for
x y and a
end loop
Algorithms of Q() and Sarsa() with either repla
ing or a
umulating tra
es. For
= 0 we have Sarsa and one step Q-learning algorithms.
Figure 2.6:
19
eQ(x;a)=T
Q(x;b)=T
b2A e
(2.26)
where T is the temperature parameter whi
h
ontrols the exploration. With a high temperature the probabilities are uniform and as T de
reases the probability of
hoosing (x)
be
ome
loser to one.
Dire
ted methods (see (Thrun 1992; Wyatt 1997; Wilson 1996) for more details) are
based on an exploration bonus whi
h is added to the utility fun
tion. It is worth mentioning
that this bonus is simply a random value in the
ase of indire
ted methods. As for dire
ted
methods, the bonus is based on one or a
ombination of the following
riteria:
riterion, whi
h takes into a
ount the number of times that a state-a
tion
pair is visited;
riterion, whi
h uses the variation of the utility fun
tion. The higher the
variation of the utility the more its
orresponding state-a
tion pair is preferred;
re en y
ounter
error
re ently.
riterion, whi h promotes state-a tion pairs that have been tried the least
Other te
hniques that seem to be powerful and promising are based on the Gittins' indexes
and are
urrently investigated by (Meuleau and Bourgine 1998).
2.3
The natural and simplest way of representing the estimates of the value and utility fun
tions
is to use a lookup table. Su
h a table will have a single entry for ea
h state or state-a
tion
pair. This kind of representation is well-suited for simple tasks with small state and
a
tion spa
es. However when these spa
es be
ome huge, the problem fa
ed goes beyond
the prohibitive amount of memory needed to store values of ea
h entry. Spe
i
ally, the
greater the number of situations whi
h the agent has to deal, the smaller the probability
20
that the same situation will be fa
ed more than on
e. Thus the learning pro
ess be
omes
di
ult and the agent needs some generalization ability, whi
h allows it to make a fair
de
ision in a situation it has never fa
ed before. This is known as the stru
tural
redit
assignment problem and is
on
erned with attributing
redit (or blame) to features of the
fa
ed situations in order to generalize a
ross them.
To deal with this problem value (or utility), fun
tions are represented using fun
tion
approximators. An ideal fun
tion approximator should use a xed and limited amount of
resour
es to represent a fun
tion, have good generalization abilities and be parameterizable
to allow on-line estimation of the fun
tion.
Several generalization methods and fun
tion approximators have been developed and
used in reinfor
ement learning: methods based on Hamming distan
e and statisti
al
lustering (Mahadevan and Connell 1992), Cerebellar Model Arti
ulation Controller (CMAC)
(Tham 1995; Santamaria et al. 1997; Benbrahim and Franklin 1997) and neural networks (Rummery 1995; Millan 1996). Here we will fo
us on neural networks and on
multi-layer per
eptron (MLP) in parti
ular be
ause they are well-suited to implement the
gradient-des
ent methods (a widely used method for fun
tion approximation) using the
error ba
k-propagation algorithm, and nally be
ause it is the approximator we used in
our experiments.
2.3.1 Predi
tion with Fun
tion Approximator
In this se
tion we present the general algorithm that
ombines both temporal dieren
e
methods and fun
tion approximation te
hniques. It is based on the gradient-des
ent approa
h and
an be used with any fun
tion approximator.
Let's assume we have at our disposal the true values of V (the fun
tion we want to
approximate) for ea
h x 2 X . Also let V^p (x) = V^ (p~; x) be the fun
tion whi
h approximates V where p~ is a parameter ve
tor. It is those parameters that are tuned so that
V^p (x) be
omes
loser to V (x) for ea
h x 2 X . Finding a good approximation of V using
V^p
onsists in nding the
onguration of ~p that minimizes the quadrati
error over the
21
state spa e:
1 X hV (x) V^ (x)i :
(2.27)
p
2 x2X
To do so gradient-des
ent methods progressively redu
e the observed error for ea
h step.
The parameter ve
tor is tuned in the opposite dire
tion of the gradient of V^p (x) with
respe
t ~p:
p~ p~ rp~E
h
i
(2.28)
p~ + V (x) V^p (x) rp~ Vp (x)
where is the learning rate and rp~ is the gradient operator with respe
t to p~. The learning
rate weights the strength of the tuning so that only a small step is taken in the improving
dire
tion. If the learning rate is tuned to
ompletely redu
e the error on the observed
example then the parameter ve
tor will not
onverge be
ause it will be destabilized after
ea
h new update.
In the
ase of TD learning, the value we want to approa
h with V^p (x) after an experien
e
< x; a; r; y >, is r +
V^p (y ). Hen
e the update rules for the parameter ve
tor are
~p ~p + V^p~e
(2.29)
where V^p is the TD error r +
V^p (y) V^p (x), is the learning rate and ~e is the eligibility
tra
e ve
tor. In the tabular
ase eligibility tra
es were assigned to ea
h state. In the present
ase they are assigned to ea
h
omponent of the parameter ve
tor. Their update is
~e
~e + rp~ Vp (x)
(2.30)
where ~e has an initial value of zero.
The equations presented here
an be extended to estimate the utility fun
tion Q (x; a)
in the same way as for the tabular
ase. In the next se
tion we brie
y introdu
e neural
networks, and then we show how they
an be used with the above update rules.
E=
Arti
ial neural networks (ANN), also known as
onne
tionist networks, are mathemati
al
and
omputational models inspired from human nervous
ells. Their basi
omponents
22
are simple pro
essing units (also
alled neurons or per
eptrons ) inter
onne
ted by weighted
synapti
links. Ea
h unit re
eives signals from other units or external sour
es and pro
esses
them. The result of pro
essing is either used as input to other units or as output of the
network.
Ar
hite
ture
Output layer
Input layer
Input
Hidden layer(s)
Back-propagation
Output
Activation
Figure 2.7:
As we said above, we will only
onsider multi-layer per
eptron (MLP) networks. In
su
h networks, units are organized in layers: units intera
ting with the outside are in the
input or output layers, and all other units belong to the hidden layers (gure 2.7). When
the units are
onne
ted in a forward way (from the input to the output layer) we have a
feed-forward network. Sometimes
ertain units in the hidden or output layers are fed ba
k
to previous layers and give a re
urrent network.
A
tivation
The a
tivation in the network is
omputed by propagating the units a
tivation from the
input to the output. The
onnexion between two units is dened by a weight wijq whi
h
determines the ee
t that a
tivation ajq of unit j has on unit i (gure 2.8). The a
tivation
of a unit i (its output) is
al
ulated in the following manner,
1
aqi = F (sqi );
(2.31)
23
layer q 1
layer q
q
ij
A
onnexion between units of
onse
utive layers. The index of the layers de
reases
from the output to the input.
Figure 2.8:
where q indexes the layer, F is an a
tivation fun
tion and sqi the weighted sum of the unit's
inputs plus a bias bqi ,
sqi =
X
j
wijq ajq
+ bqi :
(2.32)
The a
tivation fun
tion F has to be non-linear and is usually either sigmoidal, semi-linear
or tangential. However a sigmoid a
tivation fun
tion is very often used
F (s) = 1 +1e s :
(2.33)
Ba
k-Propagation
The prin
iple of the ba
k-propagation method is to propagate the error, namely the dieren
e between the desired output and the a
tual output, from the output to the input units
so as to know the error of ea
h unit. It
onsists in using a gradient-des
ent te
hnique to
minimize the quadrati
error
1
(2.34)
E = (d~ ~a) ;
2
where d~ is the desired output ve
tor and ~a is the a
tual output ve
tor of the network. To
do so the gradient E=wijq is
omputed by de
omposing it into two terms whi
h will be
separately evaluated
2
E
wijq
E sqi
= sq wq :
i
ij
(2.35)
24
wijq
wikq akq
+ bqi = ajq
(2.36)
and the rst term whi
h is the error " on the unit i of the layer q is de
omposed on
e
again to give:
E E aq
(2.37)
"qi = q = q qi :
si ai si
As aqi = F (sqi ) we immediately dedu
e
aqi
0 q
(2.38)
q = F (si ):
si
For the
al
ulation of E=aqi we have to
onsider two distin
t
ases in whi
h whether
layer q is or is not the output layer. If it is then
E
= (di aqi );
(2.39)
aqi
and the error of an output unit is
"qi = (di aqi )F 0 (sqi ):
(2.40)
When the layer q is not the output layer, the gradient E=aqi is derived from the errors
of forward layers
E X E sqk
= sq aq
aqi
i
k
k
(2.41)
X
q
q
= "k wki :
q
i
+1
+1
+1
q
i
+1
X
k
"
q +1
k
q +1
ki
F 0(sqi ):
(2.42)
25
Figure 2.9:
To represent the utility with MLP networks (
alled in this
ase Q-nets ), one has to
arefully
dene a
ertain number of issues.
Basi
ally Q-nets take as inputs a state x and an a
tion a and produ
e their utility
Q(x; a) as an output. So the rst issue
on
erns the use of a single network whose inputs
en
ode both the state and the a
tion or a set of jAj distin
t networks whose inputs en
ode
only the state. The monolithi
ase may give fair results when the a
tion spa
e is
ontinuous
but is not e
ient to deal with domains with dis
rete a
tions. This limitation
omes from
the fa
t that the network has, in this
ase, to model a highly non-linear fun
tion be
ause
for the same state dierent a
tions (usually having a similar representation) may have very
dierent utility fun
tions. Moreover this ar
hite
ture does not support the use of eligibility
tra
es. The distributed ar
hite
ture, also
alled OAON (One A
tion One Networks ) (Lin
1992) asso
iates one network to ea
h a
tion to redu
e the interferen
es between a
tions
and is suitable for use with eligibility tra
es.
The se
ond issue
on
erns non-Markov states. Re
all that a Markov state is ne
essary
and su
ient to make the right de
ision and to predi
t the next state for a given a
tion
in a given state. When the agent does not have a Markov state it fa
es the hidden state
26
To
ope with this problem the agent has to build an internal Markov state using
history information. Re
urrent neural networks,
onstru
t su
h a history in a
ompa
t
way: units in the hidden layer are fed ba
k to a part of the input layer
alled
ontext, the
rest of the input layer is devoted to the state (gure 2.10). These networks are known as
Elman networks and have been used by Lin (1992) to solve several non-Markov tasks.
The last issue refers to the spe
i
ations of ea
h of the three layers.
Figure 2.10:
Output unit
Hidden units
Context units
Input units
problem.
The input ve
tor of the neural network is a representation in terms of features
oding of
the state to be evaluated. It is
alled the input pattern. The design of this ve
tor is
very important and has a great impa
t on the learning and generalization abilities of the
network. The
hoi
e of the features requires a good knowledge of the task domain and
their
oding depends on their nature.
As far as the features allow it, the most simple and e
ient way of representing them is
a binary
oding. If a feature has a nite and small number of possible values, for instan
e
su
h as a lift's lo
ation in a building, then one input unit is asso
iated with ea
h of them.
The unit is 'on' when the feature has the
orresponding value and 'o' otherwise.
27
The number of hidden layers as well as the number of units in ea
h layer are the fa
tors
that dene the degree of freedom of a neural network. Hen
e the more
ompli
ated the
fun
tion, the more numerous hidden layers and units. In an MLP a single hidden layer
is usually su
ient but there is no systemati
means of determining the exa
t number of
hidden units. However it has been reported by Rummery (1995) that, in reinfor
ement
learning appli
ations, the nal performan
e of the system is no more ae
ted beyond a
ertain number of hidden units. Only the
onvergen
e time and the
omputational
ost
be
ome high. Therefore a possible strategy to nd the ideal number of hidden units would
be to start with a small number of hidden units and to in
rease it up to the point where
no improvement
an be observed.
The Output Layer
The output of the network, when it is used to approximate a utility fun
tion, is a real
value. It
an be either en
oded by several sigmoidal output units using the te
hnique
of overlapping Gaussian ranges (Pomerleau 1991) or by a single unit. In the latter the
a
tivation fun
tion of this unit may be either linear or sigmoidal. However with a linear
fun
tion the output value is not bounded, therefore a high error may be ba
k-propagated
and thereby makes the units overshot. If a sigmoid fun
tion is utilized, the output value is
1A
brief des ription of the oarse oding te hnique using sigmoid fun tion is given in se tion 4.4.2.
28
within the range [0; 1 so the immediate reinfor
ement has also to be within this range. In
pra
ti
e either we have an idea about the variation range of the reinfor
ement, so we
an
s
ale it, or we use a very small learning rate whi
h will slow-down the learning pro
ess.
To over
ome this handi
ap Benbrahim and Franklin (1997) developed a method
alled Self
S
aling Reinfor
ement (SSR) whi
h self s
ales the reinfor
ement signal a
ording to the
minima and the maxima observed.
2.4
Summary
This
hapter has set up the foundations of reinfor
ement learning and has overviewed related existing methods and algorithms. Let's re
all that reinfor
ement learning has to be
viewed as a
lass of problems or as an adaptive
ontrol paradigm rather than a parti
ular
learning te
hnique. RL has be
ome very popular in the eld of intelligent autonomous
agents and has attra
ted resear
hers from other dis
iplines like statisti
s, psy
hology and
arti
ial intelligen
e. RL is be
oming in
reasingly mature be
ause, on one hand its theoreti
al aspe
ts (link with dynami
programming,
hoi
e of optimality
riteria, analysis
of various algorithms' behavior, fun
tion approximators) are intensively investigated and
on the other hand the number of pra
ti
al appli
ations is
ontinuously growing. Examples of su
h appli
ations are: elevator
ontrol (Crites 1996), TD-Gammon (Tesauro 1995),
dynami
hannel allo
ation in
ellular telephone system (Singh and Bertsekas 1997) and
job-shop s
heduling (Zhang and Dietteri
h 1995). The eorts are
urrently fo
used on s
aling up reinfor
ement learning to large,
omplex and partially observable problems. They
involve issues su
h as
ontinuous state and a
tion spa
es, representation, hierar
hi
al
ontrol and task de
omposition, and methodologies for general appli
ation of RL. The last
two issues
onstitute the
entral theme of this thesis.
Chapter 3
The Postman Robot Problem
In this thesis the postman robot problem is used, as an appli
ation framework for the
methodology we will introdu
e, and as a testbed for our experiments. In this
hapter we
des
ribe the postman robot task as well as the robot and the parti
ular setups that we
have used.
3.1
The postman-robot is given a set of parallel and
on
i
ting obje
tives and must satisfy
them as best as it
an. The robot a
ts in an o
e environment
omposed of o
es,
a batteries'
harger and a mailbox. Its task is to
olle
t letters from the o
es and post
them in the mailbox. While a
hieving its postman's task as e
iently as possible the robot
has to avoid
ollisions with obsta
les and re
harge its batteries to prevent break-downs.
3.2
The robot
The physi
al robot is a Nomad 200 mobile platform (gure 3.1). It has 16 infrared sensors for ranges less than 40
entimeters, 16 sonar sensors for ranges between 40 and 650
entimeters, and 20 ta
tile sensors to dete
t
onta
t with obje
ts. It is also equipped with
wheel en
oders and a
ompass to
ompute its
urrent lo
ation and orientation relative to
its initial ones. Finally, it has three wheels
ontrolled together by two motors whi
h make
it translate and rotate. A third motor
ontrols the turret rotation.
Figure 3.1:
30
31
The Environment
The postman robot's de
isions are mainly driven by the letters
ow, as well as by the
batteries' level. In this se
tion we des
ribe their dynami
s and relative assumptions.
3.3.1 Assumptions
mailbox, and re
harges its batteries on
e it is near the
harger (be
ause it does not
have any grasping or re
harging devi
es).
3.3.2 Dynami s
Let's denote xr (t) the number of letters that the robot holds, xli(t) the number of letters
in ea
h o
e i, and xb (t) the batteries' level, at a given time step t. The evolution of these
parameters are governed by the following equations:
Letters in an o
e i 8
< xli (t) + (t)
xli (t + 1) =
:
0; if the robot pi
ks up the letters from o
e i
where (t) is the in
oming letters in o
e i at time step t.
Letters transported by the robot:
xr (t + 1) =
8
<
Batteries' level:
xb (t + 1) =
8
<
32
xb (t) xb
3.3.3 Testbed
The parti
ular environment we used for our experiments is
omposed of three o
es, one
mailbox and one
harger (gure 3.2). Its size is approximatively 13m 13m. Letter
arrivals in ea
h o
e are either periodi
(i.e n letters ea
h p time steps) or follow a Poisson
distribution. Table 3.1 shows the letters
ow patterns that were used.
Periodi
Poisson
(letters/period) (mean letters/time steps)
O
e 1
1/40
3/100
O
e 2
1/30
5/100
O
e 3
1/20
7/100
Table 3.1:
To
arry out the experiments we had at our disposal the Nomad 200 development
host whi
h simulates the robot's sensors and
inemati
s and we wrote a program whi
h
simulates the letters arrival and the batteries' dynami
s (gure 3.2). Although the robot's
simulator is realisti
it is time
onsuming. For example, it takes about 30 se
onds to move
from one o
e to another when the simulator is run on a Sun Ultra 1 station. To speed up
the simulation pro
ess, we have pro
eeded in the following manner. When the navigation's
behaviors were learned (using the Nomad 200 simulator) we measured the number of time
steps needed to move from one pla
e to another. These measures are used to dene a grid
simulator whi
h is then
oupled to the letters
ow and batteries simulator.
Thus we
an test and validate the
oordination of these elementary behaviors mu
h
faster, while still being able to reuse the learned
oordination with the robot's simulator.
As our navigation algorithms rely on the odometry (see se
tion 4.5.1), we were unable
to reuse them on the real robot be
ause of the drift. We are
urrently developing other
navigation behaviors based on bea
ons' dete
tion.
33
Mailbox
Office 1
Office 2
Charger
Office 3
Figure 3.2:
The Nomad 200 development host and the letters ow and batteries simulator.
34
Summary
We have
hosen the postman robot task be
ause it provides an opportunity to apply
reinfor
ement learning to build both rea
tive (navigation, obsta
le avoidan
e) and planning
(
olle
ting and posting letters e
iently) skills of the robot. It is an instan
e of a more
general task involving the
oordination of
on
urrent and interfering behaviors and is
analogous to the optimal foraging problem whi
h is usually fa
ed by animals (Stephens
and Krebs 1986). Let's add by the way that a postman robot is
urrently running in a
building of Carnegie Mellon University and that its design and implementation has involved
about 10 persons (Simmons et al. 1997).
Chapter 4
The Methodology
This
hapter introdu
es a methodology to solve problems using reinfor
ement learning.
We begin this
hapter by justifying the need of a methodology in reinfor
ement learning.
An intera
tion model between the agent and the environment is then presented and some
important notions like agent or behavior are
laried. Then we des
ribe the Hierar
hi
al
Problem Solving (HPS) methodology as well as its asso
iated methods, and apply it to the
postman robot problem. Finally we report the experiments we
arried out and the results
we obtained.
4.1
Problem solving using embedded reinfor
ement learning agents has be
ome very attra
tive be
ause the level of abstra
tion at whi
h the designer intervenes is raised (i.e. the
agent is told what to do using the reinfor
ement fun
tion and not how to do it) and little
programming eort is required (most of the work is done by autonomous training).
Nevertheless and despite its mathemati
al foundations, reinfor
ement learning
annot
be used as it is to make the agents solve
omplex problems. Su
h a limitation is essentially
due to the huge sear
h spa
e the agent has to deal with and to the di
ulty in nding
the adequate reinfor
ement fun
tion. One way to solve
omplex problems is to adopt a
divide-and-
onquer approa
h: (1) breaking down the initial problem into sub-problems
with small state spa
es and simple reinfor
ement fun
tions, (2) solving ea
h sub-problem,
(3)
ombining the solutions of ea
h sub-problem to solve the original problem.
The Methodology
36
The above pro
edure is re
ognized to give fair results and has been widely applied
in reinfor
ement learning (see (Mahadevan and Connell 1992; Lin 1993; Kalmar et al.
1998; Dietteri
h 1997) for instan
e). However only experien
ed designers
an over
ome the
tri
ks that may appear during its use. In this
hapter we introdu
e a methodology whi
h
integrates this pro
edure and helps the designer to build e
ient
ontrol ar
hite
tures for
reinfor
ement learning agents.
The obje
tive of a methodology, in any engineering eld, is to provide helpful guidelines
to engineers during the design pro
ess. Its role is of great importan
e be
ause it not only
ensures the quality of the nal produ
t but also optimizes the use of available resour
es,
the tasks' allo
ation over several persons as well as the management of the whole pro
ess.
The dierent stages in a general engineering methodology are shown in gure 4.1. The
next two se
tions review attempts to determine prin
iples for the agent's design pro
ess.
Define the problem
Implement, test
and validate
the solution
Engineering
Design
Process
Analyze the
problem
Make the
design choices
Figure 4.1:
By setting up the foundations for autonomous agents' design prin
iples, Pfeifer (1996)
wanted to provide new insights in understanding intelligen
e. His main argument is that
the best way to understand intelligen
e is to build autonomous agents. Another major
The Methodology
37
purpose is that the agent's design relies on the intuitions of experien
ed designers and
that this know-how is often left impli
it in most s
ienti
publi
ations. Thus the design
prin
iples aim at making this knowledge expli
it and provide guidan
e on how to build
autonomous agents.
The design prin
iples whi
h were proposed are
lustered into two
lasses. The rst
lass
is
alled task environment and
on
erns the denition of the e
ologi
al ni
he in whi
h the
agent will evolve, as well as the task it has to a
hieve and the behaviors it has to exhibit.
The se
ond
lass is devoted to the design of the agent itself and is
onstituted of seven
prin
iples whi
h in
lude issues su
h as agent morphology and
ontrol ar
hite
ture. We
review these prin
iples as they were summarized in (Pfeifer and S
heier 1998):
1. The
omplete agent prin
iple. The kind of agents of interest are the
omplete
agents, i.e. agents that are autonomous, self-su
ient, embodied and situated.
2.
Intelligen
e is emergent
from an agent-environment intera
tion based on a large number of
oupled pro
esses
that run in parallel, loosely
oupled pro
esses that run asyn
hronously and are
onne
ted to the sensory-motor apparatus.
3.
4.
5.
6.
All intelligent behavior (e.g. per
eption,
ategorization, memory) is to be
on
eived as a sensory-motor
oordination
whi
h serves to stru
ture the input.
The Methodology
7.
38
This prin
iple states that the agent has to be equipped with
a value system and a me
hanisms for self-supervised learning employing prin
iples of
self-organization.
The value prin
iple.
These design prin
iples were su
essfully applied to build "Sahabot", a mobile robot
whose behavior is inspired from the desert ant's behavior.
4.1.2 The BAT Methodology
The need for a prin
ipled approa
h to developing learning autonomous agents, also motivates the eorts of Dorigo and Colombetti (1998) to dene a new te
hnologi
al dis
ipline
alled Behavior Engineering. Behavior Engineering aims at providing a methodology, a
repertoire of models and a set of tools supporting all the phases of the agent development
pro
ess. The methodology they proposed,
alled Behavior Analysis and training (BAT)
(Colombetti et al. 1996), is based on the experien
e a
quired during their past resear
h,
and
overs several issues in the building pro
ess of autonomous robots su
h as spe
i
ation,
design, training, and assessment. The BAT methodology
omprises the following stages:
1. The informal (i.e. in natural language) des
ription of the agent and its environment
as well as the requirements of the desired behavior.
2. The analysis of the behavior and its de
omposition into simple behaviors. The intera
tion between these behaviors is then dened using some operators (independent
sum,
ombination, suppression, sequen
e). The result of this stage is a stru
tured
behavior.
3. The spe
i
ation of the robot
omponents in
luding in parti
ular the sensors and the
ee
tors, the
ontroller ar
hite
ture, the reinfor
ement fun
tion for ea
h elementary
behavior, the training strategy, and sometimes the extensions that should be added to
the environment. A set of generi
ontrol ar
hite
tures based on Behavioral Modules
(BM) is provided to implement the stru
tured behavior.
4. The design, the implementation and the veri
ation of the
ontrol ar
hite
ture.
The Methodology
39
The two approa
hes presented above
onstitute the main and, to the best of our knowledge,
the only attempts to dene a prin
ipled and systemati
means to designing autonomous
agents. Both of them were developed within and espe
ially for the roboti
s eld. However
some remarks
an me made about them.
Pfeifers' design prin
iples provide a set of re
ommendations and advi
e to respe
t,
rather than guidelines to follow. Also they do not deal with the testing and the evaluation
issues, and timidly address the learning aspe
t. However the dieren
e between a behavior
and the me
hanism whi
h produ
es it by intera
tion with the environment has been
learly
stated and highlighted (this point will be detailed in the next se
tion).
The BAT methodology expli
itly guides the designer during all the stages and denes
the expe
ted result at the end of ea
h of them. Learning is
onsidered as an integrated
part of the methodology and the role of the trainer to make the learning pro
ess e
ient is
stressed. However we regret a
ertain la
k of formalism in the spe
i
ation phase and that
the de
omposition pro
ess heavily relies on the designer's intuition and past experien
e.
In
on
lusion, we
an point out that these approa
hes are (or may be)
omplementary
in the sense that the rst one addresses the s
ienti
part while the se
ond one addresses
the engineering part in the design of autonomous agents.
The Methodology
4.2
40
At this stage it is worth
larifying the notion of behavior whi
h is usually en
ountered
in agent appli
ations and roboti
s in parti
ular. A behavior is the des
ription from the
observer's point of view, at dierent levels of abstra
tion, of a sequen
e of a
tions produ
ed
by the agent via its
oupling with the environment. In simple words, an agent's behavior
an also be dened as the result of the intera
tion between the agent's sensory-motor loops
and the environment. In this se
tion, we des
ribe this intera
tion within the reinfor
ement
Observers
point of view
Environment
u
Execution
Perception
y
Reinforcement
r
a
Agents
point of view
Decision
x
Revision
Agent
Sensory-motor loop
Figure 4.2:
learning framework, in more depth than in
hapter 2. As shown in gure 4.2 the agent's
behavior is modeled as a
oupling of two dynami
al systems: the agent,
onstituted here
by a single sensory-motor loop and the environment. We also distinguish between the
dierent points of view:
the agent's point of view, whi
h takes into a
ount the internal me
hanism that
The Methodology
41
the external observer's point of view, whi h onsiders the environment, in luding the
This state is made Markov by the revision (or re
onstru
tion) pro
ess whi
h ranges
from the identity fun
tion up to the most sophisti
ated knowledge revision pro
ess;
the reinfor ement signal whi h previously ame from the task (gure 2.1) is now a
part of the agent. More pre
isely, it is a part of the agent's a priori knowledge given
by the designer: the philogeneti
al inheritan
e.
omplex behaviors may be produ ed by simple me hanisms through their intera tion
with the environment (Braitenberg 1984; Pfeifer and S
heier 1998). Hen
e the behavior's design pro
ess would be a proje
tion from the problem's domain (observer's
point of view) to the
o-domain (robot's point of view).
From now on, we will use the term behavior to des
ribe an agent solving a problem.
Also, problem de
omposition and sub-problem will be repla
ed by behavior de
omposition
and sub-behavior. Thus a behavior is
onstituted by a hierar
hy of sub-behaviors just as
if we had a hierar
hy of agents in whi
h ea
h agent is solving a sub-problem. In addition,
The Methodology
42
these
hanges will pla
e stress on the design of an intera
tion rather than that of an isolated
agent.
4.3
The Hierar
hi
al Problem Solving (HPS) methodology we propose aims at providing a systemati
approa
h in the use of embedded reinfor
ement learning agents to solve problems.
It fo
uses on the agent's design and more spe
i
ally on the hierar
hi
al aspe
t of the
ontrol ar
hite
ture. The methodology assumes that the environment, the agent and its
intera
tion devi
es, as well as the problem to solve are predened.
The HPS methodology will guide the designer by telling him how to:
formally spe
ify the agent's behavior;
de
ompose the global behavior into a hierar
hy of sub-behaviors;
produ
e elementary behaviors of the hierar
hy, i.e. behaviors of the lowest level,
oordinate the sensory-motor loops at a given level of the hierar hy to get the behavior
Figure 4.3 gives an overview of the dierent stages of the methodology. We noti
e that:
the
ontroller's design is iterative, that is, the results of the global behavior's evalua-
tion
an be used to
orre
t the spe
i
ations. The
y
le is repeated until the expe
ted
behavior is observed;
the analysis pro ess is top-down and from the observer's point of view while the
design pro ess is bottom-up and from the robot's point of view;
The Methodology
43
the distin tion between the dierent points of view allows us to identify whi h parts
have to be treated by the designer and whi
h have to be learned by the robot. Hen
e
we
an easily
ombine engineering and evolution.
Observers point of view
Formal specification
of the behavior
Decomposition into
a hierarchy of behaviors
Figure 4.3:
Coordination of the
sensory-motor loops
Production of
elementary behaviors
of the hierarchy
The spe
i
ation stage has an important role in the HPS methodology. On one hand all
the next stages rely on it, and on the other hand it provides the assessment stage with
a useful referen
e mat
hing. The dynami
s of the intera
tion between the agent and the
environment was formalized as an MDP. Thus a behavior will be represented by a parti
ular
traje
tory in the MDP's state spa
e.
By asso
iating with ea
h possible traje
tory, a quality
riterion we then have a means of
spe
ifying the desired behavior. The quality
riterion
an be expressed as the
ombination
of an obje
tive fun
tion and some
onstraints on the traje
tory. The obje
tive fun
tion
largely depends on the nature of the problem and represents a measure of the system's
performan
e su
h as the letters
olle
ted or the fuel
onsumption or more generally the
squared deviation from an optimal value. It is expressed as an integral on the traje
tory
generated by a
ontrol poli
y , for a horizon N :
J ( ) =
N
0
f (x(t); t) dt:
(4.1)
The Methodology
44
The
onstraints set C = fx 2 X j ' (x) = 0; :::; 'n(x) = 0g re
e
ts the aspe
ts of the
traje
tory whi
h are undesirable. So the goal is to optimize the obje
tive fun
tion while
at the same time satisfying the
onstraints. The
onstraints are enfor
ed by augmenting
the obje
tive fun
tion as follows
1
J 0 (; ) = J ( ) +
=
=
N
0
N
0
N
0
X
i
[f (x(t); t) +
X
i
(4.2)
F (x(t); ; t) dt
where the auxiliary fun
tion F (x; ) is
alled Hamiltonian fun
tion and i are the Lagrange multipliers. They are
omputed using the exterior penalties method (Minoux 1986):
i (x) = 0 if the
onstraint 'i (x) = 0 is satised and i (x) = pi otherwise. The positive
onstant pi weights the strength of the penalty.
At the end of this stage the desired behavior is spe
ied.
4.3.2 De
omposition
Human designers are usually skillful in de
omposing a
omplex task. However with a
systemati
approa
h they
an perform better de
ompositions.
To de
ompose the main behavior into a hierar
hy of sub-behaviors we propose a graphi
al based approa
h. The rst step in the de
omposition pro
edure is to graphi
ally represent F as a fun
tion of time steps or de
ision steps. The next step
onsists in identifying
the positive
ontributions of the agent to optimize this fun
tion as well as the asso
iated de
ision making (or behaviors sele
tion). These
ontributions usually appear as falling edges
in the
ase of a minimization. Of
ourse between two falling edges other de
isions
ould
have been made ex
ept that they do not have a positive
ontribution or their
ontribution
does not appear be
ause of the nature of the fun
tion and the kind of representation.
The surfa
e
orresponding to the integral we have to minimize is de
omposed into a
series of re
tangles whose sides are respe
tively the distan
e between two falling edges and
the value of the fun
tion when the se
ond falling edge o
urs (gure 4.4). We noti
e that
The Methodology
45
the sum of the re
tangles' surfa
e is not exa
tly equal to the integral of F but to the a
tual
measure of the agent's
ontribution. This measure
on
erns aspe
ts of the environment
that are
ontrollable by the agent and allows us to
ompare the performan
es of two agents.
For example, in the postman robot problem, the robot
an
hoose in whi
h o
e to go
but
annot a
t on the letters'
ow. In ee
t, while the robot is moving towards a given
pla
e, the number of letters in standby in the o
es as well as the batteries' level evolve
independently of the robot destination. They are a
tually ae
ted when the destination
is rea
hed, that is, when the exe
ution of the robot de
ision is
ompleted. The surfa
e of
ea
h re
tangle
an be minimized by redu
ing one of its two sides. The pro
esses
onsisting
in minimizing ea
h of these sides
orrespond to two
on
urrent behaviors.
The obtained behaviors are then formally spe
ied and de
omposed on
e again. The
pro
edure is repeated until the behaviors
annot be de
omposed anymore or
an easily be
produ
ed. At that time we have a hierar
hy of sub-behaviors.
falling edge
Figure 4.4:
Mathemati al Support
In this se
tion we provide a mathemati
al support for the graphi
al-based de
omposition
method presented above. Let's rst introdu
e the fundamental denition and theorem
(taken from (Minoux 1986)) on whi
h fun
tion de
omposition methods rely.
We say that a fun
tion f is de
omposable into f and f if f is separable
(i.e. it
an be put into the form f (x; y) = f (x; f (y))), and if moreover the fun
tion f
Denition
The Methodology
46
is monotone non-de
reasing relative to its se
ond argument. The following fundamental
result
an then be stated:
Theorem Let f be a real fun
tion of x and of y = (y1 ; :::; yk ).
f (x; y ) = f1 (x; f2 (y )) then we have
If f is de omposable with
min
f (x; y ) = min
ff (x; min
ff (y)g)g
x
y
x;y
(
min l :l = min
f (l ; min f (l ));
l
l
(l1 ;l2 )
where f (u; v) = u:v and f (x) = x, when l and l are both positive.
1
In this se
tion we present a generi
sensory-motor loop (gure 4.5) whi
h allows us to
generate a behavior given its spe
i
ations. This stage of the methodology essentially
onsists in making design
hoi
es and
on
erns elementary behaviors as well as other subbehaviors of the hierar
hy.
The
ore of the sensory-motor loop is the learning system whi
h
omputes the utility
of ea
h
ommand. The nature of the representation of the utility fun
tion depends on the
size of the state spa
e. A simple lookup table is su
ient for small spa
es, but a fun
tion
approximator su
h as those presented in se
tion 2.3 is needed for huge spa
es.
From the per
eptions we have to generate an internal state representation whi
h must
be on the one hand
omplete enough to allow predi
tion of future states and rewards
and on the other hand sele
tive, i.e.
ontaining only information whi
h is relevant to the
behavior asso
iated to the sensory-motor loop. Su
h a representation
an also be learned,
as reported by M
Callum (1996).
The reinfor
ement fun
tion is an important part of the sensory-motor loop and a great
are must be taken to ensure that it will lead to the desired behavior. It translates the
agent per
eptions' into a reward value.
The Methodology
47
The dierent exploration strategies were presented in se
tion 2.2.3 so the designer
an
hoose the most suitable among them.
Finally, as an output, the sensory-motor loop generates signals whi
h a
tivate or inhibit
the
ommands. The
ommand set may
ontain atomi
ommands whi
h dire
tly intera
t
with the environment or sensory-motor loops in the
ase of a
oordination.
Command
Set
Activation/Inhibition
Perceptions
State
Representation
Utility
Function
Representation
Exploration
Policy
Reinforcement
Function
Figure 4.5:
Action
Selection
Mechanism
To avoid the generation of wrong behaviors we propose to use the fun
tion that spe
ies the
behavior to dene the reinfor
ement fun
tion. As the spe
i
ation fun
tion is dened from
the observer's point of view, the expe
ted behavior will be generated when this fun
tion is
optimized. We then dene the instantaneous reinfor
ement as the dieren
e between the
surfa
es of two
onse
utive re
tangles:
r(T ) = F (x(T
1); ):T
F (x(T ); ):T
(4.3)
where T is a de
ision step and T is the dieren
e, in terms of time steps, between two
de
ision steps T 1 and T . The reinfor
ement fun
tion has the form of a gradient and gives
a
ontinuous information on the progress made by the agent. In addition, the learning is
speeded up and the exploration is improved (Matari
1994). Given that the reinfor
ement
learning algorithms we use maximize the
umulated dis
ounted reward over an innite
The Methodology
48
horizon, we have:
1
X
T =0
F (x(1); ):1 )
+1
(4.4)
T =1
= ( 1)
1
X
T =0
We noti
e that maximizing equation 4.4 is equivalent to the initial obje
tive whi
h is to
minimize equation 4.2, be
ause 0 <
< 1 and as far as the value of
is
hosen so that
N
be
ome negligible.
The Internal State
To build an internal state that meets the
ompleteness and sele
tiveness properties, the
designer has to
onsider the following two guidelines. First he has to identify the per
eptions on whi
h the spe
i
ation fun
tion depends, that is, those whi
h make the fun
tion
hange when they evolve. Then the designer has to
he
k if the instantaneous per
eptions
The Methodology
49
are su
ient to make e
ient de
isions. If not, some kind of
ontext or short term memory
has to be added.
4.3.4 Coordination
During this stage the designer has to answer the following questions:
Is the observed behavior
orre
t ?
If not, why ?
What are the agent performan
es ?
Wyatt et al. (1998) argue that a
orre
t approa
h is to employ multiple forms of evaluations. Thus it is possible to disambiguate the error sour
e and to provide explanations of
why the agent failed or su
eeded.
Here we make the distin
tion between the behavior assessment (Colombetti et al. 1996)
and the evaluation of the agent learning. The former is a qualitative
riterion and the latter
is a quantitative
riterion. Moreover we add two viewpoints: internal and external.
To assess a behavior the designer should validate its
orre
tness and its robustness.
This is done from the observer's point of view. A behavior is
orre
t when the task
assigned to the agent is fullled. For example, we will validate the postman-robot if we see
the robot
olle
ting and posting the letters without running out of energy. On the other
hand a behavior is robust if it remains
orre
t when stru
tural
hanges of the environment
o
ur. Robustness is strongly linked to the adaptiveness property. If the
orre
t behavior
is not generated, then the designer should verify the learning system qualitatively, that is,
determine if the agent is learning or not. A problem during this veri
ation is usually due
to a programming error in the software ar
hite
ture.
The Methodology
50
Qualitative
Quantitative
Convergen
e speed
Internal Is the robot learning ? Average reward
Corre
tness
External Robustness
Table 4.1:
Case Study
In this study we des
ribe the appli
ation of the HPS methodology to solve the postman
robot problem.
The Methodology
51
To fulll its task, the postman robot has to minimize the number of letters xli in the o
es
as well as the number of letters xr it holds by respe
tively
olle
ting and then posting
them. The following obje
tive fun
tion is derived
f1 (x; t) =
X
i
0<<1
(4.5)
xb (t) 0;
(4.6)
F1 (x; 1 ; t) =
X
i
(4.7)
The two
on
urrent behaviors that are involved in the minimization pro
ess of a re
tangle's
surfa
e are:
The Methodology
52
F1
F21
F22
penalty
Figure 4.6:
move to the nearest pla
e providing a positive
ontribution, for the horizontal side;
move the pla
e providing the highest
ontribution, for the verti
al side.
For the rst behavior the robot has to minimize the traveled distan
e xd between two
de
ision steps T 1 and T . The
orresponding obje
tive fun
tion is
f21 (x; T ) = xd (T );
(4.8)
subje
t to providing a positive
ontribution (a falling edge in the graphi
s). In ee
t, the
robot may move to the nearest o
e but it may not
ontain any letter. This
onstraint is
expressed by ' (x; T ) = 0 where
8
1) F (x; ; tT ) > 0
< 0; if F (x; ; tT
' (x; T ) =
(4.9)
:
1; otherwise.
where tT is the time step
orresponding to the de
ision step T . We obtain
21
21
(4.10)
The Methodology
53
1)
F1 (x; 1 ; tT )
(4.11)
22
move to an o
e (3 behaviors);
move to the mailbox;
move to the batteries
harger;
(4.12)
where x is the robot orientation with respe
t to the goal has to be minimized subje
t to
the obsta
les avoidan
e
onstraint
'3i (x) = (dsi
xsi ) < 0;
(4.13)
where xsi is the robot's reading of sensor i whi
h indi
ates the distan
e to the nearest
obsta
le and dsi is the nearest safe distan
e to an obsta
le. The performan
e
riterion we
The Methodology
54
obtain is
F3 (x; 3 ; t) = x (t) +
X
i
(4.14)
The hierar
hy of sub-behaviors obtained at the end of this stage is sket
hed in gure
4.7.
postman
nearest
move to
office 1
Figure 4.7:
move to
office 2
highest
move to
office 3
move to
mailbox
move to
charger
The design
hoi
es that were made for ea
h sensory-motor loop are now des
ribed. Ea
h
sub-behavior of the hierar
hy will be learned using
onnexionist reinfor
ement learning.
A single MLP with a sigmoid fun
tion a
tivation and a single output unit was used to
represent the utility fun
tion of ea
h
ommand. Some
omponents of the per
eption ve
tor
are represented using a sigmoidal
oarse
oding as in (Rummery 1995). Basi
ally su
h a
oding works as follows. A number of sigmoid fun
tions, one for ea
h input neuron, are
spread a
ross the input spa
e (gure 4.8). As the sigmoid fun
tions overlap ea
h other,
ea
h input value will be
oded by several values in [0,1
orresponding to the value of
ea
h sigmoid fun
tion for that input. The input patterns for ea
h network as well as the
reinfor
ement fun
tions are detailed in the experiments se
tion.
The Methodology
55
Con
erning the exploration poli
y, a simple -greedy poli
y was used. A
ommand is
hosen a
ording to the probability P (a = arg maxa2A x Q(x; a)jx) = 1 , where is
de
reased from 1 to 0 in Nexp exploration steps.
( )
1.0
0.5
input
0.0
x
Figure 4.8: The input real-value x is
oarse
oded into four values in [0,1 whi
h are 0.05, 0.55,
0.95, 1.0 and
onstitute a suitable input for a neural network.
4.4.3 Coordination
We used a
oordination me
hanism in whi
h the sensory-motor loops of a given layer are
treated as simple
ommands by the upper level. On
e they are a
tivated they keep the
ontrol of the agent until they are
omplete. The
ontrol is then returned ba
k to the
sensory-motor loop whi
h a
tivated them. This kind of
oordination is
alled Hierar
hi
al
Q-learning (Lin 1993).
4.4.4 Evaluation and Validation
To judge the ee
tiveness of the overall behavior we dened the following metri
s:
the average letters in standby in the o
es, the average letters
arried by the robot
as well as the average batteries level for the external assessment. These values are
updated at ea
h intera
tion
y
le (the lowest temporal resolution) to guarantee a
uniformity in the
omparison with other ar
hite
tures;
the average of the global quality riterion, updated at ea h de ision step, to evaluate
The Methodology
4.5
56
Experiments
The postman robot behavior is learned in
rementally. With this te
hnique the robot is
rst trained to learn elementary behaviors and then to learn the upper behaviors using
previously a
quired skills. This pro
ess
alled modular learning is repeated for ea
h level
of the hierar
hy. The navigation behaviors are learned separately and preserved using
persistent neural networks. The
oordination behaviors are then learned so as to a
hieve
the global behavior.
4.5.1 Learning to Navigate
Mobile robot navigation towards a goal while avoiding obsta
les has been studied in the
reinfor
ement learning
ontext by Rummery (1995) and Millan (1996). Their work is an
extension of those of Pres
ott and Mayhew (1992) and Krose and Van Dam (1993) in whi
h
the robot avoids obsta
les, not in order to get to a target lo
ation, but just to explore the
environment. It may be seen as an adaptive
onstru
tion of a potential eld (i.e. the
goal generates a potential whi
h pulls the robot towards it, and the obsta
les produ
e a
potential whi
h repels the robot away) where the potential ve
tor in a given position is
dened by the robot`s a
tion with the highest utility in this situation. In
lassi
al path
planning (Khatib 1986; Barraquand and Latombe 1991) the potential eld is
omputed
using a priori knowledge about the environment's
onguration.
In our experiments re
urrent neural networks, with 2 hidden units were used to learn
the navigation behaviors. A network's input pattern is a ve
tor of 26
omponents whi
h
are real numbers in the interval [0,1. The rst 16
omponents
orrespond to the inverse
exponential of distan
e sensors readings, e d=k where k is a weighting fa
tor set to 50 during
the experiments, and d is a
ombination of infrared and sonar readings so as to provide
measures between 0 and 650
entimeters. The next 8
omponents are a sigmoid
oarse
oding of the robot's orientation relative to the goal. The orientation is
omputed using
odometry. The remaining 2
omponents represent the input
ontext and are linked to the
output of hidden units. The input
ontext as well as the orientation allow the robot to
dierentiate several situations
orresponding to the same sensors
onguration.
The Methodology
57
turn-left
turn-right
move-forward
( = 22:5o; t = 25
m)
( = 0o; t = 25
m)
r3 (t) = F3 (x; 3 ; t
1)
F3 (x; 3 ; t):
Safety thresholds
on
ern only the nine frontal sensors and dene a se
urity zone in front
of the robot (gure 4.9). We noti
e that safety thresholds are higher in front of the robot
than on its sides. It is simply be
ause the robot
an still move even if its sides are near an
obsta
le but
annot do so if its front is
on
erned. The values of the Lagrange multipliers
when the
onstraints are violated are
hosen to give a penalty whi
h is proportional to the
violated surfa
e in the se
urity zone, the overall zone being equivalent to the maximum
robot's heading deviation from the goal, whi
h is 180 degrees.
The networks' weights were initialized with random values between -0.1 and 0.1, the
dis
ount fa
tor
was xed to 0.99, the learning rate to 2.0, the eligibility tra
e fa
tor
to 0.5 and the exploration parameter Nexp to 1500 steps. As the networks' output is in
the range [0,1, we s
aled the reinfor
ement signal between -0.1 and 0.1 to prevent units
from overshooting.
The robot was trained to learn ea
h of the ve navigation behaviors in a series of trials,
with ea
h trial starting with the robot pla
ed in a dierent room and ending when it rea
hes
the target lo
ation. Figure 4.10 shows the robot's traje
tories when it navigates from one
room to another, on
e it has learned. To evaluate the robot learning performan
es we
onsidered the behavior move to the
harger. The robot was trained to rea
h the
harger
starting from o
e 3 only. After learning it was able to nd the optimal path leading to
The Methodology
58
obstacle
security
zone
Figure 4.9:
the
harger starting from o
e 3 (gure 4.11) and also starting from other rooms (gure
4.12), thus exhibiting generalization abilities. Moreover it rea
ts e
iently to unexpe
ted
obsta
les (gure 4.13). The learning
urves of gure 4.14 show that the robot learns how
to move to the
harger after 4 trials,
orresponding to 6258 steps. However the path found
is not optimal and sometimes not safe either. The reason is that during this trial there is a
residual exploration of 38%. Thereafter, from the 22nd trial, the path is optimal (between
41 and 44 steps) and safe as we see in gure 4.15 that there are no more penalties after
the 22th trial. It is worth adding that during this trial the residual exploration was 16%.
4.5.2 Learning the Coordination
In this se
tion, we report the experiments we
arried out to
oordinate the navigation
behaviors. A grid simulator
ongured with the distan
es shown in table 4.2 was used for
this purpose.
As shown in the hierar
hy of gure 4.7 two intermediate behaviors, nearest and highest,
as well as the global behavior postman have to be learned. On
e again the robot was rst
The Methodology
59
Figure 4.10:
Figure 4.11:
The Methodology
60
Figure 4.12:
Figure 4.13:
Generalization abilities.
The Methodology
61
3500
3000
2500
2000
1500
1000
500
0
0
Figure 4.14:
10
15
20
25
Trials
30
35
40
45
50
Number of steps needed to rea h the harger starting from o e3 for ea h trial.
140
120
Average penalties
100
80
60
40
20
0
0
10
Figure 4.15:
15
20
25
Trials
30
35
40
45
50
The Methodology
62
Table 4.2:
39
Office 3
62
41
Mailbox
34
43
65
Charger
40
29
42
44
Office 2
Office 3
Mailbox
Charger
Office 2
Office 1
Office 1
Steps needed by the robot to move between dierent pla es in the environment.
trained to learn the two intermediate behaviors, whi
h were preserved thereafter, and
trained to learn the global behavior afterwards. We used feed-forward neural networks to
store the Q-values. The same network ar
hite
ture was used for the three above behaviors,
as they share the same state spa
e. It is
omposed of 40 input units, 3 hidden units and
one output unit. All units have a sigmoid a
tivation fun
tion. The input pattern is as
follows:
35 units: ea
h set of 7 units represents a sigmoidal
oarse
oding of either the number
5 units: ea h of these units represents a possible lo ation of the robot, i.e. in whi h
pla e it is. So exa tly one unit is 'on' at any de ision step.
However the ar
hite
tures dier in their number of networks and in their reinfor
ement
fun
tions. The intermediate behaviors needed ve networks ea
h: one for ea
h navigation behavior. The global behavior needed only two networks: one for ea
h intermediate
behavior. The reinfor
ement fun
tion of ea
h behavior is dire
tly
omputed from the
orresponding performan
e
riterion, as explained in se
tion 4.3.3.
The Methodology
63
The networks' weights, in ea
h sensory-motor loop, were initialized with a random value
in the range [-0.1,0.1, and the rest of the parameters as follows:
= 0:99, = 0:5, = 2:0
and Nexp = 1500. Like in the navigation behaviors the reinfor
ement fun
tion was s
aled
between -0.1 and 0.1.
Sin
e we did not have an optimal poli
y for the postman robot problem we de
ided to
ompare the performan
e of our hierar
hi
al ar
hite
ture with those of a
at ar
hite
ture
(gure 4.16) and of a hand-
oded
ontroller. In the
at ar
hite
ture, the high level behavior (postman ) dire
tly
ontrols the navigation behaviors. It uses the same reinfor
ement
fun
tion as the behavior highest does. We tried to use a rule based reinfor
ement fun
tion
but the results were so bad that it would not be fair to
ompare them with the hierar
hi
al
ar
hite
ture. So the
omparison will be espe
ially made on the ar
hite
ture rather than
on the behaviors' spe
i
ation, be
ause the behaviors were spe
ied in the same way. The
hand-
oded
ontroller uses a simple heuristi
to
hoose between the navigation behaviors.
This heuristi
onsists in moving to the o
e with the highest amount of letters, posting
the letters when the amount of letters
arried by the robot is higher than the number of
letters in ea
h o
e, and re
harging the batteries when their level is below the threshold.
Ea
h of these
ontrollers was tested on 50000 de
ision steps, a de
ision step
orresponding to an elementary behavior sele
tion, and for ea
h letters
ow
onguration of table 3.1.
The batteries level threshold was set to 60 %.
postman
move to
office 1
Figure 4.16:
move to
office 2
move to
office 3
move to
mailbox
move to
charger
The at ar hite ture used for the omparison with the hierar hi al one.
Tables of gure 4.17 show the obtained results. Re
all that a good postman robot is the
one whi
h minimizes both the number of letters in standby in the o
es and the
arried
The Methodology
64
letters, while keeping its batteries level above the xed threshold. We
an see that both
RL systems a
hieved good performan
es
ompared to those of the hand-
oded one. The
main reason is that the learning agents impli
itly take into a
ount some parameters like
the distan
e between the rooms and the letters
ows. Thus they
an anti
ipate the ee
t
of their de
isions and move, for example, to the o
e from whi
h the highest amount of
letters will a
tually be
olle
ted. The hand-
oded agent de
ides to move to an o
e whi
h
ontains the highest amount of letters, at the moment where the de
ision is taken but
not ne
essarily when it is
ompleted. On the other hand we noti
e that the hierar
hi
al
ar
hite
ture outperforms the
at ar
hite
ture. With the former ar
hite
ture, there are in
average 11.38 and 10.32 (respe
tively with periodi
and Poisson
ow) letters in standby
in the o
es less than with the latter ar
hite
ture, whereas the average letters
arried
rises by only 7.56 and 4.23 letters. Moreover a better energy management is a
hieved
by the hierar
hi
al ar
hite
ture. As it
an be observed in the
urves of gure 4.18, the
hierar
hi
al ar
hite
ture learns a better strategy than the
at one, and does so very qui
kly,
i.e. it does not behave badly in the beginning. To explain this superiority we argue that the
hierar
hi
al ar
hite
ture explores a smaller sear
h spa
e in the sense that it
oordinates only
two sensory-motor loops whi
h are pre-learned, whereas the
at ar
hite
ture
oordinates
ve sensory-motor loops. Another reason is that we were a
tually solving a Semi-Markov
De
ision Problem, that is an MDP where the duration of the a
tions is not the same. The
hierar
hi
al ar
hite
ture takes this feature into a
ount and expli
itly
onsiders the elapsed
time between two de
isions, whereas the
at ar
hite
ture does not.
4.6
Summary
We have presented a methodology whose obje
tive is to provide helpful guidelines to analyze and design agents
apable of solving
omplex reinfor
ement learning problems. The
methodology must be seen as a
on
eptual framework in whi
h a number of methods are
to be dened. The postman robot
ase study illustrated how the HPS methodology
an be
applied. The proposed spe
i
ation and de
omposition methods were su
essfully tested
and have given good results. The methodology must now be applied to solve other prob-
The Methodology
Parameters
Average letters in O
e 1
Average letters in O
e 2
Average letters in O
e 3
Average letters
arried
Average battery level
Average quality
riterion
65
Periodi
ow
Hand-
oded
Flat
Hierar
hi
al
ar
hite
ture ar
hite
ture ar
hite
ture
11.72
8.49
6.40
9.87
9.70
4.76
15.32
17.17
13.95
16.59
16.42
23.96
68.95
68.92
72.62
-46.61
-42.88
-38.18
Poisson distribution
ow
Parameters
Hand-
oded
Flat
Hierar
hi
al
ar
hite
ture ar
hite
ture ar
hite
ture
Average letters in O
e 1
15.52
9.43
8.18
Average letters in O
e 2
16.44
13.54
9.72
Average letters in O
e 3
20.00
21.97
16.32
Average letters
arried
21.09
25.59
29.82
Average battery level
69.78
70.56
77.51
Average quality
riterion
-62.10
-55.86
-49.80
Figure 4.17: Tables resuming the performan
e of the
oordination methods for dierent letters
ow
ongurations.
The Methodology
66
-20
-40
-60
-80
-100
Flat architecture
Hierarchical architecture
Hand Coded
-120
0
5000
10000
15000
20000
25000
30000
Time Step
35000
40000
45000
50000
-50
-100
-150
-200
Flat architecture
Hierarchical architecture
Hand Coded
-250
-300
0
5000
10000
15000
20000
25000
30000
Time Step
35000
40000
45000
50000
Average of the quality
riterion as a fun
tion of de
ision steps. The top graph
on
erns the periodi
letters
ow and the bottom graph the Poisson distribution letters
ow.
Figure 4.18:
The Methodology
67
lems in order to be generalized and
ompleted, and some eorts need to be done to improve
our methods or to propose new ones.
Chapter 5
The Coordination Problem
This
hapter
on
erns the
oordination problem, that is, how a
omplex behavior
an
be generated by the
oordination of several sensory-motor loops. We rst review the
hierar
hi
al methods that have been proposed so far to s
ale up reinfor
ement learning.
Then we dis
uss the properties that a
oordination me
hanism should have and propose a
new
oordination method based on the restless bandits theory. Restless bandits allo
ation
indexes are an extension of the Gittins indexes and are borrowed from the eld of optimal
s
heduling. They
on
ern problems involving the sharing of limited resour
es between
several proje
ts whi
h are being pursued. The performan
es of the proposed method are
illustrated through the postman robot problem and
ompared to those of the Hierar
hi
al
Q-learning (Lin 1993).
5.1
Statement
Consider a
olle
tion of sensory-motor loops organized in a hierar
hi
al stru
ture (gure
5.1) in whi
h the sensory-motor loops at a given level have a dire
t
ausal in
uen
e, in
terms of a
tivation or inhibition, on the sensory-motor loops of the level below. In su
h a
hierar
hy, de
ision making and learning o
ur at dierent levels but the intera
tion with
the environment
an only take pla
e at the lowest level. Finally, ea
h sensory-motor loop
has its own internal state depending on the level at whi
h it intervenes as well as on the
task it has to solve. Generally stated, the
oordination problem within a hierar
hy of
sensory-motor loops
onsists in a
tivating at ea
h time step one sensory-motor loop at
69
Level n, T n
Sn
Level n, T 2
S21
S22
S2n
Level n, T 1
S11
S12
S1n
Level 0, t
Primitive commands
ea
h level in order to generate the global expe
ted behavior. This is also known as the
a
tion sele
tion problem and
on
erns the resolution of
on
i
ts whi
h arise when several
a
tions or behaviors
ompete to a
ess to limited motor resour
es. It has been studied in
ethology (M
Farland 1981) as well as in adaptive behavior (Tyrell 1993).
5.2
Related Work
It has been re
ognized that the use of hierar
hies in reinfor
ement learning improves the
learning performan
es. It allows a better exploration of the sear
h spa
e, the reuse of
previously learned skills at a given level to a
quire new skills at the level above, and speed
up the overall learning pro
ess. Although we are espe
ially interested in the sele
tion
devi
e, that is, the me
hanism that allows swit
hing between sensory-motor loops, we take
the opportunity to review most of the work done in hierar
hi
al reinfor
ement learning.
This work
an be roughly grouped into four
ategories:
1.
ommand and temporal abstra
tion;
70
When we think about
oordination methods the rst idea that
omes to mind is to treat
the sensory-motor loops as primitive
ommands. This approa
h has been rst introdu
ed
by Mahadevan and Connell (1992). In their work a global behavior
onsisting in boxpushing was de
omposed into elementary sub-behaviors (nder, pusher, unwedger) whi
h
were learned independently using reinfor
ement learning. A hand-
oded arbiter swit
hes
between the sub-behaviors a
ording to their appli
ability
onditions and their pre
eden
e.
Lin (1993) went further and proposed a system in whi
h both the sub-behaviors and
the arbiter were learned using Q-learning. The task to be a
hieved
onsisted of nding a
batteries'
harger in an o
e environment and
onne
ting to it. As this task is so di
ult
to learn by a monolithi
agent, it was de
omposed into three sub-behaviors: following
walls on the robot's left/right hand side, passing a door, do
king on the
harger. Ea
h
sub-behavior Sbi was learned by a single sensory-motor loop with Q-learning using a lo
al
reinfor
ement fun
tion. These new skills are then used as a
tions by the arbiter whi
h
learns Q(state; Sbi ) with global reinfor
ement fun
tion and state spa
e. A sub-behavior is
sele
ted a
ording to its Q-value and some appli
ability
onditions, and a new de
ision is
made when an a
tive sub-behavior ends or another one be
omes appli
able.
5.2.2 Feudal Q-Learning
The prin
iple of this approa
h, proposed by Dayan and Hinton (1993), is to operate a
oarsening at ea
h level of the hierar
hy, that is, ea
h state at a given level represents an
aggregation of states at the immediately lower level. The goal state is also abstra
ted so
that for ea
h level i, the goal state is the one to whi
h the goal state at the lower level
71
1 belongs. Given that ea
h level has an assigned manager, the learning pro
edure works
as follows. The manager of a level i being in abstra
t state S i performs a
ommand C i
whi
h should lead him to state S i . This
ommand be
omes a goal for the manager of
the lower level i 1, in the sense that
ommands have to be exe
uted in order to enter
a state S i in the aggregation represented by S i . This pro
edure
ontinues until the
lowest level where a primitive
ommand is exe
uted. An abstra
t a
tion ends when a new
state is observed at the same level. A manager is then rewarded if he a
hieves his goal
and punished otherwise. If the goal is rea
hed at a given level, its manager delegates the
responsibility to his sub-manager to sear
h within the zone dened by his abstra
t state.
This approa
h was applied to a robot navigation task in 8 8 grid without obsta
les.
It has re
ently been extended by Dietteri
h (1997) who adds the possibility of hierar
hi
al learning of the Q-values. The value fun
tion of an abstra
t
ommand (i.e., the sum of
rewards generated by the exe
ution of this abstra
t
ommands) is treated as an immediate
reward by the level that sele
ts it, just as the rst level does with primitive
ommands.
The dire
t
onsequen
e of this improvement is the polling exe
ution of the hierar
hy, that
is, a de
ision is made at ea
h level at ea
h time step.
i
MDP de
omposition methods
onsist in partitioning the state spa
e into regions and
omputing an optimal poli
y for ea
h of them. The resulting poli
ies are then
ombined to
solve the initial MDP.
In the HDG algorithm (Hierar
hi
al Distan
e to Goal) proposed by Kaelbling (1993a)
the state spa
e is partitioned so that ea
h region
orresponds to a landmark. A landmark
is a
tually a spe
i
state and a region is
omposed by states that are
loser to a landmark
than to any other one. First a high-level poli
y that leads to the goal region (i.e. the
region
ontaining the goal state) starting from any other region is learned. It gives the
agent the next region to rea
h on the route from its
urrent region (i.e. the agent
losest's
landmark) to the goal region. Then for ea
h region, a poli
y that allows the agent to move
to the neighboring region is learned. On
e the agent is in the goal's region, it learns how
72
to rea
h the goal state. The union of these poli
ies denes the global solution.
The landmarks are given a priori by the designer. However methods to autonomously
nd them are
urrently being investigated.
Similar approa
hes have been studied by Parr (1998), Dean and Lin (1995) and Hauskre
ht
et al. (1998).
5.2.4 W-Learning
In modular approa
hes, sensory motor loops are used as gating me
hanisms whose role is
to
ontrol the
ow of
ommands from the bottom to the top of the hierar
hy. There is
no temporal or state abstra
tion. The problem is solved at the lowest level of abstra
tion
by using suggestions of several experts. Humphrys (1997) and Whitehead et al. (1993)
proposed a two-level ar
hite
ture in whi
h several modules (sensory-motor loops)
ompete
to get the
ontrol of the agent. Ea
h module learns how to a
hieve a sub-goal and maintains
its own Q-values tables. In a given state x observed by the agent, ea
h module Mi suggests
a
ommand
i it wants to see exe
uted. The module
hooses the
ommand a
ording to
its utility Qi (x;
i) and strengthens it with a weight Wi (x). The agent nds the module
Mk with the highest weight
Wk (x) = max
Wi (x);
i
and exe
utes the suggested
ommand
k . The value of Wi(x) may be
omputed as follows:
Wi(x) = Qi(x;
i )
alled maximize best happiness by Humphrys (1997), and nearest
neighbor
Wi(x) = Pi Qi(x;
i)
alled maximize
olle
tive happiness by Humphrys (1997), and
greatest mass
A more interesting way to
ompute Wi (x) is to make it express the dieren
e between the
utility Qi(x;
i) that a module Mi has of being obeyed and the utility Qi(x;
k ) of not being
obeyed (a
tually following the suggestion of module Mk ):
Wi (x) = Qi (x;
i ) Qi (x;
k ):
73
This approa
h similar to the ours whi
h we introdu
e in se
tion 5.4.1. The idea is to
minimize the worst unhappiness, that is, perform the
ommand
k of the module Mk that
will most suer if it is not obeyed:
Wk (x) = max
max
(Qi(x;
i) Qi (x;
j )):
j
i
However, the result of the sele
tion is greatly in
uen
ed by the order in whi
h the modules'
suggestions are examined, and the same suite of
ommands is needed for all modules.
To over
ome this drawba
k, Humphrys (1997) proposed the following update rule,
whi
h he
alled W-learning to estimate Wi(x) online, even when the modules do not share
the same set of
ommands:
Wi (x)
(1
(ri +
max
Qi (y; b));
b2C
i
for all i 6= k where Mk is the winning module. We noti
e that the transition is
aused
by the
ommand
k and that the error represents the loss of prot of module Mi . It is
assumed in this rule that Qi is already learned. Therefore if Qi and Wi(x) are to be
estimated
onjointly, then it is ne
essary to delay the learning of Wi(x).
5.2.5 Compositional Q-Learning
Singh (1992) developed an ar
hite
ture to solve
ompositional tasks, that is, tasks whi
h
an be expressed as a sequen
e of sub-tasks. The originality of his approa
h is that subtasks are not a priori assigned to sensory-motor loops. During the learning phase a reward
is generated only when a sub-task is a
hieved or when the whole
omposite task is
ompleted. A gating fun
tion learns to sele
t the sensory-motor loop that will a
tually perform
its
ommand. The winning sensory-motor loop is the one who has the best estimate of the
Q-values (or has the smallest expe
ted error) of the sub-task that is
urrently exe
uted.
Be
ause the sensory-motor loop that has produ
ed the least error learns the most (in proportion to the error), the more a sensory-motor loop learns a given sub-task, the more it
improves its Q-values estimate. Thus its probability of being sele
ted for the same sub-task
will in
rease leading to the emergen
e of sub-task assignment over sensory-motor loops.
74
5.2.6 Ma ro Q-Learning
Finally, Sutton et al. (1998) studied the
ase where an MDP has to be solved using abstra
t
a
tions (options or ma
ro-a
tions as they
alled them). To do so they used SMDP Qlearning (Bradtke and Du 1995; Mahadevan et al. 1997) and introdu
ed the notion of
Termination Improvement. During the exe
ution of a parti
ular option o, laun
hed at time
t from state st and normally terminating at time t + k, it is possible to update the utility
values of performing option o (as well as other options whose traje
tories are in
luded in
the one of option o) from ea
h state st i (1 i < k). Thus, information to make de
isions
is available in every state and an ongoing option
an be interrupted in any state in favor
of a more promising option. The notion of ma
ro-a
tion interruption is dis
ussed in the
next se
tion.
+
5.3
75
a sensory-motor loop is sele
ted it will keep the
ontrol of the agent until its
ompletion .
Hen
e learning will o
ur only at the termination of this sensory-motor loop. Be
ause a
sele
ted sensory-motor loop
annot be interrupted, this s
heme may have some drawba
ks
in problems involving the satisfa
tion of multiple and
on
urrent obje
tives. On the other
hand exploration is improved be
ause the state spa
e is
overed using big steps (Dietteri
h
1997).
In the a
tion sele
tion s
heme only one behavior will remain: it is the one whi
h is
produ
ed by the overall system. It may be analyzed at various levels as
onsisting of
streams of a
tions ranging from rea
tive to planning operations. At ea
h time step the
system learns and makes de
isions at ea
h level of the hierar
hy. Therefore any sensorymotor loops at any levels may interrupt ea
h other. Su
h a
ontinual interruption leads to
ine
ient exploration be
ause it redu
es the probability of rea
hing sensible states.
Let's illustrate these two s
hemes by the traditional ethologi
al example of an animal
having to satisfy both hunger and thirst drives. We assume that food and water are in
dierent lo
ations and that there are several levels of thirst and hunger. Suppose that
the animal is hungry and that this a
tivates the behavior leading it towards the food. If
the thirst level be
omes higher than the hunger one and the animal
annot interrupt the
sele
ted behavior it might die of dehydration en-route towards the food. On the other
hand if it
an interrupt its behaviors at any time and the levels of thirst and hunger
be
ome alternatively higher and lower one relative to the other, it may die of starvation or
dehydration somewhere between the two lo
ations.
These two approa
hes seem to be extremist but
omplementary. To nd a
ompromise
we
an either introdu
e a model of fatigue, whi
h is based on time sharing, to the a
tion
sele
tion s
heme or allow the interruption of sensory-motor loops in the behavior sele
tion
s
heme. The se
ond method seems to be more natural than the rst one but may exhibit
an unstable behavior. In ee
t, two sensory-motor loops with
losed a
tivation degree may
interrupt ea
h other (as explained in the above example), thus generating an os
illation.
This phenomenon is
alled dithering (Pres
ott et al. 1999; Redgrave et al. 1999). A way of
1
1A
sensory-motor loop is
ompleted when it rea
hes a state whi
h is a goal or in whi
h it is not appli
able
anymore.
76
over
oming this problem is to add some kind of persisten
e to the a
tive sensory-motor loop.
It means that to interrupt an a
tive sensory-motor loop, the
andidate sensory-motor loop
(i.e. the one with the highest a
tivation degree among the appli
able but ina
tive sensorymotor loops) must not only have a greater a
tivation degree than the a
tive sensory-motor
loop has but must also ex
eed it by a given
onstant w=2. The
onstant w is the width of
the hysteresis loop representing the behavior swit
hing between a
tive and passive phases
(gure 5.2).
It has been hypothesized that the sele
tion me
hanism of this form is implemented in
the vertebrate brain by the Basal Ganglia (Pres
ott et al. 1999; Redgrave et al. 1999).
active
I-Ic
inactive
w
Figure 5.2: The hysteresis loop representing the behavior swit
hing between the a
tive and
passive phases. I
is the index of the
andidate behavior and w is the width of the hysteresis.
5.4
Indexed Poli y
An indexed poli
y
onsists in allo
ating an index to ea
h sensory-motor loop and to a
tivate
the one with the highest index in a winner-take-all manner. Of
ourse indexes whi
h are
omputed adaptively and on-line are highly desirable. In hierar
hi
al Q-learning (Lin 1993)
the indexes simply
orrespond to the Q-values of sele
ting a sensory-motor loop in a
ertain
77
global state. In W-learning the index is the value of the strength W (x). Here we introdu
e
another method to
ompute su
h indexes, based on the restless bandit theory, whi
h we
all RBI-learning.
5.4.1 The Restless Bandits
The restless bandits are an extension of the multi-armed bandit problem and have been
studied by Whittle (1988), and Weber and Weiss (1989). The initial problem
on
erns n
proje
ts, the state of a proje
t i at time t being denoted by xi(t). At ea
h time step t only
one proje
t has to be operated. If the operated proje
t is i then it will generate a reward
ri (t) and make a transition xi (t) ! xi (t + 1) a
ording to its transition probabilities Pi .
The other n 1 proje
ts remain frozen, i.e. neither produ
e reward nor
hange state. A
proje
t is said to be in an a
tive or passive phase depending upon whether it is sele
ted
or not. Gittins (1989) has shown that an index poli
y is optimal for this problem. Su
h
an index is denoted Ii(xi ) and is a fun
tion of the proje
t i as well as its state xi:
P
E t
t ri (t)
I (x ) = max
:
(5.1)
P
i
>0
1
=0
1
t=0
This index
an be interpreted as the maximal value of the reward density relative to
the stopping time . The optimal poli
y will simply be to sele
t the proje
t with the
greatest index. The ni
e property of su
h a strategy is that Ii only depends on information
on
erning proje
t i. The dimensionality of the problem is
onsiderably redu
ed.
To have a better and intuitive understanding of the Gittins' indexes, we will examine
the following dida
ti
example provided by Du , where for the sake of simpli
ity the
rewards are deterministi
. Imagine several sta
ks
ontaining numbers, whi
h are rewards,
and suppose that we
an see the entire
ontents of ea
h sta
k. Our goal is to pop the sta
ks
in an order that maximizes the dis
ounted sum of the resulting reward stream. We
an
onvin
e ourselves that the optimal strategy involves popping the sta
k with the highest
reward density:
PT
k ri ( k )
k
Di = max P
;
(5.2)
T
k
T
2
1
=0
k =0
2 Personal
ommuni ation
78
where ri(k) is the
ontents of sta
k i in position k, starting from the top. Sta
ks with
higher reward density
ontain high rewards near their top and have to be popped rst
be
ause of the dis
ount fa
tor (gure 5.3).
8
12
D1 = 4:05
D2 = 8:84
D3 = 2:07
Sta
ks reward densities for
= 0:9. Noti
e that the sta
k to pop is not ne
essary
the one with the highest value at its top.
Figure 5.3:
Unfortunately this method
annot be dire
tly applied to solve the
oordination problem
(the proje
ts being repla
ed by sensory-motor loops) be
ause the fundamental assumption
(i.e. the unsele
ted proje
ts remain frozen) is not valid anymore. This happens in many
ases and espe
ially in mobile roboti
s be
ause the states of the sensory-motor loops are
built from the same agent's per
eptions and these per
eptions evolve whatever the sele
ted
sensory-motor loop is.
To treat the restless bandits problem we will introdu
e the following notation:
3
in phase k, where k = 1 or 2 for respe tively the a tive or the passive phase;
introdu e the theory we will use the term proje t instead of sensory-motor loop.
79
If we want to maximize the dis
ounted sum of reward over an innite horizon, for a single
proje
t i we just have to solve the following optimality equation
(
Vi (x) = kmax
E rik +
=1;2
y Xi
Pi (x; y; k)Vi(y ) ;
(5.3)
where Vi(x) is the value fun
tion of proje
t i in state x. To do so we will
ompute the
Q-values
(5.4)
b=1;2
and then de
ide to a
tivate or freeze the proje
t a
ording its Q-values.
Consider now the multi-proje
t
ase. We are essentially interested in maximizing
X X
t rik (t)
(5.5)
t
X
t
( t
X
i
X
i
li (t))
(5.6)
y Xi
Pi (x; y; k)Vi(y ) ;
(5.7)
(5.8)
where
Lik = rik +
y Xi
Pi (x; y; k)Vi(y ):
(5.9)
80
Proof Let x = (x1 ; x2 ; :::; xi ; :::; xn ) be the
omposite state of the global problem, and let
Q(x; k) be the utility of a
tivating proje
t k in state x:
Q(x; k) = Qk (xk ; 1) +
i=k
Qi (xi ; 2)
8 k 2 [1; n
i=k;m
Qi (xi ; 2)
i=k;m
Intuitively we
an see that the index a
tually re
e
ts the need for a proje
t to be a
tive
with respe
t to the exploration and exploitation
riteria. A
tually the value of in
reases
if
Qi(xi ; 1) in
reases whi
h means that the proje
t needs to be a
tive (exploitation
phase), or
Qi(xi ; 2) de reases whi h means that the proje t does not want to be passive (explo-
ration of the ee
ts of its a
tivation). This
ondition holds as far as the proje
t is
deteriorating during its passive phase (i.e., re
eiving negative rewards).
81
loop
end for
end loop
Figure 5.4:
Algorithm of RBI-learning.
On the other hand the utility to a proje
t of being a
tive or passive
an be seen respe
tively
as a
tivation and inhibition signals. Thus, persisten
e may be implemented by simply
removing the inhibition signal from the sele
ted sensory-motor loop and keeping it for
others.
Our
oordination method may be situated between hierar
hi
al Q-learning and Wlearning. RBI-learning and W-learning are similar be
ause they are both motivated by the
same
riterion, whi
h is to redu
e the loss of prot when a proje
t (module) is not sele
ted
(obeyed). However they dier in the sense that RBI-learning supports temporal abstra
tion
(like hierar
hi
al Q-learning) whereas W-learning does not. A
tually W-learning needs to
perform an update after ea
h exe
ution of a primitive
ommand. In addition RBI-learning
is supported by a strong theory and does not require any pre-learned Q-values.
5.5
Experiments
The oordination method we presented above is now evaluated and its performan es ompared to those of Hierar hi al Q-learning . To do so we have followed the HPS methodol4
4 We
have not made any
omparison with W-learning. The reason is that it is not appli
able to the
postman robot problem be
ause it does not support temporal abstra
tion. In ee
t, the update of Q-values
82
ogy therefore some results and settings from the previous
hapter are reused. We de
ided
to test these
oordination methods on the
at ar
hite
ture, using the behavior sele
tion
s
heme. The network ar
hite
ture used to implement the Hierar
hi
al Q-learning method
is the same as in the previous
hapter. However we view it here. Ea
h of the ve neural
networks of the ar
hite
ture is
omposed of 40 input units, 3 hidden units and one output
unit. All units have a sigmoid a
tivation fun
tion. The input pattern is as follows:
35 units: ea
h set of 7 units represents a sigmoidal
oarse
oding of either the number
5 units: ea h of these units represents a possible lo ation of the robot, i.e. in whi h
Re
all also that the fun
tion to be optimized is f (x; ; t) = Pi xli(t)+xr (t)+ (x; t)(xbth
xb (t)) and the instantaneous reinfor
ement fun
tion is r(t) = f (x; ; t) f (x; ; t 1). For
the RBI-learning, the above fun
tion is linearly de
omposed into ve fun
tions: one for
ea
h elementary sensory-motor loop. We obtained:
1
f (x; t) = xr (t) for the sensory-motor orresponding the behavior move to the mail4
box ;
f (x; ; t) = (x; t)(xbth xb (t)) for the sensory-motor
orresponding the behavior
5
We used two dierent network ar
hite
tures to implement the restless bandits method. In
the rst one all the sensory-motor loops share the same state spa
e so it is similar to the one
of primitive
ommands (robot's movements) would be ine
ient be
ause the state spa
e is huge and the
reinfor
ement is only given when the robot rea
hes one of the sub-goals (o
es, mailbox,
harger).
83
of Hierar
hi
al Q-learning. In the se
ond ar
hite
ture, the state spa
e is redu
ed for ea
h
sensory-motor loop in order to keep only features relevant to the fun
tion to be optimized.
Thus, for ea
h sensory-motor loop, we kept features representing the robot lo
ation (5
units) and features representing the amount to be optimized (7 units),
orresponding for
example to the number of letters
arried by the robot for the behavior move to the mailbox.
Hen
e we obtained networks with 12 input units, 2 hidden units and one output unit.
However, for both ar
hite
tures we needed two networks for ea
h sensory-motor loop: one
to approximate the Q-values for ea
h phase (passive or a
tive) of the sensory-motor loop.
The networks weights, in ea
h sensory-motor loop, were initialized with a random value
in the range [-0.1,0.1 and the reinfor
ement fun
tion was s
aled between -0.1 and 0.1. The
rest of the parameters was set as follows:
= 0:99, = 0:5, = 2:0 and, for Hierar
hi
al
Q-learning Nexp = 1500. There was no exploration phase for RBI-learning.
Ea
h of these
ontrollers was tested on 50000 de
ision steps, a de
ision step
orresponding to an elementary behavior sele
tion, and for ea
h letters
ow
onguration of table 3.1.
The batteries level threshold was set to 60 %.
The results reported in the tables of gure 5.5 show that RBI-learning outperforms
Hierar
hi
al Q-learning. For the Poisson distribution we
an see that with the former
method, there are in average 8.75 letters less in standby in the o
es than with the latter
method, whereas the average of
arried letters in
reases by only 5.88 letters. For the
periodi
ow the
arried letters are almost the same whereas the letters in standby drop
by 5.33 letters for the RBI-learning. Moreover a better energy management is a
hieved by
the restless bandits method. The performan
es of RBI-learning
an be justied by the fa
t
that the reinfor
ement fun
tion is de
omposed and that there is a good balan
ing between
exploration and exploitation whi
h allows a good strategy to be found very qui
kly (gure
5.6).
Surprisingly, the RBI-learning with redu
ed state spa
e did not give the expe
ted results. We expe
ted that, be
ause of the small sear
h spa
es, a better strategy would have
been found or at least the indexes would have been learned more qui
kly. It seems that in
84
pra
ti
e, better performan
es are to be expe
ted when the state spa
e is the same .
Periodi
ow
Parameters
Hierar
hi
al Restless Bandits Restless Bandits
Q-learning
full spa
e
redu
ed spa
e
Average letters in O
e 1
8.49
8.88
9.82
Average letters in O
e 2
9.70
8.26
16.65
Average letters in O
e 3
17.17
12.89
18.47
Average letters
arried
16.42
15.31
16.18
Average battery level
68.92
71.00
71.42
Average quality
riterion
-42.88
-37.42
-53.06
5
Poisson distribution
ow
Parameters
Hierar
hi
al Restless Bandits Restless Bandits
Q-learning
full spa
e
redu
ed spa
e
Average letters in O
e 1
9.43
7.75
10.15
Average letters in O
e 2
13.54
11.21
17.99
Average letters in O
e 3
21.97
17.23
21.00
Average letters
arried
25.59
31.47
27.82
Average battery level
70.56
72.12
68.17
Average quality
riterion
-55.86
-52.82
-63.26
Tables summarizing the performan
es of the
oordination methods for dierent
letters
ow
ongurations.
Figure 5.5:
5.6
Summary
In order to solve the
oordination problem we have been inspired by the fun
tioning of the
a
tion sele
tion devi
e of natural
ontrol systems. We proposed a
omputational model
based on restless bandits indexes that implements su
h a devi
e and showed that its performan
es over
ome those of an existing method. However we have used the behavior
sele
tion s
heme without interruption in our implementation, be
ause so far, we do not
have a
lear idea about how interruption should work. We think that this issue is of great
importan
e and we will investigate it in our future work.
5 Personal
85
-20
-40
-60
-80
-100
Hierarchical Q-Learning
Restless Bandits with full space
Restless Bandits with reduced space
-120
0
5000
10000
15000
20000
25000
30000
Time Step
35000
40000
45000
50000
-50
-100
-150
-200
Hierarchical Q-Learning
Restless Bandits with full space
Restless Bandits with reduced space
-250
-300
0
5000
10000
15000
20000
25000
30000
Time Step
35000
40000
45000
50000
Figure 5.6: Average of the quality
riterion as a fun
tion of de
ision steps. The top graph
on
erns the periodi
letters
ow and the bottom graph the Poisson distribution letters
ow.
Chapter 6
Con
lusion
6.1
Summary of ontributions
The work presented in this thesis was motivated by the need to solve
omplex problems
using embedded agents learning by reinfor
ement. We identied and analyzed the reasons
that make standard reinfor
ement learning methods impra
ti
al in
omplex domains and
proposed some me
hanisms to s
ale up these approa
hes. Our
ontributions are summarized as follows.
We set up a new design methodology whose aim is to systemize the agent's design
pro
ess (Faihe and Muller 1997). It provides a
on
eptual framework to design hierar
hi
al
ontrol ar
hite
tures for embedded autonomous agents. The obje
tives of ea
h stage of the
methodology were
learly dened and the distin
tion was made between what the agent
has to learn and what has to be given a priori by the designer.
Assuming that the solution to the problem
orresponds to a parti
ular pattern of intera
tion between the agent and the environment, we established the relationship between
solving a problem and generating a behavior. Then we proposed a way of formally spe
ifying a behavior. To do so we used a quality
riterion,
omposed of an obje
tive fun
tion
and a set of
onstraints. The desired behavior is the one generating a traje
tory (in the
intera
tion spa
e) that optimizes the obje
tive fun
tion without violating the
onstraints.
In addition to being both a formal and natural means of dening a behavior, the proposed
method allows us to derive the reinfor
ement fun
tion (as a progress estimator), to learn
Con lusion
87
the behavior and to have a good basis for the de
omposition pro
ess.
A graphi
al approa
h was proposed to perform the problem's de
omposition (or behavior's de
omposition be
ause to ea
h problem
orresponds a set of behaviors that solves it).
Although this te
hnique is still partly reliant on the designer's intuition and experien
e, it
allows to dis
over sub-behaviors that would not be identied otherwise.
Con
erning the
oordination problem, we reviewed the features present in the behavior's
sele
tion me
hanism of natural systems and highly desirable in arti
ial systems. We
proposed a
oordination method based on restless bandits indexes (Faihe and Muller 1998).
It extends and generalizes W-learning, is
ompletely distributed and has been shown to be
more powerful than Hierar
hi
al Q-learning for the postman robot task.
The feasibility of the methodology as well as the performan
es of the methods were
demonstrated through the postman robot problem, whi
h is a non-trivial problem. In
addition we developed and implemented a three-level ar
hite
ture, whi
h is rarely found
in the reinfor
ement learning area.
6.2
Pra ti al Issues
The implementation of the postman robot ar
hite
ture was not straightforward and sometimes resulted in agents that fail to
onverge to a satisfa
tory solution. The main di
ulty
was nding a good tuning of the parameters, whi
h are the learning rate , the eligibility
tra
e fa
tor , the dis
ount fa
tor
and the number of exploration steps Nexp. Unfortunately there is no s
ienti
method to tune su
h parameters so they are
hosen a
ording
to one's own experien
e and experiments as well as those reported by other resear
hers.
We noti
ed that and are
losely linked and that the evolution of one of them ae
ts
the value of the other. A bad setting of these parameters results either in a slow
onvergen
e
or in a
omplete failure of the learning pro
ess. We de
ided to measure the performan
es
of the agent (i.e. the average value of the quality
riterion after 50000 de
ision steps) for
several values of and . The best results were obtained for = 0:5 and = 2:0, whi
h
are the values used during our experiments for all the ar
hite
tures.
The number of exploration steps was easy to nd. Starting with a small value of Nexp
Con lusion
88
(200 steps), we in
reased it progressively until 8000 steps and reported the agent's performan
e. We noti
ed that the performan
e improves while Nexp in
reases, then stabilizes
between 1500 and 4000 steps, and deteriorates thereafter. In ee
t, if the value of Nexp is
too low then the agent will be unable to nd a good poli
y (due to the la
k of sear
h) and,
on the other hand a high value will prevent the agent
onsolidating its knowledge be
ause
of the random perturbations.
For the dis
ount fa
tor one may wonder whether to dis
ount (
< 1) or not (
= 1).
Dis
ounting is useful for any task that is learned in trials. The navigation tasks, for example, are suitable to be learned with dis
ounting be
ause solutions that allow the agent
to rea
h the goal in very few steps are preferred. The s
heduling task (
oordination of the
navigation's behaviors) is a
ontinuous task. Therefore a natural and logi
al optimality
riteria would be the average reward re
eived over time. General results for online learning using su
h a
riterion are
urrently under progress (Mahadevan 1994). However we
obtained fair performan
es with
= 0:99.
Another di
ulty we had to fa
e
on
erns the stability of neural networks. It was
impossible to get a stable network with a linear output unit, even with a very low learning
rate (order of magnitude of 10 ). For this reason we used networks with non-linear output
units. Nevertheless we were
onstrained to s
ale the reinfor
ement value between -0.1 and
0.1 to avoid large updates, whi
h may make units blow up.
3
6.3
Future work
Further resear
h that
an be
arried out in the dire
tion of the work presented in this dissertation is twofold. It may
on
ern the extension of the methodology or the improvement
of the proposed methods.
One possible way of extending the methodology would be to automate the pro
esses,
whi
h require extensive human intervention. Su
h pro
esses are the de
omposition of a
behavior into sub-behaviors and the design of sensory-motor loops. We do believe that
animals, whi
h learn by reinfor
ement (su
h as birds learning to
y), were born with
all the ne
essary stru
tures to a
hieve su
h a learning. These stru
tures are geneti
ally
Con lusion
89
Epilogue
The work presented in this thesis takes pla
e within the general
ontext of learning and
development in arti
ial
reatures. The long-term obje
tive is to nd me
hanisms that
Con lusion
90
Bibliography
Barraquand, J. and J. C. Latombe (1991). Robot motion planning: A distributed representation approa
h. The International Journal of Roboti
s Reasear
h 10 (6), 628{649.
Barto, A., R. Sutton, and C. Watkins (1990). Learning and sequential de
ision making.
In Learning and sequential de
ision making, M.Gabriel and J.W. Moore, editors, The
MIT Press.
Barto, A. G., S. J. Bradtke, and S. P. Singh (1995). Learning to a
t using real-time
dynami
programming. Arti
ial Intelligen
e 72(1-2), 81{138.
Barto, A. G. and S. P. Singh (1990). On the
omputational e
onomi
s of reinfor
ement
learning. In D. S. Touretzky (Ed.), Conne
tionist Models: Pro
eedings of the 1990
Summer S
hool. Morgan Kaufmann.
Benbrahim, H. and J. A. Franklin (1997). Biped dynami
walking using reinfor
ement
learning. Roboti
s and Autonomous Systems 22, 284{302.
Bradtke, S. J. and M. O. Du (1995). Reinfor
ement learning methods for
ontinuoustime markov de
ision problems. In Advan
es in Neural Information Pro
essing Systems 7. MIT Press.
Braitenberg, V. (1984). Vehi
les. Experiments In Syntheti
Psy
hology. MIT Press.
Ci
hosz, P. (1995). Trun
ating temporal dieren
es: On the e
ient implementation of
td() for reinfor
ement learning. Journal of Arti
ial Intelligen
e Resear
h 2, 287{
318.
Colombetti, M., M. Dorigo, and G. Borghi (1996). Behavior analysis and design - a
methodology for behavior engineering. IEEE Transa
tions on Systems, Man and
Bibliography
92
Cyberneti s 26.
Crites, R. H. (1996). Large-S
ale Dynami
Optimization using Teams of Reinfor
ement
Learning Agents. Ph. D. thesis, University of Massa
husetts.
Dayan, P. and G. E. Hinton (1993). Feudal reinfor
ement learning. In Advan
es in Neural
Information Pro
essing Systems 5.
Dean, T. and S.-H. Lin (1995). De
omposition te
hniques for planning in sto
hasti
domains. Te
hni
al Report CS-95-10, Brown University.
Dietteri
h, T. G. (1997). Hierar
hi
al reinfor
ement learning with the MAXQ value
fun
tion de
omposition. Te
hni
al report, Oregon State University.
Dorigo, M. and M. Colombetti (1998). Robot Shaping: An Experiment in Behavior
Engineering. MIT Press/Bradfort Books.
Faihe, Y. and J.-P. Muller (1997). Behavior analysis and design: Towards a methodology.
In A. Birk and J. Demiris (Eds.), Pro
eedings of the Sixth European Workshop on
Learning Robots (EWLR6), Le
ture Notes in Arti
ial Intelligen
e. Springer-Verlag.
Faihe, Y. and J.-P. Muller (1998). Behaviors
oordination using restless bandits allo
ation indexes. In Pro
eedings of the Fifth International Conferen
e on Simulation of
Adaptive Behavior (SAB98).
Gittins, J. C. (1989). Multi-armed Bandit Allo
ation Indi
es. Willey.
Hauskre
ht, M., N. Meuleau, C. Boutilier, L. P. Kaelbling, and T. Dean (1998). Hierar
hi
al solution of markov de
ision pro
esses using ma
ro-a
tions. In Pro
eedings of
the Fourteenth Conferen
e on Un
ertainty in Arti
ial Intelligen
e (UAI98).
Humphrys, M. (1997). A
tion Sele
tion methods using Reinfor
ement Learning. Ph. D.
thesis, University of Cambridge.
Kaelbling, L. P. (1993a). Hierar
hi
al learning in sto
hasti
domains: Preliminary results. In M. Kaufmann (Ed.), Pro
eedings of the Tenth International Conferen
e on
Ma
hine Learning.
Kaelbling, L. P. (1993b). Learning in Embedded Systems. MIT Press.
Bibliography
93
Bibliography
94
Matari
, M. J. (1994). Reward fun
tions for a
elerated learning. In Pro
eedings of the
Eleventh International Conferen
e on Ma
hine Learning. Morgan Kaufmann.
M
Callum, A. (1996). Reinfor
ement Learning with Sele
tive Per
eption and Hidden
State. Ph. D. thesis, University of Ro
hester.
M
Farland, D. (1981). Animal Behaviour. Longman.
Meuleau, N. and P. Bourgine (1998). Exploration of multi-state environments: Lo
al
measures and ba
k-propagation of un
ertainty. Ma
hine Learning . In press.
Millan, J. d. R. (1996). Rapid, safe and in
remental learning of navigation strategies.
IEEE Transa
tions on Systems, Man and Cyberneti
s 26.
Minoux, M. (1986). Mathemati
al Programming. John Wiley and Son.
Parr, R. (1998). Flexible de
omposition algorithms for weakly
oupled markov de
ision
problems. In Pro
eedings of the Fourteenth Conferen
e on Un
ertainty in Arti
ial
Intelligen
e (UAI98).
Peng, J. and R. J. Williams (1996). In
remental multi-step Q-learning. Ma
hine Learning 22, 283{290.
Pfeifer, R. (1996). Building "fungus eaters": Design prin
iples of autonomous agents.
In Pro
eedings of the Fourth International Conferen
e on Simulation of Adaptive
Behavior (SAB96).
Pfeifer, R. and C. S
heier (1998). Introdu
tion to "New Arti
ial Intelligen
e". MIT
Press. Book manus
ript under review.
Pomerleau, D. A. (1991). E
ient training of arti
ial neural networks for autonomous
navigatioin. Neural Computation 3 (1), 88{97.
Pres
ott, T. J. and J. E. Mayhew (1992). Obsta
le avoidan
e through reinfor
ement
learning. In Advan
es in neural information pro
essing systems 4, pp. 523{530. Morgan Kaufmann.
Pres
ott, T. J., P. Redgrave, and G. Kevin (1999). Layered
ontrol ar
hite
tures in
robots and vertebrates. Adaptive Behavior . To appear.
Bibliography
95
Redgrave, P., T. J. Pres
ott, and G. Kevin (1999). The basal ganglian: A vertebrate
solution to the sele
tion problem? Neuros
ien
e . To appear.
Rummery, G. A. (1995). Problem Solving With Reinfor
ement Learning. Ph. D. thesis,
University of Cambridge.
Rummery, G. A. and M. Niranjan (1994). On-line Q-learning using
onnexionist systems.
Te
hni
al Report CUED/F-INFEG/TR66, Cambridge University.
Santamaria, J. C., R. S. Sutton, and A. Ram (1997). Experiments with reinfor
ement learning in problems with
ontinuous state and a
tion spa
es. Adaptive Behavior 6 (2), 163{217.
Simmons, R., R. Goodwin, K. Z. Haigh, S. Koenig, and J. O'Sullivan (1997). A modular ar
hite
ture for o
e delivery robots. In Pro
eedings of the First International
Conferen
e on Autonomous Agents. ACM Press.
Singh, S. P. (1992). Transfer of learning by
omposing solutions of elemental sequential
tasks. Ma
hine Learning 8 (3/4), 323{339.
Singh, S. P. and D. Bertsekas (1997). Reinfor
ement learning for dynami
hannel allo
ation in
ellular telephone systems. In Advan
es in Neural Information Pro
essing
Systems. MIT Press.
Singh, S. P. and R. S. Sutton (1996). Reinfor
ement learning with repla
ing eligibility
tra
es. Ma
hine Learning 22, 123{158.
Stephens, D. W. and J. R. Krebs (1986). Foraging Theory. Prin
eton University Press.
Sutton, R. S. (1988). Learning to predi
t by the methods of temporal dieren
es. Ma
hine Learning 3, 9{44.
Sutton, R. S. and A. G. Barto (1998). Reinfor
ement Learning: An Introdu
tion. MIT
Press.
Sutton, R. S., D. Pre
up, S. Singh, and B. Ravindran (1998). Improved swit
hing among
temporally abstra
t a
tions. In Advan
es in Neural Information Pro
essing Systems
11. MIT Press.
Bibliography
96
Bibliography
97
Zhang, W. and T. G. Dietteri
h (1995). A reinfor
ement learning approa
h to jobshop s
heduling. In Pro
eedings of the Fourteenth International Joint Conferen
e on
Arti
ial Intelligen
e. Morgan Kaufmann.