0% found this document useful (0 votes)
26 views292 pages

Thesis

Uploaded by

Ashly Guevarra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views292 pages

Thesis

Uploaded by

Ashly Guevarra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 292

A Learning Classifier System

Approach to Relational
Reinforcement Learning

Drew Mellor
B. Comp. Sci. (Hons)

Submitted in partial fulfilment of the requirements for the degree of


Doctor of Philosophy (Computer Science)

School of Electrical Engineering and Computer Science


The University of Newcastle
Callaghan, 2308
Australia

February 2008
I hereby certify that the work embodied in this thesis is the result of original
research and has not been submitted for a higher degree to any other Uni-
versity or Institution.

(Signed): ..................................................................
Acknowledgements

A PhD thesis is a sizeable undertaking and invariably depends on contributions


from many people besides the author. I would like to extend a warm thank-you to
the people in the following list, all of whom, although some of them may not realise
it, contributed to this research. Mirka Miller encouraged me to do a PhD in the first
place and was instrumental during the application process. Sašo Džeroski gave a
keynote talk at ICML 2002 which inspired the thesis topic. My supervisors Stephan
Chalup and Huilin Ye gave me their trust and a free hand to develop the topic.
Frans Henskens went beyond his professional duty to help secure much needed travel
funding for the presentation of my work. Stewart Wilson, Tim Kovacs and Martin
Butz provided encouragement and validation at a crucial time. Sašo Džeroski, Kurt
Driessens, Martijn van Otterlo and Federico Divina helpfully answered my queries
about relational reinforcement learning and other topics. Robert King advised on
appropriate statistical tests, while Aaron Scott, David Montgomery and Geoff Mar-
tin tirelessly provided valuable technical support without which this thesis would
have come to nothing. During the writing phase many people proof read chap-
ters: Alyssa Brugman, Dirk Brugman, Elena Prieto, Erol Engin, Linda Seymour,
Michael Quinlan and of course my supervisors, Stephan and Huilin. Throughout
the course of the candidature my parents Rob and Cherie Mellor always gave me
their support, while Helen Giggins and the legged Robocup team, Michael, Craig,
Naomi, Kenny, Steve and the others were at hand for much needed regular diver-
sions. I would also like to thank anyone else that belongs on this list but whom I
might have forgotten to add, for which I sincerely apologise. Last but not least, I
would like to mention my long time pet cat and companion, “Crackles”, who passed
away during the candidature and to whom this thesis is dedicated.
Contents

List of Symbols vii

Abstract xiii

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Blocks World . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.2 The Reinforcement Learning Problem . . . . . . . . . 4

1.1.3 Structural Regularity . . . . . . . . . . . . . . . . . . 6

1.1.4 First-Order Logic . . . . . . . . . . . . . . . . . . . . . 7

1.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.2.1 Learning Classifier Systems . . . . . . . . . . . . . . . 10

1.2.2 Inductive Logic Programming . . . . . . . . . . . . . . 10

1.2.3 Strengths and Weaknesses . . . . . . . . . . . . . . . . 12

1.3 Thesis Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 14

i
ii CONTENTS

2 Background 17

2.1 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . 18

2.1.1 Acting Optimally . . . . . . . . . . . . . . . . . . . . . 20

2.1.2 Solving Markov Decision Processes . . . . . . . . . . . 22

2.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . 26

2.2.1 The TD(0) Algorithm . . . . . . . . . . . . . . . . . . 26

2.2.2 The Q-Learning Algorithm . . . . . . . . . . . . . . . 28

2.3 Generalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.3.1 Function Approximation . . . . . . . . . . . . . . . . . 31

2.3.2 Aggregation . . . . . . . . . . . . . . . . . . . . . . . . 33

2.3.3 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.4 Relational Reinforcement Learning . . . . . . . . . . . . . . . 36

2.4.1 Propositionally Factored MDPs . . . . . . . . . . . . . 36

2.4.2 Relational Markov Decision Processes . . . . . . . . . 40

2.4.3 Aggregating States and Actions Under an RMDP . . . 43

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3 A Survey of Relational Reinforcement Learning 47

3.1 Connections to Other Fields . . . . . . . . . . . . . . . . . . . 48

3.2 Existing RRL Methods and Approaches . . . . . . . . . . . . 49

3.2.1 Static Generalisation . . . . . . . . . . . . . . . . . . . 50

3.2.2 Dynamic Generalisation . . . . . . . . . . . . . . . . . 53


CONTENTS iii

3.2.3 Policy Learning . . . . . . . . . . . . . . . . . . . . . . 55

3.2.4 Policy Driven Approaches . . . . . . . . . . . . . . . . 58

3.2.5 Other Dynamic Methods . . . . . . . . . . . . . . . . 60

3.2.6 Extensions and Related Methods . . . . . . . . . . . . 61

3.3 Dimensions of RRL . . . . . . . . . . . . . . . . . . . . . . . . 64

3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4 The XCS Learning Classifier System 71

4.1 The XCS System . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.1.1 System Architecture . . . . . . . . . . . . . . . . . . . 73

4.1.2 The Rule Base . . . . . . . . . . . . . . . . . . . . . . 74

4.1.3 The Production Subsystem . . . . . . . . . . . . . . . 77

4.1.4 The Credit Assignment Subsystem . . . . . . . . . . . 80

4.1.5 The Rule Discovery Subsystem . . . . . . . . . . . . . 86

4.2 Accuracy-Based Fitness . . . . . . . . . . . . . . . . . . . . . 91

4.3 Biases within XCS . . . . . . . . . . . . . . . . . . . . . . . . 94

4.3.1 The Generality and Optimality Hypotheses . . . . . . 94

4.3.2 Butz’s Evolutionary Pressures . . . . . . . . . . . . . . 96

4.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 97

4.4 Alternative Rule Languages . . . . . . . . . . . . . . . . . . . 99

4.5 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 104


iv CONTENTS

4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5 The FOXCS System 109

5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.2 Representational Aspects . . . . . . . . . . . . . . . . . . . . 112

5.2.1 Background Knowledge . . . . . . . . . . . . . . . . . 113

5.2.2 Representation of the Inputs . . . . . . . . . . . . . . 114

5.2.3 Representation of the Rules . . . . . . . . . . . . . . . 115

5.2.4 Expressing Generalisations Within a Single Rule . . . 116

5.3 The Matching Operation . . . . . . . . . . . . . . . . . . . . . 118

5.3.1 The Order of Atoms Within a Rule . . . . . . . . . . . 119

5.3.2 The Use of Inequations . . . . . . . . . . . . . . . . . 119

5.3.3 Caching . . . . . . . . . . . . . . . . . . . . . . . . . . 120

5.4 The Production Subsystem . . . . . . . . . . . . . . . . . . . 121

5.5 The Rule Discovery Subsystem . . . . . . . . . . . . . . . . . 124

5.5.1 Declaring the Rule Language . . . . . . . . . . . . . . 124

5.5.2 The Covering Operation . . . . . . . . . . . . . . . . . 127

5.5.3 The Mutation Operations . . . . . . . . . . . . . . . . 130

5.5.4 Subsumption Deletion . . . . . . . . . . . . . . . . . . 140

5.6 Implementation Notes . . . . . . . . . . . . . . . . . . . . . . 143

5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

6 Application to Inductive Logic Programming 145


CONTENTS v

6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 146

6.1.1 Materials . . . . . . . . . . . . . . . . . . . . . . . . . 146

6.1.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . 148

6.2 Comparison to ILP Algorithms . . . . . . . . . . . . . . . . . 149

6.3 Verifying the Generality Hypothesis . . . . . . . . . . . . . . 152

6.4 The Effect of Subsumption Deletion on Efficiency . . . . . . . 156

6.5 The Effect of Learning Rate Annealing on Performance . . . 157

6.6 The Influence of the Selection Method . . . . . . . . . . . . . 162

6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

7 Application to Relational Reinforcement Learning 171

7.1 Experiments In Blocks World . . . . . . . . . . . . . . . . . . 172

7.2 Scaling Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

7.2.1 P-Learning . . . . . . . . . . . . . . . . . . . . . . . . 183

7.2.2 An Implementation of P-Learning for Foxcs . . . . . 184

7.2.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . 186

7.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

8 Conclusion 193

8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

8.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

8.3 Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

8.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 201


vi CONTENTS

A First-Order Logic 203

A.1 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

A.2 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

A.3 Herbrand Interpretations . . . . . . . . . . . . . . . . . . . . 213

A.4 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

A.5 Induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

A.6 Substitution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

B Inductive Logic Programming 222

B.1 Learning from Entailment . . . . . . . . . . . . . . . . . . . . 224

B.2 Learning from Interpretations . . . . . . . . . . . . . . . . . . 226

B.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

C The Inductive Logic Programming Tasks 231

C.1 The Prediction of Mutagenic Activity . . . . . . . . . . . . . 231

C.2 The Prediction of Biodegradability . . . . . . . . . . . . . . . 234

C.3 Predicting Traffic Congestion and Accidents . . . . . . . . . . 235

C.4 Classifying Hands of Poker . . . . . . . . . . . . . . . . . . . 238

D Validity Tests for the Blocks World Environment 241


List of Symbols

The lists below contain symbols and acronyms used commonly throughout
the thesis. Where relevant, page numbers are given to where the symbol or
abbreviation is introduced.

Logic

∧ conjunction

∨ disjunction

¬ negation

← implication

∃ existential quantifier

∀ universal quantifier

|= entailment, 215

θ a substitution

θ−1 an inverse substitution

const(Φ) the set of constant symbols occurring in the logical sentence Φ

vars(Φ) the set of variables occurring in the logical sentence Φ

vii
viii List of Symbols

Reinforcement Learning

t discrete time step, 18

s a state, 18

a an action, 18

r a reward, 18

S state space, 18

A action space, 18

T (s, a, s0 ) the probability of transition from s to s0 under a, 19

R(s, a) expected reward from taking a in s, 19

A(s) set of actions possible in state s, 19

π policy, 21

π∗ optimal policy, 22

π(s) the action to take in state s under policy π, 21

V π (s) the value of state s under policy π, 21

V ∗ (s) the value of state s under an optimal policy, 22

Q∗ (s, a) the value of state-action pair (s, a) under an optimal policy, 28

γ discount factor, 21

α learning rate, 27

 probability of selecting a random action under an -greedy policy,


29
List of Symbols ix

Relational Reinforcement Learning

L an alphabet over first-order logic

C set of constant symbols

F set of function symbols

P set of predicate symbols

PA set of predicate symbols for representing actions

PS set of predicate symbols for representing states

PB set of predicate symbols for representing background knowledge

Learning Classifier Systems

The following list gives the parameters associated with an individual rule j
in Xcs (the parameter’s name is given in parentheses):

Aj action advocated by j (action), 74

Cj the set of states to which j applies (condition), 74

pj an estimate of the mean expected payoff for j (prediction), 75

εj an estimate of the mean expected absolute difference between pj


and the actual payoff (error), 75

Fj determines j’s probability of selection for reproduction (fitness),


75

exp j the number of times j has been a member of an action set (ex-
perience), 75

ns j an estimate of the mean size of the action sets that j has been a
member of (niche size), 76
x List of Symbols

ts j the time step when the GA was most recently invoked on an


action set that j belonged to (time step), 76

nj the number of micro-rules represented by j (numerosity), 76

dv j probability that j is deleted (deletion vote), 90

Other important symbols and parameters for Xcs:

[P] the current set of rules contained within the system (population),
74

[M] the set of rules matching the current state (match set), 77

[A] the subset of [M] advocating the selected action (action set), 80

[A]−1 [A] from the previous time step (action set), 80

N the maximum number of rules in terms of micro-rules that may


exist in [P ] at any one time, 77

ρ the target value when updating p, 81

κ an inverse measure of ε used for calculating fitness (accuracy),


82

a, b, 0 parameters for calculating κ, 82

α learning rate for p, ε, F , and ns updates, 81

γ discount factor, 81

θGA a threshold used by the triggering mechanism for the GA, 89

θdel a threshold used by the rule deletion mechanism, 90

δ the fraction of the mean fitness below which a rule’s probability


of deletion is increased, 90

θsub a threshold used by subsumption deletion, 91


List of Symbols xi

The following symbols are specific to Foxcs:

Φj the logical part of j, replacing Aj and Cj , consisting of a definite


clause over first-order logic, 115

µi determines the probability of selecting an evolutionary operation,


i ∈ {del,c2v,v2a,add,v2c,a2v,rep}, 134

Acronyms

The following acronyms are used throughout this thesis:

GA Genetic algorithm, 104

ILP Inductive logic programming, 10, 222

LCS Learning classifier system, 10, 71

MDP Markov decision process, 18

RRL Relational reinforcement learning, 36, 47


Abstract

Machine learning methods usually represent knowledge and hypotheses using


attribute-value languages, principally because of their simplicity and demon-
strated utility over a broad variety of problems. However, attribute-value
languages have limited expressive power and for some problems the target
function can only be expressed as an exhaustive conjunction of specific cases.
Such problems are handled better with inductive logic programming (ILP)
or relational reinforcement learning (RRL), which employ more expressive
languages, typically languages over first-order logic. Methods developed
within these fields generally extend upon attribute-value algorithms; how-
ever, many attribute-value algorithms that are potentially viable for RRL,
the younger of the two fields, remain to be extended.

This thesis investigates an approach to RRL derived from the learning clas-
sifier system Xcs. In brief, the new system, Foxcs, generates, evaluates,
and evolves a population of “condition-action” rules that are definite clauses
over first-order logic. The rules are typically comprehensible enough to be
understood by humans and can be inspected to determine the acquired prin-
ciples. Key properties of Foxcs, which are inherited from Xcs, are that it
is general (applies to arbitrary Markov decision processes), model-free (re-
wards and state transitions are “black box” functions), and “tabula rasa”
(the initial policy can be unspecified). Furthermore, in contrast to decision
tree learning, its rule-based approach is ideal for incrementally learning ex-
pressions over first-order logic, a valuable characteristic for an RRL system.

xiii
xiv Abstract

Perhaps the most novel aspect of Foxcs is its inductive component, which
synthesizes evolutionary computation and first-order logic refinement for
incremental learning. New evolutionary operators were developed because
previous combinations of evolutionary computation and first-order logic were
non-incremental. The effectiveness of the inductive component was empiri-
cally demonstrated by benchmarking on ILP tasks, which found that Foxcs
produced hypotheses of comparable accuracy to several well-known ILP al-
gorithms. Further benchmarking on RRL tasks found that the optimality
of the policies learnt were at least comparable to those of existing RRL sys-
tems. Finally, a significant advantage of its use of variables in rules was
demonstrated: unlike RRL systems that did not use variables, Foxcs, with
appropriate extensions, learnt scalable policies that were genuinely indepen-
dent of the dimensionality of the task environment.
Chapter 1

Introduction

“If we knew what it was we were doing, it would not be called


research, would it?”
—Albert Einstein

A small child learns that blocks fit into holes according to their shape.
A student distinguishes between crystals on the basis of their symmetry.
An ornithologist determines the species of a finch from its markings. The
ability to identify similarity between different objects or situations plays
an important role in decision making. In the scenarios above, similarity
is pattern-based; it the purpose of this thesis to explore a method that
automatically discovers the relevant patterns in a given problem and uses
them as a basis for decision making.

The decision making paradigm considered in this thesis is known as rein-


forcement learning (Sutton and Barto, 1998). Under this paradigm, the
decision maker is given feedback about the quality of its choices through
a reward signal; it is the goal of the decision maker to maximise the re-
ward accumulated over time. The ability to automatically recognise simi-
larity between different situations can improve the efficiency of reinforcement
learning methods: the outcomes of previously experienced situations can be

1
2 Ch 1. Introduction

reused in new situations that the method would otherwise have to approach
from scratch. Many methods have been devised to automatically recognise
similarity between different situations in the reinforcement learning setting;
however, the automatic detection of pattern-based similarity in this context
is relatively new and less well explored than other approaches.

In this thesis we adopt the perspective of an artificial intelligence researcher.


That is, we take a computational approach to the topic, based on the study
and evaluation of algorithms. Furthermore, the work itself can be situated
within the field of relational reinforcement learning (RRL) (Džeroski et al.,
2001; van Otterlo, 2005). This relatively recent field focuses on reinforcement
learning methods that are particularly well-suited to recognising and using
patterns in the input. In the remainder of this chapter I first illustrate the
relational reinforcement learning problem and the challenges involved and
then sketch the approach that will be developed and evaluated in subsequent
chapters.

1.1 Motivation

The purpose of this section is to illustrate, through example, the essence


of the relational reinforcement learning problem. The example concerns
a task environment known as blocks world. Blocks world, originating from
work in natural language processing (Winograd, 1972), has had an extensive
association with artificial intelligence, particularly with planning algorithms.
It is the planning version of blocks world (Slaney and Thiébaux, 2001),
somewhat simpler than the original natural language processing version,
which is presented here.
1.1. Motivation 3

d b d b d b

e a e c a e c a
⇒ ⇒

Figure 1.1: A sequence of transitions in bw5 .

1.1.1 Blocks World

The blocks world environment, denoted bwn , contains a finite collection of


n blocks and a floor. Each block is associated with a label, such as a string
of lower case letters and digits, that uniquely identifies the block; in this
example, block labels are lower case letters of the alphabet. Each block
rests either on top of exactly one other block or on the floor. If a block has
nothing on top of it then it is said to be “clear”, and it can be moved, either
to the floor or onto another clear block.

Many different tasks can be devised within the blocks world environment.
Džeroski et al. (2001) have defined three, stack, unstack, and onab, which
have become standard benchmarking tasks for evaluating relational rein-
forcement learning systems. They are described below.

Stack: the goal of this task is arrange all the blocks into a single stack (the
order of the blocks in the stack is unimportant).

Unstack: the goal of this task is to position all the blocks on the floor.

Onab: for this task, which is more complex than stack and unstack, two
blocks are designated A and B respectively. The goal is to place A
directly on B.

For the purpose of reinforcement learning it is useful to characterise blocks


world in terms of a set of states, a set of actions, and a transition function.
4 Ch 1. Introduction

A state corresponds to a specific arrangement of the blocks; an action cor-


responds to moving a block (note that for each state, only a subset of the
actions apply); and the transition function returns the state that follows
from applying a particular action to a particular state. Figure 1.1 shows
some transitions in bw5 .

The aim, under reinforcement learning, for each of the above tasks is to find
a policy that solves it optimally (that is, in the least number of steps). A
policy defines how to behave in any given situation; in other words, it is a
function that maps each state in the task environment to some action. A
policy thus completely solves the problem, in the sense that it specifies an
action for every possible state. Optimal policies for these tasks are already
known: for stack, place any clear block on top of the block which is the
highest; for unstack, select any clear block that is not already on the floor
and move it there; and for onab, first to remove the blocks above A and B,
and then move A onto B. Therefore, the optimality of a potential solution
to any of the tasks can be measured by comparing it, in terms of the number
of steps it takes to reach a goal state, to the corresponding known optimal
policy.

1.1.2 The Reinforcement Learning Problem

Two defining features of reinforcement learning problems are that they pro-
ceed according to trial-and-error and that feedback is given through rewards
(Sutton and Barto, 1998). Under the trial-and-error approach, an agent, the
reinforcement learning system, interacts with the environment to determine
cause-and-effect relationships. For the above tasks, interaction naturally
breaks up into separate episodes. An episode begins with the blocks in ran-
domly generated positions, it proceeds in a sequence of steps, where one
block is moved per step, and continues until a goal state is reached. For
other tasks, interaction may continue indefinitely without being broken up
into episodes. In both cases—episodic and continuing—the agent is respon-
1.1. Motivation 5

sible for determining how the interaction unfolds. This trial-and-error ap-
proach to interaction distinguishes reinforcement learning from supervised
learning (where an external teacher provides the interaction, in the form of
examples of optimal or desired behaviour).

The use of rewards also distinguishes reinforcement learning from other


forms of machine learning. Rewards are determined by a function that maps
the current state into a scalar numerical value, providing crucial information
for assessing policies. In fact, the object of a reinforcement learning system
is to discover or “learn” the policy which maximises the amount of reward
accumulated over the long term. A policy that, for all states, maximises the
accumulated reward is an optimal policy. For the above tasks, a reward of
−1 is given on every step, which, since it is negative, can be viewed as a
cost to be minimised. An optimal policy solves the task in the least number
of steps precisely because it minimises the accumulated cost.

In order to find an optimal policy, a reinforcement learning system has two


distinct challenges to overcome. First is the credit assignment problem.
Measuring the reward accumulated over an entire episode allows the com-
plete sequence of actions in the episode to be evaluated as a whole. But
how does an agent differentiate between the specific actions—between those
which led it closer to the goal from those which did not? Particularly when,
as in this example, all actions receive the same reward?

The second challenge is to overcome the curse of dimensionality. As shown


in Table 1.1, the number of states in bwn grows very quickly as the number
of blocks, n, increases; in fact, the complexity of the state space is greater
than O(n!).1 Furthermore, there are between 1 and n2 − n actions avail-
able within each state. For example, when n = 10 blocks there are between
one and 90 actions for each of the 58,941,091 ground states. Hence, sam-
pling every state-action combination is, in general, an intractable approach.
1
Note that asymptotically, O(n!) grows even faster than O(cn ), which is very formidable
indeed. Slaney and Thiébaux (2001) give a precise expression for the number of states in
`n´ (n−1)!
bwn as n
P
i=0 i (i−1)! .
6 Ch 1. Introduction

Table 1.1: The number of states and actions per state for blocks world as the number
of blocks increases.

domain number of states number of actions per state

bw3 13 1–6
4
bw 73 1–12
5
bw 501 1–20
6
bw 4,051 1–30
7
bw 37,633 1–42
8
bw 394,353 1–56
9
bw 4,596,553 1–72
10
bw 58,941,091 1–90

How can the agent act optimally if it cannot experience every state-action
combination occurring in the environment?

Detailed answers to these two challenges must wait until the problem frame-
work is formally defined in Chapter 2. However, let us here focus a little
more on the second problem, as relational reinforcement learning is distin-
guished from other forms of reinforcement learning by its approach to this
challenge.

1.1.3 Structural Regularity

Without direct experience of the outcome of each possible state and action
combination, the best that an agent can do is to make decisions in new
situations based on previously experienced situations and outcomes. In other
words, the system infers a general, hypothetical relationship between states
or state-action combinations and long term rewards that is consistent with
observation. For this inductive approach to work, the environment must
contain regularity linking states or state-action combinations to long term
reward. In some environments, for example, the long term reward is a
function of the current state-action combination. In those environments, it is
1.1. Motivation 7

possible to use function approximation techniques, such as neural networks,


which approximate the entire function given some points on the function.

When attempting to infer a general relationship between state-action com-


binations and long term rewards, the success or failure of specific techniques
crucially depends on the nature of the regularity exhibited by the environ-
ment. For the blocks world tasks described above, regularity takes a struc-
tural form. It turns out that for any state, the least number of steps to a
goal state depends on the arrangement of the blocks in that state. Further-
more, all possible arrangements of the blocks can be described by a number
of distinct patterns (see Figure 1.2). Thus, since the number of steps to
the goal determines the long term reward, states can be linked to long term
reward via patterns. In other words, the arrangement of the blocks forms
a basis of regularity in blocks world sufficient for predicting the long term
reward associated with any state. Because for a given value of n there are
significantly fewer patterns than states in bwn , associating long term reward
with patterns circumvents the curse of dimensionality.

1.1.4 First-Order Logic

Structural regularity like that exhibited by blocks world can be difficult to


represent under the predominant representational paradigm for reinforce-
ment learning and machine learning generally: the so-called attribute-value
framework. If the relevant patterns are known prior to solving the task
then an effective attribute-value approach is based on defining a collection
of features that represent the patterns. This approach works well in chess,
for instance, where centuries of human experience can be used to define fea-
tures relevant for assessing the strength of board positions. In many other
domains, however, it is precisely the relevant patterns or features that are un-
known and which we are trying to discover. For example, the identification
of a library of sub-molecular structures associated with certain outcomes,
like bio-degradability or carcinogenicity, would allow valuable predictions to
8 Ch 1. Introduction

optimal policy
stack
unstack

stack: 0 steps stack: 1 step unstack: 2 steps stack: 3 steps unstack: 0 steps
unstack: 3 steps

stack: 2 steps unstack: 2 steps stack: 2 steps unstack: 1 step

Figure 1.2: The optimal policies for stack and unstack in bw4 . Note that the state
space of bw4 is partitioned into five patterns that, under these policies, are associated
with the optimal number of steps to the goal. A similar decomposition can be determined
for onab, where the patterns must additionally specify the blocks designated as A and B.

be made about newly synthesized chemicals and pharmaceuticals containing


the structures. It is therefore important to develop alternative approaches
that do not rely on prior knowledge of the relevant patterns or structures in
the environment.

A quality of structural regularity is that it is composed from particular


combinations of atomic elements occurring in the environment. In blocks
world, for example, the atomic elements are blocks and the patterns are
particular arrangements of blocks. This quality of the regularity suggests an
alternative to the feature-based approach: the use of languages, such as first-
order logic, that support the combination of atomic elements into complex,
abstract structures. For example, the bottom left pattern in Figure 1.2 can
be written in Prolog, a logic programming language, as follows:
1.2. Approach 9

on(A,B), % block A is on block B


on(C,D),
cl(A), % block A is clear
cl(C),
on_fl(B), % block B is on the floor
on_fl(D),
A\=C,
B\=D.

Note that A, B, C, and D are variables and not block labels; thus, the resulting
pattern is genuinely abstract, applying in all situations where the blocks sat-
isfy the specified relationships and properties. The use of variables, however,
necessitates the inequations A\=C and B\=D. Without these extra constraints,
the example also represents a two block pattern corresponding to the case
where A=C and B=D.

The above example shows that structural regularity can be represented ab-
stractly by expressions in first-order logic. However, it does not show how to
generate these expressions. This issue—the development of methods that in-
duce general, hypothetical relationships between states or state-action com-
binations and long term reward using languages, like first-order logic, that
can abstractly represent structural regularity—is the chief concern of rela-
tional reinforcement learning.

1.2 Approach

This thesis presents a new relational reinforcement learning method. Be-


cause the development of inductive techniques is more mature under attribute-
value frameworks than under first-order logic, we do not develop the new
technique from scratch but instead follow a typical approach, which is to
“upgrade” a method developed under the attribute-value framework to first-
order logic. This approach allows existing, well-understood algorithms and
10 Ch 1. Introduction

heuristics to be reused, and has an additional benefit in that many aspects


of the system architecture will be already familiar to a community of re-
searchers.

The method developed in this thesis, Foxcs, “upgrades” the learning clas-
sifier system Xcs (Wilson, 1995, 1998). In doing so, it borrows ideas and
methods from inductive logic programming (Nienhuys-Cheng and de Wolf,
1997). The approach is thus based on a synthesis of concepts and techniques
from learning classifier systems and inductive logic programming.

1.2.1 Learning Classifier Systems

Learning classifier systems (LCS) are a rule-based framework that originally


grew out of the application of genetic algorithms (genetic algorithms are
described in Section 4.5) to machine learning. The approach developed in
this thesis is a derivative of the Xcs system (Wilson, 1995, 1998), the most
mature and suitable LCS for addressing reinforcement learning.

In brief, Xcs contains a population of condition-action rules that represent


the system’s hypotheses. Each rule specifies how to behave when the current
situation satisfies the condition specified by the rule. Furthermore, each rule
is also associated with a parameter that estimates the expected long term
reward that will accumulate after following the rule. Rules are automatically
generated by the system using principles based in large part on genetic
algorithms. The long term reward estimate, on the other hand, is updated
using techniques that have a close connection to temporal difference learning,
a fundamental method of reinforcement learning.

1.2.2 Inductive Logic Programming

Like relational reinforcement learning, inductive logic programming (ILP)


(Muggleton, 1991; Muggleton and De Raedt, 1994; Nienhuys-Cheng and
1.2. Approach 11

de Wolf, 1997) addresses machine learning tasks that exhibit structural reg-
ularity. Although the context of ILP is supervised and unsupervised learn-
ing, rather than reinforcement learning, concepts and techniques from ILP
can be used to provide a principled approach to the upgrade of Xcs. The
primary ideas borrowed from ILP are the notions of consistency and back-
ground knowledge, and the techniques of refinement and θ-subsumption.

Consistency In Foxcs we are concerned with computing the consistency


of a rule (represented by a definite clause over first-order logic) with
respect to the current state (a set of atoms over first-order logic) and
an action (an atom). Testing for consistency proceeds according to
deductive principles and can be automated under logic languages such
as Prolog.

Background knowledge The deductive process can refer to a set of addi-


tional rules, called a background theory, when testing for consistency.
The background theory contains general domain knowledge about the
task environment and providing a “good” theory is often the key to
solving a task effectively.

Refinement Foxcs uses refinement to generate new rules based on exist-


ing rules. There are two kinds of refinement, downwards and upwards.
Given a clause in first-order logic, a downward refinement operator is
a function that computes a set of specialisations of it; conversely, an
upward refinement operator computes a set of generalisations. Down-
ward refinement is used for top-down searches while upward refinement
is used for bottom-up searches. Because search in evolutionary com-
putation is bi-directional, both upward and downward refinement is
employed by Foxcs.

θ-subsumption Given two clauses in first-order logic, θ-subsumption is a


technique for determining whether one is a specialisation of the other.
θ-subsumption is known to be incomplete (that is, there are cases
12 Ch 1. Introduction

of specialisation that it does not detect); however, it is efficient and


therefore commonly used by ILP systems.

Note that induction under first-order logic presents special challenges. The
hypothesis space defined by a first-order logic language is generally richer
than an hypothesis space under an attribute-value language, and meth-
ods are thus potentially less efficient and more computationally complex.
Unfortunately, some time complexity results for inductive logic program-
ming are negative (Kietz, 1993; Cohen, 1993, 1995; Cohen and Jr., 1995;
De Raedt, 1997), showing that the complexity is greater than polynomial
and thus intractable. However, under the non-monotonic setting (originat-
ing from Helft, 1989) tractable PAC-learnability results have been obtained
(De Raedt and Džeroski, 1994). The non-monotonic setting is therefore
adopted by the inductive component of Foxcs.

1.2.3 Strengths and Weaknesses

There are several advantages arising from the use of the LCS framework
for relational reinforcement learning. Many existing RRL methods do not
possess certain desirable characteristics for practical reinforcement learning.
First, they may restrict the problem framework in some way; second, they
may require a partial or complete model of the environment’s dynamics;
and third, they may require candidate hypotheses to be provided by the
user. The method developed in this thesis avoids the above limitations.
Specifically:

• The approach applies to Markov decision processes. This problem


framework is very general and no simplifications were made to it. To
emphasize, the representational paradigm adopted by the current ap-
proach does not limit the generality of the Markov decision process
framework.
1.3. Thesis Objectives 13

• The approach is model-free. In other words, it can be applied to en-


vironments where the transitions are not known. All necessary infor-
mation is collected through interacting with the environment.

• Induction is performed automatically. That is, candidate rules do not


have to be hand-crafted. Instead, the system automatically generates,
evaluates and refines rules itself.

Other advantages include:

• Domain specific information is provided by a user-defined rule lan-


guage. For RRL tasks, it is perhaps easier to define an appropriate
rule language than it is to provide other forms of domain specific in-
formation, such as suitable distance metrics or kernel functions.

• Hypotheses are typically comprehensible enough to be understood by


humans and can be inspected to determine the general principles that
have been learnt. The comprehensibility is partly due to the rule-
based approach of LCS, which, in contrast to architectures like neural
networks, represents hypotheses in an understandable form: condition-
action rules. It is further enhanced in Foxcs by the syntax of first-
order logic, which supports the use of mnemonic names.

A disadvantage of the LCS approach is that it has lengthy training times


compared to other reinforcement learning systems. This is perhaps due
to the stochastic nature of the evolutionary component (which generates
and evolves rules). Offsetting this weakness is that few other relational
reinforcement learning systems contain all the above listed advantages.

1.3 Thesis Objectives

The purpose of this thesis is to investigate the approach to RRL outlined in


Section 1.2. In other words the stated aim is:
14 Ch 1. Introduction

To derive and evaluate a relational reinforcement learning system


based on the learning classifier system Xcs.

And the approach will be:

To “upgrade” the rule language of Xcs to definite clauses over


first-order logic.

The expected outcome of the research is twofold: the contribution of a de-


sign following the given approach and an empirical demonstration of the
effectiveness of the approach. In order to derive an RRL system based on
the LCS framework, novel covering and mutation operations will need to be
tailored for use with representation in first-order logic.2 Evaluation of the
system will focus on demonstrating the effectiveness or influence of individ-
ual system components as well as the effectiveness of the overall approach.
Evaluation will also assess the utility of the system for learning scalable poli-
cies: policies which are optimal independent of the size of the environment.

1.4 Thesis Outline

The remainder of this thesis is organised as follows. Chapters 2 to 4 fo-


cus on background material. Chapter 2 formally describes the problem to
be addressed: a logic extension of the Markov decision process; and covers
fundamentals of Markov decision processes, reinforcement learning meth-
ods, general approaches to overcoming the “curse of dimensionality”, and
the incorporation of first-order logic into the framework of Markov decision
processes. Chapter 3 then reviews existing relational reinforcement learning
methods, describes different dimensions for categorising RRL systems, and
2
There do exist systems that have previously adapted mutation for representation in
first-order logic (Augier et al., 1995; Kókai, 2001; Divina and Marchiori, 2002). However,
they are supervised learning systems and their mutation operations of do not satisfy the
requirements of the reinforcement learning paradigm.
1.4. Thesis Outline 15

places the current work in context. Chapter 4 gives a complete description


of the algorithmic aspects of the learning classifier system Xcs from which
our method will be derived, and assesses its potential for extension with
first-order logic.

Chapters 5 to 7 contain the primary contributions of the thesis. Chapter 5


describes the new system, Foxcs, focussing on the extensions made to Xcs
(it does not repeat the details of Xcs given in Chapter 4). Novel algorithms
that are tailored for covering and mutating first-order logic expressions un-
der the LCS framework are provided. Chapters 6 and 7 then empirically
evaluate the system. In Chapter 6, Foxcs is benchmarked on ILP tasks,
primarily to assess the effectiveness of its novel inductive mechanisms. It
is found to perform at a level comparable to that of specialist inductive al-
gorithms. In Chapter 7, Foxcs is benchmarked on RRL tasks set in the
blocks world environment. It is found to perform as well as previous sys-
tems, demonstrating the overall effectiveness of the approach. This chapter
also shows that by incorporating P-Learning (Džeroski et al., 2001), Foxcs
can learn policies that scale up to arbitrary sized environments. Finally,
Chapter 8 concludes the body of the thesis, summarising the main findings
and identifying areas for further investigation.

Several appendices have been provided in addition to the main body of the
thesis. The first two appendices include background material that is relevant
to the thesis in order to make it more self-contained. Appendix A gives
definitions and concepts from first-order logic, while Appendix B focuses on
two formal descriptions of the learning problems addressed by ILP systems.
The final two appendices give details of the tasks addressed by Foxcs.
Appendix C describes the ILP tasks from Chapter 6, while Appendix D
lists constraints that were used to prevent Foxcs from generating rules that
do not match any states in blocks world.
Chapter 2

Background

“Good and evil, reward and punishment, are the only motives to
a rational creature . . . ”
—John Locke

In the previous chapter we gave an intuitive explanation of relational rein-


forcement learning. The aim of this chapter is to provide a more formal basis
for the topic. We begin by introducing Markov decision processes, a formal
description of the problem framework underlying reinforcement learning.
Next we briefly describe some fundamental reinforcement learning methods,
TD(0) and Q-Learning. The essence of these methods is the principle of
temporal difference learning; this principle also finds expression in learning
classifiers systems, which are in turn the basis of the method developed in
this thesis. We see that temporal difference learning methods like TD(0) and
Q-Learning are impractical for large Markov decision processes and examine
the general approach for overcoming the problem, which is to incorporate a
generalisation or inductive component into the method. Finally, we moti-
vate and describe an approach to generalisation based around using abstract
expressions in first-order logic. This approach is the foundation of relational
reinforcement learning.

17
18 Ch 2. Background

action

reward
agent environment

state

Figure 2.1: Interaction between the agent and environment in an MDP.

2.1 Markov Decision Processes

The type of learning scenario that is considered in this thesis is known as


a Markov decision process (MDP). An MDP is a discrete, finite, dynamical
system consisting of an agent and an environment which interact in a feed-
back loop, as shown in Figure 2.1. At time t ∈ Z, the environment exists
in a particular state, st ∈ S, where S is the set of all states in which the
environment can exist. The agent selects and executes an action, at ∈ A,
where A is the set of all actions available to the agent. The environment
responds by moving to another state, st+1 , and the agent receives feedback
about its choice of action, at , through a reward rt+1 ∈ R. The objective of
the agent is to maximise the amount of reward that it receives by selecting
the appropriate action at each time step.

More formally, an MDP is defined as follows:

Definition 1 A Markov decision process is a tuple hS, A, T, Ri consisting


of:

• a finite set of states S,

• a finite set of actions A,


2.1. Markov Decision Processes 19

• a transition function T : S × A × S → [0, 1],

• a reward function R : S × A → R.

Each of these components are now discussed in turn.

The state space, S, is the set of states that describe each possible situa-
tion that the agent may encounter; we assume that S is discrete and
finite. It is also assumed that the agent has perfect sensors, that is, it
is able to correctly identify the current state of the environment. For
now we make no assumptions about how the states are represented to
the agent.

The action space, A, is the set of actions through which the agent inter-
acts with the environment. As with the state space, we assume that
the action space A is discrete and finite. Frequently not all actions are
applicable to every state; if this is the case, then the set of admissible
actions of s is the subset of A that is applicable to s and is denoted
A(s).

The transition function, T , gives the probability that when the agent
executes action a in state s then the following state is s0 , that is,
T (s, a, s0 ) = P r{st+1 = s0 | st = s, at = a}. The transition function
is a proper probability distribution over successor states s0 , that is, for
all s ∈ S and a ∈ A, s0 T (s, a, s0 ) = 1. A deterministic MDP is the
P

special case when T : S × A × S → {0, 1}.

The reward function, R, gives the expected immediate reward for tak-
ing action a in state s, that is, R(s, a) = E{rt+1 | st = s, at = a}.
The reward function is sometimes state-based, in which case R may
be given as a function over S; the two formulations are related by
R(s, a) = R(s) for all s ∈ S and a ∈ A(s).

In an MDP, the dynamics of the environment depend only on the current


state and action and are independent of the time step. This requirement
20 Ch 2. Background

is formalised by the Markov property, which says that a task is Markovian


if at every time step t, the probability of a transition occurring given the
entire history of transitions, st , at , rt , st−1 , at−1 , . . . , r1 , s0 , a0 , is the same as
its probability of occurring given only the current state, st , and action, at :

P r{st+1 , rt+1 | st , at , rt , st−1 , at−1 , . . . , r1 , s0 , a0 } = P r{st+1 , rt+1 | st , at },

for all time steps t. Hence, in a Markovian task, the current state and
action summarise all the information required to predict the next state and
the immediate reward.

Before considering precisely what it means to act optimally in an MDP, we


note that tasks may be either episodic or continuing. In an episodic task the
sequence of time steps is finite, that is, t ∈ {1, 2, . . . , n} for some positive
integer n 6= ∞. In a continuing task, on the other hand, the agent interacts
with the environment indefinitely, thus t ∈ {1, 2, . . . , ∞}. All the tasks
considered in this thesis are episodic. Episodic tasks occur naturally when
the agent is attempting to arrive at some goal state, such as a destination
in a navigation task. The framework presented in the following section for
determining optimal behaviour treats both episodic and continuing tasks
uniformly, thus there is no loss of generality from considering episodic tasks
only.

2.1.1 Acting Optimally

The agent’s objective is to maximise the long term reward accumulated from
interacting with the environment. A simple measure of long term reward
is the expected sum of rewards obtained from the environment after the
current time step until the end of the episode. Under this criterion, the
agent aims to maximise E{rt+1 + rt+2 + . . . + rn } for all t ∈ {1, 2, . . . , n − 1},
where n is the final step of the episode. However, this measure does not
adequately deal with continuing MDPs.

A better optimality criteria for continuing MDPs is the infinite-horizon dis-


2.1. Markov Decision Processes 21

counted metric. Under this criteria, the value that the agent is attempting
to maximise is E{ ∞ k
P
k=0 γ rt+k+1 }, where γ is a discount factor satisfying
0 ≤ γ < 1. The use of a discount factor creates a geometrically decreasing
series which prevents the summation from going to infinity. The discounted
infinite-horizon optimality criteria is attractive because it handles both con-
tinuing and episodic tasks uniformly1 and because most results under the
criteria also extend to the undiscounted case where appropriate (Littman,
1996, page 57).

The behaviour of the agent is determined by a policy. A deterministic policy


π : S → A specifies that action π(s) ∈ A is to be executed when in state
s ∈ S. Given a policy π, the expected long term value that an agent would
receive from executing π can be expressed as the sum of discounted expected
future rewards:
(∞ )
X
V π (s) = Eπ γ k rt+k+1 | st = s , (2.1)
k=0

where V π is referred to as the value function. The expectation in (2.1) is


with respect to the state distribution from following π, and the discounting
ensures that V π (s) is finite for all s ∈ S.

The value function can be rewritten in recursive form as a set of simultaneous


linear Bellman equations, one for each state, s ∈ S:
X
V π (s) = R(s, π(s)) + γ T (s, π(s), s0 )V π (s0 ). (2.2)
s0

A Bellman equation for state s says that V π (s), the value of state s under
policy π, is the immediate reward R(s, π(s)) plus the discounted value of the
next state V π (s0 ), averaged over all possible next states s0 according to the
likelihood of their occurrence T (s, π(s), s0 ). The system of Bellman equa-
tions can be solved by a variety of linear programming methods, including
Gaussian elimination and others.
1
An episodic task can be converted to a continuing task by assuming that the goal
states are absorbing, that is, a goal state transitions only to itself and only generates
rewards of zero.
22 Ch 2. Background

Given a policy, π, then, it is possible to calculate its corresponding value


function, V π . The reverse can also be achieved, that is, it is possible to
determine a policy given a value function. The greedy policy with respect
to value function V is defined as:
" #
X
0 0
πV (s) = arg max R(s, a) + γ T (s, a, s )V (s ) . (2.3)
a
s0

The greedy policy πV is thus obtained by always selecting the action which
leads to the maximum one-step value with respect to V .

The agent’s objective can be put as the task of finding an optimal policy.
By optimal we mean that the value for each state under the policy is at least
as great as the value when following any other policy. More formally, V ∗ ,
the value function for an optimal policy π ∗ satisfies:

V ∗ (s) = max V π (s),


π

for all s ∈ S. There is always at least one policy that is optimal (Howard,
1960), although it may not be unique. There is a corresponding Bellman
optimality equation for V ∗ :
" #
X
∗ 0 ∗ 0
V (s) = max R(s, a) + γ T (s, a, s )V (s ) , (2.4)
a
s0

for each s ∈ S. It can be shown that πV ∗ , any policy which is greedy with
respect to the optimal value function V ∗ , is an optimal policy (Puterman,
1994). Hence, one approach to finding an optimal policy is to calculate V ∗
and then derive πV ∗ . However, due to the presence of the non-linear max-
imisation operator in (2.4), Gaussian elimination is insufficient for solving
for V ∗ . Algorithms for computing V ∗ are considered in the following section.

2.1.2 Solving Markov Decision Processes

Three fundamental methods for solving MDPs are linear programming, value
iteration and policy iteration. These three methods analytically compute the
2.1. Markov Decision Processes 23

Given a variable V (s) for each s ∈ S

Maximise:
P
sV (s)

Subject to the constraints:

T (s, a, s0 )V (s0 ), ∀s ∈ S and a ∈ A


P
V (s) ≤ R(s, a) + γ s0

Figure 2.2: Solving an MDP using linear programming.

optimal value function V ∗ corresponding to a given MDP, M = hS, A, T, Ri,


and discount factor, γ. As mentioned above, once V ∗ has been computed
the optimal policy, π ∗ , can be derived as the greedy policy with respect to
V ∗ . Puterman (1994) has described and analysed each of the methods in
detail.

Linear Programming

An MDP can be solved by reformulating the optimal value function as a


linear program (d’Epenoux, 1963), which can then be solved by the simplex
method or a variant of it. A linear program consists of a set of variables,
a set of inequalities over the variables, and a linear objective function. The
linear program for an MDP is shown in Figure 2.2. Linear programming
is analytically attractive, but as noted by Puterman (1994, page 223), it is
typically less efficient for solving MDPs than other methods.

Policy Iteration

Policy iteration (Howard, 1960) computes a sequence of value functions,


V0 , V1 , V2 , . . . , Vk , where V0 , is assigned arbitrarily and Vi is the value func-
tion of the greedy policy with respect to Vi−1 for i = 1, 2, . . . , k. Each Vi
24 Ch 2. Background

Initialise V0 (s) arbitrarily for all s ∈ S

i := 0

Repeat

i := i + 1

For each s ∈ S and a ∈ A do

T (s, a, s0 )Vi−1 (s0 )


P
πi (s) := arg maxa R(s, a) + γ s0

Vi := the solution to the system of Bellman equations for πt :

Vπi (s) = R(s, π(s)) + γ s0 T (s, πt (s), s0 )Vπi (s0 ), ∀s ∈ S


P

Until Vi (s) = Vi−1 (s) for all s ∈ S

Figure 2.3: The policy iteration algorithm.

in the sequence is strictly closer to V ∗ than Vi−1 except for Vk (Puterman,


1994). Since there are at most |A||S| policies, and the policy πi improves
at each iteration, the optimal policy is guaranteed to be found after a finite
number of steps (for finite state and action spaces). In practice, policy iter-
ation often converges in only a few iterations; however, this is offset by the
cost of computing the value function Vi at each iteration i. An algorithm
for policy iteration is given in Figure 2.3. The computation of πi and Vπi
are called the policy improvement and policy evaluation steps respectively.

As noted above, the computation of the value function, Vi , at each iteration


of the algorithm is expensive. Puterman and Shin (1978) have observed that
the exact value of Vi typically does not need to be calculated in order to
derive πi+1 . They proposed modified policy iteration, where Vi is approxi-
mated at each iteration instead of being solved exactly. Policy iteration can
be regarded as a special case of modified policy iteration in which the ap-
proximation of Vi is taken to the extreme of an exact solution. In the other
extreme, the value iteration algorithm (which is discussed next) is obtained
2.1. Markov Decision Processes 25

Initialise V0 (s) arbitrarily for all s ∈ S

i := 0

Repeat

i := i + 1

For each s ∈ S do

T (s, a, s0 )Vi−1 (s0 )]


P
Vi (s) := maxa [R(s, a) + γ s0

Until maxs |Vi (s) − Vi−1 (s)| < 

Return the greedy policy with respect to Vi

Figure 2.4: The value iteration algorithm.

when the approximation of Vi is terminated after just a single step. Versions


of modified policy iteration can perform very efficiently (Puterman, 1994).

Value Iteration

Like policy iteration, value iteration (Bellman, 1957) also computes a se-
quence of value functions. The algorithm for value iteration, given in Fig-
ure 2.4, is obtained by simply turning the Bellman optimality equation (2.4)
into an update rule.

Tseng (1990) has shown that there exists an i∗ , polynomial in 1


1−γ , such that
the greedy policy with respect to Vi∗ is optimal (even though computing
the optimal function itself requires an infinite number of steps). However,
rather than calculate i∗ and terminate value iteration after the i∗ th loop,
in practice termination occurs when the Bellman residual, the maximum
difference between two successive value functions, or maxs |Vi (s) − Vi−1 (s)|,
is less than an error . Williams and Baird (1993) have shown that if the
Bellman residual is less than , then the maximum error between the value
26 Ch 2. Background

2γ
function and the optimal value function is less than 1−γ , that is:

2γ
max |V πVi (s) − V ∗ (s)| < .
s 1−γ

2.2 Reinforcement Learning

The previous section introduced MDPs and briefly looked at analytical


methods for solving them. These methods require the MDP to be fully
specified, that is, each element of the task M = hS, A, T, Ri is known. Re-
inforcement learning methods (Sutton and Barto, 1998; Kaelbling et al.,
1996) are another group of methods that address MDPs; however, unlike
the analytical methods presented above, they have the advantage that they
can operate without requiring any knowledge of the environment’s dynamics
(i.e. T and R). This is of great practical significance because for many tasks,
T , and sometimes R, are unknown or difficult to acquire (this is the case
in robot control, for example). Reinforcement learning algorithms compen-
sate for the lack of knowledge about T and R by sampling the environment.
This dependance on sampling is a defining characteristic of reinforcement
learning methods.

A fundamental principle underlying many reinforcement learning algorithms


is temporal difference learning. Below, two key temporal difference learn-
ing algorithms, TD(0) and Q-Learning, are presented. They illustrate how
knowledge of T and R can be effectively replaced by experience gained from
sampling the environment. The first algorithm focuses on the problem of
computing the value function corresponding to a given policy; the second,
on the problem of computing an optimal value function.

2.2.1 The TD(0) Algorithm

The TD(0) algorithm (Sutton, 1988) is designed to estimate the value func-
tion V π given a policy π. Although it does not find an optimal value
2.2. Reinforcement Learning 27

function, it illustrates very simply the central idea of temporal difference


learning, which underlies much of reinforcement learning.

The intuition behind the TD(0) algorithm is based on the following rewrite
of the value function equation (2.1):
(∞ )
X
V π (s) = Eπ γ k rt+k+1 | st = s
k=0

( )
X
= Eπ rt+1 + γ γ k rt+k+2 | st = s
k=0

= Eπ {rt+1 + γV π (st+1 ) | st = s} (2.5)

Equation (2.5) says that V π (st ) can be expressed as the expectation of


rt+1 + γV π (st+1 ) with respect to π, which leads to the idea of estimating
V π (st ) by averaging rt+1 + γV π (st+1 ) according to samples of rt+1 and st+1
obtained by following π.

Figure 2.5 gives the TD(0) algorithm. The algorithm maintains a table V
which estimates V π , and which is updated according to the experience from
interacting with the MDP. The experience can be expressed as a sequence
of tuples hs, π(s), r, s0 i, where each tuple indicates that the agent began in
state s, executed action π(s), received reward r and moved to state s0 . An
experience tuple for s provides a sample reward r and next state s0 , which is
used to update V (s) according to (1 − α)V (s) + α(r + γV (s0 )), where α is a
learning rate satisfying 0 < α < 1. For each experience tuple, hs, π(s), r, s0 i,
the update acts to reduce the difference between V (s) and r + γV (s0 ), which
leads to the name of the algorithm, temporal difference learning. For a
particular state, the input from each experience tuple which starts from that
state is averaged over the long term according to the learning rate α. Since
the experience tuples are expected to approximate the true distribution of
rewards and transitions with respect to π over the long term, V (s) should
converge to the value given by (2.1) in the limit as the number of experience
tuples sampled approaches infinity. Convergence results for TD(0) have been
given by Sutton (1988), Dayan (1992), Jaakkola et al. (1994) and Tsitsiklis
28 Ch 2. Background

Initialise V (s) arbitrarily for all s ∈ S

For each episode do

Initialise s

While s ∈
/ G do

a := π(s)
Execute a and observe r and s0
V (s) := (1 − α)V (s) + α(r + γV (s0 ))
s := s0

Figure 2.5: The TD(0) algorithm. Here, G is a set of goal states; for continuing tasks
G = ∅.

(1994).

2.2.2 The Q-Learning Algorithm

The Q-Learning algorithm (Watkins, 1989) addresses the problem of learn-


ing an optimal policy for an MDP. Like TD(0) it maintains a table of values
for the purpose of estimation. In the case of Q-Learning though, the ta-
ble estimates the optimal Q function, Q∗ . The optimal Q function, Q∗ , is
related to the optimal value function, V ∗ , through the following equation:
X
Q∗ (s, a) = R(s, a) + γ T (s, a, s0 )V ∗ (s0 ).
s0

That is, the optimal action value for (s, a) is the immediate reward plus the
discounted optimal value of the next state, s0 , averaged over all possible next
states. The Q function formulation is convenient because it associates values
with state-action pairs rather than states, which avoids the need to make a
forward reasoning step in order to find optimal actions. In other words, to
find π ∗ (s), the optimal action for state s, given V ∗ , you need to calculate a
solution to π ∗ (s) = arg maxa [R(s, a) + γ s0 T (s, a, s0 )V ∗ (s0 )]. However, to
P
2.2. Reinforcement Learning 29

Initialise Q̂(s, a) arbitrarily for all s ∈ S and a ∈ A

For each episode do

Initialise s

While s ∈
/ G do

Select a using an -greedy2 policy over Q̂


Execute a and observe r and s0
Q̂(s, a) := (1 − α)Q̂(s, a) + α(r + γ maxa0 Q̂(s0 , a0 ))
s := s0

Figure 2.6: The Q-Learning algorithm.

find π ∗ (s) given Q∗ , you only need to solve π ∗ (s) = arg maxa Q∗ (s, a), which
completely eliminates the dependence on R and T .

Figure 2.6 gives the Q-Learning algorithm for estimating Q∗ . As with TD(0),
under Q-Learning the experience derived from interacting with the MDP can
be expressed as a sequence of tuples hs, a, r, s0 i, where each tuple indicates
that the agent began in state s, executed action a, received reward r and
moved to state s0 . To arrive at an accurate estimate of Q∗ this experience
needs to be averaged over many samples. Thus, a learning rate α, satisfying
0 < α < 1, averages together the current experience with the previous
estimate.

It can be proved that Q̂ converges to Q∗ under Q-Learning given that α


decays appropriately and that each state-action pair is sampled an infinite
number of times (Watkins and Dayan, 1992; Jaakkola et al., 1994; Tsitsiklis,

2
An -greedy policy behaves greedily most of the time, but with probability  it selects

an action uniformly at random. That is, πQ (s), the -greedy policy with respect to Q, is:
8

< arg max Q(s, a)
a with probability 1 − ,
πQ (s) =
: random action from A(s) otherwise.
30 Ch 2. Background

1994). The first requirement can be arranged by setting α = k1 , where k is


the number of times that an experience tuple has been sampled from the
MDP. The second requirement explains why the algorithm uses an -greedy
policy to select actions instead of a greedy policy, as it ensures that each
state-action pair will be sampled an infinite number of times in the limit as
the number of experience tuples sampled approaches infinity. Fortunately,
with only a finite amount of sampling, approximately correct action-values
are usually found.

2.3 Generalisation

The algorithms discussed above all work by storing V or Q in memory. Tech-


nological progress is continually increasing the computational and memory
resources of modern computers, and for small MDPs it is feasible to store
V or Q entirely in memory. MDPs, however, suffer from Bellman’s “curse
of dimensionality”, the tendency of the state space to scale exponentially
with the number of state variables (Bellman, 1957). For many tasks, |S|,
and sometimes |A|, are very large, too large for V and Q to be stored ex-
plicitly. For example, although the number of variables characterising chess
is quite small—it has 2 × 16 pieces which each belong to one of six types
and which are arranged on an 8 × 8 board—it is estimated to contain ap-
proximately 1043 legal positions (Shannon, 1950), which is well beyond the
storage capability of any computer in the imaginable future.

Note that T and R are subject to the curse of dimensionality too, as storing
these functions in matrix form also requires space proportional to |S| and
|A|. This has negative implications for the above methods that need access
to T and R. In many cases, the dynamics of the environment, that is, T and
R, can often be compactly formulated as a set of equations, so this is less of a
problem than the problem of storing V and Q. However, the main difficulty
arising from the curse of dimensionality relates not to storage space, but time
complexity. Even if storage requirements were not a problem, the training
2.3. Generalisation 31

time required to exhaustively sample large state and action spaces would be
prohibitive.

The curse of dimensionality leads to a dilemma. On the one hand, a well


intentioned fine-grain modelling of an environment is self-defeating, since it
will put the computational requirements of finding a solution beyond reach.
On the other hand, an attempt to side-step the explosion in the size of S
by modelling the environment coarsely can lead to an unsatisfactory level
of realism. The general approach to overcoming the curse of dimensionality
is to model the environment at an acceptably realistic level and reduce the
time complexity by sampling only a portion of S × A. When the value func-
tion of the sampled portion of S × A is calculated, an inductive algorithm
simultaneously generalises it to the, hopefully, relevant unsampled portions.
Induction, the problem of generalising from examples, has been addressed
by countless methods within machine learning, and many of these methods
can be adapted for use by MDP solution methods. There are two fundamen-
tal groups of methods for inducing generalisations in this context, function
approximation and aggregation techniques. They are now discussed in turn.

2.3.1 Function Approximation

Under function approximation, the value function, V , is approximated by a


function, Ṽθ , parameterised by θ ∈ Rk .3 The complexity of storing V is thus
now proportional to k rather than |S|. A simple function for approximating
V is a linear approximator, Ṽθ (s) = ki=1 θi si , where s = hs1 , s2 , . . . , sk i and
P

θ = hθ1 , θ2 , . . . , θk i. The task becomes one of finding θ such that Ṽθ approx-
imates V well. If linear approximators are insufficient to approximate V
then non-linear methods, such as multilayer neural networks, decision trees,
or support vector machines, can be used instead. An introduction to func-
tion approximation in reinforcement learning has been given by Sutton and

3
Function approximation can similarly be used for the Q function, but for simplicity
we only illustrate its use for approximating V .
32 Ch 2. Background

Barto (1998, chapter 8), and the approach has received rigorous analytical
attention from Bertsekas and Tsitsiklis (1996).

Function approximation is perhaps the most commonly used approach to


generalisation in reinforcement learning. Since examples of reinforcement
learning systems using function approximation are too numerous to list
comprehensively, we suffice with mentioning a few key examples instead.
Samuel (1959) wrote a checker playing program which was perhaps the ear-
liest recognisable example of temporal difference learning and which made
use of linear function approximation to deal with the enormous state space
of checkers. Another game playing program, TD-Gammon (Tesauro, 1994,
1995), which achieved the level of grandmaster at backgammon, used a mul-
tilayer neural network to approximate the value function. And in the area
of (simulated) robotic control, Sutton (1996) successfully applied the linear
CMAC method (Albus, 1975) to approximate the Q function.

Despite its popularity and the existence of the above mentioned positive re-
sults, particularly the grandmaster backgammon player, TD-Gammon, there
are known to be some negative results from incorporating function approx-
imation into reinforcement learning and dynamic programming algorithms.
Examples of divergence on simple MDPs have been produced with temporal
difference methods, value iteration, and policy iteration when combined with
very benign forms of linear function approximation (Baird, 1995; Boyan and
Moore, 1995; Tsitsiklis and Roy, 1996; Gordon, 1995). This shows that the
convergence of these algorithms, when combined with function approxima-
tion, cannot be guaranteed in general and has motivated the development of
new algorithms that when combined with function approximation are stable
(Baird, 1995; Baird and Moore, 1998; Gordon, 1995; Precup et al., 2001,
2006).
2.3. Generalisation 33

2.3.2 Aggregation

In its simplest form, aggregation partitions S or S × A into a collection


of disjoint clusters. Each cluster contains a group of states or state-action
pairs that, ideally, have similar or equal value according to V or Q. In
more sophisticated forms of aggregation an individual state may belong to
multiple clusters. Aggregation techniques maintain a single value for the
entire cluster, such as an estimate of the average value of the states or
state-action pairs contained within the cluster, hence the space required for
representing V or Q is thus now dependent on the number of clusters rather
than on |S| or |S × A|.

Let us consider a simple form of aggregation in more detail. The state space
S is partitioned into disjoint subsets, S1 , . . . , Sn , where each partition Si is
associated with a value Ṽi . The optimal value function, V ∗ , is approximated
by Ṽ , where Ṽ (s) = Ṽi for all s ∈ Si . According to Tsitsiklis and Roy (1996),
there are no inherent limitations with using this type of aggregation. That
is, given some  > 0, the partitions can be defined as:

Si = {s | i ≤ V ∗ (s) < (i + 1)},

for all i. Thus, the optimal value function V ∗ can be approximated with
accuracy . However, it is precisely the value function V ∗ that we are trying
to predict, thus in practice we are unable to find a partition of S using V ∗ .

The general approach taken by aggregation techniques is to identify reg-


ularities in the attributes characterising S that correspond to regularities
in the V or Q estimate. This approach is well suited to factored MDPs
(Boutilier et al., 2000), for example. In a factored MDP the state space, S,
is characterised by a set of random variables, that is, s = {s1 , s2 , . . . , sk }
for all s ∈ S, where si is a variable over a finite domain val(si ). Typi-
cally, variables are Boolean, although any finite domain can be used. The
state space, S, is just the possible assignment of values to variables, hence
S = val(s1 ) × val(s2 ) × . . . × val(sk ). The size of the state space is |S| =
34 Ch 2. Background

HCO

W HCR

9.00 10.00 O O

W W W W

7.45 R 6.64 R 5.19 R 5.83 R

U 8.45 U 7.64 U 6.19 U 6.83

8.45 8.36 7.64 6.81 6.19 5.62 6.83 6.10

Figure 2.7: A value function tree (reproduced from Boutilier et al., 2000).

Qk
i=1 |val(si )|, thus the memory requirements for storing V explicitly are
exponential in k. An example of a compact representation of V under fac-
tored state spaces is shown in Figure 2.7. Here, a decision tree represents
the value function. Internal nodes in the tree represent boolean variables
characterising the state space. The left subtree under a variable represents
the case when the variable is true, while the right subtree represents the
false case. Leaf nodes store the value of states which are consistent with
the corresponding branch. The decision tree, in effect, partitions the state
space according to an estimate of the value function, and is thus a form of
aggregation.

The above example shows that aggregation may be non-uniform, that is,
a certain variable may only be relevant to the value function in particular
regions of S. Contrast this non-uniformity in the value function with linear
function approximation. Under linear function approximation, the value
function is a weighted contribution of each feature, si . The weights, θi ,
may change over time but they are fixed with respect to S, suggesting that,
while aggregation can be useful for value functions that are conditionally
dependent on the variables describing S, linear function approximation is
2.3. Generalisation 35

better suited for smooth value functions.

Aggregation methods can be either static or dynamic. Under static aggre-


gation, a suitable partition or clustering is user engineered or computed in a
preprocessing step and given to the aggregation technique to evaluate (Gi-
van et al., 2003; Kim and Dean, 2003; Dean and Givan, 1997). Alternatively,
a dynamic aggregation technique automatically generates and refines clus-
ters for itself (Guestrin et al., 2003; Zhang and Baras, 2001; Boutilier et al.,
2000; Chapman and Kaelbling, 1991; Bertsekas and non A. David, 1989).

Some analysis of using static aggregation techniques for value function gen-
eralisation has been performed. A convergence proof for Q-Learning with
a general form of static aggregation called soft state aggregation4 has been
given by Singh et al. (1995). Tsitsiklis and Roy (1996) give a convergence
proof for value iteration under a static partition of S. Note, however, that
in both cases the policy which is converged to is optimal with respect to the
given clustering and is not, in general, necessarily the same as π ∗ , the opti-
mal policy for the unclustered MDP. For example, in the extreme case where
all states are grouped into a single cluster, only the action with the highest
expected value over the entire state space will be selected. The quality of
the policy thus depends on the quality of the clustering.

2.3.3 Remarks

Whether function approximation performs better than aggregation, or vice


versa, depends on the task at hand. Function approximation is well suited
to environments where the state space, S, is naturally represented by real-
valued vectors—for example, environments which are perceived through sen-
sor devices such as cameras and microphones—and where the value function

4
In soft state aggregation, each state s belongs to cluster x with probability P (x|s).
Each state s may belong to several clusters. Partitioning is a special case of soft state
aggregation where for each state s there is a single cluster x such that P (x|s) = 1 and
P (y|s) = 0 for all other clusters y 6= x.
36 Ch 2. Background

is smooth with respect to the state signal—that is, environments where small
changes in the state signal lead to small changes in the value function. Ag-
gregation techniques, on the other hand, are particularly suited to tasks that
exhibit regularity in the value function which is conditionally dependent on
variables that factor S and A.

2.4 Relational Reinforcement Learning

In the previous section we saw that, because of the curse of dimensionality,


there is a need for reinforcement learning algorithms to generalise over the
value function. In this section we consider an approach to generalisation
that makes use of the abstractive principles of first-order logic. Reinforce-
ment learning algorithms and methods utilising this approach are collectively
known as Relational Reinforcement Learning (RRL).

2.4.1 Propositionally Factored MDPs

Before discussing how first-order logic can be used to address the generalisa-
tion problem, let us first motivate interest in it by considering limitations of
propositional representations. We consider propositionally factored MDPs
specifically; however, the limitations that will be raised apply to attribute-
value representations in general, including those used in conjunction with
function approximation and aggregation techniques within reinforcement
learning.

Propositionally factored MDPs were discussed in Section 2.3.2, where they


were called factored MDPs. Recall that in a propositionally factored MDP,
the state space S is characterised by a set of random variables, s = {s1 , s2 ,
. . . , sk } for all s ∈ S, where si is a variable over a finite domain val(si ).
In the following example the Q function is represented by a collection of
rules, where each rule can be viewed as aggregating a set of triples over
2.4. Relational Reinforcement Learning 37

Table 2.1: Representing poker hands using a rank-suit factoring. An example for each
class of poker hand is given. The variables Ri and Si are, respectively, the rank and suit
of the ith card (the order of cards is arbitrarily assigned) in the hand.

Class R1 S1 R2 S2 R3 S3 R4 S4 R5 S5
four-of-a-kind 2 ♠ 2 ♦ 2 ♥ 2 ♣ 8 ♦
fullhouse Q ♠ Q ♣ 4 ♥ 4 ♦ 4 ♠
flush 3 ♣ 9 ♣ J ♣ K ♣ A ♣
straight 4 ♣ 5 ♦ 6 ♠ 7 ♥ 8 ♣
three-of-a-kind 7 ♠ 7 ♦ 7 ♣ A ♣ 9 ♥
two-pair 8 ♣ 4 ♣ 8 ♥ 3 ♠ 4 ♠
pair 7 ♦ K ♣ K ♠ 10 ♣ 2 ♥

S × A × R. Syntactically, the rules are definite clauses in propositional logic,


each having the form: action ← condition, where action ∈ A and condition
is a collection of propositions connected by the logical symbols ¬, ∨, and
∧. Each proposition in condition has the form si ◦ k, where si is a variable
factoring the state space S, k is a value from val(si ), and ◦ belongs to a
set of predetermined operators, such as {≤, =, 6=, ≥}. This representation
is general enough to allow condition to represent any arbitrary aggregation
of states over S. Each rule is associated with a constant representing a Q
value, but in the following example it will be omitted in order to focus on
representational issues.

The task which we consider is that of classifying hands of poker (adapted


from Blockeel, 1998). A poker hand contains five cards, each uniquely identi-
fied by its rank and suit attributes. The state space of poker can be factored
by 10 variables using the propositional factoring shown in Table 2.1. Each
card, i, within the hand is represented by two variables, Ri for its rank and
Si for its suit. Since there are five cards in a hand, S = val(R1 ) × val(S1 ) ×
. . . × val(R5 ) × val(S5 ).5 The action space is the set of classes to which
5
Note, however, that the subset of S consisting of those hands which contain the same
card twice or more, {s ∈ S | ∃(i, j) s.t. Ri = Rj , Si = Sj , and i 6= j}, do not correspond
38 Ch 2. Background

a hand may belong, that is A = {four-of-a-kind , fullhouse, flush, straight,


three-of-a-kind , two-pair , pair }.

Although the above representation is quite natural for representing individ-


ual states and actions, its utility is very limited for compactly representing
the Q function. For example, under this representation, a rule for classifying
a pair would look as follows:

pair ← (R1 = 2 ∧ R2 = 2 ∧ ranks 3, 4 and 5 not equal to 2 or each other )


∨ (R1 = 3 ∧ R2 = 3 ∧ ranks 3, 4 and 5 not equal to 3 or each other )
∨ ...
∨ (R1 = 2 ∧ R3 = 2 ∧ ranks 2, 4 and 5 not equal to 2 or each other )
∨ ...

This rule amounts to checking, in an exhaustive disjunction of specific cases,


whether exactly two of the five cards have equal rank. Since there are 13
different rank values and 10 distinct pairs of rank variables to check, the
above rule contains 130 lines. Note that each line checks the remaining rank
variables in order to ensure that the hand is not actually a two-pair, three-
of-a-kind, fullhouse, or four-of-a-kind. Each of these checks are themselves
an exhaustive disjunction of specific cases, for example the first line of the
above rule expands to:

(R1 = 2 ∧ R2 = 2 ∧ R3 = 3 ∧ R4 = 4 ∧ R5 6= 2 ∧ R5 6= 3 ∧ R5 6= 4)
∨ (R1 = 2 ∧ R2 = 2 ∧ R3 = 3 ∧ R4 = 5 ∧ R5 6= 2 ∧ R5 6= 3 ∧ R5 6= 5)
∨ ...
∨ (R1 = 2 ∧ R2 = 2 ∧ R3 = 3 ∧ R4 = A ∧ R5 6= 2 ∧ R5 6= 3 ∧ R5 6= A)
∨ ...
∨ (R1 = 2 ∧ R2 = 2 ∧ R3 = A ∧ R4 = 3 ∧ R5 6= 2 ∧ R5 6= A ∧ R5 6= 3)
∨ (R1 = 2 ∧ R2 = 2 ∧ R3 = A ∧ R4 = 4 ∧ R5 6= 2 ∧ R5 6= A ∧ R5 6= 4)
∨ ...
∨ (R1 = 2 ∧ R2 = 2 ∧ R3 = A ∧ R4 = K ∧ R5 6= 2 ∧ R5 6= A ∧ R5 6= K)

All in all, the expanded rule contains 130 × 132 = 17, 160 lines. Although
to valid hands.
2.4. Relational Reinforcement Learning 39

the approach stops short of listing every single instance of a pair, of which
there are approximately one million, it remains an unsatisfactory approach
for representing generalisations for this task. And although the rank-suit
factoring can be used with more compact structures than rules, such as
trees, the resulting level of complexity remains of the same order.

A better factoring, at least for identifying a pair, can be obtained from the
observation that the above rule detects when exactly two rank variables are
equal. Let Xi,j be a boolean variable such that:

 T if and only if the ranks of cards i and j are equal,
Xi,j =
 F otherwise.

Since by symmetry Xi,j = Xj,i , and Xi,i is unnecessary for detecting pairs,
only ten variables are needed: X1,2 , X1,3 , X1,4 , X1,5 , X2,3 , X2,4 , X2,5 , X3,4 ,
X3,5 , and X4,5 . Under this factoring, a rule for identifying a pair requires
only 10 lines:

pair ← (X1,2 = T ∧ X1,3 = F ∧ . . . ∧ X4,5 = F )


∨ (X1,2 = F ∧ X1,3 = T ∧ . . . ∧ X4,5 = F )
∨ ...
∨ (X1,2 = F ∧ X1,3 = F ∧ . . . ∧ X4,5 = T )

Each line in the rule checks that exactly one variable is T and the rest are
F . Taking this approach to the extreme, a new boolean variable, X, could
be defined such that X = T if and only if one pair of ranks in the hand are
equal. A rule which correctly classifies a hand as a pair is then simply:

pair ← X.

In summary, we have seen through the rank-suit factoring of poker hands


that factoring the state space in the most natural way may be inadequate
for learning compact rules. More compact rules are possible when the state
space is factored with variables representing the equality between specific
ranks, but this relied on prior knowledge of the features that were directly
40 Ch 2. Background

relevant to the concept to be learnt. Frequently we are not in this fortunate


situation and would like the generalisation algorithm to identify the relevant
concepts for us.

2.4.2 Relational Markov Decision Processes

We now turn our attention to relational factorings of MDPs: factorings


that make use of the abstractive qualities of first-order logic. This approach
can achieve a very compact representation of the value function and may
be particularly advantageous when relevant propositional features are not
known at design time. Readers unfamiliar with first-order logic may like to
consult Appendix A while reading this section.

The state and action space of an MDP (that is, S and A) can be factored
using an appropriate alphabet over first-order logic. An alphabet in first-
order logic is a tuple hC, F, P, di, consisting of: a set of constant symbols,
C; a set of function symbols, F; a set predicate symbols, P; and a function
d : F ∪ P → Z, where d(i) is called the arity of i. In the following, it will
be convenient to partition P into two disjoint subsets, one for representing
states, PS , and another for actions, PA .

We now introduce the relationally factored MDP (van Otterlo, 2004), which
is a type of MDP where S and A are factored into atoms by an alphabet
over first-order logic.

Definition 2 (Relationally factored MDP) A relationally factored MDP


(RMDP) is the tuple hL, S, A, T, Ri consisting of:

• An alphabet over first-order logic L = hC, F, PS ∪ PA , di

• A set of states S ⊆ 2HB[hC,F ,PS ,di] , a subset of the Herbrand interpre-


tations over hC, F, PS , di

• A set of actions A = HB[hC, F, PA , di], the Herbrand base over hC, F, PA , di


2.4. Relational Reinforcement Learning 41

• A transition function T : S × A × S → [0, 1]

• A reward function R : S × A → R

Note that 2HB[L] denotes the powerset of HB[L], that is, the set of all Her-
brand interpretations over L. Also note that S is typically a subset and not
the complete set of all Herbrand interpretations over hC, F, PS , di. This is
because not every Herbrand interpretation over hC, F, PS , di necessarily cor-
responds to a valid state. For the sake of simplicity we have not formalised
a test to distinguish valid from invalid states; for such a test, see van Otterlo
(2005).

For every MDP there exists an equivalent RMDP; thus, the framework is
not more restrictive than MDPs. Given an MDP M = hS, A, T, Ri, an
equivalent RMDP, M 0 = hL, S 0 , A0 , T 0 , R0 i, can be trivially derived from M .
The alphabet L = hC, ∅, PS ∪ PA , di is constructed as follows: C contains a
single dummy constant; PS contains a unique predicate symbol is for each
state s ∈ S; PA is constructed analogously over A; and d(i) = 1 for all
i ∈ PS ∪ PA . Under this construction, S 0 and A0 are isomorphic to S and
A respectively. The transition function T 0 is defined from T , such that each
element of S and A in T is replaced by its equivalent in S 0 or A0 . The reward
function R0 can be defined in an analogous fashion. Using this construction,
M 0 is equivalent to M since each element of M 0 , except for L, which has no
corresponding element in M , is isomorphic to its corresponding element in
M.

However, M 0 , as constructed above, would not confer any additional benefit


for generalisation compared to M . Instead, it is usual to employ an alphabet
L = hC, ∅, PS ∪ PA , di, where C is a set of constants representing a collection
of objects contained in the given environment (and frequently also their
relevant attributes); PS is a set of predicates specifying properties of and
relationships between the elements of C; PA is a set of predicates representing
the actions to perform on the objects; and d, of course, being the arity of the
predicates in PS ∪ PA . The following example illustrates such an alphabet
42 Ch 2. Background

Table 2.2: Representing hands of poker using a relational factoring.

Class Hand
class(fourofakind ) {card (first, two, spades), card (second , eight, diamonds),
card (third , two, spades), card (fourth, two, clubs),
card (fifth, two, diamonds)}
class(fullhouse) {card (first, queen, spades), card (second , queen, clubs),
card (third , four , hearts), card (fourth, four , diamonds),
card (fifth, four , spades)}
class(flush) {card (first, three, clubs), card (second , nine, clubs),
card (third , jack , clubs), card (fourth, king, clubs),
card (fifth, ace, clubs)}
class(two-pair ) {card (first, eight, clubs), card (second , four , clubs),
card (third , eight, hearts), card (fourth, three, spades),
card (fifth, four , spades)}
class(pair ) {card (first, seven, diamonds), card (second , king, clubs),
card (third , king, spades), card (fourth, ten, clubs),
card (fifth, two, hearts)}

for poker.

Example 1 Table 2.2 shows several poker hands selected from Table 2.1
represented under a relational alphabet. The alphabet L = hC, ∅, PS ∪ PA , di
is as follows: C = {first, second , . . . , fifth} ∪ {two, three, . . . , ace} ∪
{diamonds, hearts, spades, clubs} ∪ {fourofakind , fullhouse, . . . , pair },
which represent the cards in the hand, the ranks, the suits, and the classes
respectively; PS = {card }; PA = {class}; and d(card ) = 3 and d(class) = 1.


Note that above factoring is as natural for representing hands of poker as


the propositional factoring in Table 2.1.
2.4. Relational Reinforcement Learning 43

2.4.3 Aggregating States and Actions Under an RMDP

Under an RMDP, a cluster or aggregation of states is represented by an


abstract state. Recall that a state s ∈ S is a Herbrand interpretation, a set of
atoms over hC, F, PS , di that do not contain variables. Similarly, an abstract
state, s̃, is a set of atoms over hC, F, PS , di but it may contain variables. The
set of states aggregated by an abstract state s̃ is S(s̃) = {s ∈ S | s̃θ ⊆ s}.
Abstract actions are defined analogously, that is, an abstract action ã is an
atom over hC, F, PA , di, it may contain variables, and represents a set of
actions A(ã) = {a ∈ A | ãθ = a}.

A concise rule for correctly classifying a pair in poker using the language
given in Example 1 is:

class(pair ) ← card (C1 , R, ), card (C2 , R, ), card (C3 , R3 , ), card (C4 , R4 , ),


card (C5 , R5 , )

Note that the condition part of the rule is an abstract state; the vari-
ables6 being the terms beginning with an uppercase letter (this convention
of indicating a variable using an uppercase letter is followed throughout the
thesis). The underscore, “ ”, is a special variable, called the anonymous
variable, which can represent any constant, not unlike the wildcard, *, of
UNIX or the “don’t care” symbol, #, of traditional learning classifier system
languages. The rule states that if the cards C1 and C2 both have the same
rank, R, and if the rank of all the other cards in the hand, R3 , R4 , and R5 ,
are different from R, then the hand is a pair.

The advantage of using first-order logic for expressing the concept of a pair in
poker becomes apparent when it is compared to the propositional factorings
and rules in Section 2.4.1. Like the propositional factoring in Table 2.1, only
the basic features of poker cards, ranks and suits, are represented in the first-
order logic language given in Example 1. However, the above rule, which
is expressed in this language, is much more compact than its equivalent
rule over the propositional factoring. It is also more compact than the first
6
In this example, different variables are assumed to represent different constants.
44 Ch 2. Background

propositional rule which made use of high level propositional features. Only
the final, trivial, propositional rule is more compact than the above rule, but
it relies on a feature that directly indicates the presence of a pair, which, if
known, would generally make learning the concept of a pair redundant.

2.5 Summary

This chapter covered background material in reinforcement learning and


relational reinforcement learning in order to lay the foundations for the re-
mainder of the thesis. In particular, the chapter dealt with fundamentals
from MDPs, reinforcement learning methods, generalisation, and relational
reinforcement learning. MDPs were introduced as the problem formalisa-
tion: sequential decision making in an agent-environment context. Key ele-
ments of an MDP include a set of states, S, which describe the environment;
a set of actions, A, which are available to the agent; a transition function, T ,
which gives the outcome of taking an arbitrary action in an arbitrary state;
and a reward function, R, which gives the reward associated with each tran-
sition. The notion of behaviour was formalised as a policy, π, which is a
mapping from states to actions. Any policy, π, is associated with a cor-
responding value function, V π , which indicates the amount of reward that
can be expected from following the policy, and which can be conveniently
expressed in recursive form through a set of Bellman equations. Conversely,
any value function is associated with a corresponding policy which is greedy
with respect to a one-step search over the value function. An optimal policy,

π ∗ , is the policy, not necessarily unique, such that V π (s) ≥ V π (s) for all π
and s ∈ S.

When the dynamics of an environment (that is, the transition function T


and reward function R) are fully known, then an optimal policy can be
computed using a variety of methods from linear programming and dynamic
programming. The latter includes value iteration and policy iteration, which
utilise the recursive nature of Bellman equations to compute the optimal
2.5. Summary 45

value function. When the dynamics of the environment are unknown, a


value function can be estimated from samples of the environment, which is
the approach of reinforcement learning methods like TD(0) and Q-Learning.
These reinforcement learning algorithms can be shown to converge to the
correct solution in the limit as the number of times each state or state-action
pair is sampled approaches infinity (other assumptions may also apply).

The curse of dimensionality refers to the problem that realistic modelling


of an environment leads to an explosion in the size of the state space. This
problem motivates the development of methods which include a general-
isation mechanism for assigning values to states or state-action pairs that
have not been sampled. Two fundamental approaches to generalisation were
described. The first approximated the value function from examples using
function approximation techniques like curve fitting or neural networks. The
second compactly aggregated states or state-action pairs into clusters, as-
signing a common value to each cluster. Function approximation is well
suited to smooth value functions, while aggregation methods may be more
suited when the value function is conditionally dependent on variables that
factor the state space.

We saw that the abstractive mechanisms of first-order logic can be used


to solve the generalisation problem in reinforcement learning. The advan-
tage of first-order logic is that it can compactly represent generalisations
which attribute value or propositional representations are unable to express
without substantial use of high-level pre-engineered features. This ability
provides a compelling motivation for investigating the combination of rein-
forcement learning methods with representation in first-order logic, leading
to the development of relational reinforcement learning.

We conclude by restating that this chapter was not intended to give a com-
prehensive overview of each of the background topics covered. Some notable
topics have been omitted because they are not directly relevant to this thesis.
They include:
46 Ch 2. Background

• Extensions to the MDP formulation, including multiple agents (Littman,


1994), continuous rather than discrete time (Crites and Barto, 1998;
Doya, 2000), and partially observable states (Åström, 1965; Sondik,
1978; Cassandra et al., 1994).

• Extensions of or variations to the basic methods, such the use of eli-


gibility traces (Sutton, 1988; Watkins, 1989; Rummery and Niranjan,
1994; Peng and Williams, 1996), model-based reinforcement learning,
which uses of experience to build an approximate model of the envi-
ronment (Sutton, 1990; Moore and Atkeson, 1993; Peng and Williams,
1993), and methods which compute a policy directly without calculat-
ing the value function (Sutton et al., 2000; Baxter and Bartlett, 2001;
Bartlett and Baxter, 2002).

• Methods other than generalisation which reduce the complexity of the


problem or solution. For example, symmetries in the environment can
be used to effectively reduce the size of the state space (Fitch et al.,
2005). Alternatively, tasks that contain identifiable subtasks may lead
to a hierarchical decomposition of the value function (Dietterich, 2000;
Hengst, 2002). Again, support for macro-actions, sometimes called
options, may increase the efficiency of learning (Sutton et al., 1999).

For a general introduction to reinforcement learning, covering many of the


above topics, the reader can consult Sutton and Barto (1998) and Kaelbling
et al. (1996).
Chapter 3

A Survey of Relational
Reinforcement Learning

“. . . one cannot study learning in the absence of assumptions


about how the acquired knowledge is described, . . . ”
—Pat Langley

In the previous chapter we examined the foundations of relational rein-


forcement learning. The aim of this chapter is to survey existing methods
for relational reinforcement learning (RRL). We start by situating the field
of RRL in relation to other fields. After this, we describe existing RRL
methods. Although the focus is on true reinforcement learning approaches,
related approaches based supervised learning and dynamic programming are
also mentioned. We then present concepts that are useful for categorising
RRL methods. Finally, we consider the current approach in relationship to
existing methods and discuss its significance.

47
48 Ch 3. A Survey of Relational Reinforcement Learning

3.1 Connections to Other Fields

Relational reinforcement learning (van Otterlo, 2005; Tadepalli et al., 2004b)


is principally connected to three other fields: reinforcement learning (RL),
inductive logic programming (ILP), and a subfield of artificial intelligence
known as planning. Figure 3.1 shows the relationships between RRL and
these three fields. From reinforcement learning, the parent field of RRL,
RRL derives its fundamental approach, including a problem formulation
based on MDPs and the calculation of a policy through methods based on
estimating the value function. RRL is also influenced, to a large extent, by
inductive logic programming—in fact it owes its origin to the combination
of RL and ILP. The development of the field of ILP demonstrated that reg-
ularity in some problems was best captured using relational languages, such
as first-order logic. These problems had been formulated as the induction
of general rules from specific examples (see Appendix B), which was the
kind of problem faced by RL algorithms attempting to generalise over the
value function. The field of RRL appeared when techniques from ILP were
applied to the problem of generalisation in reinforcement learning (Džeroski
et al., 1998a, 2001), resulting in the use of relational languages to compactly
represent the value function, and also, in some cases, the incorporation of
ILP algorithms into RL systems.

There is also a connection between RRL and the artificial intelligence sub-
field of planning. These two fields overlap in the sense that they are both
agent-based frameworks which focus on the issue of how to compute an
optimal behaviour given a particular environment. Furthermore, planning
environments have traditionally been described using relational languages.
For example, the well known blocks world environment, used extensively
to benchmark RRL algorithms and systems, originates from the planning
community. However, RRL is differentiated from planning in the following
ways:

• In an MDP, the agent’s aim is to maximise the reward signal, R; under


3.2. Existing RRL Methods and Approaches 49

RRL

RL ILP Planning

Problem formulation, Representation, Representation,


Methods Methods Tasks

Figure 3.1: Connections between relational reinforcement learning and other fields.

the planning framework, rewards are absent and the agent’s aim is
instead to find a goal state.

• An RL agent does not usually have direct access to a model of the


environment (that is, the transition and reward functions, T and R);
a planning agent can arbitrarily access state transition information.

• A typical approach of an RL agent is to compute a value function from


which a policy is derived; a planning agent usually searches directly
for the shortest path to the goal.

However, despite these differences, many planning tasks can be reformulated


as an MDP and hence solved, at least in principle, using reinforcement
learning or dynamic programming methods.

3.2 Existing RRL Methods and Approaches

This section surveys existing RRL systems. We focus primarily on rein-


forcement learning systems and do not consider relational extensions of dy-
namic programming in depth. The first section (Section 3.2.1) covers static
50 Ch 3. A Survey of Relational Reinforcement Learning

approaches followed by four sections (sections 3.2.2–3.2.5) on dynamic ap-


proaches. Here, static and dynamic refer to the generalisation process: under
static approaches, generalisations remain fixed throughout training; while
under dynamic approaches, generalisations are generated and refined online
by the system. The final section (Section 3.2.6) considers systems which
either extend RRL or are closely related to the field. Overviews of RRL
have been compiled by van Otterlo (2005), Tadepalli et al. (2004b), van Ot-
terlo (2004), and van Otterlo (2002), of which the first three are the most
current. The RRL workshop proceedings (Tadepalli et al., 2004a) contains
much recent material in the area.

3.2.1 Static Generalisation

In the previous chapter we saw that basic reinforcement learning methods


use a table representation to store the value function, V or Q. One approach
to relational reinforcement learning replaces the table with a static abstrac-
tion device which compactly clusters states or state-action pairs. A value es-
timation algorithm, such as Q-Learning, is then modified so that it evaluates
the state or state-action clusters. Examples of this approach are Carcass
(van Otterlo, 2004), Logical Q-Learning (Kersting and De Raedt, 2003),
Logical TD(λ) (Kersting and De Raedt, 2004), and rQ-Learning (Morales,
2003, 2004).

Table 3.1 lists the abstraction devices employed by the above systems to
structure the value function. Carcass uses decision rules of the form:

{ã1 , . . . , ãn } ← s̃

where the ãi are abstract actions and s̃ is an abstract state. Each rule
implicitly represents the n rules: ãi ← s̃, where 1 ≤ i ≤ n. For each
implicit sub-rule, an estimate of its corresponding value, Q(s̃, ãi ), is stored
and updated throughout training. A Carcass rule, {ã1 , . . . , ãn } ← s̃, can
be seen as essentially clustering a subset of S × A into n clusters, where the
3.2. Existing RRL Methods and Approaches 51

Table 3.1: A listing of abstraction devices used by static RRL systems. †The form
which rules take in the Carcass framework differs slightly from those in Logical TD(λ)
and Logical Q-Learning; see text for details.

system device for structuring V or Q


Carcass Ordered decision list of abstract decision rules†
Logical TD(λ) Ordered decision list of abstract decision rules
Logical Q-Learning Same as Logical TD(λ)
rQ-Learning Prolog relations

ith cluster is:

{(s, a) ∈ S × A(s) | ∃θ : s = s̃θ, a = ãi θ)}

and has a corresponding estimated value, Q(s̃, ãi ). A set of such rules, as
illustrated in Table 3.2, is then used to decompose S × A into a finite set of
clusters for compactly representing the Q function. In Logical TD(λ) and
Logical Q-Learning, decision rules are analogous to Carcass rules except
that n = 1.

A different approach to abstraction is taken by rQ-Learning (Morales, 2003,


2004). Here, r-states and r-actions represent clusters of states and state-
action pairs respectively. Each r-state and r-action has a corresponding
Prolog relation which takes as an argument a state, s ∈ S, for an r-state
relation, or a state-action pair, (s, a) ∈ S × A, for an r-action relation. An
r-state is the set of all states, s ∈ S, such that the corresponding relation
succeeds given s. An r-action is defined analogously over state-action pairs.
The use of Prolog relations supports a very high level of abstraction (see
Table 3.3) and does not require S and A to be factored relationally. A
consequence of using such a high level of abstraction is that a cluster, an
r-action, will typically contain state-action pairs whose real Q values differ
from one another, thus the Q value estimate associated with the cluster is
essentially an average over the cluster.

Orthogonal to the issue of the abstraction device is the choice of algorithm


52 Ch 3. A Survey of Relational Reinforcement Learning

Table 3.2: An ordered list of rules for bw4 that could be evaluated by the Carcass
system (adapted from van Otterlo, 2004).

1. {mv f l(A)} ←

on(A, B), on(B, C), on(C, D), on f l(D), cl(A)

2. {mv f l(A), mv(A, D), mv(D, A)} ←

on(A, B), on(B, C), on f l(C), on f l(D), cl(A), cl(D)

3. {mv f l(A), mv(A, D)} ←

on(A, B), on f l(B), on(C, D), on f l(D), cl(A), cl(D)

4. {mv f l(A), mv(A, C), mv(C, A)} ←

on(A, B), on f l(B), on f l(C), cl(A), cl(C)

5. {mv(A, B)} ←

on f l(A), on f l(B), cl(A), cl(B)

for calculating the value function. In principle, any table-based value func-
tion method can be adapted, although to date, only variants of Q-Learning,
TD(λ),1 and prioritised sweeping (van Otterlo, 2004)2 have been used. Non-
value function methods could also be adapted for static RRL. For example,
in the work of Itoh and Nakamura (2004), a list of n condition-action rules
implementing a learning and a planning component was provided to a rein-
forcement learning system. Each rule i was associated with a probability pi
indicating its likelihood of being invoked when its condition was satisfied.
Gradient descent was performed on the probability vector, hp1 , . . . , pn i, in

1
The TD(λ) algorithm (Sutton, 1988) is a generalisation of TD(0) that uses eligibility
traces. Eligibility traces are a mechanism for making efficient use of the sampled data and
generally reduce the amount of training required (see Sutton and Barto, 1998, chapter 7).
2
See Section 3.2.6 for additional comments on this system.
3.2. Existing RRL Methods and Approaches 53

Table 3.3: The definition for an r-state for the King and Rook versus King chess
endgame (adapted from Morales, 2004). The argument, s, is the board position. This
relation covers more than 3,000 positions.

r_state1(s) :-
kings_in_opposition(s),
rook_divides_kings(s).

order to learn, in effect, which rules to keep and which to abandon. The
intention of the system was to “learn when to learn and plan” and it did not
attempt to generalise over the value function. However, it could be made to
do so by replacing its rule list with the kinds of abstraction devices listed in
Table 3.1.

The advantage of using static abstraction is the simplicity of the approach.


If there already exists adequate knowledge for creating a suitable abstrac-
tion for a particular environment, then the overhead of dynamically gen-
erating clusters during training is unnecessary. Alternatively, if a suitable
abstraction is unknown, it might be possible to generate it automatically
in a preprocessing phase; Morales (2004) has taken steps in this direction.
Another important benefit of simplicity is that it facilitates analysis, leading
to a convergence proof for Logical TD(λ) (Kersting and De Raedt, 2004).

3.2.2 Dynamic Generalisation

We now move to RRL systems that generalise dynamically. In this section,


we consider a family of RRL methods that achieve dynamic generalisation
by combining Q-Learning with inductive techniques over first-order logic.
Belonging to this general approach are Q-rrl (Džeroski et al., 1998a, 2001),
Rrl-tg (Driessens et al., 2001), Rrl-rib (Driessens and Ramon, 2003),
Rrl-kbr (Gärtner et al., 2003a; Ramon and Driessens, 2004), and Trendi
(Driessens and Džeroski, 2005).
54 Ch 3. A Survey of Relational Reinforcement Learning

root: goal_on(A,B),numberofblocks(C),mv(D,E)
on(A,B)

0.0 cl(A)

1.0 cl(E)

0.9 0.81

Figure 3.2: A Q-tree for blocks world (adapted from Džeroski et al., 2001). Non-leaf
nodes contain atoms factoring the state space: the left subtree represents the case where
the test corresponding to the atom succeeds, while the right subtree represents failure. The
root of the tree additionally contains the action, mv(D,E) here, and atoms whose variables
bind to globally relevant attributes, in this case, goal on(A,B) and numberofblocks(C).
The leaf nodes store the Q value for the corresponding branch.

Q-rrl, the first ever relational reinforcement learning system, was devel-
oped in the seminal work of Džeroski et al. (2001, 1998a).3 The system
combined Q-Learning with the ILP decision tree algorithm, Tilde (Block-
eel and De Raedt, 1998). During training, the algorithm builds a Q-tree that
partitions S × A into clusters containing (s, a) pairs whose Q(s, a) estimate
is equal (see Figure 3.2). The system was evaluated on three blocks world
tasks, stack, unstack and onab, hat have now become standard benchmarks
for RRL systems.

The Q-rrl system was, however, a first attempt at RRL and contained some
significant inefficiencies. Prominent among them was that the Q-trees were
generated from scratch after each training episode, requiring every sampled
(s, a) pair and its associated Q(s, a) estimate to be explicitly retained in
memory until the completion of training. Subsequent research focussed on
3
Some sources, including the original article by Džeroski et al. (1998a), refer to the
system as Rrl. We prefer Q-rrl in order to avoid confusion with the acronym for the
name of the field. It also distinguishes between a different version of the system, P-rrl,
reported in (Džeroski et al., 2001).
3.2. Existing RRL Methods and Approaches 55

developing an improved version, Rrl-tg (Driessens et al., 2001), which used


a relational “upgrade” of the G algorithm (Chapman and Kaelbling, 1991)
to build the decision tree incrementally so that the (s, a) pairs could be
discarded after sampling.

The next two methods, Rrl-rib (Driessens and Ramon, 2003) and Rrl-kbr
(Gärtner et al., 2003a; Ramon and Driessens, 2004), replace the decision
tree of Rrl-tg with other techniques for approximating the value func-
tion. The Rrl-rib system uses an instance-based method related to the
k-nearest neighbour algorithm (Aha et al., 1991), and the Rrl-kbr system
employs Gaussian processes (MacKay, 1998) with graph kernels (Gärtner
et al., 2003b). These two methods do not, like other RRL methods, form
clusters of state-action pairs using the abstractive mechanisms of first-order
logic, such as variables. Rather, they rely on calculating the similarity of
the current input to previously experienced examples in order to make a
prediction for the value of the input. On blocks world tasks, Rrl-rib and
Rrl-kbr were found to outperform Rrl-tg with respect to the level of op-
timality attained by the learnt policies. They are, however, more expensive
in terms of computation and memory requirements, their predictions are less
transparent to humans, and they rely on domain specific distance metrics
in order to compute similarity.

The final system, Trendi (Driessens and Džeroski, 2005), hybridises the
decision tree- and instance-based learning methods, Rrl-tg and Rrl-rib,
in order to combine the efficiency of the former with the better performance
of the latter. Trendi has achieved better performance levels than either
method individually, although not as good as Rrl-kbr, and has an efficiency
level comparable to Rrl-tg.

3.2.3 Policy Learning

Apart from the development of the first RRL method, Q-rrl, another in-
novation contained in (Džeroski et al., 2001) was the development of policy
56 Ch 3. A Survey of Relational Reinforcement Learning

a mv_fl(a)
Q = 0.81
c c
mv_fl(d)
d mv_fl(c) d d Q = 1.0
Q = 0.9
b b a c b a

(a)

root: goal_unstack,numberofblocks(C),mv_fl(A)
cl(A),on(A,B),on(B,C),on(C,D),on_fl(D)

cl(A),on(A,B), 0.81
cl(D),on_fl(D)

cl(C), on(B,C),
on_fl(C) on_fl(C)

... ...

1.0 0.9

(b)

root: goal_unstack,numberofblocks(C),mv_fl(A)
cl(A),on(A,B)

non−optimal optimal

(c)

Figure 3.3: a) An optimal sequence of actions for the unstack task, and their associated
Q values. An hypothetical b) Q-tree and c) P-tree for the task. A P-tree is similar to a
Q-tree except that the Q values are replaced with the labels “optimal” or “non-optimal”.

learning, or P-Learning. The authors recognised that optimal policies could


be expressed more naturally when a decision tree, called a P-tree, clustered
S × A into groups of optimal or suboptimal (s, a) pairs. In other words, a
P-tree effectively generalises over a policy, π : S → A, rather than a value
3.2. Existing RRL Methods and Approaches 57

function, Q : S × A → R. As an example, consider the optimal policy


for the unstack task, which is to always move a block to the floor (see Fig-
ure 3.3a). Although the optimal policy is simple to express directly, a Q-tree
requires at least one branch for each distinct Q value occurring in the task
(Figure 3.3b). A P-tree, in contrast, labels each branch as optimal or non-
optimal (Figure 3.3c), allowing optimal state-action pairs to be clustered
together irrespective of their predicted Q-values; in this example, the entire
optimal policy corresponds to a single branch.

P-Learning proceeds as an auxiliary process to Q-Learning. Given a Q


function, the P function is defined as:

 1 if a = arg max 0 Q(s, a0 )
a
P (s, a) = (3.1)
 0 otherwise,

where the values 1 and 0 indicate optimality and non-optimality respectively.


The P function thus merely indicates the greedy policy with respect to the Q
function. Given an estimate of a Q function as a Q-tree derived from samples
of (s, a) pairs and associated rewards, a P-tree is induced over the (s, a) pairs
from the P function in a process called P-rrl. Obviously, the correctness
of the policy produced under P-Learning depends on the accuracy of the Q
function estimates.

A benefit of P-Learning is that produces polices which may scale to larger


versions of the training task better than those produced under Q-Learning.
For example, Džeroski et al. (2001) show that after training on blocks world
with n blocks, the policy represented by the P-tree generalises to blocks
worlds with an arbitrary number of blocks. In contrast, the policy under
the Q-tree is limited to the n block environment. Driessens et al. (2001)
have reported similar results. The significance of P-Learning is, thus, that
it can solve large tasks which are intractable for Q-Learning. Learning first
proceeds on a small version of the task, after which the resulting policy is
transferred to the larger version with minimal or no retraining.

Inspired by this approach, Cole et al. (2003) developed a system which


58 Ch 3. A Survey of Relational Reinforcement Learning

combined Q-Learning and P-Learning with an incremental version of the


higher-order logic decision tree learner, Alkemy (Lloyd, 2003; Ng, 2005).
The system was successfully applied to generalisations of the blocks world
onab task, in which multiple pairs of blocks must be correctly placed. In one
task the system starts with the goal of correctly placing 5 pairs of blocks.
It begins training in a world with 5 blocks and is then presented with an
extra block and pair at constant intervals throughout training. The aim is
to handle the extra block and pair with minimal retraining. This task is
more complex than those of Džeroski et al. (2001) and demonstrates the
potentially powerful nature of abstraction in relational languages.

However, note that although Cole et al.’s system has an architecture that
supports dynamic clustering, only limited use was made of the ability in the
reported experiments. In particular, the Q-tree was statically organised us-
ing an heuristic, with dynamic clustering being reserved for the P-Learning
phase. In contrast, the experiments in Džeroski et al. (2001) allowed gener-
alisation to occur in both the Q-Learning and P-Learning phases.

In summary, P-Learning is performed as an auxiliary process to Q-Learning


with the purpose of exploiting the powerful abstractive qualities of first-order
(or higher-order) logic. By generalising over π : S → A rather than the value
function, Q : S × A → R, generalisations can take a more unconstrained
form. Policies resulting from P-Learning have been shown to scale, with
little or no retraining, to larger versions of the training task.

3.2.4 Policy Driven Approaches

Like P-Learning, the two systems considered in this section generalise over
a policy rather than a value function; that is, over π : S → A rather
than Q : S × A → R. Thus, as with P-Learning, both systems avoid the
constraints of generalising over Q.

The first system is based on approximate policy iteration (API) (Bertsekas


3.2. Existing RRL Methods and Approaches 59

and Tsitsiklis, 1996). The API technique modifies policy iteration (see Sec-
tion 2.1.2, page 23) so that the value function calculation or estimate of a
representative set of states, S̃ ⊂ S, is generalised over the whole of S us-
ing function approximation.4 Fern et al. (2004a) adapt API for RRL, but
instead of employing value function approximation, they use an abstract
policy in the form of a relational decision list. Their method works as fol-
lows. Like the policy iteration algorithm (see Figure 2.3), the procedure
essentially loops over two steps: policy evaluation and policy improvement.
During the policy evaluation step, a set, D, is constructed, consisting of a
D E
tuple s, π(s), Q̂(s, a1 ), . . . , Q̂(s, am ) for each s ∈ S̃, where π is the current
policy and Q̂(s, ai ) is an estimate of Q(s, ai ) as calculated by policy rollout.
To compute Q̂(s, ai ) by policy rollout involves:

1. Generating n trajectories of length h starting from state s; each tra-


jectory begins by executing action ai for the first action and then uses
policy π for the subsequent actions.

2. Summing the discounted rewards over each individual trajectory; re-


wards are generated by a heuristic function.

3. Averaging the summations over the n trajectories.

In the policy improvement step, a new policy π 0 is calculated which is greedy


with respect to the Q̂(s, a) values found during the evaluation of π. However,
because of the abstract formulation of the policies, only an approximation of
D E
a greedy policy is obtained. Given an example s, π(s), Q̂(s, a1 ), . . . , Q̂(s, am )
in D, the Q-advantage of taking action a instead of action π(s) in state s is
∆(s, a) = Q(s, π(s))−Q(s, a). The value of an abstract rule is then the num-
ber of examples in D where the rule fires plus the cumulative Q-advantage
over those examples. A decision list representing π 0 is incrementally built
4
Approximate policy iteration should not to be confused with the modified policy iter-
ation method discussed in Section 2.1.2. Unlike API, modified policy iteration computes
a separate estimate for each s ∈ S and thus does not perform generalisation.
60 Ch 3. A Survey of Relational Reinforcement Learning

that maximises the value of its rules over D using a supervised learning
technique described by Yoon et al. (2002).

Like P-Learning, the method computes a policy rather than a value func-
tion and thus may generalise in a more unconstrained fashion. Unlike P-
Learning, however, it avoids the need to maintain an approximate value
function in an intermediate layer. The method is elegant and powerful, pro-
ducing positive results on blocks world tasks that are substantially more
complex than those considered by any other RRL method. However, the
use of policy rollout does require an unconstrained simulator—the ability
to generate a trajectory for an arbitrary policy and initial state—and an
heuristic function for estimating the value of a state.

Another policy driven system is Grey (Muller and van Otterlo, 2005). The
system evolves a population of policies where each policy is a relational
decision list. Each individual policy is assessed over a number of randomly
generated instances of a task and assigned a fitness value based on the total
reward obtained, the number of time steps taken, and the number of rules
in the policy. Individuals with high fitness are reproduced using crossover
and mutation to derive a new generation of policies. This new generation
is assessed and reproduced to obtain another generation, and so on, until
some terminal condition is reached. No detailed results have been reported
for Grey as yet.

3.2.5 Other Dynamic Methods

We now consider two RRL methods that dynamically generalise over the
value function, but which do not use temporal difference-like methods to
compute value function estimates. The first system, SVRRL (Sanner, 2005),
is based on statistical methods that restrict it to undiscounted, finite-horizon
(episodic) tasks with a single terminal reward of success or failure. The value
function is represented as a Bayes network where a node in the network
represents an atomic formula or a conjunction of atomic formulae. The
3.2. Existing RRL Methods and Approaches 61

value of the network and its structure is updated online using a novel Bayes
network algorithm.

The second system (Walker et al., 2004) focuses on the problem of predicting
Qπ for a given a policy, π, rather than on computing an optimal policy, π ∗ .
Generalisation is performed by estimating Qπ as a weighted combination of
features, where each feature is a conjunction of atomic formulae. Training
starts by sampling a set of (s, a, Qπ (s, a)) triples according to π, then a set of
features is stochastically generated from the samples, and finally regularised
kernel regression is applied to learn the weights. A number of Qπ estimates
are generated and then combined using ensemble learning. Since the method
requires Qπ (s, a) values to be provided as data, in experiments, they focus
on a deterministic, undiscounted, episodic task so that Qπ (s, a) values can
be computed from the reward on the final step.

3.2.6 Extensions and Related Methods

In this section we briefly describe approaches which extend or are closely


related to RRL. We begin with a model-based extension of the Carcass
system described in Section 3.2.2. Recall that Carcass takes an ordered
list of abstract decision rules, and estimates the value of each rule using a
variant of Q-Learning. A model-based extension of Carcass (van Otterlo,
2004), based on the prioritised sweeping algorithm,5 estimates the portion
of T and R corresponding to the system’s rules by using samples of the
states and rewards obtained during training. Evaluation of a given rule
can be performed using its associated model, reducing the need for further
sampling.
5
Prioritised sweeping (Moore and Atkeson, 1993) is an example of model-based rein-
forcement learning, a type of reinforcement learning where the system gathers experience
in order to build a partial, approximate model of the environment, which can be used as
an additional source of input for calculating Q values. In addition to prioritised sweep-
ing, a prominent example of model-based reinforcement learning is the Dyna framework
(Sutton, 1990, 1991a,b).
62 Ch 3. A Survey of Relational Reinforcement Learning

A related system, Qlarc (Croonenborghs et al., 2004), also builds a partial


abstract model from information sampled during training. The model identi-
fies some of the abstract states—typically those representing goal states—as
being “interesting”. The agent’s policy is then derived from a look-ahead
search for the maximum estimated Q value over the interesting states. A
third system which learns an abstract model of the environment is Trail
(Benson, 1996). The system uses an ILP algorithm to generalise a model,
classifying regions of S × A according to whether they can access other
specified regions, from sample trajectories through the environment. Both
systems make use of the abstractive features of first-order logic in order to
compactly represent the model.

Special mention should be made of the work by Lecœuche (2001), which


emphasises the comprehensibility motivation for expressing policies in rela-
tional form. The system applies a reinforcement learning algorithm to learn
an optimal policy, π ∗ , over ground states and actions. It then converts π ∗ to
an abstract policy, π̃ ∗ , for the purpose of simplification and understandabil-
ity. The system takes π ∗ in the form of a list of rules and treats it as a set of
examples for the relational decision list learner Foidl (Califf and Mooney,
1998). The system was applied to a real world task—automated dialogue
management—where it produced an abstract policy that was more compre-
hensible than the original ground policy, making a case for using relational
languages to transparently represent policies. Note that generalisation is
usually employed online during training in order to make the learning pro-
cess tractable, but here it was applied in a separate post-processing phase
in order to make a policy more transparent to humans,6 dialogue designers
in this case.

6
Although, an extension was described where an abstract policy was produced at regu-
lar intervals throughout the training of the reinforcement learning component. An abstract
policy was derived from the current ground policy, which was used to guide subsequent
exploration of the agent and was found to reduce the amount of training required to find
an optimal policy.
3.2. Existing RRL Methods and Approaches 63

Some reinforcement learning methods have used representations that are in-
termediate between propositional and first-order logic. For example, deictic
representations have been combined with Q-Learning in the hope of achiev-
ing some of the abstractive power of first-order logic without the associated
algorithmic complexity. The approach has been applied to blocks world
with mixed results: Whitehead and Ballard (1991) have reported success,
but Finney et al. (2002a,b) have found its performance disappointing.

Moving away from reinforcement learning, we now consider relational ex-


tensions to dynamic programming. These can be divided into methods that
are exact (Kersting et al., 2004; Hölldobler and Skvortsova, 2004; Groß-
mann et al., 2002; Boutilier et al., 2000) or approximate (Karabaev and
Skvortsova, 2005; Sanner and Boutilier, 2005; Gretton and Thiébaux, 2004).
Exact methods use deductive principles to create error free generalisations
over the value function, but are computationally very expensive. Approxi-
mate methods avoid the computational expense of exact methods by induc-
ing generalisations instead. However, since the generalisations may not be
completely accurate, they can potentially produce non-optimal policies.

There are also supervised learning systems which have been specifically de-
signed to learn policies for relational MDPs or planning tasks similar to
relational MDPs (Yoon et al., 2002; Martı́n and Geffner, 2004; Khardon,
1999). The general approach is to induce a policy from a set of examples
consisting of (s, a) pairs belonging, ideally, to an optimal policy for the
task. A typical representation device for the policy is a decision list of gen-
eral rules expressed in a relational language. Since the rules of the policy
do not have to cluster the (s, a) pairs according to the value function, this
approach should inherit the advantages of other policy driven methods, such
as the ability to handle scaled up versions of the task without retraining.
The training examples could be generated from optimal policies produced by
applying dynamic programming or an existing planning algorithm to small
versions of the task.
64 Ch 3. A Survey of Relational Reinforcement Learning

Finally, we list some additional work which:

• Investigates the use of guidance techniques in conjunction with RRL,


such as human generated trajectories through the state-action space
(Driessens and Džeroski, 2004; Morales and Sammut, 2004), succes-
sively increasing the level of difficulty of the task (Fern et al., 2004b),
and discovering guidance heuristics (Yoon et al., 2005).

• Hierarchically decomposes the task and applies RRL methods to the


subtasks (Roncagliolo and Tadepalli, 2004; Aycinena, 2002; Driessens
and Blockeel, 2001).

• Extends RRL formalisms and methods to multiagent settings, such as


zero-sum Markov games (Finzi and Lukasiewicz, 2005, 2004a,b) and
collaborative agents (Croonenborghs et al., 2006; Letia and Precup,
2002).

3.3 Dimensions of RRL

The previous section surveyed reinforcement learning systems that make


use of the abstractive mechanisms of relational representations. The meth-
ods that we looked at can be divided along several dimensions, which are
described below. These dimensions are particularly useful for understand-
ing RRL methods; however, note that they apply to reinforcement learning
methods in general.

• Static or dynamic generalisation. The systems in Section 3.2.1 evalu-


ate static generalisations, which must be user-engineered or otherwise
developed prior to training; the systems in sections 3.2.2–3.2.5 auto-
matically create and modify generalisations dynamically throughout
the course of training.

• Model-based or model-free methods. Model-based methods rely on


3.3. Dimensions of RRL 65

knowledge of the transition and reward functions, T and R, and in-


clude the following systems: the model-based reinforcement learning
systems, Carcass (page 61) and Qlarc (page 62), which supplement
online learning with simulations over known or estimated portions of T
and R; the dynamic programming systems (page 63), which compute
their policy directly from T and R; and the API method of Fern et al.
(2004a) (page 59), which relies on an unconstrained simulator and can
therefore be considered to be model-based. Model-free methods, on
the other hand, avoid using T and R. They include most of the other
systems discussed above.

• Exact or approximate generalisations. The majority of the systems


discussed perform generalisation using inductive processes, which are
subject to error and are therefore approximate. The exceptions are
the dynamic programming systems that deduce generalisations (see
page 63). Because deduction is truth preserving, their generalisations
are exact. The static systems do not perform generalisation themselves
but, rather, evaluate user-engineered generalisations, which can be
either exact or approximate.

• Policies computed analytically or through experience. Under the ana-


lytic approach, all information required to compute the policy is fully
known in advance; when using experience, information is obtained
through interaction with the environment. The dynamic programming
algorithms take the former approach, while the reinforcement learning
methods, including the model-based methods, take the latter. Experi-
ence may either be derived from a real environment or from simulation
via a model of the environment.

• Policies represented directly or inferred from a value function. Most


systems took the latter approach with P-Learning (page 55), the pol-
icy driven systems (page 58), and the supervised learning systems
(page 63) being the exceptions. Note, however, that P-Learning does
rely on learning a value function as an intermediate step. Representing
66 Ch 3. A Survey of Relational Reinforcement Learning

policies directly lessens the constraints placed on the generalisations


and may allow more compact and transferrable policies to be learnt.

3.4 Discussion

Relational reinforcement learning is a young field and no doubt many inno-


vative approaches will be developed in the immediate future. Here we wish
to focus, however, on the prospects for relational reinforcement learning as
envisaged and realised in the seminal work of Džeroski et al. (1998a, 2001).
Although the original system, Q-rrl, had methodological shortcomings—
notably, the need to maintain a complete history of the states, actions, and
rewards experienced during training—it possessed the following desirable
characteristics:

• addressed MDPs generally: it did not place constraints on the type of


MDP that could be considered.

• model-free learning: the environment’s dynamics could be unknown;

• dynamic generalisation: the system automatically generated its own


generalisations;

These characteristics are essential for realising the full potential of the rein-
forcement learning framework and without them, a system limits its prac-
tical significance. In subsequent work, Driessens and others addressed the
methodological shortcomings of Q-rrl (Driessens et al., 2001) and explored
some alternative approaches (Driessens and Ramon, 2003; Gärtner et al.,
2003a; Driessens and Džeroski, 2005), all of which posses the above charac-
teristics. However, apart from that work very few realised RRL systems—
perhaps only the system of Cole et al. (2003)—possess the above three char-
acteristics in combination.
3.4. Discussion 67

Table 3.4: Comparison of several RRL systems including Foxcs based on comprehensi-
bility and required domain knowledge.

System Comprehensibility Domain Knowledge


Foxcs high hypothesis language
Rrl-tg high hypothesis language
Rrl-rib low distance metric
Rrl-kbr low kernel function
Trendi intermediate hypothesis language & distance metric

The approach developed in this thesis also possesses the characteristics men-
tioned above. For this reason, in terms of the aims of the system, we situate
the current work as being most closely related to the family of systems devel-
oped by Driessens and others (Driessens et al., 2001; Driessens and Ramon,
2003; Gärtner et al., 2003a; Driessens and Džeroski, 2005). Methodologi-
cally, Driessens’ systems are all closely related. Each contains a Q-Learning
component combined with a generalisation mechanism. The primary differ-
ence between them lies in the generalisation mechanism: Rrl-tg (Driessens
et al., 2001) contains a decision tree builder; Rrl-rib (Driessens and Ra-
mon, 2003), an instance-based component; Rrl-kbr (Gärtner et al., 2003a),
a Gaussian process; and Trendi (Driessens and Džeroski, 2005) a hybrid
decision tree, instance-based algorithm. The current work, being based on
Xcs, can be viewed in a similar way: it combines a rule-based Q-learning
component with an evolutionary approach to generalisation.

It is interesting to compare the current work to Driessens’ systems. We


consider three dimensions: comprehensibility, the type of domain knowl-
edge required, and how generalisations are represented; performance and
efficiency comparisons will be made in Chapter 7. Comprehensibility refers
to the level to which the generalisations produced by the system can be
understood by humans. The type of domain knowledge refers to the form
that required domain specific knowledge takes; all the systems require some
68 Ch 3. A Survey of Relational Reinforcement Learning

form of domain specific knowledge, which is usually provided by the user of


the system.

Comparisons based on the first two dimensions are summarised in Table 3.4.
As rule- and tree-based systems, Foxcs and Rrl-tg produce readable struc-
tures and so have a high level of comprehensibility. At the other end of the
spectrum, Rrl-rib and Rrl-kbr produce predictions that are not trans-
parent to humans. Trendi, as a hybrid system, lies somewhere between
Rrl-tg and Rrl-rib.

Regarding domain knowledge, Foxcs and Rrl-tg need an appropriate hy-


pothesis language in order to provide the systems with elements for con-
structing their rules and trees; Rrl-rib and Rrl-kbr rely on distance met-
rics and kernel functions respectively in order to compute similarity; and
Trendi, which requires the most domain specific information, takes both
an hypothesis language and a distance metric. In first-order logic, it is per-
haps easier to define an appropriate rule language than a distance metric or
kernel function. For example, Driessens (2004) was unable to apply Rrl-
rib and Rrl-kbr to two of his three test environments due to difficulty in
defining appropriate distance metrics and kernel functions for the domains;
Rrl-tg, in contrast, had no such problems.

Another notable difference between the systems concerns how generalisa-


tions are represented. Foxcs and Rrl-tg use the abstractive qualities of
first-order logic itself, such as of variables; Rrl-rib and Rrl-kbr rely in-
stead on distance metrics and kernel functions over first-order logic; and
Trendi, as a hybrid, utilises both approaches. A detailed comparison be-
tween these approaches would be interesting, but unfortunately it is beyond
the scope of this thesis. Note however, that in the context of ILP, meth-
ods based on distance metrics and kernel functions intuitively appear to be
more naturally suited to data having a significant numerical component. For
RRL, the ultimate significance of the different approaches is an open issue.
However, it is yet to be shown that Rrl-rib and Rrl-kbr can discover the
3.5. Summary 69

kinds of policies learnt by Rrl-tg under P-Learning, such as the optimal


policy in Figure 3.3c.7

In summary then, Foxcs belongs, along with Driessens’ systems, to a small


but important group of RRL systems that support model-free learning, gen-
eralise dynamically, and do not restrict the MDP model. In relation to those
systems, Foxcs produces hypotheses at the upper end of comprehensibility,
takes domain knowledge in a form that is at least as convenient if not easier
to define than the other approaches, and exploits the full abstractive power
of first-order logic. For these reasons, we believe that the approach taken
by Foxcs is significant enough to the field of RRL to warrant investigation.

3.5 Summary

In this chapter we focussed on the field of relational reinforcement learn-


ing: placing the field in context to others, surveying existing methods, and
presenting several dimensions for categorising RRL methods. The survey
has allowed us to situate Foxcs in relation to other RRL systems and to
argue for the system’s significance based on a list of characteristics that it
displays.

7
Knowing an optimal policy, one could perhaps reverse engineer an appropriate metric
or kernel. This is not a practical approach though, as it presupposes the solution.
Chapter 4

The XCS Learning Classifier


System

“The theory of evolution by cumulative natural selection is the


only theory we know of that is in principle capable of explaining
the existence of organized complexity.”
—Richard Dawkins

The previous chapter examined existing approaches to RRL; in this chapter


we begin to focus on the methodology adopted by Foxcs. Recall that Foxcs
adopts the learning classifier system (LCS) framework and in particular de-
rives from the Xcs system (Wilson, 1995, 1998); the purpose of this chapter
is to lay out the essentials of Xcs. The LCS framework first appeared in
the 1970s (Holland, 1976), predating the emergence of the field of reinforce-
ment learning in the 1980s. Several noteworthy implementations followed
(including Holland and Reitman, 1978; Goldberg, 1983; Wilson, 1985, 1987;
Robertson and Riolo, 1988; Booker, 1989; Dorigo and Sirtori, 1991; Wilson,
1994), most of which can be characterised methodologically as a kind of
rule-based temporal difference learning system combined with a genetic al-
gorithm. However, it was the development of Xcs in the mid-1990s (Wilson,

71
72 Ch 4. The XCS Learning Classifier System

1995) which realised the potential of the framework for solving MDPs and
which initiated an expansion of interest in the area (dubbed the “Learning
Classifier System Renaissance” by Cribbs and Smith, 1996). In contrast to
earlier systems, the genetic algorithm in Xcs used accuracy-based fitness,
overcoming the problem of strong overgeneral rules (Kovacs, 2001) that had
hindered the earlier strength-based systems. The Xcs system has subse-
quently become the most widely investigated and well documented system
in the field of LCS.

The purpose of this chapter is to give an introduction to the Xcs system,


which is the methodological basis for our approach to relational reinforce-
ment learning developed in the next chapter. We start by giving a detailed
description of the algorithmic processes of Xcs (Section 4.1). We then con-
sider some important aspects of the system design, its accuracy-based fitness
(Section 4.2) and its system biases (Section 4.3). Next, Xcs systems featur-
ing alternative rule languages are mentioned (Section 4.4). An independent
section giving some background on genetic algorithms follows (Section 4.5).
Finally, the chapter concludes with a summary (Section 4.6).

4.1 The XCS System

The Xcs system, originally developed by Wilson (1995), was the result of
a line of investigation that, beginning with Zcs (Wilson, 1994), sought to
simplify the algorithmic complexity of LCS systems, which over the years
had steadily increased. However Xcs went beyond these aims, introducing
innovations of which accuracy-based fitness is perhaps the most significant.
Later, Wilson (1998) refined Xcs into what is now widely recognised as the
standard version of the system, although further improvements are known
(cf. Butz et al., 2003). In this section we describe Xcs procedurally, following
the Butz and Wilson (2002) specification except where otherwise noted.
Readers seeking a comprehensive treatment of Xcs are directed to the recent
book by Butz (2006).
4.1. The XCS System 73

XCS

evolutionary processes: evolutionary processes:


mutation, recombination, Rule discovery mutation, recombination,
deletion deletion

Credit assignment rule−based Q−Learning

rule matching and


Production subsystem execution

Rule base a population of rules

State Reward Action

Environment

Figure 4.1: Architecture of the Xcs system.

4.1.1 System Architecture

The Xcs system is a rule-based, evolutionary, and online Michigan system.1


As shown in Figure 4.1, the architecture of Xcs consists of four interacting
components or subsystems: a rule base containing a population of rules; a
production subsystem for deciding which action to execute given the current
state of the environment; a credit assignment subsystem for calculating the
value of rules; and finally, a rule discovery subsystem for generating new
rules. In the following sections, each of the above components is examined
in turn.

1
There are two types of LCS, Michigan and Pittsburgh systems. In a Michigan system,
each individual in the population is a rule representing a partial policy. In a Pittsburgh
system, on the other hand, the individuals in the population are rule sets each representing
a complete policy. While Michigan systems are generally trained online, the additional
overhead due to the need to evaluate multiple policies in Pittsburgh systems means that
they are usually trained offline. Recent work in Pittsburgh systems is exemplified by Gale
(Llorà, 2002; Bernadó et al., 2002).
74 Ch 4. The XCS Learning Classifier System

4.1.2 The Rule Base

The rule base, denoted [P], contains a population of condition-action rules.2


Each rule represents a policy fragment and the entire set of rules taken
together plus logic for resolving conflicts between them constitutes the sys-
tem’s policy for interacting with the environment. In representing the policy,
rules perform a generalisation role as the number of rules in [P] typically
satisfies |[P]|  |S × A|.

Rule Syntax and Semantics

In this thesis the following syntax is used to represent the condition and
action of an individual rule:

action ← condition

The condition and action are fixed length bit-strings over the alphabets
{0, 1, #} and {0, 1} respectively, that is condition ∈ {0, 1, #}m and action ∈
{0, 1}n . For example, two instances of valid rules over {0, 1, #}3 and {0, 1}2
are 00 ← 010 and 11 ← 0##. Under this bit-string paradigm, the state
and action spaces are assumed to be a subset of the binary strings, that is
S ⊆ {0, 1}m and A ⊆ {0, 1}n ; typically S and A are derived by mapping a
feature space pertaining to the specific environment into a space of binary
strings. It will be convenient to have a shorthand notation to represent
conditions and actions in formulas: given a rule j, the notation Cj and Aj
denotes the condition and action of j respectively.

The semantics of a rule is implemented by the matching operation. The


matching operation is defined as follows.

Definition 3 (Matching) Given a rule j with condition Cj ∈ {0, 1, #}m


2
Note that by convention, rules in [P] are termed “classifiers” in the LCS literature.
We prefer to just use “rule” in order to emphasize the rule-based nature of the system, and
thus its connection to other rule-based systems such as rule-based RRL and ILP systems.
4.1. The XCS System 75

and a state s ∈ S ⊆ {0, 1}m , then j is said to match s if and only if the
string Cj matches the string s precisely, symbol for symbol, except that the
“don’t care” symbol, #, can match either 0 or 1.

For example, the rule 10 ← 111 matches the state 111 but not 010 or any
other state s 6= 111. The # symbol, also called the “don’t care” symbol,
supports generalisation by allowing a single condition to match multiple
states, for example the condition string 10## generalises over four states
(i.e. 1000, 1001, 1010 and 1011). A rule’s action does not contain the #
symbol, thus it does not generalise over A.

The Rule’s Parameters

The rules also keep track of various estimates. To this end, each rule contains
the following three parameters:

1. The prediction p ∈ R, which estimates the expected payoff received


from executing the rule.

2. The error ε ∈ R, which estimates the average absolute difference be-


tween p and the actual payoff received.

3. The fitness F ∈ R, which estimates the accuracy (an inverse function


of the error) of the rule.

The payoff referred to above is a long term measure of reward and is best
thought of as a being analogous to the optimal value function, Q∗ . A de-
tailed comparison between Xcs and Q-Learning is provided by Kovacs (2002,
section 6.1); we note here that in the case where there is a separate rule for
each (s, a) pair then p is indeed an estimate of Q∗ (s, a).

Each rule also contains an additional four auxiliary parameters:

1. The experience, exp, is a counter recording the number of times the


rule has been updated.
76 Ch 4. The XCS Learning Classifier System

2. The niche size, ns (also commonly called the action set size), is used
to help balance resources across different portions of S × A.

3. The timestamp, ts, is used to determine when to trigger evolutionary


operations.

4. The numerosity, n, is a technique that allows rules whose condition


and action are identical to be represented by a single rule.

The function of the numerosity parameter is elaborated below; the function


of the other three parameters is best described in appropriate parts of the re-
maining system description. Note that parameters will often be subscripted
throughout the remainder of the thesis in order to indicate the rule with
which they are associated, for example pj refers to the prediction of rule j.

Numerosity

Throughout the course of the evolutionary process it is quite likely for an


individual rule to be generated that contains the same condition and action
as another rule already present in the rule base. Rather than storing the new
rule separately, it is incorporated into the existing rule, which is achieved
by discarding the new rule and incrementing the existing rule’s numerosity.
Similarly, whenever a rule is selected for deletion, the numerosity of the
selected rule is decremented and only when its numerosity drops to zero is
the rule actually removed from the rule base. Thus, a rule’s numerosity
indicates the number of “virtual” rules that it represents. This distinction
between rules and virtual rules has led to the terminology of macro-rule and
micro-rule.

Due to the use of numerosity there are potentially two units of measure for
the size of the rule base, one is in terms of the macro-rules and the other is in
terms of the “virtual” or micro-rules. The latter is used herein except where
otherwise noted. The size of the rule base, |[P]|, in terms of micro-rules is
4.1. The XCS System 77

given by the formula:


X
|[P]| = nj ≤ N, (4.1)
j∈[P]

where N is a user set constant indicating the maximum number of micro-


rules which the rule base can hold. The numerosity parameter is not in-
tended to affect the logical operation of the system; rather, its purpose is
to simplify rule manipulation and increase system efficiency. Efficiency is
improved because each macro-rule, j, replaces nj micro-rules, reducing the
number of matching operations from nj to 1.

4.1.3 The Production Subsystem

The production subsystem, responsible for interacting with the environment


and for coordinating the rule update and rule discovery subsystems, runs
the procedure shown in Figure 4.2. The main body of the procedure, the
inner loop, is known as the operational cycle. Each step of the operational
cycle is elaborated below. It may be useful to consult the schematic diagram
of Xcs in Figure 4.3 while reading the following descriptions.

1. The first step is to construct the match set [M], which contains all the
rules in [P] that match the state st ∈ S for the current time step t.

2. Second, if the match set [M] is empty, which typically occurs at the
beginning of a training run, then the covering operation (see Sec-
tion 4.1.5) is called to produce a new rule. This step ensures that
[M] will contain at least one rule.

3. Third, for each a ∈ A advocated by a rule in [M], the system prediction


P (a) is computed as the fitness-weighted average prediction of the rules
advocating a. That is, [M] is partitioned into a number of mutually
exclusive subsets, [M]a1 , . . . , [M]an , according to the actions a1 , . . . , an
advocated by the rules in [M], and the average prediction weighted by
78 Ch 4. The XCS Learning Classifier System

[P] := ∅

For each episode do

[A]−1 := ∅

For each step of the episode do

1. Construct the match set [M].


2. Trigger covering if [M] is empty.
3. Calculate the system prediction for each action sug-
gested by at least one rule in [M].
4. Select an action -greedily over the predictions. Form
the action set [A].
5. Execute the selected action and obtain the reward.
6. Call the credit assignment procedure on [A]−1 and, if it
is the terminal step of the episode, [A].
7. Conditionally trigger the rule discovery component on
[A]−1 and, if it is the terminal step of the episode, [A].
8. [A]−1 := [A]

Figure 4.2: The production subsystem algorithm of Xcs.

fitness is taken over each subset:


P
j∈[M]a
p j Fj
P (a) = P . (4.2)
j∈[M]a
Fj

4. Fourth, an action at is selected in an -greedy fashion over P (a1 ), . . . , P (an ).3


That is, at is selected according to:

 arg max
a∈{a1 ,...,an } P (a) with probability 1 − ,
at =
 random action from {a1 , . . . , an } otherwise.

3
The -greedy method is now typical for action selection in Xcs (Butz and Wilson,
2002), although originally an alternative “explore-exploit” method was used (Wilson, 1995,
1998).
4.1. The XCS System 79

XCS System

Credit Assignment
updated Subsystem
classifiers Rule
updates
reward

Rule Discovery
Subsystem
Genetic
Algorithm
action set, [A]

Production
Subsystem

Action
new classifiers

selection
action
match set, [M]

Matching

state

Rule Base
all rules, [P]

Population

Figure 4.3: Schematic diagram of Xcs.


80 Ch 4. The XCS Learning Classifier System

Some actions in A may not be advocated by any rule in [M] and will
thus not be considered for selection; however, there will always be at
least one action advocated due to the covering operation on step 2. As
part of this step the action set [A] is constructed, which is identical to
[M]at .

5. Fifth, the selected action, at , is executed and a reward, rt+1 , obtained.

6. Sixth,

(a) The credit assignment procedure is called on [A]−1 , the action set
formed on the previous time step, t − 1.
(b) If it is the terminal step of an episode then credit assignment is
called on [A].

7. Seventh,

(a) The rule discovery component is conditionally triggered on [A]−1 .


(b) If it is the terminal step of an episode then rule discovery is
conditionally triggered on [A].

8. Finally, if it is not the terminal step of an episode, [A] is assigned to


[A]−1 .

We now continue to the credit assignment subsystem.

4.1.4 The Credit Assignment Subsystem

This subsystem is responsible for updating most of the rule parameters.


In particular, it updates the prediction, p, error, ε, and fitness, F , as well
as the experience counter, exp, and the niche size, ns. The two remaining
parameters, the timestamp, ts, and the numerosity, n, are not updated here,
but rather in the rule discovery subsystem.

At each time step, t, except the first, an update is performed on [A]−1 , the
action set from the previous step, t − 1. Then, if t is the terminal step of
4.1. The XCS System 81

an episode, an update is also performed on the current action set, [A]. The
update procedure proceeds by first updating all the prediction estimates,
p, then the errors, ε, then the fitness values, F , and finally the experience
counter, exp, and the niche size, ns. We now describe each of the updates
in turn.

The Prediction and Error Parameter Updates

For each rule j in the action set, the prediction estimate, pj , is updated
according to the equation:

pj := (1 − α)pj + αρ (4.3)

where α is a learning rate satisfying 0 < α < 1 and ρ is the new target value
for the prediction. The target ρ is defined as:

 r + γ max
t a∈{a1 ,...,an } P (a) if updating [A]−1
ρ= (4.4)
 rt+1 if updating [A].
where γ is a discount factor satisfying 0 ≤ γ < 1 (see Section 2.1.1),
a1 , . . . , an are the actions advocated by [M], and P is the system prediction
calculated on the current time step t. Note that rt+1 is the reward obtained
from executing action at and is obtained on time step t (not on step t + 1),
which is consistent with the notation presented earlier in Chapter 2.

The update represented by equations (4.3) and (4.4) is essentially a rule-


based version of the update performed on the Q(s, a) estimates in Q-Learning
(Section 2.2.2). The main difference is that here the prediction P (a) is av-
eraged over many estimates whereas in Q-Learning (including Q-Learning
with function approximation) there is only a single estimate for each (s, a)
pair. This similarity between the two updates leads us to consider the pre-
diction pj to be an estimate of the Q(s, a) values averaged over all the (s, a)
pairs which match the rule.

Next, the error parameters are updated according to:

εj := (1 − α)εj + α(|ρ − pj |) (4.5)


82 Ch 4. The XCS Learning Classifier System

where εj is the error value associated with rule j, and α and ρ are as de-
scribed above. The error εj averages |ρ − pj | over consecutive updates. If
all (s, a) pairs which match rule j have the same Q(s, a) value then |ρ − pj |
tends to go to 0 in the limit as the number of updates approaches infinity,
although it also depends on the other rules in the population that affect the
value of ρ. A special group of rules are those whose error satisfies εj < ε0 ,
where ε0 is a small value close to zero, since any rule belonging to this group
is likely to represent a region over S × A that has a uniform Q value.

The Fitness Parameter Update

Xcs uses accuracy-based fitness, perhaps the most significant feature dis-
tinguishing it from other LCS systems. This fitness measure is designed to
reflect the importance given to rules with low error, particularly rules where
εj < ε0 . The fitness is updated in a three step procedure, listed below.
First, the accuracy, κ, is calculated according to the equation:

 1 if εj < ε0
κj = (4.6)
 a( ε0 )b otherwise,
εj

where ε0 is a threshold controlling the tolerance for prediction error. Any


rule where εj < ε0 is considered to be fully accurate. Accuracy is an inverse
power function of error, thus as error decreases the accuracy increases, with
the rate of change being greater as εj approaches ε0 . The constants a and
b control the shape of the curve (see Figure 4.4a).

Next the relative accuracy, κ0 , is calculated:

κj nj
κ0j = P (4.7)
kn
i∈[A] i i

where nj is the numerosity of rule j. Note that we use [A] here to represent
either [A] or [A]−1 , whichever set is being updated. The relative accuracy
normalises the accuracy over the interval [0, 1] with respect to the other
accuracies, thus eliminating differences in the accuracies which originate
4.1. The XCS System 83

Shrink Stretch

1.0
1.0 1.0
1−a

ε0 ε κ κ0 κ κ0

(a) (b)

Figure 4.4: The accuracy and relative accuracy metrics: a) the relationship between
accuracy κ and error ε (from Butz, 2006), and b) the effects of the relative accuracy
calculation (from Kovacs, 2002).

from rewards of different magnitude. As a result of normalisation, all relative


κ0 = 1. Hence, if the sum of the
P
accuracies sum to one, that is, j∈[A] j
accuracies is greater than 1 then the relative accuracies are less than their
corresponding accuracies. On the other hand, the relative accuracies are
larger than the accuracies if the accuracy sum is less than 1 (see Figure 4.4b).
One unfortunate consequence of relative accuracy is that if [A] contains a
single rule then its relative accuracy, κ0 , is one irrespective of its accuracy, κ.

Finally, the fitness parameter, F , is updated:

Fj := (1 − α)Fj + ακ0j . (4.8)

Note that each rule’s relative accuracy, κ0 , is weighted by numerosity. Since


the fitness, F , is directly based on κ0 , it too is weighted by numerosity. Con-
sequently, numerosity is never explicitly included in calculations involving
fitness, such as selection for reproduction, as it is already factored into the
fitness value. In other words, fitness is never multiplied by numerosity.
84 Ch 4. The XCS Learning Classifier System

Other Parameter Updates

In addition to the above updates to the prediction, error, and fitness, most
of the auxiliary parameters are also updated: the niches size, ns, the expe-
rience, exp, and potentially the time stamp, ts. First, the niche size, ns, is
updated according to:

ns j := (1 − α)ns j + α|[A]|, (4.9)

where [A] again represents either [A] or [A]−1 . Note that |[A]| is measured
in terms of micro-rules as in equation (4.1). Next, the experience parameter
is incremented, exp j := exp j + 1. Finally, if rule discovery is triggered
(Section 4.1.5) then the time stamp of all rules in the action set are assigned
to the current time step:
ts j := t. (4.10)

The MAM Technique

In the above we have assumed that the learning rate, α, is a fixed constant.
In this section we discuss the Moyenne Adaptive Modifée (MAM) technique
(Venturini, 1994), which involves using a non-constant learning rate in rule
updates. The intention of MAM is to more quickly reduce parameter error
due to initialisation by using true sample averaging in the early stages of the
rule’s life. More specifically, the kth update to rule j’s prediction adjusts
the parameter according to:

1 1
pj := (1 − )pj + ρ
k k

for the first n updates to rule j. From the n + 1 update onwards, the update
rule returns to:
pj := (1 − α)pj + αρ.

1
In other words, the substitution α = k is made for the first n updates. Note
that here we are illustrating the technique using the prediction update, but
4.1. The XCS System 85

it applies everywhere where α is used, with the sole exception of the fitness
update.

We now give the MAM technique. For the kth update of rule j, the technique
sets α as follows: 
1 1

k if k < β
α= (4.11)
 β otherwise,

where 0 < β < 1. Note that k can derived from the experience parameter
expj . Assuming that exp j is incremented before all other updates are per-
formed, then the substitution k = exp j can be made in (4.11). The MAM
technique is applied to the updates for prediction p, error ε and niche size ns
given in equations (4.5), (4.3) and (4.9), but as noted above, fitness updates
do not employ MAM updates. That is, α = β in the fitness update (4.8).
We now consider another technique which uses a non-constant α.

Annealing the Learning Rate

Section 2.2.2 mentioned that for Q-Learning, α should be decayed in or-


der to ensure convergence of the algorithm. The close similarity of Xcs to
Q-Learning suggests that α should be decayed in Xcs too. In Q-Learning,
α can be decayed using the technique of annealing. On the kth update,
annealing would set α = k1 . The procedure can be adapted to Xcs straight-
forwardly: on the kth update of rule j, annealing sets α according to:

1
α= . (4.12)
k

Like the MAM technique, annealing is applied when updating the predic-
tion, p, error, ε and niche size, ns, parameters using equations (4.5), (4.3)
and (4.9), but not when updating the fitness parameter, F , in equation (4.8).

Note that a comparison of equations (4.11) and (4.12) shows that α is iden-
1
tical under both methods when k < β. Thus, despite the difference in
motivation between the two techniques, they operate in essentially the same
86 Ch 4. The XCS Learning Classifier System

way; the only difference being that under MAM updates, α reverts to a
constant after some point, whilst under annealing it continually decays.

The MAM technique was incorporated into Xcs in the original version (Wil-
son, 1995) and has subsequently become part of the system’s standard spec-
ification (Butz and Wilson, 2002); however, the use of annealing in Xcs, is
uncommon. In fact, the author is not aware of any implementations of Xcs
which make use of annealing. Nevertheless, the significance of annealing for
convergence in Q-Learning suggests that it would useful within Xcs also.

4.1.5 The Rule Discovery Subsystem

The final component is the rule discovery subsystem, which is responsible


for exploring a space of generalisations over S ×A and which principally con-
tains mechanisms for generating and deleting rules. In the following sections
we first consider the three mechanisms for generating rules: initialisation,
covering, and the genetic algorithm. After a rule is created it has many
parameters that must be set; therefore, we next consider the issue of initial-
ising a rule’s parameters. Finally, we describe the two deletion mechanisms:
rule deletion and subsumption deletion.

Initialising the Rule Base

Initialising the rule base with a population of rules allows the incorporation
of user knowledge into the system in the form of an approximate or partial
policy. Alternatively, an initial population may be randomly generated. One
scheme would be to generate N rules with randomly assigned action and
condition strings. A random action is typically generated by setting each
bit in the string to 0 or 1 with equal likelihood; and a random condition, by
setting each bit to # with probability P# , or to 0 or 1 otherwise with equal
likelihood.
4.1. The XCS System 87

Rule Creation Through Covering

In practice, it is uncommon to initialise the rule base with a population


of rules as just described; instead the technique of covering is usually used.
The rule base in Xcs typically starts out empty (although this is not strictly
necessary) and is populated with rules that are created by the covering
procedure whenever the system encounters a state that has no matching
rule. The covering procedure operates as follows:

1. First, it is triggered when the matchset is empty; that is, when


|[M]| = 0.

2. Next, it produces a new rule with an action string that is randomly


generated and a condition string that is identical to the current state, s.

3. Then, it sets each bit in the rule’s condition to # with probability P# ,


which generalises the condition while ensuring that it still matches s.

4. Finally, it adds the new rule to [P], after which deletion occurs if
|[P]| > N .

New rules created in this way can populate an empty rule base, hence the
system does not require the user to supply an initial set of rules. Note that
when covering is triggered, it is usually relatively early in a run, when the
state space is still not completely covered by [P].

The covering procedure’s triggering mechanism may be a generalisation of


that described above. Covering is triggered when |[M]| < θmna , where θmna is
a constant, typically equal to |A|, rather than when |[M]| = 0. Alternatively,
in action set covering (Kovacs, 2002), a new rule is created for each a ∈ A
where |[M]a | < θmna , where typically θmna = 1. Under this last version, the
rule’s action string is set to a, and the procedure thereby ensures that all
actions are advocated by at least one rule.

Rule Creation Through Application of a Genetic Algorithm


88 Ch 4. The XCS Learning Classifier System

The last way of generating rules is by applying a genetic algorithm (GA).


The procedure involves four processes: selection, crossover, mutation, and
potentially deletion. First, two rules are selected from the action set (which
we again denote as [A], although it can represent either [A] or [A]−1 ) pro-
portional to their fitness. Proportional selection is typically implemented as
roulette wheel selection. Under roulette wheel selection P r(j), the proba-
bility of selecting rule j ∈ [A], is given by:

Fj
P r(j) = P . (4.13)
i∈[A] Fi

Proportional selection ensures that each rule in [A] has some chance of being
selected.

There is some evidence that an alternative selection procedure, tournament


selection, performs better in Xcs than the more typical roulette wheel se-
lection (Butz et al., 2005, 2003). Under this selection technique, a “tourna-
ment” is held between the rules in [A] where the rule with the greatest fitness
“wins” and is selected for reproduction. The tournaments do not contain all
the rules in [A] but rather consist of τ |[A]| micro-rules, where 0 < τ < 1 (a
typical value is τ = 0.4). The rules are selected randomly from [A], thus all
rules in [A] have some chance of participating in the tournament; although
only the top fitness-ranked rule participating in the tournament will actu-
ally reproduce. When selecting two rules, two independent tournaments are
held.

After selection, the two parent rules are copied and the copies subject, first,
to single point crossover and then, second, to either free or niche mutation
(see Section 4.5 for a description of the crossover and mutation operations).
Finally, rule deletion is applied if the size of [P] exceeds N after the two new
rules are added to the population.

The GA is triggered using a mechanism whose purpose is to allocate re-


sources (i.e. rules) “approximately equally to the different match sets”, thus
the mechanism operates so that “the rate of reproduction per match set per
4.1. The XCS System 89

unit of time is approximately constant” (quotations from Wilson, 1995).4


The implementation of the mechanism relies on the timestamp parameter,
ts. It checks if the average difference between the current time step and the
timestamp ts over all the rules in [A] is greater than a threshold θGA . More
precisely, the GA is invoked on time step t if:
P
j∈[A] t − ts j
> θGA . (4.14)
|[A]|

If the GA is invoked then ts is set to the current time step t.

Parameter Initialisation

Whenever a new rule is created, initial values for its prediction, error and
fitness parameters must be assigned. This section lists the methods used to
determine the initial parameter values. The aim of these methods is to make
a reasonable guess at the true values in order to reduce the time needed to
evaluate the rules online; however, the system should be robust to fairly
arbitrary initial parameter values. An exception, noted by Kovacs (2002), is
that new rules should not be given a large initial fitness value, which might
enable them to influence action selection and reproduction before they are
evaluated properly. The following methods for setting the parameters are
given by Butz and Wilson (2002):

• For an initial population and for rules created by covering: The initial
prediction, error and fitness values are set to user supplied constants.
An alternative method which can be used in the case of covering is to
set the initial prediction and error to the mean values over the rule
base and the initial fitness to 10% of the mean fitness over the rule
base (Kovacs, 2002).
4
Note that Wilson is referring to match sets, not action sets; this is because the original
version of Xcs applied the GA to match sets rather than action sets. Subsequent versions
of Xcs starting from (Wilson, 1998) applied the GA to action sets; for these versions,
replacing “match set” with “action set” in the quotations above has the intended meaning.
90 Ch 4. The XCS Learning Classifier System

• For rules created by the genetic algorithm: Rules created by the genetic
algorithm usually have two parents, and hence the initial prediction
and error are set to the mean of the parents values and the initial
fitness to 10% of the mean of the parents fitness. If there is only one
parent then the initial prediction and error can be set to the parent’s
values and initial fitness to 10% of the parent’s fitness.

Rule Deletion

As previously mentioned, the maximum size of the rule base is bounded


by N , a user supplied constant. If the size of the rule base exceeds N
after insertion then a rule is selected for deletion. The selection process
has two aims: to balance the resources (i.e. the number of rules) occurring
in the different action sets and to remove rules with low fitness. Selection
is implemented by assigning a deletion vote to each rule in the rule base
representing the rule’s probability of deletion. The deletion vote, dv, is
calculated as follows:

ns·n·F̄

(F/n) if exp > θdel and F < δ F̄
dv = (4.15)
 ns · n otherwise

where ns is the niche size estimate, n is the numerosity, F̄ is the average


fitness of the population, and θdel and δ are user set thresholds. The equation
assigns a value to dv proportional to ns · n, therefore rules which occur in
large action sets or which have a large numerosity have a greater probability
of deletion. In addition, if a rule is sufficiently experienced (exp > θdel )
and its fitness it is significantly less than the average fitness of the rule base
(F < δ F̄ ), then the deletion vote is increased by a factor which is the inverse
of its relative fitness, F̄ /(F · n).

When a rule is selected for deletion its numerosity n is decremented by 1.


If the numerosity falls to 0 then the rule is removed from the rule base.
4.2. Accuracy-Based Fitness 91

Subsumption Deletion

Subsumption deletion is a technique for biasing Xcs in favour of rules that


are more general without being less accurate. A rule j may subsume another
rule i if the following three criteria are met:

1. Rule j is sufficiently experienced. That is, expj > θsub where θsub is a
user set threshold.

2. Rule j has perfect accuracy. That is, εj < ε0 .

3. The condition of rule j is not less general than the condition of rule i.

The generalisation test is performed by comparing condition strings: Cj


must be the same as Ci except that it may contain # symbols where Ci has
a 0 or 1. For example a rule with condition 1##0 could subsume a rule with
condition 1000, 1010, 1100, 1110, 10#0, 11#0, 1#00, 1#10, or 1##0. When
rule j subsumes another rule i, the numerosity of j is incremented and i is
deleted.

There are two forms of subsumption deletion, GA subsumption and action


set subsumption. The first form tests any new rule created by the genetic
algorithm for subsumption by its parent rules. The second form is run
each time an action set is updated. The most general of the experienced
and accurate rules is found, and all rules occurring the set are tested for
subsumption by it.

4.2 Accuracy-Based Fitness

Now that the computational processes of Xcs have been described it is useful
at this point to elucidate some aspects of the system design. We have already
noted that Xcs can be seen as a kind of rule-based Q-Learning system and
that the prediction calculation is a form of temporal difference learning, but
92 Ch 4. The XCS Learning Classifier System

we have not yet discussed the GA and in particular the significance of the
fitness calculation. Why make fitness a function of accuracy and not some
other metric? Prior to Xcs, the use accuracy-based fitness was uncommon5
and its incorporation into Xcs represented a major advance in the field,
leading to a deeper understanding of the LCS framework. Below we set out
some advantages of using accuracy-based fitness.

A Solution to the Problem of Strong Overgeneral Rules

Most LCS implementations prior to Xcs used strength-based fitness. These


systems had a “strength” parameter which was analogous to Xcs’s pre-
diction parameter, except that it was used for both prediction and fitness.
Unfortunately, strength-based fitness leads to a problem with strong over-
general rules (Kovacs, 2001, 2000). Before explaining the notion of a strong
overgeneral rule, we need to define an optimal rule, which is a rule that is
consistent with the optimal policy, π ∗ , and a sub-optimal rule, which is a rule
that is inconsistent, either partially or completely, with π ∗ . A strong over-
general rule is a rule which is sub-optimal, but receives such a high payoff
on average that the GA selects it for reproduction in favour of rules which
are optimal but whose average payoff is not as great. Strong overgeneral
rules are a problem because they can displace optimal rules from the rule
base.

In strength-based systems, the technique of default hierarchies (a method for


chaining rules together over successive time steps; see Riolo, 1988; Goldberg,
1989; Smith, 1991) was sometimes able to remedy the problem of strong
overgenerals, but at the cost of increasing the algorithmic complexity of the
system. Switching to accuracy-based fitness was a significant improvement
as it removes the problem of strong overgeneral rules altogether. Because
fitness is now a function of accuracy rather than strength, rules with a

5
Wilson (1995) gives an account of the use of accuracy metrics in LCS systems before
Xcs.
4.2. Accuracy-Based Fitness 93

greater average payoff can no longer displace rules with a lesser average
payoff except on the basis of accuracy. Avoiding the problem of strong
overgeneral rules is a principal motivation for using accuracy-based fitness.

Complete Maps

Another important effect of accuracy-based fitness is that it tends to produce


a complete map of Q∗ , the optimal value function (Kovacs, 2001, 2000), so
long as the bound on the size of the population, N , is adequate for storing
the number of rules required to represent it. By a complete map it is meant
that for each (s, a) ∈ S × A there is at least one rule j which matches s,
and whose action is a, and who accurately predicts Q∗ (s, a) (i.e. j satisfies:
j matches s, Aj = a, pj = Q∗ (s, a), and εj < ε0 ). Since fitness is a function
of accuracy, there is a general pressure to reproduce accurate rules (here we
are simplifying matters by ignoring some of the effects of relative accuracy
on fitness, see page 82), even those which represent a sub-optimal policy
fragment. Thus, since selection discriminates between all accurate rules
more-or-less equally, over time the tendency is for the population to contain
accurate rules for the entire Q∗ function.

Is it a waste of resources to keep a complete map of the Q∗ function? Learn-


ing just the portion of Q∗ which corresponds to the optimal policy π ∗ would
generally require less rules and would thus perhaps require less training.
However, it turns out that there is an advantage to learning accurate, sub-
optimal rules. Consider a situation where in the current state of the envi-
ronment there are only two actions, both of which lead to the termination
of the current episode. One action, a+ , results in a positive reward while
the other, a− , leads to a negative reward (such is the case, for example, in
a binary classification task, where there are only two classes). Imagine that
the system has no rules which accurately predict for a+ , but that it does
have rules which predict the negative reward associated with a− . In this
case, knowing which action not to take, a− , means the system can select the
94 Ch 4. The XCS Learning Classifier System

optimal action, a+ . This is an advantage when it is very difficult to discover


an accurate rule for a+ . In general, unlike the previous example, knowing
accurate, sub-optimal rules is not sufficient for computing the optimal deci-
sion; but it will allow the system to avoid selecting the (sub-optimal) actions
associated with them.

4.3 Biases within XCS

Further insight into the design of Xcs can be gained by examining key biases
which influence the behaviour of the system and the mechanisms which
produce them. These biases attempt to provide an answer to the question
“How does the population of rules evolve over time?” An analytical answer
to this question is difficult as it relies on the complex interaction of several
stochastic processes (including selection, mutation, crossover and deletion;
additionally, the transition function may also be stochastic). Therefore,
rather than attempt to provide a quantitative analysis of the population
dynamics we will be content to identify qualitative features which influence
the constitution of the population.

4.3.1 The Generality and Optimality Hypotheses

The first account of population dynamics in Xcs was provided by Wilson


(1995) in the same paper that introduced the system. In his Generality
Hypothesis, Wilson (1995) identifies an important bias which promotes ac-
curate and maximally general rules in the population. Recall that an accu-
rate rule is a rule whose prediction is estimated to be within tolerance e0 of
Q(s, a) for all (s, a) pairs aggregated by the rule. A maximally general rule
is a rule which is accurate but will lose accuracy if generalised (i.e. if any 0s
or 1s are changed to a #). The hypothesis states that:

Definition 4 (The Generality Hypothesis) In Xcs, training tends to


4.3. Biases within XCS 95

increase the proportion of rules in [P] that are accurate and maximally gen-
eral.

The hypothesis rests on the observation that reproduction occurs in action


sets, either [A] or [A]−1 , while deletion occurs over the whole rule base, [P].
Thus, in addition to its fitness, the probability that a particular rule repro-
duces depends on the probability of its occurrence in an action set, which
in turn depends on the level of generality of the rule.6 But the selection
of rules for removal occurs over the whole population and is thus unbiased
with respect to generality. The net result over time should be a tendency to
accumulate rules which are increasingly more general, the effect of selection
from [A] and deletion from [P], and which are also accurate, the effect of
accuracy-based fitness.7

An additional bias is identified in the Optimality Hypothesis (Kovacs, 1996,


1997), which states that:

Definition 5 (The Optimality Hypothesis) Given sufficient training,


Xcs forms a subpopulation, [O], containing a complete, minimal, and non-
overlapping set of rules.

The Optimality Hypothesis does not identify any mechanisms for produc-
ing [O] in addition to those given in the Generality Hypothesis; instead, it
says that these mechanisms are sufficient for generating [O] given sufficient
training. Empirical support for the hypothesis has been obtained by Kovacs
(1996, 1997) who observed the formation of [O] on multiplexor tasks having
up to 11 input bits.

6
The assumption here is that the more general a rule is, the more likely it is to occur
in an action set.
7
This explanation of the Generality Hypothesis has been influenced by Butz (2006).
96 Ch 4. The XCS Learning Classifier System

4.3.2 Butz’s Evolutionary Pressures

Later, a number of evolutionary pressures were identified within Xcs which


account for key biases displayed by the system, including the bias described
in Wilson’s Generality Hypothesis. These pressures were originally described
by Butz and Pelikan (2001) and were subject to further investigation in
(Butz et al., 2003, 2004; Butz, 2006). The following five pressures were
identified:

1. Set pressure. Reproduction from [A] and deletion from [P] creates a
pressure towards rules with greater semantic generality.8

2. Fitness pressure. Accuracy-based fitness results in a pressure towards


rules with higher accuracy.

3. Mutation pressure. Mutation pushes towards an equal number of 0s,


1s and #s in each rule, assuming that each bit value has an equal
probability of being set under the mutation operation.

4. Deletion pressure. Deleting proportionally to the average size of the


action sets to which a rule belongs encourages equal distribution of
rules in each environmental “niche”. Factoring accuracy into the se-
lection calculation tends to remove inaccurate rules and thus also adds
an additional pressure towards higher accuracy.

5. Subsumption pressure. Subsumption provides a pressure toward rules


with greater syntactical generality and reduces the number of matching
operations required.

The first two pressures account for the bias identified in the Generality Hy-
pothesis while the other three describe additional biases. Butz et al. (2003)
8
The semantic generality of a rule refers to its level of generality as determined by its
frequency of occurrence in action sets. Syntactic generality, on the other hand, refers to
the rule’s level of generality as determined by its logical form. A rule’s level of semantic
generality is clearly influenced by its syntactic generality, but also by the distribution of
states sampled during training.
4.3. Biases within XCS 97

provide an analysis of set pressure showing that the average expected level of
generality in [A] is indeed greater than the average level of generality in [P],
supporting the validity of the Generalisation Hypothesis. Unfortunately for
our purposes, the result depends on bit-string analysis and is not language
independent.

4.3.3 Discussion

Now that we have identified key biases present in Xcs, we would like to
consider the effect that modifying Xcs for relational reinforcement learning
will have on them. In particular, we would like to show that changing the
rule language from bit-strings to first-order logic will not adversely disrupt
the biases. We are not, however, interested in considering the effect of any
other changes required to modify the system for RRL, since these changes
are secondary in the sense that they arise because the rule-language has
changed, and if a particular modification is disruptive then it is possible that
another can be designed. On the other hand, changing the rule language
is critical to achieving our intention: if Xcs supports bit-string rules only
then our intention cannot be realised.

Let us consider the Generalisation Hypothesis first. While analytical support


for the Generalisation Hypothesis (Butz et al., 2003) relies on bit-string
analysis, we can see no reason why the principles outlined in the Generality
Hypothesis would not hold for other rule languages. First, the mechanism
which reproduces from action sets and deletes from the overall population
is language independent. And second, intuitively, the probability that the
occurrence of a rule in an action set is related to its level of generality also
appears to be language independent. However, to provide a formal argument
of the second point for the case of first-order logic rules is difficult because
analysis of generality in first-order logic is much less straightforward than
it is for bit-strings. We nevertheless conjecture that in general Wilson’s
Generality Hypothesis will hold under alternative rule languages, including
98 Ch 4. The XCS Learning Classifier System

first-order logic rule languages.

Since the Optimality Hypothesis is based on the same principles as the Gen-
erality Hypothesis, it should also hold under alternative rule languages. As
an aside, we note that in some circumstances the Optimality Hypothesis
and the Generality Hypothesis make conflicting predictions about the com-
position of the population, as the following example shows.

Example 2 Consider a task environment having state space S = {0, 1}2


and action space A = {1} and for which the Q function is:

 100 if s = 00
Q(s, a) =
 0 otherwise.

Under the standard ternary bit-string language there are six accurate rules
which can be formed, four specific rules: 1 ← 00, 1 ← 01, 1 ← 10, and
1 ← 11 and two general rules: 1 ← #1 and 1 ← 1#. Note that no single rule
can have a condition which represents {01, 10, 11}, the three states whose
Q-value is 0. For this task the Generality Hypothesis says that training
tends to push the composition of the population to:

1 ← 00 1 ← #1 1 ← 1#

as these are the maximally general rules which accurately represent the
Q function. On the other hand, the Optimality Hypothesis suggests that
training tends to produce one of the following two sets of rules:
1 ← 00 1 ← 01 1 ← 1#
1 ← 00 1 ← 10 1 ← #1
since they are the complete, minimal, non-overlapping rule sets which accu-
rately represent the Q function.


The above example was contrived to show a task for which a population
containing only maximally general rules is not consistent with a popula-
tion containing only non-overlapping rules. Empirically, Kovacs (2002, sec-
tion 3.5.2) has observed that Xcs tends to squeeze out overlapping rules,
4.4. Alternative Rule Languages 99

supporting the view that in practice Xcs follows the Optimality Hypothesis
when the two hypotheses disagree.

Returning to the main theme of this discussion, we now consider the effect
of the rule language on the evolutionary pressures identified by Butz. Of
these pressures, set, fitness, and deletion pressure are language independent,
while subsumption and mutation pressure rely on language dependent fea-
tures. The language dependent part of subsumption, however, is limited to
testing if a particular rule is a generalisation of another, and such a test is
usually possible, so the influence of the rule language on subsumption pres-
sure is generally negligible. Mutation pressure, on the other hand, largely
relies on operations which are dependent on representation. Thus, the bias
of the Xcs system towards accurate and maximally general classifiers is rel-
atively language independent, while the rule discovery processes, which rely
on mutation and other evolutionary variation operators, are the representa-
tionally dependent factor.

In conclusion, it appears that most of the biases discussed above are un-
affected by the choice of rule language, from which a case can be made
that Xcs is essentially a language neutral architecture. This analysis sug-
gests that, in principle, Xcs will not be adversely affected by changing the
rule language to first-order logic. It also suggests that other rule languages
could be used, and indeed, many alternative rule languages have already
been implemented, as we consider next.

4.4 Alternative Rule Languages

In addition to the above analysis, we can further satisfy ourselves that the
functioning of Xcs will not be impaired by changing the rule language by
considering the existence of previous work which has extended Xcs with
alternative rule languages. The LCS framework was originally conceived
to use bit-string rules, and the Xcs system follows this tradition. The
100 Ch 4. The XCS Learning Classifier System

prevalence of bit-string rules within the LCS paradigm is due to the central
role that the GA occupies, as it is the principal mechanism for rule discovery.
However the LCS framework, and Xcs in particular, is flexible enough to
support other rule languages, providing that there are variation operations
available to act upon arbitrary rules in the new language in order to produce
variation. For instance, Xcs has been extended with rule languages over:

• continuous spaces (Wilson, 2000, 2002; Stone and Bull, 2003, 2005;
Butz, 2005; Dam et al., 2005; Butz et al., 2006; Lanzi and Wilson,
2006)

• fuzzy logic (Casillas et al., 2004)

• Lisp-like S-expressions (Lanzi, 2001, 1999b)

• multilayer neural networks (Bull and O’Hara, 2002)

Hence, alternatives to bit-string rules have been repeatedly, and effectively,


implemented before. However, although some rule languages have been
used which represent structural or relational aspects of data—in particular,
languages over S-expressions—the use of a rule language over first-order
logic has not been explored before in either Xcs or other LCS architectures.9
Below we look at four LCS systems whose rule languages are the most closely
related to first-order logic. Note that we have broadened the scope here to
include LCS systems in general and not just those based on Xcs.

Recall that a bit-string classifier expresses a condition as a string over the


ternary alphabet {0,1,#}. The # symbol supports generalisation by allowing
a single condition to match a range of inputs, for example the condition
string 10## generalises over four input values. However, as noted by Schu-
urmans and Schaeffer (1989), the # symbol cannot express relationships
between the values at different bits, for example it cannot express “the first
9
However, first-order has been combined with other evolutionary systems; see “Evolu-
tionary Computation in Inductive Logic Programming” on page 101.
4.4. Alternative Rule Languages 101

Evolutionary Computation in Inductive Logic Programming

The earliest systems to combine evolutionary computation and representation


in first-order logic appeared in the mid 1990s. Many of these systems aimed
to extend the representational capability of genetic algorithms by interpreting
bit-strings as expressions in first-order logic. They include Glp (Osborn
et al., 1995), Regal (Giordana and Neri, 1995; Neri and Saitta, 1995),
G-Net (Anglano et al., 1997, 1998), and Dogma (Hekanaho, 1996, 1998).
The last three systems require the user to provide a “template” which defines
the exact structure of the rules which are evolved. The function of the
template is to facilitate the mapping between bit-strings and first-order logic
rules and allows conventional bit-string mutation and crossover operations
to be retained by the GA. However, the template in fact propositionalises
the representation, and is therefore perhaps an unsatisfactory technique
with respect to ILP. In particular, for ILP tasks, it is typical that the set
of instances belonging to a class do not exhibit uniform structure which
can be described by a single template. Additionally, in ILP, the structure
represented by the template is not assumed to be known by the user, rather
it is the task of the system to automatically discover it.

Related to the above systems is an approach described by Tamaddoni-Nezhad


and Muggleton (2003, 2001, 2000). The approach incorporates a GA into
Progol in order to replace complex clause evaluation with simple bitwise
operations, thus increasing the efficiency of the system. Unfortunately, like
the above template-based systems, the bit-string representation used by the
GA propositionalises Progol and thus reduces its representational power
(Mărginean, 2003).

. . . continued on page 102

and fourth bits are equal and the second and third bits are equal” (i.e. ABBA
where A and B are variables over bit values). It is this lack of expressive
power that limits the usefulness of bit-strings in relational domains. The
following systems all employ rule languages which can represent relational
concepts like ABBA.
102 Ch 4. The XCS Learning Classifier System

Evolutionary Computation in Inductive Logic Programming


(continued)

Another approach is to abandon the bit-string GA in favour of increasing


the representational sophistication; that is, to evolve expressions directly in
first-order (or even higher-order) logic. This path is more appropriate for
ILP as it avoids the limitations of propositionalisation. Examples of this
approach include Steps (Thie and Giraud-Carrier, 2005), Ecl (Divina and
Marchiori, 2002), GeLog (Kókai, 2001; Fühner and Kókai, 2003), Siao1
(Augier et al., 1995; Augier and Venturini, 1996), and Glps (Wong and
Leung, 1995). Each of the systems, however, uses a somewhat different
representational device. For instance, Ecl and Siao1 evolve rule sets, Steps
evolves trees, and GeLog and Glps evolve forests of trees. Steps takes
representational sophistication to the extreme by employing higher-order
logic. Now that bit-strings are no longer used, new mutation and crossover
operations are required; given the diversity of representation in the sys-
tems, it is not surprising that a variety of new operations have been developed.

A recent survey by Divina (2006) reviews many of the systems described above.

The Vcs system (Shu and Schaeffer, 1989) attempted to solve the above rep-
resentational limitation of bit-strings by extending the bit-string alphabet
to support variables. Fixed length bit-strings over the alphabet could repre-
sent certain first-order logic rules if a mapping was provided for encoding the
predicates and variables as bit-strings. Furthermore, custom mutation and
crossover operators which handled the variable and predicate bit-strings as
units were adapted from the standard bit-string versions. However, encoding
first-order logic rules into bit-strings in fact propositionalises the representa-
tion, so it is not representationally equivalent to ILP or RRL systems, which
directly use first-order logic languages.

The Vcs system must have faced significant challenges during its implemen-
tation. At that time, the development of Xcs still lay ahead five years into
the future, so the significance of accuracy-based fitness was not yet under-
4.4. Alternative Rule Languages 103

stood. Also, relational learning was still maturing, with the development of
many ILP algorithms as well as the non-monotonic setting10 yet to occur.
Thus, the concept behind Vcs may have been too ambitious for its time
and, as far as the author is aware, Vcs was unfortunately never completed.

Dl-cs (Reiser, 1999) was another LCS system that followed the path of
encoding first-order logic rules into bit-strings that support variables. Like
Vcs, it used an encoding from first-order logic to fixed length bit-strings,
and thus propositionalised the representation. Unlike Vcs, Dl-cs was suc-
cessfully implemented but empirical study showed that it performed disap-
pointingly, although some of its poor performance may be attributed to the
use of strength-based fitness and possibly to other innovations (it maintained
two separate populations for instance).

Two other LCS systems should be mentioned in this context, Xcsl (Lanzi,
2001, 1999b) and Gp-cs (Ahluwalia and Bull, 1999). These two systems
were inspired by genetic programming (Koza, 1992) and its utility for rule
discovery in LCS systems. Genetic programming is a style of evolutionary
computation with many similarities to genetic algorithms; the greatest dif-
ference being that the population in a genetic program consists of Lisp-like
S-expressions, instead of bit-strings, that are mutated and recombined us-
ing tree-based operations. The significance of employing S-expressions over
bit-strings is that, due to their recursive structure, they are more expressive.

Essentially, Xcsl and Gp-cs replace the GA component within the LCS
with a genetic program. Each rule is now partially represented by an S-
expression: in Xcsl the rule’s condition is an S-expression while its action
is a bit-string; in Gp-cs it is the reverse. A limitation of this dual rep-
resentation for actions and conditions is that variables cannot range over
the entire rule as they do in first-order logic. Regrettably, a comparison
between the representational characteristics of first-order logic and Lisp S-

10
The non-monotonic, or “learning from interpretations”, setting is particularly relevant
to reinforcement learning systems; see Appendix B.
104 Ch 4. The XCS Learning Classifier System

expressions is well beyond the scope of this thesis, but we note that the two
formalisations are united under some higher-order logics (Lloyd, 2003).

4.5 Genetic Algorithms

This section gives some elementary background on genetic algorithms and


their use in Xcs. The field of evolutionary computation contains several
classes of parallel, stochastic search algorithms. One of these, genetic al-
gorithms (GA), which originate from the work of Holland (1971, 1975), are
distinguished from other types of evolutionary computation by being loosely
based on a genetic metaphor that emphasizes the role of bit-string “chromo-
somes” and their recombination through crossover, an operation analogous
to sexual reproduction in biological systems. Goldberg (1989) gives a com-
prehensive introduction to GAs, while Bäck et al. (2000) provides a broader
treatment of evolutionary computation in general, including GAs.

The essential elements of a GA are:

• A population of individuals; each individual is represented by a chro-


mosome, typically a binary string, that can be mapped to a candidate
solution for the search.

• A fitness function, which evaluates the quality of individuals with re-


spect to the task at hand.

• Variation operations, typically crossover and mutation, which generate


new individuals from the chromosomes of existing ones.

Many types of crossover and mutation operations have been developed for
GAs; here we mention two, single point crossover and free mutation, which
are commonly used within Xcs. In single point crossover, two parent chro-
mosomes are recombined to form two offspring chromosomes. A bit location
is randomly generated and the substrings before and after the bit location
4.5. Genetic Algorithms 105

A B A D
z}|{ z }| { z}|{ z }| {
parent 1 101 | 1111 101 | 1010 child 1
@ 
parent 2 000 | 1010
|{z} | {z }
R
@
000 | 1111
|{z} | {z } child 2
C D C B

Figure 4.5: Single point crossover.

in each parent chromosome are exchanged to create the offspring (see Fig-
ure 4.5). In free mutation, each bit in the string is mutated with a given
probability. When a bit is mutated it is set to a randomly selected element
from the set of symbols in the bit-string alphabet minus the current bit
value.

The prototypical GA begins with a randomly initialised population and


proceeds by evolving successive generations in a cyclic process. During each
cycle the following steps (illustrated in Figure 4.6) are performed:

1. Select individuals stochastically for reproduction based on their fitness.

2. Generate offspring through recombination (crossover) and mutation.

3. Select individuals for the next generation.

An intuitive explanation of this process is that selection (step 1) focusses the


search on the best individuals, crossover (step 2) recombines the chromo-
somes from two parents in the hope of inheriting the best features of both,
and mutation (step 2) performs a stochastic local search over the chromo-
somes. The next generation (step 3) may simply consist of all offspring, or
it may additionally contain some of the highly fit parents (a process called
elitism).
106 Ch 4. The XCS Learning Classifier System

Selection Recombination Mutation

(0011010) 21 (101|0111) (1011010) (1011011)

(1010111) 19 (001|1010) (0010111) (0010111)

(0011100) 5 (1010|111) (1010011) (1011011)

(1111011) 24 (1111|011) (1111111) (1111111)

Generation i Fitness values Generation i+1

Figure 4.6: The ith cycle of a genetic algorithm.

The GA within Xcs is somewhat more specialised than the generic de-
scription given above. In particular, individuals in the generic GA usually
represent a whole solution to the given problem, whereas individuals in Xcs
are fragments of a policy. Thus, in Xcs, the entire population, together with
logic for resolving conflicts, represents the solution to the problem. Also,
the GA in Xcs often uses a restricted version of free mutation, called niche
mutation, where the mutated bit-strings are guaranteed to match the cur-
rent state (a bit can only be mutated to # or to the corresponding bit value
in the current state). Finally, we note that while GAs enjoy a prominent
position within LCS design, there appears to be no inherent reason why al-
ternative evolutionary algorithms, or other parallel search algorithms, could
not be adapted for use within an LCS implementation. Indeed, some ex-
isting LCS systems have already used alternatives to the GA (for example,
Lanzi, 1999b; Stolzmann, 2000).

4.6 Summary

In this chapter we have described in detail the basic method from which we
will derive a relational reinforcement learning system capable of dynamic
generalisation. The system, Xcs, perhaps the most prominent instance of
the LCS paradigm, can be characterised as a kind of rule-based Q-Learning
system combined with a genetic algorithm. In the following chapter, Xcs
4.6. Summary 107

will be extended for representation with first-order logic.

We saw that each rule in Xcs contains an action and a condition part
which together represent a policy fragment, and that a rule also performs
a generalisation role through its ability to represent a cluster of states in
a single compact condition. The paradigm used for representing the rules
is the bit-string, which represents the action and condition as fixed length
strings over the alphabets {0, 1} and {0, 1, #} respectively. A rule is said
to match a state, also given as a string over {0, 1}, if its condition string
is identical to the state, except for any “don’t care” symbols, #, which can
match either a 0 or a 1.

Each rule is also associated with a number of parameters, principally the


prediction, which is an estimate of the payoff, a value analogous to the
expected Q-value for the policy fragment; the error, which is an estimate
indicating the magnitude of the variance of the payoff; and the fitness, an
inverse function of error (also called accuracy) which is used by the genetic
algorithm.

Architecturally, the four key components of Xcs are a rule-base, a production


subsystem, a credit assignment subsystem and a rule discovery subsystem.
These components function in the following way: the rule-base acts as a
repository for a population of rules; the production subsystem is responsible
for decision making based on the current population of rules; the credit
assignment subsystem updates the rule parameters based on the reward
signal from the environment; and the rule discovery subsystem generates
new rules for the system, chiefly by applying a genetic algorithm to existing
rules with high fitness, although a covering operation is also used which
generates new rules matching the current state.

Learning classifier systems are complex systems using mechanisms and heuris-
tics which have been fine-tuned over the decades since their inception in the
1970s. In this context, the primary contribution of Xcs is its accuracy-
based fitness. Earlier systems used a single parameter, strength, for both
108 Ch 4. The XCS Learning Classifier System

prediction and fitness which led to the problem of strong overgeneral rules:
the displacement of rules which consistently advocate an optimal action by
rules which advocate a sub-optimal action but whose payoff is higher than
the former optimal rules. By basing fitness on accuracy instead of strength,
rules which always advocate optimal actions in Xcs cannot be displaced by
strong overgeneral rules.

Key biases which determine the composition of rules in Xcs’s population


are described in the Generality Hypothesis and the Optimality Hypothesis.
The Generality Hypothesis proposes that given enough training, most of the
population will consist of rules which are both accurate and maximally gen-
eral. The Optimality Hypothesis further maintains that a sub-population of
complete, minimal and non-overlapping rules will eventually develop when-
ever possible. Also, a number of evolutionary pressures have been identified
which describe additional biases: set pressure, fitness pressure, mutation
pressure, deletion pressure and subsumption pressure. We have conjectured
that all these biases are unaffected by merely changing the rule language,
with the exception of mutation pressure and possibly subsumption pressure.
Finally, we have seen that Xcs has been previously extended with other rule
languages, although never with first-order logic.
Chapter 5

The FOXCS System

“To invent, you need a good imagination and a pile of junk.”


—Thomas A. Edison

Recall that the objective of this research is to develop and evaluate an ap-
proach to relational reinforcement learning based on the learning classifier
system Xcs. Background material has been dealt with, including funda-
mentals of the RRL problem, a survey of RRL methods, and a description
of Xcs; the purpose of this chapter is thus to present a detailed design real-
ising the proposed approach. The system is named Foxcs: a “First-Order
Logic Extended Classifier System”. In brief, Foxcs automatically gener-
ates, evolves, and evaluates a population of condition-action rules taking
the form of definite clauses over first-order logic. As mentioned previously,
the primary advantages of this approach are that it is model-free, general-
isation is performed automatically, no restrictions are placed on the MDP
framework, it produces rules that are comprehensible to humans, and it uses
a rule-language to provide domain specific bias.

Perhaps the most novel aspect of the system is that it features an inductive
component based on a synthesis of evolutionary computation and refine-
ment in ILP. Although there do exist systems that have previously adapted

109
110 Ch 5. The FOXCS System

evolutionary computation for representation in first-order logic,1 they are


supervised learning systems and their evolutionary operations do not sat-
isfy the online requirements of the reinforcement learning paradigm. New
covering and mutation operations were therefore tailored for use in Foxcs.

Note that Mellor (2005) gives an earlier, proof-of-concept version of Foxcs.


The principal difference between that earlier version and the one described
in this chapter lies in the covering and mutation operations employed. The
earlier version used domain specific operations; here, general domain inde-
pendent operations have been developed.

This chapter is set out as follows. First, the overall approach is briefly
described and the principal modifications required to produce Foxcs from
Xcs are listed. Next, forming the bulk of the chapter, each modification
is fully described in its turn (in general, the details of Xcs will not be
repeated; readers unfamiliar with Xcs might like to consult the previous
chapter). Then a brief overview of the system from a software engineering
perspective follows. Finally, a summary of Foxcs concludes the chapter.

5.1 Overview

The Foxcs system is a learning classifier system designed to evolve rules


expressed in first-order logic. The approach taken was to extend the Xcs
system by “upgrading” its representation to definite clauses over first-order
logic. Using this approach, it was hoped that most of the biases present
in Xcs, which are the result of substantial innovation and fine-tuning over
many years, would be unaffected and could be directly leveraged by Foxcs.

Within inductive logic programming, many systems have been obtained sim-
ilarly. For example, the well known Foil (Quinlan, 1990), Icl (De Raedt
and Van Laer, 1995), Ribl (Emde and Wettschereck, 1996), Claudien

1
See “Evolutionary Computation in Inductive Logic Programming” on page 101.
5.1. Overview 111

FOXCS

2. task specific, 5. covering and mutation


Rule discovery
user defined 6. subsumption
rule language
Credit assignment

3. matching Production subsystem 4. match set

Rule base

1. rules, states, actions State Reward Action

Markov decision process

Figure 5.1: The architecture of Foxcs with modified components indicated.

(De Raedt and Dehaspe, 1997), Warmr (Dehaspe and De Raedt, 1997), and
Tilde (Blockeel and De Raedt, 1998), amongst others, have all been derived
following the “upgrade” approach. A general methodology for achieving
such an upgrade has even been formulated (Van Laer and De Raedt, 2001).
Note that representations other than definite clauses, such as first-order logic
trees, have been employed in some cases.

Architecturally, the high-level design of Foxcs is identical to that of Xcs.


Recall from the previous chapter that the architecture of Xcs consists of
four subsystems: the rule base, production subsystem, credit assignment
subsystem, and rule discovery subsystem. This four-tiered system is retained
by Foxcs; however, individual components and elements of the subsystems
have been modified to support the change in representation from bit-strings
to first-order logic. Figure 5.1 shows the high-level architecture of Foxcs
and locates the modifications, which are summarised below:

1. Rules, states, and actions are represented by expressions over first-


order logic.
112 Ch 5. The FOXCS System

2. A task specific rule language is defined by the user of the system using
special language declaration commands.

3. The matching operation is redefined for the new representation.

4. During the operational cycle, a separate match set is constructed for


each action. On any particular cycle an individual rule may occur in
multiple match sets.

5. The covering and mutation operations are tailored for producing and
adapting rules in first-order logic.

6. The test for subsumption deletion is redefined.

The following sections describe these modifications in greater detail.

5.2 Representational Aspects

Like its parent Xcs, the Foxcs system accepts inputs and produces rules;
however, unlike Xcs these inputs and rules are expressed in a language
over first-order logic. The language is specified via an alphabet, which is a
tuple hC, F, P, di, consisting of a set of constant symbols C; a set of function
symbols F (although functions are not supported by Foxcs, i.e. F = ∅); a
set predicate symbols P; and a function d : F ∪P → Z, called the arity, which
specifies the number of arguments associated with each element of F ∪ P.
For the purpose of describing a task environment, it will be convenient to
partition the set of predicate symbols, P, into three disjoint subsets, PS ,
PA , and PB , which contain the predicate symbols for states, actions and
background knowledge respectively.

Example 3 Recall the blocks world environment, bwn , introduced in Chap-


ter 1. Let LBW = hC, ∅, PS ∪ PA , di be an alphabet for a language over
first-order logic that describes the states and actions in bwn , where PS =
5.2. Representational Aspects 113

{on, on f l, cl}; PA = {mv, mv f l}; C = {a, b, c, . . .} and |C| = n; and d is


defined as: d(on) = 2, d(on f l) = 1, d(cl) = 1, d(mv) = 2 and d(mv f l) = 1.


In the remainder of this section, we describe how background knowledge,


inputs and rules are represented using L (a command for declaring the lan-
guage L to the system is described later in Section 5.5.1). The semantics of
the representation is based in the theory of Herbrand interpretations; readers
unfamiliar with Herbrand interpretations might like to consult Appendix A,
which gives a description of the theory.

5.2.1 Background Knowledge

As mentioned above, the language L contains predicates for background


knowledge, PB , in addition to predicates for describing states, PS , and ac-
tions, PA . A background theory, let us call it B, containing definitions for
these predicates is provided by the user of the system in the form of definite
clauses. The use of background knowledge is a feature of many ILP and
RRL systems, and providing a “good” theory is often the key to solving a
task effectively.

Example 4 A useful relation in blocks world is above(X, Y ), which is true


whenever there is a sequence of blocks leading from X down to Y . A recur-
sive definition for above(X, Y ) is:

above(X, Y ) ← on(X, Y )
above(X, Y ) ← on(X, Z), above(Z, Y )

Let us extend the language for blocks world so that LBW =


hC, ∅, PS ∪ PA ∪ PB , di, where PB = {above} and d(above) = 2.

114 Ch 5. The FOXCS System

a d

b e

s = {cl(a), on(a, b), on f l(b), cl(c), on(c, d), on(d, e), on f l(e)}
A(s) = {mv f l(a), mv(a, c), mv f l(c), mv(c, a)}

Figure 5.2: An hypothetical system input.

5.2.2 Representation of the Inputs

An input to the system describes the present state of the environment and
the potential for action within it. More specifically, it is a pair (s, A(s)),
where s ∈ S is the current state and A(s) is the set of admissible actions for
s. The state, s, and the set of admissible actions, A(s), are each represented
by an Herbrand interpretation—a set of ground atoms which are true—over
the language L.2

Example 5 Figure 5.2 shows a diagram and representation of an hypothet-


ical input (s, A(s)) from bw5 over the language LBW .


2
The representation of inputs with Herbrand interpretations is called learning from in-
terpretations or non-monotonic learning in the ILP literature (De Raedt, 1997; De Raedt
and Džeroski, 1994). Blockeel (1998, page 89) notes that a limitation of using this
paradigm is that rules with recursive definitions cannot be learnt; however, he suggests
that recursive rules occur comparatively rarely.
5.2. Representational Aspects 115

5.2.3 Representation of the Rules

Rules in Foxcs contain a logical part Φ, which replaces the bit-string action
and condition of their counterparts in Xcs. Apart from this change, all
other parameters and estimates associated with rules in Xcs are retained
by Foxcs and function as they do in Xcs.

Definition 6 The logical part of a rule in Foxcs, Φ, is a definite clause


over a first-order language L.

A definite clause has the form:

ϕ 0 ← ϕ1 , . . . , ϕ n

where each ϕi is an atom of the form ρ(τ1 , . . . , τd(ρ) ), where ρ ∈ P, each τi


is either a constant from C or a variable, and d(ρ) is the arity of ρ. If the
arity of ρ is zero (i.e. d(ρ) = 0) then ρ is a proposition and the bracketed list
of arguments is omitted. The head of the rule, ϕ0 , can be thought of as the
rule’s action and only ever contains a predicate from PA . The rule body,
ϕ1 , . . . , ϕn , can be thought of as the rule’s condition and contains predicates
from PS and PB .

Example 6 The logical form, Φ, of three rules over the language LBW are
given below:

rl1 : mv(A, B) ← cl(A), cl(B), on(B, C), on f l(C)


rl2 : mv(A, B) ← cl(A), cl(B)
rl3 : mv f l(A) ← cl(A), above(A, B)

These rules are illustrated in Figure 5.3.




Note that the action part of all the rules in the above example contain
variables; such rules are said to have abstract actions and can advocate more
116 Ch 5. The FOXCS System

A B A B B

mv(A, B) ← cl(A), cl(B), mv(A, B) ← cl(A), cl(B) mv f l(A) ← cl(A),


on(B, C), on f l(C) above(A, B)

Figure 5.3: Three rules for blocks world and an illustration of the patterns that they
describe. A serrated edge on a block signifies that it rests on a stack of zero or more
blocks.

than one action for a particular state. Rules in Foxcs are thus capable of
generalising over actions in addition to states.

5.2.4 Expressing Generalisations Within a Single Rule

Now that the representation of rules and inputs has been described, it is
interesting to characterise how the first-order logic rules of Foxcs express
generalisations. There are two ways that a rule can generalise: the first is
through the use of variables, and the second is through underspecification.
Although these two generalisation mechanisms have been separated for clar-
ity, it should be noted that an individual rule usually contains variables and
is also typically underspecified; that is, both mechanisms generally occur
together in the same rule.

The presence of variables within a rule allow it to generalise over attributes


or objects, and the resulting rule may be visualised as an abstract pattern or
template (see Figure 5.4a). With underspecification, generalisation occurs
because the rule omits to specify, either concretely or abstractly, some of
5.2. Representational Aspects 117

A a

C E c

F D B b

mv(A, B) ← cl(C), on(C, F ), on f l(F ), mv(a, b) ← cl(a), on(a, c), cl(b),


cl(A), on(A, D), on(D, E), on f l(b), highest(a)
cl(B), on f l(B)
(a) (b)

Figure 5.4: (a) A rule containing variables; the rule generalises over all states where the
blocks are arranged in the indicated pattern. (b) An underspecified rule; this rule gener-
alises over all arrangements of unspecified blocks that preserve the truth of highest(a).

the objects or attributes present in the state (see Figure 5.4b).

The variables of first-order logic are analogous to the # symbol in the bit-
string languages of LCS systems in the sense that they both generalise over
multiple values. Of course, unlike variables, the use of # in different loca-
tions within a rule cannot express a correspondence between the values of
the attributes at those locations. Underspecification, on the other hand,
typically does not occur in bit-string rules of LCS systems; an exception is
Xcsm (Lanzi, 1999a), which achieves the effect through the incorporation
of messy coded bit-strings (Goldberg et al., 1989, 1990).

Generalising through the use of variables and under-specification is very typ-


ical of both ILP and RRL systems generally.3 However, all such dynamically
generalising, model-free RRL systems use tree representations, in contrast
to the rule-based representation of Foxcs. Despite the widely acknowledged
utility of tree structures and algorithms for the purpose of generalisation,
there is reason to believe that rule-based approaches are preferreable to tree-
based ones in the context of RRL. Further discussion of this point is taken
3
Other approaches within these areas include the use of distance metrics and kernel
functions.
118 Ch 5. The FOXCS System

up on page 199.

5.3 The Matching Operation

With the new representation comes a corresponding redefinition for match-


ing. The matching operation in Foxcs works as follows: a rule rl with
logical part Φ successfully matches a ground state-action pair (s, a) if and
only if Φ, s, and the background theory B, together entail a. This leads to
the following definition of matching:

Definition 7 A given rule rl with logical part Φ matches a ground state-


action pair (s, a) under background theory B, if and only if:

Φ ∧ s ∧ B |= a (5.1)

Example 7 Let us consider matching the rule rl3 from Example 6 to the
input (s, A(s)) illustrated in Figure 5.2. The values of Φ, s and A(s) are
reproduced here for convenience:

Φ = mv f l(A) ← cl(A), above(A, B)


s = {cl(a), on(a, b), on f l(b), cl(c), on(c, d), on(d, e), on f l(e)}
A(s) = {mv f l(a), mv(a, c), mv f l(c), mv(c, a)}

A separate matching operation is run for each (s, a) ∈ {s} × A(s) because
matching is defined for (s, a) pairs rather than (s, A(s)) pairs. In this ex-
ample, rl3 will be matched to four (s, a) pairs because |A(s)| = 4.

Before matching, the Prolog knowledge base is initialised with B. When


matching begins, the first step is to assert Φ and s to the knowledge base.
Next, a query is made for each of the actions, a1 , . . . , a4 , in turn. The queries
corresponding to a1 (?- mv fl(a)) and a3 (?- mv fl(c)) succeed, while the
queries for a2 (?- mv(a,c)) and a4 (?- mv(c,a)) fail; hence, rl3 matches
(s, a1 ) and (s, a3 ), but not (s, a2 ) and (s, a4 ).

5.3. The Matching Operation 119

Note that the rule in the above example generalised over two ground actions.
It is because of this ability to generalise over actions that the rules in Foxcs
are matched against (s, a) pairs rather than against the state s only: so that
matching is ambiguous with respect to actions.

5.3.1 The Order of Atoms Within a Rule

The order in which atoms are ordered within Φ is important for producing
sound results. In particular, when a rule contains certain expressions, such
as an inequality or a negation, Prolog can return an error depending on the
order of the atoms. For example, the two rules:

minor(X) ← age(X, Y ), Y < 21


minor(X) ← Y < 21, age(X, Y )

are both meant to indicate that X is a minor if X’s age is less than 21. The
first rule works as intended; however, in the second rule, Y is not bound
when the inequality is evaluated, hence the rule will produce an error. A
similar problem can occur for negation; therefore, care must be taken that
any variable occurring in an inequality or a negation already occurs earlier
in the rule.4

5.3.2 The Use of Inequations

Another consideration involves the usual semantics of first-order logic which


allows different variables within Φ to unify to the same value. In some
environments, such as blocks world, it can be useful if each distinct variable
in Φ refers to a different object or attribute. Prolog lacks a mechanism
for automatically enforcing this requirement, therefore, following Khardon
(1999), an inequation is placed into Φ for each pair of distinct variables
4
Note that this problem could also be overcome by using a Prolog implementation that
supports co-routining. Co-routining effectively delays the evaluation of a problematic
atom until its variables can be bound.
120 Ch 5. The FOXCS System

occurring in Φ.5 For instance, when inequations are included, the logical
form of the three rules from Example 6 would be:

mv(A, B) ← cl(A), cl(B), A 6= B, on(B, C), B 6= C, on f l(C)


mv(A, B) ← cl(A), cl(B), A 6= B
mv f l(A) ← cl(A), above(A, B), A 6= B

Note that inequations are placed as leftmost as possible in the rule after
the pair of variables that they refer to occur. This has the effect of causing
matching to terminate as early as possible when the inequation is not satis-
fied and was found to lead to a significant reduction in the running time of
the system compared to placing them at the end of the rule.

The added inequations are artifacts that implement the semantics of the rule
and are thus not regarded as syntactic elements of Φ. Hence, the inequations
are not directly subject to mutation nor are they explicitly shown when Φ
is written out. The use of inequations is controlled by a user setting, which
is set to “on” by default. The default setting is assumed throughout the
remainder of this thesis except where otherwise indicated.

5.3.3 Caching

The computational efficiency of the Xcs framework, and indeed the LCS
approach in general, relies heavily on the efficiency of matching. This is
because on each operational cycle (that is, on each time step t) matching is
run on every rule in [P]. Matching rules in Xcs is cheap: linear in the number
of bits; however, it is a more expensive operation in Foxcs. This additional

5
Actually the inequations are type dependent, that is, an inequation will only be added
if the two variables refer to attributes of the same user defined type (these types are
declared with the mode command discussed in Section 5.5.1). This precludes the addition of
many unnecessary inequations and also prevents complications that arise when attributes
which have different user defined types range over the same basic types (such as integers
and floating point numbers).
5.4. The Production Subsystem 121

expense for performing matching in Foxcs is the cost of increasing the


expressive power of the rule language.

In order to improve the efficiency of matching in Foxcs, each rule j ∈ [P]


is associated with a cache, Kj , that can be searched with logarithmic time
complexity. The cache contains up to NK records indexed on (s, a);6 each
record, Kj (s, a), is a boolean indicating the success or failure of matching j
to (s, a). The cache is used as follows:

• Kj is searched before j is matched to (s, a). If Kj (s, a) exists then it


is used, otherwise j is matched to (s, a).

• If a matching operation is performed on j and (s, a) then a record of


the result is inserted into Kj .

• If |Kj | > NK after an insertion then the oldest record is deleted.

Unless otherwise stated, Foxcs always used caching in experiments. The


maximum size of the cache was set to NK = 1000 records.

This technique works well for ILP tasks, which usually have a few hundred
data items only. However, typically |S|  NK for RRL tasks, which reduces
the effectiveness of caching for RRL tasks; an interesting exception is looping
behaviour: if the loop is completed in NK steps or less then caching can be
effective.

5.4 The Production Subsystem

The production subsystem for Foxcs is largely identical to its counterpart


in Xcs. However, some modifications were made in order to accommodate
the system’s ability to generalise over actions as well as states. Figure 5.5
6
At an implementation level, (s, a) is converted into a key string in order to facilitate
search.
122 Ch 5. The FOXCS System

input: the current input to the system, (s, A(s))

1. Construct match sets. For each a ∈ A(s) there is a corresponding


match set [M]a , where [M]a = {rl ∈ [P] | Φrl matches (s, a)}.

2. Trigger covering if required. For each action a ∈ A(s) where


[M]a = ∅, the covering operation is called to produce a rule cov-
ering (s, a).

3. Calculate system predictions. For each a ∈ A(s) the system pre-


diction P (a) is calculated as a fitness weighted sum of the predic-
tions of the rules in [M]a .

4. Select action. An action a∗ ∈ A(s) is selected -greedily over


the system predictions. On greedy steps, if some actions are tied
for maximum prediction then these actions are selected from at
random uniformly. The action set is assigned: [A] := [M]a∗ .

5. Execute action. The selected action, a∗ is executed, and a reward


is obtained.

6. Assign credit. The parameters of the rules in the action set of the
previous cycle, [A]−1 are updated. If it is the terminal step of an
episode then the rules in [A] are also updated.

7. Trigger mutation. From time to time mutation is run on a parent


rule selected from [A]−1 . If it is the terminal step of an episode
then mutation may be triggered on [A] also.

Figure 5.5: The algorithm for the operational cycle.

gives the algorithm for the operational cycle of Foxcs. The first modifica-
tion, which has been noted before, is that input to the system includes the
set of admissible actions, A(s), for the current state s ∈ S. The next, and
principal, modification occurs on step 1 of the cycle and involves the con-
5.4. The Production Subsystem 123

struction of a separate match set, denoted [M]a , for each action a ∈ A(s).
Note that an individual rule can, and frequently does, belong to more than
one match set if it contains an abstract action; for instance, the rule in
Example 7 would belong to two match sets.

At step 2, another modification was made for triggering the covering op-
eration. The standard mechanism, as given by Butz and Wilson (2002),
invokes covering whenever less actions are advocated than a user set thresh-
old. In Foxcs, the covering operation is called for each action a which has an
empty match set, [M]a . This version originates from Kovacs (2002), where it
is termed action set covering; it requires knowledge of A(s), but eliminates
the need for a user setting. For relational environments, action set covering
appears to be a more natural choice than using a fixed threshold, because
the number of actions frequently depends upon the state. For instance, in
bwn the number of actions that apply in an arbitrary state varies between
1 and n2 − n.

Another difference occurs on step 3. The formula for calculating the system
predictions was not itself modified. That is, P (a) is still computed according
to equation (4.2) as the fitness weighted sum of the individual predictions
of the rules in [M]a . The equation is reproduced here for convenience:
P
p F
j∈[M]a j j
P (a) = P .
j∈[M]
Fj
a

However, because rules can generalise over actions, then, unlike the situation
with Xcs, it is possible for an individual rule to occur in more than one [M]a
and thus contribute to multiple system prediction calculations on a single
cycle.7

The final change concerns the rule updates which are made on step 6. The
7
Note however, that a rule cannot achieve perfect accuracy unless each of these actions
yields the same expected payoff. Thus, generalisations made over actions—like general-
isations made over states—must reflect a uniformity in the payoff landscape or the rule
will suffer low fitness. In other words, fit rules in Foxcs are those rules where Q(s, a) is
uniform for each (s, a) pair that matches the rule.
124 Ch 5. The FOXCS System

updates are identical to the procedure employed by Xcs as described in


Section 4.1.4, with the following exception. A user set option allows the
learning rate, α, to be annealed as a function of the rule’s experience (see
Section 4.1.4, page 85). In the experiments with Foxcs reported throughout
this thesis, unless otherwise indicated, α was always annealed as it was found
to produce better results than using a constant value for α.

5.5 The Rule Discovery Subsystem

The rule discovery subsystem of Foxcs contains several important modifi-


cations. The first modification is the introduction of a command to declare
a task specific rule language. In order to create and mutate rules, the system
needs to know the predicates that constitute the rule language. Because the
constitution of the rule language depends on the task at hand, it must be
declared by the user of the system on a task by task basis, necessitating a
mechanism for declaring the rule language. Further modifications are the
redefinition of the covering and mutation operations. These operations have
the same purpose as their counterparts in Xcs but have been completely
redefined for representation with first-order logic. Crossover was not im-
plemented because, under first-order logic, recombination of the syntactic
elements of a definite clause generally doesn’t produce a recombination ef-
fect at the semantic level. The final modification is the use of θ-subsumption
to detect generality for subsumption deletion. Below, each of these modifi-
cations are discussed in turn.

5.5.1 Declaring the Rule Language

Each environment requires its own specific vocabulary for describing it.
When the system creates or modifies rules—through covering or mutation—
it consults user defined declarations in order to determine which literals
may be added, modified, or deleted. A language declaration command that
5.5. The Rule Discovery Subsystem 125

enables such user provided definitions is a common feature of most ILP


systems. In Foxcs the command is:

mode(type, [min, max], neg, pred),

which declares a predicate for inclusion in P. The first three arguments to


mode are:

type Indicates whether the predicate is a member of PA , PS or PB (the


values are a, s and b respectively).

min, max Integers specifying the minimum and maximum number of oc-
currences of the predicate allowed within an individual rule.

neg A boolean value which determines whether the predicate may be negated
or not.

The last argument, pred, declares the predicate itself. The general form
of pred is r(m1 , . . . , md(r) ), where r is the predicate symbol, mi is a dec-
laration for the ith argument of r, and d(r) is the arity of r. However,
if r is a proposition—which is frequently the case for action predicates in
classification tasks—then the form of pred reduces to just r.

Each declaration mi is a list, [arg type, spec1 , spec2 , . . .], where arg type is a
type specifier,8 and the remaining arguments, spec1 , spec2 , . . ., are “mode”
symbols which determine how the ith argument may be set. The mode
symbols and their meanings are given below:

“+” Input variable. The argument may be set to a named variable (of the
same type) that already occurs in the rule.

“-” Output variable. The argument may be set to a named variable that
does not already occur in the rule.
8
The types are user defined. Note that user defined types are not automatically sup-
ported by Prolog and the developer must provide the code which handles them.
126 Ch 5. The FOXCS System

“#” Constant. The argument may be a constant.

“” Anonymous variable. The argument may be set to the anonymous


variable.9

“!” If the argument is a variable then it must be unique within the literal’s
argument list.

The list should contain at least one of the first four symbols. If it contains
more than one of the above symbols, then the argument may be set according
to any of the symbols present.

Example 8 The following are mode declarations for the blocks world lan-
guage LBW (see examples 3 and 4).

mode(a,[1,1],false,mv([block,-,#],[block,-,#])).
mode(a,[1,1],false,mv_fl([block,-,#])).
mode(s,[0,5],false,cl([block,-,#])).
mode(s,[0,5],false,on([block,-,#],[block,-,#])).
mode(s,[0,5],false,on_fl([block,-,#])).
mode(b,[0,5],true,above([block,-,#],[block,-,#])).

The language, L = hC, ∅, P, di, corresponding to the declarations is as fol-


lows: the set of constants, C, is not explicitly declared, instead constants are
obtained when required through interaction with the environment; the set
of predicates P is partitioned into PA = {mv, mv f l}, PS = {cl, on, on f l}
and PB = {above}; and the function d is given implicitly by the number of
arguments for each predicate.


The next example makes more sophisticated use of argument types and
modes.
9
The role of the anonymous variable in first-order logic is analogous to the role of the
“don’t care” symbol, #, in bit-string languages.
5.5. The Rule Discovery Subsystem 127

Example 9 The following are mode declarations that could be used by the
system for learning to recognise hands of Poker.

mode(a,[1,1],false,hand([class,#])).
mode(s,[5,5],false,card([rank,+,-,_],[suit,+,-,_],[position,-])).
mode(b,[0,1],true,succession([rank,+,!],[rank,+,!],[rank,+,!],[rank,+,!],
[rank,+,!])).

The first predicate, hand, is an action predicate which is used to indicate


the class of the hand. The second predicate, card, is a state predicate
which represents a playing card and takes three arguments, which are the
rank of the card, its suit, and its position in the hand. The last predicate,
succession, is a background predicate which takes five ranks as arguments
and succeeds if there is an ordering its arguments which forms a sequence
of successive ranks.

Note the use of “!” in the declaration of succession; the symbol prevents
rules from being generated that contain a succession literal having two or
more equal arguments, which would be unsatisfiable, and thus reduces the
size of the rule space to be searched. This illustrates the purpose of “!”,
which is to improve efficiency.


5.5.2 The Covering Operation

In Foxcs, the covering operation generates a new rule which matches a given
state-action pair, (s, a). The algorithm for covering is given in Figure 5.6 and
works as follows. First, a new rule, rl, is created and its parameters, except
for Φ, are initialised in the same way as in Xcs.10 Next, the logical part, Φ,
is set to a definite clause corresponding to a ← s. At this stage the rule does
not generalise because s and a are ground, so in the next two steps an inverse
10
See Section 4.1.5, page 89. The method that assigns population means is used, except
when the population is empty in which case user supplied constants are assigned.
128 Ch 5. The FOXCS System

input: a state-action pair (s, a)

1. create and initialise a new rule rl

2. Φ := ϕ0 ← ϕ1 , . . . , ϕn , where ϕ0 = a and ϕ1 , . . . , ϕn is the clause


containing all the facts in s (as part of this step each place p in
Φ is associated with a set of mode symbols Mp according to the
mode declarations)

3. θ−1 := {hc1 , {p1,1 , . . . , p1,k1 }i/v1 , . . . , hcl , {pl,1 , . . . , pl,kl }i/vl },


where c1 , . . . , cl are the constants of Φ, pi,j are the places in Φ
where ci occurs such that Θ(Mpi,j ) = “-”, and v1 , . . . , vl are vari-
ables to be substituted for c1 , . . . , cl (each vi is unique).

4. Φ := Φθ−1

5. insert rl into [P ] with deletion

Figure 5.6: The algorithm for covering.

substitution is created and applied to the rule in order to generalise it (the


inverse substitution replaces some or all of the constants in the rule with
variables; Section A.6 describes substitutions and inverse substitutions more
formally). Whether a variable will be assigned to a particular argument,
or whether it will be left as a constant, depends on the mode declaration
corresponding to the argument. Finally, rl is inserted into the population,
with deletion occurring if the number of rules exceeds the user set parameter
N . We now discuss some of these steps in more detail.

At step 2 of the covering algorithm, when Φ is originally assigned, each


argument of each formula contained in Φ is also associated with its list
of mode symbols as specified by the appropriate mode declaration. This
association allows the system to determine whether an argument can be
a variable in the following step (and it will also be useful for performing
mutation on the rule later on). It is convenient to refer to an argument by
5.5. The Rule Discovery Subsystem 129

its place within Φ, where the term at place hi, ji in Φ : ϕ0 ← ϕ1 , . . . , ϕn is


the term tj in ϕi : f (t1 , . . . , tm ). The set of mode symbols associated with
the argument at place p is thus denoted Mp .

At step 3, an inverse substitution, θ−1 , is created to generalise the rule. In


generating the substitution, the function Θ(Mp ) is used to determine if the
constant at place p will be replaced by a variable. The function returns
a randomly selected mode symbol from Mp ∩ {“#”,“-”}; if “-” is returned
then the constant at place p is replaced, otherwise it remains.11

Although not described in the algorithm given above, the actual covering
operation also takes type information into account when creating the inverse
substitution at step 3. Without checking type information, an inverse substi-
tution potentially has undesirable consequences when different data types
use the same values for constants (which typically occurs with numerical
data, for instance). For example, consider the clause:

age(anne, 60), weight(anne, 60).

Here, the constant anne represents the same person in both literals, but the
value of 60 for both age and weight is coincidence. Thus, replacing anne in
both literals with variable X, say, creates a useful association, but replacing
60 in both literals with Y is much less likely to be useful. The system
therefore performs type checking to ensure that the inverse substitution is
type safe, that is, that constants of attributes with different types will not
be substituted with the same variable. In the above example, the numerical
arguments would be declared to have different types:

mode(s,[1,1],false,age([name,-],[age,-])).
mode(s,[1,1],false,weight([name,-],[weight,-])).

Finally, at step 5 it is possible that a rule already exists in the population


11
Note that this requires that the mode declarations for the formulas in s and a, must
specify either the symbol “#” or “-”, or both, for each argument of each formula. In other
words, each argument of every predicate belonging to PS ∪ PA must be able to be an
output variable or a constant.
130 Ch 5. The FOXCS System

whose logical part, Φ, is syntactically identical to that of the new rule. If


this is the case, then, as is typical for Xcs, the numerosity of the existing
rule is incremented and the new rule is discarded (note that insertion after
mutation also similarly first checks for syntactically identical rules).

5.5.3 The Mutation Operations

The mutation operations of Foxcs are based on processes that are analogous
to refinement operations. Refinement operations are used to search the
hypothesis space in ILP systems and compute a set of specialisations or
generalisations of a given rule, expression or hypothesis. In Foxcs, versions
of refinement operations have been created that are suitable for its learning
classifier system framework. The operations produce a single, stochastically
refined rule and can access the current state, s, and set of admissible actions,
A(s), but no other elements of S or A.

The mutation operations in Foxcs consist of two groups: generalising mu-


tations and specialising mutations. An ILP system typically performs either
a general-to-specific or a specific-to-general search and therefore usually em-
ploys either generalisation operators or specialisation operators, depending
on the direction of the search, but not often both. In contrast, because the
evolutionary search in Xcs is bi-directional (in the sense that it searches
in both the specific-to-general and general-to-specific direction), the Foxcs
system uses both generalising and specialising operators. For both groups—
generalising and specialising—Foxcs defines three operations: one opera-
tion works at the level of the literals in the rule; another works at the level of
the arguments to the literals; and the third is a variation of the second that
specifically handles the anonymous variable. In addition to the mutation
operations, Foxcs also employs a reproduction operation that copies the
rule but does not mutate it.

In the subsections below, the generalising and specialising mutation oper-


ations are first described. Note that, although the operators are based on
5.5. The Rule Discovery Subsystem 131

fairly central principles from refinement in ILP, it certainly would be possi-


ble to invent other operations or variations of them. Next, the reproduction
operation is described. Following that, the procedure for determining which
operation to select is detailed. Finally, some examples are given to illustrate
how the operations function.

The Generalising Mutations

The three generalising operations are del (delete literal), c2v (constant to
variable), and v2a (variable to anonymous variable). They are described
below.

del: A literal from the body of the rule is randomly selected and deleted.

c2v: Each occurrence of a particular constant, c, at a location in the rule


which may hold a variable, is replaced with a variable, v. The oper-
ation first randomly selects the constant c from the set of constants
occurring in the rule. It then selects the variable v from the union of
the set of variables occurring in the rule (such that type consistency
between c and each variable in the set is satisfied) and a new variable
not already occurring in the rule. An inverse substitution is then per-
formed which replaces c with v at those locations in the rule where c
is allowed to be a variable.

v2a: A variable at a specific location is replaced with the anonymous vari-


able. This operation first creates the set of all places in the rule where a
variable occurs and which may be assigned to the anonymous variable.
It then randomly selects one of the places and replaces the variable at
that location with the anonymous variable.

The algorithms for the three generalising mutation operations are given in
Figure 5.7. However, they have been simplified by ignoring several factors:
type information; the minimum and maximum number of literals allowed in
132 Ch 5. The FOXCS System

input: a rule with logical part Φ

del

1. ϕi := random literal of Φ : ϕ0 ← ϕ1 , . . . , ϕn such that i 6= 0

2. Φ := Φ − ϕi

c2v

1. c := random element of Const(Φ)

2. v := random element of Vars(Φ) ∪ {v 0 : v 0 is a new variable, i.e.


v0 ∈
/ Vars(Φ)}

3. θ−1 := {hc, {p1 , . . . , pk }i/v}, where the set of pi are the places in
Φ where c occurs and “+” ∈ Mpi if v ∈ Vars(Φ), or “-” ∈ Mpi if
v is a new variable

4. Φ := Φθ−1

v2a

1. p := random element of {p1 , . . . , pk } the set of places pi in Φ s.t.


the term at pi is a variable and “ ” ∈ Mpi

2. θ−1 := {hv, pi/“ ”} where v is the variable at place p

3. Φ := Φθ−1

Figure 5.7: The algorithms for the generalising mutations. The sets Const(Φ) and
Vars(Φ) are, respectively, the set of constants and the set of variables occurring in Φ.

a rule; the mode symbol “!”; and various failure conditions (e.g. the c2v
operation is invoked on a rule where Const(Φ) = ∅).
5.5. The Rule Discovery Subsystem 133

The Specialising Mutations

The three specialising operations are add (add literal), v2c (variable to
constant), and a2v (anonymous variable to variable). The operations are
described below. Note however, that in order for the child rule to belong
to the same action set as its parent, it must match a state-action pair (s, a)
corresponding to the current state, s, and an action a ∈ A(s).

add: A new literal is generated and added to the body of the rule. A
predicate is selected from PS ∪ PB and its arguments are filled in
according to the mode symbols associated with the argument. If any
constants are to be assigned to the arguments of the predicate, then
they are generated from the current state, s, such that the new rule
matches (s, a) for some action a ∈ A(s).

v2c: Each occurrence of a particular variable, v, at a location in the rule


which may hold a constant, is replaced with a constant, c. The op-
eration first randomly selects the variable v from the set of variables
occurring in the rule. It then finds a set, C, containing constants
which are candidates for replacement. Each constant c0 ∈ C is such
that the rule j matches (s, a) for the current state s and for some ac-
tion a ∈ A(s), where j is the new rule is obtained by replacing each
occurrence of v with c0 (at places which are allowed to hold a constant).
The constant c is then selected at random from C.

a2v: An anonymous variable at a specific location is replaced with a named


variable. This operation first creates the set of all places in the rule
where an anonymous variable occurs and which may be assigned to an
input (“+”) or output (“-”) variable. It then randomly selects a one of
the places, p, and creates a set of variables, V , which are candidates for
replacement of the anonymous variable at p. The variables which are
included in V depend on the mode symbols associated with p: if “+” is
associated with p then V will include all the variables already occurring
134 Ch 5. The FOXCS System

in the rule (subject to type constraints); and if “-” is associated with


p then V will include a new free variable which does not already occur
in the rule. Finally, the anonymous variable at p is replaced with a
randomly selected variable from V .

The algorithms for the three specialising mutation operations are given in
Figures 5.8 and 5.9. Again, they have been simplified in the same way that
the generalising mutations were.

The Reproduction Operation

In addition to the mutation operations there is also a reproduction operation:

rep: Increments the numerosity of the parent rule.

This operation does not alter Φ; instead, its purpose is to encourage the
most highly fit rules to dominate the population.

Selecting a Mutation Operation

When mutation is triggered the system randomly selects one of the above
seven operations to apply. Each operation, i ∈ {del,c2v,v2a,add,v2c,a2v,
rep}, is associated with a weight µi , and its selection probability is propor-
tional to its relative weight, Pµi . If the operation fails to produce offspring
j µj

then another randomly selected operation is applied, and so on, until one
succeeds.

The weighting scheme allows the search to be biased in favour of some


operations over others. For example, a typical approach for the system
when addressing ILP tasks is to produce rules through covering that are very
general—containing only one or two literals—and to incrementally specialise
them through mutation by “growing” the number of literals in the rule;
5.5. The Rule Discovery Subsystem 135

inputs: a rule with logical part Φ, and the current system input
(s, A(s))

add

1. randomly select a mode declaration for a state or background pred-


icate, and let ρ be the name of the predicate, n be the arity, and
M1 , . . . , Mn be the modes for the arguments where Mi is the set
of mode symbols declared for the ith argument

2. ϕ := ρ(τ1 , . . . , τn ), where each τi is a random element of


S
mode∈Mi T (mode), where



 Vars(Φ) if mode = “+”

 {v : v is a new variable} if mode

= “-”
T (mode) = (5.2)


 {“ ”} if mode =“”


 {“#”} if mode = “#”

3. Φ := Φ ∪ ϕ

4. θ := random element of {θ1 , . . . , θk }, the set of all substitutions


θi s.t. θi = {h“#”, p1 i/c1 , . . . , h“#”, pl i/cl }, where p1 , . . . , pl are
the places in Φ where “#” occurs, and c1 , . . . , cl are constants s.t.
∃a ∈ A(s) : Φθ matches (s, a)

5. Φi := Φθ

Figure 5.8: The algorithm for the add mutation.


136 Ch 5. The FOXCS System

inputs: a rule with logical part Φ, and the current system input
(s, A(s))

v2c

1. v := random element of Vars(Φ)

2. θ := random element of {θ1 , . . . , θk }, the set of all substitutions θi


s.t. θi = {hv, {p1 , . . . , pk }i/c}, where the set of pj are the places
in Φ where v occurs and “#” ∈ Mpi , and c is a constant s.t.
∃a ∈ A(s) : Φθ matches (s, a)

3. Φi := Φθ

a2v

1. p := random element of {p1 , . . . , pk } the set of places pi in Φ


s.t. the term at pi is the anonymous variable and “-” ∈ Mpi or
“+” ∈ Mpi

2. if “+” ∈ Mpi then V := Vars(Φ) else V := ∅

3. if “-” ∈ Mpi then V := V ∪ {v 0 :


v0 is a new variable, i.e. v0 ∈
/ Vars(Φ)}

4. θ := {h“ ”, pi/v} where v is a random element of V

5. Φ := Φθ

Figure 5.9: The algorithms for the v2c and a2v mutations.
5.5. The Rule Discovery Subsystem 137

under this approach, µadd should be set larger than µdel so that the
frequency of add operations is greater than that of del operations.

Examples

We now give some examples to illustrate how the add and del operations
work. We do not provide explicit examples of the other four mutations
because they are concerned with assigning the arguments of literals, which
is also performed by steps two and four in the add operation. Therefore the
add example will also suffice to illustrate the kinds of processes involved in
the c2v, v2a, v2c and a2v mutations.

Example 10 (The ADD operation) The following mode declarations re-


late to the Mutagenesis task, a well known ILP task:

mode(a,[1,1],false,active).
mode(a,[1,1],false,inactive).
mode(s,[1,1],false,molecule([compound,-])).
mode(b,[0,20],false,atm([compound,+],[atomid,+,-],[element,#],[integer,#],[charge,#,-])).
mode(b,[0,20],false,bond([compound,+],[atomid,+,-],[atomid,+,-],[integer,#])).

Suppose that Foxcs is applied to the Mutagenesis task with the above
declarations, and that at some point during training the system applies the
add mutation to the following rule:

Φ : active ← molecule(A).

At step 1 of the operation, the system randomly selects one of the mode
declarations for a state or background predicate (i.e. a declaration specifying
type “s” or “b”). Let the selected declaration be the one relating to the
predicate atm. Also during this step, Mi is assigned to the set of mode
symbols associated with argument i of the predicate. Here, M1 = {“+”},
M2 = {“+”,“-”}, M3 = {“#”}, M4 = {“#”} and M5 = {“#”,“-”}.
138 Ch 5. The FOXCS System

At step 2, the terms to the five arguments of atm are assigned. First, the
set of candidate terms, call it Ti , is computed for each argument i, where
S
Ti = m∈Mi T (m) (for the definition of T (m) see equation (5.2)). The
sets of candidates are: T1 = {A}, T2 = {B}, T3 = {#}, T4 = {#} and
T5 = {#, C}. A couple of comments are in order:

• Note that due to type checking, A ∈


/ T2 , even although “+” ∈ M2
and A ∈ Vars(Φ). This is because A is already associated with the
type compound through molecule(A) whereas the type of the second
argument to atm is atomid. If A were assigned to the second argument
of atm then the new rule would be unsatisfiable.

• The symbol # which occurs in some of the sets is a placeholder for


constants which are generated on a later step.

Second, a term from each Ti is selected at random to be the ith argu-


ment of atm. Let the terms be selected such that the resulting literal is
atm(A, B, #, #, C).

At step 3, the rule now has the form:

Φ : active ← molecule(A), atm(A, B, #, #, C).

At step 4, values are generated to be the constants in atm(A, B, #, #, C)


for each place containing the placeholder #. In order to be consistent with
the online learning paradigm, the constants are generated from the current
system input, (s, A(s)), only. Now suppose that the current state is s =
{molecule(d1)}, A(s) = {active} and that molecule d1 is associated with
the following atm facts:

atm(d1,d1_1,c,22,-0.117).
atm(d1,d1_2,c,22,-0.117).
atm(d1,d1_3,c,22,-0.117).
atm(d1,d1_4,c,195,-0.087).
atm(d1,d1_5,c,195,0.013).
5.5. The Rule Discovery Subsystem 139

atm(d1,d1_6,c,22,-0.117).
atm(d1,d1_7,h,3,0.142).
atm(d1,d1_8,h,3,0.143).
atm(d1,d1_9,h,3,0.142).
atm(d1,d1_10,h,3,0.142).
...

From this data, a set of candidate substitutions, Θ, is found, where each


substitution specifies the constant to be assigned to each place pi in Φ which
contains the # symbol. In this example, the places, pi , are p1 = h#, h2, 3ii
and p2 = h#, h2, 4ii, that is, the places at the third and fourth arguments
respectively of the second literal in the body of the rule. Each candidate
θ ∈ Θ must satisfy the condition that Φθ matches (s, a) for some a ∈ A(s).
Here, the set of substitutions satisfying the matching condition is:

Θ = {{p1 /c, p2 /22}, {p1 /c, p2 /195}, {p1 /h, p2 /3}, . . .}

One of the substitutions from Θ is selected at random, let this be θ =


{p1 /c, p2 /195}.

Finally, on step 5, the selected substitution, θ, is applied to derive the new


rule:
Φθ : active ← molecule(A), atm(A, B, c, 195, C).

Example 11 (The DEL operation) The del operation appears, perhaps,


too straightforward to warrant an example. However, the algorithm given
earlier was a simplification, and in practice the operation must preserve the
integrity between a rule’s input and output variables. The operation thus
cannot delete just any arbitrary literal in the rule, otherwise the integrity
of some of the variables may be lost. For example, deleting the atm literal
in the rule:

active ← molecule(A), atm(A, B, c, 22, C), C < 0.316


140 Ch 5. The FOXCS System

would cause the rule to become unsatisfiable because then C would be a


free variable and the inequality, C < 0.316, could not be satisfied. Similarly,
deleting the molecule literal would mean that variable A would no longer
be bound to the ID of the current molecule given in s, and the rule will
either be always satisfied or always unsatisfied, depending on the background
knowledge for the task, irrespective of the value of s.

These complications with deletion is one of the reasons why variable argu-
ments are declared as either input (“+”) or output variables (“-”): so that
the system can reason about which literals can and cannot be deleted. In
general, if a rule contains a variable, v, which occurs as an input variable at
some location in the rule, then v must also occur as an output variable at
a previous location. The del mutation is therefore constrained such that it
will only delete literals which do not result in a violation of this condition.
Similarly, the v2a mutation will not rename an output variable if it would
result in a violation of the condition. The appropriate declarations for the
above example are:

mode(s,[1,1],false,molecule([compound,-])).
mode(b,[0,20],false,atm([compound,+],[atomid,+,-],[element,#],[integer,#],[charge,#,-])).
mode(b,[0,20],false,<([charge,-],[float,#])).

5.5.4 Subsumption Deletion

The final modification required for the specification of Foxcs is the test
for subsumption. Recall that in the Xcs system, the subsumption deletion
technique (see Section 4.1.5, page 91) helps to encourage the system to
converge to a population containing maximally general rules. If rule i is
accurate (i.e. ε < ε0 ) and sufficiently experienced (i.e. exp > θsub ) then i
may subsume j if i is a generalisation of j. In Foxcs, the accuracy and
experience conditions remain, but the test for generalisation is modified for
5.5. The Rule Discovery Subsystem 141

representation in first-order logic.

Testing for Generality

In first order logic, the θ-subsumption procedure (Plotkin, 1969, 1971) can
be used to determine if one rule is a generalisation of another. The definition
of θ-subsumption is as follows:

Definition 8 (θ-Subsumption) The rule Φg θ-subsumes the rule Φs if


and only if there exists a substitution θ such that Φg θ ⊆ Φs . Under θ-
subsumption Φg is a generalisation of Φs , and Φs is a specialisation of Φg .

Note that for the purpose of testing θ-subsumption, a rule, Φ, is represented


as the set of literals which constitute it.

Example 12 The rule Φg : mv(A, B) ← cl(A), cl(B) θ-subsumes the rule


Φs : mv(a, c) ← cl(a), on(a, b), on f l(b), cl(c) because ∃θ : Φg θ ⊆ Φs , namely
θ = {A/a, B/c}.


A limitation of θ-subsumption is that there are cases of generalisation which


it does not detect. In first order logic, generalisation is typically taken to be
equivalent to entailment, that is, Φg is more general than Φs iff Φg |= Φs .
Unfortunately, θ-subsumption is not equivalent to entailment. An example
where Φg |= Φs , but Φg does not θ-subsume Φs is (Blockeel, 1998):

Φg : ϕ1 (Y ) ← ϕ1 (X), ϕ2 (X, Y )
Φs : ϕ1 (Z) ← ϕ1 (X), ϕ2 (X, Y ), ϕ2 (Y, Z)

Despite this limitation with θ-subsumption, testing for θ-subsumption is


more efficient than testing for entailment and for this reason it is typically
preferred over entailment.
142 Ch 5. The FOXCS System

task system
environment settings

input files

interaction
learning experiment
matching classifier system module
Prolog

module
subsumption
C++ C++

core modules

population trace performance


statistics

output files

Figure 5.10: Architecture of the Foxcs system. Arrows indicate the flow of data.

A Faster Generality Test for GA Subsumption

Recall that there are two varieties of subsumption deletion in Xcs, GA sub-
sumption and action set subsumption. For action set subsumption, Foxcs
uses θ-subsumption to test for generalisation, but for GA subsumption there
is a simpler, constant time test. In Foxcs, unlike in Xcs, each type of mu-
tation operation acts to either always generalise or always specialise rules.
Thus, in order to determine whether a parent rule is more general than its
child, each mutation operation sets or clears a flag depending on whether it
generalises or specialises. The subsumption test then just tests the flag to
determine whether the child rule should be subsumed by its parent or not.
5.6. Implementation Notes 143

5.6 Implementation Notes

This section provides a brief description of the architecture of the Foxcs


system from a software engineering perspective rather than from the logi-
cal perspective given in the preceding sections. From this perspective the
system consists of three layers: input files, core modules, and output files.
Figure 5.10 shows the elements in each layer and how they interact. The
input layer consists of the task environment, including the rule language
declarations and a knowledge base in the case of an ILP task or an imple-
mentation of a transition function for an RRL task; and the system settings,
such as values for system parameters and experimental configurations. The
output layer includes performance statistics, the final population of rules,
and a trace file showing the history of interaction with the environment.
The core system itself consists of three high level modules.

• The first module contains routines for interacting with the environ-
ment, such as initiating an episode, obtaining the current input, ex-
ecuting actions and receiving rewards, and also for implementing the
matching and θ-subsumption operations.

• The second module implements all the learning classifier system sub-
systems: the rulebase, the production system with the exception of
matching, credit assignment, and rule discovery operations with the
exception of θ-subsumption.

• The last module handles functionality relating to running experiments


and computing statistics about system performance.

Note that when ILP tasks are addressed they must be converted into an
equivalent MDP formulation of the task so that the LCS methodology can be
applied. In practice these conversions are implemented by placing additional
routines in the task environment files which respond to interaction requests.
144 Ch 5. The FOXCS System

The Foxcs system is implemented in C++ and Prolog as indicated in the


diagram. Rules are contained in the learning classifier system module and
are therefore implemented as C++ objects, but are translated into equiva-
lent Prolog clauses for matching and θ-subsumption. We used GNU Prolog
1.2.16 and GNU g++ 2.95.2. Although these compilers are not the most
current versions, they were used because of stability issues which arose when
later versions of GNU C++ and Prolog were used in conjunction with each
other. Development and experimentation with Foxcs took place on the
Solaris 9 and 10 Intel operating system.

5.7 Summary

In this chapter we have presented the details of Foxcs systems. The sys-
tem was obtained by extending the learning classifier system Xcs for rep-
resentation with first-order logic. Essentially, the extension was achieved
through the following steps: replacing the bit-string rules of Xcs with defi-
nite clauses in first-order logic; redefining matching as a test for consistency
with respect to a given state-action pair; replacing bit-string mutation and
crossover with online versions of upward and downward ILP refinement; and
replacing the bit-string subsumption test with θ-subsumption. In addition
to these changes, the system also requires a task specific rule language to be
declared as part of the task specification.

By deriving from the LCS methodology, the Foxcs system inherits the
following qualities: it is model-free, automatically discovers generalisations,
and does not restrict the MDP framework. In the remainder of this thesis
we focus on an empirical evaluation of the Foxcs system.
Chapter 6

Application to Inductive
Logic Programming

“Two algorithms walk into a bar. . . ”


—Dirk, a friend of the author

We now, in this chapter and the next, turn to an empirical evaluation of


the Foxcs system. This chapter focuses on evaluating specific aspects and
components of the system using ILP tasks; the following chapter assesses
the overall system performance on RRL tasks. The first experiment in this
chapter benchmarks Foxcs on several ILP tasks (Section 6.2). A number
of factors motivate our interest in doing so: first, ILP tasks constitute an
important category of machine learning problems and the ability to solve
them is significant in its own right; second, ILP is a more mature field than
RRL and offers better scope for benchmarking in terms of the number of
available tasks and published results; and finally, and perhaps most impor-
tantly, it allows us to empirically verify the effectiveness of Foxcs’s novel
inductive mechanisms. The last point follows because the system’s perfor-
mance on ILP tasks depends almost entirely on its ability to generalise,
since the problem of delayed reward in ILP tasks is non-existent due to the

145
146 Ch 6. Application to Inductive Logic Programming

fact that all rewards are immediate. We find that Foxcs is generally not
significantly outperformed by several well-known ILP systems, confirming
Foxcs’s ability to generalise effectively.

Next we give evidence that the evolution of rules in Foxcs is consistent


with the Generality Hypothesis (Section 6.3). This result is interesting to
the field of LCS, as it supports the view that the biases identified by the
Generality Hypothesis are general in the sense that they are largely lan-
guage independent. After that, we show that the new subsumption deletion
procedure, which has greater time complexity than its counterpart in Xcs,
does increase the efficiency of the system (Section 6.4). Next, we present
evidence that annealing the learning rate produces better results than us-
ing a fixed learning rate for ILP tasks (Section 6.5). We suggest that this
result should extend to Xcs systems generally for classification tasks. Fi-
nally, we compare the two selection methods typically used in Xcs systems,
tournament and proportional selection, finding that for Foxcs there is a
performance–efficiency trade-off (Section 6.6).

6.1 Experimental Setup

This section provides details of the materials and system settings that were
used for the experiments reported in this chapter.

6.1.1 Materials

Brief descriptions of the ILP data sets which were used are given below.
More detailed descriptions are provided in Appendix C, including examples
of the specific rule language declarations that were used.

Mutagenesis. The aim of this task is to predict the mutagenic activity


of nitroaromatic compounds (Srinivasan et al., 1996). The data is
divided into two subsets, a 188 molecule “regression friendly” subset,
6.1. Experimental Setup 147

and a 42 molecule “regression unfriendly” subset. In these experiments


the former subset was the one used, which contains 125 active and 63
inactive molecules.

There are two different levels of description for the molecules. The
first level, NS+S1, contains a low level structural description of the
molecule in terms of its constituent atoms and bonds, plus its logP
and LU M O values. The second level, NS+S2, contains additional in-
formation about higher level sub-molecular structures. The second
level is a superset of the first and generally allows a small increase in
predictive accuracy to be attained.

Biodegradability. This data set contains 328 molecules that must be clas-
sified as either resistant (143 examples) or degradable (185 examples)
from their associated structural descriptions—which are very similar
to those in the Mutagenesis data set—and molecular weights. Block-
eel et al. (2004) give several levels of description for the molecules; the
Global+R level is used here.

Traffic. This data set originates from Džeroski et al. (1998b). The task
is to predict critical road sections responsible for accidents and con-
gestion given traffic sensor readings and road geometry. The data
contains 66 examples of congestion, 62 accidents, and 128 non-critical
sections, totaling 256 examples in all.

Poker. This task originates from Blockeel et al. (1999). The aim is to
classify hands of Poker into eight classes, fourofakind, fullhouse,
flush, straight, threeofakind, twopair, pair and nought. The
first seven classes are defined as normal for Poker, although no dis-
tinction is made between royal, straight, and ordinary flushes. The
last class, nought, consists of all the hands that do not belong to any
of the other classes. In Poker, the frequency with which the different
classes occur when dealing is extremely uneven, but in these experi-
ments the data examples were artificially stratified to ensure that the
class distribution was approximately equal (see Section C.4).
148 Ch 6. Application to Inductive Logic Programming

Transforming an ILP task into an MDP is relatively straightforward. The


states, S, consist of the data examples plus a dummy terminal state—let
us call it s∅ —and the actions, A, contain the class labels which are used
for classifying the examples. Training is broken into a series of single step
episodes, where each episode involves presenting a data example, s ∈ S,
to the system which it must classify by choosing an action, a ∈ A. The
transition function, T , is the trivial one: T (s, a, s0 ) = 1 if s0 = s∅ , otherwise
T (s, a, s0 ) = 0. And finally, the reward signal, R, is two valued: R(s, a) = r,
where r > 0, if a is the correct class for s, otherwise R(s, a) = −r.

6.1.2 Methodology

Unless otherwise noted, all results for the Foxcs system were obtained under
the following settings. The system parameters were set to: N = 1000,  =
10%, α = 0.1, β = 0.1, 0 = 0.01, ν = 5, θga = 50, θsub = 20, θdel = 20, and
δ = 0.1. The mutation parameters were: µrep = 20, µadd = 60, µdel = 20,
and µi = 0 for all i ∈ {c2v, v2a, v2c, a2v}. Tournament selection was used
with τ = 0.4. The learning rate was annealed and both GA and action
set subsumption were used. The system was trained for 100,000 steps, but
rule discovery was switched off after 90,000 steps in order to reduce the
disruptive effect of rules which have not had sufficient experience to be
adequately evaluated because they have been generated late in training (see
Figure 6.1). Finally, the reward given was 10 for a correct classification and
−10 for an incorrect classification.

In many of the following experiments the system’s predictive accuracy was


measured.1 A typical procedure for measuring predictive accuracy is 10-fold
cross-validation (see “The k-Fold Cross-Validation Method” on page 150),
however care must be taken when using this method with Foxcs. Because
1
Predictive accuracy refers to the percentage of correct classifications made by the
system on data withheld from it during training. Note that it is a performance measure
and is not related to the accuracy, κ, or relative accuracy, κ0 , which are computed during
the fitness calculation of Foxcs and Xcs.
6.2. Comparison to ILP Algorithms 149

85
System A
84 System B

83

82
Accuracy (%)

81

80

79

78

77

76

75
0 20 40 60 80 100
Training Episode (x1000)

Figure 6.1: This graph of the system performance on the Mutagenesis NS+S1 data set
is indicative of the performance improvement gained by switching off rule discovery prior
to the completion of training. After 90,000 training episodes rule discovery is switched
off for system A but left on for system B. Comparing the performance of the two systems
after this point illustrates the negative effect which is exerted by freshly evolved rules.

the system is not deterministic, it produces different rules when an experi-


ment is re-run, which potentially reduces the reproducibility of a result even
when the same data folds are used. All predictive accuracies reported for
Foxcs have been determined by repeating 10-fold cross-validation ten times
in order to minimise the effects of non-determinism. Such a procedure will
ensure that the results given are reproducible.

6.2 Comparison to ILP Algorithms

In this section we compare Foxcs to several well known ILP algorithms.


ILP tasks are essentially classification tasks and it has been shown that
Xcs is competitive with other machine learning algorithms at classification.
For instance, Bernadó et al. (2002) compared Xcs to six algorithms (each
belonging to different a methodological paradigm) on 15 classification tasks
150 Ch 6. Application to Inductive Logic Programming

The k-Fold Cross-Validation Method

The k-fold cross-validation method is a commonly used technique for estimat-


ing the predictive accuracy of a classification algorithm (Kohavi, 1995). The
method works as follows:

1. The data is partitioned, usually by random sampling, into k subsets of


approximately equal size;

2. Each subset is used in turn as the test set, and the remaining k − 1
subsets as the training set;

3. The algorithm’s predictive accuracy is estimated as the average accuracy


over the k test sets, where accuracy is the number of correct classifica-
tions divided by the number of instances in the test set.

A typical choice for the number of partitions is k = 10.

In stratified cross-validation, the partitions are stratified so that each class


is represented in approximately the same proportion as in original data set,
which reduces the variance of estimate. In this thesis all partitions are strat-
ified unless otherwise noted. Repeated stratified cross-validation, where for
example ten-fold cross-validation is repeated 10 times and the resulting esti-
mates averaged, usually reduces the variance even further.

and found that it was not significantly outperformed by any of the algorithms
in relation to its predicative accuracy; and Butz (2004) has also obtained
similar results. We are therefore interested to determine whether Foxcs
inherits Xcs’s ability at classification and performs at a level comparable
with existing ILP algorithms.

This experiment compared the Foxcs system to several ILP algorithms and
systems. Four well-known ILP algorithms were selected, Foil (Quinlan,
1990), Progol (Muggleton, 1995), Icl (De Raedt and Van Laer, 1995)
and Tilde (Blockeel and De Raedt, 1998). The first three systems are
rule-based, while Tilde uses a tree-based representation. A fifth system
containing an evolutionary component, Ecl (Divina and Marchiori, 2002),
6.2. Comparison to ILP Algorithms 151

Table 6.1: Comparison between the predictive accuracy of Foxcs and selected ILP
algorithms on the Mutagenesis, Biodegradability, and Traffic data sets. The standard
deviations, where available, are given in parentheses. An * (**) indicates that the value
is significantly different from the corresponding value obtained by Foxcs according to an
unpaired t-test (assuming unequal variance) with confidence level 95% (99%) (note that a
significance test could not be run for the cases where no standard deviation was reported).

Algorithm Predictive Accuracy (%)

Mute (NS+S1) Mute (NS+S2) Biodegradability Traffic


Foxcs 84 (3) 87 (2) 74 (2) 93 (1)
Icl 87 (10) 88 (8) 75 (1) 93 (4)
Tilde 85 86 74 (1) 94 (4)
Progol 82 (3) 88 (2) – 94 (3)
Foil 83 82 – –
Ecl – 90 (1)** 74 (4) 93 (2)

was also selected. The Ecl system employs a memetic algorithm which
hybridised evolutionary and ILP search heuristics. These algorithms are all
supervised learning algorithms since—as far as the author is aware—Foxcs
is the first reinforcement learning system to be applied to ILP tasks.

Table 6.1 compares the predictive accuracy of Foxcs to that of the ILP
systems on the Mutagenesis, Biodegradability, and Traffic data sets. The
results were taken from the following sources. For the Mutagenesis data
set: Srinivasan et al. (1996) for Progol, Blockeel and De Raedt (1998)
for Tilde and Foil,2 Van Laer (2002) for Icl and Divina (2004) for Ecl;
for Biodegradability: Blockeel et al. (2004) for Tilde and Icl, and Divina
(2004) for Ecl; and for Traffic: Džeroski et al. (1998b) for all systems
except Ecl, which is Divina (2004). All predictive accuracies have been

2
Blockeel and De Raedt (1998) are a secondary source for Foil on the Mutagenesis
data set; we note that the primary source, (Srinivasan et al., 1995), has been withdrawn
and that its replacement, (Srinivasan et al., 1999), reassesses Progol but unfortunately
does not replicate the experiments for Foil.
152 Ch 6. Application to Inductive Logic Programming

measured using 10-fold cross validation.3 The folds are provided with the
Mutagenesis and Biodegradability data sets but are generated independently
for the Traffic data. For the Biodegradability data, five different 10-fold
partitionings are provided and the final result is the mean performance over
the five 10-fold partitions. For consistency, all results have been rounded
to the lowest precision occurring in the sources (i.e. whole numbers); more
precise results for Foxcs on these tasks can be found in Section 6.6. Where
a source provides multiple results due to different settings of the algorithm,
the best results that were obtained are given.

From the table it can be seen that Foxcs generally performs at a level
comparable to the other systems. In only one case is Foxcs significantly
outperformed. This finding confirms that Foxcs retains Xcs’s ability at
classification and validates the efficacy of Foxcs’s novel combination of
evolutionary computation and online ILP refinement operators.

6.3 Verifying the Generality Hypothesis

The Generality Hypothesis (see Section 4.3) states that the number of ac-
curate and maximally general rules in the population of Xcs will tend to
increase over time. In Chapter 4 it was suggested that the Generality Hy-
pothesis would also hold under alternative rule languages, since the mech-
anisms in Xcs which the hypothesis identifies as generating the pressure
towards accuracy and maximal generality are essentially language neutral
(see page 97). In this section we seek to empirically determine if the be-
haviour of Foxcs is consistent with the Generality Hypothesis.

Use of a synthetic data set is indicated for this experiment, since at least two
potential problems arise with real-world data sets, such as the Mutagenesis,

3
Note that, as previously described in Section 6.1.2, the predictive accuracy for Foxcs
is calculated by performing ten repetitions of the 10-fold cross validation procedure and
taking the mean.
6.3. Verifying the Generality Hypothesis 153

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1 Performance
Population size (/N)
0
0 10000 20000 30000
Episode

Figure 6.2: Graph of system performance and population size for the poker task. The
system performance was measured as the proportion of correct classifications over the last
50 non-exploratory episodes, and the population size is the number of macro-rules divided
by N (500). Results are averaged over 10 separate runs.

Biodegradability, and Traffic data sets used in the preceding experiments.


First, correct class definitions are not typically known with real-world data
sets, thus it is not usually possible to verify the level of generality of the rules
evolved by the system. Second, it is also possible that the data itself may not
be sufficiently representative of the different classes for the system to be able
to induce maximally general definitions. In order to overcome these difficul-
ties we used the synthetic poker data set; it was suitable for this experiment
as definitions of the different poker hands are common knowledge, allowing
the level of generality of rules evolved by Foxcs to be assessed. Also, a very
large quantity of training instances could be generated, ensuring that each
class was sufficiently represented.

For this experiment some of the system settings were different from those
given in Section 6.1.2. The poker task is easier to solve than the previous
tasks, so training was terminated after 30,000 episodes (rule discovery was
switched off at 25,000 episodes) and N was decreased to 500. However, the
task does require a high value for θsub (see Mellor, 2005, for an explanation);
we used θsub = 500. Finally, the anonymous variable is useful for this task,
154 Ch 6. Application to Inductive Logic Programming

so the v2a and a2v mutations were used. Hence, the mutation parameters
were set as follows: µi = 20, for i ∈ {rep, add, del, v2a, a2v} and µj = 0
for j ∈ {v2c, c2v}.

The system performance and the population size was measured throughout
training and the results are shown in Figure 6.2. The performance plot shows
that the system is able to make error-free classifications after 25,000 training
episodes (when evolution was switched off).4 After an initial period where
the rule base is populated by covering, the size of the population decreases
throughout the course of training. This decline in diversity of the population
indicates that generalisation is occurring and that accurate, specific rules are
being displaced or subsumed by accurate, but more general rules.

The rules evolved by the system were inspected to determine if the most
numerous rules were accurate and maximally general. A rule, rl, is max-
imally general if every hand that matches rl’s condition part belongs to
rl’s class, and if every hand belonging to rl’s class does match rl’s condi-
tion. It was found that maximally general rules did evolve, and that the
minimum number of episodes required to evolve maximally general rules in
greater proportion than sub-optimally general classifiers for all classes was
approximately 25,000.

A sample containing the accurate and correct rules belonging to the final
population for one arbitrarily selected run is given in Table 6.3. In this
146
sample the proportion of maximally general rules is 152 = 96.1%. Note
that for some classes several different maximally general classifiers evolved,
which is possible because the rule language does allow for more than one
maximally general expression for each of the classes.

We conclude that for this task, the generalisation behaviour of Foxcs is


consistent with the Generality Hypothesis—at least, for the portion of the
population that makes correct classifications.

4
Note that this result improves upon those reported by Mellor (2005) for an earlier
version of the Foxcs system.
6.3. Verifying the Generality Hypothesis 155

Table 6.2: All rules with κ = 1 and p = 10 after 30,000 episodes of the Poker task, ordered
by numerosity. In the first column, an asterisk indicates that the rule is maximally general; the
second column gives the numerosity; and the remaining columns give the logical part, Φ.
* 19 f ullhouse ← card(A, B, ), card(D, B, F ), card(G, H, C), card(J, B, L),
card(M, H, )
* 18 twopair ← card(A, B, ), card(D, B, ), card(G, H, F ), card(J, H, C),
card(M, N, )
* 17 f lush ← card(A, B, C), card(D, E, C), card(G, H, C), card(J, K, C),
card(M, N, C)
* 17 f ourof akind ← card(A, B, C), card(D, B, F ), card(G, B, I), card(J, K, ),
card(M, B, O)
* 11 threeof akind ← card(A, B, C), card(D, E, ), card(G, H, F ), card(J, B, L),
card(M, B, )
* 11 straight ← card(A, B, C), card(D, E, ), card(G, H, ), card(J, K, ),
card(M, N, F ), consecutive(N, H, B, K, E)
* 11 pair ← card(A, B, ), card(D, E, F ), card(G, E, C), card(J, K, ),
card(M, N, )
* 10 threeof akind ← card(A, B, C), card(D, E, F ), card(G, H, ), card(J, B, L),
card(M, B, )
* 9 pair ← card(A, B, ), card(D, E, ), card(G, E, C), card(J, K, ),
card(M, N, F )
* 5 nought ← card(A, B, ), card(D, E, F ), card(G, H, ), card(J, K, C),
card(M, N, ), not(consecutive(H, K, B, N, E))
* 5 nought ← card(A, B, ), card(D, E, F ), card(G, H, C), card(J, K, ),
card(M, N, ), not(consecutive(H, K, B, N, E))
* 4 straight ← card(A, B, C), card(D, E, ), card(G, H, ), card(J, K, ),
card(M, N, F ), consecutive(E, B, H, N, K)
3 straight ← card(A, B, ), card(D, E, ), card(G, H, I), card(J, K, C),
card(M, N, F ), consecutive(H, N, B, K, E)
* 3 nought ← card(A, B, C), card(D, E, F ), card(G, H, ), card(J, K, ),
card(M, N, ), not(consecutive(H, N, B, K, E))
2 pair ← card(A, B, ), card(D, K, F ), card(G, H, C), card(J, K, L),
card(M, E, )
* 2 nought ← card(A, B, C), card(D, E, F ), card(G, H, ), card(J, K, ),
card(M, N, ), not(consecutive(N, K, H, B, E))
* 2 f ullhouse ← card(A, B, ), card(D, B, F ), card(G, H, C), card(J, H, L),
card(M, H, )
1 f ourof akind ← card(A, , C), card(D, B, F ), card(G, B, I), card(J, K, C),
card(M, B, O)
* 1 f ullhouse ← card(A, B, ), card(D, H, F ), card(G, H, C), card(J, B, L),
card(M, H, L)
* 1 nought ← card(A, B, ), card(D, E, F ), card(G, H, C), card(J, K, ),
card(M, N, ), not(consecutive(N, K, B, H, E))
156 Ch 6. Application to Inductive Logic Programming

Table 6.3: Comparison between using and not using subsumption deletion on execution
time, size of the population, and predictive accuracy for the Traffic data set. Standard
deviations are given in parentheses.

Setting Time (sec) |[P]| (# macro-rules) Accuracy (%)

Both 5,202 (63.4) 161.9 (6.3) 92.6 (1.6)


AS 5,496 (55.1) 162.2 (6.7) 92.5 (1.3)
GA 5,797 (82.1) 314.9 (7.8) 92.4 (1.3)
None 12,244 (60.4) 729.0 (4.5) 92.8 (1.4)

6.4 The Effect of Subsumption Deletion on Effi-


ciency

The subsumption deletion technique, which Foxcs inherits from Xcs, aims
to reduce the size of the population, in terms of macro-rules, without ad-
versely affecting the system’s decision making capability (see Section 5.5.4).
Ideally, there is an efficiency gain associated with the use of subsumption
deletion because a system that contains a population with fewer macro-
rules also requires fewer matching operations. However, in Foxcs, the time
cost for running the θ-subsumption test potentially offsets the reduction in
matching overhead. As it is not possible to analytically assess the relative
benefit of having fewer rules against the cost of performing θ-subsumption,
the aim of this experiment is to empirically determine whether the use of
subsumption deletion translates to a net gain in efficiency.

Recall that there are two types of subsumption deletion, GA subsumption


and action set subsumption (see Section 4.1.5). Foxcs was therefore tested
under four settings: no subsumption deletion, GA subsumption only, action
set subsumption only, and both action set subsumption and GA subsump-
tion. Experiments, run according to the methodology given in Section 6.1.2,
measured the influence of each setting on three factors: the final size of the
population in terms of the number of macro-rules, the predictive accuracy,
6.5. The Effect of Learning Rate Annealing on Performance 157

and the total execution of the system.

The results of the experiments on the Traffic data set are shown in Table 6.3.
They verify that using subsumption deletion does reduce the population size
without significantly affecting the predictive accuracy. Action set subsump-
tion reduces the size of the population more than GA subsumption, which
is unsurprising given that it is typically invoked more frequently. The re-
sults also show that using subsumption deletion significantly reduces the
execution time of the system; under each of the three subsumption deletion
settings the execution time was less than half the execution time without
using subsumption deletion (see Figure 6.3). Comparing execution times
under action set subsumption and GA subsumption shows relatively little
difference compared to the difference in the population size under the two
settings. This disparity suggests another factor affecting efficiency apart
from the population size; perhaps a subset of the population that has the
longest matching times are removed under either action set or GA subsump-
tion.

The experiments were also performed on the Mutagenesis and Biodegrad-


ability data sets. For these data sets, the affect on execution time was even
more pronounced. In fact, without subsumption deletion, some runs could
not be run to completion because they took longer than several days.

In summary, the use of subsumption deletion in Foxcs was found to be


beneficial for the system, providing a substantial improvement to efficiency.

6.5 The Effect of Learning Rate Annealing on Per-


formance

In this experiment we assess the influence of annealing the learning rate com-
pared to using a fixed constant for the learning rate. Annealing the learning
rate in temporal difference learning algorithms produces value function esti-
158 Ch 6. Application to Inductive Logic Programming

14000

12000

10000

Time (sec)
8000

6000

4000

2000

0
Both AS GA None

Figure 6.3: The effect of using subsumption deletion on the execution time.

mates which are an average of the sampled discounted reward (Sutton and
Barto, 1998, section 2.5). Using a learning rate which is a fixed constant,
on the other hand, produces a recency weighted average in which rewards
experienced more recently are given greater weight in the calculation of the
estimate (Sutton and Barto, 1998, section 2.6). Thus, in temporal difference
algorithms, using a fixed learning rate increases the sensitivity of the system
to recently experienced inputs compared to annealing.5 The close similarity
between updates in temporal difference learning and Xcs suggests that this
connection between learning rate and sensitivity to recent inputs also holds
for Xcs and Xcs derivative systems including Foxcs.

Sutton and Barto suggest that using a fixed learning rate is useful for
tracking non-stationary environments.6 However, with respect to classifi-
cation tasks, the recency effect which is produced by using a fixed learning
rate could make the system’s value estimates—and by extension, its overall
performance—sensitive to the order in which training examples are pre-
sented. In other words, using a fixed learning rate could bias the system
towards predicting the most commonly occurring class of the most recently

5
More accurately, under a fixed constant learning rate, β, the sensitivity to the current
1
input is greater than for annealing after β
training steps. See Section 4.1.4, page 85.
6
A non-stationary MDP is one where the transition function, T , and the reward func-
tion, R, change over time.
6.5. The Effect of Learning Rate Annealing on Performance 159

experienced examples. In order to determine if using a fixed learning rate


or annealing does affect the predictive performance of Foxcs, we compared
the performance of the system under both learning rate settings7 on the
Mutagenesis, Biodegradability and Traffic data sets.

For each data set we ran three experiments: one, which we call RAND,
where the training examples were selected at random from the training set,
and two, SEQ and SEQ2, where the training examples were presented to
the system in a fixed sequential order. In the SEQ experiment, the data
was divided into groups such that all examples belonging to a single group
are instances of the same class; since all examples from one group were
presented before moving to the next, the short term class distribution in
the SEQ experiment varied from the overall class distribution. The SEQ2
experiment took this approach to the extreme by setting the number of
groups equal to the number of classes.

The results of the experiments are shown in Table 6.4 and Figure 6.4. The
system performed better under annealing than under a fixed learning rate
on all data sets and sampling methods. According to the unpaired t-test
(assuming equal variances), the improvement is significant with confidence
level 95% in all cases, and significant with confidence level 99% in all but two
cases. Annealing also produced results which were more comparable across
the different sampling methods than using a fixed learning rate. The SEQ2
sampling method—where all data examples from one class are presented
before the next—generally proved to be the most challenging, although to
a much lesser extent for annealing than for a fixed learning rate.

Evidence for sensitivity to the order in which inputs are sampled can be
observed in Figure 6.5, which shows the performance of the system through-
out training on the Mutagenesis data set (NS+S2). The predictive accuracy
was measured at regular intervals of 1,000 training episodes in the same ex-

7
We note that the MAM technique (see Section 4.1.4, page 84) was used for the fixed
learning rate.
160 Ch 6. Application to Inductive Logic Programming

Table 6.4: Comparison between using a fixed learning rate (F) and annealing (A) on
the predictive accuracy of Foxcs for several data sets and sampling methods. Standard
deviations are given in parenthesis. An * (**) indicates that corresponding A and F values
are significantly different according to an unpaired t-test with confidence level 95% (99%).

Task Sampling method

RAND SEQ SEQ2


Mute (NS+S1) A 84.5 (2.4)** 84.7 (1.5)** 82.7 (2.3)*
F 81.1 (2.5) 79.6 (1.9) 80.5 (1.4)
Mute (NS+S2) A 86.4 (2.2)* 86.2 (2.4)** 83.7 (2.7)**
F 83.1 (3.6) 79.0 (2.4) 74.1 (1.9)
Bio A 72.3 (1.7)** 73.8 (0.9)** 70.2 (1.9)**
F 69.1 (2.2) 63.3 (1.1) 57.2 (1.4)
Traffic A 92.7 (0.9)** 92.6 (1.3)** 91.3 (1.4)**
F 90.4 (1.6) 89.6 (1.9) 74.2 (3.7)

periments that generated the corresponding results in Figure 6.4. Note the
oscillations which can be observed in the graphs for the SEQ and SEQ2 sam-
pling methods, and which are particularly prominent for the fixed learning
rate. The peaks in the graph correspond to when training examples belong-
ing to the majority class have been recently presented, while the troughs
correspond to when the recent examples are from the minority class.8 This
strong correspondence between system performance and the order of the
training examples is consistent with a sensitivity to recent inputs.

In conclusion, we found that annealing produced better results than a fixed


learning rate for Foxcs on several ILP tasks under a variety of sampling
schemes. We suggest that sensitivity to the distribution of recent inputs

8
Similar results were obtained for the Mutagenesis NS+S1 setting. For Biodegradabil-
ity, the difference between the number of examples for the majority and minority classes is
much less and the oscillating effect is therefore less noticeable. For Traffic, the oscillations
are also less noticeable because there are more than two classes.
6.5. The Effect of Learning Rate Annealing on Performance 161

Mutagenesis NS+S1 Mutagenesis NS+S2


95 95
Annealed Annealed
Fixed Fixed

90 90
Accuracy (%)

Accuracy (%)
85 85

80 80

75 75

70 70
RAND SEQ SEQ2 RAND SEQ SEQ2

Biodegradability Traffic
80 100
Annealed Annealed
Fixed Fixed
75 95

90
Accuracy (%)

Accuracy (%)

70

85

65
80

60
75

55 70
RAND SEQ SEQ2 RAND SEQ SEQ2

Figure 6.4: Comparison between using a fixed learning rate and annealing on the pre-
dictive accuracy of Foxcs for several data sets and sampling methods.

accounts for the poorer performance observed when using a fixed learning
rate. Furthermore, we believe that these results have implications for Xcs.
Hence, although the regular practice in Xcs is to use a fixed learning rate,
annealing may produce better results when the system is applied to classi-
fication tasks.9 Annealing may also be beneficial for Xcs systems on tasks
containing noisy data.

9
Multiplexor tasks—which are the typical benchmark for Xcs—may represent an ex-
ceptional case because, unlike most classification tasks, it is possible to achieve 100%
predictive accuracy on them. In this situation, the long term and short term error for
a correct rule is the same irrespective of the sample distribution: zero. For other tasks
where perfect accuracy is not possible, the system must, in part, rely on overgeneral rules
in order to make classification predictions. As overgeneral rules have a non-zero error
whose value depends on the sample distribution, they benefit most from annealing.
162 Ch 6. Application to Inductive Logic Programming

RAND
90

85

80

75

Accuracy (%)
70

65

60

55
fixed
annealed
50
0 20 40 60 80 100
Training Episode (x1000)

SEQ SEQ2
90 90

85 85

80 80

75
Accuracy (%)
75
Accuracy (%)

70 70

65 65

60 60

55
55
fixed fixed
annealed
annealed 50
50 0 20 40 60 80 100
0 20 40 60 80 100
Training Episode (x1000) Training Episode (x1000)

Figure 6.5: Comparison between using a fixed learning rate and annealing for the
Mutagenesis NS+S2 data set. There is a graph for each sampling method as labelled
in the figure.

6.6 The Influence of the Selection Method

This section compares the influence of the two methods—proportional and


tournament—for selecting a parent rule for mutation on the system’s per-
formance. Although proportional selection is typically used for Xcs, Butz
et al. (2003) has found that tournament selection produces better results
and we seek to verify that this result holds for Foxcs. We also wish to
compare the execution times under the two techniques.

We obtained results for Foxcs on the Mutagenesis, Biodegradability and


Traffic data sets under both proportional and tournament selection (τ =
0.4). We also experimented with several µi values to see how tolerant the
6.6. The Influence of the Selection Method 163

selection method was to different values of the mutation parameters. Four


(µadd , µdel ) settings were used, which were (40,40), (50,30), (60,20) and
(70,10); we omitted values where the frequency of add operations was less
than del operations because add is the more important operation (this is
because for these tasks, covering generates rules that are very general, con-
taining only one or two literals, and which must be subsequently specialised
by “growing” the rule using the add mutation).

Performance

The performance results for the system in terms of its predictive accuracy
under the different settings are given in Table 6.5. With only two excep-
tions, tournament selection outperformed proportional selection for all tasks
and (µadd ,µdel ) settings. For the two exceptions—which occurred on the
Traffic data set—the performance of the system was comparable under both
selection methods. According to an unpaired t-test (assuming equal vari-
ance), many of the observed improvements under tournament selection were
significant. Tournament selection was also less sensitive to the values of
the mutation parameters, producing results which are generally comparable
across the different settings. It is evident from these results that the choice
of selection method has an influence on performance. The superiority of
tournament selection over proportional selection observed here is consistent
with the findings of Butz et al. (2003).

Note that, with respect to the (µadd ,µdel ) settings, the best accuracies were
generally obtained under higher ratios of µadd to µdel , that is, the (70,10)
and (60,20) settings; however, it would be dangerous to infer that these
parameter values will produce the best performances in general, since the
results are an artifact of the general-to-specific direction of the search which
is occurring for these particular tasks.

Observant readers might have noticed that the results for the Biodegrad-
ability task reported in Section 6.2 are better than those given here for the
164 Ch 6. Application to Inductive Logic Programming

Table 6.5: Comparison between tournament and proportional selection on several ILP
data sets and (µadd ,µdel ) settings. The best accuracy for each task is shown in bold.
An * (**) indicates that the value is significantly different from its corresponding value
under the other selection method according to an unpaired t-test with confidence level
95% (99%).

Task Predictive Accuracy (%)

(40,40) (50,30) (60,20) (70,10)


Tournament Selection
Mute (NS+S1) 83.7 (2.6)** 83.8 (1.2)* 83.7 (1.4) 83.8 (2.5)
Mute (NS+S2) 86.5 (2.1)** 86.2 (1.4)** 86.1 (1.5) 86.7 (2.4)
Bio 71.6 (1.5)** 71.5 (1.0) 72.7 (1.7)* 72.9 (2.1)
Traffic 91.7 (1.4) 91.8 (1.9) 92.4 (1.3) 92.6 (1.2)

Proportional Selection
Mute (NS+S1) 80.3 (1.5)** 81.3 (3.1)* 82.3 (2.0) 82.2 (1.8)
Mute (NS+S2) 82.9 (2.3)** 83.8 (2.0)** 84.8 (2.0) 85.7 (1.6)
Bio 69.3 (1.4)** 71.1 (1.8) 70.8 (1.4)* 71.6 (1.9)
Traffic 91.2 (1.2) 91.9 (1.4) 92.5 (1.3) 91.9 (1.0)

Mutagenesis NS+S1 Mutagenesis NS+S2


90 90

88 88

86 86
Accuracy (%)

Accuracy (%)

84 84

82 82

80 80

78 78
Tournament Tournament
Proportional Proportional
76 76
40−40 50−30 60−20 70−10 40−40 50−30 60−20 70−10

Biodegradability Traffic
76 96

75 95

74 94

73 93
Accuracy (%)

Accuracy (%)

72 92

71 91

70 90

69 89

68 88

67 Tournament 87 Tournament
Proportional Proportional
66 86
40−40 50−30 60−20 70−10 40−40 50−30 60−20 70−10
6.6. The Influence of the Selection Method 165

equivalent parameter settings. This is because here training was performed


on only one of the five 10-fold cross validation partitions that were used in
Section 6.2, and the predictive accuracy on this partition is worse than the
mean accuracy over the whole five partitions.

Efficiency

The execution times for the system under the different settings are given in
Table 6.6. These results show that the system was significantly less efficient
under tournament selection than under proportional selection according to
the Mann-Whitney U test.10 Also, under tournament selection the execu-
tion time of the system rises as the proportion of add operations increases,
although this effect is less evident on the Traffic data set. Observation of
the system during training showed that the extra time was due chiefly to a
small number of rules which had disproportionately lengthy matching times
rather than, for example, any differences in the time complexity of the two
selection methods. Under tournament selection these rules were more likely
to be generated, although why this should be so is not clear.

One speculation goes as follows. Under either selection method, the fittest
rule in a given niche is more likely to be selected for mutation than the other
rules in the niche. If the fittest rule in a niche has a particularly lengthy
matching time, then it would lead to a proliferation of rules which, because
of their parent, are likely to have lengthy matching times. If tournament
selection were more likely to select the fittest rule than proportional selec-
tion, then that would account for the difference which was observed in the
system’s execution time under the two methods. Is tournament selection
more likely to select the fittest rule than proportional selection? Under
tournament selection with τ = 0.4, the fittest rule will be selected at least
10
The distribution of execution times appears to be not well approximated by a normal
distribution. The non-parametric Mann-Whitney U test is thus more appropriate here
than the t-test as it does not assume that data measurements necessarily belong to a
normal distribution.
166 Ch 6. Application to Inductive Logic Programming

Table 6.6: Comparison between tournament and proportional selection with respect to
execution time in tabular and bar graph format. The figures given are the mean time
(and standard deviation) for an entire 10-fold cross validation experiment. The best time
for each task is shown in bold. According to the Mann-Whitney U test, the difference
between corresponding measurements produced under the two selection methods is
significant with a confidence level of 99% for all four mutation parameter settings.

Task Execution Time (sec)

(40,40) (50,30) (60,20) (70,10)


Tournament Selection
Mute (NS+S1) 28,262 (41,143) 20,014 (12,387) 36,271 (25,977) 66,702 (24,455)
Mute (NS+S2) 6,021 (750) 8,188 (1,628) 10,128 (1,361) 14,269 (1,816)
Bio 4,283 (526) 7,770 (6,361) 10,194 (4,997) 30,804 (22,973)
Traffic 4,793 (86) 5,140 (81) 5,384 (111) 5,178 (76)

Proportional Selection
Mute (NS+S1) 3,459 (133) 4,036 (279) 4,413 (244) 7,351 (4,326)
Mute (NS+S2) 3,135 (133) 3,568 (167) 4,152 (152) 4,920 (336)
Bio 2,718 (59) 3,428 (1,151) 3,499 (323) 4,160 (459)
Traffic 4,114 (114) 4,490 (83) 4,545 (95) 4,513 (59)

Mutagenesis NS+S1 Mutagenesis NS+S2


4 4
x 10 x 10
10 2
Tournament Tournament
9 Proportional 1.8 Proportional

8 1.6

7 1.4

6 1.2
Time (sec)

Time (sec)

5 1

4 0.8

3 0.6

2 0.4

1 0.2

0 0
40−40 50−30 60−20 70−10 40−40 50−30 60−20 70−10

Biodegradability Traffic
4
x 10
6
Tournament 6000
Proportional
5
5000

4
4000
Time (sec)

Time (sec)

3
3000

2 2000

1 1000
Tournament
Proportional
0 0
40−40 50−30 60−20 70−10 40−40 50−30 60−20 70−10
6.7. Summary 167

40% of the time and usually more depending on its numerosity value. Under
proportional selection, the fittest rule may be selected less frequently than
40%, depending on the fitness and number of other rules in the niche. Thus,
the mechanism does appear to be plausible.

Under the above explanation, the large variance in the execution time ob-
served for the Mutagenesis and Biodegradability tasks under tournament
selection can be accounted for as being the consequence of a large varia-
tion in the matching times of highly fit rules for those tasks. Conversely,
the smaller variance observed for the Traffic data set—and also the closer
similarity between the execution times of the two selection methods—is ac-
counted for if there is only a small variance in the matching times of highly
fit rules for that task.

Conclusion

In this section we compared the performance of the system under tourna-


ment and proportional selection. In summary, we found that the use of
tournament selection lead to better results with respect to predictive accu-
racy and was less sensitive to the mutation parameter settings; however, with
respect to efficiency, the use of proportional selection outperformed tourna-
ment selection. Which setting to use depends upon the user’s needs; how-
ever, the use of tournament selection on the (40,40) or (50,30) (µadd ,µdel )
setting provides a reasonable performance-efficiency trade-off for ILP tasks.

6.7 Summary

In this chapter, we focussed on evaluating various aspects and components


of Foxcs using ILP tasks. We began by comparing Foxcs to several well-
known ILP algorithms and found that it performed at a level comparable to
the ILP systems. This result verifies the inductive aspects of Foxcs, partic-
168 Ch 6. Application to Inductive Logic Programming

ularly its novel evolutionary component which implements mutation through


online, upwards and downwards refinement operations. Note that Foxcs is
at a slight disadvantage on classification tasks due to its online learning ap-
proach. The ILP systems are able to process data in batch mode, so, for
example, they can compute summary statistics which are error free with
respect to the available data. Foxcs, in contrast, processes data examples
one-at-a-time and keeps running estimates which are potentially subject to
error. Despite this, Foxcs was not significantly outperformed in general
and was able to achieve predictive accuracy values within 1–3% of the best
values obtained by the ILP systems to which it was compared.

We next sought to determine if the behaviour of Foxcs is consistent with the


Generality Hypothesis, which describes the bias in Xcs towards an increase
in the proportion of accurate and maximally general rules in the population.
We tested Foxcs on a synthetic data set, Poker, for which the class defini-
tions are common knowledge and found that the rules evolved by Foxcs did
appear to be consistent with the Generality Hypothesis. When this result is
considered along with the now not unsubstantial number of systems which
have extended the representational capability of Xcs (see Section 4.4), it
suggests that the biases identified by the Generality Hypothesis are essen-
tially independent of the rule language.

The final three sections focussed on evaluating specific components of Foxcs:


subsumption deletion, annealing the learning rate, and the method of selec-
tion of rules for reproduction. Subsumption deletion was found to be effec-
tive for increasing efficiency, while annealing improved the system’s predic-
tive accuracy compared to using a fixed learning rate. We believe that the
result for annealing extends to Xcs systems in general. The choice of selec-
tion method was found to lead to a performance–efficiency trade-off, with
the best predictive accuracies occurring under tournament selection, while
the most efficient execution times occurred under proportional selection.

The most significant drawback observed with Foxcs is that occasionally


6.7. Summary 169

some runs take a disproportionately long time to complete compared to the


average run time. We believe that these cases of lengthy execution times
are due to a proliferation of highly fit rules which have long matching times.
Several factors appear to influence the likelihood of occurrence of particu-
larly high above-average run times: the task itself, the selection method, and
the mutation parameters settings. If timely execution is important, then—
as the speed of execution is not related to the level of performance—multiple
instances of Foxcs can be run in parallel and the results taken from the
fastest system.

We conclude by remarking that many innovations in Xcs design could be


incorporated into Foxcs. For example, alternative fitness measures could
be used, such as bilateral accuracy (Butz et al., 2003) or classification ac-
curacy (Bernadó-Mansilla and Garrell-Guiu, 2003). In this thesis we have
focussed on demonstrating an effective design for Foxcs which is close to
the “standard” version of Xcs, and have left the evaluation of alternative
versions of Foxcs based on new Xcs components for future work.11

11
However, a version of Foxcs was implemented based on the modifications to Xcs
suggested by Bernadó-Mansilla and Garrell-Guiu (2003) for handling supervised learn-
ing tasks. Unfortunately, it performed disappointingly in comparison to the unmodified
version of Foxcs.
Chapter 7

Application to Relational
Reinforcement Learning

“Your true value depends entirely on what you are compared


with.”
—Bob Wells

In the previous chapter, Foxcs was applied to ILP tasks in order to evaluate
and assess the influence of individual system components, particularly the
novel inductive mechanism. In this chapter, the focus moves to an assess-
ment of the overall integrity of the system. Whereas ILP tasks essentially
pose a single challenge (that of generalisation, the system’s ability to clas-
sify previously unexperienced data items), reinforcement learning tasks, on
the other hand, usually pose a double challenge: not only must the system
generalise to previously unexperienced situations, it must also deal with “de-
layed reward”. An episode of a reinforcement learning task, in contrast to
that of a classification task, typically involves making a sequence of decisions
where the outcome of each decision not only affects the immediate reward
but also all subsequent rewards until the end of the episode. Thus, RRL
tasks challenge the credit assignment component of Foxcs in equal measure

171
172 Ch 7. Application to Relational Reinforcement Learning

to the rule discovery component, and require both components to integrate


effectively for the system to function well overall. Hence, in this chapter,
the performance and integrity of Foxcs is evaluated by application to RRL
tasks.

This chapter is divided into two sections. In Section 7.1 Foxcs is bench-
marked on blocks world tasks in order to empirically demonstrate the effec-
tiveness of the system on RRL tasks. In these experiments the number of
blocks in the environment is constant. In Section 7.2 we address the prob-
lem of how to learn scalable policies in blocks world with Foxcs. Scalable
policies are independent of the number of blocks in the environment. Both
sections contain comparisons to existing RRL results and systems.

7.1 Experiments In Blocks World

The aim of these experiments is to demonstrate that Foxcs can achieve op-
timal or near-optimal behaviour on RRL tasks. Comparisons to previously
published results obtained for other RRL systems are made. Note however,
that the number of results available for comparison is more limited for RRL
than ILP due to the field’s more recent development.

Materials

For these experiments, Foxcs was benchmarked on two bwn tasks, stack
and onab (Džeroski et al., 2001), under several values of n. The tasks were
introduced in Chapter 1 and are described again below for convenience.

stack The goal of this task is arrange all the blocks into a single stack; the
order of the blocks in the stack is unimportant. An optimal policy for
this task is to move any clear block onto the highest block.

onab For this task two blocks are designated A and B respectively. The
7.1. Experiments In Blocks World 173

goal is to place A directly on B. An optimal policy is to first move


the blocks above A and B to the floor and then to move A onto B.

Training episodes began with the blocks in randomly generated positions


(except that positions corresponding to goal states were discarded)1 and
continued until either a goal state was reached or an upper limit on the
number of time steps (set to 100) was exceeded. A reward of −1 was given
on each time step. Since the reward is negative it can be viewed as a cost
to be minimised. An optimal policy minimises the long term cost because
it solves the task in the least number of steps.

The rule language employed by Foxcs for stack and onab contained the
following predicates:

• mv /2 Moves a given block onto another block. The arguments are the
block to move and the destination block respectively.

• mv fl /1 Moves the given block to the floor.

• ab/2 This predicate is only used for the onab task. It identifies the
two blocks A and B respectively.

• cl /1 The given block is clear. That is, there is no block on top of it.

• on/2 The arguments are two blocks. The first block is on the second.

• on fl /1 The given block is on the floor.

• above/2 The arguments are two blocks belonging to the same stack.
The first block is above the second.

• highest/1 There is no block higher than the given block. Height is


measured by the number of blocks between the floor and the given
block.
1
The method of Slaney and Thiébaux (2001) was used to generate all random blocks
world states. This method generates states according to a uniform distribution.
174 Ch 7. Application to Relational Reinforcement Learning

Table 7.1: The mode declarations for the blocks world tasks, stack and onab.

inequations( true ).

mode( a, [1,1], false, mv([block,-,+], [block,-,+]) ).


mode( a, [1,1], false, mv_fl([block,-,+]) ).
mode( s, [0,1], false, ab([block,-,+,!], [block,-,+,!]) ). % only for onab
mode( s, [1,7], false, cl([block,-,+]) ).
mode( s, [0,6], false, on([block,-,+,!], [block,-,+,!]) ).
mode( s, [1,7], false, on_fl([block,-,+]) ).
mode( b, [0,1], false, above([block,-,+,!], [block,-,+,!]) ).
mode( b, [0,1], false, highest([block,-,+]) ).

These predicates were declared as shown in Table 7.1.

Note that many combinations of the above predicates do not correspond to


legal states or state-action pairs in bwn . Unrestricted mutation could thus
waste resources by generating a large quantity of useless rules. For this rea-
son, rules were tested for validity after mutation. If they failed the validity
check, they were discarded and mutation was rerun on the same parent.
The validity tests are listed in Appendix D. Although not all invalid rules
are detected by the given tests, inspection after training revealed that the
number of invalid rules typically consisted of less than 2% of the population.

Methodology

Each experiment involved 10 separate runs on a specific task. Each run


contains interleaved training and evaluation phases: after every 50 training
episodes, the system was evaluated on 100 test episodes during which all
learning behaviour was switched off. The same 100 start states were used
for the test episodes in any particular run. The evaluation measure is the
percentage of the 100 test episodes that were completed in the optimal
7.1. Experiments In Blocks World 175

number of steps and all results are averaged over the 10 separate runs.
Because of the potentially disruptive effect of freshly evolved rules, evolution
was switched off prior to the completion of training (see Section 6.1.2).

Unless otherwise noted, all results for the Foxcs system were obtained under
the following settings. The system parameters were: N = 1000,  = 10%,
α = 0.1, β = 0.1, 0 = 0.001, ν = 5, θga = 50, θsub = 100, θdel = 20, and
δ = 0.1. The mutation parameters were: µrep = 25, µadd = 50, µdel = 25,
and µi = 0 for all i ∈ {c2v, v2a, v2c, a2v}. Proportional selection was
used, the learning rate was annealed, and GA subsumption was used but
not action set subsumption.

Recall that under action set subsumption, an accurate and sufficiently ex-
perienced rule may subsume all other rules in an action set that are spe-
cialisations of it. For the blocks world tasks, we observed that action set
subsumption sometimes disrupted the population due to the subsumption
of genuinely accurate rules by rules which were only temporarily accurate.
This problem with action set subsumption has been previously reported by
Butz (2004), who also remedied it by switching off the mechanism.

Covering was not used to generate the initial population. Under covering,
the tasks are “too easy” in the sense that the rules produced by covering
are sufficient to form an optimal policy without any need for mutation.
Thus, rather than use covering, the rule base was given an initial population
instead. The initial population consisted of the following two rules:

mv f l(A) ← cl(A), on(A, B)


mv(A, B) ← cl(A), cl(B)

These two rules were selected because between them they cover the entire
state-action space of blocks world. However, despite the use of an initial pop-
ulation, covering was not disabled: if either of the above two rules were to be
deleted from the population, it is possible that the state-action space would
no longer be completely covered. Thus, covering was permitted. However,
observation of training runs showed that it was invoked only very infre-
176 Ch 7. Application to Relational Reinforcement Learning

Stack (BW4) Stack (BW5)


1 1
0.9
0.8 0.8
0.7
0.6 0.6
Accuracy

Accuracy
0.5
0.4 0.4
0.3
0.2 0.2
0.1
0 0
0 1000 2000 3000 4000 5000 0 5,000 10,000 15,000 20,000
Episode Episode

Stack (BW6) Stack (BW7)


1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Accuracy

Accuracy
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 5,000 10,000 15,000 20,000 0 5,000 10,000 15,000 20,000
Episode Episode

Figure 7.1: The performance of Foxcs on the stack task in bwn for n = 4, 5, 6, 7 blocks.
The graphs show the mean accuracy (solid line) and standard deviation (dotted line) over
ten runs. Accuracy is the percentage of episodes completed in the optimal number of
steps; it does not refer to κ. The point at which evolution was switched off is indicated
by the vertical dashed line.

quently.

Results

Figures 7.1 and 7.2 show the performance of Foxcs on the stack and onab
tasks respectively in bwn|n∈{4,5,6,7} . After evolution was switched off, the
performance of Foxcs on the stack task was optimal in bw4 and bw5 and
near-optimal in bw6 and bw7 . By near-optimal we mean that the final
accuracy is ≥ 98%. For onab it was optimal in bw4 and near-optimal in
bw5 but not for bw6 and bw7 .
7.1. Experiments In Blocks World 177

Onab (BW4) Onab (BW5)


1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Accuracy

Accuracy
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 10,000 20,000 30,000 0 10,000 20,000 30,000 40,000 50,000
Episode Episode

Onab (BW6) Onab (BW7)


1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Accuracy

Accuracy

0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 10,000 20,000 30,000 40,000 50,000 0 10,000 20,000 30,000 40,000 50,000
Episode Episode

Figure 7.2: The performance of Foxcs on the onab task in bwn for n = 4, 5, 6, 7 blocks.

We note that once Foxcs no longer achieved optimal or near-optimal per-


formance (that is, for onab in bwn|n=6,7 ), the best accuracy was achieved
relatively early in the training process, after which performance degrades.
Unfortunately, we do not have an explanation. However, to determine if
the system could achieve better, or even optimal, performance for those
tasks with more training, we increased the number of training episodes to
100,000. Slight improvements were observed but the system was still not
able to achieve optimal performance.

We now list, in Table 7.2, the ten most numerous rules discovered by Foxcs
after one arbitrarily selected run on stack in bw4 . The rules are easy to
interpret diagrammatically, as shown in Figure 7.3. That the rules can be
examined conceptually like this demonstrates the comprehensibility aspect
of Foxcs’s first-order logic, rule-based approach. Of course, understanding
178 Ch 7. Application to Relational Reinforcement Learning

Table 7.2: The ten most numerous rules discovered by Foxcs after one arbitrarily
selected run on stack in bw4 . The second rule is illustrated in Figure 7.3.

Φ p ε F n

mv(A, B) ← cl(A), cl(B), on f l(A), on(C, D), highest(B) -1 0 0.999 199


mv(A, B) ← cl(A), cl(B), on f l(C), on f l(B) -2.7103 0.0007 0.610 110
mv(A, B) ← cl(A), cl(B), on f l(C), on(B, C), on f l(D), -1.9008 0.0032 0.387 82
highest(B)
mv(A, B) ← cl(A), cl(B), on f l(B) -2.7102 0.0015 0.322 58
mv f l(A) ← cl(A), on(A, B), on f l(C), cl(D), on(D, C) -2.7101 0.0006 0.500 50
mv f l(A) ← cl(A), on(A, B), on f l(C), on(B, D), -2.7103 0.0008 0.272 43
above(A, B), highest(A)
mv(A, B) ← cl(A), cl(B), on f l(C), on(B, C), on f l(D), -1.9004 0.0006 0.168 36
cl(D), highest(B)
mv(A, B) ← cl(A), cl(B), on(B, C), on f l(D), above(A, D) -1.9005 0.0007 0.171 36
mv(A, B) ← cl(A), cl(B), on f l(C), on(B, C), cl(D), -1.9004 0.0005 0.164 35
highest(B)
mv(A, B) ← cl(A), cl(B), on(B, C), on f l(D), highest(B), -1.9005 0.0007 0.157 33
above(A, D)

the system’s overall policy is more complex as it is based on calculations


involving multiple rules. However, the ability to interpret individual rules is
useful for understanding specific situations. Also, an indicative policy can
frequently be obtained after isolating and interpreting specific rules, such as
rules with high fitness or numerosity.

Comparison

The experiments in this section are most directly comparable to those re-
ported by Džeroski et al. (2001, section 6.3.1) for Q-rrl and by Driessens
et al. (2001, section 4.1) for Rrl-tg. In those experiments they addressed
the same tasks in bwn|n∈{3,4,5} . In terms of the accuracy level attained,
Foxcs performed at least as well as Q-rrl and Rrl-tg, except for the
onab task in bw5 , where Rrl-tg was able to achieve optimal performance.
Q-rrl and Rrl-tg, however, required significantly fewer training episodes
to converge. As a rough guide, the amount of training episodes required by
7.1. Experiments In Blocks World 179

B A C B

or

C
A

B C B A C

Figure 7.3: All patterns of blocks in bw4 that match the rule mv(A, B) ←
cl(A), cl(B), on f l(C), on f l(B).

Foxcs was approximately an order of magnitude greater than Rrl-tg and


approximately two orders of magnitude greater than Q-rrl. Thus, for the
stack task for example, Q-rrl required tens of training episodes, Rrl-tg
required hundreds, and Foxcs, thousands.

The discrepancy in training time may be partly accounted for by the follow-
ing two factors. First, Q-rrl trains particularly quickly because after every
episode it processes the entire history of states ever visited during all pre-
ceding episodes. Q-rrl thus processes more information per episode than
Rrl-tg and Foxcs. Second, learning classifier systems train more slowly
than other reinforcement learning systems due to the stochastic nature of
the evolutionary component.

Although the results in this section show that Foxcs can solve RRL tasks,
the performance of Foxcs was limited by the scale of the task. Scale, mea-
sured as n for bwn , negatively affects both the performance and execution
time.
180 Ch 7. Application to Relational Reinforcement Learning

Further Experiments

Recall from Section 1.1.2 that |S| > O(n!) for bwn . It is not surprising then
that as n increases, the performance of Foxcs deteriorates. In particular,
each additional block tends to decrease:

• the maximum level of accuracy obtained,

and to increase:

• the number of training episodes required to reach a certain perfor-


mance level,

• the number of rules required, and

• the average matching time per rule.

Due to these factors, it becomes infeasible to apply Foxcs to bwn above


a certain value of n depending on the resources available. To illustrate
this point, we ran further experiments that compared the execution time
of Foxcs on bw4 , bw5 , bw6 , and bw7 for both stack and onab. For each
experiment, 20,000 training episodes were interleaved with 20,000 evaluation
episodes (after every 100 training episodes, 100 evaluation episodes were
run), for a total of 40,000 episodes in all.

Table 7.3 shows the results of the experiment. From the table, it appears
that the execution times increase super-linearly with respect to the number
of blocks. Although the results are not as bad as a greater than O(n!)
growth in the state space might lead one to predict, there is nevertheless an
alternative approach that completely avoids any extra time cost at all as n
increases. The approach, presented in the following section, allows policies
to be learnt that are independent of n. Training is performed under small
values of n to minimise training times and the resulting policies scale up to
arbitrary values of n.
7.2. Scaling Up 181

Table 7.3: Mean execution time (and standard deviation) over 10 runs of 40,000 episodes
on the stack and onab tasks in bwn for n = 4, 5, 6, 7.

Task Execution Time (sec)

bw4 bw5 bw6 bw7


stack 1,578 (383) 7,136 (4,036) 17,336 (1,529) 35,655 (7,946)
onab 7,165 (708) 19,852 (1,987) 42,684 (5,940) 79,450 (7,859)

4 Onab
4 Stack x 10
x 10 9
4.5
8
4
7
3.5
6
3
Time (sec)
Time (sec)

2.5 5

2 4

1.5 3

1 2

0.5 1

0 0
4 5 6 7 4 5 6 7
Number of blocks Number of blocks

Conclusion

The results in this section demonstrate that Foxcs was successfully able to
address RRL tasks, simultaneously solving the twin challenges of delayed
reward and generalisation. It achieved optimal or near-optimal performance
for both the stack and onab tasks in bwn environments containing up to
n = 7 and 5 blocks respectively. Unfortunately, as n, the number of blocks,
increases, the performance of Foxcs deteriorates and is thus, without mod-
ification, not a scalable approach. In Section 7.2 we address the problem of
how to learn scalable policies with Foxcs.

7.2 Scaling Up

Simple optimal policies exist for the stack and onab tasks that are indepen-
dent of the number of blocks and which can be expressed straightforwardly
182 Ch 7. Application to Relational Reinforcement Learning

Table 7.4: Optimal policies for stack and onab that are independent of the number of
blocks in the environment.

stack

mv(A, B) ← cl(A), highest(B)

onab

mv(A, B) ← cl(A), cl(B), ab(A, B)


mv f l(A) ← cl(A), ab(B, C), above(A, B)
mv f l(A) ← cl(A), ab(B, C), above(A, C)
mv f l(A) ← cl(A), ab(A, B), above(A, B)
mv f l(B) ← cl(B), ab(A, B), above(B, A)

using the rule language given in Table 7.1. Examples of these policies are
shown in Table 7.4. If the policies can be learnt in bwn under a small
value of n, they can be transferred to bwn for arbitrary values of n, thus
providing the system with a way to scale up to large environments. Unfortu-
nately, these scalable policies cannot be learnt by Foxcs under a standard
approach. This is because there is variance in the Q values of the state-
action pairs matching any particular rule in the policy. The rules therefore
have non-zero error, ε, and thus poor fitness, F .

The P-Learning method (Džeroski et al., 2001), however, provides a solution


to the problem of generalising over optimal Q values having a non-zero
variance. We therefore investigate the use of P-Learning in conjunction
with Foxcs in this section. We first describe P-Learning, then give an
implementation of it for the Foxcs system, and finally demonstrate the
effectiveness of the implementation on stack and onab.
7.2. Scaling Up 183

c d

b b a c e

e a c d e a b

Q(s, a) = −1 Q(s, a) = −2.7 Q(s, a) = −1.9

Figure 7.4: Three (s, a) pairs from bw5 which match the rule mv(A, B) ←
cl(A), highest(B) and their associated Q(s, a) value for the stack task under an optimal
policy. The Q values were calculated under a reward of −1 per step and γ = 0.9.

7.2.1 P-Learning

The intention of P-Learning (Džeroski et al., 2001) is to learn a function


P : S × A → {0, 1} that represents the optimal policy, π ∗ , as follows:

 1 if a = π ∗ (s),
P (s, a) = (7.1)
 0 otherwise.

Because (7.1) depends on π ∗ (s), it is not in a form useful for learning


π ∗ (s). In P-Learning, π ∗ (s) is replaced with an approximation, such as
arg maxa Q(s, a). P-Learning thus proceeds in conjunction with Q-Learning
or some other method for approximating π ∗ . The P function can be stored
in a table but, like Q-Learning, P-Learning is most beneficial when combined
with generalisation. However, the generalisations formed under P-Learning
are not constrained by the Q function. They can thus represent π ∗ more
directly than those formed under Q-Learning.

As an illustration of why it can be helpful to generalise over P instead of


Q, consider the optimal policy for the stack task given in Table 7.4. The
rule not only compactly represents π ∗ , it is also optimal regardless of the
number of blocks in the world. Unfortunately, as Figure 7.4 illustrates,
184 Ch 7. Application to Relational Reinforcement Learning

Q(s, a) is not uniform over all (s, a) pairs matching the rule. It is thus not
a valid generalisation over Q. In the Foxcs system, for instance, the rule
has non-zero error, ε, and thus poor fitness, F . However, because the rule
is optimal, P (s, a) = 1 over all (s, a) pairs matching it. Hence, the rule is
a valid generalisation over P . In general, any rule is a valid generalisation
over the P function if all (s, a) pairs that match it satisfy a = π ∗ (s).

7.2.2 An Implementation of P-Learning for Foxcs

We now present an implementation of P-Learning for the Foxcs architec-


ture.2 This implementation is not limited to Foxcs but applies to Xcs in
general. In order to implement P-Learning, Foxcs maintains two separate
populations of rules, [P]Q and [P]P . The first population, [P]Q , functions as
usual, while the second population, [P]P , is used to estimate the P function.
The introduction of the second population also brings with it an additional
match set, [M]P , and an additional action set, [A]P , both composed of rules
belonging to [P]P . The rule update and discovery components operate on
the secondary population as described below.

Parameter Updates

On each time step, all required updates to [A]Q


−1 and [A]
Q are first per-

formed. Then, all members of [M]P are updated as follows. All updates
remains unchanged except for the prediction, p, and error, ε, parameters.
The prediction, p, and error, ε, are updated as given by equations (4.3) and
(4.5), except that P (s, a) estimates are used in the updates instead of the
target ρ defined in equation (4.4). Let s ∈ S be the current state. For each
rule j ∈ [M]P , the prediction parameter, pj , and the error parameter, εj ,

2
Previous implementations of P-Learning have been described in Section 3.2.3.
7.2. Scaling Up 185

are updated according to:

pj := (1 − α)pj + αP (s, a), (7.2)


εj := (1 − α)εj + α(|P (s, a) − pj |), (7.3)

where a ∈ A(s). There are up to |A(s)| updates performed on pj and εj :


one for each a ∈ A(s) such that (s, a) matches j. During the updates, the
P function is estimated from the system prediction as follows:

 1 if max P (u) − P (a) < ,
P (s, a) = u∈A(s) (7.4)
0 otherwise,

where P (a) is the system prediction for action a (see equation (4.2)) and
 is a tolerance that allows for error in P (u) and P (a). In all experiments
involving P-Learning in this chapter,  = 0.05.

Errors in P (u) and P (a) essentially act as a source of noise for P-Learning.
For this reason, we found that the best results were obtained when using, in
addition to a tolerance, , two separate learning rates: one, αQ , for standard
updates and another, αP , for the P-Learning updates, (7.2) and (7.3). We
set αQ = 0.1 and αP = 0.002. The αP value is very small to compensate for
the noisy signal; no annealing was performed on αP .

Note for each time step t apart from the terminal step, the rules in [A]Q wait
until time step t + 1 before being updated because ρ depends on rules that
match state st+1 . In contrast, the rules in [M]P are updated immediately
because P (s, a) depends only on rules matching the current state.

Rule Discovery

The rule discovery component functions identically for both populations,


except that for the secondary population rules are selected from [A]P in-
stead of [A]P−1 . Note that, if advantageous, it is possible to use different
rule languages for [P]Q and [P]P . Having different languages for [P]Q and
[P]P allows for differences in the generalisations required over Q and P . In
186 Ch 7. Application to Relational Reinforcement Learning

the following experiments, however, the languages for [P]Q and [P]P were
identical, with one exception, described later.

7.2.3 Experiments

The aim of this experiment is to demonstrate that Foxcs can learn optimal
policies that scale to arbitrary sized environments using the P-Learning
extension. The blocks world benchmarking tasks, stack and onab, were
again used.

Method

The P-Learning version of Foxcs, as described in Section 7.2.2, was used


in these experiments. The system used the parameters values given in sec-
tions 7.1 and 7.2.2. The training and evaluation procedure, however, was
changed as set out below.

First, the behavioural policy of the system depended on whether the sys-
tem was performing a training or evaluation episode. For training episodes,
the policy was determined by the primary population, [P]Q , as usual. Dur-
ing evaluation, however, the behavioural policy was determined by the sec-
ondary population, [P]P , containing the rules learnt through P-Learning.
As usual, all learning was suspended when the system was performing an
evaluation episode.

Second, during training the number of blocks was varied from episode to
episode, encouraging the P-Learning component to learn policies that were
independent of the number of blocks.3 The initial state of each episode
was randomly selected from bwn , where n was randomly selected from
3
In preliminary experiments, the number of blocks, n, was kept at a constant through-
out training. This scheme resulted in P-Learning policies that were optimal with respect
to bwn only. However, by varying n from episode to episode, any rule j ∈ [P]P that was
not optimal in bwn over all values of n tested suffered from poor fitness, Fj .
7.2. Scaling Up 187

[3, 5]. Because episodes contain different numbers of blocks, the predicate
num blocks/1 was added to the state signal and rule language for [P]Q (but
not to the rule language for [P]P ) so that rules can explicitly limit them-
selves to states containing a particular number of blocks. The argument of
num blocks is an integer specifying the appropriate number of blocks.

Finally, during evaluation the number of blocks was also varied from episode
to episode. Before training, 100 test episodes were generated. The initial
state of each test episode was randomly selected from bwn where n was
randomly selected from [3, 16]. The system was thus tested on much larger
blocks world environments than it was trained on. The P-Learning com-
ponent was, however, not activated or evaluated until after 20,000 train-
ing episodes had elapsed. This delay allowed time for [P]Q to evolve so
that the system prediction, P (a), would be relatively stable by the time
P-Learning commenced.4 After this point, both populations, [P]Q and [P]P ,
were updated and evolved in parallel. Updates continued until the termina-
tion of training; evolution ceased for both populations after 45,000 training
episodes.

Results

Figure 7.5 shows the results of the experiments with the P-Learning version
of Foxcs on the tasks stack and onab. By the end of training, the system
was able to act near-optimally on both tasks. For onab the mean accuracy
after evolution was switched off was ≥ 98.6%. Recall that most evaluation
episodes occurred in worlds containing many more blocks than were present
in training episodes. The results thus demonstrate that Foxcs had learnt
genuinely scalable policies, which could handle blocks worlds of effectively
arbitrary size.

The graphs reveal that the system’s performance on onab can be divided
4
The dependence of P-Learning on the system prediction, P (a), is given in equa-
tion (7.4).
188 Ch 7. Application to Relational Reinforcement Learning

Stack (BW3 to BW16) Onab (BW3 to BW16)


1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

Accuracy
Accuracy

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
5000 6000 7000 8000 9000 10000 20,000 25,000 30,000 35,000 40,000 45,000 50,000
Episode Episode

(a) (b)
3 16 Onab (BW3 to BW16)
Stack (BW to BW ) 250
3500

3000

Error (extra steps to reach goal)


200
Error (extra steps to reach goal)

2500

150
2000

1500 100

1000
50
500

0 0
5000 6000 7000 8000 9000 10000 20,000 25,000 30,000 35,000 40,000 45,000 50,000
Episode Episode

(c) (d)

Figure 7.5: The performance of Foxcs under P-Learning on the stack and onab tasks.
Graphs (a) and (b) give performance in terms of accuracy, the percentage of episodes
completed in the optimal number of steps. Graphs (c) and (d) give performance in terms
of the extra number of steps, totaled over all 100 evaluation episodes, taken to reach the
goal compared to an optimal policy. Each plot is averaged over ten runs. The dashed line
indicates the point at which evolution was switched off.

into two phases. First there is a phase during which performance rapidly
improves, accounting for most of the learning. This is followed by a sec-
ond, much longer phase during which performance improves more slowly.
Inspection of execution traces showed that although the system had dis-
covered optimal [P]P rules, some of them nevertheless had a non-zero error
estimate, ε. The error originated from occasional inaccuracy in the system
predication, P (a), computed from the [P]Q rules, and slowed down the iden-
tification of optimal [P]P rules. This sensitivity to error in P (a) is remedied
by the introduction of the tolerance, , and the learning rate, αP , described
7.2. Scaling Up 189

in Section 7.2.2.

An alternative solution may have been to disable evolution in [P]Q once


P-Learning commenced. This solution follows from the observation that
freshly evolved rules are the typical source of error in P (a). However, it
was not tried because we wished to show that P-Learning could occur in
parallel with normal system processes. Also, another drawback with that
approach is that it would make the system more sensitive to the point at
which P-Learning starts. If the error in P (a) is too great when P-Learning
commences then it is unlikely that the P-Learning component will be able
to discover optimal rules.

Comparison

The problem of learning scalable policies for stack and onab has been previ-
ously addressed by Džeroski et al. (2001) and Driessens et al. (2001) using
the P-Learning technique. Also, Driessens and Džeroski (2005) have bench-
marked several different RRL systems, Rrl-tg, Rrl-rib, Rrl-kbr, and
Trendi,5 on stack and onab where the number of blocks was allowed to
vary.6 Although the experiments in Driessens and Džeroski (2005) do not
specifically address the issue of learning scalable policies, to perform well
the systems must learn policies that translate to environments containing a
different number of blocks than those experienced under training. In par-
ticular, the number of blocks was varied between 3 and 5 during training
and the resulting policies were evaluated in blocks world environments con-
taining 3–10 blocks. In addition, the systems were allowed to learn from
samples of optimal trajectories in bw10 , so testing did not explicitly show
that the policies learnt scaled to environments containing more blocks than
experienced during training. Nevertheless, the ability to translate policies to
5
The systems have been described previously in Chapter 3.
6
Earlier results for some of these systems have been published by Driessens and Ramon
(2003) and Gärtner et al. (2003a). However, the comparison in Driessens and Džeroski
(2005) is the most recent and complete (in terms of the number of systems tested).
190 Ch 7. Application to Relational Reinforcement Learning

Table 7.5: The accuracy of scalable policies for the stack and onab tasks learnt by
previous RRL systems. Also given are the number of training episodes taken. Rrl-tg†
used P-Learning; Rrl-tg did not. Source for P-rrl is Džeroski et al. (2001); for

Rrl-tg , Driessens et al. (2001); and for all other comparison systems, Driessens and
Džeroski (2005).

System Accuracy (%) No. of Training


Episodes (×1, 000)

stack onab stack onab


Foxcs 100% ∼ 98% 20 50
P-rrl 100% ∼ 90% 0.045 0.045

Rrl-tg 100% ∼ 92% 30 30
Rrl-tg ∼ 88% ∼ 92% 0.5 12.5
Rrl-rib ∼ 98% ∼ 90% 0.5 2.5
Rrl-kbr 100% ∼ 98% 0.5 2.5
Trendi 100% ∼ 99% 0.5 2.5

previously unexperienced environments would conceivably contribute when


scaling up to larger environments.

Table 7.5 summarises the results obtained by Džeroski et al. (2001), Driessens
et al. (2001) and Driessens and Džeroski (2005). Note that results were orig-
inally presented graphically, therefore all accuracy values less than 100% are
approximate and should thus be considered as indicative rather than abso-
lute. As can be observed in the table, Foxcs achieved accuracy results as
good as or better than these previous systems. Also shown in the table are
the number of episodes used to train the systems. Unfortunately for Foxcs,
all other systems except for Rrl-tg† trained on significantly less episodes.

It is important to emphasise that the systems evaluated in Driessens and


Džeroski (2005) have not learnt policies that are genuinely independent of
the number of blocks; they all required training in the largest of the test
environments, bw10 . Although Rrl-tg, given an adequate language bias,
is able to express genuinely scalable policies (similar to those given in Ta-
7.2. Scaling Up 191

ble 7.4) without the use of P-Learning it will not do so.7 Unlike Rrl-tg,
Rrl-rib and Rrl-kbr are unable to even express scalable policies because
they do not support variables and do not form appropriate abstract expres-
sions which can represent the policies. The Trendi system, as a hybrid of
Rrl-rib and Rib-tg, would also have problems expressing scalable policies.
Thus, of all these systems, only Rib-tg has the potential to learn policies
that are genuinely independent of the number of blocks.

Finally, it is worth mentioning a key difference between the training pro-


cedure for Foxcs and systems reported in Driessens and Džeroski (2005):
the reward signal for Rrl-tg, Rrl-rib, Rrl-kbr, and Trendi was always
zero except when a goal state was reached, whereupon a positive reward
was given. Such a reward signal is sparse and not very locally informative.
Looping behaviour is particularly problematic because only rewards of zero
can be received until the loop is broken out of. Guided exploration, as de-
scribed by Driessens and Džeroski (2004), was thus provided to facilitate
the training process. In contrast, the reward signal for Foxcs effectively
assigned a −1 cost to each action performed and no guided exploration was
provided. Note that since looping increases the amount of negative reward
accumulated, the cost-per-action reward signal actively discourages looping
behaviour.

Conclusion

The results in this section demonstrate that Foxcs, in conjunction with P-


Learning, was successfully able to learn scalable policies. It achieved optimal
performance on the stack task and near-optimal performance on onab. In
contrast, none of the comparison systems was able to learn genuinely scalable
7
Recall that Rrl-tg incorporates a tree building algorithm. As such it performs gen-
eralisation by constructing trees that define clusters of state-action pairs whose Q-values
have a low variance over the cluster. Optimal scalable policies for stack and onab cannot
be expressed so that they satisfy this constraint. Therefore, without P-Learning, Rrl-tg
like Foxcs will not converge to a scalable policy.
192 Ch 7. Application to Relational Reinforcement Learning

policies for both of the evaluation tasks; in fact, most could not learn scalable
policies for either of the tasks.

7.3 Summary

In this chapter we demonstrated empirically that Foxcs is an effective ap-


proach for addressing RRL tasks. Foxcs was evaluated on the blocks world
environment and compared to several other RRL systems in a series of exper-
iments. In the first group of experiments, Foxcs was trained and evaluated
in several environments containing a constant number of blocks. Foxcs
was able to learn policies that performed as well as or better than previous
systems. As the number of blocks grows, however, performance deterio-
rates and training times increase. Therefore, a second set of experiments
addressed the problem of learning scalable policies that are independent of
the number of blocks. For these experiments Foxcs was augmented with
P-Learning (Džeroski et al., 2001) and was successfully able to learn scalable
policies for both of the evaluation tasks; a result that goes beyond the ca-
pabilities of existing systems. One drawback with Foxcs is that it requires
longer training times than the systems to which it was compared; however,
the cost of training becomes irrelevant when it can scale up, since acceptable
training times can be achieved by selecting a small environment to train in.
Chapter 8

Conclusion

“The outcome of any serious research can only be to make two


questions grow where only one grew before.”
—Thorstein Veblen

This chapter concludes the body of the thesis. We provide a summary of


the aims, approach, and outcomes of the thesis (Section 8.1), a statement
of the thesis contributions (Section 8.2), a discussion of the significance of
the work (Section 8.3) and some final ideas for future work (Section 8.4).

8.1 Summary

The purpose of this thesis was to investigate an approach to relational rein-


forcement learning based on the learning classifier system Xcs. Recall that
RRL is in general concerned with the generalisation problem in reinforce-
ment learning, and specifically focussed on methods that exploit structural
regularity exhibited by the task environment, particularly methods that use
the abstractive qualities of first-order logic. As RRL is a relatively recent
area, many design possibilities are yet to be investigated; in this thesis we
contributed a new approach, the Foxcs system, which was derived by “up-

193
194 Ch 8. Conclusion

grading” the representational aspects of Xcs. In order to upgrade Xcs


in a principled fashion, we borrowed and adapted relevant concepts and
techniques, such as consistency, refinement, and θ-subsumption, from logic
programing and inductive logic programing.

Advantages arising from the approach taken by Foxcs are listed below.
First, unlike some other RRL approaches, no restrictions are placed on the
MDP framework. Second, the system is model-free; it does not need a model
of the environment’s dynamics but instead gathers information through in-
teracting with the environment. This feature is of benefit when, as is often
the case for real-world tasks, the dynamics of the environment are unknown.
Third, rules are automatically generated and refined; candidate rules do not
have to be supplied by the user, however if known, then they can be used to
initialise the system. Fourth, domain specific bias is provided by the defini-
tion of the rule language; for most RRL environments, it is perhaps easier
to define an appropriate rule language than to provide other forms of bias
used in RRL systems, such as a distance metric or kernel function. Finally,
it produces rules that are comprehensible to humans.

As the inductive component of Foxcs is significantly novel, involving the


synthesis of evolutionary computation with online versions of ILP refine-
ment operators, a comparison with several specialist inductive algorithms
on ILP tasks was performed to empirically demonstrate the effectiveness of
the mechanism. We found that in general, Foxcs was not significantly out-
performed by the ILP algorithms. As these algorithms are supervised learn-
ing systems, and since feedback under the reinforcement learning paradigm
is less informative than under supervised learning, this result is compelling
evidence that the system is able to generalise effectively.

Other specific components of Foxcs evaluated included subsumption dele-


tion, annealing the learning rate, and the method of selection of rules for
reproduction. Despite the additional time complexity introduced by θ-
subsumption into the subsumption test, subsumption deletion was found to
8.1. Summary 195

significantly improve system efficiency. For ILP tasks, the use of annealing
was found improved the system’s predictive accuracy compared to using a
fixed learning rate; we believe that this result should extend to Xcs systems
in general when addressing classification tasks. Finally, the choice of selec-
tion method was found to lead to a performance–efficiency trade-off, with
the best predictive accuracies occurring under tournament selection, while
the most efficient execution times occurred under proportional selection.

In order to demonstrate the overall effectiveness of the system for RRL,


Foxcs was benchmarked on two blocks world tasks, stack and onab. For
both tasks, the system was found to perform at a level comparable to or bet-
ter than existing systems under similar conditions with respect to solution
optimality. We also demonstrated that Foxcs could learn policies that are
independent of the number of the blocks by incorporating the P-Learning
technique (Džeroski et al., 2001); at present only two other RRL systems
have been able to demonstrate this ability (Driessens et al., 2001; Cole et al.,
2003).1 Our main conclusion is therefore that an effective RRL system can
be obtained by following the approach adopted by Foxcs.

The main limitation of the approach regards efficiency. There are two as-
pects to efficiency: the number of training steps the system requires to reach
a certain performance level and the computational efficiency per step. Com-
putational efficiency is largely determined by the efficiency of the matching
operation, as matching is run on all members of [P] on each time step.
Matching rules is cheap in Xcs, but in Foxcs it is a more expensive oper-
ation. In order to improve the computational efficiency of Foxcs, a cache
was associated with each rule, storing the results of matching. This tech-
nique works well for ILP tasks, which usually involve only a few hundred
data items; for RRL tasks however, |S| is typically too large for caching to
be as effective.

1
Here we are not distinguishing between Džeroski et al. (2001) and Driessens et al.
(2001) as the latter’s system is a refinement of the former’s.
196 Ch 8. Conclusion

With respect to training times, it was observed that Foxcs requires more
training than other RRL methods. This represents a negative aspect of
adopting the LCS approach: Foxcs must evaluate a large number of stochas-
tically generated rules, many of which will be sub-optimal. However, re-
search into Xcs is ongoing and recent developments have improved the Xcs
framework; Foxcs will benefit from many of these developments, including
any that reduce the amount of training required by Xcs and its derivatives.

A lesser drawback regards parameter tuning. The Xcs system has a large
number of free parameters and Foxcs introduces several more (in partic-
ular, the mutation parameters, µi ). Although guidelines for setting the
parameters of Xcs exist (see Butz and Wilson, 2002, for example), often
the best results require some fine-tuning. Parameter fine-tuning can be a
time-consuming process, particularly when tuning multiple parameters that
have interacting effects.

8.2 Contributions

The primary contribution of this thesis is the development and evaluation of


Foxcs, a relational reinforcement learning method based on a synthesis of
techniques and concepts from learning classifier systems and inductive logic
programming. A summary of Foxcs including its strengths, weaknesses,
and performance results has already been given in the previous section; we
emphasize here that benchmarking showed that Foxcs performs competi-
tively with existing ILP and RRL methods in terms of solution optimality.

We now itemise the contributions of the thesis on a chapter by chapter basis.

Chapter 4

• Gave a scheme for annealing the learning rate that is general to the
Xcs framework (Section 4.1.4).

• Identified a potential conflict between the Generalisation and Opti-


8.2. Contributions 197

mality hypotheses (Section 4.3.3).

Chapter 5

• Developed a new RRL system, Foxcs, based on Xcs (entire chapter).

• Developed an algorithm for the covering operation (Section 5.5.2).

• Developed algorithms, based on online upwards (generalisation) and


downwards (specialisation) ILP refinement operations, for use as the
mutation operations. The incremental nature of reinforcement learn-
ing means that refinements must be generated online, that is, with
respect to one state at a time; as far as the author is aware, this is the
first use of online ILP refinement operators (Section 5.5.3).

• As far as the author is aware, the first use of caching in a learning


classifier system to improve the efficiency of matching (Section 5.3.3).

Chapter 6

• Demonstrated the effectiveness of Foxcs’s generalisation processes by


comparison to specialist inductive algorithms. First application of a
reinforcement learning system to ILP tasks (Section 6.2).

• Presented evidence that the behaviour of Foxcs is consistent with the


Generalisation hypothesis (Section 6.3).

• Demonstrated that despite the increase in the time complexity of the


operation in Foxcs, the use of subsumption deletion results in a net
efficiency gain to the system (Section 6.4).

• Demonstrated that annealing the learning rate improves the perfor-


mance of Foxcs on ILP tasks. This result could have implications for
Xcs on classification tasks generally (Section 6.5).

• Demonstrated that there is a performance-efficiency trade-off between


the use of tournament and proportional selection in Foxcs (Section 6.6).
198 Ch 8. Conclusion

Chapter 7

• Demonstrated the effectiveness of Foxcs on RRL tasks. Foxcs was


able to learn optimal or near optimal policies on the blocks world task,
stack, containing up to 7 blocks, and on the onab task containing up
to 5 blocks (Section 7.1).

• Gave an implementation of the P-Learning technique for Foxcs that


is general to Xcs (Section 7.2.2).

• Demonstrated that with the P-Learning extension, Foxcs could learn


optimal policies that were independent of the number of blocks in the
stack and onab tasks (Section 7.2.3).

8.3 Significance

Elsewhere in this thesis we have already discussed several desirable prop-


erties of the approach taken by Foxcs for addressing RRL. To reiterate:
it applies to MDPs generally, is model free, automatically induces general
rules “tabula rasa” (no initialisation with candidate rules required), uses
language bias, and produces relatively comprehensible rules. However, per-
haps its most compelling advantage is that it relies directly on the abstrac-
tive qualities of first-order logic itself, particularly the use of variables, to
represent generalisations. The effect of this was best demonstrated in Sec-
tion 7.2 where Foxcs, when extended with P-Learning, was, unlike any of
the comparison systems, able to learn genuinely scalable policies for both
of the evaluation tasks, which allowed the system to perform optimally or
near-optimally in versions of the task environments whose state spaces were
significantly larger than those experienced during training.2
2
The largest environment experienced during training was bw5 , where |S| = 501;
in contrast, the largest environment experienced during testing was bw16 , where |S| =
1, 290, 434, 218, 669, 921, a very substantial increase in the size of the state space.
8.3. Significance 199

For stack and onab, the blocks world tasks considered in Section 7.2, systems
that can directly express the optimal policies shown in Table 7.4 have the
advantage, because those policies capture the salient regularity precisely and
independently of the number of blocks, and thus scale up indefinitely. In
contrast, the comparison systems that generalised using distance metrics
and kernel functions could not represent these or equivalent policies and
therefore required training in the largest test environment, bw10 for them.
Having to perform training in bw10 is a severe problem due to the size of
the state space; hence, these systems, unable to utilise the typical trial-and-
error approaches in reinforcement learning, resorted to learning from expert
traces exemplifying optimal behaviour. Learning a scalable policy, on the
other hand, neatly avoids the problem altogether.

Both Foxcs and Rrl-tg generalise using the abstractive qualities of first-
order logic. However, when combined with P-Learning, Rrl-tg was only
able to demonstrate that it could learn a scalable policy for stack and not
for the more difficult of the two evaluation tasks, onab. An explanation most
likely lies in Rrl-tg’s use of a regression tree to represent the value func-
tion, since there is reason to believe that rules, which are used by Foxcs, are
better matched for the needs of RRL than trees. Because they are modular,
changing one rule does not directly change another; indeed, this indepen-
dence is essential to the effective functioning of the evolutionary process
within Foxcs and other LCS systems. On the other hand, changing the
internal node of a first-order logic tree can directly affect descendents of
that node, which has negative consequences for incrementally grown trees.
In the initial stages of training, the value function is typically poorly es-
timated, and the tree that is induced will therefore contain poor splitting
decisions. Subsequent training incrementally grows the tree, refining it, but
not correcting any of the errors introduced higher up the tree. The result, as
noted by Driessens and Džeroski (2004), is that the performance of Rrl-tg
is sensitive to the order in which states are sampled, which may explain why
it was outperformed by Foxcs in Section 7.2 (see Table 7.5). Ideally, non-
200 Ch 8. Conclusion

leaf nodes should be refined during training; however, as mentioned above,


this will potentially affect the descendents of a refined node.3 In contrast,
rules are easily changed without affecting each other and thus appear to be
a more convenient representation for RRL.

Detractors of the LCS approach employed by Foxcs may point to its length-
ier training times. However, the cost of training becomes irrelevant when
scaling up: an acceptable training time can be straightforwardly produced
by selecting a small environment to train in. Furthermore, research into
Xcs is vigorous and ongoing and any discoveries that improve the efficiency
of Xcs would be expected to carry over to Foxcs.

The key issue is whether other task environments, particularly for real-world
problems, exhibit regularity that, like the blocks world tasks addressed in
this thesis, can be most effectively expressed with first-order (or higher-
order) logic using the abstractive property of variables. There is promise
here, as one of the guiding motivations in the historical development of first-
order logic, including its support for variables, has been to create a formal
language capable of expressing generalisations about the world around us in
a concise and convenient format. However, the issue remains open, providing
an important agenda for both the ILP and RRL communities to address in
future years.

3
It is possible to restructure incrementally grown, attribute-value trees so that the tree
produced is independent of the order in which training examples are presented (Utgoff
et al., 1997). However, to do so relies on an independence assumption: that the meaning
of a test associated with a particular node is independent of the node’s position in the
tree. Unfortunately, this assumption is not satisfied by first-order logic trees because, due
to the presence of variables, the meaning of a test at a particular node can depend on the
tests in its ancestors.
8.4. Future Work 201

8.4 Future Work

We conclude by identifying some directions for future work on the Foxcs


system. One direction would be to extend Foxcs with higher-order logic,
which would yield at least three advantages. First, implementations of
higher-order logic generally support an inbuilt type system. Such a fea-
ture is beneficial because type information effectively reduces the size of
the hypothesis space. Although Foxcs supports types (through the mode
declaration), it was necessary to hand code all type processing into the sys-
tem because Prolog, in its standard forms, has no inbuilt support for types.
Second, higher-order logic is more expressive than first-order logic. For ex-
ample, first-order logic supports variables that range over the individuals
of the domain; higher-order logic additionally supports variables that range
over sets of individuals in the domain. Third, it is easier to preserve the
recombinatorial effect of crossover within the semantics of higher-order logic
than those of first-order logic. No effective crossover operation has been de-
veloped for definite clauses over first-order logic because it is difficult to map
recombination at the syntactic level to recombination at the semantic level
of the logic. Unlike Xcs then, Foxcs did not utilise crossover. However, an
effective crossover operation has been developed for higher-order logic (Thie
and Giraud-Carrier, 2005).

Another extension to Foxcs would be to include special techniques for han-


dling numerical attributes. While most attributes in ILP and RRL tasks
are over nominal domains, some tasks contain attributes that are numeri-
cal (that is, attributes that are real- or integer-valued). When handling a
numerical attribute we are usually interested in finding a particular range
over the attribute; a straightforward nominal approach is inappropriate here
because it would represent a range as a collection of separate cases. One ap-
proach, called discretisation, would be to partition the original attribute into
a finite number of intervals in a preprocessing step, effectively converting a
numeric attribute into a nominal attribute (Van Laer et al., 1997; Blockeel
202 Ch 8. Conclusion

and De Raedt, 1997; Divina et al., 2003). However, this approach limits the
task framework; it assumes that the state space can be preprocessed and is
thus more suitable for ILP tasks than RRL tasks. Another approach, would
be to introduce interval predicates into the rule-language, similar to the ap-
proach taken by (Wilson, 2000, 2001b,a; Stone and Bull, 2003; Dam et al.,
2005). Under this approach the interval predicates are generated through
covering and are subsequently modified by mutation operations that adjust
the parameters defining the interval. This approach has the advantage that
it preserves the system’s online, incremental approach and is therefore more
suitable for reinforcement learning tasks.

Finally, we finish by suggesting an alternative but related approach to


Foxcs: upgrading a Pittsburgh style LCS for RRL. Pittsburgh systems
search the policy space directly without appealing to the value function. Ap-
plied to RRL, they may therefore posses a similar advantage to P-Learning:
the ability to discover hypotheses that are unconstrained by the structure
of the value function. This approach has already been proposed by Muller
and van Otterlo (2005); however, as far as the author is aware, there has
only been a preliminary implementation to date. As yet, the ability of the
approach to compete with P-Learning has not been demonstrated and is
therefore worthy of further investigation.
Appendix A

First-Order Logic

The purpose of this appendix this to provide the reader with definitions and
concepts from first-order logic (first-order logic is also commonly known
as predicate calculus) that are relevant to the thesis. We begin by pre-
senting the syntax of first-order logic, which are the rules that determine
what expressions are well formed. Next we consider semantics, the basis
for determining if a syntactically correct expression is true or false. Then
a concept with significant and practical application to logic programming,
Herbrand interpretations, is introduced. Inference, which examines when a
set of premises justifies a conclusion, and induction, a key form of inference
that forms the basis of making generalisations using logic, follow. Finally, we
describe substitutions and inverse substitutions, concepts that are used by
the mutation and covering operations of Foxcs. Note that this appendix is
by necessity introductory; for a more expanded treatment of first-order logic
in the context of artificial intelligence and logic programming the reader is
referred to (Genesereth and Nilsson, 1987; Lloyd, 1987).

203
204 Ch A. First-Order Logic

A.1 Syntax

Expressions over first-order logic are strings of characters that are arranged
according to a formal syntax. The syntax determines which expressions are
legal expressions over the logic and which are not. The syntax is based on
an alphabet, defined as follows:

Definition 9 An alphabet over first-order logic contains the following sym-


bols:

1. A set of constants, C.

2. A set of function symbols, F.

3. A set of predicate symbols, P.

4. A set of variables, V.

5. Five logical connectives: ¬, ∨, ∧, → and ↔.

6. Two quantifiers: ∃ (the existential quantifier) and ∀ (the universal


quantifier).

7. Three punctuation symbols: ‘(’, ‘)’ and ‘,’.

The usual assumptions are that F and P are finite and that C and V are
countable: there exist two sequences, c1 , c2 , . . . and v1 , v2 , . . ., which enumer-
ate all the elements of the C and V respectively, possibly with repetitions.
Also, conventionally, the sets C, F, P and V contain strings of alpha-numeric
characters.1
1
The convention in Prolog, followed throughout most of this thesis, is that variables
begin with an upper case letter, while constants and function and predicate symbols begin
with a lower case letter. However, in this chapter only, variables are represented by lower
case letters.
A.1. Syntax 205

Each function and predicate symbol is associated with a non-negative inte-


ger, called its arity, which designates how many arguments are associated
with the particular function or predicate that the symbol represents. A
function, d : F ∪ P → Z, returns the arity for each symbol in F ∪ P. A
predicate with an arity of zero is sometimes called a nullary predicate or,
more commonly, a proposition. If each predicate in P is a proposition then
the resulting logic reduces to propositional logic.

Note that while the sets C, F, P and V can vary from alphabet to alphabet,
the remaining symbols are always the same. For this reason, it is convenient
to drop the logical connectives, quantifiers and punctuation symbols when
specifying an alphabet, their existence being implicitly assumed. Also, al-
phabets that vary only in the symbols that constitute V produce languages
that are semantically equivalent; thus, variable symbols are generally gener-
ated (automatically) when needed rather than explicitly specified as part of
the alphabet. Therefore, throughout this thesis we specify an alphabet for
first-order logic as a tuple hC, F, P, di.

Definition 10 A term over the alphabet hC, F, P, di is either:

1. A constant symbol from C.

2. A variable.

3. An expression of the form f (τ1 , . . . , τn ), where f ∈ F, d(f ) = n and


each τi is a term.

In other words, a term is a constant, variable or functional expression. Note


that the arguments of a functional expression are themselves terms, which
includes other functional expressions; thus a functional term may be a nested
expression.

Definition 11 A sentence (also called a well-formed formula) over the al-


phabet hC, F, P, di is either:
206 Ch A. First-Order Logic

1. An expression of the form φ(τ1 , . . . , τn ), where φ ∈ P, d(φ) = n and


each τi is a term. Such an expression is called an atomic sentence or
atom.

2. An expression of the form (¬Φ), (Φ ∨ Ψ), (Φ ∧ Ψ), (Φ → Ψ) or (Φ ↔


Ψ), where Φ and Ψ are sentences over hC, F, P, di. Such an expression
is called a logical sentence.

3. An expression of the form (∃x Φ) or (∀x Φ) where x is a variable and


Φ is a sentence over hC, F, P, di. Such an expression is called a quan-
tified sentence.

Note that it is convenient to write atomic sentences involving mathematical


relations in infix rather than prefix form; for example, (τ1 < τ2 ) instead of
<(τ1 , τ2 ). Also note that it can be convenient to write the sentence (Φ → Ψ)
as (Ψ ← Φ); this latter form is used frequently throughout the thesis.

A variable may occur as a term in a sentence without being associated with


an enclosing quantifier. If this is the case then the variable is known as a free
variable; otherwise the variable is associated with an enclosing quantifier and
is known as a bound variable. A sentence that contains no free variables is
called a closed sentence, while a sentence that contains no variables at all is
called a ground sentence. Note that variables may only refer to the objects
in the domain of discourse (this domain is described in the following section)
and not to functions or relations. In contrast, variables in higher-order logic
may also refer to functions and relations.

Parentheses in sentences are required in order to prevent ambiguity; however,


many parentheses can be removed by imposing precedence on the connec-
tives and quantifiers. A typical precedence order is given in Table A.1.

Definition 12 A first-order language over an alphabet is the set of all sen-


tences over the alphabet.
A.1. Syntax 207

Table A.1: Precedence of connectives and quantifiers ordered from highest


to lowest.

→↔


¬∀∃

An important type of first-order sentence is the clause: a disjunction of


universally quantified literals, where a literal is an atom or a negated atom.

Definition 13 A clause is a sentence of the form:

∀x1 . . . ∀xk (ψ1 ∨ . . . ∨ ψn )

where each ψi is a literal and x1 , . . . , xk are all the variables occurring in


ψ1 ∨ . . . ∨ ψn .

It is convenient to represent a clause with a more succinct notation. The


expression:
φ1 , . . . , φn ← ψ1 , . . . , ψm (A.1)

denotes the clause:

∀x1 . . . ∀xk (φ1 ∨ . . . ∨ φn ∨ ¬ψ1 ∨ . . . ∨ ¬ψm ) (A.2)

where φi and ψi are atoms and x1 . . . xk are all the variables occurring in
the clause. This notation is derived from the following equivalence:

∀x1 . . . ∀xk (φ1 ∨ . . . ∨ φn ∨ ¬ψ1 ∨ . . . ∨ ¬ψm ) =


∀x1 . . . ∀xk (φ1 ∨ . . . ∨ φn ← ¬ψ1 ∧ . . . ∧ ¬ψm ) (A.3)

This equivalence, in turn, follows once the semantics of the connectives is


given. The succinct notation for a clause as given in (A.1) is obtained from
208 Ch A. First-Order Logic

the right hand side of (A.3) by making the universal quantifiers, parentheses,
and negation symbols implicit and replacing the disjunction and conjunction
symbols with commas.

Clasual form is a subset of a first-order logic and thus may appear to be


restrictive. However, once the semantics of first-order logic is established, it
can be shown that any sentence in a first-order language can be rewritten
as an equivalent set of clauses (a procedure that performs this conversion is
given in Genesereth and Nilsson, 1987).

An important subset of the clauses are the definite clauses.

Definition 14 A definite clause is a clause of the form:

φ ← ψ1 , . . . , ψ m .

The head (or consequent) of the clause is φ while the body (or antecedent)
is ψ1 , . . . , ψm .

From the definition it can be seen that a definite clause is a clause that
contains exactly one positive literal (a positive literal is an atom, a negative
literal is a negated atom): its head. The body of a definite clause may be
empty, in which case the clause is known as a fact. Unlike general clausal
form, definite clauses do restrict first-order logic; nevertheless, they largely
form the basis for logic programming because they are still quite expressive
and they can be processed more efficiently then general clauses.

A.2 Semantics

Semantics is concerned with attaching meaning to the expressions of a lan-


guage; in the case of first-order logic this is achieved by deciding if a syntac-
tically well-formed sentence is true or false. From the preceding section it is
clear that first-order logic is symbolic: well-formed sentences are constructed
A.2. Semantics 209

by combining sets of symbols according to certain rules. The symbols, of


course, are not significant in themselves; rather, their role (specifically, the
role of constants and variables) is to refer to the elements of some domain,
called the domain of discourse, just as the symbols ‘1’, ‘2’, ‘3’ and so on
are generally understood to refer to the elements of the natural numbers.
Importantly, there is great flexibility in the type of domain that can be rep-
resented; for example, it may be concrete or abstract, real or fictional, finite
or infinite.

Just as it is the role of the constants and variables to refer to the elements
of the domain, the function and predicate symbols are likewise used to rep-
resent functions and relations over the domain respectively. A function over
a domain maps elements or groups of elements over the domain to other
elements. A relation, on the other hand, effectively groups elements into
tuples; note that singleton relations, which group the elements in tuples of
size 1, are not only permissible but quite frequent. Usually only a small
subset of the functions and relations over the domain are of interest; each
of these will be assigned a function or relation symbol while the rest are
ignored.

From this understanding of the role of the symbols in first-order logic, it


follows that a term corresponds to an element of the domain of discourse.
A well-formed sentence, in turn, corresponds to an assertion, which may
be either true or false, about the elements of the domain that are referred
to by the sentence’s terms. Determining the truth of the sentence involves
a number of steps: first a mapping between the terms and predicates of a
language and the elements and relations of a domain is specified; next, the
truth of atomic sentences is determined by reference to this mapping; and
last, the truth of logical and quantified sentences follows by considering the
connectives and quantifiers involved.

Formally, the mapping between symbols and domain elements, functions


and relations is given by an interpretation:
210 Ch A. First-Order Logic

Definition 15 Let |I| be the domain of discourse. An interpretation I of


a first-order language over hC, F, P, di contains the following mappings:

1. A mapping from constants to the elements of the domain of discourse,


IC : C → |I|.

2. A mapping for each function symbol f ∈ F that defines a function


over the elements of the domain of discourse, If : |I|d(f ) → |I|.

3. A mapping for each predicate symbol φ ∈ P that defines a relation over


the domain of discourse, Iφ : |I|d(φ) → {true, false}.

Here, |I|n means the set of all n-tuples over |I|; for example, |I|3 = |I| ×
|I| × |I|.

Variables are treated separately from the constant, function and predicate
symbols.

Definition 16 Let |I| be the domain of discourse. A variable assignment


V is a mapping from the variables of a first-order language to the elements
of the domain of discourse; that is, V : V → |I|.

Terms over an alphabet can be assigned to elements of the domain of dis-


course by combining an interpretation and a variable assignment.

Definition 17 Let I be an interpretation of a first-order language and V be


a variable assignment for the variables of the language. The term assignment
TIV corresponding to I and V is a mapping from terms to the elements of
the domain of discourse, |I|, such that:

1. If τ is a constant then TIV (τ ) = IC (τ ).

2. If τ is a variable then TIV (τ ) = V (τ ).

3. If τ is a term of the form f (τ1 , . . . , τn ) then TIV (τ ) = If (τ10 , . . . , τn0 )


where τi0 = TIV (τi ).
A.2. Semantics 211

We are now ready to consider a notion of truth called satisfaction. Deter-


mining satisfaction is simplest for the atomic sentences: each term in the
atom is mapped to an element of the domain of discourse using a given term
assignment, if the corresponding relation is true according to the interpre-
tation associated with the term assignment then the sentence is satisfied,
otherwise it is not. The truth of any non-atomic sentence follows directly
once the connectives and quantifiers can be handled.

Definition 18 A sentence Φ over a first-order logic language is satisfiable


with respect to a term assignment TIV (written `I Φ[v]) if it meets one of
the following conditions:

1. If the sentence is atomic, Φ = φ(τ1 , . . . , τn ), then `I Φ[v] if and only


if Iφ (τ10 , . . . , τn0 ) = true, where τi0 = TIV (τi ), otherwise 0I Φ[v].

2. If the sentence is a logical sentence then:

(a) `I (¬Φ)[v] if and only if 0I Φ[v].

(b) `I (Φ1 ∨ . . . ∨ Φn )[v] if and only if `I Φi [v] for some i, 1 ≤ i ≤ n.

(c) `I (Φ1 ∧ . . . ∧ Φn )[v] if and only if `I Φi [v] for all i = 1, . . . , n.

(d) `I (Φ → Ψ)[v] if and only if 0I Φ[v] or `I Ψ[v].

(e) `I (Φ ← Ψ)[v] if and only if `I Φ[v] or 0I Ψ[v].

(f ) `I (Φ ↔ Ψ)[v] if and only if `I (Φ → Ψ)[v] and `I (Φ ← Ψ)[v].

3. If the sentence is logically quantified then:

(a) `I (∃x Φ)[v] if and only if there exists ι ∈ |I| such that `I Φ[u]
where U (x) = ι and U (y) = V (y) for y 6= x.

(b) `I (∀x Φ)[v] if and only if for all ι ∈ |I| it is the case that `I Φ[u]
where U (x) = ι and U (y) = V (y) for y 6= x.

Note that satisfaction is relative: the satisfaction of a sentence is dependent


on a specific interpretation and variable assignment. This is potentially
212 Ch A. First-Order Logic

problematic: we might not know which interpretation and variable assign-


ment an author had in mind when we are faced with determining the truth
of a sentence in first-order logic. Thus, in the following it will generally be
necessary to consider all possible interpretations and variable assignments.

Note that the problem is alleviated slightly for some sentences. In partic-
ular, a variable assignment has no effect on the satisfaction of a sentence
that conations no free variables—that is, ground and closed sentences. If a
sentence is ground then it conations no variables, and consequently no ap-
peal is made to the variable assignment in order to determine satisfaction.
Alternatively, if a sentence is closed then each variable occurs within the
scope of a quantifier; thus, it is the quantifier rules, rather than the variable
assignment, that determines how each variable is bound.

The concept of a model distinguishes those interpretations that satisfy a


sentence from those that do not.

Definition 19 Let Φ be a sentence over a first-order language and let I be


an interpretation. Then I is a model of Φ if and only if I satisfies Φ for all
variable assignments.

In this thesis we are only concerned with closed sentences; the definition
of a model of a closed sentence can be simplified since their satisfaction or
otherwise is independent of any variable assignment.

Definition 20 Let Φ be a closed sentence over a first-order language and


let I be an interpretation. Then I is a model of Φ if and only if I satisfies
Φ.

A closed sentence is satisfiable if and only if there exists some interpretation


and that satisfies it; otherwise it is unsatisfiable.

The concept of a model naturally extends to a set of closed sentences.


A.3. Herbrand Interpretations 213

Definition 21 Let Γ = {Φ1 , Φ2 , . . .} be a set of closed sentences over a


first-order language and let I be an interpretation. Then I is a model of Γ
if and only if I is a model for each sentence Φi ∈ Γ.

A set of closed sentences Γ is satisfiable (synonymously, consistent) if and


only if there is an interpretation that satisfies every sentence belonging to
Γ; otherwise, it is unsatisfiable (inconsistent).

A.3 Herbrand Interpretations

We now introduce an important class of interpretations, the Herbrand in-


terpretations, which simplify the space of interpretations that need to be
considered when testing for satisfiability.

Definition 22 The Herbrand universe over an alphabet hC, F, P, di, de-


noted HU[hC, F, P, di], is the set of all terms such that each term is either:

1. A constant symbol from C.

2. An expression of the form f (τ1 , . . . , τn ), where f ∈ F, d(f ) = n and


each τi ∈ HU[hC, F, P, di].

That is, the Herbrand universe over a given alphabet consists of all the
ground terms—terms that do not contain variables—over the alphabet.

Definition 23 The Herbrand base over an alphabet hC, F, P, di, denoted


HB[hC, F, P, di], is the set of all atomic sentences φ(τ1 , . . . , τn ), such that
φ ∈ P, d(φ) = n and each τi ∈ HU[hC, F, P, di].

The definition states that the Herbrand base over a given alphabet consists
of the set of all atomic sentences over the alphabet such that each argument
is a ground term over the alphabet; more simply put, it is the set of ground
atomic sentences over the alphabet.
214 Ch A. First-Order Logic

Definition 24 An Herbrand interpretation over an alphabet hC, F, P, di is


an interpretation such that the domain of discourse is the Herbrand universe,
|I| = HU[hC, F, P, di]; and which contains the following mappings:

1. A mapping IC : C → C, which maps each constant to itself.

2. A mapping If : HU[hC, F, P, di]d(f ) → C for each function symbol


f ∈ F.

3. A mapping Iφ : HU[hC, F, P, di]d(φ) → {true, false} for each predicate


symbol φ ∈ P.

Note that the IC mapping is unambiguous and in practice the If mappings,


which usually correspond to known functions, are too. Thus, when specify-
ing an Herbrand interpretation the IC and If mappings are conventionally
left implicit. Also note that the Iφ mappings partition the Herbrand base
into two disjoint subsets: the set of ground atoms which are satisfied by
the interpretation and the remaining set of atoms which are not. This ob-
servation, together with the convention of leaving the IC and If mappings
implicit, allows an Herbrand interpretation to be conveniently specified as
a subset of the Herbrand base (usually the satisfiable atoms, as this set is
typically smaller than the set of unsatisfiable atoms).

The concept of a model naturally applies to Herbrand interpretations.

Definition 25 Let Γ = {Φ1 , Φ2 , . . .} be a set of closed sentences over a


first-order language and let I be an Herbrand interpretation. Then I is an
Herbrand model of Γ if and only if I is a model for each sentence Φi ∈ Γ.

The following proposition shows that checking for the existence of a model
can be performed by considering the Herbrand interpretations only.

Proposition 1 Let Γ = {Φ1 , Φ2 , . . .} be a set of clauses over a first-order


language. Then Γ has a model if and only if Γ has an Herbrand model.
A.4. Inference 215

For a proof see Nienhuys-Cheng and de Wolf (1997) or Lloyd (1987). Note
the mild restriction that Γ be a set of clauses; that is, Γ cannot contain
non-clausal sentences.

The significance of the above proposition is that satisfiability can be de-


termined from just the Herbrand interpretations alone; which greatly re-
duces the number of interpretations that need to be considered. It is also
handy because Herbrand interpretations are conveniently represented using
the given first-order language; non-Herbrand interpretations require addi-
tional machinery because their domain of discourse is something other than
the syntactic elements of the language itself.

A.4 Inference

Inference is the process of deriving conclusions from premises. There are


many forms of inference; however, central to all forms is the concept of entail-
ment. Intuitively, to say that a sentence Φ (or set of sentences {Φ1 , Φ2 , . . .})
entails the sentence Ψ means that whenever Φ (or {Φ1 , Φ2 , . . .}) is true then
Ψ must be true also. More formally:

Definition 26 Let Γ = {Φ1 , Φ2 , . . .} be a set of sentences over a first-order


language and let Ψ be a sentence over the same language. Then Γ entails Ψ
(written Γ |= Ψ) if and only if every model of Γ is also a model of Ψ.

Entailment is the formal basis upon which it can be decided if a conclusion


follows from a set of premises; inference, on the other hand, refers to the
actual process of showing that a conclusion follows from the given premises.
Traditionally, when making an inference, one appeals to established infer-
ence rules, such as modus ponens and modus tollens,2 and in doing so, creates
2
The modus ponens rule states that if a set of premises consist of the sentences Φ → Ψ
and Φ then Ψ can be inferred. The modus tollens rule is the complement of modus ponens;
it states that if the premises are Φ → Ψ and ¬Ψ then ¬Φ can be inferred.
216 Ch A. First-Order Logic

a derivation that ensures that the premises entail the conclusion. More re-
cently, inference algorithms have been developed, which allow a systematic
approach to inference. There are three types of inference algorithms: for-
ward chaining, backward chaining (backward chaining forms the basis for
logic programming environments, including Prolog) and resolution.3

Two desirable qualities of inference procedures are soundness and complete-


ness. An inference procedure is sound if and only if all sentences which
it can derive from a set of sentences Γ are also entailed by Γ. Modus po-
nens, modus tollens, forward and backward chaining and resolution are all
examples of sound inference rules and processes. An inference procedure is
complete, on the other hand, if and only if it can derive any sentence that
is entailed by Γ. Unfortunately, none of the above techniques are complete;
however, resolution exhibits a weaker form completeness called refutation
completeness. An inference procedure is refutation complete if and only if
it can always derive a contradiction given a set of sentences Γ that are in-
consistent; in other words, although it cannot derive every conclusion that
is entailed by a set of premises, it can decide if a given conclusion is entailed
by the premises.

A.5 Induction

If every crow you had ever seen had been black then you might decide that
all crows are black. This is an example of a kind of inference called induction,
which involves the derivation of a general law from particular instances. In-
duction contrasts against deduction, which refers to the derivation of specific
conclusions by reference to a general law or principle; modus ponens and
modus tollens are examples of deductive reasoning. Both forms of reasoning
are very significant: many important mathematical theorems are justified
using deductive reasoning, while induction can be seen as the foundation of
3
An overview of these three types of algorithms is provided by Russell and Norvig
(2003).
A.5. Induction 217

scientific enquiry.

Induction proceeds from a set of sentences which represent beliefs about


a task environment. The beliefs are divided into two parts: the data, ∆,
containing a set of facts; and the background theory, Γ, which is a set of
clauses. These sentence must be consistent with each other, otherwise no
sensible conclusions can be drawn. Furthermore, the background theory
should not entail the data; that is, Γ 2 ∆.

Definition 27 An hypothesis Φ is an inductive conclusion of a given back-


ground theory and data set if and only if:

1. The hypothesis is consistent with the background theory and data:

Γ ∪ ∆ 2 ¬Φ

2. The hypothesis explains the data:

Γ∪Φ∆

Unlike deductive reasoning, the conclusions produced by induction are not


necessarily sound. That is, although an inductive hypothesis must be consis-
tent with the data and background theory, it is not required to be entailed by
them. For instance, the hypothesis that “all swans are white” was consistent
with European knowledge at the beginning of the 18th century; however, by
the end of that century the hypothesis was no longer consistent with the ev-
idence, as black swans had been discovered in Australia. The point is that
there was an interpretation, under which swans are white in Europe and
black in Australia, that was a model for the data available at the beginning
of the 18th century but which was not a model for the original hypothesis;
thus, the inductive hypothesis was not entailed by early European knowl-
edge. The only time in which an inductive hypothesis is entailed by the
premises is when the data exhaustively describes all possibilities covered by
218 Ch A. First-Order Logic

the hypothesis; in the case of the swan hypothesis above, the colour of every
swan in existence would need to be known.

One especially noteworthy point is that for any background theory and data
set there are generally a great multiplicity of inductive hypotheses. Is it
possible to discriminate between them for the purpose of preferring one over
another? And is there any justifiable basis for doing so? We shall briefly
indicate some approaches to the first issue; however, the second issue moves
into philosophical territory, placing it beyond the scope of this thesis.

One way of potentially discriminating between competing hypothesis is to


use model maximisation techniques, which select the most general hypothesis
available. Let Φ and Ψ be two inductive hypotheses for the same background
theory and data set. If the set of models for Φ is a proper superset of the
models of Ψ then model maximisation would select Φ over Ψ. Note that if
the hypotheses are incompatible—that is, their models cannot be ordered
by the proper subset relation—then the technique cannot be used. Another
way to choose between competing hypothesis is by application of Ockham’s
razor: select the “simplest” hypothesis. Of course “simplest” can have many
definitions; an obvious example in the context of first-order logic is that one
sentence is simpler than another if it contains fewer atoms.

The use of language bias is another way to exclude or order potential hy-
potheses. One type of language bias, conceptual bias, results from the use
of a particular alphabet, which excludes the possibility of hypotheses that
cannot be expressed as sentences over that alphabet. Another type, logical
bias, restricts hypotheses to a particular logical structure, such as the defi-
nite clauses (which is a bias present in Foxcs). Note that while it is difficult
to formally justify specific biases, in practice it is impossible to invent in-
ductive methods, logical, statistical, or otherwise, that are completely bias
free.4

4
Mitchell (1997) provides a discussion of bias in the field of machine learning generally.
A.6. Substitution 219

A.6 Substitution

We conclude this chapter by defining the concept of a substitution and an


inverse substitution. These concepts are employed by Foxcs within its
covering and mutation operations in order to generalise and specialise its
rules.

A substitution is an association between variables and terms. The intention


behind making the association is to enable the derivation of a new clause
from an existing clause by replacing the variables in the original clause with
their associated terms. More formally:

Definition 28 A substitution, θ = {x1 /τ1 , . . . , xn /τn }, is a finite set of


associations between variables and terms (often constants) such that:

1. Each variable is associated with one term only; that is, xi 6= xj for all
i 6= j.

2. None of the variables xi may occur within any of the associated terms
τi .

The terms associated with the variables are frequently called the bindings
for those variables. When a substitution θ is applied to a clause Φ, all
bound variables occurring in Φ are replaced by their bindings. Any variables
without bindings are left unchanged. The convention for denoting the clause
resulting from a substitution is, somewhat unusually, Φθ (that is, functional
notation is not employed). For example, let Φ = P (x) ∨ Q(x) ∨ R(y) and
θ = {x/bill }, then:

Φθ = P (bill ) ∨ Q(bill ) ∨ R(y)

Note that applying a substitution to a universally quantified clause produces


a new clause that may be less general but not more general than the original
clause. Thus, substitutions can be understood to “specialise” a clause.
220 Ch A. First-Order Logic

It can also be useful to apply a substitution in reverse; that is, to replace


terms with variables. An inverse substitution can be used for this purpose.
An inverse substitution relies on the notion of a place, the location of a term
within a clause.

Definition 29 A place within a term τ is a tuple of natural numbers, de-


fined recursively as follows:

1. The term at place hii in term τ = f (τ1 , . . . , τn ) is τi .

2. The term at place hi1 , . . . , in i in term τ = f (τ1 , . . . , τn ) is the term at


place hi2 , . . . , in i in τi1 .

A place within a literal is defined analogously.

Example 13 Let ψ = P (bill , bob, F (bill ), ben). Then the places of bill in
ψ are h1i and h3, 1i.


A place within a clause is defined next.

Definition 30 Let τ be the term at place p in literal φ, where φ is a literal


within the clause Φ. The place of τ in Φ is the pair hφ, pi.

Example 14 Let Ψ = P (bill ) ∨ Q(bill ) ∨ R(y). Then the places of bill in


Ψ are hP (bill ), h1ii and hQ(bill ), h1ii.


Now the concept of an inverse substitution can be defined.

Definition 31 Let θ = {x1 /τ1 , . . . , xn /τn } be a substitution for the clause


Φ. The corresponding inverse substitution is:

θ−1 = {hτ1 /{p1,1 , . . . , p1,m1 }i /x1 , . . . , hτn /{pn,1 , . . . , pn,mn }i /xn }.


A.6. Substitution 221

When the inverse substitution θ−1 is applied to the clause Φ, denoted Φθ−1 ,
each term τi is replaced at the places pi,1 , . . . , pi,mi in Φ by the variable xi .
It follows that Φθθ−1 = Φ. Also note that just as a substitution cannot
make a clause more general, an inverse substitution cannot make a clause
less general. Thus, an inverse substitution can be applied to “generalise” a
clause.
Appendix B

Inductive Logic
Programming

It was the combination of an inductive logic programming (ILP) algorithm,


Tilde (Blockeel and De Raedt, 1998), with Q-Learning that led to the
first RRL system (Džeroski et al., 1998a, 2001). Since then, concepts and
techniques from ILP continue to play an important role in RRL. Although
it is beyond the scope of the thesis to provide a detailed account of ILP, in
this appendix we formally describe the learning problem addressed by ILP
systems, an area of fundamental significance for RRL.

Recall that along with the credit assignment problem, the generalisation
problem is one of the two key challenges that a reinforcement learning al-
gorithm must address. One way of addressing the generalisation problem
is through inductive inference, the central concern of ILP, which leads to
the subfield of RRL. This appendix presents two of the most important
formalisations of induction for ILP; between them, they describe the ma-
jority of all ILP systems. However, these two formalisations are not, as we
shall see, equally useful for the purposes of reinforcement learning, which
places special requirements on an inductive algorithm. The two problem
settings arelearning from entailment and learning from interpretations, also

222
223

commonly known as the normal and non-monotonic settings respectively.1

Before presenting these two formalisms, it is worth briefly noting some his-
tory of ILP. The name “Inductive Logic Programming” was coined by Mug-
gleton (1991) who defined it as the intersection between machine learning
and logic programming (today, such a definition would include RRL but ILP
by convention refers specifically to supervised and unsupervised learning sys-
tems). However, the origins of the field of ILP occurred much earlier in the
work of Banerji (1964), who used first-order logic to represent hypotheses
learnt under a paradigm called concept learning. Subsequently, ILP was
greatly influenced by two key dissertations: that of Plotkin (1971), who
was the first to formalise induction using clausal logic; and that of Shapiro
(1983), who introduced the notion of refinement operators.2 An influential
system, Foil, perhaps the first to take the approach of upgrading an existing
machine learning method, in this case learning rule sets, to first-order logic,
appeared in Quinlan (1990). Around this time, the establishment of ILP
as a separate subfield of machine learning arose, in part due to the field’s
christening (Muggleton, 1991) as well as to the identification of a thorough
research agenda (Muggleton and De Raedt, 1994). Since then, important
contributions include, amongst others, a comprehensive description of the
theoretical foundations of ILP (Nienhuys-Cheng and de Wolf, 1997) and the
application of ILP to data mining (Džeroski and Lavrač, 2001).

It is also worth mentioning the benefits that arise from incorporating first-
order logic into machine learning. These benefits include:

• A formal theoretical basis to learning (i.e. induction)

• A highly expressive language for representing hypotheses (see Sec-

1
Descriptions of these settings, and others, can be found in (Muggleton and De Raedt,
1994; De Raedt and Džeroski, 1994; Wrobel and Džeroski, 1995; Nienhuys-Cheng and
de Wolf, 1997; De Raedt, 1997).
2
Actually, Shapiro first introduced his refinement operators in (Shapiro, 1981) and then
subsequently incorporated them into his PhD thesis.
224 Ch B. Inductive Logic Programming

tion 2.4 for further discussion on this point)

• A straightforward way of including background knowledge about the


task environment into the learning process

• The generation of relatively comprehensible hypotheses

We now describe the learning from entailment and learning from interpre-
tation settings in turn.

B.1 Learning from Entailment

This setting is the more general of the two; it applies to concept learning,
classification, regression and program synthesis, and also allows recursive
definitions to be learnt. Concept learning, perhaps the simplest of these
scenarios, is convenient for illustrating learning from entailment. The idea
behind concept learning is to induce a definition of a general “concept” from
specific data. Under this setting the “concept” is a relation represented by
a predicate, say φ, the definition takes the form of definite clauses, and the
data are typically facts over φ.3 The data is divided into two: a set of
positive examples and a set of negative examples; each positive (negative)
example is a fact that is consistent (inconsistent) with the concept to be
induced. Background knowledge is also provided, usually in the form of
facts or definite clauses, which contains other data necessary or helpful for
inducing a definition for φ. More formally, concept learning under learning
from entailment is defined as follows:

Definition 32 (Learning from entailment) Given:

• A set of positive examples E + ; each example is a fact

• A set of negative examples E −


3
In principle, non-fact clauses can be used to represent data but this is unusual.
B.1. Learning from Entailment 225

• A background theory B

Find: An hypothesis Φ such that:

1. ∀e ∈ E + : Φ ∧ B  e,

2. ∀e ∈ E − : Φ ∧ B 2 e.

In other words, the hypothesis, Φ, together with the background knowledge,


B, should entail all the positive examples, E + , but none of the negative
examples, E − .

Example 15 Suppose:



 parent(ane, bill) ←

 parent(bill, carl) ←

B =


 parent(uma, vera) ←


 parent(vera, walt) ←

E + = {grandparent(ane, carl), grandparent(uma, vera)}


E − = {grandparent(ane, bill), grandparent(ane, uma)},

then the theory

grandparent(X, Z) ← parent(X, Y ), parent(Y, Z)

in conjunction with B implies all the positive examples but none of the
negative examples.


Unfortunately, ILP under the learning from entailment setting is not PAC-
learnable with polynomial time complexity (Džeroski et al., 1992; Cohen,
1995; Cohen and Jr., 1995). The next setting, however, does come with
tractable PAC-learnability results.
226 Ch B. Inductive Logic Programming

B.2 Learning from Interpretations

Under the learning from interpretations setting, the induced concept defi-
nition is no longer required to imply the positive data items; each positive
data item should, instead, satisfy the induced definition. Continuing with
concept learning, we again wish to induce a clausal definition for predicate φ
from positive and negative data and background knowledge. Each data item
under this setting is an Herbrand interpretation (recall from Section A.3 that
an Herbrand interpretation is a set of facts), which contrasts with learning
from entailment, where typically each data item is a fact; this difference has
positive implications for efficiency. Background knowledge is incorporated
by extending the interpretation into the minimal Herbrand model of the
data item and the background knowledge.

Definition 33 Let Γ be a set of clauses. The minimal Herbrand model of


Γ, denoted M(Γ), is the set of all facts entailed by Γ.

The minimal Herbrand model of Γ is uniquely determined if Γ consists of


definite clauses (note that facts are a type of definite clause, so Γ may contain
facts).

Example 16 Consider the following facts, which describe the blocks world
state shown in Figure B.1: e = {cl(a), on(a, b), on f l(b), cl(c), on(c, d), on(d, e),
on f l(e)}; and the background theory, B:

above(X, Y ) ← on(X, Y )
above(X, Y ) ← on(X, Z), above(Z, Y )

Then M(e∧B) = {cl(a), on(a, b), on f l(b), cl(c), on(c, d), on(d, e), on f l(e),
above(a, b), above(c, d), above(d, e)}, which includes all the facts from e in
addition to those that can be derived from the background theory together
with e, i.e. the facts describing which blocks are above which.

B.2. Learning from Interpretations 227

a d

b e

Figure B.1: A blocks world state.

The idea behind learning from interpretations is to induce a set of clauses


that are satisfied by the positive examples (in conjunction with the back-
ground knowledge) but not the negative examples (also in conjunction with
the background knowledge). Intuitively, each interpretation “describes” a
situation while the induced clauses express any regularities that exist in
the descriptions of the positive but not negative examples. More formally,
concept learning under learning from interpretations is defined as follows:

Definition 34 (Learning from interpretations) Given:

• A set of positive examples E + ; each example is an Herbrand interpre-


tation

• A set of negative examples E −

• A background theory B

Find: An hypothesis Φ such that:

1. ∀e ∈ E + : M(e ∧ B) satisfies Φ,

2. ∀e ∈ E − : M(e ∧ B) does not satisfy Φ.

Frequently there are no negative examples; that is, E − = ∅. The use of


the minimal Herbrand model, M(e ∧ B), is in effect an application of the
228 Ch B. Inductive Logic Programming

closed world assumption; that is, all facts that are not part of M(e ∧ B) are
assumed to not apply to the example e. The closed world assumption is, in
turn, a form of non-monotonic reasoning, hence the alternative name of this
setting.

Example 17 Adapting Example 15 for this setting we have B = ∅, E − = ∅


and E + = {e1 , e2 }, where:

e1 = {grandparent(ane, carl), parent(ane, bill), parent(bill, carl)}


e2 = {grandparent(uma, walt), parent(uma, vera), parent(vera, walt)}

The hypothesis:

grandparent(X, Z) ← parent(X, Y ), parent(Y, Z)

is satisfied by both examples.




Comparing Examples 15 and 17 shows an important difference between the


two settings. In Example 15, much of the information is contained in the
background knowledge, B. However, in Example 17 all information relating
to a specific data item has been explicitly relocated from B to within the
data item. According to Blockeel (1998), the explicit location and separation
of information to within data items improves the efficiency of the learning
process; and indeed, unlike the case for learning from entailment, tractable
PAC-learnability results have been obtained for learning from interpreta-
tions (De Raedt and Džeroski, 1994). Another advantage which arises from
the clear separation of data items is that it makes it easier to upgrade meth-
ods from attribute-value frameworks, which rely on completely separating
the data belonging to each item. On the negative side, the learning from in-
terpretations setting has difficulty with examples that are infinite Herbrand
interpretations and cannot learn recursive definitions.
B.3. Discussion 229

B.3 Discussion

Of the two settings, learning from interpretations provides the better match
for the needs of reinforcement learning. The interactive nature of rein-
forcement learning means that data is typically processed incrementally,
which the separation of data items under learning from interpretations con-
veniently supports. Additionally, states and state-action pairs can be mod-
eled quite naturally by Herbrand interpretations, as shown in Section 5.2.
The tractable complexity result for learning from interpretations also in-
creases its attractiveness.

Generalisation in reinforcement learning resembles classification more than


concept learning;4 hence, it is useful to reformulate the learning from inter-
pretations setting as a classification problem, as has been done below:5

Definition 35 (Learning from interpretations) Given:

• A set of labelled examples E = {(e, y)}; each example contains an


Herbrand interpretation e and a label y

• A background theory B

Find: An hypothesis Φ such that for all (e, y) ∈ E:

1. Φ ∧ e ∧ B  y

2. Φ ∧ e ∧ B 2 y 0 for all y 0 6= y

There two ways to straightforwardly map the elements of this setting to the
reinforcement learning problem. One way is to let an example represent
a state (e ∈ S) and its label represent an action (y ∈ A); then we have
4
Concept learning can be viewed as a special case of classification where the definition
of only a single class is to be induced.
5
This formulation of is due to Blockeel (1998).
230 Ch B. Inductive Logic Programming

an approach that learns a policy directly without appealing to the value


function, such as the supervised learning systems mentioned in Section 3.2.6,
page 63. Alternatively, e represents a state-action pair (s, a) and y the
pair’s corresponding value Q(s, a); then we have an approach based on value
function estimation, as exemplified by the Rrl-tg system for instance.

Foxcs most closely resembles the first approach, even though it estimates
the value function. However, note that in Foxcs the hypothesis is a set of
definite clauses and that a prediction value, which is essentially a partial
value function estimate, is associated with each clause in a way that is
not modeled above. Also note that the second condition is relaxed, with
competing actions being resolved by appealing to the associated prediction
values.
Appendix C

The Inductive Logic


Programming Tasks

This appendix describes the ILP tasks that were used for evaluating the
Foxcs system. The data sets for some of these, and other, ILP tasks are
publicly available from the web sites at OUCL (2000) and MLnet OiS (2000).

C.1 The Prediction of Mutagenic Activity

The task considered here is the prediction of mutagenic activity from in-
formation known about specific nitroaromatic compounds. Many nitroaro-
matic compounds are mutagenic, that is, they can cause DNA to mutate
and are thus potentially carcinogenic. A procedure called the Ames test can
detect mutagenesis, however some compounds, such as antibiotics, can not
be tested due to risk of toxicity to the test organisms. The development
of methods for predicting mutagenicity is therefore of considerable interest
because they avoid the potential risks associated with detection.

The Mutagenesis data set, available from the OUCL (2000) site, contains
information about 230 nitroaromatic compounds; it was introduced by Srini-

231
232 Ch C. The Inductive Logic Programming Tasks

Table C.1: A sample of the Mutagenesis data set. The sample gives the NS+S1 de-
scriptions for the compound d1. The atm facts list the atoms occurring in the molecule
(there are 26 atoms for d1); each fact gives the element (e.g. carbon, c), the element’s
configuration, and its partial charge. The bond facts list all bonds that exist between the
atoms of the molecule; each fact identifies the two atoms participating in the bond and
also gives the type of the bond. Each compound also has a logp and lumo fact.

% atm(Compound,AtomId,Element,Configuration,PartialCharge)

atm(d1,d1_1,c,22,-0.117).
atm(d1,d1_2,c,22,-0.117).
atm(d1,d1_3,c,22,-0.117).
(...)
atm(d1,d1_25,o,40,-0.388).
atm(d1,d1_26,o,40,-0.388).

% bond(Compound,AtomId,AtomId,Type)

bond(d1,d1_1,d1_2,7).
bond(d1,d1_2,d1_3,7).
bond(d1,d1_3,d1_4,7).
(...)
bond(d1,d1_24,d1_25,2).
bond(d1,d1_24,d1_26,2).

% logp and lumo

logp(d1,4.23).
lumo(d1,-1.246).

vasan et al. (1994) and has subsequently been widely used for evaluating ILP
algorithms (an overview of selected results is provided by Lodhi and Mug-
gleton (2005)). Following Srinivasan et al. (1996), each compound in the
data set has two levels of description, NS+S1 and NS+S2, which contain
the following information:
C.1. The Prediction of Mutagenic Activity 233

NS+S1 contains a structural description of each compound, consisting of


all the atoms and bonds occurring in the molecule. It also contains two
attributes associated with each compound, logP and LU M O , which
are, respectively, a measure of the compound’s hydrophobicity, and
the energy of the compound’s lowest unoccupied molecular orbital.

NS+S2 includes (in addition to all information in NS+S1) chemical con-


cepts such as benzene, anthracene, phenanthrene, and others, describ-
ing submolecular structures present within the compounds.

Two other attributes, called I1 and Ia , are also known to be relevant, however
they are generally reserved for use with attribute value systems. A repre-
sentative sample of the data is given in Table C.1, however note that the
chemical concepts from NS+S2 are typically converted into ground clauses,
which improves the efficiency of an algorithm when run on the data set.

The compounds in the data set are usually divided into two subsets, a “re-
gression friendly” subset of 188 compounds (125 active and 63 inactive), and
a “regression unfriendly” subset of 42 compounds (13 active and 29 inac-
tive). The regression unfriendly subset has been found to yield poor results
using linear regression with attribute value descriptions (Debnath et al.,
1991; Srinivasan et al., 1996), thus it forms the more interesting group for
study in the ILP context. However, we have chosen to use the regression
friendly data set because it has been used more often (which is probably
due to its larger size) and thus has the greater number of published results
available for comparison.

At each time step the input to Foxcs consists of the ID of a compound


only, all other data is provided as background knowledge, which is accessed
through the support predicates. The rule language is given by the mode
declarations in Table C.2.
234 Ch C. The Inductive Logic Programming Tasks

Table C.2: The mode declarations for the Mutagenesis data set.

% modes for level NS+S1

mode( a, [1,1], false, active).


mode( a, [1,1], false, inactive).
mode( s, [1,1], false, molecule([compound,-])).
mode( b, [0,10],false, atm([compound,+],[atomid,+],[element,#],[integer,#],[charge,#,-])).
mode( b, [0,10],false, atm([compound,+],[atomid,-],[element,#],[integer,#],[charge,#,-])).
mode( b, [0,5], false, bond([compound,+],[atomid,+,!],[atomid,+,!],[integer,#])).
mode( b, [0,5], false, bond([compound,+],[atomid,+],[atomid,-],[integer,#])).
mode( b, [0,5], false, bond([compound,+],[atomid,-],[atomid,+],[integer,#])).
mode( b, [0,5], false, bond([compound,+],[atomid,-],[atomid,-],[integer,#])).
mode( b, [0,1], false, lumo([compound,+],[energy,#,-])).
mode( b, [0,1], false, logp([compound,+],[hydrophob,#,-])).
mode( b, [0,20], false, gteq([charge,+],[float,#])).
mode( b, [0,20], false, lteq([charge,+],[float,#])).
mode( b, [0,1], false, gteq([energy,+],[float,#])).
mode( b, [0,1], false, lteq([energy,+],[float,#])).
mode( b, [0,1], false, gteq([hydrophob,+],[float,#])).
mode( b, [0,1], false, lteq([hydrophob,+],[float,#])).

% additional modes for level NS+S2

mode( b, [0,3], false, benzene([compound,+],[ring,-])).


mode( b, [0,3], false, carbon_5_aromatic_ring([compound,+],[ring,-])).
mode( b, [0,3], false, carbon_6_ring([compound,+],[ring,-])).
mode( b, [0,3], false, hetero_aromatic_6_ring([compound,+],[ring,-])).
mode( b, [0,3], false, hetero_aromatic_5_ring([compound,+],[ring,-])).
mode( b, [0,3], false, ring_size_6([compound,+],[ring,-])).
mode( b, [0,3], false, ring_size_5([compound,+],[ring,-])).
mode( b, [0,3], false, nitro([compound,+],[ring,-])).
mode( b, [0,3], false, methyl([compound,+],[ring,-])).
mode( b, [0,3], false, anthracene([compound,+],[ringlist,-])).
mode( b, [0,3], false, phenanthrene([compound,+],[ringlist,-])).
mode( b, [0,3], false, ball3([compound,+],[ringlist,-])).

C.2 The Prediction of Biodegradability

The aim of this task, introduced by Džeroski et al. (1999), is to predict


the half-life of a chemical in order to assess its potential as an environmen-
C.3. Predicting Traffic Congestion and Accidents 235

tally friendly agent—in particular, the task focuses on “biodegradation in


an aqueous environment under aerobic conditions, which affects the quality
of surface- and groundwater” (Džeroski et al., 1999). In order to support
a classification approach, the half-life values, which are real numbers, are
discretised into a several classes, from which the system chooses as its pre-
diction. Following Blockeel et al. (2004) we consider the simplest case where
the half-life values are discretised into just two classes: resistant and degrad-
able.

The data set contains 328 molecules (143 resistant, 185 degradable) and their
associated descriptions. Blockeel et al. (2004) give several levels of descrip-
tion for the molecules, which are selected depending on whether the system
makes use of propositional or relational data. We use the relational repre-
sentation called Global + R, which describes molecules structurally using a
representation that is very similar to that for the Mutagenesis task—with
predicates for the atoms, bonds and higher level sub-molecular structures of
the compounds. The molecular weight is also given for each compound as
it is relevant to biodegradability. We omit a representation of the data due
to its similarity to the Mutagenesis data set.

This data set, and the following one, was obtained from Sašo Džeroski and
used with his kind permission.

C.3 Predicting Traffic Congestion and Accidents

The goal of expert systems for road traffic management (Cuena et al., 1992,
1995) is to suggest actions to avoid or reduce traffic problems like conges-
tion given information about the current state of traffic over a wide area.
The construction of an expert system is tailored specifically for each city
and requires strategies to be input from expert human traffic controllers. In
addition to the human experts the process could be augmented with ma-
chine learning methods, which would infer general principles from a large
236 Ch C. The Inductive Logic Programming Tasks

Table C.3: Part of the data relating to the road section


RDLT ronda de Dalt en Aiguablava. The predicate names are in Spanish due to
the origin of the data set. The predicate secciones posteriores relates adjacent road
sections, tipo gives the road type, velocidad refers to velocity, ocupacion to the amount
of time that the sensor was occupied, and saturacion to traffic flow.

% connections for RDLT_ronda_de_Dalt_en_Artesania

secciones_posteriores(’RDLT_ronda_de_Dalt_en_Artesania’,
’RDLT_ronda_de_Dalt_en_salida_a_Roquetas’).
secciones_posteriores(’RDLT_ronda_de_Dalt_en_Artesania’,
’RDLT_salida_a_Roquetas’).

% section type

tipo(’RDLT_ronda_de_Dalt_en_Artesania’,carretera).

% arguments are (Time,Section,Value)

velocidad(11,’RDLT_ronda_de_Dalt_en_Artesania’,84.0).
ocupacion(11,’RDLT_ronda_de_Dalt_en_Artesania’,319.25).
saturacion(11,’RDLT_ronda_de_Dalt_en_Artesania’,71.0).
velocidad(12,’RDLT_ronda_de_Dalt_en_Artesania’,84.75).
ocupacion(12,’RDLT_ronda_de_Dalt_en_Artesania’,277.25).
saturacion(12,’RDLT_ronda_de_Dalt_en_Artesania’,60.75).
(...)

knowledge base of actual situations and outcomes.

The specific task considered here is to predict critical road sections—that


is, sections responsible for accidents and congestion—given sensor readings
and road geometry. Typically traffic problems manifest in the road sec-
tions following a critical section, thus a critical section is actually a section
preceding the one where the problems are observed. The road structure is
generally such that many road sections precede and follow a single section.
This many-to-one relationship between road sections suggests the use of ILP
C.3. Predicting Traffic Congestion and Accidents 237

Table C.4: The mode declarations for the Traffic data set. The predicate velocidadd
is a discretised version of velocidad, containing 3 values baja, media, alta; similarly for
saturaciond and ocupaciond.

mode( a, [1,1], false, accident ).


mode( a, [1,1], false, congestion ).
mode( a, [1,1], false, noncs ).
mode( s, [1,1], false, section([section,-]) ).
mode( s, [1,1], false, time([time,-]) ).
mode( b, [0,1], false, secciones_posteriores([section,-], [section,+]) ).
mode( b, [0,1], false, secciones_posteriores([section,+], [section,-]) ).
mode( b, [0,2], false, tipo([section,+], [type,#]) ).
mode( b, [0,2], false, velocidadd([time,+], [section,+], [value,#]) ).
mode( b, [0,2], false, saturaciond([time,+], [section,+], [value,#]) ).
mode( b, [0,2], false, ocupaciond([time,+], [section,+], [value,#]) ).

methods.

The data set was used with the kind permission of Martin Molina and is
not publicly available. Introduced by Džeroski et al. (1998b), it contains
data simulated for various road sections in the city of Barcelona. The data
includes two components, one contains the road section details and other
describes the traffic conditions. The data distribution is 66 examples of
congestion, 62 accidents, and 128 non-critical sections (256 examples in to-
tal).

At each time step the input to Foxcs consists of a road section and a sensor
reading time; the remaining data is supplied as background knowledge. A
sample of the data is shown in Table C.3, and the rule language for the task
is given in Table C.4.
238 Ch C. The Inductive Logic Programming Tasks

C.4 Classifying Hands of Poker

The aim of this task, which originates from Blockeel et al. (1999), is to suc-
cessfully recognise eight types of hands in the card game of Poker. The eight
classes are fourofakind, fullhouse, flush, straight, threeofakind,
twopair, pair, and nought. The first seven classes are defined as nor-
mal for Poker, although no distinction is made between royal, straight, and
ordinary flushes. The last class, nought, consists of all the hands that do
not belong to one of the other classes.

In Poker the cards in the hand are selected randomly, resulting in a par-
ticularly uneven class distribution. The table below gives the number of
hands belonging to each class, which clearly shows the uneven distribution;
1,302,540
for example, a nought hand occurs on average with frequency 2,598,960 , or

fourofakind 624
fullhouse 3,744
flush 5,148
straight 10,200
threeofakind 54,912
two pairs 123,552
pair 1,098,240
nought 1,302,540
total 2,598,960

approximately once every 2 hands, compared to a four of a kind, which


624
occurs on average with frequency 2,598,960 , or only once every 4,165 hands.
Such a large disparity can pose difficulty for the learning process, thus in
the experiments with Foxcs data sampling was stratified so that the class
distribution was approximately equal.

The rule language for the task is given in Table C.5. The predicate card
represents a playing card and contains three arguments, which are the rank
of the card, its suit, and its position in the hand. The predicate succession
C.4. Classifying Hands of Poker 239

Table C.5: The mode declarations for the Poker task.

inequations(true).

mode( a, [1,1], false, fourofakind ).


mode( a, [1,1], false, fullhouse ).
mode( a, [1,1], false, flush ).
mode( a, [1,1], false, straight ).
mode( a, [1,1], false, threeofakind ).
mode( a, [1,1], false, twopair ).
mode( a, [1,1], false, pair ).
mode( a, [1,1], false, nought ).
mode( a, [1,1], false, hand([class,#]) ).
mode( s, [5,5], false, card([position,-],[rank,+,-,_],[suit,+,-,_]) ).
mode( b, [0,1], true, succession([rank,+,!],[rank,+,!],[rank,+,!],[rank,+,!],[rank,+,!])).

takes five ranks as arguments and evaluates to true when there is an ordering
of the ranks given by the arguments that forms an unbroken sequence. Given
this very natural representation for Poker concepts, the task is relational
because it is the relationships between the ranks and suits of the cards in
the hand, rather than their values, that define the classes.1 Readers familiar
with Poker will note that the class definitions do not depend on the order
in which cards are dealt; why, then, is the position attribute included in
card? The reason for having the position attribute is discussed below.

Inclusion of the Position Attribute

The use of the anonymous variable was allowed in the card predicate in
order to increase the level of generality that a single rule could attain. This,
in turn, created a need for the attribute, position, to ensure that each
literal represents a separate card. To see this more concretely, imagine that
the system uses the above representation minus the position attribute, and

1
Thornton (2000) illustrates, through an anecdote of the true story of the Cincinnati
System, the pitfalls of using non-relational representations to express concepts in Poker.
240 Ch C. The Inductive Logic Programming Tasks

that it has discovered the following rule for classifying a pair:

pair ← card(A, ), card(A, ), card(B, ), card(C, E), card(D, F)

The rule correctly classifies all hands that contain a single pair. The first
two literals in the condition represent the two cards in the hand that have
the same rank, and the other three literals represent the remaining cards.
Note that because of the use of inequations on the variables A, B, C, and D,
there cannot be more than two cards with the same rank; however, since
the first two literals are identical, there is nothing to prevent them from
matching the same card. Thus, many non-pair hands will also be covered
by the rule, which means the rule is not always correct.

The solution is to introduce an extra attribute to card so that each literal


will match a separate card, such as the “position” attribute already de-
scribed. Using the extra attribute the corresponding rule to the one above
is:

pair ← card(G, A, ), card(H, A, ), card(I, B, ), card(J, C, E), card(K, D, F)

Here the position attribute ensures that the first two literals must repre-
sent different cards—again through the use of inequations, this time on the
variables G and H. Now the rule correctly classifies all hands that contain a
single pair, but it no longer matches any non-pair hands; it is therefore a
correct and maximally general definition of a pair.
Appendix D

Validity Tests for the Blocks


World Environment

This appendix describes the validity tests that were used to constrain mu-
tation from producing rules that do not match any states in bwn . Note
that the validity checks do not entirely eliminate all illegal possibilities. The
validity checks are implemented by a set of prolog rules defining the rela-
tion is illegal/2. The two arguments to is illegal are the action and
condition of the rule to be checked. The definitions make use of the relation
memberof/2, which takes two lists as arguments and checks that all elements
in the first occur in the second:

memberof([],_).
memberof([H|T],List) :- member(H,List), memberof(T,List).

The definitions for is illegal are given below.

is_illegal( Action, Condition ) :-

% action mv_fl(A) is inconsistent with condition containing on_fl(A)

Action = mv_fl(A),
memberof([on_fl(B)],Condition),

241
242 Ch D. Validity Tests for the Blocks World Environment

A==B.

is_illegal( Action, Condition ) :-

% action mv_fl(A) is inconsistent with a condition that does not


% contain either cl(A),on(A,B) or cl(A),above(A,B)

Action = mv_fl(A),
((memberof([cl(B),on(C,D)],Condition),A==B,A==C) ->
fail
;
((memberof([cl(B),above(F,G)],Condition),A==B,A==F) ->
fail
;
true
)
).

is_illegal( Action, Condition ) :-

% action mv(A,B) is inconsistent with a condition that does not


% contain cl(A),cl(B)

Action = mv(A,B),
((memberof([cl(C),cl(D)],Condition),A==C,B==D) -> fail; true).

is_illegal( _, Condition ) :-

% on(A,B) and on(B,A) are inconsistent

memberof([on(A,B),on(C,D)],Condition),
A==D,
B==C.

is_illegal( _, Condition ) :-

% on(A,B) and on(C,B) are inconsistent unless A=C

memberof([on(A,B),on(C,D)],Condition),
A\==C,
B==D.

is_illegal( _, Condition ) :-
243

% on(A,B) and on(A,C) are inconsistent unless B=C

memberof([on(A,B),on(C,D)],Condition),
A==C,
B\==D.

is_illegal( _, Condition ) :-

% on(A,B) and on_fl(A) are inconsistent

memberof([on(A,B),on_fl(C)],Condition),
A==C.

is_illegal( _, Condition ) :-

% on(A,B) and cl(B) are inconsistent

memberof([on(A,B),cl(C)],Condition),
B==C.

is_illegal( _, Condition ) :-

% on(A,B) and highest(B) are inconsistent

memberof([on(A,B),highest(C)],Condition),
B==C.

is_illegal( _, Condition ) :-

% above(A,B) and highest(B) are inconsistent

memberof([above(A,B),highest(C)],Condition),
B==C.

is_illegal( _, Condition ) :-

% on(A,B) and above(B,A) are inconsistent

memberof([on(A,B),above(C,D)],Condition),
A==D,
B==C.

is_illegal( _, Condition ) :-
244 Ch D. Validity Tests for the Blocks World Environment

% cl(A) and above(B,A) are inconsistent

memberof([cl(A),above(B,C)],Condition),
A==C.

is_illegal( _, Condition ) :-

% on_fl(A) and above(A,B) are inconsistent

memberof([on_fl(A),above(B,C)],Condition),
A==B.
Bibliography

Aha, D. W., Kibler, D., and Albert, M. K. (1991). Instance-based learning


algorithms. Machine Learning, 6(1):37–66.

Ahluwalia, M. and Bull, L. (1999). A genetic programming-based classifier


system. In Banzhaf et al. (1999), pages 11–18.

Albus, J. S. (1975). A new approach to manipulator control: Cerebellar


model articulation controller (CMAC). Journal of Dynamic Systems,
97:220–227.

Anglano, C., Giordana, A., Lo Bello, G., and Saitta, L. (1997). A network
genetic algorithm for concept learning. In Bäck, T., editor, Proceedings of
the Seventh International Conference on Genetic Algorithms (ICGA97),
San Francisco, CA. Morgan Kaufmann.

Anglano, C., Giordana, A., Lo Bello, G., and Saitta, L. (1998). An experi-
mental evaluation of coevolutive concept learning. In Shavlik, J. W., edi-
tor, Proceedings of the 15th International Conference on Machine Learn-
ing (ICML 1998), pages 19–27. Morgan Kaufmann, San Francisco, CA.

Åström, K. J. (1965). Optimal control of Markov decision processes with


incomplete state estimation. Journal of Mathematical Analysis and Ap-
plications, 10(1):174–205.

Augier, S. and Venturini, G. (1996). SIAO1: A first order logic machine


learning system using genetic algorithms. In Fogarty, T. C. and Ven-
turini, G., editors, 13th International Conference on Machine Learning

245
246 BIBLIOGRAPHY

(ICML’96) — Workshop on Evolutionary Algorithms and Machine Learn-


ing.

Augier, S., Venturini, G., and Kodratoff, Y. (1995). Learning first order
logic rules with a genetic algorithm. In Fayyad, U. M. and Uthurusamy,
R., editors, The First International Conference on Knowledge Discovery
and Data Mining (KDD-95), pages 21–26. AAAI Press.

Aycinena, M. (2002). Hierarchical relational reinforcement learning.


http://csail.mit.edu/~aycinena/curis2002/final_paper.pdf.

Bacchus, F. and Jaakkola, T., editors (2005). Proceedings of the 21st Con-
ference on Uncertainty in Artificial Intelligence (UAI ’05). AUAI Press.

Bäck, T., Fogel, D. B., and Michalewicz, Z., editors (2000). Evolutionary
Computation, volume 1. Institute of Physics Publishing.

Baird, L. C. (1995). Residual algorithms: Reinforcement learning with func-


tion approximation. In Proceedings of the 12th International Conference
on Machine Learning, pages 30–37. Morgan Kaufmann.

Baird, L. C. and Moore, A. W. (1998). Gradient descent for general rein-


forcement learning. In Kearns, M. J., Solla, S. A., and Cohn, D. A., edi-
tors, Advances in Neural Information Processing Systems 11. MIT Press.

Banerji, R. B. (1964). A language for the description of concepts. General


Systems, 9:135–141.

Banzhaf, W., Daida, J., Eiben, A. E., Garzon, M. H., Honavar, V., Jakiela,
M., and Smith, R. E., editors (1999). Proceedings of the Genetic and
Evolutionary Computation Conference (GECCO’99). Morgan Kaufmann.

Bartlett, P. L. and Baxter, J. (2002). Estimation and approximation bounds


for gradient-based reinforcement learning. Journal of Computer and Sys-
tem Sciences, 64(1):133–150.

Baxter, J. and Bartlett, P. L. (2001). Infinite-horizon gradient-based policy


search. Journal of Artificial Intelligence Research, 15:319–350.
BIBLIOGRAPHY 247

Bellman, R. E. (1957). Dynamic Programming. Princeton University Press.

Benson, S. S. (1996). Learning Action Models for Reactive Autonmous


Agents. PhD thesis, Department of Computer Science, Stanford Uni-
versity.

Bernadó, E., Llorà, X., and Garrel, J. M. (2002). XCS and GALE: A
comparative study of two learning classifier systems on data mining. In
Lanzi, P. L., Stolzmann, W., and Wilson, S. W., editors, Advances in
Learning Classifier Systems: 4th International Workshop, IWLCS 2001,
volume 2321 of LNCS, pages 115–132. Springer-Verlag.

Bernadó-Mansilla, E. and Garrell-Guiu, J. M. (2003). Accuracy-based learn-


ing classifier systems: Models, analysis and applications to classification
tasks. Evolutionary Computation, 11(3):209–238.

Bertsekas, D. P. and non A. David, C. (1989). Adaptive aggregation meth-


ods for infinite horizon dynamic programming. IEEE Transactions on
Automatic Control, 34(6):589–598.

Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming.


Athena Scientific.

Beyer, H.-G. and O’Reilly, U.-M., editors (2005). Proceedings of the Genetic
and Evolutionary Computation Conference, GECCO 2005. ACM Press.

Blockeel, H. (1998). Top-Down Induction of First-Order Logical Decision


Trees. PhD thesis, Department of Computer Science, Katholieke Univer-
siteit Leuven, Leuven, Belgium.

Blockeel, H. and De Raedt, L. (1997). Lookahead and discretization in ILP.


In Džeroski, S. and Lavrač, N., editors, Proceedings of the 7th Interna-
tional Workshop on Inductive Logic Programming, volume 1297 of LNCS,
pages 77–84. Springer-Verlag.

Blockeel, H. and De Raedt, L. (1998). Top-down induction of first-order


logical decision trees. Artificial Intelligence, 101(1–2):285–297.
248 BIBLIOGRAPHY

Blockeel, H., De Raedt, L., Jacobs, N., and Demoen, B. (1999). Scaling
up inductive logic programming by learning from interpretations. Data
Mining and Knowledge Discovery, 3(1):59–93.

Blockeel, H., Džeroski, S., Kompare, B., Kramer, S., Pfahringer, B., and
Van Laer, W. (2004). Experiments in predicting biodegradability. Applied
Artificial Intelligence, 18(2):157–181.

Booker, L. B. (1989). Triggered rule discovery in classifier systems. In


Schaffer (1989), pages 265–274.

Boutilier, C., Dearden, R., and Goldszmidt, M. (2000). Stochastic dynamic


programming with factored representations. Artificial Intelligence, 121(1–
2):49–107.

Boyan, J. A. and Moore, A. W. (1995). Generalization in reinforcement


learning: Safely approximating the value function. In Tesauro, G., Touret-
zky, D. S., and Leen, T. K., editors, Advances in Neural Information
Processing Systems 7, pages 369–376, Cambridge, MA. The MIT Press.

Brodley, C. E., editor (2004). Proceedings of the Twenty-First International


Conference on Machine Learning (ICML 2004), volume 69 of ACM In-
ternational Conference Proceeding Series. ACM Press.

Bull, L. and O’Hara, T. (2002). Accuracy-based neuro and neuro-fuzzy


classifier systems. In Langdon et al. (2002), pages 905–911.

Butz, M. V. (2004). Rule-based Evolutionary Online Learning Systems:


Learning Bounds, Classification, and Prediction. PhD thesis, University
of Illinois at Urbana-Champaign, 104 S. Mathews Avenue, Urbana, IL
61801, U.S.A.

Butz, M. V. (2005). Kernel-based, ellipsoidal conditions in the real-valued


XCS classifier system. In Beyer and O’Reilly (2005), pages 1835–1842.

Butz, M. V. (2006). Rule-based Evolutionary Online Learning Systems: A


Principled Approach to LCS Analysis and Design, volume 191 of Studies
in Fuzziness and Soft Computing. Springer.
BIBLIOGRAPHY 249

Butz, M. V., Goldberg, D. E., and Tharakunnel, K. (2003). Analysis and im-
provement of fitness exploitation in XCS: Bounding models, tournament
selection, and bilateral accuracy. Evolutionary Computation, 11(3):239–
277.

Butz, M. V., Kovacs, T., Lanzi, P. L., and Wilson, S. W. (2004). Toward
a theory of generalization and learning in XCS. IEEE Transactions on
Evolutionary Computation, 8(1):28–46.

Butz, M. V., Lanzi, P. L., and Wilson, S. W. (2006). Hyper-ellipsoidal


conditions in XCS: Rotation, linear approximation, and solution structure.
In Cattolico (2006), pages 1457–1464.

Butz, M. V. and Pelikan, M. (2001). Analyzing the evolutionary pressures


in XCS. In Spector et al. (2001), pages 935–942.

Butz, M. V., Sastry, K., and Goldberg, D. E. (2005). Strong, stable, and
reliable fitness pressure in XCS due to tournament selection. Genetic
Programming and Evolvable Machines, 6(1):53–77.

Butz, M. V. and Wilson, S. W. (2002). An algorithmic description of XCS.


Soft Computing, 6(3–4):144–153.

Califf, M. E. and Mooney, R. J. (1998). Advantages of decision lists and


implicit negatives in inductive logic programming. New Generation Com-
puting, 16(3):263–281.

Casillas, J., Carse, B., and Bull, L. (2004). Fuzzy XCS: an accuracy-based
fuzzy classifier system. In Proceedings of the XII Congreso Espanol sobre
Tecnologia y Logica Fuzzy (ESTYLF 2004), pages 369–376.

Cassandra, A. R., Kaelbling, L. P., and Littman, M. L. (1994). Acting op-


timally in partially observable stochastic domains. In Proceedings of the
Twelfth National Conference on Artificial Intelligence (AAAI-94), vol-
ume 2, pages 1023–1028, Seattle, Washington, USA. AAAI Press/MIT
Press.
250 BIBLIOGRAPHY

Cattolico, M., editor (2006). Proceedings of the Genetic and Evolutionary


Computation Conference, GECCO 2006. ACM Press.

Chapman, D. and Kaelbling, L. P. (1991). Input generalization in delayed


reinforcement learning: An algorithm and performance comparisons. In
Mylopoulos, J. and Reiter, R., editors, Proceedings of the 12th Interna-
tional Joint Conference on Artificial Intelligence (IJCAI-91), pages 726–
731. Morgan Kaufmann.

Cohen, W. W. (1993). Learnability of restricted logic programs. In Mug-


gleton, S. H., editor, Third International Workshop on Inductive Logic
Programming, ILP’93, pages 41–71.

Cohen, W. W. (1995). PAC-learning recursive logic programs: Negative


results. Journal of Artificial Intelligence Research, 2:541–573.

Cohen, W. W. and Jr., C. D. P. (1995). Polynomial learnability and induc-


tive logic programming: Methods and results. New Generation Comput-
ing, 13(3):369–409.

Cole, J., Lloyd, J., and Ng, K. S. (2003). Symbolic learn-


ing for adaptive agents. In Proceedings of the Annual Partner
Conference, Smart Internet Technology Cooperative Research Centre.
http://users.rsise.anu.edu.au/~jwl/crc_paper.pdf.

Cribbs, H. B. and Smith, R. E. (1996). Classifer system renaissance: New


analogies, new directions. In Koza, J. R., Goldberg, D. E., Fogel, D. B.,
and Riolo, R. L., editors, Proceedings of the First Annual Conference of
Genetic Programming, pages 547–552. The MIT Press.

Crites, R. H. and Barto, A. G. (1998). Elevator group control using multiple


reinforcement learning agents. Machine Learning, 33(2-3):235–262.

Croonenborghs, T., Ramon, J., and Bruynooghe, M. (2004). Towards in-


formed reinforcement learning. In Tadepalli et al. (2004a), pages 21–26.
http://eecs.oregonstate.edu/research/rrl/index.html.
BIBLIOGRAPHY 251

Croonenborghs, T., Tuyls, K., Ramon, J., and Bruynooghe, M. (2006).


Multi-agent relational reinforcement learning. explorations in multi-state
coordination tasks. In Tuyls, K., ’t Hoen, P. J., Verbeeck, K., and Sen, S.,
editors, Learning and Adaptation in Multi-Agent Systems, volume 3898 of
LNCS, pages 192–206. Springer-Verlag.

Cuena, J., Giorgio, A., and Boero, M. (1992). A general knowledge-based


architecture for traffic control: The KITS approach. In Proceedings of the
International Conference on Artificial Intelligence Applications in Trans-
portation Engineering.

Cuena, J., Hernández, J., and Molina, M. (1995). Knowledge-based mod-


els for adaptive traffic management. Transportation Research: Part C,
3(5):311–337.

Dam, H. H., Addass, H. H., and Lokan, C. (2005). Ge real! XCS with
continuous-valued inputs. Technical Report TR-ALAR-200504001, The
Artificial Life and Adaptive Robotics Laboratory, University of New South
Wales, Northcott Drive, Campbell, Canberra, ACT 2600, Australia.

Dayan, P. (1992). The convergence of TD(λ) for general λ. Machine Learn-


ing, 8(3–4):341–362.

De Raedt, L. (1997). Logical settings for concept-learning. Artificial Intel-


ligence, 95(1):187–201.

De Raedt, L. and Dehaspe, L. (1997). Clausal discovery. Machine Learning,


26(2–3):99–146.

De Raedt, L. and Džeroski, S. (1994). First-order jk-clausal theories are


PAC-learnable. Artificial Intelligence, 70(1–2):375–392.

De Raedt, L. and Van Laer, W. (1995). Inductive constraint logic. In Jantke,


K. P., Shinohara, T., and Zeugmann, T., editors, Proceedings of the Sixth
International Workshop on Algorithmic Learning Theory, volume 997 of
LNCS, pages 80–94. Springer-Verlag.
252 BIBLIOGRAPHY

Dean, T. and Givan, R. (1997). Model minimization in Markov decision


processes. In Proceedings of the 14th National Conference on Artificial
Intelligence and 9th Innovative Applications of Artificial Intelligence Con-
ference (AAAI-97/IAAI-97), pages 106–111.

Debnath, A. K., Lopez de Compadre, R. L., Debnath, G., Schusterman,


A. J., and Hansch, C. (1991). Structure-activity relationship of mutagenic
aromatic and heteroaromatic nitro compounds. correlation with molecu-
lar orbital energies and hydrophobicity. Journal of Medicinal Chemistry,
34(2):786–797.

Dehaspe, L. and De Raedt, L. (1997). Mining association rules in multiple


relations. In Lavrac, N. and Džeroski, S., editors, Inductive Logic Pro-
gramming, 7th International Workshop, ILP-97, volume 1297 of LNCS,
pages 125–132. Springer-Verlag.

d’Epenoux, F. (1963). A probabilistic production and inventory problem.


Management Science, 10:98–108.

Dietterich, T. G. (2000). Hierarchical reinforcement learning with the


MAXQ value function decomposition. Journal of Artificial Intelligence
Research, 13:227–303.

Divina, F. (2004). Hybrid Genetic Relational Search for Inductive Learn-


ing. PhD thesis, Department of Computer Science, Vrije Universiteit,
Amsterdam, The Netherlands.

Divina, F. (2006). Evolutionary concept learning in first order logic: An


overview. AI Communications, 19(1):13–33.

Divina, F., Keijzer, M., and Marchiori, E. (2003). A method for handling
numerical attributes in GA-based inductive concept learners. In Cantú-
Paz, E., Foster, J. A., Deb, K., Davis, D., Roy, R., O’Reilly, U.-M., Beyer,
H.-G., Standish, R., Kendall, G., Wilson, S., Harman, M., Wegener, J.,
Dasgupta, D., Potter, M. A., Schultz, A. C., Dowsland, K., Jonoska, N.,
BIBLIOGRAPHY 253

and Miller, J., editors, Genetic and Evolutionary Computation – GECCO


2003, volume 2723 of LNCS, pages 898–908. Springer-Verlag.

Divina, F. and Marchiori, E. (2002). Evolutionary concept learning. In


Langdon et al. (2002), pages 343–350.

Dorigo, M. and Sirtori, E. (1991). ALECSYS: A parallel laboratory for learn-


ing classifier systems. In Belew, R. K. and Booker, L. B., editors, Pro-
ceedings of the Fourth International Conference on Genetic Algorithms,
pages 296–302. Morgan Kaufmann.

Doya, K. (2000). Reinforcement learning in continuous time and space.


Neural Computation, 12(1):219–245.

Driessens, K. (2004). Relational Reinforcement Learning. PhD thesis, De-


partment of Computer Science, Katholieke Universiteit Leuven, Leuven,
Belgium.

Driessens, K. and Blockeel, H. (2001). Learning digger using hierarchical


reinforcement learning for concurrent goals. In Proceedings of the 5th
European Workshop on Reinforcement Learning (EWRL’01).

Driessens, K. and Džeroski, S. (2004). Integrating guidiance into relational


reinforcement learning. Machine Learning, 57(3):271–304.

Driessens, K. and Džeroski, S. (2005). Combining model-based and instance-


based learning for first order regression. In De Raedt, L. and Wrobel,
S., editors, Proceedings of the Twenty-Second International Conference
on Machine Learning (ICML 2005), volume 119 of ACM International
Conference Proceeding Series, pages 193–200. ACM Press.

Driessens, K. and Ramon, J. (2003). Relational instance based regression for


relational reinforcement learning. In Fawcett, T. and Mishra, N., editors,
Machine Learning, Proceedings of the Twentieth International Conference
(ICML 2003), pages 123–130. AAAI Press.
254 BIBLIOGRAPHY

Driessens, K., Ramon, J., and Blockeel, H. (2001). Speeding up relational


reinforcement learning through the use of an incremental first order de-
cision tree learner. In De Raedt, L. and Flach, P., editors, Proceedings
of the 12th European Conference on Machine Learning, pages 97–108.
Springer-Verlag.

Džeroski, S., Blockeel, H., Kompare, B., Kramer, S., Pfahringer, B., and
Van Laer, W. (1999). Experiments in predicting biodegradability. In
Džeroski, S. and Flach, P., editors, International Workshop on Inductive
Logic Programming, volume 1634 of LNCS, pages 80–91. Springer-Verlag.

Džeroski, S., De Raedt, L., and Blockeel, H. (1998a). Relational reinforce-


ment learning. In International Workshop on Inductive Logic Program-
ming, pages 11–22.

Džeroski, S., De Raedt, L., and Driessens, K. (2001). Relational reinforce-


ment learning. Machine Learning, 43(1–2):7–52.

Džeroski, S., Jacobs, N., Molina, M., Moure, C., Muggleton, S., and van
Laer, W. (1998b). Detecting traffic problems with ILP. In Proceedings
of the Eighth International Conference on Inductive Logic Programming,
volume 1446 of LNCS, pages 281–290. Springer-Verlag.

Džeroski, S. and Lavrač, N., editors (2001). Relational Data Mining.


Springer.

Džeroski, S., Muggleton, S., and Russell, S. (1992). PAC-learnability of


determinate logic programs. In Proceedings of the fifth annual workshop
on Computational learning theory (COLT ’92), pages 128–135. ACM.

Emde, W. and Wettschereck, D. (1996). Relational instance-based learning.


In Proceedings of the Thirteenth International Conference (ICML ’96),
pages 122–130. Morgan Kaufmann.

Fern, A., Yoon, S. W., and Givan, R. (2004a). Approximate policy iteration
with a policy language bias. In Thrun, S., Saul, L. K., and Schölkopf, B.,
BIBLIOGRAPHY 255

editors, Proceedings of 16th Conference on Advances in Neural Informa-


tion Processing, NIPS 2003. MIT Press.

Fern, A., Yoon, S. W., and Givan, R. (2004b). Learning domain-specific


control knowledge from random walks. In Zilberstein, S., Koehler, J., and
Koenig, S., editors, Proceedings of the Fourteenth International Confer-
ence on Automated Planning and Scheduling (ICAPS 2004), pages 191–
199. AAAI Press.

Finney, S., Gardiol, N. H., Kaelbling, L. P., and Oates, T. (2002b). The
thing that we tried didn’t work very well: Deictic representation in rein-
forcement learning. In Proceedings of the 18th International Conference
on Uncertainty in Artificial Intelligence (UAI-02).

Finney, S., Gardiol, N. H., Kaelbling, L. P., and Oates, T. (April 2002a).
Learning with deictic representation. Technical Report AI Laboratory
Memo AIM-2002-006, MIT, Cambridge, MA.

Finzi, A. and Lukasiewicz, T. (2004a). Game-theoretic agent programming


in Golog. In de Mántaras, R. L. and Saitta, L., editors, Proceedings
of the 16th Eureopean Conference on Artificial Intelligence (ECAI’2004),
including Prestigious Applicants of Intelligent Systems (PAIS 2004), pages
23–27. IOS Press.

Finzi, A. and Lukasiewicz, T. (2004b). Relational Markov games. In Alferes,


J. J. and Leite, J. A., editors, Proceedings of the 9th European Conference
on Logics in Artificial Intelligence (JELIA 2004), volume 3229 of LNCS,
pages 320–333. Springer-Verlag.

Finzi, A. and Lukasiewicz, T. (2005). Game theoretic Golog under partial


observability. In Dignum, F., Dignum, V., Koenig, S., Kraus, S., Singh,
M. P., and Wooldridge, M., editors, Proceedings of the 4th International
Joint Conference on Autonomous Agents and Multiagent Systems (AA-
MAS 2005), pages 1301–1302. ACM.
256 BIBLIOGRAPHY

Fitch, R., Hengst, B., Šuc, D., Calbert, G., and Scholz, J. (2005). Struc-
tural abstraction experiments in reinforcement learning. In Zhang, S. and
Jarvis, R., editors, Proceedings of the 18th Australian Joint Conference on
Artificial Intelligence (AI 2005), volume 3809 of LNCS, pages 164–175.
Springer-Verlag.

Fühner, T. and Kókai, G. (2003). Incorporating linkage learning into the


GeLog framework. Journal Acta Cybernetica, 16(2):209–228.

Gärtner, T., Driessens, K., and Ramon, J. (2003a). Graph kernels and
Gaussian processes for relational reinforcement learning. In Horváth and
Yamamoto (2003), pages 146–163.

Gärtner, T., Flach, P., and Wrobel, S. (2003b). On graph kernels: Hardness
results and efficient alternatives. In Schölkopf, B. and Warmuth, M. K.,
editors, Learning Theory and Kernel Machines: 16th Annual Conference
on Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, vol-
ume 2777 of LNCS, pages 129–143. Springer-Verlag.

Genesereth, M. R. and Nilsson, N. J. (1987). Logical Foundations of Artifi-


cial Intelligence. Morgan Kaufmann.

Giordana, A. and Neri, F. (1995). Search-intensive concept induction. Evo-


lutionary Computation, 3(4):375–419.

Givan, R., Dean, T., and Greig, M. (2003). Equivalence notions and model
minimization in Markov decision processes. Artificial Intelligence, 147(1–
2):163–223.

Goldberg, D. E. (1983). Computer-Aided Gas Pipeline Operation using Gen-


ertic Algorithms and Rule-Learning. PhD thesis, University of Michigan.

Goldberg, D. E. (1989). Genetic Algorithms in Search, Optimization, and


Machine Learning. Addison-Wesley.

Goldberg, D. E., Deb, K., and Korb, B. (1990). Messy genetic algorithms
revisited: Studies in mixed size and scale. Complex Systems, 4(4):415–444.
BIBLIOGRAPHY 257

Goldberg, D. E., Korb, B., and Deb, K. (1989). Messy genetic algorithms:
Motivation, analysis and first results. Complex Systems, 3(4):493–530.

Gordon, G. J. (1995). Stable function approximation in dynamic program-


ming. In Prieditis, A. and Russell, S., editors, Proceedings of the Twelfth
International Conference on Machine Learning, pages 261–268, San Fran-
cisco, CA. Morgan Kaufmann.

Gretton, C. and Thiébaux, S. (2004). Exploiting first-order regression


in inductive policy selection. In Tadepalli et al. (2004a), pages 51–56.
http://eecs.oregonstate.edu/research/rrl/index.html.

Großmann, A., Hölldobler, S., and Skvortsova, O. (2002). Symbolic dynamic


programming within the fluent calculus. In Ishii, N., editor, Proceedings
of the IASTED International Conference on Artificial and Computational
Intelligence, pages 378–383, Tokyo, Japan. ACTA Press.

Guestrin, C., Koller, D., Parr, R., and Venkataraman, S. (2003). Efficient
solution algorithms for factored MDPs. Journal of Artificial Intelligence
Research (JAIR), 19:399–468.

Hekanaho, J. (1996). Background knowledge in GA-based concept learning.


In Proceedings of the 13th International Conference on Machine Learning,
pages 234–242.

Hekanaho, J. (1998). DOGMA: A GA-based relational learner. In Page, D.,


editor, Proceedings of the 8th International Conference on Inductive Logic
Programming, volume 1446 of LNCS, pages 205–214. Springer-Verlag.

Helft, N. (1989). Induction as nonmonotonic inference. In Proceedings of the


First International Conference on Principles of Knowledge Representation
and Reasoning, pages 149–156. Morgan Kaufmann.

Hengst, B. (2002). Discovering hierarchy in reinforcement learning with


HEXQ. In Sammut, C. and Hoffmann, A. G., editors, Proceedings of the
Nineteenth International Conference on Machine Learning (ICML 2002),
pages 243–250. Morgan Kaufmann.
258 BIBLIOGRAPHY

Holland, J. H. (1976). Adaptation. In Rosen, R. and Snell, F. M., editors,


Progress in Theoretical Biology, volume 4, NY. Plenum.

Holland, J. H. and Reitman, J. S. (1978). Cognitive systems based on adap-


tive algorithms. In Waterman, D. and HayesRoth, F., editors, Pattern-
Directed Inference Systems. Academic Press, Orlando, FL, USA.

Hölldobler, S. and Skvortsova, O. (2004). A logic-based approach to dynamic


programming. In Proceedings of the AAAI Workshop on Learning and
Planning in Markov Processes—Advances and Challenges, pages 31–36.
AAAI Press.

Horváth, T. and Yamamoto, A., editors (2003). Proceedings of the 13th In-
ternational Conference on Inductive Logic Programming, ILP 2003, vol-
ume 2835 of LNCS. Springer-Verlag.

Howard, R. A. (1960). Dynamic Programming and Markov Processes. The


MIT Press, Cambridge, MA.

Itoh, H. and Nakamura, K. (2004). Towards learning to learn and plan by


relational reinforcement learning. In Tadepalli et al. (2004a), pages 34–39.
http://eecs.oregonstate.edu/research/rrl/index.html.

Jaakkola, T., Jordan, M. I., and Singh, S. P. (1994). On the convergence of


stochastic iterative dynamic programming algorithms. Neural Computa-
tion, 6(6):1185–1201.

Kaelbling, L. P., Littman, M. L., and Moore, A. P. (1996). Reinforcement


learning: A survey. Journal of Artificial Intelligence Research, 4:237–285.

Karabaev, E. and Skvortsova, O. (2005). A heuristic search algorithm for


solving first-order MDPs. In Bacchus and Jaakkola (2005), pages 292–299.

Kersting, K. and De Raedt, L. (2003). Logical Markov decision programs.


In Working Notes of the IJCAI-2003 Workshop on Learning Statistical
Models from Relational Data (SRL-03), pages 63–70.
BIBLIOGRAPHY 259

Kersting, K. and De Raedt, L. (2004). Logical Markov decision programs


and the convergence of logical TD(λ). In Camacho, R., King, R. D., and
Srinivasan, A., editors, Proceedings of the Inductive Logic Programming,
14th International Conference, ILP 2004, volume 3194 of LNCS, pages
180–197. Springer-Verlag.

Kersting, K., van Otterlo, M., and De Raedt, L. (2004). Bellman goes
relational. In Brodley (2004).

Khardon, R. (1999). Learning action strategies for planning domains. Arti-


ficial Intelligence, 113(1–2):125–148.

Kietz, J.-U. (1993). Some lower bounds for the computational complexity
of inductive logic programming. In Proceedings of the Sixth European
Conference on Machine Learning (ECML93), volume 667 of LNCS, pages
115–123. Springer-Verlag.

Kim, K.-E. and Dean, T. (2003). Solving factored MDPs using non-
homogeneous partitions. Artificial Intelligence, 147(1–2):225–251.

Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy


estimation and model selection. In Proceedings of the Fourteenth Inter-
national Joint Conference on Artificial Intelligence (IJCAI 95), pages
1137–1145. Morgan Kaufmann.

Kókai, G. (2001). GeLog - A system combining genetic algorithm with induc-


tive logic programming. In Proceedings of the International Conference,
7th Fuzzy Days on Computational Intelligence, Theory and Applications,
volume 2206 of LNCS, pages 326–344. Springer-Verlag.

Kovacs, T. (1996). Evolving optimal populations with XCS classifier sys-


tems. Master’s thesis, School of Computer Science, University of Birm-
ingham, UK.

Kovacs, T. (1997). XCS classifier system reliably evolves accurate, complete,


and minimal representations for boolean functions. In Roy, Chawdhry, and
260 BIBLIOGRAPHY

Pant, editors, Soft Computing in Engineering Design and Manufacturing,


pages 59–68. Springer-Verlag.

Kovacs, T. (2000). Strength or accuracy? fitness calculation in learning


classifier systems. In Lanzi et al. (2000), pages 143–160.

Kovacs, T. (2001). Towards a theory of strong overgeneral classifiers. In


Martin, W. and Spears, W., editors, Foundations of Genetic Algorithms
6, pages 165–184. Morgan Kaufmann.

Kovacs, T. (2002). A Comparison of Strength and Accuracy-Based Fitness


in Learning Classifier Systems. PhD thesis, School of Computer Science,
University of Birmingham, UK.

Koza, J. R. (1992). Genetic Programming: On the Programming of Comput-


ers by Means of Natural Selection. The MIT Press, Morgan Cambridge,
MA.

Langdon, W. B., Cantú-Paz, E., Mathias, K. E., Roy, R., Davis, D., Poli,
R., Balakrishnan, K., Honavar, V., Rudolph, G., Wegener, J., Bull, L.,
Potter, M. A., Schultz, A. C., Miller, J. F., Burke, E. K., and Jonoska, N.,
editors (2002). GECCO 2002: Proceedings of the Genetic and Evolution-
ary Computation Conference, New York, USA, 9-13 July 2002. Morgan
Kaufmann.

Lanzi, P. L. (1999a). Extending the representation of classifer conditions,


part I: From binary to messy coding. In Banzhaf et al. (1999), pages
337–344.

Lanzi, P. L. (1999b). Extending the representation of classifer conditions,


part II: From messy codings to S-expressions. In Banzhaf et al. (1999),
pages 345–352.

Lanzi, P. L. (2001). Mining interesting knowledge from data with the XCS
classifier system. In Spector et al. (2001), pages 958–965.
BIBLIOGRAPHY 261

Lanzi, P. L., Stolzmann, W., and Wilson, S. W., editors (2000). Learning
Classifier Systems: from Foundations to Applications, volume 1813 of
LNCS. Springer-Verlag.

Lanzi, P. L. and Wilson, S. W. (2006). Using convex hulls to represent


classifier conditions. In Cattolico (2006), pages 1481–1488.

Lecœuche, R. (2001). Learning optimal dialogue management rules by us-


ing reinforcement learning and inductive logic programming. In NAACL
2001, Language Technologies 2001: The Second Meeting of the North
American Chapter of the Association for Computational Linguistics.
http://acl.ldc.upenn.edu/N/N01/.

Letia, I. A. and Precup, D. (2002). Developing collaborative Golog agents


by reinforcement learning. International Journal on Artificial Intelligence
Tools, 11(2):233–246.

Littman, M. L. (1994). Markov games as a framework for multi-agent rein-


forcement learning. In Proceedings of the 11th International Conference on
Machine Learning (ML-94), pages 157–163, New Brunswick, NJ. Morgan
Kaufmann.

Littman, M. L. (1996). Algorithms for Sequential Decision Making. PhD


thesis, Department of Computer Science, Brown University.

Llorà, X. (2002). Genetic Based Machine Learning using Fine-grained Par-


allelism for Data Mining. PhD thesis, Enginyeria i Arquitectura La Salle,
Ramon Llull University, Barcelona.

Lloyd, J. W. (1987). Foundations of Logic Programming. Springer, 2nd


edition edition.

Lloyd, J. W. (2003). Logic for Learning: Learning Comprehensible Theories


from Structured Data. Cognitive Technologies Series. Springer.

Lodhi, H. and Muggleton, S. (2005). Is Mutagenesis still challenging? In


Proceedings of the 15th International Conference on Inductive Logic Pro-
gramming, ILP 2005, pages 35–40. late breaking papers.
262 BIBLIOGRAPHY

MacKay, D. J. C. (1998). Introduction to Gaussian processes.


http://www.cs.toronto.edu/~mackay/BayesGP.html.

Martı́n, M. and Geffner, H. (2004). Learning generalized policies from plan-


ning examples using concept languages. Applied Intelligence, 20(1):9–19.

Mellor, D. (2005). A first order logic classifier system. In Beyer and O’Reilly
(2005), pages 1819–1826.

Mitchell, T. M. (1997). Machine Learning. McGraw-Hill.

MLnet OiS (2000). The machine learning network online information service.
http://www.mlnet.org/cgi-bin/mlnetois.pl/?File=datasets.html.

Moore, A. W. and Atkeson, C. G. (1993). Prioritized sweeping: Reinforce-


ment learning with less data and less time. Machine Learning, 13:103–130.

Morales, E. F. (2003). Scaling up reinforcement learning with a relational


representation. In Proceedings of the Workshop on Adaptability in Multi-
agent Systems (AORC–2003), pages 15–26.

Morales, E. F. (2004). Relational state abstractions for rein-


forcement learning. In Tadepalli et al. (2004a), pages 27–32.
http://eecs.oregonstate.edu/research/rrl/index.html.

Morales, E. F. and Sammut, C. (2004). Learning to fly by combining re-


inforcement learning with behavioural cloning. In Brodley (2004), pages
598–605.

Mărginean, F. A. (2003). Which first-order logic clauses can be learned using


genetic algorithms? In Horváth and Yamamoto (2003), pages 233–250.

Muggleton, S. (1991). Inductive logic programming. New Generation Com-


puting, 8(4):295–318.

Muggleton, S. (1995). Inverse entailment and Progol. New Generation Com-


puting, Special issue on Inductive Logic Programming, 13(3–4):245–286.
BIBLIOGRAPHY 263

Muggleton, S. and De Raedt, L. (1994). Inductive logic programming: The-


ory and methods. Journal of Logic Programming, 19/20:629–679.

Muller, T. J. and van Otterlo, M. (2005). Evolutionary reinforcement learn-


ing in relational domains. In Seventh European Workshop on Reinforce-
ment Learning (EWRL’05).

Neri, F. and Saitta, L. (1995). Analysis of genetic algorithms evolution


under pure selection. In Eshelman, L. J., editor, Proceedings of the 6th
International Conference on Genetic Algorithms, pages 32–41. Morgan
Kaufmann, Pittsburgh, PA, USA.

Ng, K. S. (2005). Learning Comprehensible Theories from Structured Data.


PhD thesis, Research School of Information Sciences and Engineering,
Australian National University.

Nienhuys-Cheng, S.-H. and de Wolf, R. (1997). Foundations of Inductive


Logic Programming, volume 1228 of LNCS. Springer-Verlag.

Osborn, T. R., Charif, A., Lamas, R., and Dubossarsky, E. (1995). Genetic
logic programming. In IEEE Conference on Evolutionary Computation,
pages 728–732. IEEE Press.

OUCL (2000). The Oxford University Computing Laboratory.


http://web.comlab.ox.ac.uk/oucl/research/areas/machlearn/applications.html.

Peng, J. and Williams, R. J. (1993). Efficient learning and planning within


the dyna framework. Adaptive Behaviour, 1(4):437–454.

Peng, J. and Williams, R. J. (1996). Incremental multi-step Q-Learning.


Machine Learning, 22(1–3):283–290.

Plotkin, G. D. (1969). A note on inductive generalisation. In Meltzer, B. and


Michie, D., editors, Proceedings of the Machine Intelligence Workshop,
volume 5, pages 153–163. Edinburgh University Press.

Plotkin, G. D. (1971). Automatic Methods of Inductive Inference. PhD


thesis, Edinburgh University.
264 BIBLIOGRAPHY

Precup, D., Sutton, R. S., and Dasgupta, S. (2001). Off-policy temporal-


difference learning with function approximation. In Proceedings of the
18th International Conference on Machine Learning, pages 417–424, San
Francisco, CA. Morgan Kaufmann.

Precup, D., Sutton, R. S., Paduraru, C., Koop, A., and Singh, S. P. (2006).
Off-policy learning with options and recognizers. In Weiss, Y., Schölkopf,
B., and Platt, J., editors, Advances in Neural Information Processing
Systems 18, pages 1097–1104, Cambridge, MA. MIT Press.

Puterman, M. L. (1994). Markov Decision Processes—Discrete Stochastic


Dynamic Programming. John Wiley & Sons, New York.

Puterman, M. L. and Shin, M. C. (1978). Modified policy iteration algo-


rithms for discounted Markov decision problems. Management Science,
24(11):1127–1137.

Quinlan, J. R. (1990). Learning logical definition from relations. Machine


Learning, 5(3):239–266.

Ramon, J. and Driessens, K. (2004). On the numeric stabil-


ity of gaussian processes regression for relational reinforce-
ment learning. In Tadepalli et al. (2004a), pages 10–14.
http://eecs.oregonstate.edu/research/rrl/index.html.

Reiser, P. (1999). Evolutionary Algorithms for Learning Formulae in First-


order Logic. PhD thesis, Department of Computer Science, University of
Wales, Aberystwyth.

Riolo, R. L. (1988). Empirical Studies of Default Hierachicies and Sequences


of Rules in Learning Classifier Systems. PhD thesis, University of Michi-
gan.

Robertson, G. G. and Riolo, R. L. (1988). A tale of two classifier systems.


Machine Learning, 3(2–3):139–159.
BIBLIOGRAPHY 265

Roncagliolo, S. and Tadepalli, P. (2004). Function approximation in hierar-


chical relational reinforcement learning. In Tadepalli et al. (2004a), pages
69–73. http://eecs.oregonstate.edu/research/rrl/index.html.

Rummery, G. A. and Niranjan, M. (1994). On-line Q-learning using connec-


tionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge
University Engineering Department.

Russell, S. J. and Norvig, P. (2003). Artificial Intelligence: A Modern Ap-


proach. Prentice Hall.

Samuel, A. (1959). Some studies in machine learning using the game of


checkers. IBM journal of research development, 3(3):210–229.

Sanner, S. (2005). Simultaneous learning of structure and value


in relational reinforcement learning. In Driessens, K., Fern,
A., and van Otterlo, M., editors, Proceedings of the ICML’05
Workshop on Rich Representations for Reinforcement Learning.
http://www.cs.waikato.ac.nz/%7Ekurtd/rrfrl/.

Sanner, S. and Boutilier, C. (2005). Approximate linear programming for


first-order MDPs. In Bacchus and Jaakkola (2005), pages 509–517.

Schaffer, J. D., editor (1989). Proceedings of the Third International Con-


ference on Genetic Algorithms, San Mateo, CA. Morgan Kaufmann.

Schuurmans, D. and Schaeffer, J. (1989). Representational difficulties with


classifier systems. In Schaffer (1989), pages 328–333.

Shannon, C. (1950). Programming a computer for playing chess. Philosoph-


ical Magazine, 41(4):256–275.

Shapiro, E. Y. (1981). An algorithm that infers theories from facts. In


Hayes, P. J., editor, Proceedings of the 7th International Joint Conference
on Artificial Intelligence (IJCAI ’81), pages 446–451. William Kaufmann.

Shapiro, E. Y. (1983). Algorithmic Program Debugging. MIT Press, Cam-


bridge, MA, USA.
266 BIBLIOGRAPHY

Shu, L. and Schaeffer, J. (1989). VCS: Variable classifier system. In Schaffer


(1989), pages 334–339.

Singh, S. P., Jaakkola, T., and Jordan, M. I. (1995). Reinforcement learn-


ing with soft state aggregation. In Tesauro, G., Touretzky, D. S., and
Leen, T. K., editors, Advances in Neural Information Processing Systems
7 (NIPS), pages 361–368. MIT Press.

Slaney, J. and Thiébaux, S. (2001). Blocks World revisited. Artificial Intel-


ligence, 125:119–153.

Smith, R. E. (1991). Default Hierarchy Formation and Memory Exploitation


in Learning Classifier Systems. PhD thesis, University of Alabama.

Sondik, E. J. (1978). The optimal control of partially observable Markov


processes over the infinite horizon: Discounted costs. Operations Research,
26(2):282–304.

Spector, L., Goodman, E. D., Wu, A., Langdon, W. B., Voigt, H.-M., Gen,
M., Sen, S., Dorigo, M., Pezeshk, S., Garzon, M. H., and Burke, E.,
editors (2001). Proceedings of the Genetic and Evolutionary Computa-
tion Conference (GECCO-2001), San Francisco, California, USA. Morgan
Kaufmann.

Srinivasan, A., King, R. D., and Muggleton, S. (1999). The role of back-
ground knowledge: using a problem from chemistry to examine the per-
formance of an ILP program. Technical Report PRG-TR-08-99, Oxford
University Computing Laboratory, Oxford, UK.

Srinivasan, A., Muggleton, S., and King, R. (1995). Comparing the use
of background knowledge by inductive logic programming systems. In
De Raedt, L., editor, Proceedings of the Fifth International Inductive
Logic Programming Workshop. Katholieke Universteit, Leuven. With-
drawn from publication and replaced by Srinivasan et al. (1999).

Srinivasan, A., Muggleton, S., King, R., and Sternberg, M. (1994). Mu-
tagenesis: ILP experiments in a non-determinate biological domain. In
BIBLIOGRAPHY 267

Wrobel, S., editor, Proceedings of the 4th International Workshop on In-


ductive Logic Programming, pages 217–232. Gesellschaft für Mathematik
und Datenverarbeitung MBH.

Srinivasan, A., Muggleton, S., Sternberg, M. J. E., and King, R. D. (1996).


Theories for mutagenicity: A study in first-order and feature-based in-
duction. Artificial Intelligence, 85(1-2):277–299.

Stolzmann, W. (2000). An introduction to anticipatory classifier systems.


In Lanzi et al. (2000), pages 175–194.

Stone, C. and Bull, L. (2003). For real! XCS with continuous-valued inputs.
Evolutionary Computation, 11(3):299–336.

Stone, C. and Bull, L. (2005). An analysis of continuous-valued represen-


tations for learning classifier systems. In Bull, L. and Kovacs, T., edi-
tors, Foundations of Learning Classifier Systems, volume 183 of Studies
in Fuzziness and Soft Computing, pages 127–176. Springer-Verlag.

Sutton, R., McAllester, D., Singh, S., and Mansour, Y. (2000). Policy
gradient methods for reinforcement learning with function approximation.
Advances in Neural Processing Systems, 12:1057–1063.

Sutton, R. S. (1988). Learning to predict by the method of temporal differ-


ences. Science, 3(1):9–44.

Sutton, R. S. (1990). Integrated architectures for learning, planning, and


reacting based on approximating dynamic programming. In Proceedings
of the Seventh International Conference on Machine Learning, pages 216–
224. Morgan Kaufmann.

Sutton, R. S. (1991a). DYNA, an integrated architecture for learning, plan-


ning and reacting. SIGART Bulletin, 2:160–163.

Sutton, R. S. (1991b). Planning by incremental dynamic programming. In


Proceedings of the Eighth International Workshop on Machine Learning,
pages 353–357. Morgan Kaufmann.
268 BIBLIOGRAPHY

Sutton, R. S. (1996). Generalization in reinforcement learning: Successful


examples using sparse coarse coding. In Touretzky, D. S., Mozer, M.,
and Hasselmo, M. E., editors, Advances in Neural Information Processing
Systems 8 (NIPS), pages 1038–1044. MIT Press.

Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning: An Intro-


duction. The MIT Press.

Sutton, R. S., Precup, D., and Singh, S. P. (1999). Between MDPs and semi-
MDPs: A framework for temporal abstraction in reinforcement learning.
Artificial Intelligence, 112(1–2):181–211.

Tadepalli, P., Givan, R., and Driessens, K., editors (2004a). Proceed-
ings of the ICML’04 Workshop on Relational Reinforcement Learning.
http://eecs.oregonstate.edu/research/rrl/index.html.

Tadepalli, P., Givan, R., and Driessens, K. (2004b). Relational reinforce-


ment learning: an overview. In Tadepalli et al. (2004a), pages 1–9.
http://eecs.oregonstate.edu/research/rrl/index.html.

Tamaddoni-Nezhad, A. and Muggleton, S. (2000). Searching the subsump-


tion lattice by a genetic algorithm. In Cussens, J. and Frisch, A. M.,
editors, Proceedings of the 10th International Conference on Inductive
Logic Programming, ILP 2000, volume 1866 of LNCS, pages 243–252.
Springer-Verlag.

Tamaddoni-Nezhad, A. and Muggleton, S. (2001). Using genetic algorithms


for learning clauses in first-order logic. In Spector et al. (2001), pages
639–646.

Tamaddoni-Nezhad, A. and Muggleton, S. (2003). A genetic algorithms


approach to ILP. In Matwin, S. and Sammut, C., editors, Proceedings of
the 12th International Conference on Inductive Logic Programming (ILP
2002), volume 2583 of LNCS, pages 285–300. Springer-Verlag.

Tesauro, G. (1994). TD-Gammon, a self-teaching backgammon program,


achieves masterlevel play. Neural Computation, 6(2):215–219.
BIBLIOGRAPHY 269

Tesauro, G. (1995). Temporal difference learning and TD-Gammon. Com-


munications of the ACM, 38(3):58–67.

Thie, C. J. and Giraud-Carrier, C. (2005). Learning concept descriptions


with typed evolutionary programming. IEEE Transactions on Knowledge
and Data Engineering, 17(12):1664–1677.

Thornton, C. (2000). Truth from Trash: How Learning Makes Sense. The
MIT Press.

Tseng, P. (1990). Solving h-horizon, stationary Markov decision problems in


time proportional to log(h). Operations Research Letters, 9(5):287–297.

Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and q-


learning. Machine Learning, 16(3):185–202.

Tsitsiklis, J. N. and Roy, B. V. (1996). Feature-based methods for large


scale dynamic programming. Machine Learning, 22(1–3):59–94.

Utgoff, P. E., Berkman, N. C., and Clouse, J. A. (1997). Decision tree


induction based on efficient tree restructuring. Machine Learning, 29(1):5–
44.

Van Laer, W. (2002). From Propositional to First Order Logic in Machine


Learning and Data Mining. PhD thesis, Katholieke Universiteit Leuven,
Belgium.

Van Laer, W. and De Raedt, L. (2001). How to upgrade propositional


learners to first order logic: A case study. In Paliouras, G., Karkaletsis, V.,
and Spyropoulos, C. D., editors, Machine Learning and Its Applications,
volume 2049 of LNCS, pages 102–126. Springer-Verlag.

Van Laer, W., Raedt, L. D., and Džeroski, S. (1997). On multi-class prob-
lems and discretization in inductive logic programming. In Proceedings of
the 10th International Syposium on Methodologies for Intelligent Systems,
volume 1325 of LNCS, pages 277–286. Springer-Verlag.
270 BIBLIOGRAPHY

van Otterlo, M. (2002). Relational representations in reinforcement learning:


Review and open problems. In de Jong, E. and Oates, T., editors, Pro-
ceedings of the ICML-2002 Workshop on Development of Representations,
pages 39–46. The University of New South Wales, Sydney, NSW.

van Otterlo, M. (2004). Reinforcement learning for relational MDPs. In


Nowé, A., Lenaerts, T., and Steenhaut, K., editors, Proceedings of the
Annual Machine Learning Conference of Belgium and the Netherlands
(BeNeLearn’04), pages 138–145.

van Otterlo, M. (2005). A survey of reinforcement learning in relational


domains. Technical Report TR-CTIT-05-31, University of Twente, The
Netherlands.

Venturini, G. (1994). Apprentissage Adaptatif et Apprentissage Supervisé


par Algorithme Gén-étique. PhD thesis, Université de Paris-Sud.

Walker, T., Shavlik, J., and Maclin, R. (2004). Relational re-


inforcement learning via sampling the space of first-order con-
junctive features. In Tadepalli et al. (2004a), pages 15–20.
http://eecs.oregonstate.edu/research/rrl/index.html.

Watkins, C. (1989). Learning from Delayed Rewards. PhD thesis, King’s


College, Cambridge University, UK.

Watkins, C. and Dayan, P. (1992). Q-learning. Machine Learning, 8(3–


4):279–292.

Whitehead, S. D. and Ballard, D. H. (1991). Learning to perceive and act


by trial and error. Machine Learning, 7(1):45–83.

Williams, R. J. and Baird, L. C. (1993). Tight performance bounds on greedy


policies based on imperfect value functions. Technical Report NU-CCS-93-
14, Northeastern University, College of Computer Science, Boston, MA,
USA.
BIBLIOGRAPHY 271

Wilson, S. W. (1985). Knowledge growth in an artificial animal. In Grefen-


stette, J. J., editor, Proceedings of the 1st International Conference on
Genetic Algorithms (ICGA), pages 16–23. Lawrence Erlbaum Associates.

Wilson, S. W. (1987). Classifier systems and the animat problem. Machine


Learning, 2(3):199–228.

Wilson, S. W. (1994). ZCS: A zeroth level classifier system. Evolutionary


Computation, 2(1):1–18.

Wilson, S. W. (1995). Classifier fitness based on accuracy. Evolutionary


Computation, 3(2):149–175.

Wilson, S. W. (1998). Generalization in the XCS classifier system. In Koza,


J. R., Banzhaf, W., Chellapilla, K., Deb, K., Dorigo, M., Fogel, D. B.,
Garzon, M. H., Goldberg, D. E., Iba, H., and Riolo, R., editors, Ge-
netic Programming 1998: Proceedings of the Third Annual Conference,
pages 665–674, University of Wisconsin, Madison, Wisconsin, USA. Mor-
gan Kaufmann.

Wilson, S. W. (2000). Get real! XCS with continuous-valued inputs. In


Lanzi et al. (2000), pages 209–222.

Wilson, S. W. (2001a). Function approximation with a classifier system. In


Spector et al. (2001), pages 974–981.

Wilson, S. W. (2001b). Mining oblique data with XCS. In Lanzi, P. L.,


Stolzmann, W., and Wilson, S. W., editors, Advances in Learning Classi-
fier Systems: Proceedings of the Third International Workshop (IWLCS
2001), volume 1996 of LNCS, pages 158–174. Springer-Verlag.

Wilson, S. W. (2002). Classifiers that approximate functions. Natural Com-


puting, 1(2–3):211–234.

Winograd, T. (1972). Understanding natural language. Cognitive Psychol-


ogy, 3(1):1–191.
272 BIBLIOGRAPHY

Wong, M. L. and Leung, K. S. (1995). Inducing logic programs with ge-


netic algorithms: The genetic logic programming system. IEEE Expert,
10(5):68–76.

Wrobel, S. and Džeroski, S. (1995). The ILP description learning problem:


Towards a general model-level definition of data mining in ILP. In Morik,
K. and Herrmann, J., editors, Proceedings of FGML-95, Annual Workshop
of the GI Special Interest Group Machine Learning (GI FG 1.1.3), pages
33–39. University of Dortmund.

Yoon, S. W., Fern, A., and Givan, R. (2002). Inductive policy selection
for first-order MDPs. In Darwiche, A. and Friedman, N., editors, UAI
’02, Proceedings of the 18th Conference in Uncertainty in Artificial Intel-
ligence. Morgan Kaufmann.

Yoon, S. W., Fern, A., and Givan, R. (2005). Learning measures of progress
for planning domains. In Veloso, M. M. and Kambhampati, S., editors,
Proceedings, The Twentieth National Conference on Artificial Intelligence
and the Seventeenth Innovative Application of Artificial Intelligence Con-
ference (IAAI 2005), pages 1217–1222. AAAI Press.

Zhang, C. and Baras, J. S. (2001). A new adaptive aggregation algorithm


for infinite horizon dynamic programming. Technical Report CSHCN TR
2001-5, Center for Satellite and Hybrid Communication Networks, College
Park, MD, USA.

You might also like