Thesis
Thesis
Approach to Relational
Reinforcement Learning
Drew Mellor
B. Comp. Sci. (Hons)
February 2008
I hereby certify that the work embodied in this thesis is the result of original
research and has not been submitted for a higher degree to any other Uni-
versity or Institution.
(Signed): ..................................................................
Acknowledgements
Abstract xiii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
i
ii CONTENTS
2 Background 17
2.3 Generalisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.2 Aggregation . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.3 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 97
8 Conclusion 193
The lists below contain symbols and acronyms used commonly throughout
the thesis. Where relevant, page numbers are given to where the symbol or
abbreviation is introduced.
Logic
∧ conjunction
∨ disjunction
¬ negation
← implication
∃ existential quantifier
∀ universal quantifier
|= entailment, 215
θ a substitution
vii
viii List of Symbols
Reinforcement Learning
s a state, 18
a an action, 18
r a reward, 18
S state space, 18
A action space, 18
π policy, 21
π∗ optimal policy, 22
γ discount factor, 21
α learning rate, 27
The following list gives the parameters associated with an individual rule j
in Xcs (the parameter’s name is given in parentheses):
exp j the number of times j has been a member of an action set (ex-
perience), 75
ns j an estimate of the mean size of the action sets that j has been a
member of (niche size), 76
x List of Symbols
[P] the current set of rules contained within the system (population),
74
[M] the set of rules matching the current state (match set), 77
[A] the subset of [M] advocating the selected action (action set), 80
γ discount factor, 81
Acronyms
This thesis investigates an approach to RRL derived from the learning clas-
sifier system Xcs. In brief, the new system, Foxcs, generates, evaluates,
and evolves a population of “condition-action” rules that are definite clauses
over first-order logic. The rules are typically comprehensible enough to be
understood by humans and can be inspected to determine the acquired prin-
ciples. Key properties of Foxcs, which are inherited from Xcs, are that it
is general (applies to arbitrary Markov decision processes), model-free (re-
wards and state transitions are “black box” functions), and “tabula rasa”
(the initial policy can be unspecified). Furthermore, in contrast to decision
tree learning, its rule-based approach is ideal for incrementally learning ex-
pressions over first-order logic, a valuable characteristic for an RRL system.
xiii
xiv Abstract
Perhaps the most novel aspect of Foxcs is its inductive component, which
synthesizes evolutionary computation and first-order logic refinement for
incremental learning. New evolutionary operators were developed because
previous combinations of evolutionary computation and first-order logic were
non-incremental. The effectiveness of the inductive component was empiri-
cally demonstrated by benchmarking on ILP tasks, which found that Foxcs
produced hypotheses of comparable accuracy to several well-known ILP al-
gorithms. Further benchmarking on RRL tasks found that the optimality
of the policies learnt were at least comparable to those of existing RRL sys-
tems. Finally, a significant advantage of its use of variables in rules was
demonstrated: unlike RRL systems that did not use variables, Foxcs, with
appropriate extensions, learnt scalable policies that were genuinely indepen-
dent of the dimensionality of the task environment.
Chapter 1
Introduction
A small child learns that blocks fit into holes according to their shape.
A student distinguishes between crystals on the basis of their symmetry.
An ornithologist determines the species of a finch from its markings. The
ability to identify similarity between different objects or situations plays
an important role in decision making. In the scenarios above, similarity
is pattern-based; it the purpose of this thesis to explore a method that
automatically discovers the relevant patterns in a given problem and uses
them as a basis for decision making.
1
2 Ch 1. Introduction
reused in new situations that the method would otherwise have to approach
from scratch. Many methods have been devised to automatically recognise
similarity between different situations in the reinforcement learning setting;
however, the automatic detection of pattern-based similarity in this context
is relatively new and less well explored than other approaches.
1.1 Motivation
d b d b d b
e a e c a e c a
⇒ ⇒
Many different tasks can be devised within the blocks world environment.
Džeroski et al. (2001) have defined three, stack, unstack, and onab, which
have become standard benchmarking tasks for evaluating relational rein-
forcement learning systems. They are described below.
Stack: the goal of this task is arrange all the blocks into a single stack (the
order of the blocks in the stack is unimportant).
Unstack: the goal of this task is to position all the blocks on the floor.
Onab: for this task, which is more complex than stack and unstack, two
blocks are designated A and B respectively. The goal is to place A
directly on B.
The aim, under reinforcement learning, for each of the above tasks is to find
a policy that solves it optimally (that is, in the least number of steps). A
policy defines how to behave in any given situation; in other words, it is a
function that maps each state in the task environment to some action. A
policy thus completely solves the problem, in the sense that it specifies an
action for every possible state. Optimal policies for these tasks are already
known: for stack, place any clear block on top of the block which is the
highest; for unstack, select any clear block that is not already on the floor
and move it there; and for onab, first to remove the blocks above A and B,
and then move A onto B. Therefore, the optimality of a potential solution
to any of the tasks can be measured by comparing it, in terms of the number
of steps it takes to reach a goal state, to the corresponding known optimal
policy.
Two defining features of reinforcement learning problems are that they pro-
ceed according to trial-and-error and that feedback is given through rewards
(Sutton and Barto, 1998). Under the trial-and-error approach, an agent, the
reinforcement learning system, interacts with the environment to determine
cause-and-effect relationships. For the above tasks, interaction naturally
breaks up into separate episodes. An episode begins with the blocks in ran-
domly generated positions, it proceeds in a sequence of steps, where one
block is moved per step, and continues until a goal state is reached. For
other tasks, interaction may continue indefinitely without being broken up
into episodes. In both cases—episodic and continuing—the agent is respon-
1.1. Motivation 5
sible for determining how the interaction unfolds. This trial-and-error ap-
proach to interaction distinguishes reinforcement learning from supervised
learning (where an external teacher provides the interaction, in the form of
examples of optimal or desired behaviour).
Table 1.1: The number of states and actions per state for blocks world as the number
of blocks increases.
bw3 13 1–6
4
bw 73 1–12
5
bw 501 1–20
6
bw 4,051 1–30
7
bw 37,633 1–42
8
bw 394,353 1–56
9
bw 4,596,553 1–72
10
bw 58,941,091 1–90
How can the agent act optimally if it cannot experience every state-action
combination occurring in the environment?
Detailed answers to these two challenges must wait until the problem frame-
work is formally defined in Chapter 2. However, let us here focus a little
more on the second problem, as relational reinforcement learning is distin-
guished from other forms of reinforcement learning by its approach to this
challenge.
Without direct experience of the outcome of each possible state and action
combination, the best that an agent can do is to make decisions in new
situations based on previously experienced situations and outcomes. In other
words, the system infers a general, hypothetical relationship between states
or state-action combinations and long term rewards that is consistent with
observation. For this inductive approach to work, the environment must
contain regularity linking states or state-action combinations to long term
reward. In some environments, for example, the long term reward is a
function of the current state-action combination. In those environments, it is
1.1. Motivation 7
optimal policy
stack
unstack
stack: 0 steps stack: 1 step unstack: 2 steps stack: 3 steps unstack: 0 steps
unstack: 3 steps
Figure 1.2: The optimal policies for stack and unstack in bw4 . Note that the state
space of bw4 is partitioned into five patterns that, under these policies, are associated
with the optimal number of steps to the goal. A similar decomposition can be determined
for onab, where the patterns must additionally specify the blocks designated as A and B.
Note that A, B, C, and D are variables and not block labels; thus, the resulting
pattern is genuinely abstract, applying in all situations where the blocks sat-
isfy the specified relationships and properties. The use of variables, however,
necessitates the inequations A\=C and B\=D. Without these extra constraints,
the example also represents a two block pattern corresponding to the case
where A=C and B=D.
The above example shows that structural regularity can be represented ab-
stractly by expressions in first-order logic. However, it does not show how to
generate these expressions. This issue—the development of methods that in-
duce general, hypothetical relationships between states or state-action com-
binations and long term reward using languages, like first-order logic, that
can abstractly represent structural regularity—is the chief concern of rela-
tional reinforcement learning.
1.2 Approach
The method developed in this thesis, Foxcs, “upgrades” the learning clas-
sifier system Xcs (Wilson, 1995, 1998). In doing so, it borrows ideas and
methods from inductive logic programming (Nienhuys-Cheng and de Wolf,
1997). The approach is thus based on a synthesis of concepts and techniques
from learning classifier systems and inductive logic programming.
de Wolf, 1997) addresses machine learning tasks that exhibit structural reg-
ularity. Although the context of ILP is supervised and unsupervised learn-
ing, rather than reinforcement learning, concepts and techniques from ILP
can be used to provide a principled approach to the upgrade of Xcs. The
primary ideas borrowed from ILP are the notions of consistency and back-
ground knowledge, and the techniques of refinement and θ-subsumption.
Note that induction under first-order logic presents special challenges. The
hypothesis space defined by a first-order logic language is generally richer
than an hypothesis space under an attribute-value language, and meth-
ods are thus potentially less efficient and more computationally complex.
Unfortunately, some time complexity results for inductive logic program-
ming are negative (Kietz, 1993; Cohen, 1993, 1995; Cohen and Jr., 1995;
De Raedt, 1997), showing that the complexity is greater than polynomial
and thus intractable. However, under the non-monotonic setting (originat-
ing from Helft, 1989) tractable PAC-learnability results have been obtained
(De Raedt and Džeroski, 1994). The non-monotonic setting is therefore
adopted by the inductive component of Foxcs.
There are several advantages arising from the use of the LCS framework
for relational reinforcement learning. Many existing RRL methods do not
possess certain desirable characteristics for practical reinforcement learning.
First, they may restrict the problem framework in some way; second, they
may require a partial or complete model of the environment’s dynamics;
and third, they may require candidate hypotheses to be provided by the
user. The method developed in this thesis avoids the above limitations.
Specifically:
Several appendices have been provided in addition to the main body of the
thesis. The first two appendices include background material that is relevant
to the thesis in order to make it more self-contained. Appendix A gives
definitions and concepts from first-order logic, while Appendix B focuses on
two formal descriptions of the learning problems addressed by ILP systems.
The final two appendices give details of the tasks addressed by Foxcs.
Appendix C describes the ILP tasks from Chapter 6, while Appendix D
lists constraints that were used to prevent Foxcs from generating rules that
do not match any states in blocks world.
Chapter 2
Background
“Good and evil, reward and punishment, are the only motives to
a rational creature . . . ”
—John Locke
17
18 Ch 2. Background
action
reward
agent environment
state
• a reward function R : S × A → R.
The state space, S, is the set of states that describe each possible situa-
tion that the agent may encounter; we assume that S is discrete and
finite. It is also assumed that the agent has perfect sensors, that is, it
is able to correctly identify the current state of the environment. For
now we make no assumptions about how the states are represented to
the agent.
The action space, A, is the set of actions through which the agent inter-
acts with the environment. As with the state space, we assume that
the action space A is discrete and finite. Frequently not all actions are
applicable to every state; if this is the case, then the set of admissible
actions of s is the subset of A that is applicable to s and is denoted
A(s).
The transition function, T , gives the probability that when the agent
executes action a in state s then the following state is s0 , that is,
T (s, a, s0 ) = P r{st+1 = s0 | st = s, at = a}. The transition function
is a proper probability distribution over successor states s0 , that is, for
all s ∈ S and a ∈ A, s0 T (s, a, s0 ) = 1. A deterministic MDP is the
P
The reward function, R, gives the expected immediate reward for tak-
ing action a in state s, that is, R(s, a) = E{rt+1 | st = s, at = a}.
The reward function is sometimes state-based, in which case R may
be given as a function over S; the two formulations are related by
R(s, a) = R(s) for all s ∈ S and a ∈ A(s).
for all time steps t. Hence, in a Markovian task, the current state and
action summarise all the information required to predict the next state and
the immediate reward.
The agent’s objective is to maximise the long term reward accumulated from
interacting with the environment. A simple measure of long term reward
is the expected sum of rewards obtained from the environment after the
current time step until the end of the episode. Under this criterion, the
agent aims to maximise E{rt+1 + rt+2 + . . . + rn } for all t ∈ {1, 2, . . . , n − 1},
where n is the final step of the episode. However, this measure does not
adequately deal with continuing MDPs.
counted metric. Under this criteria, the value that the agent is attempting
to maximise is E{ ∞ k
P
k=0 γ rt+k+1 }, where γ is a discount factor satisfying
0 ≤ γ < 1. The use of a discount factor creates a geometrically decreasing
series which prevents the summation from going to infinity. The discounted
infinite-horizon optimality criteria is attractive because it handles both con-
tinuing and episodic tasks uniformly1 and because most results under the
criteria also extend to the undiscounted case where appropriate (Littman,
1996, page 57).
A Bellman equation for state s says that V π (s), the value of state s under
policy π, is the immediate reward R(s, π(s)) plus the discounted value of the
next state V π (s0 ), averaged over all possible next states s0 according to the
likelihood of their occurrence T (s, π(s), s0 ). The system of Bellman equa-
tions can be solved by a variety of linear programming methods, including
Gaussian elimination and others.
1
An episodic task can be converted to a continuing task by assuming that the goal
states are absorbing, that is, a goal state transitions only to itself and only generates
rewards of zero.
22 Ch 2. Background
The greedy policy πV is thus obtained by always selecting the action which
leads to the maximum one-step value with respect to V .
The agent’s objective can be put as the task of finding an optimal policy.
By optimal we mean that the value for each state under the policy is at least
as great as the value when following any other policy. More formally, V ∗ ,
the value function for an optimal policy π ∗ satisfies:
for all s ∈ S. There is always at least one policy that is optimal (Howard,
1960), although it may not be unique. There is a corresponding Bellman
optimality equation for V ∗ :
" #
X
∗ 0 ∗ 0
V (s) = max R(s, a) + γ T (s, a, s )V (s ) , (2.4)
a
s0
for each s ∈ S. It can be shown that πV ∗ , any policy which is greedy with
respect to the optimal value function V ∗ , is an optimal policy (Puterman,
1994). Hence, one approach to finding an optimal policy is to calculate V ∗
and then derive πV ∗ . However, due to the presence of the non-linear max-
imisation operator in (2.4), Gaussian elimination is insufficient for solving
for V ∗ . Algorithms for computing V ∗ are considered in the following section.
Three fundamental methods for solving MDPs are linear programming, value
iteration and policy iteration. These three methods analytically compute the
2.1. Markov Decision Processes 23
Maximise:
P
sV (s)
Linear Programming
Policy Iteration
i := 0
Repeat
i := i + 1
i := 0
Repeat
i := i + 1
For each s ∈ S do
Value Iteration
Like policy iteration, value iteration (Bellman, 1957) also computes a se-
quence of value functions. The algorithm for value iteration, given in Fig-
ure 2.4, is obtained by simply turning the Bellman optimality equation (2.4)
into an update rule.
2γ
function and the optimal value function is less than 1−γ , that is:
2γ
max |V πVi (s) − V ∗ (s)| < .
s 1−γ
The TD(0) algorithm (Sutton, 1988) is designed to estimate the value func-
tion V π given a policy π. Although it does not find an optimal value
2.2. Reinforcement Learning 27
The intuition behind the TD(0) algorithm is based on the following rewrite
of the value function equation (2.1):
(∞ )
X
V π (s) = Eπ γ k rt+k+1 | st = s
k=0
∞
( )
X
= Eπ rt+1 + γ γ k rt+k+2 | st = s
k=0
Figure 2.5 gives the TD(0) algorithm. The algorithm maintains a table V
which estimates V π , and which is updated according to the experience from
interacting with the MDP. The experience can be expressed as a sequence
of tuples hs, π(s), r, s0 i, where each tuple indicates that the agent began in
state s, executed action π(s), received reward r and moved to state s0 . An
experience tuple for s provides a sample reward r and next state s0 , which is
used to update V (s) according to (1 − α)V (s) + α(r + γV (s0 )), where α is a
learning rate satisfying 0 < α < 1. For each experience tuple, hs, π(s), r, s0 i,
the update acts to reduce the difference between V (s) and r + γV (s0 ), which
leads to the name of the algorithm, temporal difference learning. For a
particular state, the input from each experience tuple which starts from that
state is averaged over the long term according to the learning rate α. Since
the experience tuples are expected to approximate the true distribution of
rewards and transitions with respect to π over the long term, V (s) should
converge to the value given by (2.1) in the limit as the number of experience
tuples sampled approaches infinity. Convergence results for TD(0) have been
given by Sutton (1988), Dayan (1992), Jaakkola et al. (1994) and Tsitsiklis
28 Ch 2. Background
Initialise s
While s ∈
/ G do
a := π(s)
Execute a and observe r and s0
V (s) := (1 − α)V (s) + α(r + γV (s0 ))
s := s0
Figure 2.5: The TD(0) algorithm. Here, G is a set of goal states; for continuing tasks
G = ∅.
(1994).
That is, the optimal action value for (s, a) is the immediate reward plus the
discounted optimal value of the next state, s0 , averaged over all possible next
states. The Q function formulation is convenient because it associates values
with state-action pairs rather than states, which avoids the need to make a
forward reasoning step in order to find optimal actions. In other words, to
find π ∗ (s), the optimal action for state s, given V ∗ , you need to calculate a
solution to π ∗ (s) = arg maxa [R(s, a) + γ s0 T (s, a, s0 )V ∗ (s0 )]. However, to
P
2.2. Reinforcement Learning 29
Initialise s
While s ∈
/ G do
find π ∗ (s) given Q∗ , you only need to solve π ∗ (s) = arg maxa Q∗ (s, a), which
completely eliminates the dependence on R and T .
Figure 2.6 gives the Q-Learning algorithm for estimating Q∗ . As with TD(0),
under Q-Learning the experience derived from interacting with the MDP can
be expressed as a sequence of tuples hs, a, r, s0 i, where each tuple indicates
that the agent began in state s, executed action a, received reward r and
moved to state s0 . To arrive at an accurate estimate of Q∗ this experience
needs to be averaged over many samples. Thus, a learning rate α, satisfying
0 < α < 1, averages together the current experience with the previous
estimate.
2
An -greedy policy behaves greedily most of the time, but with probability it selects
an action uniformly at random. That is, πQ (s), the -greedy policy with respect to Q, is:
8
< arg max Q(s, a)
a with probability 1 − ,
πQ (s) =
: random action from A(s) otherwise.
30 Ch 2. Background
2.3 Generalisation
Note that T and R are subject to the curse of dimensionality too, as storing
these functions in matrix form also requires space proportional to |S| and
|A|. This has negative implications for the above methods that need access
to T and R. In many cases, the dynamics of the environment, that is, T and
R, can often be compactly formulated as a set of equations, so this is less of a
problem than the problem of storing V and Q. However, the main difficulty
arising from the curse of dimensionality relates not to storage space, but time
complexity. Even if storage requirements were not a problem, the training
2.3. Generalisation 31
time required to exhaustively sample large state and action spaces would be
prohibitive.
θ = hθ1 , θ2 , . . . , θk i. The task becomes one of finding θ such that Ṽθ approx-
imates V well. If linear approximators are insufficient to approximate V
then non-linear methods, such as multilayer neural networks, decision trees,
or support vector machines, can be used instead. An introduction to func-
tion approximation in reinforcement learning has been given by Sutton and
3
Function approximation can similarly be used for the Q function, but for simplicity
we only illustrate its use for approximating V .
32 Ch 2. Background
Barto (1998, chapter 8), and the approach has received rigorous analytical
attention from Bertsekas and Tsitsiklis (1996).
Despite its popularity and the existence of the above mentioned positive re-
sults, particularly the grandmaster backgammon player, TD-Gammon, there
are known to be some negative results from incorporating function approx-
imation into reinforcement learning and dynamic programming algorithms.
Examples of divergence on simple MDPs have been produced with temporal
difference methods, value iteration, and policy iteration when combined with
very benign forms of linear function approximation (Baird, 1995; Boyan and
Moore, 1995; Tsitsiklis and Roy, 1996; Gordon, 1995). This shows that the
convergence of these algorithms, when combined with function approxima-
tion, cannot be guaranteed in general and has motivated the development of
new algorithms that when combined with function approximation are stable
(Baird, 1995; Baird and Moore, 1998; Gordon, 1995; Precup et al., 2001,
2006).
2.3. Generalisation 33
2.3.2 Aggregation
Let us consider a simple form of aggregation in more detail. The state space
S is partitioned into disjoint subsets, S1 , . . . , Sn , where each partition Si is
associated with a value Ṽi . The optimal value function, V ∗ , is approximated
by Ṽ , where Ṽ (s) = Ṽi for all s ∈ Si . According to Tsitsiklis and Roy (1996),
there are no inherent limitations with using this type of aggregation. That
is, given some > 0, the partitions can be defined as:
for all i. Thus, the optimal value function V ∗ can be approximated with
accuracy . However, it is precisely the value function V ∗ that we are trying
to predict, thus in practice we are unable to find a partition of S using V ∗ .
HCO
W HCR
9.00 10.00 O O
W W W W
Figure 2.7: A value function tree (reproduced from Boutilier et al., 2000).
Qk
i=1 |val(si )|, thus the memory requirements for storing V explicitly are
exponential in k. An example of a compact representation of V under fac-
tored state spaces is shown in Figure 2.7. Here, a decision tree represents
the value function. Internal nodes in the tree represent boolean variables
characterising the state space. The left subtree under a variable represents
the case when the variable is true, while the right subtree represents the
false case. Leaf nodes store the value of states which are consistent with
the corresponding branch. The decision tree, in effect, partitions the state
space according to an estimate of the value function, and is thus a form of
aggregation.
The above example shows that aggregation may be non-uniform, that is,
a certain variable may only be relevant to the value function in particular
regions of S. Contrast this non-uniformity in the value function with linear
function approximation. Under linear function approximation, the value
function is a weighted contribution of each feature, si . The weights, θi ,
may change over time but they are fixed with respect to S, suggesting that,
while aggregation can be useful for value functions that are conditionally
dependent on the variables describing S, linear function approximation is
2.3. Generalisation 35
Some analysis of using static aggregation techniques for value function gen-
eralisation has been performed. A convergence proof for Q-Learning with
a general form of static aggregation called soft state aggregation4 has been
given by Singh et al. (1995). Tsitsiklis and Roy (1996) give a convergence
proof for value iteration under a static partition of S. Note, however, that
in both cases the policy which is converged to is optimal with respect to the
given clustering and is not, in general, necessarily the same as π ∗ , the opti-
mal policy for the unclustered MDP. For example, in the extreme case where
all states are grouped into a single cluster, only the action with the highest
expected value over the entire state space will be selected. The quality of
the policy thus depends on the quality of the clustering.
2.3.3 Remarks
4
In soft state aggregation, each state s belongs to cluster x with probability P (x|s).
Each state s may belong to several clusters. Partitioning is a special case of soft state
aggregation where for each state s there is a single cluster x such that P (x|s) = 1 and
P (y|s) = 0 for all other clusters y 6= x.
36 Ch 2. Background
is smooth with respect to the state signal—that is, environments where small
changes in the state signal lead to small changes in the value function. Ag-
gregation techniques, on the other hand, are particularly suited to tasks that
exhibit regularity in the value function which is conditionally dependent on
variables that factor S and A.
Before discussing how first-order logic can be used to address the generalisa-
tion problem, let us first motivate interest in it by considering limitations of
propositional representations. We consider propositionally factored MDPs
specifically; however, the limitations that will be raised apply to attribute-
value representations in general, including those used in conjunction with
function approximation and aggregation techniques within reinforcement
learning.
Table 2.1: Representing poker hands using a rank-suit factoring. An example for each
class of poker hand is given. The variables Ri and Si are, respectively, the rank and suit
of the ith card (the order of cards is arbitrarily assigned) in the hand.
Class R1 S1 R2 S2 R3 S3 R4 S4 R5 S5
four-of-a-kind 2 ♠ 2 ♦ 2 ♥ 2 ♣ 8 ♦
fullhouse Q ♠ Q ♣ 4 ♥ 4 ♦ 4 ♠
flush 3 ♣ 9 ♣ J ♣ K ♣ A ♣
straight 4 ♣ 5 ♦ 6 ♠ 7 ♥ 8 ♣
three-of-a-kind 7 ♠ 7 ♦ 7 ♣ A ♣ 9 ♥
two-pair 8 ♣ 4 ♣ 8 ♥ 3 ♠ 4 ♠
pair 7 ♦ K ♣ K ♠ 10 ♣ 2 ♥
(R1 = 2 ∧ R2 = 2 ∧ R3 = 3 ∧ R4 = 4 ∧ R5 6= 2 ∧ R5 6= 3 ∧ R5 6= 4)
∨ (R1 = 2 ∧ R2 = 2 ∧ R3 = 3 ∧ R4 = 5 ∧ R5 6= 2 ∧ R5 6= 3 ∧ R5 6= 5)
∨ ...
∨ (R1 = 2 ∧ R2 = 2 ∧ R3 = 3 ∧ R4 = A ∧ R5 6= 2 ∧ R5 6= 3 ∧ R5 6= A)
∨ ...
∨ (R1 = 2 ∧ R2 = 2 ∧ R3 = A ∧ R4 = 3 ∧ R5 6= 2 ∧ R5 6= A ∧ R5 6= 3)
∨ (R1 = 2 ∧ R2 = 2 ∧ R3 = A ∧ R4 = 4 ∧ R5 6= 2 ∧ R5 6= A ∧ R5 6= 4)
∨ ...
∨ (R1 = 2 ∧ R2 = 2 ∧ R3 = A ∧ R4 = K ∧ R5 6= 2 ∧ R5 6= A ∧ R5 6= K)
All in all, the expanded rule contains 130 × 132 = 17, 160 lines. Although
to valid hands.
2.4. Relational Reinforcement Learning 39
the approach stops short of listing every single instance of a pair, of which
there are approximately one million, it remains an unsatisfactory approach
for representing generalisations for this task. And although the rank-suit
factoring can be used with more compact structures than rules, such as
trees, the resulting level of complexity remains of the same order.
A better factoring, at least for identifying a pair, can be obtained from the
observation that the above rule detects when exactly two rank variables are
equal. Let Xi,j be a boolean variable such that:
T if and only if the ranks of cards i and j are equal,
Xi,j =
F otherwise.
Since by symmetry Xi,j = Xj,i , and Xi,i is unnecessary for detecting pairs,
only ten variables are needed: X1,2 , X1,3 , X1,4 , X1,5 , X2,3 , X2,4 , X2,5 , X3,4 ,
X3,5 , and X4,5 . Under this factoring, a rule for identifying a pair requires
only 10 lines:
Each line in the rule checks that exactly one variable is T and the rest are
F . Taking this approach to the extreme, a new boolean variable, X, could
be defined such that X = T if and only if one pair of ranks in the hand are
equal. A rule which correctly classifies a hand as a pair is then simply:
pair ← X.
The state and action space of an MDP (that is, S and A) can be factored
using an appropriate alphabet over first-order logic. An alphabet in first-
order logic is a tuple hC, F, P, di, consisting of: a set of constant symbols,
C; a set of function symbols, F; a set predicate symbols, P; and a function
d : F ∪ P → Z, where d(i) is called the arity of i. In the following, it will
be convenient to partition P into two disjoint subsets, one for representing
states, PS , and another for actions, PA .
We now introduce the relationally factored MDP (van Otterlo, 2004), which
is a type of MDP where S and A are factored into atoms by an alphabet
over first-order logic.
• A reward function R : S × A → R
Note that 2HB[L] denotes the powerset of HB[L], that is, the set of all Her-
brand interpretations over L. Also note that S is typically a subset and not
the complete set of all Herbrand interpretations over hC, F, PS , di. This is
because not every Herbrand interpretation over hC, F, PS , di necessarily cor-
responds to a valid state. For the sake of simplicity we have not formalised
a test to distinguish valid from invalid states; for such a test, see van Otterlo
(2005).
For every MDP there exists an equivalent RMDP; thus, the framework is
not more restrictive than MDPs. Given an MDP M = hS, A, T, Ri, an
equivalent RMDP, M 0 = hL, S 0 , A0 , T 0 , R0 i, can be trivially derived from M .
The alphabet L = hC, ∅, PS ∪ PA , di is constructed as follows: C contains a
single dummy constant; PS contains a unique predicate symbol is for each
state s ∈ S; PA is constructed analogously over A; and d(i) = 1 for all
i ∈ PS ∪ PA . Under this construction, S 0 and A0 are isomorphic to S and
A respectively. The transition function T 0 is defined from T , such that each
element of S and A in T is replaced by its equivalent in S 0 or A0 . The reward
function R0 can be defined in an analogous fashion. Using this construction,
M 0 is equivalent to M since each element of M 0 , except for L, which has no
corresponding element in M , is isomorphic to its corresponding element in
M.
Class Hand
class(fourofakind ) {card (first, two, spades), card (second , eight, diamonds),
card (third , two, spades), card (fourth, two, clubs),
card (fifth, two, diamonds)}
class(fullhouse) {card (first, queen, spades), card (second , queen, clubs),
card (third , four , hearts), card (fourth, four , diamonds),
card (fifth, four , spades)}
class(flush) {card (first, three, clubs), card (second , nine, clubs),
card (third , jack , clubs), card (fourth, king, clubs),
card (fifth, ace, clubs)}
class(two-pair ) {card (first, eight, clubs), card (second , four , clubs),
card (third , eight, hearts), card (fourth, three, spades),
card (fifth, four , spades)}
class(pair ) {card (first, seven, diamonds), card (second , king, clubs),
card (third , king, spades), card (fourth, ten, clubs),
card (fifth, two, hearts)}
for poker.
Example 1 Table 2.2 shows several poker hands selected from Table 2.1
represented under a relational alphabet. The alphabet L = hC, ∅, PS ∪ PA , di
is as follows: C = {first, second , . . . , fifth} ∪ {two, three, . . . , ace} ∪
{diamonds, hearts, spades, clubs} ∪ {fourofakind , fullhouse, . . . , pair },
which represent the cards in the hand, the ranks, the suits, and the classes
respectively; PS = {card }; PA = {class}; and d(card ) = 3 and d(class) = 1.
A concise rule for correctly classifying a pair in poker using the language
given in Example 1 is:
Note that the condition part of the rule is an abstract state; the vari-
ables6 being the terms beginning with an uppercase letter (this convention
of indicating a variable using an uppercase letter is followed throughout the
thesis). The underscore, “ ”, is a special variable, called the anonymous
variable, which can represent any constant, not unlike the wildcard, *, of
UNIX or the “don’t care” symbol, #, of traditional learning classifier system
languages. The rule states that if the cards C1 and C2 both have the same
rank, R, and if the rank of all the other cards in the hand, R3 , R4 , and R5 ,
are different from R, then the hand is a pair.
The advantage of using first-order logic for expressing the concept of a pair in
poker becomes apparent when it is compared to the propositional factorings
and rules in Section 2.4.1. Like the propositional factoring in Table 2.1, only
the basic features of poker cards, ranks and suits, are represented in the first-
order logic language given in Example 1. However, the above rule, which
is expressed in this language, is much more compact than its equivalent
rule over the propositional factoring. It is also more compact than the first
6
In this example, different variables are assumed to represent different constants.
44 Ch 2. Background
propositional rule which made use of high level propositional features. Only
the final, trivial, propositional rule is more compact than the above rule, but
it relies on a feature that directly indicates the presence of a pair, which, if
known, would generally make learning the concept of a pair redundant.
2.5 Summary
We conclude by restating that this chapter was not intended to give a com-
prehensive overview of each of the background topics covered. Some notable
topics have been omitted because they are not directly relevant to this thesis.
They include:
46 Ch 2. Background
A Survey of Relational
Reinforcement Learning
47
48 Ch 3. A Survey of Relational Reinforcement Learning
There is also a connection between RRL and the artificial intelligence sub-
field of planning. These two fields overlap in the sense that they are both
agent-based frameworks which focus on the issue of how to compute an
optimal behaviour given a particular environment. Furthermore, planning
environments have traditionally been described using relational languages.
For example, the well known blocks world environment, used extensively
to benchmark RRL algorithms and systems, originates from the planning
community. However, RRL is differentiated from planning in the following
ways:
RRL
RL ILP Planning
Figure 3.1: Connections between relational reinforcement learning and other fields.
the planning framework, rewards are absent and the agent’s aim is
instead to find a goal state.
Table 3.1 lists the abstraction devices employed by the above systems to
structure the value function. Carcass uses decision rules of the form:
{ã1 , . . . , ãn } ← s̃
where the ãi are abstract actions and s̃ is an abstract state. Each rule
implicitly represents the n rules: ãi ← s̃, where 1 ≤ i ≤ n. For each
implicit sub-rule, an estimate of its corresponding value, Q(s̃, ãi ), is stored
and updated throughout training. A Carcass rule, {ã1 , . . . , ãn } ← s̃, can
be seen as essentially clustering a subset of S × A into n clusters, where the
3.2. Existing RRL Methods and Approaches 51
Table 3.1: A listing of abstraction devices used by static RRL systems. †The form
which rules take in the Carcass framework differs slightly from those in Logical TD(λ)
and Logical Q-Learning; see text for details.
and has a corresponding estimated value, Q(s̃, ãi ). A set of such rules, as
illustrated in Table 3.2, is then used to decompose S × A into a finite set of
clusters for compactly representing the Q function. In Logical TD(λ) and
Logical Q-Learning, decision rules are analogous to Carcass rules except
that n = 1.
Table 3.2: An ordered list of rules for bw4 that could be evaluated by the Carcass
system (adapted from van Otterlo, 2004).
1. {mv f l(A)} ←
5. {mv(A, B)} ←
for calculating the value function. In principle, any table-based value func-
tion method can be adapted, although to date, only variants of Q-Learning,
TD(λ),1 and prioritised sweeping (van Otterlo, 2004)2 have been used. Non-
value function methods could also be adapted for static RRL. For example,
in the work of Itoh and Nakamura (2004), a list of n condition-action rules
implementing a learning and a planning component was provided to a rein-
forcement learning system. Each rule i was associated with a probability pi
indicating its likelihood of being invoked when its condition was satisfied.
Gradient descent was performed on the probability vector, hp1 , . . . , pn i, in
1
The TD(λ) algorithm (Sutton, 1988) is a generalisation of TD(0) that uses eligibility
traces. Eligibility traces are a mechanism for making efficient use of the sampled data and
generally reduce the amount of training required (see Sutton and Barto, 1998, chapter 7).
2
See Section 3.2.6 for additional comments on this system.
3.2. Existing RRL Methods and Approaches 53
Table 3.3: The definition for an r-state for the King and Rook versus King chess
endgame (adapted from Morales, 2004). The argument, s, is the board position. This
relation covers more than 3,000 positions.
r_state1(s) :-
kings_in_opposition(s),
rook_divides_kings(s).
order to learn, in effect, which rules to keep and which to abandon. The
intention of the system was to “learn when to learn and plan” and it did not
attempt to generalise over the value function. However, it could be made to
do so by replacing its rule list with the kinds of abstraction devices listed in
Table 3.1.
root: goal_on(A,B),numberofblocks(C),mv(D,E)
on(A,B)
0.0 cl(A)
1.0 cl(E)
0.9 0.81
Figure 3.2: A Q-tree for blocks world (adapted from Džeroski et al., 2001). Non-leaf
nodes contain atoms factoring the state space: the left subtree represents the case where
the test corresponding to the atom succeeds, while the right subtree represents failure. The
root of the tree additionally contains the action, mv(D,E) here, and atoms whose variables
bind to globally relevant attributes, in this case, goal on(A,B) and numberofblocks(C).
The leaf nodes store the Q value for the corresponding branch.
Q-rrl, the first ever relational reinforcement learning system, was devel-
oped in the seminal work of Džeroski et al. (2001, 1998a).3 The system
combined Q-Learning with the ILP decision tree algorithm, Tilde (Block-
eel and De Raedt, 1998). During training, the algorithm builds a Q-tree that
partitions S × A into clusters containing (s, a) pairs whose Q(s, a) estimate
is equal (see Figure 3.2). The system was evaluated on three blocks world
tasks, stack, unstack and onab, hat have now become standard benchmarks
for RRL systems.
The Q-rrl system was, however, a first attempt at RRL and contained some
significant inefficiencies. Prominent among them was that the Q-trees were
generated from scratch after each training episode, requiring every sampled
(s, a) pair and its associated Q(s, a) estimate to be explicitly retained in
memory until the completion of training. Subsequent research focussed on
3
Some sources, including the original article by Džeroski et al. (1998a), refer to the
system as Rrl. We prefer Q-rrl in order to avoid confusion with the acronym for the
name of the field. It also distinguishes between a different version of the system, P-rrl,
reported in (Džeroski et al., 2001).
3.2. Existing RRL Methods and Approaches 55
The next two methods, Rrl-rib (Driessens and Ramon, 2003) and Rrl-kbr
(Gärtner et al., 2003a; Ramon and Driessens, 2004), replace the decision
tree of Rrl-tg with other techniques for approximating the value func-
tion. The Rrl-rib system uses an instance-based method related to the
k-nearest neighbour algorithm (Aha et al., 1991), and the Rrl-kbr system
employs Gaussian processes (MacKay, 1998) with graph kernels (Gärtner
et al., 2003b). These two methods do not, like other RRL methods, form
clusters of state-action pairs using the abstractive mechanisms of first-order
logic, such as variables. Rather, they rely on calculating the similarity of
the current input to previously experienced examples in order to make a
prediction for the value of the input. On blocks world tasks, Rrl-rib and
Rrl-kbr were found to outperform Rrl-tg with respect to the level of op-
timality attained by the learnt policies. They are, however, more expensive
in terms of computation and memory requirements, their predictions are less
transparent to humans, and they rely on domain specific distance metrics
in order to compute similarity.
The final system, Trendi (Driessens and Džeroski, 2005), hybridises the
decision tree- and instance-based learning methods, Rrl-tg and Rrl-rib,
in order to combine the efficiency of the former with the better performance
of the latter. Trendi has achieved better performance levels than either
method individually, although not as good as Rrl-kbr, and has an efficiency
level comparable to Rrl-tg.
Apart from the development of the first RRL method, Q-rrl, another in-
novation contained in (Džeroski et al., 2001) was the development of policy
56 Ch 3. A Survey of Relational Reinforcement Learning
a mv_fl(a)
Q = 0.81
c c
mv_fl(d)
d mv_fl(c) d d Q = 1.0
Q = 0.9
b b a c b a
(a)
root: goal_unstack,numberofblocks(C),mv_fl(A)
cl(A),on(A,B),on(B,C),on(C,D),on_fl(D)
cl(A),on(A,B), 0.81
cl(D),on_fl(D)
cl(C), on(B,C),
on_fl(C) on_fl(C)
... ...
1.0 0.9
(b)
root: goal_unstack,numberofblocks(C),mv_fl(A)
cl(A),on(A,B)
non−optimal optimal
(c)
Figure 3.3: a) An optimal sequence of actions for the unstack task, and their associated
Q values. An hypothetical b) Q-tree and c) P-tree for the task. A P-tree is similar to a
Q-tree except that the Q values are replaced with the labels “optimal” or “non-optimal”.
However, note that although Cole et al.’s system has an architecture that
supports dynamic clustering, only limited use was made of the ability in the
reported experiments. In particular, the Q-tree was statically organised us-
ing an heuristic, with dynamic clustering being reserved for the P-Learning
phase. In contrast, the experiments in Džeroski et al. (2001) allowed gener-
alisation to occur in both the Q-Learning and P-Learning phases.
Like P-Learning, the two systems considered in this section generalise over
a policy rather than a value function; that is, over π : S → A rather
than Q : S × A → R. Thus, as with P-Learning, both systems avoid the
constraints of generalising over Q.
and Tsitsiklis, 1996). The API technique modifies policy iteration (see Sec-
tion 2.1.2, page 23) so that the value function calculation or estimate of a
representative set of states, S̃ ⊂ S, is generalised over the whole of S us-
ing function approximation.4 Fern et al. (2004a) adapt API for RRL, but
instead of employing value function approximation, they use an abstract
policy in the form of a relational decision list. Their method works as fol-
lows. Like the policy iteration algorithm (see Figure 2.3), the procedure
essentially loops over two steps: policy evaluation and policy improvement.
During the policy evaluation step, a set, D, is constructed, consisting of a
D E
tuple s, π(s), Q̂(s, a1 ), . . . , Q̂(s, am ) for each s ∈ S̃, where π is the current
policy and Q̂(s, ai ) is an estimate of Q(s, ai ) as calculated by policy rollout.
To compute Q̂(s, ai ) by policy rollout involves:
that maximises the value of its rules over D using a supervised learning
technique described by Yoon et al. (2002).
Like P-Learning, the method computes a policy rather than a value func-
tion and thus may generalise in a more unconstrained fashion. Unlike P-
Learning, however, it avoids the need to maintain an approximate value
function in an intermediate layer. The method is elegant and powerful, pro-
ducing positive results on blocks world tasks that are substantially more
complex than those considered by any other RRL method. However, the
use of policy rollout does require an unconstrained simulator—the ability
to generate a trajectory for an arbitrary policy and initial state—and an
heuristic function for estimating the value of a state.
Another policy driven system is Grey (Muller and van Otterlo, 2005). The
system evolves a population of policies where each policy is a relational
decision list. Each individual policy is assessed over a number of randomly
generated instances of a task and assigned a fitness value based on the total
reward obtained, the number of time steps taken, and the number of rules
in the policy. Individuals with high fitness are reproduced using crossover
and mutation to derive a new generation of policies. This new generation
is assessed and reproduced to obtain another generation, and so on, until
some terminal condition is reached. No detailed results have been reported
for Grey as yet.
We now consider two RRL methods that dynamically generalise over the
value function, but which do not use temporal difference-like methods to
compute value function estimates. The first system, SVRRL (Sanner, 2005),
is based on statistical methods that restrict it to undiscounted, finite-horizon
(episodic) tasks with a single terminal reward of success or failure. The value
function is represented as a Bayes network where a node in the network
represents an atomic formula or a conjunction of atomic formulae. The
3.2. Existing RRL Methods and Approaches 61
value of the network and its structure is updated online using a novel Bayes
network algorithm.
The second system (Walker et al., 2004) focuses on the problem of predicting
Qπ for a given a policy, π, rather than on computing an optimal policy, π ∗ .
Generalisation is performed by estimating Qπ as a weighted combination of
features, where each feature is a conjunction of atomic formulae. Training
starts by sampling a set of (s, a, Qπ (s, a)) triples according to π, then a set of
features is stochastically generated from the samples, and finally regularised
kernel regression is applied to learn the weights. A number of Qπ estimates
are generated and then combined using ensemble learning. Since the method
requires Qπ (s, a) values to be provided as data, in experiments, they focus
on a deterministic, undiscounted, episodic task so that Qπ (s, a) values can
be computed from the reward on the final step.
6
Although, an extension was described where an abstract policy was produced at regu-
lar intervals throughout the training of the reinforcement learning component. An abstract
policy was derived from the current ground policy, which was used to guide subsequent
exploration of the agent and was found to reduce the amount of training required to find
an optimal policy.
3.2. Existing RRL Methods and Approaches 63
Some reinforcement learning methods have used representations that are in-
termediate between propositional and first-order logic. For example, deictic
representations have been combined with Q-Learning in the hope of achiev-
ing some of the abstractive power of first-order logic without the associated
algorithmic complexity. The approach has been applied to blocks world
with mixed results: Whitehead and Ballard (1991) have reported success,
but Finney et al. (2002a,b) have found its performance disappointing.
There are also supervised learning systems which have been specifically de-
signed to learn policies for relational MDPs or planning tasks similar to
relational MDPs (Yoon et al., 2002; Martı́n and Geffner, 2004; Khardon,
1999). The general approach is to induce a policy from a set of examples
consisting of (s, a) pairs belonging, ideally, to an optimal policy for the
task. A typical representation device for the policy is a decision list of gen-
eral rules expressed in a relational language. Since the rules of the policy
do not have to cluster the (s, a) pairs according to the value function, this
approach should inherit the advantages of other policy driven methods, such
as the ability to handle scaled up versions of the task without retraining.
The training examples could be generated from optimal policies produced by
applying dynamic programming or an existing planning algorithm to small
versions of the task.
64 Ch 3. A Survey of Relational Reinforcement Learning
3.4 Discussion
These characteristics are essential for realising the full potential of the rein-
forcement learning framework and without them, a system limits its prac-
tical significance. In subsequent work, Driessens and others addressed the
methodological shortcomings of Q-rrl (Driessens et al., 2001) and explored
some alternative approaches (Driessens and Ramon, 2003; Gärtner et al.,
2003a; Driessens and Džeroski, 2005), all of which posses the above charac-
teristics. However, apart from that work very few realised RRL systems—
perhaps only the system of Cole et al. (2003)—possess the above three char-
acteristics in combination.
3.4. Discussion 67
Table 3.4: Comparison of several RRL systems including Foxcs based on comprehensi-
bility and required domain knowledge.
The approach developed in this thesis also possesses the characteristics men-
tioned above. For this reason, in terms of the aims of the system, we situate
the current work as being most closely related to the family of systems devel-
oped by Driessens and others (Driessens et al., 2001; Driessens and Ramon,
2003; Gärtner et al., 2003a; Driessens and Džeroski, 2005). Methodologi-
cally, Driessens’ systems are all closely related. Each contains a Q-Learning
component combined with a generalisation mechanism. The primary differ-
ence between them lies in the generalisation mechanism: Rrl-tg (Driessens
et al., 2001) contains a decision tree builder; Rrl-rib (Driessens and Ra-
mon, 2003), an instance-based component; Rrl-kbr (Gärtner et al., 2003a),
a Gaussian process; and Trendi (Driessens and Džeroski, 2005) a hybrid
decision tree, instance-based algorithm. The current work, being based on
Xcs, can be viewed in a similar way: it combines a rule-based Q-learning
component with an evolutionary approach to generalisation.
Comparisons based on the first two dimensions are summarised in Table 3.4.
As rule- and tree-based systems, Foxcs and Rrl-tg produce readable struc-
tures and so have a high level of comprehensibility. At the other end of the
spectrum, Rrl-rib and Rrl-kbr produce predictions that are not trans-
parent to humans. Trendi, as a hybrid system, lies somewhere between
Rrl-tg and Rrl-rib.
3.5 Summary
7
Knowing an optimal policy, one could perhaps reverse engineer an appropriate metric
or kernel. This is not a practical approach though, as it presupposes the solution.
Chapter 4
71
72 Ch 4. The XCS Learning Classifier System
1995) which realised the potential of the framework for solving MDPs and
which initiated an expansion of interest in the area (dubbed the “Learning
Classifier System Renaissance” by Cribbs and Smith, 1996). In contrast to
earlier systems, the genetic algorithm in Xcs used accuracy-based fitness,
overcoming the problem of strong overgeneral rules (Kovacs, 2001) that had
hindered the earlier strength-based systems. The Xcs system has subse-
quently become the most widely investigated and well documented system
in the field of LCS.
The Xcs system, originally developed by Wilson (1995), was the result of
a line of investigation that, beginning with Zcs (Wilson, 1994), sought to
simplify the algorithmic complexity of LCS systems, which over the years
had steadily increased. However Xcs went beyond these aims, introducing
innovations of which accuracy-based fitness is perhaps the most significant.
Later, Wilson (1998) refined Xcs into what is now widely recognised as the
standard version of the system, although further improvements are known
(cf. Butz et al., 2003). In this section we describe Xcs procedurally, following
the Butz and Wilson (2002) specification except where otherwise noted.
Readers seeking a comprehensive treatment of Xcs are directed to the recent
book by Butz (2006).
4.1. The XCS System 73
XCS
Environment
1
There are two types of LCS, Michigan and Pittsburgh systems. In a Michigan system,
each individual in the population is a rule representing a partial policy. In a Pittsburgh
system, on the other hand, the individuals in the population are rule sets each representing
a complete policy. While Michigan systems are generally trained online, the additional
overhead due to the need to evaluate multiple policies in Pittsburgh systems means that
they are usually trained offline. Recent work in Pittsburgh systems is exemplified by Gale
(Llorà, 2002; Bernadó et al., 2002).
74 Ch 4. The XCS Learning Classifier System
In this thesis the following syntax is used to represent the condition and
action of an individual rule:
action ← condition
The condition and action are fixed length bit-strings over the alphabets
{0, 1, #} and {0, 1} respectively, that is condition ∈ {0, 1, #}m and action ∈
{0, 1}n . For example, two instances of valid rules over {0, 1, #}3 and {0, 1}2
are 00 ← 010 and 11 ← 0##. Under this bit-string paradigm, the state
and action spaces are assumed to be a subset of the binary strings, that is
S ⊆ {0, 1}m and A ⊆ {0, 1}n ; typically S and A are derived by mapping a
feature space pertaining to the specific environment into a space of binary
strings. It will be convenient to have a shorthand notation to represent
conditions and actions in formulas: given a rule j, the notation Cj and Aj
denotes the condition and action of j respectively.
and a state s ∈ S ⊆ {0, 1}m , then j is said to match s if and only if the
string Cj matches the string s precisely, symbol for symbol, except that the
“don’t care” symbol, #, can match either 0 or 1.
For example, the rule 10 ← 111 matches the state 111 but not 010 or any
other state s 6= 111. The # symbol, also called the “don’t care” symbol,
supports generalisation by allowing a single condition to match multiple
states, for example the condition string 10## generalises over four states
(i.e. 1000, 1001, 1010 and 1011). A rule’s action does not contain the #
symbol, thus it does not generalise over A.
The rules also keep track of various estimates. To this end, each rule contains
the following three parameters:
The payoff referred to above is a long term measure of reward and is best
thought of as a being analogous to the optimal value function, Q∗ . A de-
tailed comparison between Xcs and Q-Learning is provided by Kovacs (2002,
section 6.1); we note here that in the case where there is a separate rule for
each (s, a) pair then p is indeed an estimate of Q∗ (s, a).
2. The niche size, ns (also commonly called the action set size), is used
to help balance resources across different portions of S × A.
Numerosity
Due to the use of numerosity there are potentially two units of measure for
the size of the rule base, one is in terms of the macro-rules and the other is in
terms of the “virtual” or micro-rules. The latter is used herein except where
otherwise noted. The size of the rule base, |[P]|, in terms of micro-rules is
4.1. The XCS System 77
1. The first step is to construct the match set [M], which contains all the
rules in [P] that match the state st ∈ S for the current time step t.
2. Second, if the match set [M] is empty, which typically occurs at the
beginning of a training run, then the covering operation (see Sec-
tion 4.1.5) is called to produce a new rule. This step ensures that
[M] will contain at least one rule.
[P] := ∅
[A]−1 := ∅
3
The -greedy method is now typical for action selection in Xcs (Butz and Wilson,
2002), although originally an alternative “explore-exploit” method was used (Wilson, 1995,
1998).
4.1. The XCS System 79
XCS System
Credit Assignment
updated Subsystem
classifiers Rule
updates
reward
Rule Discovery
Subsystem
Genetic
Algorithm
action set, [A]
Production
Subsystem
Action
new classifiers
selection
action
match set, [M]
Matching
state
Rule Base
all rules, [P]
Population
Some actions in A may not be advocated by any rule in [M] and will
thus not be considered for selection; however, there will always be at
least one action advocated due to the covering operation on step 2. As
part of this step the action set [A] is constructed, which is identical to
[M]at .
6. Sixth,
(a) The credit assignment procedure is called on [A]−1 , the action set
formed on the previous time step, t − 1.
(b) If it is the terminal step of an episode then credit assignment is
called on [A].
7. Seventh,
At each time step, t, except the first, an update is performed on [A]−1 , the
action set from the previous step, t − 1. Then, if t is the terminal step of
4.1. The XCS System 81
an episode, an update is also performed on the current action set, [A]. The
update procedure proceeds by first updating all the prediction estimates,
p, then the errors, ε, then the fitness values, F , and finally the experience
counter, exp, and the niche size, ns. We now describe each of the updates
in turn.
For each rule j in the action set, the prediction estimate, pj , is updated
according to the equation:
pj := (1 − α)pj + αρ (4.3)
where α is a learning rate satisfying 0 < α < 1 and ρ is the new target value
for the prediction. The target ρ is defined as:
r + γ max
t a∈{a1 ,...,an } P (a) if updating [A]−1
ρ= (4.4)
rt+1 if updating [A].
where γ is a discount factor satisfying 0 ≤ γ < 1 (see Section 2.1.1),
a1 , . . . , an are the actions advocated by [M], and P is the system prediction
calculated on the current time step t. Note that rt+1 is the reward obtained
from executing action at and is obtained on time step t (not on step t + 1),
which is consistent with the notation presented earlier in Chapter 2.
where εj is the error value associated with rule j, and α and ρ are as de-
scribed above. The error εj averages |ρ − pj | over consecutive updates. If
all (s, a) pairs which match rule j have the same Q(s, a) value then |ρ − pj |
tends to go to 0 in the limit as the number of updates approaches infinity,
although it also depends on the other rules in the population that affect the
value of ρ. A special group of rules are those whose error satisfies εj < ε0 ,
where ε0 is a small value close to zero, since any rule belonging to this group
is likely to represent a region over S × A that has a uniform Q value.
Xcs uses accuracy-based fitness, perhaps the most significant feature dis-
tinguishing it from other LCS systems. This fitness measure is designed to
reflect the importance given to rules with low error, particularly rules where
εj < ε0 . The fitness is updated in a three step procedure, listed below.
First, the accuracy, κ, is calculated according to the equation:
1 if εj < ε0
κj = (4.6)
a( ε0 )b otherwise,
εj
κj nj
κ0j = P (4.7)
kn
i∈[A] i i
where nj is the numerosity of rule j. Note that we use [A] here to represent
either [A] or [A]−1 , whichever set is being updated. The relative accuracy
normalises the accuracy over the interval [0, 1] with respect to the other
accuracies, thus eliminating differences in the accuracies which originate
4.1. The XCS System 83
Shrink Stretch
1.0
1.0 1.0
1−a
ε0 ε κ κ0 κ κ0
(a) (b)
Figure 4.4: The accuracy and relative accuracy metrics: a) the relationship between
accuracy κ and error ε (from Butz, 2006), and b) the effects of the relative accuracy
calculation (from Kovacs, 2002).
In addition to the above updates to the prediction, error, and fitness, most
of the auxiliary parameters are also updated: the niches size, ns, the expe-
rience, exp, and potentially the time stamp, ts. First, the niche size, ns, is
updated according to:
where [A] again represents either [A] or [A]−1 . Note that |[A]| is measured
in terms of micro-rules as in equation (4.1). Next, the experience parameter
is incremented, exp j := exp j + 1. Finally, if rule discovery is triggered
(Section 4.1.5) then the time stamp of all rules in the action set are assigned
to the current time step:
ts j := t. (4.10)
In the above we have assumed that the learning rate, α, is a fixed constant.
In this section we discuss the Moyenne Adaptive Modifée (MAM) technique
(Venturini, 1994), which involves using a non-constant learning rate in rule
updates. The intention of MAM is to more quickly reduce parameter error
due to initialisation by using true sample averaging in the early stages of the
rule’s life. More specifically, the kth update to rule j’s prediction adjusts
the parameter according to:
1 1
pj := (1 − )pj + ρ
k k
for the first n updates to rule j. From the n + 1 update onwards, the update
rule returns to:
pj := (1 − α)pj + αρ.
1
In other words, the substitution α = k is made for the first n updates. Note
that here we are illustrating the technique using the prediction update, but
4.1. The XCS System 85
it applies everywhere where α is used, with the sole exception of the fitness
update.
We now give the MAM technique. For the kth update of rule j, the technique
sets α as follows:
1 1
k if k < β
α= (4.11)
β otherwise,
where 0 < β < 1. Note that k can derived from the experience parameter
expj . Assuming that exp j is incremented before all other updates are per-
formed, then the substitution k = exp j can be made in (4.11). The MAM
technique is applied to the updates for prediction p, error ε and niche size ns
given in equations (4.5), (4.3) and (4.9), but as noted above, fitness updates
do not employ MAM updates. That is, α = β in the fitness update (4.8).
We now consider another technique which uses a non-constant α.
1
α= . (4.12)
k
Like the MAM technique, annealing is applied when updating the predic-
tion, p, error, ε and niche size, ns, parameters using equations (4.5), (4.3)
and (4.9), but not when updating the fitness parameter, F , in equation (4.8).
Note that a comparison of equations (4.11) and (4.12) shows that α is iden-
1
tical under both methods when k < β. Thus, despite the difference in
motivation between the two techniques, they operate in essentially the same
86 Ch 4. The XCS Learning Classifier System
way; the only difference being that under MAM updates, α reverts to a
constant after some point, whilst under annealing it continually decays.
The MAM technique was incorporated into Xcs in the original version (Wil-
son, 1995) and has subsequently become part of the system’s standard spec-
ification (Butz and Wilson, 2002); however, the use of annealing in Xcs, is
uncommon. In fact, the author is not aware of any implementations of Xcs
which make use of annealing. Nevertheless, the significance of annealing for
convergence in Q-Learning suggests that it would useful within Xcs also.
Initialising the rule base with a population of rules allows the incorporation
of user knowledge into the system in the form of an approximate or partial
policy. Alternatively, an initial population may be randomly generated. One
scheme would be to generate N rules with randomly assigned action and
condition strings. A random action is typically generated by setting each
bit in the string to 0 or 1 with equal likelihood; and a random condition, by
setting each bit to # with probability P# , or to 0 or 1 otherwise with equal
likelihood.
4.1. The XCS System 87
4. Finally, it adds the new rule to [P], after which deletion occurs if
|[P]| > N .
New rules created in this way can populate an empty rule base, hence the
system does not require the user to supply an initial set of rules. Note that
when covering is triggered, it is usually relatively early in a run, when the
state space is still not completely covered by [P].
Fj
P r(j) = P . (4.13)
i∈[A] Fi
Proportional selection ensures that each rule in [A] has some chance of being
selected.
After selection, the two parent rules are copied and the copies subject, first,
to single point crossover and then, second, to either free or niche mutation
(see Section 4.5 for a description of the crossover and mutation operations).
Finally, rule deletion is applied if the size of [P] exceeds N after the two new
rules are added to the population.
Parameter Initialisation
Whenever a new rule is created, initial values for its prediction, error and
fitness parameters must be assigned. This section lists the methods used to
determine the initial parameter values. The aim of these methods is to make
a reasonable guess at the true values in order to reduce the time needed to
evaluate the rules online; however, the system should be robust to fairly
arbitrary initial parameter values. An exception, noted by Kovacs (2002), is
that new rules should not be given a large initial fitness value, which might
enable them to influence action selection and reproduction before they are
evaluated properly. The following methods for setting the parameters are
given by Butz and Wilson (2002):
• For an initial population and for rules created by covering: The initial
prediction, error and fitness values are set to user supplied constants.
An alternative method which can be used in the case of covering is to
set the initial prediction and error to the mean values over the rule
base and the initial fitness to 10% of the mean fitness over the rule
base (Kovacs, 2002).
4
Note that Wilson is referring to match sets, not action sets; this is because the original
version of Xcs applied the GA to match sets rather than action sets. Subsequent versions
of Xcs starting from (Wilson, 1998) applied the GA to action sets; for these versions,
replacing “match set” with “action set” in the quotations above has the intended meaning.
90 Ch 4. The XCS Learning Classifier System
• For rules created by the genetic algorithm: Rules created by the genetic
algorithm usually have two parents, and hence the initial prediction
and error are set to the mean of the parents values and the initial
fitness to 10% of the mean of the parents fitness. If there is only one
parent then the initial prediction and error can be set to the parent’s
values and initial fitness to 10% of the parent’s fitness.
Rule Deletion
Subsumption Deletion
1. Rule j is sufficiently experienced. That is, expj > θsub where θsub is a
user set threshold.
3. The condition of rule j is not less general than the condition of rule i.
Now that the computational processes of Xcs have been described it is useful
at this point to elucidate some aspects of the system design. We have already
noted that Xcs can be seen as a kind of rule-based Q-Learning system and
that the prediction calculation is a form of temporal difference learning, but
92 Ch 4. The XCS Learning Classifier System
we have not yet discussed the GA and in particular the significance of the
fitness calculation. Why make fitness a function of accuracy and not some
other metric? Prior to Xcs, the use accuracy-based fitness was uncommon5
and its incorporation into Xcs represented a major advance in the field,
leading to a deeper understanding of the LCS framework. Below we set out
some advantages of using accuracy-based fitness.
5
Wilson (1995) gives an account of the use of accuracy metrics in LCS systems before
Xcs.
4.2. Accuracy-Based Fitness 93
greater average payoff can no longer displace rules with a lesser average
payoff except on the basis of accuracy. Avoiding the problem of strong
overgeneral rules is a principal motivation for using accuracy-based fitness.
Complete Maps
Further insight into the design of Xcs can be gained by examining key biases
which influence the behaviour of the system and the mechanisms which
produce them. These biases attempt to provide an answer to the question
“How does the population of rules evolve over time?” An analytical answer
to this question is difficult as it relies on the complex interaction of several
stochastic processes (including selection, mutation, crossover and deletion;
additionally, the transition function may also be stochastic). Therefore,
rather than attempt to provide a quantitative analysis of the population
dynamics we will be content to identify qualitative features which influence
the constitution of the population.
increase the proportion of rules in [P] that are accurate and maximally gen-
eral.
The Optimality Hypothesis does not identify any mechanisms for produc-
ing [O] in addition to those given in the Generality Hypothesis; instead, it
says that these mechanisms are sufficient for generating [O] given sufficient
training. Empirical support for the hypothesis has been obtained by Kovacs
(1996, 1997) who observed the formation of [O] on multiplexor tasks having
up to 11 input bits.
6
The assumption here is that the more general a rule is, the more likely it is to occur
in an action set.
7
This explanation of the Generality Hypothesis has been influenced by Butz (2006).
96 Ch 4. The XCS Learning Classifier System
1. Set pressure. Reproduction from [A] and deletion from [P] creates a
pressure towards rules with greater semantic generality.8
The first two pressures account for the bias identified in the Generality Hy-
pothesis while the other three describe additional biases. Butz et al. (2003)
8
The semantic generality of a rule refers to its level of generality as determined by its
frequency of occurrence in action sets. Syntactic generality, on the other hand, refers to
the rule’s level of generality as determined by its logical form. A rule’s level of semantic
generality is clearly influenced by its syntactic generality, but also by the distribution of
states sampled during training.
4.3. Biases within XCS 97
provide an analysis of set pressure showing that the average expected level of
generality in [A] is indeed greater than the average level of generality in [P],
supporting the validity of the Generalisation Hypothesis. Unfortunately for
our purposes, the result depends on bit-string analysis and is not language
independent.
4.3.3 Discussion
Now that we have identified key biases present in Xcs, we would like to
consider the effect that modifying Xcs for relational reinforcement learning
will have on them. In particular, we would like to show that changing the
rule language from bit-strings to first-order logic will not adversely disrupt
the biases. We are not, however, interested in considering the effect of any
other changes required to modify the system for RRL, since these changes
are secondary in the sense that they arise because the rule-language has
changed, and if a particular modification is disruptive then it is possible that
another can be designed. On the other hand, changing the rule language
is critical to achieving our intention: if Xcs supports bit-string rules only
then our intention cannot be realised.
Since the Optimality Hypothesis is based on the same principles as the Gen-
erality Hypothesis, it should also hold under alternative rule languages. As
an aside, we note that in some circumstances the Optimality Hypothesis
and the Generality Hypothesis make conflicting predictions about the com-
position of the population, as the following example shows.
Under the standard ternary bit-string language there are six accurate rules
which can be formed, four specific rules: 1 ← 00, 1 ← 01, 1 ← 10, and
1 ← 11 and two general rules: 1 ← #1 and 1 ← 1#. Note that no single rule
can have a condition which represents {01, 10, 11}, the three states whose
Q-value is 0. For this task the Generality Hypothesis says that training
tends to push the composition of the population to:
1 ← 00 1 ← #1 1 ← 1#
as these are the maximally general rules which accurately represent the
Q function. On the other hand, the Optimality Hypothesis suggests that
training tends to produce one of the following two sets of rules:
1 ← 00 1 ← 01 1 ← 1#
1 ← 00 1 ← 10 1 ← #1
since they are the complete, minimal, non-overlapping rule sets which accu-
rately represent the Q function.
The above example was contrived to show a task for which a population
containing only maximally general rules is not consistent with a popula-
tion containing only non-overlapping rules. Empirically, Kovacs (2002, sec-
tion 3.5.2) has observed that Xcs tends to squeeze out overlapping rules,
4.4. Alternative Rule Languages 99
supporting the view that in practice Xcs follows the Optimality Hypothesis
when the two hypotheses disagree.
Returning to the main theme of this discussion, we now consider the effect
of the rule language on the evolutionary pressures identified by Butz. Of
these pressures, set, fitness, and deletion pressure are language independent,
while subsumption and mutation pressure rely on language dependent fea-
tures. The language dependent part of subsumption, however, is limited to
testing if a particular rule is a generalisation of another, and such a test is
usually possible, so the influence of the rule language on subsumption pres-
sure is generally negligible. Mutation pressure, on the other hand, largely
relies on operations which are dependent on representation. Thus, the bias
of the Xcs system towards accurate and maximally general classifiers is rel-
atively language independent, while the rule discovery processes, which rely
on mutation and other evolutionary variation operators, are the representa-
tionally dependent factor.
In conclusion, it appears that most of the biases discussed above are un-
affected by the choice of rule language, from which a case can be made
that Xcs is essentially a language neutral architecture. This analysis sug-
gests that, in principle, Xcs will not be adversely affected by changing the
rule language to first-order logic. It also suggests that other rule languages
could be used, and indeed, many alternative rule languages have already
been implemented, as we consider next.
In addition to the above analysis, we can further satisfy ourselves that the
functioning of Xcs will not be impaired by changing the rule language by
considering the existence of previous work which has extended Xcs with
alternative rule languages. The LCS framework was originally conceived
to use bit-string rules, and the Xcs system follows this tradition. The
100 Ch 4. The XCS Learning Classifier System
prevalence of bit-string rules within the LCS paradigm is due to the central
role that the GA occupies, as it is the principal mechanism for rule discovery.
However the LCS framework, and Xcs in particular, is flexible enough to
support other rule languages, providing that there are variation operations
available to act upon arbitrary rules in the new language in order to produce
variation. For instance, Xcs has been extended with rule languages over:
• continuous spaces (Wilson, 2000, 2002; Stone and Bull, 2003, 2005;
Butz, 2005; Dam et al., 2005; Butz et al., 2006; Lanzi and Wilson,
2006)
and fourth bits are equal and the second and third bits are equal” (i.e. ABBA
where A and B are variables over bit values). It is this lack of expressive
power that limits the usefulness of bit-strings in relational domains. The
following systems all employ rule languages which can represent relational
concepts like ABBA.
102 Ch 4. The XCS Learning Classifier System
A recent survey by Divina (2006) reviews many of the systems described above.
The Vcs system (Shu and Schaeffer, 1989) attempted to solve the above rep-
resentational limitation of bit-strings by extending the bit-string alphabet
to support variables. Fixed length bit-strings over the alphabet could repre-
sent certain first-order logic rules if a mapping was provided for encoding the
predicates and variables as bit-strings. Furthermore, custom mutation and
crossover operators which handled the variable and predicate bit-strings as
units were adapted from the standard bit-string versions. However, encoding
first-order logic rules into bit-strings in fact propositionalises the representa-
tion, so it is not representationally equivalent to ILP or RRL systems, which
directly use first-order logic languages.
The Vcs system must have faced significant challenges during its implemen-
tation. At that time, the development of Xcs still lay ahead five years into
the future, so the significance of accuracy-based fitness was not yet under-
4.4. Alternative Rule Languages 103
stood. Also, relational learning was still maturing, with the development of
many ILP algorithms as well as the non-monotonic setting10 yet to occur.
Thus, the concept behind Vcs may have been too ambitious for its time
and, as far as the author is aware, Vcs was unfortunately never completed.
Dl-cs (Reiser, 1999) was another LCS system that followed the path of
encoding first-order logic rules into bit-strings that support variables. Like
Vcs, it used an encoding from first-order logic to fixed length bit-strings,
and thus propositionalised the representation. Unlike Vcs, Dl-cs was suc-
cessfully implemented but empirical study showed that it performed disap-
pointingly, although some of its poor performance may be attributed to the
use of strength-based fitness and possibly to other innovations (it maintained
two separate populations for instance).
Two other LCS systems should be mentioned in this context, Xcsl (Lanzi,
2001, 1999b) and Gp-cs (Ahluwalia and Bull, 1999). These two systems
were inspired by genetic programming (Koza, 1992) and its utility for rule
discovery in LCS systems. Genetic programming is a style of evolutionary
computation with many similarities to genetic algorithms; the greatest dif-
ference being that the population in a genetic program consists of Lisp-like
S-expressions, instead of bit-strings, that are mutated and recombined us-
ing tree-based operations. The significance of employing S-expressions over
bit-strings is that, due to their recursive structure, they are more expressive.
Essentially, Xcsl and Gp-cs replace the GA component within the LCS
with a genetic program. Each rule is now partially represented by an S-
expression: in Xcsl the rule’s condition is an S-expression while its action
is a bit-string; in Gp-cs it is the reverse. A limitation of this dual rep-
resentation for actions and conditions is that variables cannot range over
the entire rule as they do in first-order logic. Regrettably, a comparison
between the representational characteristics of first-order logic and Lisp S-
10
The non-monotonic, or “learning from interpretations”, setting is particularly relevant
to reinforcement learning systems; see Appendix B.
104 Ch 4. The XCS Learning Classifier System
expressions is well beyond the scope of this thesis, but we note that the two
formalisations are united under some higher-order logics (Lloyd, 2003).
Many types of crossover and mutation operations have been developed for
GAs; here we mention two, single point crossover and free mutation, which
are commonly used within Xcs. In single point crossover, two parent chro-
mosomes are recombined to form two offspring chromosomes. A bit location
is randomly generated and the substrings before and after the bit location
4.5. Genetic Algorithms 105
A B A D
z}|{ z }| { z}|{ z }| {
parent 1 101 | 1111 101 | 1010 child 1
@
parent 2 000 | 1010
|{z} | {z }
R
@
000 | 1111
|{z} | {z } child 2
C D C B
in each parent chromosome are exchanged to create the offspring (see Fig-
ure 4.5). In free mutation, each bit in the string is mutated with a given
probability. When a bit is mutated it is set to a randomly selected element
from the set of symbols in the bit-string alphabet minus the current bit
value.
The GA within Xcs is somewhat more specialised than the generic de-
scription given above. In particular, individuals in the generic GA usually
represent a whole solution to the given problem, whereas individuals in Xcs
are fragments of a policy. Thus, in Xcs, the entire population, together with
logic for resolving conflicts, represents the solution to the problem. Also,
the GA in Xcs often uses a restricted version of free mutation, called niche
mutation, where the mutated bit-strings are guaranteed to match the cur-
rent state (a bit can only be mutated to # or to the corresponding bit value
in the current state). Finally, we note that while GAs enjoy a prominent
position within LCS design, there appears to be no inherent reason why al-
ternative evolutionary algorithms, or other parallel search algorithms, could
not be adapted for use within an LCS implementation. Indeed, some ex-
isting LCS systems have already used alternatives to the GA (for example,
Lanzi, 1999b; Stolzmann, 2000).
4.6 Summary
In this chapter we have described in detail the basic method from which we
will derive a relational reinforcement learning system capable of dynamic
generalisation. The system, Xcs, perhaps the most prominent instance of
the LCS paradigm, can be characterised as a kind of rule-based Q-Learning
system combined with a genetic algorithm. In the following chapter, Xcs
4.6. Summary 107
We saw that each rule in Xcs contains an action and a condition part
which together represent a policy fragment, and that a rule also performs
a generalisation role through its ability to represent a cluster of states in
a single compact condition. The paradigm used for representing the rules
is the bit-string, which represents the action and condition as fixed length
strings over the alphabets {0, 1} and {0, 1, #} respectively. A rule is said
to match a state, also given as a string over {0, 1}, if its condition string
is identical to the state, except for any “don’t care” symbols, #, which can
match either a 0 or a 1.
Learning classifier systems are complex systems using mechanisms and heuris-
tics which have been fine-tuned over the decades since their inception in the
1970s. In this context, the primary contribution of Xcs is its accuracy-
based fitness. Earlier systems used a single parameter, strength, for both
108 Ch 4. The XCS Learning Classifier System
prediction and fitness which led to the problem of strong overgeneral rules:
the displacement of rules which consistently advocate an optimal action by
rules which advocate a sub-optimal action but whose payoff is higher than
the former optimal rules. By basing fitness on accuracy instead of strength,
rules which always advocate optimal actions in Xcs cannot be displaced by
strong overgeneral rules.
Recall that the objective of this research is to develop and evaluate an ap-
proach to relational reinforcement learning based on the learning classifier
system Xcs. Background material has been dealt with, including funda-
mentals of the RRL problem, a survey of RRL methods, and a description
of Xcs; the purpose of this chapter is thus to present a detailed design real-
ising the proposed approach. The system is named Foxcs: a “First-Order
Logic Extended Classifier System”. In brief, Foxcs automatically gener-
ates, evolves, and evaluates a population of condition-action rules taking
the form of definite clauses over first-order logic. As mentioned previously,
the primary advantages of this approach are that it is model-free, general-
isation is performed automatically, no restrictions are placed on the MDP
framework, it produces rules that are comprehensible to humans, and it uses
a rule-language to provide domain specific bias.
Perhaps the most novel aspect of the system is that it features an inductive
component based on a synthesis of evolutionary computation and refine-
ment in ILP. Although there do exist systems that have previously adapted
109
110 Ch 5. The FOXCS System
This chapter is set out as follows. First, the overall approach is briefly
described and the principal modifications required to produce Foxcs from
Xcs are listed. Next, forming the bulk of the chapter, each modification
is fully described in its turn (in general, the details of Xcs will not be
repeated; readers unfamiliar with Xcs might like to consult the previous
chapter). Then a brief overview of the system from a software engineering
perspective follows. Finally, a summary of Foxcs concludes the chapter.
5.1 Overview
Within inductive logic programming, many systems have been obtained sim-
ilarly. For example, the well known Foil (Quinlan, 1990), Icl (De Raedt
and Van Laer, 1995), Ribl (Emde and Wettschereck, 1996), Claudien
1
See “Evolutionary Computation in Inductive Logic Programming” on page 101.
5.1. Overview 111
FOXCS
Rule base
(De Raedt and Dehaspe, 1997), Warmr (Dehaspe and De Raedt, 1997), and
Tilde (Blockeel and De Raedt, 1998), amongst others, have all been derived
following the “upgrade” approach. A general methodology for achieving
such an upgrade has even been formulated (Van Laer and De Raedt, 2001).
Note that representations other than definite clauses, such as first-order logic
trees, have been employed in some cases.
2. A task specific rule language is defined by the user of the system using
special language declaration commands.
5. The covering and mutation operations are tailored for producing and
adapting rules in first-order logic.
Like its parent Xcs, the Foxcs system accepts inputs and produces rules;
however, unlike Xcs these inputs and rules are expressed in a language
over first-order logic. The language is specified via an alphabet, which is a
tuple hC, F, P, di, consisting of a set of constant symbols C; a set of function
symbols F (although functions are not supported by Foxcs, i.e. F = ∅); a
set predicate symbols P; and a function d : F ∪P → Z, called the arity, which
specifies the number of arguments associated with each element of F ∪ P.
For the purpose of describing a task environment, it will be convenient to
partition the set of predicate symbols, P, into three disjoint subsets, PS ,
PA , and PB , which contain the predicate symbols for states, actions and
background knowledge respectively.
above(X, Y ) ← on(X, Y )
above(X, Y ) ← on(X, Z), above(Z, Y )
a d
b e
s = {cl(a), on(a, b), on f l(b), cl(c), on(c, d), on(d, e), on f l(e)}
A(s) = {mv f l(a), mv(a, c), mv f l(c), mv(c, a)}
An input to the system describes the present state of the environment and
the potential for action within it. More specifically, it is a pair (s, A(s)),
where s ∈ S is the current state and A(s) is the set of admissible actions for
s. The state, s, and the set of admissible actions, A(s), are each represented
by an Herbrand interpretation—a set of ground atoms which are true—over
the language L.2
2
The representation of inputs with Herbrand interpretations is called learning from in-
terpretations or non-monotonic learning in the ILP literature (De Raedt, 1997; De Raedt
and Džeroski, 1994). Blockeel (1998, page 89) notes that a limitation of using this
paradigm is that rules with recursive definitions cannot be learnt; however, he suggests
that recursive rules occur comparatively rarely.
5.2. Representational Aspects 115
Rules in Foxcs contain a logical part Φ, which replaces the bit-string action
and condition of their counterparts in Xcs. Apart from this change, all
other parameters and estimates associated with rules in Xcs are retained
by Foxcs and function as they do in Xcs.
ϕ 0 ← ϕ1 , . . . , ϕ n
Example 6 The logical form, Φ, of three rules over the language LBW are
given below:
Note that the action part of all the rules in the above example contain
variables; such rules are said to have abstract actions and can advocate more
116 Ch 5. The FOXCS System
A B A B B
Figure 5.3: Three rules for blocks world and an illustration of the patterns that they
describe. A serrated edge on a block signifies that it rests on a stack of zero or more
blocks.
than one action for a particular state. Rules in Foxcs are thus capable of
generalising over actions in addition to states.
Now that the representation of rules and inputs has been described, it is
interesting to characterise how the first-order logic rules of Foxcs express
generalisations. There are two ways that a rule can generalise: the first is
through the use of variables, and the second is through underspecification.
Although these two generalisation mechanisms have been separated for clar-
ity, it should be noted that an individual rule usually contains variables and
is also typically underspecified; that is, both mechanisms generally occur
together in the same rule.
A a
C E c
F D B b
Figure 5.4: (a) A rule containing variables; the rule generalises over all states where the
blocks are arranged in the indicated pattern. (b) An underspecified rule; this rule gener-
alises over all arrangements of unspecified blocks that preserve the truth of highest(a).
The variables of first-order logic are analogous to the # symbol in the bit-
string languages of LCS systems in the sense that they both generalise over
multiple values. Of course, unlike variables, the use of # in different loca-
tions within a rule cannot express a correspondence between the values of
the attributes at those locations. Underspecification, on the other hand,
typically does not occur in bit-string rules of LCS systems; an exception is
Xcsm (Lanzi, 1999a), which achieves the effect through the incorporation
of messy coded bit-strings (Goldberg et al., 1989, 1990).
up on page 199.
Φ ∧ s ∧ B |= a (5.1)
Example 7 Let us consider matching the rule rl3 from Example 6 to the
input (s, A(s)) illustrated in Figure 5.2. The values of Φ, s and A(s) are
reproduced here for convenience:
A separate matching operation is run for each (s, a) ∈ {s} × A(s) because
matching is defined for (s, a) pairs rather than (s, A(s)) pairs. In this ex-
ample, rl3 will be matched to four (s, a) pairs because |A(s)| = 4.
Note that the rule in the above example generalised over two ground actions.
It is because of this ability to generalise over actions that the rules in Foxcs
are matched against (s, a) pairs rather than against the state s only: so that
matching is ambiguous with respect to actions.
The order in which atoms are ordered within Φ is important for producing
sound results. In particular, when a rule contains certain expressions, such
as an inequality or a negation, Prolog can return an error depending on the
order of the atoms. For example, the two rules:
are both meant to indicate that X is a minor if X’s age is less than 21. The
first rule works as intended; however, in the second rule, Y is not bound
when the inequality is evaluated, hence the rule will produce an error. A
similar problem can occur for negation; therefore, care must be taken that
any variable occurring in an inequality or a negation already occurs earlier
in the rule.4
occurring in Φ.5 For instance, when inequations are included, the logical
form of the three rules from Example 6 would be:
Note that inequations are placed as leftmost as possible in the rule after
the pair of variables that they refer to occur. This has the effect of causing
matching to terminate as early as possible when the inequation is not satis-
fied and was found to lead to a significant reduction in the running time of
the system compared to placing them at the end of the rule.
The added inequations are artifacts that implement the semantics of the rule
and are thus not regarded as syntactic elements of Φ. Hence, the inequations
are not directly subject to mutation nor are they explicitly shown when Φ
is written out. The use of inequations is controlled by a user setting, which
is set to “on” by default. The default setting is assumed throughout the
remainder of this thesis except where otherwise indicated.
5.3.3 Caching
The computational efficiency of the Xcs framework, and indeed the LCS
approach in general, relies heavily on the efficiency of matching. This is
because on each operational cycle (that is, on each time step t) matching is
run on every rule in [P]. Matching rules in Xcs is cheap: linear in the number
of bits; however, it is a more expensive operation in Foxcs. This additional
5
Actually the inequations are type dependent, that is, an inequation will only be added
if the two variables refer to attributes of the same user defined type (these types are
declared with the mode command discussed in Section 5.5.1). This precludes the addition of
many unnecessary inequations and also prevents complications that arise when attributes
which have different user defined types range over the same basic types (such as integers
and floating point numbers).
5.4. The Production Subsystem 121
This technique works well for ILP tasks, which usually have a few hundred
data items only. However, typically |S| NK for RRL tasks, which reduces
the effectiveness of caching for RRL tasks; an interesting exception is looping
behaviour: if the loop is completed in NK steps or less then caching can be
effective.
6. Assign credit. The parameters of the rules in the action set of the
previous cycle, [A]−1 are updated. If it is the terminal step of an
episode then the rules in [A] are also updated.
gives the algorithm for the operational cycle of Foxcs. The first modifica-
tion, which has been noted before, is that input to the system includes the
set of admissible actions, A(s), for the current state s ∈ S. The next, and
principal, modification occurs on step 1 of the cycle and involves the con-
5.4. The Production Subsystem 123
struction of a separate match set, denoted [M]a , for each action a ∈ A(s).
Note that an individual rule can, and frequently does, belong to more than
one match set if it contains an abstract action; for instance, the rule in
Example 7 would belong to two match sets.
At step 2, another modification was made for triggering the covering op-
eration. The standard mechanism, as given by Butz and Wilson (2002),
invokes covering whenever less actions are advocated than a user set thresh-
old. In Foxcs, the covering operation is called for each action a which has an
empty match set, [M]a . This version originates from Kovacs (2002), where it
is termed action set covering; it requires knowledge of A(s), but eliminates
the need for a user setting. For relational environments, action set covering
appears to be a more natural choice than using a fixed threshold, because
the number of actions frequently depends upon the state. For instance, in
bwn the number of actions that apply in an arbitrary state varies between
1 and n2 − n.
Another difference occurs on step 3. The formula for calculating the system
predictions was not itself modified. That is, P (a) is still computed according
to equation (4.2) as the fitness weighted sum of the individual predictions
of the rules in [M]a . The equation is reproduced here for convenience:
P
p F
j∈[M]a j j
P (a) = P .
j∈[M]
Fj
a
However, because rules can generalise over actions, then, unlike the situation
with Xcs, it is possible for an individual rule to occur in more than one [M]a
and thus contribute to multiple system prediction calculations on a single
cycle.7
The final change concerns the rule updates which are made on step 6. The
7
Note however, that a rule cannot achieve perfect accuracy unless each of these actions
yields the same expected payoff. Thus, generalisations made over actions—like general-
isations made over states—must reflect a uniformity in the payoff landscape or the rule
will suffer low fitness. In other words, fit rules in Foxcs are those rules where Q(s, a) is
uniform for each (s, a) pair that matches the rule.
124 Ch 5. The FOXCS System
Each environment requires its own specific vocabulary for describing it.
When the system creates or modifies rules—through covering or mutation—
it consults user defined declarations in order to determine which literals
may be added, modified, or deleted. A language declaration command that
5.5. The Rule Discovery Subsystem 125
min, max Integers specifying the minimum and maximum number of oc-
currences of the predicate allowed within an individual rule.
neg A boolean value which determines whether the predicate may be negated
or not.
The last argument, pred, declares the predicate itself. The general form
of pred is r(m1 , . . . , md(r) ), where r is the predicate symbol, mi is a dec-
laration for the ith argument of r, and d(r) is the arity of r. However,
if r is a proposition—which is frequently the case for action predicates in
classification tasks—then the form of pred reduces to just r.
Each declaration mi is a list, [arg type, spec1 , spec2 , . . .], where arg type is a
type specifier,8 and the remaining arguments, spec1 , spec2 , . . ., are “mode”
symbols which determine how the ith argument may be set. The mode
symbols and their meanings are given below:
“+” Input variable. The argument may be set to a named variable (of the
same type) that already occurs in the rule.
“-” Output variable. The argument may be set to a named variable that
does not already occur in the rule.
8
The types are user defined. Note that user defined types are not automatically sup-
ported by Prolog and the developer must provide the code which handles them.
126 Ch 5. The FOXCS System
“!” If the argument is a variable then it must be unique within the literal’s
argument list.
The list should contain at least one of the first four symbols. If it contains
more than one of the above symbols, then the argument may be set according
to any of the symbols present.
Example 8 The following are mode declarations for the blocks world lan-
guage LBW (see examples 3 and 4).
mode(a,[1,1],false,mv([block,-,#],[block,-,#])).
mode(a,[1,1],false,mv_fl([block,-,#])).
mode(s,[0,5],false,cl([block,-,#])).
mode(s,[0,5],false,on([block,-,#],[block,-,#])).
mode(s,[0,5],false,on_fl([block,-,#])).
mode(b,[0,5],true,above([block,-,#],[block,-,#])).
The next example makes more sophisticated use of argument types and
modes.
9
The role of the anonymous variable in first-order logic is analogous to the role of the
“don’t care” symbol, #, in bit-string languages.
5.5. The Rule Discovery Subsystem 127
Example 9 The following are mode declarations that could be used by the
system for learning to recognise hands of Poker.
mode(a,[1,1],false,hand([class,#])).
mode(s,[5,5],false,card([rank,+,-,_],[suit,+,-,_],[position,-])).
mode(b,[0,1],true,succession([rank,+,!],[rank,+,!],[rank,+,!],[rank,+,!],
[rank,+,!])).
Note the use of “!” in the declaration of succession; the symbol prevents
rules from being generated that contain a succession literal having two or
more equal arguments, which would be unsatisfiable, and thus reduces the
size of the rule space to be searched. This illustrates the purpose of “!”,
which is to improve efficiency.
In Foxcs, the covering operation generates a new rule which matches a given
state-action pair, (s, a). The algorithm for covering is given in Figure 5.6 and
works as follows. First, a new rule, rl, is created and its parameters, except
for Φ, are initialised in the same way as in Xcs.10 Next, the logical part, Φ,
is set to a definite clause corresponding to a ← s. At this stage the rule does
not generalise because s and a are ground, so in the next two steps an inverse
10
See Section 4.1.5, page 89. The method that assigns population means is used, except
when the population is empty in which case user supplied constants are assigned.
128 Ch 5. The FOXCS System
4. Φ := Φθ−1
Although not described in the algorithm given above, the actual covering
operation also takes type information into account when creating the inverse
substitution at step 3. Without checking type information, an inverse substi-
tution potentially has undesirable consequences when different data types
use the same values for constants (which typically occurs with numerical
data, for instance). For example, consider the clause:
Here, the constant anne represents the same person in both literals, but the
value of 60 for both age and weight is coincidence. Thus, replacing anne in
both literals with variable X, say, creates a useful association, but replacing
60 in both literals with Y is much less likely to be useful. The system
therefore performs type checking to ensure that the inverse substitution is
type safe, that is, that constants of attributes with different types will not
be substituted with the same variable. In the above example, the numerical
arguments would be declared to have different types:
mode(s,[1,1],false,age([name,-],[age,-])).
mode(s,[1,1],false,weight([name,-],[weight,-])).
The mutation operations of Foxcs are based on processes that are analogous
to refinement operations. Refinement operations are used to search the
hypothesis space in ILP systems and compute a set of specialisations or
generalisations of a given rule, expression or hypothesis. In Foxcs, versions
of refinement operations have been created that are suitable for its learning
classifier system framework. The operations produce a single, stochastically
refined rule and can access the current state, s, and set of admissible actions,
A(s), but no other elements of S or A.
The three generalising operations are del (delete literal), c2v (constant to
variable), and v2a (variable to anonymous variable). They are described
below.
del: A literal from the body of the rule is randomly selected and deleted.
The algorithms for the three generalising mutation operations are given in
Figure 5.7. However, they have been simplified by ignoring several factors:
type information; the minimum and maximum number of literals allowed in
132 Ch 5. The FOXCS System
del
2. Φ := Φ − ϕi
c2v
3. θ−1 := {hc, {p1 , . . . , pk }i/v}, where the set of pi are the places in
Φ where c occurs and “+” ∈ Mpi if v ∈ Vars(Φ), or “-” ∈ Mpi if
v is a new variable
4. Φ := Φθ−1
v2a
3. Φ := Φθ−1
Figure 5.7: The algorithms for the generalising mutations. The sets Const(Φ) and
Vars(Φ) are, respectively, the set of constants and the set of variables occurring in Φ.
a rule; the mode symbol “!”; and various failure conditions (e.g. the c2v
operation is invoked on a rule where Const(Φ) = ∅).
5.5. The Rule Discovery Subsystem 133
The three specialising operations are add (add literal), v2c (variable to
constant), and a2v (anonymous variable to variable). The operations are
described below. Note however, that in order for the child rule to belong
to the same action set as its parent, it must match a state-action pair (s, a)
corresponding to the current state, s, and an action a ∈ A(s).
add: A new literal is generated and added to the body of the rule. A
predicate is selected from PS ∪ PB and its arguments are filled in
according to the mode symbols associated with the argument. If any
constants are to be assigned to the arguments of the predicate, then
they are generated from the current state, s, such that the new rule
matches (s, a) for some action a ∈ A(s).
The algorithms for the three specialising mutation operations are given in
Figures 5.8 and 5.9. Again, they have been simplified in the same way that
the generalising mutations were.
This operation does not alter Φ; instead, its purpose is to encourage the
most highly fit rules to dominate the population.
When mutation is triggered the system randomly selects one of the above
seven operations to apply. Each operation, i ∈ {del,c2v,v2a,add,v2c,a2v,
rep}, is associated with a weight µi , and its selection probability is propor-
tional to its relative weight, Pµi . If the operation fails to produce offspring
j µj
then another randomly selected operation is applied, and so on, until one
succeeds.
inputs: a rule with logical part Φ, and the current system input
(s, A(s))
add
3. Φ := Φ ∪ ϕ
5. Φi := Φθ
inputs: a rule with logical part Φ, and the current system input
(s, A(s))
v2c
3. Φi := Φθ
a2v
5. Φ := Φθ
Figure 5.9: The algorithms for the v2c and a2v mutations.
5.5. The Rule Discovery Subsystem 137
under this approach, µadd should be set larger than µdel so that the
frequency of add operations is greater than that of del operations.
Examples
We now give some examples to illustrate how the add and del operations
work. We do not provide explicit examples of the other four mutations
because they are concerned with assigning the arguments of literals, which
is also performed by steps two and four in the add operation. Therefore the
add example will also suffice to illustrate the kinds of processes involved in
the c2v, v2a, v2c and a2v mutations.
mode(a,[1,1],false,active).
mode(a,[1,1],false,inactive).
mode(s,[1,1],false,molecule([compound,-])).
mode(b,[0,20],false,atm([compound,+],[atomid,+,-],[element,#],[integer,#],[charge,#,-])).
mode(b,[0,20],false,bond([compound,+],[atomid,+,-],[atomid,+,-],[integer,#])).
Suppose that Foxcs is applied to the Mutagenesis task with the above
declarations, and that at some point during training the system applies the
add mutation to the following rule:
Φ : active ← molecule(A).
At step 1 of the operation, the system randomly selects one of the mode
declarations for a state or background predicate (i.e. a declaration specifying
type “s” or “b”). Let the selected declaration be the one relating to the
predicate atm. Also during this step, Mi is assigned to the set of mode
symbols associated with argument i of the predicate. Here, M1 = {“+”},
M2 = {“+”,“-”}, M3 = {“#”}, M4 = {“#”} and M5 = {“#”,“-”}.
138 Ch 5. The FOXCS System
At step 2, the terms to the five arguments of atm are assigned. First, the
set of candidate terms, call it Ti , is computed for each argument i, where
S
Ti = m∈Mi T (m) (for the definition of T (m) see equation (5.2)). The
sets of candidates are: T1 = {A}, T2 = {B}, T3 = {#}, T4 = {#} and
T5 = {#, C}. A couple of comments are in order:
atm(d1,d1_1,c,22,-0.117).
atm(d1,d1_2,c,22,-0.117).
atm(d1,d1_3,c,22,-0.117).
atm(d1,d1_4,c,195,-0.087).
atm(d1,d1_5,c,195,0.013).
5.5. The Rule Discovery Subsystem 139
atm(d1,d1_6,c,22,-0.117).
atm(d1,d1_7,h,3,0.142).
atm(d1,d1_8,h,3,0.143).
atm(d1,d1_9,h,3,0.142).
atm(d1,d1_10,h,3,0.142).
...
These complications with deletion is one of the reasons why variable argu-
ments are declared as either input (“+”) or output variables (“-”): so that
the system can reason about which literals can and cannot be deleted. In
general, if a rule contains a variable, v, which occurs as an input variable at
some location in the rule, then v must also occur as an output variable at
a previous location. The del mutation is therefore constrained such that it
will only delete literals which do not result in a violation of this condition.
Similarly, the v2a mutation will not rename an output variable if it would
result in a violation of the condition. The appropriate declarations for the
above example are:
mode(s,[1,1],false,molecule([compound,-])).
mode(b,[0,20],false,atm([compound,+],[atomid,+,-],[element,#],[integer,#],[charge,#,-])).
mode(b,[0,20],false,<([charge,-],[float,#])).
The final modification required for the specification of Foxcs is the test
for subsumption. Recall that in the Xcs system, the subsumption deletion
technique (see Section 4.1.5, page 91) helps to encourage the system to
converge to a population containing maximally general rules. If rule i is
accurate (i.e. ε < ε0 ) and sufficiently experienced (i.e. exp > θsub ) then i
may subsume j if i is a generalisation of j. In Foxcs, the accuracy and
experience conditions remain, but the test for generalisation is modified for
5.5. The Rule Discovery Subsystem 141
In first order logic, the θ-subsumption procedure (Plotkin, 1969, 1971) can
be used to determine if one rule is a generalisation of another. The definition
of θ-subsumption is as follows:
Φg : ϕ1 (Y ) ← ϕ1 (X), ϕ2 (X, Y )
Φs : ϕ1 (Z) ← ϕ1 (X), ϕ2 (X, Y ), ϕ2 (Y, Z)
task system
environment settings
input files
interaction
learning experiment
matching classifier system module
Prolog
module
subsumption
C++ C++
core modules
output files
Figure 5.10: Architecture of the Foxcs system. Arrows indicate the flow of data.
Recall that there are two varieties of subsumption deletion in Xcs, GA sub-
sumption and action set subsumption. For action set subsumption, Foxcs
uses θ-subsumption to test for generalisation, but for GA subsumption there
is a simpler, constant time test. In Foxcs, unlike in Xcs, each type of mu-
tation operation acts to either always generalise or always specialise rules.
Thus, in order to determine whether a parent rule is more general than its
child, each mutation operation sets or clears a flag depending on whether it
generalises or specialises. The subsumption test then just tests the flag to
determine whether the child rule should be subsumed by its parent or not.
5.6. Implementation Notes 143
• The first module contains routines for interacting with the environ-
ment, such as initiating an episode, obtaining the current input, ex-
ecuting actions and receiving rewards, and also for implementing the
matching and θ-subsumption operations.
• The second module implements all the learning classifier system sub-
systems: the rulebase, the production system with the exception of
matching, credit assignment, and rule discovery operations with the
exception of θ-subsumption.
Note that when ILP tasks are addressed they must be converted into an
equivalent MDP formulation of the task so that the LCS methodology can be
applied. In practice these conversions are implemented by placing additional
routines in the task environment files which respond to interaction requests.
144 Ch 5. The FOXCS System
5.7 Summary
In this chapter we have presented the details of Foxcs systems. The sys-
tem was obtained by extending the learning classifier system Xcs for rep-
resentation with first-order logic. Essentially, the extension was achieved
through the following steps: replacing the bit-string rules of Xcs with defi-
nite clauses in first-order logic; redefining matching as a test for consistency
with respect to a given state-action pair; replacing bit-string mutation and
crossover with online versions of upward and downward ILP refinement; and
replacing the bit-string subsumption test with θ-subsumption. In addition
to these changes, the system also requires a task specific rule language to be
declared as part of the task specification.
By deriving from the LCS methodology, the Foxcs system inherits the
following qualities: it is model-free, automatically discovers generalisations,
and does not restrict the MDP framework. In the remainder of this thesis
we focus on an empirical evaluation of the Foxcs system.
Chapter 6
Application to Inductive
Logic Programming
145
146 Ch 6. Application to Inductive Logic Programming
fact that all rewards are immediate. We find that Foxcs is generally not
significantly outperformed by several well-known ILP systems, confirming
Foxcs’s ability to generalise effectively.
This section provides details of the materials and system settings that were
used for the experiments reported in this chapter.
6.1.1 Materials
Brief descriptions of the ILP data sets which were used are given below.
More detailed descriptions are provided in Appendix C, including examples
of the specific rule language declarations that were used.
There are two different levels of description for the molecules. The
first level, NS+S1, contains a low level structural description of the
molecule in terms of its constituent atoms and bonds, plus its logP
and LU M O values. The second level, NS+S2, contains additional in-
formation about higher level sub-molecular structures. The second
level is a superset of the first and generally allows a small increase in
predictive accuracy to be attained.
Biodegradability. This data set contains 328 molecules that must be clas-
sified as either resistant (143 examples) or degradable (185 examples)
from their associated structural descriptions—which are very similar
to those in the Mutagenesis data set—and molecular weights. Block-
eel et al. (2004) give several levels of description for the molecules; the
Global+R level is used here.
Traffic. This data set originates from Džeroski et al. (1998b). The task
is to predict critical road sections responsible for accidents and con-
gestion given traffic sensor readings and road geometry. The data
contains 66 examples of congestion, 62 accidents, and 128 non-critical
sections, totaling 256 examples in all.
Poker. This task originates from Blockeel et al. (1999). The aim is to
classify hands of Poker into eight classes, fourofakind, fullhouse,
flush, straight, threeofakind, twopair, pair and nought. The
first seven classes are defined as normal for Poker, although no dis-
tinction is made between royal, straight, and ordinary flushes. The
last class, nought, consists of all the hands that do not belong to any
of the other classes. In Poker, the frequency with which the different
classes occur when dealing is extremely uneven, but in these experi-
ments the data examples were artificially stratified to ensure that the
class distribution was approximately equal (see Section C.4).
148 Ch 6. Application to Inductive Logic Programming
6.1.2 Methodology
Unless otherwise noted, all results for the Foxcs system were obtained under
the following settings. The system parameters were set to: N = 1000, =
10%, α = 0.1, β = 0.1, 0 = 0.01, ν = 5, θga = 50, θsub = 20, θdel = 20, and
δ = 0.1. The mutation parameters were: µrep = 20, µadd = 60, µdel = 20,
and µi = 0 for all i ∈ {c2v, v2a, v2c, a2v}. Tournament selection was used
with τ = 0.4. The learning rate was annealed and both GA and action
set subsumption were used. The system was trained for 100,000 steps, but
rule discovery was switched off after 90,000 steps in order to reduce the
disruptive effect of rules which have not had sufficient experience to be
adequately evaluated because they have been generated late in training (see
Figure 6.1). Finally, the reward given was 10 for a correct classification and
−10 for an incorrect classification.
85
System A
84 System B
83
82
Accuracy (%)
81
80
79
78
77
76
75
0 20 40 60 80 100
Training Episode (x1000)
Figure 6.1: This graph of the system performance on the Mutagenesis NS+S1 data set
is indicative of the performance improvement gained by switching off rule discovery prior
to the completion of training. After 90,000 training episodes rule discovery is switched
off for system A but left on for system B. Comparing the performance of the two systems
after this point illustrates the negative effect which is exerted by freshly evolved rules.
2. Each subset is used in turn as the test set, and the remaining k − 1
subsets as the training set;
and found that it was not significantly outperformed by any of the algorithms
in relation to its predicative accuracy; and Butz (2004) has also obtained
similar results. We are therefore interested to determine whether Foxcs
inherits Xcs’s ability at classification and performs at a level comparable
with existing ILP algorithms.
This experiment compared the Foxcs system to several ILP algorithms and
systems. Four well-known ILP algorithms were selected, Foil (Quinlan,
1990), Progol (Muggleton, 1995), Icl (De Raedt and Van Laer, 1995)
and Tilde (Blockeel and De Raedt, 1998). The first three systems are
rule-based, while Tilde uses a tree-based representation. A fifth system
containing an evolutionary component, Ecl (Divina and Marchiori, 2002),
6.2. Comparison to ILP Algorithms 151
Table 6.1: Comparison between the predictive accuracy of Foxcs and selected ILP
algorithms on the Mutagenesis, Biodegradability, and Traffic data sets. The standard
deviations, where available, are given in parentheses. An * (**) indicates that the value
is significantly different from the corresponding value obtained by Foxcs according to an
unpaired t-test (assuming unequal variance) with confidence level 95% (99%) (note that a
significance test could not be run for the cases where no standard deviation was reported).
was also selected. The Ecl system employs a memetic algorithm which
hybridised evolutionary and ILP search heuristics. These algorithms are all
supervised learning algorithms since—as far as the author is aware—Foxcs
is the first reinforcement learning system to be applied to ILP tasks.
Table 6.1 compares the predictive accuracy of Foxcs to that of the ILP
systems on the Mutagenesis, Biodegradability, and Traffic data sets. The
results were taken from the following sources. For the Mutagenesis data
set: Srinivasan et al. (1996) for Progol, Blockeel and De Raedt (1998)
for Tilde and Foil,2 Van Laer (2002) for Icl and Divina (2004) for Ecl;
for Biodegradability: Blockeel et al. (2004) for Tilde and Icl, and Divina
(2004) for Ecl; and for Traffic: Džeroski et al. (1998b) for all systems
except Ecl, which is Divina (2004). All predictive accuracies have been
2
Blockeel and De Raedt (1998) are a secondary source for Foil on the Mutagenesis
data set; we note that the primary source, (Srinivasan et al., 1995), has been withdrawn
and that its replacement, (Srinivasan et al., 1999), reassesses Progol but unfortunately
does not replicate the experiments for Foil.
152 Ch 6. Application to Inductive Logic Programming
measured using 10-fold cross validation.3 The folds are provided with the
Mutagenesis and Biodegradability data sets but are generated independently
for the Traffic data. For the Biodegradability data, five different 10-fold
partitionings are provided and the final result is the mean performance over
the five 10-fold partitions. For consistency, all results have been rounded
to the lowest precision occurring in the sources (i.e. whole numbers); more
precise results for Foxcs on these tasks can be found in Section 6.6. Where
a source provides multiple results due to different settings of the algorithm,
the best results that were obtained are given.
From the table it can be seen that Foxcs generally performs at a level
comparable to the other systems. In only one case is Foxcs significantly
outperformed. This finding confirms that Foxcs retains Xcs’s ability at
classification and validates the efficacy of Foxcs’s novel combination of
evolutionary computation and online ILP refinement operators.
The Generality Hypothesis (see Section 4.3) states that the number of ac-
curate and maximally general rules in the population of Xcs will tend to
increase over time. In Chapter 4 it was suggested that the Generality Hy-
pothesis would also hold under alternative rule languages, since the mech-
anisms in Xcs which the hypothesis identifies as generating the pressure
towards accuracy and maximal generality are essentially language neutral
(see page 97). In this section we seek to empirically determine if the be-
haviour of Foxcs is consistent with the Generality Hypothesis.
Use of a synthetic data set is indicated for this experiment, since at least two
potential problems arise with real-world data sets, such as the Mutagenesis,
3
Note that, as previously described in Section 6.1.2, the predictive accuracy for Foxcs
is calculated by performing ten repetitions of the 10-fold cross validation procedure and
taking the mean.
6.3. Verifying the Generality Hypothesis 153
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1 Performance
Population size (/N)
0
0 10000 20000 30000
Episode
Figure 6.2: Graph of system performance and population size for the poker task. The
system performance was measured as the proportion of correct classifications over the last
50 non-exploratory episodes, and the population size is the number of macro-rules divided
by N (500). Results are averaged over 10 separate runs.
For this experiment some of the system settings were different from those
given in Section 6.1.2. The poker task is easier to solve than the previous
tasks, so training was terminated after 30,000 episodes (rule discovery was
switched off at 25,000 episodes) and N was decreased to 500. However, the
task does require a high value for θsub (see Mellor, 2005, for an explanation);
we used θsub = 500. Finally, the anonymous variable is useful for this task,
154 Ch 6. Application to Inductive Logic Programming
so the v2a and a2v mutations were used. Hence, the mutation parameters
were set as follows: µi = 20, for i ∈ {rep, add, del, v2a, a2v} and µj = 0
for j ∈ {v2c, c2v}.
The system performance and the population size was measured throughout
training and the results are shown in Figure 6.2. The performance plot shows
that the system is able to make error-free classifications after 25,000 training
episodes (when evolution was switched off).4 After an initial period where
the rule base is populated by covering, the size of the population decreases
throughout the course of training. This decline in diversity of the population
indicates that generalisation is occurring and that accurate, specific rules are
being displaced or subsumed by accurate, but more general rules.
The rules evolved by the system were inspected to determine if the most
numerous rules were accurate and maximally general. A rule, rl, is max-
imally general if every hand that matches rl’s condition part belongs to
rl’s class, and if every hand belonging to rl’s class does match rl’s condi-
tion. It was found that maximally general rules did evolve, and that the
minimum number of episodes required to evolve maximally general rules in
greater proportion than sub-optimally general classifiers for all classes was
approximately 25,000.
A sample containing the accurate and correct rules belonging to the final
population for one arbitrarily selected run is given in Table 6.3. In this
146
sample the proportion of maximally general rules is 152 = 96.1%. Note
that for some classes several different maximally general classifiers evolved,
which is possible because the rule language does allow for more than one
maximally general expression for each of the classes.
4
Note that this result improves upon those reported by Mellor (2005) for an earlier
version of the Foxcs system.
6.3. Verifying the Generality Hypothesis 155
Table 6.2: All rules with κ = 1 and p = 10 after 30,000 episodes of the Poker task, ordered
by numerosity. In the first column, an asterisk indicates that the rule is maximally general; the
second column gives the numerosity; and the remaining columns give the logical part, Φ.
* 19 f ullhouse ← card(A, B, ), card(D, B, F ), card(G, H, C), card(J, B, L),
card(M, H, )
* 18 twopair ← card(A, B, ), card(D, B, ), card(G, H, F ), card(J, H, C),
card(M, N, )
* 17 f lush ← card(A, B, C), card(D, E, C), card(G, H, C), card(J, K, C),
card(M, N, C)
* 17 f ourof akind ← card(A, B, C), card(D, B, F ), card(G, B, I), card(J, K, ),
card(M, B, O)
* 11 threeof akind ← card(A, B, C), card(D, E, ), card(G, H, F ), card(J, B, L),
card(M, B, )
* 11 straight ← card(A, B, C), card(D, E, ), card(G, H, ), card(J, K, ),
card(M, N, F ), consecutive(N, H, B, K, E)
* 11 pair ← card(A, B, ), card(D, E, F ), card(G, E, C), card(J, K, ),
card(M, N, )
* 10 threeof akind ← card(A, B, C), card(D, E, F ), card(G, H, ), card(J, B, L),
card(M, B, )
* 9 pair ← card(A, B, ), card(D, E, ), card(G, E, C), card(J, K, ),
card(M, N, F )
* 5 nought ← card(A, B, ), card(D, E, F ), card(G, H, ), card(J, K, C),
card(M, N, ), not(consecutive(H, K, B, N, E))
* 5 nought ← card(A, B, ), card(D, E, F ), card(G, H, C), card(J, K, ),
card(M, N, ), not(consecutive(H, K, B, N, E))
* 4 straight ← card(A, B, C), card(D, E, ), card(G, H, ), card(J, K, ),
card(M, N, F ), consecutive(E, B, H, N, K)
3 straight ← card(A, B, ), card(D, E, ), card(G, H, I), card(J, K, C),
card(M, N, F ), consecutive(H, N, B, K, E)
* 3 nought ← card(A, B, C), card(D, E, F ), card(G, H, ), card(J, K, ),
card(M, N, ), not(consecutive(H, N, B, K, E))
2 pair ← card(A, B, ), card(D, K, F ), card(G, H, C), card(J, K, L),
card(M, E, )
* 2 nought ← card(A, B, C), card(D, E, F ), card(G, H, ), card(J, K, ),
card(M, N, ), not(consecutive(N, K, H, B, E))
* 2 f ullhouse ← card(A, B, ), card(D, B, F ), card(G, H, C), card(J, H, L),
card(M, H, )
1 f ourof akind ← card(A, , C), card(D, B, F ), card(G, B, I), card(J, K, C),
card(M, B, O)
* 1 f ullhouse ← card(A, B, ), card(D, H, F ), card(G, H, C), card(J, B, L),
card(M, H, L)
* 1 nought ← card(A, B, ), card(D, E, F ), card(G, H, C), card(J, K, ),
card(M, N, ), not(consecutive(N, K, B, H, E))
156 Ch 6. Application to Inductive Logic Programming
Table 6.3: Comparison between using and not using subsumption deletion on execution
time, size of the population, and predictive accuracy for the Traffic data set. Standard
deviations are given in parentheses.
The subsumption deletion technique, which Foxcs inherits from Xcs, aims
to reduce the size of the population, in terms of macro-rules, without ad-
versely affecting the system’s decision making capability (see Section 5.5.4).
Ideally, there is an efficiency gain associated with the use of subsumption
deletion because a system that contains a population with fewer macro-
rules also requires fewer matching operations. However, in Foxcs, the time
cost for running the θ-subsumption test potentially offsets the reduction in
matching overhead. As it is not possible to analytically assess the relative
benefit of having fewer rules against the cost of performing θ-subsumption,
the aim of this experiment is to empirically determine whether the use of
subsumption deletion translates to a net gain in efficiency.
The results of the experiments on the Traffic data set are shown in Table 6.3.
They verify that using subsumption deletion does reduce the population size
without significantly affecting the predictive accuracy. Action set subsump-
tion reduces the size of the population more than GA subsumption, which
is unsurprising given that it is typically invoked more frequently. The re-
sults also show that using subsumption deletion significantly reduces the
execution time of the system; under each of the three subsumption deletion
settings the execution time was less than half the execution time without
using subsumption deletion (see Figure 6.3). Comparing execution times
under action set subsumption and GA subsumption shows relatively little
difference compared to the difference in the population size under the two
settings. This disparity suggests another factor affecting efficiency apart
from the population size; perhaps a subset of the population that has the
longest matching times are removed under either action set or GA subsump-
tion.
In this experiment we assess the influence of annealing the learning rate com-
pared to using a fixed constant for the learning rate. Annealing the learning
rate in temporal difference learning algorithms produces value function esti-
158 Ch 6. Application to Inductive Logic Programming
14000
12000
10000
Time (sec)
8000
6000
4000
2000
0
Both AS GA None
Figure 6.3: The effect of using subsumption deletion on the execution time.
mates which are an average of the sampled discounted reward (Sutton and
Barto, 1998, section 2.5). Using a learning rate which is a fixed constant,
on the other hand, produces a recency weighted average in which rewards
experienced more recently are given greater weight in the calculation of the
estimate (Sutton and Barto, 1998, section 2.6). Thus, in temporal difference
algorithms, using a fixed learning rate increases the sensitivity of the system
to recently experienced inputs compared to annealing.5 The close similarity
between updates in temporal difference learning and Xcs suggests that this
connection between learning rate and sensitivity to recent inputs also holds
for Xcs and Xcs derivative systems including Foxcs.
Sutton and Barto suggest that using a fixed learning rate is useful for
tracking non-stationary environments.6 However, with respect to classifi-
cation tasks, the recency effect which is produced by using a fixed learning
rate could make the system’s value estimates—and by extension, its overall
performance—sensitive to the order in which training examples are pre-
sented. In other words, using a fixed learning rate could bias the system
towards predicting the most commonly occurring class of the most recently
5
More accurately, under a fixed constant learning rate, β, the sensitivity to the current
1
input is greater than for annealing after β
training steps. See Section 4.1.4, page 85.
6
A non-stationary MDP is one where the transition function, T , and the reward func-
tion, R, change over time.
6.5. The Effect of Learning Rate Annealing on Performance 159
For each data set we ran three experiments: one, which we call RAND,
where the training examples were selected at random from the training set,
and two, SEQ and SEQ2, where the training examples were presented to
the system in a fixed sequential order. In the SEQ experiment, the data
was divided into groups such that all examples belonging to a single group
are instances of the same class; since all examples from one group were
presented before moving to the next, the short term class distribution in
the SEQ experiment varied from the overall class distribution. The SEQ2
experiment took this approach to the extreme by setting the number of
groups equal to the number of classes.
The results of the experiments are shown in Table 6.4 and Figure 6.4. The
system performed better under annealing than under a fixed learning rate
on all data sets and sampling methods. According to the unpaired t-test
(assuming equal variances), the improvement is significant with confidence
level 95% in all cases, and significant with confidence level 99% in all but two
cases. Annealing also produced results which were more comparable across
the different sampling methods than using a fixed learning rate. The SEQ2
sampling method—where all data examples from one class are presented
before the next—generally proved to be the most challenging, although to
a much lesser extent for annealing than for a fixed learning rate.
Evidence for sensitivity to the order in which inputs are sampled can be
observed in Figure 6.5, which shows the performance of the system through-
out training on the Mutagenesis data set (NS+S2). The predictive accuracy
was measured at regular intervals of 1,000 training episodes in the same ex-
7
We note that the MAM technique (see Section 4.1.4, page 84) was used for the fixed
learning rate.
160 Ch 6. Application to Inductive Logic Programming
Table 6.4: Comparison between using a fixed learning rate (F) and annealing (A) on
the predictive accuracy of Foxcs for several data sets and sampling methods. Standard
deviations are given in parenthesis. An * (**) indicates that corresponding A and F values
are significantly different according to an unpaired t-test with confidence level 95% (99%).
periments that generated the corresponding results in Figure 6.4. Note the
oscillations which can be observed in the graphs for the SEQ and SEQ2 sam-
pling methods, and which are particularly prominent for the fixed learning
rate. The peaks in the graph correspond to when training examples belong-
ing to the majority class have been recently presented, while the troughs
correspond to when the recent examples are from the minority class.8 This
strong correspondence between system performance and the order of the
training examples is consistent with a sensitivity to recent inputs.
8
Similar results were obtained for the Mutagenesis NS+S1 setting. For Biodegradabil-
ity, the difference between the number of examples for the majority and minority classes is
much less and the oscillating effect is therefore less noticeable. For Traffic, the oscillations
are also less noticeable because there are more than two classes.
6.5. The Effect of Learning Rate Annealing on Performance 161
90 90
Accuracy (%)
Accuracy (%)
85 85
80 80
75 75
70 70
RAND SEQ SEQ2 RAND SEQ SEQ2
Biodegradability Traffic
80 100
Annealed Annealed
Fixed Fixed
75 95
90
Accuracy (%)
Accuracy (%)
70
85
65
80
60
75
55 70
RAND SEQ SEQ2 RAND SEQ SEQ2
Figure 6.4: Comparison between using a fixed learning rate and annealing on the pre-
dictive accuracy of Foxcs for several data sets and sampling methods.
accounts for the poorer performance observed when using a fixed learning
rate. Furthermore, we believe that these results have implications for Xcs.
Hence, although the regular practice in Xcs is to use a fixed learning rate,
annealing may produce better results when the system is applied to classi-
fication tasks.9 Annealing may also be beneficial for Xcs systems on tasks
containing noisy data.
9
Multiplexor tasks—which are the typical benchmark for Xcs—may represent an ex-
ceptional case because, unlike most classification tasks, it is possible to achieve 100%
predictive accuracy on them. In this situation, the long term and short term error for
a correct rule is the same irrespective of the sample distribution: zero. For other tasks
where perfect accuracy is not possible, the system must, in part, rely on overgeneral rules
in order to make classification predictions. As overgeneral rules have a non-zero error
whose value depends on the sample distribution, they benefit most from annealing.
162 Ch 6. Application to Inductive Logic Programming
RAND
90
85
80
75
Accuracy (%)
70
65
60
55
fixed
annealed
50
0 20 40 60 80 100
Training Episode (x1000)
SEQ SEQ2
90 90
85 85
80 80
75
Accuracy (%)
75
Accuracy (%)
70 70
65 65
60 60
55
55
fixed fixed
annealed
annealed 50
50 0 20 40 60 80 100
0 20 40 60 80 100
Training Episode (x1000) Training Episode (x1000)
Figure 6.5: Comparison between using a fixed learning rate and annealing for the
Mutagenesis NS+S2 data set. There is a graph for each sampling method as labelled
in the figure.
Performance
The performance results for the system in terms of its predictive accuracy
under the different settings are given in Table 6.5. With only two excep-
tions, tournament selection outperformed proportional selection for all tasks
and (µadd ,µdel ) settings. For the two exceptions—which occurred on the
Traffic data set—the performance of the system was comparable under both
selection methods. According to an unpaired t-test (assuming equal vari-
ance), many of the observed improvements under tournament selection were
significant. Tournament selection was also less sensitive to the values of
the mutation parameters, producing results which are generally comparable
across the different settings. It is evident from these results that the choice
of selection method has an influence on performance. The superiority of
tournament selection over proportional selection observed here is consistent
with the findings of Butz et al. (2003).
Note that, with respect to the (µadd ,µdel ) settings, the best accuracies were
generally obtained under higher ratios of µadd to µdel , that is, the (70,10)
and (60,20) settings; however, it would be dangerous to infer that these
parameter values will produce the best performances in general, since the
results are an artifact of the general-to-specific direction of the search which
is occurring for these particular tasks.
Observant readers might have noticed that the results for the Biodegrad-
ability task reported in Section 6.2 are better than those given here for the
164 Ch 6. Application to Inductive Logic Programming
Table 6.5: Comparison between tournament and proportional selection on several ILP
data sets and (µadd ,µdel ) settings. The best accuracy for each task is shown in bold.
An * (**) indicates that the value is significantly different from its corresponding value
under the other selection method according to an unpaired t-test with confidence level
95% (99%).
Proportional Selection
Mute (NS+S1) 80.3 (1.5)** 81.3 (3.1)* 82.3 (2.0) 82.2 (1.8)
Mute (NS+S2) 82.9 (2.3)** 83.8 (2.0)** 84.8 (2.0) 85.7 (1.6)
Bio 69.3 (1.4)** 71.1 (1.8) 70.8 (1.4)* 71.6 (1.9)
Traffic 91.2 (1.2) 91.9 (1.4) 92.5 (1.3) 91.9 (1.0)
88 88
86 86
Accuracy (%)
Accuracy (%)
84 84
82 82
80 80
78 78
Tournament Tournament
Proportional Proportional
76 76
40−40 50−30 60−20 70−10 40−40 50−30 60−20 70−10
Biodegradability Traffic
76 96
75 95
74 94
73 93
Accuracy (%)
Accuracy (%)
72 92
71 91
70 90
69 89
68 88
67 Tournament 87 Tournament
Proportional Proportional
66 86
40−40 50−30 60−20 70−10 40−40 50−30 60−20 70−10
6.6. The Influence of the Selection Method 165
Efficiency
The execution times for the system under the different settings are given in
Table 6.6. These results show that the system was significantly less efficient
under tournament selection than under proportional selection according to
the Mann-Whitney U test.10 Also, under tournament selection the execu-
tion time of the system rises as the proportion of add operations increases,
although this effect is less evident on the Traffic data set. Observation of
the system during training showed that the extra time was due chiefly to a
small number of rules which had disproportionately lengthy matching times
rather than, for example, any differences in the time complexity of the two
selection methods. Under tournament selection these rules were more likely
to be generated, although why this should be so is not clear.
One speculation goes as follows. Under either selection method, the fittest
rule in a given niche is more likely to be selected for mutation than the other
rules in the niche. If the fittest rule in a niche has a particularly lengthy
matching time, then it would lead to a proliferation of rules which, because
of their parent, are likely to have lengthy matching times. If tournament
selection were more likely to select the fittest rule than proportional selec-
tion, then that would account for the difference which was observed in the
system’s execution time under the two methods. Is tournament selection
more likely to select the fittest rule than proportional selection? Under
tournament selection with τ = 0.4, the fittest rule will be selected at least
10
The distribution of execution times appears to be not well approximated by a normal
distribution. The non-parametric Mann-Whitney U test is thus more appropriate here
than the t-test as it does not assume that data measurements necessarily belong to a
normal distribution.
166 Ch 6. Application to Inductive Logic Programming
Table 6.6: Comparison between tournament and proportional selection with respect to
execution time in tabular and bar graph format. The figures given are the mean time
(and standard deviation) for an entire 10-fold cross validation experiment. The best time
for each task is shown in bold. According to the Mann-Whitney U test, the difference
between corresponding measurements produced under the two selection methods is
significant with a confidence level of 99% for all four mutation parameter settings.
Proportional Selection
Mute (NS+S1) 3,459 (133) 4,036 (279) 4,413 (244) 7,351 (4,326)
Mute (NS+S2) 3,135 (133) 3,568 (167) 4,152 (152) 4,920 (336)
Bio 2,718 (59) 3,428 (1,151) 3,499 (323) 4,160 (459)
Traffic 4,114 (114) 4,490 (83) 4,545 (95) 4,513 (59)
8 1.6
7 1.4
6 1.2
Time (sec)
Time (sec)
5 1
4 0.8
3 0.6
2 0.4
1 0.2
0 0
40−40 50−30 60−20 70−10 40−40 50−30 60−20 70−10
Biodegradability Traffic
4
x 10
6
Tournament 6000
Proportional
5
5000
4
4000
Time (sec)
Time (sec)
3
3000
2 2000
1 1000
Tournament
Proportional
0 0
40−40 50−30 60−20 70−10 40−40 50−30 60−20 70−10
6.7. Summary 167
40% of the time and usually more depending on its numerosity value. Under
proportional selection, the fittest rule may be selected less frequently than
40%, depending on the fitness and number of other rules in the niche. Thus,
the mechanism does appear to be plausible.
Under the above explanation, the large variance in the execution time ob-
served for the Mutagenesis and Biodegradability tasks under tournament
selection can be accounted for as being the consequence of a large varia-
tion in the matching times of highly fit rules for those tasks. Conversely,
the smaller variance observed for the Traffic data set—and also the closer
similarity between the execution times of the two selection methods—is ac-
counted for if there is only a small variance in the matching times of highly
fit rules for that task.
Conclusion
6.7 Summary
11
However, a version of Foxcs was implemented based on the modifications to Xcs
suggested by Bernadó-Mansilla and Garrell-Guiu (2003) for handling supervised learn-
ing tasks. Unfortunately, it performed disappointingly in comparison to the unmodified
version of Foxcs.
Chapter 7
Application to Relational
Reinforcement Learning
In the previous chapter, Foxcs was applied to ILP tasks in order to evaluate
and assess the influence of individual system components, particularly the
novel inductive mechanism. In this chapter, the focus moves to an assess-
ment of the overall integrity of the system. Whereas ILP tasks essentially
pose a single challenge (that of generalisation, the system’s ability to clas-
sify previously unexperienced data items), reinforcement learning tasks, on
the other hand, usually pose a double challenge: not only must the system
generalise to previously unexperienced situations, it must also deal with “de-
layed reward”. An episode of a reinforcement learning task, in contrast to
that of a classification task, typically involves making a sequence of decisions
where the outcome of each decision not only affects the immediate reward
but also all subsequent rewards until the end of the episode. Thus, RRL
tasks challenge the credit assignment component of Foxcs in equal measure
171
172 Ch 7. Application to Relational Reinforcement Learning
This chapter is divided into two sections. In Section 7.1 Foxcs is bench-
marked on blocks world tasks in order to empirically demonstrate the effec-
tiveness of the system on RRL tasks. In these experiments the number of
blocks in the environment is constant. In Section 7.2 we address the prob-
lem of how to learn scalable policies in blocks world with Foxcs. Scalable
policies are independent of the number of blocks in the environment. Both
sections contain comparisons to existing RRL results and systems.
The aim of these experiments is to demonstrate that Foxcs can achieve op-
timal or near-optimal behaviour on RRL tasks. Comparisons to previously
published results obtained for other RRL systems are made. Note however,
that the number of results available for comparison is more limited for RRL
than ILP due to the field’s more recent development.
Materials
For these experiments, Foxcs was benchmarked on two bwn tasks, stack
and onab (Džeroski et al., 2001), under several values of n. The tasks were
introduced in Chapter 1 and are described again below for convenience.
stack The goal of this task is arrange all the blocks into a single stack; the
order of the blocks in the stack is unimportant. An optimal policy for
this task is to move any clear block onto the highest block.
onab For this task two blocks are designated A and B respectively. The
7.1. Experiments In Blocks World 173
The rule language employed by Foxcs for stack and onab contained the
following predicates:
• mv /2 Moves a given block onto another block. The arguments are the
block to move and the destination block respectively.
• ab/2 This predicate is only used for the onab task. It identifies the
two blocks A and B respectively.
• cl /1 The given block is clear. That is, there is no block on top of it.
• on/2 The arguments are two blocks. The first block is on the second.
• above/2 The arguments are two blocks belonging to the same stack.
The first block is above the second.
Table 7.1: The mode declarations for the blocks world tasks, stack and onab.
inequations( true ).
Methodology
number of steps and all results are averaged over the 10 separate runs.
Because of the potentially disruptive effect of freshly evolved rules, evolution
was switched off prior to the completion of training (see Section 6.1.2).
Unless otherwise noted, all results for the Foxcs system were obtained under
the following settings. The system parameters were: N = 1000, = 10%,
α = 0.1, β = 0.1, 0 = 0.001, ν = 5, θga = 50, θsub = 100, θdel = 20, and
δ = 0.1. The mutation parameters were: µrep = 25, µadd = 50, µdel = 25,
and µi = 0 for all i ∈ {c2v, v2a, v2c, a2v}. Proportional selection was
used, the learning rate was annealed, and GA subsumption was used but
not action set subsumption.
Recall that under action set subsumption, an accurate and sufficiently ex-
perienced rule may subsume all other rules in an action set that are spe-
cialisations of it. For the blocks world tasks, we observed that action set
subsumption sometimes disrupted the population due to the subsumption
of genuinely accurate rules by rules which were only temporarily accurate.
This problem with action set subsumption has been previously reported by
Butz (2004), who also remedied it by switching off the mechanism.
Covering was not used to generate the initial population. Under covering,
the tasks are “too easy” in the sense that the rules produced by covering
are sufficient to form an optimal policy without any need for mutation.
Thus, rather than use covering, the rule base was given an initial population
instead. The initial population consisted of the following two rules:
These two rules were selected because between them they cover the entire
state-action space of blocks world. However, despite the use of an initial pop-
ulation, covering was not disabled: if either of the above two rules were to be
deleted from the population, it is possible that the state-action space would
no longer be completely covered. Thus, covering was permitted. However,
observation of training runs showed that it was invoked only very infre-
176 Ch 7. Application to Relational Reinforcement Learning
Accuracy
0.5
0.4 0.4
0.3
0.2 0.2
0.1
0 0
0 1000 2000 3000 4000 5000 0 5,000 10,000 15,000 20,000
Episode Episode
Accuracy
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 5,000 10,000 15,000 20,000 0 5,000 10,000 15,000 20,000
Episode Episode
Figure 7.1: The performance of Foxcs on the stack task in bwn for n = 4, 5, 6, 7 blocks.
The graphs show the mean accuracy (solid line) and standard deviation (dotted line) over
ten runs. Accuracy is the percentage of episodes completed in the optimal number of
steps; it does not refer to κ. The point at which evolution was switched off is indicated
by the vertical dashed line.
quently.
Results
Figures 7.1 and 7.2 show the performance of Foxcs on the stack and onab
tasks respectively in bwn|n∈{4,5,6,7} . After evolution was switched off, the
performance of Foxcs on the stack task was optimal in bw4 and bw5 and
near-optimal in bw6 and bw7 . By near-optimal we mean that the final
accuracy is ≥ 98%. For onab it was optimal in bw4 and near-optimal in
bw5 but not for bw6 and bw7 .
7.1. Experiments In Blocks World 177
Accuracy
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 10,000 20,000 30,000 0 10,000 20,000 30,000 40,000 50,000
Episode Episode
Accuracy
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 10,000 20,000 30,000 40,000 50,000 0 10,000 20,000 30,000 40,000 50,000
Episode Episode
Figure 7.2: The performance of Foxcs on the onab task in bwn for n = 4, 5, 6, 7 blocks.
We now list, in Table 7.2, the ten most numerous rules discovered by Foxcs
after one arbitrarily selected run on stack in bw4 . The rules are easy to
interpret diagrammatically, as shown in Figure 7.3. That the rules can be
examined conceptually like this demonstrates the comprehensibility aspect
of Foxcs’s first-order logic, rule-based approach. Of course, understanding
178 Ch 7. Application to Relational Reinforcement Learning
Table 7.2: The ten most numerous rules discovered by Foxcs after one arbitrarily
selected run on stack in bw4 . The second rule is illustrated in Figure 7.3.
Φ p ε F n
Comparison
The experiments in this section are most directly comparable to those re-
ported by Džeroski et al. (2001, section 6.3.1) for Q-rrl and by Driessens
et al. (2001, section 4.1) for Rrl-tg. In those experiments they addressed
the same tasks in bwn|n∈{3,4,5} . In terms of the accuracy level attained,
Foxcs performed at least as well as Q-rrl and Rrl-tg, except for the
onab task in bw5 , where Rrl-tg was able to achieve optimal performance.
Q-rrl and Rrl-tg, however, required significantly fewer training episodes
to converge. As a rough guide, the amount of training episodes required by
7.1. Experiments In Blocks World 179
B A C B
or
C
A
B C B A C
Figure 7.3: All patterns of blocks in bw4 that match the rule mv(A, B) ←
cl(A), cl(B), on f l(C), on f l(B).
The discrepancy in training time may be partly accounted for by the follow-
ing two factors. First, Q-rrl trains particularly quickly because after every
episode it processes the entire history of states ever visited during all pre-
ceding episodes. Q-rrl thus processes more information per episode than
Rrl-tg and Foxcs. Second, learning classifier systems train more slowly
than other reinforcement learning systems due to the stochastic nature of
the evolutionary component.
Although the results in this section show that Foxcs can solve RRL tasks,
the performance of Foxcs was limited by the scale of the task. Scale, mea-
sured as n for bwn , negatively affects both the performance and execution
time.
180 Ch 7. Application to Relational Reinforcement Learning
Further Experiments
Recall from Section 1.1.2 that |S| > O(n!) for bwn . It is not surprising then
that as n increases, the performance of Foxcs deteriorates. In particular,
each additional block tends to decrease:
and to increase:
Table 7.3 shows the results of the experiment. From the table, it appears
that the execution times increase super-linearly with respect to the number
of blocks. Although the results are not as bad as a greater than O(n!)
growth in the state space might lead one to predict, there is nevertheless an
alternative approach that completely avoids any extra time cost at all as n
increases. The approach, presented in the following section, allows policies
to be learnt that are independent of n. Training is performed under small
values of n to minimise training times and the resulting policies scale up to
arbitrary values of n.
7.2. Scaling Up 181
Table 7.3: Mean execution time (and standard deviation) over 10 runs of 40,000 episodes
on the stack and onab tasks in bwn for n = 4, 5, 6, 7.
4 Onab
4 Stack x 10
x 10 9
4.5
8
4
7
3.5
6
3
Time (sec)
Time (sec)
2.5 5
2 4
1.5 3
1 2
0.5 1
0 0
4 5 6 7 4 5 6 7
Number of blocks Number of blocks
Conclusion
The results in this section demonstrate that Foxcs was successfully able to
address RRL tasks, simultaneously solving the twin challenges of delayed
reward and generalisation. It achieved optimal or near-optimal performance
for both the stack and onab tasks in bwn environments containing up to
n = 7 and 5 blocks respectively. Unfortunately, as n, the number of blocks,
increases, the performance of Foxcs deteriorates and is thus, without mod-
ification, not a scalable approach. In Section 7.2 we address the problem of
how to learn scalable policies with Foxcs.
7.2 Scaling Up
Simple optimal policies exist for the stack and onab tasks that are indepen-
dent of the number of blocks and which can be expressed straightforwardly
182 Ch 7. Application to Relational Reinforcement Learning
Table 7.4: Optimal policies for stack and onab that are independent of the number of
blocks in the environment.
stack
onab
using the rule language given in Table 7.1. Examples of these policies are
shown in Table 7.4. If the policies can be learnt in bwn under a small
value of n, they can be transferred to bwn for arbitrary values of n, thus
providing the system with a way to scale up to large environments. Unfortu-
nately, these scalable policies cannot be learnt by Foxcs under a standard
approach. This is because there is variance in the Q values of the state-
action pairs matching any particular rule in the policy. The rules therefore
have non-zero error, ε, and thus poor fitness, F .
c d
b b a c e
e a c d e a b
Figure 7.4: Three (s, a) pairs from bw5 which match the rule mv(A, B) ←
cl(A), highest(B) and their associated Q(s, a) value for the stack task under an optimal
policy. The Q values were calculated under a reward of −1 per step and γ = 0.9.
7.2.1 P-Learning
Q(s, a) is not uniform over all (s, a) pairs matching the rule. It is thus not
a valid generalisation over Q. In the Foxcs system, for instance, the rule
has non-zero error, ε, and thus poor fitness, F . However, because the rule
is optimal, P (s, a) = 1 over all (s, a) pairs matching it. Hence, the rule is
a valid generalisation over P . In general, any rule is a valid generalisation
over the P function if all (s, a) pairs that match it satisfy a = π ∗ (s).
Parameter Updates
formed. Then, all members of [M]P are updated as follows. All updates
remains unchanged except for the prediction, p, and error, ε, parameters.
The prediction, p, and error, ε, are updated as given by equations (4.3) and
(4.5), except that P (s, a) estimates are used in the updates instead of the
target ρ defined in equation (4.4). Let s ∈ S be the current state. For each
rule j ∈ [M]P , the prediction parameter, pj , and the error parameter, εj ,
2
Previous implementations of P-Learning have been described in Section 3.2.3.
7.2. Scaling Up 185
where P (a) is the system prediction for action a (see equation (4.2)) and
is a tolerance that allows for error in P (u) and P (a). In all experiments
involving P-Learning in this chapter, = 0.05.
Errors in P (u) and P (a) essentially act as a source of noise for P-Learning.
For this reason, we found that the best results were obtained when using, in
addition to a tolerance, , two separate learning rates: one, αQ , for standard
updates and another, αP , for the P-Learning updates, (7.2) and (7.3). We
set αQ = 0.1 and αP = 0.002. The αP value is very small to compensate for
the noisy signal; no annealing was performed on αP .
Note for each time step t apart from the terminal step, the rules in [A]Q wait
until time step t + 1 before being updated because ρ depends on rules that
match state st+1 . In contrast, the rules in [M]P are updated immediately
because P (s, a) depends only on rules matching the current state.
Rule Discovery
the following experiments, however, the languages for [P]Q and [P]P were
identical, with one exception, described later.
7.2.3 Experiments
The aim of this experiment is to demonstrate that Foxcs can learn optimal
policies that scale to arbitrary sized environments using the P-Learning
extension. The blocks world benchmarking tasks, stack and onab, were
again used.
Method
First, the behavioural policy of the system depended on whether the sys-
tem was performing a training or evaluation episode. For training episodes,
the policy was determined by the primary population, [P]Q , as usual. Dur-
ing evaluation, however, the behavioural policy was determined by the sec-
ondary population, [P]P , containing the rules learnt through P-Learning.
As usual, all learning was suspended when the system was performing an
evaluation episode.
Second, during training the number of blocks was varied from episode to
episode, encouraging the P-Learning component to learn policies that were
independent of the number of blocks.3 The initial state of each episode
was randomly selected from bwn , where n was randomly selected from
3
In preliminary experiments, the number of blocks, n, was kept at a constant through-
out training. This scheme resulted in P-Learning policies that were optimal with respect
to bwn only. However, by varying n from episode to episode, any rule j ∈ [P]P that was
not optimal in bwn over all values of n tested suffered from poor fitness, Fj .
7.2. Scaling Up 187
[3, 5]. Because episodes contain different numbers of blocks, the predicate
num blocks/1 was added to the state signal and rule language for [P]Q (but
not to the rule language for [P]P ) so that rules can explicitly limit them-
selves to states containing a particular number of blocks. The argument of
num blocks is an integer specifying the appropriate number of blocks.
Finally, during evaluation the number of blocks was also varied from episode
to episode. Before training, 100 test episodes were generated. The initial
state of each test episode was randomly selected from bwn where n was
randomly selected from [3, 16]. The system was thus tested on much larger
blocks world environments than it was trained on. The P-Learning com-
ponent was, however, not activated or evaluated until after 20,000 train-
ing episodes had elapsed. This delay allowed time for [P]Q to evolve so
that the system prediction, P (a), would be relatively stable by the time
P-Learning commenced.4 After this point, both populations, [P]Q and [P]P ,
were updated and evolved in parallel. Updates continued until the termina-
tion of training; evolution ceased for both populations after 45,000 training
episodes.
Results
Figure 7.5 shows the results of the experiments with the P-Learning version
of Foxcs on the tasks stack and onab. By the end of training, the system
was able to act near-optimally on both tasks. For onab the mean accuracy
after evolution was switched off was ≥ 98.6%. Recall that most evaluation
episodes occurred in worlds containing many more blocks than were present
in training episodes. The results thus demonstrate that Foxcs had learnt
genuinely scalable policies, which could handle blocks worlds of effectively
arbitrary size.
The graphs reveal that the system’s performance on onab can be divided
4
The dependence of P-Learning on the system prediction, P (a), is given in equa-
tion (7.4).
188 Ch 7. Application to Relational Reinforcement Learning
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
Accuracy
Accuracy
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
5000 6000 7000 8000 9000 10000 20,000 25,000 30,000 35,000 40,000 45,000 50,000
Episode Episode
(a) (b)
3 16 Onab (BW3 to BW16)
Stack (BW to BW ) 250
3500
3000
2500
150
2000
1500 100
1000
50
500
0 0
5000 6000 7000 8000 9000 10000 20,000 25,000 30,000 35,000 40,000 45,000 50,000
Episode Episode
(c) (d)
Figure 7.5: The performance of Foxcs under P-Learning on the stack and onab tasks.
Graphs (a) and (b) give performance in terms of accuracy, the percentage of episodes
completed in the optimal number of steps. Graphs (c) and (d) give performance in terms
of the extra number of steps, totaled over all 100 evaluation episodes, taken to reach the
goal compared to an optimal policy. Each plot is averaged over ten runs. The dashed line
indicates the point at which evolution was switched off.
into two phases. First there is a phase during which performance rapidly
improves, accounting for most of the learning. This is followed by a sec-
ond, much longer phase during which performance improves more slowly.
Inspection of execution traces showed that although the system had dis-
covered optimal [P]P rules, some of them nevertheless had a non-zero error
estimate, ε. The error originated from occasional inaccuracy in the system
predication, P (a), computed from the [P]Q rules, and slowed down the iden-
tification of optimal [P]P rules. This sensitivity to error in P (a) is remedied
by the introduction of the tolerance, , and the learning rate, αP , described
7.2. Scaling Up 189
in Section 7.2.2.
Comparison
The problem of learning scalable policies for stack and onab has been previ-
ously addressed by Džeroski et al. (2001) and Driessens et al. (2001) using
the P-Learning technique. Also, Driessens and Džeroski (2005) have bench-
marked several different RRL systems, Rrl-tg, Rrl-rib, Rrl-kbr, and
Trendi,5 on stack and onab where the number of blocks was allowed to
vary.6 Although the experiments in Driessens and Džeroski (2005) do not
specifically address the issue of learning scalable policies, to perform well
the systems must learn policies that translate to environments containing a
different number of blocks than those experienced under training. In par-
ticular, the number of blocks was varied between 3 and 5 during training
and the resulting policies were evaluated in blocks world environments con-
taining 3–10 blocks. In addition, the systems were allowed to learn from
samples of optimal trajectories in bw10 , so testing did not explicitly show
that the policies learnt scaled to environments containing more blocks than
experienced during training. Nevertheless, the ability to translate policies to
5
The systems have been described previously in Chapter 3.
6
Earlier results for some of these systems have been published by Driessens and Ramon
(2003) and Gärtner et al. (2003a). However, the comparison in Driessens and Džeroski
(2005) is the most recent and complete (in terms of the number of systems tested).
190 Ch 7. Application to Relational Reinforcement Learning
Table 7.5: The accuracy of scalable policies for the stack and onab tasks learnt by
previous RRL systems. Also given are the number of training episodes taken. Rrl-tg†
used P-Learning; Rrl-tg did not. Source for P-rrl is Džeroski et al. (2001); for
†
Rrl-tg , Driessens et al. (2001); and for all other comparison systems, Driessens and
Džeroski (2005).
Table 7.5 summarises the results obtained by Džeroski et al. (2001), Driessens
et al. (2001) and Driessens and Džeroski (2005). Note that results were orig-
inally presented graphically, therefore all accuracy values less than 100% are
approximate and should thus be considered as indicative rather than abso-
lute. As can be observed in the table, Foxcs achieved accuracy results as
good as or better than these previous systems. Also shown in the table are
the number of episodes used to train the systems. Unfortunately for Foxcs,
all other systems except for Rrl-tg† trained on significantly less episodes.
ble 7.4) without the use of P-Learning it will not do so.7 Unlike Rrl-tg,
Rrl-rib and Rrl-kbr are unable to even express scalable policies because
they do not support variables and do not form appropriate abstract expres-
sions which can represent the policies. The Trendi system, as a hybrid of
Rrl-rib and Rib-tg, would also have problems expressing scalable policies.
Thus, of all these systems, only Rib-tg has the potential to learn policies
that are genuinely independent of the number of blocks.
Conclusion
policies for both of the evaluation tasks; in fact, most could not learn scalable
policies for either of the tasks.
7.3 Summary
Conclusion
8.1 Summary
193
194 Ch 8. Conclusion
Advantages arising from the approach taken by Foxcs are listed below.
First, unlike some other RRL approaches, no restrictions are placed on the
MDP framework. Second, the system is model-free; it does not need a model
of the environment’s dynamics but instead gathers information through in-
teracting with the environment. This feature is of benefit when, as is often
the case for real-world tasks, the dynamics of the environment are unknown.
Third, rules are automatically generated and refined; candidate rules do not
have to be supplied by the user, however if known, then they can be used to
initialise the system. Fourth, domain specific bias is provided by the defini-
tion of the rule language; for most RRL environments, it is perhaps easier
to define an appropriate rule language than to provide other forms of bias
used in RRL systems, such as a distance metric or kernel function. Finally,
it produces rules that are comprehensible to humans.
significantly improve system efficiency. For ILP tasks, the use of annealing
was found improved the system’s predictive accuracy compared to using a
fixed learning rate; we believe that this result should extend to Xcs systems
in general when addressing classification tasks. Finally, the choice of selec-
tion method was found to lead to a performance–efficiency trade-off, with
the best predictive accuracies occurring under tournament selection, while
the most efficient execution times occurred under proportional selection.
The main limitation of the approach regards efficiency. There are two as-
pects to efficiency: the number of training steps the system requires to reach
a certain performance level and the computational efficiency per step. Com-
putational efficiency is largely determined by the efficiency of the matching
operation, as matching is run on all members of [P] on each time step.
Matching rules is cheap in Xcs, but in Foxcs it is a more expensive oper-
ation. In order to improve the computational efficiency of Foxcs, a cache
was associated with each rule, storing the results of matching. This tech-
nique works well for ILP tasks, which usually involve only a few hundred
data items; for RRL tasks however, |S| is typically too large for caching to
be as effective.
1
Here we are not distinguishing between Džeroski et al. (2001) and Driessens et al.
(2001) as the latter’s system is a refinement of the former’s.
196 Ch 8. Conclusion
With respect to training times, it was observed that Foxcs requires more
training than other RRL methods. This represents a negative aspect of
adopting the LCS approach: Foxcs must evaluate a large number of stochas-
tically generated rules, many of which will be sub-optimal. However, re-
search into Xcs is ongoing and recent developments have improved the Xcs
framework; Foxcs will benefit from many of these developments, including
any that reduce the amount of training required by Xcs and its derivatives.
A lesser drawback regards parameter tuning. The Xcs system has a large
number of free parameters and Foxcs introduces several more (in partic-
ular, the mutation parameters, µi ). Although guidelines for setting the
parameters of Xcs exist (see Butz and Wilson, 2002, for example), often
the best results require some fine-tuning. Parameter fine-tuning can be a
time-consuming process, particularly when tuning multiple parameters that
have interacting effects.
8.2 Contributions
Chapter 4
• Gave a scheme for annealing the learning rate that is general to the
Xcs framework (Section 4.1.4).
Chapter 5
Chapter 6
Chapter 7
8.3 Significance
For stack and onab, the blocks world tasks considered in Section 7.2, systems
that can directly express the optimal policies shown in Table 7.4 have the
advantage, because those policies capture the salient regularity precisely and
independently of the number of blocks, and thus scale up indefinitely. In
contrast, the comparison systems that generalised using distance metrics
and kernel functions could not represent these or equivalent policies and
therefore required training in the largest test environment, bw10 for them.
Having to perform training in bw10 is a severe problem due to the size of
the state space; hence, these systems, unable to utilise the typical trial-and-
error approaches in reinforcement learning, resorted to learning from expert
traces exemplifying optimal behaviour. Learning a scalable policy, on the
other hand, neatly avoids the problem altogether.
Both Foxcs and Rrl-tg generalise using the abstractive qualities of first-
order logic. However, when combined with P-Learning, Rrl-tg was only
able to demonstrate that it could learn a scalable policy for stack and not
for the more difficult of the two evaluation tasks, onab. An explanation most
likely lies in Rrl-tg’s use of a regression tree to represent the value func-
tion, since there is reason to believe that rules, which are used by Foxcs, are
better matched for the needs of RRL than trees. Because they are modular,
changing one rule does not directly change another; indeed, this indepen-
dence is essential to the effective functioning of the evolutionary process
within Foxcs and other LCS systems. On the other hand, changing the
internal node of a first-order logic tree can directly affect descendents of
that node, which has negative consequences for incrementally grown trees.
In the initial stages of training, the value function is typically poorly es-
timated, and the tree that is induced will therefore contain poor splitting
decisions. Subsequent training incrementally grows the tree, refining it, but
not correcting any of the errors introduced higher up the tree. The result, as
noted by Driessens and Džeroski (2004), is that the performance of Rrl-tg
is sensitive to the order in which states are sampled, which may explain why
it was outperformed by Foxcs in Section 7.2 (see Table 7.5). Ideally, non-
200 Ch 8. Conclusion
Detractors of the LCS approach employed by Foxcs may point to its length-
ier training times. However, the cost of training becomes irrelevant when
scaling up: an acceptable training time can be straightforwardly produced
by selecting a small environment to train in. Furthermore, research into
Xcs is vigorous and ongoing and any discoveries that improve the efficiency
of Xcs would be expected to carry over to Foxcs.
The key issue is whether other task environments, particularly for real-world
problems, exhibit regularity that, like the blocks world tasks addressed in
this thesis, can be most effectively expressed with first-order (or higher-
order) logic using the abstractive property of variables. There is promise
here, as one of the guiding motivations in the historical development of first-
order logic, including its support for variables, has been to create a formal
language capable of expressing generalisations about the world around us in
a concise and convenient format. However, the issue remains open, providing
an important agenda for both the ILP and RRL communities to address in
future years.
3
It is possible to restructure incrementally grown, attribute-value trees so that the tree
produced is independent of the order in which training examples are presented (Utgoff
et al., 1997). However, to do so relies on an independence assumption: that the meaning
of a test associated with a particular node is independent of the node’s position in the
tree. Unfortunately, this assumption is not satisfied by first-order logic trees because, due
to the presence of variables, the meaning of a test at a particular node can depend on the
tests in its ancestors.
8.4. Future Work 201
and De Raedt, 1997; Divina et al., 2003). However, this approach limits the
task framework; it assumes that the state space can be preprocessed and is
thus more suitable for ILP tasks than RRL tasks. Another approach, would
be to introduce interval predicates into the rule-language, similar to the ap-
proach taken by (Wilson, 2000, 2001b,a; Stone and Bull, 2003; Dam et al.,
2005). Under this approach the interval predicates are generated through
covering and are subsequently modified by mutation operations that adjust
the parameters defining the interval. This approach has the advantage that
it preserves the system’s online, incremental approach and is therefore more
suitable for reinforcement learning tasks.
First-Order Logic
The purpose of this appendix this to provide the reader with definitions and
concepts from first-order logic (first-order logic is also commonly known
as predicate calculus) that are relevant to the thesis. We begin by pre-
senting the syntax of first-order logic, which are the rules that determine
what expressions are well formed. Next we consider semantics, the basis
for determining if a syntactically correct expression is true or false. Then
a concept with significant and practical application to logic programming,
Herbrand interpretations, is introduced. Inference, which examines when a
set of premises justifies a conclusion, and induction, a key form of inference
that forms the basis of making generalisations using logic, follow. Finally, we
describe substitutions and inverse substitutions, concepts that are used by
the mutation and covering operations of Foxcs. Note that this appendix is
by necessity introductory; for a more expanded treatment of first-order logic
in the context of artificial intelligence and logic programming the reader is
referred to (Genesereth and Nilsson, 1987; Lloyd, 1987).
203
204 Ch A. First-Order Logic
A.1 Syntax
Expressions over first-order logic are strings of characters that are arranged
according to a formal syntax. The syntax determines which expressions are
legal expressions over the logic and which are not. The syntax is based on
an alphabet, defined as follows:
1. A set of constants, C.
4. A set of variables, V.
The usual assumptions are that F and P are finite and that C and V are
countable: there exist two sequences, c1 , c2 , . . . and v1 , v2 , . . ., which enumer-
ate all the elements of the C and V respectively, possibly with repetitions.
Also, conventionally, the sets C, F, P and V contain strings of alpha-numeric
characters.1
1
The convention in Prolog, followed throughout most of this thesis, is that variables
begin with an upper case letter, while constants and function and predicate symbols begin
with a lower case letter. However, in this chapter only, variables are represented by lower
case letters.
A.1. Syntax 205
Note that while the sets C, F, P and V can vary from alphabet to alphabet,
the remaining symbols are always the same. For this reason, it is convenient
to drop the logical connectives, quantifiers and punctuation symbols when
specifying an alphabet, their existence being implicitly assumed. Also, al-
phabets that vary only in the symbols that constitute V produce languages
that are semantically equivalent; thus, variable symbols are generally gener-
ated (automatically) when needed rather than explicitly specified as part of
the alphabet. Therefore, throughout this thesis we specify an alphabet for
first-order logic as a tuple hC, F, P, di.
2. A variable.
→↔
∧
∨
¬∀∃
where φi and ψi are atoms and x1 . . . xk are all the variables occurring in
the clause. This notation is derived from the following equivalence:
the right hand side of (A.3) by making the universal quantifiers, parentheses,
and negation symbols implicit and replacing the disjunction and conjunction
symbols with commas.
φ ← ψ1 , . . . , ψ m .
The head (or consequent) of the clause is φ while the body (or antecedent)
is ψ1 , . . . , ψm .
From the definition it can be seen that a definite clause is a clause that
contains exactly one positive literal (a positive literal is an atom, a negative
literal is a negated atom): its head. The body of a definite clause may be
empty, in which case the clause is known as a fact. Unlike general clausal
form, definite clauses do restrict first-order logic; nevertheless, they largely
form the basis for logic programming because they are still quite expressive
and they can be processed more efficiently then general clauses.
A.2 Semantics
Just as it is the role of the constants and variables to refer to the elements
of the domain, the function and predicate symbols are likewise used to rep-
resent functions and relations over the domain respectively. A function over
a domain maps elements or groups of elements over the domain to other
elements. A relation, on the other hand, effectively groups elements into
tuples; note that singleton relations, which group the elements in tuples of
size 1, are not only permissible but quite frequent. Usually only a small
subset of the functions and relations over the domain are of interest; each
of these will be assigned a function or relation symbol while the rest are
ignored.
Here, |I|n means the set of all n-tuples over |I|; for example, |I|3 = |I| ×
|I| × |I|.
Variables are treated separately from the constant, function and predicate
symbols.
(a) `I (∃x Φ)[v] if and only if there exists ι ∈ |I| such that `I Φ[u]
where U (x) = ι and U (y) = V (y) for y 6= x.
(b) `I (∀x Φ)[v] if and only if for all ι ∈ |I| it is the case that `I Φ[u]
where U (x) = ι and U (y) = V (y) for y 6= x.
Note that the problem is alleviated slightly for some sentences. In partic-
ular, a variable assignment has no effect on the satisfaction of a sentence
that conations no free variables—that is, ground and closed sentences. If a
sentence is ground then it conations no variables, and consequently no ap-
peal is made to the variable assignment in order to determine satisfaction.
Alternatively, if a sentence is closed then each variable occurs within the
scope of a quantifier; thus, it is the quantifier rules, rather than the variable
assignment, that determines how each variable is bound.
In this thesis we are only concerned with closed sentences; the definition
of a model of a closed sentence can be simplified since their satisfaction or
otherwise is independent of any variable assignment.
That is, the Herbrand universe over a given alphabet consists of all the
ground terms—terms that do not contain variables—over the alphabet.
The definition states that the Herbrand base over a given alphabet consists
of the set of all atomic sentences over the alphabet such that each argument
is a ground term over the alphabet; more simply put, it is the set of ground
atomic sentences over the alphabet.
214 Ch A. First-Order Logic
The following proposition shows that checking for the existence of a model
can be performed by considering the Herbrand interpretations only.
For a proof see Nienhuys-Cheng and de Wolf (1997) or Lloyd (1987). Note
the mild restriction that Γ be a set of clauses; that is, Γ cannot contain
non-clausal sentences.
A.4 Inference
a derivation that ensures that the premises entail the conclusion. More re-
cently, inference algorithms have been developed, which allow a systematic
approach to inference. There are three types of inference algorithms: for-
ward chaining, backward chaining (backward chaining forms the basis for
logic programming environments, including Prolog) and resolution.3
A.5 Induction
If every crow you had ever seen had been black then you might decide that
all crows are black. This is an example of a kind of inference called induction,
which involves the derivation of a general law from particular instances. In-
duction contrasts against deduction, which refers to the derivation of specific
conclusions by reference to a general law or principle; modus ponens and
modus tollens are examples of deductive reasoning. Both forms of reasoning
are very significant: many important mathematical theorems are justified
using deductive reasoning, while induction can be seen as the foundation of
3
An overview of these three types of algorithms is provided by Russell and Norvig
(2003).
A.5. Induction 217
scientific enquiry.
Γ ∪ ∆ 2 ¬Φ
Γ∪Φ∆
the hypothesis; in the case of the swan hypothesis above, the colour of every
swan in existence would need to be known.
One especially noteworthy point is that for any background theory and data
set there are generally a great multiplicity of inductive hypotheses. Is it
possible to discriminate between them for the purpose of preferring one over
another? And is there any justifiable basis for doing so? We shall briefly
indicate some approaches to the first issue; however, the second issue moves
into philosophical territory, placing it beyond the scope of this thesis.
The use of language bias is another way to exclude or order potential hy-
potheses. One type of language bias, conceptual bias, results from the use
of a particular alphabet, which excludes the possibility of hypotheses that
cannot be expressed as sentences over that alphabet. Another type, logical
bias, restricts hypotheses to a particular logical structure, such as the defi-
nite clauses (which is a bias present in Foxcs). Note that while it is difficult
to formally justify specific biases, in practice it is impossible to invent in-
ductive methods, logical, statistical, or otherwise, that are completely bias
free.4
4
Mitchell (1997) provides a discussion of bias in the field of machine learning generally.
A.6. Substitution 219
A.6 Substitution
1. Each variable is associated with one term only; that is, xi 6= xj for all
i 6= j.
2. None of the variables xi may occur within any of the associated terms
τi .
The terms associated with the variables are frequently called the bindings
for those variables. When a substitution θ is applied to a clause Φ, all
bound variables occurring in Φ are replaced by their bindings. Any variables
without bindings are left unchanged. The convention for denoting the clause
resulting from a substitution is, somewhat unusually, Φθ (that is, functional
notation is not employed). For example, let Φ = P (x) ∨ Q(x) ∨ R(y) and
θ = {x/bill }, then:
Example 13 Let ψ = P (bill , bob, F (bill ), ben). Then the places of bill in
ψ are h1i and h3, 1i.
When the inverse substitution θ−1 is applied to the clause Φ, denoted Φθ−1 ,
each term τi is replaced at the places pi,1 , . . . , pi,mi in Φ by the variable xi .
It follows that Φθθ−1 = Φ. Also note that just as a substitution cannot
make a clause more general, an inverse substitution cannot make a clause
less general. Thus, an inverse substitution can be applied to “generalise” a
clause.
Appendix B
Inductive Logic
Programming
Recall that along with the credit assignment problem, the generalisation
problem is one of the two key challenges that a reinforcement learning al-
gorithm must address. One way of addressing the generalisation problem
is through inductive inference, the central concern of ILP, which leads to
the subfield of RRL. This appendix presents two of the most important
formalisations of induction for ILP; between them, they describe the ma-
jority of all ILP systems. However, these two formalisations are not, as we
shall see, equally useful for the purposes of reinforcement learning, which
places special requirements on an inductive algorithm. The two problem
settings arelearning from entailment and learning from interpretations, also
222
223
Before presenting these two formalisms, it is worth briefly noting some his-
tory of ILP. The name “Inductive Logic Programming” was coined by Mug-
gleton (1991) who defined it as the intersection between machine learning
and logic programming (today, such a definition would include RRL but ILP
by convention refers specifically to supervised and unsupervised learning sys-
tems). However, the origins of the field of ILP occurred much earlier in the
work of Banerji (1964), who used first-order logic to represent hypotheses
learnt under a paradigm called concept learning. Subsequently, ILP was
greatly influenced by two key dissertations: that of Plotkin (1971), who
was the first to formalise induction using clausal logic; and that of Shapiro
(1983), who introduced the notion of refinement operators.2 An influential
system, Foil, perhaps the first to take the approach of upgrading an existing
machine learning method, in this case learning rule sets, to first-order logic,
appeared in Quinlan (1990). Around this time, the establishment of ILP
as a separate subfield of machine learning arose, in part due to the field’s
christening (Muggleton, 1991) as well as to the identification of a thorough
research agenda (Muggleton and De Raedt, 1994). Since then, important
contributions include, amongst others, a comprehensive description of the
theoretical foundations of ILP (Nienhuys-Cheng and de Wolf, 1997) and the
application of ILP to data mining (Džeroski and Lavrač, 2001).
It is also worth mentioning the benefits that arise from incorporating first-
order logic into machine learning. These benefits include:
1
Descriptions of these settings, and others, can be found in (Muggleton and De Raedt,
1994; De Raedt and Džeroski, 1994; Wrobel and Džeroski, 1995; Nienhuys-Cheng and
de Wolf, 1997; De Raedt, 1997).
2
Actually, Shapiro first introduced his refinement operators in (Shapiro, 1981) and then
subsequently incorporated them into his PhD thesis.
224 Ch B. Inductive Logic Programming
We now describe the learning from entailment and learning from interpre-
tation settings in turn.
This setting is the more general of the two; it applies to concept learning,
classification, regression and program synthesis, and also allows recursive
definitions to be learnt. Concept learning, perhaps the simplest of these
scenarios, is convenient for illustrating learning from entailment. The idea
behind concept learning is to induce a definition of a general “concept” from
specific data. Under this setting the “concept” is a relation represented by
a predicate, say φ, the definition takes the form of definite clauses, and the
data are typically facts over φ.3 The data is divided into two: a set of
positive examples and a set of negative examples; each positive (negative)
example is a fact that is consistent (inconsistent) with the concept to be
induced. Background knowledge is also provided, usually in the form of
facts or definite clauses, which contains other data necessary or helpful for
inducing a definition for φ. More formally, concept learning under learning
from entailment is defined as follows:
• A background theory B
1. ∀e ∈ E + : Φ ∧ B e,
2. ∀e ∈ E − : Φ ∧ B 2 e.
Example 15 Suppose:
parent(ane, bill) ←
parent(bill, carl) ←
B =
parent(uma, vera) ←
parent(vera, walt) ←
in conjunction with B implies all the positive examples but none of the
negative examples.
Unfortunately, ILP under the learning from entailment setting is not PAC-
learnable with polynomial time complexity (Džeroski et al., 1992; Cohen,
1995; Cohen and Jr., 1995). The next setting, however, does come with
tractable PAC-learnability results.
226 Ch B. Inductive Logic Programming
Under the learning from interpretations setting, the induced concept defi-
nition is no longer required to imply the positive data items; each positive
data item should, instead, satisfy the induced definition. Continuing with
concept learning, we again wish to induce a clausal definition for predicate φ
from positive and negative data and background knowledge. Each data item
under this setting is an Herbrand interpretation (recall from Section A.3 that
an Herbrand interpretation is a set of facts), which contrasts with learning
from entailment, where typically each data item is a fact; this difference has
positive implications for efficiency. Background knowledge is incorporated
by extending the interpretation into the minimal Herbrand model of the
data item and the background knowledge.
Example 16 Consider the following facts, which describe the blocks world
state shown in Figure B.1: e = {cl(a), on(a, b), on f l(b), cl(c), on(c, d), on(d, e),
on f l(e)}; and the background theory, B:
above(X, Y ) ← on(X, Y )
above(X, Y ) ← on(X, Z), above(Z, Y )
Then M(e∧B) = {cl(a), on(a, b), on f l(b), cl(c), on(c, d), on(d, e), on f l(e),
above(a, b), above(c, d), above(d, e)}, which includes all the facts from e in
addition to those that can be derived from the background theory together
with e, i.e. the facts describing which blocks are above which.
B.2. Learning from Interpretations 227
a d
b e
• A background theory B
1. ∀e ∈ E + : M(e ∧ B) satisfies Φ,
closed world assumption; that is, all facts that are not part of M(e ∧ B) are
assumed to not apply to the example e. The closed world assumption is, in
turn, a form of non-monotonic reasoning, hence the alternative name of this
setting.
The hypothesis:
B.3 Discussion
Of the two settings, learning from interpretations provides the better match
for the needs of reinforcement learning. The interactive nature of rein-
forcement learning means that data is typically processed incrementally,
which the separation of data items under learning from interpretations con-
veniently supports. Additionally, states and state-action pairs can be mod-
eled quite naturally by Herbrand interpretations, as shown in Section 5.2.
The tractable complexity result for learning from interpretations also in-
creases its attractiveness.
• A background theory B
1. Φ ∧ e ∧ B y
2. Φ ∧ e ∧ B 2 y 0 for all y 0 6= y
There two ways to straightforwardly map the elements of this setting to the
reinforcement learning problem. One way is to let an example represent
a state (e ∈ S) and its label represent an action (y ∈ A); then we have
4
Concept learning can be viewed as a special case of classification where the definition
of only a single class is to be induced.
5
This formulation of is due to Blockeel (1998).
230 Ch B. Inductive Logic Programming
Foxcs most closely resembles the first approach, even though it estimates
the value function. However, note that in Foxcs the hypothesis is a set of
definite clauses and that a prediction value, which is essentially a partial
value function estimate, is associated with each clause in a way that is
not modeled above. Also note that the second condition is relaxed, with
competing actions being resolved by appealing to the associated prediction
values.
Appendix C
This appendix describes the ILP tasks that were used for evaluating the
Foxcs system. The data sets for some of these, and other, ILP tasks are
publicly available from the web sites at OUCL (2000) and MLnet OiS (2000).
The task considered here is the prediction of mutagenic activity from in-
formation known about specific nitroaromatic compounds. Many nitroaro-
matic compounds are mutagenic, that is, they can cause DNA to mutate
and are thus potentially carcinogenic. A procedure called the Ames test can
detect mutagenesis, however some compounds, such as antibiotics, can not
be tested due to risk of toxicity to the test organisms. The development
of methods for predicting mutagenicity is therefore of considerable interest
because they avoid the potential risks associated with detection.
The Mutagenesis data set, available from the OUCL (2000) site, contains
information about 230 nitroaromatic compounds; it was introduced by Srini-
231
232 Ch C. The Inductive Logic Programming Tasks
Table C.1: A sample of the Mutagenesis data set. The sample gives the NS+S1 de-
scriptions for the compound d1. The atm facts list the atoms occurring in the molecule
(there are 26 atoms for d1); each fact gives the element (e.g. carbon, c), the element’s
configuration, and its partial charge. The bond facts list all bonds that exist between the
atoms of the molecule; each fact identifies the two atoms participating in the bond and
also gives the type of the bond. Each compound also has a logp and lumo fact.
% atm(Compound,AtomId,Element,Configuration,PartialCharge)
atm(d1,d1_1,c,22,-0.117).
atm(d1,d1_2,c,22,-0.117).
atm(d1,d1_3,c,22,-0.117).
(...)
atm(d1,d1_25,o,40,-0.388).
atm(d1,d1_26,o,40,-0.388).
% bond(Compound,AtomId,AtomId,Type)
bond(d1,d1_1,d1_2,7).
bond(d1,d1_2,d1_3,7).
bond(d1,d1_3,d1_4,7).
(...)
bond(d1,d1_24,d1_25,2).
bond(d1,d1_24,d1_26,2).
logp(d1,4.23).
lumo(d1,-1.246).
vasan et al. (1994) and has subsequently been widely used for evaluating ILP
algorithms (an overview of selected results is provided by Lodhi and Mug-
gleton (2005)). Following Srinivasan et al. (1996), each compound in the
data set has two levels of description, NS+S1 and NS+S2, which contain
the following information:
C.1. The Prediction of Mutagenic Activity 233
Two other attributes, called I1 and Ia , are also known to be relevant, however
they are generally reserved for use with attribute value systems. A repre-
sentative sample of the data is given in Table C.1, however note that the
chemical concepts from NS+S2 are typically converted into ground clauses,
which improves the efficiency of an algorithm when run on the data set.
The compounds in the data set are usually divided into two subsets, a “re-
gression friendly” subset of 188 compounds (125 active and 63 inactive), and
a “regression unfriendly” subset of 42 compounds (13 active and 29 inac-
tive). The regression unfriendly subset has been found to yield poor results
using linear regression with attribute value descriptions (Debnath et al.,
1991; Srinivasan et al., 1996), thus it forms the more interesting group for
study in the ILP context. However, we have chosen to use the regression
friendly data set because it has been used more often (which is probably
due to its larger size) and thus has the greater number of published results
available for comparison.
Table C.2: The mode declarations for the Mutagenesis data set.
The data set contains 328 molecules (143 resistant, 185 degradable) and their
associated descriptions. Blockeel et al. (2004) give several levels of descrip-
tion for the molecules, which are selected depending on whether the system
makes use of propositional or relational data. We use the relational repre-
sentation called Global + R, which describes molecules structurally using a
representation that is very similar to that for the Mutagenesis task—with
predicates for the atoms, bonds and higher level sub-molecular structures of
the compounds. The molecular weight is also given for each compound as
it is relevant to biodegradability. We omit a representation of the data due
to its similarity to the Mutagenesis data set.
This data set, and the following one, was obtained from Sašo Džeroski and
used with his kind permission.
The goal of expert systems for road traffic management (Cuena et al., 1992,
1995) is to suggest actions to avoid or reduce traffic problems like conges-
tion given information about the current state of traffic over a wide area.
The construction of an expert system is tailored specifically for each city
and requires strategies to be input from expert human traffic controllers. In
addition to the human experts the process could be augmented with ma-
chine learning methods, which would infer general principles from a large
236 Ch C. The Inductive Logic Programming Tasks
secciones_posteriores(’RDLT_ronda_de_Dalt_en_Artesania’,
’RDLT_ronda_de_Dalt_en_salida_a_Roquetas’).
secciones_posteriores(’RDLT_ronda_de_Dalt_en_Artesania’,
’RDLT_salida_a_Roquetas’).
% section type
tipo(’RDLT_ronda_de_Dalt_en_Artesania’,carretera).
velocidad(11,’RDLT_ronda_de_Dalt_en_Artesania’,84.0).
ocupacion(11,’RDLT_ronda_de_Dalt_en_Artesania’,319.25).
saturacion(11,’RDLT_ronda_de_Dalt_en_Artesania’,71.0).
velocidad(12,’RDLT_ronda_de_Dalt_en_Artesania’,84.75).
ocupacion(12,’RDLT_ronda_de_Dalt_en_Artesania’,277.25).
saturacion(12,’RDLT_ronda_de_Dalt_en_Artesania’,60.75).
(...)
Table C.4: The mode declarations for the Traffic data set. The predicate velocidadd
is a discretised version of velocidad, containing 3 values baja, media, alta; similarly for
saturaciond and ocupaciond.
methods.
The data set was used with the kind permission of Martin Molina and is
not publicly available. Introduced by Džeroski et al. (1998b), it contains
data simulated for various road sections in the city of Barcelona. The data
includes two components, one contains the road section details and other
describes the traffic conditions. The data distribution is 66 examples of
congestion, 62 accidents, and 128 non-critical sections (256 examples in to-
tal).
At each time step the input to Foxcs consists of a road section and a sensor
reading time; the remaining data is supplied as background knowledge. A
sample of the data is shown in Table C.3, and the rule language for the task
is given in Table C.4.
238 Ch C. The Inductive Logic Programming Tasks
The aim of this task, which originates from Blockeel et al. (1999), is to suc-
cessfully recognise eight types of hands in the card game of Poker. The eight
classes are fourofakind, fullhouse, flush, straight, threeofakind,
twopair, pair, and nought. The first seven classes are defined as nor-
mal for Poker, although no distinction is made between royal, straight, and
ordinary flushes. The last class, nought, consists of all the hands that do
not belong to one of the other classes.
In Poker the cards in the hand are selected randomly, resulting in a par-
ticularly uneven class distribution. The table below gives the number of
hands belonging to each class, which clearly shows the uneven distribution;
1,302,540
for example, a nought hand occurs on average with frequency 2,598,960 , or
fourofakind 624
fullhouse 3,744
flush 5,148
straight 10,200
threeofakind 54,912
two pairs 123,552
pair 1,098,240
nought 1,302,540
total 2,598,960
The rule language for the task is given in Table C.5. The predicate card
represents a playing card and contains three arguments, which are the rank
of the card, its suit, and its position in the hand. The predicate succession
C.4. Classifying Hands of Poker 239
inequations(true).
takes five ranks as arguments and evaluates to true when there is an ordering
of the ranks given by the arguments that forms an unbroken sequence. Given
this very natural representation for Poker concepts, the task is relational
because it is the relationships between the ranks and suits of the cards in
the hand, rather than their values, that define the classes.1 Readers familiar
with Poker will note that the class definitions do not depend on the order
in which cards are dealt; why, then, is the position attribute included in
card? The reason for having the position attribute is discussed below.
The use of the anonymous variable was allowed in the card predicate in
order to increase the level of generality that a single rule could attain. This,
in turn, created a need for the attribute, position, to ensure that each
literal represents a separate card. To see this more concretely, imagine that
the system uses the above representation minus the position attribute, and
1
Thornton (2000) illustrates, through an anecdote of the true story of the Cincinnati
System, the pitfalls of using non-relational representations to express concepts in Poker.
240 Ch C. The Inductive Logic Programming Tasks
The rule correctly classifies all hands that contain a single pair. The first
two literals in the condition represent the two cards in the hand that have
the same rank, and the other three literals represent the remaining cards.
Note that because of the use of inequations on the variables A, B, C, and D,
there cannot be more than two cards with the same rank; however, since
the first two literals are identical, there is nothing to prevent them from
matching the same card. Thus, many non-pair hands will also be covered
by the rule, which means the rule is not always correct.
Here the position attribute ensures that the first two literals must repre-
sent different cards—again through the use of inequations, this time on the
variables G and H. Now the rule correctly classifies all hands that contain a
single pair, but it no longer matches any non-pair hands; it is therefore a
correct and maximally general definition of a pair.
Appendix D
This appendix describes the validity tests that were used to constrain mu-
tation from producing rules that do not match any states in bwn . Note
that the validity checks do not entirely eliminate all illegal possibilities. The
validity checks are implemented by a set of prolog rules defining the rela-
tion is illegal/2. The two arguments to is illegal are the action and
condition of the rule to be checked. The definitions make use of the relation
memberof/2, which takes two lists as arguments and checks that all elements
in the first occur in the second:
memberof([],_).
memberof([H|T],List) :- member(H,List), memberof(T,List).
Action = mv_fl(A),
memberof([on_fl(B)],Condition),
241
242 Ch D. Validity Tests for the Blocks World Environment
A==B.
Action = mv_fl(A),
((memberof([cl(B),on(C,D)],Condition),A==B,A==C) ->
fail
;
((memberof([cl(B),above(F,G)],Condition),A==B,A==F) ->
fail
;
true
)
).
Action = mv(A,B),
((memberof([cl(C),cl(D)],Condition),A==C,B==D) -> fail; true).
is_illegal( _, Condition ) :-
memberof([on(A,B),on(C,D)],Condition),
A==D,
B==C.
is_illegal( _, Condition ) :-
memberof([on(A,B),on(C,D)],Condition),
A\==C,
B==D.
is_illegal( _, Condition ) :-
243
memberof([on(A,B),on(C,D)],Condition),
A==C,
B\==D.
is_illegal( _, Condition ) :-
memberof([on(A,B),on_fl(C)],Condition),
A==C.
is_illegal( _, Condition ) :-
memberof([on(A,B),cl(C)],Condition),
B==C.
is_illegal( _, Condition ) :-
memberof([on(A,B),highest(C)],Condition),
B==C.
is_illegal( _, Condition ) :-
memberof([above(A,B),highest(C)],Condition),
B==C.
is_illegal( _, Condition ) :-
memberof([on(A,B),above(C,D)],Condition),
A==D,
B==C.
is_illegal( _, Condition ) :-
244 Ch D. Validity Tests for the Blocks World Environment
memberof([cl(A),above(B,C)],Condition),
A==C.
is_illegal( _, Condition ) :-
memberof([on_fl(A),above(B,C)],Condition),
A==B.
Bibliography
Anglano, C., Giordana, A., Lo Bello, G., and Saitta, L. (1997). A network
genetic algorithm for concept learning. In Bäck, T., editor, Proceedings of
the Seventh International Conference on Genetic Algorithms (ICGA97),
San Francisco, CA. Morgan Kaufmann.
Anglano, C., Giordana, A., Lo Bello, G., and Saitta, L. (1998). An experi-
mental evaluation of coevolutive concept learning. In Shavlik, J. W., edi-
tor, Proceedings of the 15th International Conference on Machine Learn-
ing (ICML 1998), pages 19–27. Morgan Kaufmann, San Francisco, CA.
245
246 BIBLIOGRAPHY
Augier, S., Venturini, G., and Kodratoff, Y. (1995). Learning first order
logic rules with a genetic algorithm. In Fayyad, U. M. and Uthurusamy,
R., editors, The First International Conference on Knowledge Discovery
and Data Mining (KDD-95), pages 21–26. AAAI Press.
Bacchus, F. and Jaakkola, T., editors (2005). Proceedings of the 21st Con-
ference on Uncertainty in Artificial Intelligence (UAI ’05). AUAI Press.
Bäck, T., Fogel, D. B., and Michalewicz, Z., editors (2000). Evolutionary
Computation, volume 1. Institute of Physics Publishing.
Banzhaf, W., Daida, J., Eiben, A. E., Garzon, M. H., Honavar, V., Jakiela,
M., and Smith, R. E., editors (1999). Proceedings of the Genetic and
Evolutionary Computation Conference (GECCO’99). Morgan Kaufmann.
Bernadó, E., Llorà, X., and Garrel, J. M. (2002). XCS and GALE: A
comparative study of two learning classifier systems on data mining. In
Lanzi, P. L., Stolzmann, W., and Wilson, S. W., editors, Advances in
Learning Classifier Systems: 4th International Workshop, IWLCS 2001,
volume 2321 of LNCS, pages 115–132. Springer-Verlag.
Beyer, H.-G. and O’Reilly, U.-M., editors (2005). Proceedings of the Genetic
and Evolutionary Computation Conference, GECCO 2005. ACM Press.
Blockeel, H., De Raedt, L., Jacobs, N., and Demoen, B. (1999). Scaling
up inductive logic programming by learning from interpretations. Data
Mining and Knowledge Discovery, 3(1):59–93.
Blockeel, H., Džeroski, S., Kompare, B., Kramer, S., Pfahringer, B., and
Van Laer, W. (2004). Experiments in predicting biodegradability. Applied
Artificial Intelligence, 18(2):157–181.
Butz, M. V., Goldberg, D. E., and Tharakunnel, K. (2003). Analysis and im-
provement of fitness exploitation in XCS: Bounding models, tournament
selection, and bilateral accuracy. Evolutionary Computation, 11(3):239–
277.
Butz, M. V., Kovacs, T., Lanzi, P. L., and Wilson, S. W. (2004). Toward
a theory of generalization and learning in XCS. IEEE Transactions on
Evolutionary Computation, 8(1):28–46.
Butz, M. V., Sastry, K., and Goldberg, D. E. (2005). Strong, stable, and
reliable fitness pressure in XCS due to tournament selection. Genetic
Programming and Evolvable Machines, 6(1):53–77.
Casillas, J., Carse, B., and Bull, L. (2004). Fuzzy XCS: an accuracy-based
fuzzy classifier system. In Proceedings of the XII Congreso Espanol sobre
Tecnologia y Logica Fuzzy (ESTYLF 2004), pages 369–376.
Dam, H. H., Addass, H. H., and Lokan, C. (2005). Ge real! XCS with
continuous-valued inputs. Technical Report TR-ALAR-200504001, The
Artificial Life and Adaptive Robotics Laboratory, University of New South
Wales, Northcott Drive, Campbell, Canberra, ACT 2600, Australia.
Divina, F., Keijzer, M., and Marchiori, E. (2003). A method for handling
numerical attributes in GA-based inductive concept learners. In Cantú-
Paz, E., Foster, J. A., Deb, K., Davis, D., Roy, R., O’Reilly, U.-M., Beyer,
H.-G., Standish, R., Kendall, G., Wilson, S., Harman, M., Wegener, J.,
Dasgupta, D., Potter, M. A., Schultz, A. C., Dowsland, K., Jonoska, N.,
BIBLIOGRAPHY 253
Džeroski, S., Blockeel, H., Kompare, B., Kramer, S., Pfahringer, B., and
Van Laer, W. (1999). Experiments in predicting biodegradability. In
Džeroski, S. and Flach, P., editors, International Workshop on Inductive
Logic Programming, volume 1634 of LNCS, pages 80–91. Springer-Verlag.
Džeroski, S., Jacobs, N., Molina, M., Moure, C., Muggleton, S., and van
Laer, W. (1998b). Detecting traffic problems with ILP. In Proceedings
of the Eighth International Conference on Inductive Logic Programming,
volume 1446 of LNCS, pages 281–290. Springer-Verlag.
Fern, A., Yoon, S. W., and Givan, R. (2004a). Approximate policy iteration
with a policy language bias. In Thrun, S., Saul, L. K., and Schölkopf, B.,
BIBLIOGRAPHY 255
Finney, S., Gardiol, N. H., Kaelbling, L. P., and Oates, T. (2002b). The
thing that we tried didn’t work very well: Deictic representation in rein-
forcement learning. In Proceedings of the 18th International Conference
on Uncertainty in Artificial Intelligence (UAI-02).
Finney, S., Gardiol, N. H., Kaelbling, L. P., and Oates, T. (April 2002a).
Learning with deictic representation. Technical Report AI Laboratory
Memo AIM-2002-006, MIT, Cambridge, MA.
Fitch, R., Hengst, B., Šuc, D., Calbert, G., and Scholz, J. (2005). Struc-
tural abstraction experiments in reinforcement learning. In Zhang, S. and
Jarvis, R., editors, Proceedings of the 18th Australian Joint Conference on
Artificial Intelligence (AI 2005), volume 3809 of LNCS, pages 164–175.
Springer-Verlag.
Gärtner, T., Driessens, K., and Ramon, J. (2003a). Graph kernels and
Gaussian processes for relational reinforcement learning. In Horváth and
Yamamoto (2003), pages 146–163.
Gärtner, T., Flach, P., and Wrobel, S. (2003b). On graph kernels: Hardness
results and efficient alternatives. In Schölkopf, B. and Warmuth, M. K.,
editors, Learning Theory and Kernel Machines: 16th Annual Conference
on Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, vol-
ume 2777 of LNCS, pages 129–143. Springer-Verlag.
Givan, R., Dean, T., and Greig, M. (2003). Equivalence notions and model
minimization in Markov decision processes. Artificial Intelligence, 147(1–
2):163–223.
Goldberg, D. E., Deb, K., and Korb, B. (1990). Messy genetic algorithms
revisited: Studies in mixed size and scale. Complex Systems, 4(4):415–444.
BIBLIOGRAPHY 257
Goldberg, D. E., Korb, B., and Deb, K. (1989). Messy genetic algorithms:
Motivation, analysis and first results. Complex Systems, 3(4):493–530.
Guestrin, C., Koller, D., Parr, R., and Venkataraman, S. (2003). Efficient
solution algorithms for factored MDPs. Journal of Artificial Intelligence
Research (JAIR), 19:399–468.
Horváth, T. and Yamamoto, A., editors (2003). Proceedings of the 13th In-
ternational Conference on Inductive Logic Programming, ILP 2003, vol-
ume 2835 of LNCS. Springer-Verlag.
Kersting, K., van Otterlo, M., and De Raedt, L. (2004). Bellman goes
relational. In Brodley (2004).
Kietz, J.-U. (1993). Some lower bounds for the computational complexity
of inductive logic programming. In Proceedings of the Sixth European
Conference on Machine Learning (ECML93), volume 667 of LNCS, pages
115–123. Springer-Verlag.
Kim, K.-E. and Dean, T. (2003). Solving factored MDPs using non-
homogeneous partitions. Artificial Intelligence, 147(1–2):225–251.
Langdon, W. B., Cantú-Paz, E., Mathias, K. E., Roy, R., Davis, D., Poli,
R., Balakrishnan, K., Honavar, V., Rudolph, G., Wegener, J., Bull, L.,
Potter, M. A., Schultz, A. C., Miller, J. F., Burke, E. K., and Jonoska, N.,
editors (2002). GECCO 2002: Proceedings of the Genetic and Evolution-
ary Computation Conference, New York, USA, 9-13 July 2002. Morgan
Kaufmann.
Lanzi, P. L. (2001). Mining interesting knowledge from data with the XCS
classifier system. In Spector et al. (2001), pages 958–965.
BIBLIOGRAPHY 261
Lanzi, P. L., Stolzmann, W., and Wilson, S. W., editors (2000). Learning
Classifier Systems: from Foundations to Applications, volume 1813 of
LNCS. Springer-Verlag.
Mellor, D. (2005). A first order logic classifier system. In Beyer and O’Reilly
(2005), pages 1819–1826.
MLnet OiS (2000). The machine learning network online information service.
http://www.mlnet.org/cgi-bin/mlnetois.pl/?File=datasets.html.
Osborn, T. R., Charif, A., Lamas, R., and Dubossarsky, E. (1995). Genetic
logic programming. In IEEE Conference on Evolutionary Computation,
pages 728–732. IEEE Press.
Precup, D., Sutton, R. S., Paduraru, C., Koop, A., and Singh, S. P. (2006).
Off-policy learning with options and recognizers. In Weiss, Y., Schölkopf,
B., and Platt, J., editors, Advances in Neural Information Processing
Systems 18, pages 1097–1104, Cambridge, MA. MIT Press.
Spector, L., Goodman, E. D., Wu, A., Langdon, W. B., Voigt, H.-M., Gen,
M., Sen, S., Dorigo, M., Pezeshk, S., Garzon, M. H., and Burke, E.,
editors (2001). Proceedings of the Genetic and Evolutionary Computa-
tion Conference (GECCO-2001), San Francisco, California, USA. Morgan
Kaufmann.
Srinivasan, A., King, R. D., and Muggleton, S. (1999). The role of back-
ground knowledge: using a problem from chemistry to examine the per-
formance of an ILP program. Technical Report PRG-TR-08-99, Oxford
University Computing Laboratory, Oxford, UK.
Srinivasan, A., Muggleton, S., and King, R. (1995). Comparing the use
of background knowledge by inductive logic programming systems. In
De Raedt, L., editor, Proceedings of the Fifth International Inductive
Logic Programming Workshop. Katholieke Universteit, Leuven. With-
drawn from publication and replaced by Srinivasan et al. (1999).
Srinivasan, A., Muggleton, S., King, R., and Sternberg, M. (1994). Mu-
tagenesis: ILP experiments in a non-determinate biological domain. In
BIBLIOGRAPHY 267
Stone, C. and Bull, L. (2003). For real! XCS with continuous-valued inputs.
Evolutionary Computation, 11(3):299–336.
Sutton, R., McAllester, D., Singh, S., and Mansour, Y. (2000). Policy
gradient methods for reinforcement learning with function approximation.
Advances in Neural Processing Systems, 12:1057–1063.
Sutton, R. S., Precup, D., and Singh, S. P. (1999). Between MDPs and semi-
MDPs: A framework for temporal abstraction in reinforcement learning.
Artificial Intelligence, 112(1–2):181–211.
Tadepalli, P., Givan, R., and Driessens, K., editors (2004a). Proceed-
ings of the ICML’04 Workshop on Relational Reinforcement Learning.
http://eecs.oregonstate.edu/research/rrl/index.html.
Thornton, C. (2000). Truth from Trash: How Learning Makes Sense. The
MIT Press.
Van Laer, W., Raedt, L. D., and Džeroski, S. (1997). On multi-class prob-
lems and discretization in inductive logic programming. In Proceedings of
the 10th International Syposium on Methodologies for Intelligent Systems,
volume 1325 of LNCS, pages 277–286. Springer-Verlag.
270 BIBLIOGRAPHY
Yoon, S. W., Fern, A., and Givan, R. (2002). Inductive policy selection
for first-order MDPs. In Darwiche, A. and Friedman, N., editors, UAI
’02, Proceedings of the 18th Conference in Uncertainty in Artificial Intel-
ligence. Morgan Kaufmann.
Yoon, S. W., Fern, A., and Givan, R. (2005). Learning measures of progress
for planning domains. In Veloso, M. M. and Kambhampati, S., editors,
Proceedings, The Twentieth National Conference on Artificial Intelligence
and the Seventeenth Innovative Application of Artificial Intelligence Con-
ference (IAAI 2005), pages 1217–1222. AAAI Press.