Learning Classifier System
Learning Classifier System
• Trial-and-error search
– neither exploitation nor exploration can be
pursued exclusively without failing at the
task
• Life-long learning
– on-going exploration
Reinforcement
Learning
Policy : S A
state
action reward
a0 a1 a2
s0 s1 s2 ...
r0 r1 r2
State value function, V
State, s V(s) V(s) predicts the future
s0 ... total reward we can
s1 10 obtain by entering state
s2 15 s p(s , a , s ) = 0.7
0 1 1
s1
s3 6
r(s0, a1) = 2 p(s0, a1, s2) = 0.3
can exploit V
greedily, i.e. in s, s0 s2
choose action a for r(s0, a2) = 5
p(s0, a2, s2) = 0.5
which the following is
largest: s3
r ( s, a ) p ( s, a, s ' ) V ( s ' ) p(s0, a2, s3) = 0.5
s 'S Choosing a1: 2 + 0.7 × 10 + 0.3 × 15 = 13.5
Choosing a2: 5 + 0.5 × 15 + 0.5 × 6 = 15.5
Action value function, Q
State, s Action, a Q(s, a)
Q(s, a) predicts the
s0 a1 13.5
future total reward we
s0 a2 15.5
can obtain by
s1 a1 ...
executing a in s
s1 a2 ...
can exploit Q
greedily, i.e. in s, s0
choose action a for
which Q(s, a) is largest
Q Learning
Exploration
For each (s, a), initialise Q(s, a) arbitrarily versus
exploitatio
Observe current state, s
n
Do until reach goal state
s s’ Watkins 1989
Backup Diagram for Q Learning
s
a
Q(s, a)
r
s’
a’
Q(s’, a’)
Function Approximation
• Q can be represented by a table only if the
number of states & actions is small
• Besides, this makes poor use of
experience
• Hence, we use function approximation,
e.g.
– neural nets
– weighted linear functions
– case-based/instance-based/memory-based
representations
Classifier Systems
• John Holland • Stewart Wilson
Rules P E F
#011: 01 43 .01 99
#011: 01 43 .01 99 #011: 01 43 .01 99
11##: 00 32 .1300 9 01 10 11
#0##: 11 14 .05 52 001#: 01 27 .24 3
#0##: 11 14 .05 52
001#: 01 27- .2442.53 - 16.5
Action set
001#: 01 27 .24 3
#0#1: 11 Prediction
18 .02 array
92
#0#1: 11 18 .02 92
1#01: Match
10 24set .17 15
CBR/IBL/MBR for RL
CBR/IBL/ Reinforceme
CBR/IBL/ nt
MBR nt
MBR Learning
Learning
• Conventionally, the
case has two parts
Case
– problem description,
solution
representing (s, a)
problem (real-valued)
– solution,
representing Q(s, a) state s & action a Q(s, a)
Case-Based XCS
Case-Based Reinforceme
Case-Based nt
Reasoning nt
Reasoning Learning
Learning
– outcome,
representing Q(s, a) Query
problem solution
• Given new s,
predict a, guided by state s ?
case outcomes as
well as similarities
Case Outcomes
• In CBR research, storing outcomes is
not common but neither is it new, e.g.
– cases have three parts in [Kolodner 1993]
– IB3’s classification records [Aha et al.
1991]
• They
– influence retrieval and reuse
– are updated in cases, based on
performance
– guide maintenance and discovery
Outcomes in Case-Based
XCS
•Each case outcome is a record of
– experience:
how many times it appeared in an action set
– prediction of future reward, P:
this is its estimate of Q(s, a)
– prediction error, E:
average error in P
– fitness, F:
inversely related to E
Retrieval and Reuse
• The Match Set contains the k-nearest
neighbours, but similarity is weighted
by fitness
E
Fitness F is accuracy relative to the total
accuracies of the Previous Action Set
Deletion
• ‘Random’ deletion
– probability inversely related to fitness
60 IB1 (498)
50 IB2 (93)
40 IB3 (82)
30 CBR-XCS (498)
20
10
0
1
46
91
136
181
226
271
316
361
406
451
496
Recommender System
Dialogs
• 1470 holidays; 8 descriptive attributes
• Leave-one-in experiments
– each holiday in turn is the target holiday
– questions are asked until retrieval set
contains 5 holidays or no questions
remain
– simulated user answers a question with the
value from the target holiday
– 25-fold cross-validation (different
orderings)
Users Who Always Answer
• Best policy is to choose the remaining
question that has highest entropy
• State, s, records the entropies for each
question
• k = 4; ε starts at 1 and, after ~150
steps, decays exponentially
• Delayed reward = -
(numQuestionsAsked3)
• Multi-step backup
• No GA
Does the learned policy
minimise dialog length?
4
3.5
3
Dialog length
2.5 Random
2 By Entropy
1.5 CBR-XCS (126)
1
0.5
0
1
121
241
361
481
601
721
841
961
1081
1201
1321
1441
Users Who Don’t Always
Answer
• Schmitt 2002:
– an entropy-like policy (simVar)
– but also customer-adaptive (a Bayesian net predicts
reaction to future questions based on reactions to previous
ones)
• Suppose users feel there is a ‘natural’ question order
– if the actual question order matches the natural order,
users will always answer
– if actual question order doesn’t match the natural order,
with non-zero probability users may not answer
• A trade-off
– learning the natural order
• to maximise chance of getting an answer
– learning to ask highest entropy questions
• to maximise chance of reducing size of retrieval set, if given an
answer
Does the learned policy find a
good trade-off?
7
5
Dialog length
Random
4 By Entropy
3 By the Ordering
CBR-XCS (104)
2
0
1
121
241
361
481
601
721
841
961
1081
1201
1321
1441
Bridge’s 11 REs
Reuse Respond
Reap
Retrieve
reward
Receive
Reinforce
sensory input
Reflect