0% found this document useful (0 votes)
20 views30 pages

Learning Classifier System

Learning Classifier System

Uploaded by

Akshay Hebbar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views30 pages

Learning Classifier System

Learning Classifier System

Uploaded by

Akshay Hebbar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 30

Reinforcement learning

& Case-Based Learning


Classifier Systems

(modified from slides by


Derek Bridge, ICBR’05)
Reinforcement Learning
• The agent interacts with its environment
to achieve a goal

• It receives reward (possibly delayed


reward) for its actions
– it is not told what actions to take

• Trial-and-error search
– neither exploitation nor exploration can be
pursued exclusively without failing at the
task

• Life-long learning
– on-going exploration
Reinforcement 
Learning
Policy  : S  A
state
action reward

a0 a1 a2
s0 s1 s2 ...
r0 r1 r2
State value function, V
State, s V(s) V(s) predicts the future
s0 ... total reward we can
s1 10 obtain by entering state
s2 15 s p(s , a , s ) = 0.7
0 1 1
s1
s3 6
r(s0, a1) = 2 p(s0, a1, s2) = 0.3
 can exploit V
greedily, i.e. in s, s0 s2
choose action a for r(s0, a2) = 5
p(s0, a2, s2) = 0.5
which the following is
largest: s3
r ( s, a )   p ( s, a, s ' ) V ( s ' ) p(s0, a2, s3) = 0.5
s 'S Choosing a1: 2 + 0.7 × 10 + 0.3 × 15 = 13.5
Choosing a2: 5 + 0.5 × 15 + 0.5 × 6 = 15.5
Action value function, Q
State, s Action, a Q(s, a)
Q(s, a) predicts the
s0 a1 13.5
future total reward we
s0 a2 15.5
can obtain by
s1 a1 ...
executing a in s
s1 a2 ...
 can exploit Q
greedily, i.e. in s, s0
choose action a for
which Q(s, a) is largest
Q Learning
Exploration
For each (s, a), initialise Q(s, a) arbitrarily versus
exploitatio
Observe current state, s
n
Do until reach goal state

Select action a by exploiting Q ε-greedily,


One-step
i.e. temporal
with probability difference
ε, choose update rule,
a randomly;
else choose the a for which
TD(0)Q(s, a) is largest
Q ( s, a )  Q ( s, a )   (r   max Q ( s ' , a ' )  Q ( s, a ))
Execute a, entering state s’ and areceiving
'
immediate reward r

Update the table entry for Q(s, a)

s  s’ Watkins 1989
Backup Diagram for Q Learning

s
a

Q(s, a)
r
s’
a’

Q(s’, a’)
Function Approximation
• Q can be represented by a table only if the
number of states & actions is small
• Besides, this makes poor use of
experience
• Hence, we use function approximation,
e.g.
– neural nets
– weighted linear functions
– case-based/instance-based/memory-based
representations
Classifier Systems
• John Holland • Stewart Wilson

– Classifier systems are – ZCS simplifies Holland’s


rule-based systems with classifier systems
components for [Wilson 1994]
performance,
reinforcement and – XCS extends ZCS and
discovery [Holland uses accuracy-based
1986] fitness [Wilson 1995]
– They influenced the – Under simplifying
development of RL and assumptions, XCS
GAs implements Q Learning
[Dorigo & Bersini 1994]
XCS
Environment
State 0011 Action 01 Reward -5

Rules P E F
#011: 01 43 .01 99
#011: 01 43 .01 99 #011: 01 43 .01 99
11##: 00 32 .1300 9 01 10 11
#0##: 11 14 .05 52 001#: 01 27 .24 3
#0##: 11 14 .05 52
001#: 01 27- .2442.53 - 16.5
Action set
001#: 01 27 .24 3
#0#1: 11 Prediction
18 .02 array
92
#0#1: 11 18 .02 92
1#01: Match
10 24set .17 15

... ... ... ... #011: 01 43 .01 99


Rule base 001#: 01 27 .24 Update
3 P, E, F
Deletion
Previous action set
Discovery by GA
Case-Based XCS
Environment
State 0011 Action 01 Reward -5
Cases
X
Rules P E F
1011: 01 43 .01 99
1011: 01 43 .01 99 1011: 01 43 .01 99
1100: 00 32 .1300 9 01 10 11
0000: 11 14 .05 52 0111: 01 27 .24 3
0000: 11 14 .05 52
0111: 01 27- .2442.53 - 16.5
Action set
0111: 01 27 .24 3
0001: 11 Prediction
18 .02 array
92
0001: 11 18 .02 92
1101: Match
10 24set .17 15

... ... ... ...


Update P, E, F
X
Rule base
Case base Deletion
Previous action set
Discovery by GA
Insertion
Case Based Reasoning
• Similar to clustering before decision-making
using cluster centroids, for categorical or
partially numeric data.
• Generalized examples (Cases) are stored,
often in an exception hierarchy (Case base)
• Given new inputs, the best-matching
lowest-level cases are retrieved.
• Adaptation rules describe how the stored
case’s answers are modified for the new
case.
Reinforceme

CBR/IBL/MBR for RL
CBR/IBL/ Reinforceme
CBR/IBL/ nt
MBR nt
MBR Learning
Learning

• Conventionally, the
case has two parts
Case
– problem description,
solution
representing (s, a)
problem (real-valued)
– solution,
representing Q(s, a) state s & action a Q(s, a)

• Hence, the task is Query


regression, i.e. given solution
a new (s, a), predict problem (real-valued)

Q(s, a) (real-valued) state s & action a ?


Reinforceme

Case-Based XCS
Case-Based Reinforceme
Case-Based nt
Reasoning nt
Reasoning Learning
Learning

• Case has three


parts
Case
– problem description,
outcome
representing s problem solution (real-valued)
– solution,
representing a state s action a Q(s, a)

– outcome,
representing Q(s, a) Query

problem solution
• Given new s,
predict a, guided by state s ?
case outcomes as
well as similarities
Case Outcomes
• In CBR research, storing outcomes is
not common but neither is it new, e.g.
– cases have three parts in [Kolodner 1993]
– IB3’s classification records [Aha et al.
1991]

• They
– influence retrieval and reuse
– are updated in cases, based on
performance
– guide maintenance and discovery
Outcomes in Case-Based
XCS
•Each case outcome is a record of
– experience:
how many times it appeared in an action set
– prediction of future reward, P:
this is its estimate of Q(s, a)
– prediction error, E:
average error in P
– fitness, F:
inversely related to E
Retrieval and Reuse
• The Match Set contains the k-nearest
neighbours, but similarity is weighted
by fitness

• From the Prediction Array, we choose


the action with the highest predicted
total future reward, but the cases’
predictions are weighted by similarity
and fitness
Reinforcement
• On receipt of reward r, for each case
in the Previous Action Set
– P is updated by the TD(0) rule
– E is moved towards the difference
between the case’s previous value of P
and its new value of P
– F is computed from accuracy , which is
based on error E
Fitness F and Accuracy 

E
Fitness F is accuracy  relative to the total
accuracies of the Previous Action Set
Deletion
• ‘Random’ deletion
– probability inversely related to fitness

• Or case ci might be deleted if there is


another case cj such that
– cj has sufficient experience
– cj has sufficient fitness (accuracy)
– cj subsumes ci i.e.
• sim(ci, cj) < θ (or could use a competence model)
• cj’s action = ci’s action
Discovery by GA
• Steady-state reproduction, not generational
• The GA runs in a niche (an action set), not
panmictically
• It runs only if time since last GA for these
cases exceeds a threshold
• From the action set, two parents are
selected; two offspring are created by
crossover and mutation
• They are not retained if subsumed by their
parents
• If retained, deletion may take place
Example application: Spam
Classification
• Emails from my mailbox, stripped of
attachments
– 498 of them, approx. 75% spam
– highly personal definition of spam
– highly noisy
– processed in chronological order
• Textual similarity based on a text
compression ratio
• k = 1; ε = 0
• No GA
Spam Classification
• Rewards
– correct: 1
– spam as ham: -100
– ham as spam: -1000

• Other ways of reflecting this


asymmetry
– skewing the voting [Delany et al. 2005]
– loss functions, e.g. [Wilke & Bergmann
1996]
Has Spam had its Chips?
100
90
80
70 Use Best
% correct

60 IB1 (498)
50 IB2 (93)
40 IB3 (82)
30 CBR-XCS (498)
20
10
0
1

46

91

136

181

226

271

316

361

406

451

496
Recommender System
Dialogs
• 1470 holidays; 8 descriptive attributes
• Leave-one-in experiments
– each holiday in turn is the target holiday
– questions are asked until retrieval set
contains  5 holidays or no questions
remain
– simulated user answers a question with the
value from the target holiday
– 25-fold cross-validation (different
orderings)
Users Who Always Answer
• Best policy is to choose the remaining
question that has highest entropy
• State, s, records the entropies for each
question
• k = 4; ε starts at 1 and, after ~150
steps, decays exponentially
• Delayed reward = -
(numQuestionsAsked3)
• Multi-step backup
• No GA
Does the learned policy
minimise dialog length?
4
3.5
3
Dialog length

2.5 Random
2 By Entropy
1.5 CBR-XCS (126)

1
0.5
0
1
121
241
361
481
601
721
841
961
1081
1201
1321
1441
Users Who Don’t Always
Answer
• Schmitt 2002:
– an entropy-like policy (simVar)
– but also customer-adaptive (a Bayesian net predicts
reaction to future questions based on reactions to previous
ones)
• Suppose users feel there is a ‘natural’ question order
– if the actual question order matches the natural order,
users will always answer
– if actual question order doesn’t match the natural order,
with non-zero probability users may not answer
• A trade-off
– learning the natural order
• to maximise chance of getting an answer
– learning to ask highest entropy questions
• to maximise chance of reducing size of retrieval set, if given an
answer
Does the learned policy find a
good trade-off?
7

5
Dialog length

Random
4 By Entropy
3 By the Ordering
CBR-XCS (104)
2

0
1
121
241
361
481
601
721
841
961
1081
1201
1321
1441
Bridge’s 11 REs
Reuse Respond

Reap
Retrieve
reward

Receive
Reinforce
sensory input

Reflect

Refine Reduce Retain Replenish

You might also like