Rule-based Reinforcement Learning augmented by External Knowledge
Rule-based Reinforcement Learning augmented by External Knowledge
T
Algorithm 1 Sarsa: Learn function Q : X × A → R
t0 −t procedure SARSA(X , A, R, T , α, γ)
X
Rt = γ rt0 (1)
t0 =t
Initialize Q : X × A → R uniformly
while Q is not converged do
where T is the time-step at which the epoch terminates. Start in state s ∈ X
The goal of reinforcement learning is to learn to select the Choose a from s using policy derived from Q (e.g.,
action with the maximum return Rt achievable for a given -greedy)
observation [Sutton and Barto, 1998]. From Equation (1), while s is not terminal do 0
we can define the action value Qπ (s, a) at a time t as the Take action a, observe r, s
expected reward for selecting an action a for a given state st 0 0
Choose a from s using policy derived from Q
and following a policy π. (e.g., -greedy)
Q(s, a) ← Q(s, a) + α · (r + γ · Q(s0 , a0 ) −
Qπ (s, a) = E [Rt | st = s, a] (2) Q(s, a))
s ← s0
The optimal policy is defined as selecting the action with the a ← a0
optimal Q-value, the highest expected return, followed by an return Q
optimal sequence of actions. This obeys the Bellman opti-
mality equation:
Sarsa converges with probability 1 to an optimal policy
h i 0 0
as long as all the action-value states are visited an infinite
Q∗ (s, a) = E r + γ max
0
Q∗
(s , a ) | s, a (3) number of times. Unfortunately, it is not possible to straight-
a
forwardly apply Sarsa learning to continuous or large state
In temporal difference (TD) learning methods such as Q- spaces. Such large spaces are difficult to explore since it
learning or Sarsa, the Q-values are updated after each time- requires a frequent visit of each state to accurately estimate
step instead of updating the values after each epoch, as hap- their values, resulting in an inefficient estimation of the Q-
pens in Monte Carlo learning. values.
2.3 Eligibility trace
Since it takes time to back-propagate the rewards to the previ-
ous Q-values, the above model suffers from slow training in
sparse reward environments. Eligibility traces is a mechanism
to handle the problem of delayed rewards. Many temporal-
difference (TD) methods including Sarsa or Q-learning can
use eligibility traces. In popular Sarsa(λ) or Q-learning(λ),
λ refers to eligibility traces or n-steps returns. In case of
Sarsa(λ), this leads to the following update rule:
by the open price, the close price, and the highest/lowest price
during the one minute interval (Figure 3(b)). We limit the
possible actions to Buy, Hold and Sell. The reward is com-
puted according to the win/lose after buying or selling. We
consider that a single agent has a limited impact on the stock
market price, for this reason, the price is not influenced by (a) Sarsa-rb(λ) without sub- (b) Sarsa-rb(λ) with sub-states
the actions of the agent. Each training episode is followed by states
a testing episode to evaluate the average reward of the agent
on another subset of the same stock price. Each episode was Figure 4: Comparison of the average number of Q-values visited at
played until the training data are consumed, approximatively least one time over 3 runs.
105 iterations.
Our system learns to trade on a minutely stock index. In In order to better understand the impact of the sub-states
total, we used 4 datasets with a duration varying between 2 on the learning, we analyze and compare Sarsa-rb(λ) with
years and 5 years. We trained the model on one stock index and without sub-states. We also investigate properties of the
and we used the other datasets to generate the rules. Among sub-states of the manually created rules.
the training examples, 80% are randomly selected for training Table 1 reports the number of times the Q-values are up-
the model and the remaining for testing it. We performed a dated on average. We run the experiments 10 times for
grid-search to find the optimal parameters to initialize the Q- 500 episodes with the same hyper-parameters. The first row
values and found that µ the mean equals to 0.25 and σ equals shows the average number of times states are updated and
to 0.2 were the best parameters. We use K = 100 as decay the second row shows the average number of steps between
factor of eligibility traces. In case of manually created rules, two consecutive updates of states. The states are updated
we first compute the percentage increase in the share price 14 +1500% with the sub-states technique and also the time be-
days later and then estimate an optimal action associated with tween two updates is decreased. We observed that a frequent
each pattern. In total, we took into account 40 candlestick update of the sub-states leads to a faster convergence of the
patterns. The patterns mined were filtered with C = 5, the Q-values.
minimum number of times a pattern occurs in the training Figure 4 shows the number of states and its sub-states are
data. updated at least once over time. At each iteration, we count
We follow a simplified technique used by [Mashayekhi and the number of states or states with a sub-state visited. On av-
Gras, 2015] to generate rules from a random forest. Briefly, erage, states are updated for the first time much earlier when
we extract the rules top to bottom (root to leaf) and filter the the sub-states technique is used. Sub-states play an important
rules to avoid redundancy. In practice, we annotate 6000 sam- role for early updates and in the update frequency. Updating
ples into 3 classes. Each sample is the aggregation of the frequently the sub-states of a state improves the accuracy of
last 5 prices. We labeled the dataset according to the price estimation of its Q-values, which can significantly decrease
pdif f increase 14 days later (pdif f >=0.5%, pdif f <=- learning time, especially when the number of states is large.
0.5%, 0.5%< pdif f >-0.5%) to train a random forest. We
compute pdif f as the average between the open and close
4.2 Overall Performance computational efficiency. Finally, we are interested in extend-
We compared Sarsa-rb(λ) trained with the sub-states mecha- ing our experiments to new environments such as textual or
nism to a deep recurrent Q-learning model [Hausknecht and visual environments.
Stone, 2015] and a DDPG [Lillicrap et al., 2015] model. For
this evaluation, we individually tuned the hyper-parameters References
of each model. We decreased the learning rate from α = 0.3
[Barto and Mahadevan, 2003] Andrew G. Barto and Srid-
to α = 0.0001, the eligibility trace from λ = 0.9 to λ = 0.995,
and then used = 0.01, λ = 0.9405 and K = 100 and we har Mahadevan. Recent advances in hierarchical rein-
tuned the neural network architectures of DQN and DDPG. forcement learning. Discrete Event Dynamic Systems,
The results are obtained by running the algorithms with the 13(4):341–379, 2003.
same environment hyper-parameters. The plots are averaged [Bellemare et al., 2013] Marc G Bellemare, Yavar Naddaf,
over 5 runs. Finally, we used the manually created rules as Joel Veness, and Michael Bowling. The arcade learning
the states of Sarsa-rb(λ). environment: An evaluation platform for general agents.
We report learning curve on the testing dataset in Figure 2013.
5. Sarsa-rb(λ) always achieve a score higher than DQN and [Bougie and Ichise, 2017] N. Bougie and R. Ichise. Deep
DDPG. As shown in Figure 5, Sarsa-rb(λ) clearly improves
Reinforcement Learning Boosted by External Knowledge.
over DQN, we obtained an average reward after converging
ArXiv e-prints, December 2017.
around 3.3 times higher. DDPG appears less fluctuating than
Sarsa-rb(λ) but also less effective. [d’Avila Garcez et al., 2018] A. d’Avila Garcez, A. Resende
Riquetti Dutra, and E. Alonso. Towards Symbolic Rein-
forcement Learning with Common Sense. ArXiv e-prints,
April 2018.
[Dundar et al., ] Murat Dundar, Balaji Krishnapuram, Jinbo
Bi, and R Bharat Rao. Learning classifiers when the train-
ing data is not iid.
[Gambardella and Dorigo, 1995] Luca M Gambardella and
Marco Dorigo. Ant-q: A reinforcement learning approach
to the traveling salesman problem. In Machine Learning
Proceedings 1995, pages 252–260. Elsevier, 1995.
[Garnelo et al., 2016] Marta Garnelo, Kai Arulkumaran, and
Murray Shanahan. Towards deep symbolic reinforcement
learning. arXiv preprint arXiv:1609.05518, 2016.
[Hausknecht and Stone, 2015] Matthew Hausknecht and Pe-
ter Stone. Deep recurrent q-learning for partially observ-
Figure 5: Performance curves for a selection of algorithms: original able mdps. 2015.
Deep Q-learning algorithm (red), Deep Deterministic Policy Gradi-
ents algorithm (green) and Sarsa-rb(λ) (blue). [Lake et al., 2016] Brenden M. Lake, Tomer D. Ullman,
Joshua B. Tenenbaum, and Samuel J. Gershman. Build-
ing machines that learn and think like people. Behavioral
5 Conclusion and Brain Sciences, pages 1–101, 2016.
This paper introduced a new model to combine reinforcement [Lample and Chaplot, 2017] Guillaume Lample and Deven-
learning and external knowledge. We demonstrated its ability dra Singh Chaplot. Playing fps games with deep reinforce-
to solve complex and highly fluctuating tasks, trading in stock ment learning. In Proceedings of AAAI, pages 2140–2146,
market. Additionally, this algorithm is fully interpretable and 2017.
understandable. In a given state, we can explain the impact [Lillicrap et al., 2015] Timothy P Lillicrap, Jonathan J Hunt,
of each variable and the patterns on the action selection. Our Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,
central thesis is to enhance states representation of Sarsa(λ) David Silver, and Daan Wierstra. Continuous con-
with background knowledge and speed up learning with a trol with deep reinforcement learning. arXiv preprint
sub-states mechanism. Further benefits stem from efficiently arXiv:1509.02971, 2015.
updating eligibility traces. Moreover, our approach can be
[Louppe, 2014] Gilles Louppe. Understanding random
easily adapted to solve new tasks with a very limited amount
of human work. We have demonstrated the effectiveness of forests: From theory to practice. arXiv preprint
our algorithm to decrease the training time and to learn a bet- arXiv:1407.7502, 2014.
ter and more efficient policy. In the future, we are planning to [Mashayekhi and Gras, 2015] Morteza Mashayekhi and
evaluate our idea with other TD methods. Another challenge Robin Gras. Rule extraction from random forest: the rf+
is how to generate the rules during the training phase and dis- hc methods. In Proceedings of Canadian Conference on
card the useless rules to decrease learning time and improve Artificial Intelligence, pages 223–237. Springer, 2015.
[Mnih et al., 2013] Volodymyr Mnih, Koray Kavukcuoglu,
David Silver, Alex Graves, Ioannis Antonoglou, Daan
Wierstra, and Martin Riedmiller. Playing atari with deep
reinforcement learning. arXiv preprint arXiv:1312.5602,
2013.
[Nison, 2001] Steve Nison. Japanese candlestick charting
techniques: a contemporary guide to the ancient invest-
ment techniques of the Far East. Penguin, 2001.
[Pal, 2005] Mahesh Pal. Random forest classifier for remote
sensing classification. International Journal of Remote
Sensing, 26(1):217–222, 2005.
[Randløv and Alstrøm, 1998] Jette Randløv and Preben Al-
strøm. Learning to drive a bicycle using reinforcement
learning and shaping. In Proceedings of ICML, volume 98,
pages 463–471, 1998.
[Safavian and Landgrebe, 1991] S Rasoul Safavian and
David Landgrebe. A survey of decision tree classifier
methodology. IEEE transactions on systems, man, and
cybernetics, 21(3):660–674, 1991.
[Singh and Sutton, 1996] Satinder P Singh and Richard S
Sutton. Reinforcement learning with replacing eligibility
traces. Machine learning, 22(1-3):123–158, 1996.
[Sutton and Barto, 1998] Richard S Sutton and Andrew G
Barto. Reinforcement learning: An introduction. MIT
press Cambridge, 1998.
[Sutton, 1996] Richard S Sutton. Generalization in rein-
forcement learning: Successful examples using sparse
coarse coding. In Advances in neural information process-
ing systems, pages 1038–1044, 1996.
[Tessler et al., 2017] Chen Tessler, Shahar Givony, Tom Za-
havy, Daniel J. Mankowitz, and Shie Mannor. A deep hi-
erarchical approach to lifelong learning in minecraft. In
Proceedings of AAAI Conference on Artificial Intelligence,
pages 1553–1561, 2017.
[Watkins and Dayan, 1992] Christopher JCH Watkins and
Peter Dayan. Q-learning. Machine learning, 8(3-4):279–
292, 1992.