0% found this document useful (0 votes)
10 views

Rule-based Reinforcement Learning augmented by External Knowledge

This paper introduces a rule-based variant of the Sarsa(λ) algorithm, named Sarsa-rb(λ), which integrates external knowledge to enhance the efficiency and interpretability of reinforcement learning in complex environments. The authors demonstrate that Sarsa-rb(λ) significantly improves training speed and performance compared to traditional deep reinforcement learning methods, particularly in stock trading tasks. By utilizing rules to represent states and actions, the proposed approach addresses limitations related to data requirements and interpretability in safety-critical applications.

Uploaded by

amogadasi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Rule-based Reinforcement Learning augmented by External Knowledge

This paper introduces a rule-based variant of the Sarsa(λ) algorithm, named Sarsa-rb(λ), which integrates external knowledge to enhance the efficiency and interpretability of reinforcement learning in complex environments. The authors demonstrate that Sarsa-rb(λ) significantly improves training speed and performance compared to traditional deep reinforcement learning methods, particularly in stock trading tasks. By utilizing rules to represent states and actions, the proposed approach addresses limitations related to data requirements and interpretability in safety-critical applications.

Uploaded by

amogadasi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Rule-based Reinforcement Learning augmented by External Knowledge

Nicolas Bougie12 , Ryutaro Ichise1


1
National Institute of Informatics
2
The Graduate University for Advanced Studies Sokendai
[email protected], [email protected]

Abstract Learning from scratch and lack of interpretability impose


some problems on deep reinforcement learning methods.
Reinforcement learning has achieved several suc- Randomly initializing the weights of a neural network is in-
cesses in sequential decision problems. However, efficient. Furthermore, this is likely intractable to train the
these methods require a large number of training model in many domains due to a large amount of required
iterations in complex environments. A standard data. Additionally, most RL algorithms cannot introduce ex-
paradigm to tackle this challenge is to extend rein- ternal knowledge limiting their performance. Moreover, the
forcement learning to handle function approxima- impossibility to explain and understand the reason for a de-
tion with deep learning. Lack of interpretability cision restricts their use to non-safety critical domains, ex-
and impossibility to introduce background knowl- cluding for example medicine or law. An approach to tackle
edge limits their usability in many safety-critical these problems is to combine simple reinforcement learning
real-world scenarios. In this paper, we study how techniques and external knowledge.
to combine reinforcement learning and external A powerful recent idea to address the problem of computa-
knowledge. We derive a rule-based variant ver- tional expenses is to modularize the model into an ensemble
sion of the Sarsa(λ) algorithm, which we call Sarsa- of experts [Lample and Chaplot, 2017], [Bougie and Ichise,
rb(λ), that augments data with complex knowledge 2017]. The task is divided into a sequence of stages and for
and exploits similarities among states. We apply each one, a policy is learned. Since each expert focuses on
our method to a trading task from the Stock Mar- learning a stage of the task, the reduction of the actions to
ket Environment. We show that the resulting al- consider leads to a shorter learning period. Although this ap-
gorithm leads to much better performance but also proach is conceptually simple, it does not handle very com-
improves training speed compared to the Deep Q- plicated environments and environments with a large set of
learning (DQN) algorithm and the Deep Determin- actions.
istic Policy Gradients (DDPG) algorithm. Another technique is called Hierarchical Learning [Tessler
et al., 2017], [Barto and Mahadevan, 2003] and is used to
solve complex tasks, such as “simulating human brain” [Lake
1 Introduction et al., 2016]. It is inspired by human learning which uses pre-
Over last few years, reinforcement learning (RL) has made vious experiences to face new situations. Instead of learning
significant progress to learn good policies in many domains. directly the entire task, different sub-tasks are learned by the
Well-known temporal difference (TD) methods such as Sarsa agent. By reusing knowledge acquired from the previous sub-
[Sutton, 1996] or Q-learning [Watkins and Dayan, 1992] tasks, the learning is faster and easier. Some limitations are
learn to predict the best action to take by step-wise interac- the necessity to re-train the model which is time-consuming
tions with the environment. In particular, Q-learning has been and problems related to the catastrophic forgetting of knowl-
shown to be effective in solving the traveling salesman prob- edge on previous tasks. All the previously cited approaches
lem [Gambardella and Dorigo, 1995] or learning to drive a suffer from lack of interpretation reducing their usage in crit-
bicycle [Randløv and Alstrøm, 1998]. However large or con- ical applications such as autonomous driving.
tinuous state spaces limit their application to simple environ- An approach, Symbolic Reinforcement Learning [Garnelo
ments. et al., 2016], [d’Avila Garcez et al., 2018] combines a system
Recently, combining advances in deep learning and rein- that learns an abstracted representation of the environment
forcement learning has proved to be very successful in mas- and high-order reasoning. However this has several limita-
tering complex tasks. A significant example is the combina- tions, it cannot support ongoing adaptation to a new environ-
tion of neural networks and Q-learning, resulting in “Deep ment and cannot handle external sources of prior knowledge.
Q-Learning” (DQN) [Mnih et al., 2013], able to achieve hu- This paper demonstrates that a simple reinforcement learn-
man performance on many tasks including Atari video games ing agent can overcome these challenges to learn control poli-
[Bellemare et al., 2013]. cies. Our model is trained with a variant of the Sarsa(λ)
algorithm [Singh and Sutton, 1996]. We introduce external 2.1 Q-learning algorithm
knowledge by representing the states as rules. Rules trans- Q-learning [Watkins and Dayan, 1992] is a common tech-
form the raw data into a compressed and high-level represen- nique to approximate π ≈ π ∗ . The estimation of the action
tation. To deal with the problem of training speed and highly value function is iteratively performed by updating Q(s, a).
fluctuating environments [Dundar et al., ], we use a sub-states This algorithm is considered as an off-policy method since
mechanism. Sub-states allow a more frequent update of the the update rule is unrelated to the policy that is learned, as
Q-values thereby smooth and speed-up the learning. Further- follows:
more, we adapted eligibility traces which turned out to be
critical in guiding the algorithm to solve tasks.
In order to evaluate our method, we constructed a variety Q(st , at ) ← Q(st , at )+
of trading environment simulations based on real stock mar- α [rt+1 + γ ∗ maxa Q(st+1 , a) − Q(st , at )] (4)
ket data. Our rule-based approach, Sarsa-rb(λ), can learn to
trade in a small number of iterations. In many cases, we The choice of the action follows a policy derived from Q.
are able to outperform the well-known Deep Q-learning al- The most common policy called -greedy policy trade-off the
gorithm in term of quality of policy and training time. Sarsa- exploration/exploitation dilemma. In case of exploration, a
rb(λ) also exhibits higher performance than DDPG [Lillicrap random action is sampled whereas exploitation selects the ac-
et al., 2015] after converging. tion with the highest estimated return. In order to converge to
The paper is organized as follows. Section 2 gives an a stable policy, the probability of exploitation must increase
overview of reinforcement learning. Section 3 describes the over time. An obvious approach to adapting Q-learning to
main contributions of the paper. Section 4 presents the exper- continuous domains is to discretize the state spaces, leading
iments and the results. Section 5 presents the main conclu- to an explosion of the number of Q-values. Therefore, a good
sions drawn from the work. estimation of the Q-values in this context is often intractable.
2.2 Sarsa algorithm
2 Reinforcement Learning Sarsa is a temporal differentiation (TD) control method. The
Reinforcement learning consists of an agent learning a pol- key difference between Q-learning and Sarsa is that Sarsa in
icy π by interacting with an environment. At each time-step an on-policy method. It implies that the Q-values are learned
the agent receives an observation st and chooses an action at . based on the action performed by the current policy instead
The agent gets a feedback from the environment called a re- of a greedy policy. The update rule becomes :
ward rt . Given this reward and the observation, the agent can
update its policy to improve the future rewards. Q(st , at ) ← Q(st , at )+α[rt+1 +γQ(st+1 , at+1 )−Q(st , at )]
Given a discount factor γ, the future discounted reward, (5)
called return Rt , is defined as follows :

T
Algorithm 1 Sarsa: Learn function Q : X × A → R
t0 −t procedure SARSA(X , A, R, T , α, γ)
X
Rt = γ rt0 (1)
t0 =t
Initialize Q : X × A → R uniformly
while Q is not converged do
where T is the time-step at which the epoch terminates. Start in state s ∈ X
The goal of reinforcement learning is to learn to select the Choose a from s using policy derived from Q (e.g.,
action with the maximum return Rt achievable for a given -greedy)
observation [Sutton and Barto, 1998]. From Equation (1), while s is not terminal do 0
we can define the action value Qπ (s, a) at a time t as the Take action a, observe r, s
expected reward for selecting an action a for a given state st 0 0
Choose a from s using policy derived from Q
and following a policy π. (e.g., -greedy)
Q(s, a) ← Q(s, a) + α · (r + γ · Q(s0 , a0 ) −
Qπ (s, a) = E [Rt | st = s, a] (2) Q(s, a))
s ← s0
The optimal policy is defined as selecting the action with the a ← a0
optimal Q-value, the highest expected return, followed by an return Q
optimal sequence of actions. This obeys the Bellman opti-
mality equation:
Sarsa converges with probability 1 to an optimal policy
h i 0 0
as long as all the action-value states are visited an infinite
Q∗ (s, a) = E r + γ max
0
Q∗
(s , a ) | s, a (3) number of times. Unfortunately, it is not possible to straight-
a
forwardly apply Sarsa learning to continuous or large state
In temporal difference (TD) learning methods such as Q- spaces. Such large spaces are difficult to explore since it
learning or Sarsa, the Q-values are updated after each time- requires a frequent visit of each state to accurately estimate
step instead of updating the values after each epoch, as hap- their values, resulting in an inefficient estimation of the Q-
pens in Monte Carlo learning. values.
2.3 Eligibility trace
Since it takes time to back-propagate the rewards to the previ-
ous Q-values, the above model suffers from slow training in
sparse reward environments. Eligibility traces is a mechanism
to handle the problem of delayed rewards. Many temporal-
difference (TD) methods including Sarsa or Q-learning can
use eligibility traces. In popular Sarsa(λ) or Q-learning(λ),
λ refers to eligibility traces or n-steps returns. In case of
Sarsa(λ), this leads to the following update rule:

Qt+1 (s, a) = Qt (s, a)


Figure 1: An illustration of the update of the Q-function. The Q-
+α [rt+1 + γQt (st+1 , at+1 ) − Qt (st , at )] et (s, a) for all s,a values of the states s2 and its sub-states are updated. The sub-states
(6) sharing similar information with s2, in blue, are also modified.
where

λet−1 (s, a) + 1, if s = st and a = at rules, R. The rules associate a pattern to an action and allow
et (s, a) = (7) to introduce complex background knowledge.
λet−1 (s, a) otherwise
A pattern is a conjunction of variables which can be arbi-
The temporal difference error for a state is estimated in a trarily complex. The variables represent significant events in
bootstrapping process. Instead of looking only at the cur- the task. For example, in a task involving driving a car, a vari-
rent reward, in Monte Carlo methods the prediction is made able could be (speed between 20 and 50 km/h) and an example
based on the successive states. The TD(λ) method is simi- of pattern is ((speed between 20 and 50 km/h) ∧ (pedestrian
lar, the current temporal difference error is used to update all crossing the road)). Finally, a rule recommends an action (e.g
the visited states of the corresponding episode. At each step, brake) for a pattern.
the reward is back-propagated to the prior states according to Given an observation obst , the active state is the state for
their frequency of visit. The parameter λ ∈ [0..1] controls the which its associated pattern is satisfied, in other words, all its
trade-off between one-step TD methods (TD(0)) and full-step variables are active. Since no pattern is always satisfied, we
methods (Monte Carlo). added an “empty” state. In other words, this is the default
state, active regardless of the input.
3 Rule-based Sarsa(λ) Our contribution here is to provide modifications to Sarsa
We first present the general idea of our algorithm, Sarsa-rb, a which allow to improve states representation with back-
variant of the Sarsa algorithm. ground knowledge. The rules are a way to abstract the states
We propose a simple method, Sarsa-rb, to enable Sarsa in from the environment and to deal with continuous or com-
continuous spaces boosted by injecting external knowledge. plex data representation. In addition, by taking advantage of
The idea behind Sarsa-rb is to enhance states representation the rules during the Q-values initialization the initial policy
and Q-values initialization with background knowledge to benefits from background knowledge. Moreover, in many do-
make training more efficient and interpretable. As in Sarsa, mains |R| << |S| resulting in a reduction of the number of
Sarsa-rb estimates the Q-values. However, each state is rep- Q-values to estimate. We filter out irrelevant rules by keeping
resented by a rule. There are various advantages of represent- only the most frequent ones.
ing the states by rules. This makes possible combining re- In Sarsa, the Q-values are uniformly initialized. In a state
inforcement learning and complex knowledge. Furthermore, s represented by a pattern p, p controls the activation of the
the number of Q-values is reduced, which makes the training state and we use the rule to improve the initialization of the
much faster. Q-values:
While Sarsa-rb provides some advantages over Sarsa in
term of quality of policy, we can significantly improve their
N (µ, σ 2 ), if ruleaction = a

training time with a sub-states mechanism. Instead of up-
dating one Q-value at each iteration, our model updates sev- Q(st=0 , at=0 ) = (8)
0 otherwise
eral Q-values which share similar information with the cur-
rent state, leading to a significant speed-up. Finally, we adapt with µ the mean and σ 2 the variance. The Q-value with the
the eligibility trace λ technique to take advantage of the sub- action recommended by the rule follows a normal distribution
states, Sarsa-rb(λ). centered around µ and the other Q-values are initialized to 0.
3.1 Rule-based Sarsa (Sarsa-rb)
3.2 Prior Knowledge for Rule Generation
The Sarsa algorithm maintains a parametrized Q-function
which maps the states S to their Q-values. Instead of us- To create the rules, we compared two methods. One consists
ing as states the state space or a discretization of it, we en- of manually creating them according to our knowledge about
hance states representation by mapping rules and actions to the task. Automatic extraction retrieves patterns from exter-
Q-values. Depicted in Figure 1, states are replaced by a set of nal sources of data.
External Knowledge Based Rules
An intuitive approach to create the rules relies on human or
background knowledge about domains. For example, if the
task involves driving a car, background knowledge can be ex-
tracted from highway rules. The action associated with a pat-
tern can be let empty if it cannot be predicted without much
affecting the quality of the agent.
For example, we can use our expertise about time-series
and stock markets. To deal with that, the rules can be based on
candlestick patterns [Nison, 2001]. This stock-market tech-
nique estimates the trend of the share price by identifying pat-
terns into the time series. Candlestick pattern analysis relies
on patterns composed by the open, high, close and low prices
of the previous observations.
Automatically Learned Rules 0
In real-world environments, the rules can be automatically Figure 2: Estimation of a Q-value, Q(s, a) , with the sub-states tech-
captured by supervised machine learning methods. We follow nique. In addition to the Q-value Q(s, a) itself, the sub-states values
0
a similar idea of [Mashayekhi and Gras, 2015]. The method Q(s , a) are taken into account.
extracts the rules from a random forest [Pal, 2005], an en-
semble of decision trees [Safavian and Landgrebe, 1991]. A
decision tree consists of several nodes that branch to two sub- The update process propagates the reward to all similar
trees based on a threshold value on a variable. We call leaf sub-states, leading to a more frequent and early update of the
nodes the terminal nodes. A single decision tree has a very states. Our approach to this problem is to increment the eli-
limited generalization capability and a high variance. Several gibility traces of the similar sub-states.
ensemble models such as random forest reduce the variance
by building many trees and predicting based on a consensus 3.4 Eligibility Trace
among decision trees. A simple tree traversal method can di-
rectly extract rules from the trees [Louppe, 2014]. Directly implementing Sarsa-rb proved to be slow to learn
in environments with sparse rewards. Our method, Sarsa-
3.3 Sub-states rb(λ), is derived from Sarsa(λ). Adding n-steps returns helps
In TD methods without eligibility traces, one Q-value of the to propagate the current reward rt to the earlier states. We
current state st is updated at each iteration. Instead, we pro- allow a propagation of rt to the earlier sub-states by chang-
pose a technique to update the states which share similar in- ing their eligibility traces. The idea behind is that a sub-state
formation with st . We augment each Q-value with an ensem- similar to the current state is likely to get a similar reward by
ble of sub-states, subs . Since each state is represented by a following the same action. The update of the current state s
pattern, we define the sub-states as its sub-patterns, the com- remains unchanged from Sarsa(λ) :
binations of the variables. To avoid a too large number of 
sub-states, we limit the size of the sub-rules to conjunctions E(s, a) = E(s, a) + 1

of at least 3 variables. The goal is to get most of the benefits E(y, a) = E(y, a) + e−sim(y,s) , if y is a sub-states of s
of the shared information among the states while keeping the −sim(y,s)2
E(y, a) = E(y, a) + e K

, otherwise

rest of the Sarsa algorithm intact and efficient. We provide
(10)
modifications to Q-value estimation and update inspired by
E(s) denotes the eligibility trace of the state s and E(y) the
Sarsa which allow to use sub-states.
0 eligibility trace of the sub-state y. We refer to sim(y, s) as
The estimation of a Q-value Q (s, a) in Sarsa-rb takes into
the similarity between the sub-state y and the state s. We
account the Q-value itself and the value of the sub-states :
0 X 0
compute the similarity score as the number of different vari-
Q (s, a) = Q(s, a) + Q(s , a) (9) ables between a sub-state y and a state s, sim(y, s) = |y ∩ s|.
s0 ∈subs
We bounded the score between 0 (identical) and 1. Note that
we only take into account the sub-states sharing at least two
with subs the sub-states of a state s. variables.
Figure 2 shows an example of a Q-value estimation.
0 Since sub-states are often updated, we avoid exploding el-
Q(s , a) refers to the estimation of the value of the sub-state igibility trace values by adding an exponential decay and a
0
s given the action a. Adding this term grounds the values constant K. This constant should be positive and greater than
of the unvisited states, and makes the value induced by the zero. A high value leads to a small increase of the eligibil-
values of the similar visited
0
states.0 Note that we limit the ity traces of the sub-states sharing only a few similar sub-
weight of the term Q(s , a) in the Q (s, a) estimation such as patterns. Updates performed in this manner allow to estimate
0
Q(s , a) << Q(s, a) to ensure convergence towards an op- more accurately Q-values. Experiments also indicate that this
timal policy. We achieved this mechanism during the update method decreases the number of necessary visits and yield
step. faster convergent policies.
4 Experiments Table 1: The table compares performance in term of frequency of
visit of the states. We compared Sarsa-rb with and without sub-
We evaluated Sarsa-rb(λ) on the OpenAI trading environ- states.
ment, a complex and fluctuating simulation from real stock
market data. The agent observes the last stock price described Settings No sub-states Sub-states
Average number of updates 376.789 5873.17
Average duration between
two consecutive updates 11715.51 2189.31

price. In order to limit the number of rules and since the


impact on accuracy was minimal, we built 20 trees with a
maximum height of 4. In total, we retrieved 855 rules.
We analyze the impact of the sub-states technique on the
agent. Furthermore, we evaluate Sarsa-rb(λ) and compare the
(a) An example of OHLC chart (open, high, (b) Stru- improvement with DQN and DDPG in terms of training speed
low, close) cure of one and in terms of quality of policy.
observation
4.1 Sub-states
Figure 3: Example of a sample of data from the environment. The
left plot shows the time series and the right plot is the structure of
one data point, one observation from the environment

by the open price, the close price, and the highest/lowest price
during the one minute interval (Figure 3(b)). We limit the
possible actions to Buy, Hold and Sell. The reward is com-
puted according to the win/lose after buying or selling. We
consider that a single agent has a limited impact on the stock
market price, for this reason, the price is not influenced by (a) Sarsa-rb(λ) without sub- (b) Sarsa-rb(λ) with sub-states
the actions of the agent. Each training episode is followed by states
a testing episode to evaluate the average reward of the agent
on another subset of the same stock price. Each episode was Figure 4: Comparison of the average number of Q-values visited at
played until the training data are consumed, approximatively least one time over 3 runs.
105 iterations.
Our system learns to trade on a minutely stock index. In In order to better understand the impact of the sub-states
total, we used 4 datasets with a duration varying between 2 on the learning, we analyze and compare Sarsa-rb(λ) with
years and 5 years. We trained the model on one stock index and without sub-states. We also investigate properties of the
and we used the other datasets to generate the rules. Among sub-states of the manually created rules.
the training examples, 80% are randomly selected for training Table 1 reports the number of times the Q-values are up-
the model and the remaining for testing it. We performed a dated on average. We run the experiments 10 times for
grid-search to find the optimal parameters to initialize the Q- 500 episodes with the same hyper-parameters. The first row
values and found that µ the mean equals to 0.25 and σ equals shows the average number of times states are updated and
to 0.2 were the best parameters. We use K = 100 as decay the second row shows the average number of steps between
factor of eligibility traces. In case of manually created rules, two consecutive updates of states. The states are updated
we first compute the percentage increase in the share price 14 +1500% with the sub-states technique and also the time be-
days later and then estimate an optimal action associated with tween two updates is decreased. We observed that a frequent
each pattern. In total, we took into account 40 candlestick update of the sub-states leads to a faster convergence of the
patterns. The patterns mined were filtered with C = 5, the Q-values.
minimum number of times a pattern occurs in the training Figure 4 shows the number of states and its sub-states are
data. updated at least once over time. At each iteration, we count
We follow a simplified technique used by [Mashayekhi and the number of states or states with a sub-state visited. On av-
Gras, 2015] to generate rules from a random forest. Briefly, erage, states are updated for the first time much earlier when
we extract the rules top to bottom (root to leaf) and filter the the sub-states technique is used. Sub-states play an important
rules to avoid redundancy. In practice, we annotate 6000 sam- role for early updates and in the update frequency. Updating
ples into 3 classes. Each sample is the aggregation of the frequently the sub-states of a state improves the accuracy of
last 5 prices. We labeled the dataset according to the price estimation of its Q-values, which can significantly decrease
pdif f increase 14 days later (pdif f >=0.5%, pdif f <=- learning time, especially when the number of states is large.
0.5%, 0.5%< pdif f >-0.5%) to train a random forest. We
compute pdif f as the average between the open and close
4.2 Overall Performance computational efficiency. Finally, we are interested in extend-
We compared Sarsa-rb(λ) trained with the sub-states mecha- ing our experiments to new environments such as textual or
nism to a deep recurrent Q-learning model [Hausknecht and visual environments.
Stone, 2015] and a DDPG [Lillicrap et al., 2015] model. For
this evaluation, we individually tuned the hyper-parameters References
of each model. We decreased the learning rate from α = 0.3
[Barto and Mahadevan, 2003] Andrew G. Barto and Srid-
to α = 0.0001, the eligibility trace from λ = 0.9 to λ = 0.995,
and then used  = 0.01, λ = 0.9405 and K = 100 and we har Mahadevan. Recent advances in hierarchical rein-
tuned the neural network architectures of DQN and DDPG. forcement learning. Discrete Event Dynamic Systems,
The results are obtained by running the algorithms with the 13(4):341–379, 2003.
same environment hyper-parameters. The plots are averaged [Bellemare et al., 2013] Marc G Bellemare, Yavar Naddaf,
over 5 runs. Finally, we used the manually created rules as Joel Veness, and Michael Bowling. The arcade learning
the states of Sarsa-rb(λ). environment: An evaluation platform for general agents.
We report learning curve on the testing dataset in Figure 2013.
5. Sarsa-rb(λ) always achieve a score higher than DQN and [Bougie and Ichise, 2017] N. Bougie and R. Ichise. Deep
DDPG. As shown in Figure 5, Sarsa-rb(λ) clearly improves
Reinforcement Learning Boosted by External Knowledge.
over DQN, we obtained an average reward after converging
ArXiv e-prints, December 2017.
around 3.3 times higher. DDPG appears less fluctuating than
Sarsa-rb(λ) but also less effective. [d’Avila Garcez et al., 2018] A. d’Avila Garcez, A. Resende
Riquetti Dutra, and E. Alonso. Towards Symbolic Rein-
forcement Learning with Common Sense. ArXiv e-prints,
April 2018.
[Dundar et al., ] Murat Dundar, Balaji Krishnapuram, Jinbo
Bi, and R Bharat Rao. Learning classifiers when the train-
ing data is not iid.
[Gambardella and Dorigo, 1995] Luca M Gambardella and
Marco Dorigo. Ant-q: A reinforcement learning approach
to the traveling salesman problem. In Machine Learning
Proceedings 1995, pages 252–260. Elsevier, 1995.
[Garnelo et al., 2016] Marta Garnelo, Kai Arulkumaran, and
Murray Shanahan. Towards deep symbolic reinforcement
learning. arXiv preprint arXiv:1609.05518, 2016.
[Hausknecht and Stone, 2015] Matthew Hausknecht and Pe-
ter Stone. Deep recurrent q-learning for partially observ-
Figure 5: Performance curves for a selection of algorithms: original able mdps. 2015.
Deep Q-learning algorithm (red), Deep Deterministic Policy Gradi-
ents algorithm (green) and Sarsa-rb(λ) (blue). [Lake et al., 2016] Brenden M. Lake, Tomer D. Ullman,
Joshua B. Tenenbaum, and Samuel J. Gershman. Build-
ing machines that learn and think like people. Behavioral
5 Conclusion and Brain Sciences, pages 1–101, 2016.
This paper introduced a new model to combine reinforcement [Lample and Chaplot, 2017] Guillaume Lample and Deven-
learning and external knowledge. We demonstrated its ability dra Singh Chaplot. Playing fps games with deep reinforce-
to solve complex and highly fluctuating tasks, trading in stock ment learning. In Proceedings of AAAI, pages 2140–2146,
market. Additionally, this algorithm is fully interpretable and 2017.
understandable. In a given state, we can explain the impact [Lillicrap et al., 2015] Timothy P Lillicrap, Jonathan J Hunt,
of each variable and the patterns on the action selection. Our Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,
central thesis is to enhance states representation of Sarsa(λ) David Silver, and Daan Wierstra. Continuous con-
with background knowledge and speed up learning with a trol with deep reinforcement learning. arXiv preprint
sub-states mechanism. Further benefits stem from efficiently arXiv:1509.02971, 2015.
updating eligibility traces. Moreover, our approach can be
[Louppe, 2014] Gilles Louppe. Understanding random
easily adapted to solve new tasks with a very limited amount
of human work. We have demonstrated the effectiveness of forests: From theory to practice. arXiv preprint
our algorithm to decrease the training time and to learn a bet- arXiv:1407.7502, 2014.
ter and more efficient policy. In the future, we are planning to [Mashayekhi and Gras, 2015] Morteza Mashayekhi and
evaluate our idea with other TD methods. Another challenge Robin Gras. Rule extraction from random forest: the rf+
is how to generate the rules during the training phase and dis- hc methods. In Proceedings of Canadian Conference on
card the useless rules to decrease learning time and improve Artificial Intelligence, pages 223–237. Springer, 2015.
[Mnih et al., 2013] Volodymyr Mnih, Koray Kavukcuoglu,
David Silver, Alex Graves, Ioannis Antonoglou, Daan
Wierstra, and Martin Riedmiller. Playing atari with deep
reinforcement learning. arXiv preprint arXiv:1312.5602,
2013.
[Nison, 2001] Steve Nison. Japanese candlestick charting
techniques: a contemporary guide to the ancient invest-
ment techniques of the Far East. Penguin, 2001.
[Pal, 2005] Mahesh Pal. Random forest classifier for remote
sensing classification. International Journal of Remote
Sensing, 26(1):217–222, 2005.
[Randløv and Alstrøm, 1998] Jette Randløv and Preben Al-
strøm. Learning to drive a bicycle using reinforcement
learning and shaping. In Proceedings of ICML, volume 98,
pages 463–471, 1998.
[Safavian and Landgrebe, 1991] S Rasoul Safavian and
David Landgrebe. A survey of decision tree classifier
methodology. IEEE transactions on systems, man, and
cybernetics, 21(3):660–674, 1991.
[Singh and Sutton, 1996] Satinder P Singh and Richard S
Sutton. Reinforcement learning with replacing eligibility
traces. Machine learning, 22(1-3):123–158, 1996.
[Sutton and Barto, 1998] Richard S Sutton and Andrew G
Barto. Reinforcement learning: An introduction. MIT
press Cambridge, 1998.
[Sutton, 1996] Richard S Sutton. Generalization in rein-
forcement learning: Successful examples using sparse
coarse coding. In Advances in neural information process-
ing systems, pages 1038–1044, 1996.
[Tessler et al., 2017] Chen Tessler, Shahar Givony, Tom Za-
havy, Daniel J. Mankowitz, and Shie Mannor. A deep hi-
erarchical approach to lifelong learning in minecraft. In
Proceedings of AAAI Conference on Artificial Intelligence,
pages 1553–1561, 2017.
[Watkins and Dayan, 1992] Christopher JCH Watkins and
Peter Dayan. Q-learning. Machine learning, 8(3-4):279–
292, 1992.

You might also like