Learning To Trade Using Q-Learning
Learning To Trade Using Q-Learning
Uirá Caiado
October 15, 2016
Abstract
In this project, I will present an adaptive learning model to trade a
single stock under the reinforcement learning framework. This area of
machine learning consists in training an agent by reward and punishment
without needing to specify the expected action. The agent learns from
its experience and develops a strategy that maximizes its profits. The
simulation results show initial success in bringing learning techniques to
build algorithmic trading strategies.
1 Introduction
In this section, I will provide a high-level overview of the project, describe the
problem addressed and the metrics used to measure the performance of the
models employed.
1
1.2 Problem Statement
Algo trading4 strategies usually are programs that follow a predefined set of
instructions to place its orders.
The primary challenge to this approach is building these rules in a way
that it can consistently generate profit without being too sensitive to market
conditions. Thus, the goal of this project is to develop an adaptive learning
model that can learn by itself those rules and trade a particular asset using
reinforcement learning framework under an environment that replays historical
high-frequency data.
As [1] described, reinforcement learning can be considered as a model-free
approximation of dynamic programming. The knowledge of the underlying pro-
cesses is not assumed but learned from experience. The agent can access some
information about the environment state as the order flow imbalance, the sizes
of the best bid and offer and so on. At each time step t, It should generate
some valid action, as buy stocks or insert a limit order at the Ask side. The
agent also should receive a reward or a penalty at each time step if it is already
carrying a position from previous rounds or if it has made a trade (the cost of
the operations are computed as a penalty). Based on the rewards and penalties
it gets, the agent should learn an optimal policy for trade this particular stock,
maximizing the profit it receives from its actions and resulting positions.
This project starts with an overview of the dataset and shows how the en-
vironment states will be represented in Section 2. The same section also dives
in the reinforcement learning framework and defines the benchmark used at the
end of the project. Section 3 discretizes the environment states by transforming
its variables and clustering them into six groups. Also describes the implemen-
tation of the model and the environment, as well as the process of improvement
made upon the algorithm used. Section 4 presents the final model and compares
statistically its performance to the benchmark chosen. Section 5 concludes the
project with some closing remarks and possible improvements.
1.3 Metrics
Different metrics are used to support the decisions made throughout the project.
We use the mean Silhouette Coefficient5 of all samples to justify the clustering
method chosen to reduce the state space representation of the environment. As
exposed in the scikit-learn documentation, this coefficient is composed by the
mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for
b−a
each sample. The score for a single cluster is given by s = max a, b . We compute
the average of these scores to all samples. The factor produced varies between
the best one (+1) and the worst case (−1).
We use Sharpe ratio6 to help us understanding the performance impact of
different values to the model parameters. The Sharpe is measured upon the first
difference (∆r) of the accumulated PnL curve of the model. The first difference
is defined as ∆r = P nLt − P nLt−1 .
Finally, as we shall justify latter, the performance of my agent will be com-
pared to the performance of a random agent. These performances will be mea-
4 Source: http://goo.gl/b9jAqE
5 Source: https://goo.gl/3MmROJ
6 Source: https://en.wikipedia.org/wiki/Sharpe_ratio
2
sured primarily of Reais made (the Brazilian currency) by the agents. To com-
pare the final PnL of both agents in the simulations, we will perform a one-sided
Welch’s unequal variances t-test7 for the null hypothesis that the learning agent
has the expected PnL greater than the performance of a random agent. As the
implementation of the t-test in the scipy8 assumes a two-sided t-test, to perform
the one-sided test, we will divide the p-value by 2 to compare to a critical value
of 0.05 and requires that the t-value is greater than zero. In the next section, I
will detail the behavior of learning agent.
2 Analysis
In this section, I will explore the dataset that will be used in the simulation, de-
fine and justify the inputs employed in the state representation of the algorithm,
explain the reinforcement learning techniques used and describe the benchmark.
3
The data used in the simulations was collected from Bloomberg. There are
45 files, each one has 110,000 rows on average, resulting in 5,631,273 rows at
total and almost 230 MB of information. In the figure 1 is the structure of one
of them:
Each file is composed of four different fields. The column Date is the time-
stamp of the row and has a precision of seconds. T ype is the kind of information
that the row encompasses. The type ’TRADE’ relates to an actual trade that
has happened. ’BID’ is related to changes in the best Bid level and ’ASK’, to
the best Offer level. P rice is the current best bid or ask (or traded price) and
Size is the accumulated quantity on that price and side.
All this data will be used to create the environment where my agent will
operate. This environment is an order book, where the agent will be able to
insert limit orders and execute trades at the best prices. The order book (table
2) is represented by two binary trees, one for the Bid and other for the Ask
side. As can be seen in the table below, the nodes of these trees are sorted by
price (price level) in ascending order on the Bid side and descending order on
the ask side. At each price level, there is another binary tree composed of the
orders placed by the agents which are sorted by time of arrival. The first order
to arrive is the first order filled when coming in a new trade.
The environment will answer with the agent’s current position and Profit
and Loss (PnL) every time the agent executes a trade or has an order filled.
The cost of the trade will be accounted as a penalty.
The agent also will be able to sense the state of the environment every two
seconds and include it in its own state representation. So, this intern state will
be represented by a set of variables about the current situation of the market
and the state of the agent, given by the table 3.
Regarding the measure of the Order Flow Imbalance (OFI), there are many
ways to measure it. [2] argued the order flow imbalance is a measure of sup-
ply/demand imbalance and defines it as a sum of individual event contribution
4
Name Type Description
qOF I integer Net order flow in the last 10 seconds
book ratio float Bid size over the Ask size
position integer Current position of my agent
OrderBid boolean The agent has order at the bid side?
OrderAsk boolean The agent has order at the ask side?
Where N (tk ) and N (tk−1 ) + 1 are index of the first and last event in the
interval. The en was defined by the authors as a measure of the contribution of
the n-th event to the size of the bid and ask queues:
en = 1PnB ≥Pn−1
B qnB − 1PnB ≤Pn−1
B
B
qn−1 − 1PnA ≤Pn−1
A qnA + 1PnA ≥Pn−1
A
A
qn−1
Where qnB and qnA are linked to the accumulated quantities at the best bid
and ask in the time n. The subscript n − 1 is related to the last observation.
1 is an indicator11 function. In the figure below is plotted the 10-second log-
return of PETR4 against the contemporaneous OFI. Log-return12 is defined as
ln rt = ln PPt−1
t
, where Pt is the current price of PETR4 and Pt−1 is the previous
one.
As described by [2] in a similar test, the figure 2 suggests that order flow
imbalance is a stronger driver of high-frequency price changes and this variable
will be used to describe the current state of the order book.
11 Source: https://en.wikipedia.org/wiki/Indicator_function
12 Source: https://quantivity.wordpress.com/2011/02/21/why-log-returns/
5
2.2 Algorithms and Techniques
Based on [3], the algo trading might be conveniently modeled in the framework
of reinforcement learning. As suggested by [3], this framework adjusts the pa-
rameters of an agent to maximize the expected payoff or reward generated due
to its actions. Therefore, the agent learns a policy that tells him the actions it
must perform to achieve its best performance. This optimal policy is exactly
what we hope to find when we are building an automated trading strategy.
According to [1], Markov decision processes (MDPs) are the most common
model when implementing reinforcement learning. The MDP model of the envi-
ronment consists, among other things, of a discrete set of states S and a discrete
set of actions taken from A. In this project, depending on the position of the
learner(long or short), at each time step t it will be allowed to choose an action
at from different subsets from the action space A , that consists of six possibles
actions:
at ∈ (N one, buy, sell, best bid, best ask, best both)
Where N one indicates that the agent shouldn’t have any order in the market.
Buy and Sell means that the agent should execute a market order to buy or sell
100 stocks (the size of an order will always be a hundred shares). This kind of
action just will be allowed depending on a trailing stop13 of 4 cents. When the
agent is losing more than 4 cents from the maximum PnL in a given position,
it might choose between stop this position sending a market order, try to close
it using a limit order or do nothing. best bid and best ask indicate that the
agent should keep order at best price just in the mentioned side and best both,
it should have orders at best price in both sides.
So, at each discrete time step t, the agent senses the current state st and
choose to take an action at . The environment responds by providing the agent
a reward rt = r(st , at ) and by producing the succeeding state st+1 = δ(st , at ).
The functions r and δ only depend on the current state and action (it is mem-
oryless14 ), are part of the environment and are not necessarily known to the
agent.
The task of the agent is to learn a policy π that maps each state to an action
(π : S → A), selecting its next action at based solely on the current observed
state st , that is π(st ) = at . The optimal policy, or control strategy, is the one
that produces the greatest possible cumulative reward over time. So, we can
state that:
∞
X
V π (st ) = rt + γrt+1 + γ 2 rt+1 + ... = γ i rt+i
i=0
π
Where V (st ) is also called the discounted cumulative reward and it rep-
resents the cumulative value achieved by following an policy π from an initial
state st and γ ∈ [0, 1] is a constant that determines the relative value of delayed
versus immediate rewards.
If we set γ = 0, only immediate rewards is considered. As γ → 1, future
rewards are given greater emphasis relative to immediate reward. The optimal
policy π ∗ that will maximizes V π (st ) for all states s can be written as:
π ∗ = arg max V π (s) , ∀s
π
13 Source: https://goo.gl/SVmVzJ
14 Source: https://en.wikipedia.org/wiki/Markov_process
6
However, learning π ∗ : S → A directly is difficult because the available
training data does not provide training examples of the form (s, a). Instead,
as [4] explained, the only available information is the sequence of immediate
rewards r(si , ai ) for i = 1, 2, 3, ...
So, as we are trying to maximize the cumulative rewards V ∗ (st ) for all states
s, the agent should prefer s1 over s2 wherever V ∗ (s1 ) > V ∗ (s2 ). Given that the
agent must choose among actions and not states, and it isn’t able to perfectly
predict the immediate reward and immediate successor for every possible state-
action transition, we also must learn V ∗ indirectly.
To solve that, we define a function Q(s, a) such that its value is the maximum
discounted cumulative reward that can be achieved starting from state s and
applying action a as the first action. So, we can write:
As δ(s, a) is the state resulting from applying action a to state s (the suc-
cessor) chosen by following the optimal policy, V ∗ is the cumulative value of the
immediate successor state discounted by a factor γ. Thus, what we are trying
to achieve is
π ∗ (s) = arg maxQ(s, a)
a
It implies that the optimal policy can be obtained even if the agent just uses
the current action a and state s and chooses the action that maximizes Q(s, a).
Also, it is important to notice that the function above implies that the agent
can select optimal actions even when it has no knowledge of the functions r and
δ.
Lastly, according to [4], there are some conditions to ensure that the re-
inforcement learning converges toward an optimal policy. On a deterministic
MDP, the agent must select actions in a way that it visits every possible state-
action pair infinitely often. This requirement can be a problem in the environ-
ment that the agent will operate.
As the most inputs suggested in the last subsection was defined in an infinite
space, in section 3 I will discretize those numbers before use them to train my
agent, keeping the state space representation manageable, hopefully. We also
will see how [4] defined a reliable way to estimate training values for Q, given
only a sequence of immediate rewards r.
2.3 Benchmark
In 1988, the Wall Street Journal created a Dartboard Contest15 , where Journal
staffers threw darts at a stock table to select their assets, while investment
experts picked their own stocks. After six months, they compared the results
of the two methods. Adjusting the results to risk level, they found out that the
pros barely have beaten the random pickers.
Given that, the benchmark used to measure the performance of the learner
will be the amount of money made by a random agent. So, my goal will be to
outperform this agent, that should just produce some random action from a set
of allowed actions taken from A at each time step t.
15 Source: http://www.automaticfinances.com/monkey-stock-picking/
7
Just like my learner, the set of action can change over time depending on
the agent’s open position, that is limited to 100 stocks at most, on any side.
When it reaches its limit, it will be allowed just to perform actions that decrease
its position. So, for instance, if it is already long16 in 100 shares, the possible
moves would be (N one, sell, best ask). If it is short17 , it just can perform
(N one, buy, best bid).
The performance will be measured primarily in the money made by the
agents (that will be optimized by the learner). First, I will analyze if the learning
agent was able to improve its performance on the same dataset after different
trials. Later on, I will use the policy learned to simulate the learning agent
behavior in a different dataset and then I will compare the final Profit and Loss
of both agents. All data analyzed will be obtained by simulation.
As the last reference, in the final section, we will also compare the total
return of the learner to a strategy of buy-and-hold in BOVA11 and in the stock
traded to check if we are consistently beating the market and not just being
profitable, as the Udacity reviewer noticed.
3 Methodology
In this section, I will discretize the input space and implement an agent to learn
the Q function.
The scale of the variables is very different, and, in the case of the Book Ratio,
it presents a logarithmic distribution. I will apply a logarithm transformation
16 Source: https://goo.gl/GgXJgR
17 Source: https://goo.gl/XFR7q3
8
on this variable and re-scale both to lie between a given minimum and maximum
value of each feature using the function MinMaxScaler18 from scikit-learn. So,
both variable will be scaled to lie between 0 and 1 by applying the formula
xi −min X
zi = max X−min X . Where z is the transformed variable, xi is the variable to be
transformed and X is a vector with all x that will be transformed. The result
of the transformation can be seen in the figure 4.
As mentioned before, in an MDP environment the agent must visit every
possible state-action pair infinitely often. If I just bucketize the variables and
combine them, I will end up with a huge number of states to explore. So, to
reduce the state space, I am going to group those variables using K-Means and
Gaussian Mixture Model (GMM) clustering algorithm. Then I will quantify the
”goodness” of the clustering results by calculating each data point’s silhouette
coefficient19 . The silhouette coefficient for a data point measures how similar it
is to its assigned cluster and it varies from -1 (dissimilar) to 1 (similar). In the
figure 5, I am going to calculate the mean silhouette coefficient to K-Means and
GMM using a different number of clusters. Also, I will test different covariance
structures to GMM.
The maximum score has happened using two clusters. However, I believe
that the market can’t be simplified that much. So, I will use the K-means with
six centroids to group the variables. In the figure 6 we can see how the algorithm
classified the data. Also, in the table 4, the centroid was put in their original
scales.
9
Figure 6: Clusters Found.
Curiously, the algorithm gave more emphasis on the BOOK RAT IO when
its value was very large (the bid size almost eight times greater than the ask
size) or tiny (when the bid size was one tenth of the ask size). The other clusters
seem mostly dominated by the OF I. In the next subsection, I will discuss how I
have implemented the Q-learning, how I intend to perform the simulations and
make some tests.
3.2 Implementation
As we have seen, learning the Q function corresponds to learning the optimal
policy. According to [5], the optimal state-action value function Q∗ is defined
for all (s, a) ∈ S × A as the expected return for taking the action a ∈ A at the
state s ∈ S, following the optimal policy. So, it can be written as [4] suggested:
The recursive nature of the function above implies that our agent doesn’t
know the actual Q function. It just can estimate Q, that we will refer as Q̂. It
will represents is hypothesis Q̂ as a large table that attributes each pair (s , a)
to a value for Q̂(s, a) - the current hypothesis about the actual but unknown
value Q(s, a). I will initialize this table with zeros, but it could be filled with
random numbers, according to [4]. Still according to him, the agent repeatedly
should observe its current state s and apply the algorithm 1.
One issue of the proposed strategy is that the agent could over-commit
to actions that presented positive Q̂ values early in the simulation, failing to
explore other actions that could present even higher values. [4] proposed to use a
probabilistic approach to select actions, assigning higher probabilities to action
10
Algorithm 1 Update Q-table
1: loop Observe the current state s and the allowed actions A∗ and:
2: Choose some action a and execute it
3: Receive the immediate reward r = r(s, a)
4: initialize the table entry Q̂(s, a) to zero if there is no entry (s, a)
5: Observe the new state s0 = δ(s, a)
6: Updates the table entry for Q̂(s, a) following:
Q̂(s, a) ← r + γ max 0
Q̂(s0 , a0 )
a
7: s ← s0
with high Q̂ values, but given to every action at least a nonzero probability. So,
I will implement the following relation:
k Q̂(s,ai )
P (ai | s) = P
Q̂(s,aj )
jk
Where P (ai | s) is the probability of selecting the action ai given the state
s. The constant k is positive and determines how strongly the selection favors
action with high Q̂ values.
Ideally, to optimize the policy found, the agent should iterate over the same
dataset repeatedly until it is not able to improve its PnL. The policy learned
should then be tested against the same dataset to check its consistency. Lastly,
this policy will be tested on the subsequent day of the training session. So,
before perform the out-of-sample test, we will use the following procedure:
Each training session will include data from the largest part of a trading
session, starting at 10:30 and closing at 16:30. Also, the agent will be allowed
to hold a position of just 100 shares at maximum (long or short). When the
training session is over, all positions from the learner will be closed out so the
agent always will start a new session without carrying positions.
The agent will be allowed to take action every 2 seconds and, due to this
delay, every time it decides to insert limit orders, it will place it 1 cent worst
than the best price. So, if the best bid is 12.00 and the best ask is 12.02, if
the agent chooses the action BEST BOT H, it should include a buy order at
11
11.99 and a sell order at 12.03. It will be allowed to cancel these orders after 2
seconds. However, if these orders are filled in the mean time, the environment
will inform the agent so it can update its current position. Even though, it just
will take new actions after passed those 2 seconds.
One of the biggest complication of the approach proposed in this project was
to find out a reasonable representation of the environment state that wasn’t too
big to visit each state-action pair sufficiently often but was still useful in the
learning process. As exposed above, we solved that by clustering the inputs.
In the next subsection, I will try different configurations of k and γ to try
to improve the performance of the learning agent over the same trial.
3.3 Refinement
As mentioned before, we should iterate over the same dataset and check the
policy learned on the same observations until convergence. Given the time
required to perform each train-test iteration, ”until convergence” will be 10
repetitions. We are going to train the model on the dataset from 08/15/2016.
After each iteration, we will check how the agent would perform using the policy
it has just learned. The agent will use γ = 0.7 and k = 0.3 in the first training
session. In the figure 7 are the results of the first round of iterations:
The curve Train in the charts is the PnL obtained during the training session
when the agent was allowed to explore new actions randomly. The test is the
PnL obtained using strictly the policy learned.
Although the agent was able to profit at the end of every single round,
”Convergence” is something that I can not claim. For instance, the PnL was
worst in the last round than in the first one. I believe this stability of the results
is difficult to obtain in day-trading. For example, even if the agent think that
it should buy before the market goes up, it doesn’t depending on its will if its
order is filled.
We will target on improving the final PnL of the agent. However, less vari-
ability of the results is desired, especially at the beginning of the day, when the
strategy didn’t make any money yet. So, we also will look at the Sharpe ratio20
of the first difference of the accumulated PnL produced by each configuration.
20 Source: https://en.wikipedia.org/wiki/Sharpe_ratio
12
First, we are going to iterate through some values for k and look at its
performance in the training phase at the first hours of the training session. We
also will use just 5 iterations here to speed up the tests.
When the agent was set to use k = 0.8 and k = 2.0, it achieved very similar
results and Sharpe ratios. As the variable k control the likelihood of the agent
try new actions based on the Q value already observed, I will prefer the smallest
value because it improves the chance of the agent to explore. In the figure 9 we
performed the same analysis varying only the γ.
As explained before, when γ approaches one, future rewards are given greater
emphasis about the immediate reward. When it is zero, only immediate rewards
is considered. Despite the fact that the best parameter was γ = 0.9, I am
not comfortable in giving so little attention to immediate rewards. It sounds
dangerous when we talk about stock markets. So, I will choose to use γ = 0.5
arbitrarily in the next tests.
In the figure 10, an agent was trained using γ = 0.5 and k = 0.8 and its
performance in out-of-sample test is compared to the previous implementation.
In this case, the dataset from 07/16/2016 was used. the current configuration
improved the performance of the model. We will discuss the final results in the
next section.
13
Figure 10: PnL From The First vs. Second Configuration.
4 Results
In this section, I will evaluate the final model, test its robustness and compare
its performance to the benchmark established earlier.
The model was able to make money in two different days after being trained
in the previous session to each day. The performance of the third day was
14
pretty bad. However, even wasting a lot of money at the beginning of the day,
the agent was able to recover the most of its loss at the end of the session.
Looking at just to this data, the performance of the model looks very un-
stable and a little disappointing. In the next subsection, we will see why it is
not that bad.
4.2 Justification
Lastly, I am going to compare the final model with the performance of a random
agent. We are going to compare the performance of those agents in the out-of-
sample tests.
As the learning agent follows strictly the policy learned, I will simulate the
operations of this agent on the datasets tested just once. Even though I had
run more trials, the return would be the same. However, I will simulate the
operations of the random agent 20 times at each dataset. As this agent can
take any action at each run, the performance can be very good or very bad. So,
I will compare the performance of the learning agent to the average performance
of the random agent.
In the figure 12 we can see how much money ,in Reais (R$), each one has
made in the first dataset used in this project, from 08/16/2016. The learning
agent was trained using data from 08/15/2016, the previous day.
15
Figure 13: Performance from Agents in different Days.
p-value < 0.000). Curiously, in the worst day of the test, the random agent
also performed poorly, suggesting that it wasn’t a problem of my agent, but
something that has happened on the market.
I believe these results are encouraging because they suggested that using
the same learning framework on different days we can successfully find practical
solutions that adapt well to new circumstances.
5 Conclusion
In this section, I will discuss the final result of the model, summarize the entire
problem solution and suggest some improvements that could be made.
16
Figure 14: Performance from Agents in different Days.
them against a different dataset. Finally, after we selected the best parame-
ters, we trained the model in different days and tested against the subsequent
sessions.
We compared these results to the returns of a random agent and concluded
that our model was significantly better during the period of the tests.
One of the most interesting parts of this project was to define the state
representation of the environment. I find out that when we increase the state
space too much, it becomes very hard the agent learns an acceptable policy in
the number of the trials we have used. The number of trials used was mostly
determined by the time it took to run (several minutes).
It was interesting to see that, even clustering the variables using k-means,
the agent was still capable of using the resulting clusters to learn something
useful from the environment.
Building the environment was the most difficult and challenging part of the
entire project. Not just find an adequate structure for build the order book
wasn’t trivial, but make the environment operates it correctly was difficult. It
has to manage different orders from various agents and ensure that each agent
can place, cancel or fill orders (or have orders been filled) in the right sequence.
Overall, I believe that the simulation results have shown initial success in
bringing reinforcement learning techniques to build algorithmic trading strate-
gies. Develop a strategy that doesn’t perform any arbitrage21 and still never lose
money is something very unlikely to happen. This agent was able to mimic the
performance of an average random agent sometimes and outperforms it other
times. In the long run, It would be good enough.
5.2 Improvement
Many areas could be explored to improve the current model and refine the
test results. I wasn’t able to achieve a stable solution using Q-Learning, and I
believe that it is most due to the non-deterministic nature of the problem. So, we
could test Recurrent Reinforcement Learning 22 , for instance, which [3] argued
21 Source: https://en.wikipedia.org/wiki/Arbitrage
22 Source: https://goo.gl/4U4ntD
17
that it could outperform Q-learning in the sense of stability and computational
convenience.
Also, I believe that different state representations should be tested much
deeper. The state observed by the agent is one of the most relevant aspects of
reinforcement learning problems and probably there are better representations
that the one used in this project to the given task.
Another future extension to that project also could include a more realistic
environment, where other agents respond to the actions of the learning agent,
and lastly, we could test other reward functions to the problem posed. Would be
interesting to include some future information in the response of the environment
to the actions of the agent, for example, to see how it would affect the policies
learned.
References
[1] Nicholas Tung Chan and Christian Shelton. An electronic market-maker.
Technical Report AI-MEMO-2001-005, MIT, AI Lab, 2001.
[2] Rama Cont, Arseniy Kukanov, and Sasha Stoikov. The price impact of order
book events. Journal of financial econometrics, 12(1):47–88, 2014.
[3] Xin Du, Jinjian Zhai, and Koupin Lv. Algorithm trading using q-learning
and recurrent reinforcement learning.
[4] T.M. Mitchell. Machine Learning. McGraw-Hill International Editions.
McGraw-Hill, 1997.
18