Simulating Market Maker Behavior Using Deep Reinforcement Learning To Understand Market (PDFDrive)
Simulating Market Maker Behavior Using Deep Reinforcement Learning To Understand Market (PDFDrive)
MARCUS ELWIN
MARCUS ELWIN
Abstract
Market microstructure studies the process of exchanging assets under
explicit trading rules. With algorithmic trading and high-frequency
trading, modern financial markets have seen profound changes in mar-
ket microstructure in the last 5 to 10 years. As a result, previously es-
tablished methods in the field of market microstructure becomes often
faulty or insufficient. Machine learning and, in particular, reinforce-
ment learning has become more ubiquitous in both finance and other
fields today with applications in trading and optimal execution. This the-
sis uses reinforcement learning to understand market microstructure
by simulating a stock market based on NASDAQ Nordics and training
market maker agents on this stock market.
Simulations are run on both a dealer market and a limit orderbook market
differentiating it from previous studies. Using DQN and PPO algo-
rithms on these simulated environments, where stochastic optimal con-
trol theory has been mainly used before. The market maker agents suc-
cessfully reproduce stylized facts in historical trade data from each sim-
ulation, such as mean reverting prices and absence of linear autocorrelations
in price changes as well as beating random policies employed on these
markets with a positive profit & loss of maximum 200%. Other trad-
ing dynamics in real-world markets have also been exhibited via the
agents interactions, mainly: bid-ask spread clustering, optimal inventory
management, declining spreads and independence of inventory and spreads,
indicating that using reinforcement learning with PPO and DQN are
relevant choices when modelling market microstructure.
iv
Sammanfattning
Marknadens mikrostruktur studerar hur utbytet av finansiella tillgång-
ar sker enligt explicita regler. Algoritmisk och högfrekvenshandel har
förändrat moderna finansmarknaders strukturer under de senaste 5
till 10 åren. Detta har även påverkat pålitligheten hos tidigare använda
metoder från exempelvis ekonometri för att studera marknadens mik-
rostruktur. Maskininlärning och Reinforcement Learning har blivit mer
populära, med många olika användningsområden både inom finans
och andra fält. Inom finansfältet har dessa typer av metoder använts
främst inom handel och optimal exekvering av ordrar. I denna uppsats
kombineras både Reinforcement Learning och marknadens mikrostruk-
tur, för att simulera en aktiemarknad baserad på NASDAQ i Norden.
Där tränas market maker - agenter via Reinforcement Learning med målet
att förstå marknadens mikrostruktur som uppstår via agenternas in-
teraktioner.
Acknowledgements
I like would like to express my gratitude to my thesis advisor Hamid
Reza Faragardi as well as my examiner Elena Troubitsyna. For both
your support and helpful comments throughout the course of the the-
sis. I am profoundly grateful to Björn Hertzberg at Nasdaq for intro-
ducing me to both the field of market microstructure and the thesis
subject. My research would have been impossible without the aid and
support of Jaakko Valli at Nasdaq. Finally I would like to thank my
family and girlfriend for having patience with me during this long
process.
Contents
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Delimitation . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.7 Societal and sustainability issues . . . . . . . . . . . . . . 5
1.8 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . 5
2 Background 6
2.1 Market Microstructure . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Market Participants . . . . . . . . . . . . . . . . . . 7
2.1.2 Trading Mechanisms . . . . . . . . . . . . . . . . . 7
2.1.3 Price Formation & Discovery . . . . . . . . . . . . 9
2.1.4 Inventory-based models . . . . . . . . . . . . . . . 14
2.2 Artificial Financial Markets . . . . . . . . . . . . . . . . . 16
2.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . 18
2.3.1 The Main Concepts . . . . . . . . . . . . . . . . . . 18
2.3.2 Exploration versus Exploitation . . . . . . . . . . 22
2.3.3 Algorithms & Learning in RL . . . . . . . . . . . . 23
2.3.4 Issues with deep RL . . . . . . . . . . . . . . . . . 26
3 Related Work 27
3.1 Agent based Financial Markets . . . . . . . . . . . . . . . 28
3.2 Market Microstructure . . . . . . . . . . . . . . . . . . . . 29
3.3 Reinforcement Learning in Finance . . . . . . . . . . . . . 31
vi
CONTENTS vii
4 Research Summary 33
4.1 Research Methodology . . . . . . . . . . . . . . . . . . . . 33
4.1.1 Research Question . . . . . . . . . . . . . . . . . . 34
4.1.2 Research Goals . . . . . . . . . . . . . . . . . . . . 34
4.1.3 Research Challenges . . . . . . . . . . . . . . . . . 34
4.2 Research Methods . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.1 Literature Study . . . . . . . . . . . . . . . . . . . 36
4.2.2 Implementation, Experiments & Evaluation . . . 36
4.3 Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.1 Construct Validity . . . . . . . . . . . . . . . . . . 36
4.3.2 Internal Validity . . . . . . . . . . . . . . . . . . . . 37
4.3.3 Conclusion Validity . . . . . . . . . . . . . . . . . 37
4.4 Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5 Implementation 38
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 Environments & Experiments . . . . . . . . . . . . . . . . 39
5.2.1 DealerMarket-v1 . . . . . . . . . . . . . . . . . . . 40
5.2.2 DealerMarket-v2 . . . . . . . . . . . . . . . . . . . 41
5.2.3 LOBMarket-v1 . . . . . . . . . . . . . . . . . . . . 42
5.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3.1 Neural Network Models . . . . . . . . . . . . . . . 43
5.3.2 Software & Hardware . . . . . . . . . . . . . . . . 46
5.4 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6 Results 49
6.1 Stylized Facts of Data . . . . . . . . . . . . . . . . . . . . . 50
6.2 Agent’s Rewards . . . . . . . . . . . . . . . . . . . . . . . 51
6.3 Price Impact Regressions . . . . . . . . . . . . . . . . . . . 55
6.4 Bid-Ask Spread & Inventory . . . . . . . . . . . . . . . . . 56
6.5 Profit & Loss . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.6 Visualization of agents states & actions . . . . . . . . . . . 66
Bibliography 74
viii CONTENTS
2.1 Snapshot of the LOB for the ticker ORCL (Oracle) after
the 10 000th event during that day. Blue bars indicate
sell limit orders, whilst red bars are buy limit orders.
Source: Cartea, Jaimungal, and Penalva [13] . . . . . . . . 8
2.2 Three components of bid-ask spread in short-term and
long-term response to a market buy order. Source: Fou-
cault, Pagano, and Röell [24] . . . . . . . . . . . . . . . . . 10
2.3 Example of volatility signature plots based on [8], show-
ing expected pattern for mean reverting, trend and un-
correlated returns. . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Basic overview of the reinforcement learning setting with
an agent interacting via actions (At ) with its environ-
ment moving through states (St ), and gaining different
rewards (Rt ). Source: Sutton and Barto [61] . . . . . . . . 18
ix
x LIST OF FIGURES
xii
LIST OF TABLES xiii
xiv
Chapter 1
Introduction
1.1 Background
Modern financial markets such as NASDAQ, CME and NYSE have
all been effected by the rise and presence of Algorithmic Trading and
High-Frequency Trading (HFT) [2], which are causing, for instance, more
fragmented markets. Both types of trading consist of using computer
programs to implement investment and trading strategies [1]. These
strategies have, according to Abergel [1] and O’Hara [49], raised var-
ious questions about their effects on the financial markets, mainly in
such areas as: liquidity, volatility, price discovery, systematic risk, manipu-
lation and market organization. A quite recent example of the proposed
effect of algorithmic trading and HFT on financial markets is the Flash
Crash the 6th of May 2010. In the course of 30 minutes, U.S. stock mar-
ket indices, stock-index futures, options, and exchange-traded funds,
experienced a sudden price drop of more than five percent, followed
by a rapid rebound [38, 37] (see an illustration of this in Figure 1.1).
1
2 CHAPTER 1. INTRODUCTION
1.2 Problem
Due to the abundance of available high frequency data, fragmented
markets and more sophisticated trading algorithms, the financial mar-
kets have become harder to understand during the past decade. Tradi-
tional methods used in market microstructure might have become ob-
solete as mentioned in O’Hara [49]. A traditional supervised learning
approach is not of use for this thesis, because financial markets are dy-
namic complex systems. The agents must be able to adapt and dynam-
ically learn optimal behaviour. Therefore, the problem in this thesis is
to understand trading dynamics, by using reinforcement learning, on
a simulated Nordic stock market.
1.3 Objective
The objective of this thesis is two-fold. Firstly, in the case of the prin-
cipal the objective is to have a functional and working exchange sim-
ulator (EXSIM), where they can change different parameters, policies,
reward functions and other things effecting the market structure, in or-
der to study and simulate modern financial markets on a microscopic
level. Secondly, from the thesis point of view, the objective is to inves-
tigate the following:
1.4 Delimitation
This thesis does not use any real-world data for the simulations and
experiments conducted. Instead, only simulated data from the agent’s
interaction with its environment is used. This is due to security con-
straints at the principle. No multiagent reinforcement learning has
been used, due to time constraints. Instead, several different single
agents are tested and evaluated.
1.5 Methodology
The methodology in the thesis is empirical with the focus on the quan-
titative approach. Data are collected from the agent’s interactions with
the environments through various experiments testing different sce-
narios, behaviours and collecting different statistics. More information
about the methodology is found in chapter 4 and chapter 5.
1.6 Contribution
The contribution of this thesis is to show the usages of reinforcement
learning to more complex environments such as the financial markets
with a more realistic limit order book market compared to previous
studies. The aim is to provide new insights and methods in the field of
market microstructure. The target audience is both academia and the
industry, which can benefit from this work.
CHAPTER 1. INTRODUCTION 5
Background
1. Price formation and discovery, i.e., looking in to the black box of the
market and see how latent demands are translated into prices.
2. Market structure and design, i.e., what rules exists, and how they
affect the black box of the market.
3. Information and disclosure, i.e., how the inner workings of the black
box or the market affects the behaviour of traders and strategies.
All of these will be covered in this section, in order to give the reader a
comprehensive overview, on how modern financial markets operate.
However, for the interested reader more details, regarding market mi-
crostructure can be found in O’hara [50], Harris [29], Bouchaud et al.
[8], Abrol, Chesir, and Mehta [2], and Abergel [1].
6
CHAPTER 2. BACKGROUND 7
Market makers will be the main participant covered in this thesis. This
because they have an important impact on providing liquidity to fi-
nancial markets, albeit the dual nature of optimizing their inventory
of stocks and making a profit for themselves.
1
Traders communicate their buying and selling intentions via an auctioneer.
8 CHAPTER 2. BACKGROUND
Figure 2.1: Snapshot of the LOB for the ticker ORCL (Oracle) after the
10 000th event during that day. Blue bars indicate sell limit orders,
whilst red bars are buy limit orders. Source: Cartea, Jaimungal, and
Penalva [13]
In the limit order market, bids and offers are accumulated in the
limit order book (LOB). The orders in the LOB are accumulated by firstly
price priority and secondly time priority [30]. In dealer markets, par-
ticipants can only trade at the bid and ask quotes posted by specialized
intermediaries, i.e., dealers or market makers [24]. Note that the LOB is
very dynamic as it consists of limit orders2 , which can be cancelled or
modified at any time. Thus, the state of the LOB can be changed ex-
tremely often [30]. For this thesis this means that the agents will have
a very large state space, with many possible actions that they need to
explore, in order to find optimal policies. More formally, according
to Bouchaud et al. [8], one can see an order as a tuple consisting of
sign/direction (εx ), price (px ), volume (vx ) and submission time (tx ):
x = (εx , px , vx , tx ) (2.1)
Both the tick size and lot size affect trading, where the lot size dic-
tates the smallest permissible order size. The tick size ϑ dictates how
much more expensive it is for a trader to gain the priority of choosing a
higher or lower price to a buy or sell order [8]. Sometimes it is also use-
ful to consider the relative tick size ϑr , which is equal to the ϑ divided by
the mid-price for a given asset [8]. To make things more complicated
modern markets have a lot of different order types. Such as hidden,
reserved, ice-burg and Fill-or-Kill orders just to mention a few, where
[24, 13, 30] gives good overviews of these. There also exist hybrid mar-
kets which are traditional quote-driven markets such as NASDAQ and
London Stock Exchange (LSE) [24].
Market makers quote two prices: the bid price and the ask price, where
the difference between these is the market makers spread [44]. By quot-
ing bid and ask prices, market makers are also providing liquidity to
the market. Spreads measure the execution cost of a small transac-
tion, by measuring how close the price of a trade is to the market price,
where the market price is the equilibrium price, i.e., the price where de-
mand equals supply [13]. One approach is by using the midprice in
Equation 2.2:
1
St = (at + bt ) (2.2)
2
which is the simple average of the bid (bt ) and ask (at ) prices at time t.
CHAPTER 2. BACKGROUND 11
However, the most common spread measures are the quoted and the
effective [13, 24] spread both shown in Equation 2.3 and Equation 2.4:
QSt = at − bt (2.3)
at − b t
RQSt = .
St
On the contrary, the effective spread or half-spread measures the re-
alized difference between the price paid and the midprice, which can
also be negative indicating that one is buying at a price below or sell-
ing above the market price [13]. ES and QS differ in the fact that ES can
only be measured when there is a trade while QS are always observ-
able [13]. Some stylized facts known about the bid-ask spread are [30,
43, 8]:
• The trade prices series is a martingale & the order flow is not
symmetric.
• The spread declines over time & the bid-ask spread are lower in
high volume securities and wider for more riskier securities.
• For large-tick stocks, the spread is almost equal to one tick. Small-
tick stocks have a broader distribution of spreads.
Vtb
It = (2.5)
Vtb + Vta
where Vtb is the bid volume at time t, conversely Vta is the ask volume
at time t. The denominator (total volume at time t) in Equation 2.5
normalizes the imbalance, therefore It ∈ [0, 1]. Bouchaud et al. [8]
provide a qualitative understanding of It :
Vtb − Vta
ρt = (2.7)
Vtb + Vta
this measure takes values for ρt ∈ [−1, 1]. Usually one computes ρt by
looking only at-the-touch, the best n-levels of the LOB. The first level is
the best price, followed by the second price level and so forth. There
are more buy orders when the imbalance is high, and there are more
sell orders when the imbalance is low. The willingness of an agent to
post limit orders is strongly dependent on the value of imbalance [13].
Ho & Stoll Model. Ho and Stoll [33], present a model that handles
the risk which the market maker faces when providing his service. In
the model the following assumptions are made: transactions follow a
Poisson process, the dealer faces uncertainty over the future and, the arrival
rate of orders depend on the bid and ask prices. The objective of the dealer
is now to maximize the expected utility of his total wealth E[U (WT )]
at time horizon T , where:
WT = FT + IT + YT (2.9)
Equation 2.9 is called the dealers pricing problem. FT , IT and YT is the
dealers cash account, inventory and base wealth, this is, in fact, an op-
timization problem where the aim is to maximize the value function
J(·) using dynamic programming. Thus, yielding the optimization
problem below in Equation 2.10:
where U is the utility function, a and b are the ask and bid adjust-
ments and t, F, I, Y are the state variables time, cash, inventory and
base wealth [50]. The function J(·) gives the level of utility given that
the dealer’s decisions are made optimally [50]. There is no intermedi-
ate consumption before time T in this model. The recurrence relation
found by using the principle of optimality is:
Finally, from Equation 2.12 and Equation 2.13 one gets the bid-ask
spread as:
The first term of Equation 2.14 is the spread which maximizes the ex-
pected returns from selling and buying stocks. The rest of the terms are
seen as risk premiums for sale and purchase transactions. This shows
how the dealer or market maker sets the spread without knowing what
side the transaction will have, i.e., bid or ask [33].
Inventory based models are just one set of models used in the market
microstructure literature, which is the most relevant one for this thesis.
However, another important family is information-based models, allow-
ing for examination of market dynamics, thus providing insights into
the adjustment process of prices [50]. For popular models see more,
in, for instance, Glosten and Milgrom [25] or Das* [18] and Das [19].
16 CHAPTER 2. BACKGROUND
• Non Gaussian returns & Heavy tails, i.e., the unconditional distri-
bution of returns seems to display a power-law or Pareto-like
tail.
• Volatility clustering.
• R is a reward function,
Note that a γ close to 0 leads to myopic behavior, i.e., the agent only
cares about immediate rewards. If γ is close to 1, the agent is more
far-sighted. Then the agent-environment interaction breaks naturally
into sub-sequences, also called episodes, which explains Equation 2.18.
Another key concept underlying RL is the Markov property, i.e., only
the current state affects the next state [4]. More formally this means:
" ∞
#
X
vπ (s) = Eπ [Gt |St = s] = Eπ γ k Rt+k+1 |St = s , ∀s ∈ S. (2.20)
k=0
Note that the value of the terminal state is always zero [61]. Similarly
one can define the action-value function for policy π [61], which is the
value of taking action a in state s under policy π, denoted qπ (s, a). This
is the expected return starting from s, taking the action a and therefore
following policy π:
" ∞
#
X
qπ = Eπ [Gt |St = s, At = a] = Eπ γ k Rt+k+1 |St = s, At = a (2.21)
k=0
Using Equation 2.24 and Equation 2.25 together with the Bellman
equations in Equation 2.22 and Equation 2.23 yields the Bellman opti-
mality equations:
For a finite MDP Equation 2.26 and Equation 2.27 have a unique solu-
tion independent of the policy [61]. Nevertheless, in practice there is
no closed form solution for these equations. Therefore, one must re-
sort to approximate and iterative methods using dynamic programming
or Monte Carlo methods.
22 CHAPTER 2. BACKGROUND
h i
Q(St , At ) ← Q(St , At ) + α Rt+1 + γmaxQ(St+1 , a) − Q(St , At ) (2.33)
a
2
Li (θi ) = E(s,a,r,s0 )∼U (D) r + γmax
0
Q(s , a0 0
, θi− ) − Q(s, a; θi ) (2.34)
a
5
The optimal action-value function.
CHAPTER 2. BACKGROUND 25
6
Increasing rewards in states that are close to the end goal.
7
Only gives reward in goal state.
Chapter 3
Related Work
27
28 CHAPTER 3. RELATED WORK
They have, for instance, a limit order book as this thesis aims to use.
However, it is unclear whether they employ a functional matching en-
gine, which this study will use by using a simulated limit orderbook
market. They also use different types of agents, for instance, zero in-
telligence traders, technical traders, evolutionary agents, risk averse agents
and mean-variance agents. Nevertheless, none of these are based on re-
inforcement learning that will be used in this thesis.
One of the main benefits with using agent-based modelling is that they
demonstrate the ability to produce realistic system dynamics, which
are comparable to those observed empirically [53]. In the study by
Platt and Gebbie [53] they calibrate their agents by using heuristic op-
timization, Nelder-Mead simplex algorithm and a genetic algorithm.
They use a simpler form of matching engine in their study, and they
also use a low-frequency trader and high-frequency trader. This thesis
focuses on simulating the behaviour of a market maker.
In [50, 30, 44, 43] both inventory-based and information based models are
discussed. However, in this thesis the author focuses on inventory-
based models for the learning agents. Moreover, information-based
models can be used in future studies. A common inventory-based
model, that uses dynamic programming to find the optimal dealer
price in a one-dealer market is given in Ho and Stoll [33]. This model
is the core idea for the simulated dealer markets used in this thesis.
30 CHAPTER 3. RELATED WORK
The main idea underpinning IRL is to find the reward function given
the observations of optimal trading behaviour and then use it for trader
identification. Though it is interesting, this thesis is not interested in
capturing key characteristic of already optimal HFT strategies. In-
stead, the author wants to see what optimal or not optimal behaviour
the agents will use, and how they will react to changed market condi-
tions. Therefore, reinforcement learning is used instead. Some previ-
ous studies exist using reinforcement learning for market making see,
for instance, [35, 15, 23, 58].
32 CHAPTER 3. RELATED WORK
Research Summary
33
34 CHAPTER 4. RESEARCH SUMMARY
4.3 Validity
In this section a discussion about validity is provided covering both
construct, internal and conclusion validity. This is of importance when
needing to devise different tests, in order to make sure that the simu-
lated data from each simulation is valid. Validity in short indicates
the degree to which an instrument measures what it is supposed to
measure [39].
4.4 Ethics
Ethics independently of quantitative or qualitative research is the moral
principles in planning, conducting and reporting results of research
studies [28]. There have not been identified any possible ethical vio-
lations by conducting this research, as all the data used is simulated
without any connection to real-world market participants. The author
also believes that the results of this thesis cannot be used for any un-
ethical behavior.
3
An independent variable that has not been taken into account effecting depen-
dent variables.
Chapter 5
Implementation
5.1 Overview
38
CHAPTER 5. IMPLEMENTATION 39
2. Secondly, training the models for some two million time steps for
intervals of 10 0001 to collect data, monitor and visualize learning
of the agent. Here the author also used different random seeds
in order to take into account randomness in the results.
1
Each time step is equivalent to 1/10th of a second. Meaning that each simulation
last for at most (1000 ∗ 2000)/(3600 ∗ 8.5) ≈ 555 hours or some 65 trading days.
40 CHAPTER 5. IMPLEMENTATION
5.2.1 DealerMarket-v1
This environment is inspired by the ideas underpinning Ho and Stoll
[33], meaning that the equilibrium price is following a Brownian mo-
tion, and a changing demand curve controlled by the slope parameter
in Table 5.1. Also note that the orders in the simulation arrive accord-
ing to a Poisson distribution based on the demand curve. All these
parameters are all changing after each episode in the OpenAI envi-
ronment. DealerMarket-v1 only has a single agent, who is a dealer or
market maker with four possible actions: Move bid up (0), Move bid
down (1), Move ask up (2) and Move ask down (3). There are no hid-
den states in the environment, in order to make it fairly easy for the
agent, to use this as a benchmark to other environments and agents.
As, input (only the last 10 frames), the agent has the following ob-
served state variables: volume imbalance, offset imbalance, inventory im-
balance, spread, wealth and share value. These are feed into the environ-
ment during training (when the agent samples possible actions) using
the OpenAI class. Each called [ibv, ibo, ibif, sp, w, v] hereafter and de-
fined below:
ibv = (last at bid - last at ask)/(last at bid + last at ask) (5.1)
5.2.2 DealerMarket-v2
1200
Price
1000
800
Figure 5.2: Example showing the change in reference price for the
DealerMarket-v2 environment.
The DealerMarket-v2 agent instead has these actions: Move bid up (0),
Move bid down (1), Move ask up (2), Move ask down (3), Move ref
price up (4) and Move ref price down (5). In Figure 5.2 the reference
(ref) price is used to simulate price changes. Also, looking at Table 5.1
this environment is more volatile and complicated. The reward func-
tion is also different, shown in Listing 5.1. In practice the agent is given
(+1) for each share worth of wealth at the end of an episode, the agent
is also penalized with −100× (% time left) when running out of cash,
where tv-true price, ti-inventory, ii-initial inventory, iv-initial value,
tf-funds, i_f-initial funds, sc-current step and sm-maximum step
1 if( not(self._is_episode_over())):
2 return -1 * ((self.events.volume_bid==0)*0.005 +
(self.events.volume_ask==0)*0.005) *
(self.state.offs_bid+self.state.offs_ask)/50
3 else:
4 return (tv*ti-ii*iv)/iv + (tf-i_f)/iv
-100*((sm-sc)/sm) #-0.01*10000
Listing 5.1: Reward function for DealerMarket-v2.
42 CHAPTER 5. IMPLEMENTATION
5.2.3 LOBMarket-v1
In order to, make the experiments more realistic a simplified version of
a limit orderbook market was used with an orderbook and matching
engine. Firstly, what the agent observes is a bit different from before.
As input the agent gets the 10 last frames, however now with the fol-
lowing variables: stance bid, stance ask, best bid, best ask, the agents best
bid and ask, offset of bid and ask, trades and levels for bid and ask, the agents
trades, imbalance volume, imbalance of wealth and relative wealth. In total
25 scalar statistics that are what one could expected to be distributed
out to participants on a real trading platform2 .
Secondly, levels here indicate the vision width in each direction from
the reference price, where the agent can see (+20/ − 20) directions of
previous prices in the LOB. The reward function is slightly changed
and varies a bit, as shown in Listing 5.2. As with DealerMarket-v2 (see
Listing 5.1) the move from sparse rewards to more carefully shaped
rewards is due to better performance when training the agents.
1 # if time, inventory or funds didn’t run out
2 if( not(self._is_episode_over())):
3 return
min(1,max(-1,atvb*((ampb-self.info[’price_true’])
4 /self.info[’price_true’]))) \
5 + min(1,max(-1,atva*((ampa-self.info[’price_true’])
6 /self.info[’price_true’])))
7 else:
8 return ((tv*ti-ii*iv)/iv + (tf-i_f)/iv
-5*self.events[’went_broke’] -25*((sm-sc-1)/sm))
Finally, as outputs the agent has ten actions, the same six actions (0) to
(5) as in DealerMarket-v2. However new for this environment is four
other actions: move, submit or cancel orders at bid or ask quote prices.
The environment flow simulation is the same as before with Poisson
arrivals of orders et cetera. In terms of complexity, this environment
is seen as the hardest for the agent to learn and navigate in, as seen in
Table 5.1.
2
At NASDAQ this type of information is sent out via the so called Net Order
Imbalance Indicator (NOII).
CHAPTER 5. IMPLEMENTATION 43
5.3 Implementation
5.3.1 Neural Network Models
The following Neural networks models were used in the different en-
vironments, which were found after both hyper-parameter search, and
what have been used in previous literature. All the models used are
shown on the next page in Table 5.2 and figures Figure 5.3 to Figure 5.5.
For DealerMarket-v1 the author trained a 8-layer fully-connected neu-
ral network (FCNN) using the DQN agent and Boltzmann policy with
LeakyReLU as activation layer.
Table 5.2: The different network architectures used in the thesis. Type
indicates what type of agent used and policy. The random agent is
sampling via a uniform distribution from different actions.
Environment Architecture Type Library
DealerMarket-v1 8-layer FCNN DQN + Boltzmann keras-RL
DealerMarket-v2 8-layer FCNN PPO + ε - decay tensorforce
DealerMarket-v2 Random model Random policy tensorforce
LOBMarket-v1 6-layer FCNN + 2 LSTM PPO + ε - decay tensorforce
LOBMarket-v1 Random model Random policy tensorforce
In Table 5.2 the main network architectures and models used are
shown. DealerMarket-v1 is the only agent using keras-RL and DQN,
which did not have any option of training a random policy. The two
other environments used tensorforce and the PPO algorithm. As can
be seen on the previous page, the network architectures are quite sim-
ilar with the following number of hidden neurons (in each individual
layer): [1024, 512, 512, 256, 256, 128, 64]. This was based on both trial
and error, the hyper-parameter search and previous papers.
keras-RL
Keras-RL is a library for reinforcement learning, developed by Plap-
pert [52] with some state-of-art reinforcement learning algorithms such
as: Deep Q Learning and SARSA. It is an easy to use interface of keras
modular API for building neural networks. This library also works
seamless with OpenAI, which was why it was selected to be used
as the primary library for this project. However, as the project pro-
gressed, another more up-to-date library that was more frequently up-
dated was needed. Hence the choice fell on Tensorforce.
Tensorforce
Tensorforce is developed by Schaarschmidt, Kuhnle, and Fricke [56],
and is another python based reinforcement learning library. Built on
top of Tensorflow with a modular API passing parameters using python
dictionaries. Some implemented agents in the library that are of inter-
est for this project are AC3, PPO, DQN and both a random and constant
agent for sanity checks [56].
Platform specification
The majority of the experiments and simulations were performed on
a virtual machine on AWS. The author used a p2.xlarge EC2 instances
on AWS, with the following specifications: 1 Tesla K80 GPU, 4 vCPUs
and 61 GiB RAM. For initial testing and debugging local experiments
were also run on a Intel i5 dual-core CPU with 2,40 GHz and 8 GB
RAM with a Windows 10 Pro operating system.
CHAPTER 5. IMPLEMENTATION 47
5.4 Visualization
5.5 Evaluation
A common approach in machine learning to evaluate models is to use
k-fold cross-validation. However, in this thesis all data generated is
from each completed simulation. Therefore, other different metrics are
gathered after each simulation in order to evaluate the agents which
are discussed below.
• Price Impact Regression. An estimate of Kyle’s lambda (λ), i.e.,
the price impact of the agent’s orders is done via regression. Us-
ing order imbalance (qn ), inventories (∇In ) and initial inventories
(∇Initn ), where a larger λ implies that volumes have a larger
price impact on prices.
• Visualizing learning & Strategies. Looking at how the agents act
together with the price data stored from each run, to see what
strategies are used by the agents.
• Spreads & Inventory. Analyzing how the spreads & inventory are
changing during training. For instance, if the spreads decline
over time, how the correlation between spreads and inventory
changes, and the order imbalance.
• Net Profit and Loss (Net PnL). Calculating the Net PnL of the agent
to see if the market maker in fact has learned to make a profit
or not, thus optimizing its inventory levels, which is defined in
Equation 5.8:
Results
In this section the results from the different experiments and environ-
ments are presented, where DealerMarket-v1 is not as extensively ana-
lyzed as the other environments because it serves as an initial baseline.
This section starts with some stylized facts about the simulated data,
continuing with a breakdown of different statistic gathered after each
simulation.
49
50 CHAPTER 6. RESULTS
Finally, looking at plot (D) one can also see an absence of linear cor-
relations clearly with near zero autocorrelations after lag 1. This is
also in line with what is stated in previous literature. In Figure 6.2 one
can see similar patterns for LOBMarket-v1 as in Figure 6.1. However,
note that both the signature plot (A) and midprice changes plot (B) are
different. The signature plot is still mean reverting with a small trend,
whilst the changes in mid prices are more frequent.
Regarding Table 6.1 above one can see that the accumulated reward is
fairly consistent for DealerMarket-v1 compared to the other environ-
ments presented later. Also note that the reward schemes are some-
what different between the different environments as discussed previ-
ously, which will obviously affect the distribution of the accumulated
rewards. This environment is easier to navigate in for the agent with
less volatility and less arrivals of orders, compared to the other envi-
ronments.
As can be seen from Table 6.1 the standard deviation of the reward
is quite large, mainly due to the randomness associated with train-
ing reinforcement learning agents. Nevertheless, calculating confi-
dence intervals for the reward results in an average reward close to
400 (391.81 ± 16.18). In Table 6.1 for DealerMarket-v2 the agent makes
on average a reward close to 22 (22.47 ± 2.85).
Note also that the reward is much smaller for this environment com-
pared to the DealerMarket-v1 environment, most likely due to higher
complexity in the DealerMarket-v2 environment with higher volatil-
ity in the underlying prices of the asset and changed reward function,
explaining the quite wide fluctuation between min and max values of
the reward. Finally, looking at the reward for the LOBMarket-v1 in
Table 6.1, the mean reward is the smallest -5 (−5.02 ± 0.88). As, this is
the hardest environment for the agent, it is also making it harder for
the agent to find optimal behavior.
52 CHAPTER 6. RESULTS
5
4
3
2
1
0
0 50 100 150 200
Number of Episodes
Mean Reward for agent
Mean reward
500
400
300
200
100
0
0 50 100 150 200
Number of Episodes
Mean Loss for agent
Mean loss
0.06
0.04
0.02
0.00
0 50 100 150 200
Number of Episodes
Figure 6.3: Plots from training the DealerMarket-v1 agent. With best
weights after training for 2 · 106 time-steps. Plotted with 95 % confi-
dence interval
100
Reward
-100
-200
0 50 100 150 200
Number of Episodes (10 000s)
In Figure 6.4 the mean reward for both the DealerMarket-v2 agent
and a random policy on the environment is shown, both this and the
LOBMarket-v1 agent used the PPO algorithm. The mean is taken after
each episode, i.e., after some 10 000 time steps, which clearly shows
that the agents is worse than a random policy in the beginning of the
training. The agent is slowly learning a more optimal behaviour after
some 75 000 time steps when it starts to outperform the random policy
quite consistently. Compared to DealerMarket-v1, the mean reward
seems to converge more smoothly using the PPO algorithm, as the
same type of fluctuations previously seen are not present here. This
is expected as policy gradient methods tends to converge more easily
than the DQN.
54 CHAPTER 6. RESULTS
-25
Reward
-50
-75
-100
LOB-V1 Random
Table 6.2: Results from running price impact regression on the change
in mid prices during the simulations. ** indicates significant estimates.
DealerMarket-v1
Estimate Std. Error t value Pr(>|t|)
β0 1.0648 0.5943 1.79 0.1711
β1 0.0112 0.0084 1.32 0.2771
∗∗
β2 −0.0757 0.0237 −3.19 0.0497
β3 −0.0058 0.0116 −0.50 0.6542
DealerMarket-v2
β0 −56.8926∗∗ 25.199 −2.26 0.0242
β1 3.4324∗∗ 1.1661 2.94 0.0033
∗∗
β2 −0.2601 0.0149 17.44 0.0000
∗∗
β3 0.0817 0.0242 3.38 0.0008
LOBMarket-v1
β0 37.6559 187.0559 0.20 0.8408
∗∗
β1 43.9947 11.6959 3.76 0.0003
β2 −12.8060 12.3870 −1.03 0.3034
β3 −32.0142 18.0639 −1.77 0.0791
DealerMarket-v1 is where the agent has the smallest price impact, which
is expected due to lower volatility and less frequency of orders. How-
ever, with higher volatility in the DealerMarket-v2 environment the
price impact seems to increase. Similar this can be seen for LOBMarket-
v1, with the largest price impact. Furthermore, other reasons can be
that the agent is posting more orders, thus effecting prices more often
compared to the other environments. Finally, transforming the regres-
sors with a square root yielded higher R2 , and lower AIC values when
performing the regressions indicating a better model.
56 CHAPTER 6. RESULTS
Inventories
Model Mean Std Max Min CI
DealerMarket-v1 939.41 629.04 3024.19 2.67 120.17
DealerMarket-v2 1206.18 831.95 5994.08 0.00 30.26
LOBMarket-v1 214.96 133.56 1430.24 −1.00 4.91
Spreads
DealerMarket-v1 10.28 2.20 22.73 2.38 0.42
DealerMarket-v2 39.49 6.81 47.45 6.50 0.25
LOBMarket-v1 22.95 4.36 32.00 4.40 0.16
Table 6.3 shows a summary over the different agents inventories and
spreads. In Figure 6.6 for DealerMarket-v1, looking at plot (A), the
gap between the bid, ask and mid prices is quite big. One would ex-
pect that the price difference would be narrower, this might be due
to the simplistic nature of the environment. However, in plot (C) one
sees that the spread is decreasing and looking at the inventory in plot
(B) and the volumes in plot (D), the inventory seems to be increasing.
While the bid and ask volumes are similar and close to 5, which in-
terestingly is similar to what the Q-value stagnates at. Maybe, due to
the fact that the agent learns that 5 in this environment seems to be the
optimal volume to post.
For DealerMarket-v2 in Figure 6.7 in figure (A) the prices have tight-
ened likely as it gets harder for the agent to make a profit, hence need-
ing to post more bid and ask quotes. Also, the inventory seems to be
increasing (B), whilst the spread is decreasing (C), however the posted
bid, and ask volumes seems to be decreasing. In LOBMarket-v1, Fig-
ure 6.8 the difference between the prices are also very tight, likely due
to that agent post much more bid and ask quotes. The inventory in fig-
ure (B) seems to be increasing at the end of training, and at the same
time the spread in figure (C) is also decreasing. Note that the prices,
posted volumes and inventory have all decreased.
A. Bid-Ask-Prices B. Change in inventory
1010
3000
1000
2000
990 1000
980 0
0 100 200 0 100 200
Number of Episodes (10 000s) Number of Episodes (10 000s)
C. Bid-Ask-Spread in percentage
D. Bid-Ask-Volumes
10.0
0.020
7.5
0.015 5.0
0.010 2.5
0.005 0.0
0 100 200 0 100 200
Number of Episodes (10 000s) Number of Episodes (10 000s)
Figure 6.6: Plots for DealerMarket-v1, showing the change in bid-ask spread(A), inventory (B), percentage change
57
1500
2000
1250
Prices
1500
1000
1000
750
500
0 50 100 150 200 0 50 100 150 200
Number of Episodes (10 000s) Number of Episodes (10 000s)
Mid Price Bid Price Ask Price Inventory Initial Inventory
C. Agents percentage spread D. Change in posted volumes during training
Volumes(Shares)
2.0
percentage
0.06
1.5
CHAPTER 6. RESULTS
0.04 1.0
0.5
0.02
0.0
0 50 100 150 200 0 50 100 150 200
Number of Episodes (10 000s) Number of Episodes (10 000s)
Percentage Spread Bid volumes Ask Volumes
Figure 6.7: Plots for DealerMarket-v2, showing the change in bid-ask spread(A), inventory (B), percentage change
58
Prices
750
200
500 100
Volumes(Shares)
0 50 100 150 200 0 50 100 150 200
Number of Episodes (10 000s) Number of Episodes (10 000s)
percentage
0.02 Volumes(Shares)
0.3
0 50 100 150 200 0 50 100 150 200
Number of Episodes (10 000s) Number of Episodes (10 000s)
Figure 6.8: Plots for LOBMarket-v1, showing the change in bid-ask spread(A), inventory (B), percentage change
59
1250
Prices
1000
750
40
Spread
30
Spread
In graphs (A) and (B) in Figure 6.9 some correlation between prices
and inventory is visible. In fact all prices are negatively correlated to
the inventory with ρ ≈ −0.41. The spreads are also negatively corre-
lated to the inventory with ρ ≈ −0.16. However, to establish whether
spreads are independent of inventory as stated by Ho and Stoll [33], a
Chi-squared Test of Independence is needed. This test is performed on the
full dataset with significance level α = 0.05, resulting in a p-value of
0.2403, thus failing to reject the null hypothesis (H0 ) of independence.
CHAPTER 6. RESULTS 61
1000
750
500
100 200 300 400 500 600
Inventory
24
20
Spread
Figure 6.10: Plot for LOBMarket-v1, over correlation between mid, bid
and ask price (A) and spread (B) against the inventory. Average over
each each time-step.
Order Imbalance
0.6
0.4
0.2
0 50 100 150 200
Number of Episodes*
Order Imbalance
0.75
0.50
0.25
0.00
0.00 0.25 0.50 0.75 1.00
queue imbalance
Order Imbalance
0.75
0.50
0.25
0.00
0.00 0.25 0.50 0.75 1.00
queue imbalance
1,000,000
PnL
-1,000,000
100
-100
0 500 1000 1500 2000
Episodes
P&L (%)
Figure 6.13: Net Profit & Loss for DealerMarket-v2 (A) accumulated
P&L (B) and (C) percentage P& for the agent during training, with
average after 1000 time steps.
CHAPTER 6. RESULTS 65
200,000
PnL
-200,000
100
PnL (%)
-100
0 500 1000 1500 2000
Episodes
P&L (%)
Figure 6.14: Net Profit & Loss for LOBMarket-v1 (A) and accumulated
P&L (B) for agent during training, with average after 1000 time steps.
In Table 6.4 a summary of the P&L for the DealerMarket-v2 & the
LOBMarket-v1 environments is presented. In Figure 6.13 one can see
the DealerMarket-v2 agent’s average profit and loss (P&L) per episode,
whilst in Figure 6.14 the P&L for the LOBMarket-v1 is shown. No-
tice that it takes some time until the agents actually learns not to go
bankrupt, around some 1 million time steps seems to be needed. With,
the chance of making between 100% − 200% of its initial investment for
each agent. However, as the difficulty of the environment increases
(LOBMarket-v1), the agent has a harder time of reaching higher re-
wards. Nonetheless, this is still above zero and better than the random
strategies tested on the same environments.
66 CHAPTER 6. RESULTS
(a) Agents actions after after 84 episodes,(b) Agents actions after after 462 episodes,
where agent losses quite frequently. the agent is slowly starting to learn not to
go bust.
(c) Agents actions after after 1169 episodes,(d) Agents actions after after 2006 episodes,
the agent has learned to not go bust andthe agent is balancing quotes and makes
makes a small profit. profits frequently.
(a) Agents actions after after 1 episodes,(b) Agents actions after after 363 episodes,
where agent losses quite frequently. agent is slowly starting to learn not to go
bust.
(c) Agents actions after after 1069 episodes,(d) Agents actions after after 1896 episodes,
the agent has learned not to go bust andthe agent is balancing quotes and make
makes a small profit. profits frequently.
7.1 Discussions
Stylized Facts
In general, when analyzing agent based markets, one usually analyses
the statistical properties of simulated data to validate the experimental
setup. Firstly, looking at the signature plots for DealerMarket-v2 and
LOBMarket-v1 one clearly sees mean-reverting behaviour of the sam-
pled volatility from the changes in the midprices. These graphs do
exhibit some flatness, indicating weak mean reversion as mentioned
in [8]. Secondly, looking at the changes in the midprices during the
simulations, one sees as indicated by Bouchaud et al. [8] some activity
clustering.
68
CHAPTER 7. DISCUSSION & CONCLUSIONS 69
2
α = 0.05 or with 95% confidence.
70 CHAPTER 7. DISCUSSION & CONCLUSIONS
Looking instead on the rolling mean for the last 100 episodes is of
more interest, there (2) has a P &L = 17977 and LOBMarket-v1 has
a P &L = −1010. Not surprisingly as it takes time even for experience
traders to make a profit in the market. Nevertheless, this is far better
than a random strategy. Note that each simulation is ran for approxi-
mately 65 trading days, where the agents starts tabula rasa. Training for
a even longer time or using pre-trained models, would have affected
the P&L positively. However, also increasing the chance of overfitting.
Looking at Q3 or the third quantile of the rewards, the agent instead
makes a P &L = 300200 and P &L = 16330 for the same two environ-
ments.
Advancing with how the spreads and inventories are related and look-
ing at Figure 6.9 and Figure 6.10, clearly the prices and inventories
tend to cluster around certain inventory levels, suggesting that some
inventory levels are more favourable than others. Ho and Stoll [33]
states that the spreads are independent of the inventory levels. This
is tested by performing a Chi-squared Test of Independence (H0 ) on
the full dataset for both environments. The null hypothesis of inde-
pendence cannot be rejected with 95% confidence, focusing on queue
dynamics, and order imbalances for both the (2) and (3) in Figure 6.11
and Figure 6.12. The average order imbalance for (2) is 0.53 and for (3)
0.49, meaning that on average both the ask and bid queues are of equal
length, i.e., a balanced buy and sell pressure in the LOB. From the lit-
erature [8, 13] higher order imbalance means that the agents should
post more buy orders and less sell orders, vice versa holds, when the
imbalance is low, that is also seen from the simulations.
72 CHAPTER 7. DISCUSSION & CONCLUSIONS
Also note that the shape of figure B in Figure 6.11 and Figure 6.12
indicates a monotonic correlation as in [8] between the queue imbal-
ance and the direction of the next price movement. Finally, looking
at visualizations of the agents behavior in Figure 6.15 and Figure 6.16,
some similar patterns emerges. The agent goes bankrupt very quickly
early during training, this is also where the agent is displaying weird
behaviour. As posting orders with the same price for a long time, how-
ever after some 350-450 episodes the agent is not going bankrupt as
often. Here the agent also learns that by doing nothing occasionally is
optimal. At around 1000 episodes the agent is rarely going bankrupt,
and at the end of training the agent learns how to make the market
by posting quotes more aggressively in response to changed prices,
lower inventory and higher volatility contemporaneously, making a
profit and balancing the trade-off between its cash and inventory. To
summarize the following have been observed during the different ex-
periments and simulations:
• The simulated environments are realistic as several know styl-
ized facts found in [16] are successfully reproduced.
• Both DQN and PPO agents with adequate reward schemes, i.e.,
employing sparse or shaped rewards outperforms random or
zero intelligence agents.
7.2 Conclusions
The research question examined in this thesis was the following:
”Will trading dynamics such as the bid-ask spread clustering,
optimal trade execution and optimal inventory costs, be exhib-
ited and learned by reinforcement learning agents on a simulated
market.”
After completing this work, the author believes that the research ques-
tions has been answered. Where, the main conclusions are: (1) the
simulated environments are realistic and (2) DQN & PPO agents can suc-
cessfully replicate trading dynamics as bid-ask spread clustering. The author
concludes that: reinforcement learning is a suitable choice in modelling mar-
ket participants behaviour, such as market makers and HFT traders when
using DQN or PPO agents.
Compared to previous research this thesis shows that both DQN &
PPO based reinforcement learning agents are realistic choices when
simulating behavior in a dealer market and a limit order book, in or-
der to understand market microstructure. Nonetheless, more research
is needed to further validate this work, on real-world data, and apply-
ing it to other market structures.
74
BIBLIOGRAPHY 75
[10] Danny Busch. “MiFID II: regulating high frequency trading, other
forms of algorithmic trading and direct electronic market ac-
cess”. In: Law and Financial Markets Review 10.2 (2016), pp. 72–
82.
[11] Lucian Busoniu, Robert Babuska, and Bart De Schutter. “A com-
prehensive survey of multiagent reinforcement learning”. In: IEEE
Transactions on Systems, Man, And Cybernetics-Part C: Applications
and Reviews, 38 (2), 2008 (2008).
[12] Lucian Busoniu, Robert Babuska, and Bart De Schutter. “Multi-
agent reinforcement learning: An overview”. In: Innovations in
multi-agent systems and applications-1. Springer, 2010, pp. 183–
221.
[13] Álvaro Cartea, Sebastian Jaimungal, and José Penalva. Algorith-
mic and high-frequency trading. Cambridge University Press, 2015.
[14] Patrıcia Xufre Casqueiro and António JL Rodrigues. “Neuro-dynamic
trading methods”. In: European journal of operational research 175.3
(2006), pp. 1400–1412.
[15] Nicholas Tung Chan and Christian Shelton. “An electronic market-
maker”. In: (2001).
[16] Rama Cont. “Empirical properties of asset returns: stylized facts
and statistical issues”. In: (2001).
[17] Vincent Darley. A NASDAQ market simulation: insights on a major
market from the science of complex adaptive systems. Vol. 1. World
Scientific, 2007.
[18] Sanmay Das*. “A learning market-maker in the Glosten–Milgrom
model”. In: Quantitative Finance 5.2 (2005), pp. 169–180.
[19] Sanmay Das. “Intelligent market-making in artificial financial
markets”. In: (2003).
[20] Michael AH Dempster and Vasco Leemans. “An automated FX
trading system using adaptive reinforcement learning”. In: Ex-
pert Systems with Applications 30.3 (2006), pp. 543–552.
[21] Prafulla Dhariwal et al. OpenAI Baselines. https://github.
com/openai/baselines. 2017.
[22] Xin Du, Jinjian Zhai, and Koupin Lv. “Algorithm Trading us-
ing Q-Learning and Recurrent Reinforcement Learning”. In: po-
sitions 1 (2016), p. 1.
76 BIBLIOGRAPHY
[59] David Silver et al. “Mastering the game of Go with deep neural
networks and tree search”. In: nature 529.7587 (2016), pp. 484–
489.
[60] David Silver et al. “Mastering the game of go without human
knowledge”. In: Nature 550.7676 (2017), p. 354.
[61] Richard S Sutton and Andrew G Barto. Reinforcement learning:
An introduction. Vol. 1. 1. MIT press Cambridge, 2017.
[62] Csaba Szepesvári. “Algorithms for reinforcement learning”. In:
(2009).
[63] William M.K. Trochim. Conclusion Validity. 2006. URL: https :
//socialresearchmethods.net/kb/concval.php.
[64] Oriol Vinyals et al. “StarCraft II: a new challenge for reinforce-
ment learning”. In: arXiv preprint arXiv:1708.04782 (2017).
[65] Huiwei Wang et al. “Reinforcement learning in energy trading
game among smart microgrids”. In: IEEE Transactions on Indus-
trial Electronics 63.8 (2016), pp. 5109–5119.
[66] Hadley Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-
Verlag New York, 2009. ISBN: 978-0-387-98140-6. URL: http://
ggplot2.org.
[67] Hadley Wickham and Romain Francois. dplyr: A Grammar of Data
Manipulation. R package version 0.5.0. 2016. URL: https : / /
CRAN.R-project.org/package=dplyr.
[68] Steve Y Yang et al. “Algorithmic trading behavior identification
using reward learning method”. In: Neural Networks (IJCNN),
2014 International Joint Conference on. IEEE. 2014, pp. 3807–3414.
[69] Steve Yang et al. “Behavior based learning in identifying high
frequency trading strategies”. In: Computational Intelligence for Fi-
nancial Engineering & Economics (CIFEr), 2012 IEEE Conference on.
IEEE. 2012, pp. 1–8.
Appendix A
On the next two pages scatter matrices are shown for Figure A.1 and
Figure A.2. The scatter matrices show the correlation between some
of the most important variables. These were used when analyzing the
data. Also, what the abbreviations mean are the following: av - Ask
volume, bv - Bid volume, bp - Bid price, ap - Ask price, mp - Mid price,
ara - Ask arrival rate, arb - Bid arrival rate, d_inv - Change in inventory,
d_cash - Change in funds, val - Underlying asset price, inv - Inventory
cash - Funds.
Looking at the scatter matrix one can see for instance that cash is neg-
atively correlated with both the bid volume and the ask volume. Con-
versely the agents’ cash is positively correlated with both the bid and
ask prices.
80
APPENDIX A. CORRELATION BETWEEN VARIABLES
81
Figure A.1: Scatter matrix between some of the most relevant variables, to find correlations for DealerMarket-v2.
APPENDIX A. CORRELATION BETWEEN VARIABLES
82
Figure A.2: Scatter matrix between some of the most relevant variables, to find correlations for LOBMarket-v1.
Appendix B
83
84 APPENDIX B. POLICY AND VALUE ITERATION
9 ∆ ← max(∆, |v − V (s)|)
10 end
11 end
12 3. Policy Improvement;
13 policy_stable ← true ;
14 while not policy_stable) do
15 foreach s ∈ S do
16 old_action ← π(s) ;
π(s) ← argmax s0 ,r p(s0 , r|s, a)[r + γV (s0 )] ;
P
17
a
18 if old_action 6= π(s) then policy_stable ← f alse;
19 end
20 if policy_stable then stop;
21 else go to 2;
22 end
23 return V ≈ v∗ and π ≈ π∗ ;
Appendix C
Y = f (x, t) (C.1)
where t is time and x is some well-defined Ito processes [50] in Equa-
tion C.1 given in Equation C.2:
∂f ∂f 1 ∂ 2f
dY = dt + + (dx)2 = (C.3)
∂t ∂x 2 ∂x2
∂f ∂f 1 ∂2 2
= dt + [µdt + σdz] + σ dt (C.4)
∂t ∂x 2 ∂x2
rewriting Equation C.4 gives Ito’s Lemma in Equation C.5:
1 ∂ 2f
∂f ∂f ∂f
dY = + µ+ 2 2
dt + σdz (C.5)
∂t ∂x 2x σ ∂x
85
86 APPENDIX C. DERIVATION OF HO & STOLL MODEL
• The arrival rate of buy orders (λa ) and sell orders (λb ) will depend
on the dealer’s ask and bid prices.
• The dealer face uncertainty over the future value of his portfolio
X.
dY = rY Y dt + Y dZY (C.9)
The objective of the dealer is now to maximize the expected utility of
his total wealth E[U (WT )] at time horizon T , where
WT = FT + IT + YT (C.10)
APPENDIX C. DERIVATION OF HO & STOLL MODEL 87
Equation 2.9 is what is termed the dealers pricing problem. This is in fact
an optimization problem with the goal to maximize the value func-
tion J(·) using dynamic programming. Thus, yielding the optimiza-
tion problem below in Equation C.11:
where U is the utility function, a and b are the ask and bid adjustments
and t, F, I, Y are the states variables time, cash, inventory and base
wealth [50]. The function J(·) gives the level of utility given that the
dealer’s decisions are made optimally [50]. As, there are no interme-
diate consumption before time T in this model, the recurrence relation
found by using the principle of optimality is:
max(dJ/dt) =Jt + LJ
a,b
1 1
LJ = JF rF +JI rI I +JY ry Y + JII σI2 I 2 + JY Y σY2 Y 2 +JIY σIY IY (C.14)
2 2
Jt + LJ is the total time derivative of derived utility when there are no
transactions. Equation C.13 determines the solution, which is hard to
solve explicitly, and Ho and Stoll [33] do not solve the general prob-
lem but introduces some transformations and simplifications in order
to solve it. Firstly, by looking at the problem only at the endpoint (τ )
where it is is equal to zero. Secondly, by taken the first-order approx-
imation of the Taylor series expansion of Equation C.11 [50]. Ho and
88 APPENDIX C. DERIVATION OF HO & STOLL MODEL
Stoll [33] then, also define two new operators the sell (SJ) and buy
(BJ) operators:
where λ(a) = α−βA and λ(b) = α+βB are symmetric linear supply and
demand functions to the dealer. There is no closed form solution for
this problem, nonetheless via approximations the bid and ask quotes
can be found below:
Finally, from Equation 2.12 and Equation 2.13 the the bid-ask spread
is:
s = α/β + (J − SJ)/2SJF Q + (J − BJ)/2BJF Q (C.20)
The first term of Equation C.20 is the spread which maximizes the ex-
pected returns from selling and buying stocks. Whilst the rest of the
terms are seen as risk premiums for sale and purchase transactions. This
as the dealer or market maker sets the spread without knowing what
side the transaction will have, i.e., bid or ask [33].
Table C.1: Tactics Dealers or Market Makers Use to Manage Their In-
ventories and Orders Flow. Adopted from [29]
TRITA -EECS-EX
www.kth.se