Deep Order Flow Imbalance Extracting Alpha at Multiple Horizons
Deep Order Flow Imbalance Extracting Alpha at Multiple Horizons
Petter Kolm
NYU Courant
[email protected]
linkedin.com/in/petterkolm
1 / 46
Petter Kolm
Petter is the Director of the Mathematics in Finance Master’s program and a Clinical Professor of Math-
ematics at the Courant Institute of Mathematical Sciences, New York University. In this role he interacts
with major financial institutions such as investment banks, financial service providers, insurance companies
and hedge funds. Petter worked in the Quantitative Strategies group at Goldman Sachs Asset Manage-
ment developing proprietary investment strategies, portfolio and risk analytics in equities, fixed income
and commodities.
Petter was awarded “Quant of the Year” in 2021 by Portfolio Management Research (PMR) and Journal
of Portfolio Management (JPM) for his contributions to the field of quantitative portfolio theory. Petter
is a frequent speaker, panelist and moderator at academic and industry conferences and events. He
is a member of the editorial boards of the International Journal of Portfolio Analysis and Management
(IJPAM), Journal of Financial Data Science (JFDS), Journal of Investment Strategies (JoIS), and Journal
of Portfolio Management (JPM). Petter is an Advisory Board Member of Alternative Data Group (ADG),
AISignals and Operations in Trading (Aisot), Betterment (one of the largest robo-advisors) and Volatility
and Risk Institute at NYU Stern. He is also on the Board of Directors of the International Association for
Quantitative Finance (IAQF) and Scientific Advisory Board Member of the Artificial Intelligence Finance
Institute (AIFI).
Petter is the co-author of several well-known finance books including, Financial Modeling of the Equity
Market: From CAPM to Cointegration (Wiley, 2006); Trends in Quantitative Finance (CFA Research
Institute, 2006); Robust Portfolio Management and Optimization (Wiley, 2007); and Quantitative Equity
Investing: Techniques and Strategies (Wiley, 2010). Financial Modeling of the Equity Markets was among
the “Top 10 Technical Books” selected by Financial Engineering News in 2006.
As a consultant and expert witness, Petter has provided his services in areas including alternative data, data
science, econometrics, forecasting models, high frequency trading, machine learning, portfolio optimization
with transaction costs, quantitative and systematic trading, risk management, robo-advisory, smart beta
strategies, trading strategies, transaction costs, and tax-aware investing.
He holds a Ph.D. in Mathematics from Yale University; an M.Phil. in Applied Mathematics from the Royal
Institute of Technology, Stockholm, Sweden; and an M.S. in Mathematics from ETH Zurich, Switzerland.
2 / 46
The M.S. in Mathematics in Finance at NYU Courant
3 / 46
Introduction & Motivation
4 / 46
Our Articles Related to This Talk
6 / 46
Signal Generation
◮ This type of alpha generation is very challenging for a number
of reasons
◮ Enormous amounts of data (TB & PB sizes)
◮ Specialist infrastructure required to store, process and analyze
◮ Data is noisy, non-stationary and fat tailed
◮ Field is extremely competitive, every single one of your
competitors is trying to do the same thing with the same data
◮ Current state of the art is to employ quants to extract and
handcraft features using expert domain knowledge which then
become high value IP
7 / 46
The Bitter Lesson
The biggest lesson that can be read from 70 years of AI
research is that general methods that leverage computation
are ultimately the most effective, and by a large margin
The ultimate reason for this is Moore’s law, or rather its
generalization of continued exponentially falling cost per
unit of computation.
. . . Researchers seek to leverage their human knowledge
of the domain [to improve performance], but the only thing
that matters in the long run is the leveraging of computa-
tion. These two need not run counter to each other, but
in practice they tend to.
Rich Sutton - The Bitter Lesson, March 20191
8 / 46
Examples of the Bitter Lesson
◮ Chess: Methods that defeated Kasparov were based on
massive deep search, not the special structure of Chess
◮ Go: 20 years later, leveraging learning from self play (to
discover value functions) provided the edge as this allowed
massive computation to be brought to bear
◮ Computer Vision: Early methods searched for edges, cylinders
etc. This has all been discarded in favour of modern
Convolutional Neural Networks (CNNs)
◮ Natural Language Applications: Methods based on linguistics,
knowledge of words/phonemes are also no longer used.
Modern approaches are all deep learning/statistically based
(BERT, FinBERT)
9 / 46
Return Forecasting
10 / 46
Review: Limit Order Book (LOB)
11 / 46
General Problem Statement
◮ Focus on the top 10 (non-zero) levels of the LOB and define
the vector
xt := (at1 , vt1,a , bt1 , vt1,b , . . . , at10 , vt10,a , bt10 , vt10,b )⊤ ∈ R40 (1)
◮ For each stock and time t, fix a horizon h and consider the
time series regression problem
12 / 46
Overview: Neural Networks
◮ The simplest Neural Network is best thought of as being made
of sequential applications (layers) of a linear transform and an
element wise non-linearity
Input #1
Input #2
Output
Input #3
Input #4
13 / 46
Overview: Multilayer Perceptron (MLP)
◮ More generally, when we have L ≥ 1 hidden layers, each with
Nl ∈ N neurons. If x ≡ h0 ∈ RN0 , is the input then an MLP
with L hidden layers takes the form
! "
h(l) = f W (l) h(l−1) + b (l) , l = 1, 2, . . . , L ,
y = W (L+1) h(L) + b (L+1)
14 / 46
Overview: Recurrent Neural Networks (RNNs)
◮ MLP is great for function approximation but has no notion of
order. Recurrent neural networks were designed for sequence
tasks (NLP, time series forecasting etc.)
◮ This time there is an input at each time xt together with a
hidden state ht
ht = fh (xt , ht−1 )
yt = fo (ht )
◮ The non-linear functions fh , fo handle the recurrent
transformation and the output transformation
◮ Trained via backpropagation through time (BPTT) where the
network is unrolled. Issues with vanishing/exploding gradients
◮ Important architecture avoiding this is the Long Short-Term
Memory Network (LSTM) (Hochreiter and Schmidhuber,
1997)
15 / 46
Overview: Convolutional Neural Networks (CNNs)
◮ Originally introduced in computer vision, specifically image
classification
◮ The layers apply (multiple) kernels to locally average parts of
the image, then apply a non-linearity
16 / 46
Financial CNNs
◮ We can think of an order book as a matrix, one state per row
◮ Temporal convolutions are interpreted as non-linear moving
averages
◮ Spatial convolutions aggregate information across the different
levels (and sides) of the LOB
◮ The most well known model (Zhang, Zohren, and Roberts,
2019) sequentially applies spatial and temporal convolutions
processing the order book into a series of features
◮ Convolutional layers are a preprocessing step, the resulting
time series is used as the input to an LSTM
◮ The CNN-LSTM can be trained via BPTT
17 / 46
Related Work
◮ Due to success of NNs in classification, existing literature
reformulates the regression problem as one of classification
◮ There are three main groups of authors who have focussed on
this problem, including
◮ Tsantekidis, Passalis, Tefas, Kanniainen, Gabbouj, and Iosifidis
(2020) - Focus on 5 Finnish stocks (FI-2010), investigate
different architectures
◮ Sirignano and Cont (2019) - Focus on S&P 500 with a single
architecture, investigate questions of universality
◮ Zhang, Zohren, and Roberts (2019) - Focus on 5 LSE stocks,
investigate sophisticated CNN-LSTM inception networks
◮ There are still a lot of open practical questions to investigate
18 / 46
Some Important Questions for the Practitioner
◮ Should we transform the raw LOB before inputting into the
NN?
◮ What architecture(s) are best?
◮ Is there a linkage between model predictive performance and
stock characteristics / microstructural properties?
◮ What kind of horizon do these alphas have?
19 / 46
What About Stationarity?
◮ Recall the original regression formulation
20 / 46
How to Difference a LOB?
◮ Given two snapshots of a limit order book, define the following
transform
# i,b
$
% vt , if bti > bt−1
i ,
bOFt,i := vti,b − vt−1i,b
, i i
if bt = bt−1 ,
$
& i,b
−vt−1 , if bti < bt−1
i
# i,a
$
%−vt−1 , if ati > at−1
i ,
aOFt,i := vti,a − vt−1
i,a
, i i
if at = at−1 ,
$
& i,a
vt , if ati < at−1
i
◮ If the bid price doesn’t move and bid level moves down, that’s
a negative imbalance (signal). Bid price moves up, that’s a
positive imbalance. Bid price moves down that’s a negative
imbalance
◮ Classical transform taken from the market microstructure
literature Cont, Kukanov, and Stoikov (2014)
21 / 46
Order Flows & Order Flow Imbalances
◮ We form order flow (OF) by concatenation and order flow
imbalance (OFI) by subtraction
' (
bOFt
OFt := ∈ R20 ,
aOFt
OFIt := bOFt − aOFt ∈ R10
22 / 46
Our Setup & Contribution
◮ Recast the problem as standard regression and include multiple
horizons in the output
23 / 46
Data & Model Fitting
24 / 46
Data Description
◮ We use data for the time period January 1, 2019 through
January 31, 2020 (LOBSTER, WRDS)
◮ Dataset is very large, approximately 10TB uncompressed. We
need specialist infrastructure to process and store
◮ Across our universe we have stocks with a few updates per day
(EBAY) and stocks with many updates per day (MSFT).
Information flows at different rates for different stocks and so
horizon cannot be constant in time across stocks
25 / 46
Defining Return Horizons
◮ We use a stock specific increment defined via
2.34 · 107
∆t := , (6)
N
where the numerator is the number of milliseconds in a trading
day and the denominator N denotes the average number of
non-zero tick by tick mid-price returns
◮ We then define horizons in terms of ∆t or “number of average
price changes.” We focus on a short horizons,
1
5 k∆t , k = 1, . . . , 10 i.e. between 0 and 2 average price
changes
◮ Insert a fixed latency buffer of 10ms for all intervals to mimic
production setting
26 / 46
Model Universe
◮ ARX - Autoregressive with exogenous features (linear model)
100
)
r t = w0 + vi⊤ xt−i + εt
i=1
27 / 46
Model Hyperparameters
28 / 46
Training
◮ To mimic a real life production setting we perform (for each
symbol) rolling fits over our time period
◮ We choose a (1,4,1) configuration
◮ The first week is for validation (early stopping)
◮ The middle 4 weeks are for training
◮ The final week is for out of sample testing
◮ We then step this forward by 3 weeks (to keep numbers of fits
manageable)
◮ Overall we have 12 x 115 x 18 (models x stocks x time
periods) NNs to fit
◮ We apply winsorization and Z -scoring to all independent and
dependent variables used in the regressions
29 / 46
Network Fitting
◮ Training deep learning models at this scale would likely be
impossible without the use of GPUs which offer ∼ 10×
speedup
◮ Our computations are performed on the NYU Greene2 and
Hudson3 high performance computing environments. Greene
has 32K CPU cores, 332 NVIDIA GPUs and 145TB of RAM
distributed over 568 nodes, Hudson has 960 CPU cores, 160
AMD GPUs and 10TB of RAM distributed across 20 nodes
◮ Code is written in Python and NNs are trained using
Tensorflow 2.3.1 and Keras (ADAM optimizer)
◮ For LSTM/CNN-LSTM-based models, we use a single GPU,
either NVIDIA Quadro RTX8000 (48GB), an NVIDIA V100
(32GB) or an AMD MI-50 (32GB).
◮ Time to train a single model varies in the range of 10-60
minutes, depending on model, stock and GPU
30 / 46
Results
31 / 46
Order Flow Imbalance or Limit Order Book?
1.0
1.25 0.5
1.00
0.0
ARX ARX
0.75 CNN-LSTM CNN-LSTM
(%)
(%)
°0.5
LSTM LSTM
ROS
ROS
0.50 LSTM (3) LSTM (3)
2
2
LSTM-MLP °1.0 LSTM-MLP
MLP MLP
0.25
°1.5
0.00
°2.0
°0.25
32 / 46
Discussion
◮ Recall the key questions we set out to address
◮ Understand the effects of different RHS in the regression
◮ Compare multiple architectures across a large set of symbols
◮ OF input is clearly better than LOB - stationarity of inputs is
important
◮ LSTM based models outperform non-LSTM models
◮ Depth/CNN layers do not seem to outperform plain LSTM
after converting to OF
◮ Significant alpha at all horizons for the OF models. Small R 2
but high profitability due to the short horizons
33 / 46
Stock Characteristics
◮ We have seen that a regular LSTM model with OF input is an
excellent (non-complex) model
◮ Use the following stock characteristics to study model
performance
◮ Tick Size - Fraction of time the spread is equal to $0.01 (large
tick stocks approximately 1)
◮ Log Updates - Log Number of updates/day
◮ Log Trades - Log Number of trades/day
◮ Log Price Chg - Log Number of price changes/day
◮ All numbers computed per stock by averaging across the time
period
34 / 46
Characteristics Correlation
1.0
LogUpdates 0.66 1 0.86 0.29 0.61
0.5
LogTrades 0.32 0.86 1 0.58 0.25
0.0
Siz
e tes es
Ch
g
hg
)
ck da rad ice eC
Ti Up Lo
gT
gP
r
Pr
ic
og s/
L Lo
d ate
g(Up
Lo
35 / 46
Methodology
◮ We fix the model to be (LSTM, OF), average across horizons
and out of sample data points
◮ We are left with a single ROS
2 per stock (115 points)
36 / 46
Cross Sectional Performance
(i) (ii)
4 4
(%)
(%)
ROS
ROS
2 2
2
2
0 0
(iii) (iv)
4 4
(%)
(%)
ROS
ROS
2 2
2
2
0 0
37 / 46
Explaining Performance
5 2
ROS : y = 0.0089log(x) - 0.0089
3
ROS
2 (%)
2 3 4 5
Log(Updates/PriceChg)
38 / 46
Discussion
◮ There are clear dependencies on updates, tick size and trades
◮ Regression analysis (see article) shows that the best
characteristic is in fact a combination,
Log(Updates/PriceChg). This explains performance (ROS 2 )
39 / 46
Additional Results & Robustness Checks
Many additional questions addressed in the article
◮ How far ahead can we predict returns? (about 2-3 price
changes)
◮ How sensitive are the results to the fixed window length
W = 100? (interestingly not very)
◮ What if we use an OFI or volume-only LOB RHS? (removing
prices helps but OF is the best)
◮ Motivation for results in terms of inductive biases - an
interesting new concept from the ML literature
40 / 46
Conclusion & Extensions
41 / 46
Conclusion
◮ We built a framework to evaluate different inputs and deep
learning models for return predictions
◮ We demonstrated that stationarity of the inputs is critical to
getting good outcomes
◮ Our results suggests that universality needs a second
consideration as model results have strong microstructural
dependencies
◮ Simple architectures perform as well as complicated ones
42 / 46
Extensions
◮ Better understand relationship between classification and
regression
◮ How about Bayesian deep nets or other “fancy” models?
◮ Can we use this setup to perform volume prediction?
◮ Can we extend what we have to other venues/asset classes
(futures)?
43 / 46
Contact
Petter Kolm
NYU Courant
[email protected]
https://www.linkedin.com/in/
petterkolm
44 / 46
References I
Briola, Antonio, Jeremy Turiel, and Tomaso Aste (2020). Deep Learning
Modeling Of Limit Order Book: A Comparative Perspective. arXiv:
2007.07319 [q-fin.TR].
Cont, Rama, Arseniy Kukanov, and Sasha Stoikov (2014). “The Price Impact
Of Order Book Events”. In: Journal Of Financial Econometrics 12.1,
pp. 47–88.
Hochreiter, Sepp and Jürgen Schmidhuber (1997). “Long Short-Term Memory”.
In: Neural Computation 9.8, pp. 1735–1780.
Hornik, Kurt (1991). “Approximation Capabilities Of Multilayer Feedforward
Networks”. In: Neural Networks 4.2, pp. 251–257.
Kolm, Petter N., Jeremy Turiel, and Nicholas Westray (2021). “Deep Order
Flow Imbalance: Extracting Alpha from the Limit Order Book”. In: Working
Paper.
Kolm, Petter N. and Nicholas Westray (2023). “A Bayesian Approach to
Analyzing Information Content of Cross-Sectional and Multilevel Order Flow
Imbalance”. In preparation.
45 / 46
References II
Sirignano, Justin A. and Rama Cont (2019). “Universal Features Of Price
Formation In Financial Markets: Perspectives From Deep Learning”. In:
Quantitative Finance 19.9, pp. 1449–1459.
Tsantekidis, Avraam, Nikolaos Passalis, Anastasios Tefas, Juho Kanniainen,
Moncef Gabbouj, and Alexandros Iosifidis (2020). “Using Deep Learning For
Price Prediction By Exploiting Stationary Limit Order Book Features”. In:
Applied Soft Computing 93, p. 106401.
Zhang, Zihao, Stefan Zohren, and Stephen Roberts (2019). “DeepLOB: Deep
Convolutional Neural Networks for Limit Order Books”. In: IEEE
Transactions On Signal Processing 67.11, pp. 3001–3012.
46 / 46