0% found this document useful (0 votes)
95 views6 pages

Forecasting Stock Prices From The Limit Order Book Using Convolutional Neural Networks

LOB
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views6 pages

Forecasting Stock Prices From The Limit Order Book Using Convolutional Neural Networks

LOB
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Forecasting Stock Prices from the Limit Order

Book using Convolutional Neural Networks


Avraam Tsantekidis‡∗ , Nikolaos Passalis∗, Anastasios Tefas∗,
Juho Kanniainen†, Moncef Gabbouj‡ and Alexandros Iosifidis‡
∗ Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece
{avraamt, passalis}@csd.auth.gr, [email protected]
† Laboratory of Industrial and Information Management, Tampere University of Technology, Tampere, Finland

[email protected]
‡ Laboratory of Signal Processing, Tampere University of Technology, Tampere, Finland

{moncef.gabbouj, alexandros.iosifidis}@tut.fi

Abstract—In today’s financial markets, where most in the financial markets very frequently exhibit irrational
trades are performed in their entirety by electronic means behaviour since they are largely influenced by human
and the largest fraction of them is completely automated, activity that mathematical models fail to capture.
an opportunity has risen from analyzing this vast amount
of transactions. Since all the transactions are recorded Machine Learning, and especially Deep Learning, has
in great detail, investors can analyze all the generated been perceived as the solution to the aforementioned
data and detect repeated patterns of the price movements. limitations of handcrafted systems. Given some input
Being able to detect them in advance, allows them to features machine learning models can be used to predict
take profitable positions or avoid anomalous events in the the behaviour of various aspects of financial markets [1],
financial markets. In this work we proposed a deep learning
methodology, based on Convolutional Neural Networks [2], [3], [4]. This has led several organizations, such
(CNNs), that predicts the price movements of stocks, using as hedge funds and investment firms, to use machine
as input large-scale, high-frequency time-series derived learning models alongside the conventional mathematical
from the order book of financial exchanges. The dataset models to conduct their trading operation.
that we use contains more than 4 million limit order events Furthermore, the modernization of exchanges and the
and our comparison with other methods, like Multilayer
Neural Networks and Support Vector Machines, shows that automation of trading have dramatically increased the
CNNs are better suited for this kind of task. volume of trading that happens daily and, as a result, the
amount of data that are produced inside the exchanges.
I. I NTRODUCTION This has created an opportunity for the exchanges to
gather all the trading data and create comprehensive
Financial markets present an opportunity for per- logs of every transaction. This data contains valuable
ceptive investors to buy undervalued assets and short signals that can be used to forecast changes in the
overvalued ones. One way to take advantage of such market, which can in turn be used by algorithms to
circumstances is to observe the market and determine make the correct decisions on live trading. However,
which moves one has to make to produce the largest applying machine learning techniques on such large-
profit with the least amount of risk. The idea of using scale data is not a straightforward task. Being able to
mathematical models to predict aspects of financial mar- utilize the information at this scale can provide strategies
kets has manifested as the field of quantitative analysis. for many different market conditions but also safeguard
The basic premise of this field is the analysis of the from volatile market movements.
time-series produced by the markets using mathematical The main contribution of this work is the proposal
and statistical modelling. This allows us to extract valu- of a deep learning methodology, based on Convolutional
able predictions about various aspects of the markets, Neural Networks (CNNs), that can be used for predict-
such as the volatility, the trend or the real value of ing future mid-price movements from large-scale high-
an asset. However, these mathematical models rely on frequency limit order data. This includes an intelligent
handcrafted features and have their parameters tuned normalization scheme that takes into account the dif-
manually by observation, which can reduce the accuracy ferences in price scales between different stocks and
of their predictions. Furthermore, asset price movements
different time periods of an individual stock. Note that training round, limiting the prediction accuracy of the
even though this data seems very suitable to be analyzed model.
by deep learning techniques, there have been only a To the best of our knowledge this is the first work that
few published works using it, mainly due to the cost uses a large-scale dataset with more than 4 million limit
of obtaining said data and the possible unwillingness to orders to train CNNs for predicting the price movement
publish positive results that could be used for profit. of stocks. The method proposed in this paper is also
In Section 2 related work on machine learning models combined with an intelligent normalization scheme that
that were applied on financial data is briefly presented. takes into account the differences in the price scales
Then, the used large-scale dataset is described in detail between different stocks and time periods, which is
in Section 3. In Section 4 the proposed deep learning essential for effectively scaling to such large-scale data.
methodology is introduced, while in Section 5 the ex- III. H IGH F REQUENCY L IMIT O RDER DATA
perimental evaluation is provided. Finally, conclusions
are drawn and future work is discussed in Section 6. In financial equity markets a limit order is a type of
order to buy or sell a specific number of shares within
a set price. For example, a sell limit order (ask) of $10
II. R ELATED W ORK
with volume of 100 indicates that the seller wishes to sell
Deep Learning has been shown to significantly im- the 100 shares for no less that $10 a piece. Respectively,
prove upon previous machine learning methods in tasks a buy limit order (bid) of $10 it means that the buyer
such as speech recognition [5], image captioning [6], wishes to buy a specified amount of shares for no more
[7], and question answering [8]. Deep Learning models, than $10 each.
such as Convolutional Neural Networks (CNNs) [9], and Consequently the order book has two sides, the bid
Recurrent Neural Networks (RNNs) [10], have greatly side, containing buy orders with prices pb (t) and vol-
contributed in the increase of performance on these umes vb (t), and the ask side, containing sell orders with
fields, with ever deeper architectures producing even prices pa (t) and volumes va (t). The orders are sorted
(1)
better results [11]. on both sides based on the price. On the bid side pb (t)
In the more advanced work of Deep Portfolio The- is the is the highest available buy price and on the ask
(1)
ory [12], the authors use autoencoders to optimize the side pa (t) is the lowest available sell price.
performance of a portfolio and beat several profit bench- Whenever a bid order price exceeds an ask order price
(i) (j)
marks, such as the biotechnology IBB Index. Similarly pb (t) > pa (t), they “annihilate”, executing the orders
in [2], a Restricted Boltzmann Machine (RBM) is trained and exchanging the traded assets between the investors.
to encode monthly closing prices of stocks and then it If there are more than two orders that fulfill the price
is fine-tuned to predict whether each stock’s price will range requirement the effect chains to them as well.
move above the median change or below it. This strategy Since the orders do not usually have the same requested
is compared to a simple momentum strategy and it is volume, the order with the greater size remains in the
established that the proposed method achieves significant order book with the remaining unfulfilled volume.
improvements in annualized returns. Several tasks arise from this data ranging from the
The daily data of the S&P 500 market fund prices and prediction of the price trend and the regression of the
Google domestic trends of 25 terms like “bankruptcy” future value of a metric, e.g., volatility, to the detection
and “insurance” are used as the input to a recurrent of anomalous events that cause price jumps, either up-
neural network that it is trained to predict the volatility wards or downwards. These tasks can lead to interesting
of the market fund’s price [3]. This method greatly im- applications, such as protecting the investments when
proves upon existing benchmarks, such as autoregressive market condition are unreliable, or taking advantage of
GARCH and Lasso techniques. such conditions to create automated trading techniques
An application using high frequency limit order book for profit.
(LOB) data is [4], where the authors create a set of Methods that utilize this data often subsample them
handcrafted features, such as price differences, bid- using re-sampling techniques, such as the OHLC (Open-
ask spreads, and price and volume derivatives. Then, High-Low-Close) resampling [13], to ensure that a spe-
a Support Vector Machine (SVM) is trained to predict cific number of values exist for each timeframe, e.g.,
whether the mid-price will move upwards or downward every minute or every day. Even though the OHLC
in the near future using these features. However, only method preserves the trend features of the market move-
2000 data points are used for training the SVM in each ments, it removes all the microstructure information of

2
the markets. This is one of the problems CNNs can solve must be constructed from our data to use as targets for
and take full advantage of the information contained our classification model. Simply using pt > pt+k to
in the data, since they can more accurately pick up determine the direction of the mid price would intro-
recurring patterns between time steps. duce unmanageable amount of noise, since the smallest
change would be registered as an upward or downward
IV. C ONVOLUTIONAL N EURAL N ETWORKS FOR movement.
F INANCIAL DATA Note that each consecutive depth sample is only
The input data consists of 10 orders for each side of slightly different from the previous one. Thus the short-
the LOB (bid and ask). Each order is described by 2 term changes between prices are very small and noisy.
values, the price and the volume. In total we have 40 In order to filter such noise from the extracted labels we
values for each timestep. The stock data, provided by use the following smoothed approach. First, the mean of
Nasdaq Nordic, come from the Finnish companies Kesko the previous k mid-prices, denoted by mb , and the mean
Oyj, Outokumpu Oyj, Sampo, Rautaruukki and Wartsila of the next k mid-prices, denoted by ma , are defined as:
Oyj. The time period used for collecting that data ranges k
from the 1st to the 14th June 2010 (only business days 1X
mb (t) = pt−i (3)
are included), while the data are provided by the Nasdaq k i=0
Nordic data feeds [14] [15]. k
The dataset is made up of 10 days for 5 different 1X
ma (t) = pt+i (4)
stocks and the total number of messages is 4.5 million k i=1
with equally many separate depths. Since the price and where pt is the mid price as described in Equation (2).
volume range is much greater than the range of the Then, a label lt that express the direction of price move-
values of the activation function of our neural network, ment at time t is extracted by comparing the previously
we need to normalize the data before feeding them to defined quantities (mb and ma ):
the network. To this end, standardization (z-score) is 
employed to normalize the data:  1, if mb (t) > ma (t) · (1 + α)

x − x̄ lt = −1, if mb (t) < ma (t) · (1 − α) (5)
xnorm = (1) 
σx̄ 0, otherwise

where x is the vector of values we want to normalize, x̄ where the threshold α is set as the least amount of
is the mean value of the data and σx̄ is the standard change in price that must occur for it to be considered
deviation of the data. Instead of simply normalizing upward or downward. If the price does not exceed this
all the values together, we take into account the scale limit, the sample will be considered to belong to the
differences between order prices and order volumes stationary class. Therefore, the resulting label expresses
and we use a separate normalizer, with different mean the current trend we wish to predict. Note that this
and standard deviation, for each of them. Also, since process is applied for every time step in our data.
different stocks have different price ranges and drastic To forecast the mid-price movement of a stock a
distributions shifts might occur in individual stocks for Convolutional Neural Network is used. The 100 most
different days, the normalization of the current day’s recent limit orders are fed as input to the network.
values uses the mean and standard deviation calculated Therefore, the input matrix for each time-step is defined
using previous day’s data. as X = [x0 , x1 , . . . , x100 ]T ∈ R100×40 , where xi is
We want to predict the direction towards which the the 40-dimensional vector that describes the i-th most
price will change. In this work the term price is used to recent LOB depth. This vector contains the 10 highest
refer to the mid-price of a stock, which is defined as the bid orders and 10 lowest ask orders, each order containts
mean between the best bid price and best ask price at 2 values a price and a size totalling 40 values.
time t: CNNs are composed of a series of convolutional and
(1) (1)
pa (t) + pb (t) pooling layers followed by a set of fully connected
pt = (2)
2 layers, as shown in Figure 1. Each convolutional layer i
This is a virtual value for the price since no order can is equipped with a set of filters Wi ∈ RS×D×N that is
happen at that exact price, but predicting its upwards convolved with the input tensor, where S is the number
or downwards movement provides a good estimate of of used filters, D is the filter size, and N is the number
the price of the future orders. A set of discrete choices of the input channels. The output of a convolutional

3
2D Input Sequence
Convolution Pooling
Convolution Pooling Fully Connected Softmax

Class
1

Class
n

Fig. 1: Typical Convolutional Neural Network architecture using 1D convolutions.

layer can be optionally pooled using a pooling layer. For The CNN and MLP models along with all the training
example, a max pooling layer with size 2 will subsample algorithms were developed using the Blocks framework
its input by a factor of 2 by applying the maximum [18], and the theano library [19], [20], while for SVM
function on each consecutive pair of vectors of the method the implementation provided by the scikit-learn
input matrix. Using a series of convolutional and pooling library [21] was used.
layers allows for capturing the fine temporal dynamics of
the time-series as well as correlating temporally distant V. E XPERIMENTAL E VALUATION
features. After the last convolutional/pooling layer a set The architecture of the proposed CNN model consists
of fully connected layers are used to classify the input of the following layers:
time-series. The network’s output expresses the categor-
1) 2D Convolution with 16 filters of size (4, 40)
ical distribution for the three direction labels (upward,
2) 1D Convolution with 16 filters of size (4, ) and
downward and stationary), as described in Equation (5),
max pooling with size (2, )
for each time-step.
3) 1D Convolution with 32 filters of size (3, )
The parameters of the model are learned by minimiz-
4) 1D Convolution with 32 filters of size e (3, ) and
ing the categorical cross entropy loss defined as:
max pooling with size (2, )
L
X 5) Fully connected layer with 32 neurons
L(W) = − yi · log ŷi (6) 6) Fully connected layer with 3 neurons
i=1
A visual representation of our model is shown in Fig-
where L is the number of different labels and the ure 2. Leaky Rectifying Linear Units [22], are used as
notation W is used to refer to the parameters of the activation function for both the convolutional layers and
CNN. The ground truth vector is denoted by y, while ŷ the first fully connected layer, while the softmax function
is the predicted label distribution. The loss is summed is used for the output layer of the network.
over all samples in each batch. The most commonly used For training our model, we use batches of 16 samples,
method to minimize the loss function defined in Equation where each sample consists of a sequence of 100 con-
(6) and learn the parameters W of the model is gradient secutive depths. Each depth consist of 40 values which
descent [16]: are described in Section IV. The dataset of 10 days is
split in a configuration of 7 days for training and 3
∂L
W′ = W − η · (7) days for testing. We train the same model for 3 different
∂W prediction horizons k, as defined in Equations (3) and
where W′ are the parameters of the model after each (4).
gradient descent step and η is the learning rate. In To measure the performance of our model we use
this work we utilize the Adaptive Moment Estimation Kohen’s kappa [23], which is used to measure the
algorithm, known as ADAM [17], which ensures that concordance between sets of given answers, taking into
the learning steps are scale invariant with respect to the consideration the possibility of random agreements hap-
parameter gradients. pening. We also report the mean recall, precision and F1

4
0.26
cost 0.65
f1 0.5
kappa
0.24 0.60 0.4
0.55
0.22 0.50 0.3
0.20 0.45 0.2
0.18 0.40 0.1
0.35
0.16 0.0 Train Data
0.30 Test Data
0.14 0.25 −0.1
0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300

Fig. 2: Training statistics of the CNN on the task of predicting the price movement of horizon k = 20. Each point
in the x-axis denotes 2, 500 training iterations.

TABLE I: Experimental results for different prediction


2D Input (100,40) horizons k

2D Convolution (4,40) Model Recall Precision F1 Cohen’s κ


16 filters Prediction Horizon k = 10
SVM 39.62% 44.92% 35.88% 0.068
MLP 47.81% 60.78% 48.27% 0.226
1D Convolution (4,) CNN 50.98% 65.54% 55.21% 0.35
16 filters
Prediction Horizon k = 20
SVM 45.08% 47.77% 43.20% 0.139
max polling MLP 51.33% 65.20% 51.12% 0.255
(2,)
CNN 54.79% 67.38% 59.17% 0.39
Prediction Horizon k = 50
1D Convolution (3,) SVM 46.05 % 60.30% 49.42% 0.243
32 filters
MLP 55.21% 67.14% 55.95% 0.324
CNN 55.58% 67.12% 59.44% 0.38
1D Convolution (3,)
32 filters

max polling stochastic gradient descent since the dataset is too large
(2,)
to use a closed-form solution. The MLP model uses a
Fully Connected single hidden layer with 128 neurons with Leaky ReLU
32 Neurons
activations. The regularization parameter of the SVM
Softmax was chosen using cross validation on a split from the
training set. Since both models are sequential, we feed
1 2 3
the concatenation of the previous 100 depth samples as
Fig. 3: A visual representation of the evaluated CNN input and we use as prediction target the price movement
model associated with the last depth sample.
The proposed method significantly outperforms all
the other evaluated models on the presented metrics,
showing that the convolutional neural network can better
score between all 3 classes. Recall is the number true
handle the sequencial nature of the LOB data and better
positive samples divided by the sum of true positives
determine the microstructure of the market in order to
and false negatives, while precision is the number of
detect mid-price changes that occur.
true positive divided by the sum of true positives and
false positives. F1 score is the harmonic mean of the
VI. C ONCLUSION
precision and recall metrics.
The results of our experiments are shown in Table I. In this work we trained a CNN on high frequency
We compare our results with those of a Linear SVM LOB data, applying a temporally aware normalization
model and an MLP model with Leaky Rectifiers as scheme on the volumes and prices of the LOB depth.
activation function. The SVM model is trained using The proposed approach was evaluated using different

5
prediction horizons and it was demonstrated that it [15] M. Siikanen, J. Kanniainen, and J. Valli, “Limit order books and
performs significantly better than other techniques, such liquidity around scheduled and non-scheduled announcements:
Empirical evidence from nasdaq nordic,” Finance Research Let-
as Linear SVMs and MLPs, when trying to predict short ters, vol. to appear, 2016.
term price movements. [16] P. J. Werbos, “Backpropagation through time: what it does and
There are several interesting future research directions. how to do it,” Proceedings of the IEEE, vol. 78, no. 10, pp.
1550–1560, 1990.
First, more data can be used to train the proposed model, [17] D. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
scaling up to a billion training samples, to determine tion,” arXiv preprint arXiv:1412.6980, 2014.
if using more data leads to better classification per- [18] B. Van Merriënboer, D. Bahdanau, V. Dumoulin, D. Serdyuk,
D. Warde-Farley, J. Chorowski, and Y. Bengio, “Blocks and fuel:
formance. With more data also increase the ”burn-in” Frameworks for deep learning,” arXiv preprint arXiv:1506.00619,
phase along with the prediction horizon to gauge the 2015.
models ability to predict the trend further into the future. [19] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu,
G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio,
Also, an attention mechanism [6], [24], can be intro- “Theano: A cpu and gpu math compiler in python,” in Proc.
duced to allow the network to capture only the relevant 9th Python in Science Conf, 2010, pp. 1–7.
information and avoid noise. Finally, more advanced [20] F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. Goodfellow,
A. Bergeron, N. Bouchard, D. Warde-Farley, and Y. Bengio,
trainable normalization techniques can be used, as it was “Theano: new features and speed improvements,” arXiv preprint
established that normalization is essential to ensure that arXiv:1211.5590, 2012.
the learned model will generalize well on unseen data. [21] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg
et al., “Scikit-learn: Machine learning in python,” Journal of
R EFERENCES Machine Learning Research, vol. 12, no. Oct, pp. 2825–2830,
[1] M. F. Dixon, D. Klabjan, and J. H. Bang, “Classification-based 2011.
financial markets prediction using deep neural networks,” 2016. [22] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities
[2] L. Takeuchi and Y.-Y. A. Lee, “Applying deep learning to improve neural network acoustic models,” in Proceedings of the
enhance momentum trading strategies in stocks,” 2013. International Conference on Machine Learning, vol. 30, no. 1,
[3] R. Xiong, E. P. Nichols, and Y. Shen, “Deep learning 2013.
stock volatility with google domestic trends,” arXiv preprint [23] J. Cohen, “A coefficient of agreement for nominal scales,”
arXiv:1512.04916, 2015. Educational and Psychological Measurement, vol. 20, no. 1, pp.
[4] A. N. Kercheval and Y. Zhang, “Modelling high-frequency limit 37–46, 1960.
order book dynamics with support vector machines,” Quantitative [24] K. Cho, A. Courville, and Y. Bengio, “Describing multimedia
Finance, vol. 15, no. 8, pp. 1315–1329, 2015. content using attention-based encoder-decoder networks,” IEEE
[5] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition Transactions on Multimedia, vol. 17, no. 11, pp. 1875–1886,
with deep recurrent neural networks,” in Proceedings of the 2015.
IEEE international conference on Acoustics, Speech and Signal
Processing (icassp), 2013, pp. 6645–6649.
[6] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov,
R. S. Zemel, and Y. Bengio, “Show, attend and tell: Neural image
caption generation with visual attention.” in Proceedings of the
International Conference on Machine Learning, vol. 14, 2015,
pp. 77–81.
[7] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille, “Deep
captioning with multimodal recurrent neural networks (m-rnn),”
arXiv preprint arXiv:1412.6632, 2014.
[8] Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei, “Visual7w:
Grounded question answering in images,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition,
2016, pp. 4995–5004.
[9] Y. LeCun, Y. Bengio et al., “Convolutional networks for images,
speech, and time series,” The handbook of brain theory and
neural networks, vol. 3361, no. 10, p. 1995, 1995.
[10] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2016, pp. 770–778.
[12] J. Heaton, N. Polson, and J. Witte, “Deep portfolio theory,” arXiv
preprint arXiv:1605.07230, 2016.
[13] D. Yang and Q. Zhang, “Drift-independent volatility estimation
based on high, low, open, and close prices,” The Journal of
Business, vol. 73, no. 3, pp. 477–492, 2000.
[14] A. Ntakaris, M. Magris, J. Kanniainen, M. Gabbouj, and A. Iosi-
fidis, “Benchmark dataset for mid-price prediction of limit order
book data,” 2017.

You might also like