0% found this document useful (0 votes)
9 views15 pages

For 2543

This research article presents the first publicly available benchmark dataset for mid-price forecasting using limit order book data from the Nasdaq Nordic stock market, comprising approximately 4 million time series samples over 10 trading days. It outlines an experimental protocol for evaluating machine learning methods in this context and provides baseline performance results for comparison. The dataset aims to facilitate advancements in high-frequency trading research and improve the performance of expert systems in financial markets.

Uploaded by

romeo.alfauno
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views15 pages

For 2543

This research article presents the first publicly available benchmark dataset for mid-price forecasting using limit order book data from the Nasdaq Nordic stock market, comprising approximately 4 million time series samples over 10 trading days. It outlines an experimental protocol for evaluating machine learning methods in this context and provides baseline performance results for comparison. The dataset aims to facilitate advancements in high-frequency trading research and improve the performance of expert systems in financial markets.

Uploaded by

romeo.alfauno
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Received: 15 May 2018 Accepted: 7 July 2018

DOI: 10.1002/for.2543

RESEARCH ARTICLE

Benchmark dataset for mid-price forecasting of limit order


book data with machine learning methods

Adamantios Ntakaris1 Martin Magris2 Juho Kanniainen2 Moncef Gabbouj1


Alexandros Iosifidis3

1
Laboratory of Signal Processing, Tampere
University of Technology, Tampere, Abstract
Finland
Managing the prediction of metrics in high-frequency financial markets is a
2
Laboratory of Industrial and Information
challenging task. An efficient way is by monitoring the dynamics of a limit
Management, Tampere University of
Technology, Tampere, Finland order book to identify the information edge. This paper describes the first pub-
3
Department of Engineering, Electrical licly available benchmark dataset of high-frequency limit order markets for
and Computer Engineering, Aarhus mid-price prediction. We extracted normalized data representations of time
University, Aarhus, Denmark
series data for five stocks from the Nasdaq Nordic stock market for a time period
Correspondence of 10 consecutive days, leading to a dataset of ∼4,000,000 time series samples
Adamantios Ntakaris, Laboratory of
in total. A day-based anchored cross-validation experimental protocol is also
Signal Processing, Tampere University of
Technology, Korkeakoulunkatu 1, provided that can be used as a benchmark for comparing the performance of
Tampere, Finland. state-of-the-art methodologies. Performance of baseline approaches are also pro-
Email: [email protected]
vided to facilitate experimental comparisons. We expect that such a large-scale
Funding information dataset can serve as a testbed for devising novel solutions of expert systems for
H2020 Marie Sklodowska-Curie Actions, high-frequency limit order book data analysis.
Grant/Award Number: MSCA-ITN-ETN
675044
K E Y WO R D S
high-frequency trading, limit order book, mid-price, machine learning, ridge regression, single
hidden feedforward neural network

1 I N T RO DU CT ION time-ordered sequences of messages that track and record


all the events occurring in the specific market. It pro-
Automated trading became a reality when the major- vides a complete market-wide history of 10 trading days.
ity of exchanges adopted it globally. This environment is Additionally, we define an experimental protocol to eval-
ideal for high-frequency traders. High-frequency trading uate the performance of research methods in mid-price
(HFT) and a centralized matching engine, referred to as prediction.1
a limit order book (LOB), are the main drivers for gener- Datasets, like the one presented here, come with chal-
ating big data (Seddon & Currie, 2017). In this paper, we lenges, including the selection of appropriate data trans-
describe a new order book dataset consisting of approx- formation, normalization, description, and classification.
imately 4 million events for 10 consecutive trading days This type of massive dataset requires a very good under-
for five stocks. The data are derived from the ITCH feed standing of the available information that can be extracted
provided by Nasdaq OMX Nordic and consists of the
1
Mid-price is the average of the best bid and best ask prices.

...............................................................................................................................................................
This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the
original work is properly cited.
© 2018 The Authors Journal of Forecasting Published by John Wiley & Sons Ltd.

852 wileyonlinelibrary.com/journal/for Journal of Forecasting. 2018;37:852–866.


NTAKARIS ET AL. 853

for further processing. We follow the information edge, can be used for model selection by speculative traders, who
as has been recently presented by Kercheval and Zhang are trading based on their predictions on market move-
(2015). The authors provide a detailed description of rep- ments. In future research, this paper can be employed
resentations that can be used for a mid-price movement to identify order book spoofing—that is, situations where
prediction metric. In light of this data representation, markets are exposed to manipulation by limit orders. In
they apply nonlinear classification based on support vec- this case, spoofers could aim to move markets in certain
tor machines (SVM) in order to predict the movement directions by limit orders that are canceled before they
of this metric. Such a supervised learning model exploits are filled. Therefore, this research is relevant not only for
class labels2 for short- and long-term prediction. How- market makers and traders but also for supervisors and
ever, they train their model based on a very small (when regulators.
compared to the size of the data that can be available for Therefore, the present work makes the following contri-
such applications) dataset of 4,000 samples. This is due butions: (1) To the best of our knowledge this is the first
to the limitations of many nonlinear kernel-based classifi- publicly available LOB-ITCH dataset for machine learning
cation models related to their time and space complexity experiments on the prediction of mid-price movements.
with respect to the training data size. On the other hand, (2) We provide baselines methods based on ridge regres-
Sirignano (2016) uses large amounts of data for nonlinear sion and a new implementation of an RBF neural network
classification based on a feedforward network. The author based on k-means algorithm. (3) The paper provides infor-
takes advantage of the local spatial structure3 of the data mation about the prediction of mid-price movements to
for modeling the joint distribution of the LOB's state based market makers, traders, and regulators. This paper does
on its current state. not suggest any trading strategies and is reliant on purely
Despite the major importance of publicly available machine learning metrics prediction. Overall, this work
datasets for advancing research in the HFT field, there is an empirical exploration of the challenges that come
are no detailed public available benchmark datasets for with high-frequency trading and machine learning appli-
method evaluation purposes. In this paper, we describe cations.
the first publicly available dataset4 for an LOB-based HFT The data from Nasdanq Helsinki Stock Exchange offers
that has been collected in the hope of facilitating future important benefits. In the USA the limit orders for a given
research in the field. Based on Kercheval and Zhang asset are spread between several exchanges, causing frag-
(2015), we provide time series representations of approx- mentation of liquidity. The fragmentation poses a problem
imately 4,000,000 trading events and annotations for five for empirical research, because, as Gould, Porter, Williams,
classification problems. Baseline results of two widely McDonald, Fenn, and Howison (2013) point out, the
used methods—that is, linear and nonlinear regression “differences between different trading platforms' match-
models, are also provided. In this way, we introduce this ing rules and transaction costs complicate comparisons
new problem for the expert systems community and pro- between different limit order books for the same asset.”
vide a testbed for facilitating future research. We hope These issues related to fragmentation are not present with
that attracting the interest of expert systems will lead data obtained from less fragmented Nasdaq Nordic mar-
to the rapid improvement of the performance achieved kets. Moreover, Helsinki Exchange is a pure limit order
in the provided dataset, thus leading to much better market, where the market makers have a limited role.
state-of-the-art solutions to this important problem. The rest of the paper is organized as follows. We pro-
The dataset described in this paper can be useful for vide a comprehensive literature review of the field in
financial expert systems in two ways. First, it can be used Section 2. Dataset and experimental protocol descriptions
to identify circumstances under which markets are sta- are provided in Section 33. Quantitative and qualitative
ble, which is very important for liquidity providers (market comparisons of the new dataset, along with related data
makers) to make the spread. Consequently, such an intel- sources, are provided in Section 4. In Section 5, we describe
ligent system would be valuable as a framework that can the engineering of our baselines. Section 6 presents our
increase liquidity provision. Secondly, analysis of the data empirical results and Section 7 concludes.

2
Labels are extracted from annotations provided by experts and represent
the direction of the mid-price. Three different states are defined—that is, 2 MACHINE LEARNING FO R H FT
upward, downward, and stationary movement.
3
By local movement, the author means that the conditional movement of AND LOB
the future price (e.g., best ask price movement) depends, locally, on the
current LOB state. The complex nature of HFT and LOB spaces is suitable
4
The dataset can be downloaded from: https://etsin.avointiede.fi/dataset/
urn-nbn-fi-csc-kata20170601153214969115https://etsin.avointiede.fi/ for interdisciplinary research. In this section, we pro-
dataset/urn-nbn-fi-csc-kata20170601153214969115. vide a comprehensive review of recent methods exploiting
854 NTAKARIS ET AL.

machine learning approaches. Regression models, neural regression7 for order imbalances and liquidity costs in
networks, and several other methods have been proposed LOBs so as to identify resilience in the market. Their
to make inferences of the stock market. Existing literature findings show that such order imbalances cause liquidity
ranges from metric prediction to optimal trading strate- issues that last for up to 10 minutes. Malik and Lon Ng
gies identification. The research community has tried to (2014) analyze the asymmetric intra-day patterns of LOBs.
tackle the challenges of prediction and data inference from They apply regression with a power transformation on the
different angles. Although mid-price prediction can be notional volume weighted average price (NVWAP) curves
considered a traditional time series prediction problem, in order to conclude that both sides of the market behave
there are several challenges that justify HFT as a unique asymmetrically to market conditions.8 In the same direc-
problem. tion, Ranaldo (2004) examines the relationship between
trading activity and the order flow dynamics in LOBs,
where the empirical investigation is based on a probit
2.1 Regression analysis model. Cao, Hansch, and Wang (2009) examine the depth
Regression models have been widely used for HFT and of different levels of an order book by using an autoregres-
LOB prediction. Zheng, Moulines, and Abergel (2012) uti- sive (AR) model of order 5 (the AR(5) framework). They
lize logistic regression in order to predict the inter-trade find that levels beyond the best bid and best ask prices pro-
price jump. Alvim, dos Santos, and Milidiu (2010) use sup- vide moderate information regarding the true value of an
port vector regression (SVR) and partial least squares (PLS) asset. Finally, Creamer (2012) suggests that the LogitBoost
for trading volume forecasting for 10 Bovespa stocks. Pai algorithm is ideal for selecting the right combination of
and Lin (2005) use a hybrid model for stock price predic- technical indicators.9
tion. They combine an autoregressive integrated moving
average (ARIMA) model and an SVM classifier in order to
2.2 Neural networks
model nonlinearities of class structure in regression esti-
mation models. Liu and Park (2015) develop a multivariate HFT is mainly a scalping10 strategy according to which the
linear model to explain short-term stock price movement chaotic nature of the data creates the proper framework
where a bid–ask spread is used for classification purposes. for the application of neural networks. Levendovszky and
Detollenaere and D'hondt (2017) apply an adaptive least Kia (2012) propose a multilayer feedforward neural net-
absolute shrinkage and selection operator (LASSO)5 for work for predicting the price of a EUR/USD pair, trained
variable selection, which best explains the transaction cost by using the backpropagation algorithm. Sirignano (2016)
of the split order. They apply an adjusted ordinal logis- proposes a new method for training deep neural networks
tic method for classifying ex ante transaction costs into that try to model the joint distribution of the bid and
groups. Cenesizoglu, Dionne, and Zhou (2014) work on ask depth, where a focal point is the spatial nature11 of
a similar problem. They hold that the state of the limit LOB levels. Bogoev and Karam (2016) propose the use of
order can be informative for the direction of future prices a single hidden-layer feedforward neural (SLFN) network
and try to prove their position by using an autoregressive for the detection of quote stuffing and momentum igni-
model. tion. Dixon (2016) uses a recurrent neural network (RNN)
Panayi, Peters, Danielsson, and Zigrand (2016) use gen- for mid-price predictions of T-bond12 and ES futures13
eralized linear models (GLM) and generalized additive based on ultra-high-frequency data. Rehman, Khan, and
models for location, shape, and scale (GAMLSS) mod-
els in order to relate the threshold exceedance dura- 7
Panel regression models provide information on data characteristics
tion (TED), which measures the length of time required individually, but also across both individuals over time.
8
Market conditions of an industry sector have an impact on sellers and
for liquidity replenishment, to the state of the LOB. Yu buyers who are related to it. Factors to consider include the number of
(2006) tries to extract information from order informa- competitors in the sector. For example, if there is a surplus, new compa-
tion and order submission based on the ordered pro- nies may find it difficult to enter the market and remain in business.
9
Technical indicators are mainly used for short-term price movement
bit model.6 The author shows, in the case of Shanghai's predictions. They are formulas based on historical data.
10
stock market, that an LOB's information is affected by the Scalping is a type of trading strategy according to which the trader tries
to make a profit for small changes in a stock.
trader's strategy, with different impacts on the bid and ask 11
The spatial nature of this type of neural network and its gradient can
sides. Amaya, Filbien, Okou, and Roch (2015) use panel be evaluated at far fewer grid points. This makes the model less compu-
tationally expensive. Furthermore, the suggested architecture can model
the entire distribution in the Rd space.
5 12
Adaptive weights are used for penalizing different coefficients in the l1 Treasury bond (T-bond) is a long-term fixed interest rate debt security
penalty term. issued by the federal government.
6 13
The method is the generalization of a linear regression model when the E-mini S&P 500 (ES futures) are electronically traded futures contracts
dependent variable is discrete. whose value is one-fifth the size of standard S&P futures.
NTAKARIS ET AL. 855

Mahmud (2014) apply recurrent Cartesian genetic pro- based on the observed limit orders. Chan and Shel-
gramming evolved artificial neural network (RCGPANN) ton (2001) use RL for market-making strategies, where
for predicting five currency rates against the Australian experiments based on a Monte Carlo simulation and a
dollar. Galeshchuk (2016) suggests that a multilayer per- state–action–reward–state–action (SARSA) algorithm test
ceptron (MLP) architecture, with three hidden layers, is the efficacy of their policy. In the same vein, Kearns and
suitable for exchange rate prediction. Majhi, Panda, and Nevmyvaka (2013) implement RL for trade execution opti-
Sahoo (2009) use the functional link artificial neural net- mization in lit and dark pools. Especially in the case of
work (FLANN) in order to predict price movements in the dark pools, they apply a censored exploration algorithm
DJIA14 and S&P 50015 stock indices. to the problem of smart order routing (SOR). Yang, Pad-
Deep belief networks are employed by Sharang and Rao drik, Hayes, Todd, Kirilenko, Beling, and Scherer (2012)
(2015) to design a medium-frequency portfolio trading examine an IRL algorithm for the separation of HFT strate-
strategy. Hallgren and Koski (2016) use continuous-time gies from other algorithmic trading activities. They also
Bayesian networks (CTBNs) for causality detection. They apply the same algorithm to the identification of manipu-
apply their model on tick-by-tick high-frequency foreign lative HFT strategies (i.e., spoofing). Felker, Mazalov, and
exchange (FX) EUR/USD data using a Skellam process.16 Watt (2014) predict changes in the price of quotes from
Sandoval and Hernández (2015) create a profitable trad- several exchanges. They apply feature-weighted Euclidean
ing strategy by combining hierarchical hidden Markov distance to the centroid of a training cluster. They calcu-
models (HHMM), where they consider wavelet-based LOB late this type of distance to the centroid of a training cluster
information filtering. In their work, they also consider a where feature selection is taken into consideration because
two-layer feedforward neural network in order to clas- several exchanges are included in their model.
sify the upcoming states. They nevertheless report limita-
tions in the neural network in terms of the volume of the
input data.
2.4 Additional methods for HFT and LOB
HFT and LOB research activity also covers topics like the
2.3 Maximum margin optimal submission strategies of bid and ask orders, with
and reinforcement learning a focus on the inventory risk that stems from an asset's
Palguna and Pollak (2016) use nonparametric meth- value uncertainty, as in the work of Avellaneda and Stoikov
ods on features derived from LOB, which are incorpo- (2008). Chang (2015) models the dynamics of LOB by
rated into order execution strategies for mid-price pre- using a Bayesian inference of the Markov chain model
diction. In the same direction, Kercheval and Zhang class, tested on high-frequency data. An and Chan (2017)
(2015) employ a multi-class SVM for mid-price and price suggest a new stochastic model that is based on indepen-
spread crossing prediction. Han et al. (2015) base their dent compound Poisson processes of the order flow. Talebi,
research on Kercheval and Zhang by using multi-class Hoang, and Gavrilova (2014) try to predict trends in the
SVM for mid-price movement prediction. More precisely, FX market by employing a multivariate Gaussian classifier
they compare multi-class SVM (exploring linear and RBF (MGC) combined with Bayesian voting. Fletcher, Hussain,
kernels) to decision trees using bagging for variance and Shawe-Taylor (2010) examine trading opportunities
reduction. for the EUR/USD where the price movement is based
Kim (2001) uses input/output hidden Markov models on multiple kernel learning (MKL). More specifically, the
(IOHMMs) and reinforcement learning (RL) in order to authors utilize SimpleMKL and the more recent LPBoost-
identify the order flow distribution and market-making MKL methods for training a multi-class SVM. Christensen
strategies, respectively. Yang et al. (2015) apply appren- and Woodmansey (2013) develop a classification method
ticeship learning17 methods, like linear inverse rein- based on the Gaussian kernel in order to identify iceberg18
forcement learning (LIRL) and Gaussian process IRL orders for GLOBEX.
(GPIRL), to recognize traders or algorithmic trades Maglaras, Moallemi, and Zheng (2015) consider the LOB
as a multi-class queueing system in order to solve the
14
The Dow Jones Industrial Average (DJIA) is the price-weighted average problem placement of limit and market order placements.
of the 30 largest, publicly owned US companies. Mankad, Michailidis, and Kirilenko (2013) apply a static
15
S&P 500 is the index that provides a summary of the overall market by
tracking some of the 500 top stocks in US stock market. plaid clustering technique to synthetic data in order to
16
A Skellam process is defined as S(t) = N (1) (t) − N (2 (t), t ⩾ 0, where
N(1) (t) and N(2) (t) are two independent homogeneous Poisson processes.
17
Motivation for apprenticeship learning is to use IRL techniques to learn
18
the reward function and then use this function in order to define a Iceberg order is the conditional request made to the broker to sell or buy
Markov decision problem (MDP). a larger quantity of the stock, but in smaller predefined quantities.
856 NTAKARIS ET AL.

classify the different types of trades. Aramonte, Schindler,


and Rosen (2013) show that the information asymmetry in
a high-frequency environment is crucial.
Vella and Ng (2016) use higher-order fuzzy systems (i.e.,
an adaptive neuro-fuzzy inference system) by introducing
T2 fuzzy sets, where the goal is to reduce microstructure
noise in the HFT sphere. Abernethy and Kale (2013) apply
market-maker strategies based on low-regr et algorithms
for the stock market. Almgren and Lorenz (2006) explain
price momentum by modeling Brownian motion with a
drift whose distribution is updated based on Bayesian
inference. Næs and Skjeltorp (2006) show that the order
book slope measures the elasticity of supplied quantity as a
function of asset prices related to volatility, trading activity,
and an asset's dispersion beliefs.

3 T H E LO B DATA S ET

In this section, we describe in detail our dataset collected


in order to facilitate future research in LOB-based HFT.
We start by providing a detailed description of the data in
Section 3.1. Data processing steps are followed in order to
extract message books and LOBs, as described in Section
3.2.

3.1 Data description


Extracting information from the ITCH flow, and with-
out relying on third-party data providers, we analyze
stocks from different industry sectors for 10 full days
FIGURE 1 Data processing flow [Colour figure can be viewed at
of ultra-high-frequency intra-day data. The data pro-
wileyonlinelibrary.com]
vide information regarding trades against hidden orders.
Coherently, the nondisplayable hidden portions of the total
volume of a so-called iceberg order are not accessible from
exchange from June 1, 2010 to June 14, 2010.19 These data
the data. Our ITCH feed data is day specific and market
are stored in a Linux cluster. Information related to the five
wide, which means that we deal with one file per day
stocks is illustrated in Table 1. The selected stocks20 are
with data over all the securities. Information (block A in
traded in one exchange (Helsinki) only. By choosing only
Figure 1) regarding (i) messages for order submissions, (ii)
one stock market exchange, the trader has the advantage
trades, and (iii) cancellations is included. For each order,
of avoiding issues associated with fragmented markets.
its type (buy/sell), price, quantity, and exact time stamp on
In the case of fragmented markets, the limit orders for
a millisecond basis is available. In addition, (iv) adminis-
trative messages (i.e., trading halts or basic security data),
19
(v) event controls (i.e., start and ending of trading days, There have been about 23,000 active order books, the vast majority of
states of market segments), and (vi) net order imbalance which are very illiquid, show sporadic activity, and correspond to little
and noisy data.
indicators are also included. 20
The choice is driven by the necessity of having a sufficient amount of
The next step is the development and implementation data for training (this excludes illiquid stocks) while covering different
industry sectors. These five selected stocks (see Table 1), which aggregate
of a C++ converter to extract all the information relevant
input message list and order book data for feature extraction, are about
to a given security. We perform the same process for five 4 GB; RTRKS was suspended from trading and delisted from the Helsinki
stocks traded on the Nasdaq OMX Nordic at the Helsinki exchange on November 20, 2014.
NTAKARIS ET AL. 857

a given asset are spread between several exchanges, pos- ble price between different orders), and the lot size 𝜎 (i.e.,
ing problems from empirical data analysis (O'Hara & Ye, the smallest amount of a stock that can be traded and is
2011). defined as {k𝜎|k = 1, 2, … }). Order inflow and resolution
parameters will formulate the dynamics of the LOB, whose
The Helsinki Stock Exchange, operated by Nasdaq current state will be identified by the state variable of four
Nordic, is a pure electronic limit order market. The ITCH elements (sbt , qbt , sat , qat ), t ≥ 0, where sbt (sbt ) is the best bid
feed keeps a record of all the events, including those that (ask) price and qbt (qat ) is the size of the best bid (ask) level
take place outside active trading hours. At the Helsinki at time t.
exchange, the trading period goes from 10:00 to 18:25 (local In our data, timestamps are expressed in milliseconds
time, UTC/GMT +2 hours). However, in the ITCH feed, based on 1 Jan 1970 format and shifted by three hours
we observe several records outside those trading hours. In with respect to Eastern European Time (in the data, the
particular, we consider the regulated auction period before trading day goes from 7:00 to 15:25). ITHC feed prices are
10:00, which is used to set the opening price of the day recorded up to 4 decimal places and, in our data, the dec-
(the so-called pre-opening period) before trading begins. imal point is removed by multiplying the price by 10,000,
This is a structurally different mechanism following dif- where currency is in euros for the Helsinki exchange. The
ferent rules with respect to the order book flow during tick size, defined as the smallest possible gap between the
trading hours. Similarly, another structural break in the ask and bid prices, is 1 cent. Similarly, order quantities are
order book's dynamics is due to the different regulations constrained to integers greater than one.
that are in force between 18:25 and 18:30 (the so-called
post-opening period). As a result, we retain exclusively the
events occurring between 10:30 and 18:00. More informa- 3.3 Data availability and distribution
tion related to the above-mentioned issues can be found In compliance with Nasdaq OMX agreements, the nor-
in Siikanen, Kanniainen, and Luoma 2017 and (Siika- malized feature dataset is made available to the research
nen, Kanniainen, & Valli, 2017). Here, the order book is community.23 The open-access version of our data has
expected to have comparable dynamics with no biases or been normalized in order to prevent reconstruction of the
exceptions caused by its proximity to the market opening original Nasdaq data.
and closing times.

3.2 Limit order and message books 3.4 Experimental protocol


Message and LOBs are processed for each of the 10 days In order to make our dataset a benchmark that can be
for the five stocks. More specifically, there are two types used for the evaluation of HTF methods based on LOB
of messages that are particularly relevant here: (i) “add information, the data are accompanied by the following
order messages,” corresponding to order submissions; and experimental protocol. We develop a day-based pre-
(ii) “modify order messages,” corresponding to updates diction framework following an anchored forward
on the status of existing orders through order cancella- cross-validation format. More specifically, the training set
tions and order executions. Example message21 and limit is increased by 1 day in each fold and stops after n − 1
order22 books are illustrated in Tables 2 and Table 3, days (i.e., after 9 days in our case where n = 10). On
respectively. each fold, the test set corresponds to 1 day of data, which
LOB is a centralized trading method that is incorpo- moves in a rolling window format. The experimental
rated by the majority of exchanges globally. It aggregates setup is illustrated in Figure 2. Performance is measured
the limit orders of both sides (i.e., the ask and bid sides) by calculating the mean accuracy, recall, precision, and F1
of the stock market (e.g., the Nordic stock market). LOB score over all folds, as well as the corresponding standard
matches every new event type according to several char- deviation. We measure our results based on these metrics,
acteristics. Event types and LOB characteristics describe which are defined as follows:
the current state of this matching engine. Event types
can be executions, order submissions, and order cancella- TP + TN
tions. Characteristics of LOB are the resolution parameters Accuracy = , (1)
TP + TN + FP + FN
(Gould, Porter, Williams, McDonald, Fenn, & Howison,
2013), which are the tick size 𝜋 (i.e., the smallest permissi- TP
Precision = , (2)
TP + FP
21
A sample from FI0009002422 on June 1, 2010.
22 23
A sample from FI0009002422 on June 1, 2010. We thank Ms. Sonja Salminen at Nasdaq for her support and help.
858 NTAKARIS ET AL.

TABLE 1 Stocks used in the analysis


ID ISIN code Company Sector Industry
KESBV FI0009000202 Kesko Oyj Consumer Defensive Grocery Stores
OUT1V FI0009002422 Outokumpu Oyj Basic Materials Steel
SAMPO FI0009003305 Sampo Oyj Financial Services Insurance
RTRKS FI0009003552 Rautaruukki Oyj Basic Materials Steel
WRT1V FI0009000727 Wärtsilä Oyj Industrials Diversified Industrials

TABLE 2 Message list example


Timestamp ID Price Quantity Event Side
1275386347944 6505727 126200 400 Cancellation Ask
1275386347981 6505741 126500 300 Submission Ask
1275386347981 6505741 126500 300 Cancellation Ask
1275386348070 6511439 126100 17 Execution Bid
1275386348070 6511439 126100 17 Submission Bid
1275386348101 6511469 126600 300 Cancellation Ask

TABLE 3 Order book example


Level 1 Level 2 …
Ask Bid Ask Bid
Timestamp Mid-price Spread Price Quantity Price Quantity Price Quantity Price Quantity
1275386347944 126200 200 126300 300 126100 17 126400 4765 126000 2800 …
1275386347981 126200 200 126300 300 126100 17 126400 4765 126000 2800 …
1275386347981 126200 200 126300 300 126100 17 126400 4765 126000 2800 …
1275386348070 126050 100 126100 291 126000 2800 126200 300 125900 1120 …
1275386348070 126050 100 126100 291 126000 2800 126200 300 125900 1120 …
1275386348101 126050 100 126100 291 126000 2800 126200 300 125900 1120 …

inflow rate. Time intervals between two consecutive events


TP
Recall = , (3) can vary from milliseconds to several minutes of differ-
TP + FN
ence. Event-based data representation avoids issues related
Precision × Recall
F1 = 2 × , (4) to such big differences in data flow. As a result, each
Precision + Recall
of our representations is a vector that contains informa-
where TP and TF represent the true positives and true neg- tion for 10 consecutive events. Event-based data descrip-
atives, respectively, of the mid-price prediction label com- tion leads to a dataset of approximately half a million
pared with the ground truth, where FP and FN represents representations (i.e., 394,337 representations). We repre-
the false positives and false negatives, respectively. From sent these events using the 144-dimensional representa-
among the above metrics, we focus on the F1 score perfor- tion proposed recently by Kercheval and Zhang (2015),
mance. The main reason that we focus on F1 score is based formed by three types of features: (a) the raw data of a
on its ability only to be affected in one direction of skew 10-level limit order containing price and volume values
distributions, in the case of unbalanced classes like ours. for bid and ask orders; (b) features describing the state
On the contrary, accuracy cannot differentiate between the of the LOB, exploiting past information; and (c) features
number of correct labels (i.e., related to mid-price move- describing the information edge in the raw data by tak-
ment direction prediction) of different classes where the ing time into account. Derivations of time, stock price,
other three metrics can separate the correct labels among and volume are calculated for short and long-term pro-
different classes, with F1 being the harmonic mean of jections. More specifically, types in features u7 , u8 , and u9
Precision and Recall. are: trades, orders, cancellations, deletion, execution of a
We follow an event-based inflow, as used in Li, et al. visible limit order, and execution of a hidden limit order.
(2016). This is due to the fact that events (i.e., orders, Expressions used for calculating these features are pro-
executions, and cancellations) do not follow a uniform vided in Table 4. One limitation of the adopted features
NTAKARIS ET AL. 859

FIGURE 2 Experimental setup framework [Colour figure can be viewed at wileyonlinelibrary.com]

TABLE 4 Feature sets


Feature set Description Details
Basic u1 = {Piask , Viask , Pibid , Vibid }ni=1 10( = n)-level LOB data
Time-insensitive u2 = {(Piask − Pibid ), (Piask + Pibid )∕2}ni=1 Spread & Mid-price
u3 = {Pnask − P1ask , P1bid − Pnbid , |Pi+1 ask
− Piask |, |Pi+1
bid
− Pibid |}ni+1 Price differences
{ n }
1
∑ ∑ n
∑ n
∑ n
u4 = n
Piask , n1 Pibid , n1 Viask , n1 Vibid Price & Volume means
{ ni=1 i=1 i=1 i=1
}
∑ ∑ n
ask bid ask bid
u5 = (Pi − Pi ), (Vi − Vi ) Accumulated differences
i=1 i=1
{ ask }n
Time-sensitive u6 = dPi ∕dt, dPibid ∕dt, dViask ∕dt, dVibid ∕dt i=1 Price & Volume derivation
{ 1 2 3 4 5 6 }
u7 = 𝜆Δt , 𝜆Δt , 𝜆Δt , 𝜆Δt , 𝜆Δt , 𝜆Δt Average intensity per type
{ }
u8 = 1𝜆1Δ >𝜆1Δ , 1𝜆2Δ >𝜆2Δ , 1𝜆3Δ >𝜆3Δ , 1𝜆4Δ >𝜆4Δ , 1𝜆5Δ >𝜆5Δ , 1𝜆6Δ >𝜆6Δ Relative intensity comparison
t T t T t T t T t T t T
1 2 3 4 5 6
u9 = {d𝜆 ∕dt, d𝜆 ∕dt, d𝜆 ∕dt, d𝜆 ∕dt, d𝜆 ∕dt, d𝜆 ∕dt} Limit activity acceleration

is the lack of information related to order flow (i.e., the where x̄ denotes the mean vector, as appears in Equation 5.
sequence of order book messages). However, as can be seen On the other hand, min–max scaling, as described by
in the Results Section 6, the baselines achieve relatively
xi − xmin
good performance and therefore we leave the introduction xi(MM) = , (6)
of extra features that can enhance performance to future xmax − xmin
research. is the process of subtracting the minimum value from each
We provide three sets of data, each created by following feature and dividing it by the difference between the max-
a different data normalization strategy—that is, z-score, imum and minimum value of that feature sample. The
min–max, and decimal precision normalization—for third scaling setup is the decimal precision approach. This
every i data sample. Z-score, in particular, is the normal- normalization method is based on moving the decimal
ization process through which we subtract the mean from points of each of the feature values. Calculations follow the
our input data for each feature separately and divide by absolute value of each feature sample:
the standard deviation of the given sample:
xi
x(DP)
i
= , (7)

N 10k
1
xi − N
x𝑗
𝑗=1 where k is the integer that will give us the maximum value
xi(z-score) = √ , (5)
√ N for |xDP | < 1.
√1∑
√ ̄ 2
(x𝑗 − x) Having defined the event representations, we use five
N
𝑗=1 different projection horizons for our labels. Each of these
860 NTAKARIS ET AL.

TABLE 5 HFT dataset examples


Dataset Public Unit time Period Asset class / Size Annotations
available No. of stocks
1 Dukascopy ✓ ms Up to date Various ∼20,000 events/day ×
2 truefx ✓ ms Up to date 15 FX pairs ∼300,000 events/day ×
3 Nasdaq AuR ms 2008-09 Equity / 120 — ×
4 Nasdaq AuR ms 10/07 & 06/08 Equity / 500 ∼55,000 events/day ×
5 Nasdaq × ms — Equity / 5 2,000 data points ×
6 Euronext AuR — — Several products — ×
7 Nasdaq × ns 01/14-08/15 Equity / 489 50 TB ×
8 Our–Nasdaq ✓ ms 01-14/06/10 Equity / 5 4 M samples ✓

horizons portrays a different future projection interval platforms requiring a subscription fee, like those in (6)
of the mid-price movement (i.e., upward, downward, Kercheval and Zhang (2015); Li et al. (2016), and (7) Sirig-
and stationary mid-price movement). More specifically, nano (2016). Existing data sources and characteristics are
we extract labels based on short-term and long-term, listed in Table 5.
event-based, relative changes for the next 1, 2, 3, 5, and 10 In particular, the datasets are at a millisecond resolu-
events for our representations dataset. tion, except for number 6 in the table. Access to vari-
Our labels describe the percentage change of the ous asset classes including FX, commodities, indices, and
mid-price, which is calculated as follows: stocks is also provided. To the best of our knowledge,
there is no available literature based on this type of dataset
1

i+k
m𝑗 − mi for equities. Another source of free tick-by-tick histori-
k
𝑗=i+1 cal data is the truefx.com site, but the site provides data
li( 𝑗) = , (8)
mi only for the FX market for several pairs of currencies at
a millisecond resolution. The data contain information
where mj is the future mid-price (k = 1, 2, 3, 5, or 10
regarding timestamps (in millisecond resolution) and bid
next events in our representations) and mi is the current
and ask prices. Each of these .csv files contains approxi-
mid-price. The extracted labels are based on a threshold
mately 200,000 events per day. This type of data is used
for the percentage change of 0.002. For percentage changes
in a mean-reverting jump-diffusion model, as presented in
equal to or greater than 0.002, we use label 1. For per-
Suwanpetai (2016).
centage change that varies from −0.00199 to 0.00199, we
There is a second category of datasets available upon
use label 2, and, for percentage change smaller or equal to
request (AuR), as seen in Hasbrouck and Saar (2013). In
−0.002, we use label 3.
this paper, the authors use the Nasdaq OMX ITCH for two
periods: October 2007 and June 2008. For that period, they
run samples at 10-minute intervals for each day where
4 EXISTING DATA SETS
they set a cutoff mechanism for available messages per
DESCRIBED IN THE LITERATURE period.24 The main disadvantage of uniformly sampling
HFT data is that the trader loses vital information. Events
In this section, we list existing HFT datasets described
come randomly, with inactive periods varying from a few
in the literature and provide qualitative and quantitative
milliseconds to several minutes or hours. In our work, we
comparisons to our dataset. The following works mainly
overcome this challenge by considering the information
focus on datasets that are related to machine learning
based on event inflow, rather than equal time sampling.
methods.
Another example of data that is available only for academic
There are mainly three sources of data from which a
purposes is Brogaard et al. (2014). The dataset contains
high-frequency trader can choose. The first option is the
information regarding timestamps, price, and buy–sell
use of publicly available data (e.g., (1) Dukascopy and
side prices but no other details related to daily events or
(2) truefx), where no prior agreement is required for data
feature vectors. Hasbrouck and Saar provide a detailed
acquisition. The second option is publicly available data
description of their Nasdaq OMX ITCH data, which is not
upon request for academic purposes, which can be found
directly accessible for testing and comparison with their
in (3) Brogaard, Hendershott, and Riordan (2014), (4) Has-
brouck and Saar (2013), (5) De Winne and D'hondt 2007,
Detollenaere and D'hondt (2017), and Carrion (2013). 24
The authors provide a threshold, which is based on 250 events per
Finally, the third and most common option is data through 10-minute sample interval.
NTAKARIS ET AL. 861

baselines. They use these data to applying low-latency In our case, each sample xi corresponds to an event,
strategies based on measures that capture links between represented by a vector (with D = 144), as described in
submissions, cancellations, and executions. De Winne and Section 3.4. For the three-class classification problems in
D'hondt (2007) and Detollenaere and D'hondt (2017) use our dataset, the elements of vectors ti ∈ RC (C = 3 in our
similar datasets from Euronext for LOB construction. They case) take values equal to tik = 1, if xi belongs to class k,
specify that their dataset is available upon request from the and if tik = −1 otherwise. The solution of Equation 10 is
provider. What is more, the data provider supplies details given by
( )−1
regarding the LOB construction by the user. Our work fills W = X XT X + 𝜆I TT , (11)
that gap since our dataset provides the full LOB depth and
or
it is ready for use and comparison with our baselines. ( )−1
The last category of dataset has dissemination restric- W = XXT + 𝜆I XTT , (12)
tions. An example is the paper by Kercheval and Zhang where I is the identity matrix of appropriate dimensions.
(2015), where the authors are trying to predict the Here, we should note that, in our case, where the size of the
mid-price movement by using machine learning (i.e., data is large, W should be computed using Equation 12,
SVM). They train their model with a very small number since the calculation of Equation 11 is computationally
of samples (i.e., 4,000 samples). The HFT activity can pro- very expensive.
duce a huge volume of trading events daily, as our database After the calculation of W, a new (test) sample x ∈ RD
does with 100,000 daily events for only one stock. More- is mapped on its corresponding representation in space
over, the datasets in Kercheval and Zhang and in Sirignano RC —that is, o = WT x—and is classified according to the
(2016) are not publicly available, which makes compari- maximum value of its projection:
son with other methods impossible. In the same direction,
we also add works such as Hasbrouck (2009), Kalay, Sade, lx = arg max ok . (13)
k
and Wohl (2004), and Kalay, Wei, and Wohl (2002), which
utilize TAQ and Tel Aviv stock exchange datasets (not for 5.2 SLFN network-based nonlinear
machine learning methods), and require subscription. regression
We also test the performance of a nonlinear regression
model. Since the application of kernel-based regression is
5 BASELINES
computationally too intensive for the size of our data, we
In order to provide performance baselines for our new use an SLFN (Figure 3) network-based regression model.
dataset of HFT with LOB data, we conducted experiments Such a model is formed as follows.
with two regression models using the data representa- For fast network training, we train our network based
tions described in Section 3.4. Details on the models used on the algorithm proposed in Huang, Zhou, Ding, and
are provided in Sections 5.1 and 5.2. The baseline perfor- Zhang (2012), Zhang, Kwok, and Parvin (2009), and Iosi-
mances are provided in Section 6. fidis, Tefas, and Pitas (2017). This algorithm is formed by

5.1 Ridge regression (RR)


Ridge regression defines a linear mapping, expressed by
the matrix W ∈ RD×C , that optimally maps a set of vec-
tors xi ∈ RD , i = 1, · · ·, N to another set of vectors (noted
as target vectors) ti ∈ RC , i = 1, · · ·, N, by optimizing the
following criterion:

N
W∗ = arg min ||WT xi − ti ||22 + 𝜆||W||2F , (9)
W i=1

or using a matrix notation:


W∗ = arg min ||WT X − T||2F + 𝜆||W||2F . (10)
W

In the above, X = [xi , … , xN ] and T = [ti , … , tN ]


are matrices formed by the samples xi and ti as columns,
respectively. FIGURE 3 SLFN
862 NTAKARIS ET AL.

two processing steps. In the first step, the network's hid- representations in spaces RK and RC ; that is, h = 𝜙RBF (x)
den layer weights are determined either randomly (Huang, and o = WT h, respectively. It is classified according to the
Zhou, Ding, & Zhang, 2012) or by applying clustering on maximal network output:
the training data. We apply K-means clustering in order
to determine K prototype vectors, which are subsequently lx = arg max ok . (17)
k
used as the network's hidden layer weights.
Having determined the network's hidden layer weights
V ∈ RD×K , the input data xi , i = 1, … , N are nonlin- 6 RESULTS
early mapped to vectors hi ∈ RK , expressing the data
representations in the feature space determined by the In our first set of experiments, we have applied two
network's hidden layer outputs RK . We use the radial supervised machine learning methods, as described in
basis function—that is, hi = 𝜑RBF (xi )—calculated in an Sections 5.1 and 5.2, on a dataset that does not include the
element-wise manner, as follows: auction period. Results with the auction period will also be
available. Since there is not a widely adopted experimental
( ) protocol for these datasets, we provide information for the
||xi − vk ||22
hik = exp , k = 1, · · ·, K, (14) five different label scenarios under the three normalization
2𝜎 2
setups.
The tables in this section provide details regarding the
where 𝜎 is a hyperparameter denoting the spread of the results of experiments conducted on raw data and three
RBF neuron and vk corresponds to the kth column of V. different normalization setups. We present these results,
The network's output weights W ∈ RK×C are subse- for our baseline models, in order to give insight into the
quently determined by solving for preprocessing step for a dataset like ours, to examine the
strength of the predictability of the projected time hori-
zon, and to understand the implications of the suggested
W∗ = arg min ||WT H − T||2F + 𝜆||W||2F , (15) methods. Data normalization can significantly improve
W
the metric's performance in combination with the use of
where H = [h1 , … , hN ] is a matrix formed by the net- the right classifier. More specifically, we measure the pre-
work's hidden layer outputs for the training data and T dictability power of our models via the performance of
is a matrix formed by the network's target vectors ti , i = the metrics of accuracy, precision, recall, and F1 score.
1, … , N as defined in Section 5.1. The network's output For instance, Table 6 presents the results based on raw
weights are given by data (i.e., no data decoding), and in the case of the
linear classifier RR and label 5 (i.e., the 5th mid-price
( )−1 event as predicted horizon), we achieve an F1 score of
W = HHT + 𝜆I HTT . (16) 40%, where as in Table 7 (i.e., the Z-score data decoding
method), Table 8 (i.e., min–max data decoding method),
After calculation of the network parameters V and W, a and Table 9 (i.e., the decimal precision decoding method),
new (test) sample x ∈ RD is mapped on its corresponding we achieve 43%, 42%, and 40%, respectively. This shows

TABLE 6 Results based on unfiltered representations


Label RRAccuracy RRPrecision RRRecall RRF1
1 0.637 ± 0.055 0.505 ± 0.145 0.337 ± 0.003 0.268 ± 0.014
2 0.555 ± 0.064 0.504 ± 0.131 0.376 ± 0.023 0.320 ± 0.050
3 0.489 ± 0.061 0.423 ± 0.109 0.397 ± 0.031 0.356 ± 0.070
5 0.429 ± 0.049 0.402 ± 0.113 0.425 ± 0.038 0.400 ± 0.093
10 0.453 ± 0.054 0.400 ± 0.105 0.400 ± 0.030 0.347 ± 0.066
Label SLFNAccuracy SLFNPrecision SLFNRecall SLFNF1
1 0.636 ± 0.055 0.299 ± 0.075 0.335 ± 0.002 0.262 ± 0.015
2 0.536 ± 0.069 0.387 ± 0.132 0.345 ± 0.009 0.260 ± 0.035
3 0.473 ± 0.074 0.334 ± 0.080 0.357 ± 0.005 0.270 ± 0.021
5 0.381 ± 0.038 0.342 ± 0.058 0.370 ± 0.020 0.327 ± 0.043
10 0.401 ± 0.039 0.284 ± 0.102 0.356 ± 0.020 0.290 ± 0.070
NTAKARIS ET AL. 863

TABLE 7 Results based on Z-score normalization


Label RRAccuracy RRPrecision RRRecall RRF1
1 0.480 ± 0.040 0.418 ± 0.021 0.435 ± 0.029 0.410 ± 0.022
2 0.498 ± 0.052 0.444 ± 0.025 0.443 ± 0.031 0.440 ± 0.031
3 0.463 ± 0.045 0.438 ± 0.027 0.437 ± 0.033 0.433 ± 0.034
5 0.439 ± 0.042 0.436 ± 0.028 0.433 ± 0.028 0.427 ± 0.041
10 0.429 ± 0.046 0.429 ± 0.028 0.429 ± 0.043 0.416 ± 0.044
Label SLFNAccuracy SLFNPrecision SLFNRecall SLFNF1
1 0.643 ± 0.056 0.512 ± 0.037 0.366 ± 0.019 0.327 ± 0.046
2 0.556 ± 0.066 0.550 ± 0.029 0.378 ± 0.011 0.327 ± 0.030
3 0.512 ± 0.069 0.497 ± 0.024 0.424 ± 0.047 0.389 ± 0.082
5 0.473 ± 0.036 0.468 ± 0.024 0.464 ± 0.028 0.459 ± 0.031
10 0.477 ± 0.048 0.453 ± 0.056 0.432 ± 0.025 0.410 ± 0.040

TABLE 8 Results Based on min–max normalization


Label RRAccuracy RRPrecision RRRecall RRF1
1 0.637 ± 0.054 0.499 ± 0.118 0.339 ± 0.005 0.272 ± 0.015
2 0.561 ± 0.063 0.467 ± 0.117 0.400 ± 0.028 0.368 ± 0.060
3 0.492 ± 0.070 0.428 ± 0.111 0.400 ± 0.030 0.357 ± 0.072
5 0.437 ± 0.048 0.419 ± 0.078 0.429 ± 0.043 0.417 ± 0.063
10 0.452 ± 0.054 0.421 ± 0.110 0.399 ± 0.028 0.348 ± 0.066
Label SLFNAccuracy SLFNPrecision SLFNRecall SLFNF1
1 0.640 ± 0.055 0.488 ± 0.104 0.348 ± 0.007 0.291 ± 0.022
2 0.558 ± 0.065 0.469 ± 0.066 0.399 ± 0.023 0.367 ± 0.050
3 0.499 ± 0.063 0.447 ± 0.068 0.410 ± 0.032 0.370 ± 0.063
5 0.453 ± 0.038 0.441 ± 0.041 0.444 ± 0.030 0.432 ± 0.050
10 0.450 ± 0.048 0.432 ± 0.070 0.406 ± 0.037 0.377 ± 0.062

TABLE 9 Results based on decimal precision normalization


Label RRAccuracy RRPrecision RRRecall RRF1
1 0.638 ± 0.054 0.518 ± 0.132 0.341 ± 0.007 0.277 ± 0.018
2 0.551 ± 0.066 0.473 ± 0.118 0.372 ± 0.018 0.315 ± 0.045
3 0.490 ± 0.069 0.432 ± 0.113 0.386 ± 0.023 0.330 ± 0.059
5 0.435 ± 0.051 0.406 ± 0.115 0.430 ± 0.039 0.405 ± 0.095
10 0.451 ± 0.052 0.417 ± 0.108 0.399 ± 0.029 0.349 ± 0.067
Label SLFNAccuracy SLFNPrecision SLFNRecall SLFNF1
1 0.641 ± 0.055 0.512 ± 0.027 0.351 ± 0.007 0.297 ± 0.024
2 0.565 ± 0.063 0.505 ± 0.020 0.410 ± 0.026 0.385 ± 0.054
3 0.504 ± 0.061 0.465 ± 0.032 0.421 ± 0.040 0.393 ± 0.073
5 0.457 ± 0.038 0.451 ± 0.029 0.449 ± 0.031 0.438 ± 0.046
10 0.461 ± 0.053 0.453 ± 0.036 0.420 ± 0.035 0.399 ± 0.053

that in the case of the linear classifier the suggested decod- result, normalization improves the F1 score performance
ing methods did not offer any significant improvements, by almost 10%.
since the variability of the performance range is approx- Normalization and model selection can also affect the
imately 3%. On the other hand, our nonlinear classifier predictability of mid-price movements over the projected
(i.e., SLFN) for the same projected time horizon (i.e., time horizon. Very interesting results come to light if
label 5) reacted more efficiently in the decoding process. we try to compare the F1 performance over different
SLFN achieves 33% for the F1 score for nonnormalized time horizons. For instance, we can see that, regard-
data, while the Z-score, min–max and decimal precision less of the decoding method, the F1 score is always
methods achieve 46%, 43%, and 43%, respectively. As a better for label 5 than 1, meaning that ‘our models’
864 NTAKARIS ET AL.

predictions are better further in the future. This result is the provided data, the exploitation of such a large cor-
significant, especially with unfiltered data and min–max pus of data can be used in order to identify patterns in
and decimal precision normalizations, when F1 score is stock markets that can be further analyzed as normal
approximately 27%, in the case of the one-step prediction or abnormal.
problem (label 1), and 43% in the case of the five-step
problem (label 5).
ACKNOWLEDGMENT
Another aspect of the experimental results above stems
from the pros and cons of linear and nonlinear classifiers. This work was supported by H2020 Project BigDataFi-
More specifically, the RR linear classifier performed better nance MSCA-ITN-ETN 675044 (http://bigdatafinance.eu),
on the raw dataset and for the Z-score decoding method Training for Big Data in Financial Research and Risk Man-
in terms of F1 when compared to the SLFN (i.e., nonlin- agement.
ear classifier). This is not the case for the last decoding
methods (i.e., min–max and decimal precision), where our
nonlinear classifier presents similar or better results than ORCID
RR. An explanation for this F1 performance discrepancy
is due to each of these methods' engineering has. The RR Adamantios Ntakaris http://orcid.org/
classifier tends to be very efficient in high-dimensional 0000-0001-6949-5337
problems, and these types of problems are linearly sepa-
rable, in most cases. Another reason that RR can perform REFERENCES
better when compared to a nonlinear classifier is that
Abernethy, J., & Kale, S. (2013). Adaptive market making via online
RR can control the complexity by penalizing the bias, via learning, Advances in Neural Information Processing Systems
cross-validation, using the ridge parameter. On the other (pp. 2058–2066). Cambridge, MA: MIT Press.
hand, a nonlinear classifier is prone to overfitting, which Almgren, R., & Lorenz, J. (2006). Bayesian adaptive trading with a
means that in some cases it offers a better degree of free- daily cycle. Journal of Trading, 1(4), 38–46.
Alvim, L. G., dos Santos, C. N., & Milidiu, R. L. (2010). Daily vol-
dom for class separation.
ume forecasting using high frequency predictors. In Proceedings
of the 10Th IASTED International Conference, Acta Press, Calgary,
Canada, Vol. 674, pp. 248.
7 CO N C LU S I O N Amaya, D., Filbien, J.-Y, Okou, C., & Roch, A. F. (2015). Distilling
liquidity costs from limit order books. Available at SSRN: https://
papers.ssrn.com/sol3/papers.cfm?abstractid=2660226.
This paper described a new benchmark dataset formed
An, Y., & Chan, N. H. (2017). Short-term stock price prediction
by the Nasdaq ITCH feed data for five stocks for 10 based on limit order book dynamics. Journal of Forecasting, 36(5),
consecutive trading days. Data representations that were 541–556.
exploited by order flow features were made available. We Aramonte, S., Schindler, J. W., & Rosen, S. (2013). Assessing and com-
formulated five classification tasks based on mid-price bining financial conditions indexes. Available at SSRN: https://
papers.ssrn.com/sol3/papers.cfm?abstractid=2976840.
movement predictions for 1, 2, 3, 5, and 10 predicted
Avellaneda, M., & Stoikov, S. (2008). High-frequency trading in a
horizons. Baseline performances of two regression mod- limit order book. Quantitative Finance, 8(3), 217–224.
els were also provided in order to facilitate future research Bogoev, D., & Karam, A. (2016). An Empirical Detection of High
in the field. Despite the data size, we achieved an aver- Frequency Trading Strategies. (Working Paper). Durham, UK:
age out-of-sample performance (F1) of approximately 46% Durham University.
for both methods. These very promising results show Brogaard, J., Hendershott, T., & Riordan, R. (2014). High-frequency
trading and price discovery. Review of Financial Studies, 27(8),
that machine learning can effectively predict mid-price 2267–2306.
movement. Cao, C., Hansch, O., & Wang, X. (2009). The information content
Potential avenues of research that can benefit from of an open limit-order book. Journal of Futures Markets, 29(1),
exploiting the provided data include: (a) prediction of 16–41.
the stability of the market, which is very important Carrion, A. (2013). Very fast money: High-frequency trading on the
NASDAQ. Journal of Financial Markets, 16(4), 680–711.
for liquidity providers (market makers) to make the
Cenesizoglu, T., Dionne, G., & Zhou, X. (2014). Effects of the limit
spread, as well as for traders to increase liquidity pro- order book on price dynamics. Retrieved from https://depot.
vision (when markets can be predicted to be stable); erudit.org/bitstream/003996dd/1/CIRPEE14-26.pdf.
(b) prediction on market movements, which is impor- Chan, N. T., & Shelton, C. (2001). An electronic market-maker.
tant for expert systems used by speculative traders; (c) Retrieved from https://dspace.mit.edu/bitstream/handle/1721.
1/7220/AIM-2001-005.pdf?sequence=2.
identification of order book spoofing—that is, situations
Chang, Y. L. (2015). Inferring Markov chain for modeling order book
where markets are manipulated by limit orders. Although dynamics in high frequency environment. International Journal
there is no spoofing activity information available for of Machine Learning and Computing, 5(3), 247–251.
NTAKARIS ET AL. 865

Christensen, H. L., & Woodmansey, R. (2013). Prediction of hidden Li, X., Xie, H., Wang, R., Cai, Y., Cao, J., Wang, F., Min, H., & Deng, F.
liquidity in the limit order book of globex futures. Journal of (2016). Empirical analysis: Stock market prediction via extreme
Trading, 8(3), 68–95. learning machine. Neural Computing and Applications, 27(1),
Creamer, G. (2012). Model calibration and automated trading agent 67–78.
for euro futures. Quantitative Finance, 12(4), 531–545. Liu, J., & Park, S. (2015). Behind stock price movement: Supply and
De Winne, R., & D'hondt, C. (2007). Hide-and-seek in the market: demand in market microstructure and market influence. Journal
placing and detecting hidden orders. Review of Finance, 11(4), of Trading, 10(3), 13–23.
663–692. Maglaras, C., Moallemi, C. C., & Zheng, H. (2015). Optimal execution
Detollenaere, B., & D'hondt, C. (2017). Identifying expensive trades in a limit order book and an associated microstructure market
by monitoring the limit order book. Journal of Forecasting, 36(3), impact model. Available at SSRN: https://papers.ssrn.com/sol3/
273–290. papers.cfm?abstractid=2610808.
Dixon, M. (2016). High frequency market making with machine Majhi, R., Panda, G., & Sahoo, G. (2009). Development and perfor-
learning. Available at SSRN: https://papers.ssrn.com/sol3/ mance evaluation of FLANN based model for forecasting of stock
papers.cfm?abstractid=2868473. markets. Expert Systems with Applications, 36(3), 6800–6808.
Felker, T., Mazalov, V., & Watt, S. M. (2014). Distance-based Malik, A., & Lon Ng, W. (2014). Intraday liquidity patterns
high-frequency trading. Procedia Computer Science, 29, in limit order books. Studies in Economics and Finance,
2055–2064. 31(1), 46–71.
Fletcher, T., Hussain, Z., & Shawe-Taylor, J. (2010). Multiple kernel Mankad, S., Michailidis, G., & Kirilenko, A. (2013). Discovering
learning on the limit order book. In Proceedings of the First Work- the ecosystem of an electronic financial market with a dynamic
shop on Applications of Pattern Analysis, Vol. 11, pp. 167–174. machine-learning method. Algorithmic Finance, 2(2), 151–165.
Galeshchuk, S. (2016). Neural networks performance in exchange Næs, R., & Skjeltorp, J. A. (2006). Order book characteristics and the
rate prediction. Neurocomputing, 172, 446–452. volume–volatility relation: Empirical evidence from a limit order
Gould, M. D., Porter, M. A., Williams, S., McDonald, M., Fenn, D. J., market. Journal of Financial Markets, 9(4), 408–432.
& Howison, S. D. (2013). Limit order books. Quantitative Finance, O'Hara, M., & Ye, M. (2011). Is market fragmentation harming mar-
13(11), 1709–1742. ket quality? Journal of Financial Economics, 100(3), 459–474.
Hallgren, J., & Koski, T. (2016). Testing for causality in continu- Pai, P.-F., & Lin, C.-S (2005). A hybrid Arima and support vec-
ous time Bayesian network models of high-frequency data. arXiv tor machines model in stock price forecasting. Omega, 33(6),
preprint retrieved from https://arxiv.org/abs/1601.06651. 497–505.
Han, J., Hong, J., Sutardja, N., & Wong, S. F. (2015). Machine Learn- Palguna, D., & Pollak, I. (2016). Mid-price prediction in a limit order
ing Techniques for Price Change Forecast Using the Limit Order book. IEEE Journal of Selected Topics in Signal Processing, 10(6),
Book Data. (Working Paper). Berkeley, CA: University of Califor- 1083–1092.
nia, Berkeley.
Panayi, E., Peters, G. W., Danielsson, J., & Zigrand, J.-P. (2016). Des-
Hasbrouck, J. (2009). Trading costs and returns for US equities: Esti- ignating market maker behaviour in limit order book markets.
mating effective costs from daily data. Journal of Finance, 64(3), Econometrics and Statistics, 5, 20–44.
1445–1477.
Ranaldo, A. (2004). Order aggressiveness in limit order book markets.
Hasbrouck, J., & Saar, G. (2013). Low-latency trading. Journal of Journal of Financial Markets, 7(1), 53–74.
Financial Markets, 16(4), 646–679.
Rehman, M., Khan, G. M., & Mahmud, S. A. (2014). Foreign cur-
Huang, G.-B., Zhou, H., Ding, X., & Zhang, R. (2012). Extreme learn- rency exchange rates prediction using CGP and recurrent neural
ing machine for regression and multiclass classification. IEEE network. IERI Procedia, 10, 239–244.
Transactions on Systems, Man, and Cybernetics, Part B, 42(2),
513–529. Sandoval, J., & Hernández, G. (2015). Computational visual analy-
sis of the order book dynamics for creating high-frequency for-
Iosifidis, A., Tefas, A., & Pitas, I. (2017). Approximate kernel extreme
eign exchange trading strategies. Procedia Computer Science, 51,
learning machine for large scale data classification. Neurocom-
1593–1602.
puting, 219, 210–220.
Seddon, J. J., & Currie, W. L. (2017). A model for unpacking big data
Kalay, A., Sade, O., & Wohl, A. (2004). Measuring stock illiquidity: An
analytics in high-frequency trading. Journal of Business Research,
investigation of the demand and supply schedules at the TASE.
70, 300–307.
Journal of Financial Economics, 74(3), 461–486.
Kalay, A., Wei, L., & Wohl, A. (2002). Continuous trading or call Sharang, A., & Rao, C. (2015). Using machine learning for medium
auctions: Revealed preferences of investors at the Tel Aviv stock frequency derivative portfolio trading. arXiv preprint retrieved
exchange. Journal of Finance, 57(1), 523–542. from https://arxiv.org/abs/1512.06228
Kearns, M., & Nevmyvaka, Y. (2013). Machine Learning for Market Siikanen, M., Kanniainen, J., & Luoma, A. (2017). What drives
Microstructure and High Frequency Trading. In D. Easley, M. the sensitivity of limit order books to company announcement
López De Prado, & M. O'Hara (Eds.), High Frequency Trading: arrivals? Economics Letters, 159, 65–68.
New Realities for Traders, Markets and Regulators. London, UK: Siikanen, M., Kanniainen, J., & Valli, J. (2017). Limit order books and
Risk Books. liquidity around scheduled and non-scheduled announcements:
Kercheval, A. N., & Zhang, Y. (2015). Modelling high-frequency limit Empirical evidence from NASDAQ Nordic. Finance Research Let-
order book dynamics with support vector machines. Quantitative ters, 21, 264–271.
Finance, 15(8), 1315–1329. Sirignano, J. (2016). Deep learning for limit order books. Avail-
Kim, A. J. (2001). Input/Output Hidden Markov Models for Modeling able at SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_
Stock Order Flows. (Technical Report No. 1370). Cambridge, MA: id=2710331.
MITAI Laboratory. Suwanpetai, P. (2016). Estimation of exchange rate models after news
Levendovszky, J., & Kia, F. (2012). Prediction based-high frequency announcement. In AP16Thai Conference 2016: Sixth Asia–Pacific
trading on financial time series. Periodica Polytechnica: Electrical Conference on Global Business, Economics, Finance and Social
Engineering and Computer Science, 56(1), 29–34. Sciences.
866 NTAKARIS ET AL.

Talebi, H., Hoang, W., & Gavrilova, M. L. (2014). Multi-scale foreign Juho Kanniainen is a Professor of Financial
exchange rates ensemble for classification of trends in FOREX
Engineering at the Tampere University of
market. Procedia Computer Science, 29, 2065–2075.
Technology, Finland. His research agenda is
Vella, V., & Ng, W. L. (2016). Improving risk-adjusted performance
in high frequency trading using interval type-2 fuzzy logic. Expert focused on quantitative finance with emphasis on
Systems with Applications, 55, 70–86. big data problems. Dr. Kanniainen has published
Yang, S., Paddrik, M., Hayes, R., Todd, A., Kirilenko, A., Beling, P., in many journals in Finance and Engineering,
& Scherer, W. (2012). Behavior Based Learning in Identifying including Review of Finance, Journal of Banking
High Frequency Trading Strategies. In 2012 IEEE Conference
on Computational Intelligence for Financial Engineering and and Finance, and Digital Signal Processing. He has
Economics (CIFEr), IEEE, Piscataway, NJ, pp. 1–8. been coordinating two international EU projects,
Yang, S. Y., Qiao, Q., Beling, P. A., Scherer, W. T., & Kirilenko, A. BigDataFinance (www.bigdatafinance.eu) and
A. (2015). Gaussian process-based algorithmic trading strategy HPCFinance (www.hpcfinance.eu).
identification. Quantitative Finance, 15(10), 1683–1703.
Yu, Y. (2006). The Limit Order Book Information and the Order
Moncef Gabbouj is a Professor of Signal Processing
Submission Strategy: a Model Explanation. In 2006 International
Conference on Service Systems and Service Management, IEEE, at the Department of Signal Processing, Tampere
Piscataway, NJ, Vol. 1, pp. 687–691. University of Technology, Tampere, Finland. He was
Zhang, K., Kwok, J. T., & Parvin, B. (2009). Prototype Vector Machine Academy of Finland Professor during 2011-2015.
for Large Scale Semi-Supervised Learning. In Proceedings of the He held several visiting professorships at different
26Th Annual International Conference on Machine Learning,
ACM, New York, NY, pp. 1233–1240. universities. Dr. Gabbouj is currently the TUT-Site
Zheng, B., Moulines, E., & Abergel, F. (2012). Price jump prediction Director of the NSF IUCRC funded Center for
in limit order book. Available at SSRN: https://papers.ssrn.com/ Visual and Decision Informatics. His research
sol3/papers.cfm?abstract_id=2026454. interests include Big Data analytics, multimedia
content-based analysis, indexing and retrieval,
artificial intelligence, machine learning, pattern
recognition, nonlinear signal and image processing
and analysis, voice conversion, and video processing
Adamantios Ntakaris is an ESR within the Marie and coding.
Curie BigDataFinance project in the Dept. of Signal
Processing at Tampere University of Technology. He Alexandros Iosifidis is currently an Assistant
received a B.Sc. in Mathematics in 2009 from the Professor of Machine Learning and Computer
Aristotle University of Thessaloniki and an M.Sc. in Vision in the Department of Engineering, at Aarhus
Financial Modelling and Optimization in 2014 from University, Denmark. He has held Postdoctoral
the University of Edinburgh. In 2014 Adamantios Researcher positions in Tampere University of
completed an industrial placement at Standard Life Technology, Finland and Aristotle University of
Investments in Edinburgh. Before commencing his Thessaloniki, Greece. He has participated in many
PhD, he worked as an Effective Interest Rate Analyst R&D projects financed by EU, Greek, Finnish, and
at CitiGroup investment bank in Edinburgh, and as Danish funding agencies and companies. He has
a Maths Olympiad Coach in Thessaloniki. co-authored more than 120 papers in international
journals and conferences proposing novel Machine
Martin Magris is an Early Stage Researcher within Learning techniques and their application in a
the Marie Curie BigDataFinance training network variety of problems.
in the Laboratory of Industrial and Information
Management at Tampere University of Technology
(Finland) since April 2016. He received a B.Sc.
in Statistics and Mathematics in 2013 and a
M.Sc in Statistical and Actuarial Sciences in 2015
How to cite this article: Ntakaris A, Magris
from Universitá degli studi di Trieste, Italy. As a
M, Kanniainen J, Gabbouj M, Iosifidis A.
part of his master studies, Martin visited Aarhus
Benchmark dataset for mid-price forecasting of
university for seven months in 2014. In the years
limit order book data with machine learning
2015-2016, before commencing his PhD, Martin
methods. Journal of Forecasting. 2018;37:852–866.
worked as actuarial analyst for a non-life insurance
https://doi.org/10.1002/for.2543
company, specifically in the car-insurance pricing
and in the development, profit-testing and pricing of
multiple-peril non-life insurance products.

You might also like