Deep Learning With Long Short-Term Memory Networks For Financial Market Predictions
Deep Learning With Long Short-Term Memory Networks For Financial Market Predictions
Deep Learning With Long Short-Term Memory Networks For Financial Market Predictions
a r t i c l e i n f o a b s t r a c t
Article history: Long short-term memory (LSTM) networks are a state-of-the-art technique for sequence learning. They
Received 12 May 2017 are less commonly applied to financial time series predictions, yet inherently suitable for this domain.
Accepted 27 November 2017
We deploy LSTM networks for predicting out-of-sample directional movements for the constituent stocks
Available online 5 December 2017
of the S&P 500 from 1992 until 2015. With daily returns of 0.46 percent and a Sharpe ratio of 5.8 prior to
Keywords: transaction costs, we find LSTM networks to outperform memory-free classification methods, i.e., a ran-
Finance dom forest (RAF), a deep neural net (DNN), and a logistic regression classifier (LOG). The outperformance
Statistical arbitrage relative to the general market is very clear from 1992 to 2009, but as of 2010, excess returns seem to
LSTM have been arbitraged away with LSTM profitability fluctuating around zero after transaction costs. We
Machine learning further unveil sources of profitability, thereby shedding light into the black box of artificial neural net-
Deep learning works. Specifically, we find one common pattern among the stocks selected for trading – they exhibit
high volatility and a short-term reversal return profile. Leveraging these findings, we are able to formal-
ize a rules-based short-term reversal strategy that yields 0.23 percent prior to transaction costs. Further
regression analysis unveils low exposure of the LSTM returns to common sources of systematic risk –
also compared to the three benchmark models.
© 2017 Elsevier B.V. All rights reserved.
1. Introduction Takeuchi and Lee (2013), Moritz and Zimmermann (2014), Dixon,
Klabjan, and Bang (2015), and further references in Atsalakis and
Prediction tasks on financial time series are notoriously diffi- Valavanis (2009) as well as Sermpinis, Theofilatos, Karathana-
cult, primarily driven by the high degree of noise and the gener- sopoulos, Georgopoulos, and Dunis (2013). Specifically, we expand
ally accepted, semi-strong form of market efficiency (Fama, 1970). on the recent work of Krauss, Do, and Huck (2017) on the same
Yet, there is a plethora of well-known capital market anomalies data sample for the sake of comparability. The authors use deep
that are in stark contrast with the notion of market efficiency. For learning, random forests, gradient-boosted trees, and different en-
example, Jacobs (2015) or Green, Hand, and Zhang (2013) provide sembles as forecasting methods on all S&P 500 constituents from
surveys comprising more than 100 of such capital market anoma- 1992 to 2015. One key finding is that deep neural networks with
lies, which effectively rely on return predictive signals to outper- returns of 0.33 percent per day prior to transaction costs underper-
form the market. However, the financial models used to establish a form gradient-boosted trees with 0.37 percent and random forests
relationship between these return predictive signals, (the features) with 0.43 percent. The latter fact is surprising, given that deep
and future returns (the targets), are usually transparent in nature learning has “dramatically improved the state-of-the-art in speech
and not able to capture complex non-linear dependencies. recognition, visual object recognition, object detection and many
In the last years, initial evidence has been established that other domains” (LeCun, Bengio, and Hinton, 2015, p. 436). At first
machine learning techniques are capable of identifying (non- sight, we would expect similar improvements in the domain of
linear) structures in financial market data, see Huck (2009, 2010), time series predictions. However, Krauss et al. (2017, p. 695) point
out that “neural networks are notoriously difficult to train” and
that it “may well be that there are configurations in parameter
space to further improve the performance” of deep learning.
∗
Corresponding author.
In this paper, we primarily focus on deep learning, and on fur-
E-mail addresses: thomas.g.fi[email protected] (T. Fischer), [email protected]
ther exploring its potential in a large-scale time series prediction
(C. Krauss).
1
The authors have benefited from many helpful discussions with Ingo Klein and problem. In this respect, we make three contributions to the liter-
three anonymous referees. ature.
https://doi.org/10.1016/j.ejor.2017.11.054
0377-2217/© 2017 Elsevier B.V. All rights reserved.
T. Fischer, C. Krauss / European Journal of Operational Research 270 (2018) 654–669 655
• First, we focus on long short-term memory (LSTM) networks, quintessence of the patterns the LSTM acts upon for selecting
one of the most advanced deep learning architectures for se- winning and losing stocks. A strategy that buys short-term ex-
quence learning tasks, such as handwriting recognition, speech tremal losers and sells short-term extremal winners leads to
recognition, or time series prediction (Graves et al., 2009; daily returns of 0.23 percent prior to transaction costs – so only
Graves, Mohamed, & Hinton, 2013; Hochreiter & Schmidhuber, about 50 percent of the LSTM returns. Regression analyses on
1997; Schmidhuber, 2015). Surprisingly, to our knowledge, there systematic risk factors unveil a remaining alpha of 0.42 percent
has been no previous attempt to deploy LSTM networks on of the LSTM prior to transaction costs and generally lower ex-
a large, liquid, and survivor bias free stock universe to assess posure to common sources of systematic risk, compared to the
its performance in large-scale financial market prediction tasks. benchmark models.
Selected applications, as in Xiong, Nichols, and Shen (2015), fo-
cus on predicting the volatility of the S&P 500, on forecasting The remainder of this paper is organized as follows.
a small sample of foreign exchange rates (Giles, Lawrence, & Section 2 briefly covers the data sample, software packages,
Tsoi, 2001), or on assessing the impact of incorporating news and hardware. Section 3 provides an in-depth discussion of our
for specific companies (Siah and Myers (2016)). We fill this methodology, i.e., the generation of training and trading sets, the
void and apply LSTM networks to all S&P 500 constituents construction of input sequences, the model architecture and train-
from 1992 until 2015. Hereby, we provide an in-depth guide on ing as well as the forecasting and trading steps. Section 4 presents
data preprocessing, as well as development, training, and de- the results and discusses our most relevant findings in light of the
ployment of LSTM networks for financial time series prediction existing literature. Finally, Section 5 concludes.
tasks. Last but not least, we contrast our findings to selected
benchmarks from the literature – a random forest (the best
2. Data, software, hardware
performing benchmark), a standard deep neural net (to show
the value-add of the LSTM architecture), and a standard logis-
2.1. Data
tic regression (to establish a baseline). The LSTM network out-
performs the memory-free methods with statistically and eco-
For the empirical application, we use the S&P 500 index con-
nomically significant returns of 0.46 percent per day – com-
stituents from Thomson Reuters. For eliminating survivor bias, we
pared to 0.43 percent for the RAF, 0.32 percent for the stan-
first obtain all month end constituent lists for the S&P 500 from
dard DNN, and 0.26 percent for the logistic regression. This rel-
Thomson Reuters from December 1989 to September 2015. We
ative advantage also holds true with regard to predictional ac-
consolidate these lists into one binary matrix, indicating whether
curacy where a Diebold–Mariano test confirms superior fore-
the stock is an index constituent in the subsequent month. As such,
casts of the LSTM networks compared to the applied bench-
we are able to approximately reproduce the S&P 500 at any given
marks. Our findings are largely robust to microstructural effects.
point in time between December 1989 and September 2015. In a
Specifically, when we implement the LSTM strategy on volume-
second step, for all stocks having ever been a constituent of the
weighted-average-prices (VWAPs) instead of closing prices, we
index during that time frame, we download daily total return in-
see a decline in profitability, but the results are still statistically
dices from January 1990 until October 2015. Return indices are
and economically significant. The same holds true for a weekly
cum-dividend prices and account for all relevant corporate actions
implementation with lower turnover – even after introducing
and stock splits, making them the most adequate metric for return
a one-day-waiting rule after the signal. Only as of 2010, the
computations. Following Clegg and Krauss (2018), we report aver-
edge of the LSTM seems to have been arbitraged away, with
age summary statistics in Table 1, split by industry sector. They are
LSTM profitability fluctuating around zero after transaction
based on equal-weighted portfolios per sector, generated monthly,
costs, and RAF profitability dipping strictly into the negative
and constrained to index constituency of the S&P 500.
domain.
• Second, we aim at shedding light into the black box of artifi-
cial neural networks – thereby unveiling sources of profitability. 2.2. Software and hardware
Generally, we find that stocks selected for trading exhibit high
volatility, below-mean momentum, extremal directional move- Data preparation and handling is entirely conducted in Python
ments in the last days prior to trading, and a tendency for re- 3.5 (Python Software Foundation, 2016), relying on the packages
versing these extremal movements in the near-term future. numpy (Van Der Walt, Colbert, & Varoquaux, 2011) and pandas
• Third, we synthesize the findings of the latter part into a sim- (McKinney, 2010). Our deep learning LSTM networks are devel-
plified, rules-based trading strategy that aims at capturing the oped with keras (Chollet, 2016) on top of Google TensorFlow, a
Table 1
Average monthly summary statistics for S&P 500 constituents from January 1990 until October 2015, split
by industry. They are based on equal-weighted portfolios per industry as defined by the Global Industry
Classification Standards Code, formed on a monthly basis, and restricted to index constituency of the S&P
500. Monthly returns and standard deviations are denoted in percent.
powerful library for large-scale machine learning on heterogenous feature vector V of dimension ni × Tstudy , where Tstudy denotes the
systems (Abadi et al., 2015). Moreover, we make use of sci-kit learn number of days in the study period. Then, we standardize the re-
(Pedregosa et al., 2011) for the random forest and logistic regres- turns by subtracting the mean (μtrainm ) and dividing them by the
sion models and of H2O (H2O, 2016) for the standard deep net. For standard deviation (σtrain ) obtained from the training set:4
m
2
The S&P 500 constituency count slightly fluctuates around 500 over time.
3 4
The reason for exhibiting no more price data is generally due to delisting. It is key to obtain mean and standard deviation from the training set only, in
Delisting may be caused by various reasons, such as bankruptcy, mergers and ac- order to avoid look-ahead biases.
5
quisitions, etc. Note that we do not eliminate stocks during the trading period in We have 10 0 0 days in the study period and a sequence length of 240 days. As
case they drop out of the S&P 500. The only criterion for being traded is that they such, 760 sequences can be created per stock. Given that there are approximately
have price information available for feature generation. 500 stocks in the S&P 500, we have a total of approximately 380,0 0 0 sequences.
T. Fischer, C. Krauss / European Journal of Operational Research 270 (2018) 654–669 657
Fig. 1. Construction of input sequences for LSTM networks (both, feature vector and sequences, are shown transposed).
Fig. 2. Structure of LSTM memory cell following Graves (2013) and Olah (2015).
consisting of so called memory cells. Each of the memory cells has into the range between 0 (completely forget) and 1 (completely
three gates maintaining and adjusting its cell state st : a forget gate remember):
(ft ), an input gate (it ), and an output gate (ot ). The structure of a
ft = sigmoid (W f,x xt + W f,h ht−1 + b f ). (3)
memory cell is illustrated in Fig. 2.
At every timestep t, each of the three gates is presented with In the second step, the LSTM layer determines which information
the input xt (one element of the input sequence) as well as the should be added to the network’s cell states (st ). This procedure
output ht−1 of the memory cells at the previous timestep t − 1. comprises two operations: first, candidate values s˜t , that could po-
Hereby, the gates act as filters, each fulfilling a different purpose: tentially be added to the cell states, are computed. Second, the ac-
• The forget gate defines which information is removed from the tivation values it of the input gates are calculated:
cell state. s˜t = tanh(Ws˜,x xt + Ws˜,h ht−1 + bs˜ ), (4)
• The input gate specifies which information is added to the cell
state.
• The output gate specifies which information from the cell state it = sigmoid (Wi,x xt + Wi,h ht−1 + bi ). (5)
is used as output.
In the third step, the new cell states st are calculated based on the
The equations below are vectorized and describe the update of the results of the previous two steps with ◦ denoting the Hadamard
memory cells in the LSTM layer at every timestep t. Hereby, the (elementwise) product:
following notation is used:
st = ft ◦ st−1 + it ◦ s˜t . (6)
• xt is the input vector at timestep t.
• Wf, x , Wf,h , Ws˜,x , Ws˜,h , Wi, x , Wi,h , Wo, x , and Wo,h are weight ma- In the last step, the output ht of the memory cells is derived as
trices. denoted in the following two equations:
• bf , bs˜, bi , and bo are bias vectors. ot = sigmoid (Wo,x xt + Wo,h ht−1 + bo ), (7)
• ft , it , and ot are vectors for the activation values of the respec-
tive gates.
• st and s˜t are vectors for the cell states and candidate values. ht = ot ◦ tanh(st ). (8)
• ht is a vector for the output of the LSTM layer.
When processing an input sequence, its features are presented
During a forward pass, the cell states st and outputs ht of the LSTM timestep by timestep to the LSTM network. Hereby, the input at
layer at timestep t are calculated as follows: each timestep t (in our case, one single standardized return) is pro-
In the first step, the LSTM layer determines which information cessed by the network as denoted in the equations above. Once the
should be removed from its previous cell states st−1 . Therefore, the last element of the sequence has been processed, the final output
activation values ft of the forget gates at timestep t are computed for the whole sequence is returned.
based on the current input xt , the outputs ht−1 of the memory cells During training, and similar to traditional feed-forward net-
at the previous timestep (t − 1), and the bias terms bf of the for- works, the weights and bias terms are adjusted in such a way that
get gates. The sigmoid function finally scales all activation values they minimize the loss of the specified objective function across
658 T. Fischer, C. Krauss / European Journal of Operational Research 270 (2018) 654–669
the training samples. Since we are dealing with a classification dard deep net, i.e., for showing the advantage of the LSTM, and
problem, we use cross-entropy as objective function. a logistic regression, i.e., a standard classifier as baseline. Note
The number of weights and bias terms being trained is calcu- that random forests, standard deep nets, and the feature genera-
lated as follows: let h denote the number of hidden units of the tion for memory-free methods follow the specifications outlined
LSTM layer, and i the number of input features, then the number in Krauss et al. (2017) for benchmarking reasons. Specifically, we
of parameters of the LSTM layer that needs to be trained is: use cumulative returns Rtm,s as features with m ∈ {{1, . . . , 20} ∪
4hi + 4h + 4h2 = 4(hi + h + h2 ) = 4(h(i + 1 ) + h2 ). (9) {40, 60, . . . , 240}}, see Eq. (1) and the same targets as defined in
Section 3.2.2. For the logistic regression model, we standardize the
Hereby 4hi refers to the dimensions of the four weight matrices returns as denoted in Eq. (2).7 In the subsequent paragraphs, we
applied to the inputs at each gate, i.e., W f,x , Ws˜,x , Wi,x , and Wo, x . briefly outline how we calibrate the benchmarking methods.
The 4h refers to the dimensions of the four bias vectors (b f , bs˜, bi , Random forest: The first algorithm for random decision forests
and bo ). Finally, the 4h2 corresponds to the dimensions of the has been suggested by Ho (1995), and was later expanded by
weight matrices applied to the outputs at the previous timestep, Breiman (2001). Simply speaking, random forests are composed of
i.e., W f,h , Ws˜,h , Wi,h , and Wo,h . many deep yet decorrelated decision trees built on different boot-
For the training of the LSTM network, we apply three advanced strap samples of the training data. Two key techniques are used in
methods via keras. First, we make use of RMSprop, a mini-batch the random forest algorithm – random feature selection to decor-
version of rprop (Tieleman & Hinton, 2012), as optimizer. The se- relate the trees and bagging, to build them on different bootstrap
lection of RMSprop is motivated from the literature as it is “usually samples. The algorithm is fairly simple: for each of the B trees in
a good choice for recurrent neural networks” (Chollet, 2016). Sec- the committee, a bootstrap sample is drawn from the training data.
ond, following Gal and Ghahramani (2016), we apply dropout reg- A decision tree is developed on the bootstrap sample. At each split,
ularization within the recurrent layer. Hereby, a fraction of the in- only a subset m of the p features is available as potential split cri-
put units is randomly dropped at each update during training time, terion. The growing stops once the maximum depth J is reached.
both at the input gates and the recurrent connections, resulting in The final output is a committee of B trees and classification is per-
reduced risk of overfitting and better generalization. Based on ini- formed as majority vote. We set the number of trees B to 10 0 0,
tial experiments on the year 1991 (which is not used as part of and maximum depth to J = 20, allowing for substantial higher or-
the out-of-sample trading periods), we have observed that higher der interactions. Random feature selection is left at a default value
dropout values go along with a decline in performance and there- √
of m = p for classification, see Pedregosa et al. (2011).
fore settled on a relatively low dropout value of 0.1. Third, we We use a random forest as benchmark for two compelling rea-
make use of early stopping to dynamically derive the number of sons. First, it is a state-of-the-art machine learning model that re-
epochs for training for each study period individually and to fur- quires virtually no tuning and usually delivers good results. Second,
ther reduce the risk of overfitting. Hereby, the training samples are random forests in this configuration are the best single technique
split into two sets: one training and one validation set. The first in Krauss et al. (2017) and the method of choice in Moritz and
set is used to train the network and to iteratively adjust its pa- Zimmermann (2014) – a large-scale machine learning application
rameters so that the loss function is minimized. After each epoch on monthly stock market data. As such, random forests serve as a
(one pass across the samples of the first set), the network predicts powerful benchmark for any innovative machine learning model.
the unseen samples from the validation set and a validation loss is Deep neural network: We deploy a standard DNN to show the
computed. Once the validation loss does not decrease for patience relative advantage of LSTM networks. Specifically, we use a feed
periods, the training is stopped and the weights of the model with forward neural network with 31 input neurons, 31 neurons in the
the lowest validation loss is restored (see ModelCheckpoint call- first, 10 in the second, 5 in the third hidden layer, and 2 neurons
back in Chollet, 2016). Following Granger (1993), who suggests to in the output layer. The activation function is maxout with two
hold back about 20 percent of the sample as “post-sample” data, channels, following Goodfellow, Warde-Farley, Mirza, Courville, and
we use 80 percent of the training samples as training set and 20 Bengio (2013), and softmax in the output layer. Dropout is set to
percent as validation set (samples are assigned randomly to either 0.5, and L1 regularization with shrinkage 0.0 0 0 01 is used – see
training or validation set), a maximum training duration of 10 0 0 Krauss et al. (2017) for further details.
epochs, and an early stopping patience of 10. The specified topol- Logistic regression: As baseline model, we also deploy logistic
ogy of our trained LSTM network is hence as follows: regression. Details about our implementation are available in the
• Input layer with 1 feature and 240 timesteps. documentation of sci-kit learn (Pedregosa et al., 2011) and the
• LSTM layer with h = 25 hidden neurons and a dropout value references therein. The optimal L2 regularization is determined
of 0.1. This configuration yields 2752 parameters for the LSTM, among 100 choices on a logarithmic scale between 0.0 0 01 and
leading to a sensible number of approximately 93 training ex- 10,0 0 0 via 5-fold cross-validation on the respective training set and
amples per parameter. This value has been chosen in analogy to L-BFGS is deployed to find an optimum, while restricting the max-
the configuration of the deep net in Krauss et al. (2017). A high imum number of iterations to 100. Logistic regression serves as a
number of observations per parameter allows for more robust baseline, so that we can derive the incremental value-add of the
estimates in case of such noisy training data, and reduces the much more complex and computationally intensive LSTM network
risk of overfitting. in comparison to a standard classifier.
• Output layer (dense layer) with two neurons and softmax acti-
vation function6 – a standard configuration. 3.5. Forecasting, ranking, and trading
s
For all models, we forecast the probability Pˆt+1
3.4. Benchmark models – random forest, deep net, and logistic |t for each stock
regression s to out-/underperform the cross-sectional median in period t + 1,
making only use of information up until time t. Then, we rank all
For benchmarking the LSTM, we choose random forests, i.e., stocks for each period t + 1 in descending order of this probability.
a robust yet high-performing machine learning method, a stan-
7
We perform no standardization of the returns for the other two models as this
6
Alternatively, one output neuron with a sigmoid activation function would also is automatically carried out for the deep neural network in its H2O implementation,
be a valid setup. or not required in case of the random forest.
T. Fischer, C. Krauss / European Journal of Operational Research 270 (2018) 654–669 659
Table 2
Panel A: p-values of Diebold–Mariano (DM) test for the null hypothesis that the forecasts of
method i have inferior accuracy than the forecasts of method j. Panel B: p-values of the Pesaran–
Timmermann (PT) test for the null hypothesis that predictions and responses are independently
distributed. Both panels are based on the k = 10 portfolio from December 1992 to October 2015.
A: DM test B: PT test
The top of the ranking corresponds to the most undervalued median or not. In this paragraph, we benchmark the predictive ac-
stocks that are expected to outperform the cross-sectional median curacy of the LSTM forecasts against those of the other methods,
in t + 1. As such, we go long the top k and short the flop k stocks and against random guessing. Furthermore, we compare the finan-
of each ranking, for a long-short portfolio consisting of 2k stocks – cial performance of the LSTM with 10 0,0 0 0 randomly generated
see Huck (2009, 2010). long-short portfolios.
First, we deploy the Diebold and Mariano (1995) (DM) test to
4. Results evaluate the null that the forecasts of method i have inferior ac-
curacy than the forecasts of method j, with i, j ∈ {LSTM, RAF, DNN,
Our results are presented in three stages. First, we analyze re- LOG} and i = j. For each forecast of each method, we assign a 0 in
turns prior to and after transaction costs of 5 bps per half-turn, case the individual stock of the k = 10 portfolio is correctly classi-
following Avellaneda and Lee (2010), and contrast the performance fied and a 1 otherwise, and use this vector of classification errors
of the LSTM network against the random forest, the deep neural as input for the DM test. In total, we hence consider 5750 × 2 × k =
net, and the logistic regression. Second, we derive common pat- 115,0 0 0 individual forecasts for the stocks in the k = 10 portfolio
terns within the top and flop stocks, thus unveiling sources of prof- for 5750 trading days in total. Results are depicted in panel A of
itability. Third, we develop a simplified trading strategy based on Table 2. In line one, for the null that the LSTM forecast is infe-
these findings, and show that we can achieve part of the LSTM per- rior to the forecasts of RAF, DNN, or LOG, we obtain p-values of
formance by capturing the most visible pattern with a transparent 0.0143, 0.0037, and 0.0000, respectively. If we test at a five per-
trading rule. cent significance level, and apply a Bonferroni correction for three
comparisons, the adjusted significance level is 1.67 percent, and we
4.1. Performance review can still reject the individual null hypotheses that the LSTM fore-
casts are less accurate than the RAF, DNN, or LOG forecasts. Hence,
4.1.1. Overview it makes sense to assume that the LSTM forecasts are superior to
First, we analyze the characteristics of portfolios consisting of those of the other considered methods. Similarly, we can reject the
2k stocks, i.e., the top k stocks we go long, and the flop k stocks we null that the RAF forecasts are inferior to the LOG forecasts as well
go short. We choose k ∈ {10, 50, 100, 150, 200} and compare the as the null that the DNN forecasts are inferior to the LOG fore-
performance of the novel LSTM with the other approaches along casts. In other words, the predictions of the sophisticated machine
the dimensions mean return per day, annualized standard devia- learning approaches all outperform those of a standard logistic re-
tion, annualized Sharpe ratio, and accuracy – prior to transaction gression classifier. Apparently, the former are able to capture com-
costs. plex dependencies in our financial time series data that cannot be
We see the following trends. Irrespective of the portfolio size extracted by a standard logistic regression. However, from the DM
k, the LSTM shows favorable characteristics vis-a-vis the other ap- test matrix, we cannot infer that the RAF forecasts outperform the
proaches. Specifically, daily returns prior to transaction costs are DNN forecasts or vice versa – both methods seem to exhibit sim-
at 0.46 percent, compared to 0.43 percent for the RAF, 0.32 per- ilar predictive accuracy. Our key finding is though, that the LSTM
cent for the DNN, and 0.26 for the LOG for k = 10. Also for larger network – despite its significantly higher computational cost – is
portfolio sizes, the LSTM achieves the highest mean returns per the method of choice in terms of forecasting accuracy.
day, with the exception of k = 200, where it is tied with the RAF. Second, we use the Pesaran–Timmermann (PT) test to evalu-
With respect to standard deviation – a risk metric – the LSTM is ate the null hypotheses that prediction and response are indepen-
on a similar level as the RAF, with slightly lower values for k = 10, dently distributed for each of the forecasting methods. We find p-
and slightly higher values for increasing portfolio sizes. Both LSTM values of zero up to the fourth digit, suggesting that the null can
and RAF exhibit much lower standard deviation than the DNN and be rejected at any sensible level of significance. In other words,
the logistic regression – across all levels of k. Sharpe ratio, or re- each machine learning method we employ exhibits statistically sig-
turn per unit of risk, is highest for the LSTM up until k = 100, nificant predictive accuracy.
and slightly less than the RAF for even larger portfolios, when the Third, we provide a statistical estimate for the probability
lower standard deviation of the RAF outweighs the higher return of the LSTM network having randomly achieved these results.
of the LSTM. Accuracy, meaning the share of correct classifications, For k = 10, we consider a total of 5750 × 10 × 2 = 115,0 0 0 top
is an important machine learning metric. We see a clear advantage and flop stocks, of which 54.3 percent are correctly classified.
of the LSTM for the k = 10 portfolio, a slight edge until k = 100, If the true accuracy of the LSTM network was indeed 50 per-
and a tie with the RAF for increasing sizes. cent, we could model the number of “successes”, i.e., the num-
We focus our subsequent analyses on the long-short portfolio ber of correctly classified stocks X in the top/flop with a binomial
with k = 10. distribution, so X ∼ B(n = 115,0 0 0, p = 0.5, q = 0.5 ). For such a
appr. √
large n, X ∼ N (μ = np, σ = npq ). Now, we can easily compute
4.1.2. Details on predictive accuracy the probability of achieving more than 54.3 percent accuracy, if the
The key task of the employed machine learning methods is to LSTM network had a true accuracy of 50 percent. We use the R
accurately predict whether a stock outperforms its cross-sectional package Rmpfr of Maechler (2016) to evaluate multiple-precision
660 T. Fischer, C. Krauss / European Journal of Operational Research 270 (2018) 654–669
Fig. 3. Daily performance characteristics for long-short portfolios of different sizes: mean return (excluding transaction costs), standard deviation, annualized Sharpe ratio
(excluding transaction costs), and accuracy from December 1992 to October 2015.
Fig. 4. Empirical distribution of mean daily returns of 10 0,0 0 0 sampled monkey trading long-short portfolios (excluding transaction costs).
floating point numbers and compute a probability of 2.7742e−187 Return characteristics: In panel A of Table 3, we see that the
that a random classifier performs as well as the LSTM by chance LSTM exhibits favorable return characteristics. Mean returns of 0.46
alone. percent before and 0.26 percent after transaction costs are statis-
Finally, we assess the financial performance of 10 0,0 0 0 ran- tically significant, with a Newey–West t-statistic of 16.9336 before
domly sampled portfolios in the sense of Malkiel’s monkey throw- and 9.5792 after transaction costs, compared to a critical value of
ing darts at the Wall Street Journal’s stock page (Malkiel, 2007). 1.9600 (5 percent significance level). The median is only slightly
Hereby, we randomly sample 10 stocks for the long and 10 stocks smaller than the mean return, and quartiles as well as minimum
for the short portfolio without replacement for each of the 5750 and maximum values suggest that results are not caused by out-
trading days. All these portfolios over the 5750 days can be inter- liers. The share of positive returns is at 55.74 percent after trans-
preted as those being picked by one monkey. Then, we compute action costs, an astonishingly high value for a long-short portfolio.
the mean average daily return of the combined long-short portfo- The second best model is the RAF, with mean returns of 0.23 per-
lios over these 5750 days to evaluate the monkey’s performance. cent after transaction costs, albeit at slightly higher standard de-
The results of 10 0,0 0 0 replications, i.e., of 10 0,0 0 0 different mon- viation (0.0209 LSTM vs. 0.0215 RAF). The DNN places third with
keys, are illustrated in Fig. 4. As expected, we see an average daily mean returns of 0.12 percent per day after transaction costs – still
return of zero prior to transaction costs. More importantly, even statistically significant – compared to the logistic regression. The
the best performing “monkey” with an average daily return of 0.05 simplest model achieves mean returns of 0.06 percent per day af-
percent, does not even come close to the results of the applied ter transaction costs, which are no longer significantly different
models shown in Fig. 3. from zero (Newey–West t-statistic of 1.6666 compared to critical
value of 1.9600). Note that the LSTM shows strong performance
4.1.3. Details on financial performance compared to the literature. The ensemble in Krauss et al. (2017),
Table 3 provides insights of the financial performance of the which consists of a deep net, gradient-boosted trees, and a random
LSTM, compared to the benchmarks, prior to and after transaction forest yields average returns of 0.25 percent per day on the same
costs. time frame, data set, and after transaction costs. In other words,
T. Fischer, C. Krauss / European Journal of Operational Research 270 (2018) 654–669 661
Table 3
Panels A, B, and C illustrate performance characteristics of the k = 10 portfolio, before and after transaction costs for LSTM, compared to RAF, DNN, LOG,
and to the general market (MKT) from December 1992 to October 2015. MKT represents the general market as in Kenneth R. French’s data library, see
here. Panel A depicts daily return characteristics. Panel B depicts daily risk characteristics. Panel C depicts annualized risk-return metrics. Newey–West
standard errors with a one-lag correction are used.
A Mean return (long) 0.0029 0.0030 0.0022 0.0021 0.0019 0.0020 0.0012 0.0011 –
Mean return (short) 0.0017 0.0012 0.0010 0.0 0 05 0.0 0 07 0.0 0 02 0.0 0 0 0 −0.0 0 05 –
Mean return 0.0046 0.0043 0.0032 0.0026 0.0026 0.0023 0.0012 0.0 0 06 0.0 0 04
Standard error 0.0 0 03 0.0 0 03 0.0 0 04 0.0 0 04 0.0 0 03 0.0 0 03 0.0 0 04 0.0 0 04 0.0 0 01
t-statistic 16.9336 14.1136 8.9486 7.0 0 06 9.5792 7.5217 3.3725 1.6666 2.8305
Minimum −0.2176 −0.2058 −0.1842 −0.1730 −0.2196 −0.2078 −0.1862 −0.1750 −0.0895
Quartile 1 −0.0053 −0.0050 −0.0084 −0.0089 −0.0073 −0.0070 −0.0104 −0.0109 −0.0046
Median 0.0040 0.0032 0.0025 0.0022 0.0020 0.0012 0.0 0 05 0.0 0 02 0.0 0 08
Quartile 3 0.0140 0.0124 0.0140 0.0133 0.0120 0.0104 0.0120 0.0113 0.0058
Maximum 0.1837 0.3822 0.4284 0.4803 0.1817 0.3802 0.4264 0.4783 0.1135
Share > 0 0.6148 0.6078 0.5616 0.5584 0.5574 0.5424 0.5146 0.5070 0.5426
Standard dev. 0.0209 0.0215 0.0262 0.0269 0.0209 0.0215 0.0262 0.0269 0.0117
Skewness −0.1249 2.3052 1.2724 1.8336 −0.1249 2.3052 1.2724 1.8336 −0.1263
Kurtosis 11.6967 40.2716 20.6760 30.2379 11.6967 40.2716 20.6760 30.2379 7.9791
B 1-percent VaR −0.0525 −0.0475 −0.0676 −0.0746 −0.0545 −0.0495 −0.0696 −0.0766 −0.0320
1-percent CVaR −0.0801 −0.0735 −0.0957 −0.0995 −0.0821 −0.0755 −0.0977 −0.1015 −0.0461
5-percent VaR −0.0245 −0.0225 −0.0333 −0.0341 −0.0265 −0.0245 −0.0353 −0.0361 −0.0179
5-percent CVaR −0.0430 −0.0401 −0.0550 −0.0568 −0.0450 −0.0421 −0.0570 −0.0588 −0.0277
Max. drawdown 0.4660 0.3187 0.5594 0.5595 0.5233 0.7334 0.9162 0.9884 0.5467
C Return p.a. 2.0127 1.7749 1.0610 0.7721 0.8229 0.6787 0.2460 0.0711 0.0925
Excess return p.a. 1.9360 1.7042 1.0085 0.7269 0.7764 0.6359 0.2142 0.0437 0.0646
Standard dev. p.a. 0.3323 0.3408 0.4152 0.4266 0.3323 0.3408 0.4152 0.4266 0.1852
Downside dev. p.a. 0.2008 0.1857 0.2524 0.2607 0.2137 0.1988 0.2667 0.2751 0.1307
Sharpe ratio p.a. 5.8261 5.0 0 01 2.4288 1.7038 2.3365 1.8657 0.5159 0.1024 0.3486
Sortino ratio p.a. 10.0224 9.5594 4.2029 2.9614 3.8499 3.4135 0.9225 0.2583 0.7077
the LSTM as single model achieves a slightly stronger performance 4.1.4. A critical review of LSTM profitability over time
than a fully-fledged ensemble. In Fig. 5, we display strategy performance over time, i.e., from
Risk characteristics: In panel B of Table 3, we observe a mixed January 1993 to October 2015. We focus on the most competitive
picture with respect to risk characteristics. In terms of daily value techniques, i.e., the LSTM and the random forest.
at risk (VaR), the LSTM achieves second place after the RAF, with a 1993/01–2000/12: These early times are characterized by strong
1-percent VaR of −5.45 percent compared to −4.95 percent for the performance – with the LSTM being superior to the RAF with re-
RAF. The riskiest strategy stems from the logistic regression model, spect to average returns per day, Sharpe ratio, and accuracy in al-
where a loss of −7.66 percent is exceeded in one percent of all most all years. Cumulative payouts on 1 USD average invest per
cases – more than twice as risky as a buy-and-hold investment in day reach a level of over 11 USD for the LSTM and over 8 USD
the general market. However, the LSTM has the lowest maximum for the RAF until 20 0 0. When considering this outperformance, it
drawdown of 52.33 percent – compared to all other models and is important to note that LSTM networks have been introduced in
the general market. 1997, and can only be feasibly deployed ever since the emergence
Annualized risk-return metrics: In panel C of Table 3, we ana- of GPU computing in the late 20 0 0s. As such, the exceptionally
lyze risk-return metrics on an annualized basis. We see that the high returns in the 90s may well be driven by the fact that LSTMs
LSTM achieves the highest annualized returns of 82.29 percent af- were either unknown to or completely unfeasible for the majority
ter transaction costs, compared to the RAF (67.87 percent), the of market professionals at that time. A similar argument holds true
DNN (24.60 percent), the LOG (7.11 percent) and the general mar- for random forests.
ket (9.25 percent). Annualized standard deviation is at the second 2001/01–2009/12: The second period corresponds to a time of
lowest level of 33.23 percent, compared to all benchmarks. The moderation. The LSTM is still able to produce positive returns after
Sharpe ratio scales excess return by standard deviation, and thus transaction costs in all years, albeit at much lower levels compared
can be interpreted as a signal-to-noise ratio in finance, or the re- to the 90s. When considering cumulative payouts, we see that the
turn per unit of risk. We see that the LSTM achieves the highest outperformance of the LSTM compared to the random forest per-
level of 2.34, with the RAF coming in second with 1.87, while all sists up to the financial crisis. A key advantage of these tree-based
other methods have a Sharpe ratio well below 1.0. methods is their robustness to noise and outliers – which plays out
From a financial perspective, we have two key findings. First, during such volatile times. The RAF achieves exceptionally high re-
the LSTM outperforms the RAF by a clear margin in terms of re- turns and consistent accuracy values at a Sharpe ratio of up to 6.
turn characteristics and risk-return metrics. We are thus able to As such, total payouts on 1 USD investment amount to 4 USD for
show that choosing LSTM networks – which are inherently suitable the LSTM, and to 5.6 USD for the RAF – with the majority of the
for time series prediction tasks – outperform shallow tree-based RAF payouts being achieved during the financial crisis.
models as well as standard deep learning. Second, we demonstrate It seems reasonable to believe that this period of moderation
that a standard logistic regression is not able to capture the same is caused by an increasing diffusion of such strategies among in-
level of information from the feature space – even though we dustry professionals, thus gradually eroding profitability. However,
perform in-sample cross-validation to find optimal regularization for the RAF, the global financial crisis in 20 08/20 09 constitutes an
values. exception – with a strong resurgence in profitability. Following the
662 T. Fischer, C. Krauss / European Journal of Operational Research 270 (2018) 654–669
Fig. 5. Contrast of LSTM and RAF performance from January 1993 to October 2015 for the k = 10 portfolio, i.e., development of cumulative profits on 1 USD average
investment per day, average daily returns by year, annualized Sharpe ratio, and average accuracy per year, after transaction costs.
literature, these profits may be driven by two factors. First, it is severe cases, short-selling may be banned altogether. But also the
reasonable to believe that investors are “losing sight of the trees long side is affected, e.g., when widening spreads and decreasing
for the forest” (Jacobs and Weber, 2015, p. 75) at times of financial liquidity set a cap on returns.
turmoil – thus creating realtive-value arbitrage opportunities, see 2010/01–2015/10: The third period corresponds to a time of de-
Clegg and Krauss (2018). Second, at times of high volatility, limits terioration. The random forest loses its edge, and destroys more
to arbitrage are exceptionally high as well, making it hard to cap- than 1 USD in value, based on an average investment of 1 USD
ture such relative-value arbitrage opportunities. Specifically, short per day. By contrast, the LSTM continues realizing higher accuracy
selling costs may rise for hard to borrow stocks – see Gregoriou scores in almost all years and is able to keep capital approximately
(2012), Engelberg, Reed, and Ringgenberg (2017), or, in even more constant, after transaction costs. When comparing to the literature,
T. Fischer, C. Krauss / European Journal of Operational Research 270 (2018) 654–669 663
Fig. 6. Time-varying share of industries in the k = 10 portfolio minus share of these industries in the S&P 500, calculated over number of stocks. A positive value indicates
that the industry is overweighted and vice versa.
it is remarkable to see that the ensemble presented in Krauss containing the patterns of the top 10 and of the flop 10 stocks.
et al. (2017) also runs into a massive drawdown as of 2010 – Results are depicted in Fig. 7, contrasted to the behavior of the
whereas the single-model LSTM returns fluctuate around zero. Re- cross-section across all stocks (mean). We see that the top and the
gardless, the edge of the presented machine learning methods flop stocks both exhibit below-mean momentum in the sense of
seems to have been arbitraged away. This effect can be observed in Jegadeesh and Titman (1993), i.e., they perform poorly from day
many academic research papers targeting quantitative strategies – t − 240 until day t − 10, compared to the cross-section. From day
see, for example, Bogomolov (2013), Rad, Low, and Faff (2016), t − 9 until t (note: the prediction is made on day t), the top stocks
Green, Hand, and Zhang (2017), Clegg and Krauss (2018), among start crashing at accelerating pace, losing about 50 percent of what
others. they have gained during the previous 230 days on average. By con-
trast, the flop stocks show an inverse pattern during the past 10
4.1.5. Industry breakdown days prior to trading, and exhibit increasingly higher returns. It is
Coming from the LSTM’s ability to identify structure, we next compelling to see that the LSTM extracted such a strong common-
analyze potential preferences for certain industries in the k = ality among the flop stocks and the top stocks.
10 portfolio. Specifically, we consider the difference between the Second, based on this visual insight, we construct further time-
share of an industry in the k = 10 portfolio and the share of that series characteristics for the top and flop stocks. Thereby, we fol-
industry in the S&P 500 at that time. A positive value indicates that low the same methodology as above for generating the descriptive
the industry is overweighted by the LSTM network, and a negative statistics, i.e., we compute the 240-day return sequence for each
value indicates that it is underweighted. stock, calculate the desired statistic, and average over all k stocks,
Fig. 6 depicts our findings for the most interesting industries – with k ∈ {1, 5, 10, 10 0, 20 0}. Specifically, we consider the following
oil and gas, technology, financials, and all others. First, we see that statistics:
there is a significant overweight of technology stocks building up
end of the 90s – corresponding to the growing dot-com bubble and • (Multi-)period returns, as defined in Section 3.2, with m ∈ {1,
its bust. Second, we observe a rise in financial stocks around the 5, 20, 240}, denoted as Return_t_t-m in the graphic, where m
years 20 08/20 09 – corresponding to the global financial crisis. And is counting backwards from day t, the last element of the se-
third, oil and gas stocks gain in weight as of 2014 – falling together quence (i.e., the day on which the prediction for t + 1 is made).
with the recent oil glut and the significant drop in crude oil prices. Moreover, we consider the cumulative return from day t − 20
Note: a more detailed breakdown depicting the relative overweight of the sequence until day t − 240 of the sequence, denoted as
for each of the GICS industries individually for the top k and the Return_t-20_t-240.
flop k portfolio is provided in the Appendix.8 It is interesting to • Sample standard deviations, computed over the same time
see that the overweight in each industry adequately captures major frames as above, and following the same naming conventions.
market developments. We hypothesize that this behavior is driven • Sample skewness and kurtosis over the full 240 days of each
by increasing volatility levels, and further elaborate on that point sequence.
in the following section. • The coefficients of a Carhart regression in the sense of Gatev,
Goetzmann, and Rouwenhorst (2006), Carhart (1997). Thus, we
4.2. Common patterns in the top and flop stocks
extract the alpha of each stock (FF_Alpha – denoting the id-
iosyncratic return of the stock beyond market movements), the
Machine learning approaches – most notably artificial neural
beta (FF_Mkt-RF – denoting how much the stock moves when
networks – are commonly considered as black-box methods. In this
the market moves by 1 percent), the small minus big fac-
section, we aim for shining light into that black-box, thus unveiling
tor (FF_SMB – denoting the loading on small versus large cap
common patterns in the top and flop stocks.
stocks), the high minus low factor (FF_HML – denoting the
First, we conduct a very simple yet effective analysis. For every
loading on value versus growth stocks), the momentum factor
day, we extract all 240-day return sequences for the top and flop
(FF_Mom – denoting the loading on the momentum factor in
k stocks.9 Then, we stack all 5750 × 10 top and the same number
the sense of Jegadeesh and Titman (1993), Carhart (1997), the
of flop sequences on top of each other. For better representation,
short-term reversal factor (FF_ST_Rev – denoting the loading on
we accumulate the 240 returns of each sequence to a return index,
short-term reversal effects) and the R squared (FF_R_squared –
starting at a level of 0 on day t − 240 and then average these re-
denoting the percentage of return variance explained by the
turn index sequences. We hence obtain two generalized sequences,
factor model). Please note that these coefficients refer to the in-
dividual stocks’ 240 days history prior to being selected by the
8
We thank an anonymous referee for this suggestion. LSTM for trading and not to the exposure of the resulting LSTM
9
A return sequence is generated as described in Section 3.2. strategy to these factors (see Section 4.3 for this analysis).
664 T. Fischer, C. Krauss / European Journal of Operational Research 270 (2018) 654–669
Fig. 7. Averaged, normalized return index of top and flop 10 stocks over sequence of the 240 one day returns prior to trading for LSTM strategy.
Fig. 8. Time-series characteristics of top and flop k stocks for LSTM strategy. Statistics are first computed over the 240-day return sequences for each stock in the top or flop
k, as described in the bullet list in Section 4.2 (including naming conventions) and then averaged over all top k or all flop k stocks. The mean is calculated similarly, however
across all stocks. Note that there is a color-coding along each row, separately for top and flop k stocks. Thereby, the darker the green, the higher the value, and the darker
the red, the lower the value. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Results are shown in Fig. 8. The graphical patterns from the Hong and Sraer (2016), and a higher beta is a key characteris-
last paragraph now become apparent in a quantitative manner – tic of the selected stocks. Also, skewness is similar to the gen-
across different values of k. First, the top stocks exhibit highly neg- eral market, and the returns of the top and flop stocks are more
ative returns in the last days prior to the prediction, and the flop leptokurtic than the general market – a potential return predic-
stocks highly positive returns. This behavior corresponds to short- tive signal remotely relating to the works of Kumar (2009), Boyer,
term reversal strategies, as outlined in Jegadeesh (1990), Lehmann Mitton, and Vorkink (2010), Bali, Cakici, and Whitelaw (2011) on
(1990), Lo and MacKinlay (1990) – to name a few. The LSTM net- stocks with “lottery-type” features. Finally, we see that the above
work seems to independently find the stock market anomaly, that mentioned time-series characteristics are also confirmed in the re-
stocks that sharply fall in the last days then tend to rise in the gression coefficients. Top and flop k stocks exhibit higher beta, a
next period and vice versa. The effect is stronger, the smaller k, negative loading on the momentum factor and a positive loading
i.e., the lower the number of stocks considered in the portfo- on the short-term reversal factor – with the respective magnitude
lio. Second, both top and flop stocks exhibit weak momentum in increasing with lower values for k. We observe a slight loading
the sense of Jegadeesh and Titman (1993). For example, the top on the SMB factor, meaning that smaller stocks among the S&P
10 stocks show an average momentum of 9.1 percent from day 500 constituents are selected – which usually have higher vola-
t − 240 until day t − 20 of the sequence, compared to 11.7 percent, tility.
i.e., the mean across all stocks. The flop stocks exhibit a similar Given that the LSTM network independently extracted these
pattern. The smaller k, the stronger the underperformance with patterns from 240-day sequences of standardized returns, it is
respect to the momentum effect. Third, when considering stan- astonishing to see how well some of them relate to commonly
dard deviation, it becomes obvious that volatility plays an impor- known capital market anomalies. This finding is compelling, given
tant role (see also LeBaron, 1992). Clearly, high volatility stocks that none of the identified characteristics is explicitly coded as fea-
are preferred compared to the market, and volatility is increas- ture, but instead derived by the LSTM network all by itself – an-
ing for the more extreme parts of the ranking. Volatility in the other key difference to the memory-free models, such as the ran-
sense of beta can be an important return predictive signal – see dom forest, who are provided with mutli-period returns as fea-
Baker, Bradley, and Wurgler (2011), Frazzini and Pedersen (2014), tures.
T. Fischer, C. Krauss / European Journal of Operational Research 270 (2018) 654–669 665
10 11
We thank French for providing all relevant data for these models on his website. We thank an anonymous referee for suggesting this robustness check.
666 T. Fischer, C. Krauss / European Journal of Operational Research 270 (2018) 654–669
Table 5
Panel A names the LSTM model variants, with different trading frequencies and types of execution (M1 through M5 ). Panels B, C, and D illustrate performance characteristics
of the k = 10 portfolio, before and after transaction costs for all five model variants, compared to the general market (MKT) from January 1998 to October 2015. MKT
represents the general market as in Kenneth R. French’s data library, see here. Panel B depicts daily return characteristics. Panel C depicts daily risk characteristics. Panel D
depicts annualized risk-return metrics. Newey–West standard errors are used.
A Model M1 M2 M3 M4 M5 M1 M2 M3 M4 M5 MKT
Frequency Daily Daily Weekly Weekly Weekly Daily Daily Weekly Weekly Weekly –
Execution Closet VWAPt Closet VWAPt Closet+1 Closet VWAPt Closet VWAPt Closet+1 –
B Mean return (long) 0.0025 0.0021 0.0014 0.0012 0.0013 0.0015 0.0011 0.0012 0.0010 0.0011 0.0 0 03
Mean return (short) 0.0014 0.0015 0.0 0 01 0.0 0 01 0.0 0 0 0 0.0 0 04 0.0 0 05 −0.0 0 01 −0.0 0 01 −0.0 0 02 –
Mean return 0.0040 0.0036 0.0015 0.0013 0.0013 0.0020 0.0016 0.0011 0.0 0 09 0.0 0 09 0.0 0 03
Standard error 0.0 0 03 0.0 0 03 0.0 0 03 0.0 0 03 0.0 0 03 0.0 0 03 0.0 0 03 0.0 0 03 0.0 0 03 0.0 0 03 0.0 0 02
t-statistic 11.9814 11.5662 5.7083 5.1998 4.9941 5.9313 5.1980 4.1749 3.6172 3.4622 1.8651
Minimum −0.2176 −0.1504 −0.1261 −0.1156 −0.1246 −0.2193 −0.1522 −0.1265 −0.1160 −0.1250 −0.0895
Quartile 1 −0.0064 −0.0064 −0.0068 −0.0064 −0.0068 −0.0084 −0.0084 −0.0072 −0.0068 −0.0072 −0.0054
Median 0.0030 0.0027 0.0 0 09 0.0 0 07 0.0010 0.0010 0.0 0 07 0.0 0 05 0.0 0 03 0.0 0 06 0.0 0 08
Quartile 3 0.0136 0.0124 0.0092 0.0082 0.0089 0.0116 0.0104 0.0088 0.0078 0.0085 0.0064
Maximum 0.1837 0.1700 0.2050 0.1814 0.1997 0.1815 0.1678 0.2045 0.1810 0.1993 0.1135
Share > 0 0.5847 0.5807 0.5337 0.5299 0.5379 0.5276 0.5212 0.5216 0.5120 0.5234 0.5384
Standard dev. 0.0224 0.0208 0.0185 0.0177 0.0181 0.0224 0.0208 0.0185 0.0177 0.0181 0.0127
Skewness −0.0682 0.4293 0.9199 0.6518 0.8163 −0.0682 0.4293 0.9195 0.6515 0.8160 −0.0737
Kurtosis 10.9634 6.1028 12.3432 9.5637 12.7913 10.9634 6.1028 12.3409 9.5620 12.7897 6.7662
C 1-percent VaR −0.0571 −0.0524 −0.0516 −0.0470 −0.0493 −0.0591 −0.0544 −0.0520 −0.0474 −0.0496 −0.0341
1-percent CVaR −0.0861 −0.0717 −0.0662 −0.0659 −0.0673 −0.0880 −0.0736 −0.0666 −0.0663 −0.0676 −0.0491
5-percent VaR −0.0271 −0.0268 −0.0251 −0.0239 −0.0252 −0.0291 −0.0287 −0.0255 −0.0243 −0.0256 −0.0195
5-percent CVaR −0.0468 −0.0432 −0.0412 −0.0395 −0.0403 −0.0487 −0.0452 −0.0416 −0.0399 −0.0407 −0.0297
Max. drawdown 0.4660 0.3717 0.3634 0.4695 0.3622 0.5230 0.6822 0.4041 0.5808 0.3794 0.5467
D Return p.a. 1.5439 1.3639 0.3939 0.3384 0.3328 0.5376 0.4287 0.2604 0.2103 0.2052 0.0668
Excess return p.a. 1.4923 1.3159 0.3655 0.3112 0.3056 0.5063 0.3996 0.2348 0.1856 0.1806 0.0450
Standard dev. p.a. 0.3561 0.3308 0.2940 0.2818 0.2879 0.3557 0.3305 0.2939 0.2817 0.2879 0.2016
Downside dev. p.a. 0.2192 0.1962 0.1863 0.1802 0.1851 0.2323 0.2103 0.1891 0.1830 0.1879 0.1424
Sharpe ratio p.a. 4.1908 3.9777 1.2433 1.1043 1.0615 1.4233 1.2091 0.7988 0.6589 0.6274 0.2234
Sortino ratio p.a. 7.0444 6.9507 2.1145 1.8784 1.7982 2.3139 2.0380 1.3770 1.1488 1.0916 0.4689
term reversal strategy or Huck (2009, 2010) in his machine learn- ing the opening) in 23 bins of 17 minutes duration, so we obtain
ing strategies, we introduce a weekly variant of the LSTM strategy 23 VWAPs. We use the 22nd VWAP as anchor point for creating
– effectively bridging the gap between lower turnover (weekly vs. features, and the 23rd VWAP for executing our trades, and create
daily), and retaining a sufficient number of training examples for features in full analogy to Section 3.2.
the LSTM to successfully extract structure from the data. The lat- Third, a one-day-waiting rule as in Gatev et al. (2006) is a
ter excludes a monthly variant, without significantly changing the proper remedy against bid-ask bounce. Specifically, we delay the
study design. Specifically, we design the LSTM in full analogy to execution by one entire trading day after signal generation. This
the cornerstones outlined in Section 3, but on a 5-day instead of rule only makes sense for the weekly strategy. In case of the daily
a 1-day horizon.12 Moreover, we implement five overlapping port- strategy, the delay covers the entire forecast horizon, rendering the
folios of LSTM strategies with a 5-day forecast horizon to avoid a predictions null and void.
bias introduced by the starting date (Jegadeesh & Titman, 1993). Fourth, for the sake of completeness, we use transaction costs
Thereby, each portfolio is offset to the other by one trading day, of 5 bps per half-turn throughout this paper – a fairly robust value
and returns are averaged across all portfolios. Note that we log the for U.S. large caps. For example, Jha (2016) assumes merely 2 bps
invested capital market-to-market on a daily basis for all five port- for the largest 500 stocks of the U.S. stock universe over a similar
folios, so we are able to exhibit daily returns for the sake of com- time horizon.
parability. In a nutshell, we run the following robustness checks on the
Second, execution on volume-weighted-average-prices (VWAPs) LSTM strategy. Model M1 is the baseline, showing the results of
is more realistic and much less susceptible to bid-ask bounce (see the standard LSTM strategy as discussed in the previous sections,
for example, Lee, Chan, Faff, & Kalev, 2003). Hence, we use minute- i.e., with daily turnover and execution on the close, but constrained
binned transaction data from January 1998 to October 2015 for to the time frame of the QuantQuote data set from 1998 to 2015.
all S&P 500 stocks from QuantQuote to create VWAPs, which we Model M2 shows the effects when executing this strategy on the
use for feature engineering, for target creation, and for backtest- VWAP instead of the closing price. Models M3 –M5 are weekly vari-
ing – see Section 3.13 Specifically, we divide the 391 minute-bins ants of the baseline strategy. Thereby, M3 is executed on the clos-
of each trading day (starting at 09:30 and ending at 16:00, includ- ing price, M4 on the VWAP, and M5 with a one-day-waiting rule
on the closing price. The results of these variants are depicted in
Table 5 – before and after transaction costs of 5 bps per half-turn.
12
As such, the time step, the forecast horizon, and the raw returns comprise a du- After transaction costs, the baseline LSTM strategy M1 results
ration of 5 days. In analogy, the sequence length is 52 to reflect one year, consisting in average returns of 0.20 percent per day on the shorter time
of 52 weeks. frame from 1998 to 2015. Executing on VWAPs instead of at the
13
To be able to evaluate the performance of the LSTM over the whole period from closing price in case of model M2 leads to a deterioration to 0.16
January 1998 until October 2015, we use the Thomson Reuters dataset to train the
percent per day – still statistically significant with a t-statistic
model on the first 750 days prior to January 1998. Out-of-sample prediction (trad-
ing) is then performed on the VWAPs calculated from the QuantQuote data. In all above 5 and economically meaningful with returns of 43 per-
following study periods, we train and predict on the QuantQuote data. cent p.a. The weekly strategies generate similar returns per day,
T. Fischer, C. Krauss / European Journal of Operational Research 270 (2018) 654–669 667
Fig. 9. Contrast LSTM model variants’ performance from January 1998 to October 2015 for the k = 10 portfolio, i.e., development of cumulative profits on 1 USD average
investment per day after transaction costs.
i.e., 0.11 percent in case of M3 (execution on close), 0.09 percent tion. As of 2010, the markets exhibit an increase in efficiency with
in case of M4 (execution on VWAP), and 0.09 percent in case of respect to the machine learning methods we deploy, with LSTM
M5 (execution on close, one-day-waiting) – all of them statistically profitability fluctuating around zero and RAF profitability dipping
and economically significant with t-statistics above 3 and annual- strictly into the negative domain.
ized returns above 20 percent. The rest of the table reads in full Also, the conceptual and empirical aspects on LSTM networks
analogy to Table 3, and shows a relative outperformance of the outlined in this paper go beyond a pure financial market applica-
LSTM strategy compared to the general market, after robustness tion, but are intended as guideline for other researchers, wishing
checks. to deploy this effective methodology to other time series predic-
In recent years, however, as depicted in Fig. 9, we see that re- tion tasks with large amounts of training data.
turns flatten out, so the LSTM edge seems to have been arbitraged Second, we disentangle the black-box “LSTM”, and unveil com-
away. Despite of this fact, model performance is robust in light of mon patterns of stocks that are selected for profitable trading. We
market frictions over an extended period of time, i.e., from 1998 find that the LSTM portfolio consist of stocks with below-mean
up until 2009, with particular upward spikes in returns around the momentum, strong short-term reversal characteristics, high volatil-
financial crisis. ity and beta. All these findings relate to some extent to exist-
ing capital market anomalies. It is impressive to see that some
5. Conclusion of them are independently extracted by the LSTM from a 240-day
sequence of standardized raw returns. It is subject to future re-
In this paper, we apply long-short term memory networks to a search to identify further, more subtle patterns LSTM neural net-
large-scale financial market prediction task on the S&P 500, from works learn from financial market data, and to validate the profit
December 1992 until October 2015. With our work, we make three potential of these patterns in more refined, rules-based trading
key contributions to the literature: the first contribution focuses on strategies.
the large-scale empirical application of LSTM networks to financial Third, based on the common patterns of the LSTM portfolio,
time series prediction tasks. We provide an in-depth guide, closely we devise a simplified rules-based trading strategy. Specifically, we
following the entire data science value chain. Specifically, we frame short short-term winners and buy short-term losers, and hold the
a proper prediction task, derive sensible features in the form of position for one day – just like in the LSTM application. With this
240-day return sequences, standardize the features during prepro- transparent, simplified strategy, we achieve returns of 0.23 percent
cessing to facilitate model training, discuss a suitable LSTM archi- per day prior to transaction costs – about 50 percent of the LSTM
tecture and training algorithm, and derive a trading strategy based returns. Further regression analysis on common sources of system-
on the predictions, in line with the existing literature. We compare atic risk unveil a remaining alpha of 0.42 percent of the LSTM prior
the results of the LSTM against a random forest, a standard deep to transaction costs and generally a lower risk exposure compared
net, as well as a simple logistic regression. We find the LSTM, a to the other models we deploy.
methodology inherently suitable for this domain, to beat the stan- Overall, we have successfully demonstrated that an LSTM net-
dard deep net and the logistic regression by a very clear margin. work is able to effectively extract meaningful information from
Most of the times – with the exception of the global financial crisis noisy financial time series data. Compared to random forests, stan-
– the random forest is also outperformed. Our findings of statisti- dard deep nets, and logistic regression, it is the method of choice
cally and economically significant returns of 0.46 percent per day with respect to predictional accuracy and with respect to daily re-
prior to transaction costs posit a clear challenge to the semi-strong turns after transaction costs. As it turns out, deep learning – in the
form of market efficiency, and show that deep learning could have form of LSTM networks – hence seems to constitute an advance-
been an effective predictive modeling technique in this domain up ment in this domain as well.
until 2010. These findings are largely robust in light of market fric-
tions, given that profitability remains statistically and economically Appendix. Detailed industry breakdown
significant when executing the daily LSTM strategy on VWAPs in-
stead of closing prices, and when running a weekly variant of the Fig. A1.
LSTM strategy with a one-day waiting period after signal genera-
668 T. Fischer, C. Krauss / European Journal of Operational Research 270 (2018) 654–669
Fig. A1. Time-varying share of industries in the “top-10” and “flop-10” portfolio minus share of these industries in the S&P 500, calculated over number of stocks. A positive
value indicates that the industry is overweighted and vice versa.
T. Fischer, C. Krauss / European Journal of Operational Research 270 (2018) 654–669 669
References Hong, H., & Sraer, D. A. (2016). Speculative betas. The Journal of Finance, 71(5),
2095–2144.
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., et al., (2015). Tensor- Huck, N. (2009). Pairs selection and outranking: An application to the S&P 100 in-
Flow: Large-scale machine learning on heterogeneous systems. Software avail- dex. European Journal of Operational Research, 196(2), 819–825.
able from tensorflow.org, http://tensorflow.org/. Huck, N. (2010). Pairs trading and outranking: The multi-step-ahead forecasting
Atsalakis, G. S., & Valavanis, K. P. (2009). Surveying stock market forecasting tech- case. European Journal of Operational Research, 207(3), 1702–1716.
niques – Part II: Soft computing methods. Expert Systems with Applications, Jacobs, H. (2015). What explains the dynamics of 100 anomalies? Journal of Banking
36(3), 5932–5941. & Finance, 57, 65–85. doi:10.1016/j.jbankfin.2015.03.006.
Avellaneda, M., & Lee, J.-H. (2010). Statistical arbitrage in the US equities market. Jacobs, H., & Weber, M. (2015). On the determinants of pairs trading profitability.
Quantitative Finance, 10(7), 761–782. Journal of Financial Markets, 23, 75–97. doi:10.1016/j.finmar.2014.12.001.
Baker, M., Bradley, B., & Wurgler, J. (2011). Benchmarks as limits to arbitrage: Un- Jegadeesh, N. (1990). Evidence of predictable behavior of security returns. The Jour-
derstanding the low-volatility anomaly. Financial Analysts Journal, 67(1), 40–54. nal of Finance, 45(3), 881–898. doi:10.2307/2328797.
Bali, T. G., Cakici, N., & Whitelaw, R. F. (2011). Maxing out: Stocks as lotteries and the Jegadeesh, N., & Titman, S. (1993). Returns to buying winners and selling losers:
cross-section of expected returns. Journal of Financial Economics, 99(2), 427–446. Implications for stock market efficiency. The Journal of Finance, 48(1), 65–91.
Bogomolov, T. (2013). Pairs trading based on statistical variability of the spread pro- Jegadeesh, N., & Titman, S. (1995). Overreaction, delayed reaction, and contrarian
cess. Quantitative Finance, 13(9), 1411–1430. doi:10.1080/14697688.2012.748934. profits. Review of Financial Studies, 8(4), 973–993.
Boyer, B., Mitton, T., & Vorkink, K. (2010). Expected idiosyncratic skewness. Review Jha, V. (2016). Timing equity quant positions with short-horizon alphas. The Journal
of Financial Studies, 23(1), 169–202. of Trading, 11(3), 53–59.
Breiman, L. (2001). Random forests. Machine learning, 45(1), 5–32. Karpathy, A. (2015). The unreasonable effectiveness of recurrent neural networks.
Britz, D. (2015). Recurrent neural network tutorial, part 4 – Implementing a http://karpathy.github.io/2015/05/21/rnn-effectiveness/.
GRU/LSTM RNN with Python and Theano. http://www.wildml.com/2015/10/ Krauss, C., Do, X. A., & Huck, N. (2017). Deep neural networks, gradient-boosted
recurrent- neural- network- tutorial- part- 4- implementing- a- grulstm- rnn- with- trees, random forests: Statistical arbitrage on the S&P 500. European Journal of
python- and- theano/. Operational Research, 259(2), 689–702.
Carhart, M. M. (1997). On persistence in mutual fund performance. The Journal of Kumar, A. (2009). Who gambles in the stock market? The Journal of Finance, 64(4),
Finance, 52(1), 57. doi:10.2307/2329556. 1889–1933. doi:10.1111/j.1540-6261.2009.01483.x.
Chollet, F. (2016). Keras. https://github.com/fchollet/keras LeBaron, B. (1992). Some relations between volatility and serial correlations in stock
Clegg, M., & Krauss, C. (2018). Pairs trading with partial cointegration. Quantitative market returns. The Journal of Business, 65(2), 199–219.
Finance, 18(1), 121–138. doi:10.1080/14697688.2017.1370122. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–
Conrad, J., & Kaul, G. (1989). Mean reversion in short-horizon expected returns. Re- 444. doi:10.1038/nature14539.
view of Financial Studies, 2(2), 225–240. Lee, D. D., Chan, H., Faff, R. W., & Kalev, P. S. (2003). Short-term contrarian invest-
Diebold, F. X., & Mariano, R. S. (1995). Comparing predictive accuracy. Journal of ing – Is it profitable?... Yes and No. Journal of Multinational Financial Manage-
Business & Economic Statistics, 13(3), 253–263. ment, 13(4), 385–404.
Dixon, M., Klabjan, D., & Bang, J. H. (2015). Implementing deep neural networks for Lehmann, B. N. (1990). Fads, martingales, and market efficiency. The Quarterly Jour-
financial market prediction on the Intel Xeon Phi. In Proceedings of the eighth nal of Economics, 105(1), 1. doi:10.2307/2937816.
workshop on high performance computational finance (pp. 1–6). Lo, A. W., & MacKinlay, A. C. (1990). When are contrarian profits due to stock mar-
Engelberg, J., Reed, A. V., & Ringgenberg, M. (2017). Short selling risk. Journal of ket overreaction? Review of Financial Studies, 3(2), 175–205.
Finance. doi:10.2139/ssrn.2312625. (forthcoming) Maechler, M. (2016). Rmpfr: R MPFR – multiple precision floating-point reliable. R
Fama, E. F. (1970). Efficient capital markets: A review of theory and empirical work. package. https://cran.r-project.org/package=Rmpfr.
The Journal of Finance, 25(2), 383–417. doi:10.2307/2325486. Malkiel, B. G. (2007). A random walk down Wall Street: The time-tested strategy for
Fama, E. F., & French, K. R. (1996). Multifactor explanations of asset pricing anoma- successful investing. WW Norton & Company.
lies. The Journal of Finance, 51(1), 55–84. doi:10.2307/2329302. McKinney, W. (2010). Data structures for statistical computing in Python. In Pro-
Frazzini, A., & Pedersen, L. H. (2014). Betting against beta. Journal of Financial Eco- ceedings of the ninth Python in science conference: 445 (pp. 51–56).
nomics, 111(1), 1–25. doi:10.1016/j.jfineco.2013.10.005. Medsker, L. (20 0 0). Recurrent neural networks: Design and applications. Interna-
Gal, Y., & Ghahramani, Z. (2016). A theoretically grounded application of dropout in tional series on computational intelligence. CRC-Press.
recurrent neural networks. In Proceedings of the 2016 advances in neural infor- Moritz, B., & Zimmermann, T. (2014). Deep conditional portfolio sorts: The relation
mation processing systems (pp. 1019–1027). between past and future stock returns. Working paper. LMU Munich and Harvard
Gatev, E., Goetzmann, W. N., & Rouwenhorst, K. G. (2006). Pairs trading: Per- University.
formance of a relative-value arbitrage rule. Review of Financial Studies, 19(3), Olah, C. (2015). Understanding LSTM Networks. http://colah.github.io/posts/
797–827. 2015- 08- Understanding- LSTMs/.
Gers, F. A., Schmidhuber, J., & Cummins, F. (20 0 0). Learning to forget: Continual Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
prediction with LSTM. Neural Computation, 12(10), 2451–2471. et al. (2011). Scikit-learn: Machine learning in python. Journal of Machine Learn-
Giles, C. L., Lawrence, S., & Tsoi, A. C. (2001). Noisy time series prediction using ing Research, 12, 2825–2830.
recurrent neural networks and grammatical inference. Machine Learning, 44(1), Peterson, B. G., & Carl, P. (2014). PerformanceAnalytics: Econometric tools for
161–183. performance and risk analysis. R package, http://CRAN.R-project.org/package=
Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., & Bengio, Y. (2013). Maxout PerformanceAnalytics.
networks. Proceedings of the 30th International Conference on Machine Learning Python Software Foundation (2016). Python 3.5.2 documentation. Available at https:
(pp. 1319–1327). //docs.python.org/3.5/.
Granger, C. W. (1993). Strategies for modelling nonlinear time-series relationships. R Core Team (2016). R: A language and environment for statistical computing. http:
Economic Record, 69(3), 233–238. //www.R-project.org/.
Graves, A. (2013). Generating sequences with recurrent neural networks. CoRR, arXiv Rad, H., Low, R. K. Y., & Faff, R. (2016). The profitability of pairs trading strate-
preprint arXiv:1308.0850. gies: Distance, cointegration and copula methods. Quantitative Finance, 16(10),
Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H., & Schmidhu- 1541–1558.
ber, J. (2009). A novel connectionist system for unconstrained handwriting Sak, H., Senior, A.W., & Beaufays, F. (2014). Long short-term memory based recur-
recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(5), rent neural network architectures for large vocabulary speech recognition. CoRR,
855–868. arXiv preprint arXiv:1402.1128.
Graves, A., Mohamed, A.-r., & Hinton, G. (2013). Speech recognition with deep re- Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Net-
current neural networks. In Proceedings of the 2013 IEEE international conference works, 61, 85–117.
on acoustics, speech and signal processing (pp. 6645–6649). IEEE. Sermpinis, G., Theofilatos, K., Karathanasopoulos, A., Georgopoulos, E. F., & Du-
Graves, A., & Schmidhuber, J. (2005). Framewise phoneme classification with bidi- nis, C. (2013). Forecasting foreign exchange rates with adaptive neural networks
rectional LSTM and other neural network architectures. Neural Networks, 18(5), using radial-basis functions and particle swarm optimization. European Journal
602–610. of Operational Research, 225(3), 528–540. doi:10.1016/j.ejor.2012.10.020.
Green, J., Hand, J. R. M., & Zhang, X. F. (2017). The characteristics that provide in- Siah, K. W., & Myers, P. (2016). Stock market prediction through technical
dependent information about average U.S. monthly stock returns. The Review of and public sentiment analysis. http://kienwei.mit.edu/sites/default/files/images/
Financial Studies, 30(12), 4389–4436. doi:10.1093/rfs/hhx019. stock- market- prediction.pdf.
Green, J., Hand, J. R. M., & Zhang, X. F. (2013). The supraview of return pre- Takeuchi, L., & Lee, Y.-Y. (2013). Applying deep learning to enhance momentum trad-
dictive signals. Review of Accounting Studies, 18(3), 692–730. doi:10.1007/ ing strategies in stocks. Working paper, Stanford University.
s11142- 013- 9231- 1. Tieleman, T., & Hinton, G. (2012). Lecture 6.5-rmsprop: Divide the gradient by a run-
Gregoriou, G. N. (2012). Handbook of short selling. Amsterdam and Boston, MA: Aca- ning average of its recent magnitude. COURSERA: Neural Networks for Machine
demic Press. Learning, 4(2), 26–30.
H2O (2016). H2O documentation. http://h2o.ai/docs, http://h2o-release.s3. Van Der Walt, S., Colbert, S. C., & Varoquaux, G. (2011). The numpy array: a structure
amazonaws.com/h2o/rel- tukey/4/docs- website/h2o- docs/index.html. for efficient numerical computation. Computing in Science & Engineering, 13(2),
Ho, T. K. (1995). Random decision forests. In Proceedings of the third international 22–30.
conference on document analysis and recognition: 1 (pp. 278–282). IEEE. Xiong, R., Nichols, E. P., & Shen, Y. (2015). Deep learning stock volatility with Google
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computa- domestic trends. arXiv e-prints arXiv:1512.04916.
tion, 9(8), 1735–1780.