Enhancing Time Series Momentum Strategies Using Deep Neural Networks
Enhancing Time Series Momentum Strategies Using Deep Neural Networks
Abstract—While time series momentum [1] is a well- such as demand forecasting [10], medicine [11]
studied phenomenon in finance, common strategies require and finance [12]. With the development of modern
the explicit definition of both a trend estimator and a architectures such as convolutional neural networks
position sizing rule. In this paper, we introduce Deep
(CNNs) and recurrent neural networks (RNNs) [13],
Momentum Networks – a hybrid approach which injects
deep learning based trading rules into the volatility scaling deep learning models have been favoured for their
framework of time series momentum. The model also ability to build representations of a given dataset [14]
simultaneously learns both trend estimation and position – capturing temporal dynamics and cross-sectional
arXiv:1904.04912v3 [stat.ML] 27 Sep 2020
sizing in a data-driven manner, with networks directly relationships in a purely data-driven manner. The
trained by optimising the Sharpe ratio of the signal. Back- adoption of deep neural networks has also been
testing on a portfolio of 88 continuous futures contracts, we
facilitated by powerful open-source frameworks such
demonstrate that the Sharpe-optimised LSTM improved
traditional methods by more than two times in the absence as TensorFlow [15] and PyTorch [16] – which
of transactions costs, and continue outperforming when use automatic differentiation to compute gradients
considering transaction costs up to 2-3 basis points. To for backpropagation without having to explicitly
account for more illiquid assets, we also propose a turnover derive them in advance. In turn, this flexibility
regularisation term which trains the network to factor in has allowed deep neural networks to go beyond
costs at run-time. standard classification and regression models. For
I. I NTRODUCTION instance, the creation of hybrid methods that combine
traditional time-series models with neural network
Momentum as a risk premium in finance has been
components have been observed to outperform pure
extensively documented in the academic literature,
methods in either category [17] e.g. the exponential
with evidence of persistent abnormal returns demon-
smoothing RNN [18], autoregressive CNNs [19]
strated across a range of asset classes, prediction
and Kalman filter variants [20, 21] – while also
horizons and time periods [2, 3, 4]. Based on the
making outputs easier to interpret by practitioners.
philosophy that strong price trends have a tendency
Furthermore, these frameworks have also enabled
to persist, time series momentum strategies are
the development of new loss functions for training
typically designed to increase position sizes with
neural networks, such as adversarial loss functions
large directional moves and reduce positions at
in generative adversarial networks (GANs) [22].
other times. Although the intuition underpinning
While numerous papers have investigated the
the strategy is clear, specific implementation details
use of machine learning for financial time series
can vary widely between signals with a plethora
prediction, they typically focus on casting the un-
of methods available to estimate the magnitude of
derlying prediction problem as a standard regression
price trends [5, 6, 4] and map them to actual traded
or classification task [23, 24, 25, 12, 26, 19, 27] –
positions [7, 8, 9].
In recent times, deep neural networks have been with regression models forecasting expected returns,
increasingly used for time series prediction, out- and classification models predicting the direction
performing traditional benchmarks in applications of future price movements. This approach, how-
ever, could lead to suboptimal performance in the
B. Lim, S. Zohren and S. Roberts are with the Depart- context time-series momentum for several reasons.
ment of Engineering Science and the Oxford-Man Institute Firstly, sizing positions based on expected returns
of Quantitative Finance, University of Oxford, Oxford, United
Kingdom (email: [email protected], [email protected], alone does not take risk characteristics into account
[email protected]). such as the volatility or skew of the predictive
1
returns distribution - which could inadvertently momentum strategies [9]. This consistency also helps
expose signals to large downside moves. This is when making comparisons to existing methods, and
particularly relevant as raw momentum strategies facilitates the interpretation of different components
without adequate risk adjustments, such as volatility of the overall signal by practitioners.
scaling [7], are susceptible to large crashes during
II. R ELATED W ORKS
periods of market panic [28, 29]. Furthermore, even
with volatility scaling which leads to positively A. Classical Momentum Strategies
skewed returns distributions and long-option-like Momentum strategies are traditionally divided
behaviour [30, 31] – trend following strategies can into two categories – namely (multivariate) cross
place more losing trades than winning ones and still sectional momentum [35, 24] and (univariate) time
be profitable on the whole – as they size up only series momentum [1, 8]. Cross sectional momentum
into large but infrequent directional moves. As such, strategies focus on the relative performance of
[32] argue that the fraction of winning trades is a securities against each other, buying relative winners
meaningless metric of performance, given that it and selling relative losers. By ranking a universe
cannot be evaluated independently from the trading of stocks based on their past return and trading the
style of the strategy. Similarly, high classification top decile against the bottom decile, [35] find that
accuracies may not necessarily translate into positive securities that recently outperformed their peers over
strategy performance, as profitability also depends the past 3 to 12 months continue to outperform on
on the magnitude of returns in each class. This average over the next month. The performance of
is also echoed in betting strategies such as the cross sectional momentum has also been shown to
Kelly criterion [33], which requires both win/loss be stable across time [36], and across a variety of
probabilities and betting odds for optimal sizing markets and asset classes [4].
in binomial games. In light of the deficiencies of Time series momentum extends the idea to focus
standard supervised learning techniques, new loss on an asset’s own past returns, building portfolios
functions and training methods would need to be comprising all securities under consideration. This
explored for position sizing – accounting for trade- was initially proposed by [1], who describe a
offs between risk and reward. concrete strategy which uses volatility scaling and
In this paper, we introduce a novel class of trades positions based on the sign of returns over
hybrid models that combines deep learning-based the past year – demonstrating profitability across
trading signals with the volatility scaling framework 58 different liquid instruments individually over
used in time series momentum strategies [8, 1] – 25 years of data. Since then, numerous trading
which we refer to as the Deep Momentum Net- rules have been proposed – with various trend
works (DMNs). This improves existing methods estimation techniques and methods map them to
from several angles. Firstly, by using deep neural traded positions. For instance, [6] documents a wide
networks to directly generate trading signals, we range of linear and non-linear filters to measure
remove the need to manually specify both the trends and a statistic to test for its significance
trend estimator and position sizing methodology – – although methods to size positions with these
allowing them to be learnt directly using modern time estimates are not directly discussed. [8] adopt a
series prediction architectures. Secondly, by utilising similar approach to [1], regressing the log price
automatic differentiation in existing backpropagation over the past 12 months against time and using
frameworks, we explicitly optimise networks for the regression coefficient t-statistics to determine
risk-adjusted performance metrics, i.e. the Sharpe the direction of the traded position. While Sharpe
ratio [34], improving the risk profile of the signal on ratios were comparable between the two, t-statistic
the whole. Lastly, retaining a consistent framework based trend estimation led to a 66% reduction in
with other momentum strategies also allows us to portfolio turnover and consequently trading costs.
retain desirable attributes from previous works – More sophisticated trading rules are proposed in
specifically volatility scaling, which plays a critical [4] and [37], taking volatility-normalised moving
role in the positive performance of time series average convergence divergence (MACD) indicators
2
as inputs. Despite the diversity of options, few autoencoder and denoising autoencoder architectures,
comparisons have been made between the trading incorporating volatility scaling into their model
rules themselves, offering little clear evidence or as well. While the results with basic deep neural
intuitive reasoning to favour one rule over the next. networks are promising, they do not consider more
We hence propose the use of deep neural networks modern architectures for time series prediction, such
to generate these rules directly, avoiding the need for as the LSTM [39] and WaveNet [40] architectures
explicit specification. Training them based on risk- which we evaluate for the DMN. Moreover, to the
adjusted performance metrics, the networks hence best of our knowledge, our paper is the first to
learn optimal training rules directly from the data consider the use of deep learning within the context
itself. of time series momentum strategies – opening up
possibilities in an alternate class of signals.
B. Deep Learning in Finance Popularised by success of DeepMind’s AlphaGo
Machine learning has long been used for financial Zero [41], deep reinforcement learning (RL) has
time series prediction, with recent deep learning also gained much attention in recent times – prized
applications studying mid-price prediction using for its ability to recommend path-dependent actions
daily data [26], or using limit order book data in dynamic environments. RL is particularly of
in a high frequency trading setting [25, 12, 38]. interest within the context of optimal execution
While a variety of CNN and RNN models have and automated hedging [42, 43] for example, where
been proposed, they typically frame the forecasting actions taken can have an impact on future states
task as a classification problem, demonstrating the of the world (e.g. market impact). However, deep
improved accuracy of their method in predicting RL methods generally require a realistic simulation
the direction of the next price movement. Trading environment (for Q-learning or policy gradient meth-
rules are then manually defined in relation to class ods), or model of the world (for model-based RL)
probabilities – either by using thresholds on classi- to provide feedback to agents during training – both
fication probabilities to determine when to initiate of which are difficult to obtain in practice.
positions [26], or incorporating these thresholds into
the classification problem itself by dividing price III. S TRATEGY D EFINITION
movements into buy, hold and sell classes depending Adopting the terminology of [8], the combined
on magnitude [12, 38]. In addition to restricting the returns of a time series momentum (TSMOM)
universe of strategies to those which rely on high strategy can be expressed as below – characterised
accuracy, further gains might be made by learning by a trading rule or signal Xt ∈ [−1, 1]:
trading rules directly from the data and removing
Nt
the need for manual specification – both of which T SM OM 1 X (i) σtgt (i)
rt,t+1 = Xt r
(i) t,t+1
. (1)
are addressed in our proposed method. Nt i=1 σt
Deep learning regression methods have also been
T SM OM
considered in cross-sectional strategies [23, 24], Here rt,t+1 is the realised return of the strategy
ranking assets on the basis of expected returns over from day t to t + 1, Nt is the number of included
(i)
the next time period. Using a variety of linear, tree- assets at t, and rt,t+1 is the one-day return of asset i.
based and neural network models [23] demonstrate We set the annualised volatility target σtgt to be 15%
the outperformance of non-linear methods, with deep and scale asset returns with an ex-ante volatility
(i)
neural networks – specifically 3-layer multilayer estimate σt – computed using an exponentially
perceptrons (MLPs) – having the best out-of-sample weighted moving standard deviation with a 60-day
(i)
predictive R2 . Machine learning portfolios were span on rt,t+1 .
then built by ranking stocks on a monthly basis
using model predictions, with the best strategy A. Standard Trading Rules
coming from a 4-layer MLP that trades the top In traditional financial time series momentum
decile against the bottom decile of predictions. In strategies, the construction of a trading signal Xt
other works, [24] adopt a similar approach using is typically divided into two steps: 1) estimating
3
future trends based on past information, and 2) moves. This allows the signal to reduces positions
computing the actual positions to hold. We illustrate in instances where assets are overbought or oversold
(i)
this in this section using two examples from the – defined to be when |qt | is observed to be larger
academic literature [1, 4], which we also include as than 1.41 times its past year’s standard deviation.
benchmarks into our tests.
Exhibit 1: Position Sizing Function φ(y)
Moskowitz et al. 2012 [1]: In their original paper
on time series momentum, a simple trading rule is
adopted as below:
(i) (i)
Trend Estimation: Yt = rt−252,t (2)
(i) (i)
Position Sizing: Xt = sgn(Yt ) (3)
This broadly uses the past year’s returns as a
trend estimate for the next time step - taking a
maximum long position when the expected trend
(i)
is positive (i.e. sgn(rt−252,t )) and a maximum short
position when negative.
(i)
√ can see that positions are increased until |Yt | =
we
(i) (i)
Trend Estimation: Yt = f ut ; θ , (9)
2 ≈ 1.41, before decreasing back to zero for larger
4
where f (·) is the output of the machine learning average return and the annualised Sharpe ratio via
model, which takes in a vector of input features the loss functions below:
(i)
ut and model parameters θ to generate predictions.
Taking volatility-normalised returns as targets, the Lreturns (θθ ) = −µR
following mean-squared error and binary cross- 1 X
= − R(i, t)
entropy losses can be used for training: M Ω
(i)
!2 1 X (i) σtgt (i)
1 X rt,t+1 = − X r (15)
Lreg (θθ ) =
(i)
Yt − (i) (10) M Ω t σt(i) t,t+1
M Ω σt √
µR × 252
1 X
(i)
Lsharpe (θθ ) = − p P (16)
Lbinary (θθ ) = − I log Yt ( Ω R(i, t)2 ) /M − µ2R
M Ω
where µR is the average return over Ω, and R(i, t)
(i)
+ (1 − I) log 1 − Yt ,(11) is the return captured by the trading rule for asset i
(1) (1) (1) at time t.
where Ω = Y1 , r1,2 /σt , ... ,
(N ) (N ) (N ) IV. D EEP M OMENTUM N ETWORKS
YT −1 , rT −1,T /σT −1 is the set of all M possible
prediction and target tuples across all N assets and In this section, we examine a variety of architec-
T time steps. For the binary classification tures that can be used in Deep Momentum Networks
(i) (i) case, I is– all of which can be easily reconfigured to generate
the indicator function I rt,t+1 /σt > 0 – making
(i) the predictions described in Section III-B. This is
Yt the estimated probability of a positive return.
achieved by implementing the models using the
This still leaves us to specify how trend estimates
Keras API in Tensorflow [15], where output
map to positions, and we do so using a similar form
activation functions can be flexibly interchanged
to Equation 3:
to generate the predictions of different types (e.g.
expected returns, binary probabilities, or direct posi-
Position Sizing:
tions). Arbitrary loss functions can also be defined
(i) (i)
Regression Xt = sgn(Yt ) (12) for direct outputs, with gradients for backpropagation
(i) (i) being easily computed using the built-in libraries for
Classification Xt = sgn(Yt − 0.5) (13)
automatic differentiation.
As such, we take a maximum long position when
the expected returns are positive in the regression A. Network Architectures
case, or when the probability of a positive return is Lasso Regression: In the simplest case, a
greater than 0.5 in the classification case. standard linear model could be used to generate
predictions as below:
Direct Outputs: An alternative approach is to (i)
T (i)
use machine learning models to generate positions Zt = g w ut−τ :t + b , (17)
directly – simultaneously learning both trend esti- n o
mation and position sizing in the same function, where Zt(i) ∈ Xt(i) , Yt(i) depending on the pre-
i.e.: diction task, w is a weight vector for the linear
Xt = f ut ; θ . (14) model, and b is a bias term. Here g(·) is a activation
(i) (i)
Direct Outputs:
function which depends on the specific prediction
Given the lack of direct information on the optimal type – linear for standard regression, sigmoid for
positions to hold at each step – which is required binary classification, and tanh-function for direct
to produce labels for standard regression and classi- outputs.
fication models – calibration would hence need to Additional regularisation is also provided during
be performed by directly optimising performance training by augmenting the various loss functions to
metrics. Specifically, we focus on optimising the include an additional L1 regulariser as below:
5
Here W and V are weight matrices associated with
L̃(θθ ) = L(θθ ) + α||w||1 , (18) the gated activation function, and A and b are the
weights and biases used to transform the u to match
where L(θθ ) corresponds to one of the loss functions dimensionality of the layer outputs for the skip
described in Section III-B, ||w||1 is the L1 norm connection. The equations for WaveNet architecture
of w, and α is a constant term which we treat as used in our investigations can then be expressed as:
an additional hyperparameter. To incorporate recent
history into predictions as well, we concatenate (i) (i)
sweekly (t) = ψ(ut−5:t ) (22)
inputs over the past τ -days into a single input vector (i)
(i) (i) T (i) T
– i.e. ut−τ :t = [ut−τ , . . . , ut ]T . This was fixed sweekly (t)
s(i) (t − 5)
to be τ = 5 days for tests in Section V. (i) weekly
smonthly (t) = ψ (i) (23)
sweekly (t − 10)
Multilayer Perceptron (MLP): Increasing the (i)
sweekly (t − 15)
degree of model complexity slightly, a 2-layer neural (i)
network can be used to incorporated non-linear smonthly (t)
(i)
effects: squarterly (t) = ψ s(i)
monthly (t − 21) .(24)
(i)
smonthly (t − 42)
(i) (i)
ht = tanh Wh ut−τ :t + bh (19)
(i) (i)
Zt = g Wz ht + bz , (20) Here each intermediate layer s(i) . (t) aggregates
representations at weekly, monthly and quarterly
(i)
where ht is the hidden state of the MLP using frequencies respectively. Intermediate layers are then
an internal tanh activation function, tanh(·), and concatenated at each layer before passing through a
W. and b. are layer weight matrices and biases 2-layer MLP to generate outputs, i.e.:
respectively. (i)
sweekly (t)
(i)
WaveNet: More modern techniques such as st = s(i) monthly (t) (25)
convolutional neural networks (CNNs) have been (i)
squarterly (t)
used in the domain of time series prediction – par- (i) (i)
ticularly in the form of autoregressive architectures ht = tanh(Wh st + bh ) (26)
e.g. [19]. These typically take the form of 1D causal (i)
Zt = g Wz ht + bz .
(i)
(27)
convolutions, sliding convolutional filters across time
to extract useful representations which are then (i)
aggregated in higher layers of the network. To State sizes for each intermediate layers sweekly (t),
increase the size of the receptive field – or the length s(i) (i)
monthly (t), squarterly (t) and the MLP hidden state
of history fed into the CNN – dilated CNNs such as h(i) are fixed to be the same, allowing us to use a
t
WaveNet [40] have been proposed, which skip over single hyperparameter to define the architecture. To
inputs at intermediate levels with a predetermined independently evaluate the performance of CNN
dilation rate. This allows it to effectively increase and RNN architectures, the above also excludes the
the amount of historical information used by the LSTM block (i.e. the context stack) described in
CNN without a large increase in computational cost. [40], focusing purely on the merits of the dilated
Let us consider a dilated convolutional layer with CNN model.
residual connections take the form below:
ψ(u) = tanh(Wu) σ(Vu) Long Short-term Memory (LSTM): Tradition-
| {z } ally used in sequence prediction for natural language
Gated Activation
processing, recurrent neural networks – specifically
+ Au
| {z+ b} . (21) long short-term memory (LSTM) architectures [39]
Skip Connection – have been increasing used in time series prediction
6
tasks. The equations for the LSTM in our model are during training. This was applied to the inputs and
provided below: hidden state for the MLP, as well as the inputs,
(i) (i) (i) Equation (22), and outputs, Equation (26), of the
ft = σ(Wf ut + Vf ht−1 + bf ) (28) convolutional layers in the WaveNet architecture.
(i) (i) (i)
it = σ(Wi ut + Vi ht−1 + bi ) (29) For the LSTM, we adopted the same dropout masks
(i) (i) (i) as in [46] – applying dropout to the RNN inputs,
ot = σ(Wo ut + Vo ht−1 + bo ) (30)
recurrent states and outputs.
(i) (i) (i)
ct = ft ct−1
(i) (i) (i)
+ it tanh(Wc ut + Vc ht−1 + bc ) (31)
V. P ERFORMANCE E VALUATION
(i) (i) (i)
ht = ot tanh(ct ) (32) A. Overview of Dataset
(i) (i)
Zt = g Wz ht + bz , (33) The predictive performance of the different archi-
tectures was evaluated via a backtest using 88 ratio-
where is the Hadamard (element-wise) product,
adjusted continuous futures contracts downloaded
σ(.) is the sigmoid activation function, W. and
from the Pinnacle Data Corp CLC Database [47].
V. are weight matrices for the different layers,
(i) (i) (i) These contracts spanned across a variety of asset
ft , it , ot correspond to the forget, input and
(i) classes – including commodities, fixed income and
output gates respectively, ct is the cell state, and currency futures – and contained prices from 1990
(i)
ht is the hidden state of the LSTM. From these to 2015. A full breakdown of the dataset can be
equations, we can see that the LSTM uses the cell found in Appendix A.
state as a compact summary of past information,
controlling memory retention with the forget gate B. Backtest Description
and incorporating new information via the input gate. Throughout our backtest, the models were recal-
As such, the LSTM is able to learn representations ibrated from scratch every 5 years – re-running
of long-term relationships relevant to the prediction the entire hyperparameter optimisation procedure
task – sequentially updating its internal memory using all data available up to the recalibration point.
states with new observations at each step. Model weights were then fixed for signals generated
B. Training Details over the next 5 year period, ensuring that tests were
Model calibration was undertaken using minibatch performed out-of-sample.
stochastic gradient descent with the Adam optimiser For the Deep Momentum Networks, we incorpo-
[44], based on the loss functions defined in Section rate a series of useful features adopted by standard
III-B. Backpropagation was performed up to a time series momentum strategies in Section III-A to
maximum of 100 training epochs using 90% of a generate predictions at each step:
given block of training data, and the most recent 1) Normalised Returns – Returns over the past
10% retained as a validation dataset. Validation day, 1-month, 3-month, 6-month and 1-year
data is then used to determine convergence – with periods are used, normalised by a measure of
early stopping triggered when the validation loss daily volatility scaled to an appropriate time
has not improved for 25 epochs – and to identify scale. For instance, normalised√annual returns
(i) (i)
the optimal model across hyperparameter settings. were taken to be rt−252,t /(σt 252).
Hyperparameter optimisation was conducted using 2) MACD Indicators – We also include the
(i)
50 iterations of random search, with full details MACD indicators – i.e. trend estimates Yt –
provided in Appendix B. For additional information as in Equation (4), using the same short time-
on the deep neural network calibration, please refer scales Sk ∈ {8, 16, 32} and long time-scales
to [13]. Lk ∈ {24, 48, 96}.
Dropout regularisation [45] was a key feature For comparisons against traditional time series mo-
to avoid overfitting in the neural network models mentum strategies, we also incorporate the following
– with dropout rates included as hyperparameters reference benchmarks:
7
(i)
1) Long Only with Volatility Scaling (Xt = 1) Additional model complexity, however, does not
2) Sgn(Returns) – Moskowitz et al. 2012 [1] necessarily lead to better predictive performance, as
3) MACD Signal – Baz et al. 2015 [4] demonstrated by the underperformance of WaveNet
Finally, performance was judged based on the compared to both the reference benchmarks and
following metrics: simple linear models. Part of this can be attributed
1) Profitability – Expected returns (E[Returns]) to the difficulties in tuning models with multiple
and the percentage of positive returns observed design parameters - for instance, better results could
across the test period. possibly achieved by using alternative dilation rates,
2) Risk – Daily volatility (Vol.), downside devia- number of convolutional layers, and hidden state
tion and the maximum drawdown (MDD) of sizes in Equations (22) to (24) for the WaveNet. In
the overall portfolio. contrast, only a single design parameter is sufficient
3) Performance Ratios – Risk adjusted perfor- to specify the hidden state size in both the MLP and
mance was LSTM models. Analysing the relative performance
measured by the Sharpe ratio
E[Returns] E[Returns]
, Sortino ratio Downside and within each model class, we can see that models
Vol. Deviation
which directly generate positions perform the best –
Calmar ratio E[Returns]
MDD
, as well as the average demonstrating the benefits of simultaneous learning
both trend estimation and position sizing functions.
Ave. P
profit over the average loss Ave. L
.
In addition, with the exception of a slight decrease
C. Results and Discussion in the MLP, Sharpe-optimised models outperform
Aggregating the out-of-sample predictions from returns-optimised ones, with standard regression and
1995 to 2015, we compute performance metrics classification benchmarks taking third and fourth
for both the strategy returns based on Equation (1) place respectively.
(Exhibit 2), as well as that for portfolios with an From Exhibit 3, while the addition of volatility
additional layer of volatility scaling – which brings scaling at the portfolio level improved performance
overall strategy returns to match the 15% volatility ratios on the whole, it had a larger beneficial effect on
target (Exhibit 3). Given the large differences in machine learning models compared to the reference
returns volatility seen in Table 2, this rescaling benchmarks – propelling Sharpe-optimised MLPs to
also helps to facilitate comparisons between the outperform returns-optimised ones, and even leading
cumulative returns of different strategies – which to Sharpe-optimised linear models beating reference
are plotted for various loss functions in Exhibit benchmarks. From a risk perspective, we can see that
4. We note that strategy returns in this section both volatility and downside deviation also become
are computed in the absence of transaction costs, a lot more comparable, with the former hovering
allowing us to focus on the raw predictive ability of close to 15.5% and the latter around 10%. However,
the models themselves. The impact of transaction Sharpe-optimised LSTMs still retained the lowest
costs is explored further in Section VI, where we MDD across all models, with superior risk-adjusted
undertake a deeper analysis of signal turnover. More performance ratios across the board. Referring to the
detailed results can also be found in Appendix C, cumulative returns plots for the rescaled portfolios in
which echo the findings below. Exhibit 4, the benefits of direct outputs with Sharpe
Focusing on the raw signal outputs, the Sharpe ratio optimisation can also be observed – with larger
ratio-optimised LSTM outperforms all benchmarks cumulative returns observed for linear, MLP and
as expected, improving the best neural network LSTM models compared to the reference bench-
model (Sharpe-optimised MLP) by 44% and the marks. Furthermore, we note the general underperfor-
best reference benchmark (Sgn(Returns)) by more mance of models which use standard regression and
than two times. In conjunction with Sharpe ratio classification methods for trend estimation – hinting
improvements to both the linear and MLP models, at the difficulties faced in selecting an appropriate
this highlights the benefits of using models which position sizing function, and in optimising models
capture non-linear relationships, and have access to generate positions without accounting for risk.
to more time history via an internal memory state. This is particularly relevant for binary classification
8
Exhibit 2: Performance Metrics – Raw Signal Outputs
Downside % of +ve Ave. P
E[Return] Vol. MDD Sharpe Sortino Calmar
Deviation Returns Ave. L
Reference
Long Only 0.039 0.052 0.035 0.167 0.738 1.086 0.230 53.8% 0.970
Sgn(Returns) 0.054 0.046 0.032 0.083 1.192 1.708 0.653 54.8% 1.011
MACD 0.030 0.031 0.022 0.081 0.976 1.356 0.371 53.9% 1.015
Linear
Sharpe 0.041 0.038 0.028 0.119 1.094 1.462 0.348 54.9% 0.997
Ave. Returns 0.047 0.045 0.031 0.164 1.048 1.500 0.287 53.9% 1.022
MSE 0.049 0.047 0.032 0.164 1.038 1.522 0.298 54.3% 1.000
Binary 0.013 0.044 0.030 0.167 0.295 0.433 0.078 50.6% 1.028
MLP
Sharpe 0.044 0.031 0.025 0.154 1.383 1.731 0.283 56.0% 1.024
Ave. Returns 0.064* 0.043 0.030 0.161 1.492 2.123 0.399 55.6% 1.031
MSE 0.039 0.046 0.032 0.166 0.844 1.224 0.232 52.7% 1.035
Binary 0.003 0.042 0.028 0.233 0.080 0.120 0.014 50.8% 0.981
WaveNet
Sharpe 0.030 0.035 0.026 0.101 0.854 1.167 0.299 53.5% 1.008
Ave. Returns 0.032 0.040 0.028 0.113 0.788 1.145 0.281 53.8% 0.980
MSE 0.022 0.042 0.028 0.134 0.536 0.786 0.166 52.4% 0.994
Binary 0.000 0.043 0.029 0.313 0.011 0.016 0.001 50.2% 0.995
LSTM
Sharpe 0.045 0.016* 0.011* 0.021* 2.804* 3.993* 2.177* 59.6%* 1.102*
Ave. Returns 0.054 0.046 0.033 0.164 1.165 1.645 0.326 54.8% 1.003
MSE 0.031 0.046 0.032 0.163 0.669 0.959 0.189 52.8% 1.003
Binary 0.012 0.039 0.026 0.255 0.300 0.454 0.046 51.0% 1.012
Reference
Long Only 0.117 0.154 0.102 0.431 0.759 1.141 0.271 53.8% 0.973
Sgn(Returns) 0.215 0.154 0.102 0.264 1.392 2.108 0.815 54.8% 1.041
MACD 0.172 0.155 0.106 0.317 1.111 1.622 0.543 53.9% 1.031
Linear
Sharpe 0.232 0.155 0.103 0.303 1.496 2.254 0.765 54.9% 1.056
Ave. Returns 0.189 0.154 0.100 0.372 1.225 1.893 0.507 53.9% 1.047
MSE 0.186 0.154 0.099* 0.365 1.211 1.889 0.509 54.3% 1.025
Binary 0.051 0.155 0.103 0.558 0.332 0.496 0.092 50.6% 1.033
MLP
Sharpe 0.312 0.154 0.102 0.335 2.017 3.042 0.930 56.0% 1.104
Ave. Returns 0.266 0.154 0.099* 0.354 1.731 2.674 0.752 55.6% 1.065
MSE 0.156 0.154 0.099* 0.371 1.017 1.582 0.422 52.7% 1.062
Binary 0.017 0.154 0.102 0.661 0.108 0.162 0.025 50.8% 0.986
WaveNet
Sharpe 0.148 0.155 0.103 0.349 0.956 1.429 0.424 53.5% 1.018
Ave. Returns 0.136 0.154 0.101 0.356 0.881 1.346 0.381 53.8% 0.993
MSE 0.084 0.153* 0.101 0.459 0.550 0.837 0.184 52.4% 0.995
Binary 0.007 0.155 0.103 0.779 0.045 0.068 0.009 50.2% 1.001
LSTM
Sharpe 0.451* 0.155 0.105 0.209* 2.907* 4.290* 2.159* 59.6%* 1.113*
Ave. Returns 0.208 0.154 0.102 0.365 1.349 2.045 0.568 54.8% 1.028
MSE 0.121 0.154 0.100 0.362 0.791 1.211 0.335 52.8% 1.020
Binary 0.075 0.155 0.099* 0.682 0.486 0.762 0.110 51.0% 1.043
9
Exhibit 4: Cumulative Returns - Rescaled to Target Volatility
methods, which produce relatively flat equity lines we investigate the performance constituents of the
and underperform reference benchmarks in general. time series momentum portfolios – using box plots
Some of these poor results can be explained by for a variety of performance metrics, plotting the
the implicit decision threshold adopted. From the minimum, lower quartile, median, upper quartile, and
percentage of positive returns captured in Exhibit maximum values across individual futures contracts.
3, most binary classification models have about a We present in Exhibit 5 plots of one metric per
50% accuracy which, while expected of a classifier category in Section V-B, although similar results can
with a 0.5 probability threshold, is far below the be seen for other performance ratios are documented
accuracies seen in other benchmarks. Furthermore, in Appendix C. In general, the Sharpe ratio plots
performance is made worse by the fact that the in Exhibit 5a echo previous findings, with direct
Ave. P
model’s magnitude of gains versus losses Ave. L
output methods performing better than indirect trend
is much smaller than competing methods – with estimation models. However, as seen in Exhibit 5c,
average loss magnitudes even outweighing
profits for this is mainly attributable to significant reduction in
Ave. P
the MLP classifier Ave. L = 0.986 . As such, these signal volatility for the Sharpe-optimised methods,
observations lend support to the direct generation of despite a comparable range of average returns in
positions sizes with machine learning methods, given Exhibit 5b. The benefits of retaining the volatility
the multiple considerations (e.g. decision thresholds scaling can also be observed, with individual signal
and profit/loss magnitudes) that would be required volatility capped near the target across all methods
to incorporate standard supervising learning methods – even with a naive sgn(.) position sizer. As such,
into a profitable trading strategy. the combination of volatility scaling, direct outputs
Strategy performance could also be aided by and Sharpe ratio optimisation were all key to
diversification across a range of assets, particularly performance gains in Deep Momentum Networks.
when the correlation between signals is low. Hence,
to evaluate the raw quality of the underlying signal,
10
Exhibit 5: Performance Across Individual Assets
(c) Volatility
11
VI. T URNOVER A NALYSIS
To investigate how transaction costs affect strategy T SM OM
r̃t,t+1 =
performance, we first analyse the daily position N t (i) (i) (i)
changes of the signal – characterised for asset i σtgt X Xt (i) Xt Xt−1
r − c (i) − (i) , (35)
(i)
by daily turnover ζt as defined in [8]: Nt i=1 σt(i) t,t+1 σt σt−1
Xt
(i) (i)
Xt−1 where c is a constant reflecting transaction cost
(i) T SM OM
ζt = σtgt (i) − (i) (34) assumptions. As such, using r̃t,t+1 in Sharpe
σt σt−1 ratio loss functions during training corresponds to
optimising the ex-cost risk-adjusted returns, and
Which is broadly proportional to the volume of (i) (i)
Xt Xt−1
asset i traded on day t with reference to the updated c σ(i) − (i) can also be interpreted as a regu-
t σt−1
portfolio weights. larisation term for turnover.
Exhibit 6a shows the average strategy turnover Given that the Sharpe-optimised LSTM is still
across all assets from 1995 to 2015, focusing on profitable in the presence of small transactions costs,
positions generated by the raw signal outputs. As the we seek to quantify the effectiveness of turnover
box plots are charted on a logarithm scale, we note regularisation when costs are prohibitively high –
that while the machine learning-based models have considering the extreme case where c = 10bps in
a similar turnover, they also trade significantly more our investigation. Tests were focused on the Sharpe-
than the reference benchmarks – approximately 10 optimised LSTM with and without the turnover
times more compared to the Long Only benchmark. regulariser (LSTM + Reg. for the former) – including
This is also reflected in Exhibit 6a which compares the additional portfolio level volatility scaling to
the average daily returns against the average daily bring signal volatilities to the same level. Based on
turnover – with ratios from machine learning models the results in Exhibit 8, we can see that the turnover
lying close to the x-axis. regularisation does help improve the LSTM in the
To concretely quantify the impact of transaction presence of large costs, leading to slightly better
costs on performance, we also compute the ex- performance ratios when compared to the reference
cost Sharpe ratios – using the rebalancing costs benchmarks.
defined in [8] to adjust our returns for a variety
of transaction cost assumptions . For the results VII. C ONCLUSIONS
in Exhibit 7, the top of each bar chart marks the We introduce Deep Momentum Networks – a
maximum cost-free Sharpe ratio of the strategy, hybrid class of deep learning models which retain
with each coloured block denoting the Sharpe ratio the volatility scaling framework of time series mo-
reduction for the corresponding cost assumption. mentum strategies while using deep neural networks
In line with the turnover analysis, the reference to output position targeting trading signals. Two
benchmarks demonstrate the most resilience to high approaches to position generation were evaluated
transaction costs (up to 5bps), with the profitability here. Firstly, we cast trend estimation as a standard
across most machine learning models persisting only supervised learning problem – using machine learn-
up to 4bps. However, we still obtain higher cost- ing models to forecast the expected asset returns or
adjusted Sharpe ratios with the Sharpe-optimised probability of a positive return at the next time step –
LSTM for up to 2-3 bps, demonstrating its suitability and apply a simple maximum long/short trading rule
for trading more liquid instruments. based on the direction of the next return. Secondly,
trading rules were directly generated as outputs
A. Turnover Regularisation from the model, which we calibrate by maximising
One simple way to account for transaction costs is the Sharpe ratio or average strategy return. Testing
T SM OM
to use cost-adjusted returns r̃t,t+1 directly during this on a universe of continuous futures contracts,
training, augmenting the strategy returns defined in we demonstrate clear improvements in risk-adjusted
Equation (1) as below: performance by calibrating models with the Sharpe
12
Exhibit 6: Turnover Analysis
Long Only 0.097 0.154* 0.103 0.482 0.628 0.942 0.201 53.3% 0.970
Sgn(Returns) 0.133 0.154* 0.102* 0.373 0.861 1.296 0.356 53.3% 1.011
MACD 0.111 0.155 0.106 0.472 0.719 1.047 0.236 52.5% 1.020*
LSTM -0.833 0.157 0.114 1.000 -5.313 -7.310 -0.833 33.9% 0.793
LSTM + Reg. 0.141* 0.154* 0.102* 0.371* 0.912* 1.379* 0.379* 53.4%* 1.014
13
ratio – where the LSTM model achieved best results. R EFERENCES
Incorporating transaction costs, the Sharpe-optimised [1] T. J. Moskowitz, Y. H. Ooi, and L. H. Pedersen, “Time series
LSTM outperforms benchmarks up to 2-3 basis momentum,” Journal of Financial Economics, vol. 104, no. 2,
pp. 228 – 250, 2012, Special Issue on Investor Sentiment.
points of costs, demonstrating its suitability for [2] B. Hurst, Y. H. Ooi, and L. H. Pedersen, “A century of
trading more liquid assets. To accommodate high evidence on trend-following investing,” The Journal of Portfolio
costs settings, we introduce a turnover regulariser to Management, vol. 44, no. 1, pp. 15–29, 2017.
[3] Y. Lemprire, C. Deremble, P. Seager, M. Potters, and J.-P.
use during training, which was shown to be effective Bouchaud, “Two centuries of trend following,” Journal of
even in extreme scenarios (i.e. c = 10bps). Investment Strategies, vol. 3, no. 3, pp. 41–61, 2014.
Future work includes extensions of the framework [4] J. Baz, N. Granger, C. R. Harvey, N. Le Roux, and
S. Rattray, “Dissecting investment strategies in the cross
presented here to incorporate ways to deal better with section and time series,” SSRN, 2015. [Online]. Available:
non-stationarity in the data, such as using the recently https://ssrn.com/abstract=2695101
introduced Recurrent Neural Filters [48]. Another [5] A. Levine and L. H. Pedersen, “Which trend is your friend,”
Financial Analysts Journal, vol. 72, no. 3, 2016.
direction of future work focuses on the study of time [6] B. Bruder, T.-L. Dao, J.-C. Richard, and T. Roncalli, “Trend
series momentum at the microstructure level. filtering methods for momentum strategies,” SSRN, 2013.
[Online]. Available: https://ssrn.com/abstract=2289097
VIII. ACKNOWLEDGEMENTS [7] A. Y. Kim, Y. Tse, and J. K. Wald, “Time series momentum
and volatility scaling,” Journal of Financial Markets, vol. 30,
We would like to thank Anthony Ledford, James pp. 103 – 124, 2016.
Powrie and Thomas Flury for their interesting [8] N. Baltas and R. Kosowski, “Demystifying time-series
comments as well the Oxford-Man Institute of momentum strategies: Volatility estimators, trading rules
and pairwise correlations,” SSRN, 2017. [Online]. Available:
Quantitative Finance for financial support. https://ssrn.com/abstract=2140091
[9] C. R. Harvey, E. Hoyle, R. Korgaonkar, S. Rattray, M. Sargaison,
and O. van Hemert, “The impact of volatility targeting,” SSRN,
2018. [Online]. Available: https://ssrn.com/abstract=3175538
[10] N. Laptev, J. Yosinski, L. E. Li, and S. Smyl, “Time-series
extreme event forecasting with neural networks at uber,” in
Time Series Workshop – International Conference on Machine
Learning (ICML), 2017.
[11] B. Lim and M. van der Schaar, “Disease-atlas: Navigating
disease trajectories using deep learning,” in Proceedings of
the 3rd Machine Learning for Healthcare Conference (MLHC),
ser. Proceedings of Machine Learning Research, vol. 85, 2018,
pp. 137–160.
[12] Z. Zhang, S. Zohren, and S. Roberts, “DeepLOB: Deep
convolutional neural networks for limit order books,” IEEE
Transactions on Signal Processing, 2019.
[13] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning.
MIT Press, 2016, http://www.deeplearningbook.org.
[14] Y. Bengio, A. Courville, and P. Vincent, “Representation
learning: A review and new perspectives,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, vol. 35, no. 8,
pp. 1798–1828, 2013.
[15] M. Abadi et al., “TensorFlow: Large-scale machine learning
on heterogeneous systems,” 2015, software available from
tensorflow.org. [Online]. Available: https://www.tensorflow.org/
[16] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito,
Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic
differentiation in PyTorch,” in Autodiff Workshop – Conference
on Neural Information Processing (NIPS), 2017.
[17] S. Makridakis, E. Spiliotis, and V. Assimakopoulos, “The M4
competition: Results, findings, conclusion and way forward,”
International Journal of Forecasting, vol. 34, no. 4, pp. 802 –
808, 2018.
[18] S. Smyl, J. Ranganathan, , and A. Pasqua. (2018) M4 forecasting
competition: Introducing a new hybrid es-rnn model. [Online].
Available: https://eng.uber.com/m4-forecasting-competition/
[19] M. Binkowski, G. Marti, and P. Donnat, “Autoregressive
convolutional neural networks for asynchronous time series,” in
14
Proceedings of the 35th International Conference on Machine [39] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Learning, ser. Proceedings of Machine Learning Research, Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
vol. 80, 2018, pp. 580–589. [40] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,
[20] S. S. Rangapuram, M. W. Seeger, J. Gasthaus, L. Stella, Y. Wang, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu,
and T. Januschowski, “Deep state space models for time series “WaveNet: A generative model for raw audio,” CoRR, vol.
forecasting,” in Advances in Neural Information Processing abs/1609.03499, 2016.
Systems 31 (NeurIPS), 2018. [41] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou,
[21] M. Fraccaro, S. Kamronn, U. Paquet, and O. Winther, “A A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton,
disentangled recognition and nonlinear dynamics model for Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche,
unsupervised learning,” in Advances in Neural Information T. Graepel, and D. Hassabis, “Mastering the game of Go without
Processing Systems 30 (NIPS), 2017. human knowledge,” Nature, vol. 550, pp. 354–, 2017.
[22] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- [42] P. N. Kolm and G. Ritter, “Dynamic replication and hedging:
Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative A reinforcement learning approach,” The Journal of Financial
adversarial nets,” in Advances in Neural Information Processing Data Science, vol. 1, no. 1, pp. 159–171, 2019.
Systems 27 (NIPS), 2014. [43] H. Bühler, L. Gonon, J. Teichmann, and B. Wood, “Deep
[23] S. Gu, B. T. Kelly, and D. Xiu, “Empirical asset pricing via Hedging,” arXiv e-prints, p. arXiv:1802.03042, 2018.
machine learning,” Chicago Booth Research Paper No. 18-04; [44] D. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
31st Australasian Finance and Banking Conference 2018, 2017. tion,” in International Conference on Learning Representations
[Online]. Available: https://ssrn.com/abstract=3159577 (ICLR), 2015.
[24] S. Kim, “Enhancing the momentum strategy through deep [45] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
regression,” Quantitative Finance, vol. 0, no. 0, pp. 1–13, 2019. R. Salakhutdinov, “Dropout: A simple way to prevent neu-
[25] J. Sirignano and R. Cont, “Universal features of price formation ral networks from overfitting,” Journal of Machine Learning
in financial markets: Perspectives from deep learning,” SSRN, Research, vol. 15, pp. 1929–1958, 2014.
2018. [Online]. Available: https://ssrn.com/abstract=3141294 [46] Y. Gal and Z. Ghahramani, “A theoretically grounded application
[26] S. Ghoshal and S. Roberts, “Thresholded ConvNet ensembles: of dropout in recurrent neural networks,” in Advances in Neural
Neural networks for technical forecasting,” in Data Science in Information Processing Systems 29 (NIPS), 2016.
Fintech Workshop – Conference on Knowledge Discover and [47] “Pinnacle Data Corp. CLC Database,” https://pinnacledata2.com/
Data Mining (KDD), 2018. clc.html.
[27] W. Bao, J. Yue, and Y. Rao, “A deep learning framework for [48] B. Lim, S. Zohren, and S. Roberts, “Recurrent Neural Filters:
financial time series using stacked autoencoders and long-short Learning Independent Bayesian Filtering Steps for Time Series
term memory,” PLOS ONE, vol. 12, no. 7, pp. 1–24, 2017. Prediction,” arXiv e-prints, p. arXiv:1901.08096, 2019.
[28] P. Barroso and P. Santa-Clara, “Momentum has its moments,”
Journal of Financial Economics, vol. 116, no. 1, pp. 111 – 120,
2015.
[29] K. Daniel and T. J. Moskowitz, “Momentum crashes,” Journal
of Financial Economics, vol. 122, no. 2, pp. 221 – 247, 2016.
[30] R. Martins and D. Zou, “Momentum strategies offer a positive
point of skew,” Risk Magazine, 2012.
[31] P. Jusselin, E. Lezmi, H. Malongo, C. Masselin, T. Roncalli,
and T.-L. Dao, “Understanding the momentum risk premium:
An in-depth journey through trend-following strategies,” SSRN,
2017. [Online]. Available: https://ssrn.com/abstract=3042173
[32] M. Potters and J.-P. Bouchaud, “Trend followers lose more than
they gain,” Wilmott Magazine, 2016.
[33] L. M. Rotando and E. O. Thorp, “The Kelly criterion and the
stock market,” The American Mathematical Monthly, vol. 99,
no. 10, pp. 922–931, 1992.
[34] W. F. Sharpe, “The sharpe ratio,” The Journal of Portfolio
Management, vol. 21, no. 1, pp. 49–58, 1994.
[35] N. Jegadeesh and S. Titman, “Returns to buying winners and
selling losers: Implications for stock market efficiency,” The
Journal of Finance, vol. 48, no. 1, pp. 65–91, 1993.
[36] ——, “Profitability of momentum strategies: An evaluation of
alternative explanations,” The Journal of Finance, vol. 56, no. 2,
pp. 699–720, 2001.
[37] J. Rohrbach, S. Suremann, and J. Osterrieder, “Momentum
and trend following trading strategies for currencies revisited
- combining academia and industry,” SSRN, 2017. [Online].
Available: https://ssrn.com/abstract=2949379
[38] Z. Zhang, S. Zohren, and S. Roberts, “BDLOB: Bayesian deep
convolutional neural networks for limit order books,” in Bayesian
Deep Learning Workshop – Conference on Neural Information
Processing (NeurIPS), 2018.
15
A PPENDIX 2) Equities:
A. Dataset Details
Identifier Description
From the full 98 ratio-adjusted continuous futures
AX GERMAN DAX INDEX
contracts in the Pinnacle Data Corp CLC Database, CA CAC40 INDEX
we extract 88 which have < 10% of its data missing EN NASDAQ, MINI
– with a breakdown by asset class below: ER RUSSELL 2000, MINI
ES S & P 500, MINI
1) Commodities: HS HANG SENG
LX FTSE 100 INDEX
Identifier Description MD S&P 400 (Mini electronic)
SC S & P 500, composite
BC BRENT CRUDE OIL, composite SP S & P 500, day session
BG BRENT GASOIL, comp. XU DOW JONES EUROSTOXX50
BO SOYBEAN OIL XX DOW JONES STOXX 50
CC COCOA YM Mini Dow Jones ($5.00)
CL CRUDE OIL
CT COTTON #2
C CORN 3) Fixed Income:
DA MILK III, Comp.
FC FEEDER CATTLE
GC GOLD (COMMEX) Identifier Description
GI GOLDMAN SAKS C. I.
HG COPPER AP AUSTRALIAN PRICE INDEX
HO HEATING OIL #2 DT EURO BOND (BUND)
JO ORANGE JUICE FA T-NOTE, 5yr day session
KC COFFEE FB T-NOTE, 5yr composite
KW WHEAT, KC GS GILT, LONG BOND
LB LUMBER TA T-NOTE, 10yr day session
LC LIVE CATTLE TD T-NOTES, 2yr day session
LH LIVE HOGS TU T-NOTES, 2yr composite
MW WHEAT, MINN TY T-NOTE, 10yr composite
NG NATURAL GAS UA T-BONDS, day session
NR ROUGH RICE UB EURO BOBL
O OATS US T-BONDS, composite
PA PALLADIUM
PL PLATINUM
RB RBOB GASOLINE 4) FX:
SB SUGAR #11
SI SILVER (COMMEX) Identifier Description
SM SOYBEAN MEAL
S SOYBEANS AD AUSTRALIAN $$, day session
W WHEAT, CBOT AN AUSTRALIAN $$, composite
ZA PALLADIUM, electronic BN BRITISH POUND, composite
ZB RBOB, Electronic CB CANADIAN 10YR BOND
ZC CORN, Electronic CN CANADIAN $$, composite
ZF FEEDER CATTLE, Electronic DX US DOLLAR INDEX
ZG GOLD, Electronic FN EURO, composite
ZH HEATING OIL, electronic FX EURO, day session
ZI SILVER, Electronic JN JAPANESE YEN, composite
ZK COPPER, electronic MP MEXICAN PESO
ZL SOYBEAN OIL, Electronic NK NIKKEI INDEX
ZM SOYBEAN MEAL, Electronic SF SWISS FRANC, day session
ZN NATURAL GAS, electronic SN SWISS FRANC, composite
ZO OATS, Electronic
ZP PLATINUM, electronic
ZR ROUGH RICE, Electronic To reduce the impact of outliers, we also winsorise
ZS SOYBEANS, Electronic the data by capping/flooring it to be within 5 times
ZT LIVE CATTLE, Electronic
ZU CRUDE OIL, Electronic its exponentially weighted moving (EWM) standard
ZW WHEAT, Electronic deviations from its EWM average – computed using
ZZ LEAN HOGS, Electronic a 252-day half life.
16
Exhibit 9: Hyperparameter Search Range
Hyperparameters Random Search Grid Notes
Dropout Rate 0.1, 0.2, 0.3, 0.4, 0.5 Neural Networks Only
Hidden Layer Size 5, 10, 20, 40, 80 Neural Networks Only
Minibatch Size 256, 512, 1024, 2048
Learning Rate 10−5 , 10−4 , 10−3 , 10−2 , 10−1 , 100
Max Gradient Norm 10−4 , 10−3 , 10−2 , 10−1 , 100 , 101
L1 Regularisation Weight (α) 10−5 , 10−4 , 10−3 , 10−2 , 10−1 Lasso Regression Only
C. Additional Results
In addition to the selected results in Section V,
we also present a full list of results was presented
for completeness – echoing the key findings reported
in the discuss. Detailed descriptions of the plots and
tables can be found below:
1) Cross-Validation Performance: The testing pro-
cedure in Section V can also be interpreted as a cross-
validation approach – splitting the original dataset
into six 5-year blocks (1990-2015), calibrating using
an expanding window of data, and testing out-of-
sample on the next block outside the training set.
As such, for consistency with machine learning
literature, we present our results in a cross validation
format as well – reporting the average value across
all blocks ± 2 standard deviations. Furthermore, this
also gives an indication of how signal performance
varies across the various time periods.
• Exhibit 10 – Cross-validation results for raw
signal outputs.
• Exhibit 11 – Cross-validation results for signals
which have been rescaled to target volatility at
the portfolio level.
2) Metrics Across Individual Assets: We also
provide additional plots on performance of other
risk metrics and performance ratios across individual
assets, as described below:
17
Exhibit 10: Cross-Validation Performance – Raw Signal Outputs
Downside
E[Return] Vol. MDD
Deviation
Reference
Long Only 0.043 ± 0.028 0.054 ± 0.016 0.037 ± 0.013 0.116 ± 0.091
Sgn(Returns) 0.047 ± 0.051 0.046 ± 0.012 0.032 ± 0.007 0.067 ± 0.041
MACD 0.026 ± 0.032 0.032 ± 0.008 0.023 ± 0.007 0.054 ± 0.048
Linear
Sharpe 0.034 ± 0.030 0.039 ± 0.028 0.028 ± 0.020 0.072 ± 0.096
Ave. Returns 0.033 ± 0.031 0.046 ± 0.025 0.031 ± 0.018 0.110 ± 0.114
MSE 0.047 ± 0.038 0.049 ± 0.019 0.033 ± 0.013 0.100 ± 0.121
Binary 0.012 ± 0.028 0.045 ± 0.011 0.031 ± 0.009 0.109 ± 0.045
MLP
Sharpe 0.038 ± 0.027 0.030 ± 0.041 0.021 ± 0.028 0.062 ± 0.160
Ave. Returns 0.056 ± 0.046* 0.044 ± 0.024 0.030 ± 0.017 0.075 ± 0.150
MSE 0.037 ± 0.051 0.048 ± 0.021 0.032 ± 0.015 0.109 ± 0.134
Binary -0.004 ± 0.028 0.042 ± 0.007 0.028 ± 0.006 0.111 ± 0.079
WaveNet
Sharpe 0.030 ± 0.030 0.038 ± 0.019 0.027 ± 0.015 0.069 ± 0.055
Ave. Returns 0.034 ± 0.043 0.042 ± 0.003 0.030 ± 0.003 0.088 ± 0.062
MSE 0.024 ± 0.046 0.043 ± 0.010 0.030 ± 0.010 0.102 ± 0.056
Binary -0.009 ± 0.023 0.043 ± 0.008 0.030 ± 0.008 0.159 ± 0.107
LSTM
Sharpe 0.045 ± 0.030 0.017 ± 0.004* 0.012 ± 0.003* 0.019 ± 0.005*
Ave. Returns 0.045 ± 0.050 0.048 ± 0.018 0.034 ± 0.011 0.104 ± 0.119
MSE 0.023 ± 0.037 0.048 ± 0.022 0.033 ± 0.017 0.116 ± 0.082
Binary -0.005 ± 0.088 0.042 ± 0.003 0.027 ± 0.006 0.151 ± 0.211
Fraction of Ave. P
Sharpe Sortino Calmar
+ve Returns Ave. L
Reference
Long Only 0.839 ± 0.786 1.258 ± 1.262 0.420 ± 0.490 0.546 ± 0.025 0.956 ± 0.135
Sgn(Returns) 1.045 ± 1.230 1.528 ± 1.966 0.864 ± 1.539 0.543 ± 0.061 1.002 ± 0.067
MACD 0.839 ± 1.208 1.208 ± 1.817 0.625 ± 1.033 0.532 ± 0.030 1.016 ± 0.079
Linear
Sharpe 1.025 ± 1.530 1.451 ± 2.154 0.800 ± 1.772 0.544 ± 0.042 1.000 ± 0.100
Ave. Returns 0.757 ± 0.833 1.150 ± 1.378 0.397 ± 0.686 0.530 ± 0.009 1.005 ± 0.100
MSE 1.012 ± 1.126 1.532 ± 1.811 0.708 ± 1.433 0.540 ± 0.024 1.008 ± 0.096
Binary 0.288 ± 0.729 0.434 ± 1.113 0.123 ± 0.313 0.506 ± 0.027 1.024 ± 0.051
MLP
Sharpe 1.669 ± 2.332 2.420 ± 3.443 1.665 ± 2.738 0.554 ± 0.063 1.069 ± 0.151
Ave. Returns 1.415 ± 1.781 2.127 ± 2.996 1.520 ± 2.761 0.553 ± 0.043 1.022 ± 0.134
MSE 0.821 ± 1.334 1.270 ± 2.160 0.652 ± 1.684 0.525 ± 0.025 1.036 ± 0.127
Binary -0.099 ± 0.648 -0.180 ± 0.956 -0.013 ± 0.304 0.500 ± 0.042 0.986 ± 0.064
WaveNet
Sharpe 0.780 ± 0.538 1.118 ± 0.854 0.477 ± 0.610 0.535 ± 0.022 0.990 ± 0.094
Ave. Returns 0.809 ± 1.113 1.160 ± 1.615 0.501 ± 1.036 0.543 ± 0.059 0.963 ± 0.069
MSE 0.513 ± 0.991 0.744 ± 1.477 0.276 ± 0.509 0.527 ± 0.033 0.979 ± 0.077
Binary -0.220 ± 0.523 -0.329 ± 0.768 -0.043 ± 0.145 0.499 ± 0.011 0.969 ± 0.043
LSTM
Sharpe 2.781 ± 2.081* 3.978 ± 3.160* 2.488 ± 1.921* 0.593 ± 0.054* 1.104 ± 0.199*
Ave. Returns 0.961 ± 1.268 1.397 ± 1.926 0.679 ± 1.552 0.547 ± 0.039 0.972 ± 0.118
MSE 0.451 ± 0.526 0.668 ± 0.812 0.184 ± 0.170 0.520 ± 0.026 0.996 ± 0.048
Binary -0.114 ± 2.147 -0.191 ± 3.435 0.227 ± 1.241 0.495 ± 0.077 1.002 ± 0.077
18
Exhibit 11: Cross-Validation Performance – Rescaled to Target Volatility
Downside
E[Return] Vol. MDD
Deviation
Reference
Long Only 0.131 ± 0.142 0.154 ± 0.001 0.104 ± 0.014 0.304 ± 0.113
Sgn(Returns) 0.186 ± 0.184 0.154 ± 0.002 0.101 ± 0.012 0.194 ± 0.126
MACD 0.140 ± 0.166 0.154 ± 0.002 0.105 ± 0.010 0.243 ± 0.129
Linear
Sharpe 0.182 ± 0.273 0.155 ± 0.003 0.105 ± 0.007 0.232 ± 0.175
Ave. Returns 0.127 ± 0.141 0.154 ± 0.003 0.101 ± 0.009 0.318 ± 0.177
MSE 0.170 ± 0.189 0.154 ± 0.003 0.099 ± 0.006* 0.256 ± 0.221
Binary 0.049 ± 0.170 0.155 ± 0.002 0.104 ± 0.013 0.351 ± 0.114
MLP
Sharpe 0.271 ± 0.375 0.154 ± 0.008 0.104 ± 0.000 0.186 ± 0.259
Ave. Returns 0.233 ± 0.270 0.154 ± 0.003 0.101 ± 0.010 0.194 ± 0.277
MSE 0.148 ± 0.178 0.154 ± 0.003 0.100 ± 0.009 0.268 ± 0.262
Binary -0.011 ± 0.117 0.154 ± 0.002 0.102 ± 0.018 0.377 ± 0.221
WaveNet
Sharpe 0.131 ± 0.103 0.154 ± 0.002 0.104 ± 0.009 0.254 ± 0.164
Ave. Returns 0.142 ± 0.196 0.154 ± 0.002 0.103 ± 0.003 0.262 ± 0.204
MSE 0.087 ± 0.150 0.153 ± 0.003* 0.101 ± 0.009 0.307 ± 0.247
Binary -0.030 ± 0.099 0.155 ± 0.001 0.105 ± 0.006 0.485 ± 0.283
LSTM
Sharpe 0.435 ± 0.342* 0.155 ± 0.002 0.108 ± 0.012 0.164 ± 0.077*
Ave. Returns 0.157 ± 0.202 0.153 ± 0.002* 0.102 ± 0.011 0.285 ± 0.196
MSE 0.087 ± 0.091 0.154 ± 0.003 0.100 ± 0.006 0.310 ± 0.130
Binary -0.008 ± 0.332 0.155 ± 0.002 0.100 ± 0.009 0.428 ± 0.495
Fraction of Ave. P
Sharpe Sortino Calmar
+ve Returns Ave. L
Reference
Long Only 0.847 ± 0.915 1.287 ± 1.475 0.445 ± 0.579 0.546 ± 0.025 0.958 ± 0.164
Sgn(Returns) 1.213 ± 1.205 1.856 ± 1.944 1.098 ± 1.658 0.543 ± 0.061 1.028 ± 0.070
MACD 0.911 ± 1.086 1.361 ± 1.733 0.643 ± 0.958 0.532 ± 0.030 1.023 ± 0.074
Linear
Sharpe 1.176 ± 1.772 1.752 ± 2.615 1.060 ± 2.376 0.544 ± 0.042 1.025 ± 0.139
Ave. Returns 0.826 ± 0.914 1.287 ± 1.504 0.471 ± 0.777 0.530 ± 0.009 1.016 ± 0.116
MSE 1.101 ± 1.220 1.729 ± 2.037 0.890 ± 1.787 0.540 ± 0.024 1.022 ± 0.116
Binary 0.321 ± 1.105 0.509 ± 1.720 0.169 ± 0.585 0.506 ± 0.027 1.031 ± 0.127
MLP
Sharpe 1.757 ± 2.405 2.623 ± 3.626 2.091 ± 3.474 0.554 ± 0.063 1.085 ± 0.176
Ave. Returns 1.516 ± 1.764 2.336 ± 2.923 1.771 ± 2.889 0.553 ± 0.043 1.038 ± 0.141
MSE 0.960 ± 1.163 1.510 ± 1.927 0.864 ± 1.999 0.525 ± 0.025 1.059 ± 0.103
Binary -0.071 ± 0.756 -0.140 ± 1.133 0.000 ± 0.372 0.500 ± 0.042 0.991 ± 0.072
WaveNet
Sharpe 0.849 ± 0.663 1.270 ± 1.060 0.575 ± 0.680 0.535 ± 0.022 1.000 ± 0.121
Ave. Returns 0.920 ± 1.271 1.376 ± 1.915 0.738 ± 1.591 0.543 ± 0.059 0.979 ± 0.066
MSE 0.565 ± 0.972 0.854 ± 1.513 0.364 ± 0.665 0.527 ± 0.033 0.986 ± 0.081
Binary -0.196 ± 0.641 -0.298 ± 0.946 -0.044 ± 0.206 0.499 ± 0.011 0.974 ± 0.067
LSTM
Sharpe 2.803 ± 2.195* 4.084 ± 3.469* 2.887 ± 3.030* 0.593 ± 0.054* 1.106 ± 0.216*
Ave. Returns 1.023 ± 1.312 1.564 ± 2.131 0.706 ± 1.440 0.547 ± 0.039 0.980 ± 0.127
MSE 0.563 ± 0.580 0.865 ± 0.901 0.284 ± 0.269 0.520 ± 0.026 1.014 ± 0.016
Binary -0.050 ± 2.152 -0.122 ± 3.381 0.190 ± 1.152 0.495 ± 0.077 1.012 ± 0.048
19
Exhibit 12: Performance Ratios Across Individual Assets
20
Exhibit 13: Reward vs Risk Across Individual Assets
(b) Volatility