Data Mining For Algorithmic Asset Management - Montana
Data Mining For Algorithmic Asset Management - Montana
20.1 Introduction
In recent years there has been increasing interest for active approaches to invest-
ing that rely exclusively on mining financial data, such as market-neutral strate-
gies [11] . This is a general class of investments that seeks to neutralize certain
market risks by detecting market inefficiencies and taking offsetting long and short
positions, with the ultimate goal of achieving positive returns independently of mar-
ket conditions. A specific instance of market-neutral strategies that heavily relies
on temporal data mining is referred to as statistical arbitrage [11, 14]. Algorithmic
asset management systems embracing this principle are developed to make spread
trades, namely trades that derive returns from the estimated relationship between
two statistically related securities.
283
284 Giovanni Montana and Francesco Parrella
In this section we outline the rationale behind the statistical arbitrage system
that forms the theme of this chapter, and provide a description of its main com-
ponents. Our system imports n + 1 cross-sectional financial data streams at discrete
time points t = 1, 2, . . .. In the sequel, we will assume that consecutive time intervals
are all equal to 24 hours, and that a trading decision is made on a daily basis. Specif-
ically, after importing and processing the data streams at each time t, a decision to
either buy or short sell a number of shares of a target security Y is made, and an
order is executed. Different sampling frequencies (e.g. irregularly spaced intervals)
and trading frequencies could also be incorporated with only minor modifications.
The imported data streams represent the prices of n + 1 assets. We denote by
yt the price of the security Y being traded by the system, whereas the remaining
n streams, collected in a vector st = (st1 , . . . , stn )T , refer to a large collection of fi-
nancial assets and economic indicators, such as other security prices and indices,
which possess some explanatory power in relation to Y . These streams will be used
to estimate the fair price of the target asset Y at each observational time point t, in a
way that will be specified below. We postulate that the price of Y at each time t can
be decomposed into two components, that is yt = zt + mt , where zt represents the
current fair price of Y , and the additive term mt represents a potential misprising.
No further assumptions are made regarding the data generating process. Clearly, if
the markets were always perfectly efficient, we would have that yt = zt at all times.
However, when |mt | > 0, an arbitrage opportunity arises. For instance, a negative mt
indicates that Y is temporarily under-valued. In this case, it is sensible to expect that
the market will promptly react to this temporary inefficiency with the effect of mov-
ing the target price up. Under this scenario, an investor would then buy a number
of shares hoping that, by time t + 1, a profit proportional to yt+1 − yt will be made.
Our system is designed to identify and exploit possible statistical arbitrage opportu-
nities of this sort in an automated fashion. This trading strategy can be formalized
by means of a binary decision rule dt ∈ {0, 1} where dt = 0 encodes a sell signal,
and dt = 1 a buy signal. Accordingly, we write
0 mt > 0
dt (mt ) = (20.1)
1 mt < 0
where we have made explicit the dependence on the current misprising mt = yt − zt .
If we denote the change in price observed on the day following the trading decision
as rt+1 = yt+1 − yt , we can also introduce a 0 − 1 loss function Lt+1 (dt , rt+1 ) =
|dt − 1(rt+1 >0) |, where the indicator variable 1(rt+1 >0) equals one if rt+1 > 0 and
zero otherwise. For instance, if the system generates a sell signal at time t, but the
security’s price increases over the next time interval, the system incurs a unit loss.
Obviously, the fair price zt is never directly observable, and therefore the mis-
prising mt is also unknown. The system we propose extracts knowledge from the
large collection of data streams, and incrementally imputes the fair price zt on the
basis of the newly extracted knowledge, in an efficient way. Although we expect
286 Giovanni Montana and Francesco Parrella
some streams to have high explanatory power, most streams will carry little signal
and will mostly contribute to generate noise. Furthermore, when n is large, we ex-
pect several streams to be highly correlated over time, and highly dependent streams
will provide redundant information. To cope with both of these issues, the system
extracts knowledge in the form of a feature vector xt , dynamically derived from st ,
that captures as much information as possible at each time step. We require for the
components of the feature vector xt to be in number less than n, and to be uncor-
related with each other. Effectively, during this step the system extracts informative
patterns while performing dimensionality reduction.
As soon as the feature vector xt is extracted, the pattern enters as input of a non-
parametric regression model that provides an estimate of the fair price of Y at the
current time t. The estimate of zt is denoted by ẑt = ft (xt ; φ ), where ft (·; φ ) is a
time-varying function depending upon the specification of a hyperparameter vector
φ . With the current ẑt at hand, an estimated mispricing m̂t is computed and used to
determine the trading rule (20.1). The major difficulty in setting up this learning step
lies in the fact that the true fair price zt is never made available to us, and therefore it
cannot be learnt directly. To cope with this problem, we use the observed price yt as
a surrogate for the fair price and note that proper choices of φ can generate sensible
estimates ẑt , and therefore realistic mispricing m̂t .
We have thus identified a number of practical issues that will have to be ad-
dressed next: (a) how to recursively extract and update the feature vector xt from the
the streaming data, (b) how to specify and recursively update the pricing function
ft (·; φ ), and finally (c) how to select the hyperparameter vector φ .
In order to extract knowledge from the streaming data and capture important
features of the underlying market in real-time, the system recursively performs a
principal component analysis, and extracts those components that explain a large
percentage of variability in the n streams. Upon arrival, each stream is first nor-
malized so that all streams have equal means and standard deviations. Let us call
Ct = E(st stT ) the unknown population covariance matrix of the n streams. The al-
gorithm proposed by [16] provides an efficient procedure to incrementally update
the eigenvectors of Ct when new data points arrive, in a way that does not require
the explicit computation of the covariance matrix. First, note that an eigenvector gt
of Ct satisfies the characteristic equation λt gt = Ct gt , where λt is the corresponding
eigenvalue. Let us call ht the current estimate of Ct gt using all the data up to the
current time t. This is given by ht = 1t ∑ti=1 si sTi gi ,
which is the incremental average of si sTi gi , where si sTi accounts for the contribu-
tion to the estimate of Ci at point i. Observing that gt = ht /||ht ||, an obvious choice
is to estimate gt as
ht−1 /||
ht−1 ||. After some manipulations, a recursive expression
for ht can be found as
20 Data Mining for Algorithmic Asset Management 287
t − 1 1
ht−1
ht = ht−1 + st stT (20.2)
t t ||
ht−1 ||
Once the first k eigenvectors are extracted, recursively, the data streams are projected
onto these directions in order to obtain the required feature vector xt . We are thus
given a sequence of paired observations (y1 , x1 ), . . . , (yt , xt ) where each xt is a k-
dimensional feature vector representing the latest market information and yt is the
price of the security being traded.
Our objective is to generate an estimate of the target security’s fair price using the
data points observed so far. In previous work [9, 10], we assumed that the fair price
depends linearly in xt and that the linear coefficients are allowed to evolve smoothly
over time. Specifically, we assumed that the fair price can be learned by recursively
minimizing the following loss function
t−1
∑ (yi − wTi xi ) +C(wi+1 − wi )T (wi+1 − wi ) (20.3)
i=1
that is, a penalized version of ordinary least squares. Temporal changes in the time-
varying linear regression weights wt result in an additional loss due to the penalty
term in (20.3). The severity of this penalty depends upon the magnitude on the reg-
ularization parameter C, which is a non-negative scalar: at one extreme, when C
gets very large, (20.3) reduces to the ordinary least squares loss function with time-
invariant weights; at the other extreme, as C is small, abrupt temporal changes in the
estimated weights are permitted. Recursive estimation equations and a connection
to the Kalman filter can be found in [10], which also describes a related algorith-
mic asset management system for trading futures contracts. In this chapter we depart
from previous work in two main directions. First, the rather strong linearity assump-
tion is released so as to add more flexibility in modelling the relationship between
the extracted market patterns and the security’s price. Second, we adopt a differ-
ent and more robust loss function. According to our new specification, estimated
prices ft (xt ) that are within ±ε of the observed price yt are always considered fair
prices, for a given user-defined positive scalar ε related to the noise level in the data.
At the same time, we would also like ft (xt ) to be as flat as possible. A standard
way to ensure this requirement is to impose an additional penalization parameter
controlling the norm of the weights, ||w||2 = wT w. For simplicity of exposition, let
us suppose again that the function to be learned is linear and can be expressed as
ft (xt ) = wT xt + b, where b is a scalar representing the bias. Introducing slack vari-
ables ξt , ξt∗ quantifying estimation errors greater than ε , the learning task can be
casted into the following minimization problem,
t
1 T
min wt wt +C ∑ (ξi + ξi∗ ) (20.4)
wt , bt 2 i=1
288 Giovanni Montana and Francesco Parrella
⎧
⎪
⎪ −yi + (wTi xi + bi ) + ε + ξi ≥ 0
⎪
⎪
⎨
s.t. yi − (wTi xi + bi ) + ε + ξi∗ ≥ 0 (20.5)
⎪
⎪
⎪
⎪
⎩
ξi , ξi∗ ≥ 0, i = 1, . . . ,t
that is, the support vector regression framework originally introduced by Vapnik
[15]. In this optimization problem, the constant C is a regularization parameter de-
termining the trade-off between the flatness of the function and the tolerated addi-
tional estimation error. A linear loss of |ξt | − ε is imposed any time the error |ξt | is
greater than ε , whereas a zero loss is used otherwise. Another advantage of having
an ε -insensitive loss function is that it will ensure sparseness of the solution, i.e.
the solution will be represented by means of a small subset of sample points. This
aspect introduces non negligible computational speed-ups, which are particularly
beneficial in time-aware trading applications. As pointed out before, our objective
is learn from the data in an incremental way. Following well established results (see,
for instance, [5]), the constrained optimization problem defined by Eqs. (20.4) and
(20.5) can be solved using a Lagrange function,
t t
1
L = wtT wt +C ∑ (ξi + ξi∗ ) − ∑ (ηi ξt + ηi∗ ξi∗ )
2 i=1 i=1
t t
(20.6)
−∑ αi (ε + ξi − yt + wtT xt + bt ) − ∑ αi∗ (ε + ξi∗ + yt − wtT xt − bt )
i=1 i=1
where αi , αi∗ , ηi and ηi∗ are the Lagrange multipliers, and have to satisfy positivity
constraints, for all i = 1, . . . ,t. The partial derivatives of (20.6) with respect to w, b, ξ
and ξ ∗ are required to vanish for optimality. By doing so, each ηt can be expressed
as C − αt and therefore can be removed (analogously for ηt∗ ) . Moreover, we can
write the weight vector as wt = ∑ti=1 (αi − αi∗ )xi , and the approximating function
can be expressed as a support vector expansion, that is
t
ft (xt ) = ∑ θi xiT xi + bi (20.7)
i=1
where each coefficient θi has been defined as the difference αi − αi∗ . The dual opti-
mization problem leads to another Lagrangian function, and its solution is provided
by the Karush-Kuhn-Tucker (KKT) conditions, whose derivation in this context can
be found in [13]. After defying the margin function hi (xi ) as the difference fi (xi )−yi
for all time points i = 1, . . . ,t, the KKT conditions can be expressed in terms of
θi , hi (xi ), ε and C. In turn, each data point (xi , yi ) can be classified as belonging to
each one of the following three auxiliary sets,
20 Data Mining for Algorithmic Asset Management 289
of the experts in the pool that predict 0 (short sell) to the total weight q1 of the
algorithms predicting 1 (buy). These two proportions are computed, respectively,
as q0 = ∑ (e) ωe and q1 = ∑ (e) ωe . The final trading decision taken by the
e:dt =o e:dt =1
WMV algorithm is
(∗) 0 if qo > q1
dt = (20.9)
1 otherwise
Each day the meta algorithm is told whether or not its last trade was successfull,
and a 0 − 1 penalty is applied, as described in Section 20.2. Each time the WMV
incurs a loss, the weights of all those experts in the pool that agreed with the master
algorithm are each multiplied by a fixed scalar coefficient β selected by the user,
with 0 < β < 1. That is, when an expert e makes as mistake, its weight is down-
graded to β ωe . For a chosen β , WMW gradually decreases the influence of experts
that make a large number of mistakes and gives the experts that make few mistakes
high relative weights.
Table 20.1 Statistical and financial indicators summarizing the performance of the 2560 experts
over the entire data set. We use the following notation: SR=Sharpe Ratio, WT=Winning Trades,
LT=Losing Trades, MG=Mean Gain, ML=Mean Loss, and MDD=Maximum Drawdown. PnL,
WT, LT, MG, ML and MDD are reported as percentages.
Summary Gross SR Net SR Gross PnL Net PnL Volatility WT LT MG ML MDD
Best 1.13 1.10 17.90 17.40 15.90 50.16 45.49 0.77 0.70 0.20
Worst -0.36 -0.39 -5.77 -6.27 15.90 47.67 47.98 0.72 0.76 0.55
Average 0.54 0.51 8.50 8.00 15.83 48.92 46.21 0.75 0.72 0.34
Std 0.36 0.36 5.70 5.70 0.20 1.05 1.01 0.02 0.02 0.19
With the chosen grid of values for each one of the three key parameters (ε varies
between 10−1 and 10−8 , while both C and σ vary between 0.0001 and 1000), the
pool comprises 2560 experts . The performance of these individual experts is sum-
marized in Table 20.1, which also reports on a number of financial indicators (see
the caption for details). In particular, the Sharpe Ratio provides a measure of risk-
adjusted return, and is computed as the ratio between the average return produced by
an expert over the entire period, divided by its standard deviation. For instance, the
best expert over the entire period achieves a promising 1.13 ratio, while the worst
expert yields negative risk-adjusted returns. The maximum drawdown represents the
total percentage loss experienced by an expert before it starts winning again. From
this table, it clearly emerges that choosing the right parameter combination, or ex-
pert, is crucial for this application, and relying on a single expert is a risky choice.
2500
2000
1.6
1.4
MV
1.2 Best
0.8
Sharpe Ratio
0.6 Average
0.4
0.2
Fig. 20.2 Sharpe Ratio pro- FBE
0
duced by two competing
strategies, Follow the Best −0.2
1.6
1.4
1.2 Best
1
WMV
0.8
Sharpe Ratio
0.6 Average
0.4
0.2
Fig. 20.3 Sharpe Ratio pro-
0
duced by Weighted Majority
Voting (WMV) as a function −0.2
ble 20.2 for more summary 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95
β
statistics.
5
x 10
16
14 WMV
12
10
8
P&L
simulation. Based upon 10, 000 repetitions, this distribution has mean −0.012 and
standard deviation 0.404. With reference to this distribution, we are then able to
compute empirical p-values associated to the observed Sharpe Ratios, after costs;
see Table 20.2. For instance, we note that a value as high as 1.45 or even higher
(β = 0.7) would have been observed by chance only in 10 out of 10, 000 cases.
These findings support our belief that the SVR-based algorithmic trading system
does capture informative signals and produces statistically meaningful results.
Table 20.2 Statistical and financial indicators summarizing the performance of Weighted Majority
Voting (WMV) as function of β . See the caption of Figure 20.1 and Section 20.4 for more details.
β Gross SR Net SR Gross PnL Net PnL Volatility WT LT MG ML MDD p-value
0.5 1.34 1.31 21.30 20.80 15.90 53.02 42.63 0.74 0.73 0.24 0.001
0.6 1.33 1.30 21.10 20.60 15.90 52.96 42.69 0.75 0.73 0.27 0.001
0.7 1.49 1.45 23.60 23.00 15.90 52.71 42.94 0.76 0.71 0.17 0.001
0.8 1.18 1.15 18.80 18.30 15.90 51.84 43.81 0.75 0.72 0.17 0.002
0.9 0.88 0.85 14.10 13.50 15.90 50.03 45.61 0.76 0.71 0.25 0.014
References
1. C.C. Aggarwal, J. Han, J. Wang, and Yu P.S. Data Streams: Models and Algorithms, chapter
On Clustering Massive Data Streams: A Summarization Paradigm, pages 9–38. Springer,
2007.
2. C. Alexander and A. Dimitriu. Sources of over-performance in equity markets: mean rever-
sion, common trends and herding. Technical report, ISMA Center, University of Reading,
UK, 2005.
3. L. Cao and F. Tay. Support vector machine with adaptive parameters in financial time series
forecasting. IEEE Transactions on Neural Networks, 14(6):1506–1518, 2003.
4. N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University
Press, 2006.
5. N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge
University Press, 2000.
6. R.J. Elliott, J. van der Hoek, and W.P. Malcolm. Pairs trading. Quantitative Finance, pages
271–276, 2005.
7. N. Littlestone and M.K. Warmuth. The weighted majority algorithm. Information and Com-
putation, 108:212–226, 1994.
8. J. Ma, J. Theiler, and S. Perkins. Accurate on-line support vector regression. Neural Compu-
tation, 15:2003, 2003.
9. G. Montana, K. Triantafyllopoulos, and T. Tsagaris. Data stream mining for market-neutral
algorithmic trading. In Proceedings of the ACM Symposium on Applied Computing, pages
966–970, 2008.
10. G. Montana, K. Triantafyllopoulos, and T. Tsagaris. Flexible least squares for
temporal data mining and statistical arbitrage. Expert Systems with Applications,
doi:10.1016/j.eswa.2008.01.062, 2008.
11. J. G. Nicholas. Market-Neutral Investing: Long/Short Hedge Fund Strategies. Bloomberg
Professional Library, 2000.
20 Data Mining for Algorithmic Asset Management 295
12. S. Papadimitriou, J. Sun, and C. Faloutsos. Data Streams: Models and Algorithms, chapter
Dimensionality reduction and forecasting on streams, pages 261–278. Springer, 2007.
13. F. Parrella and G. Montana. A note on incremental support vector regression. Technical report,
Imperial College London, 2008.
14. A. Pole. Statistical Arbitrage. Algorithmic Trading Insights and Techniques. Wiley Finance,
2007.
15. V. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.
16. J. Weng, Y. Zhang, and W. S. Hwang. Candid covariance-free incremental principal compo-
nent analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(8):1034–
1040, 2003.