0% found this document useful (0 votes)
52 views13 pages

Data Mining For Algorithmic Asset Management - Montana

This document discusses a computational framework for statistical arbitrage based on support vector regression. The framework uses machine learning algorithms to analyze streaming financial data and predict the fair price of a security. It aims to identify and exploit arbitrage opportunities by detecting temporary deviations between the actual and fair prices. The system faces challenges from non-stationary data and concept drift, so it uses an ensemble learning approach to combine multiple trading algorithms.

Uploaded by

chris
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views13 pages

Data Mining For Algorithmic Asset Management - Montana

This document discusses a computational framework for statistical arbitrage based on support vector regression. The framework uses machine learning algorithms to analyze streaming financial data and predict the fair price of a security. It aims to identify and exploit arbitrage opportunities by detecting temporary deviations between the actual and fair prices. The system faces challenges from non-stationary data and concept drift, so it uses an ensemble learning approach to combine multiple trading algorithms.

Uploaded by

chris
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Chapter 20

Data Mining for Algorithmic Asset Management

Giovanni Montana and Francesco Parrella

Abstract Statistical arbitrage refers to a class of algorithmic trading systems imple-


menting data mining strategies. In this chapter we describe a computational frame-
work for statistical arbitrage based on support vector regression. The algorithm
learns the fair price of the security under management by minimining a regularized
ε -insensitive loss function in an on-line fashion, using the most recent market infor-
mation acquired by means of streaming financial data. The difficult issue of adap-
tive learning in non-stationary environments is addressed by adopting an ensemble
learning approach, where a meta-algorithm strategically combines the opinion of a
pool of experts. Experimental results based on nearly seven years of historical data
for the iShare S&P 500 ETF demonstrate that satisfactory risk-adjusted returns can
be achieved by the data mining system even after transaction costs.

20.1 Introduction

In recent years there has been increasing interest for active approaches to invest-
ing that rely exclusively on mining financial data, such as market-neutral strate-
gies [11] . This is a general class of investments that seeks to neutralize certain
market risks by detecting market inefficiencies and taking offsetting long and short
positions, with the ultimate goal of achieving positive returns independently of mar-
ket conditions. A specific instance of market-neutral strategies that heavily relies
on temporal data mining is referred to as statistical arbitrage [11, 14]. Algorithmic
asset management systems embracing this principle are developed to make spread
trades, namely trades that derive returns from the estimated relationship between
two statistically related securities.

Giovanni Montana, Francesco Parrella


Imperial College London, Department of Mathematics, 180 Queen’s Gate, London SW7 2AZ, UK,
e-mail: {g.montana,f.parrella}@imperial.ac.uk

283
284 Giovanni Montana and Francesco Parrella

An example of statistical arbitrage strategies is given by pairs trading [6]. The


rationale behind this strategy is an intuitive one: if the difference between two statis-
tically depending securities tends to fluctuate around a long-term equilibrium, then
temporary deviations from this equilibrium may be exploited by going long on the
security that is currently under-valued, and shorting the security that is over-valued
(relatively to the paired asset) in a given proportion. By allowing short selling, these
strategies try to benefit from decreases, not just increases, in the prices. Profits are
made when the assumed equilibrium is restored.
The system we describe in this chapter can be seen as a generalization of pairs
trading . In our setup, only one of the two dependent assets giving raise to the spread
is a tradable security under management. The paired asset is instead an artificial one,
generated as a result of a data mining process that extracts patterns from a large pop-
ulation of data streams, and utilizes these patterns to build up the synthetic stream
in real time. The extracted patterns will be interpreted as being representative of the
current market conditions, whereas the synthetic asset will represent the fair price
of the target security being traded by the system. The underlying concept that we
try to exploit is the existence of time-varying cross-sectional dependencies among
securities. Several data mining techniques are being developed lately to capture de-
pendencies among data streams in a time-aware fashion, both in terms of latent
factors [12] and clusters [1]. Recent developments include novel database architec-
tures and paradigms such as CEP (Complex Event Processing) that discern patterns
in streaming data, from simple correlations to more elaborated queries.
In financial applications, data streams arrive into the system one data point at a
time, and quick decisions need to be made. A prerequisite for a trading system to
operate efficiently is to learn the novel information content obtained from the most
recent data in an incremental way, slowly forgetting the previously acquired knowl-
edge and, ideally, without having to access all the data that has been previously
stored. To meet these requirements, our system builds upon incremental algorithms
that efficiently process data points as they arrive. In particular, we deploy a modified
version of on-line support vector regression [8] as a powerful function approxima-
tion device that can discover non-negligible divergences between the paired assets
in real time. Streaming financial data are also characterized by the fact that the un-
derlying data generating mechanism is constantly evolving (i.e. it is non-stationary),
a notion otherwise referred to as concept drifting [12] . Due to this difficulty, partic-
ularly in the high-frequency trading spectrum, a trading system’s ability to capture
profitable inefficiencies has an ever-decreasing half life: where once a system might
have remained viable for long periods, it is now increasingly common for a trading
system’s performance to decay in a matter of days or even hours. Our attempt to
deal with this challenge in an autonomous way is based on an ensemble learning
approach , where a pool of trading algorithms or experts are evolved in parallel, and
then strategically combined by a master algorithm. The expectation is that combin-
ing expert opinion can lead to fewer trading mistakes in all market conditions.
20 Data Mining for Algorithmic Asset Management 285

20.2 Backbone of the Asset Management System

In this section we outline the rationale behind the statistical arbitrage system
that forms the theme of this chapter, and provide a description of its main com-
ponents. Our system imports n + 1 cross-sectional financial data streams at discrete
time points t = 1, 2, . . .. In the sequel, we will assume that consecutive time intervals
are all equal to 24 hours, and that a trading decision is made on a daily basis. Specif-
ically, after importing and processing the data streams at each time t, a decision to
either buy or short sell a number of shares of a target security Y is made, and an
order is executed. Different sampling frequencies (e.g. irregularly spaced intervals)
and trading frequencies could also be incorporated with only minor modifications.
The imported data streams represent the prices of n + 1 assets. We denote by
yt the price of the security Y being traded by the system, whereas the remaining
n streams, collected in a vector st = (st1 , . . . , stn )T , refer to a large collection of fi-
nancial assets and economic indicators, such as other security prices and indices,
which possess some explanatory power in relation to Y . These streams will be used
to estimate the fair price of the target asset Y at each observational time point t, in a
way that will be specified below. We postulate that the price of Y at each time t can
be decomposed into two components, that is yt = zt + mt , where zt represents the
current fair price of Y , and the additive term mt represents a potential misprising.
No further assumptions are made regarding the data generating process. Clearly, if
the markets were always perfectly efficient, we would have that yt = zt at all times.
However, when |mt | > 0, an arbitrage opportunity arises. For instance, a negative mt
indicates that Y is temporarily under-valued. In this case, it is sensible to expect that
the market will promptly react to this temporary inefficiency with the effect of mov-
ing the target price up. Under this scenario, an investor would then buy a number
of shares hoping that, by time t + 1, a profit proportional to yt+1 − yt will be made.
Our system is designed to identify and exploit possible statistical arbitrage opportu-
nities of this sort in an automated fashion. This trading strategy can be formalized
by means of a binary decision rule dt ∈ {0, 1} where dt = 0 encodes a sell signal,
and dt = 1 a buy signal. Accordingly, we write

0 mt > 0
dt (mt ) = (20.1)
1 mt < 0
where we have made explicit the dependence on the current misprising mt = yt − zt .
If we denote the change in price observed on the day following the trading decision
as rt+1 = yt+1 − yt , we can also introduce a 0 − 1 loss function Lt+1 (dt , rt+1 ) =
|dt − 1(rt+1 >0) |, where the indicator variable 1(rt+1 >0) equals one if rt+1 > 0 and
zero otherwise. For instance, if the system generates a sell signal at time t, but the
security’s price increases over the next time interval, the system incurs a unit loss.
Obviously, the fair price zt is never directly observable, and therefore the mis-
prising mt is also unknown. The system we propose extracts knowledge from the
large collection of data streams, and incrementally imputes the fair price zt on the
basis of the newly extracted knowledge, in an efficient way. Although we expect
286 Giovanni Montana and Francesco Parrella

some streams to have high explanatory power, most streams will carry little signal
and will mostly contribute to generate noise. Furthermore, when n is large, we ex-
pect several streams to be highly correlated over time, and highly dependent streams
will provide redundant information. To cope with both of these issues, the system
extracts knowledge in the form of a feature vector xt , dynamically derived from st ,
that captures as much information as possible at each time step. We require for the
components of the feature vector xt to be in number less than n, and to be uncor-
related with each other. Effectively, during this step the system extracts informative
patterns while performing dimensionality reduction.
As soon as the feature vector xt is extracted, the pattern enters as input of a non-
parametric regression model that provides an estimate of the fair price of Y at the
current time t. The estimate of zt is denoted by ẑt = ft (xt ; φ ), where ft (·; φ ) is a
time-varying function depending upon the specification of a hyperparameter vector
φ . With the current ẑt at hand, an estimated mispricing m̂t is computed and used to
determine the trading rule (20.1). The major difficulty in setting up this learning step
lies in the fact that the true fair price zt is never made available to us, and therefore it
cannot be learnt directly. To cope with this problem, we use the observed price yt as
a surrogate for the fair price and note that proper choices of φ can generate sensible
estimates ẑt , and therefore realistic mispricing m̂t .
We have thus identified a number of practical issues that will have to be ad-
dressed next: (a) how to recursively extract and update the feature vector xt from the
the streaming data, (b) how to specify and recursively update the pricing function
ft (·; φ ), and finally (c) how to select the hyperparameter vector φ .

20.3 Expert-based Incremental Learning

In order to extract knowledge from the streaming data and capture important
features of the underlying market in real-time, the system recursively performs a
principal component analysis, and extracts those components that explain a large
percentage of variability in the n streams. Upon arrival, each stream is first nor-
malized so that all streams have equal means and standard deviations. Let us call
Ct = E(st stT ) the unknown population covariance matrix of the n streams. The al-
gorithm proposed by [16] provides an efficient procedure to incrementally update
the eigenvectors of Ct when new data points arrive, in a way that does not require
the explicit computation of the covariance matrix. First, note that an eigenvector gt
of Ct satisfies the characteristic equation λt gt = Ct gt , where λt is the corresponding
eigenvalue. Let us call  ht the current estimate of Ct gt using all the data up to the
current time t. This is given by  ht = 1t ∑ti=1 si sTi gi ,
which is the incremental average of si sTi gi , where si sTi accounts for the contribu-
tion to the estimate of Ci at point i. Observing that gt = ht /||ht ||, an obvious choice
is to estimate gt as 
ht−1 /||
ht−1 ||. After some manipulations, a recursive expression

for ht can be found as
20 Data Mining for Algorithmic Asset Management 287

t − 1 1 
ht−1

ht = ht−1 + st stT (20.2)
t t ||
ht−1 ||
Once the first k eigenvectors are extracted, recursively, the data streams are projected
onto these directions in order to obtain the required feature vector xt . We are thus
given a sequence of paired observations (y1 , x1 ), . . . , (yt , xt ) where each xt is a k-
dimensional feature vector representing the latest market information and yt is the
price of the security being traded.
Our objective is to generate an estimate of the target security’s fair price using the
data points observed so far. In previous work [9, 10], we assumed that the fair price
depends linearly in xt and that the linear coefficients are allowed to evolve smoothly
over time. Specifically, we assumed that the fair price can be learned by recursively
minimizing the following loss function
t−1
∑ (yi − wTi xi ) +C(wi+1 − wi )T (wi+1 − wi ) (20.3)
i=1

that is, a penalized version of ordinary least squares. Temporal changes in the time-
varying linear regression weights wt result in an additional loss due to the penalty
term in (20.3). The severity of this penalty depends upon the magnitude on the reg-
ularization parameter C, which is a non-negative scalar: at one extreme, when C
gets very large, (20.3) reduces to the ordinary least squares loss function with time-
invariant weights; at the other extreme, as C is small, abrupt temporal changes in the
estimated weights are permitted. Recursive estimation equations and a connection
to the Kalman filter can be found in [10], which also describes a related algorith-
mic asset management system for trading futures contracts. In this chapter we depart
from previous work in two main directions. First, the rather strong linearity assump-
tion is released so as to add more flexibility in modelling the relationship between
the extracted market patterns and the security’s price. Second, we adopt a differ-
ent and more robust loss function. According to our new specification, estimated
prices ft (xt ) that are within ±ε of the observed price yt are always considered fair
prices, for a given user-defined positive scalar ε related to the noise level in the data.
At the same time, we would also like ft (xt ) to be as flat as possible. A standard
way to ensure this requirement is to impose an additional penalization parameter
controlling the norm of the weights, ||w||2 = wT w. For simplicity of exposition, let
us suppose again that the function to be learned is linear and can be expressed as
ft (xt ) = wT xt + b, where b is a scalar representing the bias. Introducing slack vari-
ables ξt , ξt∗ quantifying estimation errors greater than ε , the learning task can be
casted into the following minimization problem,
t
1 T
min wt wt +C ∑ (ξi + ξi∗ ) (20.4)
wt , bt 2 i=1
288 Giovanni Montana and Francesco Parrella


⎪ −yi + (wTi xi + bi ) + ε + ξi ≥ 0



s.t. yi − (wTi xi + bi ) + ε + ξi∗ ≥ 0 (20.5)





ξi , ξi∗ ≥ 0, i = 1, . . . ,t
that is, the support vector regression framework originally introduced by Vapnik
[15]. In this optimization problem, the constant C is a regularization parameter de-
termining the trade-off between the flatness of the function and the tolerated addi-
tional estimation error. A linear loss of |ξt | − ε is imposed any time the error |ξt | is
greater than ε , whereas a zero loss is used otherwise. Another advantage of having
an ε -insensitive loss function is that it will ensure sparseness of the solution, i.e.
the solution will be represented by means of a small subset of sample points. This
aspect introduces non negligible computational speed-ups, which are particularly
beneficial in time-aware trading applications. As pointed out before, our objective
is learn from the data in an incremental way. Following well established results (see,
for instance, [5]), the constrained optimization problem defined by Eqs. (20.4) and
(20.5) can be solved using a Lagrange function,

t t
1
L = wtT wt +C ∑ (ξi + ξi∗ ) − ∑ (ηi ξt + ηi∗ ξi∗ )
2 i=1 i=1
t t
(20.6)
−∑ αi (ε + ξi − yt + wtT xt + bt ) − ∑ αi∗ (ε + ξi∗ + yt − wtT xt − bt )
i=1 i=1

where αi , αi∗ , ηi and ηi∗ are the Lagrange multipliers, and have to satisfy positivity
constraints, for all i = 1, . . . ,t. The partial derivatives of (20.6) with respect to w, b, ξ
and ξ ∗ are required to vanish for optimality. By doing so, each ηt can be expressed
as C − αt and therefore can be removed (analogously for ηt∗ ) . Moreover, we can
write the weight vector as wt = ∑ti=1 (αi − αi∗ )xi , and the approximating function
can be expressed as a support vector expansion, that is
t
ft (xt ) = ∑ θi xiT xi + bi (20.7)
i=1

where each coefficient θi has been defined as the difference αi − αi∗ . The dual opti-
mization problem leads to another Lagrangian function, and its solution is provided
by the Karush-Kuhn-Tucker (KKT) conditions, whose derivation in this context can
be found in [13]. After defying the margin function hi (xi ) as the difference fi (xi )−yi
for all time points i = 1, . . . ,t, the KKT conditions can be expressed in terms of
θi , hi (xi ), ε and C. In turn, each data point (xi , yi ) can be classified as belonging to
each one of the following three auxiliary sets,
20 Data Mining for Algorithmic Asset Management 289

S = {i | (θi ∈ [0, +C] ∧ hi (xi ) = −ε ) ∨ (θi ∈ [−C, 0] ∧ hi (xi ) = +ε )}


E = {i |(θi = −C ∧ hi (xi ) ≥ +ε ) ∨ (θi = +C ∧ hi (xi ) ≤ −ε )} (20.8)
R = {i |θi = 0 ∧ |hi (xi )| ≤ ε }

and an incremental learning algorithm can be constructed by appropriately allocat-


ing new data points to these sets [8]. Our learning algorithm is based on this idea,
although our definition (20.8) is different. In [13] we argue that a sequential learning
algorithm adopting the original definitions proposed by [8] will not always satisfy
the KKT conditions, and we provide a detailed derivation of the algorithm for both
incremental learning and forgetting of old data points1 .
In summary, three parameters affect the estimation of the fair price using sup-
port vector regression. First, the C parameter featuring in Eq. (20.4) that regulates
the trade-off between model complexity and training error. Second, the parameter ε
controlling the width of the ε -insensitive tube used to fit the training data. Finally,
the σ value required by the kernel. We collect these three user-defined coefficients
in the hyperparameter vector φ . Continuous or adaptive tuning of φ would be par-
ticularly important for on-line learning in non-stationary environments, where pre-
viously selected parameters may turn out to be sub-optimal in later periods. Some
variations of SVR have been proposed in the literature (e.g. in [3]) in order to deal
with these difficulties. However, most algorithms proposed for financial forecasting
with SVR operate in an off-line fashion and try to tune the hyperparameters using
either exhaustive grid searches or other search strategies (for instance, evolutionary
algorithms), which are very computationally demanding.
Rather than trying to optimize φ , we take an ensemble learning approach: an
entire population of p SVR experts is continuously evolved, in parallel, with each
expert being characterized by its own parameter vector φ (e) , with e = 1, . . . , p. Each
expert, based on its own opinion regarding the current fair value of the target asset
(e)
(i.e. an estimate zt ) generates a binary trading signal of form (20.1), which we now
(e)
denote by dt . A meta-algorithm is then responsible for combining the p trading
signals generated by the experts. Thus formulated, the algorithmic trading problem
is related to the task of predicting binary sequences from expert advice which has
been extensively studied in the machine learning literature and is related to sequen-
tial portfolio selection decisions [4]. Our goal is for the trading algorithm to perform
nearly as well as the best expert in the pool so far: that is, to guarantee that at any
time our meta-algorithm does not perform much worse than whichever expert has
made the fewest mistakes to date. The implicit assumption is that, out of the many
SVR experts, some of them are able to capture temporary market anomalies and
therefore make good predictions.
The specific expert combination scheme that we have decided to adopt here is the
Weighted Majority Voting (WMV) algorithm introduced in [7]. The WMV algorithm
maintains a list of non-negative weights ω1 , . . . , ω p , one for each expert, and predicts
based on a weighted majority vote of the expert opinions. Initially, all weights are
set to one. The meta-algorithm forms its prediction by comparing the total weight
1 C++ code of our implementation is available upon request.
290 Giovanni Montana and Francesco Parrella

of the experts in the pool that predict 0 (short sell) to the total weight q1 of the
algorithms predicting 1 (buy). These two proportions are computed, respectively,
as q0 = ∑ (e) ωe and q1 = ∑ (e) ωe . The final trading decision taken by the
e:dt =o e:dt =1
WMV algorithm is

(∗) 0 if qo > q1
dt = (20.9)
1 otherwise
Each day the meta algorithm is told whether or not its last trade was successfull,
and a 0 − 1 penalty is applied, as described in Section 20.2. Each time the WMV
incurs a loss, the weights of all those experts in the pool that agreed with the master
algorithm are each multiplied by a fixed scalar coefficient β selected by the user,
with 0 < β < 1. That is, when an expert e makes as mistake, its weight is down-
graded to β ωe . For a chosen β , WMW gradually decreases the influence of experts
that make a large number of mistakes and gives the experts that make few mistakes
high relative weights.

20.4 An Application to the iShare Index Fund

Our empirical analysis is based on historical data of an exchange-traded fund


(ETF) . ETFs are relatively new financial instruments that have exploded in pop-
ularity over the last few years. ETFs are securities that combine elements of both
index funds and stocks: like index funds, they are pools of securities that track spe-
cific market indexes at a very low cost; like stocks, they are traded on major stock
exchanges and can be bought and sold anytime during normal trading hours. Our
target security is the iShare S&P 500 Index Fund, one of the most liquid ETFs. The
historical time series data cover a period of about seven years, from 19/05/2000 to
28/06/2007, for a total of 1856 daily observations. This fund tracks very closely
the S&P 500 Price Index and therefore generates returns that are highly correlated
with the underlying market conditions. Given the nature of our target security, the
explanatory data streams are taken to be a subset of all
constituents of the underlying S&P 500 Price Index comprising n = 455 stocks,
namely all those stocks whose historical data was available over the entire period
chosen for our analysis. The results we present here are generated out-of-sample by
emulating the behavior of a real-time trading system. At each time point, the system
first projects the lastly arrived data points onto a space of reduced dimension. In
order to implement this step, we have set k = 1 so that only the first eigenvector
is extracted. Our choice is backed up by empirical evidence, commonly reported
in the financial literature, that the first principal component of a group of securities
captures the market factor (see, for instance, [2]). Optimal values of k > 1 could be
inferred from the streaming data in an incremental way, but we do not discuss this
direction any further here.
20 Data Mining for Algorithmic Asset Management 291

Table 20.1 Statistical and financial indicators summarizing the performance of the 2560 experts
over the entire data set. We use the following notation: SR=Sharpe Ratio, WT=Winning Trades,
LT=Losing Trades, MG=Mean Gain, ML=Mean Loss, and MDD=Maximum Drawdown. PnL,
WT, LT, MG, ML and MDD are reported as percentages.
Summary Gross SR Net SR Gross PnL Net PnL Volatility WT LT MG ML MDD
Best 1.13 1.10 17.90 17.40 15.90 50.16 45.49 0.77 0.70 0.20
Worst -0.36 -0.39 -5.77 -6.27 15.90 47.67 47.98 0.72 0.76 0.55
Average 0.54 0.51 8.50 8.00 15.83 48.92 46.21 0.75 0.72 0.34
Std 0.36 0.36 5.70 5.70 0.20 1.05 1.01 0.02 0.02 0.19

With the chosen grid of values for each one of the three key parameters (ε varies
between 10−1 and 10−8 , while both C and σ vary between 0.0001 and 1000), the
pool comprises 2560 experts . The performance of these individual experts is sum-
marized in Table 20.1, which also reports on a number of financial indicators (see
the caption for details). In particular, the Sharpe Ratio provides a measure of risk-
adjusted return, and is computed as the ratio between the average return produced by
an expert over the entire period, divided by its standard deviation. For instance, the
best expert over the entire period achieves a promising 1.13 ratio, while the worst
expert yields negative risk-adjusted returns. The maximum drawdown represents the
total percentage loss experienced by an expert before it starts winning again. From
this table, it clearly emerges that choosing the right parameter combination, or ex-
pert, is crucial for this application, and relying on a single expert is a risky choice.

2500

2000

Fig. 20.1 Time-dependency 1500


Expert Index

of the best expert: each square


represents the expert that pro-
1000
duced the highest Sharpe ratio
during the last trading month
(22 days). The horizontal line 500
indicates the best expert over-
all. Historical window sizes
of different lengths produced 6 12 18 24 30 36 42 48 54 60 66 72 78
very similar patterns. Month

However, even if an optimal parameter combination could be quickly identified,


it would soon become sub-optimal. As anticipated, the best performing expert in the
pool dynamically and quite rapidly varies across time. This important aspect can be
appreciated by looking at the pattern reported in Figure 20.1, which identifies the
best expert over time by considering the Sharpe Ratio generated in the last trading
month. From these results, it clearly emerges that the overall performance of the
292 Giovanni Montana and Francesco Parrella

1.6

1.4
MV
1.2 Best

0.8

Sharpe Ratio
0.6 Average

0.4

0.2
Fig. 20.2 Sharpe Ratio pro- FBE
0
duced by two competing
strategies, Follow the Best −0.2

Expert (FBE) and Majority −0.4


Worst

Voting (MV), as a function of 5 20 60 120 240 All


window size. Window Size

1.6

1.4

1.2 Best

1
WMV

0.8
Sharpe Ratio

0.6 Average

0.4

0.2
Fig. 20.3 Sharpe Ratio pro-
0
duced by Weighted Majority
Voting (WMV) as a function −0.2

of the β parameter. See Ta- −0.4


Worst

ble 20.2 for more summary 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95
β
statistics.
5
x 10
16

14 WMV

12

10

8
P&L

Fig. 20.4 Comparison of 2 B&H

profit and losses generated by 0

Buy-and-Hold (B&H) versus


−2
Weighted Majority Voting
(WMV), after costs (see the −4
0 200 400 600 800 1000 1200 1400 1600 1800
text for details). Day
20 Data Mining for Algorithmic Asset Management 293

system may be improved by dynamically selecting or combining experts. For com-


parison, we also present results produced by two alternative strategies. The first one,
which we call Follow the Best Expert (FBE), consists in following the trading deci-
sion of the best performing expert seen to far, where again the optimality criterion
used to elect the best expert is the Sharpe Ratio. That is, on each day, the best expert
is the one that generated the highest Share Ratio over the last m trading days, for a
given value of m. The second algorithm is Majority Voting (MV). Analogously to
WMV, this meta algorithm combines the (unweighted) opinion of all the experts in
the pool and takes a majority vote. In our implementation, a majority vote is reached
if the number of experts deliberating for either one of the trading signals represents
a fraction of the total experts at least as large as q, where the optimal q value is learnt
by the MV algorithm on each day using the last m trading days. Figure 20.2 reports
on the Sharpe Ratio obtained by these two competing strategies, FBW and MV,
as a function of the window size m. The overall performance of a simple minded
strategy such a FBE falls well below the average expert performance, whereas MV
always outperforms the average expert. For some specific values of the window size
(around 240 days), MV even improves upon the best model in the pool.
The WMV algorithm only depends upon one parameter, the scalar β . Figure 20.3
shows that WMV always consistently outperforms the average expert regardless of
the chosen β value. More surprisingly, for a wide range of β values, this algorithm
also outperforms the best performing expert by a large margin (Figure 20.3). Clearly,
the WMV strategy is able to strategically combine the expert opinion in a dynamic
way. As our ultimate measure of profitability, we compare financial returns gener-
ated by WMV with returns generated by a simple Buy-and-Hold (B&H) investment
strategy. Figure 20.4 compares the profits and losses obtained by our algorithmic
trading system with B&H, and illustrates the typical market neutral behavior of the
active trading system. Furthermore, we have attempted to include realistic estimates
of transaction costs, and to characterize the statistical significance of these results.
Only estimated and visible costs are considered here, such as bid-ask spreads and
fixed commission fees. The bid-ask spread on a security represents the difference
between the lowest available quote to sell the security under consideration (the ask
or the offer) and the highest available quote to buy the same security (the bid). His-
torical tick by tick data gathered from a number of exchanges using the OpenTick
provider have been used to estimate bid-ask spreads in terms of base points or bps2 .
In 2005 we observed a mean bps of 2.46, which went down to 1.55 in 2006 and to
0.66 in 2007. On the basis of these findings, all the net results presented in Table
20.2 assume an indicative estimate of 2 bps and a fixed commission fee ($10).
Finally, one may tempted to question whether very high risk-adjusted returns, as
those generated by WMV with our data, could have been produced only by chance.
In order to address this question and gain an understanding of the statistical signif-
icance of our empirical results, we first approximate the Sharpe Ratio distribution
(after costs) under the hypothesis of random trading decisions, i.e. when sell and
buy signals are generated on each day with equal probabilities, using Monte Carlo

2 A base point is defined as 10000 (a−b)


m , where a is the ask, b is the bid, and m is their average.
294 Giovanni Montana and Francesco Parrella

simulation. Based upon 10, 000 repetitions, this distribution has mean −0.012 and
standard deviation 0.404. With reference to this distribution, we are then able to
compute empirical p-values associated to the observed Sharpe Ratios, after costs;
see Table 20.2. For instance, we note that a value as high as 1.45 or even higher
(β = 0.7) would have been observed by chance only in 10 out of 10, 000 cases.
These findings support our belief that the SVR-based algorithmic trading system
does capture informative signals and produces statistically meaningful results.

Table 20.2 Statistical and financial indicators summarizing the performance of Weighted Majority
Voting (WMV) as function of β . See the caption of Figure 20.1 and Section 20.4 for more details.
β Gross SR Net SR Gross PnL Net PnL Volatility WT LT MG ML MDD p-value
0.5 1.34 1.31 21.30 20.80 15.90 53.02 42.63 0.74 0.73 0.24 0.001
0.6 1.33 1.30 21.10 20.60 15.90 52.96 42.69 0.75 0.73 0.27 0.001
0.7 1.49 1.45 23.60 23.00 15.90 52.71 42.94 0.76 0.71 0.17 0.001
0.8 1.18 1.15 18.80 18.30 15.90 51.84 43.81 0.75 0.72 0.17 0.002
0.9 0.88 0.85 14.10 13.50 15.90 50.03 45.61 0.76 0.71 0.25 0.014

References

1. C.C. Aggarwal, J. Han, J. Wang, and Yu P.S. Data Streams: Models and Algorithms, chapter
On Clustering Massive Data Streams: A Summarization Paradigm, pages 9–38. Springer,
2007.
2. C. Alexander and A. Dimitriu. Sources of over-performance in equity markets: mean rever-
sion, common trends and herding. Technical report, ISMA Center, University of Reading,
UK, 2005.
3. L. Cao and F. Tay. Support vector machine with adaptive parameters in financial time series
forecasting. IEEE Transactions on Neural Networks, 14(6):1506–1518, 2003.
4. N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University
Press, 2006.
5. N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge
University Press, 2000.
6. R.J. Elliott, J. van der Hoek, and W.P. Malcolm. Pairs trading. Quantitative Finance, pages
271–276, 2005.
7. N. Littlestone and M.K. Warmuth. The weighted majority algorithm. Information and Com-
putation, 108:212–226, 1994.
8. J. Ma, J. Theiler, and S. Perkins. Accurate on-line support vector regression. Neural Compu-
tation, 15:2003, 2003.
9. G. Montana, K. Triantafyllopoulos, and T. Tsagaris. Data stream mining for market-neutral
algorithmic trading. In Proceedings of the ACM Symposium on Applied Computing, pages
966–970, 2008.
10. G. Montana, K. Triantafyllopoulos, and T. Tsagaris. Flexible least squares for
temporal data mining and statistical arbitrage. Expert Systems with Applications,
doi:10.1016/j.eswa.2008.01.062, 2008.
11. J. G. Nicholas. Market-Neutral Investing: Long/Short Hedge Fund Strategies. Bloomberg
Professional Library, 2000.
20 Data Mining for Algorithmic Asset Management 295

12. S. Papadimitriou, J. Sun, and C. Faloutsos. Data Streams: Models and Algorithms, chapter
Dimensionality reduction and forecasting on streams, pages 261–278. Springer, 2007.
13. F. Parrella and G. Montana. A note on incremental support vector regression. Technical report,
Imperial College London, 2008.
14. A. Pole. Statistical Arbitrage. Algorithmic Trading Insights and Techniques. Wiley Finance,
2007.
15. V. Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.
16. J. Weng, Y. Zhang, and W. S. Hwang. Candid covariance-free incremental principal compo-
nent analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(8):1034–
1040, 2003.

You might also like