Algorithmic Trading Strategy Based On Massive Data Mining
Algorithmic Trading Strategy Based On Massive Data Mining
Abstract
We believe that there is useful information hiding behind the noisy and massive data that can provide
us insight into the financial markets. Our goal in this project is to find a strategy to select profitable
U.S stocks everyday by mining the public data. To achieve this we build models that predict the daily
return of a stock from a set of features. These features are constructed based on quoted and external data
that is available before the prediction date. When considering machine learning models we consider both
regression and classification approaches and several supervised learning algorithms are implemented. In
order to catch the dynamical nature of the financial market, we carefully design out-sample testing and
cross validation procedures to ensure that our historical test results are reasonable and is achievable in
the real market. Finally, we construct stock portfolios based on our forecast models and illustrate the
performance of these portfolios to show that our strategy works indeed.
1
III. Targets and Features HighPricei,t
HLi,t = (4)
Construction LowPricei,t
2
Figure 1: The upper figure indicates the cross-sectional averages of direct features, including RET, HL, VOL,
VOLCHNG. The lower figure indicates the cross-sectional averages of indirect features,including gold,
crude oil, nature gas, corn, cotton, USDvsEUR, USDvsJPY and Treasury
Now that we have specified our targets and fea- 2. To predict the performance of stocks
tures, implementing specific machine learning on the date Ti , use the sample during
algorithm is important. Ti − 1, Ti − 2, Ti − 3, · · · , Ti − W as the
As we have specified two ways of defining training set to train models.
targets (numerical or categorical), we have two
After we generated predictions for every
representations of predicting the performance
day we should measure how good our predic-
of stocks: classification and regression. In the
tions are. We can compute every day’s cor-
classification set-up we try to predict the trend
rection rate in the classification models and
of the stock in a specific day. Besides, we pre-
the mean square root error in the regression
dict the exact return of a stock in regression
models but then it would then be abstruse to
set-up. For simplicity we first try linear models:
compare cross these two categories. We there-
logistic regression as the classification model
fore give a more practical and visualizable way
and linear regression as the regression model.
of measuring performance: testing the perfor-
Then we implement SVM models (classifier
mance of stocks selected by our models. The
and regression) to explore possible non-linear
methodology is as follows:
regularities utilizing kernels. Before feeding
into models we also normalize and centralize 1. Specify a portfolio size (number of stocks
our features to mean 0 and standard deviation to be picked) N
1. However our model is dynamic rather than
fixed, depending on the date at which a return 2. Every day choose N stocks according to
is to be predicted. Specifically, our procedure the predictions generate by the models.
3
For regression model we simply choose Another interesting question to think is
the N stocks that are predicted to have whether our models behave stably over time.
the highest return, and for classification From the graph we find that the 2 linear mod-
models we choose N stocks that are best els behave relatively stably before the end of
classified, i.e., with the largest scores in 2008. To quantify this we compute SR every
the classification 80 days and plot these time series showed in
Figure 4
3. Compute the actual return of every day’s
stock basket. Compare this time series to
the market index such as S&P 500. Fur- VI. Conclusion
thermore we can denote one specific day
as ‘successful’ if the portfolio we selected To conclude: we derived an approach to predict
has greater return than the market index daily returns of U.S stocks based on their trad-
and as ‘unsuccessful’ if that didn’t hap- ing data and external financial indices. Our lin-
pen. Then we can compute the successful ear models work well in both regression frame-
rate for each model: work and classification framework. The best
N o f succes f ul days model turns out to be linear classifier: logistic
SR = (11)
N o f total days regression. It gives 56.65% successful rate and
2000% cumulative return over 14 years. How-
Then we compare different models and ever as time pass by the models tend to behave
present the model with the best successful rate. less stably especially after 2008.
Results see Figure 2 and Figure 3 In the future we can try to find methods
that give more stable predictions. From the per-
V. Discussions and Analysis spective of machine learning we can try mixing
different models our train models with more
The linear models behave well. Actually from data every day. From the perspective of in-
the portfolio return graphs we can see that lin- vestment we can make predictions on ‘alphas’
ear classification model and linear regression rather than returns. Also we can try to get
model give similar results here. The regression more information from text data. News and
approach and classification approach both can social networks can be excellent information
capture the underlying regularities. resources for predicting stocks returns.
The supporting vector machines don’t be-
have well enough. One possible reason is that
for these extremely noisy data linear simple References
models can behave better. SVM classifier’s per-
formance is especially bad. One possible rea- [Abu-Mostafa and Atiya, 1996] Abu-Mostafa,
son is that the ‘confidence’of the classification, Y. S., & Atiya, A. F. (1996). Introduction to
or say decision function that we are sorting as financial forecasting. Applied Intelligence.,
an indicator of potential success doesn’t make 6(3):205-213.
much sense in the non-linear case. Using more
data to train each day’s model may improve the [Smola and Schölkopf, 2004 ] Smola, A. J., &
performance of SVM, but the computational Schölkopf, B. (2004). A tutorial on support
cost will increase significantly since we retrain vector regression Statistics and computing.,
our model for each trading day. 14(3): 199-222.
4
Figure 2: left figure is the result of Logistic Regression modelIn the implementation we specify our window parameter
W=5 and portfolio size N = 100. We first implement linear models. The logistic regression gives SR =
56.65%. The cumulative return of stocks selected by this model. The right figure shows linear regression
model giving SR = 55.82%. The cumulative return of the portfolio selected.As we can see that for linear
models, both classification setup and regression setup behave well. $1 invested in 2000 became about $20 in
2014 if we have continually implemented the trades suggested by these 2 models. This success indicates that
our approach did get information from the data we collected.
Figure 3: The left is SVM. We use Gaussian Kernel and since the financial data is highly noised we set the parameter
C as 0.85. Surprisingly SVM give worse results than linear models. For SVM classifier we have: SR =
49.56%. The return of selected portfolio. The rightis SVM regression we get SR = 53.18%.After we adjusted
the window parameter W and the regularization parameter C this doesn’t improve much.
Figure 4: It can be shown that all models tend to behave less well as time goes by especially after 2008. We believe that
from that time daily stock prices change depend on more factors than we have recovered.