0% found this document useful (0 votes)
90 views5 pages

Algorithmic Trading Strategy Based On Massive Data Mining

The document describes an algorithmic trading strategy that uses machine learning models to mine large datasets and predict daily stock returns. The strategy collects daily trading data on over 2,600 US stocks from 2000-2014 along with economic indicator data. It constructs 12 features for each stock, including direct features based on stock prices and volumes, and indirect features measuring sensitivity to economic indices. Machine learning regression and classification models are trained to predict stock returns from these features and select stocks for profitable daily portfolios.

Uploaded by

Dushyant Chauhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views5 pages

Algorithmic Trading Strategy Based On Massive Data Mining

The document describes an algorithmic trading strategy that uses machine learning models to mine large datasets and predict daily stock returns. The strategy collects daily trading data on over 2,600 US stocks from 2000-2014 along with economic indicator data. It constructs 12 features for each stock, including direct features based on stock prices and volumes, and indirect features measuring sensitivity to economic indices. Machine learning regression and classification models are trained to predict stock returns from these features and select stocks for profitable daily portfolios.

Uploaded by

Dushyant Chauhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Algorithmic Trading Strategy Based

On Massive Data Mining


Haoming Li, Zhijun Yang and Tianlun Li
Stanford University

Abstract

We believe that there is useful information hiding behind the noisy and massive data that can provide
us insight into the financial markets. Our goal in this project is to find a strategy to select profitable
U.S stocks everyday by mining the public data. To achieve this we build models that predict the daily
return of a stock from a set of features. These features are constructed based on quoted and external data
that is available before the prediction date. When considering machine learning models we consider both
regression and classification approaches and several supervised learning algorithms are implemented. In
order to catch the dynamical nature of the financial market, we carefully design out-sample testing and
cross validation procedures to ensure that our historical test results are reasonable and is achievable in
the real market. Finally, we construct stock portfolios based on our forecast models and illustrate the
performance of these portfolios to show that our strategy works indeed.

I. Introduction mance of our daily selected portfolio as well as


some discussions and analysis on the results
ow can we discover stocks that will we get. In Part 6 we draw the summary.

H rise in the future? The general answer


is to gather as much relevant and non-
trivial information as possible. One possible
way to get such information is mining the huge
amount of financial and Internet data that can-
not be easily understood. This data allows us
II. Data description
to define various features for each individual
stock. For example, we can distinguish dif-
We collected daily trading data of 2666 U.S
ferent stocks by their historical performance,
stocks trading (or once traded) at NYSE or
trading volume or sensitivity to external eco-
NASDAQ from 2000-01-01 to 2014-11-10. This
nomical and financial variables. Then we can
dataset includes each day’s open price, close
use machine learning models to discover the
price, highest price, lowest price and trading
underlying relation between these features and
volume of every stock. Data is collected from a
actual performance of stocks. Finally, we can
free online database named Quandl.
select those stocks that are predicted to have
the highest returns. Meanwhile, we also collected data that is
The report is organized as follows. In part not directly related to each stock but may con-
2 we mainly discuss what data we are using tain additional information for forecasting pur-
and how we collected and processed the data. poses. These include the daily quotes of 5
In part 3 we introduce our methodology of commodity future contracts (gold, crude oil,
constructing features. Part 4 gives the machine nature gas, corn, cotton), 2 foreign currencies
learning models that we are implementing and (EUR, JPY) and 1 interest rate (10-year treasury
the procedures of dynamically training and rate), all from 2000-01-01 to 2014-11-10. The
testing. Part 5 gives the results and the perfor- aggregate size of all data files is 1.11 GB.

1
III. Targets and Features HighPricei,t
HLi,t = (4)
Construction LowPricei,t

VOLi,t = ln ( TradingVolumei,t−1 ) (5)


As our goal is to predict the daily return of
each stock, then we naturally define our target 
TradingVolumei,t−1

as stock i’s daily return on day t for all i and t: VOLCHNGi,t = ln
TradingVolumei,t−2
(6)
ClosePricei,t
Targeti,t = −1 (1) These four features are properly lagged so
OpenPricei,t
that can be computed. And these are all rele-
Note that we can also focus only on the direc- vant since they measure some trends or relative
tion in spite of the amplitude. Another way of strength of each stock.
defining or targets is: Indirect Features:
  The intuition of constructing indirect fea-
0 ClosePricei,t tures corresponding to some external economic
Targeti,t = sign −1 (2)
OpenPricei,t indices is to compute the ‘sensitivity’ of each
stock’s return to these indices and multiply
For the first definition we have a regression this sensitivity by the latest indices values. As
problem and for the second we have a classifi- mentioned above we have data of 8 external
cation problem. Both of these two setups will indices (5 commodity futures, 2 foreign curren-
be tried. cies and 1 interest rate). For index j, we define
And then we have to construct features that the corresponding feature for stock i at day t
help distinguish (or say, define) each stock ev- as:
ery day. These features should be relevant to EXTRN_ji,t = β_ji,t × index j t−1 (7)
the performance and should be available be-
fore the trading day. It is well known that Where:
stock performance is correlated with dozens
of things and our model will only employ a (c, β_1i,t , · · · , β_8i,t )T = argmin β k Xi,t β − yi,t k2
relative small amount of features in this paper (8)
for simplicity. Where:
The features we constructed can be divided
···
 
1 index_1t−2 index_8t−2
into two categories. The first category, which
Xi,t =  :
 .. 
is named as direct features, contains some : . : 
variables that are constructed by explicit (and 1 index_1t−T −1 · · · index_8t−T −1
lagged) market data of stocks, e.g., open, close, (9)
high, low, etc. The other category is named as  
indirect features and concerns about the infor- Targeti,t−1
mation carried by external factors. That is, we yi,t = :  (10)
construct one feature for each external variable Targeti,t−T
to reflect how a specific stock can be affected
Here T is an arbitrary window period parame-
at a certain day when the external variable
ter to compute sensitivity. Intuitively, these 8
changes. Now we will have a more detailed
indirect features describe the relative change
discussion on how we construct the features by
of stock i’th possible price change at time t
category:
with respect to the change of index j at time t-1.
Direct Features:
Since everything defined here is lagged, the
Based on our raw data, we constructed 4
values of these 8 indirect features are available.
direct features:
Thus we construct 12 features for stocks.
ClosePricei,t Cross-sectional averages of these features are
RETi,t = − 1 = Targeti,t−1 (3) shown as follows:
OpenPricei,t

2
Figure 1: The upper figure indicates the cross-sectional averages of direct features, including RET, HL, VOL,
VOLCHNG. The lower figure indicates the cross-sectional averages of indirect features,including gold,
crude oil, nature gas, corn, cotton, USDvsEUR, USDvsJPY and Treasury

IV. Learning Algorithms of training and testing models are as follows:


Implementation
1. Specify a training window parameter W

Now that we have specified our targets and fea- 2. To predict the performance of stocks
tures, implementing specific machine learning on the date Ti , use the sample during
algorithm is important. Ti − 1, Ti − 2, Ti − 3, · · · , Ti − W as the
As we have specified two ways of defining training set to train models.
targets (numerical or categorical), we have two
After we generated predictions for every
representations of predicting the performance
day we should measure how good our predic-
of stocks: classification and regression. In the
tions are. We can compute every day’s cor-
classification set-up we try to predict the trend
rection rate in the classification models and
of the stock in a specific day. Besides, we pre-
the mean square root error in the regression
dict the exact return of a stock in regression
models but then it would then be abstruse to
set-up. For simplicity we first try linear models:
compare cross these two categories. We there-
logistic regression as the classification model
fore give a more practical and visualizable way
and linear regression as the regression model.
of measuring performance: testing the perfor-
Then we implement SVM models (classifier
mance of stocks selected by our models. The
and regression) to explore possible non-linear
methodology is as follows:
regularities utilizing kernels. Before feeding
into models we also normalize and centralize 1. Specify a portfolio size (number of stocks
our features to mean 0 and standard deviation to be picked) N
1. However our model is dynamic rather than
fixed, depending on the date at which a return 2. Every day choose N stocks according to
is to be predicted. Specifically, our procedure the predictions generate by the models.

3
For regression model we simply choose Another interesting question to think is
the N stocks that are predicted to have whether our models behave stably over time.
the highest return, and for classification From the graph we find that the 2 linear mod-
models we choose N stocks that are best els behave relatively stably before the end of
classified, i.e., with the largest scores in 2008. To quantify this we compute SR every
the classification 80 days and plot these time series showed in
Figure 4
3. Compute the actual return of every day’s
stock basket. Compare this time series to
the market index such as S&P 500. Fur- VI. Conclusion
thermore we can denote one specific day
as ‘successful’ if the portfolio we selected To conclude: we derived an approach to predict
has greater return than the market index daily returns of U.S stocks based on their trad-
and as ‘unsuccessful’ if that didn’t hap- ing data and external financial indices. Our lin-
pen. Then we can compute the successful ear models work well in both regression frame-
rate for each model: work and classification framework. The best
N o f succes f ul days model turns out to be linear classifier: logistic
SR = (11)
N o f total days regression. It gives 56.65% successful rate and
2000% cumulative return over 14 years. How-
Then we compare different models and ever as time pass by the models tend to behave
present the model with the best successful rate. less stably especially after 2008.
Results see Figure 2 and Figure 3 In the future we can try to find methods
that give more stable predictions. From the per-
V. Discussions and Analysis spective of machine learning we can try mixing
different models our train models with more
The linear models behave well. Actually from data every day. From the perspective of in-
the portfolio return graphs we can see that lin- vestment we can make predictions on ‘alphas’
ear classification model and linear regression rather than returns. Also we can try to get
model give similar results here. The regression more information from text data. News and
approach and classification approach both can social networks can be excellent information
capture the underlying regularities. resources for predicting stocks returns.
The supporting vector machines don’t be-
have well enough. One possible reason is that
for these extremely noisy data linear simple References
models can behave better. SVM classifier’s per-
formance is especially bad. One possible rea- [Abu-Mostafa and Atiya, 1996] Abu-Mostafa,
son is that the ‘confidence’of the classification, Y. S., & Atiya, A. F. (1996). Introduction to
or say decision function that we are sorting as financial forecasting. Applied Intelligence.,
an indicator of potential success doesn’t make 6(3):205-213.
much sense in the non-linear case. Using more
data to train each day’s model may improve the [Smola and Schölkopf, 2004 ] Smola, A. J., &
performance of SVM, but the computational Schölkopf, B. (2004). A tutorial on support
cost will increase significantly since we retrain vector regression Statistics and computing.,
our model for each trading day. 14(3): 199-222.

4
Figure 2: left figure is the result of Logistic Regression modelIn the implementation we specify our window parameter
W=5 and portfolio size N = 100. We first implement linear models. The logistic regression gives SR =
56.65%. The cumulative return of stocks selected by this model. The right figure shows linear regression
model giving SR = 55.82%. The cumulative return of the portfolio selected.As we can see that for linear
models, both classification setup and regression setup behave well. $1 invested in 2000 became about $20 in
2014 if we have continually implemented the trades suggested by these 2 models. This success indicates that
our approach did get information from the data we collected.

Figure 3: The left is SVM. We use Gaussian Kernel and since the financial data is highly noised we set the parameter
C as 0.85. Surprisingly SVM give worse results than linear models. For SVM classifier we have: SR =
49.56%. The return of selected portfolio. The rightis SVM regression we get SR = 53.18%.After we adjusted
the window parameter W and the regularization parameter C this doesn’t improve much.

Figure 4: It can be shown that all models tend to behave less well as time goes by especially after 2008. We believe that
from that time daily stock prices change depend on more factors than we have recovered.

You might also like