Price Trend Prediction of Stock Market Using Outlier Data Mining Algorithm
Price Trend Prediction of Stock Market Using Outlier Data Mining Algorithm
Abstract—In this paper we present a novel data miming 1. We first propose using anomalies on distribution of trading
approach to predict long term behavior of stock trend. Traditional volume to predict upward trend of stock prices.
techniques on stock trend prediction have shown their limitations 2. We use tick-by-tick data instead of time series data on stock
when using time series algorithms or volatility modelling on price price in a novel outlier mining algorithm.
sequence. In our research, a novel outlier mining algorithm is 3. We select 200 stocks randomly in our experiment. The result
proposed to detect anomalies on the basis of volume sequence of shows that using anomalies can predict the upward trend of
high frequency tick-by tick data of stock market. Such anomaly stock prices effectively.
trades always inference with the stock price in the stock market.
By using the cluster information of such anomalies, our approach The rest of the paper is organized as follows. Section 2
predict the stock trend effectively in the really world market. introduces the motivation and provides an example to illustrate
Experiment results show that our proposed approach makes the problem. Section 3 introduces our approach and explain the
profits on the Chinese stock market, especially in a long-term
outlier algorithm. Section 4 evaluates our approach by applying
usage.
the method to the data to get the metrics. Section 5 gives some
Keywords—Stock trend prediction, data mining, cluster analysis, related works of our subject. Section 6 concludes this paper.
stock market, anomaly
Financial time series change dynamically and selectively. Stock markets are changing all the time and prediction of
Such time series are obviously difficult to predict because stock trend is a significant issue in the modern financial market.
the problem is nonlinear, non-stationary and have a lot of However, according to the efficient market hypothesis [2], the
noises[4]. Stock price is a kind of time series in financial market price will follow a random walk and a permanent
domain. The approach to predict stock trend in the future has prediction strategy is not possible. An interesting issue is that
become one of the most import issues by using data mining for some trading price that market is not efficient anymore
techniques. However, prediction is difficult from the principle in the real word, so it breaks the efficient market hypothesis.
of the efficient market hypothesis [2] that if the market is an Therefore, the data of stock price will not be so random and
efficient market then the stock price will follow a random work prediction of stock trend becomes possible. A traditional way
pattern. In addition, a stationary prediction strategy is also to predict the stock trend is using the data mining techniques
not possible if the market is efficient because investors will on the basis of stock prices. Unfortunately, the data of stock
soon discover such strategies and those successful forecasting price have many noises [1] and for noisy data people always
rules will lead to self-destruct [3]. A lot of researchers devote build stochastic volatility models to make predictions whose
their time to study such random walks by time series model- efficiency is low.
ing [5], volatility modeling[6] and even artificial intelligence In the above non-efficient case, when we analyze the
modelling[4]. But those algorithms are all on the basis of the volume data there are always anomalies in the distribution of
stock price itself which has random property. trading volumes. Insider trading and market manipulation [7],
In this paper, we turn back our attention to the distribution [8] are the two key anomalies in stock market. Insider trading
on volume in the high frequency tick-by-tick data in the is the trades on the basis of non-public information by insiders,
market. The trading volume will follow some random distri- such as the directors, employees and officers [9], [10]. Market
bution because in the efficient market hypothesis the market manipulation is the trades or actions that attempt to affect the
always follows a random walk. Therefore, we assume that if fair and free operation of the stock market and create false
the volume is not so random anymore that there are some or misleading appearance of a stock [11]. The anomalies will
anomalies in the distribution. At that point the market is not severely impair the stock market and obviously will in fact
efficient and this means the stock price is not a random walk have long term influence on the stock prices. Thus, anomalies
anymore so a long term predicting strategy is possible. Here, have the long term predictability on the stock trend, in our
we want to study whether using the detected anomalies from method we will utilize these anomalies to get rid of the effect
historical financial time series data can predict stock trend of price noises to predict market trend. In this paper we will
effectively or not. limit our scope on the upward trend prediction because an
upward trend usually means stable and long term arbitrage
Our contribution are as follows: opportunities.
0.07
2.5
0.06
2
0.05
0.04
1.5
0.03
1
0.02
0.5
0.01
0 0
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
(a) Volume distribution of all the stocks (b) Volume distribution of ”Shenzhen development bank”
Fig. 1 shows one example of anomaly. The left part of TABLE I: This table shows some data
Fig. 1 is the trading volume distribution of all the stocks in
the market at price 10.12. Compared with right part of the Time Price Change Volume Amount Bs
figure, which is the volume distribution of stock with the 15:00:19 10.77 – 1785 1923500 b
14:57:01 10.77 – 1 1077 s
name ”Shenzhen development bank” at price 10.12, we can 14:56:55 10.77 – 10 10770 b
find there’s an anomaly on the volume 80 which is an outlier 14:56:52 10.77 -0.01 186 200322 s
14:56:46 10.78 – 94 101332 b
of the distribution and marked in circle. 14:56:43 10.78 0.01 20 21560 b
14:56:43 10.77 -0.01 75 80775 s
In our approach, we detect all the anomalies and mark
them on the price sequence. After that it is easy to predict
that the stock trend changed dramatically when our approach
For the record R each field have the means:
clusters such anomalies. For example, in Fig. 2, the anomalies
are marked with ’+’ on the price sequence. The horizontal t time of the trade
axis is the index of the trade point in tick-by-tick data and the p price of the trade
vertical axis is the stock price. We can see after the anomalies c change of the price
marked with ’+’ there’s an obvious upward trend in the stock v volume of the trade
price. Next Section will introduce our approach in details. a amount of the trade
b buy or sell signal of the trade
III. A PPROACH
One example of our tick-by-tick data is shown in Table I:
An diagram of overview of our approach is shown in
Fig. 3. We first fetch the data from data source then make Once we have the data in hand, we can start to pre-process
a preprocessing to the data, after that we transform the high the data into the ratio matrix for later use. The following steps
frequency data to a ratio matrix and then feed it into the outlier is what we need to do for this process.
algorithm to find anomalies. We can then make predictions
according to the position of the anomalies and evaluate the 1 Prepare tick-by-tick data
result. 2 Fix a price for all the stocks
3 Fix a price for one specific stock
A. Data Preprocessing
4 Make ratio matrix for step 2 and 3 for later use
The data we use in this paper is the high frequency tick-
by-tick trading data. Tick-by-tick data is a kind of format used Here’s the explanation for each step:
frequently in financial industry. This data records each trade for
every stock in the market, if there’s 1000 trades for a specific
stock then there will be 1000 records for that stock on that Step 1: For each stock collect all the tick-by-tick data for
day, so for a relatively long period the data size can be very all the trading day we want into a single matrix of records T .
big. One record of the tick-by-tick data is defined as: Each row of matrix T is a record R we defined earlier.
94
95
0.12
Algorithm 1 : Anomaly price and volume finding algorithm Normal cluster
Input: M(i,:), Ms(i,:), prices 0.1 Anomaly cluster
Output: Anomaly price, Anomaly volume
1: pseq:= unique(prices) 0.08
2: for (int i=1; i<length(pseq); i++) do
3: pi:= pseq(i)
4: theoryseq:=M(i,:) 0.06
5: actualseq:=Ms(i,:)
6: difference:=actualseq-theoryseq 0.04
7: k:=find(difference>0.8)
8: if k is not empty then
0.02
9: an anomaly is found on price pi and volume number k
10: end if
11: end for 0
12: return
−0.02
96
18 1.35
Stock return for our approach
Stock return for SVM
16 1.3
14
1.25
12
1.2
10
1.15
8
1.1
6
1.05
4
2 1
0 0.95
14 16 18 20 22 24 26 28 30 32 0 20 40 60 80 100 120
Fig. 5: Price histogram of the anomaly cluster for stock code Fig. 6: Average return of 200 stocks
000623
1
normal data in our dataset, thus it is not possible to make
prediction if the normal data happens later than anomaly data 0.9
while for our algorithm such issue doesn’t exist that we can
0.8
make prediction no matter where the anomaly data is.
0.7
very obvious in the figure. We also measured the successful Fig. 7: Successful Rate of 200 Stocks
rate of our prediction. Successful rate is defined to be:
number of correct predictions
Successf ul Rate =
number of all predictions
here a correct prediction means the stock return is bigger than
1. the ANN performance. It is a challenge to design the sampling
schema, choose training and testing datasets and select the
Fig. 7 shows the successful rate. The horizontal axis of effective factors for improving the prediction performance and
Fig. 7 is the index of trading day starting from 03-31-2014, the it is difficult to define the structures of the models such as the
vertical axis is the successful rate of the 200 stocks on each day hidden layers, the neurons, etc. Zhang et al. [12] presented a
compared with the price on the day 03-31-2014. We observe piecewise nonlinear model to analyzing stock market tick data.
that as time goes by the successful rate goes higher and close They proposed Prop NN, which can improve the predictability
to 1 at last, which means almost all the stocks changed their of stock price. They claimed that it is significantly better than
trend after the anomaly clusters. The results in Fig. 6 and Fig. 7 the basic BPN model. But as many of the other machine
show that the predictability of our approach is satisfactory. learning algorithms, ANN suffers from the problem of over-
fitting. It can not discriminate between useful information and
V. R ELATED W ORKS noisy information and many of the time the noise level is too
A. Neural Network Approaches high that what the algorithm did is actually make a fitting
on the noise, in this case the prediction on useful data is
There are many researches using artificial neural networks impossible. For our algorithm there’s no such issue that we can
(ANNs). A lot of successful trials have shown that ANN can be simply ignore the noise and only pick up the useful information
a powerful tool for time series forecasting and modeling [12]. which is the anomaly volume, this will make the analysis much
However, too many factors required to be tuned would affect easier.
97
B. SVM based approaches R EFERENCES
Support vector machine proposed by Boser et al [13] is [1] Antoniou A, Vorlow C E. Price clustering and discreteness: is there chaos
behind the noise?[J]. Physica A Statistical Mechanics & Its Applications,
attracting more attention these years. It is used as a clustering 2005, 348:389.
algorithm at first, derived from the structural risk minimization
[2] Malkiel B G. The Efficient Market Hypothesis and Its Critics[J]. Journal
principle [14] and by separating the decision hyperplane it can of Economic Perspectives, 2003, 17(1):pgs. 59-82.
also be used in classification and regression analysis, and can [3] Timmermann A, Granger C W J. Efficient market hypothesis and
help users make well-informed business decisions. Wang et forecasting[J]. International Journal of Forecasting, 2004, 20(3):15C27.
al. [15] showed that the K-means SVM (KMSVM) algorithm [4] P. K. Padhiary and A. P. Mishra, Development of improved artificial
can speed up the response time of classifiers by decreasing neural network model for stock market prediction, International Journal
the number of support vectors while maintaining a compatible of Engineering Science and Technology, Vol. 3, 2011, pp. 1576-1581.
accuracy to SVM. But the situation is the same of ANN [5] Amihud Y. Illiquidity And Stock Returns: Cross-Section And Time-
algorithms that if the noise level is high then it is impossible Series Effects[J]. Social Science Electronic Publishing, 2002, 5:31-56.
DOI:http://dx.doi.org/10.1016/S1386-4181(01)00024-6.
to make prediction.
[6] Stein E 1, Stein J 2. Stock Price Distributions with Stochastic Volatility:
An Analytic Approach[J]. Review of Financial Studies, 1991, volume
VI. C ONCLUSION 4(4):727-752(26).
[7] F. Allen and G. Gorton. Stock price manipulation, market microstruc-
In this paper, starting from the efficient market hypothesis ture and asymmetric information. European Economic Review, pages
we found a way to locate the anomaly trade data among the 624C630, 1992.
high frequency tick-by-tick data by comparing the distribution [8] M. Minenna. Insider trading abnormal return and preferential informa-
of volume sequence between the market and the specific stock. tion: Supervising through a probabilistic model. Journal of Banking and
By making the volume distribution matrix of all the stocks Finance, pages 59C86, 2003.
in the market and any individual stock we can discover the [9] L. Cheng, M. Firth, T. Leung, and O. Rui. The effects of insider trading
difference between them and if that difference is bigger than a on liquidity. Pacific-Basin Finance Journal, pages 467C483, 2006.
certain limit then an anomaly is found. We found that clusters [10] B. Cornell and B. Sirri. The reaction of investors and stock prices to
insider trading. Journal of Finance, pages 1031C1059, 1992.
of anomalies always predict an upward trend of the stock price.
[11] K. Felixson and A. Pelli. Day end returns: Stock price manipulation.
A traditional algorithm for cluster analysis is also possible Journal of Multinational Financial Management,pages 95C127, 1999.
to find the anomalies but our algorithm is more practical in
[12] G. Zhang, B. E. Patuwo, and M. Y. Hu, Forecasting with artificial neural
that it is more effective in making predictions. We tested our networks: The state of the art, International Journal of Forecasting, Vol.
novel outlier mining algorithm and found that it is consistent 14, 1998, pp. 35-62.
with k-means clustering algorithm. Finally the average return [13] B. E. Boser, I. M. Guyon, and V. N. Vapnik, A training algorithm
and successful rate is tested against our algorithm and the for optimal margin classifiers, in Proceedings of the 5th Annual ACM
prediction about this two quantities is correct and satisfactory. Workshop on Computational Learning Theory, 1992, pp. 144-152.
[14] K.-J. Kim, Financial time series forecasting using support vector
machines, Neurocomputing, Vol. 55, 2003, pp. 307-319.
[15] J. Wang, X. Wu, and C. Zhang, Support vector machines based on K-
means clustering for real-time business intelligence systems, International
Journal of Business Intelligence and Data Mining, Vol. 1, 2005, pp. 54-
64.
98