0% found this document useful (0 votes)
8 views6 pages

Price Trend Prediction of Stock Market Using Outlier Data Mining Algorithm

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views6 pages

Price Trend Prediction of Stock Market Using Outlier Data Mining Algorithm

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2015 IEEE Fifth International Conference on Big Data and Cloud Computing

Price Trend Prediction of Stock Market Using


Outlier Data Mining Algorithm

Zhao, Lei Wang, Lin


Baylor University Japan Advanced Institute of Science and Technology
Email: [email protected] Email: [email protected]

Abstract—In this paper we present a novel data miming 1. We first propose using anomalies on distribution of trading
approach to predict long term behavior of stock trend. Traditional volume to predict upward trend of stock prices.
techniques on stock trend prediction have shown their limitations 2. We use tick-by-tick data instead of time series data on stock
when using time series algorithms or volatility modelling on price price in a novel outlier mining algorithm.
sequence. In our research, a novel outlier mining algorithm is 3. We select 200 stocks randomly in our experiment. The result
proposed to detect anomalies on the basis of volume sequence of shows that using anomalies can predict the upward trend of
high frequency tick-by tick data of stock market. Such anomaly stock prices effectively.
trades always inference with the stock price in the stock market.
By using the cluster information of such anomalies, our approach The rest of the paper is organized as follows. Section 2
predict the stock trend effectively in the really world market. introduces the motivation and provides an example to illustrate
Experiment results show that our proposed approach makes the problem. Section 3 introduces our approach and explain the
profits on the Chinese stock market, especially in a long-term
outlier algorithm. Section 4 evaluates our approach by applying
usage.
the method to the data to get the metrics. Section 5 gives some
Keywords—Stock trend prediction, data mining, cluster analysis, related works of our subject. Section 6 concludes this paper.
stock market, anomaly

I. I NTRODUCTION II. M OTIVATION

Financial time series change dynamically and selectively. Stock markets are changing all the time and prediction of
Such time series are obviously difficult to predict because stock trend is a significant issue in the modern financial market.
the problem is nonlinear, non-stationary and have a lot of However, according to the efficient market hypothesis [2], the
noises[4]. Stock price is a kind of time series in financial market price will follow a random walk and a permanent
domain. The approach to predict stock trend in the future has prediction strategy is not possible. An interesting issue is that
become one of the most import issues by using data mining for some trading price that market is not efficient anymore
techniques. However, prediction is difficult from the principle in the real word, so it breaks the efficient market hypothesis.
of the efficient market hypothesis [2] that if the market is an Therefore, the data of stock price will not be so random and
efficient market then the stock price will follow a random work prediction of stock trend becomes possible. A traditional way
pattern. In addition, a stationary prediction strategy is also to predict the stock trend is using the data mining techniques
not possible if the market is efficient because investors will on the basis of stock prices. Unfortunately, the data of stock
soon discover such strategies and those successful forecasting price have many noises [1] and for noisy data people always
rules will lead to self-destruct [3]. A lot of researchers devote build stochastic volatility models to make predictions whose
their time to study such random walks by time series model- efficiency is low.
ing [5], volatility modeling[6] and even artificial intelligence In the above non-efficient case, when we analyze the
modelling[4]. But those algorithms are all on the basis of the volume data there are always anomalies in the distribution of
stock price itself which has random property. trading volumes. Insider trading and market manipulation [7],
In this paper, we turn back our attention to the distribution [8] are the two key anomalies in stock market. Insider trading
on volume in the high frequency tick-by-tick data in the is the trades on the basis of non-public information by insiders,
market. The trading volume will follow some random distri- such as the directors, employees and officers [9], [10]. Market
bution because in the efficient market hypothesis the market manipulation is the trades or actions that attempt to affect the
always follows a random walk. Therefore, we assume that if fair and free operation of the stock market and create false
the volume is not so random anymore that there are some or misleading appearance of a stock [11]. The anomalies will
anomalies in the distribution. At that point the market is not severely impair the stock market and obviously will in fact
efficient and this means the stock price is not a random walk have long term influence on the stock prices. Thus, anomalies
anymore so a long term predicting strategy is possible. Here, have the long term predictability on the stock trend, in our
we want to study whether using the detected anomalies from method we will utilize these anomalies to get rid of the effect
historical financial time series data can predict stock trend of price noises to predict market trend. In this paper we will
effectively or not. limit our scope on the upward trend prediction because an
upward trend usually means stable and long term arbitrage
Our contribution are as follows: opportunities.

CFP1552Z-CDR/15 $31.00 © 2015 IEEE 93


DOI 10.1109/BDCloud.2015.19
0.09 3.5
Volume distribution Volume distribution
Anomaly point
0.08
3

0.07

2.5
0.06

2
0.05

0.04
1.5

0.03
1

0.02

0.5
0.01

0 0
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

(a) Volume distribution of all the stocks (b) Volume distribution of ”Shenzhen development bank”

Fig. 1: Anomaly found in price 10.12 of Shenzhen development bank

Fig. 1 shows one example of anomaly. The left part of TABLE I: This table shows some data
Fig. 1 is the trading volume distribution of all the stocks in
the market at price 10.12. Compared with right part of the Time Price Change Volume Amount Bs
figure, which is the volume distribution of stock with the 15:00:19 10.77 – 1785 1923500 b
14:57:01 10.77 – 1 1077 s
name ”Shenzhen development bank” at price 10.12, we can 14:56:55 10.77 – 10 10770 b
find there’s an anomaly on the volume 80 which is an outlier 14:56:52 10.77 -0.01 186 200322 s
14:56:46 10.78 – 94 101332 b
of the distribution and marked in circle. 14:56:43 10.78 0.01 20 21560 b
14:56:43 10.77 -0.01 75 80775 s
In our approach, we detect all the anomalies and mark
them on the price sequence. After that it is easy to predict
that the stock trend changed dramatically when our approach
For the record R each field have the means:
clusters such anomalies. For example, in Fig. 2, the anomalies
are marked with ’+’ on the price sequence. The horizontal t time of the trade
axis is the index of the trade point in tick-by-tick data and the p price of the trade
vertical axis is the stock price. We can see after the anomalies c change of the price
marked with ’+’ there’s an obvious upward trend in the stock v volume of the trade
price. Next Section will introduce our approach in details. a amount of the trade
b buy or sell signal of the trade
III. A PPROACH
One example of our tick-by-tick data is shown in Table I:
An diagram of overview of our approach is shown in
Fig. 3. We first fetch the data from data source then make Once we have the data in hand, we can start to pre-process
a preprocessing to the data, after that we transform the high the data into the ratio matrix for later use. The following steps
frequency data to a ratio matrix and then feed it into the outlier is what we need to do for this process.
algorithm to find anomalies. We can then make predictions
according to the position of the anomalies and evaluate the 1 Prepare tick-by-tick data
result. 2 Fix a price for all the stocks
3 Fix a price for one specific stock
A. Data Preprocessing
4 Make ratio matrix for step 2 and 3 for later use
The data we use in this paper is the high frequency tick-
by-tick trading data. Tick-by-tick data is a kind of format used Here’s the explanation for each step:
frequently in financial industry. This data records each trade for
every stock in the market, if there’s 1000 trades for a specific
stock then there will be 1000 records for that stock on that Step 1: For each stock collect all the tick-by-tick data for
day, so for a relatively long period the data size can be very all the trading day we want into a single matrix of records T .
big. One record of the tick-by-tick data is defined as: Each row of matrix T is a record R we defined earlier.

R = {t, p, c, v, a, b} Step 2: We define a new vector by the following means


from the matrix T :

94
95
0.12
Algorithm 1 : Anomaly price and volume finding algorithm Normal cluster
Input: M(i,:), Ms(i,:), prices 0.1 Anomaly cluster
Output: Anomaly price, Anomaly volume
1: pseq:= unique(prices) 0.08
2: for (int i=1; i<length(pseq); i++) do
3: pi:= pseq(i)
4: theoryseq:=M(i,:) 0.06
5: actualseq:=Ms(i,:)
6: difference:=actualseq-theoryseq 0.04
7: k:=find(difference>0.8)
8: if k is not empty then
0.02
9: an anomaly is found on price pi and volume number k
10: end if
11: end for 0
12: return
−0.02

Algorithm 2 : Anomaly location finding algorithm


−0.04
Input: Anomaly price, Anomaly volume, T
Output: Anomaly position
1: price:=Anomaly price −0.06
−0.1 −0.05 0 0.05 0.1 0.15
2: volume:=Anomaly volume
3: index:=find(T(:,2)==price and T(:,4)==volume); Fig. 4: Clustering of stock code 000623
4: for (int i=1; i<=3; i++) do
5: dif:=diff(index)
6: indexDiff:=find(dif<5)
7: index:=index(indexDiff+1)
8: end for the anomaly trades in the record matrix T and use a simple
9: if index is not empty then method to determine if they are clustered on the row index in
10: An anomaly location is found
11: end if matrix T , and if they are then a cluster of anomalies is found
12: return and the following trend of the stock can be evaluated.

IV. E XPERIMENT R ESULTS


⎛ ⎞
0.06 0.01 0.04 0.02 0.01 0.01 0.02 · · · The experiments on real exchange data have shown that
⎜0.04 0.01 0.05 0.06 0.02 0.01 0.01 · · ·⎟ our approach is more effective in prediction than the traditional
M =⎝
0.08 0.01 0.03 0.02 0.07 0.02 0.01 · · ·⎠ data mining algorithm and the predictability of our approach
0.01 0.01 0.03 0.05 0.03 0.07 0.02 · · · is satisfactory. The experiment data are from Chinese stock
exchange with the time range 03-31-2014 to 04-30-2015,
which include 272 trading days. The size of the data set is
Each row corresponds to a price in the price sequence 7.1 GB.
P while each column corresponds to a volume number in
volume sequence T . Have all the data prepared we can start A. Clustering
the explanation of our algorithm.
In this part of experiment, we use k-means clustering
B. Algorithm algorithm to the rows of matrix Msiv and check if any cluster
exist. One of the typical result shows in Fig. 4.
We propose an outlier miming algorithm to detect the
anomalies of high frequency trading data in this paper. The Fig. 4 shows the clustering on the ratio matrix. The points
detail of this algorithm is show in Algorithm 1. are grouped into two clusters, one is marked in dot and one
is marked in ’+’. The cluster on the left represent the major
In Algorithm 1, M is the matrix Miv while Ms is the part of the trading thus is the normal cluster, the cluster on
matrix Msiv , so iterate through index s we can find anomalies the right represents the trading that different from normal then
for all the stocks. What we did is we first retrieve the ratio is considered to be the anomaly cluster. Recall that each row
sequence from Miv for the price Pi and compared it with Msiv in matrix Msiv correspond to a price Pi in the unique price
for the same price to find the difference of this two ratio sequence, so each cluster of rows of Msiv is also a cluster of
sequence. If there’s a value in the difference exceeds some price.
certain limit then that value is considered to be an anomaly
and the corresponding volume number and price is recorded. Fig. 5 shows the histogram of the prices corresponding to
A typical anomaly record can be defined as following: the cluster who is marked with ’+’ in Fig. 4. The horizontal
axis in Fig. 5 is the price and the vertical axis is the occurrences
A = {s, p, v}
of that price. In Fig. 5, we observe that the price is clustered
where s is the stock index,p is the anomaly price and v is the around 15. It is coincide with the anomaly cluster of stock
anomaly volume number. After all the anomalies of a stock are 000623 in Fig. 2a.
found we define another algorithm to locate those anomalies
This means our outlier mining algorithm is coherent with
on the time of the trade. The implement of such algorithm is
traditional cluster algorithms. But our algorithm is more ef-
shown in Algorithm 2.
fective in prediction that for other algorithms we need all the
In Algorithm 2, T is the matrix of tick-by-tick records R information in the data before we perform the algorithm, which
introduced in step 1 of last subsection. This program first locate means in order to found the anomaly cluster we also need the

96
18 1.35
Stock return for our approach
Stock return for SVM
16 1.3

14
1.25

12
1.2

10
1.15
8

1.1
6

1.05
4

2 1

0 0.95
14 16 18 20 22 24 26 28 30 32 0 20 40 60 80 100 120

Fig. 5: Price histogram of the anomaly cluster for stock code Fig. 6: Average return of 200 stocks
000623

1
normal data in our dataset, thus it is not possible to make
prediction if the normal data happens later than anomaly data 0.9

while for our algorithm such issue doesn’t exist that we can
0.8
make prediction no matter where the anomaly data is.
0.7

B. Evaluate for prediction


0.6

We randomly chose 200 stocks in Chinese Shenzhen stock


0.5
market and found 111 of them have the behavior of cluster of
anomalies. We measured the average return of the stock after 0.4
the anomaly cluster for 100 days and plot it in Fig. 6. For
comparison we also measured the average return for the same 0.3

set of stocks in the meaning of support vector machine [13] as


0.2
in the related works section. The horizontal axis of Fig. 6 is
the index of trading day starting from 03-31-2014, the vertical 0.1
Successful rate for our approach
axis is the mean return of this 200 stocks compared with the Successful rate for SVM
0
price on the day 03-31-2014. We see that the upward trend is 0 20 40 60 80 100 120

very obvious in the figure. We also measured the successful Fig. 7: Successful Rate of 200 Stocks
rate of our prediction. Successful rate is defined to be:
number of correct predictions
Successf ul Rate =
number of all predictions
here a correct prediction means the stock return is bigger than
1. the ANN performance. It is a challenge to design the sampling
schema, choose training and testing datasets and select the
Fig. 7 shows the successful rate. The horizontal axis of effective factors for improving the prediction performance and
Fig. 7 is the index of trading day starting from 03-31-2014, the it is difficult to define the structures of the models such as the
vertical axis is the successful rate of the 200 stocks on each day hidden layers, the neurons, etc. Zhang et al. [12] presented a
compared with the price on the day 03-31-2014. We observe piecewise nonlinear model to analyzing stock market tick data.
that as time goes by the successful rate goes higher and close They proposed Prop NN, which can improve the predictability
to 1 at last, which means almost all the stocks changed their of stock price. They claimed that it is significantly better than
trend after the anomaly clusters. The results in Fig. 6 and Fig. 7 the basic BPN model. But as many of the other machine
show that the predictability of our approach is satisfactory. learning algorithms, ANN suffers from the problem of over-
fitting. It can not discriminate between useful information and
V. R ELATED W ORKS noisy information and many of the time the noise level is too
A. Neural Network Approaches high that what the algorithm did is actually make a fitting
on the noise, in this case the prediction on useful data is
There are many researches using artificial neural networks impossible. For our algorithm there’s no such issue that we can
(ANNs). A lot of successful trials have shown that ANN can be simply ignore the noise and only pick up the useful information
a powerful tool for time series forecasting and modeling [12]. which is the anomaly volume, this will make the analysis much
However, too many factors required to be tuned would affect easier.

97
B. SVM based approaches R EFERENCES
Support vector machine proposed by Boser et al [13] is [1] Antoniou A, Vorlow C E. Price clustering and discreteness: is there chaos
behind the noise?[J]. Physica A Statistical Mechanics & Its Applications,
attracting more attention these years. It is used as a clustering 2005, 348:389.
algorithm at first, derived from the structural risk minimization
[2] Malkiel B G. The Efficient Market Hypothesis and Its Critics[J]. Journal
principle [14] and by separating the decision hyperplane it can of Economic Perspectives, 2003, 17(1):pgs. 59-82.
also be used in classification and regression analysis, and can [3] Timmermann A, Granger C W J. Efficient market hypothesis and
help users make well-informed business decisions. Wang et forecasting[J]. International Journal of Forecasting, 2004, 20(3):15C27.
al. [15] showed that the K-means SVM (KMSVM) algorithm [4] P. K. Padhiary and A. P. Mishra, Development of improved artificial
can speed up the response time of classifiers by decreasing neural network model for stock market prediction, International Journal
the number of support vectors while maintaining a compatible of Engineering Science and Technology, Vol. 3, 2011, pp. 1576-1581.
accuracy to SVM. But the situation is the same of ANN [5] Amihud Y. Illiquidity And Stock Returns: Cross-Section And Time-
algorithms that if the noise level is high then it is impossible Series Effects[J]. Social Science Electronic Publishing, 2002, 5:31-56.
DOI:http://dx.doi.org/10.1016/S1386-4181(01)00024-6.
to make prediction.
[6] Stein E 1, Stein J 2. Stock Price Distributions with Stochastic Volatility:
An Analytic Approach[J]. Review of Financial Studies, 1991, volume
VI. C ONCLUSION 4(4):727-752(26).
[7] F. Allen and G. Gorton. Stock price manipulation, market microstruc-
In this paper, starting from the efficient market hypothesis ture and asymmetric information. European Economic Review, pages
we found a way to locate the anomaly trade data among the 624C630, 1992.
high frequency tick-by-tick data by comparing the distribution [8] M. Minenna. Insider trading abnormal return and preferential informa-
of volume sequence between the market and the specific stock. tion: Supervising through a probabilistic model. Journal of Banking and
By making the volume distribution matrix of all the stocks Finance, pages 59C86, 2003.
in the market and any individual stock we can discover the [9] L. Cheng, M. Firth, T. Leung, and O. Rui. The effects of insider trading
difference between them and if that difference is bigger than a on liquidity. Pacific-Basin Finance Journal, pages 467C483, 2006.
certain limit then an anomaly is found. We found that clusters [10] B. Cornell and B. Sirri. The reaction of investors and stock prices to
insider trading. Journal of Finance, pages 1031C1059, 1992.
of anomalies always predict an upward trend of the stock price.
[11] K. Felixson and A. Pelli. Day end returns: Stock price manipulation.
A traditional algorithm for cluster analysis is also possible Journal of Multinational Financial Management,pages 95C127, 1999.
to find the anomalies but our algorithm is more practical in
[12] G. Zhang, B. E. Patuwo, and M. Y. Hu, Forecasting with artificial neural
that it is more effective in making predictions. We tested our networks: The state of the art, International Journal of Forecasting, Vol.
novel outlier mining algorithm and found that it is consistent 14, 1998, pp. 35-62.
with k-means clustering algorithm. Finally the average return [13] B. E. Boser, I. M. Guyon, and V. N. Vapnik, A training algorithm
and successful rate is tested against our algorithm and the for optimal margin classifiers, in Proceedings of the 5th Annual ACM
prediction about this two quantities is correct and satisfactory. Workshop on Computational Learning Theory, 1992, pp. 144-152.
[14] K.-J. Kim, Financial time series forecasting using support vector
machines, Neurocomputing, Vol. 55, 2003, pp. 307-319.
[15] J. Wang, X. Wu, and C. Zhang, Support vector machines based on K-
means clustering for real-time business intelligence systems, International
Journal of Business Intelligence and Data Mining, Vol. 1, 2005, pp. 54-
64.

98

You might also like