Time Series Data Mining A Case Study With Big
Time Series Data Mining A Case Study With Big
INTELLIGENT TRANSPORTATION
Received December 31, 2019, accepted January 9, 2020, date of publication January 14, 2020, date of current version January 24, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.2966553
ABSTRACT Time series data is common in data sets has become one of the focuses of current research. The
prediction of time series can be realized through the mining of time series data, so that we can obtain the
development process and regularity of social economic phenomena reflected by time series, and extrapolate
to predict its development trend. More and more attention has been paid to time series prediction in the era
of big data. It is the basic application of time series prediction to accurately predict the trend. In this paper,
we introduce various time series autoregressive (AR) model, moving average (MA) model, and ARIMA
model that is combined by AR and MA. As the time series prediction in general scenarios, the ARIMA is
applied to the risk prediction of the National SME Stock Trading (New Third Board) in combination with
specific scenarios. The case studies show that the results of our analysis are basically consistent with the
actual situation, which has greatly helped the prediction of financial risks.
INDEX TERMS Data mining, time series, financial forecast, AR, MA, ARIMA, financial risk.
I. INTRODUCTION This method fits the historical time trend curve by estab-
Time series data mining comes from the need of people lishing an appropriate mathematical model and predicts the
to visualize data models according to their abilities. People trend of future time series according to the established model
rely on complex methods to perform these tasks. In fact, Curves, our common models include ARMA [2], VAR [3],
we can ignore small fluctuations to get the conceptual model TAR [4], ARCH [5], etc. The traditional time series method
and distinguish different time models based on the similarity can be applied to a variety of scenarios because it relies
between models. The main time series related tasks include on relatively simple data and only needs the historical time
content-based querying, anomaly checking, pattern recogni- series trend curve to build a model. However, the traditional
tion, prediction, clustering, classification and segmentation. time series prediction method often faces the problem of
A large number of decision-making problems cannot be lag, which is that the predicted value is several time units
separated from prediction in various research fields of the later than the true value. In order to improve the accuracy of
natural sciences and social sciences, forecast is the basis of prediction, machine learning algorithms are introduced into
the decision-making [1]. Therefore, we mainly explored the time series prediction. The machine learning methods select
time series data analysis and prediction. features that may affect the predicted value according to the
Time series prediction methods are divided into traditional specific application scenario, then introduces these features
time series prediction methods and machine learning meth- into the model, finally applies machine learning classification
ods. The traditional time series forecasting method refers to models for prediction. Machine learning methods need to
predicting the trend development of future time series only extract more features from data in multiple dimensions. The
based on the trend development of historical time series. more complex the model, the more accurate the prediction.
However, models are often not universal and features need
The associate editor coordinating the review of this manuscript and to be re-extracted for different application scenarios to build
approving it for publication was Sabah Mohammed . models. In reality prediction, machine learning methods are
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/
14322 VOLUME 8, 2020
F. Wang et al.: Time Series Data Mining: Case Study With Big Data Analytics Approach
vectors are uncorrelated. The smaller the angle between the TABLE 1. Unit root inspection table.
two vectors, the closer the absolute value of the correlation
coefficient is to 1, and the higher the correlation between the
two vectors.
The linear correlation between the two vectors is mea-
sured by correlation coefficient. In the stable time series
{rt }, the linear correlation between rt and its past value rt−i
is measured by autocorrelation coefficient. The correlation
coefficient between rt and rt−i is called the autocorrelation
coefficient of spacing l of rt , which is usually recorded as p1 .
specific:
Cov (rt , rt−1 ) Cov (rt , rt−1 )
ρ1 = √ =
Var (rt ) Var (rt−1 ) Var (rt ) perturbation or information of the AR model at time t, then it
The above formula uses the property of weak stationary: can be found that the model uses random interference or pre-
Var (rt ) = Var (rt−1 ). For {rt } samples of stationary time diction error in the past q periods to linearly express the cur-
series, then the autocorrelation coefficient of the samples with rent prediction value. The autocorrelation function is always
an interval of 1 is estimated as: q-step truncated for q-order MA models. Therefore, the MA
PT
(rt − r) (rt−1 − r) (q) sequence is only linearly related to its first q delay values,
ρ̂1 = t=l+1PT so it is a ‘‘limited memory’’ model. This feature can be
t=1 (rt − r)
2
used to determine the order of the model. MA models are
A series of autocorrelation sequences ρ̂1 , ρ̂2 , ρ̂3 · · · is always weakly stationary because they are a finite linear
called the sample autocorrelation function of rt . We con- combination of white noise sequences. Therefore, MA model
sider that the time series is completely uncorrelated when all has the properties of weak stationarity: stationarity, finality
the values in the autocorrelation function are 0. Therefore, and reversibility.
we often need to check whether multiple autocorrelation
coefficients are 0. C. ARIMA PREDICTION MODEL
So far, we have focused on stationary sequences. We can
2) AUTOREGRESSIVE (AR) MODEL consider using the ARIMA model if the sequence is non-
The data rt−1 at time t-1 may be useful in predicting rt at time stationary. The ARIMA can be used for statistics and arti-
t when the time series data interval is 1 and the autocorrelation ficial intelligence [12]. ARIMA has only one more letter
coefficient ACF is significant. We can build the following ‘‘I’’ than ARMA, which means that it has one more level
model according to the above principles:rt = ∅0 +∅1 rt−1 +at , of connotation than ARMA. A non-stationary sequence can
at is a white noise sequence, this model is called a first- be transformed into a stationary time series after d times of
order autoregressive (AR) model. We can introduce an AR difference. For the specific value of d, we first perform a
(p) model from AR model: rt = ∅0 + ∅1 rt−1 + ∅2 rt−2 + · · · + stationary test on the sequence after the first difference. Then
∅p rt−p + at . We generally choose partial correlation function we will continue to make the difference if it is still non-
and information criterion function to determine the order. The stationary until the test is stationary after d times. Finally,
information criterion usually uses the AIC rule. The follow- the specific value of d is calculated.
ing methods are proposed for the test of AR (p) stationarity.
We first assume that the sequence is weakly stationary, then 1) UNIT ROOT TEST
E (rt ) = µ, Cov (rt ) = γ0 , Cov rt , rt−j = γj , (µ, γ0 ) are ADF is a common unit root test method [13]. Its original
constants. Because at is a white noise sequence, there are: hypothesis is that the sequence has a unit root, and the
E (at ) = 0, Var (at ) = σa2 , so there are: E(rt ) = ∅0 + sequence is non-stationary. It is necessary to be significant
∅1 E(rt−1 ) + ∅2 E(rt−2 ) + · · · + ∅p E(rt−p ). According to the at a given confidence level and reject the original hypothesis
nature of stationary, E (rt ) = E (rt−1 ) = E (rt−2 ) = · · · = for a stable time series data.
µ, which has: µ = ∅0 +∅1 µ+∅2 µ+· · ·+∅p µ, E (rt ) = µ = According to Table 1 and Figure 2 above, we assume the
θ0
1−∅1 −∅2 −···−∅p . We have the equation 1 − ∅1 x − ∅2 x − · · · − original hypothesis that the sequence has a unit root. The
∅p x = 0 as the characteristic equation when the denominator original hypothesis cannot be rejected because we can see that
is not 0. The inverses of all the solutions of the equation are the value of p-value is 0.1704489, which is much larger than
the characteristic roots of the model. The AR (p) sequence is the significant level. Therefore, the daily index series of the
stationary when all the characteristic roots are less than 1. Shanghai Stock Index is non-stationary. We make a difference
to the sequence as shown in Figure 3:
3) MOVING AVERAGE (MA) PREDICTION MODEL We can know from the figure 3 that the sequence is
We directly give the form of the MA (q) model: rt = c0 + approximately stationary. Let’s perform ADF test, p-value:
at − θ1 at−1 − θq a[11]
t−q , c0 is a constant term. The at is the 2.31245750144e-30. We can think the sequence is stationary
A. ANALYTICAL METHOD
We used the three-step analysis method proposed above for
analysis: In the first stage, a non-stationary sequence is trans-
formed into a stationary time series by differential processing.
In the second stage, we use the ADF unit root test to check
whether the time series is stable. In the third stage, the prob-
ability distribution of different rises and falls within the same
time period based on the historical distribution of rises and
prepare is inferred for extreme situations that may seriously
affect the level of NAV.
STEP 1: data preparation (data preprocessing). The time
series is defined as a series of quantitative observations at
FIGURE 2. Unit root test chart.
consecutive times. In the analysis of financial time series,
the price time series itself is generally unstable, not com-
pletely random distribution, and has obvious autocorrelation.
At the same time, the law of price distribution may also
change abruptly due to a variety of factors, so that the law
established in the past stage may not still hold in the future.
Therefore, it is generally invalid to analyze the price time
series directly in an attempt to find the law or regression
formula. We pre-process the time series before applying the
ARMA model if the sequence is non-stationary. Generally,
the method for dealing with unstable time series is to make
first order difference of the time series [15]. Generally, two
FIGURE 3. Sub-difference. methods can be used:
The first is to find the difference between adjacent vari-
Because it can be seen that the p-value is very close to 0 and ables to build a first-order difference sequence, we can build
the original hypothesis is rejected. The value of d for the a new sequence yt :
original sequence can be 1 because the sequence is stable after
one difference. An ARMA model can be built from the differ- yt = xt − xt−1
ential sequence after the value of d is determined. At present, The second is to find the ratio of adjacent variables to
ARIMA has been widely used in various fields [14]. Next, build a first-order difference sequence, we can build a new
we will use the ARIMA model to analyze example in the sequence yt :
financial field. xt
yt =
xt−1
IV. NEW THIRD BOARD RISK FORECAST
A non-stationary sequence can be transformed into a sta-
The unit root analysis of the rise based on the stock price
tionary sequence after d times of difference. The specific
within a certain period of time can determine the stability of
value of d depends on the structure of the stationarity test
the rise series. The probability distribution of different rises
after the time series difference. we will continue to make the
and falls in the same period in the future can be inferred
difference if it is still non-stationary until the test is stationary
based on the historical distribution of the rise when the
after d times.
sequence is stable, so that the interested parties prepare plans
The relative ratio of the stock prices (the relative increase)
for extreme situations that seriously affect the level of net
is more concerned about the absolute value of stock price
worth of funds. In recent years, the OTC New Third Board has
changes, so that the ratio method is generally used in
developed rapidly in China. We can find that the New Third
the analysis of financial product price time series [16].
Board market has two characteristics after careful observa-
At the same time, the stock price difference will continue to
tion. Firstly, the overall market price volatility is significantly
increase or decrease accordingly after the stock price contin-
higher than that of the Shanghai and Shenzhen markets.
ues to rise or fall. Therefore, it is proposed to use the natural
Secondly, the volatility distribution is severely rightward. The
logarithm of the ratio of adjacent variables in the time series
fluctuation risk of individual stocks is often released quickly
of stock prices to perform first-order difference processing,
and violently because there is no limit of the daily limit
we need to construct a new series yt :
system. In the following, we focus on the practical problems
in the Chinese NEEQ stock market. The time series analysis xt
yt = ln( )
method was used to estimate the distribution probability of xt−1
future rise and fall based on the differential time series of A prominent advantage of this method is that the first-
daily rise and fall of stock prices. order difference sequence yt obtained from this method is
approximately equal to the stock price increase, which can be TABLE 2. Unit root test results.
directly used for the probability prediction of the future stock
price distribution. In this paper, we use the ratio method to
deal with time series.
STEP 2: Stationarity check. We apply the unit root test to
the logarithmic rise series. Our goal is to investigate the sta-
TABLE 3. Daily increase probability distribution table.
tionarity of the residuals to determine if the ARMA model is a
good model for them. The original hypothesis of the unit root
test is to test whether the sequence is stationary. Then, negat-
ing the original hypothesis means that the series (or the dif-
ferential sequence in this example) is stationary. Specifically,
we use ADF (Augmented Dicky-Puller) to check whether the
time series is stable.
STEP 3: Prediction of the probability distribution of risk.
We can prepare for extreme situations that severely affect
asset benefit levels with the probability distribution of price
the next day. The basic idea of the ARMA model is to com-
bine the AR and MA models so that the number of parameters
used is kept small. The form of the model is:
i=p
X i=q
X
r t = ϕ0 + ϕi rt−1 + at + θi at−1
i=1 i=1
FIGURE 4. Daily increase probability distribution and cumulative
Among them, {at } is a white noise sequence, p and q are probability distribution.
both non-negative integers. We use the sequence of moving
operator B backward, the previous moment, the above model series can be rejected to have a unit root when the confidence
can be written as: 1 − ∅1 B − · · · − ∅p Bp rt = ∅0 + (1 − level is significantly higher than 1%, which is that the time
θ1 B − · · · − θq Bq )at series of logarithmic increase and decrease is basically stable.
At the time we get the expectation of rt : Therefore, we can use the time series analysis to predict
∅0 the future probability of distribution based on the stock’s
E (rt ) = past gain information. The following is a forecast of the
1 − ∅1 − · · · − ∅p
distribution of the gain in the next 1 transfer day based on
The inverse of all solutions of the equation 1−∅1 x −∅2 x 2 − historical gain information:
· · · − ∅p x p = 0 is called the characteristic root of the model. As can be seen from Table 3, the daily probability dis-
We can think the ARMA model is stable if the modulus of tribution of closing prices has a characteristic of peaks and
all characteristic roots is less than 1. We limit the maximum long tails. When applied to intraday T+0 trading, we should
order of AR to less than 6 and the maximum order of MA not pay attention to setting reasonable stop loss prices to avoid
to exceed 4 in order to control the amount of calculation. Then large losses caused by small probability events. As shown
we established the ARMA model based on the (3,3) order in Figure 4:
model solved by the AIC criterion. In addition to the daily increase distribution probabil-
We take Jindalai of the New Third Board as an exam- ity, we cannot directly use the cumulative distribution of
ple and predict the probability distribution of future fluc- the smaller period increase level probability when we want
tuations through the analysis of the closing price increase. to obtain the increase distribution probability of different
Since Jindalai was changed to a market-making transfer from time periods. For example, it is not possible to obtain
November 25, 2014, historical data was selected as daily a weekly increase horizontal distribution probability or a
closing price data for 421 transfer days from January 5, monthly increase horizontal distribution probability from the
2015 to September 24, 2016. A logarithmic gain sequence daily increase horizontal distribution probability by a super-
can be obtained after differential processing of the closing imposed manner. The correct processing method is to directly
price sequence. According to our analysis framework (see find the logarithmic first-order difference of the weekly clos-
Figure 1), we next perform a unit root test on the above two ing price sequence to obtain the time series of the logarithmic
time series and use ADF on Eviews 8 (Augmented Dicky- rise of the weekly closing price and process the sequence. For
Fuller) test can be obtained: example, statistics on the weekly logarithmic rise time series
It can be seen from table 2 that the existence of unit root of Jindalai are shown in Figure 5:
in the closing price time series cannot be denied even at the They can be obtained other distribution probabilities by
confidence level of 10%, so that the closing price time series interpolation in the distribution probability table or by tak-
is basically non-stationary. The logarithmic rise and fall time ing intersection points on the probability distribution curve.
[16] F. E. Tay and L. Cao, ‘‘Modified support vector machines in financial MENGGANG LI (Member, IEEE) received the
time series forecasting,’’ Neurocomputing, vol. 48, nos. 1–4, pp. 847–861, Ph.D. degree in applied economics from Beijing
Oct. 2002. Jiaotong University, Beijing, China. He is cur-
[17] D. Zhang, ‘‘High-speed train control system big data analysis based on rently the Dean of the National Academy of
fuzzy RDF model and uncertain reasoning,’’ Int. J. Comput., Commun. Economic Security, Beijing Jiaotong University,
Control, vol. 12, no. 4, p. 577, Jun. 2017. the Director of the Beijing Laboratory of National
[18] W. Xu, L. Liu, Q. Zhang, and P. Liu, ‘‘Location decision-making of Economic Security Early-Warning Engineering
equipment manufacturing enterprise under dual-channel purchase and sale
and the Beijing Philosophy and Social Sci-
mode,’’ Complexity, vol. 2018, Dec. 2018, Art. no. 3797131.
ence Beijing Industrial Security and Development
[19] D. Zhang, J. Sui, and Y. Gong, ‘‘Large scale software test data genera-
tion based on collective constraint and weighted combination method,’’ Research Base, and the Chairman of the IEEE
Tehnicki Vjesnik-Tech. Gazette, vol. 24, no. 4, pp. 1041–1049, Jul. 2017. Professional Committee in Logistics, Informatics, and Industrial Security
[20] W. Xu and Y. Yin, ‘‘Functional objectives decision-making of discrete System. His current research concerns national economic security, industrial
manufacturing system based on integrated ant colony optimization and economics, and industrial security.
particle swarm optimization approach,’’ Adv. Prod. Eng. Manage., vol. 13,
no. 4, pp. 389–404, Dec. 2018.
[21] W. Bao, J. Yue, and Y. Rao, ‘‘A deep learning framework for financial time
series using stacked autoencoders and long-short term memory,’’ PLoS
ONE, vol. 12, no. 7, Jul. 2017, Art. no. e0180944.
[22] J. Y. Chen, ‘‘Thrown under the bus and outrunning it! The logic of DiDi
and taxi drivers’ labour and activism in the on-demand economy,’’ New YIDUO MEI received the B.E. and Ph.D. degrees
Media Soc., vol. 20, no. 8, pp. 2691–2711, Aug. 2018. in computer science and technology from Xi’an
[23] L. Zhang, J. Lu, J. Zhou, J. Zhu, Y. Li, and Q. Wan, ‘‘Complexities’ day-to- Jiaotong University, in 2004 and 2011, respec-
day dynamic evolution analysis and prediction for a Didi taxi trip network tively. He is currently a Postdoctoral Researcher
based on complex network theory,’’ Mod. Phys. Lett. B, vol. 32, no. 9,
with the National Academy of Economic Security,
Mar. 2018, Art. no. 1850062.
Beijing Jiaotong University. His main research
[24] Y. Lu and X. Xiong, ‘‘Topic analysis of microblog about ‘didi taxi’ based
on K-means algorithm,’’ Amer. J. Inf. Sci. Technol., vol. 3, no. 3, p. 72, interests include blockchain, AI, cloud computing,
2019. grid computing, trust management, and big data.
[25] J.-C. Kim and K. Chung, ‘‘Mining based time-series sleeping pattern analy-
sis for life big-data,’’ Wireless Pers. Commun., vol. 105, no. 2, pp. 475–489,
Mar. 2019.
FANG WANG received the Ph.D. degree in WENRUI LI received the Ph.D. degree in applied
statistics from Beijing Jiaotong University, economics from Beijing Jiaotong University,
Beijing, China. She is currently a Postdoctoral Beijing, China. He is currently a Lecturer of indus-
Researcher with Beijing Jiaotong University. Her trial economics with Beijing Jiaotong University.
main research involves in game theory, machine His current research involves in industrial eco-
learning, industrial economics, and industrial nomics and industrial security.
security.