2022 3rd Information Communication Technologies Conference
Stock Trend Prediction Based on ARIMA-LightGBM
Hybrid Model
2022 3rd Information Communication Technologies Conference (ICTC) | 978-1-6654-9508-0/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICTC55111.2022.9778304
Xiuyan Zheng Jiajing Cai* Guangfu Zhang
College of Information Engineering College of Information Engineering College of Information Engineering
Hainan Vocational University of Science Hainan Vocational University of Science Hainan Vocational University of Science
and Technology and Technology and Technology
Haikou, China Haikou, China Haikou, China
[email protected] [email protected] [email protected] Corresponding Author
Abstract — As an important part of capital market, stock statistical analysis models, such as exponential smoothing,
market is playing an increasingly important role in social and multiple linear regression model, autoregressive moving
economic development. Stock trend prediction model research average model, autoregressive integrated moving average
has been a popular topic of study among specialists and (ARIMA), etc. [1]. The study lunched by Ariyo showed that
academics in the fields of economic finance and data analysis. In ARIMA achieved high processing efficiency and good
this paper, Gree Electric Appliance stock is selected as the prediction accuracy in stock time series prediction,
research object. When the training set and test set are especially for short-time prediction [2]. However, it is difficult
determined, the ARIMA model and LightGBM model, which are to obtain better prediction accuracy by using this simple linear
commonly used for forecasting, are used to predict the trend of model because of stock price trends exhibit nonlinear
the stock respectively, and then the benefits and drawbacks of characteristics caused by various indexes. With the
these two models in stock trend prediction are analyzed and
development of artificial intelligence technology, many
summarized. On this basis, we propose the ARIMA-LightGBM
hybrid model to predict the stock change trend of Gree Electric
machine learning algorithms have been widely used to predict
Appliances stock in six months. In the proposed hybrid model, stock trends, such as support vector machine, decision tree,
The ARIMA model was used for the six-month prediction of random forest, deep neural network, etc., because of their good
exogenous variables. Secondly, the LightGBM model is used to fitting ability in dealing with nonlinear problems [3].
model the exogenous variables predicted by the ARIMA model Fangzhou Ye's research revealed that the stock prediction
to obtain the predicted stock trend in the next six months. By model based on LightGBM has better forecasting accuracy,
comparing with the actual Gree Electric Appliances stock price speed and higher returns compared with other machine
trend, the results show that the prediction accuracy of the learning models [4].
proposed ARIMA-LightGBM hybrid model is better than that of
the LightGBM model. At the end of the paper, we also put
The stock trend prediction models mentioned above are
forward some valuable investment strategies based on the single models. However, Fabbiani et al. pointed out that single
forecast results. models such as linear regression, random forest and support
vector machine are inferior to integrated models [5] . It is
Keywords—ARIMA model, LGBM model, Stock prediction, mainly because the trend of stock price is easily affected by
ARIMA-LGBM hybrid model short-term factors, it is difficult for a single model to capture
the behavior of stock price index. Therefore, this paper
I. INTRODUCTION proposes the ARIMA-LightGBM hybrid model method which
With the development of economy, the increase of integrates the advantages of ARIMA model and lightweight
residents' income and the perfection of stock market system gradient lifting machine (LightGBM) model. We use these
construction, the stock market will become the main three models respectively to forecast Gree Electric Appliances
investment channel of residents in the future. Effective and stock trend. According to RMSE evaluation index and
accurate prediction of stock trends is helpful for investors to comparison chart observation, it shows that the proposed
formulate investment strategies in time to reduce the risk of model has higher prediction accuracy.
stock investment. Maintaining constant attention to policy and
II. METHODOLOGY
analyzing k-line data are common artificial stock forecasting
methods. However, the trend of the stock is determined by a A. ARIMA Model
variety of index data, and sometimes the artificial forecast will The ARIMA model is one of the most common time series
deviate greatly from the actual trend, which may lead to prediction models among statistic models. It combines the
serious economic losses. At present, many experts and advantages of time series and regression analysis and has
scholars use different data analysis techniques to carry out considerable flexibility, thus it is widely used in the
research on stock forecasting. Some scholars have proposed application and forecast of inventory. ARIMA(p, d, q) is called
stock price predictive model based on mathematical and differential autoregressional moving average model, which is
978-1-6654-9508-0/22/$31.00 ©2022 IEEE 227
Authorized licensed use limited to: Institut Teknologi Sepuluh Nopember. Downloaded on August 29,2023 at 09:04:31 UTC from IEEE Xplore. Restrictions apply.
the combination of autoregressional (AR) model, moving
average (MA) model and difference method. The p stands for
the autoregressional term, the q for the number of moving
average terms, and the d for the number of differences created
when the time series becomes stationary The p and q are
the key parameters for AR model and MA model respectively.
The modeling process of the ARIMA(p, d, q) model is as
follows:
1) Data preprocessing. The stock data of Gree Electric
Appliances are sampled firstly. Due to the trading
rules of stocks, the original data set is
discontinuous. Therefore, we use linear interpolation
to process the original data set. The values of
discontinuous points are obtained by taking the mean
of the left and right adjacent data values in the data
sequence, which better fits the series prediction
model..
2) Stationarity test. In the ARIMA(p, d, q) model, the
second step is to carry out a stationarity test. Since
naked eye inspection inevitably enhances the
subjectivity of the model, more authoritative
techniques for examining the Stationarity of data are
required. Therefore, ADF test is utilized for
Stationarity test of which the related formula (1), (2)
and (3) are shown as follows.
m
X t = X t −1 + i X t −i + t
i =1 () Fig. 1. ACF and PACF chart.
m
Observing the truncation and truncation of the ACF and
X t = + X t −1 + i X t −i + t PACF graphs yielded the ACF and PACF values. Finally, the
i =1 () parameters p and q of the model were determined to be 12 and
m 3 respectively.
X t = + t + X t −1 + i X t −i + t
i =1 ()
When the stock increase rate of Gree Electric Appliance is
stabilized, the p-values are all close to 0, and the ADF-result values
rejects null hypothesis data, which may be concluded that all
data are stable and pass the stationability test.
3) Model grading. The p, d and q parameters are
important in ARIMA(p, d, q) model. In order to better
predict stock trends, we need to determine their AIC = −2 ln(L) + 2K ()
values in this step. The parameter d can be obtained Where L represents the maximum likelihood function of
according to the part of the series stationary test. The the model and K represents the number of model parameters.
values of p and q are calculated by autocorrelation Where n stands for sample size.
function and partial autocorrelation function
respectively. The function expression is as follows: 𝐵𝐼𝐶 = −2 𝑙𝑛(𝐿) + 𝐾 ln(𝑛) ()
5) Model test. The residual white noise test and
Cov( yt , yt − k )
ACF (k ) = k = parametricity test are carried out on it to determine
Var ( yt ) () whether the created model is desired. If the residual
sequence is not a white noise sequence, go back to (3)
Through experiment, The value of d can be determined to and re-establish the model until it passes the model's
be 0 according to the result of sequence stabilization For p and parameter test and residual white noise test. The
q, the index is determined by observing the truncation and obtained ARIMA(p, d, q) model has passed the
termination of PACF and ACF plots. Therefore, The ACF and parameter test and residual white noise test.
PACF plots of the Gree Electric Appliances stocks are drawn
in Fig. 1.
228
Authorized licensed use limited to: Institut Teknologi Sepuluh Nopember. Downloaded on August 29,2023 at 09:04:31 UTC from IEEE Xplore. Restrictions apply.
B. LGBM Model
LightGBM refers to a lightweight gradient lifting machine
which is provided by Microsoft [7]. It's an efficient
implementation of Gradient Boost Descent Tree (GBDT), just
as XGBoost. In principle, the negative gradient of the loss
function is employed as the residual approximation of the
current decision tree to fit the new decision tree by compared
with GBDT and XGBoost. LightGBM adopts histogram-based
decision tree algorithm, the core of which is to scatter the
features of continuous floating points into K dispersion values,
and construct histogram with width of K. Then the training
data is traversed and the cumulative statistics of each discrete
value in the histogram are calculated. In feature selection, we
only need to search for the optimal segmentation points by
traversing according to the discrete values of the histogram [8].
The algorithm preserves discrete values of features, which
effectively reduces memory usage and improving model
training speed, thus it is widely used in many machine learning
tasks such as sorting and classification.
This paper intends to build a prediction model of stock Fig. 2. ARIMA-LGBM hybrid model.
increase rate based on the LightGBM algorithm, and the
specific steps are as follows. III. DATA AND EMPIRICAL ANALYSIS
divide A. Data Set
The data set of this research are mainly obtained through
the Wind database terminal of Gree Electric Appliances stock
index data. The data of this stock from August 2020 to
November 2021 are selected. Specifically speaking, the data
Use
from August 2020 to August 2021 are used as the training set
and the data from September 2021 to November 2021 are
regarded as the test set to complete the fitting of the final
model and evaluate the merits of the model.
B. Prediction Based on ARIMA Model
The ARIMA model established in this paper trains the
stock index data of one year starting from August 2020, and
then it is used to test and predict the stock price from
Evaluate September 2021 to November 2021. As shown in Fig. 3, the
blue line represents the actual inventory and the red line
represents the forecast inventory. As can be seen from the line
C. ARIMA-LGBM Hybrid Model chart, The predicted stock trend is partly in line with the actual
First of all, for the ARIMA model, this model only needs trend. However, it is still impossible to predict trends at
its own data sequence without the help of other exogenous inflection points. Although ARIMA model can predict
variables, which is used for the sequence with apparent stationary series well, it has great randomness for stocks.
regularity and periodicity. LightGBM is a strong When it is an unstable series, it often needs to be converted to
generalization algorithm with high fitting, fast training speed. a stationary series in the prediction. It can be inferred that the
The ARIMA-LGBM hybrid model is proposed based on the created ARIMA model is not good at predicting stock trends.
advantages of the two models.
The Fig. 2 illustrates the key modeling steps of the
proposed model. After confirming the stock to be studied, the
characteristic indexes of the stock are extracted, and then the
data of these indexes are preprocessed. The second step is to
substitute each preprocessed data indicator into ARIMA model
to predict the data of this indicator in the next 6 months. In the
next step, the characteristic index and the decline rate of the
prediction index of the selected stock are substituted into the
LightGBM model for training. The trained LightGBM model
can be utilised to predict changes in the stocks under study
over the next three months from September 2021.
Fig. 3. Prediction diagram of ARIMA model.
229
Authorized licensed use limited to: Institut Teknologi Sepuluh Nopember. Downloaded on August 29,2023 at 09:04:31 UTC from IEEE Xplore. Restrictions apply.
C. Prediction Based on LGBM Model
The LGBM model established in this paper was trained ARIMA-LGBM
from August 2020 to August 2021, in which model testing and
prediction were conducted from September 2021 to November According to the stock forecast trend chart, this paper gives
2021. As shown in Fig. 4, the blue line represents the actual the corresponding investment strategy. Gree Electric
situation of the stock, and the red line represents the forecast Appliances can increase its holdings in the long term until
situation of the stock trend. It can be seen that the deviation February 2021, but after February, it should implement neutral
between the predicted value and the real value is small, so the or avoidance strategies according to the actual situation.
prediction effect of the model is good.
E. Prediction Based on ARIMA-LGBM Hybrid Model
Manual observation of the PACF and ACF graphs is still
employed to calculate p and q values for the ARIMA model. If
the BIC value is selected as the standard for parameter
selection, the code will take a lengthy time to run. Therefore,
the observation graph approach is still conducted to determine
p and q in this study. As for the selection of LGBM parameters,
the grid search method was adopted to determine the
parameters [9]. However, the most significant disadvantage of
this strategy was that the code running is time consuming.
Therefore, the paper determined LGBM parameters based on
historical experience values and obtained the forecast trend of
the stock in the following three months.
Fig. 4. Stock simulation forecast chart of LGBM model.
D. Comparative Analysis of ARIMA Model and LGBM Model
In this paper, RMSE is used as an indicator to evaluate the
model's merits. RMSE is Root Mean Square Error. For
example, RMSE=10, it can be considered that the regression
effect differs by ten on average from the true value, and the
formula is as follows:
1 n ()
RMSE = ( yˆi − yi )2
n i =1
The range of RMSE is [0,+∞), when the predicted value is
Fig. 5. Stock forecast trend of proposed model
identical with the real value and equal to 0, the model is
perfect; the larger the RMSE value is, the greater the error is.
IV. CONCLUSION
ARIMA model, as a relatively mature model in time series, In order to improve the accuracy of stock trend prediction,
has the biggest characteristic that it does not refer to the this paper constructed a combination prediction method using
influence of external derived variables on the prediction value ARIMA and LGBM combined with the characteristics of a
and only considers the data in the learning and training set. stock trend. It established a stock trend prediction model based
LGBM model, as a mature integration algorithm, can take into on time series and integration algorithm. Firstly, a brief
account the influence of external variables on the predicted introduction of ARIMA and LightGBM algorithms is given.
value. Therefore, the LGBM model refers to the influence of Secondly, the prediction results are compared and analyzed.
the research report feature index, while the ARIMA model Then, the design principle and idea of ARIMA-LGBM hybrid
does not entirely refer to the feature index selected in the first model are described in detail, and the flow chart of the
question. Therefore, the stock trend chart fitted by the two algorithm model and the pseudo code of the algorithm are
models can be compared as shown in Fig. 4 and Fig. 3, and an given. Theoretical analysis and experimental results show that
RMSE value can be obtained, and corresponding strategies can the combined model is better than the single model. In the face
be provided. of a complex stock market, the ARIMA-LGBM hybrid model
The LGBM model fitting effect after the selection of can achieve more rapid and accurate prediction, which can
feature indexes of research reports is better than that of the reduce investors' risk to a certain extent. This model can
ARIMA model, as shownin Table 1. This feature is more efficiently deal with time series, portability and has certain
intuitively shown. practical application value for other time series problems.
TABLE I. FITTING EFFECT VALUES REFERENCES
[1] Chatzis S P, Siakoulis V, Petropoulos A, et al. Forecasting stock market
crisis events using deep and statistical machine learning techniques[J].
Expert systems with applications, 2018, 112: 353-371.
230
Authorized licensed use limited to: Institut Teknologi Sepuluh Nopember. Downloaded on August 29,2023 at 09:04:31 UTC from IEEE Xplore. Restrictions apply.
[2] Ariyo A A, Adewumi A O, Ayo C K. Stock price prediction using the [6] Ding W Z. Comparison of ARIMA model and LSTM model based on
ARIMA model[C]//2014 UKSim-AMSS 16th International Conference stock forecasting [J]. Industrial Control Computer,
on Computer Modelling and Simulation. IEEE, 2014: 106-112. 2021,34(07):109-112+116.
[3] WENG B, LIN L, XINGW,etal. Predicting shorttermstock prices using [7] Yu X . Light Gradient Boosting Machine: An efficient soft computing
ensemble methods and online data sources[J].Expert Systems with model for estimating daily reference evapotranspiration with local and
Applications, 2018, 112: 258-273. external meteorological data[J]. Agricultural Water Management, 2019,
[4] Ye F, Wang J, Li Z, et al. Jane Street Stock prediction model based on 225:105758.
LightGBM[C]//2021 6th International Conference on Intelligent [8] Guo Y, Li Y, Xu Y. Study on the application of LSTM-LightGBM
Computing and Signal Processing (ICSP). IEEE, 2021: 385-388. Model in stock rise and fall prediction[C], MATEC Web of
[5] Yu X . Light Gradient Boosting Machine: An efficient soft computing Conferences. EDP Sciences, 2021, 336: 05011.
model for estimating daily reference evapo transpiration with local and [9] Ding Hui. Support vector Machine enterprise based on grid
external meteorological data[J]. Agricultural Water Management, 2019, optimization model. Research on the application of Credit Rating[J].
225:105758. Yinchuan Central Sub-Branch of The People's Bank of China,2021(10)
63:66.
231
Authorized licensed use limited to: Institut Teknologi Sepuluh Nopember. Downloaded on August 29,2023 at 09:04:31 UTC from IEEE Xplore. Restrictions apply.