Good Good Good Good - Great !!!! Great !!!! - A Machine Learning Approach To Stock Sceening
Good Good Good Good - Great !!!! Great !!!! - A Machine Learning Approach To Stock Sceening
Good Good Good Good - Great !!!! Great !!!! - A Machine Learning Approach To Stock Sceening
How to make machine select stocks like fund managers? Use scoring and
screening model✩
Yanrui Li a , Kaiyou Fu b , Yuchen Zhao a , Chunjie Yang a ,∗
a
College of Control Science and Engineering, State Key Laboratory of Industrial Control Technology, Zhejiang University, Zheda Road 38, Hangzhou, China
b
School of Medicine, Zhejiang University, Hangzhou, China
Keywords: With the development of technology and the abundance of data, many novel methods like artificial intelligence
Stock selection and machine learning have emerged for quantitative finance. This work tries to build a framework with
Model fusion screening function to help investors create a portfolio of stocks based on data from multiple sources, including
Stock screening
historical trading data, factor data, financial data and media data. The framework integrates scoring and
Factor modeling
screening models. The scoring model consists of Seq2Seq model using historical trading data and a factor model
Deep learning
using a new bottom-up discretization method while the screening model is composed of a novel discriminative
model and a media model based on the weighted stock relation graph. Two types of model are fused to select
portfolio with screening ability. This framework has been verified in China’s A-share market, and is proved to
be effective. We also noticed that the fused model is sensitive to the scale of selected stocks and the length of
prediction period, which means it can be quickly adjusted according to our trading strategy.
1. Introduction learning also provide us with the ability to process these types of data.
Therefore, using new algorithms and different types of data to assist
Quantitative trading has been used in various fields of finance, the stock investment has been an attractive topic in both academia and
including financial instruments selection, timing, arbitrage, and risk business field. Some examples are shown in Section 2.
control. The goal of stock investment is to create a portfolio of stocks However, three universal defects limit the ability of this data intel-
that maximizes the overall return regarding the risk of stocks in that ligence:
portfolio (Markowitz, 1952) of which the success heavily depends on (1) Model evaluation index does not match the application. Most
the right stock selection. Although as the efficient market hypothesis models were built as a classification (on price movement direction) or
(EMH) (Fama, 1970) states that in a perfect efficient market, stock a regression (on price value) task to predict the change of an index or a
markets reflect all available information and follow random pathways, stock. But in real investment, the objective is to create a portfolio, not
declaring that they cannot be predicted, many studies have shown that just predict the price of stocks, which means the model needs to adapt
the stock market does not conform to the EMH, and the fluctuation of to a large range of different type of companies’ stock and find those can
stock is not random walk (Carhart, 1997; Jaworski & Pitera, 2014).
bring benefit, not predict the future price of stocks.
Therefore, how to predict the stock trend and establish a profitable
(2) Missing the screening function. For fund managers to choose
portfolio is worth studying. But we cannot deny that the prediction
stocks, they can selectively filter out stocks that they do not familiar
of the market is an extremely difficult task and no strategy can keep
with or they think is risky, and create a portfolio among those they
winning all the time. From our points of view, this is because the stock
are familiar with. This screening ability can make their portfolio more
price not only reflects the market value of the corresponding company,
reliable. However, most methods lack of this screening ability, which
but also a medium of game for investors. Hence there cannot be a
will make the established model mediocre because some anomalous
strictly dominated strategy for this game.
stocks are hard to fit by a unified model.
With the development of network technology, an ocean of data
collected from various sources including the historical trading data, (3) Lack of information fusion. There are many kinds of informa-
company relationship data, public opinion data and etc. are easier to tion related to stock prediction, such as historical trading data, basic
obtain. Many novel methods using artificial intelligence and machine information of a company, the media news about the company, etc.
✩ This work was supported by National Natural Science Foundation of China (61933015).
∗ Corresponding author.
E-mail addresses: [email protected] (Y. Li), [email protected] (K. Fu), [email protected] (Y. Zhao), [email protected] (C. Yang).
https://doi.org/10.1016/j.eswa.2022.116629
Received 2 November 2021; Received in revised form 25 January 2022; Accepted 29 January 2022
Available online 17 February 2022
0957-4174/© 2022 Published by Elsevier Ltd.
Y. Li et al. Expert Systems With Applications 196 (2022) 116629
Although there have been studies to combine a variety of data, few a classification task (Ayala et al., 2021; Bustos & Pomares-Quimbaya,
of them discussed the characteristics of different types of data. For 2020). Several algorithms, like Support Vector Regression (SVR) (Chen
example, the historical trading data is more suitable for short-term & Hao, 2017), Bayesian analysis (Tsay & Ando, 2012),Neural net-
investment while fundamental analysis for long-term investment and work (Site et al., 2019) and K-Nearest Neighbors (Vijh et al., 2020)
the media news is often event-driven. Therefore, how to fuse these data have been developed and applied. In Patel et al. (2015a), the authors
according to their characteristics still needs further research. compare four prediction models, Artificial Neural Network (ANN),
To address the mentioned limits, we propose the method with the Support Vector Machine (SVM), random forest and naive-Bayes, with
following advantages: First, we use the average return of the portfolio two approaches for input, and show the performance improve when
as our evaluation indicator to build and test our model with Wilcoxon trend deterministic data are used. In Patel et al. (2015b), the authors
rank-sum test to evaluate the performance of the selected portfolio proposed two stage fusion approach SVR-ANN, SVR-random forest (RF)
statistically. Second, the framework has filtering capability, obtained and SVR–SVR to predict future market index. Compared to the tradi-
by fusing the novel discriminative model and the news sentiment anal- tional time series processing methods based on statistical methods such
ysis model. Third, complete data related to stock, including historical as Autoregressive Integrated Moving Average mode (ARIMA) (Healy,
trading data, basic information of company, factor data and the media 1964), machine learning methods have better performance for nonlin-
ear multidimensional financial data (Kumar et al., 2021). Besides, some
news text, are considered in this fusion model, and the impact range
researchers address stock selection problems by clustering to control
and characteristic of each type of data are discussed. The novelty of our
the investment management (Han & Ge, 2020; Sun et al., 2021; Wang,
method is reflected in our specific methods to realize the above men-
2011). In Majumdar and Laha (2020), the authors utilize the method
tioned advantages. First, we use the Seq2Seq model and t-merge factor
of topological data analysis(TDA) to perform time series classification
model to extract the information in historical trading data and factor
and discern different stock indexes.
data respectively and fuse them to give the basic ranking. The proposed
(3) Deep learning: deep learning, also called the deep neural net-
scoring model shows great improvement compared to other methods,
work, is widely studied in finance these days, since it achieved excellent
especially when they are combined. Second, the novel discriminative
development in natural language processing (NLP) and image pro-
model as well as a media sentiment model based on weighted stock cessing fields. Its ability to extract features from a large set of raw
relation graph are used to detect anomaly stocks. The two models are data without relying on prior knowledge of the predictors model and
designed for the screening purpose which can extract information from extract complex hidden patterns from finance data in both tempo-
factor data and media news data respectively. Finally, in the model ral and spatial dimensions make the application of deep learning in
fusion process, the screening model gives reward and punishment to the capital market increase significantly recently. Most of the deep
the stock list which is ranked by predictive return of scoring models. learning research in the finance field focuses on modeling financial
This fusion method can increase the stability of the selected portfolio time series using recurrent neural network(RNN) and convolutional
and reduce its risk. neural network (CNN) (Bukhari et al., 2020; Chen et al., 2018; Saud
The rest of this paper is organized as follows: The related work are & Shakya, 2020). Besides, considering the different stocks are related
represents in Section 2. In Section 3, we introduce the source of data by shareholders, industrial chain and etc, this kind of graph struc-
used in proposed framework, the overall structure of the framework ture relationship is introduced into the model by graph convolutional
and how we evaluate its performance. In Section 4, we describe the neural network. For example, Chen et al. (2021) proposed the graph
detailed structure of each model. In Section 5, the performance of convolutional feature based convolutional neural network (GC-CNN)
each model and the fusion model are exhibited and discussed, which is model, considering both stock market information and individual stock
followed by the conclusion in Section 6. information.
(4) Financial text model: text, which contains the most abundant
2. Related work intuitive information, can be automatically analyzed with the devel-
opment of technology. And a lot of economic news websites provide
From the technical perspective, research related to stock investment researchers with sufficient data to study (Li et al., 2018). One of
mainly uses the following four methods: factor model, time series the main research directions is the sentiment analysis of financial
analysis method, machine learning and deep learning. text (Seong & Nam, 2021). These works often use machine learning (Wu
(1) Multi-factor model: multi-factor model is a financial model that et al., 2014) and deep learning method (Kilimci & Duvar, 2020; Ren
employs multiple factors in its calculations to explain the stock price, et al., 2020) to extract emotion score from news (Ftiti et al., 2021;
which is widely used for decades. One of the most famous one is Nerger et al., 2021) or social media (Urolagin, 2017) and then valid
the Fama–French Three-factor model, which uses three factors, the their correlations to stock price. Mehta et al. (2021) compared the five
market risk, the outperformance of small versus big companies and the different machine learning and deep learning methods, including SVM,
outperformance of high 𝑏𝑜𝑜𝑘∕𝑚𝑎𝑟𝑘𝑒𝑡 versus low 𝑏𝑜𝑜𝑘∕𝑚𝑎𝑟𝑘𝑒𝑡 companies MNB classifier, linear regression, Naive Bayes and Long Short-Term
to explain the USA stock market (FAMA & FRENCH, 1992). Later, it Memory(LSTM) and LSTM shows the best accuracy. Besides, Gite et al.
(2021) applied Explainable AI(XAI) tool on sentimental analysis LSTM
was extended to a five-factor model, added a further two factors: prof-
model to give an understandable explanation for prediction.
itability and investment (Fama & French, 2017). Although this model
Although forementioned studies have been experimented in the
was validated in the USA stock market, it is shown unable to offer a
market respectively, fusion of different algorithms and data from vari-
convincing asset pricing model for the stock market in UK (Foye, 2017).
ous sources still needs further study. For example, Barak et al. (2017)
Therefore, there is still no widely accepted conclusion on what factors
focus on the fusion of different tree-based methods and Li et al. (2021)
should be used to explain different markets. How to find new factors
applied a two-level information fusion approach to examine the effects
that can explain the market in a relative validity period is becoming
of peer engagement on social media on stock price synchronicity and
the focus of more and more fund companies. The factors are roughly compare the effects between epidemic and non-epidemic contexts.
classified as quality factor, barra factor, risk factor, technical factor, However, neither of them considered the screening of stocks, and the
basic factor, etc (Harvey et al., 2015), and more and more factors have data type relevant to stocks is insufficient.
been developed. Besides, automatic feature construction are often used
to solve the inconvenience of manual feature construction (LeDell & 3. Preliminaries
Poirier, 2020; Li & Yang, 2021).
(2) Machine learning: comparing to multi-factor model, machine Before describing the details of each model, first of all, the data used
learning often uses the financial time series data to forecast price of in our research, the idea of the overall framework and the way evaluate
stocks, as a regression task, or to predict the future market trend, as our models are introduced as shown below.
2
Y. Li et al. Expert Systems With Applications 196 (2022) 116629
Fig. 1. The visualization of partition of dataset on time series of the CSI 300 index.
3.1. Data introduction analysis the public opinion fluctuation during the corresponding time
period, we obtained over 30000 pieces of news from industry news and
The data used in this paper are Chinese stock market data from July company news sections of Sinafinance,2 an authoritative financial news
2020 to January 2021, total of 3760 stocks,including the following four platform for Chinese and international investors, with 150 to 200 pieces
types of data: of news per day, and a total of 1773 companies were mentioned in the
(1) Historical trading data: The historical trading data presents news.
the time-series stock market data including open price, close price,
highest price, lowest price, trading volume and turnover of each stock 3.2. Evaluation method
everyday. Note that we use adjusted price to avoid the discontinuity
(1) Average return of portfolio
of stock price when stock split. Besides, we use the change rate of
In previous work about stock prediction, most models were built as
each stock’s price relative to the close price of the previous day as the
a classification (on price movement direction) or a regression (on price
input for modeling and feature extraction to eliminate the impact of
value) task, which would cause a large discrepancy on the investment
the different values of stock’s prices.
revenue. Besides, even a factor model with a negative 𝑅2 may also yield
(2) Financial data: financial data include the basic information of
profitable returns in practice as long as it has the ability to correctly
the company, such as its industry categories, shareholder and financial
rank the stocks by future return (Feng et al., 2018). Therefore, in our
statements updated every quarter, including income statement, cash
work, the comprehensive return(C-return) and pure return(P-return)
flow statement and balance statement. are used as the evaluation index of the model, which directly related to
(3) Factor data: more than 300 factors are collected from join- the benefits by using the model and are defined as following equation:
quant1 which contains different types of factors, such as quality factor,
barra factor, risk factor and technical factor. Most of the factors are
∑
𝐷
constructed according to published articles (Zura et al., 2015). 𝑝𝑡0 +𝑡 (1 − 𝛿)𝑡
(4) Media data: media data include events related to the stocks and 𝑡=1
𝐶 − 𝑟𝑒𝑡𝑢𝑟𝑛(𝐷) = (1)
the sentiments of people towards the market and stocks. In order to 𝑝𝑡0
1 2
https://www.joinquant.com/help/api/help#JQData. https://finance.sina.com.cn/.
3
Y. Li et al. Expert Systems With Applications 196 (2022) 116629
𝑝𝑡0 +𝐷
𝑃 − 𝑟𝑒𝑡𝑢𝑟𝑛(𝐷) = (2)
𝑝𝑡0
The two equations above are the index to denote the return of single
stock, where 𝑝𝑖𝑡 is the price of the stock 𝑖 of the last observation day,
0
and 𝑝𝑖𝑡 +𝑡 is the price of the 𝑡th prediction day, with a total of 𝐷 days. the
0
𝛿 is the depreciation rate we set 0 for the convenience of calculation.
The stock selection model will select the 𝐾 stocks with highest expected
return and the 𝐶 −𝑟𝑒𝑡𝑢𝑟𝑛(𝐾, 𝐷) and 𝑃 −𝑟𝑒𝑡𝑢𝑟𝑛(𝐾, 𝐷) are used to evaluate
the result of the selected set:
∑
𝐶 − 𝑟𝑒𝑡𝑢𝑟𝑛(𝐷)𝑖
𝐶 − 𝑟𝑒𝑡𝑢𝑟𝑛(𝐾, 𝐷) = 𝑖 𝑖𝑛 𝑆 (3)
𝐾
∑
𝑃 − 𝑟𝑒𝑡𝑢𝑟𝑛(𝐷)𝑖
𝑃 − 𝑟𝑒𝑡𝑢𝑟𝑛(𝐾, 𝐷) = 𝑖 𝑖𝑛 𝑆 (4) Fig. 3. The structure of the Seq2Seq model.
𝐾
where S is the stock set of 𝐾 highest expected return and 𝐾 is the stock
selection scale.
4.1. Scoring models
In all model training stages, we set the prediction time 𝐷 as 36 and
use 𝐶 − 𝑟𝑒𝑡𝑢𝑟𝑛(36) as the label for each stock. The training and testing
The scoring models can give each stock a relatively fair score and
dataset are split as Fig. 1, models are trained and validated with data
build the base rank.
from July to August, with the information in observation window as
input and stock return in prediction window as label, and finally tested
4.1.1. Seq2Seq model
on the data from September 2020 to January 2021. Note the prediction
The Seq2Seq model is devised for the historical trading data. As
period of the test dataset is completely unknown for the models.
shown in Fig. 3, the Seq2Seq has the encoder–decoder structure. We
(2) Adjust R squared
use the historical trading data in the observation period as the input to
In statistic R-square is the proportion of the variation in the depen-
predict the stock price sequence in the prediction period. The Seq2Seq
dent variable that is predictable from the independent variable, and
network runs as follows. First, the encoder of the Seq2Seq network
often used to evaluate the models’ prediction ability. The calculation is
extracts the vector containing temporal information of the input serial-
as follows:
ized trading data. Secondly, this vector is put into the decoder part of
∑
𝑆𝑆𝑟𝑒𝑠 = (𝑦𝑖 − 𝑓𝑖 )2 (5) the network to generate the first prediction of stock price movement.
𝑖 Third, the decoder updates with the prediction return of the previous
∑ day as the input and generates the prediction of the next day. The
𝑆𝑆𝑡𝑜𝑡 = (𝑦𝑖 − 1)2 (6)
𝑖
advantage of this network is its ability of fitting complex temporal data
𝑆𝑆𝑟𝑒𝑠 and predicting the future return of the stock every day, rather than
𝑎𝑑𝑗𝑢𝑠𝑡 − 𝑅2 = 1 − (7) offering one comprehensive return.
𝑆𝑆𝑡𝑜𝑡
The difference between the adjust R-square we used and normal R- 4.1.2. Factor model
square is that we use a constant number 1 instead of average return rate We build the factor model based on the proposed t-merge discretiza-
to calculate 𝑆𝑆𝑡𝑜𝑡 . That is because the average return rate is a future tion method and histogram-based gradient boosting regression Tree (Ke
variable which will cause bias to the evaluation. et al., 2017) with the input variables containing financial information,
(3) Wilcoxon rank-sum test common factors and temporal features extracted from historical trading
Wilcoxon rank-sum test, also called Mann–Whitney U test is used data. The whole process includes the following steps:
in our experiment to evaluate whether the return of selected stocks is (1) Data preparation: the factor data consists of two parts, the nor-
greater than the average. mal factors downloaded directly from the Joinquant, and the time series
factors extracted from the close price and volume by tsfresh (Christ
3.3. Structure of framework et al., 2016). A total of 947 factors were passed into the subsequent
processing.
As shown in Fig. 2, four models are built to process different types (2) Factor selection: we use MIC (Albanese et al., 2018; Reshef
of data. The fusion model includes two parts: the scoring part and et al., 2011) to select factors that are highly correlated with the
screening part. In scoring part, theSeq2Seq model and t-merge factor comprehensive return of stocks for its ability to detect various kinds
model extract the information from historical trading data and factor of relationships.
data respectively. In screening part the novel discriminative model as (3) Data discretization: the factor data are discretized by the t-merge
well as a media sentiment model based on weighted stock relation algorithm. The t-merge algorithm has the same bottom-up framework
graph are used to detect anomaly stocks from factor data and media as chimerge (Kerber, 1992), but is based on the t value to test the
news data respectively. Finally, in model fusion process, the screening continuous variable for each pair of adjacent intervals. The discretiza-
model give reward and punishment to the stock list which is ranked by tion algorithm is shown in Algorithm 1. Two parameters affect this
predictive return of scoring models. This fusion model with screening algorithm: the start bins 𝑛1 determine the computing time and the
function can increase the stability of selected portfolio and reduce sensitivity of the algorithm, and the final bins 𝑛2 influence the sub-
its risk. Note the factors in figure includes not only the factor data sequent modeling result. The time complexity of this algorithm is
downloaded, but also features extracted from trading time series using 𝑂(𝑛𝑙𝑜𝑔(𝑛)𝑘𝑙𝑜𝑔(𝑘)), where 𝑛 is the sample number and 𝑘 is the number
tsfresh (Christ et al., 2016). of start bins The discretization method can discretize the data with the
least loss of information, balancing the stability and sensitivity.
4. Model detail (4) Modeling and validation: As shown in Fig. 4, after discretization,
the average return shows great correlation with the value of factor.
The four models are described in detail in this section. Some relationships are linear, such as sales to price ratio, the bigger
4
Y. Li et al. Expert Systems With Applications 196 (2022) 116629
Fig. 4. The relationship between factors and the C-return(after the value of factor is discretized).
Algorithm 1: T-merge discretization algorithm Short-Term Memory (LSTM) network and Natural visibility encoding
Data: the continuous value of variables 𝑋(𝑥1 , 𝑥2 , ⋯ , 𝑥𝑚 ), the (NVE) and a fusion model SVR+LSTM. Note that the LSTM, Seq2Seq
corresponding label 𝑌 (𝑦1 , 𝑦2 , ⋯ , 𝑦𝑚 ) and NVE models are built on historical trading data while others are
Input: Start bins 𝑛1 , Final bins 𝑛2 built on factor data. The label is the return rate of C-return(36) of each
Output: Discretized variables 𝑋𝑑 stock.
1 𝐷(𝑑1 , 𝑑2 , ⋯ , 𝑑𝑛1 ) ← discretize X into equal-sized buckets based on The combination of Seq2Seq model and t-merge factor achieve best
accuracy in all results, which has the 𝑅2 greater than 0.15. In all single
sample quantiles;
model, the t-merge factor model has the best result, which is largely
2 𝑇 (𝑡1 , 𝑡2 , ⋯ , 𝑡𝑛1 −1 ) ← compute the t value for each pair of adjacent
improved compare to the direct factor model. This result indicates the
intervals;
proposed t-merge discretization can make fcator model more robust and
3 while 𝑙𝑒𝑛(𝑇 ) > 𝑛2 − 1 do
improve the accuracy. Compared with LSTM, seq2seq can predict every
4 𝑖 ← 𝑎𝑟𝑔𝑚𝑖𝑛(𝑇 );
day in the future and show the better result.
5 delete 𝑡𝑖 in 𝑇 ;
6 𝑑𝑖 ← 𝑑𝑖 + 𝑑𝑖+1 (merge the discretization);
4.2. Screening model
7 renew the 𝑡𝑖−1 and 𝑡𝑖 ;
8 end
The screening models give fusion model the ability to screen the
9 return D
anomalous stocks and reduce the risk of the elected portfolio.
5
Y. Li et al. Expert Systems With Applications 196 (2022) 116629
6
Y. Li et al. Expert Systems With Applications 196 (2022) 116629
The fusion mechanism fuse the screening model and scoring model
with different rules as shown in Fig. 9. The two types of information:
Fig. 7. Stock relationship graph (part). Three types of nodes are in the graph: the
orange ones represent stock nodes, the nodes in red are industry nodes and the nodes score calculated by scoring model 𝑠1 , 𝑠2 and marks generated by screen-
in blue are concept nodes. Note that the stocks included in industry and concept nodes ing models 𝑚1 , 𝑚2 are the input. First, two scores 𝑠1 , 𝑠2 are weighted
are defined in data according to the expertise knowledge. Normally, a concept includes averaged to calculate the score 𝑠, by which the base ranking generated.
some companies with the common ground, which means that they may belong to the Then two screening models will veto the stocks that are marked as
same industrial chain (like concept ‘gene chip’), or are related to the same companies
bad stocks and give reward to stocks marked as good ones. This fusion
(like concept ‘Ali related’).
process is determined by three parameters: the scoring weight 𝜔 and
1 − 𝜔 for Seq2Seq model and factor model and the 𝜃𝑓 and 𝜃𝑚 , the
threshold to determine whether to veto the stocks or not. In our model,
we set 𝜔 = 0.5, 𝜃𝑓 = 1 and 𝜃𝑚 = 10. Note that this fusion mechanism
is designed under the fact that there is no effective short investment in
china’s A market so it is impossible for the model to predict the fall of
stocks to earn benefits.
5.2. Results
Each separate model and the fusion model are verified on the test
dataset, and the results are shown in Table 1. We can find almost all
models have positive extra return under different trading conditions
Fig. 8. Influence curve.
(𝐾 or 𝐷), which proves their ability to select the profitable stocks.
The fusion model can achieve best return in most trading conditions,
Table 1 especially when 𝐾 = 100 and 𝐾 = 1000. Besides, we can find that the
Prediction performance of different scoring methods.
increase of the 𝐷 lead to the increase of the return, which means the
Algorithm Rmse Adjust-𝑟2
rising trend of the selected stock is enduring.
SVR 0.1006 0.0252 Next, we will further analyze the result through three charts. The
LSTM 0.0905 0.0622
Fig. 10 shows the decline of daily return when 𝐾 increases, which
GBDT 0.0932 0.0024
LightGBM (Ke et al., 2017) 0.0889 0.0687 means the models can correctly rank the future return of the stocks
T-merge factor 0.0786 0.1374 but not a rise-or-fall classification. The Fig. 11 presents the relationship
LSTM 0.0989 0.0552 between 𝐷 and the daily return. The daily returns are higher when 𝐷
Seq2Seq 0.0868 0.0774 is 36. This shows the models are sensitive to the days of prediction
Seq2Seq+t-merge factor 0.0734 0.1528
period. We can retrain the model with different labels when predicting
NVE (Huang et al., 2021) 0.0822 0.1233
PCA-SVR 0.0934 0.0931 in different periods of time. The Fig. 12 shows the relationships of two
LSTM+GBDT 0.0794 0.1336 results evaluation indexes which also means two trading strategies. The
abscissa and ordinate of it represent the values of a class of indicators,
and each point in the chart represents the corresponding values of the
results of the two indicators of one setting.
7
Y. Li et al. Expert Systems With Applications 196 (2022) 116629
Table 2
The experiment results with different setting of K and D. Two values, extra return rate(ER) and 𝑝-value of Wilcoxon rank-sum test are shown in table to display the performance
of the selected portfolio in different settings.
Seq2Seq Factor model Media model Discriminative model Fusion model
C-R P-R C-R P-R C-R P-R C-R P-R C-R P-R
ER 0.421 1.965 0.088 2.187 0.727 2.248 0.477 2.102 0.876 3.630
𝐾 = 100, 𝐷 = 15
𝑝-value 0.441 0.077 0.787 0.039 0.099 0.046 0.226 0.017 0.294 0.018
ER 0.299 1.114 0.435 1.842 1.202 2.724 0.250 0.682 1.558 2.496
𝐾 = 200, 𝐷 = 15
𝑝-value 0.557 0.143 0.224 0.068 0.089 0.039 0.352 0.282 0.147 0.036
ER 0.119 0.257 0.630 1.600 0.626 1.402 −0.109 −0.129 0.815 1.787
𝐾 = 500, 𝐷 = 15
𝑝-value 0.589 0.443 0.495 0.121 0.222 0.110 0.796 0.855 0.176 0.057
ER −0.025 −0.134 0.192 0.619 0.279 0.773
𝐾 = 1000, 𝐷 = 15
𝑝-value 0.774 0.784 0.733 0.521 0.544 0.454
ER 2.329 4.418 2.460 6.883 3.528 6.663 2.969 3.119 3.793 7.924
𝐾 = 100, 𝐷 = 36
𝑝-value 0.039 0.002 0.036 <0.001 0.008 0.001 0.033 0.019 0.018 <0.001
ER 2.173 4.936 2.317 5.847 3.377 6.551 1.996 2.871 3.002 6.180
𝐾 = 200, 𝐷 = 36
𝑝-value 0.053 0.002 0.046 0.001 0.020 0.001 0.037 0.024 0.021 0.001
ER 1.032 3.146 1.996 4.114 1.934 3.710 0.384 0.565 1.854 3.610
𝐾 = 500, 𝐷 = 36
𝑝-value 0.133 0.029 0.042 0.003 0.068 0.016 0.656 0.677 0.078 0.013
ER 0.397 1.388 0.822 1.642 1.093 1.971
𝐾 = 1000, 𝐷 = 36
𝑝-value 0.433 0.090 0.239 0.049 0.076 0.041
ER 3.303 4.518 4.667 6.769 5.064 5.534 3.089 4.544 5.647 9.229
𝐾 = 100, 𝐷 = 60
𝑝-value 0.022 0.002 0.004 <0.001 0.002 0.002 0.024 0.003 0.002 <0.001
ER 1.378 2.220 4.308 7.086 4.268 3.984 2.101 2.906 4.681 6.617
𝐾 = 200, 𝐷 = 60
𝑝-value 0.053 0.015 0.005 <0.001 0.007 0.006 0.041 0.034 0.003 0.001
ER 2.076 4.442 3.344 5.273 2.644 2.745 0.168 0.577 3.883 5.443
𝐾 = 500, 𝐷 = 60
𝑝-value 0.029 0.004 0.013 0.001 0.032 0.035 0.545 0.345 0.003 0.006
ER 1.535 2.385 1.648 2.761 1.649 2.939
𝐾 = 1000, 𝐷 = 60
𝑝-value 0.067 0.049 0.114 0.043 0.088 0.026
Fig. 10. The change of daily return of models with different selection scale(under time
period 𝐷 = 36). Fig. 12. Relationship of two evaluation indexes.
The ratio of the two types of return can reflect the trend of the stock
in the prediction period. If the stock price rises by the same amount
every day, the ratio of 𝑃 − 𝑟𝑒𝑡𝑢𝑟𝑛 and 𝐶 − 𝑟𝑒𝑡𝑢𝑟𝑛 will be 2, and if the
price rises by the same rate every day, the ratio will be much larger.
However, the most of the results are smaller than 2, which indicate
that the model can find the stocks that is profitable even if they fall
back at the end of the prediction period. Besides, the ratio seems higher
when prediction period is short, indicating the stocks may maintain the
upward trend in this period.
The result of our fusion model is also compared to other algorithms
under the experiment setting 𝐾 = 100, 𝐷 = 36, which means the
portfolio contains 100 stocks and the label of each model is the 𝐶 −
𝑅𝑒𝑡𝑢𝑟𝑛 of 36 days. For other algorithm, the portfolio is created by
selecting the stocks with the top 100 maximum predictive return. The
Fig. 11. The change of daily return of models with different prediction time(under result is shown in Table 3. The fusion model is much better than other
selection scale 𝐾 = 100). models, which is mostly because the two proposed scoring model can
8
Y. Li et al. Expert Systems With Applications 196 (2022) 116629
Table 3 6. Conclusion
The comparison of performance of portfolio selected by different
methods.
In this paper, we built a framework that can use four types of finan-
Algorithm 𝐶 − 𝑅 (𝑝 − 𝑣𝑎𝑙𝑢𝑒) 𝑃 − 𝑅 (𝑝 − 𝑣𝑎𝑙𝑢𝑒)
cial data to rank the stocks in A market of China for investment stock
SVR 2.24(0.033) 3.43(0.024)
selection. The framework contains 4 models, which can be divided into
LSTM 3.27(0.016) 4.89(0.005)
GBDT 1.03(0.178) 4.14(0.014) scoring models and screening models. By fusion of four models, The
Factor 1.52(0.069) 3.33(0.039) P-return(100,36) of the final result on the test dataset is 9.2% higher
Fusion model 3.793(0.018) 7.924 (<0.001) than the average value, 2.5% higher than the maximum of the single
PCA+SVR 0.797(0.285) 1.219(0.248) model, which is a significant improvement. The limitation of the work
NVE 2.19(0.042) 6.32(0.002)
is as follow: first, the framework is validated on a relative short time,
the robustness of the model and the expire time of the model need be
validated more in practice. Second, the calculation time of the model
complement each other and fully use the different types of data and the is more than 3 h, so it is only suitable for medium and long-term
screening function of the fusion model. investment, not for high-frequency trading. Last but not least, how to
combine human knowledge with the proposed framework need more
exploration.
5.3. Discussion
CRediT authorship contribution statement
5.3.1. The effectiveness of model
From the Table 2, the factor model and the media model, the most Yanrui Li: Methodology, Software, Data curation, Writing – original
studied and mature models, have best results in four separate models. draft. Kaiyou Fu: Data curation, Factor model construction, Writing –
As shown in Fig. 11, the factor model seems to have more advantages in review & editing. Yuchen Zhao: Data curation, Media model construc-
long time prediction, because it is the only one with daily income rising tion, Writing – review & editing. Chunjie Yang: Writing – review &
when 𝐷 = 60. Besides, the daily return of screening models drop rapidly editing, Supervision.
with 𝐾 increasing. It indicates that this kind of model has a strong
polarization, that is, good stocks and bad stocks that screened with high Declaration of competing interest
scores are more reliable than ones selected by scoring models, while the
stocks screened with low score is relative unreliable. Therefore, we set The authors declare that they have no known competing finan-
the threshold of screening model 𝜃 in fusion mechanism to eliminate cial interests or personal relationships that could have appeared to
unreliable results in screening model. Due to the fusion mechanism influence the work reported in this paper.
that gives bonus to the good stocks and the veto to the bad stocks, the
performance of the fusion model is significantly improved compared to Acknowledgment
the single model, especially when 𝐾 is 100 and 1000.
This work is supported by National Natural Science Foundation of
China (61933015).
5.3.2. The effectiveness of data
In some people’s view, historical data cannot provide any help to References
the stock forecast, because it is too simple for investors to collect
the data, only the financial data commonly used in the qualitative Albanese, D., Riccadonna, S., Donati, C., & Franceschi, P. (2018). A practical tool for
analysis is effective. However, counting the sources of factors with maximal information coefficient analysis. GigaScience, 7(4), giy032.
high correlation in the factor model(among top 40 correlated factors Ayala, J., Garca-Torres, M., Noguera, J. L. V., Gmez-Vela, F., & Divina, F. (2021).
Technical analysis strategy optimization using a machine learning approach in stock
26 are factors extracted from historical trading time series), we find market indices. Knowledge-Based Systems, 225, Article 107119.
the historical trading data is capable of providing sufficient useful Barak, S., Arjmand, A., & Ortobelli, S. (2017). Fusion of multiple diverse predictors in
information whereas some factors that have been proved to be effective stock market. Information Fusion, 36, 90–102.
in many papers are ineffective. This may be due to the characteristics Bukhari, A. H., Raja, M. A. Z., Sulaiman, M., Islam, S., Shoaib, M., & Kumam, P. (2020).
Fractional neuro-sequential ARFIMA-LSTM for financial market forecasting. IEEE
of China’s stock market, or the temporally abnormal circumstances of Access, 8, 71326–71338.
stocks in the post epidemic period. Bustos, O., & Pomares-Quimbaya, A. (2020). Stock market movement forecast: A
Besides, the effectiveness of media data has a strong relationship systematic review. Expert Systems With Applications, 156, Article 113464.
Carhart, M. M. (1997). On persistence in mutual fund performance. The Journal Of
with the source and processing method of media data. In this work, we
Finance, 52(1), 57–82.
just build a simple framework to process the media data, however, the Chen, Y., & Hao, Y. (2017). A feature weighted support vector machine and K-
result proves to be excellent. Therefore, we think it is very important, nearest neighbor algorithm for stock market indices prediction. Expert Systems With
and a customized NLP model may lead to a better result. Applications, 80, 340–355.
Chen, W., Jiang, M., Zhang, W.-G., & Chen, Z. (2021). A novel graph convolutional
feature based convolutional neural network for stock trend prediction. Information
5.3.3. Model fusion mechanism Sciences, 556, 67–94.
Chen, W., Yeo, C. K., Lau, C. T., & Lee, B. S. (2018). Leveraging social media news
When designing the model fusion mechanism, we have completely
to predict stock index movement using RNN-boost. Data & Knowledge Engineering,
different rules for the good stocks and bad stocks obtained from the 118, 14–24.
screening model. We give a more severe punishment, the veto, to the Christ, M., Kempa-Liehr, A. W., & Feindt, M. (2016). Distributed and parallel time
bad stocks compared to the award we give to the good ones. Therefore, series feature extraction for industrial big data applications. ArXiv E-Prints, pages,
arXiv:1610.07717.
the fusion model can get rid of selecting bad stocks. This is mainly
Fama, E. F. (1970). Efficient capital markets: A review of theory and empirical work.
because there is no effective short mechanism in China’s A-share market The Journal Of Finance, 25(2), 383–417.
and the accurate prediction of stocks fall is not as important as the FAMA, E. F., & FRENCH, K. R. (1992). The cross-section of expected stock returns. The
accurate prediction of stocks rise. Besides, although investors are often Journal Of Finance, 47(2), 427–465.
Fama, E. F., & French, K. R. (2017). International tests of a five-factor asset pricing
described as risk averse, many of them are, rather, loss averse. This
model. Journal Of Financial Economics, 123(3), 441–463.
mechanism encourages fusion model to chase risky stocks that may Feng, F., He, X., Wang, X., Luo, C., Liu, Y., & Chua, T. (2018). Temporal relational
bring more profits and abandon the ones that will bring more losses. ranking for stock prediction. CoRR, abs/1809.09441.
9
Y. Li et al. Expert Systems With Applications 196 (2022) 116629
Foye, J. (2017). Testing alternative versions of the fama-french five-factor model in the Nerger, G.-L., Huynh, T. L. D., & Wang, M. (2021). Which industries benefited from
UK. SSRN Electronic Journal. Trump environmental policy news? evidence from industrial stock market reactions.
Ftiti, Z., Ben Ameur, H., & Louhichi, W. (2021). Does non-fundamental news related to Research In International Business And Finance, 57, Article 101418.
COVID-19 matter for stock returns? evidence from shanghai stock market. Economic Patel, J., Shah, S., Thakkar, P., & Kotecha, K. (2015a). Predicting stock and stock price
Modelling, 99, Article 105484. index movement using trend deterministic data preparation and machine learning
Gite, S., Khatavkar, H., Kotecha, K., Srivastava, S., Maheshwari, P., & Pandey, N. (2021). techniques. Expert Systems With Applications, 42(1), 259–268.
Explainable stock prices prediction from financial news articles using sentiment Patel, J., Shah, S., Thakkar, P., & Kotecha, K. (2015b). Predicting stock market index
analysis. PeerJ Computer Science, 7, Article e340. using fusion of machine learning techniques. Expert Systems With Applications, 42(4),
Han, J., & Ge, Z. (2020). Effect of dimensionality reduction on stock selection with 2162–2172.
cluster analysis in different market situations. Expert Systems With Applications, 147, Ren, Y., Liao, F., & Gong, Y. (2020). Impact of news on the trend of stock price change:
Article 113226. an analysis based on the deep bidirectiona LSTM model. Procedia Computer Science,
Harvey, C. R., Liu, Y., & Zhu, H. (2015). And the cross-section of expected returns. The 174, 128–140, 2019 International Conference on Identification, Information and
Review Of Financial Studies, 29(1), 5–68. Knowledge in the Internet of Things.
Healy, M. J. R. (1964). Smoothing, forecasting and prediction of discrete time series. Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turn-
Journal Of The Royal Statistical Society: Series A (General), 127(2), 292–293. baugh, P. J., Lander, E. S., Mitzenmacher, M., & Sabeti, P. C. (2011). Detecting
Huang, Y., Mao, X., & Deng, Y. (2021). Natural visibility encoding for time series novel associations in large data sets. Science, 334(6062), 1518–1524.
and its application in stock trend prediction. Knowledge-Based Systems, 232, Article Saud, A. S., & Shakya, S. (2020). Analysis of look back period for stock price prediction
107478. with RNN variants: A case study on banking sector of NEPSE. Procedia Computer
Jaworski, P., & Pitera, M. (2014). On spatial contagion and multivariate GARCH models. Science, 167, 788–798, International Conference on Computational Intelligence and
Applied Stochastic Models In Business And Industry, 30(3), 303–327. Data Science.
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T.-Y. (2017). Seong, N., & Nam, K. (2021). Predicting stock movements based on financial news
LightGBM: A highly efficient gradient boosting decision tree. In I. Guyon, U. with segmentation. Expert Systems With Applications, 164(November 2018), Article
V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), 113988.
30, Advances in neural information processing systems. Curran Associates, Inc.. Site, A., Birant, D., & I0603k, Z. (2019). Stock market forecasting using machine
Kerber, R. (1992). ChiMerge: Discretization of numeric attributes. In Proceedings of the learning models. In 2019 innovations in intelligent systems and applications conference
10th national conference on artificial intelligence. San Jose, CA, July 12-16, 1992.. (pp. 1–6).
Kilimci, Z. H., & Duvar, R. (2020). An efficient word embedding and deep learning Sun, L., Wang, K., Balezentis, T., Streimikiene, D., & Zhang, C. (2021). Extreme point
based model to forecast the direction of stock exchange market using Twitter and bias compensation: A similarity method of functional clustering and its application
financial news sites: A case of Istanbul stock exchange (BIST 100). IEEE Access, 8, to the stock market. Expert Systems With Applications, 164, Article 113949.
188186–188198. Sun, Y., Wang, S., Li, Y., Feng, S., Tian, H., Wu, H., & Wang, H. (2019). ERNIE
Kumar, D., Sarangi, P. K., & Verma, R. (2021). A systematic review of stock market 2.0: A continual pre-training framework for language understanding. arXiv preprint
prediction using machine learning and statistical techniques. Materials Today: arXiv:1907.12412.
Proceedings. Tsay, R. S., & Ando, T. (2012). BayesIan panel data analysis for exploring the impact
LeDell, E., & Poirier, S. (2020). H2O autoML: Scalable automatic machine learning. In of subprime financial crisis on the US stock market. Computational Statistics & Data
7th ICML workshop on automated machine learning (AutoML). Analysis, 56(11), 3345–3365, 1st issue of the Annals of Computational and Financial
Li, Q., Chen, Y., Wang, J., Chen, Y., & Chen, H. (2018). Web media and stock markets Econometrics Sixth Special Issue on Computational Econometrics.
: A survey and future directions from a big data perspective. IEEE Transactions On Urolagin, S. (2017). Text mining of tweet for sentiment classification and association
Knowledge And Data Engineering, 30(2), 381–399. with stock prices. In 2017 international conference on computer and applications (pp.
Li, Y., & Yang, C. (2021). Domain knowledge based explainable feature construction 384–388).
method and its application in ironmaking process. Engineering Applications Of Vijh, M., Chandola, D., Tikkiwal, V. A., & Kumar, A. (2020). Stock closing price
Artificial Intelligence, 100, Article 104197. prediction using machine learning techniques. Procedia Computer Science, 167,
Li, L., Zhu, F., Sun, H., Hu, Y., Yang, Y., & Jin, D. (2021). Multi-source information 599–606, International Conference on Computational Intelligence and Data Science.
fusion and deep-learning-based characteristics measurement for exploring the Wang, R. (2011). Stock selection based on data clustering method. In 2011 seventh
effects of peer engagement on stock price synchronicity. Information Fusion, 69, international conference on computational intelligence and security (pp. 1542–1545).
1–21. Wu, D. D., Zheng, L., & Olson, D. L. (2014). A decision support approach for online
Majumdar, S., & Laha, A. K. (2020). Clustering and classification of time series stock forum sentiment analysis. IEEE Transactions On Systems, Man, And Cybernetics:
using topological data analysis with applications to finance. Expert Systems With Systems, 44(8), 1077–1087.
Applications, 162, Article 113868. Zura, K., Geoffrey, L., & Igor, T. (2015). 101 Formulaic alphas. Ssrn Electronic Journal.
Markowitz, H. (1952). PORTFOLIO selection*. The Journal Of Finance, 7(1), 77–91.
Mehta, P., Pandya, S., & Kotecha, K. (2021). Harvesting social media sentiment analysis
to enhance stock market prediction using deep learning. PeerJ Computer Science, 7,
Article e476.
10