0% found this document useful (0 votes)
14 views7 pages

Hierarchical Multi-Scale Gaussian Transformer For Stock Movement Prediction

This paper presents a novel Transformer-based method for predicting stock movement, addressing limitations of traditional models by enhancing locality and capturing long-term dependencies. Key innovations include a Multi-Scale Gaussian Prior, Orthogonal Regularization, and a Trading Gap Splitter, which improve the model's performance on financial time series. Experimental results demonstrate that this approach outperforms existing methods like CNN and LSTM in stock price prediction tasks across different markets.

Uploaded by

maning009
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views7 pages

Hierarchical Multi-Scale Gaussian Transformer For Stock Movement Prediction

This paper presents a novel Transformer-based method for predicting stock movement, addressing limitations of traditional models by enhancing locality and capturing long-term dependencies. Key innovations include a Multi-Scale Gaussian Prior, Orthogonal Regularization, and a Trading Gap Splitter, which improve the model's performance on financial time series. Experimental results demonstrate that this approach outperforms existing methods like CNN and LSTM in stock price prediction tasks across different markets.

Uploaded by

maning009
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)

Special Track on AI in FinTech

Hierarchical Multi-Scale Gaussian Transformer for Stock Movement Prediction


Qianggang Ding1,2,∗ , Sifan Wu2,∗ , Hao Sun3 , Jiadong Guo1 and Jian Guo1,†
1
Peng Cheng Laboratory
2
Tsinghua University
3
The Chinese University of Hong Kong
{dqg18, wusf18}@mails.tsinghua.edu.cn, [email protected], {guojd, guoj}@pcl.ac.cn

Abstract series models, which captures explicit or implicit patterns


from historical financial data. However, the performance of
Predicting the price movement of finance securi- those methods are limited by two aspects. Firstly, they usu-
ties like stocks is an important but challenging task, ally require expertise in finance. Secondly, these methods
due to the uncertainty of financial markets. In this only capture simple patterns and simple dependence struc-
paper, we propose a novel approach based on the tures of financial time series. With the rise of artificial in-
Transformer to tackle the stock movement predic- telligence technology, more and more researchers attempt to
tion task. Furthermore, we present several enhance- solve this problem using machine learning algorithms, such
ments for the proposed basic Transformer. Firstly, as SVM [Cortes and Vapnik, 1995], Nearest Neighbors [Alt-
we propose a Multi-Scale Gaussian Prior to en- man, 1992], Random Forest [Breiman, 2001]. Recently, since
hance the locality of Transformer. Secondly, we de- deep neural networks empirically exhibited its powerful capa-
velop an Orthogonal Regularization to avoid learn- bilities in solving highly uncertain and nonlinear problems,
ing redundant heads in the multi-head self-attention the stock prediction research based on deep learning tech-
mechanism. Thirdly, we design a Trading Gap nique has become more and more popular in recent years and
Splitter for Transformer to learn hierarchical fea- show significant advantages over traditional approaches.
tures of high-frequency finance data. Compared
The stock prediction research based on deep learning tech-
with other popular recurrent neural networks such
nique can roughly be grouped to two categories: (1) Fun-
as LSTM, the proposed method has the advantage
damental analysis, and (2) Technical analysis. Fundamen-
to mine extremely long-term dependencies from
tal analysis constructs prediction signals using fundamen-
financial time series. Experimental results show
tal information such as news text, finance report and an-
our proposed models outperform several competi-
alyst report. For example, [Schumaker and Chen, 2009;
tive methods in stock price prediction tasks for the
Xu and Cohen, 2018; Chen et al., 2019] use natural language
NASDAQ exchange market and the China A-shares
processing approaches to predict stock price movement by
market.
extracting latent features from market-related texts informa-
tion,such as news, reports, and even rumors. On the other
1 Introduction hand, technical analysis predicts finance market using his-
torical data of stocks. One natural choice is the RNN fam-
With the development of stock markets all around the world, ily, such as RNN [Rumelhart et al., 1986], LSTM [Hochre-
the overall capitalization of stock markets worldwide has ex- iter and Schmidhuber, 1997], Conv-LSTM [Xingjian et al.,
ceed 68 trillion U.S. dollar by 20181 . Recent years, more 2015], and ALSTM [Qin et al., 2017]. However, the primary
and more quantitative researchers get involved in predicting drawback of these methods is that RNN family struggles to
the future trends of stocks, and they help investors make prof- capture extremely long-term dependencies [Li et al., 2019],
itable decisions using state-of-the-art trading strategies. How- such as the dependencies across several months on financial
ever, the uncertainty of stock prices make it an extremely time series.
challenging problem in the field of data science.
Recently, a well-known sequence-to-sequence model
Prediction of stock price movement belongs to the area called Transformer [Vaswani et al., 2017] has achieved great
of time series analysis which models rich contextual depen- success on natural machine translation tasks. Distinct from
dencies using statics or machine learning methods. Tradi- RNN-based models, Transformer employs a multi-head self-
tional approaches for stock price prediction are mainly based attention mechanism to learn the relationship among different
on fundamental factors technical indices or statistical time positions globally, thereby the capacity of learning long-term
∗ dependencies is enhanced. Nevertheless, canonical Trans-
Equal contribution.
† former is designed for natural language tasks, and therefore
Corresponding author.
1 it has a number of limitations in tackling finance prediction:
https://data.worldbank.org/indicator/CM.MKT.TRAD.CD?end=
2018&start=2018&view=bar (1) Locality imperception: the global self-attention mecha-

4640
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)
Special Track on AI in FinTech

nism in canonical Transformer is insensitive to local context, [Zhang et al., 2017] proposes an LSTM model on histori-
whose dependencies are much important in financial time se- cal data to discover multi-frequency trading. [Wang et al.,
ries. (2) Hierarchy poverty: the point-wise dot-product self- 2019] proposes a ConvLSTM-based Seq2Seq framework for
attention mechanism lacks the capability of utilizing hierar- stock movement prediction. [Qin et al., 2017] proposes an
chical structure of financial time series (e.g. learning intra- Attentive-LSTM model with an attention mechanism to pre-
day, intra-week and intra-month features in financial time se- dict stock price movement and [Feng et al., 2019] further in-
ries independently). Intuitively, addressing those drawbacks troduces an data augmentation approach with the idea of ad-
will improve the robustness of the model and lead to better versarial training. However, [Li et al., 2019] points out that
performance in the task of financial time series prediction. LSTM can only distinguish 50 positions nearby with an effec-
In this paper, we propose a new Transformer-based method tive context size of about 200. That means that LSTM-based
for stock movement prediction. The primary highlight of models suffer from the difficulty in capturing extremely long-
the proposed model is the capability of capturing long-term, term dependencies in time series. To tackle this issue, we pro-
short-term as well as hierarchical dependencies of financial pose a Transformer-based method to better mine the intrinsic
time series. For these aims, we propose several enhancements long-term and complex structures in financial time series.
for the Transformer-based model: (1) Multi-Scale Gaussian
Prior enhances the locality of Transformer. (2) Orthogonal 3 Problem Formulation
Regularization avoids learning redundant heads in the multi- Since the exact price of a stock is extremely hard to be pre-
mead self-attention mechanism. (3) Trading Gap Splitter en- dicted accurately, we follow the setup of [Walczak, 2001]
ables Transformer to learn intra-day features and intra-week and predict the stock price movement instead. Usually the
features independently. Numerical results comparing with stock movement prediction is treated as a binary classifica-
other competitive methods for time series show the advan- tion problem — e.g., discretizing the stock movement into
tages of the proposed method. two classes (Rise or Fall). Formally, given the stock features
In summary, the main contributions of our paper include: X = [xT −∆t+1 , xT −∆t+2 , ..., xT ] ∈ R∆t×F (also repre-
• We propose a Transformer-based method for stock sented as X = [x1 , x2 , ..., xN ] ∈ RN ×F in the rest of the
movement prediction. To the best of our knowledge, this paper for simplicity) in the latest ∆t time-steps, the predic-
is the first work using Transformer model to tackle finan- tion model fθ (X) with parameters θ can output the predicted
cial time series forecasting problems. movement label y = I(pT +k > pT ), where T denotes the
• We propose several enhancements for Transformer target trading time, F denotes the dimension of stock fea-
model include Multi-Scale Gaussian Prior, Orthogonal tures and pt denotes the close price at time-step t. Briefly, the
Regularization, and Trading Gap Splitter. proposed model utilizes the historical data of a stock s in the
lag [T − ∆t + 1, T ] (where ∆t is a fixed lag size) to predict
• In experiments, the proposed Transformer-based method the movement class y (0 for Fall, 1 for Rise) of the future k
significantly outperform several state-of-the-art base- time-steps.
lines, such as CNN, LSTM and ALSTM, on two real-
world exchange market. 4 Proposed Method
In this section, we first describe the basic Transformer model
2 Related Work we designed. Then we introduce the proposed enhancements
Fundamental Analysis Machine learning for fundamental of Transformer for financial time series.
analysis developed with the explosion of finance alternative
data, such as news, location, e-commerce. [Schumaker and 4.1 Basic Transformer for Stock Movement
Chen, 2009] proposes a predictive machine learning approach Prediction
for financial news articles analysis using several different tex- In our work, we instantiate fθ (·) with Transformer-based
tual representations. [Weng et al., 2017] outlines a novel model. To adapt the stock movement prediction task which
methodology to predict the future movements in the value of takes time series as inputs, we design a variant of Transformer
securities after tapping data from disparate sources. [Xu and with encoder-only structure which consists of L blocks of
Cohen, 2018] uses a stochastic recurrent model (SRM) with multi-head self-attention layers and position-wise feed for-
an extra discriminator and an attention mechanism to address ward layers (see Figure 1). Given the input time series
the adaptability of stock markets. [Chen et al., 2019] pro- X = [x1 , x2 , ..., xN ] ∈ RN ×F , we first add the position en-
poses to learn event extraction and stock prediction jointly. coding and adopt an linear layer with tanh activation function
Technical Analysis On the other hand, technical analy- as follows:
(I)
sis methods extract price-volume information from historical X̄ = σtanh Wh [PositionEncoding(X)]. (1)
trading data and use machine learning algorithms for predic-
tion. For instances, [Lin et al., 2013] proposes an SVM- Then multi-head self-attention layers take X̄ as input, and are
based approach for stock market trend prediction. Mean- computed by
while, LSTM neural network [Hochreiter and Schmidhuber, (Q)
Qh = Wh X̄, Kh = Wh X̄,
(K) (V )
Vh = Wh X̄, (2)
1997] has been employed to model stock price movement.
(Q) (K) (V )
[Nelson et al., 2017] proposes an LSTM model to predict where h = 1, ..., H and Wh , Wh and Wh are learn-
stock movement based on the technical analysis indicators. able weight matrices for Query, Key and Value, respectively

4641
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)
Special Track on AI in FinTech

Prediction 𝑦ො 1.2 Head-1: D=5


Head-2: D=10
1 Head-3: D=20
Head-4: D=40
Temporal Attention
𝒎 0.8

[𝐁! ]%,'
Aggregation 0.6

(#)
𝛼1 𝛼2 𝛼3 𝛼𝑁
0.4
Feed-Forward 𝒛1 𝒛2 𝒛3 … 𝒛𝑁 0.2
Layer
Head-1
0
Head-2 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10
Head-3
𝒐1 𝒐2 𝒐3 … 𝒐𝑁
j-i

×L
Multi-Head (G)
Self-Attention Figure 2: Visualization of [Bh ]i,j in Eq. 8 with window-size set
Layers
ഥ1
𝒙 ഥ2
𝒙 ഥ3
𝒙 … ഥ𝑁
𝒙
D = {5, 10, 20, 40}.

Input 𝒙1 𝒙2 𝒙3 … 𝒙𝑁
with the intuition that the relevance of data in two positions is
directly proportional to the temporal distance between them.
Figure 1: The proposed Basic Transformer overview. To pay more attention to the closer time-steps, we add bi-
ases of Gaussian prior to the attention score matrices based on
the assumption that such scores would obey Gaussian distri-
(refers to [Vaswani et al., 2017] for more details). Then the butions. Note that this operation is equivalent to multiplying
attention score matrix ah ∈ RN ×N of the hth head is com- the original attention distribution with a Gaussian distribution
puted by mask (See [Guo et al., 2019] for the proof). In details, we
Q h KT transform Eq.3 to the following form by adding Gaussian bi-
ah = softmax( √ h · M), (3) ases:
dk
Q h KT (G)
where M is a position-wise mask matrix to filter out temporal ah = softmax[( √ h + Bh ) · M], (7)
attention, so as to avoid future information leakage. After- dk
wards, the output of the hth head is a weighted sum defined where Bh
(G)
∈ RN ×N is a matrix computed by
as follows:
2
(
N (G) exp(− (j−i)
2 )
2σh
j ≤ i;
X [Bh ]i,j = (8)
[Oh ]i = (ah )i,j · [Vh ]j . (4) 0 j > i.
j=1
(G)
The final output of multi-head attention layers is the con- Note that we allow σh in Bh are different for different heads
catenation of all heads by O = [O1 , O2 , ..., OH ]. After- in the multi-head self-attention layer.
ward, the position-wise feed forward layer takes O as input Besides, we also give an empirical approximation for σh .
and transforms it to Z by two fully-connected layers and a Suppose we want to pay more attention to the Dh closest
ReLU activation layer. Upon the output zi of the last self- time-steps, the variance can be empirically set as σh = Dh .
attention layer, a temporal attention layer [Qin et al., 2017] is By this way, we allow different Dh in different attention
deployed to aggregate the latent features from each position heads in order to provide Multi-Scale Gaussian Prior.
as m =
PN In finance, the temporal features from last 5-day, 10-day,
i=1 αi zi . Then the scalar prediction score ŷ is
computed by a fully-connected layer and a sigmoid transfor- 20-day or 40-day are usually considered in trading strategies.
mation: That means, for a 4-head self-attention layer, we can empir-
ŷ = sigmoid(Wf c m). (5) ically assign the window-size set D = {5, 10, 20, 40} to σh
with h = 1, ..., 4, respectively as is shown in Figure 2. In
Our ultimate goal is to maximize the log-likelihood between conclusion, the proposed Multi-Scale Gaussian Prior enables
ŷ and y via the following loss function: Transformer to learn multi-scale localities from financial time
LCE = (1 − y)log(1 − ŷ) + ylog(ŷ) (6) series.

4.2 Enhancing Locality with Multi-Scale Gaussian 4.3 Orthogonal Regularization for Multi-Head
Prior Self-Attention Mechanism
Recently, Transformer exhibits its powerful capability of ex- With the proposed Multi-Scale Gaussian Prior, we let differ-
tracting global patterns in natural language processing fields. ent heads learn different temporal patterns in the multi-head
However, the self-attention mechanism in Transformer con- attention layer. However, some previous research [Tao et al.,
siders the global dependencies with very weak position in- 2018][Li et al., 2018][Lee et al., 2019] claims that canonical
formation. Note that, the position information serves as the multi-head self-attention mechanism tend to learn redundant
temporal variant patterns in time series, which is much im- heads. To enhance the diversity between each head, we in-
portant. To address it, we incorporate Multi-Scale Gaussian duce an orthogonal regularization with regard to the weight
(V )
Prior into the canonical multi-head self-attention mechanism tensor Wh in Eq.2. Specifically, we first calculate the

4642
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)
Special Track on AI in FinTech

Global Feature 𝐌 (0) for Block-2


Intra-Week Feature Intra-Week Feature Intra-Week Feature
Intra-Day Feature Intra-Day Feature Intra-Day Feature Intra-Day Feature Intra-Day Feature
130×130

!#
𝒙 !$
𝒙 … !$%
𝒙 !$& …
𝒙 !'$
𝒙 … !#(' … 𝒙
𝒙 !#)( !#)# …
𝒙 !#'%
𝒙 … … !…
𝒙 … !…
𝒙 … …
130×130
26 × 15-min 26 × 15-min 26 × 15-min 26 × 15-min 26 × 15-min

5 × day 5 × day 5 × day




(a) Hierarchical features of NASDAQ 15-minute data
… 130×130
𝒐# 𝒐$ … 𝒐#)( 𝒐#)# 𝒐#)$ … 𝒐,-# 𝒐,
Block-3 … … 𝐌 (0) for Block-1
!#
𝒙 !$
𝒙 … !#)(
𝒙 !#)# 𝒙
𝒙 !#)$ … !,-# 𝒙
𝒙 !, 26×26

𝒐# 𝒐$ … 𝒐$% 𝒐$& … 𝒐'$ … 𝒐#(' … 𝒐#)( 𝒐#)# 𝒐#)$ … 𝒐#'% … 𝒐$)' … 𝒐$%( 26×26

Block-2 … … … … … … … 26×26
!#
𝒙 !$
𝒙 … !$%
𝒙 !$&
𝒙 … !'$
𝒙 … !#(' …
𝒙 !#)(
𝒙 !#)# 𝒙
𝒙 !#)$ !#'%
… 𝒙 … !$)' … 𝒙
𝒙 !$%(
26×26
𝒐# 𝒐$ … 𝒐$% 𝒐$& … 𝒐'$ 𝒐#(' … 𝒐#)( 𝒐#)# 𝒐#)$ … 𝒐#'% 𝒐$)' … 𝒐$%(




Block-1 … … … … … … … …
!#
𝒙 !$
𝒙 … !$%
𝒙 !$&
𝒙 … !'$
𝒙 !#(' …
𝒙 !#)(
𝒙 !#)# 𝒙
𝒙 !#)$ !#'%
… 𝒙 !$)' … 𝒙
𝒙 !$%(
…26×26
(b) Hierarchical self-attention mechanism (c) Hierarchical masks

Figure 3: Trading Gap Splitter overview.

(V ) (V ) (V )
tensor W(V ) = [W1 , W2 , ..., WH ] by concatenating inside treats all time-steps equally and omit the implicit inter-
(V )
Wh of all heads. Note that the size of W(V ) is H × F × dv day and inter-week trading gaps. To solve this problem, we
where dv denotes the last dimension of Vh . Then we flatten design a new hierarchical self-attention mechanism for the
the tensor W(V ) to a matrix A with size of H × (F ∗ dv ), and Transformer model to learn the hierarchical features of stock
data (see Figure 3 (a)).
further normalize it as A
e = A/||A||2 . Finally, the penalty
loss is computed by Takes a 3-block Transformer model as an example, we aim
to learn the hierarchical features of stock data by the order
Lp = ||A
eAe T − I||F , (9) “intra-day→intra-week→global”. In order to do so, we set
two extra position-wise masks to the first and second self-
where || ◦ ||F denotes the Frobenius norm of a matrix and attention blocks respectively in order to limit the attention
I stands for an identity matrix. We briefly add the penalty scopes. Formally, we modify Eq.7 to the following form:
loss to the original loss with a trade-off hyper-parameter γ as
follows:
L = LCE + γLp . (10) Q h KT (G)
ah = softmax[( √ h + Bh ) · M(H) · M], (12)
For simplicity in expression, here we omit the number of dk
multi-head self-attention layers in the model. In our exper-
iment, we sum up the penalty losses from each multi-head where M(H) is an N × N matrix filled with −inf whose
self-attention layer as the final penalty loss as follows: diagonal is composed of continuous sub-matrices filled with
Lp = L(1) (2) (L) 0. The M(H) for the first and second self-attention blocks are
p + Lp + ... + Lp . (11)
shown in Figure 3 (c). Specifically, the size of sub-matrices in
4.4 Trading Gap Splitter M(H) for the first block is 26 × 26 since one trading day con-
tains 26 15-minute time-steps, and the size of sub-matrices
As is known that the input of the model is a continuous time
for the second block changes to 130 × 130 (26 ∗ 5) since one
series. However, due to the trading gaps, the input time se-
trading week contains 5 trading days. By this way, the first
ries is essentially NOT continuous. Takes the 15-minute data
and second self-attention blocks will learn the intra-day and
from NASDAQ stock market2 as an example, of which one
intra-week features of stock data, respectively. Moreover, for
trading day contains 26 15-minute time-steps and one trading
the last self-attention block, we keep the original attention
week contains 5 trading days. This means there are inter-day
and inter-week trading gaps. However, when the basic Trans- mechanism without M(H) to learn global features of stock
former model is applied to this data, the self-attention layer data. As a result, the Transformer model with the proposed
hierarchical attention mechanism avoids suffering from the
2
the NASDAQ Stock Exchange Market is open 5 days per week trading gaps. Note that, all attention heads in the same multi-
for 6.5 hours per day head self-attention layer share the same M(H) .

4643
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)
Special Track on AI in FinTech

Property
Data Accuracy (%)/MCC (×10−2 )
Daily 15-min Method with window size of K-day

Market NASDAQ China A-shares K=5 K = 10 K = 20 K = 40


Start Date 2010/07/01 2015/12/01 NASDAQ Daily Data
End Date 2019/07/01 2019/12/01 CNN 52.33/3.16 52.02/2.68 51.84/2.28 52.60/2.52
Time Frequency 1 day 15 minutes LSTM 53.86/7.73 53.89/7.72 53.59/7.15 53.81/7.48
Total Stocks 3243 500 ALSTM 54.06/8.35 53.94/7.92 54.05/8.11 54.19/8.56
Total Records 9749098 7928000 B-TF† 54.78/8.48 54.84/8.89 54.90/9.13 56.01/9.45
Rising Threshold βrise 0.55% -0.5% MG-TF† 55.10/8.98 56.18/9.74 56.77/10.39 57.30/11.46
Falling Threshold βf all -0.1% 0.105%
China A-shares 15-min Data
CNN 53.53/2.62 52.25/1.80 52.03/1.81 51.61/1.77
Table 1: Data description.
LSTM 56.59/6.42 56.70/6.19 56.18/3.74 54.93/2.98
ALSTM 57.03/8.23 57.42/9.16 55.69/6.65 55.68/6.65
5 Experiments B-TF† 57.14/9.68 57.42/9.52 57.32/9.14 57.55/10.41
HMG-TF† 57.36/10.52 57.79/9.98 57.90/10.33 58.70/14.87
To evaluate the proposed methods, we use two stock data: one
from NASDAQ market and the other from China A-shares Table 2: The results of comparison experiments. All values are av-
market. The details of the two data are listed in Table 1. In erage results from 5 repeated experiments. The best results are in
the following subsections, we will introduce the data collec- bold and † denotes our methods.
tion process and show our empirical results from numerical
experiments. We also conduct an incremental analysis to ex- LSTM ALSTM HMG-TF LSTM ALSTM HMG-TF
plore the effectiveness of each proposed enhancements for 59 20

MCC (× 10−2 )
58 15
Transformer.

Acc (%)
57
10
56
5.1 Data Collection 55 5
54 0
We collect the daily quote data of all 3243 stocks from NAS- 5 10 20 40 5 10 20 40
DAQ stock market from July 1st , 2010 to July 1st , 2019 and K K
the 15-min quote data of 500 CSI-500 component stocks from
China A-shares market from December 1st , 2015 to Decem- Figure 4: The accuracy and MCC trends along with the window size
K-day on the China A-shares data.
ber 1st , 2019. We move a lag window with size of N time-
steps along these time series to construct candidate exam-
ples. For the NASDAQ data, we construct 5 datasets with 5.2 Evaluation Metrics
window sizes N = 5, 10, 20, 40 (denoting 5-, 10-, 20-, 40- Following previous research [Feng et al., 2019; Xu and Co-
day, respectively), and the lag strides are all fixed to 1. For hen, 2018], we evaluate the prediction performance with two
the China A-shares data, we also construct 5 datasets with metrics: Accuracy (Acc) and Matthews Correlation Coeffi-
window sizes N = 5 ∗ 16, 10 ∗ 16, 20 ∗ 16, 40 ∗ 16 (de- cient (MCC), which is defined as below:
noting 5-, 10-, 20-, 40-day since one trading day contains
16 15-minute in China A-shares market), and the lag strides tp × tn − fp × fn
MCC = p (14)
are all fixed to 16 (i.e. 1 day). The features used in all (tp + fp)(tp + fn)(tn + fp)(tn + fn)
datasets consist of open, high, low, close and volume, which
are all adjusted and normalized following [Feng et al., 2019; where tp, tn, fp, fn denote the number of samples classified
Xu and Cohen, 2018]. We initially label both datasets by the as true positive, true negative, false positive and false nega-
strategy mentioned in Section 3.1 (i.e y = I(pT +k > pT )), tive, respectively.
where k is set to 1 and 16 (both denote 1 day) for the NAS-
DAQ data and the China A-shares data, respectively. Further- 5.3 Numerical Experiments
more, we set two threshold parameter βrise , βf all to the labels Approaches in comparison. We compare our approaches
in each dataset in order to balance the number of positive and B-TF, MG-TF and HMG-TF with the baselines CNN, LSTM
negative samples to roughly 1 : 1 as follows: and ALSTM:
 pT +k −pT • CNN [Selvin et al., 2017]: Here we use 1D-CNN with
1
 pT > βrise ; the kernel size of 1 × 3.
pT +k −pT
y = −1 pT < βf all ; (13) • LSTM [Nelson et al., 2017]: A variant of recurrent neu-

abandon otherwise.
ral network with feedback connections.
To avoid the data leakage problem, we strictly follow the • Attentive LSTM (ALSTM) [Feng et al., 2019]: A vari-
sequential order to split training/validation/test sets. For in- ant of LSTM model with a temporal attentive aggrega-
stances, we split the NASDAQ data and the China A-shares tion layer.
data into training/validation/test sets by 8-year/1-year/1-year • Basic Transformer (B-TF): The proposed Basic Trans-
and 3-year/6-month/6-month, respectively. former model introduced in Section 4.1.

4644
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)
Special Track on AI in FinTech

Settings Metrics
Method
MG OR TS Acc. (%) MCC (×10−2 )
B-TF 7 7 7 57.55 10.41
MG-TF∗ 3 7 7 58.03 12.11
MG-TF 3 3 7 58.25 12.75
HMG-TF 3 3 3 58.70 14.87

Table 3: Experimental results of incremental analysis. MG, OR and (a) Daily attention scores
TS denote Multi-Scale Gaussian Prior, Orthogonal Regularization
and Trading Gap Splitter, respectively. * denotes wo/ OR.

• Multi-Scale Gaussian Transformer (MG-TF): The B-TF


model with enhancements of Multi-Gaussian Prior and
Orthogonal Regularization introduced in Section 4.2
and Section 4.3, respectively.
• Hierarchical Multi-Scale Gaussian Transformer (HMG-
(b) Weekly attention scores (c) Global attention scores
TF): The MG-TF model with the enhancement of Trad-
ing Gap Splitter introduced in Section 4.4.
Figure 5: Hierarchical attention scores. (a)(b)(c) represents the at-
Settings. We implement B-TF, MG-TF and HMG-TF with tention score matrices learned by Block-1, Block-2 and Block-3, re-
PyTorch framework on Nvidia Tesla V100 GPU. We use spectively.
Adam optimizer with an initial learning-rate of 1e − 4. The
size of mini-batch is set to 256. The trade-off hyper-
parameter γ is set to 0.05. All TF-based models have 3 multi- As is shown in Table 3, these components all contribute to
head self-attention blocks each with 4 heads. We train the the performance of Transformer-based method. Moreover,
model in an end-to-end manner from raw quote data without we can further observe that the performance improvement
any data augmentation. Since Trading Gap Splitter is less mainly benefits from Multi-Scale Guassian Prior and Trad-
appropriate for low-frequency data like daily quote data, we ing Gap Splitter, while the gain from Orthogonal Regulariza-
only perform MG-TF on our NASDAQ dataset. tion is relatively insignificant.
Besides, we illustrate the effectiveness of Trading Gap
Results. The performance of comparison experiments on
Splitter by visualizing the attention score matrix learned by
both data are shown in Table 2, from which we have the fol-
HMG-TF model with N = 160 (is equivalent to K = 10-day
lowing observations:
or 2-week). As is shown in Figure 5, the attention scope be-
• The proposed approaches B-TF, MG-TF and HMG- comes larger gradually (16→80→160) from (a) to (c). That
TF show significant gains on both metrics Acc and means, with the proposed hierarchical self-attentiom mech-
MCC compared with baselines in all cases. It exhibits anism, the time-steps only attend to those belonging to the
that Transformer-based approaches have significant per- same day in the first self-attention block, the time-steps only
formance advantages over RNN- and CNN-based ap- attend to those belonging to the same week in the second
proaches. block, and the attention scopes of time-steps are no limit in
• Transformer-based approaches have better capabilities the last block.
of learning long-term dependencies. Especially on
China A-shares data, in which the window size of 40- 6 Discussion & Future Work
day comtains 640 (40*16) time-steps, and it is hard for
RNN-based approaches to learn the dependencies across In this paper, we propose to apply Transformer model for
so many time-steps (see Figure 4). While the advantages stock movement prediction in which the attention mechanism
of the self-attention mechanism enables Transformer- can help to capture extremely long-term dependencies of fi-
based approaches achieve consistently better perfor- nance time series. Furthermore, equipped with the proposed
mance as the window size becomes larger. enhancements Multi-Scale Gaussian Prior, Orthogonal Reg-
ularization and Trading Gap Splitter, our Transformer-based
• The modified Transformer model MG-TF and HMG- model achieves significant gains over several state-of-the-art
TF have better performance than the basic Transformer baselines on two real-world market dataset. In the future,
model. More detailed analysis will be shown in the next except for the model itself, the following aspects can be in-
section. vestigated for further improvements: (1) cross-sectional fea-
tures of financial data can be engaged to improve the model,
5.4 Incremental Analysis (2) regularization methods can be explored to avoid suffering
To explore the effectiveness of the proposed components from overfitting on financial data, (3) data augmentation such
Multi-Scale Guassian Prior, Orthogonal Regularization and as adversarial and stochastic perturbations can be adopted to
Trading Gap Splitter, we further conduct an incremental anal- improve the robustness of the model. Also, it is significant to
ysis on different settings of the Transformer-based models. investigate the theoretical guarantee for our method.

4645
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)
Special Track on AI in FinTech

References [Rumelhart et al., 1986] David E Rumelhart, Geoffrey E


[Altman, 1992] Naomi S Altman. An introduction to ker- Hinton, and Ronald J Williams. Learning representations
by back-propagating errors. nature, 323(6088):533–536,
nel and nearest-neighbor nonparametric regression. The
1986.
American Statistician, 46(3):175–185, 1992.
[Schumaker and Chen, 2009] Robert P Schumaker and
[Breiman, 2001] Leo Breiman. Random forests. Machine Hsinchun Chen. Textual analysis of stock market pre-
learning, 45(1):5–32, 2001. diction using breaking financial news: The azfin text
[Chen et al., 2019] Deli Chen, Yanyan Zou, Keiko Hari- system. ACM Transactions on Information Systems
moto, Ruihan Bao, Xuancheng Ren, and Xu Sun. Incor- (TOIS), 27(2):12, 2009.
porating fine-grained events in stock movement prediction. [Selvin et al., 2017] Sreelekshmy Selvin, R Vinayakumar,
arXiv preprint arXiv:1910.05078, 2019. EA Gopalakrishnan, Vijay Krishna Menon, and KP So-
[Cortes and Vapnik, 1995] Corinna Cortes and Vladimir man. Stock price prediction using lstm, rnn and cnn-
Vapnik. Support-vector networks. Machine learning, sliding window model. In 2017 International Conference
20(3):273–297, 1995. on Advances in Computing, Communications and Infor-
matics (ICACCI), pages 1643–1647. IEEE, 2017.
[Feng et al., 2019] Fuli Feng, Huimin Chen, Xiangnan He,
Ji Ding, Maosong Sun, and Tat-Seng Chua. Enhancing [Tao et al., 2018] Chongyang Tao, Shen Gao, Mingyue
stock movement prediction with adversarial training. In Shang, Wei Wu, Dongyan Zhao, and Rui Yan. Get the
Proceedings of the 28th International Joint Conference point of my utterance! learning towards effective re-
on Artificial Intelligence, pages 5843–5849. AAAI Press, sponses with multi-head attention mechanism. In IJCAI,
2019. pages 4418–4424, 2018.
[Guo et al., 2019] Maosheng Guo, Yu Zhang, and Ting Liu. [Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki
Gaussian transformer: a lightweight approach for natural Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
language inference. In The Thirty-Third AAAI Conference Łukasz Kaiser, and Illia Polosukhin. Attention is all you
on Artificial Intelligence, 2019. need. In Advances in neural information processing sys-
tems, pages 5998–6008, 2017.
[Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and [Walczak, 2001] Steven Walczak. An empirical analysis of
Jürgen Schmidhuber. Long short-term memory. Neural data requirements for financial forecasting with neural
computation, 9(8):1735–1780, 1997. networks. Journal of management information systems,
[Lee et al., 2019] Mingu Lee, Jinkyu Lee, Hye Jin Jang, 17(4):203–222, 2001.
Byeonggeun Kim, Wonil Chang, and Kyuwoong Hwang. [Wang et al., 2019] Jia Wang, Tong Sun, Benyuan Liu,
Orthogonality constrained multi-head attention for key- Yu Cao, and Hongwei Zhu. Clvsa: A convolutional lstm
word spotting. arXiv preprint arXiv:1910.04500, 2019. based variational sequence-to-sequence model with atten-
[Li et al., 2018] Jian Li, Zhaopeng Tu, Baosong Yang, tion for predicting trends of financial markets. In Proceed-
Michael R Lyu, and Tong Zhang. Multi-head atten- ings of the 28th International Joint Conference on Artifi-
tion with disagreement regularization. arXiv preprint cial Intelligence, pages 3705–3711. AAAI Press, 2019.
arXiv:1810.10183, 2018. [Weng et al., 2017] Bin Weng, Mohamed A Ahmed, and
[Li et al., 2019] Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Fadel M Megahed. Stock market one-day ahead move-
Zhou, Wenhu Chen, Yu-Xiang Wang, and Xifeng Yan. En- ment prediction using disparate data sources. Expert Sys-
hancing the locality and breaking the memory bottleneck tems with Applications, 79:153–163, 2017.
of transformer on time series forecasting. In Advances [Xingjian et al., 2015] SHI Xingjian, Zhourong Chen, Hao
in Neural Information Processing Systems, pages 5244– Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun
5254, 2019. Woo. Convolutional lstm network: A machine learning ap-
[Lin et al., 2013] Yuling Lin, Haixiang Guo, and Jinglu Hu. proach for precipitation nowcasting. In Advances in neural
An svm-based approach for stock market trend prediction. information processing systems, pages 802–810, 2015.
In The 2013 international joint conference on neural net- [Xu and Cohen, 2018] Yumo Xu and Shay B Cohen. Stock
works (IJCNN), pages 1–7. IEEE, 2013. movement prediction from tweets and historical prices. In
[Nelson et al., 2017] David MQ Nelson, Adriano CM Proceedings of the 56th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers),
Pereira, and Renato A de Oliveira. Stock market’s price
pages 1970–1979, 2018.
movement prediction with lstm neural networks. In
2017 International Joint Conference on Neural Networks [Zhang et al., 2017] Liheng Zhang, Charu Aggarwal, and
(IJCNN), pages 1419–1426. IEEE, 2017. Guo-Jun Qi. Stock price prediction via discovering
multi-frequency trading patterns. In Proceedings of the
[Qin et al., 2017] Yao Qin, Dongjin Song, Haifeng Chen,
23rd ACM SIGKDD international conference on knowl-
Wei Cheng, Guofei Jiang, and Garrison Cottrell. A dual- edge discovery and data mining, pages 2141–2149. ACM,
stage attention-based recurrent neural network for time se- 2017.
ries prediction. arXiv preprint arXiv:1704.02971, 2017.

4646

You might also like