Hierarchical Multi-Scale Gaussian Transformer For Stock Movement Prediction
Hierarchical Multi-Scale Gaussian Transformer For Stock Movement Prediction
4640
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)
Special Track on AI in FinTech
nism in canonical Transformer is insensitive to local context, [Zhang et al., 2017] proposes an LSTM model on histori-
whose dependencies are much important in financial time se- cal data to discover multi-frequency trading. [Wang et al.,
ries. (2) Hierarchy poverty: the point-wise dot-product self- 2019] proposes a ConvLSTM-based Seq2Seq framework for
attention mechanism lacks the capability of utilizing hierar- stock movement prediction. [Qin et al., 2017] proposes an
chical structure of financial time series (e.g. learning intra- Attentive-LSTM model with an attention mechanism to pre-
day, intra-week and intra-month features in financial time se- dict stock price movement and [Feng et al., 2019] further in-
ries independently). Intuitively, addressing those drawbacks troduces an data augmentation approach with the idea of ad-
will improve the robustness of the model and lead to better versarial training. However, [Li et al., 2019] points out that
performance in the task of financial time series prediction. LSTM can only distinguish 50 positions nearby with an effec-
In this paper, we propose a new Transformer-based method tive context size of about 200. That means that LSTM-based
for stock movement prediction. The primary highlight of models suffer from the difficulty in capturing extremely long-
the proposed model is the capability of capturing long-term, term dependencies in time series. To tackle this issue, we pro-
short-term as well as hierarchical dependencies of financial pose a Transformer-based method to better mine the intrinsic
time series. For these aims, we propose several enhancements long-term and complex structures in financial time series.
for the Transformer-based model: (1) Multi-Scale Gaussian
Prior enhances the locality of Transformer. (2) Orthogonal 3 Problem Formulation
Regularization avoids learning redundant heads in the multi- Since the exact price of a stock is extremely hard to be pre-
mead self-attention mechanism. (3) Trading Gap Splitter en- dicted accurately, we follow the setup of [Walczak, 2001]
ables Transformer to learn intra-day features and intra-week and predict the stock price movement instead. Usually the
features independently. Numerical results comparing with stock movement prediction is treated as a binary classifica-
other competitive methods for time series show the advan- tion problem — e.g., discretizing the stock movement into
tages of the proposed method. two classes (Rise or Fall). Formally, given the stock features
In summary, the main contributions of our paper include: X = [xT −∆t+1 , xT −∆t+2 , ..., xT ] ∈ R∆t×F (also repre-
• We propose a Transformer-based method for stock sented as X = [x1 , x2 , ..., xN ] ∈ RN ×F in the rest of the
movement prediction. To the best of our knowledge, this paper for simplicity) in the latest ∆t time-steps, the predic-
is the first work using Transformer model to tackle finan- tion model fθ (X) with parameters θ can output the predicted
cial time series forecasting problems. movement label y = I(pT +k > pT ), where T denotes the
• We propose several enhancements for Transformer target trading time, F denotes the dimension of stock fea-
model include Multi-Scale Gaussian Prior, Orthogonal tures and pt denotes the close price at time-step t. Briefly, the
Regularization, and Trading Gap Splitter. proposed model utilizes the historical data of a stock s in the
lag [T − ∆t + 1, T ] (where ∆t is a fixed lag size) to predict
• In experiments, the proposed Transformer-based method the movement class y (0 for Fall, 1 for Rise) of the future k
significantly outperform several state-of-the-art base- time-steps.
lines, such as CNN, LSTM and ALSTM, on two real-
world exchange market. 4 Proposed Method
In this section, we first describe the basic Transformer model
2 Related Work we designed. Then we introduce the proposed enhancements
Fundamental Analysis Machine learning for fundamental of Transformer for financial time series.
analysis developed with the explosion of finance alternative
data, such as news, location, e-commerce. [Schumaker and 4.1 Basic Transformer for Stock Movement
Chen, 2009] proposes a predictive machine learning approach Prediction
for financial news articles analysis using several different tex- In our work, we instantiate fθ (·) with Transformer-based
tual representations. [Weng et al., 2017] outlines a novel model. To adapt the stock movement prediction task which
methodology to predict the future movements in the value of takes time series as inputs, we design a variant of Transformer
securities after tapping data from disparate sources. [Xu and with encoder-only structure which consists of L blocks of
Cohen, 2018] uses a stochastic recurrent model (SRM) with multi-head self-attention layers and position-wise feed for-
an extra discriminator and an attention mechanism to address ward layers (see Figure 1). Given the input time series
the adaptability of stock markets. [Chen et al., 2019] pro- X = [x1 , x2 , ..., xN ] ∈ RN ×F , we first add the position en-
poses to learn event extraction and stock prediction jointly. coding and adopt an linear layer with tanh activation function
Technical Analysis On the other hand, technical analy- as follows:
(I)
sis methods extract price-volume information from historical X̄ = σtanh Wh [PositionEncoding(X)]. (1)
trading data and use machine learning algorithms for predic-
tion. For instances, [Lin et al., 2013] proposes an SVM- Then multi-head self-attention layers take X̄ as input, and are
based approach for stock market trend prediction. Mean- computed by
while, LSTM neural network [Hochreiter and Schmidhuber, (Q)
Qh = Wh X̄, Kh = Wh X̄,
(K) (V )
Vh = Wh X̄, (2)
1997] has been employed to model stock price movement.
(Q) (K) (V )
[Nelson et al., 2017] proposes an LSTM model to predict where h = 1, ..., H and Wh , Wh and Wh are learn-
stock movement based on the technical analysis indicators. able weight matrices for Query, Key and Value, respectively
4641
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)
Special Track on AI in FinTech
[𝐁! ]%,'
Aggregation 0.6
…
(#)
𝛼1 𝛼2 𝛼3 𝛼𝑁
0.4
Feed-Forward 𝒛1 𝒛2 𝒛3 … 𝒛𝑁 0.2
Layer
Head-1
0
Head-2 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10
Head-3
𝒐1 𝒐2 𝒐3 … 𝒐𝑁
j-i
…
×L
Multi-Head (G)
Self-Attention Figure 2: Visualization of [Bh ]i,j in Eq. 8 with window-size set
Layers
ഥ1
𝒙 ഥ2
𝒙 ഥ3
𝒙 … ഥ𝑁
𝒙
D = {5, 10, 20, 40}.
Input 𝒙1 𝒙2 𝒙3 … 𝒙𝑁
with the intuition that the relevance of data in two positions is
directly proportional to the temporal distance between them.
Figure 1: The proposed Basic Transformer overview. To pay more attention to the closer time-steps, we add bi-
ases of Gaussian prior to the attention score matrices based on
the assumption that such scores would obey Gaussian distri-
(refers to [Vaswani et al., 2017] for more details). Then the butions. Note that this operation is equivalent to multiplying
attention score matrix ah ∈ RN ×N of the hth head is com- the original attention distribution with a Gaussian distribution
puted by mask (See [Guo et al., 2019] for the proof). In details, we
Q h KT transform Eq.3 to the following form by adding Gaussian bi-
ah = softmax( √ h · M), (3) ases:
dk
Q h KT (G)
where M is a position-wise mask matrix to filter out temporal ah = softmax[( √ h + Bh ) · M], (7)
attention, so as to avoid future information leakage. After- dk
wards, the output of the hth head is a weighted sum defined where Bh
(G)
∈ RN ×N is a matrix computed by
as follows:
2
(
N (G) exp(− (j−i)
2 )
2σh
j ≤ i;
X [Bh ]i,j = (8)
[Oh ]i = (ah )i,j · [Vh ]j . (4) 0 j > i.
j=1
(G)
The final output of multi-head attention layers is the con- Note that we allow σh in Bh are different for different heads
catenation of all heads by O = [O1 , O2 , ..., OH ]. After- in the multi-head self-attention layer.
ward, the position-wise feed forward layer takes O as input Besides, we also give an empirical approximation for σh .
and transforms it to Z by two fully-connected layers and a Suppose we want to pay more attention to the Dh closest
ReLU activation layer. Upon the output zi of the last self- time-steps, the variance can be empirically set as σh = Dh .
attention layer, a temporal attention layer [Qin et al., 2017] is By this way, we allow different Dh in different attention
deployed to aggregate the latent features from each position heads in order to provide Multi-Scale Gaussian Prior.
as m =
PN In finance, the temporal features from last 5-day, 10-day,
i=1 αi zi . Then the scalar prediction score ŷ is
computed by a fully-connected layer and a sigmoid transfor- 20-day or 40-day are usually considered in trading strategies.
mation: That means, for a 4-head self-attention layer, we can empir-
ŷ = sigmoid(Wf c m). (5) ically assign the window-size set D = {5, 10, 20, 40} to σh
with h = 1, ..., 4, respectively as is shown in Figure 2. In
Our ultimate goal is to maximize the log-likelihood between conclusion, the proposed Multi-Scale Gaussian Prior enables
ŷ and y via the following loss function: Transformer to learn multi-scale localities from financial time
LCE = (1 − y)log(1 − ŷ) + ylog(ŷ) (6) series.
4.2 Enhancing Locality with Multi-Scale Gaussian 4.3 Orthogonal Regularization for Multi-Head
Prior Self-Attention Mechanism
Recently, Transformer exhibits its powerful capability of ex- With the proposed Multi-Scale Gaussian Prior, we let differ-
tracting global patterns in natural language processing fields. ent heads learn different temporal patterns in the multi-head
However, the self-attention mechanism in Transformer con- attention layer. However, some previous research [Tao et al.,
siders the global dependencies with very weak position in- 2018][Li et al., 2018][Lee et al., 2019] claims that canonical
formation. Note that, the position information serves as the multi-head self-attention mechanism tend to learn redundant
temporal variant patterns in time series, which is much im- heads. To enhance the diversity between each head, we in-
portant. To address it, we incorporate Multi-Scale Gaussian duce an orthogonal regularization with regard to the weight
(V )
Prior into the canonical multi-head self-attention mechanism tensor Wh in Eq.2. Specifically, we first calculate the
4642
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)
Special Track on AI in FinTech
!#
𝒙 !$
𝒙 … !$%
𝒙 !$& …
𝒙 !'$
𝒙 … !#(' … 𝒙
𝒙 !#)( !#)# …
𝒙 !#'%
𝒙 … … !…
𝒙 … !…
𝒙 … …
130×130
26 × 15-min 26 × 15-min 26 × 15-min 26 × 15-min 26 × 15-min
…
…
…
(a) Hierarchical features of NASDAQ 15-minute data
… 130×130
𝒐# 𝒐$ … 𝒐#)( 𝒐#)# 𝒐#)$ … 𝒐,-# 𝒐,
Block-3 … … 𝐌 (0) for Block-1
!#
𝒙 !$
𝒙 … !#)(
𝒙 !#)# 𝒙
𝒙 !#)$ … !,-# 𝒙
𝒙 !, 26×26
𝒐# 𝒐$ … 𝒐$% 𝒐$& … 𝒐'$ … 𝒐#(' … 𝒐#)( 𝒐#)# 𝒐#)$ … 𝒐#'% … 𝒐$)' … 𝒐$%( 26×26
…
Block-2 … … … … … … … 26×26
!#
𝒙 !$
𝒙 … !$%
𝒙 !$&
𝒙 … !'$
𝒙 … !#(' …
𝒙 !#)(
𝒙 !#)# 𝒙
𝒙 !#)$ !#'%
… 𝒙 … !$)' … 𝒙
𝒙 !$%(
26×26
𝒐# 𝒐$ … 𝒐$% 𝒐$& … 𝒐'$ 𝒐#(' … 𝒐#)( 𝒐#)# 𝒐#)$ … 𝒐#'% 𝒐$)' … 𝒐$%(
…
…
…
Block-1 … … … … … … … …
!#
𝒙 !$
𝒙 … !$%
𝒙 !$&
𝒙 … !'$
𝒙 !#(' …
𝒙 !#)(
𝒙 !#)# 𝒙
𝒙 !#)$ !#'%
… 𝒙 !$)' … 𝒙
𝒙 !$%(
…26×26
(b) Hierarchical self-attention mechanism (c) Hierarchical masks
(V ) (V ) (V )
tensor W(V ) = [W1 , W2 , ..., WH ] by concatenating inside treats all time-steps equally and omit the implicit inter-
(V )
Wh of all heads. Note that the size of W(V ) is H × F × dv day and inter-week trading gaps. To solve this problem, we
where dv denotes the last dimension of Vh . Then we flatten design a new hierarchical self-attention mechanism for the
the tensor W(V ) to a matrix A with size of H × (F ∗ dv ), and Transformer model to learn the hierarchical features of stock
data (see Figure 3 (a)).
further normalize it as A
e = A/||A||2 . Finally, the penalty
loss is computed by Takes a 3-block Transformer model as an example, we aim
to learn the hierarchical features of stock data by the order
Lp = ||A
eAe T − I||F , (9) “intra-day→intra-week→global”. In order to do so, we set
two extra position-wise masks to the first and second self-
where || ◦ ||F denotes the Frobenius norm of a matrix and attention blocks respectively in order to limit the attention
I stands for an identity matrix. We briefly add the penalty scopes. Formally, we modify Eq.7 to the following form:
loss to the original loss with a trade-off hyper-parameter γ as
follows:
L = LCE + γLp . (10) Q h KT (G)
ah = softmax[( √ h + Bh ) · M(H) · M], (12)
For simplicity in expression, here we omit the number of dk
multi-head self-attention layers in the model. In our exper-
iment, we sum up the penalty losses from each multi-head where M(H) is an N × N matrix filled with −inf whose
self-attention layer as the final penalty loss as follows: diagonal is composed of continuous sub-matrices filled with
Lp = L(1) (2) (L) 0. The M(H) for the first and second self-attention blocks are
p + Lp + ... + Lp . (11)
shown in Figure 3 (c). Specifically, the size of sub-matrices in
4.4 Trading Gap Splitter M(H) for the first block is 26 × 26 since one trading day con-
tains 26 15-minute time-steps, and the size of sub-matrices
As is known that the input of the model is a continuous time
for the second block changes to 130 × 130 (26 ∗ 5) since one
series. However, due to the trading gaps, the input time se-
trading week contains 5 trading days. By this way, the first
ries is essentially NOT continuous. Takes the 15-minute data
and second self-attention blocks will learn the intra-day and
from NASDAQ stock market2 as an example, of which one
intra-week features of stock data, respectively. Moreover, for
trading day contains 26 15-minute time-steps and one trading
the last self-attention block, we keep the original attention
week contains 5 trading days. This means there are inter-day
and inter-week trading gaps. However, when the basic Trans- mechanism without M(H) to learn global features of stock
former model is applied to this data, the self-attention layer data. As a result, the Transformer model with the proposed
hierarchical attention mechanism avoids suffering from the
2
the NASDAQ Stock Exchange Market is open 5 days per week trading gaps. Note that, all attention heads in the same multi-
for 6.5 hours per day head self-attention layer share the same M(H) .
4643
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)
Special Track on AI in FinTech
Property
Data Accuracy (%)/MCC (×10−2 )
Daily 15-min Method with window size of K-day
MCC (× 10−2 )
58 15
Transformer.
Acc (%)
57
10
56
5.1 Data Collection 55 5
54 0
We collect the daily quote data of all 3243 stocks from NAS- 5 10 20 40 5 10 20 40
DAQ stock market from July 1st , 2010 to July 1st , 2019 and K K
the 15-min quote data of 500 CSI-500 component stocks from
China A-shares market from December 1st , 2015 to Decem- Figure 4: The accuracy and MCC trends along with the window size
K-day on the China A-shares data.
ber 1st , 2019. We move a lag window with size of N time-
steps along these time series to construct candidate exam-
ples. For the NASDAQ data, we construct 5 datasets with 5.2 Evaluation Metrics
window sizes N = 5, 10, 20, 40 (denoting 5-, 10-, 20-, 40- Following previous research [Feng et al., 2019; Xu and Co-
day, respectively), and the lag strides are all fixed to 1. For hen, 2018], we evaluate the prediction performance with two
the China A-shares data, we also construct 5 datasets with metrics: Accuracy (Acc) and Matthews Correlation Coeffi-
window sizes N = 5 ∗ 16, 10 ∗ 16, 20 ∗ 16, 40 ∗ 16 (de- cient (MCC), which is defined as below:
noting 5-, 10-, 20-, 40-day since one trading day contains
16 15-minute in China A-shares market), and the lag strides tp × tn − fp × fn
MCC = p (14)
are all fixed to 16 (i.e. 1 day). The features used in all (tp + fp)(tp + fn)(tn + fp)(tn + fn)
datasets consist of open, high, low, close and volume, which
are all adjusted and normalized following [Feng et al., 2019; where tp, tn, fp, fn denote the number of samples classified
Xu and Cohen, 2018]. We initially label both datasets by the as true positive, true negative, false positive and false nega-
strategy mentioned in Section 3.1 (i.e y = I(pT +k > pT )), tive, respectively.
where k is set to 1 and 16 (both denote 1 day) for the NAS-
DAQ data and the China A-shares data, respectively. Further- 5.3 Numerical Experiments
more, we set two threshold parameter βrise , βf all to the labels Approaches in comparison. We compare our approaches
in each dataset in order to balance the number of positive and B-TF, MG-TF and HMG-TF with the baselines CNN, LSTM
negative samples to roughly 1 : 1 as follows: and ALSTM:
pT +k −pT • CNN [Selvin et al., 2017]: Here we use 1D-CNN with
1
pT > βrise ; the kernel size of 1 × 3.
pT +k −pT
y = −1 pT < βf all ; (13) • LSTM [Nelson et al., 2017]: A variant of recurrent neu-
abandon otherwise.
ral network with feedback connections.
To avoid the data leakage problem, we strictly follow the • Attentive LSTM (ALSTM) [Feng et al., 2019]: A vari-
sequential order to split training/validation/test sets. For in- ant of LSTM model with a temporal attentive aggrega-
stances, we split the NASDAQ data and the China A-shares tion layer.
data into training/validation/test sets by 8-year/1-year/1-year • Basic Transformer (B-TF): The proposed Basic Trans-
and 3-year/6-month/6-month, respectively. former model introduced in Section 4.1.
4644
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)
Special Track on AI in FinTech
Settings Metrics
Method
MG OR TS Acc. (%) MCC (×10−2 )
B-TF 7 7 7 57.55 10.41
MG-TF∗ 3 7 7 58.03 12.11
MG-TF 3 3 7 58.25 12.75
HMG-TF 3 3 3 58.70 14.87
Table 3: Experimental results of incremental analysis. MG, OR and (a) Daily attention scores
TS denote Multi-Scale Gaussian Prior, Orthogonal Regularization
and Trading Gap Splitter, respectively. * denotes wo/ OR.
4645
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)
Special Track on AI in FinTech
4646