Transformers Architectures For Time Series Forecasting
Transformers Architectures For Time Series Forecasting
Transformers Architectures For Time Series Forecasting
UNIVERSITÀ DI BOLOGNA
ARTIFICIAL INTELLIGENCE
MASTER THESIS
in
Artificial Intelligence in Industry
CANDIDATE SUPERVISOR
Andrea Policarpi Prof. Michele Lombardi
COSUPERVISORS
Dr. Rosalia Tatano
Dr. Antonio Mastropietro
1 Introduction 1
2 Background 3
2.1 Forecasting and Time Series . . . . . . . . . . . . . . . . . . 3
2.1.1 The Time Series Forecasting Problem . . . . . . . . . 6
2.1.2 Applications . . . . . . . . . . . . . . . . . . . . . . 8
2.1.3 Challenges of the TSF problem . . . . . . . . . . . . 9
2.2 History of models used for the TSF problem . . . . . . . . . . 11
2.2.1 NonTransformer based models . . . . . . . . . . . . 11
2.2.2 The SOTA: Transformerbased models . . . . . . . . 15
3 Related Work 22
3.1 Transformer drawbacks and state of the research . . . . . . . . 22
3.2 Models focusing on local context of input . . . . . . . . . . . 23
3.3 Models with focus on efficiency . . . . . . . . . . . . . . . . 25
3.4 Models with focus on positional and temporal information . . 28
3.5 Other transformerbased models . . . . . . . . . . . . . . . . 30
4 The Datasets 33
4.1 The ETT dataset . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 The CUBEMS dataset . . . . . . . . . . . . . . . . . . . . . 37
5 The Models 44
5.1 Convolutional and LSTM models . . . . . . . . . . . . . . . 44
iii
5.2 The TransformerT2V model . . . . . . . . . . . . . . . . . . 46
5.3 The Informer model . . . . . . . . . . . . . . . . . . . . . . . 49
5.3.1 Starting input representation . . . . . . . . . . . . . . 50
5.3.2 Input embedding layers . . . . . . . . . . . . . . . . . 52
5.3.3 Encoder layers and ProbSparse Attention . . . . . . . 54
5.3.4 Conv1D & Pooling layers . . . . . . . . . . . . . . . 57
5.3.5 Decoder layers and final dense output . . . . . . . . . 58
5.3.6 Informer model hyperparameters . . . . . . . . . . . . 59
7 Results 71
7.1 Models performances on ETTm1 Dataset . . . . . . . . . . . 71
7.2 Models performances on CUBEMS Dataset . . . . . . . . . . 74
7.3 Results on the study of ProbSparse Attention . . . . . . . . . 78
7.3.1 RMSE between query scores . . . . . . . . . . . . . . 78
7.3.2 Hamming distance between query rankings . . . . . . 80
7.3.3 Jaccard distance between topu query sets . . . . . . . 81
8 Conclusions 86
8.1 Final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 86
8.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
iv
Bibliography 93
v
List of Figures
vi
3.1 Comparison between the classical querykey construction (b)
and the causal convolution one (d), and the portion of input
they involve (a, d). The first method is locallyagnostic, while
the second one is contextaware (Image from [24]). . . . . . . 24
3.2 Comparison between the vanilla attention (a) and the LogSparse
attention (b). (Image from [24]). . . . . . . . . . . . . . . . . 25
3.3 (a): Working principle of the Feedback Transformer: past hid
den representations from all layers are merged into a single
vector and stored in a global memory. (b): Comparison be
tween vanilla and Feedback transformer architectures. (Image
from [11]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Temporal Fusion Transformer model architecture. (Image from
[25]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
vii
4.6 Plot of the 15minutes sampled ”Total Floor 7 Consumption”
feature inserted in the CUBEMS dataset (a) and zoomed win
dows of monthly (b), weekly (c) and daily (d) sizes. . . . . . . 41
4.7 Example of dailylevel outlier in the CUBEMS dataset. De
spite being a Tuesday, the 23 October date is Chulalongkorn
Day, a popular holiday in Thailand, and thus the energy con
sumption of the building drops to zero. . . . . . . . . . . . . . 42
7.1 ETTm1 test set predictions for the LSTM, CNN, TransformerT2V
and Informer architectures. . . . . . . . . . . . . . . . . . . . 73
viii
7.2 CUBEMS test set prediction for the LSTM, CNN, Trans
formerT2V and Informer architectures. . . . . . . . . . . . . . 75
7.3 Example of ”holiday outlier” and related Informer perdiction
at 1, 12 and 24 time steps in the future. . . . . . . . . . . . . 76
7.4 RMSE values related to the ranking function investigation on
the ”Full” model. . . . . . . . . . . . . . . . . . . . . . . . . 78
7.5 RMSE values related to the ranking function investigation on
the ”Sampled” model. . . . . . . . . . . . . . . . . . . . . . . 79
7.6 Bar charts of the Hamming distance value as a function of c
for the ”Full” (a) and the ”Sampled” (b) models. . . . . . . . . 81
7.7 Jaccard distance between exact and approximated topu queries
sets for both ”full” and ”sampled” Informer models. . . . . . 82
7.8 Jaccard distance matrix associated to all possible cq and ck
configurations, along with the relative heatmap and rows bar
charts, for an Informer model trained with c = cq = ck = 1. . . 83
7.9 Jaccard distance matrix associated to all possible cq and ck
configurations, along with the relative heatmap and rows bar
charts, for an Informer model trained with c = cq = ck = 3. . . 84
7.10 Jaccard distance matrix associated to all possible cq and ck
configurations, along with the relative heatmap and rows bar
charts, for an Informer model trained with c = cq = ck = 5. . . 85
ix
List of Tables
x
7.3 MSE and MAE scores for predicted data at timesteps t + 24,
depending on whether the related feature column is used or
not. The metrics are also computed on timesteps correspond
ing to working days and weekends/holidays separately. . . . . 77
7.4 Normalized Hamming distance between query rankings in the
”Full” model. ”Full ranking” refers to the full query ordering,
while ”topu ranking” the one of topu queries only . . . . . . 80
7.5 Normalized Hamming distance between query rankings in the
”Sampled” model. ”Full ranking” refers to the full query or
dering, while ”topu ranking” the one of topu queries only . . 80
xi
Chapter 1
Introduction
Regarding the first point, the models have been trained on two public
datasets, namely the ETTm1 and CUBEMS, and their performances have been
evaluated both in qualitative and quantitative terms; for the second goal, the
focus of the experiments has instead been on the hyperparameter responsible
for the degrees of the approximations carried out by the Probsparse mech
anism, which have been quantified and evaluated by means of appropriate
metrics.
This thesis is structured as follows: Chapters 2 and 3 introduce the topic
background and related work present in the literature, while Chapters 4 and 5
describe in the detail the datasets and the architectures involved in the investi
gations. Chapter 6 illustrates the performed experiments and the methodology
followed for their execution, while their results are provided in Chapter 7. Fi
nally, Chapter 8 is reserved for some final remarks and possible future work
suggestions.
Chapter 2
Background
The trend reflects a longterm increase or decrease in the data, not nec
essairly linear [16]; it reflects the overall direction of the series, net of local
oscillations. The latter are instead included in the seasonal component of the
series: recurrent behaviors dictated by periodic conditions or events such as a
certain time of the day or a month of the year. Seasonality is always of a fixed
and known frequency [16]; if multiple patterns occur at different frequencies
in the same series, the dominant one is taken into account. As for the residual
component, it collect the remainder of the series that is neither trend or sea
sonal: mostly noise and irregular fluctuations, and sometimes minor recurring
behaviors with different frequencies with respect to the seasonal one.
Figure 2.2: A time series and its decomposition into its three main components
(Image from [4]).
The composition of these three components into the original series can
either be additive or multiplicative [16]. For each element yt of a series Y , the
additive composition takes the form:
2.1 Forecasting and Time Series 6
yt = Tt + St + Rt (2.2)
yt = Tt · St · Rt (2.3)
Time series forecasting, in short TSF, can be carried out in many ways, and
some classification can be made [16][26]. Forecasting problems differ by:
• The prediction object. In the point estimates case we predict the ex
pected future values of a target variable, while in the probabilistic fore
casting case we obtain the parameter values of a distribution of prob
ability (e.g. Gaussian) associated to it (useful to take into account the
model’s uncertainty);
• The input and output dimensionality. Input and output time series
can either be univariate or multivariate, thus enabling various combi
nations. For example, in a multitosingle forecasting, from past values
2.1 Forecasting and Time Series 7
Taking into account the point estimates case, the univariate onestep ahead
forecasting can be formalized as follows:
The function f (·) is modeldependent, and can vary from simple to very
complex depending on the input elaborations taken into account.
The provided definition can be easily extended to the multihorizon case
by considering a certain forecasting window M . Furthermore, the multito
single forecasting case is covered by introducing the concept of covariate time
series, additional series used to help explaining the target one. We have:
where ŷt+1 , ..., ŷt+M are the predicted values of yt+1 , ..., yt+M .
2.1.2 Applications
Predicting the future involves dealing with the uncertain and the unknown.
In the time series forecasting problem, the major complication is given by
the fact that predictions far into the future often resemble the behaviour of
chaotic systems: given a small perturbation of the initial state (in our case, the
input series), the output forecasts may differ very significantly. Extending the
forecasting window causes some degree of error accumulation; the farther we
try to graze into the future, the lower our accuracy will be (Fig.2.3).
2.1 Forecasting and Time Series 10
employs the idea of moving averages to learn the serial correlation of the series
(namely, the correlation between the series and a lagged version of itself).
ARIMA models are a result of three components:
yt′ = c + ϕ1 yt−1
′ ′
+ ... + ϕp yt−p + θ1 ϵt−1 + ... + θq ϵt−q + ϵt (2.7)
where yt′ is the series differenced d times, ϕ and θ are the model parameters,
ϵ is the white noise, p is the order of the autoregressive part and q the order of
the moving average part.
While these models are easy to implement and computationally inexpen
sive, representing a good tool for lowcomplexity forecasting applications,
they struggle to grasp input dependencies in more difficult problems: they
provide a ”blackbox” approach, in which the output is computed purely from
the input data, without a meaningful elaboration of the underlying system’s
state [26].
Moving on to deep learningbased models, a major representative is the
class of Convolutional neural networks. This class of neural networks, orig
inally created to analyze image inputs, can be adapted to the elaboration of
2.2 History of models used for the TSF problem 13
time series [21][26]. The peculiarity of CNNs resides in the use of convolu
tional layers, which are able to analyze not the single input values, but win
dows of them, by means of sliding filters (bidimensional for images, monodi
mensional for time series). With this mechanism, a CNN model is able to
learn shortterm dependencies between a time step and its neighbours. In TSF
applications, in order to consider only past correlations (since we don’t know
future values in advance), the standard convolution is replaced by a causal
convolution, in which only the past neighbours are considered for each input
element (Fig.2.4, Fig.2.6a).
Figure 2.4: (a): Example of convolutional neural network architecture for time
series forecasting (Image from [23].) (b): 2D convolution with a 3x3 filter
(Image from [10]). (c): Difference between standard and causal convolution
(Image from [20]).
Figure 2.5: (a): Recurrent layer in its folded (left) and unfolded (right) forms.
(b): Internal structure of a LSTM unit (Images from [30]).
2.2 History of models used for the TSF problem 15
Figure 2.6: Input elaboration pipeline for the CNN, RNN and Attentionbased
models (Image from [26]).
• A positional/temporal embedding;
Figure 2.7: (a): The original Transformer architecture. (b): Scaled Dot
Product Attention representation. (c): MultiHead Attention representation
(Images from [42]).
!
pos
P E(pos,i) = F 2i (2.8)
10000 dmodel
where pos is the position and i is the dimension of the input elements.
Multiple choices are possible for the periodic function F ; in the original
implementation, a sine/cosine approach is presented:
sin(x), i = 2k
F (i, x) = (2.9)
cos(x), i = 2k + 1
!
pos
P E(pos,2k) = sin 2k (2.10)
10000 dmodel !
pos
P E(pos,2k+1) = cos 2k (2.11)
10000 dmodel
Figure 2.8: Representation of the sine/cosine encoding (Left image from [1]).
With respect to the encoder, the decoder presents two differences. The
first is that the attention performed by the first decoder block is masked,
in order to prevent input elements from attending to future outputs: the
predictions for each position t can depend only on the known outputs
at positions less than t. The second resides in the fact that the decoder
blocks provide a third multihead selfattention sublayer, in order to
collect the output of the encoder stack.
!
QK T
Attention(Q, K, V ) = sof tmax √ V (2.12)
dk
where the scaling constant √1dk is added to prevent dot products from grow
ing too large in magnitude and thus hindering the softmax activation.
Aim of the selfattention is to relate different positions of a single sequence
in order to compute a meaningful representation of it; in order to do so, the
selfattention stores into a matrix a compatibility score of each possible query
key combination, and uses these scores to compute a weighted sum of the
values. The rationale behind it is that values associated to a higher query
key score are considered as ”more meaningful” in terms of information, and
thus should contribute more to the final output representation. An example of
selfattention matrix is depicted in Fig.2.9.
2.2 History of models used for the TSF problem 21
Why using multiple attention heads? Each head comes with its own pa
rameters, and thus during the training different heads can focus on different
parts and internal dependencies of the input. So, each head is associated to a
different semantic information, and the concatenation of their output allows
for a greater extraction and retain of useful information.
Chapter 3
Related Work
operation.
Figure 3.1: Comparison between the classical querykey construction (b) and
the causal convolution one (d), and the portion of input they involve (a, d).
The first method is locallyagnostic, while the second one is contextaware
(Image from [24]).
Figure 3.2: Comparison between the vanilla attention (a) and the LogSparse
attention (b). (Image from [24]).
of efficient models surveyed by Tay et al., along with their classification and
the computational complexity of their attention layers, is provided in Tab.3.1.
Table 3.1: Efficient transformer models surveyed by Tay et al., along with
their attention mechanism complexity and their classification. Complexity
abbreviations: n = sequence length, {b, k, m} = pattern window/block size,
nm = memory length, nc = convolutionally compressed sequence length. Class
abbreviations: P = Pattern, M = Memory, LP = Learnable Pattern, LR = Low
Rank, KR = Kernel, RC = Recurrence. (Original table from [39]).
3.4 Models with focus on positional and temporal information 28
Figure 3.3: (a): Working principle of the Feedback Transformer: past hidden
representations from all layers are merged into a single vector and stored in a
global memory. (b): Comparison between vanilla and Feedback transformer
architectures. (Image from [11]).
wi t + ϕi if i = 0
t2v(t)[i] = (3.1)
f (w t + ϕ ) if 1 ≤ i ≤ k
i i
surely true for simpler models, such as CNNs and RNNs; for transformers,
the attention mechanism represents a first step towards explainability, but in
most applications the underlying decision process still remains obscure. In
this context, Lim et al. proposed the Temporal Fusion Transformer (TFT)
[25], a multihorizon forecasting architecture which also provides insight into
how and which parts of the input are considered in order to make predictions.
The TFT structure, depicted in Fig.3.4, is constituted by five key components:
The Datasets
In order to employ the models of this thesis, two datasets have been chosen:
ETT Dataset [47] and CUBEMS [32]. While being substantially different,
they can be linked to two practical time series forecasting problems, each with
their own challenges. The following paragraphs will provide a description of
the data they enclose, along with the practical scenarios correlated to them.
Feature Meaning
Figure 4.2: Plot of the full ETTm1 dataset (a) and zoomed windows of
monthly (b), weekly (c) and daily (d) sizes.
4.1 The ETT dataset 36
challenging TSF problem, due to the shortterm irregularities of the target se
ries. But a correct forecast could bring strong benefits: as described in [47],
anticipating the electric power demand of specific areas is problematic due
to its variation with respect to factors such as weekdays, holidays, seasons,
weather and temperatures. For this reason, reliable methods to perform long
term predictions of the demand itself with an acceptable precision still do not
exist, and a wrong prediction could overheat the electrical transformer, dam
aging it. Since the oil temperature can reflect the condition of the electrical
transformer, its prediction could be used in order to employ an anomaly de
tection mechanism: by comparing the expected behaviour with the currently
measured one, if their difference falls over a certain threshold an alarm sig
nal will be sent, and appropriate actions could be taken if deemed necessary.
Moreover, since the oil temperature is related to the actual power usage, an
indirect estimation of the latter could be obtained, preventing overestimations
and thus unnecessary wastes of electric energy and equipment degradation.
Each floor of the building is divided into four (for floors 12) or five (for
floors 37) zones, and each zone is subjected to six different measurements:
• Lighting load;
• Plug load;
• Indoor temperature;
• Relative humidity;
• Ambient light.
are present, the majority of features have a data availability of at least 95%,
with some exceptions at middle floors. Being divided both by year and by
floor, CUBEMS is composed by 14 subdatasets; a summary of the overall
structure is depicted in Fig.4.5.
Figure 4.5: CUBEMS dataset file names (a), types of available measurements
(b) and classification of features contained in the dataset of floor 7 (c) (Original
images from [32]).
4.2 The CUBEMS dataset 40
For the purposes of this thesis, the original CUBEMS data has undergone
some preliminary processing steps. First of all, it has been decided to work at
floorlevel, only considering data from floor 7 as the context of predictions.
The seventh one in particular has been chosen for two main reasons: it is one
of the floors with the most number of sensors in it, leading to 29 correspond
ing features (Fig.4.5c), while at the same time containing the least amount of
missing values.
Secondly, a 15min downsampling of the data has been carried out: this
has been done not only to adopt the same sample frequency of ETTm1, but
also because in the considered forecasting problem a 1minute granularity has
been deemed redundant and computationally inefficient (more input elements
to compute, without a real gain in meaningful information).
At last, the object of forecasting had to be defined; concerning this, the
total floor consumption has been computed and inserted in the dataset as the
target feature. Its values, at each time step, are given by the sum of all the AC,
light and plug electricity consumption in the floor, regardless of the zone; a
plot of this constructed series is depicted in Fig.4.6.
4.2 The CUBEMS dataset 41
Figure 4.6: Plot of the 15minutes sampled ”Total Floor 7 Consumption” fea
ture inserted in the CUBEMS dataset (a) and zoomed windows of monthly
(b), weekly (c) and daily (d) sizes.
as suggested by [32].
For all of these situations, CUBEMS data represents a valid starting point
to train and test complex forecasting models. Overall, the high number of fea
tures and the strongly regular patterns of CUBEMS makes it very different
from ETTm1, despite being both related to an energy consumption context
(addressed directly by the first, and indirectly by the latter). An architecture
able to perform well on both would prove its ability to adapt to different situ
ations and thus its effectiveness on important TSF applications.
Chapter 5
The Models
The experiments carried out on this thesis work are mainly focused on the
study and the application in the TSF domain of two different transformer
based architectures: a TransformerT2V model and an Informer [47] model.
The first one is a simple but effective adjustment of the vanilla Transformer
for the time series problem, while the second is a complex architecture able
to reach SOTA results. In order to compare their performances with the ones
of nontransformer models, two architectures of this latter category have also
been trained and evaluated on the proposed datasets: a CNN and a LSTM.
The following paragraph will provide a description of these architectures, with
particular attention to the Informer model and its main characteristics.
Figure 5.1: LSTM (a) and CNN (b) architectures used as representatives of
nontransformer models.
letting the models focus on a single time step at a given distance in the future
instead of on a target window is driven by the willingness to help the mod
els by assigning them an easier prediction. The hyperparameters of the two
models, along with their description, are listed in Tab.5.1.
LSTM units_dense_lstm Units number of the first Dense layer of the LSTM model
LSTM units_lstm Units number of the LSTM layers
CNN units_dense_conv Units number of the first Dense layer of the CNN model
CNN filters_conv Number of filters of convolutional layers
CNN conv_width Filters size of convolutional layers
Figure 5.2: TransformerT2V architecture (a) and internal structure of the en
coder attention layers (b).
Hyperparameter Description
Figure 5.4: Time window split into the four components of the Informer input.
• The encoder and decoder time inputs, enclosing the temporal infor
mation of the series. These two tensors are built with the same pro
cedure followed for the previously mentioned value ones, but in this
case the feature space is substituted with a time encoding space of tun
able granularity. For a 15 minscale encoding, which is the one used
in the Informer model, five time features are created, corresponding to
month, day, weekday, hour and minute representations. In this way,
each [1, w, F ] tensor is mapped into one of shape [1, w, 5]. A visualiza
tion of time encoding is provided in Fig.5.5.
5.3 The Informer model 52
Once created, these four components are further processed by the encoder
and decoder embedding layers, described in the following paragraph.
The embedding process is equal for both the encoder and the decoder sides. It
is carried out by an apposite block, depicted in Fig.5.6, which maps the data
into tensors of shape [1, Lin , dmodel ], where Lin is set to Ls for the encoder
and to Ll for the decoder, while dmodel is the dimension of the internal data
representation inside the attention layers.
5.3 The Informer model 53
Taking as input both the value and time tensor elements described in the
previous paragraph, each embedding layer outputs the sum of three different
components:
• a positional embedding Xpos , also applied to the value input and repre
sented by the classical sine/cosine encoding of the vanilla Transformer;
• a temporal embedding Xtime , acting over the time input and represented
by the sum of five different linear embeddings of dimension dmodel , one
for each time feature:
5.3 The Informer model 54
X
Xtime = LinearEmbedding(xk ) (5.2)
k∈A
The encoder layers of the Informer, depicted in Fig.5.7, are structurally sim
ilar to the vanilla Transformer ones, being composed by an attention block
followed by a feedforward projection, with residual connections after each
of them.
5.3 The Informer model 55
Figure 5.7: Structure of the Informer encoder blocks. With respect to the
original Transformer model, the standard attention mechanism is substituted
with the ProbSparse one.
The main difference with respect to the canonical model resides in the use
of ProbSparse attention layers, which are able to reduce the time and memory
complexity of the attention computation from O(Lk · Lq ) to O(Lq · lnLk )
(where Lq and Lk are the number of queries and keys) without a loss in the
overall performances.
The idea behind this proposed mechanism is that computing each query
key dot product pair is redundant, since the majority of meaningful informa
tion is carried out by only a few elements [47]. For this reason, ProbSparse al
lows each key to only attend to the topu dominant queries, with u = c·ln(Lq )
(where c is an hyperparameter), ranked by means of a sparsity score function
5.3 The Informer model 56
! !
qi kjT 1 X Lk
qi kjT
M (qi , K) = max √ − √ (5.4)
j dmodel Lk j=1 dmodel
in other words, for each query the maximum and the mean value of its
scaled dot product with all keys is computed, and the difference of these two
components is considered. This peculiar ranking metric is an approximation of
how much the probability distribution of the query attention score with respect
to the keys is dissimilar to the uniform distribution: the underlying hypothesis
is that queries which are dominant in the attention computation show a peak
in their distribution (reflecting an ”activation” when coupled to certain keys),
while uninteresting ones are associated to a ”flat” plot (producing the same
response regardless of their pairing). A detailed formalization of this concept,
along with an explanation on how the score function M (qi , K) is constructed,
is provided in Appendix A.
Back to the ProbSparse attention computation, we can see that until now
the complexity is still O(Lq · Lk ), since for each query qi its dot product qi kjT
with all the keys kj must be computed. It is here that a second simplification
is made: instead of considering the full key matrix K, the authors propose to
randomly sample U = c · ln(Lk ) keys in order to obtain a sparse matrix K̄
on which the rows corresponding to nonsampled keys are padded with zeros
and thus do not contribute to the score computation. The approximated score
function M̄ , which is the one used in the Informer, becomes:
! !
qi k T 1 X qi k T
M̄ (qi , K̄) = max √ j − √ j (5.5)
kj ∈K̄ dmodel U k ∈K̄ dmodel
j
With this method, only Lq · lnLk dotproduct pairs are computed, thus re
sulting in a major efficiency gain with respect to the standard attention mech
anism.
5.3 The Informer model 57
After each encoder attention block, except for the last one, a Conv1D & Pool
ing layer is in charge of distilling the attention output. This component, whose
structure is depicted in Fig.5.8, performs a 1D convolution (with kernel size
= 3) along the time dimension, followed by a layer normalization and an ELU
activation function. At the end, a Max Pooling operation, with stride = 2, is
applied: this reduces by half the size of data along the feature space. This
”distilling” operation, which is responsible for the funnelshape structure of
the encoder, sharply reduces the overall space complexity and helps discarding
redundant information in traversing data.
Figure 5.8: Internal architecture of the Conv1D & Pooling layers of the In
former.
5.3 The Informer model 58
Just like the encoder, the decoder layers of the Informer are similar to the orig
inal Transformer ones, except for the use of the ProbSparse attention. Their
structure is depicted in Fig.5.9.
Each decoder layer is composed by three parts. The first sublayer, con
nected to the embedded decoder input, performs a standard self ProbSparse
attention; only in the first decoder layer, this attention is masked, preventing
the elaboration of future time steps data by the model. The second component
is another ProbSparse block, computing a crossattention between the decoder
queries and the keys and values provided by the encoder output. At last, a fi
nal feedforward layer is used to project the data outside the block. As usual,
5.3 The Informer model 59
Hyperparameter Description
This chapter will provide a description of the investigations carried out in this
thesis work, along with their associated setup and preliminary steps. The ex
periments can be split in two main categories:
x − min x
x′ = (6.1)
max x − min x
with this operation, all features are mapped into the [0, 1] interval.
• Train/validation/test split. The data have been split into train, val
idation and test sets, by following a 80%/10%/10% ratio depicted in
Fig.6.1.
6.1 Models training and evaluation on the proposed datasets 62
Figure 6.1: Visualization of ETTm1 and CUBEMS datasets split into train,
validation and test data.
• Input and label creation. Once set the lookback window and the fore
casting target, the train, validation and test series elements have been
sorted to form the models input and the associated ground truth labels
(corresponding to the exact prediction values).
In order to obtain the best results, various hyperparameter choices have been
tested for each model, resulting in the final configurations described in Tab.
6.1, 6.2 and 6.3.
As for the forecasting target, for the Informer model a window of 24 steps
into the future has been considered, while for the other models two different
targets at 12 and 24 steps into the future have been chosen, in order to compare
their predictions with the Informer ones.
6.1 Models training and evaluation on the proposed datasets 63
Table 6.1: Final hyperparameter configuration chosen for the LSTM and CNN
models.
TransformerT2V
Hyperparameter Value
seq_len 128
foresight 12, 24
dmodel 256
N_heads 12
FF_dim 256
N_dense 64
Dropout 0.1
Informer
Hyperparameter Value
seq_len 96
label_len 48
pred_len 24
Factor 5
dmodel 512
N_heads 8
enc_layers 3
dec_layers 2
df f 512
Dropout 0.1
Table 6.3: Final hyperparameter configuration chosen for the Informer model.
For all the models the training has been carried out with a batch size of 32 and
a maximum number of epochs of 10. The Adam optimizer has been used, with
a starting learning rate of 10−4 .
The loss function chosen is the mean squared error (MSE):
1 XN
M SE(ytrue , ypred ) = (ytrue − ypred )2 (6.2)
N i=1
while the evaluation metric is the mean average error (MAE):
1 XN
M AE(ytrue , ypred ) = |ytrue − ypred | (6.3)
N i=1
As for the training runtime, a custom schedule has been adopted, with two
callbacks:
6.2 Analysis of the ProbSparse attention 65
Training configuration
Batch size 32
Epochs 10
Optimizer Adam
Starting learning rate 10−4
Early stopping patience 4 epochs
Learning rate plateau reduction patience 2 epochs
Learning rate reduction factor 0.1
Minimum learning rate 10−10
The models on which the following experiments have been carried out are
two Informer architectures, trained on the CUBEMS dataset. The first is a
canonical, ”sampled” model: its probsparsef actor hyperparameter, or c in a
compact notation, determines both the number of topu queries (u = c·ln(Lq ))
and the number of sampled keys (S = c·ln(Lk )) used to approximate the keys
set K with a subset K̄. The second model is instead a ”full” one: while the
number of topu queries is still determined by c, all the keys are considered
and no sampling is made. Using these trained models as a tool, two questions
have been asked:
Given a queries set Q and a keys set K, the ”full” score of each query qi ∈ Q
is given by:
! !
qi k T 1 X qi k T
M (qi , K) = max √ j − √ j (6.4)
kj ∈K dmodel U kj ∈K dmodel
! !
qi k T 1 X qi k T
M̄ (qi , K̄) = max √ j − √ j (6.5)
kj ∈K̄ dmodel U k ∈K̄ dmodel
j
Since the subset K̄ is random (due to the random keys sampling), this
procedure has been repeated N times, with N sufficiently large (in these ex
periments, N = 1000), and the mean RMSE value has been taken as the final
result.
Furthermore, the two main components of the score function have been
evaluated separately. Recalling Eq.6.4, M (qi , K) can be seen as:
6.2 Analysis of the ProbSparse attention 68
with
!
qi k T
M AX(qi , K) = max √ j (6.8)
kj ∈K dmodel
and
!
1 X qi k T
M EAN (qi , K) = √ j (6.9)
LK kj ∈K dmodel
where M AX(qi , K) represents the peak of the querykeys dot product dis
tribution, while M EAN (qi , K) its average value. Given their approximated
¯
counterparts M AX(q ¯
i , K̄) and M EAN (qi , K̄), the corresponding mean RMSE
Since the probsparse attention output is not directly influenced by the M (qi , K)
scores, but only by the topu queries choice, it has also been decided to di
rectly measure the distance between the two query rankings R = [q1R , ..., qLRq ]
and R̄ = [q1R̄ , ..., qLR̄q ] obtained from M and M̄ . The rationale behind this is
that two different sets of scores could determine two equal query orderings,
and consequently the same final result: therefore, regardless of the error be
tween M and M̄ , if R and R̄ are similar enough the approximation obtained
by considering only the subset of keys K̄ ∈ K can be deemed valid.
The relation between R and R̄ has been observed by means of two different
points of view, each measured with a corresponding metric:
• Queries ordering. This case aims to measure how many queries are
placed at the same position in both rankings, considering both the full
6.2 Analysis of the ProbSparse attention 69
ranking and the topu only. The proposed metric is the normalized Ham
ming distance, computed as follows:
P
N
f (R[i], R̄[i])
i=1
H(R, R̄) = (6.10)
N
with
1, R[i] = R̄[i]
f (R[i], R̄[i]) = (6.11)
0, R[i] ̸= R̄[i]
This metric can be applied also for the topu only evaluation, since it
does not require the two topu subsets to share the same queries (al
though ordered differently).
Rtop−u ∩ R̄top−u
J(Rtop−u , R̄top−u ) = (6.13)
Rtop−u ∪ R̄top−u
Results
This chapter will provide the results obtained by the CNN, LSTM, Trans
formerT2V and Informer models on the ETTm1 and CUBEMS datasets, and
the outcome of the studies on the probsparse mechanism of the Informer.
From the metrics values it can be observed that all models perform very
well on the ETTm1 dataset, with the Informer architecture performing best
while the TransformerT2V having performances comparable with the CNN
and LSTM ones. This could suggest that, for lowfeature datasets, the vanilla
7.1 Models performances on ETTm1 Dataset 72
Attention mechanism plus the introduction of a time encoding does not pro
vide significant advantages over standard methods such as convolutions and
recurrence; another hypothesis is that discarding the decoder component of
the Transformer could have hindered the advantages provided by the vanilla
architecture.
The situation is different for the Informer model, outperforming other ar
chitectures by a significant margin and obtaining results similar to the ones
achieved by the model authors on the same dataset [47].
A visualization of each model’s forecasting on the ETTm1 test set is de
picted in Fig.7.1. Overall, all the predictions manage to follow the series trend,
with some oscillations especially in the TransformerT2V case. The Informer
forecasting is instead very precise and seems to capture very well the local
maxima and minima of the series.
7.1 Models performances on ETTm1 Dataset 73
Figure 7.1: ETTm1 test set predictions for the LSTM, CNN, TransformerT2V
and Informer architectures.
7.2 Models performances on CUBEMS Dataset 74
Figure 7.2: CUBEMS test set prediction for the LSTM, CNN, Trans
formerT2V and Informer architectures.
The Informer is still the best performing architecture, with low MSE and
MAE scores. Still, it struggles to correctly predict time steps related to festiv
ities, especially in the case of predictions far in the future. An example of this
is depicted in Fig.7.3.
7.2 Models performances on CUBEMS Dataset 76
Table 7.3: MSE and MAE scores for predicted data at timesteps t+24, depend
ing on whether the related feature column is used or not. The metrics are also
computed on timesteps corresponding to working days and weekends/holidays
separately.
From the table, it can be seen that while global and working days met
rics stay more or less the same, a small improvement is made on the week
end/holidays error, suggesting the beneficial effects of this feature on the over
all training.
7.3 Results on the study of ProbSparse Attention 78
The mean RMSE values between the exact query scores M (qi , K) and the
approximated ones M̄ (qi , K̄), computed over 1000 iterations and as a function
of the probsparse factor c, are depicted in Fig.7.4 for the ”Full” model, and in
Fig.7.5 for the ”Sampled” one; the same figures also provide the results of the
investigation focused on the ”max” and ”mean” components of the ranking
function.
Figure 7.4: RMSE values related to the ranking function investigation on the
”Full” model.
7.3 Results on the study of ProbSparse Attention 79
Figure 7.5: RMSE values related to the ranking function investigation on the
”Sampled” model.
From the RMSE tables and their associated bar charts some considerations
could be made. First of all, while the error starts higher for low values of c in
the ”sampled” model, in both cases tends to reach the same plateau for high
c values, with a similar descending curve; as expected, the two differently
trained models show the same behaviour for high values of the hyperparame
ter, but even for lower values of the latter their difference is not so marked.
By looking at the MAX and MEAN components, it is possible to see that
in both cases the RMSE of the latter is relatively low, and almost constant
regardless of c, while the first starts high and decreases progressively: this
suggests that even by sampling a few number of keys the mean value of the
distribution is approximated well, while its maximum is not. The fact that c
only influences the MAX component approximation could lead to the sugges
tion of modifying the original M (qi , K) score function in order to give it a
major weight in the final result, for example by discarding the mean compo
nent from the computation.
Still, only looking at the error in the score values is not enough to draw
strong conclusions, since different query scores not necessarily lead to differ
ent rankings.
7.3 Results on the study of ProbSparse Attention 80
The following tables (Tab.7.4, Tab.7.5) show the normalized Hamming dis
tances between query rankings built from the exact M and approximated M̄
scores, considering both the full Lq queries and the topu only. Their associ
ated bar charts are also depicted in Fig.7.6
”Full” Model
Factor Hamming distance, full ranking Hamming distance, topu ranking
1 0.92 0.84
2 0.87 0.90
3 0.83 0.71
4 0.78 0.69
5 0.74 0.56
6 0.26 0.72
”Sampled” Model
Factor Hamming distance, full ranking Hamming distance, topu ranking
1 0.90 0.84
2 0.76 0.64
3 0.62 0.66
4 0.51 0.59
5 0.46 0.37
6 0.44 0.44
Figure 7.6: Bar charts of the Hamming distance value as a function of c for
the ”Full” (a) and the ”Sampled” (b) models.
Unlike in the previous experiment, here the ”full” and the ”sampled” mod
els present different behaviours: the first shows an overall high error in using
sampled keys to approximate the queries ranking even for high values of c,
while the second performs much better in this sense. In fact, for the ”sam
pled” model, the ranking approximation error decreases almost linearly with
increasing values of c, while for the ”full” one it does not decrease signifi
cantly; this suggests that pruning the key distribution information only after
the training is not as effective as employing that strategy during it.
Still, for both models the approximation error is relatively high, with even
the ”prob” model’s best configuration staying over a 0.35 distance score. This
does not necessarily lead to errors in the final attention output, since the prob
sparse mechanism treats the chosen queries as an unordered set; the following
experiment is focused on this aspect.
The Jaccard distances between real and approximated topu queries sets for
the two studied models are depicted in Fig.7.7.
7.3 Results on the study of ProbSparse Attention 82
Figure 7.7: Jaccard distance between exact and approximated topu queries
sets for both ”full” and ”sampled” Informer models.
It can be seen that for appropriate values of c the error drops considerably;
recalling Table 7.4, for certain values of c onwards, the choice of queries to in
volve in the attention computation is similar in both the exact and the approx
imated computations, even if the corresponding Hamming distance is high.
This holds for both models, but is particularly true for the ”sampled” one, since
the initial Jaccard distance is around 0.5 for the minimum value of c (and so
for a small number of sampled keys). Again, this shows the importance of
enacting the sampling mechanism during the model training.
As previously underlined, with this setup the Jaccard distance is bound to
reach zero for the the maximum value of c, since in this limit case all queries
7.3 Results on the study of ProbSparse Attention 83
are considered in toppositions; this is a consequence of the fact that the prob
sparse factor c is associated to both queries and keys extraction. The effects
of decoupling c into two subhyperparameters cq and ck , of which the first
is responsible for the topu queries and the second for the sampled keys, are
described by the last experiment’s results, reported below: they show the Jac
card distance matrix between topu sets, which contains the metric scores as
sociated to all possible cq and ck configurations, for three different Informer
models trained with c = 1 (Fig.7.8), 3 (Fig.7.9) and 5 (Fig.7.10) respectively:
Figure 7.8: Jaccard distance matrix associated to all possible cq and ck config
urations, along with the relative heatmap and rows bar charts, for an Informer
model trained with c = cq = ck = 1.
7.3 Results on the study of ProbSparse Attention 84
Figure 7.9: Jaccard distance matrix associated to all possible cq and ck config
urations, along with the relative heatmap and rows bar charts, for an Informer
model trained with c = cq = ck = 3.
7.3 Results on the study of ProbSparse Attention 85
Figure 7.10: Jaccard distance matrix associated to all possible cq and ck config
urations, along with the relative heatmap and rows bar charts, for an Informer
model trained with c = cq = ck = 5.
The results show that, regardless of the choice of c for the training, a com
mon pattern can be observed: depending on the query factor cq , from a certain
key sampling parameter ck onwards a plateau is reached, namely the Jaccard
distance does not decrease significantly by increasing ck . This represents a
noteworthy observation, since for certain configurations it is possible to de
crease the number of sampled keys, and thus the overall computational burden,
with negligible performance degradations.
Chapter 8
Conclusions
classical methods. As expected, of all the models the Informer is the best per
forming one, outclassing the prediction accuracy of the other considered archi
tectures by a significant margin: this shows the potential benefits of adopting
SOTA transformerbased models for TSFrelated applications.
The second group of experiments aimed instead at studying the mecha
nisms behind the distinctive characteristic of the Informer, which is the Prob
sparse attention, and the role of the probsparse hyperparameter c, responsible
for both the number of sampled keys and the queries involved in the attention
computation. The obtained results showed how variations in the choice of c
only affect a component of the query score function, suggesting a rework of
the latter in order to be more easily controlled by the hyperparameter. Further
more, it has been shown how a decoupling of c into two distinct components
could prove beneficial, diminishing the computational burden without a loss
in the performances. At last, it has been shown how, from certain values of
c onwards, the accuracy of the probsparse’s internal representations reach a
plateau: this could be exploited by fixing a threshold Th in the approxima
tion error, and tuning the value of c in order to have the smallest number of
sampled keys while staying under Th .
!
QK T
Attention(Q, K, V ) = sof tmax √ V (A.1)
dk
we can see that it represents a weighted sum of input values, on which the
weights are computed starting from a softmax function applied to scaled dot
products between pairs of queries and keys. From this consideration, the Prob
Sparse mechanism of the Informer lays its foundations on the hypothesis that
the aforementioned softmax scores follow a longtail distribution, depicted in
Fig.A.1: only a few dotproduct pairs contribute to the major attention compu
tation, while most of the others could be ignored without a significant variation
of the final result.
Foundations of the ProbSparse Attention mechanism 90
X Ker(qi , kJ )
Attention(Q, K, V ) = P VJ (A.2)
J Ker(qi , kl )
l∈Lk
qi k T
√j
Ker(qi , kJ ) = e d (A.3)
Ker(qi , kJ )
p(qi , kJ ) = P (A.4)
Ker(qi , kl )
l∈Lk
1
q(qi , kJ ) = (A.5)
LK
In order to measure the similarity between p(qi , K) and q(qi , K), a candi
date metric is represented by the Kullback–Leibler divergence:
XLK qi kT
√l 1 X
LK
qi klT
KL (q||p) = ln e d − √ − ln(LK ) (A.6)
l=1 LK J=1 d
where, for a given query qi , the first term is the logsumexp (LSE) function
computed on all keys, the second is the arithmetic mean, and the third is a
constant that can be discarded from the final result. If the value of KL (q||p)
Foundations of the ProbSparse Attention mechanism 92
is high, the query is ”active”, and has an high chance to produce relevant dot
product values in the attention computation.
The use of KL divergence as the similarity metric is however computa
tionally expensive, and for this reason a simpler, more efficient query score
function M (qi , K) can be introduced:
!
qi k T 1 X
LK
qi kJT
M (qi , K) = max √ J − √ (A.7)
J d LK J=1 d
[4] J. Brownlee. How to Decompose Time Series Data into Trend and Sea
sonality. URL: https://machinelearningmastery.com/decompose-
time-series-data-trend-seasonality/.
[7] J. Connor, R. Martin, and L. Atlas. Recurrent neural networks and ro
bust time series prediction. IEEE Transactions on Neural Networks,
5(2):240–254, 1994. DOI: 10.1109/72.279188.
BIBLIOGRAPHY 94
[12] X. S. Ganchao Bao Yuan Wei and H. Zhang. Double attention recurrent
convolution neural network for answer selection. Royal Society Open
Science, 7, May 2020. URL: https://doi.org/10.1098/rsos.
191517.
[22] J. Lee, Y. Lee, J. Kim, A. R. Kosiorek, S. Choi, and Y. W. Teh. Set trans
former: a framework for attentionbased permutationinvariant neural
networks, 2019. arXiv: 1810.00825 [cs.LG].
[24] S. Li, X. Jin, Y. Xuan, X. Zhou, W. Chen, Y.X. Wang, and X. Yan. En
hancing the locality and breaking the memory bottleneck of transformer
on time series forecasting, 2020. arXiv: 1907.00235 [cs.LG].
[33] J. Qiu, H. Ma, O. Levy, S. W.t. Yih, S. Wang, and J. Tang. Blockwise
selfattention for long document understanding, 2020. arXiv: 1911 .
02972 [cs.CL].
BIBLIOGRAPHY 97
[38] Y. Tay, D. Bahri, L. Yang, D. Metzler, and D.C. Juan. Sparse sinkhorn
attention, 2020. arXiv: 2002.11296 [cs.LG].