0% found this document useful (0 votes)
52 views19 pages

Long-Range Transformers For Dynamic Spatiotemporal Forecasting

The document discusses a new method called Spacetimeformer for multivariate time series forecasting that uses long-range Transformers to jointly learn interactions between space, time, and value information. It addresses limitations of prior methods that either rely on predefined graphs that cannot change over time or perform separate spatial and temporal updates without direct connections between variables. The method formulates forecasting as a 'spatiotemporal sequence' and shows competitive results on various benchmarks.

Uploaded by

Lim Yen Wee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views19 pages

Long-Range Transformers For Dynamic Spatiotemporal Forecasting

The document discusses a new method called Spacetimeformer for multivariate time series forecasting that uses long-range Transformers to jointly learn interactions between space, time, and value information. It addresses limitations of prior methods that either rely on predefined graphs that cannot change over time or perform separate spatial and temporal updates without direct connections between variables. The method formulates forecasting as a 'spatiotemporal sequence' and shows competitive results on various benchmarks.

Uploaded by

Lim Yen Wee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Long-Range Transformers for Dynamic Spatiotemporal

Forecasting
Jake Grigsby Zhe Wang
University of Virginia† University of Virginia
[email protected] [email protected]

Nam Nguyen Yanjun Qi


IBM Research University of Virginia
[email protected] [email protected]
arXiv:2109.12218v3 [cs.LG] 18 Mar 2023

ABSTRACT Context Target


10 10 10 ?
Multivariate time series forecasting focuses on predicting future ?

1 1 ?
values based on historical context. State-of-the-art sequence-to- a) 0
0
?

sequence models rely on neural attention between timesteps, which -8 -5 ?


-9 -9 ?
allows for temporal learning but fails to consider distinct spatial re- -10

lationships between variables. In contrast, methods based on graph T-c T-1 T T+1 T+h
neural networks explicitly model variable relationships. However,
these methods often rely on predefined graphs that cannot change 10 10 0 ? ?
b)
over time and perform separate spatial and temporal updates with- 1 1 -9 ? ?
Temporal
out establishing direct connections between each variable at ev- Attention -9 -8 -5 ? ?
ery timestep. Our work addresses these problems by translating T-c T-1 T T+1 T+h
multivariate forecasting into a “spatiotemporal sequence" formu- Query Token
lation where each Transformer input token represents the value
of a single variable at a given time. Long-Range Transformers can c)
10 10 0 ? ?
then learn interactions between space, time, and value information Spatial
1 1 -9 ? ?
jointly along this extended sequence. Our method, which we call Graph and
-9 -8 -5 ? ?
Temporal
Spacetimeformer, achieves competitive results on benchmarks T-c T-1 T T+1 T+h
Attention
from traffic forecasting to electricity demand and weather predic- Query Token

tion while learning spatiotemporal relationships purely from data. 10 10 0 ? ?

T-c T-1 T T+1 T+h


d)
1 INTRODUCTION Spatio- 1 1 -9 ? ?

Multivariate forecasting models attempt to predict future outcomes temporal T-c T-1 T T+1 T+h
based on historical context; jointly modeling a set of variables al- Attention
-9 -8 -5 ? ?
lows us to interpret dependency relationships that provide early
warning signs of changes in future behavior. A simple example is T-c T-1 T T+1 T+h
shown in Figure 1a. Time Series Forecasting (TSF) models typically Query Token
Less More
deal with a small number of variables with long-term temporal Attention Attention
dependencies that require historical recall and distant forecast-
ing. This is commonly handled by encoder-decoder sequence-to- Figure 1: Attention in multivariate forecasting. (a) A three-
sequence (seq2seq) architectures based on recurrent networks or variable sequence with three context points and two target
one-dimensional convolutions. Current state-of-the-art TSF models points to predict. (b) Temporal attention in which each to-
substitute classic seq2seq architectures for neural-attention-based ken contains all three variables. Darker blue lines between
mechanisms. However, these models represent the value of multi- grey tokens represent increasing attention. (c) Temporal at-
ple variables per timestep as a single input token. This lets them tention with spatial interactions modeled within each to-
learn “temporal attention" amongst timesteps but can ignore the dis- ken’s timestep by known spatial graphs (shown in black).
tinct spatial relationships that exist between variables. A temporal (d) Spatiotemporal attention in which each variables at each
attention network is shown in Figure 1b. timestep is a separate token. In practice this graph is densely
In contrast, spatial-temporal methods aim to capture the rela- connected but edges have been cut for readability. All figures
tionships between multiple variables. These models typically in- in this paper are best viewed in color.
volve alternating applications of temporal sequence processing and
spatial message passing based on Graph Neural Network (GNN)

Work done in part while an intern at IBM Research. Now at UT Austin, email: components that rely on ground-truth variable relationships that
[email protected]. are provided in advance or determined by heuristics. However,
Jake Grigsby, Zhe Wang, Nam Nguyen, and Yanjun Qi

hardcoded graphs can be difficult to define in domains that do not distant tokens [83, 8] or a combination thereof [86, 89]. While these
have clear physical relationships between variables. Even when methods are effective, their inductive biases about the structure of
they do exist, predefined graphs can create fixed (static) spatial the trained attention matrix are not always compatible with tasks
structure. Variable relationships may change over time and can be outside of NLP. Another approach looks to approximate attention
more accurately modeled by a context-dependent (dynamic) graph. in sub-quadratic time while retaining its flexibility [66, 79, 95, 54].
A temporal attention network with spatial graph layers is depicted Particularly relevant to this work is the Performer [9]. Performer
in Figure 1c. approximates attention in linear space and time with a kernel of ran-
This paper proposes a general-purpose multivariate forecaster dom orthogonal features and enables the long-sequence approach
with the long-term prediction ability of a time series model and the that is central to our work. For a thorough survey of efficient atten-
dynamic spatial modeling of a GNN without relying on a hardcoded tion mechanisms, see [62].
graph. Let 𝑁 be the number of variables we are predicting and 𝐿
be the sequence length in timesteps. We flatten multivariate in-
puts of shape (𝐿, 𝑁 ) into long sequences of shape (𝐿 × 𝑁 , 1) where 2.2 Time Series Forecasting and Transformers
each input token isolates the value of a single variable at a given Deep learning approaches to TSF are generally based on a seq2seq
timestep. The resulting input allows Transformer architectures to framework in which a context window of the 𝑐 most recent timesteps
learn attention networks across both space and time jointly, cre- is mapped to a target window of predictions for a time horizon of ℎ
ating the “spatiotemporal attention" mechanism shown in Figure steps into the future. Let 𝑥𝑡 be a vector of timestamp values (the
1d. Spatiotemporal attention learns dynamic variable relationships day, month, year, etc.) at time 𝑡 and 𝑦𝑡 be a vector of variable values
purely from data, while an encoder-decoder Transformer architec- at time 𝑡. Given a context sequence of time inputs (𝑥𝑇 −𝑐 , . . . , 𝑥𝑇 )
ture enables accurate long-term predictions. Our method avoids and variables (𝑦𝑇 −𝑐 , . . . , 𝑦𝑇 ) up to time 𝑇 , we output another se-
TSF/GNN domain knowledge by representing the multivariate fore- quence of variable values (^ 𝑦𝑇 +1, . . . , 𝑦^𝑇 +ℎ ) corresponding to our
casting problem in a raw format that relies on end-to-end learning; predictions at future timesteps (𝑥𝑇 +1, . . . , 𝑥𝑇 +ℎ ).
the cost of that simplicity is the engineering challenge of train- The most common class of deep TSF models are based on a combi-
ing Transformers at long sequence lengths. We explore a variety nation of Recurrent Neural Networks (RNNs) and one-dimensional
of strategies, including fast attention, hybrid convolutional archi- convolutions (Conv1Ds) [3, 60, 57, 32]. More related to the proposed
tectures, local-global and shifted-window attention, and a custom method is a recent line of work on attention mechanisms that aim
spatiotemporal embedding scheme. Our implementation is efficient to overcome RNNs’ autoregressive training and difficulty in in-
enough to run on standard resources and scales well with high- terpreting long-term patterns [27, 35, 88, 72, 50]. Notable among
memory GPUs. Extensive experiments demonstrate the benefits of these is the Informer [92] - a general encoder-decoder Transformer
Transformers with spatiotemporal attention in benchmarks from architecture for TSF. Informer takes the sequence timestamps
traffic forecasting to electricity production, temperature prediction, (𝑥𝑇 −𝑐 , . . . , 𝑥𝑇 +ℎ ) and embeds them in a higher dimension 𝑑. The
and metro ridership. We show that a single approach can achieve variable values - with zeros replacing the unknown target sequence
highly competitive results against specialized baselines from both (𝑦𝑇 −𝑐 , . . . , 𝑦𝑇 , 0𝑇 +1 . . . , 0𝑇 +ℎ ) - are mapped to the same dimension
the time series forecasting and spatial-temporal GNN literature. We 𝑑. The time (𝑥) and variable (𝑦) components sum to create an input
open-source a large codebase that includes our model, datasets, and sequence of 𝑐 + ℎ tokens, written in matrix form as Z ∈ R (𝑐+ℎ)×𝑑 .
the groundwork for future research directions. The encoder processes the sub-sequence Z[0 . . . 𝑐] while the de-
coder observes the target sequence Z[𝑐 + 1 . . . 𝑐 + ℎ]. The outputs
2 BACKGROUND AND RELATED WORK of the decoder are treated as predictions and error is minimized
by regression to the true sequence values. Note that Informer out-
2.1 Long-Range Transformers puts the entire prediction sequence directly in one forward pass, as
For brevity, we assume the reader is familiar with the Transformer opposed to decoder-only generative models that output each token
architecture [65]. A brief overview of the self-attention mechanism iteratively (e.g., large language models). This has the advantage of
can be found in Appendix A.1. This paper will focus on the inter- reducing compute by reusing the encoder representation in each
pretation of Transformers as a learnable message passing graph decoder layer and minimizing error accumulation in autoregressive
amongst a set of inputs [28, 42, 86]. Let A ∈ R𝐿𝑞 ×𝐿𝑘 be the atten- predictions.
tion scores of a query sequence of length 𝐿𝑞 and a key sequence Following Informer, a rapidly expanding line of work has looked
of length 𝐿𝑘 . A acts much like the adjacency matrix of a graph to improve upon the benchmark results of Transformers in TSF.
between the tokens of the two sequences, where A[𝑖, 𝑗] denotes These methods adjust the model architecture [45, 43] and re-introduce
the strength of the connection between the 𝑖th token of the query classic time series domain knowledge such as series decomposi-
sequence and the 𝑗th token in the key sequence. tion, auto-correlation, or computation in frequency space [93, 70].
Because attention involves matching each query to the entire These time series biases are often used as a domain-specific form
key sequence, its runtime and memory use grows quadratically of efficient attention [71, 43, 15], as many long-term TSF tasks nat-
with the length of its input. As a result, the research community urally extend beyond the sequence limits of default Transformers.
has raced to develop and evaluate Transformer variants for longer It is important to note that time series Transformers use one input
sequences [63]. Many of these methods introduce heuristics to spar- token per timestep, so that the embedding of the token at time 𝑡
sify the attention matrix. For example, we can attend primarily to represents 𝑁 distinct variables at that moment in time. This is in
adjacent input tokens [35], select global tokens [22], increasingly contrast with domains like natural language processing in which
Long-Range Transformers for Dynamic Spatiotemporal Forecasting

each token represents just one unified idea (e.g., a single word) A Input Data B Temporal Sequence
[59]. The message-passing graph that results from attention over a 10 10 0 10 10 0
ycontext 1 1 -9
multivariate sequence learns patterns across time while keeping Value 1 1 -9
variables grouped together (Fig. 1b). This setup forces the variables -9 -8 -5 -9 -8 -5
within each token to receive the same amount of information from xcontext T-c T-1 T Time T-c T-1 T
other timesteps, despite the fact that variables may have distinct Position 0 1 2
patterns or relationships to each other. Ideally, we would have a 3C Spatiotemporal Sequence
way to model these kinds of variable relationships. Value 10 10 0 1 1 -9 -9 -8 -5
Time T-c T-1 T T-c T-1 T T-c T-1 T
2.3 Spatial-Temporal Forecasting and GNNs Variable 0 0 0 1 1 1 2 2 2
Position 0 1 2 0 1 2 0 1 2
Multivariate TSF involves two dimensions of complexity: the fore-
cast sequence’s duration, 𝐿, and the number of variables 𝑁 consid-
Figure 2: Embedding Multivariate Data (a) The example con-
ered at each timestep. As 𝑁 grows, it becomes increasingly impor-
text sequence from Fig. 1a. (b) Standard “Temporal" embed-
tant to model the relationships between each variable. The multi-
ding input sequence where each column will become a to-
variate forecasting problem can be reformulated as the prediction of
ken. (c) Flattened spatiotemporal input sequence for Fig. 1d
a sequence of graphs, where at each timestep 𝑡 we have a graph G𝑡
with position and variable indices.
consisting of 𝑁 variable nodes connected by weighted edges. Given
a context sequence of graphs (G𝑇 −𝑐 , . . . , G𝑇 ), we must predict the
in matrix form as Z ∈ R (𝑐+ℎ)×𝑑 . We propose to modify the token
node values of the target sequence (G𝑇 +1, . . . , G𝑇 +ℎ ).
embedding input sequence by flattening each multivariate 𝑦𝑡 vector
Graph Neural Networks (GNNs) are a category of deep learn-
into 𝑁 scalars with a copy of its timestamp 𝑥𝑡 , leading to a new
ing techniques that aim to explicitly model variable relationships
sequence: (𝑥𝑇 −𝑐 , 𝑦𝑇0 −𝑐 ), . . . , (𝑥𝑇 −𝑐 , 𝑦𝑇𝑁−𝑐 ), . . . , (𝑥𝑇 , 𝑦𝑇0 ), . . . ,
as interactions along a network of nodes [74]. Earlier work uses
(𝑥𝑇 , 𝑦𝑇𝑁 ), (𝑥𝑇 +1, 0𝑇0 +1 ), . . . , (𝑥𝑇 +1, 0𝑇𝑁+1 ), . . . , (𝑥𝑇 +ℎ , 0𝑇𝑁+ℎ ) . Embed-

graph convolutional layers [48] to pass information amongst the
variables of each timestep while using standard sequence learning ding this longer sequence results in a Z ′ ∈ R𝑁 (𝑐+ℎ)×𝑑 . When we
architectures (e.g., RNNs [36], Conv1Ds [85], dilated Conv1Ds [76]) pass Z ′ through a Transformer, the attention matrix A ∈ R𝑁 (𝑐+ℎ)×𝑁 (𝑐+ℎ)
to adapt those representations across time. More recent work has then represents a spatiotemporal graph with a direct path between
extended this formula by replacing the spatial and/or temporal every variable at every timestep (see Appendix A.2 Fig. 5). In addi-
learning mechanisms with attention modules [6, 80, 77]. Temporal tion, we are now learning spatial relationships that do not rely on
attention with intra-token spatial graph learning is depicted in Fig- predefined variable graphs and can change dynamically according
ure 1c. Existing GNN-based methods have a combination of three to the time and variable values of the input data. This concept is
common shortcomings: depicted in Figure 1d. There are two important questions left to
answer:
(1) Their spatial modules require predefined graphs denoting
variable relationships. This can make it more difficult to (1) How do we embed this sequence so that the attention net-
solve abstract forecasting problems where relationships are work parameters can accurately interpret the information
unknown and must be discovered from data. in each token?
(2) They perform separate spatial and temporal updates in al- (2) Are we able to multiply the sequence length by a factor of
ternating layers. This creates information bottlenecks that 𝑁 and still scale to real-world datasets?
restrict spatiotemporal message passing.
(3) Their spatial graphs remain static across timesteps. Vari- 3.2 Spatiotemporal Embeddings
able relationships can often change over time and should be Representing Time and Value. The input embedding module of
modeled by a dynamic graph. a TSF Transformer determines the way the (𝑥, 𝑦) sequence con-
Appendix A.2 provides a detailed overview of related work cepts in Sec 3.1 are implemented in practice. We create input se-
in the crowded field of spatial-temporal GNNs, and includes quences consisting of the values of our time series variables (𝑦) and
a categorization of existing methods according to these three time information (𝑥). Although many seq2seq TSF models discard
key differences. Our goal is to develop a seq2seq time series model explicit time information, we use Time2Vec layers [29] to learn
with true spatiotemporal message passing on a dynamic graph that seasonal characteristics. Time2Vec maps 𝑥𝑡 to sinusoidal patterns
can be competitive with GNNs despite not using predefined variable of learned offsets and wavelengths. This helps represent periodic
relationships. relationships that extend past the limited length of the context se-
quence. The concatenated variable values and time embeddings are
3 SPATIOTEMPORAL TRANSFORMERS then projected to the input dimension of the Transformer model
with a feed-forward layer. We refer to the resulting output as the
3.1 Spatiotemporal Sequences “value+time embedding."
We begin by building upon the Informer-style encoder-decoder Representing Position. Transformers are permutation invari-
Transformer framework. As discussed in Sec. 2.2, Informer gener- ant, meaning they cannot interpret the order of input tokens by
ates 𝑑-dimensional embeddings of the sequence  (𝑥𝑇 −𝑐 , 𝑦𝑇 −𝑐 ), . . . , default. This is fixed by adding a position embedding to the tokens;
(𝑥𝑇 , 𝑦𝑇 ), (𝑥𝑇 +1, 0𝑇 +1 ), . . . , (𝑥𝑇 +ℎ , 0𝑇 +ℎ ) , with the result expressed we opt for the fully learnable position embedding variant where we
Jake Grigsby, Zhe Wang, Nam Nguyen, and Yanjun Qi

unpredictable or ambiguous. For example, popular traffic bench-


Predictions: 9 0 -8 8 1 -6
marks (Sec. 4.3) replace missing values with zeros, so that it is
Norm Linear
unclear whether traffic was low or was not recorded. Embedding
each variable in its own separate token gives us the flexibility to
Linear leave values missing in the data pipeline, replace them with zeros
Norm in the forward pass, and then tell the model when values were orig-
Encoder Output
Encoder Sequence

Global Cross
inally missing with a binary “given embedding." The “value+time",
Norm Attention variable, position, and “given" embeddings sum to create the final
Norm spatiotemporal input sequence.
Linear Local Cross
Attention
Norm Norm
3.3 Scaling to Long Sequence Lengths
Global Self Global Self
Attention Attention Our embedding scheme in Sec 3.2 converts a sequence-to-sequence
Norm Norm problem of length 𝐿 into a new format of length 𝐿𝑁 . Standard
Local Self Local Self Transformer layers have a maximum sequence length of less than
Attention Attention
1, 000; the spatiotemporal sequence lengths considered in our ex-
Norm Norm
perimental results exceed 26, 000. Clearly, additional optimization
Context Embedding Target
Target Embedding
Embedding
is needed to make these problems feasible.
Scaling with Fast-Attention. However, we are fortunate that
Value 10 1 -9 10 1 -8 0 -9 -5 ? ? ? ? ? ? there is a rapidly-developing field of research in long-sequence
Variable 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 Transformers (Sec. 2.1). The particular choice of attention mech-
Time T-c T-c T-c T-1 T-1 T-1 T T T T+1 T+1 T+1 T+h T+h T+h
anism is quite flexible, although most results in this paper use
Performer FAVOR+ attention [9] - a linear approximation of atten-
tion via a random kernel method. The direct (non-iterative) output
Figure 3: The Spacetimeformer architecture for joint spa- format of our model does not require causal masking, which can be
tiotemporal learning applied to the sequence shown in Fig- a key advantage when dealing with approximate attention mecha-
ure 1a. This architecture creates a practical implementation nisms that have difficulty masking an attention matrix that they
of the spatiotemporal attention concept in Figure 1d. never explicitly compute. However, there are some domains (Ap-
pendix B.6) with variable-length sequences and padding that make
masking an important consideration.
Scaling with Initial and Intermediate Conv1Ds. In low-resource
initialize 𝑑-dimensional embedding vectors for each timestep up to settings, we can also look for ways to learn shorter representations
the maximum sequence length. An example of the way input time of the input with strided convolutions. “Initial" convolutions are
series data is organized for standard TSF Transformers is shown in applied to the value+time embedding of the encoder. “Intermedi-
Figure 2a and 2b. Existing models use a variety of implementation ate" convolutions occur between attention layers and are a key
details, but the end result is a 𝑑-dimensional token that represents component of the Informer architecture. However, our flattened
the values of all 𝑁 variables simultaneously. spatiotemporal sequences lay out variables in an arbitrary order
Representing Space. We create spatiotemporal sequences by (Fig. 2c). We rearrange the input so that each variable can be passed
flattening the 𝑁 variables into separate tokens (Fig. 2c). Each token through a convolutional block independently and then recombined
is given a copy of the time information 𝑥 for its value+time embed- into the longer spatiotemporal sequence.
ding and assigned the position embedding index corresponding to Scaling with Shifted Attention Windows. We can split the
its original timestep order, so that each position now appears 𝑁 input into smaller chunks or “windows" across time, and perform
times. We differentiate between the variables at each timestep with spatiotemporal attention amongst the tokens of each window sepa-
an additional “variable embedding." We initialize 𝑁 𝑑-dimensional rately. Subsequent layers can shift the window boundaries and let
embedding vectors for each variable index, much like the position information spread between distant windows as the network gets
embedding. Note that this means the variable representations are deeper. The shifted window approach is inspired by the connection
randomly initialized and learned end-to-end during training. The between Vision Transformers and spatiotemporal attention [44]
input pattern of position and variable indices is represented by (see Appendix A.3).
Figure 2c. The use of two learnable embedding sets (space and time) These scaling methods can be mixed and matched based on avail-
creates interesting parallels between our time series model and able GPU memory. Our goal is to learn spatiotemporal graphs across
Transformers in computer vision. Further discussion and small- entire input sequences. Therefore, we try use as few optimizations
scale experiments on image data using our model are in Appendix as possible, even though convolutions and windowed attention
A.3. have shown promise as a way to improve predictions despite not
Representing Missing Data. Many real-world applications in- being a computational necessity. Most results in this paper were
volve missing data. Existing work often ignores timesteps that have collected using fast attention alone with less than 40GBs of GPU
any missing variable values (wasting valuable data) or replaces memory. Strided convolutions are only necessary on the largest
them with an arbitrary scalar that can confuse the model by being datasets. Shifted window attention saves meaningful amounts of
Long-Range Transformers for Dynamic Spatiotemporal Forecasting

memory when using quadratic attention, so we primarily use it the Spacetimeformer architecture with the more standard tem-
when we need to mask padded sequences. poral sequence embedding (see Fig. 2b). We refer to this as the
Temporal model1 . These baselines are included in our open-source
3.4 Spacetimeformer release and use the same training loop and evaluation process. We
add Time2Vec information to baseline inputs when applicable be-
Local and Global Architecture. We find that attention over a
cause Time2Vec has been shown to improve the performance of a
longer multivariate sequence can complicate learning in problems
variety of sequence models [29]. Time series forecasting tricks like
with large 𝑁 . We add some architectural bias towards the sequence
decomposition and input normalization have been implemented for
of each token’s own variable with “local" attention modules in each
all methods. We also provide reference results from existing work
encoder and decoder layer. In a local layer, each token attends to
when they are available, including several spatial-temporal GNN
the timesteps of its own spatial variable. Note that this does not
models with predefined graph information (Appendix A.2) and re-
mean we are simplifying the “global" attention layer by separating
cent TSF Transformers (Sec. 2.2). We report evaluation metrics as
temporal and spatial attention, as is common in spatial-temporal
the average over all the timesteps of the target sequence, and the
methods (Appendix A.2). Rather, tokens attend to every token in
average of at least three training runs.
their own variable’s sequence and then to every token in the entire
spatiotemporal global sequence. We use a Pre-Norm architecture
[78] and BatchNorm [26] normalization. Figure 3 shows a one-layer
4.1 Toy Examples
encoder-decoder architecture. We begin with a binary multivariate copy task similar to those used
Output and Time Series Tricks. The final feed-forward layer to evaluate long-range dependencies in memory-based sequence
of the decoder outputs a sequence of predictions that can be folded models [21]. However, we add an extra challenge and shift each
back into the original input format of (𝑥𝑡 , 𝑦^𝑡 ). We can then optimize variable’s output by a unique number of timesteps (visualized in
a variety of forecasting loss functions, depending on the particular Appendix B.3 Fig. 7). The shifted copy task was created because it
problem and the baselines we are comparing against. Our goal is to requires each variable to attend to different timesteps; Temporal’s
create a general multivariate sequence model, so we try to avoid attention is fundamentally unable to do this, and instead resorts
adding domain specific tricks whenever possible. However, we in- to learning one variable relationship per attention head until it
clude features from the recent Transformer TSF literature such as runs out of heads and produces blurry outputs. The attention heads
seasonal decomposition, input normalization, and the ability to are visualized in Appendix B.3 Fig. 8 while an example sequence
predict the target sequence as the net difference from the output of result is shown in Appendix B.3 Fig. 9. Spatiotemporal attention is
a simple linear model. These tricks are turned off by default and capable of learning all 𝑁 variable relationships in one attention head
only used in non-stationary domains where distribution shift is a (Appendix B.3 Fig. 10), leading to an accurate output (Appendix B.3
major concern; we return to this briefly in the Experiments section Fig. 11).
and in detail in Appendix B.4. We nickname our full model the Next we look at a more traditional forecasting setup inspired by
Spacetimeformer for clarity in experimental results. More imple- [59] consisting of a multivariate sine-wave sequence with strong
mentation details are listed in Appendix B.2, including explanations inter-variable dependence. Several ablations of our method are
of several techniques that are not used in the main experimental considered. This dataset creates a less extreme example of the effect
results but are part of our open-source code release. in the shifted copy task, where Temporal models are forced to
compromise their attention over timesteps in a way that reduces
predictive power over variables with such distinct frequencies. Our
4 EXPERIMENTS
method learns an uncompromising spatiotemporal relationship
Our experiments are designed to answer the following questions: among all tokens to generate the most accurate predictions. Dataset
(1) Is our model competitive with seq2seq methods on highly details and results can be found in Appendix B.3.
temporal tasks that require long-term forecasting?
(2) Is our model competitive with graph-based methods on 4.2 Time Series Forecasting
highly spatial tasks, despite not having access to a predefined NY-TX Weather. We continue with more realistic time series do-
variable graph? mains where we must learn to forecast a relatively small number
(3) Does our spatiotemporal sequence formulation let Trans- of variables 𝑁 with unknown relationships over a long duration
formers learn meaningful variable relationships? 𝐿. First we evaluate on a custom dataset of temperature values
compiled from the ASOS Weather Network [49]. We use three sta-
We compare against representative methods from the TSF and
tions in central Texas and three in eastern New York to create two
GNN literature in addition to ablations of our model. MTGNN [75]
almost unrelated spatial sub-graphs. Temperature values are taken
is a GNN method that is capable of learning its graph structure
at one-hour intervals, and we investigate the impact of sequence
from data. LinearAR is a basic linear model that iteratively predicts
length by predicting 40, 80, and 160 hour target sequences. The re-
each timestep of the target sequence as a linear combination of the
sults are shown in Table 1. Our spatiotemporal embedding scheme
context sequence and previous outputs. We also include a standard
provides the most accurate forecasts, and its improvement over the
encoder-decoder LSTM [24] and the RNN/Conv1D-based LSTNet
[32]. The most important baseline is an ablation of our method 1 One key detail that cannot be applied to the Temporal model is the use of local
similar to Informer that controls for implementation details to attention layers, because there is no concept of local vs. global when tokens represent
measure the impact of spatiotemporal attention; this model uses a combination of variables.
Jake Grigsby, Zhe Wang, Nam Nguyen, and Yanjun Qi

LinearAR LSTM MTGNN Temporal Spacetimeformer Prediction Length


40 hours 24 48 96 288 672
MSE 18.84 14.29 13.32 13.29 12.49 LSTM 0.63 0.94 0.91 1.12 1.56
MAE 3.24 2.84 2.67 2.67 2.57
RRSE 0.40 0.355 0.34 0.34 0.33 Informer 0.37 0.50 0.61 0.79 0.93
80 hours Pyraformer 0.49 0.66 0.71
MSE 23.99 18.75 19.27 19.99 17.9 YFormer 0.36 0.46 0.57 0.59 0.66
MAE 3.72 3.29 3.31 3.37 3.19 Preformer 0.40 0.43 0.45 0.49 0.54
RRSE 0.45 0.40 0.41 0.41 0.40
Autoformer 0.40 0.45 0.46 0.53 0.54
160 hours
ETSFormer 0.34 0.38 0.39 0.42 0.45
MSE 28.84 22.11 24.28 24.16 21.35
MAE 4.13 3.63 3.78 3.77 3.51 LinearAR 0.33 0.37 0.39 0.44 0.48
RRSE 0.50 0.44 0.46 0.46 0.44 Spacetimeformer 0.34 0.38 0.40 0.45 0.52
Table 1: NY-TX Weather Results. Table 2: ETTm1 test set normalized MAE. Additional results
and discussion provided in Appendix B.4.

Temporal method appears to increase over longer sequences where


the lack of flexibility that comes from grouping the two geographic
regions together may become more relevant. MTGNN learns spatial LinearAR LSTNet LSTM MTGNN Temporal Spacetimeformer
relationships, but temporal consistency can be difficult without MSE 14.3 15.09 10.59 11.40 9.94 7.75
decoder attention; its convolution-only output mechanism begins MAE 2.29 2.08 1.56 1.76 1.60 1.37
to struggle at the 80 and 160 hour lengths. Table 3: AL Solar Results.
ETTm1. The Transformer TSF literature (Sec. 2.2) has settled
on a set of common benchmarks for experimental comparison
in long-term forecasting, including a dataset of electricity trans- Traffic Prediction. Next, we experiment on two datasets com-
former temperature (ETT) series introduced by [92]. We compare mon in GNN research. The Metr-LA and Pems-Bay datasets consist
against Informer and a selection of follow-up methods on the of traffic speed readings at 5 minute intervals, and we forecast the
multivariate minute resolution variant in Table 2. At first glance, conditions for the next hour. For these experiments we include
Spacetimeformer offers meaningful improvements to Informer results directly from the literature (Appendix A.2) to get a better
and is competitive with later variants that make use of additional comparison with GNN-based spatial models that used predefined
time series domain knowledge. However, our work on ETTm1 road graphs. The results are listed in Table 4. Our method clearly
and similar benchmarks revealed that these results have less to separates itself from TSF models and enters the performance band
do with advanced attention mechanisms or model architectures of dedicated GNN methods on both datasets (without needing pre-
than they do with robustness to the distribution shift caused by defined graphs).
non-stationary datasets. This is highlighted by the performance HZMetro. [40] experiment with a dataset of passenger arrivals
of simple linear models like LinearAR, which closely matches the and departures at metro stations in Hangzhou, China for a total of
accuracy of ETSFormer. In fact, a few tricks allow most models 160 variables recorded in 15 minute time intervals. We forecast the
to achieve near state-of-the-art performance; reversible instance next hour and compare against their published results in Table 5.
normalization [30], for example, is enough to more than halve the Spacetimeformer again shows that it can be as effective as graph-
prediction error of the LSTM baseline - noticeably outperforming based methods in spatial domains without requiring predefined
the original Informer results. This discussion is continued in detail graphs. We list the bounds of several trials because the performance
in Appendix B.4 with experiments on ETTm1 and another common gap between methods on this dataset is relatively slim.
benchmark. In addition, we create a custom task to investigate the Spatiotemporal Attention Patterns. Standard Transformers
way time series models handle different kinds of distributional shift. learn sliding attention patterns that resemble convolutions. Our
method learns distinct connections between variables - this leads
4.3 Spatial-Temporal Forecasting to attention diagrams that tend to be structured in “variable blocks"
AL Solar. We turn our focus to problems on the GNN end of the spa- due to the way we flatten our input sequence (Fig. 2c). Figure 4 pro-
tiotemporal spectrum where 𝑁 approaches or exceeds 𝐿. The AL So- vides an annotated example for the NY-TX weather dataset. Some
lar dataset consists of solar power production measurements taken attention heads convincingly recover the ground-truth relationship
at 10-minute intervals from 137 locations. We predict 4-hour hori- between input variables. Our method’s GNN-level performance
zons, leading to the longest spatiotemporal sequences of our main in highly spatial tasks like traffic forecasting supports a similar
experiments; the results are shown in Table 3. Spacetimeformer conclusion when facing more complex graphs.
is significantly more accurate than the TSF baselines. We speculate Ablations and Node Classification. Finally, we perform sev-
that this is due to an increased ability to forecast unusual changes in eral ablation experiments to measure the importance of design
power production due to weather or other localized effects. MTGNN decisions in our embedding mechanism and model architecture
learns similar spatial relationships, but its temporal predictions are using the NY-TX Weather and Metr-LA Traffic datasets. Results
not as accurate. and analysis can be found in Appendix B.5.
Long-Range Transformers for Dynamic Spatiotemporal Forecasting

Time Series Models ST-GNN Models


Graph Traffic
LinearAR LSTM Temporal DCRNN MTGNN STAWnet ST-GRAT Spacetimeformer
WaveNet Trans.
Metr-LA
MAE 4.71 3.87 3.59 3.03 3.08 3.06 3.03 3.09 2.83 2.86
MSE 94.11 47.5 52.73 37.88 39.05 38.73 39.77 39.84 32.82 38.27
MAPE 12.7 10.7 10.7 8.27 8.30 8.34 8.25 8.42 7.70 7.80
Pems-Bay
MAE 2.24 2.41 2.49 1.59 1.64 1.61 1.62 1.63 1.53 1.61
MSE 27.62 25.49 27.27 13.69 13.98 13.47 13.85 13.87 13.25 13.99
MAPE 4.98 5.81 6.12 3.61 3.66 3.63 3.65 3.67 3.49 3.63
Table 4: Traffic Forecasting Results. Results for italicized models taken directly from published work.

LSTM Temporal ASTGCN DCRNN GCRNN Graph-WaveNet PVGCN Spacetimeformer


MAE 29.1 29.0 28.0 26.1 26.1 26.5 23.8 25.7 ± .3
RMSE 51.3 48.5 47.2 44.64 44.5 44.8 40.1 44.7 ± 1.6
Table 5: HZMetro Ridership Prediction Results. Italicized model results provided by [40].

can scale to high dimensional spatial problems without relying


on a predefined graph. We see several promising directions for
future development. First, there is room to scale Spacetimeformer
to much larger domains and model sizes. This could be accom-
plished with additional computational resources or by making bet-
ter use of optimizations like windowed attention and convolutional
layers that were underutilized in our experimental results. Next,
our main experiments focus on established benchmarks with rela-
tively static spatial relationships and a standard multivariate input
format. While it was necessary to verify that our model is com-
petitive in popular domains like traffic forecasting, we feel that
new applications with more rapid changes in variable behavior
could take better advantage our fully dynamic and learnable graph.
Our embedding scheme also enables flexible input formats with
irregularly/unevenly sampled series and exogenous variables (Ap-
pendix B.2). Finally, we see an opportunity to experiment with
multi-dataset generalization as is common for Transformers in
many other areas of machine learning. Appendix B.6 provides fur-
ther discussion of this direction, and the foundation for this work
is included in our open-source release.
Figure 4: Discovering Spatial Relationships from Data: We
forecast the temperature at three weather stations in Texas REFERENCES
(lower left, blue diamonds) and three stations in New York [1] Ali Araabi and Christof Monz. Optimizing Transformer for
(upper right, purple triangles). Temporal attention stacks all Low-Resource Neural Machine Translation. 2020. arXiv: 2011.
six time series into one input variable and attends across 02266 [cs.CL].
time alone (upper left). Our method recovers the correct [2] Lei Bai et al. “Adaptive graph convolutional recurrent net-
spatial relationship between the variables along with the work for traffic forecasting”. In: Advances in neural informa-
strided temporal relation (lower right) (Dark blue shaded en- tion processing systems 33 (2020), pp. 17804–17815.
tries → more attention). [3] Anastasia Borovykh, Sander Bohte, and Cornelis W Ooster-
5 CONCLUSION AND FUTURE DIRECTIONS lee. “Conditional time series forecasting with convolutional
neural networks”. In: arXiv preprint arXiv:1703.04691 (2017).
This paper has presented a unified method for multivariate forecast-
[4] Tom B. Brown et al. Language Models are Few-Shot Learners.
ing based on the application of a custom long-range Transformer
2020. doi: 10.48550/ARXIV.2005.14165. url: https://arxiv.
architecture to elongated spatiotemporal input sequences. Our ap-
org/abs/2005.14165.
proach jointly learns temporal and spatial relationships to achieve
competitive results on long-sequence time series forecasting, and
Jake Grigsby, Zhe Wang, Nam Nguyen, and Yanjun Qi

[5] Khac-Hoai Nam Bui, Jiho Cho, and Hongsuk Yi. “Spatial- 33.01 (July 2019), pp. 922–929. doi: 10 . 1609 / aaai . v33i01 .
temporal graph neural network for traffic forecasting: An 3301922. url: https://ojs.aaai.org/index.php/AAAI/article/
overview and open research issues”. In: Applied Intelligence view/3881.
(2021), pp. 1–12. [24] Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-Term
[6] Ling Cai et al. “Traffic transformer: Capturing the continu- Memory”. In: Neural Comput. 9.8 (Nov. 1997), pp. 1735–1780.
ity and periodicity of time series for traffic forecasting”. In: issn: 0899-7667. doi: 10 . 1162 / neco . 1997 . 9 . 8 . 1735. url:
Transactions in GIS 24.3 (2020), pp. 736–755. https://doi.org/10.1162/neco.1997.9.8.1735.
[7] Vitor Cerqueira, Luis Torgo, and Igor Mozetič. “Evaluating [25] Xiao Shi Huang et al. “Improving Transformer Optimization
time series forecasting models: An empirical study on per- Through Better Initialization”. In: Proceedings of the 37th
formance estimation methods”. In: Machine Learning 109.11 International Conference on Machine Learning. Ed. by Hal
(2020), pp. 1997–2028. Daumé III and Aarti Singh. Vol. 119. Proceedings of Machine
[8] Rewon Child et al. “Generating long sequences with sparse Learning Research. PMLR, July 2020, pp. 4475–4483. url:
transformers”. In: arXiv preprint arXiv:1904.10509 (2019). https://proceedings.mlr.press/v119/huang20f.html.
[9] Krzysztof Choromanski et al. Rethinking Attention with Per- [26] Sergey Ioffe and Christian Szegedy. Batch Normalization:
formers. 2021. arXiv: 2009.14794 [cs.LG]. Accelerating Deep Network Training by Reducing Internal
[10] Xiangxiang Chu et al. “Twins: Revisiting the design of spatial Covariate Shift. 2015. doi: 10.48550/ARXIV.1502.03167. url:
attention in vision transformers”. In: Advances in Neural https://arxiv.org/abs/1502.03167.
Information Processing Systems 34 (2021). [27] Tomoharu Iwata and Atsutoshi Kumagai. “Few-shot Learn-
[11] Razvan-Gabriel Cirstea et al. “Towards spatio-temporal aware ing for Time-series Forecasting”. In: arXiv preprint arXiv:2009.14379
traffic time series forecasting”. In: 2022 IEEE 38th Interna- (2020).
tional Conference on Data Engineering (ICDE). IEEE. 2022, [28] Chaitanya Joshi. “Transformers are Graph Neural Networks”.
pp. 2900–2913. In: The Gradient (2020).
[12] Jacob Devlin et al. BERT: Pre-training of Deep Bidirectional [29] Seyed Mehran Kazemi et al. “Time2vec: Learning a vector
Transformers for Language Understanding. 2019. arXiv: 1810. representation of time”. In: arXiv preprint arXiv:1907.05321
04805 [cs.CL]. (2019).
[13] Xiaoyi Dong et al. “Cswin transformer: A general vision [30] Taesung Kim et al. “Reversible instance normalization for ac-
transformer backbone with cross-shaped windows”. In: arXiv curate time-series forecasting against distribution shift”. In:
preprint arXiv:2107.00652 (2021). International Conference on Learning Representations. 2021.
[14] Alexey Dosovitskiy et al. “An image is worth 16x16 words: [31] Louis Kirsch et al. “General-purpose in-context learning by
Transformers for image recognition at scale”. In: arXiv preprint meta-learning transformers”. In: arXiv preprint arXiv:2212.04458
arXiv:2010.11929 (2020). (2022).
[15] Dazhao Du, Bing Su, and Zhewei Wei. Preformer: Predic- [32] Guokun Lai et al. Modeling Long- and Short-Term Temporal
tive Transformer with Multi-Scale Segment-wise Correlations Patterns with Deep Neural Networks. 2018. arXiv: 1703.07015
for Long-Term Time Series Forecasting. 2022. doi: 10.48550/ [cs.LG].
ARXIV.2202.11356. url: https://arxiv.org/abs/2202.11356. [33] Shiyong Lan et al. “DSTAGNN: Dynamic Spatial-Temporal
[16] et al. Falcon WA. “PyTorch Lightning”. In: GitHub. Note: Aware Graph Neural Network for Traffic Flow Forecasting”.
https://github.com/PyTorchLightning/pytorch-lightning 3 (2019). In: Proceedings of the 39th International Conference on Ma-
[17] Angela Fan, Edouard Grave, and Armand Joulin. Reducing chine Learning. Ed. by Kamalika Chaudhuri et al. Vol. 162.
Transformer Depth on Demand with Structured Dropout. 2019. Proceedings of Machine Learning Research. PMLR, July 2022,
arXiv: 1909.11556 [cs.LG]. pp. 11906–11917. url: https://proceedings.mlr.press/v162/
[18] Yuchen Fang et al. “Spatio-Temporal meets Wavelet: Disen- lan22a.html.
tangled Traffic Flow Forecasting via Efficient Spectral Graph [34] Mengzhang Li and Zhanxing Zhu. “Spatial-Temporal Fusion
Attention Network”. In: arXiv e-prints (2021), arXiv–2112. Graph Neural Networks for Traffic Flow Forecasting”. In:
[19] Tryambak Gangopadhyay et al. “Spatiotemporal attention Proceedings of the AAAI Conference on Artificial Intelligence
for multivariate time series prediction and interpretation”. In: 35.5 (May 2021), pp. 4189–4196. url: https://ojs.aaai.org/
ICASSP 2021-2021 IEEE International Conference on Acoustics, index.php/AAAI/article/view/16542.
Speech and Signal Processing (ICASSP). IEEE. 2021, pp. 3560– [35] Shiyang Li et al. “Enhancing the locality and breaking the
3564. memory bottleneck of transformer on time series forecast-
[20] Rakshitha Godahewa et al. “Monash time series forecasting ing”. In: Advances in Neural Information Processing Systems
archive”. In: arXiv preprint arXiv:2105.06643 (2021). 32 (2019), pp. 5243–5253.
[21] Alex Graves, Greg Wayne, and Ivo Danihelka. “Neural turing [36] Yaguang Li et al. Diffusion Convolutional Recurrent Neural
machines”. In: arXiv preprint arXiv:1410.5401 (2014). Network: Data-Driven Traffic Forecasting. 2018. arXiv: 1707.
[22] Qipeng Guo et al. “Star-transformer”. In: arXiv preprint arXiv:1902.09113 01926 [cs.LG].
(2019). [37] Yawei Li et al. “Localvit: Bringing locality to vision trans-
[23] Shengnan Guo et al. “Attention Based Spatial-Temporal Graph formers”. In: arXiv preprint arXiv:2104.05707 (2021).
Convolutional Networks for Traffic Flow Forecasting”. In:
Proceedings of the AAAI Conference on Artificial Intelligence
Long-Range Transformers for Dynamic Spatiotemporal Forecasting

[38] Bryan Lim et al. Temporal Fusion Transformers for Inter- [54] Zhen Qin et al. cosFormer: Rethinking Softmax in Attention.
pretable Multi-horizon Time Series Forecasting. 2020. arXiv: 2022. doi: 10.48550/ARXIV.2202.08791. url: https://arxiv.
1912.09363 [stat.ML]. org/abs/2202.08791.
[39] Aoyu Liu and Yaying Zhang. “Spatial-Temporal Interactive [55] Alex Rogozhnikov. “Einops: Clear and Reliable Tensor Ma-
Dynamic Graph Convolution Network for Traffic Forecast- nipulations with Einstein-like Notation”. In: International
ing”. In: arXiv preprint arXiv:2205.08689 (2022). Conference on Learning Representations. 2022. url: https :
[40] Lingbo Liu et al. “Physical-virtual collaboration modeling for //openreview.net/forum?id=oapKSVM2bcj.
intra-and inter-station metro ridership prediction”. In: IEEE [56] Benedek Rozemberczki et al. PyTorch Geometric Temporal:
Transactions on Intelligent Transportation Systems (2020). Spatiotemporal Signal Processing with Neural Machine Learn-
[41] Liyuan Liu et al. Understanding the Difficulty of Training ing Models. 2021. arXiv: 2104.07788 [cs.LG].
Transformers. 2020. arXiv: 2004.08249 [cs.LG]. [57] David Salinas, Valentin Flunkert, and Jan Gasthaus. DeepAR:
[42] Pengfei Liu et al. “Contextualized non-local neural networks Probabilistic Forecasting with Autoregressive Recurrent Net-
for sequence learning”. In: Proceedings of the AAAI Confer- works. 2019. arXiv: 1704.04110 [cs.AI].
ence on Artificial Intelligence. Vol. 33. 01. 2019, pp. 6762–6769. [58] Sheng Shen et al. PowerNorm: Rethinking Batch Normaliza-
[43] Shizhan Liu et al. “Pyraformer: Low-complexity pyramidal tion in Transformers. 2020. arXiv: 2003.07845 [cs.CL].
attention for long-range time series modeling and forecast- [59] Shun-Yao Shih, Fan-Keng Sun, and Hung-yi Lee. “Temporal
ing”. In: International Conference on Learning Representations. pattern attention for multivariate time series forecasting”.
2021. In: Machine Learning 108.8 (2019), pp. 1421–1441.
[44] Ze Liu et al. Swin Transformer: Hierarchical Vision Trans- [60] Slawek Smyl. “A hybrid method of exponential smoothing
former using Shifted Windows. 2021. doi: 10.48550/ARXIV. and recurrent neural networks for time series forecasting”.
2103.14030. url: https://arxiv.org/abs/2103.14030. In: International Journal of Forecasting 36.1 (2020), pp. 75–85.
[45] Kiran Madhusudhanan et al. Yformer: U-Net Inspired Trans- [61] Chao Song et al. “Spatial-Temporal Synchronous Graph Con-
former Architecture for Far Horizon Time Series Forecasting. volutional Networks: A New Framework for Spatial-Temporal
2021. doi: 10.48550/ARXIV.2110.08255. url: https://arxiv. Network Data Forecasting”. In: Proceedings of the AAAI Con-
org/abs/2110.08255. ference on Artificial Intelligence 34.01 (Apr. 2020), pp. 914–921.
[46] Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assi- doi: 10.1609/aaai.v34i01.5438. url: https://ojs.aaai.org/index.
makopoulos. “The M4 Competition: 100,000 time series and php/AAAI/article/view/5438.
61 forecasting methods”. In: International Journal of Forecast- [62] Yi Tay et al. Efficient Transformers: A Survey. 2020. doi: 10.
ing 36.1 (2020). M4 Competition, pp. 54–74. issn: 0169-2070. 48550/ARXIV.2009.06732. url: https://arxiv.org/abs/2009.
doi: https : / / doi . org / 10 . 1016 / j . ijforecast . 2019 . 04 . 014. 06732.
url: https://www.sciencedirect.com/science/article/pii/ [63] Yi Tay et al. Long Range Arena: A Benchmark for Efficient
S0169207019301128. Transformers. 2020. arXiv: 2011.04006 [cs.LG].
[47] Toan Q Nguyen and Julian Salazar. “Transformers with- [64] Chenyu Tian and Wai Kin Chan. “Spatial-temporal attention
out tears: Improving the normalization of self-attention”. wavenet: A deep learning framework for traffic prediction
In: arXiv preprint arXiv:1910.05895 (2019). considering spatial-temporal dependencies”. In: IET Intelli-
[48] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. gent Transport Systems 15.4 (2021), pp. 549–561.
Learning Convolutional Neural Networks for Graphs. 2016. [65] Ashish Vaswani et al. “Attention is all you need”. In: Advances
arXiv: 1605.05273 [cs.LG]. in neural information processing systems. 2017, pp. 5998–6008.
[49] NOAA. National Weather Service Automated Surface Observ- [66] Sinong Wang et al. Linformer: Self-Attention with Linear Com-
ing System (ASOS). 2021. url: https://www.weather.gov/ plexity. 2020. arXiv: 2006.04768 [cs.LG].
asos/. [67] Wenhai Wang et al. “Pyramid vision transformer: A versa-
[50] Boris N. Oreshkin et al. Meta-learning framework with ap- tile backbone for dense prediction without convolutions”.
plications to zero-shot time-series forecasting. 2020. arXiv: In: Proceedings of the IEEE/CVF International Conference on
2002.02887 [cs.LG]. Computer Vision. 2021, pp. 568–578.
[51] Cheonbok Park et al. “ST-GRAT: A Novel Spatio-temporal [68] Yuhu Wang et al. “TVGCN: Time-variant graph convolu-
Graph Attention Networks for Accurately Forecasting Dy- tional network for traffic forecasting”. In: Neurocomputing
namically Changing Road Speed”. In: Proceedings of the 29th 471 (2022), pp. 118–129.
ACM International Conference on Information & Knowledge [69] Chen Weikang et al. “Spatial-Temporal Adaptive Graph Con-
Management (Oct. 2020). doi: 10.1145/3340531.3411940. url: volution with Attention Network for Traffic Forecasting”. In:
http://dx.doi.org/10.1145/3340531.3411940. arXiv preprint arXiv:2206.03128 (2022).
[52] Adam Paszke et al. “PyTorch: An Imperative Style, High- [70] Gerald Woo et al. “ETSformer: Exponential Smoothing Trans-
Performance Deep Learning Library”. In: Advances in Neural formers for Time-series Forecasting”. In: arXiv preprint arXiv:2202.01381
Information Processing Systems 32. Ed. by H. Wallach et al. (2022).
2019, pp. 8024–8035. [71] Haixu Wu et al. Autoformer: Decomposition Transformers
[53] Yanjun Qin et al. “DMGCRN: Dynamic Multi-Graph Convo- with Auto-Correlation for Long-Term Series Forecasting. 2021.
lution Recurrent Network for Traffic Forecasting”. In: arXiv doi: 10.48550/ARXIV.2106.13008. url: https://arxiv.org/abs/
preprint arXiv:2112.02264 (2021). 2106.13008.
Jake Grigsby, Zhe Wang, Nam Nguyen, and Yanjun Qi

[72] Sifan Wu et al. “Adversarial Sparse Transformer for Time [90] Chuanpan Zheng et al. “Gman: A graph multi-attention net-
Series Forecasting”. In: (2020). work for traffic prediction”. In: Proceedings of the AAAI con-
[73] Zhen Wu et al. UniDrop: A Simple yet Effective Technique to ference on artificial intelligence. Vol. 34. 01. 2020, pp. 1234–
Improve Transformer without Extra Cost. 2021. arXiv: 2104. 1241.
04946 [cs.CL]. [91] Chuanpan Zheng et al. Spatio-Temporal Joint Graph Convolu-
[74] Zonghan Wu et al. “A Comprehensive Survey on Graph tional Networks for Traffic Forecasting. 2021. arXiv: 2111.13684
Neural Networks”. In: IEEE Transactions on Neural Networks [cs.LG].
and Learning Systems 32.1 (Jan. 2021), pp. 4–24. issn: 2162- [92] Haoyi Zhou et al. “Informer: Beyond efficient transformer
2388. doi: 10.1109/tnnls.2020.2978386. url: http://dx.doi.org/ for long sequence time-series forecasting”. In: Proceedings of
10.1109/TNNLS.2020.2978386. AAAI. 2021.
[75] Zonghan Wu et al. Connecting the Dots: Multivariate Time [93] Tian Zhou et al. FEDformer: Frequency Enhanced Decomposed
Series Forecasting with Graph Neural Networks. 2020. arXiv: Transformer for Long-term Series Forecasting. 2022. doi: 10.
2005.11650 [cs.LG]. 48550/ARXIV.2201.12740. url: https://arxiv.org/abs/2201.
[76] Zonghan Wu et al. “Graph wavenet for deep spatial-temporal 12740.
graph modeling”. In: arXiv preprint arXiv:1906.00121 (2019). [94] Wangchunshu Zhou et al. Scheduled DropHead: A Regulariza-
[77] Zonghan Wu et al. TraverseNet: Unifying Space and Time in tion Method for Transformer Models. 2020. arXiv: 2004.13342
Message Passing. 2021. arXiv: 2109.02474 [cs.LG]. [cs.CL].
[78] Ruibin Xiong et al. On Layer Normalization in the Transformer [95] Chen Zhu et al. Long-Short Transformer: Efficient Transform-
Architecture. 2020. arXiv: 2002.04745 [cs.LG]. ers for Language and Vision. 2021. arXiv: 2107.02192 [cs.CV].
[79] Yunyang Xiong et al. Nyströmformer: A Nyström-Based Al-
gorithm for Approximating Self-Attention. 2021. arXiv: 2102. A ADDITIONAL BACKGROUND AND
03902 [cs.CL]. RELATED WORK
[80] Mingxing Xu et al. Spatial-Temporal Transformer Networks for
Traffic Flow Forecasting. 2021. arXiv: 2001.02908 [eess.SP]. A.1 Transformers and Self Attention
[81] Jiexia Ye et al. “Multi-stgcnet: A graph convolution based The Transformer [65] is a deep learning architecture for sequence-
spatial-temporal framework for subway passenger flow fore- to-sequence prediction that is widely used in natural language
casting”. In: 2020 International joint conference on neural net- processing (NLP) [12]. Transformers operate on two sequences of
works (IJCNN). IEEE. 2020, pp. 1–8. 𝑑-dimensional vectors, represented in matrix form as X ∈ R𝐿𝑥 ×𝑑
[82] Xue Ye et al. “Meta Graph Transformer: A Novel Framework and Z ∈ R𝐿𝑧 ×𝑑 , where 𝐿𝑥 and 𝐿𝑧 are sequence lengths. The primary
for Spatial–Temporal Traffic Prediction”. In: Neurocomputing component of the model is the attention mechanism that updates
491 (2022), pp. 544–563. issn: 0925-2312. doi: https://doi. the representation of tokens in X with information from Z. Tokens
org / 10 . 1016 / j . neucom . 2021 . 12 . 033. url: https : / / www. in Z are mapped to key vectors with learned parameters 𝑊 𝐾 , while
sciencedirect.com/science/article/pii/S0925231221018725. tokens in X generate query vectors with 𝑊 𝑄 . The dot-product simi-
[83] Zihao Ye et al. “Bp-transformer: Modelling long-range con- larity between query and key vectors is re-normalized to determine
text via binary partitioning”. In: arXiv preprint arXiv:1911.04070 the attention matrix 𝐴(X, Z) ∈ R𝐿𝑥 ×𝐿𝑧 :
(2019).
[84] Xueyan Yin et al. “STNN: A Spatial-Temporal Graph Neu- 
𝑊 𝑄 (X)(𝑊 𝐾 (Z))𝑇

ral Network for Traffic Prediction”. In: 2021 IEEE 27th In- 𝐴(X, Z) = softmax √ (1)
ternational Conference on Parallel and Distributed Systems 𝑑
(ICPADS). IEEE. 2021, pp. 146–152. As mentioned in the main text, our method focuses on the inter-
[85] Bing Yu, Haoteng Yin, and Zhanxing Zhu. “Spatio-Temporal pretation of attention as a form of message passing along a dynam-
Graph Convolutional Networks: A Deep Learning Frame- ically generated adjacency matrix [28, 42, 86], where 𝐴(X, Z) [𝑖, 𝑗]
work for Traffic Forecasting”. In: Proceedings of the Twenty- denotes the strength of the connection between the 𝑖th token in X
Seventh International Joint Conference on Artificial Intelli- and the 𝑗th token in Z. The information passed along the edges of
gence (July 2018). doi: 10.24963/ijcai.2018/505. url: http: this graph are value vectors of sequence Z, generated with parame-
//dx.doi.org/10.24963/ijcai.2018/505. ters 𝑊 𝑉 . We create a new representation of sequence X according
[86] Manzil Zaheer et al. “Big Bird: Transformers for Longer to the attention-weighted sum of 𝑊 𝑉 (Z):
Sequences.” In: NeurIPS. 2020.
[87] Ailing Zeng et al. Are Transformers Effective for Time Series
Attention(X, Z) ∈ R𝐿𝑥 ×𝑑 := 𝐴(X, Z)𝑊 𝑉 (Z) (2)
Forecasting? 2022. doi: 10.48550/ARXIV.2205.13504. url:
https://arxiv.org/abs/2205.13504. Let Z correspond to the context sequence of the time series
[88] George Zerveas et al. “A Transformer-based Framework for forecasting framework (Sec. 2.2) and X be the target sequence. In
Multivariate Time Series Representation Learning”. In: arXiv an encoder-decoder Transformer, a stack of consecutive encoder
preprint arXiv:2010.02803 (2020). layers observe the context sequence and perform self-attention
[89] Hang Zhang et al. Poolingformer: Long Document Modeling between that sequence and itself (Attention(Z, Z)), leading to an
with Pooling Attention. 2021. arXiv: 2105.04371 [cs.CL]. updated representation of the context sequence Z ′ . Decoder layers
process the target sequence X and alternate between self-attention
Long-Range Transformers for Dynamic Spatiotemporal Forecasting

(Attention(X, X)) and cross-attention with the output of the en- Transformer component that handles temporal learning. The overall
coder (Attention(X, Z ′ )). Each encoder and decoder layer also in- architecture is one example of the “spatial, then temporal" pattern
cludes normalization, fully connected layers, and residual connec- that is common in ST-GNNs, and is depicted in Figure 1c. Once
tions applied to each token [65, 78]. The output of the last decoder again, this framework does not require an attention mechanism that
layer is passed through a linear transformation to generate a se- looks like a Transformer - it just needs a way to share information
quence of predictions. across space and time in an alternating fashion. DCRNN, for example,
is a foundational work in this literature that merges diffusion based
A.2 Spatial-Temporal Graph Neural Networks graph layers with the recurrent temporal processing of an RNN.
Here we expand upon the discussion of spatial-temporal related One drawback of graph convolutions on hardcoded adjacency
work summarized by Section 2.3. The method proposed in this pa- matrices is their inability to adapt spatial connections for specific
per is focused on “spatiotemporal" forecasting, where we learn both points in time. In our traffic forecasting example, roadway connec-
spatial relationships amongst multiple variables and temporal rela- tions may change due to closure, accidents, or high volume. More
tionships across multiple timesteps. There are countless papers on formally, some domains require an extension of Eq. 3 to a dynamic
sequence models for forecasting that learn representations across sequence of adjacency matrices:
time, and a growing literature on Graph Neural Network (GNN)
models that learn representations across space. At the intersection [𝑌𝑇 +1, . . . , 𝑌𝑇 +ℎ ] = 𝑓𝜃 ([(𝑌𝑇 −𝑐 , A𝑇 −𝑐 ), . . . , (𝑌𝑇 , A𝑇 )]) (4)
of seq2seq TSF and GNNs is the spatial-temporal GNN (ST-GNN)
literature centered around graph operations over a sequence of However it is rare to have such a sequence available, so practical
variable nodes. The spatial-temporal literature can be difficult to implementations often rely on attention mechanisms to re-weight
categorize due to heavy overloading of vocabulary like “spatial their spatial relationships based on time and node attributes, as in
attention" and concurrent publication of similar methods. In this ASTGCN [23]. This can allow for some dynamic adaptation of the
section, we try to abstract away implementation details and catego- spatial graph, although we are still unable to learn edges that were
rize existing works based on the ways they learn representations not provided in advance. GMAN [90] enables a fully dynamic spatial
of multivariate data. graph by using A to create a spatiotemporal embedding for each
We begin with non-GNN seq2seq time series models as a point of node 𝑣 ∈ V𝑡 . Spatial attention modules based on these embeddings
reference. Informer, for example, groups 𝑁 variables per timestep combine with more typical temporal self attention for accurate
into 𝑑 dimensional vector representations, and then uses self atten- predictions.
tion to share information between timesteps (Sec. 2.2). This makes While adjacency matrices may be available for traffic forecast-
models like Informer and later variants (Autoformer, ETSFormer, ing where road networks are clearly defined, many multivariate
etc.) purely temporal methods. The Temporal category is not spe- domains have unknown spatial relationships that need to be discov-
cific to self attention but to any method that learns patterns across ered from data. One approach - used by Graph WaveNet [76], MTGNN
time. For example, the LSTM model used in our experiments learns [75], AGCRN [2], and others - is to randomly initialize trainable node
temporal patterns by selectively updating a compressed represen- embeddings and use the similarity scores between them to construct
tation of past timesteps given the current timestep. a learned adjacency matrix. However, these graphs are updated
ST-GNNs reformulate multivariate forecasting as the prediction by gradient descent and are then static after training is complete.
of a sequence of graphs. A graph at timestep 𝑡, denoted G𝑡 , has Methods like STAWnet [64] learn dynamic graphs by making the
a set of 𝑁 nodes (V𝑡 ) connected by edges (E𝑡 ). The 𝑖th node 𝑣𝑡𝑖 spatial relationships dependent on both the node embeddings and
contains a vector of values/attributes 𝑦𝑡𝑖 , with the attributes of the time/value of the current input.
all 𝑁 nodes represented by a matrix 𝑌𝑡 . Nodes are connected by More recent work takes spatial-temporal learning a step further.
weighted edges that can be represented by an adjacency matrix Rather than alternating between spatial and temporal layers, fully
A𝑡 ∈ R𝑁 ×𝑁 . In traffic forecasting, for example, nodes are road spatiotemporal methods spread information across space and time
sensors with the current traffic velocity as attributes and weighted jointly by adding edges between the graphs of multiple timesteps.
edges corresponding to road lengths. Using graph notation, the fore- When using alternating spatial/temporal layers, information from
casting problem can be written as the prediction of future node val- past timesteps of neighboring nodes must take an indirect route
ues [𝑌𝑇 +1, 𝑌𝑇 +2, . . . , 𝑌𝑇 +ℎ ] given previous graphs [G𝑇 −𝑐 , . . . , G𝑇 ], through the representation of another node. In other words, a spa-
where ℎ is the horizon and 𝑐 is the context length. In practice, we tial layer must store irrelevant information in a node just so that it
typically look to learn a parameterized function 𝑓𝜃 from past node can be moved by a temporal layer to the timestep where it is rele-
values and a fixed (static) adjacency matrix: vant and vice versa. This effect is most evident in attention-based
models where two-step message passing relies on the queries/keys
of unrelated nodes, but also occurs in recurrent/convolutional mod-
[𝑌𝑇 +1, 𝑌𝑇 +2, . . . , 𝑌𝑇 +ℎ ] = 𝑓𝜃 ([𝑌𝑇 −𝑐 , . . . , 𝑌𝑇 ; A]) (3) els due to the need to compress information into a fixed amount
Traffic Transformer [6] is a prototypical example of using the of space. A illustrative example using self-attention terminology is
graph-based formulation with self-attention components. Traffic provided in Figure 5. STSGCN [34], STJGCN [91], and TraverseNet
Transformer uses the predefined adjacency matrix A to perform [77] expand their predefined graphs by connecting node neighbor-
a graph convolution, sharing information between the nodes of hoods across short segments of time. Spacetimeformer processes
each timestep according to the hard-coded spatial relationships. joint spatiotemporal relationships across a learnable dynamic graph,
The node representations are then passed through a more standard providing the most flexible and assumption-free combination of the
Jake Grigsby, Zhe Wang, Nam Nguyen, and Yanjun Qi

ST-GNN literature. We accomplish this by leveraging efficient at- 2.2) is as the completion of the rightmost columns of a grayscale
tention mechanisms to dynamically generate the adjacency matrix image given the leftmost columns. We let 𝑥 be the scalar index
of a densely connected graph between all nodes at all timesteps. of a column and 𝑦 be the vector of pixel values for that column.
The number of variables corresponds to the number of rows in the
image. An example of using Spacetimeformer to complete MNIST
images given the first 10 columns of pixels is shown in Figure 6.
Our model solves this problem with the same approach used in all
yt0 yt+1
0 the other forecasting results presented in this paper. The image-
completion perspective can be an intuitive way to identify the
A B problem with standard “temporal" attention in TSF Transformers.
If faced with the image completion task in Figure 6, we would be
C Spatial Attn unlikely to try and model all of the rows of the image together and
perform attention over columns - but that is exactly what models
like Informer end up doing. Pixel shapes can have complex local
C D
1 1
structure and each region should have the freedom to attend to
yt yt+1 distinct parts of the image that are most relevant. We run similar
spatiotemporal (Spacetimeformer) and temporal (Temporal) ar-
chitectures on MNIST and see an 8% reduction in prediction error.
Spatial, then Temporal However, MNIST is a simple dataset and we would expect a much
Temporal, then Spatial larger gap in performance on higher-resolution images.
The key difference between image and time series data is that
Spatiotemporal both dimensions of vision data have meaningful order while the
order of our spatial axis must be assumed to be arbitrary. This
Figure 5: “Spatial and Temporal" vs. Spatiotemporal Atten- can limit our ability to apply common vision techniques to reduce
tion. We depict a two variable series with purple (top) and sequence length. For example, the original Vision Transformer [14]
red (bottom) variable sequences. Nodes corresponding to established a convention of “patching" the image into a grid of small
timesteps 𝑡 and 𝑡 + 1 and have been labeled A, B, C, D for sim- (16 × 16) pixel squares. The input to the Transformer then becomes
plicity. Suppose that the information at node A is important a feed-forward layer’s projection of regions of pixels rather than
to the representation we want to produce at node D. Alter- every pixel individually - dramatically reducing the input length
nating temporal and spatial layers force this information to of the self-attention mechanism. Square patches of multivariate
take an indirect route through the attention mechanism of time series data would arbitrarily group multiple variables together
an unrelated token (e.g., B or C). The A → D path is then de- based on the order their columns appeared in the dataset. However,
pendent on the similarity of B and A or C and A, as well as we do have the option to patch along rectangular (1 × 𝑘) slices
any other tokens involved in B’s temporal attention or C’s of time. The initial convolution approach (Sec. 3.3) used on large
spatial attention. In contrast, true spatiotemporal attention datasets can be seen as a kind of overlapping patching to reduce
creates a direct path A → D with no risk of information loss. sequence length.
Another common theme in Vision Transformer work is the com-
In summary, the ST-GNN literature has three key methodological bination of attention and convolutions. Convolutions provide an
differences that define models’ flexibility and accuracy: architectural bias towards groups of adjacent pixels while attention
allows for global connections even in shallow layers where convolu-
(1) The type of graph used in learning. Are we performing spa-
tional receptive fields are small [37]. Spacetimeformer “intermedi-
tial and temporal updates in an alternating fashion, or can
ate convolutions" (Sec. 3.3) rearrange the flattened spatiotemporal
we learn relationships across space and time jointly with a
attention sequence to a Conv1D input format to get a similar effect.
true spatiotemporal graph?
Some Vision Transformers create a bias towards nearby pixels with
(2) The requirement of a predefined adjacency matrix based on
efficient attention mechanisms that resemble convolutions over lo-
known relationships between nodes.
cal regions but form sparse hierarchies across the full image [13, 10,
(3) The ability to dynamically adapt spatial relationships accord-
67]. Spacetimeformer’s local attention layers can be interpreted
ing to the current timestep and node values.
as a version of this approach. However, our model also contains
We categorize related work according to these characteris- true global layers that would be the equivalent of attention between
tics in Table 6. For more background on ST-GNNs, see [5]. each pixel and every other pixel in an image - something that is not
usually attempted in vision architectures. An interesting technique
A.3 Connection to Vision Transformers related to the local vs. global approach and convolutional networks
Spacetimeformer’s architecture and embedding scheme prompt is Shifted Window Attention [44]. Attention is performed only
some interesting parallels with work on Transformers in computer amongst the pixels or patches of a “window" or neighborhood of
vision. Both domains involve two dimensional data (rows/columns pixels, but the window boundaries are redrawn in each layer so that
of pixels in vision and space/time in forecasting). In fact, another information can spread between distant windows as the network
way to look at the standard multivariate forecasting problem (Sec. gets deeper. This is directly analogous to the expanding receptive
Long-Range Transformers for Dynamic Spatiotemporal Forecasting

Message Passing Predefined Dynamic


Method
Type Spatial Graph Spatial Graph
Informer [92] Temporal ✗ ✗
TFT [38] Temporal ✗ ✗
DCRNN [36] Spatial + Temporal ✓ ✗
TSE-SC [6] Spatial + Temporal ✓ ✗
PVCGN [40] Spatial + Temporal ✓ ✗
Multi-STGCnet [81] Spatial + Temporal ✓ ✗
STFGNN [61] Spatial + Temporal ✓ ✗
ASTGCN [23] Spatial + Temporal ✓ ✓-
ST-GRAT [51] Spatial + Temporal ✓ ✓-
DMGCRN [53] Spatial + Temporal ✓ ✓-
STWave [18] Spatial + Temporal ✓ ✓-
MGT [82] Spatial + Temporal ✓ ✓-
STTN [80] Spatial + Temporal ✓ ✓
STIDGCN [39] Spatial + Temporal ✓ ✓
GMAN [90] Spatial + Temporal ✓ ✓
STNN [84] Spatial + Temporal ✓ ✓
Graph WaveNet [76] Spatial + Temporal ✗ ✗
MTGNN [75] Spatial + Temporal ✗ ✗
AGCRN [2] Spatial + Temporal ✗ ✗
DSTAGNN [33] Spatial + Temporal ✗ ✓-
ST-WA [11] Spatial + Temporal ✗ ✓
TVGCN [68] Spatial + Temporal ✗ ✓
STAAN [69] Spatial + Temporal ✗ ✓
TPA-LSTM [59] Spatial + Temporal ✗ ✓
STAWNet [64] Spatial + Temporal ✗ ✓
STAM [19] Spatial + Temporal ✗ ✓
STSGCN [34] Short-Term Spatiotemporal ✓ ✗
TraverseNet [77] Short-Term Spatiotemporal ✓ ✓-
STJGCN [91] Short-Term Spatiotemporal ✓ ✓
Spacetimeformer Spatiotemporal ✗ ✓
Table 6: Spatial-Temporal forecasting related work categorized by graph type (Fig 5), the requirement of a hard-coded variable
graph, and the ability to dynamically adapt spatial relationships across time. “Short-Term Spatiotemporal" refers to spatiotem-
poral graphs that are restricted to a short range of timesteps. The “✓-" rating in the dynamic spatial graph column indicates
models that re-weight their adjacency matrix without creating new connections.

field that results from down-sampled convolutional architectures.


Spacetimeformer implements shifted windowed attention in one
dimension where neighborhoods of data are defined by slices in
time. This mechanism is not used in the primary experimental re-
sults because fast attention provides sufficient memory savings.
In general, the overlap between vision and time series techniques
appears to be a promising direction for the future scalability of
spatiotemporal forecasting. Our public codebase provides fast setup
for time series models to train on the MNIST task in Fig. 6. We also
include a more difficult CIFAR-10 task where the images’ rows and
columns have been flattened into a sequence with three variables
Figure 6: Image completion as a multivariate forecasting
corresponding to the red, green, and blue color channels.
problem. Spacetimeformer learns to complete images given
the leftmost columns.
Jake Grigsby, Zhe Wang, Nam Nguyen, and Yanjun Qi

B ADDITIONAL RESULTS AND DETAILS across the state of Alabama in 2006. We use a context length of
168 and a target length of 24.
B.1 Real-World Dataset Details
• Pems-Bay Traffic [36]. Similar to Metr-LA but covering the
Details of the real-world datasets used in our experimental results Bay Area over 6 months in 2017. The context and target lengths
are listed below. Descriptions of toy datasets are deferred to the are 12 and we use the standard train/test split.
appendix sections where their results are discussed. We try to follow • HZMetro [40]. A dataset of the number of riders arriving and
the train/val/test splits of prior work where possible to ensure a departing from 80 metro stations in Hangzhou, China during
fair comparison. Splits are often determined by dividing the series the month of January 2019. The (𝐿 × 𝑁 , 1) input format of our
by a specific point in time, so that the earliest data is the train set model requires the arrivals and departures to be separated into
and the most recent data is the test set. This evaluation scheme their own variable nodes, leading to a total of 160 variables. The
helps measure the future predictive power of the model and is context and target windows are set to a length of 4 as in previous
especially important in non-stationary datasets [7]. Datasets are work.
either released directly with our source code or made available for • ETT [92]. An electricity transformer dataset covering 2016 −
download in the proper format. 2018. We use the ETTm1 variant which is logged at one minute
intervals to provide as much data as possible. We evaluate on the
Variables Length Size set of target sequence lengths {24, 48, 96, 288, 672} established
(𝑁 ) (𝐿) (Timesteps) by Informer.
• Weather [92]. A German dataset of 21 weather-related variables
NY-TX Weather 6 800 569,443 recorded every 10 minutes in 2020. We use the same set of target
AL Solar 137 192 52,560 sequence lengths and context window selection approach as in
Metr-LA Traffic 207 24 34,272 ETTm1.
Pems-Bay Traffic 325 24 52,116
ETTm1 7 1344 69,680
HZMetro 160 8 1650 B.2 Code and Implementation Details
Weather 21 1440 52,696 The code for our work is open-sourced and available on GitHub at
Table 7: Real-World Dataset Summary. Sequence length (𝐿) is QData/spacetimeformer. All models are implemented in PyTorch
reported as the largest combined length of the context and [52] and the training process is conducted with PyTorch Lightning
target windows used in results. [16]. Our LSTNet and MTGNN implementations are based on public
code [56] and verified by replicating experiments from the original
papers. Generic models like LSTM and LinearAR are implemented
from scratch and we made an effort to ensure the results are com-
• NY-TX Weather (new). We obtain hourly temperature readings petitive. The code for the Spacetimeformer model was originally
from the ASOS weather network. We use three stations located in based on the Informer open-source release.
central Texas and three more located hundreds of miles away in Data Preprocessing. As mentioned in the previous section,
eastern New York. The data covers the years 1949−2021, making train/val/test splits are based on existing work or determined by a
this a very large dataset by TSF standards. Many of the longest temporal splits where the most recent data forms the test set. Data
active stations are airport weather stations. We use the airport is organized into time sequences (𝑥) and variable values (𝑦). We
codes ACT (Waco, TX), ABI (Abilene, TX), AMA (Amarillo, TX), follow established variable normalization schemes of prior work to
ALB (Albany, New York) as well as the two largest airports in the ensure a fair comparison, and default to z-score normalization in
New York City metro, LGA and JFK. The NY-TX benchmark was other cases. Most real-world datasets use 𝑥 values that correspond
created for this paper, and has two key advantages when evaluat- to date/time information. We represent calendar dates by splitting
ing timeseries Transformers. First, it has far more timesteps than the information into separate year, month, day, hour, minute, and
popular datasets; Transformers are data-hungry, but we can more second values and then re-scaling each to be ∈ [0, 1]. This works
safely ignore overfitting issues here. Second, this dataset is almost out so that only the year value is unbounded, but we divide by the
perfectly stationary; popular datasets are not, and this creates latest year present in the training set. We discard time variables
an evaluation problem that is not related to the modeling power that are not robust or prone to overfitting. For example, a dataset
of the architecture. This issue has been widely overlooked and that only spans two months would drop the month and year values.
allows basic linear models that are more robust to non-stationary Embedding. The time variables 𝑥 are passed through a Time2Vec
to outperform recent methods that are more powerful on paper. layer [29]. If 𝑥 is a time representation with three elements {hour,
For more on this see [87] and Appendix B.4. minute, second}, the Time2Vec output would be shape (3, 𝑘) where
• Metr-LA Traffic [36]. A popular benchmark dataset in the GNN 𝑘 is the time embedding dimension. The first of the 𝑘 elements has
literature consisting of traffic measurements from Los Angeles no activation function while the remaining 𝑘 − 1 use a sine func-
highway sensors at 5-minute intervals over 4 months in 2012. tion with trainable parameters to represent periodic patterns. After
Both the context and target sequence lengths are set to 12. We flattening (3, 𝑘) → (3 × 𝑘, ), the time embedding is concatenated
use the same train/test splits as [36]. with its corresponding 𝑦 value. When using our spatiotemporal
• AL Solar [32]. A popular benchmark dataset in the time series embedding, the y value will be a single scalar and time values are
literature consisting of solar power production measurements duplicated to account for the flattened sequence. Note that we use
Long-Range Transformers for Dynamic Spatiotemporal Forecasting

terminology like “flatten" because most datasets in practice are The choice of attention implementation is flexible across the archi-
structured such that there are 𝑁 variables sampled at the same tecture, allowing progress in the sub-field of long-range attention
moment in time, and the conversion to the spatiotemporal format to improve future scalability. We default to the ReLU version of
looks like we have laid out the rows of a dataframe end-to-end. Performer [9] due to its memory savings and compatibility with
However, embedding the value of each variable at every timestep both self and cross attention. However, we also experimented with
as its own separate token gives us a lot of flexibility to use alternate Nystromformer [79] and ProbSparse attention [92].
dataset formats where variables may be sampled at different inter- Metrics. For completeness, the evaluation metrics used in our
vals. We do not take advantage of this in our experimental results results tables are listed below. 𝑦𝑡𝑛 and 𝑦^𝑡𝑛 correspond to the true and
because it is not relevant to common benchmark datasets and is predicted values of the 𝑛th variable at timestep 𝑡, respectively. 𝑦¯ is
not applicable to all the baselines we consider. shorthand for the mean value of 𝑦.
The combined value and time are projected to the Transformer
model dimension (𝑑) with a single feed-forward layer. We experi-
mented with two approaches to the position embedding. The first re- 𝑁 𝑇 +ℎ
purposes Time2Vec to learn a periodic 𝑑-dimensional embeddings - 1 ∑︁ ∑︁ 𝑛
MSE := (𝑦𝑡 − 𝑦^𝑡𝑛 ) 2 (5)
essentially learning the frequency hyperparameters of the original ℎ𝑁 𝑛=1
𝑡 =𝑇
fixed position embedding [65] automatically. We also experimented 𝑁 𝑇∑︁+ℎ
with fully learnable lookup-table-style position embeddings com- 1 ∑︁
MAE := (𝑦𝑡 − 𝑦^𝑡 ) (6)
monly used for word embeddings in natural language processing. ℎ𝑁 𝑛=1
𝑡 =𝑇
Both are provided in the code and appeared to lead to similar perfor- v
u
t 𝑁 𝑇 +ℎ
mance. However, we decided that the fully learnable option was the 1 ∑︁ ∑︁
RMSE := (𝑦𝑡 − 𝑦^𝑡 ) 2 (7)
safer choice to make sure the position embedding had the freedom ℎ𝑁 𝑛=1
𝑡 =𝑇
to differentiate itself from the other components that make up our v

t 𝑁 Í𝑇 +ℎ (𝑦 − 𝑦^ ) 2
embedding scheme. Position indices are repeated as necessary to 𝑛=1 𝑡 =𝑇 𝑡 𝑡
account for the flattened 𝑦 values. Variable embeddings are imple- RRSE := Í𝑁 Í𝑇 +ℎ (8)
𝑛=1 𝑡 =𝑇 (𝑦𝑡 − 𝑦) ¯2
mented similarly with indices assigned arbitrarily from 0, . . . , 𝑁 .
𝑁 𝑇 +ℎ
The repeating pattern of tokens’ variable and position indices is 1 ∑︁ ∑︁ (𝑦𝑡 − 𝑦^𝑡 )
best explained by Figure 2. MAPE := (9)
ℎ𝑁 𝑛=1 𝑦𝑡
𝑡 =𝑇
“Given" embeddings are a third lookup-table layer with two en-
tries that indicate whether the 𝑦 value for a token contains missing
values. They are only used in the encoder’s embedding layer be- The main detail to note here is that we are reporting the average
cause all of the values are missing/empty in the decoder sequence. over the length of the forecast sequence of ℎ timesteps. However, it
The value+time, variable, position, and given embeddings sum to is somewhat common (especially in the ST-GNN datasets) to report
create the final 𝑑-dimensional embedding. metrics for multiple timesteps independently to see how accuracy
Architecture and Training Loop. There is significant empiri- changes as we get further from the known context. We take the
cal work investigating technical improvements to the Transformer mean of all reported timesteps to recover the metrics of related
architecture and training routine [25, 41, 73, 94, 1, 17]. We incorpo- work for Tables 4 and 5. All metrics discard missing values.
rate some of these techniques to increase performance and hyperpa- As mentioned in Sec 3.4, the output of our model is a sequence
rameter robustness while retaining simplicity. A Pre-Norm architec- of predictions that can be restored to the original (𝐿, 𝑁 ) input for-
ture [78] is used to forego the standard learning rate warmup phase. mat. This lets us choose from several different loss functions. Most
We also find that replacing LayerNorm with BatchNorm is advanta- datasets use either Mean Squared Error (MSE) or Mean Absolute
geous in the time series domain. [58] argue that BatchNorm is more Error (MAE). We occasionally compare Root Relative Squared Error
popular in computer vision applications because reduced training (RRSE) and Mean Absolute Percentage Error (MAPE) - though they
variance improves performance over the LayerNorms that are the are never used as an objective function. We train all models with
default in NLP. Our experiments add empirical evidence that this early stopping and learning rate reductions based on validation
may also be the case in time series problems. We also experiment performance.
with PowerNorm [58] and ScaleNorm [47] layers with mixed results. Hyperparameters and Compute. Due to the large number of
All four variants are included as a configurable hyperparameter in experiments and baselines used in our results, we choose to defer
our open-source release. hyperparamter information to the source code by providing the nec-
The encoder and decoder have separate embedding layers to essary training commands. This lets us include the settings of minor
enable the context sequence to contain a different set of input vari- details that have not been explained in writing. Spacetimeformer’s
ables than those we are forecasting in the target sequence. Dropout long-sequence attention architecture is mainly constrained by GPU
can be applied to: 1) query, key, and value layer parameters. 2) The memory. Most results in this paper were gathered on 1 − 4 12GB
output of embedding layers. 3) The attention matrix (when applica- cards, although larger A100 80GB GPUs were used in some later
ble). 4) The feed-forward layers at the end of each encoder/decoder results (e.g., Pems-Bay). We hope to provide training commands
layer. Local attention layers are implemented with rearrange-style that serve as a competitive alternative for resource-constrained set-
operations [55] and can be used with any efficient attention variant. tings. Training is relatively fast due to small dataset sizes, with the
longest results being Metr-LA and Pems-Bay at roughly 5 hours.
Jake Grigsby, Zhe Wang, Nam Nguyen, and Yanjun Qi

B.3 Toy Experiments Next we run the same Transformer with spatiotemporal atten-
Shifted Copy Task. Copying binary input sequences is a common tion and see a roughly 17× reduction in MAE, with outputs that are
test of long-sequence or memory-based models [21]. We generate near-perfect reproductions of the input (Figure 10). By flattening
binary masks with shape (𝐿, 𝑁 ), where elements are set to 1 with the context and target tokens into a spatiotemporal graph, the vari-
some probability 𝑝. Each of the 𝑁 variables is associated with ables of each timestep are given independent attention mechanisms.
a “shift" value. The target sequence is a duplicate of the context Each attention head is capable of learning 𝑁 2 relationships between
sequence with each variable zero-padded and translated forward variables. Of course, this toy problem only has 𝑁 important spatial
in time by its shift value. An example with 𝐿 = 100, 𝑁 = 5, and relationships (each variable only needs to attend to itself). Figure
shifts of {0, 5, 10, 15, 25} is shown in Figure 7. The rows have been 11 shows the spatiotemporal pattern with attention shifting along
color-coded to make the shift easier to identify. the diagonal.

Dependent Sine Waves. Next we recreate a version of the toy


dataset used to emphasize the necessity of spatial modeling in [59].
We generate 𝐷 sequences where sequence 𝑖 at timestep 𝑡 is defined
by:
𝐷
2𝜋𝑖𝑡 1 ∑︁ 2𝜋 𝑗𝑡
𝑌𝑡𝑖 = 𝑠𝑖𝑛( )+ 𝑠𝑖𝑛( ) (10)
64 𝐷 + 1 𝑗=1,𝑗≠𝑖 64
We map 2, 000 timesteps to a sequence of daily calendar dates
Figure 7: Shifted Copy Task. Copy a binary sequence with beginning on Jan 1, 2000. We set 𝐷 = 20 and use a context length
each row shifted by increasing amounts top to bottom (col- of 128 and a target length of 32. The final quarter of the time series
orized for visualization). is held out as a test set.
Several ablations of our method are considered. Temporal modi-
fies the spatiotemporal embedding as discussed at the end of Sec
First we train a standard (“Temporal") Transformer on an eight 3.4. ST Local skips the global attention layers but includes spa-
variable version of the shifted copy task. An example of the correct tial information in the embedding. The “Deeper" variants attempt
output (bottom) and predicted sequence (top) is shown in Figure 8. to compensate for the additional parameters of the local+global
While the first few rows appear to be accurately copied, the lower attention architecture of our full method. All models use a small
variable outputs are blurry or missing entirely. Transformer model and optimize for MSE. The results are shown
in Table 8. The Temporal embedding is forced to compromise its
attention over timesteps in a way that reduces predictive power
over variables with such distinct frequencies. Standard (“Full") at-
tention fits in memory with the Temporal embedding but is well
approximated by Performer. Our method learns an uncompromis-
ing spatiotemporal relationship among all tokens to generate the
most accurate predictions by all three metrics.
Figure 8: Temporal attention discrete copy example. Top im-
age shows predicted output sequence; bottom shows ground- Temporal
Temporal
ST Local
Temporal (Deeper & ST Local Spacetimeformer
truth output sequence. Standard attention forces each col- (Deeper)
Full Attn)
(Deeper)
umn to share attention graphs, leading to a “blurry" copy MSE 0.006 0.010 0.005 0.021 0.014 0.003
output on variables with high shift. MAE 0.056 0.070 0.056 0.104 0.090 0.042
RRSE 0.093 0.129 0.094 0.180 0.153 0.070
Table 8: Toy Dataset Results
To try and understand where the model is going wrong, we
visualize the cross attention patterns of several heads in the first
decoder layer. The results are shown in Figure 9. To perform this
task correctly, the tokens of each element in the target sequence
need to learn to attend to a specific position in the context sequence.
B.4 Distributional Shift and Time Series
The problem is that each variable needs to attend to different posi- Transformers
tions due to their temporal shift. The standard Transformer groups While collecting baseline results on ETT and other datasets used
all of the variables into the same token, making it very difficult to in recent Transformer TSF work we noticed significant gaps in
pick timesteps to attend to that are useful for all of them. Instead it performance between train and test set prediction error. However,
correctly learns the shift of a single variable per attention head and the apparent overfitting effect showed few signs of improvement
eventually runs out of heads, leading to the inaccurate results in when given additional data and smaller, highly-regularized archi-
the bottom half of the output. Note that in this problem the correct tectures. Recently, [87] released an investigation into the failures
attention pattern would form a line where the distance from the of Transformers in long-term TSF. They show that a simple model,
diagonal corresponds to a variable’s shift value. “DLinear," can outperform Transformers in a variety of common
Long-Range Transformers for Dynamic Spatiotemporal Forecasting

Figure 9: Temporal attention fails with too few heads. Because each variable (row) requires an independent attention graph, a
strong optimizer will attempt to put one relationship on each attention head. However, we quickly run out of heads and are
left with an inaccurate copy. In a more complex problem, Temporal attention could require as many as 𝑁 2 heads to accurately
model every variable relationship.

linear combination of the previous timesteps. We verify a similar


result with the tiny LinearAR model in Table 9 - nearly matching
the performance of ETSFormer on the ETTm1 dataset.
Comparisons between the NY-TX Weather dataset - where LinearAR
performs worse than our Transformer baselines - and ETTm1 show
the latter dataset has a much larger gap between the magnitude
of the train and test sequences. This type of distributional shift is
common in non-stationary time series and is caused by the use of
the most recent data as a test set; long-term trends that begin in
the train set can continue into the test set and alter the distribution
of our inputs. We hypothesize that distribution shift is a key part
of the Transformer’s test set inaccuracy and that linear models are
less effected because their outputs are a more direct combination of
the magnitude of their inputs. Reversible Instance Normalization
(RevIN) [30] normalizes a model’s input based on the statistics of
each context sequence so that its parameters are less sensitive to
changes in scale. The predicted output can then be inverse normal-
ized to the original range. We experiment with RevIN as a way to
combat distribution shift on both ETTm1 and the popular Weather
dataset. The results are displayed in Tables 9 and 10. Input normal-
ization improves the performance of all models and even makes
LSTM competitive with advanced Transformers. This may suggest
that the improved performance of recent methods may not be the
Figure 10: Spatiotemporal attention represents 𝑁 2 relation- result of more complex architectures but from improved resistance
ships per head. Thanks to its spatiotemporal sequence em- to distributional shift. We investigate this further by implementing
bedding, Spacetimeformer can represent the underlying seasonal decomposition - another common detail of Transformers
variable relationship in a single accurate head. This capabil- in time series. Seasonal decomposition separates a series into short
ity spares room for optimization errors and multiple rela- and long-term trends to be processed independently, and is also
tionships between variables in more complex problems. included in the DLinear model [87]. We evaluate its impact on the
Weather dataset in Table 10. Performance is comparable to instance
normalization, potentially because decomposition has the effect of
standardizing inputs by transforming them into a difference from a
moving average. The Spacetimeformer results in Table 2 use both
seasonal decomposition and reversible instance normalization for
further improved results.
However, input normalization is not always enough, especially
Figure 11: Spacetimeformer outputs an accurate copy (top) when patterns are in fact scale-dependent. For a brief example
of the input sequence (bottom) despite the row shift. consider a continuous version of the copy task in the previous
section using sine curves instead of binary masks. We randomly
generate 𝑁 different context and target sequences according to:
benchmarks. Their DLinear model is a slightly more advanced ver-
sion of our LinearAR baseline, which predicts each timestep as a
Jake Grigsby, Zhe Wang, Nam Nguyen, and Yanjun Qi

Prediction Length
24 48 96 288 672
𝑦 (𝑡 ) = 𝑎 sin(𝑏𝑡 + 𝑐) + 𝑑 + 𝜖 (𝑡), 𝜖 (𝑡) ∼ N (0, .1)
LSTM 0.63 0.94 0.91 1.12 1.56
Repeat Last 0.54 0.76 0.78 0.83 0.87
where 𝑎, 𝑏, 𝑐 and 𝑑 are parameters drawn from distributions that Informer 0.37 0.50 0.61 0.79 0.93
can vary between the train and test sets. Our experiments show Pyraformer 0.49 0.66 0.71
that RevIN-style input normalization can reduce the generalization YFormer 0.36 0.46 0.57 0.59 0.66
gap for out-of-distribution 𝑑 to nearly zero for LSTM, temporal and Preformer 0.40 0.43 0.45 0.49 0.54
spatiotemporal attention models. However, if we shift the curve Autoformer 0.40 0.45 0.46 0.53 0.54
between the context and target such that: ETSFormer 0.34 0.38 0.39 .42 .45
LSTM + RevIN 0.37 0.44 0.47 0.53 0.55
𝑦context (𝑡) = 𝑎 sin(𝑏𝑡 + 𝑐) + 𝑑 + 𝜖 (𝑡), 𝜖 (𝑡) ∼ N (0, .1) Temporal + RevIN .32 0.40 0.44 0.50 0.55
Spatiotemporal + RevIN 0.34 0.40 0.45 0.50 0.57
𝑦target (𝑡) = 𝑦𝑐𝑜𝑛𝑡𝑒𝑥𝑡 (𝑡) + (1 + |𝑑 |) 2
LinearAR 0.33 .37 .39 0.44 0.48
RevIN is unable to generalize. This is because when we pass Table 9: Normalized Mean Absolute Error (MAE) on ETTm1
our model a normalized input we are discarding the magnitude (𝑑) test set. “Repeat Last" is a simple heuristic that predicts the
information necessary to make accurate predictions. An example most recent value in the context sequence for the duration
of a Spacetimeformer+RevIN prediction is plotted in Figure 12. of the forecast.
The behavior of many real-world time series also changes with
the overall magnitude of the variable and more work is needed to
make large Transformer TSF models robust to small, non-stationary
datasets. Prediction Length
96 192 336 720
Informer 0.38 0.54 0.52 0.74
LSTM (from Autoformer) 0.41 0.44 0.45 0.52
Autoformer 0.34 0.37 0.40 0.43
Linear Shared 0.24 0.29 0.32 0.36
Linear Ind Decomp 0.22 0.27 0.31 0.36
LSTM (Ours) 0.30 0.34 0.41 0.53
LSTM RevIN 0.23 0.26 0.32 0.36
LSTM Decomp 0.23 0.27 0.30 0.36
Transformer (Ours) 0.24 0.33 0.40 0.42
Transformer RevIN 0.23 0.27 0.35 0.37
Transformer Decomp 0.24 0.27 0.31 0.37
Figure 12: Modeling Without Magnitude. Input normaliza- Table 10: Normalized Mean Absolute Error (MAE) on the
tion removes the context information that is needed to Weather test set. “Linear Ind Decomp" adds independent pa-
make scale-dependent predictions. rameters for each variable to a linear model with seasonal
decomposition.

B.5 Ablations
Ablation results are listed in Table 11. During our development
process, we became interested in the amount of spatial information spatial tasks like Metr-LA, accuracy begins to decline over time.
that makes it through the encoder and decoder layers. We created Because this decline corresponds to increased forecasting accuracy,
a way to measure this by adding a softmax classification layer to we interpret this as a positive indication that related nodes are
the encoder and decoder output sequences. This layer interprets grouped with similar variable embeddings that become difficult for
each token and outputs a prediction for the time series variable it a single-layer classification model to distinguish.
originated from. Importantly, we detach this classification loss from The relative importance of space and time embedding infor-
the forecasting loss our model is optimizing; only the classification mation changes based on whether the problem is more temporal
layer is trained on this objective. Classification accuracy begins like NY-TX Weather or spatial like traffic forecasting. While global
near zero but spikes upwards of 99% in all problems within a few spatiotemporal attention is key to state-of-the-art performance on
hundred gradient updates due to distinct (randomly initialized) Metr-LA, it is interesting to see that local attention on its own is
variable embeddings that are well-preserved by the residual Pre- a competitive baseline. We investigate this further by removing
Norm Transformer architecture. In TSF tasks like NY-TX weather, both global attention and the variable embeddings that differentiate
this accuracy is maintained throughout training. However, in more traffic nodes, which leads to the most inaccurate predictions. This
Long-Range Transformers for Dynamic Spatiotemporal Forecasting

implies that the local attention graph is adapting to each node based of semi-independent series that do not occur on the same time
on the static spatial information learned by variable embeddings. interval and therefore cannot be cast in the standard multivariate
format. For example, we might have the sales data of a collection of
Classification Acc. thousands of different products sold in different stores several years
MAE
(%) apart. ML-based solutions use a univariate context/target window
NY-TX Weather (𝐿 = 200, 𝑁 = 6) that is trained jointly across batches from every series. Implicitly,
these models need to infer the behavior of the current series they
Full Spatiotemporal 2.57 99
are observing from this limited context. Long-sequence Transform-
No Local Attention 2.62 99
No Variable Embedding 2.66 45
ers create an opportunity for genuine “in-context" learning [31, 4]
Temporal Embedding/Attention 2.67 - in the time series domain, where the context sequence is all existing
No Value Embedding 3.83 99 data for a given varible up until the current moment, and the target
No Time Embedding 4.01 99 sequence is all future values we might want to predict. A model
trained in this format would have all available information to iden-
Metr-LA Traffic (𝐿 = 24, 𝑁 = 207)
tify the characteristics of the current series and learn generalizable
Full Spatiotemporal 2.83 58 patterns for more accurate forecasting. The main implementation
No Global Attention 3.00 46 difference here is the varying number of samples per series, which
No Time Embedding 3.11 50 requires the use of sequence padding and masking and can limit
Temporal Embedding/Attention 3.14 - our choice of efficient attention variant. For multi-series problems,
No Local Attention 3.27 54 we revert to vanilla (quadratic) attention with arbitrary masking,
No Variable Embedding 3.29 2 and make use of shifted-window attention (Sec. 3.3) to reduce GPU
No Global Attention, No Variable Emb. 3.48 1 memory usage. Our open-source code release includes fully imple-
Table 11: Ablation Results. mented training routines for the Monash dataset [20], M4 dataset
[46], and Wikipedia site traffic Kaggle competition. Preliminary
results suggest this method can be competitive with leaderboard
scores of ML-based approaches on the M4 and Wikipedia compe-
B.6 Future Directions in Multi-series titions, and we hope that our codebase can be a starting point for
Generalization further development on this topic. The multivariate attention ca-
Some time series prediction problems - particularly competitions pabilities of Spacetimeformer allow an extension of meta-TSF to
like the M series [46] and Kaggle - are based on a large collection datasets of multiple multivariate time series.

You might also like