0% found this document useful (0 votes)
16 views10 pages

A SpatialTemporal Attention Approach For Traffic Prediction

The document presents a novel Attention-based Periodic-Temporal Neural Network (APTN) for traffic prediction, which effectively captures spatial, short-term, and long-term periodical dependencies using an attention mechanism. APTN improves upon existing models by addressing the limitations of convolutional approaches in modeling complex non-Euclidean spatial correlations. Experimental results demonstrate that APTN outperforms state-of-the-art methods in traffic forecasting accuracy.

Uploaded by

janarthanan20669
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views10 pages

A SpatialTemporal Attention Approach For Traffic Prediction

The document presents a novel Attention-based Periodic-Temporal Neural Network (APTN) for traffic prediction, which effectively captures spatial, short-term, and long-term periodical dependencies using an attention mechanism. APTN improves upon existing models by addressing the limitations of convolutional approaches in modeling complex non-Euclidean spatial correlations. Experimental results demonstrate that APTN outperforms state-of-the-art methods in traffic forecasting accuracy.

Uploaded by

janarthanan20669
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 22, NO.

8, AUGUST 2021 4909

A Spatial–Temporal Attention Approach


for Traffic Prediction
Xiaoming Shi , Heng Qi , Yanming Shen , Genze Wu, and Baocai Yin, Member, IEEE

Abstract— Accurate traffic forecasting is important to enable Time-series based approaches, such as autoregressive inte-
intelligent transportation systems in a smart city. This problem is grated moving average (ARIMA), Kalman filtering, and latent
challenging due to the complicated spatial, short-term temporal space model, have been widely applied to traffic prediction
and long-term periodical dependencies. Existing approaches have
considered these factors in modeling. Most solutions apply CNN, problems ( [4], [9], [14]). But these approaches do not cap-
or its extension Graph Convolution Networks (GCN) to model ture the complex non-linear spatial-temporal dependency well.
the spatial correlation. However, the convolution operator may Recent advances in deep learning enable promising results in
not adequately model the non-Euclidean pair-wise correlations. modeling the complex spatiotemporal relationship in traffic
In this paper, we propose a novel Attention-based Periodic- forecasting. Existing deep learning approaches usually adopt
Temporal neural Network (APTN), an end-to-end solution for
traffic foresting that captures spatial, short-term, and long- the CNN for spatial correlation extraction, and RNN or its
term periodical dependencies. APTN first uses an encoder variants LSTM/GRU for temporal dependencies modeling. For
attention mechanism to model both the spatial and periodical example, several studies ( [27], [28]) have modeled citywide
dependencies. Our model can capture these dependencies more traffic as a heatmap image and used CNN to model the
easily because every node attends to all other nodes in the non-linear spatial dependency, and [25] used recurrent neural
network, which brings regularization effect to the model and
avoids overfitting between nodes. Then, a temporal attention is network based framework for modeling temporal dependency.
applied to select relevant encoder hidden states across all time Recent studies further proposed methods to jointly model
steps. We evaluate our proposed model using real world traffic spatial, temporal, and external features dependencies by inte-
datasets and observe consistent improvements over state-of-the- grating CNN and LSTM ( [10], [16], [22], [23]). However,
art baselines. these convolution based approaches may not adequately model
Index Terms— Attention mechanism, traffic prediction, neural the spatial correlation including non-Euclidean pair-wise cor-
networks. relations, since the convolution is based on Euclidean distance
to capture spatial correlation. [5] alleviates this problem using
I. I NTRODUCTION multi-graph convolutions, which take into account distance,

A CCURATE traffic prediction is essential to many real-


world applications. For example, traffic volume predic-
tion can help city better manage traffic to alleviate congestion;
functional similarity, and transportation connectivity when
modeling spatial dependencies. But it needs to utilize under-
lining spatial structures, e.g., distance, functional similarity,
car-hailing demand prediction can help car-sharing compa- transportation connectivity of different regions.
nies pre-allocate cars to high demand regions. The growing Encoder-decoder networks ( [2], [18]) have been applied
available traffic related datasets provide us potential new for traffic prediction. The key idea is to encode the source
perspectives to address this problem. However, due to complex sequence as a fixed-length vector and use the decoder to
spatial-temporal correlations, this problem is challenging. generate the prediction. One problem with encoder-decoder
Manuscript received July 6, 2019; revised December 6, 2019 and networks is that their performance will deteriorate rapidly as
February 27, 2020; accepted March 23, 2020. Date of publication April 9, the length of input sequence increases. To resolve this issue,
2020; date of current version August 9, 2021. This work was supported the attention mechanism was proposed in [1]. But they may not
in part by the National Natural Science Foundation of China under Grant
U1811463 and Grant 61772112, and in part by the Innovation Foundation be suitable for traffic prediction due to the spatial correlation.
of Science and Technology of Dalian under Grant 2018J11CY010 and Grant In this paper, we propose Attention-based Periodic-
2019J12GX037. The Associate Editor for this article was Y. Kamarianakis. Temporal neural Network (APTN), which models the spa-
(Corresponding author: Yanming Shen.)
Xiaoming Shi, Heng Qi, and Genze Wu are with the School of Electronic tial, short-term and long-term periodical dependencies. APTN
Information and Electrical Engineering, Dalian University of Technology, incorporates a novel attention based Encoder-Decoder archi-
Dalian 116024, China. tecture. It first processes the long-term periodic input with a
Yanming Shen is with the School of Electronic Information and Electrical
Engineering, Dalian University of Technology, Dalian 116024, China, and also recurrent skip neural network, and then encodes the spatial
with the Key Laboratory of Intelligent Control and Optimization for Industrial and periodical dependencies in the encoder. In the decoder,
Equipment, Ministry of Education, Dalian University of Technology, Dalian a temporal attention mechanism is applied to capture the
116024, China (e-mail: [email protected]).
Baocai Yin is with the School of Electronic Information and Electrical dependencies from encoder hidden states across all time steps.
Engineering, Dalian University of Technology, Dalian 116024, China, and In this way, APTN can adaptively select the most relevant
also with the Beijing Key Laboratory of Multimedia and Intelligent Software input features as well as capture the long-term temporal depen-
Technology, Faculty of Information Technology, Beijing University of Tech-
nology, Beijing 100124, China. dencies appropriately. The proposed spatial attention learns the
Digital Object Identifier 10.1109/TITS.2020.2983651 weight of each node, which represents the correlation with
1558-0016 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Sathyabama Institute of Science and Technology. Downloaded on January 30,2025 at 07:53:29 UTC from IEEE Xplore. Restrictions apply.
4910 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 22, NO. 8, AUGUST 2021

the entire graph. In other words, these nodes with different TABLE I
weights can reconstruct the entire graph. The motivation is N OTATIONS
that the values in the entire graph are related to the values
of several most important nodes, which brings regularization
effect to the model and avoids overfitting between nodes. The
main contributions of this paper are,
• We proposed a novel end-to-end framework for traffic
prediction, which can model spatial, short-term and long-
term periodical dependencies using attention mechanism.
• We designed an attention mechanism for obtaining
dynamic spatial correlation. Compared with CNN/GCN
approaches, it better captures the spatial dependencies,
leading to a significant performance gain.
• By extensive experiment results, we showed that the
performance of our model outperforms the state-of-art
methods.
This paper is organized as follows. Section II presents
related work. In Section III, we introduce notations and give
the problem definition. Section IV describes the detailed
design of our proposed model. The experiment evaluations are
presented in Section V and Section VI concludes this paper.

II. R ELATED W ORK


Data-driven traffic prediction has received wide attention
for decades. Essentially, the problem of traffic prediction is
to predict a traffic-related value for a location at a given
always accurate. For example, consider two stadiums. They are
timestamp based on historical data.
semantic similar. However, if one is hosting an event, and the
Deep learning models provide a promising way to capture
other is not, then their semantic similarity may lead to the
non-linear spatio-temporal relations in traffic prediction. In a
false conclusion. Also, spatially nearby or semantic similarity
seminal work, [27] applied convolutional structure to cap-
only consider partial spatial correlations, e.g., connectivity or
ture spatial correlation for traffic volume prediction. Sequen-
other factors may also have an effect. Reference [15] uses
tial dependency is modeled using recurrent neural network
input attention to capture the correlation between a time series
in [25]. However, while these studies explicitly model tem-
and others. Similarly, [13] introduces global spatial attention
poral sequential dependency or spatial dependency, none of
to capture the correlation between the target time series of a
them consider both aspects simultaneously. Recently, several
sensor and the time series of other sensors and further uses
studies use convolutional LSTM [16], [17] to handle spatial
local spatial attention to capture the correlation between a
and temporal dependency for taxi demand prediction [10]. [23]
feature time series and other time series for the same sensor.
further proposed a multi-view spatial-temporal network for
The spatial attention weights in these two models are used to
demand prediction, which learns the spatial-temporal depen-
evaluate the importance of different sensors on the predicted
dency simultaneously by integrating LSTM, local-CNN and
target sensor. Therefore, these two methods output predictions
semantic network embedding. [22] improves the work in [23]
for a particular sensor, i.e., training a model for each sensor,
by modeling dynamic spatial similarity and long-term periodic
which will take a lot of time for obtaining the predictions
information with flow gating and periodically shifted attention
for the entire network. Also, these approaches overlook the
mechanism. For these approaches, the spatial structure of
long-term periodic influence. In summary, our proposed model
traffic data is formulated as a matrix whose entries repre-
explicitly handles dynamic spatial similarity, short-term and
sent rectangular regions. Then, regions and their pair-wise
long-term temporal dependencies.
relationships naturally formulate an Euclidean structure, and
consequently convolution neural networks are leveraged for
effective prediction. III. N OTATIONS AND P ROBLEM F ORMULATION
Based on road network, several studies extended tradi- We define a set of observations of the road traffic vol-
tional CNN and RNN structure to graph-based CNN and ume sensors as A = {1, 2, . . . , N}. The entire time period
RNN for traffic prediction, such as graph convolutional GRU (e.g., one month) is split into equal-length continuous time
([6], [12], [24]), graph attention ( [26]). These approaches intervals. Let x ti represent the observation of the traffic volume
extend traffic prediction in Euclidean space to non-Euclidean sensor in time interval t at road segment i . Then, xt =
road networks. However, in these studies, the similarity (x t1 , x t2 , . . . , x tN ) ∈ R N represents the observation of the traffic
between roads is generally based on static distance, road volume sensors for all road segments at time t. The notations
structure, and/or semantic. These static approaches are not in this paper are listed in Table I.

Authorized licensed use limited to: Sathyabama Institute of Science and Technology. Downloaded on January 30,2025 at 07:53:29 UTC from IEEE Xplore. Restrictions apply.
SHI et al.: SPATIAL–TEMPORAL ATTENTION APPROACH FOR TRAFFIC PREDICTION 4911

we need is shown in Figure 2 (shaded in blue), denoted


by X L = (x1,1, x2,1 , . . . , xTs ,1 , x1,2 , · · · , xTs ,n ) ∈ R nTs ×N ,
where xt,j in X L represents the traffic volume at time t of
the j -th period. Let Tl denote the period parameter (which is
usually one day for traffic data). Then X S and X L are inputs
to our model and will be fed to a fully-connected layer.
The fully-connected layer extracts the feature representation
of the input vectors,
z t = g(x t ) = ReLU (x t Wv + bv ), (1)
where ReLU (x) = max(0, x) is the activation function, g is
the mapping function from input vector to feature represen-
tation, Wv ∈ R N×v and bv ∈ R v are learnable parameters of
g, v is the feature representation dimension, and z t ∈ R v
is the zipped representation of x t . Then Z L = g(X L ) =
Fig. 1. The Architecture of Our Proposed Solution, where AR stands for
autoregressive model and FC stands for fully connected neural network.
(z 1,1 , z 2,1 , . . . , z Ts ,1 , z 1,2 , · · · , z Ts ,n ) ∈ R nTs ×v and Z S =
g(X S ) = (z 1,n+1 , z 2,n+1 , · · · , z Ts ,n+1 ) ∈ R Ts ×v represent
the feature representation matrices of the long-term data X L
and the short-term data X S respectively and are inputs to the
attention part, where z t, j in Z L represents the embedding at
time t of the j -th period and z t,n+1 in Z S represents the
embedding at time t of the current period.

Fig. 2. Illustration of input sequence. A. Long-Term Periodical Component


Our long-term periodical component is based on recurrent
neural networks, specifically, LSTM. For LSTM, the opera-
Problem definition tions of an LSTM unit can be formulated as [7],
Given the historical observations X = (x1 , x2 , . . . , xT ) ∈
R N×T , the traffic volume prediction aims to predict X̂ = f t = σg (W f [h t −1 ; i t ] + b f )
(x̂T+1 , x̂T+2 , . . . , x̂T+K ) ∈ R K ×N , where K is the horizon ı t = σg (Wi [h t −1 ; i t ] + bi )
that we are interested in for different tasks.
ot = σg (Wo [h t −1 ; i t ] + bo )
st = f t ◦ st −1 + ı t ◦ tanh(Ws [h t −1 ; i t ] + bc )
IV. ATTENTION -BASED P ERIODIC -T EMPORAL
N EURAL N ETWORK h t = ot ◦ tanh(st ),
The key point for APTN is that we model spatial, temporal, where ◦ is the element-wise product, f t , ı t and ot are forget
and periodical correlations using an attention mechanism. gate, input gate and output gate respectively, i t is the input
Figure 1 presents the architecture of the proposed APTN. of LSTM unit, h t is the hidden state at time t, [h t −1 ; i t ] is
First, a fully connected neural network is applied to extract a concatenation of the previous hidden state and the current
the features of the input vectors. After that, we process the input, tanh is the hyperbolic tangent function, and σg is the
long-term periodic input with a recurrent skip neural network. sigmoid function. The above formula can be represented as,
Then, in the encoder, we propose a novel attention mechanism
that encodes the spatial and periodical dependencies. Finally, h t = f L ST M (h t −1 , i t ),
in the decoder, a temporal attention mechanism is applied to where f L ST M is the mapping from i t to h t that LSTM learned.
capture the dependencies from encoder hidden states across In this paper, for consistency, we set the dimension of hid-
all time steps. den representation of all LSTM unit to be the same value, m.
For traffic datasets, there is a clear daily/weekly pattern. The long-term periodical dependencies can hardly be cap-
To predict the traffic at t for today, in addition to the most tured by the standard LSTM or GRU [3] units due to the
recent records, an accurate model also needs to leverage the vanishing gradient problem. Therefore, as shown in Figure 3,
records at t in historical days. Therefore, in our model, we con- to model long-term periodical dependency, we use a recurrent
sider both short-term and long-term periodical dependencies. structure with temporal skip-connections [11], where skip-
To model short-term temporal dependencies, the recent time links are added between the current hidden cell and the hidden
slots inputs are X S = (x1,n+1 , . . . , xTs −1,n+1 , xTs ,n+1 ) ∈ cells in the same phase in adjacent periods. The input to the
R Ts ×N (Figure 2, shaded in red) where Ts is the number long-term periodical component is the long-term embedded
of slots that our model utilizes and xt,n+1 represents the traffic information Z L , and then the updating process can be
traffic volume at time t of the current period. For long-term formulated as,
periodical modeling, within each period, we need Ts amount
of data. Assuming n periods to be considered, then the data h t,L j = f p (h t,L j −1 , z t, j ) f or t ∈ [1, Ts ], j ∈ [1, n], (2)

Authorized licensed use limited to: Sathyabama Institute of Science and Technology. Downloaded on January 30,2025 at 07:53:29 UTC from IEEE Xplore. Restrictions apply.
4912 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 22, NO. 8, AUGUST 2021

Then αt ∈ R v is the attention weights of the spatial attention


measuring the importance to the current entire graph of each
value in z t,n+1 . The motivation is that the values in the entire
graph are related to the values of several most important nodes,
which brings regularization effect to the model and avoids
overfitting between nodes. Following a similar motivation
as in [19], given a large feature dimension, when the dot
product becomes large in magnitude, it will result in extremely
small gradients for softmax function. To address this problem,
Fig. 3. The recurrent skip connection.
we use square root of the feature dimension to scale the dot
product.
With periodic vectors and the attention weights, we can
where f p is the mapping function that periodic LSTM learned, construct the input of encoder LSTM. First, we calculate the
and h t,L j −1 ∈ R m is the hidden state of the previous Tl periodic weighted input rt at time t,
LSTM unit at time t. rt = αt ◦ z t,n+1 . (7)
As illustrated in Figure 3, for given time t, we obtain
L as the encoder input, which
Then we concatenate rt and h t,n
its input series (z t,1 , z t,2 , · · · , z t,n ) in n periods. Then it
is fed to periodic LSTM in order to get n hidden states considers both short-term and long-term periodic effects,
L , h L , . . . , h L ), and the last hidden state h L is taken
(h t,1 t,2 t,n t,n et = f 1 ([et −1 , [rt ; h t,n
L
]), (8)
as output. We do the same for Ts times and finally obtain Ts
hidden states H L = (h 1,n L , h L , . . . , h L ) ∈ R Ts ×m . We call where f 1 is the mapping function that encoder LSTM learned,
2,n Ts ,n
L
H as long-term periodical representations, which will be fed and et is the output at time t. Let E = (e1 , . . . , eTs−1 , eTs ) ∈
to the encoder. R Ts ×m be the encoder hidden states obtained by the encoder
LSTM in each step.

B. Encoder With Spatial Attention C. Decoder With Temporal Attention


Traffic is highly spatiotemporal correlated. To capture the We use another LSTM-based recurrent neural network to
complicated dependencies, most existing solutions used CNN decode the encoded information. Since the performance of
and/or RNN based approaches. However, these approaches encoder-decoder architecture will degrade rapidly as the length
usually utilize underlying spatial structures, such as distance, of input sequence increases [1], an important improvement is
functional similarity, and connectivity, et.al. Attention mecha- to add a temporal attention mechanism, which can adaptively
nisms have become successful in sequence modeling, allowing select the relevant encoder hidden states to produce output
modeling of dependencies without regard to their distance in sequence. In this way, we can better model the dynamic tempo-
the input sequences. Inspired by this, we proposed a novel ral correlations between different time intervals. Specifically,
attention-based encoder that can capture the spatial correla- the attention weight at time t is calculated based upon the
tions. It learns the weight of each node, which can represent previous decoder hidden state dt −1 and the cell state st−1 ,
the correlation with the entire graph at a time.
Given the short-term traffic volume input Z S and the output etembed = ReLU (E Wd + bhd ) (9)
of long-term temporal component H L , we construct a LSTM- d̃t = ReLU (Ud [dt −1; st−1 ] + bud ) (10)
based encoder with attention, as shown in Figure 4, which
Vd [etembed ; d̃t ]
calculates the correlations among different roads, βt = softmax( √ ), (11)
Ts
z tembed = ReLU (We Z + bwe )S
(3)
where Wd ∈ R m , bhd ∈ R Ts , Ud ∈ R Ts ×2m , bud ∈ R Ts ,
ltembed = ReLU (Ul h t,n
L
+ bul ) (4) and Vd ∈ R Ts ×2Ts are learnable parameters, dt −1 ∈ R m and
ẽt = ReLU (Ue [et −1; st −1 ] + bue ) (5) st−1 ∈ R m are the hidden state and cell state of the previous
Ve [z tembed ; ltembed ; ẽt ] decoder LSTM unit, and βt ∈ R Ts is the attention weights of
αt = softmax( √ ), (6) the temporal attention measuring the importance of each time
v
step. Then we use these attention weights and the short-term
where “; " means concatenation, We ∈ R Ts , bwe ∈ R v , Ue ∈ input vector to construct the input of decoder,
R v×2m , bue ∈ R v , Ul ∈ R v×m , bul ∈ R v , and Ve ∈ R v×3v
are learnable parameters, et −1 ∈ R m and st −1 ∈ R m are the ct = βt E (12)
hidden state and cell state of the previous encoder LSTM unit. c˜t = ReLU (Wc [ct ; z t,n+1 ] + bc ), (13)
z tembed and ltembed are the embeddings of zipped and long-term
where Wc ∈ R m×(m+v) and bc ∈ R m are learnable parameters,
periodical representations respectively, and we calculate the
and ct ∈ R m is the weighted sum of the encoder hidden states.
correlations among the zipped representation z tembed , the long-
The newly computed context vector c˜t ∈ R m can be used for
term periodical representation ltembed , and the embedding of
the update of the decoder hidden state at time t,
current time ẽt . It can be seen that our approach takes into
account spatial, short-term and long-term temporal dynamics. dt = f 2 (dt −1 , c˜t ), (14)

Authorized licensed use limited to: Sathyabama Institute of Science and Technology. Downloaded on January 30,2025 at 07:53:29 UTC from IEEE Xplore. Restrictions apply.
SHI et al.: SPATIAL–TEMPORAL ATTENTION APPROACH FOR TRAFFIC PREDICTION 4913

Fig. 4. The Architecture of Encoder and Decoder. The spatial attention mechanism computes the attention weights conditioned on the previous hidden state
L in the recurrent skip network. Then the newly computed r is fed into the encoder LSTM unit. The temporal attention computes
et−1 in the encoder and h t,n t
the attention weights based on the previous decoder hidden state dt−1 and represents the input information as a weighted sum of the encoder hidden states
across all the time steps. The generated context vector c̃t , and z t,n+1 are then used as inputs to the decoder LSTM unit.

where f2 is the mapping function that the decoder LSTM Then the final prediction of APTN is the integration of the
learned. Then a fully connected neural network is applied to outputs of neural network and the AR component,
obtain the final output,
x̂ T +k = x̂ Tnn+k + x̂ Tar+k . (18)
oTs = ReLU (Vi (ReLU (Wi [eTs ; dTs ] + bwi )) + bvi ), (15)
In our experiment, we adopt the squared error as loss function
where Wi ∈ R v×2m , Vi ∈ R v×v are learnable parameters, and of our model in training,
oTs ∈ R v is the output representation of neural network and
 K
will be used to generate predictions for each horizon. 1  i
L= |x̂ T +k − x Ti +k |2 , (19)
K
i=1 k=1
D. Generating the Prediction
where  is the number of training samples, x̂ Ti +k and x Ti +k are
As shown in Figure 1, the same encoder-decoder architec-
the prediction and groundtruth of the i -th sample in horizon
ture is used for all future K horizons, with shared parameters.
k respectively.
In this way, we can achieve the effect of multi-task learning,
obtain all features in K horizons, and reduce model overfitting.
After encoder-decoder, we use a feedforward neural network V. E XPERIMENTS
to obtain the final output of neural network, and the predicted A. Settings
output x̂ Tnn+k at time T + k is, 1) Dataset: To test the performance of our model, we use
x̂ Tnn+k = Vmk (ReLU (Wmk oTs + bwm
k
)) + bm
k
, (16) two large-scale public real-world datasets PeMSD4 and
PeMSD8 from California [6]. The data is collected in real
where Wmk ∈ R N×v , bwm k ∈ R N , Vmk ∈ R N×N and bm k ∈ RN
time every 30 seconds, and is aggregated into every 5-minutes
are learnable parameters. interval from the raw data.
Due to the non-linearity of the recurrent components, one 1) PeMSD4: It refers to the traffic data in San Francisco
major drawback of the neural network model is that the scale Bay Area, containing 3848 detectors on 29 roads, from
of outputs is not sensitive to the scale of inputs. However, for which we choose 307 detectors. The time span of this
real traffic datasets, the scale of input constantly changes in a dataset is from January to February in 2018.
non-periodic manner, which lowers the forecasting accuracy of 2) PeMSD8: It is the traffic data in San Bernardino from
the neural network model. To address this deficiency, following July to August in 2016, containing 1979 detectors on
the previous work [11], we decompose the final prediction 8 roads, from which we choose 170 detectors.
into a linear part using Autoregressive (AR) model, which
The data for training, validation, and test is 6:2:2.
primarily focuses on the local scaling issue, plus a non-linear
2) Baselines: We compare APTN with the following widely
part containing recurring patterns. This stabilizes the gradient
used baselines:
flow and makes the neural network easy to train. The output
from the autoregressive part at time T + k is 1) Historical Average (HA): which models the traffic
demand as a seasonal process, and uses the average of

Tar−1
k, j previous seasons as the prediction. The period used is
x̂ Tar+k = War x T − j + bar
k
, (17) one week, and the prediction is based on aggregated data
j =0
from the same time in previous weeks.
where Tar is the size of input window over short-term input, 2) Auto-Regressive Integrated Moving Average
k, j k ∈ R Tar ,
War is the j -th value of learnable parameters War (ARIMA) [20]: which is a generalization of an
and bar ∈ R.
k autoregressive moving average (ARMA) model,

Authorized licensed use limited to: Sathyabama Institute of Science and Technology. Downloaded on January 30,2025 at 07:53:29 UTC from IEEE Xplore. Restrictions apply.
4914 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 22, NO. 8, AUGUST 2021

and considers moving average and autoregressive median absolute error (MdAE), mean absolute scaled error
components. (MASE) [8], and mean absolute percentage error (MAPE)
3) Vector Auto-Regressive (VAR) [29]: which is also a are used to evaluate the models, and are defined as follows:

time series model, capturing the pairwise relationships   
 1  K
RM S E = 
among all traffic flow series.
|x̂ Ti +k − x Ti +k |2 , (20)
4) Long Short-Term Memory Network (LSTM): Long K
i=1 k=1
Short-Term Memory network, a special RNN for

 K
sequence modeling. The size of LSTM unit is set to 1
M AE = |x̂ Ti +k − x Ti +k |, (21)
be the same as our model. K
i=1 k=1
5) Dual-stage Attention-based Recurrent Neural Net-
work (DA-RNN) [15]: a dual-stage attention-based Md AE = medi an |x̂ Ti +k − x Ti +k |, (22)
1≤i≤,1≤k≤K
recurrent neural network for time series prediction.  
1  K
|x̂ Ti +k − x Ti +k |
In the first stage, it introduces an input attention mech- M AS E = , (23)
anism to adaptively extract relevant driving series at K M AE in−sample
i=1 k=1
each time step by referring to the previous encoder   K
1 x̂ Ti +k − x Ti +k
hidden state. In the second stage, it uses a temporal M AP E = | |, (24)
attention mechanism to select relevant encoder hidden K x Ti +k
i=1 k=1
states across all time steps.
where x̂ Ti +k and x Ti +k are the prediction and groundtruth
6) Geo-sensory Multi-level Attention Networks
of i -th sample of the k-th horizon of of all nodes respec-
(GeoMAN) [13]: a multi-level attention networks
tively, and M AE in−sample is the mean absolute error of
for geo-sensory time series prediction. It introduces
the twelve-step forecast random walk method and its val-
global spatial attention to capture the correlation
ues on the PeMSD4 and PeMSD8 datasets are 42.66 and
between the target time series of a sensor and the time
35.48. In addition, on the PeMSD4 and PeMSD8 test sets,
series of other sensors and uses local spatial attention
the RMSE and MAE of the random walk one-step forecast are
to capture the correlation between a feature time series
(34.28, 21.06) and (24.82, 16.04).
and other time series.
3) Hyperparameter Settings: For hyperparameter settings,
7) Spatial-Temporal Graph Convolution Network
we use grid search to find the optimal hyperparameters and
(STGCN) [24]: it comprises several spatio-temporal
optimization algorithms. From n = [1, 2, 4, 6, 7, 8, 9, 10,
convolutional blocks, which are a combination of
11, 12], Ts = [1, 2, 4, 8, 12, 16, 24, 32], m = v = [32, 64,
graph convolutional layers and convolutional sequence
128, 256], batch size = [16, 32, 64, 128], learning rate = [0.1,
learning layers, to model spatial and temporal
0.01, 0.001, 0.0001], dropout = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5] and
dependencies.
optimization = (SGD, Momentum, Adagrad, Adam), we find
8) Diffusion Convolutional Recurrent Neural Network
the set of hyperparameters and optimization algorithm that the
(DCRNN) [12]: a diffusion convolutional recurrent
model performs best on the validation set. One time slot of
neural network for traffic forecasting. The diffusion
the PeMSD4 and PeMSD8 datasets is 5 minutes. We set Ts to
convolution operation builds a latent representation by
24 (corresponding to two hours) and n to 7 (corresponding to
scanning a diffusion process across each node in a graph-
one week) for all datasets. For long-term temporal information,
structured input, where the diffusion process is based on
we set period time interval Tl to be one day, i.e., Tl is 288
random walks on graph.
(corresponding to 288 × 5 mins = 1 day). The dimension
9) Graph WaveNet [21]: a graph convolution network
of hidden state of all LSTM unit m is set as 128, and the
using a constructed adjacency matrix to uncover unseen
feature representation dimension v is set as 128 as well. In our
graph structures from data. It proposes a CNN-based
experiment, the batch size is set to 64, the learning rate is set
graph convolution layer in which a self-adaptive adja-
as 0.001 and Adam optimization algorithm is used to train
cency matrix can be learned from the data through
model. Both dropout and recurrent dropout rates in LSTM are
an end-to-end supervised training, where the self-
set to 0.2.
adaptive adjacency matrix preserves hidden spatial
The source code and the two datasets are available at https://
dependencies.
github.com/Maple728/APTN.
10) Attention based Spatial-Temporal Graph Convolu-
tional Networks (ASTGCN) [6]: an attention based
spatial-temporal graph convolutional networks for traffic B. Results
flow forecasting. It designs a novel spatial-temporal Table II shows the results of our proposed method as com-
convolution module consisting of graph convolutions pared to other baselines on PeMSD4 and PeMSD8 datasets,
for capturing spatial features from the original graph- respectively. We report the average prediction results of traffic
based traffic network structure and convolutions in the volume over the next one hour (K = 12). We can see that
temporal dimension for describing dependencies from our proposed APTN outperforms all competing baselines,
nearby time slices. which suggests the effectiveness of proposed approach for
In our experiments, five commonly metrics: root spatiotemporal correlations modeling. In Figure 5, we show
mean square error (RMSE), mean absolute error (MAE), the actual and predicted time series with different horizons

Authorized licensed use limited to: Sathyabama Institute of Science and Technology. Downloaded on January 30,2025 at 07:53:29 UTC from IEEE Xplore. Restrictions apply.
SHI et al.: SPATIAL–TEMPORAL ATTENTION APPROACH FOR TRAFFIC PREDICTION 4915

TABLE II
C OMPARISON W ITH D IFFERENT BASELINES ON P E MSD4 AND P E MSD8

Fig. 6. Performance of different horizons.


Fig. 5. Predicted vs. groundtruth for detector 38 on February 28.
TABLE III
P EAK AND O FF -P EAK P ERFORMANCE
for detector 38 on February 28 on the PeMSD4 dataset.
We can see that, for different horizons, the predicted series
can closely follow the actual series. Specifically, since the
traditional time-series prediction methods (HA and ARIMA)
only rely on historical records without considering spatial
features, they have the worst performance. VAR captures
spatial correlations by considering pairwise relationships, and
achieves better performance. However, it fails to capture the between one time series and the rest. However, in traffic,
complex non-linear temporal dependencies and the dynamic when the correlation between two roads changes, the captured
spatial relationships. dependency may be overfitted.
APTN also outperforms deep learning based methods. For In Figure 6, we show the performance of different models
LSTM, it only captures temporal information, but overlooks for different horizons on PeMSD4 dataset. We can see that
the spatial dependency. DCRNN, Graph WaveNet, STGCN the performance of GCN based model (STGCN, ASTGCN)
and ASTGCN simultaneously take both the temporal and in the first horizon is very good, but the performance declines
spatial correlations into account, leading to better performance. considerably with the increase of horizon. Although the per-
However, they mainly focus on modeling the correlations formance of our model declines with the increase of horizon
among spatially adjacent roads and the static dependencies as well, it is still better than other baselines, and has consistent
among different roads, while our attention based mechanism better performance with different horizons.
can also capture pair-wise correlations among possibly distant In addition, we test our model’s performance during peak
roads and the dynamic dependencies over time, which is crit- and off-peak periods on the PeMSD4 dataset, and the results
ical for accurate forecasting. DA-RNN and GeoMAN capture are in Table III. We can see that the RMSE and MAE of
the correlation between one time series and the rest, but these peak hours are higher than off-peak hours, since the traffic
two methods focus on capturing the short-term dependency volume in peak hours is much higher than off-peak hours.

Authorized licensed use limited to: Sathyabama Institute of Science and Technology. Downloaded on January 30,2025 at 07:53:29 UTC from IEEE Xplore. Restrictions apply.
4916 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 22, NO. 8, AUGUST 2021

Fig. 8. Spatial attention weights visualization.

the rest of the roads. The trend of solid red line at top and
bottom represents the trend of high traffic volume and low
traffic volume sets of roads. Because these two groups have
fewer roads, so only one road in each group has a higher
attention weight. There are many roads with medium volume,
which account for a large proportion of weight in the entire
Fig. 7. Peak and off-peak performance of all models.
graph and have various trends. Therefore, four roads with large
attention weights (four solid red line in the middle) represent
For MAPE, due to the small volume in off-peak hours, it is the trend of medium traffic volume. It can be seen that the
higher than peak hours. We also test the performance of all trend of each road with higher attention weight reflects the
baselines during peak and off-peak periods, and the results are main trend of its group, which collectively reflect the main
shown in Figure 7. We can see that our model outperforms all trend of the entire graph.
competing baselines.
D. Ablation Analysis
C. Spatial Attention Analysis Our proposed APTN model mainly consists of the fol-
The weights obtained from the spatial attention measure lowing four components, spatial attention, temporal attention,
the importance of each node to the entire road network. periodical information, and AR. To further investigate the
Figure 8(a) illustrates the attention weights. The six roads effectiveness of each component, we compare APTN with its
marked by darker red are the most weighted roads. Next, variants as follows:
we show how these six roads affect the prediction. Figure 8(b) 1) APTN/SA: remove the spatial attention. The input of
encoder LSTM is the concatenation of z t,n+1 and h t,n L
shows how traffic volumes of each road change with time,
where the six solid red lines represent the selected six roads L
instead of the weighted input rt and h t,n . It means that
with the largest weights, and the black dotted lines represent our model just deteriorates to a standard LSTM model

Authorized licensed use limited to: Sathyabama Institute of Science and Technology. Downloaded on January 30,2025 at 07:53:29 UTC from IEEE Xplore. Restrictions apply.
SHI et al.: SPATIAL–TEMPORAL ATTENTION APPROACH FOR TRAFFIC PREDICTION 4917

TABLE IV Another moderately important component in our proposed


A BLATION A NALYSIS ON THE P E MSD4 D ATASET APTN is the long-term periodic dependency modeling, since
periodicity is typical for traffic data. From Table IV, we can
see that without periodical component, the RMSE will be
increased from 31.00 to 31.83.
We also evaluate the effect of period parameter n.
We choose different values of n from the set {1, 2, 4, 6, 7, 8, 9,
10, 11, 12}. Figure 9(a) shows the result on PeMSD4 dataset.
We can see that the performance is the best when n = 7
(which means 7 days). This is because that traffic data also
has a weekly pattern. Note that the weekly pattern can be
incorporated to our model in the same way as the daily pattern.
In our model, temporal attention mechanism is employed to
determine the relevant encoder hidden state for making predic-
tions. Table IV shows that our model outperforms APTN/TA
since the temporal attention mechanism enhances the long-
term predictive performance.
We also try different encoder length Ts to verify its validity.
Figure 9(b) shows the result on PeMSD4 dataset. We can see
that when Ts is about 15 (corresponding to 15 × 5 mins =
75 mins), the performance stabilizes.

VI. C ONCLUSION
In this paper, we investigated the traffic prediction prob-
lem. We proposed a novel Attention-based Periodic-Temporal
neural Network (APTN), which captures the spatial, temporal,
and periodical correlations. When evaluated on real-world
datasets, the proposed approach achieved better results than
state-of-the-art baselines. However, one potential limitation of
our model is that it has more parameters (2.2M of APTN
versus 1.3M of ASTGCN), which means that our model
needs more data to perform better than other baselines. For
future work, we plan to take external factors into account to
further improve the forecasting accuracy, e.g., weather, social
events, POI.
Fig. 9. Effect of hyperparameters.
R EFERENCES
based on encoder-decoder architecture with temporal [1] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
attention. jointly learning to align and translate,” 2014, arXiv:1409.0473. [Online].
2) APTN/TA: remove the temporal attention. The input of Available: http://arxiv.org/abs/1409.0473
[2] K. Cho et al., “Learning phrase representations using RNN encoder-
the decoder LSTM is the hidden state of encoder LSTM decoder for statistical machine translation,” 2014, arXiv:1406.1078.
et instead of context vector ct . [Online]. Available: http://arxiv.org/abs/1406.1078
3) APTN/PI: remove the periodical input. The long-term [3] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation
of gated recurrent neural networks on sequence modeling,” 2014,
periodical component is removed, and the input of arXiv:1412.3555. [Online]. Available: http://arxiv.org/abs/1412.3555
encoder LSTM is just the weighted input rt . [4] D. Deng, C. Shahabi, U. Demiryurek, L. Zhu, R. Yu, and L. Yan, “Latent
4) APTN/AR: remove the AR part. The prediction of our space model for road networks to predict time-varying traffic,” in Proc.
KDD, 2016, pp. 1525–1534.
model is just the output of neural network x̂ Tnn+k . [5] X. Geng et al., “Spatiotemporal multi-graph convolution network for
Table IV shows the results of each model. We can see ride-hailing demand forecasting,” in Proc. AAAI Conf. Artif. Intell.,
that among all components, spatial attention has the biggest vol. 33, Jul. 2019, pp. 3656–3663.
[6] S. Guo, Y. Lin, N. Feng, C. Song, and H. Wan, “Attention based spatial-
effect. Without spatial attention, the RMSE increases from temporal graph convolutional networks for traffic flow forecasting,” in
31.00 to 34.78. The spatial attention can selectively focus Proc. AAAI Conf. Artif. Intell., vol. 33, Jul. 2019, pp. 922–929.
on certain roads rather than treating all roads equally, and [7] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
this data-driven attention mechanism captures the dynamic [8] R. J. Hyndman and A. B. Koehler, “Another look at measures of
correlations as well. forecast accuracy,” Int. J. Forecasting, vol. 22, no. 4, pp. 679–688,
From Table IV, AR has the second biggest effect. This Oct. 2006.
[9] S. Ishak and H. Al-Deek, “Performance evaluation of short-term time-
shows that AR can better obtain the linear output, comple- series traffic prediction model,” J. Transp. Eng., vol. 128, no. 6,
menting the nonlinear output from neural network. pp. 490–498, Nov. 2002.

Authorized licensed use limited to: Sathyabama Institute of Science and Technology. Downloaded on January 30,2025 at 07:53:29 UTC from IEEE Xplore. Restrictions apply.
4918 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 22, NO. 8, AUGUST 2021

[10] J. Ke, H. Zheng, H. Yang, and X. Chen, “Short-term forecasting of Heng Qi received the B.S. degree from Hunan
passenger demand under on-demand ride services: A spatio-temporal University in 2004, the M.E. and Ph.D. degrees from
deep learning approach,” Transp. Res. C, Emerg. Technol., vol. 85, the Dalian University of Technology in 2006 and
pp. 591–608, Dec. 2017. 2012, respectively. He has been a JSPS Oversea
[11] G. Lai, W.-C. Chang, Y. Yang, and H. Liu, “Modeling Long- and short- Research Fellow with the Graduate School of Infor-
term temporal patterns with deep neural networks,” in Proc. 41st Int. mation Science, Nagoya University, Japan, from
ACM SIGIR Conf. Res. Develop. Inf. Retr. (SIGIR), 2018, pp. 95–104. 2016 to 2017. He is currently an Associate Professor
[12] Y. Li, R. Yu, C. Shahabi, and Y. Liu, “Diffusion convolutional recurrent with the School of Computer Science and Tech-
neural networks: Data-driven traffic forecasting,” in Proc. ICLR, 2018, nology, Dalian University of Technology, China.
pp. 1–15. His research interests include computer network and
[13] Y. Liang, S. Ke, J. Zhang, X. Yi, and Y. Zheng, “GeoMAN: Multi-level multimedia computing.
attention networks for geo-sensory time series prediction,” in Proc. 27th
Int. Joint Conf. Artif. Intell., Jul. 2018, pp. 3428–3434.
[14] I. Okutani and Y. J. Stephanedes, “Dynamic prediction of traffic volume
through Kalman filtering theory,” Transp. Res. B, Methodol., vol. 18,
no. 1, pp. 1–11, Feb. 1984.
[15] Y. Qin, D. Song, H. Chen, W. Cheng, G. Jiang, and G. Cot-
trell, “A dual-stage attention-based recurrent neural network for
time series prediction,” 2017, arXiv:1704.02971. [Online]. Available:
http://arxiv.org/abs/1704.02971 Yanming Shen received the B.S. degree in automa-
[16] X. Shi, Z. Chen, W. Hao, D. Y. Yeung, W. Wong, and W. Woo, tion from Tsinghua University in 2000, and the
“Convolutional LSTM Network: A Machine Learning Approach for Ph.D. degree from the Department of Electrical
Precipitation Nowcasting,” in Proc. NIPS, 2015, pp. 802–810. and Computer Engineering, Polytechnic University
[17] X. Shi, Z. Gao, L. Lausen, H. Wang, and D.-Y. Yeung, “Deep learning (now NYU Tandon School of Engineering) in 2007.
for precipitation nowcasting: A benchmark and a new model,” in Proc. He is currently a Professor with the School of Com-
NIPS, 2017, pp. 5617–5627. puter Science and Technology, Dalian University of
[18] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning Technology, China. His general research interests
with neural networks,” in Proc. Adv. Neural Inf. Process. Syst., 2014, include big data analytics, distributed systems, and
pp. 3104–3112. networking. He is a recipient of the 2011 Best Paper
[19] A. Vaswani et al., “Attention is all you need,” in Proc. NIPS, 2017, Award for Multimedia Communications (awarded by
pp. 5998–6008. the IEEE Communications Society).
[20] B. M. Williams and L. A. Hoel, “Modeling and forecasting vehicular
traffic flow as a seasonal ARIMA process: Theoretical basis and empir-
ical results,” J. Transp. Eng., vol. 129, no. 6, pp. 664–672, Nov. 2003.
[21] Z. Wu, S. Pan, G. Long, J. Jiang, and C. Zhang, “Graph WaveNet for
deep spatial-temporal graph modeling,” in Proc. 28th Int. Joint Conf.
Artif. Intell., Aug. 2019, pp. 1907–1913.
[22] H. Yao, X. Tang, H. Wei, G. Zheng, and Z. Li, “Revisiting spatial-
temporal similarity: A deep learning framework for traffic prediction,” Genze Wu is currently pursuing the bachelor’s
in Proc. AAAI Conf. Artif. Intell., vol. 33, pp. 5668–5675, Jul. 2019. degree with the Dalian University of Technology,
[23] H. Yao et al., “Deep multi-view spatial-temporal network for taxi majoring in computer science. His research interests
demand prediction,” in Proc. AAAI, 2018, pp. 2588–2595. include data mining and computer vision.
[24] B. Yu, H. Yin, and Z. Zhu, “Spatio-temporal graph convolutional
networks: A deep learning framework for traffic forecasting,” in Proc.
27th Int. Joint Conf. Artif. Intell., Jul. 2018, pp. 1–5.
[25] R. Yu, Y. Li, C. Shahabi, U. Demiryurek, and Y. Liu, “Deep learning:
A generic approach for extreme condition traffic forecasting,” in Proc.
SIAM Int. Conf. Data Mining, 2017, pp. 777–785.
[26] J. Zhang, X. Shi, J. Xie, H. Ma, I. King, and D.-Y. Yeung, “GaAN: Gated
attention networks for learning on large and spatiotemporal graphs,” in
Proc. UAI, 2018, pp. 1–10.
[27] J. Zhang, Y. Zheng, and D. Qi, “Deep spatio-temporal residual net-
works for citywide crowd flows prediction,” in Proc. AAAI, 2017,
pp. 1655–1661.
[28] J. Zhang, Y. Zheng, D. Qi, R. Li, and X. Yi, “DNN-based prediction
model for spatio-temporal data,” in Proc. 24th ACM SIGSPATIAL Int.
Conf. Adv. Geographic Inf. Syst. (GIS), 2016, pp. 1–4. Baocai Yin (Member, IEEE) received the M.S. and
[29] E. Zivot and J. Wang, “Vector autoregressive models for multivariate Ph.D. degrees in computational mathematics from
time series,” in Proc. Modeling Financial Time Series S-PLUS, 2006, the Dalian University of Technology, Dalian, China,
pp. 385–429. in 1988 and 1993, respectively. He is currently a
Professor of computer science and technology with
the Faulty of Electronic Information and Electrical
Engineering, Dalian University of Technology. He is
also a Researcher with the Beijing Key Laboratory of
Multimedia and Intelligent Software Technology and
the Beijing Advanced Innovation Center for Future
Xiaoming Shi received the B.S. degree from Dalian Internet Technology. He has authored or coauthored
Maritime University in 2017. He is currently pursu- more than 200 academic articles in prestigious international journals, includ-
ing the master’s degree in computer science with the ing the IEEE T RANSACTIONS ON PATTERN A NALYSIS AND M ACHINE
Dalian University of Technology. His research inter- I NTELLIGENCE (T-PAMI), the IEEE T RANSACTIONS ON M ULTIMEDIA
ests include data mining and time series forecasting. (T-MM), the IEEE T RANSACTIONS ON I MAGE P ROCESSING (T-IP), the IEEE
T RANSACTIONS ON N EURAL N ETWORKS AND L EARNING S YSTEMS
(T-NNLS), the IEEE T RANSACTIONS ON C YBERNETICS (T-CYB), the IEEE
T RANSACTIONS ON C IRCUITS AND S YSTEMS FOR V IDEO T ECHNOLOGY
(T-CSVT), and top-level conferences, such as CVPR, AAAI, INFOCOM,
IJCAI, and ACM SIGGRAPH. His research interests include multimedia,
image processing, computer vision, and pattern recognition.

Authorized licensed use limited to: Sathyabama Institute of Science and Technology. Downloaded on January 30,2025 at 07:53:29 UTC from IEEE Xplore. Restrictions apply.

You might also like