0% found this document useful (0 votes)
22 views15 pages

Few-Sample Traffic Prediction With Graph Networks Using Locale As Relational Inductive Biases

Uploaded by

Shuhan Qiu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views15 pages

Few-Sample Traffic Prediction With Graph Networks Using Locale As Relational Inductive Biases

Uploaded by

Shuhan Qiu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

1894 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 24, NO.

2, FEBRUARY 2023

Few-Sample Traffic Prediction With Graph


Networks Using Locale as Relational
Inductive Biases
Mingxi Li , Yihong Tang , and Wei Ma , Member, IEEE

Abstract— Accurate short-term traffic prediction plays a piv-


otal role in various smart mobility operation and management
systems. Currently, most of the state-of-the-art prediction models
are based on graph neural networks (G NNs), and the required
training samples are proportional to the size of the traffic
network. In many cities, the available amount of traffic data
is substantially below the minimum requirement due to the data
collection expense. It is still an open question to develop traffic
prediction models with a small size of training data on large-
scale networks. We notice that the traffic states of a node for
the near future only depend on the traffic states of its local-
ized neighborhoods, which can be represented using the graph
relational inductive biases. In view of this, this paper develops a
graph network (G N)-based deep learning model L OCALE G N that
depicts the traffic dynamics using localized data aggregating and Fig. 1. Relationship of network and data size for current traffic prediction
updating functions, as well as the node-wise recurrent neural models.
networks. L OCALE G N is a light-weighted model designed for
training on few samples without over-fitting, and hence it can
solve the problem of few-sample traffic prediction. The proposed
model is examined on predicting both traffic speed and flow mobility applications such as personal map services [3], con-
with six datasets, and the experimental results demonstrate nected and autonomous vehicles [4], traffic signal control [5],
that L OCALE G N outperforms existing state-of-the-art baseline and advanced traveler information system/advanced traffic
models. It is also demonstrated that the learned knowledge management system (ATIS/ATMS). In many megacities like
from L OCALE G N can be transferred across cities. The research Los Angeles and New York, massive traffic data have been
outcomes can help to develop light-weighted traffic prediction
systems, especially for cities lacking historically archived traffic collected and archived, which include vehicle speeds, traffic
data. volumes, origin-destination (OD) matrices, etc., and these data
are widely used to generate traffic predictions. Among differ-
Index Terms— Traffic prediction, few-sample learning, graph
networks, transfer learning, intelligent transportation systems. ent traffic prediction models, deep neural networks, especially
the graph-based neural networks, such as graph convolutional
I. I NTRODUCTION networks (G CN) [6], [7], achieve state-of-the-art accuracy and
have been widely deployed in various industry-level smart
S MART traffic operation and management systems rely
on accurate and real-time network-wide traffic predic-
tion [1], [2], and it is the essential input for various smart
mobility applications. For example, Uber has been using deep
learning for travel time prediction [8].
To further improve the prediction accuracy on large-scale
Manuscript received 5 March 2022; revised 30 August 2022 and 7 October transportation networks, the model complexity and number of
2022; accepted 1 November 2022. Date of publication 10 November 2022; trainable parameters of the newly developed traffic prediction
date of current version 8 February 2023. This work was supported in part by
the National Natural Science Foundation of China under Grant 52102385; in models have increased drastically in recent years [9], [10].
part by the Research Grants Council of the Hong Kong Special Administrative Most of the existing models require the historically archived
Region, China, under Project PolyU/25209221; and in part by the Research traffic data for a long time period, and we define the length
Institute for Sustainable Urban Development (RISUD) at the Hong Kong
Polytechnic University under Project P0038288. The Associate Editor for this of the time period as the data size [11]. In contrast, the size
article was B. Singh. (Corresponding author: Wei Ma.) of the transportation network is referred to as the network
Mingxi Li is with the Department of Civil and Environmental Engineering, size. In Figure 1, many cities are gradually installing traffic
The Hong Kong Polytechnic University, Hong Kong, SAR, China (e-mail:
[email protected]). sensors during the development of the Intelligent Transporta-
Yihong Tang is with the Department of Urban Planning and Design tion System (ITS). After the new installation, the maximum
(DUPAD), The University of Hong Kong, Hong Kong, SAR, China (e-mail: number of available training data depends on the data amount
[email protected]).
Wei Ma is with the Department of Civil and Environmental Engi- of the newest sensor. Therefore, the size of the training data
neering and the Research Institute for Sustainable Urban Development, for conventional traffic prediction models will be reduced
The Hong Kong Polytechnic University, Hong Kong, SAR, China, and also dramatically, when the number of sensors keeps increasing.
with The Hong Kong Polytechnic University Shenzhen Research Institute,
Shenzhen, Guangdong 518057, China (e-mail: [email protected]). Every time a new installation of traffic sensors, we need to
Digital Object Identifier 10.1109/TITS.2022.3219618 re-train the traffic prediction model with a smaller dataset
1558-0016 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: TONGJI UNIVERSITY. Downloaded on September 22,2023 at 15:01:02 UTC from IEEE Xplore. Restrictions apply.
LI et al.: FEW-SAMPLE TRAFFIC PREDICTION WITH GNs USING LOCALE AS RELATIONAL INDUCTIVE BIASES 1895

Fig. 2. Distribution of available detectors in Hong Kong in September 2021 Fig. 3. Illustration of the concept of locale.
(left) and January 2022 (right).

essentially challenging as the complicated prediction model on


and larger graph. For the majority of deep learning models, large-scale network is prone to overfit with limited data [14],
the number of model parameters is proportional to the road [15]. To address this issue, we notice that the short-term traffic
network size. Thus, existing traffic prediction models could state on a certain node (target node) only depends on the traffic
overfit on the few training data and the prediction accuracy states of its localized neighbors. More specifically, as shown
could drop significantly. in Figure 3, the traffic state at time τ + 1 is mainly affected by
In general, traffic prediction accuracy increases with respect the traffic states of itself and its neighborhoods at time τ . It is
to the data size, and the number of trainable parameters straightforward to observe that the change of traffic state on a
number increases with respect to the network size. When the node is attributed to the traffic (e.g., vehicle, flow) exchanges,
network size becomes larger, more data (longer time period then the nodes far away from the target node cannot exchange
of historical data) is required to prevent overfitting due to a traffic directly with the target node, and hence the impact of
large number of trainable parameters [12]. Overall, existing those nodes are indirect and marginal. We define the concept
well-performed deep learning models require a large data size locale of a target node as a collection of information on its
for training to ensure good performance on the large-scale neighboring nodes, and the information includes, but is not
networks. However, because of the high expenses in data limited to traffic states (e.g., speed, flow, OD), static data
collection and sensor maintenance [13], it is not practical to (e.g., road type, speed limit), and auxiliary information (e.g.,
expect every city to archive a comprehensive and long-history weather). Finally, it is safe to claim that the traffic state of the
dataset. As shown in Figure 1, how to develop traffic prediction target node in the near future mainly depends on its current
models with a small data size on the large-scale networks is locale.
the key research question addressed in this paper. The locale of a node can be viewed as the relational
We further motivate this study using a practical example. inductive biases for the prediction model, which enforces the
Hong Kong aims to transform itself into a smart city within entity relations in a deep learning architecture. In our case,
the next decade, and the Smart City Blueprint for Hong Kong the connections between nodes allow the traffic exchanges,
2.0 was released in December 2020, which outlines the future while no direct traffic exchanges are allowed if two nodes
smart city applications in Hong Kong. The blueprint plans are not connected. Additionally, other static information such
to “complete the installation of about 1,200 traffic detectors as location information and road properties affect the speed
along major roads and all strategic roads to provide additional and frequency of those traffic exchanges. To make use of the
real-time traffic information” for Hong Kong’s smart mobility relational inductive biases, we adopt the Graph Network (G N)
system. Consequently, Hong Kong’s Transport Department to capture the dynamic and localized traffic states of a target
is gradually installing traffic sensors and releasing the data node. The G N demonstrates great potential in modeling the
starting from the middle of 2021. The number of traffic sensors relational inductive biases and has been widely used to depict
increases drastically in the recent year. The duration of the the localized relationship among different entities [16]. It is
historical traffic data from the new sensors can be less than one also a generalized form of many existing graph-based network
month, making it impractical to train existing traffic prediction structures [17], [18]. In this paper, we extend the G N model to
models. Similar situations also happen in many cities like learn the locally spatial and temporal patterns of traffic states
Paris, Shenzhen, and Liverpool, as the concept of smart cities for generating predictions. Importantly, G N is applied to each
just steps into the deployment phase globally. Therefore, node separately, and it can be applied to different nodes with
a network-wide traffic prediction model, which achieves state- various topologies. G N is also light-weighted with a small
of-the-art performance with a small size of traffic data, could number of trainable parameters, making it easy to train given
enable the smooth transition and early deployment of smart the limited training samples.
mobility applications. Hence this study has practical values It is noteworthy that the similar concept of locale is widely
for many cities. adopted for the spatial approaches of the graph neural net-
To this end, we define the task of few-sample traffic works (G NN) defined in [6], and one of the representative
prediction, which aims to train on historical traffic data with examples is the message passing neural network (M PNN) [19].
a short history, and generate accurate short-term and long- However, current traffic prediction models overlook the prop-
term traffic prediction on the large-scale networks. This task is erties of locale and rely on G NN to implicitly learn the locally

Authorized licensed use limited to: TONGJI UNIVERSITY. Downloaded on September 22,2023 at 15:01:02 UTC from IEEE Xplore. Restrictions apply.
1896 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 24, NO. 2, FEBRUARY 2023

spatial dependency of the traffic states, which also explains transportation are discussed. Lastly, we review the emergence
why the existing traffic prediction models generally require a of G N and justify the choice of G N for the few-sample traffic
large amount of data. To the best of our knowledge, among the prediction.
commonly used traffic prediction models, D CRNN and G RAPH 1) Traffic Prediction Models: Traffic data as a spatio-
WAVE N ET depict the diffusion process of traffic states, which temporal information on road networks (graphs) is a typical
share the most similar idea to locale [20], [21]. However, non-Euclidean data. There are two types of models to deal
due to their complex model structures, the performance of with graph data: spatial graph-based models and spectral
both models degrades drastically with a few data samples, graph-based models [6]. In smart mobility applications, G CN
which will later be shown in the numerical experiments. This is a widely used spectral graph-based model to learn the com-
observation also suggests a light-weighted structure for the plex structure of spatio-temporal data, and the representative
few-sample traffic prediction task. models include T-G CN [22], S T-G CN [23], T GC -L STM [24],
To summarize, the small data size and large network STEM GNN [25] and A ST G CN [26]. The convolutional oper-
size present great challenges to existing traffic prediction ation bases on the whole graph with its Laplacian matrix
models, and it is promising to utilize the localized traffic or other variants, such as dynamic Laplacian matrix [27].
information and data (locale) to generate traffic predictions In [28], it is a delicate prediction model for passenger
without modeling the entire network simultaneously. In view inflow and outflow at traffic stations, such as subway or bus
of this, this paper proposes Localized Temporal Graph Net- station. In other words, the graph consists of nodes (traffic
work (L OCALE G N) to model the locally spatial and temporal stations) with two attributes (inflow and outflow). The graph
dependencies on graphs for predicting traffic states in the near information is further processed to obtain the subgraph for
future. This is the first time that G N is adopted and extended each node. This is an enlightening setting, in which spatial-
for traffic prediction tasks, and the choice of G N is attributed temporal patterns can be learned by subgraph cantered with
to its previous successful applications in predicting physical one node. However, the spectral graph-based models do not
systems, such as fluid dynamics and glassy systems [16]. have parameter-sharing characteristics, thus these models are
L OCALE G N can learn from different nodes on the large net- tied to specific graph topologies. For example, T-G CN and
work, and hence it can be declared as a light-weighted model S T-G CN are inflexible and impose restrictions on transfer
to make accurate short-term predictions from few training learning tasks across different graphs [29].
samples. The contribution of this paper can be summarized For spatial graph-based models, local aggregators are
as follows: explored in order to deal with the direct convolution operation
• We propose the few-sample traffic prediction task and with the different number of neighbors [6]. In D CRNN [20], the
highlight the importance of using localized information bidirectional random walk is introduced to model the spatial
(locale) of each node to address the issue of lacking data correlation of traffic flow on graphs. In G RAPH WAVE N ET,
in the proposed task. an adaptive dependency matrix is adopted through the node
• We develop a locally spatial and temporal graph archi- embedding process, and the 1D convolution component is used
tecture L OCALE G N for the few-sample traffic prediction to make the relation between entities trainable [21]. Moreover,
task. The locally spatial pattern is modeled by the graph attention-based spatial methods are developed to operate the
network with relational inductive biases, and the tempo- attention mechanism on graphs. In G ATCN, the graph attention
ral pattern is depicted with a modified recurrent neural network is applied to extract the spatial features of the road
network. network [30]. We note that existing spatial graph-based models
• We conduct experiments on three real-world traffic speed have flexible graph representations but require large historical
datasets and three real-world traffic flow datasets, respec- data to achieve competitive accuracy. For example, in D CRNN,
tively. The results show that L OCALE G N consistently the pair-wise spatial correlation makes graph structures train-
outperforms other state-of-the-art baseline models for able, but its complex encoder and decoder structures also make
the few-sample traffic prediction. It is also demonstrated the training data-intensive.
that the learned knowledge from L OCALE G N can be Methods mentioned above in the spectral and spatial domain
transferred across cities. mainly focus on modeling the spatial dependency of the traffic
Source codes of L OCALE G N are publicly available at data. In addition to capturing spatial patterns, many other
https://github.com/MingxiLii/LocaleGN. The remainder of this models are combined to learn temporal patterns. For example,
paper is organized as follows. Section II reviews the Recurrent Neural Network (R NN) is adopted in time series
related studies on traffic prediction, few-sample learning, prediction tasks [31], [32]. Both L STM and G RU are two
and graph neural networks. Section III formulates the task typical R NN models [33], and attention mechanism can also be
of few-sample traffic prediction and presents details of used for inferring temporal dependencies (e.g., A ST- GAT [34],
L OCALE G N. In Sections IV, numerical experiments are con- attention with L STM [35], and A PTN [36]).
ducted to demosntrate the advantages of L OCALE G N. Lastly, In addition to deep learning, there are also other methods to
conclusions and future research are summarized in Section V. deal with traffic prediction. For example, in [37], reinforce-
ment learning is utilized to address data defect issues in traffic
II. R ELATED W ORK flow prediction. The dynamic graph is generated by the policy
In this section, we first summarize the existing traffic gradient algorithm, and the differences of the graph are used
prediction models, then applications of few-sample learning in as reward signals for reinforcement learning to generate action

Authorized licensed use limited to: TONGJI UNIVERSITY. Downloaded on September 22,2023 at 15:01:02 UTC from IEEE Xplore. Restrictions apply.
LI et al.: FEW-SAMPLE TRAFFIC PREDICTION WITH GNs USING LOCALE AS RELATIONAL INDUCTIVE BIASES 1897

sequences on the traffic flow transfer graph. The linear models structure in glassy systems [55], and simulate the molecular
have more compact and simple structures, which are used dynamic systems [56]. All these studies demonstrate the great
in many traffic prediction settings. For example, [38] pro- potential of G N in modeling the localized relationship that
poses to predict the short-term traffic dynamics based on the is viewed as the relational inductive biases. Referring to
shockwave analysis and bottleneck identification. Specifically, spatio-temporal networks, G N is utilized to predict the climate
the historical dataset is clustered by Gaussian Mixture Model data and the encoder-G N-decoder structure is developed to
and the clustering result is used for generating the congestion strengthen the representation ability in uncovering the complex
map bounded by shockwave speed. Besides, [39] develops and diverse connectivity of spatio-temporal networks [17].
a dynamic linear model with recursive coefficient updates. Overall, it is worth exploring to extend G N to traffic prediction
On top of [39], the large-scale dynamic linear model (L SDLM) as the road network is also viewed as a typical spatio-temporal
further incorporates graph topological information and makes network
it feasible to make prediction on large scale network [40].
In general, it is difficult to adapt the above models for the III. M ODEL
traffic prediction settings with changing the number of sensors. In this section, we formulate the few-sample traffic predic-
2) Few-Shot and Few-Sample Learning in Transporta- tion task. The overall framework of L OCALE G N is presented
tion: The majority of research applying deep learning to and each component is introduced in detail. Lastly, the com-
small datasets focuses on few-shot prediction which includes putational steps in L OCALE G N are summarized.
pre-training on a large dataset and fine-tuning on a small
dataset [41], [42], [43]. However, works for directly learning A. Few-Sample Traffic Prediction
from small datasets without external data are scarce [44]. The traffic prediction task can be formulated as a regression
Few-sample learning can be viewed as a specific exam- problem. Various traffic data, such as traffic speed and flow,
ple of few-shot learning without pre-training. Thus, it is a are graph-based, which can be modeled on a spatio-temporal
promising methodology to deal with the data inefficiency graph. Different types of data might associate with nodes or
issue in the field of transportation. There is no direct work edges. For example, the point detector data is associated with
about few-sample learning in transportation, but some few-shot nodes, while the travel time data is associated with edges.
learning approaches are explored. The existing few-shot learn- We define the graph associated with the traffic data as a
ing models can be categorized into two types: one is the directed graph G = (V, E), where each node is represented
gradient-based models [45] and the other is the metric-based by i , and the set of node indices V = {1, 2, · · · , i, · · · , Nv }.
models [46]. These methodologies work well in the case of Nv is the number of traffic sensors. Similarly, the set of edges
independent and identically distributed data. However, few- is defined as E = {1, 2, · · · , k, · · · , Ne }, where Ne represents
shot learning in graphs is more challenging as the nodes and the number of edges. The connectivity among nodes and edges
edges on the graph are connected and correlated with each represents the relational inductive biases on the graph.
other. Transportation data is a typical type of graph-based In this paper, we formulate the traffic prediction on time-
data. In recent years, studies explore few-shot learning with dependent (dynamic) graphs. Suppose the set of time intervals
transportation applications such as vehicle detection [47] and during the study period is denoted as T , for each time
traffic sign recognition [48]. In traffic prediction tasks, region- interval τ , we define the time-dependent data on graph G
based transfer learning across cities is applied by match- as Gτ = (Vτ , Eτ ), where Vτ ∈ R Nv ×M and Eτ ∈ R Ne ×M .
ing similar sub-regions among different cities [49]. Overall,
To be precise, Vτ = [v1τ ; · · · ; viτ ; · · · ; vτNv ], and viτ =
although lacking few-sample learning, the explorations of
few-shot learning for traffic prediction can provide inspiration [v iτ −M , · · · , v iτ ]T ∈ R1×M , where viτ represents the traffic
for building few-sample models without external data. data for node i at time τ . M is the look-back window.
3) Graph Networks: Spectral G CN models and spatial G NN The edge-based data Eτ = [e1τ ; · · · ; ekτ ; · · · ; eτNe ], and ekτ =
models can be adopted for traffic prediction tasks [7]. The [ekτ −M , · · · , ekτ ]T , where ekτ represents the traffic data on
inflexibility of the former one makes it unfeasible to con- edge k at time τ . We note the time-dependent data in Gτ
struct few-sample or transfer prediction models. Therefore, can not only include conventional traffic data, but also other
it is worthwhile to look into spatial-based G NN models. graph-based datasets such as weather, road properties, and so
There are many popular spatial-based G NN models such on. In this paper, without loss of generality, we suppose the
as diffusion graph convolution (D GC) [50], M PNN [19], nodes carry the point detector data, while the edges include
G RAPH SAGE [51], and graph attention networks (G AT) [52]. the road properties (e.g., length, width, category).
Traffic speed or flow prediction models base on D GC, M PNN, Based on the above notations, existing traffic prediction
G RAPH SAGE and G AT for short-term prediction are devel- models can be viewed as a function , and (Gτ ) = Vτ +1 .
oped with sufficient traffic data. However, these models may  can be viewed as a dynamical system that evolves the
fail in the cases of few data samples. In view of this, we note graph-based traffic data from time τ to τ + 1. For each τ ,
that Graph Network (G N) with relational inductive biases suppose Gτ follows a certain distribution D, then we have
could address the issue of data intensity. In [16], the relational Gτ ∼ D. Given a sufficient large T , it is expected that we
inductive bias is widely discussed when utilizing deep learning can use the deep learning model to approximate  such that
models to deal with structured data. In recent years, G N has the prediction error (G ˆ τ ) − Vτ +1  is minimized, where  ˆ
been used to simulate physical systems [53], [54], predict the represents the deep learning-based approximator.

Authorized licensed use limited to: TONGJI UNIVERSITY. Downloaded on September 22,2023 at 15:01:02 UTC from IEEE Xplore. Restrictions apply.
1898 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 24, NO. 2, FEBRUARY 2023

Fig. 4. An overview of L OCALE G N (H represents the size of hidden layers).

In the few-sample traffic prediction task, the set of archived encoders, graph network (G N) and nodes decoder. At each
time interval T is relatively small, say |T | = 10, and then time τ , the input of L OCALE G N is Gτ , and the node data Vτ
training  ˆ can easily overfit, making the existing traffic predic- is feed into the N ODE G RU for learning the temporal patterns
tion models not suitable for the few-sample tasks. To address of each node. Simultaneously, both edge data Eτ and node
this issue, we propose to develop a function (viτ , Lτ (i )) = data Vτ are embedded by the edge encoder and node encoder,
v iτ +1 , ∀i , where Lτ (i ) represents the locale of node i at time respectively. The G N is used to model the locally spatial
τ . For example, L(i )τ can represent all the ek and vi that patterns on both nodes and edges, and the edge information is
are within the K -hop neighborhoods of node i . To obtain aggregated to nodes. Then, the aggregated node information
an approximator of , we notice that the available number is further decoded by the node decoder. Lastly, the temporal
of training data becomes |T |Nv . If the deep learning-based patterns by N ODE G RU is concatenated with the locally spatial
approximator  ˆ could learn the locally spatial and temporal patterns by G N, and a dense layer is used to generate the final
pattern for each node with |T |Nv number of samples, then prediction for viτ +1 , ∀i .
the few-sample traffic prediction task can be solved. In the 1) N ODE G RU: The N ODE G RU focuses on capturing the
following sections, we present and validate L OCALE G N as a temporal correlation on vi for each node separately, as pre-
suitable selection for . ˆ sented in Equation 1. The temporal pattern of vi is embedded
in the G RU, which will be used for the inference of time
B. L OCALE G N τ + 1 in the following steps. We note that the G RU is applied
In this section, we introduce L OCALE G N in detail. Firstly, for each node separately, which makes the trainable parameters
an overview of L OCALE G N is shown in Figure 4. dependent on the network size.
L OCALE G N mainly consists of four major components:
node Gate Recurrent Unit (N ODE G RU), node and edge viτ,G RU = G RU (viτ ), ∀i, τ (1)

Authorized licensed use limited to: TONGJI UNIVERSITY. Downloaded on September 22,2023 at 15:01:02 UTC from IEEE Xplore. Restrictions apply.
LI et al.: FEW-SAMPLE TRAFFIC PREDICTION WITH GNs USING LOCALE AS RELATIONAL INDUCTIVE BIASES 1899

2) Node and Edge Encoder: To encode the node and edge graph convolution neural networks, etc. Mathematically, the
τ τ
data in Gτ , we employ two multilayer perceptron M LP E and locale of node i can be rigorously defined as e →i , vτi , E →i
M LP V for edge and node respectively, as shown in Equation 2. and all the eτk , vτtail(k) , vτhead(k) for all the k that are connect-
ing to i . The aggregating function ρ E→V can be applied to
eτk = M LP E (ekτ ), ∀k, τ
different numbers of edges for each node, making our model
vτi = M LP V (viτ ), ∀i, τ (2) flexible for various graph typologies. Indeed, the updating
The encoded node and edge data can form the encoded and aggregating operations in G N mimic the localized traffic
τ τ τ exchanges shown in Figure 3, and hence it is powerful in
graph G = (V , E ). In general, the two encoders can better
learn the latent representations of these node and edge data, uncovering the locally spatial patterns of traffic data. Addi-
and the learned representations will be used to further mine tionally, G N is applied to each node and edge separately, so it
the spatial and temporal relationship by G N. is independent of network size, which is another attractive
3) Graph Network: The graph network (G N) is the essential feature of G N.
τ
component in L OCALE G N. In general, G N models the evolu- 4) Node Decoder: The updated node v i from the G N
tion of dynamic graphs using the updating and aggregating model is further decoded by the node decoder. Similar to the
operations on nodes and edges. In particular, we aim to use node encoder, the node decoder is modeled through an MLP,
τ τ +1 as shown in Equation 6.
G N to evolve G to G .  τ
 
In G N, two updating functions φ V , φ E are employed to viτ = M LP V v i , ∀i, τ (6)
update the per-node data and per-edge data respectively, and

one aggregating function ρ E→V is used to aggregate the where viτ represents the decoded data for node i at time τ .
per-edge data for each node. To provide more details, G N 5) Output Layer: In the output layer, we combine the
models the spatio-temporal propagation of the dynamic graph locally spatial and temporal information obtained from the
τ
G based on the following three steps: N ODE G RU and node decoder, respectively. Mathematically,

i In the first step, the edge updating function φ E is applied we concatenate viτ,G RU and viτ , the resulting vector is feed
to every single edge in the graph to calculate the per-edge into an MLP for predicting the traffic states at time τ + 1,
updates. For each edge k, we combine the updated edge as shown in Equation 7.
data eτk and the node data for both the tail node tail(k)   
and head node head(k) of k, and then the combined data v̂ iτ +1 = M LPOutput viτ ⊕ viτ,G RU , ∀i, τ (7)
is feed into φ E , as shown in the following equation:
where ⊕ is the vector concatenation operator, v̂ iτ +1 is the
τ
e k = φ E (eτk , vτtail(k) , vτhead(k) ), ∀k, τ, (3) prediction for v iτ +1 , and M LPOutput represents another MLP.
To summarize, all the computational steps in the L OCALE G N
where φ E is modeled through an MLP (M LPφ ).
E
are presented in Algorithm 1.
ii In step two, the aggregating function ρ E→V is applied Finally, the difference between the prediction v̂ iτ +1 and the
to all the edges that point to node i . Mathematically, actual data v iτ +1 is measured by the 2 norm represented by
τ τ  τ +1
we denote E →i = {e k |tail(k) = i, ∀k} as the set i (v i − v̂ iτ +1 )2 . The error is back-propagated to update all
of the updated edge data pointing to i at time τ . The the parameters in L OCALE G N.
τ
function ρ E→V aggregates the information in E →i and In addition, the computational complexity of proposed
outputs a representation of all the edges pointing to node L OCALE G N is O(N ∗ P + N ∗ M), in which N is the number
i , as shown in Equation 4. We note that the aggregating of nodes, P is the layer number of G N and M is the length
τ
function should work on different size of E →i , and hence of the lookback window.
the element-wise mean function is chosen as ρ E→V .
τ τ
e →i = ρ E→V (E →i ), ∀i, τ (4) C. Transferability of L OCALE G N
In this section, we discuss how L OCALE G N is different from
iii In step three, the node updating function φ V is applied to
the existing traffic prediction models in terms of transferability.
every single node in the graph. For node i , the function φ V
The most important feature of L OCALE G N is that both G N
takes the current node data vτi and the representation of
and N ODE G RU modules make use of the parameter-sharing
the edge data pointing to node i computed in the previous
τ characteristic so that the knowledge learned for traffic pre-
step as input, and the output is the updated node data v i ,
diction can be transferred among nodes. Though most of
as represented in the following equation:
the existing G NNS can be cast into the message passing
τ τ
v i = φ V (e →i , vτi ), ∀i, τ, (5) neural networks and hence the parameters can be shared [19],
existing traffic prediction models avoid using the parameter
where φ V is modeled as a different MLP (M LP ). φV
sharing characteristic because it could impede the prediction
Overall, the G N model decomposes the complex topological performance.
graph structure into updating and aggregating operations on In this study, we particularly consider the edge data etτ ,
each single node and edge, and the localized relationship on which represents road properties (e.g., road length, width, and
the graph can be modeled accordingly. G N is a generalized types), location information (e.g., weather, socio-demographic
module that can be reduced to many existing G NNS, such as factors), and so on. For node i , we define the locale of node i

Authorized licensed use limited to: TONGJI UNIVERSITY. Downloaded on September 22,2023 at 15:01:02 UTC from IEEE Xplore. Restrictions apply.
1900 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 24, NO. 2, FEBRUARY 2023

Algorithm 1 Computational Steps in L OCALE G N


Require: graph structure G = (V, E), node data
Vτ = [v1τ ; · · · ; viτ ; · · · ; vτNv ] and edge data Eτ =
[e1τ ; · · · ; ekτ ; · · · ; eτNe ].
Ensure: G RU in Equation 1, M LP E and M LP V in Equation 2,

M LPφ and M LPφ in G N, M LP V in Equation 6, and
E V

Output
M LP in Equation 7 are properly initialized and trained.
for i ∈ {1, · · · , N v } do
Compute viτ,G RU based on Equation 1.
Compute vτi based on Equation 2.
end for
for k ∈ {1, . . . , N e } do
Compute eτk based on Equation 2.
Update the edge data using the edge updating function
based on Equation 3.
end for
for i ∈ {1, · · · , N v } do
Fig. 5. Determining the number of G N layers.
Form the set of the updated edges pointing to node i as
τ
E →i .
τ τ
Compute e →i based on Equation 4 and E →i . which can be used to infer the free-flow speed. Similarly,
τ
Compute v i based on Equation 5. the shockwave speed w can be determined based on the road
Decode the updated node data based on Equation 6. properties in OpenStreetMap. We define di ( j ), ∀ j ∈ V, which

Concatenate viτ and viτ,G RU . represents the distance from node i to node j , and K can be
Predict v̂ iτ +1 based on Equation 7. determined using Equation 8.
end for  
return v̂ iτ +1 , ∀i ∈ {1, · · · , N v } K = max ∪i∈V Iiforward ∪ Iibackward
 
Iiforward = hopi ( j )|di ( j ) ≤ Lv, ∀ j ∈ V
 
τ
Iibackward = hop j (i )|d j (i ) ≤ Lw, ∀ j ∈ V (8)
at time τ to be Lτ (i ) = {viτ , E →i }, and hence the information
contained in the locale can be used to conduct traffic prediction where hopi ( j ) counts the number of edges in the shortest
independently. The learning of traffic states becomes inductive, path from i to j . Iiforward and Iibackward denote the sets of
i.e., L OCALE G N universally learns the traffic states condi- edge numbers to which the forward traffic flow and congestion
tional on the locale, instead of learning the traffic states for spillover could reach from node i , respectively. Using Figure 5
each node separately. Later in the experimental experiments, as an example, the impact of a source node can spread to its
we demonstrate that with the consideration of edge data, the directly connected neighbors in 5 minutes. It will take longer
node embedding can be learned through a universal encoder, time to spread the impact further and hence it suffices to
i.e., the G N module, and the traffic prediction can be conducted consider a small number of K based on Equation 8.
per node. Additionally, the information contained in the locale
is permutation-invariant, hence L OCALE G N is suitable for IV. N UMERICAL E XPERIMENTS
different network topologies for traffic prediction. In this section, the proposed model is examined in
real-world flow and speed datasets to test the short-term
D. Determining the Number of G N Layers prediction performance.
One can see that the G N (spatial unit) in L OCALE G N depicts
the traffic states on each node, and the K layers of G N can be A. Datasets
stacked to model the localized traffic exchanges for the K -hop We evaluate L OCALE G N and other baseline models on three
neighbors. In this section, we discuss how to determine K with traffic speed and three traffic flow datasets, respectively. The
theoretical analysis. detailed dataset information is listed as follows:
In general, traffic states are formed through the traffic flow 1) Traffic Speed Data: Three speed datasets are used and
forward propagation and congestion spillover. To predict the time interval is every five minutes:
traffic states on node i for the next L time intervals, we should • LA: The LA dataset is collected from 207 loop detectors
ensure that the traffic exchanges in the next L time intervals are in Los Angeles, and the data ranges from March 1st to
within the K -hop neighbors, so that G N has the capability to March 7th, 2012.
capture the changes in traffic states. Without loss of generality, • SacS: also known as PEMSD7, which contains speed
we assume that around node i , the free flow speed is f and data from 228 detectors in the Sacramento area. The data
the shockwave speed is w, respectively. Free-flow speed f ranges from May to June 2012.
can be obtained from the OpenStreetMap. The OpenStreetMap • ST: ST contains speed data from 170 sensors in the Seat-
contains the road level information (e.g., highway, local roads), tle Area. The data ranges from Jan 1st to Feb 1st, 2015.

Authorized licensed use limited to: TONGJI UNIVERSITY. Downloaded on September 22,2023 at 15:01:02 UTC from IEEE Xplore. Restrictions apply.
LI et al.: FEW-SAMPLE TRAFFIC PREDICTION WITH GNs USING LOCALE AS RELATIONAL INDUCTIVE BIASES 1901

2) Traffic Flow Data: Three traffic flow datasets in [57] are we set the edge data to be the normalized distance of the edges
utilized and the time interval is every five minutes: between two nodes, and the node data are the traffic speed or
• SF: also known as PEMS04, which includes 307 sensors traffic flow from different datasets. After parameter fine-tuning
in the San Francisco Bay area, and the data ranges from with the validation set, we select one hour historical data
September 1st to November 7th, 2018. to predict the next 5, 15, 30, 45, and 60 minutes’ data,
• SacF: also known as PEMS07, which contains traffic meaning M = 12 and L = 1, 3, 6, 9, 12. Hyper-parameters
flow data from 883 detectors in the Sacramento area. The in L OCALE G N are determined by the validation set, and the
data ranges from January 1st to August 7st, 2018. finalized model specifications are presented in Appendix A.
• SanB: also known as PEMS08, which contains traffic
flow data from 170 sensors in the San Bernardino area, D. Evaluation Metrics
and the data ranges from July 1st to August 7st, 2016.
Three different metrics are chosen to evaluate the prediction
performance of L OCALE G N and other baseline models by
B. Baseline Models comparing v̂ iτ +1 with the ground truth v iτ +1 .
The following baseline models will be compared with • Root Mean Squared Error (RMSE):
L OCALE G N. Particularly, N ODE G RU and G N are presented v
for ablation study. RMSE = N1v i=1N
(v̂ iτ +1 − v iτ +1 )2
• Mean Absolute Error (MAE):
• G CN : Graph Convolutional Network for graph-based data  N v τ +1
prediction [58]. MAE = N1v i=1 v̂ i − v iτ +1
• L STM : Long Short-Term Memory network for time series • Mean Absolute Percentage Error (MAPE):
 N v v̂ iτ +1 −v iτ +1
prediction [59]. MAPE = 100%
Nv i=1 τ +1
vi
• T-G CN : Temporal Graph Convolutional Network [22].
Spatial dependency is captured by the G CN module and
temporal correlation is abstracted by G RU. E. Experimental Results
• S T-G CN : Spatial-Temporal Graph Convolutional Net- Table I and Table II present the prediction accuracy of differ-
work [23]. ChebNet and 2D convolutional network are ent models with few training samples on three speed and flow
utilized to capture the spatial and temporal correlation, datasets, respectively. One can see the proposed L OCALE G N
respectively. outperforms other baseline models on the six datasets for
• D CRNN : Diffusion Convolutional Recurrent Neural most time. L OCALE G N outperforms the large-scale dynamic
Network [20]. linear model-L OCALE G N in terms of testing accuracy, which
• G RAPH WAVE N ET: Graph WaveNet for Deep show that the knowledge learned in our proposed model
Spatial-Temporal Graph Modeling [21]. can be applied to unseen scenarios. L STM utilizes temporal
• S TEM GNN: Spectral Temporal Graph Neural correlations, and G CN models the spatial dependencies; both
Network [25]. models cannot fully take the spatio-temporal dependencies into
• L SDLM : Large-scale Dynamic Linear Model [40]. consideration at the same time. With few training samples,
• N ODE G RU : Gated Recurrent Unit [60] for every single the performance of the state-of-the-art models like T-G CN
node’s data in graph. This model can be viewed as and S T-G CN degrades, while L OCALE G N achieves higher
L OCALE G N with only G RU component. prediction accuracy. Besides, S TEM GNN achieves acceptable
• G N : Graph network model with relational inductive prediction performance but is not as accurate as L OCALE G N
bias [20]. This model can be viewed as L OCALE G N Among parameters-sharing models, L OCALE G N also outper-
without the N ODE G RU module. forms D CRNN and G RAPH WAVE N ET. It is probably due to
the complicated encoder and decoder structure of D CRNN
C. Experimental Settings and the specified node embedding of G RAPH WAVE N ET. The
All experiments are conducted on a desktop with Intel complicated structures make them overfit the limited avail-
Core i9-10900K CPU @3.7GHz × 10, 2666MHz × 2 × able data. However, with parameters-sharing characteristics,
16GB RAM, GeForce RTX 2080 Ti × 2, 500GB SSD. L OCALE G N is a light-weighted model with the well-designed
We divide all speed and flow datasets of seven days with a structure that perfectly meets the requirements of short-term
ratio of 6:1:1 into training set, validation set and testing set. traffic prediction with few samples.
Since the most-commonly used historical data ranges from We note for the flow datasets, the standard deviation is high
7 days to 30 days, we choose 7 days’ data as the minimum for the baseline models, which suggests that these models
sufficient data for training the majority of existing prediction may overfit with few training samples. Besides, there is a
models. To simulate the situation of limited training samples, large gap between the prediction accuracy of N ODE G RU and
we randomly select 20% of the training set for training. The L STM, although G RU and L STM are similar RNN modules.
sufficient dataset for prediction is 7 days (including five days’ It is because the complicated model is inclined to overfit on
lengthen of training, 1-day lengthen of validation and 1-day small datasets. G RU is less complicated than L STM and L STM
lengthen of testing) and we utilize 1-day data as the insufficient is applied for all nodes’ data while N ODE G RU is for single
training data (20% of the original training dataset) and keep node separately, which means L STM have more trainable para-
the validation and testing unchanged. When constructing Gτ , meters and more complicated. The MAPE for some models

Authorized licensed use limited to: TONGJI UNIVERSITY. Downloaded on September 22,2023 at 15:01:02 UTC from IEEE Xplore. Restrictions apply.
1902 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 24, NO. 2, FEBRUARY 2023

TABLE I
P ERFORMANCE OF L OCALE G N AND O THER BASELINE M ODELS ON T RAFFIC S PEED D ATASETS (AVERAGE ± S TANDARD D EVIATION A CROSS 5
E XPERIMENTAL R EPEATS ; U NIT FOR RMSE AND MAE: M ILES /H OUR )

(e.g., G CN, L STM) in the flow datasets is very high, which is 2) Justifying the Use of Each Module: We further conduct
because the flow can be close to zero, resulting in an arbitrarily more experiments to justify the use of N ODE G RU and G N.
large MAPE if the flow prediction is not accurate. To this end, we try to replace each module with other
1) Ablation Study: The ablation study for L OCALE G N has modules with similar functionality. For example, N ODE G RU
been incorporated in Table I and Table II by comparing can be replaced with a residual connection [61], and G N
with N ODE G RU and G N, as both models are the essential can be replaced with a self-attention module [62]. We find
components of L OCALE G N. In speed datasets, we observe that that both N ODE G RU and G N outperform their counterparts,
L OCALE G N is consistently better than the other two models. which justifies the use of both modules. Details are presented
In flow datasets, N ODE G RU and G N occasionally outperform in Appendix B.
L OCALE G N, which suggests that either the G RU component or
G N component might dominate for a specific dataset. Overall, F. Number of Model Parameters
L OCALE G N achieves satisfactory performance for all datasets According to the few-sample experiment results, although
under different metrics. inferior to L OCALE G N, S T-G CN achieves better prediction

Authorized licensed use limited to: TONGJI UNIVERSITY. Downloaded on September 22,2023 at 15:01:02 UTC from IEEE Xplore. Restrictions apply.
LI et al.: FEW-SAMPLE TRAFFIC PREDICTION WITH GNs USING LOCALE AS RELATIONAL INDUCTIVE BIASES 1903

TABLE II
P ERFORMANCE OF L OCALE G N AND O THER BASELINE M ODELS ON T RAFFIC F LOW D ATASETS (AVERAGE ± S TANDARD D EVIATION A CROSS 5
E XPERIMENTAL R EPEATS ; U NIT FOR RMSE AND MAE: V EHICLES /H OUR )

TABLE III SacF has a large node number of 883, and the number
C OMPARISON OF THE N UMBER OF T RAINABLE PARAMETERS B ETWEEN of trainable parameters in S T-G CN is ten times larger than
S T-G CN AND L OCALE G N (U NIT: I N THOUSAND )
that of L OCALE G N. However, the prediction performance of
L OCALE G N is even better than that of S T-G CN. It further
proves that the short-term traffic prediction relies on localized
information, and this can be efficiently learned from few
samples with graph relational inductive biases.
In the traditional graph model for spatio-temporal data
prediction, the parameters sets are different for different nodes
and it causes difficulty for the training process of large graphs.
However, L OCALE G N makes it feasible to learn across nodes.
accuracy than other baseline models. We compare the num- It can be seen as the transfer learning model among different
ber of trainable parameters of L OCALE G N and S T-G CN in nodes on a network. One node and its neighborhoods can
Table III. For any network of arbitrary size, the number of be interpreted as a sub-graph, and L OCALE G N is able to
trainable parameters in L OCALE G N remains the same. How- extract useful information from the sub-graphs. This sug-
ever, for S T-G CN, the number of its parameters is positively gests the reason why L OCALE G N requires fewer samples for
correlated with the network size. For example, the flow dataset training.

Authorized licensed use limited to: TONGJI UNIVERSITY. Downloaded on September 22,2023 at 15:01:02 UTC from IEEE Xplore. Restrictions apply.
1904 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 24, NO. 2, FEBRUARY 2023

TABLE IV
T RANSFERRING P ERFORMANCE OF D CRNN , G RAPH WAVE N ET AND L OCALE G N ON T RAFFIC S PEED D ATASETS (AVERAGE ± S TANDARD D EVIATION
A CROSS 5 E XPERIMENTAL R EPEATS ; U NIT FOR RMSE AND MAE: M ILES /H OUR )

we would like to use the knowledge learned from some city


and directly apply it to a new city. To this end, we consider
one city (target graph) that does not have historical traffic
data, while some other cities (source graphs) have archived
historical data. One can see this setting focuses on the transfer
learning across cities. The well-performed L OCALE G N model
is designed for extracting locally spatial and temporal traffic
patterns, which is independent of graph structures and has the
potential to be transferred to the target city. As the spectral
graph-based G NN can be applied to graph structures, we com-
pare L OCALE G N with D CRNN and G RAPH WAVE N ET, the
two state-of-the-art spectral graph-based models.
We compare the three models on speed and flow datasets
respectively. Each model is pre-trained on a source graph with
a random 20% of the training data and then directly tested on
the target graph without further training. The prediction results
on the target graph are presented in Table IV and Table V. One
Fig. 6. Comparison of prediction performance among L OCALE G N, T-G CN
and S T-G CN with the percentage of training set ranging from 20% to 100% can see that L OCALE G N consistently outperforms D CRNN
(first row: speed data LA, ST, and SacS, unit: miles/hour; second row: flow and G RAPH WAVE N ET on different cities for both speed and
data SF, SacF, and SanB, unit: vehicles/hour). flow datasets, and the prediction accuracy is very stable. The
results show that L OCALE G N has great potential to transfer
G. Sensitivity Analysis pre-trained models across cities even without fine-tuning the
Apart from evaluation in few training samples (20% of the data of the target city because the universal traffic patterns and
training data), we also compare the prediction performance physical rules regardless of network structures can be learned
of L OCALE G N and other baseline models with different by L OCALE G N.
percentages of training data (from 20% to 100%), the results In Figure 7, we compare the prediction results of G RAPH
are presented in Figure 6. The prediction performance remains WAVE N ET and L OCALE G N from 8:00 AM to 20:00 PM for
stable for L OCALE G N when the training ratio decreases from three traffic speed datasets and three traffic flow datasets,
100% to 20%, while the prediction performance of S T-G CN respectively. In general, L OCALE G N can make more accurate
gradually degrades when reducing the number of training predictions than G RAPH WAVE N ET in all datasets. It is notable
samples. that when the speed/flow surges or decreases sharply, the
Overall, L OCALE G N outperforms T-G CN and S T-G CN with performance of G RAPH WAVE N ET will decline significantly
all percentages of training data and the advantage is notably while the prediction of L OCALE G N remains stable. This
large when the ratio is 20% or 40%. In other words, T-G CN demonstrates that although G RAPH WAVE N ET and D CRNN
and S T-G CN overfit and become unstable when the training contain the parameters-sharing characteristics, they cannot
set is small. In contrast, L OCALE G N has better learning ability learn the universal rule for short-term traffic prediction and
in capturing the localized spatial and temporal correlations their performance still depends on graph structures. Conse-
than T-G CN and S T-G CN. The results further demonstrate that quently, directly applying G RAPH WAVE N ET and L OCALE G N
L OCALE G N is a powerful and efficient model for the task of to the target city will result in low prediction accuracy.
the few-sample traffic prediction. In contrast, the performance of L OCALE G N remains satisfac-
tory. The structure of L OCALE G N can serve as an essential
H. Cross-City Transfer Analysis building block for transferring prediction models between
In this section, we consider a more challenging situa- cities. Besides, the embedding and temporal components can
tion, in which traffic sensors are just installed or the traffic be easily replaced with other state-of-the-art deep learning
prediction service is just deployed in a city. In this case, modules.

Authorized licensed use limited to: TONGJI UNIVERSITY. Downloaded on September 22,2023 at 15:01:02 UTC from IEEE Xplore. Restrictions apply.
LI et al.: FEW-SAMPLE TRAFFIC PREDICTION WITH GNs USING LOCALE AS RELATIONAL INDUCTIVE BIASES 1905

Fig. 7. Comparison of prediction performance with L OCALE G N and G RAPH WAVE N ET for cross-city transfer analysis (first row: speed data LA, SacS, and
ST, unit: miles/hour; second row: flow data SF, SacF, and SanB, unit: vehicles/hour).

TABLE V
T RANSFERRING P ERFORMANCE OF D CRNN , G RAPH WAVE N ET AND L OCALE G N ON T RAFFIC F LOW D ATASETS (AVERAGE ± S TANDARD D EVIATION
A CROSS 5 E XPERIMENTAL R EPEATS ; U NIT FOR RMSE AND MAE: V EHICLES /H OUR )

V. C ONCLUSION for traffic prediction and developing traffic prediction models


In this paper, we discuss and define the research question of for cities with few historically archived data.
few-sample traffic prediction on large-scale networks. A graph As for the future research directions, firstly, it is necessary
network-based model L OCALE G N is developed to learn the to interpret the knowledge learned by the L OCALE G N among
locally spatial and temporal patterns of traffic data, thus accu- different nodes, as this paper has demonstrated that the learned
rate short-term predictions can be generated. The parameter- knowledge among nodes can contribute to the prediction
sharing characteristics help L OCALE G N prevent overfitting accuracy. Secondly, because L OCALE G N can be applied to
with limited training data. Additionally, the learned knowledge different cities, it is interesting to develop a unified and
in L OCALE G N can be transferred across cities. Extensive fair framework for training and testing the L OCALE G N on
evaluations on six real-world traffic speed or flow datasets multiple cities without biases. Essentially, L OCALE G N has
demonstrate that L OCALE G N outperforms other baseline mod- capacities in transferring knowledge not only among nodes
els. L OCALE G N is also more light-weighted as it contains in a single graph, but also across different graphs. Thirdly,
less trainable parameters, and this also suggests a shorter L OCALE G N can be further extended to predict OD demand
training time and lower data requirements. Ablation study and for public transit and ride-hailing services [63], [64]. Indeed,
sensitivity analysis are conducted to show the compactness the enlightening idea of relational inductive bias and the simple
and robustness of L OCALE G N. The cross-city transfer analysis structure of L OCALE G N can be applied to learn various spatio-
also demonstrates its great potential in developing traffic temporal datasets, and the localized patterns can be extracted.
prediction services without training in a new city. Overall, this These datasets include human mobility patterns, social media,
paper sheds light on utilizing the transferability of L OCALE G N epidemics, and climate-related data.

Authorized licensed use limited to: TONGJI UNIVERSITY. Downloaded on September 22,2023 at 15:01:02 UTC from IEEE Xplore. Restrictions apply.
1906 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 24, NO. 2, FEBRUARY 2023

TABLE VI TABLE VIII


F UNCTIONS IN L OCALE G N P ERFORMANCE OF L OCALE G N , RG N AND AG N
ON T RAFFIC S PEED D ATASETS

TABLE VII
H YPER -PARAMETERS IN L OCALE G N

TABLE IX
P ERFORMANCE OF L OCALE G N , RG N AND AG N
ON T RAFFIC F LOW D ATASETS

A PPENDIX
A. Model Specifications
In this section, we present details of the model specifications
of L OCALE G N. There are four modules in L OCALE G N, which
include Encoder, N ODE G RU, G N, and the Output layer. The
detailed layer information of each module is listed in Table VI.
Besides, we provide hyper-parameters for model setting and
training process in Table VII.

B. Additional Experiments
In this section, we provide results of some additional experi-
ments to further justify the proposed structure of L OCALE G N. dependency of the traffic data. Different from applying G RU
1) Residual Graph Network: On top of the basic architec- for each node data separately, the S ELF -ATTENTION module
ture of Encoder, G N, and Decoder, we also try to add residual is applied to all the node data simultaneously. The reason we
connections to the G N module. Instead of using N ODE G RU, compare with the S ELF -ATTENTION module is that recent
the G N module is developed to learn the differences between studies indicate its strong potential in transferability [62],
the input graph and the expected output graph. The encoded [65]. The proposed Attention Graph Network (AG N) can be
data will be processed by the G N module, in which we add the represented in Equation 10.
residual connection to update the node attributes. The design
of Residual Graph Network (RG N) model is inspired by the G τ +1 = G N(G τ + S ELF -ATTENTION(G τ )) (10)
theory of dynamical system, in which the output at τ +1 can be where the key, query and value in S ELF -ATTENTION are
calculated based on the input at time τ as well as the evolution embedded with historical node data. The outcome vector from
function G N. Mathematically, the residual connection can be S ELF -ATTENTION is transformed to a lower dimension vector
presented in Equation 9. to obtain the compressed temporal patterns, which are then
G τ +1 = G N(G τ ) + G τ (9) combined with the original node data. The combined vector
is fed into an encoder to generate the input of G N in order to
2) Attention Graph Network: We also try to replace further infer the spatio-temporal patterns at time τ + 1.
the N ODE G RU module in L OCALE G N with the 3) Results Comparison: Using the same experiment settings
S ELF -ATTENTION mechanism to capture the temporal as in section IV-C, the additional two models are examined on

Authorized licensed use limited to: TONGJI UNIVERSITY. Downloaded on September 22,2023 at 15:01:02 UTC from IEEE Xplore. Restrictions apply.
LI et al.: FEW-SAMPLE TRAFFIC PREDICTION WITH GNs USING LOCALE AS RELATIONAL INDUCTIVE BIASES 1907

the three speed datasets and three flow datasets, respectively. [20] Y. Li, R. Yu, C. Shahabi, and Y. Liu, “Diffusion convolutional recurrent
The prediction errors of L OCALE G N, RG N and AG N are neural network: Data-driven traffic forecasting,” in Proc. Int. Conf.
Learn. Represent., 2018, pp. 1–16.
listed in Table VIII and Table IX. [21] Z. Wu, S. Pan, G. Long, J. Jiang, and C. Zhang, “Graph WaveNet for
For traffic speed prediction, L OCALE G N outperforms RG N deep spatial–temporal graph modeling,” in Proc. 28th Int. Joint Conf.
and AG N on both LA and ST, but slightly under-performs Artif. Intell., Aug. 2019, pp. 1–7.
[22] L. Zhao et al., “T-GCN: A temporal graph convolutional network for
RG N on SacS dataset. For traffic flow prediction, L OCALE G N traffic prediction,” IEEE Trans. Intell. Transp. Syst., vol. 21, no. 9,
outperforms RG N and AG N on all datasets regarding the pp. 3848–3858, Sep. 2020.
MAPE, but its performance is slightly lower than RG N on [23] B. Yu, H. Yin, and Z. Zhu, “Spatio-temporal graph convolutional
SacF dataset in terms of MAE and RMSE. On SF, the networks: A deep learning framework for traffic forecasting,” in Proc.
27th Int. Joint Conf. Artif. Intell., Jul. 2018, pp. 1–7.
differences between AG N and L OCALE G N for MAE and [24] Z. Cui, K. Henrickson, R. Ke, and Y. Wang, “Traffic graph convolutional
RMSE are within the range of ±0.01 and negligible, but recurrent neural network: A deep learning framework for network-
MAPE of L OCALE G N is substantially smaller than that of scale traffic learning and forecasting,” IEEE Trans. Intell. Transp. Syst.,
vol. 21, no. 11, pp. 4883–4894, Nov. 2020.
AG N. Overall, L OCALE G N has the best and most stable [25] D. Cao et al., “Spectral temporal graph neural network for multivariate
performance among the three proposed models on both traffic time-series forecasting,” in Proc. Adv. Neural Inf. Process. Syst., vol. 33,
speed and flow prediction tasks. The detailed implementation 2020, pp. 17766–17778.
[26] S. Guo, Y. Lin, N. Feng, C. Song, and H. Wan, “Attention based spatial–
for all the three models is also available in the code repository. temporal graph convolutional networks for traffic flow forecasting,” in
Proc. AAAI, vol. 33, 2019, pp. 922–929.
R EFERENCES [27] Z. Diao, X. Wang, D. Zhang, Y. Liu, K. Xie, and S. He, “Dynamic
spatial–temporal graph convolutional neural networks for traffic fore-
[1] F. Li et al., “Dynamic graph convolutional recurrent network for traffic casting,” in Proc. AAAI Conf. Artif. Intell., vol. 33, 2019, pp. 890–897.
prediction: Benchmark and solution,” ACM Trans. Knowl. Discovery [28] H. Peng et al., “Spatial–temporal incidence dynamic graph neural
Data, vol. 16, Apr. 2021. networks for traffic flow forecasting,” Inf. Sci., vol. 521, pp. 277–290,
[2] G. Meena, D. Sharma, and M. Mahrishi, “Traffic prediction for intelli- Jun. 2020.
gent transportation system using machine learning,” in Proc. 3rd Int. [29] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini,
Conf. Emerg. Technol. Comput. Eng., Mach. Learn. Internet Things “The graph neural network model,” IEEE Trans. Neural Netw., vol. 20,
(ICETCE), Feb. 2020, pp. 145–148. no. 1, pp. 61–80, Dec. 2009.
[3] S. Kaffash, A. T. Nguyen, and J. Zhu, “Big data algorithms and appli- [30] G. Ge and W. Yuan, “Short-term traffic speed forecasting based on graph
cations in intelligent transportation system: A review and bibliometric attention temporal convolutional networks,” Neurocomputing, vol. 410,
analysis,” Int. J. Prod. Econ., vol. 231, Jan. 2021, Art. no. 107868. no. 14, pp. 387–393, Oct. 2020.
[4] Y. Ma, X. Zhu, S. Zhang, R. Yang, W. Wang, and D. Manocha, [31] Z. C. Lipton, J. Berkowitz, and C. Elkan, “A critical review of recurrent
“TrafficPredict: Trajectory prediction for heterogeneous traffic-agents,” neural networks for sequence learning,” 2015, arXiv:1506.00019.
in Proc. AAAI Conf. Artif. Intell., vol. 33, 2019, pp. 6120–6127. [32] Y. Yu, X. Si, C. Hu, and Z. Jianxun, “A review of recurrent neural
[5] X. Zang, H. Yao, G. Zheng, N. Xu, K. Xu, and Z. Li, “MetaLight: networks: LSTM cells and network architectures,” Neural Comput.,
Value-based meta-reinforcement learning for traffic signal control,” in vol. 31, no. 7, pp. 1235–1270, Jul. 2019.
Proc. AAAI Conf. Artif. Intell., 2020, vol. 34, no. 1, pp. 1153–1160. [33] Y. Tian and L. Pan, “Predicting short-term traffic flow by long short-
[6] J. Zhou et al., “Graph neural networks: A review of methods and term memory recurrent neural network,” in Proc. IEEE Int. Conf. Smart
applications,” AI Open, vol. 1, pp. 57–81, Jan. 2020. City/SocialCom/SustainCom (SmartCity), Dec. 2015, pp. 153–158.
[7] W. Jiang and J. Luo, “Graph neural network for traffic forecasting: [34] D. Li and J. Lasenby, “Spatiotemporal attention-based graph convolution
A survey,” 2021, arXiv:2101.11174. network for segment-level traffic prediction,” IEEE Trans. Intell. Transp.
[8] F. Bell and S. Smyl. (Sep. 2018). Forecasting At UBER: An Introduction. Syst., vol. 23, no. 7, pp. 8337–8345, Jul. 2022.
[Online]. Available: https://eng.uber.com/forecasting-introduction/
[35] H. Zheng, F. Lin, X. Feng, and Y. Chen, “A hybrid deep learning
[9] X. Yin, G. Wu, J. Wei, Y. Shen, H. Qi, and B. Yin, “Deep learning on model with attention-based conv-LSTM networks for short-term traffic
traffic prediction: Methods, analysis and future directions,” IEEE Trans. flow prediction,” IEEE Trans. Intell. Transp. Syst., vol. 22, no. 11,
Intell. Transp. Syst., vol. 23, no. 6, pp. 4927–4943, Jun. 2022.
pp. 6910–6920, Nov. 2021.
[10] M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. de Freitas, “Predict-
[36] X. Shi, H. Qi, Y. Shen, G. Wu, and B. Yin, “A spatial–temporal attention
ing parameters in deep learning,” in Proc. NIPS, 2013, pp. 1–9.
approach for traffic prediction,” IEEE Trans. Intell. Transp. Syst., vol. 22,
[11] F. Emmert-Streib, Z. Yang, H. Feng, S. Tripathi, and M. Dehmer,
no. 8, pp. 4909–4918, Aug. 2021.
“An introductory review of deep learning for prediction models with
big data,” Frontiers Artif. Intell., vol. 3, p. 4, Feb. 2020. [37] H. Peng et al., “Dynamic graph convolutional network for long-term
[12] P. L. Bartlett, P. M. Long, G. Lugosi, and A. Tsigler, “Benign overfitting traffic flow prediction with reinforcement learning,” Inf. Sci., vol. 578,
in linear regression,” Proc. Nat. Acad. Sci. USA, vol. 117, no. 48, pp. 401–416, Nov. 2021.
pp. 30063–30070, Dec. 2020. [38] M. Yildirimoglu and N. Geroliminis, “Experienced travel time prediction
[13] G. Leduc, “Road traffic data: Collection methods and applications,” for congested freeways,” Transp. Res. B, Methodol., vol. 53, pp. 45–63,
Work. Papers Energy, Transport Climate Change, vol. 1, pp. 1–55, Jul. 2013.
Nov. 2008. [39] S. Kwak and N. Geroliminis, “Travel time prediction for congested
[14] R. Caruana, S. Lawrence, and L. Giles, “Overfitting in neural nets: freeways with a dynamic linear model,” IEEE Trans. Intell. Transp. Syst.,
Backpropagation, conjugate gradient, and early stopping,” in Proc. Adv. vol. 22, no. 12, pp. 7667–7677, Dec. 2021.
Neural Inf. Process. Syst., 2001, pp. 402–408. [40] S. Kwak, N. Geroliminis, and P. Frossard, “Traffic signal prediction on
[15] A. Ghasemian, H. Hosseinmardi, and A. Clauset, “Evaluating overfit transportation networks using spatio-temporal correlations on graphs,”
and underfit in models of network community structure,” IEEE Trans. IEEE Trans. Signal Inf. Process. Over Netw., vol. 7, pp. 648–659, 2021.
Knowl. Data Eng., vol. 32, no. 9, pp. 1722–1735, Sep. 2020. [41] B. Y. Lin, F. F. Xu, E. Q. Liao, and K. Q. Zhu, “Transfer learning for
[16] P. W. Battaglia et al., “Relational inductive biases, deep learning, and traffic speed prediction: A preliminary study,” in Proc. Workshops 32nd
graph networks,” 2018, arXiv:1806.0126. AAAI Conf. Artif. Intell., 2018.
[17] S. Seo and Y. Liu, “Differentiable physics-informed graph networks,” [42] C. Zhang, H. Zhang, J. Qiao, D. Yuan, and M. Zhang, “Deep transfer
2019, arXiv:1902.02950. learning for intelligent cellular traffic prediction based on cross-domain
[18] A. Sanchez-Gonzalez et al., “Graph networks as learnable physics big data,” IEEE J. Sel. Areas Commun., vol. 37, no. 6, pp. 1389–1401,
engines for inference and control,” in Proc. Int. Conf. Mach. Learn., Jun. 2019.
2018, pp. 4470–4479. [43] Y. Ren, X. Chen, S. Wan, K. Xie, and K. Bian, “Passenger flow
[19] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, prediction in traffic system based on deep neural networks and transfer
“Neural message passing for quantum chemistry,” in Proc. Int. Conf. learning method,” in Proc. 4th Int. Conf. Intell. Transp. Eng. (ICITE),
Mach. Learn., 2017, pp. 1263–1272. Sep. 2019, pp. 115–120.

Authorized licensed use limited to: TONGJI UNIVERSITY. Downloaded on September 22,2023 at 15:01:02 UTC from IEEE Xplore. Restrictions apply.
1908 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 24, NO. 2, FEBRUARY 2023

[44] B. Barz and J. Denzler, “Deep learning on small datasets without pre- [64] H. Yao et al., “Deep multi-view spatial–temporal network for taxi
training using cosine loss,” in Proc. IEEE Winter Conf. Appl. Comput. demand prediction,” in Proc. AAAI Conf. Artif. Intell., 2018, vol. 32,
Vis. (WACV), Mar. 2020, pp. 1371–1380. no. 1, pp. 1–8.
[45] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for [65] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training
fast adaptation of deep networks,” in Proc. Int. Conf. Mach. Learn., of deep bidirectional transformers for language understanding,” in Proc.
2017, pp. 1126–1135. Conf. North Amer. Chapter Assoc. Comput. Linguistics, Hum. Lang.
[46] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few- Technol. (NAACL-HLT), vol. 1, J. Burstein, C. Doran, and T. Solorio,
shot learning,” in Proc. Adv. Neural Inf. Process. Syst., vol. 30, 2017, Eds. Minneapolis, MN, USA: Association for Computational Linguis-
pp. 1–11. tics, 2019, pp. 4171–4186.
[47] L. Cao, R. Ji, C. Wang, and J. Li, “Towards domain adaptive vehicle
detection in satellite image by supervised super-resolution transfer,” in
Proc. AAAI Conf. Artif. Intell., vol. 30, no. 1, 2016, pp. 1–7.
[48] C. Lin, L. Li, W. Luo, K. C. P. Wang, and J. Guo, “Transfer learning Mingxi Li graduated from Sichuan University.
based traffic sign recognition using inception-V3 model,” Periodica She is currently pursuing the Ph.D. degree with
Polytechnica Transp. Eng., vol. 47, no. 3, pp. 242–250, Aug. 2018. the Department of Civil and Environmental Engi-
[49] L. Wang, X. Geng, X. Ma, F. Liu, and Q. Yang, “Cross-city transfer neering, The Hong Kong Polytechnic Univer-
learning for deep spatio-temporal prediction,” in Proc. 28th Int. Joint sity (PolyU). Her research interests include deep
Conf. Artif. Intell., Aug. 2019, pp. 1–10. learning, multisource traffic data mining, urban
[50] J. Atwood and D. Towsley, “Diffusion-convolutional neural networks,” computing, and the corresponding applications in
in Proc. Adv. Neural Inf. Process. Syst., vol. 29, 2016, pp. 1–9. intelligent transportation systems (ITS).
[51] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation
learning on large graphs,” in Proc. Adv. Neural Inf. Process. Syst.,
vol. 30, 2017, pp. 1–11.
[52] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and
Y. Bengio, “Graph attention networks,” Stat, vol. 1050, p. 20, Oct. 2017.
[53] H. Tang, Z. Huang, J. Gu, B.-L. Lu, and H. Su, “Towards scale-invariant
graph-related problem solving by iterative homogeneous GNNs,” in Yihong Tang received the bachelor’s degree in com-
Proc. Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 15811–15822. puter science and technology from the Beijing Uni-
[54] A. Sanchez-Gonzalez, J. Godwin, T. Pfaff, R. Ying, J. Leskovec, versity of Posts and Telecommunications (BUPT).
and P. Battaglia, “Learning to simulate complex physics with graph He is currently pursuing the Master of Philosophy
networks,” in Proc. Int. Conf. Mach. Learn., 2020, pp. 8459–8468. (M.Phil.) degree with the Department of Urban
[55] V. Bapst et al., “Unveiling the predictive power of static structure in Planning and Design, The University of Hong Kong
glassy systems,” Nature Phys., vol. 16, no. 4, pp. 448–454, 2020. (HKU). His research interests include data and graph
[56] T. Xie, A. France-Lanord, Y. Wang, Y. Shao-Horn, and J. C. Grossman, mining, urban computing, demand and mobility
“Graph dynamical networks for unsupervised learning of atomic scale modeling, privacy and security issues in emerging
dynamics in materials,” Nature Commun., vol. 10, no. 1, pp. 1–9, internet of things systems, and intelligent transporta-
Dec. 2019. tion applications.
[57] C. Song, Y. Lin, S. Guo, and H. Wan, “Spatial–temporal synchronous
graph convolutional networks: A new framework for spatial–temporal
network data forecasting,” in Proc. AAAI, 2020, vol. 34, no. 1,
pp. 914–921.
[58] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks Wei Ma (Member, IEEE) received the bachelor’s
and deep locally connected networks on graphs,” in Proc. 2nd Int. Conf. degree in civil engineering and mathematics from
Learn. Represent., 2014. Tsinghua University, China, and the master’s degree
[59] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural in machine learning and civil and environmental
Comput., vol. 9, no. 8, pp. 1735–1780, 1997. engineering and the Ph.D. degree in civil and envi-
[60] K. Cho et al., “Learning phrase representations using RNN encoder– ronmental engineering from Carnegie Mellon Uni-
decoder for statistical machine translation,” in Proc. Conf. Empirical versity, USA. He is currently an Assistant Professor
Methods Natural Lang. Process. (EMNLP). Doha, Qatar: Association with the Department of Civil and Environmental
for Computational Linguistics, Oct. 2014, pp. 1724–1734. Engineering, The Hong Kong Polytechnic University
[61] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for (PolyU). His research interests include intersection
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. of machine learning, data mining, and transportation
(CVPR), Jun. 2016, pp. 770–778. network modeling, with applications for smart and sustainable mobility
[62] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf. systems. He has received the 2020 Mao Yisheng Outstanding Dissertation
Process. Syst., vol. 30, 2017, pp. 1–11. Award and the Best Paper Award (theoretical track) at INFORMS Data
[63] X. Geng et al., “Spatiotemporal multi-graph convolution network for Mining and Decision Analytics Workshop. He is now serving at the Early
ride-hailing demand forecasting,” in Proc. AAAI Conf. Artif. Intell., 2019, Career Editorial Advisory Board of Transportation Research Part C: Emerging
vol. 33, no. 1, pp. 3656–3663. Technologies.

Authorized licensed use limited to: TONGJI UNIVERSITY. Downloaded on September 22,2023 at 15:01:02 UTC from IEEE Xplore. Restrictions apply.

You might also like