BiLSTM LSTM

Uploaded by

RONY ARTURO BOCANGEL SALAS

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

119 views11 pages

BiLSTM LSTM

Uploaded by

RONY ARTURO BOCANGEL SALAS

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

1

Stacked Bidirectional and Unidirectional LSTM

Recurrent Neural Network for
Network-wide Traffic Speed Prediction
Zhiyong Cui, Ruimin Ke, Ziyuan Pu, Yinhai Wang

In the last three decades, a large number of methods have

Abstract— Short-term traffic forecasting based on deep been proposed for traffic forecasting in terms of predicting
learning methods, especially long short-term memory (LSTM) speed, volume, density and travel time. Studies in this area
neural networks, has received much attention in recent years. normally focus on the methodology components, aiming at
However, the potential of deep learning methods in traffic
forecasting has not yet fully been exploited in terms of the depth
developing different models to improve prediction accuracy,
of the model architecture, the spatial scale of the prediction area, efficiency, or robustness. Previous literature indicates that the
and the predictive power of spatial-temporal data. In this paper, a existing models can be roughly divided into two categories, i.e.
deep stacked bidirectional and unidirectional LSTM (SBU- classical statistical methods and computational intelligence (CI)
LSTM) neural network architecture is proposed, which considers approaches [2]. Most statistical methods for traffic forecasting
both forward and backward dependencies in time series data, to were proposed at an earlier stage when traffic condition were
predict network-wide traffic speed. A bidirectional LSTM
(BDLSM) layer is exploited to capture spatial features and
less complex and transportation datasets were relatively small
bidirectional temporal dependencies from historical data. To the in size. Later on, with the rapid development in traffic sensing
best of our knowledge, this is the first time that BDLSTMs have technologies and computational power, as well as traffic data
been applied as building blocks for a deep architecture model to volume, the majority of more recent work focuses on CI
measure the backward dependency of traffic data for prediction. approaches for traffic forecasting.
The proposed model can handle missing values in input data by With the ability to deal with high dimensional data and the
using a masking mechanism. Further, this scalable model can
predict traffic speed for both freeway and complex urban traffic
capability of capturing non-linear relationship, CI approaches
networks. Comparisons with other classical and state-of-the-art tend to outperform the statistical methods, such as auto-
models indicate that the proposed SBU-LSTM neural network regressive integrated moving average (ARIMA) [36], with
achieves superior prediction performance for the whole traffic respect to handling complex traffic forecasting problems [38].
network in both accuracy and robustness. However, the full potential of artificial intelligence was not
exploited until the rise of neural networks (NN) based methods.
Ever since the precursory study of utilizing NN into the traffic
Index Terms—Deep learning, bidirectional LSTM, backward
prediction problem was proposed [39], many NN-based
dependency, traffic prediction, network-wide traffic
methods, like feed forward NN [41], fuzzy NN [40], recurrent
NN (RNN) [42], and hybrid NN [25], are adopted for traffic
I. INTRODUCTION forecasting problems. Recurrent Neural Networks (RNNs)
model sequence data by maintaining a chain-like structure and
T HE performances of intelligent transportation systems
(ITS) applications largely rely on the quality of traffic
information. Recently, with the significant increases in both the
internal memory with loops [4] and, due to the dynamic nature
of transportation, are especially suitable to capture the temporal
evolution of traffic status. However, the chain-like structure and
total traffic volume and the data they generate, opportunities
the depth of the loops make RNNs difficult to train because of
and challenges exist in transportation management and research the vanishing or blowing up gradient problems during the back-
in terms of how to efficiently and accurately understand and propagating process. There have been a number of attempts to
exploit the essential information underneath these massive overcome the difficulty of training RNNs over the years. These
datasets. Short-term traffic forecasting based on data driven difficulties were successfully addressed by the Long Short-
models for ITS applications has been one of the biggest Term Memory networks (LSTMs) [3], which is a type of RNN
developing research areas in utilizing massive traffic data, and with gated structure to learn long-term dependencies of
has great influence on the overall performance of a variety of sequence-based tasks.
modern transportation systems [1]. As a representative deep learning method handling sequence-

Zhiyong Cui, Ruimin Ke, Ziyuan Pu, and Yinhai Wang are with the
Department of Civil and Environmental Engineering, University of
Washington, Seattle, WA 98195 USA (e-mail: [email protected],
[email protected], [email protected], [email protected]).
2

data, LSTMs have been proved to be able to process sequence time series data, especially for recurring traffic patterns, from
data [4] and applied in many real-world problems, like speech both forward and backward temporal perspectives will enhance
recognition [6], image captioning [7], music composition [8] the predictive performance [28]. However, based on our review
and human trajectory prediction [9]. In recent years, LSTMs of the literature, few studies on traffic analysis utilized the
have been gaining popularity in traffic forecasting due to their backward dependency. To fill this gap, a bidirectional LSTM
ability to model long-term dependencies. Several studies [2, 22- (BDLSTM) with the ability to deal with both forward and
25, 34, 43, 44, 45] have been done to examine the applicability backward dependencies is adopted as a component of the
of LSTMs in traffic forecasting, and the results demonstrate the network structure in this study.
advantages of LSTMs. However, the potential of LSTMs is far In addition, when predicting the network-wide traffic speed,
from being fully exploited in the domain of transportation. The rather than the speed at a single location, the impact of upstream
three primary limitations in previous work on LSTMs in traffic and downstream speeds on each location in the traffic network
forecasting can be summarized as follows: 1) traffic forecasting should not be neglected. Previous studies [26, 27] which only
has generally focused on a small collection of network level. 2) making use of the forward dependencies of time series data
Most of the structures of LSTM-based methods are shallow. 3) have found that the past speed values of upstream as well as
The long-term dependencies are normally learned from downstream locations influence the future speed values of a
chronologically arranged input data considering only forward location along a corridor. However, for complicated traffic
dependencies, while backward dependencies learned from networks with intersections and loops, upstream and
reverse-chronological ordered data has never been explored. downstream both refer to relative positions, and two arbitrary
From the perspective of the scale of prediction area, locations can be upstream and downstream of each other.
predicting large-scale transportation network traffic has Upstream and downstream are defined with respect to space,
become an important and challenging topic. Most existing while forward and backward dependencies are defined with
studies utilize traffic data at a sensor location or along a respect to time. With the help of forward and backward
corridor, and thus, network-wide prediction could not be dependencies of spatial-temporal data, the learned feature will
achieved unless N models were trained for a traffic network be more comprehensive.
with N nodes [22]. While, learning complex spatial-temporal In this paper, we propose a stacked bidirectional and
features of a large-scale traffic network by only one model unidirectional LSTM (SBU-LSTM) neural network, combining
should be explored. LSTM and BDLSTM, for network-wide traffic speed
Regarding depth of the structure of LSTM-based models, the prediction. The proposed model is capable of handling input
structure should have the ability to capture the dynamic nature data with missing values and is tested on both large-scale
of the traffic system. Most of the newly proposed LSTM-based freeway and urban traffic networks in the Seattle area.
prediction models have relatively shallow structures with only Experimental results show that our model achieves network-
one hidden layer to deal with time series data [2, 22, 44]. wide traffic speed prediction with a high prediction accuracy.
Existing studies [20, 21] have shown that deep LSTM The influence of the number of layers, the number of time lags
architectures with several hidden layers can build up (the length of time series input), the dimension of weight
progressively higher levels of representations of sequence data. matrices in LSTM/BDLSTM layers, and the impact of
Although some studies [23-25] utilized more than one hidden additional volume and occupancy data are further analysed. The
LSTM layer, the influences of the number of LSTM layers in model’s scalability and its potential applications are also
different LSTM-based models need to be further compared and discussed. In summary, our contributions can be stated as
explained. follows: 1) we expand the traffic forecasting area from a
In terms of the dependency in prediction problems, all of the specific location or several adjacent locations along a corridor
information contained in time series data should be fully to large-scale traffic networks, varying from freeway traffic
utilized. Normally, the dataset fed to an LSTM model is network to complex urban traffic network; 2) we propose a deep
chronologically arranged, with the result that the information in architecture considering backward dependencies by combining
the LSTMs is passed in a positive direction from the time step LSTM and BDLSTM to enhance the feature learning from the
𝑡 − 1 to the time step 𝑡 along the chain-like structure. Thus, the large-scale spatial time series data; 3) a masking mechanism is
LSTM structure only makes use of the forward dependencies adopted to handle missing values; and 4) we evaluate many of
[5]. But in this process, it is highly possible that useful the model’s internal and external influential factors.
information is filtered out or not efficiently passed through the
chain-like gated structure. Therefore, it may be informative to II. METHODOLOGY
consider backward dependencies, which pass information in a In this section, the components and the architecture of the
negative direction, into consideration. Another reason for proposed SBU-LSTM is detailly introduced in this section.
including backward dependency into our study is the Here, speed prediction is defined as predicting future speed
periodicity of traffic. Unlike wind speed forecasting [15], traffic based on historical speed information. The illustrations of the
incident forecasting [16], or many other time series forecasting models in following sub-sections all take the traffic speed
problems with strong randomness, traffic conditions have prediction as examples.
strong periodicity and regularity, and even short-term
periodicity can be observed [17]. Analysing the periodicity of
3

Fig. 1 Standard RNN architecture and an unfolded structure with T time steps

A. Network-wide Traffic Speed Data

Traffic speed prediction at one location normally uses a
Fig. 2 LSTM architecture. The pink circles are arithmetic operators and the
sequence of speed values with 𝑛 historical time steps as the colored rectangles are the gates in LSTM.
input data [2, 22, 23], which can be represented by a vector, Equation (1) and Equation (2), the parameters of the RNN is
X& = [x&*+ , x&*(+*.) , … , x&*1 , x&*. ] (1) trained and updated iteratively via the back-propagation (BP)
But the traffic speed at one location may be influenced by the method. In each time step 𝑡, the hidden layer will generate a
speeds of nearby locations or even locations faraway, especially value, 𝑦D , and the last output, 𝑦H , is the desired predicted speed
when traffic jam propagates through the traffic network. To in the next time step, namely 𝑥XH=. = 𝑦H .
take these network-wide influences into account, the proposed Although RNNs exhibit the superior capability of modeling
and compared models in this study take the network-wide nonlinear time series problems [2], regular RNNs suffering
traffic speed data as the input. Suppose the traffic network from the vanishing or blowing up gradient during the BP
consists of P locations and we need to predict the traffic speeds process, and thus, being incapable of learning from long time
at time T using n historical time frames (steps), the input can be lags [10], or saying long-term dependencies [11].
characterized as a speed data matrix,
x.&*+ x.&*+=. ⋯ x.&*1 x.&*. C. LSTMs
x. ⎡ ⎤
x 1
x 1
x 1
x 1
x 1 To handle the aforementioned problems of RNNs, several
6
X& = 7 9 = ⎢ &*+ &*+=. ⋱ &*1 &*. ⎥ (2)
⋮ ⎢ ⋮ ⋮ ⋮ ⋮ ⎥ sophisticated recurrent architectures, like LSTM architecture
x 6 6 6
⎣x&*+ x&*+=. ⋯ x&*1 x&*. ⎦ 6 6 [3] and Gated Recurrent Unit (GRU) architecture [12] are
where each element 𝑥DE represents the speed of the 𝑡-th time proposed. It has been showed that the LSTMs work well on
frame at the 𝑝-th location. To reflect the temporal attributes of sequence-based tasks with long-term dependencies, but GRU,
the speed data and simplify the expressions of the equations in a simplified LSTM architecture, is only recently introduced and
the following subsections, the speed matrix is represented by a used in the context of machine translation [13]. Although there
vector, 𝑿IH = J𝑥H*K , 𝑥H*(K*.) , … , 𝑥H*1 , 𝑥H*. L , in which each are a variety of typical LSTM variants proposed in recent year,
a large-scale analysis of LSTM variant shows that none of the
element is a vector of the 𝑃 locations’ speed values.
variants can improve upon the standard LSTM architecture
B. RNNs significantly [14]. Thus, the standard LSTM architecture is
RNN is a class of powerful deep neural network using its adopted in this study as a part of the proposed network structure
internal memory with loops to deal with sequence data. The and introduced in this section.
architecture of RNNs, which also is the basic structure of The only different component between standard LSTM
LSTMs, is illustrated in Fig. 1. For a hidden layer in RNN, it architecture and RNN architecture is the hidden layer [10]. The
receives the input vector, 𝑿IH , and generates the output vector, hidden layer of LSTM is also named as LSTM cell, which is
𝒀H . The unfolded structure of RNNs, shown in the right part of shown in Fig. 2. Like RNNs, at each time iteration, 𝑡, the LSTM
Fig. 1, presents the calculation process that, at each time cell has the layer input, 𝑥D , and the layer output, ℎD . The
iteration, 𝑡, the hidden layer maintains a hidden state, ℎD , and complicated cell also takes the cell input state, 𝐶ZD , the cell
updates it based on the layer input, 𝑥D , and previous hidden output state, 𝐶D , and the previous cell output state, 𝐶D*. , into
state, ℎD*. , using the following equation: account while training and updating parameters. Due to the
ℎD = 𝜎Q (𝑊SQ 𝑥D + 𝑊QQ ℎD*. + 𝑏Q ) (3) gated structure, LSTM can deal with long-term dependencies to
where 𝑊SQ is the weight matrix from the input layer to the allow useful information pass along the LSTM network. There
hidden layer, 𝑊QQ is the weight matrix between two are three gates in a LSTM cell, including an input gate, a forget
consecutive hidden states (ℎD*. and ℎD ), 𝑏Q is the bias vector of gate, and an output gate. The gated structure, especially the
the hidden layer and 𝜎Q is the activation function to generate the forget gate, helps LSTM to be an effective and scalable model
hidden state. The network output can be characterized as: for several learning problems related to sequential data [14]. At
𝑦D = 𝜎W (𝑊QW ℎD + 𝑏W ) (4) time 𝑡 , the input gate, the forget gate, and the output gate,
where 𝑊QW is the weight matrix from the hidden layer to the denoted as 𝑖D , 𝑓D , and 𝑜D respectively. The input gate, the forget
gate, the output gate and the input cell state, which are
output layer, 𝑏W is the bias vector of the output layer and 𝜎W is
represented by colorful boxes in the LSTM cell in Fig. 2, can
the activation function of the output layer. By applying the be calculated using the following equations:
4

Fig. 4 Masking layer for time series data with missing values

generates an output vector, 𝒀H , in which each element is

Fig. 3 Unfolded architecture of bidirectional LSTM with three consecutive
steps calculated by using the following equation:
𝑦D = 𝜎(ℎ{⃗D , ℎ
⃖{D ) (11)
f_ = σa (Wc x_ + Uc h_*. + bc ) (5) where 𝜎 function is used to combine the two output sequences.
i_ = σa (Wh x_ + Uh h_*. + bh ) (6) It can be a concatenating function, a summation function, an
o_ = σa (Wk x_ + Uk h_*. + bk ) (7) average function or a multiplication function. Similar to the
mC_ = tanh(Wp x_ + Up h_*. + bp ) (8) LSTM layer, the final output of a BDLSTM layer can be
where 𝑊q , 𝑊r , 𝑊s , and 𝑊t are the weight matrices mapping the represented by a vector, 𝒀H = [𝑦H*K , … , 𝑦H*. ], in which the
hidden layer input to the three gates and the input cell state, last element, 𝑦H*. , is the predicted speed for the next time
while the 𝑈q , 𝑈r , 𝑈s , and 𝑈t are the weight matrices connecting iteration when taking speed prediction as an example.
the previous cell output state to the three gates and the input cell E. Masking Layer for Time Series Data with Missing Values
state. The 𝑏q , 𝑏r , 𝑏s , and 𝑏t are four bias vectors. The 𝜎v is the In reality, traffic sensors, like inductive-loop detectors, may
gate activation function, which normally is the sigmoid fail due to breakdown of wire insulation, poor sealants, damage
function, and the tanh is the hyperbolic tangent function. Based caused by construction activities, or electronics unit failure. The
on the results of four above equations, at each time iteration 𝑡, sensor failure further causes missing values in collected time
the cell output state, 𝐶D , and the layer output, ℎD , can be series data. For the LSTM-based prediction problem, if the
calculated as follows: input time series data contains missing/null values, the LSTM-
𝐶D = 𝑓D ∗ 𝐶D*. + 𝑖D ∗ 𝐶xD (9) based model will fail due to null values cannot be computed
ℎD = 𝑜D ∗ tanh(𝐶D ) (10) during the training process. If the missing values are set as zero,
The final output of a LSTM layer should be a vector of all or some other pre-defined values, the training and testing results
the outputs, represented by 𝒀H = [ℎH*K , … , ℎH*. ]. Here, when will be highly biased. Thus, we adopt a masking mechanism to
taking the speed prediction problem as an example, only the last overcome the potential missing values problem.
element of the output vector, ℎH*. , is what we want to predict. Fig. 4 demonstrates the details of the masking mechanism.
Thus, the predicted speed value (𝑥X) for the next time iteration, The (BD)LSTM cell denotes a LSTM-based layer, like a LSTM
𝑇, is ℎH*. , namely 𝑥XH = ℎH*. . layer or a BDLSTM layer. A mask value, ∅, is pre-defined,
D. BDLSTMs which normally is 0 or Null, and all missing values in the time
series data are set as ∅. For an input time series data 𝑋𝑇 , if 𝑥𝑡 is
The idea of BDLSTMs comes from bidirectional RNN [18],
the missed element, which equals to ∅, the training process at
which processes sequence data in both forward and backward
the 𝑡-th step will be skipped, and thus, the calculated cell state
directions with two separate hidden layers. BDLSTMs connect
of the (𝑡 − 1)-th step will be directly input into the (𝑡 + 1)-th
the two hidden layers to the same output layer. It has been
step. In this case, the output of 𝑡-th step also equals to ∅, which
proved that the bidirectional networks are substantially better
will be considered as a missing value and, if needed, input to
than unidirectional ones in many fields, like phoneme
the subsequent layer. Similarly, we can deal with input data
classification [19] and speech recognition [20]. But
with consecutive missing values using the masking mechanism.
bidirectional LSTMs have not been used in traffic prediction
problem, based on our review of the literature [2,22,23,24,25]. F. Stacked Bidirectional and Unidirectional LSTM Networks
In this section, the structure of an unfolded BDLSTM layer, Existing studies [20, 21] have shown that deep LSTM
containing a forward LSTM layer and a backward LSTM layer, architectures with several hidden layers can build up
is introduced and illustrated in Fig. 3. The forward layer output progressively higher level of representations of sequence data,
sequence, 𝒉 {⃗, is iteratively calculated using inputs in a positive and thus, work more effective. The deep LSTM architectures
sequence from time 𝑇 − 𝑛 to time 𝑇 − 1, while the backward are networks with several stacked LSTM hidden layers, in
layer output sequence, 𝒉 ⃖{{, is calculated using the reversed inputs which the output of a LSTM hidden layer will be fed as the
from time 𝑇 − 𝑛 to 𝑇 − 1 . Both the forward and backward input into the subsequent LSTM hidden layer. This stacked-
layer outputs are calculated by using the standard LSTM layers mechanism, which can enhance the power of neural
updating equations, Equations (3) - (8). The BDLSTM layer networks, is adopted in this study. As mentioned in previous
sections, BDLSTMs can make use of both forward and
5

Fig. 5 SBU-LSTMs architecture necessarily consists of a BDLSTM layer and a LSTM layer. Masking layer for handling missing values and multiple LSTM or
BDLSTM layers as middle layers are optional.

backward dependencies. When feeding the spatial-temporal Fig. 5, since the target of this study is to predict network-wide
information of the traffic network to the BDLSTMs, both the traffic speed for one future time step. The detailed spatial
spatial correlation of the speeds in different locations of the structure of input data is described in the experiment section.
traffic network and the temporal dependencies of the speed
values can be captured during the feature learning process. In III. EXPERIMENTS
this regard, the BDLSTMs are very suitable for being the first
A. Dataset Description
layer of a model to learn more useful information from spatial
time series data. When predicting future speed values, the top In this study, two types of traffic state datasets are utilized to
layer of the architecture only needs to utilize learned features, carry out experiments to test the proposed model. One is a
namely the outputs from lower layers, to calculate iteratively station-/point-based dataset, called loop detector data [46],
along the forward direction and generate the predicted values. collected by inductive loop detectors deployed on roadway
Thus, an LSTM layer, which is fit for capturing forward surface. Multiple loop detectors are connected to a detector
dependency, is a better choice to be the last (top) layer of the station deployed around every half a mile. The collected data
model. from each station are grouped and aggregated as station-based
In this study, we propose a novel deep architecture named traffic state data according to directions. This aggregated and
stacked bidirectional and unidirectional LSTM network (SBU- quality controlled dataset contains traffic speed, volume, and
LSTM) to predict the network-wide traffic speed values. Fig. 5 occupancy information. In the experiments, the loop detector
illustrates the graphical architecture of the proposed model. If data cover four connected freeways, which are I-5, I-405, I-90
the input contains missing values, a masking layer should be and SR-520 in the Seattle area, and are extracted from the
adopted by the SBU-LSTM. Each SBU-LSTM contains a Digital Roadway Interactive Visualization and Evaluation
BDLSTM layer as the first feature-learning layer and a LSTM Network (DRIVE Net) system [29, 30]. The traffic sensor
layer as the last layer. For sake of making full use of the input stations are shown in Fig. 6 (a), which is represented by small
data and learning complex and comprehensive features, the blue icons. This dataset contains traffic state data of 323 sensor
SBU-LSTM can be optionally filled with one or more stations in 2015 and the time step interval of this dataset is 5
LSTM/BDLSTM layers in the middle. Fig. 5 shows that the minutes.
SBU-LSTM takes the spatial time series data as the input and The other dataset used in this study is a segment-based
predict future speed values for one time-step. The SBU-LSTM dataset, called INRIX data [46], which measures traffic speeds
is also capable of predicting values for multiple future time of both freeway and urban roadway segments. INRIX data is
steps based on historical data. But this property is not shown in selected by the U.S. Federal Highway Administration as the
National Performance Management Research Data Set. INRIX
6

TABLE Ⅰ
PERFORMANCE COMPARISON OF THE PROPOSED MODEL WITH OTHER
BASELINE MODELS FOR SINGLE DETECTOR STATIONS SPEED PREDICTION
Models MAE (mph) MAPE (%)
SVM 9.23 20.39
Random Forest 2.64 6.30
Feed-forward NN
2.63 6.41
(2-hidden layers)
GRU NN 3.43 8.02
SBU-LSTMs 2.42 5.67

K
100 𝑥r − 𝑥Xr
MAPE = —š š (13)
𝑛 𝑥r
r™.
Fig. 6 Loop detector stations on the freeway network in Seattle area where 𝑥r is the observed traffic speed, and 𝑥Xr is the predicted
speed. All the compared models in this section are trained and
data provide wide coverage and accurate traffic information by tested multiple times to eliminate outliers, and the results of
aggregating GPS probe data from a wide array of commercial them presented are averaged to reduce random errors.
vehicle fleets, connected cars and mobile apps. An entire traffic In this section, the results of the proposed SBU-LSTMs are
network in the Seattle downtown area, which contains more analyzed and compared with classical methods and other RNN-
than 1000 roadway segments, shown in Fig. 6 (b), is selected as based models. Further analysis about the influence of the
the experimental dataset. The dataset covers the whole year of number of time lags, the dimension of weight matrices, the
2012 and its time step interval is also 5 minutes. number of layers, the impact of volume and occupancy
information, spatial feature learning, and model robustness are
B. Experiment Results Analysis and Comparison
carried out to shed more light on the characteristics of proposed
In this sub-section, only the loop detector data, due to its high model.
data quality [46], are used to measure the performance of the 1) Comparison with Classical Models for Single Location
proposed approach and compare with other models. Hence, the Traffic Speed Prediction
network-wide traffic is characterized by the 323 station speed Many classical baseline models used in traffic forecasting
values and the spatial dimension of the input data is set as, 𝑃 = problems, like ARIMA [2, 23] Support Vector Regression
323. Since, the unit of a time step in loop detector data is 5 (SVR) [37], Kalman filter [35]. Based on our literature review
†‡(ˆrK)
minutes, the dataset has ‰ × 24(ℎ𝑜𝑢𝑟) × 365(𝑦𝑒𝑎𝑟) = [2], the performances of ARIMA and Kalman filter method are
105120 time steps in total. Suppose the number of the time lags far behind the others, and thus, these two methods are not
compared in this study. Most of mentioned classical models are
is set as 𝑛 = 10, which means the model uses a set of data with
not suitable for predicting network-wide traffic speed via a
10 consecutive time steps (covering 50 minutes) to predict the
following 5-minute speed value, the dataset is separated into single model, since they normally cannot process 3-D spatial
temporal vectors. To compare our proposed model with these
samples with 10 time lags and the sample size is 𝑁 =
baseline models, experiments are carried out for single loop
105110 (105120– 10).
detector stations, whose input data is a 2-D vector without
Based on the descriptions of the model, each sample of the
spatial dimension. The results are averaged to measure the
input data, 𝑿IH , is a 2-D vector with the dimension of [𝑛, 𝑃] =
overall performance of these models.
[10, 323], and each sample of the output data is a 1-dimension
We compared the performance of the SBU-LSTMs with
vector with 323 components. The input of the model is a 3-D
SVR, random forest, feed-forward NN, GRU NN. In this
vector, whose dimension is [𝑁, 𝑛, 𝑃]. Before fed into the model,
comparison, the proposed model does not use the masking layer
all the samples are randomized and divided into training set,
and optional middle layers. Among these baseline models, the
validation set, and test set with the ratio 7:2:1.
feed-forward NN model, also called Multilayer Perceptron
In the training process, mini-batch gradient descent method
(MLP), has superior performance for the traffic flow prediction
is used when the model optimizes the mean squared error
[32], and decision tree and SVR are very efficient models for
(MSE) loss using RMSProp optimizer and early stopping
prediction [23, 37]. For the SVR method, the Radial Basis
mechanism is used to avoid over-fitting. To measure the
Function (RBF) kernel is utilized, and for the Random Forest
effectiveness of different traffic speed prediction algorithms,
method, 10 trees are built, and no maximum depth of the trees
the Mean Absolute Errors (MAE) and Mean Absolute
is limited. In this experiment, the feed-forward NN model
Percentage Errors (MAPE) are computed using the following
consists of two hidden layers with 323 nodes in each layer.
equations:
K Table Ⅰ demonstrates the prediction performance of different
1 algorithms for the single detector stations. The number of input
MAE = —|𝑥r − 𝑥Xr | (12)
𝑛 time lags in this experiment is set as 10. Among the non-neural
r™.
network algorithms, random forest performs much better, with
the MAE of 2.64, than the SVM method, which makes sense
7

TABLE Ⅱ
PERFORMANCE COMPARISON OF THE PROPOSED MODEL WITH OTHER LSTM-BASED MODELS FOR NETWORK-WIDE TRAFFIC SPEED PREDICTION
Number of LSTM / BDLSTM layers
Model N=0 N=1 N=2 N=3 N=4
MAE MAPE MAE MAPE MAE MAPE MAE MAPE MAE MAPE
N-layers LSTM 2.886 6.585 2.502 5.929 2.483 5.950 2.529 6.114
N-layers LSTM
2.652 6.489 2.581 6.332 2.630 6.438 2.646 6.586
+ 1-layer DNN
N-layers LSTM
2.668 6.506 2.557 6.274 2.595 6.447 2.647 6.602
+ Hour of Day + Day of Week
N-layers BDLSTM 3.021 6.758 2.472 5.819 2.476 5.846 2.526 5.988
SBU-LSTMs: 1-layer BDLSTM
+ N middle BDLSTM layers 2.426 5.674 2.465 5.787 2.502 5.950 2.549 6.191 2.576 6.227
+ 1-layer LSTM

due to the majority votes mechanism of random forest. The middle layer, to represent the basic structure of the SBU-LSTM.
feed-forward NN whose MAE is 2.63 performs very close to The performance of SBU-LSTM is in conformity with the
the random forest method. Although GRU NN is a kind of trends of the compared models that the MAE and MAPE
recurrent NN, its performance obviously cannot outperform increase as the number of layers rises from zero to four.
those of feed-forward NN and random forest. The single layer The proposed SBU-LSTM outperforms the others for all the
structure and the simplified gates in GRU NN may be the layer numbers. When the SBU-LSTM has no middle layer, it
reasons. To sum up, the proposed SBU-LSTM model is clearly achieves the best MAE, 2.426 mph, and MAPE, 5.674%. The
superior to the other four methods in this single detector station test errors of multilayer LSTM NN and BD LSTM NN turn out
based experiment. to be larger than that of the proposed model. They achieve their
2) Comparison with LSTM-based models for Network-wide best MAEs of 2.502 and 2.472, respectively, when they both
Traffic Speed Prediction have two layers. It should be noted that, for the one-layer case,
The SBU-LSTMs is proposed aiming at predicting the the BDLSTM NN model gets the worst performance in our
network-wide traffic speed, and thus, other methods with the experiments shown in the Table Ⅱ. It indicates that one-layer
ability of predicting multi-dimensional time series data are BDLSTM may be good enough for capturing features, but it is
compared in this section. Since the proposed model combines not satisfactory to predict the results. Except for the one-layer
BDLSTMs and LSTMs, the pure deep (N-layers) BDLSTMs case, the model combining deep LSTM and DNN are not
and LSTMs are compared. A deep LSTM NN adding a fully comparable with others. This test results show that adding DNN
connected deep neural network (DNN) layer, which is proven layers to deep LSTM cannot make improvements for the
to be able to boost the LSTM NN [33], is also compared. To network-wide traffic prediction problem is consistent with the
measure the influence of temporal information to the network- finding in a previous study [33]. The performance of the
wide traffic speed, a multilayer LSTM model combining day of temporal information added multilayer LSTM is very close to
week and hour of day is also tested in this experiment. that of the LSTM combined with DNN. Thus, incorporating the
Meanwhile, the influence of depth of the neural networks, day of week and time of day features cannot improve the
namely the number of layers of the models, is tested in this performance for this study. This is in accordance with the
section. All the experiments undertook in this section used the results of previous works [23, 24].
dataset covering the whole traffic network with 10-time lags. 3) Influence of number of time lags
The number of time lags, 10, is set within a reasonable range The number of time lags, 𝑛, is the temporal dimension of the
for traffic forecasting based on literatures [25, 32] and our input data, which may influence the performance of the
experiments. The spatial dimension of weight matrices in each
LSTM or BDLSTM layer in this experiment is set as the
number of loop detector stations, 323, to ensure the spatial
feature can be fully captured. The comparison results are
averaged from multiple tests to remove random errors.
Table Ⅱ shows the comparison results, where the headers on
horizontal axis show the amount of the LSTM or BDLSTM
layers owned by the models. In terms of the influence of depth
of the neural network, all the compared models achieve their
best performance when they have two layers and their
performances have the same trends that the values of MAE and
MAPE increase as the number of layers increases from two to Fig. 6 Boxplot of MAE versus number of time lags in SBU-LSTMs. One unit
four. Table Ⅱ contains a special “(N=0)” column, denoting no of time lag is 5 minutes.
8

Fig. 7 Heatmaps of ground truth and predicted speed values for the freeway traffic network on 01/09/2015. The two plots share the same meanings of the two
axes, where the two horizontal axes represent the index and the arrangement order of sensor stations based on the mileposts and directions of the four freeways,
respectively.

proposed model. Fig. 6 shows the boxplot of the MAE versus performance, if the dimension is set as a reasonable value close
the number of time lags, in which the spatial dimensions of all to the number of sensor locations.
weight matrices are all set as 𝑃 = 323. When the number of 5) Spatial features learning
time lags equals 8, 10, and 12, the MAEs are very close, around Spatial features of a traffic network are critical for predicting
2.4. The deviations of these MAEs are relatively small. When network-wide traffic states. By carefully studying the LSTM
the number of time lags is set as 6, the MAE is much higher, methodology, we can find that the spatial features can be
and the deviation is much larger than other cases. That means, inherently learned by the weights in LSTM or BDLSTM layers
given the 5-minute time step interval and the studied traffic at the training process. No matter what the network’s spatial
network, input data with 6 time steps are not enough for the structure is, and no matter what the spatial order of the input
model to accurately predict network-wide traffic speed. To sum data is, the traffic speed relationship between each pair of two
up, the number of time lags tends to influence the predictive locations in the traffic network can be captured by the LSTM
performance, especially when the number is relatively small. weight matrices.
4) Influence of dimension of weight matrices In this section, we measure the influence of spatial order of
In the experiment, the dimension of each data sample is the input data on the spatial feature learning. Firstly, we order
[𝑛, 𝑃] , where 𝑃 is the spatial dimension representing the the spatial dimension of input data based on the milepost and
number of loop detector stations. According to the matrix direction of freeways. Fig. 7 displays the heatmap of true speed
multiplication rule, the spatial dimension of the weight matrices and predicted speed for the freeway network on a randomly
in the first layer of the SBU-LSTM must be accordance with selected day, taking 09/01/2015, a Friday, for an example. The
the value of 𝑃. But the spatial dimension of weight matrices in extremely similarity between the shapes in the two heatmaps
other layers can be customized. In this section, we measure the shows that the proposed model is capable of learning spatial
influence of the dimension of weight matrices in the basic SBU- features. Then, we randomly rearrange the spatial dimension of
LSTM. input data. By training and testing the model for multiple times,
When the model’s last LSTM layer has different spatial we find that the predictive performance nearly does not change,
. .
dimensions, including › 𝑃•, › 𝑃•, 𝑃, 2𝑃 and 4𝑃, very close and the MAEs are all around 2.42 mph. To the best of our
œ 1 knowledge, at least two aspects of reasons lead the good
prediction results are observed. Here, 𝑃 equals 323 and ⌈∙⌉ is performance. One is that the BDLSTM, measuring both
the ceiling function. Table Ⅲ shows the comparison results. forward and backward dependencies, helps learn better features.
The MAE, MAPE, and standard deviations are nearly the same. The other one is that the inherent spatial correlation between
Hence, the variation of the dimension of the weight matrices in locations is learned and stored in the weight matrices during the
the LSTM layer almost has no influence on the predictive training process. Hence, the order of spatial dimension of input
data basically does not affect the model performance.
TABLE Ⅲ
PERFORMANCES COMPARISON OF SBU-LSTMS WITH DIFFERENT SPATIAL
6) Influence of volume and occupancy
DIMENSIONS OF WEIGHT MATRICES Speed, volume (flow), and occupancy are the three
Spatial dimension of fundamental factors to analyze traffic flow. Considering the
weight matrices in MAE MAPE STD loop detector data contains speed, volume, and occupancy
the last layer (LSTM layer) information, it is informative to investigate the influence of
1⁄4 𝑃 2.486 5.903 0.675
these factors on the proposed model’s predictive performance.
1⁄2 𝑃 2.425 5.680 0.643
𝑃 = 323 2.426 5.674 0.630 In previous experiments, each element of the model input, 𝑥DE , is
2 𝑃 2.431 5.736 0.636 the speed (𝑠) at a specific location, 𝑝, at time 𝑡, where 𝑥DE = 𝑠DE .
4 𝑃 2.411 5.696 0.636
While, in this experiment, an element of the model input
combine speed (𝑠) with volume (𝑣) and occupancy (𝑜), where
9

Fig. 8 Fundamental scatter diagrams of traffic flow: (a) speed-volume diagram,

(b) speed-occupancy diagram, and (c) volume-occupancy diagram.

𝑥DE can be (𝑠, 𝑜)ED , (𝑠, 𝑣)ED , or (𝑠, 𝑣, 𝑜)ED .

Before investigating the influence of volume and occupancy,
the relationship between these factors are directly evaluated by Fig. 9 Prediction MAE of the SBU-LSTM model versus traffic network size
for both freeway and INRIX network. The x-axis is exponentially scaled to
three scatter diagrams, plotted based on the loop detector show the results for both datasets.
dataset and shown in Fig. 8. Although obvious noise points can
be seen from Fig. 8 (b) and Fig. 8 (c), the main distributions in including freeways, arterials, urban streets, and even ramps, to
the three diagrams follows the traffic flow theory [47]. Our predict the speed of the whole traffic network in the Seattle
experiments show that when solely combining volume data, downtown area. The INRIX traffic network is shown in Fig. 6
there is nearly no improvement over the prediction accuracy (b). This prediction task is more challenging, not only because
shown in the Section 2). But when inputting speed and the traffic network consists of multiple types of roads, also due
occupancy, or all the three factors, the model performs slightly to the speed limits of these roads varies from 20 mph to 60 mph.
better, which has less than 5% increase in the prediction In this section, we use the basic SBU-LSTM model to predict
accuracy. Therefore, the volume and occupancy has slightly speed for the INRIX traffic network containing more than 1000
influence on the traffic speed prediction based on our roadway segments. The prediction MAE and MAPE is 1.126
experiment results. mph and 4.212%, respectively, which is better than the that of
7) Model robustness the experiments based on loop detector data. This implies that
The optional masking layer makes the SBU-LSTM more the variation of the INRIX speed data might be relatively small.
robust that the model can handle input data with missing values. Further, the scalability of the model turns out to be quite
In this section, we test the model’s robustness by randomly remarkable, when we scale the size of traffic network. Fig. 9
selecting a specific proportion of elements from the spatial- shows the MAEs when applying the model to size-varying
temporal matrix in each input sample and set them as 𝑁𝑢𝑙𝑙. The freeway and INRIX traffic networks. It shows that, when the
model’s prediction accuracy varies as we change the proportion size of the IRNIX traffic network increases, the prediction
of 𝑁𝑢𝑙𝑙 values in the input samples. MAE increases slightly. In addition, size of freeway network
The performance of the model is presented in Table Ⅳ. It is also does not affect the prediction performance that much,
obvious that the prediction accuracy decreases as the proportion considering the horizonal axis of Fig. 9 is an exponentially
of missing values increases. We can also notice that the MAE scaled. Hence, it is proved that the proposed approach is able to
values in Table Ⅳ are nearly double of the MAE when there deal with multiple types of traffic network and works pretty
are no missing values, listed in Table Ⅱ, which means the good when the size of traffic network changes.
missing values really affect the performance of the model. In
conclusion, the model’s capability of dealing with missing D. Visualization and Potential Applications
values is acceptable, but there are still rooms for us to improve Besides theoretical contribution, the proposed model has
the model robustness. potential impact on the traffic speed prediction related
applications. The proposed model and its visualized predicting
C. Model Scalability results will soon be implemented on an extended version of a
All above experiments are conducted based on freeway transportation data analytics platform [29, 30], which mainly
traffic state data, which does not cover urban roadways. To test utilizes artificial intelligence methods to solve transportation
the scalability of the proposed approach, we adopt the INRIX problems. The model’s predicted results and the corresponding
data, which covers a wide range of roadway segments, visualized traffic networks, like the studied freeway and INRIX
TABLE Ⅳ traffic networks, shown in Fig. 10, will be public accessible via
PERFORMANCES COMPARISON OF THE SBU-LSTM WITH DIFFERENT the platform.
PROPORTIONS OF MISSING VALUES It has been proved that network-wide prediction accuracy is
Proportion of missing values
MAE MAPE
high in previous sections. By investigating the predicted traffic
in input data speed for single locations, we find that the prediction
5% 3.828 9.053
performance is also very good and the trends of predicted and
10% 4.294 9.968
15% 4.328 10.348 true values are pretty similar. For an example, Fig. 11 (a) and
20% 4.765 11.118 Fig. 11 (b) show the predicted and true speed values at two
30% 4.962 11.507 randomly selected locations from the freeway and INRIX
traffic networks, respectively, during stochastically selected
10

Fig. 11 Prediction Performance for single locations on a randomly selected

weekday: (a) a sensor station at 167.56 milepost on I-5, (b) a TMC in the INRIX
Fig. 10 Visualization of predicted traffic speed on freeway and INRIX traffic
network, and (c) a TMC in the INRIX network, at which a non-recurring
networks. (a) Freeway traffic network. (b) INRIX traffic network.
congestion occurred. X-axis represents the time of a day.
weekdays. It is obvious that recurring congestions at morning of historical data is not large enough, prediction performance
and evening peak hours are successfully predicted by the may decrease. The spatial order of input data and the spatial
proposed approach. While, non-recurring congestions cannot dimension of weight matrices in the last layer of the model
be easily predicted by models without enough training data, like almost has no influence on the prediction results. Additional
weather data, event data and incident data. The red box in Fig. information, like volume and occupancy, cannot significantly
11 (c) tagged a sudden congestion in the late evening at the 114- improve the predictive performance. Further, it is proved that
04200 TMC place on the INRIX network, which is highly the proposed model is suitable for predicting traffic speed on
possible to be a non-recurring congestion. Thus, by feeding different types of traffic network.
more data sources, like weather data and incident data, to the Further improvements and extensions can be made based on
proposed model, distinguishing recurring and non-recurring this study. The model will be improved towards graph-based
congestions may be another applicable scenario soon. structure to learn and interpret spatial features. The model will
be implemented on an artificial intelligence based
IV. CONCLUSION AND FUTURE WORK transportation analytical platform. Potential applications, like
A deep stacked bidirectional and unidirectional LSTM neural non-recurring congestion detection, will be explored by
network is proposed in this paper for network-wide traffic speed combining other datasets.
prediction. The improvements and contributions in this study
mainly focus on four aspects: 1) we expand the traffic REFERENCES
forecasting area to the whole traffic network, including freeway [1] E. I. Vlahogianni, M. G. Karlaftis, and J. C. Golias, “Short-term traffic
forecasting: Where we are and where were going,” Transportation
and urban traffic networks; 2) we propose a deep architecture Research Part C: Emerging Technologies, vol. 43, pp. 3–19, 2014.
stacked architecture considering both forward and backward [2] X. Ma, Z. Tao, Y. Wang, H. Yu, and Y. Wang, “Long short-term memory
dependencies of network-wide traffic data; 3) multiple neural network for traffic speed prediction using remote microwave
sensor data,” Transportation Research Part C: Emerging Technologies,
influential factors for the proposed model are detailly analysed; vol. 54, pp. 187–197, 2015.
and 4) a masking mechanism is adopted to handle missing [3] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
values. computation, vol. 9, no. 8, pp. 1735–1780, 1997.
Experiment results indicate that the two-layers SBU-LSTM [4] R. Jozefowicz, W. Zaremba, and I. Sutskever, “An empirical exploration
of recurrent network architectures,” in Proceedings of the 32nd
without middle layers is the best structure for network-wide International Conference on Machine Learning (ICML-15), 2015, pp.
traffic speed prediction. Comparing to LSTM, BDLSTM and 2342–2350.
other LSTM-based methods, the structure of stacking BDLSM [5] Z. C. Lipton, J. Berkowitz, and C. Elkan, “A critical review of recurrent
neural networks for sequence learning,” arXiv preprint
and LSTM layers turns out to be more efficient to learn spatial- arXiv:1506.00019, 2015.
temporal features from the dataset. If the number of time lags
11

[6] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep modeling, and analysis,” in Smart Cities Conference (ISC2), 2016 IEEE
recurrent neural networks,” in Acoustics, speech and signal processing International. IEEE, 2016, pp. 1–2.
(icassp), 2013 ieee international conference on. IEEE, 2013, pp. 6645– [30] X. Ma, Y.-J. Wu, and Y. Wang, “Drive net: E-science transportation
6649. platform for data sharing, visualization, modeling, and analysis,”
[7] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural Transportation Research Record: Journal of the Transportation Research
image caption generator,” in Proceedings of the IEEE conference on Board, no. 2215, pp. 37–49, 2011.
computer vision and pattern recognition, 2015, pp. 3156–3164. [31] K. Henrickson, Y. Zou, and Y. Wang, “Flexible and robust method for
[8] D. Eck and J. Schmidhuber, “A first look at music composition using lstm missing loop detector data imputation,” Transportation Research Record:
recurrent neural networks,” Istituto Dalle Molle Di Studi Sull Intelligenza Journal of the Transportation Research Board, no. 2527, pp. 29–36, 2015.
Artificiale, vol. 103, 2002. [32] Y. Lv, Y. Duan, W. Kang, Z. Li, and F.-Y. Wang, “Traffic flow prediction
[9] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. with big data: a deep learning approach,” IEEE Transactions on Intelligent
Savarese, “Social lstm: Human trajectory prediction in crowded spaces,” Transportation Systems, vol. 16, no. 2, pp. 865–873, 2015.
in Proceedings of the IEEE Conference on Computer Vision and Pattern [33] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, “Convolutional, long
Recognition, 2016, pp. 961–971. short-term memory, fully connected deep neural networks,” in Acoustics,
[10] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: Speech and Signal Processing (ICASSP), 2015 IEEE International
Continual prediction with lstm,” 1999. Conference on. IEEE, 2015, pp. 4580–4584.
[11] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies [34] X. Song, H. Kanasugi, and R. Shibasaki, “Deeptransport: Prediction and
with gradient descent is difficult,” IEEE transactions on neural networks, simulation of human mobility and transportation mode at a citywide
vol. 5, no. 2, pp. 157–166, 1994. level.” in IJCAI, 2016, pp. 2618–2624.
[12] K. Cho, B. Van Merri¨enboer, D. Bahdanau, and Y. Bengio, “On the [35] J. Guo, W. Huang, and B. M. Williams, “Adaptive kalman filter approach
properties of neural machine translation: Encoder-decoder approaches,” for stochastic short-term traffic flow rate prediction and uncertainty
arXiv preprint arXiv:1409.1259, 2014. quantification,” Transportation Research Part C: Emerging Technologies,
[13] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of vol. 43, pp. 50–64, 2014.
gated recurrent neural networks on sequence modeling,” arXiv preprint [36] Q. Ye, W. Y. Szeto, and S. C. Wong, “Short-term traffic speed forecasting
arXiv:1412.3555, 2014. based on data recorded at irregular intervals,” IEEE Transactions on
[14] K. Greff, R. K. Srivastava, J. Koutn´ık, B. R. Steunebrink, and J. Intelligent Transportation Systems, vol. 13, no. 4, pp. 1727–1737, 2012.
Schmidhuber, “Lstm: A search space odyssey,” IEEE transactions on [37] C.-H. Wu, J.-M. Ho, and D.-T. Lee, “Travel-time prediction with support
neural networks and learning systems, 2017. vector regression,” IEEE transactions on intelligent transportation
[15] M.-D. Wang, Q.-R. Qiu, and B.-W. Cui, “Short-term wind speed systems, vol. 5, no. 4, pp. 276–281, 2004.
forecasting combined time series method and arch model,” in Machine [38] M. G. Karlaftis and E. I. Vlahogianni, “Statistical methods versus neural
Learning and Cybernetics (ICMLC), 2012 International Conference on, networks in transportation research: Differences, similarities and some
vol. 3. IEEE, 2012, pp. 924–927. insights,” Transportation Research Part C: Emerging Technologies, vol.
[16] X. Zheng and M. Liu, “An overview of accident forecasting 19, no. 3, pp. 387–399, 2011.
methodologies,” Journal of Loss Prevention in the process Industries, vol. [39] J. Hua and A. Faghri, “Apphcations of artificial neural networks to
22, no. 4, pp. 484–491, 2009. intelligent vehicle-highway systems,” Transportation Research Record,
[17] X. Jiang and H. Adeli, “Wavelet packet-autocorrelation function method vol. 1453, p. 83, 1994.
for traffic flow pattern analysis,” Computer-Aided Civil and [40] H. Yin, S. Wong, J. Xu, and C. Wong, “Urban traffic flow prediction
Infrastructure Engineering, vol. 19, no. 5, pp. 324–337, 2004. using a fuzzy-neural approach,” Transportation Research Part C:
[18] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” Emerging Technologies, vol. 10, no. 2, pp. 85–98, 2002.
IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, [41] D. Park and L. R. Rilett, “Forecasting freeway link travel times with a
1997. multilayer feedforward neural network,” Computer-Aided Civil and
[19] A. Graves and J. Schmidhuber, “Framewise phoneme classification with Infrastructure Engineering, vol. 14, no. 5, pp. 357–367, 1999.
bidirectional lstm and other neural network architectures,” Neural [42] J. Van Lint, S. Hoogendoorn, and H. Van Zuylen, “Freeway travel time
Networks, vol. 18, no. 5, pp. 602–610, 2005. prediction with state-space neural networks: modeling state-space
[20] A. Graves, N. Jaitly, and A.-r. Mohamed, “Hybrid speech recognition dynamics with recurrent neural networks,” Transportation Research
with deep bidirectional lstm,” in Automatic Speech Recognition and Record: Journal of the Transportation Research Board, no. 1811, pp. 30–
Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 2013, pp. 273– 39, 2002.
278. [43] H. Yu, Z. Wu, S. Wang, Y. Wang, and X. Ma, “,” arXiv preprint
[21] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, arXiv:1705.02699, 2017.
no. 7553, pp. 436–444, 2015. [44] R. Fu, Z. Zhang, and L. Li, “Using lstm and gru neural network methods
[22] Y. Duan, Y. Lv, and F.-Y. Wang, “Travel time prediction with lstm neural for traffic flow prediction,” in Chinese Association of Automation (YAC),
network,” in Intelligent Transportation Systems (ITSC), 2016 IEEE 19th Youth Academic Annual Conference of. IEEE, 2016, pp. 324–328.
International Conference on. IEEE, 2016, pp. 1053–1058. 2 [45] Z. Zhao, W. Chen, X. Wu, P. C. Chen, and J. Liu, “Lstm network: a deep
[23] Y.-y. Chen, Y. Lv, Z. Li, and F.-Y. Wang, “Long short-term memory learning approach for short-term traffic forecast,” IET Intelligent
model for traffic congestion prediction with online open data,” in Transport Systems, vol. 11, no. 2, pp. 68–75, 2017.
Intelligent Transportation Systems (ITSC), 2016 IEEE 19th International [46] Y. Wang, R. Ke, W. Zhang, Z. Cui, and K. Henrickson, “Digital roadway
Conference on. IEEE, 2016, pp. 132–137. interactive visualization and evaluation network applications to wsdot
[24] Y. Wu and H. Tan, “Short-term traffic flow forecasting with operational data usage,” University of Washington Seattle, Washington,
spatialtemporal correlation in a hybrid deep learning framework,” arXiv Tech. Rep., 2016.
preprint arXiv:1612.01022, 2016. [47] J. Li and H. Zhang, “Fundamental diagram of traffic flow: new
[25] R. Yu, Y. Li, C. Shahabi, U. Demiryurek, and Y. Liu, “Deep learning: A identification scheme and further evidence from empirical data,”
generic approach for extreme condition traffic forecasting,” in Transportation Research Record: Journal of the Transportation Research
Proceedings of the 2017 SIAM International Conference on Data Mining. Board, no.2260, pp. 50–59, 2011.
SIAM, 2017, pp. 777–785.
[26] S. R. Chandra and H. Al-Deek, “Predictions of freeway traffic speeds and
volumes using vector autoregressive models,” Journal of Intelligent
Transportation Systems, vol. 13, no. 2, pp. 53–72, 2009.
[27] Y. Kamarianakis, H. O. Gao, and P. Prastacos, “Characterizing regimes
in daily cycles of urban traffic using smooth-transition regressions,”
Transportation Research Part C: Emerging Technologies, vol. 18, no. 5,
pp. 821–840, 2010.
[28] G. E. Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung, Time series
analysis: forecasting and control. John Wiley & Sons, 2015.
[29] Z. Cui, S. Zhang, K. C. Henrickson, and Y. Wang, “New progress of drive
net: An e-science transportation platform for data sharing, visualization,