Transportation Research Part C: Jiajie Yu, Pierre-Antoine Laharotte, Yu Han, Ludovic Leclercq
Transportation Research Part C: Jiajie Yu, Pierre-Antoine Laharotte, Yu Han, Ludovic Leclercq
A R T I C L E I N F O A B S T R A C T
Keywords: Managing traffic flow at intersections in a large-scale network remains challenging. Multi-modal
Traffic Signal Control signalized intersections integrate various objectives, including minimizing the queue length and
Bus Holding maintaining constant bus headway. Inefficient traffic signals and bus headway control strategies
Multi-Modal Network
may cause severe traffic jams, high delays for bus passengers, and bus bunching that harms bus
Deep Reinforcement Learning
Artificial Neural Network
line operations. To simultaneously improve the level of service for car traffic and the bus system
in a multi-modal network, this paper integrates bus priority and holding with traffic signal control
via decentralized controllers based on Reinforcement Learning (RL). The controller agents act and
learn from a synthetic traffic environment built with the microscopic traffic simulator SUMO.
Action information is shared among agents to achieve cooperation, forming a Multi-Agent
Reinforcement Learning (MARL) framework. The agents simultaneously aim to minimize vehi
cles’ total stopping time and homogenize the forward and backward space headways for buses
approaching intersections at each decision step. The Deep Q-Network (DQN) algorithm is applied
to manage the continuity of the state space. The tradeoff between the bus transit and car traffic
objectives is discussed using various numerical experiments. The introduced method is tested in
scenarios with distinct bus lane layouts and bus line deployments. The proposed controller out
performs model-based adaptive control methods and the centralized RL method regarding global
traffic efficiency and bus transit stability. Furthermore, the remarkable scalability and trans
ferability of trained models are demonstrated by applying them to several different test networks
without retraining.
1. Introduction
To avoid traffic conflicts, traffic signal control efficiently allocates green times to different vehicle movements at a signalized
intersection (Wu et al., 2007). Inadequate traffic signal control strategies may cause severe traffic congestion and further waste of
power energy as well as exhaust pollution (Zhao et al., 2012). Optimal traffic signal strategies at the network level with multi-modal
objectivities remain challenging because of complexity and scalability issues (Wang et al., 2021b). Many model-based and model-free
strategies are explored and developed to cope with traffic signal coordination. Centralized control methods manage signals to match an
overall goal but are often limited in scale and reduce the coordination ability to a local set of intersections (Ma et al., 2009; Wang et al.,
* Corresponding author.
E-mail addresses: [email protected] (J. Yu), [email protected] (P.-A. Laharotte), [email protected] (Y. Han), ludovic.
[email protected] (L. Leclercq).
https://doi.org/10.1016/j.trc.2023.104281
Received 23 November 2022; Received in revised form 28 July 2023; Accepted 28 July 2023
Available online 4 August 2023
0968-090X/© 2023 Elsevier Ltd. All rights reserved.
J. Yu et al. Transportation Research Part C 154 (2023) 104281
2021a; Yu et al., 2018). For example, Maxband and its extensions (Little, 1966; Xu et al., 2022a; Yang et al., 2015; Zhang et al., 2015)
introduce coordination at the arterial level by enlarging the green wave bandwidth, effectively decreasing vehicle stop times.
Performance-based algorithms aim to improve the network’s global performance indicators (e.g., total delay, total queue length, and
average speed of vehicles) in a specific period. Some consider multi-objective, environmental concerns, and target robustness (Ma
et al., 2020; Mohebifard et al., 2019; Yin, 2008; Zhang et al., 2013). These models are always formulated with mixed-integer pro
gramming and its derivations. The complexity increases exponentially with the number of signals and control periods. Therefore, these
strategies appear efficient for coordinating several offline intersections based on daily traffic patterns. Still, the effort required to solve
such an optimization problem in terms of computation time (especially for a large number of traffic signals) is incompatible with real-
time applications (Chu et al., 2019).
Decentralized methods are more robust and easily scalable, but coordination should be carefully addressed (Le et al., 2015; Yu
et al., 2021). For example, the max pressure method prioritizes the phase with the maximum pressure (the maximal difference between
upstream and downstream queues) to accommodate the real-time traffic demand (Varaiya, 2013). The max pressure algorithm is
particularly appealing at the network scale as it can overcome the computation complexity. Its structure is fully decentralized, and
each signal is regarded as an independent agent (Levin et al., 2020; Varaiya, 2013). The capacity of each link is regarded as unlimited
in the original max pressure method (Sun and Yin, 2018). Gregoire et al. (2014) and Yu et al. (2021) reformulated the pressure
expression accounting for the link capacity. They achieved better stabilization in a simulated environment. However, these strategies
still lack considerations for agent coordination and long-term system performance. Thus, the global benefit from the control strategy
may deteriorate when applied to a large network for an extensive period (Korecki and Helbing, 2022).
Recent development in Artificial Intelligent makes this trend even more promising as machine learning techniques can be applied
to learn the optimal local policies from observations, while distributed framework introduces cooperation between local agents by
sharing information (Abdoos, 2020; Li et al., 2021). Several attempts have been made to apply machine learning techniques to traffic
signal control (Fadlullah et al., 2017; Lee et al., 2020; Wu et al., 2017), mainly using Reinforcement Learning (RL). The agent in the RL
framework tends to choose the control action that leads to the optimal long-term performance at each step based on the system’s real-
time state (Sutton and Barto, 2018; Wang et al., 2021b). Artificial Neural Network (ANN) is commonly combined with the RL
framework to increase agents’ learning efficiency and accommodate continuous state space (Mnih et al., 2013; Prashanth and Bhat
nagar, 2010; Sutton and Barto, 2018). Encouraging results are obtained compared to the pre-timed signal controller and model-based
control strategies, particularly under variable traffic conditions (Casas, 2017; Chin et al., 2012; Gao et al., 2017; Genders and Razavi,
2016; Li et al., 2016; Yang et al., 2019).
Similar to the computational burden issue in centralized model-based control strategies, the state-action space increases expo
nentially with the number of signals in a centralized RL framework, leading to the challenge of scalability and learning convergence for
agents (Chu et al., 2019). Therefore, Multi-Agent Reinforcement Learning (MARL) framework is commonly used to address these
issues. MARL framework allows multiple agents to act simultaneously in a shared environment. The decentralized agents in MARL
observe partial information from the environment, simplifying the state and action space compared to the centralized agent (Arel et al.,
2010). Chu et al. (2019) presented a signal control strategy with MARL and tested it in an extended synthetic grid network and a real-
world traffic network. Results demonstrate the robustness and efficiency of the proposed algorithm. Zhang et al. (2019a) developed a
traffic simulator, CityFlow, which can build a MARL environment for large-scale city traffic scenarios. Abdoos (2020) proposed a
cooperative MARL framework for network signal control, which integrates game theory and Q-learning to provide a more cooperative
traffic signal control strategy than the one generated by Q-learning only. Communication among agents, i.e., sharing states or actions
information, is encouraged to foster cooperation and improve global indicators. Li et al. (2021) proposed a signal control method with
knowledge sharing among all agents in the MARL framework. Each agent can contribute experiences and access a shared knowledge
container of the traffic environment. The knowledge-sharing agents accelerate the learning convergence and improve the traffic ef
ficiency of large-scale networks compared to non-communicate agents. Wang et al. (2021b) showed a novel way to define the agent in
their MARL framework on traffic signal control. Each agent represents a group of signals. A region-aware cooperative strategy that can
incorporate the spatial information of the surrounding agents is computed. Each agent decides if the group of signals needs to perform
the green wave. A better traffic performance is obtained compared with the existing algorithms in large-scale networks.
However, the aforementioned studies focus on car traffic only and do not account for bus line operations. Indeed, several transport
modes co-exist in urban areas and may receive different priority levels at intersections. Among others, the transit service has specific
objectives like maintaining constant bus headway or fulfilling timetables that might be difficult to achieve without adequate priority at
traffic signals. Ineffective bus headway control may lead to bus bunching and further delays for passengers (Daganzo, 2009). Thus,
investigating cooperative and decentralized traffic signal control strategies considering multiple transportation modes is crucial.
Accounting for multiple and possible competing objectives is a critical challenge when designing the signal control strategy.
To take bus transit into account with signal timing, Transit Signal Priority (TSP) is a widely explored strategy. Transit priority can
be pre-set based on the bus timetable, known as passive TSP, which lacks robustness (Ni et al., 2022). Active and adaptive TSP
combines real-time bus information with signal timing to enhance the control effect of buses. The combination of fixed control and TSP
guarantees the priority of passing the intersection for buses by green time extension, green phase rotation, or green phase splitting
based on real-time bus information (Ma and Yang, 2007). Transit arrival and bus dwell time prediction strategies are developed to
improve the TSP performance (Ding et al., 2015; Ekeila et al., 2009; Ghanim and Abu-Lebdeh, 2015). With the emerging signal control
strategies and real-time traffic state detecting techniques, TSP is integrated with more advanced adaptive signal control strategies. Xu
et al. (2022b) integrated TSP with max pressure, prioritizing the incoming lane with a bus in a max pressure control background. The
numerical simulation with a network equipped with dedicated bus lanes suggests that the method reduces the bus travel time without
breaking the stability of the control compared to the original max pressure. Chen et al. (2022) combined bus priority with rhythmic
2
J. Yu et al. Transportation Research Part C 154 (2023) 104281
control. The control framework is designed for a full-automatic vehicle environment. The controller simultaneously guides automated
vehicles along conflict-free time–space trajectories computed to handle any movement at the intersection. Both studies reduce the bus
delay via traffic light controllers. Long et al. (2022) proposed a TSP strategy based on Deep Reinforcement Learning (DRL) dealing with
multiple conflicting bus priority requests. They extended the Dueling Double Deep Q-learning (D3QN) for their algorithm, receiving
higher convergence speed and lower average person delay than other RL benchmarks and active TSP strategies with a simulation
environment of a single intersection. Nevertheless, systematically allocating priority to buses is not necessarily the best option as it
might favor bus bunching when an early bus joins a late one. Therefore, homogenizing headways is crucial to avoid such phenomena
and minimize passengers’ waiting times at bus stops, thus enhancing bus service efficiency and reliability (Wang and Sun, 2020).
Compared to bus priority, bus holding control is a more effective way to equalize bus headways when bus bunching occurs (Berrebi
et al., 2018; Hans et al., 2015). It adjusts bus headways by holding them at stops when necessary (Laskaris et al., 2020). Squared
coefficient of variation of headways and mean holding time are typically direct indicators to measure the bus control performance
(Berrebi et al., 2018). Wang and Sun (2020) proposed a MARL framework to implement a dynamic bus holding control strategy at bus
stops. Each bus is regarded as a decentralized agent that aims to minimize the weighted sum of forward and backward headway
difference and bus holding time. Simulation tests show promising results with a one-way bus corridor and uniformly distributed bus
stops. Since bus holding control at bus stops may not be well-perceived by passengers, holding can be achieved silently at traffic signals
if early buses are not granted priority. To combine bus holding control with signal timing, Chow et al. (2021) applied DRL to adaptive
signal control considering bus service reliability. They developed a centralized controller to manage traffic delays and bus headway
control synchronously. Compared to fixed signal control integrating with TSP strategy, improvements in vehicle travel time and bus
headways are obtained in a macroscopic traffic environment with a one-way bus corridor. However, due to the centralized feature of
Chow’s model, scalability and transferability are not guaranteed. Furthermore, using a macroscopic model to feed the DRL framework
tends to smooth the traffic indicators and might alter the resilience to field test observations. Some refinement to the above meth
odologies needs to be explored to achieve more efficient (for both car traffic and bus transit) and scalable methods for controlling
traffic signals in a multi-modal network.
In conclusion, the main limitations of existing studies are:
• Traffic signal control strategies considering TSP are well explored with emerging techniques. Bus bunching can not be effectively
resolved by TSP, which needs bus holding control. The study on signal control integrating TSP and bus holding control is limited,
especially in a decentralized formulation.
• DRL has achieved significant improvement in both signal control and bus control. However, the training and testing networks are
always the same in most existing studies, which can not well demonstrate the scalability (the adaptability of the proposed agent
design in larger scale networks) and transferability (applicability of trained agents in various test environments without retraining)
of the model, which are crucial properties of RL agents (Ye et al., 2022; Zoph et al., 2018).
• Regarding the bus control strategies, most studies either lack a full-defined traffic environment (e.g., car traffic disturbances,
double-way bus lines, multiple conflict bus lines, and mixed traffic lanes) or scale of intersections. The disturbance from car traffic
and diverse bus line operations has to be better considered in bus control schemes.
This paper proposes a decentralized traffic signal control model with DRL, where agents cooperate through communication with
their immediate neighborhood. The control is designed for a multi-modal network consisting of private car traffic and bus transit.
Vehicle accumulations and queue length on all legs of intersections and bus positions and speeds are required for real-time operation.
The model simultaneously minimizes the traffic delay and bus headway variations for closed loop bus lines and can accommodate
different road layouts (e.g., dedicated bus lanes and mixed traffic lanes) and multiple conflict bus lines. Scalability and transferability
are demonstrated by applying trained models to other similar intersections without retraining, reducing the training costs in an
extensive transportation system. A large benchmark over the most representative methods is performed in numerical experiments
using the microscopic traffic simulator SUMO (Alvarez Lopez et al., 2018). The proposed approach provides a promising performance.
The remainder of this paper is organized as follows: Section 2 describes the DRL algorithm used in this paper and formalizes the
3
J. Yu et al. Transportation Research Part C 154 (2023) 104281
agent design in detail. Section 3 discusses the tradeoff between car traffic and bus transit in the agent’s reward and compares the
proposed method with benchmarks via several numerical experiments. The conclusions and perspectives are summarized in section 4.
2. Methodology
To achieve scalable and decentralized traffic signal control, we set each signalized intersection in the multi-modal network as an
agent that learns strategies offline or inherits the trained model from other agents. The dataset of available trained models is built
through training the agents in various small-scale networks. For a more extensive or distinct testing network, each signalized inter
section is paired with a trained model from the dataset based on the proximity of the intersection configuration and neighbor
connection. One trained model can be reused for several agents. If no suitable match is found, a new model is trained specifically for
that intersection, as illustrated in Fig. 1. Mai in Fig. 1 is the label of trained model i, and the Manew represent a newly trained model with
that signalized intersection.
The traffic signal control process for each intersection is represented by an agent that dynamically triggers one phase. The optimal
signal timing plan is supposed to reduce the overall car traffic delay and the variance of bus headways in the long term. The timing plan
needs to be evaluated and updated after each decision taken by the agent controlling the traffic signal. This interaction between the
traffic light and traffic environment follows Markov Decision Processes (MDPs) (Mannion et al., 2016). In MDPs, the agent estimates
the value of each action with a given state and selects the optimal (Sutton and Barto, 2018), which exactly describes the signal
controller’s action. In this study, the real-time car traffic and bus headway information is retrieved at the beginning of each decision
step. The agent chooses an action only depending on the current state, which satisfies the Markov property. In the corridor and network
level signal control, a group of signalized agents acts in a shared environment to achieve a common individual goal, forming the MARL
framework (Buşoniu et al., 2010). Communication among agents exists for better coordination.
To build the multi-modal traffic signal control method in a MARL framework, as shown in Fig. 2, the environment consists of
intersections, the road network, car traffic flow, and bus transit. Each signalized intersection is regarded as an agent. In this framework,
the state observations consist of real-time traffic and bus information and the last actions of traffic lights. The same variables are used
to define the reward function, i.e., total stopping time, occupancy, and bus headways. In this paper, we set the reward to minimize the
cumulative stopping vehicles for all incoming legs at the intersection and to homogenize (keep constant) headways between buses. The
tradeoff between car traffic-related objectives and bus bunching avoidance is defined by parameter c.
4
J. Yu et al. Transportation Research Part C 154 (2023) 104281
We adopt a Deep Q-Network (DQN) to achieve an action-value function approximation in this study (Mnih et al., 2013; Mnih et al.,
2015). The MDPs can be defined by (S , A , P , R , γ), where S and A denote the state space and action space, respectively, P : S ×
A ↦ S is the state transition function, R represents the reward function, and γ is the discount factor for future reward. The future
∑
discount reward at decision step t is defined as Rt = Tt̂=t γ t̂− t rt̂ , where T is the maximum time step, and rt̂ is the immediate return at
decision step ̂t. ANN is used to learn a mapping from states to Q values in the learning process of DQN. The agent tends to select actions
to maximize expected cumulative discounted future rewards. The optimal action value of agent i is given by Eq. (1):
[ ]
Q*i (s, a) = maxE ri,t + γri,t+1 + γ2 ri,t+2 +...|Si,t = s, ai,t = a, π (1)
π
where α is the learning rate of agents, (s, a) is the state-action pair, and (s′, a′) denotes the state-action pair at the next decision step.
In order to retrieve random samples for the agent to learn the action-value function, the state, action, reward, and next-step state
{ }
are recorded for every update. The memory of agent i at decision step t is labeled as D i,t = ei,1 , ..., ei,t , where ei,t =
( )
Si,t− 1 , ai,t− 1 , ri,t− 1 , Si,t , to perform replay. For each learning phase, a batch of samples is drawn randomly from the total memory and
used to compute the temporal difference error. The loss function adopts the formulation given below:
[( ) ] 2
Lk (θk ) = E(s,a,r,s′)∼U(M) r + γmaxQ(s′, a′; θ−k ) − Q(s, a; θk ) (3)
a′
where k denotes the kth iteration, and θk and θ−k are the parameters of Q-network and target network, respectively, at iteration k. The
loss function Lk (θk ) is minimized to reduce the deviation between the target and the current Q-value.
The ANN is traditionally used to estimate the action-value function since it can achieve a nonlinear function approximation (Sutton
and Barto, 2018). An ANN consists of an input layer, interconnected hidden layers, and an output layer. The structure of ANN in this
paper, as shown in Fig. 3, refers to the method developed in Vidali (2021) due to the similar state and action dimensions. The ANN
approximates the optimal action-value function by minimizing the temporal difference error. In the test process, the agent recalls the
ANN in operating mode: it predicts the expected reward of each action based on the state via the fitted action-value function to find the
optimal action at each decision step. The details regarding the learning process are summarized in Appendix A.
The agent is a traffic signal controller associated with an intersection. The agent’s action consists of a single integer representing
the index of the predefined green phase. According to the action index, the signal activates one of the predefined phases for the
following decision step. If the action chosen differs from the last action, the yellow and all red phase of 5 s is activated. Table 1 details
the notations adopted in the proposed framework.
5
J. Yu et al. Transportation Research Part C 154 (2023) 104281
Table 1
Notations of variables.
Variable Description
di,m,t Control distance for buses on the mth incoming leg of agent i at decision step t
d′i,m,t Control distance for buses on the mth incoming leg equipped with dedicated bus lane of agent i at decision step t
d″i,m,t Control distance for buses on the mth incoming leg equipped with mixed traffic lane of agent i at decision step t
vmax Speed limit for buses
adec /aacc Acceptable deceleration/acceleration for buses
Δt Time duration of a decision step
Si,t State tuple of agent i at decision step t
Straffic /Stransit /Scoop State retrieved from car traffic/bus transit/cooperativeness
i,t i,t i,t
Mi Set of all incoming legs of intersection i
MBi Set of all incoming legs equipped with bus line of intersection i
MRi,t Set of all incoming legs that are in red phases of intersection i at decision step t
xsi,m Distance between intersection i and the bus stop on the mth incoming leg. xsi,m = ∞ if there is no stop in the mth incoming leg of the intersection
i
xbi,m,t Distance between intersection i and the bus on the mth incoming leg at time step t, xbi,m,t = 0 if there is no bus on the mth incoming leg of the
intersection i
vi,m,t Speed of the bus closest to intersection i on the mth incoming leg at time step t
vcri Critical speed to measure buses state
Di,t Summation of total stopping time of all vehicles on incoming legs of intersection i during decision step t.
Oi,t Set of occupancies of all incoming legs of intersection i at time step t
Oi,m,t Occupancy of the mth incoming leg of the intersection i at time step t
Ocri Critical occupancy for the reward definition
nm
i,t Stopping vehicle number on the mth incoming leg of intersection i at time step t
f
hi,m,t /hbi,m,t Forward/backward space headway of the bus on the mth incoming leg of intersection i at time step t
ri,t Total reward of agent i at decision step t
traffic
ri,t
agent
/ri,t /rtransit Reward received from car traffic/agent’s action/bus transit of agent i at decision step t
i,t
ai,t Action of agent i at decision step t
yi,t Number of yellow phases during last ten actions of agent i at decision step t
gbus
i,m
Bus phase on the mth incoming leg of intersection i
y Critical occurrence times of yellow and all red phases among last ten actions.
c Weight of bus transit reward.
where t represents the index of the decision step, and each decision step lasts for 5 s in the proposed model. t’ represents the index of the
simulation step, where each step lasts for 1 s. There are five simulation steps within each decision step. During each simulation step, the
total stopping time is calculated as the number of stopped vehicles. The sum of the total stopping time in all simulation steps within a
decision step is considered as the total stopping time of vehicles in that decision step.
The number of switches, yi,t , is an integer variable ranging from 0 to 10, which monitors the frequency of the controller switches
phases. If the frequency is high, a large amount of the available capacity is lost due to excessive yellow and all red phases. In real-world
signal settings, the lower bound of the cycle length is typically 50 to 60 s. Therefore, we chose the span of 10 steps which lasts for 50 s,
to be consistent with the real-world signal setting. It is important for controllers to consider this information in their decision process as
it is not reflected in any other state variables.
[ ]
f
The bus transit state Stransit
i,t = hi,m,t , hbi,m,t , ∀m ∈ MBi is the real-time forward and backward space headway of the bus on the mth
incoming leg of intersection i at decision step t. All incoming legs with bus lines need to be detected. For example, if two incoming legs
of an intersection are equipped with the bus line, there are two sets of forward and backward space headways in this agent’s state. Note
that the optimal bus service is usually obtained when time headways are homogeneous. However, the previous buses’ arrival data is
required to calculate the time headways, which could potentially violate the Markov Property. Bus control can also be applied by
monitoring space headways since there is a strong correlation and consistent performance between time headway and space headway
(Abul-Magd, 2007; Ampountolas and Kring, 2021; Liu and Wang, 2012; Nagatani, 2001). The space headway can be detected in real-
6
J. Yu et al. Transportation Research Part C 154 (2023) 104281
time. Therefore, the bus control strategy here aims to equalize forward and backward space headways for all buses approaching
intersections.
If the bus is too far from the signal, the signal’s latest action does not contribute to bus service, so the agent should take action
regardless of the bus state. When a bus distance to the traffic signal is below the control distance (for bus transit), di,m,t , the agent has to
find the tradeoff between car traffic-related and bus transit-related objectives by activating the reward for bus transit. The control
distance, di,m,t , is a parameter defining the maximal distance at which the agent i needs to consider the impacts of incoming buses while
taking the following action.
The calculations of control distance on dedicated bus lanes and mixed traffic lanes are different, denoted by d′i,m,t and d″i,m,t ,
respectively. The control distance for dedicated bus lanes is defined by the maximum value between the distance for a bus decelerating
to a halt with an acceptable deceleration and the distance for a bus moving forward with max speed during the period of a decision
step. Assuming that the bus performs a uniform deceleration motion:
{ 2 }
v
d′i,m,t = max max , vmax Δt (5)
2adec
For the mixed traffic lane, the control distance is defined according to the speed of the bus, as Eq (6) displays. If the bus is at a low
speed or queueing in a line, the action of the signal can influence the queue and then the bus. No matter how far the bus is from the
intersection, the control distance equals the distance between the bus and the intersection. If the bus is running at a regular or high
speed, the control distance is calculated following the formula in Eq. (5). The control distance in the mixed traffic lane is given by:
⎧
⎪
⎨ xbi,m,t , vi,m,t < vcri
″
di,m,t = {
v 2 } (6)
⎪
⎩ max max , vmax Δt , vi,m,t > vcri
2adec
For example, if a bus travels on a mixed traffic lane at a low speed (compared to critical speed) and 250 m far from the intersection,
the control distance is 250 m. If the bus travels at a regular speed, the control distance is calculated using the second line in Eq (6),
which results in a much smaller value than 250 m.
If the bus is outside the control distance or there is no bus on the incoming leg, the forward and backward space headways are set to
0. When several buses are on the same leg, only the state of the bus closest to the signal is collected. To exclude the situation where a
dwelling bus is regarded as queuing, we define the critical position at which the bus speeds up to vcri departing from a stop. If the bus
v2cri
distance to the traffic signal is farther than the critical position, the bus state is set to 0. Thus, if xbi,m,t > xsi,m − 2aacc , Stransit
i,t = [0, 0].
Stransit
i,t can be concluded as Eq. (7):
⎧[ ] { }
⎪
⎨ hf , hb b s v2cri
, x < min d , x −
(7)
transit i,m,t i,m,t i,m,t i,m,t i,m
Si,t = 2aacc , ∀m ∈ MiB
⎪
⎩
[0, 0] , otherwise
[ ]
The cooperativeness state Scoop i,t = ai− 1,t− 1 , ai,t− 1 , ai+1,t− 1 consists of 2 components: the last action of agent i, ai,t− 1 , and the set of
actions taken by the immediate neighborhood of agent i. There are two neighbors (ai− 1,t− 1 , ai+1,t− 1 ) for single-arterial scenarios and four
for network or multi-arterial scenarios.
traffic′
incoming legs. Therefore, the total stopping time of the current decision step and last step are compared to decide the reward, ri,t .
th traffic″
The reward of occupancy on the m incoming leg, ri,m,t ,
has to be calculated for all incoming legs in a red phase assigned during the
last decision step. The reward will be penalized cumulatively for the incoming legs with occupancy larger than the critical one. Eq. (8)
and (9) describe the calculation of the two rewards.
{
1 , Di,t < Di,t− 1
traffic′
ri,t = (8)
− 1 , Di,t ⩾Di,t− 1
{
0 , Oi,m,t < Ocri
(9)
″
traffic
ri,m,t =
− 1 , Oi,m,t ⩾Ocri
Agents try to avoid wasting green time caused by frequent phase switches. Thus, the reward from the agent’s action is given by Eq.
(10), where y is a predefined integer (0 < y < 10) representing the acceptable number of phase switches in 10 steps. To match the real-
7
J. Yu et al. Transportation Research Part C 154 (2023) 104281
Fig. 4. Network and intersections structure in scenario 1. (a) Training artery, (b) testing artery, (c) structure of intersections, (d) predefined phase 1,
(e) predefined phase 2.
world setting, the value of y can be chosen as the number of green phases, which equals the number of yellow phases, in one fixed
control cycle. Since agents need to switch phases in time to respond to real-time traffic, there is no positive reward that encourages
agents to switch less frequently.
{
0 , yi,t ⩽y
agent
ri,t = (10)
− 1 , yi,t > y
∑
For bus transit, ri,t
transit
= ∀m∈MBi ri,m,t ,
transit
if the forward bus headway is larger than the backward one, denoted as hfi,m,t > hbi,m,t , we need
to prioritize the bus to shorten the forward headway, thus equalizing the forward and backward headways. In this case, the transit
reward is positive if the agent gives green priority to the bus-incoming lane in the following decision step. Similarly, in the case of
hfi,m,t < hbi,m,t , the bus needs to be held to equalize the forward and backward headways. The reward from the bus system can be
concluded as Eq. (11), where gi,m
bus
is the bus phase on the mth incoming leg of intersection i. For example, a bus travels from the West to
the East, and the phase that provides green to this movement is labeled Phase 1. Then, gi,m
bus
= 0 since 0 represents activating Phase 1.
Since the bus line route and the signal phases are pre-defined, gi,m bus
is a pre-known constant for each intersection.
⎧
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪ 1 , hfi,m,t− 1 > hbi,m,t− 1 andai,t− 1 = gbus
i,m
⎪
⎪
⎨ 1 , hfi,m,t− 1 < hbi,m,t− 1 andai,t− 1 ∕ = gbus
transit
ri,m,t = i,m
(11)
⎪
⎪ 0
⎪
⎪ , hfi,m,t− 1 = hbi,m,t− 1
⎪
⎪ − 1
⎪
⎪ , otherwise
⎪
⎪
⎩
8
J. Yu et al. Transportation Research Part C 154 (2023) 104281
Fig. 5. Structure of intersections and predefined phases in scenario 2. (a) Structure of intersections, (b) predefined phase 1, (c) predefined phase 2.
3. Numerical experiments
Our numerical experiment is drawn on three scenarios with various road and bus line configurations. Multiple traffic control
strategies are tested and compared among these scenarios. These traffic control strategies have been selected because of their prox
imity to our approach or their wide use in the literature or field operation:
The analysis is performed according to three scenarios implemented in the SUMO simulation framework (Alvarez Lopez et al.,
2018). Scenarios differ in:
• The bus lane layout: dedicated bus lane or mixed traffic lane,
• The bus-line deployment: one bus line or multiple (crossing) bus lines.
The first scenario is further used to calibrate the tradeoff parameter c. The scalability and transferability of trained models are
demonstrated in the first and last scenarios by applying existing models without retraining.
Scenario 1: One bus line artery with dedicated bus lanes.. This scenario includes one bus line driving on dedicated bus lanes. Buses are not
affected by the car traffic conditions, but only by the traffic lights.
To highlight the scalability and transferability of strategies learned by agents, we trained and tested them on distinct networks with
similar intersection configurations. There are five signalized intersections in the training artery (Fig. 4(a)) and ten in the test artery
(Fig. 4(b)). The numbers displayed in the network figures are the link lengths. Due to the decentralized framework, five models
(labeled Ma1 to Ma5 from the West to the East) are obtained after training. They are transferred to the test network to apply to agents
managing signals with similar road configurations. Specifically, Ma1 , Ma2 , Ma4 , and Ma5 are assigned to the first and last two in
tersections in the test artery, while all others request Ma3 .
The sketch of each intersection and the predefined signal phases are shown in Fig. 4(c), (d), and (e). There are two lanes for each
direction on the main road (one is the dedicated bus lane), one lane on the crossroads, and two optional phases for the signal. The
straight movements are always prioritized more than left-turn movements in the same phase. Left-turn vehicles need to wait for gaps of
straight flows to pass through the intersection. In the training process, four buses travel in loops on the artery. All buses take a U-turn at
the artery terminal and continue the route in the opposite direction. Six bus stops are along the main artery, as shown in Fig. 4(a). In the
testing network, the number of buses is increased to eight. There are ten bus stops in the testing artery, as shown in Fig. 4(b).
Scenario 2: One bus line artery with mixed traffic lanes.. This scenario is similar to scenario 1, except that dedicated bus lanes are
converted into mixed traffic lanes. The buses are affected by surrounding car traffic conditions and the signal lights.
The training and test networks are the same as the 5-signal artery in scenario 1. However, the intersection pattern is adjusted to the
9
J. Yu et al. Transportation Research Part C 154 (2023) 104281
Fig. 6. Network and intersection structure in scenario 3. (a) Training and testing network, the structure of intersections (b) in horizontal artery (c)
in vertical artery.
mixed traffic lane. The traffic demand is adjusted to maintain a realistic demand profile, which is detailed in Section 3.1.2. The
intersection structure and predefined phases are displayed in Fig. 5. The trained models are labeled Ma1′ to Ma5′ from the West to the
East, which are transferred to agents in scenario 3.
Scenario 3: Two bus lines network with mixed traffic lanes.. This scenario mimics an urban network of two crossing corridors with bus
lines on mixed traffic lanes. Simultaneously, the agents must deal with car traffic conditions and the bus located on various legs. It
introduces a new agent model, taking into account two crossing bus lines where the buses come from four incoming legs, while in
previous scenarios, agents were dealing with buses coming from two opposite legs.
The training and test networks, including the bus stops information, are shown in Fig. 6(a). There are two mixed traffic lanes on
both main arteries, while one at all sideroads. Fig. 6(b) and 6(c) show the detailed structure of intersections. There is one bus line on
each artery, so buses come from all directions at the central intersection. Bus line 1 runs along the horizontal artery, and bus line 2
along the vertical one.
10
J. Yu et al. Transportation Research Part C 154 (2023) 104281
Table 2
Traffic demand in scenarios 1, 2, and 3.
Scenario Direction Traffic demand (pcu/h)
* Random with a 10% interval higher or lower than the value displayed.
Note: for any flow direction at all intersections, 5%-10% of the total displayed demand is assigned to left or right turns.
Table 3
Parameters in RL-based strategy.
Parameter Value
different traffic demands. A wide random range for the demand of each side road is set to ensure that different demand patterns can be
generated while training. Since the range of side road demand is wide, the general demand level also varies. All simulations last for
7200 s. The departure interval of buses is 192 s for any network of the three scenarios.
3.2. Calibration of parameter c: Finding a tradeoff between car traffic and bus service with scenario 1
This section seeks a suitable tradeoff between car traffic-related and bus-related objectives, modeled by parameter c. Consequently,
we trained and tested different c values (ranging from 1 to 5) and two extreme cases with scenario 1. In the two extreme cases, only the
reward of car traffic or bus transit is considered, represented by ’RL – only traffic’ and ’RL – only bus’, respectively. The sensitivity of
different c values was tested on trained models for 70 episodes. The reward curves of all agents in each training model are shown in
Fig. 7. Models obtained at the 50, 60, and 70 episodes were all saved for testing. Models trained with 50 episodes always perform
satisfactorily among these strategies. Therefore, the early stopping, a common form of regularization used when the best performance
is not provided by the final model due to an overfitting (Malik et al., 2021; Zhang et al., 2019b), at 50 episodes is applied in this study.
We compare the average queue length and standard deviation of bus space/time headways among these cases. The results are
shown in Table 4. The time headway is the time gap between two successive buses arriving at the same bus stop. We set different
random seeds to generate various traffic demands for testing. The notation of ’RL - n’ relates to the proposed RL model with c = n. The
average space headway in Table 4 is always 1516 m because buses travel in a loop, and the space headway between the first and last
11
J. Yu et al. Transportation Research Part C 154 (2023) 104281
Fig. 7. Total reward of each agent along the training episodes in RL-based models of (a) only traffic, (b) c = 1, (c) c = 2, (d) c = 3, (e) c = 4, and (f)
only transit.
buses is also considered. Thus, the average space headway is always the total length of a round trip divided by the number of buses, also
denoted as nominal headway.
A larger c value forces the model to put more weight on the bus than on traffic performance in the reward. Thus, the car traffic
performance might get impacted. Fig. 8 compares the total queue length of all control strategies along the simulation process. The
continuous increase in total queue length is observed from ’RL – only traffic’ to ’RL - n’ and then ’RL – only bus’. According to Table 4,
when c is set to 3, car traffic and bus performances appear well-balanced in all three tests. For other c values, only car traffic or bus
performance is satisfied. We choose c = 3 in the subsequent sections. With this setting, the car traffic and bus transit rewards range
from − 3 to 1 and − 3 to 3, respectively. Bus transit has the same weight as car traffic in negative reward and a larger weight in positive
reward.
We further compare the performances of our approach with the benchmark approaches. Table 4 displays the results from various
simulation seeds in scenario 1. Three different seeds are tested to avoid the coincidental performance of the proposed model. The
average queue length of ’RL − 3′ is always shorter than the benchmark approaches. On average over the three seeds, ’RL − 3′ decreases
12
J. Yu et al. Transportation Research Part C 154 (2023) 104281
Table 4
Testing results of each control method in scenario 1.
Random seed Control method Space headway Time headway Average queue length (vehs)
the average queue length by 30.91%, 19.59%, and 6.80% compared to fixed control, longest queue first, and max pressure, respec
tively. According to Fig. 8, the max pressure method performs slightly better than ’RL – 3′ during the first half of the simulation. When
the demand from sideroads increases during the second half of the simulation, ’RL-3′ gets better than max pressure and obtains a better
global performance.
’RL – 3′ also outperforms the benchmark strategies in headway control. The space and time headways of the control methods in the
test with random seed = 25000 are visible in Fig. 9 and Fig. 10. Fig. 9 displays the travel distance of each bus during the simulation, and
Fig. 10 presents the arrival time of each bus at each bus stop. Buses travel in loops on the artery, so all the bus stops are passed several
times by all buses. The value of bus stop in the x-axis of Fig. 10 is the cumulative number of stops buses that buses have reached. In
Fig. 9 and Fig. 10, two lines getting closer or even crossing match with bus bunching, which is pointed out in the figures of space
headways with red circles. This phenomenon is observed in trajectories for buses 1 and 2, buses 4 and 5 in the longest queue first
approach, while it is seen for buses 1 and 2, buses 7 and 0 in the max pressure strategy. On the contrary, the RL-based methods
effectively prevent buses from bunching.
Fig. 11 displays the distribution of bus headways in each control strategy. The space and time headways distribute more
concentratedly around the average value in our RL-based models than in benchmark strategies. This rule is consistent with that in
Table 4: the standard deviations of headways in ’RL-3′ and ’RL-bus only’ are smaller than benchmark approaches. We calculate the
percentage of small headways (less than 50% of nominal headway) for the control methods. They are 17.64%, 16.12%, 6.96%, and
3.03% for longest queue first, max pressure, RL – 3, and RL – only bus, respectively. The proposed approaches provide significant
improvement. The scalability and transferability of the proposed method get verified via the test performances.
In this scenario, since the buses drive in mixed traffic and might be affected by car traffic conditions, choosing an appropriate
critical speed for buses is essential. According to Eq. (6), the critical speed directly determines the speed state and control distance for
buses. We tested different values, 25% (3.47 m/s) and 50% (6.95 m/s) of the free flow speed, as well as some values around them (2 m/
s and 5 m/s), as the critical speed in scenario 2 (Chen et al., 2021). They are denoted as ’RL – C3.47′, ’RL – C6.95′, ’RL – C2′, and ’RL –
C5′, respectively.
13
J. Yu et al. Transportation Research Part C 154 (2023) 104281
Fig. 8. Comparisons of queue length along the simulation in scenario 1 when (a) random seed = 15000, (b) random seed = 20000, and (c) random
seed = 25000.
The test results for each model in scenario 2 are presented in Table 5 (random seed = 15000). The performance of RL-based control
strategies varies significantly depending on the critical speed. As a result, the models with a critical speed of 3.47 m/s and 5 m/s
outperform other RL-based models. Based on the space headway deviation, which is the direct optimization indicator in our model, the
model with a critical speed of 3.47 m/s exhibits the best performance in both headway control and traffic delay.
In this scenario, the centralized RL model has a similar goal to ours: attempting to improve car traffic and bus performance through
the traffic signal control (Chow et al., 2021). The model is built on a macroscopic traffic flow environment with Cell Transmission
Method (CTM). Due to the traffic flow environment in our model being microscopic, Edie’s definition is applied to trajectory data to
ensure a proper estimation of the density and outflow in the centralized RL model (Edie, 1963; Leclercq et al., 2014).
The signal strategies ’MP + TSP’ and ’Fixed + TSP’ prioritize the bus phase whenever a bus is on the incoming legs, regardless of
other traffic conditions. However, if the frequency of buses is high, this can lead to poor performance in average queue length. In this
scenario, four buses simultaneously travel through the 5-signal artery, and more than two signals need to activate TSP throughout the
simulation. The average queue length of these two strategies is heavily affected. Therefore, we focus solely on comparing the bus
control performances of ’MP + TSP’ and ’Fixed + TSP’ with other strategies.
According to the results in Table 5 and Fig. 12, the performance of the centralized RL model in our environment is acceptable but
still distinct from the ones in the original paper. It is a common issue for centralized RL methods lacking transferability and scalability,
leading to the poor convergence of the model applied in a different scenario. The agent in the centralized RL model was formulated in a
macroscopic traffic environment in the original paper, ignoring the disturbances and fluctuations in the microscopic and real-world
traffic environment. These may explain the model underperforms the initial results and other benchmarks in our microscopic
environment.
Regarding car traffic performance, ’RL-C3.47′ gets better results than other models, reducing the average queue length by 9.30%
compared to the best performance from benchmarks. Max pressure and longest queue first control methods underperform fixed control
on average queue length as their signal phases frequently switch, resulting in excessive waste of green time. The average yellow phase
rates of max pressure, longest queue first, fixed control, and the proposed method are 31.46%, 31.53%, 11.11%, and 10.82%,
respectively. Although the minimum and maximum green times for these control strategies are identical, the switch rates of the max
14
J. Yu et al. Transportation Research Part C 154 (2023) 104281
Fig. 9. Travel distance of buses along the simulation step in scenario 1. (a) Longest queue first, (b) max pressure, (c) RL – 3, (d) RL – only bus.
15
J. Yu et al. Transportation Research Part C 154 (2023) 104281
Fig. 9. (continued).
pressure and longest queue first strategies are still much higher than the one of the proposed method.
Regarding bus headway performance, the results in Table 5 show that ’MP + TSP’ outperforms all other strategies, reducing the
standard deviation of time headway by 8.77% compared to ’RL - C3.47′. In terms of space headway, ’RL – C3.47′ still performs the best.
’Fixed + TSP’ outperforms other benchmarks but is still inferior to the proposed method and ’MP + TSP’. Furthermore, Figs. 13 to 15
present the bus headway performance of ’RL – C3.47′, ’MP + TSP’, and ’Fixed + TSP’, respectively, showing homogenized trajectories
for both space and time.
To demonstrate the effectiveness of the proposed model in handling multiple buses on the same link, two unstable scenarios with
bus bunching are tested. In the first scenario, a 10-signal artery with dedicated bus lanes (same as the test network in scenario 1) is
simulated, with three buses departing within 30 s. The second scenario is a 5-signal artery with mixed traffic lanes (same as the test
network in scenario 2), with two buses departing within 10 s. The space trajectories of buses, depicted in Figs. 16 and 17 below,
indicate that the proposed model can effectively resolve bus bunching and homogenize space headways in both scenarios. In the
dedicated bus lane scenario, bus bunching is resolved after 340 action steps, and the space headways of eight buses are homogeneous
in the last 300 action steps of the simulation. In the mixed traffic lane scenario, the bus bunching is resolved around 260 action steps,
and the space headways are homogenized later. The ’MP + TSP’ strategy is also tested with these two unstable scenarios. The space
trajectories of buses are shown in Fig. 18. In both scenarios, the bus bunching is not resolved by the signals. These results indicate that:
i) the proposed model is capable of handling multiple buses traveling on the same link without other interventions, and ii) system
atically prioritizing buses may not be appropriate for bus bunching as it keeps the bunching state throughout.
In this scenario, competition between two crossing bus lines and car traffic exists in one network. To further test the transferability
of trained models, models for signals in the horizontal artery (except the central one) are all from scenario 2. The agent models for
other traffic signals in this network require specific training since they involve new connections with their neighborhood. Table 6
depicts test results for RL-based methods and the benchmark strategies (random seed = 15000). ’RL – only traffic’ and ’RL’ represent
our approach only considering car traffic in reward and finding the tradeoff, respectively. Compared to the best performance in
benchmark strategies, ’RL – only traffic’ achieves a 30.87% decrease in average queue length. Fig. 19 exhibits the total queue length of
each method along the simulation. ’RL’ sacrifices some improvement on car traffic performance to reach the tradeoff, but it still
outperforms benchmarks on average queue length. Meanwhile, the standard deviation of bus headways gets significantly improved for
both bus lines. Compared to the best performance in benchmarks, the decreases in the standard deviation of time headway are 58.45%
and 43.13% for bus lines 1 and 2, respectively. Fig. 20 and Fig. 21 display the space and time trajectories of the two bus lines. Per
formances of both bus lines are guaranteed as there is no trajectory crossing. Thus, the transferability of the proposed approach is
promising.
When both directions of buses request priority (green phase) simultaneously, there is no difference between the bus transit reward
of the two actions. Therefore, the agent will choose the action that yields a larger car traffic and cooperative reward to maximize the
global reward. To illustrate how the middle agent deal with the multiple conflict bus lines, the trajectories of two buses simultaneously
approaching the central intersection are retrieved and plotted in Fig. 22 below. Several such situations occurred in the simulation. We
randomly chose one of them to display. The yellow (respectively blue) line represents the distance between the W-E (resp. N-S)
traveling bus and the intersection. A distance of 0 indicates that the bus has arrived at the intersection. The time steps of the trajectories
highlighted by red circles denote the time when the two buses pass through the intersection. In this figure, the bus in the W-E direction
16
J. Yu et al. Transportation Research Part C 154 (2023) 104281
Fig. 10. Arriving time of buses at each stop in scenario 1. (a) Longest queue first, (b) max pressure, (c) RL – 3, (d) RL – only bus.
17
J. Yu et al. Transportation Research Part C 154 (2023) 104281
passes the intersection first while the other one is held by the signal.
This study proposes a MARL framework for traffic signal control in a multi-modal network consisting of private car traffic and bus
Fig. 11. Box plots for bus headway distribution in different control strategies: (a) space headway and (b) time headway.
Table 5
Testing results of each control method in scenario 2.
Control method Space headway Time headway Average queue length (vehs)
Note: the notation of ’RL - Cn’ reflects the critical speed n used to feature the bus’s state in the proposed RL model.
18
J. Yu et al. Transportation Research Part C 154 (2023) 104281
Fig. 13. Bus trajectories in proposed strategies (RL - C3.47): (a) space trajectories and (b) time trajectories.
Fig. 14. Bus trajectories in ’Max pressure + TSP’ strategies: (a) space trajectories and (b) time trajectories.
transit. The decentralized agents combine bus priority and holding control with traffic signal control. Deep Q-Network is applied to
address the continuous state space. The crucial concept of the proposed framework is to homogenize bus headways via traffic signals
without reducing traffic efficiency. The tradeoff between car traffic-related and bus transit-related rewards is discussed based on
numerical experiments. The proposed model is tested in various scenarios, including different bus lane layouts (dedicated bus lanes or
mixed traffic lanes) and bus line deployments (single bus line or multiple crossing bus lines). In the configuration with buses driving in
mixed traffic, the agent’s performance strongly depends on the critical speed setting, which defines the signal control distance for
buses. The transferability and scalability are demonstrated with several stochastic-demand tests by applying learned agents to similarly
configured intersections and different scales of networks without retraining. The decentralized method performs far better on traffic
19
J. Yu et al. Transportation Research Part C 154 (2023) 104281
Fig. 15. Bus trajectories in ’Fixed + TSP’ strategies: (a) space trajectories and (b) time trajectories.
Fig. 16. Space trajectories of buses on dedicated bus lanes. Left: 8 buses, 0–1400 action steps; right: 4 buses, 0–800 action steps.
Fig. 17. Space trajectories of buses on mixed traffic lanes. Left: 0–1400 action steps; right: 0–600 action steps.
Fig. 18. Bus space trajectories of ’MP + TSP’ on (a) dedicated bus lanes and (b) mixed traffic lanes.
20
J. Yu et al.
Table 6
Testing results of each control method in scenario 3.
Control method Space headway Time headway Average queue length (vehs)
Average (m) Standard deviation Average (m) Standard deviation Average (s) Standard deviation Average (s) Standard deviation
21
RL – only traffic 1637.78 1221.50 903.71 532.06 265.75 195.55 233.99 80.97 170.38
RL 243.88 187.96 288.65 36.14 251.02 54.58 240.50
Fixed 876.33 664.63 278.54 144.28 233.04 95.98 246.46
Max pressure 538.16 679.75 330.82 86.99 351.56 142.76 277.01
Longest queue first 651.14 799.76 307.17 113.81 266.66 134.41 251.06
Fig. 19. Total queue length of each control method along the simulation in scenario 3.
Fig. 20. Bus trajectories of bus-line 1: (a) space trajectories and (b) time trajectories.
Fig. 21. Bus trajectories of bus-line 2: (a) space trajectories and (b) time trajectories.
delay and bus headway control than model-based adaptive control methods and the centralized RL method.
In future work, the agent performance should be improved with more advanced RL algorithms (e.g., Double DQN, Dueling DQN,
and Dueling Double DQN) since promising performance has been verified by applying these algorithms in the traffic and bus control
(Han et al., 2022; Li et al., 2022; Long et al., 2022; Qi et al., 2019). Furthermore, the proposed approach would be investigated in non-
closed loop bus-line systems with heterogeneous passenger demand. Further exploration of scenarios involving multiple bus lines
sharing an artery is warranted. The exploration of grouping several signals as an agent will be undertaken to address the city-scale
signal control problem. Eventually, we need to further explore the proposed method’s portability by transferring this agent design
to intersections with different configurations and traffic demand patterns. The robustness of the proposed method also needs to be
22
J. Yu et al. Transportation Research Part C 154 (2023) 104281
tested with a full range of demand levels. A relevant perspective would be to explore the notion of similarity between trained agents
according to their features related to traffic demand and intersection configuration and then define the range where it is possible to
reuse a trained agent in a slightly different intersection configuration.
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to
influence the work reported in this paper.
Acknowledgments
J. Yu acknowledges funding from China Scholarship Council. L. Leclercq acknowledges funding from the European Union’s Ho
rizon 2020 research and innovation program under Grant Agreement no. 953783 (DIT4TraM).
Initialize the total training episodes M , max decision step T in each episode
Initialize the number of training epochs P, replay memory D , minibatch
Initialize the parameters in action-value function Q
f or episode = 1 to M do
for t = 1 to T do
Retrieve state st
With probability ∊ select a random action at
otherwise select at = argmaxa Q(st , a; θ)
Update environment to st+1 and feedback reward rt
Store transition (st , at , rt , st+1 ) in D
end for
for epoch = 1 to P do
Sample random minibatch of transitions (sj , aj , rj , sj+1 ) from D
{
rj for terminal sj+1
Set yj =
rj + γmaxa’ Q(sj+1 , a’; θ) for non-terminal sj+1
( ( ) )2
Perform a gradient descent step on yj − Q sj , aj ; θ according to Eq. (3)
end for
end for
References
Abdoos, M., 2020. A cooperative multiagent system for traffic signal control using game theory and reinforcement learning. IEEE Intell. Transp. Syst. Mag. 13 (4),
6–16.
23
J. Yu et al. Transportation Research Part C 154 (2023) 104281
Abul-Magd, A.Y., 2007. Modeling highway-traffic headway distributions using superstatistics. Phys. Rev. E 76 (5), 057101.
Alvarez Lopez, P., Behrisch, M., Bieker-Walz, L., Erdmann, J., Flötteröd, Y.-P., Hilbrich, R., Lücken, L., Rummel, J., Wagner, P., Wießner, E., 2018. Microscopic Traffic
Simulation using SUMO, The 21st IEEE International Conference on Intelligent Transportation Systems. IEEE, Maui, USA, pp. 2575–2582.
Ampountolas, K., Kring, M., 2021. Mitigating bunching with bus-following models and bus-to-bus cooperation. IEEE Trans. Intell. Transp. Syst. 22 (5), 2637–2646.
Arel, I., Liu, C., Urbanik, T., Kohls, A.G., 2010. Reinforcement learning-based multi-agent system for network traffic signal control. IET Intel. Transport Syst. 4 (2),
128–135.
Berrebi, S.J., Hans, E., Chiabaut, N., Laval, J.A., Leclercq, L., Watkins, K.E., 2018. Comparing bus holding methods with and without real-time predictions. Transp.
Res. Part C: Emerging Technol. 87, 197–211.
Buşoniu, L., Babuška, R., De Schutter, B., 2010. Multi-agent Reinforcement Learning: An Overview. In: Srinivasan, D., Jain, L.C. (Eds.), innovations in Multi-Agent
Systems and Applications - 1. Springer, Berlin Heidelberg, Berlin, Heidelberg, pp. 183–221.
Casas, N., 2017. Deep deterministic policy gradient for urban traffic light control. arXiv preprint arXiv:1703.09035.
Chen, Y., Chen, C., Wu, Q., Ma, J., Zhang, G., Milton, J., 2021. Spatial-temporal traffic congestion identification and correlation extraction using floating car data.
J. Intell. Transp. Syst. 25 (3), 263–280.
Chen, X., Lin, X., Li, M., He, F., 2022. Network-level control of heterogeneous automated traffic guaranteeing bus priority. Transp. Res. Part C: Emerging Technol. 140,
103671.
Chin, Y.K., Kow, W.Y., Khong, W.L., Tan, M.K., Teo, K.T.K., 2012. Q-learning traffic signal optimization within multiple intersections traffic network, 2012 Sixth
UKSim/AMSS European Symposium on Computer Modeling and Simulation. IEEE 343–348.
Chow, A.H.F., Su, Z.C., Liang, E.M., Zhong, R.X., 2021. Adaptive signal control for bus service reliability with connected vehicle technology via reinforcement
learning. Transp. Res. Part C: Emerging Technol. 129, 103264.
Chu, T., Wang, J., Codecà, L., Li, Z., 2019. Multi-agent deep reinforcement learning for large-scale traffic signal control. IEEE Trans. Intell. Transp. Syst. 21 (3),
1086–1095.
Daganzo, C.F., 2009. A headway-based approach to eliminate bus bunching: systematic analysis and comparisons. Transp. Res. B Methodol. 43 (10), 913–921.
Ding, J., Yang, M., Wang, W., Xu, C., Bao, Y., 2015. Strategy for multiobjective transit signal priority with prediction of bus dwell time at stops. Transp. Res. Rec. 2488
(1), 10–19.
Edie, L.C., 1963. Discussion of traffic stream measurements and definitions. Port of New York Authority New York.
Ekeila, W., Sayed, T., Esawey, M.E., 2009. Development of dynamic transit signal priority strategy. Transp. Res. Rec. 2111 (1), 1–9.
Fadlullah, Z.M., Tang, F., Mao, B., Kato, N., Akashi, O., Inoue, T., Mizutani, K., 2017. State-of-the-art deep learning: Evolving machine intelligence toward tomorrow’s
intelligent network traffic control systems. IEEE Commun. Surv. Tutorials 19 (4), 2432–2455.
Gao, J., Shen, Y., Liu, J., Ito, M., Shiratori, N., 2017. Adaptive traffic signal control: Deep reinforcement learning algorithm with experience replay and target network.
arXiv preprint arXiv:1705.02755.
Genders, W., Razavi, S., 2016. Using a deep reinforcement learning agent for traffic signal control. arXiv preprint arXiv:1611.01142.
Ghanim, M.S., Abu-Lebdeh, G., 2015. Real-time dynamic transit signal priority optimization for coordinated traffic networks using genetic algorithms and artificial
neural networks. J. Intell. Transp. Syst. 19 (4), 327–338.
Gregoire, J., Qian, X., Frazzoli, E., De La Fortelle, A., Wongpiromsarn, T., 2014. Capacity-aware backpressure traffic signal control. IEEE Trans. Control Network Syst.
2 (2), 164–173.
Han, Y., Hegyi, A., Zhang, L., He, Z., Chung, E., Liu, P., 2022. A new reinforcement learning-based variable speed limit control approach to improve traffic efficiency
against freeway jam waves. Transp. Res. Part C: Emerging Technol. 144, 103900.
Hans, E., Chiabaut, N., Leclercq, L., Bertini, R.L., 2015. Real-time bus route state forecasting using particle filter and mesoscopic modeling. Transp. Res. Part C:
Emerging Technol. 61, 121–140.
Kirchner, M., Schubert, P., Haas, C.T., 2014. Characterisation of real-world bus acceleration and deceleration signals. J. Signal and Information Processing 5, 42694.
Korecki, M., Helbing, D., 2022. Analytically guided machine learning for green IT and fluent traffic. IEEE Access 10, 96348–96358.
Laskaris, G., Seredynski, M., Viti, F., 2020. Enhancing bus holding control using cooperative ITS. IEEE Trans. Intell. Transp. Syst. 21 (4), 1767–1778.
Le, T., Kovács, P., Walton, N., Vu, H.L., Andrew, L.L., Hoogendoorn, S.S., 2015. Decentralized signal control for urban road networks. Transp. Res. Part C: Emerging
Technol. 58, 431–450.
Leclercq, L., Chiabaut, N., Trinquier, B., 2014. Macroscopic fundamental diagrams: a cross-comparison of estimation methods. Transp. Res. B Methodol. 62, 1–12.
Lee, S., Kim, Y., Kahng, H., Lee, S.-K., Chung, S., Cheong, T., Shin, K., Park, J., Kim, S.B., 2020. Intelligent traffic control for autonomous vehicle systems based on
machine learning. Expert Syst. Appl. 144, 113074.
Levin, M.W., Hu, J., Odell, M., 2020. Max-pressure signal control with cyclical phase structure. Transp. Res. Part C: Emerging Technol. 120, 102828.
Li, L., Lv, Y., Wang, F.-Y., 2016. Traffic signal timing via deep reinforcement learning. IEEE/CAA J. Automatica Sinica 3 (3), 247–254.
Li, G., Yang, Y., Li, S., Qu, X., Lyu, N., Li, S.E., 2022. Decision making of autonomous vehicles in lane change scenarios: Deep reinforcement learning approaches with
risk awareness. Transp. Res. Part C: Emerging Technol. 134, 103452.
Li, Z., Yu, H., Zhang, G., Dong, S., Xu, C.-Z., 2021. Network-wide traffic signal control optimization using a multi-agent deep reinforcement learning. Transp. Res. Part
C: Emerging Technol. 125, 103059.
Little, J.D., 1966. The synchronization of traffic signals by mixed-integer linear programming. Oper. Res. 14, 568–594.
Liu, Y., Wang, D., 2012. Minimum time headway model by using safety space headway, World Automation Congress 2012. IEEE 1–4.
Lo, H.K., Chang, E., Chan, Y.C., 2001. Dynamic network traffic control. Transp. Res. A Policy Pract. 35 (8), 721–744.
Long, M., Zou, X., Zhou, Y., Chung, E., 2022. Deep reinforcement learning for transit signal priority in a connected environment. Transp. Res. Part C: Em. Technol.
142, 103814.
Ma, W., Yang, X., 2007. A Passive Transit Signal Priority Approach for Bus Rapid Transit System, 2007 IEEE Intelligent Transportation Systems Conference, pp. 413-418.
Ma, Y., Chiu, Y., Yang, X., 2009. Urban traffic signal control network automatic partitioning using laplacian eigenvectors, 2009 12th International IEEE Conference on
Intelligent Transportation Systems, pp. 1-5.
Ma, W., Wan, L., Yu, C., Zou, L., Zheng, J., 2020. Multi-objective optimization of traffic signals based on vehicle trajectory data at isolated intersections. Transp. Res.
Part C: emerging technologies 120, 102821.
Malik, S., Anwar, U., Aghasi, A., Ahmed, A., 2021. Inverse constrained reinforcement learning. Int. Conference on Machine Learn.. PMLR 7390–7399.
Mannion, P., Duggan, J., Howley, E., 2016. An experimental review of reinforcement learning algorithms for adaptive traffic signal control. In: McCluskey, T.L.,
Kotsialos, A., Müller, J.P., Klügl, F., Rana, O., Schumann, R. (Eds.), Autonomic Road Transport Support Systems. Springer International Publishing, Cham,
pp. 47–66.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M., 2013. Playing atari with deep reinforcement learning. arXiv preprint
arXiv:1312.5602.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., 2015. Human-level control
through deep reinforcement learning. Nature 518 (7540), 529–533.
Mohebifard, R., Al Islam, S.B., Hajbabaie, A., 2019. Cooperative traffic signal and perimeter control in semi-connected urban-street networks. Transp. Res. Part C: Em.
Technol. 104, 408–427.
Nagatani, T., 2001. Bunching transition in a time-headway model of a bus route. Phys. Rev. E 63 (3), 036115.
Ni, Y.-C., Lo, H.-H., Hsu, Y.-T., Huang, H.-J., 2022. Exploring the effects of passive transit signal priority design on bus rapid transit operation: a microsimulation-
based optimization approach. Transp. Lett. 14 (1), 14–27.
Prashanth, L., Bhatnagar, S., 2010. Reinforcement learning with function approximation for traffic signal control. IEEE Trans. Intell. Transp. Syst. 12 (2), 412–421.
Qi, X., Luo, Y., Wu, G., Boriboonsomsin, K., Barth, M., 2019. Deep reinforcement learning enabled self-learning control for energy efficient driving. Transp. Res. Part
C: Em. Technol. 99, 67–81.
24
J. Yu et al. Transportation Research Part C 154 (2023) 104281
Seredynski, M., Khadraoui, D., 2014. Complementing transit signal priority with speed and dwell time extension advisories, 17th International IEEE Conference on
Intelligent Transportation Systems (ITSC). IEEE, pp. 1009-1014.
Sun, W., Schmöcker, J.-D., Fukuda, K., 2021. Estimating the route-level passenger demand profile from bus dwell times. Transp. Res. Part C: Em. Technol. 130,
103273.
Sun, X., Yin, Y., 2018. A simulation study on max pressure control of signalized intersections. Transp. Res. Rec. 2672 (18), 117–127.
Sutton, R.S., Barto, A.G., 2018. Reinforcement learning: an introduction. MIT press.
Varaiya, P., 2013. Max pressure control of a network of signalized intersections. Transp. Res. Part C: Em. Technol. 36, 177–195.
Vidali, A., 2021. Deep Q-Learning Agent for Traffic Signal Control. GitHub, https://github.com/AndreaVidali/Deep-QLearning-Agent-for-Traffic-Signal-Control.
Wang, T., Cao, J., Hussain, A., 2021b. Adaptive Traffic signal control for large-scale scenario with cooperative group-based multi-agent reinforcement learning.
Transp. Res. Part C: Em. Technol. 125, 103046.
Wang, J., Sun, L., 2020. Dynamic holding control to avoid bus bunching: a multi-agent deep reinforcement learning framework. Transp. Res. Part C: Emerging
Technol. 116, 102661.
Wang, Q., Yuan, Y., Yang, X.T., Huang, Z., 2021a. Adaptive and multi-path progression signal control under connected vehicle environment. Transp. Res. Part C: Em.
Technol. 124, 102965.
Webster, F.V., 1958. Traffic signal settings. Road Research Laboratory, London, U.K.
Wu, J., Abbas-Turki, A., Correia, A., Moudni, A.E., 2007. Discrete Intersection Signal Control, 2007 IEEE International Conference on Service Operations and Logistics, and
Informatics, pp. 1-6.
Wu, C., Kreidieh, A., Parvate, K., Vinitsky, E., Bayen, A.M., 2017. Flow: Architecture and benchmarking for reinforcement learning in traffic control. arXiv preprint
arXiv:1710.05465 10.
Wunderlich, R., Liu, C., Elhanany, I., Urbanik, T., 2008. A novel signal-scheduling algorithm with quality-of-service provisioning for an isolated intersection. IEEE
Trans. Intell. Transp. Syst. 9 (3), 536–547.
Xu, T., Barman, S., Levin, M.W., Chen, R., Li, T., 2022b. Integrating public transit signal priority into max-pressure signal control: methodology and simulation study
on a downtown network. Transp. Res. Part C: Em. Technol. 138, 103614.
Xu, L., Xu, J., Qu, X., Jin, S., 2022a. An origin-destination demands-based multipath-band approach to time-varying arterial coordination. IEEE Trans. Intell. Transp.
Syst. https://doi.org/10.1109/TITS.2022.3150977.
Yang, X., Cheng, Y., Chang, G.-L., 2015. A multi-path progression model for synchronization of arterial traffic signals. Transp. Res. part C: emerging technol. 53,
93–111.
Yang, S., Yang, B., Wong, H.-S., Kang, Z., 2019. Cooperative traffic signal control using multi-step return and off-policy asynchronous advantage actor-critic graph
algorithm. Knowl.-Based Syst. 183, 104855.
Ye, Z., Wang, K., Chen, Y., Jiang, X., Song, G., 2022. Multi-UAV navigation for partially observable communication coverage by graph reinforcement learning. IEEE
Trans. Mob. Comput.
Yin, Y., 2008. Robust optimal traffic signal timing. Transp. Res. B Methodol. 42 (10), 911–924.
Yu, H., Ma, R., Zhang, H.M., 2018. Optimal traffic signal control under dynamic user equilibrium and link constraints in a general network. Transp. Res. B Methodol.
110, 302–325.
Yu, H., Liu, P., Fan, Y., Zhang, G., 2021. Developing a decentralized signal control strategy considering link storage capacity. Transp. Res. Part C: Emerging Technol.
124, 102971.
Zhang, Y., Clavera, I., Tsai, B., Abbeel, P., 2019b. Asynchronous methods for model-based reinforcement learning. arXiv preprint arXiv:1910.12453.
Zhang, H., Feng, S., Liu, C., Ding, Y., Zhu, Y., Zhou, Z., Zhang, W., Yu, Y., Jin, H., Li, Z., 2019a. Cityflow: a multi-agent reinforcement learning environment for large
scale city traffic scenario. The world wide web conference 3620–3624.
Zhang, C., Xie, Y., Gartner, N.H., Stamatiadis, C., Arsava, T., 2015. AM-band: an asymmetrical multi-band model for arterial traffic signal coordination. Transp. Res.
Part C: Em. Technol. 58, 515–531.
Zhang, L., Yin, Y., Chen, S., 2013. Robust signal timing optimization with environmental concerns. Transp. Res. Part C: Emerging Technol. 29, 55–71.
Zhao, D., Dai, Y., Zhang, Z., 2012. Computational intelligence in urban traffic signal control: a survey. IEEE Trans. Syst. man, and cybernetics, part C (applications and
reviews) 42 (4), 485–494.
Zlatkovic, M., Stevanovic, A., Martin, P.T., 2012. Development and evaluation of algorithm for resolution of conflicting transit signal priority requests. Transp. Res.
Rec. 2311 (1), 167–175.
Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V., 2018. Learning transferable architectures for scalable image recognition, Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 8697-8710.
25