0% found this document useful (0 votes)

8 views25 pages

Transportation Research Part C: Jiajie Yu, Pierre-Antoine Laharotte, Yu Han, Ludovic Leclercq

Uploaded by

1057258646

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views25 pages

Transportation Research Part C: Jiajie Yu, Pierre-Antoine Laharotte, Yu Han, Ludovic Leclercq

Uploaded by

1057258646

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Transportation Research Part C 154 (2023) 104281

Contents lists available at ScienceDirect

Transportation Research Part C

journal homepage: www.elsevier.com/locate/trc

Decentralized signal control for multi-modal traffic network: A

deep reinforcement learning approach
Jiajie Yu a, b, *, Pierre-Antoine Laharotte b, Yu Han a, Ludovic Leclercq b
a
School of Transportation, Southeast University, 211189 Nanjing, China
b
Univ Gustave Eiffel, ENTPE, LICIT-ECO7, F-69675 Lyon, France

A R T I C L E I N F O A B S T R A C T

Keywords: Managing traffic flow at intersections in a large-scale network remains challenging. Multi-modal
Traffic Signal Control signalized intersections integrate various objectives, including minimizing the queue length and
Bus Holding maintaining constant bus headway. Inefficient traffic signals and bus headway control strategies
Multi-Modal Network
may cause severe traffic jams, high delays for bus passengers, and bus bunching that harms bus
Deep Reinforcement Learning
Artificial Neural Network
line operations. To simultaneously improve the level of service for car traffic and the bus system
in a multi-modal network, this paper integrates bus priority and holding with traffic signal control
via decentralized controllers based on Reinforcement Learning (RL). The controller agents act and
learn from a synthetic traffic environment built with the microscopic traffic simulator SUMO.
Action information is shared among agents to achieve cooperation, forming a Multi-Agent
Reinforcement Learning (MARL) framework. The agents simultaneously aim to minimize vehi
cles’ total stopping time and homogenize the forward and backward space headways for buses
approaching intersections at each decision step. The Deep Q-Network (DQN) algorithm is applied
to manage the continuity of the state space. The tradeoff between the bus transit and car traffic
objectives is discussed using various numerical experiments. The introduced method is tested in
scenarios with distinct bus lane layouts and bus line deployments. The proposed controller out
performs model-based adaptive control methods and the centralized RL method regarding global
traffic efficiency and bus transit stability. Furthermore, the remarkable scalability and trans
ferability of trained models are demonstrated by applying them to several different test networks
without retraining.

1. Introduction

To avoid traffic conflicts, traffic signal control efficiently allocates green times to different vehicle movements at a signalized
intersection (Wu et al., 2007). Inadequate traffic signal control strategies may cause severe traffic congestion and further waste of
power energy as well as exhaust pollution (Zhao et al., 2012). Optimal traffic signal strategies at the network level with multi-modal
objectivities remain challenging because of complexity and scalability issues (Wang et al., 2021b). Many model-based and model-free
strategies are explored and developed to cope with traffic signal coordination. Centralized control methods manage signals to match an
overall goal but are often limited in scale and reduce the coordination ability to a local set of intersections (Ma et al., 2009; Wang et al.,

* Corresponding author.
E-mail addresses: [email protected] (J. Yu), [email protected] (P.-A. Laharotte), [email protected] (Y. Han), ludovic.
[email protected] (L. Leclercq).

https://doi.org/10.1016/j.trc.2023.104281
Received 23 November 2022; Received in revised form 28 July 2023; Accepted 28 July 2023
Available online 4 August 2023
0968-090X/© 2023 Elsevier Ltd. All rights reserved.
J. Yu et al. Transportation Research Part C 154 (2023) 104281

2021a; Yu et al., 2018). For example, Maxband and its extensions (Little, 1966; Xu et al., 2022a; Yang et al., 2015; Zhang et al., 2015)
introduce coordination at the arterial level by enlarging the green wave bandwidth, effectively decreasing vehicle stop times.
Performance-based algorithms aim to improve the network’s global performance indicators (e.g., total delay, total queue length, and
average speed of vehicles) in a specific period. Some consider multi-objective, environmental concerns, and target robustness (Ma
et al., 2020; Mohebifard et al., 2019; Yin, 2008; Zhang et al., 2013). These models are always formulated with mixed-integer pro
gramming and its derivations. The complexity increases exponentially with the number of signals and control periods. Therefore, these
strategies appear efficient for coordinating several offline intersections based on daily traffic patterns. Still, the effort required to solve
such an optimization problem in terms of computation time (especially for a large number of traffic signals) is incompatible with real-
time applications (Chu et al., 2019).
Decentralized methods are more robust and easily scalable, but coordination should be carefully addressed (Le et al., 2015; Yu
et al., 2021). For example, the max pressure method prioritizes the phase with the maximum pressure (the maximal difference between
upstream and downstream queues) to accommodate the real-time traffic demand (Varaiya, 2013). The max pressure algorithm is
particularly appealing at the network scale as it can overcome the computation complexity. Its structure is fully decentralized, and
each signal is regarded as an independent agent (Levin et al., 2020; Varaiya, 2013). The capacity of each link is regarded as unlimited
in the original max pressure method (Sun and Yin, 2018). Gregoire et al. (2014) and Yu et al. (2021) reformulated the pressure
expression accounting for the link capacity. They achieved better stabilization in a simulated environment. However, these strategies
still lack considerations for agent coordination and long-term system performance. Thus, the global benefit from the control strategy
may deteriorate when applied to a large network for an extensive period (Korecki and Helbing, 2022).
Recent development in Artificial Intelligent makes this trend even more promising as machine learning techniques can be applied
to learn the optimal local policies from observations, while distributed framework introduces cooperation between local agents by
sharing information (Abdoos, 2020; Li et al., 2021). Several attempts have been made to apply machine learning techniques to traffic
signal control (Fadlullah et al., 2017; Lee et al., 2020; Wu et al., 2017), mainly using Reinforcement Learning (RL). The agent in the RL
framework tends to choose the control action that leads to the optimal long-term performance at each step based on the system’s real-
time state (Sutton and Barto, 2018; Wang et al., 2021b). Artificial Neural Network (ANN) is commonly combined with the RL
framework to increase agents’ learning efficiency and accommodate continuous state space (Mnih et al., 2013; Prashanth and Bhat
nagar, 2010; Sutton and Barto, 2018). Encouraging results are obtained compared to the pre-timed signal controller and model-based
control strategies, particularly under variable traffic conditions (Casas, 2017; Chin et al., 2012; Gao et al., 2017; Genders and Razavi,
2016; Li et al., 2016; Yang et al., 2019).
Similar to the computational burden issue in centralized model-based control strategies, the state-action space increases expo
nentially with the number of signals in a centralized RL framework, leading to the challenge of scalability and learning convergence for
agents (Chu et al., 2019). Therefore, Multi-Agent Reinforcement Learning (MARL) framework is commonly used to address these
issues. MARL framework allows multiple agents to act simultaneously in a shared environment. The decentralized agents in MARL
observe partial information from the environment, simplifying the state and action space compared to the centralized agent (Arel et al.,
2010). Chu et al. (2019) presented a signal control strategy with MARL and tested it in an extended synthetic grid network and a real-
world traffic network. Results demonstrate the robustness and efficiency of the proposed algorithm. Zhang et al. (2019a) developed a
traffic simulator, CityFlow, which can build a MARL environment for large-scale city traffic scenarios. Abdoos (2020) proposed a
cooperative MARL framework for network signal control, which integrates game theory and Q-learning to provide a more cooperative
traffic signal control strategy than the one generated by Q-learning only. Communication among agents, i.e., sharing states or actions
information, is encouraged to foster cooperation and improve global indicators. Li et al. (2021) proposed a signal control method with
knowledge sharing among all agents in the MARL framework. Each agent can contribute experiences and access a shared knowledge
container of the traffic environment. The knowledge-sharing agents accelerate the learning convergence and improve the traffic ef
ficiency of large-scale networks compared to non-communicate agents. Wang et al. (2021b) showed a novel way to define the agent in
their MARL framework on traffic signal control. Each agent represents a group of signals. A region-aware cooperative strategy that can
incorporate the spatial information of the surrounding agents is computed. Each agent decides if the group of signals needs to perform
the green wave. A better traffic performance is obtained compared with the existing algorithms in large-scale networks.
However, the aforementioned studies focus on car traffic only and do not account for bus line operations. Indeed, several transport
modes co-exist in urban areas and may receive different priority levels at intersections. Among others, the transit service has specific
objectives like maintaining constant bus headway or fulfilling timetables that might be difficult to achieve without adequate priority at
traffic signals. Ineffective bus headway control may lead to bus bunching and further delays for passengers (Daganzo, 2009). Thus,
investigating cooperative and decentralized traffic signal control strategies considering multiple transportation modes is crucial.
Accounting for multiple and possible competing objectives is a critical challenge when designing the signal control strategy.
To take bus transit into account with signal timing, Transit Signal Priority (TSP) is a widely explored strategy. Transit priority can
be pre-set based on the bus timetable, known as passive TSP, which lacks robustness (Ni et al., 2022). Active and adaptive TSP
combines real-time bus information with signal timing to enhance the control effect of buses. The combination of fixed control and TSP
guarantees the priority of passing the intersection for buses by green time extension, green phase rotation, or green phase splitting
based on real-time bus information (Ma and Yang, 2007). Transit arrival and bus dwell time prediction strategies are developed to
improve the TSP performance (Ding et al., 2015; Ekeila et al., 2009; Ghanim and Abu-Lebdeh, 2015). With the emerging signal control
strategies and real-time traffic state detecting techniques, TSP is integrated with more advanced adaptive signal control strategies. Xu
et al. (2022b) integrated TSP with max pressure, prioritizing the incoming lane with a bus in a max pressure control background. The
numerical simulation with a network equipped with dedicated bus lanes suggests that the method reduces the bus travel time without
breaking the stability of the control compared to the original max pressure. Chen et al. (2022) combined bus priority with rhythmic

2
J. Yu et al. Transportation Research Part C 154 (2023) 104281

Fig. 1. Model inherits of agents in the test road network.

control. The control framework is designed for a full-automatic vehicle environment. The controller simultaneously guides automated
vehicles along conflict-free time–space trajectories computed to handle any movement at the intersection. Both studies reduce the bus
delay via traffic light controllers. Long et al. (2022) proposed a TSP strategy based on Deep Reinforcement Learning (DRL) dealing with
multiple conflicting bus priority requests. They extended the Dueling Double Deep Q-learning (D3QN) for their algorithm, receiving
higher convergence speed and lower average person delay than other RL benchmarks and active TSP strategies with a simulation
environment of a single intersection. Nevertheless, systematically allocating priority to buses is not necessarily the best option as it
might favor bus bunching when an early bus joins a late one. Therefore, homogenizing headways is crucial to avoid such phenomena
and minimize passengers’ waiting times at bus stops, thus enhancing bus service efficiency and reliability (Wang and Sun, 2020).
Compared to bus priority, bus holding control is a more effective way to equalize bus headways when bus bunching occurs (Berrebi
et al., 2018; Hans et al., 2015). It adjusts bus headways by holding them at stops when necessary (Laskaris et al., 2020). Squared
coefficient of variation of headways and mean holding time are typically direct indicators to measure the bus control performance
(Berrebi et al., 2018). Wang and Sun (2020) proposed a MARL framework to implement a dynamic bus holding control strategy at bus
stops. Each bus is regarded as a decentralized agent that aims to minimize the weighted sum of forward and backward headway
difference and bus holding time. Simulation tests show promising results with a one-way bus corridor and uniformly distributed bus
stops. Since bus holding control at bus stops may not be well-perceived by passengers, holding can be achieved silently at traffic signals
if early buses are not granted priority. To combine bus holding control with signal timing, Chow et al. (2021) applied DRL to adaptive
signal control considering bus service reliability. They developed a centralized controller to manage traffic delays and bus headway
control synchronously. Compared to fixed signal control integrating with TSP strategy, improvements in vehicle travel time and bus
headways are obtained in a macroscopic traffic environment with a one-way bus corridor. However, due to the centralized feature of
Chow’s model, scalability and transferability are not guaranteed. Furthermore, using a macroscopic model to feed the DRL framework
tends to smooth the traffic indicators and might alter the resilience to field test observations. Some refinement to the above meth
odologies needs to be explored to achieve more efficient (for both car traffic and bus transit) and scalable methods for controlling
traffic signals in a multi-modal network.
In conclusion, the main limitations of existing studies are:

• Traffic signal control strategies considering TSP are well explored with emerging techniques. Bus bunching can not be effectively
resolved by TSP, which needs bus holding control. The study on signal control integrating TSP and bus holding control is limited,
especially in a decentralized formulation.
• DRL has achieved significant improvement in both signal control and bus control. However, the training and testing networks are
always the same in most existing studies, which can not well demonstrate the scalability (the adaptability of the proposed agent
design in larger scale networks) and transferability (applicability of trained agents in various test environments without retraining)
of the model, which are crucial properties of RL agents (Ye et al., 2022; Zoph et al., 2018).
• Regarding the bus control strategies, most studies either lack a full-defined traffic environment (e.g., car traffic disturbances,
double-way bus lines, multiple conflict bus lines, and mixed traffic lanes) or scale of intersections. The disturbance from car traffic
and diverse bus line operations has to be better considered in bus control schemes.

This paper proposes a decentralized traffic signal control model with DRL, where agents cooperate through communication with
their immediate neighborhood. The control is designed for a multi-modal network consisting of private car traffic and bus transit.
Vehicle accumulations and queue length on all legs of intersections and bus positions and speeds are required for real-time operation.
The model simultaneously minimizes the traffic delay and bus headway variations for closed loop bus lines and can accommodate
different road layouts (e.g., dedicated bus lanes and mixed traffic lanes) and multiple conflict bus lines. Scalability and transferability
are demonstrated by applying trained models to other similar intersections without retraining, reducing the training costs in an
extensive transportation system. A large benchmark over the most representative methods is performed in numerical experiments
using the microscopic traffic simulator SUMO (Alvarez Lopez et al., 2018). The proposed approach provides a promising performance.
The remainder of this paper is organized as follows: Section 2 describes the DRL algorithm used in this paper and formalizes the

3
J. Yu et al. Transportation Research Part C 154 (2023) 104281

Fig. 2. Reinforcement learning framework for multi-modal network.

agent design in detail. Section 3 discusses the tradeoff between car traffic and bus transit in the agent’s reward and compares the
proposed method with benchmarks via several numerical experiments. The conclusions and perspectives are summarized in section 4.

2. Methodology

2.1. Overall framework

To achieve scalable and decentralized traffic signal control, we set each signalized intersection in the multi-modal network as an
agent that learns strategies offline or inherits the trained model from other agents. The dataset of available trained models is built
through training the agents in various small-scale networks. For a more extensive or distinct testing network, each signalized inter
section is paired with a trained model from the dataset based on the proximity of the intersection configuration and neighbor
connection. One trained model can be reused for several agents. If no suitable match is found, a new model is trained specifically for
that intersection, as illustrated in Fig. 1. Mai in Fig. 1 is the label of trained model i, and the Manew represent a newly trained model with
that signalized intersection.
The traffic signal control process for each intersection is represented by an agent that dynamically triggers one phase. The optimal
signal timing plan is supposed to reduce the overall car traffic delay and the variance of bus headways in the long term. The timing plan
needs to be evaluated and updated after each decision taken by the agent controlling the traffic signal. This interaction between the
traffic light and traffic environment follows Markov Decision Processes (MDPs) (Mannion et al., 2016). In MDPs, the agent estimates
the value of each action with a given state and selects the optimal (Sutton and Barto, 2018), which exactly describes the signal
controller’s action. In this study, the real-time car traffic and bus headway information is retrieved at the beginning of each decision
step. The agent chooses an action only depending on the current state, which satisfies the Markov property. In the corridor and network
level signal control, a group of signalized agents acts in a shared environment to achieve a common individual goal, forming the MARL
framework (Buşoniu et al., 2010). Communication among agents exists for better coordination.
To build the multi-modal traffic signal control method in a MARL framework, as shown in Fig. 2, the environment consists of
intersections, the road network, car traffic flow, and bus transit. Each signalized intersection is regarded as an agent. In this framework,
the state observations consist of real-time traffic and bus information and the last actions of traffic lights. The same variables are used
to define the reward function, i.e., total stopping time, occupancy, and bus headways. In this paper, we set the reward to minimize the
cumulative stopping vehicles for all incoming legs at the intersection and to homogenize (keep constant) headways between buses. The
tradeoff between car traffic-related objectives and bus bunching avoidance is defined by parameter c.

4
J. Yu et al. Transportation Research Part C 154 (2023) 104281

Fig. 3. Structure of ANN in the training process.

2.2. Action-value function approximation: Deep Q-networks

We adopt a Deep Q-Network (DQN) to achieve an action-value function approximation in this study (Mnih et al., 2013; Mnih et al.,
2015). The MDPs can be defined by (S , A , P , R , γ), where S and A denote the state space and action space, respectively, P : S ×
A ↦ S is the state transition function, R represents the reward function, and γ is the discount factor for future reward. The future
∑
discount reward at decision step t is defined as Rt = Tt̂=t γ t̂− t rt̂ , where T is the maximum time step, and rt̂ is the immediate return at
decision step ̂t. ANN is used to learn a mapping from states to Q values in the learning process of DQN. The agent tends to select actions
to maximize expected cumulative discounted future rewards. The optimal action value of agent i is given by Eq. (1):
[ ]
Q*i (s, a) = maxE ri,t + γri,t+1 + γ2 ri,t+2 +...|Si,t = s, ai,t = a, π (1)
π

where π = P(a|s) is the behavior policy.

The update action-value function is given by:

Qi (s, a) = Qi (s, a) + α(ri,t + γmaxQ(s′, a′) − Q(s, a)) (2)

a′

where α is the learning rate of agents, (s, a) is the state-action pair, and (s′, a′) denotes the state-action pair at the next decision step.
In order to retrieve random samples for the agent to learn the action-value function, the state, action, reward, and next-step state
{ }
are recorded for every update. The memory of agent i at decision step t is labeled as D i,t = ei,1 , ..., ei,t , where ei,t =
( )
Si,t− 1 , ai,t− 1 , ri,t− 1 , Si,t , to perform replay. For each learning phase, a batch of samples is drawn randomly from the total memory and
used to compute the temporal difference error. The loss function adopts the formulation given below:
[( ) ] 2
Lk (θk ) = E(s,a,r,s′)∼U(M) r + γmaxQ(s′, a′; θ−k ) − Q(s, a; θk ) (3)
a′

where k denotes the kth iteration, and θk and θ−k are the parameters of Q-network and target network, respectively, at iteration k. The
loss function Lk (θk ) is minimized to reduce the deviation between the target and the current Q-value.
The ANN is traditionally used to estimate the action-value function since it can achieve a nonlinear function approximation (Sutton
and Barto, 2018). An ANN consists of an input layer, interconnected hidden layers, and an output layer. The structure of ANN in this
paper, as shown in Fig. 3, refers to the method developed in Vidali (2021) due to the similar state and action dimensions. The ANN
approximates the optimal action-value function by minimizing the temporal difference error. In the test process, the agent recalls the
ANN in operating mode: it predicts the expected reward of each action based on the state via the fitted action-value function to find the
optimal action at each decision step. The details regarding the learning process are summarized in Appendix A.

2.3. Agent design

The agent is a traffic signal controller associated with an intersection. The agent’s action consists of a single integer representing
the index of the predefined green phase. According to the action index, the signal activates one of the predefined phases for the
following decision step. If the action chosen differs from the last action, the yellow and all red phase of 5 s is activated. Table 1 details
the notations adopted in the proposed framework.

5
J. Yu et al. Transportation Research Part C 154 (2023) 104281

Table 1
Notations of variables.
Variable Description

di,m,t Control distance for buses on the mth incoming leg of agent i at decision step t
d′i,m,t Control distance for buses on the mth incoming leg equipped with dedicated bus lane of agent i at decision step t
d″i,m,t Control distance for buses on the mth incoming leg equipped with mixed traffic lane of agent i at decision step t
vmax Speed limit for buses
adec /aacc Acceptable deceleration/acceleration for buses
Δt Time duration of a decision step
Si,t State tuple of agent i at decision step t
Straffic /Stransit /Scoop State retrieved from car traffic/bus transit/cooperativeness
i,t i,t i,t
Mi Set of all incoming legs of intersection i
MBi Set of all incoming legs equipped with bus line of intersection i
MRi,t Set of all incoming legs that are in red phases of intersection i at decision step t
xsi,m Distance between intersection i and the bus stop on the mth incoming leg. xsi,m = ∞ if there is no stop in the mth incoming leg of the intersection
i
xbi,m,t Distance between intersection i and the bus on the mth incoming leg at time step t, xbi,m,t = 0 if there is no bus on the mth incoming leg of the
intersection i
vi,m,t Speed of the bus closest to intersection i on the mth incoming leg at time step t
vcri Critical speed to measure buses state
Di,t Summation of total stopping time of all vehicles on incoming legs of intersection i during decision step t.
Oi,t Set of occupancies of all incoming legs of intersection i at time step t
Oi,m,t Occupancy of the mth incoming leg of the intersection i at time step t
Ocri Critical occupancy for the reward definition
nm
i,t Stopping vehicle number on the mth incoming leg of intersection i at time step t
f
hi,m,t /hbi,m,t Forward/backward space headway of the bus on the mth incoming leg of intersection i at time step t
ri,t Total reward of agent i at decision step t
traffic
ri,t
agent
/ri,t /rtransit Reward received from car traffic/agent’s action/bus transit of agent i at decision step t
i,t
ai,t Action of agent i at decision step t
yi,t Number of yellow phases during last ten actions of agent i at decision step t
gbus
i,m
Bus phase on the mth incoming leg of intersection i
y Critical occurrence times of yellow and all red phases among last ten actions.
c Weight of bus transit reward.

2.3.1. Agent’s state

[ ]
traffic
The state space Si,t = Si,t , Stransit
i,t , Scoop
i,t consists of the state observations for car traffic, bus transit, and cooperativeness.
[ ]
traffic
The car traffic state Si,t = Di,t , Oi,t , yi,t consists of the total stopping time of vehicles, Di,t , the set of occupancies of all incoming
links, Oi,t , and the number of switches from one phase to another during the last ten action steps, yi,t , of the signalized intersection i at
decision step t. The occupancy of each link is the total length of the occupied vehicles divided by the link length. Di,t is an integer
resulting from the summation of the total number of stopped vehicles for all incoming legs of agent i within decision step t, as depicted
by Eq. (4).
∑ ∑
Di,t = nmi,t′ (4)
∀t′∈t ∀m∈Mi

where t represents the index of the decision step, and each decision step lasts for 5 s in the proposed model. t’ represents the index of the
simulation step, where each step lasts for 1 s. There are five simulation steps within each decision step. During each simulation step, the
total stopping time is calculated as the number of stopped vehicles. The sum of the total stopping time in all simulation steps within a
decision step is considered as the total stopping time of vehicles in that decision step.
The number of switches, yi,t , is an integer variable ranging from 0 to 10, which monitors the frequency of the controller switches
phases. If the frequency is high, a large amount of the available capacity is lost due to excessive yellow and all red phases. In real-world
signal settings, the lower bound of the cycle length is typically 50 to 60 s. Therefore, we chose the span of 10 steps which lasts for 50 s,
to be consistent with the real-world signal setting. It is important for controllers to consider this information in their decision process as
it is not reflected in any other state variables.
[ ]
f
The bus transit state Stransit
i,t = hi,m,t , hbi,m,t , ∀m ∈ MBi is the real-time forward and backward space headway of the bus on the mth
incoming leg of intersection i at decision step t. All incoming legs with bus lines need to be detected. For example, if two incoming legs
of an intersection are equipped with the bus line, there are two sets of forward and backward space headways in this agent’s state. Note
that the optimal bus service is usually obtained when time headways are homogeneous. However, the previous buses’ arrival data is
required to calculate the time headways, which could potentially violate the Markov Property. Bus control can also be applied by
monitoring space headways since there is a strong correlation and consistent performance between time headway and space headway
(Abul-Magd, 2007; Ampountolas and Kring, 2021; Liu and Wang, 2012; Nagatani, 2001). The space headway can be detected in real-

6
J. Yu et al. Transportation Research Part C 154 (2023) 104281

time. Therefore, the bus control strategy here aims to equalize forward and backward space headways for all buses approaching
intersections.
If the bus is too far from the signal, the signal’s latest action does not contribute to bus service, so the agent should take action
regardless of the bus state. When a bus distance to the traffic signal is below the control distance (for bus transit), di,m,t , the agent has to
find the tradeoff between car traffic-related and bus transit-related objectives by activating the reward for bus transit. The control
distance, di,m,t , is a parameter defining the maximal distance at which the agent i needs to consider the impacts of incoming buses while
taking the following action.
The calculations of control distance on dedicated bus lanes and mixed traffic lanes are different, denoted by d′i,m,t and d″i,m,t ,
respectively. The control distance for dedicated bus lanes is defined by the maximum value between the distance for a bus decelerating
to a halt with an acceptable deceleration and the distance for a bus moving forward with max speed during the period of a decision
step. Assuming that the bus performs a uniform deceleration motion:
{ 2 }
v
d′i,m,t = max max , vmax Δt (5)
2adec
For the mixed traffic lane, the control distance is defined according to the speed of the bus, as Eq (6) displays. If the bus is at a low
speed or queueing in a line, the action of the signal can influence the queue and then the bus. No matter how far the bus is from the
intersection, the control distance equals the distance between the bus and the intersection. If the bus is running at a regular or high
speed, the control distance is calculated following the formula in Eq. (5). The control distance in the mixed traffic lane is given by:
⎧
⎪
⎨ xbi,m,t , vi,m,t < vcri
″
di,m,t = {
v 2 } (6)
⎪
⎩ max max , vmax Δt , vi,m,t > vcri
2adec
For example, if a bus travels on a mixed traffic lane at a low speed (compared to critical speed) and 250 m far from the intersection,
the control distance is 250 m. If the bus travels at a regular speed, the control distance is calculated using the second line in Eq (6),
which results in a much smaller value than 250 m.
If the bus is outside the control distance or there is no bus on the incoming leg, the forward and backward space headways are set to
0. When several buses are on the same leg, only the state of the bus closest to the signal is collected. To exclude the situation where a
dwelling bus is regarded as queuing, we define the critical position at which the bus speeds up to vcri departing from a stop. If the bus
v2cri
distance to the traffic signal is farther than the critical position, the bus state is set to 0. Thus, if xbi,m,t > xsi,m − 2aacc , Stransit
i,t = [0, 0].
Stransit
i,t can be concluded as Eq. (7):
⎧[ ] { }
⎪
⎨ hf , hb b s v2cri
, x < min d , x −
(7)
transit i,m,t i,m,t i,m,t i,m,t i,m
Si,t = 2aacc , ∀m ∈ MiB
⎪
⎩
[0, 0] , otherwise
[ ]
The cooperativeness state Scoop i,t = ai− 1,t− 1 , ai,t− 1 , ai+1,t− 1 consists of 2 components: the last action of agent i, ai,t− 1 , and the set of
actions taken by the immediate neighborhood of agent i. There are two neighbors (ai− 1,t− 1 , ai+1,t− 1 ) for single-arterial scenarios and four
for network or multi-arterial scenarios.

2.3.2. Agent’s reward

traffic agent transit
The reward provides feedback regarding three dimensions, ri,t = ri,t + ri,t + c × ri,t , matching with a performance evalu
traffic transit agent
ation according to car traffic, ri,t ,
the bus system, ri,t ,
and the capacity loss due to the agent’s past actions, ri,t . c is a predefined
positive integer representing the weight of bus transit in the reward function.
traffic traffic′ ∑ traffic″
For the car traffic, ri,t = ri,t + ∀m∈MR ri,m,t , the agent aims to minimize the total stopping time and limit the occupancy on
i,t− 1

traffic′
incoming legs. Therefore, the total stopping time of the current decision step and last step are compared to decide the reward, ri,t .
th traffic″
The reward of occupancy on the m incoming leg, ri,m,t ,
has to be calculated for all incoming legs in a red phase assigned during the
last decision step. The reward will be penalized cumulatively for the incoming legs with occupancy larger than the critical one. Eq. (8)
and (9) describe the calculation of the two rewards.
{
1 , Di,t < Di,t− 1
traffic′
ri,t = (8)
− 1 , Di,t ⩾Di,t− 1
{
0 , Oi,m,t < Ocri
(9)
″
traffic
ri,m,t =
− 1 , Oi,m,t ⩾Ocri

Agents try to avoid wasting green time caused by frequent phase switches. Thus, the reward from the agent’s action is given by Eq.
(10), where y is a predefined integer (0 < y < 10) representing the acceptable number of phase switches in 10 steps. To match the real-

7
J. Yu et al. Transportation Research Part C 154 (2023) 104281

Fig. 4. Network and intersections structure in scenario 1. (a) Training artery, (b) testing artery, (c) structure of intersections, (d) predefined phase 1,
(e) predefined phase 2.

world setting, the value of y can be chosen as the number of green phases, which equals the number of yellow phases, in one fixed
control cycle. Since agents need to switch phases in time to respond to real-time traffic, there is no positive reward that encourages
agents to switch less frequently.
{
0 , yi,t ⩽y
agent
ri,t = (10)
− 1 , yi,t > y
∑
For bus transit, ri,t
transit
= ∀m∈MBi ri,m,t ,
transit
if the forward bus headway is larger than the backward one, denoted as hfi,m,t > hbi,m,t , we need
to prioritize the bus to shorten the forward headway, thus equalizing the forward and backward headways. In this case, the transit
reward is positive if the agent gives green priority to the bus-incoming lane in the following decision step. Similarly, in the case of
hfi,m,t < hbi,m,t , the bus needs to be held to equalize the forward and backward headways. The reward from the bus system can be
concluded as Eq. (11), where gi,m
bus
is the bus phase on the mth incoming leg of intersection i. For example, a bus travels from the West to
the East, and the phase that provides green to this movement is labeled Phase 1. Then, gi,m
bus
= 0 since 0 represents activating Phase 1.
Since the bus line route and the signal phases are pre-defined, gi,m bus
is a pre-known constant for each intersection.
⎧
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪ 1 , hfi,m,t− 1 > hbi,m,t− 1 andai,t− 1 = gbus
i,m
⎪
⎪
⎨ 1 , hfi,m,t− 1 < hbi,m,t− 1 andai,t− 1 ∕ = gbus
transit
ri,m,t = i,m
(11)
⎪
⎪ 0
⎪
⎪ , hfi,m,t− 1 = hbi,m,t− 1
⎪
⎪ − 1
⎪
⎪ , otherwise
⎪
⎪
⎩

8
J. Yu et al. Transportation Research Part C 154 (2023) 104281

Fig. 5. Structure of intersections and predefined phases in scenario 2. (a) Structure of intersections, (b) predefined phase 1, (c) predefined phase 2.

3. Numerical experiments

3.1. Design of experiment

Our numerical experiment is drawn on three scenarios with various road and bus line configurations. Multiple traffic control
strategies are tested and compared among these scenarios. These traffic control strategies have been selected because of their prox
imity to our approach or their wide use in the literature or field operation:

• Fixed control (Webster, 1958),

• Longest queue first (Wunderlich et al., 2008),
• Max pressure (Varaiya, 2013),
• Centralized RL method (Chow et al., 2021),
• Max Pressure + Transit Signal Priority (MP + TSP) (Xu et al., 2022b),
• Fixed control + Transit Signal Priority (Fixed + TSP) (Zlatkovic et al., 2012).

The analysis is performed according to three scenarios implemented in the SUMO simulation framework (Alvarez Lopez et al.,
2018). Scenarios differ in:

• The bus lane layout: dedicated bus lane or mixed traffic lane,
• The bus-line deployment: one bus line or multiple (crossing) bus lines.

The first scenario is further used to calibrate the tradeoff parameter c. The scalability and transferability of trained models are
demonstrated in the first and last scenarios by applying existing models without retraining.

3.1.1. Layout settings

Scenario 1: One bus line artery with dedicated bus lanes.. This scenario includes one bus line driving on dedicated bus lanes. Buses are not
affected by the car traffic conditions, but only by the traffic lights.
To highlight the scalability and transferability of strategies learned by agents, we trained and tested them on distinct networks with
similar intersection configurations. There are five signalized intersections in the training artery (Fig. 4(a)) and ten in the test artery
(Fig. 4(b)). The numbers displayed in the network figures are the link lengths. Due to the decentralized framework, five models
(labeled Ma1 to Ma5 from the West to the East) are obtained after training. They are transferred to the test network to apply to agents
managing signals with similar road configurations. Specifically, Ma1 , Ma2 , Ma4 , and Ma5 are assigned to the first and last two in
tersections in the test artery, while all others request Ma3 .
The sketch of each intersection and the predefined signal phases are shown in Fig. 4(c), (d), and (e). There are two lanes for each
direction on the main road (one is the dedicated bus lane), one lane on the crossroads, and two optional phases for the signal. The
straight movements are always prioritized more than left-turn movements in the same phase. Left-turn vehicles need to wait for gaps of
straight flows to pass through the intersection. In the training process, four buses travel in loops on the artery. All buses take a U-turn at
the artery terminal and continue the route in the opposite direction. Six bus stops are along the main artery, as shown in Fig. 4(a). In the
testing network, the number of buses is increased to eight. There are ten bus stops in the testing artery, as shown in Fig. 4(b).

Scenario 2: One bus line artery with mixed traffic lanes.. This scenario is similar to scenario 1, except that dedicated bus lanes are
converted into mixed traffic lanes. The buses are affected by surrounding car traffic conditions and the signal lights.
The training and test networks are the same as the 5-signal artery in scenario 1. However, the intersection pattern is adjusted to the

9
J. Yu et al. Transportation Research Part C 154 (2023) 104281

Fig. 6. Network and intersection structure in scenario 3. (a) Training and testing network, the structure of intersections (b) in horizontal artery (c)
in vertical artery.

mixed traffic lane. The traffic demand is adjusted to maintain a realistic demand profile, which is detailed in Section 3.1.2. The
intersection structure and predefined phases are displayed in Fig. 5. The trained models are labeled Ma1′ to Ma5′ from the West to the
East, which are transferred to agents in scenario 3.

Scenario 3: Two bus lines network with mixed traffic lanes.. This scenario mimics an urban network of two crossing corridors with bus
lines on mixed traffic lanes. Simultaneously, the agents must deal with car traffic conditions and the bus located on various legs. It
introduces a new agent model, taking into account two crossing bus lines where the buses come from four incoming legs, while in
previous scenarios, agents were dealing with buses coming from two opposite legs.
The training and test networks, including the bus stops information, are shown in Fig. 6(a). There are two mixed traffic lanes on
both main arteries, while one at all sideroads. Fig. 6(b) and 6(c) show the detailed structure of intersections. There is one bus line on
each artery, so buses come from all directions at the central intersection. Bus line 1 runs along the horizontal artery, and bus line 2
along the vertical one.

3.1.2. Simulation settings

The environment is built with SUMO. Each dwell of each bus is set to a random value that follows the uniform distribution
(Seredynski and Khadraoui, 2014; Sun et al., 2021), where the lower and upper bounds are 20 and 60 s, respectively. The length of the
private car is 5 m, and the minimum gap between two vehicles is 2.5 m. The free flow speed of cars and buses is 50 km/h. The
acceptable deceleration and acceleration of a bus approaching or leaving a bus stop are set to 0.981 m/s2 (Kirchner et al., 2014).
The traffic demands in all scenarios are shown in Table 2. The intervals in Table 2 represent the upper and lower bounds for
drawing the effective traffic demand at each training episode and test. The effective value is randomly drawn according to a uniform
distribution within this range. The demand range of scenario 2 and the horizontal artery in scenario 3 is the same, leading to a similar
congestion level for the reference traffic signal settings. It simply accommodates the demand levels for the new network configuration.
The demand for the vertical artery in scenario 3 is shown in Table 2. The random seeds differ in each training episode and test to obtain

10
J. Yu et al. Transportation Research Part C 154 (2023) 104281

Table 2
Traffic demand in scenarios 1, 2, and 3.
Scenario Direction Traffic demand (pcu/h)

0–1200 s 1200–2400 s 2400–3600 s 3600–5400 s 5400–7200 s

1 W to E 700* 900* 500* 500* 600*

E to W 500* 600* 600* 600* 500*
N to S 420–840 450–720 300–600 600–800 600–840
S to N 300–600 300–600 360–720 720–960 500–700

2 & 3 (horizontal artery) W to E 1000* 1500* 1000* 1100* 1200*

E to W 900* 1200* 1200* 1300* 1000*
N to S 420–840 450–720 300–600 600–800 600–840
S to N 300–600 300–600 360–720 720–960 500–700

3 (vertical artery) N to S 840–1680 900–1440 600–1200 600–800 600–840

S to N 600–1200 600–1200 840–1440 720–960 500–700
W to E 420–840 450–720 300–600 200–400 600–840
E to W 300–600 300–600 360–720 240–480 500–700

* Random with a 10% interval higher or lower than the value displayed.
Note: for any flow direction at all intersections, 5%-10% of the total displayed demand is assigned to left or right turns.

Table 3
Parameters in RL-based strategy.
Parameter Value

Length of decision step 5s

Discount rate γ 0.95
Number of layers of ANN 4
Width of layers in ANN 400
Training batch size 200
Learning rate 0.001
Training epochs 800
Exploration rate ∊ Decreases linearly from 1 to 0.02 for the first 50 episodes and then keeps 0.02 for the latter episodes
y 2
Ocri 0.5

different traffic demands. A wide random range for the demand of each side road is set to ensure that different demand patterns can be
generated while training. Since the range of side road demand is wide, the general demand level also varies. All simulations last for
7200 s. The departure interval of buses is 192 s for any network of the three scenarios.

3.1.3. Agent settings for the RL-based strategy

In our MARL framework, each agent manages a single intersection. Although the intersection configurations of agents are distinct,
they are initiated with the same learning settings shown in Table 3. As for Ocri , the possible occupancy value ranges from 0 to 0.67
according to the settings of vehicles in the simulation (5 m of car length and 2.5 m of minimum gap). 75% of the upper bound is
considered congested with the corresponding traffic splits (Lo et al., 2001). Therefore, a critical value of 0.5 is chosen to avoid
oversaturated queue length. The training epochs, 800, is the number of training iterations executed by the ANN at the end of each
episode. The agent is not trained for every action step in the training process. Instead, it is trained multiple times at the end of the
simulation for each episode. The critical value y decides if the agent switches the phase too frequently. There are two yellow and all red
phases in one cycle for the two-phase signal control, so the critical value y is set to 2, which matches the real-world setting.

3.2. Calibration of parameter c: Finding a tradeoff between car traffic and bus service with scenario 1

This section seeks a suitable tradeoff between car traffic-related and bus-related objectives, modeled by parameter c. Consequently,
we trained and tested different c values (ranging from 1 to 5) and two extreme cases with scenario 1. In the two extreme cases, only the
reward of car traffic or bus transit is considered, represented by ’RL – only traffic’ and ’RL – only bus’, respectively. The sensitivity of
different c values was tested on trained models for 70 episodes. The reward curves of all agents in each training model are shown in
Fig. 7. Models obtained at the 50, 60, and 70 episodes were all saved for testing. Models trained with 50 episodes always perform
satisfactorily among these strategies. Therefore, the early stopping, a common form of regularization used when the best performance
is not provided by the final model due to an overfitting (Malik et al., 2021; Zhang et al., 2019b), at 50 episodes is applied in this study.
We compare the average queue length and standard deviation of bus space/time headways among these cases. The results are
shown in Table 4. The time headway is the time gap between two successive buses arriving at the same bus stop. We set different
random seeds to generate various traffic demands for testing. The notation of ’RL - n’ relates to the proposed RL model with c = n. The
average space headway in Table 4 is always 1516 m because buses travel in a loop, and the space headway between the first and last

11
J. Yu et al. Transportation Research Part C 154 (2023) 104281

Fig. 7. Total reward of each agent along the training episodes in RL-based models of (a) only traffic, (b) c = 1, (c) c = 2, (d) c = 3, (e) c = 4, and (f)
only transit.

buses is also considered. Thus, the average space headway is always the total length of a round trip divided by the number of buses, also
denoted as nominal headway.
A larger c value forces the model to put more weight on the bus than on traffic performance in the reward. Thus, the car traffic
performance might get impacted. Fig. 8 compares the total queue length of all control strategies along the simulation process. The
continuous increase in total queue length is observed from ’RL – only traffic’ to ’RL - n’ and then ’RL – only bus’. According to Table 4,
when c is set to 3, car traffic and bus performances appear well-balanced in all three tests. For other c values, only car traffic or bus
performance is satisfied. We choose c = 3 in the subsequent sections. With this setting, the car traffic and bus transit rewards range
from − 3 to 1 and − 3 to 3, respectively. Bus transit has the same weight as car traffic in negative reward and a larger weight in positive
reward.

3.3. Results for scenario 1: The dedicated bus lane scenario

We further compare the performances of our approach with the benchmark approaches. Table 4 displays the results from various
simulation seeds in scenario 1. Three different seeds are tested to avoid the coincidental performance of the proposed model. The
average queue length of ’RL − 3′ is always shorter than the benchmark approaches. On average over the three seeds, ’RL − 3′ decreases

12
J. Yu et al. Transportation Research Part C 154 (2023) 104281

Table 4
Testing results of each control method in scenario 1.
Random seed Control method Space headway Time headway Average queue length (vehs)

Average Standard deviation Average Standard deviation

(m) (s)

15,000 RL - only traffic 1516 858.72 208.01 105.74 163.11

RL − 1 505.89 195.25 45.21 174.30
RL − 2 559.63 207.17 58.41 179.83
RL − 3 517.94 194.64 49.89 179.59
RL − 4 806.08 199.82 93.19 187.67
RL − 5 393.05 211.43 43.25 271.81
RL - only bus 394.29 207.30 42.74 563.33
Longest queue first 872.79 197.20 103.59 250.48
Max pressure 595.03 200.74 68.67 200.36
Fixed 778.38 204.44 95.67 273.41

20,000 RL - only traffic 1516 1098.14 207.85 143.25 190.62

RL − 1 644.18 193.38 66.64 202.74
RL − 2 888.17 207.96 110.01 206.93
RL − 3 501.20 193.66 49.49 210.41
RL − 4 665.29 199.54 72.02 218.66
RL − 5 342.56 209.51 38.40 256.71
RL - only bus 417.91 207.79 41.34 596.62
Longest queue first 742.21 196.60 85.60 248.38
Max pressure 620.76 200.83 69.16 224.94
Fixed 773.84 204.44 95.20 301.24

25,000 RL - only traffic 1516 818.38 203.81 100.57 189.39

RL − 1 845.63 196.29 92.62 203.38
RL − 2 634.01 206.91 72.02 227.06
RL − 3 508.05 191.33 50.07 219.02
RL − 4 676.71 199.95 76.32 238.96
RL − 5 361.28 211.53 45.80 291.63
RL - only bus 389.02 207.71 41.41 585.20
Longest queue first 748.32 197.20 86.88 258.18
Max pressure 779.26 201.34 92.51 227.16
Fixed 778.64 204.44 95.63 305.35

Note: the notation of ’RL - n’ donates the proposed RL model with c = n.

the average queue length by 30.91%, 19.59%, and 6.80% compared to fixed control, longest queue first, and max pressure, respec
tively. According to Fig. 8, the max pressure method performs slightly better than ’RL – 3′ during the first half of the simulation. When
the demand from sideroads increases during the second half of the simulation, ’RL-3′ gets better than max pressure and obtains a better
global performance.
’RL – 3′ also outperforms the benchmark strategies in headway control. The space and time headways of the control methods in the
test with random seed = 25000 are visible in Fig. 9 and Fig. 10. Fig. 9 displays the travel distance of each bus during the simulation, and
Fig. 10 presents the arrival time of each bus at each bus stop. Buses travel in loops on the artery, so all the bus stops are passed several
times by all buses. The value of bus stop in the x-axis of Fig. 10 is the cumulative number of stops buses that buses have reached. In
Fig. 9 and Fig. 10, two lines getting closer or even crossing match with bus bunching, which is pointed out in the figures of space
headways with red circles. This phenomenon is observed in trajectories for buses 1 and 2, buses 4 and 5 in the longest queue first
approach, while it is seen for buses 1 and 2, buses 7 and 0 in the max pressure strategy. On the contrary, the RL-based methods
effectively prevent buses from bunching.
Fig. 11 displays the distribution of bus headways in each control strategy. The space and time headways distribute more
concentratedly around the average value in our RL-based models than in benchmark strategies. This rule is consistent with that in
Table 4: the standard deviations of headways in ’RL-3′ and ’RL-bus only’ are smaller than benchmark approaches. We calculate the
percentage of small headways (less than 50% of nominal headway) for the control methods. They are 17.64%, 16.12%, 6.96%, and
3.03% for longest queue first, max pressure, RL – 3, and RL – only bus, respectively. The proposed approaches provide significant
improvement. The scalability and transferability of the proposed method get verified via the test performances.

3.4. Results for scenario 2: The mixed traffic lane scenario

In this scenario, since the buses drive in mixed traffic and might be affected by car traffic conditions, choosing an appropriate
critical speed for buses is essential. According to Eq. (6), the critical speed directly determines the speed state and control distance for
buses. We tested different values, 25% (3.47 m/s) and 50% (6.95 m/s) of the free flow speed, as well as some values around them (2 m/
s and 5 m/s), as the critical speed in scenario 2 (Chen et al., 2021). They are denoted as ’RL – C3.47′, ’RL – C6.95′, ’RL – C2′, and ’RL –
C5′, respectively.

13
J. Yu et al. Transportation Research Part C 154 (2023) 104281

Fig. 8. Comparisons of queue length along the simulation in scenario 1 when (a) random seed = 15000, (b) random seed = 20000, and (c) random
seed = 25000.

The test results for each model in scenario 2 are presented in Table 5 (random seed = 15000). The performance of RL-based control
strategies varies significantly depending on the critical speed. As a result, the models with a critical speed of 3.47 m/s and 5 m/s
outperform other RL-based models. Based on the space headway deviation, which is the direct optimization indicator in our model, the
model with a critical speed of 3.47 m/s exhibits the best performance in both headway control and traffic delay.
In this scenario, the centralized RL model has a similar goal to ours: attempting to improve car traffic and bus performance through
the traffic signal control (Chow et al., 2021). The model is built on a macroscopic traffic flow environment with Cell Transmission
Method (CTM). Due to the traffic flow environment in our model being microscopic, Edie’s definition is applied to trajectory data to
ensure a proper estimation of the density and outflow in the centralized RL model (Edie, 1963; Leclercq et al., 2014).
The signal strategies ’MP + TSP’ and ’Fixed + TSP’ prioritize the bus phase whenever a bus is on the incoming legs, regardless of
other traffic conditions. However, if the frequency of buses is high, this can lead to poor performance in average queue length. In this
scenario, four buses simultaneously travel through the 5-signal artery, and more than two signals need to activate TSP throughout the
simulation. The average queue length of these two strategies is heavily affected. Therefore, we focus solely on comparing the bus
control performances of ’MP + TSP’ and ’Fixed + TSP’ with other strategies.
According to the results in Table 5 and Fig. 12, the performance of the centralized RL model in our environment is acceptable but
still distinct from the ones in the original paper. It is a common issue for centralized RL methods lacking transferability and scalability,
leading to the poor convergence of the model applied in a different scenario. The agent in the centralized RL model was formulated in a
macroscopic traffic environment in the original paper, ignoring the disturbances and fluctuations in the microscopic and real-world
traffic environment. These may explain the model underperforms the initial results and other benchmarks in our microscopic
environment.
Regarding car traffic performance, ’RL-C3.47′ gets better results than other models, reducing the average queue length by 9.30%
compared to the best performance from benchmarks. Max pressure and longest queue first control methods underperform fixed control
on average queue length as their signal phases frequently switch, resulting in excessive waste of green time. The average yellow phase
rates of max pressure, longest queue first, fixed control, and the proposed method are 31.46%, 31.53%, 11.11%, and 10.82%,
respectively. Although the minimum and maximum green times for these control strategies are identical, the switch rates of the max

14
J. Yu et al. Transportation Research Part C 154 (2023) 104281

Fig. 9. Travel distance of buses along the simulation step in scenario 1. (a) Longest queue first, (b) max pressure, (c) RL – 3, (d) RL – only bus.

15
J. Yu et al. Transportation Research Part C 154 (2023) 104281

Fig. 9. (continued).

pressure and longest queue first strategies are still much higher than the one of the proposed method.
Regarding bus headway performance, the results in Table 5 show that ’MP + TSP’ outperforms all other strategies, reducing the
standard deviation of time headway by 8.77% compared to ’RL - C3.47′. In terms of space headway, ’RL – C3.47′ still performs the best.
’Fixed + TSP’ outperforms other benchmarks but is still inferior to the proposed method and ’MP + TSP’. Furthermore, Figs. 13 to 15
present the bus headway performance of ’RL – C3.47′, ’MP + TSP’, and ’Fixed + TSP’, respectively, showing homogenized trajectories
for both space and time.
To demonstrate the effectiveness of the proposed model in handling multiple buses on the same link, two unstable scenarios with
bus bunching are tested. In the first scenario, a 10-signal artery with dedicated bus lanes (same as the test network in scenario 1) is
simulated, with three buses departing within 30 s. The second scenario is a 5-signal artery with mixed traffic lanes (same as the test
network in scenario 2), with two buses departing within 10 s. The space trajectories of buses, depicted in Figs. 16 and 17 below,
indicate that the proposed model can effectively resolve bus bunching and homogenize space headways in both scenarios. In the
dedicated bus lane scenario, bus bunching is resolved after 340 action steps, and the space headways of eight buses are homogeneous
in the last 300 action steps of the simulation. In the mixed traffic lane scenario, the bus bunching is resolved around 260 action steps,
and the space headways are homogenized later. The ’MP + TSP’ strategy is also tested with these two unstable scenarios. The space
trajectories of buses are shown in Fig. 18. In both scenarios, the bus bunching is not resolved by the signals. These results indicate that:
i) the proposed model is capable of handling multiple buses traveling on the same link without other interventions, and ii) system
atically prioritizing buses may not be appropriate for bus bunching as it keeps the bunching state throughout.

3.5. Results for scenario 3: The multiple bus lines scenario

In this scenario, competition between two crossing bus lines and car traffic exists in one network. To further test the transferability
of trained models, models for signals in the horizontal artery (except the central one) are all from scenario 2. The agent models for
other traffic signals in this network require specific training since they involve new connections with their neighborhood. Table 6
depicts test results for RL-based methods and the benchmark strategies (random seed = 15000). ’RL – only traffic’ and ’RL’ represent
our approach only considering car traffic in reward and finding the tradeoff, respectively. Compared to the best performance in
benchmark strategies, ’RL – only traffic’ achieves a 30.87% decrease in average queue length. Fig. 19 exhibits the total queue length of
each method along the simulation. ’RL’ sacrifices some improvement on car traffic performance to reach the tradeoff, but it still
outperforms benchmarks on average queue length. Meanwhile, the standard deviation of bus headways gets significantly improved for
both bus lines. Compared to the best performance in benchmarks, the decreases in the standard deviation of time headway are 58.45%
and 43.13% for bus lines 1 and 2, respectively. Fig. 20 and Fig. 21 display the space and time trajectories of the two bus lines. Per
formances of both bus lines are guaranteed as there is no trajectory crossing. Thus, the transferability of the proposed approach is
promising.
When both directions of buses request priority (green phase) simultaneously, there is no difference between the bus transit reward
of the two actions. Therefore, the agent will choose the action that yields a larger car traffic and cooperative reward to maximize the
global reward. To illustrate how the middle agent deal with the multiple conflict bus lines, the trajectories of two buses simultaneously
approaching the central intersection are retrieved and plotted in Fig. 22 below. Several such situations occurred in the simulation. We
randomly chose one of them to display. The yellow (respectively blue) line represents the distance between the W-E (resp. N-S)
traveling bus and the intersection. A distance of 0 indicates that the bus has arrived at the intersection. The time steps of the trajectories
highlighted by red circles denote the time when the two buses pass through the intersection. In this figure, the bus in the W-E direction

16
J. Yu et al. Transportation Research Part C 154 (2023) 104281

Fig. 10. Arriving time of buses at each stop in scenario 1. (a) Longest queue first, (b) max pressure, (c) RL – 3, (d) RL – only bus.

17
J. Yu et al. Transportation Research Part C 154 (2023) 104281

Fig. 10. (continued).

passes the intersection first while the other one is held by the signal.

4. Conclusions and perspectives

This study proposes a MARL framework for traffic signal control in a multi-modal network consisting of private car traffic and bus

Fig. 11. Box plots for bus headway distribution in different control strategies: (a) space headway and (b) time headway.

Table 5
Testing results of each control method in scenario 2.
Control method Space headway Time headway Average queue length (vehs)

Average (m) Standard deviation Average (s) Standard deviation

RL - C2 1637.78 727.03 549.27 208.33 326.98

RL - C3.47 392.98 283.35 64.00 124.50
RL - C5 486.79 290.30 42.36 127.97
RL - C6.95 884.06 626.94 326.66 390.74
Centralized RL 923.68 353.32 190.55 211.89
Fixed 738.18 273.88 118.28 137.26
Longest queue first 539.51 301.47 85.44 144.02
Max pressure 569.34 342.01 112.26 153.94
MP + TSP 518.80 222.33 58.39 217.58
Fixed + TSP 521.13 261.27 76.28 219.58

Note: the notation of ’RL - Cn’ reflects the critical speed n used to feature the bus’s state in the proposed RL model.

18
J. Yu et al. Transportation Research Part C 154 (2023) 104281

Fig. 12. Total queue length along the simulation in scenario 2.

Fig. 13. Bus trajectories in proposed strategies (RL - C3.47): (a) space trajectories and (b) time trajectories.

Fig. 14. Bus trajectories in ’Max pressure + TSP’ strategies: (a) space trajectories and (b) time trajectories.

transit. The decentralized agents combine bus priority and holding control with traffic signal control. Deep Q-Network is applied to
address the continuous state space. The crucial concept of the proposed framework is to homogenize bus headways via traffic signals
without reducing traffic efficiency. The tradeoff between car traffic-related and bus transit-related rewards is discussed based on
numerical experiments. The proposed model is tested in various scenarios, including different bus lane layouts (dedicated bus lanes or
mixed traffic lanes) and bus line deployments (single bus line or multiple crossing bus lines). In the configuration with buses driving in
mixed traffic, the agent’s performance strongly depends on the critical speed setting, which defines the signal control distance for
buses. The transferability and scalability are demonstrated with several stochastic-demand tests by applying learned agents to similarly
configured intersections and different scales of networks without retraining. The decentralized method performs far better on traffic

19
J. Yu et al. Transportation Research Part C 154 (2023) 104281

Fig. 15. Bus trajectories in ’Fixed + TSP’ strategies: (a) space trajectories and (b) time trajectories.

Fig. 16. Space trajectories of buses on dedicated bus lanes. Left: 8 buses, 0–1400 action steps; right: 4 buses, 0–800 action steps.

Fig. 17. Space trajectories of buses on mixed traffic lanes. Left: 0–1400 action steps; right: 0–600 action steps.

Fig. 18. Bus space trajectories of ’MP + TSP’ on (a) dedicated bus lanes and (b) mixed traffic lanes.

20
J. Yu et al.
Table 6
Testing results of each control method in scenario 3.
Control method Space headway Time headway Average queue length (vehs)

Bus-line 1 Bus-line 2 Bus-line 1 Bus-line 2

Average (m) Standard deviation Average (m) Standard deviation Average (s) Standard deviation Average (s) Standard deviation
21

RL – only traffic 1637.78 1221.50 903.71 532.06 265.75 195.55 233.99 80.97 170.38
RL 243.88 187.96 288.65 36.14 251.02 54.58 240.50
Fixed 876.33 664.63 278.54 144.28 233.04 95.98 246.46
Max pressure 538.16 679.75 330.82 86.99 351.56 142.76 277.01
Longest queue first 651.14 799.76 307.17 113.81 266.66 134.41 251.06

Transportation Research Part C 154 (2023) 104281

J. Yu et al. Transportation Research Part C 154 (2023) 104281

Fig. 19. Total queue length of each control method along the simulation in scenario 3.

Fig. 20. Bus trajectories of bus-line 1: (a) space trajectories and (b) time trajectories.

Fig. 21. Bus trajectories of bus-line 2: (a) space trajectories and (b) time trajectories.

delay and bus headway control than model-based adaptive control methods and the centralized RL method.
In future work, the agent performance should be improved with more advanced RL algorithms (e.g., Double DQN, Dueling DQN,
and Dueling Double DQN) since promising performance has been verified by applying these algorithms in the traffic and bus control
(Han et al., 2022; Li et al., 2022; Long et al., 2022; Qi et al., 2019). Furthermore, the proposed approach would be investigated in non-
closed loop bus-line systems with heterogeneous passenger demand. Further exploration of scenarios involving multiple bus lines
sharing an artery is warranted. The exploration of grouping several signals as an agent will be undertaken to address the city-scale
signal control problem. Eventually, we need to further explore the proposed method’s portability by transferring this agent design
to intersections with different configurations and traffic demand patterns. The robustness of the proposed method also needs to be

22
J. Yu et al. Transportation Research Part C 154 (2023) 104281

Fig. 22. Trajectories of buses crossing at the central intersection.

tested with a full range of demand levels. A relevant perspective would be to explore the notion of similarity between trained agents
according to their features related to traffic demand and intersection configuration and then define the range where it is possible to
reuse a trained agent in a slightly different intersection configuration.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to
influence the work reported in this paper.

Acknowledgments

J. Yu acknowledges funding from China Scholarship Council. L. Leclercq acknowledges funding from the European Union’s Ho
rizon 2020 research and innovation program under Grant Agreement no. 953783 (DIT4TraM).

Appendix A. DQN with experience replay (Mnih et al., 2013)

Initialize the total training episodes M , max decision step T in each episode
Initialize the number of training epochs P, replay memory D , minibatch
Initialize the parameters in action-value function Q
f or episode = 1 to M do
for t = 1 to T do
Retrieve state st
With probability ∊ select a random action at
otherwise select at = argmaxa Q(st , a; θ)
Update environment to st+1 and feedback reward rt
Store transition (st , at , rt , st+1 ) in D
end for
for epoch = 1 to P do
Sample random minibatch of transitions (sj , aj , rj , sj+1 ) from D
{
rj for terminal sj+1
Set yj =
rj + γmaxa’ Q(sj+1 , a’; θ) for non-terminal sj+1
( ( ) )2
Perform a gradient descent step on yj − Q sj , aj ; θ according to Eq. (3)
end for
end for

References

Abdoos, M., 2020. A cooperative multiagent system for traffic signal control using game theory and reinforcement learning. IEEE Intell. Transp. Syst. Mag. 13 (4),
6–16.

23
J. Yu et al. Transportation Research Part C 154 (2023) 104281

Abul-Magd, A.Y., 2007. Modeling highway-traffic headway distributions using superstatistics. Phys. Rev. E 76 (5), 057101.
Alvarez Lopez, P., Behrisch, M., Bieker-Walz, L., Erdmann, J., Flötteröd, Y.-P., Hilbrich, R., Lücken, L., Rummel, J., Wagner, P., Wießner, E., 2018. Microscopic Traffic
Simulation using SUMO, The 21st IEEE International Conference on Intelligent Transportation Systems. IEEE, Maui, USA, pp. 2575–2582.
Ampountolas, K., Kring, M., 2021. Mitigating bunching with bus-following models and bus-to-bus cooperation. IEEE Trans. Intell. Transp. Syst. 22 (5), 2637–2646.
Arel, I., Liu, C., Urbanik, T., Kohls, A.G., 2010. Reinforcement learning-based multi-agent system for network traffic signal control. IET Intel. Transport Syst. 4 (2),
128–135.
Berrebi, S.J., Hans, E., Chiabaut, N., Laval, J.A., Leclercq, L., Watkins, K.E., 2018. Comparing bus holding methods with and without real-time predictions. Transp.
Res. Part C: Emerging Technol. 87, 197–211.
Buşoniu, L., Babuška, R., De Schutter, B., 2010. Multi-agent Reinforcement Learning: An Overview. In: Srinivasan, D., Jain, L.C. (Eds.), innovations in Multi-Agent
Systems and Applications - 1. Springer, Berlin Heidelberg, Berlin, Heidelberg, pp. 183–221.
Casas, N., 2017. Deep deterministic policy gradient for urban traffic light control. arXiv preprint arXiv:1703.09035.
Chen, Y., Chen, C., Wu, Q., Ma, J., Zhang, G., Milton, J., 2021. Spatial-temporal traffic congestion identification and correlation extraction using floating car data.
J. Intell. Transp. Syst. 25 (3), 263–280.
Chen, X., Lin, X., Li, M., He, F., 2022. Network-level control of heterogeneous automated traffic guaranteeing bus priority. Transp. Res. Part C: Emerging Technol. 140,
103671.
Chin, Y.K., Kow, W.Y., Khong, W.L., Tan, M.K., Teo, K.T.K., 2012. Q-learning traffic signal optimization within multiple intersections traffic network, 2012 Sixth
UKSim/AMSS European Symposium on Computer Modeling and Simulation. IEEE 343–348.
Chow, A.H.F., Su, Z.C., Liang, E.M., Zhong, R.X., 2021. Adaptive signal control for bus service reliability with connected vehicle technology via reinforcement
learning. Transp. Res. Part C: Emerging Technol. 129, 103264.
Chu, T., Wang, J., Codecà, L., Li, Z., 2019. Multi-agent deep reinforcement learning for large-scale traffic signal control. IEEE Trans. Intell. Transp. Syst. 21 (3),
1086–1095.
Daganzo, C.F., 2009. A headway-based approach to eliminate bus bunching: systematic analysis and comparisons. Transp. Res. B Methodol. 43 (10), 913–921.
Ding, J., Yang, M., Wang, W., Xu, C., Bao, Y., 2015. Strategy for multiobjective transit signal priority with prediction of bus dwell time at stops. Transp. Res. Rec. 2488
(1), 10–19.
Edie, L.C., 1963. Discussion of traffic stream measurements and definitions. Port of New York Authority New York.
Ekeila, W., Sayed, T., Esawey, M.E., 2009. Development of dynamic transit signal priority strategy. Transp. Res. Rec. 2111 (1), 1–9.
Fadlullah, Z.M., Tang, F., Mao, B., Kato, N., Akashi, O., Inoue, T., Mizutani, K., 2017. State-of-the-art deep learning: Evolving machine intelligence toward tomorrow’s
intelligent network traffic control systems. IEEE Commun. Surv. Tutorials 19 (4), 2432–2455.
Gao, J., Shen, Y., Liu, J., Ito, M., Shiratori, N., 2017. Adaptive traffic signal control: Deep reinforcement learning algorithm with experience replay and target network.
arXiv preprint arXiv:1705.02755.
Genders, W., Razavi, S., 2016. Using a deep reinforcement learning agent for traffic signal control. arXiv preprint arXiv:1611.01142.
Ghanim, M.S., Abu-Lebdeh, G., 2015. Real-time dynamic transit signal priority optimization for coordinated traffic networks using genetic algorithms and artificial
neural networks. J. Intell. Transp. Syst. 19 (4), 327–338.
Gregoire, J., Qian, X., Frazzoli, E., De La Fortelle, A., Wongpiromsarn, T., 2014. Capacity-aware backpressure traffic signal control. IEEE Trans. Control Network Syst.
2 (2), 164–173.
Han, Y., Hegyi, A., Zhang, L., He, Z., Chung, E., Liu, P., 2022. A new reinforcement learning-based variable speed limit control approach to improve traffic efficiency
against freeway jam waves. Transp. Res. Part C: Emerging Technol. 144, 103900.
Hans, E., Chiabaut, N., Leclercq, L., Bertini, R.L., 2015. Real-time bus route state forecasting using particle filter and mesoscopic modeling. Transp. Res. Part C:
Emerging Technol. 61, 121–140.
Kirchner, M., Schubert, P., Haas, C.T., 2014. Characterisation of real-world bus acceleration and deceleration signals. J. Signal and Information Processing 5, 42694.
Korecki, M., Helbing, D., 2022. Analytically guided machine learning for green IT and fluent traffic. IEEE Access 10, 96348–96358.
Laskaris, G., Seredynski, M., Viti, F., 2020. Enhancing bus holding control using cooperative ITS. IEEE Trans. Intell. Transp. Syst. 21 (4), 1767–1778.
Le, T., Kovács, P., Walton, N., Vu, H.L., Andrew, L.L., Hoogendoorn, S.S., 2015. Decentralized signal control for urban road networks. Transp. Res. Part C: Emerging
Technol. 58, 431–450.
Leclercq, L., Chiabaut, N., Trinquier, B., 2014. Macroscopic fundamental diagrams: a cross-comparison of estimation methods. Transp. Res. B Methodol. 62, 1–12.
Lee, S., Kim, Y., Kahng, H., Lee, S.-K., Chung, S., Cheong, T., Shin, K., Park, J., Kim, S.B., 2020. Intelligent traffic control for autonomous vehicle systems based on
machine learning. Expert Syst. Appl. 144, 113074.
Levin, M.W., Hu, J., Odell, M., 2020. Max-pressure signal control with cyclical phase structure. Transp. Res. Part C: Emerging Technol. 120, 102828.
Li, L., Lv, Y., Wang, F.-Y., 2016. Traffic signal timing via deep reinforcement learning. IEEE/CAA J. Automatica Sinica 3 (3), 247–254.
Li, G., Yang, Y., Li, S., Qu, X., Lyu, N., Li, S.E., 2022. Decision making of autonomous vehicles in lane change scenarios: Deep reinforcement learning approaches with
risk awareness. Transp. Res. Part C: Emerging Technol. 134, 103452.
Li, Z., Yu, H., Zhang, G., Dong, S., Xu, C.-Z., 2021. Network-wide traffic signal control optimization using a multi-agent deep reinforcement learning. Transp. Res. Part
C: Emerging Technol. 125, 103059.
Little, J.D., 1966. The synchronization of traffic signals by mixed-integer linear programming. Oper. Res. 14, 568–594.
Liu, Y., Wang, D., 2012. Minimum time headway model by using safety space headway, World Automation Congress 2012. IEEE 1–4.
Lo, H.K., Chang, E., Chan, Y.C., 2001. Dynamic network traffic control. Transp. Res. A Policy Pract. 35 (8), 721–744.
Long, M., Zou, X., Zhou, Y., Chung, E., 2022. Deep reinforcement learning for transit signal priority in a connected environment. Transp. Res. Part C: Em. Technol.
142, 103814.
Ma, W., Yang, X., 2007. A Passive Transit Signal Priority Approach for Bus Rapid Transit System, 2007 IEEE Intelligent Transportation Systems Conference, pp. 413-418.
Ma, Y., Chiu, Y., Yang, X., 2009. Urban traffic signal control network automatic partitioning using laplacian eigenvectors, 2009 12th International IEEE Conference on
Intelligent Transportation Systems, pp. 1-5.
Ma, W., Wan, L., Yu, C., Zou, L., Zheng, J., 2020. Multi-objective optimization of traffic signals based on vehicle trajectory data at isolated intersections. Transp. Res.
Part C: emerging technologies 120, 102821.
Malik, S., Anwar, U., Aghasi, A., Ahmed, A., 2021. Inverse constrained reinforcement learning. Int. Conference on Machine Learn.. PMLR 7390–7399.
Mannion, P., Duggan, J., Howley, E., 2016. An experimental review of reinforcement learning algorithms for adaptive traffic signal control. In: McCluskey, T.L.,
Kotsialos, A., Müller, J.P., Klügl, F., Rana, O., Schumann, R. (Eds.), Autonomic Road Transport Support Systems. Springer International Publishing, Cham,
pp. 47–66.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M., 2013. Playing atari with deep reinforcement learning. arXiv preprint
arXiv:1312.5602.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., 2015. Human-level control
through deep reinforcement learning. Nature 518 (7540), 529–533.
Mohebifard, R., Al Islam, S.B., Hajbabaie, A., 2019. Cooperative traffic signal and perimeter control in semi-connected urban-street networks. Transp. Res. Part C: Em.
Technol. 104, 408–427.
Nagatani, T., 2001. Bunching transition in a time-headway model of a bus route. Phys. Rev. E 63 (3), 036115.
Ni, Y.-C., Lo, H.-H., Hsu, Y.-T., Huang, H.-J., 2022. Exploring the effects of passive transit signal priority design on bus rapid transit operation: a microsimulation-
based optimization approach. Transp. Lett. 14 (1), 14–27.
Prashanth, L., Bhatnagar, S., 2010. Reinforcement learning with function approximation for traffic signal control. IEEE Trans. Intell. Transp. Syst. 12 (2), 412–421.
Qi, X., Luo, Y., Wu, G., Boriboonsomsin, K., Barth, M., 2019. Deep reinforcement learning enabled self-learning control for energy efficient driving. Transp. Res. Part
C: Em. Technol. 99, 67–81.

24
J. Yu et al. Transportation Research Part C 154 (2023) 104281

Seredynski, M., Khadraoui, D., 2014. Complementing transit signal priority with speed and dwell time extension advisories, 17th International IEEE Conference on
Intelligent Transportation Systems (ITSC). IEEE, pp. 1009-1014.
Sun, W., Schmöcker, J.-D., Fukuda, K., 2021. Estimating the route-level passenger demand profile from bus dwell times. Transp. Res. Part C: Em. Technol. 130,
103273.
Sun, X., Yin, Y., 2018. A simulation study on max pressure control of signalized intersections. Transp. Res. Rec. 2672 (18), 117–127.
Sutton, R.S., Barto, A.G., 2018. Reinforcement learning: an introduction. MIT press.
Varaiya, P., 2013. Max pressure control of a network of signalized intersections. Transp. Res. Part C: Em. Technol. 36, 177–195.
Vidali, A., 2021. Deep Q-Learning Agent for Traffic Signal Control. GitHub, https://github.com/AndreaVidali/Deep-QLearning-Agent-for-Traffic-Signal-Control.
Wang, T., Cao, J., Hussain, A., 2021b. Adaptive Traffic signal control for large-scale scenario with cooperative group-based multi-agent reinforcement learning.
Transp. Res. Part C: Em. Technol. 125, 103046.
Wang, J., Sun, L., 2020. Dynamic holding control to avoid bus bunching: a multi-agent deep reinforcement learning framework. Transp. Res. Part C: Emerging
Technol. 116, 102661.
Wang, Q., Yuan, Y., Yang, X.T., Huang, Z., 2021a. Adaptive and multi-path progression signal control under connected vehicle environment. Transp. Res. Part C: Em.
Technol. 124, 102965.
Webster, F.V., 1958. Traffic signal settings. Road Research Laboratory, London, U.K.
Wu, J., Abbas-Turki, A., Correia, A., Moudni, A.E., 2007. Discrete Intersection Signal Control, 2007 IEEE International Conference on Service Operations and Logistics, and
Informatics, pp. 1-6.
Wu, C., Kreidieh, A., Parvate, K., Vinitsky, E., Bayen, A.M., 2017. Flow: Architecture and benchmarking for reinforcement learning in traffic control. arXiv preprint
arXiv:1710.05465 10.
Wunderlich, R., Liu, C., Elhanany, I., Urbanik, T., 2008. A novel signal-scheduling algorithm with quality-of-service provisioning for an isolated intersection. IEEE
Trans. Intell. Transp. Syst. 9 (3), 536–547.
Xu, T., Barman, S., Levin, M.W., Chen, R., Li, T., 2022b. Integrating public transit signal priority into max-pressure signal control: methodology and simulation study
on a downtown network. Transp. Res. Part C: Em. Technol. 138, 103614.
Xu, L., Xu, J., Qu, X., Jin, S., 2022a. An origin-destination demands-based multipath-band approach to time-varying arterial coordination. IEEE Trans. Intell. Transp.
Syst. https://doi.org/10.1109/TITS.2022.3150977.
Yang, X., Cheng, Y., Chang, G.-L., 2015. A multi-path progression model for synchronization of arterial traffic signals. Transp. Res. part C: emerging technol. 53,
93–111.
Yang, S., Yang, B., Wong, H.-S., Kang, Z., 2019. Cooperative traffic signal control using multi-step return and off-policy asynchronous advantage actor-critic graph
algorithm. Knowl.-Based Syst. 183, 104855.
Ye, Z., Wang, K., Chen, Y., Jiang, X., Song, G., 2022. Multi-UAV navigation for partially observable communication coverage by graph reinforcement learning. IEEE
Trans. Mob. Comput.
Yin, Y., 2008. Robust optimal traffic signal timing. Transp. Res. B Methodol. 42 (10), 911–924.
Yu, H., Ma, R., Zhang, H.M., 2018. Optimal traffic signal control under dynamic user equilibrium and link constraints in a general network. Transp. Res. B Methodol.
110, 302–325.
Yu, H., Liu, P., Fan, Y., Zhang, G., 2021. Developing a decentralized signal control strategy considering link storage capacity. Transp. Res. Part C: Emerging Technol.
124, 102971.
Zhang, Y., Clavera, I., Tsai, B., Abbeel, P., 2019b. Asynchronous methods for model-based reinforcement learning. arXiv preprint arXiv:1910.12453.
Zhang, H., Feng, S., Liu, C., Ding, Y., Zhu, Y., Zhou, Z., Zhang, W., Yu, Y., Jin, H., Li, Z., 2019a. Cityflow: a multi-agent reinforcement learning environment for large
scale city traffic scenario. The world wide web conference 3620–3624.
Zhang, C., Xie, Y., Gartner, N.H., Stamatiadis, C., Arsava, T., 2015. AM-band: an asymmetrical multi-band model for arterial traffic signal coordination. Transp. Res.
Part C: Em. Technol. 58, 515–531.
Zhang, L., Yin, Y., Chen, S., 2013. Robust signal timing optimization with environmental concerns. Transp. Res. Part C: Emerging Technol. 29, 55–71.
Zhao, D., Dai, Y., Zhang, Z., 2012. Computational intelligence in urban traffic signal control: a survey. IEEE Trans. Syst. man, and cybernetics, part C (applications and
reviews) 42 (4), 485–494.
Zlatkovic, M., Stevanovic, A., Martin, P.T., 2012. Development and evaluation of algorithm for resolution of conflicting transit signal priority requests. Transp. Res.
Rec. 2311 (1), 167–175.
Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V., 2018. Learning transferable architectures for scalable image recognition, Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 8697-8710.

Electronics 10 02363 v2
No ratings yet
Electronics 10 02363 v2
32 pages
Applsci 13 02750 v2
No ratings yet
Applsci 13 02750 v2
23 pages
2018 TS Optimization 34
No ratings yet
2018 TS Optimization 34
17 pages
Investigacion Algoritmos 2
No ratings yet
Investigacion Algoritmos 2
22 pages
Transportation Research Part C: Yiming Bie, Yuting Ji, Dongfang Ma
No ratings yet
Transportation Research Part C: Yiming Bie, Yuting Ji, Dongfang Ma
19 pages
Actuators 13 00251
No ratings yet
Actuators 13 00251
15 pages
Sensors 22 08732 v3
No ratings yet
Sensors 22 08732 v3
21 pages
Neural Networks For Real-Time Traffic Signal Control (Srinivasan Et Al., 2006)
No ratings yet
Neural Networks For Real-Time Traffic Signal Control (Srinivasan Et Al., 2006)
12 pages
Electronics: Optimization Control of Adaptive Traffic Signal With Deep Reinforcement Learning
No ratings yet
Electronics: Optimization Control of Adaptive Traffic Signal With Deep Reinforcement Learning
20 pages
3 Discharge Control Policy Based On Density and Speed For Deep Q-Learning Adaptive Traffic Signal
No ratings yet
3 Discharge Control Policy Based On Density and Speed For Deep Q-Learning Adaptive Traffic Signal
21 pages
1 Two-Layer Coordinated Reinforcement Learning For Traffic Signal Control in Traffic Network
No ratings yet
1 Two-Layer Coordinated Reinforcement Learning For Traffic Signal Control in Traffic Network
12 pages
Intelligent Emergency Traffic Signal Control System With Pedestrian Access
No ratings yet
Intelligent Emergency Traffic Signal Control System With Pedestrian Access
1 page
Multi-Agent Reinforcement Learning For Traffic Signal Control Through Universal Communication Method
No ratings yet
Multi-Agent Reinforcement Learning For Traffic Signal Control Through Universal Communication Method
12 pages
Urban Traffic Signal Control Using Reinforcement Learning Agents
No ratings yet
Urban Traffic Signal Control Using Reinforcement Learning Agents
13 pages
Sensors 24 03987 v2
No ratings yet
Sensors 24 03987 v2
19 pages
Expert Systems With Applications: Leilei Kang, Hao Huang, Weike Lu, Lan Liu
No ratings yet
Expert Systems With Applications: Leilei Kang, Hao Huang, Weike Lu, Lan Liu
15 pages
Integrated
No ratings yet
Integrated
24 pages
Pone 0298417
No ratings yet
Pone 0298417
22 pages
Deep Reinforcement Learning Algorithm With Experience Replay and Target Network
No ratings yet
Deep Reinforcement Learning Algorithm With Experience Replay and Target Network
10 pages
Mitigating Action Hysteresis in Traffic Signal Control With Traffic Predictive Reinforcement Learning
No ratings yet
Mitigating Action Hysteresis in Traffic Signal Control With Traffic Predictive Reinforcement Learning
12 pages
Fully Distributed Model Predictive Control of Connected Automated Vehicles in Intersections Theory and Vehicle Experiments
No ratings yet
Fully Distributed Model Predictive Control of Connected Automated Vehicles in Intersections Theory and Vehicle Experiments
13 pages
Graph-Based Cooperation Multi-Agent Reinforcement Learning For Intelligent Traffic Signal Control
No ratings yet
Graph-Based Cooperation Multi-Agent Reinforcement Learning For Intelligent Traffic Signal Control
13 pages
Enhancing Traffic Flow Through Multi-Agent Reinforcement Learning For Adaptive Traffic Light Duration Control
No ratings yet
Enhancing Traffic Flow Through Multi-Agent Reinforcement Learning For Adaptive Traffic Light Duration Control
16 pages
Distributed Traffic Light Control at Uncoupled Intersections With Real-World Topology by Deep Reinforcement Learning
No ratings yet
Distributed Traffic Light Control at Uncoupled Intersections With Real-World Topology by Deep Reinforcement Learning
9 pages
1 s2.0 S0968090X14003325 Main
No ratings yet
1 s2.0 S0968090X14003325 Main
20 pages
Traffic Signal Control System Using Deep Reinforcement Learning With Emphasis On Reinforcing Successful Experiences
No ratings yet
Traffic Signal Control System Using Deep Reinforcement Learning With Emphasis On Reinforcing Successful Experiences
8 pages
8349-Article Text-48881-2-10-20201129
No ratings yet
8349-Article Text-48881-2-10-20201129
16 pages
Deep Reinforcement Q-Learning For Intelligent Tra C Signal Control With Partial Detection
No ratings yet
Deep Reinforcement Q-Learning For Intelligent Tra C Signal Control With Partial Detection
15 pages
An Information Fusion Approach To Intelligent Traffic Signal Control Using The Joint Methods of Multiagent Reinforcement Learning and Artificial Intelligence of Things
No ratings yet
An Information Fusion Approach To Intelligent Traffic Signal Control Using The Joint Methods of Multiagent Reinforcement Learning and Artificial Intelligence of Things
11 pages
Hamsa Seminar Report
No ratings yet
Hamsa Seminar Report
18 pages
Simulation of Intelligent Traffic Control For Autonomous Vehicles
No ratings yet
Simulation of Intelligent Traffic Control For Autonomous Vehicles
7 pages
Ecolight: Reward Shaping in Deep Reinforcement Learning For Ergonomic Traffic Signal Control
No ratings yet
Ecolight: Reward Shaping in Deep Reinforcement Learning For Ergonomic Traffic Signal Control
5 pages
Machine Learning Technque Research Paper
No ratings yet
Machine Learning Technque Research Paper
11 pages
Transportation Research Part C: Xiang (Ben) Song, Bin Zhou, Dongfang Ma
No ratings yet
Transportation Research Part C: Xiang (Ben) Song, Bin Zhou, Dongfang Ma
16 pages
Swarm Intelligence Inspired Adaptive Traf C Control For Traf C Networks
No ratings yet
Swarm Intelligence Inspired Adaptive Traf C Control For Traf C Networks
11 pages
RL Paper Latex v01d01
No ratings yet
RL Paper Latex v01d01
6 pages
Annual Reviews in Control: Yanbo Zhao, Petros Ioannou
No ratings yet
Annual Reviews in Control: Yanbo Zhao, Petros Ioannou
9 pages
Multi Agent Learning Automata For Online Adaptive Control of Large Scale Traffic Signal Systems
No ratings yet
Multi Agent Learning Automata For Online Adaptive Control of Large Scale Traffic Signal Systems
6 pages
A Traffic Light Control Method Based On Multi Agent Deep Reinforcement Learning Algorithm
No ratings yet
A Traffic Light Control Method Based On Multi Agent Deep Reinforcement Learning Algorithm
11 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
Deep Reinforcement Learning Based Approach For Tra C Signal Control Deep Reinforcement Learning Based Approach For Tra C Signal Control
No ratings yet
Deep Reinforcement Learning Based Approach For Tra C Signal Control Deep Reinforcement Learning Based Approach For Tra C Signal Control
8 pages
Traffic Signal Control A Double Q-Learning Approac
No ratings yet
Traffic Signal Control A Double Q-Learning Approac
5 pages
Reinforcement Learning Based Multiagent System For Network Traffic Signal Control
No ratings yet
Reinforcement Learning Based Multiagent System For Network Traffic Signal Control
8 pages
Cooperative Reinforcement Learning On Traffic Signal Control
No ratings yet
Cooperative Reinforcement Learning On Traffic Signal Control
10 pages
Design and Analysis of Multi Agent Reinforcement Learning
No ratings yet
Design and Analysis of Multi Agent Reinforcement Learning
5 pages
Optimize Traffic Signal Control
No ratings yet
Optimize Traffic Signal Control
11 pages
IEEE 2023 Cooperative Control
No ratings yet
IEEE 2023 Cooperative Control
5 pages
Improving Traffic Light Systems Using Deep Q-Networks
No ratings yet
Improving Traffic Light Systems Using Deep Q-Networks
13 pages
Proximal Policy Optimization Through A Deep Reinfo
No ratings yet
Proximal Policy Optimization Through A Deep Reinfo
19 pages
Aa Aa
No ratings yet
Aa Aa
16 pages
Reinforcement Learning-Based Intelligent Traffic Signal Control Considering Sensing Information of Railway
No ratings yet
Reinforcement Learning-Based Intelligent Traffic Signal Control Considering Sensing Information of Railway
12 pages
Using A Deep Reinforcement Learning Agent For Traffic Signal Control
No ratings yet
Using A Deep Reinforcement Learning Agent For Traffic Signal Control
9 pages
IEEE DQL Regional Network
No ratings yet
IEEE DQL Regional Network
5 pages
IEEE 2024 DQL Improved DQL
No ratings yet
IEEE 2024 DQL Improved DQL
6 pages
Traffic Light Control With Reinforcement Learning
No ratings yet
Traffic Light Control With Reinforcement Learning
18 pages
Improving Traffic Light Systems Using Deep Q-Networks
No ratings yet
Improving Traffic Light Systems Using Deep Q-Networks
2 pages
Nukul Sharma 21ucs253 HMARL Project Report
No ratings yet
Nukul Sharma 21ucs253 HMARL Project Report
7 pages
A Review of Reinforcement Learning Applications in Adaptive Traffic
No ratings yet
A Review of Reinforcement Learning Applications in Adaptive Traffic
17 pages
Overview of Traffic Light Control
No ratings yet
Overview of Traffic Light Control
6 pages
Aditya Kumar Singh - A2305220463 NTCC Term Paper
No ratings yet
Aditya Kumar Singh - A2305220463 NTCC Term Paper
16 pages
Control Systems and Reinforcement Learning - Sean Meyn - 2022 - Cambridge University Press - 9781009051873 - Anna's Archive
No ratings yet
Control Systems and Reinforcement Learning - Sean Meyn - 2022 - Cambridge University Press - 9781009051873 - Anna's Archive
454 pages
Traffic Simultaion
No ratings yet
Traffic Simultaion
12 pages
Ai (It) Unit-5
No ratings yet
Ai (It) Unit-5
43 pages
Deep Reinforcement Learning Mohit Sewak
No ratings yet
Deep Reinforcement Learning Mohit Sewak
6 pages
Markov Decision Process
No ratings yet
Markov Decision Process
3 pages
Unit 4
100% (1)
Unit 4
7 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
112 Q Learning N
100% (1)
112 Q Learning N
15 pages
Deep Reinforcement Learning For Cyber Security
No ratings yet
Deep Reinforcement Learning For Cyber Security
17 pages
Module - 1 - Reinforcement Learning and Markov Decision Process
No ratings yet
Module - 1 - Reinforcement Learning and Markov Decision Process
19 pages
Lecture 7
No ratings yet
Lecture 7
52 pages
18 Deeprl
No ratings yet
18 Deeprl
19 pages
A Review On Game Theory As A Tool in Operation Research
No ratings yet
A Review On Game Theory As A Tool in Operation Research
22 pages
The Art of Reinforcement Learning: Fundamentals, Mathematics, and Implementations With Python 1st Edition Michael Hu
No ratings yet
The Art of Reinforcement Learning: Fundamentals, Mathematics, and Implementations With Python 1st Edition Michael Hu
49 pages
Smart City Traffic Flow and Signal Optimization Using STGCN-LSTM and PPO Algorithms
No ratings yet
Smart City Traffic Flow and Signal Optimization Using STGCN-LSTM and PPO Algorithms
17 pages
Reinf Learning Res Paper 2
No ratings yet
Reinf Learning Res Paper 2
12 pages
Unit 06 Temporal Difference Learning
No ratings yet
Unit 06 Temporal Difference Learning
9 pages
1 s2.0 S2352484722027391 Main
No ratings yet
1 s2.0 S2352484722027391 Main
13 pages
AI Open
No ratings yet
AI Open
9 pages
Algorithmic Governance
No ratings yet
Algorithmic Governance
27 pages
1 s2.0 S0968090X22003680 Main
No ratings yet
1 s2.0 S0968090X22003680 Main
25 pages
Price Trailing Regularized Deep Reinforcement Learning For Financial Trading Preprint
No ratings yet
Price Trailing Regularized Deep Reinforcement Learning For Financial Trading Preprint
14 pages
SSRN 3763090
No ratings yet
SSRN 3763090
4 pages
cs188 sp23 Note14
No ratings yet
cs188 sp23 Note14
2 pages
Deep Hedging
No ratings yet
Deep Hedging
21 pages
14D006 Automated Trading RF SM MZ HC
No ratings yet
14D006 Automated Trading RF SM MZ HC
21 pages
The Recurrent Reinforcement Learning Crypto Agent: Gabriel Borrageiro Nick Firoozye Paolo Barucca
No ratings yet
The Recurrent Reinforcement Learning Crypto Agent: Gabriel Borrageiro Nick Firoozye Paolo Barucca
16 pages
PD Control Based On Reinforcement Learning Compensation For A DC Servo Drive
No ratings yet
PD Control Based On Reinforcement Learning Compensation For A DC Servo Drive
6 pages
Artificial Intelligence: Project Report "Implementation of Snake Game Using Deep Q-Learning Algorithm"
No ratings yet
Artificial Intelligence: Project Report "Implementation of Snake Game Using Deep Q-Learning Algorithm"
9 pages
SONET Architecture and Implementation: Definitive Reference for Developers and Engineers
From Everand
SONET Architecture and Implementation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

Transportation Research Part C: Jiajie Yu, Pierre-Antoine Laharotte, Yu Han, Ludovic Leclercq

Uploaded by

Transportation Research Part C: Jiajie Yu, Pierre-Antoine Laharotte, Yu Han, Ludovic Leclercq

Uploaded by

Transportation Research Part C 154 (2023) 104281

Contents lists available at ScienceDirect

Transportation Research Part C

Decentralized signal control for multi-modal traffic network: A

Fig. 1. Model inherits of agents in the test road network.

Fig. 2. Reinforcement learning framework for multi-modal network.

2.1. Overall framework

Fig. 3. Structure of ANN in the training process.

2.2. Action-value function approximation: Deep Q-networks

where π = P(a|s) is the behavior policy.

Qi (s, a) = Qi (s, a) + α(ri,t + γmaxQ(s′, a′) − Q(s, a)) (2)

2.3. Agent design

2.3.1. Agent’s state

2.3.2. Agent’s reward

3.1. Design of experiment

• Fixed control (Webster, 1958),

3.1.1. Layout settings

3.1.2. Simulation settings

0–1200 s 1200–2400 s 2400–3600 s 3600–5400 s 5400–7200 s

1 W to E 700* 900* 500* 500* 600*

2 & 3 (horizontal artery) W to E 1000* 1500* 1000* 1100* 1200*

3 (vertical artery) N to S 840–1680 900–1440 600–1200 600–800 600–840

Length of decision step 5s

3.1.3. Agent settings for the RL-based strategy

3.3. Results for scenario 1: The dedicated bus lane scenario

Average Standard deviation Average Standard deviation

15,000 RL - only traffic 1516 858.72 208.01 105.74 163.11

20,000 RL - only traffic 1516 1098.14 207.85 143.25 190.62

25,000 RL - only traffic 1516 818.38 203.81 100.57 189.39

Note: the notation of ’RL - n’ donates the proposed RL model with c = n.

3.4. Results for scenario 2: The mixed traffic lane scenario

3.5. Results for scenario 3: The multiple bus lines scenario

Fig. 10. (continued).

4. Conclusions and perspectives

Average (m) Standard deviation Average (s) Standard deviation

RL - C2 1637.78 727.03 549.27 208.33 326.98

Fig. 12. Total queue length along the simulation in scenario 2.

Bus-line 1 Bus-line 2 Bus-line 1 Bus-line 2

Transportation Research Part C 154 (2023) 104281

Fig. 22. Trajectories of buses crossing at the central intersection.

Declaration of Competing Interest

Appendix A. DQN with experience replay (Mnih et al., 2013)

You might also like