0% found this document useful (0 votes)
48 views15 pages

Expert Systems With Applications: Leilei Kang, Hao Huang, Weike Lu, Lan Liu

Uploaded by

1057258646
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views15 pages

Expert Systems With Applications: Leilei Kang, Hao Huang, Weike Lu, Lan Liu

Uploaded by

1057258646
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Expert Systems With Applications 255 (2024) 124627

Contents lists available at ScienceDirect

Expert Systems With Applications


journal homepage: www.elsevier.com/locate/eswa

Optimizing gate control coordination signal for urban traffic network


boundaries using multi-agent deep reinforcement learning
Leilei Kang a , Hao Huang a , Weike Lu b, c , Lan Liu a, d, *
a
School of Transportation and Logistics, Southwest Jiaotong University, Chengdu 610031, China
b
School of Rail Transportation, Soochow University, Suzhou 215131, China
c
Alabama Transportation Institute, Tuscaloosa, AL 35487, USA
d
National and Local Joint Engineering Laboratory of Integrated Intelligent Transportation, Southwest Jiaotong University, Chengdu 610031, China

A R T I C L E I N F O A B S T R A C T

Keywords: The macro-aggregated dynamic characteristics of urban network traffic flow encapsulated embodied in macro­
Macroscopic fundamental diagram scopic fundamental diagram theory provide a concise perspective for network-level traffic control. However,
Deep reinforcement learning accurately translating inter-regional macro-traffic flow control into fine-grained control at each intersection
Boundary multi-agent gate control
along the boundary remains challenging. To address this issue, a regional boundary intersections collaborative
Communication mechanism
control method is proposed, leveraging the model-free data-driven advantages of multi-agent deep reinforcement
learning. First, the proposed model develops a real-time communication module using a bi-directional long short-
term memory model and an attention mechanism, enabling intersections to capture spatiotemporal features of
the boundary traffic flow. Second, the feedback reward mechanism of the boundary intersection agent is
meticulously constructed by considering the traffic state of the protected region, the boundary control pressure,
and the traffic pressure of the intersection. The proposed model fosters cooperation at boundary intersections and
delicately balances their macro-micro control contradictions based on mastering the boundary state information
of the road network. Data-driven boundary gating control can fine-regulate the traffic flow distribution of the
network from the perspective of macro-micro to promote its operation efficiency. The simulation experiments
verify the effectiveness of the proposed multi-agent signal collaborative control method.

1. Introduction et al., 2004). These regional signal control methods primarily coordinate
parameters such as signal period and offset between intersections based
As economic development progresses, metropolitan areas accom­ on the traffic arrival distribution or the expected distribution at the
modate the increasing number of citizens and vehicles, leading to severe upstream intersections. However, it is hard to address complex traffic
traffic congestion, safety incidents, and air pollution (Albalate and flow conditions, and the control is relatively decentralized and cannot
Fageda, 2021; Zhong et al., 2017). Given the limited urban land re­ effectively regulate the overall traffic flow (Zhou and Gayah, 2023).
sources, extensive enhancements and expansions of transportation Furthermore, urban network traffic flow exhibits significant nonlinear
infrastructure are not feasible. Therefore, the development of an effi­ and non-stationary (Vlahogianni et al., 2006). This problem brings
cient and intelligent traffic signal management system is crucial for challenges in accurately modelling network traffic flow and imple­
enhancing traffic efficiency. Early researchers mainly focused on single- menting precise control for network nodes.
point intersections (Allsop, 1976; Webster, 1958) and arterial signal Fortunately, the proposal and development of the macroscopic
optimization (Little et al., 1981). However, the above method only fundamental diagram (MFD) theory (Daganzo and Geroliminis, 2008;
considers limited intersections, making it challenging to meet the signal Godfrey, 1969; Geroliminis and Daganzo, 2008) have ushered in a
control requirements at the regional level. Subsequently, relevant re­ concise and efficient perspective for traffic flow control at the network
searchers developed regional signal systems such as SCATS (Sims, level. Specifically, MFD can be viewed as a mapping function to express
1981), SCOOT (Robertson and Bretherton, 1991), and TRANSYT (Pastor the macro-aggregated traffic information. It can indicate the unimodal,

* Corresponding author.
E-mail addresses: [email protected] (L. Kang), [email protected] (H. Huang), [email protected] (W. Lu), [email protected]
(L. Liu).

https://doi.org/10.1016/j.eswa.2024.124627
Received 1 February 2024; Received in revised form 21 June 2024; Accepted 24 June 2024
Available online 27 June 2024
0957-4174/© 2024 Elsevier Ltd. All rights are reserved, including those for text and data mining, AI training, and similar technologies.
L. Kang et al. Expert Systems With Applications 255 (2024) 124627

low-scattering macro-mapping relationship between the cumulative Especially with the rapid advancement of artificial intelligence tech­
number of vehicles and the weighted traffic volume within the traffic nology (Mnih et al., 2015; Silver et al., 2016; Sutton and Barto, 2018),
network. Many scholars have applied the macro property reflected by deep reinforcement learning (DRL) has paved the way for tackling the
MFD to study urban network boundary control (Aboudolas and Ger­ above model-driven issues.
oliminis, 2013; Chen et al., 2022; Daganzo, 2007; Ding et al., 2020; Yoon et al. (2020) integrated DRL with numerical simulation of
Keyvan-Ekbatani et al., 2012, 2013, 2015a, b; Geroliminis et al., 2012; macro traffic data to control urban network boundaries. Zhou and
Ni and Cassidy, 2019; Zhou and Gayah, 2023). In essence, boundary Gayah (2021, 2023) modelled the macro traffic dynamics of regions
control is a demand management solution for excessive vehicle loading using the MFD theory. Based on this groundwork, DRL-based perimeter
within MFD regions, also referred to as protected regions. It mitigates control strategies have been proposed to manage macro traffic flow
traffic congestion by delaying or preventing vehicles from entering MFD shifts at boundaries. However, these strategies are insufficient for
areas. This may lead to vehicle over-queuing at the boundary intersec­ effectively guiding the setting of specific signal control plans. Ni and
tion, affecting the overall operating efficiency of the road network. Cassidy (2019) employed graph neural networks to learn the spatio­
Additionally, most current research concentrates on determining the temporal transformations of boundary traffic flow. Chen et al. (2022)
macro-inflow for the MFD region, lacking micro fine-grained regulation utilized the Lyapunov to prove the stability of the proposed DRL model.
from macro-inflow to boundary intersections. Thus, this research aims to Both methods consider relatively simple scenarios, lacking a depiction of
implement refined control of boundary intersections by integrating the macro–micro control role in boundary intersections. From the
macro–micro perspectives of urban traffic flow, thereby enhancing the above, it is evident that the application of DRL in urban network
efficacy of boundary control. boundary control is still in its early stages. Furthermore, there is insuf­
The complexity of spatio-temporal dynamic modelling of traffic ficient research on boundary multi-agent traffic signals cooperation.
networks can be effectively reduced by harnessing MFD theory. Daganzo
(2007) proposed the benefits of monitoring and controlling traffic flow 1.3. Muti-agent deep reinforcement learning
from the macro perspective of a traffic network. Hence, the design of the
boundary control strategy based on MFD has attracted the attention of Numerous studies have used DRL for signal control at an intersection
many researchers. Current research in this community mainly focuses on and demonstrated promising results (Wei et al., 2018; Noaeen et al.,
the following two aspects: (1) model-driven region boundary control 2022; Liu et al., 2023; Kumar et al., 2024; Merbah et al., 2024). How­
strategy and (2) data-driven region boundary control strategy. ever, the direct employment of a single agent to control multiple in­
tersections faces challenges. The primary reason is that as the number of
1.1. Model-driven region boundary control strategy intersections increases, the joint action policy by centralized agents
expands exponentially, potentially leading to the curse of dimension­
Multiple researchers have proposed various boundary control ality in action space (Prashanth and Bhatnagar, 2010). A more
methods by integrating MFD theory with traditional model-driven ap­ straightforward approach involves treating each intersection as an agent
proaches. For instance, Geroliminis et al. (2012) simulated the macro to manage signal control (Genders and Razavi, 2019). However, in­
traffic flow dynamics between two MFD regions and used model pre­ tersections do not exist independently within the traffic network, and
dictive control (MPC) to regulate the macro inflow rate between regions. the dynamics of their surrounding neighbours influence them (Tampuu
Haddad et al. (2013) applied the MPC method to attain optimal flow et al., 2017). That is, the optimal control policy of each agent changes in
control between regions in a hybrid network of urban roads and ex­ response to changes in the policies of other agents, potentially resulting
pressways. However, the MPC method necessitates precise predictions in instability in inter-agent learning. To address these deficiencies, many
from external models, posing a substantial challenge for practical traffic studies have begun to consider promoting inter-agent cooperation
flow control. Keyvan-Ekbatani et al. (2012, 2013, 2015a, b) devised through information exchange to improve the control effect. For
proportional-integral (PI) controllers to regulate the number of vehicles instance, Peng et al. (2017) employed the bidirectional recurrent neural
in the protected region to maintain an optimal level. The objective of the network architecture to facilitate communication between agents in
PI controller is to ascertain the permissible macroscopic inflow into the StarCraft games. Chu et al. (2019) enhanced multi-agent cooperation by
protected region. The PI control is also known as gating, where each incorporating state information and rewards from neighbouring agents
intersection acts as a valve to regulate the entry and exit of traffic vol­ into the decision-making process of the target agent. Wei et al. (2019)
ume. Fu et al. (2017) developed a three-layer control strategy to adopted the attention mechanism to consider the impact of the neigh­
determine the macro inflow rate for the boundary of a two-region traffic bour intersection state on the target intersection. Bokade et al. (2023)
network. They utilized a heuristic genetic algorithm to solve local con­ developed a flexible communication strategy for agents, enabling them
trol objectives, which may be computationally expensive for real-time to accept messages from one another more flexibly and thus promote
control. Ding et al. (2020) proposed a boundary control model to pre­ multi-agent cooperation. Yang et al. (2023) applied causal inference and
vent MFD regional state degradation. Haddad (2017) derived the multi-agent DRL to promote cooperation for traffic signal control in road
optimal boundary control solution for two MFD regions considering traffic networks. Yan et al. (2023) employed a graph network structure
boundary queue length and input flow constraints. Li et al. (2021) to coordinate intersection agents to enhance their cooperation. An
proposed a robust boundary control model to minimize vehicle travel increasing number of studies utilize graph deep reinforcement learning
time within the network by considering boundary queuing. The afore­ for adaptive signal control of large-scale road networks (Kolat et al.,
mentioned model-driven boundary control methods require specific 2023; Wang et al., 2024; Devailly et al., 2024; Zhao et al., 2024). Han
assumptions about the environment or model (Chen et al., 2022; Zhou et al. (2024) integrated the attention mechanism with multi-agent DRL
and Gayah, 2021). In addition, most research remains at the macro- to develop the MAPPO algorithm for the adaptive signal control of traffic
aggregation level and ignores the specific intersection level. network. Huang et al. (2024) established information interaction be­
tween traffic lights and vehicles, promoting the simultaneous optimi­
1.2. Data-driven region boundary control strategy zation of traffic signal lights and vehicle speeds through multi-agent
deep reinforcement learning. Currently, few studies focus on coopera­
To overcome the limitations of model-driven methods, some re­ tive signal control for boundary intersections by the DRL method.
searchers have adopted data-driven algorithms to investigate boundary
control. The advantage of data-driven approaches is that they don’t need 1.4. Main content of this study
pre-assumptions, concentrating instead on learning operational data
generated by the system interactions to optimize decision-making. To address the above challenges, we propose a Boundary Multi-

2
L. Kang et al. Expert Systems With Applications 255 (2024) 124627

Fig. 1. Schematic of the logical relationship of network boundary signal control: (a) Traffic network example; (b) MFD mapping relationship; (c) Boundary
intersection control diagram.

Agent traffic signal Cooperative Control model (BMACC) based on DRL. flows at boundary intersections, offering adaptive traffic signal
The proposed model aims to achieve the trade-off between vehicles control solutions.
entering the MFD area and vehicles queuing at the boundary in­ (2) We utilize the Bi-LSTM model combined with an attention
tersections through cooperative control signals for boundary in­ mechanism to convey the traffic state information perceived by
tersections. Moreover, our research contributes to improving the overall boundary intersection agents.
operational efficiency of the traffic network. The proposed method ac­ (3) The definition of state and reward considers the macro–micro
complishes the aforementioned objectives primarily through the control characteristics of boundary intersection.
following novelties: 1) The proposed model is a data-driven multi-agent (4) The effectiveness of the proposed method is verified through a
DRL method to optimize urban boundary intersection control; 2) It simulated network, demonstrating superior performance
employs a bidirectional long short-term memory (Bi-LSTM) model compared to the comparison models.
combined with a self-attention approach to establish a state information
communication mechanism among multiple agents situated at boundary The paper is arranged as follows. In Section 2, we describe the main
intersections; 3) The Bi-LSTM facilitates the exchange of private obser­ ideas of urban network boundary control and the DRL method. The
vation information among boundary intersections, enabling the proposed boundary multi-agent signal cooperative control model is
boundary intersection agent to initially grasp the traffic flow dynamics described in detail in Section 3. The simulation network, benchmark
across the entire boundary; 4) Building upon the third point, the agent models, evaluation criteria, and experimental results are presented in
utilizes the self-attention mechanism to extract relevant information Section 4. Finally, Section 5 concludes with a summary of the findings of
from multiple agent along the same boundary, which enables the agent the case study.
to acquire boundary state information with greater accuracy; 5) A re­
sidual connection is adopted to feed information from each module into 2. Related work
the decision-making layer to make the agent more fully perceive infor­
mation; 6) The proposed reward mechanism can guide the boundary 2.1. Macroscopic characteristics and boundary gate control of traffic
intersection agent in balancing the macro–micro control contradictions. network
Furthermore, the state definition takes into account the traffic dynamics
of MFD regions and the boundary intersections state. Given the advan­ Geroliminis and Daganzo (2008) verified that different macroscopic
tages of the Dueling Double Deep Q Network algorithm (Liang et al., aggregation features of traffic networks can establish an inverted U-
2019) for training discrete action agents, our study employs this algo­ shaped mapping relationship, also referred to as MFD. For instance, the
rithm to train agents. traffic network depicted in Fig. 1(a) can generate the macro-aggregation
To sum up, the principal contributions of our study can be summa­ mapping function illustrated in Fig. 1(b). The x-axis represents the cu­
rized as follows: mulative number n of vehicles in the MFD region; the y-axis represents
the weighted average flow qw of the protected area. These indexes can be
(1) We propose a novel multi-agent collaborative control approach obtained from the following equation:
for urban network boundary intersections that is capable of ∑
concurrently accounting for the macro traffic dynamics within n(τ) = ne (τ) (1)
e∈G
protected regions and the spatiotemporal distribution of traffic
where n(τ) is the number of vehicles in the protected region at τ time; τ =

3
L. Kang et al. Expert Systems With Applications 255 (2024) 124627

0, 1, 2, ... is a discrete time index; e means a road lane segment; G is the between present-moment and future-moment rewards.
set of all lanes in the protected area; ne (τ) is the number of vehicles in It can be seen from the above that the boundary multi-agent gating
lane e at τ time. problem in this research can be defined as a POMDP composed of the
∑ following tuples < S , O , U , P , R , Z , N , γ >. This study will define
l ⋅q (τ)
qw (τ) = e∈G ∑e e (2) appropriate tuple elements and introduce relevant DRL algorithms to
e∈G le address this decision process more effectively.
where qw (τ) is the average weight flow at τ time; le is the length of lane e;
qe (τ) is the flow in lane e at τ time. 2.3. Q-learning algorithm
The mapping curve is derived from discrete points calculated by Eq.
(1) and (2), and it can reflect the traffic operation state of the MFD re­ Q-learning is a classical reinforcement learning algorithm for
gion, shown in Fig. 1(b). As the vehicle increases in the network, the obtaining the Q value of discrete control action through iterative
average weighted flow—reflecting the operational efficiency—initially calculation (Watkins and Dayan, 1992). The action Q-value, denoted as
rises and then declines in Fig. 1(b). n is the critical accumulation amount
⌢ Qπ (ot , at ), represents the expected return that the agent’s policy π can
for the MFD area. When the cumulative vehicle count exceeds the crit­ obtain by taking action at after observing the environmental state ot :
ical threshold, network vehicles experience increasing saturation, Qπ (o, a) = E[R|o = ot , a = at ] (4)
resulting in diminishing network operational efficiency. Hence, the
primary objective of boundary gating is to maintain the vehicles in the To get the optimal action-value function Q* (o, a), the agent needs to
protected area close to the critical cumulative count, thereby ensuring collect trajectory (ot , at , rt , ot+1 ) from the interaction between its policy
network operational efficiency. Fig. 1(c) illustrates the primary tasks of and the environment. During training, the agent adopts an explor­
a boundary intersection: monitoring the cumulative number of vehicles ation–exploitation strategy to select actions to prevent it from falling
in the protected area and implementing signal control to temporarily into a suboptimal solution. Exploration means selecting actions
delay vehicles from entering the region. This strategy can mitigate an randomly with a certain probability. Exploitation means selecting ac­
excessive influx of vehicles into the MFD region. tions greedily:
The schematic diagram above concisely and lucidly explains the at = π(ot ) = argmaxQ(ot , a) (5)
fundamental process of boundary gating control for the MFD region. a∈A

However, imposing precise signal control for boundary intersections is Finally, the Q-leaning algorithm is employed to calculate the optimal
not simple. This is mainly because boundary gating involves more than action-value function iteratively:
just blocking excess vehicles from entering this region. The core purpose
of this method is to ensure traffic operation efficiency within the pro­ Q(ot , at )←Q(ot , at ) + α⋅(rt + γ⋅maxQ(ot+1 , a) − Q(ot , at )) (6)
a∈A
tected area while preventing excessive vehicle queues outside the pro­
tected area, thereby preserving overall traffic operation efficiency. To where α is the learning rate.
achieve this, coordinated signal control at boundary intersections is The Q learning algorithm has received widespread attention since it
necessary to temporarily restrict vehicles from entering the protected was proposed, but the Q function is limited to tabular form (Abdoos
area. To better achieve the goal of boundary control, this study defines et al., 2011; Araghi et al., 2013). In fact, the approach for tabular form
boundary intersections as agents and adopts multi-agent DRL to opti­ can only address limited state-action dimensions problems, whereas
mize the boundary gating control task. actual situations may involve complex state-action dimensions. To solve
the shortcomings, researchers have proposed a variety of function forms
to estimate action Q-values (Sutton and Barto, 2018; van Hasselt et al.,
2.2. Multi-agent Markov decision process 2018). Mnih et al. (2015) proposed using deep neural network functions
to estimate action Q values, namely the Deep Q-Networks (DQN) algo­
Meanwhile, the above process can be formalized as a partially rithm. This study extends the application of reinforcement learning to
observable Markov decision process (POMDP) (Li et al., 2021; Zhou and DRL, demonstrating that control strategies achieved through DRL can
Gayah, 2023). The boundary intersection agents observe the state of the rival human-level operations. This work has inspired the formulation
environment E and then make decisions to impose on the environment. and advancement of numerous DRL algorithms (Hessel et al., 2018;
The agent is denoted as Ni ∈ N , Ni represents the jth intersection of the ith Silver et al., 2016; Wang et al., 2016). In the DQN algorithm, a target Q-
j j

boundary, i = 1, …, I, j = 1, 2, …, Ji . The global state information con­ value function Q ̂ is constructed to enhance the fitting of the Q-value
tained in the environment at time t is st ∈ S . Limited by the current function. Then, the parameters θ of the neural network are learned and
technical conditions, the observation of the agent is partial state infor­ updated using the difference between the Q function and the target Q
j j
mation oi (t) ∈ O of the environment. The policy π i (⋅) ∈ Z of the agent function:
j j
takes action ai (t) ∈ A based on the observation information oi (t), that is y )2 ], where ̂
L(θ) = E(o,a,r,oʹ)∼M [(Q(o, a|θ) − ̂ y = r + γ⋅max Q(o
̂ ʹ, aʹ|̂
θ) (7)
π ji (aji (t)|oji (t)).
a ∈A
ʹ
The actions of all agents constitute the overall control
action ut ∈ U . According to the state transition function, after applying where ̂y is the target action Q value. Adopting the target Q value miti­
global action ut in state st , the state of the environment will change to gates the overestimation of the online Q value, thereby stabilizing the
st+1 , that is P (st+1 |st , ut ) : S × U →S . The agent also receives feedback training process. When θ is updated for a certain number of rounds, the
j
rewards ri (t) ∈ R provided by the environment. The reward function is parameter of ̂ θ is assigned to the parameter of θ. During training, the
j j j j
defined as ri (t) = r(oi (t), ai (t), oi (t + 1)) in this research. Furthermore, ReplayBuffer library is established to store trajectory. To calculate more
the reward received by each agent in the time dimension adds up to the efficiently, a certain amount of trajectory is randomly selected from the
j
return. The return Ri of the agent at time t is calculated as follows: ReplayBuffer library to calculate Eq. (7).

T

Rji (t) = γ χ− t ⋅rij (χ ) (3) 3. Methodology
χ =t
3.1. Problem statement
where χ is the discrete time point; T is the total number of decision steps;
γ ∈ [0, 1] is a discount factor, and it is employed to discern the disparity The proposed BMACC model is employed to deal with boundary

4
L. Kang et al. Expert Systems With Applications 255 (2024) 124627

layer, which perceives the traffic state of the environment and boundary
intersections; (b) the communication layer, which builds a communi­
cation module for the traffic state information perceived by the
perception layer; (c) decision-making layer, which synthesizes all the
information enable the agent to take appropriate actions. It should be
noted that for the convenience of writing, the time symbol t is omitted in
Fig. 3.

3.2.1. Traffic state perception


j
At phase decision time t for the boundary intersections, an agent Ai
receives both private observation and macro-level information based on
the monitoring equipment, thereby obtaining its observation informa­
j
tion oi (t). The information can be mapped to the latent variable space
through a fully connected multi-layer perceptron (MLP):

Fig. 2. Overview of boundary signal control scenarios. mji (t) = ReLU(oji (t)Wji,m + bji,m ) (8)

intersection gate control for two concentric areas (see Fig. 2). This j
where mi (t) ∈ R1×κ1 is a state variable mapped from observation infor­
control approach is more conducive to protecting the traffic operation j j j
mation oi (t);Wi,m ∈ Rκ0 ×κ1 and bi,m ∈ R1×κ1 are learnable weight matrices
efficiency within the urban network (Keyvan-Ekbatani et al., 2015b).
and bias vectors in MLP; ReLU(⋅) is a nonlinear activation function. Eq.
The intersections on boundaries are regarded as agents. The agents
(8) can be regarded as the preliminary feature extraction of the obser­
communicate through the constructed communication mechanism. The
vation information for the boundary intersection.
boundary control of urban traffic networks is a task that requires coor­
dination and cooperation at boundary intersections. From the perspec­
3.2.2. The state information communication for the agent
tive of DRL, the control task can be formalized as a POMDP represented
After the agent completes its preliminary feature extraction, it needs
by a tuple form < S , O , U , P , R , Z , N ,γ >. The objective of boundary
to communicate the feature information obtained with others. This can
intersection agents is to temporarily limit the boundary vehicles to the
enable agents to acquire the traffic state characteristic of the boundary
appropriate location through collaboration to ensure the efficiency of
and then make appropriate traffic phase selections to promote cooper­
the MFD area and the overall traffic operation. By observing the
ation among agents. This study aims to enhance information commu­
macroscopic traffic state of the network and its own traffic state, the
nication among agents in the following aspects.
agent transmits its own particular observation information to other
I Bi-LSTM communication
agents, receives communication information from them, and ultimately
This study first constructs a two-way communication mechanism
takes appropriate action. The proposed model architecture and its
between boundary intersections through a Bi-LSTM neural network. The
communication mechanism are described below.
Bi-LSTM (Graves et al., 2005) consists of two layers: one layer processes
the feature information of the boundary intersection in a forward di­
rection; and the other layer processes the features in a reverse direction.
3.2. The architecture of the BMACC model
It enables agents to obtain traffic state characteristics in both directions
comprehensively. The LSTM model is composed of input gates, forget­
The proposed BMACC model architecture is depicted in Fig. 3. It is
ting gates, output gates, and corresponding internal state update
mainly divided into the following three sections: (a) the state perception

Fig. 3. Framework of the BMACC algorithm.

5
L. Kang et al. Expert Systems With Applications 255 (2024) 124627

mechanisms: from the feature vector derived from the output gate and the feature
1) Forward LSTM communication vector obtained from the memory unit.
Similarly, by combining the computational process from Eq. (9) to
→j →j →j− 1 →j →j
Z i,in (t) = σ (mji (t)Wi,inm + h i (t)W i,inh + b i,in ) (9) Eq. (14), it can be concluded that the calculation process for the
boundary intersections in the opposite direction can be naturally
→j j
where Z i,in (t) is the feature vector obtained after substituting mi (t) and obtained.
2) Backward LSTM communication
→j− 1 → j− 1
h i (t) ∈ R1×κ2 into the input gate; h i (t) is the hidden state variable
←j ←j ←j+1 ←j ←j
acquired by the agent at the previous position of the jth agent at the ith Z i,in (t) = σ(mji (t)Wi,inm + h i (t)W i,inh + b i,in ) (15)
boundary through a complete LSTM calculation unit; When j = 1 and
→j− 1 →0 →0 ←j ←j ←j+1 ←j ←j
t = 0, h i (t) = h i (0); h i (0) is the initial value, and the assigned Z i,f (t) = σ (mji (t)Wi,fm + h i (t)W i,fh + b i,f ) (16)
→j →j
values are all zero; Wi,inm ∈ Rκ1 ×κ2 and Wi,inh ∈ Rκ2 ×κ2 are the learnable
←j ←j ←j+1 ←j ←j
→j Z i,out (t) = σ(mji (t)Wi,outm + h i (t)Wi,outh + b i,out ) (17)
weight parameters; b i,in ∈ R1×κ2 is the learnable bias term; σ (⋅) is the
sigmoid nonlinear activation function, and it can transform the calcu­ ←j ←j+1 ←j ←j
←j
̃ (t) = tanh(mj (t)W
lation results to between 0 and 1; The input gate is to assess the current C i i i,c̃m + h i (t)Wi,c̃h + b i,c̃ ) (18)
input information to help candidate memory units screen essential
information. ←j ←j ←j+1 ←j ←j
C i (t) = Z i,f (t) ⊙ C i (t) + Z i,in (t) ⊙ C
̃ (t)
i (19)
→j →j →j− 1 →j →j
Z i,f (t) = σ (mji (t)W i,fm + h i (t)Wi,fh + b i,f ) (10)
←j ←j ←j
h i (t) = Z i,out (t) ⊙ tanh( C i (t)) (20)
→j
where Z i,f (t) is the feature vector obtained through the forget gate;
where
→j →j
Wi,fm ∈ Rκ1 ×κ2 and Wi,fh ∈ Rκ2 ×κ2 are learnable parameter matrices; ←j ←j ←j ←j
Wi,inm ∈ Rκ1 ×κ2 ,W i,inh ∈ Rκ2 ×κ2 ,W i,fm ∈ Rκ1 ×κ2 ,W i,fh ∈ Rκ2 ×κ2 ,­
→j
b i,f ∈ R1×κ2 is the learnable bias term. The primary function of the ←j ←j ←j ←j
Wi,outm ∈ Rκ1 ×κ2 ,Wi,outh ∈ Rκ2 ×κ2 ,W i,c̃m ∈ Rκ1 ×κ2 ,Wi,c̃h ∈ Rκ2 ×κ2 are learn­
forget gate is to determine which information to forget. This allows ←j ←j
LSTM computing units to preserve vital information in the long term, able weight matrix parameters; b i,in ∈ R1×κ2 , b i,f ∈ R1×κ2 ,
and these memories can be dynamically modified in response to new ←j ←j
b i,out ∈ R1×κ2 , b i,c̃ ∈ R1×κ2 are learnable bias vectors. The meanings of
inputs.
other parameters can refer to forward calculation.
→j →j →j− 1 →j →j
Z i,out (t) = σ(mji (t)Wi,outm + h i (t)W i,outh + b i,out ) (11) By executing Bi-LSTM computations on the boundary intersection,
the characteristic information of the target intersection can be obtained
→j in both directions of the boundary:
where Z i,out (t) is the feature vector acquired through the output gate;
→j ←j
→j →j
Wi,outm ∈ Rκ1 ×κ2 and Wi,outh ∈ Rκ2 ×κ2 are learnable parameter matrices; hji (t) = CON[ h i (t), h i (t)] (21)
→j
b i,out ∈ R1×κ2 is the learnable bias. The output gate assesses the current →j
where CON(⋅) represents the splicing operation, and splice h i (t) and
input information to extract crucial feature information. ←j j
h i (t); hi (t) ∈ R1×κ3 , κ3 = 2κ2 is the hidden state variable of the boundary
→j →j →j− 1 →j →j
̃ (t) = tanh(mj (t)W
C + h i (t)Wi,c̃h + b i,c̃ ) (12) intersection obtained after Bi-LSTM calculation.
i i i,c̃m
It is noteworthy that the LSTM can implicit communication in one
→j direction of the boundary, while two-way communication compensates
where C ̃ (t) represents the candidate memory feature calculated by
i for this limitation and strengthens the agent’s understanding of the
j →j− 1 →j →j overall boundary state. To further improve the awareness of the overall
mi (t) and h i (t); Wi,c̃m ∈ Rκ1 ×κ2 and Wi,c̃h ∈ Rκ2 ×κ2 are the neural
state of the boundaries, this study introduces an attention mechanism
network weight parameters used in calculating candidate memory;
for communication among boundary intersections.
→j
b i,c̃ ∈ R1×κ2 is a learnable deviation term; tanh(⋅) is a nonlinear acti­ II Attention communication
vation function that transforms the results to between − 1 and 1. The attention mechanism (Vaswani et al., 2017) is a method that
Candidate memory information is filtered to obtain practical informa­ mimics human attention. It employs a specialized neural network
tion based on the current input feature information. structure to learn autonomously and selectively concentrate on crucial
features. The primary purpose is to select information more critical to
→j →j →j− 1 →j →j
C i (t) = Z i,f (t) ⊙ C i (t) + Z i,in (t) ⊙ C
̃ (t) (13) the current goal from numerous pieces of information. It enables target
i
intersections to be associated with features from other intersections
→j →j− 1 rather than relying solely on adjacent locations.
where C i (t) is the memory feature vector output; The C i (t) ∈ R1×κ2
→j− 1 Q j,x x j
i (t) = Wi,Q hi (t) (22)
value case is obtained in the same way as the h i (t); ⊙ is the Hadamard
product that represents the element-to-element product calculation be­
tween two vectors; Eq. (13) utilizes input gate information to preserve K j,x x j
i (t) = Wi,K hi (t) (23)
crucial features from the current memory and leverages forgetting gate
information to retain critical features from past memories. V j,x x j
i (t) = Wi,V hi (t) (24)

→j →j →j T √̅̅̅̅
h i (t) = Z i,out (t) ⊙ tanh( C i (t)) (14) exp(Q j,x jʹ,x
i (t))⋅(K i (t)) / D)
αj,ji ,x (t) = ∑
ʹ
T √̅̅̅̅ (25)
Ji j,x jʹ,x
→j jʹ exp(Q i (t)⋅(K i (t)) / D)
where h i (t) ∈ R1×κ2 is the hidden state feature output. It is computed

6
L. Kang et al. Expert Systems With Applications 255 (2024) 124627

∑J1
dji (t) = CON[ αj,ji ,x (t)V ji ,x (t), ∀x]
ʹ ʹ
(26)

j,x j,x j,x


where Q i (t) ∈ R1×κ4 , K i (t) ∈ R1×κ4 , and V i (t) ∈ R1×κ4 are query, key,
j
and value feature vectors, which are derived by multiplying hi (t) with Fig. 4. Discrete phase signal configuration.
three distinct learnable weight parameters Wxi,Q ∈ Rκ3 ×κ4 , Wxi,K ∈ Rκ3 ×κ4 ,
j,x j,x
and Wxi,V ∈ Rκ3 ×κ4 ; Match Q i (t) with K i (t) to acquire the correlation represents selecting the action by the maximal Q value.
between boundary intersections, and then calculate the attention weight
between the two; The attention mechanism output of target intersection 3.2.4. Learning algorithm
features is computed by performing a weighted sum of the value feature Sections 3.2 above describe the forward process for calculating the
vectors of boundary intersections, where the weights are the attention optimal phase decision at the boundary intersection. To acquire the
weights; The number of attention heads in the above equations is X ;x = optimal learning parameters for the equations above, a corresponding
1, 2, ..., X denotes a specific attention head under consideration;αi (t)
j,jʹ,x training algorithm is required for backpropagation calculations. The
j j
represents the attention weight of the jth intersection feature vector on parameters of the agent Ai are denoted as θi , and the parameters of its
j
the ith boundary in the xth head and the jʹth (jʹ = 1, 2, ..., Ji ) intersection target agent are ̂
θ i . We utilize the DoubleDQN algorithm to train the
feature on the same boundary; D = κ4 represents the feature dimensions learning parameters of agents:
j,x j,x
of Q i and K i , and it is adopted in Eq. (25) mainly to prevent the result j j
j,x j,x j y i (t) = rij (t) + γ⋅Qji (oji (t + 1), a− |̂
̂ θ i )wherea− = argmax(Qji (oji (t + 1), a|θji ))
of the dot multiplication of Q i and K i from being too large; di (t) ∈ a∈A

R1×κ5 is the output of the attention mechanism, which aggregates the (34)
results of each attention head calculation.
j 2
L(θji ) = E[oj (t),aj (t),rj (t),oj (t+1)]∼M [(Qji (oji (t), aii (t)|θji ) − ̂
y i (t)) ] (35)
3.2.3. Decision-making layer i i i i

The preceding state perception and communication information


θji ←θji − η⋅∇θj L(θji ) (36)
calculations can engender valuable features that assist the boundary i

agent in decision-making. In an endeavour to augment the decision-


j
making capabilities of the boundary agent, this study integrates infor­ where L(θi ) is the batch-mean loss computed from a batch of data
mation from both the perception and communication layers: extracted from the ReplayBuffer library M . Applying this loss to perform
backpropagation can obtain the updated agent parameter values.
wji (t) = CON[mji (t), hji (t), dji (t)] (27)

j 3.3. Basic elements of agent


where wi (t) ∈ R1×κ6 , (κ6 = κ1 + κ3 + κ5 ) is the feature information input
to the decision-making layer after splicing.
The above mainly describes the computational details for agents. The
At the decision-making stage, the Dueling structure is utilized to
state, action, and reward definitions are also critical in the agent control
calculate the action-Q value. Firstly, we compute the optimal advantage
of boundary intersections.
approximation, followed by the calculation of the optimal state value
approximation:
3.3.1. State
ϕj,1 j j,1 j,1
i (t) = ReLU(wi (t)Wi + bi ) (28) To comprehensively reflect the macro–micro characteristics of the
traffic network observable at each intersection, this study defines the
j
ϕj,adv
i (t) = ϕj,1 j,adv
i (t)Wi + bj,adv
i (29) observation state oi (t) of the boundary intersection agent at time t as
follows: (a) the number of vehicles in the first protected region; (b) the
j,1 j,adv
where ϕi (t) ∈ R1×κ7 is an intermediary variable; ϕi (t) ∈ R1×κ8 is the number of vehicles in the non-overlapping area between the first and the
optimal action advantage approximation;
j
Wi ∈ Rκ6 ×κ7 and Wi
j,adv
∈ second region; (c) the number of vehicles outside the protected region;
j,1 1×κ7 j,adv (d) the current phase status; (e) the number of vehicles on the entrance
R are learnable weight;
κ7 ×κ8
bi ∈R and bi ∈ R1×κ8 are learnable
lane of the intersection; (f) the number of vehicles on the downstream
bias parameters. exit lane connected to the entrance lane of the intersection. Further­
φj,2 j j,2 j,2 more, certain state information is normalized before being input into the
i (t) = ReLU(wi (t)Wi + bi ) (30)
feature extraction layer to aid the decision-making of the agent.
φj,val
i (t) = φj,2 j,val
i (t)Wi + bj,val
i (31)
3.3.2. Action
j,2 1×κ7 j,val The action of the boundary intersection is set to four discrete phases.
Where φi (t) ∈R is an intermediary variable; φi (t) ∈ R1×1 repre­
j,2 j,val
The overall right-of-way of each entry section of the intersection is
sents the approximate optimal state value;Wi ∈ Rκ6 ×κ7 and Wi ∈ divided into discrete phases, and the right-turn phase of each entrance is
j,2 1×κ7 j,val 1×1
Rκ7 ×1 are learnable weight values; bi ∈ R and bi ∈ R are restricted. The form of the discrete phase can be seen in Figs. 1 and 4.
learnable bias. j
The discrete phase selection is ai ∈ A , and A = {1, 2, 3, 4}.
j,adv j,val
When the values of and ϕi (t) φi (t) are obtained, the action-Q
value can be further estimated: 3.3.3. Reward
Within the context of DRL control tasks, reward signals are a vital
Qji (t) = φj,val
i (t) + ϕj,adv
i (t) − mean(ϕj,adv
i (t)) (32) component which serves to guide the agent towards achieving control
objectives throughout the training process. From the above analysis, it is
aji (t) = argmax(Qji (t))) (33) evident that intersections at the boundaries of urban networks primarily
a∈A
undertake the following two tasks: 1) guarantee that traffic efficiency
j within protected areas is sustained at a relatively good level; 2) ensure
where Qi (t) ∈ R1×κ8 is the approximate action-Q value; Eq. (32)
that boundary intersections do not suffer from long-term congestion.

7
L. Kang et al. Expert Systems With Applications 255 (2024) 124627

Fig. 5. Simulation network and its intersections.

From these perspectives, the following reward mechanism is formulated: of the residual capacity of the target intersection relative to the
∑ j,e remaining capacity of the entire boundary; When the intersection ex­
ℓji (t) = [ (C i − nj,e j
i (t))]/∂i (37) j
j,in
hibits a higher value of δi (t), this indicates that the target intersection
e∈Gi
must assume a greater responsibility for the excessive influx of vehicles
j

0 if ℓji (t) = 0 into the protected area;ζi (t) is the total number of vehicles queuing in all


⎨ entrance lanes of the target intersection; β1 and β2 denote hyper­
δji (t) = ℓi (t) j (38) parameter values; The reward is expressed as a negative value, with


⎩ ∑Ji j if ℓji (t) ∕
=0
j ℓi (t)
larger values indicating better decisions. To facilitate agent training, the
reward value for the same boundary agent is divided by a relatively large
{ value. Refer to Appendix A regarding the division of entrance roads at
0 if ni > ni (t)

μji (t) = (39) boundary intersections.
n i − ni (t) if ni ≤ ni (t)
⌢ ⌢

4. Experiments
rij (t) = β1 ⋅δji (t)⋅μji (t) − β2 ⋅ζji (t) (40)
To evaluate the effectiveness of the proposed model, comparative
j
where ℓi (t) represents the remaining vehicle capacity on the entrance experiments are conducted using the Simulation of Urban Mobility
lane for the jth intersection at the ith boundary at time t; Gi is the
j,in
(SUMO) software (Krajzewicz et al., 2012). The Traci interface in SUMO
specific lane set at the boundary intersection, implying the vehicles on provides real-time traffic state information for the control model and
j,e
this lane will enter the protected region; C i represents the vehicle ca­ promptly executes decision actions generated by the control model. The
j,e simulation experiments are run on a computing platform configured
pacity on entry lane e; ni (t)
is the number of vehicles on lane e at time t;
with i7-11700, 64G RAM, and RTX 3060.
When two road segments at the boundary intersection belong to the
j
entry road segment, ∂i = 2; When one road segments belong to the entry
4.1. General setups
road, ∂ji
=
j
1;ri (t)
signifies the reward value acquired after the target
intersection executing an action at time t and waits for the action to be This research conducted experiments on the simulated road net­
j
completed; μi (t) is the number of vehicles in the ith protected region that works with 11 × 11 intersections (see Fig. 5). Fig. 5 illustrates the
j
exceeds the ideal value of the area at time t; δi (t) denotes the proportion simulated road network and its components.
Fig. 5(a) depicts the overall structure of the road network, boundary
Table 1 intersections enclosed by blue-shaded areas. Fig. 5(b) displays the phase
The hyperparameter value of the proposed learning model. composition form of the boundary intersection, and its phase composi­
tion is consistent with that described in Fig. 1(c). Fig. 5(c) illustrates a
Hyperparameter Value Description
non-boundary intersection whose phase composition is north–south
Training rounds 150 The number of simulation rounds required to train straight, east–west straight, north–south left, east–west left, and unre­
the learning model
Replay buffer size 50,000 The number of trajectory data that can be stored
stricted right turns. The internal boundary is the first boundary. The
Batch size 128 The number of randomly selected trajectory external boundary corresponds to the second boundary. The boundary
samples when training the learning model intersection circled in green in Fig. 5(a) is numbered first, followed by
Discount factor 0.98 Discount factor for rewards at future moments the other boundary intersections in a clockwise sequence.
Initial learning rate 0.0001 Learning rate during training of the learning model
In the simulated network, vehicles are set to be homogeneous. The
Decay rate 0.96 Starting from 121 rounds of simulation, the
learning rate of each round attenuates 0.96 times car-following and lane-changing behaviours adopt the default model in
compared with the previous round SUMO. The maximum speed of all vehicles is 13.89m/s; the maximum
Initial exploration 1.0 The probability that the agent makes a random acceleration is 2.6m/s2 ; the maximum deceleration is 4.5m/s2 . The
rate choice
traffic demand input is represented as origin–destination pairs, speci­
Final exploration 0.01 The exploration rate is reduced by 0.01 in each
rate round of simulation compared to the beginning of fying only the start and end points for vehicles in the simulation file. The
the previous round until it reaches 0.01 advantage is that the vehicle can evaluate network states at departure
Target network 64 After the network parameters have been updated and choose a more suitable route, which is also consistent with the
update 64 times, the target network is assigned its actual situation. When the boundary control model manages the
parameters
boundary intersection, a flexible discrete four-phase selection is

8
L. Kang et al. Expert Systems With Applications 255 (2024) 124627

Fig. 6. MFD relationship of the protected region for simulation network.

executed during the simulation experiment. A fixed signal plan is 4.3. The comparison model for the boundary signal control
implemented for non-boundary intersections. The green time of each
phase in intersection signal control in simulation is G = 20s. The To comprehensively assess the effectiveness of the proposed
duration of the yellow light is Y = 3s. The duration of all-red lights is Boundary Multi-Agent traffic signal Cooperative Control method
A R = 1s. (BMACC), we introduce the following boundary signal control methods
The number of agent intersections at the first boundary in the for comparison.
simulation road network is 16; The number of second-boundary agent Logic Control Methods.
intersections is 32. The change values of each hidden space state in the (1) FT: All intersections in the simulation network adopt the Fixed
above model are: κ1 = 64,κ2 = 64,κ3 = 128,κ4 = 32,κ5 = 128,κ6 = Timing (FT) method.
320,κ7 = 64,κ8 = 4,β1 = 2,β2 = 1. The number of attention head is four. (2) MP-BB: This study combines the Maximum Pressure and Bang-
The hyperparameters of the proposed boundary multi-agent model are Bang (MP-BB) control methods to propose an MP-BB boundary signal
presented in Table 1. control logic, which can be found in Appendix A for details.
(3) MP-PI: Likewise, the study combines the Maximum Pressure and
4.2. MFD relationship of simulation network PI (MP-PI) control to introduce an MP-PI boundary signal control logic,
which can be found in Appendix A for details.
By obtaining the MFD relationship of the protected area of the road DRL Methods.
network, the critical vehicle accumulation in that area is determined. (4) BMAC1: Boundary Multi-Agent Control (BMAC) strategy without
Firstly, the lane flow and the number of vehicles in each lane during the communication.
simulation run are collected. Next, Eq. (1) and (2) are adopted to (5) BMAC2: There is no explicit state communication mechanism
calculate the macro aggregation point of the protected area. Finally, between agents, and parameters are shared between agents at the same
these data are fitted using a quadratic function to obtain the macro boundary.
mapping relationship (that is, MFD). To enhance the credibility of the (6) BMAC3: The communication mode between agents on the same
results, nine simulation experiments are conducted using different boundary under this model is that each agent can obtain the hidden state
random seeds. Subsequently, nine sets of MFD relationships are derived, information of all agents on the boundary.
as illustrated in Fig. 6. Ablation methods.
As depicted in Fig. 6, the MFD under various simulation tests exhibits (7) BMAC4: This model achieves state information exchange among
a similar trend. Based on comprehensive research and analysis, it is agents at the same boundary using the Bi-LSTM model. Unlike the state
concluded that the critical vehicle accumulation in the protected region communication in BMACC, it does not incorporate the attention
of the simulation network can be determined as follows: n 1 =
⌢ mechanism.
1800,n 2

= 6000. (8) BMAC5: In this model, the state communication process among

Fig. 7. Training effects of different learning models.

9
L. Kang et al. Expert Systems With Applications 255 (2024) 124627

Fig. 9. Computational time spent on model testing.

Fig. 8. Computational time on model training.


highly time-consuming nature of large-scale road network simulations,
which is further exacerbated when the road network experiences
agents at the same boundary relies on the attention mechanism,
congestion. Fig. 8 shows that BMAC2, BMAC5, and BMACC are learning
excluding the Bi-LSTM model used in BMACC.
models that require relatively little time to train. It may be attributed to
It should be noted that BMAC4 and BMAC5 are a discussion of
the increasingly effective control effects of these learning models, which
different communication modes between agents at boundary in­
facilitate smoother traffic flow, consequently reducing the simulation
tersections and an ablation experiment of the BMACC model.
runtime. During the test, one round of simulation lasts for 2.5 h. The
duration of a simulation round for all learning control models in Fig. 9 is
4.4. Experiment results and discussions notably less than 2.5 h. It indicates that the DRL agent is a controller
capable of responding quickly during forward calculation, and its real-
4.4.1. Training convergence level time application potential is enormous. Furthermore, based on the
During the training of the multi-agent DRL model, a training round training effect in Fig. 7, the training cost of the proposed model is
refers to the duration from the start to the end of the simulation. In this acceptable.
study, one simulation run lasts for 2.5 h, with a total of 150 training
rounds. The convergence of the proposed learning models during the 4.4.3. Signal control performance comparison
training process is illustrated in Fig. 7. The convergence curve represents When testing these learning methods described in Section 4.3, the
the average of the rewards of all agents on the same boundary. simulation time remains 2.5 h. During the test, the neural network pa­
Fig. 7 demonstrates that most DRL boundary signal control models rameters that demonstrated the best performance during the 150
have converged and exhibit relative stability. This suggests that the training rounds are selected.
proposed state definition and reward mechanism are reasonable. The I The cumulative status of vehicles changes for the protected
training effect trends of the agent for the first boundary in Fig. 7(a) and region
the second boundary agent in Fig. 7(b) show certain similarities. Among One of the core goals of urban network boundary signal control is to
them, the proposed BMACC boundary signal control model demon­ maintain efficient traffic operation in protected regions. That is,
strates an outstanding training effect, followed by BMAC5, BMAC2, ensuring that the number of vehicles within the protected area does not
BMAC1, BMAC3, and BMAC4. The training results of BMACC and exceed a specific value. Fig. 10 illustrates the effect of control measures
BMAC5 models indicate that state information communication between in maintaining the number of vehicles within the protected area.
agents facilitates the convergence of learning models. The training effect Fig. 10 provides an intuitive the impact of different boundary signal
of BMAC2 indicates that it is a relatively effective training method to control models on vehicle changes within the protected region. The
jointly train the trajectory data collected at all intersections on the same interrupted lines of grey parallel to the x-axis in Fig. 10(a) and (b)
boundary to an agent with shared parameters when the agents are ho­
represent the critical number of vehicles inside the protected area: n 1

mogeneous and the reward targets are consistent. The training results of
and n2. As seen in the figure above, the boundary control method that

BMAC3 and BMAC4 are inferior to BMAC1, which emphasizes the sig­
nificance of effectively exchanging state information among agents considers the macroscopic traffic state of the road network reduces
positioned on the same boundary. The above results also preliminarily vehicle accumulation within the protected area of the road network to
confirm the feasibility of applying DRL methods to urban network varying extents. The BMACC demonstrates optimal control effectiveness
boundary control. in both the first and second boundaries. Additionally, BMAC1, BMAC2,
and BMAC5 also exhibit favourable control efficacy. The control effect of
4.4.2. Computation cost the BMAC3 method meets the basic requirements of boundary control.
To further evaluate the computational performance of the proposed The control effectiveness of the BMAC4 method is subpar. FT does not
traffic signal control model, this study documents the time required to consider the function of boundary intersections on external vehicle
train and test these models. The training time for each model during the interception, resulting in more vehicles entering the protected region.
simulation is summarized in Fig. 8. The testing time of each model for MP-BB and MP-PI are responsive traffic signal control strategies for
one simulation round is presented in Fig. 9. urban network boundaries designed to prevent the number of vehicles in
Fig. 8 illustrates that training a DRL model in a large simulated road the protected area from surpassing a critical threshold. Fig. 10 illustrates
network is a computationally expensive task. This is primarily due to the that the effective learning model prioritizes the consideration of the

10
L. Kang et al. Expert Systems With Applications 255 (2024) 124627

Fig. 10. Changes in the number of vehicles inside the protected region.

Fig. 11. Average queue status of vehicles at different intersections of the road network.

long-term impact of the current decision more than other control that the most effective control method during the simulation testing
methods. The effective learning model results in a relatively obvious process is the proposed BMACC model. The model demonstrates rela­
reduction in the number of vehicles within the protected area during the tively favourable control performance at both the first and second
later stages of the simulation. It indicates that maintaining the number boundaries. It is important to note that this study only controls in­
of vehicles within a specific range in protected areas can enhance the tersections on the first and second boundaries. All intersections shown in
operational efficiency of the urban road network, thereby enabling more Fig. 11(c) adopt fixed signal timing. When the road network boundary
vehicles to complete trips as quickly as possible. Furthermore, the results employs the BMACC control method, the vehicle load at the outermost
reveal that the communication of state information among agents can intersections of the road network is also relatively low. This demon­
promote cooperation by comparing the control effect of the BMACC and strates that the BMACC method facilitates the operational efficiency of
other control methods. The above analysis further verifies the advan­ the road network through coordinated signal control at its boundary
tages of applying DRL to boundary control in mitigating the accumu­ intersections. Fig. 11 indicates that the FT control method leads to
lation of vehicles in protected areas. However, this explanation does not vehicle clustering in the core area, inducing congestion and diminishing
fully elucidate the advantages of using DRL for collaborative signal network efficiency. BMAC3 and BMAC4 exhibit poor boundary signal
control at boundary intersections. Further comparison of the control control, which can cause boundary congestion and propagate this
performance of the proposed method for road networks is needed. congestion to upstream intersections. This indirectly proves that BMAC3
II Operation status of road network intersections. and BMAC4 employ a communication mode that fails to identify the key
To assess the operational status of intersections at various locations state information of the network boundary accurately. The effectiveness
within the road network under boundary control, the vehicle queues at of other boundary control methods falls between that of FT and BMACC.
different intersections are documented during the simulation test, as III Comparison of the overall operating performance of road
depicted in Fig. 11. network.
Fig. 11(a) illustrates the average queue length of vehicles at all in­ Another prerequisite for boundary control is ensuring that the
tersections along the first boundary during the simulation of the test overall traffic operation efficiency of the road network remains unaf­
road network, and its intersections are numbered 101 to 116. Fig. 11(b) fected. To validate this, we use three key indicators to evaluate the
displays the average vehicle queue length at all intersections along the operation efficiency of the network, including the number of vehicle
second boundary, numbered from 201 to 232. Fig. 11 (c) depicts the trips completed, the average delay time of vehicles, and the average
average queue length of vehicles observed at the outermost 42 in­ speed. To bolster the credibility of the comparison results, each control
tersections within the network, numbered from 1 to 42. Fig. 11 indicates method undergoes simulation testing with ten different random seeds.

11
L. Kang et al. Expert Systems With Applications 255 (2024) 124627

Table 2
Friedman test results.
Performance index χ2F p

Average vehicle delay 67.86 ≪ 0.01


Trip completed number 65.71 ≪ 0.01

The detailed results are presented in the figure below.


Figs. 12 to 14 present the mean and standard deviation of perfor­
mance evaluation metrics for the simulation tests of each traffic signal
control method under ten random seeds. Meanwhile, these figures
visually illustrate the percentage improvement in performance of the
proposed method compared to others. Fig. 12 displays that BMACC
outperforms other methods in terms of average vehicle delay in the road
network: compared with FT, MP-BB, MP-PI, BMAC1, BMAC2, BMAC3,
BMAC4, and BMAC5, this method reduces the average vehicle delay by
23.4 %, 15.6 %, 18.8 %, 12.3 %, 11.3 %, 34.4 %, 31.4 %, and 4.2 %,
respectively. As depicted in Fig. 13, with respect to the average speed of
vehicles in the road network, BMACC outperforms the FT, MP-BB, MP-
Fig. 12. Average vehicle delay of the road network. PI, BMAC1, BMAC2, BMAC3, BMAC4, and BMAC5 by 26.2 %, 13.9 %,
16.9 %, 9.0 %, 5.7 %, 35.4 %, 26.6 %, and 1.2 %, respectively. As shown
in Fig. 14, the BMACC method performs better performance regarding
vehicle trips completed, increasing by 14.2 %, 11.3 %, 14.2%, 4.5 %,
6.8 %, 54.8 %, 36.5 %, and 1.1 % compared with other methods. These
results indicate that the BMACC model has obvious application potential
for controlling boundary intersections in urban traffic networks.
The FT control represents a fixed signal timing method without
considering the macroscopic traffic conditions within the road network.
The MP-BB and MP-PI methods incorporate macroscopic traffic condi­
tions into signal control decisions at boundary intersections. Addition­
ally, the efficacy of these two methods in controlling traffic surpasses
that of the FT control approach. This demonstrates that road network
boundary control can improve the operating efficiency of entire road
network vehicles to some extent. The BMAC5 method exhibits the
second-best performance in boundary signal control based on the three
overall operational evaluation indicators. It proves that effective
communication of state information among intersection agents can
improve the efficiency of traffic signal control at the road network
boundary. BMAC3 and BMAC4 demonstrate the poorest control per­
formance, and their control performance is even far inferior to that of FT
signal control. It further underscores the significance of effective
Fig. 13. Average vehicle speed of the road network.
communication of state information between boundary intersection
agents. From the results above, it is evident that BMAC1 and BMAC2
have also shown relatively favorable control outcomes. This suggests
that when the rewards for agents at boundary intersections are consis­
tent, focusing solely on their local state is feasible. Some methods exhibit
large standard deviations when testing performance indicators. This
indicates that there are relatively obvious differences in traffic flow
distribution across different random seed tests.
IV Nonparametric tests.
The study utilizes the Friedman test (Zimmerman and Zumbo, 1993),
a nonparametric statistical method, to assess significant differences
among signal control models. The average vehicle delay and vehicle trip
completion number obtained by various methods under different
random seed tests were evaluated. The results are presented in Table 2.
As shown in Table 2, for the average vehicle delay indicator, the
Friedman test statistic for different algorithms is 67.86, with a p-value
far less than 0.01. It substantiates the presence of noteworthy disparities
among the control algorithms, thereby indirectly affirming the efficacy
of the control algorithm proposed. The Friedman test results for the
number of vehicle trips completed also support this conclusion. How­
ever, the Friedman test can only determine the presence of significant
Fig. 14. Trip completed number of the road network. differences among multiple model measurements and cannot show the
differences between any two models. Therefore, this study further em­
ploys the Post-hoc Nimenyi test (Bannò and Matassoni, 2023) to inves­
tigate the differences between control models. The statistical test results

12
L. Kang et al. Expert Systems With Applications 255 (2024) 124627

Fig. 15. Significant differences among various control algorithms.

are shown in Fig. 15. boundary signal control; (3) The DRL method, with the aid of neural
The smaller value in Fig. 15 indicates that the control performance networks, can better capture the traffic states evolution of boundaries
difference between the two algorithms is more significant. When and road networks.
comparing the proposed BMACC method with other control algorithms, However, our study still has some limitations and should be further
it is evident that the performance disparity is most pronounced between improved in future. Future research should not only incorporate more
BMACC and the FT, MPPI, BMAC3, and BMAC4 algorithms. BMACC urban traffic network boundaries but also focus on state communication
shows a certain degree of difference from MP-BB, BMAC1, and BMAC2. between intersection agents from different boundaries so that these
The difference in control performance between BMACC and BMAC5 is agents to obtain more comprehensive state information about the road
relatively small, which proves that BMAC5 is also an effective boundary network. Additionally, integrating this study with urban public trans­
control method. This outcome is anticipated, predominantly due to the port priority strategies presents an interesting direction for future
BMAC5 method being derived from ablation experiments conducted on research, which may further enhance the service efficiency of urban
BMACC. Overall, the statistical results from the Friedman test and post- transportation networks.
hoc Nimenyi test indicate that BMACC is an effective method for urban
network boundary control. The control differences between the algo­ CRediT authorship contribution statement
rithms reflected in Fig. 15 also support the results analyzed in the pre­
vious chapters. Leilei Kang: Conceptualization, Methodology, Software, Investiga­
tion, Visualization, Writing – original draft. Hao Huang: Investigation,
5. Conclusion Validation, Writing – review & editing. Weike Lu: Investigation, Vali­
dation, Writing – review & editing. Lan Liu: Data curation, Methodol­
This paper proposes a novel data-driven model-free boundary in­ ogy, Investigation, Writing – review & editing, Funding acquisition.
tersections signal cooperative control method. Its novelty mainly lies in:
(a) Utilizing a multi-agent DRL to optimize the signal timing of urban Declaration of competing interest
road network boundary intersections; (b) Combining the Bi-LSTM model
and the attention mechanism to communicate the state information of The authors declare that they have no known competing financial
the boundary intersections, thereby promoting signal cooperation be­ interests or personal relationships that could have appeared to influence
tween them; (c) Establishing state and reward mechanisms aligned with the work reported in this paper.
boundary signal control purpose based on the MFD theory and the actual
conditions of boundary intersections. The efficacy of the proposed multi- Data availability
agent DRL control model is validated through numerous simulation
experiments. The simulation test results also verified the following Data will be made available on request.
points: (1) Effective boundary signal control for urban road networks not
only preserves the traffic operational efficiency of the protected region Acknowledgments
but also enhances the overall operation efficiency of the entire road
network; (2) Establishing a state information communication mecha­ The project received research funding support from the National
nism between boundary intersections can further promote the effect of Natural Science Foundation of China (No.61873216, No.62103292).

Appendix A. . MP-BB and MP-PI control algorithm

This study first classifies the entrance roads of boundary intersections into three types:1) The vehicles carried on the entrance section are about to
enter the protected area, recorded as roadin ; 2) The vehicles carried by the right or left lane of the entrance road are about to leave the protected area,
recorded as roadmid ; 3) The vehicles carried by the straight lane on the entrance road are about to leave the protected area, recorded as roadout . In
Table A.1, maxpressure(⋅) mean selecting the phase with the most significant “pressure” value from the four candidate phases for the boundary
intersection. max occ(⋅) represents the maximum vehicle density for all input roads; outbb,midbb, and inbb are the threshold determination for
obtaining the right of way for the corresponding road segment. The parameters of outbb,midbb and inbb are 0.2, 0.3 and 0.45, respectively. It should be

13
L. Kang et al. Expert Systems With Applications 255 (2024) 124627

noted that the time symbol t is omitted for simplicity of writing.


Table
A.1. MP-BB control algorithm.
MP-BB control algorithm for boundary intersections
1: Initialize and load simulation 20: if roadocc
in ≥ inbb then

2: for i = 1 to I do 21: aji ←phase(roadocc


in )
3: if ni <ni

then 22: else
4: for j = 1 to Ji do 23:
j
ai ←maxpressure(roadmid , roadout )
5:
j
ai ←maxpressure(⋅) 24: end if
6: end for 25: end if
7: end if 26: else
8: if ni ≥ n i then

27: roadocc
in ←max occ(roadin )
9: for j = 1 to Ji do 28: if roadocc
in ≥ inbb then
10: roadocc
out ←max occ(roadout ) 29:
j
ai ←phase(roadocc
in )
11: if roadocc
out ≥ outbb then 30: else
j j
12: ai ←phase(roadocc out ) 31: ai ←phase(roadocc
out )
13: else 32: end if
14: if roadmid ! = None then 33: end if
15: roadocc
mid ←max occ(roadmid ) 34: end if
16: if roadocc
mid ≥ midbb then 35: end for
17:
j
ai ←phase(roadocc 36: end if
mid )
18: else 37: end for
occ
19: roadin ←max occ(roadin )
In Table A.2, PIi (⋅) represents PI controller of the ith boundary. This detailed introduction of controller can
be found in (Keyvan-Ekbatani et al., 2012; 2015b). B i is the number of vehicles allowed to flow into the
protected area calculated by controller PIi (⋅); n ine stores the intersection name and the entry section
with the most vehicles for this intersection, which is the data form of the dictionary structure; V (roadin )
means counting the number of vehicles on roadin ; sort(n ine) indicates sorting according to the key value
of n ine; vehsin represents the number of vehicles expected to enter the protected area on a specific entry
road segment at the target boundary intersection; nd(a) represents the action value of the boundary
intersection nd; MP − BB(nd) respresents the MP-BB control algorithm in Table A.1;inpi, outpi, and midpi
are thresholds used to determine whether the entrance road to the boundary intersection has the right of
way. The controller parameters of the first boundary are Kp1 = 110 and Ki1 = 6.The controller parameters
of the second boundary are Kp2 = 140 and Ki2 = 2. The parameters of inpi, midpi and outpi are 0.5, 0.3 and
0.2, respectively.

Table
A.2. MP-PI control algorithm.
MP-PI control algorithm for boundary intersections
1: Initialize and load simulation 24: if roadocc
out ≥ outpi then

2: for i = 1 to I do 25: nd(a)←phase(roadocc


out )
3: if ni < n i then

26: else
4: for j = 1 to Ji do 27: if roadmid ! = None then
5: nd(a)←maxpressure(⋅) 28: roadocc
mid ←max occ(roadmid )
6: end for 29: if roadocc
mid ≥ midpi then
7: end if 30: nd(a)←phase(roadoccmid )
8: if ni ≥ n i then

31: else
9: B i ←PIi (⋅) 32: nd(a)←maxpressure(roadmid )
10: n ine←{} 33: B i ←B i
11: for j = 1 to Ji do 34: end if
12: n ine[nd(j)]←max(V (roadin )) 35: else
13: end for 36: nd(a)←phase(roadocc
out )
14: n ine←dict(sort(n ine)) 37: end if
15: for nd to n ine do 38: end if
16: if B i > 0 then 39: end if
17: roadocc
in ←max occ(roadin ) 40: else
18: if roadocc
in ≥ inpi then 41: nd(a)←MP − BB(nd)
19: nd(a)←phase(roadocc in ) 42: B i ←B i
20: vehsin ←passvehs(roadin ) 43: end if
21: B i ←(B i − vehsin ) 44: end for
22: else 45: end if
23: roadocc
out ←max occ(roadout ) 46: end for

References Abdoos, M., Mozayani, N., & Bazzan, A. L. (2011). Traffic light control in non-stationary
environments based on multi agent Q-learning. In In 2011 14th International IEEE
conference on intelligent transportation systems (ITSC) (pp. 1580–1585). IEEE.
Allsop, R. E. (1976). SIGCAP: A computer program for assessing the traffic capacity of
signal-controlled road junctions. Traffic Engineering & Control, 17(Analytic).

14
L. Kang et al. Expert Systems With Applications 255 (2024) 124627

Aboudolas, K., & Geroliminis, N. (2013). Perimeter and boundary flow control in multi- Little, J., Kelson, M. D., & Gartner, N. H. (1981). Maxband: A program for setting signals
reservoir heterogeneous networks. Transportation Research Part B: Methodological, 55, on arteries and triangular networks. transportation research record 795: Trb,
265–281. national research council. washington.
Albalate, D., & Fageda, X. (2021). On the relationship between congestion and road Sims, A. G. (1981, January). Scat the sydney coordinated adaptive traffic system. In
safety in cities. Transport policy, 105, 145–152. Symposium on computer control of transport 1981: Preprints of papers (pp. 22-26).
Bokade, R., Jin, X., & Amato, C. (2023). Multi-agent reinforcement learning based on Barton, ACT: Institution of Engineers, Australia.
representational communication for large-scale traffic signal control. IEEE Access. Li, Y., Yildirimoglu, M., & Ramezani, M. (2021). Robust perimeter control with cordon
Chu, T., Wang, J., Codecà, L., & Li, Z. (2019). Multi-agent deep reinforcement learning queues and heterogeneous transfer flows. Transportation Research Part C: Emerging
for large-scale traffic signal control. IEEE Transactions on Intelligent Transportation Technologies, 126, Article 103043.
Systems, 21(3), 1086–1095. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., &
Chen, C., Huang, Y. P., Lam, W. H. K., Pan, T. L., Hsu, S. C., Sumalee, A., & Zhong, R. X. Hassabis, D. (2015). Human-level control through deep reinforcement learning.
(2022). Data efficient reinforcement learning and adaptive optimal perimeter nature, 518(7540), 529–533.
control of network traffic dynamics. Transportation Research Part C: Emerging Merbah, A., & Ben-Othman, J. (2024). Optimizing Traffic Flow With Reinforcement
Technologies, 142, Article 103759. Learning: A Study on Traffic Light Management. IEEE Transactions on Intelligent
Daganzo, C. F. (2007). Urban gridlock: Macroscopic modeling and mitigation Transportation Systems.
approaches. Transportation Research Part B: Methodological, 41(1), 49–62. Ni, W., & Cassidy, M. J. (2019). Cordon control with spatially-varying metering rates: A
Daganzo, C. F., & Geroliminis, N. (2008). An analytical approximation for the reinforcement learning approach. Transportation Research Part C: Emerging
macroscopic fundamental diagram of urban traffic. Transportation Research Part B: Technologies, 98, 358–369.
Methodological, 42(9), 771–781. Noaeen, M., Naik, A., Goodman, L., Crebo, J., Abrar, T., Abad, Z. S. H., & Far, B. (2022).
Ding, H., Zhou, J., Zheng, X., Zhu, L., Bai, H., & Zhang, W. (2020). Perimeter control for Reinforcement learning in urban network traffic signal control: A systematic
congested areas of a large-scale traffic network: A method against state degradation literature review. Expert Systems with Applications, 199, Article 116830.
risk. Transportation Research Part C: Emerging Technologies, 112, 28–45. Prashanth, L. A., & Bhatnagar, S. (2010). Reinforcement learning with function
Devailly, F. X., Larocque, D., & Charlin, L. (2024). Model-Based Graph Reinforcement approximation for traffic signal control. IEEE Transactions on Intelligent Transportation
Learning for Inductive Traffic Signal Control. IEEE Open Journal of Intelligent Systems, 12(2), 412–421.
Transportation Systems. Pastor, E., Pena, M., & Solé, M. (2004). A short introduction to the TRANSYT verification
Fu, H., Liu, N., & Hu, G. (2017). Hierarchical perimeter control with guaranteed stability tool. Dept. Comput. Archit., Tech. Univ. Catalonia, Barcelona, Spain, UPC/DAC
for dynamically coupled heterogeneous urban traffic. Transportation Research Part C: Tech. Rep. RR-2004/14.
Emerging Technologies, 83, 18–38. Peng, P., Wen, Y., Yang, Y., Yuan, Q., Tang, Z., Long, H., & Wang, J. (2017). Multi-agent
Godfrey, J. W. (1969). The mechanism of a road network. Traffic Engineering & Control, 8 bidirectionally-coordinated nets: Emergence of human-level coordination in learning
(8). to play starcraft combat games. arXiv preprint arXiv:1703.10069.
Graves, A., Fernández, S., & Schmidhuber, J. (2005, September). Bidirectional LSTM Robertson, D. I., & Bretherton, R. D. (1991). Optimizing networks of traffic signals in real
networks for improved phoneme classification and recognition. In International time-the SCOOT method. IEEE Transactions on Vehicular Technology, 40(1), 11–15.
conference on artificial neural networks (pp. 799-804). Berlin, Heidelberg: Springer Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., &
Berlin Heidelberg. Hassabis, D. (2016). Mastering the game of Go with deep neural networks and tree
Geroliminis, N., & Daganzo, C. F. (2008). Existence of urban-scale macroscopic search. Nature, 529(7587), 484–489.
fundamental diagrams: Some experimental findings. Transportation Research Part B: Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
Methodological, 42(9), 759–770. Tampuu, A., Matiisen, T., Kodelja, D., Kuzovkin, I., Korjus, K., Aru, J., & Vicente, R.
Geroliminis, N., Haddad, J., & Ramezani, M. (2012). Optimal perimeter control for two (2017). Multi-agent cooperation and competition with deep reinforcement learning.
urban regions with macroscopic fundamental diagrams: A model predictive PloS One, 12(4), e0172395.
approach. IEEE Transactions on Intelligent Transportation Systems, 14(1), 348–359. Vlahogianni, E. I., Karlaftis, M. G., & Golias, J. C. (2006). Statistical methods for
Genders, W., & Razavi, S. (2019). An open-source framework for adaptive traffic signal detecting nonlinearity and non-stationarity in univariate short-term time-series of
control. arXiv preprint arXiv:1909.00395. traffic volume. Transportation Research Part C: Emerging Technologies, 14(5), 351–367.
Haddad, J., Ramezani, M., & Geroliminis, N. (2013). Cooperative traffic control of a Van Hasselt, H., Doron, Y., Strub, F., Hessel, M., Sonnerat, N., & Modayil, J. (2018). Deep
mixed network with two urban regions and a freeway. Transportation Research Part B: reinforcement learning and the deadly triad. arXiv preprint arXiv:1812.02648.
Methodological, 54, 17–36. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., &
Haddad, J. (2017). Optimal perimeter control synthesis for two urban regions with Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information
aggregate boundary queue dynamics. Transportation Research Part B: Methodological, Processing Systems, 30.
96, 1–25. Webster, F. V. (1958). Traffic signal settings (No. 39).
Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., ... & Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine Learning, 8, 279–292.
Silver, D. (2018, April). Rainbow: Combining improvements in deep reinforcement Wang, Z., Schaul, T., Hessel, M., Hasselt, H., Lanctot, M., & Freitas, N. (2016). Dueling
learning. In Proceedings of the AAAI conference on artificial intelligence (Vol. 32, No. 1). network architectures for deep reinforcement learning. In International conference on
Han, G., Liu, X., Wang, H., Dong, C., & Han, Y. (2024). An Attention Reinforcement machine learning (pp. 1995–2003). PMLR.
Learning-Based Strategy for Large-Scale Adaptive Traffic Signal Control System. Wei, H., Zheng, G., Yao, H., & Li, Z. (2018, July). Intellilight: A reinforcement learning
Journal of Transportation Engineering, Part A: Systems, 150(3), 04024001. approach for intelligent traffic light control. In Proceedings of the 24th ACM SIGKDD
Liang, X., Du, X., Wang, G., & Han, Z. (2019). A deep reinforcement learning network for International Conference on Knowledge Discovery & Data Mining (pp. 2496-2505).
traffic light cycle control. IEEE Transactions on Vehicular Technology, 68(2), Wei, H., Xu, N., Zhang, H., Zheng, G., Zang, X., Chen, C., ... & Li, Z. (2019, November).
1243–1253. Colight: Learning network-level cooperation for traffic signal control. In Proceedings
Huang, H., Hu, Z., Li, M., Lu, Z., & Wen, X. (2024). Cooperative Optimization of Traffic of the 28th ACM International Conference on Information and Knowledge Management
Signals and Vehicle Speed Using a Novel Multi-agent Deep Reinforcement Learning. (pp. 1913-1922).
IEEE Transactions on Vehicular Technology. Wang, T., Zhu, Z., Zhang, J., Tian, J., & Zhang, W. (2024). A large-scale traffic signal
Liu, J., Qin, S., Su, M., Luo, Y., Zhang, S., Wang, Y., & Yang, S. (2023). Traffic signal control algorithm based on multi-layer graph deep reinforcement learning.
control using reinforcement learning based on the teacher-student framework. Expert Transportation Research Part C: Emerging Technologies, 162, Article 104582.
Systems with Applications, 228, Article 120458. Yoon, J., Kim, S., Byon, Y. J., & Yeo, H. (2020). Design of reinforcement learning for
Krajzewicz, D., Erdmann, J., Behrisch, M., & Bieker, L. (2012). Recent development and perimeter control using network transmission model based macroscopic traffic
applications of SUMO-Simulation of Urban MObility. International journal on simulation. Plos One, 15(7), e0236655.
advances in systems and measurements, 5(3&4). Yan, L., Zhu, L., Song, K., Yuan, Z., Yan, Y., Tang, Y., & Peng, C. (2023). Graph
Keyvan-Ekbatani, M., Kouvelas, A., Papamichail, I., & Papageorgiou, M. (2012). cooperation deep reinforcement learning for ecological urban traffic signal control.
Exploiting the fundamental diagram of urban networks for feedback-based gating. Applied Intelligence, 53(6), 6248–6265.
Transportation Research Part B: Methodological, 46(10), 1393–1403. Yang, S., Yang, B., Zeng, Z., & Kang, Z. (2023). Causal inference multi-agent
Keyvan-Ekbatani, M., Papageorgiou, M., & Papamichail, I. (2013). Urban congestion reinforcement learning for traffic signal control. Information Fusion, 94, 243–256.
gating control based on reduced operational network fundamental diagrams. Zimmerman, D. W., & Zumbo, B. D. (1993). Relative power of the Wilcoxon test, the
Transportation Research Part C: Emerging Technologies, 33, 74–87. Friedman test, and repeated-measures ANOVA on ranks. The Journal of Experimental
Keyvan-Ekbatani, M., Papageorgiou, M., & Knoop, V. L. (2015). Controller design for Education, 62(1), 75–86.
gating traffic control in presence of time-delay in urban road networks. Zhong, N., Cao, J., & Wang, Y. (2017). Traffic congestion, ambient air pollution, and
Transportation Research Procedia, 7, 651–668. health: Evidence from driving restrictions in Beijing. Journal of the Association of
Keyvan-Ekbatani, M., Yildirimoglu, M., Geroliminis, N., & Papageorgiou, M. (2015). Environmental and Resource Economists, 4(3), 821–856.
Multiple concentric gating traffic control in large-scale urban networks. IEEE Zhou, D., & Gayah, V. V. (2021). Model-free perimeter metering control for two-region
Transactions on Intelligent Transportation Systems, 16(4), 2141–2154. urban networks using deep reinforcement learning. Transportation Research Part C:
Kolat, M., Kővári, B., Bécsi, T., & Aradi, S. (2023). Multi-agent reinforcement learning for Emerging Technologies, 124, Article 102949.
traffic signal control: A cooperative approach. Sustainability, 15(4), 3479. Zhou, D., & Gayah, V. V. (2023). Scalable multi-region perimeter metering control for
Kumar, R., Sharma, N. V. K., & Chaurasiya, V. K. (2024). Adaptive traffic light control urban networks: A multi-agent deep reinforcement learning approach. Transportation
using deep reinforcement learning technique. Multimedia Tools and Applications, 83 Research Part C: Emerging Technologies, 148, Article 104033.
(5), 13851–13872. Zhao, Z., Wang, K., Wang, Y., & Liang, X. (2024). Enhancing traffic signal control with
composite deep intelligence. Expert Systems with Applications, 244, Article 123020.

15

You might also like