IEEE 2024 DQL Improved DQL
IEEE 2024 DQL Improved DQL
$EVWUDFW—In order to solve the problem that the existing In the TSC method based on deep reinforcement learning
traffic signal control methods usually need to predefine the (DRL), according to the action setting scheme, it can be
phase sequence or phase duration, a single intersection traffic divided into three categories: phase switching, phase
signal control method based on improved deep reinforcement selection and phase duration [4]. Aslani[5] et al. designed the
learning Double D N (DD N) was proposed. Firstly, a traffic state space based on the queue length at the intersection, and
signal control model based on deep reinforcement learning set the traffic signal control action as phase switching. Tan[6]
DD N was constructed to optimize the estimated value of et al. designed the state space based on the current traffic
action-value function and the iterative process of target value.
light phase, queue length and average speed at the
2024 China Automation Congress (CAC) | 979-8-3503-6860-4/24/$31.00 ©2024 IEEE | DOI: 10.1109/CAC63892.2024.10865285
979-8-3503-6860-4/24/$31.00
Authorized ©2024
licensed use limited to: to IEEExplore IEEE
provided 5959Downloaded on February 16,2025 at 08:28:24 UTC from IEEE Xplore. Restrictions apply.
by University Libraries | Virginia Tech.
phase selection, phase duration and loss factor, and location corresponding to this grid. However, when the value
adaptively decides the next phase and phase duration, (2) of the discreted grid element is equal to 1, it means that the
Secondly, a new loss function based on loss factor is corresponding vehicle monitoring location of this grid has
proposed, the weight of each loss term is intelligently vehicles. In Fig. 2d, when the value of the grid element is
assigned, and the update process of the loss function is greater than 0, it means that the vehicle speed of the
dynamically optimized, and (3) A random vehicle generation corresponding vehicle monitoring location of the grid.
model based on Weber distribution is designed to simulate
the traffic flow in different periods.
II. THREE ELEMENTS OF TRAFFIC SIGNAL CONTROL BASED
ON DEEP REINFORCEMENT LEARNING
In this study, a simulation environment of a typical
intersection with 12 lanes is built based on the SUMO. As
shown in Fig. 1, each entrance direction contains three lanes:
left-turn, straight, right-turn and straight lane. The
intersection contains four phases: straight north-south (lanes
1, 2, 7, 8), left turn north-south (lanes 3, 9), straight east-west
(lanes 4, 5, 10, 11), and left turn east-west (lanes 6, 12), Fig. 2 Schematic diagram of the state space.
where right-turn vehicles are not controlled by traffic signals.
% $FWLRQ 6SDFH
Typical traffic signal control action space mainly
includes non-conflict phase selection scheme or variable
phase duration scheme [15]. Although this scheme can
quickly switch to the phase with the maximum current traffic
demand, it requires a predefined phase duration as the agent
decision interval. In the case of unbalanced traffic demand in
all directions, the non-conflicting phase selection scheme
may not flexibly exploit the potential of DRL in TSC. In the
proposed method, the original one-dimensional action space
was discretized into a three-dimensional space containing
phase selection, phase duration and loss factor, and the
weight of each action in the loss function was adjusted by the
agent. The action is represented as follows,
Fig. 1 Schematic diagram of an intersection. ⋃ ∈ ˈ ∈ , , (2)
The TSC model can be approximated as a typical Markov Where, represents the loss factor. denote a phase from a
Decision Process (MDP), that is, the dynamic process of [13]. predefined phase library, and denote a duration from a
Where, S is the set of state spaces, represents the variables predefined phase duration range G.
that can describe the state information of the intersection,
and s represents the traffic state of a certain scale. A is the & 5HZDUG )XQFWLRQ
action space set, which represents the set of all phases in the A reward is the reward or punishment that the
traffic signal control phase library. P represents the state environment gives to the agent after the agent has performed
transition probability, which represents the probability that an action. Reward and punishment information is the
the state of the intersection changes into after the execution direction and key for the agent to find the optimal decision
of the state. In this study, the temporal difference method is [16]. In TSC methods based on DRL, the reward function is
used to approximately solve P. R denotes the reward set and usually defined by traffic efficiency indicators (such as the
represents the reward fed back by the environment after sum of the number of vehicles queuing and the sum of the
performing action a in state s. Indicates the degree of waiting time of vehicles). In this study, based on the high
influence on the state after executing the action. saturation of traffic flow, the reward function is designed by
the difference of the number of vehicles queuing, and the
$ 6WDWH 6SDFH queue number at the intersection at the last decision time is
In this study, the traffic state of the intersection is used as the baseline to guide the agent to accurately judge the
discreted into the number of lanes that represent the entrance quality of the action. The reward at the decision moment is
of the intersection. H represents the number of discreted given by,
monitoring indicators, the effective monitoring length of the
entrance lane, and the effective grid length of the standard 1 (3)
vehicle [14]. As shown in Fig. 2, the effective monitoring Where, and are the reward obtained by the agent at
length of each incoming lane is discreted into grids, and each W and the number of vehicle queues at the intersection,
grid information describes the discrete index of a vehicle in respectively. When 0 , indicates that the traffic
the monitoring area. The distribution information of vehicles condition at the intersection at time is improved relative to
in a certain approach lane is shown in Fig. 2b, and the that at -1 time. The agent positively optimizes the neural
corresponding vehicle position information table and speed network parameters according to the feedback signal, and
information table are shown in Fig. 2c and Fig. 2d. In Fig. 2c, positively rewards the actions performed within the time
the value of the discreted grid element is equal to 0, which step.
describes that there is no vehicle in the vehicle monitoring
5960Downloaded on February 16,2025 at 08:28:24 UTC from IEEE Xplore. Restrictions apply.
Authorized licensed use limited to: to IEEExplore provided by University Libraries | Virginia Tech.
III. DEEP REINFORCEMENT LEARNING ALGORITHM DESIGN model and flexibly decide the optimal timing scheme [19].
Therefore, in order to intellitively adjust the weight of each
$ %DVLF )UDPHZRUN RI 76& %DVHG RQ '5/ loss in the loss function, this study designed the function as
Reinforcement learning is a field of machine learning that an action-value function with phase selection, phase duration
emphasizes observing the environment and acting in order to and loss factor as parameters. The function as follows,
maximize the expected benefit. As shown in Fig. 3, the agent
senses the state of the environment and decides the optimal , , , , , ,
action to be taken in the environment. After the environment '
executes the action, the state will be changed, and a reward
⋅ 1, , , 1, , , , , (4)
and punishment signal will be generated to feedback to the Where, , and represent the phase selection, phase
agent. The agent uses the environment feedback signal to duration and loss factor of the agent at time t, respectively.
continuously optimize the optimal policy until the agent
receives the termination signal from the environment. The The TSC method proposed in this paper based on
state of the environment, the feedback reward and improved DDQN, its specific execution process is
punishment signals and the action output by the agent summarized as Algorithm 1. The framework for improving
constitute the basic framework of reinforcement learning the DRL algorithm is shown in Fig. 4.
algorithm, forming a dynamic system of Markov decision.
Algorithm 1: TSC method based on improved DDQN
1 Initialize the training parameters˖Training round 1ˈ
The maximum number of training steps Q, etc;
2 Initialize the experiment parameters˖Period , greedy
coefficient , learning rate , etc;
3 for episodes = 0 to 1 do
4 Initialize the road network environment and load the
traffic flow data;
5 for t=1 to Q do
6 Traffic state and calculate the estimated value;
Fig. 3 A reinforcement learning model. 7 Perform action and phase duration ;
8 Calculate the reward and get the state 1;
The goal of DRL is to decide the optimal policy from
9 Store st , at , gt , rt , st 1 to the experience pool;
different strategies through continuous interaction with the
environment and trial and error, so that the expected value of 10 % samples are extracted to train neural network;
cumulative returns under this strategy can be maximized [17]. 11 Update the evaluation network parameters ;
12 if Number of decisions % 7 = 0
% 76& 0HWKRG %DVHG RQ ,PSURYHG ''41 $OJRULWKP 13 Update the target network parameters ' ;
The core of DDQN algorithm based on improved DRL is 14 End if
to introduce the phase duration action space on the basis of 13 End for
keeping the original action space as phase selection, so that 14 End for
the agent can decide the phase and phase duration of the TSC
5961Downloaded on February 16,2025 at 08:28:24 UTC from IEEE Xplore. Restrictions apply.
Authorized licensed use limited to: to IEEExplore provided by University Libraries | Virginia Tech.
In order to verify the performance of the proposed $GDSWLYH FRQWURO Adaptive traffic signal control
method, an intersection simulation environment is built calculates the number of vehicles in each lane based on the
based on the microscopic traffic simulation platform SUMO, vehicles arriving at a single intersection in real time. The
and the control platform based on python language is used to agent extends the current phase or switches phase
interact with the Traci interface provided by SUMO.
sequentially according to whether there are waiting vehicles
,QWHUVHFWLRQ SDUDPHWHU VHWWLQJ As shown in Fig. 1, in the corresponding lane of the current phase.
the typical intersection is taken as the research object in this 3KDVH VZLWFKLQJ WUDIILF VLJQDO FRQWURO Phase
study. Each road is set as a two-way lane, and the entrance switching traffic signal control method is that the agent
of each direction contains three lanes, which are left turn monitors the single intersection in real time, extracts the
lane, straight lane, straight line and right turn lane characteristic information of the traffic state, and
respectively. The road length of each lane is 400m, and the intelligently decides to extend the current phase or switch
maximum speed limit is 50km/h. In addition, the maximum the phase in an orderly manner.
phase duration of the intersection is set to 100s, and the 3KDVH VHOHFWLRQ WUDIILF VLJQDO FRQWURO Phase selection
yellow light switching time between different phases is set traffic signal control method is that the agent extracts the
to 3s. characteristic information of the traffic state of a single
7UDIILF IORZ VHWWLQJ DW WKH LQWHUVHFWLRQ The traffic intersection in real time, and intelligently decides to extend
flow information is shown in Table 1, where E, W, S, and N the current phase or jump to a phase of the predefined phase
represent east, west, south, and north, respectively, and EW library.
represents the lane in the direction from east to west. Based 3KDVH GXUDWLRQ WUDIILF VLJQDO FRQWURO The phase
on the statistics of the data set, the proportion of straight duration TSC method is that the agent decides the duration
traffic flow in the entrance lane of the intersection is the of each phase intelligently in the fixed phase sequence based
proportion of right-turn traffic flow and left-turn traffic flow. on the traffic state information of the intersection.
The distribution of traffic flow follows Weber distribution,
& ([SHULPHQWDO 5HVXOWV
and the probability density is expressed as follows.
0, 0 &RPSDULVRQ RI WUDIILF EHQHILW LQGLFDWRUV LQ WKH WUDLQLQJ VHW
, , 1 (9) Fig. 5 shows the cumulative reward curve of the four control
, 0
methods of traffic flow in the training process under the
Where, is a random variable, is the scale coefficient, and condition of the training set. Through comparison, the
N is the shape coefficient. convergence effect of the DDQN algorithm based on
$OJRULWKP SDUDPHWHU VHWWLQJ In the algorithm based improved DRL is significantly better than that of the other
on the improved DDQN algorithm, the target network and three control methods, which indicates that the DDQN
the estimation network are composed of fully connected algorithm based on improved DRL can more accurately
neural networks, and each network outputs three different decide the optimal phase timing scheme. In order to evaluate
value distributions, and different value distributions share the performance of the control methods in the TSC task, the
the first three fully connected layers. The specific traffic efficiency indicators of different control methods are
parameters are shown in Table I. compared respectively. Fig. 5-8 shows the change curves of
the average queue length, the average waiting time and the
TABLE I. PARAMETERS CONFIGURATION
average number of stops of the six control methods. In the
Parameters Value early stage of training, due to the fact that the agent is still in
Learning rate ∝ 0.0003 the exploration stage and the sample size of the experience
Discount factor ¤ 0.95 pool is too small, the optimal traffic signal timing scheme
Experience pool size 0 50000 decided by the agent is not accurate enough, so the traffic
efficiency index increases significantly. With the increase of
Mini-batch size % 256
training rounds, the traffic efficiency index decreases and
Greedy strategy coefficient ¦ 0.01
gradually tends to be stable and convergent. As shown in
Iteration period 7 100 Table II, compared with the other five algorithms, the
Simulation round 1 300 average queue length is reduced by 29.7%, 38.3%, 45.6%,
Maximum simulation steps Q 3600 53.9%, 55.5%, and the average waiting time is reduced by
GminˈGmax (s) 10,30
28.1%, 31.5%, 36.5%, 43.9%, 56.7%, respectively. The
average number of stops decreased by 18.8%, 32.5%, 44.2%,
48.4%, 52.6%. In summary, the DDQN algorithm based on
% $QDO\VLV RI FRPSDUDWLYH DOJRULWKPV the improved deep reinforcement learning has achieved
better performance on the three traffic benefit evaluation
7LPLQJ FRQWURO Timed traffic signal control is based
indicators of the average queue length of vehicles, the
on a single intersection scenario where vehicles arrive
average waiting time and the average number of parking. It
uniformly in a certain historical period. The mathematical
shows that the method of controlling the phase selection and
model is used to calculate the phase period length and phase
the phase duration of traffic signals at the same time can
segmentation. The agent switches sequentially according to
more effectively alleviate traffic congestion and guide the
the predefined phase sequence and phase duration.
agent to optimize the optimal phase timing decision process.
5962Downloaded on February 16,2025 at 08:28:24 UTC from IEEE Xplore. Restrictions apply.
Authorized licensed use limited to: to IEEExplore provided by University Libraries | Virginia Tech.
TABLE II. PARAMETERS CONFIGURATION
Algorithm Queue length/m Waiting times/s Stops
Timing control 376.8 87.6 44.7
Phase duration 363.9 67.6 41.1
Adaptive control 308.1 59.7 38.0
Phase switching 271.6 55.3 31.4
Phase selection 238.4 52.7 26.1
Improved DDQN 167.5 37.9 21.2
Fig. 7 Average waiting time of different TSC methods. Fig. 10 Average waiting time of different TSC methods.
5963Downloaded on February 16,2025 at 08:28:24 UTC from IEEE Xplore. Restrictions apply.
Authorized licensed use limited to: to IEEExplore provided by University Libraries | Virginia Tech.
[2] B. Zhou, X.-D. Wu, D.-F. Ma, et al., ̌ A Review of Deep
Reinforcement Learning Application in Urban Traffic Signal Control
Methods, ̍ Modern Transportation and Metallurgical Materials, 2022,
2(03): 84-93.
[3] W.-C. Yang, L. Zhang, Y.-P. Shi, et al., ̌Application Review of Agent
Technology in Urban Traffic Signal Control System,̍ Journal of Wuhan
University of Technology (Transportation Science & Engineering), 2014,
38(04): 709-718.
[4] X.-Q. Chen, Y.-Z. Zhu, C.-F. Lv, ̌ Intersection Signal Phase and
Timing Optimization Method Based on Mixed Proximal Policy
Optimization, ̍ Journal of Transportation Systems Engineering and
Information Technology, 2023, 23(01): 106-113.
[5] M. Aslani, M. S. Mesgari, M. Wiering, ̌Adaptive traffic signal control
with actor-critic methods in a real-world traffic network with different
traffic disruption events, ̍ Transportation Research Part C: Emerging
Technologies, 2017, 85: 732-752.
[6] K. L. Tan, A. Sharma, S. Sarkar, ̌Robust deep reinforcement learning
for traffic signal control,̍ Journal of Big Data Analytics in Transportation,
2020, 2: 263-274.
Fig. 11 Average braking times of different TSC methods.
[7] T. Wu, P. Zhou, K. Liu, et al., ̌ Multi-Agent Deep Reinforcement
Learning for Urban Traffic Light Control in Vehicular Networks,̍ IEEE
TABLE III. COMPARISON OF THE PERFORMANCE OF DIFFERENT METHODS
Transactions on Vehicular Technology, 2020, 69(8): 8243-8256.
Algorithm Queue length/m Waiting time/s Stops [8] G.-Q. Zhang, F.-R. Chang, J.-L. Jin, et al., ̌ Safety-Driven Adaptive
Signal Control Method for Urban Intersections, ̍ China Safety Science
Phase duration 556.6 103.7 62.6
Journal, 2023, 19(10): 192-199.
Timing control 528.7 99.6 61.3 [9] H. Wei, C. Chen, G. Zheng, et al., ̌Presslight: Learning max pressure
Adaptive control 466.1 95.2 58.9 control to coordinate traffic signals in arterial network, ̍ Proceedings of
the 25th ACM SIGKDD International Conference on Knowledge
Phase switching 420.1 90.8 48.9 Discovery & Data Mining, 2019.
Phase selection 340.7 83.1 38.6 [10] M. Xu, J. Wu, L. Huang, et al., ̌Network-wide traffic signal control
based on the discovery of critical nodes and deep reinforcement learning,̍
Improved DDQN 200.3 54.3 24.3 Journal of Intelligent Transportation Systems, 2019, 24(1): 1-10.
[11] X. Liang, X. Du, G. Wang, et al., ̌ A deep reinforcement learning
network for traffic light cycle control, ̍ IEEE Transactions on Vehicular
Technology, 2019, 68(2): 1243-1253.
V. CONCLUSION
[12] B.-L. Ye, W. Wu, K. Ruan, et al., ̌ A survey of model predictive
We proposes a DDQN algorithm based on improved control methods for traffic signal control, ̍ IEEE/CAA Journal of
DRL, which uses neural network analysis to extract the Automatica Sinica, 2019, 6(3): 623-640.
traffic state information features of the intersection, maps the [13] Y. Hua, X.-F. Wang, B. Jin, ̌ A Survey on Multi-Agent
next phase and phase duration of the TSC model, and Reinforcement Learning for Urban Traffic Signal Optimization, ̍
Operations Research Transactions, 2023, 27(02): 49-62.
designs a new loss function based on the loss factor, and
[14] M. Kolat, B. K v ri, T. B csi, et al., ̌ Multi-agent reinforcement
dynamically optimizes the update process of the loss learning for traffic signal control: A cooperative approach,̍ Sustainability,
function. The intersection simulation traffic signal control 2023, 15(4): 3479.
scene is built on the SUMO microscopic traffic simulation [15] Z.-M. Liu, B.-L. Ye, Y.-D. Zhu, et al., ̌ Traffic Signal Control
platform, the real single intersection traffic flow data set is Method Based on Deep Reinforcement Learning, ̍ Journal of Zhejiang
used to train the traffic signal control model, and multiple University (Engineering Science), 2022, 56(6): 1249-1256.
traffic flow test sets are constructed to test the effect of the [16] A. Jamal, M. Tauhidur Rahman, H. M. Al-Ahmadi, et al., ̌Intelligent
model. The training and test results show that the DDQN intersection control for delay optimization: Using meta-heuristic search
algorithms,̍ Sustainability, 2020, 12(5): 1896.
algorithm based on improved DRL effectively overcomes the
shortcomings of predefined phase sequence and phase [17] Z.-D. Zhang, Y.-N. Wang, Y.-K. Liu, et al., ̌ Reinforcement
Learning Algorithm for Road Network Traffic Control Based on Nash-
duration. Compared with other traffic signal control Stackelberg Hierarchical Game Model, ̍ Journal of Southeast University
algorithms, the average queue length, the average waiting (Natural Science Edition), 2023, 53(02): 334-341.
time and the average number of stops of vehicles are [18] S.-F. Ding, W. Du, L.-L. Guo, et al., ̌ Multi-Agent Deep
significantly reduced, which effectively improves the Deterministic Policy Gradient Method Based on Dual Critics,̍ Journal of
efficiency of intersection traffic. However, the current Computer Research and Development, 2023, 60(10): 2394-2404.
research is limited to the TSC problem of single intersection. [19] L. Zhu, P. Peng, Z. Lu, et al., ̌Metavim: Meta variationally intrinsic
For the coordination of arterial traffic signal and the motivated reinforcement learning for decentralized traffic signal control,̍
IEEE Transactions on Knowledge and Data Engineering, 2023, 35(11):
cooperative control of local road network, the next stage will 11570-11584.
focus on the research. [20] X.-Y. Peng, H. Wang, ̌ A Review of Combined Optimization of
Traffic Assignment and Signal Control, ̍ Journal of Transportation
REFERENCES Engineering and Information, 2023, 21(01): 1-18.
[1] Z. Yu, N.-W. Ning, Y.-L. Zheng, et al., ̌Review of Intelligent Traffic
Signal Control Strategy Driven by Deep Reinforcement Learning, ̍
Computer Science, 2023, 50(04): 159-171.
5964Downloaded on February 16,2025 at 08:28:24 UTC from IEEE Xplore. Restrictions apply.
Authorized licensed use limited to: to IEEExplore provided by University Libraries | Virginia Tech.