Cooperative Reinforcement Learning On Traffic Signal Control
Cooperative Reinforcement Learning On Traffic Signal Control
Abstract—Traffic signal control is a challenging real-world great effects on the policy decision. To make decisions on
problem aiming to minimize overall travel time by coordinating continuous space, recent policy-based RL methods [10], [11],
vehicle movements at road intersections. Existing traffic signal [12] become more popularly adopted in traffic signal control
arXiv:2205.11291v2 [cs.AI] 6 Aug 2022
action design that can solve and predict this phase remaining an action from the current state s; that is, π(s) : S → A. The
time. goal of this paper is to find such a policy to maximize the
On the other hand, RL can be classified according to the future reward Gt :
adopted action schemes, such as: (1) setting the length of green
light, (2) choosing whether to change phase, and (3) choosing Gt = Σ∞ k
k=0 γ Rt+k . (1)
the next phase. The DQN and AC methods are suitable for A value function V (st ) indicates how good the agent is at
action schema 2 and 3, but improper for setting the length state st , and is defined as the expected total return of the agent
of the green line since the action space of DQN is discrete starting from st . If V (st ) is conditioned on a given strategy π,
and huge for this task, and much calculation efficacy will it will be expressed by V π (st ); that is, V π (st ) = E[Gt |st =
be wasted for the AC method. Compared with DQN whose s], ∀st ∈ S. The optimal policy π ∗ at state st can be found
action space is discrete, DDPG [17] can solve continuous by solving
action spaces which are more suitable to model the length π ∗ (st ) = arg max V π (st ), (2)
of green light. Although DDPG is originated from the AC π
III. RL BACKGROUND AND N OTATIONS With the Q function, the optimal policy π ∗ at state st can be
found by solving:
The basic elements of a RL problem for traffic signal
control can be formulated as the Markov Decision Process π ∗ (st ) = arg max Q∗ (st , a). (6)
a
(MDP) mathematical framework of < S, A, T, R, γ >, with
the following definitions: Q∗ (st , a) is the sum of two terms: (i) the instant reward after
• S denotes the set of states, which is the set of all lanes
a period of execution in the state st and (ii) the discount
containing all possible vehicles. st ∈ S is a state at time expected future reward after the transition to the next state
step t for an agent. st+1 . Then, we can use the Bellman equation [29] to express
• A denotes the set of possible actions, which is the duration Q∗ (st , a) as follows:
of green light. In our scenarios, both duration lengths for Q∗ (st , a) = R(st , a) + γEst+1 [V ∗ (st+1 )]. (7)
a traffic cycle and a yellow light are fixed. Then, once the
state of green light is chosen, the duration of a red light V ∗ (st ) is the maximum expected total reward from state st
can be determined. At time step t, the agent can take an to the end. It will be the maximum value of Q∗ (s, a) among
action at from A. all possible actions. Then, V ∗ can be obtained from Q∗ as
• T denotes the transition function, which stores the proba- follows:
bility of an agent transiting from state st (at time t) to V ∗ (st ) = max Q∗ (st , a) , ∀st ∈ S. (8)
a
st+1 (at time t + 1) if the action at is taken; that is,
T (st+1 |st , at ) : S × A → S. Two strategies, i.e., value iteration and policy iteration can be
• R denotes the reward, where at time step t, the agent ob- used to calculate the optimal value function V ∗ (st ). The value
tains a reward rt specified by a reward function R(st , at ) iteration calculates the optimal state value function by itera-
if the action at is taken under state st . tively improving the estimation of V (s). It repeatedly updates
• γ denotes the discount, which not only controls the impor- the values of Q(s, a) and V (s) until they converge. Instead
tance of the immediate reward versus future rewards, but of repeatedly improving the estimation of V (s), the policy
also ensures the convergence of the value function, where, iteration redefines the policy at each step and calculate the
γ ∈ [0, 1). value according to this new policy until the policy converges.
At time-step t, the agent determines its next action at based Deep Q-Network (DQN): In [30], [31], a deep neural
on the current state st . After executing at , it will be transited to network is used to approximate the Q function, which enables
next state st+1 and receive a reward rt (s, a); that is, rt (s, a) = the RL algorithm to learn Q well in high-dimensional spaces.
E[Rt |st = s, at = a], where Rt is named as the one-step Let Qtar be the targeted true value which is expressed as
reward. The way that the RL agent chooses an action is named Qtar = r + γ max Q(s0 , a0 ; θ). In addition, let Q(s, a; θ) be
a0
policy and denoted by π. Policy is a function π(s) that chooses the estimated value, where θ is the set of parameters of the
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2021 4
IV. M ETHOD
This paper proposes a cooperative, multi-objective archi-
tecture with age-decaying weights for traffic signal control
optimization. This architecture represents each intersection
with a DDPG architecture which contains a critic network
and an actor network. In a given state, the critic network is
(a)
responsible for judging the value of doing an action, and the
actor network tries to make the best decision to output an
action. The outcome of an action is the number of seconds of
green light. This paper assumes that the duration for a phase
cycle (green, yellow, red) is different at different intersections
but fixed at an intersection. In addition, the duration of yellow
light is the same and fixed for all intersections. Then, the
duration of red light can be directly derived once the duration
of green is known.
The RL-based method for traffic signal control can be
value-based or policy-based. The value-based method can (b)
achieve faster convergence on traffic signal control but its Fig. 3. Architectures for global agent. (a) Global critic.(b) Global actor.
time space is discrete, and cannot reflect the real requirements
to optimize traffic conditions. The policy-based method can
infer a non-discrete length of phase duration but its gradient to derive the policy. It interleaves learning an approximator to
estimation is strongly dependent on sampling an not stable find the best Q∗ (s, a) and also learning another approximator
due to sampling bias. Thus, the DDPG method is adopted in to decide the optimal action a∗ (s), and so in a way the
this paper to concurrently learns the desired Q-function and action space is continuous. The output of this DDPG is a
the corresponding policy. continuous probability function to represent an action. In this
The original DDPG uses off-policy data and the Bellman paper, an action corresponds to the seconds of green light.
equation [29] (see Eq.(7)) to learn the Q-function, and then Although DDPG is off-policy, we can mix the past data into
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2021 5
m m
the training set and thus make the distribution of the training Wglobal =Wglobal ×(0.95)t . This COMMA-DDPG method runs
set diverse by feeding current environment parameters to a one hour of simulation with an epsilon greedy exploration
traffic simulation software named TSIS [33] to provide on- scheme to collect on-policy data. The set of on-policy data
policy data for RL training. In a general DDPG, to make the collected for training the mth local agent is denoted by B m .
agents have more opportunities to explore the environment, The on-policy data set B for training the global agent is
noise (random sampling) is added to the output action space the union of all B m , i.e., B= (B 1 , ..., B m , ..., B M ). In the
during the training process, but will also makes the agent following, details of each local agent and the global agent are
blindly explore the environments. In the scenarios of traffic described.
signal control, most of SoTA methods adopted local agents
to model different intersections. During training, the same
training mechanism “adding noise to action model” is used B. Generating On-policy Data
to make each agent explore the environment more. How- As mentioned earlier, the major contribution of this pa-
ever,“increasing the whole throughput” is the same goal for per is to add a global agent to trade off different local
all local agents. The learning strategy “adding noise to action agents’requirements and cooperate them to find better strate-
model” will decrease not only the effectiveness of learning but gies for traffic signal control. Here we explain how global
also the whole throughput since blinding exploration will make output and local output compete. The global output contains
local agents choose conflict actions to other agents. It means the number of seconds and weight WGm which represent the
there should be a cooperation mechanism to be included to the global agent’s importance to the mth intersection. Then, the
DDPG mechanism among different local agents to increase value WLm =(1-WGm ) is the importance of the mth local agent
the final throughput during the learning process. The major to the global agent. The one with higher importance is chosen
novelty of this paper is to introduce a cooperative learning to output seconds. To avoid the training being too biased to one
mechanism via a global agent to avoid local agents blindly side, we will have a penalty mechanism by the time-decayed
exploring the environments so that the whole throughput and method. Assuming that the model selects the global output for
the learning effectiveness can be significantly improved. t consecutive times, the global weight should be multiplied by
(0.95)t ; that is, WGm =WGm × (0.95)t .
Our method is based on MA-DDPG [34] with M lo-
A. Cooperative DPGG Network Architecture cal agents, using the decentralized reinforcement learning
Most of policy-based RL methods [10], [11], [12] use only method [35]. The proposed COMMA-DDPG method adds a
local agents to perform RL learning for traffic control. The global agent, based on MA-DDPG, to control all intersections
requirements of a local agent will easily produce conflicts by using average stopped delay time of vehicle as the reward.
to other agents and results in the divergence problem during Its actor output is not only the duration of the green light of
optimization. A cooperative DPGG architecture is designed each intersection, but also the weight WGm relative to each
in this paper, where a local agent controls each intersection local agent m. It is involved only during the training stage to
and a global agent manages all intersections. Details of this generate on-policy data. During the data generation process,
COMMA DPGG algorithm are described in Algorithm 1. a specific local agent is created at each intersection, using
Although the DDPG method is off-policy, we use TSIS (Owen the clearance degree as the reward. As shown in Fig. 2(a), its
et al. 2000) to collect on-policy data for RL training. Details critic’s input state will not only have information about its own
of the on-policy data collection process are described in the intersection, but also takes the actions of other agents as its
GOD (Generating On-policy Data) algorithm (see Algorithm own state, so as to achieve information transmission between
2). With the set of on-policy data, the parameters of local and all the agents during the training process.
global agents are then updated by the LAU (Local Agent During the RL-based training process, before starting each
Updating) algorithm and GAU(Global Agent Updating) epoch, we will perform a one-hour simulation to collect data
algorithm, respectively. The global agent is involved only (see Algorithm 2) and store it in the replay buffer B. In
during the training stage to generate on-policy data. Let WGm the process of interacting with the environment, we will add
represent the global agent’s importance to the mth intersection. epsilon greedy and weight-decayed method to the selection of
Then, the importance WLm of the mth local agent will be 1- actions. In particular, the epsilon greedy method will gradually
WGm , i.e., WLm = 1- WGm . For the mth intersection, the GOD reduce epsilon from 0.9 to 0.1.
algorithm predicts the next actions by using the local agent
and global agent via an epsilon greedy exploration scheme,
respectively. The competition between the output seconds of C. Local Agent
the global agent and the local agent depends on WGm and In our scenario, a fixed duration of a traffic signal change
WLm .Then, the one with higher importance will be chosen to cycle is assigned to each intersection. In addition, there are 5
output seconds. The output seconds are fed into TSIS (Owen seconds prepared for the yellow light. Then, we only need to
et al. 2000) to generate on-policy data for RL training. To model the phase duration for the green light. After that, the
avoid the training being too biased to one side, we will have phase duration for the red light can be directly estimated. At
a penalty mechanism by the time-decayed method. Assuming each intersection, a DPGG-based architecture is constructed
that the model selects the global output for t consecutive times, to model the local agent for traffic control. To describe this
the global weight should be multiplied by (0.95)t ; that is, local agent, some definitions are given as the following.
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2021 6
critic network at the time step i. Then, the loss function for
Q
updating θG is defined as follows:
1 X G Q 2
LG
critic = (y − QG (Si , Ai |θG )) . (18) Algorithm 2: GOD (Generating On-policy Data)
Nb i i
/* Run one hour of simulation with noise η*/
The network architecture to calculate the value function QG Input: t: timestamp
µ
is shown in Fig. 3(a). It is noticed that the output of this θm : parameters for the mth actor network
µ
global critic network is a scalar value, i.e., the predicted total θG : parameters for the global actor network
µ Output: B: on-policy data
waiting time across the whole site. To train θG , we use the
loss function: β = 0.95t ;// rate for time decline
1 X for m=1, ... , M do
µ Q
LGactor = − QG (Si , µG (Si |θG )|θG ). (19) Get WGm from the global actor network with the
Nb i µ
parameters θG ;
Fig. 3(b) shows the architecture to calculate µG . In addition, WG = β × WG ; WLm =1-WGm ;
m m
the output of the global actor network is a vector which for l=1, ... ,3600 do
includes the actions of all intersections, and the weight WGm /* : the probability of choosing to explore */
which represent the global agent’s importance to the mth /* ηm : noise for epsilon greedy exploration*/
intersection. All the local agents and global agent are modeled p = random(0,1);
(
as a DDPG. Details to update the global agent are described 0, if p ≤ ,
ηm =
in Algorithm 4. random(−5, 5), if p > ;
(
µ
µ(sl |θm ) + ηm , if WLm > WGm ,
Algorithm 1: COMMA-DDPG traffic signal control aml = µ
µG (sl |θG )(m) + ηm , if WLm < WGm ;
RL algorithm.
Execute am m m
l and observe rl , sl+1 ;
Initialize critic network Q(s, a|θQ ) and actor network m m m m
Store transition (sl , al , rl , sl+1 ) in Bm ;
µ(s|θµ ) with random weights θQ and θµ . end
Initialize target network Q0 and µ0 with weights end
0 0
θQ ← θQ , θµ ← θµ and also initialize replay buffer B = (B1 , ..., Bm , ..., BM );
R. Return(B);
for t=1, ... ,T do
Clean the replay buffer B.
/* B = (B1 , ..., Bm , ..., BM ); */
/* B m : on-policy data for the mth intersection */
/* Generate on-policy data */
B= GOD(t);
for episode=1, ..., 400 do
for m=1,..., M, Global do Algorithm 3: LAU (Local Agent Updating)
if m 6= Global then Input:
LAU (B,m);// Update local agents
B: on-policy data; m: the mth agent
end Q
θm : set of parameters for the local critic network;
if agent=Global then µ
GAU (B);// Update the global agent θm : set of parameters for the local actor network;
Q0 µ0
end (θm ,θm ): sets of parameters for the target network;
end Output:
Q
end θm : new parameters for the mth critic network;
µ
end θm : new parameters for the mth actor network;
Q0 µ0
(θm ,θm ): new parameters for the target network;
Sample a random minibatch of Nb transitions
(Si , Ai , Ri , Si+1 ) from B;
µ0 Q0
V. E XPERIMENTAL R ESULTS Set yim = Ri (m) + γQ0 (Si+1 |µ0 (Si+1 |θm )|θm );
Q
A. Environment Setup Update the critic parameters θ m by minimizing the
1
loss: Lm
P m Q 2
It is difficult to test and evaluate traffic signal control critic = Nb (y
i i − Q(S i , A i |θ m )) ;
µ
strategies in the real world due to high cost and intensive Update the actor parameters θm by minimizing the
m 1 µ Q
P
labor. Simulation is a useful alternative method before actual loss: Lactor = − Nb i Q(Si , µ(Si |θm )|θm );
implementation for most SoTA methods [3]. To build the simu- Update the target network:
Q0 Q Q0 µ0 µ µ0
lation data, real data were collected from five real intersections θm ← (1 − τ )θm + τ θm , θm ← (1 − τ )θm + τ θm ;
in a city in Asia among half a year. The simulation with
real traffic flow at the intersections was performed based on
the traffic simulation software TSIS [33]. Through TSIS, we
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2021 8
VI. C ONCLUSIONS
can control the behaviors of each traffic light with a plug-
This paper proposed a novel cooperative RL architecture to
in program and use it as the simulation software needed for
handle cooperation problems by adding a global agent. Since
performance evaluation.
the global agent knows all the intersection information, it can
guide the local agent to make better actions in the training
B. Results process, so that the local agent does not use random noise
To train our method, the fixed-time control model was to randomly explore the environment, but has a directional
first used to pretrain our COMMA-DDPG model for speed- direction. explore. Since RL training requires a large amount
ing up the efficiency of training. The base line is a fixed of data, we hope to add it to RL through data augmentation
strategy. Three SoTA methods were compared in this paper; in the future, so that training can be more efficient.
that is, CGRL [38], MADDPG [34], TD3 [36], PPO [37],
Presslight [26], IntelliLight [7], CoLight [27]. Table 1 shows VII. A PPENDIX FOR C ONVERGENCE P ROOF
the comparisons of waiting time and average speed of vehicles In this section, we will prove that value function in our
among different methods. Clearly, our method performs better method will actually converge.
than the fixed-time model and other SoTA methods. Table
2 shows the comparisons of throughput between COMMA- Definition VII.1. A metric space < M, d > is complete
DDPG and other methods at different intersections. Due to the (or Cauchy) if and only if all Cauchy sequences in M will
control of the global agent, our method performs much better converge to M . In other words, in a complete metric space,
than other methods. In algorithm 3 and 4, there is a soft update for any point sequence a1 , a2 , · · · ∈ M , if the sequence is
τ to update the network parameters. Table 3 shows the effects Cauchy, then the sequence converges to M :
of change of τ to the training result in different situations. lim an ∈ M.
It means better performance can be gained if the model is n→∞
not changed frequently. DDPG is an off-policy method. In Definition VII.2. Let (X,d) be a complete metric space. Then,
a map T : X → X is called a contraction mapping on X if
there exists q ∈ [0, 1) such that d(T (x), T (y)) < qd(x, y),
∀x, y ∈ X.
TABLE IV
C OMPARISONS BETWEEN ON - POLICY AND OFF - POLICY TRAINING .
Method Waiting Time Average Speed
on-policy 269747 43
off-policy 275868 29
Fig. 4. Time and space diagram.
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2021 9
Theorem 1 (Banach fixed-point theorem). Let (X,d) be a non- From Theorem 2, we can show that every eigenvalue of P π
empty complete metric space with a contraction mapping T is in the disc centered at (0, 0) with radius 1. That is, the
: X → X. Then T admits a unique fixed-point x∗ in X. i.e. maximum absolute value of eigenvalue will be less than 1.
T (x∗ ) = x∗ .
d(T π (u), T π (v)) ≤k λP π k u − v k∞ k∞
Theorem 2 (Gershgorin circle theorem). Let A be a complex ≤ λ k u − v k∞ (25)
n × n matrix, with entries aij . For i ∈ 1, 2, ..., n, let Ri be
the sum of the absolute of values of the non-diagonal entries = λd(u, v).
in the ith row: From the Theorem 1, Eq.(2) converges to only V π .
Xn
Ri = |aij |.
j=0,j6=i R EFERENCES
Let D(aii , Ri ) ⊆ C be a closed disc centered at aii with [1] S. Alemzadeh, R. Moslemi, R. Sharma, and M. Mesbahi, “Adaptive
radius Ri , and every eigenvalue of A lies within at least one traffic control with deep reinforcement learning: Towards state-of-the-
art and beyond,” arXiv preprint arXiv:2007.10960, 2020. 1
of the Gershgorin discs D(aii , Ri ). [2] G. Zheng, X. Zang, N. Xu, H. Wei, Z. Yu, V. Gayah, K. Xu, and
Z. Li, “Diagnosing reinforcement learning for traffic signal control,”
Lemma 3. We claim that the value function of RL can actually arXiv preprint arXiv:1905.04716, 2019. 1, 2
converge, and we also apply it to traffic control. [3] H. Wei, G. Zheng, V. Gayah, and Z. Li, “Recent advances in rein-
forcement learning for traffic signal control: A survey of models and
Proof. The value function is to calculate the value of each evaluation,” ACM SIGKDD Explorations Newsletter, vol. 22, no. 2, pp.
state, which is defined as follows: 12–18, 2021. 1, 7
[4] P. Mannion, J. Duggan, and E. Howley, “An experimental review of
V π (s) = π(a|s) p(s0 , r|s, a)[r + γV π (s0 )]
P P
reinforcement learning algorithms for adaptive traffic signal control,”
a s 0 ,r Autonomic road transport support systems, pp. 47–66, 2016. 1
p(s0 , r|s, a)r
P P
= π(a|s) [5] T. T. Pham, T. Brys, M. E. Taylor, T. Brys, M. M. Drugan, P. Bosman,
a 0
sP ,r
(20) M.-D. Cock, C. Lazar, L. Demarchi, D. Steenhoff et al., “Learning
P 0 π 0 coordinated traffic light control,” in Proceedings of the Adaptive and
+ π(a|s) p(s , r|s, a)[γV (s )]. Learning Agents workshop (at AAMAS-13), vol. 10. IEEE, 2013, pp.
a s0 ,r
1196–1201. 1
Since the immediate reward is determined, it can be regarded [6] E. Van der Pol and F. A. Oliehoek, “Coordinated deep reinforcement
learners for traffic light control,” Proceedings of Learning, Inference and
as a constant term relative to the second term. Assuming that Control of Multi-Agent Systems (at NIPS 2016), 2016. 1
the state is finite, we express the state value function in matrix [7] H. Wei, G. Zheng, H. Yao, and Z. Li, “Intellilight: A reinforcement
form below. Set the state set S = {S0 , S1 , · · · , Sn }, V π = learning approach for intelligent traffic light control,” in Proceedings
of the 24th ACM SIGKDD International Conference on Knowledge
{V π (s0 ), V π (s1 ), · · · , V π (sn )}T , and the transition matrix is Discovery & Data Mining, 2018, pp. 2496–2505. 1, 8
π π
[8] I. Arel, C. Liu, T. Urbanik, and A. G. Kohls, “Reinforcement learning-
0 P0,1 · · · P0,n based multi-agent system for network traffic signal control,” IET Intel-
π π
P1,0 0 · · · P1,n ligent Transport Systems, vol. 4, no. 2, pp. 128–135, 2010. 1
Pπ =
, (21)
··· ··· ··· ··· [9] J. A. Calvo and I. Dusparic, “Heterogeneous multi-agent deep reinforce-
π π ment learning for traffic lights control.” in AICS, 2018, pp. 2–13. 1
Pn,0 Pn,1 ··· 0 [10] T. Chu, J. Wang, L. Codecà, and Z. Li, “Multi-agent deep reinforcement
π
P learning for large-scale traffic signal control,” IEEE Transactions on
where Pi,j = π(a|si )p(sj , r|si , a). The constant term is Intelligent Transportation Systems, vol. 21, no. 3, pp. 1086–1095, 2019.
a 1, 5
expressed as Rπ = {R0 , R1 , · · · , Rn }T . Then we can rewrite [11] T. Nishi, K. Otaki, K. Hayakawa, and T. Yoshimura, “Traffic signal
the state-value function as: control based on reinforcement learning with graph convolutional neural
nets,” in 2018 21st International Conference on Intelligent Transporta-
V π = Rπ + λP π V π . (22) tion Systems (ITSC). IEEE, 2018, pp. 877–883. 1, 5
[12] S. S. Mousavi, M. Schukat, and E. Howley, “Traffic light control using
deep policy-gradient and value-function-based reinforcement learning,”
Above we define the state value function vector as V π = IET Intelligent Transport Systems, vol. 11, no. 7, pp. 417–423, 2017. 1,
{V π (s0 ), V π (s1 ), · · · , V π (sn )}T , which belongs to the value 5
function space V . We consider V to be an n-dimensional [13] M. Aslani, M. S. Mesgari, and M. Wiering, “Adaptive traffic signal
control with actor-critic methods in a real-world traffic network with
vector full space, and define the metric of this space is the different traffic disruption events,” Transportation Research Part C:
infinite norm. It means: Emerging Technologies, vol. 85, pp. 732–752, 2017. 1, 2
[14] M. Aslani, S. Seipel, M. S. Mesgari, and M. Wiering, “Traffic signal
d(u, v) =k u − v k∞ = max |u(s) − v(s)|, ∀u, v ∈ V (23) optimization through discrete and continuous reinforcement learning
s∈S with robustness analysis in downtown tehran,” Advanced Engineering
Informatics, vol. 38, pp. 639–655, 2018. 1
Since < V, d > is the full space of vectors, V is a complete [15] H. Pang and W. Gao, “Deep deterministic policy gradient for traffic
metric space. Then, the iteration result of the state value signal control of single intersection,” in 2019 Chinese Control And
function is unew = T π (u) = Rπ + λP π u. We can show that Decision Conference (CCDC). IEEE, 2019, pp. 5861–5866. 1, 3
[16] H. Wu, “Control method of traffic signal lights based on ddpg reinforce-
it is a contraction mapping. ment learning,” in Journal of Physics: Conference Series, vol. 1646,
no. 1. IOP Publishing, 2020, p. 012077. 1, 3
d(T π (u), T π (v)) =k (Rπ + λP π u) − (Rπ + λP π v) k∞ [17] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
=k λP π (u − v) k∞ D. Silver, and D. Wierstra, “Continuous control with deep reinforcement
learning,” arXiv preprint arXiv:1509.02971, 2015. 1, 3
≤k λP π k u − v k∞ k∞ . [18] R. P. Roess, E. S. Prassas, and W. R. McShane, Traffic engineering.
(24) Pearson/Prentice Hall, 2004. 2
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2021 10