0% found this document useful (0 votes)
31 views10 pages

Cooperative Reinforcement Learning On Traffic Signal Control

This paper presents a novel approach to traffic signal control using a cooperative multi-agent reinforcement learning framework called COoperative Multi-Objective Multi-Agent Deep Deterministic Policy Gradient (COMMA-DDPG). The proposed method incorporates both local and global agents to optimize traffic flow and reduce delays, demonstrating a 60% reduction in total delayed time compared to existing state-of-the-art methods. The framework is evaluated using real-world traffic data, highlighting its effectiveness in adapting to dynamic traffic conditions.

Uploaded by

Ankur De
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views10 pages

Cooperative Reinforcement Learning On Traffic Signal Control

This paper presents a novel approach to traffic signal control using a cooperative multi-agent reinforcement learning framework called COoperative Multi-Objective Multi-Agent Deep Deterministic Policy Gradient (COMMA-DDPG). The proposed method incorporates both local and global agents to optimize traffic flow and reduce delays, demonstrating a 60% reduction in total delayed time compared to existing state-of-the-art methods. The framework is evaluated using real-world traffic data, highlighting its effectiveness in adapting to dynamic traffic conditions.

Uploaded by

Ankur De
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

JOURNAL OF LATEX CLASS FILES, VOL. XX, NO.

XX, MAY 2021 1

Cooperative Reinforcement Learning on Traffic


Signal Control
Chi-Chun Chao, Student Member, IEEE, , Jun-Wei Hsieh, Member, IEEE, and Bor-Shiun Wang, Student
Member, IEEE,

Abstract—Traffic signal control is a challenging real-world great effects on the policy decision. To make decisions on
problem aiming to minimize overall travel time by coordinating continuous space, recent policy-based RL methods [10], [11],
vehicle movements at road intersections. Existing traffic signal [12] become more popularly adopted in traffic signal control
arXiv:2205.11291v2 [cs.AI] 6 Aug 2022

control systems in use still rely heavily on oversimplified in-


formation and rule-based methods. Specifically, the periodicity so that a non-discrete length of phase duration can be inferred.
of green/red light alternations can be considered as a prior However, its gradient estimation is strongly dependent on
for better planning of each agent in policy optimization. To sampling and not stable if sampling cases are not general,
better learn such adaptive and predictive priors, traditional RL- and thus easily trapped to a non-optimal solution.
based methods can only return a fixed length from predefined Another vital problem of the above RL-based methods is
action pool with only local agents. If there is no cooperation
between these agents, some agents often make conflicts to other that their agents are trained in an off-policy way. They are
agents and thus decrease the whole throughput. This paper not retrained on-fly during inference. Its means their policy
proposes a cooperative, multi-objective architecture with age- decision strategy cannot be amended and adapted to real traffic
decaying weights to better estimate multiple reward terms for conditions. Moreover, they make action decisions continuously
traffic signal control optimization, which termed COoperative along the time and give drivers very short reaction time to
Multi-Objective Multi-Agent Deep Deterministic Policy Gradient
(COMMA-DDPG). Two types of agents running to maximize change their behaviors. Outcomes from these methods are
rewards of different goals - one for local traffic optimization less practical since in real-world traffic control scenarios, con-
at each intersection and the other for global traffic waiting sidering the next traffic light phase from pre-defined discrete
time optimization. The global agent is used to guide the local cyclic sequences of red/green lights is actually important to let
agents as a means for aiding faster learning but not used in drivers know how much reminding time the next traffic signal
the inference phase. We also provide an analysis of solution
existence together with convergence proof for the proposed RL phase will be changed. For traffic optimization, this means
optimization. Evaluation is performed using real-world traffic when the agent makes decisions not only on which action to
data collected using traffic cameras from an Asian country. Our be performed but also how long it should be performed. To
method can effectively reduce the total delayed time by 60%. determine a proper period for action to be executed, some
Results demonstrate its superiority when compared to SoTA frameworks [13], [14] pre-define some time slots for the
methods.
agent to choose for computation efficiency and traffic control
Index Terms—Reinforcement learning, Traffic signal control simplification. However, this solution of pre-defining time slots
is less flexible than an on-demand solution to better relieve
traffic congestion.
I. I NTRODUCTION
To bridge the gaps between value-based and policy-based

T RAFFIC signal control is a challenging real-world prob-


lem whose goal tries to minimize the overall vehicle
travel time by coordinating the traffic movements at road
RL approaches, the actor-critic framework is widely adopted to
stabilize the RL training process, where the policy structure is
known as the actor and the estimated value function is known
intersections. Existing traffic signal control systems in use still as the critic. Thus, there are some actor-critic frameworks
rely heavily on manually designed rules which cannot adapt proposed for traffic signal control. For example, in [15],
to dynamic traffic changes. Recent advance in reinforcement [16], a model-free actor-critic method named as “Deep Deter-
learning (RL), especially deep RL [1], [2], offers excellent ministic Policy Gradient” (DDPG) [17] was adopted to learn
capability to work with high dimensional data, where agents a deterministic policy mapping states to actions. However,
can learn a state abstraction and policy approximation directly it is a “single-agent” solution and cannot output a proper
from input states. This paper explores the possibility of RL to execution period for the chosen action to more effectively
on-policy traffic signal control with fewer assumptions. relieve traffic congestion. This paper incorporates multiple
In literature, there have been different RL-based frame- agents in an actor-critic framework to develop a COoperative
works [3] proposed for traffic signal control. Most of them [2], Multi-objective Multi-Agent DDPG (COMMA-DDPG) for
[4], [5], [6], [7], [8], [9] are value-based and can achieve faster optimal traffic signal control. This novelty of our method is
convergences on traffic signal control. However, the actions, to introduce a global agent to trade off different local agents’
states, and time space they can handle are discrete. Thus, the requirements and cooperate them to find better strategies for
time slots for each action to be executed are fixed and cannot traffic signal control. The global agent is used as a means
reflect the real requirements to optimize traffic conditions. for aiding faster learning and not used in the inference phase.
Moreover, a small change in the value function will cause This idea is very different from other RL-based multi-agent
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2021 2

methods which use only local agents to search solutions and


often produce conflicts between two agents’ strategies, thus
decreasing the whole throughput.
Fig. 1 shows the diagram of this COMMA-DDPG archi-
tecture. A local agent is first used to learn and optimize
the policy at an intersection. During the training process,
we introduce a global agent to cooperate each local agent
and then optimize the global traffic throughput among all
intersections in the whole observed site. With the actor-critic
framework, the global agent can optimize and send various
information exchange among local intersections to local agents
so as to optimize the final reward globally. It can aid faster
learning while not constraining the deployment since it is
not used during the inference phase. To finish this goal, the Fig. 1. COMMA-DDPG architecture.
parameters of each local DDPG-based agent is initialized
by the global agent. Thus, the COMMA-DDPG framework
can select the best policy for controlling periodical phases thus becomes the most commonly adopted method in traffic
of traffic signals that maximizes throughput by trading off signal control. Its time length for each traffic light phase is
the requirements and reducing conflicts between agents. It pre-calculated and still keep the same even though traffic
can yield a dynamic length of the next traffic light phase in conditions have been changed. Actuated control determines
seconds. This is very different from other RL-based methods traffic conditions using pre-determined thresholds; that is, if
which can only return a fixed length from a predefined action the traffic condition (e.g., the car queue length) exceeds a
pool. Then, the remaining seconds of a traffic light phase can threshold, a green light will be issued accordingly. Adaptive
be dynamically predicted and sent to the drivers for doing next control methods including SCATS [22] and SCOOT [23]
driving plans. It is noticed again that the global agent is only determine the best signal phase according to the on-going
used during the training process to cooperate different local traffic conditions, and thus can achieve more effective traffic
agents’ requirements. optimization.
Convergence Analysis. To prove the convergence of
COMMA-DDPG, we also analyze the existence and unique-
ness of our actor-critic model in the “Appedix” Section for B. RL traffic control
proof. This proof provides theoretical supports of the conver- The recent advancement of RL shed a light on automatic
gence of our COMMA-DDPG approach. traffic control improvement. RL agents use knowledge from
Evaluation. Our traffic data consists of visual traffic mon- the traffic data to learn and optimize the policies without
itoring sequences from five consecutive intersections during human intervention for traffic sign control. There are two
morning rush hour in one local country of Taiwan. We con- main approaches to solving traffic sign control problems, i.e.,
ducted various ablation studies on COMMA-DDPG with dif- value-based and policy-based. There is also a hybrid, actor-
ferent SoTA to evaluate the performance comparisons. Results critic approach, which employs both value-based and policy-
show that COMMA-DDPG provides significant improvement based searches. The value-based method first estimates the
of the overall traffic waiting time in better alleviating traffic value (expected return) of being in a given state and then finds
congestion. the best policy from the estimated value function. One of the
The remaining of this paper is organized in the following. most used value-based methods is Q learning [24]. The first
Section 2 surveys related works of intelligent traffic control. Q-learning method applied to control traffic signals at street
Section 3 describes the RL notations and schemes. Section intersection is traced from [25]. However, in Q learning, a
4 describes the architecture of COMMA-DDPG. Section 5 huge table should be created and updated to store the Q values
shows the experiment results. of each action in each state. Thus, it is both memory- and
time-consuming, and improper for problems with complicated
II. R ELATED W ORK states and actions. Recently, the advent of deep learning has
cast significant impacts on many areas such as object detection,
A. Traditional traffic control speech recognition, language translation, and so on. Thus, deep
Traditional traffic control methods can be categorized into reinforcement learning methods such as value-based methods
the following classes: (1) fixed time control [18], (2) actuated DQN (Deep Q Network) [2], [26], [27] and AC (Actor Critic)
control [19], [20], and (3) adaptive control [21], [2]. They methods [13], [28] are widely used in traffic control. However,
are mostly based on human knowledge or manual efforts to to the best of our knowledge, none of the above methods
design an appropriate cycle length and strategies for better can reliably predict the remaining seconds of a traffic light
traffic control. The involved manual tasks will make parameter phase. We note that such predicting capability can provide
settings very cumbersome and difficult to satisfy different useful information for drivers to prepare their next driving
scenarios’ requirements including peak hours, normal hours, behaviors in real-life traffic control regarding passenger safety.
and off-peak hours. Fixed time control is simply, easy, and The proposed MOMA-DDPG model in this paper is a new
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2021 3

action design that can solve and predict this phase remaining an action from the current state s; that is, π(s) : S → A. The
time. goal of this paper is to find such a policy to maximize the
On the other hand, RL can be classified according to the future reward Gt :
adopted action schemes, such as: (1) setting the length of green
light, (2) choosing whether to change phase, and (3) choosing Gt = Σ∞ k
k=0 γ Rt+k . (1)
the next phase. The DQN and AC methods are suitable for A value function V (st ) indicates how good the agent is at
action schema 2 and 3, but improper for setting the length state st , and is defined as the expected total return of the agent
of the green line since the action space of DQN is discrete starting from st . If V (st ) is conditioned on a given strategy π,
and huge for this task, and much calculation efficacy will it will be expressed by V π (st ); that is, V π (st ) = E[Gt |st =
be wasted for the AC method. Compared with DQN whose s], ∀st ∈ S. The optimal policy π ∗ at state st can be found
action space is discrete, DDPG [17] can solve continuous by solving
action spaces which are more suitable to model the length π ∗ (st ) = arg max V π (st ), (2)
of green light. Although DDPG is originated from the AC π

method, it performs more robustly than the AC method since it π


where V (st ) is the state-value function for a policy π.
creates two networks as regression models to estimate values. Similarly, we can define the expected return of taking action
Therefore, this paper uses DDPG [17] as the main architecture a in state st under a policy π denoted by a Q function:
to adapt and model the action space of green light length.
Given a range of green light duration, the proposed DDPG- Qπ (st , at ) = E[Gt |st = s, at = a]. (3)
based method can easily output seconds within the range. In The relationship between Qπ (st , at ) and V π (st ) is derived as
the past, the DDPG-based traffic control frameworks [15], X
[16] focus on only a single intersection. This paper will use Vπ (s) = π(a|s)Qπ (s, a). (4)
the idea of DDPG to model traffic conditions on multiple a∈A
intersections by introducing a global agent to trade off different Then, the optimal solution Q∗ (st , a) is found by iteratively
local agents’requirements and then find better strategies for solving:
traffic signal control.
Q∗ (st , a) = max Qπ (st , a). (5)
π

III. RL BACKGROUND AND N OTATIONS With the Q function, the optimal policy π ∗ at state st can be
found by solving:
The basic elements of a RL problem for traffic signal
control can be formulated as the Markov Decision Process π ∗ (st ) = arg max Q∗ (st , a). (6)
a
(MDP) mathematical framework of < S, A, T, R, γ >, with
the following definitions: Q∗ (st , a) is the sum of two terms: (i) the instant reward after
• S denotes the set of states, which is the set of all lanes
a period of execution in the state st and (ii) the discount
containing all possible vehicles. st ∈ S is a state at time expected future reward after the transition to the next state
step t for an agent. st+1 . Then, we can use the Bellman equation [29] to express
• A denotes the set of possible actions, which is the duration Q∗ (st , a) as follows:
of green light. In our scenarios, both duration lengths for Q∗ (st , a) = R(st , a) + γEst+1 [V ∗ (st+1 )]. (7)
a traffic cycle and a yellow light are fixed. Then, once the
state of green light is chosen, the duration of a red light V ∗ (st ) is the maximum expected total reward from state st
can be determined. At time step t, the agent can take an to the end. It will be the maximum value of Q∗ (s, a) among
action at from A. all possible actions. Then, V ∗ can be obtained from Q∗ as
• T denotes the transition function, which stores the proba- follows:
bility of an agent transiting from state st (at time t) to V ∗ (st ) = max Q∗ (st , a) , ∀st ∈ S. (8)
a
st+1 (at time t + 1) if the action at is taken; that is,
T (st+1 |st , at ) : S × A → S. Two strategies, i.e., value iteration and policy iteration can be
• R denotes the reward, where at time step t, the agent ob- used to calculate the optimal value function V ∗ (st ). The value
tains a reward rt specified by a reward function R(st , at ) iteration calculates the optimal state value function by itera-
if the action at is taken under state st . tively improving the estimation of V (s). It repeatedly updates
• γ denotes the discount, which not only controls the impor- the values of Q(s, a) and V (s) until they converge. Instead
tance of the immediate reward versus future rewards, but of repeatedly improving the estimation of V (s), the policy
also ensures the convergence of the value function, where, iteration redefines the policy at each step and calculate the
γ ∈ [0, 1). value according to this new policy until the policy converges.
At time-step t, the agent determines its next action at based Deep Q-Network (DQN): In [30], [31], a deep neural
on the current state st . After executing at , it will be transited to network is used to approximate the Q function, which enables
next state st+1 and receive a reward rt (s, a); that is, rt (s, a) = the RL algorithm to learn Q well in high-dimensional spaces.
E[Rt |st = s, at = a], where Rt is named as the one-step Let Qtar be the targeted true value which is expressed as
reward. The way that the RL agent chooses an action is named Qtar = r + γ max Q(s0 , a0 ; θ). In addition, let Q(s, a; θ) be
a0
policy and denoted by π. Policy is a function π(s) that chooses the estimated value, where θ is the set of parameters of the
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2021 4

used deep neural network. We define the loss function for


training the DQN as:
L(θ) = Es,a,r,s0 [(Qtar − Q(s, a; θ))2 ]. (9)
As described in [32], the value of Qtar is constantly changing
and often overestimated during training and results in the
problem of unstable convergence of the Q function. In [31],
a DDQN (Double DQNs) was proposed to deal with this
unstable problem by separating the neural network into two
DQNs with two value functions such that there are two sets of
weights θ and φ for parameterizing the original value function
and the second target network, respectively. The second DQN (a)
Qtar with parameters φ is a lagged copy of the first DQN
Q(s, q; θ) to fairly evaluate the Q value; that is,
Qtar = r + γQ(s0 , max
0
Q(s0 , a0 ; θ); φ). (10)
a

Deep Deterministic Policy Gradient (DDPG): DDPG is


also a type of model-free and off-policy, and it also uses a deep
neural network for function approximation. But unlike DQN
which can only solve discrete and low-dimensional action
spaces, DDPG can solve continuous action spaces. In addition,
DQN is a value-based method, while DDPG is an Actor-Critic (b)
method, which has both a value function network (critic) and
Fig. 2. Architectures for local agent. (a) Local critic.(b) Local actor.
a policy network (actor). The critic network used in DDPG
is the same as the actor-critic network(described before). The
difference between DDPG and the actor-critic network is that:
derived from DDQN [31], DDPG makes the training process
more robustly by creating two DQNs (target and now) to
estimate the value functions.

IV. M ETHOD
This paper proposes a cooperative, multi-objective archi-
tecture with age-decaying weights for traffic signal control
optimization. This architecture represents each intersection
with a DDPG architecture which contains a critic network
and an actor network. In a given state, the critic network is
(a)
responsible for judging the value of doing an action, and the
actor network tries to make the best decision to output an
action. The outcome of an action is the number of seconds of
green light. This paper assumes that the duration for a phase
cycle (green, yellow, red) is different at different intersections
but fixed at an intersection. In addition, the duration of yellow
light is the same and fixed for all intersections. Then, the
duration of red light can be directly derived once the duration
of green is known.
The RL-based method for traffic signal control can be
value-based or policy-based. The value-based method can (b)
achieve faster convergence on traffic signal control but its Fig. 3. Architectures for global agent. (a) Global critic.(b) Global actor.
time space is discrete, and cannot reflect the real requirements
to optimize traffic conditions. The policy-based method can
infer a non-discrete length of phase duration but its gradient to derive the policy. It interleaves learning an approximator to
estimation is strongly dependent on sampling an not stable find the best Q∗ (s, a) and also learning another approximator
due to sampling bias. Thus, the DDPG method is adopted in to decide the optimal action a∗ (s), and so in a way the
this paper to concurrently learns the desired Q-function and action space is continuous. The output of this DDPG is a
the corresponding policy. continuous probability function to represent an action. In this
The original DDPG uses off-policy data and the Bellman paper, an action corresponds to the seconds of green light.
equation [29] (see Eq.(7)) to learn the Q-function, and then Although DDPG is off-policy, we can mix the past data into
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2021 5

m m
the training set and thus make the distribution of the training Wglobal =Wglobal ×(0.95)t . This COMMA-DDPG method runs
set diverse by feeding current environment parameters to a one hour of simulation with an epsilon greedy exploration
traffic simulation software named TSIS [33] to provide on- scheme to collect on-policy data. The set of on-policy data
policy data for RL training. In a general DDPG, to make the collected for training the mth local agent is denoted by B m .
agents have more opportunities to explore the environment, The on-policy data set B for training the global agent is
noise (random sampling) is added to the output action space the union of all B m , i.e., B= (B 1 , ..., B m , ..., B M ). In the
during the training process, but will also makes the agent following, details of each local agent and the global agent are
blindly explore the environments. In the scenarios of traffic described.
signal control, most of SoTA methods adopted local agents
to model different intersections. During training, the same
training mechanism “adding noise to action model” is used B. Generating On-policy Data
to make each agent explore the environment more. How- As mentioned earlier, the major contribution of this pa-
ever,“increasing the whole throughput” is the same goal for per is to add a global agent to trade off different local
all local agents. The learning strategy “adding noise to action agents’requirements and cooperate them to find better strate-
model” will decrease not only the effectiveness of learning but gies for traffic signal control. Here we explain how global
also the whole throughput since blinding exploration will make output and local output compete. The global output contains
local agents choose conflict actions to other agents. It means the number of seconds and weight WGm which represent the
there should be a cooperation mechanism to be included to the global agent’s importance to the mth intersection. Then, the
DDPG mechanism among different local agents to increase value WLm =(1-WGm ) is the importance of the mth local agent
the final throughput during the learning process. The major to the global agent. The one with higher importance is chosen
novelty of this paper is to introduce a cooperative learning to output seconds. To avoid the training being too biased to one
mechanism via a global agent to avoid local agents blindly side, we will have a penalty mechanism by the time-decayed
exploring the environments so that the whole throughput and method. Assuming that the model selects the global output for
the learning effectiveness can be significantly improved. t consecutive times, the global weight should be multiplied by
(0.95)t ; that is, WGm =WGm × (0.95)t .
Our method is based on MA-DDPG [34] with M lo-
A. Cooperative DPGG Network Architecture cal agents, using the decentralized reinforcement learning
Most of policy-based RL methods [10], [11], [12] use only method [35]. The proposed COMMA-DDPG method adds a
local agents to perform RL learning for traffic control. The global agent, based on MA-DDPG, to control all intersections
requirements of a local agent will easily produce conflicts by using average stopped delay time of vehicle as the reward.
to other agents and results in the divergence problem during Its actor output is not only the duration of the green light of
optimization. A cooperative DPGG architecture is designed each intersection, but also the weight WGm relative to each
in this paper, where a local agent controls each intersection local agent m. It is involved only during the training stage to
and a global agent manages all intersections. Details of this generate on-policy data. During the data generation process,
COMMA DPGG algorithm are described in Algorithm 1. a specific local agent is created at each intersection, using
Although the DDPG method is off-policy, we use TSIS (Owen the clearance degree as the reward. As shown in Fig. 2(a), its
et al. 2000) to collect on-policy data for RL training. Details critic’s input state will not only have information about its own
of the on-policy data collection process are described in the intersection, but also takes the actions of other agents as its
GOD (Generating On-policy Data) algorithm (see Algorithm own state, so as to achieve information transmission between
2). With the set of on-policy data, the parameters of local and all the agents during the training process.
global agents are then updated by the LAU (Local Agent During the RL-based training process, before starting each
Updating) algorithm and GAU(Global Agent Updating) epoch, we will perform a one-hour simulation to collect data
algorithm, respectively. The global agent is involved only (see Algorithm 2) and store it in the replay buffer B. In
during the training stage to generate on-policy data. Let WGm the process of interacting with the environment, we will add
represent the global agent’s importance to the mth intersection. epsilon greedy and weight-decayed method to the selection of
Then, the importance WLm of the mth local agent will be 1- actions. In particular, the epsilon greedy method will gradually
WGm , i.e., WLm = 1- WGm . For the mth intersection, the GOD reduce epsilon from 0.9 to 0.1.
algorithm predicts the next actions by using the local agent
and global agent via an epsilon greedy exploration scheme,
respectively. The competition between the output seconds of C. Local Agent
the global agent and the local agent depends on WGm and In our scenario, a fixed duration of a traffic signal change
WLm .Then, the one with higher importance will be chosen to cycle is assigned to each intersection. In addition, there are 5
output seconds. The output seconds are fed into TSIS (Owen seconds prepared for the yellow light. Then, we only need to
et al. 2000) to generate on-policy data for RL training. To model the phase duration for the green light. After that, the
avoid the training being too biased to one side, we will have phase duration for the red light can be directly estimated. At
a penalty mechanism by the time-decayed method. Assuming each intersection, a DPGG-based architecture is constructed
that the model selects the global output for t consecutive times, to model the local agent for traffic control. To describe this
the global weight should be multiplied by (0.95)t ; that is, local agent, some definitions are given as the following.
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2021 6

1) The duration of traffic phase ranges from Dmin to Dmax Thus,


PM the m input dimension for each local actor network is (M +
seconds. m=1 Nlane ).
2) Stopped vehicles are defined as those vehicles whose Q µ
Let θm and θm denote the sets of parameters of the mth
speeds are less than 3 km/hr. Q
local critic and actor networks, respectively. To train θm
3) The state at an intersection is defined by a vector in which µ
and θm , we sample a random minibatch of Nb transitions
each entry records the number of stopped vehicles of each (Si , Ai , Ri , Si+1 ) from B, where
lane at this intersection at the end of the green light, and 1) each state Si is an M × 1 vector and contains the local
current traffic signal phase. states of all intersections.
The reward evaluating the quality of a state at an intersection is 2) each action Ai is represented as an M × 1 vector which
defined as the clearance degree of this state at this intersection, contains the seconds of current phase of all intersections.
i.e., the number of vehicles remaining in the intersection when 3) reward Ri is an M ×1 vector which contains the rewards
the period of green light ends. There are two cases to give a obtained from each intersection after performing Ai at
reward to qualify a state; that is, (1) the green light ends but the state Si . In addition, Ri (m) denotes the reward of
there is still traffic and (2) the green light is still but there is no the mth intersection after performing Ai .
traffic. There is no reward or penalty for other cases. Let Nm,t Let yim denote the reward obtained from the mth target critic
denote the number of vehicles in the intersection m at time t, nework. The loss function for updating θm Q
is then defined as
and Nmax be the maximum traffic flow in the mth intersection. follows:
This paper uses the clearance degree as a reward for qualifying
Nb
the mth local agent. When the green light ends and there is 1 X
Lm
critic = (y m − Q(Si , Ai |θm
Q 2
)) . (13)
no traffic, a pre-defined max reward Rmax is assigned to the Nb i=1 i
mth local agent. If there is still traffic, a penalty proportional
µ
to Nm,t is given to this local agent. More precisely, for Case In addition, the loss function for updating θm is defined as
local
1, the reward rm,t for the mth intersection is defined as: Nb
Case 1: If the green light ends but there is still traffic, 1 X
Lm
actor =− µ
Q(Si , µ(Si |θm Q
)|θm ). (14)
( Nb i=1
Nm,t 1
local Rmax , if Nmax ≤ Nmax
rm,t = (11) Q µ Q 0
µ 0
−Rmax × Nm,t /Nmax , else With θm and θm , the parameters θm and θm for the target
network are updated as follows:
For Case 2, if there is no traffic but a long period still 0 0
Q Q Q
remaining for the green light, various vehicles moving on θm ← (1 − τ )θm + τ θm , (15)
another road should stop and wait until this green light turns
off. To avoid this case happening again, a penalty should be and
0 0
µ µ µ
given to this local agent. Let gm,t denote the remnant green θm ← (1 − τ )θm + τ θm . (16)
light time (counted by seconds) when there is no traffic flow The parameter τ is set to 0.8 for updating the target network.
in the mth intersection at time step t, and Gmax the largest Details to update the parameters of local agents are described
duration of green light. Then, the reward function for Case 2 in Algorithm 3.
is defined as follows.
Case 2: If there is no traffic but the green light is still on,
(
gm,t
D. Global Agent
1
local Rmax , if Gmax ≤ Gmax
rm,t = (12) To make the output action no longer blindly explore the
−Rmax × gm,t /Gmax , else environment, we introduces a global agent to explore the
Detailed architectures for local agents are shown in Fig. 2. environment more precisely when outputting actions. The
Fig. 2(a) shows the proposed local critic architecture. Its global agent controls the total waiting time at all intersections.
inputs include the numbers of stopped vehicles at the end of Fig. 3 shows the detailed architectures of the global critic
the green light at each lane, the the remaining green light and actor networks, where (a) is the one of global critic
seconds, and current traffic signal phases of all intersections. network and (b) is for the global actor network. For the mth
Thus, thePinput dimension for each local critic network is intersection, we use Vm to denote the number of its total
M w,i
(2M + m=1 Nlane m
), where M denotes the number of vehicles, and Tm,n to be the waiting time of vehicle n in at
m
intersections and Nlane is the number of lanes in the mth the time step i. Then, the total waiting time across the whole
intersection. Then, a hyperbolic tangent function is used as site is used to define the global reward as follows:
an activation function to normalize all the input and output M Vm
values. There are two hidden fully-connected layers used to 1 XX
riG = − T w,i . (17)
model the Q-value. The output is the expected value of future M m=1 n=1 m,n
return of doing the action at the state.
Q µ
The architecture of the local actor network is shown in Fig. Let θG and θG denote the parameters of the global critic and
Q µ
2(b). Three inputs are used to model this network including actor networks, respectively. To train θG and θG , we sample a
the numbers of stopped vehicles at the end of the green light at random minibatch of Nb transitions (Si , Ai , Ri , Si+1 ) from
each lane, and current traffic signal phases of all intersections. B. Let yiG denote the reward obtained from the global target
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2021 7

critic network at the time step i. Then, the loss function for
Q
updating θG is defined as follows:
1 X G Q 2
LG
critic = (y − QG (Si , Ai |θG )) . (18) Algorithm 2: GOD (Generating On-policy Data)
Nb i i
/* Run one hour of simulation with noise η*/
The network architecture to calculate the value function QG Input: t: timestamp
µ
is shown in Fig. 3(a). It is noticed that the output of this θm : parameters for the mth actor network
µ
global critic network is a scalar value, i.e., the predicted total θG : parameters for the global actor network
µ Output: B: on-policy data
waiting time across the whole site. To train θG , we use the
loss function: β = 0.95t ;// rate for time decline
1 X for m=1, ... , M do
µ Q
LGactor = − QG (Si , µG (Si |θG )|θG ). (19) Get WGm from the global actor network with the
Nb i µ
parameters θG ;
Fig. 3(b) shows the architecture to calculate µG . In addition, WG = β × WG ; WLm =1-WGm ;
m m

the output of the global actor network is a vector which for l=1, ... ,3600 do
includes the actions of all intersections, and the weight WGm /* : the probability of choosing to explore */
which represent the global agent’s importance to the mth /* ηm : noise for epsilon greedy exploration*/
intersection. All the local agents and global agent are modeled p = random(0,1);
(
as a DDPG. Details to update the global agent are described 0, if p ≤ ,
ηm =
in Algorithm 4. random(−5, 5), if p > ;
(
µ
µ(sl |θm ) + ηm , if WLm > WGm ,
Algorithm 1: COMMA-DDPG traffic signal control aml = µ
µG (sl |θG )(m) + ηm , if WLm < WGm ;
RL algorithm.
Execute am m m
l and observe rl , sl+1 ;
Initialize critic network Q(s, a|θQ ) and actor network m m m m
Store transition (sl , al , rl , sl+1 ) in Bm ;
µ(s|θµ ) with random weights θQ and θµ . end
Initialize target network Q0 and µ0 with weights end
0 0
θQ ← θQ , θµ ← θµ and also initialize replay buffer B = (B1 , ..., Bm , ..., BM );
R. Return(B);
for t=1, ... ,T do
Clean the replay buffer B.
/* B = (B1 , ..., Bm , ..., BM ); */
/* B m : on-policy data for the mth intersection */
/* Generate on-policy data */
B= GOD(t);
for episode=1, ..., 400 do
for m=1,..., M, Global do Algorithm 3: LAU (Local Agent Updating)
if m 6= Global then Input:
LAU (B,m);// Update local agents
B: on-policy data; m: the mth agent
end Q
θm : set of parameters for the local critic network;
if agent=Global then µ
GAU (B);// Update the global agent θm : set of parameters for the local actor network;
Q0 µ0
end (θm ,θm ): sets of parameters for the target network;
end Output:
Q
end θm : new parameters for the mth critic network;
µ
end θm : new parameters for the mth actor network;
Q0 µ0
(θm ,θm ): new parameters for the target network;
Sample a random minibatch of Nb transitions
(Si , Ai , Ri , Si+1 ) from B;
µ0 Q0
V. E XPERIMENTAL R ESULTS Set yim = Ri (m) + γQ0 (Si+1 |µ0 (Si+1 |θm )|θm );
Q
A. Environment Setup Update the critic parameters θ m by minimizing the
1
loss: Lm
P m Q 2
It is difficult to test and evaluate traffic signal control critic = Nb (y
i i − Q(S i , A i |θ m )) ;
µ
strategies in the real world due to high cost and intensive Update the actor parameters θm by minimizing the
m 1 µ Q
P
labor. Simulation is a useful alternative method before actual loss: Lactor = − Nb i Q(Si , µ(Si |θm )|θm );
implementation for most SoTA methods [3]. To build the simu- Update the target network:
Q0 Q Q0 µ0 µ µ0
lation data, real data were collected from five real intersections θm ← (1 − τ )θm + τ θm , θm ← (1 − τ )θm + τ θm ;
in a city in Asia among half a year. The simulation with
real traffic flow at the intersections was performed based on
the traffic simulation software TSIS [33]. Through TSIS, we
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2021 8

Algorithm 4: GAU(Global Agent Updating) TABLE II


C OMPARISONS OF THROUGHPUT BETWEEN COMMA-DDPG AND OTHER
Sample a random minibatch of Nb transition TRADITIONAL RL METHODS .
(Si , Ai , Ri , Si+1 ) from B;
Method I1 I2 I3 I4 I5
Calculate riG from Eq.(17). Fixed 1530 1560 1996 2288 2291
µ0 Q0
Set yiG = riG + γQ0G (Si+1 |µ0G (Si+1 |θG )|θG ); MADDPG 1782 1819 2098 1896 2400
Q TD3 1370 1394 1787 2070 2147
Update the critic parameter θG by minimizing the loss: PPO 979 957 1206 1517 1619
1 Q 2
LG
P G
critic = Nb i (yi − QG (Si , Ai |θG )) ; COMMA-DDPG 2225 2310 2784 3052 2868
µ TABLE III
Update the actor parameter θG by minimizing the loss:
1 µ Q C OMPARISONS OF τ BETWEEN RANDOM SAMPLE AND FIXED .
LG
P
actor = − Nb i QG (Si , µG (Si |θG )|θG );
Update the target networks: Updating Ratio τ Waiting Time Average Speed
Q0 Q Q0 µ0 µ µ0 random(0, 1) 436538 36
θG ← (1 − τ )θG + τ θG , θG ← (1 − τ )θG + τ θG random(0.8, 1) 626936 27
random(0.9, 1) 675703 26
τ =0.995 269747 43
TABLE I
C OMPARISONS OF WAITING TIME AND SPEED BETWEEN COMMA-DDPG
AND OTHER TRADITIONAL RL METHODS . Algorithm 2, an on-policy data collection method is proposed
Method Waiting Time Average Speed
to train the agents. The results in Table 4 illustrate the theory
Fixed 750628 19 we mentioned earlier, that is, using on policy training can
IntelliLight [7] xxx xx achieve better results.
MADDPG [34] 420561 20 Fig 4 shows that the distance between the two parallel lines
TD3 [36] 716481 20
PPO [37] 873585 12 is the Green Band, and its slope represents the driving speed. It
CGRL [38] xxx xx shows that even during peak working hours, we can continue
Presslight [26] xxx xx to pass through all intersections in the system without being
CoLight [27] xxx xx
COMMA-DDPG w/o Global Agent xxx xxx hindered by red lights if we use the average speed as the
COMMA-DDPG with Global Agent 269747 43 designed continuous speed to drive.

VI. C ONCLUSIONS
can control the behaviors of each traffic light with a plug-
This paper proposed a novel cooperative RL architecture to
in program and use it as the simulation software needed for
handle cooperation problems by adding a global agent. Since
performance evaluation.
the global agent knows all the intersection information, it can
guide the local agent to make better actions in the training
B. Results process, so that the local agent does not use random noise
To train our method, the fixed-time control model was to randomly explore the environment, but has a directional
first used to pretrain our COMMA-DDPG model for speed- direction. explore. Since RL training requires a large amount
ing up the efficiency of training. The base line is a fixed of data, we hope to add it to RL through data augmentation
strategy. Three SoTA methods were compared in this paper; in the future, so that training can be more efficient.
that is, CGRL [38], MADDPG [34], TD3 [36], PPO [37],
Presslight [26], IntelliLight [7], CoLight [27]. Table 1 shows VII. A PPENDIX FOR C ONVERGENCE P ROOF
the comparisons of waiting time and average speed of vehicles In this section, we will prove that value function in our
among different methods. Clearly, our method performs better method will actually converge.
than the fixed-time model and other SoTA methods. Table
2 shows the comparisons of throughput between COMMA- Definition VII.1. A metric space < M, d > is complete
DDPG and other methods at different intersections. Due to the (or Cauchy) if and only if all Cauchy sequences in M will
control of the global agent, our method performs much better converge to M . In other words, in a complete metric space,
than other methods. In algorithm 3 and 4, there is a soft update for any point sequence a1 , a2 , · · · ∈ M , if the sequence is
τ to update the network parameters. Table 3 shows the effects Cauchy, then the sequence converges to M :
of change of τ to the training result in different situations. lim an ∈ M.
It means better performance can be gained if the model is n→∞
not changed frequently. DDPG is an off-policy method. In Definition VII.2. Let (X,d) be a complete metric space. Then,
a map T : X → X is called a contraction mapping on X if
there exists q ∈ [0, 1) such that d(T (x), T (y)) < qd(x, y),
∀x, y ∈ X.

TABLE IV
C OMPARISONS BETWEEN ON - POLICY AND OFF - POLICY TRAINING .
Method Waiting Time Average Speed
on-policy 269747 43
off-policy 275868 29
Fig. 4. Time and space diagram.
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2021 9

Theorem 1 (Banach fixed-point theorem). Let (X,d) be a non- From Theorem 2, we can show that every eigenvalue of P π
empty complete metric space with a contraction mapping T is in the disc centered at (0, 0) with radius 1. That is, the
: X → X. Then T admits a unique fixed-point x∗ in X. i.e. maximum absolute value of eigenvalue will be less than 1.
T (x∗ ) = x∗ .
d(T π (u), T π (v)) ≤k λP π k u − v k∞ k∞
Theorem 2 (Gershgorin circle theorem). Let A be a complex ≤ λ k u − v k∞ (25)
n × n matrix, with entries aij . For i ∈ 1, 2, ..., n, let Ri be
the sum of the absolute of values of the non-diagonal entries = λd(u, v).
in the ith row: From the Theorem 1, Eq.(2) converges to only V π .
Xn
Ri = |aij |.
j=0,j6=i R EFERENCES
Let D(aii , Ri ) ⊆ C be a closed disc centered at aii with [1] S. Alemzadeh, R. Moslemi, R. Sharma, and M. Mesbahi, “Adaptive
radius Ri , and every eigenvalue of A lies within at least one traffic control with deep reinforcement learning: Towards state-of-the-
art and beyond,” arXiv preprint arXiv:2007.10960, 2020. 1
of the Gershgorin discs D(aii , Ri ). [2] G. Zheng, X. Zang, N. Xu, H. Wei, Z. Yu, V. Gayah, K. Xu, and
Z. Li, “Diagnosing reinforcement learning for traffic signal control,”
Lemma 3. We claim that the value function of RL can actually arXiv preprint arXiv:1905.04716, 2019. 1, 2
converge, and we also apply it to traffic control. [3] H. Wei, G. Zheng, V. Gayah, and Z. Li, “Recent advances in rein-
forcement learning for traffic signal control: A survey of models and
Proof. The value function is to calculate the value of each evaluation,” ACM SIGKDD Explorations Newsletter, vol. 22, no. 2, pp.
state, which is defined as follows: 12–18, 2021. 1, 7
[4] P. Mannion, J. Duggan, and E. Howley, “An experimental review of
V π (s) = π(a|s) p(s0 , r|s, a)[r + γV π (s0 )]
P P
reinforcement learning algorithms for adaptive traffic signal control,”
a s 0 ,r Autonomic road transport support systems, pp. 47–66, 2016. 1
p(s0 , r|s, a)r
P P
= π(a|s) [5] T. T. Pham, T. Brys, M. E. Taylor, T. Brys, M. M. Drugan, P. Bosman,
a 0
sP ,r
(20) M.-D. Cock, C. Lazar, L. Demarchi, D. Steenhoff et al., “Learning
P 0 π 0 coordinated traffic light control,” in Proceedings of the Adaptive and
+ π(a|s) p(s , r|s, a)[γV (s )]. Learning Agents workshop (at AAMAS-13), vol. 10. IEEE, 2013, pp.
a s0 ,r
1196–1201. 1
Since the immediate reward is determined, it can be regarded [6] E. Van der Pol and F. A. Oliehoek, “Coordinated deep reinforcement
learners for traffic light control,” Proceedings of Learning, Inference and
as a constant term relative to the second term. Assuming that Control of Multi-Agent Systems (at NIPS 2016), 2016. 1
the state is finite, we express the state value function in matrix [7] H. Wei, G. Zheng, H. Yao, and Z. Li, “Intellilight: A reinforcement
form below. Set the state set S = {S0 , S1 , · · · , Sn }, V π = learning approach for intelligent traffic light control,” in Proceedings
of the 24th ACM SIGKDD International Conference on Knowledge
{V π (s0 ), V π (s1 ), · · · , V π (sn )}T , and the transition matrix is Discovery & Data Mining, 2018, pp. 2496–2505. 1, 8
 π π
 [8] I. Arel, C. Liu, T. Urbanik, and A. G. Kohls, “Reinforcement learning-
0 P0,1 · · · P0,n based multi-agent system for network traffic signal control,” IET Intel-
π π
 P1,0 0 · · · P1,n ligent Transport Systems, vol. 4, no. 2, pp. 128–135, 2010. 1
Pπ = 

, (21)
 ··· ··· ··· ···  [9] J. A. Calvo and I. Dusparic, “Heterogeneous multi-agent deep reinforce-
π π ment learning for traffic lights control.” in AICS, 2018, pp. 2–13. 1
Pn,0 Pn,1 ··· 0 [10] T. Chu, J. Wang, L. Codecà, and Z. Li, “Multi-agent deep reinforcement
π
P learning for large-scale traffic signal control,” IEEE Transactions on
where Pi,j = π(a|si )p(sj , r|si , a). The constant term is Intelligent Transportation Systems, vol. 21, no. 3, pp. 1086–1095, 2019.
a 1, 5
expressed as Rπ = {R0 , R1 , · · · , Rn }T . Then we can rewrite [11] T. Nishi, K. Otaki, K. Hayakawa, and T. Yoshimura, “Traffic signal
the state-value function as: control based on reinforcement learning with graph convolutional neural
nets,” in 2018 21st International Conference on Intelligent Transporta-
V π = Rπ + λP π V π . (22) tion Systems (ITSC). IEEE, 2018, pp. 877–883. 1, 5
[12] S. S. Mousavi, M. Schukat, and E. Howley, “Traffic light control using
deep policy-gradient and value-function-based reinforcement learning,”
Above we define the state value function vector as V π = IET Intelligent Transport Systems, vol. 11, no. 7, pp. 417–423, 2017. 1,
{V π (s0 ), V π (s1 ), · · · , V π (sn )}T , which belongs to the value 5
function space V . We consider V to be an n-dimensional [13] M. Aslani, M. S. Mesgari, and M. Wiering, “Adaptive traffic signal
control with actor-critic methods in a real-world traffic network with
vector full space, and define the metric of this space is the different traffic disruption events,” Transportation Research Part C:
infinite norm. It means: Emerging Technologies, vol. 85, pp. 732–752, 2017. 1, 2
[14] M. Aslani, S. Seipel, M. S. Mesgari, and M. Wiering, “Traffic signal
d(u, v) =k u − v k∞ = max |u(s) − v(s)|, ∀u, v ∈ V (23) optimization through discrete and continuous reinforcement learning
s∈S with robustness analysis in downtown tehran,” Advanced Engineering
Informatics, vol. 38, pp. 639–655, 2018. 1
Since < V, d > is the full space of vectors, V is a complete [15] H. Pang and W. Gao, “Deep deterministic policy gradient for traffic
metric space. Then, the iteration result of the state value signal control of single intersection,” in 2019 Chinese Control And
function is unew = T π (u) = Rπ + λP π u. We can show that Decision Conference (CCDC). IEEE, 2019, pp. 5861–5866. 1, 3
[16] H. Wu, “Control method of traffic signal lights based on ddpg reinforce-
it is a contraction mapping. ment learning,” in Journal of Physics: Conference Series, vol. 1646,
no. 1. IOP Publishing, 2020, p. 012077. 1, 3
d(T π (u), T π (v)) =k (Rπ + λP π u) − (Rπ + λP π v) k∞ [17] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
=k λP π (u − v) k∞ D. Silver, and D. Wierstra, “Continuous control with deep reinforcement
learning,” arXiv preprint arXiv:1509.02971, 2015. 1, 3
≤k λP π k u − v k∞ k∞ . [18] R. P. Roess, E. S. Prassas, and W. R. McShane, Traffic engineering.
(24) Pearson/Prentice Hall, 2004. 2
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, MAY 2021 10

[19] M. Fellendorf, “Vissim: A microscopic simulation tool to evaluate


actuated signal control including bus priority,” in 64th Institute of
Transportation Engineers Annual Meeting, vol. 32. Springer, 1994,
pp. 1–9. 2
[20] P. Mirchandani and L. Head, “A real-time traffic signal control system:
architecture, algorithms, and analysis,” Transportation Research Part C:
Emerging Technologies, vol. 9, no. 6, pp. 415–432, 2001. 2
[21] G. Zheng, Y. Xiong, X. Zang, J. Feng, H. Wei, H. Zhang, Y. Li, K. Xu,
and Z. Li, “Learning phase competition for traffic signal control,” in
Proceedings of the 28th ACM International Conference on Information
and Knowledge Management, 2019, pp. 1963–1972. 2
[22] P. Lowrie, Scats, sydney co-ordinated adaptive traffic system: A traffic
responsive method of controlling urban traffic. Darlinghurst, NSW,
Australia, 1990. 2
[23] P. Hunt, D. Robertson, R. Bretherton, and R. Winton, “SCOOT - a
traffic responsive method of coordinating signals,” Transport and Road
Research Laboratory (TRRL), Tech. Rep., 1981. 2
[24] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no.
3-4, pp. 279–292, 1992. 2
[25] M. Abdoos, N. Mozayani, and A. L. Bazzan, “Traffic light control in
non-stationary environments based on multiagent q-learning,” in 2011
14th International IEEE conference on intelligent transportation systems
(ITSC). IEEE, 2011, pp. 1580–1585. 2
[26] H. Wei, C. Chen, G. Zheng, K. Wu, V. Gayah, K. Xu, and Z. Li,
“Presslight: Learning max pressure control to coordinate traffic signals
in arterial network,” in Proceedings of the 25th ACM SIGKDD Interna-
tional Conference on Knowledge Discovery & Data Mining, 2019, pp.
1290–1298. 2, 8
[27] H. Wei, N. Xu, H. Zhang, G. Zheng, X. Zang, C. Chen, W. Zhang,
Y. Zhu, K. Xu, and Z. Li, “Colight: Learning network-level cooperation
for traffic signal control,” in Proceedings of the 28th ACM International
Conference on Information and Knowledge Management, 2019, pp.
1913–1922. 2, 8
[28] Y. Xiong, G. Zheng, K. Xu, and Z. Li, “Learning traffic signal control
from demonstrations,” in Proceedings of the 28th ACM International
Conference on Information and Knowledge Management, 2019, pp.
2289–2292. 2
[29] E. N. Barron and H. Ishii, “The bellman equation for minimizing the
maximum cost,” Nonlinear Analysis: Theory, Methods and Applications,
vol. 13, no. 9, pp. 1067–1090, 1989. 3, 4
[30] T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot,
D. Horgan, J. Quan, A. Sendonaris, I. Osband et al., “Deep Q-learning
from demonstrations,” in Proceedings of the AAAI Conference on
Artificial Intelligence, 2018. 3
[31] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
with double Q-learning,” Proceedings of the AAAI Conference on
Artificial Intelligence, vol. 30, no. 1, 2016. 3, 4
[32] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski
et al., “Human-level control through deep reinforcement learning,”
nature, vol. 518, no. 7540, pp. 529–533, 2015. 4
[33] L. E. Owen, Y. Zhang, L. Rao, and G. McHale, “Traffic flow simulation
using corsim,” in 2000 Winter Simulation Conference Proceedings (Cat.
No. 00CH37165), vol. 2. IEEE, 2000, pp. 1143–1147. 5, 7
[34] J. K. Gupta, M. Egorov, and M. Kochenderfer, “Cooperative multi-agent
control using deep reinforcement learning,” in International Conference
on Autonomous Agents and Multiagent Systems. Springer, 2017, pp.
66–83. 5, 8
[35] L. Matignon, G. J. Laurent, and N. Le Fort-Piat, “Hysteretic q-learning:
an algorithm for decentralized reinforcement learning in cooperative
multi-agent teams,” in 2007 IEEE/RSJ International Conference on
Intelligent Robots and Systems. IEEE, 2007, pp. 64–69. 5
[36] S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approxi-
mation error in actor-critic methods,” in International Conference on
Machine Learning. PMLR, 2018, pp. 1587–1596. 8
[37] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox-
imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347,
2017. 8
[38] E. Van der Pol and F. A. Oliehoek, “Coordinated deep reinforcement
learners for traffic light control,” in NIPS’16 Workshop on Learning,
Inference and Control of Multi-Agent Systems, Dec. 2016. [Online].
Available: https://sites.google.com/site/malicnips2016/papers 8

You might also like