0% found this document useful (0 votes)
14 views12 pages

Zhang 2020

Uploaded by

kedietzom200
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views12 pages

Zhang 2020

Uploaded by

kedietzom200
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS 1

Using Reinforcement Learning With Partial Vehicle


Detection for Intelligent Traffic Signal Control
Rusheng Zhang , Akihiro Ishikawa, Wenli Wang, Benjamin Striner, and Ozan K. Tonguz

Abstract— Intelligent Traffic Signal Control (ITSC) systems control schemes are expensive and, therefore, they exist only
have attracted the attention of researchers and the general at a small percentage of intersections in the United States,
public alike as a means of alleviating traffic congestion. Recently, Europe, and Asia.
the vehicular wireless technologies have enabled a cost-efficient
way to achieve ITSC by detecting vehicles using Vehicle to Recently, several more cost-effective approaches to imple-
Infrastructure (V2I) wireless communications. Traditional ITSC ment ITSC systems were proposed by leveraging the
algorithms, in most cases, assume that every vehicle is detected, fact that Dedicated Short-Range Communication (DSRC)
such as by a camera or a loop detector, but a V2I implementation technology [11]–[13]. DSRC technology is potentially a much
would detect only those vehicles equipped with wireless com- cheaper technology for detecting the presence of vehicles
munications capability. We examine a family of transportation
systems, which we will refer to as ‘Partially Detected Intelligent on the approaches of an intersection. However, at the early
Transportation Systems’. An algorithm that can perform well stages of deployment, only a small percentage of vehicles
under a small detection rate is highly desirable due to gradual will be equipped with DSRC radios. Meanwhile, the rapid
increasing penetration rates of the underlying technologies such development of the Internet of Things (IoT) has created
as Dedicated Short Range Communications (DSRC) technology. new technology applicable for sensing vehicles for ITSC.
Reinforcement Learning (RL) approach in Artificial Intelli-
gence (AI) could provide indispensable tools for such problems Other than DSRC, applicable technologies include, but are not
where only a small portion of vehicles are detected by the ITSC limited to, RFID, Bluetooth, Ultra-Wide Band (UWB), Zigbee,
system. In this paper, we report a new RL algorithm for Partially and even cellphone apps such as Google Map [14]–[16].
Detected Intelligent Traffic Signal Control (PD-ITSC) systems. All these systems are more economical than traditional loop
The performance of this system is studied under different car detectors or cameras. Performance-wise, most of these systems
flows, detection rates, and types of the road network. Our system
is able to efficiently reduce the average waiting time of vehicles are able to track vehicles in a continuous manner, while loop
at an intersection, even with a low detection rate, thus reducing detectors can only detect the presence of vehicles. These ITSC
the travel time of vehicles. systems mentioned above are all promising technologies that
Index Terms— Reinforcement learning, artificial intelli- could bring the expensive price of traditional ITSC systems
gence, intelligent transportation systems, partially detected down dramatically; however, these systems have a common
intelligent transportation systems, vehicle-to-infrastructure critical shortcoming: they are not able to detect vehicles
communications. unequipped with the communication device (i.e., DSRC radios,
RFID tags, Bluetooth device, etc.).
I. I NTRODUCTION Since this adoption stage of the aforementioned systems
could possibly take several years [17], new control algorithms
T RAFFIC congestion is a daunting problem that affects
the daily lives of billions of people in most countries
across the world [1]. Over the last 30 years, many Intelligent
that can handle partial detection of vehicles are required. One
potential AI algorithm that could be very helpful is deep rein-
Traffic Signal Control (ITSC) systems have been designed forcement learning (DRL), which has recently been explored
and demonstrated as one of the effective ways to reduce by several groups [18], [19]. These results show an improve-
traffic congestion [2]–[9]. These systems use real time traf- ment in terms of waiting time and queue length experienced
fic information measured or collected by video cameras or at an intersection in a fully-observable environment. Hence,
loop detectors and optimize the cycle split of a traffic light in this paper, we investigate this promising approach in a par-
accordingly [10]. Unfortunately, such intelligent traffic signal tially observable environment. As expected, we observe an
asymptotically improving result as we increase the penetration
Manuscript received June 28, 2018; revised December 22, 2018, rate of DSRC-equipped vehicles.
May 16, 2019, and September 6, 2019; accepted November 18, 2019. This
work was supported by the King Abdulaziz City of Science and Technology In this paper, we explore the capability of DRL for han-
(KACST), Riyadh, Kingdom of Saudi Arabia. The Associate Editor for this dling ITSC systems using partial detection. For simplicity,
article was J. W. Choi. (Corresponding author: Rusheng Zhang.) in some sections, we use DSRC detection based system as
Rusheng Zhang, Akihiro Ishikawa, Wenli Wang, and Ozan K. Tonguz
are with the Department of Electrical and Computer Engineering, Carnegie the example system, but the scheme described in this paper
Mellon University, Pittsburgh, PA 15213-3890 USA (e-mail: rushengz@ is very general and therefore can be used for any possible
andrew.cmu.edu). forms of partial detection, such as vehicle detection based on
Benjamin Striner is with the Machine Learning Department, Carnegie
Mellon University, Pittsburgh, PA 15213-3890 USA. RFID, Bluetooth Low Energy 5.0 (BLE 5.0), cellular (LTE or
Digital Object Identifier 10.1109/TITS.2019.2958859 5G). Via extensive simulations we analyze the performance of
1524-9050 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: University of Wollongong. Downloaded on April 05,2020 at 08:51:39 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

the Reinforcement Learning (RL) method. Our results clearly TRANSYT [2]. These systems optimize the offsets of traffic
show that reinforcement learning is capable of providing an signals in the network, based on current traffic demand, and
excellent traffic management scheme that is able to reduce generate ‘green-wave’ for major car flow. Meanwhile, some
the waiting time of commuters at intersections, even at a low other model-based systems have been proposed, including
penetration rate. The results also show a different performance OPAC [6], RHODES [7], PRODYN [8]. These systems use
in detected vehicles and undetected vehicles, suggesting a both the current traffic arrivals and the prediction of future
built-in business model, which could be the key to eventually arrivals, and choose a signal phase planning that which
push forward on large-scale deployment of ITSC. optimize the objective functions. While these systems work
The remainder of this paper is organized as follows. efficiently, they do have some significant shortcomings. The
In Section II, we review the related work in this area. cost of these systems are generally very high [32], [33].
Section III gives a detailed Problem formulation. Section IV Even though RL yields impressive results for these cases,
outlines the Approach we use. Section V presents the Results it does not outperform current systems. Hence, the progress of
of our study in terms of performance and sensitivity to critical these algorithms, while interesting, is of limited impact, since
system parameters. In Section VI, a Discussion is presented traditional ITSC systems perform comparably.
that highlights the practical implications of our results for Meanwhile, the recent advancements in Vehicle-
intelligent transportation systems in addition to highlighting to-Everything (V2X) communication have made traffic
important extensions of our work for future work. Finally, signal control schemes based on such technology a rising
Section VII concludes the paper. field, as the cost is significantly lower than a traditional ITSC
system [11]–[13]. Within these schemes, a system known as
Virtual Traffic Lights (VTL) is very attractive, as it proposes
II. R ELATED W ORKS
an infrastructure-free DSRC-based solution, by installing
Traffic signal control using Artificial Intelligence (AI), traffic control devices in vehicles and having the vehicles
especially reinforcement learning (RL), has been an active decide the right-of-way at an intersection locally. Different
field of research for the last 20 years. In 1994, Mikami, et al. aspects of VTL technology have been studied by different
proposed distributed reinforcement learning (Q-learning) using research groups in the last few years [11], [34]–[45]. However,
a Genetic Algorithm to present a traffic signal control scheme a VTL system requires all vehicles in the road network to be
that effectively increased throughput of a road network [20]. equipped with DSRC devices, therefore, a transition scheme
Due to the limitations of computing power in 1994, however, for the current transportation systems to smoothly transition
it could not be implemented at that time. to VTL system is needed.
Bingham proposed RL for parameter search of a On the other hand, several methods have been proposed
fuzzy-neural traffic signal controller for a single intersec- for floating vehicle data gathered from Global Position Sys-
tion [21], while Choy et al. adapted RL on the fuzzy-neural tem (GPS) that are used to detect, estimate and predict
system in a cooperative scheme, achieving adaptive control traffic states based on fuzzy logic, Genetic Algorithm (GA),
for a large area [22]. These algorithms are based on RL, but Support Vector Mechine (SVM) and other statistical learn-
the major goal of RL is parameter tuning of the fuzzy-neural ing algorithms [46]–[51]. The success of these works sug-
system. Abdulhai et al. proposed the first truly adaptive traffic gest the possibility to optimize traffic control based on
signal which learns to control the traffic signal dynamically partial detection (such a system is formally introduced in
based on a Cerebellar Model Articulation Controller (CMAC), section III).
as a Q-estimation network [23]. Silva, et.al. and Oliveira et.al. There are a few research projects currently available
then proposed a context-detector (CD) in conjunction with using partial detection. For example, COLOMBO is one
RL to further improve the performance under non-stationary of the projects that focuses on low-penetration rate of
traffic conditions [24], [25]. Several researchers have focused DSRC-equipped vehicles [52]–[54]. The system uses infor-
on multi-agent reinforcement learning for implementing it on mation provided by V2X technology and feed the information
a large scale [26]–[29]. to a traffic management system. Since COLOMBO cannot
Recently, with the development of GPU and computing directly react to real-time traffic flow (the detected and unde-
power, DRL has become an attractive method in several fields. tected vehicles have the same performance), under low to
Several attempts have been made using Deep Q-learning for medium car flow it will NOT achieve optimum performance
ITSC system, including [18], [19], [30], [31]. These results as the optimal strategy under low-to-medium car flow has to
show that a DQN based Q-learning algorithm is capable of react according to detected car arrivals. Another very recent
optimizing the traffic in an intelligent manner. system is DSRC-Actuated Traffic Lights, which is one of
Traditional intelligent traffic signal systems use loop our previous implementations using DSRC radio for traffic
detectors, magnetic detectors and cameras for improving control. The designed prototype of this system was publicly
the performance of traffic lights. In the past few decades, demonstrated in Riyadh, Saudi Arabia, in July 2018 [55], [56].
various adaptive traffic systems were developed and imple- DSRC-Actuated Traffic Lights, however, is based on the arrival
mented. Some of these traffic systems such as SCOOT of each vehicle, and hence works well under low to medium
[4], SCATS [3], are based on dynamic traffic coordination car flow rates, but it does not work well under high car flow
[5], and can be viewed as a traffic-responsive version of rate.

Authorized licensed use limited to: University of Wollongong. Downloaded on April 05,2020 at 08:51:39 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHANG et al.: USING RL WITH PARTIAL VEHICLE DETECTION FOR ITSC 3

Fig. 1. Illustration of partially detected intelligent transportation system.


Fig. 2. One possible system design for the proposed scheme.
The main contributions of this paper are:
1) Explore a new kind of intelligent system that is based on B. Example PD-ITSC System Design Based on DSRC
partial detection of vehicles, which is a cost-effective
We provide here one of the possible system realizations
alternative to current ITSC systems and an important
for the proposed scheme, based on Dedicated Short-Range
problem not addressed by traditional ITSC systems.
Communications (DSRC). The system has an ‘On Roadside’
2) Propose a transition scheme to VTL. Not only do we
unit and an ‘On Vehicle’ unit, as shown in Figure 2. DSRC
reduce the average commute time for all users, but those
RoadSide Unit(RSU) senses the Basic Safety Message (BSM)
users that can be detected have much lower commute
broadcast by the DSRC OnBoard Unit (OBU), parse the useful
time, which attracts additional users to adopt the device
information out, and send them to the RL Based Decision
or service.
Making Unit. This unit will then make a decision based on
3) Design a new RL-based traffic signal control algorithm
the information provided by the RSU.
and system design that performs well under low pene-
Even though the example system won’t be able to detect all
tration ratio and detection rates.
vehicles, it will collect more detailed information about the
4) Provide a detailed performance analysis. Our results
detected vehicles: While in traditional ITSC systems based
show that, under a low detection rate, the system can
on loop detectors, only the vehicle occupancy is detected,
perform almost as good as an ITSC system that employs
the system based on DSRC technology can provide a rich
full detection. This is a very attractive solution consid-
set of attributes including speed, distance, trajectory, and even
ering its cost-effectiveness.
destination of each detected vehicle. It is worth mentioning
III. P ROBLEM S TATEMENT here that such properties are NOT unique to the example
system considered in this section that uses DSRC technology;
A. What is a Partial Detection Based ITSC System?
in fact, the same properties exist in most of other partial
Figure.1 gives an illustration of a Partially Detected Intel- detection ITSC systems as well since they are based on similar
ligent Traffic Signal Control (PD-ITSC) system. There are wireless technologies. Therefore, the algorithm designed for
two kinds of vehicles in the system: the red vehicles in PD-ITSC handling the PD-ITSC systems should be able to
the figure are the vehicles that the traffic lights are able to integrate all these pieces of information. Obviously, developing
detect, we denote these vehicles as detected vehicles; the blue a pure analytical algorithm that takes all these information into
semi-transparent vehicles in the figure, on the other hand, are consideration is non-trivial, thus making RL a very attractive
undetectable by the traffic system, are denoted as undetected and promising method, as it does not require a comprehensive
vehicles. In a PD-ITSC system, both kinds of vehicles co-exist theoretical analysis of the environment to find a near-optimal
in the system. The system, based on the information from the solution.
detected vehicles, decide the current phase at the intersections, It is clear that since most of the traditional ITSC schemes do
in order to minimize the delay at the intersection for both not take undetected vehicles into account, they are not suitable
detected vehicles and undetected vehicles. for PD-ITSC systems. Moreover, an ideal scheme for PD-ITSC
Many example systems can be categorized as PD-ITSC, should also:
especially the newly proposed systems from the last decade 1) perform well even with a low detection rate;
based on wireless communications and IoT [14]–[16]. In these 2) accelerate the transition to a higher adoption rate and
systems, the vehicles are equipped with communication therefore a higher detection rate (this point will be
devices that communicate with traffic lights. Vehicles equipped discussed in more details in Section VI).
with the communication device are detected vehicles and vehi-
cles NOT equipped with the device are undetected vehicles. IV. A PPROACH AND THE U NDERLYING T HEORY
In this paper, we choose one of the typical PD-ITSC system,
A. Q-Learning Algorithm
the traffic signal system based on DSRC radios, as an exam-
ple. The detected vehicles are vehicles equipped with DSRC We refer to Watkins [57] for a detailed explanation of
radios, and the undetected vehicles are those unequipped with general reinforcement learning and Q-learning but we will
DSRC radios. Observe that other kinds of PD-ITSC system provide a brief review of the underlying theory in this section.
are analogous, thus making the methodologies described in The goal of reinforcement learning is to train an agent
this paper applicable to them as well. that interacts with the environment by selecting the action

Authorized licensed use limited to: University of Wollongong. Downloaded on April 05,2020 at 08:51:39 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

in a way that maximizes the future reward. At every time data is sampled from the memory for a certain batch size.
step, the agent gets the state (the current observation of the This experience replay aims to break the time correlation
environment) and reward information (the quantified indicator between samples [67].
of performance from the last time step) from the environment In this paper, we train the traffic lights agents using a
and makes an action. During this process, the agent tries to Deep Q-network (DQN) [67]. With the Q-learning algorithm
optimize (maximize/minimize) the cumulative reward for its described above, our work focuses on the definition of agents’
action policy. The beauty of this kind of algorithm is the fact actions and the assignment of the states and rewards, which
that it doesn’t need any supervision, since the agent observes is discussed in the the following subsection IV-B.
the environment and tries to optimize its performance without
human intervention. B. Parameter Modeling
RL algorithms come in two categories: policy based algo- We consider a traffic light controller, which takes reward
rithms such as Trust Region Policy Optimization (TRPO) [58], and state observation from the environment and chooses an
Advantage Actor Critic (A2C) [59], Proximal Policy Opti- action. In this subsection, we introduce our design of actions,
mization (PPO) [60] that optimize the policy that maps rewards, and states for the aforementioned PD-ITSC system
from states to actions; and value based algorithms such problem.
as Q-learning [57], double Q-Learning [61], and soft 1) Agent Action: In our context, the relevant action of the
Q-learning [62] that directly maximize the cumulative rewards. agent is either to keep the current traffic light phase, or to
While policy based algorithms have achieved good results and switch to the next traffic light phase. At every time step,
will potentially be applicable for the problem proposed in this the agent makes an observation and takes action accordingly,
paper [63], [64], in this paper, we choose deep Q-learning achieving intelligent control of traffic.
algorithm. 2) Reward: For traffic optimization problems, the goal is to
In the Q-learning approach, the agent learns a ‘Q-Value’, decrease the average traffic delay of commuters in the network,
denoted Q(st , at ), which is a function of observed state st by using traffic light phasing strategy S. Specifically, find the
and action at that outputs the expected cumulative discounted best traffic light phasing strategy S, such that tS − tmin is
future reward. Here, t denotes the discrete time index. The minimum, where tS is the average travel time of commuters
cumulative discounted future reward is defined as: in the network, under the traffic control scheme S, and tmin
Q(st , at ) = rt + γ rt +1 + γ 2 rt +2 + γ 3 rt +3 + . . . is the physically possible lowest average travel time. Consider
traveling the same distance d,
Here, rt is the reward at each time step, the meaning of  tS
which needs to be specified according to the actual problem, d= v S (t)dt = tmin v max
and γ < 1 is the discount factor. At every time step, the agent 0

updates its Q function by an update of the Q value: Here, v max is some maximum reasonable speed for the
vehicle, such as the speed limit of the road of interest. v S (t )
Q(st , at ) := Q(st , at )+ α(rt +1 +γ max Q(st +1 , at )− Q(st , at )) denotes the actual vehicle speed under strategy S, at time t.
Therefore,
In most cases, including the traffic control scenarios of  tS
interest, due to the complexity of the state space and action 1
tmin = v S (t)dt
space, deep neural networks can be used to approximate the v
 max
tS
0
 tS
Q function. Instead of updating the Q value, we use the value: 1
tS − tmin = 1dt − v S (t)dt
v
Q(st , at ) + α(rt +1 + γ max Q(st +1 , at ) − Q(st , at )) 0
 tS max 0
1
= v max − v S (t)dt
as the output target of a Q network and do a step of back v max 0
propagation on the input of st , at . Therefore, to get minimum delay tS − tmin is equivalent to
We utilized two known methods to stabilize the training minimizing at each step t, for each vehicle:
process [65], [66]:
1
1) Two Q-networks are maintained, a target Q-network [v max − v S (t)] (1)
v max
and an on-line Q network. Target Q-network is used
to approximate the true Q-values, and the on-line We note that this is equivalent to maximizing v S (t), if the
Q-network is back-propagated every step. In the train- v max on all roads for all cars are the same. If different
ing period, the agent makes decision with the target vehicles have different v max , the reward function is taken as
Q-network, the results from each time instance are used the arithmetic average of the function for all vehicles.
to update the on-line Q-network. At periodic intervals, We define the statement in (1) as the penalty of each
on-line Q network’s weights are synchronized with the step. Our goal is to minimize the penalty of each step. Since
target Q-network. This will keep the agent’s decision reinforcement learning tries to maximize the reward (minimize
network relatively stable, instead of changing at every penalty), we define the opposite number of the loss as the
step. reward for the reinforcement learning problem:
2) Instead of training after every step an agent has taken, 1
rt = − [v max − v S (t)] (2)
past experience is stored in a memory buffer and training v max

Authorized licensed use limited to: University of Wollongong. Downloaded on April 05,2020 at 08:51:39 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHANG et al.: USING RL WITH PARTIAL VEHICLE DETECTION FOR ITSC 5

TABLE I
D ETAILS OF S TATE R EPRESENTATION

Fig. 3. Control logic of RL based decision making unit.

In some cases, especially when the traffic flow is heavy,


one can shape the rewards to guide the agent’s action, such as
avoiding big traffic jams [68]. This is certainly an interesting
direction for future research.
3) State Representation: For optimal decision making,
a system should consider as much relevant information about
traffic processes as possible. Traditional ITSC system only
typically detect simple information such as the presence of
vehicles. In PD-ITSC system, only a portion of the vehicles
are detected, but it’s likely that more specific information
Fig. 4. The deployment scheme.
about these vehicles such as speed and position are available
due to the capabilities of the underlying wireless technologies
(discussed in Section III-B).
unit gets the state representation periodically, calculates the
RL enables experimentation with many possible choices of
Q-value for all the possible actions and if the action of keeping
inputs and input representations. Further research is required
the current phase has bigger Q-value, it retains the phase;
to determine the experimental benefits of each option and
otherwise, switches to the next phase.
that goes beyond the scope of this paper. Based on initial
Other than the main logic discussed above, a sanity check is
experiments, for the purpose of this paper, we selected a state
performed on the agent: a mandatory maximum and minimum
representation including the distance to the nearest vehicle at
phase. If the current phase duration is less than the minimum
each approach, number of vehicles at each approach, amber
phase time, the agent will keep the current phase no matter
phase indicator, current traffic light phase elapsed time and
what action the DQN is choosing; similarly, if phase duration
current time, as shown in Table I. Note that current traffic
is larger or equal to maximum phase time, the phase will be
light phase (green or red) is represented by a sign change in
forced to switch.
the per-lane detected car count and distance rather than by a
separate indicator. In initial experiments, we observed slightly
faster convergence using this distributed representation (sign D. Implementation
representation) than a separate indicator (shown in Figure 5). In this section, we describe the design of the proposed
We hypothesize that, in combination with Rectified Linear scheme at the system level. The implementation of the system
Unit (ReLU) activation, this encoding biases the network to contains two phases, the training phase and the deployment
utilize different combinations of neurons for different phases. phase. As shown in Figure 4, the agent is first trained with a
ReLU units are active if the output is positive and inactive simulator, which is then ported to the intersection, connected
if the output is negative, so our representation may encour- to the real traffic signal, after which it starts to control the
age different units to be utilized during different phases, traffic.
accelerating learning. There are many possible representations 1) Training Phase: The agent is trained by interacting with
and our experimentation with different representations is not a traffic simulator. The simulator randomly generates vehicle
exhaustive, but we found that RL was able to handle several arrivals, then determines whether each vehicle can be detected
different representations with reasonable performance. by drawing from a Bernoulli distribution parameterized by p,
the detection rate. In the context of DSRC-based vehicle
C. System detection systems, the detection rate corresponds to the DSRC
Figure 3 gives a flow chart on how the RL based control penetration rate. The simulator obtains the traffic state st
unit makes the decisions. As shown in the figure, control and calculates the current reward rt accordingly, and feeds

Authorized licensed use limited to: University of Wollongong. Downloaded on April 05,2020 at 08:51:39 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

it to the agent. Using the Q-learning updating formula cited


in previous sections, the agent updates itself based on the
information from the simulator. Meanwhile, the agent chooses
an action at , and forwards the action to the simulator. The
simulator will then update, and change the traffic light phase
according to agent’s indication. These steps are done repeat-
edly until convergence, at which point the agent is trained.
The performance of an agent relies heavily on the quality
of the simulator. To obtain similar arrival pattern as the real
world, the simulator generates car flow by the historical record
of vehicle arrival rate on the same map of the real intersection. Fig. 5. Penalty function decreasing with number of iterations in training,
To address the variance in car flow in different parts of the the situation shown in the figure is plotted from training with dense car flow
day, current time of the day is also specified in the state at a single intersection.
representation, so that after training the agent is able to adapt
to different car flow in different time of the day. Other factors
In this case, vehicles can be considered as independent
that affect car flow, such as day of the week, could also be
‘particles’, and the optimal traffic agent react for each of
parameterized in the state representation.
their arrivals independently. Therefore, we should observe
The goal of training is to have the traffic control scheme
much better performance for the detected vehicles than
achieve the shortest average commute time for all commuters.
those undetected vehicles, which corresponds to the cases
In the training period, the machine tries different control
shown in Figure. 7b.
schemes and eventually converges to an optimal scheme which
• When the car flow rate is extremely heavy (at the point
yields a minimum average commute time.
of saturation), the optimal traffic agent should take a
2) Deployment Phase: In the deployment phase, the soft-
completely different strategy, instead of only taking care
ware agent is moved to the intersection for controlling the
of the detected vehicles, the agent should be aware of the
traffic signal. Here, the agent will not update the learned
fact that the detected vehicles are only representatives of
Q-function, but simply control the traffic signal. Namely,
the car flow, and react in a way that maximizes the overall
the detector will feed the agent’s current detected traffic state
waiting time. The waiting time of detected vehicles and
st ; based on st , the agent chooses an action based on the
undetected vehicles should be similar, because they are of
trained Q-network and directs the traffic signal to switch/keep
the same car flow. The vehicles here should be considered
phase accordingly. This step is performed in real-time, thus
as ‘liquid’ instead of ‘particles’ from the previous case.
enabling continuous traffic control.
This can be seen in Figure 7a.
The rest of the section is organized as follows:
V. P ERFORMANCE A NALYSIS subsection V-A evaluates the performance of the system under
In this section, we give several scenarios of simulations different detection rates. One should expect different perfor-
to evaluate various aspects of the performance of the pro- mance for different car flow rates for the reasons mentioned
posed scheme. The simulations are performed with SUMO, above. SubsectionV-B gives an estimate on the benefit of
a microscopic traffic simulator [69]–[71]. Different scenarios the designed agent during different times of the day. Finally,
are considered, in order to provide a comprehensive analysis subsection V-C and V-D show that when the implementation
for the proposed scheme. scenario is slightly different from the training scenario, the per-
Qualitatively speaking, we see the performance of the agent formance of the designed agent is still reasonably good.
reacting to the traffic intelligently from the GUI. It makes
reasonable decisions for the arriving vehicles. We demonstrate A. Performance for Different Detection Rates
the performance of the agent after different periods of training In this subsection, we present performance results under
in a video available in [72]. different detection rates, to qualify the performance of a
Figure 5 shows a typical training process curve. Both phase PD-ITSC system as the detection rate increases from 0% to
representations have similar trends, but we do observe that the 100%. We compare to the performance of a typical pre-timed
sign representation has a slightly faster convergence rate in all signal with green phase duration of 24 seconds, shown in
experiments (see section IV-B3). dashed lines as a simple reference.
We provide a quantitative analysis in the following sub- Figure 6 shows a typical trend we obtained in simulations.
sections. Though currently there are no analytical results for The figure shows the waiting time of vehicles at a single
PD-ITSC system, we can predict what will be observed by intersection under the car flow from north, east, south, west to
considering the following two extreme cases: be 0.02 veh/s, 0.1 veh/s, 0.02 veh/s, 0.05 veh/s, respectively,
• When the car flow rate is extremely low, vehicles come with vehicles arriving as a Poisson process. One can make
to the intersection independently. For detected vehicles, several interesting observations from this figure. First of all,
the optimal traffic signal should switch phases on their the system under AI control is much better than the traditional
arrival to yield zero waiting time, for the undetected pre-timed traffic signal, even under low detection rate. We can
vehicles, the traffic agent won’t be able to do anything. also observe that the overall waiting time (red line) within

Authorized licensed use limited to: University of Wollongong. Downloaded on April 05,2020 at 08:51:39 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHANG et al.: USING RL WITH PARTIAL VEHICLE DETECTION FOR ITSC 7

Fig. 6. Waiting time under different detection rate under medium car flow.

this system decreases as the detection rate increases. This


is intuitive, since as more vehicles are detected, the more
information the system has and thus the system is able to
optimize the car flow better.
Additionally, from the figure one can observe that approx-
imately 80% of the benefit happens in the first 20% of
transition. This finding is quite significant in that we find a
transition scheme that asymptotically gets better as the system
gradually evolves to a 100% detection rate, and will be able Fig. 7. Waiting time under different detection rate under dense and sparse
to receive much of the ultimate benefit during the initial car flow.
transition.
Another important observation is that during the transition,
although the agent is rewarded for optimizing the overall
average commute time for both detected and undetected
vehicles, the detected vehicles (green line in Figure 6) have
a lower commute time than undetected vehicles (blue line
in Figure 6). This provides an interesting ‘potential’ or ‘incen-
tive’ to the system, to transition from no vehicles equipped
with the IoT device, to all vehicles equipped with the device.
Drivers of those vehicles not yet equipped with the device
now have a good reason and strong incentive to install Fig. 8. Typical car flow in a day.
one.
Here, we also compare with our previous designed system
known as DSRC-ATL [55], which is an algorithm designed B. Performance of a Whole Day
for dealing with partial detection under sparse to medium car Section V-A examines the effect of flow rate on system
flow. We see that though the algorithms exhibit similar trends, performance. Since the car flow differs at different times of the
RL agents have better performance during the whole transition day, we simulate an entire day of traffic. To generate realistic
from 0 to 1 detection rate. car flow of a day, we refer to the whole day car flow reported
Figure 7 shows the performance under the other two cases: in [73]. To adapt the reported arrival rate to the simulation
when the car flow is very sparse (0.02 veh/s at each lane) or system, we multiply the car flow in [73] with a factor so
very dense (0.5 veh/s at each lane). For the sparse situation that the peak volume matches the saturation flow rate of the
in Figure 7b, the trend is similar to the medium flow case simulated roads. Figure 8 shows the car flow rate we used for
shown in Figure 6. the simulation, the car flow reach peak on 8 am in the morning
One can see from Figure 7a that under the dense situation, and 6 pm in the afternoon of 1.2 vehicles/s, the car flow of the
the curve becomes quite flat. This is because when car flow regular hours is around 0.7 vehicles/s. It is worth mentioning
is high, detecting individual vehicles become less important. that the car flow of different intersections in the real world
When many cars arrive at the intersection, car flow has ‘liquid’ might be very different, so the result presented here is just an
qualities, as opposed to ‘particle’ qualities in the previous example of what the performance looks like under a typical
two situations. The trained RL agent is able to seamlessly traffic volume of a whole day.
transition from a ‘particle arrival’ optimization agent which Figure 9 shows the performance of different vehicles in
handles random arrivals to a ‘liquid arrival’ optimization a whole day. One can observe from this figure that the
agent which handles macroscopic flow. This result shows performance of 20% detection rate (red line) is very close to
that RL is able to capture the main factors that affect traffic the performance of 100% detection rate (green line), at most
system’s performance and performs differently under different times of the day (from 5am to 9pm). During rush hours,
car arrival rates. Hence, RL provides a much desired adaptive the system with 100% detection rate is almost the same
behavior. as the system with 20% detection rate. Though a traffic

Authorized licensed use limited to: University of Wollongong. Downloaded on April 05,2020 at 08:51:39 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

Fig. 11. Sensitivity analysis of detection rate.

under 0.1 veh/s flow can handle flow rates from 0 to 0.15 at
near-optimal levels. At higher flow rates, it still performs
Fig. 9. Expected performance by time. reasonably well. The agent trained on 0.5 veh/s flow will
perform reasonably from 0.25 veh/s to 0.5 veh/s, but under
0.25 veh/s, the agent will start to perform substantially worse
than the optimal agent. Since traffic patterns are not expected
to heavily fluctuate, these results give a strong indication
that the agent trained by the data will be able to adapt to
the environment even when the trained situation is slightly
different.
2) Sensitivity to Detection Rate: In most situations,
the detection rate can only be approximately measured. It is
likely that an agent trained under one detection rate needs to
Fig. 10. Sensitivity analysis of flow rate. operate under a slightly different detection rate, so we test the
sensitivity of agents to detection rates.
Figure 11 shows the sensitivity of two cases. Figure 11a
system under 100% detection rate performs visibly better at shows the sensitivity of low detection rate (0.2), figure 11b
midnight, the performance at that time is not as critical as the shows the sensitivity under high detection rate (0.8).
performance during the busier daytime. This result indicates We observe that the agent trained under 0.2 detection rate
that by detecting 20% of vehicles, we can perform almost the performs at an optimal level from 0.1 to 0.4 detection rate. The
same as detecting all vehicles. But those detectable vehicles sensitivity upward is better than downward. This indicates that
(yellow lines) will have a benefit against those undetectable at early deployment of this system, it’s better to under-estimate
vehicles (dash line). detection rate, since the agent’s performance is more stable for
These results confirm intuition. With a large volume of the higher detection rate.
cars, a low detection rate should still provide a relatively Figure 11b shows the sensitivity of the agent trained under
low-variance estimate of traffic flow. If there are few cars and high detection rate (0.8). We can see that the performance
a low detection rate, the estimate of traffic flow can have very of this agent is at optimal level when detection rate is from
high-variance. Late at night with only a single detected car, 0.5 to 1. Though the sensitivity performance for an agent
an ITSC system can give that car a green immediately, which under low detection rate is different than the sensitivity under
would not be possible with an undetected car. high detection rate, for both cases, the agent shows a level of
stability, which means that as long as the detection rate used
C. Sensitivity Analysis for training is not too different from the actual detection rate,
the performance of the agent will not be affected a lot.
The results obtained above used agents trained and evaluated
under the same environmental parameters, since traffic patterns
only fluctuate slightly from day to day. D. Robustness Between Training and Deployment Scenario
Below, we evaluate the sensitivity of the agents to two There are many differences between the training and the
environmental parameters: the car flow and the detection rate. actual deployment scenario, as the simulator, though quite
1) Sensitivity to Car Flow: Figure 10 shows the agents’ sophisticated, will never able to take all the factors in the real
sensitivity to car flow. Figure 10a shows the performance scenario into account. This simulation aims to evaluate and
of an agent trained under 0.1 veh/s car flow, operating at verify that those minor factors, such as stop-and-go vehicles,
different flow rates. Figure 10b shows the sensitivity of an arrival patterns and other factors won’t affect the system in
agent trained under 0.5 veh/s car flow. The blue curve in the a major way. We choose a newly published realistic scenario
figure is the trained agent’s performance, while the red one is known as Luxembourg SUMO Traffic (LuST) [74]. The sce-
the performance of the optimal agent (the agent trained under nario is generated on the real map of Luxembourg, the activity
that situation and tested under that situation). Both agents of vehicles are generated according to the demographic data
perform well over a range of flow rates. The agent trained published by the government. The authors of this scenario

Authorized licensed use limited to: University of Wollongong. Downloaded on April 05,2020 at 08:51:39 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHANG et al.: USING RL WITH PARTIAL VEHICLE DETECTION FOR ITSC 9

TABLE II
D EFERENCE IN T RAINING AND E VALUATION S CENARIO

compared the generated traffic with a data set collected


between March and April 2015 in Luxembourg, which con-
tains 6,000,000 floating vehicles sample and achieved similar
speed distributions, hence the LuST scenario has a high degree
of realism.
In our simulation, we don’t directly train the traffic light Fig. 12. Performance of the agent in LuST scenario.
on the scenario; instead, we use this scenario as ground truth
to evaluate the trained traffic light. The simulation steps we
performed are as follows: intentionally introduce differences between training and eval-
1) Choose a certain intersection from LuST with high rate uation. This is a judicious choice on our part. Our goal
of car flow (intersection -12408) is to give a reasonable estimate of the performance in the
2) Measure the hourly traffic volume of that intersection real-world implementation where the simulation scenario is
3) Build a simple intersection in a separate simulator and slightly different than the real-world scenario.
train a traffic agent with car flow generated by the We choose three different times of the day to present the
new simulator, according to the hourly traffic volume results:
measured in step 2. 1) Midnight: 2 AM in the morning, in this case, the car
4) Train an agent on the simplified scenario we built in flow at intersection is sparse
step 3. 2) Rush-hours: 8 AM in the morning, this is a situation
5) After training, we evaluate the performance on the where car flow is dense
original LuST scenario, by substituting the traffic agent 3) Regular hours: 2 PM in the afternoon, this is the
of that intersection to the new traffic agent we trained. situation during regular hours, the car flow is in between
of midnight car flow and rush hours car flow (medium
It is worth mentioning here that this simulation follows
car flow).
the steps of actual implementation in real world (described
Figure 12 shows the performance of the agent in the
in section IV-D), so the performance here can be considered
LuST scenario. We can clearly see that even though the
as a reference for the performance of actual deployment when
evaluated situation is quite different from the training situation,
the simulator and real world have major differences in details.
we still observe: the performance improves asymptotically as
Other than the difference in the map and car flow, there are
the detection rate grows, which exhibits the same trend as we
more differences between training and evaluation, the scenario
observed in V-A.
used for evaluation is rich in details. In Table II, we list all
the differences between the Lust scenario (for evaluation) and
the simulator used for training. VI. D ISCUSSION
Notice that the simulator is sophisticated enough to take As the simulation results show, while all vehicles will
all the factors listed in the table into account. Here we experience a shorter waiting time under an RL-based traffic

Authorized licensed use limited to: University of Wollongong. Downloaded on April 05,2020 at 08:51:39 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

controller, detected vehicles will have a shorter commute time a very promising outcome that is highly desirable since the
than undetected vehicles. This property makes it possible industry forecasts on DSRC penetration process seems gradual
for hardware manufacturers, software companies, and vehicle as opposed to abrupt.
manufacturers to help push forward the proposed scheme, The numerical results on sparse, medium, and dense arrival
other than the Department of Transportation (DoT) alone, for rates suggest that reinforcement learning is able to handle all
the simple reason that all of them can profit from this system. kinds of traffic flow. Although the optimization of traffic on
For example, it would be valuable for a certain navigation sparse arrival and dense arrival are, in general, very different,
app to advertise that their customers can save 30% on commute results show that reinforcement learning is able to leverage the
time. ‘particle’ property of the vehicle flow, as well as the ‘liquid’
Therefore, we view this technology as a new generation property, thus providing a very powerful overall optimization
of Intelligent Transportation Systems, as it inherently comes scheme.
with a lucrative commercial business model. The burden of
spreading the penetration rate in this system is distributed to a ACKNOWLEDGMENT
lot of companies, as opposed to the traditional ITSC systems The authors would like to thank Dr. H. Liu from Lan-
which put all the burden on the DoT alone. This makes it guage Technology Institute, Carnegie Mellon University for
financially feasible to have the system installed on most of informative discussions and a lot of suggestions to the
the intersections in a city, as opposed to the current situation methods reported in this paper. The authors would also
where only a small proportion of intersections are installed like to thank Dr. L. Gallo from Eurecom, France and
with ITSC. Mr. M. E. Diaz-Granados of Yahoo, U.S., for the initial attempt
The mechanism of the system solution described will also to solve this problem in 2016.
make it possible to have dynamic pricing. Dynamic pricing
refers to reserving certain roads during rush hours exclusively R EFERENCES
for paid users. This method has been scuttled by public or [1] (2017). Traffic Congestion and Reliability: Trends and Advanced
Strategies for Congestion Mitigation. Accessed: Aug. 19, 2017. [Online].
political opposition and only a few cities have implemented Available: https://ops.fhwa.dot.gov/congestion_report/executive_
dynamic pricing [75], [76]. Those few successful examples, summary.htm
however, cannot be easily copied or adapted to other cities, [2] D. I. Robertson, “‘Tansyt’ method for area traffic control,” Traffic Eng.
Control, vol. 8, no. 8, 1969.
as the method depends hugely on road topologies.. In our [3] P. Lowrie, “Scats, sydney co-ordinated adaptive traffic system: A traffic
solution, we can accomplish dynamic pricing in a more responsive method of controlling urban traffic,” Roads Traffic Authority
intelligent way, by simply considering vehicle detection as a NSW, Sydney, NSW, Australia, Tech. Rep. 00772163, 1990. [Online].
Available: https://trid.trb.org/view/488852
service. Compared to existing solutions, this service will not [4] P. Hunt, D. Robertson, R. Bretherton, and M. C. Royle, “The SCOOT on-
require to reserve roads, making the scheme flexible and easy line traffic signal optimisation technique,” Traffic Eng. Control, vol. 23,
to implement. The user will also be able to choose to pay for no. 4, 1982.
[5] J. Luk, “Two traffic-responsive area traffic control methods: SCAT and
a prioritized signal phase whenever they are in a hurry. SCOOT,” Traffic Eng. Control, vol. 25, no. 1, 1984.
Further research is needed to make this AI-based Intelligent [6] N. H. Gartner, “OPAC: A demand-responsive strategy for traffic signal
Traffic Control System more practical. First of all, the system control,” U.S. Dept. Transp., Washington, DC, USA, Tech. Rep. 906,
1983.
currently needs to be fully trained in a simulator; under the [7] P. Mirchandani and L. Head, “A real-time traffic signal control system:
partial observation setup, the system will not be able to observe Architecture, algorithms, and analysis,” Transp. Res. C, Emerg. Technol.,
the reward, hence, it won’t be able to do any incremental vol. 9, no. 6, pp. 415–432, Dec. 2001.
[8] J.-J. Henry, J. L. Farges, and J. Tuffal, “The PRODYN real time
training after deployment. Clearly, this is a drawback or traffic algorithm,” in Control in Transportation Systems. Amsterdam,
shortcoming of the proposed system. Some solutions to this The Netherlands: Elsevier, 1984, pp. 305–310.
problem are reported in a follow-up paper [77]. Another future [9] R. Vincent and J. Peirce, “‘MOVA’: Traffic responsive, self-optimising
signal control for isolated intersections,” Transp. Road Res. Lab.,
direction would be to further develop the system to achieve Crowthorne, U.K., Tech. Rep. RR 170, 1988.
multi-agent coordination so that, with the help of DSRC radios [10] (2016). Traffic Light Control and Coordination. Accessed:
(or other forms of communications), traffic lights will be able Mar. 23, 2016. [Online]. Available: https://en.wikipedia.org/wiki/Traffic_
light_control_and_coordination
to communicate with each other. Clearly, designing such a [11] M. Ferreira, R. Fernandes, H. Conceição, W. Viriyasitavat, and
system will significantly improve the performance of PD-ITSC O. K. Tonguz, “Self-organized traffic control,” in Proc. 7th ACM Int.
system. Further research is also required to investigate whether Workshop Veh. InterNETworking (VANET), 2010, pp. 85–90.
[12] N. S. Nafi and J. Y. Khan, “A VANET based intelligent road traffic
the RL agent will be able to pick up the drivers’ behavior signalling system,” in Proc. Australas. Telecommun. Netw. Appl. Conf.
accurately at each intersection [78]–[82]. (ATNAC), Nov. 2012, pp. 1–6.
[13] V. Milanes, J. Villagra, J. Godoy, J. Simo, J. Perez, and E. Onieva,
“An intelligent V2I-based traffic management system,” IEEE Trans.
VII. C ONCLUSION Intell. Transp. Syst., vol. 13, no. 1, pp. 49–58, Mar. 2012.
[14] A. Chattaraj, S. Bansal, and A. Chandra, “An intelligent traffic control
In this paper, we have proposed reinforcement learning, system using RFID,” IEEE Potentials, vol. 28, no. 3, pp. 40–43,
May 2009.
specifically deep Q-learning, for traffic control with partial [15] M. R. Friesen and R. D. Mcleod, “Bluetooth in intelligent transportation
detection of vehicles. The results of our study show that rein- systems: A survey,” Int. J. Intell. Transp. Syst. Res., vol. 13, no. 3,
forcement learning is a promising new approach to optimizing pp. 143–153, Sep. 2015.
[16] F. Qu, F.-Y. Wang, and L. Yang, “Intelligent transportation spaces:
traffic control problems under partial detection scenarios, such Vehicles, traffic, communications, and beyond,” IEEE Commun. Mag.,
as traffic control systems using DSRC technology. This is vol. 48, no. 11, pp. 136–142, Nov. 2010.

Authorized licensed use limited to: University of Wollongong. Downloaded on April 05,2020 at 08:51:39 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ZHANG et al.: USING RL WITH PARTIAL VEHICLE DETECTION FOR ITSC 11

[17] Average Age of Cars on US. Accessed: Aug. 21, 2017. [Online]. Avail- [40] O. Tonguz, W. Viriyasitavat, and J. Roldan, “Implementing virtual
able: https://www.usatoday.com/story/money/2015/07/29/new-car-sales- traffic lights with partial penetration: A game-theoretic approach,” IEEE
soaring-but-cars-getting-older-too/30821191/ Commun. Mag., vol. 52, no. 12, pp. 173–182, Dec. 2014.
[18] W. Genders and S. Razavi, “Using a deep reinforcement learning agent [41] J. Yapp and A. J. Kornecki, “Safety analysis of virtual traffic lights,”
for traffic signal control,” 2016, arXiv:1611.01142. [Online]. Available: in Proc. 20th Int. Conf. Methods Models Autom. Robot. (MMAR),
https://arxiv.org/abs/1611.01142 Aug. 2015, pp. 505–510.
[19] E. van der Pol, “Deep reinforcement learning for coordination in [42] A. Bazzi, A. Zanella, and B. M. Masini, “A distributed virtual traffic
traffic light control,” Ph.D. dissertation, Univ. Amsterdam, Amsterdam, light algorithm exploiting short range V2V communications,” Ad Hoc
The Netherlands, 2016. Netw., vol. 49, pp. 42–57, Oct. 2016.
[20] S. Mikami and Y. Kakazu, “Genetic reinforcement learning for coopera- [43] O. K. Tonguz and W. Viriyasitavat, “A self-organizing network approach
tive traffic signal control,” in Proc. 1st IEEE Conf. Evol. Comput. IEEE to priority management at intersections,” IEEE Commun. Mag., vol. 54,
World Congr. Comput. Intell., Dec. 2002, pp. 223–228. no. 6, pp. 119–127, Jun. 2016.
[21] E. Bingham, “Reinforcement learning in neurofuzzy traffic signal con- [44] R. Zhang et al., “Virtual traffic lights: System design and implemen-
trol,” Eur. J. Oper. Res., vol. 131, no. 2, pp. 232–241, Jun. 2001. tation,” 2018, arXiv:1807.01633. [Online]. Available: https://arxiv.org/
[22] M. Chee Choy, D. Srinivasan, and R. Long Cheu, “Hybrid cooperative abs/1807.01633
agents with online reinforcement learning for traffic control,” in Proc. [45] O. K. Tonguz, “Red light, green light-no light: Tomorrow’s communica-
IEEE World Congr. Comput. Intell. IEEE Int. Conf. Fuzzy Syst. (FUZZ), tive cars could take turns at intersections,” IEEE Spectr. Mag., vol. 55,
Jun. 2003, pp. 1015–1020. no. 10, pp. 24–29, Oct. 2018.
[23] B. Abdulhai, R. Pringle, and G. J. Karakoulas, “Reinforcement learning [46] J. Lu and L. Cao, “Congestion evaluation from traffic flow information
for true adaptive traffic signal control,” J. Transp. Eng., vol. 129, no. 3, based on fuzzy logic,” in Proc. IEEE Int. Conf. Intell. Transp. Syst.,
pp. 278–285, 2003. vol. 1, Apr. 2004, pp. 50–53.
[24] A. B. C. da Silva, D. de Oliveria, and E. Basso, “Adaptive traffic control [47] B. Kerner et al., “Traffic state detection with floating car data in road
with reinforcement learning,” in Proc. Conf. Auto. Agents Multiagent networks,” in Proc. IEEE Intell. Transp. Syst., Oct. 2005, pp. 44–49.
Syst. (AAMAS), 2006, pp. 80–86. [48] W. Pattara-atikom, P. Pongpaibool, and S. Thajchayapong, “Estimating
[25] D. de Oliveira et al., “Reinforcement learning based control of traffic road traffic congestion using vehicle velocity,” in Proc. 6th Int. Conf.
lights in non-stationary environments: A case study in a microscopic ITS Telecommun., Jun. 2006, pp. 1001–1004.
simulator,” in Proc. EUMAS, 2006. [49] C. De Fabritiis, R. Ragona, and G. Valenti, “Traffic estimation and
[26] M. Abdoos, N. Mozayani, and A. L. C. Bazzan, “Traffic light control prediction based on real time floating car data,” in Proc. 11th Int. IEEE
in non-stationary environments based on multi agent Q-learning,” in Conf. Intell. Transp. Syst., Oct. 2008, pp. 197–203.
Proc. 14th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Oct. 2011, [50] Y. Feng, J. Hourdos, and G. A. Davis, “Probe vehicle based real-time
pp. 1580–1585. traffic monitoring on urban roadways,” Transp. Res. C, Emerg. Technol.,
[27] J. C. Medina and R. F. Benekohal, “Traffic signal control using reinforce- vol. 40, pp. 160–178, Mar. 2014.
ment learning and the max-plus algorithm as a coordinating strategy,” in [51] X. Kong, Z. Xu, G. Shen, J. Wang, Q. Yang, and B. Zhang, “Urban traffic
Proc. 15th Int. IEEE Conf. Intell. Transp. Syst., Sep. 2012, pp. 596–601. congestion estimation and prediction based on floating car trajectory
[28] S. El-Tantawy, B. Abdulhai, and H. Abdelgawad, “Multiagent rein- data,” Future Gener. Comput. Syst., vol. 61, pp. 97–107, Aug. 2016.
forcement learning for integrated network of adaptive traffic signal [52] P. Bellavista, F. Caselli, and L. Foschini, “Implementing and evaluating
controllers (MARLIN-ATSC): Methodology and large-scale application V2X protocols over iTETRIS: Traffic estimation in the COLOMBO
on downtown toronto,” IEEE Trans. Intell. Transp. Syst., vol. 14, no. 3, project,” in Proc. 4th ACM Int. Symp. Develop. Anal. Intell. Veh. Netw.
pp. 1140–1150, Sep. 2013. Appl. (DIVANet), 2014, pp. 25–32.
[29] M. A. Khamis and W. Gomaa, “Adaptive multi-objective reinforcement [53] D. Krajzewicz et al., “Colombo: Investigating the potential of V2X for
learning with hybrid exploration for traffic signal control based on traffic management purposes assuming low penetration rates,” in Proc.
cooperative multi-agent framework,” Eng. Appl. Artif. Intell., vol. 29, ITS Eur., 2013.
pp. 134–151, Mar. 2014. [54] P. Bellavista, L. Foschini, and E. Zamagni, “V2X protocols for low-
[30] L. Li, Y. Lv, and F.-Y. Wang, “Traffic signal timing via deep reinforce- penetration-rate and cooperative traffic estimations,” in Proc. IEEE 80th
ment learning,” IEEE/CAA J. Autom. Sinica, vol. 3, no. 3, pp. 247–254, Veh. Technol. Conf. (VTC-Fall), Sep. 2014, pp. 1–6.
Apr. 2016. [55] R. Zhang et al., “Increasing traffic flows with DSRC technology: Field
[31] D. Garg, M. Chli, and G. Vogiatzis, “Deep reinforcement learning for trials and performance evaluation,” in Proc. 44th Annu. Conf. IEEE Ind.
autonomous traffic light control,” in Proc. 3rd IEEE Int. Conf. Intell. Electron. Soc. (IECON), Oct. 2018, pp. 6191–6196.
Transp. Eng. (ICITE), Sep. 2018. [56] O. K. Tonguz and R. Zhang, “Harnessing vehicular broadcast communi-
[32] (2016). Intelligent Traffic System Cost. Accessed: Nov. 23, 2017. cations: Dsrc-actuated traffic control,” IEEE Trans. Intell. Transp. Syst.,
[Online]. Available: http://www.itscosts.its.dot.gov/ITS/benecost.nsf/ID/ to be published.
C1A22DD1C3BA1ED285257CD60062C3BB?OpenDocument&Query [57] C. J. C. H. Watkins and P. Dayan, “Q-learning,” Mach. Learn., vol. 8,
=CApp/ nos. 3–4, pp. 279–292, 1992.
[33] (2016). Scats System Cost. Accessed: May 13, 2018. [Online]. Avail- [58] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust
able: https://www.itscosts.its.dot.gov/ITS/benecost.nsf/0/9E957998C8A region policy optimization,” in Proc. Int. Conf. Mach. Learn., 2015,
B79A885257B1E0049CAFF?OpenDocument&Query=Home pp. 1889–1897.
[34] T. Neudecker, N. An, O. K. Tonguz, T. Gaugel, and J. Mittag, “Feasi- [59] V. Mnih et al., “Asynchronous methods for deep reinforcement learning,”
bility of virtual traffic lights in non-line-of-sight environments,” in Proc. in Int. Conf. Mach. Learn., 2016, pp. 1928–1937.
9th ACM Int. Workshop Veh. Inter-Netw., Syst., Appl. (VANET), 2012, [60] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
pp. 103–106. “Proximal policy optimization algorithms,” 2017, arXiv:1707.06347.
[35] M. Ferreira and P. M. D’orey, “On the impact of virtual traffic lights on [Online]. Available: https://arxiv.org/abs/1707.06347
carbon emissions mitigation,” IEEE Trans. Intell. Transp. Syst., vol. 13, [61] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
no. 1, pp. 284–295, Mar. 2012. with double Q-learning,” in Proc. AAAI, Phoenix, AZ, USA, vol. 2,
[36] M. Nakamurakare, W. Viriyasitavat, and O. K. Tonguz, “A prototype of 2016, p. 5.
Virtual Traffic Lights on Android-based smartphones,” in Proc. IEEE [62] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforcement learning
Int. Conf. Sens., Commun. Netw. (SECON), Jun. 2013, pp. 236–238. with deep energy-based policies,” 2017, arXiv:1702.08165. [Online].
[37] W. Viriyasitavat, J. M. Roldan, and O. K. Tonguz, “Accelerating the Available: https://arxiv.org/abs/1702.08165
adoption of Virtual Traffic Lights through policy decisions,” in Proc. [63] F. Belletti, D. Haziza, G. Gomes, and A. M. Bayen, “Expert level control
Int. Conf. Connected Vehicles Expo (ICCVE), Dec. 2013, pp. 443–444. of ramp metering based on multi-task deep reinforcement learning,”
[38] A. Bazzi, A. Zanella, B. M. Masini, and G. Pasolini, “A distributed IEEE Trans. Intell. Transp. Syst., vol. 19, no. 4, pp. 1198–1207,
algorithm for virtual traffic lights with IEEE 802.11p,” in Proc. Eur. Apr. 2018.
Conf. Netw. Commun. (EuCNC), Jun. 2014, pp. 1–5. [64] C. Wu, A. Kreidieh, K. Parvate, E. Vinitsky, and A. M. Bayen, “Flow:
[39] F. Hagenauer, P. Baldemaier, F. Dressler, and C. Sommer, “Advanced Architecture and benchmarking for reinforcement learning in traffic
leader election for virtual traffic lights,” ZTE Commun., Special Issue control,” 2017, arXiv:1710.05465. [Online]. Available: https://arxiv.org/
VANET, vol. 12, no. 1, pp. 11–16, Mar. 2014. abs/1710.05465

Authorized licensed use limited to: University of Wollongong. Downloaded on April 05,2020 at 08:51:39 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

[65] L.-J. Lin, “Reinforcement learning for robots using neural networks,” Akihiro Ishikawa received the M.S. degree from the
Ph.D. dissertation, School Comput. Sci., Carnegie Mellon Univ., Pitts- Electrical and Computer Engineering Department,
burgh, PA, USA, 1993. Carnegie Mellon University in 2017. His research
[66] V. Mnih et al., “Playing atari with deep reinforcement learning,” 2013, interests include vehicular networks, wireless net-
arXiv:1312.5602. [Online]. Available: https://arxiv.org/abs/1312.5602 works, and artificial intelligence.
[67] V. Mnih et al., “Human-level control through deep reinforcement learn-
ing,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015.
[68] A. Y. Ng et al., “Policy invariance under reward transformations:
Theory and application to reward shaping,” in Proc. ICML, Jun. 1999,
pp. 278–287.
[69] D. Krajzewicz, J. Erdmann, M. Behrisch, and L. Bieker, “Recent
development and applications of sumo–simulation of urban mobility,”
Int. J. Adv. Syst. Meas., vol. 5, nos. 3–4, 2012.
[70] S. Krauss, P. Wagner, and C. Gawron, “Metastable states in a micro-
scopic model of traffic flow,” Phys. Rev. E, Stat. Phys. Plasmas Fluids Wenli Wang received the B.S. degree in statistics
Relat. Interdiscip. Top., vol. 55, no. 5, pp. 5597–5602, Jul. 2002. and the B.A. degree in fine arts from the Uni-
[71] P. A. Lopez et al., “Microscopic traffic simulation using SUMO,” in versity of California, Los Angeles, in 2016, and
Proc. 21st Int. Conf. Intell. Transp. Syst. (ITSC), Nov. 2018. [Online]. the M.S. degree from the Electrical and Computer
Available: https://elib.dlr.de/124092/ Engineering Department, Carnegie Mellon Univer-
[72] Reinforcement Learning for Traffic Optimization. Accessed: sity, in 2018. Her research interests include machine
May 12, 2018. [Online]. Available: https://youtu.be/HkXriL9SOW4 learning and its applications in wireless networks
[73] (2014). Traffic Monitoring Guide. Accessed: May 13, 2018. [Online]. and computer vision.
Available: https://ops.fhwa.dot.gov/freewaymgmt/publications/frwy_
mgmt_handbook/chapter1_01.htm
[74] L. Codeca, R. Frank, S. Faye, and T. Engel, “Luxembourg SUMO traffic
(LuST) scenario: Traffic demand evaluation,” IEEE Intell. Transp. Syst.
Mag., vol. 9, no. 2, pp. 52–63, Apr. 2017.
[75] A. De Palma and R. Lindsey, “Traffic congestion pricing methodologies
and technologies,” Transp. Res. C, Emerg. Technol., vol. 19, no. 6, Benjamin Striner received the B.A. degree in neu-
pp. 1377–1399, Dec. 2011. roscience and psychology from the Oberlin College
[76] B. Schaller, “New York City’s congestion pricing experience and impli- in 2005. He is currently pursuing the master’s degree
cations for road pricing acceptance in the United States,” Transp. Policy, with the Machine Learning Department, Carnegie
vol. 17, no. 4, pp. 266–273, Aug. 2010. Mellon University. He was a patent expert witness
[77] R. Zhang, R. Leteurtre, B. Striner, A. Alanazi, A. Alghafis, and and Engineer, especially in wireless communica-
O. K. Tonguz, “Partially detected intelligent traffic signal control: Envi- tions. His research interests include reinforcement
ronmental adaptation,” 2019, arXiv:1910.10808. [Online]. Available: learning, generative networks, and better understand-
https://arxiv.org/abs/1910.10808 ability and explainability in machine learning.
[78] D. A. Noyce, D. B. Fambro, and K. C. Kacir, “Traffic characteristics
of protected/permitted left-turn signal displays,” Transp. Res. Rec.,
vol. 1708, no. 1, pp. 28–39, Jan. 2000.
[79] K. Tang and H. Nakamura, “A comparative study on traffic character-
istics and driver behavior at signalized intersections in Germany and
Japan,” in Proc. Eastern Asia Soc. Transp. Stud. 7th Int. Conf. Eastern
Asia Soc. Transp. Stud., vol. 6, 2007, p. 324.
[80] T. J. Gates and D. A. Noyce, “Dilemma zone driver behavior as a Ozan K. Tonguz is currently a tenured Full Profes-
function of vehicle type, time of day, and platooning,” Transp. Res. sor with the Electrical and Computer Engineering
Rec., vol. 2149, no. 1, pp. 84–93, Jan. 2010. Department, Carnegie Mellon University (CMU).
[81] L. Rittger, G. Schmidt, C. Maag, and A. Kiesel, “Driving behaviour He currently leads substantial research efforts at
at traffic light intersections,” Cogn., Technol. Work, vol. 17, no. 4, CMU in the broad areas of telecommunications and
pp. 593–605, Nov. 2015. networking. He is the Founder and CEO of the CMU
[82] J. Li, X. Jia, and C. Shao, “Predicting driver behavior during the yellow startup known as Virtual Traffic Lights, LLC, which
interval using video surveillance,” Int. J. Environ. Res. Public Health, specializes in providing solutions to acute trans-
vol. 13, no. 12, p. 1213, Dec. 2016. portation problems using vehicle-to-vehicle (V2V)
and vehicle-to-infrastructure (V2I) communications
paradigms. He has published about 300 research
Rusheng Zhang was born in Chengdu, China, papers in IEEE journals and conference proceedings in the areas of wire-
in 1990. He received the first B.E. degree in less networking, optical communications, and computer networks. He is
micro electrical mechanical system and the second the author (with G. Ferrari) of the book Ad Hoc Wireless Networks:
B.E. degree in applied mathematics from Tsinghua A Communication-Theoretic Perspective (Wiley, 2006). He is the inventor
University, Beijing, in 2013, and the M.S. degree in of 21 issued or pending patents (18 U.S. patents and three international
electrical and computer engineering from Carnegie patents). His current research interests include vehicular networks, wireless
Mellon University, in 2015, where he is currently ad hoc networks, sensor networks, self-organizing networks, artificial intel-
pursuing the Ph.D. degree. His research areas ligence (AI), statistical machine learning, smart grid, bioinformatics, and
include vehicular networks, intelligent transportation security. He currently serves or has served as a consultant or expert for several
systems, wireless computer networks, artificial intel- companies, major law firms, and government agencies in the USA, Europe,
ligence, and intravehicular sensor networks. and Asia.

Authorized licensed use limited to: University of Wollongong. Downloaded on April 05,2020 at 08:51:39 UTC from IEEE Xplore. Restrictions apply.

You might also like