Zhang 2020
Zhang 2020
Abstract— Intelligent Traffic Signal Control (ITSC) systems control schemes are expensive and, therefore, they exist only
have attracted the attention of researchers and the general at a small percentage of intersections in the United States,
public alike as a means of alleviating traffic congestion. Recently, Europe, and Asia.
the vehicular wireless technologies have enabled a cost-efficient
way to achieve ITSC by detecting vehicles using Vehicle to Recently, several more cost-effective approaches to imple-
Infrastructure (V2I) wireless communications. Traditional ITSC ment ITSC systems were proposed by leveraging the
algorithms, in most cases, assume that every vehicle is detected, fact that Dedicated Short-Range Communication (DSRC)
such as by a camera or a loop detector, but a V2I implementation technology [11]–[13]. DSRC technology is potentially a much
would detect only those vehicles equipped with wireless com- cheaper technology for detecting the presence of vehicles
munications capability. We examine a family of transportation
systems, which we will refer to as ‘Partially Detected Intelligent on the approaches of an intersection. However, at the early
Transportation Systems’. An algorithm that can perform well stages of deployment, only a small percentage of vehicles
under a small detection rate is highly desirable due to gradual will be equipped with DSRC radios. Meanwhile, the rapid
increasing penetration rates of the underlying technologies such development of the Internet of Things (IoT) has created
as Dedicated Short Range Communications (DSRC) technology. new technology applicable for sensing vehicles for ITSC.
Reinforcement Learning (RL) approach in Artificial Intelli-
gence (AI) could provide indispensable tools for such problems Other than DSRC, applicable technologies include, but are not
where only a small portion of vehicles are detected by the ITSC limited to, RFID, Bluetooth, Ultra-Wide Band (UWB), Zigbee,
system. In this paper, we report a new RL algorithm for Partially and even cellphone apps such as Google Map [14]–[16].
Detected Intelligent Traffic Signal Control (PD-ITSC) systems. All these systems are more economical than traditional loop
The performance of this system is studied under different car detectors or cameras. Performance-wise, most of these systems
flows, detection rates, and types of the road network. Our system
is able to efficiently reduce the average waiting time of vehicles are able to track vehicles in a continuous manner, while loop
at an intersection, even with a low detection rate, thus reducing detectors can only detect the presence of vehicles. These ITSC
the travel time of vehicles. systems mentioned above are all promising technologies that
Index Terms— Reinforcement learning, artificial intelli- could bring the expensive price of traditional ITSC systems
gence, intelligent transportation systems, partially detected down dramatically; however, these systems have a common
intelligent transportation systems, vehicle-to-infrastructure critical shortcoming: they are not able to detect vehicles
communications. unequipped with the communication device (i.e., DSRC radios,
RFID tags, Bluetooth device, etc.).
I. I NTRODUCTION Since this adoption stage of the aforementioned systems
could possibly take several years [17], new control algorithms
T RAFFIC congestion is a daunting problem that affects
the daily lives of billions of people in most countries
across the world [1]. Over the last 30 years, many Intelligent
that can handle partial detection of vehicles are required. One
potential AI algorithm that could be very helpful is deep rein-
Traffic Signal Control (ITSC) systems have been designed forcement learning (DRL), which has recently been explored
and demonstrated as one of the effective ways to reduce by several groups [18], [19]. These results show an improve-
traffic congestion [2]–[9]. These systems use real time traf- ment in terms of waiting time and queue length experienced
fic information measured or collected by video cameras or at an intersection in a fully-observable environment. Hence,
loop detectors and optimize the cycle split of a traffic light in this paper, we investigate this promising approach in a par-
accordingly [10]. Unfortunately, such intelligent traffic signal tially observable environment. As expected, we observe an
asymptotically improving result as we increase the penetration
Manuscript received June 28, 2018; revised December 22, 2018, rate of DSRC-equipped vehicles.
May 16, 2019, and September 6, 2019; accepted November 18, 2019. This
work was supported by the King Abdulaziz City of Science and Technology In this paper, we explore the capability of DRL for han-
(KACST), Riyadh, Kingdom of Saudi Arabia. The Associate Editor for this dling ITSC systems using partial detection. For simplicity,
article was J. W. Choi. (Corresponding author: Rusheng Zhang.) in some sections, we use DSRC detection based system as
Rusheng Zhang, Akihiro Ishikawa, Wenli Wang, and Ozan K. Tonguz
are with the Department of Electrical and Computer Engineering, Carnegie the example system, but the scheme described in this paper
Mellon University, Pittsburgh, PA 15213-3890 USA (e-mail: rushengz@ is very general and therefore can be used for any possible
andrew.cmu.edu). forms of partial detection, such as vehicle detection based on
Benjamin Striner is with the Machine Learning Department, Carnegie
Mellon University, Pittsburgh, PA 15213-3890 USA. RFID, Bluetooth Low Energy 5.0 (BLE 5.0), cellular (LTE or
Digital Object Identifier 10.1109/TITS.2019.2958859 5G). Via extensive simulations we analyze the performance of
1524-9050 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Wollongong. Downloaded on April 05,2020 at 08:51:39 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
the Reinforcement Learning (RL) method. Our results clearly TRANSYT [2]. These systems optimize the offsets of traffic
show that reinforcement learning is capable of providing an signals in the network, based on current traffic demand, and
excellent traffic management scheme that is able to reduce generate ‘green-wave’ for major car flow. Meanwhile, some
the waiting time of commuters at intersections, even at a low other model-based systems have been proposed, including
penetration rate. The results also show a different performance OPAC [6], RHODES [7], PRODYN [8]. These systems use
in detected vehicles and undetected vehicles, suggesting a both the current traffic arrivals and the prediction of future
built-in business model, which could be the key to eventually arrivals, and choose a signal phase planning that which
push forward on large-scale deployment of ITSC. optimize the objective functions. While these systems work
The remainder of this paper is organized as follows. efficiently, they do have some significant shortcomings. The
In Section II, we review the related work in this area. cost of these systems are generally very high [32], [33].
Section III gives a detailed Problem formulation. Section IV Even though RL yields impressive results for these cases,
outlines the Approach we use. Section V presents the Results it does not outperform current systems. Hence, the progress of
of our study in terms of performance and sensitivity to critical these algorithms, while interesting, is of limited impact, since
system parameters. In Section VI, a Discussion is presented traditional ITSC systems perform comparably.
that highlights the practical implications of our results for Meanwhile, the recent advancements in Vehicle-
intelligent transportation systems in addition to highlighting to-Everything (V2X) communication have made traffic
important extensions of our work for future work. Finally, signal control schemes based on such technology a rising
Section VII concludes the paper. field, as the cost is significantly lower than a traditional ITSC
system [11]–[13]. Within these schemes, a system known as
Virtual Traffic Lights (VTL) is very attractive, as it proposes
II. R ELATED W ORKS
an infrastructure-free DSRC-based solution, by installing
Traffic signal control using Artificial Intelligence (AI), traffic control devices in vehicles and having the vehicles
especially reinforcement learning (RL), has been an active decide the right-of-way at an intersection locally. Different
field of research for the last 20 years. In 1994, Mikami, et al. aspects of VTL technology have been studied by different
proposed distributed reinforcement learning (Q-learning) using research groups in the last few years [11], [34]–[45]. However,
a Genetic Algorithm to present a traffic signal control scheme a VTL system requires all vehicles in the road network to be
that effectively increased throughput of a road network [20]. equipped with DSRC devices, therefore, a transition scheme
Due to the limitations of computing power in 1994, however, for the current transportation systems to smoothly transition
it could not be implemented at that time. to VTL system is needed.
Bingham proposed RL for parameter search of a On the other hand, several methods have been proposed
fuzzy-neural traffic signal controller for a single intersec- for floating vehicle data gathered from Global Position Sys-
tion [21], while Choy et al. adapted RL on the fuzzy-neural tem (GPS) that are used to detect, estimate and predict
system in a cooperative scheme, achieving adaptive control traffic states based on fuzzy logic, Genetic Algorithm (GA),
for a large area [22]. These algorithms are based on RL, but Support Vector Mechine (SVM) and other statistical learn-
the major goal of RL is parameter tuning of the fuzzy-neural ing algorithms [46]–[51]. The success of these works sug-
system. Abdulhai et al. proposed the first truly adaptive traffic gest the possibility to optimize traffic control based on
signal which learns to control the traffic signal dynamically partial detection (such a system is formally introduced in
based on a Cerebellar Model Articulation Controller (CMAC), section III).
as a Q-estimation network [23]. Silva, et.al. and Oliveira et.al. There are a few research projects currently available
then proposed a context-detector (CD) in conjunction with using partial detection. For example, COLOMBO is one
RL to further improve the performance under non-stationary of the projects that focuses on low-penetration rate of
traffic conditions [24], [25]. Several researchers have focused DSRC-equipped vehicles [52]–[54]. The system uses infor-
on multi-agent reinforcement learning for implementing it on mation provided by V2X technology and feed the information
a large scale [26]–[29]. to a traffic management system. Since COLOMBO cannot
Recently, with the development of GPU and computing directly react to real-time traffic flow (the detected and unde-
power, DRL has become an attractive method in several fields. tected vehicles have the same performance), under low to
Several attempts have been made using Deep Q-learning for medium car flow it will NOT achieve optimum performance
ITSC system, including [18], [19], [30], [31]. These results as the optimal strategy under low-to-medium car flow has to
show that a DQN based Q-learning algorithm is capable of react according to detected car arrivals. Another very recent
optimizing the traffic in an intelligent manner. system is DSRC-Actuated Traffic Lights, which is one of
Traditional intelligent traffic signal systems use loop our previous implementations using DSRC radio for traffic
detectors, magnetic detectors and cameras for improving control. The designed prototype of this system was publicly
the performance of traffic lights. In the past few decades, demonstrated in Riyadh, Saudi Arabia, in July 2018 [55], [56].
various adaptive traffic systems were developed and imple- DSRC-Actuated Traffic Lights, however, is based on the arrival
mented. Some of these traffic systems such as SCOOT of each vehicle, and hence works well under low to medium
[4], SCATS [3], are based on dynamic traffic coordination car flow rates, but it does not work well under high car flow
[5], and can be viewed as a traffic-responsive version of rate.
Authorized licensed use limited to: University of Wollongong. Downloaded on April 05,2020 at 08:51:39 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: University of Wollongong. Downloaded on April 05,2020 at 08:51:39 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
in a way that maximizes the future reward. At every time data is sampled from the memory for a certain batch size.
step, the agent gets the state (the current observation of the This experience replay aims to break the time correlation
environment) and reward information (the quantified indicator between samples [67].
of performance from the last time step) from the environment In this paper, we train the traffic lights agents using a
and makes an action. During this process, the agent tries to Deep Q-network (DQN) [67]. With the Q-learning algorithm
optimize (maximize/minimize) the cumulative reward for its described above, our work focuses on the definition of agents’
action policy. The beauty of this kind of algorithm is the fact actions and the assignment of the states and rewards, which
that it doesn’t need any supervision, since the agent observes is discussed in the the following subsection IV-B.
the environment and tries to optimize its performance without
human intervention. B. Parameter Modeling
RL algorithms come in two categories: policy based algo- We consider a traffic light controller, which takes reward
rithms such as Trust Region Policy Optimization (TRPO) [58], and state observation from the environment and chooses an
Advantage Actor Critic (A2C) [59], Proximal Policy Opti- action. In this subsection, we introduce our design of actions,
mization (PPO) [60] that optimize the policy that maps rewards, and states for the aforementioned PD-ITSC system
from states to actions; and value based algorithms such problem.
as Q-learning [57], double Q-Learning [61], and soft 1) Agent Action: In our context, the relevant action of the
Q-learning [62] that directly maximize the cumulative rewards. agent is either to keep the current traffic light phase, or to
While policy based algorithms have achieved good results and switch to the next traffic light phase. At every time step,
will potentially be applicable for the problem proposed in this the agent makes an observation and takes action accordingly,
paper [63], [64], in this paper, we choose deep Q-learning achieving intelligent control of traffic.
algorithm. 2) Reward: For traffic optimization problems, the goal is to
In the Q-learning approach, the agent learns a ‘Q-Value’, decrease the average traffic delay of commuters in the network,
denoted Q(st , at ), which is a function of observed state st by using traffic light phasing strategy S. Specifically, find the
and action at that outputs the expected cumulative discounted best traffic light phasing strategy S, such that tS − tmin is
future reward. Here, t denotes the discrete time index. The minimum, where tS is the average travel time of commuters
cumulative discounted future reward is defined as: in the network, under the traffic control scheme S, and tmin
Q(st , at ) = rt + γ rt +1 + γ 2 rt +2 + γ 3 rt +3 + . . . is the physically possible lowest average travel time. Consider
traveling the same distance d,
Here, rt is the reward at each time step, the meaning of tS
which needs to be specified according to the actual problem, d= v S (t)dt = tmin v max
and γ < 1 is the discount factor. At every time step, the agent 0
updates its Q function by an update of the Q value: Here, v max is some maximum reasonable speed for the
vehicle, such as the speed limit of the road of interest. v S (t )
Q(st , at ) := Q(st , at )+ α(rt +1 +γ max Q(st +1 , at )− Q(st , at )) denotes the actual vehicle speed under strategy S, at time t.
Therefore,
In most cases, including the traffic control scenarios of tS
interest, due to the complexity of the state space and action 1
tmin = v S (t)dt
space, deep neural networks can be used to approximate the v
max
tS
0
tS
Q function. Instead of updating the Q value, we use the value: 1
tS − tmin = 1dt − v S (t)dt
v
Q(st , at ) + α(rt +1 + γ max Q(st +1 , at ) − Q(st , at )) 0
tS max 0
1
= v max − v S (t)dt
as the output target of a Q network and do a step of back v max 0
propagation on the input of st , at . Therefore, to get minimum delay tS − tmin is equivalent to
We utilized two known methods to stabilize the training minimizing at each step t, for each vehicle:
process [65], [66]:
1
1) Two Q-networks are maintained, a target Q-network [v max − v S (t)] (1)
v max
and an on-line Q network. Target Q-network is used
to approximate the true Q-values, and the on-line We note that this is equivalent to maximizing v S (t), if the
Q-network is back-propagated every step. In the train- v max on all roads for all cars are the same. If different
ing period, the agent makes decision with the target vehicles have different v max , the reward function is taken as
Q-network, the results from each time instance are used the arithmetic average of the function for all vehicles.
to update the on-line Q-network. At periodic intervals, We define the statement in (1) as the penalty of each
on-line Q network’s weights are synchronized with the step. Our goal is to minimize the penalty of each step. Since
target Q-network. This will keep the agent’s decision reinforcement learning tries to maximize the reward (minimize
network relatively stable, instead of changing at every penalty), we define the opposite number of the loss as the
step. reward for the reinforcement learning problem:
2) Instead of training after every step an agent has taken, 1
rt = − [v max − v S (t)] (2)
past experience is stored in a memory buffer and training v max
Authorized licensed use limited to: University of Wollongong. Downloaded on April 05,2020 at 08:51:39 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE I
D ETAILS OF S TATE R EPRESENTATION
Authorized licensed use limited to: University of Wollongong. Downloaded on April 05,2020 at 08:51:39 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: University of Wollongong. Downloaded on April 05,2020 at 08:51:39 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 6. Waiting time under different detection rate under medium car flow.
Authorized licensed use limited to: University of Wollongong. Downloaded on April 05,2020 at 08:51:39 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
under 0.1 veh/s flow can handle flow rates from 0 to 0.15 at
near-optimal levels. At higher flow rates, it still performs
Fig. 9. Expected performance by time. reasonably well. The agent trained on 0.5 veh/s flow will
perform reasonably from 0.25 veh/s to 0.5 veh/s, but under
0.25 veh/s, the agent will start to perform substantially worse
than the optimal agent. Since traffic patterns are not expected
to heavily fluctuate, these results give a strong indication
that the agent trained by the data will be able to adapt to
the environment even when the trained situation is slightly
different.
2) Sensitivity to Detection Rate: In most situations,
the detection rate can only be approximately measured. It is
likely that an agent trained under one detection rate needs to
Fig. 10. Sensitivity analysis of flow rate. operate under a slightly different detection rate, so we test the
sensitivity of agents to detection rates.
Figure 11 shows the sensitivity of two cases. Figure 11a
system under 100% detection rate performs visibly better at shows the sensitivity of low detection rate (0.2), figure 11b
midnight, the performance at that time is not as critical as the shows the sensitivity under high detection rate (0.8).
performance during the busier daytime. This result indicates We observe that the agent trained under 0.2 detection rate
that by detecting 20% of vehicles, we can perform almost the performs at an optimal level from 0.1 to 0.4 detection rate. The
same as detecting all vehicles. But those detectable vehicles sensitivity upward is better than downward. This indicates that
(yellow lines) will have a benefit against those undetectable at early deployment of this system, it’s better to under-estimate
vehicles (dash line). detection rate, since the agent’s performance is more stable for
These results confirm intuition. With a large volume of the higher detection rate.
cars, a low detection rate should still provide a relatively Figure 11b shows the sensitivity of the agent trained under
low-variance estimate of traffic flow. If there are few cars and high detection rate (0.8). We can see that the performance
a low detection rate, the estimate of traffic flow can have very of this agent is at optimal level when detection rate is from
high-variance. Late at night with only a single detected car, 0.5 to 1. Though the sensitivity performance for an agent
an ITSC system can give that car a green immediately, which under low detection rate is different than the sensitivity under
would not be possible with an undetected car. high detection rate, for both cases, the agent shows a level of
stability, which means that as long as the detection rate used
C. Sensitivity Analysis for training is not too different from the actual detection rate,
the performance of the agent will not be affected a lot.
The results obtained above used agents trained and evaluated
under the same environmental parameters, since traffic patterns
only fluctuate slightly from day to day. D. Robustness Between Training and Deployment Scenario
Below, we evaluate the sensitivity of the agents to two There are many differences between the training and the
environmental parameters: the car flow and the detection rate. actual deployment scenario, as the simulator, though quite
1) Sensitivity to Car Flow: Figure 10 shows the agents’ sophisticated, will never able to take all the factors in the real
sensitivity to car flow. Figure 10a shows the performance scenario into account. This simulation aims to evaluate and
of an agent trained under 0.1 veh/s car flow, operating at verify that those minor factors, such as stop-and-go vehicles,
different flow rates. Figure 10b shows the sensitivity of an arrival patterns and other factors won’t affect the system in
agent trained under 0.5 veh/s car flow. The blue curve in the a major way. We choose a newly published realistic scenario
figure is the trained agent’s performance, while the red one is known as Luxembourg SUMO Traffic (LuST) [74]. The sce-
the performance of the optimal agent (the agent trained under nario is generated on the real map of Luxembourg, the activity
that situation and tested under that situation). Both agents of vehicles are generated according to the demographic data
perform well over a range of flow rates. The agent trained published by the government. The authors of this scenario
Authorized licensed use limited to: University of Wollongong. Downloaded on April 05,2020 at 08:51:39 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE II
D EFERENCE IN T RAINING AND E VALUATION S CENARIO
Authorized licensed use limited to: University of Wollongong. Downloaded on April 05,2020 at 08:51:39 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
controller, detected vehicles will have a shorter commute time a very promising outcome that is highly desirable since the
than undetected vehicles. This property makes it possible industry forecasts on DSRC penetration process seems gradual
for hardware manufacturers, software companies, and vehicle as opposed to abrupt.
manufacturers to help push forward the proposed scheme, The numerical results on sparse, medium, and dense arrival
other than the Department of Transportation (DoT) alone, for rates suggest that reinforcement learning is able to handle all
the simple reason that all of them can profit from this system. kinds of traffic flow. Although the optimization of traffic on
For example, it would be valuable for a certain navigation sparse arrival and dense arrival are, in general, very different,
app to advertise that their customers can save 30% on commute results show that reinforcement learning is able to leverage the
time. ‘particle’ property of the vehicle flow, as well as the ‘liquid’
Therefore, we view this technology as a new generation property, thus providing a very powerful overall optimization
of Intelligent Transportation Systems, as it inherently comes scheme.
with a lucrative commercial business model. The burden of
spreading the penetration rate in this system is distributed to a ACKNOWLEDGMENT
lot of companies, as opposed to the traditional ITSC systems The authors would like to thank Dr. H. Liu from Lan-
which put all the burden on the DoT alone. This makes it guage Technology Institute, Carnegie Mellon University for
financially feasible to have the system installed on most of informative discussions and a lot of suggestions to the
the intersections in a city, as opposed to the current situation methods reported in this paper. The authors would also
where only a small proportion of intersections are installed like to thank Dr. L. Gallo from Eurecom, France and
with ITSC. Mr. M. E. Diaz-Granados of Yahoo, U.S., for the initial attempt
The mechanism of the system solution described will also to solve this problem in 2016.
make it possible to have dynamic pricing. Dynamic pricing
refers to reserving certain roads during rush hours exclusively R EFERENCES
for paid users. This method has been scuttled by public or [1] (2017). Traffic Congestion and Reliability: Trends and Advanced
Strategies for Congestion Mitigation. Accessed: Aug. 19, 2017. [Online].
political opposition and only a few cities have implemented Available: https://ops.fhwa.dot.gov/congestion_report/executive_
dynamic pricing [75], [76]. Those few successful examples, summary.htm
however, cannot be easily copied or adapted to other cities, [2] D. I. Robertson, “‘Tansyt’ method for area traffic control,” Traffic Eng.
Control, vol. 8, no. 8, 1969.
as the method depends hugely on road topologies.. In our [3] P. Lowrie, “Scats, sydney co-ordinated adaptive traffic system: A traffic
solution, we can accomplish dynamic pricing in a more responsive method of controlling urban traffic,” Roads Traffic Authority
intelligent way, by simply considering vehicle detection as a NSW, Sydney, NSW, Australia, Tech. Rep. 00772163, 1990. [Online].
Available: https://trid.trb.org/view/488852
service. Compared to existing solutions, this service will not [4] P. Hunt, D. Robertson, R. Bretherton, and M. C. Royle, “The SCOOT on-
require to reserve roads, making the scheme flexible and easy line traffic signal optimisation technique,” Traffic Eng. Control, vol. 23,
to implement. The user will also be able to choose to pay for no. 4, 1982.
[5] J. Luk, “Two traffic-responsive area traffic control methods: SCAT and
a prioritized signal phase whenever they are in a hurry. SCOOT,” Traffic Eng. Control, vol. 25, no. 1, 1984.
Further research is needed to make this AI-based Intelligent [6] N. H. Gartner, “OPAC: A demand-responsive strategy for traffic signal
Traffic Control System more practical. First of all, the system control,” U.S. Dept. Transp., Washington, DC, USA, Tech. Rep. 906,
1983.
currently needs to be fully trained in a simulator; under the [7] P. Mirchandani and L. Head, “A real-time traffic signal control system:
partial observation setup, the system will not be able to observe Architecture, algorithms, and analysis,” Transp. Res. C, Emerg. Technol.,
the reward, hence, it won’t be able to do any incremental vol. 9, no. 6, pp. 415–432, Dec. 2001.
[8] J.-J. Henry, J. L. Farges, and J. Tuffal, “The PRODYN real time
training after deployment. Clearly, this is a drawback or traffic algorithm,” in Control in Transportation Systems. Amsterdam,
shortcoming of the proposed system. Some solutions to this The Netherlands: Elsevier, 1984, pp. 305–310.
problem are reported in a follow-up paper [77]. Another future [9] R. Vincent and J. Peirce, “‘MOVA’: Traffic responsive, self-optimising
signal control for isolated intersections,” Transp. Road Res. Lab.,
direction would be to further develop the system to achieve Crowthorne, U.K., Tech. Rep. RR 170, 1988.
multi-agent coordination so that, with the help of DSRC radios [10] (2016). Traffic Light Control and Coordination. Accessed:
(or other forms of communications), traffic lights will be able Mar. 23, 2016. [Online]. Available: https://en.wikipedia.org/wiki/Traffic_
light_control_and_coordination
to communicate with each other. Clearly, designing such a [11] M. Ferreira, R. Fernandes, H. Conceição, W. Viriyasitavat, and
system will significantly improve the performance of PD-ITSC O. K. Tonguz, “Self-organized traffic control,” in Proc. 7th ACM Int.
system. Further research is also required to investigate whether Workshop Veh. InterNETworking (VANET), 2010, pp. 85–90.
[12] N. S. Nafi and J. Y. Khan, “A VANET based intelligent road traffic
the RL agent will be able to pick up the drivers’ behavior signalling system,” in Proc. Australas. Telecommun. Netw. Appl. Conf.
accurately at each intersection [78]–[82]. (ATNAC), Nov. 2012, pp. 1–6.
[13] V. Milanes, J. Villagra, J. Godoy, J. Simo, J. Perez, and E. Onieva,
“An intelligent V2I-based traffic management system,” IEEE Trans.
VII. C ONCLUSION Intell. Transp. Syst., vol. 13, no. 1, pp. 49–58, Mar. 2012.
[14] A. Chattaraj, S. Bansal, and A. Chandra, “An intelligent traffic control
In this paper, we have proposed reinforcement learning, system using RFID,” IEEE Potentials, vol. 28, no. 3, pp. 40–43,
May 2009.
specifically deep Q-learning, for traffic control with partial [15] M. R. Friesen and R. D. Mcleod, “Bluetooth in intelligent transportation
detection of vehicles. The results of our study show that rein- systems: A survey,” Int. J. Intell. Transp. Syst. Res., vol. 13, no. 3,
forcement learning is a promising new approach to optimizing pp. 143–153, Sep. 2015.
[16] F. Qu, F.-Y. Wang, and L. Yang, “Intelligent transportation spaces:
traffic control problems under partial detection scenarios, such Vehicles, traffic, communications, and beyond,” IEEE Commun. Mag.,
as traffic control systems using DSRC technology. This is vol. 48, no. 11, pp. 136–142, Nov. 2010.
Authorized licensed use limited to: University of Wollongong. Downloaded on April 05,2020 at 08:51:39 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[17] Average Age of Cars on US. Accessed: Aug. 21, 2017. [Online]. Avail- [40] O. Tonguz, W. Viriyasitavat, and J. Roldan, “Implementing virtual
able: https://www.usatoday.com/story/money/2015/07/29/new-car-sales- traffic lights with partial penetration: A game-theoretic approach,” IEEE
soaring-but-cars-getting-older-too/30821191/ Commun. Mag., vol. 52, no. 12, pp. 173–182, Dec. 2014.
[18] W. Genders and S. Razavi, “Using a deep reinforcement learning agent [41] J. Yapp and A. J. Kornecki, “Safety analysis of virtual traffic lights,”
for traffic signal control,” 2016, arXiv:1611.01142. [Online]. Available: in Proc. 20th Int. Conf. Methods Models Autom. Robot. (MMAR),
https://arxiv.org/abs/1611.01142 Aug. 2015, pp. 505–510.
[19] E. van der Pol, “Deep reinforcement learning for coordination in [42] A. Bazzi, A. Zanella, and B. M. Masini, “A distributed virtual traffic
traffic light control,” Ph.D. dissertation, Univ. Amsterdam, Amsterdam, light algorithm exploiting short range V2V communications,” Ad Hoc
The Netherlands, 2016. Netw., vol. 49, pp. 42–57, Oct. 2016.
[20] S. Mikami and Y. Kakazu, “Genetic reinforcement learning for coopera- [43] O. K. Tonguz and W. Viriyasitavat, “A self-organizing network approach
tive traffic signal control,” in Proc. 1st IEEE Conf. Evol. Comput. IEEE to priority management at intersections,” IEEE Commun. Mag., vol. 54,
World Congr. Comput. Intell., Dec. 2002, pp. 223–228. no. 6, pp. 119–127, Jun. 2016.
[21] E. Bingham, “Reinforcement learning in neurofuzzy traffic signal con- [44] R. Zhang et al., “Virtual traffic lights: System design and implemen-
trol,” Eur. J. Oper. Res., vol. 131, no. 2, pp. 232–241, Jun. 2001. tation,” 2018, arXiv:1807.01633. [Online]. Available: https://arxiv.org/
[22] M. Chee Choy, D. Srinivasan, and R. Long Cheu, “Hybrid cooperative abs/1807.01633
agents with online reinforcement learning for traffic control,” in Proc. [45] O. K. Tonguz, “Red light, green light-no light: Tomorrow’s communica-
IEEE World Congr. Comput. Intell. IEEE Int. Conf. Fuzzy Syst. (FUZZ), tive cars could take turns at intersections,” IEEE Spectr. Mag., vol. 55,
Jun. 2003, pp. 1015–1020. no. 10, pp. 24–29, Oct. 2018.
[23] B. Abdulhai, R. Pringle, and G. J. Karakoulas, “Reinforcement learning [46] J. Lu and L. Cao, “Congestion evaluation from traffic flow information
for true adaptive traffic signal control,” J. Transp. Eng., vol. 129, no. 3, based on fuzzy logic,” in Proc. IEEE Int. Conf. Intell. Transp. Syst.,
pp. 278–285, 2003. vol. 1, Apr. 2004, pp. 50–53.
[24] A. B. C. da Silva, D. de Oliveria, and E. Basso, “Adaptive traffic control [47] B. Kerner et al., “Traffic state detection with floating car data in road
with reinforcement learning,” in Proc. Conf. Auto. Agents Multiagent networks,” in Proc. IEEE Intell. Transp. Syst., Oct. 2005, pp. 44–49.
Syst. (AAMAS), 2006, pp. 80–86. [48] W. Pattara-atikom, P. Pongpaibool, and S. Thajchayapong, “Estimating
[25] D. de Oliveira et al., “Reinforcement learning based control of traffic road traffic congestion using vehicle velocity,” in Proc. 6th Int. Conf.
lights in non-stationary environments: A case study in a microscopic ITS Telecommun., Jun. 2006, pp. 1001–1004.
simulator,” in Proc. EUMAS, 2006. [49] C. De Fabritiis, R. Ragona, and G. Valenti, “Traffic estimation and
[26] M. Abdoos, N. Mozayani, and A. L. C. Bazzan, “Traffic light control prediction based on real time floating car data,” in Proc. 11th Int. IEEE
in non-stationary environments based on multi agent Q-learning,” in Conf. Intell. Transp. Syst., Oct. 2008, pp. 197–203.
Proc. 14th Int. IEEE Conf. Intell. Transp. Syst. (ITSC), Oct. 2011, [50] Y. Feng, J. Hourdos, and G. A. Davis, “Probe vehicle based real-time
pp. 1580–1585. traffic monitoring on urban roadways,” Transp. Res. C, Emerg. Technol.,
[27] J. C. Medina and R. F. Benekohal, “Traffic signal control using reinforce- vol. 40, pp. 160–178, Mar. 2014.
ment learning and the max-plus algorithm as a coordinating strategy,” in [51] X. Kong, Z. Xu, G. Shen, J. Wang, Q. Yang, and B. Zhang, “Urban traffic
Proc. 15th Int. IEEE Conf. Intell. Transp. Syst., Sep. 2012, pp. 596–601. congestion estimation and prediction based on floating car trajectory
[28] S. El-Tantawy, B. Abdulhai, and H. Abdelgawad, “Multiagent rein- data,” Future Gener. Comput. Syst., vol. 61, pp. 97–107, Aug. 2016.
forcement learning for integrated network of adaptive traffic signal [52] P. Bellavista, F. Caselli, and L. Foschini, “Implementing and evaluating
controllers (MARLIN-ATSC): Methodology and large-scale application V2X protocols over iTETRIS: Traffic estimation in the COLOMBO
on downtown toronto,” IEEE Trans. Intell. Transp. Syst., vol. 14, no. 3, project,” in Proc. 4th ACM Int. Symp. Develop. Anal. Intell. Veh. Netw.
pp. 1140–1150, Sep. 2013. Appl. (DIVANet), 2014, pp. 25–32.
[29] M. A. Khamis and W. Gomaa, “Adaptive multi-objective reinforcement [53] D. Krajzewicz et al., “Colombo: Investigating the potential of V2X for
learning with hybrid exploration for traffic signal control based on traffic management purposes assuming low penetration rates,” in Proc.
cooperative multi-agent framework,” Eng. Appl. Artif. Intell., vol. 29, ITS Eur., 2013.
pp. 134–151, Mar. 2014. [54] P. Bellavista, L. Foschini, and E. Zamagni, “V2X protocols for low-
[30] L. Li, Y. Lv, and F.-Y. Wang, “Traffic signal timing via deep reinforce- penetration-rate and cooperative traffic estimations,” in Proc. IEEE 80th
ment learning,” IEEE/CAA J. Autom. Sinica, vol. 3, no. 3, pp. 247–254, Veh. Technol. Conf. (VTC-Fall), Sep. 2014, pp. 1–6.
Apr. 2016. [55] R. Zhang et al., “Increasing traffic flows with DSRC technology: Field
[31] D. Garg, M. Chli, and G. Vogiatzis, “Deep reinforcement learning for trials and performance evaluation,” in Proc. 44th Annu. Conf. IEEE Ind.
autonomous traffic light control,” in Proc. 3rd IEEE Int. Conf. Intell. Electron. Soc. (IECON), Oct. 2018, pp. 6191–6196.
Transp. Eng. (ICITE), Sep. 2018. [56] O. K. Tonguz and R. Zhang, “Harnessing vehicular broadcast communi-
[32] (2016). Intelligent Traffic System Cost. Accessed: Nov. 23, 2017. cations: Dsrc-actuated traffic control,” IEEE Trans. Intell. Transp. Syst.,
[Online]. Available: http://www.itscosts.its.dot.gov/ITS/benecost.nsf/ID/ to be published.
C1A22DD1C3BA1ED285257CD60062C3BB?OpenDocument&Query [57] C. J. C. H. Watkins and P. Dayan, “Q-learning,” Mach. Learn., vol. 8,
=CApp/ nos. 3–4, pp. 279–292, 1992.
[33] (2016). Scats System Cost. Accessed: May 13, 2018. [Online]. Avail- [58] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust
able: https://www.itscosts.its.dot.gov/ITS/benecost.nsf/0/9E957998C8A region policy optimization,” in Proc. Int. Conf. Mach. Learn., 2015,
B79A885257B1E0049CAFF?OpenDocument&Query=Home pp. 1889–1897.
[34] T. Neudecker, N. An, O. K. Tonguz, T. Gaugel, and J. Mittag, “Feasi- [59] V. Mnih et al., “Asynchronous methods for deep reinforcement learning,”
bility of virtual traffic lights in non-line-of-sight environments,” in Proc. in Int. Conf. Mach. Learn., 2016, pp. 1928–1937.
9th ACM Int. Workshop Veh. Inter-Netw., Syst., Appl. (VANET), 2012, [60] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
pp. 103–106. “Proximal policy optimization algorithms,” 2017, arXiv:1707.06347.
[35] M. Ferreira and P. M. D’orey, “On the impact of virtual traffic lights on [Online]. Available: https://arxiv.org/abs/1707.06347
carbon emissions mitigation,” IEEE Trans. Intell. Transp. Syst., vol. 13, [61] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
no. 1, pp. 284–295, Mar. 2012. with double Q-learning,” in Proc. AAAI, Phoenix, AZ, USA, vol. 2,
[36] M. Nakamurakare, W. Viriyasitavat, and O. K. Tonguz, “A prototype of 2016, p. 5.
Virtual Traffic Lights on Android-based smartphones,” in Proc. IEEE [62] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforcement learning
Int. Conf. Sens., Commun. Netw. (SECON), Jun. 2013, pp. 236–238. with deep energy-based policies,” 2017, arXiv:1702.08165. [Online].
[37] W. Viriyasitavat, J. M. Roldan, and O. K. Tonguz, “Accelerating the Available: https://arxiv.org/abs/1702.08165
adoption of Virtual Traffic Lights through policy decisions,” in Proc. [63] F. Belletti, D. Haziza, G. Gomes, and A. M. Bayen, “Expert level control
Int. Conf. Connected Vehicles Expo (ICCVE), Dec. 2013, pp. 443–444. of ramp metering based on multi-task deep reinforcement learning,”
[38] A. Bazzi, A. Zanella, B. M. Masini, and G. Pasolini, “A distributed IEEE Trans. Intell. Transp. Syst., vol. 19, no. 4, pp. 1198–1207,
algorithm for virtual traffic lights with IEEE 802.11p,” in Proc. Eur. Apr. 2018.
Conf. Netw. Commun. (EuCNC), Jun. 2014, pp. 1–5. [64] C. Wu, A. Kreidieh, K. Parvate, E. Vinitsky, and A. M. Bayen, “Flow:
[39] F. Hagenauer, P. Baldemaier, F. Dressler, and C. Sommer, “Advanced Architecture and benchmarking for reinforcement learning in traffic
leader election for virtual traffic lights,” ZTE Commun., Special Issue control,” 2017, arXiv:1710.05465. [Online]. Available: https://arxiv.org/
VANET, vol. 12, no. 1, pp. 11–16, Mar. 2014. abs/1710.05465
Authorized licensed use limited to: University of Wollongong. Downloaded on April 05,2020 at 08:51:39 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[65] L.-J. Lin, “Reinforcement learning for robots using neural networks,” Akihiro Ishikawa received the M.S. degree from the
Ph.D. dissertation, School Comput. Sci., Carnegie Mellon Univ., Pitts- Electrical and Computer Engineering Department,
burgh, PA, USA, 1993. Carnegie Mellon University in 2017. His research
[66] V. Mnih et al., “Playing atari with deep reinforcement learning,” 2013, interests include vehicular networks, wireless net-
arXiv:1312.5602. [Online]. Available: https://arxiv.org/abs/1312.5602 works, and artificial intelligence.
[67] V. Mnih et al., “Human-level control through deep reinforcement learn-
ing,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015.
[68] A. Y. Ng et al., “Policy invariance under reward transformations:
Theory and application to reward shaping,” in Proc. ICML, Jun. 1999,
pp. 278–287.
[69] D. Krajzewicz, J. Erdmann, M. Behrisch, and L. Bieker, “Recent
development and applications of sumo–simulation of urban mobility,”
Int. J. Adv. Syst. Meas., vol. 5, nos. 3–4, 2012.
[70] S. Krauss, P. Wagner, and C. Gawron, “Metastable states in a micro-
scopic model of traffic flow,” Phys. Rev. E, Stat. Phys. Plasmas Fluids Wenli Wang received the B.S. degree in statistics
Relat. Interdiscip. Top., vol. 55, no. 5, pp. 5597–5602, Jul. 2002. and the B.A. degree in fine arts from the Uni-
[71] P. A. Lopez et al., “Microscopic traffic simulation using SUMO,” in versity of California, Los Angeles, in 2016, and
Proc. 21st Int. Conf. Intell. Transp. Syst. (ITSC), Nov. 2018. [Online]. the M.S. degree from the Electrical and Computer
Available: https://elib.dlr.de/124092/ Engineering Department, Carnegie Mellon Univer-
[72] Reinforcement Learning for Traffic Optimization. Accessed: sity, in 2018. Her research interests include machine
May 12, 2018. [Online]. Available: https://youtu.be/HkXriL9SOW4 learning and its applications in wireless networks
[73] (2014). Traffic Monitoring Guide. Accessed: May 13, 2018. [Online]. and computer vision.
Available: https://ops.fhwa.dot.gov/freewaymgmt/publications/frwy_
mgmt_handbook/chapter1_01.htm
[74] L. Codeca, R. Frank, S. Faye, and T. Engel, “Luxembourg SUMO traffic
(LuST) scenario: Traffic demand evaluation,” IEEE Intell. Transp. Syst.
Mag., vol. 9, no. 2, pp. 52–63, Apr. 2017.
[75] A. De Palma and R. Lindsey, “Traffic congestion pricing methodologies
and technologies,” Transp. Res. C, Emerg. Technol., vol. 19, no. 6, Benjamin Striner received the B.A. degree in neu-
pp. 1377–1399, Dec. 2011. roscience and psychology from the Oberlin College
[76] B. Schaller, “New York City’s congestion pricing experience and impli- in 2005. He is currently pursuing the master’s degree
cations for road pricing acceptance in the United States,” Transp. Policy, with the Machine Learning Department, Carnegie
vol. 17, no. 4, pp. 266–273, Aug. 2010. Mellon University. He was a patent expert witness
[77] R. Zhang, R. Leteurtre, B. Striner, A. Alanazi, A. Alghafis, and and Engineer, especially in wireless communica-
O. K. Tonguz, “Partially detected intelligent traffic signal control: Envi- tions. His research interests include reinforcement
ronmental adaptation,” 2019, arXiv:1910.10808. [Online]. Available: learning, generative networks, and better understand-
https://arxiv.org/abs/1910.10808 ability and explainability in machine learning.
[78] D. A. Noyce, D. B. Fambro, and K. C. Kacir, “Traffic characteristics
of protected/permitted left-turn signal displays,” Transp. Res. Rec.,
vol. 1708, no. 1, pp. 28–39, Jan. 2000.
[79] K. Tang and H. Nakamura, “A comparative study on traffic character-
istics and driver behavior at signalized intersections in Germany and
Japan,” in Proc. Eastern Asia Soc. Transp. Stud. 7th Int. Conf. Eastern
Asia Soc. Transp. Stud., vol. 6, 2007, p. 324.
[80] T. J. Gates and D. A. Noyce, “Dilemma zone driver behavior as a Ozan K. Tonguz is currently a tenured Full Profes-
function of vehicle type, time of day, and platooning,” Transp. Res. sor with the Electrical and Computer Engineering
Rec., vol. 2149, no. 1, pp. 84–93, Jan. 2010. Department, Carnegie Mellon University (CMU).
[81] L. Rittger, G. Schmidt, C. Maag, and A. Kiesel, “Driving behaviour He currently leads substantial research efforts at
at traffic light intersections,” Cogn., Technol. Work, vol. 17, no. 4, CMU in the broad areas of telecommunications and
pp. 593–605, Nov. 2015. networking. He is the Founder and CEO of the CMU
[82] J. Li, X. Jia, and C. Shao, “Predicting driver behavior during the yellow startup known as Virtual Traffic Lights, LLC, which
interval using video surveillance,” Int. J. Environ. Res. Public Health, specializes in providing solutions to acute trans-
vol. 13, no. 12, p. 1213, Dec. 2016. portation problems using vehicle-to-vehicle (V2V)
and vehicle-to-infrastructure (V2I) communications
paradigms. He has published about 300 research
Rusheng Zhang was born in Chengdu, China, papers in IEEE journals and conference proceedings in the areas of wire-
in 1990. He received the first B.E. degree in less networking, optical communications, and computer networks. He is
micro electrical mechanical system and the second the author (with G. Ferrari) of the book Ad Hoc Wireless Networks:
B.E. degree in applied mathematics from Tsinghua A Communication-Theoretic Perspective (Wiley, 2006). He is the inventor
University, Beijing, in 2013, and the M.S. degree in of 21 issued or pending patents (18 U.S. patents and three international
electrical and computer engineering from Carnegie patents). His current research interests include vehicular networks, wireless
Mellon University, in 2015, where he is currently ad hoc networks, sensor networks, self-organizing networks, artificial intel-
pursuing the Ph.D. degree. His research areas ligence (AI), statistical machine learning, smart grid, bioinformatics, and
include vehicular networks, intelligent transportation security. He currently serves or has served as a consultant or expert for several
systems, wireless computer networks, artificial intel- companies, major law firms, and government agencies in the USA, Europe,
ligence, and intravehicular sensor networks. and Asia.
Authorized licensed use limited to: University of Wollongong. Downloaded on April 05,2020 at 08:51:39 UTC from IEEE Xplore. Restrictions apply.