Reinforcement Learning For Adaptive Traffic Signal Control: Baher Abdulhai Rob Pringle and Grigoris J. Karakoulas
Reinforcement Learning For Adaptive Traffic Signal Control: Baher Abdulhai Rob Pringle and Grigoris J. Karakoulas
Control
Baher Abdulhai1; Rob Pringle2; and Grigoris J. Karakoulas3
Abstract: The ability to exert real-time, adaptive control of transportation processes is the core of many intelligent transportation
systems decision support tools. Reinforcement learning, an artificial intelligence approach undergoing development in the machine-
Downloaded from ascelibrary.org by University of Saskatchewan on 11/22/12. Copyright ASCE. For personal use only; all rights reserved.
learning community, offers key advantages in this regard. The ability of a control agent to learn relationships between control actions and
their effect on the environment while pursuing a goal is a distinct improvement over prespecified models of the environment. Prespecified
models are a prerequisite of conventional control methods and their accuracy limits the performance of control agents. This paper contains
an introduction to Q-learning, a simple yet powerful reinforcement learning algorithm, and presents a case study involving application to
traffic signal control. Encouraging results of the application to an isolated traffic signal, particularly under variable traffic conditions, are
presented. A broader research effort is outlined, including extension to linear and networked signal systems and integration with dynamic
route guidance. The research objective involves optimal control of heavily congested traffic across a two-dimensional road network—a
challenging task for conventional traffic signal control methodologies.
DOI: 10.1061/共ASCE兲0733-947X共2003兲129:3共278兲
CE Database subject headings: Traffic signal controllers; Intelligent transportation systems; Traffic control; Traffic management;
Adaptive systems.
Fig. 1. Illustration of reinforcement learning: 共a兲 Gridworld; 共b兲 first episode; 共c兲 second episode; 共d兲 selected final Q-estimates; 共e兲 one possible
optimal policy
ing, which could be integrated with dynamic route guidance to couraging the robot to learn the shortest possible path to Door B.
provide effective traffic control in highly congested conditions. For this example, assume that the discount rate is 0.9.
Effective traffic control in the face of severe congestion on a Initially, all potential moves from any given grid square, ex-
two-dimensional road network is a challenging task for existing cept that involving passing through Door B from the square in
signal control methodologies. front of it, have a value of zero, in the sense that no reward
appears to be gained by implementing them. On its first journey,
therefore, the robot explores in a random fashion, possibly fol-
Reinforcement Learning: Brief Primer lowing the path shown in Fig. 1共b兲, until it eventually passes
through Door B. In doing so, it gains a reward of 100 units and
remains there, ending the current episode. The value assigned to
Illustrative Example
the move preceding the move through Door B is updated using
In its simplest terms, reinforcement learning involves an agent the reward of 100 units, factored by the discount rate of 0.9, since
that wishes to learn how to achieve a goal. It does so by interact- the reward was gained one time step into the future, to give a net
ing dynamically with its environment, trying different actions in value of 90 units. On its second journey from Door A, the robot
different situations in order to determine the best action or se- explores until it reaches a square adjacent to that in front of Door
quence of actions to achieve its goal from any possible given B. As before, the preceding move is assigned a value of 90 units,
situation. Feedback signals provided by the environment allow factored by the discount rate of 0.9, to give a net value of 81
the agent to determine to what extent an action actually contrib- units, as shown in Fig. 1共c兲. Each journey or episode thereafter
uted to the achievement of the desired goal. may result in another move being assigned a value. At some
To illustrate the concept of reinforcement learning, consider point, the robot might find itself confronted with a choice between
the following simplified example of a mobile robot navigating making a move with zero value and making one with some pre-
within the gridworld shown in Fig. 1共a兲. This is actually an illus- viously assigned positive value. In this situation, the robot must
tration of Q-learning, developed by Watkins 共1989; Watkins and choose whether or not to explore the move with a current value of
Dayan 1992兲, and is one of a number of possible reinforcement zero, on the chance that it might be better than exploiting its
learning algorithms and the one that is used in the case study current knowledge by making the move that it knows has a posi-
presented later in this paper. Imagine that the robot starts behind tive value.
Door A and that its goal is to pass through Door B, for which it After a sufficient number of episodes, each move from any
gains a reward of 100 units. No other actions are rewarded. Once given square will have been assigned a value. In most practical
it passes through Door B, it remains there 共perhaps waiting for a problems, particularly in stochastic domains, many episodes are
further task兲 and gains no further rewards. At each time step, the required before these values achieve useful convergence. Fig. 1共d兲
robot can move to an adjacent grid square but cannot move di- shows a selection of these values, each of which represents the
agonally. Let us also define a discount rate that has the effect of sum of discounted future rewards if one follows an optimal path
reducing the value of future rewards relative to more immediate from that particular grid square to Door B. The robot has there-
rewards. In this case, the discount rate also has the effect of en- fore learned an estimate of the value function Q. At this point, the
Fig. 2. Key elements of Q-learning This particular training rule is relevant to stochastic environ-
ments such as the traffic environment in the case study out-
lined in the next section. Decreasing the training rate over
time is one of the conditions necessary for convergence of
robot can implement an optimal sequence of actions, or policy, by
the Q-function in a stochastic environment. The other condi-
greedily taking the action with the highest value, regardless of tion requires that each state-action combination be visited
where it starts from or finds itself, until it reaches Door B. Fig. infinitely often, although in most practical problems, the por-
1共e兲 shows one of several possible optimal policies. tion of the state-space that is of primary interest will be
visited often but not infinitely often. If penalties are received
More Precise Definition of Q-learning rather than rewards, the MIN function is used in place of
MAX.
Building on the example described in the preceding section, let us 5. The updated estimate of the Q-value is then stored for later
now formulate a more precise, although still basic, definition of reuse. The Q-values may be stored in an unaltered form in a
Q-learning. Consider the system shown in Fig. 2, which shows look-up table, although this requires a significant amount of
the key elements of Q-learning. memory. They may also be used as inputs to a function ap-
1. The agent is the entity responsible for interpreting sensory proximation process designed to generalize the Q-function
inputs from the environment, choosing actions on the basis so that Q-value estimates may be obtained for state/action
of the fused inputs, and learning on the basis of the effects of combinations not yet visited but similar to combinations that
its actions on the environment. At time t, the Q-learning have been visited.
agent receives from the environment a signal describing its
current state s. The state is a group of key variables that
together describe those current characteristics of the environ- Adaptive Traffic Signal Control—Case Study Using
ment that are relevant to the problem. Theoretically, the state Q-learning
information must exhibit the Markov property, in that this
information, together with a description of the action being Background
taken, is all that is needed to predict the effect on the envi-
ronment. The agent does not need to know the history of its Until relatively recently, capital improvements, such as building
previous states or actions. In practice, it is assumed that the new roads or adding traffic lanes, and a variety of operational
process is Markovian, although this may not be strictly true. improvements have been the primary tools used to address in-
2. Based on its perception of the state s, the agent selects an creasing congestion due to growth in road traffic volumes. How-
action a, from the set of possible actions. This decision de- ever, increasingly tight constraints on financial resources and
pends on the relative value of the various possible actions, or physical space, as well as environmental considerations, have re-
more precisely on the estimated Q-values Qs,a , which reflect quired consideration of a wider range of options. Enhancing the
the value to the agent of undertaking action a while in state intelligence of traffic signal control systems is an approach that
s, resulting in a transition to state s⬘ , and following a cur- has shown potential to improve the efficiency of traffic flow. Off-
rently optimal policy 共sequence of actions兲 thereafter. At the line signal coordination methods, such as the maximization of
outset, the agent does not have any values for the through-bandwidths using time-space diagrams and optimization
Q-estimates and must learn these by randomly exploring al- with the TRANSYT family of programs, are gradually giving way
ternative actions from each state. A gradual shift is effected in larger cities to real-time methods such as the split, cycle, offset
from exploration to exploitation of those state/action combi- optimization technique 共SCOOT兲 共Hunt et al. 1981; Bretherton
nations found to perform well. 1996; Bretherton et al. 1998兲. Research is continuing into traffic
3. As a result of taking action a in state s, the agent receives a signal control systems that adapt to changing traffic conditions
reinforcement or reward rs,a , which depends upon the effect 共Gartner and Al-Malik 1996; Yagar and Dion 1996; Spall and
of this action on the agent’s environment. There may be a Chin 1997; Sadek et al. 1998兲.
delay between the time of the action and the receipt of the Severe traffic congestion, both recurring and nonrecurring,
reward. The objective of the agent in seeking the optimum presents a difficult challenge to existing control methodologies,
policy is to maximize the accumulated reward 共or minimize particularly in the case of two-dimensional road networks. Such
the accumulated penalty兲 over time. A discount rate may be congestion is often experienced in conjunction with busy urban
used to bound the reward, particularly in the case of continu- cores and, on a more localized basis, in association with major
ous episodes. The discount rate reflects the higher value of sports and entertainment events, major accidents or other inci-
short-term future rewards relative to those in the longer term. dents, and road construction and maintenance. There is an appar-
approach does not appear to be directly applicable to real-time ing useful experience even while exploring actions that may later
traffic signal control. Bingham 共1998兲 applied reinforcement turn out to be nonoptimal.
learning in the context of a neuro-fuzzy approach to traffic signal
control, but met with limited success due to the insensitivity of
the approach, limited exploration in what is a stochastic environ- Key Elements of Case-study Implementation
ment, and an off-line approach to value updating. The initial test application of Q-learning to the problem of traffic
signal control involved a single, isolated intersection. This simple
example was used to gain experience with this method in the
Advantages of Q-learning for Traffic Signal Control
stochastic traffic environment and to establish useful ranges for
In comparison to other state-of-the-art techniques used for traffic the various parameters involved. Application of Q-learning in a
signal control, and many other dynamic programming and ma- multiagent context to a linear system of traffic signals is now
chine learning approaches, Q-learning offers some potentially sig- under way, and this will be followed by extension to a two-
nificant advantages, as discussed next. dimensional road network and signal system. The isolated signal
Q-learning does not require a prespecified model of the envi- and linear system implementations involve two-phase operation
ronment on which to base action selection. Instead, relationships without turning flows. The network implementation will consider
between states, actions, and rewards are learned through dynamic turning movements and more flexible phasing arrangements. The
interaction with the environment. By way of contrast, existing following discussion outlines the essential elements of the iso-
traffic signal control methods usually require prespecified models lated signal case study and identifies modifications being tested in
of traffic flow to generate short-term predictions of traffic condi- the linear, multiagent application as a result of insights gained
tions or to assess the impacts of possible control decisions. If a through the initial application.
single, general model is used, it is possible and even likely that
conditions around individual intersections will vary from the con- Description of Test-beds
ditions upon which the model was based. The isolated traffic signal test-bed consisted of a simulated two-
Another benefit of Q-learning, and reinforcement learning in phase signal controlling the intersection of two two-lane roads.
general, is that supervision of the learning process is not required. Vehicle arrivals were generated using individual Poisson pro-
Supervised machine-learning algorithms require, for training pur- cesses with predefined average arrival rates on each of the four
poses, a large number of examples, consisting of sets of inputs approaches. The average rates could be varied over time to rep-
and associated outcomes, which adequately cover the range of resent different peak-period traffic profiles over the 2 h simulated
environmental conditions expected on deployment. They involve episodes. In practice, the agent would operate continuously.
supervision in the sense that the appropriate outcome is provided In the case of the linear signal system, autonomous Q-learning
for each combination of inputs so that any inherent relationships agents, each controlling a single intersection with two-phase con-
can be learned. The machine learning methods, such as artificial trol similar to that used in the isolated signal case, comprise the
neural networks, that have been the most widely studied and ap- test-bed. Individual Poisson processes are used to generate ve-
plied to transportation systems to date, typically involve super- hicle arrivals on each approach to the system. Traffic movement
vised learning. An example is the work on incident detection by within the system is simulated at a microscopic level. Each road
Abdulhai and Ritchie 共1999a, b兲. In the case of Q-learning, which link is divided into blocks; vehicles advance one block per time
is unsupervised, the outcome associated with taking a particular step, provided the downstream block is not occupied. In cases
action in any state encountered is learned through dynamic trial- where there is insufficient space on the downstream link to exit an
and-error exploration of alternative actions and observation of the intersection, vehicles may enter the intersection probabilistically
relative outcomes. Rather than being presented with a large set of and be trapped there until there is an opportunity to move ahead.
training examples, the generation of which is a challenging task in This may block following and crossing flows, and allows heavily
many cases, even for a domain expert, a Q-learning agent essen- congested traffic conditions to be simulated more realistically.
tially generates its own training experiences from its environment.
The learning process can be initiated on a simulator, with refine- State, Action, and Reward Definitions
ment and optimization for the intended environment occurring In the case of the isolated intersection, the state information avail-
after deployment. able to the agent included the queue lengths on the four ap-
It is important to note that not all real-time algorithms are truly proaches and the elapsed phase time. The multiagent situation
adaptive. The two terms are often used interchangeably and pos- permits additional state information, since communication be-
sibly confused. Real-time algorithms are those able to respond to tween agents extends the effective field of view of individual
sensory inputs in real time, although the internal logic and param- agents. In addition to local queue lengths, various combinations
lected an action—either remain with the current signal indication test several exploration policies. An -greedy policy was tested,
or change it. Considering the potential need to transmit, receive, where the best action is exploited with probability and an ex-
and process communicated information, 1-s intervals between ploratory action is chosen randomly with probability 1⫺. A
action-selection decisions were not considered to be sufficiently range of values for was evaluated, and a value of 0.9 was found
flexible in the case of the multiagent system. In this case, action to yield good results. A softmax exploration policy was also
selection consists of a decision, made at the time of the previous tested, where the probability of choosing an action was propor-
phase change, as to when to make the next phase change. To tional to the Q-estimate or value for that action given the current
provide additional flexibility, cycle lengths are not fixed in this state. Good results were achieved where the probability of choos-
case, but minimum and maximum limits are placed on phase ing the best action was annealed, starting with random explora-
lengths, as before, to ensure practicality. In the case where pro- tion and increasing to the point where the best action was chosen
jected phase-change times at adjacent signals are included as state with a probability of 0.9, provided that state had been visited at
elements, the agents are provided with an opportunity to respond least 35–50 times. Both techniques required that the definition of
to this information. At the time of a phase change, and decision random exploration be modified to avoid exploratory change ac-
on the time of the subsequent change, by any individual agent, the tions being implemented consistently within the first few seconds
other agents, in order of adjacency, are provided with an oppor- of the phase. With the change in the action-space for the multi-
tunity to review and adjust their currently projected change times agent test-bed, the standard softmax procedure is being used, al-
if the increase in benefit would exceed a minimum threshold. though further testing is necessary to ensure that the shift from
Where these review points are insufficiently close in time, inter- exploration to exploitation is sufficiently gradual so as not to
mediate reviews can be scheduled to allow any significant inhibit convergence.
changes in state to be considered as they occur. This review pro-
cess is repeated, as required and as time permits, in an attempt to Function Approximation and Generalization
reach equilibrium. In both the single and multiagent settings, it is In both the single-agent and multiagent test-beds, the Cerebellar
possible to constrain phase lengths so that they do not vary by Model Articulation Controller 共CMAC兲, as pioneered by Albus
more than a prespecified time from the previous phase length. 共1975a, b兲, is used for storage and generalization of the
This may be seen as desirable to limit variability in successive Q-estimates. The CMAC is conceptually similar to an artificial
cycles, although some degradation of performance is likely. neural network, although the implementation used in this case, as
The definition of reward 共actually a penalty in this case兲 is described by Smith 共1998兲, operates more like a sophisticated
relatively straightforward in the single-agent case, being the total look-up table. The CMAC fulfills a function approximation and
delay incurred between successive decision points by vehicles in generalization role by allowing Q-estimates for any given state-
the queues of the four approaches. The delay in each 1-s step, action pair to influence those of nearby state-action pairs. This
being directly proportional to the queue length, was modified effectively smooths the decision hypersurface and enables
using a power function to encourage approximate balancing of Q-estimates to be derived for state-action pairs not yet visited, but
queue lengths. Otherwise, the agent was found to be indifferent similar to pairs that have been visited. The actual storage of
between situations involving very long and very short queues and Q-estimates was accomplished using hash tables. This minimizes
situations involving equal-length queues, both with the same av- memory requirements, since the high-dimensional arrays re-
erage queue length and therefore delay. In the multiagent case, a quired, one dimension for each element of the state-space, are
key issue is the extent to which global rewards 共or penalties兲 are typically sparsely populated.
necessary to promote cooperation among the agents. It is hypoth- In the case of the isolated intersection, two CMACs were
esized that interacting agents must respond to a reward structure used—one for the change action and one for the don’t-change
that incorporates not only local rewards, as in the single-agent action. Testing showed that 11 association units or layers in the
case, but also global rewards to avoid agents acting solely on the CMAC, in combination with a resolution of 50% 共mapping two
basis of self-interest and compromising overall effectiveness and adjacent queue lengths—for example, 23 vehicles and 24
efficiency. In addition to the local reward used by the isolated vehicles—into the same Q-estimate兲, yielded good results without
agent, alternative global reward formulations are being investi- requiring excessive memory for the storage of the Q-estimates.
gated for the multiagent case, including delay and the incidence Despite the fact that the CMAC implies nonlinear function ap-
of intersection blockage along the main streets and across the proximation, possibly problematic in the case of Q-learning in a
network. Weighting of the global rewards relative to local rewards stochastic environment, lack of convergence did not appear to be
is also being evaluated. an issue. Various values for the training rate ␣ s,a were tested, and
It is possible to define rewards or penalties related to other the best results were achieved when ␣ s,a was gradually decreased
objectives or priorities. For example, the throughput of the inter- in inverse proportion to the number of visits to that particular
in the initial stages of deployment, but this exploration should be will involve extension to control of a two-dimensional system of
incremental and should not produce control decisions that appear traffic signals and the integration of traffic signal control based on
to drivers to be obviously inappropriate to the situation. A ‘‘con- Q-learning with dynamic route guidance. Comparison of the
tinuing education’’ strategy should also be developed that will Q-learning approach to traffic signal control with existing state-
enable the agent to assess its performance on an ongoing basis in of-the-art methods such as SCOOT will also be pursued.
light of possibly changing conditions and determine when and
how much additional on-line training may be required.
A sensory subsystem is needed to provide the required inputs Acknowledgments
to the agents. This may involve adaptation of existing induction
loop technology, although a video imaging system would likely The second writer wishes to acknowledge the financial assistance
be more effective. A contingency plan would also be required to provided by the Natural Science and Engineering Research Coun-
deal with a potential loss of communications. Evaluation of simu- cil of Canada and the University of Toronto.
lated agent performance using only local state inputs, but in a
multiagent context, is planned to provide insights into a possible
strategy for this scenario. References