0% found this document useful (0 votes)
5 views8 pages

Reinforcement Learning For Adaptive Traffic Signal Control: Baher Abdulhai Rob Pringle and Grigoris J. Karakoulas

This paper discusses the application of reinforcement learning, specifically Q-learning, for adaptive traffic signal control, highlighting its advantages over conventional methods that rely on prespecified models. A case study demonstrates the effectiveness of this approach on an isolated traffic signal under varying conditions, with plans for further research to extend this methodology to more complex traffic systems. The goal is to optimize traffic management in congested urban environments through real-time learning and adaptation.

Uploaded by

Ankur De
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views8 pages

Reinforcement Learning For Adaptive Traffic Signal Control: Baher Abdulhai Rob Pringle and Grigoris J. Karakoulas

This paper discusses the application of reinforcement learning, specifically Q-learning, for adaptive traffic signal control, highlighting its advantages over conventional methods that rely on prespecified models. A case study demonstrates the effectiveness of this approach on an isolated traffic signal under varying conditions, with plans for further research to extend this methodology to more complex traffic systems. The goal is to optimize traffic management in congested urban environments through real-time learning and adaptation.

Uploaded by

Ankur De
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Reinforcement Learning for True Adaptive Traffic Signal

Control
Baher Abdulhai1; Rob Pringle2; and Grigoris J. Karakoulas3

Abstract: The ability to exert real-time, adaptive control of transportation processes is the core of many intelligent transportation
systems decision support tools. Reinforcement learning, an artificial intelligence approach undergoing development in the machine-
Downloaded from ascelibrary.org by University of Saskatchewan on 11/22/12. Copyright ASCE. For personal use only; all rights reserved.

learning community, offers key advantages in this regard. The ability of a control agent to learn relationships between control actions and
their effect on the environment while pursuing a goal is a distinct improvement over prespecified models of the environment. Prespecified
models are a prerequisite of conventional control methods and their accuracy limits the performance of control agents. This paper contains
an introduction to Q-learning, a simple yet powerful reinforcement learning algorithm, and presents a case study involving application to
traffic signal control. Encouraging results of the application to an isolated traffic signal, particularly under variable traffic conditions, are
presented. A broader research effort is outlined, including extension to linear and networked signal systems and integration with dynamic
route guidance. The research objective involves optimal control of heavily congested traffic across a two-dimensional road network—a
challenging task for conventional traffic signal control methodologies.
DOI: 10.1061/共ASCE兲0733-947X共2003兲129:3共278兲
CE Database subject headings: Traffic signal controllers; Intelligent transportation systems; Traffic control; Traffic management;
Adaptive systems.

Introduction A key limitation of conventional control systems is a require-


ment for one or more prespecified models of the environment.
The ability to exert real-time, adaptive control over a transporta- The purpose of these might be to convert sensory inputs into a
tion process is potentially useful for a variety of intelligent trans- useful picture of current or impending conditions or provide an
portation systems services, including control of a system of traffic assessment of the probable impacts of alternative control actions
signals, control of the dispatching of paratransit vehicles, and in a given situation. Such models require domain expertise to
control of the changeable message displays or other cues in a construct. Furthermore, they must often be sufficiently general to
dynamic route guidance system, to name a few. In each case, the cover a variety of conditions, as it is usually impractical to pro-
controlling actions should respond to actual environmental vide separate models to address each potential situation. For ex-
conditions—vehicular demand in the case of a signal system, the ample, some state-of-the-art traffic signal control systems rely on
demand for multiple paratransit trip origins and destinations, or a platoon-dispersion model to predict the arrival pattern of ve-
the road network topology and traffic conditions in the case of hicles at a downstream signal based on departures from an up-
dynamic route guidance. Even more valuable is the ability to stream signal. A generalized model designed to represent all road
control in accordance with an optimal strategy defined in terms of links cannot possibly reflect the impacts of the different combi-
one or more performance objectives. For example, one might nations of side streets and driveways generating and absorbing
wish to have a signal control strategy that minimizes delay, a traffic between the upstream and downstream signals.
paratransit dispatching system that minimizes wait time and ve- What if a controlling agent could directly learn the various
hicle kilometers traveled, or a dynamic route guidance system relationships inherent in its world from its experience with differ-
that minimizes travel time. ent situations in that world? Not only would the need for model
prespecification be obviated or at least minimized, but such an
1
Assistant Professor and Director, Intelligent Transportation Systems agent could effectively tailor its control actions to specific situa-
Centre, Dept. of Civil Engineering, Univ. of Toronto, Toronto, ON, tions based on its past experience with the same or similar situa-
Canada M5S 1A4. E-mail: [email protected] tions. The machine-learning research community, related to the
2 artificial intelligence community, provides us with a variety of
PhD Candidate, Intelligent Transportation Systems Centre, Dept. of
Civil Engineering, Univ. of Toronto, Toronto, ON, Canada M5S 1A4. methods that might be adapted to transportation control problems.
E-mail: [email protected] One of these, particularly useful due to its conceptual simplicity,
3
Dept. of Computer Science, Univ. of Toronto, Pratt Building yet impressive in its potential, is reinforcement learning 关see Sut-
LP283E, 6 King’s College, Toronto, ON, Canada M5S 1A4. E-mail: ton and Barto 共1998兲 or Kaelbling et al. 共1996兲 for comprehensive
[email protected] overviews, or Bertsekas and Tsitsiklis 共1996兲 for a more rigorous
Note. Discussion open until October 1, 2003. Separate discussions
treatment兴.
must be submitted for individual papers. To extend the closing date by
one month, a written request must be filed with the ASCE Managing This paper provides a brief introduction to the concept of re-
Editor. The manuscript for this paper was submitted for review and pos- inforcement learning. As a case study, reinforcement learning is
sible publication on October 30, 2001; approved on May 21, 2002. This applied to the case of an isolated traffic signal with encouraging
paper is part of the Journal of Transportation Engineering, Vol. 129, results. This is the first stage in a research program to develop a
No. 3, May 1, 2003. ©ASCE, ISSN 0733-947X/2003/3-278 –285/$18.00. signal system control methodology, based on reinforcement learn-

278 / JOURNAL OF TRANSPORTATION ENGINEERING © ASCE / MAY/JUNE 2003

J. Transp. Eng. 2003.129:278-285.


Downloaded from ascelibrary.org by University of Saskatchewan on 11/22/12. Copyright ASCE. For personal use only; all rights reserved.

Fig. 1. Illustration of reinforcement learning: 共a兲 Gridworld; 共b兲 first episode; 共c兲 second episode; 共d兲 selected final Q-estimates; 共e兲 one possible
optimal policy

ing, which could be integrated with dynamic route guidance to couraging the robot to learn the shortest possible path to Door B.
provide effective traffic control in highly congested conditions. For this example, assume that the discount rate is 0.9.
Effective traffic control in the face of severe congestion on a Initially, all potential moves from any given grid square, ex-
two-dimensional road network is a challenging task for existing cept that involving passing through Door B from the square in
signal control methodologies. front of it, have a value of zero, in the sense that no reward
appears to be gained by implementing them. On its first journey,
therefore, the robot explores in a random fashion, possibly fol-
Reinforcement Learning: Brief Primer lowing the path shown in Fig. 1共b兲, until it eventually passes
through Door B. In doing so, it gains a reward of 100 units and
remains there, ending the current episode. The value assigned to
Illustrative Example
the move preceding the move through Door B is updated using
In its simplest terms, reinforcement learning involves an agent the reward of 100 units, factored by the discount rate of 0.9, since
that wishes to learn how to achieve a goal. It does so by interact- the reward was gained one time step into the future, to give a net
ing dynamically with its environment, trying different actions in value of 90 units. On its second journey from Door A, the robot
different situations in order to determine the best action or se- explores until it reaches a square adjacent to that in front of Door
quence of actions to achieve its goal from any possible given B. As before, the preceding move is assigned a value of 90 units,
situation. Feedback signals provided by the environment allow factored by the discount rate of 0.9, to give a net value of 81
the agent to determine to what extent an action actually contrib- units, as shown in Fig. 1共c兲. Each journey or episode thereafter
uted to the achievement of the desired goal. may result in another move being assigned a value. At some
To illustrate the concept of reinforcement learning, consider point, the robot might find itself confronted with a choice between
the following simplified example of a mobile robot navigating making a move with zero value and making one with some pre-
within the gridworld shown in Fig. 1共a兲. This is actually an illus- viously assigned positive value. In this situation, the robot must
tration of Q-learning, developed by Watkins 共1989; Watkins and choose whether or not to explore the move with a current value of
Dayan 1992兲, and is one of a number of possible reinforcement zero, on the chance that it might be better than exploiting its
learning algorithms and the one that is used in the case study current knowledge by making the move that it knows has a posi-
presented later in this paper. Imagine that the robot starts behind tive value.
Door A and that its goal is to pass through Door B, for which it After a sufficient number of episodes, each move from any
gains a reward of 100 units. No other actions are rewarded. Once given square will have been assigned a value. In most practical
it passes through Door B, it remains there 共perhaps waiting for a problems, particularly in stochastic domains, many episodes are
further task兲 and gains no further rewards. At each time step, the required before these values achieve useful convergence. Fig. 1共d兲
robot can move to an adjacent grid square but cannot move di- shows a selection of these values, each of which represents the
agonally. Let us also define a discount rate that has the effect of sum of discounted future rewards if one follows an optimal path
reducing the value of future rewards relative to more immediate from that particular grid square to Door B. The robot has there-
rewards. In this case, the discount rate also has the effect of en- fore learned an estimate of the value function Q. At this point, the

JOURNAL OF TRANSPORTATION ENGINEERING © ASCE / MAY/JUNE 2003 / 279

J. Transp. Eng. 2003.129:278-285.


4. The combination of state s, action a, and reward rs,a is then
used to update the previous estimate of the Q-value Qt⫺1(s,a)
recursively according to the following training rule:
␦⫽␣s,a兵rs,a ⫹␥ t •MAX关 Qt⫺1 共 s ⬘ ,a ⬘ 兲 兴 ⫺Qt⫺1 共 s,a 兲 其 (1)
where ␦⫽increment to be added to the previously estimated
Q-value, Qt⫺1(s,a) , to get Qt(s,a) ; ␣ s,a ⫽training rate in the
interval 关0,1兴; rs,a ⫽reward received for taking action a,
while in state s; ␥ t ⫽discount rate in the interval 关0,1兴, ap-
plied to future rewards; MAX关 Qt⫺1(s ⬘ ,a ⬘ ) 兴 ⫽previously esti-
mated Q-value following the optimum policy starting in state
s⬘ ; and Qt⫺1(s,a) ⫽previous estimate of the Q-value of taking
action a while in state s.
Downloaded from ascelibrary.org by University of Saskatchewan on 11/22/12. Copyright ASCE. For personal use only; all rights reserved.

Fig. 2. Key elements of Q-learning This particular training rule is relevant to stochastic environ-
ments such as the traffic environment in the case study out-
lined in the next section. Decreasing the training rate over
time is one of the conditions necessary for convergence of
robot can implement an optimal sequence of actions, or policy, by
the Q-function in a stochastic environment. The other condi-
greedily taking the action with the highest value, regardless of tion requires that each state-action combination be visited
where it starts from or finds itself, until it reaches Door B. Fig. infinitely often, although in most practical problems, the por-
1共e兲 shows one of several possible optimal policies. tion of the state-space that is of primary interest will be
visited often but not infinitely often. If penalties are received
More Precise Definition of Q-learning rather than rewards, the MIN function is used in place of
MAX.
Building on the example described in the preceding section, let us 5. The updated estimate of the Q-value is then stored for later
now formulate a more precise, although still basic, definition of reuse. The Q-values may be stored in an unaltered form in a
Q-learning. Consider the system shown in Fig. 2, which shows look-up table, although this requires a significant amount of
the key elements of Q-learning. memory. They may also be used as inputs to a function ap-
1. The agent is the entity responsible for interpreting sensory proximation process designed to generalize the Q-function
inputs from the environment, choosing actions on the basis so that Q-value estimates may be obtained for state/action
of the fused inputs, and learning on the basis of the effects of combinations not yet visited but similar to combinations that
its actions on the environment. At time t, the Q-learning have been visited.
agent receives from the environment a signal describing its
current state s. The state is a group of key variables that
together describe those current characteristics of the environ- Adaptive Traffic Signal Control—Case Study Using
ment that are relevant to the problem. Theoretically, the state Q-learning
information must exhibit the Markov property, in that this
information, together with a description of the action being Background
taken, is all that is needed to predict the effect on the envi-
ronment. The agent does not need to know the history of its Until relatively recently, capital improvements, such as building
previous states or actions. In practice, it is assumed that the new roads or adding traffic lanes, and a variety of operational
process is Markovian, although this may not be strictly true. improvements have been the primary tools used to address in-
2. Based on its perception of the state s, the agent selects an creasing congestion due to growth in road traffic volumes. How-
action a, from the set of possible actions. This decision de- ever, increasingly tight constraints on financial resources and
pends on the relative value of the various possible actions, or physical space, as well as environmental considerations, have re-
more precisely on the estimated Q-values Qs,a , which reflect quired consideration of a wider range of options. Enhancing the
the value to the agent of undertaking action a while in state intelligence of traffic signal control systems is an approach that
s, resulting in a transition to state s⬘ , and following a cur- has shown potential to improve the efficiency of traffic flow. Off-
rently optimal policy 共sequence of actions兲 thereafter. At the line signal coordination methods, such as the maximization of
outset, the agent does not have any values for the through-bandwidths using time-space diagrams and optimization
Q-estimates and must learn these by randomly exploring al- with the TRANSYT family of programs, are gradually giving way
ternative actions from each state. A gradual shift is effected in larger cities to real-time methods such as the split, cycle, offset
from exploration to exploitation of those state/action combi- optimization technique 共SCOOT兲 共Hunt et al. 1981; Bretherton
nations found to perform well. 1996; Bretherton et al. 1998兲. Research is continuing into traffic
3. As a result of taking action a in state s, the agent receives a signal control systems that adapt to changing traffic conditions
reinforcement or reward rs,a , which depends upon the effect 共Gartner and Al-Malik 1996; Yagar and Dion 1996; Spall and
of this action on the agent’s environment. There may be a Chin 1997; Sadek et al. 1998兲.
delay between the time of the action and the receipt of the Severe traffic congestion, both recurring and nonrecurring,
reward. The objective of the agent in seeking the optimum presents a difficult challenge to existing control methodologies,
policy is to maximize the accumulated reward 共or minimize particularly in the case of two-dimensional road networks. Such
the accumulated penalty兲 over time. A discount rate may be congestion is often experienced in conjunction with busy urban
used to bound the reward, particularly in the case of continu- cores and, on a more localized basis, in association with major
ous episodes. The discount rate reflects the higher value of sports and entertainment events, major accidents or other inci-
short-term future rewards relative to those in the longer term. dents, and road construction and maintenance. There is an appar-

280 / JOURNAL OF TRANSPORTATION ENGINEERING © ASCE / MAY/JUNE 2003

J. Transp. Eng. 2003.129:278-285.


ent need to continue development of traffic signal control tech- eters of the controller remain unchanged. On the other hand, an
niques to more effectively address these situations and it is to this essential feature of adaptive algorithms is their ability to adjust
niche that the research described in the following case study is their internal logic and parameters in response to major changes
directed. in the environment—changes that may make the knowledge base
Recent research literature includes three related efforts where in a nonadaptive controller obsolete. One of the advantages of
reinforcement learning or related dynamic programming algo- reinforcement learning is that such algorithms are truly adaptive,
rithms were applied to the problem of traffic signal control. Sen in the sense that they are capable of responding to not only dy-
and Head 共1997兲 utilized dynamic programming to develop a namic sensory inputs from the environment, but also a dynami-
phasing plan for each cycle based on short-term traffic predic- cally changing environment, through ongoing learning and adap-
tions. However, their approach lacks the ability to learn from tation. Since the one-step Q-learning algorithm updates the
experience and requires a traffic prediction model. Thorpe 共1997兲 Q-estimates at short intervals in conjunction with each action, it is
used reinforcement learning to minimize the time required to dis- also readily adaptable to on-line, real-time learning. Furthermore,
charge a fixed volume of traffic through a road network, but his Q-learning is an off-policy algorithm, in the sense that it is gain-
Downloaded from ascelibrary.org by University of Saskatchewan on 11/22/12. Copyright ASCE. For personal use only; all rights reserved.

approach does not appear to be directly applicable to real-time ing useful experience even while exploring actions that may later
traffic signal control. Bingham 共1998兲 applied reinforcement turn out to be nonoptimal.
learning in the context of a neuro-fuzzy approach to traffic signal
control, but met with limited success due to the insensitivity of
the approach, limited exploration in what is a stochastic environ- Key Elements of Case-study Implementation
ment, and an off-line approach to value updating. The initial test application of Q-learning to the problem of traffic
signal control involved a single, isolated intersection. This simple
example was used to gain experience with this method in the
Advantages of Q-learning for Traffic Signal Control
stochastic traffic environment and to establish useful ranges for
In comparison to other state-of-the-art techniques used for traffic the various parameters involved. Application of Q-learning in a
signal control, and many other dynamic programming and ma- multiagent context to a linear system of traffic signals is now
chine learning approaches, Q-learning offers some potentially sig- under way, and this will be followed by extension to a two-
nificant advantages, as discussed next. dimensional road network and signal system. The isolated signal
Q-learning does not require a prespecified model of the envi- and linear system implementations involve two-phase operation
ronment on which to base action selection. Instead, relationships without turning flows. The network implementation will consider
between states, actions, and rewards are learned through dynamic turning movements and more flexible phasing arrangements. The
interaction with the environment. By way of contrast, existing following discussion outlines the essential elements of the iso-
traffic signal control methods usually require prespecified models lated signal case study and identifies modifications being tested in
of traffic flow to generate short-term predictions of traffic condi- the linear, multiagent application as a result of insights gained
tions or to assess the impacts of possible control decisions. If a through the initial application.
single, general model is used, it is possible and even likely that
conditions around individual intersections will vary from the con- Description of Test-beds
ditions upon which the model was based. The isolated traffic signal test-bed consisted of a simulated two-
Another benefit of Q-learning, and reinforcement learning in phase signal controlling the intersection of two two-lane roads.
general, is that supervision of the learning process is not required. Vehicle arrivals were generated using individual Poisson pro-
Supervised machine-learning algorithms require, for training pur- cesses with predefined average arrival rates on each of the four
poses, a large number of examples, consisting of sets of inputs approaches. The average rates could be varied over time to rep-
and associated outcomes, which adequately cover the range of resent different peak-period traffic profiles over the 2 h simulated
environmental conditions expected on deployment. They involve episodes. In practice, the agent would operate continuously.
supervision in the sense that the appropriate outcome is provided In the case of the linear signal system, autonomous Q-learning
for each combination of inputs so that any inherent relationships agents, each controlling a single intersection with two-phase con-
can be learned. The machine learning methods, such as artificial trol similar to that used in the isolated signal case, comprise the
neural networks, that have been the most widely studied and ap- test-bed. Individual Poisson processes are used to generate ve-
plied to transportation systems to date, typically involve super- hicle arrivals on each approach to the system. Traffic movement
vised learning. An example is the work on incident detection by within the system is simulated at a microscopic level. Each road
Abdulhai and Ritchie 共1999a, b兲. In the case of Q-learning, which link is divided into blocks; vehicles advance one block per time
is unsupervised, the outcome associated with taking a particular step, provided the downstream block is not occupied. In cases
action in any state encountered is learned through dynamic trial- where there is insufficient space on the downstream link to exit an
and-error exploration of alternative actions and observation of the intersection, vehicles may enter the intersection probabilistically
relative outcomes. Rather than being presented with a large set of and be trapped there until there is an opportunity to move ahead.
training examples, the generation of which is a challenging task in This may block following and crossing flows, and allows heavily
many cases, even for a domain expert, a Q-learning agent essen- congested traffic conditions to be simulated more realistically.
tially generates its own training experiences from its environment.
The learning process can be initiated on a simulator, with refine- State, Action, and Reward Definitions
ment and optimization for the intended environment occurring In the case of the isolated intersection, the state information avail-
after deployment. able to the agent included the queue lengths on the four ap-
It is important to note that not all real-time algorithms are truly proaches and the elapsed phase time. The multiagent situation
adaptive. The two terms are often used interchangeably and pos- permits additional state information, since communication be-
sibly confused. Real-time algorithms are those able to respond to tween agents extends the effective field of view of individual
sensory inputs in real time, although the internal logic and param- agents. In addition to local queue lengths, various combinations

JOURNAL OF TRANSPORTATION ENGINEERING © ASCE / MAY/JUNE 2003 / 281

J. Transp. Eng. 2003.129:278-285.


of upstream and downstream queue lengths and the offset of sig- section could be used in place of, or in combination with, delay.
nal changes controlling upstream and downstream movements are Delay 共or throughput兲 on main roads could be weighted more
being evaluated as state elements. Since the addition of elements heavily than that on lesser streets. Vehicle emissions or fuel con-
to the state definition dramatically increases the size of the state- sumption could also be incorporated, given suitable methods for
space, a balance has to be sought between the benefit of this their estimation.
information and its impact on problem tractability. Sensing of
queue lengths would be most effectively achieved using video Exploration Policies
imaging technology in combination with artificial neural network Convergence of a Q-learning agent on a suitable Q-function, par-
or other pattern recognition techniques. ticularly where the process being controlled is stochastic, requires
The isolated signal agent was operated with a fixed cycle adequate exploration to ensure that all areas of interest across the
length as context. Each second, between a point 10 s into the state space are visited sufficiently often. Limiting attention too
cycle and a point 10 s from the end of the cycle, arbitrary limits soon to promising early results may mean that the optimum
fixed to ensure minimum practical phase lengths, the agent se- policy is not discovered. The single agent test-bed was used to
Downloaded from ascelibrary.org by University of Saskatchewan on 11/22/12. Copyright ASCE. For personal use only; all rights reserved.

lected an action—either remain with the current signal indication test several exploration policies. An ␧-greedy policy was tested,
or change it. Considering the potential need to transmit, receive, where the best action is exploited with probability ␧ and an ex-
and process communicated information, 1-s intervals between ploratory action is chosen randomly with probability 1⫺␧. A
action-selection decisions were not considered to be sufficiently range of values for ␧ was evaluated, and a value of 0.9 was found
flexible in the case of the multiagent system. In this case, action to yield good results. A softmax exploration policy was also
selection consists of a decision, made at the time of the previous tested, where the probability of choosing an action was propor-
phase change, as to when to make the next phase change. To tional to the Q-estimate or value for that action given the current
provide additional flexibility, cycle lengths are not fixed in this state. Good results were achieved where the probability of choos-
case, but minimum and maximum limits are placed on phase ing the best action was annealed, starting with random explora-
lengths, as before, to ensure practicality. In the case where pro- tion and increasing to the point where the best action was chosen
jected phase-change times at adjacent signals are included as state with a probability of 0.9, provided that state had been visited at
elements, the agents are provided with an opportunity to respond least 35–50 times. Both techniques required that the definition of
to this information. At the time of a phase change, and decision random exploration be modified to avoid exploratory change ac-
on the time of the subsequent change, by any individual agent, the tions being implemented consistently within the first few seconds
other agents, in order of adjacency, are provided with an oppor- of the phase. With the change in the action-space for the multi-
tunity to review and adjust their currently projected change times agent test-bed, the standard softmax procedure is being used, al-
if the increase in benefit would exceed a minimum threshold. though further testing is necessary to ensure that the shift from
Where these review points are insufficiently close in time, inter- exploration to exploitation is sufficiently gradual so as not to
mediate reviews can be scheduled to allow any significant inhibit convergence.
changes in state to be considered as they occur. This review pro-
cess is repeated, as required and as time permits, in an attempt to Function Approximation and Generalization
reach equilibrium. In both the single and multiagent settings, it is In both the single-agent and multiagent test-beds, the Cerebellar
possible to constrain phase lengths so that they do not vary by Model Articulation Controller 共CMAC兲, as pioneered by Albus
more than a prespecified time from the previous phase length. 共1975a, b兲, is used for storage and generalization of the
This may be seen as desirable to limit variability in successive Q-estimates. The CMAC is conceptually similar to an artificial
cycles, although some degradation of performance is likely. neural network, although the implementation used in this case, as
The definition of reward 共actually a penalty in this case兲 is described by Smith 共1998兲, operates more like a sophisticated
relatively straightforward in the single-agent case, being the total look-up table. The CMAC fulfills a function approximation and
delay incurred between successive decision points by vehicles in generalization role by allowing Q-estimates for any given state-
the queues of the four approaches. The delay in each 1-s step, action pair to influence those of nearby state-action pairs. This
being directly proportional to the queue length, was modified effectively smooths the decision hypersurface and enables
using a power function to encourage approximate balancing of Q-estimates to be derived for state-action pairs not yet visited, but
queue lengths. Otherwise, the agent was found to be indifferent similar to pairs that have been visited. The actual storage of
between situations involving very long and very short queues and Q-estimates was accomplished using hash tables. This minimizes
situations involving equal-length queues, both with the same av- memory requirements, since the high-dimensional arrays re-
erage queue length and therefore delay. In the multiagent case, a quired, one dimension for each element of the state-space, are
key issue is the extent to which global rewards 共or penalties兲 are typically sparsely populated.
necessary to promote cooperation among the agents. It is hypoth- In the case of the isolated intersection, two CMACs were
esized that interacting agents must respond to a reward structure used—one for the change action and one for the don’t-change
that incorporates not only local rewards, as in the single-agent action. Testing showed that 11 association units or layers in the
case, but also global rewards to avoid agents acting solely on the CMAC, in combination with a resolution of 50% 共mapping two
basis of self-interest and compromising overall effectiveness and adjacent queue lengths—for example, 23 vehicles and 24
efficiency. In addition to the local reward used by the isolated vehicles—into the same Q-estimate兲, yielded good results without
agent, alternative global reward formulations are being investi- requiring excessive memory for the storage of the Q-estimates.
gated for the multiagent case, including delay and the incidence Despite the fact that the CMAC implies nonlinear function ap-
of intersection blockage along the main streets and across the proximation, possibly problematic in the case of Q-learning in a
network. Weighting of the global rewards relative to local rewards stochastic environment, lack of convergence did not appear to be
is also being evaluated. an issue. Various values for the training rate ␣ s,a were tested, and
It is possible to define rewards or penalties related to other the best results were achieved when ␣ s,a was gradually decreased
objectives or priorities. For example, the throughput of the inter- in inverse proportion to the number of visits to that particular

282 / JOURNAL OF TRANSPORTATION ENGINEERING © ASCE / MAY/JUNE 2003

J. Transp. Eng. 2003.129:278-285.


state s and action a. The multiagent structure is designed so that
each agent, and therefore each intersection, employs two
CMACs—one for changes from a green to a red indication, and
one for changes from red to green. Theoretically, all agents could
share a single pair of CMACs, implying faster training. However,
this limits effective learning of optimal policies in cases where
local environments 共road section lengths, road configuration, in-
tervening side streets or major generators, etc.兲 are dissimilar and
may result in oscillation of the Q-estimates and hindering of con-
vergence.

Multiagent Architecture and Communication Strategy


One of the key issues being explored with the multiagent test-bed
Downloaded from ascelibrary.org by University of Saskatchewan on 11/22/12. Copyright ASCE. For personal use only; all rights reserved.

is the role of communication between agents. While it is hypoth-


esized that communication should expand the agents’ perceptual
Fig. 3. Isolated traffic signal: Average delay per vehicle ratio
horizons and their ability to cooperate toward a globally optimal
共Q-learning/pretimed兲
policy, it is also recognized that excessive information can in-
crease the dimensionality of the problem, increase the computa-
tional burden, and reduce robustness should the communication
system malfunction. In this research, the benefits of various levels the pretimed signal-phasing plan used as a baseline for compari-
of communication are being compared to each other and to the son utilized constant phase times based on the critical peak-hour
baseline case without communication, where each agent has only flow rates by direction.
local sensory inputs and acts independently. Where traffic flow rates were uniform across the approaches at
There are several key opportunities to incorporate information any given point in time, or where there was a constant ratio be-
communicated between agents. The first involves the state of the tween the main and side-street flow rates, the Q-learning agent
environment where, as noted previously, tests are being con- performed generally on a par with, or slightly better than, the
ducted on alternative state definitions that include different forms pretimed signal controller. With the uniform flow rates, the differ-
of communicated information. The communication of intended ences in mean delay per vehicle between pretimed operation and
actions in the form of projected phase-change times provides an the results obtained by the Q-learning agent were not statistically
opportunity for cooperative, real-time review of proposed actions significant beyond 50 training episodes. In the case of constant-
as the environment changes. Another application of communi- ratio flow rates, the differences in mean delay per vehicle were
cated information is in the reward structure where, as discussed statistically significant in favor of the Q-learning agent between
earlier, the inclusion of weighted global rewards may assist in 300 and 400 training episodes. The similarity in performance of
convergence toward a policy that is optimal, considering the en- the pretimed and Q-learning approaches does not argue against
tire network. In the multiagent test-bed, it is likely that the actions the effectiveness of the latter, since these conditions are amenable
of one agent in the system have an impact on not only adjacent to pretimed signal control. That the Q-learning agent was able to
agents, but others as well. If an agent allows a local queue to outperform the pretimed controller at all under these conditions
build up so that it extends through upstream intersections, both was due to its ability to adapt to minor random fluctuations in
following and cross-street traffic may be blocked. flow. The initially higher delays for the Q-learning agent reflect
the early stages of training, before the Q-function has stabilized.
Preliminary Test Results Starting with zeroed initial Q-estimates, the Q-learning agent was
able to achieve effective and reasonably stable performance
The following discussion presents selected results obtained from within 200– 400 training episodes. Where the traffic flows were
the isolated signal test-bed. In this case, performance was com- more variable, the Q-learning agent produced delays that were
pared with that of a commonly used pretimed signal controller. only 38 – 44% of those obtained using pretimed signal control and
Comparison with semi- or fully actuated controllers might be con- outperformed the pretimed controller over virtually 100% of the
sidered a more appropriate test of performance, but, in the heavily test episodes.
congested conditions that are the subject of this research, these Tests were also conducted with smoothed signal changes,
typically default to what is essentially pretimed control. At the where the difference between subsequent phase lengths was lim-
time of writing, testing with the multiagent test-bed had not pro- ited. When the maximum difference was set to 6 s, the delay
gressed to the point where useful conclusions could be drawn. understandably increased, typically by 5–10%. The ability of the
These will be reported on at a later time. In the multiagent case, agent to generalize to different cycle lengths was also evaluated.
comparisons will be drawn with other commonly used signal sys- The average delays associated with doubling the cycle length
tem control methodologies such as through-bandwidth maximiza- were significantly higher, although this is an extreme case not
tion and TRANSYT 共off-line兲 and SCOOT 共on-line兲. usually contemplated in practice. This result, in part, motivated
Tests were conducted using three different traffic profiles to the use of a flexible cycle length for the multiagent test-bed.
evaluate the performance of the Q-learning agent under varying
conditions. Fig. 3 summarizes the results of these tests. The
graphs in Fig. 3 reflect the average vehicular delay across indi- Implementation Issues
vidual sets of 50 test episodes, typically conducted after each of
10, 25, 50, 100, 150, 200, 250, and 500 training episodes. Each Implementation of reinforcement-learning-based signal control
training and testing episode was equivalent to a 2-h peak period systems is contemplated primarily for the multiagent, signal net-
involving 144 signal cycles. In accordance with typical practice, work situation, as it is in this case that the benefits of learning

JOURNAL OF TRANSPORTATION ENGINEERING © ASCE / MAY/JUNE 2003 / 283

J. Transp. Eng. 2003.129:278-285.


should be most apparent and useful. Several key deployment is- reinforcement learning to the problem of traffic signal control,
sues need to be addressed, as discussed next. with particular emphasis on heavily congested conditions in a
It is envisaged that the agents would be pretrained on a simu- two-dimensional road network. Preliminary results from the ap-
lator prior to actual deployment. Tests to date with the isolated plication of Q-learning to an isolated, two-phase traffic signal are
signal test-bed have shown that pretraining requirements are not encouraging. The Q-learning agent performed on a par with pre-
onerous. Analogous testing with the multiagent test-bed is re- timed signals under traffic conditions amenable to pretimed con-
quired to determine pretraining requirements and to ascertain how trol, involving constant or constant-ratio flow rates. Under more
close to expected operational conditions the simulated pretraining variable traffic conditions, the Q-learning agent demonstrated
scenarios would have to be to ensure reasonable generalization to marked superiority due to its ability to adapt to changing circum-
the expected range of operating conditions upon deployment. stances.
Once deployed, the agents would be programmed to continue Research is currently under way to extend the reinforcement-
their training to refine the Q-estimates based on actual environ- learning approach to a linear signal system and will be reported
mental and operating conditions. Exploration would be necessary on in the near future. Subsequent phases of this research effort
Downloaded from ascelibrary.org by University of Saskatchewan on 11/22/12. Copyright ASCE. For personal use only; all rights reserved.

in the initial stages of deployment, but this exploration should be will involve extension to control of a two-dimensional system of
incremental and should not produce control decisions that appear traffic signals and the integration of traffic signal control based on
to drivers to be obviously inappropriate to the situation. A ‘‘con- Q-learning with dynamic route guidance. Comparison of the
tinuing education’’ strategy should also be developed that will Q-learning approach to traffic signal control with existing state-
enable the agent to assess its performance on an ongoing basis in of-the-art methods such as SCOOT will also be pursued.
light of possibly changing conditions and determine when and
how much additional on-line training may be required.
A sensory subsystem is needed to provide the required inputs Acknowledgments
to the agents. This may involve adaptation of existing induction
loop technology, although a video imaging system would likely The second writer wishes to acknowledge the financial assistance
be more effective. A contingency plan would also be required to provided by the Natural Science and Engineering Research Coun-
deal with a potential loss of communications. Evaluation of simu- cil of Canada and the University of Toronto.
lated agent performance using only local state inputs, but in a
multiagent context, is planned to provide insights into a possible
strategy for this scenario. References

Abdulhai, B., and Ritchie, S. G. 共1999a兲. ‘‘Enhancing the universality and


Future Research transferability of freeway incident detection using a Bayesian-based
neural network.’’ Transportation Research—Part C, 7, 261–280.
Following the completion of the current evaluation of Abdulhai, B., and Ritchie, S. G. 共1999b兲. ‘‘Towards adaptive incident
detection algorithms.’’ Proc., 6th World Congress on Intelligent
reinforcement-learning-based signal control for a linear signal
Transport Systems.
system, extension to a two-dimensional network will be pursued. Albus, J. S. 共1975a兲. ‘‘Data storage in the Cerebellar Model Articulation
To fully evaluate the reinforcement-learning approach, it will be Controller 共CMAC兲.’’ J. Dyn. Syst., Meas., Control, 97, 228 –233.
necessary to compare its performance with that of state-of-the-art Albus, J. S. 共1975b兲. ‘‘A new approach to manipulator control: The cer-
control methodologies, such as SCOOT. ebellar model articulation controller 共CMAC兲.’’ J. Dyn. Syst., Meas.,
The final stage of this ongoing research effort involves inte- Control, 97, 220–227.
grating the multiagent traffic control system with dynamic route Bertsekas, D. P., and Tsitsiklis, J. N. 共1996兲. Neuro-dynamic program-
guidance, also based on reinforcement learning. This is seen as a ming, Athena Scientific, Belmont, Mass.
two-way interaction. Collective perceptions of the Q-learning Bingham, E. 共1998兲. ‘‘Neurofuzzy traffic signal control.’’ Master’s thesis,
agents concerning the distribution of congestion across the net- Dept. of Engineering Physics and Mathematics, Helsinki Univ. of
Technology, Helsinki, Finland.
work could be used as a basis for advising drivers of less-
Bretherton, D. 共1996兲. ‘‘Current developments in SCOOT: Version 3.’’
congested routes using variable-message signs, local-area radio Transportation Research Record 1554, Transportation Research
broadcasts, or other means. The other side of this interaction Board, Washington, D.C., 48 –52.
would involve the real-time adaptation of the agents across the Bretherton, D., Wood, K., and Raha, N. 共1998兲. ‘‘Traffic monitoring and
network to the changes in traffic flows resulting from the reaction congestion management in the SCOOT urban traffic control system.’’
of drivers to the guidance information, thus completing the feed- Transportation Research Record 1634, Transportation Research
back loop. Again, the ability of Q-learning to provide adaptive, Board, Washington, D.C., 118 –122.
real-time control is seen as the key to the effective integration of Gartner, N. H., and Al-Malik, M. 共1996兲. ‘‘Combined model for signal
dynamic route guidance with traffic signal system control. control and route choice in urban traffic networks.’’ Transportation
Research Record 1554, Transportation Research Board, Washington,
D.C., 27–35.
Conclusions Hunt, P. B., Robertson, D. I., Bretherton, D., and Winton, R. I. 共1981兲.
‘‘SCOOT—A traffic responsive method of coordinating signals.’’
Reinforcement learning appears to offer significant advantages in Laboratory Rep. 1014, Transport and Road Research Laboratory.
the application to transportation processes where real-time, adap- Kaelbling, L. P., Littman, M. L., and Moore, A. W. 共1996兲. ‘‘Reinforce-
ment learning: A survey.’’ J. Artif. Intell. Res., 4, 237–285.
tive control is the key to improving effectiveness and efficiency.
Sadek, A. W., Smith, B. L., and Demetsky, M. J. 共1998兲. ‘‘Artificial
The ability to learn through dynamic interaction with the environ- intelligence-based architecture for real-time traffic flow manage-
ment is seen as a significant benefit relative to control method- ment.’’ Transportation Research Record 1651, Transportation Re-
ologies that rely on prespecified models of these processes. search Board, Washington, D.C., 53–58.
The current research effort outlined in this paper, one phase of Sen, S., and Head, K. L. 共1997兲. ‘‘Controlled optimization of phases at an
which was presented as a case study, involves the application of intersection.’’ Transp. Sci., 31共1兲, 5–17.

284 / JOURNAL OF TRANSPORTATION ENGINEERING © ASCE / MAY/JUNE 2003

J. Transp. Eng. 2003.129:278-285.


Smith, R. 共1998兲. ‘‘Intelligent motion control with an artificial cerebel- ter’s Project Rep., Computer Science Dept., Colorado State Univ.,
lum.’’ PhD thesis, Dept. of Electrical and Electronic Engineering, Fort Collins, Colo.
Univ. of Auckland, Auckland, New Zealand. Watkins, C. J. C. H. 共1989兲. ‘‘Learning from delayed rewards.’’ PhD
Spall, J. C., and Chin, D. C. 共1997兲. ‘‘Traffic-responsive signal timing for thesis, King’s College, Univ. of Cambridge, Cambridge, U.K.
system-wide traffic control.’’ Transp. Res., Part C: Emerg. Technol., Watkins, C. J. C. H., and Dayan, P. 共1992兲. ‘‘Q-learning.’’ Mach. Learn.,
5共3/4兲, 153–163. 8, 279–292.
Sutton, R. S., and Barto, A. G. 共1998兲. Reinforcement learning—An in- Yagar, S., and Dion, F. 共1996兲. ‘‘Distributed approach to real-time control
troduction, MIT Press, Cambridge, Mass. of complex signalized networks.’’ Transportation Research Record
Thorpe, T. L. 共1997兲. ‘‘Vehicle traffic light control using SARSA.’’ Mas- 1554, Transportation Research Board, Washington, D.C., 1– 8.
Downloaded from ascelibrary.org by University of Saskatchewan on 11/22/12. Copyright ASCE. For personal use only; all rights reserved.

JOURNAL OF TRANSPORTATION ENGINEERING © ASCE / MAY/JUNE 2003 / 285

J. Transp. Eng. 2003.129:278-285.

You might also like