A Modified Reinforcement Learning Algorithm For Solving Coordinated Signalized Networks
A Modified Reinforcement Learning Algorithm For Solving Coordinated Signalized Networks
a r t i c l e i n f o a b s t r a c t
Article history: This study proposes Reinforcement Learning (RL) based algorithm for finding optimum sig-
Received 11 November 2013 nal timings in Coordinated Signalized Networks (CSN) for fixed set of link flows. For this
Received in revised form 8 March 2015 purpose, MOdified REinforcement Learning algorithm with TRANSYT-7F (MORELTRANS)
Accepted 8 March 2015
model is proposed by way of combining RL algorithm and TRANSYT-7F. The modified RL
Available online 22 March 2015
differs from other RL algorithms since it takes advantage of the best solution obtained from
the previous learning episode by generating a sub-environment at each learning episode as
Keywords:
the same size of original environment. On the other hand, TRANSYT-7F traffic model is used
Reinforcement learning
Coordinated signalized network
in order to determine network performance index, namely disutility index. Numerical
TRANSYT-7F application is conducted on medium sized coordinated signalized road network. Results
Signal timing optimization indicated that the MORELTRANS produced slightly better results than the GA in signal tim-
ing optimization in terms of objective function value while it outperformed than the HC. In
order to show the capability of the proposed model for heavy demand condition, two cases
in which link flows are increased by 20% and 50% with respect to the base case are consid-
ered. It is found that the MORELTRANS is able to reach good solutions for signal timing
optimization even if demand became increased.
Ó 2015 Elsevier Ltd. All rights reserved.
1. Introduction
The optimization of traffic signal timings has been at the heart of urban traffic control for many years. It is well known
that traffic signal control which encloses delay, queuing, pollution and fuel consumption is a multi-objective optimization.
For a signal controlled road network, using the optimization techniques in determining signal timings has been discussed
greatly for decades. Due to complexity of signal timing optimization problem, new methods and approaches are needed
to improve efficiency of traffic control in signalized road networks. Although the optimization of signal timings for an
isolated junction is relatively easy, it requires further research due to ‘‘offset’’ and ‘‘network cycle time’’ components being
in Coordinated Signalized Networks (CSN).
For the CSN, TRAffic Network StudY Tool (TRANSYT) is one of the most useful tools for optimizing signal timings and also
the most widely used program of its type. It is a stage-based optimization program and was developed by Transportation and
Road Research Laboratory (Robertson, 1969). TRANSYT consists of two main parts: A traffic model and a signal timing opti-
mizer. The traffic model utilizes a platoon dispersion model that simulates the normal dispersion of platoons as they travel
downstream. It simulates traffic in a network of signalized intersections to produce a cyclic flow profile of arrivals at each
intersection that is used to compute a Performance Index (PI) for a given signal timing and staging plan. The performance
⇑ Corresponding author. Tel.: +90 258 2963351; fax: +90 258 2963460.
E-mail address: halimc@pau.edu.tr (H. Ceylan).
http://dx.doi.org/10.1016/j.trc.2015.03.010
0968-090X/Ó 2015 Elsevier Ltd. All rights reserved.
C. Ozan et al. / Transportation Research Part C 54 (2015) 40–55 41
index is defined as the sum of a weighted linear combination of estimated delay and number of stops per unit time for all
signal-controlled traffic streams and is used to measure the overall cost of traffic congestion associated with the traffic con-
trol plan.
TRANSYT version 7 was originally modified for the Federal HighWay Administration (FHWA), thus it was called
‘‘TRANSYT-7F.’’ The PI in TRANSYT-7F may be defined in a number of ways. One of the TRANSYT-7F’s PI is the Disutility
Index (DI). The DI is a measure of disadvantageous operation; that is stops, delay, fuel consumption, etc. Optimization in
TRANSYT-7F consists of a series of trial simulation runs using the simulation engine. Each simulation run is assigned a
unique signal timing plan by the optimization processor. The optimizer applies the Hill-Climbing (HC) or Genetic
Algorithm (GA) searching strategies. The trial simulation run resulting in the best performance is reported as optimal.
Although the GA is mathematically better suited for determining global or near global optimal solution, relative to HC
optimization, it generally requires longer CPU times than the HC optimization (McTrans Center, 2008).
In modeling the signal timing optimization problem, different objectives and methods have been sought in the literature.
Wong (1995) provided approximate expressions for the derivatives of performance index with respect to the phase-based
control variables. The derivatives calculated from these expressions were compared with those obtained from numerical dif-
ferentiation. The results for both methods were found to be in good agreement but the approximate expressions required
much less computational effort. Wong (1996) also proposed an approach for area traffic control using group-based control
variables. In this case, the TRANSYT performance index is considered as a function of the group-based control variables, cycle
time, start and duration of green time. About 10% improvements in the optimal performance was gained over the stage-
based method in TRANSYT. Since this approach requires much longer computational time, parallel computing was also
investigated to reduce the computational time for optimization of signal settings by Wong (1997). Heydecker (1996) pro-
posed a decomposition approach to optimize signal timings based on group-based variables without taking the effect of
the coordination between adjacent intersections into account. Afterwards, Wong et al. (2002) developed a time-dependent
TRANSYT traffic model for the evaluation of performance index. Three scenarios have been considered and a microscopic
simulation model has been used to evaluate the performance indices for the signal plans derived from these three scenarios.
Using the proposed group-based methodology in Scenario 3, a remarkable improvement over Scenario 1 taking the average
flows for analysis has been obtained. Moreover, when compared with the signal plans from Scenario 2 based on independent
analyses, a good improvement has also been found. Girianna and Benekohal (2002) presented two different GA techniques
which are applied on signal coordination for oversaturated networks. Their paper reveals that micro GA implementation on
signal coordination problems reaches the near-optimal values of signal timing much earlier than simple GA implementation.
Similarly, Ceylan (2006) developed a GA with TRANSYT-HC optimization tool, and proposed a method for decreasing the
search space to solve the area traffic control problem. Proposed approach was found better than TRANSYT regarding optimal
values of signal timings and performance index. Chen and Xu (2006) investigated the application of Particle Swarm
Optimization (PSO) algorithm to solve signal timing optimization problem. Their results showed that PSO can be applied
to this problem under different traffic demands. Dan and Xiaohong (2008) developed an improved GA in order to find opti-
mal signal plans for signal optimization problem, which takes the coordination of signal timings for all signal-controlled
junctions into account. The results showed that the method based on GA could minimize delay and improve capacity of net-
work. Li (2011) presented an arterial signal optimization model that consider queue blockage among intersection lane
groups under oversaturated conditions. The proposed model captures traffic dynamics with the cell transmission concept,
which takes into account complex flow interactions among different lane groups. Liu and Chang (2011) further developed
an arterial signal optimization model which considers physical queue evolution on arterial links by lane-group and the
dynamic interactions of spillback queues among lane groups. The solution procedure developed with GA has been tested
with an example arterial under different demand scenarios. Results revealed that the proposed model may be considered
for use in design of arterial signals in comparison with TRANSYT-7F. He et al. (2012) presented a platoon-based mathemati-
cal formulation, which aims to provide multi-modal dynamical progression on the arterial based on the probe information,
to perform arterial traffic signal control. VISSIM software shows that the proposed model can easily handle two common
traffic modes, transit buses and automobiles, and significantly reduce delays for both modes under both non-saturated
and oversaturated traffic conditions as compared with timings optimized by SYNCHRO. Jones et al. (2013) addressed the
problem of determining robust signal controls in a road network considering interdependency of signal controls and traffic
flow patterns and uncertainty in the travel demands. According to the results of case studies performed, their approach
seems to provide a robust performance for solving signal control problem. On the other hand, Hu and Liu (2013) developed
a data-driven arterial offset optimization model taking some inherent problems with vehicle-actuated signal coordination
into consideration. The aim of this model is to minimize total delay for the main coordinated direction and to maximize
the performance of the opposite direction as well. Results obtained from the field experiments show that the proposed
model can reduce travel delay of coordinated direction significantly without compromising the performance of the opposite
approach. Hu et al. (2013) proposed a model which maximizes the discharging capacity along oversaturated routes by con-
sidering green time constraints. In order to obtain the maximum flow, a forward–backward procedure was used in the model
which tested using a microscopic traffic simulation model for an arterial network. Results indicated that the model can effec-
tively reduce oversaturation and thus improve system performance. Varaiya (2013) considered the control of a network of
signalized intersections and introduced the max pressure control which selects a stage that depends only on the queues
adjacent to the intersection. Results show that max pressure control with some modifications which guarantees minimum
green for each approach and considering of weighted queues is able to control of signalized networks although priority
42 C. Ozan et al. / Transportation Research Part C 54 (2015) 40–55
service and fully actuated control may not be stabilized in some cases. Maher et al. (2013) investigated the application of the
cross-entropy method to find optimal signal timings with fixed time control in a given road network. Results showed that the
proposed method could be efficiently used for determining in different types of signal optimization problems. Zhang et al.
(2013) formulated a bi-objective optimization model to determine signal timing plans for coordinated traffic signals using
simulation-based GA by considering environmental concerns. Dell’Orco et al. (2013) proposed Harmony Search (HS) algo-
rithm for optimizing traffic signal timings. The results of HS have been first compared with those obtained using the GA
and the HC on a two-junction network for a fixed set of link flows. Secondly, the HS algorithm with equilibrium link flows
has been applied to the medium-sized network to show the applicability of the proposed algorithm. Likewise, Dell’Orco et al.
(2014) used Artificial Bee Colony (ABC) algorithm with TRANSYT-7F for finding optimal setting of traffic signals in the CSN
for fixed set of link flows. Results showed that the proposed model is slightly better in signal timing optimization in terms of
objective function compared with GA and HC methods. Recently, Cesme and Furth (2014) explored a new paradigm for traf-
fic signal control called self-organizing signals based on local actuated control. Results show that overall delay can be
reduced up to 14% compared to an optimized coordinated-actuated scheme where there is no transit priority.
Additionally, it has been found that transit signal priority with self-organizing control may be more effective than coordi-
nated-actuated control. He et al. (2014) aimed to address the conflicting issues between actuated-coordination and multi-
modal priority control which have different control objectives. For this purpose, a request-based mixed-integer linear pro-
gram is formulated considering coordination and vehicle actuation. The simulation tests show that the proposed control
model is capable of reducing different types of average delay especially for highly congested condition.
The reviewed literature shows that heuristic methods are commonly preferred by researchers rather than using conven-
tional mathematical methods for finding optimal signal timings due to complexity of the problem. On the other hand, few
studies have been carried on traffic signal control using the Reinforcement Learning (RL) based algorithms. Thorpe (1997)
applied the RL algorithm to traffic signal control problem. Martin and Brauer (2000) presented a fuzzy model based on RL
approach, and applied to the problem of optimal signal plan selection. Wiering (2000) studied the use of multi-agent RL algo-
rithms for learning traffic signal controllers. Bingham (2001) applied the RL in the context of a neuro-fuzzy approach to traf-
fic signal control. Similarly, Abdulhai et al. (2003) applied a RL based algorithm to an isolated traffic signal in a road network.
In addition, Camponogara and Kraus (2003) studied a simple scenario with only two intersections using stochastic game the-
ory and RL. Additionally, Cai et al. (2009) presented an adaptive signal controller for real-time operation in road networks.
For this approach, temporal-difference learning and perturbation learning methods have been investigated as learning tech-
niques. Bazzan et al. (2010) investigated the task of multi-agent RL for control of traffic signals. Arel et al. (2010) introduced a
novel method based on RL to obtain an efficient traffic signal control policy by minimizing the average delay and congestion.
The method took advantage of the Q-learning algorithm with feed-forward neural network. Results obviously show advan-
tages of the proposed method. EI-Tantawy and Abdulhai (2010) developed a Q-learning based signal control system that uses
a variable phasing sequence. The proposed model was tested on a typical multiphase intersection for different traffic con-
ditions in order to minimize vehicle delay, and outperformed the widely used Webster pre-timed optimized signal control
strategy. Another solution algorithm was presented in order to solve dynamic user equilibrium network design problem
using RL based approach by Ozan et al. (2014). The proposed algorithm was tested on the medium sized network and
encouraging results were obtained. EI-Tantawy et al. (2013) presented a novel system of multi-agent RL for adaptive traffic
signal control. The proposed system was tested on a large-scale simulated network and results show significantly reduction
in the average intersection delay and saving in travel-time. Recently, Zhu et al. (2015) proposed Junction Tree Algorithm
based on RL for coordinated signal control problems in which traffic signals are modeled as intelligent agents interacting
with the stochastic traffic environment. Results showed that the proposed algorithm outperforms other RL based methods
in terms of average delay, number of stops, and vehicular emissions.
Although there are many studies in literature with various heuristic methods to optimize traffic signal timings for the
CSN, there is a few application of RL to this area. In addition, the optimization of signal timings on the CSN, which includes
a set of non-linear mathematical formulations, is very difficult. Therefore, new methods and approaches are needed to
improve efficiency of signal control in a road network due to complexity of the signal timing optimization problem. Thus,
this study proposes MOdified REinforcement Learning algorithm with TRANSYT-7F (MORELTRANS) model in which modi-
fied RL algorithm and TRANSYT-7F traffic model are combined for solving the signal timing optimization problem. In this
model, modified RL algorithm is applied to optimize traffic signal timings in the CSN using fixed set of link flows while
TRANSYT-7F traffic model is used to estimate network performance index, namely DI.
The remaining content of this paper is organized as follows. The modified RL algorithm and its solution process are given
in Section 2. Problem formulation and model development are provided in Section 3. Numerical application is presented in
Section 4. Finally, some concluding remarks plus future directions are given in Section 5.
The RL algorithm is considered to be a straightforward framing of the problem of learning from interaction to achieve a
goal (Sutton and Barto, 1998). In the RL, the decision-maker is called an agent that interacts with its environment.
Information on the environment is given to the agent through interaction between each other. Based on the information
taken, the agent chooses an action to perform in the environment. The action changes the environment in different ways,
and this change is communicated to the agent through a scalar reinforcement signal. The environment produces rewards
C. Ozan et al. / Transportation Research Part C 54 (2015) 40–55 43
as special numerical values which the agent tries to maximize over time. The agent and environment interact in a sequence
of steps. At each step, the agent receives some representation of the environment’s state, s 2 S, where S is the set of possible
states, and on that basis, the agent selects an action, a 2 AðsÞ, where AðsÞ is the set of actions available in state s: One step
later, the agent receives a numerical reward, r 2 R, and finds itself in a new state, s0 . Fig. 1 shows the fundamental agent–
environment interaction.
There are three major categories of RL methods which are the dynamic programming, Monte-Carlo and temporal differ-
ence learning methods. In fact, temporal difference learning methods combine ideas from both dynamic programming meth-
ods and Monte Carlo techniques (Abdulhai and Kattan, 2003). As known, Q-learning is one of the temporal difference
methods. It is a model free approach that does not require the agent to have access to information about how the environ-
ment works. Q-learning works by estimating state-action values (Q values) which are numerical estimators of quality for a
given pair of a state and an action (Bazzan et al., 2010). It uses the experience of each state transition to update one element
of a table which is called Q-table (Sutton and Barto, 1998). This table has an entry, Q ðs; aÞ, for each pair of a state, s, and an
action, a. The Q-learning algorithm determines the Q value, which reflects the value of an action a executed in a state s, and
selects the best actions (Vanhulsel et al., 2009). The Q table is populated during the learning process as shown in Table 1.
The learning process is carried out for a number of learning episodes. Each learning episode starts in a random state s, and
then the agent selects and executes an action. It receives the immediate reward and starts to observe the next state, s0 . Based
on this information, the agent updates the Q value according to Eq. (1):
next state s0
zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{
h i
Q t ðs; aÞ ð1 aÞ Q t ðs; aÞ þ a r t ðs; aÞ þ c Q best t1 ðs; aÞ ð1Þ
where Q t ðs; aÞ and r t ðs; aÞ are the updated Q value and the reward value at tth learning episode, respectively; Q best
t1 ðs; aÞ is the
best Q value obtained in the previous learning episode; a is the learning rate and c is the discounting factor.
Although the Q-learning algorithm can be used for finding optimal or near optimal solution for a given optimization prob-
lem, we modified it in order to be able to obtain better solutions in solving the signal timing optimization problem by means
of some improvements, which will be explained in the next sub-section.
Hybridizing heuristic algorithms with other methods or modified heuristic algorithms is an effective and efficient way to
solve optimization problems in various areas. The literature describes several studies in which RL has been integrated with
other heuristic methods or modified to solve various optimization problems. Liu and Zeng (2009) proposed an improved GA
with reinforcement mutation to solve the traveling salesman problem. Their results showed that the proposed hybrid algo-
rithm could obtain a nearly optimal tour in a reasonable amount of time. Maravall et al. (2009) integrated RL with evolution-
ary algorithms to solve the problem of autonomous motion control of robots. They demonstrated the efficiency of their
hybrid approach and also examined the efficiency of RL in real-time and on-line situations. Chen et al. (2009) developed
Agent
Environment
Table 1
Q-learning process (Kaelbling et al., 1996).
Initialize Q values
Repeat t times (t = number of learning episodes)
Select a random state s
Repeat until the end of the learning episode
Select an action a
Receive an immediate reward r
Observe the next state s0
Update the Q table according to the update rule
Set s ¼ s0
44 C. Ozan et al. / Transportation Research Part C 54 (2015) 40–55
an enhancement of a stock trading model using genetic network programming with RL algorithm. They carried out a sim-
ulation and compared their hybrid method with other methods to confirm its effectiveness. Their results showed that the
proposed approach outperformed all the other methods considered. Wu et al. (2011) developed a novel multi-agent RL algo-
rithm for job scheduling problems. Their simulation results presented that the proposed method can achieve the goal of bal-
ancing loads effectively.
In the light of given literature, we can see that the RL based algorithms have been successfully applied to various
optimization problems but it may be further improved to obtain better solutions for any given optimization problem. The
core of the modified RL algorithm used in this study is to generate a sub-environment based on the best solution available
in the previous learning episode as the same size of the original environment at each learning episode, as shown in Fig. 2.
In Fig. 2, m is the size of the original environment, n is the number of decision variables of a given optimization problem at
the tth learning episode and f is the value of objective function. As given in Fig. 2, the best solution vector obtained from the
previous learning episode is stored in the (m + 1)th row to avoid being trapped at local optimum. At the tth learning episode,
a sub-environment as the same size of the original environment is also randomly generated and located from the (m + 2)th to
the (2m + 1)th row of the given matrix, depending on the best solution for the previous information as given in Eq. (2):
rnd Q best best
t1 ðs; aÞ b; Q t1 ðs; aÞ þ b ð2Þ
By means of the generated sub-environment, a global optimum is sought around the best solution, using a reduced search
space with a b value determined during the algorithm process. b is used in order to reduce the size of the search space during
the application of modified RL algorithm. The bounds of the sub-environment are determined with bj, j = 1, 2, . . ., n, and n
being the number of decision variables. The range of b may be chosen between the minimum and maximum bounds of
any given optimization problem (Baskan et al., 2009). After that, the solution vectors populated in the environment and
sub-environment are sorted from best to worst to identify potentially better Q values based on their objective function val-
ues. In this way, the solution vectors in the sub-environment and the best solution stored in the Q-table are compared with
the solution vectors existed in the original environment. If one of the solution vectors provides a better functional value than
the worst one, the new vector is included in the original environment and the worst vector is excluded from the environ-
ment. Thus, modified RL algorithm may reach quickly the global or near global optimum without being trapped at local opti-
mum. Fig. 3 shows the steps of the modified RL algorithm.
In modified RL algorithm, the function, rðs; aÞ, rewards action a in state s in the search for the global or near global opti-
mum for any given optimization problem as shown in Eq. (3).
Q best
t ðs; aÞ Q t ðs; aÞ
rt ðs; aÞ ¼ ð3Þ
Q t ðs; aÞ
where rt ðs; aÞ is the reward function, Q t ðs; aÞ is the Q value and Q best
t ðs; aÞ is the best Q value obtained in the tth learning epi-
sode. The reward value for each decision variable is evaluated by dividing the difference between the best and Q value by the
Q value. In modified RL algorithm, the reward values approach a value of ‘‘0’’ because of the form of the reward function. In
case the global or near global optimum are investigated in any given optimization problem, the solutions located further
from global optimum may take bigger reward values than the solutions closer to global optimum due to the form of the
reward function. Thus, the probability of reaching of further located Q values to the global optimum may be increased.
On the other hand, the reward function developed ensures that Q values are associated with smaller rewards as they
approach to the global optimum. Therefore, this function may be called as penalty instead of reward.
2n
⎢f ⎥
⎢Qt ( s, a ) , Qt ( s, a),KKKKKKKKKKKKK Qt ( s, a ) ⎥
22
⎢ 2 ⎥
⎢ ⎥ ⎢M ⎥
⎢ M M M ⎥ ⎢ ⎥
⎢ M M M ⎥ ⎢M ⎥
⎢ m1 m2 mn
⎥ ⎢ ⎥
⎢Qt ( s, a ) , Qt ( s, a ),KKKKKKKKKKKKK Qt ( s, a ) ⎥ ⎢ ⎥
⎢ ( m +1)1 ( m + 2)2
⎥ ⇒⎢ ⎥
⎢Qt −1 ( s, a ) , Qt −1 ( s, a ),KKKKKKKKKK Qt(−m1 + 2) n ( s, a ) ⎥ ⎢ ⎥
⎢ ( m + 2)1 ⎥ ⎢ ⎥
⎢Qt ( s, a ) , Qt( m + 2)2 ( s, a ),KKKKKKKKKK Qt( m + 2) n ( s, a ) ⎥ ⎢ ⎥
⎢ M M M ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎢ M M M ⎥ ⎢M ⎥
⎢ (2 m +1)1 (2 m +1)2 (2 m +1) n ⎥ ⎢ ⎥
Q
⎣⎢ t ( s , a ) , Q t
( s , a ), KKKKKKKKK Q t
( s , a ) ⎦⎥ ⎣⎢ f 2 m +1 ⎦⎥
In this study, it is aimed to find optimum signal timings in the CSN for fixed set of link flows. The objective function and
corresponding constraints are given in Eq. (4).
X
min DI ¼ wda da ðwÞ þ K wsa Sa ðwÞ
w;q
a2L
8
> cmin 6 c 6 cmax
>
>
>
>
<0 6 h 6 c
> ð4Þ
Subject to wðc; h; uÞ 2 X0 ; umin 6 u 6 c
>
>
>
> Xz
>
> ðu þ IÞi ¼ c
:
i¼1
where da is delay on link a, a 2 L, wda is link-specific weighting factor for delay d on link a, K is stop penalty factor to express
the importance of stops relative to delay, Sa is stops on link a per second, wsa is link-specific weighting factor for stops S on
link a, q is fixed set of link flows, w is signal setting parameters, c is network cycle time (s), h is offset time (s), u is stage
green time (s), X0 is feasible region for signal timings, I is intergreen time (s), and z is number of stages at each signalized
intersection in a given road network.
In this study, the MORELTRANS is developed for solving the signal timing optimization problem in the CSN. The proposed
model includes two main parts: (i) modified RL algorithm; (ii) TRANSYT-7F traffic model. The modified RL algorithm opti-
mizes traffic signal timings under fixed set of link flows while TRANSYT-7F is used to estimate network performance index,
DI, for a given signal timing and staging plan in a road network. The MORELTRANS consists of 6 steps as given below:
At this step, the objective function, the number of signal timing variables (n), the constraints for each decision variable,
the fixed set of link flows (q), the saturation flows (s), the free-flow travel time (t 0a ) on link a, a 2 L, maximum number of
learning episodes (tmax), the size of environment (m), the value of b for each signal timing variable, learning rate (a) and dis-
counting factor (c) are specified.
Since the environment contains possible states, S, in the proposed model, the number of states is assumed to equal the
size of environment, m. At each state, there is a set of possible actions, AðsÞ, and the agent randomly selects an action, a, from
the set of possible actions for each state, s 2 S. The set of possible actions includes values between minimum and maximum
bounds of each decision variable. The randomly selected action at each state is defined as Q ðs; aÞ in the Q-table. Therefore, the
Q-table represents solution vectors which contain value of signal timing variables. This process is explained as shown below:
In the MORELTRANS, the agent selects an action for network cycle time according to constraints, cmin and cmax . The net-
work cycle time is randomly generated as shown in Eq. (5).
(ii) Offsets
According to the generated network cycle time, c, the agent performs actions for offset variables which are randomly gen-
erated between 0 and c as shown in Eq. (6) for each intersection in a given road network.
hj ¼ int½rndð0; 1Þ c ð6Þ
where i is the number of the stages, and umin is the minimum green timing considered for each stage. In addition, the green
timings should be distributed to the all signal stages in order to provide the cycle time constraint according to Eq. (8) (Ceylan
and Bell, 2004).
p Xz Xz
ui ¼ umin þ X z i c I
k¼1 k
k¼1
umin;k i ¼ 1; 2; . . . ; z ð8Þ
p
k¼1 i
where ui and umin are the green timing (s) and minimum green timing (s) for stage i, respectively; pi is generated randomly
green timing (s) for stage i; z is the number of stages; I is intergreen time (s) and c is the network cycle time (s).
After the signal timing variables randomly generated, Q-table is created and given in Eq. (9) in which the green timings,
offsets and network cycle time are presented in each row as Q values. As aforementioned, the signal timing variables take the
place of Q values and the final appearance of Eq. (9) can be seen in Eq. (10).
2 u11 u12
3
c1
zfflfflfflfflffl}|fflfflfflfflffl{ zfflfflfflfflffl}|fflfflfflfflffl{ zfflfflfflfflffl}|fflfflfflfflffl{
6 11 7
6 Q ðs; aÞ; Q 12 ðs; aÞ ;.........; Q 1n ðs; aÞ 7
6 7
6 21 22 7
6 Q ðs; aÞ;
6 Q ðs; aÞ ;.........; Q 2n ðs; aÞ 7 7
6 7
6 .. .. .. 7
Q ¼6
6 . . . 7
7 ð9Þ
6 .. .. .. 7
6 7
6 . . . 7
6 7
6 .. .. .. 7
6 . . . 7
4 5
Q m1 ðs; aÞ; Q m2 ðs; aÞ ; . . . . . . . . . ; Q mn ðs; aÞ mn
2
u11 ; u12 ; . . . . . . ; h11 ; h12 ; . . . . . . ; c1 3
6
6 u21 ; u22 ; . . . . . . ; h21 ; h22 ; . . . . . . ; c2 7
7
6 7
6 7
6 ... ... ... ... ...7
wðc; h; uÞ ¼ 6
6 ...
7 ð10Þ
6 ... ... ... ...7
7
6 7
6 ... ... ... ... ...7
4 5
um1 ; um2 ; . . . . . . ; hm1 ; hm2 ; . . . . . . ; cm mn
2
u11 ; u12 ; . . . . . . ; h11 ; h12 ; . . . . . . ; c1 3 2
f 1 ðw1 ; qÞ
3
6 u21 ; u22 ; . . . . . . ; h21 ; h22 ; . . . . . . ; c2 7 6 f 2 ðw2 ; qÞ 7
6 7 6 7
6 ... ... ... ... ...7 6 ... 7
6 7 6 7
6 7 )6 7 ð11Þ
6 ... ... ... ... ...7 6 ... 7
6 7 6 7
4 ... ... ... ... ...5 4 ... 5
um1 ; um2 ; . . . . . . ; hm1 ; hm2 ;......; cm mn f m ðwm ; qÞ
Step 4. Generating sub-environment.
Using the best solution vector stored in the (m + 1)th row of the Q-table and b value, a sub-environment as the same size
of the original environment is randomly generated, which is located from the (m + 2)th to the (2m + 1)th row of the Q-table.
Afterwards, solution vectors in both the original and the sub-environment are input to TRANSYT-7F and corresponding
objective function values are calculated. The solution vectors are put in order from best to worst based on their objective
function values. If one of the solution vectors gives a better functional value than the worst one, the new solution vector
is included in the original environment and the worst solution vector is excluded.
The best solution vector is determined with respect to objective function values stored in the Q-table. Accordingly, reward
values are calculated for each signal timing variable in each solution vector using Eq. (3). As an illustration, in case the best
solution vector is assumed as w ¼ ðu21 ; u22 ; . . . ; h21 ; h22 ; . . . ; c2 Þ; corresponding reward values are calculated and their final
representation can be seen as given in Eq. (12).
2 3
r 11 ðs;aÞ r 12 ðs;aÞ r 1n ðs;aÞ
6 zfflfflfflfflfflfflffl
ffl}|fflfflfflfflfflfflffl ffl{ zfflfflfflfflfflfflffl
ffl}|fflfflfflfflfflfflffl
ffl{
zfflfflfflfflfflffl}|fflfflfflfflfflffl{
7
6 u21 u22 h21 c2 7
6 1 ; 1 ;.........; 1 ; . . . . . . . . . ; 1 7
6 u11 u12 h11
c1 7
6 7
6 0 0 0 0 7
6 zfflfflfflfflfflfflffl
ffl}|fflfflfflfflfflfflffl
ffl{ zfflfflfflfflfflfflffl
ffl }|fflfflfflfflfflfflffl ffl { zfflfflfflfflfflfflffl}|fflfflfflfflfflfflffl{ zfflfflfflfflfflffl}|fflfflfflfflfflffl{7
6 7
6 u21 u22 h21 c2 7
6 1 ; 1 ;.........; 1 ;.........; 1 7
6 u21 u22 h21 c2 7 ð12Þ
6 7
6 7
6 ... ... ..
.
..
. 7
6 7
6 .. .. .. .. 7
6 . . . . 7
6 7
6 .. .. .. .. 7
6 . 7
4 . . . 5
u21 u22 h21 c2
u 1 ;
m1 u 1
m2
;.........; hm1
1 ;.........; cm
1
Due to the form of the reward function, the reward values in the second row take the value of ‘‘0’’ since the solution vector
located in the second row is assumed the best solution vector. Finally, Q values are updated using Eq. (1).
The algorithm is terminated when a preset stopping criterion is satisfied and the best solution vector is selected as the
final solution. Otherwise, algorithm continues from Step 4 to 6. Fig. 4 shows the flowchart of the MORELTRANS.
4. Numerical application
In order to show the effectiveness of the MORELTRANS, it was applied to Allsop and Charlesworth’s well-known test road
network taken from Ceylan (2002). This network was chosen since it is probably the most used network for solving trans-
portation related problems (Chiou, 2003, 2014; Ceylan and Bell, 2004, 2005; Ceylan, 2006; Ceylan and Ceylan, 2012;
Dell’Orco et al., 2013; Maher et al., 2013). Thus some results obtained from this network in this study may be compared with
those did by other methods in the literature. Basic layouts of the network and stage plans are given in Figs. 5 and 6.
This network includes 23 links and 21 signal setting variables at six signal-controlled junctions. The fixed set of link flows,
taken from Ceylan (2002), is given in Table 2. The constraints for each signal timing variable are set as follows:
36 6 c 6 120
06h6c
76u6c
I12 ¼ I21 ¼ 5
The MORELTRANS was encoded by the MATLAB 2009 environment available in the PC with Intel Core2 CPU 2.66 GHz,
RAM 4 GB. It is performed with the following user-specified parameters: learning rate (a) is 0.8, discounting factor (c) is
0.2, environment size (m) is 20, and maximum number of learning episodes (tmax) is 1000. The solution process is repeated
48 C. Ozan et al. / Transportation Research Part C 54 (2015) 40–55
until a preset stopping criterion is met. The MORELTRANS is terminated when the difference between the best and the aver-
age values of DI is less than 4%.
Before we test the possible effect of demand increase, the MORELTRANS was applied for the base case using existing set of
link flows and the convergence of the model is given in Fig. 7. In 337th learning episode, the algorithm was terminated since
stopping criterion was met. In this learning episode, the best value of objective function was found as 363.80 while it was
determined as 527.80 at first learning episode. In other words, the improvement rate is 45% according to the initial value of
objective function. As seen in Fig. 7, average objective function values fluctuate during the optimization process since new
solution vectors which may be further located from the solution vectors existed in the original environment are generated in
the sub-environment. Therefore, it is expected that the average objective function values fluctuate during the optimization
process. Through the generated sub-environment, global optimum is searched around the best signal setting parameters
using reduced search space during the algorithm process. Moreover, the best solution vector obtained from the previous
learning episode is stored in order to avoid being trapped at local optimum.
For comparison, the problem of finding optimum signal timings on Allsop and Charlesworth’s network is also solved using
TRANSYT-7F in which GA and HC optimization tools exist. For this network, the MORELTRANS and TRANSYT-7F optimizers’
results are given in Table 3. While the best value of DI is 443.00 in TRANSYT-7F with HC, it is found as 364.70 with GA. The
network cycle time is determined as 65 and 80 s with HC and GA, respectively. The MORELTRANS is slightly better than the
GA solution since the best value of DI for the MORELTRANS is 363.80. Moreover, 22% improvement is achieved relative to HC
optimization tool in terms of final values of DI in the proposed model. After solving signal optimization problem using modi-
fied RL and GA methods, the obtained results are quite close to each other in terms of DI as seen in Table 3. However, findings
show that the proposed method may be considered to preferable to GA since its application is easier than GA which has bin-
ary coding/encoding procedure. On the other hand, as in similar studies, it is desired that heuristic methods provide better
results than compared methods for solving signal optimization problem as traffic demand increased rather than regular traf-
fic condition.
Additionally, the values of operating cost and total travel time are in good agreement with the final values of DI. As can be
seen in Table 3, HC gives maximum value for both total travel time and operating cost as well as for DI. On the other hand,
C. Ozan et al. / Transportation Research Part C 54 (2015) 40–55 49
D
Origin-Destination N
11 12
Junction
4 E
13
5
6
14 10 17
20 21
C 3 5 F
4 15 8 9 18
23 22
B 2 6 G
16 7
19
3
1 2
the only disadvantage of the MORELTRANS is that it requires much more CPU time than the GA and HC. While CPU times of
the HC and GA are approximately 10 and 40 min, respectively, the MORELTRANS requires 5.74 h for the whole computation.
Although the difference of required CPU times between all compared algorithms seems enormous, the MORELTRANS has
been reached to the value of 90% of its final solution within 188 min. Moreover, other essential difference is that the GA
and the HC tools are embedded to TRANSYT-7F and this property leads to substantially decrease CPU time of their solutions.
On the other hand, we coded the modified RL algorithm in MATLAB environment and combined with TRANSYT-7F. Therefore,
it is expectable that the MORELTRANS requires much more CPU time than the GA and the HC. Although this seems a draw-
back, it can be overcome by embedding the modified RL algorithm to TRANSYT-7F in future. Thus, the MORELTRANS may be
considered for optimizing signal timings in the CSN instead of GA and HC optimization tools. On the other hand, two cases
are considered in order to test the possible effect of demand increase. For this purpose, link flows given in Table 2 are
increased 20% and 50%.
4.2. Case A
The MORELTRANS was applied to the Allsop and Charlesworth’s network with increased demand of 20%. The convergence
of the proposed model for this case can be seen in Fig. 8.
In 477th learning episode, the MORELTRANS reached its final solution as 781.30 while the objective function value was
1043.00 at first learning episode. In the base case, the final objective function of the MORELTRANS had been obtained as
363.80 while it was determined as 781.30 in case of demand increase of 20%. This result shows that demand increase causes
approximately to double the final value of DI. For Case A, the MORELTRANS and TRANSYT-7F optimizers’ results are given in
Table 4. Depending on demand growth, the network cycle times determined by the MORELTRANS, GA, and HC are 112, 108
50 C. Ozan et al. / Transportation Research Part C 54 (2015) 40–55
16
19
1
1 1 2
15 23
2
3
14
3 20
11 12 12
5
4 13
6
5 10
17
21
5 8 9 8
18
22
6
7
Table 2
Fixed set of link flows.
Link number Link flow (veh/h) Saturation flow (veh/h) Free-flow travel time (s)
1 716 2000 1
2 463 1600 1
3 716 3200 10
4 569 3200 15
5 636 1800 20
6 173 1850 20
7 462 1800 10
8 478 1850 15
9 120 1700 15
10 479 2200 10
11 499 2000 1
12 250 1800 1
13 450 2200 1
14 789 3200 20
15 790 2600 15
16 663 2900 10
17 409 1700 10
18 350 1700 15
19 625 1500 10
20 1290 2800 1
21 1057 3200 15
22 1250 3600 1
23 837 3200 15
C. Ozan et al. / Transportation Research Part C 54 (2015) 40–55 51
850
800 Best
Average
80
160
320
240
Number of learning episodes
Table 3
The best values of DI and signal timings for base case.
Disutility Operating cost Total travel time Cycle time Junction Duration of stages (s) Offsets
Index (DI) ($/h) (veh-h/h) c (s) number j (s) hj
Stage 1 Stage 2 Stage 3
uj;1 uj;2 uj;3
MORELTRANS 363.80 1358 169 82 1 43 39 – 0
2 51 31 – 80
3 52 30 – 46
4 27 29 26 27
5 18 27 37 28
6 36 46 – 42
TRANSYT-7F 443.00 1451 202 65 1 25 40 – 0
with HC 2 35 30 – 2
3 41 24 – 2
4 24 21 20 22
5 12 23 30 56
6 28 37 – 2
TRANSYT-7F 364.70 1360 169 80 1 42 38 – 0
with GA 2 50 30 – 12
3 50 30 – 71
4 27 27 26 62
5 12 34 34 11
6 40 40 – 0
and 93 s, respectively. It is shown that demand growth causes to increase the network cycle time with respect to base case,
as expected. As for the objective function values obtained from HC and GA, we can see that the MORELTRANS outperforms
the both optimization tools. The proposed model improves network’s DI by 3% and 14% compared with GA and HC,
respectively.
4.3. Case B
In this case, link flows are increased up to 50% in order to show the effectiveness of the MORELTRANS for heavy demand
condition. The signal timings and corresponding parameters obtained are given in Table 5. According to the initial solution,
the MORELTRANS improved by 20% the value of best objective function. In 550th learning episode, the algorithm was
stopped and the best value of DI was found as 2137.00. As can be seen in Fig. 9, the average objective function values show
a higher tendency to fluctuate than the case of demand increase of 20%. The underlying reason is that the higher demand
leads to increase traffic congestion and may make difficult finding the optimal solution of signal optimization problem.
For heavy demand condition, the MORELTRANS was also capable to find the near global optimum solution and produced
better objective function value than the GA and HC. Following to analyze the effect of heavy demand condition, the network
operating parameters such as degree of saturation, total delay and stops are given in Table 6 for all cases.
52 C. Ozan et al. / Transportation Research Part C 54 (2015) 40–55
1750
80
160
240
320
480
400
Number of learning episodes
Table 4
The best values of DI and signal timings for Case A.
Disutility Operating cost Total travel time Cycle time Junction Duration of stages (s) Offsets
Index (DI) ($/h) (veh-h/h) c (s) number j (s) hj
Stage 1 Stage 2 Stage 3
uj;1 uj;2 uj;3
MORELTRANS 781.30 2144 355 112 1 56 56 – 0
2 67 45 – 56
3 70 42 – 31
4 39 35 38 8
5 17 40 55 75
6 51 61 – 81
TRANSYT-7F 890.70 2272 404 93 1 35 58 – 0
with HC 2 50 43 – 20
3 59 34 – 56
4 34 30 29 42
5 14 34 45 28
6 40 53 – 28
TRANSYT-7F 804.00 2171 364 108 1 54 54 – 0
with GA 2 67 41 – 78
3 69 39 – 103
4 37 36 35 50
5 14 47 47 80
6 54 54 – 31
Table 5
The best values of DI and signal timings for Case B.
Disutility Operating cost Total travel time Cycle time Junction Duration of stages (s) Offsets
Index (DI) ($/h) (veh-h/h) c (s) number j (s) hj
Stage 1 Stage 2 Stage 3
uj;1 uj;2 uj;3
MORELTRANS 2137.00 4241 964 120 1 58 62 – 0
2 74 46 – 95
3 77 43 – 87
4 44 37 39 57
5 18 40 62 81
6 49 71 – 13
TRANSYT-7F 2286.00 4438 1033 106 1 39 67 – 0
with HC 2 57 49 – 88
3 67 39 – 40
4 39 34 33 4
5 15 39 52 12
6 46 60 – 40
TRANSYT-7F 2228.00 4346 1005 108 1 55 53 – 0
with GA 2 65 43 – 70
3 66 42 – 88
4 40 34 34 51
5 14 47 47 30
6 54 54 – 57
C. Ozan et al. / Transportation Research Part C 54 (2015) 40–55 53
3700
3300 Average
3100
2900
2700
2500
2300
2100
112
168
224
336
448
560
280
392
504
56
0
Table 6
Network operating parameters resulting from the MORELTRANS.
As can be seen from the Table 6, the degree of saturation on some links such as 16 and 19 are passed the critical stage of
100% for base case. On the other hand, the degrees of saturation on six links are higher than the critical value of ‘‘1’’ due to
increasing demand up to 20% for Case A. Similarly, when we proceed to increase the demand, the number of links whose
degree of saturation passed the critical stage of 100% continues to gradually increase as given in Case B. As shown in
Table 6, the results of total delay and stops on links are in good agreement with the degrees of saturation. That is, growing
demand causes to an increase in the degree of saturation as well as in total delay and stops.
This study deals with finding optimum signal timings in the CSN for fixed set of link flows. For this purpose, the
MORELTRANS which includes two main parts, namely modified RL algorithm and TRANSYT-7F traffic model was developed.
The modified RL is based on Q-learning algorithm and differs from other RL algorithms in that a sub-environment is gener-
ated at each learning episode as the same size of original environment using the best solution obtained from the previous
learning episode. On the other part of the MORELTRANS, TRANSYT-7F traffic model was used to estimate total network per-
formance index.
54 C. Ozan et al. / Transportation Research Part C 54 (2015) 40–55
The proposed model was tested on medium sized coordinated signalized road network which contains six junctions.
Results obtained from the numerical application showed that the MORELTRANS produced slightly better results than the
GA in signal timing optimization in terms of objective function value while it outperformed the HC optimization tool. To
investigate the capability of the MORELTRANS in heavy demand condition, two cases in which link flows were increased
by 20% and 50% with respect to the base case were considered. Results indicated that the MORELTRANS was also able to find
the best objective function value and corresponding optimal signal timings even if the demand became increased in both
cases.
Consequently, results showed that the MORELTRANS may be used for optimizing traffic signal timings in the CSN for fixed
set of link flows. Hence, it may provide an alternative to the HC and GA optimization tools in TRANSYT-7F. In future studies, it
is aimed to apply the MORELTRANS to large-scale road networks and to solve the problem of signal timing optimization
under equilibrium link flows.
Acknowledgements
The authors would like to thank the anonymous referees for their constructive and useful comments during the devel-
opment stage of this paper. Scientific Research Foundation of the Pamukkale University with the Project No. 2010-FBE-
063 is also acknowledged.
References
Abdulhai, B., Kattan, L., 2003. Reinforcement learning: introduction to theory and potential for transport applications. Can. J. Civ. Eng. 30, 981–991.
Abdulhai, B., Pringle, R., Karakoulas, G.J., 2003. Reinforcement learning for true adaptive traffic signal control. J. Transport. Eng. 129 (3), 278–285.
Arel, I., Liu, C., Urbanik, T., Kohls, A.G., 2010. Reinforcement learning-based multi-agent system for network traffic signal control. IET Intell. Transp. Syst. 4
(2), 128–135.
Baskan, O., Haldenbilen, S., Ceylan, H., Ceylan, H., 2009. A new solution algorithm for improving performance of ant colony optimization. Appl. Math.
Comput. 211, 75–84.
Bazzan, A.L.C., Oliveira, D., Silva, B.C., 2010. Learning in groups of traffic signals. Eng. Appl. Artif. Intell. 23, 560–568.
Bingham, E., 2001. Reinforcement learning in neurofuzzy traffic signal control. Eur. J. Oper. Res. 131, 232–241.
Cai, C., Wong, C.K., Heydecker, B.G., 2009. Adaptive traffic signal control using approximate dynamic programming. Transport. Res. Part C 17, 456–474.
Camponogara, E., Kraus Jr., W., 2003. Distributed learning agents in urban traffic control. In: Moura-Pires, F., Abreu, S. (Eds.), EPIA, pp. 324–335.
Cesme, B., Furth, P.G., 2014. Self-organizing traffic signals using secondary extension and dynamic coordination. Transport. Res. Part C 48, 1–15.
Ceylan, H., 2002. A Genetic Algorithm Approach to the Equilibrium Network Design Problem. Ph.D. Thesis, University of Newcastle upon Tyne, UK.
Ceylan, H., 2006. Developing combined genetic algorithm hill-climbing optimization method for area traffic control. J. Transport. Eng. 132 (8), 663–671.
Ceylan, H., Bell, M.G.H., 2004. Traffic signal timing optimisation based on genetic algorithm approach, including drivers’ routing. Transport. Res. Part B 38
(4), 329–342.
Ceylan, H., Bell, M.G.H., 2005. Genetic algorithm solution for the stochastic equilibrium transportation networks under congestion. Transport. Res. Part B 39
(2), 169–185.
Ceylan, H., Ceylan, H., 2012. A Hybrid Harmony Search and TRANSYT hill climbing algorithm for signalized stochastic equilibrium transportation networks.
Transport. Res. Part C 25, 152–167.
Chen, J., Xu, L., 2006. Road-junction traffic signal timing optimization by an adaptive particle swarm algorithm. In: 9th International Conference on Control,
Automation, Robotics and Vision, vol. 1–5, pp. 1103–1109.
Chen, Y., Mabu, S., Shimada, K., Hirasawa, K., 2009. A genetic network programming with learning approach for enhanced stock trading model. Expert Syst.
Appl. 36, 12537–12546.
Chiou, S.-W., 2003. TRANSYT derivatives for area traffic control optimisation with network equilibrium flows. Transport. Res. Part B 37, 263–290.
Chiou, S.-W., 2014. Optimization of robust area traffic control with equilibrium flow under demand uncertainty. Comput. Oper. Res. 41, 399–411.
Dan, C., Xiaohong, G., 2008. Study on intelligent control of traffic signal of urban area and microscopic simulation. In: Proceedings of the Eighth International
Conference of Chinese Logistics and Transportation Professionals, Logistics: The Emerging Frontiers of Transportation and Development in China, pp.
4597–4604.
Dell’Orco, M., Baskan, O., Marinelli, M., 2013. A Harmony Search algorithm approach for optimizing traffic signal timings. Promet Traffic Transport. 25 (4),
349–358.
Dell’Orco, M., Baskan, O., Marinelli, M., 2014. Artificial bee colony-based algorithm for optimising traffic signal timings. In: Snášel, V., Krömer, P., Köppen, M.,
Schaefer, G. (Eds.), Soft Computing in Industrial Applications, Advances in Intelligent Systems and Computing, vol. 223. Springer, Berlin/Heidelberg, pp.
327–337.
EI-Tantawy, S., Abdulhai, B., 2010. An agent-based learning towards decentralized and coordinated traffic signal control. In: Proceedings of the 13th
International IEEE Annual Conference on Intelligent Transportation Systems, Madeira Island, Portugal, pp. 665–670.
EI-Tantawy, S., Abdulhai, B., Abdelgawad, H., 2013. Multiagent reinforcement learning for integrated network of adaptive traffic signal controllers (MARLIN-
ATSC): methodology and large-scale application on downtown Toronto. IEEE Trans. Intell. Transport. Syst. 14 (3), 1140–1150.
Girianna, M., Benekohal, R.F., 2002. Application of genetic algorithms to generate optimum signal coordination for congested networks. Proc. Seventh Int.
Conf. Appl. Adv. Technol. Transport., 762–769
He, Q., Head, K.L., Ding, J., 2012. PAMSCOD: Platoon-based arterial multi-modal signal control with online data. Transport. Res. Part C 20, 164–184.
He, Q., Head, K.L., Ding, J., 2014. Multi-modal traffic signal control with priority, signal actuation and coordination. Transport. Res. Part C 46, 65–82.
Heydecker, B.G., 1996. A decomposed approach for signal optimization in road networks. Transport. Res. Part B 30 (2), 99–114.
Hu, H., Liu, H.X., 2013. Arterial offset optimization using archived high-resolution traffic signal data. Transport. Res. Part C 37, 131–144.
Hu, H., Wu, X., Liu, H.X., 2013. Managing oversaturated signalized arterials: a maximum flow based approach. Transport. Res. Part C 36, 196–211.
Jones, L.K., Deshpande, R., Gartner, N.H., Stamatiadis, C., Zou, F., 2013. Robust controls for traffic networks: the near-Bayes near-Minimax strategy.
Transport. Res. Part C 27, 205–218.
Kaelbling, L.P., Littman, M.L., Moore, A.W., 1996. Reinforcement learning: a survey. J. Artif. Intell. Res. 4, 237–285.
Li, Z., 2011. Modeling arterial signal optimization with enhanced cell transmission formulations. J. Transport. Eng. 137 (7), 445–454.
Liu, Y., Chang, G.-L., 2011. An arterial signal optimization model for intersections experiencing queue spillback and lane blockage. Transport. Res. Part C 19,
130–144.
Liu, F., Zeng, G., 2009. Study of genetic algorithm with reinforcement learning to solve the TSP. Expert Syst. Appl. 36, 6995–7001.
Maher, M., Liu, R., Ngoduy, D., 2013. Signal optimisation using the cross entropy method. Transport. Res. Part C 27, 76–88.
C. Ozan et al. / Transportation Research Part C 54 (2015) 40–55 55
Maravall, D., Lope, J.de., Martin, H.J.A., 2009. Hybridizing evolutionary computation and reinforcement learning for the design of almost universal
controllers for autonomous robots. Neurocomputing 72, 887–894.
Martin, A., Brauer, W., 2000. Fuzzy model-based reinforcement learning. In: European Symposium on Intelligent Techniques (ESIT), Aachen Germany, 14–15
September 2000, pp. 14–15.
McTrans Center, 2008. TRANSYT-7F Release 11.3 Users Guide. University of Florida, Gaineville, Florida.
Ozan, C., Ceylan, H., Haldenbilen, S., 2014. Solving network design problem with dynamic network loading profiles using modified reinforcement learning
method. In: Proceedings of the 16th Meeting of the EURO Working Group on Transportation, Procedia – Social and Behavioral Sciences, vol. 111, pp. 38–
47.
Robertson, D.I., 1969. TRANSYT: A Traffic Network Study Tool, RRL Report, LR 253. Transport and Road Research Laboratory, Crowthorne.
Sutton, R.S., Barto, A.G., 1998. Reinforcement Learning: An Introduction. The MIT Press, Cambridge, Massachusetts, USA/London, England.
Thorpe, T.L., 1997. Vehicle traffic light control using SARSA. Master’s Project Report, Computer Science Department. Colorado State University, Colo.
Vanhulsel, M., Janssens, D., Wets, G., Vanhoof, K., 2009. Simulation of sequential data: an enhanced reinforcement learning approach. Expert Syst. Appl. 36,
8032–8039.
Varaiya, P., 2013. Max pressure control of a network of signalized intersections. Transport. Res. Part C 36, 177–195.
Wiering, M.A., 2000. Learning to control traffic lights with multi-agent reinforcement learning. In: First World Congress of the Game Theory Society Games,
Utrecht, Netherlands, Basque Country University and Foundation, Spain.
Wong, S.C., 1995. Derivatives of the performance index for the traffic model from TRANSYT. Transport. Res. Part B 29 (5), 303–327.
Wong, S.C., 1996. Group-based optimisation of signal timings using the TRANSYT traffic model. Transport. Res. Part B 30 (3), 217–244.
Wong, S.C., 1997. Group-based optimisation of signal timings using parallel computing. Transport. Res. Part C 5 (2), 123–139.
Wong, S.C., Wong, W.T., Leung, C.M., Tong, C.O., 2002. Group-based optimization of a time-dependent TRANSYT traffic model for area traffic control.
Transport. Res. Part B 36, 291–312.
Wu, J., Xu, X., Zhang, P., Liu, C., 2011. A novel multi-agent reinforcement learning approach for job scheduling in grid computing. Fut. Gener. Comput. Syst.
27, 430–439.
Zhang, L., Yin, Y., Chen, S., 2013. Robust signal timing optimization with environmental concerns. Transport. Res. Part C 29, 55–71.
Zhu, F., Abdul Aziz, H.M., Qian, X., Ukkusuri, S.V., 2015. A junction-tree based learning algorithm to optimize network wide traffic control: a coordinated
multi-agent framework. Transport. Res. Part C. http://dx.doi.org/10.1016/j.trc.2014.12.009 (in press).