Optimizing Traffic Signal Control With Deep Reinforcement Learning Exploring Decay Rate Tuning For Enhanced Exploration-Exploitation Trade-Off
Optimizing Traffic Signal Control With Deep Reinforcement Learning Exploring Decay Rate Tuning For Enhanced Exploration-Exploitation Trade-Off
Trade-off
1st Saidulu Thadikamalla 2nd Piyush Joshi 3rd Balaji Raman
Department of Computer Science, Department of Computer Science, The Mathworks Inc,
Indian Institute of Information Technology, Indian Institute of Information Technology, Bengaluru, India
Sri City, Chittoor, India Sri City, Chittoor, India [email protected]
[email protected] [email protected]
Abstract—Traffic congestion is an increasingly prevalent global reduced vehicular speeds, queuing, protracted travel durations,
issue that necessitates the advancement of Traffic Signal Control and a plethora of indirect repercussions on environmental
technologies. Deep Reinforcement Learning has emerged as a sustainability and overall quality of life. In the face of sat-
prominent machine learning paradigm, leveraging trial-and-
error experimentation in conjunction with Deep Neural Network urated public transportation systems and constrained infras-
models to facilitate autonomous and coordinated management tructure expansion, Traffic Signal Control (TSC) has emerged
of traffic signal lights spanning numerous intersections within a as an efficacious remedy for traffic management. However,
traffic network. Reinforcement Learning methodologies employ conventional TSC technologies exhibit limitations primarily
diverse exploration strategies such as Epsilon greedy, Softmax, due to the stochastic nature of traffic dynamics. In contrast,
Upper Confidence Bound, among others, to ascertain an opti-
mal policy. In the pursuit of long-term rewards, an effective the application of Artificial Intelligence (AI) methodologies,
exploration strategy must adeptly balance the exploitation of such as fuzzy logic, Q-Learning, and Deep Q-Learning, has
the current policy with the exploration of novel alternatives. facilitated the development of TSC systems endowed with
The Epsilon greedy algorithm stands out as a widely adopted the capability to dynamically adapt to rapidly shifting traffic
approach for navigating this trade-off in Reinforcement Learning. scenarios, optimize traffic flow, and alleviate congestion.
However, its performance is intricately tied to the initially hand-
crafted exploration rate. This work contributes significantly to Balancing the trade-off between exploration and exploita-
the attainment of an optimal policy by primarily emphasizing two tion plays a pivotal role in optimizing the efficiency of Traffic
key aspects. Firstly, it underscores the criticality of meticulously Signal Control systems. The Epsilon greedy strategy stands
tuning the -decay rate, which governs the progression of the as a ubiquitous technique employed to strike this balance and
exploration rate, in order to cultivate an optimal traffic signal derive an optimal policy for TSC.
control system. Secondly, this work delves into an in-depth
exploration of the constraints inherent in the epsilon decay rate This study sets out to meticulously investigate the ramifi-
and offers potential avenues for future research in this domain. cations of various decay rates associated with the parameter
Index Terms—Traffic signal control, Deep reinforcement learn- within the Epsilon greedy strategy on the efficacy of Deep
ing, Exploration-exploitation trade-off, Epsilon greedy algorithm, Reinforcement Learning (DRL) methodologies in addressing
Traffic congestion TSC challenges. These innovative RL solutions are geared to-
wards the explicit management of the exploration-exploitation
I. I NTRODUCTION conundrum with the overarching goal of enhancing traffic flow
The exponential proliferation of motor vehicle utilization optimization and congestion reduction within dynamic traffic
has precipitated a pronounced escalation in traffic congestion, environments.
notably in major urban conurbations. The expansion of the Moreover, this research underscores the intrinsic limitations
vehicular fleet has outpaced the growth of the road network, inherent to the Epsilon greedy approach and proffers a suite of
engendering a confluence of adverse consequences, including enhancements designed to ameliorate its deficiencies. Through
As a consequence, researchers have embraced artificial Here, Q(st , at ) denotes the envisaged reward for state st
intelligence (AI) paradigms such as fuzzy logic [5], swarm in- and action at , α signifies the learning rate, R(st , at ) represents
telligence, genetic algorithms, Q-Learning [6], and Q-Learning the return for action at in state st , γ stands for the discount
with neural networks for the function approximation [7] [8] to factor, st+1 indicates the sequential state, and a denotes best
cater to the exigencies of perpetually shifting traffic dynamics course of action in the subsequent state.
[19]. Fuzzy techniques, for instance, compute the optimal
signal extension times using fuzzy logic, rendering them B. Double Deep Q Network
more adaptable compared to fixed-time controllers. However, The Double Deep Q-Network (DDQN) [18], [19], an ef-
it is imperative to note that these operations entail substan- fective iteration of the Deep Q-Network [11]–[14], is tailored
tial computational resources, eventually undermining system for Traffic Signal Control (TSC). By leveraging two neural
efficiency. Consequently, researchers have investigated the networks—a core(main) network for Q-value estimation and a
2024 11th International Conference on Signal Processing and Integrated Networks (SPIN- 2024)
Authorized licensed use limited to: VIT University. Downloaded on October 22,2024 at 10:35:39 UTC from IEEE Xplore. Restrictions apply.
65
target network to mitigate aggrandizement DDQN improves 2) A thorough and extensive examination of the ramifi-
traffic flow. The agent adopts a search policy for action cations of diverse epsilon decay rate profiles on the
selection and maintains transitions in a replay buffer. The management of traffic and the reduction of congestion.
core network updates via the Q-learning loss function, while 3) An in-depth exploration into how dynamically modulat-
the target network undergoes periodic updates to stay aligned. ing exploration rates affect the efficacy of RL agents in
DDQN, coupled with an exploration strategy [20], emerges as the domain of traffic signal control.
a potent solution for TSC challenges. To gauge the impact of epsilon decay rates, we will leverage
a real-world traffic simulator. This simulator will facilitate the
C. Epsilon Greedy Search Algorithm
generation of a diverse spectrum of traffic scenarios, catering
The Epsilon greedy search algorithm [20] epitomizes a to both training and testing requirements. The exploration
prevalent exploration strategy in RL, adept at balancing ex- strategies, featuring a gamut of epsilon decay profiles, will be
ploration and exploitation. Governed by an epsilon decay rate, trained on this dataset and subsequently subjected to assess-
this strategy regulates the reduction of the exploration rate ment on a held-out set of traffic conditions. The evaluation
denoted by “” over time. In the context of TSC, the Epsilon will pivot around critical performance metrics, including:
greedy method facilitates optimal traffic signal determination, • The average travel time of vehicles.
adjusting the exploration rate dynamically. At each time step, • The length of vehicle queues at intersections.
the algorithm probabilistically selects the optimal traffic signal
This research is poised to provide invaluable insights into
with probability (1−t ) or randomly explores with probability
the optimization of exploration strategies in RL-based Traffic
t .
Signal Control, with a specific emphasis on epsilon decay
rates. The anticipated findings are poised to make substantial
argmaxa Q(st , at ) with probability 1 − t contributions to the realm of more efficient and adaptive traffic
at = (2) management in urban environments.
a random action with probability t
In equation (2), the action taken at time t is denoted by V. M ETHODOLOGY
at , the envisaged reward for action a at time t is represented In the realm of Traffic Signal Control research, our central
by Qt (a), and t signifies the chance of selecting an action research objective revolves around a methodological explo-
randomly at time t. The algorithm iterates until convergence ration and comparative analysis of distinct epsilon decay rates
to the optimal policy, modulating exploration via decay in . within the Epsilon greedy exploration strategy as applied to
reinforcement learning agents. The core focus of our inves-
IV. P ROBLEM S TATEMENT
tigation is to discern the intricate interplay between these
The challenge of exploration holds paramount importance epsilon decay rates and their influence on the equilibrium
in the training of Reinforcement Learning agents for Traffic between exploration and exploitation within the field of traffic
Signal Control. Within this domain, the widely employed ex- signal control. This methodological framework is meticulously
ploration strategy becomes a focal point. This strategy entails structured to address several pivotal facets critical to our
the RL agent randomly exploring with a probability denoted research:
as “” while favoring the selection of the best-known action 1) Data Collection: We systematically procure comprehen-
with a probability of “(1 − .)” However, it is imperative to sive data similar to real-world traffic from a sophisticated
acknowledge the inherent limitations of the strategy. It adheres traffic simulator, incorporating a wealth of information,
to a fixed exploration rate, which can introduce perturbations involving counts of vehicles, speeds, and queue lengths
into the learning process, yielding suboptimal results. The across numerous intersections.
static nature of the exploration rate can further engender ineffi- 2) Agent Training: Our training paradigm features the
cient exploration, necessitating meticulous tuning to engender application of a robust Double Deep Q Network ar-
effective training. chitecture, enabling our reinforcement learning agent
The principal aim of this research revolves around the com- to interact dynamically with the traffic simulator. This
prehensive analysis of epsilon decay rates within the context interaction refines the traffic signal control policies of
of Traffic Signal Control through Reinforcement Learning. the agent as it systematically alters the epsilon decay
Specifically, our objective is to scrutinize the impact of dy- rates in its exploration strategy.
namic exploration rates, encompassing epsilon decay, on the 3) Exploration Strategies: We rigorously implement the
training and performance of RL agents tasked with the control Epsilon greedy exploration strategy, incorporating an
of traffic signals. The central challenge we address pertains to array of Epsilon decay rates. These dynamic epsilon de-
the design and evaluation of epsilon decay strategies that adapt cay rates significantly shape the agent’s action selection
the exploration probability over time. process, systematically influencing the exploration and
This research encompasses several pivotal facets, including: exploitation trade-offs.
1) The integration of epsilon decay exploration strategies 4) Performance Metrics: Our evaluation is anchored in
within RL-based Traffic Signal Control. the meticulous scrutiny of diverse epsilon decay rates.
66 2024 11th International Conference on Signal Processing and Integrated Networks (SPIN- 2024)
Authorized licensed use limited to: VIT University. Downloaded on October 22,2024 at 10:35:39 UTC from IEEE Xplore. Restrictions apply.
Key performance metrics, such as the average duration
of vehicle journeys and the number of vehicles waiting
in queues at intersections, serve as integral gauges
for assessing the efficacy of traffic management under
diverse exploration strategies.
5) Comparative Analysis: Central to our research is a
rigorous comparative analysis, designed to delineate the
performance disparities inherent to various epsilon decay
rates. Our overarching aim is to pinpoint the epsilon
decay rate that engenders optimal traffic flow efficiency
while concurrently mitigating congestion.
Through this comprehensive methodology, our research
seeks to shed light on the nuanced dynamics of epsilon decay
rates within the Epsilon greedy strategy, providing valuable
insights into the intricacies of exploration and exploitation in
Traffic Signal Control. These insights are poised to inform the
development of more refined and responsive traffic regulation Fig. 2: Visualization of a generated intersection in SUMO
methods in the realm of reinforcement learning. simulator
2024 11th International Conference on Signal Processing and Integrated Networks (SPIN- 2024)
Authorized licensed use limited to: VIT University. Downloaded on October 22,2024 at 10:35:39 UTC from IEEE Xplore. Restrictions apply.
67
are regularly utilized for establishing vehicle departure times TABLE I: Parameter settings for experiment
[27], [28]. The Weibull distribution, featuring an invariant Parameter Description Value
shape parameter of 2, accounts for departure time diversity
N The cumulative number 50
while preserving consistency. Concurrently, using the normal of training episodes.
distribution, departure times can be generated with a mean- max_size limit in PER memory 20,000
centered distribution, thereby improving the naturalism of b size Sample batch size 100
learning_rate Learning rate 0.001
traffic simulation. In both instances, distribution parameters γ Gamma parameter 0.95
are deliberately chosen to represent the intended traits of the _min Minimum permissible 0.02
traffic scenario. The Weibull distribution [27], characterized by epsilon
C Update interval for tar- 5
a shape parameter of 2, governs the timing of vehicle arrivals. get network (simula-
This configuration effectively captures real-world traffic sce- tion cycles)
nario intricacies, creating a sturdy platform for a systematic Y_Dur Yellow light duration 3s
G_Dur Green light duration 15 s
investigation of traffic signal control strategies derived from
RL principles. Modeled traffic flow is generated within each
episode for a time frame of 3800 seconds, incorporating both
straight (Northbound-Southbound and Eastbound-Westbound) determined by dividing the cumulative travel time of all
movements and turning maneuvers (Left-turns and Right- vehicles by the total count of vehicles.
turns). Detailed traffic movement configurations are outlined N
1
in Table I. Average Travel Time = Travel Timei (3)
N i=1
D. Experiment parameter configurations
2) Queue Length (QL): The queue length of a lane repre-
Following multiple experimental investigation trials and
sents the total number of vehicles waiting in line on that
carefully adjusting parameters, we set the following specifi-
lane. It is an essential metric for evaluating congestion
cations (refer to Table I). The cumulative number of training
and traffic monitoring.
episodes (N ) is set at 50, illustrating the collective iterations
throughout which the RL agent engages with the elements These metrics are integrated into the reward and evaluation
of the traffic environment to refine its governing policy. The mechanisms, guiding the agent toward decisions that enhance
utmost limit of the Prioritized Experience Replay (PER) mem- traffic conditions and overall system effectiveness. The rein-
ory, labeled as max_size, is defined as 20,000, operating as forcement learning agent endeavors to learn policies that result
a repository facility for preceding incidents. A mini-batch size in diminished queue lengths and shorter average travel times.
(b size) of 100 is employed, governing the count of incidents
VII. E XPERIMENTAL R ESULTS AND A NALYSIS
selected from the PER repository for policy updates during
each training cycle. The learning rate (learning_rate) In our analysis, we examined Queue Length and Average
is established at 0.001, influencing the magnitude of policy Travel Time under various -decay rates in the Epsilon greedy
adjustments based on unseen incidents. The gamma parameter exploration strategy. To enhance clarity, four graphs were
(γ), with a value of 0.95, impacts the weight given to future created, each corresponding to a specific performance metric
rewards in Q-value updates. The minimum permissible epsilon at varying -decay rates. Figure 3 succinctly illustrates the
value (_min) is set to 0.02, controlling the gradual decline performance trends for each training episode, with each curve
of exploration. The Update interval for the target network (C) representing the evolution of the reinforcement learning agent
specifies how frequently the target network aligns with the core over 50 episodes. We provide a nuanced understanding of the
network’s weights, established to occur every 5 simulation agent’s learning process by showcasing individual episodes.
cycles. Ultimately, designated durations for the yellow light Future work may explore presenting averaged performance
interval (Y_Dur) and green light interval (G_Dur) of traffic metrics to highlight consistent improvements in system effi-
lights are set as 3 seconds and 15 seconds, correspondingly, ciency.
delineating TSC phases, encompassing transition, and opera- 1. Average Travel Time vs. -Decay Rate: The first
tion. Together, these configurations establish the experimental graph 3a illustrates the correlation between Average Travel
conditions that form the basis of the Learning curve of the RL Time and the -Decay Rate. Similarly, each data point on the
agent for traffic signal coordination. graph corresponds to a specific -Decay Rate. By examining
this graph, it becomes apparent how different -Decay Rates
E. Performance assessment metrics
impact the efficiency of vehicle travel times. Lower -Decay
To appraise the effectiveness of different traffic signal Rates (0.2 and 0.6) result in shorter travel times, indicating
control methodologies, two metrics are utilized in our experi- faster traffic flow, while higher -Decay Rates (0.4 and 0.5)
ments: lead to extended travel durations due to congestion.
1) Average Travel Time (ATT): This indicator is employed 2. Queue Length vs. -Decay Rate: The second graph 3b
as the key performance benchmark, illustrating the time depicts the relationship between Queue Length (the total num-
taken by vehicles to pass through the intersection. It is ber of vehicles queuing at the intersection) and the -Decay
68 2024 11th International Conference on Signal Processing and Integrated Networks (SPIN- 2024)
Authorized licensed use limited to: VIT University. Downloaded on October 22,2024 at 10:35:39 UTC from IEEE Xplore. Restrictions apply.
(a) Average travel time of vehicles for different decay values (b) Queue length of vehicles for different decay values during
during training training
(c) Average travel time of vehicles for different decay values (d) Queue length of vehicles for different decay values during
during testing testing
Fig. 3: Effect of epsilon decay on reinforcement learning agents in traffic signal control systems
Rate. Each point on the graph corresponds to a different - experimental results but also help in visualizing the relation-
Decay Rate, ranging from 0.1 to 0.9. By visually analyzing this ship between performance metrics and exploration strategies,
graph, it becomes evident how the Queue Length fluctuates as facilitating a more comprehensive understanding of the impact
the exploration-exploitation balance changes. Lower -Decay of -Decay Rates on Traffic Signal Control.
Rates (0.1 and 0.2) lead to reduced congestion, resulting in Continuing with our presentation of the experimental re-
lower Queue Length values, while higher -Decay Rates (0.4 sults, we have also compiled a comprehensive metric table,
and 0.5) correspond to more pronounced congestion, resulting as seen in Table II. This table provides a structured overview
in higher Queue Length. of the performance metrics for different -Decay Rates, rang-
3. Average Travel Time for Testing vs. -Decay Rate: The ing from 0.1 to 0.9. Two crucial performance indicators,
third graph 3c presents the Average Travel Time specifically Queue Length and Average Travel Time, were measured for
for testing scenarios across various -Decay Rates. This graph each -decay rate to investigate the influence of exploration-
allows us to focus on the performance during testing and exploitation strategies on traffic signal optimization.
highlights the effect of different -Decay Rates on travel times
In addition to the tabulated results shown in Table II, Figure
in the testing phase.
4 provides a visual comparison of the experiment results,
4. Queue Length for Testing vs. -Decay Rate: The illustrating the variation in Queue Length and Average Travel
fourth graph 3d provides a view of the Queue Length in the Time under various -Decay Rates. Our analysis revealed
testing phase at different -Decay Rates. This graph helps distinct and nuanced patterns in the relationship between
assess congestion levels during testing for various exploration- Queue Length and Average Travel Time as we manipulated the
exploitation strategies. -Decay Rate. Significantly, when the -Decay Rate was set to
These graphs not only provide a clearer representation of the 0.6, we observed a pronounced reduction in both QL and ATT,
2024 11th International Conference on Signal Processing and Integrated Networks (SPIN- 2024)
Authorized licensed use limited to: VIT University. Downloaded on October 22,2024 at 10:35:39 UTC from IEEE Xplore. Restrictions apply.
69
TABLE II: Results of the Experiment metrics within the perspective of traffic signal supervision. QL
Performance quantifies the number of vehicles queuing or waiting in traffic
Indicator Queue Length (m) Avg. Travel Time (s) lanes, with elevated values denoting congestion and slower
-Decay Rate
0.1 386.68 160.01
traffic flow. In contrast, ATT represents the average travel
0.2 304.23 143.31 time for vehicles to traverse the traffic network, with lower
0.3 314.45 153.39 values indicating expedited travel times. While high QL often
0.4 607.21 268.01 signals congestion, the efficiency of traffic signal optimization
0.5 520.31 311.06
0.6 164.88 165.65 can counterintuitively lead to lower ATT even in congested
0.7 338.18 171.24 scenarios. Conversely, lower QL may imply reduced queuing,
0.8 515.43 242.76 yet overly conservative traffic signals that limit vehicle flow
0.9 375.19 185.97
can result in longer travel times. The equilibrium between
these two metrics is intricately tied to the prevailing traffic
signifying an optimal equilibrium between exploration and conditions and the efficacy of the traffic signal control strategy
exploitation strategies. Conversely, higher -Decay Rates (0.8 in play.
and 0.4) led to escalated values of both QL and ATT, implying These results underscore the imperative of judiciously se-
that excessively exploratory or exploitative behaviors had lecting an appropriate -Decay Rate, accounting for specific
adverse effects on traffic conditions. On the other hand, lower traffic conditions and the overarching objectives of traffic
-Decay Rates (0.2 and 0.9) exhibited favorable outcomes in signal control. Achieving the desired equilibrium between
either QL or ATT individually, highlighting a potential trade- QL and ATT hinges upon the effectiveness of the chosen
off between these two critical metrics. Additionally, an - exploration strategy and the overarching aim of optimizing
Decay Rate of 0.5 resulted in the highest values for both traffic flow.
ATT and QL, indicative of suboptimal performance in traffic VIII. C ONCLUSION
control. Analyzing these outcomes, we derive the following
In this comprehensive investigation of Traffic Signal Con-
trol, we have delved into the detailed intricacies of -decay
rates within the Epsilon greedy exploration strategy. Using the
SUMO traffic simulation tool and a Reinforcement Learning
agent powered by the Prioritized Experience Replay integrated
into the Double Deep Q Network, we have uncovered critical
insights.
Our investigation has underscored the pivotal role of the -
decay rate in shaping the performance of TSC systems. The
optimal -decay rate of 0.6 has emerged as a key driver for
minimizing both Queue Length and Average Travel Time,
indicating a dual reduction in congestion and enhanced traffic
efficiency. However, we have also discerned that the interplay
Fig. 4: Comparison of Queue Length and Avg. Travel Time between these metrics at varying decay rates necessitates
for Different -Decay Rates precision in tailoring exploration strategies to specific traffic
objectives. The results of this study open avenues for fu-
insights: ture technical inquiries: dynamic exploration strategies that
• An -Decay Rate of 0.6 appears to be optimal for mini- autonomously adjust the -decay rate in response to real-
mizing both QL and ATT, showcasing the most favorable time traffic conditions, multi-intersection coordination within
outcomes for both metrics. complex urban networks, enhanced Reinforcement Learning
• Elevated -Decay rates (0.8 and 0.4) result in increased techniques, and addressing real-world deployment challenges
QL and ATT, signifying that overly exploratory or ex- in actual urban settings.
ploitative behavior may lead to deteriorated traffic condi- It is paramount to align the chosen traffic signal control
tions. objectives with the unique imperatives of the specific traffic
• Lower -Decay rates (0.2 and 0.9) exhibit favorable ecosystem. This alignment ensures the judicious calibration
results in either QL or ATT individually, elucidating an of the -Decay Rate within the exploration strategy, optimally
inherent trade-off between these metrics. harmonizing the trade-off between exploration and exploita-
• An -Decay Rate of 0.5 engenders the highest values for tion to achieve the desired outcomes in traffic flow dynamics.
both ATT and QL, indicative of suboptimal traffic control In summary, this study advances the field of finely tuned
performance. traffic signal control systems, offering a foundation for the
The observation that Average Travel Time is occasionally development of more efficient, adaptive, and technically so-
lower even when Queue Length is high can be elucidated by phisticated traffic management solutions to meet the evolving
comprehending the intricate relationship between these two demands of urban environments.
70 2024 11th International Conference on Signal Processing and Integrated Networks (SPIN- 2024)
Authorized licensed use limited to: VIT University. Downloaded on October 22,2024 at 10:35:39 UTC from IEEE Xplore. Restrictions apply.
ACKNOWLEDGMENT [24] Caelen, Olivier, and Gianluca Bontempi. “Improving the exploration
strategy in bandit algorithms.” Learning and Intelligent Optimization:
This research received funding from the DST NMICPS Second International Conference, LION 2007 II, Trento, Italy, December
8-12, 2007. Selected Papers 2. Springer Berlin Heidelberg, 2008.
Technology Innovation Hub On Autonomous Navigation [25] Schaul, Tom, et al. “Prioritized experience replay.” arXiv preprint
Foundation (TiHAN IIT Hyderabad). arXiv:1511.05952 (2015).
[26] Behrisch, Michael, et al. “SUMO–simulation of urban mobility: an
overview.” Proceedings of SIMUL 2011, The Third International Con-
R EFERENCES ference on Advances in System Simulation. ThinkMind, 2011.
[27] Hallinan Jr, Arthur J. “A review of the Weibull distribution.” Journal of
[1] Uddin, Azeem. “Traffic congestion in Indian cities: Challenges of a rising Quality Technology 25.2 (1993): 85-93.
power.” Kyoto of the cities, Naples (2009). [28] Ahsanullah, Mohammad, et al. “Normal distribution.” Normal and
[2] Serafini, Paolo, and Walter Ukovich. “A mathematical model for the fixed- Student st Distributions and Their Applications (2014): 7-50.
time traffic control problem.” European Journal of Operational Research [29] Agarwal, Rishabh, et al. “Deep reinforcement learning at the edge of the
42.2 (1989): 152-165. statistical precipice.” Advances in neural information processing systems
[3] Zhao, Dongbin, Yujie Dai, and Zhen Zhang. “Computational intelligence 34 (2021): 29304-29320.
in urban traffic signal control: A survey.” IEEE Transactions on Systems, [30] Genders, Wade, and Saiedeh Razavi. “Using a deep reinforcement
Man, and Cybernetics, Part C (Applications and Reviews) 42.4 (2011): learning agent for traffic signal control.” arXiv preprint arXiv:1611.01142
485-494. (2016).
[4] Gartner, Nathan H., and Mohammed Al-Malik. “Combined model for
signal control and route choice in urban traffic networks.” Transportation
Research Record 1554.1 (1996): 27-35.
[5] Askerzade, I. N., and Mustafa Mahmood. “Control the extension time of
traffic light in single junction by using fuzzy logic.” International Journal
of Electrical Computer Sciences IJECS–IJENS 10.2 (2010): 48-55.
[6] Liao, Yongquan, and Xiangjun Cheng. “Study on traffic signal control
based on q-learning.” 2009 Sixth International Conference on Fuzzy
Systems and Knowledge Discovery. Vol. 3. IEEE, 2009.
[7] Mousavi, Seyed Sajad, Michael Schukat, and Enda Howley. “Traffic light
control using deep policy-gradient and value-function-based reinforce-
ment learning.” IET Intelligent Transport Systems 11.7 (2017): 417-423.
[8] Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An
introduction. MIT press, 2018.
[9] Thrun, Sebastian. “Efficient exploration in reinforcement learning.” Tech-
nical Report. Carnegie Mellon University (1992).
[10] Brafman, Ronen I., and Moshe Tennenholtz. “R-max-a general polyno-
mial time algorithm for near-optimal reinforcement learning.” Journal of
Machine Learning Research 3.Oct (2002): 213-231.
[11] Ishii, Shin, Wako Yoshida, and Junichiro Yoshimoto. “Control of ex-
ploitation–exploration meta-parameter in reinforcement learning.” Neural
networks 15.4-6 (2002): 665-687.
[12] Watkins, Christopher John Cornish Hellaby. “Learning from delayed
rewards.” (1989).
[13] Chapelle, Olivier, and Lihong Li. “An empirical evaluation of thompson
sampling.” Advances in neural information processing systems 24 (2011).
[14] Caelen, Olivier, and Gianluca Bontempi. “Improving the exploration
strategy in bandit algorithms.” Learning and Intelligent Optimization:
Second International Conference, LION 2007 II, Trento, Italy, December
8-12, 2007. Selected Papers 2. Springer Berlin Heidelberg, 2008.
[15] Van Hasselt, Hado, Arthur Guez, and David Silver. “Deep reinforcement
learning with double q-learning.” Proceedings of the AAAI conference
on artificial intelligence. Vol. 30. No. 1. 2016.
[16] Behrisch, Michael, et al. “SUMO–simulation of urban mobility: an
overview. Proceedings of SIMUL 2011, The Third International Con-
ference on Advances in System Simulation. ThinkMind, 2011.
[17] Vidali, Andrea, et al. “A Deep Reinforcement Learning Approach to
Adaptive Traffic Lights Management.” WOA. 2019.
[18] Schaul, Tom, et al. “Prioritized experience replay.” arXiv preprint
arXiv:1511.05952 (2015).
[19] Miller, Alan J. “Road traffic flow considered as a stochastic process.”
Mathematical Proceedings of the Cambridge Philosophical Society. Vol.
58. No. 2. Cambridge University Press, 1962.
[20] Watkins, Christopher John Cornish Hellaby. “Learning from delayed
rewards.” (1989).
[21] Thrun, S.B.: Efficient exploration in reinforcement learning. Technical
Report CMU-CS-92-102, Carnegie Mellon University, Pittsburgh, PA,
USA (1992)
[22] Caelen, Olivier, and Gianluca Bontempi. “Improving the exploration
strategy in bandit algorithms.” International Conference on Learning and
Intelligent Optimization. Berlin, Heidelberg: Springer Berlin Heidelberg,
2007.
[23] Kuleshov, Volodymyr, and Doina Precup. “Algorithms for multi-armed
bandit problems.” arXiv preprint arXiv:1402.6028 (2014).
2024 11th International Conference on Signal Processing and Integrated Networks (SPIN- 2024)
Authorized licensed use limited to: VIT University. Downloaded on October 22,2024 at 10:35:39 UTC from IEEE Xplore. Restrictions apply.
71