0% found this document useful (0 votes)
27 views8 pages

Optimizing Traffic Signal Control With Deep Reinforcement Learning Exploring Decay Rate Tuning For Enhanced Exploration-Exploitation Trade-Off

Uploaded by

gdhanusu2025
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views8 pages

Optimizing Traffic Signal Control With Deep Reinforcement Learning Exploring Decay Rate Tuning For Enhanced Exploration-Exploitation Trade-Off

Uploaded by

gdhanusu2025
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Optimizing Traffic Signal Control with Deep

Reinforcement Learning: Exploring Decay Rate


Tuning for Enhanced Exploration-Exploitation
2024 11th International Conference on Signal Processing and Integrated Networks (SPIN) | 979-8-3503-0843-3/24/$31.00 ©2024 IEEE | DOI: 10.1109/SPIN60856.2024.10511583

Trade-off
1st Saidulu Thadikamalla 2nd Piyush Joshi 3rd Balaji Raman
Department of Computer Science, Department of Computer Science, The Mathworks Inc,
Indian Institute of Information Technology, Indian Institute of Information Technology, Bengaluru, India
Sri City, Chittoor, India Sri City, Chittoor, India [email protected]
[email protected] [email protected]

4th Manipriya S 5th Rakesh Kumar Sanodiya


Department of Computer Science, Department of Computer Science,
Indian Institute of Information Technology, Indian Institute of Information Technology,
Sri City, Chittoor, India Sri City, Chittoor, India
[email protected] [email protected]

Abstract—Traffic congestion is an increasingly prevalent global reduced vehicular speeds, queuing, protracted travel durations,
issue that necessitates the advancement of Traffic Signal Control and a plethora of indirect repercussions on environmental
technologies. Deep Reinforcement Learning has emerged as a sustainability and overall quality of life. In the face of sat-
prominent machine learning paradigm, leveraging trial-and-
error experimentation in conjunction with Deep Neural Network urated public transportation systems and constrained infras-
models to facilitate autonomous and coordinated management tructure expansion, Traffic Signal Control (TSC) has emerged
of traffic signal lights spanning numerous intersections within a as an efficacious remedy for traffic management. However,
traffic network. Reinforcement Learning methodologies employ conventional TSC technologies exhibit limitations primarily
diverse exploration strategies such as Epsilon greedy, Softmax, due to the stochastic nature of traffic dynamics. In contrast,
Upper Confidence Bound, among others, to ascertain an opti-
mal policy. In the pursuit of long-term rewards, an effective the application of Artificial Intelligence (AI) methodologies,
exploration strategy must adeptly balance the exploitation of such as fuzzy logic, Q-Learning, and Deep Q-Learning, has
the current policy with the exploration of novel alternatives. facilitated the development of TSC systems endowed with
The Epsilon greedy algorithm stands out as a widely adopted the capability to dynamically adapt to rapidly shifting traffic
approach for navigating this trade-off in Reinforcement Learning. scenarios, optimize traffic flow, and alleviate congestion.
However, its performance is intricately tied to the initially hand-
crafted exploration rate. This work contributes significantly to Balancing the trade-off between exploration and exploita-
the attainment of an optimal policy by primarily emphasizing two tion plays a pivotal role in optimizing the efficiency of Traffic
key aspects. Firstly, it underscores the criticality of meticulously Signal Control systems. The Epsilon greedy strategy stands
tuning the -decay rate, which governs the progression of the as a ubiquitous technique employed to strike this balance and
exploration rate, in order to cultivate an optimal traffic signal derive an optimal policy for TSC.
control system. Secondly, this work delves into an in-depth
exploration of the constraints inherent in the epsilon decay rate This study sets out to meticulously investigate the ramifi-
and offers potential avenues for future research in this domain. cations of various decay rates associated with the  parameter
Index Terms—Traffic signal control, Deep reinforcement learn- within the Epsilon greedy strategy on the efficacy of Deep
ing, Exploration-exploitation trade-off, Epsilon greedy algorithm, Reinforcement Learning (DRL) methodologies in addressing
Traffic congestion TSC challenges. These innovative RL solutions are geared to-
wards the explicit management of the exploration-exploitation
I. I NTRODUCTION conundrum with the overarching goal of enhancing traffic flow
The exponential proliferation of motor vehicle utilization optimization and congestion reduction within dynamic traffic
has precipitated a pronounced escalation in traffic congestion, environments.
notably in major urban conurbations. The expansion of the Moreover, this research underscores the intrinsic limitations
vehicular fleet has outpaced the growth of the road network, inherent to the Epsilon greedy approach and proffers a suite of
engendering a confluence of adverse consequences, including enhancements designed to ameliorate its deficiencies. Through

979-8-3503-0843-3/24/$31.00 ©2024 IEEE 64


Authorized licensed use limited to: VIT University. Downloaded on October 22,2024 at 10:35:39 UTC from IEEE Xplore. Restrictions apply.
extensive experimentation with diverse, randomly selected applicability of Q-Learning to effectively manage and optimize
epsilon decay rates, the study unveils the profound impact traffic flow with the aim of congestion reduction.
of decay rate selection on the effectiveness of Reinforcement Q-Learning algorithms have demonstrated their proficiency
Learning models. The findings underscore the indispensable in learning from stochastic environments, while exploration
nature of adaptive exploration techniques rooted in Reinforce- strategies have been harnessed to enhance the efficacy of
ment Learning for the augmentation of TSC systems. Reinforcement Learning in the realm of TSC [17]. A multitude
We also highlight the limitations of the Epsilon greedy of strategies have been devised to strike a balance between
approach and propose improvements to address them. Con- exploration and exploitation in Reinforcement Learning, in-
ducting experiments with different randomly chosen epsilon cluding those employing counters [9], biologically-inspired
decay rates demonstrated that the selection of decay rate had model learning [10], or reward comparison. Nevertheless,
a substantial effect on the effectiveness of the Reinforcement the Epsilon greedy technique often takes precedence in the
Learning model. To achieve a suitable trade-off between ex- majority of cases, as advocated by Sutton et.al [8] [12].
ploration and exploitation, it is imperative to employ adaptive This preference for the Epsilon greedy strategy stems from
exploration techniques in TSC systems that are based on RL. its propensity to yield near-optimal outcomes in a diverse
This study emphasizes the necessity of such techniques in array of applications with the simple adjustment of a single
enhancing the performance of RL-based TSC systems. parameter, obviating the need to store any exploration-specific
The paper is structured as follows: In Section 1, we in- data. However, despite its prevalence, the research landscape
troduce the paper’s background and motivation. Section 2 remains bereft of strategies for adapting the exploration rate
provides an overview of related work, focusing on rein- of the Epsilon greedy method in accordance with the learning
forcement learning-based traffic signal control systems and progress. Only a limited number of strategies, such as -first
the importance of exploration strategies. Section 3 presents or decreasing- [14], take temporal considerations into account
the preliminaries necessary to understand our testing model. while reducing the exploration probability. It is patently clear
Section 4 discusses the problem statement in detail. In Section from the foregoing discussions that the decay rate of epsilon
5, we delve into the methodology used in this study. The in the Epsilon greedy approach can wield substantial influence
experimental setup, including parameters and data collection, over the performance of the RL model. Therefore, meticulous
is covered in Section 6. The experimental results and analysis calibration of the -decay rate [15] is imperative when employ-
are discussed in Section 7, shedding light on the findings and ing the Epsilon greedy strategy as an exploration technique,
their implications. Finally, Section 8 concludes the paper, sum- particularly in the development of an optimal TSC system, in-
marizing the main contributions and providing an overview of tegral to the broader Intelligent Traffic Signal Control System
the study’s key takeaways. (ITSC).
III. P RELIMINARIES
II. R ELATED WORK
A. Reinforcement Learning
Traffic Signal Control is universally acknowledged as the Reinforcement Learning (RL) [9] is a pivotal machine
paramount and efficacious approach to traffic management. learning paradigm wherein an agent learns to maximize cumu-
Conventional TSC systems employ fixed-time strategies where lative rewards by interacting with the environment. Through
each signal operates on a predetermined schedule, merging actions and subsequent assessments in the form of rewards
optimization methodologies with mathematical models [2] [3] or penalties, the agent refines its decision-making process.
[4]. Nonetheless, these traditional TSC approaches are inher- Q-Learning [10]–[13] stands as a prominent RL algorithm,
ently constrained in dynamically evolving traffic environments, tasked with approximating the optimal policy (π ∗ ) without
resulting in traffic congestion and extended travel times. The prior knowledge.
intricacies of traffic signal optimization in contemporary traffic
scenarios elude conventional methods. In light of this, exten- Q(st , at ) ← Q(st , at ) + α [R(st , at )
sive research has been conducted to elevate the sophistication (1)
+ γ max

Q(st+1 , a ) − Q(st , at ) ]
of traffic management systems at intersections. a

As a consequence, researchers have embraced artificial Here, Q(st , at ) denotes the envisaged reward for state st
intelligence (AI) paradigms such as fuzzy logic [5], swarm in- and action at , α signifies the learning rate, R(st , at ) represents
telligence, genetic algorithms, Q-Learning [6], and Q-Learning the return for action at in state st , γ stands for the discount
with neural networks for the function approximation [7] [8] to factor, st+1 indicates the sequential state, and a denotes best
cater to the exigencies of perpetually shifting traffic dynamics course of action in the subsequent state.
[19]. Fuzzy techniques, for instance, compute the optimal
signal extension times using fuzzy logic, rendering them B. Double Deep Q Network
more adaptable compared to fixed-time controllers. However, The Double Deep Q-Network (DDQN) [18], [19], an ef-
it is imperative to note that these operations entail substan- fective iteration of the Deep Q-Network [11]–[14], is tailored
tial computational resources, eventually undermining system for Traffic Signal Control (TSC). By leveraging two neural
efficiency. Consequently, researchers have investigated the networks—a core(main) network for Q-value estimation and a

2024 11th International Conference on Signal Processing and Integrated Networks (SPIN- 2024)
Authorized licensed use limited to: VIT University. Downloaded on October 22,2024 at 10:35:39 UTC from IEEE Xplore. Restrictions apply.
65
target network to mitigate aggrandizement DDQN improves 2) A thorough and extensive examination of the ramifi-
traffic flow. The agent adopts a search policy for action cations of diverse epsilon decay rate profiles on the
selection and maintains transitions in a replay buffer. The management of traffic and the reduction of congestion.
core network updates via the Q-learning loss function, while 3) An in-depth exploration into how dynamically modulat-
the target network undergoes periodic updates to stay aligned. ing exploration rates affect the efficacy of RL agents in
DDQN, coupled with an exploration strategy [20], emerges as the domain of traffic signal control.
a potent solution for TSC challenges. To gauge the impact of epsilon decay rates, we will leverage
a real-world traffic simulator. This simulator will facilitate the
C. Epsilon Greedy Search Algorithm
generation of a diverse spectrum of traffic scenarios, catering
The Epsilon greedy search algorithm [20] epitomizes a to both training and testing requirements. The exploration
prevalent exploration strategy in RL, adept at balancing ex- strategies, featuring a gamut of epsilon decay profiles, will be
ploration and exploitation. Governed by an epsilon decay rate, trained on this dataset and subsequently subjected to assess-
this strategy regulates the reduction of the exploration rate ment on a held-out set of traffic conditions. The evaluation
denoted by “” over time. In the context of TSC, the Epsilon will pivot around critical performance metrics, including:
greedy method facilitates optimal traffic signal determination, • The average travel time of vehicles.
adjusting the exploration rate dynamically. At each time step, • The length of vehicle queues at intersections.
the algorithm probabilistically selects the optimal traffic signal
This research is poised to provide invaluable insights into
with probability (1−t ) or randomly explores with probability
the optimization of exploration strategies in RL-based Traffic
t .
Signal Control, with a specific emphasis on epsilon decay
 rates. The anticipated findings are poised to make substantial
argmaxa Q(st , at ) with probability 1 − t contributions to the realm of more efficient and adaptive traffic
at = (2) management in urban environments.
a random action with probability t
In equation (2), the action taken at time t is denoted by V. M ETHODOLOGY
at , the envisaged reward for action a at time t is represented In the realm of Traffic Signal Control research, our central
by Qt (a), and t signifies the chance of selecting an action research objective revolves around a methodological explo-
randomly at time t. The algorithm iterates until convergence ration and comparative analysis of distinct epsilon decay rates
to the optimal policy, modulating exploration via decay in . within the Epsilon greedy exploration strategy as applied to
reinforcement learning agents. The core focus of our inves-
IV. P ROBLEM S TATEMENT
tigation is to discern the intricate interplay between these
The challenge of exploration holds paramount importance epsilon decay rates and their influence on the equilibrium
in the training of Reinforcement Learning agents for Traffic between exploration and exploitation within the field of traffic
Signal Control. Within this domain, the widely employed ex- signal control. This methodological framework is meticulously
ploration strategy becomes a focal point. This strategy entails structured to address several pivotal facets critical to our
the RL agent randomly exploring with a probability denoted research:
as “” while favoring the selection of the best-known action 1) Data Collection: We systematically procure comprehen-
with a probability of “(1 − .)” However, it is imperative to sive data similar to real-world traffic from a sophisticated
acknowledge the inherent limitations of the strategy. It adheres traffic simulator, incorporating a wealth of information,
to a fixed exploration rate, which can introduce perturbations involving counts of vehicles, speeds, and queue lengths
into the learning process, yielding suboptimal results. The across numerous intersections.
static nature of the exploration rate can further engender ineffi- 2) Agent Training: Our training paradigm features the
cient exploration, necessitating meticulous tuning to engender application of a robust Double Deep Q Network ar-
effective training. chitecture, enabling our reinforcement learning agent
The principal aim of this research revolves around the com- to interact dynamically with the traffic simulator. This
prehensive analysis of epsilon decay rates within the context interaction refines the traffic signal control policies of
of Traffic Signal Control through Reinforcement Learning. the agent as it systematically alters the epsilon decay
Specifically, our objective is to scrutinize the impact of dy- rates in its exploration strategy.
namic exploration rates, encompassing epsilon decay, on the 3) Exploration Strategies: We rigorously implement the
training and performance of RL agents tasked with the control Epsilon greedy exploration strategy, incorporating an
of traffic signals. The central challenge we address pertains to array of Epsilon decay rates. These dynamic epsilon de-
the design and evaluation of epsilon decay strategies that adapt cay rates significantly shape the agent’s action selection
the exploration probability over time. process, systematically influencing the exploration and
This research encompasses several pivotal facets, including: exploitation trade-offs.
1) The integration of epsilon decay exploration strategies 4) Performance Metrics: Our evaluation is anchored in
within RL-based Traffic Signal Control. the meticulous scrutiny of diverse epsilon decay rates.

66 2024 11th International Conference on Signal Processing and Integrated Networks (SPIN- 2024)
Authorized licensed use limited to: VIT University. Downloaded on October 22,2024 at 10:35:39 UTC from IEEE Xplore. Restrictions apply.
Key performance metrics, such as the average duration
of vehicle journeys and the number of vehicles waiting
in queues at intersections, serve as integral gauges
for assessing the efficacy of traffic management under
diverse exploration strategies.
5) Comparative Analysis: Central to our research is a
rigorous comparative analysis, designed to delineate the
performance disparities inherent to various epsilon decay
rates. Our overarching aim is to pinpoint the epsilon
decay rate that engenders optimal traffic flow efficiency
while concurrently mitigating congestion.
Through this comprehensive methodology, our research
seeks to shed light on the nuanced dynamics of epsilon decay
rates within the Epsilon greedy strategy, providing valuable
insights into the intricacies of exploration and exploitation in
Traffic Signal Control. These insights are poised to inform the
development of more refined and responsive traffic regulation Fig. 2: Visualization of a generated intersection in SUMO
methods in the realm of reinforcement learning. simulator

Figure 2). Every approach to the Intersection features four


lanes serving incoming and outgoing traffic, encompassing
lanes that are 750 meters long [16]. The lane functionalities
are delineated: the far-left lanes exclusively accommodate left
turns, the far-right lane facilitates both rightward movement
and straightforward travel, while the two middle lanes are
earmarked for straight travel. Regarding traffic signal arrange-
ment, individual signals govern the far-left lane, whereas the
remaining lanes all rely on a single signal.
B. Agent Workflow
Fig. 1: An illustration showcasing the DDQN model with - The regular operation of the agent is presented in Figure
decay-driven exploration 1. Importantly, within this SUMO-oriented application, time
progression is computed in simulation cycles. Nonetheless, the
agent is only activated periodically, after sufficient evolution
VI. E XPERIMENTAL CONFIGURATION of the environment. We label each occurrence dedicated to the
The focal point of this study revolves around optimiz- agent’s assignments as a “cycle of agency”, while extensive
ing traffic flow and mitigating congestion by fine-tuning the simulation intervals are termed “simulation cycles”. Subse-
synchronization of traffic signals at junctions. Employing a quently, following a set number of simulation cycles, the agent
traffic simulation tool SUMO [26], our empirical framework initiates its lineup of assignments by capturing the prevailing
integrates critical components involving representing the state, environment state. Furthermore, the agent computes the reward
selecting actions, evaluating performance, and generating data for the preceding action derived from an evaluation of existing
with an RL agent (specifically, DDQN) strengthened by Pri- traffic situations, a data sample comprising extensive data from
oritized Experience Replay (PER) [25]. In this context, the recent simulation cycles is preserved in memory for succeed-
agent symbolizes Intersection management systems (IMS), ing training sessions. At this juncture, the agent proceeds to
tasked with interacting with the environment to optimize a pick a novel action influenced by the current conditions of the
predefined metric of traffic efficiency. With this overarching environment, thereby carrying on with the simulation until the
goal in sight, this paper articulates its fundamental problem: next engagement with the agent transpires.
given the current intersection state, how can the agent, armed
C. Traffic Flow
with a predetermined assortment of actions, decide on the
optimal traffic signal timing to maximize rewards and thereby The simulation generates traffic flow, measured in vehicles
augment intersection traffic efficiency. per hour. We intricately adjust the SUMO simulator with spe-
cific settings to ensure high realism, Following this, produce
A. Intersection Design a file with routes that mimic the movement of desired vehi-
A meticulously designed Intersection with four arms is cles here it is 1000. Two principal probability distributions,
constructed within the SUMO traffic simulator (illustrated in namely the Weibull distribution and the normal distribution

2024 11th International Conference on Signal Processing and Integrated Networks (SPIN- 2024)
Authorized licensed use limited to: VIT University. Downloaded on October 22,2024 at 10:35:39 UTC from IEEE Xplore. Restrictions apply.
67
are regularly utilized for establishing vehicle departure times TABLE I: Parameter settings for experiment
[27], [28]. The Weibull distribution, featuring an invariant Parameter Description Value
shape parameter of 2, accounts for departure time diversity
N The cumulative number 50
while preserving consistency. Concurrently, using the normal of training episodes.
distribution, departure times can be generated with a mean- max_size limit in PER memory 20,000
centered distribution, thereby improving the naturalism of b size Sample batch size 100
learning_rate Learning rate 0.001
traffic simulation. In both instances, distribution parameters γ Gamma parameter 0.95
are deliberately chosen to represent the intended traits of the _min Minimum permissible 0.02
traffic scenario. The Weibull distribution [27], characterized by epsilon
C Update interval for tar- 5
a shape parameter of 2, governs the timing of vehicle arrivals. get network (simula-
This configuration effectively captures real-world traffic sce- tion cycles)
nario intricacies, creating a sturdy platform for a systematic Y_Dur Yellow light duration 3s
G_Dur Green light duration 15 s
investigation of traffic signal control strategies derived from
RL principles. Modeled traffic flow is generated within each
episode for a time frame of 3800 seconds, incorporating both
straight (Northbound-Southbound and Eastbound-Westbound) determined by dividing the cumulative travel time of all
movements and turning maneuvers (Left-turns and Right- vehicles by the total count of vehicles.
turns). Detailed traffic movement configurations are outlined N
1 
in Table I. Average Travel Time = Travel Timei (3)
N i=1
D. Experiment parameter configurations
2) Queue Length (QL): The queue length of a lane repre-
Following multiple experimental investigation trials and
sents the total number of vehicles waiting in line on that
carefully adjusting parameters, we set the following specifi-
lane. It is an essential metric for evaluating congestion
cations (refer to Table I). The cumulative number of training
and traffic monitoring.
episodes (N ) is set at 50, illustrating the collective iterations
throughout which the RL agent engages with the elements These metrics are integrated into the reward and evaluation
of the traffic environment to refine its governing policy. The mechanisms, guiding the agent toward decisions that enhance
utmost limit of the Prioritized Experience Replay (PER) mem- traffic conditions and overall system effectiveness. The rein-
ory, labeled as max_size, is defined as 20,000, operating as forcement learning agent endeavors to learn policies that result
a repository facility for preceding incidents. A mini-batch size in diminished queue lengths and shorter average travel times.
(b size) of 100 is employed, governing the count of incidents
VII. E XPERIMENTAL R ESULTS AND A NALYSIS
selected from the PER repository for policy updates during
each training cycle. The learning rate (learning_rate) In our analysis, we examined Queue Length and Average
is established at 0.001, influencing the magnitude of policy Travel Time under various -decay rates in the Epsilon greedy
adjustments based on unseen incidents. The gamma parameter exploration strategy. To enhance clarity, four graphs were
(γ), with a value of 0.95, impacts the weight given to future created, each corresponding to a specific performance metric
rewards in Q-value updates. The minimum permissible epsilon at varying -decay rates. Figure 3 succinctly illustrates the
value (_min) is set to 0.02, controlling the gradual decline performance trends for each training episode, with each curve
of exploration. The Update interval for the target network (C) representing the evolution of the reinforcement learning agent
specifies how frequently the target network aligns with the core over 50 episodes. We provide a nuanced understanding of the
network’s weights, established to occur every 5 simulation agent’s learning process by showcasing individual episodes.
cycles. Ultimately, designated durations for the yellow light Future work may explore presenting averaged performance
interval (Y_Dur) and green light interval (G_Dur) of traffic metrics to highlight consistent improvements in system effi-
lights are set as 3 seconds and 15 seconds, correspondingly, ciency.
delineating TSC phases, encompassing transition, and opera- 1. Average Travel Time vs. -Decay Rate: The first
tion. Together, these configurations establish the experimental graph 3a illustrates the correlation between Average Travel
conditions that form the basis of the Learning curve of the RL Time and the -Decay Rate. Similarly, each data point on the
agent for traffic signal coordination. graph corresponds to a specific -Decay Rate. By examining
this graph, it becomes apparent how different -Decay Rates
E. Performance assessment metrics
impact the efficiency of vehicle travel times. Lower -Decay
To appraise the effectiveness of different traffic signal Rates (0.2 and 0.6) result in shorter travel times, indicating
control methodologies, two metrics are utilized in our experi- faster traffic flow, while higher -Decay Rates (0.4 and 0.5)
ments: lead to extended travel durations due to congestion.
1) Average Travel Time (ATT): This indicator is employed 2. Queue Length vs. -Decay Rate: The second graph 3b
as the key performance benchmark, illustrating the time depicts the relationship between Queue Length (the total num-
taken by vehicles to pass through the intersection. It is ber of vehicles queuing at the intersection) and the -Decay

68 2024 11th International Conference on Signal Processing and Integrated Networks (SPIN- 2024)
Authorized licensed use limited to: VIT University. Downloaded on October 22,2024 at 10:35:39 UTC from IEEE Xplore. Restrictions apply.
(a) Average travel time of vehicles for different  decay values (b) Queue length of vehicles for different  decay values during
during training training

(c) Average travel time of vehicles for different  decay values (d) Queue length of vehicles for different  decay values during
during testing testing
Fig. 3: Effect of epsilon decay on reinforcement learning agents in traffic signal control systems

Rate. Each point on the graph corresponds to a different - experimental results but also help in visualizing the relation-
Decay Rate, ranging from 0.1 to 0.9. By visually analyzing this ship between performance metrics and exploration strategies,
graph, it becomes evident how the Queue Length fluctuates as facilitating a more comprehensive understanding of the impact
the exploration-exploitation balance changes. Lower -Decay of -Decay Rates on Traffic Signal Control.
Rates (0.1 and 0.2) lead to reduced congestion, resulting in Continuing with our presentation of the experimental re-
lower Queue Length values, while higher -Decay Rates (0.4 sults, we have also compiled a comprehensive metric table,
and 0.5) correspond to more pronounced congestion, resulting as seen in Table II. This table provides a structured overview
in higher Queue Length. of the performance metrics for different -Decay Rates, rang-
3. Average Travel Time for Testing vs. -Decay Rate: The ing from 0.1 to 0.9. Two crucial performance indicators,
third graph 3c presents the Average Travel Time specifically Queue Length and Average Travel Time, were measured for
for testing scenarios across various -Decay Rates. This graph each -decay rate to investigate the influence of exploration-
allows us to focus on the performance during testing and exploitation strategies on traffic signal optimization.
highlights the effect of different -Decay Rates on travel times
In addition to the tabulated results shown in Table II, Figure
in the testing phase.
4 provides a visual comparison of the experiment results,
4. Queue Length for Testing vs. -Decay Rate: The illustrating the variation in Queue Length and Average Travel
fourth graph 3d provides a view of the Queue Length in the Time under various -Decay Rates. Our analysis revealed
testing phase at different -Decay Rates. This graph helps distinct and nuanced patterns in the relationship between
assess congestion levels during testing for various exploration- Queue Length and Average Travel Time as we manipulated the
exploitation strategies. -Decay Rate. Significantly, when the -Decay Rate was set to
These graphs not only provide a clearer representation of the 0.6, we observed a pronounced reduction in both QL and ATT,

2024 11th International Conference on Signal Processing and Integrated Networks (SPIN- 2024)
Authorized licensed use limited to: VIT University. Downloaded on October 22,2024 at 10:35:39 UTC from IEEE Xplore. Restrictions apply.
69
TABLE II: Results of the Experiment metrics within the perspective of traffic signal supervision. QL
Performance quantifies the number of vehicles queuing or waiting in traffic
Indicator Queue Length (m) Avg. Travel Time (s) lanes, with elevated values denoting congestion and slower
-Decay Rate
0.1 386.68 160.01
traffic flow. In contrast, ATT represents the average travel
0.2 304.23 143.31 time for vehicles to traverse the traffic network, with lower
0.3 314.45 153.39 values indicating expedited travel times. While high QL often
0.4 607.21 268.01 signals congestion, the efficiency of traffic signal optimization
0.5 520.31 311.06
0.6 164.88 165.65 can counterintuitively lead to lower ATT even in congested
0.7 338.18 171.24 scenarios. Conversely, lower QL may imply reduced queuing,
0.8 515.43 242.76 yet overly conservative traffic signals that limit vehicle flow
0.9 375.19 185.97
can result in longer travel times. The equilibrium between
these two metrics is intricately tied to the prevailing traffic
signifying an optimal equilibrium between exploration and conditions and the efficacy of the traffic signal control strategy
exploitation strategies. Conversely, higher -Decay Rates (0.8 in play.
and 0.4) led to escalated values of both QL and ATT, implying These results underscore the imperative of judiciously se-
that excessively exploratory or exploitative behaviors had lecting an appropriate -Decay Rate, accounting for specific
adverse effects on traffic conditions. On the other hand, lower traffic conditions and the overarching objectives of traffic
-Decay Rates (0.2 and 0.9) exhibited favorable outcomes in signal control. Achieving the desired equilibrium between
either QL or ATT individually, highlighting a potential trade- QL and ATT hinges upon the effectiveness of the chosen
off between these two critical metrics. Additionally, an - exploration strategy and the overarching aim of optimizing
Decay Rate of 0.5 resulted in the highest values for both traffic flow.
ATT and QL, indicative of suboptimal performance in traffic VIII. C ONCLUSION
control. Analyzing these outcomes, we derive the following
In this comprehensive investigation of Traffic Signal Con-
trol, we have delved into the detailed intricacies of -decay
rates within the Epsilon greedy exploration strategy. Using the
SUMO traffic simulation tool and a Reinforcement Learning
agent powered by the Prioritized Experience Replay integrated
into the Double Deep Q Network, we have uncovered critical
insights.
Our investigation has underscored the pivotal role of the -
decay rate in shaping the performance of TSC systems. The
optimal -decay rate of 0.6 has emerged as a key driver for
minimizing both Queue Length and Average Travel Time,
indicating a dual reduction in congestion and enhanced traffic
efficiency. However, we have also discerned that the interplay
Fig. 4: Comparison of Queue Length and Avg. Travel Time between these metrics at varying decay rates necessitates
for Different -Decay Rates precision in tailoring exploration strategies to specific traffic
objectives. The results of this study open avenues for fu-
insights: ture technical inquiries: dynamic exploration strategies that
• An -Decay Rate of 0.6 appears to be optimal for mini- autonomously adjust the -decay rate in response to real-
mizing both QL and ATT, showcasing the most favorable time traffic conditions, multi-intersection coordination within
outcomes for both metrics. complex urban networks, enhanced Reinforcement Learning
• Elevated -Decay rates (0.8 and 0.4) result in increased techniques, and addressing real-world deployment challenges
QL and ATT, signifying that overly exploratory or ex- in actual urban settings.
ploitative behavior may lead to deteriorated traffic condi- It is paramount to align the chosen traffic signal control
tions. objectives with the unique imperatives of the specific traffic
• Lower -Decay rates (0.2 and 0.9) exhibit favorable ecosystem. This alignment ensures the judicious calibration
results in either QL or ATT individually, elucidating an of the -Decay Rate within the exploration strategy, optimally
inherent trade-off between these metrics. harmonizing the trade-off between exploration and exploita-
• An -Decay Rate of 0.5 engenders the highest values for tion to achieve the desired outcomes in traffic flow dynamics.
both ATT and QL, indicative of suboptimal traffic control In summary, this study advances the field of finely tuned
performance. traffic signal control systems, offering a foundation for the
The observation that Average Travel Time is occasionally development of more efficient, adaptive, and technically so-
lower even when Queue Length is high can be elucidated by phisticated traffic management solutions to meet the evolving
comprehending the intricate relationship between these two demands of urban environments.

70 2024 11th International Conference on Signal Processing and Integrated Networks (SPIN- 2024)
Authorized licensed use limited to: VIT University. Downloaded on October 22,2024 at 10:35:39 UTC from IEEE Xplore. Restrictions apply.
ACKNOWLEDGMENT [24] Caelen, Olivier, and Gianluca Bontempi. “Improving the exploration
strategy in bandit algorithms.” Learning and Intelligent Optimization:
This research received funding from the DST NMICPS Second International Conference, LION 2007 II, Trento, Italy, December
8-12, 2007. Selected Papers 2. Springer Berlin Heidelberg, 2008.
Technology Innovation Hub On Autonomous Navigation [25] Schaul, Tom, et al. “Prioritized experience replay.” arXiv preprint
Foundation (TiHAN IIT Hyderabad). arXiv:1511.05952 (2015).
[26] Behrisch, Michael, et al. “SUMO–simulation of urban mobility: an
overview.” Proceedings of SIMUL 2011, The Third International Con-
R EFERENCES ference on Advances in System Simulation. ThinkMind, 2011.
[27] Hallinan Jr, Arthur J. “A review of the Weibull distribution.” Journal of
[1] Uddin, Azeem. “Traffic congestion in Indian cities: Challenges of a rising Quality Technology 25.2 (1993): 85-93.
power.” Kyoto of the cities, Naples (2009). [28] Ahsanullah, Mohammad, et al. “Normal distribution.” Normal and
[2] Serafini, Paolo, and Walter Ukovich. “A mathematical model for the fixed- Student st Distributions and Their Applications (2014): 7-50.
time traffic control problem.” European Journal of Operational Research [29] Agarwal, Rishabh, et al. “Deep reinforcement learning at the edge of the
42.2 (1989): 152-165. statistical precipice.” Advances in neural information processing systems
[3] Zhao, Dongbin, Yujie Dai, and Zhen Zhang. “Computational intelligence 34 (2021): 29304-29320.
in urban traffic signal control: A survey.” IEEE Transactions on Systems, [30] Genders, Wade, and Saiedeh Razavi. “Using a deep reinforcement
Man, and Cybernetics, Part C (Applications and Reviews) 42.4 (2011): learning agent for traffic signal control.” arXiv preprint arXiv:1611.01142
485-494. (2016).
[4] Gartner, Nathan H., and Mohammed Al-Malik. “Combined model for
signal control and route choice in urban traffic networks.” Transportation
Research Record 1554.1 (1996): 27-35.
[5] Askerzade, I. N., and Mustafa Mahmood. “Control the extension time of
traffic light in single junction by using fuzzy logic.” International Journal
of Electrical Computer Sciences IJECS–IJENS 10.2 (2010): 48-55.
[6] Liao, Yongquan, and Xiangjun Cheng. “Study on traffic signal control
based on q-learning.” 2009 Sixth International Conference on Fuzzy
Systems and Knowledge Discovery. Vol. 3. IEEE, 2009.
[7] Mousavi, Seyed Sajad, Michael Schukat, and Enda Howley. “Traffic light
control using deep policy-gradient and value-function-based reinforce-
ment learning.” IET Intelligent Transport Systems 11.7 (2017): 417-423.
[8] Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An
introduction. MIT press, 2018.
[9] Thrun, Sebastian. “Efficient exploration in reinforcement learning.” Tech-
nical Report. Carnegie Mellon University (1992).
[10] Brafman, Ronen I., and Moshe Tennenholtz. “R-max-a general polyno-
mial time algorithm for near-optimal reinforcement learning.” Journal of
Machine Learning Research 3.Oct (2002): 213-231.
[11] Ishii, Shin, Wako Yoshida, and Junichiro Yoshimoto. “Control of ex-
ploitation–exploration meta-parameter in reinforcement learning.” Neural
networks 15.4-6 (2002): 665-687.
[12] Watkins, Christopher John Cornish Hellaby. “Learning from delayed
rewards.” (1989).
[13] Chapelle, Olivier, and Lihong Li. “An empirical evaluation of thompson
sampling.” Advances in neural information processing systems 24 (2011).
[14] Caelen, Olivier, and Gianluca Bontempi. “Improving the exploration
strategy in bandit algorithms.” Learning and Intelligent Optimization:
Second International Conference, LION 2007 II, Trento, Italy, December
8-12, 2007. Selected Papers 2. Springer Berlin Heidelberg, 2008.
[15] Van Hasselt, Hado, Arthur Guez, and David Silver. “Deep reinforcement
learning with double q-learning.” Proceedings of the AAAI conference
on artificial intelligence. Vol. 30. No. 1. 2016.
[16] Behrisch, Michael, et al. “SUMO–simulation of urban mobility: an
overview. Proceedings of SIMUL 2011, The Third International Con-
ference on Advances in System Simulation. ThinkMind, 2011.
[17] Vidali, Andrea, et al. “A Deep Reinforcement Learning Approach to
Adaptive Traffic Lights Management.” WOA. 2019.
[18] Schaul, Tom, et al. “Prioritized experience replay.” arXiv preprint
arXiv:1511.05952 (2015).
[19] Miller, Alan J. “Road traffic flow considered as a stochastic process.”
Mathematical Proceedings of the Cambridge Philosophical Society. Vol.
58. No. 2. Cambridge University Press, 1962.
[20] Watkins, Christopher John Cornish Hellaby. “Learning from delayed
rewards.” (1989).
[21] Thrun, S.B.: Efficient exploration in reinforcement learning. Technical
Report CMU-CS-92-102, Carnegie Mellon University, Pittsburgh, PA,
USA (1992)
[22] Caelen, Olivier, and Gianluca Bontempi. “Improving the exploration
strategy in bandit algorithms.” International Conference on Learning and
Intelligent Optimization. Berlin, Heidelberg: Springer Berlin Heidelberg,
2007.
[23] Kuleshov, Volodymyr, and Doina Precup. “Algorithms for multi-armed
bandit problems.” arXiv preprint arXiv:1402.6028 (2014).

2024 11th International Conference on Signal Processing and Integrated Networks (SPIN- 2024)
Authorized licensed use limited to: VIT University. Downloaded on October 22,2024 at 10:35:39 UTC from IEEE Xplore. Restrictions apply.
71

You might also like