Decision-Making For Autonomous Vehicles On Highway: Deep Reinforcement Learning With Continuous Action Horizon

Uploaded by

Safae Belkhyr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views9 pages

Decision-Making For Autonomous Vehicles On Highway: Deep Reinforcement Learning With Continuous Action Horizon

Uploaded by

Safae Belkhyr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

1

Decision-making for Autonomous Vehicles on Highway: Deep

Reinforcement Learning with Continuous Action Horizon
Hao Chen1,2, Xiaolin Tang1 ，Teng Liu1
1. College of Mechanical and Vehicle Engineering, Chongqing University, Chongqing 400044, China;
2. Department of Industrial Design, Chongqing Normal University, Chongqing 401331, China

ACC Adaptive Cruise Control

Abstract—Decision-making strategy for autonomous ve-
SGD Stochastic Gradient Descent
hicles delineating a sequence of driving maneuvers aimed at
accomplishing specific navigational missions. In this re- CEM Cross-Entropy Method
search, the deep reinforcement learning (DRL) methodol-
ogy is employed to address the intricate and continuous- I. INTRODUCTION
horizon decision-making challenge posed by highway sce- OTIVATED by advancements in artificial intelligence
narios. Initially, the paper introduces the vehicle kinematics
and the driving scenario on the freeway. The overarching
M (AI) technologies, autonomous vehicles are emerging as
promising transportation solutions aimed at mitigating
goal of the autonomous vehicle is to execute a collision-free, traffic accidents and enhancing road efficiency [1]-[2]. Four
efficient, and seamless policy. Subsequently, the specific al- pivotal modules constitute essential components for an
gorithm, known as proximal policy optimization (PPO)- automated vehicle, which are perception, decision-making,
enhanced DRL, is expounded upon. This algorithm is de- planning, and control [3]-[4]. Attaining comprehensive
signed to surmount challenges relating to delayed training automation within intricate driving scenarios necessitates
efficiency and sample inefficiency, thereby achieving en- further dedication and exploration across these research
hanced learning efficiency and exceptional control perfor- domains.
mance. Finally, the decision-making strategy rooted in Decision-making encompasses a continuous sequence of
PPO-DRL is assessed from multifaceted perspectives, en- driving maneuvers aimed at accomplishing specific
compassing optimality, learning efficiency, and adaptability. navigational tasks [5]-[6]. The distinctive directives within a
Furthermore, its potential for online application is explored decision-making strategy typically involve adjustments to the
through its utilization in analogous driving scenarios. accelerator pedal and steering angle. Numerous endeavors have
been undertaken to formulate an appropriate decision-making
policy. For instance, Hoel et al. [7] employed a Monte Carlo
Index Terms—Autonomous vehicles, decision-making, tree search to derive tactical decision-making strategies for au-
proximal policy optimization, deep reinforcement learning, tonomous driving (AD). The driving environment adheres to a
continuous action horizon partially observable Markov decision process (MDP), and the
ensuing outcomes are juxtaposed with a neural network (NN)
NOMENCLATURE policy. The authors explored cooperative lane-changing deci-
DRL Deep Reinforcement Learning sions as a means to optimize limited road resources and mitigate
PPO Proximal Policy Optimization competition [8]. Furthermore, Ref. [9] expounded upon high-
way-exit decisions for autonomous vehicles. The authors as-
AI Artificial Intelligence
serted that the presented decision-making controller yields an
AD Autonomous Driving elevated likelihood of successful highway exits, backed by
MDP Markov Decision Process 6000 instances of stochastic simulations.
NN Neural Network Reinforcement learning (RL), particularly deep reinforce-
RL Reinforcement Learning ment learning (DRL) methods, possess significant potential for
addressing decision-making challenges in autonomous driving
DQL Deep Q-learning
(AD) [10]. For instance, researchers in [11] employed deep Q-
AEV Autonomous Ego Vehicle learning (DQL) to address the lane-changing decision-making
IDM Intelligent Driver Model issue within an uncertain highway environment. Similarly, in
MOBIL Minimize Overall Braking Induced by Lane Changes the context of lane changing, Zhang et al. [12] devised a model-
based exploration policy grounded in intrinsic surprise rewards.

This work was supported by the Education Planning Project of Chongqing

（No. K22YG205145）(Corresponding authors: Xiaolin Tang)
2

Furthermore, Reference. [13] conducted a comprehensive sur- the real-world viability of DRL-based decision-making strate-
vey of the prevailing applications of RL and DRL for automated gies. These encompass challenges related to sample efficiency,
vehicles, encompassing agent training, evaluation techniques, slow learning rates, and operational safety.
and robust estimation. Nevertheless, several limitations curtail
Driving Scenario and Vehicle Kinematics

Reinforcement Learning
Input Bicycle N-Lanes
model Highway

Evaluation of Optimality and Learning Rate

State, 1. Reference IDM MOBIL
Reward Action Model
Assess
PPO-DRL Algorithm 2. CEM
Algorithm

Estimation of Adaptability in New Driving Scenarios

Lanes 4 and Vehicles 5 Lanes 2 and Vehicles 10
Test
Driving Driving
Scenario 1 Scenario 2

Fig. 1. An efficient and safe decision-making control framework based on PPO-DRL for autonomous vehicles.
This study aims to develop an effective and secure decision- delves into the analysis of pertinent simulation outcomes for the
making policy for autonomous driving (AD). To achieve this proposed decision-making strategy. Ultimately, Section V pre-
goal, a proximal policy optimization (PPO)-enhanced deep re- sents the concluding remarks.
inforcement learning (DRL) approach is presented for highway
scenarios with a continuous action horizon, as illustrated in Fig. II. VEHICLE KINEMATICS AND DRIVING SCENARIOS
1. Initially, the vehicle's kinematics and driving scenarios are In this section, we establish the highway driving scenario for
established, wherein the autonomous ego vehicle is designed to our research. This environment encompasses the autonomous
operate efficiently and safely. Through the utilization of the ego vehicle (AEV) and the surrounding vehicles. We also detail
policy gradient method, the PPO-enhanced DRL framework en- the vehicle kinematics of these entities, allowing for the calcu-
ables direct acquisition of control actions, while maintaining a lation of longitudinal and lateral speeds. Additionally, we intro-
trust region with bounded objectives. The specific implementa- duce reference models for driving maneuvers in both the longi-
tion details of this DRL algorithm are subsequently elaborated tudinal and lateral directions.
upon. Ultimately, a series of comprehensive test experiments
are designed to assess the optimality, learning efficiency, and A. Vehicle Kinematics
adaptability of the proposed decision-making policy within the In this study, we elucidate the vehicle kinematics through the
context of highway scenarios. application of the widely acknowledged common bicycle model
This work introduces three key contributions and innovations: [14]-[15], characterized by nonlinear continuous horizon equa-
1) the development of an advanced, efficient, and safe decision- tions. The representation of the inertial frame is illustrated in
making policy for AD on highways; 2) the application of PPO- Fig. 2. Computation of the differentials for position and inertial
enhanced DRL to address the transferred control optimization heading is outlined as follows:
challenge in autonomous vehicle scenarios; 3) the establish- x = v cos( +  ) (1)
ment of an adaptive estimation framework to assess the adapt- y = v sin( +  ) (2)
ability of the proposed approach. This endeavor represents a
v
concerted effort to enhance the efficiency and safety of deci-  = sin  (3)
sion-making policies through the utilization of cutting-edge ad- lr
vanced DRL methodologies. where (x, y) represents the positional coordinates of the vehicle
To elucidate the contributions of this article, the subsequent within the inertial frame. The vehicle velocity is denoted as v,
sections are organized as follows. Section II outlines the vehicle while lr signifies the distance between the center of mass and
kinematics and driving scenarios on the highway. The PPO- the rear axles. Additionally, ψ represents the inertial heading,
enhanced DRL framework employed in this research is ex- and β denotes the slip angle at the center of gravity. This angle
pounded upon in Section III. The ensuing section, Section IV, and the vehicle speed can be further displayed as:
3

v =a (4) uation of decision-making policies hinges on three key indica-

 l  tors: safety, efficiency, and comfort. Safety necessitates that the
 = arctan  r tan( f )  (5) AEV must steer clear of collisions. Efficiency entails the pref-
l +l 
 f r  erence of an autonomous vehicle to elevate its speed. Comfort,
where lf represents the distance of the center of mass from the on the other hand, signifies that the AEV should manage the
front. δf signifies the front steering angle. The two degrees-of- frequency of lane-changing and the degree of vehicle decelera-
freedom model simplifies the delegation of the primary vehicle tion [16].
parameters: speed and acceleration. In this model, the control In this study, the primary considerations for the AEV encom-
inputs encompass acceleration and steering angle, forming a pass safety and efficiency. Additionally, the vehicle exhibits a
continuous-time horizon as discussed in this article. preference for positioning itself within the high-speed lane. As
The default parameters for both the AEV and the surrounding illustrated in Fig. 3, the AEV is represented by the green vehicle,
vehicles are identical. The dimensions consist of a length of 5.0 while the surrounding vehicles are depicted in blue. In each lane,
m and a width of 2.0 m. The initial speed is selected randomly the total count of surrounding vehicles is denoted as K. Within
from the range of [23, 25] m/s, while the maximum achievable this article, an episode is defined as the scenario where the AEV
speed is limited to 30 m/s. The initial position is stochastically successfully overtakes all the surrounding vehicles or reaches
assigned along the highway, thus reflecting the inherent uncer- its designated destination.
tainty of the driving environment. Without sacrificing generality, we designate the number of
lanes on the highway as N=3. Within each lane, the count of
surrounding vehicles is configured to K=5. The Autonomous
O Ego Vehicle (AEV) adheres to a predetermined lane, specifi-
cally the right lane. The simulation operates at a frequency of
20 Hz, with a sampling time of 1 second (implying that the AEV
makes decisions every second). An episode extends for a dura-
δf tion of 50 seconds. The driving behavior of the surrounding ve-
v
hicles adheres to two commonly utilized models, which will be
β
elaborated upon in the ensuing subsection.
ψ
y
C. Behavioral Controller
lf In this section, we formulate the Intelligent Driver Model
(IDM) and the Minimize Overall Braking Induced by lane
lr changes (MOBIL) to govern the driving behaviors of
x surrounding vehicles. Furthermore, the fusion of these two
models serves as a benchmark for the AEV, facilitating a
Fig. 2. Bicycle model for vehicle kinematics on highway. comparison with the proposed DRL approach.
B. Driving Scenarios The IDM is commonly employed for Adaptive Cruise Con-
To emulate a real-world driving environment on the highway, trol (ACC) in automated vehicles, functioning as a continuous-
we establish a driving scenario featuring N lanes in the same time horizon car-following model [17]-[18]. The longitudinal
direction, as illustrated in Fig. 3. The decision-making strategy acceleration within IDM is expressed as follows:
for the AEV presented in this study involves ascertaining con-  v d 
a = amax  1 − ( ) − ( r ) 2 )  (6)
trol actions for vehicle speed and steering angle at each time  vr d 
step. The primary aim of the AEV is to achieve maximum speed
where amax is the maximum acceleration. vr and dr represent the
while avoiding collisions with surrounding vehicles.
requested vehicle velocity and separation distance, respectively.
Driving direction
Starting Point Destination δ stands as the constant acceleration parameter, while △d per-
L=1
tains to the interval between the subject vehicle and the leading
vehicle. Within the framework of the IDM, the requested speed
L=2 is determined by the interplay between the maximum accelera-
tion and the requested distance, with the latter calculated as fol-
…
…

lows:
L=N
v  v
dr = d0 + T  v + (7)
2 amax  b
Fig. 3. Driving scenario on highway with N lanes for decision-making policy.
where d0 represent the minimum relative distance between two
The overtaking behavior signifies instances when the subject vehicles on the same lane, while T signifies the desired time
vehicle surpasses nearby vehicles through a combination of interval for ensuring safety. △v accounts for the relative speed
lane-changing and accelerating maneuvers. Typically, the eval- gap between the subject vehicle and the vehicle ahead, and b
4

denotes the value of deceleration aligned with comfortable cri- III. PROXIMAL POLICY OPTIMIZATION-ENABLED DEEP
teria. The specific parameters of the IDM employed in this REINFORCEMENT LEARNING
study are detailed in Table I. This section elucidates the procedural steps involved in im-
TABLE I
DEFAULT PARAMETERS OF IDM
plementing the PPO-enhanced DRL method under study. Ini-
tially, the foundations of reinforcement learning (RL) methods
Symbol Value Unit
and the rationale behind employing a continuous-time horizon
Maximum acceleration amax 6 m/s2 are presented. Subsequently, the conventional format of the pol-
Acceleration argument δ 4 / icy gradient technique is detailed. Finally, the utilization of
Desired time gap T 1.5 s PPO-enabled DRL is illuminated, facilitating the derivation of
Comfortable deceleration rate b -5 m/s2 a decision-making strategy for the control problem established
in Section II.
Minimum relative distance d0 10 m
Upon establishing the longitudinal acceleration of the adja- A. Necessity of Continuous Horizon
cent vehicle, MOBIL is employed to govern lateral lane-chang- RL has emerged as a methodological approach to tackle se-
ing decisions [19]. The MOBIL framework incorporates two quential decision-making problems through a trial and error
pivotal conditions: the safety criterion and the incentive condi- process [20]-[22]. This dynamic process is exemplified by the
tion. The safety criterion dictates that during a lane change, the interplay between an intelligent agent and its environment. The
subsequent vehicle should avoid excessive deceleration to pre- agent takes control actions within the environment and subse-
vent collisions. The mathematical representation of this accel- quently receives evaluations of its choices from the environ-
eration constraint is as follows: ment [23]-[24]. Broadly, RL methods are categorized into pol-
icy-based approaches (i.e., policy gradients algorithm) and
an  −bsafe (8) value-based approaches (i.e., Q-learning and Sarsa algorithms).
In the context of the highway decision-making problem, the
where ãn denotes the acceleration experienced by the newly fol-
intelligent agent functions as the decision-making controller for
lowing vehicle after executing a lane change, while bsafe repre-
the AEV, while the surrounding vehicles comprise the environ-
sents the upper limit for deceleration applied to the new fol-
ment. This interaction is typically emulated through the appli-
lower. Equation (8) is strategically employed to establish con-
cation of Markov decision processes (MDPs) with Markov
ditions that guarantee collision-free scenarios.
property [25]. The MDP is characterized by a pivotal tuple (S,
Assuming the an and ãn denote the accelerations of the new
A, P, R, γ), where S and A are the sets of state variable and
follower prior to and after lane-changing, while ao and ão refer
control actions. P denotes the transition model of the state var-
to the accelerations of the previous follower before and after the
iable, and R corresponds to the reward model associated with
lane-change event. The incentive condition is established
the state-action pair (s, a). γ is referred to as the discount factor,
through the imposition of an acceleration constraint, which is
serving to strike a balance between immediate and future re-
articulated as follows:
wards.
ae − ae + p ( (an − an ) + (ao − ao ) )  ath (9)
The goal of RL techniques is to select a sequence of control
where ae and ãe represent the accelerations of the AEV prior to actions from set A to maximize cumulative rewards. The cumu-
and following a lane change. p corresponds to the politeness lative rewards, denoted as Rt is the sum of the current reward
coefficient, which quantifies the degree of influence exerted by and the discounted future rewards as:
Rt =  t =0  t  rt
the followers during the lane-changing process. Additionally, 
(10)
ath denotes the threshold for making lane-changing decisions.
This criterion signifies that the intended lane must offer a higher where t is the time step, and rt represents the corresponding in-
level of safety compared to the current one. Notably, the accel- stantaneous reward. Two distinct value functions are defined to
erations employed in the MOBIL approach are determined by convey the significance of selecting control actions. These
the IDM at each time step. Moreover, the AEV possesses the functions are identified as the state-value function V and the
capability to execute overtaking maneuvers by transitioning be- state-action function Q:
tween the right and left lanes. The specific parameters for V  ( st ) E [ Rt | st ,  ] (11)
MOBIL are outlined in Table II.
TABLE II
Q ( st , at ) E [ Rt | st , at ,  ] (12)
MOBIL CONFIGURATION where π is a special control policy. It is evident that various
Keyword Value Unit control policies result in distinct values for the value function,
Safe deceleration limitation bsafe 2 m/s2 with the pursuit of optimal performance being desirable. The
optimal control policy is defined as follows:
Politeness factor p 0.001 /
Lane-changing decision threshold ath
 ( st ) = arg max Q( st , at ) (13)
0.2 m/s2 at

The essence of the RL algorithms lies in updating value func-

tions based on the interactions between the agent and the envi-
ronment [26]. These value functions subsequently guide the
5

agent in discovering an optimal control strategy. For DRL, the applying a clipping mechanism to the probability ratio. Here, τ
value function is approximated using a neural network. From s a hyperparameter set to a value of 0.2. In the second term, the
(12), the state-action function assumes the form of a matrix, probability ratio rt(θ) is bounded within the range of 1- τ to 1+
with its rows and columns corresponding to the quantities of τ, subsequently forming the clipped objective through multipli-
state variables and control actions. In scenarios characterized cation with the advantage function. The inclusion of this
by extensive state variable and control action spaces, the pro- clipped version serves to prevent excessively significant up-
cess of updating the value function and searching for an appro- dates to the policy derived from the previous policy.
priate control policy can become inefficient. In order to establish parameter sharing between the policy
To address this limitation, this study models the control ac- and value functions using a neural network, the loss function is
reformulated by combining the policy surrogate with an error
tions as the vehicle's throttle and steering angle. The throttle
term from the value function [27]. The resulting modified loss
governs acceleration, while the steering angle directly impacts
function is constructed as follows:
lane-changing behavior. These two actions operate within con-
tinuous-time horizons, specifically within ranges of [-5, 5] m/s2 LCLIP +VF + S ( ) = ˆ t  LCLIP
t ( ) − c1 LVF
t ( ) − c2 S[  ]( st )  (18)

for acceleration and [-Π/4, Π/4] rad (Π is the circumference as
t ( ) is the squared-error loss of the state-value func-
where LVF
3.1415) for steering angle. This approach enables the AEV to
tion (V ( st ) − Vt tar ) 2 , S indicates an entropy loss. c1 and c2 are
iteratively determine control action pairs at each time step,
thereby establishing the vehicle's kinematics as detailed in Sec- the coefficients.
TABLE III
tion II.A. IMPLEMENTATION CODE OF PPO ALGORITHM
B. Policy Gradient PPO Algorithm, Actor-Critic Style
1. For iteration = 1, 2, …, do
In policy-based RL methods, an estimator of the policy gra-
dient is computed for a stochastic policy, which is represented 2. For actor = 1, 2, …, M do
as follows: 3. Run policy  old
in environment for T timesteps

4. Calculate advantage function based on (19)-(20), Aˆ1 ,..., AˆT

gˆ = ˆ t  log  (at | st ) Aˆt  (14)
5. end for
where Êt indicates the expectation over a finite batch of samples,
6. Optimize loss function in (18) with respect to θ for Z epochs
πθ denotes a random control policy, and Ât signifies the ad-
7. Update θold with θ
vantage function. In order to compute the policy gradient esti-
8. end for
mator, the loss function for updating an RL policy is described
as follows: To estimate the advantage function, data from T timesteps (T
is much less than the episode length) is sampled. This gathered
LPG ( ) = ˆ t log  (at | st ) Aˆt  (15) data is employed to update the loss function, wherein the ad-
vantage function is incorporated in a truncated form as follows:
In the common policy gradient method, this loss function LPG Aˆt =t + ( )t +1 + ... + ( )T −t +1T −1 (19)
is employed to undergo multiple optimization steps using the
where
same control policy. Nevertheless, challenges can arise during
t =rt +  V (st +1 ) − V (st ) (20)
the updating of extensive policies, including issues of sample
inefficiency, policy diversity, and hesitance in exploration and During each iteration, M actors are established to gather data
exploitation. In order to tackle these challenges, the PPO algo- over T timesteps data. λ represents the discounting factor for the
rithm is proposed in [27] aiming to merge the strengths of con- advantage function. Subsequently, the surrogate loss is formu-
lated using the collected data and optimized through mini-batch
ventional value-based and policy-based RL approaches..
stochastic gradient descent (SGD) for Z epochs. The implemen-
C. PPO DRL tation pseudo-code for the PPO algorithm is presented in Table
In the traditional policy gradient approach, the policy can un- III.
dergo rapid changes with each update. In order to mitigate this As explained in Section III.A, the control actions encompass
the vehicle throttle and steering angle, both operating within a
issue, a policy surrogate objective is adjusted according to the
continuous time horizon. The state variables consist of the
following formulation:
AEV's speed, position, as well as the relative speed and position
between the AEV and its surrounding vehicles:
LCLIP ( ) = ˆ t min(rt ( ) Aˆt , clip(rt ( ),1 −  ,1 +  )) Aˆt  (16) s = saev − ssur (21)
where v = vaev − vsur (22)
 (a | s )
rt ( )=  t t (17) where s and v are the position and speed information obtained
  (at | st )
old
from (1)-(5) in Section II.A. The superscript aev and sur indi-
where rt(θ) represents the probability ratio. Equation (16) com- cate the AEV and surrounding vehicles, respectively. The ex-
pares two terms: the first term stands as the surrogate objective pressions (21) and (22) can also be treated as the transition
[28], while the second term refines this surrogate objective by model P in the RL framework.
6

Finally, the reward function R in this article encompasses surpasses that of the other two methods. The rewards consist-
three components, reflecting the objectives of efficiency, safety, ently outperform those achieved by CEM and IDM+MOBIL.
and lane preference. Specifically, the AEV aims to maximize Consequently, the control policy derived through the PPO ap-
its speed, prioritize the right lane, and prevent collisions with proach demonstrates superiority over the other two strategies.
other surrounding vehicles. The instantaneous reward at time
step t is defined as follows:

rt = −100  collision − 40( L-1) 2 -10(vaev − vaev

max 2
) (23)
where collision ∈ {0, 1} indicates the collision conditions for
the AEV. L ∈ {1, 2, 3} implies the lane number. For the sake of
easy comparison, the value of the instantaneous reward is
scaled to the range [0, 1] at each step. This scaling implies that
the maximum value of the cumulative rewards for one episode
equals the duration time (set as 50 in this work) of the driving
scenario.
The default parameters for the presented PPO-enhanced
DRL method are defined as follows: The discount factor γ and
learning rate α in the RL framework are 0.8 and 0.01. The train- Fig. 4. Normalized average rewards in the reference model, CEM, and PPO-
ing timesteps N is 51200, the mini-batch size Z is 64, the hy- DRL for comparison purposes.
perparameter τ is 0.2, and the discounting coefficient for ad-
vantage function λ is 0.92. The decision-making policy on the
highway for the AEV is derived and estimated in the OpenAI
gym Python toolkit [29]. The performance of this proposed de-
cision-making strategy is discussed and analyzed in the subse-
quent section.
IV. EXPERIMENTS AND DISCUSSION
This section evaluates the control performance of the pro-
posed PPO-DRL-based decision-making strategy for the AEV.
The evaluation comprises three key aspects. Firstly, the efficacy
of this decision-making policy is compared and validated
against two alternative methods. The comprehensive simulation
results demonstrate its optimality. Secondly, the learning capa-
bility of the proposed PPO algorithm is substantiated through
an analysis of the loss function and the accumulation of rewards.
Lastly, the derived decision-making policy is evaluated in two
comparable highway driving scenarios to showcase its adapta-
bility.
A. Effectiveness of PPO DRL
Fig. 5. Vehicle speed and traveling distance of the AEV in three methods.
This subsection presents a comparative analysis of three de- Given that vehicle speed and distance serve as the selected
cision-making methods for highway scenarios: PPO-DRL, the state variables in this study, Fig. 5 illustrates the diverse trajec-
reference model (IDM+MOBIL) , and the cross-entropy tories of these variables. In (23), the reward function prompts
method (CEM) . Notably, CEM has been demonstrated as ef- the AEV to accelerate at suitable instances, thus correlating
fective for addressing continuous-time horizon problems in pre- higher speed with greater rewards. Additionally, the increased
vious research [30]. The reference model and CEM are consid- distance traveled by the AEV signifies that the chosen control
ered as benchmark approaches, serving to establish the optimal- actions facilitate extended travel and collision avoidance. These
ity of the PPO-DRL algorithm. It is important to highlight that simulation outcomes underscore the AEV's capacity to attain
the setting parameters in both the PPO and CEM methods re- efficiency and safety objectives effectively under the guidance
main consistent. of the PPO-DRL approach.
The overall reward accumulated in each episode serves as a Finally, to juxtapose the performance of these three methods
key indicator of the performance of the control policy in DRL. within collision-free scenarios, Table IV delineates the collision
The normalized average rewards across the three compared rate and success rate across testing episodes (the number of test-
techniques are illustrated in Fig. 4. The upward trajectory of ing episodes is 100). The collision rate signifies the likelihood
these curves signifies that the AEV progressively improves its of a collision occurring, while the success rate denotes the
performance through interaction with the environment. Further- AEV's ability to overtake all surrounding vehicles and reach its
more, it is evident that the learning rate exhibited by PPO-DRL destination. Evidently, the PPO-DRL approach demonstrates
superior collision avoidance capabilities compared to the other
7

two methods. Furthermore, the elevated success rate value un-

derscores the PPO algorithm's proficiency in efficiently accom-
plishing the driving task.

TABLE IV
COLLISION CONDITIONS IN THREE COMPARED APPROACHES
Algorithms Collision rate (%) Success rate (%)
PPO-DRL 0.59 99.03
CEM 4.32 91.55
IDM+MOBIL 7.10 87.21

B. Learning rate of PPO DRL

In this subsection, we delve into the learning rate and con-
vergence rate of the introduced PPO-DRL approach. The pri- Fig. 7. Accumulated rewards in two compared algorithms: CEM and PPO.
mary goal of DRL algorithms is to update the state-action func-
tion Q(s, a) in distinct manners. The loss function elucidated in C. Adaptability of PPO DRL
(18) encapsulates the virtues of a chosen control policy. The
aggregate loss for both PPO and CEM is graphically depicted
in Fig. 6. The conspicuous downward trajectories signify that
the AEV progressively refines its control policy through itera-
tive trial and error. Furthermore, it's noteworthy that the loss
values for PPO consistently remain lower than those for CEM.
This observation suggests that the AEV governed by the PPO
algorithm attains a higher level of familiarity with the driving
environment compared to CEM. Thus, it can be confidently as-
serted that the convergence rate of PPO outperforms that of
CEM when tackling the decision-making problem on the high-
way.

Fig. 6. Value of loss function in two DRL methods: CEM and PPO. Fig. 8. Rewards in the testing experiments for two new driving scenarios.
To scrutinize the learning rate disparities between the PPO This subsection introduces an adaptive estimation framework
and CEM algorithms, Fig. 7 illustrates the trajectories of cumu- aimed at substantiating the efficacy of the proposed decision-
lative rewards across these two methodologies. As defined in making policy. The fundamental determinants of a given driv-
(10), the cumulative rewards encompass the summation of the ing scenario encompass the count of lanes and vehicles therein.
current reward and discounted future rewards, thereby serving In this vein, we manipulate these parameters to instantiate two
as a pivotal determinant for control action selection. Evidently novel driving scenarios. The first scenario entails four lanes,
portrayed in Fig. 7, the PPO consistently outperforms the CEM, each accommodating five vehicles (designated as driving sce-
nario 1). Conversely, the second scenario entails two lanes, with
signifying that the control policy furnished by PPO is superior.
ten vehicles occupying each lane (designated as driving sce-
The AEV operating under the PPO paradigm acquires a more
nario 2). A total of 10 testing episodes are conducted for each
extensive repertoire of knowledge and experiential insights per- of these distinct scenarios. The speed and position attributes of
taining to the driving environment. This augmentation can be these encompassing vehicles are also subject to randomized as-
ascribed to the innovative loss function delineated in (18), signment. These designed driving scenarios serve as emblem-
which expedites the intelligent agent's quest for an optimal con- atic representations of the variegated uncertainties inherent in
trol policy. actual driving environments. Moreover, they facilitate a lucid
8

demonstration of the adaptability inherent in the proposed deci- graphical representation in Fig. 10 reveals that the AEV under-
sion-making policy.. takes a daring lane-changing maneuver in the presence of nu-
Fig. 8 depicts the aggregate rewards achieved by the PPO al- merous surrounding vehicles. In the training procedure, the
gorithm within these two novel driving scenarios. A higher re- AEV might not have encountered this particular situation, ren-
ward value signifies a more fitting control policy tailored to the dering it challenging to accurately predict potential collisions in
specific scenario. Conversely, lower reward values can be at- such circumstances. To address this challenge, two potential re-
tributed to two distinct factors. Firstly, the random positioning search avenues could be pursued to enhance the AEV's deci-
of the surrounding vehicles may lead to the obstruction of all sion-making capabilities. Firstly, extending the training process
lanes, impeding the AEV's ability to execute efficient lane could allow the AEV to accumulate more insights from diverse
changes. Secondly, the AEV might engage in hazardous lane- driving environments, thereby refining its decision-making
changing maneuvers in exceptional circumstances, resulting in skills. Secondly, the integration of communication technology
collisions. Notably, Fig. 8 showcases that the acquired decision- to provide the AEV with real-time information about its sur-
making policy exhibited superior performance within the con- roundings could facilitate more informed and judicious deci-
text of the first scenario. This can be attributed to the fact that, sion-making on the highway.
in the initial driving scenario, an additional lane was introduced
while maintaining the same count of surrounding vehicles. This Duration: 10. Action: Success overtaking 1 to accelerate.
augmentation in lane availability provides the AEV with in-
creased opportunities for successful lane-changing and colli-
sion avoidance. To further elucidate the adaptability of the pro-
posed decision-making policy, we meticulously analyze two
Duration: 18. Action: Success overtaking 2 to accelerate.
specific episodes.

Duration: 27. Action: Dangerous overtaking to cause crash.

Duration: 8. Action: Stay the right lane for higher reward.

Fig. 10. Episode 6 in driving scenario 2: dangerous overtaking to cause crash.

Duration: 15. Action: Try to lane-changing and overtaking. V. CONCLUSION
This study introduces an effective and secure decision-mak-
ing policy for autonomous vehicles on the highway, employing
the PPO-DRL methodology. The developed control framework
Duration: 23. Action: Execute car-following to avoid collision. is applicable to analogous driving scenarios featuring varying
Fig. 9. Episode 5 in driving scenario 1: car-following to avoid collision. lane configurations and surrounding vehicles. The simulation
In the context of the new driving scenario 1, episode 5 is se- outcomes unequivocally demonstrate the superior performance
lected to analyze due to the lowest value of the reward. As de- of the proposed decision-making approach in terms of optimal-
picted in Fig. 9, all lanes remain obstructed by surrounding ve- ity, convergence rate, and adaptability. Notably, the derived de-
hicles for an extended duration. Consequently, the AEV is com- cision-making policy exhibits adaptability across distinct new
driving scenarios, reaffirming its versatility and robustness.
pelled to engage in car-following maneuvers to avert potential
Further works may focus on the online application of the pro-
collisions. This circumstance necessitates a reduction in the
posed decision-making policy. Incorporating predictive infor-
AEV's speed, as it patiently awaits an opportune moment to ex- mation for the AEV may enhance its performance. Additionally,
ecute an overtaking maneuver. The episode's reward is influ- exploring the concept of a connected environment to facilitate
enced by factors such as collision occurrences, vehicle velocity, information sharing among proximate vehicles is worth inves-
and lane preference, resulting in a slightly diminished value tigating. The utilization of authentic driving data collected from
compared to other episodes. This outcome is not unexpected, as real-world settings can provide a comprehensive evaluation of
safety assumes paramount importance within real-world driv- the decision-making process within genuine driving conditions.
ing scenarios. This observation underscores the adaptability of
the learned decision-making policy, which demonstrates its ca- REFERENCES
pacity to adeptly respond to dynamically changing driving con- [1] T. Liu, B. Huang, Z. Deng, H. Wang, X. Tang, X. Wang and D. Cao,
“Heuristics-oriented overtaking decision making for autonomous vehi-
ditions.
cles using reinforcement learning”, IET Electrical Systems in Transpor-
Fig. 10 provides an illustration of episode 6 within driving tation, 2020.
scenario 2, utilizing the PPO-DRL-enabled decision-making [2] A. Rasouli, and J. K. Tsotsos, “Autonomous vehicles that interact with
strategy. Notably, this scenario exhibits a reduction in the num- pedestrians: A survey of theory and practice,” IEEE Trans. Intell. Transp.
Syst., vol. 21, no. 3, pp. 900-918, 2019.
ber of lanes accompanied by an increase in the number of vehi- [3] T. Liu, B. Tian, Y. Ai, L. Chen, F. Liu, and D. Cao, Dynamic States Pre-
cles. As a result, the AEV faces heightened complexity in deci- diction in Autonomous Vehicles: Comparison of Three Different Meth-
sion-making due to the more intricate driving environment. The ods, IEEE Intelligent Transportation Systems Conference (ITSC 2019),
9

27-30 Oct 2019. [27] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox-
[4] J. Nie, J. Zhang, E. Ding, X. Wan, X. Chen, and B. Ran, “Decentralized imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347,
cooperative lane-changing decision-making for connected autonomous 2017.
vehicles,” IEEE Access, vol. 4, pp. 9413-9420, 2016. [28] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust re-
[5] W. Song, G. Xiong, and H. Chen, “Intention-aware autonomous driving gion policy optimization,” In International conference on machine learn-
decision-making in an uncontrolled intersection,” Math. Probl. Eng., ing, pp. 1889-1897, 2015.
2016. [29] L. Edouard, “An environment for autonomous driving decision-making,”
[6] L. Li, K. Ota, and M. Dong, “Humanlike driving: Empirical decision- https://github.com/ eleurent/highway-env, GitHub, 2018.
making system for autonomous vehicles,” IEEE IEEE Trans. Veh. Tech- [30] I. Szita and A. L¨orincz, “Learning Tetris using the noisy cross-entropy
nol., vol. 67, no. 8, pp. 6814-6823, 2018. method,” In: Neural computation, vol. 18, no. 12, pp. 2936–2941, 2006.
[7] C. Hoel, K. Driggs-Campbell, K. Wolff, L. Laine, and M. Kochenderfer,
“Combining planning and deep reinforcement learning in tactical deci-
sion making for autonomous driving,” IEEE Transactions on Intelligent
Vehicles, vol. 5, no. 2, pp. 294-305, 2019.
[8] G. Wang, J. Hu, Z. Li, and L. Li, “Cooperative lane changing via deep
reinforcement learning,” arXiv preprint arXiv:1906.08662, 2019.
[9] Z. Cao, D. Yang, S. Xu, H. Peng, B. Li, S. Feng, and D. Zhao, “Highway
Exiting Planner for Automated Vehicles Using Reinforcement Learning,”
IEEE Trans. Intell. Transp. Syst. , 2020.
[10] N. Sakib. “Highway Lane change under uncertainty with Deep Rein-
forcement Learning based motion planner,” 2020.
[11] A. Alizadeh, M. Moghadam, Y. Bicer, N. Ure, U. Yavas, and C. Kurtulus,
“Automated Lane Change Decision Making using Deep Reinforcement
Learning in Dynamic and Uncertain Highway Environment,” In 2019
IEEE Intelligent Transportation Systems Conference (ITSC), pp. 1399-
1404, 2019.
[12] S. Zhang, H. Peng, S. Nageshrao, and E. Tseng, “Discretionary Lane
Change Decision Making using Reinforcement Learning with Model-
Based Exploration,” In 2019 18th IEEE International Conference on Ma-
chine Learning and Applications (ICMLA), pp. 844-850, 2019.
[13] B. R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A. Sallab, S. Yogamani,
and P. Pérez, “Deep reinforcement learning for autonomous driving: A
survey,” arXiv preprint arXiv:2002.00444, 2020.
[14] J. Kong, M. Pfeiffer, G. Schildbach, and F. Borrelli, “Kinematic and dy-
namic vehicle models for autonomous driving control design,” In 2015
IEEE Intelligent Vehicles Symposium (IV), pp. 1094-1099, June 2015.
[15] R. Rajamani, Vehicle Dynamics and Control, ser. Mechanical Engineer-
ing Series. Springer, 2011.
[16] F. Ye, X. Cheng, P. Wang, and C. Chan, “Automated lane change strategy
using proximal policy optimization-based deep reinforcement learning,”
arXiv preprint arXiv:2002.02667, 2020.
[17] M. Treiber, A. Hennecke, and D. Helbing, “Congested traffic states in
empirical observations and microscopic simulations,” Phys. Rev. E, vol.
62, pp. 1805-1824, 2000.
[18] M. Zhou, X. Qu, and S. Jin, “On the impact of cooperative autonomous
vehicles in improving freeway merging: a modified intelligent driver
model-based approach,” IEEE Trans. Intell. Transp. Syst., vol. 18, no. 6,
pp. 1422-1428, June 2017.
[19] A. Kesting, M. Treiber, and D. Helbing, “General lane-changing model
MOBIL for car-following models,” Transportation Research Record, vol.
1999, no. 1, pp. 86-94, 2007.
[20] T. Liu, X. Hu, W. Hu, Y. Zou, “A heuristic planning reinforcement learn-
ing-based energy management for power-split plug-in hybrid electric ve-
hicles,” IEEE Trans. Ind. Inform., vol. 15, no. 12, pp. 6436-6445, 2019.
[21] T. Liu, X. Tang, H. Wang, H. Yu, and X Hu, “Adaptive Hierarchical En-
ergy Management Design for a Plug-in Hybrid Electric Vehicle,” IEEE
IEEE Trans. Veh. Technol., vol. 68, no. 12, pp. 11513-11522, 2019.
[22] T. Liu, B. Wang, and C. Yang, “Online Markov Chain-based energy man-
agement for a hybrid tracked vehicle with speedy Q-learning,” Energy,
vol. 160, pp. 544-555, 2018.
[23] T. Liu, H. Yu, H. Guo, Y. Qin, and Y. Zou, “Online energy management
for multimode plug-in hybrid electric vehicles,” IEEE Trans. Ind. Inform.,
vol. 15, no. 7, pp. 4352-4361, July 2019.
[24] J. Duan, S. E. Li, Y. Guan, Q. Sun, and B. Cheng, “Hierarchical rein-
forcement learning for self-driving decision-making without reliance on
labelled driving data,” IET Intell. Transp. Syst., vol. 14, no. 5, pp. 297-
305, 2020.
[25] M. L., Puterman, “Markov decision processes: discrete stochastic dy-
namic programming,” John Wiley & Sons, 2014.
[26] X. Hu, T. Liu, X. Qi, and M. Barth, “Reinforcement learning for hybrid
and plug-in hybrid electric vehicle energy management: Recent advances
and prospects,” IEEE Ind. Electron. Mag., vol. 13, no. 3, pp. 16-25, 2019.