Decision-Making For Autonomous Vehicles On Highway: Deep Reinforcement Learning With Continuous Action Horizon
Decision-Making For Autonomous Vehicles On Highway: Deep Reinforcement Learning With Continuous Action Horizon
Furthermore, Reference. [13] conducted a comprehensive sur- the real-world viability of DRL-based decision-making strate-
vey of the prevailing applications of RL and DRL for automated gies. These encompass challenges related to sample efficiency,
vehicles, encompassing agent training, evaluation techniques, slow learning rates, and operational safety.
and robust estimation. Nevertheless, several limitations curtail
Driving Scenario and Vehicle Kinematics
Reinforcement Learning
Input Bicycle N-Lanes
model Highway
Fig. 1. An efficient and safe decision-making control framework based on PPO-DRL for autonomous vehicles.
This study aims to develop an effective and secure decision- delves into the analysis of pertinent simulation outcomes for the
making policy for autonomous driving (AD). To achieve this proposed decision-making strategy. Ultimately, Section V pre-
goal, a proximal policy optimization (PPO)-enhanced deep re- sents the concluding remarks.
inforcement learning (DRL) approach is presented for highway
scenarios with a continuous action horizon, as illustrated in Fig. II. VEHICLE KINEMATICS AND DRIVING SCENARIOS
1. Initially, the vehicle's kinematics and driving scenarios are In this section, we establish the highway driving scenario for
established, wherein the autonomous ego vehicle is designed to our research. This environment encompasses the autonomous
operate efficiently and safely. Through the utilization of the ego vehicle (AEV) and the surrounding vehicles. We also detail
policy gradient method, the PPO-enhanced DRL framework en- the vehicle kinematics of these entities, allowing for the calcu-
ables direct acquisition of control actions, while maintaining a lation of longitudinal and lateral speeds. Additionally, we intro-
trust region with bounded objectives. The specific implementa- duce reference models for driving maneuvers in both the longi-
tion details of this DRL algorithm are subsequently elaborated tudinal and lateral directions.
upon. Ultimately, a series of comprehensive test experiments
are designed to assess the optimality, learning efficiency, and A. Vehicle Kinematics
adaptability of the proposed decision-making policy within the In this study, we elucidate the vehicle kinematics through the
context of highway scenarios. application of the widely acknowledged common bicycle model
This work introduces three key contributions and innovations: [14]-[15], characterized by nonlinear continuous horizon equa-
1) the development of an advanced, efficient, and safe decision- tions. The representation of the inertial frame is illustrated in
making policy for AD on highways; 2) the application of PPO- Fig. 2. Computation of the differentials for position and inertial
enhanced DRL to address the transferred control optimization heading is outlined as follows:
challenge in autonomous vehicle scenarios; 3) the establish- x = v cos( + ) (1)
ment of an adaptive estimation framework to assess the adapt- y = v sin( + ) (2)
ability of the proposed approach. This endeavor represents a
v
concerted effort to enhance the efficiency and safety of deci- = sin (3)
sion-making policies through the utilization of cutting-edge ad- lr
vanced DRL methodologies. where (x, y) represents the positional coordinates of the vehicle
To elucidate the contributions of this article, the subsequent within the inertial frame. The vehicle velocity is denoted as v,
sections are organized as follows. Section II outlines the vehicle while lr signifies the distance between the center of mass and
kinematics and driving scenarios on the highway. The PPO- the rear axles. Additionally, ψ represents the inertial heading,
enhanced DRL framework employed in this research is ex- and β denotes the slip angle at the center of gravity. This angle
pounded upon in Section III. The ensuing section, Section IV, and the vehicle speed can be further displayed as:
3
lows:
L=N
v v
dr = d0 + T v + (7)
2 amax b
Fig. 3. Driving scenario on highway with N lanes for decision-making policy.
where d0 represent the minimum relative distance between two
The overtaking behavior signifies instances when the subject vehicles on the same lane, while T signifies the desired time
vehicle surpasses nearby vehicles through a combination of interval for ensuring safety. △v accounts for the relative speed
lane-changing and accelerating maneuvers. Typically, the eval- gap between the subject vehicle and the vehicle ahead, and b
4
denotes the value of deceleration aligned with comfortable cri- III. PROXIMAL POLICY OPTIMIZATION-ENABLED DEEP
teria. The specific parameters of the IDM employed in this REINFORCEMENT LEARNING
study are detailed in Table I. This section elucidates the procedural steps involved in im-
TABLE I
DEFAULT PARAMETERS OF IDM
plementing the PPO-enhanced DRL method under study. Ini-
tially, the foundations of reinforcement learning (RL) methods
Symbol Value Unit
and the rationale behind employing a continuous-time horizon
Maximum acceleration amax 6 m/s2 are presented. Subsequently, the conventional format of the pol-
Acceleration argument δ 4 / icy gradient technique is detailed. Finally, the utilization of
Desired time gap T 1.5 s PPO-enabled DRL is illuminated, facilitating the derivation of
Comfortable deceleration rate b -5 m/s2 a decision-making strategy for the control problem established
in Section II.
Minimum relative distance d0 10 m
Upon establishing the longitudinal acceleration of the adja- A. Necessity of Continuous Horizon
cent vehicle, MOBIL is employed to govern lateral lane-chang- RL has emerged as a methodological approach to tackle se-
ing decisions [19]. The MOBIL framework incorporates two quential decision-making problems through a trial and error
pivotal conditions: the safety criterion and the incentive condi- process [20]-[22]. This dynamic process is exemplified by the
tion. The safety criterion dictates that during a lane change, the interplay between an intelligent agent and its environment. The
subsequent vehicle should avoid excessive deceleration to pre- agent takes control actions within the environment and subse-
vent collisions. The mathematical representation of this accel- quently receives evaluations of its choices from the environ-
eration constraint is as follows: ment [23]-[24]. Broadly, RL methods are categorized into pol-
icy-based approaches (i.e., policy gradients algorithm) and
an −bsafe (8) value-based approaches (i.e., Q-learning and Sarsa algorithms).
In the context of the highway decision-making problem, the
where ãn denotes the acceleration experienced by the newly fol-
intelligent agent functions as the decision-making controller for
lowing vehicle after executing a lane change, while bsafe repre-
the AEV, while the surrounding vehicles comprise the environ-
sents the upper limit for deceleration applied to the new fol-
ment. This interaction is typically emulated through the appli-
lower. Equation (8) is strategically employed to establish con-
cation of Markov decision processes (MDPs) with Markov
ditions that guarantee collision-free scenarios.
property [25]. The MDP is characterized by a pivotal tuple (S,
Assuming the an and ãn denote the accelerations of the new
A, P, R, γ), where S and A are the sets of state variable and
follower prior to and after lane-changing, while ao and ão refer
control actions. P denotes the transition model of the state var-
to the accelerations of the previous follower before and after the
iable, and R corresponds to the reward model associated with
lane-change event. The incentive condition is established
the state-action pair (s, a). γ is referred to as the discount factor,
through the imposition of an acceleration constraint, which is
serving to strike a balance between immediate and future re-
articulated as follows:
wards.
ae − ae + p ( (an − an ) + (ao − ao ) ) ath (9)
The goal of RL techniques is to select a sequence of control
where ae and ãe represent the accelerations of the AEV prior to actions from set A to maximize cumulative rewards. The cumu-
and following a lane change. p corresponds to the politeness lative rewards, denoted as Rt is the sum of the current reward
coefficient, which quantifies the degree of influence exerted by and the discounted future rewards as:
Rt = t =0 t rt
the followers during the lane-changing process. Additionally,
(10)
ath denotes the threshold for making lane-changing decisions.
This criterion signifies that the intended lane must offer a higher where t is the time step, and rt represents the corresponding in-
level of safety compared to the current one. Notably, the accel- stantaneous reward. Two distinct value functions are defined to
erations employed in the MOBIL approach are determined by convey the significance of selecting control actions. These
the IDM at each time step. Moreover, the AEV possesses the functions are identified as the state-value function V and the
capability to execute overtaking maneuvers by transitioning be- state-action function Q:
tween the right and left lanes. The specific parameters for V ( st ) E [ Rt | st , ] (11)
MOBIL are outlined in Table II.
TABLE II
Q ( st , at ) E [ Rt | st , at , ] (12)
MOBIL CONFIGURATION where π is a special control policy. It is evident that various
Keyword Value Unit control policies result in distinct values for the value function,
Safe deceleration limitation bsafe 2 m/s2 with the pursuit of optimal performance being desirable. The
optimal control policy is defined as follows:
Politeness factor p 0.001 /
Lane-changing decision threshold ath
( st ) = arg max Q( st , at ) (13)
0.2 m/s2 at
agent in discovering an optimal control strategy. For DRL, the applying a clipping mechanism to the probability ratio. Here, τ
value function is approximated using a neural network. From s a hyperparameter set to a value of 0.2. In the second term, the
(12), the state-action function assumes the form of a matrix, probability ratio rt(θ) is bounded within the range of 1- τ to 1+
with its rows and columns corresponding to the quantities of τ, subsequently forming the clipped objective through multipli-
state variables and control actions. In scenarios characterized cation with the advantage function. The inclusion of this
by extensive state variable and control action spaces, the pro- clipped version serves to prevent excessively significant up-
cess of updating the value function and searching for an appro- dates to the policy derived from the previous policy.
priate control policy can become inefficient. In order to establish parameter sharing between the policy
To address this limitation, this study models the control ac- and value functions using a neural network, the loss function is
reformulated by combining the policy surrogate with an error
tions as the vehicle's throttle and steering angle. The throttle
term from the value function [27]. The resulting modified loss
governs acceleration, while the steering angle directly impacts
function is constructed as follows:
lane-changing behavior. These two actions operate within con-
tinuous-time horizons, specifically within ranges of [-5, 5] m/s2 LCLIP +VF + S ( ) = ˆ t LCLIP
t ( ) − c1 LVF
t ( ) − c2 S[ ]( st ) (18)
for acceleration and [-Π/4, Π/4] rad (Π is the circumference as
t ( ) is the squared-error loss of the state-value func-
where LVF
3.1415) for steering angle. This approach enables the AEV to
tion (V ( st ) − Vt tar ) 2 , S indicates an entropy loss. c1 and c2 are
iteratively determine control action pairs at each time step,
thereby establishing the vehicle's kinematics as detailed in Sec- the coefficients.
TABLE III
tion II.A. IMPLEMENTATION CODE OF PPO ALGORITHM
B. Policy Gradient PPO Algorithm, Actor-Critic Style
1. For iteration = 1, 2, …, do
In policy-based RL methods, an estimator of the policy gra-
dient is computed for a stochastic policy, which is represented 2. For actor = 1, 2, …, M do
as follows: 3. Run policy old
in environment for T timesteps
Finally, the reward function R in this article encompasses surpasses that of the other two methods. The rewards consist-
three components, reflecting the objectives of efficiency, safety, ently outperform those achieved by CEM and IDM+MOBIL.
and lane preference. Specifically, the AEV aims to maximize Consequently, the control policy derived through the PPO ap-
its speed, prioritize the right lane, and prevent collisions with proach demonstrates superiority over the other two strategies.
other surrounding vehicles. The instantaneous reward at time
step t is defined as follows:
TABLE IV
COLLISION CONDITIONS IN THREE COMPARED APPROACHES
Algorithms Collision rate (%) Success rate (%)
PPO-DRL 0.59 99.03
CEM 4.32 91.55
IDM+MOBIL 7.10 87.21
Fig. 6. Value of loss function in two DRL methods: CEM and PPO. Fig. 8. Rewards in the testing experiments for two new driving scenarios.
To scrutinize the learning rate disparities between the PPO This subsection introduces an adaptive estimation framework
and CEM algorithms, Fig. 7 illustrates the trajectories of cumu- aimed at substantiating the efficacy of the proposed decision-
lative rewards across these two methodologies. As defined in making policy. The fundamental determinants of a given driv-
(10), the cumulative rewards encompass the summation of the ing scenario encompass the count of lanes and vehicles therein.
current reward and discounted future rewards, thereby serving In this vein, we manipulate these parameters to instantiate two
as a pivotal determinant for control action selection. Evidently novel driving scenarios. The first scenario entails four lanes,
portrayed in Fig. 7, the PPO consistently outperforms the CEM, each accommodating five vehicles (designated as driving sce-
nario 1). Conversely, the second scenario entails two lanes, with
signifying that the control policy furnished by PPO is superior.
ten vehicles occupying each lane (designated as driving sce-
The AEV operating under the PPO paradigm acquires a more
nario 2). A total of 10 testing episodes are conducted for each
extensive repertoire of knowledge and experiential insights per- of these distinct scenarios. The speed and position attributes of
taining to the driving environment. This augmentation can be these encompassing vehicles are also subject to randomized as-
ascribed to the innovative loss function delineated in (18), signment. These designed driving scenarios serve as emblem-
which expedites the intelligent agent's quest for an optimal con- atic representations of the variegated uncertainties inherent in
trol policy. actual driving environments. Moreover, they facilitate a lucid
8
demonstration of the adaptability inherent in the proposed deci- graphical representation in Fig. 10 reveals that the AEV under-
sion-making policy.. takes a daring lane-changing maneuver in the presence of nu-
Fig. 8 depicts the aggregate rewards achieved by the PPO al- merous surrounding vehicles. In the training procedure, the
gorithm within these two novel driving scenarios. A higher re- AEV might not have encountered this particular situation, ren-
ward value signifies a more fitting control policy tailored to the dering it challenging to accurately predict potential collisions in
specific scenario. Conversely, lower reward values can be at- such circumstances. To address this challenge, two potential re-
tributed to two distinct factors. Firstly, the random positioning search avenues could be pursued to enhance the AEV's deci-
of the surrounding vehicles may lead to the obstruction of all sion-making capabilities. Firstly, extending the training process
lanes, impeding the AEV's ability to execute efficient lane could allow the AEV to accumulate more insights from diverse
changes. Secondly, the AEV might engage in hazardous lane- driving environments, thereby refining its decision-making
changing maneuvers in exceptional circumstances, resulting in skills. Secondly, the integration of communication technology
collisions. Notably, Fig. 8 showcases that the acquired decision- to provide the AEV with real-time information about its sur-
making policy exhibited superior performance within the con- roundings could facilitate more informed and judicious deci-
text of the first scenario. This can be attributed to the fact that, sion-making on the highway.
in the initial driving scenario, an additional lane was introduced
while maintaining the same count of surrounding vehicles. This Duration: 10. Action: Success overtaking 1 to accelerate.
augmentation in lane availability provides the AEV with in-
creased opportunities for successful lane-changing and colli-
sion avoidance. To further elucidate the adaptability of the pro-
posed decision-making policy, we meticulously analyze two
Duration: 18. Action: Success overtaking 2 to accelerate.
specific episodes.
27-30 Oct 2019. [27] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox-
[4] J. Nie, J. Zhang, E. Ding, X. Wan, X. Chen, and B. Ran, “Decentralized imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347,
cooperative lane-changing decision-making for connected autonomous 2017.
vehicles,” IEEE Access, vol. 4, pp. 9413-9420, 2016. [28] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust re-
[5] W. Song, G. Xiong, and H. Chen, “Intention-aware autonomous driving gion policy optimization,” In International conference on machine learn-
decision-making in an uncontrolled intersection,” Math. Probl. Eng., ing, pp. 1889-1897, 2015.
2016. [29] L. Edouard, “An environment for autonomous driving decision-making,”
[6] L. Li, K. Ota, and M. Dong, “Humanlike driving: Empirical decision- https://github.com/ eleurent/highway-env, GitHub, 2018.
making system for autonomous vehicles,” IEEE IEEE Trans. Veh. Tech- [30] I. Szita and A. L¨orincz, “Learning Tetris using the noisy cross-entropy
nol., vol. 67, no. 8, pp. 6814-6823, 2018. method,” In: Neural computation, vol. 18, no. 12, pp. 2936–2941, 2006.
[7] C. Hoel, K. Driggs-Campbell, K. Wolff, L. Laine, and M. Kochenderfer,
“Combining planning and deep reinforcement learning in tactical deci-
sion making for autonomous driving,” IEEE Transactions on Intelligent
Vehicles, vol. 5, no. 2, pp. 294-305, 2019.
[8] G. Wang, J. Hu, Z. Li, and L. Li, “Cooperative lane changing via deep
reinforcement learning,” arXiv preprint arXiv:1906.08662, 2019.
[9] Z. Cao, D. Yang, S. Xu, H. Peng, B. Li, S. Feng, and D. Zhao, “Highway
Exiting Planner for Automated Vehicles Using Reinforcement Learning,”
IEEE Trans. Intell. Transp. Syst. , 2020.
[10] N. Sakib. “Highway Lane change under uncertainty with Deep Rein-
forcement Learning based motion planner,” 2020.
[11] A. Alizadeh, M. Moghadam, Y. Bicer, N. Ure, U. Yavas, and C. Kurtulus,
“Automated Lane Change Decision Making using Deep Reinforcement
Learning in Dynamic and Uncertain Highway Environment,” In 2019
IEEE Intelligent Transportation Systems Conference (ITSC), pp. 1399-
1404, 2019.
[12] S. Zhang, H. Peng, S. Nageshrao, and E. Tseng, “Discretionary Lane
Change Decision Making using Reinforcement Learning with Model-
Based Exploration,” In 2019 18th IEEE International Conference on Ma-
chine Learning and Applications (ICMLA), pp. 844-850, 2019.
[13] B. R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A. Sallab, S. Yogamani,
and P. Pérez, “Deep reinforcement learning for autonomous driving: A
survey,” arXiv preprint arXiv:2002.00444, 2020.
[14] J. Kong, M. Pfeiffer, G. Schildbach, and F. Borrelli, “Kinematic and dy-
namic vehicle models for autonomous driving control design,” In 2015
IEEE Intelligent Vehicles Symposium (IV), pp. 1094-1099, June 2015.
[15] R. Rajamani, Vehicle Dynamics and Control, ser. Mechanical Engineer-
ing Series. Springer, 2011.
[16] F. Ye, X. Cheng, P. Wang, and C. Chan, “Automated lane change strategy
using proximal policy optimization-based deep reinforcement learning,”
arXiv preprint arXiv:2002.02667, 2020.
[17] M. Treiber, A. Hennecke, and D. Helbing, “Congested traffic states in
empirical observations and microscopic simulations,” Phys. Rev. E, vol.
62, pp. 1805-1824, 2000.
[18] M. Zhou, X. Qu, and S. Jin, “On the impact of cooperative autonomous
vehicles in improving freeway merging: a modified intelligent driver
model-based approach,” IEEE Trans. Intell. Transp. Syst., vol. 18, no. 6,
pp. 1422-1428, June 2017.
[19] A. Kesting, M. Treiber, and D. Helbing, “General lane-changing model
MOBIL for car-following models,” Transportation Research Record, vol.
1999, no. 1, pp. 86-94, 2007.
[20] T. Liu, X. Hu, W. Hu, Y. Zou, “A heuristic planning reinforcement learn-
ing-based energy management for power-split plug-in hybrid electric ve-
hicles,” IEEE Trans. Ind. Inform., vol. 15, no. 12, pp. 6436-6445, 2019.
[21] T. Liu, X. Tang, H. Wang, H. Yu, and X Hu, “Adaptive Hierarchical En-
ergy Management Design for a Plug-in Hybrid Electric Vehicle,” IEEE
IEEE Trans. Veh. Technol., vol. 68, no. 12, pp. 11513-11522, 2019.
[22] T. Liu, B. Wang, and C. Yang, “Online Markov Chain-based energy man-
agement for a hybrid tracked vehicle with speedy Q-learning,” Energy,
vol. 160, pp. 544-555, 2018.
[23] T. Liu, H. Yu, H. Guo, Y. Qin, and Y. Zou, “Online energy management
for multimode plug-in hybrid electric vehicles,” IEEE Trans. Ind. Inform.,
vol. 15, no. 7, pp. 4352-4361, July 2019.
[24] J. Duan, S. E. Li, Y. Guan, Q. Sun, and B. Cheng, “Hierarchical rein-
forcement learning for self-driving decision-making without reliance on
labelled driving data,” IET Intell. Transp. Syst., vol. 14, no. 5, pp. 297-
305, 2020.
[25] M. L., Puterman, “Markov decision processes: discrete stochastic dy-
namic programming,” John Wiley & Sons, 2014.
[26] X. Hu, T. Liu, X. Qi, and M. Barth, “Reinforcement learning for hybrid
and plug-in hybrid electric vehicle energy management: Recent advances
and prospects,” IEEE Ind. Electron. Mag., vol. 13, no. 3, pp. 16-25, 2019.