0% found this document useful (0 votes)
68 views11 pages

Decision-Making Strategy On Highway For Autonomous Vehicles Using Deep Reinforcement Learning

This document presents a deep reinforcement learning approach for autonomous vehicle decision making on highways. Specifically, it proposes using a dueling deep Q-network algorithm to generate an overtaking policy for an ego vehicle surrounded by other vehicles on the highway. The algorithm is evaluated through simulation experiments which demonstrate that the dueling deep Q-network based policy can efficiently and safely accomplish highway driving tasks like overtaking other vehicles.

Uploaded by

Eduard Conesa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views11 pages

Decision-Making Strategy On Highway For Autonomous Vehicles Using Deep Reinforcement Learning

This document presents a deep reinforcement learning approach for autonomous vehicle decision making on highways. Specifically, it proposes using a dueling deep Q-network algorithm to generate an overtaking policy for an ego vehicle surrounded by other vehicles on the highway. The algorithm is evaluated through simulation experiments which demonstrate that the dueling deep Q-network based policy can efficiently and safely accomplish highway driving tasks like overtaking other vehicles.

Uploaded by

Eduard Conesa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Received August 14, 2020, accepted August 31, 2020, date of publication September 9, 2020, date of current version

October 8, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.3022755

Decision-Making Strategy on Highway


for Autonomous Vehicles Using Deep
Reinforcement Learning
JIANGDONG LIAO1 , TENG LIU 2 , (Member, IEEE), XIAOLIN TANG 2, (Member, IEEE),
XINGYU MU2 , BING HUANG2 , AND DONGPU CAO3
1 Schoolof Mathematics and Statistics, Yangtze Normal University, Chongqing 408100, China
2 Collegeof Automotive Engineering, Chongqing University, Chongqing 400044, China
3 Department of Mechanical and Mechatronics Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada

Corresponding authors: Teng Liu ([email protected]) and Xiaolin Tang ([email protected])


This work was supported in part by the State Key Laboratory of Mechanical System and Vibration under Grant MSV202016.

ABSTRACT Autonomous driving is a promising technology to reduce traffic accidents and improve driving
efficiency. In this work, a deep reinforcement learning (DRL)-enabled decision-making policy is constructed
for autonomous vehicles to address the overtaking behaviors on the highway. First, a highway driving
environment is founded, wherein the ego vehicle aims to pass through the surrounding vehicles with an
efficient and safe maneuver. A hierarchical control framework is presented to control these vehicles, which
indicates the upper-level manages the driving decisions, and the lower-level cares about the supervision of
vehicle speed and acceleration. Then, the particular DRL method named dueling deep Q-network (DDQN)
algorithm is applied to derive the highway decision-making strategy. The exhaustive calculative procedures
of deep Q-network and DDQN algorithms are discussed and compared. Finally, a series of estimation
simulation experiments are conducted to evaluate the effectiveness of the proposed highway decision-
making policy. The advantages of the proposed framework in convergence rate and control performance are
illuminated. Simulation results reveal that the DDQN-based overtaking policy could accomplish highway
driving tasks efficiently and safely.

INDEX TERMS Autonomous driving, decision-making, deep reinforcement learning, dueling deep
Q-network, deep Q-learning, overtaking policy.

I. INTRODUCTION on the functions of a variety of sensors, such as radar, lidar,


Autonomous driving (AD) enables the vehicle to engage global positioning system (GPS), et al. [6]. Decision-making
different driving missions without a human driver [1], [2]. controller manages the driving behaviors of the vehicles, and
Motivated by the enormous potentials of artificial intelli- these behaviors include acceleration, braking, lane-changing,
gence (AI), autonomous vehicles or automated vehicles have lane-keep and so on [7]. Planning function helps the auto-
become one of the research hotspots all over the world [3]. mated cars find the reasonable running trajectories from one
Many automobile manufacturers, such as Toyota, Tesla, Ford, point to another. Finally, the control module would command
Audi, Waymo, Mercedes-Benz, General Motors, and so on, the onboard powertrain components to operate accurately to
are developing their own autonomous cars and achieving finish the driving maneuvers and follow the planning path.
tremendous progress. Meanwhile, automotive researchers are According to the intelligent degrees of these mentioned mod-
paying attention to overcome the essential technologies to ules, the AD is classified into six levels, from L0 to L5 [8].
build automated cars with full automation [4]. Decision-making strategy is regarded as the human brain
Four significant modules are contained in autonomous and is extremely important in autonomous vehicles [9]. This
vehicles, which are perception, decision-making, planning, policy is often generated by the manual rules based on human
and control [5]. Perception indicates the autonomous vehicles driving experiences or imitated manipulation learned from
know the information about the driving environments based supervised learning approaches [10]. For example,
Song et al. applied a continuous hidden Markov
The associate editor coordinating the review of this manuscript and chain to predict the motion intention of the surrounding
approving it for publication was Chao Yang . vehicles. Then, a partially observable Markov decision

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
177804 VOLUME 8, 2020
J. Liao et al.: Decision-Making Strategy on Highway for Autonomous Vehicles

FIGURE 1. The constructed deep reinforcement learning-enabled highway overtaking driving policy for autonomous vehicles.

process (POMDP) is used to construct the general decision- policy is able to promote traffic flow and driving comfort.
making framework [11], [12]. The authors in [13] developed However, the common DRL methods are unable to address
an advanced ability to make appropriate decisions in the the highway overtaking problems because of the continuous
city road traffic situations. The presented decision-making action space and large state space [23].
policy is multiple criteria, which helps the city cars make In this work, a DRL enabled highway overtaking driving
feasible choices in different conditions. In Ref. [14], policy is constructed for autonomous vehicles. The proposed
Nie et al. discussed the lane-changing decision-making strat- decision-making strategy is evaluated and estimated to be
egy for connected automated cars. The related model is adaptive to other complicated scenarios, as depicted in Fig. 1.
combining the cooperative car-following models and candi- First, the studied driving environment is founded on the high-
date decision generation module. Furthermore, the authors in way, wherein an ego vehicle aims to run through a particular
[15] mentioned the thought of a human-like driving system. driving scenario efficiently and safely. Then, a hierarchi-
It could adjust the driving decisions by considering the cal control structure is shown to manipulate the lateral and
driving demand for human drivers. longitudinal motions of the ego and surrounding vehicles.
Deep reinforcement learning (DRL) techniques are taken Furthermore, the special DRL algorithm called dueling deep
as a powerful tool to deal with the long sequential Q-network (DDQN) is derived and utilized to obtain the high-
decision-making problems [16]. In recent years, many way decision-making strategy. The DQL and DDQN algo-
attempts have been implemented to study DRL-based rithms are compared and analyzed theoretically. Finally, the
autonomous driving topics. For example, Duan et al. built performance of the proposed control framework is discussed
a hierarchical structure to learn the decision-making policy via executing a series of simulation experiments. Simulation
via the reinforcement learning (RL) method [17]. The pro results reveal that the DDQN-based overtaking policy could
of this work is independent of the historical labeled driving accomplish highway driving tasks efficiently and safely.
data. Ref. [18], [19] utilized DRL approaches to handle the The main contributions and innovations of this work can
collision avoidance and path following problems for auto- be cast into three perspectives: 1) an adaptive and optimal
mated vehicles. The relevant control performance is better DRL-based highway overtaking strategy is proposed for auto-
than the conventional RL methods in these two findings. mated vehicles; 2) the dueling deep Q-network (DDQN)
Furthermore, the authors in [20], [21] considered not only algorithm is leveraged to address the large state space of
path planning but also the fuel consumption for autonomous the decision-making problem; 3) the convergence rate and
vehicles. The related algorithm is deep Q-learning (DQL), control optimization of the derived decision-making policy
and it was proven to accomplish these two-driving missions are demonstrated by multiple designed experiments.
suitably. Han et al. employed the DQL algorithm to decide This following organization of this article is given as
the lane change or lane keep for connected autonomous cars, follows: the highway driving environment and the control
in which the information of the nearby vehicles is treated modules of the ego and surrounding vehicles are described
as feedback knowledge from the network [22]. The resulted in Section II. The DQL and DDQN algorithms are defined in

VOLUME 8, 2020 177805


J. Liao et al.: Decision-Making Strategy on Highway for Autonomous Vehicles

At the beginning of this driving task, all the surrounding


vehicles located in front of the ego vehicle. In each lane,
the number of surrounding vehicles is M , which indicates
there are 3M nearby cars in this situation. Two conditions
would interrupt the ego vehicle, which is crashing other
vehicles or reaching the time limit. The procedure of running
from the starting point to the ending point is called as one
episode in this work.
Without loss of generality, the parameters of the driving
scenario are settled as follows: the original speed of the ego
vehicle is chosen from [23], [25] m/s, its maximum speed
is 40 m/s, the length and width of all vehicles are 5m and
2m. The duration of one episode is 100s, and the simulation
FIGURE 2. Highway driving environment for decision-making problem frequency is 20 Hz. The initial velocity of the surrounding
with three lanes.
vehicles is randomly chosen from [20], [23] m/s, and their
Section III, in which the parameters of the RL framework are behaviors are manipulated by IDM and MOBIL. The next
discussed in detail. Section IV shows the relevant results of section will discuss these two models in detail.
a series of simulation experiments. Finally, the conclusion is
B. VEHICLE BEHAVIOR CONTROLLER
conducted in Section V.
The movements of all the vehicles in the highway
II. DRIVING ENVIRONMENT AND CONTROL MODULE environments are mastered by a hierarchical control frame-
In this section, the studied driving scenario on the highway work, as shown in Fig. 3. The upper-level applied IDM and
is introduced. Without loss of generality, a three-lane free- MOBIL to manage the vehicle behaviors, and the lower-level
way environment is constructed. Furthermore, a hierarchical aims to enable the ego vehicle to track a given target speed and
motion controller is described to manage the lateral and longi- follow a target lane. In this work, the DRL method is used to
tudinal movements of the ego and surrounding vehicles. The control the ego vehicle. The reference model implies that the
upper-level contains two models, which are the intelligent ego vehicle is controlled by the bi-level structure in Fig. 3,
driver model (IDM) and minimize overall braking induced which is taken as a benchmark to evaluate the DRL-based
by lane changes (MOBIL) [24]. The lower-level focuses on decision-making strategy.
regulating vehicle velocity and acceleration. IDM in the upper-level is a prevalent microscopic model
A. HIGHWAY DRIVING SCENARIO [25] to realize car-following and collision-free. In the adap-
Decision-making in autonomous driving means selecting a tive cruise controller of automated cars, the longitudi-
sequence of reasonable driving behaviors to achieve special nal behavior is usually decided by IDM. In general, the
driving missions. On the highway, these behaviors involve longitudinal acceleration is IDM is determined as [26]:
lane- changing, lane-keeping, acceleration, braking. The v dtar 2
a = amax · [1 − ( )δ − ( ) ] (1)
main objectives are avoiding collisions, running efficiently, vtar 1d
and driving on the preferred lane. Accelerating and sur- where v and a is the current vehicle speed and acceleration.
passing other vehicles is a typical driving behavior called amax is the maximum acceleration, d is the distance to the
overtaking. front car and δ is named as the constant acceleration param-
This work discusses the decision-making problem on the eter. vtar and dtar are the target velocity and distance, and
highway for autonomous vehicles, and the research driving the desired speed is achieved by the amax and dtar . In IDM,
scenario is depicted in Fig. 2. The orange vehicle is the the expected distance dtar is affected by the front vehicle and
ego vehicle, and other green cars are named as surrounding is calculated as follows:
vehicles. There are three lanes in the driving environment,
v1v
and the derived decision-making policy in this paper is easily dtar = d0 + Tv + √ (2)
generalized to different situations. The ego vehicle would be 2 amax b
initialized in the middle lane at a random speed. where d0 is the predefined minimum relative distance, T is the
The objective of the ego vehicle is to run as quickly as expected time interval for safety goal, 1v is the relative speed
possible without crashing the surrounding vehicles. Hence, between two vehicles, and b is the deceleration rate according
this goal is interpreted as efficiency and safety. The initial to the comfortable purpose.
velocity and position of the surrounding vehicles are designed In IDM, the relative speed and distance are defined a priori
randomly. It implies the driving scenario consists of uncer- to induce the vehicle velocity and acceleration at each time
tainties as to the actual driving. Furthermore, to imitate the step. The default configuration is introduced as following: the
real conditions, the ego vehicle prefers to stay on lane 1 maximum acceleration amax is 6 m/s2 , acceleration argument
(L = 1), and it can overtake other vehicles from the right δ is 4, desired time gap T is 1.5 s, comfortable deceleration
or left sides. rate b is -5 m/s2 , and minimum relative distance d0 is 10m.

177806 VOLUME 8, 2020


J. Liao et al.: Decision-Making Strategy on Highway for Autonomous Vehicles

FIGURE 3. The hierarchical control framework discussed in this work for the ego vehicle and surrounding vehicles.

Since the IDM is utilized to determine the longitudinal the acceleration by a proportional controller as:
behavior, the MOBIL is employed to make the lateral lane
a = Kp (vtar − v) (5)
change decisions [27]. MOBIL states that lane-changing
behaviors should be observed by two restrictions, which are where Kp is the proportional gain.
safety criterion and incentive condition. These constraints In the lateral direction, the controller deals with the
are related to the ego vehicle e, the follower i (of the ego position and heading of the vehicle with a simple
vehicle) at current lane, and the follower j at the target lane proportional-derivative action. The position indicates the
of lane change. Assuming aold old
i and aj are the accelerations
lateral speed vlat of the vehicle is computed as follows:
of these followers before changing, and anewi and anew
j are the vlat = −Kp,lat 1lat (6)
accelerations after changing.
The safety criterion requires the follower in the desired lane where Kp,lat is named as position gain, 1lat is the lateral
(after changing) to limit its acceleration to avoid a collision. position of the vehicle with respect to the center-line of the
The mathematic expression is shown as: lane. Then, the heading control is related to the yaw rate
command ϕ as:
anew ≥ −bsafe (3)
j ϕ̇ = Kp,ϕ (ϕtar − ϕ) (7)
where bsafe is the maximum braking imposed to the follower where ϕtar is the target heading angle to follow the desired
in the lane-changing behavior. By following (3), collision and lane and Kp,lat is the heading gain.
accidents could be avoided effectively. Hence, the movements of the surrounding vehicles are
The incentive condition is imposed on the ego vehicle and achieved by the bi-level control framework in Fig. 3.
its followers by an acceleration threshold ath : The position, speed, and acceleration of these vehicles are
assumed to be known to the ego vehicle. This limitation
anew − aold j )] > ath (4)
new
e e + z[(ai − aold new
i ) + (aj − aold
propels the ego vehicle to learn how to drive in the scenario
where z is named as the politeness coefficient to determine the via the trial-and-error procedure. In the next section, the DRL
effect degree of the followers in the lane-changing behaviors. approach is introduced and established to realize this learning
This incentive condition means the desired lane should be process and derive the highway decision-making policy.
safer than the old lane. For application, the parameters in
III. DRL METHODOLOGY
MOBIL are defined as follows: the politeness factor z is
This section introduces the RL method and exhibits the
0.001, safe deceleration limit bsafe is 2 m/s2 , and acceleration
special DRL algorithms. The interaction in RL between
threshold ath is 0.2 m/s2 . After deciding the longitudinal and
the agent and the environment is first explained. Then,
lateral behaviors in the upper-level, the lower-level is applied
the DQL algorithm that incorporates the neural network and
to follow the target speed and lane.
Q-learning algorithm is formulated. Finally, a dueling net-
C. VEHICLE MOTION CONTROLLER work is constructed in a DQL algorithm to reconstitute the
In the lower-level, the motions of the vehicles in the longitudi- output layer of the neural network, and thus raise the DDQN
nal and lateral direction are controlled. The former regulates method.

VOLUME 8, 2020 177807


J. Liao et al.: Decision-Making Strategy on Highway for Autonomous Vehicles

A. RL CONCEPT state-value function. In the common Q-learning, the updating


RL approach describes the process that an intelligent agent rule of this function is narrated as follows:
interacts with its environment. It is powerful and useful to
Q(s, a) ← Q(s, a) + α[r + γ max Q(s0 , a0 ) − Q(s, a)] (13)
solve sequential decision-making problems. The goal of the a0
agent is to search an optimal sequence of control actions
where α ∈ [0, 1] is named as a learning rate to trade-off the
based on feedback from the environment. Owing to its char-
old and new learned experiences from the environment. s0 and
acteristics of self-evaluation and self-promotion, RL is widely
a0 are the state and action at the next time step.
used in many research fields [2], [28]–[31].
The common Q-learning is unable to handle the problem
In the decision-making problem on the highway, the agent
with a large space of state variable because it needs an enor-
and environment are the ego vehicle and surrounding vehi-
mous time to obtain the mutable Q table. Thus, in DQN,
cles (including the driving conditions), respectively. This
a neural network is employed to approximate the Q table as
problem is able to be mimicked by the Markov decision
Q(s, a; θ ). For the neural network, the inputs are the arrays
processes (MDPs), which indicates the next state variable
of state variables and control actions, and the output is the
is only concerned with the current state and action [32].
state-value function [34].
It means the discussed sequential decision-making prob-
To measure the discrepancy between the approximated and
lem of autonomous driving has the Markov property. The
actual Q table in DQN, the loss function is introduced like the
related MDP often represents the RL interaction as a tuple
following expression:
(S, A, P, R, γ ), in which S and A are the state and control
sets. P and R are the significant elements of the environ- XN
L(θ ) = E[ (yt − Q(s, a; θ ))2 ] (14)
ments in RL, and they mean the transition and reward model, t=1
respectively. In RL, the current action would influence the where
immediate and future rewards synchronously. Hence, γ is a
discount factor to balance these two parts of rewards. yt = rt + γ max Q(s0 , a0 ; θ 0 ) (15)
a0
To represent the list of future rewards, the accumulated
reward Rt is defined as follows: As can be seen, there are two parameters (θ and θ 0 ) of the
X∞ neural network, which delegate two networks in DQN. These
Rt = γ t · rt (8)
t networks are prediction and target networks. The former is
where t is the time instant, and rt is the relevant reward. applied to estimate the current control action, and the latter
To record the worth of the state s and state-action pair (s, a), aims to generate the target value. In general, the target net-
two value function as expressed by the accumulated reward work would copy the parameters from the prediction network
as: every certain number of time steps. By doing this, the target
. Q table will converge to predict one to some extent to remit
V π (st ) = Eπ [Rt |st , π] (9)
π . the network instability.
Q (st , at ) = Eπ [Rt |st , at , π] (10) In DQN, the online neural network is updated by gradient
where π is called as control action policy, V is state value descent as follows:
function, and Q is the state-action function (called Q table ∇θ L(θ) = E[(yi − Q(s, a; θ ))∇θ Q(s, a; θ )] (16)
for short). To be updated easily, the state-action function is
usually rewritten as the recursive form: This operation makes the DQN as an off-policy algorithm,
and the states and rewards are acquired by a special criterion.
Qπ (st , at ) = Eπ [rt + γ max Qπ (st+1 , at+1 )] (11)
at+1 This rule is known as epsilon greedy, which indicates that the
agent executes the exploration (choose a random action) with
Finally, the optimal control action with respect to the
probability ε, and makes exploitation (use the current best
control policy π is determined by the state-action function:
action) with probability 1-ε.
π(st ) = arg max Q(st , at ) (12)
at C. DUELING DQN ALGORITHM
In some RL problems, the selection of current control action
Therefore, the essence of different RL algorithms is updating
may not cause negative results, apparently. For example,
the state-action function Q(s, a) in various ways. According
in the highway environment, many actions would not lead to
to the style of updating rules, the RL algorithms could have
the collision. However, these choices may indirectly result in
diverse classifications, such as model-based and model-free,
bad rewards afterward [34]. Motivated by this insight, a duel-
policy-based and value-based, temporal-difference (TD), and
ing network is proposed in this work to estimate the worth
Monte-Carlo (MC) [33].
of the control actions at each step. A new neural network
B. DEEP Q NETWORK is constructed to approximate the Q table in the highway
Deep Q network (DQN) is first presented to play the Atari decision-making problem, as shown in Fig. 4.
games in [34]. It synthesizes the strengths of deep learn- Two streams of fully connected layers are used to estimate
ing (neural network) and Q-learning to obtain the new the state-value function V (s) and the advantage function

177808 VOLUME 8, 2020


J. Liao et al.: Decision-Making Strategy on Highway for Autonomous Vehicles

FIGURE 4. The dueling network combined with state-value network and advantage network for Q table updating.

A(s, a) of each action. Therefore, the state-action function where v1 , v2 are the longitudinal and lateral speed of the
(Q table) is constituted as follows: vehicle, respectively, similar to the d1 and d2 . The policy
frequency is 1 Hz, which indicates the time interval 1t is
Qπ (s, a) = Aπ (s, a) + V π (s) (17)
1 second. It should be noticed that (23) and (24) are feasible
It is obvious that the output of this new dueling network is for the ego vehicle and surrounding vehicles simultaneously,
also a Q table, and thus the neural network used in DQN can and these expressions are considered as the transition model
also be employed to approximate this Q table. The network P in RL. Then, the state variables are defined as the relative
with two parameters is computed as: speed and distance between the ego and nearby cars:
Qπ (s, a; θ ) = V π (s; θ1 ) + Aπ (s, a; θ2 ) (18) ego
1dt = dt − dtsur

(25)
where θ1 and θ2 are the parameters of state-value function and
ego
1vt = vt − vt sur

(26)
advantage function, respectively.
To update the Q table in DDQN and achieve the optimal where the superscript ego and sur represent the ego vehicle
control action, (18) is reformulated as follows: and surrounding vehicles, respectively.
Finally, the reward model R is constituted by the optimal
Qπ (s, a; θ ) = V π (s; θ1 ) + (Aπ (s, a; θ2 ) − max Aπ (s, a0 ; θ2 )) control objectives, which are avoiding collision, running as
a0
(19) fast as possible, and trying the driving on lane 1 (L = 1).
To bring this insight to fruition, the instantaneous reward
a∗ = arg max Q(s, a0 ; θ ) = arg max A(s, a0 ; θ2 ) (20) function is defined as follows:
a0 a0
It can be decerned that the input-output interfaces in DDQN rt = −1 · collision − 0.1 ∗ (vtego − vmax 2
ego ) − 0.4 ∗ (L − 1)
2

and DQN are the same. Hence, the gradient descent in (16) is (27)
capable of being recycled to train the Q table in this work.
where collision ∈ {0, 1} and the goal of the DDQN-based
D. VARIABLES SPECIFICATION highway decision-making strategy is maximizing the cumu-
To derive the DDQN-based decision-making strategy, the lative rewards.
preliminaries are initialized as follows, and the calculative The proposed decision-making control policy is trained
procedure is easily transformed into an analogous driving and evaluated in the simulation environment based on the
environment. The control actions are the longitudinal and OpenAI gym Python toolkit [35]. The numbers of lanes and
lateral accelerations (a1 and a2 ) with the units m/s2 : surrounding vehicles are 3 and 30. The discount factor γ
a1 ∈ [−5, 5]m/s2 (21) and learning rate α are 0.8 and 0.2. The layers of the value
2 network and advantage network are both 128. The value of ε
a2 ∈ [−1, 1]m/s (22) decreases from 1 to 0.05 with the time step 6000. The training
It is noticed that when these two accelerations are zeros, episode in different DRL approaches are 2000. The next
the ego vehicle adopts an idling control. section discusses the effectiveness of the presented decision-
After obtaining the acceleration actions, the speed and making strategy for autonomous vehicles.
position of the vehicle can be computed as follows: IV. RESULTS AND EVALUATION
(
vt+1
1 = vt1 + a1 · 1t In this section, the proposed highway decision-making policy
(23)
vt+1
2 = vt2 + a2 · 1t is estimated by comparing it with the benchmark methods.
( These techniques are the reference model in Fig. 3 and
d1t = vt1 · 1t + 21 · a1 · 1t 2 the common DQN in Section III.B. The optimality is ana-
(24)
d2t = vt2 · 1t + 12 · a2 · 1t 2 lyzed by conducting a comparison of these three methods.

VOLUME 8, 2020 177809


J. Liao et al.: Decision-Making Strategy on Highway for Autonomous Vehicles

FIGURE 5. Average reward variation in three compared methods: the


reference model, DQN and DDQN.

Furthermore, the adaptability of the presented approach is


verified by implementing the trained model into a similar
highway driving scenario.
A. OPTIMALITY EVALUATION
The reference model, DQN, and DDQN are compared in this
subsection. All of them adopted a hierarchical control frame-
work. The lower-levels are the same and utilize (5)-(7) to reg-
ulate the acceleration, position, and heading. The upper-levels
are different, which are IDM and MOBIL, DQN algorithm FIGURE 6. Vehicle speed and traveling distance of the ego vehicle in each
in Section III.B, and DDQN algorithm in Section III.C. The episode of these compared techniques.

default parameters are the same in DQN and DDQN.


Fig. 5 depicts the normalized average rewards of these
three methods. Based on the definition of reward function
in (27), higher reward indicates driving on the preferred lane
with a more efficient maneuver. It is obvious that the training
stability and learning speed of DDQN are better than the other
two approaches. Besides, after about 500 episodes, the reward
in DDQN is greater than the other two approaches, and it
keeps this momentum all the time. And they are both better
than the reference model. It is mainly caused by the advantage
network in DDQN. This network could assess the worth of the
chosen action at each step, which helps the ego vehicle to find
a better decision-making policy fleetly.
To observe the trajectories of state variables in this work,
Fig. 6 shows the average vehicle speed and traveling distance FIGURE 7. Collision conditions of the ego vehicle in each compared
method: collision = 0, the ego vehicle does not crash other vehicles;
in these three compared techniques. They are all trained by collision = 1, the ego vehicle crashes other vehicles.
2000 episodes. A higher average implies that the ego vehicle
could run through the driving scenario faster and achieve (collision = 0 or 1). It can be noticed that the DDQN,
greater cumulative rewards. The travel distance is affected DQN, and reference model-enabled agents could avoid a
by collision conditions and vehicle speed. A higher traveling collision after 1300, 1700, 1950 episodes, respectively. This
distance means the ego vehicle could drive longer during a appearance can also prove that the DDQN-based agent is
time limit without collision. The random set of the initial more intelligent other two agents. As the safety claim is the
speeds and positions of surrounding vehicles is the reason for first concern for the actual application of automated driving,
the average speed and travel distance (of the ego vehicle) fluc- the learned decision-making model based on DDQN is more
tuating until the training is over. These results directly reflect promising to be employed in real-world environments.
safety and efficiency demand. The noticeable differences are Furthermore, to defense the concrete control actions are
able to certify the optimality of the proposed algorithm. different in these three methods. The curves of control action
As the ego vehicle is not willing to crash the surrounding sequences of one successful episode (means the ego vehi-
vehicles, the collision conditions of these three control cases cle could drive from the starting to ending point) are given
are described in Fig. 7, wherein collision has two values in Fig. 8. The actions in longitudinal and later directions of

177810 VOLUME 8, 2020


J. Liao et al.: Decision-Making Strategy on Highway for Autonomous Vehicles

FIGURE 10. Accumulated rewards in DQN and DDQN: the higher


accumulated reward indicates better control action choices.
FIGURE 8. Control actions in one successful episode of three compared
methods: Index = 1, changing left lane; Index = 2, idling speed; Index = 3,
changing right lane; Index = 4, running faster; Index = 5, running slower.

FIGURE 11. Normalized reward in the testing experiment of three


compared methods.

To exhibit the usage of the dueling network in future


decisions in this autonomous driving problem, Fig. 10
FIGURE 9. Mean discrepancy of Q table in the training process of two
discusses the track of the cumulative rewards. The uptrend
DRL approaches. variation implies that the control action choices are capable of
improving future rewards. As the DDQN is larger than DQN,
the ego vehicle are uninformed as five selections. They are it signifies the related agent could achieve better control
changing left lane, changing right lane, idling speed, running performance. This is also attributed to the advantage network
faster, running slower. The differences between these trajec- in DDQN, which enables the ego vehicle to quantify the
tories indicate that the proposed decision-making policy is potential worth of current control action. To assess the above-
different from the two benchmark methods (in the same suc- mentioned decision-making policies in a similar driving con-
cessful episode). Overall, according to all the display results dition, the next subsection discusses the adaptability of these
in this subsection, the optimality of the DDQN-enabled strategies.
decision-making strategy is illuminated.
C. ADAPTABILITY ESTIMATION
B. COMPARISION BETWEEN DQN AND DDQN After learning and training the automated vehicles in highway
Since DQN and DDQN are two permanent DRL algorithms, driving environments, a short episode is applied to test their
this experiment aims to appraise the learning and training adaptive capacity. The testing number of episodes is 10 in
procedure of these two approaches. As the target of the neural this work. The default settings and the number of lanes and
network is acquiring the mutable Q table. The normalized surrounding vehicles are the same as the training process.
mean discrepancy of the Q table in the training process of The learned parameters of the neural networks are saved
these two methods is displayed in Fig. 9. The downtrend and can be utilized directly in the new conditions. The most
graphs indicate that both ego vehicles become more familiar concerning elements are the average reward and collision
with the driving environment by interacting with it. Further- conditions of the testing operation.
more, it can be discerned that the DDQN could learn more Fig. 11 shows the normalized average reward of the
knowledge about the traffic situations with the same episodes, reference model, DQN, and DDQN methods in the testing
and thus results in the faster learning course. Hence, the ego experiment. From (27), the reward is mainly influenced by the
vehicle could manipulate more efficiently and safely by the collision conditions and vehicle speed. The average reward
guidance of the DDQN algorithm. may not achieve the highest score (100 in this work) because

VOLUME 8, 2020 177811


J. Liao et al.: Decision-Making Strategy on Highway for Autonomous Vehicles

TABLE 1. Training and testing time of DQN and DDQN methods.

policy has the potential to be applied in real-world environ-


ments. Table 1 provides the training and testing time of the
DQN, and DDQN approaches. Although the training time can
be only realized offline, the learned parameters and policies
FIGURE 12. One typical testing driving condition: the ego vehicle has to are able to be utilized online. This inspires us to implant
execute car-following behavior for a long time. our decision-making policy in the visualization simulation
environments and to conduct the related loop experiments in
the future.

V. CONCLUSION
This paper discusses the highway decision-making problem
using the DRL technique. By applying the DDQN algorithm
in the designed driving environments, an efficient and safe
control framework is constructed. Depending on a series of
simulation experiments, the optimality, convergence rate, and
adaptability are demonstrated. In addition, the testing results
are analyzed, and the potentials of the presented method to be
applied in real-world environments are proven. Future work
includes the online applications of highway decision-making
by executing hardware-in-loop (HIL) experiments. Moreover,
the real-world collected highway database can be used to
FIGURE 13. Another representative testing driving condition: the ego
estimate the related overtaking strategy.
vehicle make a dangerous lane changing and a collision happens.
REFERENCES
the ego vehicle has to slow down sometimes to avoid a col- [1] A. Raj, J. A. Kumar, and P. Bansal, ‘‘A multicriteria decision mak-
ing approach to study barriers to the adoption of autonomous vehi-
lision. The ego vehicle also needs to change to other lanes to cles,’’ Transp. Res. Part A, Policy Pract., vol. 133, pp. 122–137,
realize the overtaking process. Without loss of generality, two Mar. 2020.
typical situations (two episodes, A and B points in Fig. 10) are [2] T. Liu, B. Tian, Y. Ai, L. Chen, F. Liu, and D. Cao, ‘‘Dynamic states
chosen to analyze the decision-making behaviors of the ego prediction in autonomous vehicles: Comparison of three different meth-
ods,’’ in Proc. IEEE Intell. Transp. Syst. Conf. (ITSC), Oct. 2019,
vehicle. pp. 3750–3755.
Fig. 12 depicts one driving situation that there are three [3] A. Rasouli and J. K. Tsotsos, ‘‘Autonomous vehicles that interact with
surrounding vehicles in front of the ego vehicle (the episode pedestrians: A survey of theory and practice,’’ IEEE Trans. Intell. Transp.
Syst., vol. 21, no. 3, pp. 900–918, Mar. 2020.
represented by A point). The ego vehicle has to execute the [4] C. Gkartzonikas and K. Gkritza, ‘‘What have we learned? A review
car-following maneuver for a long time and wait for the of stated preference and choice studies on autonomous vehicles,’’
opportunity the overtake them. As a consequence, the vehicle Transp. Res. Part C, Emerg. Technol., vol. 98, pp. 323–337,
Jan. 2019.
speed may not reach the maximum value, and the ego vehi- [5] C.-J. Hoel, K. Driggs-Campbell, K. Wolff, L. Laine, and
cle may not surpass all the surrounding vehicles before the M. J. Kochenderfer, ‘‘Combining planning and deep reinforcement
destination. Furthermore, an infrequent driving condition is learning in tactical decision making for autonomous driving,’’ IEEE
Trans. Intell. Vehicles, vol. 5, no. 2, pp. 294–305, Jun. 2020.
described in Fig. 13 (the episode represented by B point). [6] C. Yang, Y. Shi, L. Li, and X. Wang, ‘‘Efficient mode transition con-
The ego vehicle wants to achieve a risky lane-changing to trol for parallel hybrid electric vehicle with adaptive dual-loop control
obtain higher rewards. However, it cashed nearby vehicles framework,’’ IEEE Trans. Veh. Technol., vol. 69, no. 2, pp. 1519–1532,
Feb. 2020.
because the operation space is not enough. This situation may
[7] C.-J. Hoel, K. Wolff, and L. Laine, ‘‘Tactical decision-making in
not happen in the training process, and thus the ego vehicle autonomous driving by reinforcement learning with uncertainty estima-
could cause a collision. tion,’’ 2020, arXiv:2004.10439. [Online]. Available: http://arxiv.org/abs/
Based on the detailed analysis in Fig 12 and 13, it hints 2004.10439
[8] SAE On-Road Automated Vehicle Standards Committee, ‘‘Taxonomy and
us to spend more time training the mutable decision-making definitions for terms related to on-road motor vehicle automated driving
strategy. These results also remind us that the relevant control systems,’’ SAE Standard J., vol. 3016, pp. 1–16, 2014.

177812 VOLUME 8, 2020


J. Liao et al.: Decision-Making Strategy on Highway for Autonomous Vehicles

[9] Y. Qin, X. Tang, T. Jia, Z. Duan, J. Zhang, Y. Li, and L. Zheng, [29] T. Liu, X. Tang, H. Wang, H. Yu, and X. Hu, ‘‘Adaptive hierarchical energy
‘‘Noise and vibration suppression in hybrid electric vehicles: State of the management design for a plug-in hybrid electric vehicle,’’ IEEE Trans. Veh.
art and challenges,’’ Renew. Sustain. Energy Rev., vol. 124, May 2020, Technol., vol. 68, no. 12, pp. 11513–11522, Dec. 2019.
Art. no. 109782. [30] X. Hu, T. Liu, X. Qi, and M. Barth, ‘‘Reinforcement learning for hybrid
[10] P. Hart and A. Knoll, ‘‘Using counterfactual reasoning and and plug-in hybrid electric vehicle energy management: Recent advances
reinforcement learning for decision-making in autonomous driving,’’ and prospects,’’ IEEE Ind. Electron. Mag., vol. 13, no. 3, pp. 16–25,
2020, arXiv:2003.11919. [Online]. Available: http://arxiv.org/ Sep. 2019.
abs/2003.11919 [31] T. Liu, H. Yu, H. Guo, Y. Qin, and Y. Zou, ‘‘Online energy management for
[11] W. Song, G. Xiong, and H. Chen, ‘‘Intention-aware autonomous driving multimode plug-in hybrid electric vehicles,’’ IEEE Trans. Ind. Informat.,
decision-making in an uncontrolled intersection,’’ Math. Problems Eng., vol. 15, no. 7, pp. 4352–4361, Jul. 2019.
vol. 2016, pp. 1–15, Apr. 2016. [32] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic
[12] C. Yang, S. You, W. Wang, L. Li, and C. Xiang, ‘‘A stochastic predictive Programming. Hoboken, NJ, USA: Wiley, 2014.
energy management strategy for plug-in hybrid electric vehicles based on [33] R. Sutton and A. Barto, Reinforcement Learning: An Introduction, 2nd ed.
fast rolling optimization,’’ IEEE Trans. Ind. Electron., vol. 67, no. 11, Cambridge, MA, USA: MIT Press, 2018.
pp. 9659–9670, Nov. 2020, doi: 10.1109/TIE.2019.2955398. [34] Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and
[13] A. Furda and L. Vlacic, ‘‘Enabling safe autonomous driving in real-world N. De Freitas, ‘‘Dueling network architectures for deep reinforcement
city traffic using multiple criteria decision making,’’ IEEE Intell. Transp. learning,’’ in Proc. ICML, Jun. 2016, pp. 1995–2003.
Syst. Mag., vol. 3, no. 1, pp. 4–17, Spring 2011. [35] L. Edouard. (2018). An Environment for Autonomous Driving Decision-
[14] J. Nie, J. Zhang, W. Ding, X. Wan, X. Chen, and B. Ran, ‘‘Decentralized Making. GitHub. [Online]. Available: https://github.com/eleurent/
cooperative lane-changing decision-making for connected autonomous highway-env
Vehicles∗ ,’’ IEEE Access, vol. 4, pp. 9413–9420, 2016.
[15] L. Li, K. Ota, and M. Dong, ‘‘Humanlike driving: Empirical decision-
making system for autonomous vehicles,’’ IEEE Trans. Veh. Technol.,
vol. 67, no. 8, pp. 6814–6823, Aug. 2018.
[16] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness,
M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, JIANGDONG LIAO received the M.S. degree
S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, in mathematics and computer engineering from
D. Wierstra, S. Legg, and D. Hassabis, ‘‘Human-level control through Chongqing Normal University. He works in
deep reinforcement learning,’’ Nature, vol. 518, no. 7540, pp. 529–533,
mathematics and statistics with Yangtze Normal
Feb. 2015.
University. His current research interests include
[17] J. Duan, S. Eben Li, Y. Guan, Q. Sun, and B. Cheng, ‘‘Hierarchical
reinforcement learning for self-driving decision-making without reliance driving behavior analysis, vehicle’s motion
on labelled driving data,’’ IET Intell. Transp. Syst., vol. 14, no. 5, prediction, and risk assessment of autonomous
pp. 297–305, May 2020. driving.
[18] M. Kim, S. Lee, J. Lim, J. Choi, and S. G. Kang, ‘‘Unexpected colli-
sion avoidance driving strategy using deep reinforcement learning,’’ IEEE
Access, vol. 8, pp. 17243–17252, 2020.
[19] Q. Zhang, J. Lin, Q. Sha, B. He, and G. Li, ‘‘Deep interactive reinforcement
learning for path following of autonomous underwater vehicle,’’ IEEE
Access, vol. 8, pp. 24258–24268, 2020.
[20] C. Chen, J. Jiang, N. Lv, and S. Li, ‘‘An intelligent path plan-
TENG LIU (Member, IEEE) received the B.S.
ning scheme of autonomous vehicles platoon using deep reinforcement
degree in mathematics and the Ph.D. degree in
learning on network edge,’’ IEEE Access, vol. 8, pp. 99059–99069,
2020. automotive engineering from the Beijing Institute
[21] C. Yang, M. Zha, W. Wang, K. Liu, and C. Xiang, ‘‘Efficient energy of Technology (BIT), Beijing, China, in 2011 and
management strategy for hybrid electric vehicles/plug-in hybrid elec- 2017, respectively. His Ph.D. Dissertation under
tric vehicles: Review and recent advances under intelligent transporta- the supervision of Prof. F. Sun was entitled Rein-
tion system,’’ IET Intell. Transp. Syst., vol. 14, no. 7, pp. 702–711, forcement Learning-Based Energy Management
Jul. 2020. for Hybrid Electric Vehicles.
[22] S. Han and F. Miao, ‘‘Behavior planning for connected autonomous He was a Research Fellow with Vehicle
vehicles using feedback deep reinforcement learning,’’ 2020, Intelligence Pioneers Ltd., from 2017 to 2018.
arXiv:2003.04371. [Online]. Available: http://arxiv.org/abs/ He was also a Postdoctoral Fellow with the Department of Mechani-
2003.04371 cal and Mechatronics Engineering, University of Waterloo, Canada, from
[23] S. Nageshrao, H. E. Tseng, and D. Filev, ‘‘Autonomous highway driving 2018 to 2020. He is currently a Professor with the Department of Auto-
using deep reinforcement learning,’’ in Proc. IEEE Int. Conf. Syst., Man
motive Engineering, Chongqing University, Chongqing, China. He has
Cybern. (SMC), Oct. 2019, pp. 2326–2331.
[24] T. Liu, B. Huang, Z. Deng, H. Wang, X. Tang, X. Wang, and D. Cao, more than eight year’s research and working experience in renewable vehi-
‘‘Heuristics-oriented overtaking decision making for autonomous vehicles cle and connected autonomous vehicle. He has published over 40 SCI
using reinforcement learning,’’ IET Elect. Syst. Transp., vol. 1, no. 99, articles and 15 conference papers in these areas. His current research
pp. 1–8, 2020. interests include reinforcement learning (RL)-based energy management
[25] M. Treiber, A. Hennecke, and D. Helbing, ‘‘Congested traffic states in hybrid electric vehicles, RL-based decision making for autonomous
in empirical observations and microscopic simulations,’’ Phys. Rev. E, vehicles, and CPSS-based parallel driving. He is a member of the IEEE
Stat. Phys. Plasmas Fluids Relat. Interdiscip. Top., vol. 62, no. 2, VTS, the IEEE ITS, the IEEE IES, the IEEE TEC, and the IEEE/CAA.
pp. 1805–1824, Aug. 2000. He received the Merit Student of Beijing in 2011, the Teli Xu Scholarship
[26] M. Zhou, X. Qu, and S. Jin, ‘‘On the impact of cooperative autonomous (Highest Honor) from the Beijing Institute of Technology in 2015, the Top
vehicles in improving freeway merging: A modified intelligent driver 10 from the IEEE VTS Motor Vehicle Challenge in 2018, and the Sole
model-based approach,’’ IEEE Trans. Intell. Transp. Syst., vol. 18, no. 6, Outstanding Winner from the ABB Intelligent Technology Competition
pp. 1422–1428, Jun. 2017. in 2018. He serves as the Workshop Co-Chair for the 2018 IEEE Intelligent
[27] A. Kesting, M. Treiber, and D. Helbing, ‘‘General lane-changing model
Vehicles Symposium (IV 2018). He serves as a Reviewer for multiple
MOBIL for car-following models,’’ Transp. Res. Rec., J. Transp. Res.
Board, vol. 1999, no. 1, pp. 86–94, Jan. 2007. SCI journals, including the IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS,
[28] T. Liu, X. Hu, W. Hu, and Y. Zou, ‘‘A heuristic planning reinforcement the IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, the IEEE TRANSACTIONS ON
learning-based energy management for power-split plug-in hybrid electric INTELLIGENT TRANSPORTATION SYSTEMS, the IEEE TRANSACTIONS ON SYSTEMS,
vehicles,’’ IEEE Trans. Ind. Informat., vol. 15, no. 12, pp. 6436–6445, MAN, AND CYBERNETICS: SYSTEMS, the IEEE TRANSACTIONS ON INDUSTRIAL
Dec. 2019. INFORMATICS, Advances in Mechanical Engineering.

VOLUME 8, 2020 177813


J. Liao et al.: Decision-Making Strategy on Highway for Autonomous Vehicles

XIAOLIN TANG (Member, IEEE) received the BING HUANG received the B.S. degree in
B.S. degree in mechanics engineering and the M.S. automotive engineering from Chongqing Univer-
degree in vehicle engineering from Chongqing sity, where he is currently pursuing the M.S.
University, Chongqing, China, in 2006 and 2009, degree. His current research interest includes
respectively, and the Ph.D. degree in mechani- decision-making for autonomous driving.
cal engineering from Shanghai Jiao Tong Uni-
versity, China, in 2015. From August 2017 to
August 2018, he was a Visiting Professor with
the Department of Mechanical and Mechatronics
Engineering, University of Waterloo, Waterloo,
ON, Canada. He is currently an Associate Professor with the Department
of Automotive Engineering, Chongqing University. He has led and involved DONGPU CAO received the Ph.D. degree from
with more than ten research projects, such as the National Natural Sci- Concordia University, Canada, in 2008. He is
ence Foundation of China. He has published more than 30 articles. His currently an Associate Professor and the Direc-
research interests include hybrid electric vehicles (HEVs), vehicle dynam- tor of the Driver Cognition and Automated Driv-
ics, noise and vibration, and transmission control. He is a Committeeman ing (DC-Auto) Laboratory, University of Water-
with the Technical Committee on Vehicle Control and Intelligence, Chinese loo, Canada. He has contributed more than
Association of Automation (CAA). 170 publications. He holds one U.S. Patent. His
research interests include vehicle dynamics and
control, driver cognition, and automated driv-
ing and parallel driving. He has been serv-
ing on the SAE International Vehicle Dynamics Standards Committee,
ASME, SAE, and the IEEE technical committees. He received the ASME
XINGYU MU received the B.S. degree in auto- AVTT’2010 Best Paper Award and the 2012 SAE Arch T. Colwell Merit
motive engineering from Chongqing University, Award. He serves as the Co-Chair for the IEEE ITSS Technical Com-
where he is currently pursuing the M.S. degree. mittee on Cooperative Driving. He serves as an Associate Editor for the
His current research interest includes left-turn IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, the IEEE TRANSACTIONS
decision-making problem of autonomous driving ON INTELLIGENT TRANSPORTATION SYSTEMS, the IEEE/ASME TRANSACTIONS ON
at intersections. MECHATRONICS, the IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, and the
Journal of Dynamic Systems, Measurement and Control (ASME). He serves
as a Guest Editor for Vehicle System Dynamics and the IEEE TRANSACTIONS
ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS.

177814 VOLUME 8, 2020

You might also like