@robust Lane Change Decision Making For Autonomous Vehicles An Observation Adversarial Reinforcement Learning Approach
@robust Lane Change Decision Making For Autonomous Vehicles An Observation Adversarial Reinforcement Learning Approach
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2022.3165178, IEEE
Transactions on Intelligent Vehicles
IEEE TRANSACTIONS ON INTELLIGENT VEHICLES 1
Abstract—Reinforcement learning holds the promise of allow- RL to decision making task of autonomous driving has become
ing autonomous vehicles to learn complex decision making behav- a hot topic for researchers [9].
iors through interacting with other traffic participants. However,
many real-world driving tasks involve unpredictable perception While existing RL based decision making methods of
errors or measurement noises which may mislead an autonomous autonomous vehicles have achieved many compelling results
vehicle into making unsafe decisions, even cause catastrophic [10], [11], [12], [13], the real-world driving tasks involve
failures. In light of these risks, to ensure safety under perception unavoidable measurement errors or sensor noises which may
uncertainty, autonomous vehicles are required to be able to cope
mislead an autonomous vehicle into making suboptimal de-
with the worst case observation perturbations. Therefore, this
paper proposes a novel observation adversarial reinforcement cisions, even cause catastrophic failures. In light of these
learning approach for robust lane change decision making of risks, autonomous vehicles are required to ensure that their
autonomous vehicles. A constrained observation-robust Markov decision making systems can handle the natural observation
decision process is presented to model lane change decision mak- uncertainties from sensing and perception system, especially
ing behaviors of autonomous vehicles under policy constraints
adversarial perturbations. However, few researches concern
and observation uncertainties. Meanwhile, a black-box attack
technique based on Bayesian optimization is implemented to and cope with the aforementioned challenge.
approximate the optimal adversarial observation perturbations Therefore, in this paper, a novel observation adversarial RL
efficiently. Furthermore, a constrained observation-robust actor- (OARL) approach for robust lane change decision making is
critic algorithm is advanced to optimize autonomous driving
lane change policies while keeping the variations of the policies proposed to improve the performance of an autonomous vehi-
attacked by the optimal adversarial observation perturbations cle while enhancing the robustness of driving policies against
within bounds. Finally, the robust lane change decision making adversarial observation perturbations. The main contributions
approach is evaluated in three stochastic mixed traffic flows based of this paper are summarized as follows:
on different densities. The results demonstrate that the proposed
method can not only enhance the performance of an autonomous • A constrained observation-robust Markov decision pro-
vehicle but also improve the robustness of lane change policies cess (COR-MDP) is advanced to model lane change
against adversarial observation perturbations.
decision making behaviors of an autonomous vehicle
Index Terms—Autonomous vehicle, lane change decision mak- under policy constraints and observation perturbations.
ing, robust decision making, reinforcement learning, adversarial Meanwhile, a black-box attack technique with Bayesian
attack.
optimization is implemented to approximate the optimal
adversarial observation perturbations efficiently.
I. I NTRODUCTION • A constrained observation-robust actor-critic (COR-AC)
algorithm is presented to optimize lane change policies
2379-8858 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 17,2022 at 07:26:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2022.3165178, IEEE
Transactions on Intelligent Vehicles
IEEE TRANSACTIONS ON INTELLIGENT VEHICLES 2
II. R ELATED W ORK can be determined by the five decision making behaviors of RL
According to different driving behaviors (e.g., lane change, agent simultaneously. A DQN based decision making method
acceleration or deceleration) or tasks (e.g., overtaking or is advanced, which can simultaneously determine discrete
ramp merging) in existing related studies, RL based decision speed modes and lane change behaviors of an autonomous
making of autonomous vehicles can roughly be divided into vehicle in [27]. An optimization embedded RL with ac-
three categories: longitudinal, lateral and coordinated decision tor–critic framework is presented to determine longitudinal and
making [9]. RL based longitudinal decision-making methods lateral coordinated decision making behaviors for autonomous
generally adopt RL algorithm to determine the speed modes vehicles in [28]. A coordinated decision making method based
of autonomous vehicles, such as keeping, acceleration and on deep deterministic policy gradient algorithm is developed
deceleration [11], [16], [17], [18]. to determine throttle and steering maneuvers for autonomous
driving in [29]. Unfortunately, the above methods mostly
A. Reinforcement Learning based Lateral Decision Making assume that the state observations are free of unexpected
for Autonomous Vehicles perturbations. Such assumption can hardly hold in real-world
scenarios.
RL based lateral decision making approaches of au-
tonomous vehicles mostly employ RL algorithm to learn lane
III. O BSERVATION A DVERSARIAL R EINFORCEMENT
change behaviors or select target lanes. One popular paradigm
L EARNING FOR ROBUST D ECISION M AKING
is the lateral decision making schemes with the deep Q-
network (DQN) or its variants. A lane change decision-making A. Robust Lane Change Decision Making Framework for
framework for autonomous vehicles is developed to learn Autonomous Vehicles
risk sensitive driving policies using risk-awareness prioritized
replay DQN in [12]. A lane change decision making method is
presented for autonomous vehicles through DQN with safety
verification in [19]. A harmonious lane-changing decision
making approach based on DQN is advanced to improve
overall traffic efficiency in [20]. A DQN method with rule-
based constraints is developed for lane change decision making
of autonomous vehicles in [21]. A lane change decision-
making approach for autonomous vehicles is developed via
double DQN with the structure of Deep Sets in [22]. A
lane change decision making method based on partial ob-
served Markov decision process and DQN is introduced for
autonomous vehicles in [23]. The above methods are simple
but effective. Moreover, combined with rule based constraints,
the driving safety of autonomous vehicles can be guaranteed.
However, these schemes can not find the optimal driving
policies necessarily.
In addition to the DQN based paradigms, there are the
autonomous driving lateral decision making approachs with
Fig. 1. Framework of the proposed robust lane change decision-making
other RL algorithms. A proximal policy optimization (PPO) approach for autonomous driving.
based lane change decision-making method is presented for
autonomous drving in [13]. A multi-objective approximate Since the existing lane change decision-making framework
policy iteration algorithm is proposed to implement lane of autonomous vehicles do not take into account perception
change decision making of an autonomous vehicle in [24]. A uncertainty mostly, the robust lane change decision making
lane change decision-making scheme based on attention-based framework with OARL algorithm is proposed to cope with the
hierarchical deep RL is proposed for autonomous vehicles in adversarial perturbations on state observations in autonomous
[25]. Although these methods may achieve better performance driving, as shown in Fig. 1. Ego vehicle is red, and it is an
than the DQN based schemes, the robust decision-making autonomous vehicle. The longitudinal decision-making of the
problem of autonomous vehicles is not studied among them. ego vehicle is implemented by SUMO based intelligent driving
model (IDM). The vehicles of other colors are social vehicles,
B. Reinforcement Learning based Coordinated Decision Mak- and the longitudinal and lateral driving behaviors of the social
ing for Autonomous Vehicles vehicles are determined by the IDM of SUMO. The social
RL based coordinated decision making schemes usually vehicles can perform lane change maneuvers via the LC2013
leverage RL algorithm to determine longitudinal and lateral model [30] in SUMO. Moreover, the output of the ego vehicle
driving behaviors of autonomous vehicles simultaneously. A is discrete, which includes lane keeping, left lane changing and
longitudinal and lateral coordinated decision making approach right lane changing.
based on AlphaGo Zero algorithm is developed for au- Our RL autonomous driving agent seeks to maximize the
tonomous vehicles in [26]. The requested speed and target lane expected return while satisfying the policy constraints. In
2379-8858 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 17,2022 at 07:26:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2022.3165178, IEEE
Transactions on Intelligent Vehicles
IEEE TRANSACTIONS ON INTELLIGENT VEHICLES 3
Fig. 1, the block with respect to COR-MDP and COR-AC is attacked by the observation perturbations. The optimization
used for optimizing robust driving policy and interacting with objective with JS divergence can be designed as:
the environment. Its input includes state s, reward r and the
c(s, s0 , ∆) = DJS (π(a|s)||π(ã|s̃)) + DJS π(a|s0 )||π(ã|s̃0 )
optimal adversarial observation perturbations ∆∗ . t denotes
1 1
time step. The optimal adversarial observation perturbations = DKL (π(a|s)||m) + DKL (π(ã|s̃)||m) (2)
∆∗ contains the optimal adversarial multiplicative-perturbation 2 2
1 1
∆∗m and the optimal adversarial additive-perturbation ∆∗a . The + DKL (π(a|s )||m ) + DKL π(ã|s̃0 )||m0 ,
0 0
2 2
output is the action a based on the policy π(a|s).
where DJS represents the distance based on JS divergence,
The block with regard to the black-box attacks is employed
DKL denotes KL divergence, and
to approximate the optimal adversarial perturbations. The input (
of this block includes state s and the policy π(a|s), and its s̃ = ∆m s + ∆a ,
(3)
output is the optimal adversarial perturbation. Additionally, the s̃0 = ∆m s0 + ∆a ,
block associated with the environment is leveraged to generate
(
state s and reward r. Its input is the action a based on the m = 12 (π(a|s) + π(ã|s̃)),
policy π(a|s), and the output contains state s and reward r. (4)
m0 = 12 (π(a|s0 ) + π(ã|s̃0 )),
where ã, s̃ and s̃0 are the action, the state and the next state
B. Constrained Observation-robust Markov Decision Process perturbed by observation perturbations respectively.
Therefore, our black-box attack approach is formulized as:
To model the decision making behaviors of RL based ∆∗ ∈ arg max E[c(s, s0 , ∆)], (5)
autonomous driving agent under policy constraints and obser- ∆
vation perturbations, the proposed COR-MDP is introduced in s.t. ∆m − ∆0m ≤ δm , ∆a − ∆0a ≤ δa ,
this section. where ∆ = [∆m ∆a ] represents observation perturba-
Definition 1: A COR-MDP can be characterized via a 7- tion, ∆m and ∆a are the multiplicative-perturbation and the
tuple (S, A, p, r, c, ∆, γ). S is the set of states called the state additive-perturbation, ∆0m and ∆0a are the reference values of
space. A is the set of states called the state space. p is the the multiplicative-perturbation and the additive-perturbation,
transition probability distribution of the next state s0 ∈ S given δm and δa are the desired bounds of the multiplicative-
the current state s ∈ S and action a ∈ A. r : S × A → R perturbation and the additive-perturbation respectively.
represents the reward function, and c denotes the constraint Algorithm (1) outlines the black-box attack method using
function. ∆ indicates the observation perturbation. γ ∈ (0, 1) Bayesian optimization. The acquisition function is designed
is the discount factor. through upper confidence bound (UCB) [36]. Additionally,
COR-MDP attempts to solve the following problem: Gaussian process is leveraged to built surrogate model for the
" # optimization objective in our algorithm.
T
X
t Algorithm 1 Black-box attack with Bayesian optimization
max E γ r(st , at ) ,
π
t=0 for i = 1, 2, ..., I do
s.t. E [c(s, s0 , ∆)] ≤ , (1) Find new adversarial observation perturbation ∆i =
[∆im ∆ia ] via optimizing the acquisition function UCB(·)
where T is timestep, and is an expected minimum deviation. over Gaussian process model:
(
∆i = arg max UCB(∆|M1:i−1 ),
∆
s.t. ∆m − ∆0m ≤ δm , ∆a − ∆0a ≤ δa .
C. Black-Box Attack with Bayesian Optimization
Compute the objective function E[c(s, s0 , ∆i )].
In this section, the black-box attack based on Bayesian Augment data to memory M:
optimization is implemented to approximate the optimal ad- M1:i = M1:i−1 ∪ ∆i , E[c(s, s0 , ∆i )] .
versarial observation perturbations. Update the Gaussian process model.
Bayesian optimization is a black-box optimization algorithm end for
with Bayes theorem [31]. This approach works by building
a probabilistic model of the objective function, called the
D. Constrained Observation-Robust Actor-Critic
surrogate model, that is then searched efficiently through an
acquisition function before candidate samples are determined To learn the robust optimal lane change policy, the proposed
for evaluation on the real objective function [32], [33]. COR-AC algorithm is introduced in this section. COR-AC
attempts to solve the following optimization problem:
The JS divergence is a symmetrized and smoothed version " T #
of the Kullback–Leibler (KL) divergence [34], [35]. But more X
t
importantly, JS divergence has a finite value which is bounded max E γ r(st , at ) ,
π
t=0
by 1 for two probability distributions. Hence, JS divergence is
employed to measure average variation distance of the policies s.t. E [c(s, s0 , ∆∗ )] ≤ , (6)
2379-8858 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 17,2022 at 07:26:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2022.3165178, IEEE
Transactions on Intelligent Vehicles
IEEE TRANSACTIONS ON INTELLIGENT VEHICLES 4
where ∆∗ = [∆∗m ∆∗a ] represents the optimal adversarial 2) Constrained Observation-Robust Policy Improvement:
observation perturbation. In COR-PI, policy improvement designates optimizing and
A policy iteration (PI) scheme is employed to solve COR- updating the policies of RL agent. The RL agent attempts
MDP, which is called constrained observation-robust PI (COR- to maximize the expected return of the policy while satisfying
PI). COR-PI consists of policy evaluation and policy improve- the nonlinear constraint c(·).
ment, and they are iteratively updated until convergence. With Eq. (7), the Lagrange dual function can be written as:
According to Lagrange duality theory [37], the Lagrange L̄(λ) = max L(π, λ) (13)
function of the optimization problem (6) can be derived as: π
" T #
X
= max E γ t r(st , at ) + λ( − c(s, s0 , ∆∗ )) .
" T #
π
X
t 0 ∗
L(π, λ) = E γ r(st , at ) + λ( − c(s, s , ∆ )) , (7) t=0
t=0 Furthermore, the Lagrange dual problem associated with the
problem (6) can be represented as:
where λ is dual variable of RL agent.
1) Constrained Observation-Robust Policy Evaluation: min L̄(λ) = min max L(π, λ) (14)
λ≥0 λ≥0 π
The action-value function Qπ (·) with adversarial observation " T #
perturbations can be learned under a fixed policy iteratively, X
= min max E γ t r(st , at ) + λ( − c(s, s0 , ∆∗ )) .
starting from any action-value function Qπ (·) : S → R|A| and λ≥0 π
t=0
repeatedly leveraging a modified Bellman backup operator T π
given via: The optimal policy π ∗ and the optimal dual variable λ∗ can
be approximated iteratively. First given a fixed λ, then solve
T π Qπ (st ) := r(st , at ) + γE[V π (st+1 )], (8) the best policy π ∗ by maximizing L(π, λ). Moreover, plug in
π ∗ and find λ∗ via minimizing L(π ∗ , λ). Therefore, with Eq.
where (14), the following expressions can be derived:
π ∗ = arg max L(π, λ), (15)
π | π 0 ∗ π
E[V (st+1 )] = π(st+1 ) [Q (st+1 ) − λc(s, s , ∆ )] (9)
λ∗ = arg min L(π ∗ , λ). (16)
λ≥0
is the expected value function with adversarial observation
perturbations. Since the policy model outputs the discrete The value function V π (·) is implicitly defined through
action distribution, the expectation of value function V π (·) the action-value function Qπ (·) and the policy π(·) and the
can be calculcated directly. constraint c(·). With the double Q(·) trick in Eq. (11), Eq.
To speed up training, COR-AC algorithm adopts two pa- (9) and Eq. (15), the policy model parameters θ can be
rameterized action-value functions with network parameters optimized via maximizing the following objective function of
φz , z ∈ {1, 2}. The action-value function parameters can be actor network:
h i
updated via minimizing the following loss function of critic Ja (θ) = E π(st ; θ)| [ min Qπ (st ; φz ) − λc(s, s0 , ∆∗ ))] .
network: at ∼π z∈{1,2}
T s∼D
(17)
2
Jc (φz ) = kyt − Qπ (st ; φz )k2 ,
E (10) Additionally, with Eq. (16), the dual variables can be
at+1 ∼π
T s∼D updated via minimizing the following loss function:
Jd (λ) = E π(st ; θ)| [λ( − c(s, s0 , ∆∗ )] .
where T s represents transition sampled from replay buffer D, (18)
at ∼π
and yt is the target value of the action-value function in the T s∼D
2379-8858 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 17,2022 at 07:26:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2022.3165178, IEEE
Transactions on Intelligent Vehicles
IEEE TRANSACTIONS ON INTELLIGENT VEHICLES 5
Algorithm 2 Observation Adversarial Reinforcement Learning Algorithm 3 Reward Function Design for RL Agent
1: Initialize actor network parameters θ, critic network parameters Input: State and action of RL agent.
φ1 and φ2 . 1: r(·) = v0 /35. . Encourage agent to be more efficiency
2: Initialize target action-value function network parameters φ¯1 ← 2: if d1 < 30 then
φ1 , and φ¯2 ← φ2 . 3: r(·) = r(·) − 0.1. . Encourage lane change behavior
3: Initialize dual variables λ and an empty replay buffer D. 4: end if
4: for iteration step n = 1, 2, . . . N do 5: if |3.14 · ω0 /180| > k · µ̄ · g/v0 and v0 > 30 then
5: Reset state s0 . 6: r(·) = r(·) − 0.05. . Penalize dynamics instability
6: for timestep in the environment t = 1, 2, . . . M do 7: end if
7: Select action based on the policy: at ∼ πθ (at |st ). 8: if Vehicle changes lane and v0 > 20 then
8: Sample transition from the environment: 9: r(·) = r(·) − v0 /350. . Penalize high-speed lane change
st+1 , rt , dt ∼ p(st+1 |st , at ). 10: end if
9: Store the transition in the replay buffer: 11: if Collision occurs then
D ← D ∪ {(st , at , rt , st+1 , dt )}. 12: r(·) = r(·) − 0.1. . Penalize collision
10: end for 13: end if
11: Sample a batch of transitions from replay buffer D. Output: r(·)
12: Generate the optimal adversarial observation perturbations
through Algorithm 1:
( TABLE I
∗ max E[c(s, s0 , ∆)], S TATE O BSERVED BY AUTONOMOUS DRIVING RL AGENT.
∆ ←
s.t. ∆m − ∆0m ≤ δm , ∆a − ∆0a ≤ δa .
Parameters (Unit) Definition
13: Update the actor network parameters through Eq. (17):
a0 (m/s2 ) Longitudinal acceleration of autonomous vehicle
θ ← ∇θ Ja (θ). ω0 (rad/s) Yaw rate of autonomous vehicle
14: Update the critic network parameters through Eq. (10): v0 (m/s) Velocity of autonomous vehicle
φ1 ← ∇φ1 Jc (φ1 ), φ2 ← ∇φ2 Jc (φ2 ). v1 (m/s) Velocity of vehicle in front in same lane
15: Update the dual variables through Eq. (18): d1 (m) Distance from vehicle in front in same lane
λ ← ∇λJd (λ). v2 (m/s) Velocity of vehicle behind in same lane
16: Update the target action-value function network parameters d2 (m) Distance from vehicle behind in same lane
through Eq. (12): v3 (m/s) Velocity of vehicle in front in left lane
φ̄1 ← ρφ̄1 + (1 − ρ)φ1 , φ̄2 ← ρφ̄2 + (1 − ρ)φ2 . d3 (m) Distance from vehicle in front in left lane
v4 (m/s) Velocity of vehicle behind in left lane
17: end for
d4 (m) Distance from vehicle behind in left lane
v5 (m/s) Velocity of vehicle in front in right lane
d5 (m) Distance from vehicle in front in right lane
v6 (m/s) Velocity of vehicle behind in right lane
d6 (m) Distance of vehicle behind in right lane
lindex Index of lane in which autonomous vehicle is located
2379-8858 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 17,2022 at 07:26:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2022.3165178, IEEE
Transactions on Intelligent Vehicles
IEEE TRANSACTIONS ON INTELLIGENT VEHICLES 6
Fig. 3. Schematic diagram of evaluation method using SUMO-based mixed traffic flow with a random number of vehicles.
2379-8858 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 17,2022 at 07:26:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2022.3165178, IEEE
Transactions on Intelligent Vehicles
IEEE TRANSACTIONS ON INTELLIGENT VEHICLES 7
Fig. 5. Training curves obtained by DQN, PPO, SAC and OARL algorithms. (a): Average return; (b): Average speed; (c): Average collision times.
TABLE III
E VALUATION OF THE POLICIES TRAINED BY DIFFERENT ALGORITHMS IN THREE STOCHASTIC MIXED TRAFFIC FLOWS .
OARL is enhanced by about 83.33%, 250.00% and 16.67% of DQN, PPO and SAC policies. Hence, although each of
respectively. It can be seen that, PPO is superior to OARL in PPO and SAC policies has a metric which is superior to one
terms of the final driving speed. However, the collision safety of OARL policies, OARL policies have better comprehensive
of PPO method is the worst. performance than the baseline policies.
Eq. 2 is utilized to measure the robustness of policy models In the stochastic mixed traffic flow scenario with high
against adversarial observation perturbations. We evaluate the density, OARL policies perform comparably to SAC policies
final policy models trained by each methods with different and outperforms DQN and PPO polices in term of trans-
random seeds. Additionally, the average metrics are counted port efficiency under adversarial observational perturbations.
over 40000 time steps (200 episodes × 200 time steps). Moreover, in contrast to DQN, PPO and SAC policies, OARL
Table III shows the test results of different policy models. gains 257.14%, 16.28% and 8.70% improvements with respect
The performance of OARL policies outperforms DQN, PPO to return respectively. Compared with DQN, PPO and SAC
and SAC in three stochastic mixed traffic flows with differ- policies, the collision safety of OARL policies is improved by
ent densities, especially in terms of robustness metric. For about 332.14%, 28.57% and 28.57% respectively. It is obvious
instance, in contrast to DQN, PPO and SAC policies, OARL that the robustness of OARL policies against adversarial obser-
gains 16.25%, 24.83% and 7.10% improvements with respect vation perturbations is superior to the one of DQN, PPO and
to return in mixed traffic flow with low density respectively. SAC policies. Hence, it can be seen that the proposed method
Meanwhile, compared with DQN, PPO and SAC methods, performs consistently in three different highway scenarios.
the traffic efficiency of OARL policies is improved by about Furthermore, Fig. 6 visually shows the performance of
10.73%, 19.11% and 1.97% respectively. It can be inferred DQN, PPO, SAC and OARL policies in the stochastic mixed
that, to ensure the transport efficiency, the autonomous vehicle traffic flows with low and high densities. it can be seen
based on OARL policies performs more lane changes to that OARL policies outperform baseline policies with a large
overtake than one with the baseline scheme driving policies. margin, in term of return, robustness and collision safety.
Additionally, the robustness metric of OARL policies almost Moreover, the performance and robustness of OARL policies
unchanged under adversarial observation perturbations. are scarcely influenced by adversarial observation pertur-
In the stochastic mixed traffic flow scenario with normal bations. This means that the proposed robust lane change
density, the average return of OARL policies outperforms one decision-making approach with OARL is able to improve
2379-8858 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 17,2022 at 07:26:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2022.3165178, IEEE
Transactions on Intelligent Vehicles
IEEE TRANSACTIONS ON INTELLIGENT VEHICLES 8
Fig. 6. Evaluation results for DQN, PPO, SAC and OARL policy models. (a): Average return; (b): Average robustness metric; (c): Average collision times.
D. Ablation
In this section, we evaluate the impact of the nonlinear
constraint on the performance of OARL agent. A scheme
called actor-critic (AC) is implemented by removing the items
associated with the constraint in OARL. AC and OARL
methods are assessed in stochastic mixed traffic flow with
normal density. Moreover, we train 5 different instances with
different random seeds.
As shown in Fig. 7, the proposed OARL algorithm outper-
forms AC scheme with a large margin, in terms of average Fig. 7. Evaluation results of ablation and comparative study.
return. It can be found that AC algorithm fails to make
any progress during policy model training. Hence, we can
find two possible explanations for this phenomenon: (1) our VI. C ONCLUSION
constraint setting is able to encourage RL agent to explore This paper introduces a novel OARL approach for robust
and avoid falling into local optimum; (2) updating policy lane change decision making of autonomous vehicles. A COR-
gradients in more directions may be beneficial to improve MDP is presented to model lane change decision making
model performance. behaviors of autonomous vehicles under policy constraints
Additionally, the performance of our OARL scheme with and observation uncertainties. Meanwhile, the black-box attack
double hidden layer based network (DHLN) is evaluated in technique with Bayesian optimization is implemented to find
stochastic mixed traffic flow with normal density. It can be the optimal adversarial observation perturbations efficiently.
seen from Fig. 7 that OARL with a single hidden layer Furthermore, a COR-AC algorithm is advanced to optimize
based neural network performs comparably to the OARL with autonomous driving lane change policies while keeping the
DHLN, in terms of average return. variations of the policies attacked by the optimal adversarial
2379-8858 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 17,2022 at 07:26:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2022.3165178, IEEE
Transactions on Intelligent Vehicles
IEEE TRANSACTIONS ON INTELLIGENT VEHICLES 9
observation perturbations within bounds. [12] G. Li, Y. Yang, S. Li, X. Qu, N. Lyu, and S. E. Li, “Decision making
The experiment results in three stochastic mixed traffic of autonomous vehicles in lane change scenarios: Deep reinforcement
learning approaches with risk awareness,” Transportation Research Part
flows with different densities demonstrate that the proposed C: Emerging Technologies, p. 103452, 2021.
scheme can make lane change decisions robustly under ob- [13] F. Ye, X. Cheng, P. Wang, C.-Y. Chan, and J. Zhang, “Automated
servation uncertainties. In comparison with three baseline lane change strategy using proximal policy optimization-based deep
reinforcement learning,” in 2020 IEEE Intelligent Vehicles Symposium
methods, the policy models trained by the proposed algorithm (IV). IEEE, 2020, pp. 1746–1752.
show superior generalization and robustness against adversar- [14] M. Behrisch, L. Bieker, J. Erdmann, and D. Krajzewicz, “Sumo–
ial observational perturbations. simulation of urban mobility: an overview,” in Proceedings of SIMUL
2011, The Third International Conference on Advances in System
Future work involves to evaluate the robust lane change Simulation. ThinkMind, 2011.
decision making approach with OARL in more scenarios. [15] P. A. Lopez, M. Behrisch, L. Bieker-Walz, J. Erdmann, Y.-P. Flötteröd,
Moreover, OARL with continuous action will be investigated R. Hilbrich, L. Lücken, J. Rummel, P. Wagner, and E. Wießner,
“Microscopic traffic simulation using sumo,” in 2018 21st International
to copy with longitudinal decision making problem of au- Conference on Intelligent Transportation Systems (ITSC). IEEE, 2018,
tonomous vehicles. pp. 2575–2582.
[16] Y. Fu, C. Li, F. R. Yu, T. H. Luan, and Y. Zhang, “A decision-
A PPENDIX making strategy for vehicle autonomous braking in emergency via deep
reinforcement learning,” IEEE transactions on vehicular technology,
vol. 69, no. 6, pp. 5876–5888, 2020.
TABLE IV [17] H. Wang, H. Gao, S. Yuan, H. Zhao, K. Wang, X. Wang, K. Li,
T HE M AIN H YPERPARAMETERS OF THE P ROPOSED A LGORITHM . and D. Li, “Interpretable decision-making for autonomous vehicles
at highway on-ramps with latent space reinforcement learning,” IEEE
Parameters Value Parameters Value Transactions on Vehicular Technology, vol. 70, no. 9, pp. 8707–8719,
2021.
Decay factor λ 0.95 Adhesion coefficient µ̄ 0.90 [18] H. Shu, T. Liu, X. Mu, and D. Cao, “Driving tasks transfer using deep
Dynamic factor k 0.85 Learning rate of actor la 0.0001 reinforcement learning for decision-making of autonomous vehicles in
Learning rate of dual lα 0.0005 Learning rate of critic lc 0.001 unsignalized intersection,” IEEE Transactions on Vehicular Technology,
Scale coefficient ρ 0.995 Constraint threshold 0.0001 2021.
Reference value ∆0m 1.00 Reference value ∆0a 0.00 [19] B. Mirchevska, C. Pek, M. Werling, M. Althoff, and J. Boedecker,
Desired bound δm 0.2 Desired bound δa 0.05 “High-level decision making for safe and reasonable autonomous lane
changing using reinforcement learning,” in 2018 21st International
Conference on Intelligent Transportation Systems (ITSC). IEEE, 2018,
pp. 2156–2162.
R EFERENCES [20] G. Wang, J. Hu, Z. Li, and L. Li, “Harmonious lane changing via deep
reinforcement learning,” IEEE Transactions on Intelligent Transporta-
[1] W. Schwarting, J. Alonso-Mora, and D. Rus, “Planning and decision-
tion Systems, 2021.
making for autonomous vehicles,” Annual Review of Control, Robotics,
[21] J. Wang, Q. Zhang, D. Zhao, and Y. Chen, “Lane change decision-
and Autonomous Systems, vol. 1, pp. 187–210, 2018.
making through deep reinforcement learning with rule-based con-
[2] S. Feng, X. Yan, H. Sun, Y. Feng, and H. X. Liu, “Intelligent driving in-
straints,” in 2019 International Joint Conference on Neural Networks
telligence test for autonomous vehicles with naturalistic and adversarial
(IJCNN). IEEE, 2019, pp. 1–6.
environment,” Nature communications, vol. 12, no. 1, pp. 1–14, 2021.
[3] C. Pek, S. Manzinger, M. Koschi, and M. Althoff, “Using online [22] M. Huegle, G. Kalweit, B. Mirchevska, M. Werling, and J. Boedecker,
“Dynamic input for deep reinforcement learning in autonomous driving,”
verification to prevent autonomous vehicles from causing accidents,”
Nature Machine Intelligence, vol. 2, no. 9, pp. 518–528, 2020. in 2019 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS). IEEE, 2019, pp. 7566–7573.
[4] C. Hubmann, J. Schulz, M. Becker, D. Althoff, and C. Stiller, “Au-
tomated driving in uncertain environments: Planning with interaction [23] S. Jiang, J. Chen, and M. Shen, “An interactive lane change decision
and uncertain maneuver prediction,” IEEE Transactions on Intelligent making model with deep reinforcement learning,” in 2019 7th Interna-
Vehicles, vol. 3, no. 1, pp. 5–17, 2018. tional Conference on Control, Mechatronics and Automation (ICCMA).
[5] Z. Hu, C. Lv, P. Hang, C. Huang, and Y. Xing, “Data-driven estimation IEEE, 2019, pp. 370–376.
of driver attention using calibration-free eye gaze and scene features,” [24] X. Xu, L. Zuo, X. Li, L. Qian, J. Ren, and Z. Sun, “A reinforcement
IEEE Transactions on Industrial Electronics, 2021. learning approach to autonomous decision making of intelligent vehicles
[6] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. on highways,” IEEE Transactions on Systems, Man, and Cybernetics:
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski Systems, vol. 50, no. 10, pp. 3884–3897, 2018.
et al., “Human-level control through deep reinforcement learning,” [25] Y. Chen, C. Dong, P. Palanisamy, P. Mudalige, K. Muelling, and J. M.
nature, vol. 518, no. 7540, pp. 529–533, 2015. Dolan, “Attention-based hierarchical deep reinforcement learning for
[7] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, lane change behaviors in autonomous driving,” in Proceedings of the
A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Mastering IEEE/CVF Conference on Computer Vision and Pattern Recognition
the game of go without human knowledge,” nature, vol. 550, no. 7676, Workshops, 2019, pp. 0–0.
pp. 354–359, 2017. [26] C.-J. Hoel, K. Driggs-Campbell, K. Wolff, L. Laine, and M. J. Kochen-
[8] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, derfer, “Combining planning and deep reinforcement learning in tactical
J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev et al., “Grand- decision making for autonomous driving,” IEEE transactions on intelli-
master level in starcraft ii using multi-agent reinforcement learning,” gent vehicles, vol. 5, no. 2, pp. 294–305, 2019.
Nature, vol. 575, no. 7782, pp. 350–354, 2019. [27] C.-J. Hoel, K. Wolff, and L. Laine, “Automated speed and lane change
[9] B. R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A. A. Al Sallab, S. Yo- decision making using deep reinforcement learning,” in 2018 21st
gamani, and P. Pérez, “Deep reinforcement learning for autonomous International Conference on Intelligent Transportation Systems (ITSC).
driving: A survey,” IEEE Transactions on Intelligent Transportation IEEE, 2018, pp. 2148–2155.
Systems, 2021. [28] Y. Zhang, B. Gao, L. Guo, H. Guo, and H. Chen, “Adaptive decision-
[10] P. R. Wurman, S. Barrett, K. Kawamoto, J. MacGlashan, K. Subrama- making for automated vehicles under roundabout scenarios using op-
nian, T. J. Walsh, R. Capobianco, A. Devlic, F. Eckert, F. Fuchs et al., timization embedded reinforcement learning,” IEEE Transactions on
“Outracing champion gran turismo drivers with deep reinforcement Neural Networks and Learning Systems, 2020.
learning,” Nature, vol. 602, no. 7896, pp. 223–228, 2022. [29] H. An and J.-i. Jung, “Decision-making system for lane change using
[11] S. Nageshrao, H. E. Tseng, and D. Filev, “Autonomous highway deep reinforcement learning in connected and automated driving,” Elec-
driving using deep reinforcement learning,” in 2019 IEEE International tronics, vol. 8, no. 5, p. 543, 2019.
Conference on Systems, Man and Cybernetics (SMC). IEEE, 2019, pp. [30] J. Erdmann, “Sumo’s lane-changing model,” in Modeling Mobility with
2326–2331. Open Data. Springer, 2015, pp. 105–123.
2379-8858 (c) 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 17,2022 at 07:26:29 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIV.2022.3165178, IEEE
Transactions on Intelligent Vehicles
IEEE TRANSACTIONS ON INTELLIGENT VEHICLES 10
[31] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimiza- Chen Lv is a Nanyang Assistant Professor at School
tion of machine learning algorithms,” Advances in neural information of Mechanical and Aerospace Engineering, and
processing systems, vol. 25, 2012. the Cluster Director in Future Mobility Solutions,
[32] M. Pelikan, D. E. Goldberg, E. Cantú-Paz et al., “Boa: The bayesian Nanyang Technological University, Singapore. He
optimization algorithm,” in Proceedings of the genetic and evolutionary received his PhD degree at Department of Automo-
computation conference GECCO-99, vol. 1. Citeseer, 1999, pp. 525– tive Engineering, Tsinghua University, China in Jan
532. 2016. He was a joint PhD researcher at UC Berkeley,
[33] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas, USA during 2014-2015, and worked as a Research
“Taking the human out of the loop: A review of bayesian optimization,” Fellow at Cranfield University, UK during 2016-
Proceedings of the IEEE, vol. 104, no. 1, pp. 148–175, 2015. 2018. He joined NTU and founded the Automated
[34] J. Lin, “Divergence measures based on the shannon entropy,” IEEE Driving and Human-Machine System (AutoMan)
Transactions on Information theory, vol. 37, no. 1, pp. 145–151, 1991. Research Lab since June 2018. His research focuses on intelligent vehicles,
[35] F. Huszár, “How (not) to train your generative model: Scheduled sam- automated driving, and human-machine systems, where he has contributed
pling, likelihood, adversary?” arXiv preprint arXiv:1511.05101, 2015. 2 books, over 100 papers, and obtained 12 granted patents. He serves as
[36] N. Srinivas, A. Krause, S. Kakade, and M. Seeger, “Gaussian process Associate Editor for IEEE T-ITS, IEEE TVT, and IEEE T-IV. He received
optimization in the bandit setting: no regret and experimental design,” many awards and honors, selectively including the Highly Commended
in Proceedings of the 27th International Conference on International Paper Award of IMechE UK in 2012, Japan NSK Outstanding Mechanical
Conference on Machine Learning, 2010, pp. 1015–1022. Engineering Paper Award in 2014, Tsinghua University Outstanding Doctoral
[37] S. Boyd, S. P. Boyd, and L. Vandenberghe, Convex optimization. Thesis Award in 2016, IEEE IV Best Workshop/Special Session Paper Award
Cambridge university press, 2004. in 2018, Automotive Innovation Best Paper Award in 2020, the winner
[38] R. Rajamani, Vehicle dynamics and control. Springer Science & of Waymo Open Dataset Challenges at CVPR 2021, and Machines Young
Business Media, 2011. Investigator Award in 2022.
[39] X. He, Y. Liu, C. Lv, X. Ji, and Y. Liu, “Emergency steering control of
autonomous vehicle for collision avoidance and stabilisation,” Vehicle
system dynamics, vol. 57, no. 8, pp. 1163–1187, 2019.
[40] P. Christodoulou, “Soft actor-critic for discrete action settings,” arXiv
preprint arXiv:1910.07207, 2019.