Interpretable End-to-End Urban Autonomous Driving With Latent Deep Reinforcement Learning
Interpretable End-to-End Urban Autonomous Driving With Latent Deep Reinforcement Learning
6, JUNE 2022
Abstract— Unlike popular modularized framework, end-to-end (2) It is hard to generalize as we might need to redesign
autonomous driving seeks to solve the perception, decision and the heuristics for each new scenario and task, and (3) these
control problems in an integrated way, which can be more adapt- modules are strongly entangled with each other, and the whole
ing to new scenarios and easier to generalize at scale. However,
existing end-to-end approaches are often lack of interpretability, system becomes expensive to scale and maintain.
and can only deal with simple driving tasks like lane keeping. Those limitations might be avoided with end-to-end
In this article, we propose an interpretable deep reinforcement autonomous driving approaches, in which a driving policy
learning method for end-to-end autonomous driving, which is can be learned and generalized to new tasks without much
able to handle complex urban scenarios. A sequential latent hand-engineered involvement [3]–[5]. Moreover, the learned
environment model is introduced and learned jointly with the
reinforcement learning process. With this latent model, a seman- policy can be continuously optimized in driving, which is pos-
tic birdeye mask can be generated, which is enforced to connect sible to achieve superhuman performance. Two main branches
with certain intermediate properties in today’s modularized for end-to-end autonomous driving are imitation learning
framework for the purpose of explaining the behaviors of learned (IL) [3], [4], [6], [7], which learns a driving policy by imitating
policy. The latent space also significantly reduces the sample the collected expert driving data, and reinforcement learning
complexity of reinforcement learning. Comparison tests in a
realistic driving simulator show that the performance of our (RL) [8]–[10], which learns a policy by self exploration and
method in urban scenarios with crowded surrounding vehicles reinforcement. However, existing end-to-end methods are crit-
dominates many baselines including DQN, DDPG, TD3 and SAC. icized by two main shortcomings: 1) The learned policies are
Moreover, through masked outputs, the learned model is able to lack of interpretability. When an end-to-end policy is learned
provide a better explanation of how the car reasons about the directly from raw observations to control commands, we can
driving environment.
not explain how it works since the deep neural network is like a
Index Terms— Autonomous driving, deep reinforcement learn- black box; 2) They usually only deal with simple driving tasks
ing, probabilistic graphical model, interpretability. such as lane keeping. However, urban autonomous driving
I. I NTRODUCTION is much more complex due to highly dynamic road traffics
and strong road user interactions. The various urban scenarios
M OST of today’s autonomous driving systems are using
a highly modularized hand-engineered approach, for
example, perception, localization, behavior prediction, deci-
and street views significantly increase the sample complexity,
making it extremely challenging to learn a good end-to-end
sion making and motion control, etc [1], [2]. Take the percep- driving policy.
tion module as an example: even though some learning tech- This article introduces the maximum entropy RL with
niques are used, its design still needs tedious hand-engineered sequential latent variables to address the problems in end-
work like selecting representation features of each types of to-end autonomous driving. The latent space is employed
road users. Even though working well in a few driving tasks, to encode the complex urban driving environment, includ-
this modularized framework starts to touch its performance ing visual inputs, spatial features, road conditions and road
limitation in urban driving scenarios because (1) too much users’ states. Historical high-dimensional raw observations
human heuristics can lead to conservative driving policies; are compressed into this low-dimensional latent space with a
sequential latent environment model, which is learned jointly
Manuscript received March 19, 2020; revised July 7, 2020 and November with the reinforcement learning process.
16, 2020; accepted December 17, 2020. Date of publication February 3, 2021; The introduced latent space enables an interpretable expla-
date of current version May 31, 2022. This work was supported by DENSO
International at America. The Associate Editor for this article was B. Fidan. nation of how the policy reasons about the environment
(Corresponding author: Jianyu Chen.) by decoding the latent state to a semantic birdeye mask.
Jianyu Chen was with the University of California at Berkeley, Berkeley, During training, this mask is enforced to connect with some
CA 94720 USA. He is now with the Institute for Interdisciplinary Information
Sciences, Tsinghua University, Beijing 100084, China, and also with the intermediate properties in today’s modularized framework, for
Shanghai Qi Zhi Institute, Shanghai 200030, China (e-mail: jianyuchen@ example, localization & mapping, object detection, and behav-
tsinghua.edu.cn). ior prediction, thus providing an explanation of the learned
Shengbo Eben Li is with the State Key Lab of Automotive Safety and
Energy, School of Vehicle and Mobility, Tsinghua University, Beijing 100084, policy. Meanwhile, the latent space provides a much more
China (e-mail: [email protected]). compact state representation, which significantly reduces the
Masayoshi Tomizuka is with the Department of Mechanical Engineering, sample complexity, resulting in a large performance improve-
University of California at Berkeley, Berkeley, CA 94720 USA (e-mail:
[email protected]). ment. We implemented our method to learn an end-to-end
Digital Object Identifier 10.1109/TITS.2020.3046646 driving policy from raw camera and lidar inputs in a realistic
1558-0016 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on October 14,2024 at 14:14:01 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: INTERPRETABLE END-TO-END URBAN AUTONOMOUS DRIVING WITH LATENT DEEP REINFORCEMENT LEARNING 5069
driving simulator. Experimental evaluation demonstrates that how the autonomous car understands the environment. Some
our method significantly outperforms prior methods in works have made efforts in this direction. Bojarski et al. [25]
crowded urban scenarios. Examples of decoded semantic bird- visualized NVIDIA’s deep neural network based driving sys-
eye masks are presented to illustrate how our autonomous car tem by extracting the convolutional layer feature maps and
understands the driving situations. highlighting the salient objects. Kim et al. [26] used a visual
attention model with a causal filter to visualize the attention
II. R ELATED W ORKS heatmap. Sauer et al. [27] analyzed the decision making
Recent advances in machine learning enables learning based process of the deep neural network by using gradient-weighted
end-to-end approaches for autonomous driving. There are two class activation maps to obtain the attention of the network.
main approaches: imitation learning (IL) and reinforcement However, the interpretable information they provide — mostly
learning (RL). IL learns a driving policy from expert driving just which part of the observed image is within attention —
data [3], [4], [6], [7]. With expert samples as labelled data, is rather weak.
a driving policy is often easy to train, and it generally works Probabilistic graphical model (PGM) is a generic and pow-
well in structured driving tasks if one can collect enough erful tool to formulate many machine learning problems [28].
expert data. However, there are fundamental limitations for In autonomous driving researches, it is widely used for mod-
IL: (1) IL is data hungry, and its performance is limited to eling the human driving behaviours [29], [30]. More recently,
level of the expert policy; (2) IL is unable to learn skills that the sequential latent model [31]–[35] is one of the applications
are not provided or rare in the demonstration data. This makes of PGM that is very relevant to this work, which uses PGM
it difficult to deal with some dangerous scenarios such as near to formulate stochastic time sequence processes with latent
collision cases because they might never be demonstrated by variables. Close connections are also found between PGM
the expert. and maximum entropy reinforcement learning [36]–[38]. Some
Combined with deep learning techniques, RL shows its recent works propose to integrate sequential latent model
power on tackling complex decision making and planning learning and reinforcement learning [34], [35], [39], [40].
problems, bringing a series of breakthroughs in recent years. Such methods show great potential in end-to-end learning of
Agents trained with deep RL techniques achieve super-human- deep policies with high dimensional inputs. However, no prior
level performance in game playing [11], [12] go playing [13], works have used this branch of techniques to formulate and
[14], and robotics [15], [16]. Related deep RL algorithms solve autonomous driving problems. Furthermore, they do
range from value based methods such as DQN [11], [12] not provide interpretability of the learned model, and do not
and double DQN [17], actor-critic based methods such as take muiltiple sources of sensor inputs, which is essential for
A3C [18], DDPG [9] and TD3 [19], policy optimization based autonomous driving systems.
methods such as TRPO [20] and PPO [21], and maximum
entropy RL methods such as SAC [22], [23]. With RL, III. PGM FOR E NVIRONMENT M ODELING AND
a policy can be learned automatically without any expert data. R EINFORCEMENT L EARNING
It can explore various kinds of possible cases including some A. Probabilistic Graphical Model (PGM)
dangerous ones, and then learn useful skills. It also has the
Probabilistic graphical model (PGM) uses a graph to repre-
potential to achieve superhuman performance.
sent conditional dependence between random variables [28].
Researchers have been trying to apply deep RL to the
They are widely used in Bayesian statistics and Bayesian
domain of autonomous driving. Wolf et al. [8] used DQN
learning. Fig.1 shows a simple example of PGM. There are
to learn to steer an autonomous car to keep in the track in
in total 4 nodes A, B, C and D. These nodes can represent
simulation. Its action space is discrete and only allowed coarse
random variables representing observable quantities, unob-
steering angles. Lillicrap et al. [9] proposed a continuous
servable latents, or unknown parameters. The edges between
control deep RL algorithm which learned a deep neural
nodes represents their conditional dependencies. In Fig.1,
network policy that was able to drive the autonomous car
C is conditioned on A and B, while D is conditioned on
on a simulated racing track. Chen et al. [24] proposed a
C. Each edge is associated with a conditional probability,
hierarchical deep RL framework to solve driving scenarios
such as p ( C| A, B) , p ( D| C). With the ability to describe
with complex decision making such as traffic light passing.
complex causal effects and probabilistic transitions, PGM can
Kendall et al. [10] demonstrated the first application of deep
be used as a generic tool to describe probabilistic processes.
RL to real world autonomous cars. They learned a deep lane
In this article, we will use PGM to formulate both the driving
keeping policy using a single front-view camera image as
environment and the reinforcement learning process.
input. There are a lot of other related works not mentioned
here. However, existing works are either for simple scenarios
without complex road conditions and multi-agent interactions, B. PGM for Sequential Latent Environment Modeling
or use manually designed feature representations. To obtain the optimal policy, it is crucial to accurately model
Another problem of learning based approaches for the environment. Most environments in their nature have the
autonomous driving is that they are lack of interpretability. following characteristics: (1) High dimensional observations:
The learned deep neural network policy is like a black box, either for a human being or an autonomous car, the raw
which is not ideal since autonomous driving is a safety observations for them are usually high dimensional, such as
critical application. It is important for us to know whether and RGB images; (2) Time sequence probabilistic dynamics: the
Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on October 14,2024 at 14:14:01 UTC from IEEE Xplore. Restrictions apply.
5070 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 23, NO. 6, JUNE 2022
needs to be summarized by historical observations. Note here we do not explicitly write the discount factor
Here we introduce a probabilistic sequential latent environ- γ in the accumulative rewards, instead we incorporate the
ment model, which satisfies the above characteristics. Similar discount factor by modifying the state transition model [36].
structures of this model is adopted by recent literatures [31], If the initial state transitions are given by p (z t +1 |z t , at ),
[34], [35]. As shown in Fig.2, x t represents the observation adding a discount factor is equivalent to undiscounted prob-
at time step t, which can be high dimensional sensor inputs lem under the modified state transitions p̄ (z t +1 |z t , at ) =
such as RGB images. at is the action chosen at t. z t is the γ p (z t +1 |z t , at ), where there is an additional transition with
latent state variable at t, which is a description of the current probability 1 − γ , regardless of action, into an absorbing state
situation summarizing historical information, e.g, the position, with reward zero. The discount factor allows convergence of
velocity, intention of other road participants, the drivable areas, the value function in infinite-horizon settings. Without loss of
and the road markings. The observation x t is a decoding of the generality, we will omit γ from the PGM related derivations
latent state z t , defined by p (x t |z t ). The latent state z t together in this article, but it can be inserted trivially in all cases simply
with the action at , decide the latent state at the next time step by modifying the state transition models as mentioned above.
defined by the state transition function p (z t +1 |z t , at ). The discount factor is revisited as an explicit consideration in
This environment model is quite generic, as there is no our reinforcement learning algorithm implementation in V-C.
restrictions of the formats and physical meanings of the Maximum entropy reinforcement learning (MaxEnt
observations, actions, and latent states. Furthermore, the obser- RL) [22], [36], [42] modifies the above standard RL by adding
vation decoding function p (x t |z t ) and state transition function an entropy regularization term H (π (at |z t )) = −logπ (at |z t )
p (z t +1 |z t , at ) can be arbitrarily complex, such as deep neural to the reward. Now considering we are using a parametric
networks. function as the policy πφ , for example a deep neural network
By introducing an additional filtering function with weights φ, then the objective of MaxEnt RL can be
p (z t +1 |z t , x t +1 , at ), the latent state can be inferred in a written as:
recursive Bayesian filtering way. Given a new observation H
x t +1 , we have p (z t +1 ) = p (z t +1 |z t , x t +1 , at ) p (z t ), where φ ∗ = argmax E r (z t , at ) − logπφ (at |z t )
at is the action executed at the last time step. The latent φ z 1 ∼ p(z 1 )
at ∼πφ (at |z t ) t =1
state for the first time step is obtained by p (z 1 ) = p (z 1 |x 1 ). z t+1 ∼ p(z t+1 |z t ,at )
Furthermore, we can make probabilistic predictions by rolling (4)
out the future states based on the state transition function:
τ +H There are several reasons why we would like to use MaxEnt
−1
p (z τ :τ +H |aτ :τ +H −1 ) = p (z τ ) p (z t +1 |z t , at ) (1) RL instead of standard RL [43]. First, it performs better explo-
t =τ ration. Standard RL requires specific exploration strategies
Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on October 14,2024 at 14:14:01 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: INTERPRETABLE END-TO-END URBAN AUTONOMOUS DRIVING WITH LATENT DEEP REINFORCEMENT LEARNING 5071
Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on October 14,2024 at 14:14:01 UTC from IEEE Xplore. Restrictions apply.
5072 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 23, NO. 6, JUNE 2022
Fig. 4. A PGM for interpretable end-to-end urban autonomous driving. Fig. 6. The bird-view semantic mask for urban autonomous driving.
Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on October 14,2024 at 14:14:01 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: INTERPRETABLE END-TO-END URBAN AUTONOMOUS DRIVING WITH LATENT DEEP REINFORCEMENT LEARNING 5073
q z w , a p |
x , a + log p (z t +1 |z t , at ) + r (z t , at )
τ +H t =τ +1 t =τ +1
−1 τ
+H
x , a ) π (aτ +H |z τ +H )
= q (z | p (z t +1 |z t , at ) π (at |z t )
t =τ +1
− log q (z |
x , a ) − log π (at |z t )
t =τ +1
(15)
τ +H
−1
where q (z |x , a ) is the posterior of latent states given historical − log p (z t +1 |z t , at ) + log p a p
(20)
sensor inputs and actions. The rest part of the right hand t =τ +1
side represents the trajectory distribution by executing policy
Notice the cancellations in (20), we have:
π (at |z t ) with latent state transition p (z t +1 |z t , at ).
Now eliminate the integration in (14) by introducing expec- ELBO = E x |z ) + log p (m|
log p ( z ) + log p (z |
a)
tation, and apply Jensen’s inequality we have: q(z w ,
a p |x ,
a)
x , a ) +
− log q (z | E
log p x , m, O p | a q(z w ,
a p |x ,
a)
τ +H
⎡ ⎤
p x, m, O p , z w , a p |a × (r (z t , at ) − log π (at |z t ) + log p (at ))
= log E ⎣ ⎦
t =τ +1
q(z w , a p |x ,
a) q (z w , a p |
x , a )
(21)
≥ E log p x, m, O p , zw , a p | a The first part of the right hand side of (21) corre-
q(z w ,
a p |x ,
a)
w p sponds to learning the environment model, while the sec-
− log q z , a | x , a
ond part corresponds to learning the driving policy, we will
= ELBO (16) derive the details of the two parts in V-B and V-C,
where “ELBO” stands for evidence lower bound. We can max- respectively.
imize the original log likelihood by maximizing the ELBO.
Let’s now derive p x , m, p w
O , z , a | p a by probability fac- B. Environment Model Learning
torization according to the PGM in Fig.4: The environment model can be learned via optimizing the
O p , z w , a p |
p x, m, a first part of (21):
= p x , m, O p , z τ +2:τ +H , a p |z , a p (z | a) E x |z )+log p (m|
log p ( z )+log p (z |
a )−log q (z |
x , a )
q(z |x ,
a)
(22)
= p ( z ) p O p , z τ +2:τ +H , a p |z τ +1 p (z |
x |z ) p (m| a)
where we replace Eq(z w ,a p |x ,a) with Eq(z |x ,a ) because this
p O p , z p , a p part of ELBO is only related to z 1:τ +1 . Now let’s fur-
= p (x |z ) p (m|
z) p (z |
a) (17)
p (z τ +1 ) ther derive the components in (22) by unfolding them
According to the soft optimality assumption: with time. Considering the conditional dependence of
PGM in Fig.4. The generative models can be unfolded
p O p , z p , a p as:
τ
+1 τ +1
= p z p , a p p O p |z p , a p
x |z ) = log
log p ( p (x t |z t ) = log p (x t |z t )
τ +H
τ +H
p −1 t =1 t =1
= p a p (z τ +1 ) p (z t +1 |z t , at ) exp r (z t , at ) τ +1 τ +1
t =τ +1 t =τ +1 z ) = log
log p (m| p (m t |z t ) = log p (m t |z t ) (23)
(18) t =1 t =1
Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on October 14,2024 at 14:14:01 UTC from IEEE Xplore. Restrictions apply.
5074 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 23, NO. 6, JUNE 2022
The prior model can be unfolded using the latent state × E Q (z t +1 , at +1 )−logπ(at +1 |z t +1 )
transition function: at+1 ∼π
τ
(28)
log p (z |
a ) = log p (z 1 ) p (z t +1 |z t , at ) and the soft policy improvement:
t =1
τ exp (Q πold (z t , ·))
πnew = argmin DKL π (·|z t ) (29)
= log p (z 1 ) + log p (z t +1 |z t , at ) (24) π Z πold (z t )
t =1
where Z πold (z t ) is the normalization term.
The posterior inference model can be unfolded as: The function approximation implementation is to optimize
τ
the loss functions that address the soft policy evaluation and
log q (z |
x , a ) = log q (z 1 |
x , a ) q (z t +1 |z t , x , a ) soft policy improvement. The loss functions are the Bellman
t =1 residual in (28):
2
τ
1
≈ log q (z 1 |x 1 ) q (z t +1 |z t , x t +1 , at ) JQ = E Q (z τ , aτ ) − Q̂ (z τ , aτ ) (30)
a) 2
z τ ∼q(z |x ,
t =1
τ and the KL divergence in (29):
= log q (z 1 |x 1 ) + log q (z t +1 |z t , x t +1 , at )
Jπ = E log π (aτ +1 |z τ +1 ) − Q (z τ +1 , aτ +1 )
t =1 z τ +1 ∼q(z |x ,
a)
(25) aτ +1 ∼π(aτ +1 |z τ +1 )
(31)
Note here we approximate q (z | x , a ) and q (z t +1 |z t , x , a )
with q (z 1 |x 1 ) and q (z t +1 |z t , x t +1 , at ) for simplicity. If we Note
want to obtain the exact accurate values, bi-directional recur- Q̂ (z τ , aτ ) = rτ + γ E
rent neural networks should be used to obtain the posterior z τ +1 ∼q(z |x ,
a)
aτ +1 ∼π(aτ +1 |z τ +1 )
probabilities conditioned on the whole trajectory sequence
x , a ) [31].
( × Q̄ (z τ +1 , aτ +1 ) − log π (aτ +1 |z τ +1 ) (32)
We can now unfold (22) with time:
where Q̄ is a delayed Q network.
E x |z ) + log p (m|
log p ( z )+log p (z |
a )−log q (z |
x , a ) Thus, the joint learning algorithm becomes to use SGD
q(z |x ,
a) to maximize the model learning part of ELBO in (26) and
τ +1 τ +1 minimize JQ in (30) and Jπ in (31).
≈ E log p (x t |z t ) + log p (m t |z t )
q(z |x ,
a)
t =1 t =1 VI. E XPERIMENTS
− DKL (q (z 1 |x 1 ) || p (z 1 ))
τ +1
A. Simulation Setup
− DKL (q (z t +1 |z t , x t +1 , at ) || p (z t +1 |z t , at )) (26) We train and evaluate our proposed method in CARLA
t =1 simulator [45]. CARLA is a high-definition open-source sim-
ulation platform for autonomous driving research. It simulates
C. Driving Policy Learning not only the driving environment and vehicle dynamics, but
also the raw sensor data inputs such as camera RGB images
The driving policy can be learned via optimizing the second
and lidar points cloud using rendering and ray-casting tech-
part of (21):
niques. Fig.7 (a) shows a sample view of the driving simulation
τ +H environment we use.
max E r (z t , at ) − logπφ (at |z t ) + log p (at ) Fig.7 (b) shows the map layout of the virtual town in
q(z p ,
a p |x ,
a)
t =τ +1 CARLA we use for training. It includes various urban sce-
τ +H narios such as intersections and roundabouts. The range of
= E r (z t , at ) − log πφ (at |z t ) (27) the map is 400m × 400m, with about 6km total length of
z τ +1 ∼ p(z τ +1 |x ,
a)
at ∼πφ (at |z t ) t =τ +1 roads. 100 vehicles are running autonomously in the virtual
z t+1 ∼ p(z t+1 |z t ,at ) town to simulate a multi-agent environment. The vehicles will
where log p (at ) is ignored since we assume uniform action randomly choose a direction at intersections, then follow the
prior. The optimization problem (27) then becomes a standard route, while slowing down for front vehicles and stopping
MaxEnt RL problem. when the front traffic light becomes red.
We use soft actor-critic (SAC) [22] to solve this MaxEnt
RL problem. SAC is a function approximation version of the B. Implementation Details
soft policy iteration (SPI). SPI is an extension of the standard 1) Reward Function: We use the following reward function
policy iteration to the maximum entropy case, which is to in our experiments:
iteratively apply the soft policy evaluation:
r = 200 rcollision +v lon + 10 rfast + rout − 5 α 2 + 0.2 rlat − 0.1
T π Q (z t , at ) = r (z t , at ) + γ E (33)
z t+1 ∼ p
Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on October 14,2024 at 14:14:01 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: INTERPRETABLE END-TO-END URBAN AUTONOMOUS DRIVING WITH LATENT DEEP REINFORCEMENT LEARNING 5075
size 256 and learning rate 0.0003. The sequential latent model
is trained with batch size 32 and learning rate 0.0001. The
length of trajectories used for training is τ = 10. The discount
factor γ = 0.99.
Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on October 14,2024 at 14:14:01 UTC from IEEE Xplore. Restrictions apply.
5076 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 23, NO. 6, JUNE 2022
Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on October 14,2024 at 14:14:01 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: INTERPRETABLE END-TO-END URBAN AUTONOMOUS DRIVING WITH LATENT DEEP REINFORCEMENT LEARNING 5077
Fig. 9. Sampled frames to illustrate the interpretability of our method. For each sample, left to right: camera, lidar, bird-view image. First row: original
sensor inputs and ground truth mask. Second row: reconstructed images. Only the raw camera and lidar images are observed.
R EFERENCES
[1] S. Thrun et al., “Stanley: The robot that won the DARPA grand
challenge,” J. Field Robot., vol. 23, no. 9, pp. 661–692, 2006.
[2] C. Urmson et al., “Autonomous driving in urban environments: Boss
and the urban challenge,” J. Field Robot., vol. 25, no. 8, pp. 425–466,
2008.
[3] M. Bojarski et al., “End to end learning for self-driving cars,” 2016,
arXiv:1604.07316. [Online]. Available: http://arxiv.org/abs/1604.07316
[4] F. Codevilla, M. Müller, A. Lopez, V. Koltun, and A. Dosovitskiy, “End-
to-end driving via conditional imitation learning,” in Proc. IEEE Int.
Fig. 10. Examples of failure cases interpretation. Conf. Robot. Autom. (ICRA), May 2018, pp. 1–9.
[5] H. Xu, Y. Gao, F. Yu, and T. Darrell, “End-to-end learning of driving
models from large-scale video datasets,” in Proc. IEEE Conf. Comput.
a vehicle occupying part of its lane. From the reconstructed Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 2174–2182.
mask we can see that although the agent recognizes the [6] M. Bansal, A. Krizhevsky, and A. Ogale, “ChauffeurNet: Learning
vehicle, it mistakenly localizes it in its own lane. This might to drive by imitating the best and synthesizing the worst,” 2018,
because this is a very rare situation and almost all training data arXiv:1812.03079. [Online]. Available: http://arxiv.org/abs/1812.03079
[7] J. Chen, B. Yuan, and M. Tomizuka, “Deep imitation learning
is composed of vehicles running in their own lanes, suggesting for autonomous driving in generic urban scenarios with enhanced
that more diverse data needs to be collected. safety,” 2019, arXiv:1903.00640. [Online]. Available: http://arxiv.org/
abs/1903.00640
[8] P. Wolf et al., “Learning how to drive in a real world simulation with
IX. C ONCLUSION deep Q-Networks,” in Proc. IEEE Intell. Vehicles Symp. (IV), Jun. 2017,
pp. 244–250.
In this article, we proposed an interpretable end-to-end rein- [9] T. P. Lillicrap et al., “Continuous control with deep rein-
forcement learnig algorithm for autonomous driving in urban forcement learning,” 2015, arXiv:1509.02971. [Online]. Available:
driving scenarios. The driving policy was learned jointly with http://arxiv.org/abs/1509.02971
[10] A. Kendall et al., “Learning to drive in a day,” in Proc. Int. Conf. Robot.
a sequential latent environment model. The learned driving Autom. (ICRA), May 2019, pp. 8248–8254.
policy took camera and lidar images as input, and generated [11] V. Mnih et al., “Playing atari with deep reinforcement learning,” 2013,
control commands to navigate the autonomous car through arXiv:1312.5602. [Online]. Available: http://arxiv.org/abs/1312.5602
[12] V. Mnih et al., “Human-level control through deep reinforcement learn-
urban driving scenarios. The learned environment model pro- ing,” Nature, vol. 518, no. 7540, p. 529, 2015.
vided an interpretable explanation of how the autonomous [13] D. Silver et al., “Mastering the game of go with deep neural networks
car understood the driving situation by generating a bird-view and tree search,” Nature, vol. 529, no. 7587, p. 484, 2016.
semantic mask. The mask was enforced to connect with cer- [14] D. Silver et al., “Mastering the game of go without human knowledge,”
Nature, vol. 550, no. 7676, pp. 354–359, Oct. 2017.
tain intermediate properties in traditional autonomous driving [15] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training
frameworks, thus providing an explanation of the learned of deep visuomotor policies,” J. Mach. Learn. Res., vol. 17, no. 1,
policy. The method was implemented and evaluated in the pp. 1334–1373, 2015.
[16] D. Kalashnikov et al., “QT-opt: Scalable deep reinforcement learning for
CARLA simulator, which was shown to have significantly vision-based robotic manipulation,” 2018, arXiv:1806.10293. [Online].
better performance over the baseline methods. Available: http://arxiv.org/abs/1806.10293
Although our framework is able to provide interpretable [17] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
explanations about how the learned model understands the with double Q-learning,” in Proc. AAAI Conf. on Artif. Intell., 2016,
pp. 2094–2100.
driving environment, it does not provide any intuition about [18] V. Mnih et al., “Asynchronous methods for deep reinforcement learning,”
how it makes the decisions, because the driving policy is in Proc. Int. Conf. Mach. Learn., 2016, pp. 1928–1937.
obtained in a model-free style. In the future, model-based [19] S. Fujimoto, H. van Hoof, and D. Meger, “Addressing function approxi-
mation error in actor-critic methods,” 2018, arXiv:1802.09477. [Online].
methods will be investigated within in this framework to Available: http://arxiv.org/abs/1802.09477
further improve the performance and interpretability. We are [20] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust
also planning to use real data in the future. However, instead region policy optimization,” in Proc. Int. Conf. Mach. Learn., 2015,
of directly running RL in real world driving, we will deploy pp. 1889–1897.
[21] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
an offline RL method, which will directly learn a good driving “Proximal policy optimization algorithms,” 2017, arXiv:1707.06347.
policy with offline collected data. [Online]. Available: http://arxiv.org/abs/1707.06347
Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on October 14,2024 at 14:14:01 UTC from IEEE Xplore. Restrictions apply.
5078 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 23, NO. 6, JUNE 2022
[22] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor- [46] J. Chen, B. Yuan, and M. Tomizuka, “Model-free deep reinforcement
critic: Off-policy maximum entropy deep reinforcement learning with learning for urban autonomous driving,” in Proc. IEEE Intell. Transp.
a stochastic actor,” 2018, arXiv:1801.01290. [Online]. Available: Syst. Conf. (ITSC), Oct. 2019, pp. 2765–2771.
http://arxiv.org/abs/1801.01290 [47] E. Yurtsever, J. Lambert, A. Carballo, and K. Takeda, “A survey of
[23] T. Haarnoja et al., “Soft actor-critic algorithms and applications,” 2018, autonomous driving: Common practices and emerging technologies,”
arXiv:1812.05905. [Online]. Available: http://arxiv.org/abs/1812.05905 2019, arXiv:1906.05113. [Online]. Available: http://arxiv.org/abs/1906.
[24] J. Chen, Z. Wang, and M. Tomizuka, “Deep hierarchical reinforcement 05113
learning for autonomous driving with distinct behaviors,” in Proc. IEEE
Intell. Vehicles Symp. (IV), Jun. 2018, pp. 1239–1244.
[25] M. Bojarski et al., “Explaining how a deep neural network trained with
end-to-end learning steers a car,” 2017, arXiv:1704.07911. [Online]. Jianyu Chen received the bachelor’s degree from
Available: http://arxiv.org/abs/1704.07911 Tsinghua University in 2015 and the Ph.D. degree
[26] J. Kim and J. Canny, “Interpretable learning for self-driving cars by from the University of California at Berkeley,
visualizing causal attention,” in Proc. IEEE Int. Conf. Comput. Vis. Berkeley, in 2020. He was with the University of
(ICCV), Oct. 2017, pp. 2942–2950. California at Berkeley, under the supervision of
[27] A. Sauer, N. Savinov, and A. Geiger, “Conditional affordance learning Prof. Masayoshi Tomizuka. Since 2020, he has been
for driving in urban environments,” 2018, arXiv:1806.06498. [Online]. an Assistant Professor with the Institute for Inter-
Available: http://arxiv.org/abs/1806.06498 disciplinary Information Sciences (IIIS), Tsinghua
[28] K. P. Murphy, Machine Learning: A Probabilistic Perspective. University. He is also working at the intersection
Cambridge, MA, USA: MIT Press, 2012. of machine learning, robotics and control to build
[29] C. Dong, J. M. Dolan, and B. Litkouhi, “Intention estimation for ramp intelligent systems which can efficiently learn safe
merging control in autonomous driving,” in Proc. IEEE Intell. Vehicles and reliable sensori-motor control policies. Applications of his work mainly
Symp. (IV), Jun. 2017, pp. 1584–1589. focus on robotic systems such as autonomous driving and industrial robots.
[30] C. Dong, J. M. Dolan, and B. Litkouhi, “Interactive ramp merging His research interests include reinforcement learning, control, deep learning,
planning in autonomous driving: Multi-merging leading PGM (MML- autonomous driving, and robotics.
PGM),” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC),
Oct. 2017, pp. 1–6.
[31] R. G. Krishnan, U. Shalit, and D. Sontag, “Deep Kalman filters,” 2015,
arXiv:1511.05121. [Online]. Available: http://arxiv.org/abs/1511.05121
[32] M. Karl, M. Soelch, J. Bayer, and P. van der Smagt, “Deep variational Shengbo Eben Li received the M.S. and Ph.D.
Bayes filters: Unsupervised learning of state space models from raw degrees from Tsinghua University in 2006 and 2009,
data,” 2016, arXiv:1605.06432. [Online]. Available: http://arxiv.org/abs/ respectively. He was with Stanford University, the
1605.06432 University of Michigan, and UC Berkeley. He is
[33] M. Fraccaro, S. Kamronn, U. Paquet, and O. Winther, “A disentangled currently with the Intelligent Driving Lab (iDLab),
recognition and nonlinear dynamics model for unsupervised learning,” Tsinghua University. His current research interests
in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 3601–3610. include intelligent vehicles and driver assistance,
[34] D. Hafner et al., “Learning latent dynamics for planning from reinforcement learning and optimal control, and dis-
pixels,” 2018, arXiv:1811.04551. [Online]. Available: http://arxiv.org/ tributed control and estimation. He has authored
abs/1811.04551 more than 100 peer-reviewed journals/conference
[35] A. X. Lee, A. Nagabandi, P. Abbeel, and S. Levine, “Stochastic articles, and the co-inventor of more than 30 patents.
latent actor-critic: Deep reinforcement learning with a latent vari- He was a recipient of the National Award for Technological Invention of
able model,” 2019, arXiv:1907.00953. [Online]. Available: http://arxiv. China in 2013, the Best Paper Award in 2014 IEEE ITS, the Best Paper Award
org/abs/1907.00953 in 14th Asian ITS, the Excellent Young Scholar of NSF China in 2016, the
[36] S. Levine, “Reinforcement learning and control as probabilistic infer- Young Professorship of Changjiang Scholar Program in 2016, the Tsinghua
ence: Tutorial and review,” 2018, arXiv:1805.00909. [Online]. Available: University Excellent Professorship Award in 2017, the National Award for
http://arxiv.org/abs/1805.00909 Progress in Science and Technology of China in 2018, and the Distinguished
[37] K. Rawlik, M. Toussaint, and S. Vijayakumar, “On stochastic optimal Young Scholar of Beijing NSF in 2018. He also serves as a Board of Governor
control and reinforcement learning by approximate inference,” in Proc. for the IEEE ITS Society and an Associate Editor for the IEEE Intelligent
23rd Int. Joint Conf. Artif. Intell., 2013. Transportation Systems Magazine (ITSM) and the IEEE T RANSACTIONS ON
[38] B. D. Ziebart, “Modeling purposeful adaptive behavior with the principle I NTELLIGENT T RANSPORTATION S YSTEMS .
of maximum causal entropy,” M.S. thesis, Carnegie Mellon Univ.,
Pittsburgh, PA, USA, 2018, doi: 10.1184/R1/6720692.v1.
[39] D. Ha and J. Schmidhuber, “World models,” 2018, arXiv:1803.10122. Masayoshi Tomizuka (Life Fellow, IEEE) received
[Online]. Available: http://arxiv.org/abs/1803.10122 the Ph.D. degree in mechanical engineering from
[40] M. Okada, N. Kosaka, and T. Taniguchi, “PlaNet of the Bayesians: MIT in February 1974. In 1974, he joined the
Reconsidering and improving deep planning network by incorporat- Faculty of the Department of Mechanical Engineer-
ing Bayesian inference,” 2020, arXiv:2003.00370. [Online]. Available: ing, University of California at Berkeley, where
http://arxiv.org/abs/2003.00370 he currently holds the Cheryl and John Neerhout,
[41] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Jr., Distinguished Professorship Chair. He served
Cambridge, MA, USA: MIT Press, 2018. as the Program Director for the Dynamic Systems
[42] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforcement learning and Control Program of the Civil and Mechanical
with deep energy-based policies,” in Proc. 34th Int. Conf. Mach. Learn., Systems Division of NSF from 2002 to 2004. His
vol. 70, 2017, pp. 1352–1361. current research interests include optimal and adap-
[43] B. Eysenbach and S. Levine, “If MaxEnt RL is the answer, tive control, digital control, signal processing, motion control, and control
what is the question?” 2019, arXiv:1910.01913. [Online]. Available: problems related to robotics, and precision motion control and vehicles.
http://arxiv.org/abs/1910.01913 He is a fellow of the ASME and IFAC. He was a recipient of the Charles
[44] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” 2013, Russ Richards Memorial Award (ASME, 1997), the Rufus Oldenburger Medal
arXiv:1312.6114. [Online]. Available: http://arxiv.org/abs/1312.6114 (ASME, 2002), and the John R. Ragazzini Award in 2006. He served as a
[45] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, Technical Editor for the ASME Journal of Dynamic Systems, Measurement
“CARLA: An open urban driving simulator,” 2017, arXiv:1711.03938. and Control, J-DSMC from 1988 to 1993, and an Editor-in-Chief of the
[Online]. Available: http://arxiv.org/abs/1711.03938 IEEE/ASME T RANSACTIONS ON M ECHATRONICS from 1997 to 1999.
Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on October 14,2024 at 14:14:01 UTC from IEEE Xplore. Restrictions apply.