0% found this document useful (0 votes)
6 views11 pages

Interpretable End-to-End Urban Autonomous Driving With Latent Deep Reinforcement Learning

This article presents an interpretable deep reinforcement learning method for end-to-end urban autonomous driving, addressing the limitations of existing modularized frameworks. By introducing a sequential latent environment model, the proposed method enhances performance in complex urban scenarios while providing interpretability through semantic birdeye masks. Experimental results demonstrate that this approach significantly outperforms traditional methods in crowded environments, enabling better understanding of the driving policy's reasoning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views11 pages

Interpretable End-to-End Urban Autonomous Driving With Latent Deep Reinforcement Learning

This article presents an interpretable deep reinforcement learning method for end-to-end urban autonomous driving, addressing the limitations of existing modularized frameworks. By introducing a sequential latent environment model, the proposed method enhances performance in complex urban scenarios while providing interpretability through semantic birdeye masks. Experimental results demonstrate that this approach significantly outperforms traditional methods in crowded environments, enabling better understanding of the driving policy's reasoning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

5068 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 23, NO.

6, JUNE 2022

Interpretable End-to-End Urban Autonomous


Driving With Latent Deep Reinforcement Learning
Jianyu Chen , Shengbo Eben Li , and Masayoshi Tomizuka , Life Fellow, IEEE

Abstract— Unlike popular modularized framework, end-to-end (2) It is hard to generalize as we might need to redesign
autonomous driving seeks to solve the perception, decision and the heuristics for each new scenario and task, and (3) these
control problems in an integrated way, which can be more adapt- modules are strongly entangled with each other, and the whole
ing to new scenarios and easier to generalize at scale. However,
existing end-to-end approaches are often lack of interpretability, system becomes expensive to scale and maintain.
and can only deal with simple driving tasks like lane keeping. Those limitations might be avoided with end-to-end
In this article, we propose an interpretable deep reinforcement autonomous driving approaches, in which a driving policy
learning method for end-to-end autonomous driving, which is can be learned and generalized to new tasks without much
able to handle complex urban scenarios. A sequential latent hand-engineered involvement [3]–[5]. Moreover, the learned
environment model is introduced and learned jointly with the
reinforcement learning process. With this latent model, a seman- policy can be continuously optimized in driving, which is pos-
tic birdeye mask can be generated, which is enforced to connect sible to achieve superhuman performance. Two main branches
with certain intermediate properties in today’s modularized for end-to-end autonomous driving are imitation learning
framework for the purpose of explaining the behaviors of learned (IL) [3], [4], [6], [7], which learns a driving policy by imitating
policy. The latent space also significantly reduces the sample the collected expert driving data, and reinforcement learning
complexity of reinforcement learning. Comparison tests in a
realistic driving simulator show that the performance of our (RL) [8]–[10], which learns a policy by self exploration and
method in urban scenarios with crowded surrounding vehicles reinforcement. However, existing end-to-end methods are crit-
dominates many baselines including DQN, DDPG, TD3 and SAC. icized by two main shortcomings: 1) The learned policies are
Moreover, through masked outputs, the learned model is able to lack of interpretability. When an end-to-end policy is learned
provide a better explanation of how the car reasons about the directly from raw observations to control commands, we can
driving environment.
not explain how it works since the deep neural network is like a
Index Terms— Autonomous driving, deep reinforcement learn- black box; 2) They usually only deal with simple driving tasks
ing, probabilistic graphical model, interpretability. such as lane keeping. However, urban autonomous driving
I. I NTRODUCTION is much more complex due to highly dynamic road traffics
and strong road user interactions. The various urban scenarios
M OST of today’s autonomous driving systems are using
a highly modularized hand-engineered approach, for
example, perception, localization, behavior prediction, deci-
and street views significantly increase the sample complexity,
making it extremely challenging to learn a good end-to-end
sion making and motion control, etc [1], [2]. Take the percep- driving policy.
tion module as an example: even though some learning tech- This article introduces the maximum entropy RL with
niques are used, its design still needs tedious hand-engineered sequential latent variables to address the problems in end-
work like selecting representation features of each types of to-end autonomous driving. The latent space is employed
road users. Even though working well in a few driving tasks, to encode the complex urban driving environment, includ-
this modularized framework starts to touch its performance ing visual inputs, spatial features, road conditions and road
limitation in urban driving scenarios because (1) too much users’ states. Historical high-dimensional raw observations
human heuristics can lead to conservative driving policies; are compressed into this low-dimensional latent space with a
sequential latent environment model, which is learned jointly
Manuscript received March 19, 2020; revised July 7, 2020 and November with the reinforcement learning process.
16, 2020; accepted December 17, 2020. Date of publication February 3, 2021; The introduced latent space enables an interpretable expla-
date of current version May 31, 2022. This work was supported by DENSO
International at America. The Associate Editor for this article was B. Fidan. nation of how the policy reasons about the environment
(Corresponding author: Jianyu Chen.) by decoding the latent state to a semantic birdeye mask.
Jianyu Chen was with the University of California at Berkeley, Berkeley, During training, this mask is enforced to connect with some
CA 94720 USA. He is now with the Institute for Interdisciplinary Information
Sciences, Tsinghua University, Beijing 100084, China, and also with the intermediate properties in today’s modularized framework, for
Shanghai Qi Zhi Institute, Shanghai 200030, China (e-mail: jianyuchen@ example, localization & mapping, object detection, and behav-
tsinghua.edu.cn). ior prediction, thus providing an explanation of the learned
Shengbo Eben Li is with the State Key Lab of Automotive Safety and
Energy, School of Vehicle and Mobility, Tsinghua University, Beijing 100084, policy. Meanwhile, the latent space provides a much more
China (e-mail: [email protected]). compact state representation, which significantly reduces the
Masayoshi Tomizuka is with the Department of Mechanical Engineering, sample complexity, resulting in a large performance improve-
University of California at Berkeley, Berkeley, CA 94720 USA (e-mail:
[email protected]). ment. We implemented our method to learn an end-to-end
Digital Object Identifier 10.1109/TITS.2020.3046646 driving policy from raw camera and lidar inputs in a realistic
1558-0016 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on October 14,2024 at 14:14:01 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: INTERPRETABLE END-TO-END URBAN AUTONOMOUS DRIVING WITH LATENT DEEP REINFORCEMENT LEARNING 5069

driving simulator. Experimental evaluation demonstrates that how the autonomous car understands the environment. Some
our method significantly outperforms prior methods in works have made efforts in this direction. Bojarski et al. [25]
crowded urban scenarios. Examples of decoded semantic bird- visualized NVIDIA’s deep neural network based driving sys-
eye masks are presented to illustrate how our autonomous car tem by extracting the convolutional layer feature maps and
understands the driving situations. highlighting the salient objects. Kim et al. [26] used a visual
attention model with a causal filter to visualize the attention
II. R ELATED W ORKS heatmap. Sauer et al. [27] analyzed the decision making
Recent advances in machine learning enables learning based process of the deep neural network by using gradient-weighted
end-to-end approaches for autonomous driving. There are two class activation maps to obtain the attention of the network.
main approaches: imitation learning (IL) and reinforcement However, the interpretable information they provide — mostly
learning (RL). IL learns a driving policy from expert driving just which part of the observed image is within attention —
data [3], [4], [6], [7]. With expert samples as labelled data, is rather weak.
a driving policy is often easy to train, and it generally works Probabilistic graphical model (PGM) is a generic and pow-
well in structured driving tasks if one can collect enough erful tool to formulate many machine learning problems [28].
expert data. However, there are fundamental limitations for In autonomous driving researches, it is widely used for mod-
IL: (1) IL is data hungry, and its performance is limited to eling the human driving behaviours [29], [30]. More recently,
level of the expert policy; (2) IL is unable to learn skills that the sequential latent model [31]–[35] is one of the applications
are not provided or rare in the demonstration data. This makes of PGM that is very relevant to this work, which uses PGM
it difficult to deal with some dangerous scenarios such as near to formulate stochastic time sequence processes with latent
collision cases because they might never be demonstrated by variables. Close connections are also found between PGM
the expert. and maximum entropy reinforcement learning [36]–[38]. Some
Combined with deep learning techniques, RL shows its recent works propose to integrate sequential latent model
power on tackling complex decision making and planning learning and reinforcement learning [34], [35], [39], [40].
problems, bringing a series of breakthroughs in recent years. Such methods show great potential in end-to-end learning of
Agents trained with deep RL techniques achieve super-human- deep policies with high dimensional inputs. However, no prior
level performance in game playing [11], [12] go playing [13], works have used this branch of techniques to formulate and
[14], and robotics [15], [16]. Related deep RL algorithms solve autonomous driving problems. Furthermore, they do
range from value based methods such as DQN [11], [12] not provide interpretability of the learned model, and do not
and double DQN [17], actor-critic based methods such as take muiltiple sources of sensor inputs, which is essential for
A3C [18], DDPG [9] and TD3 [19], policy optimization based autonomous driving systems.
methods such as TRPO [20] and PPO [21], and maximum
entropy RL methods such as SAC [22], [23]. With RL, III. PGM FOR E NVIRONMENT M ODELING AND
a policy can be learned automatically without any expert data. R EINFORCEMENT L EARNING
It can explore various kinds of possible cases including some A. Probabilistic Graphical Model (PGM)
dangerous ones, and then learn useful skills. It also has the
Probabilistic graphical model (PGM) uses a graph to repre-
potential to achieve superhuman performance.
sent conditional dependence between random variables [28].
Researchers have been trying to apply deep RL to the
They are widely used in Bayesian statistics and Bayesian
domain of autonomous driving. Wolf et al. [8] used DQN
learning. Fig.1 shows a simple example of PGM. There are
to learn to steer an autonomous car to keep in the track in
in total 4 nodes A, B, C and D. These nodes can represent
simulation. Its action space is discrete and only allowed coarse
random variables representing observable quantities, unob-
steering angles. Lillicrap et al. [9] proposed a continuous
servable latents, or unknown parameters. The edges between
control deep RL algorithm which learned a deep neural
nodes represents their conditional dependencies. In Fig.1,
network policy that was able to drive the autonomous car
C is conditioned on A and B, while D is conditioned on
on a simulated racing track. Chen et al. [24] proposed a
C. Each edge is associated with a conditional probability,
hierarchical deep RL framework to solve driving scenarios
such as p ( C| A, B) , p ( D| C). With the ability to describe
with complex decision making such as traffic light passing.
complex causal effects and probabilistic transitions, PGM can
Kendall et al. [10] demonstrated the first application of deep
be used as a generic tool to describe probabilistic processes.
RL to real world autonomous cars. They learned a deep lane
In this article, we will use PGM to formulate both the driving
keeping policy using a single front-view camera image as
environment and the reinforcement learning process.
input. There are a lot of other related works not mentioned
here. However, existing works are either for simple scenarios
without complex road conditions and multi-agent interactions, B. PGM for Sequential Latent Environment Modeling
or use manually designed feature representations. To obtain the optimal policy, it is crucial to accurately model
Another problem of learning based approaches for the environment. Most environments in their nature have the
autonomous driving is that they are lack of interpretability. following characteristics: (1) High dimensional observations:
The learned deep neural network policy is like a black box, either for a human being or an autonomous car, the raw
which is not ideal since autonomous driving is a safety observations for them are usually high dimensional, such as
critical application. It is important for us to know whether and RGB images; (2) Time sequence probabilistic dynamics: the

Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on October 14,2024 at 14:14:01 UTC from IEEE Xplore. Restrictions apply.
5070 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 23, NO. 6, JUNE 2022

Furthermore, from the latent states, we can not only decode


to the raw observations, but can also decode to any other rep-
resentations, such as a semantic mask to provide interpretable
explanations.
We can fit the parameters ψ of this PGM from dataset,
which is composed of observation-action trajectory sequences
 i  N
D = x 1:τ , a1:τ
i
i=1
, by maximizing the likelihood of the
data:

N  
Fig. 1. A simple example of probabilistic graphical model. max i
p x 1:τ |a1:τ
i
(2)
ψ
i=1

C. PGM for Reinforcement Learning


Under the settings of reinforcement learning [41], at each
time step, an agent observes the state z t , executes action at
generated by its policy at ∼ π (at |z t ), and then gets the
reward r (z t , at ). The state is then updated according to the
state transition z t +1 ∼ p (z t +1 |z t , at ). Assume there are H
time steps in an episode and the initial state is generated by
z 1 ∼ p (z 1 ), then the objective of reinforcement learning is
Fig. 2. A PGM for sequential latent environment modeling.
to find an policy that optimizes the expected accumulative
rewards:
state of the environment will change with time, thus time H
sequence relations should be modeled; (3) Partially observable: π ∗ = argmax E r (z t , at ) (3)
π z 1 ∼ p(z 1 )
the observation at the current time alone might not be enough at ∼π(at |z t ) t =1
to recover full state of the environment, historical information z t+1 ∼ p(z t+1 |z t ,at )

needs to be summarized by historical observations. Note here we do not explicitly write the discount factor
Here we introduce a probabilistic sequential latent environ- γ in the accumulative rewards, instead we incorporate the
ment model, which satisfies the above characteristics. Similar discount factor by modifying the state transition model [36].
structures of this model is adopted by recent literatures [31], If the initial state transitions are given by p (z t +1 |z t , at ),
[34], [35]. As shown in Fig.2, x t represents the observation adding a discount factor is equivalent to undiscounted prob-
at time step t, which can be high dimensional sensor inputs lem under the modified state transitions p̄ (z t +1 |z t , at ) =
such as RGB images. at is the action chosen at t. z t is the γ p (z t +1 |z t , at ), where there is an additional transition with
latent state variable at t, which is a description of the current probability 1 − γ , regardless of action, into an absorbing state
situation summarizing historical information, e.g, the position, with reward zero. The discount factor allows convergence of
velocity, intention of other road participants, the drivable areas, the value function in infinite-horizon settings. Without loss of
and the road markings. The observation x t is a decoding of the generality, we will omit γ from the PGM related derivations
latent state z t , defined by p (x t |z t ). The latent state z t together in this article, but it can be inserted trivially in all cases simply
with the action at , decide the latent state at the next time step by modifying the state transition models as mentioned above.
defined by the state transition function p (z t +1 |z t , at ). The discount factor is revisited as an explicit consideration in
This environment model is quite generic, as there is no our reinforcement learning algorithm implementation in V-C.
restrictions of the formats and physical meanings of the Maximum entropy reinforcement learning (MaxEnt
observations, actions, and latent states. Furthermore, the obser- RL) [22], [36], [42] modifies the above standard RL by adding
vation decoding function p (x t |z t ) and state transition function an entropy regularization term H (π (at |z t )) = −logπ (at |z t )
p (z t +1 |z t , at ) can be arbitrarily complex, such as deep neural to the reward. Now considering we are using a parametric
networks. function as the policy πφ , for example a deep neural network
By introducing an additional filtering function with weights φ, then the objective of MaxEnt RL can be
p (z t +1 |z t , x t +1 , at ), the latent state can be inferred in a written as:
recursive Bayesian filtering way. Given a new observation H
x t +1 , we have p (z t +1 ) = p (z t +1 |z t , x t +1 , at ) p (z t ), where φ ∗ = argmax E r (z t , at ) − logπφ (at |z t )
at is the action executed at the last time step. The latent φ z 1 ∼ p(z 1 )
at ∼πφ (at |z t ) t =1
state for the first time step is obtained by p (z 1 ) = p (z 1 |x 1 ). z t+1 ∼ p(z t+1 |z t ,at )
Furthermore, we can make probabilistic predictions by rolling (4)
out the future states based on the state transition function:
τ +H There are several reasons why we would like to use MaxEnt
−1
p (z τ :τ +H |aτ :τ +H −1 ) = p (z τ ) p (z t +1 |z t , at ) (1) RL instead of standard RL [43]. First, it performs better explo-
t =τ ration. Standard RL requires specific exploration strategies

Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on October 14,2024 at 14:14:01 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: INTERPRETABLE END-TO-END URBAN AUTONOMOUS DRIVING WITH LATENT DEEP REINFORCEMENT LEARNING 5071

such as adding noise to the policy. However, MaxEnt RL has a


stochastic policy by default, thus the policy itself includes the
exploration strategy, which is optimized during RL training.
In practice, the performance of MaxEnt RL is usually better
and more robust than standard RL algorithms.
Second, MaxEnt RL can be interpreted as learning a PGM.
As shown in Fig.3, z t represents the state, at is the action,
and Ot is a binary random variable. The use of Ot is to
indicate whether the agent is acting optimally at time step
t. Its conditional probability is defined by: Fig. 3. A PGM for maximum entropy reinforcement learning.

p (Ot = 1|z t , at ) = exp (r (z t , at )) (5)


IV. I NTERPRETABLE E ND - TO -E ND U RBAN
thus higher reward indicates higher optimality. Therefore, AUTONOMOUS D RIVING
to make the agent act optimally, we want to maximize the
A. PGM for Interpretable Urban Autonomous Driving
probability of optimality in the whole trajectory p (O1:H ).
Let’s now look at its log likelihood: There are two main building blocks for urban autonomous
driving. The first is the perception and recognition module,
log p (O1:H ) = log p (O1:H , z 1:H , a1:H ) dz 1:H da1:H which helps the autonomous car understand the current driving
situation, such as where is the ego vehicle, what is the road
= log p (O1:H , z 1:H , a1:H ) condition, and where are the surrounding road participants.
Furthermore, it needs to be able to reason about what will
q (z 1:H , a1:H ) happen in the future, such as where will the ego car and
× dz 1:H da1:H
q (z 1:H , a1:H ) surrounding road participants go. These information should

p (O1:H , z 1:H , a1:H ) be obtained given the historical high dimensional raw sensor
= log E
q(z 1:H ,a1:H ) q (z 1:H , a1:H ) inputs. The second module is planning and control, which
≥ E [log p (O1:H , z 1:H , a1:H ) helps the autonomous car decide what actions to take.
q(z 1:H ,a1:H )
Using the methods mentioned in Section III, the above two
− log q (z 1:H , a1:H )] (6) building blocks can be formulated by two PGMs separately,
The above inequality is obtained by adding a variational dis- and it is natural to combine the two PGMs into a single
tribution q (z 1:H , a1:H ) and then applying Jensen’s inequality. one. Inspired by recent works that combines latent represen-
The variational distribution should be the trajectory distribu- tation learning and reinforcement learning [34], [35], [39],
tion generated by the current policy π (at |z t ): we present a PGM for urban autonomous driving, which is
shown in Fig.4. Same with the notations in Section III, z t

H −1
represents for the latent state, at represents for action, Ot
(z 1:H , a1:H ) = p (z 1 ) π (a H |z H ) p (z t +1 |z t , at ) π (at |z t )
represents for the optimality variable, and x t represents for the
t =1
sensor inputs. Note here we allow sensor inputs from multiple
(7)
sources.
The optimality distribution of the trajectory is: We have a newly introduced variable, m t , which we call
the mask. It contains semantic meanings of the environment
p (O1:H , z 1:H , a1:H ) = p (O1:H |z 1:H , a1:H ) p (z 1:H , a1:H )
 H  in a human understandable way. Details about this mask is
described in Section IV-B. The main purpose of the mask is
= exp r (z t , at )
to provide interpretability for the system. At training time we
t =1
need to provide the ground truth labels of the mask, but at test
H−1
time, the mask can be decoded from the latent state, showing
× p (z 1 ) p (z t +1 |z t , at ) (8)
how the system is understanding the environment semantically.
t =1
After learning this PGM in Fig.4, the following models can
By cancellation of repeated terms, the inequality (6) be obtained:
becomes: 1) Policy p (at |z t ): Given the latent state, the policy tells
H how to choose the action.
log p (O1:H ) ≥ E r (z t , at ) − log π (at |z t ) (9) 2) Inference p (z t +1 |x 1:t +1 , a1:t ): With historical sensor
q(z 1:H ,a1:H )
t =1 inputs and actions, the inference model infers the current latent
Note that we can indirectly maximize the left side by state.
maximizing the right side, and the right side of the inequality is 3) Latent Dynamics p (z t +1 |z t , at ): This helps predict the
exactly the same objective of MaxEnt RL. This means, we can future states.
use MaxEnt RL to maximize the likelihood of optimality 4) Generative Models p (x t |z t ) , p (m t |z t ): p (x t |z t )
variables in the PGM in Fig.3. In this sense, the reinforcement decodes the latent state z t to raw sensor inputs x t , showing
learning problem is reformulated into a learning problem for how much information the latent state captures. p (m t |z t )
the PGM shown in Fig.3. generates the semantic mask m t to provide interpretability.

Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on October 14,2024 at 14:14:01 UTC from IEEE Xplore. Restrictions apply.
5072 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 23, NO. 6, JUNE 2022

Fig. 4. A PGM for interpretable end-to-end urban autonomous driving. Fig. 6. The bird-view semantic mask for urban autonomous driving.

2) Routing: Routing contains information of waypoints,


which is provided by a route planner. It is rendered as a thick
blue polyline.
3) Detected Objects: Historical bounding boxes of detected
surrounding road participants (e.g, vehicles, bicycles and
pedestrians) are rendered as green boxes.
4) Ego State: The bounding box of the ego vehicle is
rendered as a red box.

V. J OINT L EARNING OF E NVIRONMENT M ODEL AND


Fig. 5. The interpretable end-to-end urban autonomous driving agent. D RIVING P OLICY
A. Variational Inference for Joint Model Learning and Policy
The whole PGM can be trained end-to-end. After training,
Learning
an intelligent driving agent containing an interpretable envi-
ronment model and a driving policy is obtained. As shown The environment model and driving policy can be learned
in Fig.5, the agent takes multiple source sensor inputs from jointly by learning the PGM shown in Fig.4. For convenience,
the driving environment, and then output control commands we first introduce some notations. Denote a trajectory to be
to drive the car in urban scenarios. In the meantime, the agent composed of sensor inputs, masks, actions and rewards:
generates a semantic mask to interpret how it understands the
x = x 1:τ +1 , m
 = m 1:τ +1 , a = a1:τ , r = r1:τ (10)
current driving situation.
The dataset, which comes from the replay buffer that
B. Sensor Inputs and Mask collects the online exploration experiences during the rein-
We use two sensors to provide the observations, which forcement learning
 process (seedetails in Section VI), is then
N
are camera and lidar. For camera, the sensor input is a written as D = x i , m
 i , a i , ri i=1 . We further denote:
front-view RGB image, which can be represented by a tensor
z = z 1:τ +1 , z w = z 1:τ +H , z p = z τ +1:τ +H ,
of R64×64×3 . For lidar, we project the point clouds to the
ground plane and render them into a 2D lidar image. The lidar O p = Oτ +1:τ +H , a p = aτ +1:τ +H (11)
image is represented by a tensor of R64×64×3 , with each pixel where the superscript “p” stands for “post”, and “w” stands
rendered in red or green depending on whether there are lidar for “whole”. The learning objective is to maximize the log
points at or above ground level existing in the corresponding likelihood of the sensor inputs, mask and the optimality
pixel cell. Desired route constituted of waypoints are rendered variables:
in blue.     
We use camera and lidar together because they are both log  O p |
p x , m, a =  O p |
log p x, m, a
important sensor sources and provide complementary informa- (  a ,r )∈D
x ,m, (  a ,r)∈D
x ,m,
tion. Lidar point clouds provides accurate spatial information (12)
of other road participants and obstacles in 360 degrees of view.
While the front-view camera is good at providing information This can be maximized by stochastic gradient descent
of the road conditions. (SGD), which optimizes parametric functions by gradient
The semantic mask provides bird-view semantics of the road descent, with the gradient estimated by sampling a batch
conditions and objects, which is represented by a tensor of of data points. To make SGD applicable to our problem,
R64×64×3 . As shown in Fig.6, the mask is composed of the  O p |
p x, m, a needs to be represented by parametric func-
following four parts: tions, then auto-differentiation tools (e.g, TensorFlow) can be
1) Map: Map contains information of road conditions. used to calculate its gradient. We can use variational infer-
Drivable areas and lane markings are rendered in the map. ence [44] to compute this log likelihood. We first introduce

Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on October 14,2024 at 14:14:01 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: INTERPRETABLE END-TO-END URBAN AUTONOMOUS DRIVING WITH LATENT DEEP REINFORCEMENT LEARNING 5073

the latent variables z w and a p : We thus have:


     
 O p |
log p x, m, a = log  O p , z w , a p |
p x, m, a dz w d a p  O p , z w , a p |
p x , m, a
 
(13) = p ( x |z ) p (m|
 z ) p a p
τ +H
 τ +H 
Then introduce a variational distribution q (z w , a p | x , a ) −1
× p (z t +1 |z t , at ) exp r (z t , at ) p (z |
a) (19)
into (13):
  t =τ +1 t =τ +1
 O p |
log p x , m, a Substituting the variational distribution (15) into (16),
  q (z w , a p |
x , a ) w p we have:
= log  O p , z w , a p |
p x , m, a dz d a 
q (z w , a p |
x , a )
(14) ELBO = E x |z ) + log p (m|
log p (  z ) + log p (z |
a)
q(z w ,
a p |x ,
a)
The variational distribution is defined as: τ +H
  −1 τ +H

q z w , a p |
x , a + log p (z t +1 |z t , at ) + r (z t , at )
τ +H t =τ +1 t =τ +1
−1 τ
+H
x , a ) π (aτ +H |z τ +H )
= q (z | p (z t +1 |z t , at ) π (at |z t )
t =τ +1
− log q (z |
x , a ) − log π (at |z t )
t =τ +1
(15) 
τ +H
−1  
where q (z |x , a ) is the posterior of latent states given historical − log p (z t +1 |z t , at ) + log p a p
(20)
sensor inputs and actions. The rest part of the right hand t =τ +1
side represents the trajectory distribution by executing policy
Notice the cancellations in (20), we have:
π (at |z t ) with latent state transition p (z t +1 |z t , at ).
Now eliminate the integration in (14) by introducing expec- ELBO = E x |z ) + log p (m|
log p (  z ) + log p (z |
a)
tation, and apply Jensen’s inequality we have: q(z w ,
a p |x ,
a)
  x , a ) +
− log q (z | E
log p x , m,  O p | a q(z w ,
a p |x ,
a)
 τ +H 
⎡  ⎤
p x, m,  O p , z w , a p |a × (r (z t , at ) − log π (at |z t ) + log p (at ))
= log E ⎣ ⎦
t =τ +1
q(z w , a p |x ,
a) q (z w , a p |
x , a )
   (21)
≥ E log p x, m,  O p , zw , a p | a The first part of the right hand side of (21) corre-
q(z w ,
a p |x ,
a)
 w p  sponds to learning the environment model, while the sec-
− log q z , a | x , a
ond part corresponds to learning the driving policy, we will
= ELBO (16) derive the details of the two parts in V-B and V-C,
where “ELBO” stands for evidence lower bound. We can max- respectively.
imize the original log  likelihood by maximizing  the ELBO.
Let’s now derive p x , m,  p w
 O , z , a | p a by probability fac- B. Environment Model Learning
torization according to the PGM in Fig.4: The environment model can be learned via optimizing the
 
 O p , z w , a p |
p x, m, a first part of (21):
 
= p x , m,  O p , z τ +2:τ +H , a p |z , a p (z | a) E x |z )+log p (m|
log p (  z )+log p (z |
a )−log q (z |
x , a )
q(z |x ,
a)
  (22)
= p (  z ) p O p , z τ +2:τ +H , a p |z τ +1 p (z |
x |z ) p (m| a)
  where we replace Eq(z w ,a p |x ,a) with Eq(z |x ,a ) because this
p O p , z p , a p part of ELBO is only related to z 1:τ +1 . Now let’s fur-
= p (x |z ) p (m|
 z) p (z |
a) (17)
p (z τ +1 ) ther derive the components in (22) by unfolding them
According to the soft optimality assumption: with time. Considering the conditional dependence of
  PGM in Fig.4. The generative models can be unfolded
p O p , z p , a p as:
    τ
+1 τ +1
= p z p , a p p O p |z p , a p
x |z ) = log
log p ( p (x t |z t ) = log p (x t |z t )
τ +H
 τ +H 
 p −1 t =1 t =1
= p a p (z τ +1 ) p (z t +1 |z t , at ) exp r (z t , at ) τ +1 τ +1
t =τ +1 t =τ +1  z ) = log
log p (m| p (m t |z t ) = log p (m t |z t ) (23)
(18) t =1 t =1

Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on October 14,2024 at 14:14:01 UTC from IEEE Xplore. Restrictions apply.
5074 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 23, NO. 6, JUNE 2022


The prior model can be unfolded using the latent state × E Q (z t +1 , at +1 )−logπ(at +1 |z t +1 )
transition function: at+1 ∼π
 τ
 (28)

log p (z |
a ) = log p (z 1 ) p (z t +1 |z t , at ) and the soft policy improvement:
t =1   
τ  exp (Q πold (z t , ·))
 
πnew = argmin DKL π (·|z t )  (29)
= log p (z 1 ) + log p (z t +1 |z t , at ) (24) π Z πold (z t )
t =1
where Z πold (z t ) is the normalization term.
The posterior inference model can be unfolded as: The function approximation implementation is to optimize
 τ
 the loss functions that address the soft policy evaluation and

log q (z |
x , a ) = log q (z 1 |
x , a ) q (z t +1 |z t , x , a ) soft policy improvement. The loss functions are the Bellman
t =1 residual in (28):
  2 
τ
 1
≈ log q (z 1 |x 1 ) q (z t +1 |z t , x t +1 , at ) JQ = E Q (z τ , aτ ) − Q̂ (z τ , aτ ) (30)
a) 2
z τ ∼q(z |x ,
t =1
τ and the KL divergence in (29):
= log q (z 1 |x 1 ) + log q (z t +1 |z t , x t +1 , at )
Jπ = E log π (aτ +1 |z τ +1 ) − Q (z τ +1 , aτ +1 )
t =1 z τ +1 ∼q(z |x ,
a)
(25) aτ +1 ∼π(aτ +1 |z τ +1 )
(31)
Note here we approximate q (z | x , a ) and q (z t +1 |z t , x , a )
with q (z 1 |x 1 ) and q (z t +1 |z t , x t +1 , at ) for simplicity. If we Note
want to obtain the exact accurate values, bi-directional recur- Q̂ (z τ , aτ ) = rτ + γ E
rent neural networks should be used to obtain the posterior z τ +1 ∼q(z |x ,
a)
aτ +1 ∼π(aτ +1 |z τ +1 )
probabilities conditioned on the whole trajectory sequence
x , a ) [31].
( × Q̄ (z τ +1 , aτ +1 ) − log π (aτ +1 |z τ +1 ) (32)
We can now unfold (22) with time:
where Q̄ is a delayed Q network.
E x |z ) + log p (m|
log p (  z )+log p (z |
a )−log q (z |
x , a ) Thus, the joint learning algorithm becomes to use SGD
q(z |x ,
a) to maximize the model learning part of ELBO in (26) and
τ +1 τ +1 minimize JQ in (30) and Jπ in (31).
≈ E log p (x t |z t ) + log p (m t |z t )
q(z |x ,
a)
t =1 t =1 VI. E XPERIMENTS
− DKL (q (z 1 |x 1 ) || p (z 1 ))
τ +1
 A. Simulation Setup
− DKL (q (z t +1 |z t , x t +1 , at ) || p (z t +1 |z t , at )) (26) We train and evaluate our proposed method in CARLA
t =1 simulator [45]. CARLA is a high-definition open-source sim-
ulation platform for autonomous driving research. It simulates
C. Driving Policy Learning not only the driving environment and vehicle dynamics, but
also the raw sensor data inputs such as camera RGB images
The driving policy can be learned via optimizing the second
and lidar points cloud using rendering and ray-casting tech-
part of (21):
niques. Fig.7 (a) shows a sample view of the driving simulation
τ +H environment we use.
max E r (z t , at ) − logπφ (at |z t ) + log p (at ) Fig.7 (b) shows the map layout of the virtual town in
q(z p ,
a p |x ,
a)
t =τ +1 CARLA we use for training. It includes various urban sce-
τ +H narios such as intersections and roundabouts. The range of
= E r (z t , at ) − log πφ (at |z t ) (27) the map is 400m × 400m, with about 6km total length of
z τ +1 ∼ p(z τ +1 |x ,
a)
at ∼πφ (at |z t ) t =τ +1 roads. 100 vehicles are running autonomously in the virtual
z t+1 ∼ p(z t+1 |z t ,at ) town to simulate a multi-agent environment. The vehicles will
where log p (at ) is ignored since we assume uniform action randomly choose a direction at intersections, then follow the
prior. The optimization problem (27) then becomes a standard route, while slowing down for front vehicles and stopping
MaxEnt RL problem. when the front traffic light becomes red.
We use soft actor-critic (SAC) [22] to solve this MaxEnt
RL problem. SAC is a function approximation version of the B. Implementation Details
soft policy iteration (SPI). SPI is an extension of the standard 1) Reward Function: We use the following reward function
policy iteration to the maximum entropy case, which is to in our experiments:
iteratively apply the soft policy evaluation:
r = 200 rcollision +v lon + 10 rfast + rout − 5 α 2 + 0.2 rlat − 0.1
T π Q (z t , at ) = r (z t , at ) + γ E (33)
z t+1 ∼ p

Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on October 14,2024 at 14:14:01 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: INTERPRETABLE END-TO-END URBAN AUTONOMOUS DRIVING WITH LATENT DEEP REINFORCEMENT LEARNING 5075

size 256 and learning rate 0.0003. The sequential latent model
is trained with batch size 32 and learning rate 0.0001. The
length of trajectories used for training is τ = 10. The discount
factor γ = 0.99.

VII. E VALUATION R ESULTS


During evaluation, we use the same stochastic policy that
is used during training. 10 episodes are performed at each
evaluation step and the average return is calculated. Same with
the training phase, all vehicles are randomly relocated in the
whole map for each new episode. No frame skip is performed
at the evaluation phase.
Fig. 7. Simulation environment.

A. Variants of Proposed Method


where rcollision is the reward related to collision, which is set to
Besides our proposed method, we also trained and evaluated
−1 if the ego vehicle collides and 0 otherwise. v lon is the speed
other two variants of the method, and then compare the three
of the ego vehicle. rfast is the reward related to running too
methods:
fast, which is set to −1 if it exceeds the desired speed (8 m/s
1) Sensor Inputs and Mask (Proposed): This is our pro-
here) and 0 otherwise. rout is set to −1 if the ego vehicle runs
posed method, which takes the sensor inputs and generate the
out of lane, and 0 otherwise. α is the steering angle of ego
mask.
vehicle in rad. rlat is the reward related to lateral acceleration,
2) Sensor Inputs Only: Here we consider the case that no
which is calculated by rlat = −|α|v lon 2 . The last constant term
mask is provided. So only the camera and lidar sensor inputs
is added to prevent the ego vehicle from standing still.
are inputted and reconstructed. The model learning part is then
2) Network Architecture: The parametrized neural networks trained in an unsupervised way without mask labels.
in our method includes the generative models p (x t |z t ) and 3) Mask Input Only: Assume we already have a good
p (m t |z t ), the latent dynamics p (z t +1 |z t , at ), the filtering perception and localization system that can accurately detects
model q (z t +1 |z t , x t +1 , at ) and q (z 1 |x 1 ), the Q network vehicles, localizes ego vehicle, and provides accurate road
Q (z t , at ), and the policy network π (at |z t ). Here we follow condition information, we can then directly generate the mask
the two-layer hierarchical latent space structure as in [35], such and use it as our input. In this case, only the mask is
that z t1 ∈ R32 and z t2 ∈ R256 . Each sensor input size and mask inputted and reconstructed. Note this variant can be regarded
size is 64 × 64 × 3, such that x t , m t ∈ [0, 255]64×64×3 . as an extension of the article [46], which separately trains a
p (x t |z t ) and p (m t |z t ) both consist of 5 deconvolutional non-sequential variational auto-encoder (VAE) to obtain the
layers ((256, 4, 1), (128, 3, 2), (64, 3, 2), (32, 3, 2), and latent state, and then applies RL on the latent space.
(3, 5, 2), with each tuple means (filters, kernel size, strides)).
p (z t +1 |z t , at ) consists of two fully connected layers with
hidden units number 256, followed by a Gaussian output layer. B. Baseline RL Algorithms
q (z t +1 |z t , x t +1 , at ) and q (z 1 |x 1 ) both consist of 5 convolu- We compare our proposed methods with the following state-
tional layers ((32, 5, 2), (64, 3, 2), (128, 3, 2), (256, 3, 2), of-the-art model-free RL algorithms:
and (256, 4, 1), with each tuple means (filters, kernel size, 1) DQN [12]: DQN is the most well-known deep reinforce-
strides)) to first encode the sensor inputs x t into features ment learning algorithm proposed by DeepMind. It uses deep
of size 256. Then two fully connected layers with hidden neural networks to approximate the Q value and uses deep
units number 256 are followed, with a Gaussian output layer. learning to approximate the bellman operation.
Q (z t , at ) consists of two fully connected layers with hidden 2) DDPG [9]: DDPG is an actor-critic algorithm based
units number 256, followed by a linear output layer. π (at |z t ) on the deterministic policy gradient which is able to handle
consists of two fully connected layers with hidden units continuous action spaces. Besides a deep Q network to approx-
number 256, followed by a Gaussian layer, and a tanh bijector. imate the Q value, there is a policy network which is optimized
3) Training Details: At each new episode, the ego vehicle jointly.
is placed in a random feasible start position in the virtual 3) TD3 [19]: For value-based and actor-critic based RL
town. Other vehicles are also located to new random positions. methods such as DQN and DDPG, function approxima-
The maximum episode length is 500 time steps, the time tion errors will lead to overestimated value functions and
interval for adjacent frames is 0.1 second. We use a frame sub-optimal policies. TD3 reduces the function approximation
skip of 4 for temporal extension, which means the action is errors by taking the minimum value between a pair of critics
fixed for every 4 steps. and delaying policy updates.
The hyperparameters are adapted from [22]. One gradient 4) SAC [22]: SAC is a fundamentally different RL algo-
step is applied per each skipped frame environment step (e.g, rithm compared to the above methods, which is within the
in our case it is one gradient step per every 4 environment MaxEnt RL framework. We have briefly introduced the algo-
steps). The Q network and policy are trained with batch rithm in Section V-C.

Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on October 14,2024 at 14:14:01 UTC from IEEE Xplore. Restrictions apply.
5076 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 23, NO. 6, JUNE 2022

training process, without storing any HD maps or manually


designing any localization algorithms.
On the other hand, object detection is of fundamental
importance for autonomous driving, as failing to detect road
participants and obstacles might lead to serious incidents. The
environment model obtained in our method also has the ability
to detect surrounding vehicles by fusing camera and lidar
sensor inputs.
Fig.9 shows some sampled frames of the sensor inputs,
ground truth masks, and reconstructions when running with
the learned model and policy. For each sample, the first row
contains the raw sensor inputs and ground truth mask (left to
right: camera, lidar, bird-view mask). The second row contains
the corresponding reconstructed images from the latent state.
Note here only the raw sensor inputs are observed, the ground
truth bird-view image is displayed only for comparison. From
Fig. 8. Comparison of learning curves with baseline RL algorithms. Average
returns calculated with 5 trials, each with 10 episodes. Shaded area indicates the reconstructed bird-view mask, we can see that it can
standard deviation. accurately locate the ego car and decode the map information
(e.g, drivable areas and road markings), even though there is
no direct information from the raw sensor inputs indicating
To make a fair comparison, we use the same encoding whether the ego car is in an intersection or an roundabout.
networks with our proposed method for those baseline algo- We can also see that our model can accurately detect the
rithms, but now without decoders. We use recurrent neural surrounding vehicles (green boxes) given raw camera and lidar
networks (RNN), since our proposed method also considers observations.
time sequence. The type of RNN we use is long short term
memory (LSTM), with LSTM size of 40 and output size of
100. B. Quantified Evaluation
We quantify the interpretability of our method by calculat-
C. Evaluation Results ing the average pixel difference between the decoded masks
and the ground truth masks with massive simulation tests in
The performance comparison is shown in Fig.8. We draw
the virtual city. The metric is defined as:
the learning curves composed of average returns (the average
N  
discounted cumulative rewards of multiple testing episodes) 1 sum |m̂ i − m i |
vs environment steps. We can see that all variants of our e= (34)
N W ×H ×C
proposed method are significantly better than the baselines. i=1
Actually, most baselines almost do not work at all. Note that where m̂ i is the predicted mask, m i is the ground truth mask,
our baselines implemented here are already better than existing N is the number of samples we evaluate. W , H and C are
RL methods for autonomous driving, which mostly only take the size of the mask image. In our case, W = H = 64,
front-view camera images as the input, do not consider time C = 3. Values in m i and m̂ i are RGB values scaled to [0, 1].
sequence, and do not use some state-of-the-art RL algorithms After evaluating N = 104 frames in the simulation, we got
such as SAC and TD3. an average pixel difference e = 0.032, which indicates high
accuracy when decoding the bird-view semantic mask images.
VIII. I NTERPRETABILITY
Besides the performance, our proposed method also has C. Failure Cases Interpretation
significant advantages in terms of intepretability by decoding
Although we can learn a significantly better driving pol-
a semantic mask from the latent state. However, since the
icy than baseline RL methods as shown in Section VII-C,
baseline RL algorithms do not have a latent space, they are not
we can still observe some failure cases such as collisions
able to provide an interpretable semantic mask. In this section,
with surrounding vehicles during testing. Our method can help
we will explain how our method is able to interpret how the
interpret why the agent fails. Fig.10 shows examples of our
autonomous car understands the environment.
failure cases interpretation. Same as Fir.9, the first row shows
the sensor inputs and ground truth masks, while the second
A. Detection & Localization Functionality row shows the reconstructed sensor inputs and masks. The left
It is essential to localize the autonomous car and under- example shows a case where the agent collides with another
stand the road conditions around the car. Traditionally, this is vehicle in an intersection. From the reconstructed mask we can
enabled by a separate localization & mapping system, which see that the agent does not recognize the surrounding vehicle.
requires the collection of an HD map and implementation of This might be caused by the low resolution of the sensor
nontrivial SLAM [47] algorithms. However, our method is inputs, as we can hardly see the vehicle in the camera image.
able to obtain all those information within the end-to-end RL The right example shows a case where the agent collides with

Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on October 14,2024 at 14:14:01 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: INTERPRETABLE END-TO-END URBAN AUTONOMOUS DRIVING WITH LATENT DEEP REINFORCEMENT LEARNING 5077

Fig. 9. Sampled frames to illustrate the interpretability of our method. For each sample, left to right: camera, lidar, bird-view image. First row: original
sensor inputs and ground truth mask. Second row: reconstructed images. Only the raw camera and lidar images are observed.

R EFERENCES

[1] S. Thrun et al., “Stanley: The robot that won the DARPA grand
challenge,” J. Field Robot., vol. 23, no. 9, pp. 661–692, 2006.
[2] C. Urmson et al., “Autonomous driving in urban environments: Boss
and the urban challenge,” J. Field Robot., vol. 25, no. 8, pp. 425–466,
2008.
[3] M. Bojarski et al., “End to end learning for self-driving cars,” 2016,
arXiv:1604.07316. [Online]. Available: http://arxiv.org/abs/1604.07316
[4] F. Codevilla, M. Müller, A. Lopez, V. Koltun, and A. Dosovitskiy, “End-
to-end driving via conditional imitation learning,” in Proc. IEEE Int.
Fig. 10. Examples of failure cases interpretation. Conf. Robot. Autom. (ICRA), May 2018, pp. 1–9.
[5] H. Xu, Y. Gao, F. Yu, and T. Darrell, “End-to-end learning of driving
models from large-scale video datasets,” in Proc. IEEE Conf. Comput.
a vehicle occupying part of its lane. From the reconstructed Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 2174–2182.
mask we can see that although the agent recognizes the [6] M. Bansal, A. Krizhevsky, and A. Ogale, “ChauffeurNet: Learning
vehicle, it mistakenly localizes it in its own lane. This might to drive by imitating the best and synthesizing the worst,” 2018,
because this is a very rare situation and almost all training data arXiv:1812.03079. [Online]. Available: http://arxiv.org/abs/1812.03079
[7] J. Chen, B. Yuan, and M. Tomizuka, “Deep imitation learning
is composed of vehicles running in their own lanes, suggesting for autonomous driving in generic urban scenarios with enhanced
that more diverse data needs to be collected. safety,” 2019, arXiv:1903.00640. [Online]. Available: http://arxiv.org/
abs/1903.00640
[8] P. Wolf et al., “Learning how to drive in a real world simulation with
IX. C ONCLUSION deep Q-Networks,” in Proc. IEEE Intell. Vehicles Symp. (IV), Jun. 2017,
pp. 244–250.
In this article, we proposed an interpretable end-to-end rein- [9] T. P. Lillicrap et al., “Continuous control with deep rein-
forcement learnig algorithm for autonomous driving in urban forcement learning,” 2015, arXiv:1509.02971. [Online]. Available:
driving scenarios. The driving policy was learned jointly with http://arxiv.org/abs/1509.02971
[10] A. Kendall et al., “Learning to drive in a day,” in Proc. Int. Conf. Robot.
a sequential latent environment model. The learned driving Autom. (ICRA), May 2019, pp. 8248–8254.
policy took camera and lidar images as input, and generated [11] V. Mnih et al., “Playing atari with deep reinforcement learning,” 2013,
control commands to navigate the autonomous car through arXiv:1312.5602. [Online]. Available: http://arxiv.org/abs/1312.5602
[12] V. Mnih et al., “Human-level control through deep reinforcement learn-
urban driving scenarios. The learned environment model pro- ing,” Nature, vol. 518, no. 7540, p. 529, 2015.
vided an interpretable explanation of how the autonomous [13] D. Silver et al., “Mastering the game of go with deep neural networks
car understood the driving situation by generating a bird-view and tree search,” Nature, vol. 529, no. 7587, p. 484, 2016.
semantic mask. The mask was enforced to connect with cer- [14] D. Silver et al., “Mastering the game of go without human knowledge,”
Nature, vol. 550, no. 7676, pp. 354–359, Oct. 2017.
tain intermediate properties in traditional autonomous driving [15] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training
frameworks, thus providing an explanation of the learned of deep visuomotor policies,” J. Mach. Learn. Res., vol. 17, no. 1,
policy. The method was implemented and evaluated in the pp. 1334–1373, 2015.
[16] D. Kalashnikov et al., “QT-opt: Scalable deep reinforcement learning for
CARLA simulator, which was shown to have significantly vision-based robotic manipulation,” 2018, arXiv:1806.10293. [Online].
better performance over the baseline methods. Available: http://arxiv.org/abs/1806.10293
Although our framework is able to provide interpretable [17] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
explanations about how the learned model understands the with double Q-learning,” in Proc. AAAI Conf. on Artif. Intell., 2016,
pp. 2094–2100.
driving environment, it does not provide any intuition about [18] V. Mnih et al., “Asynchronous methods for deep reinforcement learning,”
how it makes the decisions, because the driving policy is in Proc. Int. Conf. Mach. Learn., 2016, pp. 1928–1937.
obtained in a model-free style. In the future, model-based [19] S. Fujimoto, H. van Hoof, and D. Meger, “Addressing function approxi-
mation error in actor-critic methods,” 2018, arXiv:1802.09477. [Online].
methods will be investigated within in this framework to Available: http://arxiv.org/abs/1802.09477
further improve the performance and interpretability. We are [20] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust
also planning to use real data in the future. However, instead region policy optimization,” in Proc. Int. Conf. Mach. Learn., 2015,
of directly running RL in real world driving, we will deploy pp. 1889–1897.
[21] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
an offline RL method, which will directly learn a good driving “Proximal policy optimization algorithms,” 2017, arXiv:1707.06347.
policy with offline collected data. [Online]. Available: http://arxiv.org/abs/1707.06347

Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on October 14,2024 at 14:14:01 UTC from IEEE Xplore. Restrictions apply.
5078 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 23, NO. 6, JUNE 2022

[22] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor- [46] J. Chen, B. Yuan, and M. Tomizuka, “Model-free deep reinforcement
critic: Off-policy maximum entropy deep reinforcement learning with learning for urban autonomous driving,” in Proc. IEEE Intell. Transp.
a stochastic actor,” 2018, arXiv:1801.01290. [Online]. Available: Syst. Conf. (ITSC), Oct. 2019, pp. 2765–2771.
http://arxiv.org/abs/1801.01290 [47] E. Yurtsever, J. Lambert, A. Carballo, and K. Takeda, “A survey of
[23] T. Haarnoja et al., “Soft actor-critic algorithms and applications,” 2018, autonomous driving: Common practices and emerging technologies,”
arXiv:1812.05905. [Online]. Available: http://arxiv.org/abs/1812.05905 2019, arXiv:1906.05113. [Online]. Available: http://arxiv.org/abs/1906.
[24] J. Chen, Z. Wang, and M. Tomizuka, “Deep hierarchical reinforcement 05113
learning for autonomous driving with distinct behaviors,” in Proc. IEEE
Intell. Vehicles Symp. (IV), Jun. 2018, pp. 1239–1244.
[25] M. Bojarski et al., “Explaining how a deep neural network trained with
end-to-end learning steers a car,” 2017, arXiv:1704.07911. [Online]. Jianyu Chen received the bachelor’s degree from
Available: http://arxiv.org/abs/1704.07911 Tsinghua University in 2015 and the Ph.D. degree
[26] J. Kim and J. Canny, “Interpretable learning for self-driving cars by from the University of California at Berkeley,
visualizing causal attention,” in Proc. IEEE Int. Conf. Comput. Vis. Berkeley, in 2020. He was with the University of
(ICCV), Oct. 2017, pp. 2942–2950. California at Berkeley, under the supervision of
[27] A. Sauer, N. Savinov, and A. Geiger, “Conditional affordance learning Prof. Masayoshi Tomizuka. Since 2020, he has been
for driving in urban environments,” 2018, arXiv:1806.06498. [Online]. an Assistant Professor with the Institute for Inter-
Available: http://arxiv.org/abs/1806.06498 disciplinary Information Sciences (IIIS), Tsinghua
[28] K. P. Murphy, Machine Learning: A Probabilistic Perspective. University. He is also working at the intersection
Cambridge, MA, USA: MIT Press, 2012. of machine learning, robotics and control to build
[29] C. Dong, J. M. Dolan, and B. Litkouhi, “Intention estimation for ramp intelligent systems which can efficiently learn safe
merging control in autonomous driving,” in Proc. IEEE Intell. Vehicles and reliable sensori-motor control policies. Applications of his work mainly
Symp. (IV), Jun. 2017, pp. 1584–1589. focus on robotic systems such as autonomous driving and industrial robots.
[30] C. Dong, J. M. Dolan, and B. Litkouhi, “Interactive ramp merging His research interests include reinforcement learning, control, deep learning,
planning in autonomous driving: Multi-merging leading PGM (MML- autonomous driving, and robotics.
PGM),” in Proc. IEEE 20th Int. Conf. Intell. Transp. Syst. (ITSC),
Oct. 2017, pp. 1–6.
[31] R. G. Krishnan, U. Shalit, and D. Sontag, “Deep Kalman filters,” 2015,
arXiv:1511.05121. [Online]. Available: http://arxiv.org/abs/1511.05121
[32] M. Karl, M. Soelch, J. Bayer, and P. van der Smagt, “Deep variational Shengbo Eben Li received the M.S. and Ph.D.
Bayes filters: Unsupervised learning of state space models from raw degrees from Tsinghua University in 2006 and 2009,
data,” 2016, arXiv:1605.06432. [Online]. Available: http://arxiv.org/abs/ respectively. He was with Stanford University, the
1605.06432 University of Michigan, and UC Berkeley. He is
[33] M. Fraccaro, S. Kamronn, U. Paquet, and O. Winther, “A disentangled currently with the Intelligent Driving Lab (iDLab),
recognition and nonlinear dynamics model for unsupervised learning,” Tsinghua University. His current research interests
in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 3601–3610. include intelligent vehicles and driver assistance,
[34] D. Hafner et al., “Learning latent dynamics for planning from reinforcement learning and optimal control, and dis-
pixels,” 2018, arXiv:1811.04551. [Online]. Available: http://arxiv.org/ tributed control and estimation. He has authored
abs/1811.04551 more than 100 peer-reviewed journals/conference
[35] A. X. Lee, A. Nagabandi, P. Abbeel, and S. Levine, “Stochastic articles, and the co-inventor of more than 30 patents.
latent actor-critic: Deep reinforcement learning with a latent vari- He was a recipient of the National Award for Technological Invention of
able model,” 2019, arXiv:1907.00953. [Online]. Available: http://arxiv. China in 2013, the Best Paper Award in 2014 IEEE ITS, the Best Paper Award
org/abs/1907.00953 in 14th Asian ITS, the Excellent Young Scholar of NSF China in 2016, the
[36] S. Levine, “Reinforcement learning and control as probabilistic infer- Young Professorship of Changjiang Scholar Program in 2016, the Tsinghua
ence: Tutorial and review,” 2018, arXiv:1805.00909. [Online]. Available: University Excellent Professorship Award in 2017, the National Award for
http://arxiv.org/abs/1805.00909 Progress in Science and Technology of China in 2018, and the Distinguished
[37] K. Rawlik, M. Toussaint, and S. Vijayakumar, “On stochastic optimal Young Scholar of Beijing NSF in 2018. He also serves as a Board of Governor
control and reinforcement learning by approximate inference,” in Proc. for the IEEE ITS Society and an Associate Editor for the IEEE Intelligent
23rd Int. Joint Conf. Artif. Intell., 2013. Transportation Systems Magazine (ITSM) and the IEEE T RANSACTIONS ON
[38] B. D. Ziebart, “Modeling purposeful adaptive behavior with the principle I NTELLIGENT T RANSPORTATION S YSTEMS .
of maximum causal entropy,” M.S. thesis, Carnegie Mellon Univ.,
Pittsburgh, PA, USA, 2018, doi: 10.1184/R1/6720692.v1.
[39] D. Ha and J. Schmidhuber, “World models,” 2018, arXiv:1803.10122. Masayoshi Tomizuka (Life Fellow, IEEE) received
[Online]. Available: http://arxiv.org/abs/1803.10122 the Ph.D. degree in mechanical engineering from
[40] M. Okada, N. Kosaka, and T. Taniguchi, “PlaNet of the Bayesians: MIT in February 1974. In 1974, he joined the
Reconsidering and improving deep planning network by incorporat- Faculty of the Department of Mechanical Engineer-
ing Bayesian inference,” 2020, arXiv:2003.00370. [Online]. Available: ing, University of California at Berkeley, where
http://arxiv.org/abs/2003.00370 he currently holds the Cheryl and John Neerhout,
[41] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Jr., Distinguished Professorship Chair. He served
Cambridge, MA, USA: MIT Press, 2018. as the Program Director for the Dynamic Systems
[42] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine, “Reinforcement learning and Control Program of the Civil and Mechanical
with deep energy-based policies,” in Proc. 34th Int. Conf. Mach. Learn., Systems Division of NSF from 2002 to 2004. His
vol. 70, 2017, pp. 1352–1361. current research interests include optimal and adap-
[43] B. Eysenbach and S. Levine, “If MaxEnt RL is the answer, tive control, digital control, signal processing, motion control, and control
what is the question?” 2019, arXiv:1910.01913. [Online]. Available: problems related to robotics, and precision motion control and vehicles.
http://arxiv.org/abs/1910.01913 He is a fellow of the ASME and IFAC. He was a recipient of the Charles
[44] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” 2013, Russ Richards Memorial Award (ASME, 1997), the Rufus Oldenburger Medal
arXiv:1312.6114. [Online]. Available: http://arxiv.org/abs/1312.6114 (ASME, 2002), and the John R. Ragazzini Award in 2006. He served as a
[45] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, Technical Editor for the ASME Journal of Dynamic Systems, Measurement
“CARLA: An open urban driving simulator,” 2017, arXiv:1711.03938. and Control, J-DSMC from 1988 to 1993, and an Editor-in-Chief of the
[Online]. Available: http://arxiv.org/abs/1711.03938 IEEE/ASME T RANSACTIONS ON M ECHATRONICS from 1997 to 1999.

Authorized licensed use limited to: TIANJIN UNIVERSITY. Downloaded on October 14,2024 at 14:14:01 UTC from IEEE Xplore. Restrictions apply.

You might also like