Towards Explainability in Modular Autonomous Vehicle Software
Towards Explainability in Modular Autonomous Vehicle Software
Towards Explainability in Modular Autonomous Vehicle Software
crucial in post-hoc analysis where blame assignment might between measurements and poses can be ambiguous, because
be necessary. In this paper, we provide positioning on how
researchers could consider incorporating explainability and inter-
sensor measurements from multiple distant poses can be
pretability into design and optimization of separate Autonomous similar. Therefore, to tightly integrate the localization module
Vehicle modules including Perception, Planning, and Control. with other parts of the software stack and for the engineers
implementing and tuning the algorithm, the explainability of
I. I NTRODUCTION localization algorithms using neural networks becomes impor-
According to the Morning Consult and Politico poll [1], tant. We need to estimate the uncertainty of the localization
only 16% of respondents are “very likely” to ride as a results, and in the worst case, to know when and why the robot
passenger in an autonomous vehicle, while 28% of respondents fails to localize on a certain map.
state that they “not likely at all”. Moreover, only 22% of Monte Carlo Localization (MCL) [10], the widely adopted
respondents believe self-driving cars are safer than the average method, uses random hypothesis sampling and sensor mea-
human driver, while 35% of them believing self-driving cars surement updates to infer the pose. In MCL, the proposed
are less safe than the average human driver. The public’s particles are explicit poses on the map and we can interpret the
distrust in Autonomous Vehicles (AV) shows that improving distribution of the particles as the uncertainties. The random
explainability in AV software is a necessity. generation of the particles can be tuned with parameters that
There exist many surveys on explainable AI (XAI) and have physical meaning, providing an interface for humans to
robotics [2]–[5]. Specifically, [6]–[8] surveys explainability adjust the behavior of the algorithm. Many developments in
in Autonomous Driving. Atakishiyev et al. [7] believes that localization seek to improve within the framework of MCL.
AVs need to provide regulatory compliant operational safety [11]–[13] Although particle filter has been a popular localiza-
and explainability in real-time decisions. It focuses on provid- tion framework for its robustness and reliable performance, it
ing discussion through the cause-effect-solution perspective. introduces random jitter into the localization results.
Zablocki et al. [6] provides an in-depth overview of XAI Other common approaches are to use Bayesian filtering
methods in deep vision-based methods, but is limited to the [14] or to find more distinguishable global descriptors on the
scope of perception only. Omeiza et al. [8] also provides map [15], [16]. In Bayesian filtering, the explainability lies in
an overview of explanations in AVs in the full self-driving the conditional probability attached with the motion model
pipeline. Gilpin et al. [9] proposes explainability as the trade- and each measurement. The estimation of such probability
off between interpretability and completeness. As described in is challenging. For the global descriptor approach, oftentimes
[9], to be interpretable is to describe the internals of a system manual selection of map features are needed, which increases
in such a way that is understandable to humans; to be complete the explainability of the system, but also increases the human
is to describe the operation of a system in an accurate way. workload and reduces robustness. Developments in localiza-
We position ourselves to provide insight in augmenting tion research usually propose better measurement models or
explanability in Autonomous Vehicle’s sense-plan-act software feature extractors within these frameworks. [17], [18].
modules as a task of balancing interpretability and complete- Recent research in localization has also focused on the use
ness. In this paper, we look at the explainability in existing of learning-based methods outside of the above frameworks
works and in our recent contributions in localization, planning, [19]. Although learning-based methods may provide better
and control. In each case, we want to be able to quantify the localization precision with lower latency, the interpretability
uncertainty at each step of the decision making and interpret of the method decreases. While the traditional localization
the provenance of the outcome of the algorithm. methods can be manually tuned according to the specific user
∗ Authors
scenarios, learning-based localization methods are usually not
contributed equally. All authors are with University of Penn-
sylvania, Department of Electrical and Systems Engineering, 19104, tunable once the network is trained. Uncertainty estimations
Philadelphia, PA, USA. Emails: {hongruiz, zzang, yangs1, of the neural networks also become a challenge for learning-
rahulm}@seas.upenn.edu based methods. There are efforts to approximate the uncer-
tainty [20]–[22], but it hasn’t been widely applied.
Our contribution: In our recent paper, Local INN, we
proposed a new approach to frame the localization problem as
an ambiguous inverse problem and solve it with an invertible
neural network (INN) [23]. It stores the map data implicitly
inside the neural network. With the assumption that the
environment doesn’t not change significantly from the map,
by evaluating the reverse path of the neural network, we can
get robot poses from LiDAR scans. It also provides uncertainty
estimation from the neural network and is capable of learning
and providing localization for complex environments.
Localization is an inverse problem of finding a robot’s pose
using a map and sensor measurements. This reverse process
of inferring the pose from sensor measurements is ambiguous.
Invertible neural networks such as normalizing flows [24] have
been used to solve ambiguous inverse problems in various
fields [25]–[28]. The version of normalizing flows we used
is called RealNVP [29], which uses a mathematical structure
called coupling layers to ensure the invertibility while perform- Fig. 1: Top: Evaluation of Local INN in forward direction gives
ing transformations with arbitrary neural network layers, such compressed map information, and in the reverse direction gives
as MLPs. This framework of solving inverse problems with accurate localization with fast runtime and uncertainty estimation.
normalizing flows was introduced by Ardizonne et al. [25] Bottom: Network structure of Local INN. Solid arrows are from pose
to lidar scan. Dashed arrows are from lidar scan to pose. Conditional
and was later extended by [26], [30] to include a conditional input is calculated from the previous pose of the robot.
input that is concatenated to the vectors inside the coupling
layers. They proposed to use normalizing flows to learn a A. Explainability from Uncertainty Estimation
bijective mapping between two distributions and use a normal-
distributed latent variable to encode the lost information in When we use Local INN to localize, the input to the reverse
training due to the ambiguity of the problem. The network path of the INN consists of the LiDAR scans concatenated with
can be evaluated in both forward and reverse paths. During a latent vector that is sampled from normal distribution. With
the evaluation, repeatedly sampling the latent variable can give this sampling of latent vector, the network can output not just
the full posterior distribution given the input. a pose but a distribution of inferred poses. The covariance
In Local INN, we use pose-scan data pairs to train such a of this distribution can be used as the confidence of the
bijective mapping. As shown in Fig. 1, The forward path is neural network when fusing with other sensors. Uncertainty
from pose to scan and the reverse path is from scan to pose. estimation improves explainability by providing information
We use a conditional input calculated from the previous pose on the measurement quality of the prediction. Compared to
of the robot to reduce the ambiguity of the problem. Because learning methods that do not provide uncertainty estimates,
INNs require the same input and output dimensions, we use a it is much easier to determine whether the prediction of the
Variational Autoencoder [31] to reduce the dimension of the neural network is lower in accuracy due to higher uncertainty,
LiDAR scans and use Positional Encoding [32] to augment and improve the prediction results by augmentation. In our
that of the poses. The network is trained with supervised experiments, we used an EKF to fuse the localization result
loss functions on both sides. The data used for training with the odometry information. The results show that this
the Local INN can be simulated or real data recorded from fusion significantly improved localization accuracy where the
LiDARs. In our experiments, we tested on both real and TABLE I: Local INN Experiments: Map Reconstruction and
simulated data with 2D and 3D LiDARs. To collect training RMS Localization Errors with 2D LiDAR (xy[m], θ[◦ ])
data, we uniformly sample x, y position and heading θ on the
drivable surface of each map, and use a LiDAR simulator to Race Track (Simulation) Outdoor (Real)
find the corresponding LiDAR ranges. This means the trained
network will be able to localize everywhere on the map. For
each different map, we need to train a separate network. Map
Original Map
files are compressed inside the neural network and are no Reconstruction
longer needed during evaluation. Test Trajectory
We claim that INN is naturally suitable for the localiza-
tion problem with improved explainability compared to other
learning-based methods. In particular, uncertainty estimation
and map representation are the two advantages that Local INN Online PF 0.168, 2.107 0.047, 1.371
Local INN+EKF 0.056, 0.284 0.046, 1.130
provides in the context of explainability.
Fig. 2: Example of global localization finding the correct pose at the 2nd iterations (green arrow). Top: Narrowing down of the
candidate poses in the first 3 iterations. We can see candidate poses on the map (orange dots), correct pose (red dot), selected
pose (green dot). Bottom: Examples of LiDAR scan ranges at candidate poses at iteration 3 (orange boxes), and at the selected
pose (green box). In the range plots, the horizontal axis is the angle of LiDAR scans and vertical axis is the measured distance.
We can see a comparison of network expected ranges (blue curve) at various poses, and the actual LiDAR measurement
(orange curve). The correct pose is selected where the measurement best matches the expected shape. The network is trained
on simulated data and tested on real LiDAR data.
map geometry is ambiguous, which means this covariance is ’kidnapping problem’. In this case, it is challenging to find the
very effective in revealing the confidence of the network. correct position on the map due to the ambiguity of problem.
As shown in Table I, The accuracy of Local INN is at par MCL algorithms usually do global localization by spreading
with the current localization efforts. See [23] for a comparative the covariance all around the map and using iterations of
analysis of Local INN localization accuracy in 2D and 3D control inputs and measurements to decrease the covariance,
maps. which gives an explicit visualization to see the progress of the
global localization processes. For other learning-based method,
B. Explainability from Map Representation
this process is usually hard to explain as we rely on the neural
Local INN provides an implicit map representation and a network to output poses as a black box.
localization method within one neural network. The guaran-
teed invertibility of the neural network provides the use a With Local INN, we can randomly initialize a set of random
direct way to check the neural network’s ’understanding’ of poses on the map as conditional inputs and use new lidar scans
the map by reproducing part of the map with poses. That to narrow down the assumptions. In other words, we initially
is, we can compare the reconstructed map to the original have multiple random assumptions of the robot’s location on
map to see how much detail is used by the neural network the map and use them as the conditional inputs for the nerual
in localization. Again, this feature improves explainability in network. As shown in figure 2 iteration 1, when we input a
failure scenarios. When the localization fails, this comparison LiDAR scan to the network along with these assumptions,
can help us explain the failure and guide us in improving the it will output multiple possible pose distributions. In our
methods. For example, we can train with more data from a algorithm, for each possible pose distribution, we compare the
particular location on the map that was difficult to localize in. sensor measurement with what the neural network expects at
As an example of how the stored map information in the this location, and use the reciprocal of the error term to weight
forward path of the neural network can help us explain the the assumptions differently. The weights for the assumptions
localization results, let us consider the algorithm for global is used to determine the amount of latent variable we use.
localization. Global localization is needed when a robot starts This process repeats with each new LiDAR scan we get from
with an unknown pose or when the robot encounters the the sensor. In our experiments, the convergence of candidate
Objective Space pipeline is shown in Figure 3. We choose a game theoretic
Conservativeness
Abstraction through
approach that models racing as a zero-sum extensive game
Scoring Rollout with imperfect information and perfect recall. Extensive games
Planning in Low
Dimension
model sequential decision-making of players and naturally
Aggressiveness form a game tree where each node is a decision point for
Objective Space
a player. However, the planning problem presented in au-
Conservativeness
Mapped to Motion Plans tonomous racing is continuous, and the state space of the
in Original Space
agents, in turn, the game tree in the extensive game, will
also be infinitely large if we model the game in the vehicle’s
Aggressiveness
planning space. Since the decision made by a game-theoretic
Short Rollout
algorithm in the planning space cannot be explained in a way
that a human can understand, we use a lower dimensional
Fig. 3: Overview of Game-theoretic Objective Space Planning space for planning. We define the notion of Objective Space
O. For each short rollout in an adversarial environment, we can
poses is fast and accurate. As shown in iteration 2 and 3 in compute multiple metrics regarding this agent’s performance,
figure 2, even if we still track multiple poses, in this example, such as safety and aggressiveness. These metrics also add
the correct pose is determined at the 2nd iteration in a highly to the interpretability of our planning algorithm while not
symmetrical map. The low part of figure 2 shows plots of the losing the completeness. O models the average outcome of
expected lidar scan from the neural network and the current each agent against competent opponents. Using O, our planner
LiDAR measurement. This reveals the black-box process of maps complex agent behaviors to a lower dimension where
the neural network so that we can see why multiple poses are only the episodic outcome is recorded instead of the entire
possible and how we should decide which one to pick. decision-making process in the planning space. We define an
To summarize, the explainability of localization methods action in our game as movements in a dimension of O. This
generally lies in the uncertainty estimation and the ability to action space is analogous to the planning space in a grid world
explain and tune the methods when localization fails. Tradi- with actions that move the agent to a neighboring cell. This
tional localization methods usually offer higher interpretabil- means the planning problem is much simpler than the original
ity than learning-based methods, whereas the learning-based problem.
methods can provide better empirical performance. The new In our case study, we choose aggressiveness and restraint
method, Local INN, we recently proposed uses an invertible as the two dimensions of O. Aggressiveness is scored on
network network architecture to solve the localization prob- an agent’s lead over the other at the end of the rollout,
lem. It offers interpretability by giving uncertainty estimation and restraint is scored on an agent’s time to collision to the
through the covariance of the inferred pose distributions, environment and the other agent. Two movements are available
and by ensuring the invertibility of the network so that we for each dimension: increasing or decreasing for a fixed
can reveal the what information the neural network is using distance along the axis. For example, four discrete actions are
during the localization. At the same time, it does sacrifice available at each turn when O ∈ R2 . Even with the formulation
completeness by using a Variational Autoencoder (VAE) to of agent action space, the possible objective values of an
model a latent space of the LiDAR scans. opponent agent or possible position in O is still infinite. We
propose a predictive model for regret values within Counter-
III. E XPLAINABILITY IN P LANNING factual Regret Minimization (CFR) [34] to make the problem
Planning is the task of finding a viable motion plan for tractable. Finally with head-to-head racing experiments, we
the robot to reach some predetermined goal given the current demonstrate that using the proposed planning pipeline above
observation of the environment through various sensors. The significantly improves the win rate that generalizes to unseen
planning step is usually the next step after perception, or lo- opponents in an unseen environment.
calization. Traditionally, sampling-based and model-predictive
methods are the most popular choices in Autonomous Vehicle A. Explainability in Agent Actions
motion planning. Planning algorithms provide explanability In this section, we examine specific cases of agents moving
through human-designed objectives: e.g. maximizing the dis- in the lower dimension and investigate whether we achieve the
tance of a planned trajectory to obstacles and undrivable areas, task of instilling explainability in our algorithm. We choose a
maximizing velocity on the trajectories, minimizing the lateral 2-D space that encodes aggressiveness and restraint. In Figure
acceleration the vehicle experiences on a planned trajectory. 4, we show four examples of races between two agents. The
We propose a unique position in adding interpretability and Ego (orange) uses our proposed game-theoretic approach in
completeness to planning algorithms: explainability through the Objective Space, and the Opponent (green) is static in the
abstraction. Next, we show our approach based on our recent objective space. In the first two cases, the opponent is in the
paper on Game-Theoretic objective space planning [33]. lower right quadrant, meaning that they’re more conservative
In this case study, our primary context is two-player rac- than aggressive. Hence our planner chooses to increase in
ing games in close proximity. An overall depiction of the aggressiveness continuously to win the races. In the last two
Restraint (lower is more conservative) Near-optimal Ego Obj. Value Opp. Obj. Value
A B C D 𝑄
3
Fig. 6: Illustration of QP: If the NN output is not in the safe
2 control set Q, it will be projected in minimally invasive way
to Q; otherwise, the control keeps the same.
0.0 2.5 0.0 2.5 0.0 2.5 0.0 2.5
Aggressiveness (lower is more aggressive)
Fig. 4: Trajectories of ego moves in O.
A. Safety-critical control
Learning-based control could provide high empirical per-
formance thus it has become popular for controlling com-
plex dynamical systems. However, learning-based controllers,
such as neural network (NN), generally lack formal safety
guarantees because of their black-box nature. This limits
their deployments with complex safety-critical systems. To
address safety risk, multiple methods have been proposed,
such as model predictive control (MPC) [35], Hamilton-Jacobi
reachability analysis [36], contraction theory [37], and control
barrier functions (CBF) [38].
Among the many safety-critical control techniques, CBF is
becoming a popular choice since it explicitly specifies a safe
Fig. 5: Effect of making a move in O in motion planning control set by using a Lyapunov-like condition and guards the
space. system inside a safe invariant set. When a continuous-time
control-affine system is considered, such projection reduces
to a convex quadratic program (QP) which is referred to as
cases, the opponent is in the upper left quadrant, meaning that CBF-QP. Due to its simplicity, flexibility, and formal safety
they’re more aggressive than conservative. Hence our planner guarantees, CBFs have been applied in safe learning control
chooses to first become more conservative, and once a chance with many successful applications [39]–[43].
presents itself, increase in aggressiveness and win the races. Compared with MPC, which needs to handle a possibly
In Figure 5, we inspect a specific move to show the effect nonconvex optimization problem in the face of nonlinear
in the motion planning space. In the beginning of the rollout, dynamical systems, CBF-QP is computationally efficient to
both agents are side by side. At a decision point, our planner solve online. However, unlike MPC, the QP-based safety filter
locates the opponent in the lower right quadrant as more only operates in a minimally invasive manner, i.e., it generates
conservative than aggressive. Then the ego decides to increase the safe control input closest to the reference control input (in
in aggressiveness to finally overtake the opponent. From the Euclidean norm), as shown in Fig. 6, unaware of the long-
these examples, it is clear that moving the planning problem term effects of its action. This indicates that the effects of the
in a lower dimension that encodes interpretable metrics for safety filter on the performance of the closed-loop system are
humans doesn’t decrease the capability of the planner, or the hard to predict. Therefore, the application of the safety filter
completeness in the algorithm. may give rise to myopic controllers [44] that induce subpar
performance in the long term.
To address the issue of myopic CBF-based safety filters,
IV. E XPLAINABILITY IN C ONTROL in our recent work [45], we propose to utilize CBF to con-
struct safe-by-construction NN controllers that allow end-to-
With robust localization, the autonomous vehicle can plan end learning. Incorporating safety layers in the NN controller
its trajectory but requires a safe an interpretable controller to allows the learning agent to take the effects of safety filters
execute the plan. In this section, we show our recent progress into account during training in order to maximize long-term
on learning-based safety-critical control through control bar- performance.
rier functions (CBFs) and provide our positioning on explain- We design a differentiable safety layer using the gauge
able safe control. Specifically, we show that our proposed map [46] (as shown in Fig. 7) which establishes a bijection
differentiable safety filter has more completeness than non- mapping between the polytopic set of a NN output (e.g.,
differentiable safety filter without sacrificing interpretability. an `∞ norm ball) and the CBF-based safe control set. The
𝑄
1 𝐵
∞
0 0
-1
-1 0 1
0
Fig. 7: Illustration of the gauge map from the `∞ ball B∞ to (a) CBF values. (b) Velocity of the ego car.
a polytopic set Q. The original point in B∞ and mapped point
Fig. 10: Results of adaptive cruise control. CBF values (left)
in Q are in the same level set and with the same direction.
and velocity of the ego car (right) under different controllers
are evaluated in closed-loop for 20s. A CBF value below zero
Neural network
output Safe action indicates unsafety, and the optimal behavior of the ego car is
System state NN CBF−QP
expected to have a steady state velocity of 16m/s, same as
𝐍𝐍-𝐐𝐏 the leading car.