0% found this document useful (0 votes)

144 views20 pages

End-To-End Autonomous Driving Challenges and Frontiers

This survey paper analyzes over 270 studies on end-to-end autonomous driving, highlighting the shift from modular to integrated algorithm frameworks that utilize raw sensor data for vehicle motion planning. Key challenges such as interpretability, robustness, and the incorporation of large-scale data are discussed, along with advancements in foundation models and visual pre-training. The paper emphasizes the need for a holistic approach to algorithm design to enhance safety and performance in autonomous driving systems.

Uploaded by

Chandra Sekhar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

144 views20 pages

End-To-End Autonomous Driving Challenges and Frontiers

Uploaded by

Chandra Sekhar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

10164 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO.

12, DECEMBER 2024

End-to-End Autonomous Driving: Challenges

and Frontiers
Li Chen , Penghao Wu , Kashyap Chitta , Bernhard Jaeger , Andreas Geiger ,
and Hongyang Li , Senior Member, IEEE

(Survey paper)

Abstract—The autonomous driving community has witnessed as perception, prediction, and planning, is individually devel-
a rapid growth in approaches that embrace an end-to-end algo- oped and integrated into onboard vehicles. The planning or
rithm framework, utilizing raw sensor input to generate vehicle control module, responsible for generating steering and accel-
motion plans, instead of concentrating on individual tasks such as
detection and motion prediction. End-to-end systems, in compar- eration outputs, plays a crucial role in determining the driving
ison to modular pipelines, benefit from joint feature optimization experience. The most common approach for planning in modular
for perception and planning. This field has flourished due to the pipelines involves using sophisticated rule-based designs, which
availability of large-scale datasets, closed-loop evaluation, and the are often ineffective in addressing the vast number of situations
increasing need for autonomous driving algorithms to perform that occur on road. Therefore, there is a growing trend to leverage
effectively in challenging scenarios. In this survey, we provide a
comprehensive analysis of more than 270 papers, covering the mo- large-scale data and to use learning-based planning as a viable
tivation, roadmap, methodology, challenges, and future trends in alternative.
end-to-end autonomous driving. We delve into several critical chal- We define end-to-end autonomous driving systems as fully
lenges, including multi-modality, interpretability, causal confusion, differentiable programs that take raw sensor data as input and
robustness, and world models, amongst others. Additionally, we produce a plan and/or low-level control actions as output.
discuss current advancements in foundation models and visual
pre-training, as well as how to incorporate these techniques within Fig. 1(a)-(b) illustrates the difference between the classical
the end-to-end driving framework. and end-to-end formulation. The conventional approach feeds
the output of each component, such as bounding boxes and
Index Terms—Autonomous driving, end-to-end system design,
policy learning, simulation.
vehicle trajectories, directly into subsequent units (dashed ar-
rows). In contrast, the end-to-end paradigm propagates fea-
ture representations across components (gray solid arrow). The
I. INTRODUCTION optimized function is set to be, for example, the planning
ONVENTIONAL autonomous driving systems adopt a performance, and the loss is minimized via back-propagation
C modular design strategy, wherein each functionality, such (red arrow). Tasks are jointly and globally optimized in this
process.
In this survey, we conduct an extensive review of this emerg-
Manuscript received 24 June 2023; revised 19 April 2024; accepted 22 July ing topic. Fig. 1 provides an overview of our work. We be-
2024. Date of publication 30 July 2024; date of current version 5 November 2024.
The work of Li Chen, Penghao Wu, and Hongyang Li were partially supported
gin by discussing the motivation and roadmap for end-to-end
by the National Key R&D Program of China under Grant 2022ZD0160104, in autonomous driving systems. End-to-end approaches can be
part by NSFC under Grant 62206172, and in part by the Shanghai Committee broadly classified into imitation and reinforcement learning,
of Science and Technology under Grant 23YF1462000. The work of Andreas
and we give a brief review of these methodologies. We cover
Geiger and Bernhard Jaeger were supported by the ERC Starting Grant LEGO-
3D (850533), in part by the BMWi in the project KI Delta Learning under Grant datasets and benchmarks for both closed and open-loop eval-
19A19013O, and in part by the DFG EXC number 2064/1 - project number uation. We summarize a series of critical challenges, includ-
390727645. Kashyap Chitta was supported by the German Federal Ministry
of Education and Research (BMBF): Tübingen AI Center, FKZ, under Grant
ing interpretability, generalization, world models, causal con-
01IS18039A. Recommended for acceptance by H. Li. (Corresponding author: fusion, etc. We conclude by discussing future trends that we
Hongyang Li.) think should be embraced by the community to incorporate
Li Chen and Hongyang Li are with the OpenDriveLab, Shanghai AI Lab,
Shanghai 200233, China, and also with the University of Hong Kong, Hong
the latest developments from data engines, and large founda-
Kong, China (e-mail: [email protected]). tion models, amongst others. Note that this review is mainly
Penghao Wu is with the OpenDriveLab, Shanghai AI Lab, Shanghai 200233, orchestrated from a theoretical perspective. Engineering efforts
China.
Kashyap Chitta, Bernhard Jaeger, and Andreas Geiger are with the University
such as version control, unit testing, data servers, data clean-
of Tübingen, 72074 Tübingen, Germany, and also with the Tübingen AI Center, ing, software-hardware co-design, etc., play crucial roles in
72076 Tübingen, Germany. deploying the end-to-end technology. Publicly available infor-
We maintain an active repository that contains up-to-date literature and
open-source projects at https://github.com/OpenDriveLab/End-
mation regarding the latest practices on these topics is limited.
to-end-Autonomous-Driving. We invite the community towards more openness in future
Digital Object Identifier 10.1109/TPAMI.2024.3435937 discussions.

© 2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see
https://creativecommons.org/licenses/by-nc-nd/4.0/
CHEN et al.: END-TO-END AUTONOMOUS DRIVING: CHALLENGES AND FRONTIERS 10165

Fig. 1. Survey at A Glance. (a) Pipeline and Methods. We define end-to-end autonomous driving as a learning-based algorithm framework with raw sensor
input and planning/control output. We deepdive into 270+ papers and categorize into imitation learning (IL) and reinforcement learning (RL). (b) Benchmarking.
We group popular benchmarks into closed-loop and open-loop evaluation, respectively. We cover various aspects of closed-loop simulation and the limitations of
open-loop evaluation for this problem. (c) Challenges. This is the main section of our work. We list key challenges from a wide range of topics and extensively
analyze why these concerns are crucial. Promising resolutions to these challenges are covered as well. (d) Future Trends. We discuss how end-to-end paradigm
could benefit by aid of the rapid development of foundation models, visual pre-training, etc. Partial photos by courtesy of online resources.

A. Motivation of an End-to-End System Note that the end-to-end paradigm does not necessarily indi-
In the classical pipeline, each model serves a standalone cate one black box with only planning/control outputs. It could
have intermediate representations and outputs (Fig. 1(b)) as in
component and corresponds to a specific task (e.g., traffic light
detection). Such a design is beneficial in terms of interpretabil- classical approaches. In fact, several state-of-the-art systems [1],
ity and ease of debugging. However, since the optimization [2] propose a modular design but optimize all components
together to achieve superior performance.
objectives across modules are different, with detection pur-
suing mean average precision (mAP) while planning aiming B. Roadmap
for driving safety and comfort, the entire system may not be
aligned with a unified target, i.e., the ultimate planning/control Fig. 2 depicts a chronological roadmap of critical achieve-
task. Errors from each module, as the sequential procedure ments in end-to-end autonomous driving, where each part indi-
proceeds, could be compounded and result in an information cates an essential paradigm shift or performance boost. The his-
loss. Moreover, compared to one end-to-end neural network, tory of end-to-end autonomous driving dates back to 1988 with
the multi-task, multi-model deployment which involves multiple ALVINN [3], where the input was two “retinas“ from a camera
encoders and message transmission systems, may increase the and a laser range finder, and a simple neural network generated
computational burden and potentially lead to sub-optimal use of steering output. NVIDIA designed a prototype end-to-end CNN
compute. system, which reestablished this idea in the new era of GPU
In contrast to its classical counterpart, an end-to-end au- computing [8]. Notable progress has been achieved with the
tonomous system offers several advantages. (a) The most appar- development of deep neural networks, both in imitation learn-
ent merit is its simplicity in combining perception, prediction, ing [15], [16] and reinforcement learning [4], [17], [18], [19].
and planning into a single model that can be jointly trained. (b) The policy distillation paradigm proposed in LBC [5] and related
The whole system, including its intermediate representations, approaches [20], [21], [22], [23] has significantly improved
is optimized towards the ultimate task. (c) Shared backbones closed-loop performance by mimicking a well-behaved expert.
increase computational efficiency. (d) Data-driven optimization To enhance generalization ability due to the discrepancy between
has the potential to improve the system by simply scaling train- the expert and learned policy, several papers [10], [24], [25] have
ing resources. proposed aggregating on-policy data [26] during training.
10166 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 12, DECEMBER 2024

Fig. 2. Roadmap of End-to-end Autonomous Driving. We present the key milestones chronologically, grouping similar works under the same theme. The
representative or first work is shown in bold with an illustration, while the date of the rest of the literature in the same theme may vary. We also display the score
for each year’s top entry in the CARLA leaderboard [13] (DS, ranging from 0 to 100) and the recent nuPlan challenge [14] (Score ranging from 0 to 1).

A significant turning point occurred around 2021. With di- motivation, methodologies, benchmarks, and more. In-
verse sensor configurations available within a reasonable com- stead of optimizing a single block, we advocate for a
putational budget, attention was focused on incorporating more philosophy to design the algorithm framework as a whole,
modalities and advanced architectures (e.g., Transformers [27]) with the ultimate target of achieving safe and comfortable
to capture global context and representative features, as in Trans- driving.
Fuser [6], [28] and many variants [29], [30], [31]. Combined b) We extensively investigate the critical challenges that
with more insights about the simulation environment, these concurrent approaches face. Out of the more than 270
advanced designs resulted in a substantial performance boost papers surveyed, we summarize major aspects and provide
on the CARLA benchmark [13]. To improve the interpretability in-depth analysis, including topics on generalizability,
and safety of autonomous systems, approaches [11], [32], [33] language-guided learning, causal confusion, etc.
explicitly involve various auxiliary modules to better supervise c) We cover the broader impact of how to embrace large
the learning process or utilize attention visualization. Recent foundation models and data engines. We believe that this
works prioritize generating safety-critical data [7], [34], [35], line of research and the large scale of high-quality data it
pre-training a foundation model or backbone curated for policy provides could significantly advance this field. To facilitate
learning [12], [36], [37], and advocating a modular end-to-end future research, we maintain an active repository updated
planning philosophy [1], [2], [38], [39]. Meanwhile, the new and with new literature and open-source projects.
challenging CARLA v2 [13] and nuPlan [14] benchmarks have
been introduced to facilitate research into this area. II. METHODS
This section reviews fundamental principles behind most
C. Comparison to Related Surveys existing end-to-end self-driving approaches. Section II-A dis-
We would like to clarify the difference between our survey cusses methods using imitation learning and provides details on
and previous related surveys [40], [41], [42], [43], [44], [45], the two most popular sub-categories, namely behavior cloning
[46], [47], [48]. Some prior surveys [40], [41], [42], [43] cover and inverse optimal control. Section II-B summarizes methods
content similar to ours in the sense of an end-to-end system. that follow the reinforcement learning paradigm.
However, they do not cover new benchmarks and approaches
that arose with the significant recent transition in the field, A. Imitation Learning
and place a minor emphasis on frontiers and challenges. The
Imitation learning (IL), also referred to as learning from
others focus on specific topics in this domain, such as imitation
demonstrations, trains an agent to learn the policy by
learning [44], [45], [46] or reinforcement learning [47], [48].
imitating the behavior of an expert. IL requires a dataset D =
In contrast, our survey provides up-to-date information on the
{ξi } containing trajectories collected under the expert’s policy
latest developments in this field, covering a wide span of topics
πβ , where each trajectory is a sequence of state-action pairs. The
and providing in-depth discussions of critical challenges.
goal of IL is to learn an agent policy π that matches πβ .
The policy π can output planned trajectories or control signals.
D. Contributions Early works usually adopt control outputs, due to the ease of
To summarize, this survey has three key contributions: collection. However, predicting controls at different steps could
a) We provide a comprehensive analysis of end-to-end au- lead to discontinuous maneuvers and the network inherently
tonomous driving for the first time, including high-level specializes to the vehicle dynamics which hinders generalization
CHEN et al.: END-TO-END AUTONOMOUS DRIVING: CHALLENGES AND FRONTIERS 10167

to other vehicles. Another genre of works predicts waypoints. It a reasonable cost c(·) and use algorithmic trajectory samplers to
considers a relatively longer time horizon. Meanwhile, convert- select the trajectory τ ∗ with the minimum cost, as illustrated in
ing trajectories for vehicles to track into control signals needs Fig. 3.
additional controllers, which is non-trivial and involves vehicle Regarding cost design, it has representations including a
models and control algorithms. Since no clear performance gap learned cost volume in a bird’s-eye-view (BEV) [32], joint
has been observed between these two paradigms, we do not energy calculated from other agents’ future motion [69], or a
differentiate them explicitly in this survey. An interesting and set of probabilistic semantic occupancy or freespace layers [39],
more in-depth discussion can be found in [22]. [70], [71]. On the other hand, trajectories are typically sampled
One widely used category of IL is behavior cloning (BC) [49], from a fixed expert trajectory set [1] or processed by parameter
which reduces the problem to supervised learning. Inverse sampling with a kinematic model [32], [38], [39], [70]. Then,
Optimal Control (IOC), also known as Inverse Reinforcement a max-margin loss is adopted as in classic IOC methods to
Learning (IRL) [50] is another type of IL method that utilizes encourage the expert demonstration to have a minimal cost while
expert demonstrations to learn a reward function. We elaborate others have high costs.
on these two categories below. Several challenges exist with cost learning approaches. In
1) Behavior Cloning: In BC, matching the agent’s pol- particular, in order to generate more realistic costs, HD maps,
icy with the expert’s is accomplished by minimizing plan- auxiliary perception tasks, and multiple sensors are typically
ning loss as supervised learning over the collected dataset: incorporated, which increases the difficulty of learning and
E(s,a) (πθ (s), a). Here, (πθ (s), a) represents a loss function constructing datasets for multi-modal multi-task frameworks.
that measures the distance between the agent action and the Nevertheless, the aforementioned cost learning methods sig-
expert action. nificantly enhance the safety and interpretability of decisions
Early applications of BC for driving [3], [8], [51] utilized (see Section IV-F), and we believe that the industry-inspired
an end-to-end neural network to generate control signals from end-to-end system design is a viable approach for real-world
camera inputs. Further enhancements, such as multi-sensor in- applications.
puts [6], [52], auxiliary tasks [16], [28], and improved expert
design [21], have been proposed to enable BC-based end-to-end
driving models to handle challenging urban scenarios. B. Reinforcement Learning
BC is advantageous due to its simplicity and efficiency, as it Reinforcement learning (RL) [72], [73] is a field of learning
does not require hand-crafted reward design, which is crucial for by trial and error. The success of deep Q networks (DQN) [74]
RL. However, there are some common issues. During training, in achieving human-level control on the Atari benchmark [75]
it treats each state as independently and identically distributed, has popularized deep RL. DQN trains a neural network called
resulting in an important problem known as covariate shift. the critic (or Q network), which takes as input the current state
For general IL, several on-policy methods have been proposed and an action, and predicts the discounted return of that action.
to address this issue [26], [53], [54], [55]. In the context of The policy is then implicitly defined by selecting the action with
end-to-end autonomous driving, DAgger [26] has been adopted the highest predicted return.
in [5], [10], [25], [56]. Another common problem with BC RL requires an environment that allows potentially unsafe
is causal confusion, where the imitator exploits and relies on actions to be executed, to collect novel data (e.g., via ran-
false correlations between certain input components and output dom actions). Additionally, RL requires significantly more data
signals. This issue has been discussed in the context of end- to train than IL. For this reason, modern RL methods often
to-end autonomous driving in [57], [58], [59], [60]. These two parallelize data collection across multiple environments [76].
challenging problems are further discussed in Section IV-I and Meeting these requirements in the real world presents great chal-
Section IV-H, respectively. lenges. Therefore, almost all papers that use RL in driving have
2) Inverse Optimal Control: Traditional IOC algorithms only investigated the technique in simulation. Most use different
learn an unknown reward function R(s, a) from expert demon- extensions of DQN. The community has not yet converged on a
strations, where the expert’s reward function can be represented specific RL algorithm.
as a linear combination of features [50], [61], [62], [63], [64]. RL has successfully learned lane following on a real car on
However, in continuous, high-dimensional autonomous driving an empty street [4]. Despite this encouraging result, it must
scenarios, the definition of the reward is implicit and difficult to be noted that a similar task was already accomplished by IL
optimize. three decades prior [3]. To date, no report has shown results for
Generative adversarial imitation learning [65], [66], [67] is end-to-end training with RL that are competitive with IL. The
a specialized approach in IOC that designs the reward function reason for this failure likely is that the gradients obtained via
as an adversarial objective to distinguish the expert and learned RL are insufficient to train deep perception architectures (i.e.,
policies, similar to the concept of generative adversarial net- ResNet) required for driving. Models used in benchmarks like
works [68]. Recently, several works propose optimizing a cost Atari, where RL succeeds, are relatively shallow, consisting of
volume or cost function with auxiliary perceptual tasks. Since only a few layers [77].
a cost is an alternative representation of the reward, we classify RL has been successfully applied in end-to-end driving
these methods as belonging to the IOC domain. We define the when combined with supervised learning (SL). Implicit affor-
cost learning framework as follows: end-to-end approaches learn dances [18], [19] pre-train the CNN encoder using SL with tasks
10168 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 12, DECEMBER 2024

Fig. 3. Overview of methods in end-to-end autonomous driving. We illustrate three popular paradigms, including two imitation learning frameworks (behavior
cloning and inverse optimal control), as well as online reinforcement learning.

like semantic segmentation. In the second stage, this encoder is in prize money for autonomously navigating a 240 km route
frozen, and a shallow policy head is trained on the features from through the Mojave desert, which no team achieved [85]. The
the frozen encoder with a modern version of Q-learning [78]. RL final series event, called the DARPA Urban Challenge, required
can also be used to finetune full networks that were pre-trained vehicles to navigate a 96 km mock-up town course, adhering
using IL [17], [79]. to traffic laws and avoiding obstacles [86]. These races fostered
RL can also been effectively applied, if the network has access important developments in autonomous driving, such as LiDAR
to privileged simulator information. [48], [80], [81]. Privileged sensors. Following this spirit, the University of Michigan estab-
RL agents can be used for dataset curation. Roach [21] trains an lished MCity [87], a large controlled real-world environment
RL agent on privileged BEV semantic maps and uses the policy designed to facilitate testing autonomous vehicles. However,
to automatically collect a dataset with which a downstream IL such academic ventures have not been widely employed for end-
agent is trained. WoR [20] employs a Q-function and tabular to-end systems due to a lack of data and vehicles. In contrast, in-
dynamic programming to generate additional or improved labels dustries with the resources to deploy fleets of driverless vehicles
for a static dataset. could rely on real-world evaluation to benchmark improvements
A challenge in the field is to transfer the findings from in their algorithms.
simulation to the real world. In RL, the objective is expressed as
reward functions, and many algorithms require them to be dense
and provide feedback at each environment step. Current works B. Online/Closed-Loop Simulation
typically use simple objectives, such as progress and collision Conducting tests of self-driving systems in the real world
avoidance. These simplistic designs potentially encourage risky is costly and risky. To address this challenge, simulation is a
behaviors [80]. Devising or learning better reward functions viable alternative [14], [88], [89], [90], [91], [92]. Simulators
remains an open problem. Another direction would be to develop facilitate rapid prototyping and testing, enable the quick iteration
RL algorithms that can handle sparse rewards, enabling the opti- of ideas, and provide low-cost access to diverse scenarios for
mization of relevant metrics directly. RL can be effectively com- unit testing. In addition, simulators offer tools for measuring
bined with world models [82], [83], [84], though this presents performance accurately. However, their primary disadvantage
specific challenges (See Section IV-C). Current RL solutions for is that the results obtained in a simulated environment do not
driving rely heavily on low-dimensional representations of the necessarily generalize to the real world (Section IV-I-3).
scene, and this issue is further discussed in Section IV-B-2. Closed-loop evaluation involves building a simulated environ-
ment that closely mimics a real-world driving environment. The
III. BENCHMARKING evaluation entails deploying the driving system in simulation and
measuring its performance. The system has to navigate safely
Autonomous driving systems require a comprehensive eval-
through traffic while progressing toward a designated goal loca-
uation to ensure safety. Researchers must benchmark these sys-
tion. There are four main sub-tasks involved in developing such
tems using appropriate datasets, simulators, metrics, and hard-
simulators: parameter initialization, traffic simulation, sensor
ware to accomplish this. This section delineates three approaches
simulation, and vehicle dynamics simulation. We briefly de-
for benchmarking end-to-end autonomous driving systems: (1)
scribe these sub-tasks below, followed by a summary of currently
real-world evaluation, (2) online or closed-loop evaluation in
available open-source simulators for closed-loop benchmarks.
simulation, and (3) offline or open-loop evaluation on driving
1) Parameter Initialization: Simulation offers the benefit of
datasets. We focus on the scalable and principled online simula-
a high degree of control over the environment, including weather,
tion setting and summarize real-world and offline assessments
maps, 3D assets, and low-level attributes such as the arrangement
for completeness.
of objects in a traffic scene. While powerful, the number of
these parameters is substantial, resulting in a challenging design
A. Real-World Evaluation
problem. Current simulators tackle this in two ways:
Early efforts on benchmarking self-driving involved real- Procedural Generation: Traditionally, initial parameters are
world evaluation. Notably, DARPA initiated a series of races hand-tuned by 3D artists and engineers [88], [89], [90],
to advance autonomous driving. The first event offered $1M [91]. This limits scalability. Recently, some of the simulation
CHEN et al.: END-TO-END AUTONOMOUS DRIVING: CHALLENGES AND FRONTIERS 10169

properties can be sampled from a probabilistic distribution TABLE I

OPEN-SOURCE SIMULATORS WITH ACTIVE BENCHMARKS FOR CLOSED-LOOP
with computer algorithms, which we refer to as procedural EVALUATION OF AUTONOMOUS DRIVING
generation [93]. Procedural generation algorithms combine
rules, heuristics, and randomization to create diverse road net-
works, traffic patterns, lighting conditions, and object place-
ments [94], [95]. Due to its efficiency compared to fully manual
design, it has become one of the most commonly used methods
of initialization for video games and simulations. Nevertheless,
the process still needs pre-defined parameters and algorithms
to control generation reliability, which is time-consuming and
requires a lot of expertise. scans that the driving system would receive from different
Data-Driven: Data-driven approaches for simulation initial- viewpoints in the simulator [105], [106], [107]. This process
ization aim to learn the required parameters. Arguably, the needs to take into account noise and occlusions to realistically
simplest way is to sample from real-world driving logs [14], assess the autonomous system. There are two main branches of
[92], where parameters such as road maps or traffic patterns are ideas concerning sensor simulation, as described below.
directly extracted from pre-recorded datasets. The advantage Graphics-Based: Recent computer graphics simulators use
of log sampling is its ability to capture the natural variability 3D models of the environment, along with traffic entity models,
present in real-world data, leading to more realistic simulation to generate sensor data via approximations of physical rendering
scenarios. However, it may not encompass rare situations that processes in the sensors [89], [90]. For example, this can involve
are critical for testing the robustness of autonomous driving occlusions, shadows, and reflections present in real-world envi-
systems. The initial parameters can be optimized to increase ronments while simulating camera images. However, the realism
the representation of such scenarios [7], [34], [35]. Another of graphics-based simulation is often subpar or comes at the cost
advanced data-driven approach to initialization is generative of heavy computation, making parallelization non-trivial [108].
modeling, where machine learning algorithms are utilized to It is closely tied to the quality of the 3D models and the ap-
learn the underlying structure and distributions of real-world proximations used in modeling the sensors. A comprehensive
data. They can then generate novel scenarios that resemble the survey of graphics-based rendering for driving data is provided
real world but were not included in the original data [96], [97], in [109].
[98], [99]. Data-Driven: Data-driven sensor simulation leverages real-
2) Traffic Simulation: Traffic simulation involves generating world sensor data to create the simulation where both the ego
and positioning virtual entities in the environment with realistic vehicle and background traffic may move differently from the
motion [97], [100]. These entities often include vehicles (such as way they did in recordings [110], [111], [112]. Popular methods
cars, motorcycles, bicycles, etc.) and pedestrians. Traffic simula- are Neural Radiance Fields (NeRF) [113] and 3D Gaussian
tors must account for the effects of speed, acceleration, braking, Splatting [114], which can generate novel views of a scene by
obstructions, and the behavior of other entities. Moreover, traffic learning an implicit representation of the scene’s geometry and
light states must be periodically updated to simulate realistic city appearance. These methods can produce more realistic sensor
driving. There are two popular approaches for traffic simulation, data visually than graphics-based approaches, but they have
which we describe below. limitations such as high rendering times or requiring indepen-
Rule-Based: Rule-based traffic simulators use pre-defined dent training for each scene being reconstructed [107], [115],
rules to generate the motion of traffic entities. The most promi- [116], [117], [118]. Another approach to data-driven sensor
nent implementation of this concept is the Intelligent Driver simulation is domain adaptation, which aims to minimize the gap
Model (IDM) [101]. IDM is a car-following model that computes between real and graphics-based simulated sensor data [119].
acceleration for each vehicle based on its current speed, the Deep learning techniques such as GANs can be employed to
speed of the leading vehicle, and a desired safety distance. improve realism (Section IV-I-3).
Although widely used and straightforward, this approach may be 4) Vehicle Dynamics Simulation: The final aspect of driv-
inadequate to simulate realistic motion and complex interactions ing simulation pertains to ensuring that the simulated vehicle
in urban environments. adheres to physically plausible motion. Most existing publicly
Data-Driven: Realistic human traffic behavior is highly inter- available simulators use highly simplified vehicle models, such
active and complex, including lane changing, merging, sudden as the unicycle model [120] or the bicycle model [121]. How-
stopping, etc. To model such behavior, data-driven traffic sim- ever, in order to facilitate seamless transfer of algorithms from
ulation utilizes data collected from real-world driving. These simulation to the real world, it is essential to incorporate more
models can capture more nuanced, realistic behavior but require accurate physical modeling of vehicle dynamics. For instance,
significant amounts of labeled data for training. A wide variety of CARLA adopts a multi-body system approach, representing a
learning-based techniques have been proposed for this task [97], vehicle as a collection of sprung masses on four wheels. For a
[98], [100], [102], [103], [104]. comprehensive review, please refer to [122].
3) Sensor Simulation: Sensor simulation is crucial for evalu- 5) Benchmarks: We give a succinct overview of end-to-end
ating end-to-end self-driving systems. This involves generating driving benchmarks available up to date in Table I. In 2019,
simulated raw sensor data, such as camera images or LiDAR the original benchmark released with CARLA [90] was solved
10170 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 12, DECEMBER 2024

with near-perfect scores [5]. The subsequent NoCrash bench-

mark [123] involves training on a single CARLA town under
specific weather conditions and testing generalization to another
town and set of weathers. Instead of a single town, the Town05
benchmark [6] involves training on all available towns while
withholding Town05 for testing. Similarly, the LAV benchmark
trains on all towns except Town02 and Town05, which are both
reserved for testing. Roach [21] uses a setting with 3 test towns,
albeit all seen during training, and without the safety-critical
scenarios in Town05 and LAV. Finally, the Longest6 bench-
mark [28] uses 6 test towns. Two online servers, the leaderboard
(v1 and v2) [13], ensure fair comparisons by keeping evaluation
routes confidential. Leaderboard v2 is highly challenging due
to the long route length (over 8 km on average, as opposed to
1-2 km on v1) and a wide variety of new traffic scenarios.
The nuPlan simulator is currently accessible for evaluating
end-to-end systems via the NAVSIM project [124]. Further, there Fig. 4. Examples of input modality and fusion strategy. Different modalities
are two benchmarks on which agents input maps and object have distinct characteristics, leading to the challenge of effective sensor fusion.
We take point clouds and images as examples to depict various fusion strategies.
properties via the data-driven parameter initialization for nuPlan
(Section III-B-1). Val14, proposed in [125], uses a validation
split of nuPlan. The leaderboard, a submission server with the of real-world driving traversals with varying degrees of diffi-
private test set, was used in the 2023 nuPlan challenge, but it is culty. However, open-loop results do not provide conclusive
no longer public for submissions. evidence of improved driving behavior in closed-loop, due to the
aforementioned drawbacks [123], [125], [129], [130]. Overall, a
realistic closed-loop benchmarking, if available and applicable,
is recommended in future research.
C. Offline/Open-Loop Evaluation
Open-loop evaluation mainly assesses a system’s perfor- IV. CHALLENGES
mance against pre-recorded expert driving behavior. This
Following each topic illustrated in Fig. 1, we now walk
method requires evaluation datasets that include (1) sensor read-
through current challenges, related works or potential reso-
ings, (2) goal locations, and (3) corresponding future driving
lutions, risks, and opportunities. We start with challenges in
trajectories, usually obtained from human drivers. Given sensor
handling different input modalities in Section IV-A, followed by
inputs and goal locations as inputs, performance is measured
a discussion on visual abstraction for efficient policy learning
by comparing the system’s predicted future trajectory against
in Section IV-B. Further, we introduce learning paradigms such
the trajectory in the driving log. Systems are evaluated based on
as world model learning (Section IV-C), multi-task frameworks
how closely their trajectory predictions match the human ground
(Section IV-D), and policy distillation (Section IV-E). Finally,
truth, as well as auxiliary metrics such as the collision probability
we discuss general issues that impede safe and reliable end-
with other agents. The advantage of open-loop evaluation is that
to-end autonomous driving, including interpretability in Sec-
it is easy to implement using realistic traffic and sensor data,
tion IV-F, safety guarantees in Section IV-G, causal confusion
as it does not require a simulator. However, the key disadvan-
in Section IV-H, and robustness in Section IV-I.
tage is that it does not measure performance in the actual test
distribution encountered during deployment. During testing, the
driving system may deviate from the expert driving corridor, and A. Dilemma Over Sensing and Input Modalities
it is essential to verify the system’s ability to recover from such 1) Sensing and Multi-Sensor Fusion. Sensing: Though early
drift (Section IV-I-2). Furthermore, the distance between the work [8] successfully achieved following a lane with a monoc-
predicted and the recorded trajectories is not an ideal metric in a ular camera, this single input modality cannot handle complex
multi-modal scenario. For example, in the case of merging into scenarios. Therefore, various sensors in Fig. 4 have been intro-
a turning lane, both the options of merging immediately or later duced for recent self-driving vehicles. Particularly, RGB images
could be valid, but open-loop evaluation penalizes the option from cameras replicate how humans perceive the world, with
that was not observed in the data. Therefore, besides measuring abundant semantic details; LiDARs or stereo cameras provide
collision probability and prediction errors, a few metrics were accurate 3D spatial knowledge. Emerging sensors like mmWave
proposed to cover more comprehensive aspects such as traffic radars and event cameras excel at capturing objects’ relative
violations, progress, and driving comfort [125]. movement. Additionally, vehicle states from speedometers and
This approach requires comprehensive datasets of trajecto- IMUs, together with navigation commands, are other lines of
ries to draw from. The most popular datasets for this purpose input that guide the driving system. However, various sensors
include nuScenes [126], Argoverse [127], Waymo [128], and possess distinct perspectives, data distributions, and huge price
nuPlan [14]. All of these datasets comprise a large number gaps, thereby posing challenges in effectively designing the
CHEN et al.: END-TO-END AUTONOMOUS DRIVING: CHALLENGES AND FRONTIERS 10171

sensory layout and fusing them to complement each other for also formulate the driving task as a question-answering problem
autonomous driving. and construct corresponding benchmarks [171], [172]. They
Multi-sensor fusion has predominantly been discussed in highlight that LLMs offer opportunities to handle sophisticated
perception-related fields, e.g., object detection [131], [132] instructions and generalize to different data domains, which
and semantic segmentation [133], [134], and is typically share similar advantages to applications in robotic areas [173].
categorized into three groups: early, mid, and late fusion. However, LLMs for on-road driving could be challenging at
End-to-end autonomous driving algorithms explore similar present, considering their long inference time, low quantitative
fusion schemes. Early fusion combines sensory inputs before accuracy, and instability of outputs. Potential resolutions could
feeding them into shared feature extractors, where concatenation be employing LLMs on the cloud specifically for complex sce-
is a common way for fusion [32], [135], [136], [137], [138]. To narios and using them solely for high-level behavior prediction.
resolve the view discrepancy, some works project point clouds
on images [139] or vice versa (predicting semantic labels for
LiDAR points [52], [140]). On the other hand, late fusion
combines multiple results from multi-modalities. It is less B. Dependence on Visual Abstraction
discussed due to its inferior performance [6], [141]. Contrary End-to-end autonomous driving systems roughly have two
to these methods, middle fusion achieves multi-sensor fusion stages: encoding the state into a latent feature representation,
within the network by separately encoding inputs and then and then decoding the driving policy with intermediate fea-
fusing them at the feature level. Naive concatenation is also tures. In urban driving, the input state, i.e., the surrounding
frequently adopted [15], [22], [30], [142], [143], [144], [145], environment and ego state, is much more diverse and high-
[146]. Recently, works have employed Transformers [27] to dimensional compared to common policy learning benchmarks
model interactions among features [6], [28], [29], [147], [148]. such as video games [18], [174], which might lead to the
The attention mechanism in Transformers has demonstrated misalignment between representations and necessary attention
great effectiveness in aggregating the context of different sensor areas for policy making. Hence, it is helpful to design “good”
inputs and achieving safer end-to-end driving. intermediate perception representations, or first pre-train visual
Inspired by the progress in perception, it is beneficial to model encoders using proxy tasks. This enables the network to extract
modalities in a unified space such as BEV [131], [132]. End- useful information for driving effectively, thus facilitating the
to-end driving also requires identifying policy-related contexts subsequent policy stage. Furthermore, this can improve the
and discarding irrelevant details. We discuss perception-based sample efficiency for RL methods.
representations in Section IV-B-1. Besides, the self-attention 1) Representation Design: Naive representations are ex-
layer, interconnecting all tokens freely, incurs a significant tracted with various backbones. Classic convolutional neural
computational cost and cannot guarantee useful information networks (CNNs) still dominate, with advantages in transla-
extraction. Advanced Transformer-based fusion mechanisms in tion equivariance and high efficiency [175]. Depth-pre-trained
the perception field, such as [149], [150], hold promise for CNNs [176] significantly boost perception and downstream
application to the end-to-end driving task. performance. In contrast, Transformer-based feature extrac-
2) Language as Input: Humans drive using both visual per- tors [177], [178] show great scalability in perception tasks while
ception and intrinsic knowledge which together form causal be- not being widely adopted for end-to-end driving yet. For driving-
haviors. In areas related to autonomous driving such as embodied specific representations, researchers introduce the concept of
AI, incorporating natural language as fine-grained knowledge bird’s-eye-view (BEV), fusing different sensor modalities and
and instructions to control the visuomotor agent has achieved temporal information within a unified 3D space [131], [132],
notable progress [151], [152], [153], [154]. However, compared [179], [180]. It also facilitates easy adaptions to downstream
to robotic applications, the driving task is more straightforward tasks [2], [30], [181]. In addition, grid-based 3D occupancy is
without the need for task decomposition, and the outdoor envi- developed to capture irregular objects and used for collision
ronment is much more complex with highly dynamic agents but avoidance in planning [182]. Nevertheless, the dense representa-
few distinctive anchors for grounding. tion brings huge computation costs compared to BEV methods.
To incorporate linguistic knowledge into driving, a few Another unsettled problem is representations of the map.
datasets are proposed to benchmark outdoor grounding and Traditional autonomous driving relies on HD Maps. Due to the
visual language navigation tasks [155], [156], [157], [158]. high cost of availability of HD Maps, online mapping methods
HAD [159] takes human-to-vehicle advice and adds a visual have been devised with different formulations, such as BEV
grounding task. Sriram et al. [160] translate natural language segmentation [183], vectorized lanlines [184], centerlines and
instructions into high-level behaviors, while [161], [162] di- their topology [185], [186], and lane segments [187]. However,
rectly ground the texts. CLIP-MC [163] and LM-Nav [164] the most suitable formulation for end-to-end systems remains
utilize CLIP [165] to extract both linguistic knowledge from unvalidated.
instructions and visual features from images. Though various representation designs offer possibilities of
Recently, observing the rapid development of large language how to design the subsequent decision-making process, they
models (LLMs) [166], [167], works encode the perceived scene also place challenges as co-designing both parts is necessary
into tokens and prompt them to LLMs for control prediction for a whole framework. Besides, given the trends observed in
and text-based explanations [168], [169], [170]. Researchers several simple yet effective approaches with scaling up training
10172 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 12, DECEMBER 2024

resources [22], [28], the ultimate necessity of explicit represen- transition dynamics into a non-reactive world model and a sim-
tations such as maps is uncertain. ple kinematic bicycle model. In [137], a probabilistic sequential
2) Representation Learning: Representation learning often latent model is used as the world model. To address the potential
incorporates certain inductive biases or prior information. There inaccuracy of the learned world model, Henaff et al. [200] train
inevitably exist possible information bottlenecks in the learned the policy network with dropout regularization to estimate the
representation, and redundant context unrelated to decisions uncertainty cost. Another approach [201] uses an ensemble of
may be removed. multiple world models to provide uncertainty estimation, based
Some early methods directly utilize semantic segmentation on which imaginary rollouts could be truncated and adjusted
masks from off-the-shelf networks as the input representation accordingly. Motivated by Dreamer [82], ISO-Dream [202]
for subsequent policy training [188], [189]. SESR [190] further decouples visual dynamics into controllable and uncontrollable
encodes segmentation masks into class-disentangled representa- states, and trains the policy on the disentangled states.
tions through a VAE [191]. In [192], [193], predicted affordance It is worth noting that learning world models in raw image
indicators, such as traffic light states, offset to the lane center, space is non-trivial for autonomous driving. Important small
and distance to the leading vehicle, are used as representations details, such as traffic lights, would easily be missed in pre-
for policy learning. dicted images. To tackle this, GenAD [203] and DriveWM [204]
Observing that results like segmentation as representations employ the prevailing diffusion technique [205]. MILE [206]
can create bottlenecks defined by humans and result in loss incorporates the Dreamer-style world model learning in the
of useful information, some have chosen intermediate features BEV segmentation space as an auxiliary task besides imitation
from pre-training tasks as effective representations for RL train- learning. SEM2 [136] also extends the Dreamer structure but
ing [18], [19], [194], [195]. In [196], latent features in VAE with BEV map inputs, and uses RL for training. Besides directly
are augmented by attention maps obtained from the diffused using the learned world model for MBRL, DeRL [195] combines
boundary of segmentation and depth maps to highlight important a model-free actor-critic framework with the world model, by
regions. TARP [197] utilizes data from a series of previous tasks fusing self-assessments of the action or state from both models.
to perform different tasks-related prediction tasks to acquire use- World model learning for end-to-end autonomous driving
ful representations. In [198], the latent representation is learned is an emerging and promising direction as it greatly reduces
by approximating the π-bisimulation metric, which is comprised the sample complexity for RL, and understanding the world
of differences of rewards and outputs from the dynamics model. is helpful for driving. However, as the driving environment is
ACO [36] learns discriminative features by adding steering angle highly complex and dynamic, further study is still needed to
categorization into the contrastive learning structure. Recently, determine what needs to be modeled and how to model the world
PPGeo [12] proposes to learn effective representation through effectively.
motion prediction together with depth estimation in a self-
supervised way on uncalibrated driving videos. ViDAR [199]
utilizes the raw image-point cloud pairs and pretrains the visual D. Reliance on Multi-Task Learning
encoder with a point cloud forecasting pre-task. These works Multi-task learning (MTL) involves jointly performing sev-
demonstrate that self-supervised representation learning from eral related tasks based on a shared representation through
large-scale unlabeled data for policy learning is promising and separate heads. MTL provides advantages such as computational
worthy of future exploration. cost reduction, the sharing of relevant domain knowledge, and
the ability to exploit task relationships to improve model’s gen-
eralization ability [207]. Consequently, MTL is well-suited for
C. Complexity of World Modeling for Model-Based RL end-to-end driving, where the ultimate policy prediction requires
Besides the ability to better abstract perceptual representa- a comprehensive understanding of the environment. However,
tions, it is essential for end-to-end models to make reasonable the optimal combination of auxiliary tasks and appropriate
predictions about the future to take safe maneuvers. In this sec- weighting of losses to achieve the best performance presents
tion, we mainly discuss the challenges of current model-based a significant challenge.
policy learning works, where a world model provides explicit In contrast to common vision tasks where dense predictions
future predictions for the policy model. are closely correlated, end-to-end driving predicts a sparse sig-
Deep RL typically suffers from the high sample complex- nal. The sparse supervision increases the difficulty of extracting
ity, which is pronounced in autonomous driving. Model-based useful information for decision-making in the encoder. For
reinforcement learning (MBRL) offers a promising direction to image input, auxiliary tasks such as semantic segmentation [28],
improve sample efficiency by allowing agents to interact with the [31], [139], [208], [209], [210] and depth estimation [28],
learned world model instead of the actual environment. MBRL [31], [208], [209], [210] are commonly adopted in end-to-end
methods employ an explicit world (environment) model, which autonomous driving models. Semantic segmentation helps the
is composed of transition dynamics and reward functions. This model gain a high-level understanding of the scene; depth esti-
is particularly helpful in driving, as simulators like CARLA are mation enables the model to capture the 3D geometry of the envi-
relatively slow. ronment and better estimate distances to critical objects. Besides
However, modeling the highly dynamic environment is a chal- auxiliary tasks on perspective images, 3D object detection [28],
lenging task. To simplify the problem, Chen et al. [20] factor the [31], [52] is also useful for LiDAR encoders. As BEV becomes
CHEN et al.: END-TO-END AUTONOMOUS DRIVING: CHALLENGES AND FRONTIERS 10173

L2 feature loss between teacher and student networks, while

CaT [23] aligns features in BEV. WoR [20] learns a model-based
action-value function and then uses it to supervise the visuomo-
tor policy. Roach [21] trains a stronger privileged expert with
RL, eliminating the upper bound of BC. It incorporates multiple
distillation targets, i.e., action distribution, values/rewards, and
latent features. By leveraging the powerful RL expert, TCP [22]
achieves a new state-of-the-art on the CARLA leaderboard with
a single camera as visual input. DriveAdpater [181] learns a
perception-only student and adapters with the feature alignment
objective. The decoupled paradigm fully enjoys the teacher’s
knowledge and student’s training efficiency.
Though huge efforts have been devoted to designing a robust
expert and transferring knowledge at various levels, the teacher-
student paradigm still suffers from inefficient distillation. For
Fig. 5. Policy distillation. (a) The privileged agent learns a robust policy with
access to privileged ground-truth information. The expert is labeled with dashed instance, the privileged agent has access to ground-truth states of
lines to indicate that it is not mandatory if the privileged agent is trained via RL. traffic lights, which are small objects in images and thus hard to
(b) The sensorimotor agent imitates the privileged agent through both feature distill corresponding features. As a result, the visuomotor agents
distillation and output imitation.
exhibit large performance gaps compared to their privileged
agents. It may also lead to causal confusion for students (see
a natural and popular representation for autonomous driving, Section IV-H). It is worth exploring how to draw more inspi-
tasks such as BEV segmentation are included in models [11], ration from general distillation methods in machine learning to
[23], [28], [29], [30], [31], [52], [148] that aggregate features minimize the gap.
in BEV space. Moreover, in addition to these vision tasks, [29],
[208], [211] also predict visual affordances including traffic light
states, distances to opposite lanes, etc. Nonetheless, constructing
F. Lack of Interpretability
large-scale datasets with multiple types of aligned and high-
quality annotations is non-trivaial for real-world applications, Interpretability plays a critical role in autonomous driv-
which remain as a great concern due to current models’ reliance ing [214]. It enables engineers to better debug the system,
on MTL. provides performance guarantees from a societal perspective,
and promotes public acceptance. Achieving interpretability for
end-to-end driving models, which are often referred to as “black
E. Inefficient Experts and Policy Distillation
boxes”, is more essential and challenging.
As imitation learning, or its predominant sub-category, be- Given trained models, some post-hoc X-AI (explainable AI)
havior cloning, is simply supervised learning that mimics expert techniques could be applied to gain saliency maps [208], [215],
behaviors, corresponding methods usually follow the “Teacher- [216], [217], [218]. Saliency maps highlight specific regions in
Student” paradigm. There lie two main challenges: (1) Teachers, the visual input on which the model primarily relies for planning.
such as the handcrafted expert autopilot provided by CARLA, However, this approach provides limited information, and its
are not perfect drivers, though having access to ground-truth effectiveness and validity are difficult to evaluate. Instead, we
states of surrounding agents and maps. (2) Students are super- focus on end-to-end frameworks that directly enhance inter-
vised by the recorded output with sensor input only, requiring pretability in their model design. We introduce each category
them to extract perceptual features and learn policy from scratch of interpretability in Fig. 6 below.
simultaneously. Attention Visualization: The attention mechanism provides a
A few studies propose to divide the learning process into two certain degree of interpretability. In [33], [208], [211], [218],
stages, i.e., training a stronger teacher network and then distilling [219], a learned attention weight is applied to aggregate impor-
the policy to the student. In particular, Chen et al. [5], [52] first tant features from intermediate feature maps. Attention weights
employ a privileged agent to learn how to act with access to can also adaptively combine ROI pooled features from different
the state of the environment, then let the sensorimotor agent object regions [220] or a fixed grid [221]. NEAT [11] iteratively
(student) closely imitate the privileged agent with distillation at aggregates features to predict attention weights and refine the ag-
the output stage. More compact BEV representations as input gregated feature. Recently, Transformer attention blocks are em-
for the privileged agent provide stronger generalization abilities ployed to better fuse different sensor inputs, and attention maps
and supervision than the original expert. The process is depicted display important regions in the input for driving decisions [28],
in Fig. 5. [29], [31], [147], [222]. In PlanT [223], attention layers process
Apart from solely supervising planning results, several works features from different vehicles, providing interpretable insights
also distill knowledge at the feature level. For example, FM- into the corresponding action. Similar to post-hoc saliency meth-
Net [212] employs segmentation and optical flow models as ods, although attention maps offer straightforward clues about
auxiliary teachers to guide feature training. SAM [213] adds models’ focus, their faithfulness and utility remain limited.
10174 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 12, DECEMBER 2024

resort to the progress of multi-modality and foundation models,

using LLMs/VLMs to provide decision-related explanations, as
discussed in Section IV-A-2.
Uncertainty Modeling: Uncertainty is a quantitative ap-
proach for interpreting the dependability of deep learning model
outputs [231], [232], which can be helpful for designers and
users to identify uncertain cases for improvement or necessary
intervention. For deep learning, there are two types of uncer-
tainty: aleatoric uncertainty and epistemic uncertainty. Aleatoric
uncertainty is inherent to the task, while epistemic uncertainty is
due to limited data or modeling capacity. In [233], authors lever-
age certain stochastic regularizations in the model to perform
multiple forward passes as samples to measure the uncertainty.
However, the requirement of multiple forward passes is not
Fig. 6. Summary of the different forms of interpretability. They aid in human feasible in real-time scenarios. Loquercio et al. [232] and Filos
comprehension of decision-making processes of end-to-end models and the et al. [234] propose capturing epistemic uncertainty with an
reliability of outputs. ensemble of expert likelihood models and aggregating the results
to perform safe planning. Regarding methods modeling aleatoric
Interpretable Tasks: Many IL-based works introduce inter- uncertainty, driving actions/planning and uncertainty (usually
pretability by decoding the latent feature representations into represented by variance) are explicitly predicted in [146], [235],
other meaningful information besides policy prediction, such [236]. Such methods directly model and quantify the uncertainty
as semantic segmentation [2], [11], [15], [28], [29], [31], [52], at the action level as a variable for the network to predict. The
[139], [163], [208], [209], [210], [224], depth estimation [15], planner would generate the final action based on the predicted
[28], [31], [208], [209], object detection [2], [28], [31], [52], uncertainty, either choosing the action with the lowest uncer-
affordance predictions [29], [208], [211], motion prediction [2], tainty from multiple actions [235] or generating a weighted com-
[52], and gaze map estimation [225]. Although these methods bination of proposed actions based on the uncertainties [146].
provide interpretable information, most of them only treat these Currently, predicted uncertainty is mainly utilized in combina-
predictions as auxiliary tasks [11], [15], [28], [31], [139], [208], tion with hard-coded rules. Exploring better ways to model and
[209], [211], with no explicit impact on final driving decisions. utilize uncertainty for autonomous driving is necessary.
Some [29], [52] do use these outputs for final actions, but they are
incorporated solely for performing an additional safety check.
G. Lack of Safety Guarantees
Rules Integration and Cost Learning: As discussed in Sec-
tion II-A-2, cost learning-based methods share similarities with Ensuring safety is of utmost importance when deploying
traditional modular systems and thus exhibit a certain level of autonomous driving systems in real-world scenarios. However,
interpretability. NMP [32] and DSDNet [226] construct the cost the learning-based nature of end-to-end frameworks inherently
volume in conjunction with detection and motion prediction lacks precise mathematical guarantees regarding safety, unlike
results. P3 [39] combines predicted semantic occupancy maps traditional rule-based approaches [237].
with comfort and traffic rules constraints to construct the cost Nevertheless, it should be noted that modular driving stacks
function. Various representations, such as probabilistic occu- have already incorporated specific safety-related constraints or
pancy and temporal motion fields [1], emergent occupancy [71], optimizations within their motion planning or speed prediction
and freespace [70], are employed to score sampled trajectories. modules to enforce safety [238], [239], [240]. These mecha-
In [38], [125], [227], human expertise and pre-defined rules nisms can potentially be adapted for integration into end-to-end
including safety, comfort, traffic rules, and routes based on models as post-process steps or safety checks, thereby provid-
perception and prediction outputs are explicitly included to ing additional safety guarantees. Furthermore, the intermediate
form the cost for trajectory scoring, demonstrating improved interpretability predictions, as discussed in Section IV-F, such
robustness and safety. as detection and motion prediction results, can be utilized in
Linguistic Explainability: As one aspect of interpretability post-processing procedures.
is to help humans understand the system, natural language is
a suitable choice for this purpose. Kim et al. [33] and Xu
H. Causal Confusion
et al. [228] develop datasets pairing driving videos or images
with descriptions and explanations, and propose end-to-end Driving is a task that exhibits temporal smoothness, which
models with both control and explanation outputs. BEEF [229] makes past motion a reliable predictor of the next action. How-
fuses the predicted trajectory and the intermediate perception ever, methods trained with multiple frames can become overly
features to predict justifications for the decision. ADAPT [230] reliant on this shortcut [241] and suffer from catastrophic failure
proposes a Transformer-based network to jointly estimate ac- during deployment. This problem is referred to as the copycat
tion, narration, and reasoning. Recently, [169], [171], [172] problem [57] in some works and is a manifestation of causal
CHEN et al.: END-TO-END AUTONOMOUS DRIVING: CHALLENGES AND FRONTIERS 10175

vehicles’ past states. This technique has been used in multiple

works [1], [32], [52], though it was not motivated this way.
However, these studies have used environments that are modi-
fied to simplify studying the causal confusion problem. Showing
performance improvements in state-of-the-art settings as men-
tioned in Section III-B-5 remains an open problem.

I. Lack of Robustness
1) Long-Tailed Distribution: One important aspect of the
long-tailed distribution problem is dataset imbalance, where a
few classes make up the majority, as shown in Fig. 8(a). This
poses a big challenge for models to generalize to diverse environ-
Fig. 7. Causal Confusion. The current action of a car is strongly correlated ments. Various methods mitigate this issue with data processing,
with low-dimensional spurious features such as the velocity or the car’s past
trajectory. End-to-End models may latch on to them leading to causal confusion. including over-sampling [246], [247], under-sampling [248],
[249], and data augmentation [250], [251]. Besides, weighting-
based approaches [252], [253] are also commonly used.
confusion [242], where access to more information leads to In the context of end-to-end autonomous driving, the long-
worse performance. tailed distribution issue is particularly severe. Most drives are
Causal confusion in imitation learning has been a persistent repetitive and uninteresting e.g., following a lane for many
challenge for nearly two decades. One of the earliest reports of frames. Conversely, interesting safety-critical scenarios occur
this effect was made by LeCun et al. [243]. They used a single rarely but are diverse in nature, and hard to replicate in the
input frame for steering prediction to avoid such extrapolation. real world for safety reasons. To tackle this, some works rely
Though simplistic, this is still a preferred solution in current on handcrafted scenarios [13], [100], [254], [255], [256] to
state-of-the-art IL methods [22], [28]. Unfortunately, using a generate more diverse data in simulation. LBC [5] leverages the
single frame makes it hard to extract the motion of surrounding privileged agent to create imaginary supervisions conditioned on
actors. Another source of causal confusion is speed measure- different navigational commands. LAV [52] includes trajectories
ment [16]. Fig. 7 showcases an example of a car waiting at a red of non-ego agents for training to promote data diversity. In [257],
light. The action of the car could highly correlate with its speed a simulation framework is proposed to apply importance-
because it has waited for many frames where the speed is zero sampling strategies to accelerate the evaluation of rare-event
and the action is the brake. Only when the traffic light changes probabilities.
from red to green does this correlation break down. Another line of research [7], [34], [35], [258], [259], [260]
There are several approaches to combat the causal confusion generates safety-critical scenarios in a data-driven manner
problem when using multiple frames. In [57], the authors attempt through adversarial attacks. In [258], Bayesian Optimization is
to remove spurious temporal correlations from the bottleneck employed to generate adversarial scenarios. Learning to col-
representation by training an adversarial model that predicts lide [35] represents driving scenarios as the joint distribution
the ego agent’s past action. Intuitively, the resulting min-max over building blocks and applies policy gradient RL methods to
optimization trains the network to eliminate its past from in- generate risky scenarios. AdvSim [34] modifies agents’ trajecto-
termediate layers. It works well in MuJoCo but does not scale ries to cause failures, while still adhering to physical plausibility.
to complex vision-based driving. OREO [59] maps images to KING [7] proposes an optimization algorithm for safety-critical
discrete codes representing semantic objects and applies random perturbations using gradients through differentiable kinematics
dropout masks to units that share the same discrete code, which models.
helps in confounded Atari. In end-to-end driving, Chauffeur- In general, efficiently generating realistic safety-critical sce-
Net [244] addresses the causal confusion issue by using the past narios that cover the long-tailed distribution remains a significant
ego-motion as intermediate BEV abstractions and dropping out challenge. While many works focus on adversarial scenarios
it with a 50% probability during training. Wen et al. [58] propose in simulators, it is also essential to better utilize real-world
upweighting keyframes in the training loss, where a decision data for critical scenario mining and potential adaptation to
change occurs (and hence are not predictable by extrapolating simulation. Besides, a systematic, rigorous, comprehensive, and
the past). PrimeNet [60] improves performance compared to realistic testing framework is crucial for evaluating end-to-end
keyframes by using an ensemble, where the prediction of a autonomous driving methods under these long-tailed distributed
single-frame model is given as additional input to a multi-frame safety-critical scenarios.
model. Chuang et al. [245] do the same but supervise the 2) Covariate Shift: As discussed in Section II-A, one impor-
multi-frame network with action residuals instead of actions. In tant challenge for behavior cloning is covariate shift. The state
addition, the problem of causal confusion can be circumvented distributions from the expert’s policy and those from the trained
by using only LiDAR histories (with a single frame image) agent’s policy differ, leading to compounding errors when the
and realigning point clouds into one coordinate system. This trained agent is deployed in unseen testing environments or when
removes ego-motion while retaining information about other the reactions from other agents differ from training time. This
10176 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 12, DECEMBER 2024

Fig. 8. Challenges in robustness. Three primary generalization issues arise in relation to dataset distribution discrepancies, namely long-tailed and normal cases,
expert demonstration and test scenarios, and domain shift in locations, weather, etc.

could result in the trained agent being in a state that is outside domains into a common latent space or representations like seg-
the expert’s distribution for training, leading to severe failures. mentation maps [262], [263]. LUSR [264] and UAIL [235] adopt
An illustration is presented in Fig. 8(b). a Cycle-Consistent VAE and GAN, respectively, to project im-
DAgger (Dataset Aggregation) [26] is a common solution for ages into a latent representation comprised of a domain-specific
this issue. DAgger is an iterative training process. The current part and a domain-general part. In SESR [190], class disentan-
trained policy is rolled out in each iteration to collect new gled encodings are extracted from a semantic segmentation mask
data, and the expert is used to label the visited states. This to reduce the sim-to-real gap. Domain randomization [265],
enriches the dataset by adding examples of how to recover from [266], [267] is also a simple and effective sim-to-real technique
suboptimal states that an imperfect policy might visit. The policy for RL policy learning, which is further adapted for end-to-end
is then trained on the augmented dataset, and the process repeats. autonomous driving [188], [268]. It is realized by randomizing
However, one downside of DAgger is the need for an available the rendering and physical settings of the simulators to cover the
expert to query online. variability of the real world during training.
For end-to-end autonomous driving, DAgger is adopted Currently, sim-to-real adaptation through source target image
in [24] with an MPC-based expert. To reduce the cost of con- mapping or domain-invariant feature learning is the focus. Other
stantly querying the expert, SafeDAgger [25] extends the origi- DA cases are handled by constructing a diverse and large-scale
nal DAgger algorithm by learning a safety policy that estimates dataset. Given that current methods mainly concentrate on the
the deviation between the current policy and the expert policy. visual gap in images, and LiDAR has become a popular input
The expert is only queried when the deviation is large. MetaDAg- modality for driving, specific adaptation techniques tailored for
ger [56] uses meta-learning with DAgger to aggregate data from LiDARs must also be designed. Besides, traffic agents’ behavior
multiple environments. LBC [5] adopts DAgger and resamples gaps between the simulator and the real world should be noticed
the data with higher loss more frequently. In DARB [10], to as well. Incorporating real-world data into simulation through
better utilize failure or safety-related samples, it proposes several techniques such as NeRF [113] is another promising direction.
mechanisms, including task-based, policy-based, and policy &
expert-based mechanisms, to sample such critical states. V. FUTURE TRENDS
3) Domain Adaptation: Domain adaptation (DA) is a type
of transfer learning in which the target task is the same as the Considering the challenges and opportunities discussed, we
source task, but the domains differ. Here we discuss scenarios list some crucial directions for future research that may have a
where labels are available for the source domain while there are broader impact in this field.
no labels or a limited amount of labels available for the target
domain. A. Zero-Shot and Few-Shot Learning
As shown in Fig. 8(c), domain adaptation for autonomous It is inevitable for autonomous driving models to eventually
driving tasks encompasses several cases [261]:
r Sim-to-real: the large gap between simulators used for encounter real-world scenarios that lie beyond the training data
distribution. This raises the question of whether we can suc-
training and the real world used for deployment.
r Geography-to-geography: different geographic locations cessfully adapt the model to an unseen target domain where
limited or no labeled data is available. Formalizing this task
with varying environmental appearances.
r Weather-to-weather: changes in sensor inputs caused by for the end-to-end driving domain and incorporating techniques
from the zero-shot/few-shot learning literature are the key steps
weather conditions such as rain, fog, and snow.
r Day-to-night: illumination variations in visual inputs. toward achieving this [269], [270].
r Sensor-to-sensor: possible differences in sensor character-
istics, e.g., resolution and relative position. B. Modular End-to-End Planning
Note that the aforementioned cases often overlap. The modular end-to-end planning framework optimizes mul-
Typically, domain-invariant feature learning is achieved with tiple modules while prioritizing the ultimate planning task,
image translators and discriminators to map images from two which enjoys the advantages of interpretability as indicated in
CHEN et al.: END-TO-END AUTONOMOUS DRIVING: CHALLENGES AND FRONTIERS 10177

Section IV-F. This is advocated in recent literature [2], [271] Therefore, an increasing number of companies have started ex-
and certain industry solutions (Tesla, Wayve, etc.) have involved ploring end-to-end autonomous driving techniques specifically
similar ideas. When designing these differentiable perception tailored for these environments. It is envisioned that with ex-
modules, several questions arise regarding the choice of loss tensive high-quality data collection, large-scale model training,
functions, such as the necessity of 3D bounding boxes for and the establishment of reliable benchmarks, the end-to-end
object detection, whether opting for BEV segmentation over lane approach will have enormous potential over modular stacks in
topology for static scene perception, or the training strategies terms of performance and effectiveness. In summary, end-to-end
with limited modules’ data. autonomous driving faces great opportunities and challenges
simultaneously, with the ultimate goal of building generalist
agents. In this era of emerging technologies, we hope this survey
C. Data Engine could serve as a starting point to shed new light on this domain.
The importance of large-scale and high-quality data for au-
tonomous driving can never be emphasized enough [272]. Estab-
ACKNOWLEDGMENT
lishing a data engine with an automatic labeling pipeline [273]
could greatly facilitate the iterative development of both data The authors would like to thank the International Max Planck
and models. The data engine for autonomous driving, especially Research School for Intelligent Systems (IMPRS-IS) for sup-
modular end-to-end planning systems, needs to streamline the porting Bernhard Jaeger and Kashyap Chitta.
process of annotating high-quality perception labels with the
aid of large perception models in an automatic way. It should
REFERENCES
also support mining hard/corner cases, scene generation, and
editing to facilitate the data-driven evaluations discussed in [1] S. Casas, A. Sadat, and R. Urtasun, “MP3: A unified model to map,
perceive, predict and plan,” in Proc. IEEE Conf. Comput. Vis. Pattern
Section III-B and promote diversity of data and the generaliza- Recognit., 2021, pp. 14403–14412.
tion ability of models (Section IV-I). A data engine would enable [2] Y. Hu et al., “Planning-oriented autonomous driving,” in Proc. IEEE
autonomous driving models to make consistent improvements. Conf. Comput. Vis. Pattern Recognit., 2023, pp. 17853–17862.
[3] D. A. Pomerleau, “Alvinn: An autonomous land vehicle in a neural net-
work,” in Proc. Int. Conf. Neural Inf. Process. Syst., 1988, pp. 305–313.
[4] A. Kendall et al., “Learning to drive in a day,” in Proc. IEEE Int. Conf.
D. Foundation Model Robot. Automat., 2019, pp. 8248–8254.
[5] D. Chen, B. Zhou, V. Koltun, and P. Krähenbühl, “Learning by cheating,”
Recent advancements in foundation models in both lan- in Proc. Conf. Robot Learn., 2020, pp. 66–75.
guage [166], [167], [274] and vision [273], [275], [276] have [6] A. Prakash, K. Chitta, and A. Geiger, “Multi-modal fusion transformer
proved that large-scale data and model capacity can unleash for end-to-end autonomous driving,” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit., 2021, pp. 7077–7087.
the immense potential of AI in high-level reasoning tasks. The [7] N. Hanselmann, K. Renz, K. Chitta, A. Bhattacharyya, and A. Geiger,
paradigm of finetuning [277] or prompt learning [278], opti- “King: Generating safety-critical driving scenarios for robust imita-
mization in the form of self-supervised reconstruction [279] or tion via kinematics gradients,” in Proc. Eur. Conf. Comput. Vis., 2022,
pp. 335–352.
contrastive pairs [165], etc., are all applicable to the end-to-end [8] M. Bojarski et al., “End to end learning for self-driving cars,” 2016,
driving domain. However, we contend that the direct adoption of arXiv:1604.07316.
LLMs for driving might be tricky. The output of an autonomous [9] F. Codevilla, M. Müller, A. López, V. Koltun, and A. Dosovitskiy, “End-
to-End driving via conditional imitation learning,” in Proc. IEEE Int.
agent requires steady and accurate measurements, whereas the Conf. Robot. Automat., 2018, pp. 4693–4700.
generative output in language models aims to behave like hu- [10] A. Prakash, A. Behl, E. Ohn-Bar, K. Chitta, and A. Geiger, “Exploring
mans, irrespective of its accuracy. A feasible solution to develop data aggregation in policy learning for vision-based urban autonomous
driving,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020,
a “foundation” driving model is to train a world model that can pp. 11763–11773.
forecast the reasonable future of the environment, either in 2D, [11] K. Chitta, A. Prakash, and A. Geiger, “Neat: Neural attention fields for
3D, or latent space. To perform well on downstream tasks like end-to-end autonomous driving,” in Proc. IEEE Int. Conf. Comput. Vis.,
2021, pp. 15793–15803.
planning, the objective to be optimized for the model needs to [12] P. Wu, L. Chen, H. Li, X. Jia, J. Yan, and Y. Qiao, “Policy pre-training
be sophisticated enough, beyond frame-level perception. for autonomous driving via self-supervised geometric modeling,” in Proc.
Int. Conf. Learn. Representations, 2023.
[13] CARLA, “CARLA autonomous driving leaderboard,” 2022. [Online].
VI. CONCLUSION AND OUTLOOK Available: https://leaderboard.carla.org/
[14] H. Caesar et al., “NuPlan: A closed-loop ML-based planning benchmark
In this survey, we provide an overview of fundamental for autonomous vehicles,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit. Workshops, 2021.
methodologies and summarize various aspects of simulation [15] J. Hawke et al., “Urban driving with conditional imitation learning,” in
and benchmarking. We thoroughly analyze the extensive Proc. IEEE Int. Conf. Robot. Automat., 2020.
literature to date, and highlight a wide range of critical [16] F. Codevilla, E. Santana, A. M. López, and A. Gaidon, “Exploring the
limitations of behavior cloning for autonomous driving,” in Proc. IEEE
challenges and promising resolutions. Int. Conf. Comput. Vis., 2019, pp. 9329–9338.
Outlook: The industry has dedicated considerable effort over [17] X. Liang, T. Wang, L. Yang, and E. Xing, “CIRL: Controllable imitative
the years to develop advanced modular-based systems capa- reinforcement learning for vision-based self-driving,” in Proc. Eur. Conf.
Comput. Vis., 2018, pp. 584–599.
ble of achieving autonomous driving on highways. However, [18] M. Toromanoff, E. Wirbel, and F. Moutarde, “End-to-end model-free
these systems face significant challenges when confronted with reinforcement learning for urban driving using implicit affordances,” in
complex scenarios, e.g., inner-city streets and intersections. Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 7153–7162.
10178 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 12, DECEMBER 2024

[19] R. Chekroun, M. Toromanoff, S. Hornauer, and F. Moutarde, “GRI: Gen- [45] L. Le Mero, D. Yi, M. Dianati, and A. Mouzakitis, “A survey on
eral reinforced imitation and its application to vision-based autonomous imitation learning techniques for end-to-end autonomous vehicles,”
driving,” Robotics, vol. 12, 2023, Art. no. 217. IEEE Trans. Intell. Transp. Syst., vol. 23, no. 9, pp. 14128–14147,
[20] D. Chen, V. Koltun, and P. Krähenbühl, “Learning to drive from a world Sep. 2022.
on rails,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 15590–15599. [46] B. Zheng, S. Verma, J. Zhou, I. W. Tsang, and F. Chen, “Imitation
[21] Z. Zhang, A. Liniger, D. Dai, F. Yu, and L. Van Gool, “End-to-end urban learning: Progress, taxonomies and challenges,” IEEE Trans. Neural
driving by imitating a reinforcement learning coach,” in Proc. IEEE Int. Netw. Learn. Syst., vol. 35, no. 5, pp. 6322–6337, May 2024.
Conf. Comput. Vis., 2021, pp. 15222–15232. [47] Z. Zhu and H. Zhao, “A survey of deep RL and IL for autonomous
[22] P. Wu, X. Jia, L. Chen, J. Yan, H. Li, and Y. Qiao, “Trajectory-guided driving policy learning,” IEEE Trans. Intell. Transp. Syst., vol. 23, no. 9,
control prediction for end-to-end autonomous driving: A simple yet pp. 14043–14065, Sep. 2022.
strong baseline,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2022, [48] B. R. Kiran et al., “Deep reinforcement learning for autonomous driving:
pp. 6119–6132. A survey,” IEEE Trans. Intell. Transp. Syst., vol. 23, no. 6, pp. 4909–4926,
[23] J. Zhang, Z. Huang, and E. Ohn-Bar, “Coaching a teachable student,” in Jun. 2022.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 7805–7815. [49] M. Bain and C. Sammut, “A framework for behavioural cloning,” Mach.
[24] Y. Pan et al., “Agile autonomous driving using end-to-end deep imitation Intell., vol. 15, 1995.
learning,” in Proc. Robotics: Sci. Sys. Conf., 2017. [50] B. D. Ziebart et al., “Maximum entropy inverse reinforcement learning,”
[25] J. Zhang and K. Cho, “Query-efficient imitation learning for end-to-end in Proc. AAAI Conf. Artif. Intell., 2008, pp. 1433–1438.
simulated driving,” in Proc. AAAI Conf. Artif. Intell., 2017, pp. 2891– [51] Y. Lecun, E. Cosatto, J. Ben, U. Muller, and B. Flepp, “DAVE: Au-
2897. tonomous off-road vehicle control using end-to-end learning,” Courant
[26] S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning Institute/CBLL, Tech. Rep. DARPA-IPTO Final Report, 2004. [Online].
and structured prediction to no-regret online learning,” in Proc. Int. Conf. Available: http://www.cs.nyu.edu/\∼{ }yann/research/dave/index.html
Artif. Intell. Statist., 2011, pp. 627–635. [52] D. Chen and P. Krähenbühl, “Learning from all vehicles,” in Proc. IEEE
[27] A. Vaswani et al., “Attention is all you need,” in Proc. Int. Conf. Neural Conf. Comput. Vis. Pattern Recognit., 2022, pp. 17222–17231.
Inf. Process. Syst., 2017, pp. 6000–6010. [53] K. Judah, A. P. Fern, T. G. Dietterich, and P. Tadepalli, “Active imitation
[28] K. Chitta, A. Prakash, B. Jaeger, Z. Yu, K. Renz, and A. Geiger, “Trans- learning: Formal and practical reductions to IID learning,” J. Mach.
fuser: Imitation with transformer-based sensor fusion for autonomous Learn. Res., vol. 15, pp. 4105–4143, 2014.
driving,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 11, [54] S. Ross and D. Bagnell, “Efficient reductions for imitation learning,” in
pp. 12878–12895, Nov. 2023. Proc. Int. Conf. Artif. Intell. Statist., 2010, pp. 661–668.
[29] H. Shao, L. Wang, R. Chen, H. Li, and Y. Liu, “Safety-enhanced au- [55] S. Ross and J. A. Bagnell, “Reinforcement and imitation learning via
tonomous driving using interpretable sensor fusion transformer,” in Proc. interactive no-regret learning,” 2014, arXiv:1406.5979.
Conf. Robot Learn., 2022, pp. 726–737. [56] A. E. Sallab, M. Saeed, O. A. Tawab, and M. Abdou, “Meta learning
[30] X. Jia et al., “Think twice before driving: Towards scalable decoders framework for automated driving,” 2017, 1706.04038.
for end-to-end autonomous driving,” in Proc. IEEE Conf. Comput. Vis. [57] C. Wen, J. Lin, T. Darrell, D. Jayaraman, and Y. Gao, “Fighting copycat
Pattern Recognit., 2023, pp. 21983–21994. agents in behavioral cloning from observation histories,” in Proc. Int.
[31] B. Jaeger, K. Chitta, and A. Geiger, “Hidden biases of end-to- Conf. Neural Inf. Process. Syst., 2020, pp. 2564–2575.
end driving models,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, [58] C. Wen, J. Lin, J. Qian, Y. Gao, and D. Jayaraman, “Keyframe-focused
pp. 8240–8249. visual imitation learning,” in Proc. Int. Conf. Mach. Learn., 2021,
[32] W. Zeng et al., “End-to-end interpretable neural motion planner,” in Proc. pp. 11123–11133.
IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 8660–8669. [59] J. Park et al., “Object-aware regularization for addressing causal confu-
[33] J. Kim, A. Rohrbach, T. Darrell, J. Canny, and Z. Akata, “Textual sion in imitation learning,” in Proc. Int. Conf. Neural Inf. Process. Syst.,
explanations for self-driving vehicles,” in Proc. Eur. Conf. Comput. Vis., 2021, pp. 3029–3042.
2018, pp. 563–578. [60] C. Wen, J. Qian, J. Lin, J. Teng, D. Jayaraman, and Y. Gao, “Fighting fire
[34] J. Wang et al., “Advsim: Generating safety-critical scenarios for self- with fire: Avoiding DNN shortcuts through priming,” in Proc. Int. Conf.
driving vehicles,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Mach. Learn., 2022, pp. 23723–23750.
2021, pp. 9909–9918. [61] D. Brown, W. Goo, P. Nagarajan, and S. Niekum, “Extrapolating
[35] W. Ding, B. Chen, M. Xu, and D. Zhao, “Learning to collide: An adaptive beyond suboptimal demonstrations via inverse reinforcement learn-
safety-critical scenarios generating method,” in Proc. IEEE/RSJ Int. Conf. ing from observations,” in Proc. Int. Conf. Mach. Learn., 2019,
Intell. Robots Syst., 2020, pp. 2243–2250. pp. 783–792.
[36] Q. Zhang, Z. Peng, and B. Zhou, “Learning to drive by watching youtube [62] C. Finn, S. Levine, and P. Abbeel, “Guided cost learning: Deep inverse
videos: Action-conditioned contrastive policy pretraining,” in Proc. Eur. optimal control via policy optimization,” in Proc. Int. Conf. Mach. Learn.,
Conf. Comput. Vis., 2022, pp. 111–128. 2016, pp. 49–58.
[37] J. Zhang, R. Zhu, and E. Ohn-Bar, “SelfD: Self-learning large-scale [63] S. Reddy, A. D. Dragan, and S. Levine, “SQIL: Imitation learning via
driving policies from the web,” in Proc. IEEE Conf. Comput. Vis. Pattern reinforcement learning with sparse rewards,” 2019, arXiv:1905.11108.
Recognit., 2022, pp. 17316–17326. [64] S. Luo, H. Kasaei, and L. Schomaker, “Self-imitation learning by plan-
[38] S. Hu, L. Chen, P. Wu, H. Li, J. Yan, and D. Tao, “ST-P3: End-to-end ning,” in Proc. IEEE Int. Conf. Robot. Automat., 2021, pp. 4823–4829.
vision-based autonomous driving via spatial-temporal feature learning,” [65] J. Ho and S. Ermon, “Generative adversarial imitation learning,” in Proc.
in Proc. Eur. Conf. Comput. Vis., 2022, pp. 533–549. Int. Conf. Neural Inf. Process. Syst., 2016, pp. 4572–4580.
[39] A. Sadat, S. Casas, M. Ren, X. Wu, P. Dhawan, and R. Urtasun, “Perceive, [66] Y. Li, J. Song, and S. Ermon, “InfoGAIL: Interpretable imitation learning
predict, and plan: Safe motion planning through interpretable semantic from visual demonstrations,” in Proc. Int. Conf. Neural Inf. Process. Syst.,
representations,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 414–430. 2017, pp. 3815–3825.
[40] J. Janai, F. Güney, A. Behl, and A. Geiger, “Computer vision for [67] G. Lee, D. Kim, W. Oh, K. Lee, and S. Oh, “MixGAIL: Autonomous
autonomous vehicles: Problems, datasets and state-of-the-art,” 2017, driving using demonstrations with mixed qualities,” in Proc. IEEE/RSJ
arXiv:1704.05519. Int. Conf. Intell. Robots Syst., 2020, pp. 5425–5430.
[41] A. Tampuu, T. Matiisen, M. Semikin, D. Fishman, and N. Muhammad, [68] I. Goodfellow et al., “Generative adversarial networks,” Commun. ACM,
“A survey of end-to-end driving: Architectures and training methods,” vol. 63, pp. 139–144, 2020.
IEEE Trans. Neural Netw. Learn. Syst., vol. 33, no. 4, pp. 1364–1384, [69] H. Wang, P. Cai, R. Fan, Y. Sun, and M. Liu, “End-to-end interactive
Apr. 2022. prediction and planning with optical flow distillation for autonomous
[42] S. Teng et al., “Motion planning for autonomous driving: The state of driving,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops,
the art and future perspectives,” IEEE Trans. Intell. Veh., vol. 8, no. 6, 2021, pp. 2229–2238.
pp. 3692–3711, Jun. 2023. [70] P. Hu, A. Huang, J. Dolan, D. Held, and D. Ramanan, “Safe local motion
[43] D. Coelho and M. Oliveira, “A review of end-to-end autonomous driving planning with self-supervised freespace forecasting,” in Proc. IEEE Conf.
in urban environments,” IEEE Access, vol. 10, pp. 75296–75311, 2022. Comput. Vis. Pattern Recognit., 2021, pp. 12727–12736.
[44] A. O. Ly and M. Akhloufi, “Learning to drive by imitation: An overview [71] T. Khurana, P. Hu, A. Dave, J. Ziglar, D. Held, and D. Ramanan,
of deep behavior cloning methods,” IEEE Trans. Intell. Veh., vol. 6, no. 2, “Differentiable raycasting for self-supervised occupancy forecasting,”
pp. 195–209, Jun. 2021. in Proc. Eur. Conf. Comput. Vis., 2022, pp. 353–369.
CHEN et al.: END-TO-END AUTONOMOUS DRIVING: CHALLENGES AND FRONTIERS 10179

[72] R. S. Sutton and A. G. Barto, “Reinforcement learning: An introduction,” [99] K. Chitta, D. Dauner, and A. Geiger, “SLEDGE: Synthesizing sim-
IEEE Trans. Neural Netw. Learn. Syst., vol. 9, no. 5, pp. 1054–1054, ulation environments for driving agents with generative models,”
Sep. 1998. 2024, arXiv:2403.17933.
[73] B. Jaeger and A. Geiger, “An invitation to deep reinforcement learning,” [100] S. Suo, S. Regalado, S. Casas, and R. Urtasun, “TrafficSm: Learning to
2023, arXiv:2312.08365. simulate realistic multi-agent behaviors,” in Proc. IEEE Conf. Comput.
[74] V. Mnih et al., “Human-level control through deep reinforcement learn- Vis. Pattern Recognit., 2021, pp. 10395–10404.
ing,” Nature, vol. 518, pp. 529–533, 2015. [101] M. Treiber, A. Hennecke, and D. Helbing, “Congested traffic states
[75] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, “The arcade in empirical observations and microscopic simulations,” Phys. Rev. E,
learning environment: An evaluation platform for general agents,” J. Artif. vol. 62, 2000, Art. no. 1805.
Intell. Res., vol. 47, pp. 253–279, 2013. [102] Z. Zhong et al., “Guided conditional diffusion for controllable traf-
[76] D. Horgan et al., “Distributed prioritized experience replay,” fic simulation,” in Proc. IEEE Int. Conf. Robot. Automat., 2023,
2018, arXiv:1803.00933. pp. 3560–3566.
[77] J. Bjorck, C. P. Gomes, and K. Q. Weinberger, “Towarddeeper deep [103] D. Xu, Y. Chen, B. Ivanovic, and M. Pavone, “Bits: Bi-level imitation
reinforcement learning with spectral normalization,” in Proc. Int. Conf. for traffic simulation,” in Proc. IEEE Int. Conf. Robot. Automat., 2023,
Neural Inf. Process. Syst., 2021, pp. 8242–8255. pp. 2929–2936.
[78] M. Toromanoff, E. Wirbel, and F. Moutarde, “Is deep reinforce- [104] Z. Zhang, A. Liniger, D. Dai, F. Yu, and L. Van Gool, “TrafficBots:
ment learning really superhuman on atari? Leveling the playing field,” Towards world models for autonomous driving simulation and mo-
2019, arXiv:1908.04683. tion prediction,” in Proc. IEEE Int. Conf. Robot. Automat., 2023,
[79] E. Ohn-Bar, A. Prakash, A. Behl, K. Chitta, and A. Geiger, “Learning pp. 1522–1529.
situational driving,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., [105] S. Manivasagam et al., “LiDARsi: Realistic LiDAR simulation by lever-
2020, pp. 11293–11302. aging the real world,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
[80] W. B. Knox, A. Allievi, H. Banzhaf, F. Schmitt, and P. Stone, “Re- 2020, pp. 11167–11176.
ward (Mis)design for autonomous driving,” Artif. Intell., vol. 316, 2023, [106] Y. Chen et al., “Geosim: Realistic video simulation via geometry-aware
Art. no. 103829. composition for self-driving,” in Proc. IEEE Conf. Comput. Vis. Pattern
[81] C. Zhang et al., “Rethinking closed-loop training for autonomous driv- Recognit., 2021, pp. 72300–7240.
ing,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 264–282. [107] Z. Yang et al., “UniSim: A neural closed-loop sensor simulator,” in Proc.
[82] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi, “Dream to control: IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 1389–1399.
Learning behaviors by latent imagination,” in Proc. Int. Conf. Learn. [108] A. Petrenko, E. Wijmans, B. Shacklett, and V. Koltun, “Megaverse:
Representations, 2020. Simulating embodied agents at one million experiences per second,” in
[83] D. Hafner, T. Lillicrap, M. Norouzi, and J. Ba, “Mastering atari with Proc. Int. Conf. Mach. Learn., 2021, pp. 8556–8566.
discrete world models,” in Proc. Int. Conf. Learn. Representations, [109] Z. Song et al., “Synthetic datasets for autonomous driving: A survey,”
2021. IEEE Trans. Intell. Veh., vol. 9, no. 1, pp. 1847–1864, Jan. 2024.
[84] D. Ha and J. Schmidhuber, “Recurrent world models facilitate pol- [110] A. Amini et al., “Learning robust control policies for end-to-end au-
icy evolution,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2018, tonomous driving from data-driven simulation,” IEEE Robot. Automat.
pp. 2455–2467. Lett., vol. 5, no. 2, pp. 1143–1150, Apr. 2020.
[85] M. Buehler, K. Iagnemma, and S. Singh, in The 2005 DARPA Grand [111] A. Amini et al., “VISTA 2.0: An open, data-driven simulator for mul-
Challenge: The Great Robot Race. Berlin, Germany: springer, 2007. timodal sensing and policy learning for autonomous vehicles,” in Proc.
[86] M. Buehler, K. Iagnemma, and S. Singh, The DARPA Urban Challenge: IEEE Int. Conf. Robot. Automat., 2022, pp. 2419–2426.
Autonomous Vehicles in City Traffic. Berlin, Germany: Springer, 2009. [112] T.-H. Wang, A. Amini, W. Schwarting, I. Gilitschenski, S. Karaman, and
[87] U. of Michigan, “Mcity,” 2015. [Online]. Available: https://mcity.umich. D. Rus, “Learning interactive driving policies via data-driven simulation,”
edu/ in Proc. IEEE Int. Conf. Robot. Automat., 2022, pp. 7745–7752.
[88] T. Team, “Torcs, the open racing car simulator.” 2000. [Online]. Avail- [113] B. Mildenhall, P.P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi,
able: https://sourceforge.net/projects/torcs/ and R. Ng, “NeRF: Representing scenes as neural radiance fields for view
[89] M. Martinez, C. Sitawarin, K. Finch, L. Meincke, A. Yablonski, and synthesis,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 405–421.
A. Kornhauser, “Beyond grand theft auto V for training, testing and [114] B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis, “3D gaussian
enhancing deep learning in self driving cars,” 2017, arXiv:1712.01397. splatting for real-time radiance field rendering,” ACM Trans. Graph.,
[90] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “CARLA: vol. 42, 2023, Art. no. 139.
An open urban driving simulator,” in Proc. Conf. Robot Learn., 2017, [115] M. Tancik et al., “Block-neRF: Scalable large scene neural view syn-
pp. 1–16. thesis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022,
[91] D. Team, “Deepdrive: A simulator that allows anyone with a PC to push pp. 8238–8248.
the state-of-the-art in self-driving,” 2020. [Online]. Available: https:// [116] H. Turki, D. Ramanan, and M. Satyanarayanan, “Mega-NERF: Scalable
github.com/deepdrive/deepdrive construction of large-scale nerfs for virtual fly-throughs,” in Proc. IEEE
[92] Q. Li, Z. Peng, L. Feng, Q. Zhang, Z. Xue, and B. Zhou, “Metadrive: Conf. Comput. Vis. Pattern Recognit., 2022, pp. 12922–12931.
Composing diverse driving scenarios for generalizable reinforcement [117] A. Kundu et al., “Panoptic neural fields: A semantic object-aware neural
learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 3, scene representation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
pp. 3461–3475, Mar. 2023. nit., 2022, pp. 12871–12881.
[93] M. Hendrikx, S. Meijer, J. Van Der Velden, and A. Iosup, “Procedural [118] Y. Yang, Y. Yang, H. Guo, R. Xiong, Y. Wang, and Y. Liao, “Urbangiraffe:
content generation for games: A survey,” ACM Trans. Multimedia Com- Representing urban scenes as compositional generative neural feature
put. Commun. Appl., vol. 9, pp. 1–22, 2013. fields,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, pp. 9199–9210.
[94] D. J. Fremont, T. Dreossi, S. Ghosh, X. Yue, A. L. Sangiovanni- [119] S. R. Richter, H. A. Alhaija, and V. Koltun, “Enhancing photorealism
Vincentelli, and S. A. Seshia, “Scenic: A language for scenario specifica- enhancement,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 2,
tion and scene generation,” in Proc. 40th ACM SIGPLAN Conf. Program. pp. 1700–1715, Feb. 2023.
Lang. Des. Implementation, 2019, pp. 63–78. [120] A. Schoonwinkel, Design and Test of a Computer Stabilized Unicycle,
[95] F. Hauer, T. Schmidt, B. Holzmüller, and A. Pretschner, “Did we test Stanford, CA, USA: Stanford University, 1987. [Online]. Available:
all scenarios for automated and autonomous driving systems?,” in Proc. https://books.google.com/books?id=LA8lGwAACAAJ
IEEE Intell. Transp. Syst. Conf., 2019, pp. 2950–2955. [121] P. Polack, F. Altché, B. d’Andréa Novel, and A. de La Fortelle, “The
[96] S. Tan et al., “SceneGen: Learning to generate realistic traffic scenes,” in kinematic bicycle model: A consistent model for planning feasible tra-
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 892–901. jectories for autonomous vehicles?,” in Proc. IEEE Intell. Veh. Symp.,
[97] L. Bergamini et al., “SimNet: Learning reactive self-driving simulations 2017, pp. 812–818.
from real-world observations,” in Proc. IEEE Int. Conf. Robot. Automat., [122] R. Rajamani, Vehicle Dynamics and Control. Berlin, Germany: Springer,
2021, pp. 5119–5125. 2011.
[98] L. Feng, Q. Li, Z. Peng, S. Tan, and B. Zhou, “TrafficGen: Learning to [123] F. Codevilla, A. M. Lopez, V. Koltun, and A. Dosovitskiy, “On offline
generate diverse and realistic traffic scenarios,” in Proc. IEEE Int. Conf. evaluation of vision-based driving models,” in Proc. Eur. Conf. Comput.
Robot. Automat., 2023, pp. 3567–3575. Vis., 2018, pp. 236–251.
10180 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 12, DECEMBER 2024

[124] N. Contributors, “NAVSIM: Data-driven non-reactive autonomous [148] H. Shao, L. Wang, R. Chen, S. L. Waslander, H. Li, and Y. Liu, “Reason-
vehicle simulation,” 2024. [Online]. Available: https://github.com/ Net: End-to-end driving with temporal and global reasoning,” in Proc.
autonomousvision/navsim IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 13723–13733.
[125] D. Dauner, M. Hallgarten, A. Geiger, and K. Chitta, “Parting with [149] Y. Li et al., “DeepFusion: Lidar-camera deep fusion for multi-modal 3D
misconceptions about learning-based vehicle motion planning,” in Proc. object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
Conf. Robot Learn., 2023, pp. 1268–1281. 2022, pp. 17161–17170.
[126] H. Caesar et al., “nuScenes: A multimodal dataset for autonomous [150] S. Borse et al., “X-align: Cross-modal cross-view alignment for bird’s-
driving,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, eye-view segmentation,” in Proc. IEEE Winter Conf. Appl. Comput. Vis.,
pp. 1618–11628. 2023, pp. 3287–3297.
[127] B. Wilson et al., “Argoverse 2: Next generation datasets for self-driving [151] P. Anderson et al., “Vision-and-language navigation: Interpreting
perception and forecasting,” in Proc. Int. Conf. Neural Inf. Process. Syst. visually-grounded navigation instructions in real environments,” in Proc.
Datasets Benchmarks, 2021. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 3674–3683.
[128] P. Sun et al., “Scalability in perception for autonomous driving: Waymo [152] M. Shridhar, L. Manuelli, and D. Fox, “Cliport: What and where pathways
open dataset,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, for robotic manipulation,” in Proc. Conf. Robot Learn., 2022, pp. 894–
pp. 2446–2454. 906.
[129] J.-T. Zhai et al., “Rethinking the open-loop evaluation of end-to-end [153] J. Duan, S. Yu, H. L. Tan, H. Zhu, and C. Tan, “A survey of embodied AI:
autonomous driving in nuscenes,” 2023, arXiv:2305.10430. From simulators to research tasks,” IEEE Trans. Emerg. Topics Comput.
[130] Z. Li et al., “Is ego status all you need for open-loop end-to-end au- Intell., vol. 6, no. 2, pp. 230–244, Apr. 2022.
tonomous driving?,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., [154] S. Vemprala, R. Bonatti, A. Bucker, and A. Kapoor, “Chat-
2024, pp. 14864–14873. GPT for robotics: Design principles and model abilities,” 2023,
[131] T. Liang et al., “BEVFusion: A simple and robust liDAR-camera fu- arXiv:2306.17582.
sion framework,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2022, [155] T. Deruyttere, S. Vandenhende, D. Grujicic, L. Van Gool, and M. F.
pp. 10421–10434. Moens, “Talk2car: Taking control of your self-driving car,” in Proc. Conf.
[132] Z. Liu et al., “Bevfusion: Multi-task multi-sensor fusion with unified Empirical Methods Natural Lang. Process., 2019.
bird’s-eye view representation,” in Proc. IEEE Int. Conf. Robot. Automat., [156] P. Mirowski et al., “Learning to navigate in cities without a map,” in Proc.
2023, pp. 2774–2781. Int. Conf. Neural Inf. Process. Syst., 2018, pp. 2424–2435.
[133] R. Zhang, S. A. Candra, K. Vetter, and A. Zakhor, “Sensor fusion for [157] H. Chen, A. Suhr, D. Misra, N. Snavely, and Y. Artzi, “TOUCHDOWN:
semantic segmentation of urban scenes,” in Proc. IEEE Int. Conf. Robot. Natural language navigation and spatial reasoning in visual street envi-
Automat., 2015, pp. 1850–1857. ronments,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019,
[134] G. P. Meyer, J. Charland, D. Hegde, A. Laddha, and C. Vallespi-Gonzalez, pp. 12530–12539.
“Sensor fusion for joint 3D object detection and semantic segmentation,” [158] R. Schumann and S. Riezler, “Generating landmark navigation instruc-
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2019, tions from maps as a graph-to-text problem,” in Proc. Annu. Meeting
pp. 1230–1237. Assoc. Comput. Linguistics, 2021, pp. 489–502.
[135] B. Zhou, P. Krähenbühl, and V. Koltun, “Does computer vision matter [159] J. Kim, T. Misu, Y.-T. Chen, A. Tawari, and J. Canny, “Grounding human-
for action?,” Sci. Robot., vol. 4, 2019. to-vehicle advice for self-driving vehicles,” in Proc. IEEE Conf. Comput.
[136] Z. Gao et al., “Enhance sample efficiency and robustness of end-to- Vis. Pattern Recognit., 2019, pp. 10583–10591.
end urban autonomous driving via semantic masked world model,” [160] S. Narayanan, T. Maniar, J. Kalyanasundaram, V. Gandhi, B. Bhowmick,
2022, arXiv:2210.04017. and K. M. Krishna, “Talk to the vehicle: Language conditioned au-
[137] J. Chen, S. E. Li, and M. Tomizuka, “Interpretable end-to-end ur- tonomous navigation of self driving cars,” in Proc. IEEE/RSJ Int. Conf.
ban autonomous driving with latent deep reinforcement learning,” Intell. Robots Syst., 2019, pp. 5284–5290.
IEEE Trans. Intell. Transp. Syst., vol. 23, no. 6, pp. 5068–5078, [161] J. Kim, S. Moon, A. Rohrbach, T. Darrell, and J. Canny, “Advisable
Jun. 2022. learning for self-driving vehicles by internalizing observation-to-action
[138] P. Cai, S. Wang, H. Wang, and M. Liu, “Carl-lead: Lidar-based end-to- rules,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020,
end autonomous driving with contrastive deep reinforcement learning,” pp. 9658–9667.
2021, arXiv:2109.08473. [162] J. Roh, C. Paxton, A. Pronobis, A. Farhadi, and D. Fox, “Conditional
[139] Z. Huang, C. Lv, Y. Xing, and J. Wu, “Multi-modal sensor fusion-based driving from natural language instructions,” in Proc. Conf. Robot Learn.,
deep neural network for end-to-end autonomous driving with scene 2019, pp. 540–551.
understanding,” IEEE Sensors J., vol. 21, no. 10, pp. 11781–11790, [163] K. Jain, V. Chhangani, A. Tiwari, K. M. Krishna, and V. Gandhi, “Ground
May 2021. then navigate: Language-guided navigation in dynamic scenes,” in Proc.
[140] O. Natan and J. Miura, “Fully end-to-end autonomous driving with IEEE Int. Conf. Robot. Automat., 2023, pp. 4113–4120.
semantic depth cloud mapping and multi-agent,” IEEE Trans. Intell. Veh., [164] D. Shah, B. Osiński, B. Ichter, and S. Levine, “LM-Nav: Robotic navi-
vol. 8, no. 1, pp. 557–571, Jun. 2022. gation with large pre-trained models of language, vision, and action,” in
[141] Y. Xiao, F. Codevilla, A. Gurram, O. Urfalioglu, and A. M. López, “Multi- Proc. Conf. Robot Learn., 2023, pp. 492–504.
modal end-to-end autonomous driving,” IEEE Trans. Intell. Transp. Syst., [165] A. Radford et al., “Learning transferable visual models from natural
vol. 23, no. 1, pp. 537–547, Jan. 2022. language supervision,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 8748–
[142] I. Sobh et al., “End-to-end multi-modal sensors fusion system for urban 8763.
automated driving,” in Proc. Int. Conf. Neural Inf. Process. Syst. Work- [166] OpenAI, “GPT-4 technical report,” 2023, arXiv:2303.08774.
shops, 2018. [167] H. Touvron et al., “LLaMA: Open and efficient foundation language
[143] Y. Chen et al., “LiDAR-video driving dataset: Learning driving policies models,” 2023, arXiv:2302.13971.
effectively,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, [168] J. Mao, Y. Qian, H. Zhao, and Y. Wang, “GPT-driver: Learning to drive
pp. 5870–5878. with GPT,” 2023, arXiv:2310.01415.
[144] H. M. Eraqi, M. N. Moustafa, and J. Honer, “Dynamic conditional [169] Z. Xu et al., “DriveGPT4: Interpretable end-to-end autonomous driving
imitation learning for autonomous driving,” IEEE Trans. Intell. Transp. via large language model,” 2023, arXiv:2310.01412.
Syst., vol. 23, no. 12, pp. 22988–23001, Dec. 2022. [170] H. Shao, Y. Hu, L. Wang, S. L. Waslander, Y. Liu, and H. Li, “LMDrive:
[145] S. Chowdhuri, T. Pankaj, and K. Zipser, “Multinet: Multi-modal multi- Closed-loop end-to-end driving with large language models,” in Proc.
task learning for autonomous driving,” in Proc. IEEE Winter Conf. Appl. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 15120–15130.
Comput. Vis., 2019, pp. 1496–1504. [171] C. Sima et al., “DriveLM: Driving with graph visual question answering,”
[146] P. Cai, S. Wang, Y. Sun, and M. Liu, “Probabilistic end-to-end vehicle 2023, arXiv:2312.14150.
navigation in complex dynamic environments with multimodal sensor [172] T. Qian, J. Chen, L. Zhuo, Y. Jiao, and Y.-G. Jiang, “Nuscenes-
fusion,” IEEE Robot. Automat. Lett., vol. 5, no. 3, pp. 4218–4224, QA: A multi-modal visual question answering benchmark for au-
Jul. 2020. tonomous driving scenario,” in Proc. AAAI Conf. Artif. Intell., 2024,
[147] Q. Zhang, M. Tang, R. Geng, F. Chen, R. Xin, and L. Wang, “MMFN: pp. 4542–4550.
Multi-modal-fusion-net for end-to-end driving,” in Proc. IEEE/RSJ Int. [173] Z. Yang, X. Jia, H. Li, and J. Yan, “A survey of large language models
Conf. Intell. Robots Syst., 2022, pp. 8638–8643. for autonomous driving,” 2023, arXiv:2311.01043.
CHEN et al.: END-TO-END AUTONOMOUS DRIVING: CHALLENGES AND FRONTIERS 10181

[174] B. Hilleli and R. El-Yaniv, “Toward deep reinforcement learning without [200] M. Henaff, A. Canziani, and Y. LeCun, “Model-predictive policy learning
a simulator: An autonomous steering example,” in Proc. AAAI Conf. Artif. with uncertainty regularization for driving in dense traffic,” in Proc. Int.
Intell., 2018, pp. 1471–1478. Conf. Learn. Representations, 2019.
[175] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image [201] J. Wu, Z. Huang, and C. Lv, “Uncertainty-aware model-based reinforce-
recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, ment learning: Methodology and application in autonomous driving,”
pp. 770–778. IEEE Trans. Intell. Veh., vol. 8, no. 1, pp. 194–203, Jan. 2022.
[176] Y. Lee, J.-W. Hwang, S. Lee, Y. Bae, and J. Park, “An energy and GPU- [202] M. Pan, X. Zhu, Y. Wang, and X. Yang, “Iso-dream: Isolating and
computation efficient backbone network for real-time object detection,” leveraging noncontrollable visual dynamics in world models,” in Proc.
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2019, Int. Conf. Neural Inf. Process. Syst., 2022, pp. 23178–23191.
pp. 752–760. [203] J. Yang et al., “Generalized predictive model for autonomous driv-
[177] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for ing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024,
image recognition at scale,” in Proc. Int. Conf. Learn. Representations, pp. 14662–14672.
2021. [204] Y. Wang, J. He, L. Fan, H. Li, Y. Chen, and Z. Zhang, “Driving into
[178] M. Dehghani et al., “Scaling vision transformers to 22 billion parame- the future: Multiview visual forecasting and planning with world model
ters,” in Proc. Int. Conf. Mach. Learn., 2023, pp. 7480–7512. for autonomous driving,” in Proc. IEEE Conf. Comput. Vis. Pattern
[179] H. Li et al., “Delving into the devils of bird’s-eye-view perception: A Recognit., 2024, pp. 14749–14759.
review, evaluation and recipe,” IEEE Trans. Pattern Anal. Mach. Intell., [205] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-
vol. 46, no. 4, pp. 2151–2170, Apr. 2024. resolution image synthesis with latent diffusion models,” in Proc. IEEE
[180] Z. Li et al., “BEVFormer: Learning bird’s-eye-view representation from Conf. Comput. Vis. Pattern Recognit., 2022, pp. 10684–10695.
multi-camera images via spatiotemporal transformers,” in Proc. Eur. [206] A. Hu et al., “Model-based imitation learning for urban driving,” in Proc.
Conf. Comput. Vis., 2022, pp. 1–18. Int. Conf. Neural Inf. Process. Syst., 2022.
[181] X. Jia, Y. Gao, L. Chen, J. Yan, P. L. Liu, and H. Li, “DriveAdapter: [207] R. Caruana, “Multitask learning,” Mach. Learn., vol. 28, pp. 41–75,
Breaking the coupling barrier of perception and planning in end-to-end 1997.
autonomous driving,” in Proc. IEEE Int. Conf. Comput. Vis., 2023, [208] K. Ishihara, A. Kanervisto, J. Miura, and V. Hautamaki, “Multi-task learn-
pp. 7953–7963. ing with attention for end-to-end autonomous driving,” in Proc. IEEE
[182] W. Tong et al., “Scene as occupancy,” in Proc. IEEE Int. Conf. Comput. Conf. Comput. Vis. Pattern Recognit. Workshops, 2021, pp. 2896–2905.
Vis., 2023, pp. 8406–8415. [209] Z. Li, T. Motoyoshi, K. Sasaki, T. Ogata, and S. Sugano, “Rethinking
[183] Q. Li, Y. Wang, Y. Wang, and H. Zhao, “HDMapNet: An online HD map self-driving: Multi-task knowledge for better generalization and accident
construction and evaluation framework,” in Proc. IEEE Int. Conf. Robot. explanation ability,” 2018, arXiv:1809.11100.
Automat., 2022, pp. 4628–4634. [210] H. Xu, Y. Gao, F. Yu, and T. Darrell, “End-to-end learning of driving
[184] B. Liao et al., “MapTR: Structured modeling and learning for online vec- models from large-scale video datasets,” in Proc. IEEE Conf. Comput.
torized HD map construction,” in Proc. Int. Conf. Learn. Representations, Vis. Pattern Recognit., 2017, pp. 2174–2182.
2023. [211] A. Mehta, A. Subramanian, and A. Subramanian, “Learning end-to-end
[185] H. Wang et al., “Openlane-v2: A topology reasoning benchmark for autonomous driving using guided auxiliary supervision,” in Proc. 11th
unified 3D HD mapping,” in Proc. Int. Conf. Neural Inf. Process. Syst. Indian Conf. Comput. Vis. Graph. Image Process., 2018, Art. no. 11.
Datasets Benchmarks, 2023, pp. 18873–18884. [212] Y. Hou, Z. Ma, C. Liu, and C. C. Loy, “Learning to steer by mimicking
[186] T. Li et al., “Topology reasoning for driving scenes,” 2023, features from heterogeneous auxiliary networks,” in Proc. AAAI Conf.
arXiv:2304.05277. Artif. Intell., 2019, pp. 8433–8440.
[187] T. Li et al., “Lanesegnet: Map learning with lane segment perception for [213] A. Zhao, T. He, Y. Liang, H. Huang, G. Van den Broeck, and S. Soatto,
autonomous driving,” in Proc. Int. Conf. Learn. Representations, 2024. “SAM: Squeeze-and-mimic networks for conditional visual driving pol-
[188] G. Wang, H. Niu, D. Zhu, J. Hu, X. Zhan, and G. Zhou, “A versatile icy learning,” in Proc. Conf. Robot Learn., 2020, pp. 156–175.
and efficient reinforcement learning framework for autonomous driving,” [214] É. Zablocki, H. Ben-Younes, P. Pérez, and M. Cord, “Explainability of
2021, arXiv:2110.11573. deep vision-based autonomous driving systems: Review and challenges,”
[189] A. Behl, K. Chitta, A. Prakash, E. Ohn-Bar, and A. Geiger, “Label Int. J. Comput. Vis., vol. 130, pp. 2425–2452, 2022.
efficient visual abstractions for autonomous driving,” in Proc. IEEE/RSJ [215] M. Bojarski et al., “Explaining how a deep neural network trained with
Int. Conf. Intell. Robots Syst., 2020, pp. 2338–2345. end-to-end learning steers a car,” 2017, arXiv:1704.07911.
[190] S.-H. Chung, S.-H. Kong, S. Cho, and I. M. A. Nahrendra, “Segmented [216] M. Bojarski et al., “VisualBackPro: Efficient visualization of CNNs for
encoding for Sim2Real of RL-based end-to-end autonomous driving,” in autonomous driving,” in Proc. IEEE Int. Conf. Robot. Automat., 2018,
Proc. IEEE Intell. Veh. Symp., 2022, pp. 1290–1296. pp. 4701–4708.
[191] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” 2013, [217] S. Mohseni, A. Jagadeesh, and Z. Wang, “Predicting model failure using
arXiv:1312.6114. saliency maps in autonomous driving systems,” 2019, arXiv:1905.07679.
[192] M. Ahmed, A. Abobakr, C. P. Lim, and S. Nahavandi, “Policy-based [218] J. Kim and J. Canny, “Interpretable learning for self-driving cars by
reinforcement learning for training autonomous driving agents in urban visualizing causal attention,” in Proc. IEEE Int. Conf. Comput. Vis., 2017,
areas with affordance learning,” IEEE Trans. Intell. Transp. Syst., vol. 23, pp. 2961–2969.
no. 8, pp. 12562–12571, Aug. 2022. [219] K. Mori, H. Fukui, T. Murase, T. Hirakawa, T. Yamashita, and H. Fu-
[193] A. Sauer, N. Savinov, and A. Geiger, “Conditional affordance learning jiyoshi, “Visual explanation by attention branch network for end-to-end
for driving in urban environments,” in Proc. Conf. Robot Learn., 2018, learning-based self-driving,” in Proc. IEEE Intell. Veh. Symp., 2019,
pp. 237–252. pp. 1577–1582.
[194] X. Zhang, M. Wu, H. Ma, T. Hu, and J. Yuan, “Multi-task long-range ur- [220] D. Wang, C. Devin, Q.-Z. Cai, F. Yu, and T. Darrell, “Deep object-centric
ban driving based on hierarchical planning and reinforcement learning,” policies for autonomous driving,” in Proc. IEEE Int. Conf. Robot. Au-
in Proc. IEEE Int. Intell. Transp. Syst. Conf., 2021, pp. 726–733. tomat., 2019, pp. 8853–8859.
[195] C. Huang et al., “Deductive reinforcement learning for visual autonomous [221] L. Cultrera, L. Seidenari, F. Becattini, P. Pala, and A. Del Bimbo,
urban driving navigation,” IEEE Trans. Neural Netw. Learn. Syst., vol. 32, “Explaining autonomous driving by learning end-to-end visual attention,”
no. 12, pp. 5379–5391, Dec. 2021. in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2020,
[196] R. Cheng, C. Agia, F. Shkurti, D. Meger, and G. Dudek, “Latent attention pp. 1389–1398.
augmentation for robust autonomous driving policies,” in Proc. IEEE/RSJ [222] Y. Xiao, F. Codevilla, D. P. Bustamante, and A. M. Lopez, “Scaling
Int. Conf. Intell. Robots Syst., 2021, pp. 130–136. self-supervised end-to-end driving with multi-view attention learning,”
[197] J. Yamada, K. Pertsch, A. Gunjal, and J. J. Lim, “Task-induced represen- 2023, arXiv:2302.03198.
tation learning,” in Proc. Int. Conf. Learn. Representations, 2022. [223] K. Renz, K. Chitta, O.-B. Mercea, A. S. Koepke, Z. Akata, and A. Geiger,
[198] J. Chen and S. Pan, “Learning generalizable representations for rein- “Plant: Explainable planning transformers via object-level representa-
forcement learning via adaptive meta-learner of behavioral similarities,” tions,” in Proc. Conf. Robot Learn., 2022, pp. 459–470.
in Proc. Int. Conf. Learn. Representations, 2022. [224] Y. Sun, X. Wang, Y. Zhang, J. Tang, X. Tang, and J. Yao, “In-
[199] Z. Yang, L. Chen, Y. Sun, and H. Li, “Visual point cloud forecasting terpretable end-to-end driving model for implicit scene understand-
enables scalable autonomous driving,” in Proc. IEEE Conf. Comput. Vis. ing,” in Proc. IEEE 26th Int. Conf. Intell. Transp. Syst., 2023,
Pattern Recognit., 2024, pp. 14673–14684. pp. 2874–2880.
10182 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO. 12, DECEMBER 2024

[225] C. Liu, Y. Chen, M. Liu, and B. E. Shi, “Using eye gaze to en- [250] S. Gidaris and N. Komodakis, “Dynamic few-shot visual learning without
hance generalization of imitation networks to unseen environments,” forgetting,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018,
IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 5, pp. 2066–2074, pp. 4367–4375.
May 2021. [251] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond
[226] W. Zeng, S. Wang, R. Liao, Y. Chen, B. Yang, and R. Urtasun, “DSDNET: empirical risk minimization,” in Proc. Int. Conf. Learn. Representations,
Deep structured self-driving network,” in Proc. Eur. Conf. Comput. Vis., 2017.
2020, pp. 156–172. [252] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for
[227] A. Cui, S. Casas, A. Sadat, R. Liao, and R. Urtasun, “Lookout: Diverse dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis., 2017,
multi-future prediction and planning for self-driving,” in Proc. IEEE Int. pp. 2980–2988.
Conf. Comput. Vis., 2021, pp. 16107–16116. [253] Y. Cui, M. Jia, T.-Y. Lin, Y. Song, and S. Belongie, “Class-balanced loss
[228] Y. Xu et al., “Explainable object-induced action decision for autonomous based on effective number of samples,” in Proc. IEEE Conf. Comput. Vis.
vehicles,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, Pattern Recognit., 2019, pp. 9268–9277.
pp. 9523–9532. [254] S. Akhauri, L. Y. Zheng, and M. C. Lin, “Enhanced transfer learning
[229] H. Ben-Younes, É. Zablocki, P. Pérez, and M. Cord, “Driving behavior for autonomous driving with systematic accident simulation,” in Proc.
explanation with multi-level fusion,” Pattern Recognit., vol. 123, 2022, IEEE/RSJ Int. Conf. Intell. Robots Syst., 2020, pp. 5986–5993.
Art. no. 108421. [255] Q. Li, Z. Peng, Q. Zhang, C. Liu, and B. Zhou, “Improving the gen-
[230] B. Jin et al., “Adapt: Action-aware driving caption transformer,” in Proc. eralization of end-to-end driving through procedural generation,” 2020,
IEEE Int. Conf. Robot. Automat., 2023, pp. 7554–7561. arXiv:2012.13681.
[231] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibration [256] P. A. Lopez et al., “Microscopic traffic simulation using SUMO,” in Proc.
of modern neural networks,” in Proc. Int. Conf. Mach. Learn., 2017, 21st Int. Conf. Intell. Transp. Syst., 2018, pp. 2575–2582.
pp. 1321–1330. [257] M. O’Kelly, A. Sinha, H. Namkoong, R. Tedrake, and J. C. Duchi, “Scal-
[232] A. Loquercio, M. Segu, and D. Scaramuzza, “A general framework for able end-to-end autonomous vehicle testing via rare-event simulation,”
uncertainty estimation in deep learning,” IEEE Robot. Automat. Lett., in Proc. Int. Conf. Neural Inf. Process. Syst., 2018, pp. 9849–9860.
vol. 5, no. 2, pp. 3153–3160, Apr. 2020. [258] Y. Abeysirigoonawardena, F. Shkurti, and G. Dudek, “Generating adver-
[233] R. Michelmore, M. Kwiatkowska, and Y. Gal, “Evaluating uncer- sarial driving scenarios in high-fidelity simulators,” in Proc. IEEE Int.
tainty quantification in end-to-end autonomous driving control,” 2018, Conf. Robot. Automat., 2019, pp. 8271–8277.
arXiv:1811.06817. [259] W. Ding, B. Chen, B. Li, K. J. Eun, and D. Zhao, “Multimodal safety-
[234] A. Filos, P. Tigkas, R. McAllister, N. Rhinehart, S. Levine, and Y. critical scenarios generation for decision-making algorithms evaluation,”
Gal, “Can autonomous vehicles identify, recover from, and adapt IEEE Robot. Automat. Lett., vol. 6, no. 2, pp. 1551–1558, Apr. 2021.
to distribution shifts?,” in Proc. Int. Conf. Mach. Learn., 2020, [260] L. Zhang, Z. Peng, Q. Li, and B. Zhou, “CAT: Closed-loop adversarial
pp. 3145–3153. training for safe end-to-end driving,” in Proc. Conf. Robot Learn., 2023,
[235] L. Tai, P. Yun, Y. Chen, C. Liu, H. Ye, and M. Liu, “Visual-based pp. 2357–2372.
autonomous driving deployment from a stochastic and uncertainty-aware [261] L. T. Triess, M. Dreissig, C. B. Rist, and J. M. Zöllner, “A survey on
perspective,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2019, deep domain adaptation for LiDAR perception,” in Proc. IEEE Intell.
pp. 2622–2628. Veh. Symp. Workshops, 2021, pp. 350–357.
[236] P. Cai, Y. Sun, H. Wang, and M. Liu, “VTGNet: A vision-based trajectory [262] Y. You, X. Pan, Z. Wang, and C. Lu, “Virtual to real reinforcement
generation network for autonomous vehicles in urban environments,” learning for autonomous driving,” in Proc. Brit. Mach. Vis. Conf., 2017.
IEEE Trans. Intell. Veh., vol. 6, no. 3, pp. 419–429, Sep. 2021. [263] A. Bewley et al., “Learning to drive from simulation without real world
[237] S. Shalev-Shwartz, S. Shammah, and A. Shashua, “On a formal model labels,” in Proc. IEEE Int. Conf. Robot. Automat., 2019, pp. 4818–4824.
of safe and scalable self-driving cars,” 2017, arXiv:1708.06374. [264] J. Xing, T. Nagata, K. Chen, X. Zou, E. Neftci, and J. L. Krichmar,
[238] T. Brüdigam, M. Olbrich, D. Wollherr, and M. Leibold, “Stochastic model “Domain adaptation in reinforcement learning via latent unified state rep-
predictive control with a safety guarantee for automated driving,” IEEE resentation,” in Proc. AAAI Conf. Artif. Intell., 2021, pp. 10452–10459.
Trans. Intell. Veh., vol. 8, no. 1, pp. 22–36, Jan. 2023. [265] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel,
[239] Y. Lyu, W. Luo, and J. M. Dolan, “Probabilistic safety-assured adaptive “Domain randomization for transferring deep neural networks from
merging control for autonomous vehicles,” in Proc. IEEE Int. Conf. simulation to the real world,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots
Robot. Automat., 2021, pp. 10764–10770. Syst., 2017, pp. 23–30.
[240] J. P. Allamaa, P. Patrinos, T. Ohtsuka, and T. D. Son, “Real-time MPC with [266] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim-to-real
control barrier functions for autonomous driving using safety enhanced transfer of robotic control with dynamics randomization,” in Proc. IEEE
collocation,” 2024, arXiv:2401.06648. Int. Conf. Robot. Automat., 2018, pp. 3803–3810.
[241] R. Geirhos et al., “Shortcut learning in deep neural networks,” Nature [267] J. Matas, S. James, and A. J. Davison, “Sim-to-real reinforcement learn-
Mach. Intell., vol. 2, pp. 665–673, 2020. ing for deformable object manipulation,” in Proc. Conf. Robot Learn.,
[242] P. de Haan, D. Jayaraman, and S. Levine, “Causal confusion in imi- 2018, pp. 734–743.
tation learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2019, [268] B. Osiński et al., “Simulation-based reinforcement learning for real-world
pp. 11698–11709. autonomous driving,” in Proc. IEEE Int. Conf. Robot. Automat., 2020,
[243] U. Muller, J. Ben, E. Cosatto, B. Flepp, and Y. LeCun, “Off-road obstacle pp. 6411–6418.
avoidance through end-to-end learning,” in Proc. Int. Conf. Neural Inf. [269] R. Kirk, A. Zhang, E. Grefenstette, and T. Rocktäschel, “A survey of
Process. Syst., 2005, pp. 739–746. zero-shot generalisation in deep reinforcement learning,” J. Artif. Intell.
[244] M. Bansal, A. Krizhevsky, and A. S. Ogale, “ChauffeurNet: Learning to Res., vol. 76, pp. 201–264, 2023.
drive by imitating the best and synthesizing the worst,” Robotics: Sci. [270] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot
Syst. Conf, 2019. learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2017, pp. 4080–
[245] C. Chuang, D. Yang, C. Wen, and Y. Gao, “Resolving copycat problems 4090.
in visual imitation learning via residual action prediction,” in Proc. Eur. [271] P. Karkus, B. Ivanovic, S. Mannor, and M. Pavone, “Diffstack: A differ-
Conf. Comput. Vis., 2022, pp. 392–409. entiable and modular control stack for autonomous vehicles,” in Proc.
[246] M. Buda, A. Maki, and M. A. Mazurowski, “A systematic study of Conf. Robot Learn., 2022, pp. 2170–2180.
the class imbalance problem in convolutional neural networks,” Neural [272] H. Li et al., “Open-sourced data ecosystem in autonomous driving: The
Netw., vol. 106, pp. 249–259, 2018. present and future,” 2023, arXiv:2312.03408.
[247] J. Byrd and Z. Lipton, “What is the effect of importance weighting in [273] A. Kirillov et al., “Segment anything,” in Proc. IEEE Int. Conf. Comput.
deep learning?,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 872–881. Vis., 2023, pp. 4015–4026.
[248] I. Mani and I. Zhang, “KNN approach to unbalanced data distributions: [274] S. Narang and A. Chowdhery, “Pathways language model (PaLM): Scal-
A case study involving information extraction,” in Proc. Int. Conf. Mach. ing to 540 billion parameters for breakthrough performance,” J. Mach.
Learn. Workshops, 2003. Learn. Res., vol. 24, pp. 11324–11436, 2022.
[249] X.-Y. Liu, J. Wu, and Z.-H. Zhou, “Exploratory undersampling for class- [275] Y. Fang et al., “Exploring the limits of masked visual representation
imbalance learning,” IEEE Trans. Syst., Man, Cybern. B Cybern., vol. 39, learning at scale,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
no. 2, pp. 539–550, Apr. 2009. 2023, pp. 19358–19369.
CHEN et al.: END-TO-END AUTONOMOUS DRIVING: CHALLENGES AND FRONTIERS 10183

[276] M. Oquab et al., “DINOv2: Learning robust visual features without Bernhard Jaeger received the BSc degree in in-
supervision,” Trans. Mach. Learn. Res., 2024. formatics: Games engineering from the Technical
[277] J.-B. Alayrac et al., “Flamingo: A visual language model for few-shot University of Munich, in 2018, and the MSc degree
learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2022, pp. 23716– in computer science from the University of Tübin-
23736. gen, in 2021. He is currently working toward the
[278] L. Ouyang et al., “Training language models to follow instructions with PhD degree with the Autonomous Vision Group led
human feedback,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2022, by Prof. Andreas Geiger, part of the University of
pp. 27730–27744. Tübingen and Tübingen AI Center, Germany. His
[279] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked research interests include most aspects of embodied
autoencoders are scalable vision learners,” in Proc. IEEE Conf. Comput. intelligence such as vision and decision making, with
Vis. Pattern Recognit., 2022, pp. 16000–16009. a focus on autonomous driving.

Li Chen received the BE degree in mechanical en-

gineering from Shanghai Jiao Tong University, and Andreas Geiger received the PhD degree from the
the MS degree in robotics from the University of Karlsruhe Institute of Technology (KIT) in 2013.
Michigan, Ann Arbor, USA. He is currently work- He is a professor with the University of Tübingen.
ing toward the PhD degree with the Department of Prior to this, he was a visiting professor with ETH
Computer Science, University of Hong Kong. He Zürich and a group leader with the Max Planck In-
also serves as a researcher with OpenDriveLab with stitute for Intelligent Systems. He studied with KIT,
Shanghai AI Laboratory. His research interests in- EPFL and MIT. His research interests include the
clude autonomous driving and computer vision. intersection of computer vision, machine learning
and robotics, with a particular focus on 3D scene
perception, deep representation learning, generative
Penghao Wu received the BE degree in electrical models, and sensori-motor control in the context of
and computer engineering from the University of autonomous systems.
Michigan-Shanghai Jiao Tong University Joint Insti-
tute (UM-SJTU JI), Shanghai Jiao Tong University
(SJTU), and the MS degree in computer science Hongyang Li (Senior Member, IEEE) received the
from UCSD, USA. His research interests include PhD degree in computer science from the Chinese
autonomous driving and computer vision. University of Hong Kong in 2019. He is currently
a research scientist and leads OpenDriveLab with
Shanghai AI Lab. His expertise focuses on perception
and cognition, end-to-end autonomous driving, and
foundation models. He won as PI the CVPR 2023 Best
Paper Award, and proposed BEVFormer that claimed
Kashyap Chitta received the MS degree in computer Top AI 100 Papers 2022. He serves as area chair for
vision from Carnegie Mellon University, in 2018. NeurIPS, CVPR and referee for Nature Communi-
He is currently working toward the PhD degree with cations, Working Group Chair for IEEE Standards
the Autonomous Vision Group led by Prof. Andreas P3474.
Geiger, part of the University of Tübingen and Tübin-
gen AI Center, Germany. His research is on data-
driven solutions to decision making tasks, with a focus
on autonomous driving.

(众凌) 6110U 6120U Controller
0% (1)
(众凌) 6110U 6120U Controller
1 page
(TS) HS70A - Booting Failed On System Start
No ratings yet
(TS) HS70A - Booting Failed On System Start
6 pages
Urdu Solved Mcqs
100% (1)
Urdu Solved Mcqs
8 pages
End-To-End Autonomous Driving: Challenges and Frontiers
No ratings yet
End-To-End Autonomous Driving: Challenges and Frontiers
21 pages
End-To-End Autonomous Driving: Challenges and Frontiers
No ratings yet
End-To-End Autonomous Driving: Challenges and Frontiers
20 pages
Recent Advancements in End-to-End Autonomous Driving Using Deep Learning: A Survey
No ratings yet
Recent Advancements in End-to-End Autonomous Driving Using Deep Learning: A Survey
28 pages
A Survey of End-To-End Driving Architectures and Training Methods
No ratings yet
A Survey of End-To-End Driving Architectures and Training Methods
21 pages
An End-to-End Curriculum Learning Approach For Aut
No ratings yet
An End-to-End Curriculum Learning Approach For Aut
10 pages
Interpretable End-to-End Urban Autonomous Driving With Latent Deep Reinforcement Learning
No ratings yet
Interpretable End-to-End Urban Autonomous Driving With Latent Deep Reinforcement Learning
11 pages
Motion Planning For Autonomous Driving: The State of The Art and Future Perspectives
No ratings yet
Motion Planning For Autonomous Driving: The State of The Art and Future Perspectives
21 pages
Planning-Oriented Autonomous Driving
No ratings yet
Planning-Oriented Autonomous Driving
24 pages
2019 - Chen Et Al. - Brain-Inspired Cognitive Model With Attention For Self-Driving Cars
No ratings yet
2019 - Chen Et Al. - Brain-Inspired Cognitive Model With Attention For Self-Driving Cars
13 pages
2023 CVPR UniID
No ratings yet
2023 CVPR UniID
10 pages
Paper 11
No ratings yet
Paper 11
15 pages
Motion Planning For Autonomous Driving The State of The Art and Future Perspectives
No ratings yet
Motion Planning For Autonomous Driving The State of The Art and Future Perspectives
20 pages
Integrating Deep Reinforcement Learning With Model-Based Path Planner
No ratings yet
Integrating Deep Reinforcement Learning With Model-Based Path Planner
6 pages
T-04 A - Survey - On - Approximate - Edge - AI - For - Energy - Efficient - Autonomous - Driving - Services
No ratings yet
T-04 A - Survey - On - Approximate - Edge - AI - For - Energy - Efficient - Autonomous - Driving - Services
41 pages
A Survey of Deep Learning Techniques For Autonomous Driving
No ratings yet
A Survey of Deep Learning Techniques For Autonomous Driving
25 pages
Advanced Self Driving Car Using Machine Learning
No ratings yet
Advanced Self Driving Car Using Machine Learning
5 pages
IcetranAutonomous Car Driving One Possible Implementation Using Machine Learning Algorithm
No ratings yet
IcetranAutonomous Car Driving One Possible Implementation Using Machine Learning Algorithm
7 pages
Interpretable End-To-End Urban Autonomous Driving With Latent Deep Reinforcement Learning
No ratings yet
Interpretable End-To-End Urban Autonomous Driving With Latent Deep Reinforcement Learning
11 pages
Driver GPT 4
No ratings yet
Driver GPT 4
16 pages
Attention-Based Highway Safety Planner For Autonomous Driving Via Deep Reinforcement Learning
No ratings yet
Attention-Based Highway Safety Planner For Autonomous Driving Via Deep Reinforcement Learning
14 pages
Bsse 211129 Asad Ullah Ai Assignment
No ratings yet
Bsse 211129 Asad Ullah Ai Assignment
11 pages
Developing Path Planning With Behavioral Cloning and Proximal Policy Optimization For Path-Tracking and Static Obstacle Nudging
No ratings yet
Developing Path Planning With Behavioral Cloning and Proximal Policy Optimization For Path-Tracking and Static Obstacle Nudging
6 pages
Learning For Autonomous Vehicles: A Focus On Expert Demonstration
No ratings yet
Learning For Autonomous Vehicles: A Focus On Expert Demonstration
24 pages
GenAD - Generative End-To-End Autonomous Driving
No ratings yet
GenAD - Generative End-To-End Autonomous Driving
10 pages
DDPG For Obstacle Avoidance
No ratings yet
DDPG For Obstacle Avoidance
22 pages
Research Paper (3) (1) 2
No ratings yet
Research Paper (3) (1) 2
8 pages
Real Time Lane Detection and Collision Avoidance Method For Autonomous Vehicle
No ratings yet
Real Time Lane Detection and Collision Avoidance Method For Autonomous Vehicle
17 pages
Autoware Challenge 2023
No ratings yet
Autoware Challenge 2023
13 pages
Sensors 23 09225
No ratings yet
Sensors 23 09225
27 pages
Thesis - (4) - Removed
No ratings yet
Thesis - (4) - Removed
35 pages
Ref 4
No ratings yet
Ref 4
9 pages
Autonomous Vehicles and Systems - A Technological and Societal Perspective
No ratings yet
Autonomous Vehicles and Systems - A Technological and Societal Perspective
464 pages
End To End DL Using PX
No ratings yet
End To End DL Using PX
9 pages
Yang 等 - 2024 - LLM4Drive a Survey of Large Language Models for Autonomous Driving
No ratings yet
Yang 等 - 2024 - LLM4Drive a Survey of Large Language Models for Autonomous Driving
19 pages
Deep Learning For Safe Autonomous Driving Current Challenges and Future Directions
No ratings yet
Deep Learning For Safe Autonomous Driving Current Challenges and Future Directions
21 pages
ROS2 Project
No ratings yet
ROS2 Project
12 pages
Path Reader and Intelligent Lane Navigator by Autonomous Vehicle
No ratings yet
Path Reader and Intelligent Lane Navigator by Autonomous Vehicle
11 pages
A Survey of Deep Learning Techniques For Autonomous Driving
No ratings yet
A Survey of Deep Learning Techniques For Autonomous Driving
28 pages
End-To-End Contextual Perception and Prediction With Interaction Transformer
No ratings yet
End-To-End Contextual Perception and Prediction With Interaction Transformer
8 pages
Learning For Autonomous Vehicles: A Focus On Expert Demonstration
No ratings yet
Learning For Autonomous Vehicles: A Focus On Expert Demonstration
26 pages
Model-Based and Machine Learning-Based High-Level Controller For Autonomous Vehicle Navigation: Lane Centering and Obstacles Avoidance
No ratings yet
Model-Based and Machine Learning-Based High-Level Controller For Autonomous Vehicle Navigation: Lane Centering and Obstacles Avoidance
14 pages
1 s2.0 S0957417423014720 Main
No ratings yet
1 s2.0 S0957417423014720 Main
15 pages
Literature 2
No ratings yet
Literature 2
12 pages
Evaluation of Deep Reinforcement Learning Algorithms For Autonomous Driving
No ratings yet
Evaluation of Deep Reinforcement Learning Algorithms For Autonomous Driving
7 pages
Nuplan: A Closed-Loop Ml-Based Planning Benchmark For Autonomous Vehicles
No ratings yet
Nuplan: A Closed-Loop Ml-Based Planning Benchmark For Autonomous Vehicles
5 pages
Artificial Intelligence AI Framework For Multi-Mod
No ratings yet
Artificial Intelligence AI Framework For Multi-Mod
9 pages
Lanes Curves Detection Report
No ratings yet
Lanes Curves Detection Report
38 pages
Paper ID 75
No ratings yet
Paper ID 75
6 pages
A Survey of Deep Learning Applications To Autonomous Vehicle Control
No ratings yet
A Survey of Deep Learning Applications To Autonomous Vehicle Control
23 pages
A Survey of Autonomous Driving:: Common Practices and Emerging Technologies
No ratings yet
A Survey of Autonomous Driving:: Common Practices and Emerging Technologies
28 pages
Iotmini
No ratings yet
Iotmini
21 pages
Emerging Technologies
No ratings yet
Emerging Technologies
27 pages
Chen DeepDriving Learning Affordance ICCV 2015 Paper
No ratings yet
Chen DeepDriving Learning Affordance ICCV 2015 Paper
9 pages
Real-Time Vehicle and Lane Detection Using Modified OverFeat CNN
No ratings yet
Real-Time Vehicle and Lane Detection Using Modified OverFeat CNN
9 pages
Self Driving Car Using Raspberry Pi: Keywords:Raspberry Pi, Lane Detection, Obstacle Detection, Opencv, Deep Learning
No ratings yet
Self Driving Car Using Raspberry Pi: Keywords:Raspberry Pi, Lane Detection, Obstacle Detection, Opencv, Deep Learning
8 pages
Review of PP Algms-Elsevier
No ratings yet
Review of PP Algms-Elsevier
46 pages
Self-Driving Cars Using Genetic Algorithm
No ratings yet
Self-Driving Cars Using Genetic Algorithm
6 pages
Comparing DRL Architectures
No ratings yet
Comparing DRL Architectures
14 pages
SDV IEEE Access V3
No ratings yet
SDV IEEE Access V3
16 pages
PATNAS A Path-Based Training-Free Neural Architecture Search
No ratings yet
PATNAS A Path-Based Training-Free Neural Architecture Search
17 pages
Deep Learning-Based Robust Positioning For All-Weather Autonomous Driving
No ratings yet
Deep Learning-Based Robust Positioning For All-Weather Autonomous Driving
16 pages
Sonata 2021 J. Phys. Conf. Ser. 1869 012071
No ratings yet
Sonata 2021 J. Phys. Conf. Ser. 1869 012071
7 pages
FocalPose Focal Length and Object Pose Estimation Via Render and Compare
No ratings yet
FocalPose Focal Length and Object Pose Estimation Via Render and Compare
18 pages
Computational Intelligence and Neuroscience - 2021 - Wang - A Real Time Object Detector For Autonomous Vehicles Based On
No ratings yet
Computational Intelligence and Neuroscience - 2021 - Wang - A Real Time Object Detector For Autonomous Vehicles Based On
11 pages
A Class of Quasi Cuk DC DC Converters Steady State Analysis and Design PDF
No ratings yet
A Class of Quasi Cuk DC DC Converters Steady State Analysis and Design PDF
20 pages
FDA Practical - Book
No ratings yet
FDA Practical - Book
66 pages
Unit One
No ratings yet
Unit One
13 pages
SIM-PA Simplified Consensus Protocol Simulator Applications To Proof of Reputation-X and Proof of Contribution
No ratings yet
SIM-PA Simplified Consensus Protocol Simulator Applications To Proof of Reputation-X and Proof of Contribution
12 pages
Srs Project
No ratings yet
Srs Project
40 pages
Marc 8 Midi PDF
No ratings yet
Marc 8 Midi PDF
1 page
Plutoconfig
No ratings yet
Plutoconfig
10 pages
7tour and Use The Power BI Service
No ratings yet
7tour and Use The Power BI Service
7 pages
HackTheBox - Bucket Walkthrough
No ratings yet
HackTheBox - Bucket Walkthrough
11 pages
DLP 7-3 Installing Drivers (SDLP)
No ratings yet
DLP 7-3 Installing Drivers (SDLP)
4 pages
(Rahman) Assignment#1
No ratings yet
(Rahman) Assignment#1
9 pages
C# IMP Notes (E-Next - In) PDF
100% (1)
C# IMP Notes (E-Next - In) PDF
116 pages
Novo9 Spark User Manual
No ratings yet
Novo9 Spark User Manual
34 pages
Delhi Public School Bangalore East Subject: Computer Science Chapter: Log On To Access (Answer Key) Grade Viii
No ratings yet
Delhi Public School Bangalore East Subject: Computer Science Chapter: Log On To Access (Answer Key) Grade Viii
2 pages
4431 Question Paper
No ratings yet
4431 Question Paper
2 pages
Ajmani International School: Timestampcamera App and E-Mail It To Your Class Teacher
No ratings yet
Ajmani International School: Timestampcamera App and E-Mail It To Your Class Teacher
2 pages
USA Comcast 2023
No ratings yet
USA Comcast 2023
30 pages
Recover Table With RMAN
No ratings yet
Recover Table With RMAN
5 pages
LegOSC - Mindstorms NXT Robotics Programming For A
No ratings yet
LegOSC - Mindstorms NXT Robotics Programming For A
7 pages
How To Delete Multiple CBO Entries
No ratings yet
How To Delete Multiple CBO Entries
22 pages
Visvesvaraya Technological University: "Car Rental Management System"
No ratings yet
Visvesvaraya Technological University: "Car Rental Management System"
31 pages
Duplication - Typecasting-Problem Statement
100% (1)
Duplication - Typecasting-Problem Statement
3 pages
2022-04-01.txt (SHARED) (1) .
No ratings yet
2022-04-01.txt (SHARED) (1) .
7 pages
Computer Knowledge Section Test
No ratings yet
Computer Knowledge Section Test
2 pages
SPE-203755-MS Digital Transformation of The Standing and Katz Compressibility Factor Chart For Natural Gases
No ratings yet
SPE-203755-MS Digital Transformation of The Standing and Katz Compressibility Factor Chart For Natural Gases
17 pages
Camm 4e Ch01 PPT
No ratings yet
Camm 4e Ch01 PPT
48 pages
Temp Based Fan
No ratings yet
Temp Based Fan
14 pages
Birth Certificate PDF Public Key Certificate Kerala
No ratings yet
Birth Certificate PDF Public Key Certificate Kerala
1 page

End-To-End Autonomous Driving Challenges and Frontiers

Uploaded by

End-To-End Autonomous Driving Challenges and Frontiers

Uploaded by

10164 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 46, NO.

12, DECEMBER 2024

End-to-End Autonomous Driving: Challenges

properties can be sampled from a probabilistic distribution TABLE I

with near-perfect scores [5]. The subsequent NoCrash bench-

L2 feature loss between teacher and student networks, while

resort to the progress of multi-modality and foundation models,

vehicles’ past states. This technique has been used in multiple

Li Chen received the BE degree in mechanical en-

You might also like